It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.
I like this idea. In typical operation we would just be calculating the difference anyway. This saves an instruction or two.
The proposed Tukey window is better than a sinc3 length 8, mostly because the inpulse response is longer. Comparing it to a length 16 sinc3 it's mixed.
It has lost a little resolution. The slow starting ramp up of that 50000 clocks now has a flat three samples then a one sample over reaction before settling.
I'm not quite following - how much lost resolution is 'a little' ? and what you describe sounds like a longer settling time, too ?
Is all this pruning/truncating/shifting/masking increasing the risk of a Verilog bug ? (or worse, a FPGA <> ASIC bug...)
How will Chip test the verilog ?
He could I guess connect one of the external CLK.DAT ADCs ?
eg an AMC1035D - looks to include a 0.2% VREF 2.5V but cannot connect straight to that, with the +/- 1V Analog IN
Can come on an Eval Board, AMC1035EVM, tho fitted with somewhat disappointing 1% resistors.....
So maybe that 0.2% 2.5V VREF can connect to a simple chain of maybe 5~10 decent resistors like
ERA-6ARW102V Panasonic RES SMD 1K OHM 0.05% 1/8W 0805 21,935 stk $0.46650/100 ±10ppm/°C
(and of course, a P2 DAC pin can )
In that case 33 source, if I change just two masks, ACC2MASK to ACC1MASK and ACC3MASK to ACC1MASK, then the red line matches the original yellow line exactly. I can't see even a single reading different.
Oh, that's 24-bit readings from 256 clocks. So, not yet trimmed for nominal 16-bit ENOB.
I'm just taking the ratio of same-sample-number outputs, the unexpected thing is the long time to converge here, that seems to have arrived with the trim/prune ?
Okay, it's not actually settling time problem. It's some sort of resolution limitation that is slightly more sensitive to the extremely slow ramp up on those early bits in the bitstream. The NCO takes time to fill up at the start and may not be ideal bitstream source.
I just did some experiments with SincN possibilities and it seems you must run the differentiator at 16 samples, or more, to get 8-bit resolution, regardless of the N in SincN. I was thinking going higher-order (SINC9) might have allowed for more frequent decimations (like, one every clock), but that's not the case. It looks like the Tukey-window approach is most practical, by far.
There's one more experiment I want to try: Running SINC3 at 16 samples, but with 16 interleaved calculations woven together, in order to get an 8-bit sample on every clock. This would be impossibly expensive in hardware, but would represent something ideal, if possible. I want to see what it looks like!
It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.
I like this idea. In typical operation we would just be calculating the difference anyway. This saves an instruction or two.
The proposed Tukey window is better than a sinc3 length 8, mostly because the inpulse response is longer. Comparing it to a length 16 sinc3 it's mixed.
For the Tukey, the max could be anywhere from 1020-1023. It doesn't have to be 1023.
The cutoff frequency seems just a little low for analog video. Although I wonder what the response is on the analog part of the chip.
Saucy, I'm going to do a quick test to see how SINC8 would work with decimation on every clock. Any predictions?
Since the signals you're looking at are below 3 MHz, you won't see much difference whether you sample it at 25MHz or 200MHz. Sampling must be done at 2x the bandwidth to meet Nyquist. For oscilloscope applications it's good to oversample 5-10x or more.
Processing samples at the full system clock rate is wasteful and expensive. But there is one benefit to generating them all: the streamer can get filtered samples whenever it wants. The streamer's numerically controlled oscillator gives great flexibility in controlling the sample rate. Since the sample rate is so much higher than the signal bandwidth it's ok to grab a sample whenever. It would not be unreasonable to couple it to a software PLL to lock the sampling rate to external data. Example: getting a fixed number of samples in a line of video.
The sinc filters can also be implemented as several moving average filters cascaded together.
I just did some experiments with SincN possibilities and it seems you must run the differentiator at 16 samples, or more, to get 8-bit resolution, regardless of the N in SincN. I was thinking going higher-order (SINC9) might have allowed for more frequent decimations (like, one every clock), but that's not the case. It looks like the Tukey-window approach is most practical, by far.
There's one more experiment I want to try: Running SINC3 at 16 samples, but with 16 interleaved calculations woven together, in order to get an 8-bit sample on every clock. This would be impossibly expensive in hardware, but would represent something ideal, if possible. I want to see what it looks like!
To output every sample you would need to increase the delay at the diffs. So instead of subtracting the previous sample, you'd need to subtract the value 8 samples ago.
It might be less resource intensive to calculate it directly as moving average filters.
stage 1
shift register 16 bits
output 5 bits
stage 2
shift register 5x16 bits
output 9 bits
stage 3
shift register 9x16 bits
output 13 bits
It should be more efficient to add the new sample and subtract the old one coming out of the FIFO.
Unfortunately the max output of the first stage is 16 which adds a bit everywhere. Or we could ignore it and let all ones output a zero. Or make it saturate. Or truncate the stage 2-3 connection. Or we could use different lengths for each stage. If the lengths are chosen carefully the performance is good. It might be harder to control the scaling of the output with a setup like this.
It might be good to have some adjustment to the filter bandwidth. We could have a mux to select the delay feeding the diffs.
Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's
However, if you are multiplexing, those are still quite long settling times.
There are differences throughout. You can call it a convergence if you like but it's not a settling thing. Settling is done in three samples for all Sinc3. Even the really badly pruned ones.
To output every sample you would need to increase the delay at the diffs. So instead of subtracting the previous sample, you'd need to subtract the value 8 samples ago.
It might be less resource intensive to calculate it directly as moving average filters.
Erna's posted spreadsheet has three pyramids of tiny adders working on shifted bits (11-bit window), with the three tops summed for sample reading.
EDIT: I can't find when he posted it but here's the spreadsheet. The filter is in column's S to Y.
Okay. I did a COST-IS-NO-OBJECT exploration into what is the best 8-bit per-clock sampling we could possibly get, maximizing the bandwidth, using any topology of SyncN filters that seemed optimal.
Here is the VERY BEST performance I could obtain using SincN filtering:
- 10 separate Sinc8 filters (eighth order!!!)
- Each processing 10 ADC samples per decimation
- Filters each working at different ADC bitstream offsets (0..9)
- Filters' outputs are interleaved to form a continuous sample stream
This would be horrendously expensive in silicon.
I was totally surprised. Compare this exotic solution to the simple hand-tuned Tukey window I made last night using TonyB_'s 17*/24 pattern:
You can see we have a harder limit on our slew rate.
One thing I found is that the SincN window had better be at least 10 bits, in order to cover the data cycling in the ADC bitstream, which is about 7 bits, rise-to-rise, worst case.
Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's
However, if you are multiplexing, those are still quite long settling times.
There are differences throughout. You can call it a convergence if you like but it's not a settling thing. Settling is done in three samples for all Sinc3. Even the really badly pruned ones.
I'm just taking a simple ratio of your different filter data outputs, and then comparing any difference, with an equivalent 1/2^16 LSB size.
Actually looking closer, it starts larger and reduces, but seems to not totally converge, but oscillates about... (ref : 1/2^16 is 15.3ppm)
1-5889219/5888960 = -43.980ppm
1-5803256/5801472 = -307.508ppm
1-5717881/5719296 = 247.408ppm
1-5632628/5632640 = 2.130ppm
1-5544874/5544256 = -111.466ppm
With the same stages, and same additions, I would not expect group delay type changes ?
Given there are differences, it looks that way. I'm just doing a simple ratio of Evanh's tables first line / last line.
Maybe the prune is still not quite done right, as I'm puzzled to see this type of rather subtle deviation, from what should be a coarse storage-vessel change.
in ADC terms however, those differences are large. (I'm assuming the sinc3 is ideal here)
Looking at an AD7403 data, it specs 16b, but shows typical DNL of < 0.2 LSB and INL typicals of inside 0.4 LSB - those are 3~6 ppm regions.
So I'd think you would want your filters to 'not make that worse' ?
Addit: AD7403 specs say
AD7403 VDD1 = 4.5 V to 5.5 V, VDD2 = 3 V to 5.5 V, VIN+ = −250 mV to +250 mV, VIN− = 0 V, TA = −40°C to +125°C, fMCLKIN1 = 5 MHz to 20 MHz, tested with sinc3 filter, 256 decimation rate, as defined by Verilog code, unless otherwise noted.
So I think they hit those DNL of < 0.2 LSB and INL < 0.4 LSB, using a Sinc3 filter.
That last change made the three accumulators 30-bit, 28-bit and 24-bit.
The top line (sinc3) in your tables is then 30-bit, 30-bit, 30-bit ?
Given the first one is a counter that is there anyway, and the second one is an adder that is there anyway, the saving is looking slight - but has downsides.
Here's data with all three accumulators at 30-bit on sixth line "Sinc3-rolling_P-acc30-dif32".
EDIT: Looks an exact match to original 32-bit version.
EDIT2: And 24-bit accumulators is also an exact match! Because the sample period is 256 clocks.
EDIT3: Refreshed csv6 file after a bug fix. No change in data but just being thorough.
I'm compiling the Tukey filter to see how big it is. This module handles its own configuration via 'set' and 'd', then picks a pin (0..63) and runs it into the tap chain. At each clock, all the taps are added up and an 8-bit sample is output:
This is a brute-force approach, but it might be decomposed during compilation, so that some efficient implementation is made. If not, I'll break it down somehow.
So, it takes 80 ALM's which is less than the Sinc3's 100 ALM's. It killed timing, though. I'll need to break it down and pipeline it a little.
Comments
Saucy, I'm going to do a quick test to see how SINC8 would work with decimation on every clock. Any predictions?
I'm not quite following - how much lost resolution is 'a little' ? and what you describe sounds like a longer settling time, too ?
Is all this pruning/truncating/shifting/masking increasing the risk of a Verilog bug ? (or worse, a FPGA <> ASIC bug...)
How will Chip test the verilog ?
He could I guess connect one of the external CLK.DAT ADCs ?
eg an AMC1035D - looks to include a 0.2% VREF 2.5V but cannot connect straight to that, with the +/- 1V Analog IN
Can come on an Eval Board, AMC1035EVM, tho fitted with somewhat disappointing 1% resistors.....
So maybe that 0.2% 2.5V VREF can connect to a simple chain of maybe 5~10 decent resistors like
ERA-6ARW102V Panasonic RES SMD 1K OHM 0.05% 1/8W 0805 21,935 stk $0.46650/100 ±10ppm/°C
(and of course, a P2 DAC pin can )
So what are the actual final sizes for the numbers - in Verilog and in the SW-post filter ?
Interesting how the settling time has degraded quite a bit, with just a prune of some number-depths.
Can you plot Yellow - Blue, to see a zoom-in of the differences ?
Thanks, that shows a long settling convergence, still significant even after 27 samples. Not what I'd expect from something that otherwise works ?
examples of convergence / settling times, in 16b LSB equivalents
1-472456/470464 = -0.004234 appx 7th sample error (~277 LSBs at 16b)
1-2190504/2190400 = -4.747e-5 appx 27th sample, ~3 LSBs different at 16b.
( ref 1/2^16 = 1.525e-5)
Not looking usable with Chip's auto-calibrate ideas ?
I'm just taking the ratio of same-sample-number outputs, the unexpected thing is the long time to converge here, that seems to have arrived with the trim/prune ?
Okay, it's not actually settling time problem. It's some sort of resolution limitation that is slightly more sensitive to the extremely slow ramp up on those early bits in the bitstream. The NCO takes time to fill up at the start and may not be ideal bitstream source.
There's one more experiment I want to try: Running SINC3 at 16 samples, but with 16 interleaved calculations woven together, in order to get an 8-bit sample on every clock. This would be impossibly expensive in hardware, but would represent something ideal, if possible. I want to see what it looks like!
How to round in less logic than just adding the pruned register bits back in?
EDIT: Made a big difference just by unpruning a single bit. Changed the +3 to a +2. Nothing else.
Since the signals you're looking at are below 3 MHz, you won't see much difference whether you sample it at 25MHz or 200MHz. Sampling must be done at 2x the bandwidth to meet Nyquist. For oscilloscope applications it's good to oversample 5-10x or more.
Processing samples at the full system clock rate is wasteful and expensive. But there is one benefit to generating them all: the streamer can get filtered samples whenever it wants. The streamer's numerically controlled oscillator gives great flexibility in controlling the sample rate. Since the sample rate is so much higher than the signal bandwidth it's ok to grab a sample whenever. It would not be unreasonable to couple it to a software PLL to lock the sampling rate to external data. Example: getting a fixed number of samples in a line of video.
The sinc filters can also be implemented as several moving average filters cascaded together.
And that is the reason why he is doing that amazing work.
I can't even imagine how you are handling this, but overall it seems to work out.
When Chip is ready it will be a wonderful product.
Enjoy!
Mike
Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's
However, if you are multiplexing, those are still quite long settling times.
It might be less resource intensive to calculate it directly as moving average filters.
stage 1
shift register 16 bits
output 5 bits
stage 2
shift register 5x16 bits
output 9 bits
stage 3
shift register 9x16 bits
output 13 bits
It should be more efficient to add the new sample and subtract the old one coming out of the FIFO.
Unfortunately the max output of the first stage is 16 which adds a bit everywhere. Or we could ignore it and let all ones output a zero. Or make it saturate. Or truncate the stage 2-3 connection. Or we could use different lengths for each stage. If the lengths are chosen carefully the performance is good. It might be harder to control the scaling of the output with a setup like this.
It might be good to have some adjustment to the filter bandwidth. We could have a mux to select the delay feeding the diffs.
There are differences throughout. You can call it a convergence if you like but it's not a settling thing. Settling is done in three samples for all Sinc3. Even the really badly pruned ones.
Erna's posted spreadsheet has three pyramids of tiny adders working on shifted bits (11-bit window), with the three tops summed for sample reading.
EDIT: I can't find when he posted it but here's the spreadsheet. The filter is in column's S to Y.
Here is the VERY BEST performance I could obtain using SincN filtering:
- 10 separate Sinc8 filters (eighth order!!!)
- Each processing 10 ADC samples per decimation
- Filters each working at different ADC bitstream offsets (0..9)
- Filters' outputs are interleaved to form a continuous sample stream
This would be horrendously expensive in silicon.
I was totally surprised. Compare this exotic solution to the simple hand-tuned Tukey window I made last night using TonyB_'s 17*/24 pattern:
You can see we have a harder limit on our slew rate.
One thing I found is that the SincN window had better be at least 10 bits, in order to cover the data cycling in the ADC bitstream, which is about 7 bits, rise-to-rise, worst case.
I'm just taking a simple ratio of your different filter data outputs, and then comparing any difference, with an equivalent 1/2^16 LSB size.
Actually looking closer, it starts larger and reduces, but seems to not totally converge, but oscillates about... (ref : 1/2^16 is 15.3ppm)
1-5889219/5888960 = -43.980ppm
1-5803256/5801472 = -307.508ppm
1-5717881/5719296 = 247.408ppm
1-5632628/5632640 = 2.130ppm
1-5544874/5544256 = -111.466ppm
With the same stages, and same additions, I would not expect group delay type changes ?
That last change made the three accumulators 30-bit, 28-bit and 24-bit.
PS: There's no guarantee that the reference is accurate. I should plot the "dclevel" line as well, it's the simulated ideal, ...
Given there are differences, it looks that way. I'm just doing a simple ratio of Evanh's tables first line / last line.
Maybe the prune is still not quite done right, as I'm puzzled to see this type of rather subtle deviation, from what should be a coarse storage-vessel change.
in ADC terms however, those differences are large. (I'm assuming the sinc3 is ideal here)
Looking at an AD7403 data, it specs 16b, but shows typical DNL of < 0.2 LSB and INL typicals of inside 0.4 LSB - those are 3~6 ppm regions.
So I'd think you would want your filters to 'not make that worse' ?
Addit: AD7403 specs say
AD7403 VDD1 = 4.5 V to 5.5 V, VDD2 = 3 V to 5.5 V, VIN+ = −250 mV to +250 mV, VIN− = 0 V, TA = −40°C to +125°C, fMCLKIN1 = 5 MHz to 20 MHz,
tested with sinc3 filter, 256 decimation rate, as defined by Verilog code, unless otherwise noted.
So I think they hit those DNL of < 0.2 LSB and INL < 0.4 LSB, using a Sinc3 filter.
Given the first one is a counter that is there anyway, and the second one is an adder that is there anyway, the saving is looking slight - but has downsides.
EDIT: Looks an exact match to original 32-bit version.
EDIT2: And 24-bit accumulators is also an exact match! Because the sample period is 256 clocks.
EDIT3: Refreshed csv6 file after a bug fix. No change in data but just being thorough.
This is a brute-force approach, but it might be decomposed during compilation, so that some efficient implementation is made. If not, I'll break it down somehow.
So, it takes 80 ALM's which is less than the Sinc3's 100 ALM's. It killed timing, though. I'll need to break it down and pipeline it a little.