ADC Sampling Breakthrough

cgracey · 2018-11-28 05:16

SaucySoliton wrote: »
TonyB_ wrote: »

It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.

I like this idea. In typical operation we would just be calculating the difference anyway. This saves an instruction or two.

The proposed Tukey window is better than a sinc3 length 8, mostly because the inpulse response is longer. Comparing it to a length 16 sinc3 it's mixed.
Top line=Tukey   full scale = 1023 , truncate 2 lsb (/4)
Middle line=Sinc^3 8  full scale =512, truncate 1 lsb (/2)
Bottom line=Sinc^3 16 full scale = 4096, truncate 4 lsb (/16)
Mean Value (channels 1-8 gio) , (channels 1-8 vio)
    45.367    48.506    44.407    45.596    43.596    43.394    48.906    46.799   212.372   214.971   214.590   214.560   211.250   212.677   215.867   215.058
    45.468    48.796    44.694    45.739    43.672    43.452    49.247    47.111   212.613   215.312   214.924   214.897   211.585   212.879   216.276   215.406
    45.236    48.339    44.605    45.468    43.485    43.265    48.929    46.685   212.550   215.054   214.720   214.702   211.387   212.827   215.931   215.120
Standard Deviation (channels 1-8 gio) , (channels 1-8 vio)
   0.55194   0.51324   0.55921   0.50285   0.53237   0.61839   0.35290   0.47419   0.72835   0.40526   0.52742   0.53002   0.49377   0.54610   0.40421   0.31277
   1.56901   1.88077   1.74451   1.51102   1.77894   1.71822   1.88983   1.51053   1.35328   1.56201   1.64810   1.65715   1.73280   1.35994   1.06898   1.49403
   0.43681   0.51541   0.50886   0.50528   0.56974   0.69142   0.70789   0.49365   0.71113   0.53671   0.47227   0.50249   0.53767   0.51045   0.38889   0.42351
For the Tukey, the max could be anywhere from 1020-1023. It doesn't have to be 1023.

The cutoff frequency seems just a little low for analog video. Although I wonder what the response is on the analog part of the chip.

Saucy, I'm going to do a quick test to see how SINC8 would work with decimation on every clock. Any predictions?

jmg · 2018-11-28 05:25

evanh wrote: »

cgracey wrote: »

This pruning is slightly lossy, right?

It has lost a little resolution. The slow starting ramp up of that 50000 clocks now has a flat three samples then a one sample over reaction before settling.

I'm not quite following - how much lost resolution is 'a little' ? and what you describe sounds like a longer settling time, too ?

Is all this pruning/truncating/shifting/masking increasing the risk of a Verilog bug ? (or worse, a FPGA <> ASIC bug...)

How will Chip test the verilog ?

He could I guess connect one of the external CLK.DAT ADCs ?
eg an AMC1035D - looks to include a 0.2% VREF 2.5V but cannot connect straight to that, with the +/- 1V Analog IN
Can come on an Eval Board, AMC1035EVM, tho fitted with somewhat disappointing 1% resistors.....
So maybe that 0.2% 2.5V VREF can connect to a simple chain of maybe 5~10 decent resistors like
ERA-6ARW102V Panasonic RES SMD 1K OHM 0.05% 1/8W 0805 21,935 stk $0.46650/100 ±10ppm/°C
(and of course, a P2 DAC pin can )

jmg · 2018-11-28 05:35

evanh wrote: »

Here's a graph with all three:
- Yellow is my older method using all 32-bit
- Blue is the tightest with diff masking
- Red has no diff masking

So what are the actual final sizes for the numbers - in Verilog and in the SW-post filter ?

Interesting how the settling time has degraded quite a bit, with just a prune of some number-depths.

Can you plot Yellow - Blue, to see a zoom-in of the differences ?

evanh · 2018-11-28 05:48

Here's the data file I imported to spreadsheet. First, third and fifth lines graphed.

evanh · 2018-11-28 05:58

In that case 33 source, if I change just two masks, ACC2MASK to ACC1MASK and ACC3MASK to ACC1MASK, then the red line matches the original yellow line exactly. I can't see even a single reading different.

jmg · 2018-11-28 06:01

evanh wrote: »

Here's the data file I imported to spreadsheet. First, third and fifth lines graphed.

Thanks, that shows a long settling convergence, still significant even after 27 samples. Not what I'd expect from something that otherwise works ?

examples of convergence / settling times, in 16b LSB equivalents

1-472456/470464 = -0.004234 appx 7th sample error (~277 LSBs at 16b)
1-2190504/2190400 = -4.747e-5 appx 27th sample, ~3 LSBs different at 16b.
( ref 1/2^16 = 1.525e-5)

Not looking usable with Chip's auto-calibrate ideas ?

evanh · 2018-11-28 06:06

Oh, that's 24-bit readings from 256 clocks. So, not yet trimmed for nominal 16-bit ENOB.

jmg · 2018-11-28 06:12

evanh wrote: »

Oh, that's 24-bit readings from 256 clocks. So, not yet trimmed for nominal 16-bit ENOB.

I'm just taking the ratio of same-sample-number outputs, the unexpected thing is the long time to converge here, that seems to have arrived with the trim/prune ?

evanh · 2018-11-28 06:18

Right, yep.

Okay, it's not actually settling time problem. It's some sort of resolution limitation that is slightly more sensitive to the extremely slow ramp up on those early bits in the bitstream. The NCO takes time to fill up at the start and may not be ideal bitstream source.

evanh · 2018-11-28 06:34

Here's the data file again but with ramp starting from 10000: All three lines settle together at the third sample.

cgracey · 2018-11-28 06:39

I just did some experiments with SincN possibilities and it seems you must run the differentiator at 16 samples, or more, to get 8-bit resolution, regardless of the N in SincN. I was thinking going higher-order (SINC9) might have allowed for more frequent decimations (like, one every clock), but that's not the case. It looks like the Tukey-window approach is most practical, by far.

There's one more experiment I want to try: Running SINC3 at 16 samples, but with 16 interleaved calculations woven together, in order to get an 8-bit sample on every clock. This would be impossibly expensive in hardware, but would represent something ideal, if possible. I want to see what it looks like!

evanh · 2018-11-28 06:41

Huh, there is a tiny oscillation at the full swing step: But again, this is extremely low bit rate in the bitstream because of 50000 NCO.

evanh · 2018-11-28 06:51

Okay, so rounding might help.

How to round in less logic than just adding the pruned register bits back in?

EDIT: Made a big difference just by unpruning a single bit.

#define  ACC2MASK  (~((1 << (WORDSIZE + 2 - ACC1SIZE)) - 1))

Changed the +3 to a +2. Nothing else.

SaucySoliton · 2018-11-28 07:14

cgracey wrote: »
SaucySoliton wrote: »
TonyB_ wrote: »

It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.

I like this idea. In typical operation we would just be calculating the difference anyway. This saves an instruction or two.

The proposed Tukey window is better than a sinc3 length 8, mostly because the inpulse response is longer. Comparing it to a length 16 sinc3 it's mixed.
Top line=Tukey   full scale = 1023 , truncate 2 lsb (/4)
Middle line=Sinc^3 8  full scale =512, truncate 1 lsb (/2)
Bottom line=Sinc^3 16 full scale = 4096, truncate 4 lsb (/16)
Mean Value (channels 1-8 gio) , (channels 1-8 vio)
    45.367    48.506    44.407    45.596    43.596    43.394    48.906    46.799   212.372   214.971   214.590   214.560   211.250   212.677   215.867   215.058
    45.468    48.796    44.694    45.739    43.672    43.452    49.247    47.111   212.613   215.312   214.924   214.897   211.585   212.879   216.276   215.406
    45.236    48.339    44.605    45.468    43.485    43.265    48.929    46.685   212.550   215.054   214.720   214.702   211.387   212.827   215.931   215.120
Standard Deviation (channels 1-8 gio) , (channels 1-8 vio)
   0.55194   0.51324   0.55921   0.50285   0.53237   0.61839   0.35290   0.47419   0.72835   0.40526   0.52742   0.53002   0.49377   0.54610   0.40421   0.31277
   1.56901   1.88077   1.74451   1.51102   1.77894   1.71822   1.88983   1.51053   1.35328   1.56201   1.64810   1.65715   1.73280   1.35994   1.06898   1.49403
   0.43681   0.51541   0.50886   0.50528   0.56974   0.69142   0.70789   0.49365   0.71113   0.53671   0.47227   0.50249   0.53767   0.51045   0.38889   0.42351
For the Tukey, the max could be anywhere from 1020-1023. It doesn't have to be 1023.

The cutoff frequency seems just a little low for analog video. Although I wonder what the response is on the analog part of the chip.
Saucy, I'm going to do a quick test to see how SINC8 would work with decimation on every clock. Any predictions?

Since the signals you're looking at are below 3 MHz, you won't see much difference whether you sample it at 25MHz or 200MHz. Sampling must be done at 2x the bandwidth to meet Nyquist. For oscilloscope applications it's good to oversample 5-10x or more.

Processing samples at the full system clock rate is wasteful and expensive. But there is one benefit to generating them all: the streamer can get filtered samples whenever it wants. The streamer's numerically controlled oscillator gives great flexibility in controlling the sample rate. Since the sample rate is so much higher than the signal bandwidth it's ok to grab a sample whenever. It would not be unreasonable to couple it to a software PLL to lock the sampling rate to external data. Example: getting a fixed number of samples in a line of video.

The sinc filters can also be implemented as several moving average filters cascaded together.

msrobots · 2018-11-28 07:38

Ken Gracey wrote: »

...Nobody has ever forced Chip successfully to do anything other than what he wants to do...
Ken Gracey

And that is the reason why he is doing that amazing work.

I can't even imagine how you are handling this, but overall it seems to work out.

When Chip is ready it will be a wonderful product.

Enjoy!

Mike

jmg · 2018-11-28 07:47

evanh wrote: »

Okay, so rounding might help.

How to round in less logic than just adding the pruned register bits back in?

EDIT: Made a big difference just by unpruning a single bit.

Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's

However, if you are multiplexing, those are still quite long settling times.

SaucySoliton · 2018-11-28 07:52

cgracey wrote: »

I just did some experiments with SincN possibilities and it seems you must run the differentiator at 16 samples, or more, to get 8-bit resolution, regardless of the N in SincN. I was thinking going higher-order (SINC9) might have allowed for more frequent decimations (like, one every clock), but that's not the case. It looks like the Tukey-window approach is most practical, by far.

There's one more experiment I want to try: Running SINC3 at 16 samples, but with 16 interleaved calculations woven together, in order to get an 8-bit sample on every clock. This would be impossibly expensive in hardware, but would represent something ideal, if possible. I want to see what it looks like!

To output every sample you would need to increase the delay at the diffs. So instead of subtracting the previous sample, you'd need to subtract the value 8 samples ago.

It might be less resource intensive to calculate it directly as moving average filters.
stage 1
shift register 16 bits
output 5 bits

stage 2
shift register 5x16 bits
output 9 bits

stage 3
shift register 9x16 bits
output 13 bits
It should be more efficient to add the new sample and subtract the old one coming out of the FIFO.

Unfortunately the max output of the first stage is 16 which adds a bit everywhere. Or we could ignore it and let all ones output a zero. Or make it saturate. Or truncate the stage 2-3 connection. Or we could use different lengths for each stage. If the lengths are chosen carefully the performance is good. It might be harder to control the scaling of the output with a setup like this.

It might be good to have some adjustment to the filter bandwidth. We could have a mux to select the delay feeding the diffs.

evanh · 2018-11-28 07:52

jmg wrote: »

Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's

However, if you are multiplexing, those are still quite long settling times.

There are differences throughout. You can call it a convergence if you like but it's not a settling thing. Settling is done in three samples for all Sinc3. Even the really badly pruned ones.

evanh · 2018-11-28 08:06

SaucySoliton wrote: »

To output every sample you would need to increase the delay at the diffs. So instead of subtracting the previous sample, you'd need to subtract the value 8 samples ago.

It might be less resource intensive to calculate it directly as moving average filters.

Erna's posted spreadsheet has three pyramids of tiny adders working on shifted bits (11-bit window), with the three tops summed for sample reading.

EDIT: I can't find when he posted it but here's the spreadsheet. The filter is in column's S to Y.

cgracey · 2018-11-28 08:14

Okay. I did a COST-IS-NO-OBJECT exploration into what is the best 8-bit per-clock sampling we could possibly get, maximizing the bandwidth, using any topology of SyncN filters that seemed optimal.

Here is the VERY BEST performance I could obtain using SincN filtering:

- 10 separate Sinc8 filters (eighth order!!!)
- Each processing 10 ADC samples per decimation
- Filters each working at different ADC bitstream offsets (0..9)
- Filters' outputs are interleaved to form a continuous sample stream

This would be horrendously expensive in silicon.

I was totally surprised. Compare this exotic solution to the simple hand-tuned Tukey window I made last night using TonyB_'s 17*/24 pattern:

You can see we have a harder limit on our slew rate.

One thing I found is that the SincN window had better be at least 10 bits, in order to cover the data cycling in the ADC bitstream, which is about 7 bits, rise-to-rise, worst case.

jmg · 2018-11-28 08:46

evanh wrote: »

jmg wrote: »

Yes, 7th sample now converges to ~30 LSB's instead of 277 above. 10th sample is now ~ 3.5 LSB's

However, if you are multiplexing, those are still quite long settling times.

There are differences throughout. You can call it a convergence if you like but it's not a settling thing. Settling is done in three samples for all Sinc3. Even the really badly pruned ones.

I'm just taking a simple ratio of your different filter data outputs, and then comparing any difference, with an equivalent 1/2^16 LSB size.

Actually looking closer, it starts larger and reduces, but seems to not totally converge, but oscillates about... (ref : 1/2^16 is 15.3ppm)

1-5889219/5888960 = -43.980ppm
1-5803256/5801472 = -307.508ppm
1-5717881/5719296 = 247.408ppm
1-5632628/5632640 = 2.130ppm
1-5544874/5544256 = -111.466ppm
With the same stages, and same additions, I would not expect group delay type changes ?

cgracey · 2018-11-28 08:54

It sounds like the pruning is a little reckless?

evanh · 2018-11-28 08:59

Chip,
That last change made the three accumulators 30-bit, 28-bit and 24-bit.

PS: There's no guarantee that the reference is accurate. I should plot the "dclevel" line as well, it's the simulated ideal, ...

jmg · 2018-11-28 09:06

cgracey wrote: »

It sounds like the pruning is a little reckless?

Given there are differences, it looks that way. I'm just doing a simple ratio of Evanh's tables first line / last line.
Maybe the prune is still not quite done right, as I'm puzzled to see this type of rather subtle deviation, from what should be a coarse storage-vessel change.
in ADC terms however, those differences are large. (I'm assuming the sinc3 is ideal here)

Looking at an AD7403 data, it specs 16b, but shows typical DNL of < 0.2 LSB and INL typicals of inside 0.4 LSB - those are 3~6 ppm regions.
So I'd think you would want your filters to 'not make that worse' ?

Addit: AD7403 specs say
AD7403 VDD1 = 4.5 V to 5.5 V, VDD2 = 3 V to 5.5 V, VIN+ = −250 mV to +250 mV, VIN− = 0 V, TA = −40°C to +125°C, fMCLKIN1 = 5 MHz to 20 MHz,
tested with sinc3 filter, 256 decimation rate, as defined by Verilog code, unless otherwise noted.
So I think they hit those DNL of < 0.2 LSB and INL < 0.4 LSB, using a Sinc3 filter.

jmg · 2018-11-28 09:10

evanh wrote: »

That last change made the three accumulators 30-bit, 28-bit and 24-bit.

The top line (sinc3) in your tables is then 30-bit, 30-bit, 30-bit ?
Given the first one is a counter that is there anyway, and the second one is an adder that is there anyway, the saving is looking slight - but has downsides.

cgracey · 2018-11-28 09:39

Evanh, do you think this can be done without any numerical compromise?

evanh · 2018-11-28 09:43

Maybe if rounding can be done cheaply. Tony mentioned doing rounding.

evanh · 2018-11-28 09:54

Here's graph with DC-level added as green reference line. Others haven't changed. Yellow's accumulators are 32-32-32.

evanh · 2018-11-28 10:06

Here's data with all three accumulators at 30-bit on sixth line "Sinc3-rolling_P-acc30-dif32".

EDIT: Looks an exact match to original 32-bit version.
EDIT2: And 24-bit accumulators is also an exact match! Because the sample period is 256 clocks.
EDIT3: Refreshed csv6 file after a bug fix. No change in data but just being thorough.

cgracey · 2018-11-28 12:02

I'm compiling the Tukey filter to see how big it is. This module handles its own configuration via 'set' and 'd', then picks a pin (0..63) and runs it into the tap chain. At each clock, all the taps are added up and an 8-bit sample is output:

// cog osc fil

module			cog_osc_fil
(
input			resn,

input			clk,
input			ena,

input			set,
input		 [7:0]	d,
input		[63:0]	pin_in,

output reg	 [7:0]	sample
);


reg  [7:0] cfg;		// configuration

`regscan (cfg, 8'b0, !ena || set, !ena ? 8'b0 : d[7:0])


reg [44:0] tap;		// Tukey window taps

`regscan (tap, 45'b0, cfg[7], {tap[43:0], pin_in[cfg[5:0]]})

wire [9:0] sum	=

	({6{tap[00]}} & 6'h01) +
	({6{tap[01]}} & 6'h03) +
	({6{tap[02]}} & 6'h05) +
	({6{tap[03]}} & 6'h07) +
	({6{tap[04]}} & 6'h0A) +
	({6{tap[05]}} & 6'h0D) +
	({6{tap[06]}} & 6'h10) +
	({6{tap[07]}} & 6'h13) +
	({6{tap[08]}} & 6'h16) +
	({6{tap[09]}} & 6'h19) +
	({6{tap[10]}} & 6'h1B) +
	({6{tap[11]}} & 6'h1D) +
	({6{tap[12]}} & 6'h1F) +
	({6{tap[13]}} & 6'h20) +
	({6{tap[14]}} & 6'h20) +
	({6{tap[15]}} & 6'h20) +
	({6{tap[16]}} & 6'h20) +
	({6{tap[17]}} & 6'h20) +
	({6{tap[18]}} & 6'h20) +
	({6{tap[19]}} & 6'h20) +
	({6{tap[20]}} & 6'h20) +
	({6{tap[21]}} & 6'h20) +
	({6{tap[22]}} & 6'h1F) +
	({6{tap[23]}} & 6'h20) +
	({6{tap[24]}} & 6'h20) +
	({6{tap[25]}} & 6'h20) +
	({6{tap[26]}} & 6'h20) +
	({6{tap[27]}} & 6'h20) +
	({6{tap[28]}} & 6'h20) +
	({6{tap[29]}} & 6'h20) +
	({6{tap[30]}} & 6'h20) +
	({6{tap[31]}} & 6'h20) +
	({6{tap[32]}} & 6'h1F) +
	({6{tap[33]}} & 6'h1D) +
	({6{tap[34]}} & 6'h1B) +
	({6{tap[35]}} & 6'h19) +
	({6{tap[36]}} & 6'h16) +
	({6{tap[37]}} & 6'h13) +
	({6{tap[38]}} & 6'h10) +
	({6{tap[39]}} & 6'h0D) +
	({6{tap[40]}} & 6'h0A) +
	({6{tap[41]}} & 6'h07) +
	({6{tap[42]}} & 6'h05) +
	({6{tap[43]}} & 6'h03) +
	({6{tap[44]}} & 6'h01) ;

`regscan (sample, 8'b0, cfg[7], sum[9:2])

endmodule

This is a brute-force approach, but it might be decomposed during compilation, so that some efficient implementation is made. If not, I'll break it down somehow.

So, it takes 80 ALM's which is less than the Sinc3's 100 ALM's. It killed timing, though. I'll need to break it down and pipeline it a little.

ADC Sampling Breakthrough

Comments