This is the logic, not the flops, that has gone up 20%. Job for you there Tony. Solve why the logic of an inc/dec counter can be smaller than an adder. Maybe this needs handcrafted to fix a deficiency in Quartus.
I wish it were possible to reduce acc1 and acc2 bit-length, but they seem to need to be able to climb to acc3-length values.
The diff's must wash it out. And guess what that means: The diff's need to be the same size, and that means for 30-bit numbers you need 30-bit 2's-complement maths ... which needs more cog instructions to execute!
EDIT: Ah, no problem, they can most-significant align when read with RDPIN. That fixes it.
This reminds me of something... shorter measurements have less noise. With this SINC3 filter, we are getting in 256 clocks what used to take 64K clocks, so we might be getting better-quality measurements, already.
I'm a little confused. I thought the sigma-delta hardware sampled at the system clock rate, so you needed 64K clocks to get a 16-bit sample. But the way I read your statement is that using the sinc3 windowing mode will give us a 16-bit sample in only 256 clocks? How is that possible? Or, more likely, what am I misunderstanding?
Searth,
Sinc3 over 256 clocks is effective to 16 bits at least. It needs 24 bits to contain that though. This is why Sinc gets used. It's simple to implement while providing quite high quality filtering. Part of that quality is the natural multiplying of sample size.
EDIT: Sinc1 is how the existing smartpin mode %01111 (Y=0) filters the bitstream. It works as you've described. Chip is look at including a Sinc3 mode as well, but it is needing a decent amount of extra logic.
I was thinking the same as Tony at first but it can be done and with little fuss too. diff would be the D operand and z could be forced into the following instruction's S port, like what ALTS does now. It would save two instructions in execution. And funnily, reduces to a subtraction and move. The addition vanishes.
rdpin z, #adcpin 'fetch z from acc3 of Sinc3 filter
diff diff1, z 'reads in diff1 and z, writes z to diff1 and calc new z
diff diff2, 0-0 'reads in diff2, writes z to diff2 and calc new z
diff diff3, 0-0 'reads in diff3, writes z to diff3 and calc new z
mov z, 0-0 'save z
I was thinking the same as Tony at first but it can be done and with little fuss too. diff would be the D operand and z could be forced into the following instruction's S port, like what ALTS does now. It would save two instructions in execution. And funnily, reduces to a subtraction and move. The addition vanishes.
rdpin z, #adcpin 'fetch z from acc3 of Sinc3 filter
diff diff1, z 'reads in diff1 and z, writes z to diff1 and calc new z
diff diff2, 0-0 'reads in diff2, writes z to diff2 and calc new z
diff diff3, 0-0 'reads in diff3, writes z to diff3 and calc new z
mov z, 0-0 'save z
I can accept why each of the differentiators must be full-width, but it doesn't make sense to me for each of the integrators to be the same size. Sinc3 is Sinc2 + another stage and Sinc2 is Sinc1 + another stage and acc1 for Sinc1 need only be 10-bit if R=1024.
Whatever the size, acc1 can be an incrementer as only 0 or +1 are added to it. I think acc1 could be 10-bit up counter, acc2 a 20-bit adder+register and acc3 a 30-bit adder+register. Have you tried sign-extending 10-bit acc1 to 20-bit to add to acc2, then sign-extending acc2 to 30-bit to add to acc3?
I've edited this to make it clear that I am talking about reducing the accumulators as a whole, adders and registers.
Sinc3 over 256 clocks is effective to 16 bits at least. It needs 24 bits to contain that though. This is why Sinc gets used. It's simple to implement while providing quite high quality filtering. Part of that quality is the natural multiplying of sample size.
EDIT: Sinc1 is how the existing smartpin mode %01111 (Y=0) filters the bitstream. It works as you've described. Chip is look at including a Sinc3 mode as well, but it is needing a decent amount of extra logic.
But 256 clock cycles would still mean that you're only getting an 8-bit value, wouldn't it? I mean, the sigma-delta ADC is still only outputting a stream of zeros and ones, so all you can accumulate in 256 clocks is an 8-bit value.
Sinc3 over 256 clocks is effective to 16 bits at least. It needs 24 bits to contain that though. This is why Sinc gets used. It's simple to implement while providing quite high quality filtering. Part of that quality is the natural multiplying of sample size.
EDIT: Sinc1 is how the existing smartpin mode %01111 (Y=0) filters the bitstream. It works as you've described. Chip is look at including a Sinc3 mode as well, but it is needing a decent amount of extra logic.
But 256 clock cycles would still mean that you're only getting an 8-bit value, wouldn't it? I mean, the sigma-delta ADC is still only outputting a stream of zeros and ones, so all you can accumulate in 256 clocks is an 8-bit value.
That's true for one accumulator/integrator (this is Sinc1) but Sinc3 has three cascaded integrators that are updated every sysclock: acc1:=acc1+ADC bit, acc2:=acc2+acc1 and acc3:=acc3+acc2. acc3 is read at a lower sampling rate and differentiated three times with simple sub+add operations as discussed above and the result is a much higher bit precision.
I'm pretty sure I know why the ADC sees more GIO noise than VIO noise.
It's because of the die substrate's many digital-ground tap connections. The substrate is full of digital ground noise and even though our analog ground is brought in via separate bond wires and tap-connected to private deep N-wells, those deep N-wells couple a lot of digital ground noise from the substrate.
Yanomani has pm'd me lots of stuff about on-chip noise-isolation, but I didn't quite get it.
The reason there's even as much noise as there is on the VIO reading is because GIO (local analog ground) is powering the inverters which make up the integrator sense amp, and the ground noise is causing the integrator cap to be read with uncertainty. For the GIO reading, you have the same inverter noise, plus the noise of the ADC input connected to GIO, instead of VIO, so there's even more.
So, the ADC's biggest noise source is from its local analog ground deep N-wells coupling digital-ground noise via the substrate. That's what's limiting ADC resolution right now.
All that would be true, if we were talking about seeing crosstalk issues. (which may still be there, of course, just not tested for yet )
However. even a P2 doing nothing but one ADC run, is still noisy.
Another simplistic analysis of the noise levels, is to imagine 3v3 as a noise generator (one DAC plot hinted at Vio noise levels)
When you measure GND, the balancing side actually spends most time at Vio, and when measuring GND, the balancing side spends most time at GND.
That's why I've been keen to see results from low noise regulators.
Last night, I added a smart pin mode to do SINC3 integration with three 30-bit accumulators, an 11-bit reporting counter, and an externally-clocked mode. I packed the four USB modes down to two modes ('host' and 'device') by remaking the slow/full-speed switch from X[15] via WXPIN (that NCO bit was always written to '0', anyway).
So, while this SINC3 mode didn't create any new flops, it did grow the smart pin logic by 20%, which is not trivial. This change might have singularly grown the overall P2 logic by 6%.
...
I think our starting utilzation was 65%. This will drive it up to about 69%, which is a little on the high side.
External clock is good to see, as it allows all those external isolated ADC's (10~20MHz SDM) to be used.
You could still add this to every second pin, and use the adjacent-pin MUX to reach any pin's ADC cell, for layout purposes. (so all pins can have access to Filter)
If you use the same adjacent pin rule as digital, the filter can be even more shared, and reduce that logic hit more.
If OnSemi hit a routing/speed wall, this may be a useful plan B to have ready ?
I'm pretty sure I know why the ADC sees more GIO noise than VIO noise.
It's because of the die substrate's many digital-ground tap connections. The substrate is full of digital ground noise and even though our analog ground is brought in via separate bond wires and tap-connected to private deep N-wells, those deep N-wells couple a lot of digital ground noise from the substrate.
Yanomani has pm'd me lots of stuff about on-chip noise-isolation, but I didn't quite get it.
The reason there's even as much noise as there is on the VIO reading is because GIO (local analog ground) is powering the inverters which make up the integrator sense amp, and the ground noise is causing the integrator cap to be read with uncertainty. For the GIO reading, you have the same inverter noise, plus the noise of the ADC input connected to GIO, instead of VIO, so there's even more.
So, the ADC's biggest noise source is from its local analog ground deep N-wells coupling digital-ground noise via the substrate. That's what's limiting ADC resolution right now.
All that would be true, if we were talking about seeing crosstalk issues. (which may still be there, of course, just not tested for yet )
However. even a P2 doing nothing but one ADC run, is still noisy.
Another simplistic analysis of the noise levels, is to imagine 3v3 as a noise generator (one DAC plot hinted at Vio noise levels)
When you measure GND, the balancing side actually spends most time at Vio, and when measuring GND, the balancing side spends most time at GND.
That's why I've been keen to see results from low noise regulators.
The balancing node stays at VIO/2, within 4mV, at 250MHz. It's the other side of the ~500k input resistor that gets pinned to either GIO or VIO for calibration.
All that digital GND noise from the core is certainly shaking our GIO up.
Last night, I added a smart pin mode to do SINC3 integration with three 30-bit accumulators, an 11-bit reporting counter, and an externally-clocked mode. I packed the four USB modes down to two modes ('host' and 'device') by remaking the slow/full-speed switch from X[15] via WXPIN (that NCO bit was always written to '0', anyway).
So, while this SINC3 mode didn't create any new flops, it did grow the smart pin logic by 20%, which is not trivial. This change might have singularly grown the overall P2 logic by 6%.
...
I think our starting utilzation was 65%. This will drive it up to about 69%, which is a little on the high side.
External clock is good to see, as it allows all those external isolated ADC's (10~20MHz SDM) to be used.
You could still add this to every second pin, and use the adjacent-pin MUX to reach any pin's ADC cell, for layout purposes. (so all pins can have access to Filter)
If you use the same adjacent pin rule as digital, the filter can be even more shared, and reduce that logic hit more.
If OnSemi hit a routing/speed wall, this may be a useful plan B to have ready ?
Yes, that must be plan B. Plan A is to try to reduce the size of acc1 and acc2.
Sinc3 over 256 clocks is effective to 16 bits at least. It needs 24 bits to contain that though. This is why Sinc gets used. It's simple to implement while providing quite high quality filtering. Part of that quality is the natural multiplying of sample size.
EDIT: Sinc1 is how the existing smartpin mode %01111 (Y=0) filters the bitstream. It works as you've described. Chip is look at including a Sinc3 mode as well, but it is needing a decent amount of extra logic.
But 256 clock cycles would still mean that you're only getting an 8-bit value, wouldn't it? I mean, the sigma-delta ADC is still only outputting a stream of zeros and ones, so all you can accumulate in 256 clocks is an 8-bit value.
Searth, I know what you're thinking. It is like magic. Each sample does incorporate two prior samples via the cascaded integrators.
I was thinking the same as Tony at first but it can be done and with little fuss too. diff would be the D operand and z could be forced into the following instruction's S port, like what ALTS does now. It would save two instructions in execution. And funnily, reduces to a subtraction and move. The addition vanishes.
rdpin z, #adcpin 'fetch z from acc3 of Sinc3 filter
diff diff1, z 'reads in diff1 and z, writes z to diff1 and calc new z
diff diff2, 0-0 'reads in diff2, writes z to diff2 and calc new z
diff diff3, 0-0 'reads in diff3, writes z to diff3 and calc new z
mov z, 0-0 'save z
Though, it does require one of those, now extinct, spare dual operand opcode slots.
diff would require a third internal register in addition to D and S.
This DIFF instruction is a great idea!
The pipeline uses Q for these purposes, already, such as for XORO32. All we need is a 32-bit subtractor. And those two empty '#D,#S' slots haven't been touched, yet, so we've got the instruction space.
Full adders aren't cheap I guess. Hmm, there should have already been two 32-bit adders in each smartpin before the Sinc3 was included. Surely the optimiser can make use of those.
I wonder if the 20% is all from a single adder.
The pre-existing adders are inc/dec-type, not full-type. The growth makes perfect sense.
The added routing for this is all local, within each smart pin.
I can accept why each of the differentiators must be full-width, but it doesn't make sense to me for each of the integrators to be the same size. Sinc3 is Sinc2 + another stage and Sinc2 is Sinc1 + another stage and acc1 for Sinc1 need only be 10-bit if R=1024.
Whatever the size, acc1 can be an incrementer as only 0 or +1 are added to it. I think acc1 could be 10-bit up counter, acc2 a 20-bit adder+register and acc3 a 30-bit adder+register. Have you tried sign-extending 10-bit acc1 to 20-bit to add to acc2, then sign-extending acc2 to 30-bit to add to acc3?
EDIT:
Edited for clarity.
This sign-extension is a great idea. I will try it out. Makes sense.
cgracey : I wish it were possible to reduce acc1 and acc2 bit-length, but they seem to need to be able to climb to acc3-length values.
I think that makes sense - because even tho early adders do not overflow as fast, a change to shortened adder becomes a sawtooth generator, as it wraps back to zero at some sub-sample timebase.
It is that discontinuity you are needing to avoid.
Someone mentioned that acc1 could be a gated counter, so that could give savings ?
I can accept why each of the differentiators must be full-width, but it doesn't make sense to me for each of the integrators to be the same size. Sinc3 is Sinc2 + another stage and Sinc2 is Sinc1 + another stage and acc1 for Sinc1 need only be 10-bit if R=1024.
Whatever the size, acc1 can be an incrementer as only 0 or +1 are added to it. I think acc1 could be 10-bit up counter, acc2 a 20-bit adder+register and acc3 a 30-bit adder+register. Have you tried sign-extending 10-bit acc1 to 20-bit to add to acc2, then sign-extending acc2 to 30-bit to add to acc3?
I think acc1 can be a gated counter, but I'm less sure it can be 10 bits, as that becomes a sawtooth generator as it goes 1023-000, injecting a big step into what gets added next.
32b counters are already in smart pins, so not much gained in moving to 10 bits anyway.
I can accept why each of the differentiators must be full-width, but it doesn't make sense to me for each of the integrators to be the same size. Sinc3 is Sinc2 + another stage and Sinc2 is Sinc1 + another stage and acc1 for Sinc1 need only be 10-bit if R=1024.
Whatever the size, acc1 can be an incrementer as only 0 or +1 are added to it. I think acc1 could be 10-bit up counter, acc2 a 20-bit adder+register and acc3 a 30-bit adder+register. Have you tried sign-extending 10-bit acc1 to 20-bit to add to acc2, then sign-extending acc2 to 30-bit to add to acc3?
I think acc1 can be a gated counter, but I'm less sure it can be 10 bits, as that becomes a sawtooth generator as it goes 1023-000, injecting a big step into what gets added next.
32b counters are already in smart pins, so not much gained in moving to 10 bits anyway.
Assuming two's complement arithmetic, we can ... calculate the number of bits required for the last comb due to bit growth. If Bin is the number of input bits, then the number of output bits, Bout, is
Bout = [N log2 RM + Bin]
It also turns out that Bout bits are needed for each integrator and comb stage. The input needs to be sign extended to Bout bits, but LSB's can either be truncated or rounded at later stages.
I think acc1 can be a gated counter, but I'm less sure it can be 10 bits, as that becomes a sawtooth generator as it goes 1023-000, injecting a big step into what gets added next.
The arithmetic is two's complement, so 1023 to 0 is actually -1 to 0 and the big step is +511 to -512, which occurs within 10 bits.
It also turns out that Bout bits are needed for each integrator and comb stage. The input needs to be sign extended to Bout bits, but LSB's can either be truncated or rounded at later stages.
Should be easy enough to test, and I guess even tho 32b counters exist already, lowering the number of bits fed into the adders helps reduce routing resource, and any reduction in adder 2 size helps...
Chip,
I guess the ADC silicon fix is not just a respin? It's in the outer ring layout?
Suggestions:
Save the 6% and do the fix in software. Wait for a hopeful later silicon respin, maybe even at 120 or 90 nm for a silicon fix.
Spend the time testing and understanding where the power is being wasted. This is the biggie and I am certain it's solvable if you put your mind to it! Then maybe some more testing and minor tweets.
Chip,
I guess the ADC silicon fix is not just a respin? It's in the outer ring layout?
Suggestions:
Save the 6% and do the fix in software. Wait for a hopeful later silicon respin, maybe even at 120 or 90 nm for a silicon fix.
Spend the time testing and understanding where the power is being wasted. This is the biggie and I am certain it's solvable if you put your mind to it! Then maybe some more testing and minor tweets.
There is a noise floor in the ADC that would take some layout work to improve, but we can drastically improve sampling times by implementing this Sinc3 filter. Imagine getting a 16-bit conversion every 256 clocks, instead of an 8-bit conversion. It's true that 3 or 4 LSBs may be uncertain due to the ADC's noise floor.
Chip,
Is this a change to just verilog stuff? Or does this require changes to the custom laid out pad stuff? I think that is Cluso's concern (and mine).
Chip,
Is this a change to just verilog stuff? Or does this require changes to the custom laid out pad stuff? I think that is Cluso's concern (and mine).
Comments
JMG, any insights?
I'm off to bed.
Scratch all that. I'd forgotten what Chip said.
The diff's must wash it out. And guess what that means: The diff's need to be the same size, and that means for 30-bit numbers you need 30-bit 2's-complement maths ... which needs more cog instructions to execute!
EDIT: Ah, no problem, they can most-significant align when read with RDPIN. That fixes it.
I'm a little confused. I thought the sigma-delta hardware sampled at the system clock rate, so you needed 64K clocks to get a 16-bit sample. But the way I read your statement is that using the sinc3 windowing mode will give us a 16-bit sample in only 256 clocks? How is that possible? Or, more likely, what am I misunderstanding?
Sinc3 over 256 clocks is effective to 16 bits at least. It needs 24 bits to contain that though. This is why Sinc gets used. It's simple to implement while providing quite high quality filtering. Part of that quality is the natural multiplying of sample size.
EDIT: Sinc1 is how the existing smartpin mode %01111 (Y=0) filters the bitstream. It works as you've described. Chip is look at including a Sinc3 mode as well, but it is needing a decent amount of extra logic.
"sub z,diff1
add diff1,z"
into a single:
"subandadd z, diff1"
?
Jonathan
As z and diff1 both change this operation needs two instructions.
diff would require a third internal register in addition to D and S.
I've edited this to make it clear that I am talking about reducing the accumulators as a whole, adders and registers.
But 256 clock cycles would still mean that you're only getting an 8-bit value, wouldn't it? I mean, the sigma-delta ADC is still only outputting a stream of zeros and ones, so all you can accumulate in 256 clocks is an 8-bit value.
That's true for one accumulator/integrator (this is Sinc1) but Sinc3 has three cascaded integrators that are updated every sysclock: acc1:=acc1+ADC bit, acc2:=acc2+acc1 and acc3:=acc3+acc2. acc3 is read at a lower sampling rate and differentiated three times with simple sub+add operations as discussed above and the result is a much higher bit precision.
All that would be true, if we were talking about seeing crosstalk issues. (which may still be there, of course, just not tested for yet )
However. even a P2 doing nothing but one ADC run, is still noisy.
Another simplistic analysis of the noise levels, is to imagine 3v3 as a noise generator (one DAC plot hinted at Vio noise levels)
When you measure GND, the balancing side actually spends most time at Vio, and when measuring GND, the balancing side spends most time at GND.
That's why I've been keen to see results from low noise regulators.
External clock is good to see, as it allows all those external isolated ADC's (10~20MHz SDM) to be used.
You could still add this to every second pin, and use the adjacent-pin MUX to reach any pin's ADC cell, for layout purposes. (so all pins can have access to Filter)
If you use the same adjacent pin rule as digital, the filter can be even more shared, and reduce that logic hit more.
If OnSemi hit a routing/speed wall, this may be a useful plan B to have ready ?
The balancing node stays at VIO/2, within 4mV, at 250MHz. It's the other side of the ~500k input resistor that gets pinned to either GIO or VIO for calibration.
All that digital GND noise from the core is certainly shaking our GIO up.
Yes, that must be plan B. Plan A is to try to reduce the size of acc1 and acc2.
Searth, I know what you're thinking. It is like magic. Each sample does incorporate two prior samples via the cascaded integrators.
This DIFF instruction is a great idea!
The pipeline uses Q for these purposes, already, such as for XORO32. All we need is a 32-bit subtractor. And those two empty '#D,#S' slots haven't been touched, yet, so we've got the instruction space.
This sign-extension is a great idea. I will try it out. Makes sense.
I think that makes sense - because even tho early adders do not overflow as fast, a change to shortened adder becomes a sawtooth generator, as it wraps back to zero at some sub-sample timebase.
It is that discontinuity you are needing to avoid.
Someone mentioned that acc1 could be a gated counter, so that could give savings ?
I think acc1 can be a gated counter, but I'm less sure it can be 10 bits, as that becomes a sawtooth generator as it goes 1023-000, injecting a big step into what gets added next.
32b counters are already in smart pins, so not much gained in moving to 10 bits anyway.
http://dspguru.com/files/cic.pdf says this in section 4 Bit Growth with key phrase emboldened:
The arithmetic is two's complement, so 1023 to 0 is actually -1 to 0 and the big step is +511 to -512, which occurs within 10 bits.
Should be easy enough to test, and I guess even tho 32b counters exist already, lowering the number of bits fed into the adders helps reduce routing resource, and any reduction in adder 2 size helps...
I guess the ADC silicon fix is not just a respin? It's in the outer ring layout?
Suggestions:
Save the 6% and do the fix in software. Wait for a hopeful later silicon respin, maybe even at 120 or 90 nm for a silicon fix.
Spend the time testing and understanding where the power is being wasted. This is the biggie and I am certain it's solvable if you put your mind to it! Then maybe some more testing and minor tweets.
There is a noise floor in the ADC that would take some layout work to improve, but we can drastically improve sampling times by implementing this Sinc3 filter. Imagine getting a 16-bit conversion every 256 clocks, instead of an 8-bit conversion. It's true that 3 or 4 LSBs may be uncertain due to the ADC's noise floor.
I'll get to reducing power soon.
Is this a change to just verilog stuff? Or does this require changes to the custom laid out pad stuff? I think that is Cluso's concern (and mine).
It is just Verilog.
What is the number of clocks needed for 8, 10, and 12bit equivalent samples?