Erna gave an alternative, window of just 11 clocks, but it was expensive as hell.
Actually, in hindsight, that might have been too quick to dismiss. There was a large number of adders in a pyramid to represent multipliers, but they were also very small at the bottom. And not being circular, can probably be optimised well.
I've made the NCO module adder generate a discrete Z[31:0]+Y[31:0] sum that the compiler should recognize as also existing in the SINC3 module (as Z[29:0] + Y[29:0]). Before, the NCO mux'd in what was going to be added to Z. This should help reduce logic if SINC3 is implemented. Compiling now...
Here's an idea for how to quickly filter a sigma-delta bitstream using a look-up-table. Output sample rate is decimated by 8 from input bitrate. Supports filters up to 32 taps.
uint8 in;
uint32 accum;
uint8 out;
uint32 LUT[256];
for samples
accum+=LUT[in]; // get 8 new bits, feed to LUT
out=accum&0xff; // output filtered 8 bit sample
accum = accum>>8; // shift right 8 bits
A 256 entry look-up-table can perform arbitrary operations on 8 bits.
We can use it for FIR filtering.
Since we don't need all 32 bits of the output, we can pack more into each long.
Let's treat it as 4 separate tables, performing 4 different functions at the same time.
[31:24][23:16][15:8][7:0]
Each byte of the table contains the result of filtering each part of the filter.
If we are mindful of carry bits, we can operate on 4 bytes at the same time. This is known as soft-SIMD.
This picture is for illustration and is not an optimal filter.
Saucy, you are just talking about adding up weighted bits in a sample window, right? Is that really going to give us good performance, without any fancy filtering? I can see the concept is extremely simple.
If acc3 in the smart pin is cleared after it is read then the diff1 differentiator can be omitted, saving two instructions and four cycles. The acc3 feedback from register to adder could be zeroed during the cycle after the read, so that acc3 is loaded with acc2+0.
I'm using TonyB_'s Tukey 17*/32 to good effect on a windowed 8-bit-sample-per-clock ADC mode in the cog.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
tukey long 01,03,05,07,10,13,16,19,22,25,27,29,31 '13 up, up/down sum = 208 * 2
long 32[9],31,32[9] '19 top, top sum = 607
long 31,29,27,25,22,19,16,13,10,07,05,03,01 '13 down, total sum = 1023 (>>2 = 255)
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
In order to get sufficient SNR for 8-bit usage, the window needed to be as wide as it is. The window's low-pass effect begins to kick in at around 1MHz at 180MHz Fsys.
This scope mode will go into the cog, where it will run when enabled and the streamer will be able to write samples to memory at anything up to full speed. The samples are always available! It will be 4 lanes wide, like a 4-channel scope. It's just a 45-bit shifter with staged adders to compute the weighted bits' sum on each clock. Bits 9..2 of the sum make the result.
Here is a picture of a 1.2MHz sawtooth recording that is getting played back at full-speed (250MSPS):
Here is a 1MHz square wave:
And here is a 1MHz sine wave:
The slew is not as fast as it was in an earlier version that used an 8-sample SINC3 filter, but the signal quality is better. I would rather do a SINC3, but it would take an inordinate amount of resources to implement the staggered stages. This is pretty cheap, but lower on bandwidth.
Fun wrecker here. We'd really like Chip to return to the Spin interpreter development. Very soon people are going to have boards in their hands and want to get started.
Chip has to make the decision, but we can also encourage the transition.
Ken,
Remember that we have fastspin that targets P2 and the p2gcc thing that takes existing propgcc output and retargets it to the P2, plus the built in ROM forth(like?). So people can use those if Chip's spin2 isn't ready yet.
Also, people have already been using pnut to do PASM2 stuff for testing.
I do want Chip to get back onto Spin2 also, so I can get OpenSpin2 done in time. Porting is going to take a bit since he's changed a lot.
I'm using TonyB_'s Tukey 17*/32 to good effect on a windowed 8-bit-sample-per-clock ADC mode in the cog.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
tukey long 01,03,05,07,10,13,16,19,22,25,27,29,31 '13 up, up/down sum = 208 * 2
long 32[9],31,32[9] '19 top, top sum = 607
long 31,29,27,25,22,19,16,13,10,07,05,03,01 '13 down, total sum = 1023 (>>2 = 255)
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
Looks good and I'm pleased the Tukey work has not been wasted, but I don't know where we are now and what's in the smart pins and what's not. Is hardware Sinc3 dropped? The sign-extending test was doomed to fail without sign-extending everything.
I'm using TonyB_'s Tukey 17*/32 to good effect on a windowed 8-bit-sample-per-clock ADC mode in the cog.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
tukey long 01,03,05,07,10,13,16,19,22,25,27,29,31 '13 up, up/down sum = 208 * 2
long 32[9],31,32[9] '19 top, top sum = 607
long 31,29,27,25,22,19,16,13,10,07,05,03,01 '13 down, total sum = 1023 (>>2 = 255)
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
Looks good and I'm pleased the Tukey work has not been wasted, but I don't know where we are now and what's in the smart pins and what's not. Is hardware Sinc3 dropped? The sign-extending test was doomed to fail without sign-extending everything.
The SINC3 is in the smart pin. Before I move on from this ADC stuff, I want to get the scope mode working, too. We are on the same page, don't worry.
I'm doing the sign-extension test where I sign-extend everything, so that acc3 is full-size, acc2 is 1 bit less, and acc1 is two bits less. It's not working, unfortunately. Everything seems to need to be full-sized, which is too bad. Any other ideas about reducing these acc sizes?
If acc3 in the smart pin is cleared after it is read then the diff1 differentiator can be omitted, saving two instructions and four cycles. The acc3 feedback from register to adder could be zeroed during the cycle after the read, so that acc3 is loaded with acc2+0.
We can do that, no problem. What would the code look like then, with and without the possible DIFF instruction?
I'm using TonyB_'s Tukey 17*/32 to good effect on a windowed 8-bit-sample-per-clock ADC mode in the cog.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
tukey long 01,03,05,07,10,13,16,19,22,25,27,29,31 '13 up, up/down sum = 208 * 2
long 32[9],31,32[9] '19 top, top sum = 607
long 31,29,27,25,22,19,16,13,10,07,05,03,01 '13 down, total sum = 1023 (>>2 = 255)
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
Looks good and I'm pleased the Tukey work has not been wasted, but I don't know where we are now and what's in the smart pins and what's not. Is hardware Sinc3 dropped? The sign-extending test was doomed to fail without sign-extending everything.
The SINC3 is in the smart pin. Before I move on from this ADC stuff, I want to get the scope mode working, too. We are on the same page, don't worry.
I'm doing the sign-extension test where I sign-extend everything, so that acc3 is full-size, acc2 is 1 bit less, and acc1 is two bits less. It's not working, unfortunately. Everything seems to need to be full-sized, which is too bad. Any other ideas about reducing these acc sizes?
Thanks for trying sign-extending again - it just doesn't work.
We could reduce the acc sizes by changing the decimation rate R from 1024 to 256. Is there any point or need for 20-bit resolution if only 16-bit values are written?
I'm using TonyB_'s Tukey 17*/32 to good effect on a windowed 8-bit-sample-per-clock ADC mode in the cog.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
tukey long 01,03,05,07,10,13,16,19,22,25,27,29,31 '13 up, up/down sum = 208 * 2
long 32[9],31,32[9] '19 top, top sum = 607
long 31,29,27,25,22,19,16,13,10,07,05,03,01 '13 down, total sum = 1023 (>>2 = 255)
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
Looks good and I'm pleased the Tukey work has not been wasted, but I don't know where we are now and what's in the smart pins and what's not. Is hardware Sinc3 dropped? The sign-extending test was doomed to fail without sign-extending everything.
The SINC3 is in the smart pin. Before I move on from this ADC stuff, I want to get the scope mode working, too. We are on the same page, don't worry.
I'm doing the sign-extension test where I sign-extend everything, so that acc3 is full-size, acc2 is 1 bit less, and acc1 is two bits less. It's not working, unfortunately. Everything seems to need to be full-sized, which is too bad. Any other ideas about reducing these acc sizes?
Thanks for trying sign-extending again - it just doesn't work.
We could reduce the acc sizes by changing the decimation rate R from 1024 to 256. Is there any point or need for 20-bit resolution if only 16-bit values are written?
Well, in externally-clocked mode, there could be need for 20-bit resolution.
Is there much need for the DIFF instruction, anymore? Does it save only one instruction now if we are clearing ACC3 in the smart pin at each measurement start? Would it have much use outside of this application?
We could reduce the acc sizes by changing the decimation rate R from 1024 to 256. Is there any point or need for 20-bit resolution if only 16-bit values are written?
Well, in externally-clocked mode, there could be need for 20-bit resolution.
But if it's the difference between fitting (comfortably) or not? R=256 reduces the acc2 and acc3 adders from 30-bit to 24-bit, assuming acc1 uses a counter. Sinc3 could be done in software for > 16-bit.
If acc3 in the smart pin is cleared after it is read then the diff1 differentiator can be omitted, saving two instructions and four cycles. The acc3 feedback from register to adder could be zeroed during the cycle after the read, so that acc3 is loaded with acc2+0.
We can do that, no problem. What would the code look like then, with and without the possible DIFF instruction?
It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.
DIFF as a separate instruction would save only one instruction and it's not worth the effort. Also, aren't the previously free slots used by SETDAC and another pin instruction?
Integrate and Dump Differentiator without DIFF
rdpin z, #adcpin
sub z,diff2
add diff2,z
sub z,diff3
add diff3,z
This scope mode will go into the cog, where it will run when enabled and the streamer will be able to write samples to memory at anything up to full speed. The samples are always available! It will be 4 lanes wide, like a 4-channel scope. It's just a 45-bit shifter with staged adders to compute the weighted bits' sum on each clock. Bits 9..2 of the sum make the result.
The slew is not as fast as it was in an earlier version that used an 8-sample SINC3 filter, but the signal quality is better. I would rather do a SINC3, but it would take an inordinate amount of resources to implement the staggered stages. This is pretty cheap, but lower on bandwidth.
In order to get sufficient SNR for 8-bit usage, the window needed to be as wide as it is. The window's low-pass effect begins to kick in at around 1MHz at 180MHz Fsys.
With this Streamer Capture + Software filter, you will be able to trade off speed against ENOB, and also against processing time, right ?
ie someone could go faster, but at reduced bits ?
IIRC The Analog limit was a rise time of ~ 50ns & what you show is not far from that, maybe 2x ? How many bits would filter be, for ~ 50ns rise ?
Fun wrecker here. We'd really like Chip to return to the Spin interpreter development. Very soon people are going to have boards in their hands and want to get started.
Chip has to make the decision, but we can also encourage the transition.
Ken Gracey
Is that code for a Verilog Freeze ? that must be very close now ?
Spin is not mandatory to test P2, (just nice to have, and there are other P2 pathways now) so I would move Spin2 priority somewhat, to after rev B sign-off release.
Before Rev B, Chip is surely better focused on the reported hardware issues, testing verilog, and getting that rev B as good as it can be ?
Even better docs around smart pins, would greatly help testing coverage there.
Comments
Actually, in hindsight, that might have been too quick to dismiss. There was a large number of adders in a pyramid to represent multipliers, but they were also very small at the bottom. And not being circular, can probably be optimised well.
Probably. I will look. Will be gone for two hours...
The integrator values are specifically signed in the sign-extension test. Here's how the Verilog might look:
Probably not great Verilog but it shows what I mean.
-Phil
We can use it for FIR filtering.
Since we don't need all 32 bits of the output, we can pack more into each long.
Let's treat it as 4 separate tables, performing 4 different functions at the same time.
[31:24][23:16][15:8][7:0]
Each byte of the table contains the result of filtering each part of the filter.
If we are mindful of carry bits, we can operate on 4 bytes at the same time. This is known as soft-SIMD.
This picture is for illustration and is not an optimal filter.
Here is the table. I needed to drop the total table sum by 1, and the only symmetrical point to do it at was the middle value:
This is definitely an application where the Tukey shines. So, all that work is not going to waste.
In order to get sufficient SNR for 8-bit usage, the window needed to be as wide as it is. The window's low-pass effect begins to kick in at around 1MHz at 180MHz Fsys.
This scope mode will go into the cog, where it will run when enabled and the streamer will be able to write samples to memory at anything up to full speed. The samples are always available! It will be 4 lanes wide, like a 4-channel scope. It's just a 45-bit shifter with staged adders to compute the weighted bits' sum on each clock. Bits 9..2 of the sum make the result.
Here is a picture of a 1.2MHz sawtooth recording that is getting played back at full-speed (250MSPS):
Here is a 1MHz square wave:
And here is a 1MHz sine wave:
The slew is not as fast as it was in an earlier version that used an 8-sample SINC3 filter, but the signal quality is better. I would rather do a SINC3, but it would take an inordinate amount of resources to implement the staggered stages. This is pretty cheap, but lower on bandwidth.
Chip has to make the decision, but we can also encourage the transition.
Ken Gracey
This is bandwidth-limited because of two things: The window's low-pass filter effect and the ADC's analog front end's slowness.
Remember that we have fastspin that targets P2 and the p2gcc thing that takes existing propgcc output and retargets it to the P2, plus the built in ROM forth(like?). So people can use those if Chip's spin2 isn't ready yet.
Also, people have already been using pnut to do PASM2 stuff for testing.
I do want Chip to get back onto Spin2 also, so I can get OpenSpin2 done in time. Porting is going to take a bit since he's changed a lot.
Looks good and I'm pleased the Tukey work has not been wasted, but I don't know where we are now and what's in the smart pins and what's not. Is hardware Sinc3 dropped? The sign-extending test was doomed to fail without sign-extending everything.
The SINC3 is in the smart pin. Before I move on from this ADC stuff, I want to get the scope mode working, too. We are on the same page, don't worry.
I'm doing the sign-extension test where I sign-extend everything, so that acc3 is full-size, acc2 is 1 bit less, and acc1 is two bits less. It's not working, unfortunately. Everything seems to need to be full-sized, which is too bad. Any other ideas about reducing these acc sizes?
We can do that, no problem. What would the code look like then, with and without the possible DIFF instruction?
Thanks for trying sign-extending again - it just doesn't work.
We could reduce the acc sizes by changing the decimation rate R from 1024 to 256. Is there any point or need for 20-bit resolution if only 16-bit values are written?
Well, in externally-clocked mode, there could be need for 20-bit resolution.
But if it's the difference between fitting (comfortably) or not? R=256 reduces the acc2 and acc3 adders from 30-bit to 24-bit, assuming acc1 uses a counter. Sinc3 could be done in software for > 16-bit.
It's not my idea. I've been doing some reading and this is called Integrate and Dump. It could be used elsewhere in the smart pins, probably. In the differentiator diff1 stores the previous value of acc3, but if acc3 is reset after being read then diff1 is redundant.
DIFF as a separate instruction would save only one instruction and it's not worth the effort. Also, aren't the previously free slots used by SETDAC and another pin instruction?
Integrate and Dump Differentiator without DIFF
Integrate and Dump Differentiator with DIFF
Looks great, and sounds a good compromise.
With this Streamer Capture + Software filter, you will be able to trade off speed against ENOB, and also against processing time, right ?
ie someone could go faster, but at reduced bits ?
IIRC The Analog limit was a rise time of ~ 50ns & what you show is not far from that, maybe 2x ? How many bits would filter be, for ~ 50ns rise ?
Is that code for a Verilog Freeze ? that must be very close now ?
Spin is not mandatory to test P2, (just nice to have, and there are other P2 pathways now) so I would move Spin2 priority somewhat, to after rev B sign-off release.
Before Rev B, Chip is surely better focused on the reported hardware issues, testing verilog, and getting that rev B as good as it can be ?
Even better docs around smart pins, would greatly help testing coverage there.