Chip, please don't take my lead as an endorsement of this entire thread. But if it works and simplifies what you were going to go ahead and do anyway, I guess I'd be in favor of it.
-Phil
I know, Phil.
I've been studying that code and I kind of get it, but what I don't understand is how would you go from one bit in to something like 8 bits out? You must allow for some bit-length expansion along the way, right? Or, input $FF for 1 and $00 for 0?
Anything to reduce logic these 4-per-cog ADC channels would be welcome.
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
It appears the non-doubling (std Sinc3), is the ideal, as those deviations are significant, so it comes down to logic cost.
With the P2 adders, how many gates / LUT do they need, relative to a ripple adder, or a MUX and a flip flop ?
I did a quick test on the P1. Breadboarded on my Activity Board with 220pF caps instead of 1nF. Input was grounded.
Triangle Rectangle
Mean 1143.3 1153.9
Std Dev 0.13057 4.64507
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
I did a quick test on the P1. Breadboarded on my Activity Board with 220pF caps instead of 1nF. Input was grounded.
Triangle Rectangle
Mean 1143.3 1153.9
Std Dev 0.13057 4.64507
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
We've been doing it wrong.
I finally tested out the Sinc3 smart pin mode and it works really well.
It seems to me that a 2nd-order modulator would be very sloppy about tracking the input. I want to see if it works for you.
I finally tested out the Sinc3 smart pin mode and it works really well.
Here's the test code that runs on the FPGA with the SINC3 smart pin and two real I/O pins on the pad ring test chip:
Great to hear.
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
I finally tested out the Sinc3 smart pin mode and it works really well.
Here's the test code that runs on the FPGA with the SINC3 smart pin and two real I/O pins on the pad ring test chip:
Great to hear.
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
Well, it's hard to say, because the FPGA board has about 50mV of 471KHz noise on its 3.3V supply that feeds the pins. Even so, the consistency of readings I'm getting is about like this:
But did you see those later pics of the super accurate and quiet ramp and sine signals from the pad ring test chip running on the FPGA board? I was really surprised. This means that the digital ground noise in the substrate of the actual P2 die is demolishing our SNR. If we could quiet down that ground noise, the ADC performance would be fantastic. It should be possible in this next revision to improve noise isolation a little bit.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
But did you see those later pics of the super accurate and quiet ramp and sine signals from the pad ring test chip running on the FPGA board? I was really surprised. This means that the digital ground noise in the substrate of the actual P2 die is demolishing our SNR. If we could quiet down that ground noise, the ADC performance would be fantastic. It should be possible in this next revision to improve noise isolation a little bit.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
All the P2-Eval boards are still at Parallax. There are none in the wild yet. Chip is using a board for tests.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs.
Yes and yes. Chip has commented a couple of times but not detailed yet.
I experimented with binomial filters, but they didn't seem to do much filtering for the amount of logic involved. I first made a "1 1" and that wasn't too hot, so I made this "1 2 1" which also was lousy. Perhaps I just didn't do it right. Can anyone look at this that knows about these things and see if there's a problem in my implementation?
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I jumpered a regulator from the new eval board over to the FPGA board for its 3.3 volt power.
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
I experimented with binomial filters, but they didn't seem to do much filtering for the amount of logic involved. I first made a "1 1" and that wasn't too hot, so I made this "1 2 1" which also was lousy. Perhaps I just didn't do it right. Can anyone look at this that knows about these things and see if there's a problem in my implementation?
...
Our Tukey window is 1/4 the logic and works 10x better. Maybe I didn't do it right, though.
The binomial method seems unsuitable for longer filters. To match the length of the Tukey we would need 44 adders. The first and last 10 samples are basically too low to contribute much to the output. So the effective length is only 25 samples.
Then there is the problem of bit growth. It should be fine to round or truncate the rest once the sums get to 10 bits or so. Maybe that would reduce the logic by half.
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
Chip, 1-bit input has the best quality, so that should be our choice, i.e. no change. The whole point of trying 2-bit was to reduce the logic, but we could do that with 1-bit and 24-bit counters to a similar extent. If somehow the Sinc3 adders could run at twice the ADC bit rate then I think 2-bit mode would use the least logic.
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
Two channels would need less logic than four, although probably not exactly half.
Note to self: Make a trigger mechanism with hysteresis for scope-like triggering on ADC channel data. It drops the current write address into a register and causes an event.
If there is not enough room for any more, could we have one Tukey, perhaps quite crude, in each cog for triggering?
Actually, its three bits, a lossless summation of 1+2+1=4, so the first summation (the ones in the odd positions) requires a half adder for the ones, which consists of an exclusive (which usually contains an add gate that also generates a carry for us into the even positions - for a result of 16 three bit values for every 32 bits in in order to get a [1,2,1] kernel with decimation by 2 (to half sysclock) in one step;
DWORD propeller_adc::decimate1 (DWORD input)
{
//static bool carry;
DWORD q0, q1, q2, q3;
DWORD r0, r1, r2, r3;
DWORD s0, s1;
// format b31.....b0, even bits have weight = 1
// odd bits have weight = 0.5*2, phase one - decimate by 2 using
// a [1,2,1] convolutional kernel yielding 16 3 bit values from a
// single DWORD containing 32 individual one bit input samples
REG[0] = input;
q0 = input&EVEN_BITS;
q1 = input&ODD_BITS;
// multiply all even bits by 2
q2 = q0<<1;
// add all of the odd bits with the appropriate
// alternate neighbor
q3 = (q1+(q1>>2))>>1;
// pick off odd pairs and even pairs since the
// next addition can result in a carry into the
// third bit - so that in the end we want to pack
// the resulting 16 three bit values into nibbles
r0 = q2&EVEN_PAIRS;
r2 = q3&EVEN_PAIRS;
s0 = r0+r2;
r1 = q2&ODD_PAIRS;
r3 = q3&ODD_PAIRS;
s1 = (r1+r3)>>2;
REG[1]=s1;
REG[2]=s0;
return 0;
}
Here is one test case which shows that summation occurs with the correct weights, and that for those who REALLY want or need to sum 6116 bits to obtain a long term moving average; you can still do so; but now with half the number of operations - since half the work has been done, i.e., try feeding this stream into a 3113 window where you are now doing 6116 additions.; the result should be the same or better; even if I still need to work out carry propagation.
OK - this part is starting to look pretty solid. Phase two and Phase three are a work in progress.
DWORD propeller_adc::decimate2 (DWORD input)
{
// phase two - apply the same transformation
// to the 16 three bit values, yielding
// eight five bit values, having a range [0..32]
DWORD q0, q1, q2, q3;
DWORD s0, s1;
q0 = REG[1]&EVEN_NIBBLES;
q1 = (REG[1]&ODD_NIBBLES)>>4;
q2 = REG[2]&EVEN_NIBBLES;
q3 = (REG[2]&ODD_NIBBLES)>>4;
// to do fix nbble order to get things in correct bins, although this gives an interesting
// high peaking response for quick settling time on transient input with no effect on
// the long term average
s0 = q0+q1+(q2<<1);
s1 = q0+q1+(q3<<1);
REG[3]=s1>>4;
REG[4]=s0>>4;
return 0;
}
DWORD propeller_adc::decimate3 (DWORD input)
{
DWORD acc;
// finally repack into a 32 bit register
// for in input rate of 250Mbps - this results
// in an initial output rate of 31.25
DWORD q0, q1, q2, q3;
DWORD s0, s1;
q0 = REG[3]&EVEN_BYTES;
q1 = (REG[3]&ODD_BYTES)>>8;
q2 = REG[4]&EVEN_BYTES;
q3 = (REG[4]&ODD_BYTES)>>8;
// HMMM... FIXME? DEFINTELY BROKEN HERE!!
s0 = q0+q1+(q2<<1);
s1 = q0+q1+(q3<<1);
REG[5]=s1;
REG[6]=s0;
acc = s0<<16+s1;
REG[7]=acc;
return acc;
}
Wendy at ON Semi ran some test compiles to weigh the new design.
It seems that even without SINC3 and the 4-channel-scope-per-cog, we had already grown quite a bit.
Here is where we are at:
Note: 'sequential' means flipflop
Note: 'area' is square um, so 1,000,000 = 1 square mm
Original Design, current P2 silicon:
Type Instances Area Area %
---------------------------------------------
timing_model 92 37049021.213 72.3
sequential 58246 4655514.931 9.1
inverter 74356 737807.974 1.4
buffer 15359 242666.189 0.5
logic 469886 8569315.686 16.7
physical_cells 0 0.000 0.0
---------------------------------------------
total 617939 51254325.994 100.0
New Design with SINC3 and SCOPE:
Type Instances Area Area %
---------------------------------------------
timing_model 91 36815952.439 69.0
sequential 61112 4871372.083 9.1
inverter 92183 908669.798 1.7
buffer 21647 339857.101 0.6
logic 559646 10389258.163 19.5
physical_cells 0 0.000 0.0
---------------------------------------------
total 734679 53325109.584 100.0
New Design without SINC3 smart pin
Type Instances Area Area %
---------------------------------------------
timing_model 91 36815952.439 69.1
sequential 61112 4881813.709 9.2
inverter 90045 889771.008 1.7
buffer 21978 340910.797 0.6
logic 554207 10318409.651 19.4
physical_cells 0 0.000 0.0
---------------------------------------------
total 727433 53246857.604 100.0
New Design without 4-channel SCOPE per cog
Type Instances Area Area %
---------------------------------------------
timing_model 91 36815952.439 70.1
sequential 59129 4729178.317 9.0
inverter 83262 819040.410 1.6
buffer 19948 308145.869 0.6
logic 530745 9811739.930 18.7
physical_cells 0 0.000 0.0
---------------------------------------------
total 693175 52484056.964 100.0
Cost of SINC3
----------------------
sequential 0
inverter 2138
buffer -331
logic 5439
----------------------
total 7246
area 78252
Cost of SCOPE
----------------------
sequential 1983
inverter 8921
buffer 1669
logic 28901
----------------------
total 41474
area 841052
Note that SINC3 turns out to be very little logic. 5439/64 is only 85 logic cells per smart pin added. And it didn't use any new flops, since the smart pin supplied them.
The 4-channel scope, on the other hand is a real pig. 28901/4channels/8cogs = 903 logic cells per channel, which seems WAY too big. I'm wondering two things:
(1) Is the tool generating a lot of extra circuitry in order to make timing? If I pipelined the 1's counts before final summing (requires < 36 flipflops per channel), might things relax and would net logic/buffering requirements go down?
(2) If this scope function went into each smart pin, it wouldn't need any new flops and things could be pipelined to relax timing. However, each cog would need to mux in 4 channels of a new 8-bit bus coming from each smart pin. And there would be twice as many Tukey filters, only half of which could ever be used at once. However, they could be instantly mux'd and filtered samples would be forthcoming, without waiting for the Tukey filter to refill. Maybe, rather than 4 random pins, you would select a group of 4 pins, differing only in the two LSBs of their pin numbers. That would lighten the mux'ing problem.
I think the first thing I need to do is see how much I can squeeze the Tukey filter logic.
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I"m not following all the details here, but the figures you reported earlier on the PAD-Ring test chip, had quite small sample counts on Sinc3. (aka high ADC conversion rates, but low bit-counts)
If sinc3 can adjust to low bit values, is that not equivalent to an ADC-Bandwidth limited low-bit-scope Tukey pathway ?
Then, you just need a means to capture the 4 outputs ? (Which I think is what you are saying above?)
I'm putting the Tukey into the smart pin to see how it compiles.
As for the new features related to the Tukey filters, could at least some part of the new 4-pin groupings and triggerring stuff be leveraged at the streamers too, to ease some way the communications with qspi, octa spi and even hyperbus-enabled devices?
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
Tukey had better quality than Sinc8.
Can the four 8-bit Tukey values still be read as one 32-bit value? And can this be streamed?
I'm wondering whether the pair symmetry in the ramp values is reflected in the logic minimization, e.g. 1& 31, 3 & 29, 5 & 27, etc., all add up to 32. I've looked at making them add up to 31 with bits inverted to halve the taps, e.g. 1 and 30, 3 & 28, 5 & 26, then adding a pair can be done by a simple OR. The problem is that the plateau value is now 31 and n+½ plateau bits are needed for the whole thing to sum to a multiple of 256 minus 1 when all bits are set.
Another thought I had was to set the midpoint of the ramp, currently 16, to zero and having -15 & +15, -13 & +13, etc., as the pairs, again to halve the taps. The previous max tap of 32 would then be +16. However, the arithmetic would not be two's complement as used by other smart pin modes.
I also had an idea for using a counter for the plateau values instead of adding them individually and I'll try to find my post about it.
Breaking my sinc decimator into stages, the first stage turns 32 input bits into 16 3 bit values, packed in odd and even groupings, with 4 bit alignment. Hand optimization of the first stage brings it down to this, with carry propagation and debugging stages two and three yet to be debugged. Eventually you get 4 eight bit values from every 32 input bits, which can be summed by whatever windowing method you wish to use in addition to the initial decimation by 2, 4, or 8 - as desired. In the meantime I am developing this code both in Visual Studio, and in Simple IDE/Propeller GCC - so that I can also pry into the assembly that GCC is generating …
This it appears that the first stage requires about 40 words ~ 80 bytes in unrolled GCC-propeller 1 assembly, comprising ~ 26 instructions; with stage two and three expected to be similar when fully debugged and optimized. At that point I am expecting to have noisy 7 bit or something like that data running at sysclock/8, which can be summed to give numbers identical to what others are doing; so that instead of counting 6116 samples at sysclock, you would be getting 4 pre-filtered values every 32 clocks; reducing the number of filtered samples that need to be summed for a 12 bit window previously obtained by other means to 764.5; or you can store the samples and run an FFT, or linear regression, or a median filter, students t-test, or whatever you want with the data.
That's a whole new trick for the Prop1. The ramping isn't every sysclock but maybe that isn't so terrible. Certainly his results look great. And, in theory, the shape can be more complex.
Guess what? The Prop2 has no equivalent mode!
Here's a commented snippet: (Inputs A and B are the same pin in James's test code)
mov i, looplength
mov j, looplength
mov frqb, #1 ' start rectangular (flat +1 increment) for input B
uploop add frqa, #1 ' start triangular (ramp the increment up and down) for input A
djnz i, #uploop ' ramp up
downloop sub frqa, #1
djnz j, #downloop ' ramp down
mov frqb, #0 ' stop rectangular for input B
mov frqa, #0 ' stop triangular for input A
wrlong phsa, adcp_tri ' post triangular sample
wrlong phsb, adcp_rect ' post rectangular sample
You can use RDPIN to read a sample, and then the streamer will be able to group four together, time-aligned.
Very interesting idea about offsetting.
Now that this thing is in the smart pin, it's pipelined, since we have the flops already there. That should relax any timing pressure. It's also no problem to check for $100 and swap out $FF, instead, so I changed that center $1F to $20.
I made this diagram of inc's and dec's for when bits move around. There has got to be some good optimization possible here:
I sent the new scope-in-the-smart-pin file set to ON Semi for a test compile. I was surprised how little logic the SINC3 took in the smart pin, and I'm curious to see how the SCOPE may work there.
We really need to optimize the Tukey-filter summing logic. Some huge optimization(s) must be possible.
Comments
I know, Phil.
I've been studying that code and I kind of get it, but what I don't understand is how would you go from one bit in to something like 8 bits out? You must allow for some bit-length expansion along the way, right? Or, input $FF for 1 and $00 for 0?
Anything to reduce logic these 4-per-cog ADC channels would be welcome.
It appears the non-doubling (std Sinc3), is the ideal, as those deviations are significant, so it comes down to logic cost.
With the P2 adders, how many gates / LUT do they need, relative to a ripple adder, or a MUX and a flip flop ?
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
We've been doing it wrong.
I finally tested out the Sinc3 smart pin mode and it works really well.
It seems to me that a 2nd-order modulator would be very sloppy about tracking the input. I want to see if it works for you.
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
Well, it's hard to say, because the FPGA board has about 50mV of 471KHz noise on its 3.3V supply that feeds the pins. Even so, the consistency of readings I'm getting is about like this:
16 counts = 7 bits
32 counts = 8 bits
64 counts = 9 bits
128 counts = 11 bits
256 counts = 12 bits
512 counts = 12 bits (1/f noise really increases)
1024 counts = 13 bits
Okay! I just wired in a quiet 3.3V regulator and things are looking WAY better:
16 counts = 8 bits
32 counts = 10 bits
64 counts = 12 bits
128 counts = 13 bits
256 counts = 13 bits
512 counts = 13 bits (1/f noise really increases)
1024 counts = 13 bits
So, noise doesn't let us get beyond ~13 bits.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
All the P2-Eval boards are still at Parallax. There are none in the wild yet. Chip is using a board for tests.
Yes and yes. Chip has commented a couple of times but not detailed yet.
Our Tukey window is 1/4 the logic and works 10x better. Maybe I didn't do it right, though.
"reg [2:0][8:0] f5" means there are three f5 registers (f5[0], f5[1], f5[2]) that are each 9 bits wide.
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I jumpered a regulator from the new eval board over to the FPGA board for its 3.3 volt power.
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
Then there is the problem of bit growth. It should be fine to round or truncate the rest once the sums get to 10 bits or so. Maybe that would reduce the logic by half.
Chip, 1-bit input has the best quality, so that should be our choice, i.e. no change. The whole point of trying 2-bit was to reduce the logic, but we could do that with 1-bit and 24-bit counters to a similar extent. If somehow the Sinc3 adders could run at twice the ADC bit rate then I think 2-bit mode would use the least logic.
Two channels would need less logic than four, although probably not exactly half.
If there is not enough room for any more, could we have one Tukey, perhaps quite crude, in each cog for triggering?
Here is one test case which shows that summation occurs with the correct weights, and that for those who REALLY want or need to sum 6116 bits to obtain a long term moving average; you can still do so; but now with half the number of operations - since half the work has been done, i.e., try feeding this stream into a 3113 window where you are now doing 6116 additions.; the result should be the same or better; even if I still need to work out carry propagation.
OK - this part is starting to look pretty solid. Phase two and Phase three are a work in progress.
It seems that even without SINC3 and the 4-channel-scope-per-cog, we had already grown quite a bit.
Here is where we are at:
Note that SINC3 turns out to be very little logic. 5439/64 is only 85 logic cells per smart pin added. And it didn't use any new flops, since the smart pin supplied them.
The 4-channel scope, on the other hand is a real pig. 28901/4channels/8cogs = 903 logic cells per channel, which seems WAY too big. I'm wondering two things:
(1) Is the tool generating a lot of extra circuitry in order to make timing? If I pipelined the 1's counts before final summing (requires < 36 flipflops per channel), might things relax and would net logic/buffering requirements go down?
(2) If this scope function went into each smart pin, it wouldn't need any new flops and things could be pipelined to relax timing. However, each cog would need to mux in 4 channels of a new 8-bit bus coming from each smart pin. And there would be twice as many Tukey filters, only half of which could ever be used at once. However, they could be instantly mux'd and filtered samples would be forthcoming, without waiting for the Tukey filter to refill. Maybe, rather than 4 random pins, you would select a group of 4 pins, differing only in the two LSBs of their pin numbers. That would lighten the mux'ing problem.
I think the first thing I need to do is see how much I can squeeze the Tukey filter logic.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I"m not following all the details here, but the figures you reported earlier on the PAD-Ring test chip, had quite small sample counts on Sinc3. (aka high ADC conversion rates, but low bit-counts)
If sinc3 can adjust to low bit values, is that not equivalent to an ADC-Bandwidth limited low-bit-scope Tukey pathway ?
Then, you just need a means to capture the 4 outputs ? (Which I think is what you are saying above?)
As for the new features related to the Tukey filters, could at least some part of the new 4-pin groupings and triggerring stuff be leveraged at the streamers too, to ease some way the communications with qspi, octa spi and even hyperbus-enabled devices?
Tukey had better quality than Sinc8.
Can the four 8-bit Tukey values still be read as one 32-bit value? And can this be streamed?
I'm wondering whether the pair symmetry in the ramp values is reflected in the logic minimization, e.g. 1& 31, 3 & 29, 5 & 27, etc., all add up to 32. I've looked at making them add up to 31 with bits inverted to halve the taps, e.g. 1 and 30, 3 & 28, 5 & 26, then adding a pair can be done by a simple OR. The problem is that the plateau value is now 31 and n+½ plateau bits are needed for the whole thing to sum to a multiple of 256 minus 1 when all bits are set.
Another thought I had was to set the midpoint of the ramp, currently 16, to zero and having -15 & +15, -13 & +13, etc., as the pairs, again to halve the taps. The previous max tap of 32 would then be +16. However, the arithmetic would not be two's complement as used by other smart pin modes.
I also had an idea for using a counter for the plateau values instead of adding them individually and I'll try to find my post about it.
This it appears that the first stage requires about 40 words ~ 80 bytes in unrolled GCC-propeller 1 assembly, comprising ~ 26 instructions; with stage two and three expected to be similar when fully debugged and optimized. At that point I am expecting to have noisy 7 bit or something like that data running at sysclock/8, which can be summed to give numbers identical to what others are doing; so that instead of counting 6116 samples at sysclock, you would be getting 4 pre-filtered values every 32 clocks; reducing the number of filtered samples that need to be summed for a 12 bit window previously obtained by other means to 764.5; or you can store the samples and run an FFT, or linear regression, or a median filter, students t-test, or whatever you want with the data.
That's a whole new trick for the Prop1. The ramping isn't every sysclock but maybe that isn't so terrible. Certainly his results look great. And, in theory, the shape can be more complex.
Guess what? The Prop2 has no equivalent mode!
Here's a commented snippet: (Inputs A and B are the same pin in James's test code)
You can use RDPIN to read a sample, and then the streamer will be able to group four together, time-aligned.
Very interesting idea about offsetting.
Now that this thing is in the smart pin, it's pipelined, since we have the flops already there. That should relax any timing pressure. It's also no problem to check for $100 and swap out $FF, instead, so I changed that center $1F to $20.
I made this diagram of inc's and dec's for when bits move around. There has got to be some good optimization possible here:
We really need to optimize the Tukey-filter summing logic. Some huge optimization(s) must be possible.