Chip, please don't take my lead as an endorsement of this entire thread. But if it works and simplifies what you were going to go ahead and do anyway, I guess I'd be in favor of it.
-Phil
I know, Phil.
I've been studying that code and I kind of get it, but what I don't understand is how would you go from one bit in to something like 8 bits out? You must allow for some bit-length expansion along the way, right? Or, input $FF for 1 and $00 for 0?
Anything to reduce logic these 4-per-cog ADC channels would be welcome.
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
It appears the non-doubling (std Sinc3), is the ideal, as those deviations are significant, so it comes down to logic cost.
With the P2 adders, how many gates / LUT do they need, relative to a ripple adder, or a MUX and a flip flop ?
I did a quick test on the P1. Breadboarded on my Activity Board with 220pF caps instead of 1nF. Input was grounded.
Triangle Rectangle
Mean1143.31153.9Std Dev 0.130574.64507
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
I did a quick test on the P1. Breadboarded on my Activity Board with 220pF caps instead of 1nF. Input was grounded.
Triangle Rectangle
Mean1143.31153.9Std Dev 0.130574.64507
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
We've been doing it wrong.
I finally tested out the Sinc3 smart pin mode and it works really well.
It seems to me that a 2nd-order modulator would be very sloppy about tracking the input. I want to see if it works for you.
I finally tested out the Sinc3 smart pin mode and it works really well.
Here's the test code that runs on the FPGA with the SINC3 smart pin and two real I/O pins on the pad ring test chip:
Great to hear.
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
I finally tested out the Sinc3 smart pin mode and it works really well.
Here's the test code that runs on the FPGA with the SINC3 smart pin and two real I/O pins on the pad ring test chip:
Great to hear.
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
Well, it's hard to say, because the FPGA board has about 50mV of 471KHz noise on its 3.3V supply that feeds the pins. Even so, the consistency of readings I'm getting is about like this:
But did you see those later pics of the super accurate and quiet ramp and sine signals from the pad ring test chip running on the FPGA board? I was really surprised. This means that the digital ground noise in the substrate of the actual P2 die is demolishing our SNR. If we could quiet down that ground noise, the ADC performance would be fantastic. It should be possible in this next revision to improve noise isolation a little bit.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
But did you see those later pics of the super accurate and quiet ramp and sine signals from the pad ring test chip running on the FPGA board? I was really surprised. This means that the digital ground noise in the substrate of the actual P2 die is demolishing our SNR. If we could quiet down that ground noise, the ADC performance would be fantastic. It should be possible in this next revision to improve noise isolation a little bit.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
All the P2-Eval boards are still at Parallax. There are none in the wild yet. Chip is using a board for tests.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs.
Yes and yes. Chip has commented a couple of times but not detailed yet.
I experimented with binomial filters, but they didn't seem to do much filtering for the amount of logic involved. I first made a "1 1" and that wasn't too hot, so I made this "1 2 1" which also was lousy. Perhaps I just didn't do it right. Can anyone look at this that knows about these things and see if there's a problem in my implementation?
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I jumpered a regulator from the new eval board over to the FPGA board for its 3.3 volt power.
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
I experimented with binomial filters, but they didn't seem to do much filtering for the amount of logic involved. I first made a "1 1" and that wasn't too hot, so I made this "1 2 1" which also was lousy. Perhaps I just didn't do it right. Can anyone look at this that knows about these things and see if there's a problem in my implementation?
...
Our Tukey window is 1/4 the logic and works 10x better. Maybe I didn't do it right, though.
The binomial method seems unsuitable for longer filters. To match the length of the Tukey we would need 44 adders. The first and last 10 samples are basically too low to contribute much to the output. So the effective length is only 25 samples.
Then there is the problem of bit growth. It should be fine to round or truncate the rest once the sums get to 10 bits or so. Maybe that would reduce the logic by half.
TonyB_, thanks for trying out that idea of doubling acc1 before doing acc2 and acc3. It looks to me like it's maybe kind of a sticky compromise. What's your take?
Chip, 1-bit input has the best quality, so that should be our choice, i.e. no change. The whole point of trying 2-bit was to reduce the logic, but we could do that with 1-bit and 24-bit counters to a similar extent. If somehow the Sinc3 adders could run at twice the ADC bit rate then I think 2-bit mode would use the least logic.
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
Two channels would need less logic than four, although probably not exactly half.
Note to self: Make a trigger mechanism with hysteresis for scope-like triggering on ADC channel data. It drops the current write address into a register and causes an event.
If there is not enough room for any more, could we have one Tukey, perhaps quite crude, in each cog for triggering?
Actually, its three bits, a lossless summation of 1+2+1=4, so the first summation (the ones in the odd positions) requires a half adder for the ones, which consists of an exclusive (which usually contains an add gate that also generates a carry for us into the even positions - for a result of 16 three bit values for every 32 bits in in order to get a [1,2,1] kernel with decimation by 2 (to half sysclock) in one step;
DWORD propeller_adc::decimate1(DWORD input){
//static bool carry;
DWORD q0, q1, q2, q3;
DWORD r0, r1, r2, r3;
DWORD s0, s1;
// format b31.....b0, even bits have weight = 1// odd bits have weight = 0.5*2, phase one - decimate by 2 using// a [1,2,1] convolutional kernel yielding 16 3 bit values from a// single DWORD containing 32 individual one bit input samples
REG[0] = input;
q0 = input&EVEN_BITS;
q1 = input&ODD_BITS;
// multiply all even bits by 2
q2 = q0<<1;
// add all of the odd bits with the appropriate// alternate neighbor
q3 = (q1+(q1>>2))>>1;
// pick off odd pairs and even pairs since the // next addition can result in a carry into the// third bit - so that in the end we want to pack// the resulting 16 three bit values into nibbles
r0 = q2&EVEN_PAIRS;
r2 = q3&EVEN_PAIRS;
s0 = r0+r2;
r1 = q2&ODD_PAIRS;
r3 = q3&ODD_PAIRS;
s1 = (r1+r3)>>2;
REG[1]=s1;
REG[2]=s0;
return0;
}
Here is one test case which shows that summation occurs with the correct weights, and that for those who REALLY want or need to sum 6116 bits to obtain a long term moving average; you can still do so; but now with half the number of operations - since half the work has been done, i.e., try feeding this stream into a 3113 window where you are now doing 6116 additions.; the result should be the same or better; even if I still need to work out carry propagation.
OK - this part is starting to look pretty solid. Phase two and Phase three are a work in progress.
DWORD propeller_adc::decimate2(DWORD input){
// phase two - apply the same transformation// to the 16 three bit values, yielding// eight five bit values, having a range [0..32]
DWORD q0, q1, q2, q3;
DWORD s0, s1;
q0 = REG[1]&EVEN_NIBBLES;
q1 = (REG[1]&ODD_NIBBLES)>>4;
q2 = REG[2]&EVEN_NIBBLES;
q3 = (REG[2]&ODD_NIBBLES)>>4;
// to do fix nbble order to get things in correct bins, although this gives an interesting// high peaking response for quick settling time on transient input with no effect on// the long term average
s0 = q0+q1+(q2<<1);
s1 = q0+q1+(q3<<1);
REG[3]=s1>>4;
REG[4]=s0>>4;
return0;
}
DWORD propeller_adc::decimate3(DWORD input){
DWORD acc;
// finally repack into a 32 bit register// for in input rate of 250Mbps - this results// in an initial output rate of 31.25
DWORD q0, q1, q2, q3;
DWORD s0, s1;
q0 = REG[3]&EVEN_BYTES;
q1 = (REG[3]&ODD_BYTES)>>8;
q2 = REG[4]&EVEN_BYTES;
q3 = (REG[4]&ODD_BYTES)>>8;
// HMMM... FIXME? DEFINTELY BROKEN HERE!!
s0 = q0+q1+(q2<<1);
s1 = q0+q1+(q3<<1);
REG[5]=s1;
REG[6]=s0;
acc = s0<<16+s1;
REG[7]=acc;
return acc;
}
Wendy at ON Semi ran some test compiles to weigh the new design.
It seems that even without SINC3 and the 4-channel-scope-per-cog, we had already grown quite a bit.
Here is where we are at:
Note: 'sequential' means flipflop
Note: 'area' is square um, so 1,000,000 = 1 square mm
Original Design, current P2 silicon:
Type Instances Area Area %
---------------------------------------------
timing_model 9237049021.21372.3
sequential 582464655514.9319.1
inverter 74356737807.9741.4
buffer 15359242666.1890.5
logic 4698868569315.68616.7
physical_cells 00.0000.0
---------------------------------------------
total 61793951254325.994100.0
New Design with SINC3 and SCOPE:
Type Instances Area Area %
---------------------------------------------
timing_model 9136815952.43969.0
sequential 611124871372.0839.1
inverter 92183908669.7981.7
buffer 21647339857.1010.6
logic 55964610389258.16319.5
physical_cells 00.0000.0
---------------------------------------------
total 73467953325109.584100.0
New Design without SINC3 smart pin
Type Instances Area Area %
---------------------------------------------
timing_model 9136815952.43969.1
sequential 611124881813.7099.2
inverter 90045889771.0081.7
buffer 21978340910.7970.6
logic 55420710318409.65119.4
physical_cells 00.0000.0
---------------------------------------------
total 72743353246857.604100.0
New Design without 4-channel SCOPE per cog
Type Instances Area Area %
---------------------------------------------
timing_model 9136815952.43970.1
sequential 591294729178.3179.0
inverter 83262819040.4101.6
buffer 19948308145.8690.6
logic 5307459811739.93018.7
physical_cells 00.0000.0
---------------------------------------------
total 69317552484056.964100.0
Cost of SINC3
----------------------
sequential 0
inverter 2138
buffer -331
logic 5439
----------------------
total 7246
area 78252
Cost of SCOPE
----------------------
sequential 1983
inverter 8921
buffer 1669
logic 28901
----------------------
total 41474
area 841052
Note that SINC3 turns out to be very little logic. 5439/64 is only 85 logic cells per smart pin added. And it didn't use any new flops, since the smart pin supplied them.
The 4-channel scope, on the other hand is a real pig. 28901/4channels/8cogs = 903 logic cells per channel, which seems WAY too big. I'm wondering two things:
(1) Is the tool generating a lot of extra circuitry in order to make timing? If I pipelined the 1's counts before final summing (requires < 36 flipflops per channel), might things relax and would net logic/buffering requirements go down?
(2) If this scope function went into each smart pin, it wouldn't need any new flops and things could be pipelined to relax timing. However, each cog would need to mux in 4 channels of a new 8-bit bus coming from each smart pin. And there would be twice as many Tukey filters, only half of which could ever be used at once. However, they could be instantly mux'd and filtered samples would be forthcoming, without waiting for the Tukey filter to refill. Maybe, rather than 4 random pins, you would select a group of 4 pins, differing only in the two LSBs of their pin numbers. That would lighten the mux'ing problem.
I think the first thing I need to do is see how much I can squeeze the Tukey filter logic.
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I"m not following all the details here, but the figures you reported earlier on the PAD-Ring test chip, had quite small sample counts on Sinc3. (aka high ADC conversion rates, but low bit-counts)
If sinc3 can adjust to low bit values, is that not equivalent to an ADC-Bandwidth limited low-bit-scope Tukey pathway ?
Then, you just need a means to capture the 4 outputs ? (Which I think is what you are saying above?)
I'm putting the Tukey into the smart pin to see how it compiles.
As for the new features related to the Tukey filters, could at least some part of the new 4-pin groupings and triggerring stuff be leveraged at the streamers too, to ease some way the communications with qspi, octa spi and even hyperbus-enabled devices?
I'm putting the Tukey into the smart pin to see how it compiles.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
Tukey had better quality than Sinc8.
Can the four 8-bit Tukey values still be read as one 32-bit value? And can this be streamed?
I'm wondering whether the pair symmetry in the ramp values is reflected in the logic minimization, e.g. 1& 31, 3 & 29, 5 & 27, etc., all add up to 32. I've looked at making them add up to 31 with bits inverted to halve the taps, e.g. 1 and 30, 3 & 28, 5 & 26, then adding a pair can be done by a simple OR. The problem is that the plateau value is now 31 and n+½ plateau bits are needed for the whole thing to sum to a multiple of 256 minus 1 when all bits are set.
Another thought I had was to set the midpoint of the ramp, currently 16, to zero and having -15 & +15, -13 & +13, etc., as the pairs, again to halve the taps. The previous max tap of 32 would then be +16. However, the arithmetic would not be two's complement as used by other smart pin modes.
I also had an idea for using a counter for the plateau values instead of adding them individually and I'll try to find my post about it.
Breaking my sinc decimator into stages, the first stage turns 32 input bits into 16 3 bit values, packed in odd and even groupings, with 4 bit alignment. Hand optimization of the first stage brings it down to this, with carry propagation and debugging stages two and three yet to be debugged. Eventually you get 4 eight bit values from every 32 input bits, which can be summed by whatever windowing method you wish to use in addition to the initial decimation by 2, 4, or 8 - as desired. In the meantime I am developing this code both in Visual Studio, and in Simple IDE/Propeller GCC - so that I can also pry into the assembly that GCC is generating …
This it appears that the first stage requires about 40 words ~ 80 bytes in unrolled GCC-propeller 1 assembly, comprising ~ 26 instructions; with stage two and three expected to be similar when fully debugged and optimized. At that point I am expecting to have noisy 7 bit or something like that data running at sysclock/8, which can be summed to give numbers identical to what others are doing; so that instead of counting 6116 samples at sysclock, you would be getting 4 pre-filtered values every 32 clocks; reducing the number of filtered samples that need to be summed for a 12 bit window previously obtained by other means to 764.5; or you can store the samples and run an FFT, or linear regression, or a median filter, students t-test, or whatever you want with the data.
That's a whole new trick for the Prop1. The ramping isn't every sysclock but maybe that isn't so terrible. Certainly his results look great. And, in theory, the shape can be more complex.
Guess what? The Prop2 has no equivalent mode!
Here's a commented snippet: (Inputs A and B are the same pin in James's test code)
mov i, looplength
mov j, looplength
movfrqb, #1' start rectangular (flat +1 increment) for input B
uploop addfrqa, #1' start triangular (ramp the increment up and down) for input Adjnz i, #uploop ' ramp up
downloop subfrqa, #1djnz j, #downloop ' ramp downmovfrqb, #0' stop rectangular for input Bmovfrqa, #0' stop triangular for input Awrlongphsa, adcp_tri ' post triangular samplewrlongphsb, adcp_rect ' post rectangular sample
You can use RDPIN to read a sample, and then the streamer will be able to group four together, time-aligned.
Very interesting idea about offsetting.
Now that this thing is in the smart pin, it's pipelined, since we have the flops already there. That should relax any timing pressure. It's also no problem to check for $100 and swap out $FF, instead, so I changed that center $1F to $20.
I made this diagram of inc's and dec's for when bits move around. There has got to be some good optimization possible here:
I sent the new scope-in-the-smart-pin file set to ON Semi for a test compile. I was surprised how little logic the SINC3 took in the smart pin, and I'm curious to see how the SCOPE may work there.
We really need to optimize the Tukey-filter summing logic. Some huge optimization(s) must be possible.
Comments
I know, Phil.
I've been studying that code and I kind of get it, but what I don't understand is how would you go from one bit in to something like 8 bits out? You must allow for some bit-length expansion along the way, right? Or, input $FF for 1 and $00 for 0?
Anything to reduce logic these 4-per-cog ADC channels would be welcome.
It appears the non-doubling (std Sinc3), is the ideal, as those deviations are significant, so it comes down to logic cost.
With the P2 adders, how many gates / LUT do they need, relative to a ripple adder, or a MUX and a flip flop ?
Triangle Rectangle Mean 1143.3 1153.9 Std Dev 0.13057 4.64507
Triangular window has a standard deviation 2.8% of the rectangular window.We've been doing it wrong.
I finally tested out the Sinc3 smart pin mode and it works really well.
It seems to me that a 2nd-order modulator would be very sloppy about tracking the input. I want to see if it works for you.
' SINC3 test program con adc_pin = 4 dac_pin = adc_pin+1 exp = 5 'exp = 4..10, period = 1<<exp dat org hubset #$FF 'select 80MHz on FPGA wrpin adc,#adc_pin 'set ADC+SINC3 wxpin ##1<<exp,#adc_pin 'set period wrpin dac,#dac_pin 'set DAC+dither wxpin #1, #dac_pin 'always updateable dirh #1<<6 + adc_pin 'enable ADC and DAC smart pins setse1 #%001<<6 + adc_pin 'event on ADC period completion .loop rep #8,#0 'rep gets 16-clock loop for exp=4 waitse1 'wait for ADC period rdpin x,#adc_pin 'read acc3 sub x,diff1 'compute diff's add diff1,x sub x,diff2 add diff2,x ror x,#(exp*3-16) & $1F 'scale sample wypin x,#dac_pin 'write to DAC pin adc long %100011_0000000_00_11000_0 'ADC + SINC3 mode dac long %10110_00000000_01_00010_0 'DAC + noise dither mode x res 1 diff1 res 1 diff2 res 1
What ENOB do you get on the test chip ? How does that compare with the logged P2 Devices ?
Well, it's hard to say, because the FPGA board has about 50mV of 471KHz noise on its 3.3V supply that feeds the pins. Even so, the consistency of readings I'm getting is about like this:
16 counts = 7 bits
32 counts = 8 bits
64 counts = 9 bits
128 counts = 11 bits
256 counts = 12 bits
512 counts = 12 bits (1/f noise really increases)
1024 counts = 13 bits
Okay! I just wired in a quiet 3.3V regulator and things are looking WAY better:
16 counts = 8 bits
32 counts = 10 bits
64 counts = 12 bits
128 counts = 13 bits
256 counts = 13 bits
512 counts = 13 bits (1/f noise really increases)
1024 counts = 13 bits
So, noise doesn't let us get beyond ~13 bits.
Does the new P2-Eval board has better ground noise performance than P2D2 board? Is the layout and capacitors helping to get better SNR? I haven't seen any information about how that board performs. Nobody is using it yet?
All the P2-Eval boards are still at Parallax. There are none in the wild yet. Chip is using a board for tests.
Yes and yes. Chip has commented a couple of times but not detailed yet.
reg [2:0][0:0] f1; reg [2:0][2:0] f2; reg [2:0][4:0] f3; reg [2:0][6:0] f4; reg [2:0][8:0] f5; reg [2:0][10:0] f6; reg [2:0][12:0] f7; reg [2:0][14:0] f8; reg [2:0][16:0] f9; reg [2:0][18:0] f10; reg [2:0][20:0] f11; reg [2:0][22:0] f12; reg [2:0][24:0] f13; reg [2:0][26:0] f14; reg [2:0][28:0] f15; reg [2:0][30:0] f16; `regscan (f1[0], 1'b0, ena, pin_in[cfg[5:0]]) `regscan (f1[1], 1'b0, ena, f1[0]) `regscan (f1[2], 1'b0, ena, f1[1]) `regscan (f2[0], 1'b0, ena, f1[0] + (f1[1] << 1) + f1[2]) `regscan (f2[1], 1'b0, ena, f2[0]) `regscan (f2[2], 1'b0, ena, f2[1]) `regscan (f3[0], 1'b0, ena, f2[0] + (f2[1] << 1) + f2[2]) `regscan (f3[1], 1'b0, ena, f3[0]) `regscan (f3[2], 1'b0, ena, f3[1]) `regscan (f4[0], 1'b0, ena, f3[0] + (f3[1] << 1) + f3[2]) `regscan (f4[1], 1'b0, ena, f4[0]) `regscan (f4[2], 1'b0, ena, f4[1]) `regscan (f5[0], 1'b0, ena, f4[0] + (f4[1] << 1) + f4[2]) `regscan (f5[1], 1'b0, ena, f5[0]) `regscan (f5[2], 1'b0, ena, f5[1]) `regscan (f6[0], 1'b0, ena, f5[0] + (f5[1] << 1) + f5[2]) `regscan (f6[1], 1'b0, ena, f6[0]) `regscan (f6[2], 1'b0, ena, f6[1]) `regscan (f7[0], 1'b0, ena, f6[0] + (f6[1] << 1) + f6[2]) `regscan (f7[1], 1'b0, ena, f7[0]) `regscan (f7[2], 1'b0, ena, f7[1]) `regscan (f8[0], 1'b0, ena, f7[0] + (f7[1] << 1) + f7[2]) `regscan (f8[1], 1'b0, ena, f8[0]) `regscan (f8[2], 1'b0, ena, f8[1]) `regscan (f9[0], 1'b0, ena, f8[0] + (f8[1] << 1) + f8[2]) `regscan (f9[1], 1'b0, ena, f9[0]) `regscan (f9[2], 1'b0, ena, f9[1]) `regscan (f10[0], 1'b0, ena, f9[0] + (f9[1] << 1) + f9[2]) `regscan (f10[1], 1'b0, ena, f10[0]) `regscan (f10[2], 1'b0, ena, f10[1]) `regscan (f11[0], 1'b0, ena, f10[0] + (f10[1] << 1) + f10[2]) `regscan (f11[1], 1'b0, ena, f11[0]) `regscan (f11[2], 1'b0, ena, f11[1]) `regscan (f12[0], 1'b0, ena, f11[0] + (f11[1] << 1) + f11[2]) `regscan (f12[1], 1'b0, ena, f12[0]) `regscan (f12[2], 1'b0, ena, f12[1]) `regscan (f13[0], 1'b0, ena, f12[0] + (f12[1] << 1) + f12[2]) `regscan (f13[1], 1'b0, ena, f13[0]) `regscan (f13[2], 1'b0, ena, f13[1]) `regscan (f14[0], 1'b0, ena, f13[0] + (f13[1] << 1) + f13[2]) `regscan (f14[1], 1'b0, ena, f14[0]) `regscan (f14[2], 1'b0, ena, f14[1]) `regscan (f15[0], 1'b0, ena, f14[0] + (f14[1] << 1) + f14[2]) `regscan (f15[1], 1'b0, ena, f15[0]) `regscan (f15[2], 1'b0, ena, f15[1]) `regscan (f16[0], 1'b0, ena, f15[0] + (f15[1] << 1) + f15[2]) `regscan (f16[1], 1'b0, ena, f16[0]) `regscan (f16[2], 1'b0, ena, f16[1]) wire [30:0] sum = (f16[0] + (f16[1] << 1) + f16[2] + 1'b1) >> 1; `regscan (sample, 8'b0, ena, sum[30:23])
Our Tukey window is 1/4 the logic and works 10x better. Maybe I didn't do it right, though.
"reg [2:0][8:0] f5" means there are three f5 registers (f5[0], f5[1], f5[2]) that are each 9 bits wide.
Is that the same (low noise TI) regulator as on the Eval Boards ?
Those are good numbers, ie hitting around 13 bits at 1MHz (which I think those numbers say) is quite good.
Be interesting to run the same test, with same regulator, on the P2 EV board, (even if via sampling) to compare the ADCs
Q: if you can hit 8b with 16 samples, do you need the scope mode silicon ? - given the analog IP bandwidth seems to be the ceiling anyway ?
I jumpered a regulator from the new eval board over to the FPGA board for its 3.3 volt power.
I, too, am rethinking the live ADC. ON Semi is running some compilation tests today and I'm waiting to hear how much logic the whole live scope thing requires.
Then there is the problem of bit growth. It should be fine to round or truncate the rest once the sums get to 10 bits or so. Maybe that would reduce the logic by half.
Chip, 1-bit input has the best quality, so that should be our choice, i.e. no change. The whole point of trying 2-bit was to reduce the logic, but we could do that with 1-bit and 24-bit counters to a similar extent. If somehow the Sinc3 adders could run at twice the ADC bit rate then I think 2-bit mode would use the least logic.
Two channels would need less logic than four, although probably not exactly half.
If there is not enough room for any more, could we have one Tukey, perhaps quite crude, in each cog for triggering?
#define EVEN_BITS (0x55555555) #define ODD_BITS (0xaaaaaaaa) #define EVEN_PAIRS (0x33333333) #define ODD_PAIRS (0xcccccccc) #define EVEN_NIBBLES (0xf0f0f0f0) #define ODD_NIBBLES (0x0f0f0f0f) #define EVEN_BYTES (0x00ff00ff) #define ODD_BYTES (0xff00ff00) #define UINT unsigned int #define DWORD unsigned int #define MATH_TYPE float #define HIWORD(arg) ((arg)&(0xffff0000)) #define LOWORD(arg) ((arg)&(0x0000ffff)) class propeller_adc { public: bool carry; unsigned int REG[8]; void reset (); unsigned int iterate (bool sample); DWORD decimate1 (DWORD input); DWORD decimate2 (DWORD input); DWORD decimate3 (DWORD input); void print_bytes (int regid); void print_bytes2 (int regid); void print_nibbles (); };
DWORD propeller_adc::decimate1 (DWORD input) { //static bool carry; DWORD q0, q1, q2, q3; DWORD r0, r1, r2, r3; DWORD s0, s1; // format b31.....b0, even bits have weight = 1 // odd bits have weight = 0.5*2, phase one - decimate by 2 using // a [1,2,1] convolutional kernel yielding 16 3 bit values from a // single DWORD containing 32 individual one bit input samples REG[0] = input; q0 = input&EVEN_BITS; q1 = input&ODD_BITS; // multiply all even bits by 2 q2 = q0<<1; // add all of the odd bits with the appropriate // alternate neighbor q3 = (q1+(q1>>2))>>1; // pick off odd pairs and even pairs since the // next addition can result in a carry into the // third bit - so that in the end we want to pack // the resulting 16 three bit values into nibbles r0 = q2&EVEN_PAIRS; r2 = q3&EVEN_PAIRS; s0 = r0+r2; r1 = q2&ODD_PAIRS; r3 = q3&ODD_PAIRS; s1 = (r1+r3)>>2; REG[1]=s1; REG[2]=s0; return 0; }
Here is one test case which shows that summation occurs with the correct weights, and that for those who REALLY want or need to sum 6116 bits to obtain a long term moving average; you can still do so; but now with half the number of operations - since half the work has been done, i.e., try feeding this stream into a 3113 window where you are now doing 6116 additions.; the result should be the same or better; even if I still need to work out carry propagation.
R0:00000000000000000000000000000001 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000000010 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000000100 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000001000 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000010000 R1/2: 0 0 0 0 0 0 0 4 R0:00000000000000000000000000100000 R1/2: 0 0 0 0 0 0 0 2 R0:00000000000000000000000001000000 R1/2: 0 0 0 0 0 0 2 2 R0:00000000000000000000000010000000 R1/2: 0 0 0 0 0 0 1 3 R0:00000000000000000000000100000000 R1/2: 0 0 0 0 0 0 4 0 R0:00000000000000000000001000000000 R1/2: 0 0 0 0 0 0 3 1 R0:00000000000000000000010000000000 R1/2: 0 0 0 0 0 0 2 2 R0:00000000000000000000100000000000 R1/2: 0 0 0 0 0 0 3 1 R0:00000000000000000001000000000000 R1/2: 0 0 0 0 0 4 0 0 R0:00000000000000000010000000000000 R1/2: 0 0 0 0 0 2 1 1 R0:00000000000000000100000000000000 R1/2: 0 0 0 0 2 2 0 0 R0:00000000000000001000000000000000 R1/2: 0 0 0 0 1 3 0 0 R0:00000000000000010000000000000000 R1/2: 0 0 0 0 4 0 0 0 R0:00000000000000100000000000000000 R1/2: 0 0 0 0 3 1 0 0 R0:00000000000001000000000000000000 R1/2: 0 0 0 0 2 2 0 0 R0:00000000000010000000000000000000 R1/2: 0 0 0 0 3 1 0 0 R0:00000000000100000000000000000000 R1/2: 0 0 0 4 0 0 0 0 R0:00000000001000000000000000000000 R1/2: 0 0 0 2 1 1 0 0 R0:00000000010000000000000000000000 R1/2: 0 0 2 2 0 0 0 0 R0:00000000100000000000000000000000 R1/2: 0 0 1 3 0 0 0 0 R0:00000001000000000000000000000000 R1/2: 0 0 4 0 0 0 0 0 R0:00000010000000000000000000000000 R1/2: 0 0 3 1 0 0 0 0 R0:00000100000000000000000000000000 R1/2: 0 0 2 2 0 0 0 0 R0:00001000000000000000000000000000 R1/2: 0 0 3 1 0 0 0 0 R0:00010000000000000000000000000000 R1/2: 0 4 0 0 0 0 0 0 R0:00100000000000000000000000000000 R1/2: 0 2 1 1 0 0 0 0 R0:01000000000000000000000000000000 R1/2: 2 2 0 0 0 0 0 0 R0:10000000000000000000000000000000 R1/2: 1 3 0 0 0 0 0 0 R0:00000000000000000000000000000011 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000000110 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000001100 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000011000 R1/2: 0 0 0 0 0 0 0 4 R0:00000000000000000000000000110000 R1/2: 0 0 0 0 0 0 0 6 R0:00000000000000000000000001100000 R1/2: 0 0 0 0 0 0 2 4 R0:00000000000000000000000011000000 R1/2: 0 0 0 0 0 0 3 5 R0:00000000000000000000000110000000 R1/2: 0 0 0 0 0 0 5 3 R0:00000000000000000000001100000000 R1/2: 0 0 0 0 0 0 7 1 R0:00000000000000000000011000000000 R1/2: 0 0 0 0 0 0 5 3 R0:00000000000000000000110000000000 R1/2: 0 0 0 0 0 0 5 3 R0:00000000000000000001100000000000 R1/2: 0 0 0 0 0 4 3 1 R0:00000000000000000011000000000000 R1/2: 0 0 0 0 0 6 1 1 R0:00000000000000000110000000000000 R1/2: 0 0 0 0 2 4 1 1 R0:00000000000000001100000000000000 R1/2: 0 0 0 0 3 5 0 0 R0:00000000000000011000000000000000 R1/2: 0 0 0 0 5 3 0 0 R0:00000000000000110000000000000000 R1/2: 0 0 0 0 7 1 0 0 R0:00000000000001100000000000000000 R1/2: 0 0 0 0 5 3 0 0 R0:00000000000011000000000000000000 R1/2: 0 0 0 0 5 3 0 0 R0:00000000000110000000000000000000 R1/2: 0 0 0 4 3 1 0 0 R0:00000000001100000000000000000000 R1/2: 0 0 0 6 1 1 0 0 R0:00000000011000000000000000000000 R1/2: 0 0 2 4 1 1 0 0 R0:00000000110000000000000000000000 R1/2: 0 0 3 5 0 0 0 0 R0:00000001100000000000000000000000 R1/2: 0 0 5 3 0 0 0 0 R0:00000011000000000000000000000000 R1/2: 0 0 7 1 0 0 0 0 R0:00000110000000000000000000000000 R1/2: 0 0 5 3 0 0 0 0 R0:00001100000000000000000000000000 R1/2: 0 0 5 3 0 0 0 0 R0:00011000000000000000000000000000 R1/2: 0 4 3 1 0 0 0 0 R0:00110000000000000000000000000000 R1/2: 0 6 1 1 0 0 0 0 R0:01100000000000000000000000000000 R1/2: 2 4 1 1 0 0 0 0 R0:11000000000000000000000000000000 R1/2: 3 5 0 0 0 0 0 0 R0:10000000000000000000000000000001 R1/2: 1 3 0 0 0 0 0 0 R0:00000000000000000000000000000111 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000001110 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000011100 R1/2: 0 0 0 0 0 0 0 4 R0:00000000000000000000000000111000 R1/2: 0 0 0 0 0 0 0 6 R0:00000000000000000000000001110000 R1/2: 0 0 0 0 0 0 2 8 R0:00000000000000000000000011100000 R1/2: 0 0 0 0 0 0 3 7 R0:00000000000000000000000111000000 R1/2: 0 0 0 0 0 0 7 5 R0:00000000000000000000001110000000 R1/2: 0 0 0 0 0 0 8 4 R0:00000000000000000000011100000000 R1/2: 0 0 0 0 0 0 9 3 R0:00000000000000000000111000000000 R1/2: 0 0 0 0 0 0 8 4 R0:00000000000000000001110000000000 R1/2: 0 0 0 0 0 4 5 3 R0:00000000000000000011100000000000 R1/2: 0 0 0 0 0 6 4 2 R0:00000000000000000111000000000000 R1/2: 0 0 0 0 2 8 1 1 R0:00000000000000001110000000000000 R1/2: 0 0 0 0 3 7 1 1 R0:00000000000000011100000000000000 R1/2: 0 0 0 0 7 5 0 0 R0:00000000000000111000000000000000 R1/2: 0 0 0 0 8 4 0 0 R0:00000000000001110000000000000000 R1/2: 0 0 0 0 9 3 0 0 R0:00000000000011100000000000000000 R1/2: 0 0 0 0 8 4 0 0 R0:00000000000111000000000000000000 R1/2: 0 0 0 4 5 3 0 0 R0:00000000001110000000000000000000 R1/2: 0 0 0 6 4 2 0 0 R0:00000000011100000000000000000000 R1/2: 0 0 2 8 1 1 0 0 R0:00000000111000000000000000000000 R1/2: 0 0 3 7 1 1 0 0 R0:00000001110000000000000000000000 R1/2: 0 0 7 5 0 0 0 0 R0:00000011100000000000000000000000 R1/2: 0 0 8 4 0 0 0 0 R0:00000111000000000000000000000000 R1/2: 0 0 9 3 0 0 0 0 R0:00001110000000000000000000000000 R1/2: 0 0 8 4 0 0 0 0 R0:00011100000000000000000000000000 R1/2: 0 4 5 3 0 0 0 0 R0:00111000000000000000000000000000 R1/2: 0 6 4 2 0 0 0 0 R0:01110000000000000000000000000000 R1/2: 2 8 1 1 0 0 0 0 R0:11100000000000000000000000000000 R1/2: 3 7 1 1 0 0 0 0 R0:11000000000000000000000000000001 R1/2: 3 5 0 0 0 0 0 0 R0:10000000000000000000000000000011 R1/2: 1 3 0 0 0 0 0 0 R0:00000000000000000000000000001111 R1/2: 0 0 0 0 0 0 0 0 R0:00000000000000000000000000011110 R1/2: 0 0 0 0 0 0 0 4 R0:00000000000000000000000000111100 R1/2: 0 0 0 0 0 0 0 6 R0:00000000000000000000000001111000 R1/2: 0 0 0 0 0 0 2 8 R0:00000000000000000000000011110000 R1/2: 0 0 0 0 0 0 3 b R0:00000000000000000000000111100000 R1/2: 0 0 0 0 0 0 7 7 R0:00000000000000000000001111000000 R1/2: 0 0 0 0 0 0 a 6 R0:00000000000000000000011110000000 R1/2: 0 0 0 0 0 0 a 6 R0:00000000000000000000111100000000 R1/2: 0 0 0 0 0 0 c 4 R0:00000000000000000001111000000000 R1/2: 0 0 0 0 0 4 8 4 R0:00000000000000000011110000000000 R1/2: 0 0 0 0 0 6 6 4 R0:00000000000000000111100000000000 R1/2: 0 0 0 0 2 8 4 2 R0:00000000000000001111000000000000 R1/2: 0 0 0 0 3 b 1 1 R0:00000000000000011110000000000000 R1/2: 0 0 0 0 7 7 1 1 R0:00000000000000111100000000000000 R1/2: 0 0 0 0 a 6 0 0 R0:00000000000001111000000000000000 R1/2: 0 0 0 0 a 6 0 0 R0:00000000000011110000000000000000 R1/2: 0 0 0 0 c 4 0 0 R0:00000000000111100000000000000000 R1/2: 0 0 0 4 8 4 0 0 R0:00000000001111000000000000000000 R1/2: 0 0 0 6 6 4 0 0 R0:00000000011110000000000000000000 R1/2: 0 0 2 8 4 2 0 0 R0:00000000111100000000000000000000 R1/2: 0 0 3 b 1 1 0 0 R0:00000001111000000000000000000000 R1/2: 0 0 7 7 1 1 0 0 R0:00000011110000000000000000000000 R1/2: 0 0 a 6 0 0 0 0 R0:00000111100000000000000000000000 R1/2: 0 0 a 6 0 0 0 0 R0:00001111000000000000000000000000 R1/2: 0 0 c 4 0 0 0 0 R0:00011110000000000000000000000000 R1/2: 0 4 8 4 0 0 0 0 R0:00111100000000000000000000000000 R1/2: 0 6 6 4 0 0 0 0 R0:01111000000000000000000000000000 R1/2: 2 8 4 2 0 0 0 0 R0:11110000000000000000000000000000 R1/2: 3 b 1 1 0 0 0 0 R0:11100000000000000000000000000001 R1/2: 3 7 1 1 0 0 0 0 R0:11000000000000000000000000000011 R1/2: 3 5 0 0 0 0 0 0 R0:10000000000000000000000000000111 R1/2: 1 3 0 0 0 0 0 0
OK - this part is starting to look pretty solid. Phase two and Phase three are a work in progress.
DWORD propeller_adc::decimate2 (DWORD input) { // phase two - apply the same transformation // to the 16 three bit values, yielding // eight five bit values, having a range [0..32] DWORD q0, q1, q2, q3; DWORD s0, s1; q0 = REG[1]&EVEN_NIBBLES; q1 = (REG[1]&ODD_NIBBLES)>>4; q2 = REG[2]&EVEN_NIBBLES; q3 = (REG[2]&ODD_NIBBLES)>>4; // to do fix nbble order to get things in correct bins, although this gives an interesting // high peaking response for quick settling time on transient input with no effect on // the long term average s0 = q0+q1+(q2<<1); s1 = q0+q1+(q3<<1); REG[3]=s1>>4; REG[4]=s0>>4; return 0; } DWORD propeller_adc::decimate3 (DWORD input) { DWORD acc; // finally repack into a 32 bit register // for in input rate of 250Mbps - this results // in an initial output rate of 31.25 DWORD q0, q1, q2, q3; DWORD s0, s1; q0 = REG[3]&EVEN_BYTES; q1 = (REG[3]&ODD_BYTES)>>8; q2 = REG[4]&EVEN_BYTES; q3 = (REG[4]&ODD_BYTES)>>8; // HMMM... FIXME? DEFINTELY BROKEN HERE!! s0 = q0+q1+(q2<<1); s1 = q0+q1+(q3<<1); REG[5]=s1; REG[6]=s0; acc = s0<<16+s1; REG[7]=acc; return acc; }
It seems that even without SINC3 and the 4-channel-scope-per-cog, we had already grown quite a bit.
Here is where we are at:
Note: 'sequential' means flipflop Note: 'area' is square um, so 1,000,000 = 1 square mm Original Design, current P2 silicon: Type Instances Area Area % --------------------------------------------- timing_model 92 37049021.213 72.3 sequential 58246 4655514.931 9.1 inverter 74356 737807.974 1.4 buffer 15359 242666.189 0.5 logic 469886 8569315.686 16.7 physical_cells 0 0.000 0.0 --------------------------------------------- total 617939 51254325.994 100.0 New Design with SINC3 and SCOPE: Type Instances Area Area % --------------------------------------------- timing_model 91 36815952.439 69.0 sequential 61112 4871372.083 9.1 inverter 92183 908669.798 1.7 buffer 21647 339857.101 0.6 logic 559646 10389258.163 19.5 physical_cells 0 0.000 0.0 --------------------------------------------- total 734679 53325109.584 100.0 New Design without SINC3 smart pin Type Instances Area Area % --------------------------------------------- timing_model 91 36815952.439 69.1 sequential 61112 4881813.709 9.2 inverter 90045 889771.008 1.7 buffer 21978 340910.797 0.6 logic 554207 10318409.651 19.4 physical_cells 0 0.000 0.0 --------------------------------------------- total 727433 53246857.604 100.0 New Design without 4-channel SCOPE per cog Type Instances Area Area % --------------------------------------------- timing_model 91 36815952.439 70.1 sequential 59129 4729178.317 9.0 inverter 83262 819040.410 1.6 buffer 19948 308145.869 0.6 logic 530745 9811739.930 18.7 physical_cells 0 0.000 0.0 --------------------------------------------- total 693175 52484056.964 100.0 Cost of SINC3 ---------------------- sequential 0 inverter 2138 buffer -331 logic 5439 ---------------------- total 7246 area 78252 Cost of SCOPE ---------------------- sequential 1983 inverter 8921 buffer 1669 logic 28901 ---------------------- total 41474 area 841052
Note that SINC3 turns out to be very little logic. 5439/64 is only 85 logic cells per smart pin added. And it didn't use any new flops, since the smart pin supplied them.
The 4-channel scope, on the other hand is a real pig. 28901/4channels/8cogs = 903 logic cells per channel, which seems WAY too big. I'm wondering two things:
(1) Is the tool generating a lot of extra circuitry in order to make timing? If I pipelined the 1's counts before final summing (requires < 36 flipflops per channel), might things relax and would net logic/buffering requirements go down?
(2) If this scope function went into each smart pin, it wouldn't need any new flops and things could be pipelined to relax timing. However, each cog would need to mux in 4 channels of a new 8-bit bus coming from each smart pin. And there would be twice as many Tukey filters, only half of which could ever be used at once. However, they could be instantly mux'd and filtered samples would be forthcoming, without waiting for the Tukey filter to refill. Maybe, rather than 4 random pins, you would select a group of 4 pins, differing only in the two LSBs of their pin numbers. That would lighten the mux'ing problem.
I think the first thing I need to do is see how much I can squeeze the Tukey filter logic.
Just had a realization that we don't need new 8-bit buses out of the smart pins. Instead, the pin just updates the result on every clock, so that a RDPIN at any time, from any cog, gets the immediate 8-bit conversion for that pin.
So, RDPIN would always read the ADC value and those same result outputs from the smart pins could be gathered, lower bytes, only, to get parallel ADC samples for streamer recording.
One other thing. The scope-trigger mechanism could go into the smart pin to alert when a trigger event occurs, raising IN.
I"m not following all the details here, but the figures you reported earlier on the PAD-Ring test chip, had quite small sample counts on Sinc3. (aka high ADC conversion rates, but low bit-counts)
If sinc3 can adjust to low bit values, is that not equivalent to an ADC-Bandwidth limited low-bit-scope Tukey pathway ?
Then, you just need a means to capture the 4 outputs ? (Which I think is what you are saying above?)
As for the new features related to the Tukey filters, could at least some part of the new 4-pin groupings and triggerring stuff be leveraged at the streamers too, to ease some way the communications with qspi, octa spi and even hyperbus-enabled devices?
Tukey had better quality than Sinc8.
Can the four 8-bit Tukey values still be read as one 32-bit value? And can this be streamed?
I'm wondering whether the pair symmetry in the ramp values is reflected in the logic minimization, e.g. 1& 31, 3 & 29, 5 & 27, etc., all add up to 32. I've looked at making them add up to 31 with bits inverted to halve the taps, e.g. 1 and 30, 3 & 28, 5 & 26, then adding a pair can be done by a simple OR. The problem is that the plateau value is now 31 and n+½ plateau bits are needed for the whole thing to sum to a multiple of 256 minus 1 when all bits are set.
Another thought I had was to set the midpoint of the ramp, currently 16, to zero and having -15 & +15, -13 & +13, etc., as the pairs, again to halve the taps. The previous max tap of 32 would then be +16. However, the arithmetic would not be two's complement as used by other smart pin modes.
I also had an idea for using a counter for the plateau values instead of adding them individually and I'll try to find my post about it.
DWORD propeller_adc::decimate1B (DWORD input) { DWORD q2, q3; REG[0] = input; q2 = (input&EVEN_BITS)<<1; q3 = ((input&ODD_BITS)+((input&ODD_BITS)>>2))>>1; REG[1] = ((q2&ODD_PAIRS) + (q3&ODD_PAIRS))>>2; REG[2] = (q2&EVEN_PAIRS) + (q3&EVEN_PAIRS); return 0; 386:FFT_test.c **** DWORD propeller_adc::decimate1B (DWORD input) 1208 0545 55AAAAAA mvi r5,#-1431655766 1208 AA 1209 054a 1514 and r5, r1 1211 054c 0B7065 xmov r7,r0 mov r6,r5 1212 054f 2740 add r7, #4 1214 0551 262A shr r6, #2 1215 0553 1650 add r6, r5 1216 0555 261A shr r6, #1 1218 0557 54CCCCCC mvi r4,#-858993460 1218 CC 1219 055c D55644 xmov r5,r6 and r5,r4 1220 055f E33080 xmov r3,r0 add r3,#8 1222 0562 20C0 add r0, #12 1225 0564 117F wrlong r1, r7 1227 0566 57555555 mvi r7,#1431655765 1227 55 1229 056b 1714 and r7, r1 1230 056d 2719 shl r7, #1 1233 056f 1474 and r4, r7 1234 0571 1540 add r5, r4 1235 0573 252A shr r5, #2 1236 0575 153F wrlong r5, r3 1238 0577 55333333 mvi r5,#858993459 1238 33 1239 057c 1654 and r6, r5 1241 057e 1754 and r7, r5 1243 0580 1670 add r6, r7 1244 0582 160F wrlong r6, r0 1247 0584 B0 mov r0, #0 1248 0585 02 lret }
This it appears that the first stage requires about 40 words ~ 80 bytes in unrolled GCC-propeller 1 assembly, comprising ~ 26 instructions; with stage two and three expected to be similar when fully debugged and optimized. At that point I am expecting to have noisy 7 bit or something like that data running at sysclock/8, which can be summed to give numbers identical to what others are doing; so that instead of counting 6116 samples at sysclock, you would be getting 4 pre-filtered values every 32 clocks; reducing the number of filtered samples that need to be summed for a 12 bit window previously obtained by other means to 764.5; or you can store the samples and run an FFT, or linear regression, or a median filter, students t-test, or whatever you want with the data.
That's a whole new trick for the Prop1. The ramping isn't every sysclock but maybe that isn't so terrible. Certainly his results look great. And, in theory, the shape can be more complex.
Guess what? The Prop2 has no equivalent mode!
Here's a commented snippet: (Inputs A and B are the same pin in James's test code)
mov i, looplength mov j, looplength mov frqb, #1 ' start rectangular (flat +1 increment) for input B uploop add frqa, #1 ' start triangular (ramp the increment up and down) for input A djnz i, #uploop ' ramp up downloop sub frqa, #1 djnz j, #downloop ' ramp down mov frqb, #0 ' stop rectangular for input B mov frqa, #0 ' stop triangular for input A wrlong phsa, adcp_tri ' post triangular sample wrlong phsb, adcp_rect ' post rectangular sample
You can use RDPIN to read a sample, and then the streamer will be able to group four together, time-aligned.
Very interesting idea about offsetting.
Now that this thing is in the smart pin, it's pipelined, since we have the flops already there. That should relax any timing pressure. It's also no problem to check for $100 and swap out $FF, instead, so I changed that center $1F to $20.
I made this diagram of inc's and dec's for when bits move around. There has got to be some good optimization possible here:
tap value bit5 bit4 bit3 bit2 bit1 bit0 -------------------------------------------------------------------- 0 000001 +1 1 000011 +1 2 000101 +1 -1 3 000111 +1 4 001010 +1 -1 -1 5 001101 +1 -1 +1 6 010000 +1 -1 -1 -1 7 010011 +1 +1 8 010110 +1 -1 9 011001 +1 -1 -1 +1 10 011011 +1 11 011101 +1 -1 12 011111 +1 13 100000 +1 -1 -1 -1 -1 -1 14 100000 15 100000 16 100000 17 100000 18 100000 19 100000 20 100000 21 100000 22 100000 23 100000 24 100000 25 100000 26 100000 27 100000 28 100000 29 100000 30 100000 31 100000 32 011111 -1 +1 +1 +1 +1 +1 33 011101 -1 34 011011 -1 +1 35 011001 -1 36 010110 -1 +1 +1 -1 37 010011 -1 +1 38 010000 -1 -1 39 001101 -1 +1 +1 +1 40 001010 -1 +1 -1 41 000111 -1 +1 +1 42 000101 -1 43 000011 -1 +1 44 000001 -1 (000000) -1 value # position ------------------------ 000001 2 0,44 000011 2 1,43 000101 2 2,42 000111 2 3,41 001010 2 4,40 001101 2 5,39 010000 2 6,38 010011 2 7,37 010110 2 8,36 011001 2 9,35 011011 2 10,34 011101 2 11,33 011111 2 12,32 100000 19 13..31 #bit5 = 19 #bit4 = 14 #bit3 = 12 #bit2 = 12 #bit1 = 14 #bit0 = 20
We really need to optimize the Tukey-filter summing logic. Some huge optimization(s) must be possible.