Digital filtering approaches on the Prop2
Mark_T
Posts: 1,981
in Propeller 2
I thought I'd start a thread to gather any techniques and ideas for implementing digital filters using the P2
There's clearly a variety of approaches on the P2 with the multiply and cordic instructions that overtake the P1's abilities.
I posted a while back a tool for generating IIR filter code on the P1, and there's Phil Pilgrim's FIR2PASM: forums.parallax.com/discussion/133173/fir2pasm-automatic-fir-filter-code-generator/p1
for P1, which have to work around the lack of multiply (with shifts and adds etc).
So lately I've been thinking about ways to implement IIRs (both second-order-sections (biquad) and lattice-filters) using the cordic
pipeline. That's still work in progress.
Last couple of days I decided to tackle FIR filters using 16 bit inputs/outputs/coefficients and the MULS instruction. Code included below of my latest effort. Basically I realized the FIFO memory accesses allow the tapped delayline to live in hub ram, with the coefficients in cog ram, meaning this simple fragment can do the multiple-accumulate engine at the heart of the filter:
Its never that simple of course, and the values have to shift down the array of taps on each sample, which naively would add a lot of overhead for shifting the data. However once using hub memory you can slide a window over memory instead, letting the active set of taps slide along. At some point you need to
copy them back to prevent the whole of memory being overwritten, but for 100 tap filter you can afford to do this only occasionally, and in fact I do it
every 32 samples, as 32 words takes up one 64 byte FIFO block for fast reads/writes.
I gave up hope of clever ways to use the typical symmetry of FIR coefficients for optimization, as the FIFO memory access is unidirectional.
I also realized the LUT ram could be used for the unrolled loop of the 3-instruction fragment above, allowing 512/3 = 170 max tap count.
Before that I was limited to about 100 taps as coefficients, housekeeping code and unrolled loop all had to coexist in 496 cog longs.
So I store upto 170 coefficients in cog ram, 170+32 words of tap vector in hub ram, and 511 instructions worth on unrolled loop in LUT ram. The code generates the unrolled loop depending on the tap count, and loads up the coefficients from hub ram word array at startup. Arguments are the tap count, array of word coefficients, pointer to hub ram workspace.
The demonstration code synthesizes a simple step waveform using CT register as input, and drives a couple of smartpins in dither DAC mode to present the input and output waveforms for 'scope perusal. I've got a 169 tap example filter running at 96kSPS (fairly close to the limit as it happens), so its ready to play with.
There's clearly a variety of approaches on the P2 with the multiply and cordic instructions that overtake the P1's abilities.
I posted a while back a tool for generating IIR filter code on the P1, and there's Phil Pilgrim's FIR2PASM: forums.parallax.com/discussion/133173/fir2pasm-automatic-fir-filter-code-generator/p1
for P1, which have to work around the lack of multiply (with shifts and adds etc).
So lately I've been thinking about ways to implement IIRs (both second-order-sections (biquad) and lattice-filters) using the cordic
pipeline. That's still work in progress.
Last couple of days I decided to tackle FIR filters using 16 bit inputs/outputs/coefficients and the MULS instruction. Code included below of my latest effort. Basically I realized the FIFO memory accesses allow the tapped delayline to live in hub ram, with the coefficients in cog ram, meaning this simple fragment can do the multiple-accumulate engine at the heart of the filter:
rfword t muls coeffN add sum, tgiven that RDFAST has been setup for the tapped delays. That's 3 instruction overhead per tap, on 160MHz sysclk that's about 250kSPS processing rate for a 100 tap FIR filter.
Its never that simple of course, and the values have to shift down the array of taps on each sample, which naively would add a lot of overhead for shifting the data. However once using hub memory you can slide a window over memory instead, letting the active set of taps slide along. At some point you need to
copy them back to prevent the whole of memory being overwritten, but for 100 tap filter you can afford to do this only occasionally, and in fact I do it
every 32 samples, as 32 words takes up one 64 byte FIFO block for fast reads/writes.
I gave up hope of clever ways to use the typical symmetry of FIR coefficients for optimization, as the FIFO memory access is unidirectional.
I also realized the LUT ram could be used for the unrolled loop of the 3-instruction fragment above, allowing 512/3 = 170 max tap count.
Before that I was limited to about 100 taps as coefficients, housekeeping code and unrolled loop all had to coexist in 496 cog longs.
So I store upto 170 coefficients in cog ram, 170+32 words of tap vector in hub ram, and 511 instructions worth on unrolled loop in LUT ram. The code generates the unrolled loop depending on the tap count, and loads up the coefficients from hub ram word array at startup. Arguments are the tap count, array of word coefficients, pointer to hub ram workspace.
The demonstration code synthesizes a simple step waveform using CT register as input, and drives a couple of smartpins in dither DAC mode to present the input and output waveforms for 'scope perusal. I've got a 169 tap example filter running at 96kSPS (fairly close to the limit as it happens), so its ready to play with.
' fast_fifo_firfilter.spin2 ' ' Mark Tillotson, 2019-01-19 ' ' 16 bit FIR filter cog test ' ' Use RDFAST and Hub delay tab vector, plus coefficients vector in COG ram, with generated unrolled ' instruction stream in LUT ram to perform RFWORD/MULS/ADD per tap ' hub array is regularly copied down when has moved a FIFO block's worth (64 bytes = 32 words) ' rather than shifting it every sample. ' ' the unidirectional nature of the FIFO hub access prevents easy exploitation of symmetric/antisymmetric ' structure of FIR coefficients. ' Manages 96kSPS rate for test filter of 169 coefficients, one cog ' ' using WAITCT1 and a synthesized step waveform derived from CT register, outputs to 16 bit dithered DAC ' on pin 48 ' ' test filter has cutoff at 0.1 * Fs, constructed using Python scipy.signal library's firwin() function CON OSCMODE = $010c3f04 FREQ = 160_000_000 MAXN = 170 ' 170 * 3 = 510, so 170 is maximum for 3 instructions/tap in LUT ram DACpin = 48 ' output DAC DACpin2 = 50 ' input waveform DAC dither = %0000_0000_000_10100_00000000_01_00010_0 ' remember to wypin FIFO_POWER = 6 FIFO_BLK = 1 << FIFO_POWER OBJ VAR WORD temp [MAXN + FIFO_BLK/2] ' work space for taps vector LONG num_taps ' arg block to cog LONG tap_buffer LONG coeff_buffer PUB demo | i clkset (OSCMODE, FREQ) ' setup arguments num_taps := 169 tap_buffer := @temp coeff_buffer := @coefficients ' start filter running cognew (@fir_filter, @num_taps) repeat pausems(10) DAT ' from Python's scipy.signal library: coeffs = firwin (177, 0.1, window='hanning') ' then scaled by 2^15 and rounded to int and leading/trailing zeroes removed to leave 169 coefficients coefficients WORD 1, 1, 1, 1, 0, -1, -2, -4, -6, -7, -8, -8, -7, -4, 0, 5, 11, 17, 22, 26, 27, 25, 20, 11, 0, -13, -28, -41 WORD -52, -59, -60, -54, -42, -24, 0, 27, 55, 80, 100, 112, 113, 102, 78, 44, 0, -49, -98, -142, -177, -196, -197 WORD -177, -136, -75, 0, 84, 169, 245, 305, 340, 342, 308, 238, 133, 0, -151, -307, -452, -571, -647, -665, -615 WORD -488, -282, 0, 349, 751, 1187, 1635, 2070, 2468, 2805, 3062, 3222, 3277, 3222, 3062, 2805, 2468, 2070, 1635 WORD 1187, 751, 349, 0, -282, -488, -615, -665, -647, -571, -452, -307, -151, 0, 133, 238, 308, 342, 340, 305, 245 WORD 169, 84, 0, -75, -136, -177, -197, -196, -177, -142, -98, -49, 0, 44, 78, 102, 113, 112, 100, 80, 55, 27, 0 WORD -24, -42, -54, -60, -59, -52, -41, -28, -13, 0, 11, 20, 25, 27, 26, 22, 17, 11, 5, 0, -4, -7, -8, -8, -7, -6 WORD -4, -2, -1, 0, 1, 1, 1, 1 ORG 0 fir_filter WRPIN ##dither, #DACpin ' smartpin mode for dithered 16 bit DAC WRPIN ##dither, #DACpin2 call #setup ' the setup code clashes with where the coeffs get loaded to save space. call #setup_coeffs ' calling this overwrites the setup routine GETCT time ' test framework uses WAITCT for timing ADDCT1 time, ##1667 ' about 96kHz (filter takes just under 10us worst case at 160MHz sysclk) fir_outer_loop mov pass_count, #FIFO_BLK/2 ' every 32 samples copy back the taps as they have crawled up in memory by 64 byte FIFO block size fir_loop call #get_input wrword input, taps ' add to one end of taps mov sum, #0 ' prepare to sum RDFAST #0, taps jmp jout ' into the generated engine code, which is in LUT, so #engine won't work jout long engine engine_done ' back here from engine cmp remaind, #0 wz ' complete FIFO block (is this necessary?) if_z jmp #.rem REP @.rem, remaind RFWORD t .rem sub taps, #2 ' crawl back the tap pointer by one word each pass call #put_output djnz pass_count, #fir_loop call #copyback ' time to shift the whole taps vector with FIFO fast memory ops (calling this overwrites setup_coeffs) jmp #fir_outer_loop ' testing input waveform get_input getct input shr input, #17 ' generate step pattern from CT, repeats every 2^20 clocks and input, #7 xor input, #5 shl input, #13 sub input, ##$7000 WYPIN input, #DACpin2 get_input_ret ret ' testing output via DAAC smartpin put_output WAITCT1 ' synchronize for clean output timing ADDCT1 time, ##1667 ' about 96kHz sar sum, #15 ' scale by 2^-15 to compensate for MULS add sum, ##$8000 ' offset for DAC WYPIN sum, #DACpin ' update dithered DAC put_output_ret ret N long 0 ' number of taps / coefficients Ncopy long 0 ' reps for copydown coeffs long 0 ' hub address coefficients taps0 long 0 ' address of tap vector + 64 taps long 0 ' current taps ptr, slides back from taps0 by 2 each sample until copyback pass_count long 32 ' loop counter for triggering copyback remaind long 0 ' fast word reads/writes left over after doing N, to end current block time long 0 ' for waitct1 synch copyback mov taps, taps0 ' every 32 samples we correct for the sliding taps window using fast FIFO add taps, N ' I'm sure there's a better way to do this on P2 add taps, N ' taps used as working pointer starting at end of taps vector sub taps, #FIFO_BLK*2 ' 32 word, copy as 16 longs as aligned to 64 byte blocks for fast read/write REP @.done, Ncopy ' copy Ncopy FIFO blocks RDFAST #0, taps RFLONG t+0 RFLONG t+1 RFLONG t+2 RFLONG t+3 RFLONG t+4 RFLONG t+5 RFLONG t+6 RFLONG t+7 RFLONG t+8 RFLONG t+9 RFLONG t+10 RFLONG t+11 RFLONG t+12 RFLONG t+13 RFLONG t+14 RFLONG t+15 add taps, #FIFO_BLK WRFAST #0, taps WFLONG t+0 WFLONG t+1 WFLONG t+2 WFLONG t+3 WFLONG t+4 WFLONG t+5 WFLONG t+6 WFLONG t+7 WFLONG t+8 WFLONG t+9 WFLONG t+10 WFLONG t+11 WFLONG t+12 WFLONG t+13 WFLONG t+14 WFLONG t+15 sub taps, #FIFO_BLK*2 .done mov taps, taps0 ' restore taps for main loop copyback_ret ret ' these temporary vars extend to overwrite setup_coeffs when copyback is called ' 16 longs from t onwards are used in copyback t long 0 sum long 0 cptr long 0 eptr long 0 input long 0 long 0 long 0 long 0 ' used at setup to read coeffs from hub ram to coeff registers setup_coeffs RDFAST #0, coeffs ' for reading coeff values in mov cptr, #coeff REP @.done, N ALTD cptr RFWORD 0-0 ' get coeff into register add cptr, #1 .done cmp remaind, #0 wz ' if last FIFO block partial, complete it if_z jmp #setup_coeffs_ret REP @.done2, remaind RFWORD t .done2 setup_coeffs_ret ret coeff ' this setup code gets overwritten with coefficients when setup_coeffs is called after it setup rdlong N, PTRA++ rdlong taps0, PTRA++ add taps0, #64 rdlong coeffs, PTRA++ call #setup_engine mov Ncopy, N add Ncopy, #FIFO_BLK/2-1 shr Ncopy, #FIFO_POWER-1 mov remaind, N and remaind, #FIFO_BLK/2-1 neg remaind, remaind add remaind, #FIFO_BLK/2 and remaind, #FIFO_BLK/2-1 mov taps, taps0 WRFAST #0, taps REP @.done, N ' zero the taps at start WFWORD #0 .done cmp remaind, #0 wz if_z jmp #setup_ret REP @.done2, remaind ' complete whole FIFO block WFWORD #0 .done2 setup_ret ret ' copy instructions into the engine area (in LUT ram), depending on number of coefficients setup_engine mov cptr, #coeff mov eptr, ##engine ' engine in LUT so need ## REP @.done, N SETS muls_instr, cptr ' unique constant reg per multiply instruction WRLUT read_instr, eptr add eptr, #1 WRLUT muls_instr, eptr add eptr, #1 WRLUT add_instr, eptr add eptr, #1 add cptr, #1 .done WRLUT jmpback_instr, eptr setup_engine_ret ret ' proforma instructions for engine read_instr RFWORD t muls_instr MULS t, 0-0 ' source will be a coefficient register add_instr add sum, t jmpback_instr jmp #\engine_done ' absolute jmp as is copied to end of engine section ' set aside space for coefficients, allowing for the stuff above that gets overwritten res MAXN - 14 - 22 - 4 ' 15 is the instr count for setup_engine, 22 for setup, 4 for proforma FIT $1F0 ' check it fits ORG $200 ' LUT ram engine res MAXN*3+1 ' engine instructions go here, RFWORD/MULS/ADD/........./JMP FIT $400
Comments
is running at 96kSPS with 169 taps
Fastspin v3.9.12 doesn't complain so I presume all is good there. I changed the pin numbers to 10 and 11 to suit, I already had a couple of scope probes attached there. I also tried adding a DIRH for each pin but that didn't help either.
still worked so I figured its not needed... Also I can change to pin 10 and it works.
I'm off to bed for now.
cleanly LPF'ing
EDIT: And a nice side effect of using this way of fast block copies is they are entirely independent of the FIFO. You can leave the FIFO operating at the same time.
I thought the output samples were fully synchronized, IIRC I looked at the SA output. Input waveform isn't, so
I guess that might be allowing a varying step waveform into the filter.
SETQ+RD/WRLONG - good call, I was fixated on FIFO ops. I think the copydown speed is OK though,
not anticipating a 4-fold speed up.
I think when writing with the FIFO you need to at least use a multiple of 8 longs to trigger each fifo write, otherwise
I don't see how the hardware can work.
BTW currently getting IIR filter code to work, using pipelined cordic multiplies over second-order-sections and the
basics are working now, scaling is still an issue (you need to compensate for Q at each stage to avoid overflow).
Each second order stage uses 4 QMUL's fired off every 16 cycles, so 400ns/section (plus 400ns for pipeline setup).
So that's about 200ns/pole/sample.
I also intend to look at implementing lattice filters too, allpass and generic IIR.
So on hubRAM reads, RDFAST, the FIFO pre-fills a whole hub rotation before returning to allow the first RFxxxx. And I think it'll top up at every opportunity after that.
On hubRAM writes, WRFAST, will simply buffer the WFxxxx's then write to hubRAM as rotation opportunity allows. It doesn't need to burst. No requirement for a cog to complete a rotation block.
This means RDFAST has a FIFO filling stall at the start. WRFAST on the other hand just has an emptying duration at the end that may stall cog execution if quickly reusing the FIFO or other hubRAM access at the same time.
So the backslash in pasm sources is also unique to code addressing, although LOC is useful for hubRAM data referencing.
input waveform:
4-pole Bessel filter:
4-pole Butterworth:
4-pole elliptic:
I figured this out, I was overdriving the input to the FIR filter which lead to sign flipping this appeared exactly to replicate the staircase waveform shifting by half its period. Reduce the amount the generated waveform values are shifted left and the anomoly resolved.
dynamically reconfigurable too.
Input (blue), output (yellow), for a 4-pole elliptic filter:
The coefficients are in hub ram, and the state is kept in LUT ram (for no deep reason),
you provide a set of second-order section's a1/a2/b1/b2 coefficients (and a binary scale
factor between sections rather than an a0). fix2.30 format is used for coefficients, with
the slight fudge that +2.0 has to be represented inexactly (coefficients are always in the
range -2.0..+2.0 in a second order section minimum phase filter)