Digital filtering approaches on the Prop2
in Propeller 2
I thought I'd start a thread to gather any techniques and ideas for implementing digital filters using the P2
There's clearly a variety of approaches on the P2 with the multiply and cordic instructions that overtake the P1's abilities.
I posted a while back a tool for generating IIR filter code on the P1, and there's Phil Pilgrim's FIR2PASM: forums.parallax.com/discussion/133173/fir2pasm-automatic-fir-filter-code-generator/p1
for P1, which have to work around the lack of multiply (with shifts and adds etc).
So lately I've been thinking about ways to implement IIRs (both second-order-sections (biquad) and lattice-filters) using the cordic
pipeline. That's still work in progress.
Last couple of days I decided to tackle FIR filters using 16 bit inputs/outputs/coefficients and the MULS instruction. Code included below of my latest effort. Basically I realized the FIFO memory accesses allow the tapped delayline to live in hub ram, with the coefficients in cog ram, meaning this simple fragment can do the multiple-accumulate engine at the heart of the filter:
Its never that simple of course, and the values have to shift down the array of taps on each sample, which naively would add a lot of overhead for shifting the data. However once using hub memory you can slide a window over memory instead, letting the active set of taps slide along. At some point you need to
copy them back to prevent the whole of memory being overwritten, but for 100 tap filter you can afford to do this only occasionally, and in fact I do it
every 32 samples, as 32 words takes up one 64 byte FIFO block for fast reads/writes.
I gave up hope of clever ways to use the typical symmetry of FIR coefficients for optimization, as the FIFO memory access is unidirectional.
I also realized the LUT ram could be used for the unrolled loop of the 3-instruction fragment above, allowing 512/3 = 170 max tap count.
Before that I was limited to about 100 taps as coefficients, housekeeping code and unrolled loop all had to coexist in 496 cog longs.
So I store upto 170 coefficients in cog ram, 170+32 words of tap vector in hub ram, and 511 instructions worth on unrolled loop in LUT ram. The code generates the unrolled loop depending on the tap count, and loads up the coefficients from hub ram word array at startup. Arguments are the tap count, array of word coefficients, pointer to hub ram workspace.
The demonstration code synthesizes a simple step waveform using CT register as input, and drives a couple of smartpins in dither DAC mode to present the input and output waveforms for 'scope perusal. I've got a 169 tap example filter running at 96kSPS (fairly close to the limit as it happens), so its ready to play with.
There's clearly a variety of approaches on the P2 with the multiply and cordic instructions that overtake the P1's abilities.
I posted a while back a tool for generating IIR filter code on the P1, and there's Phil Pilgrim's FIR2PASM: forums.parallax.com/discussion/133173/fir2pasm-automatic-fir-filter-code-generator/p1
for P1, which have to work around the lack of multiply (with shifts and adds etc).
So lately I've been thinking about ways to implement IIRs (both second-order-sections (biquad) and lattice-filters) using the cordic
pipeline. That's still work in progress.
Last couple of days I decided to tackle FIR filters using 16 bit inputs/outputs/coefficients and the MULS instruction. Code included below of my latest effort. Basically I realized the FIFO memory accesses allow the tapped delayline to live in hub ram, with the coefficients in cog ram, meaning this simple fragment can do the multiple-accumulate engine at the heart of the filter:
rfword t
muls coeffN
add sum, t
given that RDFAST has been setup for the tapped delays. That's 3 instruction overhead per tap, on 160MHz sysclk that's about 250kSPS processing rate for a 100 tap FIR filter.Its never that simple of course, and the values have to shift down the array of taps on each sample, which naively would add a lot of overhead for shifting the data. However once using hub memory you can slide a window over memory instead, letting the active set of taps slide along. At some point you need to
copy them back to prevent the whole of memory being overwritten, but for 100 tap filter you can afford to do this only occasionally, and in fact I do it
every 32 samples, as 32 words takes up one 64 byte FIFO block for fast reads/writes.
I gave up hope of clever ways to use the typical symmetry of FIR coefficients for optimization, as the FIFO memory access is unidirectional.
I also realized the LUT ram could be used for the unrolled loop of the 3-instruction fragment above, allowing 512/3 = 170 max tap count.
Before that I was limited to about 100 taps as coefficients, housekeeping code and unrolled loop all had to coexist in 496 cog longs.
So I store upto 170 coefficients in cog ram, 170+32 words of tap vector in hub ram, and 511 instructions worth on unrolled loop in LUT ram. The code generates the unrolled loop depending on the tap count, and loads up the coefficients from hub ram word array at startup. Arguments are the tap count, array of word coefficients, pointer to hub ram workspace.
The demonstration code synthesizes a simple step waveform using CT register as input, and drives a couple of smartpins in dither DAC mode to present the input and output waveforms for 'scope perusal. I've got a 169 tap example filter running at 96kSPS (fairly close to the limit as it happens), so its ready to play with.
' fast_fifo_firfilter.spin2
'
' Mark Tillotson, 2019-01-19
'
' 16 bit FIR filter cog test
'
' Use RDFAST and Hub delay tab vector, plus coefficients vector in COG ram, with generated unrolled
' instruction stream in LUT ram to perform RFWORD/MULS/ADD per tap
' hub array is regularly copied down when has moved a FIFO block's worth (64 bytes = 32 words)
' rather than shifting it every sample.
'
' the unidirectional nature of the FIFO hub access prevents easy exploitation of symmetric/antisymmetric
' structure of FIR coefficients.
' Manages 96kSPS rate for test filter of 169 coefficients, one cog
'
' using WAITCT1 and a synthesized step waveform derived from CT register, outputs to 16 bit dithered DAC
' on pin 48
'
' test filter has cutoff at 0.1 * Fs, constructed using Python scipy.signal library's firwin() function
CON
OSCMODE = $010c3f04
FREQ = 160_000_000
MAXN = 170 ' 170 * 3 = 510, so 170 is maximum for 3 instructions/tap in LUT ram
DACpin = 48 ' output DAC
DACpin2 = 50 ' input waveform DAC
dither = %0000_0000_000_10100_00000000_01_00010_0 ' remember to wypin
FIFO_POWER = 6
FIFO_BLK = 1 << FIFO_POWER
OBJ
VAR
WORD temp [MAXN + FIFO_BLK/2] ' work space for taps vector
LONG num_taps ' arg block to cog
LONG tap_buffer
LONG coeff_buffer
PUB demo | i
clkset (OSCMODE, FREQ)
' setup arguments
num_taps := 169
tap_buffer := @temp
coeff_buffer := @coefficients
' start filter running
cognew (@fir_filter, @num_taps)
repeat
pausems(10)
DAT
' from Python's scipy.signal library: coeffs = firwin (177, 0.1, window='hanning')
' then scaled by 2^15 and rounded to int and leading/trailing zeroes removed to leave 169 coefficients
coefficients WORD 1, 1, 1, 1, 0, -1, -2, -4, -6, -7, -8, -8, -7, -4, 0, 5, 11, 17, 22, 26, 27, 25, 20, 11, 0, -13, -28, -41
WORD -52, -59, -60, -54, -42, -24, 0, 27, 55, 80, 100, 112, 113, 102, 78, 44, 0, -49, -98, -142, -177, -196, -197
WORD -177, -136, -75, 0, 84, 169, 245, 305, 340, 342, 308, 238, 133, 0, -151, -307, -452, -571, -647, -665, -615
WORD -488, -282, 0, 349, 751, 1187, 1635, 2070, 2468, 2805, 3062, 3222, 3277, 3222, 3062, 2805, 2468, 2070, 1635
WORD 1187, 751, 349, 0, -282, -488, -615, -665, -647, -571, -452, -307, -151, 0, 133, 238, 308, 342, 340, 305, 245
WORD 169, 84, 0, -75, -136, -177, -197, -196, -177, -142, -98, -49, 0, 44, 78, 102, 113, 112, 100, 80, 55, 27, 0
WORD -24, -42, -54, -60, -59, -52, -41, -28, -13, 0, 11, 20, 25, 27, 26, 22, 17, 11, 5, 0, -4, -7, -8, -8, -7, -6
WORD -4, -2, -1, 0, 1, 1, 1, 1
ORG 0
fir_filter WRPIN ##dither, #DACpin ' smartpin mode for dithered 16 bit DAC
WRPIN ##dither, #DACpin2
call #setup ' the setup code clashes with where the coeffs get loaded to save space.
call #setup_coeffs ' calling this overwrites the setup routine
GETCT time ' test framework uses WAITCT for timing
ADDCT1 time, ##1667 ' about 96kHz (filter takes just under 10us worst case at 160MHz sysclk)
fir_outer_loop mov pass_count, #FIFO_BLK/2 ' every 32 samples copy back the taps as they have crawled up in memory by 64 byte FIFO block size
fir_loop call #get_input
wrword input, taps ' add to one end of taps
mov sum, #0 ' prepare to sum
RDFAST #0, taps
jmp jout ' into the generated engine code, which is in LUT, so #engine won't work
jout long engine
engine_done ' back here from engine
cmp remaind, #0 wz ' complete FIFO block (is this necessary?)
if_z jmp #.rem
REP @.rem, remaind
RFWORD t
.rem
sub taps, #2 ' crawl back the tap pointer by one word each pass
call #put_output
djnz pass_count, #fir_loop
call #copyback ' time to shift the whole taps vector with FIFO fast memory ops (calling this overwrites setup_coeffs)
jmp #fir_outer_loop
' testing input waveform
get_input getct input
shr input, #17 ' generate step pattern from CT, repeats every 2^20 clocks
and input, #7
xor input, #5
shl input, #13
sub input, ##$7000
WYPIN input, #DACpin2
get_input_ret ret
' testing output via DAAC smartpin
put_output WAITCT1 ' synchronize for clean output timing
ADDCT1 time, ##1667 ' about 96kHz
sar sum, #15 ' scale by 2^-15 to compensate for MULS
add sum, ##$8000 ' offset for DAC
WYPIN sum, #DACpin ' update dithered DAC
put_output_ret ret
N long 0 ' number of taps / coefficients
Ncopy long 0 ' reps for copydown
coeffs long 0 ' hub address coefficients
taps0 long 0 ' address of tap vector + 64
taps long 0 ' current taps ptr, slides back from taps0 by 2 each sample until copyback
pass_count long 32 ' loop counter for triggering copyback
remaind long 0 ' fast word reads/writes left over after doing N, to end current block
time long 0 ' for waitct1 synch
copyback mov taps, taps0 ' every 32 samples we correct for the sliding taps window using fast FIFO
add taps, N ' I'm sure there's a better way to do this on P2
add taps, N ' taps used as working pointer starting at end of taps vector
sub taps, #FIFO_BLK*2 ' 32 word, copy as 16 longs as aligned to 64 byte blocks for fast read/write
REP @.done, Ncopy ' copy Ncopy FIFO blocks
RDFAST #0, taps
RFLONG t+0
RFLONG t+1
RFLONG t+2
RFLONG t+3
RFLONG t+4
RFLONG t+5
RFLONG t+6
RFLONG t+7
RFLONG t+8
RFLONG t+9
RFLONG t+10
RFLONG t+11
RFLONG t+12
RFLONG t+13
RFLONG t+14
RFLONG t+15
add taps, #FIFO_BLK
WRFAST #0, taps
WFLONG t+0
WFLONG t+1
WFLONG t+2
WFLONG t+3
WFLONG t+4
WFLONG t+5
WFLONG t+6
WFLONG t+7
WFLONG t+8
WFLONG t+9
WFLONG t+10
WFLONG t+11
WFLONG t+12
WFLONG t+13
WFLONG t+14
WFLONG t+15
sub taps, #FIFO_BLK*2
.done
mov taps, taps0 ' restore taps for main loop
copyback_ret ret
' these temporary vars extend to overwrite setup_coeffs when copyback is called
' 16 longs from t onwards are used in copyback
t long 0
sum long 0
cptr long 0
eptr long 0
input long 0
long 0
long 0
long 0
' used at setup to read coeffs from hub ram to coeff registers
setup_coeffs RDFAST #0, coeffs ' for reading coeff values in
mov cptr, #coeff
REP @.done, N
ALTD cptr
RFWORD 0-0 ' get coeff into register
add cptr, #1
.done
cmp remaind, #0 wz ' if last FIFO block partial, complete it
if_z jmp #setup_coeffs_ret
REP @.done2, remaind
RFWORD t
.done2
setup_coeffs_ret ret
coeff
' this setup code gets overwritten with coefficients when setup_coeffs is called after it
setup rdlong N, PTRA++
rdlong taps0, PTRA++
add taps0, #64
rdlong coeffs, PTRA++
call #setup_engine
mov Ncopy, N
add Ncopy, #FIFO_BLK/2-1
shr Ncopy, #FIFO_POWER-1
mov remaind, N
and remaind, #FIFO_BLK/2-1
neg remaind, remaind
add remaind, #FIFO_BLK/2
and remaind, #FIFO_BLK/2-1
mov taps, taps0
WRFAST #0, taps
REP @.done, N ' zero the taps at start
WFWORD #0
.done
cmp remaind, #0 wz
if_z jmp #setup_ret
REP @.done2, remaind ' complete whole FIFO block
WFWORD #0
.done2
setup_ret ret
' copy instructions into the engine area (in LUT ram), depending on number of coefficients
setup_engine mov cptr, #coeff
mov eptr, ##engine ' engine in LUT so need ##
REP @.done, N
SETS muls_instr, cptr ' unique constant reg per multiply instruction
WRLUT read_instr, eptr
add eptr, #1
WRLUT muls_instr, eptr
add eptr, #1
WRLUT add_instr, eptr
add eptr, #1
add cptr, #1
.done
WRLUT jmpback_instr, eptr
setup_engine_ret ret
' proforma instructions for engine
read_instr RFWORD t
muls_instr MULS t, 0-0 ' source will be a coefficient register
add_instr add sum, t
jmpback_instr jmp #\engine_done ' absolute jmp as is copied to end of engine section
' set aside space for coefficients, allowing for the stuff above that gets overwritten
res MAXN - 14 - 22 - 4 ' 15 is the instr count for setup_engine, 22 for setup, 4 for proforma
FIT $1F0 ' check it fits
ORG $200 ' LUT ram
engine res MAXN*3+1 ' engine instructions go here, RFWORD/MULS/ADD/........./JMP
FIT $400
Comments
is running at 96kSPS with 169 taps
Fastspin v3.9.12 doesn't complain so I presume all is good there. I changed the pin numbers to 10 and 11 to suit, I already had a couple of scope probes attached there. I also tried adding a DIRH for each pin but that didn't help either.
still worked so I figured its not needed... Also I can change to pin 10 and it works.
I'm off to bed for now.
cleanly LPF'ing
EDIT: And a nice side effect of using this way of fast block copies is they are entirely independent of the FIFO. You can leave the FIFO operating at the same time.
I thought the output samples were fully synchronized, IIRC I looked at the SA output. Input waveform isn't, so
I guess that might be allowing a varying step waveform into the filter.
SETQ+RD/WRLONG - good call, I was fixated on FIFO ops. I think the copydown speed is OK though,
not anticipating a 4-fold speed up.
I think when writing with the FIFO you need to at least use a multiple of 8 longs to trigger each fifo write, otherwise
I don't see how the hardware can work.
BTW currently getting IIR filter code to work, using pipelined cordic multiplies over second-order-sections and the
basics are working now, scaling is still an issue (you need to compensate for Q at each stage to avoid overflow).
Each second order stage uses 4 QMUL's fired off every 16 cycles, so 400ns/section (plus 400ns for pipeline setup).
So that's about 200ns/pole/sample.
I also intend to look at implementing lattice filters too, allpass and generic IIR.
So on hubRAM reads, RDFAST, the FIFO pre-fills a whole hub rotation before returning to allow the first RFxxxx. And I think it'll top up at every opportunity after that.
On hubRAM writes, WRFAST, will simply buffer the WFxxxx's then write to hubRAM as rotation opportunity allows. It doesn't need to burst. No requirement for a cog to complete a rotation block.
This means RDFAST has a FIFO filling stall at the start. WRFAST on the other hand just has an emptying duration at the end that may stall cog execution if quickly reusing the FIFO or other hubRAM access at the same time.
So the backslash in pasm sources is also unique to code addressing, although LOC is useful for hubRAM data referencing.
input waveform:
4-pole Bessel filter:
4-pole Butterworth:
4-pole elliptic:
I figured this out, I was overdriving the input to the FIR filter which lead to sign flipping this appeared exactly to replicate the staircase waveform shifting by half its period. Reduce the amount the generated waveform values are shifted left and the anomoly resolved.
dynamically reconfigurable too.
Input (blue), output (yellow), for a 4-pole elliptic filter:
The coefficients are in hub ram, and the state is kept in LUT ram (for no deep reason),
you provide a set of second-order section's a1/a2/b1/b2 coefficients (and a binary scale
factor between sections rather than an a0). fix2.30 format is used for coefficients, with
the slight fudge that +2.0 has to be represented inexactly (coefficients are always in the
range -2.0..+2.0 in a second order section minimum phase filter)
' ' Test IIR cordic code ' CON OSCMODE = $010c3f04 FREQ = 160_000_000 BAUD = 2*115200 DACpin_in = 50 ' input waveform DACpin_out = 48 ' output DAC dither = %0000_0000_000_10100_00000000_01_00010_0 ' remember to wypin PERIOD_IN_CLKS = $400 OBJ ser: "SmartSerial.spin2" PUB demo | i clkset (OSCMODE, FREQ) ser.start (63, 62, 0, BAUD) ser.str (string ("IIR cordic test 2, table driven")) ser.tx (13) ser.tx (10) pausems(1000) N := 4 coeffs := @coefficients first_state := $200 cognew (@entry) DAT ORG 0 ' test harness, generate stepped waveform from bits of CT register, output input wave and filtered results to 2 dithered DAC smartpins entry call #setup_IIR ' zero out the state (in LUT ram) WRPIN ##dither, #DACpin_out ' smartpin mode for dithered 16 bit DAC WRPIN ##dither, #DACpin_in ' smartpin mode for dithered 16 bit DAC GETCT time ' test framework uses WAITCT for timing ADDCT1 time, ##PERIOD_IN_CLKS ' 1024 cycle sample period, 6.4us at 160MHz looper get_input getct acc shr acc, #1 and acc, ##$C000 ' generate step pattern from CT, repeats every 2^20 clocks sub acc, ##$7000 WYPIN acc, #DACpin_in call #IIR_filter add acc, ##$6000 and acc, ##$FFFF WAITCT1 ' synchronize for clean output timing ADDCT1 time, ##PERIOD_IN_CLKS WYPIN acc, #DACpin_out jmp #looper time long 0 {{ Table-driven IIR Filter code using cordic multiplies }} N long 4 first_state long $200 ' address of LUT ram (this wraps anyway for rdlut/wrlut instructions, but give the full address) coeffs long @coefficients ' setup state, call once at start setup_IIR mov ctr, N ' N is number of SoS's (second-order-sections) shl ctr, #1 ' two longs of state per SoS mov state_ptr, first_state mov d1, #0 .zeroloop wrlut d1, state_ptr add state_ptr, #1 djnz ctr, #.zeroloop setup_IIR_ret ret ' IIR filter routine, input in acc, returns output also in acc IIR_filter mov state_ptr, first_state RDFAST #0, coeffs mov c, N .iir_loop call #IIR_section ' loop for each SoS, state in LUT, coefficients and scaling shifts from hubram djnz c, #.iir_loop ' if want correct overall gain can do another multiply here. { RFLONG gain_correct ' a positive value QMUL acc, gain_correct testb acc, #31 wz if_nz mov gain_correct, #0 ' sign correct GETQY acc sub acc, gain_correct } IIR_filter_ret ret ' perform one second-order-section IIR_section rdlut d1, state_ptr ' state kept in vector in LUT ram add state_ptr, #1 rdlut d2, state_ptr RFLONG a1 ' read coefficients QMUL a1, d1 ' start queuing multiples every 8 cycles testb d1, #31 wz ' while doing so get signed multiply corrections into a1/a2/b1/b2 testb a1, #31 wc RFLONG a2 QMUL a2, d2 if_nz mov a1, #0 if_c add a1, d1 ' a1 = (d1<0 ? a1 : 0)+(a1<0 ? d1 : 0) - the signed mpy correction for a1*d1 RFLONG b1 QMUL b1, d1 testb b1, #31 wc if_nz mov b1, #0 ' b1 clobbered (after used in QMUL) RFLONG b2 QMUL b2, d2 ' last multiply queued if_c add b1, d1 ' finish doing sign corrections testb d2, #31 wz testb a2, #31 wc if_nz mov a2, #0 if_c add a2, d2 testb b2, #31 wc if_nz mov b2, #0 if_c add b2, d2 wrlut d1, state_ptr ' new d2 is current d1 RFLONG shift ' get inter-stage shift cmps shift, #0 wcz ' and set flags for later GETQY r ' stall here for results, but we filled most of the delay slots above sub r, a1 ' apply sign correction sub acc, r ' accumulate sub state_ptr, #1 GETQY r sub r, a2 sub acc, r mov d1, acc ' next d1 taken after accumulating the a1/a2 feedback coeffs GETQY r sub r, b1 add acc, r shl d1, #2 ' scale for fix2.30 format coeffs GETQY r sub r, b2 add acc, r wrlut d1, state_ptr ' new d1 now scaled so write to state add state_ptr, #2 ' step to next state pair if_a shl acc, shift ' apply inter-stage scaling if_b neg shift if_b sar acc, shift IIR_section_ret ret state_ptr res 1 acc res 1 d1 res 1 d2 res 1 a1 res 1 b1 res 1 a2 res 1 b2 res 1 r res 1 shift res 1 gain_correct res 1 ctr res 1 c res 1 FIT $1F0 coefficients ' example coefficients in signed fix2.30 format, a 4-pole elliptic filter co_a1_0 long $9FF29AB9 ' -1.5008176031 co_a2_0 long $25e07063 ' 0.591823669728 co_b1_0 long $1977e716 ' 0.397943279136 co_b2_0 long $40000000 ' 1.0 long -3 ' shift between sections, +ve = left, -ve = right co_a1_1 long $9D716E8B ' -1.53995167189 co_a2_1 long $3160055f ' 0.771485655348 co_b1_1 long $B180F7C3 ' -1.22650342945 co_b2_1 long $40000000 ' 1.0 long -2 co_a1_2 long $9B5C08BA ' -1.57250768423 co_a2_2 long $3a586669 ' 0.91164551065 co_b1_2 long $A09F6B14 ' -1.49026988061 co_b2_2 long $40000000 ' 1.0 long -3 co_a1_3 long $99EAAF14 ' -1.59505103142 co_a2_3 long $3ea0f823 ' 0.978574784981 co_b1_3 long $9CCA6E6D ' -1.5501445708 co_b2_3 long $40000000 ' 1.0 long 0