Digital filtering approaches on the Prop2

I thought I'd start a thread to gather any techniques and ideas for implementing digital filters using the P2

There's clearly a variety of approaches on the P2 with the multiply and cordic instructions that overtake the P1's abilities.
I posted a while back a tool for generating IIR filter code on the P1, and there's Phil Pilgrim's FIR2PASM: forums.parallax.com/discussion/133173/fir2pasm-automatic-fir-filter-code-generator/p1
for P1, which have to work around the lack of multiply (with shifts and adds etc).

So lately I've been thinking about ways to implement IIRs (both second-order-sections (biquad) and lattice-filters) using the cordic
pipeline. That's still work in progress.

Last couple of days I decided to tackle FIR filters using 16 bit inputs/outputs/coefficients and the MULS instruction. Code included below of my latest effort. Basically I realized the FIFO memory accesses allow the tapped delayline to live in hub ram, with the coefficients in cog ram, meaning this simple fragment can do the multiple-accumulate engine at the heart of the filter:
        rfword  t
        muls    coeffN
        add     sum, t
given that RDFAST has been setup for the tapped delays. That's 3 instruction overhead per tap, on 160MHz sysclk that's about 250kSPS processing rate for a 100 tap FIR filter.

Its never that simple of course, and the values have to shift down the array of taps on each sample, which naively would add a lot of overhead for shifting the data. However once using hub memory you can slide a window over memory instead, letting the active set of taps slide along. At some point you need to
copy them back to prevent the whole of memory being overwritten, but for 100 tap filter you can afford to do this only occasionally, and in fact I do it
every 32 samples, as 32 words takes up one 64 byte FIFO block for fast reads/writes.

I gave up hope of clever ways to use the typical symmetry of FIR coefficients for optimization, as the FIFO memory access is unidirectional.

I also realized the LUT ram could be used for the unrolled loop of the 3-instruction fragment above, allowing 512/3 = 170 max tap count.

Before that I was limited to about 100 taps as coefficients, housekeeping code and unrolled loop all had to coexist in 496 cog longs.

So I store upto 170 coefficients in cog ram, 170+32 words of tap vector in hub ram, and 511 instructions worth on unrolled loop in LUT ram. The code generates the unrolled loop depending on the tap count, and loads up the coefficients from hub ram word array at startup. Arguments are the tap count, array of word coefficients, pointer to hub ram workspace.

The demonstration code synthesizes a simple step waveform using CT register as input, and drives a couple of smartpins in dither DAC mode to present the input and output waveforms for 'scope perusal. I've got a 169 tap example filter running at 96kSPS (fairly close to the limit as it happens), so its ready to play with.
' fast_fifo_firfilter.spin2
'
' Mark Tillotson, 2019-01-19
'
' 16 bit FIR filter cog test
'
' Use RDFAST and Hub delay tab vector, plus coefficients vector in COG ram, with generated unrolled
' instruction stream in LUT ram to perform RFWORD/MULS/ADD per tap
' hub array is regularly copied down when has moved a FIFO block's worth (64 bytes = 32 words)
' rather than shifting it every sample.
'
' the unidirectional nature of the FIFO hub access prevents easy exploitation of symmetric/antisymmetric
' structure of FIR coefficients.
' Manages 96kSPS rate for test filter of 169 coefficients, one cog
'
' using WAITCT1 and a synthesized step waveform derived from CT register, outputs to 16 bit dithered DAC
' on pin 48
'
' test filter has cutoff at 0.1 * Fs, constructed using Python scipy.signal library's firwin() function


CON
    OSCMODE = $010c3f04
    FREQ = 160_000_000

    MAXN = 170   ' 170 * 3 = 510, so 170 is maximum for 3 instructions/tap in LUT ram

    DACpin = 48  ' output DAC
    DACpin2 = 50 ' input waveform DAC
    dither = %0000_0000_000_10100_00000000_01_00010_0    ' remember to wypin

    FIFO_POWER = 6
    FIFO_BLK = 1 << FIFO_POWER
OBJ

VAR
    WORD  temp [MAXN + FIFO_BLK/2] ' work space for taps vector

    LONG  num_taps    ' arg block to cog
    LONG  tap_buffer
    LONG  coeff_buffer
    

PUB demo | i
    clkset (OSCMODE, FREQ)

    ' setup arguments
    num_taps := 169
    tap_buffer := @temp
    coeff_buffer := @coefficients
    ' start filter running
    cognew (@fir_filter, @num_taps)

    repeat
        pausems(10)


DAT

' from Python's scipy.signal library:    coeffs = firwin (177, 0.1, window='hanning')
' then scaled by 2^15 and rounded to int and leading/trailing zeroes removed to leave 169 coefficients
coefficients	WORD 1, 1, 1, 1, 0, -1, -2, -4, -6, -7, -8, -8, -7, -4, 0, 5, 11, 17, 22, 26, 27, 25, 20, 11, 0, -13, -28, -41
		WORD -52, -59, -60, -54, -42, -24, 0, 27, 55, 80, 100, 112, 113, 102, 78, 44, 0, -49, -98, -142, -177, -196, -197
		WORD -177, -136, -75, 0, 84, 169, 245, 305, 340, 342, 308, 238, 133, 0, -151, -307, -452, -571, -647, -665, -615
		WORD -488, -282, 0, 349, 751, 1187, 1635, 2070, 2468, 2805, 3062, 3222, 3277, 3222, 3062, 2805, 2468, 2070, 1635
		WORD 1187, 751, 349, 0, -282, -488, -615, -665, -647, -571, -452, -307, -151, 0, 133, 238, 308, 342, 340, 305, 245
		WORD 169, 84, 0, -75, -136, -177, -197, -196, -177, -142, -98, -49, 0, 44, 78, 102, 113, 112, 100, 80, 55, 27, 0
		WORD -24, -42, -54, -60, -59, -52, -41, -28, -13, 0, 11, 20, 25, 27, 26, 22, 17, 11, 5, 0, -4, -7, -8, -8, -7, -6
		WORD -4, -2, -1, 0, 1, 1, 1, 1

		ORG	0

fir_filter	WRPIN   ##dither, #DACpin   ' smartpin mode for dithered 16 bit DAC
		WRPIN	##dither, #DACpin2

		call	#setup          ' the setup code clashes with where the coeffs get loaded to save space.
		call	#setup_coeffs   ' calling this overwrites the setup routine
		
		GETCT	time            ' test framework uses WAITCT for timing
		ADDCT1	time, ##1667    ' about 96kHz  (filter takes just under 10us worst case at 160MHz sysclk)

fir_outer_loop  mov	pass_count, #FIFO_BLK/2  ' every 32 samples copy back the taps as they have crawled up in memory by 64 byte FIFO block size

fir_loop	call	#get_input
		wrword  input, taps	 ' add to one end of taps
		mov	sum, #0          ' prepare to sum
		RDFAST	#0, taps 	 
		jmp	jout		 ' into the generated engine code, which is in LUT, so #engine won't work
jout		long    engine

engine_done				 ' back here from engine
		cmp	remaind, #0  wz  ' complete FIFO block (is this necessary?)
	if_z	jmp	#.rem
		REP	@.rem, remaind
		RFWORD  t
.rem
		sub	taps, #2	 ' crawl back the tap pointer by one word each pass
		call	#put_output
		djnz	pass_count, #fir_loop

		call	#copyback        ' time to shift the whole taps vector with FIFO fast memory ops (calling this overwrites setup_coeffs)
		jmp	#fir_outer_loop


' testing input waveform
get_input	getct   input
		shr	input, #17  ' generate step pattern from CT, repeats every 2^20 clocks
		and	input, #7
		xor	input, #5
		shl	input, #13
		sub	input, ##$7000
		WYPIN   input, #DACpin2
get_input_ret	ret


' testing output via DAAC smartpin
put_output      WAITCT1    ' synchronize for clean output timing
		ADDCT1	time, ##1667    ' about 96kHz

		sar     sum, #15        ' scale by 2^-15 to compensate for MULS
		add	sum, ##$8000    ' offset for DAC
		WYPIN   sum, #DACpin    ' update dithered DAC
put_output_ret  ret

N		long	0       ' number of taps / coefficients
Ncopy		long	0       ' reps for copydown
coeffs		long	0       ' hub address coefficients
taps0		long	0       ' address of tap vector + 64
taps		long	0       ' current taps ptr, slides back from taps0 by 2 each sample until copyback
pass_count	long	32      ' loop counter for triggering copyback
remaind		long	0       ' fast word reads/writes left over after doing N, to end current block
time		long	0       ' for waitct1 synch


copyback	mov	taps, taps0  ' every 32 samples we correct for the sliding taps window using fast FIFO
		add	taps, N      ' I'm sure there's a better way to do this on P2
		add	taps, N      ' taps used as working pointer starting at end of taps vector
		sub	taps, #FIFO_BLK*2   ' 32 word, copy as 16 longs as aligned to 64 byte blocks for fast read/write

		REP	@.done, Ncopy  ' copy Ncopy FIFO blocks
		RDFAST	#0, taps
		RFLONG	t+0
		RFLONG	t+1
		RFLONG	t+2
		RFLONG	t+3
		RFLONG	t+4
		RFLONG	t+5
		RFLONG	t+6
		RFLONG	t+7
		RFLONG	t+8
		RFLONG	t+9
		RFLONG	t+10
		RFLONG	t+11
		RFLONG	t+12
		RFLONG	t+13
		RFLONG	t+14
		RFLONG	t+15
		add	taps, #FIFO_BLK
		WRFAST	#0, taps
		WFLONG	t+0
		WFLONG	t+1
		WFLONG	t+2
		WFLONG	t+3
		WFLONG	t+4
		WFLONG	t+5
		WFLONG	t+6
		WFLONG	t+7
		WFLONG	t+8
		WFLONG	t+9
		WFLONG	t+10
		WFLONG	t+11
		WFLONG	t+12
		WFLONG	t+13
		WFLONG	t+14
		WFLONG	t+15
		sub	taps, #FIFO_BLK*2
.done
		mov	taps, taps0  ' restore taps for main loop
copyback_ret	ret

' these temporary vars extend to overwrite setup_coeffs when copyback is called
' 16 longs from t onwards are used in copyback
t		long	0
sum		long	0
cptr		long	0
eptr		long	0
input		long	0
		long    0
		long    0
		long    0


' used at setup to read coeffs from hub ram to coeff registers
setup_coeffs    RDFAST	#0, coeffs		 ' for reading coeff values in
		mov	cptr, #coeff
		
		REP     @.done, N
		ALTD	cptr
		RFWORD  0-0	' get coeff into register
		add	cptr, #1
.done
		cmp	remaind, #0  wz   ' if last FIFO block partial, complete it
	if_z	jmp	#setup_coeffs_ret
		REP	@.done2, remaind
		RFWORD  t
.done2
setup_coeffs_ret ret




coeff	

' this setup code gets overwritten with coefficients when setup_coeffs is called after it
setup		rdlong	N, PTRA++
		rdlong	taps0, PTRA++
		add	taps0, #64
		rdlong	coeffs, PTRA++

		call	#setup_engine
		
		mov	Ncopy, N
		add	Ncopy, #FIFO_BLK/2-1
		shr	Ncopy, #FIFO_POWER-1
		mov	remaind, N
		and	remaind, #FIFO_BLK/2-1
		neg	remaind, remaind
		add	remaind, #FIFO_BLK/2
		and	remaind, #FIFO_BLK/2-1
		
		mov	taps, taps0
		WRFAST	#0, taps
		REP	@.done, N   ' zero the taps at start
		WFWORD	#0
.done
		cmp	remaind, #0  wz
	if_z	jmp	#setup_ret
		REP	@.done2, remaind  ' complete whole FIFO block
		WFWORD  #0
.done2
setup_ret	ret


' copy instructions into the engine area (in LUT ram), depending on number of coefficients
setup_engine    mov     cptr, #coeff
		mov     eptr, ##engine   ' engine in LUT so need ##

		REP	@.done, N

		SETS	muls_instr, cptr    ' unique constant reg per multiply instruction
	
		WRLUT	read_instr, eptr
		add	eptr, #1
		WRLUT	muls_instr, eptr
		add	eptr, #1
		WRLUT	add_instr, eptr
		add	eptr, #1
	
		add	cptr, #1
.done
		WRLUT   jmpback_instr, eptr
setup_engine_ret ret


' proforma instructions for engine
read_instr	RFWORD	t
muls_instr	MULS	t, 0-0    ' source will be a coefficient register
add_instr	add	sum, t
jmpback_instr	jmp	#\engine_done  ' absolute jmp as is copied to end of engine section

' set aside space for coefficients, allowing for the stuff above that gets overwritten
		res     MAXN - 14 - 22 - 4    ' 15 is the instr count for setup_engine, 22 for setup, 4 for proforma
	
		FIT	$1F0   ' check it fits

		ORG     $200   ' LUT ram

engine		res	MAXN*3+1   ' engine instructions go here, RFWORD/MULS/ADD/........./JMP
		FIT	$400
P1010183.JPG
P1010184.JPG

Comments

  • There's something that's perplexing me about the first 'scope image, see if you can figure out what? Remember this
    is running at 96kSPS with 169 taps
  • Window length comes to 1.76 ms, right? But the lag is over 4 ms.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • Mark_TMark_T Posts: 1,981
    edited 2019-01-20 - 01:22:03
    Yup, I'm expecting 1/2 a window's delay, about 800us, but seeing 4ms. Odd. Perhaps I have a bug that allows it to still work but store more samples somehow? Perhaps smartpins are smarter than I thought? Perhaps my 'scope is confused? Perhaps the P2 can travel through time? Fresh attack on this in the morning I think.
  • I just tried your program but no joy. DAC output is flat zero.

    Fastspin v3.9.12 doesn't complain so I presume all is good there. I changed the pin numbers to 10 and 11 to suit, I already had a couple of scope probes attached there. I also tried adding a DIRH for each pin but that didn't help either.

    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • Odd. You've had that DAC mode working on your board before?
  • I don't see anywhere in the source where DIR is set high for those two smartpins. Is some of the source code missing?

    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • Mark_TMark_T Posts: 1,981
    edited 2019-01-20 - 13:13:35
    No, works as is for me - I remember having DIRH in there before for DAC driving, then commented it out and the code
    still worked so I figured its not needed... Also I can change to pin 10 and it works.
  • Huh, I just had to update Fastspin. Now using v3.9.15. And yep, it's fine without DIR set. Also getting the same 4.175 ms lag.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • evanhevanh Posts: 7,735
    edited 2019-01-20 - 13:33:43
    fast-fifo-filter binary size went up from 2624 to 2848 bytes, so clearly something was being missed out before.

    I'm off to bed for now.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • Ah. One mystery down, one remains, I also looked at the output signal on the SA and the filter looks to be
    cleanly LPF'ing
  • I'm just starting to look over your code. I note you've got some questions in there. First clarification I can make is with the JMP #engine workaround you've done. Pnut seem to miscalculate PC-relative when below $400 and I guess the others have followed suit. The recommended solution is JMP #\engine. This tells the assembler you want it as absolute encoded, so works. Further reading - https://forums.parallax.com/discussion/comment/1457051/#Comment_1457051
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • evanhevanh Posts: 7,735
    edited 2019-01-21 - 01:50:25
    Next up is this:
    complete FIFO block (is this necessary?)
    I haven't much with the fifo but it should only need emptied if using FBLOCK. SInce you're using RDFAST every time then the answer should be a no.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • I see you've used a WAITCT for sample timing. I did note that both waveforms have a jitter to them. I can't say if that would be due to the nature of generator calculations or a sample rate problem.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • evanhevanh Posts: 7,735
    edited 2019-01-21 - 02:39:25
    Oh, that copyback routine should definitely be using SETQ+RDLONG and SETQ+WRLONG. It'll be 4x faster as a guess.

    EDIT: And a nice side effect of using this way of fast block copies is they are entirely independent of the FIFO. You can leave the FIFO operating at the same time.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • jmp #\engine can only address cog ram surely? I think I tried jmp ##\engine and gave up and used a register instead.

    I thought the output samples were fully synchronized, IIRC I looked at the SA output. Input waveform isn't, so
    I guess that might be allowing a varying step waveform into the filter.

    SETQ+RD/WRLONG - good call, I was fixated on FIFO ops. I think the copydown speed is OK though,
    not anticipating a 4-fold speed up.

    I think when writing with the FIFO you need to at least use a multiple of 8 longs to trigger each fifo write, otherwise
    I don't see how the hardware can work.

    BTW currently getting IIR filter code to work, using pipelined cordic multiplies over second-order-sections and the
    basics are working now, scaling is still an issue (you need to compensate for Q at each stage to avoid overflow).

    Each second order stage uses 4 QMUL's fired off every 16 cycles, so 400ns/section (plus 400ns for pipeline setup).
    So that's about 200ns/pole/sample.

    I also intend to look at implementing lattice filters too, allpass and generic IIR.
  • My understanding of the FIFO hardware is it doesn't have to burst whole rotation blocks. At the SRAM level hubRAM locations can be addressed/accessed on single clock cycles.

    So on hubRAM reads, RDFAST, the FIFO pre-fills a whole hub rotation before returning to allow the first RFxxxx. And I think it'll top up at every opportunity after that.
    On hubRAM writes, WRFAST, will simply buffer the WFxxxx's then write to hubRAM as rotation opportunity allows. It doesn't need to burst. No requirement for a cog to complete a rotation block.

    This means RDFAST has a FIFO filling stall at the start. WRFAST on the other hand just has an emptying duration at the end that may stall cog execution if quickly reusing the FIFO or other hubRAM access at the same time.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • evanhevanh Posts: 7,735
    edited 2019-01-21 - 10:07:54
    Mark_T wrote: »
    jmp #\engine can only address cog ram surely? I think I tried jmp ##\engine and gave up and used a register instead.
    Code addressing is separate to data addressing. It's a flat map starting from cogRAM to lutRAM to hubRAM. That's why below $400 hubRAM cannot execute in place.

    So the backslash in pasm sources is also unique to code addressing, although LOC is useful for hubRAM data referencing.
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • So I've got my IIR digital filters working (Python script generated P2ASM), hooked up like the FIR code:

    input waveform:
    iir_input_wave.jpg
    4-pole Bessel filter:
    iir_bessel.jpg
    4-pole Butterworth:
    iir_butterworth.jpg
    4-pole elliptic:
    iir_elliptic.jpg

  • Mark_T wrote: »
    Yup, I'm expecting 1/2 a window's delay, about 800us, but seeing 4ms. Odd. Perhaps I have a bug that allows it to still work but store more samples somehow? Perhaps smartpins are smarter than I thought? Perhaps my 'scope is confused? Perhaps the P2 can travel through time? Fresh attack on this in the morning I think.

    I figured this out, I was overdriving the input to the FIR filter which lead to sign flipping this appeared exactly to replicate the staircase waveform shifting by half its period. Reduce the amount the generated waveform values are shifted left and the anomoly resolved.
  • Always good to clean those up. Sometimes software goes into production that way. :O
    "We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
    By doing that, we can more accurately measure their mass, and determine whether
    scientists have systematically been underestimating how much matter they contain."
  • Mark_TMark_T Posts: 1,981
    edited 2019-01-26 - 23:13:14
    I've rework my IIR filtering to be table driven (rather than Python-generated), which is more convenient to use and
    dynamically reconfigurable too.

    Input (blue), output (yellow), for a 4-pole elliptic filter:
    iir_2.jpg

    The coefficients are in hub ram, and the state is kept in LUT ram (for no deep reason),
    you provide a set of second-order section's a1/a2/b1/b2 coefficients (and a binary scale
    factor between sections rather than an a0). fix2.30 format is used for coefficients, with
    the slight fudge that +2.0 has to be represented inexactly (coefficients are always in the
    range -2.0..+2.0 in a second order section minimum phase filter)
    '
    ' Test IIR cordic code
    '
    CON
        OSCMODE = $010c3f04
        FREQ = 160_000_000
        BAUD = 2*115200
    
        DACpin_in = 50  ' input waveform
        DACpin_out = 48  ' output DAC
        dither = %0000_0000_000_10100_00000000_01_00010_0    ' remember to wypin
    
        PERIOD_IN_CLKS = $400
    
    OBJ
        ser: "SmartSerial.spin2"
    
    
    
    PUB demo | i
        clkset (OSCMODE, FREQ)
        ser.start (63, 62, 0, BAUD)
     
        ser.str (string ("IIR cordic test 2, table driven"))
        ser.tx (13)
        ser.tx (10)
        pausems(1000)
    
        N	   := 4
        coeffs := @coefficients
        first_state := $200
        cognew (@entry)
    
    DAT
    
                    ORG     0
    
    ' test harness, generate stepped waveform from bits of CT register, output input wave and filtered results to 2 dithered DAC smartpins
    entry
    		call	#setup_IIR             ' zero out the state (in LUT ram)
    		
    		WRPIN   ##dither, #DACpin_out  ' smartpin mode for dithered 16 bit DAC
    		WRPIN   ##dither, #DACpin_in   ' smartpin mode for dithered 16 bit DAC
    		GETCT	time                   ' test framework uses WAITCT for timing
    		ADDCT1	time, ##PERIOD_IN_CLKS ' 1024 cycle sample period, 6.4us at 160MHz
    looper
    get_input	getct   acc
    		shr     acc, #1
    		and	acc, ##$C000 ' generate step pattern from CT, repeats every 2^20 clocks
    		sub	acc, ##$7000
    		WYPIN   acc, #DACpin_in
    		
    		call	#IIR_filter
    		
    		add	acc, ##$6000
    		and	acc, ##$FFFF
    
    		WAITCT1    ' synchronize for clean output timing
    		ADDCT1	time, ##PERIOD_IN_CLKS
    		WYPIN   acc, #DACpin_out
    		jmp	#looper
    		
    time		long    0		
    
    
    
    {{ Table-driven IIR Filter code using cordic multiplies  }}
    
    N		long	4
    first_state	long	$200		' address of LUT ram (this wraps anyway for rdlut/wrlut instructions, but give the full address)
    coeffs		long	@coefficients
    
    
    ' setup state, call once at start
    setup_IIR	mov	ctr, N          ' N is number of SoS's (second-order-sections)
    		shl	ctr, #1         ' two longs of state per SoS
    		mov	state_ptr, first_state
    		mov	d1, #0
    .zeroloop
    		wrlut	d1, state_ptr
    		add	state_ptr, #1
    		djnz	ctr, #.zeroloop
    setup_IIR_ret   ret
    
    
    ' IIR filter routine, input in acc, returns output also in acc
    IIR_filter
    		mov	state_ptr, first_state
    		RDFAST	#0, coeffs
    		mov	c, N
    .iir_loop
    		call	#IIR_section      ' loop for each SoS, state in LUT, coefficients and scaling shifts from hubram
    		djnz	c, #.iir_loop
    
    		' if want correct overall gain can do another multiply here.
    		{
    		RFLONG  gain_correct	' a positive value
    		QMUL	acc, gain_correct
    		testb	acc, #31  wz
    	if_nz	mov	gain_correct, #0 ' sign correct
    		GETQY	acc
    		sub	acc, gain_correct
    		}
    IIR_filter_ret	ret
    
    
    ' perform one second-order-section
    IIR_section	rdlut	d1, state_ptr	' state kept in vector in LUT ram
    		add	state_ptr, #1
    		rdlut	d2, state_ptr
    		
    		RFLONG  a1		' read coefficients
    
    		QMUL	a1, d1		' start queuing multiples every 8 cycles
    
    		testb	d1, #31  wz	' while doing so get signed multiply corrections into a1/a2/b1/b2
    		testb	a1, #31	 wc
    		RFLONG	a2
    
    		QMUL	a2, d2
    
    	if_nz	mov	a1, #0
    	if_c	add	a1, d1		' a1 = (d1<0 ? a1 : 0)+(a1<0 ? d1 : 0) - the signed mpy correction for a1*d1
    		RFLONG	b1
    
    		QMUL	b1, d1
    		
    		testb	b1, #31	 wc
    	if_nz	mov	b1, #0		' b1 clobbered (after used in QMUL)
    		RFLONG	b2
    
    		QMUL	b2, d2		' last multiply queued
    		
    	if_c	add	b1, d1          ' finish doing sign corrections
    		testb	d2, #31  wz
    		testb	a2, #31	 wc
    	if_nz	mov	a2, #0
    	if_c	add	a2, d2
    		testb	b2, #31	 wc
    	if_nz	mov	b2, #0
    	if_c	add	b2, d2
    
    		wrlut	d1, state_ptr   ' new d2 is current d1
    
    		RFLONG	shift		' get inter-stage shift
    		cmps	shift, #0  wcz  ' and set flags for later
    
    		GETQY	r		' stall here for results, but we filled most of the delay slots above	
    		sub	r, a1		' apply sign correction
    		sub	acc, r		' accumulate
    		sub	state_ptr, #1
    
    		GETQY   r	
    		sub	r, a2
    		sub	acc, r
    		mov	d1, acc		' next d1 taken after accumulating the a1/a2 feedback coeffs
    
    		GETQY	r
    		sub	r, b1
    		add	acc, r
    		shl	d1, #2		' scale for fix2.30 format coeffs
    		
    		GETQY   r	
    		sub	r, b2
    		add	acc, r
    		wrlut	d1, state_ptr   ' new d1 now scaled so write to state
    		add	state_ptr, #2	' step to next state pair
    		
    	if_a	shl	acc, shift	' apply inter-stage scaling
    	if_b	neg	shift
    	if_b	sar	acc, shift
    IIR_section_ret	ret
    
    
    state_ptr	res	1
    acc		res	1
    d1		res	1
    d2		res	1
    a1		res	1
    b1		res	1
    a2		res	1
    b2		res	1
    r		res	1
    shift		res	1
    gain_correct	res	1
    ctr		res	1
    c		res	1
    
    		FIT	$1F0
    
    coefficients		' example coefficients in signed fix2.30 format, a 4-pole elliptic filter
    co_a1_0         long    $9FF29AB9  ' -1.5008176031
    co_a2_0         long    $25e07063  ' 0.591823669728
    co_b1_0         long    $1977e716  ' 0.397943279136
    co_b2_0         long    $40000000  ' 1.0
    		long	-3   ' shift between sections, +ve = left, -ve = right
    co_a1_1         long    $9D716E8B  ' -1.53995167189
    co_a2_1         long    $3160055f  ' 0.771485655348
    co_b1_1         long    $B180F7C3  ' -1.22650342945
    co_b2_1         long    $40000000  ' 1.0
    		long    -2
    co_a1_2         long    $9B5C08BA  ' -1.57250768423
    co_a2_2         long    $3a586669  ' 0.91164551065
    co_b1_2         long    $A09F6B14  ' -1.49026988061
    co_b2_2         long    $40000000  ' 1.0
    		long	-3
    co_a1_3         long    $99EAAF14  ' -1.59505103142
    co_a2_3         long    $3ea0f823  ' 0.978574784981
    co_b1_3         long    $9CCA6E6D  ' -1.5501445708
    co_b2_3         long    $40000000  ' 1.0
    		long	0
    
Sign In or Register to comment.