digital filtering using CORDIC
ManAtWork
Posts: 2,176
in Propeller 2
I've been experimenting a bit with digital filters. IIR filters need much less taps and mul/add instructions than FIR filters. So I thought I could use floating point math to get rid of all scaling and overflow issues. Unfortunatelly, computing a 3rd order Bessel filter took around 7900 cycles and this is just a bit too slow for my application. I'm not sure if 16 bit fixed point math would be sufficient. So I had the idea of using the CODRIC unit for the filtering. I don't know if it's the best possible solution but at least the code looks cool and IMHO is a good demonstration of how make full use of the pipeline of the CORDIC unit.
I'll post the full source code after I've made some fine tuning...
' The CORDIC unit doesn't support signed multiplication. ' We do a trick and compute y = 2*x*sin(arcsin(c/2)) ' which is the same as y = x*c for -2 <= c <= +2. ' QROTATE computes y = D * sin(S), x is ignored. ' The arcsine of the coefficients are pre-calculated. qrotate in,a0 // 0 start feeding pipeline qrotate x1,a1 // +8 clocks qrotate y1,b1 // +16 qrotate x2,a2 // +24 qrotate y2,b2 // +32 qrotate x3,a3 // +40 qrotate y3,b3 // +48 getqy y0 // +56 result of in*a0 getqy r // +64 result of x1*a1 add y0,r getqy r // +72 result of y1*b1 sub y0,r getqy r // +80 result of x2*a2 add y0,r getqy r // +88 result of y2*b2 sub y0,r getqy r // +96 result of x3*a3 add y0,r getqy r // +104 result of y3*b3 sub y0,rExecution time should be 106+something clocks but calling it from a high-level languange and having to place all FIFO data in registers, first, adds quite some overhead. I measured 960 clocks without any optimizations which is roughly an 8 times speedup.
I'll post the full source code after I've made some fine tuning...
Comments
I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.
I tried to optimize this by passing pointers instead and shuffling the data inside the assembler part. But that doesn't help much. The call of the function and those 3 rd/wrlong instructions alone take more than 400 clocks!
SETQ + RD/WRLONG is very efficient when executed from COGRAM and can read/write one longword every clock cycle but it seems to be very slow when executed from HUBRAM. The reason is that both hub execution and rd/wrlong struggle about the hubram FIFO which forces constant reloading of the FIFO like an old PC with too little RAM swapping pages all the time.
Correct. Have a look at TestFilterCordic.p2asm. Yikes! I don't know if fastspin would use a block read if the args were in order in memory. I think the best way would be to pass a pointer to the iirFilter structure and read the whole thing as one block. If the compiler will agree with that. Then do the time shifting in the same pasm code as the filter.
Edit: If the shifting of old samples is interleaved with the cordic operation then it would take no time at all.
If I could understand the implications of that first for( ) loop, the copy over from structure f into the int32_t could be included. It seems like it's only copying [1] = [0] given coefficient size 2.
EDIT: Of course, if f.n only equals 4 then there's not much saving to make.
And all array contents have to be copied to cogram anyway otherwise the timing of the pipeline couldn't be met. All data is already arranged in a single block in the filter structure. So a single block move is necessary to swap everything in. The shifting could be done interleaved with the CORDIC code and then a single block move can copy everything back to the struct in hubram.
But the block move with SETQ+RDLONG seems to be the bottleneck and is not very efficient in hub exec mode. I have to do some experiments what works best.
I wonder if a similar optimization could be done to the floating point library. I found out that a float multiplication or add takes around 800 cycles. But the actual "core" code for the math operation is only 30 or 40 instructions so it should execute in 80 cycles. The rest is wasted by packing/unpacking the IEEE number format and by copying arguments and local variables. >1MFLOP (per cog!) would be damn good for an embedded processor without FPU.
I was trying some evil stuff like #define SCRATCHPAD ((volatile iirFilter*) 0x4) and upon usage getting "error: Internal Error: emitting move to immediate"
Unfortunatelly, theese days almost everything is "cheating" in the sense of "using features that are not officially documented". I'm looking forward to the time when everything about the P2 will be documented as well as we are used to with the P1 and don't have to browse the forum for solutions.