digital filtering using CORDIC
ManAtWork
Posts: 2,262
in Propeller 2
I've been experimenting a bit with digital filters. IIR filters need much less taps and mul/add instructions than FIR filters. So I thought I could use floating point math to get rid of all scaling and overflow issues. Unfortunatelly, computing a 3rd order Bessel filter took around 7900 cycles and this is just a bit too slow for my application. I'm not sure if 16 bit fixed point math would be sufficient. So I had the idea of using the CODRIC unit for the filtering. I don't know if it's the best possible solution but at least the code looks cool and IMHO is a good demonstration of how make full use of the pipeline of the CORDIC unit.
I'll post the full source code after I've made some fine tuning...
' The CORDIC unit doesn't support signed multiplication. ' We do a trick and compute y = 2*x*sin(arcsin(c/2)) ' which is the same as y = x*c for -2 <= c <= +2. ' QROTATE computes y = D * sin(S), x is ignored. ' The arcsine of the coefficients are pre-calculated. qrotate in,a0 // 0 start feeding pipeline qrotate x1,a1 // +8 clocks qrotate y1,b1 // +16 qrotate x2,a2 // +24 qrotate y2,b2 // +32 qrotate x3,a3 // +40 qrotate y3,b3 // +48 getqy y0 // +56 result of in*a0 getqy r // +64 result of x1*a1 add y0,r getqy r // +72 result of y1*b1 sub y0,r getqy r // +80 result of x2*a2 add y0,r getqy r // +88 result of y2*b2 sub y0,r getqy r // +96 result of x3*a3 add y0,r getqy r // +104 result of y3*b3 sub y0,rExecution time should be 106+something clocks but calling it from a high-level languange and having to place all FIFO data in registers, first, adds quite some overhead. I measured 960 clocks without any optimizations which is roughly an 8 times speedup.
I'll post the full source code after I've made some fine tuning...

Comments
I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.
// definition of internal function: uint32_t CordicIIRfilter (a0, in, a1, x1, a2, x2, a3, x3, b1, y1, b2, y2, b3, y3) { __asm { ... // call it: f.y[0]= CordicIIRfilter (f.a[0], in, f.a[1], f.x[1], f.a[2], f.x[2], f.a[3], f.x[3], f.b[1], f.y[1], f.b[2], f.y[2], f.b[3], f.y[3]);I tried to optimize this by passing pointers instead and shuffling the data inside the assembler part.
uint32_t CordicIIRfilter (int32_t in, int32_t* x, int32_t* y, int32_t* a, int32_t* b) { __asm { setq #6 rdlong 1,x // transfer arrays to hubram setq #2 // first 16 longs of cogram are scratch space setq #7 rdlong 8,a mov 0,in // x0= in setq #7 wrlong 0,x // write shifted FIFO back to hubramBut that doesn't help much. The call of the function and those 3 rd/wrlong instructions alone take more than 400 clocks!SETQ + RD/WRLONG is very efficient when executed from COGRAM and can read/write one longword every clock cycle but it seems to be very slow when executed from HUBRAM. The reason is that both hub execution and rd/wrlong struggle about the hubram FIFO which forces constant reloading of the FIFO like an old PC with too little RAM swapping pages all the time.
Correct. Have a look at TestFilterCordic.p2asm. Yikes! I don't know if fastspin would use a block read if the args were in order in memory. I think the best way would be to pass a pointer to the iirFilter structure and read the whole thing as one block. If the compiler will agree with that.
Edit: If the shifting of old samples is interleaved with the cordic operation then it would take no time at all.
int32_t GetFilterOutput (iirFilter& f, int32_t in) { for (unsigned int i=f.n-1; i>0; i--) { f.x[i]= f.x[i-1]; f.y[i]= f.y[i-1]; } f.x[0]= in<<1; int32_t x0 = f.x[0]; int32_t x1 = f.x[1]; int32_t x2 = f.x[2]; int32_t x3 = f.x[3]; int32_t y0 = f.y[0]; int32_t y1 = f.y[1]; int32_t y2 = f.y[2]; int32_t y3 = f.y[3]; int32_t a0 = f.a[0]; int32_t a1 = f.a[1]; int32_t a2 = f.a[2]; int32_t a3 = f.a[3]; int32_t b0 = f.b[0]; int32_t b1 = f.b[1]; int32_t b2 = f.b[2]; int32_t b3 = f.b[3]; int32_t r; __asm { //shl in,#1 // compensate gain (c/2) use x0 instead qrotate x0,a0 // 0 start feeding pipeline qrotate x1,a1 // +8 clocks shl y1,#1 qrotate y1,b1 // +16 qrotate x2,a2 // +24 shl y2,#1 qrotate y2,b2 // +32 qrotate x3,a3 // +40 shl y3,#1 qrotate y3,b3 // +48 getqy y0 // +56 result of in*a0 <- x0*a0 getqy r // +64 result of x1*a1 add y0,r getqy r // +72 result of y1*b1 sub y0,r getqy r // +80 result of x2*a2 add y0,r getqy r // +88 result of y2*b2 sub y0,r getqy r // +96 result of x3*a3 add y0,r getqy r // +104 result of y3*b3 sub y0,r } f.y[0] = y0; return f.y[0]; }If I could understand the implications of that first for( ) loop, the copy over from structure f into the int32_t could be included. It seems like it's only copying [1] = [0] given coefficient size 2.
EDIT: Of course, if f.n only equals 4 then there's not much saving to make.
And all array contents have to be copied to cogram anyway otherwise the timing of the pipeline couldn't be met. All data is already arranged in a single block in the filter structure.
typedef struct { unsigned int n; // number of coefficients = order+1 int32_t x[maxFilterCoeff]; // input samples int32_t y[maxFilterCoeff]; // output samples int32_t a[maxFilterCoeff]; // A coefficients int32_t b[maxFilterCoeff]; // B coefficients } iirFilter;So a single block move is necessary to swap everything in. The shifting could be done interleaved with the CORDIC code and then a single block move can copy everything back to the struct in hubram.But the block move with SETQ+RDLONG seems to be the bottleneck and is not very efficient in hub exec mode. I have to do some experiments what works best.
I wonder if a similar optimization could be done to the floating point library. I found out that a float multiplication or add takes around 800 cycles. But the actual "core" code for the math operation is only 30 or 40 instructions so it should execute in 80 cycles. The rest is wasted by packing/unpacking the IEEE number format and by copying arguments and local variables. >1MFLOP (per cog!) would be damn good for an embedded processor without FPU.
I was trying some evil stuff like #define SCRATCHPAD ((volatile iirFilter*) 0x4) and upon usage getting "error: Internal Error: emitting move to immediate"
Unfortunatelly, theese days almost everything is "cheating" in the sense of "using features that are not officially documented". I'm looking forward to the time when everything about the P2 will be documented as well as we are used to with the P1 and don't have to browse the forum for solutions.