Shop OBEX P1 Docs P2 Docs Learn Events
digital filtering using CORDIC — Parallax Forums

digital filtering using CORDIC

I've been experimenting a bit with digital filters. IIR filters need much less taps and mul/add instructions than FIR filters. So I thought I could use floating point math to get rid of all scaling and overflow issues. Unfortunatelly, computing a 3rd order Bessel filter took around 7900 cycles and this is just a bit too slow for my application. I'm not sure if 16 bit fixed point math would be sufficient. So I had the idea of using the CODRIC unit for the filtering. I don't know if it's the best possible solution but at least the code looks cool and IMHO is a good demonstration of how make full use of the pipeline of the CORDIC unit.
' The CORDIC unit doesn't support signed multiplication.
' We do a trick and compute y = 2*x*sin(arcsin(c/2))
' which is the same as y = x*c for -2 <= c <= +2.
' QROTATE computes y = D * sin(S), x is ignored.
' The arcsine of the coefficients are pre-calculated.

	qrotate in,a0 // 0 start feeding pipeline
	qrotate x1,a1 // +8 clocks
	qrotate y1,b1 // +16
	qrotate x2,a2 // +24
	qrotate y2,b2 // +32
	qrotate x3,a3 // +40
	qrotate y3,b3 // +48
	getqy   y0    // +56 result of in*a0
	getqy	r     // +64 result of x1*a1
	add	y0,r
	getqy	r     // +72 result of y1*b1
	sub	y0,r
	getqy	r     // +80 result of x2*a2
	add	y0,r
	getqy	r     // +88 result of y2*b2
	sub	y0,r
	getqy	r     // +96 result of x3*a3
	add	y0,r
	getqy	r     // +104 result of y3*b3
	sub	y0,r
Execution time should be 106+something clocks but calling it from a high-level languange and having to place all FIFO data in registers, first, adds quite some overhead. I measured 960 clocks without any optimizations which is roughly an 8 times speedup.

I'll post the full source code after I've made some fine tuning...

Comments

  • evanhevanh Posts: 16,075
    edited 2020-05-11 14:43
    Ah, there's a real optimisation that the compilers could potentially be improved upon here. And that is to block copy the local register variables for this using SETQ+RD/WRLONG. EDIT: Of course that would need an ordered block in hubRAM as well.
  • Ok, here is the source code. It looks quite ugly as I've tried out many different things. The compiler and me were both very stubborn about the syntax but we finally found a way that works. 8)

    I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.
    // definition of internal function:
    uint32_t CordicIIRfilter (a0, in, a1, x1, a2, x2, a3, x3, b1, y1, b2, y2, b3, y3)
    {
      __asm {
      ...
    
    // call it:
        f.y[0]= CordicIIRfilter (f.a[0], in, f.a[1], f.x[1], f.a[2], f.x[2], f.a[3], f.x[3],
                                 f.b[1], f.y[1], f.b[2], f.y[2], f.b[3], f.y[3]);
    

    I tried to optimize this by passing pointers instead and shuffling the data inside the assembler part.
    uint32_t CordicIIRfilter (int32_t in, int32_t* x, int32_t* y, int32_t* a, int32_t* b)
    {
      __asm {
    	setq	#6
    	rdlong	1,x	// transfer arrays to hubram
    	setq	#2	// first 16 longs of cogram are scratch space
    	setq	#7
    	rdlong	8,a
    	mov	0,in	// x0= in
    	setq	#7
    	wrlong	0,x	// write shifted FIFO back to hubram
    
    But that doesn't help much. The call of the function and those 3 rd/wrlong instructions alone take more than 400 clocks!

    SETQ + RD/WRLONG is very efficient when executed from COGRAM and can read/write one longword every clock cycle but it seems to be very slow when executed from HUBRAM. The reason is that both hub execution and rd/wrlong struggle about the hubram FIFO which forces constant reloading of the FIFO like an old PC with too little RAM swapping pages all the time.
  • SaucySolitonSaucySoliton Posts: 525
    edited 2020-05-12 02:49
    ManAtWork wrote: »
    Ok, here is the source code. It looks quite ugly as I've tried out many different things. The compiler and me were both very stubborn about the syntax but we finally found a way that works. 8)

    I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.

    Correct. Have a look at TestFilterCordic.p2asm. Yikes! I don't know if fastspin would use a block read if the args were in order in memory. I think the best way would be to pass a pointer to the iirFilter structure and read the whole thing as one block. If the compiler will agree with that. :wink: Then do the time shifting in the same pasm code as the filter.

    _GetFilterOutput
    	mov	COUNT_, #6
    	calla	#pushregs_
    	mov	local01, arg01
    	mov	local02, arg02
    	rdlong	local03, local01
    	sub	local03, #1
    LR__0008
    	cmp	local03, #0 wcz
     if_be	jmp	#LR__0009
    	mov	local04, local03
    	shl	local04, #2
    	add	local01, #4
    	add	local04, local01
    	mov	local05, local03
    	sub	local05, #1
    	shl	local05, #2
    	add	local05, local01
    	rdlong	local06, local05
    	wrlong	local06, local04
    	mov	local04, local03
    	shl	local04, #2
    	add	local01, #16
    	add	local04, local01
    	mov	local05, local03
    	sub	local05, #1
    	shl	local05, #2
    	add	local05, local01
    	rdlong	local06, local05
    	wrlong	local06, local04
    	sub	local03, #1
    	sub	local01, #20
    	jmp	#LR__0008
    LR__0009
    	mov	local04, local02
    	shl	local04, #1
    	add	local01, #4
    	wrlong	local04, local01
    	add	local01, #32
    	rdlong	arg01, local01
    	add	local01, #4
    	rdlong	arg03, local01
    	sub	local01, #32
    	rdlong	arg04, local01
    	add	local01, #36
    	rdlong	arg05, local01
    	sub	local01, #32
    	rdlong	arg06, local01
    	add	local01, #36
    	rdlong	arg07, local01
    	sub	local01, #32
    	rdlong	arg08, local01
    	add	local01, #40
    	rdlong	arg09, local01
    	sub	local01, #32
    	rdlong	arg10, local01
    	add	local01, #36
    	rdlong	arg11, local01
    	sub	local01, #32
    	rdlong	arg12, local01
    	add	local01, #36
    	rdlong	arg13, local01
    	sub	local01, #32
    	rdlong	arg14, local01
    	sub	local01, #32
    	mov	arg02, local02
    	calla	#_CordicIIRfilter
    	add	local01, #20
    	wrlong	result1, local01
    	mov	ptra, fp
    	calla	#popregs_
    _GetFilterOutput_ret
    	reta
    

    Edit: If the shifting of old samples is interleaved with the cordic operation then it would take no time at all. :yum:
  • The call to the asm can be eliminated, for a gain of t=653 instead of t=949:
    int32_t GetFilterOutput (iirFilter& f, int32_t in)
    {
        for (unsigned int i=f.n-1; i>0; i--)
        {
          f.x[i]= f.x[i-1];
          f.y[i]= f.y[i-1];
        }
        f.x[0]= in<<1;
    
        int32_t x0 = f.x[0];
        int32_t x1 = f.x[1];
        int32_t x2 = f.x[2];
        int32_t x3 = f.x[3];
    
        int32_t y0 = f.y[0];
        int32_t y1 = f.y[1];
        int32_t y2 = f.y[2];
        int32_t y3 = f.y[3];
    
        int32_t a0 = f.a[0];
        int32_t a1 = f.a[1];
        int32_t a2 = f.a[2];
        int32_t a3 = f.a[3];
    
        int32_t b0 = f.b[0];
        int32_t b1 = f.b[1];
        int32_t b2 = f.b[2];
        int32_t b3 = f.b[3];
        int32_t r;
    
      __asm {
    	//shl	in,#1 // compensate gain (c/2) use x0 instead
    	qrotate x0,a0 // 0 start feeding pipeline
    	qrotate x1,a1 // +8 clocks
    	shl	y1,#1
    	qrotate y1,b1 // +16
    	qrotate x2,a2 // +24
    	shl	y2,#1
    	qrotate y2,b2 // +32
    	qrotate x3,a3 // +40
    	shl	y3,#1
    	qrotate y3,b3 // +48
    	getqy   y0    // +56 result of in*a0 <- x0*a0
    	getqy	r     // +64 result of x1*a1
    	add	y0,r
    	getqy	r     // +72 result of y1*b1
    	sub	y0,r
    	getqy	r     // +80 result of x2*a2
    	add	y0,r
    	getqy	r     // +88 result of y2*b2
    	sub	y0,r
    	getqy	r     // +96 result of x3*a3
    	add	y0,r
    	getqy	r     // +104 result of y3*b3
    	sub	y0,r
      }
        f.y[0] = y0;
        return f.y[0];
    }
    

    If I could understand the implications of that first for( ) loop, the copy over from structure f into the int32_t could be included. It seems like it's only copying [1] = [0] given coefficient size 2.
  • evanhevanh Posts: 16,075
    edited 2020-05-12 07:52
    It's a processor expensive FIFO. Change it to a progressing index with wraparound to eliminate the copying.

    EDIT: Of course, if f.n only equals 4 then there's not much saving to make.

  • The number of FIFO entries is limited to 4 (max. order of filter is 3) simply because I don't need more and it would be difficult to implement higher order filters pipelined. So I think any optimization would take more time than always process all 4 entires.

    And all array contents have to be copied to cogram anyway otherwise the timing of the pipeline couldn't be met. All data is already arranged in a single block in the filter structure.
    typedef struct
    {
      unsigned int n; // number of coefficients = order+1
      int32_t x[maxFilterCoeff]; // input samples
      int32_t y[maxFilterCoeff]; // output samples
      int32_t a[maxFilterCoeff]; // A coefficients
      int32_t b[maxFilterCoeff]; // B coefficients
    } iirFilter;
    
    So a single block move is necessary to swap everything in. The shifting could be done interleaved with the CORDIC code and then a single block move can copy everything back to the struct in hubram.

    But the block move with SETQ+RDLONG seems to be the bottleneck and is not very efficient in hub exec mode. I have to do some experiments what works best.
  • ManAtWorkManAtWork Posts: 2,178
    edited 2020-05-12 11:50
    Ok, now I got it down to 265 cycles which is <1.5µs@180MHz and pretty good, I think.

    I wonder if a similar optimization could be done to the floating point library. I found out that a float multiplication or add takes around 800 cycles. But the actual "core" code for the math operation is only 30 or 40 instructions so it should execute in 80 cycles. The rest is wasted by packing/unpacking the IEEE number format and by copying arguments and local variables. >1MFLOP (per cog!) would be damn good for an embedded processor without FPU.
  • Well that's an impressive speedup.

    I was trying some evil stuff like #define SCRATCHPAD ((volatile iirFilter*) 0x4) and upon usage getting "error: Internal Error: emitting move to immediate"
  • evanhevanh Posts: 16,075
    edited 2020-05-13 06:17
    Huh, very good indeed. How do the x0,1,2,3 and y0,1,2,3 and a0,1,2,3 variables get passed to the assembly? Doh! I see, they're #defines, that's cheating! :)

  • Well, I took advantage of a semi-secret feature of the compiler I found here. Eric said that the first 16 longs of cog RAM can be safely used for temporary variables. So luckily, all the coefficient and FIFO arrays fit in as one block.

    Unfortunatelly, theese days almost everything is "cheating" in the sense of "using features that are not officially documented". I'm looking forward to the time when everything about the P2 will be documented as well as we are used to with the P1 and don't have to browse the forum for solutions.
  • evanhevanh Posts: 16,075
    Oh, double doh, I didn't even read the values of the #defines. I see they are register addresses. Not the sort of cheating I was thinking of, lol. Excellent use really.

Sign In or Register to comment.