digital filtering using CORDIC

ManAtWork · 2020-05-11 14:21

I've been experimenting a bit with digital filters. IIR filters need much less taps and mul/add instructions than FIR filters. So I thought I could use floating point math to get rid of all scaling and overflow issues. Unfortunatelly, computing a 3rd order Bessel filter took around 7900 cycles and this is just a bit too slow for my application. I'm not sure if 16 bit fixed point math would be sufficient. So I had the idea of using the CODRIC unit for the filtering. I don't know if it's the best possible solution but at least the code looks cool and IMHO is a good demonstration of how make full use of the pipeline of the CORDIC unit.

' The CORDIC unit doesn't support signed multiplication.
' We do a trick and compute y = 2*x*sin(arcsin(c/2))
' which is the same as y = x*c for -2 <= c <= +2.
' QROTATE computes y = D * sin(S), x is ignored.
' The arcsine of the coefficients are pre-calculated.

	qrotate in,a0 // 0 start feeding pipeline
	qrotate x1,a1 // +8 clocks
	qrotate y1,b1 // +16
	qrotate x2,a2 // +24
	qrotate y2,b2 // +32
	qrotate x3,a3 // +40
	qrotate y3,b3 // +48
	getqy   y0    // +56 result of in*a0
	getqy	r     // +64 result of x1*a1
	add	y0,r
	getqy	r     // +72 result of y1*b1
	sub	y0,r
	getqy	r     // +80 result of x2*a2
	add	y0,r
	getqy	r     // +88 result of y2*b2
	sub	y0,r
	getqy	r     // +96 result of x3*a3
	add	y0,r
	getqy	r     // +104 result of y3*b3
	sub	y0,r

Execution time should be 106+something clocks but calling it from a high-level languange and having to place all FIFO data in registers, first, adds quite some overhead. I measured 960 clocks without any optimizations which is roughly an 8 times speedup.

I'll post the full source code after I've made some fine tuning...

evanh · 2020-05-11 14:41

Ah, there's a real optimisation that the compilers could potentially be improved upon here. And that is to block copy the local register variables for this using SETQ+RD/WRLONG. EDIT: Of course that would need an ordered block in hubRAM as well.

ManAtWork · 2020-05-11 17:36

Ok, here is the source code. It looks quite ugly as I've tried out many different things. The compiler and me were both very stubborn about the syntax but we finally found a way that works. 8)

I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.

// definition of internal function:
uint32_t CordicIIRfilter (a0, in, a1, x1, a2, x2, a3, x3, b1, y1, b2, y2, b3, y3)
{
  __asm {
  ...

// call it:
    f.y[0]= CordicIIRfilter (f.a[0], in, f.a[1], f.x[1], f.a[2], f.x[2], f.a[3], f.x[3],
                             f.b[1], f.y[1], f.b[2], f.y[2], f.b[3], f.y[3]);

I tried to optimize this by passing pointers instead and shuffling the data inside the assembler part.

uint32_t CordicIIRfilter (int32_t in, int32_t* x, int32_t* y, int32_t* a, int32_t* b)
{
  __asm {
	setq	#6
	rdlong	1,x	// transfer arrays to hubram
	setq	#2	// first 16 longs of cogram are scratch space
	setq	#7
	rdlong	8,a
	mov	0,in	// x0= in
	setq	#7
	wrlong	0,x	// write shifted FIFO back to hubram

But that doesn't help much. The call of the function and those 3 rd/wrlong instructions alone take more than 400 clocks!

SETQ + RD/WRLONG is very efficient when executed from COGRAM and can read/write one longword every clock cycle but it seems to be very slow when executed from HUBRAM. The reason is that both hub execution and rd/wrlong struggle about the hubram FIFO which forces constant reloading of the FIFO like an old PC with too little RAM swapping pages all the time.

SaucySoliton · 2020-05-12 01:17

ManAtWork wrote: »

Ok, here is the source code. It looks quite ugly as I've tried out many different things. The compiler and me were both very stubborn about the syntax but we finally found a way that works. 8)

I think most of the time is wasted sorting out how variables are passed as arguments. I tried to do it the simple way and pass every single array entry as function argument to be sure it's loaded into a cog register.

Correct. Have a look at TestFilterCordic.p2asm. Yikes! I don't know if fastspin would use a block read if the args were in order in memory. I think the best way would be to pass a pointer to the iirFilter structure and read the whole thing as one block. If the compiler will agree with that.

Then do the time shifting in the same pasm code as the filter.

_GetFilterOutput
	mov	COUNT_, #6
	calla	#pushregs_
	mov	local01, arg01
	mov	local02, arg02
	rdlong	local03, local01
	sub	local03, #1
LR__0008
	cmp	local03, #0 wcz
 if_be	jmp	#LR__0009
	mov	local04, local03
	shl	local04, #2
	add	local01, #4
	add	local04, local01
	mov	local05, local03
	sub	local05, #1
	shl	local05, #2
	add	local05, local01
	rdlong	local06, local05
	wrlong	local06, local04
	mov	local04, local03
	shl	local04, #2
	add	local01, #16
	add	local04, local01
	mov	local05, local03
	sub	local05, #1
	shl	local05, #2
	add	local05, local01
	rdlong	local06, local05
	wrlong	local06, local04
	sub	local03, #1
	sub	local01, #20
	jmp	#LR__0008
LR__0009
	mov	local04, local02
	shl	local04, #1
	add	local01, #4
	wrlong	local04, local01
	add	local01, #32
	rdlong	arg01, local01
	add	local01, #4
	rdlong	arg03, local01
	sub	local01, #32
	rdlong	arg04, local01
	add	local01, #36
	rdlong	arg05, local01
	sub	local01, #32
	rdlong	arg06, local01
	add	local01, #36
	rdlong	arg07, local01
	sub	local01, #32
	rdlong	arg08, local01
	add	local01, #40
	rdlong	arg09, local01
	sub	local01, #32
	rdlong	arg10, local01
	add	local01, #36
	rdlong	arg11, local01
	sub	local01, #32
	rdlong	arg12, local01
	add	local01, #36
	rdlong	arg13, local01
	sub	local01, #32
	rdlong	arg14, local01
	sub	local01, #32
	mov	arg02, local02
	calla	#_CordicIIRfilter
	add	local01, #20
	wrlong	result1, local01
	mov	ptra, fp
	calla	#popregs_
_GetFilterOutput_ret
	reta

Edit: If the shifting of old samples is interleaved with the cordic operation then it would take no time at all.

whicker · 2020-05-12 05:23

The call to the asm can be eliminated, for a gain of t=653 instead of t=949:

int32_t GetFilterOutput (iirFilter& f, int32_t in)
{
    for (unsigned int i=f.n-1; i>0; i--)
    {
      f.x[i]= f.x[i-1];
      f.y[i]= f.y[i-1];
    }
    f.x[0]= in<<1;

    int32_t x0 = f.x[0];
    int32_t x1 = f.x[1];
    int32_t x2 = f.x[2];
    int32_t x3 = f.x[3];

    int32_t y0 = f.y[0];
    int32_t y1 = f.y[1];
    int32_t y2 = f.y[2];
    int32_t y3 = f.y[3];

    int32_t a0 = f.a[0];
    int32_t a1 = f.a[1];
    int32_t a2 = f.a[2];
    int32_t a3 = f.a[3];

    int32_t b0 = f.b[0];
    int32_t b1 = f.b[1];
    int32_t b2 = f.b[2];
    int32_t b3 = f.b[3];
    int32_t r;

  __asm {
	//shl	in,#1 // compensate gain (c/2) use x0 instead
	qrotate x0,a0 // 0 start feeding pipeline
	qrotate x1,a1 // +8 clocks
	shl	y1,#1
	qrotate y1,b1 // +16
	qrotate x2,a2 // +24
	shl	y2,#1
	qrotate y2,b2 // +32
	qrotate x3,a3 // +40
	shl	y3,#1
	qrotate y3,b3 // +48
	getqy   y0    // +56 result of in*a0 <- x0*a0
	getqy	r     // +64 result of x1*a1
	add	y0,r
	getqy	r     // +72 result of y1*b1
	sub	y0,r
	getqy	r     // +80 result of x2*a2
	add	y0,r
	getqy	r     // +88 result of y2*b2
	sub	y0,r
	getqy	r     // +96 result of x3*a3
	add	y0,r
	getqy	r     // +104 result of y3*b3
	sub	y0,r
  }
    f.y[0] = y0;
    return f.y[0];
}

If I could understand the implications of that first for( ) loop, the copy over from structure f into the int32_t could be included. It seems like it's only copying [1] = [0] given coefficient size 2.

evanh · 2020-05-12 07:48

It's a processor expensive FIFO. Change it to a progressing index with wraparound to eliminate the copying.

EDIT: Of course, if f.n only equals 4 then there's not much saving to make.

ManAtWork · 2020-05-12 08:25

The number of FIFO entries is limited to 4 (max. order of filter is 3) simply because I don't need more and it would be difficult to implement higher order filters pipelined. So I think any optimization would take more time than always process all 4 entires.

And all array contents have to be copied to cogram anyway otherwise the timing of the pipeline couldn't be met. All data is already arranged in a single block in the filter structure.

typedef struct
{
  unsigned int n; // number of coefficients = order+1
  int32_t x[maxFilterCoeff]; // input samples
  int32_t y[maxFilterCoeff]; // output samples
  int32_t a[maxFilterCoeff]; // A coefficients
  int32_t b[maxFilterCoeff]; // B coefficients
} iirFilter;

So a single block move is necessary to swap everything in. The shifting could be done interleaved with the CORDIC code and then a single block move can copy everything back to the struct in hubram.

But the block move with SETQ+RDLONG seems to be the bottleneck and is not very efficient in hub exec mode. I have to do some experiments what works best.

ManAtWork · 2020-05-12 11:49

Ok, now I got it down to 265 cycles which is <1.5µs@180MHz and pretty good, I think.

I wonder if a similar optimization could be done to the floating point library. I found out that a float multiplication or add takes around 800 cycles. But the actual "core" code for the math operation is only 30 or 40 instructions so it should execute in 80 cycles. The rest is wasted by packing/unpacking the IEEE number format and by copying arguments and local variables. >1MFLOP (per cog!) would be damn good for an embedded processor without FPU.

whicker · 2020-05-13 05:06

Well that's an impressive speedup.

I was trying some evil stuff like #define SCRATCHPAD ((volatile iirFilter*) 0x4) and upon usage getting "error: Internal Error: emitting move to immediate"

evanh · 2020-05-13 06:16

Huh, very good indeed. How do the x0,1,2,3 and y0,1,2,3 and a0,1,2,3 variables get passed to the assembly? Doh! I see, they're #defines, that's cheating!

ManAtWork · 2020-05-13 09:07

Well, I took advantage of a semi-secret feature of the compiler I found here. Eric said that the first 16 longs of cog RAM can be safely used for temporary variables. So luckily, all the coefficient and FIFO arrays fit in as one block.

Unfortunatelly, theese days almost everything is "cheating" in the sense of "using features that are not officially documented". I'm looking forward to the time when everything about the P2 will be documented as well as we are used to with the P1 and don't have to browse the forum for solutions.

evanh · 2020-05-13 09:59

Oh, double doh, I didn't even read the values of the #defines. I see they are register addresses. Not the sort of cheating I was thinking of, lol. Excellent use really.

digital filtering using CORDIC

Comments