Fixed point math on P2

Rayman · 2018-11-26 17:53

Wish I would have looked more closely at this earlier, might have asked for a cordic option...

Looks like Q16.16 is going to make the most sense for a lot of applications.
I think this is how you do multiplication on P2:

                qmul    f2,f1
                getqx   f1
                getqy   f2
                shr     f2,#16 'lower bytes
                shl     f1,#16 'make 16.16
                add     f1,f2

Might have been nicer if there was a QMUL2 that did the SHR, SHL and ADD. Would that have added a lot?
It is nice to have the 32x32 multiply... Puts is close to ARM Cortex M3, apparently...
It's looking to me like both have to be positive for this to work right (or at least the same sign)...

I found "libfixmath" that does some clever stuff with the leftover bits... The upper bits can be used for overflow detection and the lower bits for rounding:

/* 64-bit implementation for fix16_mul. Fastest version for e.g. ARM Cortex M3.
 * Performs a 32*32 -> 64bit multiplication. The middle 32 bits are the result,
 * bottom 16 bits are used for rounding, and upper 16 bits are used for overflow
 * detection.
 */
 
#if !defined(FIXMATH_NO_64BIT) && !defined(FIXMATH_OPTIMIZE_8BIT)
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1)
{
	int64_t product = (int64_t)inArg0 * inArg1;
	
	#ifndef FIXMATH_NO_OVERFLOW
	// The upper 17 bits should all be the same (the sign).
	uint32_t upper = (product >> 47);
	#endif
	
	if (product < 0)
	{
		#ifndef FIXMATH_NO_OVERFLOW
		if (~upper)
				return fix16_overflow;
		#endif
		
		#ifndef FIXMATH_NO_ROUNDING
		// This adjustment is required in order to round -1/2 correctly
		product--;
		#endif
	}
	else
	{
		#ifndef FIXMATH_NO_OVERFLOW
		if (upper)
				return fix16_overflow;
		#endif
	}
	
	#ifdef FIXMATH_NO_ROUNDING
	return product >> 16;
	#else
	fix16_t result = product >> 16;
	result += (product & 0x8000) >> 15;
	
	return result;
	#endif
}
#endif

ersmith · 2018-11-26 18:39

Just as an FYI, fastspin has a --fixed option that treats all floating point constants as 16.16 fixed point instead, and converts floating point operations to fixed point (in BASIC and C, since Spin doesn't have built in floating point).

Rayman · 2018-11-26 19:12

I did notice that earlier... Does it use QMUL in P2 mode?

ersmith · 2018-11-26 19:21

Rayman wrote: »

I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.

Rayman · 2018-11-26 19:26

That is surprising... I imagine if there were a QMUL2 option that let you get the 16.16 result via one GETQX instruction, that would be faster, right?

Still, cordic has this pipeline thing that might speed things up for me, we'll see...

Electrodude · 2018-11-26 19:38

ersmith wrote: »

Rayman wrote: »

I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.

Are you attempting to interleave other instructions in between the QMUL and the GETQX and GETQY?

ersmith · 2018-11-26 21:02

Rayman wrote: »

That is surprising... I imagine if there were a QMUL2 option that let you get the 16.16 result via one GETQX instruction, that would be faster, right?

Probably not, but I'm not sure. My tests were of pure multiplies, not 16.16 multiplies, so the QMUL was at full speed. The MUL version had to do shifts anyway to get the partial results in the right place -- for a 16.16 multiply the value of the shift would change, but not the time it takes.

IIRC CORDIC is kind of slow for multiplies (I think 16 cycles?? Plus 2 cycles to issue the qmul and 4 cycles to fetch the results). But it's been a while since I ran the tests.

ersmith · 2018-11-26 21:03

Electrodude wrote: »

ersmith wrote: »

Rayman wrote: »

I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.

Are you attempting to interleave other instructions in between the QMUL and the GETQX and GETQY?

No, and that would potentially improve performance at the cost of a more complicated compiler (and assuming there are useful things to do between the QMUL and GETQX).

ozpropdev · 2018-11-26 22:00

Have a look at the SCA and SCAS instructions.

SCA     D,{#}S          {WZ}
Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16.

Rayman · 2018-11-26 22:13

SCA would be great for doing 16 bit integer math, but I think I need 32 fixed point with 16 bits in front of decimal and 16 bits behind (Q16.16).

evanh · 2018-11-30 03:34

Regarding execution speed, even without filling the pipeline, the shifts and adds are free. With the cordic operating in parallel, the cog has lots of thumb twiddling time to format the numbers to and from the cordic. Just have to pre-prep the next cordic command to be ready for issuing the moment the current result arrives.

Only time this fails is if it's a simple recursion that has to be serially processed.

evanh · 2018-11-30 03:45

And that keeps it on the interrupt compatibility side of things too.

Fixed point math on P2

Comments