#### Equip your Genius

Welcome to the Parallax Discussion Forums, sign-up to participate.

# Fixed point math on P2

Posts: 9,716
edited 2018-11-26 - 19:13:32
Wish I would have looked more closely at this earlier, might have asked for a cordic option...

Looks like Q16.16 is going to make the most sense for a lot of applications.
I think this is how you do multiplication on P2:
qmul    f2,f1
getqx   f1
getqy   f2
shr     f2,#16 'lower bytes
shl     f1,#16 'make 16.16

Might have been nicer if there was a QMUL2 that did the SHR, SHL and ADD. Would that have added a lot?
It is nice to have the 32x32 multiply... Puts is close to ARM Cortex M3, apparently...
It's looking to me like both have to be positive for this to work right (or at least the same sign)...

I found "libfixmath" that does some clever stuff with the leftover bits... The upper bits can be used for overflow detection and the lower bits for rounding:
/* 64-bit implementation for fix16_mul. Fastest version for e.g. ARM Cortex M3.
* Performs a 32*32 -> 64bit multiplication. The middle 32 bits are the result,
* bottom 16 bits are used for rounding, and upper 16 bits are used for overflow
* detection.
*/

#if !defined(FIXMATH_NO_64BIT) && !defined(FIXMATH_OPTIMIZE_8BIT)
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1)
{
int64_t product = (int64_t)inArg0 * inArg1;

#ifndef FIXMATH_NO_OVERFLOW
// The upper 17 bits should all be the same (the sign).
uint32_t upper = (product >> 47);
#endif

if (product < 0)
{
#ifndef FIXMATH_NO_OVERFLOW
if (~upper)
return fix16_overflow;
#endif

#ifndef FIXMATH_NO_ROUNDING
// This adjustment is required in order to round -1/2 correctly
product--;
#endif
}
else
{
#ifndef FIXMATH_NO_OVERFLOW
if (upper)
return fix16_overflow;
#endif
}

#ifdef FIXMATH_NO_ROUNDING
return product >> 16;
#else
fix16_t result = product >> 16;
result += (product & 0x8000) >> 15;

return result;
#endif
}
#endif
Prop Info and Apps: http://www.rayslogic.com/

• Posts: 3,399
Just as an FYI, fastspin has a --fixed option that treats all floating point constants as 16.16 fixed point instead, and converts floating point operations to fixed point (in BASIC and C, since Spin doesn't have built in floating point).
• Posts: 9,716
edited 2018-11-26 - 19:12:32
I did notice that earlier... Does it use QMUL in P2 mode?
Prop Info and Apps: http://www.rayslogic.com/
• Posts: 3,399
Rayman wrote: »
I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.
• Posts: 9,716
That is surprising... I imagine if there were a QMUL2 option that let you get the 16.16 result via one GETQX instruction, that would be faster, right?

Still, cordic has this pipeline thing that might speed things up for me, we'll see...
Prop Info and Apps: http://www.rayslogic.com/
• Posts: 1,288
edited 2018-11-26 - 19:38:14
ersmith wrote: »
Rayman wrote: »
I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.
Are you attempting to interleave other instructions in between the QMUL and the GETQX and GETQY?
• Posts: 3,399
Rayman wrote: »
That is surprising... I imagine if there were a QMUL2 option that let you get the 16.16 result via one GETQX instruction, that would be faster, right?
Probably not, but I'm not sure. My tests were of pure multiplies, not 16.16 multiplies, so the QMUL was at full speed. The MUL version had to do shifts anyway to get the partial results in the right place -- for a 16.16 multiply the value of the shift would change, but not the time it takes.

IIRC CORDIC is kind of slow for multiplies (I think 16 cycles?? Plus 2 cycles to issue the qmul and 4 cycles to fetch the results). But it's been a while since I ran the tests.
• Posts: 3,399
ersmith wrote: »
Rayman wrote: »
I did notice that earlier... Does it use QMUL in P2 mode?

Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.
Are you attempting to interleave other instructions in between the QMUL and the GETQX and GETQY?

No, and that would potentially improve performance at the cost of a more complicated compiler (and assuming there are useful things to do between the QMUL and GETQX).
• Posts: 2,567
edited 2018-11-26 - 22:40:36
Have a look at the SCA and SCAS instructions.
SCA     D,{#}S          {WZ}
Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16.

Melbourne, Australia
• Posts: 9,716
edited 2018-11-26 - 22:13:43
SCA would be great for doing 16 bit integer math, but I think I need 32 fixed point with 16 bits in front of decimal and 16 bits behind (Q16.16).
Prop Info and Apps: http://www.rayslogic.com/
• Posts: 7,912
Regarding execution speed, even without filling the pipeline, the shifts and adds are free. With the cordic operating in parallel, the cog has lots of thumb twiddling time to format the numbers to and from the cordic. Just have to pre-prep the next cordic command to be ready for issuing the moment the current result arrives.

Only time this fails is if it's a simple recursion that has to be serially processed.

"We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
By doing that, we can more accurately measure their mass, and determine whether
scientists have systematically been underestimating how much matter they contain."
• Posts: 7,912
And that keeps it on the interrupt compatibility side of things too.

"We suspect that ALMA will allow us to observe this rare form of CO in many other discs.
By doing that, we can more accurately measure their mass, and determine whether
scientists have systematically been underestimating how much matter they contain."