Welcome to the Parallax Discussion Forums, sign-up to participate.
qmul f2,f1 getqx f1 getqy f2 shr f2,#16 'lower bytes shl f1,#16 'make 16.16 add f1,f2
/* 64-bit implementation for fix16_mul. Fastest version for e.g. ARM Cortex M3. * Performs a 32*32 -> 64bit multiplication. The middle 32 bits are the result, * bottom 16 bits are used for rounding, and upper 16 bits are used for overflow * detection. */ #if !defined(FIXMATH_NO_64BIT) && !defined(FIXMATH_OPTIMIZE_8BIT) fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1) { int64_t product = (int64_t)inArg0 * inArg1; #ifndef FIXMATH_NO_OVERFLOW // The upper 17 bits should all be the same (the sign). uint32_t upper = (product >> 47); #endif if (product < 0) { #ifndef FIXMATH_NO_OVERFLOW if (~upper) return fix16_overflow; #endif #ifndef FIXMATH_NO_ROUNDING // This adjustment is required in order to round -1/2 correctly product--; #endif } else { #ifndef FIXMATH_NO_OVERFLOW if (upper) return fix16_overflow; #endif } #ifdef FIXMATH_NO_ROUNDING return product >> 16; #else fix16_t result = product >> 16; result += (product & 0x8000) >> 15; return result; #endif } #endif
Comments
Yes, although I think I may change it to a routine that uses a series of 16x16 MUL instructions -- I did a test a while back and doing 3 MULs plus the appropriate shifts and adds is slightly faster than QMUL.
Still, cordic has this pipeline thing that might speed things up for me, we'll see...
IIRC CORDIC is kind of slow for multiplies (I think 16 cycles?? Plus 2 cycles to issue the qmul and 4 cycles to fetch the results). But it's been a while since I ran the tests.
No, and that would potentially improve performance at the cost of a more complicated compiler (and assuming there are useful things to do between the QMUL and GETQX).
Only time this fails is if it's a simple recursion that has to be serially processed.
hidden in the rings that we can't see,
the rings are almost pure ice."
hidden in the rings that we can't see,
the rings are almost pure ice."