Slightly faster integer multiplication in PASM
lonesock
Posts: 917
Hi, all.
(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.
So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
thanks,
Jonathan
(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.
So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
{{ vRes (32-bit) := v1 (32-bit) * v2 (32-bit) ------------------------------------------ Break the multiplication of 2 32-bit numbers into 4 multiplications of the 4x 16-bit portions: v1 * v2 = (v1_hi * v2_hi) << 32 + (v1_hi * v2_lo) << 16 + (v1_lo * v2_hi) << 16 + (v1_lo * v2_lo) << 0 Note that the first term can not fit in our result so we ignore it, and I can re-combine v1_hi and v1_lo: v1 * v2 (fit into 32 bits) = (v1 * v2_lo) + (v1_lo * v2_hi) << 16 }} newmult ' setup mov vRes, #0 ' Primary accumulator (and final result) mov tmp1, v1 ' Both my secondary accumulator, shl tmp1, #16 ' and the lower 16 bits of v1. mov tmp2, v2 ' This is the upper 16 bits of v2, shr tmp2, #16 ' which will sum into my 2nd accumulator. mov i, #16 ' Instead of 4 instructions 32x, do 6 instructions 16x. :loop ' v1_hi_lo * v2_lo shr v2, #1 wc ' get the low bit of v2 if_c add vRes, v1 ' (conditionally) sum v1 into my 1st accumulator shl v1, #1 ' bit align v1 for the next pass ' v1_lo * v2_hi shl tmp1, #1 wc ' get the high bit of v1_lo, *AND* shift my 2nd accumulator if_c add tmp1, tmp2 ' (conditionally) add v2_hi into the 2nd accumulator ' repeat 16x djnz i, #:loop ' I can't think of a way to early exit this ' finalize shl tmp1, #16 ' align my 2nd accumulator add vRes, tmp1 ' and add its contribution newmult_ret retThis always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.
thanks,
Jonathan
Comments
I will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.
Jonathan
I must have missed the Size values for the new Mult, and the Size value for the Old Mult in #1.
Can you point to them ?
From my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.
The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.
My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.
thanks,
Jonathan
Thanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.
yes that is the latest spin interpreter.
For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)
I also posted the sqrt section as it has been improved in case anyone is interested.
Jonathan
Unroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup
gives 108 instructions, static instruction count = 15
Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,
a whisker faster.
Standard method unrolled by 8 is 104 instructions dynamic, 27 static.
Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...
One can test for one of the arguments being 16 bit only and branch to a shorter version as
this is the common case
Here is a typical timing of the UM* instruction which returns a 64-bit result.
123 CLKFREQ LAP UM* LAP .LAP 2.500us ok
The best speed for the UM* is 833ns while the worst case is multiplying two $FFFF.FFFFs together to get a 64bit $FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!
This was run on a +P8 running at 96MHz.
1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign
The result will be 64bits, preventing overflow.
This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.
Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
http://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt
Bear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.
Here's an interactive example with timing report in Tachyon.
80MHz clock - 32x32 multiply with 64-bit result.
most values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly
small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.
Occassionally the values are small but the worst-case performance is all that matters for real-time samples..