Slightly faster integer multiplication in PASM
Hi, all.
(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.
So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
thanks,
Jonathan
(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.
So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
{{
vRes (32-bit) := v1 (32-bit) * v2 (32-bit)
------------------------------------------
Break the multiplication of 2 32-bit numbers into 4 multiplications
of the 4x 16-bit portions:
v1 * v2 =
(v1_hi * v2_hi) << 32
+ (v1_hi * v2_lo) << 16
+ (v1_lo * v2_hi) << 16
+ (v1_lo * v2_lo) << 0
Note that the first term can not fit in our result so we ignore it,
and I can re-combine v1_hi and v1_lo:
v1 * v2 (fit into 32 bits) =
(v1 * v2_lo)
+ (v1_lo * v2_hi) << 16
}}
newmult ' setup
mov vRes, #0 ' Primary accumulator (and final result)
mov tmp1, v1 ' Both my secondary accumulator,
shl tmp1, #16 ' and the lower 16 bits of v1.
mov tmp2, v2 ' This is the upper 16 bits of v2,
shr tmp2, #16 ' which will sum into my 2nd accumulator.
mov i, #16 ' Instead of 4 instructions 32x, do 6 instructions 16x.
:loop ' v1_hi_lo * v2_lo
shr v2, #1 wc ' get the low bit of v2
if_c add vRes, v1 ' (conditionally) sum v1 into my 1st accumulator
shl v1, #1 ' bit align v1 for the next pass
' v1_lo * v2_hi
shl tmp1, #1 wc ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
if_c add tmp1, tmp2 ' (conditionally) add v2_hi into the 2nd accumulator
' repeat 16x
djnz i, #:loop ' I can't think of a way to early exit this
' finalize
shl tmp1, #16 ' align my 2nd accumulator
add vRes, tmp1 ' and add its contribution
newmult_ret ret
This always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.thanks,
Jonathan

Comments
I will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.
Jonathan
I must have missed the Size values for the new Mult, and the Size value for the Old Mult in #1.
Can you point to them ?
From my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.
The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.
My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.
thanks,
Jonathan
Thanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.
yes that is the latest spin interpreter.
For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)
I also posted the sqrt section as it has been improved in case anyone is interested.
'$F3 decode math_F3 mov x,#1 '\\ decode shl x,y '// <=== could jmp to math_E0 jmp #push '$F4..F7 = MUL/DIV lower/upper result math_F4 and a,#%11111 '<== and mask (need in a) mov t1,#0 mov t2,#32 'multiply/divide abs x,x wc muxc a,#%01100 abs y,y wc,wz if_c xor a,#%00100 test a,#%00010 wc if_c_and_nz jmp #mdiv 'if divide and y=0, do multiply so result=0 shr x,#1 wc 'multiply mmul if_c add t1,y wc rcr t1,#1 wc rcr x,#1 wc djnz t2,#mmul test a,#%00100 wz if_nz neg t1,t1 if_nz neg x,x wz if_nz sub t1,#1 test a,#%00001 wz if_nz mov x,t1 jmp #push mdiv shr y,#1 wc,wz 'divide rcr t1,#1 if_nz djnz t2,#mdiv mdiv2 cmpsub x,t1 wc rcl y,#1 shr t1,#1 djnz t2,#mdiv2 test a,#%01000 wc negc x,x test a,#%00100 wc test a,#%00001 wz if_z negc x,y jmp #push '$F8=SQR square root ' '<=== can improve from Chips code in sqrt forum { math_F8 mov x,#0 mov t1,#0 mov t2,#16 msqr shl y,#1 wc rcl t1,#1 shl y,#1 wc rcl t1,#1 shl x,#2 or x,#1 cmpsub t1,x wc shr x,#2 rcl x,#1 djnz t2,#msqr jmp #push } ' 'Chip's smaller version: masksqrt = $40000000 gets used directly and rotated all the way back to orig. value math_F8 mov x,#0 'reset root msqr or x,masksqrt 'set trial bit cmpsub y,x wc 'subtract root from input if fits sumnc x,masksqrt 'cancel trial bit, set root bit if fit shr x,#1 'shift root down ror masksqrt,#2 wc 'shift mask down (wraps on last iteration) if_nc jmp #msqr 'loop until mask restored jmp #push 'Jonathan
Unroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup
gives 108 instructions, static instruction count = 15
Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,
a whisker faster.
Standard method unrolled by 8 is 104 instructions dynamic, 27 static.
Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...
One can test for one of the arguments being 16 bit only and branch to a shorter version as
this is the common case
Here is a typical timing of the UM* instruction which returns a 64-bit result.
123 CLKFREQ LAP UM* LAP .LAP 2.500us ok
The best speed for the UM* is 833ns while the worst case is multiplying two $FFFF.FFFFs together to get a 64bit $FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!
This was run on a +P8 running at 96MHz.
{HELP UM* ( u1 u2 -- u1*u2L u1*u2H ) DESC: unsigned 32bit * 32bit multiply -- 64bit result TIME: 1..11.8us @80MHz } UMMUL mov R0,tos+1 min R0,tos ' max(tos, tos+1) mov R2,tos+1 max R2,tos ' min(tos, tos+1) mov R1,#0 mov tos,#0 ' zero result mov tos+1,#0 UMMULLP shr R2,#1 wz,wc ' test next bit of u1 if_nc jmp #UMMUL1 ' skip if no bit here add tos+1,R0 wc ' add in shifted u2 addx tos,R1 ' carry into upper long UMMUL1 add R0,R0 wc ' shift u2 left addx R1,R1 ' carry into 64-bits if_nz jmp #UMMULLP ' exhausted u1? jmp unext1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign
The result will be 64bits, preventing overflow.
This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.
Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
http://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt
' ' Multiply ' ' in: q1 = 16-bit multiplicand (q1[31..16] must be 0) ' q2 = 16-bit multiplier ' ' out: q1 = 32-bit product ' multiply_ ' setup mov q4, q1 xor q4, q2 abs q1, q1 abs q2, q2 mov vRes, #0 ' Primary accumulator (and final result) mov tmp1, q1 ' Both my secondary accumulator, shl tmp1, #16 ' and the lower 16 bits of v1. mov tmp2, q2 ' This is the upper 16 bits of v2, shr tmp2, #16 ' which will sum into my 2nd accumulator. mov temp, #16 ' Instead of 4 instructions 32x, do 6 instructions 16x. :loop ' v1_hi_lo * v2_lo shr q2, #1 wc ' get the low bit of v2 if_c add vRes, q1 ' (conditionally) sum v1 into my 1st accumulator shl q1, #1 ' bit align v1 for the next pass ' v1_lo * v2_hi shl tmp1, #1 wc ' get the high bit of v1_lo, *AND* shift my 2nd accumulator if_c add tmp1, tmp2 ' (conditionally) add v2_hi into the 2nd accumulator ' repeat 16x djnz temp, #:loop ' I can't think of a way to early exit this ' finalize shl tmp1, #16 ' align my 2nd accumulator add vRes, tmp1 ' and add its contribution mov q1, vRes shl q4,#1 wc if_c neg q1, q1 multiply__ret retBear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.
Here's an interactive example with timing report in Tachyon.
80MHz clock - 32x32 multiply with 64-bit result.
most values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly
small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.
Occassionally the values are small but the worst-case performance is all that matters for real-time samples..