Welcome to the Parallax Discussion Forums, sign-up to participate.

# Slightly faster integer multiplication in PASM

Posts: 903
edited 2015-04-16 - 19:46:27
Hi, all.

(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.

So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
```{{
vRes (32-bit) := v1 (32-bit) * v2 (32-bit)
------------------------------------------

Break the multiplication of 2 32-bit numbers into 4 multiplications
of the 4x 16-bit portions:
v1 * v2 =
(v1_hi * v2_hi) << 32
+ (v1_hi * v2_lo) << 16
+ (v1_lo * v2_hi) << 16
+ (v1_lo * v2_lo) << 0

Note that the first term can not fit in our result so we ignore it,
and I can re-combine v1_hi and v1_lo:
v1 * v2 (fit into 32 bits) =
(v1 * v2_lo)
+ (v1_lo * v2_hi) << 16
}}
newmult       ' setup
mov       vRes, #0      ' Primary accumulator (and final result)
mov       tmp1, v1      ' Both my secondary accumulator,
shl       tmp1, #16     ' and the lower 16 bits of v1.
mov       tmp2, v2      ' This is the upper 16 bits of v2,
shr       tmp2, #16     ' which will sum into my 2nd accumulator.
mov       i, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.
:loop         ' v1_hi_lo * v2_lo
shr       v2, #1 wc     ' get the low bit of v2
if_c  add       vRes, v1      ' (conditionally) sum v1 into my 1st accumulator
shl       v1, #1        ' bit align v1 for the next pass
' v1_lo * v2_hi
shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
' repeat 16x
djnz      i, #:loop     ' I can't think of a way to early exit this
' finalize
shl       tmp1, #16     ' align my 2nd accumulator
newmult_ret   ret
```
This always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.

thanks,
Jonathan
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer

• Posts: 14,392
edited 2015-04-15 - 12:56:24
Jonathan,
I will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.
My Prop boards:
Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
Website: www.clusos.com
• Posts: 903
edited 2015-04-15 - 13:26:42
Thanks. Is version "260C_007F" the latest (it was rather hard to find, my Forum-Fu is weak)? If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.

Jonathan
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer
• Posts: 12,909
edited 2015-04-15 - 13:54:46
lonesock wrote: »
If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.
When rating code, it is a good idea to include size, along with speed figures.
• Posts: 9,691
edited 2015-04-15 - 15:44:47
There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.
Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
• Posts: 12,909
edited 2015-04-15 - 15:50:40
There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.
?
I must have missed the Size values for the new Mult, and the Size value for the Old Mult in #1.
Can you point to them ?
• Posts: 903
edited 2015-04-15 - 15:55:51
Here's the comparison, with a caveat: the Spin interpreter has all four ops (* ** / //) semi-interleaved, so there are some things that would be inefficient if you were to pull the MUL code out by itself.

From my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.

The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.

My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

thanks,
Jonathan
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer
• Posts: 12,909
edited 2015-04-15 - 16:01:42
lonesock wrote: »
My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

Thanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.
• Posts: 14,392
edited 2015-04-15 - 18:36:42
lonesock,
yes that is the latest spin interpreter.
For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)

I also posted the sqrt section as it has been improved in case anyone is interested.
```'\$F3 decode
math_F3                 mov     x,#1                    '\\ decode
shl     x,y                     '//                    <=== could jmp to math_E0
jmp     #push

'\$F4..F7 = MUL/DIV lower/upper result
math_F4                 and     a,#%11111               '<== and mask (need in a)
mov     t1,#0
mov     t2,#32                  'multiply/divide
abs     x,x             wc
muxc    a,#%01100
abs     y,y             wc,wz
if_c            xor     a,#%00100
test    a,#%00010       wc
if_c_and_nz     jmp     #mdiv                   'if divide and y=0, do multiply so result=0

shr     x,#1            wc      'multiply
rcr     t1,#1           wc
rcr     x,#1            wc
djnz    t2,#mmul
test    a,#%00100       wz
if_nz           neg     t1,t1
if_nz           neg     x,x             wz
if_nz           sub     t1,#1
test    a,#%00001       wz
if_nz           mov     x,t1
jmp     #push

mdiv                    shr     y,#1            wc,wz   'divide
rcr     t1,#1
if_nz           djnz    t2,#mdiv
mdiv2                   cmpsub  x,t1            wc
rcl     y,#1
shr     t1,#1
djnz    t2,#mdiv2
test    a,#%01000       wc
negc    x,x
test    a,#%00100       wc
test    a,#%00001       wz
if_z            negc    x,y
jmp     #push

'\$F8=SQR  square root
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
'<=== can improve from Chips code in sqrt forum
{
math_F8                 mov     x,#0
mov     t1,#0
mov     t2,#16
msqr                    shl     y,#1            wc
rcl     t1,#1
shl     y,#1            wc
rcl     t1,#1
shl     x,#2
or      x,#1
cmpsub  t1,x            wc
shr     x,#2
rcl     x,#1
djnz    t2,#msqr
jmp     #push
}
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
'Chip's smaller version: masksqrt = \$40000000 gets used directly and rotated all the way back to orig. value
math_F8                 mov     x,#0                    'reset root
msqr                    or      x,masksqrt              'set trial bit
cmpsub  y,x             wc      'subtract root from input if fits
sumnc   x,masksqrt              'cancel trial bit, set root bit if fit
shr     x,#1                    'shift root down
if_nc           jmp     #msqr                   'loop until mask restored
jmp     #push
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;

```
My Prop boards:
Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
Website: www.clusos.com
• Posts: 903
edited 2015-04-15 - 20:55:50
Heh, I remember the original thread for the sqrt optimization! I'm getting old. [8^)

Jonathan
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer
• Posts: 1,789
edited 2015-04-16 - 03:58:55
Standard 32x32 loop body is 3 instructions plus the djnz, so 4 in total, run 32 times = 128 instructions
Unroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup
gives 108 instructions, static instruction count = 15

Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,
a whisker faster.

Standard method unrolled by 8 is 104 instructions dynamic, 27 static.

Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...

One can test for one of the arguments being 16 bit only and branch to a shorter version as
this is the common case
• Posts: 8,014
edited 2015-04-16 - 08:26:47
The routine I use in Tachyon does not setup a loop counter, it simply tests for exhaustion of the multiplicand. It also optimizes the order of the operands as well as that only adds another two instructions.

Here is a typical timing of the UM* instruction which returns a 64-bit result.

123 CLKFREQ LAP UM* LAP .LAP 2.500us ok

The best speed for the UM* is 833ns while the worst case is multiplying two \$FFFF.FFFFs together to get a 64bit \$FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!

This was run on a +P8 running at 96MHz.
```{HELP UM* ( u1 u2 -- u1*u2L u1*u2H )
DESC: unsigned 32bit * 32bit multiply -- 64bit result
TIME: 1..11.8us @80MHz
}
UMMUL        mov    R0,tos+1
min    R0,tos           ' max(tos, tos+1)
mov    R2,tos+1
max    R2,tos           ' min(tos, tos+1)
mov    R1,#0
mov    tos,#0           ' zero result
mov    tos+1,#0
UMMULLP      shr    R2,#1 wz,wc      ' test next bit of u1
if_nc     jmp    #UMMUL1          ' skip if no bit here
addx   tos,R1           ' carry into upper long
UMMUL1       add    R0,R0 wc         ' shift u2 left
addx   R1,R1            ' carry into 64-bits
if_nz     jmp    #UMMULLP         ' exhausted u1?
jmp    unext
```

Tachyon Forth - compact, fast, forthwright and interactive

--->CLICK THE LOGO for more links<---

P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
Brisbane, Australia
• Posts: 14,392
edited 2015-04-16 - 14:51:03
Thinking mathematically...
1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign

The result will be 64bits, preventing overflow.

This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
My Prop boards:
Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
Website: www.clusos.com
• Posts: 1,559
edited 2015-04-16 - 19:40:46
Cluso99 wrote: »
Thinking mathematically...
1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign

The result will be 64bits, preventing overflow.

This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
For divides there are many extremely fast tight algorithms implemented in ARM assembly, mostly by RISC OS coders (they/we used to challenge each other to do better than the known fastest and smallest divide routines in assembly). A quick search should provide these, and you should be able to reimplement these on the prop fairly simply.
Hoping for the Prop II. I like ARM, though need a Prop II to complement it.
• Posts: 1,559
edited 2015-04-16 - 19:46:27
For some of the fast ARM division routines see:
http://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt
Hoping for the Prop II. I like ARM, though need a Prop II to complement it.
• Posts: 3
Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.
```'
' Multiply
'
'   in:         q1 = 16-bit multiplicand (q1[31..16] must be 0)
'               q2 = 16-bit multiplier
'
'   out:        q1 = 32-bit product
'
multiply_       ' setup

mov     q4, q1
xor     q4, q2
abs     q1, q1
abs     q2, q2

mov       vRes, #0      ' Primary accumulator (and final result)
mov       tmp1, q1      ' Both my secondary accumulator,
shl       tmp1, #16     ' and the lower 16 bits of v1.
mov       tmp2, q2      ' This is the upper 16 bits of v2,
shr       tmp2, #16     ' which will sum into my 2nd accumulator.
mov       temp, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.
:loop                   ' v1_hi_lo * v2_lo
shr       q2, #1 wc     ' get the low bit of v2
if_c  add       vRes, q1      ' (conditionally) sum v1 into my 1st accumulator
shl       q1, #1        ' bit align v1 for the next pass
' v1_lo * v2_hi
shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
' repeat 16x
djnz      temp, #:loop     ' I can't think of a way to early exit this
' finalize
shl       tmp1, #16     ' align my 2nd accumulator
mov       q1, vRes

shl     q4,#1           wc
if_c      neg     q1, q1

multiply__ret             ret
```
• Posts: 8,014
Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.

Bear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.
Here's an interactive example with timing report in Tachyon.

80MHz clock - 32x32 multiply with 64-bit result.
```...  1,234,567,912 100 LAP UM* LAP .LAP SPACE D. 224 cycles = 2.800us  123456791200 ok
```

Tachyon Forth - compact, fast, forthwright and interactive

--->CLICK THE LOGO for more links<---

P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
Brisbane, Australia
• Posts: 1,789
In some situations early exit doesn't give you much/any gain - for instance if doing signal processing you might expect
most values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly
small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.
Occassionally the values are small but the worst-case performance is all that matters for real-time samples..