Slightly faster integer multiplication in PASM

lonesock · 2015-04-15 12:31

Hi, all.

(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.

So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:

{{
        vRes (32-bit) := v1 (32-bit) * v2 (32-bit)
        ------------------------------------------

        Break the multiplication of 2 32-bit numbers into 4 multiplications
        of the 4x 16-bit portions:
        v1 * v2 =
              (v1_hi * v2_hi) << 32
            + (v1_hi * v2_lo) << 16
            + (v1_lo * v2_hi) << 16
            + (v1_lo * v2_lo) << 0

        Note that the first term can not fit in our result so we ignore it,
        and I can re-combine v1_hi and v1_lo:
        v1 * v2 (fit into 32 bits) =
              (v1 * v2_lo)
            + (v1_lo * v2_hi) << 16   
}}
newmult       ' setup
              mov       vRes, #0      ' Primary accumulator (and final result)
              mov       tmp1, v1      ' Both my secondary accumulator,
              shl       tmp1, #16     ' and the lower 16 bits of v1.
              mov       tmp2, v2      ' This is the upper 16 bits of v2,
              shr       tmp2, #16     ' which will sum into my 2nd accumulator.
              mov       i, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.          
:loop         ' v1_hi_lo * v2_lo
              shr       v2, #1 wc     ' get the low bit of v2          
        if_c  add       vRes, v1      ' (conditionally) sum v1 into my 1st accumulator
              shl       v1, #1        ' bit align v1 for the next pass 
              ' v1_lo * v2_hi
              shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
        if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
              ' repeat 16x
              djnz      i, #:loop     ' I can't think of a way to early exit this
              ' finalize
              shl       tmp1, #16     ' align my 2nd accumulator
              add       vRes, tmp1    ' and add its contribution          
newmult_ret   ret

This always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.

thanks,
Jonathan

Cluso99 · 2015-04-15 12:56

Jonathan,
I will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.

lonesock · 2015-04-15 13:26

Thanks. Is version "260C_007F" the latest (it was rather hard to find, my Forum-Fu is weak)? If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.

Jonathan

jmg · 2015-04-15 13:54

lonesock wrote: »

If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.

When rating code, it is a good idea to include size, along with speed figures.

potatohead · 2015-04-15 15:44

There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.

jmg · 2015-04-15 15:50

potatohead wrote: »

There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.

?
I must have missed the Size values for the new Mult, and the Size value for the Old Mult in #1.
Can you point to them ?

lonesock · 2015-04-15 15:55

Here's the comparison, with a caveat: the Spin interpreter has all four ops (* ** / //) semi-interleaved, so there are some things that would be inefficient if you were to pull the MUL code out by itself.

From my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.

The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.

My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

thanks,
Jonathan

jmg · 2015-04-15 16:01

lonesock wrote: »

My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

Thanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.

Cluso99 · 2015-04-15 18:36

lonesock,
yes that is the latest spin interpreter.
For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)

I also posted the sqrt section as it has been improved in case anyone is interested.

'$F3 decode 
math_F3                 mov     x,#1                    '\\ decode
                        shl     x,y                     '//                    <=== could jmp to math_E0
                        jmp     #push

'$F4..F7 = MUL/DIV lower/upper result
math_F4                 and     a,#%11111               '<== and mask (need in a)
                        mov     t1,#0
                        mov     t2,#32                  'multiply/divide
                        abs     x,x             wc
                        muxc    a,#%01100
                        abs     y,y             wc,wz
        if_c            xor     a,#%00100
                        test    a,#%00010       wc
        if_c_and_nz     jmp     #mdiv                   'if divide and y=0, do multiply so result=0

                        shr     x,#1            wc      'multiply
mmul    if_c            add     t1,y            wc
                        rcr     t1,#1           wc
                        rcr     x,#1            wc
                        djnz    t2,#mmul
                        test    a,#%00100       wz
        if_nz           neg     t1,t1
        if_nz           neg     x,x             wz
        if_nz           sub     t1,#1
                        test    a,#%00001       wz
        if_nz           mov     x,t1
                        jmp     #push

mdiv                    shr     y,#1            wc,wz   'divide
                        rcr     t1,#1
        if_nz           djnz    t2,#mdiv
mdiv2                   cmpsub  x,t1            wc
                        rcl     y,#1
                        shr     t1,#1
                        djnz    t2,#mdiv2
                        test    a,#%01000       wc
                        negc    x,x
                        test    a,#%00100       wc
                        test    a,#%00001       wz
        if_z            negc    x,y
                        jmp     #push

'$F8=SQR  square root
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
'<=== can improve from Chips code in sqrt forum
{
math_F8                 mov     x,#0                    
                        mov     t1,#0                   
                        mov     t2,#16
msqr                    shl     y,#1            wc
                        rcl     t1,#1
                        shl     y,#1            wc
                        rcl     t1,#1
                        shl     x,#2
                        or      x,#1
                        cmpsub  t1,x            wc
                        shr     x,#2
                        rcl     x,#1
                        djnz    t2,#msqr
                        jmp     #push
}                        
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
'Chip's smaller version: masksqrt = $40000000 gets used directly and rotated all the way back to orig. value
math_F8                 mov     x,#0                    'reset root
msqr                    or      x,masksqrt              'set trial bit
                        cmpsub  y,x             wc      'subtract root from input if fits
                        sumnc   x,masksqrt              'cancel trial bit, set root bit if fit
                        shr     x,#1                    'shift root down
                        ror     masksqrt,#2     wc      'shift mask down (wraps on last iteration)
        if_nc           jmp     #msqr                   'loop until mask restored
                        jmp     #push
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;

lonesock · 2015-04-15 20:55

Heh, I remember the original thread for the sqrt optimization! I'm getting old. [8^)

Jonathan

Mark_T · 2015-04-16 03:58

Standard 32x32 loop body is 3 instructions plus the djnz, so 4 in total, run 32 times = 128 instructions
Unroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup
gives 108 instructions, static instruction count = 15

Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,
a whisker faster.

Standard method unrolled by 8 is 104 instructions dynamic, 27 static.

Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...

One can test for one of the arguments being 16 bit only and branch to a shorter version as
this is the common case

Peter Jakacki · 2015-04-16 08:26

The routine I use in Tachyon does not setup a loop counter, it simply tests for exhaustion of the multiplicand. It also optimizes the order of the operands as well as that only adds another two instructions.

Here is a typical timing of the UM* instruction which returns a 64-bit result.

123 CLKFREQ LAP UM* LAP .LAP 2.500us ok

The best speed for the UM* is 833ns while the worst case is multiplying two $FFFF.FFFFs together to get a 64bit $FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!

This was run on a +P8 running at 96MHz.

{HELP UM* ( u1 u2 -- u1*u2L u1*u2H )
DESC: unsigned 32bit * 32bit multiply -- 64bit result
TIME: 1..11.8us @80MHz
}
UMMUL        mov    R0,tos+1
             min    R0,tos           ' max(tos, tos+1)
             mov    R2,tos+1
             max    R2,tos           ' min(tos, tos+1)
             mov    R1,#0
             mov    tos,#0           ' zero result
             mov    tos+1,#0
UMMULLP      shr    R2,#1 wz,wc      ' test next bit of u1
   if_nc     jmp    #UMMUL1          ' skip if no bit here
             add    tos+1,R0 wc      ' add in shifted u2
             addx   tos,R1           ' carry into upper long
UMMUL1       add    R0,R0 wc         ' shift u2 left
             addx   R1,R1            ' carry into 64-bits
   if_nz     jmp    #UMMULLP         ' exhausted u1?
             jmp    unext

Cluso99 · 2015-04-16 14:51

Thinking mathematically...
1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign

The result will be 64bits, preventing overflow.

This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.

davidsaunders · 2015-04-16 19:40

Cluso99 wrote: »

Thinking mathematically...
1. Save resultant sign
2. Make both 32bit numbers positive
3. Make the multiplier the smaller number
4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
5. Reinsert resultant sign

The result will be 64bits, preventing overflow.

This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.

For divides there are many extremely fast tight algorithms implemented in ARM assembly, mostly by RISC OS coders (they/we used to challenge each other to do better than the known fastest and smallest divide routines in assembly). A quick search should provide these, and you should be able to reimplement these on the prop fairly simply.

davidsaunders · 2015-04-16 19:46

For some of the fast ARM division routines see:
http://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt

Nuclear_Man_D · 2018-04-03 21:53

Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.

'
' Multiply
'
'   in:         q1 = 16-bit multiplicand (q1[31..16] must be 0)
'               q2 = 16-bit multiplier
'
'   out:        q1 = 32-bit product
'                    
multiply_       ' setup
                                       
                        mov     q4, q1
                        xor     q4, q2
                        abs     q1, q1
                        abs     q2, q2

                        mov       vRes, #0      ' Primary accumulator (and final result)   
                        mov       tmp1, q1      ' Both my secondary accumulator,
                        shl       tmp1, #16     ' and the lower 16 bits of v1.    
                        mov       tmp2, q2      ' This is the upper 16 bits of v2,
                        shr       tmp2, #16     ' which will sum into my 2nd accumulator.
                        mov       temp, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.          
:loop                   ' v1_hi_lo * v2_lo
                        shr       q2, #1 wc     ' get the low bit of v2          
                  if_c  add       vRes, q1      ' (conditionally) sum v1 into my 1st accumulator
                        shl       q1, #1        ' bit align v1 for the next pass 
                        ' v1_lo * v2_hi
                        shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
                  if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
                        ' repeat 16x
                        djnz      temp, #:loop     ' I can't think of a way to early exit this
                        ' finalize
                        shl       tmp1, #16     ' align my 2nd accumulator
                        add       vRes, tmp1    ' and add its contribution
                        mov       q1, vRes
                       
                        shl     q4,#1           wc
              if_c      neg     q1, q1
                     
multiply__ret             ret

Peter Jakacki · 2018-04-04 02:35

Nuclear_Man_D wrote: »

Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.

Bear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.
Here's an interactive example with timing report in Tachyon.

80MHz clock - 32x32 multiply with 64-bit result.

...  1,234,567,912 100 LAP UM* LAP .LAP SPACE D. 224 cycles = 2.800us  123456791200 ok

Mark_T · 2018-04-04 09:59

In some situations early exit doesn't give you much/any gain - for instance if doing signal processing you might expect
most values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly
small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.
Occassionally the values are small but the worst-case performance is all that matters for real-time samples..

Slightly faster integer multiplication in PASM

Comments