Slightly faster integer multiplication in PASM

lonesocklonesock Posts: 886
edited April 2015 in Propeller 1 Vote Up0Vote Down
Hi, all.

(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.

So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:
{{
        vRes (32-bit) := v1 (32-bit) * v2 (32-bit)
        ------------------------------------------

        Break the multiplication of 2 32-bit numbers into 4 multiplications
        of the 4x 16-bit portions:
        v1 * v2 =
              (v1_hi * v2_hi) << 32
            + (v1_hi * v2_lo) << 16
            + (v1_lo * v2_hi) << 16
            + (v1_lo * v2_lo) << 0

        Note that the first term can not fit in our result so we ignore it,
        and I can re-combine v1_hi and v1_lo:
        v1 * v2 (fit into 32 bits) =
              (v1 * v2_lo)
            + (v1_lo * v2_hi) << 16   
}}
newmult       ' setup
              mov       vRes, #0      ' Primary accumulator (and final result)
              mov       tmp1, v1      ' Both my secondary accumulator,
              shl       tmp1, #16     ' and the lower 16 bits of v1.
              mov       tmp2, v2      ' This is the upper 16 bits of v2,
              shr       tmp2, #16     ' which will sum into my 2nd accumulator.
              mov       i, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.          
:loop         ' v1_hi_lo * v2_lo
              shr       v2, #1 wc     ' get the low bit of v2          
        if_c  add       vRes, v1      ' (conditionally) sum v1 into my 1st accumulator
              shl       v1, #1        ' bit align v1 for the next pass 
              ' v1_lo * v2_hi
              shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
        if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
              ' repeat 16x
              djnz      i, #:loop     ' I can't think of a way to early exit this
              ' finalize
              shl       tmp1, #16     ' align my 2nd accumulator
              add       vRes, tmp1    ' and add its contribution          
newmult_ret   ret
This always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.

thanks,
Jonathan
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer

Comments

  • 17 Comments sorted by Date Added Votes
  • Cluso99Cluso99 Posts: 13,382
    edited April 2015 Vote Up0Vote Down
    Jonathan,
    I will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • lonesocklonesock Posts: 886
    edited April 2015 Vote Up0Vote Down
    Thanks. Is version "260C_007F" the latest (it was rather hard to find, my Forum-Fu is weak)? If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • jmgjmg Posts: 11,076
    edited April 2015 Vote Up0Vote Down
    lonesock wrote: »
    If so, it has the classical 32 reps of 4 instructions at the core and no early exit, so this version should be ~25% faster in all cases.
    When rating code, it is a good idea to include size, along with speed figures.
  • potatoheadpotatohead Posts: 9,094
    edited April 2015 Vote Up0Vote Down
    There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • jmgjmg Posts: 11,076
    edited April 2015 Vote Up0Vote Down
    potatohead wrote: »
    There is a listing, along with instructions per loop detail in the first post containing all anyone needs for metrics.
    ?
    I must have missed the Size values for the new Mult, and the Size value for the Old Mult in #1.
    Can you point to them ?
  • lonesocklonesock Posts: 886
    edited April 2015 Vote Up0Vote Down
    Here's the comparison, with a caveat: the Spin interpreter has all four ops (* ** / //) semi-interleaved, so there are some things that would be inefficient if you were to pull the MUL code out by itself.

    From my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.

    The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.

    My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

    thanks,
    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • jmgjmg Posts: 11,076
    edited April 2015 Vote Up0Vote Down
    lonesock wrote: »
    My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

    Thanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.
  • Cluso99Cluso99 Posts: 13,382
    edited April 2015 Vote Up0Vote Down
    lonesock,
    yes that is the latest spin interpreter.
    For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)

    I also posted the sqrt section as it has been improved in case anyone is interested.
    '$F3 decode 
    math_F3                 mov     x,#1                    '\\ decode
                            shl     x,y                     '//                    <=== could jmp to math_E0
                            jmp     #push
    
    '$F4..F7 = MUL/DIV lower/upper result
    math_F4                 and     a,#%11111               '<== and mask (need in a)
                            mov     t1,#0
                            mov     t2,#32                  'multiply/divide
                            abs     x,x             wc
                            muxc    a,#%01100
                            abs     y,y             wc,wz
            if_c            xor     a,#%00100
                            test    a,#%00010       wc
            if_c_and_nz     jmp     #mdiv                   'if divide and y=0, do multiply so result=0
    
                            shr     x,#1            wc      'multiply
    mmul    if_c            add     t1,y            wc
                            rcr     t1,#1           wc
                            rcr     x,#1            wc
                            djnz    t2,#mmul
                            test    a,#%00100       wz
            if_nz           neg     t1,t1
            if_nz           neg     x,x             wz
            if_nz           sub     t1,#1
                            test    a,#%00001       wz
            if_nz           mov     x,t1
                            jmp     #push
    
    mdiv                    shr     y,#1            wc,wz   'divide
                            rcr     t1,#1
            if_nz           djnz    t2,#mdiv
    mdiv2                   cmpsub  x,t1            wc
                            rcl     y,#1
                            shr     t1,#1
                            djnz    t2,#mdiv2
                            test    a,#%01000       wc
                            negc    x,x
                            test    a,#%00100       wc
                            test    a,#%00001       wz
            if_z            negc    x,y
                            jmp     #push
    
    '$F8=SQR  square root
    '&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
    '<=== can improve from Chips code in sqrt forum
    {
    math_F8                 mov     x,#0                    
                            mov     t1,#0                   
                            mov     t2,#16
    msqr                    shl     y,#1            wc
                            rcl     t1,#1
                            shl     y,#1            wc
                            rcl     t1,#1
                            shl     x,#2
                            or      x,#1
                            cmpsub  t1,x            wc
                            shr     x,#2
                            rcl     x,#1
                            djnz    t2,#msqr
                            jmp     #push
    }                        
    '&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
    'Chip's smaller version: masksqrt = $40000000 gets used directly and rotated all the way back to orig. value
    math_F8                 mov     x,#0                    'reset root
    msqr                    or      x,masksqrt              'set trial bit
                            cmpsub  y,x             wc      'subtract root from input if fits
                            sumnc   x,masksqrt              'cancel trial bit, set root bit if fit
                            shr     x,#1                    'shift root down
                            ror     masksqrt,#2     wc      'shift mask down (wraps on last iteration)
            if_nc           jmp     #msqr                   'loop until mask restored
                            jmp     #push
    '&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
    
    
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • lonesocklonesock Posts: 886
    edited April 2015 Vote Up0Vote Down
    Heh, I remember the original thread for the sqrt optimization! I'm getting old. [8^)

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • Mark_TMark_T Posts: 1,653
    edited April 2015 Vote Up0Vote Down
    Standard 32x32 loop body is 3 instructions plus the djnz, so 4 in total, run 32 times = 128 instructions
    Unroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup
    gives 108 instructions, static instruction count = 15

    Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,
    a whisker faster.

    Standard method unrolled by 8 is 104 instructions dynamic, 27 static.

    Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...

    One can test for one of the arguments being 16 bit only and branch to a shorter version as
    this is the common case
  • Peter JakackiPeter Jakacki Posts: 6,903
    edited April 2015 Vote Up0Vote Down
    The routine I use in Tachyon does not setup a loop counter, it simply tests for exhaustion of the multiplicand. It also optimizes the order of the operands as well as that only adds another two instructions.

    Here is a typical timing of the UM* instruction which returns a 64-bit result.

    123 CLKFREQ LAP UM* LAP .LAP 2.500us ok

    The best speed for the UM* is 833ns while the worst case is multiplying two $FFFF.FFFFs together to get a 64bit $FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!

    This was run on a +P8 running at 96MHz.
    {HELP UM* ( u1 u2 -- u1*u2L u1*u2H )
    DESC: unsigned 32bit * 32bit multiply -- 64bit result
    TIME: 1..11.8us @80MHz
    }
    UMMUL        mov    R0,tos+1
                 min    R0,tos           ' max(tos, tos+1)
                 mov    R2,tos+1
                 max    R2,tos           ' min(tos, tos+1)
                 mov    R1,#0
                 mov    tos,#0           ' zero result
                 mov    tos+1,#0
    UMMULLP      shr    R2,#1 wz,wc      ' test next bit of u1
       if_nc     jmp    #UMMUL1          ' skip if no bit here
                 add    tos+1,R0 wc      ' add in shifted u2
                 addx   tos,R1           ' carry into upper long
    UMMUL1       add    R0,R0 wc         ' shift u2 left
                 addx   R1,R1            ' carry into 64-bits
       if_nz     jmp    #UMMULLP         ' exhausted u1?
                 jmp    unext
    
    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    Tachyon Forth News Blog
    TACHYON DEMONSTRATOR
    Brisbane, Australia
  • Cluso99Cluso99 Posts: 13,382
    edited April 2015 Vote Up0Vote Down
    Thinking mathematically...
    1. Save resultant sign
    2. Make both 32bit numbers positive
    3. Make the multiplier the smaller number
    4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
    5. Reinsert resultant sign

    The result will be 64bits, preventing overflow.

    This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

    Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • davidsaundersdavidsaunders Posts: 1,559
    edited April 2015 Vote Up0Vote Down
    Cluso99 wrote: »
    Thinking mathematically...
    1. Save resultant sign
    2. Make both 32bit numbers positive
    3. Make the multiplier the smaller number
    4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop
    5. Reinsert resultant sign

    The result will be 64bits, preventing overflow.

    This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

    Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.
    For divides there are many extremely fast tight algorithms implemented in ARM assembly, mostly by RISC OS coders (they/we used to challenge each other to do better than the known fastest and smallest divide routines in assembly). A quick search should provide these, and you should be able to reimplement these on the prop fairly simply.
    Hoping for the Prop II. I like ARM, though need a Prop II to complement it.
  • davidsaundersdavidsaunders Posts: 1,559
    edited April 2015 Vote Up0Vote Down
    For some of the fast ARM division routines see:
    http://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt
    Hoping for the Prop II. I like ARM, though need a Prop II to complement it.
  • Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.
    '
    ' Multiply
    '
    '   in:         q1 = 16-bit multiplicand (q1[31..16] must be 0)
    '               q2 = 16-bit multiplier
    '
    '   out:        q1 = 32-bit product
    '                    
    multiply_       ' setup
                                           
                            mov     q4, q1
                            xor     q4, q2
                            abs     q1, q1
                            abs     q2, q2
    
                            mov       vRes, #0      ' Primary accumulator (and final result)   
                            mov       tmp1, q1      ' Both my secondary accumulator,
                            shl       tmp1, #16     ' and the lower 16 bits of v1.    
                            mov       tmp2, q2      ' This is the upper 16 bits of v2,
                            shr       tmp2, #16     ' which will sum into my 2nd accumulator.
                            mov       temp, #16        ' Instead of 4 instructions 32x, do 6 instructions 16x.          
    :loop                   ' v1_hi_lo * v2_lo
                            shr       q2, #1 wc     ' get the low bit of v2          
                      if_c  add       vRes, q1      ' (conditionally) sum v1 into my 1st accumulator
                            shl       q1, #1        ' bit align v1 for the next pass 
                            ' v1_lo * v2_hi
                            shl       tmp1, #1 wc   ' get the high bit of v1_lo, *AND* shift my 2nd accumulator
                      if_c  add       tmp1, tmp2    ' (conditionally) add v2_hi into the 2nd accumulator
                            ' repeat 16x
                            djnz      temp, #:loop     ' I can't think of a way to early exit this
                            ' finalize
                            shl       tmp1, #16     ' align my 2nd accumulator
                            add       vRes, tmp1    ' and add its contribution
                            mov       q1, vRes
                           
                            shl     q4,#1           wc
                  if_c      neg     q1, q1
                         
    multiply__ret             ret
    
  • Thankyou lonesock! I needed a 32-bit signed multiplication function. I modified your code for signed multiplication. I believe it works fairly well, however it may need a few changes to work in another implementation.

    Bear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.
    Here's an interactive example with timing report in Tachyon.

    80MHz clock - 32x32 multiply with 64-bit result.
    ...  1,234,567,912 100 LAP UM* LAP .LAP SPACE D. 224 cycles = 2.800us  123456791200 ok
    
    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    Tachyon Forth News Blog
    TACHYON DEMONSTRATOR
    Brisbane, Australia
  • In some situations early exit doesn't give you much/any gain - for instance if doing signal processing you might expect
    most values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly
    small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.
    Occassionally the values are small but the worst-case performance is all that matters for real-time samples..
Sign In or Register to comment.