Substantially faster/shorter multiply ?

Hi all,

Working on my Oberon compiler for the P2, i need a set of multiply subroutines
(or even better a set of templates that the compiler can use to generate inline multiply code).

I know that i can use the cordic option but i was hoping to find something substantially faster,
that can be used when you cant overlap the cordic operation with other code.

I took a routine posted by ersmith in the "Multiply, multiply, multiply" forum post as the basis
for these routines. (See mul.spin)

mul.spin provides:

32 x 32 bit multiply giving 64 bit result - 36 clock cycles
32 x 32 multiply 32 - 20 clock cycles
32 x 16 multiply 32 - 12 clock cycles

Also see mul1.spin which provides: (but needs a half register add instruction)

32 x 32 bit multiply giving 64 bit result - 24 clock cycles
32 x 32 multiply 32 - 16 clock cycles
32 x 16 multiply 32 - 8 clock cycles


I dont know the P2 instruction set in detail yet, are there existing instructions in the P2
which allow add and addx to only add a half register ? (matching the half size multiplier)
or are there other instructions that can achieve this ?

Optimizing the multiply routines make a big difference to code generated by the compiler
to support array of structures and multi dimensional arrays. My compiler already knows
how to use shifts for any multiply or divide by a power of 2.

Thanks in advance.
«1

Comments

  • markmpx wrote: »
    Optimizing the multiply routines make a big difference to code generated by the compiler
    to support array of structures and multi dimensional arrays. My compiler already knows
    how to use shifts for any multiply or divide by a power of 2.
    Note that any multiply by a constant can be converted to shifts and adds/subtracts, and in some cases this will be a big win. For example:
    int times10(int x) { return x * 10; }
    
    can be compiled as:
    _times10
            mov     result1, arg01
            shl     result1, #2
            add     result1, arg01
            shl     result1, #1
    _times10_ret
            reta
    

  • You can cut the number of shifts and adds even further for constant multiplication by using Canonical Signed Digit notation:

    https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

    That's what I used in my FIR2PASM filter code.

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • ersmith wrote: »
    markmpx wrote: »
    Optimizing the multiply routines make a big difference to code generated by the compiler
    to support array of structures and multi dimensional arrays. My compiler already knows
    how to use shifts for any multiply or divide by a power of 2.
    Note that any multiply by a constant can be converted to shifts and adds/subtracts, and in some cases this will be a big win. For example:
    int times10(int x) { return x * 10; }
    
    can be compiled as:
    _times10
            mov     result1, arg01
            shl     result1, #2
            add     result1, arg01
            shl     result1, #1
    _times10_ret
            reta
    

    Yes that is often a big win.

    The same can also be done for integer division by shift and subtract for division by small constants.

    If your multiply is way faster then your divide (often the case) you can do integer division by a constant
    using a multiply by a magic number. (see https://www.hackersdelight.org/magic.htm and
    http://libdivide.com/). These are often used in x86 compilers.

    It may also be possible to use the ENCOD instruction on both 32 bit operands of a multiply to
    decide if a 2 clock mul or a 12 clock mul could be used. I havent figured out the overhead yet
    to determine if its actually worthy the overhead.
  • You can cut the number of shifts and adds even further for constant multiplication by using Canonical Signed Digit notation:

    https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

    That's what I used in my FIR2PASM filter code.

    -Phil

    Thanks for the link.
  • Here is another way of speeding up multiply:

    SCA currently does :- Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16

    Change SCA WZ so it does Next instruction's S value = unsigned (D[15:0] * S[15:0]) << 16

    The Z bit can still be written as normal (assuming it does that now and its desirable)

    32 x 32 multiply->64 bit result WAS 36 clocks NOW 28
    32 x 32 multiply->32 bit result WAS 20 clocks NOW 16
    32 x 16 multiply->32 bit result WAS 12 clocks NOW 10 (can be 8 if return result in a)

    This also has the advantage that the a and b registers are no longer changed,
    reducing the number of surrounding move instructions.

    What do you think about this idea ?

    I am NOT suggesting holding up the next spin to do this.
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-02 - 22:29:48
    Having to use WZ in this way is not ideal. If SCA rotated the MUL result by 16 bits and put it in the next instruction's D value then 32x32 -> 64-bit could take 32 clocks and WZ would operate as now. The saving is less than markmpx's idea above but still a worthwhile 12.5% performance boost.
    Formerly known as TonyB
  • markmpx wrote: »
    What do you think about this idea ?
    It's a cool new instruction, but just that. Say MULH for high 16 bits, or MULD for D port insertion, or MULQ for implicit setq on next instruction.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • evanh wrote: »
    markmpx wrote: »
    What do you think about this idea ?
    It's a cool new instruction, but just that. Say MULH for high 16 bits, or MULD for D port insertion, or MULQ for implicit setq on next instruction.

    I understand how the MULH and MULD would work, but i don't understand the MULQ you suggested.

    Could you explain in a bit more detail please ? I am still a P2 newbie.

    Thanks
  • Oh, those are just suggested mnemonics for your one new instruction. They're all saying the same thing but with different terminology just as a selection for preferred spelling of the mnemonic.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • evanhevanh Posts: 6,964
    edited 2019-02-02 - 10:59:00
    The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

    PS: It's the S-port, not the D-port I listed above. So MULD is no good, and MULS is already taken. There has been some discussion about changing this type of instruction to feeding D instead of S but that hasn't happened.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • AleAle Posts: 2,352
    I do not know which compiler are you using to test but I used the AyaCompiler to do the development of the Oberon compiler. I had to update several of the original files to suit the AyaCompiler and windows, that worked well. Sadly I did not progress beyond that :(.
  • evanh wrote: »
    The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

    Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-02 - 12:54:51
    markmpx wrote: »
    Here is another way of speeding up multiply:

    SCA currently does :- Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16

    Change SCA WZ so it does Next instruction's S value = unsigned (D[15:0] * S[15:0]) << 16

    The Z bit can still be written as normal (assuming it does that now and its desirable)

    Mark,

    Your idea to extend SCA is a clever one that would make 32x32 multiplies a lot faster. It's a pity the C flag opcode bit has been used already to distinguish SCA from SCAS. The issue with WZ is that it's either not available anymore (for high word) or must be used all the time (for low word) and it might be better to swap the WZ words. My understanding of your proposed change is as follows:

    SCA = {16'b0, MUL[31:16]}
    SCA WZ = {MUL[15:0],16'b0}

    If the instruction were modified to be

    SCA {WZ} = {MUL[15:0], MUL[31:16]}

    then WZ could be used only when desired. The drawback is an extra instruction is required to zero the low or high word (if that is necessary for the latter). There would be a smaller time saving for 32x32 -> 64-bit than your solution but none at all for the two smaller multiplies you mention.

    I can't see a way to zero the low word with any net time saving if the result goes to next instruction's S value, but it could be done using the D value and the two extra instructions for masking out the low word would increase the cycle count from 28 to 32, showing that my suggestion is not as optimal as yours.
    Formerly known as TonyB
  • Ale wrote: »
    I do not know which compiler are you using to test but I used the AyaCompiler to do the development of the Oberon compiler. I had to update several of the original files to suit the AyaCompiler and windows, that worked well. Sadly I did not progress beyond that :(.

    The Oberon compiler that i am currently using to host the P2 Oberon compiler is voc. You can find it on
    github by searching for Vishaps/voc. This Oberon compiler can run on Windows/Linux and Mac.
  • evanh wrote: »
    The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

    PS: It's the S-port, not the D-port I listed above. So MULD is no good, and MULS is already taken. There has been some discussion about changing this type of instruction to feeding D instead of S but that hasn't happened.

    Thanks evan. I understand your post now.
  • TonyB_ wrote: »
    Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.

    I've noticed that the SETQ operation only persists, excepting AUGx, until the next instruction anyway. So it would be extinguished by an inserted ADD even.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • PS: There must be at least two additional state bits associated with the Q register. Off the top of my head:
    - Idle
    - SETQ operation
    - ALTx operation
    - Override next S input
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • TonyB_,

    Thanks for your response.

    I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
    instruction can always be used to calculate offsets from the base address of a structure for

    a[i,,j,k] := ?? etc

    Improvements in this area will ultimately benefit any high level language compiler (Ie: gcc)

    Hopefully, folks are a bit more aware of the issue, and maybe one day things will improve.

    Of my 2 suggestions, the ability to support a word add to a long (mul1.spin in my first post) would result in the largest improvement, but a modified SCA comes close.

    Incidentally, the original oberon compiler i am modifying, does produce code for a 3 operand instruction set.
    (I mentioned this as you seem interested in a 3 operand format eventually).
  • Thinking aloud here...
    You have 2 x 32bit numbers you wish to multiply to give a 64bit answer...
    AH(16bits), AL(16bits)
    BH(16bits), BL(16bits)
    result..
    RH(32bits), RL(32bits)
    
    So to do this
    
    X1(32bits) = AL * BL
    X2(32bits) = AH * BL (not required if AH=0)
    X3(32bits) = AL * BH (not required if BH=0)
    X4(32bits) = AH * BH (not required if AH=0 or BH=0)
    
    then align and add 
    
    R(64bits) = 
     X1 +
     X2 << 16 +
     X3 << 16 +
     X4 << 32
    

    Do I have this right to start with ???
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • You should be able to do a 64x64 multiply with only three 32x32 multiplications using something like Karatsuba multiplication.
  • Cluso99 wrote: »
    Do I have this right to start with ???

    Yes and adding X2 and X3 are problematical.
    Formerly known as TonyB
  • Cluso99 wrote: »
    Thinking aloud here...
    You have 2 x 32bit numbers you wish to multiply to give a 64bit answer...
    AH(16bits), AL(16bits)
    BH(16bits), BL(16bits)
    result..
    RH(32bits), RL(32bits)
    
    So to do this
    
    X1(32bits) = AL * BL
    X2(32bits) = AH * BL (not required if AH=0)
    X3(32bits) = AL * BH (not required if BH=0)
    X4(32bits) = AH * BH (not required if AH=0 or BH=0)
    
    then align and add 
    
    R(64bits) = 
     X1 +
     X2 << 16 +
     X3 << 16 +
     X4 << 32
    

    Do I have this right to start with ???

    Ray,

    Yes you have it correct.

    Adding X2 and X3 will each require a SHL and a SHR as they are aligned as follows
    where each item is a 16 bit quantity. Carry must be propagated between the add of the lo and hi longs.

    X4 X4 X1 X1 +
    00 X2 X2 00 +
    00 X3 X3 00

    To save instructions and cycles, my question was "Is there a way to do a half word add" ?
    This was mul1.spin (which had a typo i have now corrected and attached). There are so many
    unusual instructions in the P2 i just didn't know, as i don't know them all intimately yet.

    My second idea was when i realized the SCA could do the job if it had a symmetrical opposite
    (the WZ proposal)

    The current use of SCA is to do a fused multiply add, so as in chip's example
    to calculate (x * y) + a you would do SCA x,y then ADD a,a
    The SCA instruction writes its result to the hidden Q register, which is fed into the next instruction as the S.

    I had hoped that if someone actually wrote SCA a,b WZ, they wanted to test if the result of the multiply
    was zero and then would ignore the result, hence my proposed use of the WZ would not be a problem.

    However if my better understanding of the hidden Q register is correct, you cant ignore the result, you can
    only know that the result of the multiply was zero.

    Most machines that have a 32 bit register and a 16 bit multiplier have a way to address a half register (hi & lo)
    hence my question about the half word add. The P2 is an unusual cpu and is certainly fabulous for writing
    in assembler. Its a joy to use ! in my opinion.
  • In P2 asm that would be something like
    	getword	al,areg,#0
    	getword	bl,breg,#0
    	getword	ah,areg,#1
    	getword	bh,breg,#1
    	mov	x1,al		'al * bl
    	mul	x1,bl
    	mov	x2,ah		'ah * bl
    	mul	x2,bl
    	mov	x3,al		'al * bh
    	mul	x3,bh
    	mov	x4,ah		'ah * bh
    	mul	x4,bh
    	mov	temp,x2		'+ x2 << 16
    	shl	x2,#16
    	shr	temp,#16
    	add	x1,x2	wc
    	addx	x4,temp
    	mov	temp,x3		'+ x3 << 16
    	shl	x3,#16
    	shr	temp,#16
    	add	x1,x3	wc
    	addx	x4,temp		'result = x4:x1
    
    
    compared to cordic
    qmul areg,breg
    getqx x1
    getqy x4
    
    ~60 clocks compared to ~42 clocks
    Melbourne, Australia
  • I recommend people read the mul/mul1/mul2.spin files above.
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-03 - 05:23:07
    markmpx wrote: »
    To save instructions and cycles, my question was "Is there a way to do a half word add" ?
    This was mul1.spin (which had a typo i have now corrected and attached). There are so many
    unusual instructions in the P2 i just didn't know, as i don't know them all intimately yet.

    My second idea was when i realized the SCA could do the job if it had a symmetrical opposite
    (the WZ proposal)

    The current use of SCA is to do a fused multiply add, so as in chip's example
    to calculate (x * y) + a you would do SCA x,y then ADD a,a
    The SCA instruction writes its result to the hidden Q register, which is fed into the next instruction as the S.

    I had hoped that if someone actually wrote SCA a,b WZ, they wanted to test if the result of the multiply
    was zero and then would ignore the result, hence my proposed use of the WZ would not be a problem.

    However if my better understanding of the hidden Q register is correct, you cant ignore the result, you can
    only know that the result of the multiply was zero.

    Most machines that have a 32 bit register and a 16 bit multiplier have a way to address a half register (hi & lo)
    hence my question about the half word add. The P2 is an unusual cpu and is certainly fabulous for writing
    in assembler. Its a joy to use ! in my opinion.

    Exactly how the SCA result gets into the next S is unimportant really to this discussion. If you only want to know whether the result is zero, the next instruction could dump it somewhere or ignore it by not using S at all.
    	sca	a,b	wz
    	mov	dump,0
    
    'or
    
    	sca	a,b	wz
    	nop
    
    Formerly known as TonyB
  • evanhevanh Posts: 6,964
    edited 2019-02-03 - 07:36:06
    By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.

    An inline cordic version uses 2 QMUL, 1 ADD and 3 GETQx. It takes 70-77 clocks depending on hub sync. The cordic version has additional opportunity to prep while waiting for result.

    So, very similar overall.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • ersmithersmith Posts: 2,991
    edited 2019-02-03 - 11:11:12
    markmpx wrote: »
    I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
    instruction can always be used to calculate offsets from the base address of a structure for

    a[i,,j,k] := ?? etc

    Usually address lookups involve multiplies by a constant value known at compile time, which is why I suggested changing the multiplies to shift and adds -- this will provide a good result even without any restriction on the sizes of structures or arrays. Both gcc and fastspin do this already.

  • ersmith wrote: »
    markmpx wrote: »
    I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
    instruction can always be used to calculate offsets from the base address of a structure for

    a[i,,j,k] := ?? etc

    Usually address lookups involve multiplies by a constant value known at compile time, which is why I suggested changing the multiplies to shift and adds -- this will provide a good result even without any restriction on the sizes of structures or arrays. Both gcc and fastspin do this already.

    Thanks for the advice.

    You are indeed correct, and my example was perhaps not a good one. and its unlikely gcc would benefit.

    Are you aware of the object oriented features of oberon ? It is possible in oberon to pass an array of structures
    to a procedure where the size of the structure can only be determined at runtime.
  • I think this would work for 32(A)x32(B)=>64(resH:resL):
    getword tmp1, A, #1     ' Ahi
    mul     tmp1, B         ' * Blo
    getword resH, B, #1     ' Bhi
    mul     resH, A         ' * Alo
    add     resH, tmp1 wC   ' sum, track carry
    mov     tmp1, resH      ' low 16 bits of that in temp
    shl     tmp1, #16       ' are now at the top word
    rcr     resH, #1        ' keep the intermediate carry bit
    shr     resH, #15       ' and align with the lower 16 bits
    mov     resL, B         ' result Lo = B (lo)
    mul     resL, A         ' * A (lo)
    add     resL, tmp1 wC   ' add the low half of the intermediate result
    shr     A, #16          ' now Ahi
    shr     B, #16          ' and Bhi
    mul     A, B            ' multiplied
    addx    resH, A         ' and add to existing component
    
    16 instructions, and it feels like we could shave off 2 more if I could reuse the storing of the hi words from some intermediate calc. None of this is tested, sorry in advance for any mistakes.

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-04 - 04:22:34
    I think 32-bit A*B -> 64-bit result can be done in 28 cycles with MUL and without SCA. I haven't looked at smaller sizes. The key is to add AL*BH to AH*BL. An extra long is needed to hold a constant. I won't post the code yet, it's quite simple to derive.
    Formerly known as TonyB
Sign In or Register to comment.