Substantially faster/shorter multiply ?

markmpx · 2019-01-30 22:16

Hi all,

Working on my Oberon compiler for the P2, i need a set of multiply subroutines
(or even better a set of templates that the compiler can use to generate inline multiply code).

I know that i can use the cordic option but i was hoping to find something substantially faster,
that can be used when you cant overlap the cordic operation with other code.

I took a routine posted by ersmith in the "Multiply, multiply, multiply" forum post as the basis
for these routines. (See mul.spin)

mul.spin provides:

32 x 32 bit multiply giving 64 bit result - 36 clock cycles
32 x 32 multiply 32 - 20 clock cycles
32 x 16 multiply 32 - 12 clock cycles

Also see mul1.spin which provides: (but needs a half register add instruction)

32 x 32 bit multiply giving 64 bit result - 24 clock cycles
32 x 32 multiply 32 - 16 clock cycles
32 x 16 multiply 32 - 8 clock cycles

I dont know the P2 instruction set in detail yet, are there existing instructions in the P2
which allow add and addx to only add a half register ? (matching the half size multiplier)
or are there other instructions that can achieve this ?

Optimizing the multiply routines make a big difference to code generated by the compiler
to support array of structures and multi dimensional arrays. My compiler already knows
how to use shifts for any multiply or divide by a power of 2.

Thanks in advance.

ersmith · 2019-01-30 22:28

markmpx wrote: »

Optimizing the multiply routines make a big difference to code generated by the compiler
to support array of structures and multi dimensional arrays. My compiler already knows
how to use shifts for any multiply or divide by a power of 2.

Note that any multiply by a constant can be converted to shifts and adds/subtracts, and in some cases this will be a big win. For example:

int times10(int x) { return x * 10; }

can be compiled as:

_times10
        mov     result1, arg01
        shl     result1, #2
        add     result1, arg01
        shl     result1, #1
_times10_ret
        reta

Phil Pilgrim (PhiPi) · 2019-01-30 23:31

You can cut the number of shifts and adds even further for constant multiplication by using Canonical Signed Digit notation:

https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

That's what I used in my FIR2PASM filter code.

-Phil

markmpx · 2019-01-31 00:34

ersmith wrote: »
markmpx wrote: »

Optimizing the multiply routines make a big difference to code generated by the compiler
to support array of structures and multi dimensional arrays. My compiler already knows
how to use shifts for any multiply or divide by a power of 2.

Note that any multiply by a constant can be converted to shifts and adds/subtracts, and in some cases this will be a big win. For example:
int times10(int x) { return x * 10; }
can be compiled as:
_times10
        mov     result1, arg01
        shl     result1, #2
        add     result1, arg01
        shl     result1, #1
_times10_ret
        reta

Yes that is often a big win.

The same can also be done for integer division by shift and subtract for division by small constants.

If your multiply is way faster then your divide (often the case) you can do integer division by a constant
using a multiply by a magic number. (see https://www.hackersdelight.org/magic.htm and
http://libdivide.com/). These are often used in x86 compilers.

It may also be possible to use the ENCOD instruction on both 32 bit operands of a multiply to
decide if a 2 clock mul or a 12 clock mul could be used. I havent figured out the overhead yet
to determine if its actually worthy the overhead.

markmpx · 2019-01-31 00:36

Phil Pilgrim (PhiPi) wrote: »

You can cut the number of shifts and adds even further for constant multiplication by using Canonical Signed Digit notation:

https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

That's what I used in my FIR2PASM filter code.

-Phil

Thanks for the link.

markmpx · 2019-02-01 07:58

Here is another way of speeding up multiply:

SCA currently does :- Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16

Change SCA WZ so it does Next instruction's S value = unsigned (D[15:0] * S[15:0]) << 16

The Z bit can still be written as normal (assuming it does that now and its desirable)

32 x 32 multiply->64 bit result WAS 36 clocks NOW 28
32 x 32 multiply->32 bit result WAS 20 clocks NOW 16
32 x 16 multiply->32 bit result WAS 12 clocks NOW 10 (can be 8 if return result in a)

This also has the advantage that the a and b registers are no longer changed,
reducing the number of surrounding move instructions.

What do you think about this idea ?

I am NOT suggesting holding up the next spin to do this.

TonyB_ · 2019-02-01 20:42

Having to use WZ in this way is not ideal. If SCA rotated the MUL result by 16 bits and put it in the next instruction's D value then 32x32 -> 64-bit could take 32 clocks and WZ would operate as now. The saving is less than markmpx's idea above but still a worthwhile 12.5% performance boost.

evanh · 2019-02-02 04:36

markmpx wrote: »

What do you think about this idea ?

It's a cool new instruction, but just that. Say MULH for high 16 bits, or MULD for D port insertion, or MULQ for implicit setq on next instruction.

markmpx · 2019-02-02 07:30

evanh wrote: »

markmpx wrote: »

What do you think about this idea ?

It's a cool new instruction, but just that. Say MULH for high 16 bits, or MULD for D port insertion, or MULQ for implicit setq on next instruction.

I understand how the MULH and MULD would work, but i don't understand the MULQ you suggested.

Could you explain in a bit more detail please ? I am still a P2 newbie.

Thanks

evanh · 2019-02-02 07:39

Oh, those are just suggested mnemonics for your one new instruction. They're all saying the same thing but with different terminology just as a selection for preferred spelling of the mnemonic.

evanh · 2019-02-02 10:50

The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

PS: It's the S-port, not the D-port I listed above. So MULD is no good, and MULS is already taken. There has been some discussion about changing this type of instruction to feeding D instead of S but that hasn't happened.

Ale · 2019-02-02 11:32

I do not know which compiler are you using to test but I used the AyaCompiler to do the development of the Oberon compiler. I had to update several of the original files to suit the AyaCompiler and windows, that worked well. Sadly I did not progress beyond that

.

TonyB_ · 2019-02-02 11:59

evanh wrote: »

The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.

TonyB_ · 2019-02-02 12:39

markmpx wrote: »

Here is another way of speeding up multiply:

SCA currently does :- Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16

Change SCA WZ so it does Next instruction's S value = unsigned (D[15:0] * S[15:0]) << 16

The Z bit can still be written as normal (assuming it does that now and its desirable)

Mark,

Your idea to extend SCA is a clever one that would make 32x32 multiplies a lot faster. It's a pity the C flag opcode bit has been used already to distinguish SCA from SCAS. The issue with WZ is that it's either not available anymore (for high word) or must be used all the time (for low word) and it might be better to swap the WZ words. My understanding of your proposed change is as follows:

SCA = {16'b0, MUL[31:16]}
SCA WZ = {MUL[15:0],16'b0}

If the instruction were modified to be

SCA {WZ} = {MUL[15:0], MUL[31:16]}

then WZ could be used only when desired. The drawback is an extra instruction is required to zero the low or high word (if that is necessary for the latter). There would be a smaller time saving for 32x32 -> 64-bit than your solution but none at all for the two smaller multiplies you mention.

I can't see a way to zero the low word with any net time saving if the result goes to next instruction's S value, but it could be done using the D value and the two extra instructions for masking out the low word would increase the cycle count from 28 to 32, showing that my suggestion is not as optimal as yours.

markmpx · 2019-02-02 20:10

Ale wrote: »

I do not know which compiler are you using to test but I used the AyaCompiler to do the development of the Oberon compiler. I had to update several of the original files to suit the AyaCompiler and windows, that worked well. Sadly I did not progress beyond that .

The Oberon compiler that i am currently using to host the P2 Oberon compiler is voc. You can find it on
github by searching for Vishaps/voc. This Oberon compiler can run on Windows/Linux and Mac.

markmpx · 2019-02-02 20:17

evanh wrote: »

The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

PS: It's the S-port, not the D-port I listed above. So MULD is no good, and MULS is already taken. There has been some discussion about changing this type of instruction to feeding D instead of S but that hasn't happened.

Thanks evan. I understand your post now.

evanh · 2019-02-02 20:49

TonyB_ wrote: »

Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.

I've noticed that the SETQ operation only persists, excepting AUGx, until the next instruction anyway. So it would be extinguished by an inserted ADD even.

evanh · 2019-02-02 20:59

PS: There must be at least two additional state bits associated with the Q register. Off the top of my head:
- Idle
- SETQ operation
- ALTx operation
- Override next S input

markmpx · 2019-02-02 21:12

TonyB_,

Thanks for your response.

I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
instruction can always be used to calculate offsets from the base address of a structure for

a[i,,j,k] := ?? etc

Improvements in this area will ultimately benefit any high level language compiler (Ie: gcc)

Hopefully, folks are a bit more aware of the issue, and maybe one day things will improve.

Of my 2 suggestions, the ability to support a word add to a long (mul1.spin in my first post) would result in the largest improvement, but a modified SCA comes close.

Incidentally, the original oberon compiler i am modifying, does produce code for a 3 operand instruction set.
(I mentioned this as you seem interested in a 3 operand format eventually).

Cluso99 · 2019-02-03 03:05

Thinking aloud here...

You have 2 x 32bit numbers you wish to multiply to give a 64bit answer...
AH(16bits), AL(16bits)
BH(16bits), BL(16bits)
result..
RH(32bits), RL(32bits)

So to do this

X1(32bits) = AL * BL
X2(32bits) = AH * BL (not required if AH=0)
X3(32bits) = AL * BH (not required if BH=0)
X4(32bits) = AH * BH (not required if AH=0 or BH=0)

then align and add 

R(64bits) = 
 X1 +
 X2 << 16 +
 X3 << 16 +
 X4 << 32

Do I have this right to start with ???

Electrodude · 2019-02-03 04:12

You should be able to do a 64x64 multiply with only three 32x32 multiplications using something like Karatsuba multiplication.

TonyB_ · 2019-02-03 04:18

Cluso99 wrote: »

Do I have this right to start with ???

Yes and adding X2 and X3 are problematical.

markmpx · 2019-02-03 04:32

Cluso99 wrote: »

Thinking aloud here...

You have 2 x 32bit numbers you wish to multiply to give a 64bit answer...
AH(16bits), AL(16bits)
BH(16bits), BL(16bits)
result..
RH(32bits), RL(32bits)

So to do this

X1(32bits) = AL * BL
X2(32bits) = AH * BL (not required if AH=0)
X3(32bits) = AL * BH (not required if BH=0)
X4(32bits) = AH * BH (not required if AH=0 or BH=0)

then align and add 

R(64bits) = 
 X1 +
 X2 << 16 +
 X3 << 16 +
 X4 << 32

Do I have this right to start with ???

Ray,

Yes you have it correct.

Adding X2 and X3 will each require a SHL and a SHR as they are aligned as follows
where each item is a 16 bit quantity. Carry must be propagated between the add of the lo and hi longs.

X4 X4 X1 X1 +
00 X2 X2 00 +
00 X3 X3 00

To save instructions and cycles, my question was "Is there a way to do a half word add" ?
This was mul1.spin (which had a typo i have now corrected and attached). There are so many
unusual instructions in the P2 i just didn't know, as i don't know them all intimately yet.

My second idea was when i realized the SCA could do the job if it had a symmetrical opposite
(the WZ proposal)

The current use of SCA is to do a fused multiply add, so as in chip's example
to calculate (x * y) + a you would do SCA x,y then ADD a,a
The SCA instruction writes its result to the hidden Q register, which is fed into the next instruction as the S.

I had hoped that if someone actually wrote SCA a,b WZ, they wanted to test if the result of the multiply
was zero and then would ignore the result, hence my proposed use of the WZ would not be a problem.

However if my better understanding of the hidden Q register is correct, you cant ignore the result, you can
only know that the result of the multiply was zero.

Most machines that have a 32 bit register and a 16 bit multiplier have a way to address a half register (hi & lo)
hence my question about the half word add. The P2 is an unusual cpu and is certainly fabulous for writing
in assembler. Its a joy to use ! in my opinion.

ozpropdev · 2019-02-03 04:35

In P2 asm that would be something like

	getword	al,areg,#0
	getword	bl,breg,#0
	getword	ah,areg,#1
	getword	bh,breg,#1
	mov	x1,al		'al * bl
	mul	x1,bl
	mov	x2,ah		'ah * bl
	mul	x2,bl
	mov	x3,al		'al * bh
	mul	x3,bh
	mov	x4,ah		'ah * bh
	mul	x4,bh
	mov	temp,x2		'+ x2 << 16
	shl	x2,#16
	shr	temp,#16
	add	x1,x2	wc
	addx	x4,temp
	mov	temp,x3		'+ x3 << 16
	shl	x3,#16
	shr	temp,#16
	add	x1,x3	wc
	addx	x4,temp		'result = x4:x1

compared to cordic

qmul areg,breg
getqx x1
getqy x4

~60 clocks compared to ~42 clocks

TonyB_ · 2019-02-03 05:08

I recommend people read the mul/mul1/mul2.spin files above.

TonyB_ · 2019-02-03 05:22

markmpx wrote: »

To save instructions and cycles, my question was "Is there a way to do a half word add" ?
This was mul1.spin (which had a typo i have now corrected and attached). There are so many
unusual instructions in the P2 i just didn't know, as i don't know them all intimately yet.

My second idea was when i realized the SCA could do the job if it had a symmetrical opposite
(the WZ proposal)

The current use of SCA is to do a fused multiply add, so as in chip's example
to calculate (x * y) + a you would do SCA x,y then ADD a,a
The SCA instruction writes its result to the hidden Q register, which is fed into the next instruction as the S.

I had hoped that if someone actually wrote SCA a,b WZ, they wanted to test if the result of the multiply
was zero and then would ignore the result, hence my proposed use of the WZ would not be a problem.

However if my better understanding of the hidden Q register is correct, you cant ignore the result, you can
only know that the result of the multiply was zero.

Most machines that have a 32 bit register and a 16 bit multiplier have a way to address a half register (hi & lo)
hence my question about the half word add. The P2 is an unusual cpu and is certainly fabulous for writing
in assembler. Its a joy to use ! in my opinion.

Exactly how the SCA result gets into the next S is unimportant really to this discussion. If you only want to know whether the result is zero, the next instruction could dump it somewhere or ignore it by not using S at all.

	sca	a,b	wz
	mov	dump,0

'or

	sca	a,b	wz
	nop

evanh · 2019-02-03 07:12

By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.

An inline cordic version uses 2 QMUL, 1 ADD and 3 GETQx. It takes 70-77 clocks depending on hub sync. The cordic version has additional opportunity to prep while waiting for result.

So, very similar overall.

ersmith · 2019-02-03 11:10

markmpx wrote: »

I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
instruction can always be used to calculate offsets from the base address of a structure for

a[i,,j,k] := ?? etc

Usually address lookups involve multiplies by a constant value known at compile time, which is why I suggested changing the multiplies to shift and adds -- this will provide a good result even without any restriction on the sizes of structures or arrays. Both gcc and fastspin do this already.

markmpx · 2019-02-03 14:13

ersmith wrote: »

markmpx wrote: »

I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL
instruction can always be used to calculate offsets from the base address of a structure for

a[i,,j,k] := ?? etc

Usually address lookups involve multiplies by a constant value known at compile time, which is why I suggested changing the multiplies to shift and adds -- this will provide a good result even without any restriction on the sizes of structures or arrays. Both gcc and fastspin do this already.

Thanks for the advice.

You are indeed correct, and my example was perhaps not a good one. and its unlikely gcc would benefit.

Are you aware of the object oriented features of oberon ? It is possible in oberon to pass an array of structures
to a procedure where the size of the structure can only be determined at runtime.

lonesock · 2019-02-04 03:51

I think this would work for 32(A)x32(B)=>64(resH:resL):

getword tmp1, A, #1     ' Ahi
mul     tmp1, B         ' * Blo
getword resH, B, #1     ' Bhi
mul     resH, A         ' * Alo
add     resH, tmp1 wC   ' sum, track carry
mov     tmp1, resH      ' low 16 bits of that in temp
shl     tmp1, #16       ' are now at the top word
rcr     resH, #1        ' keep the intermediate carry bit
shr     resH, #15       ' and align with the lower 16 bits
mov     resL, B         ' result Lo = B (lo)
mul     resL, A         ' * A (lo)
add     resL, tmp1 wC   ' add the low half of the intermediate result
shr     A, #16          ' now Ahi
shr     B, #16          ' and Bhi
mul     A, B            ' multiplied
addx    resH, A         ' and add to existing component

16 instructions, and it feels like we could shave off 2 more if I could reuse the storing of the hi words from some intermediate calc. None of this is tested, sorry in advance for any mistakes.

Jonathan

TonyB_ · 2019-02-04 03:57

I think 32-bit A*B -> 64-bit result can be done in 28 cycles with MUL and without SCA. I haven't looked at smaller sizes. The key is to add AL*BH to AH*BL. An extra long is needed to hold a constant. I won't post the code yet, it's quite simple to derive.

Substantially faster/shorter multiply ?

Comments