# Substantially faster/shorter multiply ?

markmpx
Posts:

**28**
Hi all,

Working on my Oberon compiler for the P2, i need a set of multiply subroutines

(or even better a set of templates that the compiler can use to generate inline multiply code).

I know that i can use the cordic option but i was hoping to find something substantially faster,

that can be used when you cant overlap the cordic operation with other code.

I took a routine posted by ersmith in the "Multiply, multiply, multiply" forum post as the basis

for these routines. (See mul.spin)

mul.spin provides:

32 x 32 bit multiply giving 64 bit result - 36 clock cycles

32 x 32 multiply 32 - 20 clock cycles

32 x 16 multiply 32 - 12 clock cycles

Also see mul1.spin which provides: (but needs a half register add instruction)

32 x 32 bit multiply giving 64 bit result - 24 clock cycles

32 x 32 multiply 32 - 16 clock cycles

32 x 16 multiply 32 - 8 clock cycles

I dont know the P2 instruction set in detail yet, are there existing instructions in the P2

which allow add and addx to only add a half register ? (matching the half size multiplier)

or are there other instructions that can achieve this ?

Optimizing the multiply routines make a big difference to code generated by the compiler

to support array of structures and multi dimensional arrays. My compiler already knows

how to use shifts for any multiply or divide by a power of 2.

Thanks in advance.

Working on my Oberon compiler for the P2, i need a set of multiply subroutines

(or even better a set of templates that the compiler can use to generate inline multiply code).

I know that i can use the cordic option but i was hoping to find something substantially faster,

that can be used when you cant overlap the cordic operation with other code.

I took a routine posted by ersmith in the "Multiply, multiply, multiply" forum post as the basis

for these routines. (See mul.spin)

mul.spin provides:

32 x 32 bit multiply giving 64 bit result - 36 clock cycles

32 x 32 multiply 32 - 20 clock cycles

32 x 16 multiply 32 - 12 clock cycles

Also see mul1.spin which provides: (but needs a half register add instruction)

32 x 32 bit multiply giving 64 bit result - 24 clock cycles

32 x 32 multiply 32 - 16 clock cycles

32 x 16 multiply 32 - 8 clock cycles

I dont know the P2 instruction set in detail yet, are there existing instructions in the P2

which allow add and addx to only add a half register ? (matching the half size multiplier)

or are there other instructions that can achieve this ?

Optimizing the multiply routines make a big difference to code generated by the compiler

to support array of structures and multi dimensional arrays. My compiler already knows

how to use shifts for any multiply or divide by a power of 2.

Thanks in advance.

## Comments

5,30423,297https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

That's what I used in my FIR2PASM filter code.

-Phil

28Yes that is often a big win.

The same can also be done for integer division by shift and subtract for division by small constants.

If your multiply is way faster then your divide (often the case) you can do integer division by a constant

using a multiply by a magic number. (see https://www.hackersdelight.org/magic.htm and

http://libdivide.com/). These are often used in x86 compilers.

It may also be possible to use the ENCOD instruction on both 32 bit operands of a multiply to

decide if a 2 clock mul or a 12 clock mul could be used. I havent figured out the overhead yet

to determine if its actually worthy the overhead.

28Thanks for the link.

28SCA currently does :- Next instruction's S value = unsigned (D[15:0] * S[15:0]) >> 16

Change SCA WZ so it does Next instruction's S value = unsigned (D[15:0] * S[15:0]) << 16

The Z bit can still be written as normal (assuming it does that now and its desirable)

32 x 32 multiply->64 bit result WAS 36 clocks NOW 28

32 x 32 multiply->32 bit result WAS 20 clocks NOW 16

32 x 16 multiply->32 bit result WAS 12 clocks NOW 10 (can be 8 if return result in a)

This also has the advantage that the a and b registers are no longer changed,

reducing the number of surrounding move instructions.

What do you think about this idea ?

I am NOT suggesting holding up the next spin to do this.

1,909Dvalue then 32x32 -> 64-bit could take 32 clocks and WZ would operate as now. The saving is less than markmpx's idea above but still a worthwhile 12.5% performance boost.12,93328I understand how the MULH and MULD would work, but i don't understand the MULQ you suggested.

Could you explain in a bit more detail please ? I am still a P2 newbie.

Thanks

12,93312,933PS: It's the S-port, not the D-port I listed above. So MULD is no good, and MULS is already taken. There has been some discussion about changing this type of instruction to feeding D instead of S but that hasn't happened.

2,3621,909Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.

1,909Mark,

Your idea to extend SCA is a clever one that would make 32x32 multiplies a lot faster. It's a pity the C flag opcode bit has been used already to distinguish SCA from SCAS. The issue with WZ is that it's either not available anymore (for high word) or must be used all the time (for low word) and it might be better to swap the WZ words. My understanding of your proposed change is as follows:

SCA = {16'b0, MUL[31:16]}

SCA WZ = {MUL[15:0],16'b0}

If the instruction were modified to be

SCA {WZ} = {MUL[15:0], MUL[31:16]}

then WZ could be used only when desired. The drawback is an extra instruction is required to zero the low or high word (if that is necessary for the latter). There would be a smaller time saving for 32x32 -> 64-bit than your solution but none at all for the two smaller multiplies you mention.

I can't see a way to zero the low word with any net time saving if the result goes to next instruction's S value, but it could be done using the D value and the two extra instructions for masking out the low word would increase the cycle count from 28 to 32, showing that my suggestion is not as optimal as yours.

28The Oberon compiler that i am currently using to host the P2 Oberon compiler is voc. You can find it on

github by searching for Vishaps/voc. This Oberon compiler can run on Windows/Linux and Mac.

28Thanks evan. I understand your post now.

12,933I've noticed that the SETQ operation only persists, excepting AUGx, until the next instruction anyway. So it would be extinguished by an inserted ADD even.

12,933- Idle

- SETQ operation

- ALTx operation

- Override next S input

28Thanks for your response.

I guess i will handle the issue by placing restriction on my oberon compiler for now, so that a 2 clock MUL

instruction can always be used to calculate offsets from the base address of a structure for

a[i,,j,k] := ?? etc

Improvements in this area will ultimately benefit any high level language compiler (Ie: gcc)

Hopefully, folks are a bit more aware of the issue, and maybe one day things will improve.

Of my 2 suggestions, the ability to support a word add to a long (mul1.spin in my first post) would result in the largest improvement, but a modified SCA comes close.

Incidentally, the original oberon compiler i am modifying, does produce code for a 3 operand instruction set.

(I mentioned this as you seem interested in a 3 operand format eventually).

18,059Do I have this right to start with ???

1,4691,909Yes and adding X2 and X3 are problematical.

28Ray,

Yes you have it correct.

Adding X2 and X3 will each require a SHL and a SHR as they are aligned as follows

where each item is a 16 bit quantity. Carry must be propagated between the add of the lo and hi longs.

X4 X4 X1 X1 +

00 X2 X2 00 +

00 X3 X3 00

To save instructions and cycles, my question was "Is there a way to do a half word add" ?

This was mul1.spin (which had a typo i have now corrected and attached). There are so many

unusual instructions in the P2 i just didn't know, as i don't know them all intimately yet.

My second idea was when i realized the SCA could do the job if it had a symmetrical opposite

(the WZ proposal)

The current use of SCA is to do a fused multiply add, so as in chip's example

to calculate (x * y) + a you would do SCA x,y then ADD a,a

The SCA instruction writes its result to the hidden Q register, which is fed into the next instruction as the S.

I had hoped that if someone actually wrote SCA a,b WZ, they wanted to test if the result of the multiply

was zero and then would ignore the result, hence my proposed use of the WZ would not be a problem.

However if my better understanding of the hidden Q register is correct, you cant ignore the result, you can

only know that the result of the multiply was zero.

Most machines that have a 32 bit register and a 16 bit multiplier have a way to address a half register (hi & lo)

hence my question about the half word add. The P2 is an unusual cpu and is certainly fabulous for writing

in assembler. Its a joy to use ! in my opinion.

2,7831,9091,909Exactly how the SCA result gets into the next S is unimportant really to this discussion. If you only want to know whether the result is zero, the next instruction could dump it somewhere or ignore it by not using S at all.

12,933An inline cordic version uses 2 QMUL, 1 ADD and 3 GETQx. It takes 70-77 clocks depending on hub sync. The cordic version has additional opportunity to prep while waiting for result.

So, very similar overall.

5,304Usually address lookups involve multiplies by a constant value known at compile time, which is why I suggested changing the multiplies to shift and adds -- this will provide a good result even without any restriction on the sizes of structures or arrays. Both gcc and fastspin do this already.

28Thanks for the advice.

You are indeed correct, and my example was perhaps not a good one. and its unlikely gcc would benefit.

Are you aware of the object oriented features of oberon ? It is possible in oberon to pass an array of structures

to a procedure where the size of the structure can only be determined at runtime.

917Jonathan

1,909