Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.

My thanks to lonesock and TonyB_ for your further work.

By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.

Well, I've made improvements with this one as well. I've knocked it down from 64 clocks to 52 clocks, ignoring the 8-clock CALL/RET overhead.

One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.

Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.

My thanks to lonesock and TonyB_ for your further work.

Modifying SCA instruction is not necessary for 32x32 -> 64-bit.

' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0
getword x2,a,#1 'x2 = ah
getword x3,b,#1 'x3 = bh
mov x0,a
mul x0,b 'x0 = al * bl
mov x1,x2
mul x1,x3 'x1 = ah * bh
mul x2,b 'x2 = ah * bl
mul x3,a 'x3 = bh * al
add x2,x3 wc 'x2 = ah * bl + bh * al
if_c add x1,xc 'not needed if a[31] = b[31] = 0
getword x3,x2,#1
shl x2,#16
add x0,x2 wc 'x0 = result(low)
addx x1,x3 'x1 = result(high)
x0 res 1
x1 res 1
x2 res 1
x3 res 1
xc long $0001_0000 'not needed if a[31] = b[31] = 0

Here's one fewer instruction, temp variable, and constant:

getword x0,a,#1
getword x1,b,#1
mov x2, x0
mul x2, x1
mul x0, b
mul x1, a
add x0, x1 wC
getword x1,x0,#1
bitc x1, #16
shl x0,#16
mul a, b
add x0,a wC
addx x1,x2

I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

Jonathan

Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer

Here's one fewer instruction, temp variable, and constant:

getword x0,a,#1
getword x1,b,#1
mov x2, x0
mul x2, x1
mul x0, b
mul x1, a
add x0, x1 wC
getword x1,x0,#1
bitc x1, #16
shl x0,#16
mul a, b
add x0,a wC
addx x1,x2

I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

Jonathan

Jonathan,

Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.

Summary
Preserve a and b: cycles = 28/26*, variables = 4
Preserve b: cycles = 26/24*, variables = 3
Overwrite a and b with result: cycles = 26/24*, variables = 3
* If a[31] = b[31] = 0

Here's the code using bitc that preserves a and b:

' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0
getword x2,a,#1 'x2 = ah
getword x3,b,#1 'x3 = bh
mov x0,a
mul x0,b 'x0 = al * bl
mov x1,x2
mul x1,x3 'x1 = ah * bh
mul x2,b 'x2 = ah * bl
mul x3,a 'x3 = bh * al
add x2,x3 wc 'x2 = ah * bl + bh * al
getword x3,x2,#1
bitc x3,#16 'not needed if a[31] = b[31] = 0
shl x2,#16
add x0,x2 wc 'x0 = result(low)
addx x1,x3 'x1 = result(high)
x0 res 1
x1 res 1
x2 res 1
x3 res 1

Can you use the Z flag from some of the MUL's to short-cut some of the code? Perhaps by ordering
the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
test vector to measure against as the worst-case time goes up.

It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
by a constant in a rather contorted fashion.

But more like 60 elapsed from start to finish, even worse for signed (which can only be pipelined
at a rate of 16+ clocks plus per multiply due to the sign correction instructions).

If you want to form a product of several values you need to chain multiplies in series or
a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
do exponential function on the sum, which is better than 3 or more chained multiplies at
the expense of being lossy.

There are lots of tradeoffs to be considered I feel.

## Comments

28Thanks your hint about adding the 2 intermediate cross products,

it shortens the code considerably, The only tricky bit is dealing with

any carry generated by this add.

My thanks to lonesock and TonyB_ for your further work.

6,964One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.

hidden in the rings that we can't see,

the rings are almost pure ice."

1,200Modifying SCA instruction is not necessary for 32x32 -> 64-bit.

1,20026or24cycles.EDIT:Corrected cycles.

908Here's one fewer instruction, temp variable, and constant: I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

1,200Jonathan,

Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.

SummaryPreserve a and b: cycles = 28/26*, variables = 4

Preserve b: cycles = 26/24*, variables = 3

Overwrite a and b with result: cycles = 26/24*, variables = 3

* If a[31] = b[31] = 0

Here's the code using bitc that preserves a and b:

1,975the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?

Most multiplies have one or two small arguments so this may be worth doing, but you need a representative

test vector to measure against as the worst-case time goes up.

It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply

by a constant in a rather contorted fashion.

14,911A new unsigned multiply instruction that works as follows

It may be possible that this could run in 10 clocks, and does not necessarily destroy the input values D & S.

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)6,964hidden in the rings that we can't see,

the rings are almost pure ice."

1,975at a rate of 16+ clocks plus per multiply due to the sign correction instructions).

If you want to form a product of several values you need to chain multiplies in series or

a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then

do exponential function on the sum, which is better than 3 or more chained multiplies at

the expense of being lossy.

There are lots of tradeoffs to be considered I feel.

14,911Oh I wish

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)1,975every instruction pretty much.

I tried FIR filter with muls instruction, haven't tried to do that with cordic yet, but I don't think you'd

get 8 cycles per tap.