Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.
My thanks to lonesock and TonyB_ for your further work.
By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.
Well, I've made improvements with this one as well. I've knocked it down from 64 clocks to 52 clocks, ignoring the 8-clock CALL/RET overhead.
One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.
Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.
My thanks to lonesock and TonyB_ for your further work.
Modifying SCA instruction is not necessary for 32x32 -> 64-bit.
' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0
getword x2,a,#1 'x2 = ah
getword x3,b,#1 'x3 = bh
mov x0,a
mul x0,b 'x0 = al * bl
mov x1,x2
mul x1,x3 'x1 = ah * bh
mul x2,b 'x2 = ah * bl
mul x3,a 'x3 = bh * al
add x2,x3 wc 'x2 = ah * bl + bh * al
if_c add x1,xc 'not needed if a[31] = b[31] = 0
getword x3,x2,#1
shl x2,#16
add x0,x2 wc 'x0 = result(low)
addx x1,x3 'x1 = result(high)
x0 res 1
x1 res 1
x2 res 1
x3 res 1
xc long $0001_0000 'not needed if a[31] = b[31] = 0
Here's one fewer instruction, temp variable, and constant:
getword x0,a,#1
getword x1,b,#1
mov x2, x0
mul x2, x1
mul x0, b
mul x1, a
add x0, x1 wC
getword x1,x0,#1
bitc x1, #16
shl x0,#16
mul a, b
add x0,a wC
addx x1,x2
I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)
Here's one fewer instruction, temp variable, and constant:
getword x0,a,#1
getword x1,b,#1
mov x2, x0
mul x2, x1
mul x0, b
mul x1, a
add x0, x1 wC
getword x1,x0,#1
bitc x1, #16
shl x0,#16
mul a, b
add x0,a wC
addx x1,x2
I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)
Jonathan
Jonathan,
Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.
Summary
Preserve a and b: cycles = 28/26*, variables = 4
Preserve b: cycles = 26/24*, variables = 3
Overwrite a and b with result: cycles = 26/24*, variables = 3
* If a[31] = b[31] = 0
Here's the code using bitc that preserves a and b:
' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0
getword x2,a,#1 'x2 = ah
getword x3,b,#1 'x3 = bh
mov x0,a
mul x0,b 'x0 = al * bl
mov x1,x2
mul x1,x3 'x1 = ah * bh
mul x2,b 'x2 = ah * bl
mul x3,a 'x3 = bh * al
add x2,x3 wc 'x2 = ah * bl + bh * al
getword x3,x2,#1
bitc x3,#16 'not needed if a[31] = b[31] = 0
shl x2,#16
add x0,x2 wc 'x0 = result(low)
addx x1,x3 'x1 = result(high)
x0 res 1
x1 res 1
x2 res 1
x3 res 1
Can you use the Z flag from some of the MUL's to short-cut some of the code? Perhaps by ordering
the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
test vector to measure against as the worst-case time goes up.
It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
by a constant in a rather contorted fashion.
But more like 60 elapsed from start to finish, even worse for signed (which can only be pipelined
at a rate of 16+ clocks plus per multiply due to the sign correction instructions).
If you want to form a product of several values you need to chain multiplies in series or
a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
do exponential function on the sum, which is better than 3 or more chained multiplies at
the expense of being lossy.
There are lots of tradeoffs to be considered I feel.
The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.
Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.
EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.
In later thinking, I figured it was likely because those that don't use Q register, namely the ALT instructions, are modifying the subsequent instruction's bit fields. Whereas the Q users are redirecting data to an ALU input port instead. The ALU input ports are a whole stage later in the pipeline.
BTW, I tested this yesterday as hubexec and cogexec and also the cordic qmul version which is running from hub.
This has a slight overhead calling this as assembly and also replacing the stack parameters a,b with the result.
TAQOZ# code MUL ' ( b a -- a*b. )
071D8 F938_1622 getword x,a,#1
071DC F938_1823 getword y,b,#1
071E0 F600_1A0B mov z,x
071E4 FA00_1A0C mul z,y
071E8 FA00_1623 mul x,b
071EC FA00_1822 mul y,a
071F0 F110_160C add x,y wc
071F4 F938_180B getword y,x,#1
071F8 F444_1810 bitc y, #16
071FC F064_1610 shl x,#16
07200 FA00_4423 mul a,b
07204 F110_1622 add x,a wc
07208 F160_180D addx y,z
0720C F600_460B mov b,x
07210 0600_440C _ret_ mov a,y
end --- ok
Then make a copy in cog memory that can be called by COGMOD
TAQOZ# AT MUL 2+ 15 LOADMOD --- ok
Then test MUL in hubexec mode, against UM* which uses qmul in hub, and then the cogexec version of MUL
TAQOZ# 12345678 1000000 LAP MUL LAP D. SPACE .LAP --- 12345678000000 128 cycles= 640ns @200MHz ok
TAQOZ# 12345678 1000000 LAP UM* LAP D. SPACE .LAP --- 12345678000000 112 cycles= 560ns @200MHz ok
TAQOZ# 12345678 1000000 LAP COGMOD LAP D. SPACE .LAP --- 12345678000000 64 cycles= 320ns @200MHz ok
So as long as I can spare the cog memory I could have a version that is twice as fast as the cordic qmul.
The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?
EDIT: Huh, the last one should be only 36ish sysclocks.
The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?
EDIT: Huh, the last one should be only 36ish sysclocks.
That had me scratching my head too. I ran the hubexec vs cogexec version again.
TAQOZ# 12345678 1000000 lap mul lap d. space .lap --- 12345678000000 120 cycles= 600ns @200MHz ok
TAQOZ# AT MUL 2+ 15 LOADMOD --- ok
TAQOZ# 12345678 1000000 lap cogmod lap d. space .lap --- 12345678000000 56 cycles= 280ns @200MHz ok
For some reason it is running faster than the previous test. That and the hubexec vs cogexec speed, it's a mystery.
Arrggh, I should have known better. The hubexec mode is actually running in threaded wordcode space and has a "call asm" wordcode before it calls the hubexec. This is the overhead mucking up the results.
TAQOZ# code Q1
07214 FD64_002D ret
--- ok end
TAQOZ# LAP Q1 LAP .LAP --- 96 cycles= 480ns @200MHz ok
All wordcode that is less than "end of compiled hubexec" which follows cog and lut code, is always called directly whereas a high level call and exit are needed for threaded code.
The dummy Q1 has a hidden asm call.
' execute assembly code following this ASM instruction '
_ASM call PTRA
jmp #EXIT
If you can arrange all of your CORDIC operations, including multiplies and divides, ahead of time, you can stuff them into the pipe every eight clocks and get the results out every eight clocks. This works really well for groups of computations.
Comments
Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.
My thanks to lonesock and TonyB_ for your further work.
One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.
Modifying SCA instruction is not necessary for 32x32 -> 64-bit.
EDIT:
Corrected cycles.
Here's one fewer instruction, temp variable, and constant: I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)
Jonathan
Jonathan,
Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.
Summary
Preserve a and b: cycles = 28/26*, variables = 4
Preserve b: cycles = 26/24*, variables = 3
Overwrite a and b with result: cycles = 26/24*, variables = 3
* If a[31] = b[31] = 0
Here's the code using bitc that preserves a and b:
the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
test vector to measure against as the worst-case time goes up.
It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
by a constant in a rather contorted fashion.
A new unsigned multiply instruction that works as follows
It may be possible that this could run in 10 clocks, and does not necessarily destroy the input values D & S.
at a rate of 16+ clocks plus per multiply due to the sign correction instructions).
If you want to form a product of several values you need to chain multiplies in series or
a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
do exponential function on the sum, which is better than 3 or more chained multiplies at
the expense of being lossy.
There are lots of tradeoffs to be considered I feel.
Oh I wish
every instruction pretty much.
I tried FIR filter with muls instruction, haven't tried to do that with cordic yet, but I don't think you'd
get 8 cycles per tap.
The hidden Q register, filled by SETQ instruction, has proven to be independent of the other prefixing instructions like XORO32, SCA and all the ALTx instructions. Further reading at https://forums.parallax.com/discussion/comment/1465673/#Comment_1465673
and http://forums.parallax.com/discussion/comment/1479657/#Comment_1479657
EDIT: Err, Chip, has confirmed, in that second link, that XORO32, along with some others, is another exception and does in fact output to Q.
EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.
This has a slight overhead calling this as assembly and also replacing the stack parameters a,b with the result.
Then make a copy in cog memory that can be called by COGMOD
Then test MUL in hubexec mode, against UM* which uses qmul in hub, and then the cogexec version of MUL
So as long as I can spare the cog memory I could have a version that is twice as fast as the cordic qmul.
EDIT: Huh, the last one should be only 36ish sysclocks.
That had me scratching my head too. I ran the hubexec vs cogexec version again.
For some reason it is running faster than the previous test. That and the hubexec vs cogexec speed, it's a mystery.
Arrggh, I should have known better. The hubexec mode is actually running in threaded wordcode space and has a "call asm" wordcode before it calls the hubexec. This is the overhead mucking up the results.
All wordcode that is less than "end of compiled hubexec" which follows cog and lut code, is always called directly whereas a high level call and exit are needed for threaded code.
The dummy Q1 has a hidden asm call.