Substantially faster/shorter multiply ?

markmpx · 2019-02-04 11:22

Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.

My thanks to lonesock and TonyB_ for your further work.

evanh · 2019-02-04 13:15

evanh wrote: »

By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.

Well, I've made improvements with this one as well. I've knocked it down from 64 clocks to 52 clocks, ignoring the 8-clock CALL/RET overhead.

One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.

'===============================================
'  input:  a64, pb  (all remain intact)
' result:  r64
'scratch:  temp1, temp2, a1, a3, b1
'descrip:  Integer multiply 64-bit x 32-bit.  r64 = a64 x pb.  52 clocks excluding CALL/RET
'
mul64x32
		getword	a1, a64, #1
		getword	a3, a64h, #1
		getword	b1, pb, #1

		mov	temp1, a64h
		mul	temp1, pb		'a2 * b0
		mov	r64h, a1
		mul	r64h, b1		'6:  high, a1 * b1
		add	r64h, temp1		'3:  high

		mul	a3, pb			'a3 * b0
		mov	temp2, a64h
		mul	temp2, b1		'a2 * b1
		add	temp2, a3		'4:  high
		shl	temp2, #16
		add	r64h, temp2		'7:  high

		mov	r64, a64
		mul	r64, pb			'1:  low, a0 * b0

		mul	a1, pb			'a1 * b0
		mul	b1, a64			'a0 * b1
		getword	temp1, a1, #1
		getword	temp2, b1, #1
		shl	a1, #16
		shl	b1, #16
		add	r64, a1		wc	'2:  low
		addx	r64h, temp1		'2:  high
		add	r64, b1		wc	'5:  low
		addx	r64h, temp2		'5:  high

		ret			wcz

TonyB_ · 2019-02-04 14:13

markmpx wrote: »

Thanks your hint about adding the 2 intermediate cross products,
it shortens the code considerably, The only tricky bit is dealing with
any carry generated by this add.

My thanks to lonesock and TonyB_ for your further work.

Modifying SCA instruction is not necessary for 32x32 -> 64-bit.

' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0

	getword	x2,a,#1  		'x2 = ah
	getword	x3,b,#1  		'x3 = bh
	mov	x0,a
	mul	x0,b			'x0 = al * bl
	mov	x1,x2
	mul	x1,x3			'x1 = ah * bh
	mul	x2,b			'x2 = ah * bl
	mul	x3,a			'x3 = bh * al
	add	x2,x3		wc	'x2 = ah * bl + bh * al
  if_c	add	x1,xc			'not needed if a[31] = b[31] = 0
	getword	x3,x2,#1
	shl	x2,#16
	add	x0,x2		wc	'x0 = result(low)
	addx	x1,x3			'x1 = result(high)

x0	res	1
x1	res	1
x2	res	1
x3	res	1
xc	long	$0001_0000		'not needed if a[31] = b[31] = 0

TonyB_ · 2019-02-04 14:58

a * b = {a, b} overwriting a and b takes 26 or 24 cycles.

EDIT:
Corrected cycles.

lonesock · 2019-02-05 00:12

Nice, TonyB_!

Here's one fewer instruction, temp variable, and constant:

	getword	x0,a,#1 
	getword	x1,b,#1 
	mov	x2, x0
	mul	x2, x1 
	mul	x0, b 
	mul	x1, a 
	add	x0, x1  wC 
	getword	x1,x0,#1 
	bitc	x1, #16 
	shl	x0,#16 
	mul  	a,  b 
	add	x0,a  wC 
	addx	x1,x2

I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

Jonathan

TonyB_ · 2019-02-05 02:10

lonesock wrote: »
Nice, TonyB_!

Here's one fewer instruction, temp variable, and constant:
	getword	x0,a,#1 
	getword	x1,b,#1 
	mov	x2, x0
	mul	x2, x1 
	mul	x0, b 
	mul	x1, a 
	add	x0, x1  wC 
	getword	x1,x0,#1 
	bitc	x1, #16 
	shl	x0,#16 
	mul  	a,  b 
	add	x0,a  wC 
	addx	x1,x2
I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

Jonathan

Jonathan,

Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.

Summary
Preserve a and b: cycles = 28/26*, variables = 4
Preserve b: cycles = 26/24*, variables = 3
Overwrite a and b with result: cycles = 26/24*, variables = 3
* If a[31] = b[31] = 0

Here's the code using bitc that preserves a and b:

' 32-bit * 32-bit -> 64-bit
' a * b = {x1, x0}
'
' 28 cycles maximum
' 26 cycles if a[31] = b[31] = 0

	getword	x2,a,#1  		'x2 = ah
	getword	x3,b,#1  		'x3 = bh
	mov	x0,a
	mul	x0,b			'x0 = al * bl
	mov	x1,x2
	mul	x1,x3			'x1 = ah * bh
	mul	x2,b			'x2 = ah * bl
	mul	x3,a			'x3 = bh * al
	add	x2,x3		wc	'x2 = ah * bl + bh * al
	getword	x3,x2,#1
	bitc	x3,#16			'not needed if a[31] = b[31] = 0
	shl	x2,#16
	add	x0,x2		wc	'x0 = result(low)
	addx	x1,x3			'x1 = result(high)

x0	res	1
x1	res	1
x2	res	1
x3	res	1

Mark_T · 2019-02-05 13:21

Can you use the Z flag from some of the MUL's to short-cut some of the code? Perhaps by ordering
the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
test vector to measure against as the worst-case time goes up.

It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
by a constant in a rather contorted fashion.

Cluso99 · 2019-02-05 21:32

Perhaps for P3...

A new unsigned multiply instruction that works as follows

i[64:0] is an internal 65bit register

 MUL32 D,S                                'new 32*32 unsigned multiply instruction
'' MUL32 runs internally using the existing 16*16 MUL instruction as follows
 i :=   d[15:0]  * s[15:0]
 i += ( d[31:16] * s[15:0]  ) << 16
 i += ( d[15:0]  * s[31:16] ) << 16
 i += ( d[31:16] * s[31:16] ) << 32
'' now get the results
 GETL D {WC/WZ/WCZ}                       ' new instruction returns D=i[15:0],  C=i[64], Z=i[65:0]
                                          '                                -or- C=i[32], Z=i[32:0] ?????
 GETH D {WC/WZ/WCZ}                       ' new instruction returns D=i]63:32], C=i[64], Z=i[64:0]

It may be possible that this could run in 10 clocks, and does not necessarily destroy the input values D & S.

evanh · 2019-02-05 23:31

Cordic can already pull 8 clocks, with parallel execution to boot.

Mark_T · 2019-02-06 00:44

But more like 60 elapsed from start to finish, even worse for signed (which can only be pipelined
at a rate of 16+ clocks plus per multiply due to the sign correction instructions).

If you want to form a product of several values you need to chain multiplies in series or
a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
do exponential function on the sum, which is better than 3 or more chained multiplies at
the expense of being lossy.

There are lots of tradeoffs to be considered I feel.

Cluso99 · 2019-02-06 01:00

evanh wrote: »

Cordic can already pull 8 clocks, with parallel execution to boot.

Oh I wish

Mark_T · 2019-02-06 01:10

I can get 16 clocks per signed x unsigned multiply for IIR filter processing, and that's squeezing
every instruction pretty much.

I tried FIR filter with muls instruction, haven't tried to do that with cordic yet, but I don't think you'd
get 8 cycles per tap.

evanh · 2020-01-09 00:19

Here's link to original topic referenced in opening post - https://forums.parallax.com/discussion/169670/multiply-multiply-multiply/p1

evanh · 2020-01-09 00:23

TonyB_ wrote: »

evanh wrote: »

The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.

Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.

Hmm, rereading this I see I was way off base.

The hidden Q register, filled by SETQ instruction, has proven to be independent of the other prefixing instructions like XORO32, SCA and all the ALTx instructions. Further reading at https://forums.parallax.com/discussion/comment/1465673/#Comment_1465673
and http://forums.parallax.com/discussion/comment/1479657/#Comment_1479657

EDIT: Err, Chip, has confirmed, in that second link, that XORO32, along with some others, is another exception and does in fact output to Q.

EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.

evanh · 2020-03-13 04:43

evanh wrote: »

EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.

In later thinking, I figured it was likely because those that don't use Q register, namely the ALT instructions, are modifying the subsequent instruction's bit fields. Whereas the Q users are redirecting data to an ALU input port instead. The ALU input ports are a whole stage later in the pipeline.

Cluso99 · 2020-03-13 07:11

Knowing the use of the Q register in hindsight another register or two would have been nice for the various uses instead of using the same Q.

Peter Jakacki · 2020-03-13 23:19

BTW, I tested this yesterday as hubexec and cogexec and also the cordic qmul version which is running from hub.

This has a slight overhead calling this as assembly and also replacing the stack parameters a,b with the result.

  TAQOZ# code MUL ' ( b a -- a*b. )
  071D8 F938_1622         getword x,a,#1
  071DC F938_1823         getword y,b,#1
  071E0 F600_1A0B         mov     z,x
  071E4 FA00_1A0C         mul     z,y
  071E8 FA00_1623         mul     x,b
  071EC FA00_1822         mul     y,a
  071F0 F110_160C         add     x,y  wc
  071F4 F938_180B         getword y,x,#1
  071F8 F444_1810         bitc    y, #16
  071FC F064_1610         shl     x,#16
  07200 FA00_4423         mul     a,b
  07204 F110_1622         add     x,a  wc
  07208 F160_180D         addx    y,z
  0720C F600_460B         mov     b,x
  07210 0600_440C   _ret_ mov     a,y
                          end ---  ok

Then make a copy in cog memory that can be called by COGMOD

TAQOZ# AT MUL 2+ 15 LOADMOD ---  ok

Then test MUL in hubexec mode, against UM* which uses qmul in hub, and then the cogexec version of MUL

TAQOZ# 12345678 1000000 LAP MUL LAP D. SPACE .LAP --- 12345678000000 128 cycles= 640ns @200MHz ok
  TAQOZ# 12345678 1000000 LAP UM* LAP D. SPACE .LAP --- 12345678000000 112 cycles= 560ns @200MHz ok
  TAQOZ# 12345678 1000000 LAP COGMOD LAP D. SPACE .LAP --- 12345678000000 64 cycles= 320ns @200MHz ok

So as long as I can spare the cog memory I could have a version that is twice as fast as the cordic qmul.

evanh · 2020-03-13 23:41

The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?

EDIT: Huh, the last one should be only 36ish sysclocks.

Peter Jakacki · 2020-03-13 23:53

evanh wrote: »

The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?

EDIT: Huh, the last one should be only 36ish sysclocks.

That had me scratching my head too. I ran the hubexec vs cogexec version again.

TAQOZ# 12345678 1000000 lap mul lap d. space .lap --- 12345678000000 120 cycles= 600ns @200MHz ok
TAQOZ# AT MUL 2+ 15 LOADMOD ---  ok
TAQOZ# 12345678 1000000 lap cogmod lap d. space .lap --- 12345678000000 56 cycles= 280ns @200MHz ok

For some reason it is running faster than the previous test. That and the hubexec vs cogexec speed, it's a mystery.

Arrggh, I should have known better. The hubexec mode is actually running in threaded wordcode space and has a "call asm" wordcode before it calls the hubexec. This is the overhead mucking up the results.

TAQOZ# code Q1
07214 FD64_002D         ret
 ---  ok         end
TAQOZ#  LAP Q1 LAP .LAP --- 96 cycles= 480ns @200MHz ok

All wordcode that is less than "end of compiled hubexec" which follows cog and lut code, is always called directly whereas a high level call and exit are needed for threaded code.

The dummy Q1 has a hidden asm call.

' execute assembly code following this ASM instruction '
_ASM			call	PTRA
			jmp	#EXIT

TAQOZ# ' Q1 $10 DUMP --- 
07212: 52 00 2D 00  64 FD 3C 01  12 72 10 F8  44 1A 4D 00     'R.-.d.<..r..D.M.' ok

cgracey · 2020-03-14 01:05

If you can arrange all of your CORDIC operations, including multiplies and divides, ahead of time, you can stuff them into the pipe every eight clocks and get the results out every eight clocks. This works really well for groups of computations.

Substantially faster/shorter multiply ?

Comments