Shop OBEX P1 Docs P2 Docs Learn Events
Substantially faster/shorter multiply ? - Page 2 — Parallax Forums

Substantially faster/shorter multiply ?

2»

Comments


  • Thanks your hint about adding the 2 intermediate cross products,
    it shortens the code considerably, The only tricky bit is dealing with
    any carry generated by this add.

    My thanks to lonesock and TonyB_ for your further work.


  • evanhevanh Posts: 15,126
    edited 2019-02-04 13:34
    evanh wrote: »
    By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.
    Well, I've made improvements with this one as well. I've knocked it down from 64 clocks to 52 clocks, ignoring the 8-clock CALL/RET overhead.

    One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.
    '===============================================
    '  input:  a64, pb  (all remain intact)
    ' result:  r64
    'scratch:  temp1, temp2, a1, a3, b1
    'descrip:  Integer multiply 64-bit x 32-bit.  r64 = a64 x pb.  52 clocks excluding CALL/RET
    '
    mul64x32
    		getword	a1, a64, #1
    		getword	a3, a64h, #1
    		getword	b1, pb, #1
    
    		mov	temp1, a64h
    		mul	temp1, pb		'a2 * b0
    		mov	r64h, a1
    		mul	r64h, b1		'6:  high, a1 * b1
    		add	r64h, temp1		'3:  high
    
    		mul	a3, pb			'a3 * b0
    		mov	temp2, a64h
    		mul	temp2, b1		'a2 * b1
    		add	temp2, a3		'4:  high
    		shl	temp2, #16
    		add	r64h, temp2		'7:  high
    
    		mov	r64, a64
    		mul	r64, pb			'1:  low, a0 * b0
    
    		mul	a1, pb			'a1 * b0
    		mul	b1, a64			'a0 * b1
    		getword	temp1, a1, #1
    		getword	temp2, b1, #1
    		shl	a1, #16
    		shl	b1, #16
    		add	r64, a1		wc	'2:  low
    		addx	r64h, temp1		'2:  high
    		add	r64, b1		wc	'5:  low
    		addx	r64h, temp2		'5:  high
    
    		ret			wcz
    
    
  • TonyB_TonyB_ Posts: 2,108
    edited 2019-02-04 14:22
    markmpx wrote: »
    Thanks your hint about adding the 2 intermediate cross products,
    it shortens the code considerably, The only tricky bit is dealing with
    any carry generated by this add.

    My thanks to lonesock and TonyB_ for your further work.

    Modifying SCA instruction is not necessary for 32x32 -> 64-bit.
    ' 32-bit * 32-bit -> 64-bit
    ' a * b = {x1, x0}
    '
    ' 28 cycles maximum
    ' 26 cycles if a[31] = b[31] = 0
    
    	getword	x2,a,#1  		'x2 = ah
    	getword	x3,b,#1  		'x3 = bh
    	mov	x0,a
    	mul	x0,b			'x0 = al * bl
    	mov	x1,x2
    	mul	x1,x3			'x1 = ah * bh
    	mul	x2,b			'x2 = ah * bl
    	mul	x3,a			'x3 = bh * al
    	add	x2,x3		wc	'x2 = ah * bl + bh * al
      if_c	add	x1,xc			'not needed if a[31] = b[31] = 0
    	getword	x3,x2,#1
    	shl	x2,#16
    	add	x0,x2		wc	'x0 = result(low)
    	addx	x1,x3			'x1 = result(high)
    
    x0	res	1
    x1	res	1
    x2	res	1
    x3	res	1
    xc	long	$0001_0000		'not needed if a[31] = b[31] = 0
    
  • TonyB_TonyB_ Posts: 2,108
    edited 2019-02-04 19:43
    a * b = {a, b} overwriting a and b takes 26 or 24 cycles.

    EDIT:
    Corrected cycles.
  • Nice, TonyB_!

    Here's one fewer instruction, temp variable, and constant:
    	getword	x0,a,#1 
    	getword	x1,b,#1 
    	mov	x2, x0
    	mul	x2, x1 
    	mul	x0, b 
    	mul	x1, a 
    	add	x0, x1  wC 
    	getword	x1,x0,#1 
    	bitc	x1, #16 
    	shl	x0,#16 
    	mul  	a,  b 
    	add	x0,a  wC 
    	addx	x1,x2
    
    I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

    Jonathan
  • TonyB_TonyB_ Posts: 2,108
    edited 2019-02-05 02:13
    lonesock wrote: »
    Nice, TonyB_!

    Here's one fewer instruction, temp variable, and constant:
    	getword	x0,a,#1 
    	getword	x1,b,#1 
    	mov	x2, x0
    	mul	x2, x1 
    	mul	x0, b 
    	mul	x1, a 
    	add	x0, x1  wC 
    	getword	x1,x0,#1 
    	bitc	x1, #16 
    	shl	x0,#16 
    	mul  	a,  b 
    	add	x0,a  wC 
    	addx	x1,x2
    
    I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

    Jonathan

    Jonathan,

    Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.

    Summary
    Preserve a and b: cycles = 28/26*, variables = 4
    Preserve b: cycles = 26/24*, variables = 3
    Overwrite a and b with result: cycles = 26/24*, variables = 3
    * If a[31] = b[31] = 0

    Here's the code using bitc that preserves a and b:
    ' 32-bit * 32-bit -> 64-bit
    ' a * b = {x1, x0}
    '
    ' 28 cycles maximum
    ' 26 cycles if a[31] = b[31] = 0
    
    	getword	x2,a,#1  		'x2 = ah
    	getword	x3,b,#1  		'x3 = bh
    	mov	x0,a
    	mul	x0,b			'x0 = al * bl
    	mov	x1,x2
    	mul	x1,x3			'x1 = ah * bh
    	mul	x2,b			'x2 = ah * bl
    	mul	x3,a			'x3 = bh * al
    	add	x2,x3		wc	'x2 = ah * bl + bh * al
    	getword	x3,x2,#1
    	bitc	x3,#16			'not needed if a[31] = b[31] = 0
    	shl	x2,#16
    	add	x0,x2		wc	'x0 = result(low)
    	addx	x1,x3			'x1 = result(high)
    
    x0	res	1
    x1	res	1
    x2	res	1
    x3	res	1
    
  • Can you use the Z flag from some of the MUL's to short-cut some of the code? Perhaps by ordering
    the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
    Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
    test vector to measure against as the worst-case time goes up.

    It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
    by a constant in a rather contorted fashion.
  • Cluso99Cluso99 Posts: 18,066
    Perhaps for P3...

    A new unsigned multiply instruction that works as follows
    i[64:0] is an internal 65bit register
    
     MUL32 D,S                                'new 32*32 unsigned multiply instruction
    '' MUL32 runs internally using the existing 16*16 MUL instruction as follows
     i :=   d[15:0]  * s[15:0]
     i += ( d[31:16] * s[15:0]  ) << 16
     i += ( d[15:0]  * s[31:16] ) << 16
     i += ( d[31:16] * s[31:16] ) << 32
    '' now get the results
     GETL D {WC/WZ/WCZ}                       ' new instruction returns D=i[15:0],  C=i[64], Z=i[65:0]
                                              '                                -or- C=i[32], Z=i[32:0] ?????
     GETH D {WC/WZ/WCZ}                       ' new instruction returns D=i]63:32], C=i[64], Z=i[64:0]
    

    It may be possible that this could run in 10 clocks, and does not necessarily destroy the input values D & S.
  • evanhevanh Posts: 15,126
    Cordic can already pull 8 clocks, with parallel execution to boot.
  • But more like 60 elapsed from start to finish, even worse for signed (which can only be pipelined
    at a rate of 16+ clocks plus per multiply due to the sign correction instructions).

    If you want to form a product of several values you need to chain multiplies in series or
    a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
    do exponential function on the sum, which is better than 3 or more chained multiplies at
    the expense of being lossy.

    There are lots of tradeoffs to be considered I feel.
  • Cluso99Cluso99 Posts: 18,066
    evanh wrote: »
    Cordic can already pull 8 clocks, with parallel execution to boot.

    Oh I wish :(
  • I can get 16 clocks per signed x unsigned multiply for IIR filter processing, and that's squeezing
    every instruction pretty much.

    I tried FIR filter with muls instruction, haven't tried to do that with cordic yet, but I don't think you'd
    get 8 cycles per tap.
  • evanhevanh Posts: 15,126
    Here's link to original topic referenced in opening post - https://forums.parallax.com/discussion/169670/multiply-multiply-multiply/p1
  • evanhevanh Posts: 15,126
    edited 2020-01-09 01:00
    TonyB_ wrote: »
    evanh wrote: »
    The setq description is my understanding of where SCA stashes its result, in the hidden Q register, that is subsequently forced into the ALU S-port for the following instruction. Q gets used in a few different ways.
    Evan, have you tested this? SETQ, then SCA or XORO32, then instruction that uses SETQ.
    Hmm, rereading this I see I was way off base.

    The hidden Q register, filled by SETQ instruction, has proven to be independent of the other prefixing instructions like XORO32, SCA and all the ALTx instructions. Further reading at https://forums.parallax.com/discussion/comment/1465673/#Comment_1465673
    and http://forums.parallax.com/discussion/comment/1479657/#Comment_1479657

    EDIT: Err, Chip, has confirmed, in that second link, that XORO32, along with some others, is another exception and does in fact output to Q.

    EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.

  • evanhevanh Posts: 15,126
    edited 2020-03-13 04:46
    evanh wrote: »
    EDIT2: Other than SETQ, I'm not sure why some prefixing instructions do modify Q and some don't.
    In later thinking, I figured it was likely because those that don't use Q register, namely the ALT instructions, are modifying the subsequent instruction's bit fields. Whereas the Q users are redirecting data to an ALU input port instead. The ALU input ports are a whole stage later in the pipeline.

  • Cluso99Cluso99 Posts: 18,066
    Knowing the use of the Q register in hindsight another register or two would have been nice for the various uses instead of using the same Q.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2020-03-13 23:19
    BTW, I tested this yesterday as hubexec and cogexec and also the cordic qmul version which is running from hub.

    This has a slight overhead calling this as assembly and also replacing the stack parameters a,b with the result.
      TAQOZ# code MUL ' ( b a -- a*b. )
      071D8 F938_1622         getword x,a,#1
      071DC F938_1823         getword y,b,#1
      071E0 F600_1A0B         mov     z,x
      071E4 FA00_1A0C         mul     z,y
      071E8 FA00_1623         mul     x,b
      071EC FA00_1822         mul     y,a
      071F0 F110_160C         add     x,y  wc
      071F4 F938_180B         getword y,x,#1
      071F8 F444_1810         bitc    y, #16
      071FC F064_1610         shl     x,#16
      07200 FA00_4423         mul     a,b
      07204 F110_1622         add     x,a  wc
      07208 F160_180D         addx    y,z
      0720C F600_460B         mov     b,x
      07210 0600_440C   _ret_ mov     a,y
                              end ---  ok
    

    Then make a copy in cog memory that can be called by COGMOD
    TAQOZ# AT MUL 2+ 15 LOADMOD ---  ok
    

    Then test MUL in hubexec mode, against UM* which uses qmul in hub, and then the cogexec version of MUL
    TAQOZ# 12345678 1000000 LAP MUL LAP D. SPACE .LAP --- 12345678000000 128 cycles= 640ns @200MHz ok
      TAQOZ# 12345678 1000000 LAP UM* LAP D. SPACE .LAP --- 12345678000000 112 cycles= 560ns @200MHz ok
      TAQOZ# 12345678 1000000 LAP COGMOD LAP D. SPACE .LAP --- 12345678000000 64 cycles= 320ns @200MHz ok
    

    So as long as I can spare the cog memory I could have a version that is twice as fast as the cordic qmul.

  • evanhevanh Posts: 15,126
    edited 2020-03-13 23:44
    The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?

    EDIT: Huh, the last one should be only 36ish sysclocks.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2020-03-14 00:05
    evanh wrote: »
    The first and last speed measures are the same code, right? It's straight line execution so hubexec wouldn't have much impact. Why the large speed difference?

    EDIT: Huh, the last one should be only 36ish sysclocks.

    That had me scratching my head too. I ran the hubexec vs cogexec version again.
    TAQOZ# 12345678 1000000 lap mul lap d. space .lap --- 12345678000000 120 cycles= 600ns @200MHz ok
    TAQOZ# AT MUL 2+ 15 LOADMOD ---  ok
    TAQOZ# 12345678 1000000 lap cogmod lap d. space .lap --- 12345678000000 56 cycles= 280ns @200MHz ok
    
    For some reason it is running faster than the previous test. That and the hubexec vs cogexec speed, it's a mystery.


    Arrggh, I should have known better. The hubexec mode is actually running in threaded wordcode space and has a "call asm" wordcode before it calls the hubexec. This is the overhead mucking up the results.
    TAQOZ# code Q1
    07214 FD64_002D         ret
     ---  ok         end
    TAQOZ#  LAP Q1 LAP .LAP --- 96 cycles= 480ns @200MHz ok
    

    All wordcode that is less than "end of compiled hubexec" which follows cog and lut code, is always called directly whereas a high level call and exit are needed for threaded code.

    The dummy Q1 has a hidden asm call.
    ' execute assembly code following this ASM instruction '
    _ASM			call	PTRA
    			jmp	#EXIT
    
    TAQOZ# ' Q1 $10 DUMP --- 
    07212: 52 00 2D 00  64 FD 3C 01  12 72 10 F8  44 1A 4D 00     'R.-.d.<..r..D.M.' ok
    
  • cgraceycgracey Posts: 14,133
    If you can arrange all of your CORDIC operations, including multiplies and divides, ahead of time, you can stuff them into the pipe every eight clocks and get the results out every eight clocks. This works really well for groups of computations.
Sign In or Register to comment.