Substantially faster/shorter multiply ?

2»

Comments


  • Thanks your hint about adding the 2 intermediate cross products,
    it shortens the code considerably, The only tricky bit is dealing with
    any carry generated by this add.

    My thanks to lonesock and TonyB_ for your further work.


  • evanhevanh Posts: 6,964
    edited 2019-02-04 - 13:34:15
    evanh wrote: »
    By copying above I've just pieced together a 64x32 using 7 MULs, 7 ADDs, 6 shifts, 9 moves, 3 GETWORDs and a RET. It takes 72 clocks including the CALL/RET within a cog. Could shave 2 clocks with a _RET_ but I prefer to preserve flags.
    Well, I've made improvements with this one as well. I've knocked it down from 64 clocks to 52 clocks, ignoring the 8-clock CALL/RET overhead.

    One saving was eliminating a shift with Tony's idea but the rest were just optimising away 5 of the 9 moves.
    '===============================================
    '  input:  a64, pb  (all remain intact)
    ' result:  r64
    'scratch:  temp1, temp2, a1, a3, b1
    'descrip:  Integer multiply 64-bit x 32-bit.  r64 = a64 x pb.  52 clocks excluding CALL/RET
    '
    mul64x32
    		getword	a1, a64, #1
    		getword	a3, a64h, #1
    		getword	b1, pb, #1
    
    		mov	temp1, a64h
    		mul	temp1, pb		'a2 * b0
    		mov	r64h, a1
    		mul	r64h, b1		'6:  high, a1 * b1
    		add	r64h, temp1		'3:  high
    
    		mul	a3, pb			'a3 * b0
    		mov	temp2, a64h
    		mul	temp2, b1		'a2 * b1
    		add	temp2, a3		'4:  high
    		shl	temp2, #16
    		add	r64h, temp2		'7:  high
    
    		mov	r64, a64
    		mul	r64, pb			'1:  low, a0 * b0
    
    		mul	a1, pb			'a1 * b0
    		mul	b1, a64			'a0 * b1
    		getword	temp1, a1, #1
    		getword	temp2, b1, #1
    		shl	a1, #16
    		shl	b1, #16
    		add	r64, a1		wc	'2:  low
    		addx	r64h, temp1		'2:  high
    		add	r64, b1		wc	'5:  low
    		addx	r64h, temp2		'5:  high
    
    		ret			wcz
    
    
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-04 - 14:22:31
    markmpx wrote: »
    Thanks your hint about adding the 2 intermediate cross products,
    it shortens the code considerably, The only tricky bit is dealing with
    any carry generated by this add.

    My thanks to lonesock and TonyB_ for your further work.

    Modifying SCA instruction is not necessary for 32x32 -> 64-bit.
    ' 32-bit * 32-bit -> 64-bit
    ' a * b = {x1, x0}
    '
    ' 28 cycles maximum
    ' 26 cycles if a[31] = b[31] = 0
    
    	getword	x2,a,#1  		'x2 = ah
    	getword	x3,b,#1  		'x3 = bh
    	mov	x0,a
    	mul	x0,b			'x0 = al * bl
    	mov	x1,x2
    	mul	x1,x3			'x1 = ah * bh
    	mul	x2,b			'x2 = ah * bl
    	mul	x3,a			'x3 = bh * al
    	add	x2,x3		wc	'x2 = ah * bl + bh * al
      if_c	add	x1,xc			'not needed if a[31] = b[31] = 0
    	getword	x3,x2,#1
    	shl	x2,#16
    	add	x0,x2		wc	'x0 = result(low)
    	addx	x1,x3			'x1 = result(high)
    
    x0	res	1
    x1	res	1
    x2	res	1
    x3	res	1
    xc	long	$0001_0000		'not needed if a[31] = b[31] = 0
    
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-04 - 19:43:20
    a * b = {a, b} overwriting a and b takes 26 or 24 cycles.

    EDIT:
    Corrected cycles.
    Formerly known as TonyB
  • Nice, TonyB_!

    Here's one fewer instruction, temp variable, and constant:
    	getword	x0,a,#1 
    	getword	x1,b,#1 
    	mov	x2, x0
    	mul	x2, x1 
    	mul	x0, b 
    	mul	x1, a 
    	add	x0, x1  wC 
    	getword	x1,x0,#1 
    	bitc	x1, #16 
    	shl	x0,#16 
    	mul  	a,  b 
    	add	x0,a  wC 
    	addx	x1,x2
    
    I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • TonyB_TonyB_ Posts: 1,200
    edited 2019-02-05 - 02:13:02
    lonesock wrote: »
    Nice, TonyB_!

    Here's one fewer instruction, temp variable, and constant:
    	getword	x0,a,#1 
    	getword	x1,b,#1 
    	mov	x2, x0
    	mul	x2, x1 
    	mul	x0, b 
    	mul	x1, a 
    	add	x0, x1  wC 
    	getword	x1,x0,#1 
    	bitc	x1, #16 
    	shl	x0,#16 
    	mul  	a,  b 
    	add	x0,a  wC 
    	addx	x1,x2
    
    I'd love for this to shrink even more, as I plan to use this code a bunch. [8^)

    Jonathan

    Jonathan,

    Excellent! However the 'cost' of saving an instruction is that a is modified. bitc is much nicer than what I was doing and it's not needed if high bits of a and b are both zero. I doubt the code can be any smaller with the existing instruction set.

    Summary
    Preserve a and b: cycles = 28/26*, variables = 4
    Preserve b: cycles = 26/24*, variables = 3
    Overwrite a and b with result: cycles = 26/24*, variables = 3
    * If a[31] = b[31] = 0

    Here's the code using bitc that preserves a and b:
    ' 32-bit * 32-bit -> 64-bit
    ' a * b = {x1, x0}
    '
    ' 28 cycles maximum
    ' 26 cycles if a[31] = b[31] = 0
    
    	getword	x2,a,#1  		'x2 = ah
    	getword	x3,b,#1  		'x3 = bh
    	mov	x0,a
    	mul	x0,b			'x0 = al * bl
    	mov	x1,x2
    	mul	x1,x3			'x1 = ah * bh
    	mul	x2,b			'x2 = ah * bl
    	mul	x3,a			'x3 = bh * al
    	add	x2,x3		wc	'x2 = ah * bl + bh * al
    	getword	x3,x2,#1
    	bitc	x3,#16			'not needed if a[31] = b[31] = 0
    	shl	x2,#16
    	add	x0,x2		wc	'x0 = result(low)
    	addx	x1,x3			'x1 = result(high)
    
    x0	res	1
    x1	res	1
    x2	res	1
    x3	res	1
    
    Formerly known as TonyB
  • Can you use the Z flag from some of the MUL's to short-cut some of the code? Perhaps by ordering
    the arguments by size with FLE/FGE you can then only bother checking the smallest argument for being 16 bits or less?
    Most multiplies have one or two small arguments so this may be worth doing, but you need a representative
    test vector to measure against as the worst-case time goes up.

    It would be nice if SKIPF could be used for multiplication, but I don't think its possible other than for multiply
    by a constant in a rather contorted fashion.
  • Perhaps for P3...

    A new unsigned multiply instruction that works as follows
    i[64:0] is an internal 65bit register
    
     MUL32 D,S                                'new 32*32 unsigned multiply instruction
    '' MUL32 runs internally using the existing 16*16 MUL instruction as follows
     i :=   d[15:0]  * s[15:0]
     i += ( d[31:16] * s[15:0]  ) << 16
     i += ( d[15:0]  * s[31:16] ) << 16
     i += ( d[31:16] * s[31:16] ) << 32
    '' now get the results
     GETL D {WC/WZ/WCZ}                       ' new instruction returns D=i[15:0],  C=i[64], Z=i[65:0]
                                              '                                -or- C=i[32], Z=i[32:0] ?????
     GETH D {WC/WZ/WCZ}                       ' new instruction returns D=i]63:32], C=i[64], Z=i[64:0]
    

    It may be possible that this could run in 10 clocks, and does not necessarily destroy the input values D & S.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cordic can already pull 8 clocks, with parallel execution to boot.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • But more like 60 elapsed from start to finish, even worse for signed (which can only be pipelined
    at a rate of 16+ clocks plus per multiply due to the sign correction instructions).

    If you want to form a product of several values you need to chain multiplies in series or
    a binary tree, and then the efficient way to use cordic is run logs in parallel, sum them, then
    do exponential function on the sum, which is better than 3 or more chained multiplies at
    the expense of being lossy.

    There are lots of tradeoffs to be considered I feel.
  • evanh wrote: »
    Cordic can already pull 8 clocks, with parallel execution to boot.

    Oh I wish :(
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • I can get 16 clocks per signed x unsigned multiply for IIR filter processing, and that's squeezing
    every instruction pretty much.

    I tried FIR filter with muls instruction, haven't tried to do that with cordic yet, but I don't think you'd
    get 8 cycles per tap.
Sign In or Register to comment.