Welcome to the Parallax Discussion Forums, sign-up to participate.

lonesock
Posts: **891**

Hi, all.

(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.

So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:

thanks,

Jonathan

(Excluding counter-based solutions) 32-bit integer multiplication always boils down a worst case scenario of 4 instructions in a 32x loop, with a base cost of 512 clocks. You can make this faster on average with an early exit, sorting the absolute value of the numbers, etc. However, the worst case remains.

So, instead of doing 32 passes on 4 instructions, here is a technique for doing 16 passes on 6 instructions:

{{ vRes (32-bit) := v1 (32-bit) * v2 (32-bit) ------------------------------------------ Break the multiplication of 2 32-bit numbers into 4 multiplications of the 4x 16-bit portions: v1 * v2 = (v1_hi * v2_hi) << 32 + (v1_hi * v2_lo) << 16 + (v1_lo * v2_hi) << 16 + (v1_lo * v2_lo) << 0 Note that the first term can not fit in our result so we ignore it, and I can re-combine v1_hi and v1_lo: v1 * v2 (fit into 32 bits) = (v1 * v2_lo) + (v1_lo * v2_hi) << 16 }} newmult ' setup mov vRes, #0 ' Primary accumulator (and final result) mov tmp1, v1 ' Both my secondary accumulator, shl tmp1, #16 ' and the lower 16 bits of v1. mov tmp2, v2 ' This is the upper 16 bits of v2, shr tmp2, #16 ' which will sum into my 2nd accumulator. mov i, #16 ' Instead of 4 instructions 32x, do 6 instructions 16x. :loop ' v1_hi_lo * v2_lo shr v2, #1 wc ' get the low bit of v2 if_c add vRes, v1 ' (conditionally) sum v1 into my 1st accumulator shl v1, #1 ' bit align v1 for the next pass ' v1_lo * v2_hi shl tmp1, #1 wc ' get the high bit of v1_lo, *AND* shift my 2nd accumulator if_c add tmp1, tmp2 ' (conditionally) add v2_hi into the 2nd accumulator ' repeat 16x djnz i, #:loop ' I can't think of a way to early exit this ' finalize shl tmp1, #16 ' align my 2nd accumulator add vRes, tmp1 ' and add its contribution newmult_ret retThis always takes 428 clocks, including the call and return. I can't think of a way to do early exit, so if you spot anything please let me know.

thanks,

Jonathan

## Comments

17 Commentssorted by Date Added Votes13,9160Vote UpVote DownI will check the code in my faster spin interpreter. IIRC Chip posted a faster multiply that I included in it. Again iirc it had an early exit. So it may suit you purpose. If you cannot wait, check my code posted in the faster spin interpreter thread.

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)8910Vote UpVote DownJonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

12,0940Vote UpVote Down9,3740Vote UpVote DownDo not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One AnotherSKYPE = acuity_dougParallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>

12,0940Vote UpVote DownI must have missed the

Sizevalues for the new Mult, and theSizevalue for the Old Mult in #1.Can you point to them ?

8910Vote UpVote DownFrom my listing you can see that my code has 15 instructions (including the return, but not including a call). The main loop is 6 instructions that run 16 times, so that's 384 clocks plus 4 clocks to fall through the DJNZ. It also has 8 instructions for setup and cleanup, plus the return. The run time with the call, setup, and return will always be 428 clocks. I am using 6 local variables: argument 1, argument 2, result, scratch 1, scratch 2, and a loop variable. Presumably those can be reused from whatever code you wish integrated, but on the off chance you run this completely solo, that's 21 longs used.

The Spin Interpreter has 15 setup & clean-up instructions in the MUL section. I am unsure how many of those could be stripped for a MUL-only code. Then there is the heart of the algorithm, which is a block of 4 instructions, executed 32 times. It also uses 5 local variables. The runtime for it looks to be about 580 clocks. Total size is about 19 longs, plus 5 local vars, or 24 total.

My intent wasn't to specifically compare it against the Spin Interpreter's version, btw. I just wanted to present another algorithm for people who could spare 2 instructions to have a faster guaranteed worst case scenario.

thanks,

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

12,0940Vote UpVote DownThanks, I asked because I know Cluso is working on a faster Spin, and the Speed/Size trade off will matter there.

13,9160Vote UpVote Downyes that is the latest spin interpreter.

For interest, here is the faster spin interpreter. However, there have been no mods to the mult as far as I can see (ie haven't verified with original)

I also posted the sqrt section as it has been improved in case anyone is interested.

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)8910Vote UpVote DownJonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

1,6830Vote UpVote DownUnroll the loop by 4, then body is 13 instructions run 8 times, total 104 instructions (plus 4 for setup

gives 108 instructions, static instruction count = 15

Your method is 16*6 plus 11 for setup = 107 instructions, static count 15 too, so basically the same,

a whisker faster.

Standard method unrolled by 8 is 104 instructions dynamic, 27 static.

Unroll your loop by 2 gains you 8 instructions, by 4 gains you 12...

One can test for one of the arguments being 16 bit only and branch to a shorter version as

this is the common case

7,7190Vote UpVote DownHere is a typical timing of the UM* instruction which returns a 64-bit result.

123 CLKFREQ LAP UM* LAP .LAP 2.500us ok

The best speed for the UM* is 833ns while the worst case is multiplying two $FFFF.FFFFs together to get a 64bit $FFFF.FFFE.0000.0001 result (or -#1FFFFFFFF) in 9.833us!

This was run on a +P8 running at 96MHz.

Tachyon Forth- compact, fast, forthwright and interactive--->CLICK THE LOGO for more links<---Latest binary V5.4 includes EASYFILE+++++Tachyon Forth News BlogP2 SHORTFORM DATASHEET+++++TAQOZ documentationBrisbane, Australia13,9160Vote UpVote Down1. Save resultant sign

2. Make both 32bit numbers positive

3. Make the multiplier the smaller number

4. Now perform the add loop, decrementing the multiplier and testing for zero to exit the loop

5. Reinsert resultant sign

The result will be 64bits, preventing overflow.

This would be the best size/speed compromise. Unravelling the loop would now sacrifice space for speed.

Next question, is this code common for divide? This is my consideration for my faster Spin interpreter.

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)1,5590Vote UpVote Down1,5590Vote UpVote Downhttp://www.chiark.greenend.org.uk/~theom/riscos/docs/ultimate/a252div.txt

30Vote UpVote Down7,7190Vote UpVote DownBear in mind that while you use a fixed sized loop it will waste unnecessary cycles when there is nothing left to process, most of the time. My 32x32 multiply with 64-bit result usually only takes about 3us.

Here's an interactive example with timing report in Tachyon.

80MHz clock - 32x32 multiply with 64-bit result.

Tachyon Forth- compact, fast, forthwright and interactive--->CLICK THE LOGO for more links<---Latest binary V5.4 includes EASYFILE+++++Tachyon Forth News BlogP2 SHORTFORM DATASHEET+++++TAQOZ documentationBrisbane, Australia1,6830Vote UpVote Downmost values to be "large" as they will be fixed-point or floating point samples, not a statistically nice set of mainly

small integers. My experience of this is handling 24 bit audio samples and doing correlation, rms, that sort of thing.

Occassionally the values are small but the worst-case performance is all that matters for real-time samples..