Fast, Faster, Fastest Code: Superfast Integer Multiplication (from Kenya) — Parallax Forums

# Fast, Faster, Fastest Code: Superfast Integer Multiplication (from Kenya)

Posts: 182
edited 2013-04-15 12:49
Hi all,

In the deep core of codes there are sometimes integer multiplication codes. They usualy go straightforward, like the elementary school math. But, they usually go a little bit further than necessary, like in

'Unsigned INT 32x32 multiply into 32                              METHOD 1
'Result = Multiplicand * Multiplier
'Result(I32) = arg1(I32) * arg2(I32)
'Result in r1
'Prepare working registers
MOV      r1,             #0            'Clear 32-bit product
MOV      r2,             #32           'Loop counter for multiply
SHR      arg2,           #1 WC         'Get initial multiplier bit
'Start multiplication
:loop
IF_C ADD r1,             arg1 WC       'Add multiplicand to product
SHL      arg1,           #1            'Shift multiplicand left (x2)
SHR      arg2,           #1 WC         'Shift next multiplier bit into C
DJNZ     r2,             #:loop
'Result in r1

or, with somewhat trickier, but the result is placed into arg2

'Unsigned INT 32x32 multiply into 32                              METHOD 2
'Result = Multiplicand * Multiplier
'Result(I32) = arg1(I32) * arg2(I32)
'Result in arg2
'Prepare working registers
MOV      r1,             #0            'Clear 32-bit product
MOV      r2,             #32           'Loop counter for multiply
SHR      arg2,           #1 WC         'Get initial multiplier bit
:loop
IF_C ADD r1,             arg1 WC       'Add multiplicand to product
RCR      r1,             #1 WC         'Shift LSB of product into C
RCR      arg2,           #1 WC         'Load LSB of product into MSB of
'arg2 (via C)
'Rotate arg2 right
'Shift next  multiplier bit into C

DJNZ     r2,             #:loop
'Result in arg2

METHOD1 and·METHOD2·take 132 machine cycles, equally,and use the same COG space.

They take 132 machine cycles to multiply 12345*67890, and
they take 132 machine cycles to multiply 12345678*9, and
they take 132 machine cycles to multiply 9*123456789 and
they take 132 machine cycle to perform even the basic 2*2 text of a computer... Capiche HAL?

Long-long time ago, a sorcerer came out of his tent and asked the people around.
-Hey men, are you bored of the long and dull multiplications?
-Yes, sure, they answered.
-Ok, listen to then...

The Kenyan algorithm of multiplication
=====================================
Product = Multiplicand times Multiplier.

Empty a bucket.

Step 1 If the actual multiplier is odd, then add the number of the actual multiplicand· into the bucket,

Half the multiplier and double the multiplicand.

Ignore the non integer part of the new multiplier.

If the new multiplier is bigger than one, goto Step 1.

Count the product from the bucket, and have lunch.

And the simple equivalent code shows clearly the binary nature of the Kenyan algorithm.

'Unsigned INT 32x32 multiply into 32  with Kenyan algorithm       METHOD 3
'Result = Multiplicand * Multiplier
'Result(I32) = arg1(I32) * arg2(I32)
'Result in r1
'Prepare working registers
MOV      r1,             #0            'Clear 32-bit product
'Start Kenyan multiplicatiom
:loop
SHR      arg2,           #1 WC, WZ     'Half multiplyer and get LSB of it
IF_C ADD r1,             arg1          'Add multiplicand to product on C
SHL      arg1,           #1            'Double multiplicand
IF_NZ    JMP             #:loop        'Check nonzero multiplier to cont. mult.
'Result in r1

No counter, the code is not only shorter, but much faster then the previous ones.

Can we do it better? Yes, we can. Note, that the smallest the multiplier, the shortest the execution time of the Kenyan algorithm is. So, we have to multiply always with the smaller of the two arguments. So it goes, with "vollautomatische" zero testing (as a by-product)

'Unsigned INT 32x32 multiply into 32  with improved Kenyan        METHOD 4
'Result = Multiplicand * Multiplier
'Result(I32) = arg1(I32) * arg2(I32)
'Result in r1
'Prepare arguments
'We would like to have arg2 (the multiplier) to be less than arg1
CMP       arg1,          arg2 WC       'If arg1 is less than arg2 C is set
IF_C  XOR arg1,          arg2          'Swap arguments
IF_C  XOR arg2,          arg1
IF_C  XOR arg1,          arg2
'Start Kenyan multiplication
'Prepare working registers
MOV      r1,             #0            'Clear 32-bit product
:loop
SHR      arg2,           #1 WC, WZ     'Half multiplyer and get LSB of it
IF_C ADD r1,             arg1          'Add multiplicand to product on C
SHL      arg1,           #1            'Double multiplicand
IF_NZ    JMP             #:loop        'Check nonzero multiplier to
'continue multiplication
'Result in r1

When overflow is not allowed, in other words, when the product is smaller than 2^32, the proposed Kenyan multiplication, even in the worst case (65535*65536) is·about twice as fast than the traditional codes!
·
An exhaustive testing of the average execution time will take about 2 million years on the Prop in PASM. Have no time for that, at the moment, but, with a good 32 random number generator one can check the avarege execution time, say, for a million of random arg1, arg2 pairs. I guess, average speed will be about 3-4(!) times higher than those of the traditional methods, used and wired-in almost everywhere. Happy testing! Let me know the result, please.

This superfast integer multiplication will be in the core of a new fixed point library for Obex. You can check the post and the codes with the attached utility.

If someone can·came out·a 10% faster method on the Propeller (on average of 1 million random 32- bit pairs), I will give her/him 100 bucks (of USA) for the first and valid code of that.

Cheers,

istvan

PS.: Next time integer division (not only with constants, of course) is to be improved.

Post Edited (cessnapilot) : 8/1/2009 10:28:12 AM GMT
«1

## Comments

• Posts: 1,929
edited 2009-08-01 02:59
Outstanding!!
cessnapilot said...
Hi all,

In the deep core of codes there are sometimes integer multiplication codes. They usualy go straightforward, like the elementary school math. But, they usually go a little bit further than necessary, like in

This superfast integer multiplication will be in the core of a new fixed point library for Obex. You can check the post and the codes with the attached utility.

If someone can·came out·a 10% faster method on the Propeller (on average of 1 million random 32- bit pairs), I will give her/him 100 bucks (of USA) for the first and valid code of that.

Cheers,

istvan

PS.: Next time integer division (not only with constants, of course) is to be improved.

• Posts: 182
edited 2009-08-01 10:37
Hi hover1,

Those on the picture look beattiful. Can you ride them on the streets or only on water?. Well, you can ride them, I'm sure, but are they fast enough to escape cars of a special kind? Let's go back on calm waters. What is the top speed of them?

Istvan
• Posts: 3,370
edited 2009-08-01 12:28
Amazing, never heard of such a thing. Any chance of the Kenyan division as described here www.sxlist.com/TECHREF/method/math/muldiv.htm?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
• Posts: 1,929
edited 2009-08-01 12:50
The hovercrafts are registered as boats, so it is illeagal to operate on roads. We do race on land/water on closed courses.

http://www.youtube.com/watch?v=qlM3Z9-yCEQ

Top speed on some crafts is 80 MPH, mine are in the 40-50 MPH range.

I really enjoy your 6DOF IMU, H48C, and GPS contributions!

Jim
cessnapilot said...

Hi hover1,

Those on the picture look beattiful. Can you ride them on the streets or only on water?. Well, you can ride them, I'm sure, but are they fast enough to escape cars of a special kind? Let's go back on calm waters. What is the top speed of them?

Istvan

• Posts: 182
edited 2009-08-01 15:09
Hi Heater,

Yes, exactly. Kenyan division seems to work on the Prop like charm. For a· future "FFFC: Superfast Integer Division II" thread, maybe that will be the winner. However, I have to check other algorithms to decide, for example the reciprocals or the division with cordic. Or a simple LUT with Newton- Raphson. To prepare a real Propeller niche for those algorithmic engines, I am making a flexible and general Monte-Carlo testpad. This SPIN application will throw random arguments to tested PASM algorithms to check the correctness, precision and last, but not least, the speed of them. Histograms of execution times and accuracies (cumulated or over ranges)·will be computed. Everything will go with only the Propeller + PST. And, there is (huge) room for improvements in the calculations of SINE, COSINE, ATAN2 and for other math beasts. Superfast Winograd FFT (with that somewhere mentioned DFT update trick), for example, to treat sound real-time with Prop.

In the secondary school a math teacher had mentioned such thing, but he called it 'Russian multiplication. Hungary was occupied that time, and as a part of my fierce resistance, I just ignored it, along with many other things at school.

Cheers,

Istvan
• Posts: 1,132
edited 2009-08-01 16:27
Nice work, Istvan!

> Everything will go with only the Propeller + PST. And, there is (huge) room for improvements
> in the calculations of SINE, COSINE, ATAN2 and for other math beasts.

No kidding.

btw - what are you using on the prop to gen the randoms?

> Superfast Winograd FFT (with that somewhere mentioned DFT update trick), for example, to treat sound real-time with Prop.

Looking forward to that.

thanks for your efforts!
- Howard

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
• Posts: 182
edited 2009-08-02 09:38
Hi CounterRotatingProps,

I use the ? operator of the SPIN to generate random values. The 'pseudorandom' property of values is very convenient to submit same sequence of arguments to different algorithms. In this way the comparison is 'fair', and this helps in locating bugs, weak points for speed or regions of bad accuracy, etc... . These test will be useful, especially for Fixed-Point Q numbers to select correct representation for a task, or check the validity of a claimed 'quick' approximation for a function with those Q numbers. The uniformity of the sequence is the most important point for these PASM tests to cover the selected intervals of the arguments consistently.

The user, however, will have an option to choose 'Real' random sequences. Like

<Empty Rx buffer>
<Display a polite question on PST screen>
Something like: Real random sequence (Y/N)?

After the user hits the keyboard:
<In case of (Y/y)· : CNT will be the seed, to generate seeds for arg1, arg2,...>
For an average user, no real chance in a lifetime to get the same sequence again.

<In case of OTHERS : Seed will be, e.g. 2009 for arg1, 2010 for arg2 and so on.>

I will use an other trick to model typical sensor reading values. I will add 5..10 uniform randoms to get·normaly distributed values. This can be shifted/stretched to cover and mimic characteristic·outputs of a sensors. This might sound strange, but imagine that you have to read reference voltages between 4.5 and 5.5·V and you have to·divide·with these values at very high speed, to collect data. In a given·number format, you can use fast division with·pre-wired bit shifts. To compare and test the 'general' and the 'pre-wired' algorithms, such·reading generator comes in handy.

Cheers,

Istvan
• Posts: 1,451
edited 2013-04-08 08:42
Would anyone know if this (Method 4) is still considered the fastest 32x32 integer multiply on the P1?
• Posts: 6,347
edited 2013-04-08 09:00
This is similar to the multiplier code used in the PropGCC kernel.
__MULSI
mov	__TMP0, r0
min	__TMP0, r1
max	r1, r0
mov	r0, #0
__MULSI_loop
shr	r1, #1	wz, wc
IF_C	add	r0, __TMP0
add	__TMP0, __TMP0
IF_NZ	jmp	#__MULSI_loop
__MULSI_ret	ret

• Posts: 1,451
edited 2013-04-08 10:45
Thanks, Dave. Perfect.
• Posts: 2,200
edited 2013-04-08 14:34
Stupid question... but are multiply signed and multiply unsigned the same operation?
• Posts: 917
edited 2013-04-08 16:20
Can be, with the addition of 2 lines just before the loop, something like this:
abs arg2, arg2 wc
if_c neg arg1, arg1

Jonathan

EDIT: of course if it's signed then the MSbit has a different meaning, but the rest of the code is the same.
EDIT2: the CMP logic gets a bit more complicated too (finding the value with fewer 1 bits when positive).
• Posts: 2,200
edited 2013-04-08 16:36
I should have just thought through the obvious example...

-1 * 1

Very different if unsigned...

Thanks,
• Posts: 6,347
edited 2013-04-09 05:57
I don't think you need to check the sign bit for multiplication. It works for both signed and unsigned numbers. It just depends on how you interpret the MSB.
• Posts: 1,748
edited 2013-04-09 06:41
Did Istvan ever publish his whole integer math library?
Jim
• Posts: 5,894
edited 2013-04-09 07:31
Kye wrote: »
Stupid question... but are multiply signed and multiply unsigned the same operation?
For 32x32 -> 32 they are the same (that is, the lower 32 bits of the result are identical for signed and unsigned multiplication). If you calculate the full 64 bits then they will differ. For example, -1 is stored as FFFFFFFF, so:
-1 * 2 = -2 =               FFFFFFFF_FFFFFFFE (full 64 bit result, signed multiply)
FFFFFFFF * 2 =              00000001_FFFFFFFE (full 64 bit result, unsigned multiply)

This is one of the neat properties of the 2's complement representation of negative numbers.
• Posts: 917
edited 2013-04-10 14:33
So, I was playing with this and I can't claim the prize, but I can help with a pathological worst case. Here are the basic cases when multiplying numbers (where large means > ~30 bits {and random 32-bit numbers tend to be large}, and small is < ~10 bits):
• small * small : ordering doesn't really matter...it's dependent but way below worst case
• large * large : ordering doesn't really matter (~500 clocks)
• large * small : simple ordering matters (~285 clocks)
• large * -small : simple ordering doesn't help, since a small negative number still has to do 32 iterations (~525 clocks)
• -small * -small : simple ordering doesn't help here either (~500 clocks)
The simple way to organize the values so that v1 has a lower MSb than v2 is:
mov       vRes, v1
max       v1, v2
min       v2, vRes

This handles all cases well except when small negative numbers are involved. For negative numbers you can change the sign of both numbers and still get the identical result. So the goal becomes, find the number with the lowest MSb of { v1, -v1, v2, -v2 }, use that as the value that gets shifted down until 0, and switch the sign on both v1 and v2 if needed. Here is the code I'm using:
abs       vRes, v1 wc
negc      sign, #1
mov       v1, vRes
abs       vRes, v2 wc
negc      sign, sign
mov       v2, vRes
min       v2, v1
max       v1, vRes
shl       sign, #1 wc
negc      v2, v2

For the case where you have a large number * a small negative number, this brings the clocks back down to ~ 188 again, at the cost of a little more code and overhead. Also, somebody will find a way to optimize my code, so it can only get better! [8^)

thanks,
Jonathan

EDIT: I instrumented the PASM function to get actual clocks instead of timing in Spin.
• Posts: 917
edited 2013-04-12 09:41
Here's a slightly smaller version of the signed swap function, with the whole multiplication code:
' shuffle the values and signs so v1 has fewer high 1 bits, and v2 is the proper sign to make the product correct
abs       v1, v1 wc
if_c  neg       v2, v2
abs       v2, v2 wc
mov       vRes, v2
min       v2, v1
max       v1, vRes
if_c  neg       v2, v2

mov       vRes, #0
' begin the multiplication loop
:loop         shr       v1, #1 wc, wz
if_c  add       vRes, v2
shl       v2, #1
if_nz jmp       #:loop

' The product is now in vRes

Jonathan
• Posts: 1,930
edited 2013-04-12 17:29
Why do the negation before the mult?

If you XOR the two inputs together, the high bit is the sign of the result. ABS both inputs, do the mult, then if the result is supposed to be negative, flip it then. That way, multiplying -1 x -1 is as fast as 1 x 1.

Or am I misunderstanding how this is supposed to work?
• Posts: 917
edited 2013-04-14 00:17
JasonDorie wrote: »
Why do the negation before the mult?

If you XOR the two inputs together, the high bit is the sign of the result. ABS both inputs, do the mult, then if the result is supposed to be negative, flip it then. That way, multiplying -1 x -1 is as fast as 1 x 1.

Or am I misunderstanding how this is supposed to work?
You got it, I just couldn't find a way to make that approach come out shorter. I would need to use another intermediate storage long, perform the XOR, shift out the top bit, then finally invert the result: MOV, XOR, SHL, NEG. The current method only uses 2 extra lines (in the precondition code, the lines with if_c neg). All the other lines are needed to make the values absolute and pick the smallest absolute value as v1. The internal 4-instruction loop remains untouched. I'm very likely still missing something, though, and I welcome any speedup (and reducing the instruction count is important too).

thanks,
Jonathan
• Posts: 18,066
edited 2013-04-14 10:32
It is going to be fastest to multiply after converting both to positive, then multiply the largest by the smallest, and finally correct sign. Only if one was negative will the result be negative - i.e. not both negative.
• Posts: 917
edited 2013-04-15 08:17
Cluso99 wrote: »
It is going to be fastest to multiply after converting both to positive, then multiply the largest by the smallest, and finally correct sign. Only if one was negative will the result be negative - i.e. not both negative.
Exactly: I added some comments:
' The loop can exit once v1 == 0, therefore
' we want v1 to have the lowest high bit. A
' negative integer's high bit is always bit
' 31, the worst case scenario.
abs       v1, v1 wc               ' make v1 positive
if_c  neg       v2, v2                  ' neg v2 if I did v1 (preserve the product)
abs       v2, v2 wc               ' make v2 positive, and flag if I had to neg
mov       vRes, v2                ' temporary variable
min       v2, v1                  ' use min to make v2 the larger of v1,v2
max       v1, vRes                ' use max to make v1 the smaller of v1,vRes(v2)
if_c  neg       v2, v2                  ' if I had to neg v2 earlier, neg it back

mov       vRes, #0                ' zero the accumulator
' begin the core multiplication loop
:loop         shr       v1, #1 wc, wz           ' get the lowest bit of v1
if_c  add       vRes, v2                ' if it was a 1, add v2 to the accumulator
shl       v2, #1                  ' shift v2 left by 1 (to prep for the next v1 bit)
if_nz jmp       #:loop                  ' loop as long as v1 is not 0

Jonathan
• Posts: 1,930
edited 2013-04-15 10:15
What Cluso means (and I mean) is don't do the C2 negation until after and you'll spend less time in the loop, average case.
• Posts: 917
edited 2013-04-15 10:24
The number of high bits in v2 (and thus the sign of v2) is entirely unimportant to the runtime. Only v1 needs to have a low high bit (if that makes sense). As long as v1 is selected / manipulated to have the lowest 1st bit possible, the runtime is as fast as possible. So, whether I change v2 up front, or the product at the end has no impact, and means I don't have to preserve the C flag over the main multiplication loop.

Jonathan
• Posts: 917
edited 2013-04-15 12:42
Btw, unrolling the loop does in fact help.

1x : 502 clocks
2x : 452 clocks
3x : 437 clocks
...
8x : 431 clocks

That's with full 32-bit random numbers, positive or negative. Here is the 3x unrolled version (note the flags all work out with a straight copy-n-paste!!):
' The loop can exit once v1 == 0, therefore
' we want v1 to have the lowest high bit. A
' negative integer's high bit is always bit
' 31, the worst case scenario.
abs       v1, v1 wc               ' make v1 positive
if_c  neg       v2, v2                  ' neg v2 if I did v1 (preserve the product)
abs       v2, v2 wc               ' make v2 positive, and flag if I had to neg
mov       vRes, v2                ' temporary variable
min       v2, v1                  ' use min to make v2 the larger of v1,v2
max       v1, vRes                ' use max to make v1 the smaller of v1,vRes(v2)
if_c  neg       v2, v2                  ' if I had to neg v2 earlier, neg it back

mov       vRes, #0                ' zero the accumulator
' begin the core multiplication loop
:loop         shr       v1, #1 wc, wz           ' get the lowest bit of v1
if_c  add       vRes, v2                ' if it was a 1, add v2 to the accumulator
shl       v2, #1                  ' shift v2 left by 1 (to prep for the next v1 bit)

shr       v1, #1 wc, wz           ' get the lowest bit of v1
if_c  add       vRes, v2                ' if it was a 1, add v2 to the accumulator
shl       v2, #1                  ' shift v2 left by 1 (to prep for the next v1 bit)

shr       v1, #1 wc, wz           ' get the lowest bit of v1
if_c  add       vRes, v2                ' if it was a 1, add v2 to the accumulator
shl       v2, #1                  ' shift v2 left by 1 (to prep for the next v1 bit)

if_nz jmp       #:loop                  ' loop as long as v1 is not 0

Jonathan
• Posts: 1,930
edited 2013-04-15 12:45
Ahhhhh.... Ok - I *did* misread the code. In the case where both numbers are negative, you make them both positive anyway. If one or the other is negative, you just pick the smaller absolute number, and sign flip the other. Is that right? That makes sense now.
• Posts: 917
edited 2013-04-15 12:49
Exactly! Seems to work, at least over a couple of million pseudo random numbers (seed with RealRandom, then run with the spin ? operator).

Jonathan
• Posts: 78
edited 2016-04-22 04:02
This is amazing! Absolutely... wow! I just stumbled on this page! What are the reasons not to use this method? There must be a catch.
• Posts: 917
In the worst case, the method here is slightly faster:

forums.parallax.com/discussion/160804/slightly-faster-integer-multiplication-in-pasm

But it doesn't early-exit, so it's slower for most numbers. This thread's approach is better for average speed (usually), or for fewer bits (always).

Jonathan
• Posts: 1,981
Usually if an application is arithmetic-intensive, its generating values as the result of calculations
that get fed into new calculations and eventually many of the numbers are large (typically the mantissa
of floating point numbers). So average time is what matters.

The fact you can multiply small integers together faster is rarely important to overall speed I reckon.

Look at things like FFT, statistical calculations, digital filtering, control loops - all are going to
be using fix or floating point with most bits used most of the time.
Sign In or Register to comment.