Fast, Faster, Fastest Code: Superfast Integer Multiplication (from Kenya)
cessnapilot
Posts: 182
Hi all,
In the deep core of codes there are sometimes integer multiplication codes. They usualy go straightforward, like the elementary school math. But, they usually go a little bit further than necessary, like in
or, with somewhat trickier, but the result is placed into arg2
METHOD1 and·METHOD2·take 132 machine cycles, equally,and use the same COG space.
They take 132 machine cycles to multiply 12345*67890, and
they take 132 machine cycles to multiply 12345678*9, and
they take 132 machine cycles to multiply 9*123456789 and
they take 132 machine cycle to perform even the basic 2*2 text of a computer... Capiche HAL?
Long-long time ago, a sorcerer came out of his tent and asked the people around.
-Hey men, are you bored of the long and dull multiplications?
-Yes, sure, they answered.
-Ok, listen to then...
The Kenyan algorithm of multiplication
=====================================
Product = Multiplicand times Multiplier.
Empty a bucket.
Step 1 If the actual multiplier is odd, then add the number of the actual multiplicand· into the bucket,
Half the multiplier and double the multiplicand.
Ignore the non integer part of the new multiplier.
If the new multiplier is bigger than one, goto Step 1.
Count the product from the bucket, and have lunch.
And the simple equivalent code shows clearly the binary nature of the Kenyan algorithm.
No counter, the code is not only shorter, but much faster then the previous ones.
Can we do it better? Yes, we can. Note, that the smallest the multiplier, the shortest the execution time of the Kenyan algorithm is. So, we have to multiply always with the smaller of the two arguments. So it goes, with "vollautomatische" zero testing (as a by-product)
When overflow is not allowed, in other words, when the product is smaller than 2^32, the proposed Kenyan multiplication, even in the worst case (65535*65536) is·about twice as fast than the traditional codes!
·
An exhaustive testing of the average execution time will take about 2 million years on the Prop in PASM. Have no time for that, at the moment, but, with a good 32 random number generator one can check the avarege execution time, say, for a million of random arg1, arg2 pairs. I guess, average speed will be about 3-4(!) times higher than those of the traditional methods, used and wired-in almost everywhere. Happy testing! Let me know the result, please.
This superfast integer multiplication will be in the core of a new fixed point library for Obex. You can check the post and the codes with the attached utility.
If someone can·came out·a 10% faster method on the Propeller (on average of 1 million random 32- bit pairs), I will give her/him 100 bucks (of USA) for the first and valid code of that.
Cheers,
istvan
PS.: Next time integer division (not only with constants, of course) is to be improved.
Post Edited (cessnapilot) : 8/1/2009 10:28:12 AM GMT
In the deep core of codes there are sometimes integer multiplication codes. They usualy go straightforward, like the elementary school math. But, they usually go a little bit further than necessary, like in
'Unsigned INT 32x32 multiply into 32 METHOD 1 'Result = Multiplicand * Multiplier 'Result(I32) = arg1(I32) * arg2(I32) 'Result in r1 'Prepare working registers MOV r1, #0 'Clear 32-bit product MOV r2, #32 'Loop counter for multiply SHR arg2, #1 WC 'Get initial multiplier bit 'Start multiplication :loop IF_C ADD r1, arg1 WC 'Add multiplicand to product SHL arg1, #1 'Shift multiplicand left (x2) SHR arg2, #1 WC 'Shift next multiplier bit into C DJNZ r2, #:loop 'Result in r1
or, with somewhat trickier, but the result is placed into arg2
'Unsigned INT 32x32 multiply into 32 METHOD 2 'Result = Multiplicand * Multiplier 'Result(I32) = arg1(I32) * arg2(I32) 'Result in arg2 'Prepare working registers MOV r1, #0 'Clear 32-bit product MOV r2, #32 'Loop counter for multiply SHR arg2, #1 WC 'Get initial multiplier bit :loop IF_C ADD r1, arg1 WC 'Add multiplicand to product RCR r1, #1 WC 'Shift LSB of product into C RCR arg2, #1 WC 'Load LSB of product into MSB of 'arg2 (via C) 'Rotate arg2 right 'Shift next multiplier bit into C DJNZ r2, #:loop 'Result in arg2
METHOD1 and·METHOD2·take 132 machine cycles, equally,and use the same COG space.
They take 132 machine cycles to multiply 12345*67890, and
they take 132 machine cycles to multiply 12345678*9, and
they take 132 machine cycles to multiply 9*123456789 and
they take 132 machine cycle to perform even the basic 2*2 text of a computer... Capiche HAL?
Long-long time ago, a sorcerer came out of his tent and asked the people around.
-Hey men, are you bored of the long and dull multiplications?
-Yes, sure, they answered.
-Ok, listen to then...
The Kenyan algorithm of multiplication
=====================================
Product = Multiplicand times Multiplier.
Empty a bucket.
Step 1 If the actual multiplier is odd, then add the number of the actual multiplicand· into the bucket,
Half the multiplier and double the multiplicand.
Ignore the non integer part of the new multiplier.
If the new multiplier is bigger than one, goto Step 1.
Count the product from the bucket, and have lunch.
And the simple equivalent code shows clearly the binary nature of the Kenyan algorithm.
'Unsigned INT 32x32 multiply into 32 with Kenyan algorithm METHOD 3 'Result = Multiplicand * Multiplier 'Result(I32) = arg1(I32) * arg2(I32) 'Result in r1 'Prepare working registers MOV r1, #0 'Clear 32-bit product 'Start Kenyan multiplicatiom :loop SHR arg2, #1 WC, WZ 'Half multiplyer and get LSB of it IF_C ADD r1, arg1 'Add multiplicand to product on C SHL arg1, #1 'Double multiplicand IF_NZ JMP #:loop 'Check nonzero multiplier to cont. mult. 'Result in r1
No counter, the code is not only shorter, but much faster then the previous ones.
Can we do it better? Yes, we can. Note, that the smallest the multiplier, the shortest the execution time of the Kenyan algorithm is. So, we have to multiply always with the smaller of the two arguments. So it goes, with "vollautomatische" zero testing (as a by-product)
'Unsigned INT 32x32 multiply into 32 with improved Kenyan METHOD 4 'Result = Multiplicand * Multiplier 'Result(I32) = arg1(I32) * arg2(I32) 'Result in r1 'Prepare arguments 'We would like to have arg2 (the multiplier) to be less than arg1 CMP arg1, arg2 WC 'If arg1 is less than arg2 C is set IF_C XOR arg1, arg2 'Swap arguments IF_C XOR arg2, arg1 IF_C XOR arg1, arg2 'Start Kenyan multiplication 'Prepare working registers MOV r1, #0 'Clear 32-bit product :loop SHR arg2, #1 WC, WZ 'Half multiplyer and get LSB of it IF_C ADD r1, arg1 'Add multiplicand to product on C SHL arg1, #1 'Double multiplicand IF_NZ JMP #:loop 'Check nonzero multiplier to 'continue multiplication 'Result in r1
When overflow is not allowed, in other words, when the product is smaller than 2^32, the proposed Kenyan multiplication, even in the worst case (65535*65536) is·about twice as fast than the traditional codes!
·
An exhaustive testing of the average execution time will take about 2 million years on the Prop in PASM. Have no time for that, at the moment, but, with a good 32 random number generator one can check the avarege execution time, say, for a million of random arg1, arg2 pairs. I guess, average speed will be about 3-4(!) times higher than those of the traditional methods, used and wired-in almost everywhere. Happy testing! Let me know the result, please.
This superfast integer multiplication will be in the core of a new fixed point library for Obex. You can check the post and the codes with the attached utility.
If someone can·came out·a 10% faster method on the Propeller (on average of 1 million random 32- bit pairs), I will give her/him 100 bucks (of USA) for the first and valid code of that.
Cheers,
istvan
PS.: Next time integer division (not only with constants, of course) is to be improved.
Post Edited (cessnapilot) : 8/1/2009 10:28:12 AM GMT
Comments
Those on the picture look beattiful. Can you ride them on the streets or only on water?. Well, you can ride them, I'm sure, but are they fast enough to escape cars of a special kind? Let's go back on calm waters. What is the top speed of them?
Istvan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
http://www.youtube.com/watch?v=qlM3Z9-yCEQ
Top speed on some crafts is 80 MPH, mine are in the 40-50 MPH range.
I really enjoy your 6DOF IMU, H48C, and GPS contributions!
Jim
Yes, exactly. Kenyan division seems to work on the Prop like charm. For a· future "FFFC: Superfast Integer Division II" thread, maybe that will be the winner. However, I have to check other algorithms to decide, for example the reciprocals or the division with cordic. Or a simple LUT with Newton- Raphson. To prepare a real Propeller niche for those algorithmic engines, I am making a flexible and general Monte-Carlo testpad. This SPIN application will throw random arguments to tested PASM algorithms to check the correctness, precision and last, but not least, the speed of them. Histograms of execution times and accuracies (cumulated or over ranges)·will be computed. Everything will go with only the Propeller + PST. And, there is (huge) room for improvements in the calculations of SINE, COSINE, ATAN2 and for other math beasts. Superfast Winograd FFT (with that somewhere mentioned DFT update trick), for example, to treat sound real-time with Prop.
In the secondary school a math teacher had mentioned such thing, but he called it 'Russian multiplication. Hungary was occupied that time, and as a part of my fierce resistance, I just ignored it, along with many other things at school.
Cheers,
Istvan
> Everything will go with only the Propeller + PST. And, there is (huge) room for improvements
> in the calculations of SINE, COSINE, ATAN2 and for other math beasts.
No kidding.
btw - what are you using on the prop to gen the randoms?
> Superfast Winograd FFT (with that somewhere mentioned DFT update trick), for example, to treat sound real-time with Prop.
Looking forward to that.
thanks for your efforts!
- Howard
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I use the ? operator of the SPIN to generate random values. The 'pseudorandom' property of values is very convenient to submit same sequence of arguments to different algorithms. In this way the comparison is 'fair', and this helps in locating bugs, weak points for speed or regions of bad accuracy, etc... . These test will be useful, especially for Fixed-Point Q numbers to select correct representation for a task, or check the validity of a claimed 'quick' approximation for a function with those Q numbers. The uniformity of the sequence is the most important point for these PASM tests to cover the selected intervals of the arguments consistently.
The user, however, will have an option to choose 'Real' random sequences. Like
<Empty Rx buffer>
<Display a polite question on PST screen>
Something like: Real random sequence (Y/N)?
After the user hits the keyboard:
<In case of (Y/y)· : CNT will be the seed, to generate seeds for arg1, arg2,...>
For an average user, no real chance in a lifetime to get the same sequence again.
<In case of OTHERS : Seed will be, e.g. 2009 for arg1, 2010 for arg2 and so on.>
I will use an other trick to model typical sensor reading values. I will add 5..10 uniform randoms to get·normaly distributed values. This can be shifted/stretched to cover and mimic characteristic·outputs of a sensors. This might sound strange, but imagine that you have to read reference voltages between 4.5 and 5.5·V and you have to·divide·with these values at very high speed, to collect data. In a given·number format, you can use fast division with·pre-wired bit shifts. To compare and test the 'general' and the 'pre-wired' algorithms, such·reading generator comes in handy.
Cheers,
Istvan
EDIT: of course if it's signed then the MSbit has a different meaning, but the rest of the code is the same.
EDIT2: the CMP logic gets a bit more complicated too (finding the value with fewer 1 bits when positive).
-1 * 1
Very different if unsigned...
Thanks,
Jim
This is one of the neat properties of the 2's complement representation of negative numbers.
- small * small : ordering doesn't really matter...it's dependent but way below worst case
- large * large : ordering doesn't really matter (~500 clocks)
- large * small : simple ordering matters (~285 clocks)
- large * -small : simple ordering doesn't help, since a small negative number still has to do 32 iterations (~525 clocks)
- -small * -small : simple ordering doesn't help here either (~500 clocks)
The simple way to organize the values so that v1 has a lower MSb than v2 is: This handles all cases well except when small negative numbers are involved. For negative numbers you can change the sign of both numbers and still get the identical result. So the goal becomes, find the number with the lowest MSb of { v1, -v1, v2, -v2 }, use that as the value that gets shifted down until 0, and switch the sign on both v1 and v2 if needed. Here is the code I'm using: For the case where you have a large number * a small negative number, this brings the clocks back down to ~ 188 again, at the cost of a little more code and overhead. Also, somebody will find a way to optimize my code, so it can only get better! [8^)thanks,
Jonathan
EDIT: I instrumented the PASM function to get actual clocks instead of timing in Spin.
If you XOR the two inputs together, the high bit is the sign of the result. ABS both inputs, do the mult, then if the result is supposed to be negative, flip it then. That way, multiplying -1 x -1 is as fast as 1 x 1.
Or am I misunderstanding how this is supposed to work?
thanks,
Jonathan
Jonathan
1x : 502 clocks
2x : 452 clocks
3x : 437 clocks
...
8x : 431 clocks
That's with full 32-bit random numbers, positive or negative. Here is the 3x unrolled version (note the flags all work out with a straight copy-n-paste!!):
Jonathan
Jonathan
forums.parallax.com/discussion/160804/slightly-faster-integer-multiplication-in-pasm
But it doesn't early-exit, so it's slower for most numbers. This thread's approach is better for average speed (usually), or for fewer bits (always).
Jonathan
that get fed into new calculations and eventually many of the numbers are large (typically the mantissa
of floating point numbers). So average time is what matters.
The fact you can multiply small integers together faster is rarely important to overall speed I reckon.
Look at things like FFT, statistical calculations, digital filtering, control loops - all are going to
be using fix or floating point with most bits used most of the time.