Fast, Faster, Fastest Code: 8x8 MUL - MISSION ACCOMPLISHED, faster than unrolle
cessnapilot
Posts: 182
Hi All,
I have found Mike Green's unrolled multiplication code on the forum as a probable speed champion for that special task of 8-bit times 8-bit multiplication·
This published code needs a little bit of optimization, as the product is to be shifted left in its register upon finish. To save this final right shift, it's enough to start with a smaller left shift at the beginning. So, unrolled 8x8 multiplication goes in full speed, as
The fast I32xI32 Optimized Kenyan Multiplication routine beats every of its I32xI32 competitors, so its high time to teach it a lesson. It can do the 8x8 multiplication table at an average 30 machine cycles per multiplication. Maybe it deals too much with that argument checking/swapping at the beginning. Yes, the simple Kenyan does the job in 29 average machine cycles on these small numbers. But, their timing results are far away from the 17 machine cycle speed of the unrolled I8xI8 multiplication. Mike's code is so straightforward, that I do not see too many chances to make a faster 8x8 multiplier. Or, maybe...
A Look up Table?
============
The fastest way of doing a multiplication not to do it at all. At least not in action time. Instead we can calculate all possible results in advance and store them in the memory. Now, whenever we need the result of a multiplication, we simply look it up in the result-table. The multiplication of two 8-bit numbers has, however, 65_536 results of 16 bits each, resulting in a table of 128Kb length. Not to mention the indexing, which would need another 8x8 multiplier. The commutativity of the multiplication may save us half of these bytes, but the necessary 64K are a lot for this simple job on the Propeller. And to be fast, the real·challenge is to squeeze that 64K into a COG. Before one says, that's impossible...
Secondary school math can help to squeeze a 64K table into a COG
============================================
To prove such algebraic equation like this one
a*b = ((a+b)/2)^2 - ((a-b)/2)^2
was very boring that time, and it is boring even today. So, let us believe it is true. It is better to be so, since then we will need only 512 items in the tables. The (a+b) expression can take 512 different inputs [noparse][[/noparse]0...511] and (a-b) can take 512 different [noparse][[/noparse]-128,...,127] inputs, too. Both expressions are squared to obtain the table values, so the second table can be merged into the first table. So, we need a 1 dimensional table with index [noparse][[/noparse]0...511] where each element contains the square of the index. That takes 512 registers. Unfortunately, that compression is not is not enough for a COG, as there are some system registers, and we will have other ones to run our code, too. We can make some further compression, however. The first half the square numbers of the table fits into 16-bit, so there is a chance to compress the table a further 25%. Finally, those square numbers can be stored in (256+128) COG registers. As a result, we can 'implode' the original 128K 2-D multiplication table into a 1-D array of 384 32-bit registers of a COG. Well, we will need some code to address, unpack the table and generate the result of the product with a subtraction. The code should take less than 17 machine cycles to be faster than the unrolled multiplication. If not, all this will be only a fair attempt to make the (almost) impossible.
Assuming the arguments A in a1 and B in a2
···· Pseudocode·············· Machine cycle
Calculate··· a1=(A + ·········· 1
Calculate··· a2=ABS(B - A)····· 2
Table read·· r1=TABLE[noparse][[/noparse]a1]····· X
Table read·· r2=TABLE[noparse][[/noparse]a2]····· X
Result········ r1=(r1 - r2)······ ·· 1··
If the 2 table lookups can be done in 4-10 machine cycles, then the speed of the 8x8 multiplication can be inreased a little bit. Although, the price of this 20-25% speed gain, in Propeller architecture, is maybe too high. 40% sounds better. If only someone could make a 5-6 machine cycle table lookup...
Cheers,
Istvan
Post Edited (cessnapilot) : 8/16/2009 8:17:37 PM GMT
I have found Mike Green's unrolled multiplication code on the forum as a probable speed champion for that special task of 8-bit times 8-bit multiplication·
'I8xI8-->I16 multiply, unrolled SHL arg1, #16 ;A little bit more than necessary SHR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC
This published code needs a little bit of optimization, as the product is to be shifted left in its register upon finish. To save this final right shift, it's enough to start with a smaller left shift at the beginning. So, unrolled 8x8 multiplication goes in full speed, as
'I8xI8-->I16 multiply, unrolled SHL arg1, #7 ;That's enough SHR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC RCR arg2, #1 WC IF_C ADD arg2, arg1 WC
The fast I32xI32 Optimized Kenyan Multiplication routine beats every of its I32xI32 competitors, so its high time to teach it a lesson. It can do the 8x8 multiplication table at an average 30 machine cycles per multiplication. Maybe it deals too much with that argument checking/swapping at the beginning. Yes, the simple Kenyan does the job in 29 average machine cycles on these small numbers. But, their timing results are far away from the 17 machine cycle speed of the unrolled I8xI8 multiplication. Mike's code is so straightforward, that I do not see too many chances to make a faster 8x8 multiplier. Or, maybe...
A Look up Table?
============
The fastest way of doing a multiplication not to do it at all. At least not in action time. Instead we can calculate all possible results in advance and store them in the memory. Now, whenever we need the result of a multiplication, we simply look it up in the result-table. The multiplication of two 8-bit numbers has, however, 65_536 results of 16 bits each, resulting in a table of 128Kb length. Not to mention the indexing, which would need another 8x8 multiplier. The commutativity of the multiplication may save us half of these bytes, but the necessary 64K are a lot for this simple job on the Propeller. And to be fast, the real·challenge is to squeeze that 64K into a COG. Before one says, that's impossible...
Secondary school math can help to squeeze a 64K table into a COG
============================================
To prove such algebraic equation like this one
a*b = ((a+b)/2)^2 - ((a-b)/2)^2
was very boring that time, and it is boring even today. So, let us believe it is true. It is better to be so, since then we will need only 512 items in the tables. The (a+b) expression can take 512 different inputs [noparse][[/noparse]0...511] and (a-b) can take 512 different [noparse][[/noparse]-128,...,127] inputs, too. Both expressions are squared to obtain the table values, so the second table can be merged into the first table. So, we need a 1 dimensional table with index [noparse][[/noparse]0...511] where each element contains the square of the index. That takes 512 registers. Unfortunately, that compression is not is not enough for a COG, as there are some system registers, and we will have other ones to run our code, too. We can make some further compression, however. The first half the square numbers of the table fits into 16-bit, so there is a chance to compress the table a further 25%. Finally, those square numbers can be stored in (256+128) COG registers. As a result, we can 'implode' the original 128K 2-D multiplication table into a 1-D array of 384 32-bit registers of a COG. Well, we will need some code to address, unpack the table and generate the result of the product with a subtraction. The code should take less than 17 machine cycles to be faster than the unrolled multiplication. If not, all this will be only a fair attempt to make the (almost) impossible.
Assuming the arguments A in a1 and B in a2
···· Pseudocode·············· Machine cycle
Calculate··· a1=(A + ·········· 1
Calculate··· a2=ABS(B - A)····· 2
Table read·· r1=TABLE[noparse][[/noparse]a1]····· X
Table read·· r2=TABLE[noparse][[/noparse]a2]····· X
Result········ r1=(r1 - r2)······ ·· 1··
If the 2 table lookups can be done in 4-10 machine cycles, then the speed of the 8x8 multiplication can be inreased a little bit. Although, the price of this 20-25% speed gain, in Propeller architecture, is maybe too high. 40% sounds better. If only someone could make a 5-6 machine cycle table lookup...
Cheers,
Istvan
Post Edited (cessnapilot) : 8/16/2009 8:17:37 PM GMT
Comments
The kenyan multiplication will help out on my audio engine driver.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
Thanks. What is your audio driver doing? Does it filter, synthetise or does it· change the pitches of the sound? (and not of that sound of the pitches (Oops, sorry)). Let us know of your advance in it.
Cheers,
Istvan
The same technique could be used up to 16x16 with the appropriate unroll. Heck, the 8x8 routine is applicable up to 25x8. Hrmm... if you used SAR instead of RCR/SHR, then arg1 could be signed. With a little pre-processing of arg2, the same routine could be used for signed multiplication too.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: Forum
NTSC & PAL driver templates: ObEx Forum
OnePinTVText driver: ObEx Forum
But, I need to to be able to run fast so that the user can change the tempo up to 48000Khz and everywhere in between.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
That 8x8 unrolled code comes from Mike Green. I only tested it and it works. The addition before the RCR clears the carry (for byte arguments), so zero is shifted into the MSB. On the other side, you are·rigth, SHR would be·safer·for the job to get the LSB. With the 'signed' variant, you are rigth, too. Thanks for the tipps.
However, when you unroll the I16xI16 multiplication, it runs in 33 cycles.·Both·I32xI32 kenyans run faster, on average, than that, and they are smaller.··Imagine what happens when you unroll the I32xI32 standard code: the bigger the size, the slower the code will be. Or, if one of the the incoming bytes is usually smaller, than 16, the improved kenyan boosts its rockets and runs within 15, or so cycles. If you know, which one will be the smaller byte, It runs within 12 machine cycles.
Cheers,
Istvan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5
Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.
Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.
Yes, but less time or less speed sometimes make a difference. The 8x8, 16x8m, 24x8 and the 32x8 unrolled multiplications take practically the same number of machine cycles on a 32-bit controller, if you unroll them in the right direction.
Cheers,
Istvan
Implode a 128KB multiplication table into a COG. (1KB is enough)
Checked!
Do a faster 8x8 multiplication than the· brute force unrolled code.
Checked!
Speed gain 23%
Here is the 128KB table imploded into 1KB in a COG
SPIN did it with this code
And here is the ultra fast 8x8 multiplication PASM code
I think this is maybe the fastes ever 8x8 multiplication·code for the Propeller.·Attachment, of course, contains a working version.
I will be happy to see an ever faster one, so 100$ goes to the first who can do it faster with another method. (Code size arbitrary, input bytes in arg1, arg2, result anywhere, unshifted)
Cheers,
Istvan
with
That makes it faster. Do I win $100?
BTW, you'll need to invert the sense of your tests for carry later in the program.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 9:14:01 PM GMT
Here is the new and·faster version, according to your idea
It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.
Altogether we profit much-much more than 100$ from·your posts. Thank you.
Cheers,
Istvan
PS.: E-mail me how to transfer the bucks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.
HOLY POOP!!!! 1.5 MILLION multiplies/sec?! DAAAAAAAAAAAAAAAAAAAAAAAAAAAMMMN thats quick!! Good work!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Quicker answers in the #propeller chat channel on freenode.net. Don't know squat about IRC? Download Pigin! So easy a caveman could do it...
http://folding.stanford.edu/ - Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.
Forget the $100. I just do this for fun! Here's a 12-instruction version:
The table is arranged 16:16 instead of %00:16:14. This allows the ANDing to be performed on the difference instead of on each argument. I've also eliminated the temp variables.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 11:01:00 PM GMT
That's, 1_666_666 Multiplications per seconds.... Wait, you'll need jmps and other stuff so it won't be that fast.
Nice job.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
-P.
You are right. It works, see first attachment that shows PST output. ·Your code goes into the "Library of Proven Fast Codes", along with the unrolled MUL, of course, which is not so fast, but very small. The attached "PASM_TestPad.spin" contains the code with the new table.
It will take some time for me to figure out, how does that gap of 2 zero bits act to make that simplification possible.
As for the $100, Thank you. I keep the note for further challenges, but I shall invest the sum into BlueTooth modules from Parallax. During the calibration of the gyro outputs of the 6DOF IMU, cables twist up quickly and everything becomes a mess with broken connectors and wires. (Or, I should not answer phones in the next room, while at work with it.)
Cheers,
Istvan