Fast, Faster, Fastest Code: 8x8 MUL - MISSION ACCOMPLISHED, faster than unrolle

cessnapilot · 2009-08-05 16:32

Hi All,

I have found Mike Green's unrolled multiplication code on the forum as a probable speed champion for that special task of 8-bit times 8-bit multiplication·

'I8xI8-->I16 multiply, unrolled 
SHL arg1, #16          ;A little bit more than necessary 
SHR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC

This published code needs a little bit of optimization, as the product is to be shifted left in its register upon finish. To save this final right shift, it's enough to start with a smaller left shift at the beginning. So, unrolled 8x8 multiplication goes in full speed, as

'I8xI8-->I16 multiply, unrolled 
SHL arg1, #7           ;That's enough 
SHR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC

The fast I32xI32 Optimized Kenyan Multiplication routine beats every of its I32xI32 competitors, so its high time to teach it a lesson. It can do the 8x8 multiplication table at an average 30 machine cycles per multiplication. Maybe it deals too much with that argument checking/swapping at the beginning. Yes, the simple Kenyan does the job in 29 average machine cycles on these small numbers. But, their timing results are far away from the 17 machine cycle speed of the unrolled I8xI8 multiplication. Mike's code is so straightforward, that I do not see too many chances to make a faster 8x8 multiplier. Or, maybe...

A Look up Table?
============
The fastest way of doing a multiplication not to do it at all. At least not in action time. Instead we can calculate all possible results in advance and store them in the memory. Now, whenever we need the result of a multiplication, we simply look it up in the result-table. The multiplication of two 8-bit numbers has, however, 65_536 results of 16 bits each, resulting in a table of 128Kb length. Not to mention the indexing, which would need another 8x8 multiplier. The commutativity of the multiplication may save us half of these bytes, but the necessary 64K are a lot for this simple job on the Propeller. And to be fast, the real·challenge is to squeeze that 64K into a COG. Before one says, that's impossible...

Secondary school math can help to squeeze a 64K table into a COG
============================================
To prove such algebraic equation like this one

a*b = ((a+b)/2)^2 - ((a-b)/2)^2

was very boring that time, and it is boring even today. So, let us believe it is true. It is better to be so, since then we will need only 512 items in the tables. The (a+b) expression can take 512 different inputs [noparse][[/noparse]0...511] and (a-b) can take 512 different [noparse][[/noparse]-128,...,127] inputs, too. Both expressions are squared to obtain the table values, so the second table can be merged into the first table. So, we need a 1 dimensional table with index [noparse][[/noparse]0...511] where each element contains the square of the index. That takes 512 registers. Unfortunately, that compression is not is not enough for a COG, as there are some system registers, and we will have other ones to run our code, too. We can make some further compression, however. The first half the square numbers of the table fits into 16-bit, so there is a chance to compress the table a further 25%. Finally, those square numbers can be stored in (256+128) COG registers. As a result, we can 'implode' the original 128K 2-D multiplication table into a 1-D array of 384 32-bit registers of a COG. Well, we will need some code to address, unpack the table and generate the result of the product with a subtraction. The code should take less than 17 machine cycles to be faster than the unrolled multiplication. If not, all this will be only a fair attempt to make the (almost) impossible.

Assuming the arguments A in a1 and B in a2

···· Pseudocode·············· Machine cycle

Calculate··· a1=(A +

·········· 1
Calculate··· a2=ABS(B - A)····· 2
Table read·· r1=TABLE[noparse][[/noparse]a1]····· X
Table read·· r2=TABLE[noparse][[/noparse]a2]····· X
Result········ r1=(r1 - r2)······ ·· 1··

If the 2 table lookups can be done in 4-10 machine cycles, then the speed of the 8x8 multiplication can be inreased a little bit. Although, the price of this 20-25% speed gain, in Propeller architecture, is maybe too high. 40% sounds better. If only someone could make a 5-6 machine cycle table lookup...

Cheers,

Istvan

Post Edited (cessnapilot) : 8/16/2009 8:17:37 PM GMT

Kye · 2009-08-05 19:44

Great Stuff, great stuff.

The kenyan multiplication will help out on my audio engine driver.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

cessnapilot · 2009-08-06 09:22

Hi Kye,

Thanks. What is your audio driver doing? Does it filter, synthetise or does it· change the pitches of the sound? (and not of that sound of the pitches (Oops, sorry)). Let us know of your advance in it.

Cheers,

Istvan

ericball · 2009-08-06 13:51

Why are you using RCR instead of SHR? That will shift in the carry from the previous step, which isn't what you want.

The same technique could be used up to 16x16 with the appropriate unroll. Heck, the 8x8 routine is applicable up to 25x8. Hrmm... if you used SAR instead of RCR/SHR, then arg1 could be signed. With a little pre-processing of arg2, the same routine could be used for signed multiplication too.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: Forum
NTSC & PAL driver templates: ObEx Forum
OnePinTVText driver: ObEx Forum

Kye · 2009-08-06 14:12

Much more simple. Its just a fullduplex audio DAC and ADC.

But, I need to to be able to run fast so that the user can change the tempo up to 48000Khz and everywhere in between.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

cessnapilot · 2009-08-06 15:09

Hi ericball,

That 8x8 unrolled code comes from Mike Green. I only tested it and it works. The addition before the RCR clears the carry (for byte arguments), so zero is shifted into the MSB. On the other side, you are·rigth, SHR would be·safer·for the job to get the LSB. With the 'signed' variant, you are rigth, too. Thanks for the tipps.

However, when you unroll the I16xI16 multiplication, it runs in 33 cycles.·Both·I32xI32 kenyans run faster, on average, than that, and they are smaller.··Imagine what happens when you unroll the I32xI32 standard code: the bigger the size, the slower the code will be. Or, if one of the the incoming bytes is usually smaller, than 16, the improved kenyan boosts its rockets and runs within 15, or so cycles. If you know, which one will be the smaller byte, It runs within 12 machine cycles.

Cheers,

Istvan

mctrivia · 2009-08-06 15:14

the only place your method seems to do less is the unrolled I24xI8

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5

Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.

Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.

cessnapilot · 2009-08-07 03:57

@mctrivia

Yes, but less time or less speed sometimes make a difference. The 8x8, 16x8m, 24x8 and the 32x8 unrolled multiplications take practically the same number of machine cycles on a 32-bit controller, if you unroll them in the right direction.

Cheers,
Istvan

cessnapilot · 2009-08-16 20:39

Checklist:

Implode a 128KB multiplication table into a COG. (1KB is enough)

Checked!

Do a faster 8x8 multiplication than the· brute force unrolled code.

Checked!

Speed gain 23%

Here is the 128KB table imploded into 1KB in a COG

'256x256 (8-bit)x(8-bit) unsigned multiplication table,
'128KB data imploded into 256 32-bit COG register (1KB)
 
 
LONG  $10000000,  $10200000,  $10404001,  $10608002 
LONG  $10810004,  $10A18006,  $10C24009,  $10E3000C 
LONG  $11040010,  $11250014,  $11464019,  $1167801E 
LONG  $11890024,  $11AA802A,  $11CC4031,  $11EE0038 
LONG  $12100040,  $12320048,  $12544051,  $1276805A 
LONG  $12990064,  $12BB806E,  $12DE4079,  $13010084 
LONG  $13240090,  $1347009C,  $136A40A9,  $138D80B6 
LONG  $13B100C4,  $13D480D2,  $13F840E1,  $141C00F0 
LONG  $14400100,  $14640110,  $14884121,  $14AC8132 
LONG  $14D10144,  $14F58156,  $151A4169,  $153F017C 
LONG  $15640190,  $158901A4,  $15AE41B9,  $15D381CE 
LONG  $15F901E4,  $161E81FA,  $16444211,  $166A0228 
LONG  $16900240,  $16B60258,  $16DC4271,  $1702828A 
LONG  $172902A4,  $174F82BE,  $177642D9,  $179D02F4 
LONG  $17C40310,  $17EB032C,  $18124349,  $18398366 
LONG  $18610384,  $188883A2,  $18B043C1,  $18D803E0 
LONG  $19000400,  $19280420,  $19504441,  $19788462 
LONG  $19A10484,  $19C984A6,  $19F244C9,  $1A1B04EC 
LONG  $1A440510,  $1A6D0534,  $1A964559,  $1ABF857E 
LONG  $1AE905A4,  $1B1285CA,  $1B3C45F1,  $1B660618 
LONG  $1B900640,  $1BBA0668,  $1BE44691,  $1C0E86BA 
LONG  $1C3906E4,  $1C63870E,  $1C8E4739,  $1CB90764 
LONG  $1CE40790,  $1D0F07BC,  $1D3A47E9,  $1D658816 
LONG  $1D910844,  $1DBC8872,  $1DE848A1,  $1E1408D0 
LONG  $1E400900,  $1E6C0930,  $1E984961,  $1EC48992 
LONG  $1EF109C4,  $1F1D89F6,  $1F4A4A29,  $1F770A5C 
LONG  $1FA40A90,  $1FD10AC4,  $1FFE4AF9,  $202B8B2E 
LONG  $20590B64,  $20868B9A,  $20B44BD1,  $20E20C08 
LONG  $21100C40,  $213E0C78,  $216C4CB1,  $219A8CEA 
LONG  $21C90D24,  $21F78D5E,  $22264D99,  $22550DD4 
LONG  $22840E10,  $22B30E4C,  $22E24E89,  $23118EC6 
LONG  $23410F04,  $23708F42,  $23A04F81,  $23D00FC0 
LONG  $24001000,  $24301040,  $24605081,  $249090C2 
LONG  $24C11104,  $24F19146,  $25225189,  $255311CC 
LONG  $25841210,  $25B51254,  $25E65299,  $261792DE 
LONG  $26491324,  $267A936A,  $26AC53B1,  $26DE13F8 
LONG  $27101440,  $27421488,  $277454D1,  $27A6951A 
LONG  $27D91564,  $280B95AE,  $283E55F9,  $28711644 
LONG  $28A41690,  $28D716DC,  $290A5729,  $293D9776 
LONG  $297117C4,  $29A49812,  $29D85861,  $2A0C18B0 
LONG  $2A401900,  $2A741950,  $2AA859A1,  $2ADC99F2 
LONG  $2B111A44,  $2B459A96,  $2B7A5AE9,  $2BAF1B3C 
LONG  $2BE41B90,  $2C191BE4,  $2C4E5C39,  $2C839C8E 
LONG  $2CB91CE4,  $2CEE9D3A,  $2D245D91,  $2D5A1DE8 
LONG  $2D901E40,  $2DC61E98,  $2DFC5EF1,  $2E329F4A 
LONG  $2E691FA4,  $2E9F9FFE,  $2ED66059,  $2F0D20B4 
LONG  $2F442110,  $2F7B216C,  $2FB261C9,  $2FE9A226 
LONG  $30212284,  $3058A2E2,  $30906341,  $30C823A0 
LONG  $31002400,  $31382460,  $317064C1,  $31A8A522 
LONG  $31E12584,  $3219A5E6,  $32526649,  $328B26AC 
LONG  $32C42710,  $32FD2774,  $333667D9,  $336FA83E 
LONG  $33A928A4,  $33E2A90A,  $341C6971,  $345629D8 
LONG  $34902A40,  $34CA2AA8,  $35046B11,  $353EAB7A 
LONG  $35792BE4,  $35B3AC4E,  $35EE6CB9,  $36292D24 
LONG  $36642D90,  $369F2DFC,  $36DA6E69,  $3715AED6 
LONG  $37512F44,  $378CAFB2,  $37C87021,  $38043090 
LONG  $38403100,  $387C3170,  $38B871E1,  $38F4B252 
LONG  $393132C4,  $396DB336,  $39AA73A9,  $39E7341C 
LONG  $3A243490,  $3A613504,  $3A9E7579,  $3ADBB5EE 
LONG  $3B193664,  $3B56B6DA,  $3B947751,  $3BD237C8 
LONG  $3C103840,  $3C4E38B8,  $3C8C7931,  $3CCAB9AA 
LONG  $3D093A24,  $3D47BA9E,  $3D867B19,  $3DC53B94 
LONG  $3E043C10,  $3E433C8C,  $3E827D09,  $3EC1BD86 
LONG  $3F013E04,  $3F40BE82,  $3F807F01,  $3FC03F80

SPIN did it with this code

'This prints imploded 8x8 multiplication table on PST screen
'I used copy/paste to insert this table into SPIN code
 
PST.Char(PST#NL)
REPEAT i FROM 0 TO 63
  PST.Str(STRING("LONG  "))  
  REPEAT j FROM 0 TO 3
    vl := (i * 4) + j
    vh := vl + 256
    h := (vh * vh) >> 2
    l := (vl * vl) >> 2
    r := (h<<14) + l
    PST.Str(STRING("$"))
    PST.Hex(r,8)
    IF (j < 3)
      PST.Str(STRING(",  "))
  PST.Char(PST#NL)

And here is the ultra fast 8x8 multiplication PASM code

'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
'this code fits into a COG. It is 23% faster than the unrolled code.
 
'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B
MOV      r1,             arg1        'Save arg1
ADD      arg1,           arg2        'arg1=A+B
SUB      arg2,           r1          'arg2=B-A
ABS      arg2,           arg2        'arg2=|B-A|=|A-B|
 
'Now indexes are prepared. For A+B we have to check upper or lower region
CMP      arg1,           #256 WC     'Check for 'Upper index'     
IF_NC SUB arg1,          #256        'If so, denorm it
 
'Write these indices into the sfield of the table readout commands
'Table starts at 0 address 
MOVS     :GetAplusB2,    arg1         'Index for ((A+B)^2)/4 
MOVS     :GetAminusB2,   arg2         'Index for ((A-B)^2)/4
 
'Read data table for folded index |A+B|
:GetAplusB2
MOV      r1,             0-0         'When this executes 0-0 is arg1
IF_NC SHR r1,            #14         'Original index was greater than 255,
                                     'Shift table item right 14-bit
IF_C AND  r1,            masklow14   'Original index was less than 255.
                                     'Mask lower 14 bits
 
'Read data table at straight index |A-B|
:GetAminusB2
MOV      r2,             0-0          'When this executes 0-0 is arg1
'|A-B| is always less then 256, so we do not have to shift 
AND      r2,             masklow14    'But we have to mask   
 
'Data prepared
SUB      r1,             r2           'Get final result
 
'Result in r1 after 14 machine cycle

I think this is maybe the fastes ever 8x8 multiplication·code for the Propeller.·Attachment, of course, contains a working version.

I will be happy to see an ever faster one, so 100$ goes to the first who can do it faster with another method. (Code size arbitrary, input bytes in arg1, arg2, result anywhere, unshifted)

Cheers,

Istvan

Phil Pilgrim (PhiPi) · 2009-08-16 21:01

You can replace:

          CMP    arg1,#256 WC     'Check for 'Upper index'     
    IF_NC SUB    arg1, #256        'If so, denorm it

with

          CMPSUB arg1,#256 WC

That makes it faster. Do I win $100?

BTW, you'll need to invert the sense of your tests for carry later in the program.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 9:14:01 PM GMT

cessnapilot · 2009-08-16 21:57

Yes Phil, you do, of course.

Here is the new and·faster version, according to your idea

'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
'this code fits into a COG. It is 30% faster than the unrolled code.
'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B

MOV      r1,             arg1        'Save arg1
ADD      arg1,           arg2        'arg1=A+B
SUB      arg2,           r1          'arg2=B-A
ABS      arg2,           arg2        'arg2=|B-A|=|A-B|
 
'Now indexes are prepared. For A+B we have to check upper or lower region
CMPSUB   arg1,           #256 WC     'Check for 'Upper index' efficiently
                                     'Thanks to Phil Pilgrim's idea
 
'Write these indices into the sfield of the table readout commands
'Table starts at 0 address 
MOVS     :GetAplusB2,    arg1         'Index for ((A+B)^2)/4 
MOVS     :GetAminusB2,   arg2         'Index for ((A-B)^2)/4
 
'Read data table for folded index |A+B|
:GetAplusB2
MOV      r1,             0-0         'When this executes 0-0 is arg1
IF_C SHR r1,             #14         'Original index was greater than 255,
                                     'Shift table item right 14-bit
IF_NC AND r1,            masklow14   'Original index was less than 255.
                                     'Mask lower 14 bits
 
'Read data table at straight index |A-B|
:GetAminusB2
MOV      r2,             0-0          'When this executes 0-0 is arg1
'|A-B| is always less then 256, so we do not have to shift 
AND      r2,             masklow14    'But we have to mask   
 
'Data prepared
SUB      r1,             r2           'Get final result
 
'Result in r1 after 13 machine cycle

It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.

Altogether we profit much-much more than 100$ from·your posts. Thank you.

Cheers,

Istvan

PS.: E-mail me how to transfer the bucks.

heater · 2009-08-16 22:04

Wow, just what I need for the MUL instruction of the MoCog 6809 emulator !

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RinksCustoms · 2009-08-16 22:14

[noparse][[/noparse]quote]

It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.

HOLY POOP!!!!

1.5 MILLION multiplies/sec?! DAAAAAAAAAAAAAAAAAAAAAAAAAAAMMMN thats quick!! Good work!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Quicker answers in the #propeller chat channel on freenode.net. Don't know squat about IRC? Download Pigin! So easy a caveman could do it...
http://folding.stanford.edu/ - Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.

Phil Pilgrim (PhiPi) · 2009-08-16 22:24

cessnapilot,

Forget the $100. I just do this for fun!

Here's a 12-instruction version:

[b]DAT[/b]

'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
'this code fits into a COG. It is 23% faster than the unrolled code.
 
'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B

                                                ' arg1       arg2
                                                '   A          B
              [b]add[/b]       arg1,arg2               ' A+B          B
              [b]shl[/b]       arg2,#1                 ' A+B         2B
              [b]sub[/b]       arg2,arg1               ' A+B        B-A
              [b]abs[/b]       arg2,arg2               ' A+B       |B-A|
 
'Now indexes are prepared. For A+B we have to check upper or lower region

              [b]cmpsub[/b]    arg1,#256 [b]wc[/b]            'Check for 'Upper index'     
 
'Write these indices into the sfield of the table readout commands
'Table starts at 0 address
 
              [b]movs[/b]      :GetAplusB2,arg1        'Index for ((A+B)^2)/4 
              [b]movs[/b]      :GetAminusB2,arg2       'Index for ((A-B)^2)/4
 
'Read data table for folded index |A+B|

:GetAplusB2   [b]mov[/b]       arg1,0-0                'When this executes 0-0 is arg1
        [b]if_c[/b]  [b]shr[/b]       arg1,#16                'Original index was greater than 255,
                                                '  so shift table item right 16-bit
 
'Read data table at straight index |A-B|

:GetAminusB2  [b]mov[/b]       arg2,0-0                '|A-B| is always less then 256, so we do not have to shift
 
'Data prepared

              [b]sub[/b]       arg1,arg2               'Get final result
              [b]and[/b]       arg1,_0xffff            'AND out top 16 bits.
 
'Result in arg1 after 12 machine cycles.

'______________________________________________________

'Table

              [b]long[/b]      $4000_0000,$4080_0000,$4101_0001,$4182_0002
              '...
              [b]long[/b]      $fc04_0204,$fd02_3e82,$fe01_3f01,$ff00_3f80

The table is arranged 16:16 instead of %00:16:14. This allows the ANDing to be performed on the difference instead of on each argument. I've also eliminated the temp variables.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 11:01:00 PM GMT

Kye · 2009-08-16 23:02

Mmm, takes 48 clock cycles. (Each instruction takes 4 cycles.)

That's, 1_666_666 Multiplications per seconds.... Wait, you'll need jmps and other stuff so it won't be that fast.

Nice job.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

Phil Pilgrim (PhiPi) · 2009-08-16 23:13

Right: 1 machine cycle = 4 clock cycles.

-P.

cessnapilot · 2009-08-17 07:58

Phil,

You are right. It works, see first attachment that shows PST output. ·Your code goes into the "Library of Proven Fast Codes", along with the unrolled MUL, of course, which is not so fast, but very small. The attached "PASM_TestPad.spin" contains the code with the new table.

It will take some time for me to figure out, how does that gap of 2 zero bits act to make that simplification possible.

As for the $100, Thank you. I keep the note for further challenges, but I shall invest the sum into BlueTooth modules from Parallax. During the calibration of the gyro outputs of the 6DOF IMU, cables twist up quickly and everything becomes a mess with broken connectors and wires. (Or, I should not answer phones in the next room, while at work with it.)

Cheers,

Istvan

Fast, Faster, Fastest Code: 8x8 MUL - MISSION ACCOMPLISHED, faster than unrolle

Comments