Shop OBEX P1 Docs P2 Docs Learn Events
Fast, Faster, Fastest Code: 8x8 MUL - MISSION ACCOMPLISHED, faster than unrolle — Parallax Forums

Fast, Faster, Fastest Code: 8x8 MUL - MISSION ACCOMPLISHED, faster than unrolle

cessnapilotcessnapilot Posts: 182
edited 2009-08-17 07:58 in Propeller 1
Hi All,

I have found Mike Green's unrolled multiplication code on the forum as a probable speed champion for that special task of 8-bit times 8-bit multiplication·

'I8xI8-->I16 multiply, unrolled 
SHL arg1, #16          ;A little bit more than necessary 
SHR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC

This published code needs a little bit of optimization, as the product is to be shifted left in its register upon finish. To save this final right shift, it's enough to start with a smaller left shift at the beginning. So, unrolled 8x8 multiplication goes in full speed, as

'I8xI8-->I16 multiply, unrolled 
SHL arg1, #7           ;That's enough 
SHR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC 
RCR arg2, #1 WC 
IF_C ADD arg2, arg1 WC

The fast I32xI32 Optimized Kenyan Multiplication routine beats every of its I32xI32 competitors, so its high time to teach it a lesson. It can do the 8x8 multiplication table at an average 30 machine cycles per multiplication. Maybe it deals too much with that argument checking/swapping at the beginning. Yes, the simple Kenyan does the job in 29 average machine cycles on these small numbers. But, their timing results are far away from the 17 machine cycle speed of the unrolled I8xI8 multiplication. Mike's code is so straightforward, that I do not see too many chances to make a faster 8x8 multiplier. Or, maybe...

A Look up Table?
============
The fastest way of doing a multiplication not to do it at all. At least not in action time. Instead we can calculate all possible results in advance and store them in the memory. Now, whenever we need the result of a multiplication, we simply look it up in the result-table. The multiplication of two 8-bit numbers has, however, 65_536 results of 16 bits each, resulting in a table of 128Kb length. Not to mention the indexing, which would need another 8x8 multiplier. The commutativity of the multiplication may save us half of these bytes, but the necessary 64K are a lot for this simple job on the Propeller. And to be fast, the real·challenge is to squeeze that 64K into a COG. Before one says, that's impossible...

Secondary school math can help to squeeze a 64K table into a COG
============================================
To prove such algebraic equation like this one

a*b = ((a+b)/2)^2 - ((a-b)/2)^2

was very boring that time, and it is boring even today. So, let us believe it is true. It is better to be so, since then we will need only 512 items in the tables. The (a+b) expression can take 512 different inputs [noparse][[/noparse]0...511] and (a-b) can take 512 different [noparse][[/noparse]-128,...,127] inputs, too. Both expressions are squared to obtain the table values, so the second table can be merged into the first table. So, we need a 1 dimensional table with index [noparse][[/noparse]0...511] where each element contains the square of the index. That takes 512 registers. Unfortunately, that compression is not is not enough for a COG, as there are some system registers, and we will have other ones to run our code, too. We can make some further compression, however. The first half the square numbers of the table fits into 16-bit, so there is a chance to compress the table a further 25%. Finally, those square numbers can be stored in (256+128) COG registers. As a result, we can 'implode' the original 128K 2-D multiplication table into a 1-D array of 384 32-bit registers of a COG. Well, we will need some code to address, unpack the table and generate the result of the product with a subtraction. The code should take less than 17 machine cycles to be faster than the unrolled multiplication. If not, all this will be only a fair attempt to make the (almost) impossible.

Assuming the arguments A in a1 and B in a2

···· Pseudocode·············· Machine cycle
Calculate··· a1=(A + B)·········· 1
Calculate··· a2=ABS(B - A)····· 2
Table read·· r1=TABLE[noparse][[/noparse]a1]····· X
Table read·· r2=TABLE[noparse][[/noparse]a2]····· X
Result········ r1=(r1 - r2)······ ·· 1··

If the 2 table lookups can be done in 4-10 machine cycles, then the speed of the 8x8 multiplication can be inreased a little bit. Although, the price of this 20-25% speed gain, in Propeller architecture, is maybe too high. 40% sounds better. If only someone could make a 5-6 machine cycle table lookup...

Cheers,


Istvan



Post Edited (cessnapilot) : 8/16/2009 8:17:37 PM GMT

Comments

  • KyeKye Posts: 2,200
    edited 2009-08-05 19:44
    Great Stuff, great stuff.

    The kenyan multiplication will help out on my audio engine driver.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Nyamekye,
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-06 09:22
    Hi Kye,

    Thanks. What is your audio driver doing? Does it filter, synthetise or does it· change the pitches of the sound? (and not of that sound of the pitches (Oops, sorry)). Let us know of your advance in it.

    Cheers,

    Istvan
  • ericballericball Posts: 774
    edited 2009-08-06 13:51
    Why are you using RCR instead of SHR? That will shift in the carry from the previous step, which isn't what you want.

    The same technique could be used up to 16x16 with the appropriate unroll. Heck, the 8x8 routine is applicable up to 25x8. Hrmm... if you used SAR instead of RCR/SHR, then arg1 could be signed. With a little pre-processing of arg2, the same routine could be used for signed multiplication too.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Composite NTSC sprite driver: Forum
    NTSC & PAL driver templates: ObEx Forum
    OnePinTVText driver: ObEx Forum
  • KyeKye Posts: 2,200
    edited 2009-08-06 14:12
    Much more simple. Its just a fullduplex audio DAC and ADC.

    But, I need to to be able to run fast so that the user can change the tempo up to 48000Khz and everywhere in between.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Nyamekye,
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-06 15:09
    Hi ericball,

    That 8x8 unrolled code comes from Mike Green. I only tested it and it works. The addition before the RCR clears the carry (for byte arguments), so zero is shifted into the MSB. On the other side, you are·rigth, SHR would be·safer·for the job to get the LSB. With the 'signed' variant, you are rigth, too. Thanks for the tipps.

    However, when you unroll the I16xI16 multiplication, it runs in 33 cycles.·Both·I32xI32 kenyans run faster, on average, than that, and they are smaller.··Imagine what happens when you unroll the I32xI32 standard code: the bigger the size, the slower the code will be. Or, if one of the the incoming bytes is usually smaller, than 16, the improved kenyan boosts its rockets and runs within 15, or so cycles. If you know, which one will be the smaller byte, It runs within 12 machine cycles.

    Cheers,

    Istvan
  • mctriviamctrivia Posts: 3,772
    edited 2009-08-06 15:14
    the only place your method seems to do less is the unrolled I24xI8

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5

    Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.

    Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-07 03:57
    @mctrivia

    Yes, but less time or less speed sometimes make a difference. The 8x8, 16x8m, 24x8 and the 32x8 unrolled multiplications take practically the same number of machine cycles on a 32-bit controller, if you unroll them in the right direction.

    Cheers,
    Istvan
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-16 20:39
    Checklist:

    Implode a 128KB multiplication table into a COG. (1KB is enough)
    Checked!
    Do a faster 8x8 multiplication than the· brute force unrolled code.
    Checked!

    Speed gain 23%

    Here is the 128KB table imploded into 1KB in a COG

    '256x256 (8-bit)x(8-bit) unsigned multiplication table,
    '128KB data imploded into 256 32-bit COG register (1KB)
     
     
    LONG  $10000000,  $10200000,  $10404001,  $10608002 
    LONG  $10810004,  $10A18006,  $10C24009,  $10E3000C 
    LONG  $11040010,  $11250014,  $11464019,  $1167801E 
    LONG  $11890024,  $11AA802A,  $11CC4031,  $11EE0038 
    LONG  $12100040,  $12320048,  $12544051,  $1276805A 
    LONG  $12990064,  $12BB806E,  $12DE4079,  $13010084 
    LONG  $13240090,  $1347009C,  $136A40A9,  $138D80B6 
    LONG  $13B100C4,  $13D480D2,  $13F840E1,  $141C00F0 
    LONG  $14400100,  $14640110,  $14884121,  $14AC8132 
    LONG  $14D10144,  $14F58156,  $151A4169,  $153F017C 
    LONG  $15640190,  $158901A4,  $15AE41B9,  $15D381CE 
    LONG  $15F901E4,  $161E81FA,  $16444211,  $166A0228 
    LONG  $16900240,  $16B60258,  $16DC4271,  $1702828A 
    LONG  $172902A4,  $174F82BE,  $177642D9,  $179D02F4 
    LONG  $17C40310,  $17EB032C,  $18124349,  $18398366 
    LONG  $18610384,  $188883A2,  $18B043C1,  $18D803E0 
    LONG  $19000400,  $19280420,  $19504441,  $19788462 
    LONG  $19A10484,  $19C984A6,  $19F244C9,  $1A1B04EC 
    LONG  $1A440510,  $1A6D0534,  $1A964559,  $1ABF857E 
    LONG  $1AE905A4,  $1B1285CA,  $1B3C45F1,  $1B660618 
    LONG  $1B900640,  $1BBA0668,  $1BE44691,  $1C0E86BA 
    LONG  $1C3906E4,  $1C63870E,  $1C8E4739,  $1CB90764 
    LONG  $1CE40790,  $1D0F07BC,  $1D3A47E9,  $1D658816 
    LONG  $1D910844,  $1DBC8872,  $1DE848A1,  $1E1408D0 
    LONG  $1E400900,  $1E6C0930,  $1E984961,  $1EC48992 
    LONG  $1EF109C4,  $1F1D89F6,  $1F4A4A29,  $1F770A5C 
    LONG  $1FA40A90,  $1FD10AC4,  $1FFE4AF9,  $202B8B2E 
    LONG  $20590B64,  $20868B9A,  $20B44BD1,  $20E20C08 
    LONG  $21100C40,  $213E0C78,  $216C4CB1,  $219A8CEA 
    LONG  $21C90D24,  $21F78D5E,  $22264D99,  $22550DD4 
    LONG  $22840E10,  $22B30E4C,  $22E24E89,  $23118EC6 
    LONG  $23410F04,  $23708F42,  $23A04F81,  $23D00FC0 
    LONG  $24001000,  $24301040,  $24605081,  $249090C2 
    LONG  $24C11104,  $24F19146,  $25225189,  $255311CC 
    LONG  $25841210,  $25B51254,  $25E65299,  $261792DE 
    LONG  $26491324,  $267A936A,  $26AC53B1,  $26DE13F8 
    LONG  $27101440,  $27421488,  $277454D1,  $27A6951A 
    LONG  $27D91564,  $280B95AE,  $283E55F9,  $28711644 
    LONG  $28A41690,  $28D716DC,  $290A5729,  $293D9776 
    LONG  $297117C4,  $29A49812,  $29D85861,  $2A0C18B0 
    LONG  $2A401900,  $2A741950,  $2AA859A1,  $2ADC99F2 
    LONG  $2B111A44,  $2B459A96,  $2B7A5AE9,  $2BAF1B3C 
    LONG  $2BE41B90,  $2C191BE4,  $2C4E5C39,  $2C839C8E 
    LONG  $2CB91CE4,  $2CEE9D3A,  $2D245D91,  $2D5A1DE8 
    LONG  $2D901E40,  $2DC61E98,  $2DFC5EF1,  $2E329F4A 
    LONG  $2E691FA4,  $2E9F9FFE,  $2ED66059,  $2F0D20B4 
    LONG  $2F442110,  $2F7B216C,  $2FB261C9,  $2FE9A226 
    LONG  $30212284,  $3058A2E2,  $30906341,  $30C823A0 
    LONG  $31002400,  $31382460,  $317064C1,  $31A8A522 
    LONG  $31E12584,  $3219A5E6,  $32526649,  $328B26AC 
    LONG  $32C42710,  $32FD2774,  $333667D9,  $336FA83E 
    LONG  $33A928A4,  $33E2A90A,  $341C6971,  $345629D8 
    LONG  $34902A40,  $34CA2AA8,  $35046B11,  $353EAB7A 
    LONG  $35792BE4,  $35B3AC4E,  $35EE6CB9,  $36292D24 
    LONG  $36642D90,  $369F2DFC,  $36DA6E69,  $3715AED6 
    LONG  $37512F44,  $378CAFB2,  $37C87021,  $38043090 
    LONG  $38403100,  $387C3170,  $38B871E1,  $38F4B252 
    LONG  $393132C4,  $396DB336,  $39AA73A9,  $39E7341C 
    LONG  $3A243490,  $3A613504,  $3A9E7579,  $3ADBB5EE 
    LONG  $3B193664,  $3B56B6DA,  $3B947751,  $3BD237C8 
    LONG  $3C103840,  $3C4E38B8,  $3C8C7931,  $3CCAB9AA 
    LONG  $3D093A24,  $3D47BA9E,  $3D867B19,  $3DC53B94 
    LONG  $3E043C10,  $3E433C8C,  $3E827D09,  $3EC1BD86 
    LONG  $3F013E04,  $3F40BE82,  $3F807F01,  $3FC03F80 
    


    SPIN did it with this code

    'This prints imploded 8x8 multiplication table on PST screen
    'I used copy/paste to insert this table into SPIN code
     
    PST.Char(PST#NL)
    REPEAT i FROM 0 TO 63
      PST.Str(STRING("LONG  "))  
      REPEAT j FROM 0 TO 3
        vl := (i * 4) + j
        vh := vl + 256
        h := (vh * vh) >> 2
        l := (vl * vl) >> 2
        r := (h<<14) + l
        PST.Str(STRING("$"))
        PST.Hex(r,8)
        IF (j < 3)
          PST.Str(STRING(",  "))
      PST.Char(PST#NL)      
    


    And here is the ultra fast 8x8 multiplication PASM code

    'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
    'this code fits into a COG. It is 23% faster than the unrolled code.
     
    'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B
    MOV      r1,             arg1        'Save arg1
    ADD      arg1,           arg2        'arg1=A+B
    SUB      arg2,           r1          'arg2=B-A
    ABS      arg2,           arg2        'arg2=|B-A|=|A-B|
     
    'Now indexes are prepared. For A+B we have to check upper or lower region
    CMP      arg1,           #256 WC     'Check for 'Upper index'     
    IF_NC SUB arg1,          #256        'If so, denorm it
     
    'Write these indices into the sfield of the table readout commands
    'Table starts at 0 address 
    MOVS     :GetAplusB2,    arg1         'Index for ((A+B)^2)/4 
    MOVS     :GetAminusB2,   arg2         'Index for ((A-B)^2)/4
     
    'Read data table for folded index |A+B|
    :GetAplusB2
    MOV      r1,             0-0         'When this executes 0-0 is arg1
    IF_NC SHR r1,            #14         'Original index was greater than 255,
                                         'Shift table item right 14-bit
    IF_C AND  r1,            masklow14   'Original index was less than 255.
                                         'Mask lower 14 bits
     
    'Read data table at straight index |A-B|
    :GetAminusB2
    MOV      r2,             0-0          'When this executes 0-0 is arg1
    '|A-B| is always less then 256, so we do not have to shift 
    AND      r2,             masklow14    'But we have to mask   
     
    'Data prepared
    SUB      r1,             r2           'Get final result
     
    'Result in r1 after 14 machine cycle
    


    I think this is maybe the fastes ever 8x8 multiplication·code for the Propeller.·Attachment, of course, contains a working version.

    I will be happy to see an ever faster one, so 100$ goes to the first who can do it faster with another method. (Code size arbitrary, input bytes in arg1, arg2, result anywhere, unshifted)

    Cheers,

    Istvan
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-08-16 21:01
    You can replace:

              CMP    arg1,#256 WC     'Check for 'Upper index'     
        IF_NC SUB    arg1, #256        'If so, denorm it
    
    
    


    with

              CMPSUB arg1,#256 WC
    
    
    


    That makes it faster. Do I win $100? smile.gif

    BTW, you'll need to invert the sense of your tests for carry later in the program.

    -Phil

    Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 9:14:01 PM GMT
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-16 21:57
    Yes Phil, you do, of course.

    Here is the new and·faster version, according to your idea

    'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
    'this code fits into a COG. It is 30% faster than the unrolled code.
    'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B
    
    MOV      r1,             arg1        'Save arg1
    ADD      arg1,           arg2        'arg1=A+B
    SUB      arg2,           r1          'arg2=B-A
    ABS      arg2,           arg2        'arg2=|B-A|=|A-B|
     
    'Now indexes are prepared. For A+B we have to check upper or lower region
    CMPSUB   arg1,           #256 WC     'Check for 'Upper index' efficiently
                                         'Thanks to Phil Pilgrim's idea
     
    'Write these indices into the sfield of the table readout commands
    'Table starts at 0 address 
    MOVS     :GetAplusB2,    arg1         'Index for ((A+B)^2)/4 
    MOVS     :GetAminusB2,   arg2         'Index for ((A-B)^2)/4
     
    'Read data table for folded index |A+B|
    :GetAplusB2
    MOV      r1,             0-0         'When this executes 0-0 is arg1
    IF_C SHR r1,             #14         'Original index was greater than 255,
                                         'Shift table item right 14-bit
    IF_NC AND r1,            masklow14   'Original index was less than 255.
                                         'Mask lower 14 bits
     
    'Read data table at straight index |A-B|
    :GetAminusB2
    MOV      r2,             0-0          'When this executes 0-0 is arg1
    '|A-B| is always less then 256, so we do not have to shift 
    AND      r2,             masklow14    'But we have to mask   
     
    'Data prepared
    SUB      r1,             r2           'Get final result
     
    'Result in r1 after 13 machine cycle
    


    It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.

    Altogether we profit much-much more than 100$ from·your posts. Thank you.


    Cheers,

    Istvan


    PS.: E-mail me how to transfer the bucks.
  • heaterheater Posts: 3,370
    edited 2009-08-16 22:04
    Wow, just what I need for the MUL instruction of the MoCog 6809 emulator !

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RinksCustomsRinksCustoms Posts: 531
    edited 2009-08-16 22:14
    [noparse][[/noparse]quote]

    It works now 30% faster than unrolled code. More than 1.5 million 8x8 multiplications per second.





    HOLY POOP!!!! hop.gif 1.5 MILLION multiplies/sec?! DAAAAAAAAAAAAAAAAAAAAAAAAAAAMMMN thats quick!! Good work!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Quicker answers in the #propeller chat channel on freenode.net. Don't know squat about IRC? Download Pigin! So easy a caveman could do it...
    http://folding.stanford.edu/ - Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-08-16 22:24
    cessnapilot,

    Forget the $100. I just do this for fun! smile.gif Here's a 12-instruction version:

    [b]DAT[/b]
    
    'Imploded Table lookup multiplication of 2 unsigned bytes. The table of
    'this code fits into a COG. It is 23% faster than the unrolled code.
     
    'Prepare Table addresses (A+B) and |A-B| from arg1=A and arg2=B
    
                                                    ' arg1       arg2
                                                    '   A          B
                  [b]add[/b]       arg1,arg2               ' A+B          B
                  [b]shl[/b]       arg2,#1                 ' A+B         2B
                  [b]sub[/b]       arg2,arg1               ' A+B        B-A
                  [b]abs[/b]       arg2,arg2               ' A+B       |B-A|
     
    'Now indexes are prepared. For A+B we have to check upper or lower region
    
                  [b]cmpsub[/b]    arg1,#256 [b]wc[/b]            'Check for 'Upper index'     
     
    'Write these indices into the sfield of the table readout commands
    'Table starts at 0 address
     
                  [b]movs[/b]      :GetAplusB2,arg1        'Index for ((A+B)^2)/4 
                  [b]movs[/b]      :GetAminusB2,arg2       'Index for ((A-B)^2)/4
     
    'Read data table for folded index |A+B|
    
    :GetAplusB2   [b]mov[/b]       arg1,0-0                'When this executes 0-0 is arg1
            [b]if_c[/b]  [b]shr[/b]       arg1,#16                'Original index was greater than 255,
                                                    '  so shift table item right 16-bit
     
    'Read data table at straight index |A-B|
    
    :GetAminusB2  [b]mov[/b]       arg2,0-0                '|A-B| is always less then 256, so we do not have to shift
     
    'Data prepared
    
                  [b]sub[/b]       arg1,arg2               'Get final result
                  [b]and[/b]       arg1,_0xffff            'AND out top 16 bits.
     
    'Result in arg1 after 12 machine cycles.
    
    '______________________________________________________
    
    'Table
    
                  [b]long[/b]      $4000_0000,$4080_0000,$4101_0001,$4182_0002
                  '...
                  [b]long[/b]      $fc04_0204,$fd02_3e82,$fe01_3f01,$ff00_3f80 
    
    
    


    The table is arranged 16:16 instead of %00:16:14. This allows the ANDing to be performed on the difference instead of on each argument. I've also eliminated the temp variables.

    -Phil

    Post Edited (Phil Pilgrim (PhiPi)) : 8/16/2009 11:01:00 PM GMT
  • KyeKye Posts: 2,200
    edited 2009-08-16 23:02
    Mmm, takes 48 clock cycles. (Each instruction takes 4 cycles.)

    That's, 1_666_666 Multiplications per seconds.... Wait, you'll need jmps and other stuff so it won't be that fast.

    Nice job.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Nyamekye,
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-08-16 23:13
    Right: 1 machine cycle = 4 clock cycles.

    -P.
  • cessnapilotcessnapilot Posts: 182
    edited 2009-08-17 07:58
    Phil,

    You are right. It works, see first attachment that shows PST output. ·Your code goes into the "Library of Proven Fast Codes", along with the unrolled MUL, of course, which is not so fast, but very small. The attached "PASM_TestPad.spin" contains the code with the new table.

    It will take some time for me to figure out, how does that gap of 2 zero bits act to make that simplification possible.

    As for the $100, Thank you. I keep the note for further challenges, but I shall invest the sum into BlueTooth modules from Parallax. During the calibration of the gyro outputs of the 6DOF IMU, cables twist up quickly and everything becomes a mess with broken connectors and wires. (Or, I should not answer phones in the next room, while at work with it.)

    Cheers,


    Istvan
Sign In or Register to comment.