Shop OBEX P1 Docs P2 Docs Learn Events
Fast, Faster, Fastest Code: Ultrafast 32-bit integer division — Parallax Forums

Fast, Faster, Fastest Code: Ultrafast 32-bit integer division

cessnapilotcessnapilot Posts: 182
edited 2012-06-02 00:44 in Propeller 1
Continuing this series of threads a new version of the ultrafast
Kenyan 32-bit integer division algorithm attached to this post.
'=========================================================================
'==================Start of tested code of Function_1=====================
'=========================================================================

'--------Unsigned 32-bit ultrafast Kenyan integer division----------------

'   Arguments: 32-bit dividend and divisor
'     Results: 32-bit quotient and remainder
 
'Prepare work registers
MOV      r1,             #0            'Clear 32-bit quotient
MOV      r2,             #1            'Prepare "divide back" loop cntr 

'First round of Kenyan division: double divisor repeatedly until it
'becomes larger then dividend. Count the number of doublings

:first_loop

SHL      arg2,           #1            'Double divisor  
CMP      arg2,           arg1 WZ, WC   'Compare doubled divisor w dividend
                                       'If new divisor smaller or equal,
                                       'C is set, when equals Z is set

IF_C_AND_NZ ADD r2,      #1            'IF_C_AND_NZ increment counter                       
IF_C_AND_NZ JMP          #:first_loop  'IF_C repeat first_loop

'If new divisor same as dividend, Z flag is rised, then the result is
'some power of two. Calculate it and finish 

IF_Z     MOV r1,         #1
IF_Z     ROL r1,         r2
IF_Z     MOV arg1,       #0
IF_Z     JMP             #:finish       

'Second Round of Kenyan division: Half back divisor and double quotient
'repeatedly and subtract divisor from dividend when dividend is larger
'or equal of divisor. In this case increment quotient with one, after
'doubling

:second_loop

SHR      arg2,           #1            'Half divisor
CMPSUB   arg1,           arg2 WC, WZ   'Compare dividend with divisor,
                                       'If dividend is greater or equal
                                       'than divisor, then subtract
                                       'divisor from dividend and set C
         RCL r1,         #1            'Double quotient and rotate C into
                                       'LSB of quotient
IF_NZ DJNZ r2,           #:second_loop 'Continue division if something
                                       'remained and some steps are left
                                       
IF_Z SUB r2,             #1            'Remainder is zero, no more "half"                           
IF_Z ROL r1,             r2            'steps needed. Finish the job

:finish

'=========================================================================
'====================End of tested code of Function_1=====================
'=========================================================================

The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.

Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.

This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.

The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...

Cheers,

Istvan

Comments

  • LeonLeon Posts: 7,620
    edited 2012-04-27 00:01
    Use:

    [ code ]

    [ /code ]

    to insert code in your post. Remove the spaces.
  • cessnapilotcessnapilot Posts: 182
    edited 2012-04-27 00:17
    Leon: Thanks.
  • kuronekokuroneko Posts: 3,623
    edited 2012-04-27 01:29
    @Istvan: Have you tried dividing $C0000000 by 1? It seems to take an eternity ...
  • cessnapilotcessnapilot Posts: 182
    edited 2012-04-27 03:18
    kuroneko: You are right. The full unsigned division needs more coding. The posted version in my programs works
    perfectly as a core of a signed 32-bit routine, where the dividend absolute value is always less than $8000_0000
    and sign is managed separately.

    Anyway, I see your point, and I am working on an unsigned 32-bit version with the ROL ... WC and CMPX or with
    some other tricky PASM way to cure the problem that rotating 1 to left 32 times results in zero in the register.

    I appreciate any help.

    Thanks beforehand,

    Istvan
  • cessnapilotcessnapilot Posts: 182
    edited 2012-05-08 01:50
    I have modified the ultrafast 32-bit unsigned integer division algorithm, according to
    kuroneko's remark.
    '=========================================================================
    '==================Start of tested code of Function_1=====================
    '=========================================================================
    
    '--------Unsigned 32-bit ultrafast Kenyan integer division----------------
    
    '   Arguments: 32-bit dividend and divisor
    '     Results: 32-bit quotient and remainder
     
    'Prepare work registers
    MOV      r1,             #0            'Clear 32-bit quotient
    MOV      r2,             #1            'Prepare "divide back" loop cntr 
    
    'First round of Kenyan division: double divisor repeatedly until it
    'becomes larger then dividend. Count the number of doublings
    
    first_loop1
    
    ROL      arg2,           #1   WC       'Double divisor
    IF_C     JMP             #second_Loop1 'Overflow:jump to second loop
                                           'with carry set
    
    CMP      arg2,           arg1 WZ, WC   'Compare doubled divisor w divident
                                           'If new divisor smaller or equal,
                                           'C is set, when equals Z is set
    
    IF_C_AND_NZ ADD r2,      #1            'IF_C_AND_NZ increment counter                       
    IF_C_AND_NZ JMP          #first_loop1  'IF_C repeat first_loop
    
    'If new divisor same as dividend, Z flag is rised, then the result is
    'some power of two. Calculate it and finish 
    
    IF_Z     MOV r1,         #1
    IF_Z     ROL r1,         r2
    IF_Z     MOV arg1,       #0
    IF_Z     JMP             #finish1       
    
    'Second Round of Kenyan division: Half back divisor and double quotient
    'repeatedly and subtract divisor from dividend when dividend is larger
    'or equal of divisor. In this case increment quotient with one, after
    'doubling
    
    second_loop1
    
    RCR      arg2,           #1            'Half divisor when C not set
                                           'or restore divisor when C set
    CMPSUB   arg1,           arg2 WC, WZ   'Compare dividend with divisor,
                                           'If dividend is greater or equal
                                           'than divisor, then subtract
                                           'divisor from dividend and set C
    RCL r1,                  #1  WC        'Double quotient and rotate C into
                                           'LSB of quotient
    IF_NZ DJNZ r2,           #second_loop1 'Continue division if something
                                           'remained and some steps are left
                                           
    IF_Z SUB r2,             #1            'Remainder is zero, no more "half"                           
    IF_Z ROL r1,             r2            'steps needed. Just finish the job
    
    finish1
    
    '=========================================================================
    '====================End of tested code of Function_1=====================
    '=========================================================================
    

    The new version is about 4 (!) times faster on random integers, than the traditional algorithms
    with 32 iterations. So, average speed is quadrupled. I think this is remarkable with such
    a basic and over-studied operation.

    For the comparison the compact and (probably) fully optimized traditional algorithm was
    borrowed from Kyle. It takes always 32 iterations and its average execution steps is about
    106 machine cycles on 10_000 pairs of pseudo-random integers.

    The ultrafast Kenyan method takes only 26 machine cycles on average with the same
    10_000 pairs of random integers. You can verify this with the attached PST application.

    The check for division with zero was omitted from both test algorithms. Division with zero
    should not be performed (only Chuck Norris can divide with zero) and this operation should
    be caught on higher levels to create an exception.

    However, there is room for further improvements. When the quotient is a very large number,
    the traditional algorithm is somewhat faster (albeit far less than 4 times). Next step should be
    the merge of the effectiveness of the Kenyan ultrafast method on random inputs with the
    effectiveness of the traditional algorithm with very large quotients. With your help on this forum
    almost everything is possible in PASM. Remember the 12 machine cycle 8x8 bit multiplication
    which outperformed the unrolled (17 cycle) method.

    Until that enjoy the benefits of the just 4 times faster (on average) new division algorithm

    Cheers,
    Istvan
  • Mark_TMark_T Posts: 1,981
    edited 2012-05-08 05:04
    There is a problem with this benchmark - it doesn't represent the reality of calling integer division.

    Usually calls to integer division involve a dividend larger in magnitude than the divisor. If you alter the benchmark to limit the divisor to 20 bits then the standard division routine starts to win, and if you limit the divisor to 12 bits the standard division is much quicker...

    Dividing by zero hangs, BTW, and dividing by small constants like 3 and 5 is slowest of all (this is probably the most common case in real programs).
  • cessnapilotcessnapilot Posts: 182
    edited 2012-05-08 05:55
    @Mark_T:

    The check for the division by zero was left out on purpose. What you noticed on the size of the divisor I noticed, too.
    That is why I mentioned there is room for improvements an I am working on it. The improved version will run something like:


    Check for small divisors (up to 16) with a jump table (This reflects to your last comment)

    Rotate divisor left and right simultaneously to select faster method

    Apply faster method using the already rotated data.


    As for the "reality" of integer division I have no idea. Or, better to say, there are other realities. For example if you check the
    integer arithmetic code sections of floating point packages you may realize other situations, because of the very normalized
    nature of floating point representation.

    Thanks for the tests and comments,

    Istvan
  • ersmithersmith Posts: 6,097
    edited 2012-05-08 06:05
    I agree with Mark -- it's very rare for the inputs to division to be distributed randomly. Far more common is the case where the divisor is small (e.g. 10) and dividend is larger.

    For this case you can optimize the division algorithm by using a "count leading zeros" operation on the divisor and/or dividend. That's effectively what the initial loop of the Kenyan division algorithm is doing, but it's taking 5 instructions per iteration and up to 31 iterations (so worst case is 155 instructions). Using a divide and conquer approach you can do the same thing in a total of 16 instructions. This isn't as good as the Kenyan division algorithm in the best case, but it's constant and better for the common case of large dividend small divisor.

    Here's the unsigned division routine used by propgcc. It gives pretty good performance over a broad range of inputs. Inputs are r0 and r1; the output r0/r1 is placed in r0, with remainder in r1. If you're using this in stand-alone PASM you could eliminate a bit of register shuffling by having separate inputs and outputs; you could also inline the __CLZSI function that counts the number of leading zero bits.
    __UDIVSI
        mov    __DIVR, r0
        call    #__CLZSI
        neg    __DIVCNT, r0
        mov    r0, r1 wz
     IF_Z   jmp    #__UDIV_BY_ZERO
        call    #__CLZSI
        add    __DIVCNT, r0
        mov    r0, #0
        cmps    __DIVCNT, #0    wz, wc
     IF_C    jmp    #__UDIVSI_done
        shl    r1, __DIVCNT
        add    __DIVCNT, #1
    __UDIVSI_loop
        cmpsub    __DIVR, r1    wz, wc
        addx    r0, r0
        shr    r1, #1
        djnz    __DIVCNT, #__UDIVSI_loop
    __UDIVSI_done
        mov    r1, __DIVR
    __UDIVSI_ret
        ret
    
        '' come here on divide by zero
        '' we probably should raise a signal
    __UDIV_BY_ZERO
        neg    r0,#1
        mov    r1,#0
        jmp    #__UDIVSI_ret
    
    __TMP0    long    0
    __MASK_00FF00FF    long    0x00FF00FF
    __MASK_0F0F0F0F    long    0x0F0F0F0F
    __MASK_33333333    long    0x33333333
    __MASK_55555555    long    0x55555555
    __CLZSI    rev    r0, #0
    __CTZSI    neg    __TMP0, r0
        and    __TMP0, r0    wz
        mov    r0, #0
     IF_Z    mov    r0, #1
        test    __TMP0, __MASK_0000FFFF    wz
     IF_Z    add    r0, #16
        test    __TMP0, __MASK_00FF00FF    wz
     IF_Z    add    r0, #8
        test    __TMP0, __MASK_0F0F0F0F    wz
     IF_Z    add    r0, #4
        test    __TMP0, __MASK_33333333    wz
     IF_Z    add    r0, #2
        test    __TMP0, __MASK_55555555    wz
     IF_Z    add    r0, #1
    __CLZSI_ret    ret
    __DIVR    long    0
    __TMP1
    __DIVCNT
        long    0
    
  • Mark_TMark_T Posts: 1,981
    edited 2012-05-08 15:11
    Actually I've realized there are circumstances when divide is called effectively on random input, and thats in a multi-precision arithmetic library - but there the dominant loop is the multiply-and-subtract loop that follows the test-division.
  • cessnapilotcessnapilot Posts: 182
    edited 2012-05-08 22:40
    @ersmith: Than you very much for the PASM code. I shall implement it in the PASM_Bench.
    The program will be modified according to Mark_T's suggestion, to test different random input distributions,
    either for the dividend and/or for the divisor. The typical small divisor (<11) cases will be tested too. In this
    way the user can select the the most appropriate division algorithm for a given kind of arithmetics.

    I especially like the divide and conquer method to figure out the count of leading zeros. I want (or just hope)
    always for the best and this seems to be. There should be a direct PASM code for that as it is so useful in
    many algorithms.

    Cheers,

    Istvan
  • cessnapilotcessnapilot Posts: 182
    edited 2012-05-30 02:48
    I attached a comprehensive PASM bench for 32-bit unsigned division.

    The tested algorithms are:

    - Fast Kenyan

    - Zerocounter (The one used in propgcc, provided by ersmith)

    - Traditional (The one provided by Kye in one of his posts)

    The following table summarises the average machine cycle execution times for different relative size dividend/divisor combinations.
    Arguments were selected randomly from the given ranges 10_000 times.

    FastKenyan____ZeroCounter__Traditional_____Dividend / Divisor
    ===============================================
    ______36__________55_________106________32-bit / 32-bit
    _____101_________119_________122________32-bit / 16-bit
    _____133_________150_________130________32-bit / 8-bit
    _____149_________164_________133________32-bit / 4-bit
    ______36__________55_________122________16-bit / 16-bit
    ______68__________86_________130________16-bit / 8-bit
    ______84_________100_________133________16-bit / 4-bit
    ______36__________54_________130_________8-bit / 8-bit
    ______52__________68_________133_________8-bit / 4-bit
    ______77__________95_________127 <
    Average execution time in machine cycles (4 clock ticks)

    The Kenyan is the fastest in 7 of the 9 cases and it is allways faster here than the ZeroCounter algorithm used in propgcc.
    On average the Kenyan is 23% faster than the ZeroCounter and is 65% faster than the Traditional (similar that used in SPIN interpreter).

    The Kenyan code runs as follows:
    '=========================================================================
    '==================Start of tested code of Function_1=====================
    '=========================================================================
    
    '-------------Unsigned 32-bit fast Kenyan integer division----------------
    
    '   Arguments: 32-bit dividend and divisor
    '     Results: 32-bit quotient and remainder
     
    'Prepare work registers
                   MOV     r1,       #0          'Clear 32-bit quotient
                   MOV     r2,       #0          'Prepare "divide back" cntr 
    
    'Check for relatively small divisors               
                   MOV     r0,       arg1        'Copy dividend to a temp reg
                   SHR     r0,       #16          
                   CMP     arg2,     r0 WZ, WC
    
    IF_NC          JMP     #prep_shift_8
    
    IF_C_AND_NZ    ADD     r2,       #16
    
                   JMP     #shift_8
    
    prep_shift_8
    
                   MOV     r0,       arg1        'Copy dividend to a temp reg 
    
    shift_8
    
                   SHR     r0,       #8
                   CMP     arg2,     r0 WZ, WC
                   
    IF_NC          JMP     #prep_shift_4
    
    IF_C_AND_NZ    ADD r2, #8
    
                   JMP     #shift_4
    
    prep_shift_4
    
                   MOV     r0,       arg1        'Copy dividend to a temp reg 
                   SHR     r0,       r2          'Apply accumulated shifts
    
    shift_4
    
                   SHR     r0,       #4
                   CMP     arg2,     r0 WZ, WC
                   
    IF_NC          JMP     #prep_1stloop
    
    IF_C_AND_NZ    ADD     r2,       #4
    
    prep_1stloop
    
                   SHL      arg2,           r2   'Apply accumulated shifts
                                                 'on divisor 
                   ADD      r2,             #1 
    
    'First round of Kenyan division: double divisor repeatedly until it
    'becomes larger then dividend. Count the number of doublings
    first_loop1
    
                   ROL     arg2,     #1 WC       'Double divisor
    IF_C           JMP     #second_Loop1           
                   CMP     arg2,     arg1 WZ, WC 'Compare doubled divisor with
                                                 'divident. If new divisor is
                                                 'smaller or equal, C is set,
                                                 'when equals Z is set
    
    IF_C_AND_NZ    ADD     r2,       #1          'IF_C_AND_NZ increment cntr                       
    IF_C_AND_NZ    JMP     #first_loop1          'IF_C repeat first_loop
    
    'Second Round of Kenyan division: Half back divisor and double quotient
    'repeatedly and subtract divisor from dividend when dividend is larger
    'or equal of divisor. In this case increment quotient with one, after
    'doubling
    second_loop1
    
                   RCR     arg2,     #1          'Half divisor when C not set
                                                 'or restore divisor when C
                   CMPSUB  arg1,     arg2 WC, WZ 'Compare dividend with divsor
                                                 'If dividend is greater or
                                                 'equal, then subtract divisor
                                                 'from dividend and set C
                   RCL     r1,       #1 WC       'Double quotient and rotate C
                                                 'into LSB of quotient
    IF_NZ          DJNZ    r2,       #second_loop1 'Continue division if some
                                                 'remained and steps are left
                                           
    IF_Z           SUB     r2,       #1          'Finish division with no                           
    IF_Z           ROL     r1,       r2          'remainder  
    
    finish1
    '=========================================================================
    '====================End of tested code of Function_1=====================
    '=========================================================================
    

    Interesting to see, that such over-studied and over-optimized important algorithms like the ones for binary integer division can be overcame
    using PASM commands and some imagination.

    Bug reports, further optimization ideas or more efficient PASM codes are welcome.

    Cheers,

    Istvan
  • Heater.Heater. Posts: 21,230
    edited 2012-05-30 03:03
    cesnapilot,

    When developing a real-time control system one would be interested in fast execution, of course, but of prime importance is the worst case execution time. If that is too long to meet some deadline then it's no use that all the other cases are much faster.

    Can you present the worst case run time for all the different algos?
  • jmgjmg Posts: 15,183
    edited 2012-05-30 03:41
    Bug reports, further optimization ideas or more efficient PASM codes are welcome.

    I'd agree that worse case time matters more, and rather than random you should be able to predict the values that drive that worse case.
    Then, a worse case for some constrained cases is useful too, as someone may be doing /10 or /SomeSmallConstant for example.

    I'd also be interested in how this scales to a 64/32 case ?
  • cessnapilotcessnapilot Posts: 182
    edited 2012-05-30 05:12
    Heater:

    The worst case run time in machine cycles for the 32-bit, 16-bit and 8-bit dividends are:


    FastKenyan_____ZeroCounter______Traditional_____Dividend
    ================================================
    ____166___________170____________135_________32-bit
    ____104___________106____________135_________16-bit
    _____72____________74____________135__________8-bit


    As you can see, in this respect Kenyan is allways better than the ZeroCounter algorithm
    in propgcc and with the 16/8 bit arguments it is far most better than the Traditional one.

    In real-time control systems with e.g 12-bit ADCs and servos the 16 bit arguments are common.
    Here you can take great advantage with the the new divison algorithm where the worst case
    execution time is only 77% of that of the traditional code, not to mention the much faster
    speed over the others.

    However, as I was lazy to program the last 2-bit shift, the Traditional has 19% shorter
    worst case (/1) execution time than the Kenyan at 32-bit arguments. Completing the Kenyan
    code with those shifts will noticably reduce the execution time for /1 (and for /2) divisions for 32-bit
    dividends.

    Anyway, the division with 1 or 2 should be done directly in a time critical real-time system. (A weak
    excuse for my laziness.)

    jmg:

    I can only guess that with those effective first shifts, the Kenyan will keep its relative speed advantage with
    those algorithms even on the 64-bit scale


    Cheers,

    Istvan
  • KyeKye Posts: 2,200
    edited 2012-05-30 07:30
    I optimized my code for size. I ran out of space in the cog I'm using it in so bad that I'm using shadow registers for general purpose variables to avoid having to alias "res" variables in the code.
  • turbosupraturbosupra Posts: 1,088
    edited 2012-05-30 18:40
    Very nice work!
  • lonesocklonesock Posts: 917
    edited 2012-06-02 00:44
    Kye wrote: »
    I optimized my code for size. I ran out of space in the cog I'm using it in so bad that I'm using shadow registers for general purpose variables to avoid having to alias "res" variables in the code.
    You can save two longs with this:
    divide        ' divide 2 32-bit numbers: arg1 /= arg2 with remainder in r2
                  mov       r1, #32                 ' bit count
                  mov       r2, #0                  ' buffer & remainder
    :dl1          shl       arg1, #1 wc             ' shift numerator left until we get a bit
            if_nc djnz      r1, #:dl1               ' keep going while no high bit...NOTE: can comment out to save one long
    :dl2          rcl       r2, #1                  ' shift the last bit into the remainder
                  cmpsub    r2, arg2 wc             ' see if we can sub the divisor
                  rcl       arg1, #1 wc             ' keep the result, while shifting out the high bit again
                  djnz      r1, #:dl2               ' finish out the rest of the bits
    
    You can save one extra long by commenting out the 1st djnz, which lowers the worst case by 4 clocks, but loses a bit in speed if there are leading 0's in the numerator.

    Jonathan
Sign In or Register to comment.