Fast, Faster, Fastest Code: Ultrafast 32-bit integer division

cessnapilot · 2012-04-26 23:18

Continuing this series of threads a new version of the ultrafast
Kenyan 32-bit integer division algorithm attached to this post.

'=========================================================================
'==================Start of tested code of Function_1=====================
'=========================================================================

'--------Unsigned 32-bit ultrafast Kenyan integer division----------------

'   Arguments: 32-bit dividend and divisor
'     Results: 32-bit quotient and remainder
 
'Prepare work registers
MOV      r1,             #0            'Clear 32-bit quotient
MOV      r2,             #1            'Prepare "divide back" loop cntr 

'First round of Kenyan division: double divisor repeatedly until it
'becomes larger then dividend. Count the number of doublings

:first_loop

SHL      arg2,           #1            'Double divisor  
CMP      arg2,           arg1 WZ, WC   'Compare doubled divisor w dividend
                                       'If new divisor smaller or equal,
                                       'C is set, when equals Z is set

IF_C_AND_NZ ADD r2,      #1            'IF_C_AND_NZ increment counter                       
IF_C_AND_NZ JMP          #:first_loop  'IF_C repeat first_loop

'If new divisor same as dividend, Z flag is rised, then the result is
'some power of two. Calculate it and finish 

IF_Z     MOV r1,         #1
IF_Z     ROL r1,         r2
IF_Z     MOV arg1,       #0
IF_Z     JMP             #:finish       

'Second Round of Kenyan division: Half back divisor and double quotient
'repeatedly and subtract divisor from dividend when dividend is larger
'or equal of divisor. In this case increment quotient with one, after
'doubling

:second_loop

SHR      arg2,           #1            'Half divisor
CMPSUB   arg1,           arg2 WC, WZ   'Compare dividend with divisor,
                                       'If dividend is greater or equal
                                       'than divisor, then subtract
                                       'divisor from dividend and set C
         RCL r1,         #1            'Double quotient and rotate C into
                                       'LSB of quotient
IF_NZ DJNZ r2,           #:second_loop 'Continue division if something
                                       'remained and some steps are left
                                       
IF_Z SUB r2,             #1            'Remainder is zero, no more "half"                           
IF_Z ROL r1,             r2            'steps needed. Finish the job

:finish

'=========================================================================
'====================End of tested code of Function_1=====================
'=========================================================================

The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.

Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.

This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.

The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...

Cheers,

Istvan

Leon · 2012-04-27 00:01

Use:

[ code ]

[ /code ]

to insert code in your post. Remove the spaces.

cessnapilot · 2012-04-27 00:17

Leon: Thanks.

kuroneko · 2012-04-27 01:29

@Istvan: Have you tried dividing $C0000000 by 1? It seems to take an eternity ...

cessnapilot · 2012-04-27 03:18

kuroneko: You are right. The full unsigned division needs more coding. The posted version in my programs works
perfectly as a core of a signed 32-bit routine, where the dividend absolute value is always less than $8000_0000
and sign is managed separately.

Anyway, I see your point, and I am working on an unsigned 32-bit version with the ROL ... WC and CMPX or with
some other tricky PASM way to cure the problem that rotating 1 to left 32 times results in zero in the register.

I appreciate any help.

Thanks beforehand,

Istvan

cessnapilot · 2012-05-08 01:50

I have modified the ultrafast 32-bit unsigned integer division algorithm, according to
kuroneko's remark.

'=========================================================================
'==================Start of tested code of Function_1=====================
'=========================================================================

'--------Unsigned 32-bit ultrafast Kenyan integer division----------------

'   Arguments: 32-bit dividend and divisor
'     Results: 32-bit quotient and remainder
 
'Prepare work registers
MOV      r1,             #0            'Clear 32-bit quotient
MOV      r2,             #1            'Prepare "divide back" loop cntr 

'First round of Kenyan division: double divisor repeatedly until it
'becomes larger then dividend. Count the number of doublings

first_loop1

ROL      arg2,           #1   WC       'Double divisor
IF_C     JMP             #second_Loop1 'Overflow:jump to second loop
                                       'with carry set

CMP      arg2,           arg1 WZ, WC   'Compare doubled divisor w divident
                                       'If new divisor smaller or equal,
                                       'C is set, when equals Z is set

IF_C_AND_NZ ADD r2,      #1            'IF_C_AND_NZ increment counter                       
IF_C_AND_NZ JMP          #first_loop1  'IF_C repeat first_loop

'If new divisor same as dividend, Z flag is rised, then the result is
'some power of two. Calculate it and finish 

IF_Z     MOV r1,         #1
IF_Z     ROL r1,         r2
IF_Z     MOV arg1,       #0
IF_Z     JMP             #finish1       

'Second Round of Kenyan division: Half back divisor and double quotient
'repeatedly and subtract divisor from dividend when dividend is larger
'or equal of divisor. In this case increment quotient with one, after
'doubling

second_loop1

RCR      arg2,           #1            'Half divisor when C not set
                                       'or restore divisor when C set
CMPSUB   arg1,           arg2 WC, WZ   'Compare dividend with divisor,
                                       'If dividend is greater or equal
                                       'than divisor, then subtract
                                       'divisor from dividend and set C
RCL r1,                  #1  WC        'Double quotient and rotate C into
                                       'LSB of quotient
IF_NZ DJNZ r2,           #second_loop1 'Continue division if something
                                       'remained and some steps are left
                                       
IF_Z SUB r2,             #1            'Remainder is zero, no more "half"                           
IF_Z ROL r1,             r2            'steps needed. Just finish the job

finish1

'=========================================================================
'====================End of tested code of Function_1=====================
'=========================================================================

The new version is about 4 (!) times faster on random integers, than the traditional algorithms
with 32 iterations. So, average speed is quadrupled. I think this is remarkable with such
a basic and over-studied operation.

For the comparison the compact and (probably) fully optimized traditional algorithm was
borrowed from Kyle. It takes always 32 iterations and its average execution steps is about
106 machine cycles on 10_000 pairs of pseudo-random integers.

The ultrafast Kenyan method takes only 26 machine cycles on average with the same
10_000 pairs of random integers. You can verify this with the attached PST application.

The check for division with zero was omitted from both test algorithms. Division with zero
should not be performed (only Chuck Norris can divide with zero) and this operation should
be caught on higher levels to create an exception.

However, there is room for further improvements. When the quotient is a very large number,
the traditional algorithm is somewhat faster (albeit far less than 4 times). Next step should be
the merge of the effectiveness of the Kenyan ultrafast method on random inputs with the
effectiveness of the traditional algorithm with very large quotients. With your help on this forum
almost everything is possible in PASM. Remember the 12 machine cycle 8x8 bit multiplication
which outperformed the unrolled (17 cycle) method.

Until that enjoy the benefits of the just 4 times faster (on average) new division algorithm

Cheers,
Istvan

Mark_T · 2012-05-08 05:04

There is a problem with this benchmark - it doesn't represent the reality of calling integer division.

Usually calls to integer division involve a dividend larger in magnitude than the divisor. If you alter the benchmark to limit the divisor to 20 bits then the standard division routine starts to win, and if you limit the divisor to 12 bits the standard division is much quicker...

Dividing by zero hangs, BTW, and dividing by small constants like 3 and 5 is slowest of all (this is probably the most common case in real programs).

cessnapilot · 2012-05-08 05:55

@Mark_T:

The check for the division by zero was left out on purpose. What you noticed on the size of the divisor I noticed, too.
That is why I mentioned there is room for improvements an I am working on it. The improved version will run something like:

Check for small divisors (up to 16) with a jump table (This reflects to your last comment)

Rotate divisor left and right simultaneously to select faster method

Apply faster method using the already rotated data.

As for the "reality" of integer division I have no idea. Or, better to say, there are other realities. For example if you check the
integer arithmetic code sections of floating point packages you may realize other situations, because of the very normalized
nature of floating point representation.

Thanks for the tests and comments,

Istvan

ersmith · 2012-05-08 06:05

I agree with Mark -- it's very rare for the inputs to division to be distributed randomly. Far more common is the case where the divisor is small (e.g. 10) and dividend is larger.

For this case you can optimize the division algorithm by using a "count leading zeros" operation on the divisor and/or dividend. That's effectively what the initial loop of the Kenyan division algorithm is doing, but it's taking 5 instructions per iteration and up to 31 iterations (so worst case is 155 instructions). Using a divide and conquer approach you can do the same thing in a total of 16 instructions. This isn't as good as the Kenyan division algorithm in the best case, but it's constant and better for the common case of large dividend small divisor.

Here's the unsigned division routine used by propgcc. It gives pretty good performance over a broad range of inputs. Inputs are r0 and r1; the output r0/r1 is placed in r0, with remainder in r1. If you're using this in stand-alone PASM you could eliminate a bit of register shuffling by having separate inputs and outputs; you could also inline the __CLZSI function that counts the number of leading zero bits.

__UDIVSI
    mov    __DIVR, r0
    call    #__CLZSI
    neg    __DIVCNT, r0
    mov    r0, r1 wz
 IF_Z   jmp    #__UDIV_BY_ZERO
    call    #__CLZSI
    add    __DIVCNT, r0
    mov    r0, #0
    cmps    __DIVCNT, #0    wz, wc
 IF_C    jmp    #__UDIVSI_done
    shl    r1, __DIVCNT
    add    __DIVCNT, #1
__UDIVSI_loop
    cmpsub    __DIVR, r1    wz, wc
    addx    r0, r0
    shr    r1, #1
    djnz    __DIVCNT, #__UDIVSI_loop
__UDIVSI_done
    mov    r1, __DIVR
__UDIVSI_ret
    ret

    '' come here on divide by zero
    '' we probably should raise a signal
__UDIV_BY_ZERO
    neg    r0,#1
    mov    r1,#0
    jmp    #__UDIVSI_ret

__TMP0    long    0
__MASK_00FF00FF    long    0x00FF00FF
__MASK_0F0F0F0F    long    0x0F0F0F0F
__MASK_33333333    long    0x33333333
__MASK_55555555    long    0x55555555
__CLZSI    rev    r0, #0
__CTZSI    neg    __TMP0, r0
    and    __TMP0, r0    wz
    mov    r0, #0
 IF_Z    mov    r0, #1
    test    __TMP0, __MASK_0000FFFF    wz
 IF_Z    add    r0, #16
    test    __TMP0, __MASK_00FF00FF    wz
 IF_Z    add    r0, #8
    test    __TMP0, __MASK_0F0F0F0F    wz
 IF_Z    add    r0, #4
    test    __TMP0, __MASK_33333333    wz
 IF_Z    add    r0, #2
    test    __TMP0, __MASK_55555555    wz
 IF_Z    add    r0, #1
__CLZSI_ret    ret
__DIVR    long    0
__TMP1
__DIVCNT
    long    0

Mark_T · 2012-05-08 15:11

Actually I've realized there are circumstances when divide is called effectively on random input, and thats in a multi-precision arithmetic library - but there the dominant loop is the multiply-and-subtract loop that follows the test-division.

cessnapilot · 2012-05-08 22:40

@ersmith: Than you very much for the PASM code. I shall implement it in the PASM_Bench.
The program will be modified according to Mark_T's suggestion, to test different random input distributions,
either for the dividend and/or for the divisor. The typical small divisor (<11) cases will be tested too. In this
way the user can select the the most appropriate division algorithm for a given kind of arithmetics.

I especially like the divide and conquer method to figure out the count of leading zeros. I want (or just hope)
always for the best and this seems to be. There should be a direct PASM code for that as it is so useful in
many algorithms.

Cheers,

Istvan

cessnapilot · 2012-05-30 02:48

I attached a comprehensive PASM bench for 32-bit unsigned division.

The tested algorithms are:

- Fast Kenyan

- Zerocounter (The one used in propgcc, provided by ersmith)

- Traditional (The one provided by Kye in one of his posts)

The following table summarises the average machine cycle execution times for different relative size dividend/divisor combinations.
Arguments were selected randomly from the given ranges 10_000 times.

FastKenyan____ZeroCounter__Traditional_____Dividend / Divisor
===============================================
______36__________55_________106________32-bit / 32-bit
_____101_________119_________122________32-bit / 16-bit
_____133_________150_________130________32-bit / 8-bit
_____149_________164_________133________32-bit / 4-bit

______36__________55_________122________16-bit / 16-bit
______68__________86_________130________16-bit / 8-bit
______84_________100_________133________16-bit / 4-bit

______36__________54_________130_________8-bit / 8-bit
______52__________68_________133_________8-bit / 4-bit

______77__________95_________127 <

Average execution time in machine cycles (4 clock ticks)

The Kenyan is the fastest in 7 of the 9 cases and it is allways faster here than the ZeroCounter algorithm used in propgcc.
On average the Kenyan is 23% faster than the ZeroCounter and is 65% faster than the Traditional (similar that used in SPIN interpreter).

The Kenyan code runs as follows:

'=========================================================================
'==================Start of tested code of Function_1=====================
'=========================================================================

'-------------Unsigned 32-bit fast Kenyan integer division----------------

'   Arguments: 32-bit dividend and divisor
'     Results: 32-bit quotient and remainder
 
'Prepare work registers
               MOV     r1,       #0          'Clear 32-bit quotient
               MOV     r2,       #0          'Prepare "divide back" cntr 

'Check for relatively small divisors               
               MOV     r0,       arg1        'Copy dividend to a temp reg
               SHR     r0,       #16          
               CMP     arg2,     r0 WZ, WC

IF_NC          JMP     #prep_shift_8

IF_C_AND_NZ    ADD     r2,       #16

               JMP     #shift_8

prep_shift_8

               MOV     r0,       arg1        'Copy dividend to a temp reg 

shift_8

               SHR     r0,       #8
               CMP     arg2,     r0 WZ, WC
               
IF_NC          JMP     #prep_shift_4

IF_C_AND_NZ    ADD r2, #8

               JMP     #shift_4

prep_shift_4

               MOV     r0,       arg1        'Copy dividend to a temp reg 
               SHR     r0,       r2          'Apply accumulated shifts

shift_4

               SHR     r0,       #4
               CMP     arg2,     r0 WZ, WC
               
IF_NC          JMP     #prep_1stloop

IF_C_AND_NZ    ADD     r2,       #4

prep_1stloop

               SHL      arg2,           r2   'Apply accumulated shifts
                                             'on divisor 
               ADD      r2,             #1 

'First round of Kenyan division: double divisor repeatedly until it
'becomes larger then dividend. Count the number of doublings
first_loop1

               ROL     arg2,     #1 WC       'Double divisor
IF_C           JMP     #second_Loop1           
               CMP     arg2,     arg1 WZ, WC 'Compare doubled divisor with
                                             'divident. If new divisor is
                                             'smaller or equal, C is set,
                                             'when equals Z is set

IF_C_AND_NZ    ADD     r2,       #1          'IF_C_AND_NZ increment cntr                       
IF_C_AND_NZ    JMP     #first_loop1          'IF_C repeat first_loop

'Second Round of Kenyan division: Half back divisor and double quotient
'repeatedly and subtract divisor from dividend when dividend is larger
'or equal of divisor. In this case increment quotient with one, after
'doubling
second_loop1

               RCR     arg2,     #1          'Half divisor when C not set
                                             'or restore divisor when C
               CMPSUB  arg1,     arg2 WC, WZ 'Compare dividend with divsor
                                             'If dividend is greater or
                                             'equal, then subtract divisor
                                             'from dividend and set C
               RCL     r1,       #1 WC       'Double quotient and rotate C
                                             'into LSB of quotient
IF_NZ          DJNZ    r2,       #second_loop1 'Continue division if some
                                             'remained and steps are left
                                       
IF_Z           SUB     r2,       #1          'Finish division with no                           
IF_Z           ROL     r1,       r2          'remainder  

finish1
'=========================================================================
'====================End of tested code of Function_1=====================
'=========================================================================

Interesting to see, that such over-studied and over-optimized important algorithms like the ones for binary integer division can be overcame
using PASM commands and some imagination.

Bug reports, further optimization ideas or more efficient PASM codes are welcome.

Cheers,

Istvan

Heater. · 2012-05-30 03:03

cesnapilot,

When developing a real-time control system one would be interested in fast execution, of course, but of prime importance is the worst case execution time. If that is too long to meet some deadline then it's no use that all the other cases are much faster.

Can you present the worst case run time for all the different algos?

jmg · 2012-05-30 03:41

cessnapilot wrote: »

Bug reports, further optimization ideas or more efficient PASM codes are welcome.

I'd agree that worse case time matters more, and rather than random you should be able to predict the values that drive that worse case.
Then, a worse case for some constrained cases is useful too, as someone may be doing /10 or /SomeSmallConstant for example.

I'd also be interested in how this scales to a 64/32 case ?

cessnapilot · 2012-05-30 05:12

Heater:

The worst case run time in machine cycles for the 32-bit, 16-bit and 8-bit dividends are:

FastKenyan_____ZeroCounter______Traditional_____Dividend
================================================
____166___________170____________135_________32-bit
____104___________106____________135_________16-bit
_____72____________74____________135__________8-bit

As you can see, in this respect Kenyan is allways better than the ZeroCounter algorithm
in propgcc and with the 16/8 bit arguments it is far most better than the Traditional one.

In real-time control systems with e.g 12-bit ADCs and servos the 16 bit arguments are common.
Here you can take great advantage with the the new divison algorithm where the worst case
execution time is only 77% of that of the traditional code, not to mention the much faster
speed over the others.

However, as I was lazy to program the last 2-bit shift, the Traditional has 19% shorter
worst case (/1) execution time than the Kenyan at 32-bit arguments. Completing the Kenyan
code with those shifts will noticably reduce the execution time for /1 (and for /2) divisions for 32-bit
dividends.

Anyway, the division with 1 or 2 should be done directly in a time critical real-time system. (A weak
excuse for my laziness.)

jmg:

I can only guess that with those effective first shifts, the Kenyan will keep its relative speed advantage with
those algorithms even on the 64-bit scale

Cheers,

Istvan

Kye · 2012-05-30 07:30

I optimized my code for size. I ran out of space in the cog I'm using it in so bad that I'm using shadow registers for general purpose variables to avoid having to alias "res" variables in the code.

turbosupra · 2012-05-30 18:40

Very nice work!

lonesock · 2012-06-02 00:44

Kye wrote: »

I optimized my code for size. I ran out of space in the cog I'm using it in so bad that I'm using shadow registers for general purpose variables to avoid having to alias "res" variables in the code.

You can save two longs with this:

divide        ' divide 2 32-bit numbers: arg1 /= arg2 with remainder in r2
              mov       r1, #32                 ' bit count
              mov       r2, #0                  ' buffer & remainder
:dl1          shl       arg1, #1 wc             ' shift numerator left until we get a bit
        if_nc djnz      r1, #:dl1               ' keep going while no high bit...NOTE: can comment out to save one long
:dl2          rcl       r2, #1                  ' shift the last bit into the remainder
              cmpsub    r2, arg2 wc             ' see if we can sub the divisor
              rcl       arg1, #1 wc             ' keep the result, while shifting out the high bit again
              djnz      r1, #:dl2               ' finish out the rest of the bits

You can save one extra long by commenting out the 1st djnz, which lowers the worst case by 4 clocks, but loses a bit in speed if there are leading 0's in the numerator.

Jonathan

Fast, Faster, Fastest Code: Ultrafast 32-bit integer division

Comments