Fast, Faster, Fastest Code: Ultrafast 32-bit integer division
Continuing this series of threads a new version of the ultrafast
Kenyan 32-bit integer division algorithm attached to this post.
The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.
Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.
This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.
The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...
Cheers,
Istvan
Kenyan 32-bit integer division algorithm attached to this post.
'=========================================================================
'==================Start of tested code of Function_1=====================
'=========================================================================
'--------Unsigned 32-bit ultrafast Kenyan integer division----------------
' Arguments: 32-bit dividend and divisor
' Results: 32-bit quotient and remainder
'Prepare work registers
MOV r1, #0 'Clear 32-bit quotient
MOV r2, #1 'Prepare "divide back" loop cntr
'First round of Kenyan division: double divisor repeatedly until it
'becomes larger then dividend. Count the number of doublings
:first_loop
SHL arg2, #1 'Double divisor
CMP arg2, arg1 WZ, WC 'Compare doubled divisor w dividend
'If new divisor smaller or equal,
'C is set, when equals Z is set
IF_C_AND_NZ ADD r2, #1 'IF_C_AND_NZ increment counter
IF_C_AND_NZ JMP #:first_loop 'IF_C repeat first_loop
'If new divisor same as dividend, Z flag is rised, then the result is
'some power of two. Calculate it and finish
IF_Z MOV r1, #1
IF_Z ROL r1, r2
IF_Z MOV arg1, #0
IF_Z JMP #:finish
'Second Round of Kenyan division: Half back divisor and double quotient
'repeatedly and subtract divisor from dividend when dividend is larger
'or equal of divisor. In this case increment quotient with one, after
'doubling
:second_loop
SHR arg2, #1 'Half divisor
CMPSUB arg1, arg2 WC, WZ 'Compare dividend with divisor,
'If dividend is greater or equal
'than divisor, then subtract
'divisor from dividend and set C
RCL r1, #1 'Double quotient and rotate C into
'LSB of quotient
IF_NZ DJNZ r2, #:second_loop 'Continue division if something
'remained and some steps are left
IF_Z SUB r2, #1 'Remainder is zero, no more "half"
IF_Z ROL r1, r2 'steps needed. Finish the job
:finish
'=========================================================================
'====================End of tested code of Function_1=====================
'=========================================================================
The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.
Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.
This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.
The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...
Cheers,
Istvan
Comments
[ code ]
[ /code ]
to insert code in your post. Remove the spaces.
perfectly as a core of a signed 32-bit routine, where the dividend absolute value is always less than $8000_0000
and sign is managed separately.
Anyway, I see your point, and I am working on an unsigned 32-bit version with the ROL ... WC and CMPX or with
some other tricky PASM way to cure the problem that rotating 1 to left 32 times results in zero in the register.
I appreciate any help.
Thanks beforehand,
Istvan
kuroneko's remark.
'========================================================================= '==================Start of tested code of Function_1===================== '========================================================================= '--------Unsigned 32-bit ultrafast Kenyan integer division---------------- ' Arguments: 32-bit dividend and divisor ' Results: 32-bit quotient and remainder 'Prepare work registers MOV r1, #0 'Clear 32-bit quotient MOV r2, #1 'Prepare "divide back" loop cntr 'First round of Kenyan division: double divisor repeatedly until it 'becomes larger then dividend. Count the number of doublings first_loop1 ROL arg2, #1 WC 'Double divisor IF_C JMP #second_Loop1 'Overflow:jump to second loop 'with carry set CMP arg2, arg1 WZ, WC 'Compare doubled divisor w divident 'If new divisor smaller or equal, 'C is set, when equals Z is set IF_C_AND_NZ ADD r2, #1 'IF_C_AND_NZ increment counter IF_C_AND_NZ JMP #first_loop1 'IF_C repeat first_loop 'If new divisor same as dividend, Z flag is rised, then the result is 'some power of two. Calculate it and finish IF_Z MOV r1, #1 IF_Z ROL r1, r2 IF_Z MOV arg1, #0 IF_Z JMP #finish1 'Second Round of Kenyan division: Half back divisor and double quotient 'repeatedly and subtract divisor from dividend when dividend is larger 'or equal of divisor. In this case increment quotient with one, after 'doubling second_loop1 RCR arg2, #1 'Half divisor when C not set 'or restore divisor when C set CMPSUB arg1, arg2 WC, WZ 'Compare dividend with divisor, 'If dividend is greater or equal 'than divisor, then subtract 'divisor from dividend and set C RCL r1, #1 WC 'Double quotient and rotate C into 'LSB of quotient IF_NZ DJNZ r2, #second_loop1 'Continue division if something 'remained and some steps are left IF_Z SUB r2, #1 'Remainder is zero, no more "half" IF_Z ROL r1, r2 'steps needed. Just finish the job finish1 '========================================================================= '====================End of tested code of Function_1===================== '=========================================================================
The new version is about 4 (!) times faster on random integers, than the traditional algorithms
with 32 iterations. So, average speed is quadrupled. I think this is remarkable with such
a basic and over-studied operation.
For the comparison the compact and (probably) fully optimized traditional algorithm was
borrowed from Kyle. It takes always 32 iterations and its average execution steps is about
106 machine cycles on 10_000 pairs of pseudo-random integers.
The ultrafast Kenyan method takes only 26 machine cycles on average with the same
10_000 pairs of random integers. You can verify this with the attached PST application.
The check for division with zero was omitted from both test algorithms. Division with zero
should not be performed (only Chuck Norris can divide with zero) and this operation should
be caught on higher levels to create an exception.
However, there is room for further improvements. When the quotient is a very large number,
the traditional algorithm is somewhat faster (albeit far less than 4 times). Next step should be
the merge of the effectiveness of the Kenyan ultrafast method on random inputs with the
effectiveness of the traditional algorithm with very large quotients. With your help on this forum
almost everything is possible in PASM. Remember the 12 machine cycle 8x8 bit multiplication
which outperformed the unrolled (17 cycle) method.
Until that enjoy the benefits of the just 4 times faster (on average) new division algorithm
Cheers,
Istvan
Usually calls to integer division involve a dividend larger in magnitude than the divisor. If you alter the benchmark to limit the divisor to 20 bits then the standard division routine starts to win, and if you limit the divisor to 12 bits the standard division is much quicker...
Dividing by zero hangs, BTW, and dividing by small constants like 3 and 5 is slowest of all (this is probably the most common case in real programs).
The check for the division by zero was left out on purpose. What you noticed on the size of the divisor I noticed, too.
That is why I mentioned there is room for improvements an I am working on it. The improved version will run something like:
Check for small divisors (up to 16) with a jump table (This reflects to your last comment)
Rotate divisor left and right simultaneously to select faster method
Apply faster method using the already rotated data.
As for the "reality" of integer division I have no idea. Or, better to say, there are other realities. For example if you check the
integer arithmetic code sections of floating point packages you may realize other situations, because of the very normalized
nature of floating point representation.
Thanks for the tests and comments,
Istvan
For this case you can optimize the division algorithm by using a "count leading zeros" operation on the divisor and/or dividend. That's effectively what the initial loop of the Kenyan division algorithm is doing, but it's taking 5 instructions per iteration and up to 31 iterations (so worst case is 155 instructions). Using a divide and conquer approach you can do the same thing in a total of 16 instructions. This isn't as good as the Kenyan division algorithm in the best case, but it's constant and better for the common case of large dividend small divisor.
Here's the unsigned division routine used by propgcc. It gives pretty good performance over a broad range of inputs. Inputs are r0 and r1; the output r0/r1 is placed in r0, with remainder in r1. If you're using this in stand-alone PASM you could eliminate a bit of register shuffling by having separate inputs and outputs; you could also inline the __CLZSI function that counts the number of leading zero bits.
__UDIVSI mov __DIVR, r0 call #__CLZSI neg __DIVCNT, r0 mov r0, r1 wz IF_Z jmp #__UDIV_BY_ZERO call #__CLZSI add __DIVCNT, r0 mov r0, #0 cmps __DIVCNT, #0 wz, wc IF_C jmp #__UDIVSI_done shl r1, __DIVCNT add __DIVCNT, #1 __UDIVSI_loop cmpsub __DIVR, r1 wz, wc addx r0, r0 shr r1, #1 djnz __DIVCNT, #__UDIVSI_loop __UDIVSI_done mov r1, __DIVR __UDIVSI_ret ret '' come here on divide by zero '' we probably should raise a signal __UDIV_BY_ZERO neg r0,#1 mov r1,#0 jmp #__UDIVSI_ret __TMP0 long 0 __MASK_00FF00FF long 0x00FF00FF __MASK_0F0F0F0F long 0x0F0F0F0F __MASK_33333333 long 0x33333333 __MASK_55555555 long 0x55555555 __CLZSI rev r0, #0 __CTZSI neg __TMP0, r0 and __TMP0, r0 wz mov r0, #0 IF_Z mov r0, #1 test __TMP0, __MASK_0000FFFF wz IF_Z add r0, #16 test __TMP0, __MASK_00FF00FF wz IF_Z add r0, #8 test __TMP0, __MASK_0F0F0F0F wz IF_Z add r0, #4 test __TMP0, __MASK_33333333 wz IF_Z add r0, #2 test __TMP0, __MASK_55555555 wz IF_Z add r0, #1 __CLZSI_ret ret __DIVR long 0 __TMP1 __DIVCNT long 0
The program will be modified according to Mark_T's suggestion, to test different random input distributions,
either for the dividend and/or for the divisor. The typical small divisor (<11) cases will be tested too. In this
way the user can select the the most appropriate division algorithm for a given kind of arithmetics.
I especially like the divide and conquer method to figure out the count of leading zeros. I want (or just hope)
always for the best and this seems to be. There should be a direct PASM code for that as it is so useful in
many algorithms.
Cheers,
Istvan
The tested algorithms are:
- Fast Kenyan
- Zerocounter (The one used in propgcc, provided by ersmith)
- Traditional (The one provided by Kye in one of his posts)
The following table summarises the average machine cycle execution times for different relative size dividend/divisor combinations.
Arguments were selected randomly from the given ranges 10_000 times.
FastKenyan____ZeroCounter__Traditional_____Dividend / Divisor
===============================================
______36__________55_________106________32-bit / 32-bit
_____101_________119_________122________32-bit / 16-bit
_____133_________150_________130________32-bit / 8-bit
_____149_________164_________133________32-bit / 4-bit
______36__________55_________122________16-bit / 16-bit
______68__________86_________130________16-bit / 8-bit
______84_________100_________133________16-bit / 4-bit
______36__________54_________130_________8-bit / 8-bit
______52__________68_________133_________8-bit / 4-bit
______77__________95_________127 <
Average execution time in machine cycles (4 clock ticks)
The Kenyan is the fastest in 7 of the 9 cases and it is allways faster here than the ZeroCounter algorithm used in propgcc.
On average the Kenyan is 23% faster than the ZeroCounter and is 65% faster than the Traditional (similar that used in SPIN interpreter).
The Kenyan code runs as follows:
'========================================================================= '==================Start of tested code of Function_1===================== '========================================================================= '-------------Unsigned 32-bit fast Kenyan integer division---------------- ' Arguments: 32-bit dividend and divisor ' Results: 32-bit quotient and remainder 'Prepare work registers MOV r1, #0 'Clear 32-bit quotient MOV r2, #0 'Prepare "divide back" cntr 'Check for relatively small divisors MOV r0, arg1 'Copy dividend to a temp reg SHR r0, #16 CMP arg2, r0 WZ, WC IF_NC JMP #prep_shift_8 IF_C_AND_NZ ADD r2, #16 JMP #shift_8 prep_shift_8 MOV r0, arg1 'Copy dividend to a temp reg shift_8 SHR r0, #8 CMP arg2, r0 WZ, WC IF_NC JMP #prep_shift_4 IF_C_AND_NZ ADD r2, #8 JMP #shift_4 prep_shift_4 MOV r0, arg1 'Copy dividend to a temp reg SHR r0, r2 'Apply accumulated shifts shift_4 SHR r0, #4 CMP arg2, r0 WZ, WC IF_NC JMP #prep_1stloop IF_C_AND_NZ ADD r2, #4 prep_1stloop SHL arg2, r2 'Apply accumulated shifts 'on divisor ADD r2, #1 'First round of Kenyan division: double divisor repeatedly until it 'becomes larger then dividend. Count the number of doublings first_loop1 ROL arg2, #1 WC 'Double divisor IF_C JMP #second_Loop1 CMP arg2, arg1 WZ, WC 'Compare doubled divisor with 'divident. If new divisor is 'smaller or equal, C is set, 'when equals Z is set IF_C_AND_NZ ADD r2, #1 'IF_C_AND_NZ increment cntr IF_C_AND_NZ JMP #first_loop1 'IF_C repeat first_loop 'Second Round of Kenyan division: Half back divisor and double quotient 'repeatedly and subtract divisor from dividend when dividend is larger 'or equal of divisor. In this case increment quotient with one, after 'doubling second_loop1 RCR arg2, #1 'Half divisor when C not set 'or restore divisor when C CMPSUB arg1, arg2 WC, WZ 'Compare dividend with divsor 'If dividend is greater or 'equal, then subtract divisor 'from dividend and set C RCL r1, #1 WC 'Double quotient and rotate C 'into LSB of quotient IF_NZ DJNZ r2, #second_loop1 'Continue division if some 'remained and steps are left IF_Z SUB r2, #1 'Finish division with no IF_Z ROL r1, r2 'remainder finish1 '========================================================================= '====================End of tested code of Function_1===================== '=========================================================================
Interesting to see, that such over-studied and over-optimized important algorithms like the ones for binary integer division can be overcame
using PASM commands and some imagination.
Bug reports, further optimization ideas or more efficient PASM codes are welcome.
Cheers,
Istvan
When developing a real-time control system one would be interested in fast execution, of course, but of prime importance is the worst case execution time. If that is too long to meet some deadline then it's no use that all the other cases are much faster.
Can you present the worst case run time for all the different algos?
I'd agree that worse case time matters more, and rather than random you should be able to predict the values that drive that worse case.
Then, a worse case for some constrained cases is useful too, as someone may be doing /10 or /SomeSmallConstant for example.
I'd also be interested in how this scales to a 64/32 case ?
The worst case run time in machine cycles for the 32-bit, 16-bit and 8-bit dividends are:
FastKenyan_____ZeroCounter______Traditional_____Dividend
================================================
____166___________170____________135_________32-bit
____104___________106____________135_________16-bit
_____72____________74____________135__________8-bit
As you can see, in this respect Kenyan is allways better than the ZeroCounter algorithm
in propgcc and with the 16/8 bit arguments it is far most better than the Traditional one.
In real-time control systems with e.g 12-bit ADCs and servos the 16 bit arguments are common.
Here you can take great advantage with the the new divison algorithm where the worst case
execution time is only 77% of that of the traditional code, not to mention the much faster
speed over the others.
However, as I was lazy to program the last 2-bit shift, the Traditional has 19% shorter
worst case (/1) execution time than the Kenyan at 32-bit arguments. Completing the Kenyan
code with those shifts will noticably reduce the execution time for /1 (and for /2) divisions for 32-bit
dividends.
Anyway, the division with 1 or 2 should be done directly in a time critical real-time system. (A weak
excuse for my laziness.)
jmg:
I can only guess that with those effective first shifts, the Kenyan will keep its relative speed advantage with
those algorithms even on the 64-bit scale
Cheers,
Istvan
divide ' divide 2 32-bit numbers: arg1 /= arg2 with remainder in r2 mov r1, #32 ' bit count mov r2, #0 ' buffer & remainder :dl1 shl arg1, #1 wc ' shift numerator left until we get a bit if_nc djnz r1, #:dl1 ' keep going while no high bit...NOTE: can comment out to save one long :dl2 rcl r2, #1 ' shift the last bit into the remainder cmpsub r2, arg2 wc ' see if we can sub the divisor rcl arg1, #1 wc ' keep the result, while shifting out the high bit again djnz r1, #:dl2 ' finish out the rest of the bits
You can save one extra long by commenting out the 1st djnz, which lowers the worst case by 4 clocks, but loses a bit in speed if there are leading 0's in the numerator.Jonathan