Fast, Faster, Fastest Code: Ultrafast 32-bit integer division
cessnapilot
Posts: 182
Continuing this series of threads a new version of the ultrafast
Kenyan 32-bit integer division algorithm attached to this post.
The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.
Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.
This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.
The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...
Cheers,
Istvan
Kenyan 32-bit integer division algorithm attached to this post.
'========================================================================= '==================Start of tested code of Function_1===================== '========================================================================= '--------Unsigned 32-bit ultrafast Kenyan integer division---------------- ' Arguments: 32-bit dividend and divisor ' Results: 32-bit quotient and remainder 'Prepare work registers MOV r1, #0 'Clear 32-bit quotient MOV r2, #1 'Prepare "divide back" loop cntr 'First round of Kenyan division: double divisor repeatedly until it 'becomes larger then dividend. Count the number of doublings :first_loop SHL arg2, #1 'Double divisor CMP arg2, arg1 WZ, WC 'Compare doubled divisor w dividend 'If new divisor smaller or equal, 'C is set, when equals Z is set IF_C_AND_NZ ADD r2, #1 'IF_C_AND_NZ increment counter IF_C_AND_NZ JMP #:first_loop 'IF_C repeat first_loop 'If new divisor same as dividend, Z flag is rised, then the result is 'some power of two. Calculate it and finish IF_Z MOV r1, #1 IF_Z ROL r1, r2 IF_Z MOV arg1, #0 IF_Z JMP #:finish 'Second Round of Kenyan division: Half back divisor and double quotient 'repeatedly and subtract divisor from dividend when dividend is larger 'or equal of divisor. In this case increment quotient with one, after 'doubling :second_loop SHR arg2, #1 'Half divisor CMPSUB arg1, arg2 WC, WZ 'Compare dividend with divisor, 'If dividend is greater or equal 'than divisor, then subtract 'divisor from dividend and set C RCL r1, #1 'Double quotient and rotate C into 'LSB of quotient IF_NZ DJNZ r2, #:second_loop 'Continue division if something 'remained and some steps are left IF_Z SUB r2, #1 'Remainder is zero, no more "half" IF_Z ROL r1, r2 'steps needed. Finish the job :finish '========================================================================= '====================End of tested code of Function_1===================== '=========================================================================
The separate handling of the "remainder=0" cases makes it even faster.
Average execution time for random 32-bit integers is less than 30 machine
cycles. This is almost 10(!) times faster than the conventional "Do the
same code for all 32 bits" algorithms, even when rolled out.
Improvements with more efficient PASM coding or with a faster method
are greatly appreciated. Till those improvements, however, the here
presented one is the fastest 32-bit integer division with PASM. I hope that
someone can improve it.
This code and the ultrafast Kenyan 32-bit multiplication is used in the
PASM version of the poker probability calculator. The algorithm, which
is based on prime numbers, uses 32-bit modulo/division/multiplication
operations for fast poker hand evaluations. 5 thousand poker situations can
be evaluated with a Propeller in 3 seconds. The 1% accuracy of the obtained
winning probabilities and this speed is more than enough to respond within
10 seconds in a typical online gameplay.
The attached SPIN/PASM wrapper PST application allows the user to test the
speed and correctness of the 32-bit ultrafast Kenyan division immediately. Enjoy...
Cheers,
Istvan
Comments
[ code ]
[ /code ]
to insert code in your post. Remove the spaces.
perfectly as a core of a signed 32-bit routine, where the dividend absolute value is always less than $8000_0000
and sign is managed separately.
Anyway, I see your point, and I am working on an unsigned 32-bit version with the ROL ... WC and CMPX or with
some other tricky PASM way to cure the problem that rotating 1 to left 32 times results in zero in the register.
I appreciate any help.
Thanks beforehand,
Istvan
kuroneko's remark.
The new version is about 4 (!) times faster on random integers, than the traditional algorithms
with 32 iterations. So, average speed is quadrupled. I think this is remarkable with such
a basic and over-studied operation.
For the comparison the compact and (probably) fully optimized traditional algorithm was
borrowed from Kyle. It takes always 32 iterations and its average execution steps is about
106 machine cycles on 10_000 pairs of pseudo-random integers.
The ultrafast Kenyan method takes only 26 machine cycles on average with the same
10_000 pairs of random integers. You can verify this with the attached PST application.
The check for division with zero was omitted from both test algorithms. Division with zero
should not be performed (only Chuck Norris can divide with zero) and this operation should
be caught on higher levels to create an exception.
However, there is room for further improvements. When the quotient is a very large number,
the traditional algorithm is somewhat faster (albeit far less than 4 times). Next step should be
the merge of the effectiveness of the Kenyan ultrafast method on random inputs with the
effectiveness of the traditional algorithm with very large quotients. With your help on this forum
almost everything is possible in PASM. Remember the 12 machine cycle 8x8 bit multiplication
which outperformed the unrolled (17 cycle) method.
Until that enjoy the benefits of the just 4 times faster (on average) new division algorithm
Cheers,
Istvan
Usually calls to integer division involve a dividend larger in magnitude than the divisor. If you alter the benchmark to limit the divisor to 20 bits then the standard division routine starts to win, and if you limit the divisor to 12 bits the standard division is much quicker...
Dividing by zero hangs, BTW, and dividing by small constants like 3 and 5 is slowest of all (this is probably the most common case in real programs).
The check for the division by zero was left out on purpose. What you noticed on the size of the divisor I noticed, too.
That is why I mentioned there is room for improvements an I am working on it. The improved version will run something like:
Check for small divisors (up to 16) with a jump table (This reflects to your last comment)
Rotate divisor left and right simultaneously to select faster method
Apply faster method using the already rotated data.
As for the "reality" of integer division I have no idea. Or, better to say, there are other realities. For example if you check the
integer arithmetic code sections of floating point packages you may realize other situations, because of the very normalized
nature of floating point representation.
Thanks for the tests and comments,
Istvan
For this case you can optimize the division algorithm by using a "count leading zeros" operation on the divisor and/or dividend. That's effectively what the initial loop of the Kenyan division algorithm is doing, but it's taking 5 instructions per iteration and up to 31 iterations (so worst case is 155 instructions). Using a divide and conquer approach you can do the same thing in a total of 16 instructions. This isn't as good as the Kenyan division algorithm in the best case, but it's constant and better for the common case of large dividend small divisor.
Here's the unsigned division routine used by propgcc. It gives pretty good performance over a broad range of inputs. Inputs are r0 and r1; the output r0/r1 is placed in r0, with remainder in r1. If you're using this in stand-alone PASM you could eliminate a bit of register shuffling by having separate inputs and outputs; you could also inline the __CLZSI function that counts the number of leading zero bits.
The program will be modified according to Mark_T's suggestion, to test different random input distributions,
either for the dividend and/or for the divisor. The typical small divisor (<11) cases will be tested too. In this
way the user can select the the most appropriate division algorithm for a given kind of arithmetics.
I especially like the divide and conquer method to figure out the count of leading zeros. I want (or just hope)
always for the best and this seems to be. There should be a direct PASM code for that as it is so useful in
many algorithms.
Cheers,
Istvan
The tested algorithms are:
- Fast Kenyan
- Zerocounter (The one used in propgcc, provided by ersmith)
- Traditional (The one provided by Kye in one of his posts)
The following table summarises the average machine cycle execution times for different relative size dividend/divisor combinations.
Arguments were selected randomly from the given ranges 10_000 times.
FastKenyan____ZeroCounter__Traditional_____Dividend / Divisor
===============================================
______36__________55_________106________32-bit / 32-bit
_____101_________119_________122________32-bit / 16-bit
_____133_________150_________130________32-bit / 8-bit
_____149_________164_________133________32-bit / 4-bit
______36__________55_________122________16-bit / 16-bit
______68__________86_________130________16-bit / 8-bit
______84_________100_________133________16-bit / 4-bit
______36__________54_________130_________8-bit / 8-bit
______52__________68_________133_________8-bit / 4-bit
______77__________95_________127 <
Average execution time in machine cycles (4 clock ticks)
The Kenyan is the fastest in 7 of the 9 cases and it is allways faster here than the ZeroCounter algorithm used in propgcc.
On average the Kenyan is 23% faster than the ZeroCounter and is 65% faster than the Traditional (similar that used in SPIN interpreter).
The Kenyan code runs as follows:
Interesting to see, that such over-studied and over-optimized important algorithms like the ones for binary integer division can be overcame
using PASM commands and some imagination.
Bug reports, further optimization ideas or more efficient PASM codes are welcome.
Cheers,
Istvan
When developing a real-time control system one would be interested in fast execution, of course, but of prime importance is the worst case execution time. If that is too long to meet some deadline then it's no use that all the other cases are much faster.
Can you present the worst case run time for all the different algos?
I'd agree that worse case time matters more, and rather than random you should be able to predict the values that drive that worse case.
Then, a worse case for some constrained cases is useful too, as someone may be doing /10 or /SomeSmallConstant for example.
I'd also be interested in how this scales to a 64/32 case ?
The worst case run time in machine cycles for the 32-bit, 16-bit and 8-bit dividends are:
FastKenyan_____ZeroCounter______Traditional_____Dividend
================================================
____166___________170____________135_________32-bit
____104___________106____________135_________16-bit
_____72____________74____________135__________8-bit
As you can see, in this respect Kenyan is allways better than the ZeroCounter algorithm
in propgcc and with the 16/8 bit arguments it is far most better than the Traditional one.
In real-time control systems with e.g 12-bit ADCs and servos the 16 bit arguments are common.
Here you can take great advantage with the the new divison algorithm where the worst case
execution time is only 77% of that of the traditional code, not to mention the much faster
speed over the others.
However, as I was lazy to program the last 2-bit shift, the Traditional has 19% shorter
worst case (/1) execution time than the Kenyan at 32-bit arguments. Completing the Kenyan
code with those shifts will noticably reduce the execution time for /1 (and for /2) divisions for 32-bit
dividends.
Anyway, the division with 1 or 2 should be done directly in a time critical real-time system. (A weak
excuse for my laziness.)
jmg:
I can only guess that with those effective first shifts, the Kenyan will keep its relative speed advantage with
those algorithms even on the 64-bit scale
Cheers,
Istvan
Jonathan