Fast, Faster, Fastest Code: Cheetah Integer Division (Kenyan style)
Hi All,
Here is the Kenyan style fast integer division algorithm in a PASM like pseudocode
Notation: QUOTIENT = DIVIDEND / DIVISOR
:First Round
Double DIVISIOR until Greater Than DIVIDEND, count the number of doublings in COUNTER
:Second Round
Half·· DIVISOR
CMPSUB DIVIDEND, DIVISOR WC······ 'Subtract DIVISOR if possible
RCL··· QUOTIEND, #1···················· · 'Collect bits of QUOTIENT
DJNZ·· COUNTER,· #:Second_Round
Here is the native·PASM version
It handles the DIVIDEND<DIVISOR case automatically with 11 machine cycles, but does not check for zero DIVISOR. The attached file contains the code.·Whet the DIVIDENT is only 1-3 magnitude bigger than the DIVISOR the algorithm is quick. When the DIVIDENT is close to the DIVISOR the algorithm as fast as lightning.
This small and quick David of Integer Division Algorithms is hard as a coffin nail (remember Snatch with Pitt) in fighting with other warriors of division algorithms. It dehonested 2 approximations badly and overtake in average speed the textbook style I32xI32 integer multiplication, as well. In these matches arg1 and arg2 was uniformly random, with the restriction of arg1 => arg2.
I know that the issues of overflow, signs and other important things in integer and fixed-point arithmetic have to be dealt with. These simple examples are aimed· only to catch the idea.
The final form of these fast algorithms will reach out into the 64-bit regime.· Beside the basic 32 bit versions, there will be a fast I32xI32 multiplication with I64-bit results, and a fast I64/I32 division variant with I32 results. These internal extended precision routines among other useful things, for example,· will help to compute (a*b/c) triads in high precision or will assist library functions to compute (a*b+c*d) like expressions with internally extended precision.
A hint on further improvement of this fast Kenyan division:
======================================
Note, that the smaller the divisor the longest the execution time is. A simple check for that case will cost only a few cycles· but can save dozens of dozens of cycles. If one can figure out a very fast method to· divide exactly with small (say < 16) integers with only shifts, subtractions and additions, that will boost the average performance of our·PASM divider. This is yet to be done. I mentioned some similar methods in the first "Fast, Faster, Fastest Codes" thread but they were for integers of limited range or for fixed-point numbers.
A concrete example:
The division of 80_000_000 with 10 takes 187 machine cycles with Kenyan division
To divide with 10 can be done quickly with 6 shifts and 4 additions (12 machine cycles)
quotient = ((dividend >> 1) + dividend) >> 1
quotient = ((quotient >> 4) + quotient)
quotient = ((quotient >> 8) + quotient)
quotient = ((quotient >> 16) + quotient) >> 3
But there remains some rounding error as the result is 7_999_999, and we don't have the remainder. Maybe we can take is as an approximation, multiply it with 10 (2 shifts +1 addition) and figure out the correct result with the comparison with the original DIVIDEND. To make this for other 11 or so cases (why not for 1, 2, 4, 8 and 16?), seems a bit messy for me, although it will be fast.··
Cheers,
Istvan
Here is the Kenyan style fast integer division algorithm in a PASM like pseudocode
Notation: QUOTIENT = DIVIDEND / DIVISOR
:First Round
Double DIVISIOR until Greater Than DIVIDEND, count the number of doublings in COUNTER
:Second Round
Half·· DIVISOR
CMPSUB DIVIDEND, DIVISOR WC······ 'Subtract DIVISOR if possible
RCL··· QUOTIEND, #1···················· · 'Collect bits of QUOTIENT
DJNZ·· COUNTER,· #:Second_Round
Here is the native·PASM version
'Unsigned INT 32/32 divide into (32;32) with Kenyan algorithm 'res1 = quotient(I32) of (arg1(I32)/arg2(I32)) 'res2 = remainder(I32) of (arg1(I32)/arg2(I32)) 'Prepare working registers MOV r1, #0 'Clear 32-bit quuotient MOV r2, #1 'Loop counter for divide back steps 'First round of Kenyan division :first_loop SHL arg2, #1 'Double divisior CMP arg2, arg1 WZ, WC 'Compare divisor with divident 'If divisor is smaller, C is set 'when they are equal Z is set IF_C_OR_Z ADD r2, #1 'IF_C_OR_Z increment counter IF_C JMP #:first_loop 'IF_C continue first round 'Second round of Kenyan division :second_loop SHR arg2, #1 'Half divisor CMPSUB arg1, arg2 WC 'Compare divident with divisor, 'If divident is greater or equal 'than divisor, then subtract 'divisor and set C RCL r1, #1 'Double quotient and rotate C into 'LSB of quotient DJNZ r2, #:second_loop 'Continue division if steps remained 'Quotient in r1 'Reamainder is what left in arg1
It handles the DIVIDEND<DIVISOR case automatically with 11 machine cycles, but does not check for zero DIVISOR. The attached file contains the code.·Whet the DIVIDENT is only 1-3 magnitude bigger than the DIVISOR the algorithm is quick. When the DIVIDENT is close to the DIVISOR the algorithm as fast as lightning.
This small and quick David of Integer Division Algorithms is hard as a coffin nail (remember Snatch with Pitt) in fighting with other warriors of division algorithms. It dehonested 2 approximations badly and overtake in average speed the textbook style I32xI32 integer multiplication, as well. In these matches arg1 and arg2 was uniformly random, with the restriction of arg1 => arg2.
I know that the issues of overflow, signs and other important things in integer and fixed-point arithmetic have to be dealt with. These simple examples are aimed· only to catch the idea.
The final form of these fast algorithms will reach out into the 64-bit regime.· Beside the basic 32 bit versions, there will be a fast I32xI32 multiplication with I64-bit results, and a fast I64/I32 division variant with I32 results. These internal extended precision routines among other useful things, for example,· will help to compute (a*b/c) triads in high precision or will assist library functions to compute (a*b+c*d) like expressions with internally extended precision.
A hint on further improvement of this fast Kenyan division:
======================================
Note, that the smaller the divisor the longest the execution time is. A simple check for that case will cost only a few cycles· but can save dozens of dozens of cycles. If one can figure out a very fast method to· divide exactly with small (say < 16) integers with only shifts, subtractions and additions, that will boost the average performance of our·PASM divider. This is yet to be done. I mentioned some similar methods in the first "Fast, Faster, Fastest Codes" thread but they were for integers of limited range or for fixed-point numbers.
A concrete example:
The division of 80_000_000 with 10 takes 187 machine cycles with Kenyan division
To divide with 10 can be done quickly with 6 shifts and 4 additions (12 machine cycles)
quotient = ((dividend >> 1) + dividend) >> 1
quotient = ((quotient >> 4) + quotient)
quotient = ((quotient >> 8) + quotient)
quotient = ((quotient >> 16) + quotient) >> 3
But there remains some rounding error as the result is 7_999_999, and we don't have the remainder. Maybe we can take is as an approximation, multiply it with 10 (2 shifts +1 addition) and figure out the correct result with the comparison with the original DIVIDEND. To make this for other 11 or so cases (why not for 1, 2, 4, 8 and 16?), seems a bit messy for me, although it will be fast.··
Cheers,
Istvan
Comments
This is very interesting -- keep up the good work.
I wanted to point back at this thread,
http://forums.parallax.com/showthread.php?p=611410
Re: optimization of multiplication by fractional constants. The book "Hacker's delight" talks about the same type of folding, as well as more general fast integer division and multiplication. But I don't see the Kenyan type of algorithm there.
Fast integer square root was shaved to the bone in:
http://forums.parallax.com/showthread.php?p=727686
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com
Thanks for the links. The fast square root I have seen, but the first was new to me on the Forum. I usually try to locate a similar one before triggering a new topic, but I am not very effective in searching the Forum.
I hope that a useful collection of tested and very quick algorithms, written in PASM, SPIN and uM-FPU ASM will emerge from the sporadic ideas presented here and there, nowadays and maybe long-time ago. I will categorize these, according the illusions, by which we would like to represent the vast infinity of mathematical numbers with only four bytes. Those illusions are the Integer math, the Fixed-point math and the IEEE 32-bit floats. The winner algorithms should not only be the fastest within a category, but should be precise enough to maintain·the illusion in us of a given representation, that mathematical numbers are being used during the calculations.
6000 year algorithms can be interesting if they are faster then the current ones. Here is an example of Old Babylon mathematics, from where the 7 day periods, the 360 degree of a circle, 60 minutes and the idea of taxes to the central government came.
How to compute ATAN2 quickly with integers?
=================================
I only show the main points in my vague translation of a Babylonian tablet into SPIN like pseudocode. They were not very sophisticated, so they called ATAN2 as ANGLE, X and Y meant steps forward/backward and left/rigth, and they measured the·ANGLE in turn from right to the left. So, 10 steps forward, the turnig to the left, and another 10 steps gives 45 deg ANGLE.
To calculate ANGLE from the X,Y coordinates (within +-1 degree)
·(With a loupe·I figured out·some·scratched lines,·and I got totally confused, but the final version is maybe)
IF (Y==0) RETURN 0
IF (X==0)
· IF (Y>0) RETURN 90
· RETURN -90
ENDIF
IF (ABS(Y)=<ABS(X))
· ANGLE := (3667 * X * Y) / (64 * X * X + 17 * Y * Y)
ELSE
· ANGLE := 90 - (3667 * X * Y) / (64 * Y * Y + 17 * X * X)
IF·(X<0)
· IF (Y<0) ANGLE :=·180·+ ANGLE
· ANGLE := 180 - ANGLE
ELSE
· IF (Y<0) ANGLE := 360 -ANGLE
RETURN ANGLE
I·might mixed up the quadrants, as there were scratches on the tablet. That must be Babylonian, as the ANGLE is in degree.· Although they know the 355/113 number.
This or similar code can save you 2 COGs sometime.
Cheers,
Istvan
Post Edited (cessnapilot) : 8/4/2009 8:00:36 AM GMT