The quickest way of computing 2/3 of an arbitrary number

Paul Baker · 2004-11-04 19:58

I am faced with the dilemma of trying to compute 2/3*x in the fastest (and most importantly code efficient) manner. Computing 3/2*x is child's play: rr((rl x)+x). However dividing by 3 is another matter.·Playing with the numbers, I stumbled upon the power series: x/2 + x/4 - x/8 + x/16 - ... = 2/3*x,·the series is computed for n iterations where·x/(2^(n+1))=0 (ie 2/3 of any 8 bit number can be computed in 7 or fewer iterations). Is this the fastest method for computing 2/3*x? Also in preliminary calculations I am finding it difficult to tease out the remainder from the calculation. Additionally the computation doesn't seem to always work (this maybe due to incorrect binary calculations since its been more than a decade since I was doing it in college). For example 2/3*21 is properly calculated (2/3*10101 = 1110) but 2/3*22 seems to not be correct (2/3*10110 = 1101). Does anyone care to lend a hand in helping me figure this puzzle out?

TIA,
Paul

BrianW · 2004-11-04 20:47

Hi,

Many years ago when working with Z80 assembler I used a subtraction method.
To divide by 3 a simple example if x is an unsigned 8 bit value then max it could be is 255

so if result = r

subtract 30 if no underflow then add 10 to r and repeat
otherwise add 30 back to x
subtract 3 if no underflow then add 1 to r and repeat
otherwise add 3 and remainder is what is left

on the SX the carry flag is cleared for underflow

for larger variables ie. 16 bit start with subtract 30000 then 3000, 300, 30, 3

as a maximum each loop would be 3 itterations

This method can be coded to divide by any value, sorry not had time to look at coding it in SX asm but it's time I was out for a beer or two.

Regards
Brian

JamesC · 2004-11-04 21:05

I too had the fortune (or misfortune) of also having to do this back in the Z-80 days (trash-80 model 1), and the bit-shift-add multiply and bit-shift-subtract divide were generally regarded as the best simple methods for general purpose multiplication and division. they are also very simple to implement. Don't know if there is a particular short cut that would be more efficient for 1/3rds, but I cant think of one.

I was going to type an explanation of the algorithm, but a quick google search revealed this site which already has an excellent tutorial. should be a snap to implement on the sx.

http://cobb.host.sk/z80guide/part4.htm

Cheers

James Newton · 2004-11-04 21:34

http://www.sxlist.com/cgi-bin/constdivmul.exe?Acc=fr37p&Bits=8&endian=little&Const=0.333333333333333&ConstErr=0.5&Temp=TEMP&cpu=sx28

http://www.sxlist.com/techref/expeval2.asp?exp=2+*+X+%2F+3

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
---
James Newton, Host of SXList.com
james@sxlist.com 1-619-652-0593 fax:1-208-279-8767
SX FAQ / Code / Tutorials / Documentation:
http://www.sxlist.com Pick faster!

James Newton · 2004-11-04 21:37

Actually

http://www.sxlist.com/cgi-bin/constdivmul.exe?Acc=ACC&Bits=8&endian=little&Const=0.66666666666666666666666666666667&ConstErr=0.5&Temp=TEMP&cpu=sx28

And you may wish to play with the error, width of the variable, etc... at
http://www.sxlist.com/cgi-bin/constdivmul.exe

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
---
James Newton, Host of SXList.com
james@sxlist.com 1-619-652-0593 fax:1-208-279-8767
SX FAQ / Code / Tutorials / Documentation:
http://www.sxlist.com Pick faster!

dkemppai · 2004-11-04 21:38

On the SX list site, there is a code generator for multiplication by a constant.·
http://www.sxlist.com/cgi-bin/constdivmul.exe?cpu=sx

If you want the remainder, multiply the result
by 3 and subtract from the number you started with...

-Dan

·

BrianW · 2004-11-05 11:05

Just thought I would see how the super free basic compiler does it :-

; x = x / 3
MOV __PARAM1,x
MOV __PARAM2,#3
CLR __PARAM3
TEST __PARAM2
JZ $+23
CLR __PARAM4
CLC
SKIP
RL __PARAM2
INC __PARAM4
JNB __PARAM2.7,$-2
JMP $+4
CLC
RL __PARAM3
RR __PARAM2
INC __PARAM3
SUB __PARAM1,__PARAM2
JC $+5
DEC __PARAM3
ADD __PARAM1,__PARAM2
DJNZ __PARAM4,$-11
MOV x,__PARAM3

James Newton · 2004-11-05 19:08

Well, the code generator at sxlist.com beats that all to heck, but the SX/B routine is a more general purpose solution, so if you were doing a lot of different divides it would end up being more code efficient although slower.

; fr37p = fr37p * 0.333333
; Temp = TEMP
; fr37p size = 8 bits
; Error = 0.5 %
; Bytes order = little endian
; Round = no

; ALGORITHM:
; Clear accumulator
; Add input / 4 to accumulator
; Add input / 16 to accumulator
; Add input / 64 to accumulator
; Add input / 256 to accumulator
; Move accumulator to result
;
; Approximated constant: 0.332031, Error: 0.390625 %

; Input: fr37p0, 8 bits
; Output: fr37p0, 7 bits
; Code size: 17 instructions

fr37p0 DS 1

;copy accumulator to temporary
mov w, fr37p0

;shift accumulator right 2 times
clc
rr fr37p0
clc
rr fr37p0

;add temporary to accumulator
add fr37p0, w

;shift accumulator right 2 times
rr fr37p0
clc
rr fr37p0

;add temporary to accumulator
add fr37p0, w

;shift accumulator right 2 times
rr fr37p0
clc
rr fr37p0

;add temporary to accumulator
add fr37p0, w

;shift accumulator right 2 times
rr fr37p0
clc
rr fr37p0

Actually, Marks 8/8 divide beats the SX/B code as well: 16 vs 22 words (and that 16 includes the return so it is actually 15 vs 22) and I believe the cycles will be about the same.

http://www.markworld.com/Div8x8.txt
; Divide W by MpyDiv1
; ret
; ret
;
Div8x8 mov MpyDiv4, W ;numerator
mov W, #8
mov Count, W
clr MpyDiv2 ;quotient
clr MpyDiv3 ;remainder
div1 clrb C
rl MpyDiv4
rl MpyDiv3
mov W, MpyDiv1
mov W, MpyDiv3-w
snb C
mov MpyDiv3, W
rl MpyDiv2
decsz Count
jmp div1
retw 0
;

And his code generator
http://www.markworld.com/evaluate8.html
is better than the sxlist.com one, but it is for the PIC 16C54 so you have to translate the result with
http://www.sxlist.com/cgi-bin/mpasm2sasm2.exe

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
---
James Newton, Host of SXList.com
james@sxlist.com 1-619-652-0593 fax:1-208-279-8767
SX FAQ / Code / Tutorials / Documentation:
http://www.sxlist.com Pick faster!

JamesC · 2004-11-06 16:46

It looks like the general purpose routine could be optimized even further...

Consider what it does (I've re-written it here with more legible labels)

; Divide W by denominator
; Return quotient in quotient 
; Return remainder in remainder
;

Div8x8  
    MOV numerator, W      ; store numerator
    MOV W, #8
    MOV Count, W          ; loop 8 times
    CLR quotient          ; quotient = 0
    CLR remainder         ; remainder = 0

div1    
    CLRB    C             ; clear carry flag
    RL  numerator         ; rotate left C <- numerator <-C
    RL  remainder         ; rotate left C <- remainder <-C
    MOV W, denominator    ; 
    MOV W, remainder-w    ;   is denom > remainder ?    
    SNB C                 ; skip next if underflow 
    MOV remainder, W      ; save remainder - denominator
    RL  quotient          ; rotate left C <-quotient <- C (result from mov W,remainder-w)
    
    DECSZ   Count         ; loop
    JMP div1              ; 
    
    RETW    0

First thing I noticed is that the CLRB C instruction appears redundant,
it only serves to ensure that the numerator variable is 0 when the
routine completes, and this is pre-set at the start to 0 anyway.

That said, it looks like its possible to combine the numerator and
quotient variables. It may look a little confusing at first, but essentially,
the variable will start out holding the numerator, which will get shifted
out while the quotient is shifted in.


Div8x8  
    MOV quotient, W       ; store numerator
    MOV W, #8
    MOV Count, W          ; loop 8 times
    CLR remainder         ; remainder = 0

div1    
    RL  quotient          ; rotate left C <- quotient <-C (numerator out/quotient in from last loop)
    RL  remainder         ; rotate left C <- remainder <-C
    MOV W, denominator    ; 
    MOV W, remainder-w    ;   is denom > remainder ?    
    SNB C                 ; skip next if underflow 
    MOV remainder, W      ; save remainder - denominator
                          ; C flag contains bit to shift into quotient
    
    DECSZ   Count         ; loop
    JMP div1              ; 
    
    RL  quotient          ; needed to shift final bit into quotient... 
    
    RETW    0

The upshoot is that 2 code words are saved, but more importantly
less work is done inside the loop (two less instructions per pass = 16 less cycles),
and the extra temp variable for the numerator is no longer needed.

BTW, none of these routines work if CARRYX is turned on. I also
had problems getting the correct result with SXSIM, but a quick preliminary
test with an actual chip seems to verify the result.

I should point out here that I am new to the SX chip, so I may be making
an assumption here that does not always hold. I had a lot of initial confusion about
the way the carry flag is handled on this processor. I also have not exhaustively
tested my optimization as of yet.

Also, note that while the general purpose solution is still slower than the dedicated
*.6666 or *.3333, it is generally more accurate: The *.3333 will not return the
correct result of, say, 6/3, due to the rounding. This would be a problem with any
repeating decimal multiplication (is it more correct to say repeating binary?).
The general purpose routine returns correct integer quotients in this case, and a remainder if
the result of the division is not an integer. Depending on the application,
this may or may not be important.

Paul Baker · 2004-11-08 16:19

Wow, I want to start off by thanking everyone so much for their assistance, the plethora of responses helped me attack the problem from several different angles and perspectives. After viewing the code produced by the constmult utility, I realized that my power series is the same if I were to combine the subtraction couplets. After careful consideration of all the solutions provided, I hope not to disappoint any of you by stating that after repeated code-ups and numerical evaluations of each of the proposed solutions, I have decided to choose option E) None of the above. By giving a brief background to the underlying problem that generated this perceived need and a brief walkthrough some of the proposed solutions, I hope you will concur that I chose correctly.

I am creating a token driven system where each token is a multiple of 8 bits, the vast majority of tokens are 8 bits with control tokens occupying 16 bits (goto, conditional goto etc, the 4 msb of the first 8bits is the type identifier, the 4 lsb of the first 8 bits and the second 8 bits are combined to produce a 12 bit address). I have chosen the virtual PC counter to coincide with the actual address of the token in program memory, therefore at the onset I believed that a jump to a target which straddles two adjacent program words·would necessitate the translation of the jump address to a real address, hence the need for computing·2/3 of an arbitrary number.·However, the 2/3 computation must be exact.

Computing 2/3 of·an 8 bit number using the constmult utility introduces a rounding error, but by my computations the error should within 3 (less than 3 bits) which can be corrected in computing the modulo portion of the computation. But computing 2/3 of a 12 bit number can produce a rounding error·greater than 2 bits, ie a modulo calculation cannot easily recover the correct result. This is even taking into account the extra terms required for 2/3 of a 12 bit number. Further analysis showed that by accounting for extra fractional bits (the shifted lsb is kept in an additional register rather than thrown away and is added·in fixed point style)·the correct result can be produced by rounding up to the next number at the end of calculation, but this greatly complicates the required code and is no longer the elegant solution I was in search of. So nix on the constmult solution.

The SX/B code (I know this was proffered only as a "I wonder" solution) is too generalized and not efficient enough for my liking.

I was impressed with the straight-forwardness of BrianW's solution, especially since it inherently produces the modulo as a byproduct and is always exact. But code-up show that this solution is too slow for a 12 bit number (1·x 3000, 9 x 300,·9 x 30, 9 x 3 for a possibility of 22 total conditional subtractions). Even using a temp register so that the addback is not necessary and·pop-out of each level·when the overflow occurs still·requires an average of 11 subtractions (the average for·in-system calculations is higher since my tokens are placed in higher program memory). Having to perform this whenever a branch occurs is too much of a slow-down on the system performance.

So here is my simple solution: byte alignment, a target address cannot straddle two adjacent code words. The token generator·program I will be producing will understand this and place a special $00 nop token in the straddled token so that the target will always occupy the 8 lsb of a program word. Therefore the actual address will be the "branch to" address.

I will likely incorporate JamesC's optimized routines if I also conclude that it works in all circumstances. I will have generalized mul and div routines within my code for 8 and 16 bit operands, but these routines (along with all data operation routines) will only be included if the user program uses them (after all if a user program never needs to divide two 16 bit numbers, why incorporate the code?).

Once again I want to thank everyone for their input. Although the code-ups and testing was somewhat arduous, it produced a fun weekend.

Thanks,
Paul

Post Edited (Paul Baker) : 11/8/2004 4:21:33 PM GMT

Tracy Allen · 2004-11-13 07:36

I'm a little late tuning in on this thread, and I don't really have anything to add for Paul's special application.

However, I just wanted to bring in the perspective of the Stamps' */ and ** operators, vis a vis the issue of multiplying by a fractional constant.

The */ operator would approximate 2/3 as 170/256, that is, multiply by 170 followed by shift right 8. (or, 171/256 is a little closer on the high side of 2/3). In binary, 171 is 10101011, so, with an input 8-bit argument abcdefgh, the answer is czzzzzzz in the following:

                        abcdefgh
                     * 10101011
                     ----------------
                        abcdefgh
                      abcdefgh
                    abcdefgh
                  abcdefgh
                abcdefgh
                ----------------------------
               czzzzzzzrrrrrrrh

The remainder rrrrrrrh is discarded by the division by 256 (shift right, middle byte of a 24 bit accumulator). There might be a carry (up to binary 100) out of the the discarded term. I don't know that that is going to line up right in the box, but you get the idea.

Now, compare that with the code generator method. It is similar, but it uses shift rights, the following addition:

                000000ab
                0000abcd
                00abcdef
                abcdefgh
                ----------
               cxxxxxxxr

followed by one more shift right to discard the bit r and bring in the carry c. That answer cxxxxxx differs from the above czzzzzzz by whatever carry might occur out of the discarded terms. Of course the full multiply requires a higher precision accumulator, which the code generator avoids by working in from the left.

If more accuracy is needed, then the same logic applies when using 2730/4096, that is, multiply by binary 101010101010 followed by shift right 12. Or, as in the Stamp ** operator, approximate 2/3 as 43690/65536, multiply by 1010101010101010 followed by shift right 16. The error can be reduced, but there is always a possible one bit error, as the following shows:

    6** 43690 = 3,       not 4, a one bit error
    6**43691=4
    60000**43690 = 39999   still only a one bit error
    60000**43691 = 40000

  6 */ 170 = 3       not 4, again a one bit error
  6 */ 171 = 4
  60000 */ 170 = 39843       worse approximation, /256, bigger error   
  60000 */ 171 = 40078

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com

The quickest way of computing 2/3 of an arbitrary number

Comments