PASM Devision routine that works for full 32 bit devision,Help Fix this one or

TJHJ · 2009-04-02 02:51

The current routine I have been using for devision in PASM is

                mov   t,#16
                shl   y,#15
loop
                cmpsub x,y   wc
                rcl   x,#1
                djnz  t,#loop

Thanks Propeller WIKI [noparse]:)[/noparse]

The problem I have run into is that for small devision it works great, anything like 17/2, 20/3, 45/7 just some I tried. But when I try to divide 153_333_333/21_343. I get crazy results. Any suggestions for a routine that can do full 32 bit devision.

If you have a Pasm program you would be willing to share that would save me a ton of time and a headache, I cant thank you enough.

TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

Post Edited (TJHJ) : 4/2/2009 5:35:05 AM GMT

virtuPIC · 2009-04-02 07:27

TJHJ said...
But when I try to divide 153_333_333/21_343. I get crazy results.

Tell us more! Would be really nice to know what you get exactly...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Airspace V - international hangar flying!
www.airspace-v.com/ggadgets for tools & toys

Brian Fairchild · 2009-04-02 12:32

Are you only wanting to do unsigned integer division with an unsigned integer result or do you require a fractional result? Or do you need to handle signed numbers? Division of a 32-bit integer by another 32-bit integer will give a 64-bit result.

TJHJ · 2009-04-02 14:43

Sorry for not being more specific on this one.
I am working on a full table interploate that has to be done in a limited time. So I wrote the following program to test and see if spin would be fast enough.

Con
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
LedPinNum = 4 
Num = 25_000'153_333_333
Div = 24_683   
obj
Term : "Serial_Terminal"
Var
Long StartPoint,ASMResult, CNT1, CNT2, ResultStore
 
Pub Main
StartPoint := LedPinNum   'Set the pin number to be correct
dira[noparse][[/noparse]LedPinNum]~~   'Test chip by turing on LED It comes on
outa[noparse][[/noparse]LedPinNum]~~
waitcnt(clkfreq+cnt)         'wait 1 second
outa[noparse][[/noparse]LedPinNum]~             'Turn off led, start up asm cog
waitcnt(clkfreq + cnt)       'Wait 1 second. This lets me make sure my ASM code isnt causing full lock

cognew(@DivTest, @StartPoint) 'Start the new asm cog
Term.start                    'Start up the terminal
term.getc                     'Wait for a char

 
CNT1 := cnt                   'Store the cnt before spin div
ResultStore := Num/Div        'Do spin Div
CNT2 := cnt                   'Store cnt after
 
waitcnt(clkfreq + cnt)        'wait 1 second
Term.out(0)                   'Clear screen
 

term.str(string("Spin took : ")) 'Output the result
term.dec(CNT2-CNT1)              'Delta CNT Spin
Term.str(string(" And the result was : "))
Term.dec(ResultStore)         
Term.out(13)
term.str(String("ASM took : "))
term.dec(StartPoint)
Term.str(String(" And the result was : "))
Term.dec(ASMResult)
repeat                  'Forever repeat to keep chip running
  waitcnt(clkfreq+cnt)
 
 
Dat
org
DivTest
                mov _cnt1, cnt    'Ssve the counter time before devision
                mov   t,#16
                shl   y,#15
loop
                cmpsub x,y  wc 
                rcl   x,#1
                djnz  t,#loop
                
                mov _cnt2, cnt      'Save the second counter time
                sub     _cnt2, _cnt1    'Find out how long it took
                
                mov     t, par          'Use t for the adress of Par
                add     t, #4           ' add 4 to move 1 long
                wrlong  _cnt2, par      'Write long at par address
                wrlong  x, t            'Write address at 1 long + par
jmpforever      jmp       #JmpForever            'Just keep repeating so it never stops
 
x       long    Num
y       long    Div
t       res       1
_cnt1   res       1
_cnt2   res       1

So here are the results I get that are odd. It seems anything over about 20,000 in the numerator makes it blow up.

Num = 25_000
Div = 24_683
Spin took : 1600 And the result was : 1
ASM took : 208 And the result was : 20774913

Num = 153_333_333
Div = 24_683
Spin took : 1664 And the result was : 6212
ASM took : 208 And the result was : 166271044

Now intrestingly these numbers work
Num = 20_000
Div = 10_000
Spin took : 1600 And the result was : 2
ASM took : 208 And the result was : 2

Yet does not
Num = 21_000
Div = 10_000
Spin took : 1600 And the result was : 2
ASM took : 208 And the result was : 65536002

Num = 20_000
Div = 15_000
Spin took : 1600 And the result was : 1
ASM took : 208 And the result was : 327680001

Note : Smaller number seem to work just fine.

So I have been trying to read and·create a new div routine·but my asm is just not that good.

·@Brian
I am totally happy with a truncated result, but it would be nice if I could handle signed numbers. I dont think any of the results of the numbers that woud be devided would ever grow beyond 32 bits.

Thank you for the time and help to look at this,
TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

Post Edited (TJHJ) : 4/2/2009 3:26:03 PM GMT

Phil Pilgrim (PhiPi) · 2009-04-02 15:55

TJ,

You haven't said for sure whether Spin would be fast enough for your app. If it is, I have an object in the OBEX that will help called "umath". It's entirely written in Spin, so you don't have to use up another cog, and includes a 32 * 32 / 32 "fractional" multiply that retains the 64-bit intermediate product for the divide. This may be just right for your interpolation problem. You can either use it as an object or copy the relevant sections into your own program.

-Phil

TJHJ · 2009-04-02 16:02

Phil My aplogies. Although the spin devision is correct it takes to long. The problem is I am having to do this live, so the aprrox 1600 clk cycles to get result is to long as for the interpolate it has 4 devision routines + 2 multiply commands.

TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

Brian Fairchild · 2009-04-02 16:45

Your PASM routine isn't a 32-bit divide, it's only 16-bit. The clue is t1, the loop counter, which only iterates 16 times.

Jack Crenshaw · 2009-04-02 16:51

TJHJ said...
The problem I have run into is that for small devision it works great, anything like 17/2, 20/3, 45/7 just some I tried. But when I try to divide 153_333_333/21_343. I get crazy results. Any suggestions for a routine that can do full 32 bit devision.

First off, it looks like you're not shifting far enough. You're shifting only 16 bits, where 153_333_333 is a 27-bit number.

As one of your tests, you should always be able to divide by 1, and get back the original dividend. Looks like your function won't do that.

In general, if you want to divide two general 32-bit numbers, you need a dividend register that's 64 (actually 65) bits long. Is the issue really that you want to divide 16-bit numbers, using a 32-bit dividend register? If so, 153_333_333 is too big for an input value.

Whatever method you use, and however long the input words, it's critically important that every trial subtraction end up leaving no non-zero bits at the "high end." As with decimal long division,
you have to shift the operands at the beginning, so you can guarantee that the first subtraction doesn't leave any 1-bits at the high end.

It looks like your routine doesn't do this preconditioning. Granted, it's going to take extra cycles, but it's necessary.

Years ago I wrote a 32x16 div routine for the z80. Recently, the topic came up again in comp.os.cpm. I no longer have a z80 system, so couldn't show them the z80 code, but I wrote a C version to show them the algorithm. I'm attaching it so it may help.

I see that Phil Pilgrim's ASM code has the right algorithm, but again, there needs to be that initial preconditioning.

Let me explain again, and I'll try to be as clear as I can: In that first comparison, the divisor must be aligned so that it either just fits into the top bits of the dividend, or is just barely too large. After the subtraction, the high bit of the result _MUST_ be a zero. If you're dividing x by 1, the 1 must be left justified so this is true. Otherwise, you won't get back the original dividend, and non-zero high bits end up shifted off into outer space.

Jack

Jack

Phil Pilgrim (PhiPi) · 2009-04-02 16:58

I deleted my post ahead of Jack's comments. It's a routine I've used, but I wasn't entirely satisfied with it upon looking at it again and needed my morning coffee before proceeding further.

-Phil

Phil Pilgrim (PhiPi) · 2009-04-02 18:49

Here's the code I deleted, but with some preconditioning to check for invalid operands. Attached is the same with a Spin wrapper for testing.

'-----------[noparse][[/noparse]* Divide 64 / 32 = 32 ]--------------------------------------------

' rb := ra:rb / rx, where ra < rx.

divabx        [b]cmpsub[/b]    ra,rx [b]wc[/b],[b]nr[/b]             'Check to make sure ra < rx. Carry if not.
        [b]if_c[/b]  [b]jmp[/b]       divabx_ret              'Return with error in carry.
        
              [b]mov[/b]       cctr,#32 [b]wc[/b]             'Initialize counter.
        
:loop         [b]rcl[/b]       rb,#1 [b]wc[/b]                'Shift dividend left.
              [b]rcl[/b]       ra,#1 [b]wc[/b]
        [b]if_c[/b]  [b]sub[/b]       ra,rx                   'If bit shifted out, then definitely subtract.
       [b]if_nc[/b]  [b]cmpsub[/b]    ra,rx [b]wc[/b]                'Else conditionally subtract, setting carry.
              [b]djnz[/b]      cctr,#:loop             'Back for more.
              
              [b]rcl[/b]       rb,#1                   'Shift in last bit of quotient.
              [b]test[/b]      cctr,cctr [b]wc[/b]            'Clear carry (error flag).
divabx_ret    [b]ret[/b]

It works, but I'm not terribly enamoured with that extra sub. There has to be a more efficient way...

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 4/2/2009 6:57:13 PM GMT

Brian Fairchild · 2009-04-02 19:11

There's something been bugging me about TJHJs original code and I've put my finger on it. It's wrong, or to be more accurate it only works if you know what the input and output format is. That information is missing from the Wiki.

The code has been lifted from one of the early parallax documents and is missing the routines header. Here is the correct code...

Here is an unsigned divider routine that divides a 32-bit value by a 16-bit value to yield a 16-bit quotient and a 16-bit remainder: 
 
' 
' Divide x[noparse][[/noparse]31..0] by y[noparse][[/noparse]15..0] (y[noparse][[/noparse]16] must be 0) 
' on exit, quotient is in x[noparse][[/noparse]15..0] and remainder is in x[noparse][[/noparse]31..16] 
' 
divide          shl     y,#15           'get divisor into y[noparse][[/noparse]30..15] 
                mov     t,#16           'ready for 16 quotient bits 
 
:loop           cmpsub  x,y     wc      'if y =< x then subtract it, quotient bit into c 
                rcl     x,#1            'rotate c into quotient, shift dividend 
                djnz    t,#:loop        'loop until done 
 
divide_ret      ret                     'quotient in x[noparse][[/noparse]15..0], remainder in x[noparse][[/noparse]31..16]

No wonder his answers were wrong.

TJHJ · 2009-04-02 19:28

Awesome thank you so much everyone for the help.

Phil and only 676 cycles for div. I Think it will fit now.

Thank you all again.

TJ

EDIT Brian, That helps a whole bunch..... Thanks for finding that for me

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

TJHJ · 2009-04-02 19:35

Brain, Which knowing that makes the whole routine work.

((ASMResult<<16)>>16)  ' Ditch out the remainder and the number makes sense.

FYI·if takes 208 cycles to process. Ill have to see if I can get the wiki updated.

Thanks again for finding that.

TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

parsko · 2009-04-02 19:39

The regular 32bit divide routine takes ~208 clocks for 32 / 16 = 16.16
The new 64bit divide routine takes ~676 clocks for 64 / 32 = 32

Just wanted to point this out for anyone referencing this post in the future.

TJHJ, if you know what kind of math you're doing (aka converting something like temperature repetatively), you can use the 32 bit routine without losing much accuracy. I've done this with a few different sensors. It all depends on just how accurate you need the divide. But, at 676 cycles, it's nice to have the 64 bit divide.

-Parsko

parsko · 2009-04-02 19:42

This is also quite convenient cause it will gladly divide a 32bit number by a 32bit number!!!

TJHJ · 2009-04-02 23:29

Hang on a second here, so maybe Im thinking about this wrong, but why would a 32 bit divided by a 32 bit number yield a 64 bit number?

Example
Worst case is 2^32/1 = 2^32 a 32 bit number. And the simplest cast is 2^32/2^32 = 1.

I cant think of a case that causes a result larger than 32 bits if its only for devision because of the fact its integer math.

Maybe Im looking at this wrong. Or is it just the fact we need a 64 bit register to start with?

Parsko,
Thanks, Its a balance of accuracy versus time, if I can use the full 64 bit routine, why not keep the accuracy, if I cant make it crunch in time then Ill sacrifice the accuracy for the 32 bit routine.
TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

TJHJ · 2009-04-03 00:36

I just ran this in ICCV7 To see how long it would take.

Also just for the record ICCV7 C took 1008.

So Your choices are

32bit/16 bit at 208 clks

32/32 signed at 676 clks

ICCV7 At 1008 clks

And Spin at 1600-1604

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

Phil Pilgrim (PhiPi) · 2009-04-03 05:58

TJHJ said...
...why would a 32 bit divided by a 32 bit number yield a 64 bit number?

It doesn't. Brian was thinking of multiplication.

-Phil

Brian Fairchild · 2009-04-03 06:31

Phil Pilgrim (PhiPi) said...

TJHJ said...
...why would a 32 bit divided by a 32 bit number yield a 64 bit number?

It doesn't. Brian was thinking of multiplication.

Actually, I was thinking of a 32-bit integer dividend and a 32-bit integer divider producing a 64-bit fractional result.

Jack Crenshaw · 2009-04-04 15:21

PhiPi said...
Here's the code I deleted, but with some preconditioning to check for invalid operands. Attached is the same with a Spin wrapper for testing.

'
[noparse][[/noparse]* Divide 64 / 32 = 32 ]

' rb := ra:rb / rx, where ra < rx.

divabx cmpsub ra,rx wc,nr 'Check to make sure ra < rx. Carry if not.
if_c jmp divabx_ret 'Return with error in carry.

Actually, Phil, ra doesn't have to be smaller than rx. It only has to be smaller than twice rx,
so when you compute ra - rx, the high bit is always zero.

Phil Pilgrim (PhiPi) · 2009-04-04 16:22

Jack,

Consider this example, though: ra:rb = $0000_0001:0000_0000, rx = $0000_0001 (i.e. ra = rx). The quotient ($1_0000_0000) overflows 32 bits. I believe this would be the case any time ra > rx.

-Phil

Tracy Allen · 2009-04-04 16:35

TJ, in the context of interpolation in a table, there are specific conditions met. It comes down to c * a/b, where a/b is less than unity. In many interpolation problems involving lookup (as opposed to lookdown), b is a constant, the spacing between table entries.

I think Phil pointed out above the possibility of a 32 * 32 / 32 "fractional" multiply, which can be faster than a genera purpose division.

In terms of interpolation, you enter with a value x, and then from lookup or lookdown operation find that it lies between two entries xm and xn, where xm = x <= xn. You grab the corresponding values ym and yn. Then, by linear interpolation,
y = ym + (yn - ym) * (x - xm) / (xn - xm)

The term on the right after the + is of the form c * a/b with a/b<1.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com

TJHJ · 2009-04-05 02:31

Thanks for the thought Tracy,
Im having a hard enough time keeping the math in the original formula context, let alone trying to convert it for efficiency. Im trying to get it working 1st.
Its a monster, I am working on an intelligent step estimator that takes a know data set from any two sensors, compiling them into a 3d table based on time. Then based on the previous data set results develop a least squares 6th order equation defining the system. Make the system move to the next point most likely containing the goal based on the predicted fit of the spline. Rinse and repeat, ditching outliers that are no longer be prevalent to the current data.
If I have the math correct it would basically be a self tuning PID system capable of accepting multiple inputs and balancing them automatically to achieve the desired goal. Ideally its a pid loop that is not a wave, and therefore correctly responds to each change in conditions with a very close estimate of its desired goal.

Who knows, if it works NO MORE PID TUNING [noparse]:)[/noparse] or Non stable loops (That in my case tend to tear a bunch of stuff up always, why do they always go unstable when your not watching them?)

TJ

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I owe everyone here a bunch, So thanks again for answering my dumb questions.
Projects. RG500 ECU system. PropCopter. Prop CanSat. Prop Paste Gun.
Suzuki RG500 in a RGV 250 frame.
Bimota V-Due (Running on the fuel injection system)
Aprilia RS250

Tracy Allen · 2009-04-05 16:28

Wow! That's a much much bigger monster than I imagined. Good luck!

Nonetheless, the operation of taking proportions right up there with multiplication and division. It is indeed an optimization, because the original terms in the formulas have to be evaluated a priori to see where a proportion is involved. Then adjust the order of operations to avoid double precision, that is, calculate (a/b)*c rather than (c*a)/b. Well, it depends too on how much precision the problem really needs.

PRI ratio(a, b, c) : f            ' calculate c *a/b 
                                ' enter with a<b; 0<=a<2^30, 0<b<=2^30 (positive)
                                ' c can be any valid twos complement number.
  repeat 31                      'calculate result such that  result / (2^31) = a / b , approximates a/b
    a <<= 1
    result <<= 1
    if a => b
      a -= b
      result++    
  result :=  c ** result * 2    ' return this value, division by 2^32 is implicit in **
                                  '  for smaller values |c|<2^30, the *2 can precede the **.

In spin, ** is a double precision operation, but you get to hand that off to the interpreter.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com

Jack Crenshaw · 2009-04-07 19:58

PhiPi said...
Jack,

Consider this example, though: ra:rb = $0000_0001:0000_0000, rx = $0000_0001 (i.e. ra = rx). The quotient ($1_0000_0000) overflows 32 bits. I believe this would be the case any time ra > rx.

It doesn't work that way, Phil. Or at least, it shouldn't. rx should always be left justified, so that its high bit is always
a one. With that arrangement, rx ends up as $1000_0000. Even if rx = $1111_1111, subtracting rx gives
$0111_1111. The high bit has been reset, the 1 has been recorded as the first bit of the result, and the process
can continue.

This is nothing new. Remember that in ordinary long division, we had to "point off" the divisor so that it was as large
as possible and still "go into" the dividend. The binary algorithm works exactly the same way, except that the trial
divisions are much easier because the partial quotient is always either a 1 or a 0. You never have to back up to a
smaller quotient digit, as sometimes happens with decimal division.

Jack

Phil Pilgrim (PhiPi) · 2009-04-07 20:25

Jack Crenshaw said...
rx should always be left justified, so that its high bit is always a one.

??? Are you sure you're not thinking about floating point? In unsigned integer 64/32=32 division, anytime the most significant long of the dividend is greater than or equal to the divisor, you will get an overflow (i.e. the quotient won't fit in 32 bits).

-Phil

PASM Devision routine that works for full 32 bit devision,Help Fix this one or

Comments