F32 - Concise floating point code for the Propeller

lonesock · 2010-11-20 08:04

Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
Jonathan

RossH · 2010-11-20 15:09

Faster AND smaller?

Very impressive!

Ross.

Sapieha · 2010-11-20 15:35

Hi lonesock.

NICE Work

lonesock wrote: »

Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
Jonathan

Humanoido · 2010-11-21 03:14

lonesock wrote: »

Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
Jonathan

Excellent!

lonesock · 2010-11-22 09:38

Thanks for the kind words. There is one final thing I need to document, and that is the domain and accuracy for each function. (Some of them still use the lookup tables in prop ROM, with linear interpolation between points, as did the original Float32Full implementation.)

Jonathan

scurrier · 2010-11-22 09:42

Cool. I'll have to try these when I need more speed. Can you comment on how you achieved the improvements?

lonesock · 2010-11-22 09:57

Sure. I don't remember everything, but here's a few:

* All of the Arc* functions used to be polynomial approximations, IIRC using 6 FP adds and 6 FP Multiplies, plus some other preconditioning functions, and a SQRT. I switched to an efficient CORDIC routine to calculate ATan2 directly (ATan2 originally computed the division, then called ATan), gaining much more speed, and better accuracy as well. Now all the Arc* functions use the ATan 2 implementation.

* Both multiply and divide had some inefficiencies in their main loops, and computed more bits than necessary (fp32 only needs 24 bits of mantissa, so multiply uses 24 iterations, while divide needed 26 iterations to get the rounding right)

* the command dispatch table no longer needs to fit inside the cog's RAM, and embeds the call command directly, so the dispatch routine is smaller too.

* Sqrt used an iterative scheme with embedded FP multiplications...I switched to calculating it directly (with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it [8^)

* the table interpolation is a bit faster now

* the Sin and Cosine code use the faster table interpolation, and the Tangent code reuses some preliminary results (the angle scaling) from Sin when calling Cos.

* various small tweaks.

Jonathan

CastIrony · 2010-11-22 13:49

lonesock wrote: »

* Sqrt used an iterative scheme with embedded FP multiplications...I switched to calculating it directly (with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it [8^)

Is it anything like this?

lonesock · 2010-11-22 14:24

Nope! [8^)

The calculation itself is a pretty typical bit-by-bit solution (if remainder >= ((root+delta)^2 - root^2) then you can subtract that term from the remainder, and add delta to the root). The cool part is the adjusting the remainder and root values in situ to keep from overflowing, and to keep all significant bits (for integers, sqrt of a 32 bit value is a 16 bit value, but for floating point, I have 24 significant bits in the input, and need 24 significant bits in the output).

Jonathan

AJM · 2010-11-22 18:38

Just noticed this - thanks Jonathan

scurrier · 2010-11-22 19:08

Very cool stuff, thanks a lot.

prof_braino · 2010-11-22 19:42

lonesock wrote: »

...with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it ....

*lonesock is now a rockstar among propellerheads*

Please post the whitepaper, and join the Fourier for dummies thread, we need your help!

MacTuxLin · 2010-11-23 03:21

Thanks Jonathan.

Your previous F32 was already a great help in comparison to FloatMath. I use FDiv a lot. In one of my function, the old calculation took 140528 cycles & when I replaced with your's, it became 7376 cycles.

Appreciated your support on this.

lonesock · 2010-11-23 10:56

Thanks for the kind words, everyone. I'm glad / I hope it's useful. [8^)

Jonathan

Chris_D · 2010-11-27 07:53

This object is VERY facinating and the timing is really good for what I am working on.

Unfortunatly, I am not sure if I can use it for what I want to. As most of what I am doing is speed sensitve, I am coding as much as I can in PASM. The reality is though that I really don't PASM enough to understand these math routines. So, I am wondering if these routines can be called from a PASM routine running in another COG? If it is possible, how is this done?

Thanks

Chris

lonesock · 2010-11-27 10:33

Hi, Chris.

Great question. I updated the F32 object to more easily support calling from your own PASM cog, and updated the demo to show off a simple incrementing application. Btw, to get maximum speed out of your cog, I recommend setting up and starting the float operation, then doing some other stuff, then waiting to get the result back just before you need it. In the demo code I just loop until the data is ready, but that's not an overly efficient use of your cog's time [8^)

Jonathan

Chris_D · 2010-11-27 10:53

Hi Jonathan,

Thanks for responding. I was hoping it would be possible to call these routines from PASM in another COG.

Can you provide a sample code of how this is done? Nothing complicated, just an example of how I would multiply or divide two numbers from within a PASM COG.

Thanks

Chris

lonesock · 2010-11-27 10:55

Sure! This is what was added to the file "demo_F32.spin" in the latest version of the F32 object in the OBEX:
(of course it is showing Add instead of Mul or Div, but you get the picture)

{{

  This portion of the demo shows the calling convention used by PASM.
  F32 expects a vector of 3 sequential longs in Hub RAM, call them:
      "result a b"
  F32 also has a pointer long "f32_Cmd".  The calling goes like this:

  result := the dispatch call command (e.g. cmdFAdd)
  a := the first floating point encoded parameter
  b := the second floating point encoded parameter

  Now the vector is initialized, you set "f32_Cmd" to the address of
  the start of the vector:

  f32_Cmd := @result

  Just wait until f32_Cmd equals 0, and by then the F32 object wrote
  the result of the floating point operation into "result" (the
  head of the input vector).      
}}
PUB demo_F32_pasm | timeout
  ' The PASM calling cog needs to know 2 base addresses
  F32_call_vector[0] := f32.Cmd_ptr
  F32_call_vector[1] := f32.Call_ptr
  ' start up the demo cog
  cognew( @F32_pasm_eg, @F32_call_vector )

  ' and just print some stuff
  timeout := cnt
  repeat
    ' print it out
    term.str( fs.FloatToString( F32_call_vector[2] ) )
    term.tx( 13 )
    ' wait
    waitcnt( timeout += clkfreq )   
  
DAT     ' this is the F32 call vector (3 longs)
        ' Note: all 3 are initialized to 0.0 (the Spin compiler can do fp32 constants)
F32_call_vector         long    0.0[3]

ORG 0
F32_pasm_eg             ' read my pointer values in
                        mov     t1, par
                        rdlong  cmd_ptr, t1
                        add     t1, #4
                        rdlong  call_ptr, t1

                        ' initialize vector[1] (1st parameter) to 1.0
                        wrlong  increment, t1

demo_loop               ' load the dispatch call into vector[0]
                        mov     t1, #f32#offAdd
                        add     t1, call_ptr
                        rdlong  t1, t1
                        wrlong  t1, par

                        ' call the F32 routine by setting the command pointer to non-0
                        mov     t1, par
                        wrlong  t1, cmd_ptr

                        ' now wait till it's done!
:waiting_loop           rdlong  t1, cmd_ptr     wz
              if_nz     jmp     #:waiting_loop

                        ' Done!  vector[0] = vector[1] + vector[2]
                        rdlong  t1, par

                        ' update my 2nd parameter, and do this all over again
                        mov     t2, par
                        add     t2, #8
                        wrlong  t1, t2

                        jmp     #demo_loop

increment     long      1.0e6                        
cmd_ptr       res       1
call_ptr      res       1
t1            res       1
t2            res       1

Jonathan

Thric · 2010-11-27 11:20

Cool!!
Now I get another cog!

Chris_D · 2010-11-27 11:53

Jonathan,

I think I got it now.

Thank much!

Chris

Chris_D · 2010-11-29 03:30

Jonathan,

I am sorry to keep bugging you but I need more help :-( I thought I could figure out the calling sequence based on your PASM example you provided, but I couldn't.

Apparently I only know the very basics of PASM (and spin for that matter) so even though I studied your example for a couple hours I could not figure it out.

I think what could help is if you showed me where in that example you place the two values being operated on.

I probably need more details too, but I just don't know enough to figure out what I don't know :-(

Chris

lonesock · 2010-11-29 12:28

No problem! As it turns out, I screwed up my dispatch table offsets (4 bytes per long...who would have guessed? [8^)

So, the F32 object is updated again to version 1.2, and there is another file "PASM_demo_F32.spin".

Jonathan

Here's the PASM portion of the example:

DAT     ' this is the F32 call vector (3 longs)
F32_call_vector         long    0.0[3]          ' initialized to 0.0, for no good reason
demo_return_value       long    0               ' sequentially next, after the call vector

ORG 0
F32_pasm_eg             ' read my pointer values in
                        mov     vec_ptr, par
                        rdlong  cmd_ptr, vec_ptr
                        add     vec_ptr, #4
                        rdlong  call_ptr, vec_ptr

                        ' let's calculate the area of a circle: A = Pi * r * r

                        ' do Pi * r
                        mov     vec_ptr, par            ' initialize my vector pointer
                        ' I want a multiplication operation (in vector[0])
                        mov     t1, #f32#offMul         ' Multiplication offset in my dispatch table
                        add     t1, call_ptr            ' add in the base address of the dispatch table
                        rdlong  t1, t1                  ' read the actual dispatch call into t1
                        wrlong  t1, vec_ptr             ' write t1 into vector[0]
                        ' I want Pi as parameter # 1 (in vector[1])
                        add     vec_ptr, #4
                        wrlong  val_Pi, vec_ptr
                        ' I want the radius as parameter # 2 (in vector[2])
                        add     vec_ptr, #4
                        wrlong  val_Radius, vec_ptr
                        ' call the F32 routine by setting the command pointer to the address of the vector
                        mov     vec_ptr, par
                        wrlong  vec_ptr, cmd_ptr
                        ' now wait till it's done!
:waiting_loop1          rdlong  t1, cmd_ptr     wz
              if_nz     jmp     #:waiting_loop1

                        ' halfway done!  vector[0] holds the result (Pi * Radius)

                        ' do (Pi * r) * r
                        mov     vec_ptr, par            ' initialize my vector pointer
                        ' read my intermediate result
                        rdlong  val_Temp, vec_ptr
                        ' I want a multiplication operation (in vector[0])
                        mov     t1, #f32#offMul         ' Multiplication offset in my dispatch table
                        add     t1, call_ptr            ' add in the base address of the dispatch table
                        rdlong  t1, t1                  ' read the actual dispatch call into t1
                        wrlong  t1, vec_ptr             ' write t1 into vector[0]
                        ' I want my intermediate as parameter # 1 (in vector[1])
                        add     vec_ptr, #4
                        wrlong  val_Temp, vec_ptr
                        ' I want the radius as parameter # 2 (in vector[2])
                        add     vec_ptr, #4
                        wrlong  val_Radius, vec_ptr
                        ' call the F32 routine by setting the command pointer to the address of the vector
                        mov     vec_ptr, par
                        wrlong  vec_ptr, cmd_ptr
                        ' now wait till it's done!
:waiting_loop2          rdlong  t1, cmd_ptr     wz
              if_nz     jmp     #:waiting_loop2

                        ' Done!  The final result is in vector[0]
                        mov     vec_ptr, par
                        rdlong  val_Temp, vec_ptr

                        ' write it to the output value (which is the long right after vector[2])
                        add     vec_ptr, #12
                        wrlong  val_Temp, vec_ptr


                        ' done here, press the self-destruct button
                        cogid   t1
                        cogstop t1

' the Spin compiler can handle floating point constants
val_Pi        long      pi
val_Radius    long      7.5
val_Temp      res       1
cmd_ptr       res       1
call_ptr      res       1
t1            res       1
vec_ptr       res       1

Chris_D · 2010-11-30 03:05

Thanks Jonathan,

I suspect that demo should do the trick!

Chris

Jay Kickliter · 2011-02-27 16:59

I'm trying to cut and paste F32 routines into my PASM program. But, except for _FFloat, when ever I call a routine, the cog locks up. The file is compiling fine. What's the minimum that needs to be cut and pasted to get _FAdd, _FMul, _FDiv, and _FSqr to work? I already have _Unpack, _Unpack2, _Pack, _FFloat, local variables, constants, and the CON block above included in my code.

Thanks

Jay Kickliter · 2011-02-27 17:32

Disregard, I accidentally deleted 'ret' off the end of '_unpack2_ret'.

Martin_H · 2011-03-10 17:20

This is exactly what I am looking for. Thanks!

Kye · 2011-03-10 18:14

Use private functions to decrease the size of the objects SPIN code! Massive waste of space there... Good job otherwise.

Humanoido · 2011-03-11 01:34

Good job! Keeping it in one Cog is good and glad to see you're making additions to the code. Do you think Mike Green will make a float version of FemtoBasic that's transparent?

mpark · 2011-03-11 07:36

Kye wrote: »

Use private functions to decrease the size of the objects SPIN code! Massive waste of space there... Good job otherwise.

Huh?

lonesock · 2011-03-11 08:50

Thanks, everybody!

@Kye: I assume you mean using a function to encapsulate the "repeat / while f32_Cmd" lines. I left each in their own function for speed purposes, but if anyone wished to decrease the massive waste of space that is a good starting point.

Speaking of speed, it turns out that:

  repeat
  while f32_Cmd

Is faster than:

  repeat while f32_Cmd

Not sure why. It didn't seem important enough to bump the revision and upload a v1.3, but if I make any other changes that will be included in the list. If we had conditional compilation in the vanilla Propeller Tool I'd definitely have a smaller vs faster tradeoff flag.

Jonathan

lonesock · 2011-03-11 09:08

OK, following up on (my interpretation of) Kye's suggestion, I tested the timing and code size of 3 variations of FMul( pi * pi ):

PUB FMul(a, b)
  result  := cmdFMul
  f32_Cmd := @result
  repeat while f32_Cmd

12 bytes, ~3504 clocks

PUB FMul(a, b)
  result  := cmdFMul
  f32_Cmd := @result
  repeat
  while f32_Cmd

10 bytes, ~3152 clocks

PRI wait( address )
  f32_Cmd := address
  repeat
  while f32_Cmd

PUB FMul(a, b)
  result  := cmdFMul
  wait( @result )

9 bytes, ~3904 clocks

The clocks are approximate as the timing will depend a bit on how the spin repeat loop aligns with when the PASM code is done. Note that going from 10 to 9 bytes means I'd save 1 byte per PUB math function, so approximately 25 bytes, and the wait function takes up 6 bytes. However, going from the current version (1.2) would be 12 down to 9 bytes, so ~70 bytes of savings. My personal leaning is to just upload the faster/smaller version (with repeat and while on distinct lines).

Jonathan

F32 - Concise floating point code for the Propeller

Comments