PDA

View Full Version : F32 - Concise floating point code for the Propeller



lonesock
11-20-2010, 03:04 PM
Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
Jonathan

RossH
11-20-2010, 10:09 PM
Faster AND smaller?

Very impressive!

Ross.

Sapieha
11-20-2010, 10:35 PM
Hi lonesock.


NICE Work



Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
Jonathan

Humanoido
11-21-2010, 10:14 AM
Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,
JonathanExcellent!

lonesock
11-22-2010, 04:38 PM
Thanks for the kind words. There is one final thing I need to document, and that is the domain and accuracy for each function. (Some of them still use the lookup tables in prop ROM, with linear interpolation between points, as did the original Float32Full implementation.)

Jonathan

scurrier
11-22-2010, 04:42 PM
Cool. I'll have to try these when I need more speed. Can you comment on how you achieved the improvements?

lonesock
11-22-2010, 04:57 PM
Sure. I don't remember everything, but here's a few:

* All of the Arc* functions used to be polynomial approximations, IIRC using 6 FP adds and 6 FP Multiplies, plus some other preconditioning functions, and a SQRT. I switched to an efficient CORDIC routine to calculate ATan2 directly (ATan2 originally computed the division, then called ATan), gaining much more speed, and better accuracy as well. Now all the Arc* functions use the ATan 2 implementation.

* Both multiply and divide had some inefficiencies in their main loops, and computed more bits than necessary (fp32 only needs 24 bits of mantissa, so multiply uses 24 iterations, while divide needed 26 iterations to get the rounding right)

* the command dispatch table no longer needs to fit inside the cog's RAM, and embeds the call command directly, so the dispatch routine is smaller too.

* Sqrt used an iterative scheme with embedded FP multiplications...I switched to calculating it directly (with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it [8^)

* the table interpolation is a bit faster now

* the Sin and Cosine code use the faster table interpolation, and the Tangent code reuses some preliminary results (the angle scaling) from Sin when calling Cos.

* various small tweaks.

Jonathan

CastIrony
11-22-2010, 08:49 PM
* Sqrt used an iterative scheme with embedded FP multiplications...I switched to calculating it directly (with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it [8^)

Is it anything like this (http://cbloomrants.blogspot.com/2010/11/11-20-10-function-approximation-by.html)?

lonesock
11-22-2010, 09:24 PM
Nope! [8^)

The calculation itself is a pretty typical bit-by-bit solution (if remainder >= ((root+delta)^2 - root^2) then you can subtract that term from the remainder, and add delta to the root). The cool part is the adjusting the remainder and root values in situ to keep from overflowing, and to keep all significant bits (for integers, sqrt of a 32 bit value is a 16 bit value, but for floating point, I have 24 significant bits in the input, and need 24 significant bits in the output).

Jonathan

AJM
11-23-2010, 01:38 AM
Just noticed this - thanks Jonathan

scurrier
11-23-2010, 02:08 AM
Very cool stuff, thanks a lot.

prof_braino
11-23-2010, 02:42 AM
...with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it ....

*lonesock is now a rockstar among propellerheads*

Please post the whitepaper, and join the Fourier for dummies thread, we need your help!

MacTuxLin
11-23-2010, 10:21 AM
Thanks Jonathan.

Your previous F32 was already a great help in comparison to FloatMath. I use FDiv a lot. In one of my function, the old calculation took 140528 cycles & when I replaced with your's, it became 7376 cycles.

Appreciated your support on this.

lonesock
11-23-2010, 05:56 PM
Thanks for the kind words, everyone. I'm glad / I hope it's useful. [8^)

Jonathan

Chris_D
11-27-2010, 02:53 PM
This object is VERY facinating and the timing is really good for what I am working on.

Unfortunatly, I am not sure if I can use it for what I want to. As most of what I am doing is speed sensitve, I am coding as much as I can in PASM. The reality is though that I really don't PASM enough to understand these math routines. So, I am wondering if these routines can be called from a PASM routine running in another COG? If it is possible, how is this done?

Thanks

Chris

lonesock
11-27-2010, 05:33 PM
Hi, Chris.

Great question. I updated the F32 object to more easily support calling from your own PASM cog, and updated the demo to show off a simple incrementing application. Btw, to get maximum speed out of your cog, I recommend setting up and starting the float operation, then doing some other stuff, then waiting to get the result back just before you need it. In the demo code I just loop until the data is ready, but that's not an overly efficient use of your cog's time [8^)

Jonathan

Chris_D
11-27-2010, 05:53 PM
Hi Jonathan,

Thanks for responding. I was hoping it would be possible to call these routines from PASM in another COG.

Can you provide a sample code of how this is done? Nothing complicated, just an example of how I would multiply or divide two numbers from within a PASM COG.

Thanks

Chris

lonesock
11-27-2010, 05:55 PM
Sure! This is what was added to the file "demo_F32.spin" in the latest version of the F32 object in the OBEX:
(of course it is showing Add instead of Mul or Div, but you get the picture)



{{

This portion of the demo shows the calling convention used by PASM.
F32 expects a vector of 3 sequential longs in Hub RAM, call them:
"result a b"
F32 also has a pointer long "f32_Cmd". The calling goes like this:

result := the dispatch call command (e.g. cmdFAdd)
a := the first floating point encoded parameter
b := the second floating point encoded parameter

Now the vector is initialized, you set "f32_Cmd" to the address of
the start of the vector:

f32_Cmd := @result

Just wait until f32_Cmd equals 0, and by then the F32 object wrote
the result of the floating point operation into "result" (the
head of the input vector).
}}
PUB demo_F32_pasm | timeout
' The PASM calling cog needs to know 2 base addresses
F32_call_vector[0] := f32.Cmd_ptr
F32_call_vector[1] := f32.Call_ptr
' start up the demo cog
cognew( @F32_pasm_eg, @F32_call_vector )

' and just print some stuff
timeout := cnt
repeat
' print it out
term.str( fs.FloatToString( F32_call_vector[2] ) )
term.tx( 13 )
' wait
waitcnt( timeout += clkfreq )

DAT ' this is the F32 call vector (3 longs)
' Note: all 3 are initialized to 0.0 (the Spin compiler can do fp32 constants)
F32_call_vector long 0.0[3]

ORG 0
F32_pasm_eg ' read my pointer values in
mov t1, par
rdlong cmd_ptr, t1
add t1, #4
rdlong call_ptr, t1

' initialize vector[1] (1st parameter) to 1.0
wrlong increment, t1

demo_loop ' load the dispatch call into vector[0]
mov t1, #f32#offAdd
add t1, call_ptr
rdlong t1, t1
wrlong t1, par

' call the F32 routine by setting the command pointer to non-0
mov t1, par
wrlong t1, cmd_ptr

' now wait till it's done!
:waiting_loop rdlong t1, cmd_ptr wz
if_nz jmp #:waiting_loop

' Done! vector[0] = vector[1] + vector[2]
rdlong t1, par

' update my 2nd parameter, and do this all over again
mov t2, par
add t2, #8
wrlong t1, t2

jmp #demo_loop

increment long 1.0e6
cmd_ptr res 1
call_ptr res 1
t1 res 1
t2 res 1


Jonathan

Thric
11-27-2010, 06:20 PM
Cool!!
Now I get another cog!

Chris_D
11-27-2010, 06:53 PM
Jonathan,

I think I got it now.

Thank much!

Chris

Chris_D
11-29-2010, 10:30 AM
Jonathan,

I am sorry to keep bugging you but I need more help :-( I thought I could figure out the calling sequence based on your PASM example you provided, but I couldn't.

Apparently I only know the very basics of PASM (and spin for that matter) so even though I studied your example for a couple hours I could not figure it out.

I think what could help is if you showed me where in that example you place the two values being operated on.

I probably need more details too, but I just don't know enough to figure out what I don't know :-(

Chris

lonesock
11-29-2010, 07:28 PM
No problem! As it turns out, I screwed up my dispatch table offsets (4 bytes per long...who would have guessed? [8^)

So, the F32 object is updated again to version 1.2, and there is another file "PASM_demo_F32.spin".

Jonathan

Here's the PASM portion of the example:


DAT ' this is the F32 call vector (3 longs)
F32_call_vector long 0.0[3] ' initialized to 0.0, for no good reason
demo_return_value long 0 ' sequentially next, after the call vector

ORG 0
F32_pasm_eg ' read my pointer values in
mov vec_ptr, par
rdlong cmd_ptr, vec_ptr
add vec_ptr, #4
rdlong call_ptr, vec_ptr

' let's calculate the area of a circle: A = Pi * r * r

' do Pi * r
mov vec_ptr, par ' initialize my vector pointer
' I want a multiplication operation (in vector[0])
mov t1, #f32#offMul ' Multiplication offset in my dispatch table
add t1, call_ptr ' add in the base address of the dispatch table
rdlong t1, t1 ' read the actual dispatch call into t1
wrlong t1, vec_ptr ' write t1 into vector[0]
' I want Pi as parameter # 1 (in vector[1])
add vec_ptr, #4
wrlong val_Pi, vec_ptr
' I want the radius as parameter # 2 (in vector[2])
add vec_ptr, #4
wrlong val_Radius, vec_ptr
' call the F32 routine by setting the command pointer to the address of the vector
mov vec_ptr, par
wrlong vec_ptr, cmd_ptr
' now wait till it's done!
:waiting_loop1 rdlong t1, cmd_ptr wz
if_nz jmp #:waiting_loop1

' halfway done! vector[0] holds the result (Pi * Radius)

' do (Pi * r) * r
mov vec_ptr, par ' initialize my vector pointer
' read my intermediate result
rdlong val_Temp, vec_ptr
' I want a multiplication operation (in vector[0])
mov t1, #f32#offMul ' Multiplication offset in my dispatch table
add t1, call_ptr ' add in the base address of the dispatch table
rdlong t1, t1 ' read the actual dispatch call into t1
wrlong t1, vec_ptr ' write t1 into vector[0]
' I want my intermediate as parameter # 1 (in vector[1])
add vec_ptr, #4
wrlong val_Temp, vec_ptr
' I want the radius as parameter # 2 (in vector[2])
add vec_ptr, #4
wrlong val_Radius, vec_ptr
' call the F32 routine by setting the command pointer to the address of the vector
mov vec_ptr, par
wrlong vec_ptr, cmd_ptr
' now wait till it's done!
:waiting_loop2 rdlong t1, cmd_ptr wz
if_nz jmp #:waiting_loop2

' Done! The final result is in vector[0]
mov vec_ptr, par
rdlong val_Temp, vec_ptr

' write it to the output value (which is the long right after vector[2])
add vec_ptr, #12
wrlong val_Temp, vec_ptr


' done here, press the self-destruct button
cogid t1
cogstop t1

' the Spin compiler can handle floating point constants
val_Pi long pi
val_Radius long 7.5
val_Temp res 1
cmd_ptr res 1
call_ptr res 1
t1 res 1
vec_ptr res 1

Chris_D
11-30-2010, 10:05 AM
Thanks Jonathan,

I suspect that demo should do the trick!

Chris

Jay Kickliter
02-27-2011, 11:59 PM
I'm trying to cut and paste F32 routines into my PASM program. But, except for _FFloat, when ever I call a routine, the cog locks up. The file is compiling fine. What's the minimum that needs to be cut and pasted to get _FAdd, _FMul, _FDiv, and _FSqr to work? I already have _Unpack, _Unpack2, _Pack, _FFloat, local variables, constants, and the CON block above included in my code.

Thanks

Jay Kickliter
02-28-2011, 12:32 AM
Disregard, I accidentally deleted 'ret' off the end of '_unpack2_ret'.

Martin_H
03-11-2011, 12:20 AM
This is exactly what I am looking for. Thanks!

Kye
03-11-2011, 01:14 AM
Use private functions to decrease the size of the objects SPIN code! Massive waste of space there... Good job otherwise.

Humanoido
03-11-2011, 08:34 AM
Good job! Keeping it in one Cog is good and glad to see you're making additions to the code. Do you think Mike Green will make a float version of FemtoBasic that's transparent?

mpark
03-11-2011, 02:36 PM
Use private functions to decrease the size of the objects SPIN code! Massive waste of space there... Good job otherwise.

Huh?

lonesock
03-11-2011, 03:50 PM
Thanks, everybody!

@Kye: I assume you mean using a function to encapsulate the "repeat / while f32_Cmd" lines. I left each in their own function for speed purposes, but if anyone wished to decrease the massive waste of space that is a good starting point.

Speaking of speed, it turns out that:


repeat
while f32_Cmd
Is faster than:


repeat while f32_Cmd
Not sure why. It didn't seem important enough to bump the revision and upload a v1.3, but if I make any other changes that will be included in the list. If we had conditional compilation in the vanilla Propeller Tool I'd definitely have a smaller vs faster tradeoff flag.

Jonathan

lonesock
03-11-2011, 04:08 PM
OK, following up on (my interpretation of) Kye's suggestion, I tested the timing and code size of 3 variations of FMul( pi * pi ):


PUB FMul(a, b)
result := cmdFMul
f32_Cmd := @result
repeat while f32_Cmd

12 bytes, ~3504 clocks



PUB FMul(a, b)
result := cmdFMul
f32_Cmd := @result
repeat
while f32_Cmd

10 bytes, ~3152 clocks



PRI wait( address )
f32_Cmd := address
repeat
while f32_Cmd

PUB FMul(a, b)
result := cmdFMul
wait( @result )

9 bytes, ~3904 clocks

The clocks are approximate as the timing will depend a bit on how the spin repeat loop aligns with when the PASM code is done. Note that going from 10 to 9 bytes means I'd save 1 byte per PUB math function, so approximately 25 bytes, and the wait function takes up 6 bytes. However, going from the current version (1.2) would be 12 down to 9 bytes, so ~70 bytes of savings. My personal leaning is to just upload the faster/smaller version (with repeat and while on distinct lines).

Jonathan

John Abshier
03-11-2011, 07:06 PM
I was surprised at the magnitude of the speed up. I vote for your suggestion for change. For me 70 bytes is not worth the slow down of a function call.

John Abshier

lonesock
05-02-2011, 02:48 PM
Version 1.3 is now in the OBEX. It contains the faster "repeat" loops, and also fixes a bug in FRound found by John Abshier...THANKS!!

Jonathan

Kye
05-02-2011, 02:58 PM
I'm all about space savings.

@You got the idea Lonesock.

lonesock
08-02-2011, 10:58 PM
I'm thinking of adding the following 'atof' function to F32. The weak link is that it uses Exp10 to correct the exponent, which in turn uses the prop's internal log/exp tables, so I'll probably need to swap that out for a loop multiplying by 10.0 or 0.1. Can anyone see any bugs, or ways to speed it up?


PUB atof( strptr ) : f | int, sign, dmag, mag, get_exp, b
' get all the digits as if this is an integer (but track the exponent)
' int := sign := dmag := mag := get_exp := 0
longfill( @int, 0, 5 )
repeat
case b := byte[strptr++]
"-": sign := $8000_0000
"0".."9":
int := int*10 + b - "0"
mag += dmag
".": dmag := -1
other: ' either done, or about to do exponent
if get_exp
' we just finished processing the exponent
if sign
int := -int
mag += int
quit
else
' convert int to a (signed) float
f := FFloat( int ) | sign
' should we continue?
if (b == "E") or (b == "e")
' int := sign := dmag := 0
longfill( @int, 0, 3 )
get_exp := 1
else
quit
' Exp10 is the weak link...uses the Log table in P1 ROM
f := FMul( f, Exp10( FFloat( mag ) ) )

thanks,
Jonathan

James Newman
02-14-2012, 12:04 AM
About to start using this, and noticed that some of the comments in the asm are from the old Float objects. An example is this line ':execCmd nop ' execute command, which was replaced by getCommand'

Just thought I'd let you know. Great object btw.

[EDIT] Ohh also the subtract function has an unneeded jmp as far as I can tell:


_FSub xor fnumB, Bit31 ' negate B
jmp #_FAdd ' add values

_FAdd call #_Unpack2 ' unpack two variables

lonesock
04-27-2012, 08:25 PM
Hi, everyone! Been away for a while, but I'm back ;-)

Here is an update to F32. I'm calling it 1.5 leaving 1.4 for the code fixes that Marty Lawson did in the table interpolation code...THANKS!

Here's what's different:
* fixed table interp (buggy LOG function, and maybe sine?), for certain input values
* FCmp is faster and smaller
* replaced "jmp #label_ret" with "jmp label_ret", saving 4 clocks...thanks kuroneko!
* fixed the PASM dispatch offset table bug (in counting, I think I had skipped 10). Only would have seen this if calling directly from PASM. I can't find a record of who found this! If if was you please let me know so I can credit you!

That's basically it. I'll leave it here for some testing. If no bugs crop up, I'll update the OBEX.

Thanks, everyone!
Jonathan

MacTuxLin
04-28-2012, 01:18 AM
Thank you Jonathan! I'll be putting this to test on my side too.

SRLM
05-21-2012, 08:11 AM
I tried to make a version of F32 that could be parallel, but in the end it operated faster in a single cog!

I have a bunch of math that could be parallelized, and so I edited F32 a bit to make it so that the wait for a result is in an external function. This requires some new variables and a bit of moving data around. By parallelizing F32, I can have four instances of the object running in four cogs, and have them all crunching numbers at the same time. So here's the multiply function that I came up with:


VAR
long func_result, func_a, func_b
PUB FMulPar(a, b)
{{
Multiplication: result = a * b
Parameters:
a 32-bit floating point value
b 32-bit floating point value
Returns: 32-bit floating point value
}}
func_result := cmdFMul
func_a := a
func_b := b
f32_Cmd := @func_result

PUB Wait

repeat
while f32_Cmd

result := func_result


And for my code that calls it, I basically had something like


fp.FMulPar(w1,w2)
fp1.FMulPar(w1,x2)
fp2.FMulPar(w1,y2)
fp3.FMulPar(w1,z2)

w := fp.wait
x := fp1.wait
y := fp2.wait
z := fp3.wait


In the end, the parallelization overhead added about 50 microseconds to the total execution time (of the four multiplies above). A slight benefit might be seen for functions that take longer (such as cos/atan/sqrt?).

Any suggestions on how to make it faster? If not, then I might look into porting the interpreter thing from Float32.

Heater.
05-21-2012, 08:36 AM
Every spin byte code takes 50 to 100 Prop instructions to execute and every Spin statement generates lots of byte codes so we can guess that the speed of F32 is just lost in the noise here. If I remember F32 pretty much fills a COG so you would have to throw out some operations in order to fit the "interpreter thing" which would be a shame.
If I were needing speed and floating point and provided my application is not too big I would consider writing in C. The code would much prettier not having to call functions to perform all the operations and it would run at about one quarter of native PASM speed.
I'm not sure if the propgcc compiler can make use of F32 "out of the box" but I have used it from the zpugcc compiler with Zog without much difficulty.

Lawson
05-21-2012, 07:46 PM
@SLRM

When I looked through the F32 code to fix the table bug, I noticed that several of the functions are made by chaining a few core functions. I.e. x^y is done as exp(y*ln(x)). These types of functions could be re-implimented using the "interpreter thing" from Float32 possibly freeing up enough space to add an interpreter in? (last I checked F32 only has a couple of longs free)

Lawson