Welcome to the Parallax Discussion Forums, sign-up to participate.

lonesock
Posts: **876**

Hi, All.

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,

Jonathan

I just uploaded F32 to the OBEX (http://obex.parallax.com/objects/689/). It is basically a rewrite of Float32Full, faster and fitting into a single cog. The Spin calling functions are identical, so it should make a convenient drop-in replacement.

Please try it out if you have any code that uses Float32 or Float32Full currently, and let me know if you have any problems, or feature requests.

thanks,

Jonathan

Free time status: see my avatar [8^)

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

Tagged:

## Comments

47 Commentssorted by Date Added Votes4,0570Vote UpVote Downsmaller?ANDVery impressive!

Ross.

Catalina- a FREE ANSI C compiler for the Propeller.Download it from http://catalina-c.sourceforge.net/

2,9640Vote UpVote DownNICE Work

Sapieha

_____________________________________________________

Nothing is impossible, there are only different degrees of difficulty.

For every stupid question there is at least one intelligent answer.

Don't guess - ask instead.

If you don't ask you won't know.

If your gonna construct something, make it as simple as possible yet as versatile/usable as possible.

5,7700Vote UpVote Down8760Vote UpVote DownJonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

320Vote UpVote Down8760Vote UpVote Down* All of the Arc* functions used to be polynomial approximations, IIRC using 6 FP adds and 6 FP Multiplies, plus some other preconditioning functions, and a SQRT. I switched to an efficient CORDIC routine to calculate ATan2 directly (ATan2 originally computed the division, then called ATan), gaining much more speed, and better accuracy as well. Now all the Arc* functions use the ATan 2 implementation.

* Both multiply and divide had some inefficiencies in their main loops, and computed more bits than necessary (fp32 only needs 24 bits of mantissa, so multiply uses 24 iterations, while divide needed 26 iterations to get the rounding right)

* the command dispatch table no longer needs to fit inside the cog's RAM, and embeds the call command directly, so the dispatch routine is smaller too.

* Sqrt used an iterative scheme with embedded FP multiplications...I switched to calculating it directly (with a nifty sliding window, which I have never seen before...might have to write a mini-whitepaper on it [8^)

* the table interpolation is a bit faster now

* the Sin and Cosine code use the faster table interpolation, and the Tangent code reuses some preliminary results (the angle scaling) from Sin when calling Cos.

* various small tweaks.

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

20Vote UpVote DownIs it anything like this?

8760Vote UpVote DownThe calculation itself is a pretty typical bit-by-bit solution (if remainder >= ((root+delta)^2 - root^2) then you can subtract that term from the remainder, and add delta to the root). The cool part is the adjusting the remainder and root values in situ to keep from overflowing, and to keep all significant bits (for integers, sqrt of a 32 bit value is a 16 bit value, but for floating point, I have 24 significant bits in the input, and need 24 significant bits in the output).

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

1710Vote UpVote Down320Vote UpVote Down4,3090Vote UpVote Down*lonesock is now a rockstar among propellerheads*

Please post the whitepaper, and join the Fourier for dummies thread, we need your help!

7730Vote UpVote DownYour previous F32 was already a great help in comparison to FloatMath. I use FDiv a lot. In one of my function, the old calculation took 140528 cycles & when I replaced with your's, it became 7376 cycles.

Appreciated your support on this.

8760Vote UpVote DownJonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

3050Vote UpVote DownUnfortunatly, I am not sure if I can use it for what I want to. As most of what I am doing is speed sensitve, I am coding as much as I can in PASM. The reality is though that I really don't PASM enough to understand these math routines. So, I am wondering if these routines can be called from a PASM routine running in another COG? If it is possible, how is this done?

Thanks

Chris

8760Vote UpVote DownGreat question. I updated the F32 object to more easily support calling from your own PASM cog, and updated the demo to show off a simple incrementing application. Btw, to get maximum speed out of your cog, I recommend setting up and starting the float operation, then doing some other stuff, then waiting to get the result back just before you need it. In the demo code I just loop until the data is ready, but that's not an overly efficient use of your cog's time [8^)

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

3050Vote UpVote DownThanks for responding. I was hoping it would be possible to call these routines from PASM in another COG.

Can you provide a sample code of how this is done? Nothing complicated, just an example of how I would multiply or divide two numbers from within a PASM COG.

Thanks

Chris

8760Vote UpVote Down(of course it is showing Add instead of Mul or Div, but you get the picture)

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

1090Vote UpVote DownNow I get another cog!

3050Vote UpVote DownI think I got it now.

Thank much!

Chris

3050Vote UpVote DownI am sorry to keep bugging you but I need more help :-( I thought I could figure out the calling sequence based on your PASM example you provided, but I couldn't.

Apparently I only know the very basics of PASM (and spin for that matter) so even though I studied your example for a couple hours I could not figure it out.

I think what could help is if you showed me where in that example you place the two values being operated on.

I probably need more details too, but I just don't know enough to figure out what I don't know :-(

Chris

8760Vote UpVote DownSo, the F32 object is updated again to version 1.2, and there is another file "PASM_demo_F32.spin".

Jonathan

Here's the PASM portion of the example:

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

3050Vote UpVote DownI suspect that demo should do the trick!

Chris

4460Vote UpVote DownThanks

www.chasingtrons.com (projects and tutorials)

4460Vote UpVote Downwww.chasingtrons.com (projects and tutorials)

3,9320Vote UpVote Down2,2000Vote UpVote Down5,7700Vote UpVote Down1,1440Vote UpVote DownHuh?

8760Vote UpVote Down@Kye: I assume you mean using a function to encapsulate the "repeat / while f32_Cmd" lines. I left each in their own function for speed purposes, but if anyone wished to decrease the massive waste of space that is a good starting point.

Speaking of speed, it turns out that: Is faster than: Not sure why. It didn't seem important enough to bump the revision and upload a v1.3, but if I make any other changes that will be included in the list. If we had conditional compilation in the vanilla Propeller Tool I'd definitely have a smaller vs faster tradeoff flag.

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer

8760Vote UpVote Down10 bytes, ~3152 clocks

9 bytes, ~3904 clocks

The clocks are approximate as the timing will depend a bit on how the spin repeat loop aligns with when the PASM code is done. Note that going from 10 to 9 bytes means I'd save 1 byte per PUB math function, so approximately 25 bytes, and the wait function takes up 6 bytes. However, going from the current version (1.2) would be 12 down to 9 bytes, so ~70 bytes of savings. My personal leaning is to just upload the faster/smaller version (with repeat and while on distinct lines).

Jonathan

F32 - fast & concise floating point: OBEX, Thread

Unrelated to the prop: KISSlicer