Any ways to speed up Float32?

lonesock · 2010-09-23 18:44

Hmmm, I plotted the errors for sine, and the results are weird and localized, as in it's most likely a programming bug on my part. I will look into this. I should have the same accuracy as the original version. And, of course, the cos and tan code use the same algorithm, so hopefully if I fix one I'll fix them all.

Thanks so much for the testing!

Jonathan

lonesock · 2010-09-24 08:30

I just updated the attachment in the first post. I had a bug in my sine table lookup/interpolation code, which only happened in 2 of the 4 quadrants. Given that error, I just went through all the functions I modified, and checked the values of my code and the original code against Excel's authoritative answer key. Results:

Same error, but faster: sin, cos, tan, sqrt, mul, div, log, exp

Less error: ATan2, ATan, ASin, ACos

There were no code changes to the other functions (that I remember). So, basically for the arc* routines, the CORDIC routine is both faster and more accurate than the 5th order polynomial approximation (and the needed sqrt(1-x*2) conversions pre and post).

Thanks again for kicking over this issue, John!

Jonathan

lonesock · 2010-09-27 13:08

New version uploaded. I changed the name to F32 (where the brevity is _supposed_ to imply smaller code). The CORDIC routine is faster for all the arc* stuff (used a longer table but shortened the inner loop so it's faster). The table interpolation code is a bit faster, so all the log/pow/trig functions should be a bit faster, and Tan is faster in that I reused a portion of the processing results from _Sin when computing _Cos.

I did implement a full CORDIC version of Sin/Cos/Tan, and it did deliver an order of magnitude accuracy improvement, but the speed was ~3x slower (also slowed down the arc* stuff, as I had to share the CORDIC code to get it to fit in-cog), so I pulled that code back out. If somebody wants a high-precision F32, I can make that an option, but would require more testing, and I would want to convert the Pow/Log stuff to CORDIC as well. Any *need* for this? I'm not really inclined to go much farther on it.

Finally, I have not implemented the user-defined function mechanism. Float32Full has it, and I was wondering if anybody needs and/or wants it. It seems a reasonably useful thing to have, and I'm inclined to use the remaining 26 longs in-cog to yield this functionality...any votes? (Note that the exact mechanism will probably have to change a bit.)

Jonathan

Heater. · 2010-09-28 01:47

Lonesock,

More amazing progress.

One thing I noticed missing is a rounding that returns a float as appears in the C maths library.

float roundf(float x);

I can make one from cmdFTruncRound and cmdFFloat but it is slow.

Any chance you can convert F32.spin to normal ASCII so that simple text editors don't throw up?

Zog votes for speed over accuracy. If a C user wants the accuracy with Zog they can use the normal math library sin/cos/tan and take the speed hit.

No opinion on user defined functions. Does anybody use them?

lonesock · 2010-09-28 10:49

OK, FloatTrunc and FloatRound added, ASCII only, CRLF endings (so notepad likes it).

Jonathan

jazzed · 2010-09-28 10:55

Nice work Jonathan. Can't wait to use it with ZOG.

John Abshier · 2010-09-28 12:52

What is the difference between FTrunc and FloatTrunc? Comments say they do the same thing, but b has a different value.

John Abshier

lonesock · 2010-09-28 13:17

Thanks, all. Sorry, John...copy-n-paste laziness has struck again. [8^) the FTrunc & FRound versions return a whole number as a signed integer. FloatTrunc and FloatRound return the whole number in floating-point format. Comments have now been updated...sorry again.

Jonathan

Heater. · 2010-09-28 13:28

I hope the comments are wrong.

One of them should return the integer result as an actual integer type, the other should return it in floating point format.

Not sure if we need them both.
The GCC soft-float calls are need are:

int __fixsfsi (float a) - convert a to a signed integer, rounding toward zero

unsigned int __fixunssfsi (float a) - convert a to an unsigned integer, rounding toward zero. Negative values all become zero.

The C standard math functions are:

float truncf(float x) - round x to the nearest integer not larger in absolute value.

float roundf(float x) - round x to the nearest integer, but round halfway cases away from zero

lonesock · 2010-09-28 14:18

Heater, right now the only thing you are missing is the "unsigned int __fixunssfsi (float a)" functionality...do you need this? (You can get most of the way there with FTrunc followed by a PASM min statement, but of course you lose that top bit of your range.)

Jonathan

Heater. · 2010-09-28 16:51

Does Zog need __fixunssfsi?

Possibly yes:

Using the ZPU soft float implementation of __fixunssfsi takes 116 zpu instructions which make 4 calls. So it's big and slow.

That top bit of range seems quite important.

Just now I have no way to get any PASM in the path. As I'm just using C code to read/write the memory interface LONGs.

Just for fun here is the ZPU softfloat __fixunssfsi:

__fixunssfsi():
    193d:    8c              im 12
    193e:    08              load
    193f:    02              pushsp
    1940:    8c              im 12
    1941:    0c              store
    1942:    fd              im -3
    1943:    3d              pushspadd
    1944:    0d              popsp
    1945:    81              im 1
    1946:    f2              im -14
    1947:    0a              flip
    1948:    0b              nop
    1949:    83              im 3
    194a:    3d              pushspadd
    194b:    0c              store
    194c:    8c              im 12
    194d:    08              load
    194e:    88              im 8
    194f:    05              add
    1950:    08              load
    1951:    02              pushsp
    1952:    84              im 4
    1953:    05              add
    1954:    0c              store
    1955:    e0              im -32
    1956:    e2              im -30
    1957:    3f              callpcrel
    1958:    80              im 0
    1959:    08              load
    195a:    53              storesp 12
    195b:    72              loadsp 8
    195c:    80              im 0
    195d:    25              lessthanorequal
    195e:    83              im 3
    195f:    38              neqbranch
    1960:    ae              im 46
    1961:    39              poppcrel

00001962 <.L3>:
.L3():
    1962:    81              im 1
    1963:    f3              im -13
    1964:    0a              flip
    1965:    0b              nop
    1966:    83              im 3
    1967:    3d              pushspadd
    1968:    0c              store
    1969:    8c              im 12
    196a:    08              load
    196b:    88              im 8
    196c:    05              add
    196d:    08              load
    196e:    02              pushsp
    196f:    84              im 4
    1970:    05              add
    1971:    0c              store
    1972:    db              im -37
    1973:    d1              im -47
    1974:    3f              callpcrel
    1975:    80              im 0
    1976:    08              load
    1977:    53              storesp 12
    1978:    72              loadsp 8
    1979:    02              pushsp
    197a:    84              im 4
    197b:    05              add
    197c:    0c              store
    197d:    dd              im -35
    197e:    c5              im -59
    197f:    3f              callpcrel
    1980:    80              im 0
    1981:    08              load
    1982:    81              im 1
    1983:    0a              flip
    1984:    11              addsp 4
    1985:    70              loadsp 0
    1986:    8c              im 12
    1987:    08              load
    1988:    fc              im -4
    1989:    05              add
    198a:    0c              store
    198b:    51              storesp 4
    198c:    53              storesp 12
    198d:    96              im 22
    198e:    39              poppcrel

0000198f <.L2>:
.L2():
    198f:    8c              im 12
    1990:    08              load
    1991:    88              im 8
    1992:    05              add
    1993:    08              load
    1994:    02              pushsp
    1995:    84              im 4
    1996:    05              add
    1997:    0c              store
    1998:    dd              im -35
    1999:    aa              im 42
    199a:    3f              callpcrel
    199b:    80              im 0
    199c:    08              load
    199d:    70              loadsp 0
    199e:    8c              im 12
    199f:    08              load
    19a0:    fc              im -4
    19a1:    05              add
    19a2:    0c              store
    19a3:    53              storesp 12

000019a4 <.L1>:
.L1():
    19a4:    8c              im 12
    19a5:    08              load
    19a6:    fc              im -4
    19a7:    05              add
    19a8:    08              load
    19a9:    80              im 0
    19aa:    0c              store
    19ab:    85              im 5
    19ac:    3d              pushspadd
    19ad:    0d              popsp
    19ae:    8c              im 12
    19af:    0c              store
    19b0:    04              poppc

lonesock · 2010-09-29 10:14

OK, I just uploaded a new version with an added function: UintTrunc. It returns an unsigned int from the floating point input. (I'm still not sure how Zog will handle this, as I thought all Zog variables were treated as signed ints, but that's probably much easier for GCC to workaround than the semi-hideous cast code you showed me [8^).

There are only 4 longs free in the cog right now...some more can be saved by re-slowing some of the functions, but unless more functionality needs to be crammed it there, we might be at the release point (after some serious testing).

Speaking of release, I have a question (especially if any mods are watching this thread): What is the best way to publish this? I could stick it in the OBEX, but it might be nice to bundle it with the Propeller Tool (and maybe replace the Float32/Float32All code once it get's approved...any way to reach of Cam?).

thanks for your input,
Jonathan

Mike Green · 2010-09-29 10:23

I'd start by posting it to the OBEX. That way it'd be available in a formal fashion in the same place as Float32.

You might use Float32's documentation as a model to document your version, copy what you can, change what needs to be changed, and leave out what doesn't apply.

Heater. · 2010-09-29 11:26

lonesock,

You are right, the ZPU is tuned for signed 32 bit ints.

I have noticed that zpu-gcc generates some pretty bloaty code when using unsigned ints. For example it will insert long winded routines for unsigned mod and div operators. I have not really explored this fully.

This is something we just have to live with. It could be tackled by adding yet another COG to handle all those GCC generated routines instead. Or they could be handled by LMM PASM by the Zog GOG. I'm really not into doing either of those.

So it is possible I have put you to the trouble of creating UIntTrunc for nothing, that is to say the benefit it gives to Zog is overshadowed by all the other bloat that using unsigned its entails. My apologies if so.

Now back to integrating and testing...

lonesock · 2010-09-29 16:51

Hey, Heater, no worries, I'm just glad everything fit into a single cog!

And thanks, Mike. Once all the features are in, I'll polish up the documentation, call for final testing, then post it to the OBEX.

Jonathan

Heater. · 2010-10-02 15:38

I've just been testing the C soft-float function __fixunssfsi which is using the cmdUintTrunc command from F32.

Now as far as I can tell it works as per the GCC soft float documentation "... convert a to an unsigned integer, rounding toward zero. Negative values all become zero. "

But I found that the using the zpu-gcc softfloat version cause my test to fail. More strangely calling __fixunssfsi on my Debian Linux box also fails!!!

Here is a cut down version of my test which fails on my PC:

#include <stdio.h>

void passif(char* test, int status)
{
  printf("%s ", test);
  if (status)
    printf ("OK\n");
  else
    printf ("FAIL\n");
}

int main (int c, char* argv[])
{
  float t;
  unsigned int u;

  // Test __fixunssfsi
  t = 147.6;
  u = __fixunssfsi(t);
  passif ("__fixunssfsi (147.6)", u == 147); 
  printf("%u\n", u);
  
  t = -147.6;
  u = __fixunssfsi(t);
  passif ("__fixunssfsi (-147.6)", u == 0); 
  printf("%u\n", u);

  return(0);
}

and here is the result:

__fixunssfsi (147.6) FAIL
2
__fixunssfsi (-147.6) FAIL
2

Looking at some assembler produced when compiling with -msoft-float I can see that __fixunssfsi is indeed what GCC inserts to assign a float to an unsigned it.

Anyone got any idea what goes on here?

lonesock · 2010-10-05 10:47

That is bizarre...any updates on this issue?

Jonathan

Heater. · 2010-10-05 11:06

Lonesock,

Not yet. I've been stuck for ages tracking down a total Zog failure which now looks like it's down to the zpu-gcc optimizer screwing up on -Os (optimise for size). Hope I can start to move forward again now.

lonesock · 2010-10-05 11:19

Gotcha. Out of curiosity, are you using Zylin's "latest stable" or "latest snapshot" (or something else entirely) version of the toolchain?

Jonathan

Heater. · 2010-10-05 11:41

Don't remember. It was a source snapshot from a while back I think. I'm doing a git clone at this moment.

Heater. · 2010-10-05 13:38

No luck with the latest zpu-gcc still fails with -Os.

Lawson · 2012-01-26 11:30

lonesock wrote: »

I did implement a full CORDIC version of Sin/Cos/Tan, and it did deliver an order of magnitude accuracy improvement, but the speed was ~3x slower (also slowed down the arc* stuff, as I had to share the CORDIC code to get it to fit in-cog), so I pulled that code back out. If somebody wants a high-precision F32, I can make that an option, but would require more testing, and I would want to convert the Pow/Log stuff to CORDIC as well. Any *need* for this? I'm not really inclined to go much farther on it.

Jonathan

I'd really love a CORDIC log function. I'm using F32 with high precision thermistor measurements and I'm linearising with the Stienhart-Hart equation. I've got at least 16 stable bits in my measurements so getting the full ~23-bits of precision of a 32-bit float is important to keep rounding errors small.

Also, I found a bug in Log. It's highly localized to numbers between 4095 and 4096 but the error is large. (input is almost 2^12... some issue with log-table wrap around?)

input                 F32 result         Matlab result
4095.761		7.791079		8.317707815
4095.466		7.995431		8.317635787
4095.135		8.224144		8.317554963
4095.196		8.18219 		8.317569858
4095.934		7.670633		8.317750053
4095.947		7.662513		8.317753227
4095.066		8.272864		8.317538113
4095.27 		8.130764		8.317587928
4095.477		7.987311		8.317638473
4095.596		7.904758		8.317667529
4095.745		7.801905		8.317703909
4095.971		7.646273		8.317759087
4095.674		7.850625		8.317686574
4095.308		8.105051		8.317597207
4095.651		7.866865		8.317680958
4095.977		7.640859		8.317760551

Lawson

Lawson · 2012-01-31 12:50

Since Lonesock seems to be missing from the internet, I went ahead and fixed the LOG function.

As I suspected, the problem was with the table interpolation code. It did not correctly deal with the second table address exceeding the bounds of the table. The fix was made more complex because the LOG and ALOG tables are 2048 entries saturating at $1_0000, while the SINE table is 2049 entries saturating at $FFFF. The fix uses 3 instructions and 2 masks using up all the free space in the F32 cog.

The attached .zip file contains my test program that I used to verify that LOG was fixed, without messing up the accuracy of SIN.
The attached F32.spin contains my fixes. (along with a version update note, and comments on my fixes)

Is there any way to get the first post of this thread and OBEX entries for F32.spin updated with these fixes?

Lawson

lonesock · 2012-01-31 12:55

Awesome! My son is due any day now, so it will be a little while before I do the update, but this will definitely go in...thanks!

Jonathan

Duane Degn · 2012-01-31 13:16

Thanks Lawson and Jonathan,

I use F32 a lot. I just replaced my earlier version with this new one.

@Jonathan, I hope all goes well on your son's birthday.

Duane Degn · 2012-12-14 22:10

I've been working on some inverse kinematic code for my hexapod and kept getting a glitch across certain servo angles. I've tracked it down to what looks like a bug in F32.

Here's the output from a little test program I wrote.

Tan2(31, 24.5) | F32 = 0.7548) | Float32Full = 0.902
Tan2(31.1, 24.5) | F32 = 0.7548) | Float32Full = 0.9035
Tan2(31.2, 24.5) | F32 = 0.7548) | Float32Full = 0.9051
Tan2(31.3, 24.5) | F32 = 0.7548) | Float32Full = 0.9067
Tan2(31.4, 24.5) | F32 = 0.7548) | Float32Full = 0.9082
Tan2(31.5, 24.5) | F32 = 0.7548) | Float32Full = 0.9098
Tan2(31.6, 24.5) | F32 = 0.7548) | Float32Full = 0.9113
Tan2(31.7, 24.5) | F32 = 0.7548) | Float32Full = 0.9128
Tan2(31.8, 24.5) | F32 = 0.7548) | Float32Full = 0.9143
Tan2(31.9, 24.5) | F32 = 0.7548) | Float32Full = 0.9159
Tan2(31, 24.4) | F32 = 1.005) | Float32Full = 0.904
Tan2(31.1, 24.4) | F32 = 0.7548) | Float32Full = 0.9055
Tan2(31.2, 24.4) | F32 = 0.7548) | Float32Full = 0.9071
Tan2(31.3, 24.4) | F32 = 0.7548) | Float32Full = 0.9086
Tan2(31.4, 24.4) | F32 = 0.7548) | Float32Full = 0.9102
Tan2(31.5, 24.4) | F32 = 0.7548) | Float32Full = 0.9117
Tan2(31.6, 24.4) | F32 = 0.7548) | Float32Full = 0.9133
Tan2(31.7, 24.4) | F32 = 0.7548) | Float32Full = 0.9148
Tan2(31.8, 24.4) | F32 = 0.7548) | Float32Full = 0.9163
Tan2(31.9, 24.4) | F32 = 0.7548) | Float32Full = 0.9178
Tan2(31, 24.3) | F32 = 1.005) | Float32Full = 0.906
Tan2(31.1, 24.3) | F32 = 0.7548) | Float32Full = 0.9075
Tan2(31.2, 24.3) | F32 = 0.7548) | Float32Full = 0.9091
Tan2(31.3, 24.3) | F32 = 0.7548) | Float32Full = 0.9106
Tan2(31.4, 24.3) | F32 = 0.7548) | Float32Full = 0.9122
Tan2(31.5, 24.3) | F32 = 0.7548) | Float32Full = 0.9137
Tan2(31.6, 24.3) | F32 = 0.7548) | Float32Full = 0.9153
Tan2(31.7, 24.3) | F32 = 0.7548) | Float32Full = 0.9168
Tan2(31.8, 24.3) | F32 = 0.7548) | Float32Full = 0.9183
Tan2(31.9, 24.3) | F32 = 0.7548) | Float32Full = 0.9198
Tan2(31, 24.2) | F32 = 1.005) | Float32Full = 0.908
Tan2(31.1, 24.2) | F32 = 1.005) | Float32Full = 0.9095
Tan2(31.2, 24.2) | F32 = 0.7548) | Float32Full = 0.9111
Tan2(31.3, 24.2) | F32 = 0.7548) | Float32Full = 0.9126
Tan2(31.4, 24.2) | F32 = 0.7548) | Float32Full = 0.9142
Tan2(31.5, 24.2) | F32 = 0.7548) | Float32Full = 0.9157
Tan2(31.6, 24.2) | F32 = 0.7548) | Float32Full = 0.9172
Tan2(31.7, 24.2) | F32 = 0.7548) | Float32Full = 0.9188
Tan2(31.8, 24.2) | F32 = 0.7548) | Float32Full = 0.9203
Tan2(31.9, 24.2) | F32 = 0.7548) | Float32Full = 0.9218
Tan2(31, 24.1) | F32 = 1.005) | Float32Full = 0.91
Tan2(31.1, 24.1) | F32 = 1.005) | Float32Full = 0.9115
Tan2(31.2, 24.1) | F32 = 1.005) | Float32Full = 0.9131
Tan2(31.3, 24.1) | F32 = 0.7548) | Float32Full = 0.9146
Tan2(31.4, 24.1) | F32 = 0.7548) | Float32Full = 0.9162
Tan2(31.5, 24.1) | F32 = 0.7548) | Float32Full = 0.9177
Tan2(31.6, 24.1) | F32 = 0.7548) | Float32Full = 0.9192
Tan2(31.7, 24.1) | F32 = 0.7548) | Float32Full = 0.9208
Tan2(31.8, 24.1) | F32 = 0.7548) | Float32Full = 0.9223
Tan2(31.9, 24.1) | F32 = 0.7548) | Float32Full = 0.9238

You can see how Float32Full has a nice gradual change in the angle from the function while F32 has a rather sharp jump.

Here's my test program:

CON
  _CLKMODE = XTAL1 + PLL16X 
  _XINFREQ = 5_000_000
 
OBJ
 
  F32 : "F32V1_3"
  FloatFull : "Float32Full"
  Pst : "Parallax Serial Terminal"
  FloatStr : "FloatString"
    
PUB Setup | localIndex
                                                                              
  Pst.Start(115200)                                                      
  F32.start                                                                  
  FloatFull.start                                                              
  FloatStr.SetPrecision(4)
                                              
  MainLoop
   
PUB MainLoop | index1, index2
    
  'Pst.str(string(13, "Start of MainLoop Method "))
  index1 := 25.0 
  repeat
    index2 := 31.0
    repeat 10
      Pst.str(string(13, "Tan2("))
      Pst.str(FloatStr.FloatToString(index2))
      Pst.str(string(", "))
      Pst.str(FloatStr.FloatToString(index1))
      Pst.str(string(") | F32 = "))
      result := F32.ATan2(index2, index1)
      Pst.str(FloatStr.FloatToString(result))
      Pst.str(string(") | Float32Full = "))
      result := FloatFull.ATan2(index2, index1)
      Pst.str(FloatStr.FloatToString(result))
      index2 := F32.FAdd(index2, 0.1)
    index1 := F32.FSub(index1, 0.1)
    Pst.charin

I usually use version 1.5 posted in this thread. Version 1.3 from the OBEX gives the same results.

The "25" and "31" in the above test just happen to be the numbers I was using when I noticed this problem.

I'll attempt to understand what the ATan2 method is doing, but I'm afraid this fix if beyond me.

I really like how F32 only uses one cog and I'm hoping I don't have to resort to using the two cog Float32Full.

If Jonathan or someone else can fix this bug, I'd really appreciate it.

Thanks,

Duane Degn · 2012-12-15 08:09

In an attempt to give better test numbers, I changed the test program to use "1.0" as the "b" value in the ATan2 calls. By using "1.0" (essentially turning it into an ATAN call, it returned expected (correct) results.

It also appears to work correctly using "-1.0" for the second argument.

For now, I'll just just divide both arguments by the absolute value of the "b" term.

Duane Degn · 2012-12-15 08:58

Here's my ugly patch of the ATan2 method.

Edit: The patch doesn't check for a zero "b" term. This is not a good fix. I'll try to fix my patch.
See post #92 below for a better patch.

PUB ATan2(a, b)
{{
  Arc Tangent of vector a, b (in radians, no division is performed, so b==0 is legal).
  Parameters:
    a        32-bit floating point value
    b        32-bit floating point value
  Returns:   32-bit floating point value (angle in radians)
}}
 [B] result := b & $7FFF_FFFF      ' absolute value
  if b == result
    b := 1.0
  else
    b := -1.0  
  a := FDiv(a, result)
[/B]result  := cmdATan2
  f32_Cmd := @result
  repeat
  while f32_Cmd

Testing to see if the absolute value of b equals b is 1632 clocks faster than just dividing b by the absolute value of b.

I hope Jonathan doesn't mind my posting a patched version of F32 here until it can be fixed properly.

I also have a modified version of F32 that allows it to be used from multiple objects and multiple cogs. In order to do so, I added a lock. I'm sure the lock slows things down a bit so I wouldn't suggest using my modified version unless you need it.

Edit: I haven't tested the multi cog version well enough. I suspect it has problems so I removed it.
Edit(12/17/12): I've also removed the modified F32 object from this post. A better modified version is attached to post #92.

Duane Degn · 2012-12-15 11:13

It looks like one person has downloaded my supposed "fix". My patch doesn't check for a zero "b" term. I'll try to fix my patch soon, but be warned the current version doesn't handle all possible values correctly.

SRLM · 2012-12-16 22:59

Duane Degn wrote: »

It looks like one person has downloaded my supposed "fix". My patch doesn't check for a zero "b" term. I'll try to fix my patch soon, but be warned the current version doesn't handle all possible values correctly.

I'm interested

I have a version of F32 that I'm getting ready to post that has a built in "interpreter" to do sequences of float operations at PASM speed without needing the Spin calling cog. However, it relies on the fact that all of the F32 commands are standardized, and don't have extra functionality outside of the PASM.