F32 - Concise floating point code for the Propeller

John Abshier · 2011-03-11 12:06

I was surprised at the magnitude of the speed up. I vote for your suggestion for change. For me 70 bytes is not worth the slow down of a function call.

John Abshier

lonesock · 2011-05-02 07:48

Version 1.3 is now in the OBEX. It contains the faster "repeat" loops, and also fixes a bug in FRound found by John Abshier...THANKS!!

Jonathan

Kye · 2011-05-02 07:58

I'm all about space savings.

@You got the idea Lonesock.

lonesock · 2011-08-02 15:58

I'm thinking of adding the following 'atof' function to F32. The weak link is that it uses Exp10 to correct the exponent, which in turn uses the prop's internal log/exp tables, so I'll probably need to swap that out for a loop multiplying by 10.0 or 0.1. Can anyone see any bugs, or ways to speed it up?

PUB atof( strptr ) : f | int, sign, dmag, mag, get_exp, b
  ' get all the digits as if this is an integer (but track the exponent)
  ' int := sign := dmag := mag := get_exp := 0
  longfill( @int, 0, 5 )
  repeat
    case b := byte[strptr++]
      "-": sign := $8000_0000
      "0".."9":
           int := int*10 + b - "0"
           mag += dmag
      ".": dmag := -1
      other: ' either done, or about to do exponent
           if get_exp
             ' we just finished processing the exponent
             if sign
               int := -int
             mag += int
             quit
           else
             ' convert int to a (signed) float
             f := FFloat( int ) | sign
             ' should we continue?
             if (b == "E") or (b == "e")
               ' int := sign := dmag := 0
               longfill( @int, 0, 3 )
               get_exp := 1
             else
               quit
  ' Exp10 is the weak link...uses the Log table in P1 ROM
  f := FMul( f, Exp10( FFloat( mag ) ) )

thanks,
Jonathan

James Newman · 2012-02-13 17:04

About to start using this, and noticed that some of the comments in the asm are from the old Float objects. An example is this line ':execCmd nop ' execute command, which was replaced by getCommand'

Just thought I'd let you know. Great object btw.

[EDIT] Ohh also the subtract function has an unneeded jmp as far as I can tell:

_FSub                   xor     fnumB, Bit31            ' negate B
                        jmp     #_FAdd                  ' add values

_FAdd                   call    #_Unpack2               ' unpack two variables

lonesock · 2012-04-27 13:25

Hi, everyone! Been away for a while, but I'm back ;-)

Here is an update to F32. I'm calling it 1.5 leaving 1.4 for the code fixes that Marty Lawson did in the table interpolation code...THANKS!

Here's what's different:
* fixed table interp (buggy LOG function, and maybe sine?), for certain input values
* FCmp is faster and smaller
* replaced "jmp #label_ret" with "jmp label_ret", saving 4 clocks...thanks kuroneko!
* fixed the PASM dispatch offset table bug (in counting, I think I had skipped 10). Only would have seen this if calling directly from PASM. I can't find a record of who found this! If if was you please let me know so I can credit you!

That's basically it. I'll leave it here for some testing. If no bugs crop up, I'll update the OBEX.

Thanks, everyone!
Jonathan

MacTuxLin · 2012-04-27 18:18

Thank you Jonathan! I'll be putting this to test on my side too.

SRLM · 2012-05-21 01:11

I tried to make a version of F32 that could be parallel, but in the end it operated faster in a single cog!

I have a bunch of math that could be parallelized, and so I edited F32 a bit to make it so that the wait for a result is in an external function. This requires some new variables and a bit of moving data around. By parallelizing F32, I can have four instances of the object running in four cogs, and have them all crunching numbers at the same time. So here's the multiply function that I came up with:

VAR
	long	func_result, func_a, func_b
PUB FMulPar(a, b)
{{
  Multiplication: result = a * b
  Parameters:
    a        32-bit floating point value
    b        32-bit floating point value
  Returns:   32-bit floating point value
}}
  func_result  := cmdFMul
  func_a := a
  func_b := b  
  f32_Cmd := @func_result 
  
PUB Wait
  
  repeat
  while f32_Cmd

  result := func_result

And for my code that calls it, I basically had something like

	fp.FMulPar(w1,w2)
	fp1.FMulPar(w1,x2)
	fp2.FMulPar(w1,y2)
	fp3.FMulPar(w1,z2)
	
	w := fp.wait
	x := fp1.wait
	y := fp2.wait
	z := fp3.wait

In the end, the parallelization overhead added about 50 microseconds to the total execution time (of the four multiplies above). A slight benefit might be seen for functions that take longer (such as cos/atan/sqrt?).

Any suggestions on how to make it faster? If not, then I might look into porting the interpreter thing from Float32.

Heater. · 2012-05-21 01:36

Every spin byte code takes 50 to 100 Prop instructions to execute and every Spin statement generates lots of byte codes so we can guess that the speed of F32 is just lost in the noise here. If I remember F32 pretty much fills a COG so you would have to throw out some operations in order to fit the "interpreter thing" which would be a shame.
If I were needing speed and floating point and provided my application is not too big I would consider writing in C. The code would much prettier not having to call functions to perform all the operations and it would run at about one quarter of native PASM speed.
I'm not sure if the propgcc compiler can make use of F32 "out of the box" but I have used it from the zpugcc compiler with Zog without much difficulty.

Lawson · 2012-05-21 12:46

@SLRM

When I looked through the F32 code to fix the table bug, I noticed that several of the functions are made by chaining a few core functions. I.e. x^y is done as exp(y*ln(x)). These types of functions could be re-implimented using the "interpreter thing" from Float32 possibly freeing up enough space to add an interpreter in? (last I checked F32 only has a couple of longs free)

Lawson

Lawson · 2014-04-24 09:18

It'd be nice if one cog running F32 could be shared between several cogs. My first thought for doing this would be to protect the spin interface with locks, but that would be slow. My second thought is to have F32 check 3-4 command mail-boxes round robin fashion. (instead of the one box currently used) Assuming that the spin interface code is taking most of the time, this could be quite fast. For a software interface, I think it'd be simplest to have the extra objects "register" to a given mail-box and then have one object start the F32 cog.

Marty

Duane Degn · 2015-02-06 19:24

I'm using two instances of F32 in my hexapod code. F32 is sure a useful object. Thanks for writing it Jonathan.

I was starting to run out of RAM so I wanted to move the PASM section of F32 to the EEPROM and then temporarily move the F32 PASM code (and PASM code from other objects) into the hub prior to launching it into a cog.

I created a modified version of the call table which can exist independent of the PASM code.

I discuss some of the details in this thread. An example of how to save the PASM section to the EEPROM is given as well as the modified call table.

JasonDorie · 2015-08-21 02:04

I think I've found a way to speed up your comparisons and get you a couple longs back.

You do this:

_FCmp                   mov     t1, fnumA               ' if both values...
                        and     t1, fnumB               '  are negative...
                        shl     t1, #1 wc               ' (bit 31 high)...

but I think this has the same effect, does it not?

                        add     fNumA, fNumB  nr, wc

If the highest bit (sign) is set in both values, doing an unsigned add will overflow, setting the carry flag. Specifying 'NR' tells the add not to write its result, so you preserve the value of fNumA.

It's a tiny change (it's only two longs) but I'm trying to squeeze as much into this cog as I can and I thought I'd pass that along.

J

Electrodude · 2015-08-21 17:47

JasonDorie wrote: »
I think I've found a way to speed up your comparisons and get you a couple longs back.

You do this:
_FCmp                   mov     t1, fnumA               ' if both values...
                        and     t1, fnumB               '  are negative...
                        shl     t1, #1 wc               ' (bit 31 high)...
but I think this has the same effect, does it not?
                        add     fNumA, fNumB  nr, wc
If the highest bit (sign) is set in both values, doing an unsigned add will overflow, setting the carry flag. Specifying 'NR' tells the add not to write its result, so you preserve the value of fNumA.

It's a tiny change (it's only two longs) but I'm trying to squeeze as much into this cog as I can and I thought I'd pass that along.

J

I don't think that will work. A counter example is (fnumA = $FFFF_FFFF) + (fnumB = $0000_0002) = $1_0000_0001. The carry bit gets set in that case, although only one of the two numbers is negative.

How about this? It only saves one long, though.

                        shl     fnumA, #1  wc, nr       ' if fnumA is negative
        if_c            shl     fnumB, #1  wc, nr       ' and fnumB is negative (don't run if fnumA was positive)

JasonDorie · 2015-08-21 20:19

D'oh - I didn't think that through far enough apparently. I'll happily use yours.

lonesock · 2015-08-21 20:27

I like Electrodude's solution, though you probably want to either SHL by #0, or include the NR flag.

Jonathan

Electrodude · 2015-08-22 04:52

lonesock wrote: »

I like Electrodude's solution, though you probably want to either SHL by #0, or include the NR flag.

Jonathan

Right. I'll edit my post to use nr.

F32 - Concise floating point code for the Propeller

Comments