Any ways to speed up Float32?

M. K. Borri · 2010-09-11 14:03

hm.... good point, can I play with it"

Heater. · 2010-09-11 23:15

Lonesock,

Had a quick peek at your code and a snooze, woke up with the following idea:

Your dispatch table is cunning. Brilliant in fact, it solves the problem with my technique that we have been discussing. With what you have there it would now be possible to do away with command look up in getCommand and save 10 LONGs in the main loop:)

1) Delete all the command definitions in CON and use those command names to label the entries in your dispatch table.

2) Rearrange the mail box interface to three longs:
command, oper1, oper2

Then in the Spin side we use something like:

oper1 := something 
oper2 := somethingelse
command := FAddCmd   'Sets command to the required PASM call instruction
repeat while command

Then replace the getCommand loop with something that has no dispatch table look up but uses the command (now a complete call instruction) directly:

DAT
'PASM CODE starts here
enter       'Set up oper1_addr,
            'oper2_addr and command_addr
            'from PAR here
            '.....
            'N.B. Any init code placed here can double up as variables
            'and so wastes no space.
            '.....
getCommand  rdlong  :execCmd, command_addr wz 'Wait for command
       if_z jmp     #getCommand

            rdlong  fnumA, oper1_addr         'Load fnumA
            rdlong  fnumB, oper2_addr         'Load fnumB

:execCmd    nop                               'Call command with call instruction placed
                                              'here by getCommand
:finishCmd  wrlong  fnumA, oper1_addr         'Return result
            wrlong  Zero, command_addr        'Clear command status (call ins)
            jmp     #getCommand               'Wait for next command

'------------------------------------------------------------------------------
'...
' Bla bla bla, the rest of floa32 here
'...
'------------------------------------------------------------------------------
            fit     496

'Command definitions (used by Spin), command values are actual call instructions
FAddCmd     call    #_FAdd
FSubCmd     call    #_FSub
FMulCmd     call    #_FMul
FDivCmd     call    #_FDiv
FFloatCmd   call    #_FFloat
FTruncCmd   call    #_FTrunc
FRoundCmd   call    #_FRound
FSqrCmd     call    #_FSqr
FCmpCmd     call    #cmd_FCmp
SinCmd      call    #_Sin
CosCmd      call    #_Cos
TanCmd      call    #_Tan
LogCmd      call    #_Log
Log10Cmd    call    #_Log10
ExpCmd      call    #_Exp
Exp10Cmd    call    #_Exp10
PowCmd      call    #_Pow
FracCmd     call    #_Frac
'------------------------------------------------------------------------------

Note: there is a little init section now to set up command and opers from PAR on start up. This area can be overlaid with vars and so takes no space.

Note: The command call instruction list is outside of the FIT and so takes no space in COG.

Note: Because the dispatch list is at the end of the PASM then C code in ZOG can use the PASM, extracted as a binary blob, it can find the required command constants from the end of the blob. This provides the required "linkage".

Note: This requires changing all the Spin interface routines and slowing them down a bit. Might be worth it for those 10 LONGs though.

lonesock · 2010-09-12 11:11

Got it, great optimization! I will work on this tomorrow. I do plan on changing the calling convention as follows...please let me know if it will work for Zog:

The Spin code has 2 longs back to back: F32CmdCall & F32ResultPtr. You would call a function as follows:

F32ResultPtr := @result
F32CmdCall := F32AddCmd
repeat while F32CmdCall

In Spin, the return value is a long, just in front of the input calling parameters. So a wrlong to the address in F32ResultPtr will save my result into the calling functions result value automatically. And a rdlong from the the (address in F32ResultPtr)+4 gets me the 1st parameter to the calling Spin function, and +8 gets me the second parameter, if any.

So, still only 2 assignments to call, which should be fairly close to 1 assignment and 1 addition. In Zog's case, you just need 3 Hub longs in a row: result, op1, op2.

Jonathan

Heater. · 2010-09-12 11:57

Lonesock,

the return value is a long, just in front of the input calling parameters.

Ouch, I had to think about that for a minute. Normally my mind reels at such solutions. Dependent on unwritten features of the tools, order of parameters, position of return results etc.

However, as the Spin interpreter is cast in silicon why not go with it?

As for Zog as long we have 3 longs in a row we are in business.

lonesock · 2010-09-13 12:06

Yep, it's a bit scary. As I understand it, when performing a subroutine call Spin pushes a 0 on the stack as the return value (which is why 'result' is already initialized), then pushes the parameters on as well. The thing is, now that my "dispatch table" consists of the full "call #some_address" long, there is no room to specify both the command and the address of the parameters, so I need a bit of extra code (including an extra rdlong) inside the call handler.

I'm going to update the attachment in the 1st post in just a minute. Here's how the call handler looks now:

f32_entry               ' zero out the command register
                        mov     t1, #0
                        wrlong  t1, par                 ' clear command status

                        ' try to keep 2 or fewer instructions between rd/wrlong
getCommand              rdlong  :execCmd, par wz        ' wait for command to be non-zero, and store it in the call location
              if_z      jmp     #getCommand

                        rdlong  ret_ptr, ret_ptr_ptr
                        add     ret_ptr, #4

                        rdlong  fNumA, ret_ptr
                        add     ret_ptr, #4

                        rdlong  fNumB, ret_ptr
                        sub     ret_ptr, #8

:execCmd                nop                             ' execute command, which was replaced by getCommand

:finishCmd              wrlong  fnumA, ret_ptr          ' store the result
                        jmp     #f32_entry              ' wait for next command

If anyone has a way to speed-up / shorten this, I would appreciate it (note that defining a long named Zero helps neither the speed (2 instructions btwn rdlongs) nor the size (as Zero itself takes up a long). However, with this stuff and a few other optimizations in place, we now have 71 longs free in the Cog!

Jonathan

lonesock · 2010-09-13 15:28

FFloat is quite a bit shorter (we now have 80 longs free in the Cog), but I am holding off uploading this version yet because...

Has anyone tried using FTrunc or FRound? The functions inside the unmodified Float32 don't call _Pack again after manipulating bits, and I'm getting weird answers. Can anyone verify for me?

Jonathan

John Abshier · 2010-09-13 15:44

FTrunc worked for the following test program.

CON
  _clkmode = xtal1 + pll16x
  _xinfreq = 5_000_000
  #1, HOME, #8, BKSP, TAB, LF, CLREOL, CLRLB, CR, #16, CLS      ' PST formmatting control  
OBJ
  Sio  : "FullDuplexSerialPlus"        
  F: "Float32"
Pub Main | i, a, b
  Sio.start(31,30,0,115200)  ' Rx,Tx, Mode, Baud
  repeat until Sio.rxCheck <> -1                        ' wait for PST 
  F.Start
  a := -2.25
  b := 0.25
  repeat 21
    a :=  F.FAdd(a,b) 
    sio.dec(F.FTrunc(a))
    sio.tx(13)

John Abshier

John Abshier · 2010-09-13 15:49

Changed F.FTrunc to F.FRound in program above. Worked OK.

John Abshier

lonesock · 2010-09-13 16:46

Ah, I see my mistake. For some reason I was expecting the output to _be_ a float, and was using FloatString to aid me in my mistake. OK, back to the trenches.

thanks,
Jonathan

lonesock · 2010-09-13 19:46

Re: Trunc and Round, try them with a really large number: e.g.

FTrunc( 323258300000.0 ) truncates to 75.

Jonathan

lonesock · 2010-09-13 20:11

Trunc and Round now do the right thing for large numbers (returning with Maxint if they can't keep up, and preserving the sign).

83 longs available now, I'm almost ready to start shoe-horn-ing in some more functionality.

Jonathan

Heater. · 2010-09-13 22:26

Lonesock,

83 !

Astounding piece of work.

That's nearly enough to put all the missing functions from Float32A in there. Zog is going love this. And pretty much everyone else.

I presume you have removed the redundant log functions already.

If everything does not quite fit it might be time to put all the code in there and have some #ifdefs around so that users can select which functions they may need for a particular application.

Not sure I follow your new Spin interface set up yet. Need some morning coffee.

lonesock · 2010-09-14 16:20

OK, we now have 110 longs free in the Cog!! I will update the attachment in the 1st post.

I would love to have some volunteer testing on this version. I had to rewrite the sin/cos/log/exp stuff to all use the same table lookup / interpolation code. (The original version of Float32 had some issues with large values for exp, both positive and negative.)

Other than looking at pow and compare, I'm done for now. If anyone has any feedback / issues, please let me know. The next step is to start adding functionality.

Jonathan

Heater. · 2010-09-15 01:26

"110 longs free in the Cog!"

What?!

Oh yeah, I forgot, this is Propeller Land where impossible things happen every day.

I'll get on with some testing ASAP.

Heater. · 2010-09-15 02:24

lonesock,

Zog has a little problem with ret_ptr_ptr.

ret_ptr_ptr is a LONG somewhere in DAT that is set up by Spin prior to launching the COG.

C or C++ code has no "linkage" with PASM like Spin does, so C/C++ does not know where ret_ptr_ptr is in order to set it. This makes creating a C/C++ wrapper for float32 a bit tricky.

Is it possible to get ret_ptr_ptr passed in at start up via PAR instead?

P.S.

Not sure if you are aware of this but the procedure I use is to compile Spin objects with BSTC with an option that outputs only the compiled PASM as a binary blob throwing away all the Spin stuff. This blob is then converted to an object file and linked in with the C code. Then a C wrapper is created for it replacing the Spin interface.

In this way it is possible to create C++ objects that mirror the original Spin objects.

This does not work so well if the are locations in the blob that need values "poking" in before it is started.

Heater. · 2010-09-15 03:10

lonesock,

You can save another two longs in your version, just put variable names as labels on the first two instructions and delet those variables:

f32_entry               ' zero out the command register
t1                        mov     t1, #0
t2                        wrlong  t1, par                 ' clear command status

Heater. · 2010-09-15 03:17

And another?

t5 does not seem to be used anywhere.

Heater. · 2010-09-15 04:15

Am I doing this right?

PUB test_fp | a, b, c, d
  a := fp.FFloat(400)
  b := fp.FFloat(800)
  c := fp.Fadd(a, b)
  d := fp.FTrunc(c)
  crlf
  ser.dec(d)
  crlf

Gives me the result of 800 instead of 1200.

Edit: However this:

PUB test_fp | a, b, c, d
  b := fp.FFloat(800)
  a := fp.FFloat(400)
  c := fp.Fadd(a, b)
  d := fp.FTrunc(c)
  crlf
  ser.dec(d)
  crlf

Which should be the same gives me 400.

Edit: More however, this works:

c := fp.Fadd(400.0, 300.0)
  d := fp.FTrunc(c)
  crlf
  ser.dec(d)

lonesock · 2010-09-15 08:23

@Heater: Well, those 1st 2 instructions aren't initialization, per se. They are used to clear the command variable to 0, signifying that we're done with the requested operation. Since I have to do that every time, I just put those 2 instructions in the entry code so I don't _also_ have to zero the command long when I call the .Start routine. Since I need to keep them as instructions, I can not use them as variables as well. And you are right, t5 is no longer needed, nice catch.

The example code you posted works fine for me, both under the propeller tool (1.2.7v2) and bst (0.19.3). Not sure what to say there.

EDIT: the address f32_RetAB_ptr is just par+4, but let me think about this...for Zog it would probably be fastest to have: command, result, op1, op2 as 4 consecutive longs, so I'd like a calling method that works equally well for Zog and for Spin, not requiring a second pointer in Zog's case, but allowing one in Spin's case.

EDIT 2: ah, t1 & t2 are getting stomped, try putting those back as "long 0"'s at the end of the cog.

Jonathan

Heater. · 2010-09-15 08:57

This is weird then.
The attached little floating point test can give me 700, the correct result or 0 depending on if I comment out a couple of unrelated lines before the FAdd or not.
I'm using BST 0.19.4 pre 2 and down loaded a fresh copy of your code to be sure I had not buggered it up.

lonesock · 2010-09-15 09:30

You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!

Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.

Jonathan

Bill Henning · 2010-09-17 14:20

Great work!

I should go away on vacation more often - come back after a week, more great Prop stuff!

lonesock wrote: »

You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!

Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.

Jonathan

Heater. · 2010-09-17 14:33

Yep, I can't believe Lonesock is about to squeeze all of 2 COGs worth of float32 into one COG!

I almost have a C wrapper for it ready, perhaps some testing tomorrow.

Looks like Zog is about to give Catalina a run for it's money in the Whetstone races:)

TinkersALot · 2010-09-17 14:52

Is the attachment in the first post in this thread the latest version? The reason I ask is: it appears that the change comments in the attached file do not appear to match the history shown in the first post.

Heater. · 2010-09-17 14:59

Yep. Lonesock is to busy hacking code to change the comments. That's normal.

TinkersALot · 2010-09-17 15:10

Okay. Thank you

lonesock · 2010-09-20 13:40

Thanks, everybody. A new version is up, which supposedly has full functionality! PLEASE TEST! I implemented ATan2 using CORDIC, then reqrote ATan and ASin and ACos in terms of ATan2...should be faster, and hopefully more accurate.

@T.A.L.: sorry, I should have added in some bare minimum comments. There is now a tiny disclaimer at the top, noting this is a modified version, and neither Cam nor Parallax should be blamed for anything! [8^)

Jonathan

Heater. · 2010-09-20 14:21

Lonesock, that's incrdible. I have a C wrapper around your float32 for zog that almost works. I'm away from home for a couple of days so testing is on hold until Thursday.

lonesock · 2010-09-21 12:03

Thanks, Heater.

OK, faster multiply and divide. (You will notice if calling from PASM, not Spin). I only needed to do 26 iterations in the divide routine instead of 30, and only 24 iteration in multiply instead of 32. There are only 24 bits in the mantissa, after all, including the implied 1, so after packing I'm getting bit identical results. If we were keeping intermediate values around, there would be some value to keeping more mantissa bits, but that would require a major overhaul to the architecture, and the calling would be complex.

There are now 26 longs free. If no one has any need of them, I can put them into the CORDIC table for ATan2 and speed that up a bit. Right now I only use 11 table entries for the arctan of the angles, because all entries after that are simply a >>1 of the previous table entry. This adds 3 longs to my CORDIC main loop, though, so the current version saves about 16 longs. I can gain speed at the expense of those 16 longs, but if they aren't in use for anything anyway...

Please let me know if anybody has any problems, ideas, or requests.

thanks,
Jonathan

John Abshier · 2010-09-23 16:06

I did a little testing to compare original and one cog version.

Sin -360 to 360 step 1 max difference = 0.0012 between orignal and new. Original more accurate when compared to XCALC and Excell. New has errors in 3d decimal, 0.00x.

Tan most are close but off by 4 at 89 degrees 57.2889 (old) 61.34064 (new)

For the arc trig functions I did arcF(F(x)) Most are not too different. If you round to integer degree they match. A truncation will sometimes result in a 1 degree difference.

I didn't check for speed differences. Gaining a cog for a slight reduction in accuracy is a good trade off.

John Abshier

Any ways to speed up Float32?

Comments