Shop OBEX P1 Docs P2 Docs Learn Events
Any ways to speed up Float32? - Page 2 — Parallax Forums

Any ways to speed up Float32?

245

Comments

  • M. K. BorriM. K. Borri Posts: 279
    edited 2010-09-11 14:03
    hm.... good point, can I play with it"
  • Heater.Heater. Posts: 21,230
    edited 2010-09-11 23:15
    Lonesock,

    Had a quick peek at your code and a snooze, woke up with the following idea:

    Your dispatch table is cunning. Brilliant in fact, it solves the problem with my technique that we have been discussing. With what you have there it would now be possible to do away with command look up in getCommand and save 10 LONGs in the main loop:)

    1) Delete all the command definitions in CON and use those command names to label the entries in your dispatch table.

    2) Rearrange the mail box interface to three longs:
    command, oper1, oper2

    Then in the Spin side we use something like:
    oper1 := something 
    oper2 := somethingelse
    command := FAddCmd   'Sets command to the required PASM call instruction
    repeat while command
    
    Then replace the getCommand loop with something that has no dispatch table look up but uses the command (now a complete call instruction) directly:
    DAT
    'PASM CODE starts here
    enter       'Set up oper1_addr,
                'oper2_addr and command_addr
                'from PAR here
                '.....
                'N.B. Any init code placed here can double up as variables
                'and so wastes no space.
                '.....
    getCommand  rdlong  :execCmd, command_addr wz 'Wait for command
           if_z jmp     #getCommand
    
                rdlong  fnumA, oper1_addr         'Load fnumA
                rdlong  fnumB, oper2_addr         'Load fnumB
    
    :execCmd    nop                               'Call command with call instruction placed
                                                  'here by getCommand
    :finishCmd  wrlong  fnumA, oper1_addr         'Return result
                wrlong  Zero, command_addr        'Clear command status (call ins)
                jmp     #getCommand               'Wait for next command
    
    '------------------------------------------------------------------------------
    '...
    ' Bla bla bla, the rest of floa32 here
    '...
    '------------------------------------------------------------------------------
                fit     496
    
    'Command definitions (used by Spin), command values are actual call instructions
    FAddCmd     call    #_FAdd
    FSubCmd     call    #_FSub
    FMulCmd     call    #_FMul
    FDivCmd     call    #_FDiv
    FFloatCmd   call    #_FFloat
    FTruncCmd   call    #_FTrunc
    FRoundCmd   call    #_FRound
    FSqrCmd     call    #_FSqr
    FCmpCmd     call    #cmd_FCmp
    SinCmd      call    #_Sin
    CosCmd      call    #_Cos
    TanCmd      call    #_Tan
    LogCmd      call    #_Log
    Log10Cmd    call    #_Log10
    ExpCmd      call    #_Exp
    Exp10Cmd    call    #_Exp10
    PowCmd      call    #_Pow
    FracCmd     call    #_Frac
    '------------------------------------------------------------------------------
    
    Note: there is a little init section now to set up command and opers from PAR on start up. This area can be overlaid with vars and so takes no space.

    Note: The command call instruction list is outside of the FIT and so takes no space in COG.

    Note: Because the dispatch list is at the end of the PASM then C code in ZOG can use the PASM, extracted as a binary blob, it can find the required command constants from the end of the blob. This provides the required "linkage".

    Note: This requires changing all the Spin interface routines and slowing them down a bit. Might be worth it for those 10 LONGs though.
  • lonesocklonesock Posts: 917
    edited 2010-09-12 11:11
    Got it, great optimization! I will work on this tomorrow. I do plan on changing the calling convention as follows...please let me know if it will work for Zog:

    The Spin code has 2 longs back to back: F32CmdCall & F32ResultPtr. You would call a function as follows:

    F32ResultPtr := @result
    F32CmdCall := F32AddCmd
    repeat while F32CmdCall

    In Spin, the return value is a long, just in front of the input calling parameters. So a wrlong to the address in F32ResultPtr will save my result into the calling functions result value automatically. And a rdlong from the the (address in F32ResultPtr)+4 gets me the 1st parameter to the calling Spin function, and +8 gets me the second parameter, if any.

    So, still only 2 assignments to call, which should be fairly close to 1 assignment and 1 addition. In Zog's case, you just need 3 Hub longs in a row: result, op1, op2.

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-12 11:57
    Lonesock,
    the return value is a long, just in front of the input calling parameters.

    Ouch, I had to think about that for a minute. Normally my mind reels at such solutions. Dependent on unwritten features of the tools, order of parameters, position of return results etc.

    However, as the Spin interpreter is cast in silicon why not go with it?

    As for Zog as long we have 3 longs in a row we are in business.
  • lonesocklonesock Posts: 917
    edited 2010-09-13 12:06
    Yep, it's a bit scary. As I understand it, when performing a subroutine call Spin pushes a 0 on the stack as the return value (which is why 'result' is already initialized), then pushes the parameters on as well. The thing is, now that my "dispatch table" consists of the full "call #some_address" long, there is no room to specify both the command and the address of the parameters, so I need a bit of extra code (including an extra rdlong) inside the call handler.

    I'm going to update the attachment in the 1st post in just a minute. Here's how the call handler looks now:
    f32_entry               ' zero out the command register
                            mov     t1, #0
                            wrlong  t1, par                 ' clear command status
    
                            ' try to keep 2 or fewer instructions between rd/wrlong
    getCommand              rdlong  :execCmd, par wz        ' wait for command to be non-zero, and store it in the call location
                  if_z      jmp     #getCommand
    
                            rdlong  ret_ptr, ret_ptr_ptr
                            add     ret_ptr, #4
    
                            rdlong  fNumA, ret_ptr
                            add     ret_ptr, #4
    
                            rdlong  fNumB, ret_ptr
                            sub     ret_ptr, #8
    
    :execCmd                nop                             ' execute command, which was replaced by getCommand
    
    :finishCmd              wrlong  fnumA, ret_ptr          ' store the result
                            jmp     #f32_entry              ' wait for next command
    
    If anyone has a way to speed-up / shorten this, I would appreciate it (note that defining a long named Zero helps neither the speed (2 instructions btwn rdlongs) nor the size (as Zero itself takes up a long). However, with this stuff and a few other optimizations in place, we now have 71 longs free in the Cog!

    Jonathan
  • lonesocklonesock Posts: 917
    edited 2010-09-13 15:28
    FFloat is quite a bit shorter (we now have 80 longs free in the Cog), but I am holding off uploading this version yet because...

    Has anyone tried using FTrunc or FRound? The functions inside the unmodified Float32 don't call _Pack again after manipulating bits, and I'm getting weird answers. Can anyone verify for me?

    Jonathan
  • John AbshierJohn Abshier Posts: 1,116
    edited 2010-09-13 15:44
    FTrunc worked for the following test program.
    CON
      _clkmode = xtal1 + pll16x
      _xinfreq = 5_000_000
      #1, HOME, #8, BKSP, TAB, LF, CLREOL, CLRLB, CR, #16, CLS      ' PST formmatting control  
    OBJ
      Sio  : "FullDuplexSerialPlus"        
      F: "Float32"
    Pub Main | i, a, b
      Sio.start(31,30,0,115200)  ' Rx,Tx, Mode, Baud
      repeat until Sio.rxCheck <> -1                        ' wait for PST 
      F.Start
      a := -2.25
      b := 0.25
      repeat 21
        a :=  F.FAdd(a,b) 
        sio.dec(F.FTrunc(a))
        sio.tx(13)      
    

    John Abshier
  • John AbshierJohn Abshier Posts: 1,116
    edited 2010-09-13 15:49
    Changed F.FTrunc to F.FRound in program above. Worked OK.

    John Abshier
  • lonesocklonesock Posts: 917
    edited 2010-09-13 16:46
    Ah, I see my mistake. For some reason I was expecting the output to _be_ a float, and was using FloatString to aid me in my mistake. OK, back to the trenches.

    thanks,
    Jonathan
  • lonesocklonesock Posts: 917
    edited 2010-09-13 19:46
    Re: Trunc and Round, try them with a really large number: e.g.

    FTrunc( 323258300000.0 ) truncates to 75.

    Jonathan
  • lonesocklonesock Posts: 917
    edited 2010-09-13 20:11
    Trunc and Round now do the right thing for large numbers (returning with Maxint if they can't keep up, and preserving the sign).

    83 longs available now, I'm almost ready to start shoe-horn-ing in some more functionality.

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-13 22:26
    Lonesock,

    83 !

    Astounding piece of work.

    That's nearly enough to put all the missing functions from Float32A in there. Zog is going love this. And pretty much everyone else.

    I presume you have removed the redundant log functions already.

    If everything does not quite fit it might be time to put all the code in there and have some #ifdefs around so that users can select which functions they may need for a particular application.

    Not sure I follow your new Spin interface set up yet. Need some morning coffee.
  • lonesocklonesock Posts: 917
    edited 2010-09-14 16:20
    OK, we now have 110 longs free in the Cog!! I will update the attachment in the 1st post.

    I would love to have some volunteer testing on this version. I had to rewrite the sin/cos/log/exp stuff to all use the same table lookup / interpolation code. (The original version of Float32 had some issues with large values for exp, both positive and negative.)

    Other than looking at pow and compare, I'm done for now. If anyone has any feedback / issues, please let me know. The next step is to start adding functionality.

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 01:26
    "110 longs free in the Cog!"

    What?!

    Oh yeah, I forgot, this is Propeller Land where impossible things happen every day.

    I'll get on with some testing ASAP.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 02:24
    lonesock,

    Zog has a little problem with ret_ptr_ptr.

    ret_ptr_ptr is a LONG somewhere in DAT that is set up by Spin prior to launching the COG.

    C or C++ code has no "linkage" with PASM like Spin does, so C/C++ does not know where ret_ptr_ptr is in order to set it. This makes creating a C/C++ wrapper for float32 a bit tricky.

    Is it possible to get ret_ptr_ptr passed in at start up via PAR instead?

    P.S.

    Not sure if you are aware of this but the procedure I use is to compile Spin objects with BSTC with an option that outputs only the compiled PASM as a binary blob throwing away all the Spin stuff. This blob is then converted to an object file and linked in with the C code. Then a C wrapper is created for it replacing the Spin interface.

    In this way it is possible to create C++ objects that mirror the original Spin objects.

    This does not work so well if the are locations in the blob that need values "poking" in before it is started.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 03:10
    lonesock,

    You can save another two longs in your version, just put variable names as labels on the first two instructions and delet those variables:
    f32_entry               ' zero out the command register
    t1                        mov     t1, #0
    t2                        wrlong  t1, par                 ' clear command status
    
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 03:17
    And another?

    t5 does not seem to be used anywhere.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 04:15
    Am I doing this right?
    PUB test_fp | a, b, c, d
      a := fp.FFloat(400)
      b := fp.FFloat(800)
      c := fp.Fadd(a, b)
      d := fp.FTrunc(c)
      crlf
      ser.dec(d)
      crlf
    
    Gives me the result of 800 instead of 1200.

    Edit: However this:
    PUB test_fp | a, b, c, d
      b := fp.FFloat(800)
      a := fp.FFloat(400)
      c := fp.Fadd(a, b)
      d := fp.FTrunc(c)
      crlf
      ser.dec(d)
      crlf
    
    Which should be the same gives me 400.

    Edit: More however, this works:
    c := fp.Fadd(400.0, 300.0)
      d := fp.FTrunc(c)
      crlf
      ser.dec(d)
    
  • lonesocklonesock Posts: 917
    edited 2010-09-15 08:23
    @Heater: Well, those 1st 2 instructions aren't initialization, per se. They are used to clear the command variable to 0, signifying that we're done with the requested operation. Since I have to do that every time, I just put those 2 instructions in the entry code so I don't _also_ have to zero the command long when I call the .Start routine. Since I need to keep them as instructions, I can not use them as variables as well. And you are right, t5 is no longer needed, nice catch.

    The example code you posted works fine for me, both under the propeller tool (1.2.7v2) and bst (0.19.3). Not sure what to say there.

    EDIT: the address f32_RetAB_ptr is just par+4, but let me think about this...for Zog it would probably be fastest to have: command, result, op1, op2 as 4 consecutive longs, so I'd like a calling method that works equally well for Zog and for Spin, not requiring a second pointer in Zog's case, but allowing one in Spin's case.

    EDIT 2: ah, t1 & t2 are getting stomped, try putting those back as "long 0"'s at the end of the cog.

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 08:57
    This is weird then.
    The attached little floating point test can give me 700, the correct result or 0 depending on if I comment out a couple of unrelated lines before the FAdd or not.
    I'm using BST 0.19.4 pre 2 and down loaded a fresh copy of your code to be sure I had not buggered it up.
  • lonesocklonesock Posts: 917
    edited 2010-09-15 09:30
    You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!

    Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.

    Jonathan
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-09-17 14:20
    Great work!

    I should go away on vacation more often - come back after a week, more great Prop stuff!
    lonesock wrote: »
    You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!

    Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-17 14:33
    Yep, I can't believe Lonesock is about to squeeze all of 2 COGs worth of float32 into one COG!

    I almost have a C wrapper for it ready, perhaps some testing tomorrow.

    Looks like Zog is about to give Catalina a run for it's money in the Whetstone races:)
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-17 14:52
    Is the attachment in the first post in this thread the latest version? The reason I ask is: it appears that the change comments in the attached file do not appear to match the history shown in the first post.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-17 14:59
    Yep. Lonesock is to busy hacking code to change the comments. That's normal.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-17 15:10
    Okay. Thank you
  • lonesocklonesock Posts: 917
    edited 2010-09-20 13:40
    Thanks, everybody. A new version is up, which supposedly has full functionality! PLEASE TEST! I implemented ATan2 using CORDIC, then reqrote ATan and ASin and ACos in terms of ATan2...should be faster, and hopefully more accurate.

    @T.A.L.: sorry, I should have added in some bare minimum comments. There is now a tiny disclaimer at the top, noting this is a modified version, and neither Cam nor Parallax should be blamed for anything! [8^)

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-09-20 14:21
    Lonesock, that's incrdible. I have a C wrapper around your float32 for zog that almost works. I'm away from home for a couple of days so testing is on hold until Thursday.
  • lonesocklonesock Posts: 917
    edited 2010-09-21 12:03
    Thanks, Heater.

    OK, faster multiply and divide. (You will notice if calling from PASM, not Spin). I only needed to do 26 iterations in the divide routine instead of 30, and only 24 iteration in multiply instead of 32. There are only 24 bits in the mantissa, after all, including the implied 1, so after packing I'm getting bit identical results. If we were keeping intermediate values around, there would be some value to keeping more mantissa bits, but that would require a major overhaul to the architecture, and the calling would be complex.

    There are now 26 longs free. If no one has any need of them, I can put them into the CORDIC table for ATan2 and speed that up a bit. Right now I only use 11 table entries for the arctan of the angles, because all entries after that are simply a >>1 of the previous table entry. This adds 3 longs to my CORDIC main loop, though, so the current version saves about 16 longs. I can gain speed at the expense of those 16 longs, but if they aren't in use for anything anyway...

    Please let me know if anybody has any problems, ideas, or requests.

    thanks,
    Jonathan
  • John AbshierJohn Abshier Posts: 1,116
    edited 2010-09-23 16:06
    I did a little testing to compare original and one cog version.

    Sin -360 to 360 step 1 max difference = 0.0012 between orignal and new. Original more accurate when compared to XCALC and Excell. New has errors in 3d decimal, 0.00x.

    Tan most are close but off by 4 at 89 degrees 57.2889 (old) 61.34064 (new)

    For the arc trig functions I did arcF(F(x)) Most are not too different. If you round to integer degree they match. A truncation will sometimes result in a 1 degree difference.

    I didn't check for speed differences. Gaining a cog for a slight reduction in accuracy is a good trade off.

    John Abshier
Sign In or Register to comment.