Had a quick peek at your code and a snooze, woke up with the following idea:
Your dispatch table is cunning. Brilliant in fact, it solves the problem with my technique that we have been discussing. With what you have there it would now be possible to do away with command look up in getCommand and save 10 LONGs in the main loop:)
1) Delete all the command definitions in CON and use those command names to label the entries in your dispatch table.
2) Rearrange the mail box interface to three longs:
command, oper1, oper2
Then in the Spin side we use something like:
oper1 := something
oper2 := somethingelse
command := FAddCmd 'Sets command to the required PASM call instruction
repeat while command
Then replace the getCommand loop with something that has no dispatch table look up but uses the command (now a complete call instruction) directly:
DAT
'PASM CODE starts here
enter 'Set up oper1_addr,
'oper2_addr and command_addr
'from PAR here
'.....
'N.B. Any init code placed here can double up as variables
'and so wastes no space.
'.....
getCommand rdlong :execCmd, command_addr wz 'Wait for command
if_z jmp #getCommand
rdlong fnumA, oper1_addr 'Load fnumA
rdlong fnumB, oper2_addr 'Load fnumB
:execCmd nop 'Call command with call instruction placed
'here by getCommand
:finishCmd wrlong fnumA, oper1_addr 'Return result
wrlong Zero, command_addr 'Clear command status (call ins)
jmp #getCommand 'Wait for next command
'------------------------------------------------------------------------------
'...
' Bla bla bla, the rest of floa32 here
'...
'------------------------------------------------------------------------------
fit 496
'Command definitions (used by Spin), command values are actual call instructions
FAddCmd call #_FAdd
FSubCmd call #_FSub
FMulCmd call #_FMul
FDivCmd call #_FDiv
FFloatCmd call #_FFloat
FTruncCmd call #_FTrunc
FRoundCmd call #_FRound
FSqrCmd call #_FSqr
FCmpCmd call #cmd_FCmp
SinCmd call #_Sin
CosCmd call #_Cos
TanCmd call #_Tan
LogCmd call #_Log
Log10Cmd call #_Log10
ExpCmd call #_Exp
Exp10Cmd call #_Exp10
PowCmd call #_Pow
FracCmd call #_Frac
'------------------------------------------------------------------------------
Note: there is a little init section now to set up command and opers from PAR on start up. This area can be overlaid with vars and so takes no space.
Note: The command call instruction list is outside of the FIT and so takes no space in COG.
Note: Because the dispatch list is at the end of the PASM then C code in ZOG can use the PASM, extracted as a binary blob, it can find the required command constants from the end of the blob. This provides the required "linkage".
Note: This requires changing all the Spin interface routines and slowing them down a bit. Might be worth it for those 10 LONGs though.
Got it, great optimization! I will work on this tomorrow. I do plan on changing the calling convention as follows...please let me know if it will work for Zog:
The Spin code has 2 longs back to back: F32CmdCall & F32ResultPtr. You would call a function as follows:
F32ResultPtr := @result
F32CmdCall := F32AddCmd
repeat while F32CmdCall
In Spin, the return value is a long, just in front of the input calling parameters. So a wrlong to the address in F32ResultPtr will save my result into the calling functions result value automatically. And a rdlong from the the (address in F32ResultPtr)+4 gets me the 1st parameter to the calling Spin function, and +8 gets me the second parameter, if any.
So, still only 2 assignments to call, which should be fairly close to 1 assignment and 1 addition. In Zog's case, you just need 3 Hub longs in a row: result, op1, op2.
the return value is a long, just in front of the input calling parameters.
Ouch, I had to think about that for a minute. Normally my mind reels at such solutions. Dependent on unwritten features of the tools, order of parameters, position of return results etc.
However, as the Spin interpreter is cast in silicon why not go with it?
As for Zog as long we have 3 longs in a row we are in business.
Yep, it's a bit scary. As I understand it, when performing a subroutine call Spin pushes a 0 on the stack as the return value (which is why 'result' is already initialized), then pushes the parameters on as well. The thing is, now that my "dispatch table" consists of the full "call #some_address" long, there is no room to specify both the command and the address of the parameters, so I need a bit of extra code (including an extra rdlong) inside the call handler.
I'm going to update the attachment in the 1st post in just a minute. Here's how the call handler looks now:
f32_entry ' zero out the command register
mov t1, #0
wrlong t1, par ' clear command status
' try to keep 2 or fewer instructions between rd/wrlong
getCommand rdlong :execCmd, par wz ' wait for command to be non-zero, and store it in the call location
if_z jmp #getCommand
rdlong ret_ptr, ret_ptr_ptr
add ret_ptr, #4
rdlong fNumA, ret_ptr
add ret_ptr, #4
rdlong fNumB, ret_ptr
sub ret_ptr, #8
:execCmd nop ' execute command, which was replaced by getCommand
:finishCmd wrlong fnumA, ret_ptr ' store the result
jmp #f32_entry ' wait for next command
If anyone has a way to speed-up / shorten this, I would appreciate it (note that defining a long named Zero helps neither the speed (2 instructions btwn rdlongs) nor the size (as Zero itself takes up a long). However, with this stuff and a few other optimizations in place, we now have 71 longs free in the Cog!
FFloat is quite a bit shorter (we now have 80 longs free in the Cog), but I am holding off uploading this version yet because...
Has anyone tried using FTrunc or FRound? The functions inside the unmodified Float32 don't call _Pack again after manipulating bits, and I'm getting weird answers. Can anyone verify for me?
Ah, I see my mistake. For some reason I was expecting the output to _be_ a float, and was using FloatString to aid me in my mistake. OK, back to the trenches.
That's nearly enough to put all the missing functions from Float32A in there. Zog is going love this. And pretty much everyone else.
I presume you have removed the redundant log functions already.
If everything does not quite fit it might be time to put all the code in there and have some #ifdefs around so that users can select which functions they may need for a particular application.
Not sure I follow your new Spin interface set up yet. Need some morning coffee.
OK, we now have 110 longs free in the Cog!! I will update the attachment in the 1st post.
I would love to have some volunteer testing on this version. I had to rewrite the sin/cos/log/exp stuff to all use the same table lookup / interpolation code. (The original version of Float32 had some issues with large values for exp, both positive and negative.)
Other than looking at pow and compare, I'm done for now. If anyone has any feedback / issues, please let me know. The next step is to start adding functionality.
ret_ptr_ptr is a LONG somewhere in DAT that is set up by Spin prior to launching the COG.
C or C++ code has no "linkage" with PASM like Spin does, so C/C++ does not know where ret_ptr_ptr is in order to set it. This makes creating a C/C++ wrapper for float32 a bit tricky.
Is it possible to get ret_ptr_ptr passed in at start up via PAR instead?
P.S.
Not sure if you are aware of this but the procedure I use is to compile Spin objects with BSTC with an option that outputs only the compiled PASM as a binary blob throwing away all the Spin stuff. This blob is then converted to an object file and linked in with the C code. Then a C wrapper is created for it replacing the Spin interface.
In this way it is possible to create C++ objects that mirror the original Spin objects.
This does not work so well if the are locations in the blob that need values "poking" in before it is started.
@Heater: Well, those 1st 2 instructions aren't initialization, per se. They are used to clear the command variable to 0, signifying that we're done with the requested operation. Since I have to do that every time, I just put those 2 instructions in the entry code so I don't _also_ have to zero the command long when I call the .Start routine. Since I need to keep them as instructions, I can not use them as variables as well. And you are right, t5 is no longer needed, nice catch.
The example code you posted works fine for me, both under the propeller tool (1.2.7v2) and bst (0.19.3). Not sure what to say there.
EDIT: the address f32_RetAB_ptr is just par+4, but let me think about this...for Zog it would probably be fastest to have: command, result, op1, op2 as 4 consecutive longs, so I'd like a calling method that works equally well for Zog and for Spin, not requiring a second pointer in Zog's case, but allowing one in Spin's case.
EDIT 2: ah, t1 & t2 are getting stomped, try putting those back as "long 0"'s at the end of the cog.
This is weird then.
The attached little floating point test can give me 700, the correct result or 0 depending on if I comment out a couple of unrelated lines before the FAdd or not.
I'm using BST 0.19.4 pre 2 and down loaded a fresh copy of your code to be sure I had not buggered it up.
You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!
Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.
You know the optimization I spoke of to zero the command long on PASM entry, saving a single Spin assignment? Yeah, that was dumb. If you called start, then sent a command before the cog had loaded, that initial setting cmd to 0 would cause the 1st function to fail, as the caller code thought that Float32 had finished already!
Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.
Is the attachment in the first post in this thread the latest version? The reason I ask is: it appears that the change comments in the attached file do not appear to match the history shown in the first post.
Thanks, everybody. A new version is up, which supposedly has full functionality! PLEASE TEST! I implemented ATan2 using CORDIC, then reqrote ATan and ASin and ACos in terms of ATan2...should be faster, and hopefully more accurate.
@T.A.L.: sorry, I should have added in some bare minimum comments. There is now a tiny disclaimer at the top, noting this is a modified version, and neither Cam nor Parallax should be blamed for anything! [8^)
Lonesock, that's incrdible. I have a C wrapper around your float32 for zog that almost works. I'm away from home for a couple of days so testing is on hold until Thursday.
OK, faster multiply and divide. (You will notice if calling from PASM, not Spin). I only needed to do 26 iterations in the divide routine instead of 30, and only 24 iteration in multiply instead of 32. There are only 24 bits in the mantissa, after all, including the implied 1, so after packing I'm getting bit identical results. If we were keeping intermediate values around, there would be some value to keeping more mantissa bits, but that would require a major overhaul to the architecture, and the calling would be complex.
There are now 26 longs free. If no one has any need of them, I can put them into the CORDIC table for ATan2 and speed that up a bit. Right now I only use 11 table entries for the arctan of the angles, because all entries after that are simply a >>1 of the previous table entry. This adds 3 longs to my CORDIC main loop, though, so the current version saves about 16 longs. I can gain speed at the expense of those 16 longs, but if they aren't in use for anything anyway...
Please let me know if anybody has any problems, ideas, or requests.
I did a little testing to compare original and one cog version.
Sin -360 to 360 step 1 max difference = 0.0012 between orignal and new. Original more accurate when compared to XCALC and Excell. New has errors in 3d decimal, 0.00x.
Tan most are close but off by 4 at 89 degrees 57.2889 (old) 61.34064 (new)
For the arc trig functions I did arcF(F(x)) Most are not too different. If you round to integer degree they match. A truncation will sometimes result in a 1 degree difference.
I didn't check for speed differences. Gaining a cog for a slight reduction in accuracy is a good trade off.
Comments
Had a quick peek at your code and a snooze, woke up with the following idea:
Your dispatch table is cunning. Brilliant in fact, it solves the problem with my technique that we have been discussing. With what you have there it would now be possible to do away with command look up in getCommand and save 10 LONGs in the main loop:)
1) Delete all the command definitions in CON and use those command names to label the entries in your dispatch table.
2) Rearrange the mail box interface to three longs:
command, oper1, oper2
Then in the Spin side we use something like:
Then replace the getCommand loop with something that has no dispatch table look up but uses the command (now a complete call instruction) directly:
Note: there is a little init section now to set up command and opers from PAR on start up. This area can be overlaid with vars and so takes no space.
Note: The command call instruction list is outside of the FIT and so takes no space in COG.
Note: Because the dispatch list is at the end of the PASM then C code in ZOG can use the PASM, extracted as a binary blob, it can find the required command constants from the end of the blob. This provides the required "linkage".
Note: This requires changing all the Spin interface routines and slowing them down a bit. Might be worth it for those 10 LONGs though.
The Spin code has 2 longs back to back: F32CmdCall & F32ResultPtr. You would call a function as follows:
F32ResultPtr := @result
F32CmdCall := F32AddCmd
repeat while F32CmdCall
In Spin, the return value is a long, just in front of the input calling parameters. So a wrlong to the address in F32ResultPtr will save my result into the calling functions result value automatically. And a rdlong from the the (address in F32ResultPtr)+4 gets me the 1st parameter to the calling Spin function, and +8 gets me the second parameter, if any.
So, still only 2 assignments to call, which should be fairly close to 1 assignment and 1 addition. In Zog's case, you just need 3 Hub longs in a row: result, op1, op2.
Jonathan
Ouch, I had to think about that for a minute. Normally my mind reels at such solutions. Dependent on unwritten features of the tools, order of parameters, position of return results etc.
However, as the Spin interpreter is cast in silicon why not go with it?
As for Zog as long we have 3 longs in a row we are in business.
I'm going to update the attachment in the 1st post in just a minute. Here's how the call handler looks now: If anyone has a way to speed-up / shorten this, I would appreciate it (note that defining a long named Zero helps neither the speed (2 instructions btwn rdlongs) nor the size (as Zero itself takes up a long). However, with this stuff and a few other optimizations in place, we now have 71 longs free in the Cog!
Jonathan
Has anyone tried using FTrunc or FRound? The functions inside the unmodified Float32 don't call _Pack again after manipulating bits, and I'm getting weird answers. Can anyone verify for me?
Jonathan
John Abshier
John Abshier
thanks,
Jonathan
FTrunc( 323258300000.0 ) truncates to 75.
Jonathan
83 longs available now, I'm almost ready to start shoe-horn-ing in some more functionality.
Jonathan
83 !
Astounding piece of work.
That's nearly enough to put all the missing functions from Float32A in there. Zog is going love this. And pretty much everyone else.
I presume you have removed the redundant log functions already.
If everything does not quite fit it might be time to put all the code in there and have some #ifdefs around so that users can select which functions they may need for a particular application.
Not sure I follow your new Spin interface set up yet. Need some morning coffee.
I would love to have some volunteer testing on this version. I had to rewrite the sin/cos/log/exp stuff to all use the same table lookup / interpolation code. (The original version of Float32 had some issues with large values for exp, both positive and negative.)
Other than looking at pow and compare, I'm done for now. If anyone has any feedback / issues, please let me know. The next step is to start adding functionality.
Jonathan
What?!
Oh yeah, I forgot, this is Propeller Land where impossible things happen every day.
I'll get on with some testing ASAP.
Zog has a little problem with ret_ptr_ptr.
ret_ptr_ptr is a LONG somewhere in DAT that is set up by Spin prior to launching the COG.
C or C++ code has no "linkage" with PASM like Spin does, so C/C++ does not know where ret_ptr_ptr is in order to set it. This makes creating a C/C++ wrapper for float32 a bit tricky.
Is it possible to get ret_ptr_ptr passed in at start up via PAR instead?
P.S.
Not sure if you are aware of this but the procedure I use is to compile Spin objects with BSTC with an option that outputs only the compiled PASM as a binary blob throwing away all the Spin stuff. This blob is then converted to an object file and linked in with the C code. Then a C wrapper is created for it replacing the Spin interface.
In this way it is possible to create C++ objects that mirror the original Spin objects.
This does not work so well if the are locations in the blob that need values "poking" in before it is started.
You can save another two longs in your version, just put variable names as labels on the first two instructions and delet those variables:
t5 does not seem to be used anywhere.
Gives me the result of 800 instead of 1200.
Edit: However this:
Which should be the same gives me 400.
Edit: More however, this works:
The example code you posted works fine for me, both under the propeller tool (1.2.7v2) and bst (0.19.3). Not sure what to say there.
EDIT: the address f32_RetAB_ptr is just par+4, but let me think about this...for Zog it would probably be fastest to have: command, result, op1, op2 as 4 consecutive longs, so I'd like a calling method that works equally well for Zog and for Spin, not requiring a second pointer in Zog's case, but allowing one in Spin's case.
EDIT 2: ah, t1 & t2 are getting stomped, try putting those back as "long 0"'s at the end of the cog.
Jonathan
The attached little floating point test can give me 700, the correct result or 0 depending on if I comment out a couple of unrelated lines before the FAdd or not.
I'm using BST 0.19.4 pre 2 and down loaded a fresh copy of your code to be sure I had not buggered it up.
Fixed, saved another long, removed the pointer to a pointer, updating attachment now, give me a 30 seconds.
Jonathan
I should go away on vacation more often - come back after a week, more great Prop stuff!
I almost have a C wrapper for it ready, perhaps some testing tomorrow.
Looks like Zog is about to give Catalina a run for it's money in the Whetstone races:)
@T.A.L.: sorry, I should have added in some bare minimum comments. There is now a tiny disclaimer at the top, noting this is a modified version, and neither Cam nor Parallax should be blamed for anything! [8^)
Jonathan
OK, faster multiply and divide. (You will notice if calling from PASM, not Spin). I only needed to do 26 iterations in the divide routine instead of 30, and only 24 iteration in multiply instead of 32. There are only 24 bits in the mantissa, after all, including the implied 1, so after packing I'm getting bit identical results. If we were keeping intermediate values around, there would be some value to keeping more mantissa bits, but that would require a major overhaul to the architecture, and the calling would be complex.
There are now 26 longs free. If no one has any need of them, I can put them into the CORDIC table for ATan2 and speed that up a bit. Right now I only use 11 table entries for the arctan of the angles, because all entries after that are simply a >>1 of the previous table entry. This adds 3 longs to my CORDIC main loop, though, so the current version saves about 16 longs. I can gain speed at the expense of those 16 longs, but if they aren't in use for anything anyway...
Please let me know if anybody has any problems, ideas, or requests.
thanks,
Jonathan
Sin -360 to 360 step 1 max difference = 0.0012 between orignal and new. Original more accurate when compared to XCALC and Excell. New has errors in 3d decimal, 0.00x.
Tan most are close but off by 4 at 89 degrees 57.2889 (old) 61.34064 (new)
For the arc trig functions I did arcF(F(x)) Most are not too different. If you round to integer degree they match. A truncation will sometimes result in a 1 degree difference.
I didn't check for speed differences. Gaining a cog for a slight reduction in accuracy is a good trade off.
John Abshier