Any ways to speed up Float32?
lonesock
Posts: 917
EDIT
faster routines: _Pack, Add, Sub, Sqr, Sin, Cos, Tan, Float, Exp*, Log*, Pow, Mul, Div
corrected routines: Trunc, Round, Exp*
added routines: Exp2, Log2 (Spin calling code only, no cog bloat), FMod, ATan, ATan2, ACos, ASin, Floor, Ceil
Sept 29, 2010 - fixed some comments, added UintTrunc, only 4 longs free [8^(
Sept 28, 2010 - Added FloatTrunc & FloatRound, converted to all ASCII with CRLF end of line.
Sept 27, 2010 - renamed to F32, faster table interpolation and Tan, added changelog to file
Sept 24, 2010 - fixed the trig functions (had a bug in my table interpolation code)
Sept 21, 2010 - faster multiply and divide - 26 longs free
Sept 20, 2010 - added in all the functionality from Float32Full - 14 longs free
Sept 15, 2010 - fixed race condition on startup, 111 longs free
Sept 14, 2010 - fixed Exp*, sped up Exp*, Log*, Pow, Cos and Sin again
Sept 13, 2010 (PM) - fixed Trunc and Round to now do the right thing for large integers. 83 longs available
Sept 13, 2010 (AM) - new calling convention. 71 longs available in-Cog
/EDIT
OK, knowing full well that much of the time in Float32 is wasted in Spin calling overhead (oh how I wish for a 'waitlock' command!), I'm still wondering if we can pare it down for space reasons, and for a bit of a speed bump if being called from assembly.
Here's a smaller & faster version of the _Pack PASM routine:
I'm sure we code save space and time if we didn't handle NaN, +- infinity, etc., but that would basically require maintenance of 2 versions, and I'm sure many people want as close to IEEE 754 compliance as possible.
Any thoughts?
Jonathan
faster routines: _Pack, Add, Sub, Sqr, Sin, Cos, Tan, Float, Exp*, Log*, Pow, Mul, Div
corrected routines: Trunc, Round, Exp*
added routines: Exp2, Log2 (Spin calling code only, no cog bloat), FMod, ATan, ATan2, ACos, ASin, Floor, Ceil
Sept 29, 2010 - fixed some comments, added UintTrunc, only 4 longs free [8^(
Sept 28, 2010 - Added FloatTrunc & FloatRound, converted to all ASCII with CRLF end of line.
Sept 27, 2010 - renamed to F32, faster table interpolation and Tan, added changelog to file
Sept 24, 2010 - fixed the trig functions (had a bug in my table interpolation code)
Sept 21, 2010 - faster multiply and divide - 26 longs free
Sept 20, 2010 - added in all the functionality from Float32Full - 14 longs free
Sept 15, 2010 - fixed race condition on startup, 111 longs free
Sept 14, 2010 - fixed Exp*, sped up Exp*, Log*, Pow, Cos and Sin again
Sept 13, 2010 (PM) - fixed Trunc and Round to now do the right thing for large integers. 83 longs available
Sept 13, 2010 (AM) - new calling convention. 71 longs available in-Cog
/EDIT
OK, knowing full well that much of the time in Float32 is wasted in Spin calling overhead (oh how I wish for a 'waitlock' command!), I'm still wondering if we can pare it down for space reasons, and for a bit of a speed bump if being called from assembly.
Here's a smaller & faster version of the _Pack PASM routine:
'------------------------------------------------------------------------------ ' input: flagA fnumA flag bits (Nan, Infinity, Zero, Sign) ' expA fnumA exponent (no bias) ' manA fnumA mantissa (aligned to bit 29) ' output: fnumA 32-bit floating point value ' changes: fnumA, flagA, expA, manA '------------------------------------------------------------------------------ _Pack cmp manA, #0 wz ' check for zero if_z mov expA, #0 if_z jmp #:exit1 sub expA, #380 ' take us out of the danger range for djnz :normalize shl manA, #1 wc ' normalize the mantissa if_nc djnz expA, #:normalize ' adjust exponent and jump add manA, #$100 wc ' round up by 1/2 lsb addx expA, #(380 + 127 + 2) ' add bias to exponent, account for rounding (in flag C, above) mins expA, Minus23 maxs expA, #255 abs expA, expA wc,wz ' check for subnormals, and get the abs in case it is if_a jmp #:exit1 :subnormal or manA, #1 ' adjust mantissa ror manA, #1 shr manA, expA mov expA, #0 ' biased exponent = 0 :exit1 mov fnumA, manA ' bits 22:0 mantissa shr fnumA, #9 movi fnumA, expA ' bits 23:30 exponent shl flagA, #31 or fnumA, flagA ' bit 31 sign _Pack_ret
I'm sure we code save space and time if we didn't handle NaN, +- infinity, etc., but that would basically require maintenance of 2 versions, and I'm sure many people want as close to IEEE 754 compliance as possible.
Any thoughts?
Jonathan
Comments
If raw speed is important, how about using the user defined function mechanism? You essentially have a floating point interpreter with an assembly language engine designed for short sequences of floating point operations with low overhead.
Jonathan
Jonathan
Not bit identical. In ~ 0.5% of my test cases, the Least Significant Bit of the original Float32 is high while mine is low, yielding about a 1e-5% delta. In 99.5% of the cases they outputs were bit identical. I'm not sure which is more accurate.
Jonathan
You would pass it a string that contains the operation to be preformed in order and the 32 bit address of each number it needs to use for something in the operation. Then the SPIN caller would return with the result of the function after waiting for the ASM code to finish.
E.g.
FP.FPCompute(string("(%1 + %2) / %3"), @valueOf1, @valueOf2, @valueOf3)
But... spin can't handle variable length expressions.
Here's the add/sub code, 3 longs shorter:
The last functions to be optimized use the tables heavily (log, exp, trig stuff, etc.), so I'm not sure I can do much, though I haven't looked at them yet. Does anybody care about / use those?
Jonathan
John Abshier
Gosh, I'm just now looking at how to get Zog to use float32 and you are already optimising it.
Zog will appreciate all the speed it can get from trig functions for the whetstone benchmark:)
Here's a tiny bit slower (3 instructions, so 12 clocks 1-time cost) dispatch table, but it saves 15 longs. It was a bit tricky, in that it's a 'call' dispatch table, instead of the simpler 'jmp' dispatch table, so there is an extra layer of indirection via self-modifying code.
Jonathan
(C++ softwares are always effecient, since some compilers usually treat Propeller as a VLIW processor, getting the best in return.)
I've been looking at the trig in Float32: it uses the builtin tables with interpolation, so there are places where you only get 5 digits of accuracy. For example:
sin( -3.337947 ) yields:
actual => 0.195095035
float32 => 0.1950913
mine => 0.1950914
I sped up Sin/Cos (and thus Tan) but still using the table. My implementation suffers in one extra area: for tiny values of Sin, sin(t) ~= t. The old code handles this well, whereas my new code ends up truncating to 0.
So, for people actually using the trig stuff, how much accuracy do you expect? I was hoping for another 2 digits or so, but the sine table in ROM is only 16 bits, so we really are limited there.
Options for moving forward:
1 - leave it as is
2 - use my faster version and don't expect sin( 1e-10 ) to equal 1e-10
3 - switch to a CORDIC implementation
4 - still use the table, but try quadratic interpolation instead of linear...would be slower
Any votes?
Jonathan
P.S. Here's the faster version:
start := cnt
a := f.FSin(b)
end := cnt
and report start - end for new and old versions.
How small is "tiny"? 0.1 degrees, 0.01, 0.001? For me 1e-10 is zero.
John Abshier
I'm using trig functions for navigation with GPS.
At the moment I'm using an integer solution.
In my case I usually need an approximate set of funtions (heading, VMG function can accept a lower precision than a cross track, due to error propagation).
Would it be possible to implement a fast and dirty set of trigs and a set of accurate but slow, and use them accordingly with the needs?
Massimo
@John: ~ 2x faster. Sin and Cos go from about 7100 counts to 3900 counts, and Tan goes from ~13400 to ~6000 counts. However I'm also thinking of Zog calling, not just Spin calling. Sin(t) goes to 0 for t (in radians) smaller than ~ 1e-7 (which makes sense given a 23-bit mantissa). It would be fairly easy to special case, but it won't increase the accuracy over the rest of the input range.
@Massimo: Certainly. The thing is, there _are_ no accurate trig functions on the propeller right now. I had just assumed that the sine and cosine calcs in float32 were as accurate as possible fitting while into the single float representation. I guess I'm asking is the status quo good enough, or should I work on a more accurate version?
Thanks for the feedback, everybody.
Jonathan
John Abshier
It's great you are looking into this.
Having done a lot of lab work as a physics undergrad it is drummed into my head not to quote your results to a higher accuracy than your measurements. Just because you have 10 digits on your calculator does not mean you should write them all down at the end. An awareness of sources of error and analysis of the accuracy of the final result is all part if the exercise.
So I'm prepared for lesser floating point trig precision on the Prop which generally gets it's inputs from real word sensors of limited accuracy. With Zog if you want more accuracy you can use the zpu-gcc math lib functions and accept the speed hit.
When working with Spin or Zog I'm not even sure speed is such a major issue. Things are pretty leisurely anyway.
What would be great is to get all these functions into one Cog.
Looking at Zog currently we might have all these COGs in use:
1) Zog interpreter.
2) VMCog
3) Float part 1
4) Float part 2
5) Debug serial port
At this point we don't have much of a Prop left for any real application and might as well be using an ARM.
Folding all functions into one Cog may be a tall order.
What about this:
Do not use jump tables!
Change the interface so it is: command, op1, op2 but arrange that the command is actually the PASM address of the float function we want. Then the command loop only has to grab the command and jump to it.
Edit: Bah! Scratch that, just realized why all those immediate ops are there.
I will keep working on ensmallenating the math stuff, and see what else I can fit into that cog. Does anyone have a priority list (personal preference is fine) of what should be shoehorned into the remaining space?
I could also remove all but one version of log and exp, etc., knowing log base x of y is ln(y)/ln(x)...which is how it's calculated internally anyway. We don't need log_2, log_e, log_10, etc. in the pasm code, those could be wrapped from the Spin calling code.
Finally, I bet you could also fit a limited serial engine into another cog, say 115_200 only, half duplex (maybe even tx only) for debugging.
Jonathan
Direct jumps would work for a jump table, but Float32 uses call's (as many functions in there require other routines as support), and the compiler needs to know the _ret address at compile time, so it gets tricky, unless I'm missing something?
Jonathan
This reminds me that the first ever scientific calculator I ever saw was a Sinclair Scientific. It only displayed a 5 digit mantissa and most of those digits were wrong after a few steps of trig functions:)
http://en.wikipedia.org/wiki/Sinclair_Scientific
"ensmallenating" brilliant. Must be the opposite to my usual "enormolizing":)
What, you already have space to spare?
For sure dump all the redundant logs. Zog can work it out.
As for the jump table, if I understand correctly currently each operation is passed two longs via PAR which are addresses of the operands. But one of them also holds command bits at the high end.
I'm suggesting that those command bits be the actual PASM address of the function to be called instead of an index into the command table.
So the get command loop would be something like:
This drops the test for out of range/invalid command values but we are not going to do that are we:)
It's a lot quicker and smaller.
But how does Spin or Zog know the command function addresses?
a) We can compile the PASM note the cog addresses and hard wire them manually into the Spin/Zog code.
b) Spin can determine them before from address calculations.
c) Put a list at a known offset in the COG that Zog can read.
d) Extract them from the compiler list file and massage them into Zog source at build time.
Either way it can be done.
BTW, Zog callback is still nice. (more feature in Propeller!)
I would love to get rid of the dispatch table, but even though the PASM 'call' instruction only takes one address, the compiler needs to know the address of the 'ret' instruction as well (which is why the naming scheme has to be myfunc...myfunc_ret). From prop manual 1.1, pg 268: This is why I had to go through the 2-layer-indirection in my shortened call dispatch table. I guess we could store both the call & ret addresses as the command ID, though it would probably take some parsing and such to regenerate the actual jmpret to be used.
I'll keep working on it though, and I'll certainly share if a solution presents itself.
Jonathan
It could be done if we used JMPRET instead of CALL. This way we can store the return address anywhere, say R. Functions like _FAdd would not use RET but jump back through R.
Now in _FMod, say, it would move R to R' then do the same JMPRET to _FAdd and the move R' back to R. I two level stack if you like.
Could be made to work if calls were not nested more than that. But I guess the extra instructions required to implement it negate the removal of the jump table.
Oh well.
I'm adding the modified version to the 1st post, in case anyone wants to play with it.
Jonathan
Me, right, how?
I thought I'd convinced myself that what I suggested was impossible or not worth while.
Too tired to look at code just now but here is a list of all the float functions required by the Whetstone benchmark. Which, as you know, is the most important thing Zog has to run at speed because Catalina can:)
Should be clear what they are, if not they are specified here:http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
Guess we just have forget all the functions that operate on doubles (8 byte floats).
Yeah I noticed that, very cunning.
Problem is that it might be a bit tough to use with the Zog byte code interpreter for C if the C code has taken over all of HUB RAM and there is no Spin code or the original PASM left in HUB.