Any ways to speed up Float32?
EDIT
faster routines: _Pack, Add, Sub, Sqr, Sin, Cos, Tan, Float, Exp*, Log*, Pow, Mul, Div
corrected routines: Trunc, Round, Exp*
added routines: Exp2, Log2 (Spin calling code only, no cog bloat), FMod, ATan, ATan2, ACos, ASin, Floor, Ceil
Sept 29, 2010 - fixed some comments, added UintTrunc, only 4 longs free [8^(
Sept 28, 2010 - Added FloatTrunc & FloatRound, converted to all ASCII with CRLF end of line.
Sept 27, 2010 - renamed to F32, faster table interpolation and Tan, added changelog to file
Sept 24, 2010 - fixed the trig functions (had a bug in my table interpolation code)
Sept 21, 2010 - faster multiply and divide - 26 longs free
Sept 20, 2010 - added in all the functionality from Float32Full - 14 longs free
Sept 15, 2010 - fixed race condition on startup, 111 longs free
Sept 14, 2010 - fixed Exp*, sped up Exp*, Log*, Pow, Cos and Sin again
Sept 13, 2010 (PM) - fixed Trunc and Round to now do the right thing for large integers. 83 longs available
Sept 13, 2010 (AM) - new calling convention. 71 longs available in-Cog
/EDIT
OK, knowing full well that much of the time in Float32 is wasted in Spin calling overhead (oh how I wish for a 'waitlock' command!), I'm still wondering if we can pare it down for space reasons, and for a bit of a speed bump if being called from assembly.
Here's a smaller & faster version of the _Pack PASM routine:
I'm sure we code save space and time if we didn't handle NaN, +- infinity, etc., but that would basically require maintenance of 2 versions, and I'm sure many people want as close to IEEE 754 compliance as possible.
Any thoughts?
Jonathan
faster routines: _Pack, Add, Sub, Sqr, Sin, Cos, Tan, Float, Exp*, Log*, Pow, Mul, Div
corrected routines: Trunc, Round, Exp*
added routines: Exp2, Log2 (Spin calling code only, no cog bloat), FMod, ATan, ATan2, ACos, ASin, Floor, Ceil
Sept 29, 2010 - fixed some comments, added UintTrunc, only 4 longs free [8^(
Sept 28, 2010 - Added FloatTrunc & FloatRound, converted to all ASCII with CRLF end of line.
Sept 27, 2010 - renamed to F32, faster table interpolation and Tan, added changelog to file
Sept 24, 2010 - fixed the trig functions (had a bug in my table interpolation code)
Sept 21, 2010 - faster multiply and divide - 26 longs free
Sept 20, 2010 - added in all the functionality from Float32Full - 14 longs free
Sept 15, 2010 - fixed race condition on startup, 111 longs free
Sept 14, 2010 - fixed Exp*, sped up Exp*, Log*, Pow, Cos and Sin again
Sept 13, 2010 (PM) - fixed Trunc and Round to now do the right thing for large integers. 83 longs available
Sept 13, 2010 (AM) - new calling convention. 71 longs available in-Cog
/EDIT
OK, knowing full well that much of the time in Float32 is wasted in Spin calling overhead (oh how I wish for a 'waitlock' command!), I'm still wondering if we can pare it down for space reasons, and for a bit of a speed bump if being called from assembly.
Here's a smaller & faster version of the _Pack PASM routine:
'------------------------------------------------------------------------------
' input: flagA fnumA flag bits (Nan, Infinity, Zero, Sign)
' expA fnumA exponent (no bias)
' manA fnumA mantissa (aligned to bit 29)
' output: fnumA 32-bit floating point value
' changes: fnumA, flagA, expA, manA
'------------------------------------------------------------------------------
_Pack cmp manA, #0 wz ' check for zero
if_z mov expA, #0
if_z jmp #:exit1
sub expA, #380 ' take us out of the danger range for djnz
:normalize shl manA, #1 wc ' normalize the mantissa
if_nc djnz expA, #:normalize ' adjust exponent and jump
add manA, #$100 wc ' round up by 1/2 lsb
addx expA, #(380 + 127 + 2) ' add bias to exponent, account for rounding (in flag C, above)
mins expA, Minus23
maxs expA, #255
abs expA, expA wc,wz ' check for subnormals, and get the abs in case it is
if_a jmp #:exit1
:subnormal or manA, #1 ' adjust mantissa
ror manA, #1
shr manA, expA
mov expA, #0 ' biased exponent = 0
:exit1 mov fnumA, manA ' bits 22:0 mantissa
shr fnumA, #9
movi fnumA, expA ' bits 23:30 exponent
shl flagA, #31
or fnumA, flagA ' bit 31 sign
_Pack_ret
I'm sure we code save space and time if we didn't handle NaN, +- infinity, etc., but that would basically require maintenance of 2 versions, and I'm sure many people want as close to IEEE 754 compliance as possible.
Any thoughts?
Jonathan


Comments
If raw speed is important, how about using the user defined function mechanism? You essentially have a floating point interpreter with an assembly language engine designed for short sequences of floating point operations with low overhead.
Jonathan
'------------------------------------------------------------------------------ ' _FDiv fnumA = fnumA / fNumB ' _FDivI fnumA = fnumA / {Float immediate} ' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB, t1, t2 '------------------------------------------------------------------------------ _FDivI movs :getB, _FDivI_ret ' get immediate value add _FDivI_ret, #1 :getB mov fnumB, 0 _FDiv call #_Unpack2 ' unpack two variables if_c_or_z mov fnumA, NaN ' check for NaN or divide by 0 if_c_or_z jmp #_FDiv_ret xor flagA, flagB ' get sign of result sub expA, expB ' subtract exponents mov t1, #0 ' clear quotient mov t2, #30 ' loop counter for divide :divide ' divide the mantissas cmpsub manA, manB wc rcl t1, #1 shl manA, #1 djnz t2, #:divide mov manA, t1 ' get result and exit call #_Pack _FDivI_ret _FDiv_ret retJonathan
Not bit identical. In ~ 0.5% of my test cases, the Least Significant Bit of the original Float32 is high while mine is low, yielding about a 1e-5% delta. In 99.5% of the cases they outputs were bit identical. I'm not sure which is more accurate.
'------------------------------------------------------------------------------ ' _FSqr fnumA = sqrt(fnumA) ' changes: fnumA, flagA, expA, manA, t1, t2, t3, t4, t5 '------------------------------------------------------------------------------ _FSqr call #_Unpack ' unpack floating point value if_nc mov fnumA, #0 ' set initial result to zero if_c_or_z jmp #_FSqr_ret ' check for NaN or zero test flagA, #signFlag wz ' check for negative if_nz mov fnumA, NaN ' yes, then return NaN if_nz jmp #_FSqr_ret test expA, #1 wz ' if even exponent, shift mantissa if_z shr manA, #1 sar expA, #1 ' get exponent of root add expA, #1 mov fnumA, #0 mov t2, #30 :sqrt ' what is the delta root^2 if we add in this bit? mov t3, fnumA shl t3, #2 add t3, #1 shl t3, t2 ' is the remainder >= delta? cmpsub manA, t3 wc rcl fnumA, #1 ' get the remainder ready for the next loop, and loop shl manA, #1 sub t2, #1 wc if_nc jmp #:sqrt mov manA, fnumA ' store new mantissa value and exit shr manA, #1 call #_Pack _FSqr_ret retJonathan
You would pass it a string that contains the operation to be preformed in order and the 32 bit address of each number it needs to use for something in the operation. Then the SPIN caller would return with the result of the function after waiting for the ASM code to finish.
E.g.
FP.FPCompute(string("(%1 + %2) / %3"), @valueOf1, @valueOf2, @valueOf3)
But... spin can't handle variable length expressions.
Here's the add/sub code, 3 longs shorter:
'------------------------------------------------------------------------------ ' _FAdd fnumA = fnumA + fNumB ' _FAddI fnumA = fnumA + {Float immediate} ' _FSub fnumA = fnumA - fNumB ' _FSubI fnumA = fnumA - {Float immediate} ' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB, t1 '------------------------------------------------------------------------------ _FSubI movs :getB, _FSubI_ret ' get immediate value add _FSubI_ret, #1 :getB mov fnumB, 0 _FSub xor fnumB, Bit31 ' negate B jmp #_FAdd ' add values _FAddI movs :getB, _FAddI_ret ' get immediate value add _FAddI_ret, #1 :getB mov fnumB, 0 _FAdd call #_Unpack2 ' unpack two variables if_c_or_z jmp #_FAdd_ret ' check for NaN or B = 0 test flagA, #SignFlag wz ' negate A mantissa if negative if_nz neg manA, manA test flagB, #SignFlag wz ' negate B mantissa if negative if_nz neg manB, manB mov t1, expA ' align mantissas sub t1, expB abs t1, t1 wc max t1, #31 if_nc sar manB, t1 if_c sar manA, t1 if_c mov expA, expB add manA, manB ' add the two mantissas abs manA, manA wc ' store the absolte value, muxc flagA, #SignFlag ' and flag if it was negative call #_Pack ' pack result and exit _FSubI_ret _FSub_ret _FAddI_ret _FAdd_ret retThe last functions to be optimized use the tables heavily (log, exp, trig stuff, etc.), so I'm not sure I can do much, though I haven't looked at them yet. Does anybody care about / use those?
Jonathan
John Abshier
Gosh, I'm just now looking at how to get Zog to use float32 and you are already optimising it.
Zog will appreciate all the speed it can get from trig functions for the whetstone benchmark:)
Here's a tiny bit slower (3 instructions, so 12 clocks 1-time cost) dispatch table, but it saves 15 longs. It was a bit tricky, in that it's a 'call' dispatch table, instead of the simpler 'jmp' dispatch table, so there is an extra layer of indirection via self-modifying code.
add :setCmd, t1 ' 1st command is offset 1 movs :setCmd, #:cmdTable-1 ' won't take effect until after the next op, so reset the lookup to the table's initial address :setCmd mov :execCmd, :cmdTable-1 ' move the appropriate call into :execCmd nop ' need this nop to let the previos modification take effect :execCmd nop ' execute command: was replaced by :setCmd jmp #endCommand ' finish :cmdTable call #_FAdd ' command dispatch table call #_FSub call #_FMul call #_FDiv call #_FFloat call #_FTrunc call #_FRound call #_FSqr call #cmd_FCmp call #_Sin call #_Cos call #_Tan call #_Log call #_Log10 call #_Exp call #_Exp10 call #_Pow call #_FracJonathan
(C++ softwares are always effecient, since some compilers usually treat Propeller as a VLIW processor, getting the best in return.)
I've been looking at the trig in Float32: it uses the builtin tables with interpolation, so there are places where you only get 5 digits of accuracy. For example:
sin( -3.337947 ) yields:
actual => 0.195095035
float32 => 0.1950913
mine => 0.1950914
I sped up Sin/Cos (and thus Tan) but still using the table. My implementation suffers in one extra area: for tiny values of Sin, sin(t) ~= t. The old code handles this well, whereas my new code ends up truncating to 0.
So, for people actually using the trig stuff, how much accuracy do you expect? I was hoping for another 2 digits or so, but the sine table in ROM is only 16 bits, so we really are limited there.
Options for moving forward:
1 - leave it as is
2 - use my faster version and don't expect sin( 1e-10 ) to equal 1e-10
3 - switch to a CORDIC implementation
4 - still use the table, but try quadratic interpolation instead of linear...would be slower
Any votes?
Jonathan
P.S. Here's the faster version:
'------------------------------------------------------------------------------ ' _Sin fnumA = sin(fnumA) ' _Cos fnumA = cos(fnumA) ' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB ' changes: t1, t2, t3, t4, t5, t6 '------------------------------------------------------------------------------ ' convert sin to cos, since cos(-t) == cos(t) means I can ignore the sign of t _Sin call #_FSubI ' cos(x) = sin(x + pi/2) long pi / 2.0 ' sin(x) = cos(x - pi/2) _Cos call #_FDivI ' rescale angle from [0..2pi] to [0..1] long 2.0 * pi ' now, work with the raw value call #_Unpack ' get the whole and fractional bits add expA, #3 ' bias the exponent by 3 so the resulting data will be full 32-bit aligned ' fractional mov t1, expA ' make a copy of the exponent mov t5, manA ' and a copy of the mantissa add t1, #13 ' bias by the number of bits I will lose abs t1, t1 wc ' find out which direction I need to shift max t1, #31 ' limit it so no weird rollover stuff happens if_c shr t5, t1 ' -exp: shift right to bring down to 1.0 if_nc shl t5, t1 ' +exp: shift left to throw away the high bits abs expA, expA wc ' was the exponent positive or negative? max expA, #31 ' limit to 31, otherwise we do weird wrapping things if_c shr manA, expA ' -exp: shift right to bring down to 1.0 if_nc shl manA, expA ' +exp: shift left to throw away the high bits ' manA now holds angle-bits add manA, bit30 ' convert from cos back to sin 'mov t5, manA ' get the fractional part by discarding the top 13 bits 'shl t5, #13 mov fnumA, manA ' and keep the integer part (top 13 bits) in manA shr fnumA, #(32-13) ' get the 1st point test fnumA, Sin_90 wc ' set C flag for quandrant 2 or 4 test fnumA, Sin_180 wz ' set Z flag for quandrant 3 or 4 negc fnumA, fnumA ' if quandrant 2 or 4, negate offset or fnumA, SineTable ' blend in sine table address shl fnumA, #1 ' get table offset rdword t2, fnumA ' get first table value shl t2, #14 ' align to bit 29 negnz t2, t2 ' if quandrant 3 or 4, negate sumc fnumA, #2 ' get second table value rdword t3, fnumA shl t3, #14 negnz t3, t3 ' if quandrant 3 or 4, negate ' do the linear interpolation sub t3, t2 :interp sar t3, #1 shl t5, #1 wc, wz if_c add t2, t3 if_nz jmp #:interp ' the table goes to $FFFF, I wanted it to go to &10000 mov t5, t2 sar t5, #16 add t2, t5 ' hack, to see the status 'mov fnumA, t2 abs manA, t2 wc muxc flagA, #SignFlag neg expA, #1 call #_Pack _Cos_ret _Sin_ret retstart := cnt
a := f.FSin(b)
end := cnt
and report start - end for new and old versions.
How small is "tiny"? 0.1 degrees, 0.01, 0.001? For me 1e-10 is zero.
John Abshier
I'm using trig functions for navigation with GPS.
At the moment I'm using an integer solution.
In my case I usually need an approximate set of funtions (heading, VMG function can accept a lower precision than a cross track, due to error propagation).
Would it be possible to implement a fast and dirty set of trigs and a set of accurate but slow, and use them accordingly with the needs?
Massimo
@John: ~ 2x faster. Sin and Cos go from about 7100 counts to 3900 counts, and Tan goes from ~13400 to ~6000 counts. However I'm also thinking of Zog calling, not just Spin calling. Sin(t) goes to 0 for t (in radians) smaller than ~ 1e-7 (which makes sense given a 23-bit mantissa). It would be fairly easy to special case, but it won't increase the accuracy over the rest of the input range.
@Massimo: Certainly. The thing is, there _are_ no accurate trig functions on the propeller right now. I had just assumed that the sine and cosine calcs in float32 were as accurate as possible fitting while into the single float representation. I guess I'm asking is the status quo good enough, or should I work on a more accurate version?
Thanks for the feedback, everybody.
Jonathan
John Abshier
It's great you are looking into this.
Having done a lot of lab work as a physics undergrad it is drummed into my head not to quote your results to a higher accuracy than your measurements. Just because you have 10 digits on your calculator does not mean you should write them all down at the end. An awareness of sources of error and analysis of the accuracy of the final result is all part if the exercise.
So I'm prepared for lesser floating point trig precision on the Prop which generally gets it's inputs from real word sensors of limited accuracy. With Zog if you want more accuracy you can use the zpu-gcc math lib functions and accept the speed hit.
When working with Spin or Zog I'm not even sure speed is such a major issue. Things are pretty leisurely anyway.
What would be great is to get all these functions into one Cog.
Looking at Zog currently we might have all these COGs in use:
1) Zog interpreter.
2) VMCog
3) Float part 1
4) Float part 2
5) Debug serial port
At this point we don't have much of a Prop left for any real application and might as well be using an ARM.
Folding all functions into one Cog may be a tall order.
What about this:
Do not use jump tables!
Change the interface so it is: command, op1, op2 but arrange that the command is actually the PASM address of the float function we want. Then the command loop only has to grab the command and jump to it.
Edit: Bah! Scratch that, just realized why all those immediate ops are there.
I will keep working on ensmallenating the math stuff, and see what else I can fit into that cog. Does anyone have a priority list (personal preference is fine) of what should be shoehorned into the remaining space?
I could also remove all but one version of log and exp, etc., knowing log base x of y is ln(y)/ln(x)...which is how it's calculated internally anyway. We don't need log_2, log_e, log_10, etc. in the pasm code, those could be wrapped from the Spin calling code.
Finally, I bet you could also fit a limited serial engine into another cog, say 115_200 only, half duplex (maybe even tx only) for debugging.
Jonathan
Direct jumps would work for a jump table, but Float32 uses call's (as many functions in there require other routines as support), and the compiler needs to know the _ret address at compile time, so it gets tricky, unless I'm missing something?
Jonathan
This reminds me that the first ever scientific calculator I ever saw was a Sinclair Scientific. It only displayed a 5 digit mantissa and most of those digits were wrong after a few steps of trig functions:)
http://en.wikipedia.org/wiki/Sinclair_Scientific
"ensmallenating" brilliant. Must be the opposite to my usual "enormolizing":)
What, you already have space to spare?
For sure dump all the redundant logs. Zog can work it out.
As for the jump table, if I understand correctly currently each operation is passed two longs via PAR which are addresses of the operands. But one of them also holds command bits at the high end.
I'm suggesting that those command bits be the actual PASM address of the function to be called instead of an index into the command table.
So the get command loop would be something like:
getCommand rdlong t1, par wz ' wait for command if_z jmp #getCommand mov t2, t1 ' load fnumA rdlong fnumA, t2 add t2, #4 rdlong fnumB, t2 ' load fnumB shr t1, #16 wz ' get command COG address call t1 ' call command N.B. No hash here endCommand mov t1, par ' return result add t1, #4 wrlong fnumA, t1 wrlong Zero,par ' clear command status jmp #getCommand ' wait for next commandThis drops the test for out of range/invalid command values but we are not going to do that are we:)It's a lot quicker and smaller.
But how does Spin or Zog know the command function addresses?
a) We can compile the PASM note the cog addresses and hard wire them manually into the Spin/Zog code.
b) Spin can determine them before from address calculations.
c) Put a list at a known offset in the COG that Zog can read.
d) Extract them from the compiler list file and massage them into Zog source at build time.
Either way it can be done.
BTW, Zog callback is still nice. (more feature in Propeller!)
I would love to get rid of the dispatch table, but even though the PASM 'call' instruction only takes one address, the compiler needs to know the address of the 'ret' instruction as well (which is why the naming scheme has to be myfunc...myfunc_ret). From prop manual 1.1, pg 268: This is why I had to go through the 2-layer-indirection in my shortened call dispatch table. I guess we could store both the call & ret addresses as the command ID, though it would probably take some parsing and such to regenerate the actual jmpret to be used.
I'll keep working on it though, and I'll certainly share if a solution presents itself.
Jonathan
It could be done if we used JMPRET instead of CALL. This way we can store the return address anywhere, say R. Functions like _FAdd would not use RET but jump back through R.
Now in _FMod, say, it would move R to R' then do the same JMPRET to _FAdd and the move R' back to R. I two level stack if you like.
Could be made to work if calls were not nested more than that. But I guess the extra instructions required to implement it negate the removal of the jump table.
Oh well.
I'm adding the modified version to the 1st post, in case anyone wants to play with it.
Jonathan
Me, right, how?
I thought I'd convinced myself that what I suggested was impossible or not worth while.
Too tired to look at code just now but here is a list of all the float functions required by the Whetstone benchmark. Which, as you know, is the most important thing Zog has to run at speed because Catalina can:)
Should be clear what they are, if not they are specified here:http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
Guess we just have forget all the functions that operate on doubles (8 byte floats).
Yeah I noticed that, very cunning.
Problem is that it might be a bit tough to use with the Zog byte code interpreter for C if the C code has taken over all of HUB RAM and there is no Spin code or the original PASM left in HUB.