Any ways to speed up Float32?

lonesock · 2010-09-08 11:37

EDIT

faster routines: _Pack, Add, Sub, Sqr, Sin, Cos, Tan, Float, Exp*, Log*, Pow, Mul, Div
corrected routines: Trunc, Round, Exp*

added routines: Exp2, Log2 (Spin calling code only, no cog bloat), FMod, ATan, ATan2, ACos, ASin, Floor, Ceil

Sept 29, 2010 - fixed some comments, added UintTrunc, only 4 longs free [8^(

Sept 28, 2010 - Added FloatTrunc & FloatRound, converted to all ASCII with CRLF end of line.

Sept 27, 2010 - renamed to F32, faster table interpolation and Tan, added changelog to file

Sept 24, 2010 - fixed the trig functions (had a bug in my table interpolation code)

Sept 21, 2010 - faster multiply and divide - 26 longs free

Sept 20, 2010 - added in all the functionality from Float32Full - 14 longs free

Sept 15, 2010 - fixed race condition on startup, 111 longs free

Sept 14, 2010 - fixed Exp*, sped up Exp*, Log*, Pow, Cos and Sin again

Sept 13, 2010 (PM) - fixed Trunc and Round to now do the right thing for large integers. 83 longs available

Sept 13, 2010 (AM) - new calling convention. 71 longs available in-Cog

/EDIT

OK, knowing full well that much of the time in Float32 is wasted in Spin calling overhead (oh how I wish for a 'waitlock' command!), I'm still wondering if we can pare it down for space reasons, and for a bit of a speed bump if being called from assembly.

Here's a smaller & faster version of the _Pack PASM routine:

'------------------------------------------------------------------------------
' input:   flagA        fnumA flag bits (Nan, Infinity, Zero, Sign)
'          expA         fnumA exponent (no bias)
'          manA         fnumA mantissa (aligned to bit 29)
' output:  fnumA        32-bit floating point value
' changes: fnumA, flagA, expA, manA 
'------------------------------------------------------------------------------

_Pack                   cmp     manA, #0 wz             ' check for zero                                        
          if_z          mov     expA, #0
          if_z          jmp     #:exit1

                        sub     expA, #380              ' take us out of the danger range for djnz
:normalize              shl     manA, #1 wc             ' normalize the mantissa
          if_nc         djnz    expA, #:normalize       ' adjust exponent and jump

                        add     manA, #$100 wc          ' round up by 1/2 lsb

                        addx    expA, #(380 + 127 + 2)  ' add bias to exponent, account for rounding (in flag C, above)
                        mins    expA, Minus23
                        maxs    expA, #255

                        abs     expA, expA wc,wz        ' check for subnormals, and get the abs in case it is
          if_a          jmp     #:exit1

:subnormal              or      manA, #1                ' adjust mantissa
                        ror     manA, #1

                        shr     manA, expA
                        mov     expA, #0                ' biased exponent = 0

:exit1                  mov     fnumA, manA             ' bits 22:0 mantissa
                        shr     fnumA, #9
                        movi    fnumA, expA             ' bits 23:30 exponent
                        shl     flagA, #31
                        or      fnumA, flagA            ' bit 31 sign            
_Pack_ret

I'm sure we code save space and time if we didn't handle NaN, +- infinity, etc., but that would basically require maintenance of 2 versions, and I'm sure many people want as close to IEEE 754 compliance as possible.

Any thoughts?
Jonathan

Mike Green · 2010-09-08 15:07

You can certainly call the floating point assembly routines from another cog and get some speedup. The Spin interface routines have an overhead on the order of 10us or so based on the number of byte codes involved. Starting up the assembly routines only involves maybe 100ns.

If raw speed is important, how about using the user defined function mechanism? You essentially have a floating point interpreter with an assembly language engine designed for short sequences of floating point operations with low overhead.

lonesock · 2010-09-09 09:58

@Mike: Cool stuff! I did not know about the "user defined function mechanism". I guess I was just noticing that the code isn't overly efficient in some places, and if anyone really needed speed from the float32 code (say called from their own PASM code), it could be sped up some.

Jonathan

lonesock · 2010-09-09 09:59

Here's divide: not too much faster...inner loop went from 6 instructions down to 4. Bit identical results.

'------------------------------------------------------------------------------
' _FDiv    fnumA = fnumA / fNumB
' _FDivI   fnumA = fnumA / {Float immediate}
' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB, t1, t2
'------------------------------------------------------------------------------
_FDivI                  movs    :getB, _FDivI_ret       ' get immediate value
                        add     _FDivI_ret, #1
:getB                   mov     fnumB, 0

_FDiv                   call    #_Unpack2               ' unpack two variables
          if_c_or_z     mov     fnumA, NaN              ' check for NaN or divide by 0
          if_c_or_z     jmp     #_FDiv_ret
        
                        xor     flagA, flagB            ' get sign of result
                        sub     expA, expB              ' subtract exponents
                        mov     t1, #0                  ' clear quotient
                        mov     t2, #30                 ' loop counter for divide

:divide                 ' divide the mantissas
                        cmpsub  manA, manB      wc
                        rcl     t1, #1
                        shl     manA, #1
                        djnz    t2, #:divide

                        mov     manA, t1                ' get result and exit
                        call    #_Pack                        
_FDivI_ret
_FDiv_ret               ret

Jonathan

lonesock · 2010-09-09 10:06

Here's sqrt: major speedup! The original code was performing a multiply for every bit of the result!! This version uses a fairly cool (if I do say so myself) mechanism to track the result using a remainder and tracking the exponent of the root and remainder terms while keeping precision.

Not bit identical. In ~ 0.5% of my test cases, the Least Significant Bit of the original Float32 is high while mine is low, yielding about a 1e-5% delta. In 99.5% of the cases they outputs were bit identical. I'm not sure which is more accurate.

'------------------------------------------------------------------------------
' _FSqr    fnumA = sqrt(fnumA)
' changes: fnumA, flagA, expA, manA, t1, t2, t3, t4, t5 
'------------------------------------------------------------------------------
_FSqr                   call    #_Unpack                 ' unpack floating point value
          if_nc         mov     fnumA, #0                ' set initial result to zero
          if_c_or_z     jmp     #_FSqr_ret               ' check for NaN or zero
                        test    flagA, #signFlag wz      ' check for negative
          if_nz         mov     fnumA, NaN               ' yes, then return NaN                       
          if_nz         jmp     #_FSqr_ret
          
                        test    expA, #1 wz             ' if even exponent, shift mantissa 
          if_z          shr     manA, #1
                        sar     expA, #1                ' get exponent of root
                        add     expA, #1

                        mov     fnumA, #0
                        mov     t2, #30
:sqrt
                        ' what is the delta root^2 if we add in this bit?
                        mov     t3, fnumA
                        shl     t3, #2
                        add     t3, #1
                        shl     t3, t2

                        ' is the remainder >= delta?
                        cmpsub  manA, t3        wc
                        rcl     fnumA, #1

                        ' get the remainder ready for the next loop, and loop
                        shl     manA, #1
                        sub     t2, #1          wc
              if_nc     jmp     #:sqrt
                        
                        mov     manA, fnumA             ' store new mantissa value and exit
                        shr     manA, #1
                        call    #_Pack
_FSqr_ret               ret

Jonathan

Kye · 2010-09-09 11:10

Here's a thought. Make the floating point library work like printf...

You would pass it a string that contains the operation to be preformed in order and the 32 bit address of each number it needs to use for something in the operation. Then the SPIN caller would return with the result of the function after waiting for the ASM code to finish.

E.g.

FP.FPCompute(string("(%1 + %2) / %3"), @valueOf1, @valueOf2, @valueOf3)

But... spin can't handle variable length expressions.

Ale · 2010-09-09 12:08

Kye... that is an interessant idea... maybe if you give a string, a number of arguments and an array of pointers... things may work.

lonesock · 2010-09-09 14:54

@Kye: I think that's similar to (though quite a bit more complex than) the user defined functions Mike mentioned. From "Propeller Floating Point.pdf":

DAT
x long 0
y long 0
z long 0
func1 long JmpCmd+@func1 ' defines runtime offset
long LoadCmd+@x ' fnumA = x
long FMulCmd+@y ' fnumA = fnumA + y
long SaveCmd+@z ' z = fnumA
long 0 ' end of function

On the down side, you apparently need Float32A, not just Float32. I wonder if it could be ported?

Here's the add/sub code, 3 longs shorter:

'------------------------------------------------------------------------------
' _FAdd    fnumA = fnumA + fNumB
' _FAddI   fnumA = fnumA + {Float immediate}
' _FSub    fnumA = fnumA - fNumB
' _FSubI   fnumA = fnumA - {Float immediate}
' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB, t1
'------------------------------------------------------------------------------

_FSubI                  movs    :getB, _FSubI_ret       ' get immediate value
                        add     _FSubI_ret, #1
:getB                   mov     fnumB, 0

_FSub                   xor     fnumB, Bit31            ' negate B
                        jmp     #_FAdd                  ' add values                                               

_FAddI                  movs    :getB, _FAddI_ret       ' get immediate value
                        add     _FAddI_ret, #1
:getB                   mov     fnumB, 0

_FAdd                   call    #_Unpack2               ' unpack two variables                    
          if_c_or_z     jmp     #_FAdd_ret              ' check for NaN or B = 0

                        test    flagA, #SignFlag wz     ' negate A mantissa if negative
          if_nz         neg     manA, manA
                        test    flagB, #SignFlag wz     ' negate B mantissa if negative
          if_nz         neg     manB, manB

                        mov     t1, expA                ' align mantissas
                        sub     t1, expB
                        abs     t1, t1          wc
                        max     t1, #31
              if_nc     sar     manB, t1
              if_c      sar     manA, t1
              if_c      mov     expA, expB

                        add     manA, manB              ' add the two mantissas
                        abs     manA, manA      wc      ' store the absolte value,
                        muxc    flagA, #SignFlag        ' and flag if it was negative

                        call    #_Pack                  ' pack result and exit
_FSubI_ret
_FSub_ret 
_FAddI_ret
_FAdd_ret               ret

The last functions to be optimized use the tables heavily (log, exp, trig stuff, etc.), so I'm not sure I can do much, though I haven't looked at them yet. Does anybody care about / use those?

Jonathan

John Abshier · 2010-09-09 15:17

I use sin/cos a lot to convert distance and angle to x and y coordinates.

John Abshier

Heater. · 2010-09-09 15:23

lonesock,

Gosh, I'm just now looking at how to get Zog to use float32 and you are already optimising it.

Zog will appreciate all the speed it can get from trig functions for the whetstone benchmark:)

lonesock · 2010-09-10 08:57

OK, I will check out the trig and exp stuff.

Here's a tiny bit slower (3 instructions, so 12 clocks 1-time cost) dispatch table, but it saves 15 longs. It was a bit tricky, in that it's a 'call' dispatch table, instead of the simpler 'jmp' dispatch table, so there is an extra layer of indirection via self-modifying code.

                        add     :setCmd, t1             ' 1st command is offset 1
                        movs    :setCmd, #:cmdTable-1   ' won't take effect until after the next op, so reset the lookup to the table's initial address
:setCmd                 mov     :execCmd, :cmdTable-1   ' move the appropriate call into :execCmd
                        nop                             ' need this nop to let the previos modification take effect
:execCmd                nop                             ' execute command: was replaced by :setCmd
                        jmp     #endCommand             ' finish

:cmdTable               call    #_FAdd                  ' command dispatch table
                        call    #_FSub
                        call    #_FMul
                        call    #_FDiv
                        call    #_FFloat
                        call    #_FTrunc
                        call    #_FRound
                        call    #_FSqr
                        call    #cmd_FCmp
                        call    #_Sin
                        call    #_Cos
                        call    #_Tan
                        call    #_Log
                        call    #_Log10
                        call    #_Exp
                        call    #_Exp10
                        call    #_Pow
                        call    #_Frac

Jonathan

Dr. Mario · 2010-09-10 11:46

Maybe, maybe... You can try C++ floating point library. just giving you a bit idea.
(C++ softwares are always effecient, since some compilers usually treat Propeller as a VLIW processor, getting the best in return.)

lonesock · 2010-09-10 12:48

Thanks Dr. Mario, I will look into some.

I've been looking at the trig in Float32: it uses the builtin tables with interpolation, so there are places where you only get 5 digits of accuracy. For example:

sin( -3.337947 ) yields:

actual => 0.195095035
float32 => 0.1950913
mine => 0.1950914

I sped up Sin/Cos (and thus Tan) but still using the table. My implementation suffers in one extra area: for tiny values of Sin, sin(t) ~= t. The old code handles this well, whereas my new code ends up truncating to 0.

So, for people actually using the trig stuff, how much accuracy do you expect? I was hoping for another 2 digits or so, but the sine table in ROM is only 16 bits, so we really are limited there.

Options for moving forward:
1 - leave it as is
2 - use my faster version and don't expect sin( 1e-10 ) to equal 1e-10
3 - switch to a CORDIC implementation
4 - still use the table, but try quadratic interpolation instead of linear...would be slower

Any votes?

Jonathan

P.S. Here's the faster version:

'------------------------------------------------------------------------------
' _Sin     fnumA = sin(fnumA)
' _Cos     fnumA = cos(fnumA)
' changes: fnumA, flagA, expA, manA, fnumB, flagB, expB, manB
' changes: t1, t2, t3, t4, t5, t6
'------------------------------------------------------------------------------

' convert sin to cos, since cos(-t) == cos(t) means I can ignore the sign of t
_Sin                    call    #_FSubI                 ' cos(x) = sin(x + pi/2)
                        long    pi / 2.0                ' sin(x) = cos(x - pi/2)

_Cos                    call    #_FDivI                 ' rescale angle from [0..2pi] to [0..1]
                        long    2.0 * pi

                        ' now, work with the raw value
                        call    #_Unpack

                        ' get the whole and fractional bits
                        add     expA, #3                ' bias the exponent by 3 so the resulting data will be full 32-bit aligned

                        ' fractional
                        mov     t1, expA                ' make a copy of the exponent
                        mov     t5, manA                ' and a copy of the mantissa
                        add     t1, #13                 ' bias by the number of bits I will lose
                        abs     t1, t1          wc      ' find out which direction I need to shift
                        max     t1, #31                 ' limit it so no weird rollover stuff happens
              if_c      shr     t5, t1                  ' -exp: shift right to bring down to 1.0
              if_nc     shl     t5, t1                  ' +exp: shift left to throw away the high bits

                        abs     expA, expA      wc      ' was the exponent positive or negative?
                        max     expA, #31               ' limit to 31, otherwise we do weird wrapping things
              if_c      shr     manA, expA              ' -exp: shift right to bring down to 1.0
              if_nc     shl     manA, expA              ' +exp: shift left to throw away the high bits

                        ' manA now holds angle-bits
                        add     manA, bit30             ' convert from cos back to sin
                        'mov     t5, manA                ' get the fractional part by discarding the top 13 bits
                        'shl     t5, #13
                        mov     fnumA, manA             ' and keep the integer part (top 13 bits) in manA
                        shr     fnumA, #(32-13)

                        ' get the 1st point
                        test    fnumA, Sin_90 wc        ' set C flag for quandrant 2 or 4
                        test    fnumA, Sin_180 wz       ' set Z flag for quandrant 3 or 4
                        negc    fnumA, fnumA            ' if quandrant 2 or 4, negate offset
                        or      fnumA, SineTable        ' blend in sine table address
                        shl     fnumA, #1               ' get table offset

                        rdword  t2, fnumA               ' get first table value
                        shl     t2, #14                 ' align to bit 29
                        negnz   t2, t2                  ' if quandrant 3 or 4, negate

                        sumc    fnumA, #2               ' get second table value
                        rdword  t3, fnumA
                        shl     t3, #14
                        negnz   t3, t3                  ' if quandrant 3 or 4, negate

                        ' do the linear interpolation
                        sub     t3, t2
:interp                 sar     t3, #1
                        shl     t5, #1          wc, wz
              if_c      add     t2, t3
              if_nz     jmp     #:interp

                        ' the table goes to $FFFF, I wanted it to go to &10000
                        mov     t5, t2
                        sar     t5, #16
                        add     t2, t5

                        ' hack, to see the status
                        'mov     fnumA, t2
                        abs     manA, t2        wc
                        muxc    flagA, #SignFlag
                        neg     expA, #1
                        call    #_Pack

_Cos_ret
_Sin_ret                ret

Dr. Mario · 2010-09-10 13:48

Yeah. Anyways, you may want to watch the way COG take bites onto the math data (such as wait / run threads) so you would get the best idea of how it will handle the floating point, in three favors of machine codes, before taking it for granted. Few things to take note.

John Abshier · 2010-09-10 14:27

Jonathan, there is not enough information to make a choice. How much faster? Users will be calling it from Spin so test code could be:

start := cnt
a := f.FSin(b)
end := cnt

and report start - end for new and old versions.

How small is "tiny"? 0.1 degrees, 0.01, 0.001? For me 1e-10 is zero.

John Abshier

max72 · 2010-09-10 14:35

Just wondering..
I'm using trig functions for navigation with GPS.
At the moment I'm using an integer solution.
In my case I usually need an approximate set of funtions (heading, VMG function can accept a lower precision than a cross track, due to error propagation).
Would it be possible to implement a fast and dirty set of trigs and a set of accurate but slow, and use them accordingly with the needs?
Massimo

lonesock · 2010-09-10 14:46

@Dr.M: Yep. Memory access and threading issues are definitely different for Prop vs. PCs.

@John: ~ 2x faster. Sin and Cos go from about 7100 counts to 3900 counts, and Tan goes from ~13400 to ~6000 counts. However I'm also thinking of Zog calling, not just Spin calling. Sin(t) goes to 0 for t (in radians) smaller than ~ 1e-7 (which makes sense given a 23-bit mantissa). It would be fairly easy to special case, but it won't increase the accuracy over the rest of the input range.

@Massimo: Certainly. The thing is, there _are_ no accurate trig functions on the propeller right now. I had just assumed that the sine and cosine calcs in float32 were as accurate as possible fitting while into the single float representation. I guess I'm asking is the status quo good enough, or should I work on a more accurate version?

Thanks for the feedback, everybody.

Jonathan

John Abshier · 2010-09-10 15:21

My thoughts, current accuracy is acceptable. Speedups are always nice. But what I would most want is to add asin, acos, atan, and atan2 to Float32. That would save a cog. The inverse trig functions look to be about 90 longs. I don't know how may you have save so far.

John Abshier

lonesock · 2010-09-10 15:34

I have about 45 longs available, and can squeeze out some more, I'm sure.

Heater. · 2010-09-10 20:18

Lonesock,

It's great you are looking into this.

Having done a lot of lab work as a physics undergrad it is drummed into my head not to quote your results to a higher accuracy than your measurements. Just because you have 10 digits on your calculator does not mean you should write them all down at the end. An awareness of sources of error and analysis of the accuracy of the final result is all part if the exercise.

So I'm prepared for lesser floating point trig precision on the Prop which generally gets it's inputs from real word sensors of limited accuracy. With Zog if you want more accuracy you can use the zpu-gcc math lib functions and accept the speed hit.

When working with Spin or Zog I'm not even sure speed is such a major issue. Things are pretty leisurely anyway.

What would be great is to get all these functions into one Cog.

Looking at Zog currently we might have all these COGs in use:
1) Zog interpreter.
2) VMCog
3) Float part 1
4) Float part 2
5) Debug serial port

At this point we don't have much of a Prop left for any real application and might as well be using an ARM.

Folding all functions into one Cog may be a tall order.

What about this:

Do not use jump tables!

Change the interface so it is: command, op1, op2 but arrange that the command is actually the PASM address of the float function we want. Then the command loop only has to grab the command and jump to it.

Heater. · 2010-09-10 21:11

I suspect Zog does not need all those operations on immediates either.

Edit: Bah! Scratch that, just realized why all those immediate ops are there.

lonesock · 2010-09-10 21:17

@Heater: Yep, I don't mind inaccuracies, as long as everyone knows they're there. I was making assumptions about the output from the trig functions, and as we all know, when you assume, you make an 'as' out of 'su me' ;-)

I will keep working on ensmallenating the math stuff, and see what else I can fit into that cog. Does anyone have a priority list (personal preference is fine) of what should be shoehorned into the remaining space?

I could also remove all but one version of log and exp, etc., knowing log base x of y is ln(y)/ln(x)...which is how it's calculated internally anyway. We don't need log_2, log_e, log_10, etc. in the pasm code, those could be wrapped from the Spin calling code.

Finally, I bet you could also fit a limited serial engine into another cog, say 115_200 only, half duplex (maybe even tx only) for debugging.

Jonathan

lonesock · 2010-09-10 21:20

Oops, forgot to add:

Direct jumps would work for a jump table, but Float32 uses call's (as many functions in there require other routines as support), and the compiler needs to know the _ret address at compile time, so it gets tricky, unless I'm missing something?

Jonathan

Heater. · 2010-09-10 21:55

Exactly. We can tolerate inaccuracies in many typical applications. Just have to determine what they are and document it.

This reminds me that the first ever scientific calculator I ever saw was a Sinclair Scientific. It only displayed a 5 digit mantissa and most of those digits were wrong after a few steps of trig functions:)

http://en.wikipedia.org/wiki/Sinclair_Scientific

"ensmallenating" brilliant. Must be the opposite to my usual "enormolizing":)

Does anyone have a priority list (personal preference is fine) of what should be shoehorned into the remaining space?

What, you already have space to spare?

For sure dump all the redundant logs. Zog can work it out.

As for the jump table, if I understand correctly currently each operation is passed two longs via PAR which are addresses of the operands. But one of them also holds command bits at the high end.

I'm suggesting that those command bits be the actual PASM address of the function to be called instead of an index into the command table.

So the get command loop would be something like:

getCommand              rdlong  t1, par wz              ' wait for command
          if_z          jmp     #getCommand

                        mov     t2, t1                  ' load fnumA
                        rdlong  fnumA, t2
                        add     t2, #4          
                        rdlong  fnumB, t2               ' load fnumB

                        shr     t1, #16 wz        ' get command COG address
                        call     t1                      ' call command N.B. No hash here

 endCommand              mov     t1, par                 ' return result
                        add     t1, #4
                        wrlong  fnumA, t1
                        wrlong  Zero,par                ' clear command status
                        jmp     #getCommand             ' wait for next command

This drops the test for out of range/invalid command values but we are not going to do that are we:)

It's a lot quicker and smaller.

But how does Spin or Zog know the command function addresses?

a) We can compile the PASM note the cog addresses and hard wire them manually into the Spin/Zog code.
b) Spin can determine them before from address calculations.
c) Put a list at a known offset in the COG that Zog can read.
d) Extract them from the compiler list file and massage them into Zog source at build time.

Either way it can be done.

Dr. Mario · 2010-09-11 09:23

Zicog, unfortunately, we will still have to use Jump table like in past (even 8088 will have to use it or it's frozen like a ice.) Well, I don't blame you - in the past, many people have tried it, only to see microcontroller smoking or not responding to anything (providing it's not fried), we will have to reset it or reprogram it. Same with Microsoft engineers, they have tried it with less than fruitful result.

BTW, Zog callback is still nice. (more feature in Propeller!)

lonesock · 2010-09-11 09:27

Well, I have 43 longs available right now, and I think I'll be able to gain some more, so there will definitely room to cram in a few more functions to this cog.

I would love to get rid of the dispatch table, but even though the PASM 'call' instruction only takes one address, the compiler needs to know the address of the 'ret' instruction as well (which is why the naming scheme has to be myfunc...myfunc_ret). From prop manual 1.1, pg 268:

The Propeller hardware does not use a call stack, so the return address must be stored in a different manner. At compile time the assembler locates the destination routine as well as its RET instruction (labeled Symbol and Symbol_ret, respectively) and encodes those addresses into the CALL instruction’s s-field and d-field. This provides the CALL instruction with the knowledge of both where it’s going to jump to and exactly where it will return from.

This is why I had to go through the 2-layer-indirection in my shortened call dispatch table. I guess we could store both the call & ret addresses as the command ID, though it would probably take some parsing and such to regenerate the actual jmpret to be used.

I'll keep working on it though, and I'll certainly share if a solution presents itself.

Jonathan

Heater. · 2010-09-11 10:57

OK guys what you are saying is that I have not thought hard enough about case like _FAdd being called from command dispatch and from other place like _FMod.

It could be done if we used JMPRET instead of CALL. This way we can store the return address anywhere, say R. Functions like _FAdd would not use RET but jump back through R.

Now in _FMod, say, it would move R to R' then do the same JMPRET to _FAdd and the move R' back to R. I two level stack if you like.

Could be made to work if calls were not nested more than that. But I guess the extra instructions required to implement it negate the removal of the jump table.

Oh well.

lonesock · 2010-09-11 11:29

Hey, you were right, I put the call dispatch table in Hub RAM, and I only need to make sure my pointer to the table fits inside Cog RAM. I use a rdlong to do my code modification. I can no longer use 'res', so I eat a few extra longs of Hub RAM besides the call table, but I now have 64 longs available inside the Cog!

I'm adding the modified version to the 1st post, in case anyone wants to play with it.

Jonathan

Heater. · 2010-09-11 12:31

Lonesock,

Me, right, how?

I thought I'd convinced myself that what I suggested was impossible or not worth while.

Too tired to look at code just now but here is a list of all the float functions required by the Whetstone benchmark. Which, as you know, is the most important thing Zog has to run at speed because Catalina can:)

float sinf (float x);

float cosf (float x);

float tanf (float x);

float atanf (float x);

float logf (float x);

float expf (float x);

float sqrtf (float x);

double __adddf3 (double a, double b);

float __addsf3 (float a, float b);

double __divdf3 (double a, double b);

float __divsf3 (float a, float b);

int __eqsf2 (float a, float b);

double __extendsfdf2 (float a);

int __fixsfsi (float a);

float __floatsisf (int i);

int __gtsf2 (float a, float b);

double __muldf3 (double a, double b);

float __mulsf3 (float a, float b);

double __subdf3 (double a, double b);

float __subsf3 (float a, float b);

float __truncdfsf2 (double a);

Should be clear what they are, if not they are specified here:http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html

Guess we just have forget all the functions that operate on doubles (8 byte floats).

M. K. Borri · 2010-09-11 13:18

if it helps any, there's a dynamic math library in the object exchange

Heater. · 2010-09-11 13:43

M. K. Borri,

Yeah I noticed that, very cunning.

Problem is that it might be a bit tough to use with the Zog byte code interpreter for C if the C code has taken over all of HUB RAM and there is no Spin code or the original PASM left in HUB.

Any ways to speed up Float32?

Comments