Fast floating point
ManAtWork
Posts: 2,178
in Propeller 2
During my digital filter tests I had an idea how to speed up floating point operations. The actual floating point operations take only ~80 clocks. Most of the time (~700 clocks) is spent for the function calls, passing parameters and unpacking and re-packing the IEEE754 format. So I wrote Add, Mul and Div operations for unpacked floating point numbers. I use one longword for the mantissa and another for the exponent and flags (sign + overflow). The basic operations are indeed very fast but as I have to pass twice as many registers and loading them from hub memory stalls the FIFO most of the advantage is lost if the code executed from hub ram. One operation takes ~300 cycles which is still a factor >2 improvement. But I fear all the overhead is not worth it, not to mention the ugly looking code of complex function calls vs. plain assignments and expressions in C.
To really make use of the full speed it would be necessary to load the code to LUT ram and call it from COG ram.
My ultimate dream was that this could be used together with operator methods. That would allow to make any extension to the built-in math completely transparent so that fast floating point numbers or even vector and matrix operations could be used as if they were built in to the compiler. But unfortunatelly, from the current point of view this is theoretically possible but too complicated to justify the required work.
To really make use of the full speed it would be necessary to load the code to LUT ram and call it from COG ram.
My ultimate dream was that this could be used together with operator methods. That would allow to make any extension to the built-in math completely transparent so that fast floating point numbers or even vector and matrix operations could be used as if they were built in to the compiler. But unfortunatelly, from the current point of view this is theoretically possible but too complicated to justify the required work.
Comments
I have optimized further by placing all the code into LUT ram. Now, a 3rd order FIR filter takes only 1330 cycles. There is still room for optimization. For example the QMUL instruction could be split into several MULs. My Add code is also not very clever.
Using IEEE format would probably double the execution time because of the need for pack/unpack but would use less registers/memory. <3000 cycles would still be OK for me but I'll probably stick to the FFP format. Mantissa resolution is 32 instead of 24 bits which is good for higher order filters. Sometimes the coefficients are < 0.001 so with only 24 bits the result resolution would probably be less than 16 bits.
Ross.
If more people than just me were interested and if I had to build something that competes with other "products" then it would be a complete different story. Then the goal was to make it as fast and efficient as possible and as universally usable as possible.
That doesn't mean that I'm not considering looking for further improvement just for fun or for the elegance of the design. As I said in the first post, my dream is the implementation of operator methods and abstract data types. That would lift coding on the P2 to a new level. I just don't have the time, right now. And I think you or Eric also don't have it.
True! Re-inventing the wheel is fun as a hobby, but it is also time consuming and not very profitable!
With only 8 cogs it is going to be important to save cogs where possible.
Or combine several things into one cog...
I haven't re-invented the wheel, I just re-arranged the wheels so that they run faster. And lazyness is the mother of all inventions. You invest some time in the hope to save more later.
The idea behind having fast floating point routines is: I need to do some calculations for the velocity and position control loop in my servo controller. The loop runs 5000 times a second so I have 200µs or 36000 clocks for one iteration. I'll probably need one or two dozen multiplications and some more adds. This would be absolutely no problem with fixed point math coded in assembler. The problem is that I can't foresee all possible applications. It's not just build for one specific type of motor. I want to be as flexible as possible. Induction or stepper motors need different calculations than BLDC motors. Parameters like inertia or power can vary by a factor of 100:1 or more from application to application and from customer to customer.
With fixed point math this would cause trouble sooner or later. Each time you change a formula you have to check if anything can cause an overflow and debug very carefully.
With floating point math and a high level language evrything is much easier. Signal too noisy? Add a filter. Too little gain or power? Adjust a scaling factor here and there without having to worry about overflows. The worst that could happen is that the output stage goes into saturation. And you can use natural IS units for all variables so that speed is expressed in revolutions per second, for example, and not in encoder steps per loop cycle time or something. This makes debugging and tuning much easier. You don't have to care about the resolution of the encoder when you check a speed value.
Operator methods (I know I repeat myself...) would make the appearance of an expression independent of the underlying implementation. You could write ... no matter if the variables a, b, c are of a basic type (built into the compiler) or an abstract data type implemented in a user library (vectors, sets, polynomials or whatever). That's unreachable luxury at the moment, but never say never...
For now, we have to write If the data type is not a basic type. C++ offers operator methods but its full implementation is to bloated to really make sense on a microprocessor.
Dedicating one cog as FPU would be acceptable. But what if you need 4 cogs for computations? You had to either time multiplex the FPU between them or spend 4 FPU cogs. Both is difficult.
But the LUT RAM is free in almost every cog as long as you don't need special things like color lookup tables. Loading the time critical math library code into the LUT mwhile leaving the high level code in HUB exec mode speeds up the calculations almost as much as a dedicated cog.
I don't know how fast Catalina is. But with Fastspin two filters would already have eaten up almost my full time budget of 30000 cycles if I used the built in float math. Now, I can do it >10 times faster which leaves much headroom for later extensions.
Actually, I don't know either! I haven't done any benchmarks on the P2 yet. However, I think you may have missed the point of my original post. I wasn't suggesting you use Catalina, I was just suggesting that that you could use its modified P2 version of Cam Thompson's original P1 floating point code. You don't need to dedicate a cog - on the P2, it is trivial to convert such Cog code to Hub code, and would only be slightly more difficult to convert it to LUT code. This code has already been tested against the classic "paranoia" test program for IEEE floating point implementations.
Interesting idea. But what would change by using XBYTE instead of direct calls to LUT code? I mean, is there anything that would be possible by using XBYTE that isn't without?
I think, creating a bytecode interpreter specialized for mathematical and logical (PLC) applications would be really cool. Two friends of mine acually did this. They have built a servo controller that can be programmed with a script language. This makes it extremely flexible and powerful (at least in theory). You can add modules (filters, timers, logic) and signal paths on the fly and without having to re-compile the firmware.
BUT... They made a big mistake (apart from using the wrong processor, of course ) They haven't got the time (or motivation) to document everything and haven't supplied a convenient user interface for it. So they have an absolutely ingenious product which, unfortunatelly, nobody is able to use except its creators. This renders the original advantage useless.
If I need some adjustment or extension I can't do it myself because I don't know how due to lack of documentation. I have to call them up and ask them how to do it. The common procedure is to establish a Teamviewer session and they hack the solution into the script interpreter leaving me standing there like an idiot. In the time this takes I could also make modifications to a (hardcoded) firmware written in C or Spin, re-compile it and send it to the customer for download.
What I want to say, flexibility is great. But the work required increases enormously with the level of flexibility desired. And you have to do it right and go all the way.
No I haven't missed the point. I just haven't explained my thoughts enough. I could surely learn something by examining the floating point code. This is exactly what I did when writing my code. I just did it with the code generated by Fastspin instead of Catalina. And I don't think the code is much different, BTW. But even if it was it wouldn't make much differece to speed because the majority of time is spent passing the parameters and for the overhead of the function calls themselves (stack handling, jump, FIFO relaod...) instead of the actual computations.
So it would be interesting if Catalina does this more efficiently than the Fastspin compiler.
I tried:
best regards
Reinhard
thank you
I also looked at RossH's other suggestion, there a lot has already been implemented.
I still don't know which way to go. Maybe an IEEE FP will come in fastspin.
nice Weekend
Reinhard