Ways to speed up code
amalfiCoast
Posts: 132
in Propeller 1
Hi,
Are there any ways to speed up code within a while loop? For example, I am running a Kalman filter in a continuous loop and a WAG for loop time is maybe 50 Hz. If I wanted the loop to run at 200 Hz, what are some ways to achieve this?
Are there any ways to speed up code within a while loop? For example, I am running a Kalman filter in a continuous loop and a WAG for loop time is maybe 50 Hz. If I wanted the loop to run at 200 Hz, what are some ways to achieve this?
Comments
Sometimes you can get a smaller improvement by optimizing the code, eliminating redundant subscripting or other operations, changing the order of calculations, etc. You won't see a 4x change there in most cases.
The C/C++ compiler produces faster and larger code than the Spin compiler. If you're using Spin, you can use "spin2cpp" to translate the Spin parts to C++, then do your tight loop directly in C++ and compile the whole thing using SimpleIDE.
David
Can you accept some lag?
Use more COGS to run the math like a pipeline.
One at 50hz, two at 100hz, 4 at 200hz. It won't double perfectly, but will approximate those kinds of gains
Incoming data gets buffered and handed off to the COGS using ring buffers. Same for output.
Mike,
I'm sorry for my ignorance but how would I make a separate function? Your idea sounds great, I just don't know how to implement.
Here are the Propeller C tutorials:
http://learn.parallax.com/tutorials/propeller-c
Functions are here:
http://learn.parallax.com/tutorials/language/propeller-c/propeller-c-functions
In Tachyon Forth I use integer maths for speed and the simple way to multiply a number by this value is to perform a mixed multiply and division operation with a 64-bit intermediate using the */ in Forth with 305432 1000000000 */ so that only takes 38.4us.
Out of curiosity though I converted part of your code into Tachyon, since I like to see if there is anything I can improve in Tachyon. Here are two subroutines and the start of the inner loop:
This is something I can do relatively easily. I will try this and report back with results. Thanks.
Genetix,
Thank you for this information.
David
I am a relatively new to programming and using the Propeller so it will take me some time to digest what you have said. One thing I believe you are saying is to reduce the mathematical operations in my loop by hard coding values. This is something I can do pretty easily. The .0175 is a scale factor from the datasheet and I was converting the data from deg/sec (outputted from the sensor) to radians/sec. All of the calculations in the filter are done in radians.
David
As well as using fixed point, simple things like splitting the serial code, into a separate COG, removes that dead time (and use burst access, as suggested above)
eg Read over i2c is not super fast, and burst 100kHz read of 3x16 with some overhead, can consume ~ 20% of your targeted 5ms budget. 400kHz i2c is better but still ~5%
You can also use shift and add, over multiply.
A sensor 'gain' example I find is defined within a band of 3.5(min) < 3.9(typ) < 4.3(max) - roughly 5%, with 10b precision
Taking your example of
.0175*PI/180 = 0.000305432
you can approximate that with fixed point and shifts.
eg if we take
MF=(10000*.0175*pi/180)
MF = 3.0543261 - that does not prune LSBs in fixed point, and takes the 10b values expanding 0~1024 to 0~3072
a simple << 1 and add gives ((2+1)-MF)/MF = 1.77866% gain error, & that could already be good enough, given sensor is 5% gain accurate.
or, you can improve more
((2+1+1/16-1/128)-MF)/MF = 1.182941e-4 (0.0183%) or one part in 8453 gain error, way better than the sensor.
The compiler should be reducing the expression to a constant anyway but it is still in floating point which is relatively slow since it has to be done in software whereas an integer operation using scaling before division is much faster. However your SPI routines are extremely inefficient and the compiler cannot optimize that for you but you can if you write %01101000 so as to address register $28 with the autoincrement bit set (bit 6) and thereafter simply read 2 bytes as X, 2 bytes as Y, 2 bytes as Z, then deselect chip-select. Much faster since SPI operations are also done in software. Alternatively you could have another cog continuously reading the gyro and updating global variables so that your loop doesn't have to wait for SPI.
These are just some quick tips.
And build with lmm instead of cmm, that's a really easy one if you are not already.
Following that, I see some other easy things to change:
Don't send 7 bits followed by 1 bit. Just send all 8 bits at once and avoid a lot of overhead related to calling the same function twice.
Do this outside of your loop
Don't forget to comment out that last print line in the loop when you're testing the timing. The print function is severely limited by your UART speed.
I'm going to ignore the huge blocks of math - it just isn't my forte. Maybe there is room for improvement, maybe not.
Okay, so you're compiling with lmm, you've fixed the floating point to be fixed point, and you've combined your redundant shift_out calls, and you moved pwm_start, and you commented out the last print line and your math algorithms are as good as they can be.... but it still isn't fast enough. What's next?
These functions from simpletools.h are pretty slow. They're nice and easy to understand, and it's great that they come built in with SimpleIDE, but they are slow. There are faster options out there - namely PropWare - but it will require more effort to convert your code base, so let's save it for a last resort.
I'm guessing based on the code comments that you might be using an L3G gyroscope, so here's PropWare's demo code for communication with an L3G: https://david.zemon.name/PropWare/api-develop/L3G_Demo_8cpp-example.xhtml
Thanks for the reply. Can you post code showing how to implement? I'm sorry I don't have much programming experience.
Thanks,
David
In Spin,
Interpolation is possible but slows things down. I don't do C, but it will be easy to translate.
For cosine, shift 90 degrees (+2048). Tangent is sine/cosine, scaled as needed.
Thanks.
The sin/cos table in the Propeller ROM is stored as fixed point, I think with 15 bits after the decimal point.
(1) Try compiling with different options. As I think David said above, LMM is faster than CMM. The -O3 option is faster than the usual -Os option, although it may make even bigger code.
(2) Make sure you use -m32bit-doubles as an option.
(3) Try to avoid computing the same value more than once. Instead of calling cos(theta) in many places, create variables like "float ct, st;". Every time theta changes (not very often in your code, recalculate "ct = cos(theta); st = sin(theta)", and use "ct" and "st" in all the expressions in place of "cos(theta)" and "sin(theta)". At optimization level -O3 the compiler may be able to do this for you, but it doesn't hurt to try to help it out.
You need to also scale, which is what fixed-point means. - and you need to take more care with overflow/underflow.
Taking the example above, you read an integer value 1..1023, and multiply by .0175*PI/180 = 0.000305432
That gives you floating point numbers from 0.000305432 to 0.31245756 at that stage, but many of those digits are a mirage, as the gain error is ~ 5% and the reading quanta is ~ 0.1%
If instead you do this MF=(10000*.0175*pi/180)
Now you create = 3.0543261 to 3124.575, or as integers, that can be 3~3125 That *10000 has moved the decimal point a fixed amount.
Because that scale is a known constant, you can even speed the Multiply here, by doing shifts and adds.
eg a very fast (3*N+N/16-N/128) has 0.0183% gain error. Checking with : N=1023; (N+N+N+(N>>4)-(N>>7)) = 3125
From there, you can further multiply and divide, & you need to check multiply does not overflow 32b and divides do not truncate too many bits.
That's why it's useful to litmus-check against what you start with, which here is ~10 bit readings.
In some cases, a 32*32 -> 64, then 64/32 mathop is useful. (Peter mentions a 64-bit intermediate above)
So yes, it is more work, but if you need a significant jump in speed, effort is needed
Fractions in fixed point are often dealt with as rational values or approximations. For example, the value 0.0175 * 180 can be dealt with ad hoc as 7/400 * 180000, then arranged to do the multiplication before the division and scaled up like that to maintain 3 digits precision.
In Spin, general precision can be had using the ** operator, high multiply, 180000 ** 75161928. That is implicitly 180000 * 75161928/(2^32). The divide by 2^32 comes from an implicit shift right by 32 bits after the multiply into 64 bits.
In any case, all the scaling and keeping track of the decimal point can be a bear in a complex problem and will make you appreciate floating point, which takes care of all that (until it doesn't!)
The trick is getting your head around fixed-point numbers. You have to convert everything over. You choose a number of decimal places, or bits, to be used for fractional values.
Convert the constants to suit this: In the case of 180/pi = 57.295779513: If choosing to use 4 decimal places in fixed point decimal then this becomes an integer of 572958, which stands in for 57.2958. If done in, say, 16.16 fixed point binary, it would be $0039 4bb8, which stands in for 57.295776367.
Be wary that you are now effectively using integers, which means all calculations will naturally truncate rounding. For positives, this rounds toward zero. For negatives, this rounds toward negative infinity I think.
Using the same format as the sine tables is sensible. So, not really your choice now.
PS: At some point you'll likely have dynamic range issues. Cross that hurdle when you get there.
PPS: Lol, I see everyone has answered. I'm gonna leave this one posted anyway.