Wolfram Alpha Expression Calculator Expression optimization - Got Lucky!
if (tooth == 36) { _waitus(intertooth_us); tooth = 1; // microseconds in a minute divided by revs per minute // divided by teeth on wheel gives microseconds between teeth. intertooth_us = 5000000/(3*sim_rpm); // 0.95us Thanks Wolfram Alpha Expression Calculator! //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us }
I asked Wolfram to look at my initial expression. The numbers speak for themselves. This was from a couple of years back and I had been meaning to post this but I did not. I think the clock was either 240 or 320 MHz. Just a thought and a data point. This was compiled with Flexprop and has proved to be reliable and accurate. I'm aware this is very isolated code without a lot of context; it's for simulating a 36-1 tone wheel on an internal combustion engine for bench testing of the actual live code.
Comments
I tried the 3 backticks; obviously I missed a minor detail. Apologies. Here's the surrounding code which runs in it's own cog and is passed sim_rpm from main where sim_rpm can be set from the terminal.
If you want more speed and precision, don't use waitus. waitus performs a multiplication internally. Calculate the time in CPU cycles directly and you can skip that.
Backticks go at both start and end of the code block.
TY Evan! I'm not a big fan of sloppy messaging and like to avoid it.
-Mike
TY for the suggestion - so perhaps I should calc the number of cycles then _waitx(cycles)? I've been grepping the hell out of things and haven't yet found the macro or code fragment which implements waitus so I can look under the hood of the compiler/libraries. I know it's there!
Signed, The Mad Grepper....
-Mike
You won't find it because it is precompiled from spin.
You can see that there are some calculation which tend to slow things down. With a small value you could just use waitcnt which is a native instruction.
Mike
OK, now I understand. Ada's suggestion works phenomenally. What a difference! The generated code in the listing file looks and proves to much more efficient and accurate. Thanks to all. I'll clean up that code fragment and post for comment once hot tea hits my gray matter.
-Mike R.
My initial point was that Wolfram was able to simplify the following with a nice speed up.
Lots of excellent advice in this thread that had nothing to do with that whatsoever but is so much appreciated, I'll be posting my bilinear interpolation code in the next several days to horrify everyone. Peace and good vibes to all...
-Mike R.
Wolfram is probably guessing that multiplies are faster executing than divides. That isn't so for the 32-bit Cordic ops.
The only reason that will be any faster is from the multiply routine doing a runtime check on the value of
sim_rpm
to know it can use the fast 16x16MUL
instruction instead of the slower CordicQMUL
. So it's not a direct optimisation but rather you're lucky Flexspin has that check.EDIT: If you declare
sim_rpm
as a short integer then the compiler can directly useMUL
and ditch the check. Then it is a direct optimisation.EDIT2: Maybe there is another way it works. I haven't actually looked: Where the compiler chooses between a few different algorithms for multiply. It's reasonably quick to multiply a 32x16 using a couple of MULs instead of the full 32x32 QMUL.
Either way, declaring
sim_rpm
as a short integer will make it a little faster again.There is no runtime check. MUL(S) vs QMUL vs function call is determined statically. MUL(S) is used when arguments are known to be in 16 bit range. function is used when a full 64 bit result from a signed multiply is needed, QMUL otherwise. There currently isn't a 32x16 mode (it'd be decently fast but take quite a lot of instructions). QMUL/GETQX can be reordered to run other instructions in parallel, so it doesn't always take the full 50-something cycles.
Good to know.
Applies equally to QDIV too. So I guess somehow the optimiser is deciding that
sim_rpm
is within 16-bit value so that MUL gets used instead of QMUL.No, in that case it's a multiply by 3, which gets converted into a shift/add sequence. Forgot that path in the previous post.
Oh, neat, I had known that too at some stage. That's pretty tight when it's a fixed multiply. Fair enough, Wolfram probably assumed that sort of optimisation would be used.
I'm pretty sure it just simplifies the expression using algebra - one should possibly exercise caution with regards to range/precision issues of intermediate results.
I believe you're right. Wolfram had no idea it's result was going to be used as source code. Yes, as far as caution re range/precision I ran all possible inputs through both equations, graphed and compared to verify there were no disagreements. I can easily see where an optimization could break things very badly.
-Mike R.
In that case, you just got lucky that it helped at all.
Oh yes! However the conversation we've had here has helped me to further understand the P2 in a very positive and substantial way. Baby steps, baby steps and I very much appreciate all of the comments all have provided. Thanks to all!
-Mike R.