Wolfram Alpha Expression Calculator Expression optimization - Got Lucky!

pmrobert · 2024-02-01 21:57

if (tooth == 36)
        {
            _waitus(intertooth_us);
            tooth = 1;
            // microseconds in a minute divided by revs per minute
            // divided by teeth on wheel gives microseconds between teeth.
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
        }

I asked Wolfram to look at my initial expression. The numbers speak for themselves. This was from a couple of years back and I had been meaning to post this but I did not. I think the clock was either 240 or 320 MHz. Just a thought and a data point. This was compiled with Flexprop and has proved to be reliable and accurate. I'm aware this is very isolated code without a lot of context; it's for simulating a 36-1 tone wheel on an internal combustion engine for bench testing of the actual live code.

pmrobert · 2024-02-01 22:38

I tried the 3 backticks; obviously I missed a minor detail. Apologies. Here's the surrounding code which runs in it's own cog and is passed sim_rpm from main where sim_rpm can be set from the terminal.

void vrsim()
{
    uint32_t intertooth_us;
    uint32_t tooth = 0;
    // Initialize to something sane: 2778 = ((60000000 / 600) / 36);
    intertooth_us = 2778;
    while (1)
    {
        if (sim_rpm < 50)
        {
            sim_rpm = 400;
        }
        if (tooth < 36)
        {
            _pinh(vr_sim_pin);
            _waitus(49);
            _pinl(vr_sim_pin);
            _waitus(intertooth_us - 51);
            tooth++;
        }
        if (tooth == 36)
        {
            _waitus(intertooth_us);
            tooth = 1;
            // microseconds in a minute divided by revs per minute
            // divided by teeth on wheel gives microseconds between teeth.
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
        }
    }
}

Wuerfel_21 · 2024-02-02 01:31

If you want more speed and precision, don't use waitus. waitus performs a multiplication internally. Calculate the time in CPU cycles directly and you can skip that.

evanh · 2024-02-02 10:02

Backticks go at both start and end of the code block.

pmrobert · 2024-02-02 10:05

@evanh said:
Backticks go at both start and end of the code block.

TY Evan! I'm not a big fan of sloppy messaging and like to avoid it.

-Mike

pmrobert · 2024-02-02 10:52

@Wuerfel_21 said:
If you want more speed and precision, don't use waitus. waitus performs a multiplication internally. Calculate the time in CPU cycles directly and you can skip that.

TY for the suggestion - so perhaps I should calc the number of cycles then _waitx(cycles)? I've been grepping the hell out of things and haven't yet found the macro or code fragment which implements waitus so I can look under the hood of the compiler/libraries. I know it's there!
Signed, The Mad Grepper....

-Mike

iseries · 2024-02-02 11:18

You won't find it because it is precompiled from spin.

'' pause for m microseconds
pri _waitus(m=long) | freq, c, offset
  c := _getcnt
  freq := __clkfreq_var
  repeat while m => 1000000
    waitcnt(c += freq)
    m -= 1000000
  if m > 0
    m := _muldiv64(m, freq, 1000000)
    if (__propeller__ == 1)
      ' watch out for wrapping around
      if m > 20000
        waitcnt( c + m )
    else
      waitcnt( c + m )

You can see that there are some calculation which tend to slow things down. With a small value you could just use waitcnt which is a native instruction.

Mike

pmrobert · 2024-02-02 11:49

@iseries said:
You won't find it because it is precompiled from spin.
'' pause for m microseconds
pri _waitus(m=long) | freq, c, offset
  c := _getcnt
  freq := __clkfreq_var
  repeat while m => 1000000
    waitcnt(c += freq)
    m -= 1000000
  if m > 0
    m := _muldiv64(m, freq, 1000000)
    if (__propeller__ == 1)
      ' watch out for wrapping around
      if m > 20000
        waitcnt( c + m )
    else
      waitcnt( c + m )
You can see that there are some calculation which tend to slow things down. With a small value you could just use waitcnt which is a native instruction.

Mike

OK, now I understand. Ada's suggestion works phenomenally. What a difference! The generated code in the listing file looks and proves to much more efficient and accurate. Thanks to all. I'll clean up that code fragment and post for comment once hot tea hits my gray matter.

// OIP2024
#include <stdio.h>
#include <propeller2.h>

enum { _clkfreq = 240000000 };

// size of stack for other COG bytes (should be at least 128)
//#define STACKSIZE 256

unsigned char stack[64];

volatile int sim_rpm;
int vr_sim_pin;
int cog;
int x;



void vrsim() __attribute__((cog))
    {
    uint32_t cntsperus;
    cntsperus = _clkfreq/1000000;
    uint32_t intertooth_us;
    uint32_t tooth = 0;
    //sim_rpm = 600;
    intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
    while (1)
        {
        intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
        if (tooth < 36)
            {
            _pinh(vr_sim_pin);
            _waitx(cntsperus*50);
            _pinl(vr_sim_pin);
            _waitx(cntsperus*intertooth_us);
            tooth++;
            }
        if (tooth == 36)
            {
            //_waitus(intertooth_us);
            tooth = 1;
            // microseconds in a minute divided by revs per minute
            // divided by teeth on wheel gives microseconds between teeth.
            //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            _waitx(cntsperus*intertooth_us);
            }
        }
    }

// and now the main program
void main()
    {
    vr_sim_pin = 33;
    printf("Started OIP2024\n\r");
    sim_rpm = 600;
    cog = __builtin_cogstart(vrsim(), &stack[0]);
    printf("vrsim started in cog %d\n\r",cog);
    while(1)
        {
        printf("Current simrpm is %d \n\r",sim_rpm);
        printf("Enter new simrpm: ");
        scanf("%d", &sim_rpm);
        }
    }

-Mike R.

pmrobert · 2024-02-05 01:48

My initial point was that Wolfram was able to simplify the following with a nice speed up.

// microseconds in a minute divided by revs per minute
            // divided by teeth on wheel gives microseconds between teeth.
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us

Lots of excellent advice in this thread that had nothing to do with that whatsoever but is so much appreciated, I'll be posting my bilinear interpolation code in the next several days to horrify everyone. Peace and good vibes to all...

-Mike R.

evanh · 2024-02-05 06:11

Wolfram is probably guessing that multiplies are faster executing than divides. That isn't so for the 32-bit Cordic ops.

The only reason that will be any faster is from the multiply routine doing a runtime check on the value of sim_rpm to know it can use the fast 16x16 MUL instruction instead of the slower Cordic QMUL. So it's not a direct optimisation but rather you're lucky Flexspin has that check.

EDIT: If you declare sim_rpm as a short integer then the compiler can directly use MUL and ditch the check. Then it is a direct optimisation.

EDIT2: Maybe there is another way it works. I haven't actually looked: Where the compiler chooses between a few different algorithms for multiply. It's reasonably quick to multiply a 32x16 using a couple of MULs instead of the full 32x32 QMUL.

Either way, declaring sim_rpm as a short integer will make it a little faster again.

Wuerfel_21 · 2024-02-05 13:17

@evanh said:
The only reason that will be any faster is from the multiply routine doing a runtime check on the value of sim_rpm to know it can use the fast 16x16 MUL instruction instead of the slower Cordic QMUL. So it's not a direct optimisation but rather you're lucky Flexspin has that check.

EDIT: If you declare sim_rpm as a short integer then the compiler can directly use MUL and ditch the check. Then it is a direct optimisation.

EDIT2: Maybe there is another way it works. I haven't actually looked: Where the compiler chooses between a few different algorithms for multiply. It's reasonably quick to multiply a 32x16 using a couple of MULs instead of the full 32x32 QMUL.

There is no runtime check. MUL(S) vs QMUL vs function call is determined statically. MUL(S) is used when arguments are known to be in 16 bit range. function is used when a full 64 bit result from a signed multiply is needed, QMUL otherwise. There currently isn't a 32x16 mode (it'd be decently fast but take quite a lot of instructions). QMUL/GETQX can be reordered to run other instructions in parallel, so it doesn't always take the full 50-something cycles.

evanh · 2024-02-05 13:27

@Wuerfel_21 said:
There is no runtime check. MUL(S) vs QMUL vs function call is determined statically. MUL(S) is used when arguments are known to be in 16 bit range. function is used when a full 64 bit result from a signed multiply is needed, QMUL otherwise. There currently isn't a 32x16 mode (it'd be decently fast but take quite a lot of instructions).

Good to know.

QMUL/GETQX can be reordered to run other instructions in parallel, so it doesn't always take the full 50-something cycles.

Applies equally to QDIV too. So I guess somehow the optimiser is deciding that sim_rpm is within 16-bit value so that MUL gets used instead of QMUL.

Wuerfel_21 · 2024-02-05 13:30

No, in that case it's a multiply by 3, which gets converted into a shift/add sequence. Forgot that path in the previous post.

evanh · 2024-02-05 13:49

Oh, neat, I had known that too at some stage. That's pretty tight when it's a fixed multiply. Fair enough, Wolfram probably assumed that sort of optimisation would be used.

Wuerfel_21 · 2024-02-05 14:47

I'm pretty sure it just simplifies the expression using algebra - one should possibly exercise caution with regards to range/precision issues of intermediate results.

pmrobert · 2024-02-05 15:25

@Wuerfel_21 said:
I'm pretty sure it just simplifies the expression using algebra - one should possibly exercise caution with regards to range/precision issues of intermediate results.

I believe you're right. Wolfram had no idea it's result was going to be used as source code. Yes, as far as caution re range/precision I ran all possible inputs through both equations, graphed and compared to verify there were no disagreements. I can easily see where an optimization could break things very badly.

-Mike R.

evanh · 2024-02-05 22:14

In that case, you just got lucky that it helped at all.

pmrobert · 2024-02-06 01:50

@evanh said:
In that case, you just got lucky that it helped at all.

Oh yes! However the conversation we've had here has helped me to further understand the P2 in a very positive and substantial way. Baby steps, baby steps and I very much appreciate all of the comments all have provided. Thanks to all!

-Mike R.

Wolfram Alpha Expression Calculator Expression optimization - Got Lucky!

Comments