Shop OBEX P1 Docs P2 Docs Learn Events
Wolfram Alpha Expression Calculator Expression optimization - Got Lucky! — Parallax Forums

Wolfram Alpha Expression Calculator Expression optimization - Got Lucky!

pmrobertpmrobert Posts: 673
edited 2024-02-13 11:33 in C/C++
if (tooth == 36)
        {
            _waitus(intertooth_us);
            tooth = 1;
            // microseconds in a minute divided by revs per minute
            // divided by teeth on wheel gives microseconds between teeth.
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
        }

I asked Wolfram to look at my initial expression. The numbers speak for themselves. This was from a couple of years back and I had been meaning to post this but I did not. I think the clock was either 240 or 320 MHz. Just a thought and a data point. This was compiled with Flexprop and has proved to be reliable and accurate. I'm aware this is very isolated code without a lot of context; it's for simulating a 36-1 tone wheel on an internal combustion engine for bench testing of the actual live code.

Comments

  • pmrobertpmrobert Posts: 673
    edited 2024-02-02 10:06

    I tried the 3 backticks; obviously I missed a minor detail. Apologies. Here's the surrounding code which runs in it's own cog and is passed sim_rpm from main where sim_rpm can be set from the terminal.

    void vrsim()
    {
        uint32_t intertooth_us;
        uint32_t tooth = 0;
        // Initialize to something sane: 2778 = ((60000000 / 600) / 36);
        intertooth_us = 2778;
        while (1)
        {
            if (sim_rpm < 50)
            {
                sim_rpm = 400;
            }
            if (tooth < 36)
            {
                _pinh(vr_sim_pin);
                _waitus(49);
                _pinl(vr_sim_pin);
                _waitus(intertooth_us - 51);
                tooth++;
            }
            if (tooth == 36)
            {
                _waitus(intertooth_us);
                tooth = 1;
                // microseconds in a minute divided by revs per minute
                // divided by teeth on wheel gives microseconds between teeth.
                intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
                //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
            }
        }
    }
    
  • If you want more speed and precision, don't use waitus. waitus performs a multiplication internally. Calculate the time in CPU cycles directly and you can skip that.

  • evanhevanh Posts: 15,912

    Backticks go at both start and end of the code block.

  • @evanh said:
    Backticks go at both start and end of the code block.

    TY Evan! I'm not a big fan of sloppy messaging and like to avoid it.

    -Mike

  • @Wuerfel_21 said:
    If you want more speed and precision, don't use waitus. waitus performs a multiplication internally. Calculate the time in CPU cycles directly and you can skip that.

    TY for the suggestion - so perhaps I should calc the number of cycles then _waitx(cycles)? I've been grepping the hell out of things and haven't yet found the macro or code fragment which implements waitus so I can look under the hood of the compiler/libraries. I know it's there!
    Signed, The Mad Grepper....

    -Mike

  • iseriesiseries Posts: 1,492

    You won't find it because it is precompiled from spin.

    '' pause for m microseconds
    pri _waitus(m=long) | freq, c, offset
      c := _getcnt
      freq := __clkfreq_var
      repeat while m => 1000000
        waitcnt(c += freq)
        m -= 1000000
      if m > 0
        m := _muldiv64(m, freq, 1000000)
        if (__propeller__ == 1)
          ' watch out for wrapping around
          if m > 20000
            waitcnt( c + m )
        else
          waitcnt( c + m )
    

    You can see that there are some calculation which tend to slow things down. With a small value you could just use waitcnt which is a native instruction.

    Mike

  • pmrobertpmrobert Posts: 673
    edited 2024-02-02 16:40

    @iseries said:
    You won't find it because it is precompiled from spin.

    '' pause for m microseconds
    pri _waitus(m=long) | freq, c, offset
      c := _getcnt
      freq := __clkfreq_var
      repeat while m => 1000000
        waitcnt(c += freq)
        m -= 1000000
      if m > 0
        m := _muldiv64(m, freq, 1000000)
        if (__propeller__ == 1)
          ' watch out for wrapping around
          if m > 20000
            waitcnt( c + m )
        else
          waitcnt( c + m )
    

    You can see that there are some calculation which tend to slow things down. With a small value you could just use waitcnt which is a native instruction.

    Mike

    OK, now I understand. Ada's suggestion works phenomenally. What a difference! The generated code in the listing file looks and proves to much more efficient and accurate. Thanks to all. I'll clean up that code fragment and post for comment once hot tea hits my gray matter.

    // OIP2024
    #include <stdio.h>
    #include <propeller2.h>
    
    enum { _clkfreq = 240000000 };
    
    // size of stack for other COG bytes (should be at least 128)
    //#define STACKSIZE 256
    
    unsigned char stack[64];
    
    volatile int sim_rpm;
    int vr_sim_pin;
    int cog;
    int x;
    
    
    
    void vrsim() __attribute__((cog))
        {
        uint32_t cntsperus;
        cntsperus = _clkfreq/1000000;
        uint32_t intertooth_us;
        uint32_t tooth = 0;
        //sim_rpm = 600;
        intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
        while (1)
            {
            intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
            if (tooth < 36)
                {
                _pinh(vr_sim_pin);
                _waitx(cntsperus*50);
                _pinl(vr_sim_pin);
                _waitx(cntsperus*intertooth_us);
                tooth++;
                }
            if (tooth == 36)
                {
                //_waitus(intertooth_us);
                tooth = 1;
                // microseconds in a minute divided by revs per minute
                // divided by teeth on wheel gives microseconds between teeth.
                //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
                intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
                _waitx(cntsperus*intertooth_us);
                }
            }
        }
    
    // and now the main program
    void main()
        {
        vr_sim_pin = 33;
        printf("Started OIP2024\n\r");
        sim_rpm = 600;
        cog = __builtin_cogstart(vrsim(), &stack[0]);
        printf("vrsim started in cog %d\n\r",cog);
        while(1)
            {
            printf("Current simrpm is %d \n\r",sim_rpm);
            printf("Enter new simrpm: ");
            scanf("%d", &sim_rpm);
            }
        }
    

    -Mike R.

  • pmrobertpmrobert Posts: 673
    edited 2024-02-05 01:49

    My initial point was that Wolfram was able to simplify the following with a nice speed up.

    // microseconds in a minute divided by revs per minute
                // divided by teeth on wheel gives microseconds between teeth.
                intertooth_us = 5000000/(3*sim_rpm);  // 0.95us Thanks Wolfram Alpha Expression Calculator!
                //intertooth_us = ((60000000 / sim_rpm) / 36); // 5.35us
    

    Lots of excellent advice in this thread that had nothing to do with that whatsoever but is so much appreciated, I'll be posting my bilinear interpolation code in the next several days to horrify everyone. Peace and good vibes to all...

    -Mike R.

  • evanhevanh Posts: 15,912
    edited 2024-02-05 06:29

    Wolfram is probably guessing that multiplies are faster executing than divides. That isn't so for the 32-bit Cordic ops.

    The only reason that will be any faster is from the multiply routine doing a runtime check on the value of sim_rpm to know it can use the fast 16x16 MUL instruction instead of the slower Cordic QMUL. So it's not a direct optimisation but rather you're lucky Flexspin has that check.

    EDIT: If you declare sim_rpm as a short integer then the compiler can directly use MUL and ditch the check. Then it is a direct optimisation.

    EDIT2: Maybe there is another way it works. I haven't actually looked: Where the compiler chooses between a few different algorithms for multiply. It's reasonably quick to multiply a 32x16 using a couple of MULs instead of the full 32x32 QMUL.

    Either way, declaring sim_rpm as a short integer will make it a little faster again.

  • @evanh said:
    The only reason that will be any faster is from the multiply routine doing a runtime check on the value of sim_rpm to know it can use the fast 16x16 MUL instruction instead of the slower Cordic QMUL. So it's not a direct optimisation but rather you're lucky Flexspin has that check.

    EDIT: If you declare sim_rpm as a short integer then the compiler can directly use MUL and ditch the check. Then it is a direct optimisation.

    EDIT2: Maybe there is another way it works. I haven't actually looked: Where the compiler chooses between a few different algorithms for multiply. It's reasonably quick to multiply a 32x16 using a couple of MULs instead of the full 32x32 QMUL.

    There is no runtime check. MUL(S) vs QMUL vs function call is determined statically. MUL(S) is used when arguments are known to be in 16 bit range. function is used when a full 64 bit result from a signed multiply is needed, QMUL otherwise. There currently isn't a 32x16 mode (it'd be decently fast but take quite a lot of instructions). QMUL/GETQX can be reordered to run other instructions in parallel, so it doesn't always take the full 50-something cycles.

  • evanhevanh Posts: 15,912

    @Wuerfel_21 said:
    There is no runtime check. MUL(S) vs QMUL vs function call is determined statically. MUL(S) is used when arguments are known to be in 16 bit range. function is used when a full 64 bit result from a signed multiply is needed, QMUL otherwise. There currently isn't a 32x16 mode (it'd be decently fast but take quite a lot of instructions).

    Good to know.

    QMUL/GETQX can be reordered to run other instructions in parallel, so it doesn't always take the full 50-something cycles.

    Applies equally to QDIV too. So I guess somehow the optimiser is deciding that sim_rpm is within 16-bit value so that MUL gets used instead of QMUL.

  • No, in that case it's a multiply by 3, which gets converted into a shift/add sequence. Forgot that path in the previous post.

  • evanhevanh Posts: 15,912

    Oh, neat, I had known that too at some stage. That's pretty tight when it's a fixed multiply. Fair enough, Wolfram probably assumed that sort of optimisation would be used.

  • I'm pretty sure it just simplifies the expression using algebra - one should possibly exercise caution with regards to range/precision issues of intermediate results.

  • @Wuerfel_21 said:
    I'm pretty sure it just simplifies the expression using algebra - one should possibly exercise caution with regards to range/precision issues of intermediate results.

    I believe you're right. Wolfram had no idea it's result was going to be used as source code. Yes, as far as caution re range/precision I ran all possible inputs through both equations, graphed and compared to verify there were no disagreements. I can easily see where an optimization could break things very badly.

    -Mike R.

  • evanhevanh Posts: 15,912

    In that case, you just got lucky that it helped at all.

  • @evanh said:
    In that case, you just got lucky that it helped at all.

    Oh yes! However the conversation we've had here has helped me to further understand the P2 in a very positive and substantial way. Baby steps, baby steps and I very much appreciate all of the comments all have provided. Thanks to all!

    -Mike R.

Sign In or Register to comment.