Ways to speed up code

Hi,

Are there any ways to speed up code within a while loop? For example, I am running a Kalman filter in a continuous loop and a WAG for loop time is maybe 50 Hz. If I wanted the loop to run at 200 Hz, what are some ways to achieve this?
«1

Comments

  • 57 Comments sorted by Date Added Votes
  • It would sure be helpful if we could see the code.
    David
    PropWare: C++ HAL (Hardware Abstraction Layer) for PropGCC; Robust build system using CMake; Integrated Simple Library, libpropeller, and libPropelleruino (Arduino port); Instructions for Eclipse and JetBrain's CLion; Example projects; Doxygen documentation
    CI Server: http://david.zemon.name:8111/?guest=1
  • Mike GreenMike Green Posts: 22,753
    edited August 8 Vote Up0Vote Down
    As David said, it would be helpful to see what you're trying to do in the loop. In general, you'll need to convert the loop to assembly to get a 4x improvement or better. If your program is in Spin, it's a little trickier since you can't put assembly code in-line. If your program is in C, the compiler can mix C and assembly easily.

    Sometimes you can get a smaller improvement by optimizing the code, eliminating redundant subscripting or other operations, changing the order of calculations, etc. You won't see a 4x change there in most cases.

    The C/C++ compiler produces faster and larger code than the Spin compiler. If you're using Spin, you can use "spin2cpp" to translate the Spin parts to C++, then do your tight loop directly in C++ and compile the whole thing using SimpleIDE.
  • Mike Green wrote: »
    As David said, it would be helpful to see what you're trying to do in the loop. In general, you'll need to convert the loop to assembly to get a 4x improvement or better. If your program is in Spin, it's a little trickier since you can't put assembly code in-line. If your program is in C, the compiler can mix C and assembly easily.
    If your program is in Spin you could try spin2cpp to make it run faster.

  • If your entire program is running in cog 0, try offloading some operations to several other cogs, especially operations that have delays in them, if possible.
    Florida, between St. Petersburg and the Gulf of Mexico

    Do not look directly into laser with remaining good eye...
  • Thank you for the suggestions. I will post my code when I get home this evening. It is written in C. It sounds like I should convert to assembly for a dramatic increase in speed. I am looking forward to learning how to do this. Is there a way to know for sure the loop time? I know Arduino has millis but haven't seen that function on the Propeller.

    David
  • Thank you for the suggestions. I will post my code when I get home this evening. It is written in C. It sounds like I should convert to assembly for a dramatic increase in speed. I am looking forward to learning how to do this. Is there a way to know for sure the loop time? I know Arduino has millis but haven't seen that function on the Propeller.

    David
    You can get some performance improvement by selecting the LMM memory model instead of CMM. However, that will make your program a lot bigger so if it's already large it may not fit in hub memory. The other thing you can do is write some of the code using the COG memory model. That code runs directly in COG memory and is nearly as fast as PASM. Finally, you can just write in PASM and get the maximum performance.

  • Look at your math. Where can operations be done at the same time? Use more COGS to do that.

    Can you accept some lag?

    Use more COGS to run the math like a pipeline.

    One at 50hz, two at 100hz, 4 at 200hz. It won't double perfectly, but will approximate those kinds of gains
    Incoming data gets buffered and handed off to the COGS using ring buffers. Same for output.

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • Take your inner loop (of the filter) and make a separate function out of it. Compile that using the COG memory model. Compile the other parts of the program using the CMM model. This way you don't need to convert the inner loop code into assembler. Let the compiler do the work.
  • My code inside the while loop is given below. Where would be a good place to use more cogs? Thanks for all the replies.
    while(1){
      
        // get gyroscope data
        // x-axis
        low(CS);                                            //CS low selects chip
        shift_out(SDA, SCL, MSBFIRST, 8, 0b10101000);         //Send read register address (OUT_X_L)
        xL = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);                                           //De-select chip with CS line high
        
        low(CS);
        shift_out(SDA, SCL, MSBFIRST, 8, 0b10101001);         //Send read register address (OUT_X_H)
        xH = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);
        
        gx = (short) (((xH<<8) | xL)-gxAvg)*.0175*PI/180;
        
        // y-axis
        low(CS);                                            //CS low selects chip
        shift_out(SDA, SCL, MSBFIRST, 8, 0b10101010);         //Send read register address (OUT_Y_L)
        yL = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);                                           //De-select chip with CS line high
        
        low(CS);
        shift_out(SDA, SCL, MSBFIRST, 8, 0b10101011);         //Send read register address (OUT_Y_H)
        yH = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);
        
        gy = (short) (((yH<<8) | yL)-gyAvg)*.0175*PI/180;
        
        // z-axis
        low(CS);                                            //CS low selects chip
        shift_out(SDA, SCL, MSBFIRST, 8, 0b11101100);         //Send read register address (OUT_Z_L)
        zL = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);                                           //De-select chip with CS line high
        
        low(CS);
        shift_out(SDA, SCL, MSBFIRST, 8, 0b10101101);         //Send read register address (OUT_Z_H)
        zH = shift_in(SDA, SCL, MSBPRE, 8);                   //Get value from register
        high(CS);
        
        gz = (short) (((zH<<8) | zL)-gzAvg)*.0175*PI/180;            //2's complement operation, subtract zero rate offset, multiply by 
                                                            //sensitivity value for 500deg/s, "short" needed for 2's complement
        //print("gx = %f  gy = %f  gz = %f\n",gx,gy,gz);
        pause(1);                                          //Cycle at 20Hz
      
        // get accel data
        
        //x-axis
        low(6);                                                                  // CS low selects chip
        shift_out(7, 8, MSBFIRST, 7, 0b0000110);                                  // Send read register address, z-axis
        shift_out(7, 8, MSBFIRST, 1, 0b0);                                        // Send don't care value
        
        x = shift_in(7, 8, MSBPRE, 8);                                            // Get value from register
        x = x-xOffset;
        float ax = (float) x;
        ax = ax/64;
                  
        high(6);                         
        
        //y-axis
        low(6);                                                                  // CS low selects chip
        shift_out(7, 8, MSBFIRST, 7, 0b0000111);                                  // Send read register address, z-axis
        shift_out(7, 8, MSBFIRST, 1, 0b0);                                        // Send don't care value
        
        y = shift_in(7, 8, MSBPRE, 8);                                            // Get value from register
        y = y-yOffset;
        float ay = (float) y;
        ay = ay/64;
                  
        high(6);                         
        
        //z-axis
        low(6);                                                                   // CS low selects chip
        shift_out(7, 8, MSBFIRST, 7, 0b0001000);                                  // Send read register address, z-axis
        shift_out(7, 8, MSBFIRST, 1, 0b0);                                        // Send don't care value
        
        z = shift_in(7, 8, MSBPRE, 8);                                            // Get value from register
        z = z-zOffset;
        float az = (float) z;
        az = az/64;
                  
        high(6);                                                                  // De-select chip
        // print("ax = %f  ay = %f  az = %f\n",ax,ay,az);                                                   // Display measurement
        pause(1);                                                                // Loop at 20 Hz    
      
        // Extended Kalman filter implmentation
        /////////////////////
      
        // Do Prediction Step, note: filter is not using accel measurements at this point, only gyro, so it drifts
        int N = 1;
        for(int i = 0; i < 1; i++){
        ////////////////////
        // State derivatives
        float f[2] = {
          {gx+gy*sin(phi)*tan(theta)+gz*cos(phi)*tan(theta)},
          {gy*cos(phi)-gz*sin(phi)}
        };  
        ///////////////////
        // Linearization of state derivatives
        float A[2][2] = {
          {gy*cos(phi)*tan(theta)-gz*sin(phi)*tan(theta), gz*cos(phi)*(square(tan(theta))+1)+gy*sin(phi)*(square(tan(theta))+1)},
          {-gz*cos(phi)-gy*sin(phi)                     ,                            0                                         }
        };
        //////////////////////
        // Calculate xhat
        // xhat = xhat + (Ts/N)*f
        xhat[0]=xhat[0]+(Ts/N)*f[0];
        xhat[1]=xhat[1]+(Ts/N)*f[1];
        //phi=xhat[0];
        //theta=xhat[1];
        
        /////////////////////
        // Calculate covariance
        // P = P+(Ts/N)*(A*P+P*A'+Q);
        P[0][0]=P[0][0]+(Ts/N)*(2*A[0][0]*P[0][0]+A[0][1]*P[0][1]+A[0][1]*P[1][0]+Q[0][0]);
        P[0][1]=P[0][1]+(Ts/N)*(A[0][0]*P[0][1]+A[1][0]*P[0][0]+A[0][1]*P[1][1]+A[1][1]*P[0][1]+Q[0][1]);
        P[1][0]=P[1][0]+(Ts/N)*(A[0][0]*P[1][0]+A[1][0]*P[0][0]+A[0][1]*P[1][1]+A[1][1]*P[1][0]+Q[1][0]);
        P[1][1]=P[1][1]+(Ts/N)*(A[1][0]*P[0][1]+A[1][0]*P[1][0]+2*A[1][1]*P[1][1]+Q[1][1]);
        
        //////////////////////
        /*
          // Print covariance matrix, P
          print("P =\n");
          for(int i = 0; i < 2; i++){
            for(int k = 0; k < 2; k++){
              print("%f ", P[i][k]*180/PI);
            }
            print("\n");
          } */
        }
      
        phi=xhat[0];
        theta=xhat[1];
      
        // Do Update Step
        /////////////////////
        // Declare vector of actual measurements
        float y[3][1] = {
          {ax*9.81},
          {ay*9.81},
          {az*-9.81}
        };  
        ////////////////////
        // Define mathematical model of measurements based on xhat
        float h[3][1] = {
          {gy*Va*sin(theta)+gravity*sin(theta)},
          {gz*Va*cos(theta)-gx*Va*sin(theta)-gravity*cos(theta)*sin(phi)},
          {-gy*Va*cos(theta)-gravity*cos(theta)*cos(phi)}
        };
        ///////////////////
        // Linearization of mathematical model of measurements, note: this works and gets values from xhat above
        float C[3][2] = {
          {             0,                gravity*cos(theta)+Va*gy*cos(theta)},
          {-gravity*cos(phi)*cos(theta),  gravity*sin(phi)*sin(theta)-Va*gz*sin(theta)-Va*gx*cos(theta)},
          {gravity*cos(theta)*sin(phi),   Va*gy*sin(theta)+gravity*cos(phi)*sin(theta)}
        };
        
        ///////////////////
        // Kalman gain matrix
        // L = P*C'*(C*P*C'+R)^-1
        // Compute s=C*P*C'+R
        
        s[0][0]=R[0][0] + C[0][0]*(C[0][0]*P[0][0] + C[0][1]*P[1][0]) + C[0][1]*(C[0][0]*P[0][1] + C[0][1]*P[1][1]);
        s[0][1]=R[0][1] + C[1][0]*(C[0][0]*P[0][0] + C[0][1]*P[1][0]) + C[1][1]*(C[0][0]*P[0][1] + C[0][1]*P[1][1]);
        s[0][2]=R[0][2] + C[2][0]*(C[0][0]*P[0][0] + C[0][1]*P[1][0]) + C[2][1]*(C[0][0]*P[0][1] + C[0][1]*P[1][1]);
        s[1][0]=R[1][0] + C[0][0]*(C[1][0]*P[0][0] + C[1][1]*P[1][0]) + C[0][1]*(C[1][0]*P[0][1] + C[1][1]*P[1][1]);
        s[1][1]=R[1][1] + C[1][0]*(C[1][0]*P[0][0] + C[1][1]*P[1][0]) + C[1][1]*(C[1][0]*P[0][1] + C[1][1]*P[1][1]);
        s[1][2]=R[1][2] + C[2][0]*(C[1][0]*P[0][0] + C[1][1]*P[1][0]) + C[2][1]*(C[1][0]*P[0][1] + C[1][1]*P[1][1]);
        s[2][0]=R[2][0] + C[0][0]*(C[2][0]*P[0][0] + C[2][1]*P[1][0]) + C[0][1]*(C[2][0]*P[0][1] + C[2][1]*P[1][1]);
        s[2][1]=R[2][1] + C[1][0]*(C[2][0]*P[0][0] + C[2][1]*P[1][0]) + C[1][1]*(C[2][0]*P[0][1] + C[2][1]*P[1][1]);
        s[2][2]=R[2][2] + C[2][0]*(C[2][0]*P[0][0] + C[2][1]*P[1][0]) + C[2][1]*(C[2][0]*P[0][1] + C[2][1]*P[1][1]);
        
        det_s[0] = s[0][0]*s[1][1]*s[2][2]+s[0][1]*s[1][2]*s[2][0]+s[0][2]*s[1][0]*s[2][1]    // Calculate determinant of d
                  -s[0][2]*s[1][1]*s[2][0]-s[0][1]*s[1][0]*s[2][2]-s[0][0]*s[1][2]*s[2][1];
        
        adj_s[0][0] = s[1][1]*s[2][2]-s[1][2]*s[2][1];      // Calculate adjoint of d
        adj_s[0][1] = s[0][2]*s[2][1]-s[0][1]*s[2][2];
        adj_s[0][2] = s[0][1]*s[1][2]-s[1][1]*s[0][2];
        adj_s[1][0] = s[2][0]*s[1][2]-s[1][0]*s[2][2];
        adj_s[1][1] = s[0][0]*s[2][2]-s[2][0]*s[0][2]; // adjoint code messes up accel x data --> adj_s and S were initialized with wrong size
        adj_s[1][2] = s[1][0]*s[0][2]-s[0][0]*s[1][2];
        adj_s[2][0] = s[1][0]*s[2][1]-s[2][0]*s[1][1];
        adj_s[2][1] = s[2][0]*s[0][1]-s[0][0]*s[2][1];
        adj_s[2][2] = s[0][0]*s[1][1]-s[0][1]*s[1][0];
        
        // Compute S=(C*P*C'+R)^-1
        S[0][0]=adj_s[0][0]*1/det_s[0];
        S[0][1]=adj_s[0][1]*1/det_s[0];
        S[0][2]=adj_s[0][2]*1/det_s[0];
        S[1][0]=adj_s[1][0]*1/det_s[0];
        S[1][1]=adj_s[1][1]*1/det_s[0];
        S[1][2]=adj_s[1][2]*1/det_s[0];
        S[2][0]=adj_s[2][0]*1/det_s[0];
        S[2][1]=adj_s[2][1]*1/det_s[0];
        S[2][2]=adj_s[2][2]*1/det_s[0];
        
        // Compute L = P*C'*(C*P*C'+R)^-1
        L[0][0]=S[1][0]*(C[1][0]*P[0][0]+C[1][1]*P[0][1])+S[2][0]*(C[2][0]*P[0][0]+C[2][1]*P[0][1])+C[0][1]*P[0][1]*S[0][0];
        L[0][1]=S[1][1]*(C[1][0]*P[0][0]+C[1][1]*P[0][1])+S[2][1]*(C[2][0]*P[0][0]+C[2][1]*P[0][1])+C[0][1]*P[0][1]*S[0][1];
        L[0][2]=S[1][2]*(C[1][0]*P[0][0]+C[1][1]*P[0][1])+S[2][2]*(C[2][0]*P[0][0]+C[2][1]*P[0][1])+C[0][1]*P[0][1]*S[0][2];
        L[1][0]=S[1][0]*(C[1][0]*P[1][0]+C[1][1]*P[1][1])+S[2][0]*(C[2][0]*P[1][0]+C[2][1]*P[1][1])+C[0][1]*P[1][1]*S[0][0];
        L[1][1]=S[1][1]*(C[1][0]*P[1][0]+C[1][1]*P[1][1])+S[2][1]*(C[2][0]*P[1][0]+C[2][1]*P[1][1])+C[0][1]*P[1][1]*S[0][1];
        L[1][2]=S[1][2]*(C[1][0]*P[1][0]+C[1][1]*P[1][1])+S[2][2]*(C[2][0]*P[1][0]+C[2][1]*P[1][1])+C[0][1]*P[1][1]*S[0][2];
        
        //////////////////
        // Define identity matrix the length of xhat (2x2)
        float I[2][2] = {
          {1,0},
          {0,1}
        };
        /////////////////
        
        // Covariance matrix updated with measurement information (2x2)
        // P = (I-L*C)*P
        P[0][0]=-P[1][0]*(C[0][1]*L[0][0]+C[1][1]*L[0][1]+C[2][1]*L[0][2])-P[0][0]*(C[1][0]*L[0][1]+C[2][0]*L[0][2]-1);
        P[0][1]=-P[1][1]*(C[0][1]*L[0][0]+C[1][1]*L[0][1]+C[2][1]*L[0][2])-P[0][1]*(C[1][0]*L[0][1]+C[2][0]*L[0][2]-1);
        P[1][0]=-P[1][0]*(C[0][1]*L[1][0]+C[1][1]*L[1][1]+C[2][1]*L[1][2]-1)-P[0][0]*(C[1][0]*L[1][1]+C[2][0]*L[1][2]);
        P[1][1]=-P[1][1]*(C[0][1]*L[1][0]+C[1][1]*L[1][1]+C[2][1]*L[1][2]-1)-P[0][1]*(C[1][0]*L[1][1]+C[2][0]*L[1][2]);
        
        //xhat=xhat+L*(y-h)
        xhat[0]=xhat[0]-L[0][0]*(h[0][0]-y[0][0])-L[0][1]*(h[1][0]-y[1][0])-L[0][2]*(h[2][0]-y[2][0]);
        xhat[1]=xhat[1]-L[1][0]*(h[0][0]-y[0][0])-L[1][1]*(h[1][0]-y[1][0])-L[1][2]*(h[2][0]-y[2][0]);          
        
        phi=xhat[0];   //Pitch
        theta=xhat[1]; //Roll
        
        /////////////////
          
        /*
          // Print covariance matrix, P
          print("y-h =\n");
          for(int i = 0; i < 3; i++){
            for(int k = 0; k < 1; k++){
              print("%f ", y[i][k]-h[i][k]);
            }
            print("\n");
          } 
          */
        
        //PI w/ rate feedback
       
        float phi_mod = fmod(phi+PI,2*PI)-PI;
        float theta_mod = fmod(theta+PI,2*PI)-PI;
        
        float Error=theta*180/PI;//-(-65);
        float Error_int=Error_int+Ts*Error;
        float servo=kp*Error+ki*Error_int+kd*gx*180/PI;    
        
        // Add PWM control for motors
        pwm_start(255);
    
        if(servo<0){
          low(1);
          low(3);
          pwm_set(2,0,-servo+4.); // when pwmstart = 100, set these to 1.8
          pwm_set(4,1,-servo+3.);
        }
        else if(servo>0){
          high(1);
          high(3);
          pwm_set(2,0,servo+4.);
          pwm_set(4,1,servo+2);
        }           
        
        print("servo=%f\n",servo);
      }  
    }
    
  • Mike Green wrote: »
    Take your inner loop (of the filter) and make a separate function out of it. Compile that using the COG memory model. Compile the other parts of the program using the CMM model. This way you don't need to convert the inner loop code into assembler. Let the compiler do the work.

    Mike,

    I'm sorry for my ignorance but how would I make a separate function? Your idea sounds great, I just don't know how to implement.
  • The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.
  • Peter JakackiPeter Jakacki Posts: 7,711
    edited August 9 Vote Up0Vote Down
    Also you are wasting a lot of time addressing individual registers when you could send out the $28 address with bit6 set to indicate autoincrement and then read three pairs of SPI reads before deselecting. I was a bit confused about your scaling because you have .0175*PI/180 = 0.000305432 but then again I'm not familiar with gyros and what you are trying to achieve.
    In Tachyon Forth I use integer maths for speed and the simple way to multiply a number by this value is to perform a mixed multiply and division operation with a 64-bit intermediate using the */ in Forth with 305432 1000000000 */ so that only takes 38.4us.

    Out of curiosity though I converted part of your code into Tachyon, since I like to see if there is anything I can improve in Tachyon. Here are two subroutines and the start of the inner loop:
    pub GyroReg ( addr -- )
    	$40 OR SPIWRB DROP
    	;
    pub ReadGyro16 ( -- 16bits )
    	0 SPIRD 0 SPIRD 8 << OR
    	;
    pub FN1 ( val -- val2 )		305432 1,000,000,000 */ ;
    
    ( INNER LOOP - PARTIAL )
    	BEGIN
    	  \ get XYZ gyroscope data 
    	  $28 GyroReg
    	  ReadGyro16 gxAVG @ - FN1 gx !
    	  ReadGyro16 gyAVG @ - FN1 gy !
    	  ReadGyro16 gyAVG @ - FN1 gz !
              \ deselect
    	  SPICE
    

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • ersmith wrote: »
    The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.

    This is something I can do relatively easily. I will try this and report back with results. Thanks.
  • Genetix wrote: »

    Genetix,

    Thank you for this information.

    David
  • Peter,

    I am a relatively new to programming and using the Propeller so it will take me some time to digest what you have said. One thing I believe you are saying is to reduce the mathematical operations in my loop by hard coding values. This is something I can do pretty easily. The .0175 is a scale factor from the datasheet and I was converting the data from deg/sec (outputted from the sensor) to radians/sec. All of the calculations in the filter are done in radians.

    David
  • jmgjmg Posts: 12,052
    ersmith wrote: »
    The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.

    This is something I can do relatively easily. I will try this and report back with results. Thanks.

    As well as using fixed point, simple things like splitting the serial code, into a separate COG, removes that dead time (and use burst access, as suggested above)
    eg Read over i2c is not super fast, and burst 100kHz read of 3x16 with some overhead, can consume ~ 20% of your targeted 5ms budget. 400kHz i2c is better but still ~5%

    You can also use shift and add, over multiply.
    A sensor 'gain' example I find is defined within a band of 3.5(min) < 3.9(typ) < 4.3(max) - roughly 5%, with 10b precision

    Taking your example of
    .0175*PI/180 = 0.000305432
    you can approximate that with fixed point and shifts.
    eg if we take
    MF=(10000*.0175*pi/180)
    MF = 3.0543261 - that does not prune LSBs in fixed point, and takes the 10b values expanding 0~1024 to 0~3072
    a simple << 1 and add gives ((2+1)-MF)/MF = 1.77866% gain error, & that could already be good enough, given sensor is 5% gain accurate.
    or, you can improve more
    ((2+1+1/16-1/128)-MF)/MF = 1.182941e-4 (0.0183%) or one part in 8453 gain error, way better than the sensor.
  • Peter JakackiPeter Jakacki Posts: 7,711
    edited August 9 Vote Up0Vote Down
    Peter,

    I am a relatively new to programming and using the Propeller so it will take me some time to digest what you have said. One thing I believe you are saying is to reduce the mathematical operations in my loop by hard coding values. This is something I can do pretty easily. The .0175 is a scale factor from the datasheet and I was converting the data from deg/sec (outputted from the sensor) to radians/sec. All of the calculations in the filter are done in radians.

    David

    The compiler should be reducing the expression to a constant anyway but it is still in floating point which is relatively slow since it has to be done in software whereas an integer operation using scaling before division is much faster. However your SPI routines are extremely inefficient and the compiler cannot optimize that for you but you can if you write %01101000 so as to address register $28 with the autoincrement bit set (bit 6) and thereafter simply read 2 bytes as X, 2 bytes as Y, 2 bytes as Z, then deselect chip-select. Much faster since SPI operations are also done in software. Alternatively you could have another cog continuously reading the gyro and updating global variables so that your loop doesn't have to wait for SPI.

    These are just some quick tips.



    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • I like ersmith's suggestion of removing floating point numbers for two reasons: 1) I agree that it should have a significant affect on performance and 2) it's something you know how to do right away. I'd start there and see how far that gets you.

    And build with lmm instead of cmm, that's a really easy one if you are not already.

    Following that, I see some other easy things to change:
    shift_out(7, 8, MSBFIRST, 7, 0b0000110);                                  // Send read register address, z-axis
    shift_out(7, 8, MSBFIRST, 1, 0b0);                                        // Send don't care value
    
    Don't send 7 bits followed by 1 bit. Just send all 8 bits at once and avoid a lot of overhead related to calling the same function twice.
    // Add PWM control for motors
    pwm_start(255);
    
    Do this outside of your loop

    Don't forget to comment out that last print line in the loop when you're testing the timing. The print function is severely limited by your UART speed.

    I'm going to ignore the huge blocks of math - it just isn't my forte. Maybe there is room for improvement, maybe not.




    Okay, so you're compiling with lmm, you've fixed the floating point to be fixed point, and you've combined your redundant shift_out calls, and you moved pwm_start, and you commented out the last print line and your math algorithms are as good as they can be.... but it still isn't fast enough. What's next?
    low(CS);
    shift_out(SDA, SCL, MSBFIRST, 8, 0b10101000);
    xL = shift_in(SDA, SCL, MSBPRE, 8);
    high(CS);
    

    These functions from simpletools.h are pretty slow. They're nice and easy to understand, and it's great that they come built in with SimpleIDE, but they are slow. There are faster options out there - namely PropWare - but it will require more effort to convert your code base, so let's save it for a last resort.
    I'm guessing based on the code comments that you might be using an L3G gyroscope, so here's PropWare's demo code for communication with an L3G: https://david.zemon.name/PropWare/api-develop/L3G_Demo_8cpp-example.xhtml
    David
    PropWare: C++ HAL (Hardware Abstraction Layer) for PropGCC; Robust build system using CMake; Integrated Simple Library, libpropeller, and libPropelleruino (Arduino port); Instructions for Eclipse and JetBrain's CLion; Example projects; Doxygen documentation
    CI Server: http://david.zemon.name:8111/?guest=1
  • amalfiCoastamalfiCoast Posts: 103
    edited August 9 Vote Up0Vote Down
    Quick question: how can I avoid the use of floats if I am using trig functions? For example, cos(60 deg)=.5, which is a float. Is there a way around this?
  • The sine table in the Prop comprises 2049 words in ROM, from $E000 to $F001. To use it you have to scale your input to 13 bits, to cover 360 degrees. Use 11 of those bits for the table lookup, and two bits to put it in the correct quadrant. The floating point object uses the table with interpolation for trig functions, and either CORDIC or a Taylor series for the inverse trig functions.
  • Tracy,

    Thanks for the reply. Can you post code showing how to implement? I'm sorry I don't have much programming experience.

    Thanks,
    David
  • Tracy AllenTracy Allen Posts: 6,276
    edited August 9 Vote Up0Vote Down
    The sine table is described on page 385 of the Propeller manual version 1.2, along with an example in PASM.

    In Spin,
    PUB sine(x) | y, q                                  ' x is phase 0--2^13 representing 0--360 degrees
      q := y >> 11                                          ' quadrant, 0 to 3
      y := (x & $7ff) << 1                                    ' 0 to 90- degrees, adjust to 11 bits, *2 for word address
      case q                                                ' by quadrant
        0 : result := word[$E000 + y]                       ' 
        1 : result := word[$F000 - y]
        2 : result := -word[$E000 + y]
        3 : result := -word[$F000 - y]
      return                                                ' word value, +/- 32768 as plus/minus unity, that is, binary radix in front of the bit 15.
    

    Interpolation is possible but slows things down. I don't do C, but it will be easy to translate.

    For cosine, shift 90 degrees (+2048). Tangent is sine/cosine, scaled as needed.
  • I'm just not understanding how I can use ints when many of my values are floating points. When multiplying by 180/pi, for example, it does not work as an int. What am I missing here?

    Thanks.
  • If all your values fit in a certain range, they can be expressed as "fixed point" rather than floating point. For example, if you know that x and y are between -32767 and +32767, they can be stored as "16.16 fixed point". Basically you multiply all of the values by 0x00010000. So for example 0.5 is stored as "0.5 * 0x010000", or "0x8000". Then addition and subtraction work the same, but when multiplying you have to shift the final result by 16 (to do "x = a * b" you have to do something like "x = (a * (int64_t)b) >> 16". This is a lot more work for the programmer, but it is enormously faster on the computer than floating point. Google "fixed point arithmetic" for more details.

    The sin/cos table in the Propeller ROM is stored as fixed point, I think with 15 bits after the decimal point.
  • Another way of speeding up your code might be to get rid of the pause() statements and re-arrange things so that the computations are taking place when you would otherwise be paused. You'll still have to make sure that there is a delay there, perhaps by using waitcnt(). I'm not sure what "pause(1)" does, but if it waits for a millisecond you could replace it with:
       uint32_t waituntil = CNT + _clkfreq / 1000; // calculate time we have to wait until
       ...
       // do some calculations here
       ...
       // now pause until at least 1 milisecond has elapsed since the calculations started
       while ((int)(CNT - waituntil) < 0)
           ;
    
  • Some other simple things you can do to speed up your code:

    (1) Try compiling with different options. As I think David said above, LMM is faster than CMM. The -O3 option is faster than the usual -Os option, although it may make even bigger code.

    (2) Make sure you use -m32bit-doubles as an option.

    (3) Try to avoid computing the same value more than once. Instead of calling cos(theta) in many places, create variables like "float ct, st;". Every time theta changes (not very often in your code, recalculate "ct = cos(theta); st = sin(theta)", and use "ct" and "st" in all the expressions in place of "cos(theta)" and "sin(theta)". At optimization level -O3 the compiler may be able to do this for you, but it doesn't hurt to try to help it out.

  • jmgjmg Posts: 12,052
    edited August 10 Vote Up0Vote Down
    I'm just not understanding how I can use ints when many of my values are floating points. When multiplying by 180/pi, for example, it does not work as an int. What am I missing here?

    Thanks.

    You need to also scale, which is what fixed-point means. - and you need to take more care with overflow/underflow.


    Taking the example above, you read an integer value 1..1023, and multiply by .0175*PI/180 = 0.000305432
    That gives you floating point numbers from 0.000305432 to 0.31245756 at that stage, but many of those digits are a mirage, as the gain error is ~ 5% and the reading quanta is ~ 0.1%

    If instead you do this MF=(10000*.0175*pi/180)
    Now you create = 3.0543261 to 3124.575, or as integers, that can be 3~3125 That *10000 has moved the decimal point a fixed amount.

    Because that scale is a known constant, you can even speed the Multiply here, by doing shifts and adds.
    eg a very fast (3*N+N/16-N/128) has 0.0183% gain error. Checking with : N=1023; (N+N+N+(N>>4)-(N>>7)) = 3125

    From there, you can further multiply and divide, & you need to check multiply does not overflow 32b and divides do not truncate too many bits.
    That's why it's useful to litmus-check against what you start with, which here is ~10 bit readings.

    In some cases, a 32*32 -> 64, then 64/32 mathop is useful. (Peter mentions a 64-bit intermediate above)

    So yes, it is more work, but if you need a significant jump in speed, effort is needed :)
  • Tracy AllenTracy Allen Posts: 6,276
    edited August 10 Vote Up0Vote Down
    I see 180/PI and PI/180 in the snippet that you posted above. Am I jumping to the wrong conclusion to think that PI=180 degrees, eh, 180/PI = PI/180 = 1?

    Fractions in fixed point are often dealt with as rational values or approximations. For example, the value 0.0175 * 180 can be dealt with ad hoc as 7/400 * 180000, then arranged to do the multiplication before the division and scaled up like that to maintain 3 digits precision.

    In Spin, general precision can be had using the ** operator, high multiply, 180000 ** 75161928. That is implicitly 180000 * 75161928/(2^32). The divide by 2^32 comes from an implicit shift right by 32 bits after the multiply into 64 bits.

    In any case, all the scaling and keeping track of the decimal point can be a bear in a complex problem and will make you appreciate floating point, which takes care of all that (until it doesn't!)

  • I'm jumping in without reading much ... David (amalfi),
    The trick is getting your head around fixed-point numbers. You have to convert everything over. You choose a number of decimal places, or bits, to be used for fractional values.

    Convert the constants to suit this: In the case of 180/pi = 57.295779513: If choosing to use 4 decimal places in fixed point decimal then this becomes an integer of 572958, which stands in for 57.2958. If done in, say, 16.16 fixed point binary, it would be $0039 4bb8, which stands in for 57.295776367.

    Be wary that you are now effectively using integers, which means all calculations will naturally truncate rounding. For positives, this rounds toward zero. For negatives, this rounds toward negative infinity I think.

    Using the same format as the sine tables is sensible. So, not really your choice now.


    PS: At some point you'll likely have dynamic range issues. Cross that hurdle when you get there.

    PPS: Lol, I see everyone has answered. I'm gonna leave this one posted anyway.
    "Are we alone in the universe?"
    "Yes," said the Oracle.
    "So there's no other life out there?"
    "There is. They're alone too."
Sign In or Register to comment.