Shop OBEX P1 Docs P2 Docs Learn Events
bsr or (int)log2 in Prop C / PASM help too (Optimization Problem) - Page 2 — Parallax Forums

bsr or (int)log2 in Prop C / PASM help too (Optimization Problem)

2»

Comments

  • That makes sense!

    -Phil
  • AribaAriba Posts: 2,682
    If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

    -Phil
    See here: forums.parallax.com/discussion/comment/1326955

    mikeologist:

    I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

    Andy

  • Ariba wrote: »
    If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

    -Phil
    See here: forums.parallax.com/discussion/comment/1326955

    mikeologist:

    I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

    Andy

    I agree, I'm also seeing that using PASM inline has very little gain for tiny razor sharp code. All the conversions and frame references, or whatever, are getting in the way
        unsigned int a = num_in;
        start = CNT;
        __asm__ volatile (
                  //move count start to here
                  "rev       %[x],#0           \n\t"        //Reverse the bits in x.
                  "neg       %[y],%[x]         \n\t"        //Isolate least-significant "one" bit.
                  "and       %[x],%[y] wz      \n\t"
                  "mov       %[y],#31          \n\t"
            "if_z  sub       %[y],#1           \n\t"        //Locate that "one" bit, if any.
                  "test      %[x],%[_0000ffff] wz \n\t"
            "if_z  sub       %[y],#16          \n\t"
                  "test      %[x],%[_00ff00ff] wz \n\t"
            "if_z  sub       %[y],#8           \n\t"
                  "test      %[x],%[_0f0f0f0f] wz \n\t"
            "if_z  sub       %[y],#4           \n\t"
                  "test      %[x],%[_33333333] wz \n\t"
            "if_z  sub       %[y],#2           \n\t"
                  "test      %[x],%[_55555555] wz \n\t"
            "if_z  sub       %[y],#1           \n\t"
                  //move count end to here
         
        : /*outputs (+inputs) */
          [y]         "+r" (pow2)
        : /*inputs */
          [x]         "r"  (a),
          [_0000ffff] "r"  (_0000ffff),
          [_00ff00ff] "r"  (_00ff00ff),
          [_0f0f0f0f] "r"  (_0f0f0f0f),
          [_33333333] "r"  (_33333333),
          [_55555555] "r"  (_55555555)
      
        );
        end = CNT;
      
        print("inline PASM) Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
    
    
    I made
    -Phil
    Phil's code and I think that the reason for these timing results:
    bsr)         Arg: 0 Characteristic: -1 time: 160
    log/log)     Arg: 0 Characteristic: -2147483648 time: 4528
    inline PASM) Arg: 0 Characteristic: -1 time: 544
    Ariba)       Arg: 0 Characteristic: -1 time: 48
    BST)         Arg: 0 Characteristic: -1 time: 368
    Provide an integer: 1
    bsr)         Arg: 1 Characteristic: 0 time: 1264
    log/log)     Arg: 1 Characteristic: 0 time: 5664
    inline PASM) Arg: -2147483648 Characteristic: 0 time: 544
    Ariba)       Arg: 1 Characteristic: 0 time: 48
    BST)         Arg: 1 Characteristic: 0 time: 368
    Provide an integer: 2
    bsr)         Arg: 2 Characteristic: 1 time: 1136
    log/log)     Arg: 2 Characteristic: 1 time: 16080
    inline PASM) Arg: 1073741824 Characteristic: 1 time: 544
    Ariba)       Arg: 2 Characteristic: 1 time: 48
    BST)         Arg: 2 Characteristic: 1 time: 304
    Provide an integer: 4
    bsr)         Arg: 4 Characteristic: 2 time: 1008
    log/log)     Arg: 4 Characteristic: 2 time: 16080
    inline PASM) Arg: 536870912 Characteristic: 2 time: 544
    Ariba)       Arg: 4 Characteristic: 2 time: 48
    BST)         Arg: 4 Characteristic: 2 time: 320
    Provide an integer: 8
    bsr)         Arg: 8 Characteristic: 3 time: 896
    log/log)     Arg: 8 Characteristic: 3 time: 16240
    inline PASM) Arg: 268435456 Characteristic: 3 time: 544
    Ariba)       Arg: 8 Characteristic: 3 time: 48
    BST)         Arg: 8 Characteristic: 3 time: 320
    Provide an integer: 16
    bsr)         Arg: 16 Characteristic: 4 time: 768
    log/log)     Arg: 16 Characteristic: 4 time: 16080
    inline PASM) Arg: 134217728 Characteristic: 4 time: 544
    Ariba)       Arg: 16 Characteristic: 4 time: 48
    BST)         Arg: 16 Characteristic: 4 time: 320
    Provide an integer: 32
    bsr)         Arg: 32 Characteristic: 5 time: 640
    log/log)     Arg: 32 Characteristic: 5 time: 16080
    inline PASM) Arg: 67108864 Characteristic: 5 time: 544
    Ariba)       Arg: 32 Characteristic: 5 time: 48
    BST)         Arg: 32 Characteristic: 5 time: 320
    Provide an integer: 64
    bsr)         Arg: 64 Characteristic: 6 time: 512
    log/log)     Arg: 64 Characteristic: 6 time: 16240
    inline PASM) Arg: 33554432 Characteristic: 6 time: 544
    Ariba)       Arg: 64 Characteristic: 6 time: 48
    BST)         Arg: 64 Characteristic: 6 time: 320
    Provide an integer: 128
    bsr)         Arg: 128 Characteristic: 7 time: 368
    log/log)     Arg: 128 Characteristic: 7 time: 16352
    inline PASM) Arg: 16777216 Characteristic: 7 time: 544
    Ariba)       Arg: 128 Characteristic: 7 time: 48
    BST)         Arg: 128 Characteristic: 7 time: 320
    Provide an integer: 256
    bsr)         Arg: 256 Characteristic: 8 time: 1248
    log/log)     Arg: 256 Characteristic: 8 time: 16080
    inline PASM) Arg: 8388608 Characteristic: 8 time: 544
    Ariba)       Arg: 256 Characteristic: 8 time: 48
    BST)         Arg: 256 Characteristic: 8 time: 368
    Provide an integer: 512
    bsr)         Arg: 512 Characteristic: 9 time: 1120
    log/log)     Arg: 512 Characteristic: 9 time: 16080
    inline PASM) Arg: 4194304 Characteristic: 9 time: 544
    Ariba)       Arg: 512 Characteristic: 9 time: 48
    BST)         Arg: 512 Characteristic: 9 time: 368
    Provide an integer: 8192
    bsr)         Arg: 8192 Characteristic: 13 time: 624
    log/log)     Arg: 8192 Characteristic: 12 time: 16384
    inline PASM) Arg: 262144 Characteristic: 13 time: 544
    Ariba)       Arg: 8192 Characteristic: 13 time: 48
    BST)         Arg: 8192 Characteristic: 13 time: 400
    Provide an integer: 66000
    bsr)         Arg: 66000 Characteristic: 16 time: 1264
    log/log)     Arg: 66000 Characteristic: 16 time: 68848
    inline PASM) Arg: 32768 Characteristic: 16 time: 544
    Ariba)       Arg: 66000 Characteristic: 16 time: 48
    BST)         Arg: 66000 Characteristic: 16 time: 432
    Provide an integer: 1000000
    bsr)         Arg: 1000000 Characteristic: 19 time: 896
    log/log)     Arg: 1000000 Characteristic: 19 time: 70544
    inline PASM) Arg: 4096 Characteristic: 19 time: 544
    Ariba)       Arg: 1000000 Characteristic: 19 time: 48
    BST)         Arg: 1000000 Characteristic: 19 time: 432
    Provide an integer: 20000000
    bsr)         Arg: 20000000 Characteristic: 24 time: 1264
    log/log)     Arg: 20000000 Characteristic: 24 time: 70352
    inline PASM) Arg: 128 Characteristic: 24 time: 544
    Ariba)       Arg: 20000000 Characteristic: 24 time: 48
    BST)         Arg: 20000000 Characteristic: 24 time: 432
    Provide an integer: 2000000000
    bsr)         Arg: 2000000000 Characteristic: 30 time: 528
    log/log)     Arg: 2000000000 Characteristic: 30 time: 70992
    inline PASM) Arg: 2 Characteristic: 30 time: 544
    Ariba)       Arg: 2000000000 Characteristic: 30 time: 48
    BST)         Arg: 2000000000 Characteristic: 30 time: 384
    

    I am experiencing an odd result in printing the argument after the PASM function, probably a simple mistake on my part.

    Thanks everyone for your help so far
  • Ariba wrote: »
    If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

    -Phil
    See here: forums.parallax.com/discussion/comment/1326955

    mikeologist:

    I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

    Andy

    I misread your last comment and have to disagree.
    There are no hard coded values except those used in the inline PASM
    Here is the entire file:
    #include "simpletools.h"                      // scan(), print()
    #include "propeller.h"                        // CNT
    
    int main() {                                  // Entry point
      int index;                                  // location within the binary integer
      int pow2;                                   // Characteristic sans mantissa, first test
      unsigned int num_in;                                 // log2 argument
      unsigned int start;
      unsigned int end;
      long _0000ffff = 0x0000ffff;
      long _00ff00ff = 0x00ff00ff;
      long _0f0f0f0f = 0x0f0f0f0f;
      long _33333333 = 0x33333333;
      long _55555555 = 0x55555555;
     
      while(1) {                                  // Primary loop
        print("Provide an integer: ");            // Prompt user
        scan("%d\n", &num_in);                    // Collect input
        
        start = CNT;
        pow2 = -1;                                // flag value
        if(num_in != 0) {
          if(num_in < 0x0000ffff) {
            if(num_in < 0x000000ff) {
              index = 8;
            } else {
              index = 16;
            }                    
          } else {
            if(num_in < 0x00ffffff) {
              index = 24;
            } else { 
              index = 32;
            }                    
          } 
          if(pow2 == -1){                        
            while(index-- > 0) {     // While no characteristic and bits remain, decrement position
              if(num_in & (1 << index)) {             // if the current bit is on
                pow2 = index;                         // record its location as the log2 characteristic
                break;                                // and exit
              }
            }
          }        
        }      
        end = CNT;
        
        print("bsr)         Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
        
        start = CNT;
        pow2 = (int)(log(num_in)/log(2));
        end = CNT; 
        
        print("log/log)     Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
       
        unsigned int a = num_in;
        start = CNT;
        __asm__ volatile (
                  //move count start to here
                  "rev       %[x],#0           \n\t"        //Reverse the bits in x.
                  "neg       %[y],%[x]         \n\t"        //Isolate least-significant "one" bit.
                  "and       %[x],%[y] wz      \n\t"
                  "mov       %[y],#31          \n\t"
            "if_z  sub       %[y],#1           \n\t"        //Locate that "one" bit, if any.
                  "test      %[x],%[_0000ffff] wz \n\t"
            "if_z  sub       %[y],#16          \n\t"
                  "test      %[x],%[_00ff00ff] wz \n\t"
            "if_z  sub       %[y],#8           \n\t"
                  "test      %[x],%[_0f0f0f0f] wz \n\t"
            "if_z  sub       %[y],#4           \n\t"
                  "test      %[x],%[_33333333] wz \n\t"
            "if_z  sub       %[y],#2           \n\t"
                  "test      %[x],%[_55555555] wz \n\t"
            "if_z  sub       %[y],#1           \n\t"
                  //move count end to here
         
        : /*outputs (+inputs) */
          [y]         "+r" (pow2)
        : /*inputs */
          [x]         "r"  (a),
          [_0000ffff] "r"  (_0000ffff),
          [_00ff00ff] "r"  (_00ff00ff),
          [_0f0f0f0f] "r"  (_0f0f0f0f),
          [_33333333] "r"  (_33333333),
          [_55555555] "r"  (_55555555)
      
        );
        end = CNT;
      
        print("inline PASM) Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
     
        a = num_in;
        start = CNT;
        pow2 = 31;
    //    if(a == 0) {pow2 = 0;} 
    //    else {
          if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
          if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
          if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
          if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
          if(!(a & 0x80000000)) pow2 -= 1;
          if(a == 0) pow2 = -1;
    //    }
        end = CNT;
        
        print("Ariba)       Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
    
        pow2 = -2;
        
        
        
        start = CNT;
        if(num_in < 0x00010000) {             // 0-15
          if(num_in < 0x00000100) {           // 0-7
            if(num_in < 0x00000010) {         // 0-3
              if(num_in < 0x0000004) {        // 0-1
                if(num_in < 0x00000002) { 
                    if(num_in == 0) { pow2 = -1; }
                    else { pow2 = 0; } 
                  }                  
                  else { pow2 = 1; }                                
              } else {                        // 2-3
                if(num_in < 0x00000008) { pow2 = 2; }
                  else { pow2 = 3; }                                
              }                            
            } else {                          // 4-7
              if(num_in < 0x00000040) {       // 4-5
                if(num_in < 0x00000020) { pow2 = 4; } 
                  else { pow2 = 5; }                                
              } else {                        // 6-7
                if(num_in < 0x00000080) { pow2 = 6; } 
                  else { pow2 = 7; }                                
              }       
            }                        
          } else {                            // 8-15
            if(num_in < 0x00001000) {         // 8-11
              if(num_in < 0x00000400) {       // 8-9
                if(num_in < 0x00000200) { pow2 = 8; } 
                  else { pow2 = 9; }                                
              } else {                        // 10-11
                if(num_in < 0x00000800) { pow2 = 10; } 
                  else { pow2 = 11; }
              }                            
            } else {                          // 12-15
              if(num_in < 0x00004000) {       // 12-13
                if(num_in < 0x00002000) { pow2 = 12; } 
                  else { pow2 = 13; }                                
              } else {                        // 14-15
                if(num_in < 0x00008000) { pow2 = 14; } 
                  else { pow2 = 15; }
              } 
            }                        
          }                    
        } else {                              // 16-31
          if(num_in < 0x01000000) {           // 16-23
            if(num_in < 0x00100000) {         // 16-19
              if(num_in < 0x0040000) {        // 16-17
                if(num_in < 0x00020000) { pow2 = 16; } 
                  else { pow2 = 17; }                                
              } else {                        // 18-19
                if(num_in < 0x00080000) { pow2 = 18; } 
                  else { pow2 = 19; }                                
              }                            
            } else {                          // 20-23
              if(num_in < 0x00400000) {       // 20-21
                if(num_in < 0x00200000) { pow2 = 20; } 
                  else { pow2 = 21; }                                
              } else {                        // 22-23
                if(num_in < 0x00800000) { pow2 = 22; } 
                  else { pow2 = 23; }                                
              }       
            }                        
          } else {                            // 24-31
            if(num_in < 0x10000000) {         // 24-27
              if(num_in < 0x04000000) {       // 24-25
                if(num_in < 0x02000000) { pow2 = 24; } 
                  else { pow2 = 25; }                                
              } else {                        // 26-27
                if(num_in < 0x08000000) { pow2 = 26; } 
                  else { pow2 = 27; }
              }                            
            } else {                          // 28-31
              if(num_in < 0x40000000) {       // 28-29
                if(num_in < 0x20000000) { pow2 = 28; } 
                  else { pow2 = 29; }                                
              } else {                        // 30-31
                if(num_in < 0x80000000) { pow2 = 30; } 
                  else { pow2 = 31; }
              } 
            }                        
          }
        }                           
        end = CNT;
        
        print("BST)         Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the result   
      }         
    }
    
    It can't be caching the values. Your code and my tweak just run that fast.
  • AribaAriba Posts: 2,682
    Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
    If you do this, you can see that it's really the optimizer that allows this result, but not in the way I thought. What happens is that the read of the start CNT value is delayed until a after a big part of the AND and SHIFT calculations, just 2 instructions before the end CNT value is read. For the optimizer the CNT and the bit caclulations are two different things and can be moved to the best position to get the smallest or fastest code.

    Here is the generated Assembly code:
      90:Tests.c       ****  
      91:Tests.c       ****     a = num_in;
     203              		.loc 1 91 0
     204 0200 1000FCA0 		mov	r4, #16
     205 0204 0000BC80 		add	r4, sp
      92:Tests.c       ****     start = CNT;
      93:Tests.c       ****     pow2 = 31;
      94:Tests.c       **** //    if(a == 0) {pow2 = 0;} 
      95:Tests.c       **** //    else {
      96:Tests.c       ****       if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
     206              		.loc 1 96 0
     207 0208 00007C5C 		mvi	r7,#-65536
     207      0000FFFF 
      97:Tests.c       ****       if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
     208              		.loc 1 97 0
     209 0210 00007C5C 		mvi	r3,#-16777216
     209      000000FF 
      91:Tests.c       ****     a = num_in;
     210              		.loc 1 91 0
     211 0218 0000BC08 		rdlong	r5, r4
     212              	.LVL19
      96:Tests.c       ****       if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
     213              		.loc 1 96 0
     214 021c 0000BCA0 		mov	r6, r5
     215 0220 00003C62 		test	r5,r7 wz
     216 0224 1000E82C 		IF_E  shl	r6, #16
     217 0228 0F00E8A0 		IF_E  mov	r7, #15
      93:Tests.c       ****     pow2 = 31;
     218              		.loc 1 93 0
     219 022c 1F00D4A0 		IF_NE mov	r7, #31
     220              		.loc 1 97 0
     221 0230 00003C62 		test	r6,r3 wz
     222 0234 0800E82C 		IF_E  shl	r6, #8
      98:Tests.c       ****       if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
     223              		.loc 1 98 0
     224 0238 00007C5C 		mvi	r3,#-268435456
     224      000000F0 
      97:Tests.c       ****       if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
     225              		.loc 1 97 0
     226 0240 0800E884 		IF_E  sub	r7, #8
     227              		.loc 1 98 0
     228 0244 00003C62 		test	r6,r3 wz
     229 0248 0400E82C 		IF_E  shl	r6, #4
      99:Tests.c       ****       if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
     230              		.loc 1 99 0
     231 024c 00007C5C 		mvi	r3,#-1073741824
     231      000000C0 
      98:Tests.c       ****       if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
     232              		.loc 1 98 0
     233 0254 0400E884 		IF_E  sub	r7, #4
     234              		.loc 1 99 0
     235 0258 00003C62 		test	r6,r3 wz
     236 025c 0200E82C 		IF_E  shl	r6, #2
     100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
     101:Tests.c       ****       if(a == 0) pow2 = -1;
     102:Tests.c       **** //    }
     103:Tests.c       ****     end = CNT;
     104:Tests.c       ****     
     105:Tests.c       ****     print("Ariba)       Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // disp
     237              		.loc 1 105 0
     238 0260 00007C5C 		mvi	r3,#.LC6
     238      00000000 
      99:Tests.c       ****       if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
     239              		.loc 1 99 0
     240 0268 0200E884 		IF_E  sub	r7, #2
     100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
     241              		.loc 1 100 0
     242 026c 00007CC3 		cmps	r6, #0 wz,wc
      92:Tests.c       ****     start = CNT;
     243              		.loc 1 92 0
     244 0270 0000BCA0 		mov	r4, CNT                         <------ Read start CNT
     245              	.LVL20
     100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
     246              		.loc 1 100 0
     247 0274 0100CC84 		IF_AE sub	r7, #1
     248              	.LVL21
     101:Tests.c       ****       if(a == 0) pow2 = -1;
     249              		.loc 1 101 0
     250 0278 00007CC3 		cmps	r6, #0 wz,wc
     103:Tests.c       ****     end = CNT;
     251              		.loc 1 103 0
     252 027c 0000BCA0 		mov	r6, CNT                         <------- Read end CNT (2 instructions in between)
     253              	.LVL22
     101:Tests.c       ****       if(a == 0) pow2 = -1;
     254              		.loc 1 101 0
     255 0280 0100E8A4 		IF_E  neg	r7,#1
     256              	.LVL23
     257              		.loc 1 105 0
     258 0284 0000BC84 		sub	r6, r4
     259              	.LVL24
     260 0288 00003C08 		wrlong	r3, sp
     261 028c 00003C08 		wrlong	r5, r14
     262 0290 0000BCA0 		mov	r5, sp
     263 0294 0800FC80 		add	r5, #8
     264 0298 00003C08 		wrlong	r7, r5
     265 029c 0000BCA0 		mov	r7, sp
     266              	.LVL25
     267 02a0 0C00FC80 		add	r7, #12
     268 02a4 00003C08 		wrlong	r6, r7
     269 02a8 00007C5C 		lcall	#_print
     269      00000000 
    

    Andy
  • Ariba wrote: »
    Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
    ~~~
    Andy

    No, I don't know how to do that, but I do now.
    I'm going to build a proper library for these different approaches and try to time them as method calls instead.
    You, heater, and Phil really gave me some new tools in this thread, thank you :smile:
  • If you want to burn some memory, you can do it in about 8 instructions
    Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x


    shift x right 24 bits
    if prev result was not zero then add 24 to result
    if previous result was zero, shift x right by 16 bits
    if previous result was not zero then add 16 to result
    if previous result was zero shift x right by 8 bits
    if previous result was not zero then add 8 to result
    move the source field of the previous result to the source field of the following instruction
    add the value in the location specified by the source field to the result.

    and we are done in 8 instructions, but we had to use 256 words of ram.





  • If you want to burn some memory, you can do it in about 8 instructions
    Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x


    shift x right 24 bits
    if prev result was not zero then add 24 to result
    if previous result was zero, shift x right by 16 bits
    if previous result was not zero then add 16 to result
    if previous result was zero shift x right by 8 bits
    if previous result was not zero then add 8 to result
    move the source field of the previous result to the source field of the following instruction
    add the value in the location specified by the source field to the result.

    and we are done in 8 instructions, but we had to use 256 words of ram.

    Why 256 values when my desired results are 0-31 & -1? Is it a sparse table?
    Thanks for the cool contribution :smile:
  • MIchael_MichalskiMIchael_Michalski Posts: 138
    edited 2017-12-29 00:43
    This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.



    shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.

    if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.

    if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
    if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.

    if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.

    if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.

    move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.


    add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.

    here is an example some of the table entries.
    Address Value
    00000000 00000000
    00000001 00000001
    00000010 00000010
    00000011 00000010
    00000100 00000011
    00000101 00000011
    00000110 00000011
    00000111 00000011
    00001000 00000100
    00001001 00000100
    .
    .
    .
    01111110 00000111
    10000000 00001000
    10000001 00001000
    .
    .
    .,
    11111110 00001000
    11111111 00001000


    We just add that value from the table to the result. So for instance, if X was

    F4:65:CE:01 then the first bit set is clearly #32. That means that the very first shift would NOT have returned zero so we add 24 and SKIP the second shift. And now we have a problem. I see that there is a flaw in my algorithm. While we will skip the remaining shifts we will add 16 and 8 as well,which will get the wrong answer. The problem is that we only skip the add instructions BEFORE we find the most signifcant 1. All the adds after run. So we should add 8 each time. Then 3x8 is 24, 2x8 is 16 and 1x8 is 8 and it all works.

    So now we have:


    shift x right 24 bits and store it as y

    if y was not zero then add 8 to R (do not affect z)

    if y was zero, shift x right by 16 bits

    if y was not zero then add 8 to result (do not affect z)

    if y was zero shift x right by 8 bits

    if previous result was not zero then add 8 to result (do not affect z)

    move the source field of y to the source field of the following instruction.

    add the value in the location specified by the source field to the result.



    So lets look at this again for F4:65:CE:01

    The first bit was clearly 32. That means the first shift does NOT return zero.

    So we add 8 but do not update the Z flag.

    since the Z flag is NOT set we skip the instruction shifting left by 16 bits

    since the Z flag is NOT set we add 8 (again) but do not update the z flag

    since the z flag is not set we SKIP the shift right by 8 instruction

    since the z flag is not set we add 8 (again) but do not update the z flag.

    now we move the source field to the following add instruction

    our add instruction now adds the value of memory location 0F4 to the result. 0F4 will contain 8 so we add that to R and get 32.




    But what if it was 00:65:CE:01

    The first bit is now 23

    So the first shift returns zero, so we DONT add 8.

    The next shift does NOT return zero

    so we DO add 8.

    we skip the third shift

    but the zero flag is still NOT set so we add another 8

    we then move the source field to the add instruction

    then we add the value at memory location 001100101 to the result. the value at that location will be 7, added to R,which is 16, we get 23.



    This is btw very wasteful of memory. You really use 4 bytes for each value in your lookup table because the memory is not byte addressable,you end up having to store an entire 32 bit word when you only need 8.

  • This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.



    shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.

    if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.

    if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
    if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.

    if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.

    if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.

    move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.


    add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.

    here is an example some of the table entries.
    Address Value
    00000000 00000000
    00000001 00000001
    00000010 00000010
    00000011 00000010
    00000100 00000011
    00000101 00000011
    00000110 00000011
    00000111 00000011
    00001000 00000100
    00001001 00000100
    .
    .
    .
    01111110 00000111
    10000000 00001000
    10000001 00001000
    .
    .
    .,
    11111110 00001000
    11111111 00001000

    Thank you for the detailed explanation. I'm going to have to write it out on paper to fully understand it, just follow some cases through until it becomes completely clear.
  • Just keep in mind, while I think i got that right,there could be a mistake lurking in there.
  • MIchael_MichalskiMIchael_Michalski Posts: 138
    edited 2017-12-29 18:21
    heres another variant

    Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.

    shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)

    if y was not zero then add 11 to R (do not affect z)

    if y was zero, shift x right by 11 bits

    if y was not zero then add 11 to result (do not affect z)

    read a byte from hub memory at the address in y

    add the value you just read to the result.

    This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)



    Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

    Cog 1:

    inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
    inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
    inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
    inst 4:if y was not zero then add 11 to port b direction register (do not affect z)

    Cog 2:
    inst 5: read byte from hub memory at the address in the port b direction register
    inst 6: complete hub memory operation
    inst 7: add value you just read to the port b direction register.
    inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.


    using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.
  • Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

    That is my understanding, the port b i/o are just registers and no pins because pins 32-63 don't exist yet. I've barely scratched the surface in PASM so I haven't used them but I did read about them in the manual.

    Thanks for all this. Regardless of which option I employ I will be sure to understand, build, and share the different methods that you've taken the time to provide here.
  • If the registers are there we can use them for fast intercog communications.
  • Heater.Heater. Posts: 21,230
    The registers are not there. You cannot use the phantom port B for inter-COG communication.
  • mikeologistmikeologist Posts: 337
    edited 2017-12-29 20:49
    Heater. wrote: »
    The registers are not there. You cannot use the phantom port B for inter-COG communication.

    Does Spin do something else with the data in outb because the register is available in Spin?
    OBJ
    
      system : "Propeller Board of Education"
      pst    : "Parallax Serial Terminal Plus"
    
    VAR
    
      long i
    
    PUB go
    
      system.Clock(80_000_000)
      
      repeat i from 0 to 15
        outb[i] := i//2
        outa[i] := 0
    
      repeat i from 16 to 31
        outa[i] := 0
        if i//3 == 1
          outb[i] := 1
        
      repeat i from 0 to 31                           
        pst.Dec(outa[i])
    
      pst.NewLine
    
      repeat i from 0 to 31                           
        pst.Dec(outb[i])
    

    produces this output:
    00000000000000000000000000000000
    01010101010101011001001001001001
    

    admittedly this says nothing about intercog comms
  • Heater.Heater. Posts: 21,230
    Well, alright, you are writing to and reading from port B from the same COG.

    If I recall correctly there is RAM at the port B location so why not?

    As an I/O register, even connected internally, not so much.
  • Heater. wrote: »
    Well, alright, you are writing to and reading from port B from the same COG.

    If I recall correctly there is RAM at the port B location so why not?

    As an I/O register, even connected internally, not so much.

    I figured you'd know. I tested it on 2 cogs and you're correct that it doesn't work
    OBJ
      system : "Propeller Board of Education"
      pst    : "Parallax Serial Terminal Plus"
    
    VAR
      long i, stack[32]
    
    PUB go
      system.Clock(80_000_000)
      
      repeat i from 0 to 15
        outb[i] := i//2
        outa[i] := 0
    
      repeat i from 16 to 31
        outa[i] := 0
        if i//3 == 1
          outb[i] := 1
    
      pst.Dec(COGNEW(copyab, @stack))
    
      printoutab
    
      copyab
    
      printoutab
    
    PUB copyab
      repeat i from 0 to 15
        outa[i] := outb[i]
      
    PUB printoutab
      pst.NewLine
      
      repeat i from 0 to 31                           
        pst.Dec(outa[i])
    
      pst.NewLine
    
      repeat i from 0 to 31                           
        pst.Dec(outb[i])
    

    produces the output:
    1
    
    00000000000000000000000000000000
    01010101010101011001001001001001
    01010101010101010000000000000000
    01010101010101011001001001001001
    

    So, is outb ram just there as a placeholder?
    Thanks
  • Heater.Heater. Posts: 21,230
    As far as I recall, yes.
  • Heater. wrote: »
    As far as I recall, yes.

    Appreciate it.
    Glad you helped me rule that out before I invested any real time in it. I will have to employ the cog ram instead as I cannot count on my pins being blank.
  • Heater. wrote: »
    As far as I recall, yes.

    Do you know how outa works by comparison?
    Is there a register in each cog or is there a single outa register that all cogs can access?
  • heres another variant

    Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.

    shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)

    if y was not zero then add 11 to R (do not affect z)

    if y was zero, shift x right by 11 bits

    if y was not zero then add 11 to result (do not affect z)

    read a byte from hub memory at the address in y

    add the value you just read to the result.

    This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)



    Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

    Cog 1:

    inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
    inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
    inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
    inst 4:if y was not zero then add 11 to port b direction register (do not affect z)

    Cog 2:
    inst 5: read byte from hub memory at the address in the port b direction register
    inst 6: complete hub memory operation
    inst 7: add value you just read to the port b direction register.
    inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.


    using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.

    I do plan on having multiple cogs running math routines but without the outb, my project restrictions of outa, and the fact that this will be used regularly but sporadically, this method will not work for my regular library, but I bet it will kick some serious butt in my vector library.

    Thanks a bunch.
  • Heater.Heater. Posts: 21,230
    All the details of the registers and logic involved in the Propeller I/O are clearly described in the Propeller Manual.

    You are not going to use port A for inter-cog communications either. That is a precious I/O resource you will need.

    COGS can share data through HUB. Which is slow. Which is why the Prop does not make a great parallel processing number cruncher. Which of course it was not designed to be.





  • Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.
  • Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.

    That's the idea with the parallel SRAM bank. I want to add a few pins to identify which cog is responding and a lock bit and I can use it for RAM, persistent memory, and inter-cog communication. I figured out a memory shortcut for a cogmem array-like thing today. I still have some work to do on it but it increased my sequential 8 bit read from 1.9MB to nearly 5MB with the 8 bit RAM; should double with the 16 bit RAM. With all that bandwidth and flag bits I can copy the data to as many cogs as will listen and the RAM at the same time. I'll just read OUTA instead of INA for input from other cogs. When data does not need to be written to RAM, such as hypervisor instructions, they can still be read on OUTA and triggered with and by the same flag bits.

    If I add a bus disconnect than I can have the memory controller page the SRAM to the SPI flash independent of the propeller and while it uses the same pins. This way the concepts are integrated by design and use the same helper methods.

    I'm trying it with 3 different controllers.

    Still rereading the manual :smile:
Sign In or Register to comment.