bsr or (int)log2 in Prop C / PASM help too (Optimization Problem)

Phil Pilgrim (PhiPi) · 2017-12-28 00:51

That makes sense!

-Phil

Ariba · 2017-12-28 01:24

Phil Pilgrim (PhiPi) wrote: »

If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

-Phil

See here: forums.parallax.com/discussion/comment/1326955

mikeologist:

I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

Andy

mikeologist · 2017-12-28 01:38

Ariba wrote: »

Phil Pilgrim (PhiPi) wrote: »

If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

-Phil

See here: forums.parallax.com/discussion/comment/1326955

mikeologist:

I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

Andy

I agree, I'm also seeing that using PASM inline has very little gain for tiny razor sharp code. All the conversions and frame references, or whatever, are getting in the way

    unsigned int a = num_in;
    start = CNT;
    __asm__ volatile (
              //move count start to here
              "rev       %[x],#0           \n\t"        //Reverse the bits in x.
              "neg       %[y],%[x]         \n\t"        //Isolate least-significant "one" bit.
              "and       %[x],%[y] wz      \n\t"
              "mov       %[y],#31          \n\t"
        "if_z  sub       %[y],#1           \n\t"        //Locate that "one" bit, if any.
              "test      %[x],%[_0000ffff] wz \n\t"
        "if_z  sub       %[y],#16          \n\t"
              "test      %[x],%[_00ff00ff] wz \n\t"
        "if_z  sub       %[y],#8           \n\t"
              "test      %[x],%[_0f0f0f0f] wz \n\t"
        "if_z  sub       %[y],#4           \n\t"
              "test      %[x],%[_33333333] wz \n\t"
        "if_z  sub       %[y],#2           \n\t"
              "test      %[x],%[_55555555] wz \n\t"
        "if_z  sub       %[y],#1           \n\t"
              //move count end to here
     
    : /*outputs (+inputs) */
      [y]         "+r" (pow2)
    : /*inputs */
      [x]         "r"  (a),
      [_0000ffff] "r"  (_0000ffff),
      [_00ff00ff] "r"  (_00ff00ff),
      [_0f0f0f0f] "r"  (_0f0f0f0f),
      [_33333333] "r"  (_33333333),
      [_55555555] "r"  (_55555555)
  
    );
    end = CNT;
  
    print("inline PASM) Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results

I made

Phil Pilgrim (PhiPi) wrote: »

-Phil

Phil's code and I think that the reason for these timing results:

bsr)         Arg: 0 Characteristic: -1 time: 160
log/log)     Arg: 0 Characteristic: -2147483648 time: 4528
inline PASM) Arg: 0 Characteristic: -1 time: 544
Ariba)       Arg: 0 Characteristic: -1 time: 48
BST)         Arg: 0 Characteristic: -1 time: 368
Provide an integer: 1
bsr)         Arg: 1 Characteristic: 0 time: 1264
log/log)     Arg: 1 Characteristic: 0 time: 5664
inline PASM) Arg: -2147483648 Characteristic: 0 time: 544
Ariba)       Arg: 1 Characteristic: 0 time: 48
BST)         Arg: 1 Characteristic: 0 time: 368
Provide an integer: 2
bsr)         Arg: 2 Characteristic: 1 time: 1136
log/log)     Arg: 2 Characteristic: 1 time: 16080
inline PASM) Arg: 1073741824 Characteristic: 1 time: 544
Ariba)       Arg: 2 Characteristic: 1 time: 48
BST)         Arg: 2 Characteristic: 1 time: 304
Provide an integer: 4
bsr)         Arg: 4 Characteristic: 2 time: 1008
log/log)     Arg: 4 Characteristic: 2 time: 16080
inline PASM) Arg: 536870912 Characteristic: 2 time: 544
Ariba)       Arg: 4 Characteristic: 2 time: 48
BST)         Arg: 4 Characteristic: 2 time: 320
Provide an integer: 8
bsr)         Arg: 8 Characteristic: 3 time: 896
log/log)     Arg: 8 Characteristic: 3 time: 16240
inline PASM) Arg: 268435456 Characteristic: 3 time: 544
Ariba)       Arg: 8 Characteristic: 3 time: 48
BST)         Arg: 8 Characteristic: 3 time: 320
Provide an integer: 16
bsr)         Arg: 16 Characteristic: 4 time: 768
log/log)     Arg: 16 Characteristic: 4 time: 16080
inline PASM) Arg: 134217728 Characteristic: 4 time: 544
Ariba)       Arg: 16 Characteristic: 4 time: 48
BST)         Arg: 16 Characteristic: 4 time: 320
Provide an integer: 32
bsr)         Arg: 32 Characteristic: 5 time: 640
log/log)     Arg: 32 Characteristic: 5 time: 16080
inline PASM) Arg: 67108864 Characteristic: 5 time: 544
Ariba)       Arg: 32 Characteristic: 5 time: 48
BST)         Arg: 32 Characteristic: 5 time: 320
Provide an integer: 64
bsr)         Arg: 64 Characteristic: 6 time: 512
log/log)     Arg: 64 Characteristic: 6 time: 16240
inline PASM) Arg: 33554432 Characteristic: 6 time: 544
Ariba)       Arg: 64 Characteristic: 6 time: 48
BST)         Arg: 64 Characteristic: 6 time: 320
Provide an integer: 128
bsr)         Arg: 128 Characteristic: 7 time: 368
log/log)     Arg: 128 Characteristic: 7 time: 16352
inline PASM) Arg: 16777216 Characteristic: 7 time: 544
Ariba)       Arg: 128 Characteristic: 7 time: 48
BST)         Arg: 128 Characteristic: 7 time: 320
Provide an integer: 256
bsr)         Arg: 256 Characteristic: 8 time: 1248
log/log)     Arg: 256 Characteristic: 8 time: 16080
inline PASM) Arg: 8388608 Characteristic: 8 time: 544
Ariba)       Arg: 256 Characteristic: 8 time: 48
BST)         Arg: 256 Characteristic: 8 time: 368
Provide an integer: 512
bsr)         Arg: 512 Characteristic: 9 time: 1120
log/log)     Arg: 512 Characteristic: 9 time: 16080
inline PASM) Arg: 4194304 Characteristic: 9 time: 544
Ariba)       Arg: 512 Characteristic: 9 time: 48
BST)         Arg: 512 Characteristic: 9 time: 368
Provide an integer: 8192
bsr)         Arg: 8192 Characteristic: 13 time: 624
log/log)     Arg: 8192 Characteristic: 12 time: 16384
inline PASM) Arg: 262144 Characteristic: 13 time: 544
Ariba)       Arg: 8192 Characteristic: 13 time: 48
BST)         Arg: 8192 Characteristic: 13 time: 400
Provide an integer: 66000
bsr)         Arg: 66000 Characteristic: 16 time: 1264
log/log)     Arg: 66000 Characteristic: 16 time: 68848
inline PASM) Arg: 32768 Characteristic: 16 time: 544
Ariba)       Arg: 66000 Characteristic: 16 time: 48
BST)         Arg: 66000 Characteristic: 16 time: 432
Provide an integer: 1000000
bsr)         Arg: 1000000 Characteristic: 19 time: 896
log/log)     Arg: 1000000 Characteristic: 19 time: 70544
inline PASM) Arg: 4096 Characteristic: 19 time: 544
Ariba)       Arg: 1000000 Characteristic: 19 time: 48
BST)         Arg: 1000000 Characteristic: 19 time: 432
Provide an integer: 20000000
bsr)         Arg: 20000000 Characteristic: 24 time: 1264
log/log)     Arg: 20000000 Characteristic: 24 time: 70352
inline PASM) Arg: 128 Characteristic: 24 time: 544
Ariba)       Arg: 20000000 Characteristic: 24 time: 48
BST)         Arg: 20000000 Characteristic: 24 time: 432
Provide an integer: 2000000000
bsr)         Arg: 2000000000 Characteristic: 30 time: 528
log/log)     Arg: 2000000000 Characteristic: 30 time: 70992
inline PASM) Arg: 2 Characteristic: 30 time: 544
Ariba)       Arg: 2000000000 Characteristic: 30 time: 48
BST)         Arg: 2000000000 Characteristic: 30 time: 384

I am experiencing an odd result in printing the argument after the PASM function, probably a simple mistake on my part.

Thanks everyone for your help so far

mikeologist · 2017-12-28 05:49

Ariba wrote: »

Phil Pilgrim (PhiPi) wrote: »

If you convert my PASM to C, I'm not sure how you will be able to express the rev operator.

-Phil

See here: forums.parallax.com/discussion/comment/1326955

mikeologist:

I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.

Andy

I misread your last comment and have to disagree.
There are no hard coded values except those used in the inline PASM
Here is the entire file:

#include "simpletools.h"                      // scan(), print()
#include "propeller.h"                        // CNT

int main() {                                  // Entry point
  int index;                                  // location within the binary integer
  int pow2;                                   // Characteristic sans mantissa, first test
  unsigned int num_in;                                 // log2 argument
  unsigned int start;
  unsigned int end;
  long _0000ffff = 0x0000ffff;
  long _00ff00ff = 0x00ff00ff;
  long _0f0f0f0f = 0x0f0f0f0f;
  long _33333333 = 0x33333333;
  long _55555555 = 0x55555555;
 
  while(1) {                                  // Primary loop
    print("Provide an integer: ");            // Prompt user
    scan("%d\n", &num_in);                    // Collect input
    
    start = CNT;
    pow2 = -1;                                // flag value
    if(num_in != 0) {
      if(num_in < 0x0000ffff) {
        if(num_in < 0x000000ff) {
          index = 8;
        } else {
          index = 16;
        }                    
      } else {
        if(num_in < 0x00ffffff) {
          index = 24;
        } else { 
          index = 32;
        }                    
      } 
      if(pow2 == -1){                        
        while(index-- > 0) {     // While no characteristic and bits remain, decrement position
          if(num_in & (1 << index)) {             // if the current bit is on
            pow2 = index;                         // record its location as the log2 characteristic
            break;                                // and exit
          }
        }
      }        
    }      
    end = CNT;
    
    print("bsr)         Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
    
    start = CNT;
    pow2 = (int)(log(num_in)/log(2));
    end = CNT; 
    
    print("log/log)     Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
   
    unsigned int a = num_in;
    start = CNT;
    __asm__ volatile (
              //move count start to here
              "rev       %[x],#0           \n\t"        //Reverse the bits in x.
              "neg       %[y],%[x]         \n\t"        //Isolate least-significant "one" bit.
              "and       %[x],%[y] wz      \n\t"
              "mov       %[y],#31          \n\t"
        "if_z  sub       %[y],#1           \n\t"        //Locate that "one" bit, if any.
              "test      %[x],%[_0000ffff] wz \n\t"
        "if_z  sub       %[y],#16          \n\t"
              "test      %[x],%[_00ff00ff] wz \n\t"
        "if_z  sub       %[y],#8           \n\t"
              "test      %[x],%[_0f0f0f0f] wz \n\t"
        "if_z  sub       %[y],#4           \n\t"
              "test      %[x],%[_33333333] wz \n\t"
        "if_z  sub       %[y],#2           \n\t"
              "test      %[x],%[_55555555] wz \n\t"
        "if_z  sub       %[y],#1           \n\t"
              //move count end to here
     
    : /*outputs (+inputs) */
      [y]         "+r" (pow2)
    : /*inputs */
      [x]         "r"  (a),
      [_0000ffff] "r"  (_0000ffff),
      [_00ff00ff] "r"  (_00ff00ff),
      [_0f0f0f0f] "r"  (_0f0f0f0f),
      [_33333333] "r"  (_33333333),
      [_55555555] "r"  (_55555555)
  
    );
    end = CNT;
  
    print("inline PASM) Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
 
    a = num_in;
    start = CNT;
    pow2 = 31;
//    if(a == 0) {pow2 = 0;} 
//    else {
      if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
      if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
      if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
      if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
      if(!(a & 0x80000000)) pow2 -= 1;
      if(a == 0) pow2 = -1;
//    }
    end = CNT;
    
    print("Ariba)       Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results

    pow2 = -2;
    
    
    
    start = CNT;
    if(num_in < 0x00010000) {             // 0-15
      if(num_in < 0x00000100) {           // 0-7
        if(num_in < 0x00000010) {         // 0-3
          if(num_in < 0x0000004) {        // 0-1
            if(num_in < 0x00000002) { 
                if(num_in == 0) { pow2 = -1; }
                else { pow2 = 0; } 
              }                  
              else { pow2 = 1; }                                
          } else {                        // 2-3
            if(num_in < 0x00000008) { pow2 = 2; }
              else { pow2 = 3; }                                
          }                            
        } else {                          // 4-7
          if(num_in < 0x00000040) {       // 4-5
            if(num_in < 0x00000020) { pow2 = 4; } 
              else { pow2 = 5; }                                
          } else {                        // 6-7
            if(num_in < 0x00000080) { pow2 = 6; } 
              else { pow2 = 7; }                                
          }       
        }                        
      } else {                            // 8-15
        if(num_in < 0x00001000) {         // 8-11
          if(num_in < 0x00000400) {       // 8-9
            if(num_in < 0x00000200) { pow2 = 8; } 
              else { pow2 = 9; }                                
          } else {                        // 10-11
            if(num_in < 0x00000800) { pow2 = 10; } 
              else { pow2 = 11; }
          }                            
        } else {                          // 12-15
          if(num_in < 0x00004000) {       // 12-13
            if(num_in < 0x00002000) { pow2 = 12; } 
              else { pow2 = 13; }                                
          } else {                        // 14-15
            if(num_in < 0x00008000) { pow2 = 14; } 
              else { pow2 = 15; }
          } 
        }                        
      }                    
    } else {                              // 16-31
      if(num_in < 0x01000000) {           // 16-23
        if(num_in < 0x00100000) {         // 16-19
          if(num_in < 0x0040000) {        // 16-17
            if(num_in < 0x00020000) { pow2 = 16; } 
              else { pow2 = 17; }                                
          } else {                        // 18-19
            if(num_in < 0x00080000) { pow2 = 18; } 
              else { pow2 = 19; }                                
          }                            
        } else {                          // 20-23
          if(num_in < 0x00400000) {       // 20-21
            if(num_in < 0x00200000) { pow2 = 20; } 
              else { pow2 = 21; }                                
          } else {                        // 22-23
            if(num_in < 0x00800000) { pow2 = 22; } 
              else { pow2 = 23; }                                
          }       
        }                        
      } else {                            // 24-31
        if(num_in < 0x10000000) {         // 24-27
          if(num_in < 0x04000000) {       // 24-25
            if(num_in < 0x02000000) { pow2 = 24; } 
              else { pow2 = 25; }                                
          } else {                        // 26-27
            if(num_in < 0x08000000) { pow2 = 26; } 
              else { pow2 = 27; }
          }                            
        } else {                          // 28-31
          if(num_in < 0x40000000) {       // 28-29
            if(num_in < 0x20000000) { pow2 = 28; } 
              else { pow2 = 29; }                                
          } else {                        // 30-31
            if(num_in < 0x80000000) { pow2 = 30; } 
              else { pow2 = 31; }
          } 
        }                        
      }
    }                           
    end = CNT;
    
    print("BST)         Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the result   
  }         
}

It can't be caching the values. Your code and my tweak just run that fast.

Ariba · 2017-12-28 11:49

Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
If you do this, you can see that it's really the optimizer that allows this result, but not in the way I thought. What happens is that the read of the start CNT value is delayed until a after a big part of the AND and SHIFT calculations, just 2 instructions before the end CNT value is read. For the optimizer the CNT and the bit caclulations are two different things and can be moved to the best position to get the smallest or fastest code.

Here is the generated Assembly code:

  90:Tests.c       ****  
  91:Tests.c       ****     a = num_in;
 203              		.loc 1 91 0
 204 0200 1000FCA0 		mov	r4, #16
 205 0204 0000BC80 		add	r4, sp
  92:Tests.c       ****     start = CNT;
  93:Tests.c       ****     pow2 = 31;
  94:Tests.c       **** //    if(a == 0) {pow2 = 0;} 
  95:Tests.c       **** //    else {
  96:Tests.c       ****       if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
 206              		.loc 1 96 0
 207 0208 00007C5C 		mvi	r7,#-65536
 207      0000FFFF 
  97:Tests.c       ****       if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
 208              		.loc 1 97 0
 209 0210 00007C5C 		mvi	r3,#-16777216
 209      000000FF 
  91:Tests.c       ****     a = num_in;
 210              		.loc 1 91 0
 211 0218 0000BC08 		rdlong	r5, r4
 212              	.LVL19
  96:Tests.c       ****       if(!(a & 0xFFFF0000)) {pow2 -= 16; a <<= 16;}
 213              		.loc 1 96 0
 214 021c 0000BCA0 		mov	r6, r5
 215 0220 00003C62 		test	r5,r7 wz
 216 0224 1000E82C 		IF_E  shl	r6, #16
 217 0228 0F00E8A0 		IF_E  mov	r7, #15
  93:Tests.c       ****     pow2 = 31;
 218              		.loc 1 93 0
 219 022c 1F00D4A0 		IF_NE mov	r7, #31
 220              		.loc 1 97 0
 221 0230 00003C62 		test	r6,r3 wz
 222 0234 0800E82C 		IF_E  shl	r6, #8
  98:Tests.c       ****       if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
 223              		.loc 1 98 0
 224 0238 00007C5C 		mvi	r3,#-268435456
 224      000000F0 
  97:Tests.c       ****       if(!(a & 0xFF000000)) {pow2 -= 8; a <<= 8;}
 225              		.loc 1 97 0
 226 0240 0800E884 		IF_E  sub	r7, #8
 227              		.loc 1 98 0
 228 0244 00003C62 		test	r6,r3 wz
 229 0248 0400E82C 		IF_E  shl	r6, #4
  99:Tests.c       ****       if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
 230              		.loc 1 99 0
 231 024c 00007C5C 		mvi	r3,#-1073741824
 231      000000C0 
  98:Tests.c       ****       if(!(a & 0xF0000000)) {pow2 -= 4; a <<= 4;}
 232              		.loc 1 98 0
 233 0254 0400E884 		IF_E  sub	r7, #4
 234              		.loc 1 99 0
 235 0258 00003C62 		test	r6,r3 wz
 236 025c 0200E82C 		IF_E  shl	r6, #2
 100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
 101:Tests.c       ****       if(a == 0) pow2 = -1;
 102:Tests.c       **** //    }
 103:Tests.c       ****     end = CNT;
 104:Tests.c       ****     
 105:Tests.c       ****     print("Ariba)       Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // disp
 237              		.loc 1 105 0
 238 0260 00007C5C 		mvi	r3,#.LC6
 238      00000000 
  99:Tests.c       ****       if(!(a & 0xC0000000)) {pow2 -= 2; a <<= 2;}
 239              		.loc 1 99 0
 240 0268 0200E884 		IF_E  sub	r7, #2
 100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
 241              		.loc 1 100 0
 242 026c 00007CC3 		cmps	r6, #0 wz,wc
  92:Tests.c       ****     start = CNT;
 243              		.loc 1 92 0
 244 0270 0000BCA0 		mov	r4, CNT                         <------ Read start CNT
 245              	.LVL20
 100:Tests.c       ****       if(!(a & 0x80000000)) pow2 -= 1;
 246              		.loc 1 100 0
 247 0274 0100CC84 		IF_AE sub	r7, #1
 248              	.LVL21
 101:Tests.c       ****       if(a == 0) pow2 = -1;
 249              		.loc 1 101 0
 250 0278 00007CC3 		cmps	r6, #0 wz,wc
 103:Tests.c       ****     end = CNT;
 251              		.loc 1 103 0
 252 027c 0000BCA0 		mov	r6, CNT                         <------- Read end CNT (2 instructions in between)
 253              	.LVL22
 101:Tests.c       ****       if(a == 0) pow2 = -1;
 254              		.loc 1 101 0
 255 0280 0100E8A4 		IF_E  neg	r7,#1
 256              	.LVL23
 257              		.loc 1 105 0
 258 0284 0000BC84 		sub	r6, r4
 259              	.LVL24
 260 0288 00003C08 		wrlong	r3, sp
 261 028c 00003C08 		wrlong	r5, r14
 262 0290 0000BCA0 		mov	r5, sp
 263 0294 0800FC80 		add	r5, #8
 264 0298 00003C08 		wrlong	r7, r5
 265 029c 0000BCA0 		mov	r7, sp
 266              	.LVL25
 267 02a0 0C00FC80 		add	r7, #12
 268 02a4 00003C08 		wrlong	r6, r7
 269 02a8 00007C5C 		lcall	#_print
 269      00000000

Andy

mikeologist · 2017-12-28 18:14

Ariba wrote: »

Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
~~~
Andy

No, I don't know how to do that, but I do now.
I'm going to build a proper library for these different approaches and try to time them as method calls instead.
You, heater, and Phil really gave me some new tools in this thread, thank you

MIchael_Michalski · 2017-12-28 18:34

If you want to burn some memory, you can do it in about 8 instructions
Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x

shift x right 24 bits
if prev result was not zero then add 24 to result
if previous result was zero, shift x right by 16 bits
if previous result was not zero then add 16 to result
if previous result was zero shift x right by 8 bits
if previous result was not zero then add 8 to result
move the source field of the previous result to the source field of the following instruction
add the value in the location specified by the source field to the result.

and we are done in 8 instructions, but we had to use 256 words of ram.

mikeologist · 2017-12-28 19:00

MIchael_Michalski wrote: »

If you want to burn some memory, you can do it in about 8 instructions
Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x

shift x right 24 bits
if prev result was not zero then add 24 to result
if previous result was zero, shift x right by 16 bits
if previous result was not zero then add 16 to result
if previous result was zero shift x right by 8 bits
if previous result was not zero then add 8 to result
move the source field of the previous result to the source field of the following instruction
add the value in the location specified by the source field to the result.

and we are done in 8 instructions, but we had to use 256 words of ram.

Why 256 values when my desired results are 0-31 & -1? Is it a sparse table?
Thanks for the cool contribution

MIchael_Michalski · 2017-12-29 00:11

This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.

shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.

if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.

if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.

if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.

if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.

move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.

add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.

here is an example some of the table entries.
Address Value
00000000 00000000
00000001 00000001
00000010 00000010
00000011 00000010
00000100 00000011
00000101 00000011
00000110 00000011
00000111 00000011
00001000 00000100
00001001 00000100
.
.
.
01111110 00000111
10000000 00001000
10000001 00001000
.
.
.,
11111110 00001000
11111111 00001000

We just add that value from the table to the result. So for instance, if X was

F4:65:CE:01 then the first bit set is clearly #32. That means that the very first shift would NOT have returned zero so we add 24 and SKIP the second shift. And now we have a problem. I see that there is a flaw in my algorithm. While we will skip the remaining shifts we will add 16 and 8 as well,which will get the wrong answer. The problem is that we only skip the add instructions BEFORE we find the most signifcant 1. All the adds after run. So we should add 8 each time. Then 3x8 is 24, 2x8 is 16 and 1x8 is 8 and it all works.

So now we have:

shift x right 24 bits and store it as y

if y was not zero then add 8 to R (do not affect z)

if y was zero, shift x right by 16 bits

if y was not zero then add 8 to result (do not affect z)

if y was zero shift x right by 8 bits

if previous result was not zero then add 8 to result (do not affect z)

move the source field of y to the source field of the following instruction.

add the value in the location specified by the source field to the result.

So lets look at this again for F4:65:CE:01

The first bit was clearly 32. That means the first shift does NOT return zero.

So we add 8 but do not update the Z flag.

since the Z flag is NOT set we skip the instruction shifting left by 16 bits

since the Z flag is NOT set we add 8 (again) but do not update the z flag

since the z flag is not set we SKIP the shift right by 8 instruction

since the z flag is not set we add 8 (again) but do not update the z flag.

now we move the source field to the following add instruction

our add instruction now adds the value of memory location 0F4 to the result. 0F4 will contain 8 so we add that to R and get 32.

But what if it was 00:65:CE:01

The first bit is now 23

So the first shift returns zero, so we DONT add 8.

The next shift does NOT return zero

so we DO add 8.

we skip the third shift

but the zero flag is still NOT set so we add another 8

we then move the source field to the add instruction

then we add the value at memory location 001100101 to the result. the value at that location will be 7, added to R,which is 16, we get 23.

This is btw very wasteful of memory. You really use 4 bytes for each value in your lookup table because the memory is not byte addressable,you end up having to store an entire 32 bit word when you only need 8.

mikeologist · 2017-12-29 00:21

MIchael_Michalski wrote: »

This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.

shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.

if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.

if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.

if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.

if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.

move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.

add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.

here is an example some of the table entries.
Address Value
00000000 00000000
00000001 00000001
00000010 00000010
00000011 00000010
00000100 00000011
00000101 00000011
00000110 00000011
00000111 00000011
00001000 00000100
00001001 00000100
.
.
.
01111110 00000111
10000000 00001000
10000001 00001000
.
.
.,
11111110 00001000
11111111 00001000

Thank you for the detailed explanation. I'm going to have to write it out on paper to fully understand it, just follow some cases through until it becomes completely clear.

MIchael_Michalski · 2017-12-29 16:36

Just keep in mind, while I think i got that right,there could be a mistake lurking in there.

MIchael_Michalski · 2017-12-29 17:14

heres another variant

Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.

shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)

if y was not zero then add 11 to R (do not affect z)

if y was zero, shift x right by 11 bits

if y was not zero then add 11 to result (do not affect z)

read a byte from hub memory at the address in y

add the value you just read to the result.

This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)

Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

Cog 1:

inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
inst 4:if y was not zero then add 11 to port b direction register (do not affect z)

Cog 2:
inst 5: read byte from hub memory at the address in the port b direction register
inst 6: complete hub memory operation
inst 7: add value you just read to the port b direction register.
inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.

using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.

mikeologist · 2017-12-29 19:00

MIchael_Michalski wrote: »

Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

That is my understanding, the port b i/o are just registers and no pins because pins 32-63 don't exist yet. I've barely scratched the surface in PASM so I haven't used them but I did read about them in the manual.

Thanks for all this. Regardless of which option I employ I will be sure to understand, build, and share the different methods that you've taken the time to provide here.

MIchael_Michalski · 2017-12-29 19:52

If the registers are there we can use them for fast intercog communications.

Heater. · 2017-12-29 19:59

The registers are not there. You cannot use the phantom port B for inter-COG communication.

mikeologist · 2017-12-29 20:47

Heater. wrote: »

The registers are not there. You cannot use the phantom port B for inter-COG communication.

Does Spin do something else with the data in outb because the register is available in Spin?

OBJ

  system : "Propeller Board of Education"
  pst    : "Parallax Serial Terminal Plus"

VAR

  long i

PUB go

  system.Clock(80_000_000)
  
  repeat i from 0 to 15
    outb[i] := i//2
    outa[i] := 0

  repeat i from 16 to 31
    outa[i] := 0
    if i//3 == 1
      outb[i] := 1
    
  repeat i from 0 to 31                           
    pst.Dec(outa[i])

  pst.NewLine

  repeat i from 0 to 31                           
    pst.Dec(outb[i])

produces this output:

00000000000000000000000000000000
01010101010101011001001001001001

admittedly this says nothing about intercog comms

Heater. · 2017-12-29 21:07

Well, alright, you are writing to and reading from port B from the same COG.

If I recall correctly there is RAM at the port B location so why not?

As an I/O register, even connected internally, not so much.

mikeologist · 2017-12-29 21:36

Heater. wrote: »

Well, alright, you are writing to and reading from port B from the same COG.

If I recall correctly there is RAM at the port B location so why not?

As an I/O register, even connected internally, not so much.

I figured you'd know. I tested it on 2 cogs and you're correct that it doesn't work

OBJ
  system : "Propeller Board of Education"
  pst    : "Parallax Serial Terminal Plus"

VAR
  long i, stack[32]

PUB go
  system.Clock(80_000_000)
  
  repeat i from 0 to 15
    outb[i] := i//2
    outa[i] := 0

  repeat i from 16 to 31
    outa[i] := 0
    if i//3 == 1
      outb[i] := 1

  pst.Dec(COGNEW(copyab, @stack))

  printoutab

  copyab

  printoutab

PUB copyab
  repeat i from 0 to 15
    outa[i] := outb[i]
  
PUB printoutab
  pst.NewLine
  
  repeat i from 0 to 31                           
    pst.Dec(outa[i])

  pst.NewLine

  repeat i from 0 to 31                           
    pst.Dec(outb[i])

produces the output:

1

00000000000000000000000000000000
01010101010101011001001001001001
01010101010101010000000000000000
01010101010101011001001001001001

So, is outb ram just there as a placeholder?
Thanks

Heater. · 2017-12-29 21:40

As far as I recall, yes.

mikeologist · 2017-12-29 21:43

Heater. wrote: »

As far as I recall, yes.

Appreciate it.
Glad you helped me rule that out before I invested any real time in it. I will have to employ the cog ram instead as I cannot count on my pins being blank.

mikeologist · 2017-12-29 21:47

Heater. wrote: »

As far as I recall, yes.

Do you know how outa works by comparison?
Is there a register in each cog or is there a single outa register that all cogs can access?

mikeologist · 2017-12-29 21:54

MIchael_Michalski wrote: »

heres another variant

Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.

shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)

if y was not zero then add 11 to R (do not affect z)

if y was zero, shift x right by 11 bits

if y was not zero then add 11 to result (do not affect z)

read a byte from hub memory at the address in y

add the value you just read to the result.

This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)

Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)

Cog 1:

inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
inst 4:if y was not zero then add 11 to port b direction register (do not affect z)

Cog 2:
inst 5: read byte from hub memory at the address in the port b direction register
inst 6: complete hub memory operation
inst 7: add value you just read to the port b direction register.
inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.

using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.

I do plan on having multiple cogs running math routines but without the outb, my project restrictions of outa, and the fact that this will be used regularly but sporadically, this method will not work for my regular library, but I bet it will kick some serious butt in my vector library.

Thanks a bunch.

Heater. · 2017-12-29 22:08

All the details of the registers and logic involved in the Propeller I/O are clearly described in the Propeller Manual.

You are not going to use port A for inter-cog communications either. That is a precious I/O resource you will need.

COGS can share data through HUB. Which is slow. Which is why the Prop does not make a great parallel processing number cruncher. Which of course it was not designed to be.

MIchael_Michalski · 2018-01-04 23:44

Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.

mikeologist · 2018-01-05 00:09

MIchael_Michalski wrote: »

Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.

That's the idea with the parallel SRAM bank. I want to add a few pins to identify which cog is responding and a lock bit and I can use it for RAM, persistent memory, and inter-cog communication. I figured out a memory shortcut for a cogmem array-like thing today. I still have some work to do on it but it increased my sequential 8 bit read from 1.9MB to nearly 5MB with the 8 bit RAM; should double with the 16 bit RAM. With all that bandwidth and flag bits I can copy the data to as many cogs as will listen and the RAM at the same time. I'll just read OUTA instead of INA for input from other cogs. When data does not need to be written to RAM, such as hypervisor instructions, they can still be read on OUTA and triggered with and by the same flag bits.

If I add a bus disconnect than I can have the memory controller page the SRAM to the SPI flash independent of the propeller and while it uses the same pins. This way the concepts are integrated by design and use the same helper methods.

I'm trying it with 3 different controllers.

Still rereading the manual

bsr or (int)log2 in Prop C / PASM help too (Optimization Problem)

Comments