Bit priority in C...

Fastrobot · 2015-10-25 05:13

Hi,
I'm interested in exploring C optimizations a little bit and learn what causes various optimizations in the propeller1.
So I wanted to implement, in C, a program to find the bit number of the least significant bit of a value.
And after compiling, with gcc -Os, wanted to find out which code is the smallest and fastest.
(Perhaps I ought to compile with -O3 for more speed??

I know SPIN has a command to locate a bit, but I'm not sure how that can be called from C or even if it is very fast.

I made two complete attempts, the second routine is pretty fast... and I have a third idea of how it might be accomplished
even quicker, but I'm not sure how to code it properly using the Propellers multiply/log table.
Comments about the code, and ways to improve it would be appreciated.

My first attempt is the simple linear search:

int bottomBitNN( unsigned long value ) {
	int n=-1;
	if (!value) return n;
	while (~value&1) { ++n; value>>=1; }
	return n;
}

But compiling it to assembly proved it it wasn't all that tiny, or efficient;
There is some overhead for FCACHE load which is quite a bit of the program's time.
Even to count to two, the routine part will take ~9 instructions, + ~6
instruction times/bit counted. The jump is assumed to happen,
which means the OF_E jmp #__LMM_FCACHE_START+(.L11-.L14) is
generally optimized wrong and takes two instruction times.

This program probably could be improved by changing to a do {} while( ~value & 1 ) instead, but that would only faster by about ~2 instruction times/loop.

I get, eitehr: 9 + 6*31 = 195 or 9+4*31 = 133 optimized instruction times for finding bit 31, around 11 for bit 1.
Average score for random bit is around ~72 instruction times.

_bottomBitNN
        jmp     #__LMM_FCACHE_LOAD
        long    .L15-.L14
.L14
        mov     pc,lr
        mov     lr,__LMM_RET
'' The above Code is FCACHE -- the actual routine follows:

        cmps    r0, #0 wz,wc
        neg     r7, #1
        IF_E    jmp     #__LMM_FCACHE_START+(.L11-.L14)
.L12
        test    r0,#0x1 wz
        IF_NE   jmp     #__LMM_FCACHE_START+(.L11-.L14)
        add     r7, #1
        shr     r0, #1
        jmp     #__LMM_FCACHE_START+(.L12-.L14)
.L11
        mov     r0, r7
        jmp     lr

For my next attempt, I figured I'd code a non-looped binary search to find the least significant set bit.
I'm pretty happy with this code:

int bottomBitN( unsigned long value ) {
	// Quickly find the bit number of the bottom most bit.
	int n=0;
	if (!value) return -1;
	if (!(value&0xFFFF)) { n+=16; value>>=16; }
	if (!(value&0xFF)) { n+=8; value>>=8; }
	if (!(value&0xF)) { n+=4; value>>=4; }
	if (!(value&0x3)) { n+=2; value>>=2; }
	if (!(value&0x1)) { n+=1; value>>=1; }
	return n;
}

_bottomBitN
        cmps    r0, #0 wz,wc
        mov     r7, r0
        IF_E    brs     #.L8
        mvi     r6,#65535
        test    r0,r6 wz
        IF_E  shr       r7, #16
        mov     r6, r7
        IF_E  mov       r0, #16
        IF_NE mov       r0, #0
        and     r6,#255 wz
        IF_E  shr       r7, #8
        IF_E  add       r0, #8
        test    r7,#0xf wz
        IF_E  shr       r7, #4
        IF_E  add       r0, #4
        test    r7,#0x3 wz
        IF_E  shr       r7, #2
        IF_E  add       r0, #2
        test    r7,#0x1 wz
        IF_NE   brs     #.L3
        add     r0, #1
        lret
.L8
        neg     r0, #1
.L3

It takes 22 instructions to execute, pretty much regardless of where the bit actually is.
So, it's about 25% more code, but 3.27 times faster execution.

One of the things that makes me curious, why did the smaller looped code generate FCACHE,
but this larger chunk of linear code did not?

last attempt:
This third piece of code, I am not sure how to write for propeller GCC; I know there is a log table of some kind in the processor,
but I'm not sure how to access it from C and I'm not sure what base it is in. I think the code would look something like the following,
but without knowing exactly how the log table is laid out -- I can't finish the code as I'm sure it can't look up 32 bit numbers.

int bottomBitNNN( unsigned long value ) {
    short int* table = (short int*)(void*)0x8000; // I am not sure exactly where is the table in memory
    value=value & ~(value-1); // isolate a single bit, the LSBIT, so a base 2 logarithm is exact.
    if (!value) return -1;
    return table[value]; 
}

I suppose I could call the log2 operation from the math library, but that seems overkill.
The call overhead would likely be very slow.
Anyone have suggestions?

DavidZemon · 2015-10-25 05:38

Fastrobot wrote: »

One of the things that makes me curious, why did the smaller looped code generate FCACHE,
but this larger chunk of linear code did not?

Just a guess: GCC doesn't know how many iterations of that loop you'll be taking. So, it's going to assume lots, and that the overhead for FCACHE will pay off. Without a loop, it must be smart enough to know that it's not worth taking the time to cache.

DavidZemon · 2015-10-25 05:40

Fastrobot wrote: »

I suppose I could call the log2 operation from the math library, but that seems overkill.
The call overhead would likely be very slow.
Anyone have suggestions?

I was curious about this too. Here's my old thread: http://forums.parallax.com/discussion/160252/log-antilog-table-in-c-c and David Hein jumped in with a code sample here.

It's also important to note that the math library (anything in libm) does not use the tables in ROM, because they are not precise enough.

DavidZemon · 2015-10-25 05:41

And finally, thanks for the write up! Very informative

Fastrobot · 2015-10-27 03:06

DavidZemon wrote: »

And finally, thanks for the write up! Very informative

Thanks for the reply, David.

Discussing optimization is always a good thing.... gets the creative juices flowing....
I never would have found your thread, as doing a search on log for propeller here on the forums has such a large hit ratio for useless threads that it's essentially buried.

The result basically proves that the log table is useless, as the task of normalizing the number before taking the log automatically requires the very routine i'm trying to replace!

I see, in the thread, that David Hein actually used the exact same linear search that wastes so many CPU cycles, too. ! (ouch!)
So, the code I'm showing here could seriously improve the performance of his routine as well.

I think I can probably improve the routine I did a little more.
The case where '0' is detected is generally a waste because it's seldom called -- so I figured I could get rid of the alternate return path code, and the wrong jump optimization, by just initializing n to a number which will cancel to -1 after a futile binary search. I can also take advantage of the barrel shifter in the comparisons to avoid storing constants larger than #511. That should also shrink code size.

// This code should work near optimally on any processor with a barrel shifter.
// Returns -1 if value is 0 otherwise, returns bit number of least significant bit, on any processer up to 128 bit word lengths.

int bottomBitNN( unsigned long value ) {
	int n=0; 
	if (!value) n=-((int)sizeof(value)*8);
	if (!(value<<(sizeof(value)*8)-64)) {n+=64; value>>=64; } // Triggers a propeller GCC bug
        if (!(value<<(sizeof(value)*8)-32)) {n+=32; value>>=32; } // Triggers a propeller GCC bug
	if (!(value<<(sizeof(value)*8)-16)) {n+=16; value>>=16; }
	if (!(value<<(sizeof(value)*8)-8))  {n+=8;  value>>=8;  }
	if (!(value<<(sizeof(value)*8)-4))  {n+=4;  value>>=4;  }
	if (!(value<<(sizeof(value)*8)-2))  {n+=2;  value>>=2;  }
	if (!(value<<(sizeof(value)*8)-1))  {n+=1;  value>>=1;  }
	return n;
}

Note: this code is buggy: The first two commented lines should optimize out on the propeller, as shifting a 32 bit value by 32 bits or more always generates the value 0. GCC does notice that, and correctly decides to optimize out those lines which are for CPU sizes bigger than the propeller. BUT -- the propGCC also erroneously removes the if statement before the 32 or 64 bit shift. Therefore, the program doesn't return -1 when value==0. However, if I hand remove the lines which should optimize out anyway, the program works correctly. So who do I bump about GCC bugs?

With the broken optimizer lines removed, the code is only 19 instructions long on the propeller, and is fixed run length. So it is both smaller and faster on average than my previous routine and no fcache code penalites.

I suppose, since the propeller has the bit reverse assembly language command, that people could just use the above routine with bit reverse to do logarithm normalization. That would be more efficient than having an lsbit finding routine written separate from a msbit finding routine. But, I would need to code one special assembly language instruction -- does anyone know how to do this, properly?
eg: something like:

static inline
int topBitNN( unsigned long value ) {
    asm( "   reverse value,value" ); // I'm not sure what the command needs to be to access a stack variable/parameter... this is a guess
   return (sizeof(long)*8-1)-bottomBitNN( value );
}

That's not quite portable... but it's an option.
The other way I might do that which would be portable, and probably slightly more efficient overall is to recode the binary search so it finds the most significant bit directly, and then I can isolate the LSBIT using portable C rather than having a separate routine to find LS bit.
eg:

static inline bottomBitNN( unsigned long value ) {
    return topBitNN( value & ~(value-1) );
}

I'll have to play with the code and see if I can make a fast log lookup library, too.

Heater. · 2015-10-27 06:15

If you care for experimenting there is a big collection of least significant bit finding algorithms here:
http://stackoverflow.com/questions/757059/position-of-least-significant-bit-that-is-set

Fastrobot · 2015-10-27 10:53

Heater. wrote: »

If you care for experimenting there is a big collection of least significant bit finding algorithms here:
http://stackoverflow.com/questions/757059/position-of-least-significant-bit-that-is-set

Pretty cool.
Although, all the techniques using a multiply or converting to float aren't going to help on the propeller, as those just use the built in C library functions. Those have a lot of call overhead and there's no hardware multiply in the propeller....

I'm not even sure the lookup table method will be faster on the propeller, because the conditionals to locate the first non-empty byte going to eat several instructions...

My original simple looped code is only 14 instructions + 1 constant = 15 instructions worth of space. That's about the smallest code possible.

Comparing to the others I've tried:

You can see I have a method with 24 assembly instructions already, and I have gotten it down to 19 instructions (I'll post that later) when doing a MSB search. (18 instructions if I hand code the assembly). And no long constants... so that's total size.

Just for reference, The stock example by parallax for taking logarithms is on pp. 382-383 of the propeller manual.
It uses 28 instructions worth of memory, and that techinique would still take 18 instruction times without fixing the -1 bug it would cause. So, I've already matched the stock algorithm for speed, and used about 1/3 less memory space without hand coding the assembly language, but just using C.

If I could figure out how to encode inline assembly and access a function paramater/variable by name... I could get the absolute max performance. GCC's optimizer actually goes the wrong way in a couple of situations.

Fastrobot · 2015-10-27 12:16

My absolute best routine is a hybrid between bit shifting and lookup table.
It's only two longs larger than a pure binary search at 16 inst+5 data=21 instruction slots used.
Execution is fixed at ~17 instruction cycles. So two cycles faster than a 19 longs based pure binary search.
and I was wrong, 8 bit lookup does work and is in fact slightly faster @ ~14 cycles, but it uses up 78 longs and probably isn't worth the slight speed advantage.

Best routine I can come up with are these two:

int topBitN( unsigned long val ) {
	int n=-1;
	static const char nybble[16]={0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4};
	if ((val>>32-16)) n+=16; else val<<=16;
	if ((val>>32-8 )) n+=8 ; else val<<=8;
	if ((val>>32-4 )) n+=4 ; else val<<=4; 
	return n+nybble[val>>32-4];
}

int bottomBitN( unsigned long val ) {
	int n=-1;
	static const char nybble[16]={0,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1};
	if (!(val<<32-16))  { val>>=16; n+=16; }
	if (!(val<<32-8 ))  { val>>=8;  n+=8;  }
	if (!(val<<32-4 ))  { val>>=4;  n+=4;  }
	return n+nybble[val & 0xF];
}

DavidZemon · 2015-10-27 12:48

I'm assuming this is all open source, no license code, but please let me know if it isn't. I think I'll be stealing it to put in PropWare next time I have it open

Heater. · 2015-10-27 13:28

If you can't get that code in lined by the compiler how about making it into a macro ?

Fastrobot · 2015-10-27 17:49

DavidZemon wrote: »

I'm assuming this is all open source, no license code, but please let me know if it isn't. I think I'll be stealing it to put in PropWare next time I have it open

yes, it's open source

Although I would appreciate a comment pointing to this thread and saying the algorithm is developed by Andrew Robinson of Scappoose.
I can use a little advertisement for the future when I eventually need to earn money / do job interviews.

And if anyone improves the code, which is probably hard to do -- they have the option of positing here, of course.

DavidZemon · 2015-10-27 17:51

Fastrobot wrote: »

DavidZemon wrote: »

I'm assuming this is all open source, no license code, but please let me know if it isn't. I think I'll be stealing it to put in PropWare next time I have it open

yes, it's open source
Although I would appreciate a comment pointing to this thread and saying the algorithm is developed by Andrew Robinson of Scappoose.
I can use a little advertisement for the future when I eventually need to earn money / do job interviews.
And if anyone improves the code, which is probably hard to do -- they have the option of positing here, of course.

I was definitely planning to give credit. I'll be sure to add your real name in there instead of just a forum username :P

ersmith · 2015-10-27 17:53

GCC has a primitive called __builtin_clz (count leading zeros); using the rev instruction and that will allow you to produce a pretty fast bottomBit2 like so:

int bottomBitClz( unsigned long value ) {
    unsigned long val;
    if (!value) return -1;
    val = __builtin_propeller_rev(value, 32);
    return __builtin_clz(val);
}

Internally __builtin_clz does the counting using methods similar to the ones above, but since it's implemented in the LMM (or CMM) kernel it'll be faster in practice since there won't be LMM cache misses. In my testing I get 480 cycles for bottomBitN versus 272 cycles for bottomBitClz; those are machine cycles from CNT, not instructions.

Fastrobot · 2015-10-27 19:05

Heater. wrote: »

If you can't get that code in lined by the compiler how about making it into a macro ?

The problem isn't getting the code to expand inline, the problem is figuring out how to get the assembly language to find the correct variable.

In my code, the parameter "val" is the parameter to the routine.
GCC somehow chooses a register for val and another register for n; eg: to two different COG registers depending on the optimization level and version of the compiler.

When I expand inline assembly, I can type in a register name but not a variable name;
eg: I can do:
asm(" rev r0,r0")
but I can't do:
asm(" rev n,n" )

The first technique requires me to guess where GCC is going to put var and n, which can change with GCC version or optimization level.
The second technique doesn't work but is desirable to prevent me from making the code dependent on GCC version or compiler switches.

On my x86 version of GCC, I remember seeing inline assembly used with a special symbol to tell the assembler to look up whatever register the variable name is held in. I think it was "%" on the x86 version of GCC -- but I can't remember -- and propeller GCC doesn't accept asm(" rev %var,%var")

I know it probably can be done -- but I'm just not sure how to get the propeller gcc to do it. I don't know where to look to find out.

DavidZemon · 2015-10-27 19:21

You can find a one-liner for inline assembly here: https://github.com/DavidZemon/PropWare/blob/release-2.0/PropWare/port.h#L208

And a much bigger block here (ignore the fcache stuff at the top)
https://github.com/DavidZemon/PropWare/blob/release-2.0/PropWare/uart/abstractsimplexuart.h#L462
Notice how line 462 allows you to present a specific variable for both read/write access.

Fastrobot · 2015-10-27 20:00

Excellent! Thanks David.
So all I need to do is:
__asm__(" rev %0,%0" : : "r" (val) );
And that will tell the compiler that macro 0 is the register that val is found in. EDIT: It compiles!!! YES!!!

DavidZemon · 2015-10-27 20:12

It... shouldn't... compile....

The way you have it written, that should mark "val" as a read-only parameter. I'm confused! lol

Fastrobot · 2015-10-27 20:33

Another GCC bug... ?! Well, it DOES compile....

I assumed r was "register", because you pointed that code out as an example for what I was trying to do --but if it's readonly -- what's readwrite?
"rw" or just blank? In fact just about anything put in quotes seems to compile... even __asm__(" rev %0,%0" : : "boogeyman" (val) );

of course it only works with the __asm__() built in and not asm(), but that's fine.

I've found a number of gcc bugs trying to work these examples out.... not sure who to tell about them.
I also find it silly when I comple with -Os -S to inspect the generated assembly and notice really simple optimization oversights like:

    cmps r0,#0
    IF_E add r1,#1
    cmps r0,#0
    IF_B add r1,#1

I mean, without a WZ,WC -- isn't the add instruction not affecting condition codes? so there's absolutely no reason to do the same compare twice, is there?

Another issue I've come across is that GCC sometimes recognizes that it has already computed an expression, and then decides to "save" the caluculation, and conditionally reuse it when it's computed a second time. That would be fine if it took more than one instruction to make the calculation, but when saving a calculation and restoring it takes two instructions but the calculation only one instruction... that's not good. Recomputing is clearly going to be faster in that case.

I mean I would expect the following snippet to compile as three instructions, not four or five.
But in a chain of similar statements, GCC is often deciding to save and restore rather than recompute.

if (x>>16) {x>>=16; n+=16;}

Sometimes, in the name of optimization, propeller GCC notices too much.... !

Electrodude · 2015-10-27 20:36

Fastrobot wrote: »
I've found a number of gcc bugs trying to work these examples out.... not sure who to tell about them.
I also find it silly when I comple with -Os -S to inspect the code that I'm finding optimization oversights like:
    cmps r0,#0
    IF_E add r1,#1
    cmps r0,#0
    IF_B add r1,#1
I mean, without a WZ,WC -- isn't the add instruction not affecting condition codes? so there's absolutely no reason to do the same compare twice, is there?

Does it always set WZ,WC? If so, why? Is there any possible workaround? Could GCC be told to treat them as single-bit registers instead of as flags, and have variants of instructions that have different combinations of {w,n}{z,c,r}?

DavidZemon · 2015-10-27 20:38

Fastrobot wrote: »

Another GCC bug... ?! Well, it DOES compile....

I assumed r was "register", because you pointed that code out as an example for what I was trying to do --but if it's readonly -- what's readwrite?
"rw" or just blank? In fact just about anything put in quotes seems to compile... even __asm__(" rev %0,%0" : : "boogeyman" (val) );

of course it only works with the __asm__() built in and not asm(), but that's fine.

"+r" is for read/write, and it would go between the two colons instead of after the second (see the second and longer block of assembly linked in my other post).

Fastrobot wrote: »
I've found a number of gcc bugs trying to work these examples out.... not sure who to tell about them.
I also find it silly when I comple with -Os -S to inspect the code that I'm finding optimization oversights like:
    cmps r0,#0
    IF_E add r1,#1
    cmps r0,#0
    IF_B add r1,#1
I mean, without a WZ,WC -- isn't the add instruction not affecting condition codes? so there's absolutely no reason to do the same compare twice, is there?

or another issue I've come across is that GCC sometimes recognizes that it has already computed an expression, and then decides to "save" the caluculation, and conditionally reuse it when it's computed a second time. That would be fine if it took more than one instruction to make the calculation, but when saving a calculation and restoring it takes two instructions but the calculation only one instruction... that's not good. Recomputing is clearly going to be faster in that case.

I mean I would expect this to compile as three instructions, not four or five.
But in a chain of them, GCC is clearly saving and resoring rather than recomputing.
if (x>>16) {x>>=16; n+=16;}
Sometimes, in the name of optimization, GCC notices too much.

Yep - GCC is pretty cool, but 4.6 is getting old, and even the newest versions certainly aren't perfect

The central repository is on github and I would think that's a good place to report issues: https://github.com/parallaxinc/propgcc/issues

ersmith · 2015-10-27 21:52

Fastrobot wrote: »

Excellent! Thanks David.
So all I need to do is:
__asm__(" rev %0,%0" : : "r" (val) );
And that will tell the compiler that macro 0 is the register that val is found in. EDIT: It compiles!!! YES!!!

That's probably not what you want -- the REV D, S instruction reverses S bits of register D.

Also note that all (I think) Propeller instructions that don't have obvious C analogues already have __builtin_propeller_xxx versions. So you could write something like:

  val = __builtin_propeller_rev(val, 32);

to reverse all 32 bits of "val".

Eric

ersmith · 2015-10-27 22:00

Also, Fastrobot, you don't mention which version of gcc you're using, but if it's the one that comes with SimpleIDE then it's really ancient. The ones that David Zemon links to from his home page are much more up to date and will do a better job of optimizing (although still not perfect, of course).

Eric

ersmith · 2015-10-27 22:05

Electrodude wrote: »
Fastrobot wrote: »
    cmps r0,#0
    IF_E add r1,#1
    cmps r0,#0
    IF_B add r1,#1
I mean, without a WZ,WC -- isn't the add instruction not affecting condition codes? so there's absolutely no reason to do the same compare twice, is there?
Does it always set WZ,WC? If so, why? Is there any possible workaround?

No, it doesn't always set WZ,WC -- the Propeller port of GCC tries to set a "minimum" number of condition bits, which may be why the sequence above occured (the first test was a test for 0, so set only WZ, and the second was a test for < 0 and set WZ and WC). I'm not sure, because there's a typo in the code as posted -- the poster forgot to include the WZ,WC on the cmp, so we don't know what the compiler was setting.
[/quote]

Fastrobot · 2015-10-27 23:31

Ok.

What I think what I saw and forgot in x86 and some other processors is a slightly different format string.
There are typically three colon groups to inform GCC about what is going on.

_asm_ ( "\tinstruction dest,src options" : [macroname1] "desination constraints" (var name1), ... : [macroname2] "source constraints" (name2), ... : "clobbered register", ... )

For example, in x86, the destination flags might be "=rm" for destination flags meaning gcc is allowed to write to either a register or memory, and source constraints might be "rm" also meaning the source can be either register or memory.

I think a blank field generally means no constraints at all. So, the command I wrote has no constraints on the destination but does have some constraints on the input field. That's why GCC did not kick out with an error. The variable %0 was defined to have constraints when used as an input field, but not when used as a destination.

https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

I don't think you actually need to set read or write permissions because GCC can figure that out from the variable's type definition; eg: const char* xxx ; should not be written to. I'm not sure you actually need to put "r" or "m" or "+" in the constraints at all, unless the type defintions are wrong.

For example, gcc flags the first asm as an error, but the second is fine. GCC is happy reading from variables whether or not "r" exists.

int test(unsigned long val, const unsigned long c, const long *co ) {
    int n = 0; 
    __asm__ ( " rev %[c],%[co] wc"   : [c] "+" (c) : [co] "" (co) : "cc" );
    __asm__ ( " rev %[n],%[co] wc"   : [n] "+" (n) : [co] "" (co) : "cc" );
   
	return c;	
}

// run by:
# propeller-elg-gcc -v
COLLECT_GCC=propeller-elf-gcc
COLLECT_LTO_WRAPPER=/mnt/parallax/bin/../libexec/gcc/propeller-elf/4.6.1/lto-wrapper
Target: propeller-elf
Configured with: /var/www/vhosts/david.zemon.name/TeamCity/buildAgent/work/71b5f7d6548ebb35/gcc4/configure --target=propeller-elf --prefix=/var/www/vhosts/david.zemon.name/TeamCity/buildAgent/work/propeller-gcc4-build/target --disable-nls --disable-shared --with-pkgversion=propellergcc-alpha_v1_9_0_ --with-bugurl=http://code.google.com/p/propgcc/issues
Thread model: single
gcc version 4.6.1 (propellergcc-alpha_v1_9_0_) 


# propeller-elf-gcc -S -Os trajectoryqueue.c
trajectoryqueue.c: In function 'test':
trajectoryqueue.c:84:5: error: read-only parameter 'c' use as 'asm' output
bash-4.2$

Electrodude · 2015-10-28 02:04

Fastrobot wrote: »

I don't think you actually need to set read or write permissions because GCC can figure that out from the variable's type definition; eg: const char* xxx ; should not be written to. I'm not sure you actually need to put "r" or "m" or "+" in the constraints at all, unless the type defintions are wrong.

I disagree, although I have no proof. It might not be necessary, but specifying read and write permissions can give GCC information that will allow it to better optimize the code surrounding your assembly.

If you don't specify permissions but your code only reads a register, GCC might not know that it doesn't need to save a copy before giving it to your assembly code, and so it will needlessly make a copy of it. If your code only ever writes a register but doesn't read it and you don't tell GCC, GCC won't know that the register can go in a new, more convenient place and doesn't have to be the same as the old register. That can avoid forcing GCC to do a mov later on if it needs to.

Fastrobot · 2015-10-28 03:36

ersmith wrote: »

Also, Fastrobot, you don't mention which version of gcc you're using, but if it's the one that comes with SimpleIDE then it's really ancient. The ones that David Zemon links to from his home page are much more up to date and will do a better job of optimizing (although still not perfect, of course).

Eric

See above post, it's gcc version 4.6.1 and roughly 1 year old.

I don't remember if it came with simple IDE or not; I think I may have recompiled it from source on github. I do remember having to upgrade several things, including recompiling gdb separately to get it to work at all.

I have a version of gdb that has bugs like, printing an array will print the pointers to each of the elements instead of the elements and any attempt to call a function on the propeller with print /x myfunction() doesn't work. Unfortunately, there's no obvious way to inspect the cog's native registers without being able to call a function in the program. There are also times, even with gcc -O0 chosen, that the debug trace will follow the wrong branch path after an if (condition) {}; but only during single step mode, never if a breakpoint is used and continue command issued. Still, gdb works well enough that it is useable -- and I'm pretty sure the simple IDE version does not work at all.

I probably ought to upgrade my tools in any event, as a lot can happen in a year. I'll definitely do that before reporting any bugs, officially.

Thanks for all the comments. ( and the benchmark. I need to learn to do that... optimized code residing in inside the KERNEL obviously has a massive advantage. I need to benchmark the cached linear bit search to see what it's performance really looks like due to good caching. )

I wasn't really paying attention to how the rev command works when I wrote out the example, I was simply trying to see if gcc would recognize a variable name and compile with it at all. I'm not all that familiar with cog assembly yet. But yes, obviously the way I had it wouldn't reverse the bits for finding priority. I need " rev %0,#32" instead. I haven't written inline assembly in probably 10 years, so I'm a bit rusty to say the least...

I am porting code from other processors that I've used in the past, and want to replace proprietary routines from those systems with more generic ones to make the code portable. I don't want to rewrite the code again in the future, if possible....
Some of the features in the propeller are pretty common, eg: My x86, and arm processors, also have barrel shifters -- so the same principle works on most machines.

Ultimately, I will probably want to create a library source tree with #if statements, so that I can select optimized routines based on architecture or cpu. But I want a good general purpose routine first that has the least penalties due to sheer stupidity.... eg: This program is a simple enough problem that it lets me get used to the propeller's quirks while exploring.

As to typos in that code... yes, I get ahead of myself sometimes... my apologies.
I was typing it from memory and forgot the CC flags; there was a "cmps r0,#0 wz,wc" in both places.
I had already deleted the source code which generated it so all I can do is try to recreate it, now.
The general idea I was doing at the time was a variaton on the theme:

int testNN( unsigned long val ) {
	int n=-1;
	if (!(val<<32-16)) { n+=16; val>>=16; }
	if (!(val<<32-8 ))  { n+=8 ;  val>>=8; }
	if (!(val<<32-4 ))  { n+=4 ;  val>>=4; }
	if (!(val<<32-2 ))  { n+=2 ;  val>>=2; }
	if (!(val<<32-1 ))  { n+=1 ;  val>>=1; }
	return val+n; // this is a mistake, really, it probably should be n+!!val; but as long as val has only one bit set, the algorithm should work.
}

So let's see what happens: gcc -S -Os trajectoryqueue.c
then I copy and paste the output so there are NO typos this time:

     _testNN
        shl     r0, #16 wz,nr
        IF_E  shr       r0, #16
        shl     r0, #24 wz,nr
        IF_E  shr       r0, #8
        shl     r0, #28 wz,nr
        IF_E  shr       r0, #4
        shl     r0, #30 wz,nr
        IF_E  shr       r0, #2
        test    r0,#0x1 wz
        IF_E  shr       r0, #1
        lret

It's different so this is obviously not the exact same variation as before.... but oh, nelly! what happened to the add 16,8,4,2, and 1 ?!
In fact, what the heck happened to n, and why?!
I dont' see what I did wrong that they should disappear. Did I make a mistake in my code?

OTOH:
Here's the other optimization issue that I saw, gcc is actually making the code larger here for no reason.
It could just discard the computation in parenthesis, using NR. Then there would be no need to do an extra move every if statement.
This is an optimization that generally would be helpful, especially when two or more calculations exist inside the if() , but here it's counter-productive.

int testNN( unsigned long val ) {
	int n=-1;
	if ((val>>16)) { n+=16; val>>=16; } // since val>>16 is computed twice, GCC decides to compute, "save", conditionally move.
	if ((val>>8 )) { n+=8 ; val>>=8; } 
	if ((val>>4 )) { n+=4 ; val>>=4; }
	if ((val>>2 )) { n+=2 ; val>>=2; }
	if ((val>>1 )) { n+=1 ; val>>=1; }
	return n + val;
}

_testNN
        mov     r7, r0
        shr     r7, #16 wz
        IF_NE mov       r0,r7
        mov     r6, r0
        IF_E  neg       r7,#1
        IF_NE mov       r7,#15
        shr     r6, #8 wz
        IF_NE mov       r0, r6
        mov     r6, r0
        IF_NE add       r7, #8
        shr     r6, #4 wz
        IF_NE mov       r0, r6
        mov     r6, r0
        IF_NE add       r7, #4
        shr     r6, #2 wz
        IF_NE mov       r0, r6
        mov     r6, r0
        IF_NE add       r7, #2
        shr     r6, #1 wz
        IF_NE add       r7, #1
        IF_NE mov       r0, r6
        add     r0, r7
        lret

Optimization faults don't generate bad code, just slightly slower or larger code. So this last isn't really a problem for me.
Thes that bother me are the ones which remove working code and change the operation of the function eg: when a shift of larger than 32 bits happens, the result ought to be considered zero -- and it's OK to remove the statement if it effectively doesn't execute; but that should not cause any previous if statement's optimization state to be corrupted and generate bad code. That's what happened in the example I gave much earlier in the thread, and unless I'm really tired the first example in this post might be another example of gcc misery.

Fastrobot · 2015-10-28 04:32

that

ersmith wrote: »

Also, Fastrobot, you don't mention which version of gcc you're using, but if it's the one that comes with SimpleIDE then it's really ancient. The ones that David Zemon links to from his home page are much more up to date and will do a better job of optimizing (although still not perfect, of course).

Eric

I'm not seeing a link when I click on his profile page. So you must mean another homepage, and the only other one I see doesn't have links obvious either, although I should say Congratulations to David on getting married; for I did see that next to the picture of the cool looking car and school info.

Hmm... I do see a git hub repository that David owns and cloned from prop-gcc, but it looks to have no code base changes but only makescript/insall patch changes 8 months ago. Is that too old?

So -- I humbly ask, could someone post a more direct link; I don't want to recompile the wrong thing.... thanks!

Fastrobot · 2015-10-28 04:53

Electrodude wrote: »

Fastrobot wrote: »

I don't think you actually need to set read or write permissions because GCC can figure that out from the variable's type definition; eg: const char* xxx ; should not be written to. I'm not sure you actually need to put "r" or "m" or "+" in the constraints at all, unless the type defintions are wrong.

I disagree, although I have no proof. It might not be necessary, but specifying read and write permissions can give GCC information that will allow it to better optimize the code surrounding your assembly.

I think those flags are called "constraints" in the manual for a reason, and generally when coding programs there is a danger in over-constraining the compiler's optimizer.

As I'm just starting out with the propeller, I would rather error on the side of allowing the compiler to make decisions as much as possible rather than trying to force it to do so. As I become more certain of how prop-gcc is using those flags, then I'll start adding more of them as appropriate. Right now, I'm not even certain the compiler really is using those flags as the memory and registers in a cog are really identical; unlike an x86 processor which has a very limited set of accumulator registers capable of doing arithmetic, the COG essentially has 400+ general purpose registers which are also memory. So, those optimization flags don't really seem critical to me as long as the data resides inside the cog.

As to Hub ram instructions... well, I still have to study that... and you may well be right, there. constraints may be very appropriate.

Electrodude · 2015-10-28 05:35

Except that propgcc doesn't let you use all 496 registers as general purpose registers. Many are used for the LMM/CMM kernel code, and many others are used for things like multiply, divide, and other functions that GCC expects to be native operations. In CMM you only get to use 16 registers, and I assume it's the same for LMM.

DavidZemon · 2015-10-28 12:56

Fastrobot wrote: »

that

ersmith wrote: »

Also, Fastrobot, you don't mention which version of gcc you're using, but if it's the one that comes with SimpleIDE then it's really ancient. The ones that David Zemon links to from his home page are much more up to date and will do a better job of optimizing (although still not perfect, of course).

Eric

I'm not seeing a link when I click on his profile page. So you must mean another homepage, and the only other one I see doesn't have links obvious either, although I should say Congratulations to David on getting married; for I did see that next to the picture of the cool looking car and school info.

Hmm... I do see a git hub repository that David owns and cloned from prop-gcc, but it looks to have no code base changes but only makescript/insall patch changes 8 months ago. Is that too old?

So -- I humbly ask, could someone post a more direct link; I don't want to recompile the wrong thing.... thanks!

Ha! Sorry about the confusion :P
Eric is referring to PropWare's homewage, where I list a number of "Useful Links" at the bottom including links to my build server, where PropGCC is being built: http://david.zemon.name/PropWare
And thank you, I'm pretty excited

My clone of PropGCC is dead - I'll probably delete it momentarily. I created it a while ago so I could create some pull requests and they've since all been accepted.

Electrodude wrote: »

Except that propgcc doesn't let you use all 496 registers as general purpose registers. Many are used for the LMM/CMM kernel code, and many others are used for things like multiply, divide, and other functions that GCC expects to be native operations. In CMM you only get to use 16 registers, and I assume it's the same for LMM.

Unless explicitly declared with the "COGRAM" keyword, PropGCC only allocates 16 registers in cogram for use. As I understand it, this is because GCC has these "registers" built into its very core. Most every processor out there has some set of general purpose registers like R0-R7 or R0-R16 or something like that. It's such a popular concept that GCC has it as a core concept, and not using that concept would require significantly more work by the PropGCC folks. So, rather than rewrite the whole compiler, they simply dedicated a few random addresses in COGRAM to be used as R0-R16. A bit more address space is used by either the LMM or CMM kernel, and then a bit more is reserved for FCACHE space.

I am not one of the PropGCC developers so take this with a grain of salt. This comes from reading lots of docs and forum threads on PropGCC, not first hand knowledge of the code base.

DavidZemon · 2015-10-28 13:01

Fastrobot wrote: »

ersmith wrote: »

Also, Fastrobot, you don't mention which version of gcc you're using, but if it's the one that comes with SimpleIDE then it's really ancient. The ones that David Zemon links to from his home page are much more up to date and will do a better job of optimizing (although still not perfect, of course).

Eric

See above post, it's gcc version 4.6.1 and roughly 1 year old.

Actually, all PropGCC versions are based on GCC 4.6.1 (there's work happening on a PropGCC based on GCC v5, but it's not ready yet). What he was asking for was the PropGCC version, not GCC version.

dzweb@davidzemonname:~$ propgcc4-bin/bin/propeller-elf-gcc --version
propeller-elf-gcc (propellergcc-alpha_v1_9_0_) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

'Course... this brings up an interesting point that builds from my build server aren't reporting the tweak/patch version. I'll post in the PropGCC thread about how to fix that.