Bit priority in C...
Fastrobot
Posts: 51
Hi,
I'm interested in exploring C optimizations a little bit and learn what causes various optimizations in the propeller1.
So I wanted to implement, in C, a program to find the bit number of the least significant bit of a value.
And after compiling, with gcc -Os, wanted to find out which code is the smallest and fastest.
(Perhaps I ought to compile with -O3 for more speed??
I know SPIN has a command to locate a bit, but I'm not sure how that can be called from C or even if it is very fast.
I made two complete attempts, the second routine is pretty fast... and I have a third idea of how it might be accomplished
even quicker, but I'm not sure how to code it properly using the Propellers multiply/log table.
Comments about the code, and ways to improve it would be appreciated.
My first attempt is the simple linear search:
But compiling it to assembly proved it it wasn't all that tiny, or efficient;
There is some overhead for FCACHE load which is quite a bit of the program's time.
Even to count to two, the routine part will take ~9 instructions, + ~6
instruction times/bit counted. The jump is assumed to happen,
which means the OF_E jmp #__LMM_FCACHE_START+(.L11-.L14) is
generally optimized wrong and takes two instruction times.
This program probably could be improved by changing to a do {} while( ~value & 1 ) instead, but that would only faster by about ~2 instruction times/loop.
I get, eitehr: 9 + 6*31 = 195 or 9+4*31 = 133 optimized instruction times for finding bit 31, around 11 for bit 1.
Average score for random bit is around ~72 instruction times.
For my next attempt, I figured I'd code a non-looped binary search to find the least significant set bit.
I'm pretty happy with this code:
It takes 22 instructions to execute, pretty much regardless of where the bit actually is.
So, it's about 25% more code, but 3.27 times faster execution.
One of the things that makes me curious, why did the smaller looped code generate FCACHE,
but this larger chunk of linear code did not?
last attempt:
This third piece of code, I am not sure how to write for propeller GCC; I know there is a log table of some kind in the processor,
but I'm not sure how to access it from C and I'm not sure what base it is in. I think the code would look something like the following,
but without knowing exactly how the log table is laid out -- I can't finish the code as I'm sure it can't look up 32 bit numbers.
I suppose I could call the log2 operation from the math library, but that seems overkill.
The call overhead would likely be very slow.
Anyone have suggestions?
I'm interested in exploring C optimizations a little bit and learn what causes various optimizations in the propeller1.
So I wanted to implement, in C, a program to find the bit number of the least significant bit of a value.
And after compiling, with gcc -Os, wanted to find out which code is the smallest and fastest.
(Perhaps I ought to compile with -O3 for more speed??
I know SPIN has a command to locate a bit, but I'm not sure how that can be called from C or even if it is very fast.
I made two complete attempts, the second routine is pretty fast... and I have a third idea of how it might be accomplished
even quicker, but I'm not sure how to code it properly using the Propellers multiply/log table.
Comments about the code, and ways to improve it would be appreciated.
My first attempt is the simple linear search:
int bottomBitNN( unsigned long value ) { int n=-1; if (!value) return n; while (~value&1) { ++n; value>>=1; } return n; }
But compiling it to assembly proved it it wasn't all that tiny, or efficient;
There is some overhead for FCACHE load which is quite a bit of the program's time.
Even to count to two, the routine part will take ~9 instructions, + ~6
instruction times/bit counted. The jump is assumed to happen,
which means the OF_E jmp #__LMM_FCACHE_START+(.L11-.L14) is
generally optimized wrong and takes two instruction times.
This program probably could be improved by changing to a do {} while( ~value & 1 ) instead, but that would only faster by about ~2 instruction times/loop.
I get, eitehr: 9 + 6*31 = 195 or 9+4*31 = 133 optimized instruction times for finding bit 31, around 11 for bit 1.
Average score for random bit is around ~72 instruction times.
_bottomBitNN jmp #__LMM_FCACHE_LOAD long .L15-.L14 .L14 mov pc,lr mov lr,__LMM_RET '' The above Code is FCACHE -- the actual routine follows: cmps r0, #0 wz,wc neg r7, #1 IF_E jmp #__LMM_FCACHE_START+(.L11-.L14) .L12 test r0,#0x1 wz IF_NE jmp #__LMM_FCACHE_START+(.L11-.L14) add r7, #1 shr r0, #1 jmp #__LMM_FCACHE_START+(.L12-.L14) .L11 mov r0, r7 jmp lr
For my next attempt, I figured I'd code a non-looped binary search to find the least significant set bit.
I'm pretty happy with this code:
int bottomBitN( unsigned long value ) { // Quickly find the bit number of the bottom most bit. int n=0; if (!value) return -1; if (!(value&0xFFFF)) { n+=16; value>>=16; } if (!(value&0xFF)) { n+=8; value>>=8; } if (!(value&0xF)) { n+=4; value>>=4; } if (!(value&0x3)) { n+=2; value>>=2; } if (!(value&0x1)) { n+=1; value>>=1; } return n; }
_bottomBitN cmps r0, #0 wz,wc mov r7, r0 IF_E brs #.L8 mvi r6,#65535 test r0,r6 wz IF_E shr r7, #16 mov r6, r7 IF_E mov r0, #16 IF_NE mov r0, #0 and r6,#255 wz IF_E shr r7, #8 IF_E add r0, #8 test r7,#0xf wz IF_E shr r7, #4 IF_E add r0, #4 test r7,#0x3 wz IF_E shr r7, #2 IF_E add r0, #2 test r7,#0x1 wz IF_NE brs #.L3 add r0, #1 lret .L8 neg r0, #1 .L3
It takes 22 instructions to execute, pretty much regardless of where the bit actually is.
So, it's about 25% more code, but 3.27 times faster execution.
One of the things that makes me curious, why did the smaller looped code generate FCACHE,
but this larger chunk of linear code did not?
last attempt:
This third piece of code, I am not sure how to write for propeller GCC; I know there is a log table of some kind in the processor,
but I'm not sure how to access it from C and I'm not sure what base it is in. I think the code would look something like the following,
but without knowing exactly how the log table is laid out -- I can't finish the code as I'm sure it can't look up 32 bit numbers.
int bottomBitNNN( unsigned long value ) { short int* table = (short int*)(void*)0x8000; // I am not sure exactly where is the table in memory value=value & ~(value-1); // isolate a single bit, the LSBIT, so a base 2 logarithm is exact. if (!value) return -1; return table[value]; }
I suppose I could call the log2 operation from the math library, but that seems overkill.
The call overhead would likely be very slow.
Anyone have suggestions?
Comments
Just a guess: GCC doesn't know how many iterations of that loop you'll be taking. So, it's going to assume lots, and that the overhead for FCACHE will pay off. Without a loop, it must be smart enough to know that it's not worth taking the time to cache.
I was curious about this too. Here's my old thread: http://forums.parallax.com/discussion/160252/log-antilog-table-in-c-c and David Hein jumped in with a code sample here.
It's also important to note that the math library (anything in libm) does not use the tables in ROM, because they are not precise enough.
Thanks for the reply, David. Discussing optimization is always a good thing.... gets the creative juices flowing....
I never would have found your thread, as doing a search on log for propeller here on the forums has such a large hit ratio for useless threads that it's essentially buried.
The result basically proves that the log table is useless, as the task of normalizing the number before taking the log automatically requires the very routine i'm trying to replace!
I see, in the thread, that David Hein actually used the exact same linear search that wastes so many CPU cycles, too. ! (ouch!)
So, the code I'm showing here could seriously improve the performance of his routine as well.
I think I can probably improve the routine I did a little more.
The case where '0' is detected is generally a waste because it's seldom called -- so I figured I could get rid of the alternate return path code, and the wrong jump optimization, by just initializing n to a number which will cancel to -1 after a futile binary search. I can also take advantage of the barrel shifter in the comparisons to avoid storing constants larger than #511. That should also shrink code size.
Note: this code is buggy: The first two commented lines should optimize out on the propeller, as shifting a 32 bit value by 32 bits or more always generates the value 0. GCC does notice that, and correctly decides to optimize out those lines which are for CPU sizes bigger than the propeller. BUT -- the propGCC also erroneously removes the if statement before the 32 or 64 bit shift. Therefore, the program doesn't return -1 when value==0. However, if I hand remove the lines which should optimize out anyway, the program works correctly. So who do I bump about GCC bugs?
With the broken optimizer lines removed, the code is only 19 instructions long on the propeller, and is fixed run length. So it is both smaller and faster on average than my previous routine and no fcache code penalites.
I suppose, since the propeller has the bit reverse assembly language command, that people could just use the above routine with bit reverse to do logarithm normalization. That would be more efficient than having an lsbit finding routine written separate from a msbit finding routine. But, I would need to code one special assembly language instruction -- does anyone know how to do this, properly?
eg: something like:
That's not quite portable... but it's an option.
The other way I might do that which would be portable, and probably slightly more efficient overall is to recode the binary search so it finds the most significant bit directly, and then I can isolate the LSBIT using portable C rather than having a separate routine to find LS bit.
eg:
I'll have to play with the code and see if I can make a fast log lookup library, too.
http://stackoverflow.com/questions/757059/position-of-least-significant-bit-that-is-set
Pretty cool.
Although, all the techniques using a multiply or converting to float aren't going to help on the propeller, as those just use the built in C library functions. Those have a lot of call overhead and there's no hardware multiply in the propeller....
I'm not even sure the lookup table method will be faster on the propeller, because the conditionals to locate the first non-empty byte going to eat several instructions...
My original simple looped code is only 14 instructions + 1 constant = 15 instructions worth of space. That's about the smallest code possible.
Comparing to the others I've tried:
You can see I have a method with 24 assembly instructions already, and I have gotten it down to 19 instructions (I'll post that later) when doing a MSB search. (18 instructions if I hand code the assembly). And no long constants... so that's total size.
Just for reference, The stock example by parallax for taking logarithms is on pp. 382-383 of the propeller manual.
It uses 28 instructions worth of memory, and that techinique would still take 18 instruction times without fixing the -1 bug it would cause. So, I've already matched the stock algorithm for speed, and used about 1/3 less memory space without hand coding the assembly language, but just using C.
If I could figure out how to encode inline assembly and access a function paramater/variable by name... I could get the absolute max performance. GCC's optimizer actually goes the wrong way in a couple of situations.
It's only two longs larger than a pure binary search at 16 inst+5 data=21 instruction slots used.
Execution is fixed at ~17 instruction cycles. So two cycles faster than a 19 longs based pure binary search.
and I was wrong, 8 bit lookup does work and is in fact slightly faster @ ~14 cycles, but it uses up 78 longs and probably isn't worth the slight speed advantage.
Best routine I can come up with are these two:
yes, it's open source
Although I would appreciate a comment pointing to this thread and saying the algorithm is developed by Andrew Robinson of Scappoose.
I can use a little advertisement for the future when I eventually need to earn money / do job interviews.
And if anyone improves the code, which is probably hard to do -- they have the option of positing here, of course.
I was definitely planning to give credit. I'll be sure to add your real name in there instead of just a forum username :P
Internally __builtin_clz does the counting using methods similar to the ones above, but since it's implemented in the LMM (or CMM) kernel it'll be faster in practice since there won't be LMM cache misses. In my testing I get 480 cycles for bottomBitN versus 272 cycles for bottomBitClz; those are machine cycles from CNT, not instructions.
The problem isn't getting the code to expand inline, the problem is figuring out how to get the assembly language to find the correct variable.
In my code, the parameter "val" is the parameter to the routine.
GCC somehow chooses a register for val and another register for n; eg: to two different COG registers depending on the optimization level and version of the compiler.
When I expand inline assembly, I can type in a register name but not a variable name;
eg: I can do:
asm(" rev r0,r0")
but I can't do:
asm(" rev n,n" )
The first technique requires me to guess where GCC is going to put var and n, which can change with GCC version or optimization level.
The second technique doesn't work but is desirable to prevent me from making the code dependent on GCC version or compiler switches.
On my x86 version of GCC, I remember seeing inline assembly used with a special symbol to tell the assembler to look up whatever register the variable name is held in. I think it was "%" on the x86 version of GCC -- but I can't remember -- and propeller GCC doesn't accept asm(" rev %var,%var")
I know it probably can be done -- but I'm just not sure how to get the propeller gcc to do it. I don't know where to look to find out.
And a much bigger block here (ignore the fcache stuff at the top)
https://github.com/DavidZemon/PropWare/blob/release-2.0/PropWare/uart/abstractsimplexuart.h#L462
Notice how line 462 allows you to present a specific variable for both read/write access.
So all I need to do is:
__asm__(" rev %0,%0" : : "r" (val) );
And that will tell the compiler that macro 0 is the register that val is found in. EDIT: It compiles!!! YES!!!
The way you have it written, that should mark "val" as a read-only parameter. I'm confused! lol
I assumed r was "register", because you pointed that code out as an example for what I was trying to do --but if it's readonly -- what's readwrite?
"rw" or just blank? In fact just about anything put in quotes seems to compile... even __asm__(" rev %0,%0" : : "boogeyman" (val) );
of course it only works with the __asm__() built in and not asm(), but that's fine.
I've found a number of gcc bugs trying to work these examples out.... not sure who to tell about them.
I also find it silly when I comple with -Os -S to inspect the generated assembly and notice really simple optimization oversights like:
I mean, without a WZ,WC -- isn't the add instruction not affecting condition codes? so there's absolutely no reason to do the same compare twice, is there?
Another issue I've come across is that GCC sometimes recognizes that it has already computed an expression, and then decides to "save" the caluculation, and conditionally reuse it when it's computed a second time. That would be fine if it took more than one instruction to make the calculation, but when saving a calculation and restoring it takes two instructions but the calculation only one instruction... that's not good. Recomputing is clearly going to be faster in that case.
I mean I would expect the following snippet to compile as three instructions, not four or five.
But in a chain of similar statements, GCC is often deciding to save and restore rather than recompute.
Sometimes, in the name of optimization, propeller GCC notices too much.... !
Does it always set WZ,WC? If so, why? Is there any possible workaround? Could GCC be told to treat them as single-bit registers instead of as flags, and have variants of instructions that have different combinations of {w,n}{z,c,r}?
"+r" is for read/write, and it would go between the two colons instead of after the second (see the second and longer block of assembly linked in my other post).
Yep - GCC is pretty cool, but 4.6 is getting old, and even the newest versions certainly aren't perfect
The central repository is on github and I would think that's a good place to report issues: https://github.com/parallaxinc/propgcc/issues
That's probably not what you want -- the REV D, S instruction reverses S bits of register D.
Also note that all (I think) Propeller instructions that don't have obvious C analogues already have __builtin_propeller_xxx versions. So you could write something like: to reverse all 32 bits of "val".
Eric
Eric
No, it doesn't always set WZ,WC -- the Propeller port of GCC tries to set a "minimum" number of condition bits, which may be why the sequence above occured (the first test was a test for 0, so set only WZ, and the second was a test for < 0 and set WZ and WC). I'm not sure, because there's a typo in the code as posted -- the poster forgot to include the WZ,WC on the cmp, so we don't know what the compiler was setting.
[/quote]
What I think what I saw and forgot in x86 and some other processors is a slightly different format string.
There are typically three colon groups to inform GCC about what is going on.
_asm_ ( "\tinstruction dest,src options" : [macroname1] "desination constraints" (var name1), ... : [macroname2] "source constraints" (name2), ... : "clobbered register", ... )
For example, in x86, the destination flags might be "=rm" for destination flags meaning gcc is allowed to write to either a register or memory, and source constraints might be "rm" also meaning the source can be either register or memory.
I think a blank field generally means no constraints at all. So, the command I wrote has no constraints on the destination but does have some constraints on the input field. That's why GCC did not kick out with an error. The variable %0 was defined to have constraints when used as an input field, but not when used as a destination.
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
I don't think you actually need to set read or write permissions because GCC can figure that out from the variable's type definition; eg: const char* xxx ; should not be written to. I'm not sure you actually need to put "r" or "m" or "+" in the constraints at all, unless the type defintions are wrong.
For example, gcc flags the first asm as an error, but the second is fine. GCC is happy reading from variables whether or not "r" exists.
I disagree, although I have no proof. It might not be necessary, but specifying read and write permissions can give GCC information that will allow it to better optimize the code surrounding your assembly.
If you don't specify permissions but your code only reads a register, GCC might not know that it doesn't need to save a copy before giving it to your assembly code, and so it will needlessly make a copy of it. If your code only ever writes a register but doesn't read it and you don't tell GCC, GCC won't know that the register can go in a new, more convenient place and doesn't have to be the same as the old register. That can avoid forcing GCC to do a mov later on if it needs to.
See above post, it's gcc version 4.6.1 and roughly 1 year old.
I don't remember if it came with simple IDE or not; I think I may have recompiled it from source on github. I do remember having to upgrade several things, including recompiling gdb separately to get it to work at all.
I have a version of gdb that has bugs like, printing an array will print the pointers to each of the elements instead of the elements and any attempt to call a function on the propeller with print /x myfunction() doesn't work. Unfortunately, there's no obvious way to inspect the cog's native registers without being able to call a function in the program. There are also times, even with gcc -O0 chosen, that the debug trace will follow the wrong branch path after an if (condition) {}; but only during single step mode, never if a breakpoint is used and continue command issued. Still, gdb works well enough that it is useable -- and I'm pretty sure the simple IDE version does not work at all.
I probably ought to upgrade my tools in any event, as a lot can happen in a year. I'll definitely do that before reporting any bugs, officially.
Thanks for all the comments. ( and the benchmark. I need to learn to do that... optimized code residing in inside the KERNEL obviously has a massive advantage. I need to benchmark the cached linear bit search to see what it's performance really looks like due to good caching. )
I wasn't really paying attention to how the rev command works when I wrote out the example, I was simply trying to see if gcc would recognize a variable name and compile with it at all. I'm not all that familiar with cog assembly yet. But yes, obviously the way I had it wouldn't reverse the bits for finding priority. I need " rev %0,#32" instead. I haven't written inline assembly in probably 10 years, so I'm a bit rusty to say the least...
I am porting code from other processors that I've used in the past, and want to replace proprietary routines from those systems with more generic ones to make the code portable. I don't want to rewrite the code again in the future, if possible....
Some of the features in the propeller are pretty common, eg: My x86, and arm processors, also have barrel shifters -- so the same principle works on most machines.
Ultimately, I will probably want to create a library source tree with #if statements, so that I can select optimized routines based on architecture or cpu. But I want a good general purpose routine first that has the least penalties due to sheer stupidity.... eg: This program is a simple enough problem that it lets me get used to the propeller's quirks while exploring.
As to typos in that code... yes, I get ahead of myself sometimes... my apologies.
I was typing it from memory and forgot the CC flags; there was a "cmps r0,#0 wz,wc" in both places.
I had already deleted the source code which generated it so all I can do is try to recreate it, now.
The general idea I was doing at the time was a variaton on the theme:
So let's see what happens: gcc -S -Os trajectoryqueue.c
then I copy and paste the output so there are NO typos this time:
It's different so this is obviously not the exact same variation as before.... but oh, nelly! what happened to the add 16,8,4,2, and 1 ?!
In fact, what the heck happened to n, and why?!
I dont' see what I did wrong that they should disappear. Did I make a mistake in my code?
OTOH:
Here's the other optimization issue that I saw, gcc is actually making the code larger here for no reason.
It could just discard the computation in parenthesis, using NR. Then there would be no need to do an extra move every if statement.
This is an optimization that generally would be helpful, especially when two or more calculations exist inside the if() , but here it's counter-productive.
Optimization faults don't generate bad code, just slightly slower or larger code. So this last isn't really a problem for me.
Thes that bother me are the ones which remove working code and change the operation of the function eg: when a shift of larger than 32 bits happens, the result ought to be considered zero -- and it's OK to remove the statement if it effectively doesn't execute; but that should not cause any previous if statement's optimization state to be corrupted and generate bad code. That's what happened in the example I gave much earlier in the thread, and unless I'm really tired the first example in this post might be another example of gcc misery.
I'm not seeing a link when I click on his profile page. So you must mean another homepage, and the only other one I see doesn't have links obvious either, although I should say Congratulations to David on getting married; for I did see that next to the picture of the cool looking car and school info.
Hmm... I do see a git hub repository that David owns and cloned from prop-gcc, but it looks to have no code base changes but only makescript/insall patch changes 8 months ago. Is that too old?
So -- I humbly ask, could someone post a more direct link; I don't want to recompile the wrong thing.... thanks!
I think those flags are called "constraints" in the manual for a reason, and generally when coding programs there is a danger in over-constraining the compiler's optimizer.
As I'm just starting out with the propeller, I would rather error on the side of allowing the compiler to make decisions as much as possible rather than trying to force it to do so. As I become more certain of how prop-gcc is using those flags, then I'll start adding more of them as appropriate. Right now, I'm not even certain the compiler really is using those flags as the memory and registers in a cog are really identical; unlike an x86 processor which has a very limited set of accumulator registers capable of doing arithmetic, the COG essentially has 400+ general purpose registers which are also memory. So, those optimization flags don't really seem critical to me as long as the data resides inside the cog.
As to Hub ram instructions... well, I still have to study that... and you may well be right, there. constraints may be very appropriate.
Ha! Sorry about the confusion :P
Eric is referring to PropWare's homewage, where I list a number of "Useful Links" at the bottom including links to my build server, where PropGCC is being built: http://david.zemon.name/PropWare
And thank you, I'm pretty excited
My clone of PropGCC is dead - I'll probably delete it momentarily. I created it a while ago so I could create some pull requests and they've since all been accepted.
Unless explicitly declared with the "COGRAM" keyword, PropGCC only allocates 16 registers in cogram for use. As I understand it, this is because GCC has these "registers" built into its very core. Most every processor out there has some set of general purpose registers like R0-R7 or R0-R16 or something like that. It's such a popular concept that GCC has it as a core concept, and not using that concept would require significantly more work by the PropGCC folks. So, rather than rewrite the whole compiler, they simply dedicated a few random addresses in COGRAM to be used as R0-R16. A bit more address space is used by either the LMM or CMM kernel, and then a bit more is reserved for FCACHE space.
I am not one of the PropGCC developers so take this with a grain of salt. This comes from reading lots of docs and forum threads on PropGCC, not first hand knowledge of the code base.
Actually, all PropGCC versions are based on GCC 4.6.1 (there's work happening on a PropGCC based on GCC v5, but it's not ready yet). What he was asking for was the PropGCC version, not GCC version.
'Course... this brings up an interesting point that builds from my build server aren't reporting the tweak/patch version. I'll post in the PropGCC thread about how to fix that.