Bit priority in C...

ersmith · 2015-10-28 13:17

Fastrobot wrote: »

Ok.

What I think what I saw and forgot in x86 and some other processors is a slightly different format string.
There are typically three colon groups to inform GCC about what is going on.

_asm_ ( "\tinstruction dest,src options" : [macroname1] "desination constraints" (var name1), ... : [macroname2] "source constraints" (name2), ... : "clobbered register", ... )

For example, in x86, the destination flags might be "=rm" for destination flags meaning gcc is allowed to write to either a register or memory, and source constraints might be "rm" also meaning the source can be either register or memory.

Yes, this is a good summary of how constraints are written. But the GCC documentation may not be sufficiently clear on what constraints are for. GCC itself doesn't understand the assembly language, so the constraints are to tell it what are valid inputs and outputs for the instructions. For example, if the constraint says "r" (register) but the value is currently on the stack, GCC will know that it needs to load from the stack into a temporary and use that as input.

I think a blank field generally means no constraints at all. So, the command I wrote has no constraints on the destination but does have some constraints on the input field. That's why GCC did not kick out with an error. The variable %0 was defined to have constraints when used as an input field, but not when used as a destination.

I think it's probably safest to always provide the appropriate constraints. If you don't, it's possible that sometimes GCC will emit invalid assembly language (e.g. trying to pass an immediate value to an instruction that doesn't support it).

For the Propeller pretty much all outputs can have constraint "rC" ("r" means a register, which for PropGCC means one of the 16 registers reserved for the compiler, and "C" means a COG memory location". Inputs are generally "rCI", where "I" stands for an immediate value in the range 0-31. Obviously there are a few exceptions to this.

Only the rdlong,rdword, and rdbyte instructions can support an "m" input constraint ("m" means memory, which for PropGCC means HUB memory) and only wrlong, wrword, and wrbyte allow an "m" output constraint.

Eric

ersmith · 2015-10-28 13:29

Fastrobot wrote: »

I don't remember if it came with simple IDE or not; I think I may have recompiled it from source on github. I do remember having to upgrade several things, including recompiling gdb separately to get it to work at all.

If you're comfortable with compiling it from source (and it sounds like you are; you seem to know your way around code!) then I'd recommend grabbing David Betz's propeller-gcc repository at https://github.com/dbetz/propeller-gcc. That has up-to-date versions of everything. The gdb in there is probably a bit better than the one you struggled with, although it's still pretty much alpha -- there hasn't been much interest in gdb, alas. That repo also gives you the choice of building "gcc" (the cutting edge gcc 6 repo) or "gcc4" (the stable gcc 4.6.1 repo).

The general idea I was doing at the time was a variaton on the theme:
int testNN( unsigned long val ) {
	int n=-1;
	if (!(val<<32-16)) { n+=16; val>>=16; }
	if (!(val<<32-8 ))  { n+=8 ;  val>>=8; }
	if (!(val<<32-4 ))  { n+=4 ;  val>>=4; }
	if (!(val<<32-2 ))  { n+=2 ;  val>>=2; }
	if (!(val<<32-1 ))  { n+=1 ;  val>>=1; }
	return val+n; // this is a mistake, really, it probably should be n+!!val; but as long as val has only one bit set, the algorithm should work.
}
So let's see what happens: gcc -S -Os trajectoryqueue.c
then I copy and paste the output so there are NO typos this time:

Hmmm. The output you posted is definitely buggy, but I'm not getting that with the current gcc4. I get instead:

	.balign	4
	.global	_testNN
_testNN
	shl	r0, #16 wz,nr
	IF_E  shr	r0, #16
	IF_E  mov	r7, #15
	IF_NE neg	r7, #1
	shl	r0, #24 wz,nr
	IF_E  shr	r0, #8
	IF_E  add	r7, #8
	shl	r0, #28 wz,nr
	IF_E  shr	r0, #4
	IF_E  add	r7, #4
	shl	r0, #30 wz,nr
	IF_E  shr	r0, #2
	IF_E  add	r7, #2
	test	r0,#0x1 wz
	IF_E  add	r7, #1
	IF_E  shr	r0, #1
	add	r0, r7
	lret

which should at least work

. I don't recall all the bugs fixed in gcc4, but there have been some, and one of them must have bit you.

OTOH:
Here's the other optimization issue that I saw, gcc is actually making the code larger here for no reason.
It could just discard the computation in parenthesis, using NR. Then there would be no need to do an extra move every if statement.

Unfortunately most of gcc's optimization is processor independent, and not many processors have the NR feature, so that's not something they thought of. There's not really any way to force the recomputation instead of saving results, or at least not any way I've seen short of re-writing the machine independent code (which nobody really wants to do!)

Thanks for your comments and suggestions,
Eric

ersmith · 2015-10-28 13:34

Fastrobot wrote: »

I think those flags are called "constraints" in the manual for a reason, and generally when coding programs there is a danger in over-constraining the compiler's optimizer.

The documentation isn't as helpful as it could be. The constraints are really there to tell the compiler about the assembly -- what inputs and outputs are permitted in the machine's assembly language (e.g. this instruction takes a register input and produces a memory output). It's only secondarily used for guiding optimization. Incorrect or incomplete constraints can lead to incorrect code. Sometimes this would cause an assembly time error, but there are some cases where the code would assemble but not run correctly if the constraints are wrong (for example if you forget to mention that a register is modified by an instruction the compiler might think it can re-use the values in that register later). It's safest to provide as much detail as possible to the compiler.

Eric

Fastrobot · 2015-10-28 19:52

Eric,
I just downloaded from David's previously linked homepage the (recommended) propgcc / GCC4 for LINUX.
I had to move the default untar directory from parallax to parallaxOct2015, because my old gcc was in parallax and I didn't want to wreck it yet. That made me suspicious -- so I did some md5's and looked at timestamps; and his package is definitely newer than mine: August of 2015.

The newer compiler does compile exactly as yours does, which is in fact -- the correct way for it to compile.

Now I need to test gdb and see if it works better... this could be a really good day.

Edit: GDB does correcly show arrays, now, too. So whatever caused the bug disappeared in the last year.

But GDB still cant call functions like this:

...

unsigned long gdbio(void) {
	return ina();
}

// The main program to initialize everything and call
int main(void) {
// All static variables, being in BSS, are presumed zeroed by C startup.
// All chip initialization is done here, in main.

	int tmp, dir, out=gdbio();
       ...

Breakpoint 1, main () at trajectoryqueue.c:291
291     int main(void) {
(gdb) print /x gdbio()
The program being debugged stopped while in a function called from GDB.
Evaluation of the expression containing the function
(gdbio) will be abandoned.
When the function is done executing, GDB will silently stop.

Unfortunately most of gcc's optimization is processor independent, and not many processors have the NR feature, so that's not something they thought of. There's not really any way to force the recomputation instead of saving results, or at least not any way I've seen short of re-writing the machine independent code (which nobody really wants to do!)

Interesting, so -- why do you think the example you posted before it, which shows my compiler was outdated, uses NR in the expressions doing if () ?

If GCC doesn't really understand assembly language as you mentioned before, and I'm willing to accept that as likely, then it follows that GCC must use a generic table of some kind to describe the size and nature of an operation in terms of assembly language. The compiler, then, selects a particular assembly language command from the table based on size, and other constraints without actually understanding the command. But -- you seem to be suggesting that either gcc doesn't pay attention to size when doing optimizations (at which point someone needs to complain, because it ought to !!!) , or else whoever ported GCC didn't fill out the table with all the information GCC needed.

Am I overlooking something?

I see the git hub for the source code of GCC, and may do a source recompile a bit later on.
If GCC is really using a table for assembly commands, I'd like to at least learn where it is in the source code. I've never looked that closely before, and gcc is a pretty big package.

ersmith · 2015-10-29 12:29

Fastrobot wrote: »

Unfortunately most of gcc's optimization is processor independent, and not many processors have the NR feature, so that's not something they thought of. There's not really any way to force the recomputation instead of saving results, or at least not any way I've seen short of re-writing the machine independent code (which nobody really wants to do!)

Interesting, so -- why do you think the example you posted before it, which shows my compiler was outdated, uses NR in the expressions doing if () ?

We do have some Propeller specific peephole optimizations and combinations that will tell the compiler to output assembly sequences containing NR. But these will only catch some fairly local, specific things. The higher level optimization of the compiler still thinks that it's always better to save a value in a register rather than calculating it twice. As you pointed out this is usually true, but there are some situations on the Propeller where it isn't. We might be able to add additional peepholes to catch more of these, but changing the compiler's default assumption (better to save values than recalculate them) is not really practical.

If GCC doesn't really understand assembly language as you mentioned before, and I'm willing to accept that as likely, then it follows that GCC must use a generic table of some kind to describe the size and nature of an operation in terms of assembly language. The compiler, then, selects a particular assembly language command from the table based on size, and other constraints without actually understanding the command.

Basically, yes, although rather than a plain table there's a fancy pattern matching engine. The patterns used by PropGCC are found in gcc/config/propeller/propeller.md (".md" stands for "machine description"). For example, the instruction to subtract 2 integers to produce a third has the pattern:

(define_insn "subsi3"
  [(set (match_operand:SI 0 "propeller_dst_operand" "=rC")
	  (minus:SI
	   (match_operand:SI 1 "propeller_dst_operand" "0")
	   (match_operand:SI 2 "propeller_src_operand" "rCI")))]
  ""
  "sub\t%0, %2"
  [(set_attr "predicable" "yes")]
)

The syntax here is fairly similar to the syntax used for inline assembly. The last parameter of each match_operand is a constraint. So in this case operand 0 is modified ("=") and may be either a register or a COG memory address ("rC"); the "0" for operand 1 says that it has to be the same as operand 0, since on the propeller we only have a 2 operand sub instruction; and operand 3 may be either a register, COG memory address, or small immediate value. The final string "sub\t%0, %2" is the actual assembly language produced, with %0 replaced by the appropriate assembly representation of operand 0 and %2 replaced by the representation of operand 2.

When GCC goes to output a subtraction like "a = b - c" it will produce a pattern like "(set (a (minus b c))" and then later will try to match that pattern with assembly language. The constraints will cause the register allocator to assign a and b to the same register, or generate an appropriate move if it can't do that.

But -- you seem to be suggesting that either gcc doesn't pay attention to size when doing optimizations (at which point someone needs to complain, because it ought to !!!) , or else whoever ported GCC didn't fill out the table with all the information GCC needed.

See above -- the problem is that the decision that we should calculate the value and move it into a register is made at a higher level than the instruction matching, which is frustrating because sometimes the Propeller code would be better if this wasn't done. To some degree we can add patterns that match multiple operations, but it's hard to catch everything. (The size of the instructions is actually included in the patterns, in the set_attr portion of the end, but it doesn't explicitly appear in the subsi3 pattern above because that one has the default size of 4 bytes; only a few multi-instruction patterns are different from that and need an explicit size.)

Fastrobot · 2015-11-07 06:34

ersmith wrote: »

Unfortunately most of gcc's optimization is processor independent, and not many processors have the NR feature, so that's not something they thought of. There's not really any way to force the recomputation instead of saving results, or at least not any way I've seen short of re-writing the machine independent code (which nobody really wants to do!)

I see.

We do have some Propeller specific peephole optimizations and combinations that will tell the compiler to output assembly sequences containing NR. But these will only catch some fairly local, specific things.

So, when a local and highly specific event happens repeatedly, you can make a peep hole optimization?

I assume these peep hole optimizations are designed for the most common cases encountered.
GCC is designed to work with standard processors... and detect certain kinds of C constructions to optimize them....
so, optimizations that can be detected by GCC on other processors should be possible on the parallax.

I've been generating and looking at code, and in my code a very common design pattern is to check data to see if a null pointer has been encountered;
or to count down to zero, and then branch once zero is reached. Generally it happens most times when an if statement has no logical operators or only the ! in the parenthesis...

GCC usually generates extremely efficient C code on most modern processors for this idea, because zero seldom needs a compare.
So anytime a variable is read from memory, and used as a condition without any logical operators -- I usually get very compact assembly generated.

Here's some example code, with comments to focus your attention on where I use the idea: eg: < ~~~~~ comments

unsigned pop( trajectoryState_t *state, trajectoryState_t *states ) {
	trajectoryFlag_t flag;

	if (!state->iPop) return 0; // If no state, then no data, obviously.
	if (state->header && --(state->iPop)) state->iPop=QUEUESIZE; // next data
	while ( (flag=flagStack[state->iPop]).operation == STACK_HEADER ) {
		if (!(state->header)) { // open header & check for repartition   < ~~~~~
			
			// Repartition before using & freeing the header
			if (!repartition( states, state->flag.mask, flag.mask)) return 0;

			// Begin execution of the new header
			state->header=dataStack[state->iPop].next.series;
			if (!state->header) return 0; // jam!, header is not filled out < ~~~~~~

			++stackMem; // keep track of freed memory
			flagStack[state->iPop].mask = 0; // mark opening header as freed-mem

		} else { // Closing header detected, do not free it.

			if ( ! ~state->header ) { // Check for stack stop
				partitionState( states, STATE_UNLINK, state->flag.mask );
				return 0;
			}
		}
		if (--(state->iPop)) state->iPop=QUEUESIZE;
	}
	return state->iPop;
}

I used it explicitly twice in that piece of code, and possibly used it implicitly in at least twice more.
It will generate propeller assembly language like this:

        rdword  r11, r1
        cmps    r11, #0 wz,wc
        IF_NE   jmp     #__LMM_FCACHE_START+(.L77-.L85)
        rdlong  r11, r2
        cmps    r11, #0 wz,wc
        IF_E    jmp     #__LMM_FCACHE_START+(.L82-.L85)

But normally, I would expect something like this to be generated by GCC.

       rdword r11,r1 wz
       IF_NE jmp #_LMM_...
       rdlong  r11,r2 wz
       IF_E jmp #_LMM_...

I am not sure if the condition codes are set correctly for hub reads of bytes; but it ought to work for at least long reads.
I use that idea a ton... so it kind of stinks that the propeller doesn't have an efficient way to ever use it.

Is this something simple enough that could be done with a peephole optimization?

Bit priority in C...

Comments