Compiler ignoring order of operations.
TomUdale
Posts: 75
in Propeller 1
Hi All,
In my quest to find ways to prevent the compiler reordering all my operations around timing instructions, I discovered what looks to me like an error.
To prevent code movement I came up with the idea of enclosing the code I did not want to move in an if() block that is always taken but cannot be optimized away. The template is this:
This seems to pretty much prevent any of the // do a bunch of stuff code from floating outside the timing block. I wrapped this up in a macro:
#define BEGIN_COMPILE_BARRIER(name) if(alwaysTrue){
#define END_COMPILE_BARRIER(name) }
(The "name" parameter is for my second implementation which removes the overhead of the if but is not shown here).
However, when I started applying this concept in general, I suddenly got some broken code. It turns out that the error does not seem to be related to my efforts but is more general in nature.
Below is a section of the C code in question.
The error is around the calculation of stimLoopTime.
The generated asm is:
I have commented the relevant areas. You can see that for some reason the compiler has decided to change the order of operations from
status->stimLoopTime=(extraBase-ct)+extraTime;
to
status->stimLoopTime=(extraTime-ct)+extraBase;
This is not really ok to do. extraTime has a value of about 20. ct is a raw CNT clock as is extraBase. So extraTime-ct becomes negative which then wraps to a big positive. Then we add a big positive in the form of extraBase and now our total elapsed time, which should be in the order of 2500 is massive.
If I comment out the BEGIN_COMPILE_BARRIER(MEASURE_CALC_LOADS) block around extraBase=CNT, the code changes to the following:
It has now recast the math to
status->stimLoopTime=(extraBase+extraTime)-ct;
which is still wrong but at least works in this instance.
But the underlying issue is that it is completely ignoring the ordering specified by the (). I know that compilers can do lots of optimizing, but I don't think they get to redefine things like this. Order of operations involving subtraction is important.
Any thoughts?
Best regards,
Tom
In my quest to find ways to prevent the compiler reordering all my operations around timing instructions, I discovered what looks to me like an error.
To prevent code movement I came up with the idea of enclosing the code I did not want to move in an if() block that is always taken but cannot be optimized away. The template is this:
extern volatile int alwaysTrue; void foo(void) { unsigned ct; unsigned time; ct=CNT; if(alwaysTrue) { // do a bunch of stuff. } time=CNT-ct; }
This seems to pretty much prevent any of the // do a bunch of stuff code from floating outside the timing block. I wrapped this up in a macro:
#define BEGIN_COMPILE_BARRIER(name) if(alwaysTrue){
#define END_COMPILE_BARRIER(name) }
(The "name" parameter is for my second implementation which removes the overhead of the if but is not shown here).
However, when I started applying this concept in general, I suddenly got some broken code. It turns out that the error does not seem to be related to my efforts but is more general in nature.
Below is a section of the C code in question.
BEGIN_COMPILE_BARRIER(MEASURE_CALC_LOADS); extraBase=CNT; END_COMPILE_BARRIER(MEASURE_CALC_LOADS); if (status) { status->stimLoopTime=(extraBase-ct)+extraTime; // <This goes wrong. data->doneStatus=status; } extraTime=CNT-extraBase;
The error is around the calculation of stimLoopTime.
The generated asm is:
412:cpx_stimdriver.cogc **** BEGIN_COMPILE_BARRIER(MEASURE_CALC_LOADS); 603 .loc 1 412 0 604 0534 0000BC08 rdlong r7, r12 605 0538 00007CC3 cmps r7, #0 wz,wc 413:cpx_stimdriver.cogc **** extraBase=CNT; 606 .loc 1 413 0 607 053c 00001408 IF_NE wrlong CNT, sp 414:cpx_stimdriver.cogc **** END_COMPILE_BARRIER(MEASURE_CALC_LOADS); 415:cpx_stimdriver.cogc **** if (status) 608 .loc 1 415 0 609 0540 00007CC3 cmps r14, #0 wz,wc 610 0544 0000685C IF_E jmp #.L33 416:cpx_stimdriver.cogc **** { 417:cpx_stimdriver.cogc **** status->stimLoopTime=(extraBase-ct)+extraTime; 611 .loc 1 417 0 612 0548 0000BCA0 mov r7, r14 ; load address of status 613 054c 0000BC84 sub r11, r8 ; subtract ct (in r8) from extraTime (r11) 614 .LVL31 615 0550 1000FC80 add r7, #16 ; calculate address of stimLooptime 616 0554 0000BC08 rdlong r3, sp ; read extraBase back. 617 0558 0000BC80 add r11, r3 ; add extraBase to extraTime-ct. 618 055c 00003C08 wrlong r11, r7 ; write result to stimLoopTime 418:cpx_stimdriver.cogc **** data->doneStatus=status; 619 .loc 1 418 0 620 0560 0000BCA0 mov r7, r13 621 0564 0C00FC80 add r7, #12 622 0568 00003C08 wrlong r14, r7 623 .L33 419:cpx_stimdriver.cogc **** } 420:cpx_stimdriver.cogc **** extraTime=CNT-extraBase; 624 .loc 1 420 0 625 056c 0000BCA0 mov r11, CNT ; save CNT 626 0570 0000BC08 rdlong r5, sp ; read extraBase 627 0574 0000BC84 sub r11, r5 ; calculate extraTime 628 .LVL32 629 .LBE3 421:cpx_stimdriver.cogc **** } 630 .loc 1 421 0 631 0578 00007C5C jmp #.L34
I have commented the relevant areas. You can see that for some reason the compiler has decided to change the order of operations from
status->stimLoopTime=(extraBase-ct)+extraTime;
to
status->stimLoopTime=(extraTime-ct)+extraBase;
This is not really ok to do. extraTime has a value of about 20. ct is a raw CNT clock as is extraBase. So extraTime-ct becomes negative which then wraps to a big positive. Then we add a big positive in the form of extraBase and now our total elapsed time, which should be in the order of 2500 is massive.
If I comment out the BEGIN_COMPILE_BARRIER(MEASURE_CALC_LOADS) block around extraBase=CNT, the code changes to the following:
412:cpx_stimdriver.cogc **** //BEGIN_COMPILE_BARRIER(MEASURE_CALC_LOADS); 413:cpx_stimdriver.cogc **** extraBase=CNT; 602 .loc 1 413 0 603 0524 0000BCA0 mov r7, CNT ; save extraBase into r7 604 .LVL33 414:cpx_stimdriver.cogc **** //END_COMPILE_BARRIER(MEASURE_CALC_LOADS); 415:cpx_stimdriver.cogc **** if (status) 605 .loc 1 415 0 606 0528 00007CC3 cmps r14, #0 wz,wc 607 052c 0000685C IF_E jmp #.L32 416:cpx_stimdriver.cogc **** { 417:cpx_stimdriver.cogc **** status->stimLoopTime=(extraBase-ct)+extraTime; 608 .loc 1 417 0 609 0530 0000BCA0 mov r6, r14 ; save address of status 610 0534 0000BC80 add r12, r7 ; add extraTime (r12) and extraBase (r7) 611 .LVL34 612 0538 1000FC80 add r6, #16 ; calculate offset of stimLoopTime 613 053c 0000BC84 sub r12, r8 ; subtract ct from extraTime+extraBase 614 0540 00003C08 wrlong r12, r6 ; save result 418:cpx_stimdriver.cogc **** data->doneStatus=status; 615 .loc 1 418 0 616 0544 0000BCA0 mov r6, r13 617 0548 0C00FC80 add r6, #12 618 054c 00003C08 wrlong r14, r6 619 .L32 419:cpx_stimdriver.cogc **** } 420:cpx_stimdriver.cogc **** extraTime=CNT-extraBase; 620 .loc 1 420 0 621 0550 0000BCA0 mov r12, CNT 622 0554 0000BC84 sub r12, r7 ; calculate extraTime 623 .LVL35 624 .LBE3 421:cpx_stimdriver.cogc **** } 625 .loc 1 421 0 626 0558 00007C5C jmp #.L33
It has now recast the math to
status->stimLoopTime=(extraBase+extraTime)-ct;
which is still wrong but at least works in this instance.
But the underlying issue is that it is completely ignoring the ordering specified by the (). I know that compilers can do lots of optimizing, but I don't think they get to redefine things like this. Order of operations involving subtraction is important.
Any thoughts?
Best regards,
Tom
Comments
What results do you actually get?
I'll also note that it looks like the code generated is incorrect in at least one place: The wrlong is using CNT as a destination register, but I think that will cause the shadow register attached to CNT to be used instead (I could be misremembering, but I think CNT only returns the right result when used as a source register).
Working around that bug may be awkward: you'll want to make sure that extraBase is in a register. Putting the CNT read in a subroutine call might help (as then it will be read into r0 instead of onto the stack).
Is that right? I guess that makes sense. I must confess that the subtraction is addition of twos complement stuff never really made intuitive sense to me. I convinced myself in part because indeed the results were completely different. In the one case I had the expected answer. In the other case I had either negative or very large positive values - depending on how printf was feeling about things.
To be perfectly honest, this particular cog has gotten so screwed up I can barely make heads or tails of it. All things impossible are starting to happen and I am about to start back at zero with it. I thought I had one small victory with this particular impossibility but I guess not. I will keep looking at it.
All the best,
Tom
x + y - z
Might over flow during the addition. But then then the subtraction could bring the result back into reality. For example:
0x7FFFFFFF + 1 - 1
Yields 0x8000000 after the addition, which has gone negative due to overflow. But the subtraction then yields 0x7FFFFFFF. The expected result.
BUT...
The C standard says the result of an overflow is undefined.
That means the addition above is free to produce any random number as a result.
That means it's pretty much impossible to write a expression in C, like x + y + z, that has a defined result!
Or am I missing a point here?
Does the C standard say anything about the order in which a bunch of additions and subtractions are done?
This is in fact entirely consistent with something I have been seeing for a long time. Indeed the counts have always gone wrong when the intermediate variable was on the stack. I always had the feeling there was something inherently flaky with the stack and have often done what you said. I am not sure why I did not remember this in this case. I guess because I kept increasing the stack size (thinking there was an overflow) and nothing changed.
Also, the variables are in fact all unsigned. They are counts after all and (I thought) there would be no chance of negative values because I did the math in a specific order.
My understanding was that it did not. That is why I specified the order with the () because I knew that doing it in that order would not result in an overflow.
Although, in retrospect, propeller code relies on the overflow/underflows working the way you understand. Every timing or wait is either
CNT-previousCnt
or
CNT+somevalue
For those two to work in general, it has to be the way you say.
I just looked into this and you are totally right. CNT can only be used as a source register. Indeed you practically quoted the manual:
"CNT is a read-only pseudo-register; when used as an instruction’s source operand, it reads the current value of the System Counter. Do not use CNT as the destination operand; that only results in reading and modifying the shadow register whose address CNT occupies. "
x + y - z
I might know that the addition may overflow. And I might always know the final result should always be in range. So I write:
x - z + y
To avoid the intermediate overflow.
BUT if the compiler can compiler can rearrange the order of those operations the compiled code may well do the addition first and we are effectively back to:
x + y - z
Which does overflow and the final result is therefore undefined.
Ergo, it's damn hard to write an arithmetic expression in C that actually has a defined result!
Of course C defines operator associativity which defines their evaluation order, which is left-to-right for + and - so the compiler should not be doing those in any order other than the one written.
As far as I can tell that is.
I had always assumed that using brackets would dictate the order of operations. Besides operator associativity rules suggest order is preserved. Statements above indicated that the optimizer could still reorder things.
Which is it then?
But at least '*' and '/' have higher precedence than '+' and '-', which is often enough to sort out an expression to avoid overflows etc. But maybe not always.
That section (C90, 6.3) does not spell it out for me. What is a subexpression? I take that to mean the parts in braces. For example:
(a + b) + (c + d) + (e + f)
has three subexpressions.
We don't care which order the "a + b" or "c +c" or "e + f" is performed in.
We may care that the results of those three subexpressions are added in the order given. As the associativity rules suggest.
What I'm getting is what does the standard actually say?
We know it says other the result of overflow is undefined.
If it were to also allow for reordering of operations, as hinted at earlier in the thread, then that would allow overflow where the programmer had carefully ordered operations to avoid it.
That would imply that the standard says that pretty much any expression is undefined!
This is all theoretical. As we know overflowing integer addition on any sane modern processor is defined.
I always imagined that at -O0 the compiler did no optimization and both the addition and division would get generated in code. On the other hand when optimizing, source code analysis can reduce it to just an assignment.
But that means that if overflow can occur, then optimized and unoptimized code would produce different results.
I get terrible headache whenever start reading standards documents.
This "random number" generating feature of integer arithmetic has always bugged me. How come processors don't throw an exception on overflow like they do for divide by zero.
Ada is the only language I know that takes care of this.
I was thinking about this on the drive in and it occurred to me that because this particular processor (and indeed almost all modern ones) have non-destructive overflow rules and because the compiler knows this, the compilers are allowed to ignore programmer suggested ordering. I believe the rule for optimizations is "don't change observable behavior". So clearly on these machines addition and subtraction order has no influence on the result and thus changing the addition/subtraction order will not change observable behavior.
However on a machine where the order did matter (say it has saturating logic on arithmetic) then I expect that the compiler will _not_ ignore user specified ordering under the assumption that the user is trying to avoid overflow.
The practical question therefore is should programmers of "normal" machines even worry about this stuff? It is probably folly to think you can write "overflow aware" code on a machine where it does not matter so maybe you should not even bother?
All the best,
Tom
So they all got negative results with a >= 0x40000000
SGI always were great computers. And IRIX.
By the way, is there some bug tracker I should be creating an entry in for this?
Of course it is expected that the compiler implementations define those things.
This leads to all the familiar "surprises" we get when moving code from one system to another, say from a 16 bit machine to a 32, or 32 to 64. Or working on an opposite endianness machine.
It's amazing anything anything works at all.
I was bitten by this a few years back when trying to create security certificates with openssl. For some reason they were created pre-expired despite my setting the expiry date far into the future. I was busy fighting with the VPN configuration so I did not think about it much. Later I found the same procedure worked on a 64 bit PC. Then it hit me, I had set the expiry date 40 years into the future, well past 2038. Just happens the Unix time_t overflows it's 32 bit integer size in 2038 so my certs were pre-expired! On investigation I found that there had indeed been a bug raised against this issue a year or so after it bit me.
As you may know OpenSSL has become famous for being riddled with bugs recently. Many of which can be attributed to the sloppy nature of C.
But looking at this discussion it's pretty clear that those who claim that C is just some kind of 'high-level' assembler are wrong. With assembly code you are in control of instruction sequence (there are out-of-order CPUs, but we don't want to go there), with C you don't, to some extent at least, as with high-level languages in general. And that's one of the distinguishing differences between a 'compiler' and an 'assembler' too.
As for OpenSSL, I took a look at that code back then (where the heartbeat error was), and I attribute that to horrible programming practice and not much else really.
Yes, writing writing endian-independent and 32/64-agnostic code is not all that hard. Don't forget 8 and 16 bit machines though
However you might be surprised out how often I have had to sort out such problems when projects have migrated from machine to machine. Often a problem when those brought up in the Windows only world are involved.
I agree C is a lot more than portable assembler. I think there is more to it than that. Let's look at Apples SSL bug: Well apart from the bug pointed out there I see a lot wrong with this code that I don't think the compiler should even accept.
1) Didn't Dijkstra point out that goto is harmful in 1968! Why is there even a goto in any high level language still?
2) Syntax inconsistency. See how an "if" can apply to a single statement or a block of statements enclosed in braces? This is a famous cause of bugs. The language should enforce the use of blocks after if, while, etc.
3) Those "if" statements are assigning to err as well as testing the call returns. Such side effects should not be allowed in a high level a language.
Enforcing the above would have prevented this bug.
One can say it's "horrible programming practice" to write like that. Which is true. But it would not take much for the compiler to spot such errors or even not support the feature at all.
Luckily we have linters and other such tools to find such problems.
I have not looked at the hearbleed bug closely but I bet we can find similar language issues there.
Probably the github issue tracker for parallaxinc/propgcc. Unfortunately the SimpleIDE version of PropGCC hasn't been updated in years, so it may be a long time before any bug fixes make it out to the general public.
Oh!! I did not realize that there were multiple versions of propgcc. Where is the location of the maintained propgcc? Is it this one:
https://github.com/parallaxinc/propgcc
I am not married to SimpleIDE at all. I just use it to run the disassembler. Nevertheless it would be ideal to just drop in the newest propgcc. Can it be used with SimpleIDE? Or are there some specific patches that SimpleIDE wants?
Tom
Eric
However, that extra 'goto' should have been easily detected, the compiler provides a warning about unreachable code. With the right flags (common now, in default automake setups) that should have failed the build.
It's not really a side effect problem there. It's not that bad, the coder has at least used the practice of extra braces around the expression, to tell GCC (and some other compilers) what the intention is. It could definitely have been written differently.
[..] The heartbleed bug was caused by someone trying to save typing by something like this (from memory): and passing that styp around to other functions, the only apparent purpose was to avoid having to write those long lines. Nothing wrong with shortcuts, as long as it's kept inside that single block. But the code was full of those kind of pointer-copies.
Ah, you remember the original question That is a nice solution. It is lightweight and does not confuse the compiler much. I found a separate approach that is akin to the if(alwaysTrue) path I was describing earlier. It is this: You use it the same as the original post in this thread: Unwinding the macro you get: The asm goto statement tells the compiler that the associated asm "code" might jump to either the begin or end label. Because the compiler does not know what will happen, it cannot move and statements across either the begin or end labels in either direction. I included the name parameter to allow multiple blocks within a function or even nested blocks.
This seems to work and does not generate any overhead other than missed optimization opportunities.
The advantage compared to your approach is that for situations where there are large numbers of statements in the //do stuff block, you don't have to USE() a bunch of times to hold them all back. The major disadvantage is that, depending on the code, you can get lots of spurious "variable may not be initialized before use" warnings since the compiler thinks various assignments may get skipped over even though they will not. You do not get this with USE().
I expect there may be some other clever asm trick to "touch" parameters without touching them hereby making it look like they were assigned that might solve that problem. I have decided to ignore that for the moment.
I have learned several things in the last few days: how to block optimization movement, why my val=CNT was failing when val was a stack variable and that propgcc is not being updated with SimpleIDE; all good things. Furthermore, not 2 hours ago I finally found a bug I have been chasing for about a week. Faith is being restored to the land.
Thanks and
all the best,
Tom