Shop OBEX P1 Docs P2 Docs Learn Events
Memory Barrier — Parallax Forums

Memory Barrier

Greetings,

Does this actually work on propgcc:
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
? I get very strange behavior in that it will compile under SimpleIDE, but when I try to look at the generated asm, I get:

error: 'asm' undeclared (first use in this function)

so I cannot actually tell if it did anything. In my experience with timings, the compiler is so sensitive to changes in code that there is no logic to looking at the actual number you get back and you really have to examine the assembly to figure out why a time did what it did.

I am trying to prevent movement of timing measurements when compiling with -Os optimizations in a COGC. For example:
cnt=CNT;
//do a bunch of stuff
externalStruct->time=CNT-cnt;
The compiler feels free to move both timing lines pretty much anywhere it wants since there are no dependencies between them and anything that happens in the "do a bunch of stuff" code. Indeed, strictly speaking I believe the following is an allowable reordering under the rules of C:
cnt=CNT;
externalStruct->time=CNT-cnt;
//do a bunch of stuff
Obviously it typically does not move them nearly that far, but enough to be annoying when tuning things. So I want to do something like:
cnt=CNT;
COMPILER_BARRIER() ;
//do a bunch of stuff
COMPILER_BARRIER() ;
externalStruct->time=CNT-cnt;
to lock those lines down.

All the best,

Tom


Comments

  • ElectrodudeElectrodude Posts: 1,657
    edited 2016-05-05 22:21
    Are you sure the compiler is moving stuff around how you say it is? What kind of timing results are you getting?

    Did you try "__asm" instead of "asm"?
  • Yes, that definition of barrier should work fine. spin2cpp generates something similar to force memory synchronization in empty loops.

    You mentioned that it worked for you under SimpleIDE, but you couldn't look at the generated asm? How are you generating the asm?

    One oddity I have found with GCC is that "asm volatile" is not allowed outside of functions (i.e. at top level), but plain "asm" is. That shouldn't be an issue for your #define as long as COMPILER_BARRIER() is used inside of function bodies only.
  • TomUdaleTomUdale Posts: 75
    edited 2016-05-06 02:46
    Hi ersmith,

    For normal builds we have a makefile we run under Visual Studio. We started on SimpleIDE and moved to this approach when we needed multiple build configurations and could not find a nice way to make SimpleIDE do it.

    I now use SimpleIDE only for generating asm listings and map files because we have not put that stuff into our makefile and I am too lazy to do so. So I just right click on the file and select the assembly option.

    I was surprised when it choked so I tried to build from within SimpleIDE and it was fine. Hence my confusion.

    I have not tried the asm volatile barrier outside a function yet. I will make sure not to.

    Any chance __asm as suggested by ElectroDude will sort out SimpleIDE? Maybe it does not know about the asm keyword when generating assembly listings for some inexplicable reason? Or would I be better off just adding the ability to generate listings straight from my makefile?

    All the best,

    Tom
  • TomUdale wrote: »
    Any chance __asm as suggested by ElectroDude will sort out SimpleIDE? Maybe it does not know about the asm keyword when generating assembly listings for some inexplicable reason? Or would I be better off just adding the ability to generate listings straight from my makefile?

    Have you tried it?

    According to Google, asm isn't an official C keyword. You can tell GCC whether it should treat asm as a reserved word or as a keyword denoting a block of inline assembly. But since __asm and __asm__ start with double underscores, and anything starting with a double underscore is reserved to the compiler, the compiler can safely always enable them. I think you want __asm__ and not __asm , but I'm not really sure of the difference. David Zemon's PropWare library, which is full of inline fcache'd assembly, uses __asm__ volatile, so I guess that's what you want.
  • TomUdaleTomUdale Posts: 75
    edited 2016-05-06 12:37
    Have you tried it?

    According to Google, asm isn't an official C keyword. You can tell GCC whether it should treat asm as a reserved word or as a keyword denoting a block of inline assembly. But since __asm and __asm__ start with double underscores, and anything starting with a double underscore is reserved to the compiler, the compiler can safely always enable them. I think you want __asm__ and not __asm , but I'm not really sure of the difference. David Zemon's PropWare library, which is full of inline fcache'd assembly, uses __asm__ volatile, so I guess that's what you want.

    I thought it was an extreme outside chance that the compiler would know the asm alias for compiling and not for disassembling but I was wrong. __asm does "fix" the problem in that I no longer get errors. I use "fix" because the barrier does not work.

    Here is the macro that compiles:
    #define COMPILER_BARRIER()   __asm volatile("":::"memory")
    

    Here is the end of my loop where I use it:
            // Record the actual voltage/dacval we programmed above.
            // We want that so we will have a consistent data set for p_calcLoad next time around the loop.
            calcLoad_vDes=vDes;
            calcLoad_vDesShortProt = vDesShortProt;
            calcLoad_rawDacVal32 = rawDacVal32;
    
            // Increment counters.
            loopIndex++;
            status->stimLoopIndex=loopIndex;
            nextCycleTime+=DBUS_LOOP_TIME;
            status->checkSum=CalcStimStatusCheckSum(status);
    
            // Record total loop time.
            COMPILER_BARRIER();
            status->stimLoopTime=CNT-ct;
        }  // while (1)
    

    You can imagine of course that ct is set to CNT at the top of the loop.

    Here is the assembly listing of the generated code:
     341:cpx_stimdriver.cogc ****         COMPILER_BARRIER();
     627              		.loc 2 341 0
     342:cpx_stimdriver.cogc ****         status->stimLoopTime=CNT-ct;
     628              		.loc 2 342 0
     629 0514 0000BCA0 		mov	r7, CNT
     630 0518 0000BCA0 		mov	r5, r14
     631 051c 0000BC84 		sub	r7, r8
     632 0520 1800FC80 		add	r5, #24
     633 0524 1800FCA0 		mov	r4, #24
     634 0528 0000BC80 		add	r4, sp
     337:cpx_stimdriver.cogc ****         nextCycleTime+=DBUS_LOOP_TIME;
     635              		.loc 2 337 0
     636 052c 0000BCA0 		mov	r8, r13
     637              	.LVL65
     332:cpx_stimdriver.cogc ****         calcLoad_rawDacVal32 = rawDacVal32;
     638              		.loc 2 332 0
     639 0530 0000BCA0 		mov	r10, r12
     342:cpx_stimdriver.cogc ****         status->stimLoopTime=CNT-ct;
     640              		.loc 2 342 0
     641 0534 00003C08 		wrlong	r7, r5
     331:cpx_stimdriver.cogc ****         calcLoad_vDesShortProt = vDesShortProt;
     642              		.loc 2 331 0
     643 0538 0800FCA0 		mov	r5, #8
     644 053c 0000BC80 		add	r5, sp
     645 0540 00003C08 		wrlong	r11, r5
     211:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     646              		.loc 2 211 0
     647 0544 0400FCA0 		mov	r7, #4
     648 0548 1C00FCA0 		mov	r5, #28
     649 054c 0000BC08 		rdlong	r6, sp
     650 0550 0000BC80 		add	r7, sp
     651 0554 0000BC80 		add	r5, sp
     652 0558 00003C08 		wrlong	r6, r7
     653 055c 0000BC08 		rdlong	r1, r4
     654 0560 0000BC08 		rdlong	lr, r5
     655 0564 00007C5C 		jmp	#.L12
    

    You can see that CNT is recorded into a register on line 629. The difference and the address of stimLoopTime are calculated immediately afterward (but it is not saved until line 641). Then there is a pile of code from before the barrier.

    What I expect is that lines 629-632 and 641 should be all together immediately before line 655. By my count, my loop time is 15 instructions shy of what it should be. This is not that big a deal because I have more slop in my downstream timings than 15 instructions. But I don't think that anything in principal prevents those 15 instructions from being 50 or 100. What if those instructions are a loop? That would be bad.

    Any thoughts anyone?

    All the best,

    Tom

  • I think the problem is that the barrier is only a memory barrier, so the compiler feels free to move instructions that don't touch memory.

    I've had luck with declaring a function and having that return CNT, and then using the function everywhere in place of CNT:
    unsigned getcnt() { return CNT; }
    ...
    loopTime = getcnt() - lastCnt;
    
    This seems to discourage code motion, I think because the function call acts as a sequence point.

    Eric
  • TomUdaleTomUdale Posts: 75
    edited 2016-05-06 17:55
    Hi Eric,
    ersmith wrote: »
    I think the problem is that the barrier is only a memory barrier, so the compiler feels free to move instructions that don't touch memory.

    I've had luck with declaring a function and having that return CNT, and then using the function everywhere in place of CNT:
    unsigned getcnt() { return CNT; }
    ...
    loopTime = getcnt() - lastCnt;
    
    This seems to discourage code motion, I think because the function call acts as a sequence point.

    And I gather then that there is no __asm volatile("":::"code|memory") or whatever? It is very strange that this is not working because I saw this asm volatile code on several different blogs (e.g. http://preshing.com/20120625/memory-ordering-at-compile-time/) and this sort of movement was exactly the kind of thing that was being discussed. Indeed it seems somewhat impossible to write lock free code with GCC if this barrier only prevents movement of the memory access and not that of the subsequent register handling.

    Anyway, I will try your getcnt() suggestion. I had thought of that before but did not even try because the ability of the gcc optimizer to unwind tricks like this is staggering. I thought for sure it would see through that function, inline it straight away and then apply the same optimizations as it does with the straight line code.

    Thanks for the suggestion.

    Best,

    Tom



  • Other processors have out-of-order execution and caches and other fun that requires those processors to have actual memory barrier instructions that actually get run by the processor. The Propeller has none of this, which might be why it isn't working.


    GCC has a noinline attribute, if you're afraid of it inlining getcnt():
    unsigned __attribute__ ((noinline)) getcnt() { return CNT; }
    
  • Other processors have out-of-order execution and caches and other fun that requires those processors to have actual memory barrier instructions that actually get run by the processor. The Propeller has none of this, which might be why it isn't working.

    The GCC documentation explicitly states that using the memory "clobberer" creates a compiler memory barrier:

    https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers

    While it might conceivably be a bug in the propgcc implementation, I think the likely problem here is that we are not contending with a memory barrier issue at all but rather, as Eric said, it is a code reordering issue.
    GCC has a noinline attribute, if you're afraid of it inlining getcnt():
    unsigned __attribute__ ((noinline)) getcnt() { return CNT; }
    

    Thanks but unfortunately, it does not help. Here is the new code with the getcnt function:
            // Record the actual voltage/dacval we programmed above.
            // We want that so we will have a consistent data set for p_calcLoad next time around the loop.
            calcLoad_vDes=vDes;
            calcLoad_vDesShortProt = vDesShortProt;
            calcLoad_rawDacVal32 = rawDacVal32;
    
            // Increment counters.
            loopIndex++;
            status->stimLoopIndex=loopIndex;
            nextCycleTime+=DBUS_LOOP_TIME;
            status->checkSum=CalcStimStatusCheckSum(status);
    
            // Record total loop time.
            //COMPILER_BARRIER();
            status->stimLoopTime=getcnt()-ct;
        }  // while (1)
    

    Here is what you get with a bare getcnt definition:
     356:cpx_stimdriver.cogc ****         status->stimLoopTime=getcnt()-ct;
     638              		.loc 2 356 0
     639 051c 0000BCA0 		mov	r5, r14
     640 0520 1800FC80 		add	r5, #24
     641 0524 1800FCA0 		mov	r4, #24
     642 0528 0000BC80 		add	r4, sp
     352:cpx_stimdriver.cogc ****         status->checkSum=CalcStimStatusCheckSum(status);
     643              		.loc 2 352 0
     644 052c 00003C08 		wrlong	r0, r7
     645              	.LBB18
     646              	.LBB19
     157:cpx_stimdriver.cogc ****   return CNT;
     647              		.loc 2 157 0
     648 0530 0000BCA0 		mov	r7, CNT
     649              	.LBE19
     650              	.LBE18
     356:cpx_stimdriver.cogc ****         status->stimLoopTime=getcnt()-ct;
     651              		.loc 2 356 0
     652 0534 0000BC84 		sub	r7, r8
     653 0538 00003C08 		wrlong	r7, r5
     345:cpx_stimdriver.cogc ****         calcLoad_vDesShortProt = vDesShortProt;
     654              		.loc 2 345 0
     655 053c 0800FCA0 		mov	r5, #8
     656 0540 0000BC80 		add	r5, sp
     657 0544 00003C08 		wrlong	r11, r5
     225:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     658              		.loc 2 225 0
     659 0548 0400FCA0 		mov	r7, #4
     660 054c 1C00FCA0 		mov	r5, #28
     661 0550 0000BC08 		rdlong	r6, sp
     662 0554 0000BC80 		add	r7, sp
     663 0558 0000BC80 		add	r5, sp
     351:cpx_stimdriver.cogc ****         nextCycleTime+=DBUS_LOOP_TIME;
     664              		.loc 2 351 0
     665 055c 0000BCA0 		mov	r8, r13
     666              	.LVL65
     225:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     667              		.loc 2 225 0
     668 0560 00003C08 		wrlong	r6, r7
     669 0564 0000BC08 		rdlong	r1, r4
     670 0568 0000BC08 		rdlong	lr, r5
     671 056c 00007C5C 		jmp	#.L13
    
    You can see the function is indeed inlined at a very low level. All the remains is the return CNT;

    Now here is what you get if you noinline getcnt:
     356:cpx_stimdriver.cogc ****         status->stimLoopTime=getcnt()-ct;
     634              		.loc 2 356 0
     635 0528 0000FC5C 		jmpret	lr,#_getcnt
     636 052c 0000BCA0 		mov	r7, r0
     637 0530 0000BCA0 		mov	r4, r14
     638 0534 0000BC84 		sub	r7, r8
     639 0538 1800FC80 		add	r4, #24
     225:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     640              		.loc 2 225 0
     641 053c 1000FCA0 		mov	r5, #16
     642 0540 0000BC80 		add	r5, sp
     351:cpx_stimdriver.cogc ****         nextCycleTime+=DBUS_LOOP_TIME;
     643              		.loc 2 351 0
     644 0544 0000BCA0 		mov	r8, r13
     645              	.LVL59
     346:cpx_stimdriver.cogc ****         calcLoad_rawDacVal32 = rawDacVal32;
     646              		.loc 2 346 0
     647 0548 0000BCA0 		mov	lr, r12
     356:cpx_stimdriver.cogc ****         status->stimLoopTime=getcnt()-ct;
     648              		.loc 2 356 0
     649 054c 00003C08 		wrlong	r7, r4
     345:cpx_stimdriver.cogc ****         calcLoad_vDesShortProt = vDesShortProt;
     650              		.loc 2 345 0
     651 0550 0800FCA0 		mov	r7, #8
     652 0554 0000BC80 		add	r7, sp
     225:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     653              		.loc 2 225 0
     654 0558 0000BC08 		rdlong	r4, sp
     345:cpx_stimdriver.cogc ****         calcLoad_vDesShortProt = vDesShortProt;
     655              		.loc 2 345 0
     656 055c 00003C08 		wrlong	r10, r7
     225:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
     657              		.loc 2 225 0
     658 0560 00003C08 		wrlong	r4, r5
     659 0564 00007C5C 		jmp	#.L13
    
    Indeed it does not inline getcnt. But it changes nothing. Combining getcnt with the memory barrier is also completely ineffective.

    I think the issue is that all the statements in the code sequence before my getcnt() line are almost completely independent of each other. There are no data dependencies that might enforce some order. So GCC in its wisdom just decides to reorder them based on who knows what.

    The only thing I can think to do now is either put everything except the timing calls into a big function so I can enforce the order or to see if there is some specific optimiziation that can be turned off (not just going from Os to O0, but rather turning off "statement-reorder" or some other hopefully existing thing). I know that Os is actually shorthand for 20 different specific things so perhaps one is the one that causes this top level reordering.

    I dare say that GCC is too clever by half for microcontrollers.

    All the best,

    Tom

Sign In or Register to comment.