Surprising propgcc code generation.
TomUdale
Posts: 75
in Propeller 1
Hi All,
I am just putting the finishing touches on the firmware for a commercial propeller application using propgcc. We are very tight on memory in several of our cogc cogs so I was messing around trying to optimize things some. We normally pass pointers to data for each cog via the PAR pointer in cogstart rather than access global variables because, well that is how I think - you never know when you might want two cogs running the same routine. Of course in the end all the cogs here are single instances so I figured the compiler probably can do better looking at global data rather than having to load from a pointer all the time since it can (in principal) precalculate the exact address of the data it is reading (this logic is based on PC programming, not my vast - ehem - prop knowledge).
Well, I discovered I was completely wrong. Not only is reading from global variables not smaller than reading via a passed pointer, it is actually quite a bit larger. After poking at the assembly for a while, I discovered that the global variable seems to disable a very reasonable optimization.
Given a function call like this:
where pointer points to a struct with byte members a,b and c,
the compiler generates something like this:
Now, if I simply change the code to be
Where globalVar is the name of the variable pointed to by "pointer" in the first example, then we get this:
You can see that the nice optimization that simply increments r7 to point from one member of the struct to the next is replaced by code that reloads and explicitly calculates the offset of each member every time.
Now, I am not completely surprised given the way that COGC is implemented that you could not achieve my hoped for optimal result of
as I expect that would require quite some linker trickery to resolve the final address of the struct, but I certainly did not expect things to get worse.
This is obviously a very minor point and I only bring it up because the compiler is in general so excellent in terms of optimizations that this really stood out to me. By and large, the source level optimizations I have been trying have already been performed by the compiler and don't get me anything. But I figured you might want to know about this (if not by design) so maybe you can see why there is the divergence in the optimization results for the two cases.
All the best,
Tom
I am just putting the finishing touches on the firmware for a commercial propeller application using propgcc. We are very tight on memory in several of our cogc cogs so I was messing around trying to optimize things some. We normally pass pointers to data for each cog via the PAR pointer in cogstart rather than access global variables because, well that is how I think - you never know when you might want two cogs running the same routine. Of course in the end all the cogs here are single instances so I figured the compiler probably can do better looking at global data rather than having to load from a pointer all the time since it can (in principal) precalculate the exact address of the data it is reading (this logic is based on PC programming, not my vast - ehem - prop knowledge).
Well, I discovered I was completely wrong. Not only is reading from global variables not smaller than reading via a passed pointer, it is actually quite a bit larger. After poking at the assembly for a while, I discovered that the global variable seems to disable a very reasonable optimization.
Given a function call like this:
foo(pointer->a,pointer->b,pointer->c);
where pointer points to a struct with byte members a,b and c,
the compiler generates something like this:
mov r7,r12 ;save address of pointer (previously saved in r12) rdbyte r0,r7 ;move a into r0 add r7,#1 ;point r7 to b rdbyte r1,r7 ;move b into r1 add r7,#1 ;point r7 to c rdbyte r2,r7 ;move c into r2 call foo ;call foo (which I guess takes parameters in registers r0, r1...)
Now, if I simply change the code to be
foo(globalVar.a,globalVar.b,globalVar.c);
Where globalVar is the name of the variable pointed to by "pointer" in the first example, then we get this:
mov r7,r12 ;save address of pointer (previously saved in r12) rdbyte r0,r7 ;move a into r0 mov r7,r12 ;save address of pointer (previously saved in r12) add r7,#1 ;point r7 to b rdbyte r1,r7 ;move b into r1 mov r7,r12 ;save address of pointer (previously saved in r12) add r7,#2 ;point r7 to c rdbyte r2,r7 ;move c into r2 call foo ;call foo
You can see that the nice optimization that simply increments r7 to point from one member of the struct to the next is replaced by code that reloads and explicitly calculates the offset of each member every time.
Now, I am not completely surprised given the way that COGC is implemented that you could not achieve my hoped for optimal result of
rdbyte r0,#456 ;move a into r0 rdbyte r1,#457 ;move b into r1 rdbyte r2,#458 ;move c into r2 call foo ;call foo
as I expect that would require quite some linker trickery to resolve the final address of the struct, but I certainly did not expect things to get worse.
This is obviously a very minor point and I only bring it up because the compiler is in general so excellent in terms of optimizations that this really stood out to me. By and large, the source level optimizations I have been trying have already been performed by the compiler and don't get me anything. But I figured you might want to know about this (if not by design) so maybe you can see why there is the divergence in the optimization results for the two cases.
All the best,
Tom
Comments
As for your ideal dream of referencing the exact memory address, not only is it difficult, it's not always possible. The source and destination fields of a Propeller Assembly instruction are each only 9-bits, therefore they are incapable of direct-addressing most of HUB memory.
Thanks. I have been lazily just getting whatever comes with SimpleIDE but I will check to see what the most recent version does for me.
Ah, it is worse than I thought. Well, it was indeed just a dream
All the best,
Tom
And here's some official docs on fcache (I just wrote some preprocessor macros to make it a little easier - and make it so the same code is compatible when compiled in COG or LMM/CMM mode) https://sites.google.com/site/propellergcc/documentation/faq#TOC-Q:-How-do-I-put-an-inline-assembly-loop-into-the-fcache-
And a description of fcache at the bottom of this page: http://propgcc.googlecode.com/hg/doc/Memory.html
Thanks for your post. It is interesting, but perhaps not as surprising as one might think at first. In general I think modern compilers will be able to optimize expressions involving local variables (including parameters) better than ones involving global variables. The reason is that globals can be changed in all sorts of ways beyond the compiler's control (other threads may be writing to them, or pointer aliasing may cause unexpected updates). In this particular case the compiler probably could have produced the same code for both, but usually the local variable / parameter will do better.
Regards,
Eric
I see. I gather then that you all are not in control of the code paths that would be making those optimization decisions.
It is still interesting to me because I would think that the data pointed to by a local pointer to an unknown object would be subject to at least as many caveats and surprises as the global name. Certainly I could see a huge difference for a local object, but the local pointer strikes me as vastly similar to the global name (which I think is basically just an address in the end anyway).
But it is all moot if you are at the mercy of the gcc guys on this anyway
All the best,
Tom