"Interesting" problem in C: data pointers
ImageCraft
Posts: 348
Here's an interesting problem we are facing with Runtime Architecture design for the ImageCraft C. Quick summary: C is compiled into large memory model where the COG is used for executing instructions in the HUB. It's native Propeller code so the overhead is ~4x of pure asm and not 80x ala SPIN.
Anyway, to make code to run as fast as possible, the initial design was to use COG RAM for registers and local stack, e.g. your local variables will either be allocated as "registers," or as locations in a software stack. Global/static variables go to HUB RAM.
Unfortunately, there is an address aliasing problem. Given a C pointer, e.g. int *p, *p is either a MOV if "p" points to COG RAM, or rdlong/wrlong if it points to HUB RAM!
So possible solutions are:
a) use COG RAM only for registers and HUB RAM for stack and global/static. Disadvantages: slower execution and not utilizing COG RAM to the fullest.
b) at runtime, somehow distinguish pointer type. For example, we can tag HUB RAM addresses and a pointer access would involve discovering the pointer type at runtime. Disadvantages: may not be any faster than a) when all said and done!
The current preferred solution, barring further analysis saying otherwise, is to use a pseudo sliding register window scheme, basically a variation of a) to possibly gain back the loss of speed:
Instead of a small fixed set of virtual registers (e.g. original design called for 6) that are allocated to variables and the rest uses the stack, make the number of virtual registers larger (e.g. up to 16) so most local variables would just end up in registers (*). At function prolog, all the live registers are copied / slided down by the amount that the function uses. At function epilog, the registers are slided up. While this may look like slower than stacking only the ones that are used by the function onto a stack, this should end up being faster since a) Propeller doesn't have stacking instructions, so stacking in COG RAM is slow and stack in HUB RAM is even slower, and b) it's a tight copy loop where the execution is 2-3 instructions per live register. Stack then solely lives on HUB RAM, but should rarely be used. *p is always a rd/wr-long (or word/byte).
While the linker will not detect register window overflow, we should be able to provide a tool that detects such situation after you build your project.
(*) We also have advanced register allocator anyway where the compiler packs as many variables into the same register as long as their uses do not overlap.
Anyway, to make code to run as fast as possible, the initial design was to use COG RAM for registers and local stack, e.g. your local variables will either be allocated as "registers," or as locations in a software stack. Global/static variables go to HUB RAM.
Unfortunately, there is an address aliasing problem. Given a C pointer, e.g. int *p, *p is either a MOV if "p" points to COG RAM, or rdlong/wrlong if it points to HUB RAM!
So possible solutions are:
a) use COG RAM only for registers and HUB RAM for stack and global/static. Disadvantages: slower execution and not utilizing COG RAM to the fullest.
b) at runtime, somehow distinguish pointer type. For example, we can tag HUB RAM addresses and a pointer access would involve discovering the pointer type at runtime. Disadvantages: may not be any faster than a) when all said and done!
The current preferred solution, barring further analysis saying otherwise, is to use a pseudo sliding register window scheme, basically a variation of a) to possibly gain back the loss of speed:
Instead of a small fixed set of virtual registers (e.g. original design called for 6) that are allocated to variables and the rest uses the stack, make the number of virtual registers larger (e.g. up to 16) so most local variables would just end up in registers (*). At function prolog, all the live registers are copied / slided down by the amount that the function uses. At function epilog, the registers are slided up. While this may look like slower than stacking only the ones that are used by the function onto a stack, this should end up being faster since a) Propeller doesn't have stacking instructions, so stacking in COG RAM is slow and stack in HUB RAM is even slower, and b) it's a tight copy loop where the execution is 2-3 instructions per live register. Stack then solely lives on HUB RAM, but should rarely be used. *p is always a rd/wr-long (or word/byte).
While the linker will not detect register window overflow, we should be able to provide a tool that detects such situation after you build your project.
(*) We also have advanced register allocator anyway where the compiler packs as many variables into the same register as long as their uses do not overlap.
Comments
You would need to be able to specify where variables get allocated so that statically allocated objects can have the type of pointer you want, but it sounds like you would have that already.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 11/30/2007 9:57:59 PM GMT
-Phil
This is listed as option b) in my post
Thanks for the comments!
This method will likely work in the prop2. I think one can assume that the prop2's proportions will stay quite similar. The hub capacity may even get much bigger. It makes no sense to provide a juiced up propeller with a hub memory less than the current prop's (1024 longs per cog).
Personally, I'd do something like follows:
- use the 'register' prefix for cog variables
- use program flow analysis to find "leaf" functions, and for leaf functions, use cog memory for local vars
- for recursive routines, flush "local" cog vars of parents to stack on entry to function, restore on exit, inbetween use cog vars
- keep stack in the hub
Hope this helps [noparse]:)[/noparse]
Bill
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers
Our compilers are much smarter than that Thanks though!!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers
COG RAM:
4 parameter registers
7 variables registers
5 expression evaluation registers
Why these numbers? From all the compilers we have done, these seem to give fairly good results for most programs. As I said before, the compiler uses advanced register allocation technique to pack as many variables as possible to the "variable registers" so most function variables will end up there.
COG RAM also has:
60-80 longs of internal stack
ISP internal stack pointer
ESP external stack pointer
The internal stack is for local variables that do not have their addresses taken and do not get allocated to registers and for saving / restoring variable registers.
HUB RAM:
External stack
The external stack is for local variables that have their addresses taken.
Variable argument functions like printf would be "interesting."
An alternative plan is to have the linker to do call tree analysis and forgo the whole register/internal stack and just allocates local variables to COG RAM addresses. This is what is needed for something like th 8051 or PIC. However, to support function pointers and recursion, a stack must be used anyway, so we *may* switch to this in later releases, but will stick with the register/internal stack for the First Release.
// richard
Though that did leave me wondering: what if the function is called from two different cogs? If the function and all of its local variables are entirely in cog RAM, again, not an issue. If they're in the shared hub RAM, though...
My hat's off to you. I've dabbled with compilers a bit, and I can see how implementing even a straightforward (no-frills optimizations) C compiler for the prop is difficult.
Ah, here's where the Large Memory Model helps. With just Propeller asm, I don't think (I could be wrong) one can implement function pointers or reentrant functions at all. However, withe LMM virtual machine, we will be doing @FCALL with does store the virtual PC. The programmers still have to worry about about HUB RAM variables sharing, but that's just an artifact of having 8 CPUs at your disposal. Everything comes with a price!
// richard
the stack. Any local variables/arguments that no one takes the address of can be assigned
a COG register, and the usual register allocation/save/clear should be used. So that's your
(a) option.
This will make function calls not *quite* as fast, but inner loops of leaf functions should be
quite fast (at least for those variables no one takes addresses of).
Frankly, I think we want a C compiler that is as functional and simple as possible; since we
have 8 cogs and they run at 20MIPs each now (and if the Prop 2 comes out we'll get even
more speed), I don't think execution speed is the most important thing. I think just being
able to program in C (and use as much preexisting C code as possible) is the most important
thing.
If you guys want to do a fancy sliding window thing later with an optimization phase, sounds
great, but I would have to think the simplest possible solution wins for now.
Consider where we are at now. Spin is great, but it's a whole new language and is (just
because it is interpreted) quite slow. Assembly is great, but has a steep learning curve.
Thus, anyone *considering* the prop at all has to learn a whole new language. Your C
compiler, even if it's 10X slower than native assembly, is *still* huge amounts faster than
Spin, and much more approachable than assembly.
And if people know to not write tiny tiny subroutines that don't do anything, but instead
have reasonable inner loops, you're not really losing anything.
Introducing a baroque pointer model will just be a source of bugs and confusion for a
speed gain that might not turn out to be that much after all (except for those people
doing "fib" benchmarks).
If you look at a few responses above, we have come to the same conclusion