"Interesting" problem in C: data pointers

ImageCraft · 2007-11-30 20:59

Here's an interesting problem we are facing with Runtime Architecture design for the ImageCraft C. Quick summary: C is compiled into large memory model where the COG is used for executing instructions in the HUB. It's native Propeller code so the overhead is ~4x of pure asm and not 80x ala SPIN.

Anyway, to make code to run as fast as possible, the initial design was to use COG RAM for registers and local stack, e.g. your local variables will either be allocated as "registers," or as locations in a software stack. Global/static variables go to HUB RAM.

Unfortunately, there is an address aliasing problem. Given a C pointer, e.g. int *p, *p is either a MOV if "p" points to COG RAM, or rdlong/wrlong if it points to HUB RAM!

So possible solutions are:
a) use COG RAM only for registers and HUB RAM for stack and global/static. Disadvantages: slower execution and not utilizing COG RAM to the fullest.
b) at runtime, somehow distinguish pointer type. For example, we can tag HUB RAM addresses and a pointer access would involve discovering the pointer type at runtime. Disadvantages: may not be any faster than a) when all said and done!

The current preferred solution, barring further analysis saying otherwise, is to use a pseudo sliding register window scheme, basically a variation of a) to possibly gain back the loss of speed:
Instead of a small fixed set of virtual registers (e.g. original design called for 6) that are allocated to variables and the rest uses the stack, make the number of virtual registers larger (e.g. up to 16) so most local variables would just end up in registers (*). At function prolog, all the live registers are copied / slided down by the amount that the function uses. At function epilog, the registers are slided up. While this may look like slower than stacking only the ones that are used by the function onto a stack, this should end up being faster since a) Propeller doesn't have stacking instructions, so stacking in COG RAM is slow and stack in HUB RAM is even slower, and b) it's a tight copy loop where the execution is 2-3 instructions per live register. Stack then solely lives on HUB RAM, but should rarely be used. *p is always a rd/wr-long (or word/byte).

While the linker will not detect register window overflow, we should be able to provide a tool that detects such situation after you build your project.

(*) We also have advanced register allocator anyway where the compiler packs as many variables into the same register as long as their uses do not overlap.

Mike Green · 2007-11-30 21:08

Another option would be to have two classes of pointers, those that point into cog memory and those that point into main memory with the default being pointers to main memory. The model would be that of 16-bit and 32-bit pointers in the 8086 architecture, but, in this case, the sizes would be the same. The types and generated code would be different.

You would need to be able to specify where variables get allocated so that statically allocated objects can have the type of pointer you want, but it sounds like you would have that already.

Paul Baker · 2007-11-30 21:47

If I remember correctly some variations of C have a·register directive; while the context is slightly·different, it should be adaptable for this situation. It would provide a convenient method for the programmer to choose which variables are best stored in cog memory (any pointer to a local variable will use the MOV method). If there are unused local registers, then the compiler can implement some means for determining which variables are placed in local registers (they would be elevated to register status and any pointers to them would use MOV). Something interesting to note, pointers to local variables would only need 3 or 4 bits depending on how many local variables are implemented, this is a situation ripe for packing multiple pointers into a single memory location (though it may present too much of a performance hit to implement).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 11/30/2007 9:57:59 PM GMT

Phil Pilgrim (PhiPi) · 2007-11-30 21:57

Unless you want to support pointers into hub ROM as well as hub RAM, why not just use bit 15 of the pointer to disambiguate which address space it points into? Dereferencing would then be done via a subroutine that checks this bit and retrieves the data appropriately.

-Phil

ImageCraft · 2007-12-01 01:16

Phil Pilgrim (PhiPi) said...
Unless you want to support pointers into hub ROM as well as hub RAM, why not just use bit 15 of the pointer to disambiguate which address space it points into? Dereferencing would then be done via a subroutine that checks this bit and retrieves the data appropriately.

-Phil

This is listed as option b) in my post

ImageCraft · 2007-12-01 01:24

One post to answer both Mike and Paul: first of all, our compilers do pack multiple variables into "registers" already, as long as their uses do not overlap. Second, yes, I can introduce a __cogram or __hubram modifier, sort of similar to the __flash we just added to the AVR compiler to indicate the separate flash address space (we have been overloading "const" to mean that before we added this keyword), so this should work too. I would like to avoid that sort of programmer specified optimization if possible though as decision made for Propeller V1 may not be applicable to V2. If the compiler and runtime makes the decision, then the same program compiled for different Propellers can get version specific optimizations done without source code changes.

Thanks for the comments!

Fred Hawkins · 2007-12-01 04:36

For the current prop my preference is to assume all addresses that fit in a cog are cog addresses, and those greater the cog's memory size are hub accesses. For a forth this is a simple distinction that can be made at compile time. For those accesses that are contrary to this method, merely define hub@ and hub! words.

This method will likely work in the prop2. I think one can assume that the prop2's proportions will stay quite similar. The hub capacity may even get much bigger. It makes no sense to provide a juiced up propeller with a hub memory less than the current prop's (1024 longs per cog).

Bill Henning · 2007-12-01 05:03

Hi Richard,

Personally, I'd do something like follows:

- use the 'register' prefix for cog variables
- use program flow analysis to find "leaf" functions, and for leaf functions, use cog memory for local vars
- for recursive routines, flush "local" cog vars of parents to stack on entry to function, restore on exit, inbetween use cog vars
- keep stack in the hub

Hope this helps [noparse]:)[/noparse]

Bill

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

ImageCraft · 2007-12-01 06:07

Bill Henning said...
Hi Richard,

Personally, I'd do something like follows:

- use the 'register' prefix for cog variables
- use program flow analysis to find "leaf" functions, and for leaf functions, use cog memory for local vars
- for recursive routines, flush "local" cog vars of parents to stack on entry to function, restore on exit, inbetween use cog vars
- keep stack in the hub

Hope this helps [noparse]:)[/noparse]

Bill

Our compilers are much smarter than that

Thanks though!!!

Bill Henning · 2007-12-01 06:34

Glad to hear that [noparse]:)[/noparse]

ImageCraft said...
Our compilers are much smarter than that Thanks though!!!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

ImageCraft · 2007-12-05 06:36

Well, the sliding register window is just too clever, and probably not worth the cost

So the current design calls for

COG RAM:
4 parameter registers
7 variables registers
5 expression evaluation registers

Why these numbers? From all the compilers we have done, these seem to give fairly good results for most programs. As I said before, the compiler uses advanced register allocation technique to pack as many variables as possible to the "variable registers" so most function variables will end up there.

COG RAM also has:
60-80 longs of internal stack
ISP internal stack pointer
ESP external stack pointer
The internal stack is for local variables that do not have their addresses taken and do not get allocated to registers and for saving / restoring variable registers.

HUB RAM:
External stack
The external stack is for local variables that have their addresses taken.

Variable argument functions like printf would be "interesting."

An alternative plan is to have the linker to do call tree analysis and forgo the whole register/internal stack and just allocates local variables to COG RAM addresses. This is what is needed for something like th 8051 or PIC. However, to support function pointers and recursion, a stack must be used anyway, so we *may* switch to this in later releases, but will stick with the register/internal stack for the First Release.

// richard

David Cuthbert · 2007-12-05 08:00

ImageCraft said...
However, to support function pointers and recursion, a stack must be used anyway, so we *may* switch to this in later releases, but will stick with the register/internal stack for the First Release.

I was wondering about this. From what I've seen in the Spin examples provided with the Prop tool, many of the assembly functions appear to rely heavily on self-modifying code. Quite interesting, but also quite non-reentrant. Obviously not an issue if there's no way to reenter the function itself...

Though that did leave me wondering: what if the function is called from two different cogs? If the function and all of its local variables are entirely in cog RAM, again, not an issue. If they're in the shared hub RAM, though...

My hat's off to you. I've dabbled with compilers a bit, and I can see how implementing even a straightforward (no-frills optimizations) C compiler for the prop is difficult.

ImageCraft · 2007-12-05 22:06

David Cuthbert said...
...
I was wondering about this. From what I've seen in the Spin examples provided with the Prop tool, many of the assembly functions appear to rely heavily on self-modifying code. Quite interesting, but also quite non-reentrant. Obviously not an issue if there's no way to reenter the function itself...

Though that did leave me wondering: what if the function is called from two different cogs? If the function and all of its local variables are entirely in cog RAM, again, not an issue. If they're in the shared hub RAM, though...

My hat's off to you. I've dabbled with compilers a bit, and I can see how implementing even a straightforward (no-frills optimizations) C compiler for the prop is difficult.

Ah, here's where the Large Memory Model helps. With just Propeller asm, I don't think (I could be wrong) one can implement function pointers or reentrant functions at all. However, withe LMM virtual machine, we will be doing @FCALL with does store the virtual PC. The programmers still have to worry about about HUB RAM variables sharing, but that's just an artifact of having 8 CPUs at your disposal. Everything comes with a price!

// richard

rokicki · 2007-12-05 23:17

I think the simplest approach should be used. In this case that would be to use main RAM for
the stack. Any local variables/arguments that no one takes the address of can be assigned
a COG register, and the usual register allocation/save/clear should be used. So that's your
(a) option.

This will make function calls not *quite* as fast, but inner loops of leaf functions should be
quite fast (at least for those variables no one takes addresses of).

Frankly, I think we want a C compiler that is as functional and simple as possible; since we
have 8 cogs and they run at 20MIPs each now (and if the Prop 2 comes out we'll get even
more speed), I don't think execution speed is the most important thing. I think just being
able to program in C (and use as much preexisting C code as possible) is the most important
thing.

If you guys want to do a fancy sliding window thing later with an optimization phase, sounds
great, but I would have to think the simplest possible solution wins for now.

Consider where we are at now. Spin is great, but it's a whole new language and is (just
because it is interpreted) quite slow. Assembly is great, but has a steep learning curve.
Thus, anyone *considering* the prop at all has to learn a whole new language. Your C
compiler, even if it's 10X slower than native assembly, is *still* huge amounts faster than
Spin, and much more approachable than assembly.

And if people know to not write tiny tiny subroutines that don't do anything, but instead
have reasonable inner loops, you're not really losing anything.

Introducing a baroque pointer model will just be a source of bugs and confusion for a
speed gain that might not turn out to be that much after all (except for those people
doing "fib" benchmarks).

ImageCraft · 2007-12-05 23:39

rokicki said...
I think the simplest approach should be used. In this case that would be to use main RAM for
the stack. Any local variables/arguments that no one takes the address of can be assigned
a COG register, and the usual register allocation/save/clear should be used. So that's your
(a) option.

...

If you look at a few responses above, we have come to the same conclusion

"Interesting" problem in C: data pointers

Comments