RFC: Propeller C Runtime Architecture

ImageCraft · 2007-10-10 09:33

At first glance, I know Bill Henning and others may be disappointed. However, as mentioned below, the goal is to converge on a common large memory / multiprocessing / multithreading / OS model when it is appropriate. This proposal is a modest first draft to get Propeller C off the ground.

****
Introduction
V1 of Propeller C will use Large Memory Model only. Propeller C generated code is standalone and does not rely on SPIN.

Propeller Basic
COG - Each COG is an independent CPU. Current generation of Propeller has 8 COGs. Each COG has 2KB of RAM. COG RAM is long addressable (e.g. 512 words). COG instructions can load/store HUB RAM, but all other instructions including jump/call/etc. reference COG RAM only.
HUB - HUB contains 32KB of shared RAM. A COG may access a HUB location with a maximum latency of 22 cycles. HUB RAM is byte addressable.

To keep V1 of Propeller C simple, it purposefully does not use Bill Henning's large memory model and his usage of the COG/HUB RAM. It is expected that when Propeller C is in production and when Bill's proposed systems are stable, then a future release of Propeller C will migrate to use a VM compatible with Bill's model.

Propeller C
Propeller C IDE uses a traditional project based paradigm. A project consists of multiple C source files and optional assembly routines (generally to be executed in separate COGs). The C compiler compiles C code into assembly code. The assembler assembles the asm code into relocatable object files. The linker then links all object files together with library code.

It is expected that existing and future SPIN Objects will be rewritten in C source or released in object format.

Program Loading
By default, program image lives in the 32KB HUB RAM. This can either be loaded by an external 32KB EEPROM, or downloaded from a PC at reset. In theory, additional program storage may be loaded from other external source such as other EEPROM etc.

Relocatable Loader
The current scheme does not require additional load time relocation. This may change in later releases.

Multithreading
Multithreading is not currently supported. This may change in later releases.

Multiprocessing
COGNEW and COGINIT are used for running concurrent routines in another COG. To run short >2KB assembler routines, a spare COG can be used. The user is responsible for managing COG usage.

Large Memory Model
To get around the 2KB limits of a COG, the COG is used to implement a Virtual Machine (VM).

COG RAM usage
000: PC
001: SP
002: @next
003: @fcall
004: @fret
005: @fcache
006: @fload
007: @fsystem
008: @debug
009-00e: reserved
00f: FP
010-07f: kernel primitives
080-0ff: FCACHE code buffer
100-10f: Virtual registers R0 to R15
110-17f: FCALL and data stack, stack grows downward
180-1cf: debug and multiprocessing support
1d0-1ef: reserved
1f0-1ff: Special Purpose Registers (used by Parallax)

HUB RAM uage
0000-xxxx: program image
xxxx-yyyy: global variables
yyyy-7fff: software stack, grows from top to bottom

Calling Convention
R0-R3 are used for parameter passing. R4-R10 for function variables. R11-R15 are for expression evaluation.

A function must preserve R4-R10 (if used) on the stack.

The stack is limited by 110 long words. Care must be taken by user not to overun the stack. Optionally a slower @next kernel primitive can be linked in for stack checking.

deSilva · 2007-10-10 14:25

Hm, I always appreciate a feasible approach.

HoweverI must say I shall be not the only one somewhat confused / disappointed at this highly conventional/ defensive / un-innovative concept. But before sheding further tears, I should like to have two bits of relevant information, not explicitly given in the forecast:

(a) What kind of intermediate code and interpretation concept will be used?
(b) How will "inline assembly code" be handled?

Mike Green · 2007-10-10 14:54

Looks like a very respectable model for a VM for a C system. It's very similar to many microcoded processors from the 1960's and 70's. You might clarify the difference in usage between the software stack and VM stack. I assume that local variables get allocated in the software stack, what other sorts of things?

I haven't looked at your linker, but I'm assuming it's got a pretty standard set of features that would include some basic support for making overlays. In particular, that you can create multiple binary image files, direct specific groups of modules to specific files, and relocate these groups of modules starting at specific absolute addresses. This way, the user can create overlays and explicitly load them into a statically allocated global variable. It would also simplify creating assembly I/O drivers to be loaded into cogs through a temporary I/O buffer from an SD card or EEPROM. Along that line, is there a linker directive or option to explicitly set the top of software stack address? For software that does a lot of overlay stuff, it would be useful to just allocate an overlay buffer at the top of memory and this could be used for loading I/O drivers as well. In fact, the library initialization routine might use this mechanism to load I/O drivers, then load some commonly used library routines there to stay for the rest of the execution time (in a 2K area).

ImageCraft · 2007-10-10 15:00

"..highly conventional/ defensive / un-innovative concept."

??? Our goal is to provide a C compiler for the Propeller. Were you expecting something else?

Once we have a functional product, we may consider things like builtin coroutines, multiprocessing enhancements etc. We sell products that people can build commercial products with.

a) perhaps you misunderstand - the code is native Propeller code, so there is no intermediate code or interpretation concept. The compiler uses intermediate code of course as part of the compilation process, but I don't think that's what you meant.

b) inline asm will be handled. What's your question?

Clearly you have some misunderstanding of our products. Perhaps it may be better to ask for clarification before being so negative? This does not show the community in a good light.

// richard

Mike Green · 2007-10-10 15:02

deSilva,
Look at the thread on the Large Memory Model started by Bill Henning (http://forums.parallax.com/showthread.php?p=615022). I believe ImageCraft is using the earlier, simpler version of this, modified slightly. Since the C compiler produces assembly source, I assume that you can embed assembly source in your program using a pragma or something like "asm { }" and the compiler would just copy it, much like in other C compilers.
Mike

ImageCraft · 2007-10-10 15:12

Mike Green said...
Looks like a very respectable model for a VM for a C system. It's very similar to many microcoded processors from the 1960's and 70's. You might clarify the difference in usage between the software stack and VM stack. I assume that local variables get allocated in the software stack, what other sorts of things?

I haven't looked at your linker, but I'm assuming it's got a pretty standard set of features that would include some basic support for making overlays. In particular, that you can create multiple binary image files, direct specific groups of modules to specific files, and relocate these groups of modules starting at specific absolute addresses. This way, the user can create overlays and explicitly load them into a statically allocated global variable. It would also simplify creating assembly I/O drivers to be loaded into cogs through a temporary I/O buffer from an SD card or EEPROM. Along that line, is there a linker directive or option to explicitly set the top of software stack address? For software that does a lot of overlay stuff, it would be useful to just allocate an overlay buffer at the top of memory and this could be used for loading I/O drivers as well. In fact, the library initialization routine might use this mechanism to load I/O drivers, then load some commonly used library routines there to stay for the rest of the execution time (in a 2K area).

Local variables are allocated in the virtual registers first (R4-R10), then on the software stack. We have an advanced register allocator so no need for users to declare variables as "register." Of course at function entry, any VR used as such will need to be saved. Alternately, we can use a linker to do static analysis and just lay out all the local variables as COG RAM space as much as possible, but I am not sure ultimately what's the "best" model for multiprocessing for the Propeller is yet. COGNEW/COGINIT etc. can get you started, but perhaps later on a more "Parallel C" type of approach is better. 8 Core controller is kind of new

so I think there will be a lot of opportunities to play around...

As for overlay and such, the support is planned but I want to see how things plan out, especially currently there is no standard for external storage space yet.

// richard

deSilva · 2007-10-10 15:21

@Richard:
Yes I TOTALLY misunderstood your posting! I could not read into it that you use native Propeller code. This is fine and obsoletes most likely my question (b)

I appologize for this extreme misunderstanding. Never happened before to this extend....
Just a tiny suggestion, perhaps you could add something to your posting so not more people will be lead astray...

And please do not blame the forum! I am absolutely untypical for this forum!

Post Edited (deSilva) : 10/10/2007 4:03:06 PM GMT

Mark Swann · 2007-10-10 15:24

This is a respectable beginning, and clearly places the Prop in the field of processors with C support.

Mike Green · 2007-10-10 15:43

Richard,
The linker is producing a 32K binary image file suitable for downloading to the Propeller. There are already several Propeller boards that include larger EEPROMs. There are already I/O drivers that can load blocks of data from EEPROM (given an address) or from PC-compatible SD card files. By including in the linker the ability to direct output to different 32K image files, this enables user-managed overlay handling. If you don't have this basic capability, overlaying is out. You can still run multiple programs, but that's different.

deSilva · 2007-10-10 15:55

Mike, the link you gave - Bill Hennings nearly a year ago - is most relevant for any understanding and assessment of this C compiler. I know the Propeller fro three months now and just read they did NOT use Bill's LMM.
As the HUB - given some 8 KB of Video Memory - can just hold 6k 32 bit instructions, this will obviously become the next bottleneck. So keeping the eye on overlay concepts seems of utmost importance.

potatohead · 2007-10-10 15:55

I like it.

My only question (right now) is:

Say one wants to use the Propeller IDE to develop some COG native code (and we are gonna have to manage terminology here quick!), then use it in tandem with LMM C compiled code. How can this be done, can it be done. Maybe the C compiler can assemble standard COG native code also? This would leave the passing of parameters to the programmer. No biggie, IMHO. We have that now. I realize SPIN stuff is going to have to be rewritten, but assembly objects really have the potential to be mostly re-used in this environment. (maybe?)

My thought was to leverage some objects, say a video driver or something, and use C instead of SPIN for the main program.

Actually, I've another one... or two!

So I've compiled a C program, and it's running in the LMM model, in a VM run by one of the COGs. Now, I want to utilize some of the COG registers, counters, etc... How does that work? Does the VM manage this? If so, how? Shadow registers, shadowed again, somewhere in HUB space? Does the VM consume the COG, meaning it's on board functions are largely lost, due to either inability to run / build COG native assembly code, or just space?

No biggie, given one can integrate COG native objects / code, with C program.

Just wondering how all that comes together.

(this is exciting!!)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

Post Edited (potatohead) : 10/10/2007 4:01:25 PM GMT

acy.stapp · 2007-10-10 15:56

It looks like you will just support the "large-mode" instruction streaming and avoid use of the FCACHE? Are there any plans to allow specification of particular code to be compiled to live in fcache? Perhaps an __attribute((fcache)) or coopt __inline? You could also just compile the innermost basic blocks of loops and leaf functions into fcache assembly. Or something like an overlay manager for basic blocks being swapped into and out of cache ram.

Mike Green · 2007-10-10 16:12

potatohead,
Read the LMM thread (see earlier post). The VM is really really very very simple. It only does a few things for you. Bill has been working on a multi-threaded version that's more complex, but that's not what we're talking about here. This LMM just copies instructions one at a time (or a block at a time) from HUB to COG and lets them execute. Since this doesn't handle jumps or calls, there are pseudo-jumps and pseudo-call/return that are interpreted by subroutines internal to the VM. The whole idea is to have what mostly behaves like native Propeller instructions and runs at near normal speed, but resides in HUB RAM.

Mark Swann · 2007-10-10 16:15

Richard,

I can use IDE's, but often prefer to use the command line and compile/link using a make file. Does your compiler and linker send messages to stdout and stderr or directly to the console? The tool I use for compiling capures stdout and stderr, but will not work with compilers/linkers that do not use stdout or stderr.

Mark

Post Edited (lucidman) : 10/10/2007 4:20:54 PM GMT

hippy · 2007-10-10 16:48

potatohead said...
So I've compiled a C program, and it's running in the LMM model, in a VM run by one of the COGs. Now, I want to utilize some of the COG registers, counters, etc... How does that work?

I think we can assume the VM will not be touching any COG registers because it has no need to except when it's executing your compiled-C code which is loaded from Hub into a Cog and is being executed.

Your high level "CTRA = 0x1234;" will compile to the necessary PASM instruction, which although held in Hub to start with will be in the Cog when it comes to executing it. That's the case for all of your compiled code.

deSilva · 2007-10-10 16:56

After having done my homework....
Yes that is most likely so, the LMM would allow that tranparency and the implementation of ImageCraft will be most likely close to it.

Even wait instructions are possible! One has to wait whether they will wrap this into library calls, use (pseudo-) memory mapping, or leave it to "inline code"

potatohead · 2007-10-10 19:50

Thanks Mike, Hippy.

Been a while since I read that. Of course, the instruction will just hit the register as it normally would, COG native mode. It is COG native for a moment, while executing. (shakes head) I remember that whole discussion now. Got it confused with the multi-threaded stuff, later on.

That leaves the other question, and maybe it's answered with: "A project consists of multiple C source files and optional assembly routines (generally to be executed in separate COGs)."

I'm good to go, for the moment, but for the waiting!

tap...tap...tap...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

hippy · 2007-10-10 20:33

@ potatohead : Regards your other question, integrating or using in tandem Spin, C, Assembler and the ImageCraft LMM implementation ...

Not sure how that would work, if at all. It would require some mechanism to integrate the two tools together and create a combined image file. It could potentially be done by stripping the code from .eeprom/.binary files each tool produces and re-combining them. The easiest option is probably to have the Propeller Tool include an ImageCraft Image somehow; that would need to be a co-operative venture to make it work.

I expect that in the short to medium term it will be a choice of Spin/PASM or C/PASM and you choose one or the other to write your application in. Hopefully Spin Objects will migrate to C libraries, but resource limitations ( Max of 8K PASM code in Hub ) could start to rear their heads.

ImageCraft · 2007-10-10 20:34

Mike Green said...
Richard,
The linker is producing a 32K binary image file suitable for downloading to the Propeller. There are already several Propeller boards that include larger EEPROMs. There are already I/O drivers that can load blocks of data from EEPROM (given an address) or from PC-compatible SD card files. By including in the linker the ability to direct output to different 32K image files, this enables user-managed overlay handling. If you don't have this basic capability, overlaying is out. You can still run multiple programs, but that's different.

Yes, I agree overlay is an important capability. The issue is whether it will be out on V1 or not. Allow me to stand on a soapbox for a moment *climb up*

I have not used the Propeller extensively yet but it's clear that many of you are using it to do some pretty amazing things. The concept of multiprocessing on the cheap, especially with 8 symmetric cores, is so new that I think there is a lot of rooms for different ideas to play out. We know the Propeller has immense potential, but I don't know how to really best tap that power yet. This is why I purposefully want to set modest goals. Get a usable and useful compiler out there and see what people do with it. Rather than giving user a solution at Day One, I want to rather see what the problems are, at least in the domain of C programming on the Propeller, and then come out with the solutions. You and Bill and others are very smart folks, and I hope to able to leverage your expertise before long.

*climb down*

But yay, overlay is definitely important. I would agree to that.

Thanks
// richard

ImageCraft · 2007-10-10 20:37

lucidman said...
Richard,

I can use IDE's, but often prefer to use the command line and compile/link using a make file. Does your compiler and linker send messages to stdout and stderr or directly to the console? The tool I use for compiling capures stdout and stderr, but will not work with compilers/linkers that do not use stdout or stderr.

Mark

Yes command line compilers are supported. Our IDE just capture the stdout/stderr and display the errors etc.

ImageCraft · 2007-10-10 21:20

acy.stapp said...
It looks like you will just support the "large-mode" instruction streaming and avoid use of the FCACHE? Are there any plans to allow specification of particular code to be compiled to live in fcache? Perhaps an __attribute((fcache)) or coopt __inline? You could also just compile the innermost basic blocks of loops and leaf functions into fcache assembly. Or something like an overlay manager for basic blocks being swapped into and out of cache ram.

The current plan is to FCACHE any straight line block of code that is longer than X instructions and also any loops that fit into the FCACHE. I believe the former is still a win since the @next kernel has some overhead. One major drawback of FCACHE, especially with compiler automatic FCACH'ing is that the instruction timing is no longer deterministic. I mean that's true as soon as you use LMM, but automatic FCACHE makes it worse.

For now we will probably just flush the cache for every FCACHE. One possible enhancement later is to actually allocate as large a cache buffer as possible. Right now it's only 128 longs, which is sort of too long for most loops and single block cache but not long enough to hold multiple of them to make it worthwhile to keep track whether something has already been cache'd.

ImageCraft · 2007-10-10 21:25

deSilva said...
@Richard:
Yes I TOTALLY misunderstood your posting! I could not read into it that you use native Propeller code. This is fine and obsoletes most likely my question (b)

I appologize for this extreme misunderstanding. Never happened before to this extend....
Just a tiny suggestion, perhaps you could add something to your posting so not more people will be lead astray...

And please do not blame the forum! I am absolutely untypical for this forum!

Apology accepted. It's all good. Propeller C will go a long way to get Propeller accepted into the wider user base. I hope I will get you and other people's support. Thanks

Mike Green · 2007-10-10 23:46

richard,
When I talk about overlays, I'm not talking about actively supporting them yet. I'm just talking about the linker being able to put the relocated absolute image into one 32K file, then stop with that file and begin building the 32K absolute image in a different file. There need to be directives to change the output file and to change the origin for the relocator. You'd already have directives to indicate the order of the various relocatables in constructing the output. Something like:

output mainprog.binary
include program1.rel
include program2.rel
output overlay.binary
origin 0x4000,0
include program3.rel
include program4.rel

This would build two 32K binary image files. The first would contain the relocatables for program1 and program2 starting at location zero. The second would contain the relocatables for program3 and program4 which would be relocated with an origin at location 0x4000, but would be written starting at location zero of the 32K binary image. It would be the user's responsibility to read the beginning of that binary image into location 0x4000 before anything in that segment is referenced. Most likely, this second 32K image would be copied by some utility program into the 2nd 32K of a boot EEPROM, but it could be loaded directly off an SD card from the image file.

It's very hard to do overlays if this functionality is missing from the linker. There's no functionality required in the compiler. It would be nice to have a way to tell the compiler or the linker where to start the top of the stack. This way, a nice fixed overlay / driver loading area could be allocated there or the space could be reused for buffers or something.

Bill Henning · 2007-10-14 04:49

Hi Richard,

Sorry for the delayed response; I've been kind of busy

and I am more puzzled than dissapointed...

ImageCraft said...
To keep V1 of Propeller C simple, it purposefully does not use Bill Henning's large memory model and his usage of the COG/HUB RAM. It is expected that when Propeller C is in production and when Bill's proposed systems are stable, then a future release of Propeller C will migrate to use a VM compatible with Bill's model.

That's the part that puzzles me; reading your detailed message, unless I mis-understood it, it seems that you are still using the large model execution engine and most ops (albeit an older model), with some changes:

- you had PC & SP at cog locations 0&1 as in my original model instead of being $1e0 & $1e1 as it is now [noparse]:)[/noparse]
- you changed the jump vector table, dropping some of the ops I defined and adding a debug call
- changed from 12 general purpose registers (R0-R11) to 16 (R0-R15)
- used an on-cog stack like Mike Green proposed
- had C inspired different ideas for $100-$1df

I'll update the original thread with the latest kernel, I'd like it if you kept compatible with the following even intially:

- vector table
- registers at $1e0-$1ef
- $80-$FF as the code cache area

You could use $100-$1DF for function variables and expression evaluation, and your on-cog stack while keeping 90%+ compatibility with the stuff I am writing; I do have my own uses for DCACHE, and intended $180-$1DF for "global" fast variables, however any such use is still fairly far down the pipeline.

Now what you said about not using my model for now only makes sense to me if you are voiding the cute fetch/exec unrolled loop, and you were instead paging in code into the $80-$ff area en masse then executing it in place linearly. If you are going for the paging approach, why not instead page in up to 256 longs? Also, if you are paging, no need for fcache [noparse]:)[/noparse] as essentially all you would be doing would be fcaching code - linear or looping. I forget his name, but someone proposed such a paging scheme in the forum a few months ago.

For parameter passing, I was going to use R8-R11, and I have reserved R0 as "TOS", and R1 as a scratch register

Here is my latest map:

[noparse][[/noparse]quote]
CON

version = 0_90 ' kernel version

finit = 0 ' initialize kernel
fnext = 1 ' execute next instruction
fjmp = 2 ' far jump
fcall = 3 ' far call
fret = 4 ' far return
fldi = 5 ' load immediate long into r0
fpushi = 6 ' push immediate long onto stack
fpush = 7 ' push r0 onto stack
fpop = 8 ' pop top of stack into r0
fcache = 9 ' load & execute following cog code, ends with 0
fyield = 10 ' yield time slice on multi-threaded kernels
fhalt = 11 ' halt kernel
fsvc = 12 ' call OS service
flib = 13 ' call library routine

icache = $080 ' 128 long fcache area
dcache = $100 ' 128 long dcache area
global = $180 ' 112 long "register variable" area
vmregs = $1e0 ' virtual machine registers

pc = $1e0 ' program counter
sp = $1e1 ' stack pointer
bp = $1e2 ' base pointer
flags = $1e3 ' flags register
r0 = $1e4 ' general purpose register 0-11
r1 = $1e5
r2 = $1e6
r3 = $1e7
r4 = $1e8
r5 = $1e9
r6 = $1ea
r7 = $1eb
r8 = $1ec
r9 = $1ed
r10 = $1ee
r11 = $1ef

ioregs = $1f0

That's directly from my latest lmm.spin file

And here is the jump table:

[noparse][[/noparse]quote]
DAT

{
*****************************************************************************************

Large Model Optimized Kernel

Copyright 2006, William Henning
http://www.mikronauts.com
webdevsupport@gmail.com

*****************************************************************************************
}

org 0 ' set the origin within the cog

init jmp #initialize ' kernel initialization routine
long next_op ' pointer to execute next instruction
long fjmpx ' tested
long fcallx ' tested
long fretx ' tested

long fldix ' implemented, not tested
long fpushix ' implemented, not tested
long fpushx ' implemented, not tested
long fpopx ' implemented, not tested

long fcachex ' tested
long fyieldx ' nop - only for multi-tasking kernel

long fhaltx ' not tested

long fsvcx ' not yet implemented
long flibx ' not yey implemented

I am expecting to have the 'not tested' calls tested by Tuesday the 16th, and I hope to release a running kernel, with a demo, a week after.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Bill Henning · 2007-10-14 05:05

Hi Mike,

Yep, I have a rough multi-threaded version, but I have gone back to basics (now that a large consulting project that tied me up for many months is over) - I have a running single threaded kernel, and am working on a simple demo - I have an incredibly stupid simple demo running, but I have an idea for a better demo... watch this forum in a bit over a week

I have given thought to the multi-threading; and have come up with malloc/free primitives; but Spin is really not the appropriate tool to generate full LMM programs, so after the demo is done I will resume debugging my large model macro assembler, and after that is done, write the relocating linker I talked about a while ago.

Mike Green said...
potatohead,
Read the LMM thread (see earlier post). The VM is really really very very simple. It only does a few things for you. Bill has been working on a multi-threaded version that's more complex, but that's not what we're talking about here. This LMM just copies instructions one at a time (or a block at a time) from HUB to COG and lets them execute. Since this doesn't handle jumps or calls, there are pseudo-jumps and pseudo-call/return that are interpreted by subroutines internal to the VM. The whole idea is to have what mostly behaves like native Propeller instructions and runs at near normal speed, but resides in HUB RAM.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Bill Henning · 2007-10-14 05:07

I like command line tools

Will there be a Linux version? Dare I hope?

ImageCraft said...

lucidman said...
Richard,

I can use IDE's, but often prefer to use the command line and compile/link using a make file. Does your compiler and linker send messages to stdout and stderr or directly to the console? The tool I use for compiling capures stdout and stderr, but will not work with compilers/linkers that do not use stdout or stderr.

Mark

Yes command line compilers are supported. Our IDE just capture the stdout/stderr and display the errors etc.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

RFC: Propeller C Runtime Architecture

Comments