"Small C" Compiler

Christof Eb. · 2007-03-10 21:24

Hello all,

I have read the discussion over perhaps making a new C-compiler. This seems to be a very big project.... At least, if you have only a few hours a week.

There seem to exist several reasons for wanting a c-compiler.
For me, the reason would be to have something between the spin interpreter and assembler. In speed and in easyness to use.
I would include the assembler outputs of the compiler into spin like hand made assembler.

Some years ago I used small C, a subset of C for some work with 68hc11 which I found nice compared to assembler.

I think, it could be ported to the propeller with moderate effort?
* There is a complete description: http://web.archive.org/web/20020803105843/www.ddjembedded.com/languages/smallc/
* The compiler itself already exists. It would "only" be necessary to make the assembler outputs for the pcodes.
* Perhaps first you could do a compiler for the cog ram small memory modell, then later a version for large memory modell as a second step?
* Disadvantages: The compiler is restriced and it does not support 32bit longs.

What is your opinion? Would this make sense? How many hours would be needed?

IWriteCode · 2007-03-10 21:39

In another thread, http://forums.parallax.com/showthread.php?p=635438·, they are working on a complete commercial version... so I'm not sure how useful it would be, but it would certainly be interesting [noparse]:)[/noparse]

Stan671 · 2007-03-11 00:32

Keep in mind, Christof Eb., that assembly language programs are not run from main memory. They are loaded into and run from a Cog's memory. And each cog has 2K of RAM which is 512 longs or 512 Assembly language instructions. So no single Assembly language program can be longer than that.

Personally, I really don't see the attraction of trying to shave the square corners off the C peg and jaming it into the round hole of the Propeller.· The Propeller's architecture was not designed for C.· There are two excellent languages (Spin and Assembly) available for the propeller that are perfectly optimized for the Propeller's achitecture.· Even when (if) someone develops a C environment for the Propeller, I suspect that it will be so limited and/or non-standard that most of the reasons to use C in the first place will not even come true.· IMHO, of course.

I know C and am·leveraging that knowledge to work my way up the Spin learning curve.· So, rather than trying to hammer the Propeller into the shape of something I already know, I am expanding my field of knowledge to see the Propeller's uniqueness for what it is.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Stan Dobrowski

Post Edited (Stan671) : 3/11/2007 12:45:41 AM GMT

Bergamot · 2007-03-11 05:52

There are workarounds to the memory issue as long as you're willing to put up with modest speed cuts (though still better than SPIN), so I think a C compiler would be feasible.

Would it be useful though? I have absolutely no idea; since the propeller has a ton of general-purpose registers, and no need for interrupts, a lot of the complexity of assembly programming just goes away.

Mike Green · 2007-03-11 06:21

Stan671,
Spin is actually fairly close to C. I imagine C could indeed be fairly easily translated into either Spin or Spin byte codes. The Spin byte codes have been mostly "reverse engineered". Other than making it easier to port existing C libraries to the Propeller, I'm not sure how useful such a compiler would be. Existing C programs can be easily hand transliterated into Spin. The difficulty is in the absence of the equivalent libraries. Until recently, there was really no mass storage I/O. That's now changed with Rokicki's SD card support routines.

Bergamot,
The main usefulness would be in the support of a "large memory model" that really doesn't have any reasonable support whether with a compiler or even a good macro front end on the existing assembler. It sounds like there may be a commercial compiler in the near future that'll support this (see IWriteCode's link above).

Peter Verkaik · 2007-03-11 08:47

I do not know much of the propeller yet, that may change when the spin stamp comes
available, so I am not sure if propeller assembly is usable with the package I attached.
It is the Retargetable Concurrent Small C (RCSC). I used it with the SX28.
The smallc compiler itself, generates pseudo instructions for a stackbased virtual
machine. Default that uses 16bit words, but you can recompile the compiler to use
32bit words. You then use the M4 macro compiler (included) to convert the pseudo instructions
to the target assembler instructions. It took me about a week to get the M4 known
so I could produce a working M4 file to generate SX assembly code from a c source.
Since the RCSC compiler has builtin support for concurrency, it may well be
suitable for the propeller.
Inside the zip are documents, ConcurrentSmallC.pdf and RetargetableConcurrentSmallC.pdf
that explain the·concurrency support added to smallc.

regards peter

Post Edited (Peter Verkaik) : 3/11/2007 8:52:00 AM GMT

IWriteCode · 2007-03-11 09:25

I had a look at the package, looks like it shouldn't be that hard to do...

Gavin · 2007-03-11 13:20

That's great stuff Peter, if I was not so tired it would be bedtime reading.
Going to have to wait until tomorrow.
I had forgotten about Andy's stuff, did not know about or had forgotten the concurrent version of Small C.

I do remember someone had done a Pcode C Compiler, I even got it many years ago, Dave Dunfield?

Gavin

Christof Eb. · 2007-03-11 20:44

Whow, Peter,
very interesting! Thank yo for the info! I will have a close look to this M4 approach. Perhaps I find the time to try something. Partly only because it's interesting.....

Stan, I agree with you. Spin as a language is OK for me, I don't see big advantages of C. But spin with it's large memory model is significantly slower than assembler with it's small memory modell.

I have now written two poject programs. (Which you can find somwhere) In both of them I have a loop doing something, not the same but the tasks seem comparable. In the one where I use spin I reach about 9kHz loop frequency. In the one, that uses assembler , I reach about 500kHz.

For me it would make some sense to have a compiler with a small memory modell perhaps 5..10 times faster than the spin interpreter. I fear, if you make a compiler with a large modell it will not be faster than spin.

ImageCraft · 2007-03-12 10:40

Even if I don't have a commercial interest in this, can people *please* read the thread cited and at least see Bill Henning's posts on how a Large Programming Model C program may run 90-95% NATIVE speed? I won't be THAT optimistic but I think a 2x to 4x, comparing to COG asm slow down is doable. My understanding is that SPIN is MUCH worse than that.

As for the merits of SPIN vs. C, or C vs. <XYZ>, I'd leave it to other people. I am a pragmatic guy - I write C compilers, I sell C compilers. People buy my compilers. I am happy.

// richard

Bill Henning · 2007-03-12 22:32

Richard is right [noparse]:)[/noparse]

Worst case slow down of large model C generated code is 4.5x (18 cycles vs. 4 cycles) using a four way unrolled kernel.

Proper use of FCACHE, including using part of the FCACHE'd area for local variables can lead to 95%+ the performance of compiler generated code for "small" model; however very carefully hand crafted code using lots of tricks can still be faster.

FCACHE is your friend.

Spin is approx. 20x-200x slower than cog assembly.

I would expect Richard's compiler (which I can't wait to play with!) to be 4x-50x as fast as Spin, depending on the code; and he is far too pessimistic (or not thinking about FCACHE enough) if he expects a 4x on average slow down vs. assembly.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

ImageCraft · 2007-03-13 01:06

Bill Henning said...
Richard is right [noparse]:)[/noparse]

Worst case slow down of large model C generated code is 4.5x (18 cycles vs. 4 cycles) using a four way unrolled kernel.
...
I would expect Richard's compiler (which I can't wait to play with!) to be 4x-50x as fast as Spin, depending on the code; and he is far too pessimistic (or not thinking about FCACHE enough) if he expects a 4x on average slow down vs. assembly.

Well, 20%-25% performance is the lower bound, so this is still much faster than SPIN. (BTW, there is no reason one cannot have a SPIN large programming model compiler (!)) If I don't promise anything higher and deliver higher performance, then people should be even more happy right

? I am still thinking the best way to use the FCACHE. In the simplest model, every straight line block larger than X instructions (with X being as small as, I don't know, 4 to match the minimum size of the unrolled kernel?) or every loop that can fit inside the FCACHE should be prefixed by the FCACHE loading code. Large blocks will have FCACHE loading instructions inserted at the FCACHE block boundary. I think this should get most of the performance. If you have ideas to push the performance further, let me know

Stan671 · 2007-03-13 03:28

Mike Green said...
Spin is actually fairly close to C. I imagine C could indeed be fairly easily translated into either Spin or Spin byte codes.

I understand that, Mike.· And it kind of makes my point.· Many of the arguments for C are that it will be faster than Spin.· Well, if the C code is translated into Spin bytecodes, then it will not be faster than Spin.· So, then, that removes a big reason to bother with C implemented in that way.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Stan Dobrowski

Bill Henning · 2007-03-13 07:41

Straight-line code does not really benefit from FCACHE unless it is enclosed in a loop [noparse]:)[/noparse]

I don't know how your compilers parse tree looks like, but what I'd suggest is·analyzing from the innermost loop (in any block of code) outwards, to see how much will fit in the FACHE area. Then, copy the local variables for the larget practical FCACHE'd block to the end of the FCACHE area, and run the loop purely within the cog. If you can fit two-three nested loops, so much the better.

A trivial example would be something like strcpy

void strcpy(char *dest, char *src) {

·· while (*dest++ = *src++);

}

leaving aside the semantics of stacking the arguments, the interesting piece of code is:

while (*dest++ = *src++);

This code would be enclosed in an FCACHE'd block, something like:

· ' in-line code before fcache would set up src and dest

· FCACHE

···· rdbyte t1, src wz

···· add src, #1

···· wrbyte t1, dst

···· add dst, #1

· if_nz jmp #$80· ' start of fcached block

··· jmp #next

t1· long 1··· ' remember, 0 stops loading fcached block

src long 1·· ' remember, 0 stops loading fcached block

dst long 1···' remember, 0 stops loading fcached block

···· long 0···' indicates end of fcache'd block in hub memory

(note if the local copies of the variables do not need to be initialized you do not have to allocate space for them in hub space, and can put the null terminating word after the last required initialized variable)

On a 100 byte string, it would run it at over 95% of pure cog asm code speed as only the function call overhead would be running as pure large model code. Ofcourse fancier strcpy's could be written, but the point of this little example was to show how FCACHE should be used.

The more work is done in the loops being fcache'd, the more closely pure cog assembly speed can be approached.

Given that probably 95%+ of time in any code is spent executing loops or library functions... you can see why I think almost pure cog asm speeds are within reach.

Basically, my large model essentially treats each cog as a cpu with a software managed I+D cache; I think I remember some papers from the early 80's about similar schemes for RISC architectures; and if I correctly recall, later Transmeta did something similar too.

ImageCraft said...

Bill Henning said...
Richard is right [noparse]:)[/noparse]

Worst case slow down of large model C generated code is 4.5x (18 cycles vs. 4 cycles) using a four way unrolled kernel.
...
I would expect Richard's compiler (which I can't wait to play with!) to be 4x-50x as fast as Spin, depending on the code; and he is far too pessimistic (or not thinking about FCACHE enough) if he expects a 4x on average slow down vs. assembly.

Well, 20%-25% performance is the lower bound, so this is still much faster than SPIN. (BTW, there is no reason one cannot have a SPIN large programming model compiler (!)) If I don't promise anything higher and deliver higher performance, then people should be even more happy right ? I am still thinking the best way to use the FCACHE. In the simplest model, every straight line block larger than X instructions (with X being as small as, I don't know, 4 to match the minimum size of the unrolled kernel?) or every loop that can fit inside the FCACHE should be prefixed by the FCACHE loading code. Large blocks will have FCACHE loading instructions inserted at the FCACHE block boundary. I think this should get most of the performance. If you have ideas to push the performance further, let me know

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Post Edited (Bill Henning) : 3/13/2007 7:53:55 AM GMT

Bill Henning · 2007-03-13 07:56

Spin will always have one advantage over large model - more compact code. The Spin bytecode may be much slower than large model code, but you can fit a lot more of it into the available memory; so Spin definitely still has its pace.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

IWriteCode · 2007-03-13 08:55

I didn't read anywhere that spin doesn't have it's place... only using a C compiler which generates (a lot) faster code, is just another great addition to the propellor tool chain. Without the need to write assembler directly, you still can get assembler like speeds... (well, almost)

ImageCraft · 2007-03-13 09:40

Bill Henning said...
Straight-line code does not really benefit from FCACHE unless it is enclosed in a loop [noparse]:)[/noparse]...

Ah wait, in our private conversations, I said:
***
Or may be there IS benefit anyway because you are doing a block copy into COG memory, and you will be eliminating the overhead of the NXT and PC update etc. If so, then we should be FCACHE'ing more aggressively.
***
In which you reply:
&&&&
EXACTLY!
&&&&

So which is it?

BTW, don't worry about what our parse tree look like. There are MANY methods of code generations and optimization technologies, and our backends use a framework that generate more or less pseudo-asm code, and then allow extensive optimizations of the instructions and the blocks.

// richard

Bill Henning · 2007-03-14 03:17

Its both, i a way.

Seriously.

If you copy a large enough chunk of linear code, it WILL have some loops - then you benefit.

But all you really need to do is fcache loops; that's where you get the biggest bang for the buck.

Sorry if I was unclear before, straight line code that·is not enclosed in a·loop receives no benefit from being fcached; I based my design loosly on treating a cog's local memory as a software controlled I(nstruction) and D(ata) cache; however if you look at most code, there will be LOTS of opportunities to use FCACHE on loops.

By the way, it can be counter-productive to fcache a loop that calls a library function; as a matter of fact, you should not FCACHE any code block that uses any extended F* pseudo-instructions in the body of the loop; however library functions should make a LOT of use of FCACHE'd blocks.

You should also not keep local char arrays in cog memory, the shifting/masking takes almost as much time as using RD/WR BYTE from hub memory.

You should *ALWAYS* try to use cog variables for pointers; MAJOR speedup for C code.

ImageCraft said...

Bill Henning said...
Straight-line code does not really benefit from FCACHE unless it is enclosed in a loop [noparse]:)[/noparse]...

Ah wait, in our private conversations, I said:
***
Or may be there IS benefit anyway because you are doing a block copy into COG memory, and you will be eliminating the overhead of the NXT and PC update etc. If so, then we should be FCACHE'ing more aggressively.
***
In which you reply:
&&&&
EXACTLY!
&&&&

So which is it?

BTW, don't worry about what our parse tree look like. There are MANY methods of code generations and optimization technologies, and our backends use a framework that generate more or less pseudo-asm code, and then allow extensive optimizations of the instructions and the blocks.

// richard

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

ImageCraft · 2007-03-14 07:52

Bill Henning said...
Its both, i a way.

Seriously.

If you copy a large enough chunk of linear code, it WILL have some loops - then you benefit.

But all you really need to do is fcache loops; that's where you get the biggest bang for the buck.

Sorry if I was unclear before, straight line code that is not enclosed in a loop receives no benefit from being fcached; I based my design loosly on treating a cog's local memory as a software controlled I(nstruction) and D(ata) cache; however if you look at most code, there will be LOTS of opportunities to use FCACHE on loops.

By the way, it can be counter-productive to fcache a loop that calls a library function; as a matter of fact, you should not FCACHE any code block that uses any extended F* pseudo-instructions in the body of the loop; however library functions should make a LOT of use of FCACHE'd blocks.

You should also not keep local char arrays in cog memory, the shifting/masking takes almost as much time as using RD/WR BYTE from hub memory.

You should *ALWAYS* try to use cog variables for pointers; MAJOR speedup for C code.

ImageCraft said...

Bill Henning said...

Straight-line code does not really benefit from FCACHE unless it is enclosed in a loop [noparse]:)[/noparse]...

Ah wait, in our private conversations, I said:
***
Or may be there IS benefit anyway because you are doing a block copy into COG memory, and you will be eliminating the overhead of the NXT and PC update etc. If so, then we should be FCACHE'ing more aggressively.
***
In which you reply:
&&&&
EXACTLY!
&&&&

So which is it?

BTW, don't worry about what our parse tree look like. There are MANY methods of code generations and optimization technologies, and our backends use a framework that generate more or less pseudo-asm code, and then allow extensive optimizations of the instructions and the blocks.

// richard

We should probably take this off-line. I understand the manually controlled cache bits. I worked on some unname-able HP projects that translate code from one ISA to another ISA and it dynamically links blocks together (google "trace cache" Intel put it in hardware, as did transmeta). Gets quite good performance. Anyway, the question still remains: if I cache straight line block, it eliminates NXT and PC increment, clearly there is a small performance even at that?

Unlike PC programs, most embedded programs I see don't call library functions all the time. Hence my not as optimistic as you are. YMMV of course.

Bill Henning · 2007-03-14 08:44

Hi Richard,
Actually, FCACHEing code that is just straight, in-line code will tend to execute *slower* than if it was run out of hub memory
Why?
The time required for incrementing the PC register and NXT·are "hidden" in the cycles that would be wasted synchronizing to the hub.
to copy a block of straight line code (ignoring initial hub sync) takes·32 cycles per instruction, plus 4 cycles to execute it (again ignoring hub sync in case instruction accesses the hub) - so for straight line code, it will take
FCACHE
n in-line instructions
0
18*2 (the FCACHE and the null) + 32n cycles + 4n to execute n instructions (again ignoring additional initial hub sync delays)
which is·40 + 36n cycles for n instructions
whereas if they were "executed directly from hub memory", it would take 18n cycles (ignoring hub sync)
Therefore, while it may seem counter-intuitive, for in-line sequential large memory model code that is not enclosed in a loop, the not FCACHED code will execute on average·be twice as fast as if it were fcached!
ie. 40+36n vs. 18n for n in-lined non-looped instructions
Ofcourse, this is all turned on its head if there are loops involved.
The way I was planning on generating code was to generate large model code for all function entry/exit code, and any in-line code (or switch statements etc) however agressively FCACHE loops, and write tight string and memory library functions which were definitely FCACHE'd in hand-crafted large model assembly code.
As long as FCACHE'd loops execute at least twice, its a win to FCACHE them, obviously the more times a loop executes, with more code inside of it, the closer it gets to pure native cog speed - so for example, if you had a FCACHE'd loop that did not need to access hub memory, with 100 instructions in a loop being executed 100 times - admittedly a contrived artificially simplistic example:
- generating pure large model code would take approximately 18 ( 100 * 100) = 180,000 cycles
- if the loop was properly FCACHE'd it would take 40+32*100+4*100*100 = 43,280 cycles (32 due to looped count already allows for first execution)
- if the loop was launched with InitCog it would take approx. 496*16+4*100*100 = 47,936 cycles
NOTE: the FCACHE'd code would even beat launching the task with CogInit, due to not having to load all 496 words into the Cog! In this case, FCACHE'd large model code would be *FASTER* than native cog code, due to not having to spend 496*16 cycles to load every word of the cog's memory - some 10% faster!·
btw... I *REALLY* hope Chip adds a length parameter to CogInit, for small cog programs it would save startup cycles...
On a different topic, I·believe I figured out how Chip managed to fit the spin interpreter into one cog.. or I figured out another mechanism to do it... he cheated [noparse]:)[/noparse] I'm betting the whole interpreter is more like 1100-1500 instructions long. (or at least the way I'd do it would cheat... it bugged me until I figured it out, because I could not see how to implement as many byte codes as have been reverse engineered in less than 496 instructions)
Sure, we can take this off line. I've been waiting for an email response from Chip before writing up some docs.

ImageCraft said...

We should probably take this off-line. I understand the manually controlled cache bits. I worked on some unname-able HP projects that translate code from one ISA to another ISA and it dynamically links blocks together (google "trace cache" Intel put it in hardware, as did transmeta). Gets quite good performance. Anyway, the question still remains: if I cache straight line block, it eliminates NXT and PC increment, clearly there is a small performance even at that?

Unlike PC programs, most embedded programs I see don't call library functions all the time. Hence my not as optimistic as you are. YMMV of course.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Post Edited (Bill Henning) : 3/14/2007 8:58:46 AM GMT

Bill Henning · 2007-03-14 09:01

another note before I hit the sack...

I suspect a Spin interpreter written to the large memory model specification could be noticably *faster* than the one in the rom - perhaps 2x-4x in some cases!

(if I am right in how Chip managed to make it fit)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Christof Eb. · 2007-03-16 20:34

Hello again,

I have now made some first experiments with the "small small c compiler for prop" ssp.exe.
And -astonishing- in just a few evenings I got it a little bit working.... (NOT MORE!)
Half of the work is to modify the pcodes in ssp4.c. The other half is to make it to make the special format of the .spin files.

small small C: ssp.exe is compiled with ccssp.bat using the small C compiler.
It makes assembler code, that works in the very limited cog ram, and resides in a .spin file.

test1.c is compiled with cct.bat which uses ssp.exe. Sem2az.exe translates ; to ' for the comments.

I spent about two evenings with the m4 macro tool. This gave me some positive results but the very special RETURN method for the prop for example made it necessary to modify the c-compiler a little bit.

The very simple program test1.c gives an output frequency of about 13kHz at bit 7: That is about 30 times faster than the corresponding spin program, which reaches about 0.4 kHz. The compiler does not make ideal assembler code...

Well, I do not yet know how to deal with large constants, how to multiply and devide, and a lot of other things....

Christof

'/* Test1.c for SSP Small SmallC for Prop*/
' Small SmallC for Prop CE 2007
'/* 13 kHz at bit 7 */
'
'/* spin code is in #asm # endasm ====================================== */
'#asm
'CON
CON
' _clkmode = xtal1 + pll16x
_clkmode = xtal1 + pll16x
' _xinfreq = 5_000_000
_xinfreq = 5_000_000
'
'VAR
VAR
' long parameter
long parameter
' long stack[noparse][[/noparse]500]
long stack[noparse][[/noparse]500]
'
'PUB Start_SSP
PUB Start_SSP
' SP := @stack
SP := @stack
' cognew(@asm_entry,@parameter)
cognew(@asm_entry,@parameter)
'
'DAT
DAT
'asm_entry ORG
asm_entry ORG
' JMP # Main
JMP # Main
'
' P long '/* Primary Register */
P long '/* Primary Register */
' S long '/* Second Register */
S long '/* Second Register */
' SP long '/* Stack increment after wr */
SP long '/* Stack increment after wr */
' BP long '/* aux reg */
BP long '/* aux reg */
' CL long '/* aux reg */
CL long '/* aux reg */
'#endasm
'
'extern int ina, outa, dira, cnt'
'DATA SEGMENT PUBLIC
'EXTRN ina:WORD
'EXTRN outa:WORD
'EXTRN dira:WORD
'EXTRN cnt:WORD
'
'/* Variables ===================================== */
'char c_char=3' /* all vars are longs!!! */
'PUBLIC c_char
c_char long 3
'
'int c_outb=255' /*constants marked with c_*/
'PUBLIC c_outb
c_outb long 255
'
'int v_count=0'
'PUBLIC v_count
v_count long 0
'
'
'
'/* Functions ======================================*/
'
'/*
'void Fun()
'{
' v_count++'
' outa=v_count'
'}
'*/
'
'void Main() /*

*/
'DATA ENDS
'CODE SEGMENT PUBLIC
'PUBLIC Main
Main
WRLONG BP,SP
ADD SP,#4
MOV BP,SP
'{
' dira=c_outb' /* low byte output */
MOV P,c_outb
MOV dira,P
'
' /* Fun()' */
'
' for('')
L_4
JMP #L_5
L_2
JMP #L_4
L_5
' {
' outa=v_count++'
MOV P,v_count
ADD v_count,#1
MOV outa,P
' }
JMP #L_2
L_3
'
'
'}
MOV SP,BP
SUB SP,#4
RDLONG BP,SP
Main_ret RET
'
'
'
'
'/* EndCode ===================================== */
'#asm
' FIT
FIT
'#endasm
'CODE ENDS
'END

Peter Verkaik · 2007-03-17 05:03

Christof,

Why did you decide to use the standard sc with 8086 output, and change only some
output strings to propeller asm/spin format? I fear that will not get you the result you
want.

Did you give up on the M4?
The M4 lets you have any output you want. You can even have different code generation,
based on register names.
I attached my SX28_16.M4 file (open with notepad), this is a rewritten 8086.M4, targeted to the SX28.
You could do the same, or alter this one, to target the propeller.
Note that·the DATA SEGMENT and CODE SEGMEN, and a feew others, generate comments (for the
assembler) as they have no meaning for an sx28.
Basically I defined working registers to replace the AX,BX,BP and SP 8086 registers.
Sometimes the code generated calls a runtime routine to perform the instruction.

regards peter

Christof Eb. · 2007-03-17 09:07

Peter,

yes, first I had rather fast first positive results and this was very encourageing. But then there came up difficulties, that I did not know how to handle with M4. I am not familiar with this.
1. I had to write a simple program, that converts ; to '. I did not find a m4-function to convert single characters.
2. I had to write a second programm, that would deal with colons at the wrong place, because of the special assembler file format of the prop system.
3. The prop assembler cannot handle "_6" as a label.
4. The prop has no stack and therefore has to have a very special return format for subroutines:

function_name
assembler stuff
function_name_ret RET

main
call function_name

The compiler has to store the function_name and to place the label function_name_ret just at the line with the RET instruction. I do not know how to implement this with m4. I think the assembler stuff could be converted easily from 8086 to propeller, but the format specialities of this assembler file are really difficult, at least for me.

I have read the pdfs about rcsc and I have to say that I did not understand the concept and its benefits completely. (Probably partly a language problem too, I'm German.) I am not conviced, that the concept will work or be easily implemented on 8 cogs. And I wanted to start with the small memory model first.

To try first with m4 was very valuable for me nevertheless, because it was encourageing to go on!

Do you want to work on a small c for prop too?

Christof

Peter Verkaik · 2007-03-17 10:13

1. rcsc does not generate ; so why do you need to convert ; into '
2. rcsc generates a : after a label name, but this can be changed by M4
or replaced in rcsc itself, that is easy and preferable if prop asm never uses :
3. same as 2, _ can be replaced by other text, for example L so _6: becomes L6
4. M4 can redefine macros, so it can remember the last functionname (macros are just text), just output
this functionname when a ret is encountered. No problem.

I would not deal with the rcsc task support for now. Concentrate on converting
the virtual machine instructions. Once you have these, deal with the labels.
Then figure out the ret problem.

I might work on this too, provided I have a propeller to work with, which I haven't.
Like I said, it took me a week to get M4 known. It is not the easiest macro language.
However, changing the rcsc compiler itself will take much longer, because simply
changing output strings will not suffice. And you need to check the compiler operation
after changes. Using M4 only requires you to check the generated assembly, changing
the M4 file until you get the correct output.

regards peter

Peter Verkaik · 2007-03-17 10:52

I checked the propeller manual for assembly syntax:

[font=TimesNewRoman,Bold size=3]Common Syntax Elements
[/font]When reading the syntax definitions in this chapter, keep in mind that all Propeller Assembly
instructions have three common, optional elements; a label, a condition, and effects. Each
Propeller Assembly instruction has the following basic syntax:
〈Label〉 〈Condition〉 Instruction 〈Effects〉
• Label is an optional statement label. Label can be global (starting with an underscore
‘_’ or a letter) or can be local (starting with a colon ‘:&#8217[noparse];)[/noparse]. Local [font=TimesNewRoman,Italic size=3]Labels [/font]must be
separated from other same-named local labels by at least one global label. [font=TimesNewRoman,Italic size=3]Label [/font]is
used by instructions like JMP, CALL and COGINIT to designate the target destination.
• Condition is an optional execution condition (IF_C, IF_Z, etc.) that causes Instruction
to be executed or not. See Conditions on page 368 for more information.
• Instruction is a Propeller Assembly instruction (MOV, ADD, COGINIT, etc.) and its
operands.
• Effects is an optional list of one to three execution effects (WZ, WC, WR, and NR) to apply
to the instruction, if executed. They cause the [font=TimesNewRoman,Italic size=3]Instruction [/font]to modify the Z flag, C
flag, and to write, or not write, the instruction’s result value to the destination
register, respectively. See Effects on page 371 for more information.

Labels may start with an underscore, which is what rcsc already uses.
Local labels start with : but rcsc does not generate local labels.
So it looks only the trailing : from labels needs to be removed
It may also be easiest to let rcsc output the functionname in front of return.
The third change for rcsc would be the number of letters a name may have.
This is curently 8, but could be increased. I did find in the manual how many
letters a prop assembly label may have. I assume 32 is sufficient.
These changes can be done easily in rcsc source. Then just recompile and
rename the generated rcsc.exe to rcsc_prop.exe

regards peter

Peter Verkaik · 2007-03-17 11:17

I checked the CALL instruction. The address to which a called routine must return
is encoded in the routine RET instruction. I think that implies a routine must have
a single exit point. This may well be a problem for rcsc, since rcsc will generate
a simple RET when you use return in a c file. So a c function can have
multiple exit points. If you write your c code so, that there is only one return,
at the closing } of a c function, it will work. But that makes writing·c code
quite tedious. To overcome this, rcsc should be altered to always jump to the
exit point at the closing } whenever a return is used. Not so simple.

regards peter

Mike Green · 2007-03-17 13:24

Peter,
CALL and RETURN are just special cases of the JMPRET and JUMP instructions respectively. The reason there's a single exit is that the CALL actually stores the return address in the RETURN instruction's source field. You might want to implement this by allocating a long to hold the return address as an implicit parameter, do a "JMPRET <exit variable>,<entry point>" for a call, then using a "JUMP <exit variable>" for a return.

Peter Verkaik · 2007-03-17 13:47

Mike,
So the prop asm routines do have a single exit point. That's what I figured.
Your suggestion cannot be constructed from the c source using the current
rcsc compiler, hence I suggested to write c functions to have a single exit point
(just before the closing } of a c function). The other option is to adapt the rcsc
source to let it generate a single exit point, meaning a return elsewhere·in a function body
must do a·jmp to the function exit point. This can be done of course, but requires indepth knowledge
of the rcsc source. This is not an easy task. Since the prop is stackless, and rcsc generates
code for a stackbased virtual machine, there is also need for·a stack, because
all the c math expressions use a (RPN) stack for calculations.
This is just to inform Christof what he is up against. I have no intention to take up
this challenge now that I know the prop is stackless.

regards peter

Peter Verkaik · 2007-03-18 04:39

There is a way to implement calls without changing the compiler:
A call is substituted by pushing the return address on the stack (that must be implemented),
then jump to the routine. Every return in the c source, is replaced with a jump to a runtime·routine
that pops the return address of the stack and places it in an exitvariable. Then that runtime routine
does a jump exitvariable.
So CALL and JMPRET will never be used in the generated assembly. This also means
there is no requirement for·having functionname_ret.
Using this method, M4 can generate the assembly and the compiler can be used as is.

regards peter

Christof Eb. · 2007-03-21 20:06

Hello Peter,

many thanks for your input!
As I have stated FOR ME it is easier to make little modifications in the source than in M4, because I don't know all the possibilities of M4 and can at least try to understand why SmallC produces this or that code.
Implemeting the data stack is not difficult. But I have now seen some new difficulties:
1. Prop does not support indexed addressing modes. Small C generates a lot of adresses: (constant+Register) with constant negative or positive. It does only support 9bit positive literals. So I had to find a way to deal with negative offsets. (Find out, it's negative, write absolute followed by NEG register,register.) But still the offset has to be in +/- 9bit range.
2. Prop does not support relative jumps.
3. SmallC does not support 32bit constants. It would be a pitty to bring the system down to 16bit only.

Perhaps I will start to use some large memory modell as suggested by Bill. Then relative jumps will be possible and with this I hope long offsets inline too. Perhaps I can manage Small C to generate additional Labels.

generate adress with constant offset to Basis BP and load register P:
HOV HELP,L_x
JMP #L_y
L_x long offset_value
L_y ADD HELP,BP
MOV P,HELP

Well, I'm learning a lot about assembler. Perhaps one day I will start to like it....

Christof

"Small C" Compiler

Comments