DIY Propeller C
hippy
Posts: 1,981
It seems that no matter what ImageCraft do there will be people who want to create their own C for the Propeller as an alternative or for the fun of doing it, so my question is; how do we turn that into a reality ?
Options seem to be GCC, SDCC, RCSC, PCC, LCC and there may be others. I'm not familiar with any of those but my favoured version is LCC as that is ANSI-C, well documented and can create textual bytecode output which needs only an assembler plus a Virtual Machine to execute the bytecode image generated.
From http://forums.parallax.com/showthread.php?p=741929 ...
I'm intrigued because that seems to be generating code for a register-based machine. I used ID Software's Quake 3 port of LCC and that appears to generate bytecode for a stack machine, but I have no idea of the licensing position doing that or how the Quake compiler diverges from pure-LCC.
While a Stack Machine may be more inefficient when implemented on a Propeller ( I'm not entirely convinced of that ), it should deliver better code density which IMO is important for the Propeller Mk I.
A reasonable and quite simple project plan seems to be ...
1) Find / build LCC compiler executables which can generate stack-based machine bytecode and be freely distributed.
2) Define a Stack Machine (VM) for the Propeller suited to LCC bytecode.
3) Create a translator to turn LCC bytecode into Assembly Language for what will be our Stack Machine.
4) Create an Assembler to turn Assembly Language into a Binary Image.
5) Create the VM to execute the Binary Image in Spin.
6) Create a Linker to combine the Binary Image and VM code into a downloadable .binary / .eeprom file.
7) Migrate the VM from Spin to PASM/LMM
That should give us a single-Cog C execution environment. Adding support for multi-Cog operation and Propeller specifics will probably require a change to the compiler, back-end and the rest of the tool chain. Code optimisation and so on can be added as extra stages later or within the compiler.
ImageCraft are well ahead of the curve on code optimisation and reducing bloat and have a lot of background experience so I don't consider this to be facing-up ICC head-on, and that's certainly not the goal for me - I like writing tools and VM's so this is a good excuse to do that and I don't think it's going to over-stretch me ( unfortunately that was the case with the JVM ).
I can see a problem with any shared / co-operative development on this - I don't use C and others don't use what I do and we'll all likely want to get on rather than change position to something we're not familiar with - so I would suggest we all work together to define the interfaces between each stage and accept we'll likely do our own thing in creating tools for those stages, let the users decide at the end of the day which specific tools to use. That doesn't preclude people choosing to co-operate on any part or even going off to pursue GCC or other ports instead if they prefer. It's more 'let everyone get involved however they want to'.
Seeing the pace of JVM progress I think it should be possible to have something credible working within a month even if it is rough around the edges.
Are people interested in such a project based on LCC; whether that's in writing tools, helping design, giving advice and comment or just cheering from the sidelines ?
Options seem to be GCC, SDCC, RCSC, PCC, LCC and there may be others. I'm not familiar with any of those but my favoured version is LCC as that is ANSI-C, well documented and can create textual bytecode output which needs only an assembler plus a Virtual Machine to execute the bytecode image generated.
From http://forums.parallax.com/showthread.php?p=741929 ...
Ale said...
Well I have added a new port to LCC, it is crude, it needs some serious help from a kernel but it works (sort of)... but I'll do my best to have something more working really soon. Stay tuned.
I'm intrigued because that seems to be generating code for a register-based machine. I used ID Software's Quake 3 port of LCC and that appears to generate bytecode for a stack machine, but I have no idea of the licensing position doing that or how the Quake compiler diverges from pure-LCC.
While a Stack Machine may be more inefficient when implemented on a Propeller ( I'm not entirely convinced of that ), it should deliver better code density which IMO is important for the Propeller Mk I.
A reasonable and quite simple project plan seems to be ...
1) Find / build LCC compiler executables which can generate stack-based machine bytecode and be freely distributed.
2) Define a Stack Machine (VM) for the Propeller suited to LCC bytecode.
3) Create a translator to turn LCC bytecode into Assembly Language for what will be our Stack Machine.
4) Create an Assembler to turn Assembly Language into a Binary Image.
5) Create the VM to execute the Binary Image in Spin.
6) Create a Linker to combine the Binary Image and VM code into a downloadable .binary / .eeprom file.
7) Migrate the VM from Spin to PASM/LMM
That should give us a single-Cog C execution environment. Adding support for multi-Cog operation and Propeller specifics will probably require a change to the compiler, back-end and the rest of the tool chain. Code optimisation and so on can be added as extra stages later or within the compiler.
ImageCraft are well ahead of the curve on code optimisation and reducing bloat and have a lot of background experience so I don't consider this to be facing-up ICC head-on, and that's certainly not the goal for me - I like writing tools and VM's so this is a good excuse to do that and I don't think it's going to over-stretch me ( unfortunately that was the case with the JVM ).
I can see a problem with any shared / co-operative development on this - I don't use C and others don't use what I do and we'll all likely want to get on rather than change position to something we're not familiar with - so I would suggest we all work together to define the interfaces between each stage and accept we'll likely do our own thing in creating tools for those stages, let the users decide at the end of the day which specific tools to use. That doesn't preclude people choosing to co-operate on any part or even going off to pursue GCC or other ports instead if they prefer. It's more 'let everyone get involved however they want to'.
Seeing the pace of JVM progress I think it should be possible to have something credible working within a month even if it is rough around the edges.
Are people interested in such a project based on LCC; whether that's in writing tools, helping design, giving advice and comment or just cheering from the sidelines ?
Comments
Note: I defined some "registers" in COG memory for compiler's use, 32 in total just as a direct translation from mips assembler.
An assembler that works, I already have, it is not C it is java but it can be either used as it is or translated to C. I'll continue to work on this, but some help as always is useful. It seems your pace is a bit faster than mine. (guessing from your proplist advances). Plenty to do.
Post Edited (Ale) : 8/24/2008 5:23:54 PM GMT
OBC
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?
Getting started with a Propeller Protoboard?
Check out: Introduction to the Proboard & Propeller Cookbook 1.4
Updates to the Cookbook are now posted to: Propeller.warrantyvoid.us
Got an SD card connected? - PropDOS
Is it possible to reduce code-size relatively speaking? Can the VM use a byte or word-wide instruction set instead of longs like PASM? I thought the JVM byte-codes were nicely defined except for some obfuscation.
Whatever happens, I would really like to see some kind of single step debugger (with variable change trap ability). An independently developed IDE would be nice, but Eclipse could be used until an IDE is made. Eclipse is widely accepted and much more powerful than other IDE I've seen ... one could write Eclipse plug-ins for specialized features if necessary.
One thing that Visual Basic adds that is just wonderful to me at least (and Java in some Eclipse versions), is the ability to modify code on the fly. With a VM implementation, it seems that might be possible.
I'll take on a piece if the project can be sub-divided developers working independently. You have outlined major components already ... if there was a way to sub-divide these that would be great. I wrote an assembler about 30 years ago and it was nothing more than a token parser, case statement, and file IO ... I might look at that later. I'm busier than a one legged man right now though. Wonder what GNUASM looks like under the hood.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
That makes sense now; it's effectively LMM code.
No. It's feasible but I'm not your man for that. On the plus side, with a C compiler, one could perhaps bootstrap a tiny-C that way. Maybe already possible with ICC ?
Yes, and I think that's the key to success although it may not give faster than Spin execution, at least not to start with.
With a register-to-register based instruction set that needs a byte to hold source and destination, another to hold the opcode, plus any other operand(s). Fast PASM/LMM would tend towards two longs for that. With a stack machine ( as demonstrated with Spin ) opcodes can be single byte, an operand usually just another byte or word. Only 32-bit numbers need a long and there's optimisation to be had there. The price to pay is slower execution.
With the Quake LCC port, all opcodes can be single byte except two for adjustment on entering and leaving functions and the push constant opcodes, so optimising push constants is the key. Typical of RISC, multiple opcodes have to execute to do anything useful so the bytes add up but again there's an option to optimise there, especially for jump and calls. Code density should be similar to a traditional 8-bitter, 6502, 6800, Z80 etc.
Having looked at stretching a byte-oriented bytecode to word or long to get gains in speed, except for well aligned words and longs I don't think there's a lot of difference; fetching multiple bytes isn't much slower than fetching a single long and extracting the necessary fields from it. To get speed up means aligning everything word or long so extraction time goes down but density reduces. I'm not convinced the speed gains outweigh the density loss.
I'm personally only really interested in the "proving it works" side and hopefully delivering something which is more usable than just proof of concept. LCC emits information which can relate source to image so a source-level debugger should be possible - That might even be something which proves very useful in debugging and proving the VM itself.
Integrating into IDE's I'd leave to others as I'm more than happy with a command line
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
I don't know how to·adapt a language to IDE like people did for the PIC & AVR, but I would sure like to learn.
Can·the free (open source)·C compilers, Assembler, Linker·or parsers MIN-GW·& MSYS that some versions of Eclipse uses for computers be adapted for the Prop?
Post Edited (plx88) : 8/25/2008 7:05:34 AM GMT
I tried using Eclipse when doing some development on a Phillips ARM chip using the GNU C toolchain. I found it truly hideous and ended up using NotePad++ and a makefile.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I had a same experience... ended up testing something with a makefile but the download process and debug via jtag did not work :-(, I think I'll get rid of the board. I even used some commercial solution that did not detect my cable (and usb-ocd by Olimex). ARM brought me only frustration. I think I will pursue my "unix on the propeller" idea even if it is dog-slow (I mean a slow dog). did I say I need a C compiler ? I'll have one.
Anybody wrote a plugin/macro to make notepad++ recognize spin files?
(for syntax highlighting).
regards peter
I do think that unix on Propeller 1 will be·extremely false economy! It will just prove what sort of capabilities the Propeller doesn't have! Now Propeller II may be a different story - but I'm going to wait for a real device to be released before I even bother thinking about something like that.
Thanks for the warning of your troubles .... I have an ARM board (by accident) and JTAG cable that I haven't tried yet. Guess I'll do that before my return for refund window closes.
Ale, where are you with LCC ?
Mirror, I tend to agree with your "false economy" statement, but no one has *really* tried yet. Even with unlimited code "text" storage and using almost all of Propeller memory just for variables, unix/linux doesn't seem possible. If it does work, it would be dog slow with SDIO access instruction fetch using LMM ... maybe a 4x or 8x speedup would be bearable.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
LCC:
- well I have most instructions converted. Some multi-register instructions are decomposed into 2 instructions... like add r0, r1, r2 (I used the mips port).
- I have to extract or better merge part of the mips with the x86 port to avoid those 3 operand instructions. (I bought the book on LCC I hope it shows up soon).
- The whole thing will need some lmm kernel, something basic call, jmp, ret and push/pop I have already working, for only 32k RAM (HUB RAM).
- An assembler: I'll use my pLMMAssembler (I wrote it in java, it does not matter as it works from the command line) I have to add the spin wrapper to load the lmm kernel, fix the checksum issue and sort of ready .
All that can take a while, especially generating useful and optimized code, but if it works... we are in business.
I'll post all these tools and things in a wiki page at propellerwiki if you do not mind, so a sort of how-to can be achieved.
-Eclipse plug-in : it is a neat idea, but we can start without it. If someone figures the how to, great, it is not something I'm proficient at.
Have fun.
Inspired, I've worked out how to convert the bytecode to something a Propeller VM can use, have an LCC bytecode converter which includes assembler and even does some optimisations, so that's a huge chunk out the way. Next step is generating an actual image ( there's a listing of what it will be so not too hard to do ) and then coding up a VM. The the big slog of debugging.
I'm quite impressed with the ease of using the LCC bytecodes and they seem to map quite well to a Propeller architecture, though I'm sure I've overlooked something and I'll also have to get to grips with the call stack framing. But looking good so far. Code density looks better than I expected and there are other tweaks possible. The observant will note I've stolen a few of Chip's optimisation tricks used in Spin bytecode.
No documentation on my own bytecodes but anyone familiar with micros should be able to make a good guess - "GTI" Get Indirect ( Pop adr, Push what's at adr ), "NUM.I32 #MEM+offset" Push a constant, type I ( Unsigned Int, S = Signed, P = Ptr (Int), F=Float ) size 32 bits, in this case the address of a global unsigned int. Opcodes are allocated dynamically so I can auto-minimise the kernel size.
It will be interesting to see how a stack version stands up against a register version. I'm feeling confident it will be reasonably fast. Going to take a bit of a rest and read up on LCC but there will be more soon.
I've included the current compiler execuatble ( source will be released later ) and example generated listing and instructions for compiling your own C files. Consider any bugs a special bonus
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
I keep thinking about converting to Spin bytecodes. It's possible but I'm dithering on that. I think performance would be higher with a dedicated kernel. A lot does depend on the call/return mechanisms and I really haven't got to grips with LCC on that topic. It's always a pain getting a VM up and running but hopefully plain sailing after that.
@ Jazzed : It'll all run in one cog but I'm not sure how big it will be. I'll start with PASM and include LMM if I run out of space and for things like floating point ( I'm not planning on actually coding any FP support ). There looks to be a lot of opcodes but mostly it's informational rather than functional. 160 of the bytecodes are one-byte packed numbers, dealt with in a dozen lines of code. The LCC bytecode uses a long/32-bit stack which is perfect, saves a lot of messing about.
The idea of running code form Eeprom appeals to me and I do want to try that at some point just to compare performance. With the compiler itself building the kernel it should be easy to have it compiled conditionally so calls to GetByte can either translate to rdbyte or a routine which does read Eeprom.
The plan is, as with the Spin VM clone and JVM, to get basic looping working, pushing and popping, then the maths working, it's all much a muchness regardless of language. Where I lost on the JVM was not understanding its quite complicated and inter-linked data structures. This only needs brute force and ignorance except for the call stack stuff.
The first milestone will be incrementing a variable and see how fast it can do that.
Not finished but it's ticking along, first milestone reached, incrementing a long. Not sure if it's as fast as I'd hoped for but not bad, coming in about 55% faster than Spin. Times to count up to 10 million at 118MHz -
That does depend on how well aligned the variable is. 25% faster than Spin if alignment is really poor. Further optimisations to take advantage of fortunate alignment to come which should squeeze more out of it.
Only the few bytecodes needed tested so far. Kernel is around 430 bytes. Now have a bytecode to Spin file generator, once the kernel is working I'll move on to generating a .eeprom image directly.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
somehow this *thing* does not recognize files with .zip extension ?? So I cannot upload (Ubuntu 7.04, Firefox 3.0 :-( ).
Take my word for it
And the test file :
Well this takes around 0x8f00_0000 cycles @ 80 MHz is around 29 s... add 30% for errors in the cycle count of rdlong/wrlong and you are around ~40s. Not bad
All this was tested with pPropellerSim and pLMMAss.
Files can be sent per email or if this thing starts to accept zips from me (.tar.bz2 are also not accepted :-( )
Note: This was only a test for the lmm kernel. No actual code was created with LCC, ok ? It is just not there yet (constant loading and so on do not work yet).
Have fun
Ale
Post Edited (Ale) : 8/28/2008 5:34:38 AM GMT
The forum only parses the very last extension. From there, people will know what to do.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
with mine ) is that 'there's an awful lot of code needed to move fields around, and still a few
rdlong / wrlong as well. This is what made me think that a stack machine and all opcodes
except number load being a single byte could be quite fast although it's usually not expected
to be the case.
I've broken my compiler ( getting confused with allocating opcode values ) so no update but
there are a few new things working now, including the ability to include a C program within
Spin - thanks to Carl Jacobs for that idea from JDForth. The C program Main can either be
executed in its own Cog or any of the C program functions can be called individual and results
returned. It's possible to run multiple C programs simultaneously, each with their own stack.
I've got a message passing interface so C can effectively call Spin methods ( well, request
them to be called ) but that has to be done on a case by case basis in the Spin program.
Optimised the number loading for nearby branching and that got the increment test up to being
near twice as fast as Spin, ratio of around 13:1 against PASM.
I've got the code setup for executing from Eeprom, all 32KB available for data/stack but not
written the PASM Eeprom routines yet. Because I2C is quite slow, I can run that as LMM and
that shouldn't take away too much of the RAM - I can move the routines to top of RAM above
the stack so maybe 30KB instead of 32
It's interesting to note that on the following wiki page·http://en.wikipedia.org/wiki/Parallax_Propeller·as well as on the Parallax web site http://www.parallax.com/Store/Microcontrollers/PropellerTools/tabid/143/ProductID/510/List/1/Default.aspx?SortField=ProductName,ProductName·there is a claim of 5 - 10 times speedup over spin. I'm not sure what benchmark was used as I believe the speedup achieved with·DeSilva's Fibonacci test was only about 3 times.
It probably means that there needs to be a unified set of tests for benchmarking the various language offerings that are becoming avialbale for the Propeller.
BTW: what was the spin equivalent of your variable increment test? I'd like to see how JDForth compares.·I'm sure Cluso would also like to know - then he can do some performance gain testing on his "faster" spin interpreter. That and of course the Fibonacci test which is already well documented.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs
JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: 32-bit floating point maths.·Simple Serial at 57.6K. Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.
The Fibonacci test is one of the worst tests in the sense that all it does is to test the function call overhead. ICC generates native Propeller code executed in a LMM environment. So the approximate speed is about 20-25% of native code, but of course it can access all of Hub memory. With the newly added FCACHE, small loops run in near native speed, after the code is fetched from the Hub memory.
Branching can be further optimized using add/sub for near jumps (+/-511 instructions) and the same for call taking advantage of the link register. Using these optimizations the cycle count can be reduced from 0x8f00_0000 to 0x8583b16a. A modest improvement , but an improvement nonetheless and 2 longs shorter.
I also test within subroutine as well ...
The equivalent PASM benchmarks ...
How useful the tests are is debatable but in the absence of anything else there's at least something to compare against. They give an indication of if some compiled code / VM is similar, better or worse than the reference in these cases.
I agree with Richard/ImageCraft that Fibonacci isn't much of a benchmark either but it does test function call overhead and most programs tend to use those.
32-Bit LMM like ImageCraft LMM-C should be the next fastest to PASM and FCACHE improves that further, although it's then down to the compiler to not FCACHE anything which takes longer to load and execute than were it run as LMM. I had quite disappointing results in my Spin Interpreter when I tried FCACHE but that wasn't looping code.
As the LMM code moves further from being PASM, and becomes an instruction set for its own VM, performance starts to drop as instruction decoding overhead increases.
I think there's a perception ( I know I believed it ) that a rdword is faster than two rdbyte but for an instruction set of 'opcode reg' where both are bytes, reading separately isn't that much extra overhead and it opens the door to byte operands whereas the rdword needs all operands to be word length and word aligned so 8-bit operands have poorer code densities.
Of course, multi-byte operands will take longer to load, but optimisation there helps. Starting with byte code which may be badly aligned one can force word and long operands to be correctly aligned by adding padding bytes. It increases code bloat but improves fetching. The nice thing is it's easy to turn on and off so it's a simple way to have the compiler generate code for highest code density or best speed. There again one has to be careful that adjusting the pc and using a rdword or rdlong doesn't take longer than it would have to read multiple bytes anyway, but if people want speed then even a few cycle savings will help.
Fastest VM comes from matching the VM instruction set to what PASM can execute quickest and optimising the instruction set to achieve that. It also comes down to how good code optimisation is in the compiler. LCC generates atrocious 'var+=1' code, my 'compiler' optimises that down to just an address load and an increment opcode. In the beta of ImageCraft ICC I recall that R0 was used as a temporary staging register ( mov r0,#long / add r6,r0 or similar ) and it's not an unexpected piece of code to generate. For 32-bit LMM that's three longs and two instruction decodes so to me the obvious optimisations would be to reduce the decode overhead by adding a faster decoded instruction 'movr0 #long' or better still 'add r6,#long' which is faster and saves an entire long (33%), however it's no longer PASM so pure LMM won't work which increases the kernel code and the instruction decoding overhead thus slower speed.
One can compress 'mov r0,#long / add r6,0' to 'push #long / add r6,<pushed>' then compress it to 'push #long / push #r6 / adi'. This is the theory my bytecode converter is based on with only push having multi-byte overhead, every other opcode single-byte. Hence my belief that a stack-based machine isn't any slower than a register to register one; one still has to read and write hub memory so may as well read write stack, the only overhead is the stack pointer adjustment having done so. It's been quite an eye opener.
Added : Another thing to consider is, that if one's going to miss a hub access sweet spot anyway, that extra cycle time wasted while waiting for the hub access spinner to come around again can be put to good use decoding more complicated instructions with what is effectively zero-overhead, in that it won't be any slower.
Unlike other processors where adding an instruction will slow things down, it may not on the Propeller. If that free time can be used to gain a speed improvement elsewhere then it's a winner. If it brings a later hub access back to a sweet spot it's a real gain and time to crack open a beer. Trying to analyse that though isn't easy.
Post Edited (hippy) : 8/28/2008 2:29:27 PM GMT
(*) I'll worry about what least used means later; probably some compiler option to force certain code into Cog rather than LMM. To start with I'm just going to bale out if Cog gets full.
I would personally love to get Interactive-C implemented on the Propeller chip. It has already been ported to several controllers like the 68HC11 HandyBoard, Lego Brick, and the GBC. I've used it on several projects and it is pretty cool. Below is a link to the software with another link to the source code:
http://www.handyboard.com/software/base.html
The nice part is that I'm sure a program on the propeller could emulate the what one of the other controllers look like and the existing code/tools may just work.
Robert
1) Fibonacci - gives an indication of function call overhead and has been the defacto·benchmark on this forum til now.
2) Increment gloabal variable in loop:
3) Increment global variable in subroutine:
4) An implementation of SimpleSerial - in high level code. Spin does 19200 baud. JDForth does 57600 baud. C does ??? (Richard??).
5) I'm open to suggestions for other tests - in the long run, the whole Propeller community will benefit from the results.
BTW I'm making this up as I go - I haven't yet done these tests, so do not know the result. In all cases spin would be·the benchmark, as it is the only way we can compensate for different systems. Eg: I run my processor at a standard 80MHz, but it seems that Hippy runs his processor at 118MHz.
I'm open to any other suggestions for tests. Obviously Cluso with his faster spin interpreter also needs some benchmarks to test against, and it will give us a speed comparison guage once the Prop II arrives - although to me that·sounds like it will be a little while yet.
Hippy, regarding your comment about reading misaligned words.·I think I found about the same result with JDForth. Being careful with algorithms has had a much larger affect than the couple of extra clocks required to do multiple rdword's to simulate a rdlong. The Propeller is quite memory constrained compared to its abilities, so I'm taking the opportunity to save as many bytes as possible. Your idea of loading bits of code out of EEPROM is interesting, you may have noticed that the JDForth byte-codes (word-codes) are completely relocatable in memory. So, there may be some examples later of dynamically loaded code in the future,·although nothing soon.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs
JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: 32-bit floating point maths.·Simple Serial at 57.6K. Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.