DIY Propeller C

hippy · 2008-08-24 16:58

It seems that no matter what ImageCraft do there will be people who want to create their own C for the Propeller as an alternative or for the fun of doing it, so my question is; how do we turn that into a reality ?

Options seem to be GCC, SDCC, RCSC, PCC, LCC and there may be others. I'm not familiar with any of those but my favoured version is LCC as that is ANSI-C, well documented and can create textual bytecode output which needs only an assembler plus a Virtual Machine to execute the bytecode image generated.

From http://forums.parallax.com/showthread.php?p=741929 ...

Ale said...
Well I have added a new port to LCC, it is crude, it needs some serious help from a kernel but it works (sort of)... but I'll do my best to have something more working really soon. Stay tuned.

I'm intrigued because that seems to be generating code for a register-based machine. I used ID Software's Quake 3 port of LCC and that appears to generate bytecode for a stack machine, but I have no idea of the licensing position doing that or how the Quake compiler diverges from pure-LCC.

While a Stack Machine may be more inefficient when implemented on a Propeller ( I'm not entirely convinced of that ), it should deliver better code density which IMO is important for the Propeller Mk I.

A reasonable and quite simple project plan seems to be ...

1) Find / build LCC compiler executables which can generate stack-based machine bytecode and be freely distributed.

2) Define a Stack Machine (VM) for the Propeller suited to LCC bytecode.

3) Create a translator to turn LCC bytecode into Assembly Language for what will be our Stack Machine.

4) Create an Assembler to turn Assembly Language into a Binary Image.

5) Create the VM to execute the Binary Image in Spin.

6) Create a Linker to combine the Binary Image and VM code into a downloadable .binary / .eeprom file.

7) Migrate the VM from Spin to PASM/LMM

That should give us a single-Cog C execution environment. Adding support for multi-Cog operation and Propeller specifics will probably require a change to the compiler, back-end and the rest of the tool chain. Code optimisation and so on can be added as extra stages later or within the compiler.

ImageCraft are well ahead of the curve on code optimisation and reducing bloat and have a lot of background experience so I don't consider this to be facing-up ICC head-on, and that's certainly not the goal for me - I like writing tools and VM's so this is a good excuse to do that and I don't think it's going to over-stretch me ( unfortunately that was the case with the JVM ).

I can see a problem with any shared / co-operative development on this - I don't use C and others don't use what I do and we'll all likely want to get on rather than change position to something we're not familiar with - so I would suggest we all work together to define the interfaces between each stage and accept we'll likely do our own thing in creating tools for those stages, let the users decide at the end of the day which specific tools to use. That doesn't preclude people choosing to co-operate on any part or even going off to pursue GCC or other ports instead if they prefer. It's more 'let everyone get involved however they want to'.

Seeing the pace of JVM progress I think it should be possible to have something credible working within a month even if it is rough around the edges.

Are people interested in such a project based on LCC; whether that's in writing tools, helping design, giving advice and comment or just cheering from the sidelines ?

Ale · 2008-08-24 17:17

hippy, I already started with LCC, translated some opcodes and made some modifications to the code. I even posted a small and sort of not bad looking example. I even bought the book, I hope it will show up this week

Note: I defined some "registers" in COG memory for compiler's use, 32 in total just as a direct translation from mips assembler.

An assembler that works, I already have, it is not C it is java but it can be either used as it is or translated to C. I'll continue to work on this, but some help as always is useful. It seems your pace is a bit faster than mine. (guessing from your proplist advances). Plenty to do.

Post Edited (Ale) : 8/24/2008 5:23:54 PM GMT

Oldbitcollector (Jeff) · 2008-08-24 17:24

hippy, could we talk you into an onboard tiny-c compiler?

OBC

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?

Getting started with a Propeller Protoboard?
Check out: Introduction to the Proboard & Propeller Cookbook 1.4
Updates to the Cookbook are now posted to: Propeller.warrantyvoid.us
Got an SD card connected? - PropDOS

jazzed · 2008-08-24 17:50

If advice == opinion, you should have no shortage of advice [noparse]:)[/noparse]

Is it possible to reduce code-size relatively speaking? Can the VM use a byte or word-wide instruction set instead of longs like PASM? I thought the JVM byte-codes were nicely defined except for some obfuscation.

Whatever happens, I would really like to see some kind of single step debugger (with variable change trap ability). An independently developed IDE would be nice, but Eclipse could be used until an IDE is made. Eclipse is widely accepted and much more powerful than other IDE I've seen ... one could write Eclipse plug-ins for specialized features if necessary.

One thing that Visual Basic adds that is just wonderful to me at least (and Java in some Eclipse versions), is the ability to modify code on the fly. With a VM implementation, it seems that might be possible.

I'll take on a piece if the project can be sub-divided developers working independently. You have outlined major components already ... if there was a way to sub-divide these that would be great. I wrote an assembler about 30 years ago and it was nothing more than a token parser, case statement, and file IO ... I might look at that later. I'm busier than a one legged man right now though. Wonder what GNUASM looks like under the hood.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

hippy · 2008-08-24 20:29

Ale said...
Note: I defined some "registers" in COG memory for compiler's use, 32 in total just as a direct translation from mips assembler.

That makes sense now; it's effectively LMM code.

Oldbitcollector said...
hippy, could we talk you into an onboard tiny-c compiler?

No. It's feasible but I'm not your man for that. On the plus side, with a C compiler, one could perhaps bootstrap a tiny-C that way. Maybe already possible with ICC ?

jazzed said...
Is it possible to reduce code-size relatively speaking? Can the VM use a byte or word-wide instruction set instead of longs like PASM?

Yes, and I think that's the key to success although it may not give faster than Spin execution, at least not to start with.

With a register-to-register based instruction set that needs a byte to hold source and destination, another to hold the opcode, plus any other operand(s). Fast PASM/LMM would tend towards two longs for that. With a stack machine ( as demonstrated with Spin ) opcodes can be single byte, an operand usually just another byte or word. Only 32-bit numbers need a long and there's optimisation to be had there. The price to pay is slower execution.

With the Quake LCC port, all opcodes can be single byte except two for adjustment on entering and leaving functions and the push constant opcodes, so optimising push constants is the key. Typical of RISC, multiple opcodes have to execute to do anything useful so the bytes add up but again there's an option to optimise there, especially for jump and calls. Code density should be similar to a traditional 8-bitter, 6502, 6800, Z80 etc.

Having looked at stretching a byte-oriented bytecode to word or long to get gains in speed, except for well aligned words and longs I don't think there's a lot of difference; fetching multiple bytes isn't much slower than fetching a single long and extracting the necessary fields from it. To get speed up means aligning everything word or long so extraction time goes down but density reduces. I'm not convinced the speed gains outweigh the density loss.

jazzed said...
Whatever happens, I would really like to see some kind of single step debugger (with variable change trap ability). An independently developed IDE would be nice, but Eclipse could be used until an IDE is made.

I'm personally only really interested in the "proving it works" side and hopefully delivering something which is more usable than just proof of concept. LCC emits information which can relate source to image so a source-level debugger should be possible - That might even be something which proves very useful in debugging and proving the VM itself.

Integrating into IDE's I'd leave to others as I'm more than happy with a command line

jazzed · 2008-08-24 23:36

If command line tools like c compiler, linker, archiver, etc... and gdb client/server for example are available, Eclipse is mostly a cake-walk. No idea if Eclipse or Java can run on Windows 98 though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

plx88 · 2008-08-25 06:58

Here is a link to how Eclipse was adapted for use with PIC's and C18 compiler. Eclipse has C development tools. Also here is a link to the AVR plugin for Eclipse

I don't know how to·adapt a language to IDE like people did for the PIC & AVR, but I would sure like to learn.

Can·the free (open source)·C compilers, Assembler, Linker·or parsers MIN-GW·& MSYS that some versions of Eclipse uses for computers be adapted for the Prop?

Post Edited (plx88) : 8/25/2008 7:05:34 AM GMT

mirror · 2008-08-25 07:54

It's funny how people have different opinions on things...

I tried using Eclipse when doing some development on a Phillips ARM chip using the GNU C toolchain. I found it truly hideous and ended up using NotePad++ and a makefile.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Ale · 2008-08-25 08:36

Mirror:

I had a same experience... ended up testing something with a makefile but the download process and debug via jtag did not work :-(, I think I'll get rid of the board. I even used some commercial solution that did not detect my cable (and usb-ocd by Olimex). ARM brought me only frustration. I think I will pursue my "unix on the propeller" idea even if it is dog-slow (I mean a slow dog). did I say I need a C compiler ? I'll have one.

Peter Verkaik · 2008-08-25 08:37

I would prefer notepad++ even for spin files.
Anybody wrote a plugin/macro to make notepad++ recognize spin files?
(for syntax highlighting).

regards peter

mirror · 2008-08-25 10:20

Ale said...
Mirror:

I had a same experience... ended up testing something with a makefile but the download process and debug via jtag did not work :-(, I think I'll get rid of the board. I even used some commercial solution that did not detect my cable (and usb-ocd by Olimex). ARM brought me only frustration. I think I will pursue my "unix on the propeller" idea even if it is dog-slow (I mean a slow dog). did I say I need a C compiler ? I'll have one.

Ale, this is now getting seriously OT, but I also used an Olimex board. I have a usb-ocd·but ended up doing all the debug via a serial port. I'm used to debugging embedded stuff via an serial port, and it can't be too shabby as I ended up with a working product that included ethernet with TCP/IP.

I do think that unix on Propeller 1 will be·extremely false economy! It will just prove what sort of capabilities the Propeller doesn't have! Now Propeller II may be a different story - but I'm going to wait for a real device to be released before I even bother thinking about something like that.

jazzed · 2008-08-25 12:48

So you condemn Eclipse because of ARM? I used it for Java long before C/C++. Eclipse with standard C/C++ environment plug-in works fine with ICC/download ... no idea how it will work with debug when that's available.

Thanks for the warning of your troubles .... I have an ARM board (by accident) and JTAG cable that I haven't tried yet. Guess I'll do that before my return for refund window closes.

Ale, where are you with LCC ?

Mirror, I tend to agree with your "false economy" statement, but no one has *really* tried yet. Even with unlimited code "text" storage and using almost all of Propeller memory just for variables, unix/linux doesn't seem possible. If it does work, it would be dog slow with SDIO access instruction fetch using LMM ... maybe a 4x or 8x speedup would be bearable.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Ale · 2008-08-25 13:05

Jazzed: I was planning on using external RAM, the protoboard is already wired. I do not care about speed in that case, remember that C was developed in a 8K/24K machine... the prop is faster with a similar amount of RAM, but I just wanted to develop something like that for the fun of it (like a uP with relays).

LCC:
- well I have most instructions converted. Some multi-register instructions are decomposed into 2 instructions... like add r0, r1, r2 (I used the mips port).
- I have to extract or better merge part of the mips with the x86 port to avoid those 3 operand instructions. (I bought the book on LCC I hope it shows up soon).
- The whole thing will need some lmm kernel, something basic call, jmp, ret and push/pop I have already working, for only 32k RAM (HUB RAM).
- An assembler: I'll use my pLMMAssembler (I wrote it in java, it does not matter as it works from the command line) I have to add the spin wrapper to load the lmm kernel, fix the checksum issue and sort of ready

.

All that can take a while, especially generating useful and optimized code, but if it works... we are in business.

I'll post all these tools and things in a wiki page at propellerwiki if you do not mind, so a sort of how-to can be achieved.

-Eclipse plug-in : it is a neat idea, but we can start without it. If someone figures the how to, great, it is not something I'm proficient at.

Have fun.

hippy · 2008-08-25 15:11

Well, it looks like it could be easier than I thought ... if you have time on your hands -- and who needs sleep when having fun !

Inspired, I've worked out how to convert the bytecode to something a Propeller VM can use, have an LCC bytecode converter which includes assembler and even does some optimisations, so that's a huge chunk out the way. Next step is generating an actual image ( there's a listing of what it will be so not too hard to do ) and then coding up a VM. The the big slog of debugging.

I'm quite impressed with the ease of using the LCC bytecodes and they seem to map quite well to a Propeller architecture, though I'm sure I've overlooked something and I'll also have to get to grips with the call stack framing. But looking good so far. Code density looks better than I expected and there are other tweaks possible. The observant will note I've stolen a few of Chip's optimisation tricks used in Spin bytecode.

No documentation on my own bytecodes but anyone familiar with micros should be able to make a good guess - "GTI" Get Indirect ( Pop adr, Push what's at adr ), "NUM.I32 #MEM+offset" Push a constant, type I ( Unsigned Int, S = Signed, P = Ptr (Int), F=Float ) size 32 bits, in this case the address of a global unsigned int. Opcodes are allocated dynamically so I can auto-minimise the kernel size.

It will be interesting to see how a stack version stands up against a register version. I'm feeling confident it will be reasonably fast. Going to take a bit of a rest and read up on LCC but there will be more soon.

I've included the current compiler execuatble ( source will be released later ) and example generated listing and instructions for compiling your own C files. Consider any bugs a special bonus

Ale · 2008-08-25 15:53

Very interesting indeed !. That was... fast ! It is not possible to use the Spin interpreter's byte code ?... bad idea... the calls and rets are... special. But you already have sort of a vm written after you translated the spin interpreter to RAM launch so it should not be that difficult... or am I wrong ?

jazzed · 2008-08-25 16:22

Looks good. How big would you estimate a PASM VM with all planned features would be? Can you do all the VM in one COG? If so, that would leave 32KB Prop memory for bss segment data if text segment data could be put in EEPROM or SDCARD or some other I2C attached storage (access would be in another COG of course). Separating segments would require a special instruction fetcher of course. Do you plan starting with LMM or just straight PASM?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

hippy · 2008-08-25 19:24

@ Ale : I had looked at LCC earlier but abandoned that when ICC was promised and the LCC bytecode is quite similar to Pascal P4 in a number of ways. The peculiar bytecode I'm emitting is based on previous work I've done.

I keep thinking about converting to Spin bytecodes. It's possible but I'm dithering on that. I think performance would be higher with a dedicated kernel. A lot does depend on the call/return mechanisms and I really haven't got to grips with LCC on that topic. It's always a pain getting a VM up and running but hopefully plain sailing after that.

@ Jazzed : It'll all run in one cog but I'm not sure how big it will be. I'll start with PASM and include LMM if I run out of space and for things like floating point ( I'm not planning on actually coding any FP support ). There looks to be a lot of opcodes but mostly it's informational rather than functional. 160 of the bytecodes are one-byte packed numbers, dealt with in a dozen lines of code. The LCC bytecode uses a long/32-bit stack which is perfect, saves a lot of messing about.

The idea of running code form Eeprom appeals to me and I do want to try that at some point just to compare performance. With the compiler itself building the kernel it should be easy to have it compiled conditionally so calls to GetByte can either translate to rdbyte or a routine which does read Eeprom.

The plan is, as with the Spin VM clone and JVM, to get basic looping working, pushing and popping, then the maths working, it's all much a muchness regardless of language. Where I lost on the JVM was not understanding its quite complicated and inter-linked data structures. This only needs brute force and ignorance except for the call stack stuff.

The first milestone will be incrementing a variable and see how fast it can do that.

hippy · 2008-08-27 04:32

Another It's Alive ! moment ...

Not finished but it's ticking along, first milestone reached, incrementing a long. Not sure if it's as fast as I'd hoped for but not bad, coming in about 55% faster than Spin. Times to count up to 10 million at 118MHz -

Pasm :  28s  3,571,428 increments per second        1:1
Spin : 640s    156,250 increments per second       23:1
Lcc  : 410s    243,900 increments per second       15:1

That does depend on how well aligned the variable is. 25% faster than Spin if alignment is really poor. Further optimisations to take advantage of fortunate alignment to come which should squeeze more out of it.

Only the few bytecodes needed tested so far. Kernel is around 430 bytes. Now have a bytecode to Spin file generator, once the kernel is working I'll move on to generating a .eeprom image directly.

Ale · 2008-08-27 04:58

Pretty neat !, well written and documented !. Now it is my turn no ?

jazzed · 2008-08-27 05:15

Looking good so far. All I can do is cheer for you right now. Can't wait to see your next installment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Ale · 2008-08-27 12:24

Well,

somehow this *thing* does not recognize files with .zip extension ?? So I cannot upload (Ubuntu 7.04, Firefox 3.0 :-( ).

Take my word for it

' Simple LMM kernel for LCC
' (c) 2008 Pacito.Sys

' Hei Hippy I hope this works well enough [img]http://forums.parallax.com/images/smilies/wink.gif[/img]

.section cog cog0

krnl_init       mov      krnl_pc,PAR
                jmp      #krnl_fetch

krnl_r0         long 0
krnl_r1         long 0
krnl_r2         long 0
krnl_r3         long 0
krnl_r4         long 0
krnl_r5         long 0
krnl_r6         long 0
krnl_r7         long 0
krnl_r8         long 0
krnl_r9         long 0
krnl_r10        long 0
krnl_r11        long 0
krnl_r12        long 0
krnl_r13        long 0
krnl_lr
krnl_r14        long 0
krnl_sp
krnl_r15        long 0
krnl_pc
krnl_r16        long 0
krnl_t1         long 0        ' temporal 1
krnl_t2         long 0
krnl_cnt_ofsmask long $7fff
                long 0
                long 0
                long 0
                long 0
' Scratch for call
krnl_call_ret
krnl_jmp_ret
krnl_rts_ret
krnl_push_lr_ret
krnl_pop_lr_ret
krnl_get_arg_ret
krnl_push_arg_ret
krnl_exit_ret
                long   0

' Kernel fetch routine

krnl_inc_pc     add     krnl_pc,#4
krnl_fetch      rdlong  krnl_inst,krnl_pc
                add     krnl_pc,#4
krnl_inst       nop
                jmp     #krnl_fetch

' address 0x20
krnl_call       mov     krnl_lr,krnl_pc ' save PC in link register
                add     krnl_lr,#4
                rdlong  krnl_pc,krnl_pc ' address of destination
                jmp     #krnl_fetch

krnl_jmp        rdlong  krnl_pc,krnl_pc   ' reads address of destination
                jmp     #krnl_fetch

                long    0,0 ' spacer

' returns from a call using the value in the link register
' this could be avoided later on using just a mov
krnl_rts        mov     krnl_pc,krnl_lr
                jmp     #krnl_fetch
                long    0,0     ' spacer

' pushes the link register to the stack
krnl_push_lr    sub     krnl_sp,#4
                wrlong  krnl_lr,krnl_sp
                jmp     #krnl_fetch

                nop     ' spacer

' pops the link register from the stack
krnl_pop_lr     rdlong  krnl_lr,krnl_sp
                add     krnl_sp,#4
                jmp     #krnl_fetch     ' stack underflow shoud be checked here

                nop     ' spacer

' gets an argument from stack
' the long that follows has 2 arguments
'
' 31       23 22 18 17     0
' +----------+-----+--------+
' | Addr_reg |  0  | offset |
' +----------+-----+--------+

krnl_get_arg    rdlong  krnl_t1,krnl_pc
                mov     krnl_t2,krnl_t1
                shr     krnl_t1,#23             ' address of destination register!
                movd    krnl_get_arg_g,krnl_t1  ' sets dest address
                and     krnl_t2,krnl_cnt_ofsmask
                add     krnl_t2,krnl_sp
krnl_get_arg_g  rdlong  0,krnl_t2
                jmp     #krnl_inc_pc            ' next instruction

' puts an argument back to the stack
' the long that follows has 2 arguments
'
' 31       23 22 18 17     0
' +----------+-----+--------+
' | Addr_reg |  0  | offset |
' +----------+-----+--------+

krnl_put_arg    rdlong  krnl_t1,krnl_pc
                mov     krnl_t2,krnl_t1
                shr     krnl_t1,#23             ' address of destination register!
                movd    krnl_put_arg_g,krnl_t1  ' sets dest address
                and     krnl_t2,krnl_cnt_ofsmask
                add     krnl_t2,krnl_sp
krnl_put_arg_g  wrlong  0,krnl_t2
                jmp     #krnl_inc_pc            ' next instruction

' loads a constant into a register
' the long that follows has 2 arguments
'
' 31       23 22 18 17     0
' +----------+-----+--------+
' | Addr_reg |  0  | offset |
' +----------+-----+--------+
krnl_load_cnt   rdlong  krnl_t1,krnl_pc
                mov     krnl_t2,krnl_t1
                shr     krnl_t1,#23             ' address of destination register!
                movd    krnl_load_cnt_g,krnl_t1 ' sets dest address
                and     krnl_t2,krnl_cnt_ofsmask
krnl_load_cnt_g rdlong  0,krnl_t2
                jmp     #krnl_inc_pc            ' next instruction

                long    0     ' spacer


' Program termination
' A cogstop should be issued, or not ?
krnl_exit       jmp     #krnl_exit              ' end program

And the test file :


' LMM example

.section lmm lmm

' LMM kernel definitions
' registers
krnl_r0 = 2
krnl_r1 = 3
krnl_r2 = 4
krnl_r3 = 5
krnl_r4 = 6
krnl_r5 = 7
krnl_r6 = 8
krnl_r7 = 9
krnl_r8 = 10
krnl_r9 = 11
krnl_r10 = 12
krnl_r11 = 13
krnl_r12 = 14
krnl_r13 = 15
krnl_lr = 16
krnl_sp = 17
krnl_pc = 18
' kernel functions
krnl_call = $20
krnl_jmp = $24
krnl_rts = $28
krnl_push_lr = $2c
krnl_pop_lr = $30
krnl_get_arg = $34
krnl_put_arg = $3c
krnl_load_cnt = $44
krnl_exit = $4c
' returns
krnl_call_ret = $1a
krnl_jmp_ret = $1a
krnl_rts_ret = $1a
krnl_push_lr_ret = $1a
krnl_pop_lr_ret = $1a
krnl_get_arg_ret = $1a
krnl_put_arg_ret = $1a
krnl_load_cnt_ret = $1a
krnl_exit_ret = $1a

        call    #krnl_load_cnt
        long    (krnl_sp<<23)+L003
        call    #krnl_push_lr

        sub    krnl_sp,#4    ' reserves space

        mov    krnl_r0,#0
        call    #krnl_put_arg
        long    (krnl_r0<<23)+0

        call    #krnl_load_cnt
        long    (krnl_r1<<23)+L004

        mov    krnl_r0,#0

L001        cmp    krnl_r1,krnl_r0 wc, wz
    if_z    call    #krnl_jmp
        long    L002

        call    #krnl_get_arg
        long    (krnl_r2<<23)+0

        add    krnl_r2,#1

        call    #krnl_put_arg
        long    (krnl_r2<<23)+0

        add    krnl_r0,#1

        call    #krnl_jmp
        long    L001

L002        call    #krnl_pop_lr
        call    #krnl_exit

L003        long    $7ffc
L004        long    10_000_000

Well this takes around 0x8f00_0000 cycles @ 80 MHz is around 29 s... add 30% for errors in the cycle count of rdlong/wrlong and you are around ~40s. Not bad

All this was tested with pPropellerSim and pLMMAss.

Files can be sent per email or if this thing starts to accept zips from me (.tar.bz2 are also not accepted :-( )

Note: This was only a test for the lmm kernel. No actual code was created with LCC, ok ? It is just not there yet (constant loading and so on do not work yet).

Have fun

Ale

Post Edited (Ale) : 8/28/2008 5:34:38 AM GMT

potatohead · 2008-08-27 16:53

Just rename the file [noparse][[/noparse]filename].zip.safe!

The forum only parses the very last extension. From there, people will know what to do.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

Chat in real time with other Propellerheads on IRC #propeller @ freenode.net

hippy · 2008-08-28 05:10

@ Ale : That looks interesting. One thing I note with register based LLM's ( and it's the same
with mine ) is that 'there's an awful lot of code needed to move fields around, and still a few
rdlong / wrlong as well. This is what made me think that a stack machine and all opcodes
except number load being a single byte could be quite fast although it's usually not expected
to be the case.

I've broken my compiler ( getting confused with allocating opcode values ) so no update but
there are a few new things working now, including the ability to include a C program within
Spin - thanks to Carl Jacobs for that idea from JDForth. The C program Main can either be
executed in its own Cog or any of the C program functions can be called individual and results
returned. It's possible to run multiple C programs simultaneously, each with their own stack.
I've got a message passing interface so C can effectively call Spin methods ( well, request
them to be called ) but that has to be done on a case by case basis in the Spin program.

Optimised the number loading for nearby branching and that got the increment test up to being
near twice as fast as Spin, ratio of around 13:1 against PASM.

I've got the code setup for executing from Eeprom, all 32KB available for data/stack but not
written the PASM Eeprom routines yet. Because I2C is quite slow, I can run that as LMM and
that shouldn't take away too much of the RAM - I can move the routines to top of RAM above
the stack so maybe 30KB instead of 32

Carl Jacobs · 2008-08-28 06:35

Hippy, I think you're on a winner with C co-existing with spin. I always thought that the death of the other forth offerings was their non-cooperative use of the chip - hence JDForth. It seems·a great loss to abandon the rich resources already in·the object exchange!

It's interesting to note that on the following wiki page·http://en.wikipedia.org/wiki/Parallax_Propeller·as well as on the Parallax web site http://www.parallax.com/Store/Microcontrollers/PropellerTools/tabid/143/ProductID/510/List/1/Default.aspx?SortField=ProductName,ProductName·there is a claim of 5 - 10 times speedup over spin. I'm not sure what benchmark was used as I believe the speedup achieved with·DeSilva's Fibonacci test was only about 3 times.

It probably means that there needs to be a unified set of tests for benchmarking the various language offerings that are becoming avialbale for the Propeller.

BTW: what was the spin equivalent of your variable increment test? I'd like to see how JDForth compares.·I'm sure Cluso would also like to know - then he can do some performance gain testing on his "faster" spin interpreter. That and of course the Fibonacci test which is already well documented.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs

JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: 32-bit floating point maths.·Simple Serial at 57.6K. Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.

ImageCraft · 2008-08-28 07:00

@Carl
The Fibonacci test is one of the worst tests in the sense that all it does is to test the function call overhead. ICC generates native Propeller code executed in a LMM environment. So the approximate speed is about 20-25% of native code, but of course it can access all of Hub memory. With the newly added FCACHE, small loops run in near native speed, after the code is fetched from the Hub memory.

Ale · 2008-08-28 08:15

Executing from EEPROM is a great idea and a cheap way to increase the amount of memory. Either stack or register based. I think Mike was exploring that for BASIC programs.

Branching can be further optimized using add/sub for near jumps (+/-511 instructions) and the same for call taking advantage of the link register. Using these optimizations the cycle count can be reduced from 0x8f00_0000 to 0x8583b16a. A modest improvement

, but an improvement nonetheless and 2 longs shorter.

hippy · 2008-08-28 14:09

@ Carl : The Spin increment benchmark I use is, launched in its own Cog ...

PUB SpinIncBenchmark
  repeat
    long[noparse][[/noparse] $7FFC ] ++

I also test within subroutine as well ...

PUB SpinIncSubBenchmark
  repeat
    IncSub
PRI IncSub
  long[noparse][[/noparse] $7FFC ]++

The equivalent PASM benchmarks ...

PasmIncBenchmark     rdlong acc,k_7FFC
                     add    acc,#1
                     wrlong acc,k_7FFC
                     jmp    #PasmIncBenchmark

acc                  long   0
k_7FFC               long   $7FFC

PasmIncSubBenchmark  call   #IncSub
                     jmp    #PasmIncSubBenchmark

IncSub               rdlong acc,k_7FFC
                     add    acc,#1
                     wrlong acc,k_7FFC
IncSub_Ret           ret

acc                  long   0
k_7FFC               long   $7FFC

How useful the tests are is debatable but in the absence of anything else there's at least something to compare against. They give an indication of if some compiled code / VM is similar, better or worse than the reference in these cases.

I agree with Richard/ImageCraft that Fibonacci isn't much of a benchmark either but it does test function call overhead and most programs tend to use those.

32-Bit LMM like ImageCraft LMM-C should be the next fastest to PASM and FCACHE improves that further, although it's then down to the compiler to not FCACHE anything which takes longer to load and execute than were it run as LMM. I had quite disappointing results in my Spin Interpreter when I tried FCACHE but that wasn't looping code.

As the LMM code moves further from being PASM, and becomes an instruction set for its own VM, performance starts to drop as instruction decoding overhead increases.

I think there's a perception ( I know I believed it ) that a rdword is faster than two rdbyte but for an instruction set of 'opcode reg' where both are bytes, reading separately isn't that much extra overhead and it opens the door to byte operands whereas the rdword needs all operands to be word length and word aligned so 8-bit operands have poorer code densities.

rdword   opc,pc                rdbyte   opc,pc
add      pc,#2                 add      pc,#1
mov      reg,opc               add      opc,#JmpTable
shr      reg,#8                rdbyte   reg,pc
and      opc,$FF               add      pc,#1
add      opc,#JmpTable         jmp      opc
jmp      opc

Of course, multi-byte operands will take longer to load, but optimisation there helps. Starting with byte code which may be badly aligned one can force word and long operands to be correctly aligned by adding padding bytes. It increases code bloat but improves fetching. The nice thing is it's easy to turn on and off so it's a simple way to have the compiler generate code for highest code density or best speed. There again one has to be careful that adjusting the pc and using a rdword or rdlong doesn't take longer than it would have to read multiple bytes anyway, but if people want speed then even a few cycle savings will help.

Fastest VM comes from matching the VM instruction set to what PASM can execute quickest and optimising the instruction set to achieve that. It also comes down to how good code optimisation is in the compiler. LCC generates atrocious 'var+=1' code, my 'compiler' optimises that down to just an address load and an increment opcode. In the beta of ImageCraft ICC I recall that R0 was used as a temporary staging register ( mov r0,#long / add r6,r0 or similar ) and it's not an unexpected piece of code to generate. For 32-bit LMM that's three longs and two instruction decodes so to me the obvious optimisations would be to reduce the decode overhead by adding a faster decoded instruction 'movr0 #long' or better still 'add r6,#long' which is faster and saves an entire long (33%), however it's no longer PASM so pure LMM won't work which increases the kernel code and the instruction decoding overhead thus slower speed.

One can compress 'mov r0,#long / add r6,0' to 'push #long / add r6,<pushed>' then compress it to 'push #long / push #r6 / adi'. This is the theory my bytecode converter is based on with only push having multi-byte overhead, every other opcode single-byte. Hence my belief that a stack-based machine isn't any slower than a register to register one; one still has to read and write hub memory so may as well read write stack, the only overhead is the stack pointer adjustment having done so. It's been quite an eye opener.

Added : Another thing to consider is, that if one's going to miss a hub access sweet spot anyway, that extra cycle time wasted while waiting for the hub access spinner to come around again can be put to good use decoding more complicated instructions with what is effectively zero-overhead, in that it won't be any slower.

Unlike other processors where adding an instruction will slow things down, it may not on the Propeller. If that free time can be used to gain a speed improvement elsewhere then it's a winner. If it brings a later hub access back to a sweet spot it's a real gain and time to crack open a beer. Trying to analyse that though isn't easy.

Post Edited (hippy) : 8/28/2008 2:29:27 PM GMT

Ale · 2008-08-28 17:57

Hippy, do you know that the floating point library as it is fits in what is left unused of my simple kernel ?... maybe throwing out some small things all can be fit in 496 longs...

hippy · 2008-08-28 21:04

My Cog's full but no reason floating point couldn't be done as LMM. The kernel is built by my compiler so it only includes what it needs and optimises that and the plan is to have it move least used code(*) into LMM totally transparently if needed. That means floating point can be added, I've never looked at the library but cannot see many problems with doing that.

(*) I'll worry about what least used means later; probably some compiler option to force certain code into Cog rather than LMM. To start with I'm just going to bale out if Cog gets full.

RobotWorkshop · 2008-08-28 21:29

Oldbitcollector said...
hippy, could we talk you into an onboard tiny-c compiler?

OBC

I would personally love to get Interactive-C implemented on the Propeller chip. It has already been ported to several controllers like the 68HC11 HandyBoard, Lego Brick, and the GBC. I've used it on several projects and it is pretty cool. Below is a link to the software with another link to the source code:

http://www.handyboard.com/software/base.html

The nice part is that I'm sure a program on the propeller could emulate the what one of the other controllers look like and the existing code/tools may just work.

Robert

Carl Jacobs · 2008-08-28 22:29

Hippy, thanks for the test. It's a bit hard to wrap a counter round that one, so I'll propose a change that allows a timer to be wrapped around it. How about the following as a starter list of tests:

1) Fibonacci - gives an indication of function call overhead and has been the defacto·benchmark on this forum til now.

2) Increment gloabal variable in loop:

SPIN:
PUB Test
  repeat 1000000
    long[noparse][[/noparse] $7FFC ] ++
 
C:
long var;
void Test(void)
{
  for(i=0;i<1000000;i++)                 /* Should benefit from FCACHE */
    var++;
}
 
JDForth:
: Test  1000000 0 DO $7FFC @ 1+ $7FFC ! LOOP ;  \ Direct memory access 
 
  - or -
 
VAR32 var
: Test 1000000 0 DO var@ 1+ var! LOOP ;         \ Variable access

3) Increment global variable in subroutine:

SPIN:
PUB Test
  repeat 1000000
    IncSub
PUB IncSub
  long[noparse][[/noparse] $7FFC ] ++
 
C:
long var;
void IncSub(void)   /* Declaring as inline would be invlaid for this test */
{
  var++;
}
void Test(void)
{
  for(i=0;i<1000000;i++)
    IncSub();
}
 
JDForth:
: IncSub $7FFC @ 1+ $7FFC ! ;    \ Direct memory access 
: Test  1000000 0 DO IncSub LOOP ;  
 
  - or -
 
VAR32 var
: IncSub var@ 1+ var! ;           \ Variable access
: Test 1000000 0 DO IncSub LOOP ;

4) An implementation of SimpleSerial - in high level code. Spin does 19200 baud. JDForth does 57600 baud. C does ??? (Richard??).

5) I'm open to suggestions for other tests - in the long run, the whole Propeller community will benefit from the results.

BTW I'm making this up as I go - I haven't yet done these tests, so do not know the result. In all cases spin would be·the benchmark, as it is the only way we can compensate for different systems. Eg: I run my processor at a standard 80MHz, but it seems that Hippy runs his processor at 118MHz.

I'm open to any other suggestions for tests. Obviously Cluso with his faster spin interpreter also needs some benchmarks to test against, and it will give us a speed comparison guage once the Prop II arrives - although to me that·sounds like it will be a little while yet.

Hippy, regarding your comment about reading misaligned words.·I think I found about the same result with JDForth. Being careful with algorithms has had a much larger affect than the couple of extra clocks required to do multiple rdword's to simulate a rdlong. The Propeller is quite memory constrained compared to its abilities, so I'm taking the opportunity to save as many bytes as possible. Your idea of loading bits of code out of EEPROM is interesting, you may have noticed that the JDForth byte-codes (word-codes) are completely relocatable in memory. So, there may be some examples later of dynamically loaded code in the future,·although nothing soon.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs

JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: 32-bit floating point maths.·Simple Serial at 57.6K. Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.

DIY Propeller C

Comments