C/Spin benchmarks?

RossH · 2012-07-10 21:47

jmg wrote: »

Interesting. I can see debug would get complicated quickly, but this could still be useful. A user could prove first in a slower mode, and then once they know where the real bottle necks are, move that small section into a COG ?

Eventually ... maybe!

jmg wrote: »

Did you mean ROM ?

Yes.

jmg wrote: »

I still like to know all the resource that is being called on: If some is 'passed through' to a COG, even if you get that Main RAM back, then that should be included in the usage maps. [ROM / EEPROM / COG / RAM ]
{ ROM is only there in a Prop 1 }

I think doing this also helps to clarify to users, exactly what is happening.

This kind of detail is available if you need to know it, but it wouldn't typically show up in a memory map or anything like that. In most cases, it is the loaders that do all the "magic" and the C program itself remains blissfully unaware. With Catalina, that's deliberate - I don't really like having to "adorn" my C programs with stuff about where the code will live or how it should be compiled. I know gcc takes a different approach, and you can both specify this in the program and also create linkage maps or something - to be honest I'm not really sure of all the details.

Ross.

RossH · 2012-07-11 02:25

Another interesting demo ...

The C standard I/O library (stdio) is notoriously large - waaaay to big to be useful on a Prop 1 ... or is it?

Catalina includes a simple stdio test program (called test_stdio_fs.c). All this program does is is read a specified file from SD card (and dump it to the screen), or write a new file to to SD card.

How hard could that be?

When compiled using the normal Catalina LMM code generator, this simple program weighs in at a whopping 37k - fine if you have XMM, but way to big to run without it. But when compiled by adding -C COMPACT to the command, the program compiles down to a cool 22k - easily small enough to execute on a bare Propeller! The attached example binary is compiled for the C3 - but it uses serial I/O and should work on any Prop with similar SD card pinouts and a 5Mhz clock.

What's interesting about this example is that the whole shebang - i.e. the DOS file system, the stdio library and the test program itself - is coded in C. So this is a really good large scale test of how much C code can be compressed by using the new compact kernel - i.e. code sizes down to 55% of the equivalent LMM size.

Ross.

potatohead · 2012-07-11 08:31

That's damn cool Ross.

Dave Hein · 2012-07-11 09:07

Ross,

Could you try the filetest.c program? I'm curious how much smaller this is with your new code. Under PropGCC it just barely fits using LMM, and that requires using the simplified printf routines in printf.c. Note, the "ls" command uses some non-standard code to get the filename, length and attributes. I need to clean that up so it uses a standard method.

Dave

RossH · 2012-07-11 15:17

Dave Hein wrote: »

Ross,

Could you try the filetest.c program? I'm curious how much smaller this is with your new code. Under PropGCC it just barely fits using LMM, and that requires using the simplified printf routines in printf.c. Note, the "ls" command uses some non-standard code to get the filename, length and attributes. I need to clean that up so it uses a standard method.

Dave

Hi Dave,

I'll have a go when I get home. I'll have to comment out the non-ANSI stuff, but it should still be reasonably accurate.

Ross.

RossH · 2012-07-12 06:06

Dave Hein wrote: »

Ross,

Could you try the filetest.c program? I'm curious how much smaller this is with your new code. Under PropGCC it just barely fits using LMM, and that requires using the simplified printf routines in printf.c. Note, the "ls" command uses some non-standard code to get the filename, length and attributes. I need to clean that up so it uses a standard method.

Dave

Hi Dave,

Here are some approximate numbers for filetest.c. I had to remove a small amount of non-ANSI code, but these numbers should be reasonably accurate for the overall program size:

LMM = 36k
CMM = 23k

This program would not be loadable with Catalina if compiled to use LMM, but would easily be loadable if compiled to use CMM. Since you said it was loadable with gcc, I spent some time looking into why it was so large with Catalina - to be honest, this was a bit unexpected. It turns out that a large part of the reason is that when you include file support (something I generally used to do only when compiling to use XMM) then Catalina pulls in 4-6kb of library code that is quite unnecessary on the Propeller (this would be around 2-3kb in CMM mode). Not a big deal when you are using XMM, but a real issue if all you have available is the 32k of Hub RAM!

I never noticed this before, so thanks for bringing it to my attention. Given the Prop's limited Hub RAM, Catalina's file support actually amounts to a severe case of overkill. Afer all, how many Prop programs are ever going to need to have 20 files open at once? - the buffers and file blocks alone would occupy nearly half the available Hub RAM! And Catalina manages these buffers dynamically! - no wonder it's implementation of stdio is so much larger than gcc's! It was probably a bit dumb of me not to realize this before - but before CMM was available I never really expected to be able to run programs that use files from Hub RAM. Like your program, my file system demo programs just barely fit - and only then when you made severe comprimises (like yours using a cut down version of printf!).

I'll fix all this once I finish the CMM support - it will really be worthwhile then, since this example demonstrates once again that CMM makes it perfectly feasible to run substantial C programs from Hub RAM without having to resort to XMM - even if the program does use stdio.

But that's for another day - and anyway it makes little difference to the relative file sizes of LMM vs CMM. In that area the overall results for your program are reasonably consistent with the other examples I have so far - in your case using a CMM code generator would result in a code size reduction of around 40% over an LMM code generator. I see it go up to 50% in some cases, but 40% is not too bad.

Ross.

Update - some tweaks to stdio have removed a few kb already. Lots of room for improvment left.

Dave Hein · 2012-07-12 06:52

PropGCC produces a binary size of 30,848 bytes. This leaves just enough space for the stack and heap that is used by this program. If I get a chance I'll try converting this to Spin to see how it compares.

Ross, CMM looks like a really nice feature. That will enable many more C applications to fit into the Prop.

jazzed · 2012-07-12 07:05

Interesting developments.

Dave Hein · 2012-07-12 12:11

I converted filetest.c to Spin, and the binary is about 16K in size.

RossH · 2012-07-12 21:00

Dave Hein wrote: »

I converted filetest.c to Spin, and the binary is about 16K in size.

Hi Dave,

It's always difficult to compare code sizes by just looking at binary sizes, because presumably the Spin binary also includes at least two PASM drivers - i.e. an SD Card driver and a Serial driver - but if we assume these are 2K each (which is the maximum size they could be) then we would get an actual Spin code size of something like 12k - this fits in very well with my experience with CMM code sizes being about 2x Spin code sizes. In this case the CMM code size is 24K

I'm impressed that your gcc code size for filetest.c is 30K. But when I look at the GCC stdio library I can see why - it's very much more streamlined than the ACK library I use. I believe I can easily remove around 6K out of the ACK library just be removing the dynamic memory management of files structures and file buffers - but even so, Catalina LMM will struggle to get it under 32K. But I'll do it anyway, since it so will also help the CMM code size - this should come down to around 20K.

Hi Jazzed,

Yes, it is interesting, isn't it? Based on current experience, if gcc also adopted CMM, it could potentially fit the filetest.c program in around 18K. And the best part is that for this type of program (i.e. one that uses complex I/O) you don't even suffer much of a performance hit, since so much of the time-critical code is already handled in PASM.

RossH · 2012-07-15 03:23

Another interesting milestone ... well, interesting to me anyway!

The game othello (aka reversi) was one of the very first programs I ever ported to Catalina. From memory, the first LMM version ended up with a code size somewhere around 10-12kb! Even with the latest (released) version of Catalina - and using the Optimizer - the code size is still around 6-7 kb..

Well, here is exactly the same program compiled in CMM mode - the program code is now down to 3900 bytes (the remainder of the binary is mostly the serial driver and the kernel itself).

Like all the binaries in this thread, this example is compiled for a C3 using a serial terminal emulator at 115 kbaud - but it should run on just about any Propeller that uses a 5Mhz clock.

Ross.

jazzed · 2012-07-15 21:25

RossH wrote: »

Hi Jazzed,

Yes, it is interesting, isn't it? Based on current experience, if gcc also adopted CMM, it could potentially fit the filetest.c program in around 18K. And the best part is that for this type of program (i.e. one that uses complex I/O) you don't even suffer much of a performance hit, since so much of the time-critical code is already handled in PASM.

Hi Ross.

Sure, I'd like to see more details.

We discussed a compressed mode months ago - It's just a matter of having free time.
This kind of thing is simpler to do having all the other infrastructure in place.

Cheers.
--Steve

RossH · 2012-07-15 22:55

jazzed wrote: »

Hi Ross.

Sure, I'd like to see more details.

You'll get them soon, since I've decided compact mode will be released as part of Catalina 3.7 - especially since I have now also made some of the stdio changes needed to take a few kb off Catalina programs that use stdio for file access in all modes (and more coming)!.

At the moment, the main thing missing from compact mode is debugger support - and I may decide to release a version without that first anyway - after all, the first few versions of Catalina didn't have full C source level debugging ... then Bob Anderson came along!

jazzed wrote: »

We discussed a compressed mode months ago - It's just a matter of having free time.

I don't actually know what your "compressed" mode does - can you elucidate? Or point me to the discussion thread? I suspect you're talking about some kind of code compression technique - which would be quite different to what I mean by "compact" mode. But I'm not sure.

jazzed wrote: »

This kind of thing is simpler to do having all the other infrastructure in place.

Ross.

jazzed · 2012-07-15 23:40

RossH wrote: »

I don't actually know what your "compressed" mode does - can you elucidate? Or point me to the discussion thread? I suspect you're talking about some kind of code compression technique - which would be quite different to what I mean by "compact" mode. But I'm not sure.

There were several ideas thrown around in another forum in March.

Basically I believe what it comes down to is using the current instruction set in a less "propellery" way to achieve better code density. Here are some ideas I've looked at that draw on some of the previous discussion. This has not been vetted or tested.

The assembly code would have access to 8 kernel registers: r0..6, lr, sp . It means that only 8 immediate jump macro types (jmp #macro) would be available; however, the D field could be stuffed for 8 sub types if that's faster than a 2 step instruction. Access to special purpose registers would be via jump macros.

Tiny branches can be made with add/sub pc, #value, jump macros would be used for other branch conditions.

The fetch, decode, execute loop would execute in as few as 3 hub cycles and would look like this:

dat     org     0

{{
16 bit "Beany" instruction coding: DDDSSSiiiiiizcrI

D = destination register
S = source register
i = instruction opcode bit
z = affect Z flag
c = affect C flag
r = store result in D
I = load immediate from S
}}
next1
        rdword  cinst,  pc      ' get instruction word  : 0000 0000 0000 0000 DDDS SSii iiii zcrI
        ror     cinst,  #10     ' place instruction     : iiii iizc rI00 0000 0000 0000 00DD DSSS
        or      cinst,  always  ' set condition always  : iiii iizc rI11 1100 0000 0000 00DD DSSS
        
        mov     inst,   cinst   ' stuff instruction     : iiii iizc rI11 1100 0000 0000 00DD DSSS
        andn    inst,   #$38    ' clear D form S field  : iiii iizc rI11 1100 0000 0000 0000 0SSS
        ror     cinst,  #3      ' put D in lower bits   : SSSi iiii izcr I111 1000 0000 0000 0DDD
        movd    inst,   cinst   ' stuff instruction     : iiii iizc rI11 1100 0000 DDD0 0000 0SSS
        
        add     pc,     #2      ' queue next instruction
inst    nop                     ' execute instruction
        jmp     #next1          ' fetch next
        
always  long    _1100_0000_0000_0000_0000


pc      long    0


cinst   res     1

RossH · 2012-07-16 01:30

jazzed wrote: »

There were several ideas thrown around in another forum in March.

Basically I believe what it comes down to is using the current instruction set in a less "propellery" way to achieve better code density. Here are some ideas I've looked at that draw on some of the previous discussion. This has not been vetted or tested.

Cute - a 16-bit "mini" Propeller!

Yes, I use some similar techniques for aspects of my "hybrid" kernel - but I think you'll find this particular technique won't by itself get you a significant code reduction. Just the fact that you are reducing the number of general purpose registers from 16 (16 for gcc - is that right?) down to 8 in this beany kernel means that much of the saving you might expect to get from having a 16 bit instruction set will get eaten away (at least when when used for non-trivial programs) by the need to generate many more instructions just to shuffle registers around, or (even worse) to use the stack in more instances for temporary storage when you run out of temorary registers. Similarly, only having support for 8 "macros" (or "primitives", as Catalina calls them) would mean more instructions need to be generated to do common tasks (like stack operations).

My hybrid kernel currently supports 32 bit instructions, 32 registers, and (roughly) the same set of primitives as the existing Catalina kernel (I say "currently" because while I'm generally satisfied with the code sizes, I'm still not happy with the current design in terms of performance - so I may change it). There's no question your beany kernel would be faster than my hybrid kernel on some operations - but it would be slower on others (e.g. accessing the "special" registers). I also don't think you'd achieve code size reductions anywhere near 50% - assuming that's what you'd be expecting by reducing the instruction size from 32 bits to 16. In fact it wouldn't surprise me to find this kernel actually increased overall code sizes over a standard LMM kernel in many instances, rather than reducing it.

But I've been wrong before

. I suggest you go ahead and implement it so we can all find out!

Ross.

jazzed · 2012-07-16 07:29

RossH wrote: »

Cute - a 16-bit "mini" Propeller!

Thanks for your encouragement.

Yup a 50% reduction is certainly not going to happen, but size reduced prospects are not so dismal either. Prioritizing on a smaller solution by any means has value. Comparing with spin performance is a good metric - and that is mostly a stack machine. I figure since most propeller folks are pretty happy with spin performance and the driver paradigm, then great. We already have relatively high performance solutions. Too bad Spin byte codes do not lend very well to a C back end - that would be perfect.

Wish I could pour myself into this, but I have higher priorities and high pressure deadlines.

Cheers.

Dave Hein · 2012-07-16 08:12

jazzed wrote: »

Too bad Spin byte codes do not lend very well to a C back end - that would be perfect.

I assume the problem with the Spin VM is that it is stack-based, and not register-based. However, isn't the ZOG VM a stack-based machine? I may be wrong about that, but it's worth looking into.

I've done some recent experiments on the Spin VM, and I found that I can double the speed of the Spin interpreter if I only implement a subset of the instruction set. The reason Spin is slow is because it implements too many instructions, and the interpreter must expend more cycles to fit into a single COG. I think a reduced-instruction Spin VM would be more efficient than the current Spin interpreter.

The stack-based VM is slower than a register-based VM, but it allows for more compact code. A local variable can be pushed to the stack by a single-byte instruction, versus the 4-byte instruction that is needed to load a register. Also, Spin jump instructions can be done in 2 bytes for short jumps and 3 bytes for long jumps. An LMM jump requires 4 bytes for a short jump and 8 bytes for a long jump.

Heater. · 2012-07-16 08:22

I thought the problem with compiling C to Spin byte codes was to do with the fact that Spin byte codes do not support unsigned arithmetic.
The ZPU virtual machine implemented by Zog is indeed stack based.
Not sure if we were ever convinced that the ZPU byte codes are more compact than LMM in the end. The ZPU has a very tiny instruction set designed for implementing on small FPGA's so it uses more instructions to get anything done.

jazzed · 2012-07-16 08:36

Heater. wrote: »

Not sure if we were ever convinced that the ZPU byte codes are more compact than LMM in the end.

It was (and may still be) the smallest C solution. I wish you would finish the little-endian version (in another thread perhaps?).

Dave Hein · 2012-07-16 08:40

There are only a few operators where unsigned versus signed int matters, such as >>, <, <=, >, >=, *, / and %. These would require extra operations to implement in Spin bytecodes. In the case of the right-shift operator, Spin does support signed and unsigned right-shifts. Also, accessing signed char and short variables will require sign extension. That's why I made char unsigned by default in CSPIN.

jazzed · 2012-07-16 08:49

Dave Hein wrote: »

There are only a few operators where unsigned versus signed int matters, such as >>, <, <=, >, >=, *, / and %. These would require extra operations to implement in Spin bytecodes. In the case of the right-shift operator, Spin does support signed and unsigned right-shifts. Also, accessing signed char and short variables will require sign extension. That's why I made char unsigned by default in CSPIN.

Hmm. Maybe there is hope.

David Betz · 2012-07-16 09:29

Heater. wrote: »

I thought the problem with compiling C to Spin byte codes was to do with the fact that Spin byte codes do not support unsigned arithmetic.
The ZPU virtual machine implemented by Zog is indeed stack based.
Not sure if we were ever convinced that the ZPU byte codes are more compact than LMM in the end. The ZPU has a very tiny instruction set designed for implementing on small FPGA's so it uses more instructions to get anything done.

It seems to me that I was able to get programs to run under ZOG that wouldn't fit using the LMM kernel in Catalina at the time. I think ZOG does get better code density than LMM.

RossH · 2012-07-16 15:55

Heater. wrote: »

Not sure if we were ever convinced that the ZPU byte codes are more compact than LMM in the end. The ZPU has a very tiny instruction set designed for implementing on small FPGA's so it uses more instructions to get anything done.

Yes, that's my recollection as well - I seem to recall at the time you were disappointed that the ZPU code sizes were not much smaller than LMM - but I can't find the code size stats. Do you still have them?

Ross.

RossH · 2012-07-17 00:35

jazzed wrote: »

Hmm. Maybe there is hope.

If you mean hope for a compiler that compiles C source to Spin byte codes, so they can be executed by the Spin interpreter embedded in the Prop 1 - then no, I don't think there is any hope.

Dave's CSPIN program is probably the best option here. This program could be brought closer to being C with more work - but it would never reach 100%. And I also doubt whether it would ever attract a substantial user base even if it did (no criticism intended, Dave!).

Enough people have looked at this idea in enough detail over enough years by now that if it was practical then it would have been done. The problem is that it soon becomes apparent that unless you are willing to restrict yourself to a Spin-like subset of C, then the resulting binaries will most likely be both larger and slower than a hand-coded Spin equivalent (i.e. hand-coded to overcome the limitations of Spin). Catalina and GCC both generate executables that are substantially larger than Spin, but they are both also substantially faster, so people are willing to accept the trade-off and use them - mostly for the sake of getting 100% C compatibility. If this hypothetical C compiler not only failed to offer that benefit, but also generated programs that were both larger and slower, who would bother using it?.

I know we've all learned (sometimes the hard way!) never to say "never" when it comes to the Prop 1, but I really can't see this one happening. Even if some fanatical hobbyist took it on as a personal crusade it would probably take more years than the Prop 1 has left! But it may well happen on the Prop II - memory is not at such a premium, so it won't matter so much if C compiled to Spin is slightly larger, and the chip itself is faster, so it won't be such a problem if the resulting program is a bit slower. But on the Prop II there is a different problem - there is no "embedded" spin interpreter anyway, so there is no reason not to "tweak" the interpreter to make it more suitable for C.

Ross.

David Betz · 2012-07-17 05:45

RossH wrote: »

Yes, that's my recollection as well - I seem to recall at the time you were disappointed that the ZPU code sizes were not much smaller than LMM - but I can't find the code size stats. Do you still have them?

Ross.

I'm not convinced this is true. I know I had cases where the ZOG code was significantly smaller than the LMM code for the same program. I think the program I was compiling at the time is what is now called propgcc/demos/ebasic. I'll be happy to try it again but I'd like to do it with a little-endian version of ZOG. The version that is checked into Google Code is still the big-endian version.

Heater: Can you point me to the changes that need to be made to convert it to little-endian?

Heater. · 2012-07-17 07:38

David,

The pressure is mounting for me to do something and wake up Zog. At risk of upsetting by boss and/or "her indoors" and/or Zog himself I will find some time to do that in the next few days.

Seems I had already prepared a Zog version 2.0 pre 1 package for little-endian to post but now I have multiple versions of that on different machines so I have to check which one works best. Also I have to remind myself how to patch zpugcc for little-endian.

Someone suggested I start a new thread for little-endian Zog so that is what I will try and do.

David Betz · 2012-07-17 08:03

Heater. wrote: »

David,

The pressure is mounting for me to do something and wake up Zog. At risk of upsetting by boss and/or "her indoors" and/or Zog himself I will find some time to do that in the next few days.

Seems I had already prepared a Zog version 2.0 pre 1 package for little-endian to post but now I have multiple versions of that on different machines so I have to check which one works best. Also I have to remind myself how to patch zpugcc for little-endian.

Someone suggested I start a new thread for little-endian Zog so that is what I will try and do.

Thanks! As soon as you post your changes I'll try to integrate them with the work I did a while ago with big-endian ZOG that is checked into Google Code.

http://code.google.com/p/propeller-zpu-vm/

Dave Hein · 2012-07-17 09:55

RossH wrote: »

Dave's CSPIN program is probably the best option here. This program could be brought closer to being C with more work - but it would never reach 100%. And I also doubt whether it would ever attract a substantial user base even if it did (no criticism intended, Dave!).

In theory, CSPIN could produce 100% ANSI compliant code, but I agree that it will never get there. It would just be too large of an effort to write an ANSI-compiant compiler from scratch. I do think it's possible to achieve the same code compaction as Spin, and with twice the speed of Spin, or better. That's what I'm seeing with the reduced-instruction Spin interpreter that I'm tinkering with.

I've currently implemented about 100 out of the 255 Spin opcodes, and 33 of the 44 extended codes with twice the performance of the standard Spin interpreter. I'm using the standard set of Spin bytecodes for now, but it should be possible to develop a new set of bytecodes that would be a better fit for C. Some additional signed and unsigned instructions would be good. It would also be good to remap the opcodes so they translate more directly into PASM opcodes. The current math bytecode don't allow for an easy translation to a PASM instruction. We also wouldn't need PBASE and VBASE and the bytecodes that support them. I think we only need three addressing modes based on DBASE, a zero-offset immediate address and a zero-offset address from the stack.

RossH · 2012-07-17 19:28

Dave Hein wrote: »

In theory, CSPIN could produce 100% ANSI compliant code, but I agree that it will never get there. It would just be too large of an effort to write an ANSI-compiant compiler from scratch. I do think it's possible to achieve the same code compaction as Spin, and with twice the speed of Spin, or better. That's what I'm seeing with the reduced-instruction Spin interpreter that I'm tinkering with.

I've currently implemented about 100 out of the 255 Spin opcodes, and 33 of the 44 extended codes with twice the performance of the standard Spin interpreter. I'm using the standard set of Spin bytecodes for now, but it should be possible to develop a new set of bytecodes that would be a better fit for C. Some additional signed and unsigned instructions would be good. It would also be good to remap the opcodes so they translate more directly into PASM opcodes. The current math bytecode don't allow for an easy translation to a PASM instruction. We also wouldn't need PBASE and VBASE and the bytecodes that support them. I think we only need three addressing modes based on DBASE, a zero-offset immediate address and a zero-offset address from the stack.

Sorry, Dave - I was specifically talking about CSPIN with the current pnut Spin interpreter. With the pnut interpreter, 100% ANSI compliance is theoretically possible (it's theoretically possible with Turing machine!) but completely impractical. If you used another interpreter or "tweak" the existing one, then it would become practical, and I also agree you should be able to do better (performance-wise) than Spin - getting speeds around twice as fast as Spin should be achievable. However, this will be difficult to achieve this at Spin code sizes. In fact, I'd go so far as to say "impossible" - so I now look forward to being proven wrong

Ross.

RossH · 2012-07-17 19:30

David Betz wrote: »

Thanks! As soon as you post your changes I'll try to integrate them with the work I did a while ago with big-endian ZOG that is checked into Google Code.

http://code.google.com/p/propeller-zpu-vm/

Oh no! Zog is re-emerging from his block of ice! Run away!!!!

C/Spin benchmarks?

Comments