C/Spin benchmarks?

RossH · 2012-06-26 02:53

Ariba wrote: »

The FSRW file drivers is originally written in C and then translated to Spin. The C code is also in the object's ZIP. So this may be good to compare the code size. But the performance is a bit harder to measure, because it's mainly the SPI driver which defines the SD access speed (and with fast SPI code, also the SD card) and not the file driver.

Andy

Thanks, Andy - there are plenty of benchmarks available for comparing performance, but this is going to be good for comparing code sizes.

Ross.

RossH · 2012-06-26 02:53

Dave Hein wrote: »

Here's the spin version of the dhrystone 1.1 program. I'll look for the xtea code tomorrow.

Thanks Dave, I must have missed that.

Ross.

RossH · 2012-06-26 03:02

jmg wrote: »

Isn't the best approach there, one that allows small sections of C, to create PASM, rather than byte-code equivalents.
( I think GCC port is following this already ?)

- or a least, this should be part of any solution.

If I knew what the best approach was, you can bet I'd already be doing it!

jmg wrote: »

If there is another switch choice that allows memory to 'go further', at the cost of some speed, that is also useful.

One of the good things about Catalina is that I can allow you to switch code generators as you need to, and still benefit from all the underlying shared "plugin" infrastructure. Eventually, I'd expect to be able to simply recompile without a single change to the source code, or even mix and match different cogs running different kernels.

jmg wrote: »

Is the memory you run short of, Code or Data memory ?

If it is code, then QuadSPI memory (even DDR) could remove that barrier ?

Yes, it's code space - data space changes very little from language to language. Catalina can already use QuadSPI, and the only experiment in DDR I am aware of seemed only to prove it took too many pins and required crazy overheads to make it work. But any form of XMM makes the Propeller a losing alternative cost-wise when compared to the competition. Want proof? Take a look at the cost of a Raspberry Pi board vs a C3!

Ross.

RossH · 2012-06-26 03:07

pedward wrote: »

Basically sounds like you are talking about retargetting to a middleground bytecode and implementing an interpreter.

Correct. At the moment my "hybrid" kernel is a mix of both - but I think the eventual solution will lean more and more to the "bytecodes" , and less and less to the "LMM" side of the equation. LMM is simply too expensive in code size.

Perhaps what I am really saying is that Chip was right all along. The Spin interpreter is really a marvel - without it the Propeller would just be empty unused silicon.

But just imagine how successful the Propeller would have been if that interpreter had implemented C rather than Spin

...

Ross.

Rayman · 2012-06-26 08:10

I think some great benchmarks would be jpg or png decode rates, mp3 decode rate, and video framerates...
But, I think a Prop2 is required for all of these.

Actually, I think Prop1 can do jpg, png, or mp3 decoding with external memory, but it would be painfully slow...

Maybe some Mendelbrot generation rates would also be good..

Dave Hein · 2012-06-26 08:30

RossH wrote: »

Thanks Dave, I must have missed that.

The dhrystone 1.1 and the xxtea Spin code was never posted to the benchmark thread. I had to hunt around to find the dhrystone Spin source. The Spin version could be tweaked to remove the dependency on CLIB, which would reduce the size of the program.

I couldn't find the Spin version of the xxtea program, so I converted it to Spin and attached it below.

pedward · 2012-06-26 14:29

RossH wrote: »

But just imagine how successful the Propeller would have been if that interpreter had implemented C rather than Spin ...

Ross.

Perhaps you should simply retarget a C compiler that generates bytecodes for the SPIN interpreter. The interpreter is already there, and the SPIN language already implements all of the functional constructs of C, plus a bunch of esoteric operators.

It wouldn't be trivial, but you have half the equation sorted out. With the optimizers that exist for C, you might improve on the execution performance by a significant amount.

Furthermore, SPIN already implements LMM, and you could emulate some of the other things transparently in the language.

RossH · 2012-06-26 18:05

pedward wrote: »

Perhaps you should simply retarget a C compiler that generates bytecodes for the SPIN interpreter. The interpreter is already there, and the SPIN language already implements all of the functional constructs of C, plus a bunch of esoteric operators.

It wouldn't be trivial, but you have half the equation sorted out. With the optimizers that exist for C, you might improve on the execution performance by a significant amount.

Furthermore, SPIN already implements LMM, and you could emulate some of the other things transparently in the language.

Hi pedward,

This has been considered by several people (including me) and pretty much ruled out. Putting it simply, the Pnut interpreter is heavily optimized towards the Spin language, and Spin is a smaller language than C - that's why Pnut can fit in a single cog. The language elements that C has that Spin does not are quite difficult to implement using the available Pnut opcodes - so the result would be that if you use C-specific features then not only will your program execute slower than the same program hand-crafted in Spin, but will probably blow out on code size as well. Spin forces you to code using techniques that map very well to the Pnut interpreter, but which would not naturally be used by C programmers - and I'm not really sure anyone would use a C compiler that generates programs that are slower than Spin as well as being larger.

This is also a reason why benchmarking C programs that have been automatically translated from Spin is a bit futile - such programs do not represent how a C programmer would program the same algorithm in C.

Ross.

RossH · 2012-06-26 18:05

Dave Hein wrote: »

The dhrystone 1.1 and the xxtea Spin code was never posted to the benchmark thread. I had to hunt around to find the dhrystone Spin source. The Spin version could be tweaked to remove the dependency on CLIB, which would reduce the size of the program.

I couldn't find the Spin version of the xxtea program, so I converted it to Spin and attached it below.

Thanks, Dave.

RossH · 2012-06-26 18:06

Rayman wrote: »

Maybe some Mendelbrot generation rates would also be good..

I'd love it! Want to program a Spin version???

Ross.

Ariba · 2012-06-26 18:56

Here you go:
http://forums.parallax.com/showthread.php?126429-RIP-Benoit-Mandelbrot-Propeller-demo-attached...&highlight=Mandelbrot

Andy

RossH · 2012-06-26 18:58

Ariba wrote: »

Here you go:
http://forums.parallax.com/showthread.php?126429-RIP-Benoit-Mandelbrot-Propeller-demo-attached...&highlight=Mandelbrot

Andy

Gosh, that was quick! I have some C sources for generating mandelbrot sets, but if you converted this from a particular C version, please let me know which one and I will use that instead!

Ross.

RossH · 2012-06-27 15:31

Hi All,

Just thought I'd mark this thread "Solved", because I think I now have enough C/Spin programs to get pretty good indicative results. Of course, you can all feel free to keep contributing!

I hope to have some actual numbers within a week or so.

Thanks to all!

Ross.

RossH · 2012-07-04 18:28

All,

Just a quick update. I am calling the Catalina mode that uses the new hybrid kernel the "Compact Memory Mode" (CMM).

I started with the simplest programs - Fibonacci and Dhrystone. These programs are a bit atypical because there is an almost 1-1 correspondence between the individual C statements and and the individual Spin statements - but for that reason they are the easiest to compare!

For such programs, the results on code size are pretty much as I expected - i.e. the CMM code generator generates C code that is very close to twice the size of Spin code. This may not sound so good, but actually it is quite a improvement over traditional LMM code generators, which typically generates C code that is 4 times as large as Spin code.

However, the initial performance figures are not so good - at least not yet. The CMM kernel executes C code at only around 20%-30% faster than the Spin interpreter executing the equivalent Spin code. I had hoped for better even in these simple cases, but I now think even getting to 50% faster than Spin in such cases is going to prove difficult. I believe the reason is that because these programs map almost directly from C to Spin it is hard for CMM to do much better - in every case where CMM moves away from being an LMM kernel and towards being an interpreted kernel, it is difficult for it to be much better than Spin at performing the equivalent operations!

However, I have early indications that the picture changes significantly when you compile more complex C programs. I have instances where the size of the compiled C code is roughly the same size as the equivalent Spin code, and yet the CMM kernel executes the code at nearly three times the speed of the equivalent Spin code. However, I only have fairly pathological instances of this as yet. I may need to find some more real world examples that demonstrate this.

Ross.

jmg · 2012-07-04 18:58

When comparing code here, you probably should split into two numbers, the 'Kernel-created' and the 'user-created' sizes.
Spin would include the ROM size, and that becomes more important on Prop 2 where it is not 'there for free', but becomes a load cost.

Cluso99 · 2012-07-04 19:35

Your improvements are of great benefit for another model. Perhaps there are ways to tweek your code too.
Nothing is free... well except Catalina

RossH · 2012-07-04 21:14

jmg wrote: »

When comparing code here, you probably should split into two numbers, the 'Kernel-created' and the 'user-created' sizes.
Spin would include the ROM size, and that becomes more important on Prop 2 where it is not 'there for free', but becomes a load cost.

Hi jmg,

For both C and Spin, I'm only considering "user created" code size (i.e C or Spin application code) in this thread - no drivers (which are the same for Spin and C anyway) and no kernel overhead (which is zero if you use the Catalina "2 phase" loaders). This will be true on the Prop 2 as well.

Ross.

RossH · 2012-07-04 21:21

Cluso99 wrote: »

Your improvements are of great benefit for another model. Perhaps there are ways to tweek your code too.
Nothing is free... well except Catalina

Hi Cluso,

Yes, I haven't quite met my original targets yet, but I already think the new CMM support is likely to be released as part of the next version of Catalina. Since it uses all the existing Catalina infrastructure, it's a simple command line option (-C COMPACT) to use the new code generator and CMM kernel in place of the previous ones, and it would be very handy to have this available as a last resort if your C program simply won't fit any other way!

Ross.

Cluso99 · 2012-07-04 23:38

Ross: With the Props cores, it might even be possible (with of course more work) to have some cores running CMM and another running the LMM or XMM model. I am sure as you develop the idea there will be other improvements to the mix.
Not everything needs to be small and not everything needs to be fast. Just like spin and pasm, there will be a place for these options.

RossH · 2012-07-05 00:20

Cluso99 wrote: »

Ross: With the Props cores, it might even be possible (with of course more work) to have some cores running CMM and another running the LMM or XMM model. I am sure as you develop the idea there will be other improvements to the mix.
Not everything needs to be small and not everything needs to be fast. Just like spin and pasm, there will be a place for these options.

Hi Cluso,

Yes, this is possible - as long as you launch the C functions on different cogs using the correct kernel, then all will work as expected - i.e. code compiled with one code generator will be able to launch a function compiled with a different code generator on another cog.

And of course all the cogs can all share the same plugins (or drivers) - the plugins don't care how the program that calls it was compiled - they can be called equally well from from the CMM, LMM, or XMM kernels (or from Spin for that matter).

The only limitation is that you won't ever be able to mix different code generators - e.g. to call a function compiled with the CMM code generator from a function compiled with the LMM code generator.

Adding the capability to make this easy to do from the Catalina "wrapper" program (i.e. catalina.exe) will be a bit difficult - it doesn't care which code generator, but it does currently expect all the objects it links to have been compiled with the same code generator. Currently, the way you would have to do it is to turn the parts of the your C program that need to be executed under a different kernel into a binary "blob" - in much the same way you currently have to do to launch Spin programs from C (via the spinc.exe program). Perhaps I could add another utility (I'd probably call it something like blobc.exe) to help out with this. But I don't think this will make it into the next release - perhaps the one after that.

Ross.

RossH · 2012-07-09 06:40

All,

A progress update ... and a demo!

Attached is a C version of the Parallax graphics demo program - compiled from C source code using the new "compact" Catalina code generator. It is 100% equivalent to the Parallax demo, including full mouse support. The attached binary is compiled for a C3, but should work on any platform with the same TV and mouse pin configurations (and a 5Mhz clock).

Catalina has included a C version of the Parallax graphics library for several releases - but previously its use was limited because the C code generated by Catalina was too large to let you do much graphics programming with it. For instance, the Parallax demo program requires 24kb of free Hub RAM just for the graphics buffers, so on a bare bones Prop 1, that leaves only 8kb for all other program code and data! With Catalina's pure LMM code generator, I was never able to squeeze the demo program code size down under around 10kb (the Spin version of the same code is around 2.5kb!). Previously, I had to work around this by either running the code from XMM, or reducing the graphics resolution, or disabling double buffering - but no more!

Now, with the new compact code generator, the same C code is under 5kb (note the binary file size is larger because of the various drivers the program needs - but these consume no Hub space at run time). So the pattern that became apparent with Dhrystone & Fibo is also true for this program as well - i.e. the compact code generator generates compiled C code that is well under half the size of an LMM code generator. And while the resulting programs are not as fast as LMM code, they are faster than Spin.

I'd always felt the Parallax graphics demo program was a significant one for demonstrating the power of the Propeller, so it always annoyed me that I could never seem to get the code size small enough using an LMM code generator, so that I coulde duplicate this demo using C. Well, now I can.

I've been concentrating on getting the code size down, since I think that's the biggest obstacle for the adoption of C on the Propeller. But now that I've reached this code size milestone I'll move to working on improving the performance instead. It's currently not great - it sits around 10-20% faster than Spin - but I hope to do better. (Note that the speed of the attached demo program is really determined mainly by the PASM portions, not the C/Spin portions - however, even in this case, this version is slightly faster than the Spin version).

Ross.

Heater. · 2012-07-09 07:06

Giving gcc a run for it's money.
Zog will not be happy either...

RossH · 2012-07-09 18:55

Heater. wrote: »

Giving gcc a run for it's money.
Zog will not be happy either...

With the Prop I guess you can't ever say anything is impossible, but a pure LMM solution would really struggle to get small enough code sizes to run a C program like this on the Prop 1. Catalina's original code generator certainly struggled, and I think gcc would struggle even more. A hybrid solution (like the one the demo uses) or a fully interpreted solution (like Zog or Spin) may be the only practical options.

By the way, do you have any records of code sizes generated by Zog? I thought you had posted some, but when I tried to find them I couldn't. Can you point me to them?

Ross.

jmg · 2012-07-10 01:36

RossH wrote: »

...(the Spin version of the same code is around 2.5kb!).

Now, with the new compact code generator, the same C code is under 5kb (note the binary file size is larger because of the various drivers the program needs - but these consume no Hub space at run time).

I see the ROM allocated to Spin is 4K, so you could say the Spin is ~2.5k+4k
If Size and speed end up being a trade off, then a Source-level compiler directive could be useful.

Not sure if what you are doing lends itself to some code blocks being size optimised, and some Speed optimised ?

The next 'big jump' occurs when you can select a small target code area, for 'compile to PASM' and run in a COG.
I think GCC is going to allow this - it this also possible with what you are doing here ?

Heater. · 2012-07-10 02:25

There is no need to include the size of the Spin interpreter in your program size. It lives in ROM and does not detract from the 32K RAM space.

That is a bit tough on attempts to use different byte codes or even LMM as then you to consume space in RAM for the interpreter/kernel.

One could aliviate that by recycling the interpreter image space in RAM when all cogs are up and running. But that precludes dynamically starting and stopping cogs later on.

One could get around that by giving the kernel the ability to clone itself, via a buffer in HUB, into another cog. The buffer being otherwise available for general use.

The Prop II has no ROM interpreter so there you can add the kernel size to your program size. That puts alternative kernels on an equal footing.

jmg · 2012-07-10 02:53

Heater. wrote: »

There is no need to include the size of the Spin interpreter in your program size. It lives in ROM and does not detract from the 32K RAM space.

Only in the Prop 1, and it is also resource, so even tho it does not consume RAM, it should not be ignored.

heater wrote:

The Prop II has no ROM interpreter so there you can add the kernel size to your program size. That puts alternative kernels on an equal footing.

Exactly, which is why mentioning the Kernel is doubly useful.

RossH · 2012-07-10 05:48

All ...

I just amended the demo graphics program above (here). I just thought of a really simple way to speed it up - now the C version is faster than the Spin version!

For those interested in the details, I have added an FCACHE capability to the new compact kernel - now the Catalina compact mode really is a hybrid - partly interpreted, partly LMM PASM, and partly COG PASM (via FCACHE). I need to do some more work on integrating this into the code generator, but the hardest bit is figuring out exactly when to use each technique.

jmg ...

This goes some way to answering your question - yes, I could now generate C code to be executed "in cog". In fact, it would be possible to allow the program to specify which coding technique to use "on the fly" (i.e. interpreted, COG PASM, or LMM PASM). Or it could specify this per function. However, this would really complicate many other things - not least the BlackBox source level debugger - so I doubt this will be something I will be adding in the short term.

Also, I'm not really sure of your point in insisting on including the size of the Spin interpreter in the code sizes, but you can do that if you want to. Just add 2k for Spin programs. If you want to do the same for the Catalina kernel, then just add 2k there as well. However, this is really a bit misleading - in neither case does this 2k detract from your ability to use all the available Hub RAM - in Spin because the interpreter is loaded direct from EEPROM, and in Catalina because of the various "two phase loaders" that load the kernel before loading the application program. In fact Catalina's approach has other advantages, because it does this for all plugins (not just the kernel). This means Catalina can use the whole of Hub RAM for code space. However, Spin cannot - it can't reclaim memory that contained the PASM code of various drivers for use as additional Spin code space.

Ross.

Dave Hein · 2012-07-10 08:09

Ross, can you post the C source that you used? It would be interesting to compare the results produced by PropGCC and Zog.

jmg · 2012-07-10 14:25

RossH wrote: »

jmg ...

This goes some way to answering your question - yes, I could now generate C code to be executed "in cog". In fact, it would be possible to allow the program to specify which coding technique to use "on the fly" (i.e. interpreted, COG PASM, or LMM PASM). Or it could specify this per function. However, this would really complicate many other things - not least the BlackBox source level debugger - so I doubt this will be something I will be adding in the short term.

Interesting. I can see debug would get complicated quickly, but this could still be useful. A user could prove first in a slower mode, and then once they know where the real bottle necks are, move that small section into a COG ?

RossH wrote:

... - in neither case does this 2k detract from your ability to use all the available Hub RAM - in Spin because the interpreter is loaded direct from EEPROM

Did you mean ROM ?

I still like to know all the resource that is being called on: If some is 'passed through' to a COG, even if you get that Main RAM back, then that should be included in the usage maps. [ROM / EEPROM / COG / RAM ]
{ ROM is only there in a Prop 1 }

I think doing this also helps to clarify to users, exactly what is happening.

RossH · 2012-07-10 20:00

Dave Hein wrote: »

Ross, can you post the C source that you used? It would be interesting to compare the results produced by PropGCC and Zog.

Just use the C code for the graphics library and the demo program that is in the current Catalina release. That version is close enough for comparison purposes. The main change you need to make is to edit the macros I previously used to scale the screen resolution. in graphics_demo.c - i.e. change this ...

#define X(n) ((n)*15/16)
#define Y(n) ((n)*10/12)

... to this ...

#define X(n) (n)
#define Y(n) (n)

Ross.

C/Spin benchmarks?

Comments