Some multi-language benchmarks
ersmith
Posts: 6,088
It's been a while since the last benchmark thread, and a lot of things have happened since then. So I thought it might be interesting to test the performance of various compilers and interpreters that are in active development.
Of course there's always a debate about what makes a good benchmark, and benchmarks can be very misleading. But they can give a very rough picture of the performance of different solutions, for situations where performance matters. In other situations, of course, other things like ease of use, interactivity, documentation, and availability of help may matter more. So caveat emptor.
I'll follow up this post with a few basic benchmarks: toggling a pin, the fibonacci benchmark, and some kind of compute intensive benchmark like an FFT. Heater's fft_bench would be ideal, but I don't think there's a Forth version of it and my Forth skills are not up to porting it, so perhaps some other compute intensive task might be used instead -- any suggestions?. So far I've tested Spin, C (Catalina and PropGCC), Basic (PropBasic), Forth (PropForth and Tachyon Forth) and PASM. If your favorite language/compiler is missing, please feel free to try it out -- these benchmarks are pretty straightforward.
Of course there's always a debate about what makes a good benchmark, and benchmarks can be very misleading. But they can give a very rough picture of the performance of different solutions, for situations where performance matters. In other situations, of course, other things like ease of use, interactivity, documentation, and availability of help may matter more. So caveat emptor.
I'll follow up this post with a few basic benchmarks: toggling a pin, the fibonacci benchmark, and some kind of compute intensive benchmark like an FFT. Heater's fft_bench would be ideal, but I don't think there's a Forth version of it and my Forth skills are not up to porting it, so perhaps some other compute intensive task might be used instead -- any suggestions?. So far I've tested Spin, C (Catalina and PropGCC), Basic (PropBasic), Forth (PropForth and Tachyon Forth) and PASM. If your favorite language/compiler is missing, please feel free to try it out -- these benchmarks are pretty straightforward.
Comments
The theoretical limit of this is about 4000 cycles (assuming 1 instruction per toggle). In PASM one could get close to this by unrolling the loop. I've chosen in the implementations below to just adopt the most natural looping construct for the languages in question, so for the PASM I've counted the time for a basic djnz loop with one toggle/per loop, plus the timing and subroutine call overhead.
I'm not very experienced with FORTH, and was unable to find an easy way to toggle a pin other than explicitly doing pinhi/pinlo. This means the loop only runs 500 times for the FORTH implementations, giving them somewhat of an advantage over the other ones (which run the loop 1000 times). So the FORTH times are skewed slightly high; on the other hand my relative inexperience with FORTH probably means that I'm missing some obvious way to improve the loop.
Compilers tested:
PropGCC CMM preview
Catalina 3.8
PropBasic 1.14 (appears to give the same results to PropBasic 1.27)
Tachyon Forth as of 9/30/2012
PropForth 5.0
fastspin 3.2.0
Here are the results in number of cycles to toggle a pin 1000 times (source code is attached to this post):
This test doesn't have as many compilers as the toggle test. PropBasic doesn't really support recursive functions (technically it does in LMM mode, but without local variables recursive functions are not really practical). PropForth accepted a recursive definition, but did not produce correct results. That may be the fault of my own inexperience with Forth rather than anything else. Tachyon Forth did allow recursion, but after fibo(10) the timing results became very suspicious, and after fibo(11) the answers came out wrong, presumably because of stack overflow. To add a bit of margin of error I've printed the results for fibo(8) only.
I've also included the size of the fibonacci function itself. Only the fibo() function is measured, not the runtime library or timing code. I've also added tests of Catalina's SMALL memory model and GCC's -mxmm memory model, both of which use external RAM and were tested on a C3 board.
Absolute size and speed for fibo program; time is in cycles to compute fibo(8), size is in bytes (for both, smaller is better):
Relative Speed and Size for fibo(8) compared to SPIN
(higher speed is better, lower size is better)
Code is attached.
results for PropBasic, Tachyon Forth, or PropForth (yet... hopefully
some interested readers will take up the challenge of porting the
benchmark!).
This time I have posted total size in bytes of the executable,
including run time libraries, although I have told the C compilers to
use the smallest libraries they have for I/O (-ltiny). This makes for
an interesting comparison of HUB memory used.
Notes:
(1) The PASM code size includes the size of the Spin wrapper and test
harness.
(2) The GCC results are reported with -Os (optimize for size) and -O2
(optimize for speed). Strangely, in this particular benchmark the -Os
code ends up bigger, probably due to -O2 being able to inline and then
eliminate some functions that are only called once.
(3) The Catalina LMM results are compiled with -ltiny -lci; Catalina CMM
adds -C COMPACT -O3.
(4) Size for Catalina is calculated as code size + cnst size + init
size, as that more accurately reflects HUB memory available. Size for
the others is the total size of the .binary file downloaded, and so
for GCC it includes the run-time kernel, which is 1940 bytes in LMM
mode and 1892 bytes in CMM mode.
(5) The JDForth result is as quoted in
http://forums.parallax.com/showthread.php?129972-fft_bench-An-MCU-benchmark-using-a-simple-FFT-algorithm-in-Spin-C-and-.../page6. The
size was not specified there, and I don't have JDForth to reproduce it.
Results sorted by speed:
Results sorted by size:
There is no way I am going to live long enough to understand Forth well enough to do this task.
I was not really thinking of benchmarking when I made that request I was just curious to see how on earth it might look.
thepin = ~thepin
instead of
TOGGLE thepin
because TOGGLE sets the pin direction to output each time.
I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?
Bean
Yes, those are impressive.
I presume the generated PASM can be manually tweaked, and included for a even-smaller result, for those cases where the effort is warranted.
eg PInToggle looks so close in size, speed is likely to be the same, but the next example has a 1.3816x speed, and 1.25x size delta. Pretty good, but a possible manual-optimize candidate.
General comment:
On benchmarks I think the size should include all resource used, otherwise novices might think Spin/Forth are tiny.
(on a Prop 2, spin has a different loaded cost, and opens compacted/expanded Spin options )
You mention Catalina "5.8" instead of "3.8". Finally, for consistency it would be good to add code size to these results as well.
EDIT: That's strange - code just appeared. Thanks.
Ross.
Sure. The -S option to propeller-elf-gcc causes it to emit assembly language (in GAS syntax instead of PASM, but the two are very close). Here is the code for the toggle function: There's a bit of fussing around at the start to make sure the parameter is positive and hence suitable for djnz, which accounts for the slightly longer time for GCC versus PASM.
In LMM mode GCC does not generate djnz, so its loop is a bit bigger (using a cmp and branch). The loop does fit into the FCACHE area in the COG and so executes without LMM overhead, which explains why GCC's LMM time is close to PropBasic's COG time.
Oops, sorry -- thanks for catching that typo! I've fixed it in the original post.
I guess I can update the toggle code to include size, although frankly for such a trivial function none of the languages should emit very much code.
Eric
Yes, fair point.
That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels. In practice once you've decided to use a given language/compiler the kernel and run time library overhead are basically fixed, and the cost of the program itself is incremental to that. Also, of course, the kernel may be stored in ROM (for Spin) or EEPROM (for some of the other languages) and so not really affect the HUB ram availability.
For now I've chosen just to focus on the "incremental" code size of the program or function being benchmarked, but of course comparing total program sizes for various tasks would also be an interesting benchmark, and it would certainly be interesting to compare those -- feel free to post your results!
I agree this is tough to handle properly, but I also agree that Eric has taken the best approach. I have also had the view for a while that for high-level languages what we should be counting is just the high-level language code size - or, to be more specific, the amount of Hub RAM that has to be dedicated to holding the application program that is to be executed. This is where languages differ considerably, and this is the statistic most useful to potential users of those languages when comparing one against another (although they may not at first realize it, especially if they look simply at the binary file size).
I could be convinced to also include const data size, since this often represents code trade-offs (a good example is jump tables often used to implement switch statements in C) - but you would have to separate this stuff out from real const data (e.g. constant strings) that are generally language independent - and this is often quite difficult. So including const data understates the difference between languages, and excluding it overstates the difference - but (on balance) I think excluding it is going to be more accurate.
I also think the kernel size should be excluded (with a clarification that should be noted below) as should anything else that gets loaded into Cog RAM and thereafter does not consume Hub RAM - i.e. plugins or drivers. On the Propeller (especially the Prop v1) the resource that everyone runs out of first is Hub RAM. Most programs do not use all the available Cog RAM, or all the available cogs. This is especially true now because most C code can be loaded using various multi-phase load techniques that load up all the Cog RAM before loading the Hub RAM with the application code to be executed (with Catalina this is possible even on a bare-bones Prop v1, provided you have a 64kb EEPROM - not sure about GCC, but I think it may be the same). Again, including this stuff understates the difference between languages, and excluding it may overstate the difference, but again on balance I think excluding it is going to give you a more accurate indication of the relative sizes of using different languages.
The one case where kernel (or driver, or plugin) size can make a difference is when a copy of the cog code has to be kept at run time to be dynamically loaded into cogs - here, Spin has a clear advantage when it comes to the kernel because its kernel is built into the ROM, whereas other languages typically have to keep a copy in Hub RAM. However, this type of dynamic loading is probably rare enough that it can simply be footnoted - including it for all cases would be quite misleading to most people (who will never write programs that use this capability).
Ross.
Oops, I did not write that very clearly, I was not expecting ONE number, but rather more columns added, that show the base-resource-load. - and a % full of each resource would be useful too.
That way, the code size itself remains very clear, which is the main objective, but the added columns also make it clear what else is going on in the background, to actually run the reported test.
I'm partly thinking of someone new to all this, reading a benchmark report, and trying to get their head around the details.
The risk with excluding stuff, is one 'mental scaling' exercise readers will do, is 'How many copies of this, can this chip run" ?
If resource that was used, is excluded, that estimate will be badly skewed.
That's why I prefer a multi-column report, as you cannot hope to squash all this into a single number.
On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:
Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.
Many application needs that, for example SPI Displays, ADCs, Serial RAM and so on. And the code is a good mix of hub variable access, subroutine call and bit handling.
In Spin a possible code looks like that: The results for this Spin code are:
11_732_816 clocks
code size: 96 bytes
Andy
When I run it:
The code size here includes the array and constants so that means besides the 512 bytes that there are another 104 bytes used in code etc. The size of Main is actually 71 bytes long including the call to the formatted print word .NUM.
HERE ' Main - .DEC 0071 ok
CODE @$5581 - bytes added = 0616 and 10879 bytes free
NAMES @$226C - bytes added = 0039 and 1564 bytes free
Is this SPIO module done in Forth or is it a PASM routine inside the Forth-Kernel ?
For sure if Spin uses a PASM SPI routine, then it is also much faster, but then it is more a PASM benchmark result.
Andy
Well then you must discount all references to PASM in any other benchmark! The SPI module is just 10 PASM instructions that help support fast I/O but since space is limited in the kernel (it doesn't use a separate PASM cog) there are 16 longs reserved for special PASM helper modules for repetitive functions. The whole point of Tachyon vs traditional Forths is to take into account the architecture of the Propeller and how it is typically used. There is also the SHROUT instruction which minimizes stack pushing and popping and so without the perfectly valid SPI module even this is what we get:
If this SPI module runs in the same cog as the kernel and can be replaced at runtime with other short PASM routines, then I also see this as a valid Tachyon benchmark result (but not a valid Forth result) - it's just a special feature of Tachyon, like the cache routines in GCC. I wish Spin could do this also (OK, something like that is doable with Spin Interpreter patching at runtime, but with a lot of overhead).
Andy
Wow, that is really cool that the compiler is "smart" enough to rearrange things to use DJNZ. I'm impressed.
Bean
Good grief so it was. And I even replied to you. It had totally slipped my mind.
I totally agree, in a language benchmark guts of the benchmark should not be relying on parts written a language that is not the subject of the benchmark.
I think we are in the right room for an argument here:)
Looking at the fft_bench in Forth we see that a great speed up can be achieved by coding the work of the inner most loop in PASM instructions. The same can be seen in the SPI code in this thread.
I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.
"Ah", you might say, "In that case GCC is not allowed to use FCACHE in LMM code to secretly stash instructions into the kernel of the LMM interpretter where they run much faster".
I would disagree and say that FCACHE is perfectly acceptable because:
1) ALL the source code of the benchmark is still written in C.
2) The FCACHEed parts are still compiled to instructions by the compiler, not written by the author.
3) It's automatic, the compiler has decided what and how to FCACHE.
4) In Java and such they have "Just In Time" compilation which compiles Java byte codes to native machine instructions on the fly at run time. I don't think benchmarkers ever rejected that idea.
Anyway we can now add fft_bench to the results on this thread.
As for PASM being disqualified of course you couldn't really disqualify it if it is part of the instruction set or library even without having to delve into PASM inline otherwise Spin and every other language would be disqualified, they are all machine code at the core.
I purposely avoid having an assembler in Forth as that defeats the purpose of programming in Forth in the first place but well chosen instructions that avoid inline PASM are pursued and implemented as need dictates. Unfortunately 496 longs do limit what you can implement as pure VM instructions so that is why the bare-bones SPI instruction is paged into the cog as needed.
Besides speed there's code compactness and it looks like the byte code languages (Spin and Tachyon Forth) win here. I suppose this should obvious from the buzz I've been hearing about C on the Propeller, but you're effectively deducing hub ram by a third by using C in LMM. For the cog model it's pretty close to PASM.
I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.
However when it comes to benchmarks it's just not cricket:)
So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?
Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.
Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.
So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.
Did I say we were in the right room for an argument?:)
That is true, and it the price you pay for speed. However both Catalina and propgcc support a compressed memory model (CMM) where the C is compiled to a more compact bytecode like binary rather than raw 32 bit instructions. The resulting code is somewhat faster than Spin and only a bit bigger, as far as I can tell.
Is anyone up to porting the JDForth version of FFT to Tachyon and/or PropForth? I started to take a stab at it, but my Forth-foo is weak and it's likely that even if I succeeded it would not be a very good port.
I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.
I remember a really old movie about Christopher Columbus and how he was sitting at a dinner table asking the person next to him if they could make an egg stand up straight without any supports or anything. Well they couldn't but Chris showed em how by smacking the egg onto the table so that it "sat flat" by itself. Now that is cheating but having an optimized instruction set is sensible and despite the reference to cheating is anything but.
However when it comes to benchmarks it's just not cricket:)
So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?
So the problem is the benchmark and expecting everything to fit to it. Would it be a good idea for Chip to leave a lot of these "non-standard" instructions out for portability or standards sake?????
Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.
I don't ever expect to port many applications I write for the Prop as the Prop is rather unique. Come on, you really expect Prop OBEX software to run on a PC!?
Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.
Oh, what a tiny cage some people make for themselves
So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.
May Tachyon never be standard Forth and given the nature of the Prop I very much doubt it could ever be anyway.
Did I say we were in the right room for an argument?:)
Absolutely, but this room is a little dry for a good argument, it needs something to wet the whistle!