Compiler Benchmarks

heater · 2010-07-26 13:26

After ferreting around I find my original statement about a pulled in library routine for xxtea is incorrect. The btea function is totally self contained and 369 (in decimal) bytes in size.

See attachment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Dave Hein · 2010-07-26 14:00

btea results for CSPIN

No printing:·416 bytes
Printing:·4724 bytes (Using CLIB 1.01)
btea:·324 bytes
Cycles: 985,120

As usual, I had to modify the source code to handle features that CSPIN doesn't support, such as do loops and compound statements.· I verfied that it decodes the 44-byte message correctly.

Ross and heater, could you supply cycle times for Catalina and Zog?· I would expect Catalina to be substantially faster.

Dave

heater · 2010-07-26 14:39

Dave, are you counting clock cycles at 80Mhz or whatever, or instruction cycles?

That's an impressive result.

Just now I've totally broken my version of Zog it will take a while to get an older one up and running again.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Dave Hein · 2010-07-26 15:50

I am counting clock systems by getting the value of cnt before I call btea, and subtracting it from the value of cnt after I call btea.· The should be independent of the processor clock frequency.· At 80 MHz that works out to 12 msecs.

If I understand correctly, it looks like the inner loop of btea will run 100 times for a blocksize of 11 longs.· There are approximately 20 operations in the inner loop for a total of about 2,000 operations.· I would expect PASM to do this in 60,000 cycles or less with all the variables in hub RAM.· If most of the variables are in cog RAM PASM might be able to do in less than 20,000 cycles.· Spin is about 40 to 80 times slower than PASM, so 985,120 seems like a reasonable number for Spin.

Dave

jazzed · 2010-07-26 16:33

Very nice results Dave. We are closer to a real answer now [noparse]:)[/noparse]

3x smaller with ZOG and more than 3x smaller with CSPIN.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

heater · 2010-07-26 18:57

Results just in:

Zog xxtea cycles (Prop counter ticks) for block size of 11 = 1,069,390.

So about 8 percent behind CSPIN. Actually better than I expected.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-07-26 20:52

Seems the xxtea test needs rephrasing a bit. As written the thing is sensitive to endianess.

That is the key is given as a string and not converted to four longs in and endian safe manner.
Then the output is 11 longs which again are not printed in an endian safe manner.

Result being that Zog totally gets it wrong.
I'll do some fixes on the test when I have a moment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-26 23:04

@all,

Looks like I need to get back to work on that Optimizer!

3x byte code size is a little worse than I had expected, although (having had a look at btea) I can see why - btea is a 'compute intensive' function.

I've not had a chance to generate execution times yet - I'll post them when I do.

Ross.

heater · 2010-07-26 23:47

"Compute intensive" - Isn't that what computers are for?

I suggested xxtea because it is full of shifting and adding and array indexing and looping. I was looking for typical MCU stuff.

We should find a hand full of other code representative of different use cases.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-27 00:17

@heater,

I'm not complaining - I think it is a much better benchmark (and exercises much more of the C language) than 'hello world' does!

However, 'btea' makes no function calls, which does make it slightly 'atypical' for C. It also uses no globals, no structures, no chars, no 'switch' , no 'goto', no 'return', no 'break', no 'continue' etc etc. These are quite typical of MCU programs as well.

I agree that we need to find benchmarks that execute the other parts of the C language.

Ross.

jazzed · 2010-07-27 01:14

RossH said...
I'm not complaining - I think it is a much better benchmark (and exercises much more of the C language) than 'hello world' does!

Gee Ross, I did suggest that other programs would be necessary.
At least the hello world experiments revealed lots of interesting stuff.
Whatever the result becomes will be posted in your thread to close the issue.
Until better data is available (if any) the answer is 3x.

Cheers.
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Dave Hein · 2010-07-27 03:20

The CSPIN version of btea generated 277 bytecode instructions.· 225 were single-byte codes, 41 used 2 bytes, 8 used 3 bytes and 3 used 5 bytes.· That works out to 1.25 bytes per instruction.· PASM instructions are 4 bytes long, which is a little more than 3 times the size.

However, I bet that the Catalina code is much faster.
·

RossH · 2010-07-27 03:51

@jazzed,

Yes, I agree this is interesting stuff. And yes, the answer for btea is 3x. Let's see how the next few go.

I think Dave said he's implementing Dhrystone. That would be good.

Ross.

RossH · 2010-07-27 03:57

@Dave,

Thanks for the info.

Let me know when you get Dhrystone working. I have enclosed a version in this post. Is this the same as the version you looked at?

Ross.

heater · 2010-07-27 04:05

RossH would like to see a benchmark that uses globals, structures, chars, 'switch', 'goto', 'return', 'break', 'continue' etc etc.

It just occurred to me that there is one little program that has much of that and uses a lot of different operators, my C version of the ZPU byte code interpreter.

But the to run that in a way that exercises all the features for a timing test we would need to find a ZPU program for it to run that uses globals, structures, chars, 'switch', 'goto', 'return', 'break', 'continue' etc etc..............

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-07-27 04:20

Great Ross, that's Dhrystone V1.1, never seen it before. It looks much easier to deal with than the v2.1 Zog has been running. I'll have a go at building that.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-27 04:20

@heater,

Don't laugh - I've done things only slightly less silly as part of Catalyst - I use Catalina to compile the P5 Pascal compiler which is written in C, then use P5 to compile a BASIC interpreter written in Pascal, then use the BASIC interpreter to run a program. The final program executes at about the rate of 3 BASIC statements per second. Now if I could only find a C compiler written in BASIC!

heater · 2010-07-27 04:37

That's: Catalina runs P5 Pascal runs BASIC runs a BASIC program.

Given that I have a ZPU interpreter written C you could have:

Catalina runs ZPU, runs P5 Pascal, runs BASIC runs a BASIC program.

I did read about a competition for the deepest stack of running emulators I recall the winner did even better. Can't find a link for it now.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-27 05:06

@heater,

Re Drhystone 2.1 - I didn't suggest that because while I can compile it, I can't run it in LMM mode, only in XMM mode - it's too large!

But purely for ad-hoc code size comparisons, how big is 2.1 when compiled for Zog?

Ross.

heater · 2010-07-27 07:20

dhrystone.bin from Dhrystone v 2.1 for Zog is a svelt 17528 bytes[noparse]:)[/noparse]

Zog wins that speed trial by default, the favourite could not even fit in the stadium[noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-27 08:28

@heater,

That's more like it. Catalina's code size is about 2.5x times Zogs code size. This is closer to what I expected.

Any idea how much of Zog's size is actual Dhrystone 'code' and how much is 'driver' (I presume you're using a serial driver)?

I think the general answer may be between 2x and 2.5x. Getting Catalina under 2x will be difficult, but might occur in some 'atypical' cases.

Ross.

RossH · 2010-07-27 08:57

@All,

I have added the execution time of 'btea' for Catalina: 3.7 times faster than CSPIN</font>

and 4 times faster than Zog!</font>

I reckon that it's still entirely possible that in a more general benchmark, we could see Catalina come in between 2 and 2.5 times the code size, but 4 times the speed!

Ross.

RossH · 2010-07-27 10:26

Dhrystone 1.1 results for Catalina ...

Dhrystone pulls in quite a lot of the standard C library, so I think the fairest thing to do is measure both the total file size and also the size of the actual Dhrystone code:

Binary file size: 27,176 bytes (including all drivers, plugins and libraries)
Dhrystone code size: 2,988 bytes (Proc0 .. Proc8, Func1 .. Func3)
Dhrystones/second (d/s): 1,923

For us veterans, that's somewhere beteeen a Vax 11/780 (1646 d/s) and a Vax 11/785 (2083 d/s).

Just out of interest, it's worth pointing out that you could in fact run 8 Dhrystones simultaneously at no extra cost, for a combined Dhrystones/sec rating of 15,384 d/s. Unless I've got my maths wrong (which is entirely possible!) this makes the Propeller running Catalina faster than a CRAY-1A (13,888 d/s)!

heater · 2010-07-27 10:40

Faster than a CRAY-1A ! And Humanoido can run 40 of them in parallel !

Dhrystone 1.1 is going to take me a while to sort out.

One obvious test for an MCU like the Prop that we should have is how fast you can wiggle the pins. What is the maximum frequency that can be generated? And say bit banging a stream of serial data out through one pin and clocking some parallel data out over an 8 bit bus. Perhaps that's three different tests already.

I guess the code in each implementation may be a bit different depending on how access to pins is done but that can be forgiven in this case.

Zog is at a disadvantage here, he can't see any pins yet.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-07-27 10:54

@heater,

No wonder - for a 1,000 foot high ice monster, wiggling those teeny-weeny pins must be darn near impossible!

Ross.

heater · 2010-07-27 13:05

Zog can do 5000 runs through Dhrystone v2.1 in about 10 seconds.

I don't have a working timer routine for dhrystone, I timed it by eye and stopwatch a while back.
How would that translate into DMIPS ?

A quick perusal of the list file shows dhrystone modules dhry_1.c and dhry_2.c take 2893 (decimal) bytes for code excluding any library functions they may use but including a small printf.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-07-27 14:27

Just compiled dhrystone v1.1 for zog. This gives me:
48K bytes as it uses printf
23K bytes if I change that to iprintf
10K bytes if I remove printing

Amazing because the actual dhrystone code is only 840 bytes on it's own.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Dave Hein · 2010-07-27 18:20

I had to make quite a few changes to get Dhrystone 1.1 to work under CSPIN, but I finally got it working.· I verified that my changes were correct by compiling the orignal and the modified versions with the Watcom compiler.· I got the same dhrystone score on my PC with both versions.

CSPIN code running on an 80 MHz Prop produced 454 Dhrystones/second, which is 4.2 times slower than Catalina.

The code size without the printfs and the library functions is 1,020 bytes, which is 2.9 times smaller than Catalina.

The binary image with printfs and the other·library functions is·5,508 bytes

There are 10,724 bytes of VAR data that are not included in the binary file.· The total executable size with this data is 16,232 bytes.

·

heater · 2010-07-27 18:56

Man that's confusing. I was wondering how we know V1.1 is working correctly as it as no output.
Are V1 and V2 dhrystones comparable because zog seems to turn in 500 dhrystones per second on V2. If so we are running quite close.

Is that the same as DMIPS?

I'll have to try and find time to get Dhrystone V1.1 working.

Meanwhile I wondered...we haven't included any floating point work yet.

I thought I'd suggest calculating some terms of the Euler series to kick off with. It has a balance of floating point multiplication, division and addition. The main questions here are:

1. How big are the functions used? See attached code.
2. How much floating point lib code gets pulled in to support it?
3. How fast does it go ?

Any use of a cog as a floating point coprocessor is forbidden[noparse]:)[/noparse]

For zog:
1. Is 116 (dec) bytes including the main (no printing)
2. Is about 5700 bytes. (Changed all the floats to ints and took the difference in size as a measure).
3) Is unknown yet....

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

jazzed · 2010-07-27 19:22

heater said...
Any use of a cog as a floating point coprocessor is forbidden[noparse]:)[/noparse]

Wow [noparse]:)[/noparse]
It's probably easier to require that you use a cog than to ask others to add non-cog support.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Compiler Benchmarks

Comments