Compiler Benchmarks

Ale · 2010-08-20 00:45

A bit OT:

@Bill can you give a bit more detail about which scope did you buy ?
For the screen I'd say a DPO3000 by Tek (bu the screen is 9" and not 7"), but the rest fits Tek's DPO2000 better (the resolution is 480x234)... Any hints ? (I love oscis and I still want a better one than the one I have... and some hints about how good they perfom would help me).

Heater. · 2010-08-20 01:22

God I'm so jealous, the last scope I owned was a Tektronix. Probably a 545B if I remember correctly. It was the size of a washing machine.

http://www.thevalvepage.com/testeq/tek/545b/545b.htm

Toby Seckshund · 2010-08-20 04:36

Heater

Perhaps we should have some flags added to this forum. You know, the sort that count Thankyous given, thankyous received, being called a human being and times being called a .........

Bill Henning · 2010-08-20 09:56

Hi Ale,

Here is a link to the scope I bought:

http://cgi.ebay.com/Uni-T-wide-screen-OSCILLOSCOPE-100MHz-UT2102CEL-1G-/170496579775?pt=BI_Oscilloscopes

1Gs/sec, 100MHz bandwidth, max 3.9ns rise/fall time, 7" 800x480 screen

So far, I REALLY like it. It's not a Tek, but it will do very nicely for my current projects - much better than my old 25MHz Tek analog scope, or my shiny new PropScope (which I also really like! ... love the DSO feature)

It was easy to find a ground fault on an IC socket pin with it - the 0.4V DC offset on the ground pin was a dead giveaway

I also really like the 600K sample buffer.

There is a "print screen" feature that captures a jpeg of the 7" 800x480 screen.

There are math functions, FFT etc., X-Y mode and much more.

Ale wrote: »

A bit OT:

@Bill can you give a bit more detail about which scope did you buy ?
For the screen I'd say a DPO3000 by Tek (bu the screen is 9" and not 7"), but the rest fits Tek's DPO2000 better (the resolution is 480x234)... Any hints ? (I love oscis and I still want a better one than the one I have... and some hints about how good they perfom would help me).

RossH · 2010-09-08 06:14

I noticed in the VMCOG thread some fibonacci times. So I thought I'd add Catalina's times here:

Here is the fibo.c program:

#include <time.h>
#include <stdio.h>

long fibo (unsigned long n)
{
   if (n <= 1)
   {
      return n;
   }
   else
   {
      return fibo(n - 1) + fibo(n - 2);
   }  
}

int main()
{
   long result;
   int n;
   clock_t time;

   printf("Press a key to begin\n");
   getchar();

   for (n = 0; n < 30; n++)
   {
      time = clock();
      result = fibo(n);
      time = clock() - time;

      printf ("fibo (%d) = %d (%dms)\n", n, result, time);
   }

   printf("Press a key to exit\n");
   getchar();

   return(0);
}

The command I used to compile the program (for a HYBRID) was:

catalina -D HYBRID fibo.c -lci -D CLOCK

With no optimization, the size of code segment was 7892 bytes, and this is the output:

fibo (0) = 0 (0ms)
fibo (1) = 1 (0ms)
fibo (2) = 1 (0ms)
fibo (3) = 2 (0ms)
fibo (4) = 3 (0ms)
fibo (5) = 5 (0ms)
fibo (6) = 8 (0ms)
fibo (7) = 13 (0ms)
fibo (8) = 21 (1ms)
fibo (9) = 34 (1ms)
fibo (10) = 55 (2ms)
fibo (11) = 89 (4ms)
fibo (12) = 144 (6ms)
fibo (13) = 233 (9ms)
fibo (14) = 377 (15ms)
fibo (15) = 610 (24ms)
fibo (16) = 987 (39ms)
fibo (17) = 1597 (63ms)
fibo (18) = 2584 (102ms)
fibo (19) = 4181 (165ms)
fibo (20) = 6765 (267ms)
fibo (21) = 10946 (431ms)
fibo (22) = 17711 (698ms)
fibo (23) = 28657 (1129ms)
fibo (24) = 46368 (1826ms)
fibo (25) = 75025 (2954ms)
fibo (26) = 121393 (4779ms)
fibo (27) = 196418 (7733ms)
fibo (28) = 317811 (12513ms)
fibo (29) = 514229 (20246ms)

With optimization (-O3) the size of code segment was 6936 bytes, and this is the output:

fibo (0) = 0 (0ms)
fibo (1) = 1 (0ms)
fibo (2) = 1 (0ms)
fibo (3) = 2 (0ms)
fibo (4) = 3 (0ms)
fibo (5) = 5 (0ms)
fibo (6) = 8 (0ms)
fibo (7) = 13 (0ms)
fibo (8) = 21 (1ms)
fibo (9) = 34 (1ms)
fibo (10) = 55 (2ms)
fibo (11) = 89 (3ms)
fibo (12) = 144 (5ms)
fibo (13) = 233 (9ms)
fibo (14) = 377 (14ms)
fibo (15) = 610 (23ms)
fibo (16) = 987 (38ms)
fibo (17) = 1597 (60ms)
fibo (18) = 2584 (98ms)
fibo (19) = 4181 (158ms)
fibo (20) = 6765 (255ms)
fibo (21) = 10946 (413ms)
fibo (22) = 17711 (669ms)
fibo (23) = 28657 (1082ms)
fibo (24) = 46368 (1751ms)
fibo (25) = 75025 (2833ms)
fibo (26) = 121393 (4583ms)
fibo (27) = 196418 (7416ms)
fibo (28) = 317811 (11999ms)
fibo (29) = 514229 (19415ms)

So in this case using the Catalina optimizer gives a 12% improvement in code segment size, and gives a 4% improvement in speed.

Heater. · 2010-09-08 06:37

Not bad.

Zog does fibo(26) in 12288ms here so 2.5 times slower. Which is excellent considering it is a byte code interpreter.

On the plus side Zog's executable is only 3.5K so less than half the size:) But then my fibo does not use printf. I have another 1K of redundant junk to remove from the executables.

fibo is a really bad case for Zog. For many more normal equivalent programs Zog is a factor of 2 or so slower than Spin. But the recursive nature of fibo drags it down to 4 or 6 times slower.

How about Catalina running fibo from external RAM?

Leon · 2010-09-08 08:27

Try Ackermann's function A(m, n):

http://en.wikipedia.org/wiki/Ackermann_function

It's doubly-recursive and is used a lot for testing compilers, I've had a couple I've been beta-testing blow up on it.

lonesock · 2010-09-08 08:32

@heater: regarding a float lib for Zog, are you thinking just a subset of common ops (+-*/), with all the other stuff emulated? What is the minimum subset of "accelerated" operations needed?

Jonathan

Heater. · 2010-09-08 09:00

Leon,

I have been dreading the day when someone would suggest the Ackermann function as a benchmark:)

http://rosettacode.or/wiki/Ackermann_Function#C

#include <stdio.h>
#include <sys/types.h>
u_int ackermann(u_int m, u_int n)
{
   if ( m == 0 ) return n+1;
   if ( n == 0 )
   {
       return ackermann(m-1, 1);
   }
   return ackermann(m-1, ackermann(m, n-1));
}
 
int main()
{
  int m, n;
 
  for(n=0; n < 7; n++)
  {
    for(m=0; m < 4; m++)
    { 
      [URL="http://www.opengroup.org/onlinepubs/009695399/functions/printf.html"]printf[/URL]("A(%d,%d) = %d\n", m, n, ackermann(m,n));
    }
    [URL="http://www.opengroup.org/onlinepubs/009695399/functions/printf.html"]printf[/URL]("\n");
  }
}

This could take some time...

Heater. · 2010-09-08 09:11

lonesock:

....are you thinking just a subset of common ops (+-*/), with all the other stuff emulated?

That was my original plan. Well almost.

I thought at one point that we might have room for the four basic functions in the Zog Cog itself. Not so.

Then I thought to borrow the four basic functions from the Catalina kernel and use LMM to run them in the Zog Cog. Still perhaps a reasonable approach if Cogs are to be preserved.

Then I though, why not just use the code from float32 in OBEX in it's entirety and have a floating point "coprocessor" Cog.

That last solution pulls in a lot of trig and log functions. The major missing items being the "arcxxx" functions if I remember correctly. We can redirect all the C maths functions to those for running at max speed and just let normal ZPU code handle the rest.

What is the minimum subset of "accelerated" operations needed?

No idea, but as you see, I look at it the other way around, what do we get in that float32 COG

One could pull up another COG for the missing functions, as is done in the float object, but I'm not inclined to do so.

K2 · 2010-09-08 12:25

Heater,

Have you tested Zog with Ackermann's function? If so, with what results?

Heater. · 2010-09-08 17:43

K2,

No, not yet.

When I sober up I will be on it, just now I have to deal with Rosh Hashanah.

RossH · 2010-09-09 02:04

Here are the Catalina results for Ackerman.

Here's the program:

#include <stdio.h>
#include <sys/types.h>
#include <time.h>
#define u_int int

u_int ackermann(u_int m, u_int n)
{
   if ( m == 0 ) return n+1;
   if ( n == 0 )
   {
       return ackermann(m-1, 1);
   }
   return ackermann(m-1, ackermann(m, n-1));
}
 
void main()
{
  int m, n;
  int time;

  printf("Press a key to start\n");
  getchar();

  time = clock(); 
  for(n=0; n < 7; n++)
  {
    for(m=0; m < 4; m++)
    { 
      printf("A(%d,%d) = %d\n", m, n, ackermann(m,n));
    }
    printf("\n");
  }
  time = clock() - time;

  printf("ackerman took %d msec\n", time);

  printf("Press a key to exit\n");
  getchar();
}

Without optimization, compiled using the command:

catalina ackerman.c -D HYBRID -lci -D CLOCK

Code segment size = 8016

Result:

A(0,0) = 1
A(1,0) = 2
A(2,0) = 3
A(3,0) = 5

A(0,1) = 2
A(1,1) = 3
A(2,1) = 5
A(3,1) = 13

A(0,2) = 3
A(1,2) = 4
A(2,2) = 7
A(3,2) = 29

A(0,3) = 4
A(1,3) = 5
A(2,3) = 9
A(3,3) = 61

A(0,4) = 5
A(1,4) = 6
A(2,4) = 11
A(3,4) = 125

A(0,5) = 6
A(1,5) = 7
A(2,5) = 13
A(3,5) = 253

A(0,6) = 7
A(1,6) = 8
A(2,6) = 15
A(3,6) = 509

ackerman took 2854ms

With optiimization, compiled using the command:

catalina ackerman.c -D HYBRID -lci -D CLOCK -O3

Code segment size = 7040

Result:

A(0,0) = 1
A(1,0) = 2
A(2,0) = 3
A(3,0) = 5

A(0,1) = 2
A(1,1) = 3
A(2,1) = 5
A(3,1) = 13

A(0,2) = 3
A(1,2) = 4
A(2,2) = 7
A(3,2) = 29

A(0,3) = 4
A(1,3) = 5
A(2,3) = 9
A(3,3) = 61

A(0,4) = 5
A(1,4) = 6
A(2,4) = 11
A(3,4) = 125

A(0,5) = 6
A(1,5) = 7
A(2,5) = 13
A(3,5) = 253

A(0,6) = 7
A(1,6) = 8
A(2,6) = 15
A(3,6) = 509

ackerman took 2699ms

So optimization reduced code size by 12% and increased performance by 5%

RossH · 2010-09-09 02:38

As requested, here are Catalina results for Fibonacci executed from external RAM (on a RAMBLADE).

Compiled with the command:

catalina fibo.c -x5 -D RAMBLADE -D PC -lci -D CLOCK -O3

Results:

fibo (0) = 0 (0ms)
fibo (1) = 1 (0ms)
fibo (2) = 1 (0ms)
fibo (3) = 2 (1ms)
fibo (4) = 3 (1ms)
fibo (5) = 5 (1ms)
fibo (6) = 8 (2ms)
fibo (7) = 13 (2ms)
fibo (8) = 21 (2ms)
fibo (9) = 34 (4ms)
fibo (10) = 55 (7ms)
fibo (11) = 89 (12ms)
fibo (12) = 144 (19ms)
fibo (13) = 233 (31ms)
fibo (14) = 377 (51ms)
fibo (15) = 610 (83ms)
fibo (16) = 987 (134ms)
fibo (17) = 1597 (217ms)
fibo (18) = 2584 (352ms)
fibo (19) = 4181 (569ms)
fibo (20) = 6765 (921ms)
fibo (21) = 10946 (1491ms)
fibo (22) = 17711 (2412ms)
fibo (23) = 28657 (3902ms)
fibo (24) = 46368 (6314ms)
fibo (25) = 75025 (10216ms)
fibo (26) = 121393 (16531ms)
fibo (27) = 196418 (26747ms)
fibo (28) = 317811 (43278ms)
fibo (29) = 514229 (70024ms)

So about 3.5 times slower executed from XMM RAM than executed from Hub RAM.

Heater. · 2010-09-09 02:48

A surprisingly good result for Zog.

The Ackermann binary is 3632 bytes, however I'm not using printf which would bloat it out to 17K. Code is attached.

ZOG v1.6 (HUB)
zpu memory at 0000008C
A(0,0) = 1
A(1,0) = 2
A(2,0) = 3
A(3,0) = 5

A(0,1) = 2
A(1,1) = 3
A(2,1) = 5
A(3,1) = 13

A(0,2) = 3
A(1,2) = 4
A(2,2) = 7
A(3,2) = 29

A(0,3) = 4
A(1,3) = 5
A(2,3) = 9
A(3,3) = 61

A(0,4) = 5
A(1,4) = 6
A(2,4) = 11
A(3,4) = 125

A(0,5) = 6
A(1,5) = 7
A(2,5) = 13
A(3,5) = 253

A(0,6) = 7
A(1,6) = 8
A(2,6) = 15
A(3,6) = 509

ackerman took 6127ms

Ackermann is really making Catalina squirm, only bit more than twice the speed of Zog:)

Am I cheating by using a 104MHz clock?

RossH · 2010-09-09 03:06

Hi Heater,

I just recompiled Ackerman for the RAMBLADE (100Mhz clock). The time comes down a bit to 2572 ms. Taking into account the 4% difference in clock speed, Catalina is 2.4 - 2.5 times faster than Zog.

Catalina is beginning to feel threatened!

Ross.

Heater. · 2010-09-09 03:15

Wow, Catalina is running fibo from ext RAM about as fast as ZOG does from HUB.

Is that everything in ext RAM or data/stack in HUB?

Zog runs ackermann from ext RAM but the timer overflows by 819ms.

RossH · 2010-09-09 03:25

Heater,

In these examples, it makes very little difference whether Catalina is runnning code and data from external RAM, or just code - because Catalina uses register (cog) or stack (hub) variables for nearly all data in either case.

I just recompiled ackerman to run from external RAM. Compiled using the command:

catalina ackerman.c -D RAMBLADE -lci -D CLOCK -O3 -x5

Result is 10620 ms - about 4 times slower than Catalina from hub RAM, and about 1.7 times slower than Zog from hub RAM.

Heater. · 2010-09-09 03:39

That's right. Everything is happening on the stack.

Catalina has the advantage of being able to have code in ext RAM but stack in nice fast HUB.

Zog put's everything out in ext RAM. This is unlikely to change.

RossH · 2010-09-09 04:05

Yes, Zog is at a disadvantage here - thank goodness! I don't really fancy a 1,000 foot ice monster breathing too closely down my neck!

Heater. · 2010-09-13 01:19

RossH,

I don't understand the results displayed by the Whetstone benchmark. Specifically I don't understand the results you presented for Catalina here:

http://forums.parallax.com/showthread.php?t=124168&page=5

In the results columns and the final lines "results to be loaded to spreadsheet" we see mostly the same numbers for flops1, mflops2 etc. for both software floating point or using two extra COGs as coprocessors.

Only the MWIPS numbers are different.

What does it all mean and what is an MWIP anyway.

RossH · 2010-09-13 02:51

Hi Heater,

You expect me to know? All I do is compile 'em, run 'em, and then compare 'em with other results! And then post them here if they make Catalina look good (or quietly ignore them if they don't

).

But here's my best guess ...

If the results are the same when using 2 cogs (i.e. implementing floating point functions in 2 extra cogs) as when using no cogs (i.e. using software emulation for all floating point functions) then all this probably means is that this particular part of the benchmark didn't actually use any of those functions.

For example, check out the N4 (fixed point) and N5 (sin, cos etc) results. N4 is the same in both benchmarks at 3.777s - this means this benchmark is probably only using the basic functions (i.e. *,-,/,*) which are built into the kernel in all cases. But N5 is massively different at 42.593s vs 8.130s - which shows the benefit of using the extra cogs!.

I presume the "results to load to spreadsheet" are just different weightings of the N1 - N8 results (e.g. some with trig functions included, some without). What they actually mean is pretty arbitrary - as are all benchmarks taken in isolation. They only mean anything when compared with other results. For a complete set of posted results, go here.

You can figure out more details out by examining the whetstone program itself.

Ross.

Leon · 2010-09-13 03:02

MWIPS - Millions of Whetstone Instructions Per Second.

The Whetstone benchmark was developed by English Electric-LEO-Marconi Computers whilst I was working for them in the 1960s. I worked in the lab at Kidsgrove where the prototype KDF9 mainframe was developed. The Whetstone compiler was developed for the KDF9, and the benchmark was developed to test it.

http://www.findlayw.plus.com/KDF9/The%20English%20Electric%20KDF9.pdf

IIRC, they never got the prototype to work properly, although the machines they shipped were OK. Several friends of mine spent endless days taking it apart and putting it together, trying to fix it.

One was sold to the Chinese, ostensibly for weather forecasting, but it had a similar configuration to the machine that was supplied to the UKAEA for developing nuclear weapons (Project Egdon - someone must have liked reading Thomas Hardy's novels). The Chinese KDF9 had all the transistors painted black, as they were made in the USA. Egdon had a very large budget, if anyone ever wanted anything out of the stores for personal use it got charged to Egdon.

Heater. · 2010-09-13 03:05

Ok I'm a bit slow this morning.

All those numbers under the MFLOPS and MOPS columns can be found in the "data to be loaded to spread sheet" lines. All be it with different names and in a different order.

Confused by the fact that I have a Whetstone version here that has a calibrate loop at the start and tries to make the total run time about 100 seconds.

This version does not get out of the calibrate loop on my Linux PC...

Heater. · 2010-09-15 14:14

Dhrystone results for Zog.

Executable binary size = 13300 bytes.
Dhrystones/second running from HUB = 555
Dhrystones/seconds running from ext mem = 135

For HUB running this is a bit more that one quarter the speed of Catalina, not bad. About the speed of a DEC PRO 380 11/73 or a 10MHz 68000 machine.

For external RAM running it's half the speed of a IBM PC/XT 8088-4.77Mhz.

ZOG v1.6 (HUB)
zpu memory at 0000008C
Dhrystone(1.1) time for 5000 passes = 9
This machine benchmarks at 555 dhrystones/second

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X0000E53 0X00 0X000017B0 0X00000C9B
BREAKPOINT

ZOG v1.6 (VM, No SD)
Dhrystone(1.1) time for 5000 passes = 37
This machine benchmarks at 135 dhrystones/second

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X0000E53 0X00 0X000017B0 0X00000C9B
BREAKPOINT

Compiler Benchmarks

Comments