Empty Loop Benchmarks

David Betz · 2011-07-16 20:04

RossH wrote: »
Hi all,

I agree with Heater - many C compilers would simply optimize the empty loop out altogether, which makes this a particularly unreliable benchmark for comparing real-world program performance. However ...

Catalina: 751 microseconds.
#include <catalina_hmi.h>
#include <catalina_cog.h>

void main() {
   long a, b;
   int i = 1;
   int j = 1000;
 
   a = _cnt();
   for (; i <= j; i++);
   b = _cnt();
   t_printf("Ticks = %d\n", b - a);
   t_printf("Microseconds = %d\n", (b - a)*1000/(_clockfreq()/1000));
   while (1);
}
Compiled for a C3 with the command:
catalina loop.c -lci -D C3 -O3
Ross.

Is that with code in hub memory or in flash?

RossH · 2011-07-16 20:11

David Betz wrote: »

Is that with code in hub memory or in flash?

Hi David,

751 microseconds is executing from Hub RAM.

When recompiled to execute from XMM it takes 5210 microseconds (5.2ms) - still about twice as fast as Spin. The command I used was:

catalina loop.c -lci -D C3 -O3 -D LARGE

EDIT: Sorry, David - the above results are for executing from SPI RAM. not SPI Flash. To execute from Flash I have to use the cache, which slows things down a bit further - 9156 microseconds - i.e. about the same speed as Spin. The command I used for this was:

catalina loop.c -lci -D C3 -O3 -D LARGE -D FLASH -D CACHED_1K

Ross.

CarlJacobs · 2011-07-16 23:04

For JDForth the results for the empty loop are:

1000 0 DO LOOP

in 1.4 milliseconds.

1000 0 DO 1 +LOOP

in 2.4 milliseconds.

Another simple test is calculating Fibonacci numbers:
For Spin:

PRI Fibo(n) : val
  if n =< 1
    val := n
  else
    val := SpinFibo(n-1) + SpinFibo(n-2)

Calculates Fibo(25) in 6773 milliseconds

For JDForth:

: Fibo ( n -- n )
  DUP 1 > 
  IF DUP 1- Fibo1 SWAP 2- Fibo1 + THEN
;

Calculates 25 Fibo in 1529 milliseconds.

prof_braino · 2011-07-17 07:10

jazzed wrote: »

I can't imagine maintaining a substantial sized program in forth.

I just noticed this comment being specific to forth. Not to subvert the thread (I'll remove the post if need be) but this is kind of important in the grand scheme of things:

Any program or application with sufficient process (documentation through testing) can be maintainable;
Any program or application with insufficient process is more likely to be a nightmare.

I wrote the of a user interface for a building control application (control points, alarm point monitoring) running forth, which spanned multiple buildings on multiple campuses on multiple sites. Very huge and complex for its time, couldn't be done using any other existing tools in the time allotted, never had a problem to my knowledge, and was maintained by the client.

I've done SQA work (much more recently) for many clients with products NOT written in forth (C mostly, but only because its so common) some of which have more bugs than functions, and "fixing" one function breaks another. This is mostly due to conflicting requirements, and poor process management, and neglect of suitable tests. This has NOTHING to do with the choice of language or environment.

While a particular language may "lend" itself well to a particular class of application, the language ultimately has very little to anything regarding the success of the effort. It all comes down to the skill of the people managing and engineering the project. Poor process yields result inversely proportional to the complexity of the project. Neither C nor forth nor Erlang can save us from this. I don't know much, but I'm pretty sure about this process stuff.

jazzed · 2011-07-17 07:31

prof_braino wrote: »

While a particular language may "lend" itself well to a particular class of application, the language ultimately has very little to anything regarding the success of the effort. It all comes down to the skill of the people managing and engineering the project. Poor process yields result inversely proportional to the complexity of the project. Neither C nor forth nor Erlang can save us from this. I don't know much, but I'm pretty sure about this process stuff.

I agree with most of what you say. Still knowing the language is key to successfully using it

I've zero interest in learning a language that is so different from all the others languages I know.

David Betz · 2011-07-17 08:00

jazzed wrote: »

I agree with most of what you say. Still knowing the language is key to successfully using it
I've zero interest in learning a language that is so different from all the others languages I know.

I have almost the opposite opinion. Why bother learning a language that is almost the same as one you already know unless you *need* to learn it to get something done. The languages I find the most interesting are the ones that take a different approach and are nothing like what I'm used to.

prof_braino · 2011-07-17 09:50

Ariba wrote: »

The Problem with an EEPROM benchmark is that the faster languages are too fast for the EEPROM. They need some delays in the read loop to not violate the max. EEPROM read clock. So the benchmark result is more for the EEPROM than for the language.

Oh, I see now. Thanks for clarifying. I'll get the hang of this yet.

I've zero interest in learning a language that is so different from all the others languages I know.

Me, I'm just lazy. Although I might eventually try J, which is Ken Iverson's "corrections" to to his APL, this might be fun. But not today.

As a benchmark, FFT is still too difficult for me, I haven't gotten my head around the fourier for dummies thread.
Nick Lordi did Fast Hartly, supposed to be similar to FFT but doesn't use "complex number"; and is already posted in forth and spin. Is anybody care to take a stab at that?

The fibonacci looks simple enough for me to try, I'm looking into that. I'm interested to compare to JDForth, which I considered but never tried due to costs.

Oh, thanks for the Catalina number, that is really interesting.

Since cnt is updated every cycle, would it be better to list the results in cycles instead of milliseconds or micro seconds? That way, we don't have to worry if somebody has a different crystal, etc?

jazzed · 2011-07-17 11:49

prof_braino wrote: »

Since cnt is updated every cycle, would it be better to list the results in cycles instead of milliseconds or micro seconds? That way, we don't have to worry if somebody has a different crystal, etc?

The standard crystal shipped on Parallax products is 80MHz. Let's stick with that.

FIBO is interesting, but it is more a measure of stack performance than anything for recursive functions. Most "language people" here have declared it not very interesting, but it is much more interesting than loop 1000.

I agree that FFT is difficult, but there are a few examples from SPIN/PASM that seem to work fine.

It all comes down to being able to do something useful in a reasonable amount of time. If folks like Forth, then that's the language they will choose. Same goes for C or some variation of BASIC.

prof_braino · 2011-07-17 15:41

jazzed wrote: »

It all comes down to being able to do something useful in a reasonable amount of time. If folks like Forth, then that's the language they will choose. Same goes for C or some variation of BASIC.

Jazzed, you give the impression that someone is trying to change your mind. Nobody wants you to do anything you don't want to do, please stick to what you do best and are comfortable with.

The purpose of this thread is to get a feel for what each tool is capable of. If a kid asks "Can I do X on the prop in basic?", it is handy to know that the overhead for just the loop is 33 ms. If a kid asks "can I do X in less than 4 microseconds on the prop? ", its handy to know that the overhead for the loop in pure assembler is 5 microseconds. Its all about being able to select the appropriate tool. In a case where no tool is declared the "favorite, forsaking all others", this might be useful information. But nobody wants to force anything you. Chill, baby! We're cool, dig the rebop.

Nobody really cares what you tool you use personally, as long as you do neat stuff.

mindrobots · 2011-07-17 15:43

This was never intended to be a serious language benchmark. Just a fun little example to see the same code in a bunch of different languages.

The various discussions that broke out were interesting (as always).

Benchmarking is always challenging to make sure you are benchmarking the proper components and comparing like to like (PC way to avoid the apples/apples, oranges/oranges controversy).

As jazzed stated, the goal is to write code to do something useful - loop 1000 is not useful and the ultimate optimization is to just not do it!

As for learning languages, since I'm not a programmer to put bread on the table, I like trying/learning different language families to play and expand my tiny head!

RossH · 2011-07-17 15:50

jazzed wrote: »

FIBO is interesting, but it is more a measure of stack performance than anything for recursive functions.

Fibonacci is mainly interesting for highlighting the fact that some prop compilers don't fully implement recursion. :frown:

While proposing new benchmarks is fine, we should also point out for newcomers that we have already had several threads on this topic, with results available for many languages and compilers.

We have results for all of the following benchmarks, which are (or perhaps "were" is more accurate) accepted as "industry standard":

Dhrystone
Whetstone
FFT

For more details (and results) see here and here.

Ross.

jazzed · 2011-07-17 21:24

Since FIBO and FFT have been mentioned ...

Here are some new numbers for xbasic on C3 and SpinSocket-Flash modules.

HUB memory 80MHz summary:

loop 1000 15ms
fibo(25) 6482ms
FFT (no PASM) 3101ms

External C3* 1MB Flash 80MHz summary:

loop 1000 21ms
fibo(25) 8643ms
FFT (no PASM) 4574ms

HUB memory 96MHz summary:

loop 1000 12ms
fibo(25) 5401ms
FFT (no PASM) 2584ms

External SpinSocket 4MB Flash 96MHz summary:

loop 1000 17ms
fibo(25) 7202ms
FFT (no PASM) 3820ms

SPIN running from HUB at 80MHz

loop 1000 9ms
fibo(25) 6069ms
FFT (no PASM) 1787ms

FFT is heater_fft_2.0 with SPIN BUTTERFLIES

*C3 has 1MB of code space, SS or SpinSocket-Flash has 4MB of code space.
C3 Flash is simple SPI flash. SpinSocket-Flash is a byte-wide SPI flash.

The xbasic language is designed for BASIC users to run large programs on cheap hardware.
xbasic is a compiled BASIC language dialect that doesn't use line numbers.

C3 and SpinSocket-Flash keeps programs in external Flash memory and data in HUB.
Hardware variations are added to xbasic by specifying a compliant driver in the xbasic.cfg file.
VM and driver files are loaded at startup and do not consume HUB code space.

prof_braino · 2011-07-18 05:41

RossH wrote: »

Fibonacci is mainly interesting for highlighting the fact that some prop compilers don't fully implement recursion. :frown:

I must doing Fibonacci wrong. I don't know how to implement recursion, though I have read that it is do-able in forth; usually there isn't enough memory to do much memory, which is the big reason why forth was chosen in the first place. What i was doing as similar to CarlJacobs:

Start with two Fibonacci's (on the stack)
Copy the top one
Add the two original numbers
Leave the result (on the top of the stack, with the previous top one under it)

It only uses two stack memory locations, and calculates up to the 47th Fibonacci, which is the last one that fits in 32bits signed long.

Am I doing this wrong?

(Sorry, didn't get to post code or results, kids had to go to the pool)

Also, do we really care about compiler benchmarks? I thought execution benchmarks were what matters. Or do I understand this incorrectly?

Heater. · 2011-07-18 06:19

The fibo test as we have seen here in C and Spin many times is purposely written to use recursion.
That is to say for any element of the fibionacci series the fibo() function is going to add two numbers. Each one of those two numbers is itself a member of the fibonacci series so fibo() calls itself twice to get them. Until you get down to the beginning of the series and fibo() can just return 1.
Of course fibo() can be written to not use recursion but that is not what we want here.
Whilst this is a very recursive algorithm it doesn't not use such a huge amount of stack.
I leave it as an exercise for the reader to determine how much stack fibo(26) uses, for example:)

Heater. · 2011-07-18 06:28

prof_braino,

Also, do we really care about compiler benchmarks? I thought execution benchmarks were what matters. Or do I understand this incorrectly?

I'm not sure what you mean here.

When benchmarking compilers people sometimes worry about how quickly the compiler converts source code into executable binary. For large programs on slow machines that can be a real issue as anyone who developed software for early micros will remember and makes your development experience miserable.

However here we don't care, Prop programs are small and the PC's we develop on are fast. I can't see anyone getting frustrated by the compile times.

So what we are talking about is execution speed of the resulting binary on the target hardware.

If my compiler generates twice as much code to do the same job as yours it may well execute my program half as fast on the same machine. So you may well want to benchmark your compilers. Can I do this application in C++ using GCC and the Zog interpreter, or do I need to go to LMM and Catalina C or do I need to go to some BASIC generating raw PASM or do I have to hand craft the PASM myself?

David Betz · 2011-07-18 07:33

jazzed wrote: »

The xbasic language is designed for BASIC users to run large programs on cheap hardware.
xbasic is a compiled BASIC language dialect that doesn't use line numbers.

If anyone is interested in this strange xbasic language that Steve keeps talking about you can find it checked into Google Code. The project name is 'xbasic'. The loader currently only works under Windows because I haven't debugged the serial port code for Mac OSX or Linux yet. However, the compiler itself will run on any platform that supports an ANSI C compiler. I've even run it on the Propeller compiled with Catalina C but it was very slow because it requires external memory and the SPI flash/SRAM on the C3 is pretty slow.

http://code.google.com/p/xbasic/

jazzed · 2011-07-18 07:38

Here's the SPIN FIBO code I've been using. Translate it to your favorite language.

{{
FiboTest.spin
}}
con
  _clkmode = xtal1 + pll16x ' 5MHz crystal
' _clkmode = xtal1 + pll8x ' 10MHz crystal
  _clkfreq = 80_000_000

obj
sx : "FullDuplexSerial"

pub main
  sx.start(31,30,0,115200)
  waitcnt(_clkfreq+cnt)
  sx.str(string($d,"Spin FIBO Test",$d))
  repeat
    'sx.rx ' press enter to continue
    fibotest(0,25)

pub fibotest(begin,end) | n, t1, t2

  repeat n from begin to end
   
    sx.str(string($d,$a,"FIBO "))
    sx.dec(n)
    
    t1 := cnt
    result := fibo(n)
    t2 := cnt-t1

    sx.str(string(" = "))
    sx.dec(result)
    sx.tx(" ")
    sx.dec(t2/(clkfreq/1000))
    sx.str(string("ms"))
     
    sx.tx($d)
     
pub fibo(num)
  if (num < 2)
    return num
  else
    return fibo(num-1) + fibo(num-2)

Here it is translated to xbasic.

REM ===============================================
REM basic fibo test
REM ===============================================

REM ------------------------------
REM Need a deeper stack for fibo
REM
option stacksize=256

include "print.bas"
include "propeller.bas"

def fibo(n)
    if n < 2 then
        return n
    else
        return fibo(n-1) + fibo(n-2)
    end if
end def

def test(count, endcount)
    dim result
    dim ms, ticks
    dim start
    dim passed

    do while (count <= endcount)
        print "FIBO( ";count;" )",
        start = CNT
        result = fibo(count)
        passed = CNT
        ticks = (passed-start)
        ms = ticks/(clkfreq/1000)
        print " = ", result, " ("; ticks; "  ", "ticks)", " ("; ms; " ", "ms)"
        count = count + 1
    loop
end def


REM ===============================================
REM main program loop
REM

print
do
    dim ticks
    ticks = CNT
    test(0,25)
    ticks = CNT-ticks
    print "Total run time: "; ticks/(clkfreq/1000); " ms"
loop

REM ===============================================

My original xbasic fibo tests used a slightly different algorithm that produced much better results.

Empty Loop Benchmarks

Comments