Empty Loop Benchmarks

mindrobots · 2011-07-14 09:59

Simple benchmarking started by Bean's example here:

http://forums.parallax.com/showthread.php?123678-PE-Basic-Version-0.16-July-11-2011&p=1017831&viewfull=1#post1017831

Mostly as an exercise in my FORTH learning, I wondered what the same algorithm would look like in PropFORTH and then of course what the results would be. It's simply, grabbing CNT register before and after a 1-1000 empty loop and then printing out the difference between start and end times.

BASIC version ran in 33 milliseconds in PE-Basic:

10 a=CNT
20 FOR b=1 TO 1000
30 NEXT b
40 a=CNT-a
50 PRINT a/80000;" milliseconds."

If I did it correctly (I'm still at that stage in my learning), The FORTH version ran in 2 milliseconds in PropFORTH v4.5

: timetest 
 cnt COG@ 
 1000 0 do loop 
 cnt COG@ 
 swap - 
 80000 / . 
 c"  milliseconds. " .concstr cr ;

Feel free to add you version in your favorite language. It will be fun to see the same simple program in different languages if nothign else.

jazzed · 2011-07-14 11:44

PropellerJVM.

33 milliseconds

Not bad for Java LOL

import stamp.core.CPU;
import stamp.core.Terminal;
import stamp.core.Timer;
import stamp.core.ScaledTimer16;

public class EmptyLoop
{
    public static void main()
    {
        int ticks = 0;
        int time  = 0;
        int n;
        // setup timer
        Timer clock = new Timer();
        ScaledTimer16 scaledTimer = new ScaledTimer16 (400, 100);
        // wait for user input
        while(!Terminal.byteAvailable())
            ;
        // start time measurement
        scaledTimer.mark();

        // do empty loop
        for(n = 0; n < 1000; n++)
            ;
        // measure time
        ticks = scaledTimer.passedTicks();
        time  = scaledTimer.passedTime();
        // print time
        System.out.print(time);
        System.out.println(" milliseconds");
    }
}

JonnyMac · 2011-07-14 11:52

9 ms in straight Spin @ 80MHz (5 x 16)

pub main | elapsed, idx

  term.start(RX1, TX1, %0000, 115_200)
  pause(1)

  term.str(string(CLS, "Loop Timer", CR, CR))

  elapsed := -cnt

  repeat idx from 1 to 1000
    ' empty loop
  
  elapsed := cnt + elapsed - 544

  term.dec(elapsed / MS_001)

  repeat
    waitcnt(0)

mindrobots · 2011-07-14 11:56

At least we can all read it and maintain it since it's Java!

jazzed · 2011-07-14 12:44

mindrobots wrote: »

At least we can all read it and maintain it since it's Java!

That's a very good answer.
I can't imagine maintaining a substantial sized program in forth.

Of course to each their own ...

Here are results for 4 different "platforms" in xBasic.

HUB only mode (5MHz)

28 milliseconds

C3 Flash (1MB 4 pins) mode (5MHz)

44 milliseconds

HUB only mode (6MHz)

23 milliseconds

External SpinSocket-Flash (4MB 10 pins) mode (6MHz)

34 milliseconds

REM xBasic loop test
REM
include "propeller.bas"
include "print.bas"

def test
  dim i, start, elapsed
  start = cnt

  for i = 1 to 1000
  next i
  elapsed = cnt - start

  print elapsed / (clkfreq / 1000); " milliseconds"
end def

REM Run loop test
test

do
loop

David Betz · 2011-07-14 12:55

Actually, xbasic doesn't do a very good job of compiling FOR loops so here is a test that uses WHILE loops along with the FOR loop version. The FOR loop version runs in 28ms and the WHILE loop version runs in 22ms on a C3 out of hub memory.

include "propeller.bas"
include "print.bas"

def test_for
  dim i, start = cnt

  for i = 1 to 1000
  next i

  return cnt - start
end def

def test_while
  dim i, start = cnt

  i = 0
  do while i < 1000
    i = i + 1
  loop

  return cnt - start
end def

print "for: "; test_for / (clkfreq / 1000)
print "while: "; test_while / (clkfreq / 1000)

JonnyMac · 2011-07-14 13:12

While I wouldn't call PASM my favorite language, it really should be the benchmark, right?

Results:

Ticks = 4004
Microseconds = 5

The Code:

dat

                        org     0

entry                   mov     idx, _1000
                        neg     elapsed, cnt
                        
                        djnz    idx, #$

                        add     elapsed, cnt
                        sub     elapsed, #4

                        wrlong  elapsed, par

unload                  cogid   id
                        cogstop id

' -----------------------------------------------------------------------------

_1000                   long    1000

idx                     res     1
elapsed                 res     1
id                      res     1

                        fit     496

mindrobots · 2011-07-14 13:19

I would think that the native assembler for any processor should be like the speed of light. If you can go faster than that with a benchmark, you've got some explaining to do!

Thanks all for you contribuitions (so far)!!

Mark_T · 2011-07-14 13:49

JonnyMac wrote: »
While I wouldn't call PASM my favorite language, it really should be the benchmark, right?

Results:
Ticks = 4004
Microseconds = 5

You mean, of course, 50us... I'm not aware of an 800MHz Prop!

Bean · 2011-07-14 14:09

PE-Basic is interpreted, so I HAD to try PropBasic

DEVICE P8X32A,XTAL1,PLL16X
FREQ 80_000_000

SOut PIN 30 HIGH

a     VAR LONG
b     VAR LONG
ascii HUB STRING(100)

PROGRAM Start

Start:
  PAUSE 5000 ' Wait to start PST
  a = CNT
  FOR b=1 to 1000
  NEXT b
  a=CNT-a
  a=a/80
  ascii = STR a,8
  ascii = ascii + " microseconds."
  SEROUT SOut, T115200, ascii
END

Got 150 microseconds (or 0.15 milliseconds if you like).

Bean

mindrobots · 2011-07-14 15:30

I seriously see PropBASIC in my future!
Very COOL!!!

Thanks!

kevin@cachia.com · 2011-07-14 16:28

Thanks for starting this thread, I would have loved to had it when I started my project and was lost in the sea of Prop languages. I was looking for a speed comparison before I started my project, cause I tried Spin but it just wasn't cutting it for what I needed. It's slooooow!

I had to switch to Bean's awesome PropBASIC! I wanted to try doing this bench but Bean beat me to it!

My original assumption/guess from Spin to PropBasic was about 40 times, but looks closer to 60x. I wonder where Catalina C would come in at?
The only thing I miss with PropBasic is not having all the OBEX libraries to "copy and paste" from. But reinventing the wheel is helping me understand and appreciate how the Prop uC works.

Thanks guys for taking the time to do this and your great support!

-Kevin

jazzed · 2011-07-14 16:50

@Bean, Can you post numbers for PropBasic using LMM?

COG code limited to 496 instructions will always be faster than any virtual machine that uses 32KB+.

The only contender that I know of with PropBasic LMM right now is Catalina.
Ross will be happy to post Catalina numbers I'm sure.

Aside: while this is an apples/apples comparison, your mileage will vary with other algorithms.

Cheers

Bean · 2011-07-14 17:21

jazzed,
Same code with 1 line changed "PROGRAM Start LMM" to use LMM code generation gives 900 microseconds or 0.9 milliseconds.

Bean

jazzed · 2011-07-14 17:30

Bean wrote: »

jazzed,
Same code with 1 line changed "PROGRAM Start LMM" to use LMM code generation gives 900 microseconds or 0.9 milliseconds.

Nice number

Are you using an overlay to cache routines?
Have you ever considered making PropBasic open source?

Bean · 2011-07-14 18:30

jazzed,
No, no cache pretty much just straight out of the hub.
Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.
I guess it could be cleaned up, but that would be alot of work.

It is in Delphi if anyone cares. I send it to BradC and he compiles it with lazarus.

Bean

David Betz · 2011-07-14 18:49

Bean wrote: »

Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.

Well, it may be messy but it sure seems to work well! You may be being too critical of it.

jazzed · 2011-07-14 19:10

Bean wrote: »

Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.
I guess it could be cleaned up, but that would be alot of work.

This is why XBASIC is coming. It is open and supports external memory so propeller can run big BASIC programs.

David Betz · 2011-07-14 19:12

But xbasic isn't even close to the speed of PropBasic. :-(

idbruce · 2011-07-15 05:04

Thanks for the comparisons guys, I definitely appreciate it.

Bruce

prof_braino · 2011-07-15 06:34

mindrobots wrote: »

It's simply, grabbing CNT register before and after a 1-1000 empty loop
The FORTH version ran in 2 milliseconds in PropFORTH v4.5

JonnyMac wrote: »
... PASM .... benchmark
Ticks = 4004
Microseconds = 5

Mark_T wrote: »

You mean, of course, 50us... I'm not aware of an 800MHz Prop!

All these '1000' both in decimal, or in hex?

PropForth 4.5 can be confusing, as each cog can have its own number BASE setting, and it gets reset to decimal after a reset in the deve kernel. (so its 'fixed' in next version, decimal by default, and hex numbers are prefixed with an X 'as in x0123A000E')

Also, can you try optimizing the forth in assembler? It should get close to the PASM, depenfding on what gets put into assembler. If just the loop is in assembler, the overhead for fetching cnt to the stack should be noticable, compared to the fetch of cnt being in assembler also.

50 microseconds is .05 milliseconds so this says PASM is about 40 times faster than straight forth, have I got that right?

mindrobots · 2011-07-15 12:11

Prof.B,

The 1000 in the FORTH example is decimal. My kernel defaults to decimal.

I'd love to be able to optimize it in assembler....just need to figure out how!

prof_braino · 2011-07-15 19:15

mindrobots wrote: »

I'd love to be able to optimize it in assembler....just need to figure out how!

Might be better to wait to 5.0, assembler optimization should be a lot simpler, and there will be some examples.

The "straight forth" version might be faster in 5.0, if the optimizations work as intended.

Ariba · 2011-07-16 00:18

FemtoBasic Benchmark:

10 c = CNT 
20 FOR i=1 TO 1000
30 NEXT i
40 c = CNT - c
50 PRINT c/80000;" ms"
READY.
run
732 ms
READY.

Heater. · 2011-07-16 04:41

In many languages, C for example, this bench mark can be very silly as the optimizer will notice that the loop does not do anything and completely remove it.
In C one can add the "volatile" attribute to the loop counter to tell the compiler not to optimize it away.
I'll have a go with Zog when I get a moment.

Toby Seckshund · 2011-07-16 05:56

And there was me, joyous in the resurection of the blessed Nascom.

A 1000 loop on that takes 1.7 seconds (on the equivilent of a 2 MHz clock rate)

Happy days.

prof_braino · 2011-07-16 07:42

Heater. wrote: »

In C one can add the "volatile" attribute to the loop counter to tell the compiler not to optimize it away.
I'll have a go with Zog when I get a moment.

We could add "increment number on the stack" to make the loop not empty.

: timetest-not-empty 
              cnt COG@ 
                                                    0
                              1000 0 do 
                                                   1+
                               loop 
                cnt COG@ 
                                                   swap drop
                swap - 
                80000 / . 
                c"  milliseconds. " .concstr cr ;

I would be interested in any results you care to run; I'd like to see default, optimized, volatile, and non empty loop.

The performance of the C code is largely a reflection of the skill of the person doing the optimizing, so the C should be very fast.
But, I don't think anybody should bother to optimize empty loop performance, we can try to get an oranges to oranges comparision by setting up an appropriate algorithm as a benchmark. (I don't do apples to apples comparisons, I don't have a Mac.)

What would be an appropriate bench mark anyway?
How about "using a stock proto board, read the first 128 consecutive bytes from upper 32k of EEPROM, flip the bites for each byte and write them back" and get the timing for this?
By "flip the bits" I mean each byte has a unique bit pattern and the bit pattern is reversed, so 10101100 becomes 00110101

jazzed · 2011-07-16 12:05

Bench-marking is difficult. A list of requirements should be considered and approved by participants, but removing a particular language's strengths should never be on the negotiating table.

A sufficiently complicated program must be used to demonstrate real value, but porting such a program is a lot of work. One could port heater's fft or one of the dhrystone algorithms. Integer performance is probably more important for an MCU than floating point.

prof_braino wrote: »

The performance of the C code is largely a reflection of the skill of the person doing the optimizing, so the C should be very fast.

As it should be, and I see no reason at all to give up such an advantage. It is no different from using LMM-like overlays -vs- a virtual byte-code machine interpreter.

prof_braino wrote: »

What would be an appropriate bench mark anyway?
How about "using a stock proto board, read the first 128 consecutive bytes from upper 32k of EEPROM, flip the bites for each byte and write them back" and get the timing for this?
By "flip the bits" I mean each byte has a unique bit pattern and the bit pattern is reversed, so 10101100 becomes 00110101

Putting extra hardware into the equation just makes a benchmark more difficult to compare for different languages and similar to your complaint about optimizations, reflects the skill of the device driver developer.

prof_braino · 2011-07-16 15:57

jazzed wrote: »

A list of requirements should be considered and approved by participants, but removing a particular language's strengths should never be on the negotiating table.

Agreed, please consider the suggestion for your approval, make corrections as needed. Has someone suggested removing a language's strength? My post intended to say "don't waste time on empty loops if the compiler is already smart enough to remove them". Why un-clever something?

A sufficiently complicated program must be used to demonstrate real value, but porting such a program is a lot of work.

Read and writing EEPROM is sufficiently complicated for somebody who wants to read and write EEPROM (which is pretty much everybody that uses the prop), can we go with that? Since every language on the prop has to do this already, it should be fairly straight forward, common denomiator.

Putting extra hardware into the equation just makes a benchmark more difficult to compare for different languages and similar to your complaint about optimizations, reflects the skill of the device driver developer.

Sorry, didn't mean to be complaining.

I was thinking that specifying exactly the same action on exactly the same hardware for each test would tell us (me at least) what we (me at least) are interested in, which is how long does it take to get something done on the prop using a given language. EEPROM is the single bit of hardware that is common to pretty much all prop configurations. The EEPROM and the prop chip will be identical in all tests; the difference would be in how a given language goes about excercising the hardware. I thought this was the sole point of the benchmark excercise. If there is a better method for comparison, I am interested to hear it; I don't know much about these things. Of course, these are only benchmarks and are not worth much trouble. I'm just curious. I think it could be handy to know what kind of time resolution one could expect in a given environment.

In any case, I bet heater comes back with something interesting.

Ariba · 2011-07-16 18:01

The Problem with an EEPROM benchmark is that the faster languages are too fast for the EEPROM. They need some delays in the read loop to not violate the max. EEPROM read clock. So the benchmark result is more for the EEPROM than for the language.

Andy

RossH · 2011-07-16 19:57

Hi all,

I agree with Heater - many C compilers would simply optimize the empty loop out altogether, which makes this a particularly unreliable benchmark for comparing real-world program performance. However ...

Catalina: 751 microseconds.

#include <catalina_hmi.h>
#include <catalina_cog.h>

void main() {
   long a, b;
   int i = 1;
   int j = 1000;
 
   a = _cnt();
   for (i; i <= j; i++);
   b = _cnt();
   t_printf("Ticks = %d\n", b - a);
   t_printf("Microseconds = %d\n", (b - a)*1000/(_clockfreq()/1000));
   while (1);
}

Compiled for a C3 with the command:

catalina loop.c -lci -D C3 -O3

Ross.

Empty Loop Benchmarks

Comments