Shop OBEX P1 Docs P2 Docs Learn Events
Empty Loop Benchmarks — Parallax Forums

Empty Loop Benchmarks

mindrobotsmindrobots Posts: 6,506
edited 2011-07-18 07:38 in Propeller 1
Simple benchmarking started by Bean's example here:

http://forums.parallax.com/showthread.php?123678-PE-Basic-Version-0.16-July-11-2011&p=1017831&viewfull=1#post1017831

Mostly as an exercise in my FORTH learning, I wondered what the same algorithm would look like in PropFORTH and then of course what the results would be. It's simply, grabbing CNT register before and after a 1-1000 empty loop and then printing out the difference between start and end times.

BASIC version ran in 33 milliseconds in PE-Basic:
10 a=CNT
20 FOR b=1 TO 1000
30 NEXT b
40 a=CNT-a
50 PRINT a/80000;" milliseconds."

If I did it correctly (I'm still at that stage in my learning), The FORTH version ran in 2 milliseconds in PropFORTH v4.5
: timetest 
 cnt COG@ 
 1000 0 do loop 
 cnt COG@ 
 swap - 
 80000 / . 
 c"  milliseconds. " .concstr cr ;

Feel free to add you version in your favorite language. It will be fun to see the same simple program in different languages if nothign else.
«1

Comments

  • jazzedjazzed Posts: 11,803
    edited 2011-07-14 11:44
    PropellerJVM.
    33 milliseconds
    

    Not bad for Java LOL :D
    import stamp.core.CPU;
    import stamp.core.Terminal;
    import stamp.core.Timer;
    import stamp.core.ScaledTimer16;
    
    public class EmptyLoop
    {
        public static void main()
        {
            int ticks = 0;
            int time  = 0;
            int n;
            // setup timer
            Timer clock = new Timer();
            ScaledTimer16 scaledTimer = new ScaledTimer16 (400, 100);
            // wait for user input
            while(!Terminal.byteAvailable())
                ;
            // start time measurement
            scaledTimer.mark();
    
            // do empty loop
            for(n = 0; n < 1000; n++)
                ;
            // measure time
            ticks = scaledTimer.passedTicks();
            time  = scaledTimer.passedTime();
            // print time
            System.out.print(time);
            System.out.println(" milliseconds");
        }
    }
    
  • JonnyMacJonnyMac Posts: 9,208
    edited 2011-07-14 11:52
    9 ms in straight Spin @ 80MHz (5 x 16)
    pub main | elapsed, idx
    
      term.start(RX1, TX1, %0000, 115_200)
      pause(1)
    
      term.str(string(CLS, "Loop Timer", CR, CR))
    
      elapsed := -cnt
    
      repeat idx from 1 to 1000
        ' empty loop
      
      elapsed := cnt + elapsed - 544
    
      term.dec(elapsed / MS_001)
    
      repeat
        waitcnt(0)
    
  • mindrobotsmindrobots Posts: 6,506
    edited 2011-07-14 11:56
    At least we can all read it and maintain it since it's Java! :smile:
  • jazzedjazzed Posts: 11,803
    edited 2011-07-14 12:44
    mindrobots wrote: »
    At least we can all read it and maintain it since it's Java! :smile:
    That's a very good answer.
    I can't imagine maintaining a substantial sized program in forth. :smile:
    Of course to each their own ...

    Here are results for 4 different "platforms" in xBasic.

    HUB only mode (5MHz)
    28 milliseconds
    

    C3 Flash (1MB 4 pins) mode (5MHz)
    44 milliseconds
    

    HUB only mode (6MHz)
    23 milliseconds
    

    External SpinSocket-Flash (4MB 10 pins) mode (6MHz)
    34 milliseconds
    
    REM xBasic loop test
    REM
    include "propeller.bas"
    include "print.bas"
    
    def test
      dim i, start, elapsed
      start = cnt
    
      for i = 1 to 1000
      next i
      elapsed = cnt - start
    
      print elapsed / (clkfreq / 1000); " milliseconds"
    end def
    
    REM Run loop test
    test
    
    do
    loop
    
  • David BetzDavid Betz Posts: 14,516
    edited 2011-07-14 12:55
    Actually, xbasic doesn't do a very good job of compiling FOR loops so here is a test that uses WHILE loops along with the FOR loop version. The FOR loop version runs in 28ms and the WHILE loop version runs in 22ms on a C3 out of hub memory.
    include "propeller.bas"
    include "print.bas"
    
    def test_for
      dim i, start = cnt
    
      for i = 1 to 1000
      next i
    
      return cnt - start
    end def
    
    def test_while
      dim i, start = cnt
    
      i = 0
      do while i < 1000
        i = i + 1
      loop
    
      return cnt - start
    end def
    
    print "for: "; test_for / (clkfreq / 1000)
    print "while: "; test_while / (clkfreq / 1000)
    
  • JonnyMacJonnyMac Posts: 9,208
    edited 2011-07-14 13:12
    While I wouldn't call PASM my favorite language, it really should be the benchmark, right?

    Results:
    Ticks = 4004
    Microseconds = 5
    

    The Code:
    dat
    
                            org     0
    
    entry                   mov     idx, _1000
                            neg     elapsed, cnt
                            
                            djnz    idx, #$
    
                            add     elapsed, cnt
                            sub     elapsed, #4
    
                            wrlong  elapsed, par
    
    unload                  cogid   id
                            cogstop id
    
    ' -----------------------------------------------------------------------------
    
    _1000                   long    1000
    
    idx                     res     1
    elapsed                 res     1
    id                      res     1
    
                            fit     496
    
  • mindrobotsmindrobots Posts: 6,506
    edited 2011-07-14 13:19
    I would think that the native assembler for any processor should be like the speed of light. If you can go faster than that with a benchmark, you've got some explaining to do!

    Thanks all for you contribuitions (so far)!!
  • Mark_TMark_T Posts: 1,981
    edited 2011-07-14 13:49
    JonnyMac wrote: »
    While I wouldn't call PASM my favorite language, it really should be the benchmark, right?

    Results:
    Ticks = 4004
    Microseconds = 5
    

    You mean, of course, 50us... I'm not aware of an 800MHz Prop!
  • BeanBean Posts: 8,129
    edited 2011-07-14 14:09
    PE-Basic is interpreted, so I HAD to try PropBasic
    DEVICE P8X32A,XTAL1,PLL16X
    FREQ 80_000_000
    
    SOut PIN 30 HIGH
    
    a     VAR LONG
    b     VAR LONG
    ascii HUB STRING(100)
    
    PROGRAM Start
    
    Start:
      PAUSE 5000 ' Wait to start PST
      a = CNT
      FOR b=1 to 1000
      NEXT b
      a=CNT-a
      a=a/80
      ascii = STR a,8
      ascii = ascii + " microseconds."
      SEROUT SOut, T115200, ascii
    END
    

    Got 150 microseconds (or 0.15 milliseconds if you like).

    Bean
  • mindrobotsmindrobots Posts: 6,506
    edited 2011-07-14 15:30
    I seriously see PropBASIC in my future!
    Very COOL!!!

    Thanks!
  • kevin@cachia.comkevin@cachia.com Posts: 23
    edited 2011-07-14 16:28
    Thanks for starting this thread, I would have loved to had it when I started my project and was lost in the sea of Prop languages. I was looking for a speed comparison before I started my project, cause I tried Spin but it just wasn't cutting it for what I needed. It's slooooow!

    I had to switch to Bean's awesome PropBASIC! I wanted to try doing this bench but Bean beat me to it!

    My original assumption/guess from Spin to PropBasic was about 40 times, but looks closer to 60x. I wonder where Catalina C would come in at?
    The only thing I miss with PropBasic is not having all the OBEX libraries to "copy and paste" from. But reinventing the wheel is helping me understand and appreciate how the Prop uC works.

    Thanks guys for taking the time to do this and your great support!

    -Kevin
  • jazzedjazzed Posts: 11,803
    edited 2011-07-14 16:50
    @Bean, Can you post numbers for PropBasic using LMM?

    COG code limited to 496 instructions will always be faster than any virtual machine that uses 32KB+.

    The only contender that I know of with PropBasic LMM right now is Catalina.
    Ross will be happy to post Catalina numbers I'm sure.

    Aside: while this is an apples/apples comparison, your mileage will vary with other algorithms.

    Cheers :)
  • BeanBean Posts: 8,129
    edited 2011-07-14 17:21
    jazzed,
    Same code with 1 line changed "PROGRAM Start LMM" to use LMM code generation gives 900 microseconds or 0.9 milliseconds.

    Bean
  • jazzedjazzed Posts: 11,803
    edited 2011-07-14 17:30
    Bean wrote: »
    jazzed,
    Same code with 1 line changed "PROGRAM Start LMM" to use LMM code generation gives 900 microseconds or 0.9 milliseconds.
    Nice number :) Are you using an overlay to cache routines?
    Have you ever considered making PropBasic open source?
  • BeanBean Posts: 8,129
    edited 2011-07-14 18:30
    jazzed,
    No, no cache pretty much just straight out of the hub.
    Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.
    I guess it could be cleaned up, but that would be alot of work.

    It is in Delphi if anyone cares. I send it to BradC and he compiles it with lazarus.

    Bean
  • David BetzDavid Betz Posts: 14,516
    edited 2011-07-14 18:49
    Bean wrote: »
    Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.

    Well, it may be messy but it sure seems to work well! You may be being too critical of it.
  • jazzedjazzed Posts: 11,803
    edited 2011-07-14 19:10
    Bean wrote: »
    Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.
    I guess it could be cleaned up, but that would be alot of work.
    This is why XBASIC is coming. It is open and supports external memory so propeller can run big BASIC programs.
  • David BetzDavid Betz Posts: 14,516
    edited 2011-07-14 19:12
    But xbasic isn't even close to the speed of PropBasic. :-(
  • idbruceidbruce Posts: 6,197
    edited 2011-07-15 05:04
    Thanks for the comparisons guys, I definitely appreciate it.

    Bruce
  • prof_brainoprof_braino Posts: 4,313
    edited 2011-07-15 06:34
    mindrobots wrote: »
    It's simply, grabbing CNT register before and after a 1-1000 empty loop
    The FORTH version ran in 2 milliseconds in PropFORTH v4.5
    JonnyMac wrote: »
    ... PASM .... benchmark
    Ticks = 4004
    Microseconds = 5
    
    Mark_T wrote: »
    You mean, of course, 50us... I'm not aware of an 800MHz Prop!

    All these '1000' both in decimal, or in hex?

    PropForth 4.5 can be confusing, as each cog can have its own number BASE setting, and it gets reset to decimal after a reset in the deve kernel. (so its 'fixed' in next version, decimal by default, and hex numbers are prefixed with an X 'as in x0123A000E')

    Also, can you try optimizing the forth in assembler? It should get close to the PASM, depenfding on what gets put into assembler. If just the loop is in assembler, the overhead for fetching cnt to the stack should be noticable, compared to the fetch of cnt being in assembler also.

    50 microseconds is .05 milliseconds so this says PASM is about 40 times faster than straight forth, have I got that right?
  • mindrobotsmindrobots Posts: 6,506
    edited 2011-07-15 12:11
    Prof.B,

    The 1000 in the FORTH example is decimal. My kernel defaults to decimal.

    I'd love to be able to optimize it in assembler....just need to figure out how! :lol:
  • prof_brainoprof_braino Posts: 4,313
    edited 2011-07-15 19:15
    mindrobots wrote: »
    I'd love to be able to optimize it in assembler....just need to figure out how! :lol:

    Might be better to wait to 5.0, assembler optimization should be a lot simpler, and there will be some examples.

    The "straight forth" version might be faster in 5.0, if the optimizations work as intended.
  • AribaAriba Posts: 2,690
    edited 2011-07-16 00:18
    FemtoBasic Benchmark:
    10 c = CNT 
    20 FOR i=1 TO 1000
    30 NEXT i
    40 c = CNT - c
    50 PRINT c/80000;" ms"
    READY.
    run
    732 ms
    READY.
    
  • Heater.Heater. Posts: 21,230
    edited 2011-07-16 04:41
    In many languages, C for example, this bench mark can be very silly as the optimizer will notice that the loop does not do anything and completely remove it.
    In C one can add the "volatile" attribute to the loop counter to tell the compiler not to optimize it away.
    I'll have a go with Zog when I get a moment.
  • Toby SeckshundToby Seckshund Posts: 2,027
    edited 2011-07-16 05:56
    And there was me, joyous in the resurection of the blessed Nascom.

    A 1000 loop on that takes 1.7 seconds (on the equivilent of a 2 MHz clock rate)

    Happy days.
  • prof_brainoprof_braino Posts: 4,313
    edited 2011-07-16 07:42
    Heater. wrote: »
    In C one can add the "volatile" attribute to the loop counter to tell the compiler not to optimize it away.
    I'll have a go with Zog when I get a moment.

    We could add "increment number on the stack" to make the loop not empty.
    : timetest-not-empty 
                  cnt COG@ 
                                                        0
                                  1000 0 do 
                                                       1+
                                   loop 
                    cnt COG@ 
                                                       swap drop
                    swap - 
                    80000 / . 
                    c"  milliseconds. " .concstr cr ;
    

    I would be interested in any results you care to run; I'd like to see default, optimized, volatile, and non empty loop.

    The performance of the C code is largely a reflection of the skill of the person doing the optimizing, so the C should be very fast.
    But, I don't think anybody should bother to optimize empty loop performance, we can try to get an oranges to oranges comparision by setting up an appropriate algorithm as a benchmark. (I don't do apples to apples comparisons, I don't have a Mac.)

    What would be an appropriate bench mark anyway?
    How about "using a stock proto board, read the first 128 consecutive bytes from upper 32k of EEPROM, flip the bites for each byte and write them back" and get the timing for this?
    By "flip the bits" I mean each byte has a unique bit pattern and the bit pattern is reversed, so 10101100 becomes 00110101
  • jazzedjazzed Posts: 11,803
    edited 2011-07-16 12:05
    Bench-marking is difficult. A list of requirements should be considered and approved by participants, but removing a particular language's strengths should never be on the negotiating table.

    A sufficiently complicated program must be used to demonstrate real value, but porting such a program is a lot of work. One could port heater's fft or one of the dhrystone algorithms. Integer performance is probably more important for an MCU than floating point.
    The performance of the C code is largely a reflection of the skill of the person doing the optimizing, so the C should be very fast.
    As it should be, and I see no reason at all to give up such an advantage. It is no different from using LMM-like overlays -vs- a virtual byte-code machine interpreter.
    What would be an appropriate bench mark anyway?
    How about "using a stock proto board, read the first 128 consecutive bytes from upper 32k of EEPROM, flip the bites for each byte and write them back" and get the timing for this?
    By "flip the bits" I mean each byte has a unique bit pattern and the bit pattern is reversed, so 10101100 becomes 00110101
    Putting extra hardware into the equation just makes a benchmark more difficult to compare for different languages and similar to your complaint about optimizations, reflects the skill of the device driver developer.
  • prof_brainoprof_braino Posts: 4,313
    edited 2011-07-16 15:57
    jazzed wrote: »
    A list of requirements should be considered and approved by participants, but removing a particular language's strengths should never be on the negotiating table.

    Agreed, please consider the suggestion for your approval, make corrections as needed. Has someone suggested removing a language's strength? My post intended to say "don't waste time on empty loops if the compiler is already smart enough to remove them". Why un-clever something?
    A sufficiently complicated program must be used to demonstrate real value, but porting such a program is a lot of work.

    Read and writing EEPROM is sufficiently complicated for somebody who wants to read and write EEPROM (which is pretty much everybody that uses the prop), can we go with that? Since every language on the prop has to do this already, it should be fairly straight forward, common denomiator.
    Putting extra hardware into the equation just makes a benchmark more difficult to compare for different languages and similar to your complaint about optimizations, reflects the skill of the device driver developer.

    Sorry, didn't mean to be complaining. :)

    I was thinking that specifying exactly the same action on exactly the same hardware for each test would tell us (me at least) what we (me at least) are interested in, which is how long does it take to get something done on the prop using a given language. EEPROM is the single bit of hardware that is common to pretty much all prop configurations. The EEPROM and the prop chip will be identical in all tests; the difference would be in how a given language goes about excercising the hardware. I thought this was the sole point of the benchmark excercise. If there is a better method for comparison, I am interested to hear it; I don't know much about these things. Of course, these are only benchmarks and are not worth much trouble. I'm just curious. I think it could be handy to know what kind of time resolution one could expect in a given environment.

    In any case, I bet heater comes back with something interesting.
  • AribaAriba Posts: 2,690
    edited 2011-07-16 18:01
    The Problem with an EEPROM benchmark is that the faster languages are too fast for the EEPROM. They need some delays in the read loop to not violate the max. EEPROM read clock. So the benchmark result is more for the EEPROM than for the language.

    Andy
  • RossHRossH Posts: 5,519
    edited 2011-07-16 19:57
    Hi all,

    I agree with Heater - many C compilers would simply optimize the empty loop out altogether, which makes this a particularly unreliable benchmark for comparing real-world program performance. However ...

    Catalina: 751 microseconds.
    #include <catalina_hmi.h>
    #include <catalina_cog.h>
    
    void main() {
       long a, b;
       int i = 1;
       int j = 1000;
     
       a = _cnt();
       for (i; i <= j; i++);
       b = _cnt();
       t_printf("Ticks = %d\n", b - a);
       t_printf("Microseconds = %d\n", (b - a)*1000/(_clockfreq()/1000));
       while (1);
    }
    
    Compiled for a C3 with the command:
    catalina loop.c -lci -D C3 -O3
    
    Ross.
Sign In or Register to comment.