Empty Loop Benchmarks
mindrobots
Posts: 6,506
Simple benchmarking started by Bean's example here:
http://forums.parallax.com/showthread.php?123678-PE-Basic-Version-0.16-July-11-2011&p=1017831&viewfull=1#post1017831
Mostly as an exercise in my FORTH learning, I wondered what the same algorithm would look like in PropFORTH and then of course what the results would be. It's simply, grabbing CNT register before and after a 1-1000 empty loop and then printing out the difference between start and end times.
BASIC version ran in 33 milliseconds in PE-Basic:
If I did it correctly (I'm still at that stage in my learning), The FORTH version ran in 2 milliseconds in PropFORTH v4.5
Feel free to add you version in your favorite language. It will be fun to see the same simple program in different languages if nothign else.
http://forums.parallax.com/showthread.php?123678-PE-Basic-Version-0.16-July-11-2011&p=1017831&viewfull=1#post1017831
Mostly as an exercise in my FORTH learning, I wondered what the same algorithm would look like in PropFORTH and then of course what the results would be. It's simply, grabbing CNT register before and after a 1-1000 empty loop and then printing out the difference between start and end times.
BASIC version ran in 33 milliseconds in PE-Basic:
10 a=CNT 20 FOR b=1 TO 1000 30 NEXT b 40 a=CNT-a 50 PRINT a/80000;" milliseconds."
If I did it correctly (I'm still at that stage in my learning), The FORTH version ran in 2 milliseconds in PropFORTH v4.5
: timetest cnt COG@ 1000 0 do loop cnt COG@ swap - 80000 / . c" milliseconds. " .concstr cr ;
Feel free to add you version in your favorite language. It will be fun to see the same simple program in different languages if nothign else.
Comments
Not bad for Java LOL
I can't imagine maintaining a substantial sized program in forth.
Of course to each their own ...
Here are results for 4 different "platforms" in xBasic.
HUB only mode (5MHz)
C3 Flash (1MB 4 pins) mode (5MHz)
HUB only mode (6MHz)
External SpinSocket-Flash (4MB 10 pins) mode (6MHz)
Results:
The Code:
Thanks all for you contribuitions (so far)!!
You mean, of course, 50us... I'm not aware of an 800MHz Prop!
Got 150 microseconds (or 0.15 milliseconds if you like).
Bean
Very COOL!!!
Thanks!
I had to switch to Bean's awesome PropBASIC! I wanted to try doing this bench but Bean beat me to it!
My original assumption/guess from Spin to PropBasic was about 40 times, but looks closer to 60x. I wonder where Catalina C would come in at?
The only thing I miss with PropBasic is not having all the OBEX libraries to "copy and paste" from. But reinventing the wheel is helping me understand and appreciate how the Prop uC works.
Thanks guys for taking the time to do this and your great support!
-Kevin
COG code limited to 496 instructions will always be faster than any virtual machine that uses 32KB+.
The only contender that I know of with PropBasic LMM right now is Catalina.
Ross will be happy to post Catalina numbers I'm sure.
Aside: while this is an apples/apples comparison, your mileage will vary with other algorithms.
Cheers
Same code with 1 line changed "PROGRAM Start LMM" to use LMM code generation gives 900 microseconds or 0.9 milliseconds.
Bean
Have you ever considered making PropBasic open source?
No, no cache pretty much just straight out of the hub.
Oh trust me...You don't WANT to see the source code. It's a really mess because it wasn't really thought-thru it was just created by the seat-of-my-pants on the fly.
I guess it could be cleaned up, but that would be alot of work.
It is in Delphi if anyone cares. I send it to BradC and he compiles it with lazarus.
Bean
Well, it may be messy but it sure seems to work well! You may be being too critical of it.
Bruce
All these '1000' both in decimal, or in hex?
PropForth 4.5 can be confusing, as each cog can have its own number BASE setting, and it gets reset to decimal after a reset in the deve kernel. (so its 'fixed' in next version, decimal by default, and hex numbers are prefixed with an X 'as in x0123A000E')
Also, can you try optimizing the forth in assembler? It should get close to the PASM, depenfding on what gets put into assembler. If just the loop is in assembler, the overhead for fetching cnt to the stack should be noticable, compared to the fetch of cnt being in assembler also.
50 microseconds is .05 milliseconds so this says PASM is about 40 times faster than straight forth, have I got that right?
The 1000 in the FORTH example is decimal. My kernel defaults to decimal.
I'd love to be able to optimize it in assembler....just need to figure out how!
Might be better to wait to 5.0, assembler optimization should be a lot simpler, and there will be some examples.
The "straight forth" version might be faster in 5.0, if the optimizations work as intended.
In C one can add the "volatile" attribute to the loop counter to tell the compiler not to optimize it away.
I'll have a go with Zog when I get a moment.
A 1000 loop on that takes 1.7 seconds (on the equivilent of a 2 MHz clock rate)
Happy days.
We could add "increment number on the stack" to make the loop not empty.
I would be interested in any results you care to run; I'd like to see default, optimized, volatile, and non empty loop.
The performance of the C code is largely a reflection of the skill of the person doing the optimizing, so the C should be very fast.
But, I don't think anybody should bother to optimize empty loop performance, we can try to get an oranges to oranges comparision by setting up an appropriate algorithm as a benchmark. (I don't do apples to apples comparisons, I don't have a Mac.)
What would be an appropriate bench mark anyway?
How about "using a stock proto board, read the first 128 consecutive bytes from upper 32k of EEPROM, flip the bites for each byte and write them back" and get the timing for this?
By "flip the bits" I mean each byte has a unique bit pattern and the bit pattern is reversed, so 10101100 becomes 00110101
A sufficiently complicated program must be used to demonstrate real value, but porting such a program is a lot of work. One could port heater's fft or one of the dhrystone algorithms. Integer performance is probably more important for an MCU than floating point.
As it should be, and I see no reason at all to give up such an advantage. It is no different from using LMM-like overlays -vs- a virtual byte-code machine interpreter.
Putting extra hardware into the equation just makes a benchmark more difficult to compare for different languages and similar to your complaint about optimizations, reflects the skill of the device driver developer.
Agreed, please consider the suggestion for your approval, make corrections as needed. Has someone suggested removing a language's strength? My post intended to say "don't waste time on empty loops if the compiler is already smart enough to remove them". Why un-clever something?
Read and writing EEPROM is sufficiently complicated for somebody who wants to read and write EEPROM (which is pretty much everybody that uses the prop), can we go with that? Since every language on the prop has to do this already, it should be fairly straight forward, common denomiator.
Sorry, didn't mean to be complaining.
I was thinking that specifying exactly the same action on exactly the same hardware for each test would tell us (me at least) what we (me at least) are interested in, which is how long does it take to get something done on the prop using a given language. EEPROM is the single bit of hardware that is common to pretty much all prop configurations. The EEPROM and the prop chip will be identical in all tests; the difference would be in how a given language goes about excercising the hardware. I thought this was the sole point of the benchmark excercise. If there is a better method for comparison, I am interested to hear it; I don't know much about these things. Of course, these are only benchmarks and are not worth much trouble. I'm just curious. I think it could be handy to know what kind of time resolution one could expect in a given environment.
In any case, I bet heater comes back with something interesting.
Andy
I agree with Heater - many C compilers would simply optimize the empty loop out altogether, which makes this a particularly unreliable benchmark for comparing real-world program performance. However ...
Catalina: 751 microseconds.
Compiled for a C3 with the command:
Ross.