Benchmarks
hippy
Posts: 1,981
How slow is Spin compared to PASM ? A question which crops up fairly regularly directly or indirectly and it would be nice to have some reasonably definitive, meaningful, ballpark answer which could be agreed on, even if there is no absolute answer.
Does anyone have any ideas on creating a sensible to use benchmark for the Propeller chip, suite of benchmarks or is interested in doing any of that ?
There's nothing available in Spin which isn't available in PASM, such as multiply or floating point ( all ultimately use similar PASM code to achieve the same ) so we are not necessarily looking at which delivers the most FLOPS or is fastest with math ... or maybe we are ?
I believe we need one or more fairly straight forward ( that is simple to implement ) programs which test a wide range of operations standalone so they can run on any Propeller development platform.
My thought is to choose such a program, code it functionally ( without fine-tuning ) in Spin, then translate line-by-line to PASM with no real optimisation. Finally optimise both the Spin and PASM as much as possible. Optimisation can continue after the first round of results but we will always have a base un-tuned Spin and PASM to reference against. Those references can also be used as benchmarks for other Propeller programming languages, again line-by-line and optimised conversions.
From a PASM perspective, hub access and register-indirect are more expensive than other operations so should be a necessary part of any benchmark test but a PASM program shouldn't be deliberately crippled by those. Having to fit PASM in the 496 longs of Cog memory is a limitation for any benchmark.
Benchmarking is not my field so any thoughts appreciated.
Does anyone have any ideas on creating a sensible to use benchmark for the Propeller chip, suite of benchmarks or is interested in doing any of that ?
There's nothing available in Spin which isn't available in PASM, such as multiply or floating point ( all ultimately use similar PASM code to achieve the same ) so we are not necessarily looking at which delivers the most FLOPS or is fastest with math ... or maybe we are ?
I believe we need one or more fairly straight forward ( that is simple to implement ) programs which test a wide range of operations standalone so they can run on any Propeller development platform.
My thought is to choose such a program, code it functionally ( without fine-tuning ) in Spin, then translate line-by-line to PASM with no real optimisation. Finally optimise both the Spin and PASM as much as possible. Optimisation can continue after the first round of results but we will always have a base un-tuned Spin and PASM to reference against. Those references can also be used as benchmarks for other Propeller programming languages, again line-by-line and optimised conversions.
From a PASM perspective, hub access and register-indirect are more expensive than other operations so should be a necessary part of any benchmark test but a PASM program shouldn't be deliberately crippled by those. Having to fit PASM in the 496 longs of Cog memory is a limitation for any benchmark.
Benchmarking is not my field so any thoughts appreciated.
Comments
A good benchmark is really a bunch of tests.
Dhrystone 1.1
Whetstone
Sieve
And maybe the time it takes to do things like Sine, Add, Square Root and Multiply.
With standard tests... We could even compare with other processers and interpreters!
A good test between SPIN and PASM would make use of the math functions and memory functions together. [noparse]:)[/noparse]
Maybe by builting a lookup-table of prime numbers would be a good place to start.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Share the knowledge: propeller.wikispaces.com
Lets make some music: www.andrewarsenault.com/hss
When you want to have a look I used the computation of the 29. fibonacci number as a reference for a very SPIN friendly benchmark link http://forums.parallax.com/showthread.php?p=601870 near the end if page 2...
It took some time to implement that in PASM so it would not meet Hippie's constraints...
Just adding "synthetic constructs" is of very little use.. but this my - though not mine alone - opinion.
A much more useful approach is "natural benchmarking", i.e. defne a real problem constructed around one (or more) parameters and analyse the parameter space is accessable by a certain implementation. The simplest parameter is run time, but used memory is nearly equally important. You can do most unbelievable things with masses of memory, e.g. sort in linear time... The best known application for it is loop unrolling....
o.k. I should suggest to define 3 or 4 of such real problems, may be things already available.
(1) Asynchronious serial transmission: Parameter: Bits/Second in transmission and reception.
(2) Calculate fibo(29) using a recursive procedure: Parameter: Time
(3) Remove all spaces from a given text (around 10kB): Parameter: Time
(4) Substitute in a given text ("template") all occurances of a special pattern (e.g. all numbers) by a given set of different strings, longer and shorter as the original pattern.
Post Edited (deSilva) : 1/3/2008 8:16:55 PM GMT
Why doesn't PropIDE assemble Spin into PASM instead of interpreted bytecodes if it does the same thing much slower?
Can you put this into the respective thread please.
I think that's a good starting point. Simple / time to implement was really a concern of "don't expect us 'math amateurs' to implement CORDIC functions from first principles" etc. If it's too complex most players walk away from the field. A simple to follow algorithm to implement is I think what really counts, and that captures the essence precisely.
It uses register-indirect ( self-modifying code ) and I like the fact it uses a software stack. Simply using iterative methods would be equally valid but only if applied to both Spin and PASM, that's a "fairness" I think we would agree on, compare like with like.
"Line-by-line" wasn't entirely what I meant in practice either, more how a programmer would take an algorithm and implement it for the language used, without spending ages over optimisation. If a PASM programmer would do a set of Spin statements as one instruction then that's fine by me if it's an obvious way of doing it. A line-by-line reference version was really an idea for trying to take out any subconscious tendency of a good PASM coder to use complicated tricks others would not automatically be familiar with.
Perhaps another way of looking at it, more fun than cold benchmarking, is "here's a basic algorithm implemented as a reference, make it as fast as possible in Spin and/or PASM".
Edit: This will be the advantage of the LMM. As PASM is limited to 300 instructions or so it has to be algorithmically simple. Not so an assembly program with 3000 instructions....
Post Edited (deSilva) : 1/3/2008 9:19:43 PM GMT
An example: SPIN has 32 bit arithmetic and under no circumstances will do less. Depending on the intelligence of the algorithm (but because of memory constraints there can't be any) this will take 3 to 4 times longer than an unrolled 16x16 multiplication or 32/16 division. Given a problem solvable with 16 bit arithmetic. Should there be a handicap?
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle
FullDuplexSerial = claims operation to 230400 and is full duplex.
So, for this example a naive speedup is 2 * 230400/19200 = 24.
I only just read your last posting in the LMM thread about benchmarking and your noticed instability at even small changes. The theory can be put quite simple.
We are interested in two elements
* instruction execution time ('et')
* instruction performance/usefulness ('p')
You can visualize that easily in 4-quadrant sketch (to be inserted somewhen ...)
The obvious optimization strategy is to avoid instructions of low 'p'/'et' .... if you can
The both successful strategies for machines (be it H/W or S/W) are called CISC (increasing 'p' more than 'et') and RISC (decreasing 'et' more than 'p').
The benchmark results will shift considerably when you manage to include high 'p' instructions in the SPIN algorithm (e.g. LOOKDOWN, or really needed 32-bit multiplication) or you force PASM to wait for the HUB. This is trivia of course.
But synthetic benchmarks tend to be quite susceptible to such effects; small benchmark suites can even become systematically biased...
I shall put this to an extreme point: You made this suggestion some time ago to "configure" the VM according to the needed instructions. This is a step in the right direction. But you can go further! You GENERATE the most time consuming computations for each specific program. In fact that is exactly what we today do by hand:" Ah I need serial communication - I incorporate FullDuplexSerial." This can also be done on micro-level:
Compiler: "Oh, I notice this user keeps adding 4 in many places - I best generate a PLUS4 instruction for him this time."
This is nothing more but "advanced optimization for dynamic VMs" (TM by deSilva)
So what when the compiler detects that the user wants to compute the Fibonacci numbers? Well, it can include the fastest algorithm. And when it detects the user needs the 29. Fibonacci number? Well, it can include that too
So it is very difficult to decide what a "fair" benchmark shall be in that cases.
Don't laugh! There had been a time when machines were build to best perform for specific synthetic benchmark suits. And this time is by no means over!
Post Edited (deSilva) : 1/3/2008 10:52:07 PM GMT
1. Serial coms (bits/second)
2. Parse GPS strings (strings/second)
3. Calculate bearing and distance from point A to point B.
4. Number to string (numbers/second)
The best benchmark is the program you are going to use or at least significant parts of it. The above are some related to robotics. A problem or at least an item to consider is not only speed but COG usage. Is the extra speed worth an extra cog?
if you use PASM you need always an extra Cog to run the assembly code and uses a small Spin interface to communicate with your assembly routines. So I think that could be in same cases a disadvantage for PASM. But it should not be relevant on a comparison with Spin.
So how you access the cells is your own matter. Other assembly COGS will use machine code. When you are using a SPIN programm it uses SPIN - what else.
Don't be fooled by the fact that there is some SPIN code in the Object where the machine code is located. This has different reasons
Post Edited (deSilva) : 1/4/2008 8:24:16 PM GMT
It would be neat to see where a COG stands up next to a say... z80 or z380, 68000, 386, SX etc.
Those would be interesting and fun numbers!
Also Spin could be compared to Basic and other interpreter systems. [noparse]:)[/noparse]
It's apples and oranges, but still they're all processors so there has to be a way to compare them all on the same grounds. [noparse];)[/noparse]
Guess it would make for some fun.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Share the knowledge: propeller.wikispaces.com
Lets make some music: www.andrewarsenault.com/hss
Post Edited (Ym2413a) : 1/4/2008 8:02:52 PM GMT