Amdahl's law and the Propeller

cessnapilot · 2012-06-06 03:50

When you define speedup (SUP) as the original execution time divided by the
shorter (improved) execution time, then Amdahl's law states for N parallel
processors (COGs)

SUP(N) = 1 / [f + (1 - f) / N]

where (1 - f) is the portion of the code (steps) running in N processors parallel, and
f is the portion of the sequential code. With numbers: if your program uses
4 COGs for 50% of the necessary code steps, you can expect a

1 / (0.5 + 0.5 / 4) = 1.6

speedup of the program. At first glance, Amdahl's law seems to be somewhat
restricting about the speed performance of multicore processing. For example,
with only 10% of the code is serial, maximum speedup is 10, even with infinite
number of COGs.

On the other hand, I guess, f can be very-very small, if your parallelized code
runs during most of the execution time. E.g. when you load 2 COGs within
1 sec but those 2 COGs run their code for about an hour, then the code fraction
consumed by sequential operations, is about 0.000277. Now SUP(2) is 1.9994.
Almost double speedup, if I am not mistaken.

It would be interesting to see some actual speedup figures for SPIN and/or PASM
programs to check the validity of Amdahl's law, which contains a lot of assumptions.
Those assumptions may not be applicable for Propeller's architecture.

I attached an article covering some uses and abuses of the law.

Cheers,

Istvan

Heater. · 2012-06-06 04:08

On one extreme we could have: 8 Cogs doing parallel processing, none of them use HUB RAM or other HUB ops, and all I/O is done via pins. In that case we have an 8 times speed up over one processor. In fact we might say the speed up is greater because it all those tasks are different executing them on a single CPU would require some kind of task scheduler which has overheads. (This is the f = 0, N = 8 case).

On the other extreme we might have: Parallel code is operating through shared RAM, or even just using it occasionally for I/O. Then with 8 processors the speed up is dragged down by the number of processors sharing the resource as they all have to wait their turn. (f > 0. N=8)

With an infinite number of processors your speed up will be 0.0. Your performance drops to zero as every processor now has to wait an infinite amount of time for it's next go at a shared resource. (f > 0, N=infinity).

Heater. · 2012-06-06 08:04

A while back Chip asked here if we would like more RAM or sixteen COGs in the extra silicon available in the Prop II. My vote was for more RAM has increasing the COG count would have diminishing returns. Now we can see from Amdahl's law how that might have worked out.

In the above graph we plot Amdahl's formula for speed gain verses processor count from 1 to 32 for various amounts of sequential code. Sequential code for the Prop really just means those accesses to HUB that have to be serialized by the HUB switch. The red line is one 25th of the instructions sequential, the green is one 50th and the blue one 100th.

We see that if we had 16 cogs working on one 25th sequential code (HUB access) we only get the useful power of about 8 cogs. For one 50th HUB access we have perhaps the power of 11 or 12 COGS.

This is about what we could expect when running the Spin interpreter I think. LMM execution would suffer even more where the is a high amount of HUB access compared to cog execution.

How closely this models the Prop and Prop II situation might be debatable but I think Chip was wise not to go for 16 COGs.

g3cwi · 2012-06-06 08:32

This is an interesting discussion. But there is perhaps an unspoken tradeoff between performance and usability. My Lamborghini is a poor choice for trips to Walmart but the Toyota is fine. Very occasionally the Lambo is the best choice but most often the SUV does the job better...

Regards

Richard

prof_braino · 2012-06-06 08:34

Can you folks suggest a simple (?) algorithm to test this?

I think we can have an unlimited number of props working in parallel due to the MCS channels, but I'm not clever enough to com up with a program to test it.

If you lay out the structure I would code it in forth.

Heater. · 2012-06-06 09:19

prof_braino,

Easy. I'd do it with a bunch of threads running on my PC. Of course that's not real parallel execution because only one thread can run at a time. At least when using a single x86 core. The solution to that is not use real time but simulated time ticks. For example when simulating parallel execution arrange for all threads to complete some block of work before the time counter is moved on to the next tick. When simulating sequential execution arrange for only the running thread to to complete and then advance the time counter to the next tick. Now you can count how many blocks of work get done per time tick over some large amount of time ticks.

The blocks of work could be anything I guess but a usual fibo or some such code would do there.

All that's left is to arrange for a variable number of blocks of work to be done in "parallel" relative to "sequential" in some long running loop in each thread. So for example each thread could calculate some fibo nine times as simulated parallel and the once as simulated sequential giving an f of 1/10. And of course arrange to be able to run a variable number of threads in each experiment.

Sounds like an excellent job for the Go language.

Amdahl's law and the Propeller

Comments