Is Parallel Programming Hard, ...
Phil Pilgrim (PhiPi)
Posts: 23,514
... And, If So, What Can You Do About It?
That's the title of a new book that you can download for free. Here's the announcement:
And here's the download page:
-Phil
That's the title of a new book that you can download for free. Here's the announcement:
And here's the download page:
-Phil
Comments
Parallel Programming is extremely hard.
I find that if I have to work on two or more programs at the same time it's better to concentrate on each one for a number of hours or days at a time and then switch to the other. If I switch to rapidly between programs it's really hard to keep all the details of each in mind, it gets very confusing and progress slows down. I have a hard time explaining this to my boss.
I know there are chess grand masters who can play multiple opponents simultaneously, move by move, but I can't pull that off when programming, say one line here one line there etc.
Now we can see there are 4 basic types of programming:
1) Single Programmer Single Program (SPSP).
2) Single Programmer Multiple Programs (SPMP)
3) Multiple Programmers Single Program (MPSP)
4) Multiple Programmers Multiple Programs (MPMP)
Oh wait a minute.... you meant programming for Parallel Execution
Yep that's hard to.
On switching back and forth between working on more than one program I agree even more. I am much more productive if I can concentrate on one program at a time. As Heater says, it's hard to keep the details of each in mind. A bit like switching a production line from one product to another. There is always some setup time required.
True, working with multi bit buses, parallel gates, etc and having to deal with multiple path gate delays and such tends to provide one with the right mindset for parallel programming. Even so, it ain't easy.
-Phil
That may well be true in some cases. Especially when stacking up programmers vs hardware engineers. In general though I might disagree. Are we talking about programmers, who are traditionally taught to think about software in a serial manner, or are we talking about computer scientists and algorithm designers who are of a more mathematical bent?
Well that sure that makes the mechanics of writing the programs easier but does it make designing efficient algorithms for parallel machines any easier?
Example 1
Let's take a case in point, topical here recently, the Fourier Transform. Given the naive old fashioned way of calculating an FT it takes a long time, with the square of the data set size. Now assume you have a sea of processors with an easy way to start threads on them and an easy way to communicate data between them. What happens?. You divide the work up among the processors. You communicate the data to and from all of them. You arrive at a solution that is still damn slow.
Enter a mathematician, Tukey, he sees a away to optimize the whole thing and achieve speed gains of thousand or millions for large data sets.
OK let's use Tukey's idea on a parallel machine. But wait there are issues with that: How to divide the work? With what granularity to divide the work? Which order to divide the work? How to create an algorithm that automatically balances compute time on each node with communication time between nodes...
Example 2
The Mandlebrot set. For each pixel of a Mandlebrot set image you need to iterate a simple function hundreds, thousand, millions or more times. There is no way to speed that up with parallel processors because every iteration depends on the result of the previous one.
That's OK though, the Mandlebrot set image can be divided into groups of pixels and the work for each group farmed out and done in parallel.
But wait, there are those problems again: What size groups of pixels? That depends on the speed of your compute nodes balanced against the communication speed. Sending one pixels worth of work out to each node might slow the whole thing down as it takes longer to communicate the parameters and collect the result that it does to calculate it. But that decision depends on comms speed, compute speed and the number of iterations required for that pixel. That last thing depends on the "zoom" factor of the image you are looking at.
So how to build a parallel Mandlebrot set algorithm that automatically optimizes what work goes where in order to maximize speed?
Conclusion: creating simple parallel solutions may now be easier with the correct processors, languages and operating systems but designing efficient, optimized, self organizing, parallel algorithms does not seem to be.
P.S. I just got my first dev board for the "processor that shall remain nameless". I do wish the Prop had such hardware/software assistance with parallelism. As Phil points out, despite having multiple cores the Prop gives no other aids to the creation of parallel algorithms.
Regarding languages for parallel processing, I was thinking of MIMD architectures, as used by Inmos and XMOS. They make algorithm implementation quite straightforward using appropriate languages. Tools are available for optimising programs. It'll be a long time before that can be done automatically, though.
Phil is exactly right here. Even in tightly coupled things like multi-cog video drivers, there's a logic behind what's going on that isn't generic. What is usually called "parallel programming" is breaking up a generic program -- the kind we would think of as sequential -- and implementing the same functionality faster and more efficiently because we have multiple processors. And most "parallel programming" models don't assume something that is central to the Propeller, deterministic timing. Thus there are all kinds of scheduling and synchronization schemes. (Not that those don't exist in Propland too, but they tend to be much simpler because they're doing specific, not generic, things.)
I've always found fixing deadlocks to be much easier than finding the place where someone forgot to take out a lock. It is especially problematic in group projects where someone doesn't understand multi-threaded programming.
I've seen an implementation of a multi-processor Mandelbrot set calculator that divides the overall screen area into groups of pixels, like 8x8 or 16x16, then pushes all the groups into a queue. Then it spawns jobs on all the processors that simply pull a group from the queue and compute the results. The thing is 'self balancing' to a certain degree - the very complex groups with deep iteration will take longer to process, but other jobs are free to continue pulling simple groups off the queue while that happens. I realize this is a little over-simplified, and doesn't deal efficiently with load balancing in the case where you have almost as many cores as you have, say, pixels, but it works well.
Yep, I've seen that done with the groups of pixels being complete lines of the display. One could see the "self-balancing" aspect happening as lines would start to be drawn from top to bottom but there would, temporarily, be gaps of undrawn lines that were taking longer to complete that the following lines.
Again the Fourier Transform has me thinking about this "self-balancing" of algorithms. Although in this case it is to do with cache memory not cores.
I have a recursive implementation of the FFT which is half as fast as my other normal non-recursive approach for small data sets. However when the data set size grows past a million points the recursive version is neck and neck. When you get to 4 million data points the recursive version is twice as fast.
Eventually I realized that the reason for this is that the recursive algorithm divides up the data set in such a way that it can keep more working data in cache memory for longer. The traditional version is very cache unfriendly, always wanting data from places that are not in cache. Magically the recursive algorithm always automatically makes good use of cache no matter the cache size or the data set size. It does not need to be tweaked or configured in any way for different job/cache sizes.
I'm sure such phenomena can happen with different algorithms making use of large numbers of processors.
I like to think of parallel processing in terms of large numbers of connected CPUs, hundreds, thousands, millions, rather than the hand full of cores sharing memory that we have had access to so far.
http://www.youtube.com/watch?v=NiDiZTvlEsA
On the Propeller, parallel programing is MUCH easier than multitasking another microcomputer.
On elaborate parallel processors, I've no idea how to even get started - but this book seems like an opportune entry point by a serious author.
BTW, NVIDA is now claiming to market "Personal Supercomputers" that are essential huge parallel processing computers. So we might actually see a lot more people jumping into parallel processing in the near future. Already Russian hackers have been using their video cards - not for video, but for breaking password codes through the use of their rather extensive and fast parallel processing arrays.
As usual, computer are fascinating and I find myself once again falling far behind the leading edge --- but having a lot of fun with them.
-Phil
The current Prop is great for Mandelbrot...up to a point. The bottleneck is the connection between the Prop and an external frame buffer. Prop II seems very promising in this regard.
Thanks for linking this Phil. I've found it very interesting reading. SGI does a lot with NUMA systems, "Non Uniform Memory Archetcture", where they've got a single OS image, and tons of CPUs. Their chief scientist once said that supercomputing was all about turning compute problems into I/O problems. That seems to resonate with the discussion here.
The bottom line is how you approach your given problem to achieve it's solution. I agree with both heater and Phil, but I also have a "different" view of how to approach a program. The Prop makes it real easy to keep "like tasks" together, different tasks separated. It's not if it's easier or harder, it's how you need to view a task and discover how best to approach that task. It's a case of what your program's (or hardware for that matter) over all goals are.
Just my humble opinion.
That's because it's typeset with TeX (or LaTeX). I imagine it can be tweaked to not produce such an identifiable output, but most people don't bother since the defaults work so well.