Is Parallel Programming Hard, ...

Phil Pilgrim (PhiPi) · 2011-01-03 10:38

... And, If So, What Can You Do About It?

That's the title of a new book that you can download for free. Here's the announcement:

http://paulmck.livejournal.com/23027.html

And here's the download page:

http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

-Phil

ratronic · 2011-01-03 20:08

Phil I only read the first couple of pages but wow the first stored program computer was built in 1949, had 768 words of ram, 2000 vacuum tubes, clocked at 1khz, weighed more than 3 metric tons and consumed 30 kilowatts of power. I learned some history today! But... I'm no expert programmer by any means but I find being able to do more than one thing at a time (programming wise) getting easier every day. Having to program everything in sequence now seems tougher to me.

Heater. · 2011-01-04 01:19

Phil,

Parallel Programming is extremely hard.

I find that if I have to work on two or more programs at the same time it's better to concentrate on each one for a number of hours or days at a time and then switch to the other. If I switch to rapidly between programs it's really hard to keep all the details of each in mind, it gets very confusing and progress slows down. I have a hard time explaining this to my boss.

I know there are chess grand masters who can play multiple opponents simultaneously, move by move, but I can't pull that off when programming, say one line here one line there etc.

Now we can see there are 4 basic types of programming:

1) Single Programmer Single Program (SPSP).

2) Single Programmer Multiple Programs (SPMP)

3) Multiple Programmers Single Program (MPSP)

4) Multiple Programmers Multiple Programs (MPMP)

Oh wait a minute.... you meant programming for Parallel Execution

Yep that's hard to.

kwinn · 2011-01-04 07:56

Have to agree with Heater on both his points. Parallel programming is more difficult. Like re-entrant programming there are additional constraints and considerations that have to be taken into account when compared to a standard non re-entrant program.

On switching back and forth between working on more than one program I agree even more. I am much more productive if I can concentrate on one program at a time. As Heater says, it's hard to keep the details of each in mind. A bit like switching a production line from one product to another. There is always some setup time required.

Leon · 2011-01-04 08:39

I remember David May (designer of the Inmos transputer and the XMOS chips) stating at a conference that hardware engineers were much better at programming parallel systems than software people, because hardware is inherently parallel.

kwinn · 2011-01-04 08:46

Leon wrote: »

I remember David May (designer of the Inmos transputer and the XMOS chips) stating at a conference that hardware engineers were much better at programming parallel systems than software people, because hardware is inherently parallel.

True, working with multi bit buses, parallel gates, etc and having to deal with multiple path gate delays and such tends to provide one with the right mindset for parallel programming. Even so, it ain't easy.

Leon · 2011-01-04 08:52

It is easy with the right languages, such as occam and XC.

Humanoido · 2011-01-04 09:04

It really depends on the uniqueness of the hardware and software. Parallel programming with a Propeller chip can be "more easy" with a one wire configuration like Master - Slave that can talk - listen. Even so, there are caveats.

Phil Pilgrim (PhiPi) · 2011-01-04 09:20

One reason I didn't post this thread in the Prop Forum is that I'm not sure the Propeller qualifies as a parallel processor in the sense that the book addresses. Although the Prop has multiple processors, they're not as tightly coupled as I would imagine being required to do "parallel programming". On the contrary, each processor is typically used more as a "soft peripheral" rather than in support of an overall parallel computing effort. Nonetheless, I respect the fact that there may be legitimate dissenting opinions on this issue.

-Phil

Heater. · 2011-01-04 09:56

Leon,

...hardware engineers were much better at programming parallel systems than software people, because hardware is inherently parallel...

That may well be true in some cases. Especially when stacking up programmers vs hardware engineers. In general though I might disagree. Are we talking about programmers, who are traditionally taught to think about software in a serial manner, or are we talking about computer scientists and algorithm designers who are of a more mathematical bent?

It is easy with the right languages, such as occam and XC.

Well that sure that makes the mechanics of writing the programs easier but does it make designing efficient algorithms for parallel machines any easier?

Example 1

Let's take a case in point, topical here recently, the Fourier Transform. Given the naive old fashioned way of calculating an FT it takes a long time, with the square of the data set size. Now assume you have a sea of processors with an easy way to start threads on them and an easy way to communicate data between them. What happens?. You divide the work up among the processors. You communicate the data to and from all of them. You arrive at a solution that is still damn slow.

Enter a mathematician, Tukey, he sees a away to optimize the whole thing and achieve speed gains of thousand or millions for large data sets.

OK let's use Tukey's idea on a parallel machine. But wait there are issues with that: How to divide the work? With what granularity to divide the work? Which order to divide the work? How to create an algorithm that automatically balances compute time on each node with communication time between nodes...

Example 2

The Mandlebrot set. For each pixel of a Mandlebrot set image you need to iterate a simple function hundreds, thousand, millions or more times. There is no way to speed that up with parallel processors because every iteration depends on the result of the previous one.

That's OK though, the Mandlebrot set image can be divided into groups of pixels and the work for each group farmed out and done in parallel.

But wait, there are those problems again: What size groups of pixels? That depends on the speed of your compute nodes balanced against the communication speed. Sending one pixels worth of work out to each node might slow the whole thing down as it takes longer to communicate the parameters and collect the result that it does to calculate it. But that decision depends on comms speed, compute speed and the number of iterations required for that pixel. That last thing depends on the "zoom" factor of the image you are looking at.

So how to build a parallel Mandlebrot set algorithm that automatically optimizes what work goes where in order to maximize speed?

Conclusion: creating simple parallel solutions may now be easier with the correct processors, languages and operating systems but designing efficient, optimized, self organizing, parallel algorithms does not seem to be.

P.S. I just got my first dev board for the "processor that shall remain nameless". I do wish the Prop had such hardware/software assistance with parallelism. As Phil points out, despite having multiple cores the Prop gives no other aids to the creation of parallel algorithms.

Leon · 2011-01-04 10:05

David was referring to software developers, rather than computer scientists.

Regarding languages for parallel processing, I was thinking of MIMD architectures, as used by Inmos and XMOS. They make algorithm implementation quite straightforward using appropriate languages. Tools are available for optimising programs. It'll be a long time before that can be done automatically, though.

localroger · 2011-01-04 17:46

Phil Pilgrim (PhiPi) wrote: »

One reason I didn't post this thread in the Prop Forum is that I'm not sure the Propeller qualifies as a parallel processor in the sense that the book addresses. Although the Prop has multiple processors, they're not as tightly coupled as I would imagine being required to do "parallel programming". On the contrary, each processor is typically used more as a "soft peripheral" rather than in support of an overall parallel computing effort. Nonetheless, I respect the fact that there may be legitimate dissenting opinions on this issue.

-Phil

Phil is exactly right here. Even in tightly coupled things like multi-cog video drivers, there's a logic behind what's going on that isn't generic. What is usually called "parallel programming" is breaking up a generic program -- the kind we would think of as sequential -- and implementing the same functionality faster and more efficiently because we have multiple processors. And most "parallel programming" models don't assume something that is central to the Propeller, deterministic timing. Thus there are all kinds of scheduling and synchronization schemes. (Not that those don't exist in Propland too, but they tend to be much simpler because they're doing specific, not generic, things.)

Martin_H · 2011-01-04 17:52

In computer science we have a saying, "Caching isn't hard, it's cache coherency that's hard." Parallel programming is similar. It is easy to do, but hard to get right. The big problems encountered are missing an uncoordinated access to a shared resource or deadlock. In the former the code we'll run correctly or incorrectly depending upon the order of execution, in the latter things just grind to a halt.

I've always found fixing deadlocks to be much easier than finding the place where someone forgot to take out a lock. It is especially problematic in group projects where someone doesn't understand multi-threaded programming.

Humanoido · 2011-01-05 07:18

One only needs to review a number of white papers on the net to quickly learn there are as many techniques and methods of parallel programming as there are visible stars in the dark night sky.

JasonDorie · 2011-01-05 17:51

Heater. wrote: »

Leon,
That's OK though, the Mandlebrot set image can be divided into groups of pixels and the work for each group farmed out and done in parallel.

But wait, there are those problems again: What size groups of pixels? That decision depends on comms speed, compute speed and the number of iterations required for that pixel.

So how to build a parallel Mandlebrot set algorithm that automatically optimizes what work goes where in order to maximize speed?

I've seen an implementation of a multi-processor Mandelbrot set calculator that divides the overall screen area into groups of pixels, like 8x8 or 16x16, then pushes all the groups into a queue. Then it spawns jobs on all the processors that simply pull a group from the queue and compute the results. The thing is 'self balancing' to a certain degree - the very complex groups with deep iteration will take longer to process, but other jobs are free to continue pulling simple groups off the queue while that happens. I realize this is a little over-simplified, and doesn't deal efficiently with load balancing in the case where you have almost as many cores as you have, say, pixels, but it works well.

Heater. · 2011-01-05 22:03

JasonDorie,

Yep, I've seen that done with the groups of pixels being complete lines of the display. One could see the "self-balancing" aspect happening as lines would start to be drawn from top to bottom but there would, temporarily, be gaps of undrawn lines that were taking longer to complete that the following lines.

Again the Fourier Transform has me thinking about this "self-balancing" of algorithms. Although in this case it is to do with cache memory not cores.

I have a recursive implementation of the FFT which is half as fast as my other normal non-recursive approach for small data sets. However when the data set size grows past a million points the recursive version is neck and neck. When you get to 4 million data points the recursive version is twice as fast.

Eventually I realized that the reason for this is that the recursive algorithm divides up the data set in such a way that it can keep more working data in cache memory for longer. The traditional version is very cache unfriendly, always wanting data from places that are not in cache. Magically the recursive algorithm always automatically makes good use of cache no matter the cache size or the data set size. It does not need to be tweaked or configured in any way for different job/cache sizes.

I'm sure such phenomena can happen with different algorithms making use of large numbers of processors.

I like to think of parallel processing in terms of large numbers of connected CPUs, hundreds, thousands, millions, rather than the hand full of cores sharing memory that we have had access to so far.

erco · 2011-01-06 07:36

Parallel PARKING is hard for many people. And many robots, too...

http://www.youtube.com/watch?v=NiDiZTvlEsA

LoopyByteloose · 2011-01-06 08:37

Yes and No is my impression....

On the Propeller, parallel programing is MUCH easier than multitasking another microcomputer.

On elaborate parallel processors, I've no idea how to even get started - but this book seems like an opportune entry point by a serious author.

BTW, NVIDA is now claiming to market "Personal Supercomputers" that are essential huge parallel processing computers. So we might actually see a lot more people jumping into parallel processing in the near future. Already Russian hackers have been using their video cards - not for video, but for breaking password codes through the use of their rather extensive and fast parallel processing arrays.

As usual, computer are fascinating and I find myself once again falling far behind the leading edge --- but having a lot of fun with them.

Phil Pilgrim (PhiPi) · 2011-01-06 08:38

I think the job queue model cited for the Mandelbrot set could work well with the Propeller. Of course, any problem to be solved would have to fit that kind of approach. But something like handling requests with a server, which is often done in a PC by spawning new threads, is a perfect example.

-Phil

User Name · 2011-01-06 11:50

Phil Pilgrim (PhiPi) wrote: »

I think the job queue model cited for the Mandelbrot set could work well with the Propeller. Of course, any problem to be solved would have to fit that kind of approach. But something like handling requests with a server, which is often done in a PC by spawning new threads, is a perfect example.

The current Prop is great for Mandelbrot...up to a point. The bottleneck is the connection between the Prop and an external frame buffer. Prop II seems very promising in this regard.

potatohead · 2011-01-06 15:52

The unbalanced effect noted here was clearly visible when I broke the Mandlebrot demo posted here into four quadrants, one per COG. It was interesting to watch them all process on their own, finishing at different times, depending on the screen image data / compute requirements.

Thanks for linking this Phil. I've found it very interesting reading. SGI does a lot with NUMA systems, "Non Uniform Memory Archetcture", where they've got a single OS image, and tons of CPUs. Their chief scientist once said that supercomputing was all about turning compute problems into I/O problems. That seems to resonate with the discussion here.

SRLM · 2011-01-16 21:36

Hmmm... Page layout looks familiar. Reminds me of one of my professors: he has been writing and revising his notes for the OS class for about 20 years now, and the typeography is exactly the same as McKenney's. Thanks for posting the link, it should help me with my "Concurrent and Parallel Systems" class.

KaosKidd · 2011-01-17 08:18

Each programming challenge will dictate the approach needed. All programmers know that a closed program (one with little to no user or external hardware interaction) is much simpler to code then one that is fully open (meaning all aspects of interfacing can be changed from within the program) is the hardest. There is no right or wrong way to look at how to get the task done, only degrees of complexity for each method taken. Loosely coupled vs tightly coupled in a programmers world implies the methods that need to be used to access resources, not the programming language or how to code it.

The bottom line is how you approach your given problem to achieve it's solution. I agree with both heater and Phil, but I also have a "different" view of how to approach a program. The Prop makes it real easy to keep "like tasks" together, different tasks separated. It's not if it's easier or harder, it's how you need to view a task and discover how best to approach that task. It's a case of what your program's (or hardware for that matter) over all goals are.

Just my humble opinion.

SSteve · 2011-01-17 09:03

SRLM wrote: »

Hmmm... Page layout looks familiar. Reminds me of one of my professors.

That's because it's typeset with TeX (or LaTeX). I imagine it can be tweaked to not produce such an identifiable output, but most people don't bother since the defaults work so well.

SSteve · 2011-01-17 09:17

Thanks, Phil. That looks like a good read. Just by coincidence, last week I converted one of the internal applications I wrote here at work into a parallel processing version. It does a lot of calculations involving spherical harmonics and is heavily processor intensive, but a lot of the calculations are independent of each other. OS X 10.5 added classes called NSOperation and NSOperationQueue. All I had to do was extract the guts of a few loops into separate method calls and then in the loop create an NSOperation that calls the appropriate method and hand it to the NSOperationQueue. Some of the operations are dependent on others completing first so I just tell the NSOperation object which other operations it needs to wait for and the NSOperationQueue handles the sequencing. When I tried it, it worked the first time. It reduced the calculation time on my quad-core G5 from 30 seconds to about 10. NSOperationQueue handles all the details of load balancing, spawning threads, deciding how many threads to create, etc. It was pretty rewarding for my first attempt at implementing parallel processing.

Is Parallel Programming Hard, ...

Comments