Parallel algorithm

prof_braino · 2010-10-04 19:00

Has anybody posted an example of a parallel algorithm using the prop?

I think Leon suggested one that parallelizes (is this the correct term?) easily,
but I'm still a bit lost on how to get started.

Anybody working on this?

pharseid · 2010-10-04 20:01

I would think the quicksort would parallelize (don't know if that's a word) rather easily. First one cog picks a key and divides an array into two sections based on a comparison to the key. One section of the array is handed off to another cog and the other is handled by the first cog. They then repeat the procedure, dividing their array sections and handing off a new section to another cog and continuing to work on one section themselves and so on until all the cogs are involved, at which point they simply do a standard quicksort on the section of the array they have.

-phar

Phil Pilgrim (PhiPi) · 2010-10-04 20:11

"Son, I've pulled you over because you were weaving across the centerline. Unfortunately my breathalyzer is in the shop for repairs. So I'm going to give you a field sobriety test. First, before you step out of the car, I want you to say 'parallelize.'"

-Phil

Leon · 2010-10-04 21:53

British police supposedly used "The Leith police dismissith us" in pre-breathalizer days.

I think I saw "parallelize" used many years ago in connection with the Inmos transputer, the first generally available parallel processing device. It's quite commonly used these days.

Standard demos for the transputer included the Mandelbrot set and ray-tracing; they were quite mind-blowing at the time. Solving a large set of simultaneous equations using Gaussian elimination would be a good one which doesn't involve graphics; several parallel algorithms are available.

Humanoido · 2010-10-04 23:58

prof_braino wrote: »

Has anybody posted an example of a parallel algorithm using the prop? I think Leon suggested one that parallelizes (is this the correct term?) easily, but I'm still a bit lost on how to get started. Anybody working on this?

I'm working on three under NDA. I have another posted for one of the BASIC Stamp Supercomputers. Before converting it, I'm holding out for a bigger machine. The idea is well explained in the handbook and postings. One program is loaded into the entire number of available computers and all programs become parallel in the prop machine. In a 40 prop machine, that's 320 cogs and 320 programs. The programs are identical. However, once they reach their destination processors, things change. They are permitted to evolve.

ericball · 2010-10-05 04:06

It depends on what you mean by parallel algorithm. My Sprite Driver uses 2-5 cogs in parallel to generate the display. While one cog is generating output, the other 4 are generating the data they will display. I believe some of the hi-res VGA drivers also have multiple cogs working in parallel.

One challenge with the Propeller is there is a significant startup delay for each cog. (And there are only 8 cogs.) This isn't a good match for parallel algorithms which have sequential and parallel portions of a single instruction stream.

Leon · 2010-10-05 04:17

I think that the OP meant algorithms for parallel computation, which are somewhat different from your application. I don't think that the Propeller is all that suitable for that sort of thing, there are far better devices available.

Humanoido · 2010-10-05 20:29

This could be a broad category. Parallel algorithms, as we have already shown, can entail use of hardware devices along with software, i.e. pins, counters, etc. There's lots of challenges to be met, and fun, in working with software only algorithms. Another type of algorithm to develop for clustered machines is enumeration. Chip talked about one method. You can try other methods as well. Last time, I used an rc method for multiple processors but this time the goal is to keep more in software and less in hardware and circuit design.

Humanoido · 2010-10-05 21:07

One worthy algorithm for clustered machines is a process of subdividing an application so the work can be shared by the other processors.

I don't know of any other prop heads working on this, at this time. However, I have a simple program that I'm developing to handle a neuron. It uses a 4D construct. I seem to have garnered an interest in neural nets.

When one n works, another, and then more can be added, until finally a tiny neural net is born (said in simple terms). I'm sure there will be numerous algorithms as a result of working with prop chips for this application.

At the same time, we would like to see code to share work across processors.

There's another program version that can take work to be shared as a specific application and send decided pieces to specific processors for shared cooperation.

This can function in both parallel and multi-threaded processing environments. A master is needed to decide how and who reported and when, or if memory permits, the program of each processor can hold the instructions.

It's also possible that someone will create a new language more suitable for this.

Then work can be shared as a process in the background. Who will be the first to get one of the multiple processor programming languages to work on the prop?

Humanoido · 2010-10-10 06:03

I know some people working on these things and unfortunately they are unwilling to share, unless you're willing to afford and pay for their work and currently unrewarded ideas. This is unfortunate. Even if it took longer to reach goals, I would rather see this open source to advance science and improve and elevate the fun of hobby projects.

Christof Eb. · 2010-10-10 11:17

Hi Prof,
what do you mean exactly by "Paralell algorithm"?
Every application of the prop could be called with these words?

I have written a version of Mandelbrot's "Apfelm

prof_braino · 2010-10-10 12:26

Christof Eb. wrote: »

Hi Prof,
what do you mean exactly by "Paralell algorithm"?
Every application of the prop could be called with these words?

I have written a version of Mandelbrot's "Apfelm

Clock Loop · 2010-10-10 16:39

prof_braino wrote: »

Also, needed is an example that is easy to compare in a multi-prop configuration. The propforth team are about to test an inter-cog high speed communication protocol, so we should be able to assign all available cogs on N props in the configuration. We don't know what the maximum N is yet.

But I'm new at this. After I figure out what I'm doing with the primes I would like to try something like fractals, if you would care to share your work.

Eventually, I would like to come up with something that Humanoid could put on his tower and use all 320 cogs

I will post versions of Sieve of Eratosthenes for single cog, multiple cogs, and multiple prop chips as I get them done.

Don't forget my thread on parallelism programming when it comes to multiple props.

http://forums.parallax.com/showthread.php?t=124343

prof_braino · 2010-10-10 21:27

Clock Loop wrote: »

Don't forget my thread on parallelism programming when it comes to multiple props.

http://forums.parallax.com/showthread.php?t=124343

Hey, that's pretty cool.

Did you ever get it to run on the tower?

Clock Loop · 2010-10-10 21:46

prof_braino wrote: »

Hey, that's pretty cool.

Did you ever get it to run on the tower?

I don't know if he did or not, I provided a simplistic program he could test with.

I don't have a tower of any sort, only 8 or so prop chips on a fairly large breadboard that I was able to test it with.

Its up to Humanoido to take it the extra step and perfect/enhance the concept to suit his needs.

The funny thing about this concept, is that it could theoretically be applied to any chip that is programmable. (just as long as it doesn't have unique ID's or use unique replys)

One could call this "Nth slave dummy programming"

This method saves one from having to do gang programming, and or sequential programming. (which would take quite some time with hundereds of props)

If I had access to 50+ prop chips I would design a socketed pcb to allow some parallel programming along with parallel program testing.

Humanoido · 2010-10-11 02:34

Clock Loop wrote: »

I don't know if he did or not, I provided a simplistic program he could test with. I don't have a tower of any sort, only 8 or so prop chips on a fairly large breadboard that I was able to test it with. Its up to Humanoido to take it the extra step and perfect/enhance the concept to suit his needs. The funny thing about this concept, is that it could theoretically be applied to any chip that is programmable. (just as long as it doesn't have unique ID's or use unique replys) One could call this "Nth slave dummy programming" This method saves one from having to do gang programming, and or sequential programming. (which would take quite some time with hundereds of props) if I had access to 50+ prop chips I would design a socketed pcb to allow some parallel programming along with parallel program testing.

Working on some updates and revisions... I've remained busy cleaning up the waveform on the technique described there at the mentioned thread, and have resorted to the success provided by blocking which I am currently exploring. I do believe that any larger props machine will need to address this and additional issues in some fashion. For example, when I block ten, the load is four times, but if I block up to a 100, the action is ten times. Well, ten times is probably my limit. I used it on the SEED project. Currently I have short term temp pulled boards for two experiments that begin with 10, then 12 and spiral upwards. Then, I pulled the twin props from the existing boards and now incorporating those, designing a new printed circuit board which much bigger and can be multiplied. I probably won't stop until the signal is 100%, then I can solder up all the remaining boards and connect it together. Maybe there's ten experiments all going on at the same time. Lots of fun working on this. For parallel algorithms, many of these I begin with one chip and eight cogs, then try to expand the concept. I am not always successful.

Clock Loop · 2010-10-11 07:38

Humanoido wrote: »

Working on some updates and revisions... I've remained busy cleaning up the waveform on the technique described there at the mentioned thread, and have resorted to the success provided by blocking which I am currently exploring. I do believe that any larger props machine will need to address this and additional issues in some fashion. For example, when I block ten, the load is four times, but if I block up to a 100, the action is ten times. Well, ten times is probably my limit. I used it on the SEED project. Currently I have short term temp pulled boards for two experiments that begin with 10, then 12 and spiral upwards. Then, I pulled the twin props from the existing boards and now incorporating those, designing a new printed circuit board which much bigger and can be multiplied. I probably won't stop until the signal is 100%, then I can solder up all the remaining boards and connect it together. Maybe there's ten experiments all going on at the same time. Lots of fun working on this. For parallel algorithms, many of these I begin with one chip and eight cogs, then try to expand the concept. I am not always successful.

I have no idea what your trying to say here, or how it relates to my post, as you quoted my post when making this reply, perhaps you should try to reply in a different way, or make it more simplistic? I would like to help you as you seem to have issues with some "signal" when it comes to my post about parallel programming? I don't know, again, your post was quite confusing. Perhaps you weren't replying to my post specifically?

Humanoido · 2010-10-11 23:50

Clock Loop wrote: »

I have no idea what your trying to say here...

Clock Loop, I was referring to your quote:

Once in a while I will see artifacts from my serial line... possibly indicating some issue with synchronicity.

As a result, I'm investigating the signal quality and synchronicity of several different interfaces and circuit designs, that involve the same and more prop chips. Part of this project is temporarily on hold pending the oscilloscope.

I'm sure your help will be greatly appreciated when I work on the new wired machine with the new boards. Thank you very much. You've done some fine pioneering work with multiple prop chips. Keep at it.

Humanoido · 2010-10-12 11:25

Here's a simple machine you can build to test some basic parallel algorithms.

http://forums.parallax.com/showthread.php?p=945724#post945724

Christof Eb. · 2010-10-13 04:04

Hi Prof_braino again,
these days I stumbled over this:
http://www.mathworks.com/company/newsletters/news_notes/clevescorner/
There are two articles about parallel processing. It becomes clear, that the benefit of parallel computing is vermuch affected by:
* the problem: how long can each cog work independantly of the others with full speed
* the size of the memory that is available to all
* the speed of exchanging orders and results

My mandelbrot fractal program (I have to dig for it a little bit, but you can find the single cog version of it somewhere here in the forum.)
* Is written in Propbasic, so that the cogs can work from compiled code, that resides all the time in their private =fast cog ram using fast cog ram variables.
* They only need the new coordinates of the screen area as input (from hub ram) and a signal to start, which is an output pin, they scan.
* They can write the results directly into the hub ram screen buffer. At the end of their job they can use an output pin to signal that they are ready.
* The job is devided for the cogs so that every cog has different lines of the screen to calculate. This ensures, that there are no problems with bit set instructions for the pixels in the same long.
* There is a master cog that will collect the ready signals, calculate a new screen area and signal to begin with the task.

Alltogether the efficiency of the jobs are very high in this special case, because each cog can work for many instructions even without accessing hub ram.

Christof

prof_braino · 2010-10-18 20:55

Hanno posted a fractal program in honor of Benoit Mandelbrot died yesterday at 85.

http://forums.parallax.com/showthread.php?t=126429

To parallelize, would one take the main loop and divide it into four smaller loops, each for a different arrea of the screen, and run them on four different cogs?

Heater. · 2010-10-18 22:28

prof_braino,

Be aware that different areas of the image take different amounts of time to compute. Perhasp does not matter for the "top level" image but as you zoom in it can be dramatically different execution times from, pixel to pixel.

This means you might end up dividing the image into 4 parts. Then find that 3 of those parts are complete in a short time, while the last part takes all week:) This does not make good use of the available computing resources.

On technique I have seen used to share the work load more evenly is to split the job up by lines. Have each available processor(COG) calculate all the pixels for a given line on the screen.

So:

A master process manages the whole thing. It knows the lines that have been drawn and which are yet to be done. It also knows which COGS are working and which are idle.

When ever a worker COG is idle the master determines the next line to be drawn. It sends the coordinates in the complex plane of the first pixel in that line to the idle COG. Together with the "step size" to get from one pixel in the line to the next and the required iteration count. And whatever other params may be needed for that line.

The COG then computes all the pixels of the line and returns the result to the master for drawing on the screen.

This continues until all lines are drawn.

This is interesting to watch because the lines are filled in and appear on the screen out of order as the computation progresses. You can see how the work is being parallelized and different lines are being completed in different times.

Leon · 2010-10-19 00:03

That's how it was done on transputers:

http://maven.smith.edu/~thiebaut/transputer/chapter8/chap8-1.html

Heater. · 2010-10-19 00:21

Leon,

Yep, that's where I have seen it before. Strange that the paper you linked to is dated 1995. I thought I saw that Mandlebrot demo'ed when they featured the Transputer on the "Tomorrows World" TV show. That must have been in the 80's

Tomorrow has been a long time coming:) See my sig.

Leon · 2010-10-19 00:28

People are still playing with them, see the comp.sys.transputer Usenet group. The first T414 transputer was developed in 1984; I bought a couple of them for £400 each in 1985, IIRC! The T800 with floating-point was launched in 1987 and cost £750.

Heater. · 2010-10-19 00:35

Where on earth would one get Transputers to play with now a days?

I'd love to have one to add to my collection of memorabilia, preferably more than one for some parallel action.

Sadly I seem to have lost all my Inmos Transputer/Occam data books and manuals somewhere along the way. Still we have the "chip that shall remain nameless" now.

Leon · 2010-10-19 00:42

Lots of people bought up stocks of chips and hardware when production ceased, there is still a lot of stuff around.

K2 · 2010-10-19 07:27

Heater, I've got a few transputers and boards I could sell you! I actually designed the boards myself. We once (87-88) crammed more than 100 transputers and their respective memory into two PC expansion boxes stacked atop one another. It was pretty cool at the time.

More impressive nowadays is Quickman1.10 on a dual- or quad-core Athlon.

Humanoido · 2010-10-19 07:34

It's really small. You could easily ship it. I think the value is in the populated board, not one chip. Maybe you can try to collect some boards and assemble your own Transputer then connect it to a prop stack.

Leon · 2010-10-19 07:41

K2 wrote: »

Heater, I've got a few transputers and boards I could sell you! I actually designed the boards myself. We once (87-88) crammed more than 100 transputers and their respective memory into two PC expansion boxes stacked atop one another. It was pretty cool at the time.

More impressive nowadays is Quickman1.10 on a dual- or quad-core Athlon.

I did something like that with 20 of my transputer modules on a double-height ISA card. Several of those were built into a special high PC cabinet (60 transputers each with 1 Mbyte of RAM). It was intended for IC design, but the university team writing the software couldn't get it to work properly.

Leon · 2010-10-19 07:44

Humanoido wrote: »

It's really small. You could easily ship it. I think the value is in the populated board, not one chip. Maybe you can try to collect some boards and assemble your own Transputer then connect it to a prop stack.

It would be more sensible to use one of the chips which must not be mentioned. I already have one interfaced to a Propeller.

Parallel algorithm

Comments