Parallel algorithm
prof_braino
Posts: 4,313
Has anybody posted an example of a parallel algorithm using the prop?
I think Leon suggested one that parallelizes (is this the correct term?) easily,
but I'm still a bit lost on how to get started.
Anybody working on this?
I think Leon suggested one that parallelizes (is this the correct term?) easily,
but I'm still a bit lost on how to get started.
Anybody working on this?
Comments
-phar
-Phil
I think I saw "parallelize" used many years ago in connection with the Inmos transputer, the first generally available parallel processing device. It's quite commonly used these days.
Standard demos for the transputer included the Mandelbrot set and ray-tracing; they were quite mind-blowing at the time. Solving a large set of simultaneous equations using Gaussian elimination would be a good one which doesn't involve graphics; several parallel algorithms are available.
One challenge with the Propeller is there is a significant startup delay for each cog. (And there are only 8 cogs.) This isn't a good match for parallel algorithms which have sequential and parallel portions of a single instruction stream.
I don't know of any other prop heads working on this, at this time. However, I have a simple program that I'm developing to handle a neuron. It uses a 4D construct. I seem to have garnered an interest in neural nets.
When one n works, another, and then more can be added, until finally a tiny neural net is born (said in simple terms). I'm sure there will be numerous algorithms as a result of working with prop chips for this application.
At the same time, we would like to see code to share work across processors.
There's another program version that can take work to be shared as a specific application and send decided pieces to specific processors for shared cooperation.
This can function in both parallel and multi-threaded processing environments. A master is needed to decide how and who reported and when, or if memory permits, the program of each processor can hold the instructions.
It's also possible that someone will create a new language more suitable for this.
Then work can be shared as a process in the background. Who will be the first to get one of the multiple processor programming languages to work on the prop?
what do you mean exactly by "Paralell algorithm"?
Every application of the prop could be called with these words?
I have written a version of Mandelbrot's "Apfelm
Don't forget my thread on parallelism programming when it comes to multiple props.
http://forums.parallax.com/showthread.php?t=124343
Hey, that's pretty cool.
Did you ever get it to run on the tower?
I don't know if he did or not, I provided a simplistic program he could test with.
I don't have a tower of any sort, only 8 or so prop chips on a fairly large breadboard that I was able to test it with.
Its up to Humanoido to take it the extra step and perfect/enhance the concept to suit his needs.
The funny thing about this concept, is that it could theoretically be applied to any chip that is programmable. (just as long as it doesn't have unique ID's or use unique replys)
One could call this "Nth slave dummy programming"
This method saves one from having to do gang programming, and or sequential programming. (which would take quite some time with hundereds of props)
If I had access to 50+ prop chips I would design a socketed pcb to allow some parallel programming along with parallel program testing.
I have no idea what your trying to say here, or how it relates to my post, as you quoted my post when making this reply, perhaps you should try to reply in a different way, or make it more simplistic? I would like to help you as you seem to have issues with some "signal" when it comes to my post about parallel programming? I don't know, again, your post was quite confusing. Perhaps you weren't replying to my post specifically?
Once in a while I will see artifacts from my serial line... possibly indicating some issue with synchronicity.
As a result, I'm investigating the signal quality and synchronicity of several different interfaces and circuit designs, that involve the same and more prop chips. Part of this project is temporarily on hold pending the oscilloscope.
I'm sure your help will be greatly appreciated when I work on the new wired machine with the new boards. Thank you very much. You've done some fine pioneering work with multiple prop chips. Keep at it.
http://forums.parallax.com/showthread.php?p=945724#post945724
these days I stumbled over this:
http://www.mathworks.com/company/newsletters/news_notes/clevescorner/
There are two articles about parallel processing. It becomes clear, that the benefit of parallel computing is vermuch affected by:
* the problem: how long can each cog work independantly of the others with full speed
* the size of the memory that is available to all
* the speed of exchanging orders and results
My mandelbrot fractal program (I have to dig for it a little bit, but you can find the single cog version of it somewhere here in the forum.)
* Is written in Propbasic, so that the cogs can work from compiled code, that resides all the time in their private =fast cog ram using fast cog ram variables.
* They only need the new coordinates of the screen area as input (from hub ram) and a signal to start, which is an output pin, they scan.
* They can write the results directly into the hub ram screen buffer. At the end of their job they can use an output pin to signal that they are ready.
* The job is devided for the cogs so that every cog has different lines of the screen to calculate. This ensures, that there are no problems with bit set instructions for the pixels in the same long.
* There is a master cog that will collect the ready signals, calculate a new screen area and signal to begin with the task.
Alltogether the efficiency of the jobs are very high in this special case, because each cog can work for many instructions even without accessing hub ram.
Christof
http://forums.parallax.com/showthread.php?t=126429
To parallelize, would one take the main loop and divide it into four smaller loops, each for a different arrea of the screen, and run them on four different cogs?
Be aware that different areas of the image take different amounts of time to compute. Perhasp does not matter for the "top level" image but as you zoom in it can be dramatically different execution times from, pixel to pixel.
This means you might end up dividing the image into 4 parts. Then find that 3 of those parts are complete in a short time, while the last part takes all week:) This does not make good use of the available computing resources.
On technique I have seen used to share the work load more evenly is to split the job up by lines. Have each available processor(COG) calculate all the pixels for a given line on the screen.
So:
A master process manages the whole thing. It knows the lines that have been drawn and which are yet to be done. It also knows which COGS are working and which are idle.
When ever a worker COG is idle the master determines the next line to be drawn. It sends the coordinates in the complex plane of the first pixel in that line to the idle COG. Together with the "step size" to get from one pixel in the line to the next and the required iteration count. And whatever other params may be needed for that line.
The COG then computes all the pixels of the line and returns the result to the master for drawing on the screen.
This continues until all lines are drawn.
This is interesting to watch because the lines are filled in and appear on the screen out of order as the computation progresses. You can see how the work is being parallelized and different lines are being completed in different times.
http://maven.smith.edu/~thiebaut/transputer/chapter8/chap8-1.html
Yep, that's where I have seen it before. Strange that the paper you linked to is dated 1995. I thought I saw that Mandlebrot demo'ed when they featured the Transputer on the "Tomorrows World" TV show. That must have been in the 80's
Tomorrow has been a long time coming:) See my sig.
I'd love to have one to add to my collection of memorabilia, preferably more than one for some parallel action.
Sadly I seem to have lost all my Inmos Transputer/Occam data books and manuals somewhere along the way. Still we have the "chip that shall remain nameless" now.
More impressive nowadays is Quickman1.10 on a dual- or quad-core Athlon.
I did something like that with 20 of my transputer modules on a double-height ISA card. Several of those were built into a special high PC cabinet (60 transputers each with 1 Mbyte of RAM). It was intended for IC design, but the university team writing the software couldn't get it to work properly.
It would be more sensible to use one of the chips which must not be mentioned. I already have one interfaced to a Propeller.