Working COGs in parallel?
Rsadeika
Posts: 3,837
in Propeller 1
I was just reading a post in another thread, and their was mention of COGs in parallel. I guess the first question is, if the Propeller could do it, what the heck would you do with it?
I am not even sure as to how that concept could be applied with a Propeller, and how efficient would that be. And if somebody was to try it, what language would you use, Spin, PASM, PropGCC, or what? Just curious.
Ray
I am not even sure as to how that concept could be applied with a Propeller, and how efficient would that be. And if somebody was to try it, what language would you use, Spin, PASM, PropGCC, or what? Just curious.
Ray
Comments
I wrote a parallel Mandelbrot set renderer in PASM. Obviously, it's not terribly useful, but it is parallel. It uses 6 cogs and is really slow (at high zoom levels, at least).
It doesn't double buffer because I like watching things happen and not double buffering allows a higher resolution, but if you find all of the comments that say "double buffer" and do what they say, and then decrease x_tiles and y_tiles so it fits in RAM, it should do double buffering.
I think it needs a 6.25MHz crystal. It might work with a 5MHz crystal, but I forget if it does. Hook up a VGA display on the usual pins.
COGs run in parallel, we have been using them in parallel since forever. In Spin, C, PASM, whatever.
But if you mean tackling a single problem problem by distributing it's algorithm over many COGs then things get a bit complicated.
Imagine you have a loop that adds up 30,000 numbers. One could make that into two loops that add up 15000 of those numbers each, running on different cogs, and then add the two results together. Or split it up into 3 or 4 or more separate loops running on 3 or 4 COGs in parallel.
One can write code like that in many languages. The problem is managing the parallelism. What if you want the same code to automatically adjust itself to 2 or 4 or more parallel threads? Whatever is available at the time. It gets worse if you algorithm is more complicated than just adding up a lead of numbers.
Enter OpenMP. OpenMP can be used with prop-gcc to automatically parallelize your algorithm. Well, I say "automatically", you do have to add some annotations to your C code to tell OpenMP what it can parallelize safely. I made a fast Fourier Transform using C and OpenMP, just for fun. Later it turned out a couple of people have used that FFT for interesting projects.
PASM is rather misleading. I'm guessing that, when you say "PASM" you think of a DAT section in a .spin file. Don't forget that you can get exactly the same result from a .S ("assembly") file which is compiled via PropGCC or even inline assembly. I also recently learned that PropBASIC compiles inline assembly.
I won't disagree that assembly would be a good choice for anything computationally intensive on the Propeller, but I would encourage you to recognize that you can do that no matter your high-level language and toolchain choice.
This is as opposed to "asm" or whatever which refers to whatever assembler syntax is used an C/C++ and other compilers and assemblers.
I can see that. But it doesn't make any sense to me to make an architectural choice based on this distinction. You can get the same thing done with both and have nearly identical binaries.
I'm not sure I'd want to get into parallelizing anything complex in assembler though.
I was amazed to find that the C version of my FFT ran nearly as fast as the PASM version. Thanks to the fcache feature of prop-gcc that stashes the inner loop entirely into COG and runs it there. (I would also say thanks to my crapily optimized PASM but actually a number of people here helped out in optimizing it so it's probably not bad)
So then, parallelizing that C code with OpenMP was pretty straight forward. Having it adjust for two or four cogs is easy. I would not want to mess with all that in PASM.
So, now I am back to thinking as to what kind of practical solution could you have the Propeller COGs, working in parallel, do for you, and I mean practical.
Ray
-Phil
Be careful what you wish here, it might remove ALL of your source...
@Rsadeika,
I am still not sure yet where you want to go here with your proposal.
A lot of examples for the Propeller use COGS in a way that the calling COG waits until the called COG finished his job. That is something I would not call parallel execution. But there are other examples really running in parallel. Like FSRWs read ahead/write behind. There the calling COG just waits for the data transfer between the COGs and the read/write operation runs in parallel to the execution of the calling COG.
Same is mostly true for Serial driver,Video driver, PWM, Servo driver, Sound, LEDs, or alike where the calling COG just sets some Parameter and goes along doing something else while the called COG does his job.
Then there is the case of shared parallel operation like in some Video drivers, where multiple COGS render some line each and one COG is displaying it.
This parallelism is the main point of the propeller. Say your video driver displays some Hub-buffer as text on the screen. Just running in a loop and doing that. Completely undisturbed by any other COG running. Same for a serial driver with HUB buffer, just tagging along, taking care of the serial bit stream. And your main program can now read one buffer, write into the other, process stuff and still serial and video are not missing any step.
As of what language to choose? Does not matter at all. All available ones are able to run code in different COGs at the same time. You just need a different programming style do take advantage of having 8 COGs instead of one core and interrupts.
Enjoy!
Mike