Working COGs in parallel?

Rsadeika · 2016-02-13 18:17

I was just reading a post in another thread, and their was mention of COGs in parallel. I guess the first question is, if the Propeller could do it, what the heck would you do with it?

I am not even sure as to how that concept could be applied with a Propeller, and how efficient would that be. And if somebody was to try it, what language would you use, Spin, PASM, PropGCC, or what? Just curious.

Ray

Electrodude · 2016-02-13 19:08

I would probably use PASM or Tachyon for anything that needed to be parallel. If it needs to be parallel, it obviously also needs to be fast, and those are the two fastest languages for the Propeller.

I wrote a parallel Mandelbrot set renderer in PASM. Obviously, it's not terribly useful, but it is parallel. It uses 6 cogs and is really slow (at high zoom levels, at least).

It doesn't double buffer because I like watching things happen and not double buffering allows a higher resolution, but if you find all of the comments that say "double buffer" and do what they say, and then decrease x_tiles and y_tiles so it fits in RAM, it should do double buffering.

I think it needs a 6.25MHz crystal. It might work with a 5MHz crystal, but I forget if it does. Hook up a VGA display on the usual pins.

potatohead · 2016-02-13 19:32

Sprites are another parallel case that has been done

Heater. · 2016-02-13 19:35

Ray,

...COGs in parallel...

I'm not sure I understand the question.

COGs run in parallel, we have been using them in parallel since forever. In Spin, C, PASM, whatever.

But if you mean tackling a single problem problem by distributing it's algorithm over many COGs then things get a bit complicated.

Imagine you have a loop that adds up 30,000 numbers. One could make that into two loops that add up 15000 of those numbers each, running on different cogs, and then add the two results together. Or split it up into 3 or 4 or more separate loops running on 3 or 4 COGs in parallel.

One can write code like that in many languages. The problem is managing the parallelism. What if you want the same code to automatically adjust itself to 2 or 4 or more parallel threads? Whatever is available at the time. It gets worse if you algorithm is more complicated than just adding up a lead of numbers.

Enter OpenMP. OpenMP can be used with prop-gcc to automatically parallelize your algorithm. Well, I say "automatically", you do have to add some annotations to your C code to tell OpenMP what it can parallelize safely.

...what the heck would you do with it?

I made a fast Fourier Transform using C and OpenMP, just for fun. Later it turned out a couple of people have used that FFT for interesting projects.

DavidZemon · 2016-02-13 19:35

What exactly does "working cogs in parallel" mean? Do you mean two cogs computing a single result? Surely can and has been done lots in many different languages. You only need to choose a communication method for the two cogs.

PASM is rather misleading. I'm guessing that, when you say "PASM" you think of a DAT section in a .spin file. Don't forget that you can get exactly the same result from a .S ("assembly") file which is compiled via PropGCC or even inline assembly. I also recently learned that PropBASIC compiles inline assembly.

I won't disagree that assembly would be a good choice for anything computationally intensive on the Propeller, but I would encourage you to recognize that you can do that no matter your high-level language and toolchain choice.

Heater. · 2016-02-13 19:51

Over the years I have understood "PASM", around here, to mean that assembler syntax you find in Spin. Of course that includes everything in the DAT sections, and the constants in CON and the variables in VAR. Which is basically Spin syntax. So every thing but the PUB and PRI sections.

This is as opposed to "asm" or whatever which refers to whatever assembler syntax is used an C/C++ and other compilers and assemblers.

DavidZemon · 2016-02-13 20:01

Heater. wrote: »

Over the years I have understood "PASM", around here, to mean that assembler syntax you find in Spin. Of course that includes everything in the DAT sections, and the constants in CON and the variables in VAR. Which is basically Spin syntax. So every thing but the PUB and PRI sections.

This is as opposed to "asm" or whatever which refers to whatever assembler syntax is used an C/C++ and other compilers and assemblers.

I can see that. But it doesn't make any sense to me to make an architectural choice based on this distinction. You can get the same thing done with both and have nearly identical binaries.

Heater. · 2016-02-13 20:26

Yep.

I'm not sure I'd want to get into parallelizing anything complex in assembler though.

I was amazed to find that the C version of my FFT ran nearly as fast as the PASM version. Thanks to the fcache feature of prop-gcc that stashes the inner loop entirely into COG and runs it there. (I would also say thanks to my crapily optimized PASM but actually a number of people here helped out in optimizing it so it's probably not bad)

So then, parallelizing that C code with OpenMP was pretty straight forward. Having it adjust for two or four cogs is easy. I would not want to mess with all that in PASM.

Rsadeika · 2016-02-13 20:58

But if you mean tackling a single problem problem by distributing it's algorithm over many COGs then things get a bit complicated.

That is what I was thinking about, many COGs to solve a problem. I have always thought along the lines of, you have a problem to work out, start up another COG, and get it done.

So, now I am back to thinking as to what kind of practical solution could you have the Propeller COGs, working in parallel, do for you, and I mean practical.

Ray

ErNa · 2016-02-13 21:00

There is a classification: sisd: single instruction, single data, misd: multiple instruction, single data, simd: single instruction, multiple data, mimd: multiple instruction, multiple data. This classification is to group different computational problems. The first obviously is not a parallel processing issue, as a single instruction applied to a single data can not be parallelised. multiple instructions normaly are applied to single data serial. A single instruction applied to multiple data can be parallelized, like calculating of a scalar product. This is where multiply add is used. The propeller is optimized for multiple instruction/multiple data problems: this is the case, when complex tasks can run in parallel, like running a display, a keyboard, sound etc.

Phil Pilgrim (PhiPi) · 2016-02-13 21:39

For me the holy grail of parallelism would be a garbage collector running in a separate cog, such that the foreground tasks can run deterministically using dynamic data, without having to stop for garbage collection passes.

-Phil

ErNa · 2016-02-13 23:06

Yes, a garbage collector that checks my code for nonesense and bugs and eliminates what is not needed ;-)

msrobots · 2016-02-13 23:09

ErNa wrote: »

Yes, a garbage collector that checks my code for nonesense and bugs and eliminates what is not needed ;-)

Be careful what you wish here, it might remove ALL of your source...

@Rsadeika,

I am still not sure yet where you want to go here with your proposal.

A lot of examples for the Propeller use COGS in a way that the calling COG waits until the called COG finished his job. That is something I would not call parallel execution. But there are other examples really running in parallel. Like FSRWs read ahead/write behind. There the calling COG just waits for the data transfer between the COGs and the read/write operation runs in parallel to the execution of the calling COG.

Same is mostly true for Serial driver,Video driver, PWM, Servo driver, Sound, LEDs, or alike where the calling COG just sets some Parameter and goes along doing something else while the called COG does his job.

Then there is the case of shared parallel operation like in some Video drivers, where multiple COGS render some line each and one COG is displaying it.

This parallelism is the main point of the propeller. Say your video driver displays some Hub-buffer as text on the screen. Just running in a loop and doing that. Completely undisturbed by any other COG running. Same for a serial driver with HUB buffer, just tagging along, taking care of the serial bit stream. And your main program can now read one buffer, write into the other, process stuff and still serial and video are not missing any step.

As of what language to choose? Does not matter at all. All available ones are able to run code in different COGs at the same time. You just need a different programming style do take advantage of having 8 COGs instead of one core and interrupts.

Enjoy!

Mike

Working COGs in parallel?

Comments