In lieu of a presentation in the "Early Adopters" series ...

RossH · 2020-07-07 11:51

Hi all

I feel a bit annoyed that my crappy low-bandwidth, high-latency satellite internet will probably preclude me from ever hosting a presentation in the Propeller 2 "Early Adopters" series, so I thought I would instead package up something I have been working on recently with Catalina - a tutorial on how to take advantage of the parallel processing capabilities of the Propeller in C by turning a classic sequential algorithm into a parallel algorithm. The point is to see how much benefit we can get just by throwing cogs at a problem - using Catalina, of course!

I have written a document that goes through the process step-by-step. Some of the introductory information may seem a bit lame to Early Adopters, but it is intended to be accessible to newbies. It is also intended to be usable by either Propeller 1 or Propeller 2 users.

The document mentions Catalina 4.3 (which is now available on SourceForge) but it is also usable with Catalina 4.2.

Attached is the document, and also the zip file containing all the files you need to work through the tutorial yourself (the document is also in the zip file).

Comments welcome!

Ross.

EDIT: While the process described in these documents is still worth studying for application to other programs, the process of "parallelizing" a serial program such as is described in these documents has now been automated. It can now be accomplished by adding just a few "pragmas" to the original program.

For full details of the new process, including the amended Sieve program, see here.

Ramon · 2020-07-07 13:28

Thank you for that great tutorial/example.

I always wondered if 'standard' C is enough to exploit parallellism of P1/P2. Xmos had to create their own 'keywords' for doing that on their chips.

I think that it could be a great learning experience if someone continue on that example with 'Appendix A' showing a full assembly listing, or assembly listing of the critical section, and tricks on how to improve speed.

And as I think that there are also some debuggers out there, it could be great if someone can continue on that example showing an 'Appendix B' with information about how it could be debugged and if there is any tool to make 'profiling' to check which functions are called most and how much time each cog expend on those calls.

ctwardell · 2020-07-07 14:19

Thanks Ross!

RossH · 2020-07-07 22:44

I just noticed a couple of typos in the original document. Also, I forgot to add that if you are using a P2 you must use a baud rate of 230400 to run the programs.

I will update the zip file and document.

Ross.

RossH · 2020-07-07 23:01

Oh, and the reason I will be releasing Catalina 4.3 is that these programs don't work if they are compiled using CMM - Catalina 4.2 has a bug in the compact version of "malloc" that crashes a program if you try to allocate large arrays. The programs work ok in LMM or NATIVE mode.

Since fixing this requires all the libraries to be recompiled, I will instead release 4.3 early.

Ross.

RossH · 2020-07-08 00:27

.

Ramon wrote: »

Thank you for that great tutorial/example.

I always wondered if 'standard' C is enough to exploit parallellism of P1/P2. Xmos had to create their own 'keywords' for doing that on their chips.

I think that it could be a great learning experience if someone continue on that example with 'Appendix A' showing a full assembly listing, or assembly listing of the critical section, and tricks on how to improve speed.

And as I think that there are also some debuggers out there, it could be great if someone can continue on that example showing an 'Appendix B' with information about how it could be debugged and if there is any tool to make 'profiling' to check which functions are called most and how much time each cog expend on those calls.

I thought about adding new keywords to C, or possibly just some preprocessor directives indicating which parts of an algorithm could be "parallelized" - but in the end I decided that if you are going to do that you may as well design a whole new language.

It is on my "to do" list to make Catalina's BlackBox source-level debugger work with threads - but before I do, I need to spend some time investigating the new debugging capabilities of the P2.

RossH · 2020-08-29 03:20

RossH wrote: »

I thought about adding new keywords to C, or possibly just some preprocessor directives indicating which parts of an algorithm could be "parallelized" - but in the end I decided that if you are going to do that you may as well design a whole new language.

Ha! Famous last words!

I have decided (after exploring several alternatives) that the appropriate method is indeed to just add a few new preprocessor directives to C.

It turns out that this is fairly easy to do, and makes the whole process described in this thread completely unnecessary - it can now be accomplished using just 3 or 4 lines added to the original program source code.

For full details of the new process, including the new Sieve program, see here.

Some might be tempted to call this entire exercise a complete waste of time, but it wasn't - until I had "parallelized" a few different programs, I didn't realize how the new factory/worker paradigm (as described in this thread) would be such a game-changer!

Ross.

Cluso99 · 2020-08-29 04:13

Ain't hindsight wonderful

RossH · 2020-08-29 05:18

Cluso99 wrote: »

Ain't hindsight wonderful

Indeed. If only we had the foresight to see how things would look in hindsight, we could probably all save ourselves an awful lot of effort!