Parallel Processing with C
RossH
Posts: 5,503
All,
Here is something I have been working on, and which I now have in a fit state for initial release - i.e. a way to make parallel processing on the Propeller almost completely trivial, using C.
Obviously, this method currently only works with Catalina (use Catalina 4.4) but if any of the other C compiler builders here are interested, we could have a common method (and tool) for implementing multiprocessing across all C compilers. And, potentially, other languages as well.
This method works equally well with Catalina on the Propeller 1 and the Propeller 2.
See the document in the attached zip file for full details and a tutorial.
Feedback welcome!
Ross.
EDIT: Attached file updated to version 1.6 - new functionality has been added, which is described below.
Here is something I have been working on, and which I now have in a fit state for initial release - i.e. a way to make parallel processing on the Propeller almost completely trivial, using C.
Obviously, this method currently only works with Catalina (use Catalina 4.4) but if any of the other C compiler builders here are interested, we could have a common method (and tool) for implementing multiprocessing across all C compilers. And, potentially, other languages as well.
This method works equally well with Catalina on the Propeller 1 and the Propeller 2.
See the document in the attached zip file for full details and a tutorial.
Feedback welcome!
Ross.
EDIT: Attached file updated to version 1.6 - new functionality has been added, which is described below.
- Requires Catalina 4.5, which has just been released to fix some issues found when using locks in a multi-threaded environment. - Source code is now included. The parallelizer is an AWKA script, and must be compiled with awka 0.7.5. It could be modified for other awk scripting implementations if required. The script is MIT licensed. - More error checking and more warning messages. - The factory related pragmas are now all optional. Factories will be defined and started automatically if there are no explicit factory or start pragmas specified. Now, only one worker pragma, and one begin and end pragma, may be required to parallelize a serial program. - New conditional options for worker and wait pragmas. In worker pragmas, conditionals can be used to control the output of variables, which will only take place if the condition is true when the worker completes. In wait pragmas they allow the wait to be terminated early if a specified condition is true. In both cases the condition can be any arbitrary boolean expression, but is normally something that is signaled by a worker thread to indicate that the main thread should continue, even if there are still worker threads yet to complete. - New foreman option on the factory pragma, which allows you to use your own foreman function. This may be useful if your program is already multi-threaded and you already have a thread function that can be used as a foreman function. - New memory management option, which can be specified on any pragma. Memory management for factories and workers can now be specified as either static or dynamic. No matter where it is specified, the option currently applies to ALL workers and factories. Static is faster, and has smaller code sizes, but all memory is pre-allocated at compile time, which may be wasteful (e.g. if you start and stop different factories at different times, all memory for all of them will still be pre-allocated when static memory management is used. In such cases, dynamic memory management may be preferable - but it may be slower and also require more code space, because the malloc() and free() library functions must be included). - New and modified test programs to demonstrate the new options.
Comments
I just fixed a stupid typo in the document and updated the zip file - on page 14 the original said:
This should have read (and now does):
It doesn't matter how many times you proofread ..
I discovered a potential problem when "parallelizing" a program consisting of multiple C source files, which I have fixed by defining some new pragmas. The zip file at the top of this thread has been updated to version 1.1.
One benefit of this is that there should now be no cases where you have to modify the original C source code just to parallelize an existing C program. Everything can now be done just by adding a few pragmas.
Ross.
No, I mean truly parallel. Here is a trivial example. Consider the following program:
Here is what you would normally expect to see when you run it:
We can "parallelize" that so that all the iterations of the "for" loop execute in parallel on multiple cogs by adding a few pragmas, as follows:
Now, here is what you see when you run it:
When you run the the "parallelized" version, the 10 "printf" statements are actually executed on 5 different cogs - 2 run on each cog.
This is actually the first demo in the zip file. Note that by using pragmas, the second version of program can still be compiled and run using any C compiler - it just won't run in parallel, so you will see the first output, not the second.
In general, that's true - but certain types of inter-dependency are ok - you can add "exclusive" and "shared" pragmas to the parallel code segments, which identify critical code segments.
Yes. Or Posix threads.
Every instance of parallel execution is assigned to a thread. You can specify how many threads can execute at a time (available memory is the only limiting factor, not the number of available cogs - if you try to start more than the maximum number of threads, the program simply stalls until a free thread becomes available).
And you can specify how many cogs will be used to execute the threads. The threads will be distributed amongst all those cogs. If you have multiple parallel segments defined, then the same set of cogs can be used to execute all the threads, or you can dedicate specific cogs to run specific types of thread.
This is as "parallel" as you can get.
I just updated the zip file at the top of this thread to version 1.2.
A few minor changes (mainly the addition of a ticks option to the worker pragma, which allows you to adjust the time slice allocated to a worker type). Plus I have now added Linux support.
This completes the parallelizer functionality, since you can now do pretty much everything that you used to do using the thread library, just by using a few pragmas instead. Of course, you can still use the thread library directly, even if you also plan to use the parallelizer.
The Linux executable was compiled on Ubuntu 18.04, 64 bit. If the executable doesn't work on your flavor of Linux, then you will have to wait for the next Catalina release, which will include the parallelizer source code.
Ross.
Is it correct that the parallelizer can in principle work with any C compiler?
I will install Catalina soon.
First I take a look under the hook.
In theory, yes. As you can see, the output of the parallelizer is just plain old C. Of course, the current parallelizer only knows how to generate code that uses the Catalina threads library - but adapting it to generate the appropriate function calls for a different thread library should be a fairly straightforward process. There are several Posix thread implementations that could probably be used. We don't even need the whole library to be implemented (it's quite large!) - just the fundamental ability to start a thread.
Ross.
The file at the top of this thread has been updated to version 1.4.
No functionality changes, just a few improvements to the parallelizer's "robustness", and also to stop the compiler generating warning messages when compiling some programs - e.g. test_8.c.
Also, that test program has been simplified slightly to make it more obvious what's actually going on.
Ross.
test_1.bin also delivers the desired result. As soon as time allows, I want to get the philosopher problem, a prime example of parallel processing, working on the P2.
On Linux it is compiled like this: gcc phil2.c -o phil2 -lpthread
I believe libtool is required to build the srecord utility, which is the last thing to be built. Probably not required if you don't use Catalina's -F command-line option.
EDIT: I think it may also be required to build Geany.
Catalina comes with a couple of versions of the Dining Philosopher problem - e.g. see demos/multithread/dining_philosophers.c
However, I have not yet rewritten it to use the new pragmas. Let me know how you go with your version!
Now I have installed all the needed stuff and start to learn Catalina.
Thank you !
I recently had to reinstall Linux, so I have an updated list of what is required to build Catalina and Geany:
I will include this updated information in the next release.
But this is not so important at the moment (for me). IDE is nice, but not elementary.
I will get along with the language because I have been programming skills in C for several years.
I've already read some of your documents and that gives me a deeper look into the chip than the origin documents about SPIN and pasm.
At the moment I still have the problem that I can only execute payload as sudo. But even I will find a solution.
Try adding yourself to the "dialout" group, as described in README.Linux:
Since you are interested in exploring the Dining Philosophers, I decided to create a version where everything is configurable on the command-line (but note that it still doesn't use the new pragmas!):
You can now specify the number of philosophers, courses and a delay that improves or reduces the likelihood of deadlock.
To compile this version, use a command like:
To raise the chances of deadlock, raise the interfork wait, or reduce the number of philosophers. To reduce the chance, reduce the interfork wait, or increase the number of philosophers.
Also, I noticed that the original version had a bug on line 144, which was:
It should have been:
I've fixed this in the above version.
Ross.
Thank you.
I add me to the dialgroup, as suggested in README Linux.
But:
this works
Well, it's definitely a permission problem. Try executing the "groups" command, and see what groups you belong to. The group you need to belong to is "dialout". Also, you need to log out and log back in again for any group changes to take effect.
As a last resort, you can change the permissions on the /dev/ttyUSB0 device to allow all users to read and write to it, or change the owner of the device to be you - but these are drastic steps, and shouldn't really be necessary.
Ross.
Reinhard
I programmed a Sudoku solver with Catalina and then loaded it onto the P2 EvalBoard with p2load (please don't ask me why, at the moment I have reasons for it).
The result is correct, the speed - well, moderate.
My question is, is this suitable for parallelization?
The solver function is called recursively, is that suitable for #pragmas?
the goodie is, this piece of code can compiled with Catalina and gcc w/o changes.
Nice algorithm! However, on a quick examination, I think not parallelizable - or at least not much.
The problem is that this program modifies entries in the "feld" array in a manner which means the algorithm won't work unless the recursive calls (in this case, to "solve") are done in a specific order. If you parallelize it, the changes being made to "feld" by one parallel task will interfere with the operation of all the other tasks.
Parallelizing recursive algorithms is certainly possible - but mainly in cases where the recursion is used to subdivide the task into smaller independent tasks, not if the execution of these tasks has to be done in a particular order. Also, the number of threads you may need would tend to make it impractical - you need a thread for each potential recursive call, and each thread needs a stack, so you can very quickly run out of memory.
However, your algorithm has given me an idea - if I can get it working without complicating the current pragmas too much, I could get you some improvement in this algorithm - perhaps as much as 50% improvement.
I'll let you now how I go.
Ross.
I know, OpenMP looks complicated. But honestly your #pragma propeller also looks complicated when one first looks at it (looking at the whole documentation). The use of "#pragma propeller" for simple cases is not hard, of course. But neither is "#pragma omp" hard for simple cases.
It's easier for you to *implement* your own pragmas, but it's much easier for users to *use* standard pragmas (even if they're only a subset) because, well, they're standard. OpenMP is implemented in a lot of compiilers (Visual C, gcc, clang, Intel C, etc.) and examples and documentation are widely available on the web.
The parallelize tool already seems to do much of what an OpenMP implementation would have to do. It looks like you would have to add the ability to automatically generate #pragma propeller worker statements from #pragma omp parallel statements, but the syntax of the two is pretty similar (you would have to make parallelize understand where C blocks of code start and end, but it probably has to parse the code already to find the #pragmas, so I don't think that would be an insurmountable barrier).
With that change your parallelize tool would be interesting not just in the propeller world but for arduino and other embedded systems too! I urge you to consider this.
Is the source code for parallelize available anywhere?
I just read the OpenMP spec in more detail, and yes, there is more similarity than I at first realized. I suppose that was inevitable, seeing as we are both using similar techniques to solve a similar problem. However, while I can see that it might be possible to implement the propeller pragmas using OpenMP pragmas, the reverse is probably not true - OpenMP is extremely complex, and requires implementation within the compiler, whereas the propeller pragmas are relatively simple and do not.
Yet the propeller pragmas seem to be quite adequate.
I think the reason for this is that OpenMP is designed to make it possible add parallelism to a multitude of different architectures, whereas the propeller pragmas are only trying to support just one architecture - one which (thanks to Parallax!) already has parallelism built in, with all the necessary supporting infrastructure (basic parallelism, locks, shared memory, etc) available at a much higher conceptual level than is true of most architectures.
All we need to do is expose that capability, not invent it, and not simulate it where it doesn't naturally exist.
I am also very strongly against implementing "subsets" of standards - if you implement something, you should implement it fully, or you are not being honest with your users - and I doubt we would ever be able to claim full OpenMP conformance - certainly not for the Propeller 1 - there is simply not enough room to support all those features.
So I will continue to work with the propeller pragmas - they are quite trivial to implement, as you can see from the code the parallelizer generates - and see how much additional complexity is required to make them more generally useful. My current belief is that not much more is needed, but I may be underestimating.
I will release the source code of the parallelizer once I think it is stable functionality-wise (it is just a single massive "awk" script - but my awk skills are rudimentary at best, so it is not the prettiest program you will ever see!).
Have you plans to add threading to your compiler (or, come to think of it, do you already have it - I haven't checked!). If so, let me know which library you use (I assume it would be posix threads) and I will see if I can add support for it to the parallelizer so you can try it out.
Ross.