Oh, and I probably should let people know (who might be trying out the parallelizer) that I have finally found a Catalina bug that has plagued me for a while - one which could make multi-threaded NATIVE mode programs on the Propeller 2 lock-up occasionally.
So I will be releasing Catalina 4.5 shortly. I had hoped to make the next release the one where I incorporated the parallelizer, but that may have to wait a bit longer.
My question is, is this suitable for parallelization?
I'll let you now how I go.
Hello @Reinhard
Well, I have done some work on parallelizing this algorithm, but not very successfully - apart from the main recursion, the opportunities for parallelizing it are quite limited due to the small size of the matrix, and when I do so it generally turns out to make it slower, not faster.
You have given me some ideas which I will continue to work on, but I don't hold out much hope for this particular algorithm!
I just read the OpenMP spec in more detail, and yes, there is more similarity than I at first realized. I suppose that was inevitable, seeing as we are both using similar techniques to solve a similar problem. However, while I can see that it might be possible to implement the propeller pragmas using OpenMP pragmas, the reverse is probably not true - OpenMP is extremely complex, and requires implementation within the compiler, whereas the propeller pragmas are relatively simple and do not.
I don't think that OpenMP *has* to be done in the compiler, you could use a preprocessor similar to your parallelize one. In fact I think that's how the original OpenMP implementations happened. The preprocessor would have to understand enough C to be able to recognize a { } block to pull out the worker functions, but that's not too hard.
I think the reason for this is that OpenMP is designed to make it possible add parallelism to a multitude of different architectures, whereas the propeller pragmas are only trying to support just one architecture - one which (thanks to Parallax!) already has parallelism built in, with all the necessary supporting infrastructure (basic parallelism, locks, shared memory, etc) available at a much higher conceptual level than is true of most architectures.
I have to strongly disagree here. We often like to think of the Propeller as being somehow uniquely parallel, but that's only true in a very limited domain (that of inexpensive microcontrollers). In the "real world", and especially in the high performance computing world that OpenMP came out of, parallelism has been a normal part of computer architectures for many, many years. A typical desktop PC has vastly more parallel computing capabilities than a propeller: the CPU has multiple cores, each of which can run multiple threads, and the GPU has hundreds or even thousands of hardware threads. Both have shared memory and much more sophisticated hardware locks than the simple ones in the propeller.
I am also very strongly against implementing "subsets" of standards - if you implement something, you should implement it fully, or you are not being honest with your users
We have a philosophical difference here: as long as you fully document the limitations I think it's much more useful for the end user to have something familiar (if incomplete) than to have to learn some completely new API. But if it makes you uncomfortable then by all means use a different #pragma (e.g. "#pragma pmp" instead of "#pragma omp"). The main thing that I think would help users is to implement the same API, just as a subset of printf (like t_printf) uses the familiar %d, %u escapes, even if not all of them are implemented.
and I doubt we would ever be able to claim full OpenMP conformance - certainly not for the Propeller 1 - there is simply not enough room to support all those features.
To the contrary, most of the complexity is in the PC side tools, not on the chip. PropGCC for the Propeller 1 fully implements OpenMP 4.5, as far as I can see.
So I will continue to work with the propeller pragmas - they are quite trivial to implement
As compiler writers, it's very very tempting for us to focus on what's easy for us to implement as opposed to what's easy for end users to learn and to use. (I'm guilty of this myself all too often!)
I really like what you've done with the parallelizer, and I think it could be a very broadly useful tool even outside of the Parallax community. But it would be much more useful if it implemented a known API (like OpenMP or Cilk) and targetted a fairly general thread library (like pthreads or the new C standard threads).
Have you plans to add threading to your compiler (or, come to think of it, do you already have it - I haven't checked!). If so, let me know which library you use (I assume it would be posix threads) and I will see if I can add support for it to the parallelizer so you can try it out.
Basic "launching a C function in another COG via cogstart" functionality is there already. I haven't implemented a higher level API yet. I'm leaning towards doing Posix threads, but the C11 <thread.h> standard (which is very similar) is also a reasonable starting point. Have you looked at these? Any preferences for which standards to adopt? I think we should have some standard thread API available for the P2, even if there are also compiler specific non-standard libraries.
I don't think that OpenMP *has* to be done in the compiler
Fair enough. OpenMP doesn't *have* to be done in the compiler ... provided that compiler implements a suitably comprehensive threads library.
We often like to think of the Propeller as being somehow uniquely parallel, but that's only true in a very limited domain (that of inexpensive microcontrollers).
The Propeller is unique, in the parallel computing "bang for buck" stakes, if nothing else. This presents developers like us with both a challenge and an opportunity.
We have a philosophical difference here: as long as you fully document the limitations I think it's much more useful for the end user to have something familiar (if incomplete) than to have to learn some completely new API.
Yes, this is a philosophical difference. Implementing half a standard is simply not an option, as far as I am concerned.
To the contrary, most of the complexity is in the PC side tools, not on the chip. PropGCC for the Propeller 1 fully implements OpenMP 4.5, as far as I can see.
Really? I should investigate this. Is there a good example program you can point me to?
I really like what you've done with the parallelizer, and I think it could be a very broadly useful tool even outside of the Parallax community. But it would be much more useful if it implemented a known API (like OpenMP or Cilk) and targetted a fairly general thread library (like pthreads or the new C standard threads).
To be honest, I have little interest in what happens "outside of the Parallax community". Let them look to us for a change! However, I do agree that something like pthreads would provide a useful "common ground" to entice newcomers to explore the world of real parallel processing. Are you aware of any implementations of pthreads for the Propeller? I have looked at doing this myself several times, but been put off by the complexity of implementing it at kernel level, and the memory requirements of implementing it at user level. But kudos to anyone who has done either one!
Basic "launching a C function in another COG via cogstart" functionality is there already.
That's cheating! This functionality is essentially built into the chip. To implement either OpenMP or the propeller pragmas you need to implement a threads library.
I agree with Eric on using the OpenMP pragmas. This helps to make it easier to run programs on different platforms. When I develop code for the Propeller I find it useful to test the code on my PC first. Even half a standard is better than no standard, as long as it's documented. In a lot of standards half the features are almost never used. So if you implement the correct half of the standard you might be able to handle most uses of OpenMP.
I don't think that OpenMP *has* to be done in the compiler
Fair enough. OpenMP doesn't *have* to be done in the compiler ... provided that compiler implements a suitably comprehensive threads library.
Which doesn't really have to be done "in the compiler"...
We often like to think of the Propeller as being somehow uniquely parallel, but that's only true in a very limited domain (that of inexpensive microcontrollers).
The Propeller is unique, in the parallel computing "bang for buck" stakes, if nothing else. This presents developers like us with both a challenge and an opportunity.
Alas, even in "bang for the buck" I think there are GPUs (on the high end) and multi-core ARM and RISC-V chips (on the low end) which give the Propeller a run for its money. The propeller architecture is particularly nice and clean, but it's no longer unique.
We have a philosophical difference here: as long as you fully document the limitations I think it's much more useful for the end user to have something familiar (if incomplete) than to have to learn some completely new API.
Yes, this is a philosophical difference. Implementing half a standard is simply not an option, as far as I am concerned.
Well, you've provided "tiny" libraries for Catalina which (quite properly) reduce code bloat by providing the most useful subset of the standard C libraries. Similarly, virtually no floating point libraries for the Propeller actually implement the whole of the IEEE floating point standard; rather they just do what programmers mostly expect. These kinds of things make porting code from other platforms much easier, and hence help speed adoption of the Propeller platform. And I think they totally make sense: if you're going to implement floating point in this day and age, you might as well start with the IEEE standard rather than creating a new standard from scratch. This is true even if for space or performance reasons you can't actually support all of the rounding modes and precision guarantees in that standard.
To the contrary, most of the complexity is in the PC side tools, not on the chip. PropGCC for the Propeller 1 fully implements OpenMP 4.5, as far as I can see.
Really? I should investigate this. Is there a good example program you can point me to?
The canonical example is fftbench, which you've been working with . The OpenMP support in PropGCC is why Heater added the OpenMP support to that benchmark.
Are you aware of any implementations of pthreads for the Propeller? I have looked at doing this myself several times, but been put off by the complexity of implementing it at kernel level, and the memory requirements of implementing it at user level.
I implemented it for PropGCC a long time ago. I've attached a zip file with the source code; it probably has some GCC dependencies, alas. Also in the zip file is a very basic COG thread library (sys_threads.c) that pthreads is built atop, and the "tinyomp.c" file which PropGCC uses by default in place of the GNU libgomp for doing OpenMP threads. libgomp does work on the Propeller 1 (it's built on pthreads) but full libgomp+pthreads consumes a lot of RAM, and most of the pthreads features aren't needed for OpenMP.
I don't think that OpenMP *has* to be done in the compiler
Fair enough. OpenMP doesn't *have* to be done in the compiler ... provided that compiler implements a suitably comprehensive threads library.
Which doesn't really have to be done "in the compiler"...
We often like to think of the Propeller as being somehow uniquely parallel, but that's only true in a very limited domain (that of inexpensive microcontrollers).
The Propeller is unique, in the parallel computing "bang for buck" stakes, if nothing else. This presents developers like us with both a challenge and an opportunity.
Alas, even in "bang for the buck" I think there are GPUs (on the high end) and multi-core ARM and RISC-V chips (on the low end) which give the Propeller a run for its money. The propeller architecture is particularly nice and clean, but it's no longer unique.
We have a philosophical difference here: as long as you fully document the limitations I think it's much more useful for the end user to have something familiar (if incomplete) than to have to learn some completely new API.
Yes, this is a philosophical difference. Implementing half a standard is simply not an option, as far as I am concerned.
Well, you've provided "tiny" libraries for Catalina which (quite properly) reduce code bloat by providing the most useful subset of the standard C libraries. Similarly, virtually no floating point libraries for the Propeller actually implement the whole of the IEEE floating point standard; rather they just do what programmers mostly expect. These kinds of things make porting code from other platforms much easier, and hence help speed adoption of the Propeller platform. And I think they totally make sense: if you're going to implement floating point in this day and age, you might as well start with the IEEE standard rather than creating a new standard from scratch. This is true even if for space or performance reasons you can't actually support all of the rounding modes and precision guarantees in that standard.
To the contrary, most of the complexity is in the PC side tools, not on the chip. PropGCC for the Propeller 1 fully implements OpenMP 4.5, as far as I can see.
Really? I should investigate this. Is there a good example program you can point me to?
The canonical example is fftbench, which you've been working with . The OpenMP support in PropGCC is why Heater added the OpenMP support to that benchmark.
Are you aware of any implementations of pthreads for the Propeller? I have looked at doing this myself several times, but been put off by the complexity of implementing it at kernel level, and the memory requirements of implementing it at user level.
I implemented it for PropGCC a long time ago. I've attached a zip file with the source code; it probably has some GCC dependencies, alas. Also in the zip file is a very basic COG thread library (sys_threads.c) that pthreads is built atop, and the "tinyomp.c" file which PropGCC uses by default in place of the GNU libgomp for doing OpenMP threads. libgomp does work on the Propeller 1 (it's built on pthreads) but full libgomp+pthreads consumes a lot of RAM, and most of the pthreads features aren't needed for OpenMP.
Thanks, I'll have a play with PropGCC, and also look at implementing the parallelizer for your threads library (once it is a bit more stable - I am currently rewriting some of the internals to make it more robust).
I am still looking at how much more complexity I may need to add to the parallelizer. If the answer is "not much", then I think the propeller pragmas are likely to be a superior (and simpler) solution than using OpenMP**.
I have reproduced some of the things I can do with the propeller pragmas using the OpenMP pragmas, and the results are very disappointing - the code bloat is massive, and the performance improvements are modest. This is what makes me think that OpenMP is not a good solution for the Propeller. However, this was done using GCC on my desktop, not PropGCC on the Propeller itself, so I will do the comparison again using PropGCC and see what the results look like.
However, I am not optimistic - using GCC, the code sizes even for trivial OpenMP examples blows out by many times - when statically linked, executable sizes up to 6 times larger are not uncommon even for very simple examples - and the performance improvements can be nonexistent (in fact, some of my examples run slower when using OpenMP - but this is also sometimes true when using my own pragmas, so I can't really complain about that - it probably has more to do with my examples being too small and trivial for the benefit of parallelizing them to outweigh the cost of doing so). I can't see PropGCC being much different to plain old GCC with respect to relative code size, but I will try it and see. Perhaps Heater has already done so?
On the "philosophical issue" about compliance with standards, I believe Catalina fully implements the IEEE standard for floating point (except for some obscure underflow issues that I don't have the expertise to solve), and also all the other library functions required by the relevant C standards. Offering "additional" libraries or alternative libraries that use less RAM is not a deviation from the standard.
Ross.
** However, I do like the OpenMP approach of assuming the parallel code segment ends at the end of the next block - this would mean I could make the "propeller end" pragma optional, and you would only need to use it if end of the parallel segment is not the end of the next block. Thanks for pointing this out.
I have reproduced some of the things I can do with the propeller pragmas using the OpenMP pragmas, and the results are very disappointing - the code bloat is massive, and the performance improvements are modest.
Don't confuse an implementation of the standard with the standard itself! If you implement the OpenMP pragmas using the same code as your #propeller pragmas, then the performance should be the same...
I can't see PropGCC being much different to plain old GCC with respect to relative code size, but I will try it and see. Perhaps Heater has already done so?
Here's what I get for fftbench with PropGCC 6.0 for P1
size time
original: 19824 141105 us
-fopenmp: 22908 82318 us
The code speedup isn't great, but I didn't spend any time trying to tune the number of COGs, so I don't know how many it's actually using.
On the "philosophical issue" about compliance with standards, I believe Catalina fully implements the IEEE standard for floating point (except for some obscure underflow issues that I don't have the expertise to solve), and also all the other library functions required by the relevant C standards.
I don't think Catalina fully implements the IEEE standard. I'm not criticizing: I doubt that any floating point library for the P1, or any other micro, fully implements the standard. For example, the directed rounding modes (round to 0, round to +- infinity) which are required by the standard, are rarely implemented on micros; almost everyone just uses "round to nearest". Similarly, the standard requires that exceptions like UNDERFLOW and OVERFLOW be signalled in some way, but P1 libraries typically don't bother to do this.
As another example, Catalina includes some features from the C99 standard but doesn't fully implement C99. I think that's sensible: you've fully (or almost) implemented C89, and if you're going to extend the libraries in some way that C99 has already specified then it makes sense to use the later standard as guidance. That's not a bug, it's a feature! The alternative is to re-invent the wheel and create some completely new libraries or language features that are non-standard. That's not as good for users, IMHO.
I'd argue that the same considerations apply to threading libraries and multiprocessing pragmas. We may not have the time or energy to implement all of the C11 standard on the P1, but using C11 <threads.h> might be a good solution to provide portability for C code. Similarly, implementing all of OpenMP may not be practical, but using some parts of it may make sense.
I have reproduced some of the things I can do with the propeller pragmas using the OpenMP pragmas, and the results are very disappointing - the code bloat is massive, and the performance improvements are modest.
Don't confuse an implementation of the standard with the standard itself! If you implement the OpenMP pragmas using the same code as your #propeller pragmas, then the performance should be the same...
Similar, true - provided you restrict yourself to the intersection of the capabilities of the OpenMP pragmas and the propeller pragmas. But since my pragmas do not map directly to OpenMP pragmas, we will need to do a bit more work to tell. We currently only have a single example to look at. I will try to generate some more.
Of course, we could implement not just a subset of OpenMP pragmas, but a subset of the functionality of each OpenMP pragma ... but then all you are really doing is using similar pragma names for something different. You will probably not be surprised to hear that I don't want to go down that path.
I can't see PropGCC being much different to plain old GCC with respect to relative code size, but I will try it and see. Perhaps Heater has already done so?
Here's what I get for fftbench with PropGCC 6.0 for P1
size time
original: 19824 141105 us
-fopenmp: 22908 82318 us
The code speedup isn't great, but I didn't spend any time trying to tune the number of COGs, so I don't know how many it's actually using.
You may be using only 2 cogs - that is about the improvement you might expect from 2 cogs. However, the code size increment is not bad. I will do a direct comparison when I get some time. I am deep in the internals of the parallelizer at present.
On the "philosophical issue" about compliance with standards, I believe Catalina fully implements the IEEE standard for floating point (except for some obscure underflow issues that I don't have the expertise to solve), and also all the other library functions required by the relevant C standards.
I don't think Catalina fully implements the IEEE standard.
I think you are nit-picking here. Catalina implements full ANSI C standard, using 32 bit IEEE 754 floating point. Yes, Catalina does not implement the "full" IEEE 754 standard because it does not (for example) implement 64 bit floating point. But this is not required , and I am fairly careful to spell out that Catalina only offers 32 bit floating point support.
As another example, Catalina includes some features from the C99 standard but doesn't fully implement C99. I think that's sensible: you've fully (or almost) implemented C89, and if you're going to extend the libraries in some way that C99 has already specified then it makes sense to use the later standard as guidance. That's not a bug, it's a feature! The alternative is to re-invent the wheel and create some completely new libraries or language features that are non-standard. That's not as good for users, IMHO.
But again, Catalina doesn't claim to be C99 compliant. That's the point.
I'd argue that the same considerations apply to threading libraries and multiprocessing pragmas. We may not have the time or energy to implement all of the C11 standard on the P1, but using C11 <threads.h> might be a good solution to provide portability for C code. Similarly, implementing all of OpenMP may not be practical, but using some parts of it may make sense.
I don't think we are ever going to agree here. It is largely a philosophical difference, as I have said. But I have also read more about the OpenMP standards, and I continue to think that they are not a good fit for the Propeller - they take the conceptually simple Propeller architecture (which already supports parallelism constructs at a high level) and turn it into something that is complex and difficult to use.
I concede that this is likely to be a problem with the C language rather than the OpenMP pragmas, but this is not a sufficient excuse to complicate things unnecessarily for users. Indeed, I worry that the propeller pragmas may already be too complex, and I am looking to simplify them if I can do so, not add to them.
I will look at the C11 threads - they may be a good basis for implementing the propeller pragmas on your compiler.
My question is, is this suitable for parallelization?
I'll let you now how I go.
Hello @Reinhard
Well, I have done some work on parallelizing this algorithm, but not very successfully - apart from the main recursion, the opportunities for parallelizing it are quite limited due to the small size of the matrix, and when I do so it generally turns out to make it slower, not faster.
You have given me some ideas which I will continue to work on, but I don't hold out much hope for this particular algorithm!
Ross.
I'm still interested in parallel processing in C on the P2. Even if the right killer application has not yet been found, the journey is the goal. I don't have that much time right now, but it stays in the back of my mind. Maybe something very simple, a demo for the LED matrix (just a thought)
I don't think Catalina fully implements the IEEE standard.
I think you are nit-picking here. Catalina implements full ANSI C standard, using 32 bit IEEE 754 floating point. Yes, Catalina does not implement the "full" IEEE 754 standard because it does not (for example) implement 64 bit floating point. But this is not required , and I am fairly careful to spell out that Catalina only offers 32 bit floating point support.
Catalina violates the IEEE 754 floating point standard because it does not implement some required features of that standard. For example, see section 4.2 (Directed Roundings) of ANSI/IEEE Std 754-1985:
"An implementation shall also provide three user-selectable directed rounding modes: round toward +INFINITY, round toward -INFINITY, and round toward 0."
Note the word "shall"; it is a requirement, not an optional feature. Similarly, I do not think Catalina implements section 7 of IEEE 754 (Exceptions), which requires that certain conditions like overflow be detected and cause either a trap to occur or a status flag to be set.
Again, I'm not trying to single out Catalina here: very few microprocessor floating point libraries conform to IEEE 754. Instead, they largely implement a useful subset of of the standard.
As another example, Catalina includes some features from the C99 standard but doesn't fully implement C99. I think that's sensible: you've fully (or almost) implemented C89, and if you're going to extend the libraries in some way that C99 has already specified then it makes sense to use the later standard as guidance. That's not a bug, it's a feature! The alternative is to re-invent the wheel and create some completely new libraries or language features that are non-standard. That's not as good for users, IMHO.
But again, Catalina doesn't claim to be C99 compliant. That's the point.
And my point is that even when we don't implement a whole standard (C99 in this case) it can be useful to implement some parts of it (e.g. the stdint.h header file that Catalina includes). And in practice you seem to agree, since that's what you've done in Catalina! We're in violent agreement that we shouldn't claim conformance and should clearly document when only parts of a standard are implemented. But an incomplete implementation of a standard is often better for users than a completely one-off API invented from scratch.
Here's what I get for fftbench with PropGCC 6.0 for P1
size time
original: 19824 141105 us
-fopenmp: 22908 82318 us
The code speedup isn't great, but I didn't spend any time trying to tune the number of COGs, so I don't know how many it's actually using.
I still haven't installed PropGCC, but I have done some comparable figures for Catalina on the Propeller 1. I am not sure what your "size" number refers to, so I have included both code size only (no data) and total file size.
Catalina is certainly no speed demon, but it is the improvement that we are interested in here, because if we implement the propeller pragmas for other compilers, the absolute times will differ, but the improvement should be very similar.
Your reported improvement using OpenMP pragmas was 1.71 times, which is what makes me think you must be using only 2 slices (i.e.2 cogs). Change the fftbench source code to use 4 slices (cogs) and re-run it.
I don't think Catalina fully implements the IEEE standard.
I think you are nit-picking here. Catalina implements full ANSI C standard, using 32 bit IEEE 754 floating point. Yes, Catalina does not implement the "full" IEEE 754 standard because it does not (for example) implement 64 bit floating point. But this is not required , and I am fairly careful to spell out that Catalina only offers 32 bit floating point support.
Catalina violates the IEEE 754 floating point standard because it does not implement some required features of that standard. For example, see section 4.2 (Directed Roundings) of ANSI/IEEE Std 754-1985:
"An implementation shall also provide three user-selectable directed rounding modes: round toward +INFINITY, round toward -INFINITY, and round toward 0."
Note the word "shall"; it is a requirement, not an optional feature. Similarly, I do not think Catalina implements section 7 of IEEE 754 (Exceptions), which requires that certain conditions like overflow be detected and cause either a trap to occur or a status flag to be set.
Again, I'm not trying to single out Catalina here: very few microprocessor floating point libraries conform to IEEE 754. Instead, they largely implement a useful subset of of the standard.
Nit picking
As another example, Catalina includes some features from the C99 standard but doesn't fully implement C99. I think that's sensible: you've fully (or almost) implemented C89, and if you're going to extend the libraries in some way that C99 has already specified then it makes sense to use the later standard as guidance. That's not a bug, it's a feature! The alternative is to re-invent the wheel and create some completely new libraries or language features that are non-standard. That's not as good for users, IMHO.
But again, Catalina doesn't claim to be C99 compliant. That's the point.
And my point is that even when we don't implement a whole standard (C99 in this case) it can be useful to implement some parts of it (e.g. the stdint.h header file that Catalina includes). And in practice you seem to agree, since that's what you've done in Catalina! We're in violent agreement that we shouldn't claim conformance and should clearly document when only parts of a standard are implemented. But an incomplete implementation of a standard is often better for users than a completely one-off API invented from scratch.
The cases are not really comparable. You are comparing a complete implementation of stdint.h to a partial implementation of OpenMP. I did not add stdint.h to claim compliance with C99, I added it because someone asked for it, and it can fairly easily be done in ANSI C.
Partial implementations are no better than custom solutions, and if we are going to have a custom solution, we may as well have one that is better suited to the Propeller.
The cases are not really comparable. You are comparing a complete implementation of stdint.h to a partial implementation of OpenMP.
Now who's nit-picking? I'm comparing a partial implementation of C99 to a partial implementation of OpenMP.
I did not add stdint.h to claim compliance with C99, I added it because someone asked for it, and it can fairly easily be done in ANSI C.
OK, perhaps we're talking past one another a bit. I do not want, in any way, to claim that we should "comply" with OpenMP, any more than we should sweat "complying" with IEEE 754.
I'm merely saying that if you want to implement a pragma to, for example, parallelize a block of code, and OpenMP already has such a pragma, then perhaps we should implement that specific pragma. Just as, for example, C99 has a useful header file (stdint.h) and we can implement that even in a compiler that does not conform to C99.
Partial implementations are no better than custom solutions.
Catalina's partial implementation of IEEE 754 is better than a custom floating point solution, because it can leverage a wide variety of existing code and users immediately know many things about Catalina's floating point (even if they have to be aware that some parts of the standard are not in fact implemented).
Similarly, an implementation of some of the OpenMP pragmas would immediately allow users to write code that can run on both PCs and on their Propeller. Such an implementation of some pragmas would not be an "OpenMP implementation", any more than Catalina is a "C99 implementation" because it provides stdint.h. But it would still be more useful than a completely custom solution, IMHO.
The main point here - IMHO - is that Catalina, FlexC, PropGCC and the upcoming Clang implementation should be as compatible to each other as possible.
Besides not liking C that much, I have used Catalina on @Cluso99's RAMblade and really like the integration of XMM. My silent hope is that @RossH will implement a XMM driver for the Hyperram on the P2 and that FlexWhatever also would be able to do so.
Sure one could claim that 512K are enough, but history has shown that this is not always true.
And I really, really would like to somehow use the C files generated by GnuCOBOL to run on a P2. Sadly my C fu is very undeveloped and I even don't know what would be needed to do so. GnuCOBOL is very Linux centered but besides using MinGw it can use Visual Studio on Windows to compile so there can't be that much of POSIX in there.
I've just spent hours trying to get PropGCC to compile from source on either Windows or Linux. No success.
Are there any already-built executables for PropGCC for wither Windows or Linux? I looked in the forums, but couldn't find any obvious places to download executables, just the sources (which don't build).
I presume there must be executables somewhere? Can someone point me to them?
EDIT: Found some older PropGCC binaries - they come with SimpleIDE. But I still can't compile later versions of PropGCC from source.
Are there any already-built executables for PropGCC for wither Windows or Linux? I looked in the forums, but couldn't find any obvious places to download executables, just the sources (which don't build).
Ok! Found some binaries. I didn't realize PropGCC was part of SimpleIDE. And what a palaver to get that working!
But anyway, here are some more PropGCC results. As I suspected, @ersmith was only using 2 cogs when he could have been using 4:
My actual numbers are different to @ersmith - we may be using a different version of PropGCC, or different options. I couldn't get some options (e.g. some memory models and optimization levels) working at all with PropGCC.
But the important point is that the OpenMP pragmas and the propeller pragmas achieve almost identical performance improvements - at least in this instance.
I will continue working on the propeller pragmas and see where we end up. But I believe I can already do much of what the OpenMP pragmas can do, but with a smaller and simpler set of pragmas which will be easily portable to any other C compiler.
The experiences with XMM on the P1 were not kind. Way too much time wasted on a technology that lots of people demanded, but very few actually used.
Yes, XMM was an idea that seemed good at the time, but few people adopted. The P2 has a different set of optimizations and trade-offs, of course, so perhaps it might work better on P2. OTOH with 512K memory there is even less need for XMM than on P1.
But it would still be more useful than a completely custom solution, IMHO.
We will probably never agree on this very fundamental point, so I guess we should just agree to disagree and move on.
Once I have implemented the propeller pragmas, and you have implemented the OpenMP pragmas, then it might make sense to return to it.
Sure, that makes sense. If the propeller pragmas are at least similar in spirit to the OpenMP ones then perhaps I can implement both.
Although honestly there are other things I should work on first. fastspin's performance is good, but it still has some distance to go to catch up to Catalina in terms of stability and completeness.
I have upgraded the version of the Catalina parallelizer/pragma processor to version 1.6 - see the first post in this thread for the zip file, which now includes the source code, and for details of the new functionality that has been added. Also, Catalina 4.5 has just been released, to fix some lock and memory management issues that were exposed when testing the parallelizer. You should upgrade to that version when using version 1.6 of the parallelizer, or some of the test programs may not function correctly.
It is likely that this will be the last separate release of the parallelizer, and that future releases will be part of Catalina. This does not mean it cannot be used independently, or support other C compilers, but it does mean that Catalina will be able to use it more effectively, such as allowing it to be invoked in any compilation command, or in the Geany or Code::Blocks IDEs, just by specifying an additional command-line option.
As discussed earlier, I considered doing away with the begin and end pragmas and just assuming that block following the worker pragma represents the parallelizable code segment (as OpenMP does), but in practice it turns out that this means you eventually need MORE pragmas, not less - and it also means you may have to modify your source code in cases where the current propeller pragmas do not require you to do so. So that idea is currently on hold. I may revisit it again later.
I hear your frustration on downloading PropGCC. It has not been compilable on modern Linux systems for many years now, and the only way to still build it is with old (not even LTS support) distributions. It seems to require Ubuntu 14.04 IIRC. Docker is the only way I'm comfortable doing it.
Basic "launching a C function in another COG via cogstart" functionality is there already. I haven't implemented a higher level API yet. I'm leaning towards doing Posix threads, but the C11 <thread.h> standard (which is very similar) is also a reasonable starting point. Have you looked at these? Any preferences for which standards to adopt? I think we should have some standard thread API available for the P2, even if there are also compiler specific non-standard libraries.
Hi @ersmith
I've only just now had time to get back to thinking about this. But having done a bit more reading, I agree that Posix threads is probably the way to go. I read that both C11 threads and OpenMP can be implemented using only Posix threads, and I believe the propeller pragmas can as well.
However, it seems your threads library is only implemented for PropGCC? And if I am reading David's post correctly, PropGCC seems to no longer be supported. So before I spend any time on it, have you also implemented your threads library for FlexC? Or do you have plans to do so?
I haven't decided yet what kind of threads to implement in FlexC.
No worries. The propeller pragmas will be able to work with whatever you end up implementing. When I get some spare time, I will make the pragma pre-processor configurable as to how it invokes the thread library. There is really only one thing it needs to know, which is how to call the thread library function that "starts a C function as a thread on a specified cog". I can do this using a fairly simple configuration file that you can customize to suit your library. I'll provide a Posix threads one as an example.
As discussed in my previous post, I have just released a new version of Catalina (4.6) that ...
(a) integrates the new parallelizer (aka the pragma preprocessor) into Catalina, but ...
(b) still allows the parallelizer to be used "stand-alone" with other C compilers and thread libraries.
To use the parallelizer with Catalina, you just include a new command-line option (-Z) before the files to be parallelized. Not only does this make it easy to use the paralellizer on the command line, it also means it is now trivial to use it from either the Geany or CodeBlocks IDEs.
For example, here is the command that might be used to compile a serial version of the sieve demo program:
catalina -lc sieve.c
And here is the command that might be used to compile a parallel version of the same program:
catalina -lc -lthreads -Z sieve.c
Using the parallelizer with other compilers and thread libraries is a little more complex, but I have included an example of using it with GCC and Posix threads (only on the Propeller 1 - I don't think any compiler implements Posix threads for the Propeller 2 yet. Happy to be corrected!).
Here are the files that contain everything necessary to use the parallelizer in this case:
custom_threads.h: this file defines the macros and functions necessary for the parallelizer to use GCC and Posix threads:
/*****************************************************************************
* *
* This is an example "custom_threads.h" file that enables the Catalina *
* Parallelizer to be used with another thread library - in this case, *
* with GCC and Posix threads. *
* *
* To use this file with a C program (which we will assume is called *
* "custom_demo.c", first use the Catalina parallelizer to produce a *
* "parallelized" version of the same program. For example: *
* *
* parallelize custom_demo.c -o parallel_demo.c *
* *
* Then compile the parallel version of the program using GCC, including *
* the file "custom_threads.c", using a command such as: *
* *
* gcc -lpthread -D _REENTRANT -D CUSTOM_THREADS -D LOCAL_THREADS -D \ *
* parallel_demo.c custom_threads.c *
* *
* Defining the _REENTRANT flag is necessary to use reentrant versions of *
* the GCC libraries. *
* *
* Defining CUSTOM_THREADS tells the program to include "custom_threads.h" *
* (instead of "catalina_threads.h") and defining LOCAL_THREADS tells it *
* to include this file from the current directory instead of from the *
* system include directories. The functions in "custom_threads.c" could *
* also be included from the system libraries, and in that case you could *
* compile programs just by using a command like: *
* *
* gcc -lpthread -D _REENTRANT -D CUSTOM_THREADS parallel_demo.c *
* *
* NOTE: When using GCC and Posix threads on the Propeller 1, you may *
* need to reduce the number of threads executed in parallel, and *
* also use the COMPACT mode of the compiler. For instance, the *
* example program custom_demo.c was created from test_1.c by *
* modifying the worker pragma to start at most 5 parallel threads *
* (not the default of 10) - i.e. the worker pragma was modified *
* to: *
* *
* #pragma propeller worker(int i) threads(5) *
* *
*****************************************************************************/
// include standard C functions:
#include <stdio.h>
#include <stdlib.h>
// include propeller functions:
#include <cog.h>
#include <propeller.h>
// incllude posix thread functions:
#include <pthread.h>
// specify the names of some common cog functions (Catalina puts an underscore
// in front of such names, other compilers may not):
#define _LOCKSET lockset
#define _LOCKCLR lockclr
#define _LOCKNEW locknew
#define _COGSTOP cogstop
// Define the name of the function pointer type (the type is defined below).
// This name must be defined as the name of the type that describes what a
// pointer to a thread function looks like. If it is not defined, then the
// Catalina thread function pointer type will be used:
#define _THREAD _PTHREAD
// Define the thread function pointer type, which defines the type of a
// pointer to a function that can be executed as a thread. This must match
// the macro _THREAD_FUNCTION, which is used to actually declare instances
// of this type, and the name must also match the name defined for _THREAD
// macro, above:
typedef void *(* _PTHREAD)(void *);
// What a function must look like to be run as a thread (if the function
// does not accept an integer as its first parameter, you must define the
// GET_ID_FUNCTION macro to return the unique integer id of the thread).
// This macro definition must match the thread function type defined above
// (note that type is a pointer to this type):
#define _THREAD_FUNCTION(pthread) void *pthread(void *argc)
// If the thread function does not accept an integer as its first parameter,
// define a _GET_ID_FUNCTION that returns a unique integer id for the calling
// thread. The name passed to this macro is the name of the worker type, and
// the unique integer id must start at zero and increment by one for each
// instance of the type created. However, the integer id is only unique to
// the type, so the name may need to be used to specify different functions
// or global variables to be used for each type. Posix threads do accept a
// parameter, but it is type void *, so we need to cast it to an integer:
#define _GET_ID_FUNCTION(name) ((int)argc)
// The function that releases CPU to other threads. This is important for
// threads implementations (such as Posix threads) that do not implement
// true preemptive multitasking:
#define _THREAD_YIELD() pthread_yield()
// The function to start a new threaded cog, with an initial "foreman"
// thread. Must return the cog number or -1 if no cogs available. This
// function is only used to start the foreman thread, which does not
// require a parameter. Worker threads are instead started using the
// _THREAD_ON_COG function:
#define _THREADED_COG(foreman, stack, stack_size) \
new_threaded_cog(foreman, stack, stack_size)
// The function to start a worker thread on a specified cog - the id passed
// in is the unique id of this worker thread. It will be passed in the first
// parameter if the thread function accepts parameters, otherwise the thread
// must fetch its unique worker id via the _GET_ID_FUNCTION:
#define _THREAD_ON_COG(worker, stack, size, cog, argc, ticks) \
new_thread_on_cog(worker, stack, size, cog, argc, ticks)
// The function to set the global thread lock (this should be a null macro
// if a global lock is not required, such as when using Posix threads, which
// do not use preemptive multitasking)
#define _THREAD_SET_LOCK(lock)
// The function to get the global thread lock (this should be a constant
// macro that just returns -1 if a global lock is not required, such as when
// using Posix threads, which do not use preemptive multitasking)
#define _THREAD_GET_LOCK() (-1)
// Forward declaration of the function to start a new cog, with a specified
// foreman, stack and stack size. Must return the cog number, or -1 if no
// cog is available.
int new_threaded_cog(_PTHREAD foreman, long *stack, int stack_size);
// Forward declaration of the function to start a new thread on a cog, with
// the specified stack size, id and ticks (ticks may be ignored by Posix
// threads, which does not implement preemptive multitasking). Must return
// (as a void *) a unique handle for the thread, or NULL if the thread
// cannot be started.
void *new_thread_on_cog(_PTHREAD worker,
long *stack, int stack_size,
int cog, int id, int ticks);
custom_threads.c: this file actually implements the functions required to use GCC and Posix threads. As you can see, the code is quite trivial:
/*****************************************************************************
* *
* This is an example "custom_threads.c" file that enables the Catalina *
* Parallelizer to be used with another thread library - in this case, *
* with GCC and Posix threads. *
* *
* See the file "custom_threads.h" for details on how to use this file. *
* *
*****************************************************************************/
#include "custom_threads.h"
// When using GCC and Posix threads, a function to be started on a cog
// is different to a function to be started as a thread, so we must
// define the former type here, as we expect to be passed the latter.
typedef void (* foreman_t)(void *arg);
// Function to start a foreman function on a new cog. Must return the
// cog number, or -1 if no cogs are available. In this case, all we
// need to do is cast the thread type to be a foreman thread type, and
// then we can just use the cogstart function. This is safe because
// the only difference is the return type, and foreman functions never
// return.
int new_threaded_cog(_PTHREAD foreman, long *stack, int stack_size) {
return cogstart((foreman_t)foreman, NULL, stack, stack_size);
}
// Function to start a worker function on a specified cog. Must return a
// void * that is the unique handle of the thread, or NULL if the thread
// cannot be started. Thread functions accept a void * parameter, so we
// cast our argument to be a void * - the worker must cast it back to an
// integer to get its unique worker id.
void *new_thread_on_cog(_PTHREAD worker,
long *stack, int stack_size,
int cog, int argc, int ticks) {
pthread_t t = NULL;
pthread_attr_t ta;
ta.stack = (void *)stack;
ta.stksiz = stack_size;
if (pthread_create(&t, &ta, worker, (void *)argc) == 0) {
pthread_set_cog_affinity_np(t, ~(1<<cog));
}
return t;
}
There are also some minor improvements to the parallelizer in the latest release. See the README.TXT file in the release for details.
It seems my Posix thread start function should perhaps have specified the thread to start as "detached" - i.e.
void *new_thread_on_cog(_PTHREAD worker,
long *stack, int stack_size,
int cog, int argc, int ticks) {
pthread_t t = NULL;
pthread_attr_t ta;
ta.stack = (void *)stack;
ta.stksiz = stack_size;
ta.flags = _PTHREAD_DETACHED;
if (pthread_create(&t, &ta, worker, (void *)argc) == 0) {
pthread_set_cog_affinity_np(t, ~(1<<cog));
}
return t;
}
Doing so certainly changes the output of my demo program, but I am not really sure why - it shouldn't really make any difference whether a thread is created as "detached" or not, if the thread is never either terminated or "joined" - but apparently it does.
Can anyone who is more familiar with Posix threads than I am confirm this?
Comments
So I will be releasing Catalina 4.5 shortly. I had hoped to make the next release the one where I incorporated the parallelizer, but that may have to wait a bit longer.
Ross.
Hello @Reinhard
Well, I have done some work on parallelizing this algorithm, but not very successfully - apart from the main recursion, the opportunities for parallelizing it are quite limited due to the small size of the matrix, and when I do so it generally turns out to make it slower, not faster.
You have given me some ideas which I will continue to work on, but I don't hold out much hope for this particular algorithm!
Ross.
I have to strongly disagree here. We often like to think of the Propeller as being somehow uniquely parallel, but that's only true in a very limited domain (that of inexpensive microcontrollers). In the "real world", and especially in the high performance computing world that OpenMP came out of, parallelism has been a normal part of computer architectures for many, many years. A typical desktop PC has vastly more parallel computing capabilities than a propeller: the CPU has multiple cores, each of which can run multiple threads, and the GPU has hundreds or even thousands of hardware threads. Both have shared memory and much more sophisticated hardware locks than the simple ones in the propeller.
We have a philosophical difference here: as long as you fully document the limitations I think it's much more useful for the end user to have something familiar (if incomplete) than to have to learn some completely new API. But if it makes you uncomfortable then by all means use a different #pragma (e.g. "#pragma pmp" instead of "#pragma omp"). The main thing that I think would help users is to implement the same API, just as a subset of printf (like t_printf) uses the familiar %d, %u escapes, even if not all of them are implemented.
To the contrary, most of the complexity is in the PC side tools, not on the chip. PropGCC for the Propeller 1 fully implements OpenMP 4.5, as far as I can see.
As compiler writers, it's very very tempting for us to focus on what's easy for us to implement as opposed to what's easy for end users to learn and to use. (I'm guilty of this myself all too often!)
I really like what you've done with the parallelizer, and I think it could be a very broadly useful tool even outside of the Parallax community. But it would be much more useful if it implemented a known API (like OpenMP or Cilk) and targetted a fairly general thread library (like pthreads or the new C standard threads).
Basic "launching a C function in another COG via cogstart" functionality is there already. I haven't implemented a higher level API yet. I'm leaning towards doing Posix threads, but the C11 <thread.h> standard (which is very similar) is also a reasonable starting point. Have you looked at these? Any preferences for which standards to adopt? I think we should have some standard thread API available for the P2, even if there are also compiler specific non-standard libraries.
Regards,
Eric
Alas, even in "bang for the buck" I think there are GPUs (on the high end) and multi-core ARM and RISC-V chips (on the low end) which give the Propeller a run for its money. The propeller architecture is particularly nice and clean, but it's no longer unique.
Well, you've provided "tiny" libraries for Catalina which (quite properly) reduce code bloat by providing the most useful subset of the standard C libraries. Similarly, virtually no floating point libraries for the Propeller actually implement the whole of the IEEE floating point standard; rather they just do what programmers mostly expect. These kinds of things make porting code from other platforms much easier, and hence help speed adoption of the Propeller platform. And I think they totally make sense: if you're going to implement floating point in this day and age, you might as well start with the IEEE standard rather than creating a new standard from scratch. This is true even if for space or performance reasons you can't actually support all of the rounding modes and precision guarantees in that standard.
The canonical example is fftbench, which you've been working with . The OpenMP support in PropGCC is why Heater added the OpenMP support to that benchmark.
I implemented it for PropGCC a long time ago. I've attached a zip file with the source code; it probably has some GCC dependencies, alas. Also in the zip file is a very basic COG thread library (sys_threads.c) that pthreads is built atop, and the "tinyomp.c" file which PropGCC uses by default in place of the GNU libgomp for doing OpenMP threads. libgomp does work on the Propeller 1 (it's built on pthreads) but full libgomp+pthreads consumes a lot of RAM, and most of the pthreads features aren't needed for OpenMP.
Thanks, I'll have a play with PropGCC, and also look at implementing the parallelizer for your threads library (once it is a bit more stable - I am currently rewriting some of the internals to make it more robust).
I am still looking at how much more complexity I may need to add to the parallelizer. If the answer is "not much", then I think the propeller pragmas are likely to be a superior (and simpler) solution than using OpenMP**.
I have reproduced some of the things I can do with the propeller pragmas using the OpenMP pragmas, and the results are very disappointing - the code bloat is massive, and the performance improvements are modest. This is what makes me think that OpenMP is not a good solution for the Propeller. However, this was done using GCC on my desktop, not PropGCC on the Propeller itself, so I will do the comparison again using PropGCC and see what the results look like.
However, I am not optimistic - using GCC, the code sizes even for trivial OpenMP examples blows out by many times - when statically linked, executable sizes up to 6 times larger are not uncommon even for very simple examples - and the performance improvements can be nonexistent (in fact, some of my examples run slower when using OpenMP - but this is also sometimes true when using my own pragmas, so I can't really complain about that - it probably has more to do with my examples being too small and trivial for the benefit of parallelizing them to outweigh the cost of doing so). I can't see PropGCC being much different to plain old GCC with respect to relative code size, but I will try it and see. Perhaps Heater has already done so?
On the "philosophical issue" about compliance with standards, I believe Catalina fully implements the IEEE standard for floating point (except for some obscure underflow issues that I don't have the expertise to solve), and also all the other library functions required by the relevant C standards. Offering "additional" libraries or alternative libraries that use less RAM is not a deviation from the standard.
Ross.
** However, I do like the OpenMP approach of assuming the parallel code segment ends at the end of the next block - this would mean I could make the "propeller end" pragma optional, and you would only need to use it if end of the parallel segment is not the end of the next block. Thanks for pointing this out.
Here's what I get for fftbench with PropGCC 6.0 for P1
The code speedup isn't great, but I didn't spend any time trying to tune the number of COGs, so I don't know how many it's actually using.
I don't think Catalina fully implements the IEEE standard. I'm not criticizing: I doubt that any floating point library for the P1, or any other micro, fully implements the standard. For example, the directed rounding modes (round to 0, round to +- infinity) which are required by the standard, are rarely implemented on micros; almost everyone just uses "round to nearest". Similarly, the standard requires that exceptions like UNDERFLOW and OVERFLOW be signalled in some way, but P1 libraries typically don't bother to do this.
As another example, Catalina includes some features from the C99 standard but doesn't fully implement C99. I think that's sensible: you've fully (or almost) implemented C89, and if you're going to extend the libraries in some way that C99 has already specified then it makes sense to use the later standard as guidance. That's not a bug, it's a feature! The alternative is to re-invent the wheel and create some completely new libraries or language features that are non-standard. That's not as good for users, IMHO.
I'd argue that the same considerations apply to threading libraries and multiprocessing pragmas. We may not have the time or energy to implement all of the C11 standard on the P1, but using C11 <threads.h> might be a good solution to provide portability for C code. Similarly, implementing all of OpenMP may not be practical, but using some parts of it may make sense.
Of course, we could implement not just a subset of OpenMP pragmas, but a subset of the functionality of each OpenMP pragma ... but then all you are really doing is using similar pragma names for something different. You will probably not be surprised to hear that I don't want to go down that path.
You may be using only 2 cogs - that is about the improvement you might expect from 2 cogs. However, the code size increment is not bad. I will do a direct comparison when I get some time. I am deep in the internals of the parallelizer at present. I think you are nit-picking here. Catalina implements full ANSI C standard, using 32 bit IEEE 754 floating point. Yes, Catalina does not implement the "full" IEEE 754 standard because it does not (for example) implement 64 bit floating point. But this is not required , and I am fairly careful to spell out that Catalina only offers 32 bit floating point support.
But again, Catalina doesn't claim to be C99 compliant. That's the point. I don't think we are ever going to agree here. It is largely a philosophical difference, as I have said. But I have also read more about the OpenMP standards, and I continue to think that they are not a good fit for the Propeller - they take the conceptually simple Propeller architecture (which already supports parallelism constructs at a high level) and turn it into something that is complex and difficult to use.
I concede that this is likely to be a problem with the C language rather than the OpenMP pragmas, but this is not a sufficient excuse to complicate things unnecessarily for users. Indeed, I worry that the propeller pragmas may already be too complex, and I am looking to simplify them if I can do so, not add to them.
I will look at the C11 threads - they may be a good basis for implementing the propeller pragmas on your compiler.
Ross.
I'm still interested in parallel processing in C on the P2. Even if the right killer application has not yet been found, the journey is the goal. I don't have that much time right now, but it stays in the back of my mind. Maybe something very simple, a demo for the LED matrix (just a thought)
Reinhard
Catalina violates the IEEE 754 floating point standard because it does not implement some required features of that standard. For example, see section 4.2 (Directed Roundings) of ANSI/IEEE Std 754-1985:
"An implementation shall also provide three user-selectable directed rounding modes: round toward +INFINITY, round toward -INFINITY, and round toward 0."
Note the word "shall"; it is a requirement, not an optional feature. Similarly, I do not think Catalina implements section 7 of IEEE 754 (Exceptions), which requires that certain conditions like overflow be detected and cause either a trap to occur or a status flag to be set.
Again, I'm not trying to single out Catalina here: very few microprocessor floating point libraries conform to IEEE 754. Instead, they largely implement a useful subset of of the standard.
And my point is that even when we don't implement a whole standard (C99 in this case) it can be useful to implement some parts of it (e.g. the stdint.h header file that Catalina includes). And in practice you seem to agree, since that's what you've done in Catalina! We're in violent agreement that we shouldn't claim conformance and should clearly document when only parts of a standard are implemented. But an incomplete implementation of a standard is often better for users than a completely one-off API invented from scratch.
I still haven't installed PropGCC, but I have done some comparable figures for Catalina on the Propeller 1. I am not sure what your "size" number refers to, so I have included both code size only (no data) and total file size.
Catalina is certainly no speed demon, but it is the improvement that we are interested in here, because if we implement the propeller pragmas for other compilers, the absolute times will differ, but the improvement should be very similar.
Your reported improvement using OpenMP pragmas was 1.71 times, which is what makes me think you must be using only 2 slices (i.e.2 cogs). Change the fftbench source code to use 4 slices (cogs) and re-run it.
Ross.
EDIT: fixed typo in numbers. Results unchanged.
Nit picking
The cases are not really comparable. You are comparing a complete implementation of stdint.h to a partial implementation of OpenMP. I did not add stdint.h to claim compliance with C99, I added it because someone asked for it, and it can fairly easily be done in ANSI C.
Partial implementations are no better than custom solutions, and if we are going to have a custom solution, we may as well have one that is better suited to the Propeller.
Ross.
OK, perhaps we're talking past one another a bit. I do not want, in any way, to claim that we should "comply" with OpenMP, any more than we should sweat "complying" with IEEE 754.
I'm merely saying that if you want to implement a pragma to, for example, parallelize a block of code, and OpenMP already has such a pragma, then perhaps we should implement that specific pragma. Just as, for example, C99 has a useful header file (stdint.h) and we can implement that even in a compiler that does not conform to C99.
Catalina's partial implementation of IEEE 754 is better than a custom floating point solution, because it can leverage a wide variety of existing code and users immediately know many things about Catalina's floating point (even if they have to be aware that some parts of the standard are not in fact implemented).
Similarly, an implementation of some of the OpenMP pragmas would immediately allow users to write code that can run on both PCs and on their Propeller. Such an implementation of some pragmas would not be an "OpenMP implementation", any more than Catalina is a "C99 implementation" because it provides stdint.h. But it would still be more useful than a completely custom solution, IMHO.
Regards,
Eric
The main point here - IMHO - is that Catalina, FlexC, PropGCC and the upcoming Clang implementation should be as compatible to each other as possible.
Besides not liking C that much, I have used Catalina on @Cluso99's RAMblade and really like the integration of XMM. My silent hope is that @RossH will implement a XMM driver for the Hyperram on the P2 and that FlexWhatever also would be able to do so.
Sure one could claim that 512K are enough, but history has shown that this is not always true.
And I really, really would like to somehow use the C files generated by GnuCOBOL to run on a P2. Sadly my C fu is very undeveloped and I even don't know what would be needed to do so. GnuCOBOL is very Linux centered but besides using MinGw it can use Visual Studio on Windows to compile so there can't be that much of POSIX in there.
Enjoy!
Mike
I've just spent hours trying to get PropGCC to compile from source on either Windows or Linux. No success.
Are there any already-built executables for PropGCC for wither Windows or Linux? I looked in the forums, but couldn't find any obvious places to download executables, just the sources (which don't build).
I presume there must be executables somewhere? Can someone point me to them?
EDIT: Found some older PropGCC binaries - they come with SimpleIDE. But I still can't compile later versions of PropGCC from source.
I am keeping my powder dry, but I will wait until there is a proven need for it on the P2 before I proceed.
The experiences with XMM on the P1 were not kind. Way too much time wasted on a technology that lots of people demanded, but very few actually used.
We will probably never agree on this very fundamental point, so I guess we should just agree to disagree and move on.
Once I have implemented the propeller pragmas, and you have implemented the OpenMP pragmas, then it might make sense to return to it.
Ross.
Ok! Found some binaries. I didn't realize PropGCC was part of SimpleIDE. And what a palaver to get that working!
But anyway, here are some more PropGCC results. As I suspected, @ersmith was only using 2 cogs when he could have been using 4:
My actual numbers are different to @ersmith - we may be using a different version of PropGCC, or different options. I couldn't get some options (e.g. some memory models and optimization levels) working at all with PropGCC.
But the important point is that the OpenMP pragmas and the propeller pragmas achieve almost identical performance improvements - at least in this instance.
I will continue working on the propeller pragmas and see where we end up. But I believe I can already do much of what the OpenMP pragmas can do, but with a smaller and simpler set of pragmas which will be easily portable to any other C compiler.
Ross.
Yes, XMM was an idea that seemed good at the time, but few people adopted. The P2 has a different set of optimizations and trade-offs, of course, so perhaps it might work better on P2. OTOH with 512K memory there is even less need for XMM than on P1.
Sure, that makes sense. If the propeller pragmas are at least similar in spirit to the OpenMP ones then perhaps I can implement both.
Although honestly there are other things I should work on first. fastspin's performance is good, but it still has some distance to go to catch up to Catalina in terms of stability and completeness.
I have upgraded the version of the Catalina parallelizer/pragma processor to version 1.6 - see the first post in this thread for the zip file, which now includes the source code, and for details of the new functionality that has been added. Also, Catalina 4.5 has just been released, to fix some lock and memory management issues that were exposed when testing the parallelizer. You should upgrade to that version when using version 1.6 of the parallelizer, or some of the test programs may not function correctly.
It is likely that this will be the last separate release of the parallelizer, and that future releases will be part of Catalina. This does not mean it cannot be used independently, or support other C compilers, but it does mean that Catalina will be able to use it more effectively, such as allowing it to be invoked in any compilation command, or in the Geany or Code::Blocks IDEs, just by specifying an additional command-line option.
As discussed earlier, I considered doing away with the begin and end pragmas and just assuming that block following the worker pragma represents the parallelizable code segment (as OpenMP does), but in practice it turns out that this means you eventually need MORE pragmas, not less - and it also means you may have to modify your source code in cases where the current propeller pragmas do not require you to do so. So that idea is currently on hold. I may revisit it again later.
Ross.
I hear your frustration on downloading PropGCC. It has not been compilable on modern Linux systems for many years now, and the only way to still build it is with old (not even LTS support) distributions. It seems to require Ubuntu 14.04 IIRC. Docker is the only way I'm comfortable doing it.
Anyway, you can grab my CI builds from https://ci.zemon.name/project/PropGCC?mode=builds&guest=1. Last build was from July of 2019.
Hi @ersmith
I've only just now had time to get back to thinking about this. But having done a bit more reading, I agree that Posix threads is probably the way to go. I read that both C11 threads and OpenMP can be implemented using only Posix threads, and I believe the propeller pragmas can as well.
However, it seems your threads library is only implemented for PropGCC? And if I am reading David's post correctly, PropGCC seems to no longer be supported. So before I spend any time on it, have you also implemented your threads library for FlexC? Or do you have plans to do so?
Ross.
No worries. The propeller pragmas will be able to work with whatever you end up implementing. When I get some spare time, I will make the pragma pre-processor configurable as to how it invokes the thread library. There is really only one thing it needs to know, which is how to call the thread library function that "starts a C function as a thread on a specified cog". I can do this using a fairly simple configuration file that you can customize to suit your library. I'll provide a Posix threads one as an example.
(a) integrates the new parallelizer (aka the pragma preprocessor) into Catalina, but ...
(b) still allows the parallelizer to be used "stand-alone" with other C compilers and thread libraries.
To use the parallelizer with Catalina, you just include a new command-line option (-Z) before the files to be parallelized. Not only does this make it easy to use the paralellizer on the command line, it also means it is now trivial to use it from either the Geany or CodeBlocks IDEs.
For example, here is the command that might be used to compile a serial version of the sieve demo program:
And here is the command that might be used to compile a parallel version of the same program:
Using the parallelizer with other compilers and thread libraries is a little more complex, but I have included an example of using it with GCC and Posix threads (only on the Propeller 1 - I don't think any compiler implements Posix threads for the Propeller 2 yet. Happy to be corrected!).
Here are the files that contain everything necessary to use the parallelizer in this case:
custom_threads.h: this file defines the macros and functions necessary for the parallelizer to use GCC and Posix threads:
custom_threads.c: this file actually implements the functions required to use GCC and Posix threads. As you can see, the code is quite trivial:
There are also some minor improvements to the parallelizer in the latest release. See the README.TXT file in the release for details.
Comments and feedback welcome.
Ross.
Doing so certainly changes the output of my demo program, but I am not really sure why - it shouldn't really make any difference whether a thread is created as "detached" or not, if the thread is never either terminated or "joined" - but apparently it does.
Can anyone who is more familiar with Posix threads than I am confirm this?