improving CMM performance
ersmith
Posts: 6,054
There are two big issues in code: size and speed. We want the code to be as small as possible and as fast as possible. CMM mode gives us a pretty good improvement in size, but is there anything we can do about speed? Well, there are a few things that we may be able to do in future PropGCC releases. If you're willing to build the compiler from source, and live on the "bleeding edge" then you can check out the "performance" branch from the propgcc repository and give these new features a try.
DISCLAIMER: the performance branch is experimental and not ready for prime time. There's no guarantee of when, or even if, the changes discussed here will make it into an official release, and all of the timings are preliminary and subject to change.
There are two innovations in the performance branch that can give CMM (and LMM) programs a pretty good speed boost.
(1) Kernel extensions. Based on an idea by Bill Henning, "kernel extensions" are overlays for the LMM (or CMM) kernel. There are a lot of functions, like floating point, string handling, and integer math, which are called frequently by C programs, but not all programs need all functions at the same time. We've put a few basic routines like multiply and divide in the kernel, but there is not enough room for them all. That's particularly true in the CMM kernel, where the interpreter is quite large and so there's not much space for anything else.
Enter kernel extensions. The linker sets aside 256 bytes at the end of cog memory for use by overlays. Any code that's in a section with a name ending in ".kerext" is linked to run in this area (using a mechanism very similar to the one used now for .cog sections). At runtime the generated code can call a __load_extension function to load one of these overlays, and then make calls directly into it. Note that the kernel extension area is not intended for general user use; the code running in the kernel extension cannot call other kernel extensions, so it has to be written carefully.
Our first uses for kernel extensions are for integer math (the divide function and a few other less frequently called integer operations are moved from the main kernel to an extension to free up room), memory operations, and floating point math. Lonesock's excellent F32 floating point code doesn't quite fit in the space available, but some of the basic primitives (Pack and Unpack) do, and some of the other primitives can be run from the already existing FCACHE area. This results in a dramatic improvement in the performance of 32 bit floating point.
Consider the euler_series benchmark. Here's the code:
Here's how it performs with the CMM preview compiler:
Here's how it performs with the performance branch compiler, with floating point kernel extensions:
This is a pretty good improvement; but we're still not as fast as LMM mode. Can we do more? Well, yes we can...
(2) Multiprocessing. There are 8 cogs in the Propeller, and often some of them are sitting unused. Wouldn't it be nice to use those spare cogs for any intensive calculations? Of course, we can do that already by manually launching threads and managing workloads. But that's the kind of thing that computers are good at. Couldn't a compiler automatically do this?
It can, but it's hard to do completely automatically. GCC does have some limited support for automatically parallelizing loops, but it requires both runtime and host side compile support. We haven't yet got all of the host side auto loop code implemented in PropGCC. But we have implemented some of the runtime code in the performance branch and with a few hints the compiler can indeed split loops up to run in parallel. GCC implements the OpenMP API for running code in multiple processors. All we have to do to the euler series example is to add one #pragma directive to the euler_series function. Here's the new function:
The pragma tells the compiler that the next loop should be run in parallel on as many threads (cogs) as are free, and that the value each cog gets in "result" should be added together to make the final result.
Here's the result with OpenMP enabled (and again, CMM mode):
We've achieved a speedup of over 45 times compared to the original CMM time!
Both of these changes help LMM mode, too. Here's the LMM mode time from the performance branch, using kernel float extensions and OpenMP:
DISCLAIMER: the performance branch is experimental and not ready for prime time. There's no guarantee of when, or even if, the changes discussed here will make it into an official release, and all of the timings are preliminary and subject to change.
There are two innovations in the performance branch that can give CMM (and LMM) programs a pretty good speed boost.
(1) Kernel extensions. Based on an idea by Bill Henning, "kernel extensions" are overlays for the LMM (or CMM) kernel. There are a lot of functions, like floating point, string handling, and integer math, which are called frequently by C programs, but not all programs need all functions at the same time. We've put a few basic routines like multiply and divide in the kernel, but there is not enough room for them all. That's particularly true in the CMM kernel, where the interpreter is quite large and so there's not much space for anything else.
Enter kernel extensions. The linker sets aside 256 bytes at the end of cog memory for use by overlays. Any code that's in a section with a name ending in ".kerext" is linked to run in this area (using a mechanism very similar to the one used now for .cog sections). At runtime the generated code can call a __load_extension function to load one of these overlays, and then make calls directly into it. Note that the kernel extension area is not intended for general user use; the code running in the kernel extension cannot call other kernel extensions, so it has to be written carefully.
Our first uses for kernel extensions are for integer math (the divide function and a few other less frequently called integer operations are moved from the main kernel to an extension to free up room), memory operations, and floating point math. Lonesock's excellent F32 floating point code doesn't quite fit in the space available, but some of the basic primitives (Pack and Unpack) do, and some of the other primitives can be run from the already existing FCACHE area. This results in a dramatic improvement in the performance of 32 bit floating point.
Consider the euler_series benchmark. Here's the code:
#include <stdio.h> #include <math.h> #include <time.h> #include <sys/time.h> int terms = 1000; float euler_term(term) { float val; val = 1.0 / ((float)term * (float)term); return(val); } float euler_series(int terms) { float result = 0; int term; for (term = 1; term <= terms; term++) { result += euler_term(term); } return(result); } double dtime() { struct timeval tv; gettimeofday(&tv, NULL); return (double)tv.tv_sec + ((double)tv.tv_usec / 1000000.0); } int main (int arc, char* argv[]) { double elapsed; float es; elapsed = dtime(); es = euler_series(terms); elapsed = dtime() - elapsed; printf ("The first %d terms of the Euler series sum to %f\n", terms, es); printf ("Elapsed time: %f seconds\n", elapsed); return (0); }
Here's how it performs with the CMM preview compiler:
The first 1000 terms of the Euler series sum to 1.643935 Elapsed time: 0.806850 seconds
Here's how it performs with the performance branch compiler, with floating point kernel extensions:
The first 1000 terms of the Euler series sum to 1.643935 Elapsed time: 0.125790 seconds
This is a pretty good improvement; but we're still not as fast as LMM mode. Can we do more? Well, yes we can...
(2) Multiprocessing. There are 8 cogs in the Propeller, and often some of them are sitting unused. Wouldn't it be nice to use those spare cogs for any intensive calculations? Of course, we can do that already by manually launching threads and managing workloads. But that's the kind of thing that computers are good at. Couldn't a compiler automatically do this?
It can, but it's hard to do completely automatically. GCC does have some limited support for automatically parallelizing loops, but it requires both runtime and host side compile support. We haven't yet got all of the host side auto loop code implemented in PropGCC. But we have implemented some of the runtime code in the performance branch and with a few hints the compiler can indeed split loops up to run in parallel. GCC implements the OpenMP API for running code in multiple processors. All we have to do to the euler series example is to add one #pragma directive to the euler_series function. Here's the new function:
float euler_series(int terms) { float result = 0; int term; #pragma omp parallel for reduction(+:result) for (term = 1; term <= terms; term++) { result += euler_term(term); } return(result); }Note that the pragma will be ignored by compilers that don't understand OpenMP, so it's harmless. But compilers that do understand OpenMP (and this includes Visual C++, the Intel C compiler, and of course GCC) can use the omp pragmas to parallelize the program. We then compile with the -fopenmp switch to tell GCC to enable the OpenMP run time.
The pragma tells the compiler that the next loop should be run in parallel on as many threads (cogs) as are free, and that the value each cog gets in "result" should be added together to make the final result.
Here's the result with OpenMP enabled (and again, CMM mode):
The first 1000 terms of the Euler series sum to 1.643935 Elapsed time: 0.017831 seconds
We've achieved a speedup of over 45 times compared to the original CMM time!
Both of these changes help LMM mode, too. Here's the LMM mode time from the performance branch, using kernel float extensions and OpenMP:
The first 1000 terms of the Euler series sum to 1.643935 Elapsed time: 0.006399 secondsThis is about a 16x speedup over the original LMM speed; not as dramatic as the CMM improvment, but pretty respectable.
Comments
Good grief there is a lot of effort going on here, very admirable, I just can't help thinking it's trying to make the Prop into something it is not. If someone really wants good C erformance would they not be better off using a chip that supports C better?
Any way there is a lot to ingest there. My initial reaction is: The kernel extentions/overlays idea seems quite sound and useable. How does a user specify what goes in a kernel extention ? Perhaps I want some floating point help in there or perhaps I want some integer divide in there or what about 64 bit arithmetic help? How do I select what gets "turbo charged"?
The automatic parallelization of loops seems a bit more troubling. How does the compiler know what COGs are free to use for that? How does it know that I might want a COG after it has started such a loop?. This all looks like it belongs on a multicore general purpose machine rather than an MCU. Again if you really want that kind of performance why would you be using an MCU?
But this is an awsome effort I must say. Something usefull will surely come of it.
It may sound like there's a potential for thrashing here, but in practice that doesn't happen. The LMM overhead is pretty significant, so if you do even 2 floating point operations in a row before switching back to integer then the cost of the load has already been saved (we use kuroneko's high speed loading routine, so there's very little extra overhead in the kernel extension load compared to running the code through once in LMM mode).
Well, the compiler will never do parallelization unless you explicitly specify -fopenmp, and even then only on loops with #pragmas inserted (until we get the -floop-parallelize option working, in which case you'll have to give that option too -- it's not implied by any of the other options).
It's not the compiler, it's the runtime... the compiler inserts a call to GOMP_start_parallel, and that function tries to start as many COGs as it can at that point. After the loop has finished, those COGs are all shut down by a call to GOMP_end_parallel.
You can also use #pragmas to place a limit on the number of COGs allocated for any loop.
The Propeller is a highly parallel machine, this is just an easier way to expose that parallelism. But we'll never be forcing any code to be run in parallel behind the programmer's back.
Eric
For extentions/overlays it was that thrashing issue that already had me worried. From what you say for a given set of standard operations, integer divide, float help, etc that is not an issue in most cases. Still looks like it might be a hinderence in mutant cases of: int divide - float something - int divide - float something etc. BUt overall looks like a winner.
I'm still not clear on the parallel thing. OK the run time does it as and when needed, after all perhaps the compiler cannot see that at compile time, but the question sticks in my mind, how does the run time know what COGs are free? So far I can do a cognew at any time, does the run time record what COGs are in use? Also after a parallel loop has been started, potentially eating all free COGs, I might have another thread that want's to do a cognew. What then?
You can also explicitly specify how many COGs to use for a loop; for example we could have said:
to limit it to 2 COGs.
Finally, don't just think in terms of loops; how about something like:
as a really simple way to launch 3 cogs, one running each of "task1", "task2", and "task3"? That example doesn't quite work with the current performance branch, because the library is missing the runtime function GOMP_sections_start; as a quick hack you can make it work by adding: and then run it on your Quickstart board to see 3 pins toggling independently.
I think this is a nice easy way to start extra cogs, isn't it?
Eric
Don't think it has much practical value, but it would be neat.
It could possibly be useful on Prop2 in LMM mode...
Wow, I am sold on the idea.
Simply allowing the programmer to limit how many COGs are used does the trick.
Still curious as to what happens when the cognew fails, does it just silently go all wrong?
That example of starting tasks in COGs is neat. Years back there was a parallel C for the Inmos Transputer which could start threads with a syntax like so: Today XMOS has it's variation on parallel C, XC, for their multicore chips that will start threads on a core or threads on different cores with a similar syntax. Your example looks eerily similar with a more cumbersome syntax.
Do you really need a second method apart from fcache? How to find out the right balance of cog memory space for fcache or overlays?
Those superfeatures have to be documented and understandable as well.....
Sorry, Christof
Loading stuff into fcache takes time, that is OK for small loops of code that will fit in cog and run for a while. Then the time overhead of loading the loop code is small compared to the execution time. The speed up was quite pronounced for the Fast Fourier Transform code. In these cases the compiler can have a look at the loops in the code and decide if they are suitable to load to fcache, i.e. small, and self contained, no calls to functions etc.
But fcache does not work so well (or at all) for, say, small functions like division, here the over head of loading the code to cache may be significant compared to it's run time. And the calls to such a division might be made in lots of places in your code that are otherwise not cached. In this case perhaps it's better to have the division loaded as an overlay/extension at start up and then it is always there for use as the program runs with no loading overhead each time.
Well that's my guess and I'm sure Eric will put us right.
Now I start to worry how fcache and overlays fight with each other in a program.
I also worry about things getting very complex, but I guess they are there for use if you need them and are ready for it. The parallel processing features, for example, are available to all GCC users, I work in C all the time and have never used them:)
If there are 3 tasks and only 2 COGs started, then tasks 1 and 2 will start immediately, and the first COG finished will then start task 3.
Eric
Another example: suppose there is a loop doing a lot of string operations. In the old way of doing things the compiler cannot put that loop into fcache. If the string functions are in a kernel extension, though, then the loop can be placed in fcache since all the function calls are into the kernel.
kernel extensions themselves are not really intended to be an end user feature. Rather they are a way we can allow further optimizations in the compiler and libraries.
A large part of the space for kernel extensions comes from moving functions that were always resident in the kernel (like division) into overlays. So having this space does not take away from what's available for FCACHE -- in fact the FCACHE size hasn't changed. The only cost is that there isn't quite as much memory free now for _COGMEM variables and _NATIVE functions.
Eric
Another reason is to group small related functions together. strcpy, strcat, and strcmp, for example, could all fit in a kernel extension and all be resident at the same time; but only one function at a time can be in the FCACHE.
They're quite independent -- they are two separate areas of COG memory. Even if an FCACHE loop calls functions in different overlays it would work, and the performance would still end up being better than it is now, since as it is now the loop couldn't go in FCACHE at all because of the function calls!
I'm sorry, I guess my original message was too long and didn't explain things very well. You guys did notice that the benchmark sped up 45x, right? I think that's a good thing? :-)
Eric
Being curious I had to try this out on my PC which has 4 cores. Sure enough I can run 4 threads and push my reported CPU load to 286%. I already discovered what you said, I tried 5 threads and sure enough the fifth one did not get run until one of the other 4 had finished. This means that when starting 5 threads than all run for ever one of them will silently never run.
Now I'm curious how that works on the Prop, if I want to start 4 threads on 4 COGs that all run forever but only 3 are available, because something else has the others already, how does the system get notified when a COG becomes free so that it can run my fourth thread? If those three threads are all running my code there is no one left to notify when a thread becomes free and anyway there is no mechanism for such a notification in the Prop hardware ("waitcog" instruction anyone?)
Sure did. Initially I was cold to the idea of openMP on the Prop, it can get you performance yes but smacked of trying to make a Prop into a performance number cruncher which it is not. However I am hot on the idea of using openMP to organize my tasks into COGs as shown.
As I said, there is a lot there to ingest, and we are a bit slow:)
This is a disaster, you have diverted me from all my projects until I have played with opemp for a while...
On my four core PC my fifth thread gets run if any of the four initial threads finishes. In that way when (if) all the threads finish the whole job is completed correctly. Nothing is missed out.
If the Prop implementation does not do is that correct behaviour for OpenMP?
(Not that I worry about that so much under the circumstances)
So what you are saying is:
1) That if all is under openMP control then it will work as on my PC. New cogs will be started if need be when cogs become free. The total job runs to completion.
2) If some COGS are started other than by openMP, say I just do a cognew, then the openMP team will run to completion correctly but my rouge cog will never be used if it happens to stop.
Sounds fair enough.
I think I have some checking out and building to do, I'll be back in a while....
Help! The build of the performance branch of propgcc fails here.
I just spent the weekend genning up on OpenMP. I must say it is is simple in concept and ideal for starting up COGs as you showed. When it comes to parallelizing existing code, or even writing new parallel code it looks like it can be a pig. Easy to create race conditions and such. Many ways to shoot yourself in the foot.
Anyway, the result is that I now have a parallelized version of fft_bench that runs as much as three times faster on my four core PC as it does when single threaded. (Well actually that's a fib, fft_bench operates on a small data set and only takes ~100us normally, the overheads of starting threads etc make it many times slower. But when I insert a nice big delay loop into the butterfly and push the run time out to seconds we begin to see the benefit).
So of course I want to run this on the Prop. But I can't build the performance version of propgcc on my Debian box, gcc make fails somewhere in building libsupc++ or such. Here is the tail of the build messages:
Great, that will have to wait until tomorrow though.
Meanwhile I starting to wonder what to expect on the Prop. What I found on my PC is that the whole fft_bench job takes less than 200us on a single core. Parallelizing the middle for loop pushed the execution time up my a factor of 10 rather than speeding anything up. I put that down to the overheads of launching threads under Linux. And I notice that OpenMP sits on top of pthreads adding a layer of abstraction code. And I guess any thread started under Linux has a latency due to the granularity of the Linux jiffy clock when scheduling can occur.
How does this go on the Prop? Does OpenMP also sit on pthreads there? We have the overheads of loading a COG for each new thread but we don't have a jiffy clock to wait for. Or what?
Does this work with FCACHE stashing away the inner loop(s)?
Yes, I think so.
I'm kind of glad to see omp for he Prop does not sit on pthreads, I was was already convinced from my experiments on the PC that doing that would much bigger and slower. So tinyomp launching COGs directly sounds great. Amazing things you are doing there.
As for not fully omp compliant, that's not something that would worry me too much as long as it is small and fast.
However as I'm going to be doing my omp experiments on a PC for a while it might be helpfull to know what might not work on the Prop.
Thinking about my effort to parallelize fft_bench I think I have to try a new approach. That FFT is a traditional three nested loop job and I have just put a "parallel for" around the middle loop. That means that for every once around the outer loop omp has to divy up the work of the inner loop afresh. Which introduces a lot of thread creation overhead.
Ideally I'd want to split the work to four cogs, say, at the top level and let those cogs run all the way through. FCACHE would really fly there.
However it is in the nature of those three loops that the outer loop is best done by two and then one thread for the last two iterations else there is a terrible race condition. And that means rearranging the code somewhat from it's fft_bench standard.
Well, I guess that's for me to worry about.
I'm not sure how much overhead there is for thread creation -- it's pretty much a direct call to cognew. I guess it would be interesting to time it.
Your work on fft_bench sounds pretty exciting -- I'm looking forward to seeing it!
propgcc now compiles my parallel fft_bench but I have no _omp_/_gomp_ functions and no tinyomp.c and omp libraries.
Strangely propgcc does not have an omp.h include file. It compiles with out it fine though.
tinyomp.c should be in propgcc/lib/sys/propeller/tinyomp.c
Are you sure you're up to date and on the performance branch? "hg branch" should print "performance".
Silly me, forgot to "pull" first.
Everything now compiles with the following undefineds, most of which I can live without.
_omp_get_max_threads - I don't need it. I guess that's kind of icky as we don't know how many cogs are in use by others outside of omp. I guess we don't want threads within single cogs like I can have more threads than cores on my PC.
_omp_get_num_procs - Easy, just return 8:)
_omp_set_num_threads - I don't need it.
_omp_get_nested - Easy just return false. I get the impression there are not many implementations that support nested threads anyway.
_omp_set_nested - Not needed if there is no nesting.
_GOMP_loop_dynamic_start
_GOMP_loop_dynamic_next
_GOMP_loop_end_nowait - These all goes away when I remove "schedule(dynamic)" from the "parallel for" pragma or use schedule(static). Actually I have not really understood what "dynamic", "guided" etc are supposed to do yet.
Here is the only omp pragma in my fft_bench:
Are you sure about that?:) Well here it is attached.
What is it? Well, basically it's my first crude attempt at splitting the FFT butterflies routine over multiple cores, currently a maximum of 4.
The butterflies() function has gained some parameters to indicate which part of the data to work on, how many levels of butterflies it should do and how many ways the data has been split (i.e. number of threads).
Then the call to the butterflies routine has been replaced by many calls placed in omp parallel sections so each call can run as a different thread or a core if available. First there are 4 threads, hopefully in parallel, doing the bulk of the work. Then there are 2 threads combining those 4 outputs. Finally a single thread combines those 2 outputs into the final output.
Now it's a bit cheaky of me to post this as I'm far away from any Propeller gear to test on and the only PC here only has a single core. So testing is lacking but as a bonus if it does work on a Prop you will be the first person in the world to see it:)
However I did test like this:
1) Use omp_set_num_threads(4) to get 4 threads available even on a single core machine.
2) Print the thread id from the butterfly function.
3) Add a 1ms delay to the butterfly
With all that in place I can see that it does indeed fire up multiple threads do the work and give the correct result.
Also reducing the number of threads down through 3,2,1 continues to work correctly. As each thread has an independent sleep time I can see performance go from 15.5 seconds with one thread to 5.4 seconds for 4 threads, sweet:)
Now this is all rude and crude by OpenMP standards I'm sure, what with having a fixed arrangement of parallel sections and a fixed max number of threads. But it has given me a severe headache to get working, I'm not sure I'm up to having it automatically adapt itself to various core quantities. It would be easy enough to extend the maximum to 8 though but that might impact performance with lesser numbers of cores.
I had a sneak peek at an OpenMP FFT I found on the net and was dumbfounded by it's complexity. When I ran it on a 4 core PC it only seemed to gain speed well after an FFT size of 1024. So perhaps my dumb approach is better suited to what we are doing here anyway.
So if you have a minute I'd love to know if this flies. I'll be back with my Props at the weekend.
However the results are pretty hopeless.
With optimizations off it completes in 239us with four cores vs 208us with one core. Hardly a speed up at all.
With size optimization (Os) completes in 110us with four cores vs 83us with one core. A slow down!
Seems this problem is so small for a modern PC that the overheads of creating threads to run it in parallel out weigh the gains of parallelism.
P.S. propgcc is lacking an omp.h and a means of setting the number of threads available omp_set_num_threads().
Sadly no Props here to run it on.
Thanks for doing this work! It's very cool.
I've updated the performance branch with omp.h and set_num_threads. On the propeller OpenMP does seem to help: the time goes from 63ms without -fopenmp to 25ms with it (that's with -Os... -O2 and -fopenmp don't seem to get along, I'm going to have to investigate why).
Nice work!
Eric
Fantastic.
My pleasure. As I said when propgcc alpha test started "I like to break things" and now you have something to fix
Any idea how many COGs that actually ended up using? It's not quite the performance gain we are looking for if it takes 4 COGs to get only a 50% boost.
I suspect it might be better to use my earlier version that only supported a max of 2 cores. That means less slicing and dicing overheads. I'll dig it out this evening.