OpenMP
Heater.
Posts: 21,230
Some time ago Eric let it out of the bag that he had implemented OpenMP support in propgcc.
This has been driving me nuts ever since as I decided that what the world really needs is a parallel Fast Fourier Transform for the Propeller that can spread the work over 2 or 4 COGs.
Initial efforts were dismal. It was hard enough for me to get an FFT working at all, making it execute chunks in parallel gives me real headache. Anyway I now have a version of fft_bench that uses the OpenMP "parallel for" construct to divide the work up over 4 COGs. It runs just fine on my PC but is not going correctly on the propeller.
Here are the issues:
1) Using just a single COG the thing runs fine. Just remove the -fopenmp option, this still divides the work up into chunks but uses only a single thread to do the junks sequentially.
2) If you hack a couple of lines where "slices" is set up and run it for two COGs it runs fine.
2) Using 4 COGs and -Os the results are nearly correct. See below.
3) Turning off optimizations -O0 causes it to hang totally!
Always suspecting the issue is with my code, it usually is, I have spent a long time futzing with it. Chiefly:
1) It could be that parallel chunks are stepping on each others data. To check for this I now have a range check in the butterfly loops. So far this has never showed any out of bounds behaviour.
2) It could be that there is some odd timing issue, especially if 1) above were happening. To that end I have run versions of this with random time delays in the past. Currently if you hack a " printf(".............\n");" into the start of the butterfly loop it produces the correct results!
3) I implemented the same algorithms in Google's Go language using goroutines to perform the parallel threads. The nice thing about this is that apart from a few syntax changes the code is almost statement for statement the same. Importantly Go has range checking on accesses to arrays and slices of arrays. I never get a bounds check error here, It works fine.
So I'm getting almost confident that the algorithm is good and OpenMP in propgcc is not quite there yet.
I have put the parallel fft in C and Go up on github together with the SimpleIDE project file I use.
https://github.com/ZiCog/fftbench
Anyone any ideas how I can run this problem down further?
Here are the expected results:
Here is what I get with the code as in github:
Tantalizingly close...
This has been driving me nuts ever since as I decided that what the world really needs is a parallel Fast Fourier Transform for the Propeller that can spread the work over 2 or 4 COGs.
Initial efforts were dismal. It was hard enough for me to get an FFT working at all, making it execute chunks in parallel gives me real headache. Anyway I now have a version of fft_bench that uses the OpenMP "parallel for" construct to divide the work up over 4 COGs. It runs just fine on my PC but is not going correctly on the propeller.
Here are the issues:
1) Using just a single COG the thing runs fine. Just remove the -fopenmp option, this still divides the work up into chunks but uses only a single thread to do the junks sequentially.
2) If you hack a couple of lines where "slices" is set up and run it for two COGs it runs fine.
2) Using 4 COGs and -Os the results are nearly correct. See below.
3) Turning off optimizations -O0 causes it to hang totally!
Always suspecting the issue is with my code, it usually is, I have spent a long time futzing with it. Chiefly:
1) It could be that parallel chunks are stepping on each others data. To check for this I now have a range check in the butterfly loops. So far this has never showed any out of bounds behaviour.
2) It could be that there is some odd timing issue, especially if 1) above were happening. To that end I have run versions of this with random time delays in the past. Currently if you hack a " printf(".............\n");" into the start of the butterfly loop it produces the correct results!
3) I implemented the same algorithms in Google's Go language using goroutines to perform the parallel threads. The nice thing about this is that apart from a few syntax changes the code is almost statement for statement the same. Importantly Go has range checking on accesses to arrays and slices of arrays. I never get a bounds check error here, It works fine.
So I'm getting almost confident that the algorithm is good and OpenMP in propgcc is not quite there yet.
I have put the parallel fft in C and Go up on github together with the SimpleIDE project file I use.
https://github.com/ZiCog/fftbench
Anyone any ideas how I can run this problem down further?
Here are the expected results:
fft_bench v1.2 OpenMP version = 3.0 Freq. Magnitude 00000000 000001FE 000000C0 000001FF 00000140 000001FF 00000200 000001FF 1024 point bit-reversal and butterfly run time = 493477 us
Here is what I get with the code as in github:
fft_bench v1.2 OpenMP version = 3.0 Freq. Magnitude 00000000 000001FF 00000001 000001FE 000000C0 000001FF 00000100 000001FE 00000101 00000470 00000140 000001FF 00000200 000001FF 1024 point bit-reversal and butterfly run time = 54636 us
Tantalizingly close...
Comments
The -O0 hang is pretty odd... can you tell where it's hanging? gdb should be working on the performance branch, so that may help out.
Eric
Yeah, sorry , this is a bit off the track for propgcc development I'm sure.
Looks like we have a race somewhere for sure, given that adding a printf gets us the correct result.
As I said I always suspect my code first, so I try to analyse it as much as possible. Just now I ran out of ideas.
Any advice on the steps to take with gdb? It's a long time since I used gdb for remote debugging and never with a multi-threaded app.