Shop OBEX P1 Docs P2 Docs Learn Events
OpenMP — Parallax Forums

OpenMP

Heater.Heater. Posts: 21,230
edited 2012-12-06 09:44 in Propeller 1
Some time ago Eric let it out of the bag that he had implemented OpenMP support in propgcc.

This has been driving me nuts ever since as I decided that what the world really needs is a parallel Fast Fourier Transform for the Propeller that can spread the work over 2 or 4 COGs.

Initial efforts were dismal. It was hard enough for me to get an FFT working at all, making it execute chunks in parallel gives me real headache. Anyway I now have a version of fft_bench that uses the OpenMP "parallel for" construct to divide the work up over 4 COGs. It runs just fine on my PC but is not going correctly on the propeller.

Here are the issues:

1) Using just a single COG the thing runs fine. Just remove the -fopenmp option, this still divides the work up into chunks but uses only a single thread to do the junks sequentially.

2) If you hack a couple of lines where "slices" is set up and run it for two COGs it runs fine.

2) Using 4 COGs and -Os the results are nearly correct. See below.

3) Turning off optimizations -O0 causes it to hang totally!

Always suspecting the issue is with my code, it usually is, I have spent a long time futzing with it. Chiefly:

1) It could be that parallel chunks are stepping on each others data. To check for this I now have a range check in the butterfly loops. So far this has never showed any out of bounds behaviour.

2) It could be that there is some odd timing issue, especially if 1) above were happening. To that end I have run versions of this with random time delays in the past. Currently if you hack a " printf(".............\n");" into the start of the butterfly loop it produces the correct results!

3) I implemented the same algorithms in Google's Go language using goroutines to perform the parallel threads. The nice thing about this is that apart from a few syntax changes the code is almost statement for statement the same. Importantly Go has range checking on accesses to arrays and slices of arrays. I never get a bounds check error here, It works fine.

So I'm getting almost confident that the algorithm is good and OpenMP in propgcc is not quite there yet.

I have put the parallel fft in C and Go up on github together with the SimpleIDE project file I use.
https://github.com/ZiCog/fftbench

Anyone any ideas how I can run this problem down further?

Here are the expected results:
fft_bench v1.2
OpenMP version = 3.0
Freq.    Magnitude

00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 493477 us



Here is what I get with the code as in github:
fft_bench v1.2
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FF
00000001 000001FE
000000C0 000001FF
00000100 000001FE
00000101 00000470
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 54636 us


Tantalizingly close...

Comments

  • ersmithersmith Posts: 6,105
    edited 2012-12-06 07:36
    Sorry, Heater, I haven't had a chance to look at this yet. It's certainly possible that the OpenMP implementation in tinyomp.c has a bug (I'd guess a race condition in this case). On the other hand it's also quite possible that the fft has a bug.

    The -O0 hang is pretty odd... can you tell where it's hanging? gdb should be working on the performance branch, so that may help out.

    Eric
  • Heater.Heater. Posts: 21,230
    edited 2012-12-06 09:44
    Eric,

    Yeah, sorry , this is a bit off the track for propgcc development I'm sure.
    Looks like we have a race somewhere for sure, given that adding a printf gets us the correct result.
    As I said I always suspect my code first, so I try to analyse it as much as possible. Just now I ran out of ideas.
    Any advice on the steps to take with gdb? It's a long time since I used gdb for remote debugging and never with a multi-threaded app.
Sign In or Register to comment.