static unsigned int bitReverse(unsigned int x, unsigned int length)
{
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
x = (x >> 16) | (x << 16);
return (x >> (32 - length));
}
Are you aware the Prop has a built-in "REV" assembler instruction that can reverse some or all bits in a byte in 4 clocks?
I don't know if there's an intrinsic function declared for this (there probably should be!), but if there isn't, the code above can be replaced by something like
__asm__ (
"REV %1, %2"
);
See page 343 of the Propeller Manual v1.2.
Disclaimer: I don't know much about FFT, except for the basic concept and for the fact that it involves bit reversing, so I didn't review the code thoroughly. I just knew that the Prop can do this in one instruction just because FFT is a common algorithm and needs it and I was curious as to whether you had used this in your implementation. :-)
Ah yes, I know there is a rev instruction, I use it in the PASM version of fft_bench and heater_fft. I did not use it in the C version of FFT bench because as a benchmark it is supposed to be pure C only. However now that you mention it I'll probably put it into heater_fft and see what we gain from it.
But you must admit that integer bitReverse() is a damn cunning piece of code. Not mine of course. I thought it made a good bit of benchmarking code for an MCU what with all those logical ops going on.
Here is a version of parallel fft_bench that only supports 2 parallel cores. May be more efficient than the 4 core version. Sorry it uploads with the same shortened file name as the fft_bench_omp_4.c from this XP machine, no idea why.
Edit: deleted the attached version, see next post for a better one.
Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.
Edit: Remove attachment, it turned out be buggy. See later post for update.
Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.
That version seems a bit slower (28ms). Still, it's a definite improvement on the original.
(I'm guessing that GCC can probably strength-reduce multiplies by loop indexes into adds, but I'm not certain.)
Problem is that the max four core version produces the wrong results when limited to 2 cores (omp_set_num_threads(2)).
Like so:
OpenMP version = 3.0
Freq. Magnitude
00000000 0000017F
00000080 0000003F
000000C0 000001C4
00000100 0000007F
00000140 000001FF
00000180 0000003F
000001C0 0000003F
00000200 0000017F
1024 point bit-reversal and butterfly run time = 30145 us
As such I cannot compare the 2 core speeds for each version.
I don't get that erroneous output when running on my four core Intel box. But potentially if there is a race condition in that case it may not show up on the PC what with having such different timing.
Turns out get correct results when running the max for cores version with 2 cores if I turn off FCACHE. Like so:
fft_bench v1.1
OpenMP version = 3.0
Freq. Magnitude
00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 75727 us
Originally Posted by RossH
Page 566? This is a joke .... isn't it?
Nope, right there in section 6.54.13.
Prosser: But you did find the plans.
Authur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
Prosser: That's our propgcc department.
Prosser: But you did find the plans.
Arthur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
Prosser: That's our propgcc department.
LOL
Some might say that PDF is the C/C++ programmer's equivalent of Vogon poetry ;-)
(And I really think I should read it some time, too!)
That manual has all sorts of little treats, such as gcov and gprof. Thanks for posting.
Not everything in the manual is applicable to the Propeller, and unfortunately gcov and gprof are among those, at least at this time :-(. (Well, you can try them ... they *might* work, but I strongly suspect they won't, since we haven't done any work to support them.)
Speaking of the GCC manual (and going further off-topic, sorry)
I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary? As far as I can tell, the following two code fragments generate pretty much the same code:
_NAKED void stopme1(void)
{
cogstop(cogid());
}
// stopme1( ); generates JMPRET lr, #_stopme1 in the assembly
_NATIVE void stopme2(void)
{
cogstop(cogid());
__builtin_unreachable();
}
// stopme2( ); generates CALL #_stopme2 in the assembly, and interestingly enough, compiles correctly even though there is no _stopme2_ret symbol???
And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?
I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.
Speaking of the GCC manual (and going further off-topic, sorry)
I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary?
Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.
And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?
You can achieve those with inline functions, e.g.:
/* implement the propeller MIN instruction (not the same as mathematical minimum, rather its opposite) */
inline unsigned MIN(unsigned a, unsigned b) { return (a >= b) ? a : b; }
and similarly for the others. ABS and NEG are obvious too, I think :-). For the rotates use the well known:
inline unsigned ROL(unsigned a, unsigned b) { return (a<<b) | (a>>(32-b)); }
One of these days I intend to post a list of how to achieve all the Propeller instructions in gcc, although it will have to wait for a few compiler changes (SUMZ and SUMNZ, for example, are only generated on the performance branch right now).
I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.
That looks very cool. Thanks, Heater! I'll try to look at it later -- right now I've been distracted by trying to get gdb set up and working (so as to better debug this kind of thing).
This is getting silly. Attached is the fft_bench with option for a maximum of eight cores. Well the Prop has eight COGs so it has to be done. Is this even runnable? Hopefully if there are only 5, 6 or 7 COGs available it will still work just using what it can find. Just define OMP_EIGHT_CORES.
It's getting silly because all that paralleization code is getting a bit unwieldy and is not generally scaleable, what happens when Chip makes a 16 or 32 COG Propeller? Time to try and think how to make that auto-scaling.
Never mind GDB. Our great leader Linus Torvalds has good arguments against debuggers, which I will not restate here, and never uses them. My experience is that when I really need a debugger they don't help or even make the problem worse, often by hiding it. Normally that involves timing and race condition problems in interrupt or threaded code.
Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.
Thanks! I was not aware of that.
You can achieve those [min/mins/max/maxs] with inline functions, e.g.:...
I just verified that "a = (a > b) ? b : a;" is replaced with "MAX a, b" . Very impressive! It does mean that the C code for this construct is harder to read than the assembly, though :-)
Actually, the C version is clear to me and the PASM version is confusing because it uses the term "MAX" to mean "limit the maximum value". I would have named this instruction "MIN" instead since this is the normal meaning of the min function in every other computer language, mathematical description and standards document.
In an effort to make my OpenMP FFT a bit less silly I have swapped from using "omp parallel sections" to "omp parallel for".
That makes it a lot less verbose and gives me clues as to how to get the FFT to automatically scale to the number of cores actually available, perhaps even more than 8.
Please see attachment.
Yay! For those who have a 16 COG Propeller or other machines the attached FFT bench will automatically scale the number of cores/COGs it uses from 1 to 16 depending on what you have available, with no code changes required!
Of course it may be more efficient to set an upper limit on the number of COGs it tries to use as that will cut down on the amount of slicing and dicing of the data and core scheduling it tries to do.
I have yet to try this on a Propeller. Perhaps later today.
Comments
Are you aware the Prop has a built-in "REV" assembler instruction that can reverse some or all bits in a byte in 4 clocks?
I don't know if there's an intrinsic function declared for this (there probably should be!), but if there isn't, the code above can be replaced by something like
See page 343 of the Propeller Manual v1.2.
Disclaimer: I don't know much about FFT, except for the basic concept and for the fact that it involves bit reversing, so I didn't review the code thoroughly. I just knew that the Prop can do this in one instruction just because FFT is a common algorithm and needs it and I was curious as to whether you had used this in your implementation. :-)
===Jac
Ah yes, I know there is a rev instruction, I use it in the PASM version of fft_bench and heater_fft. I did not use it in the C version of FFT bench because as a benchmark it is supposed to be pure C only. However now that you mention it I'll probably put it into heater_fft and see what we gain from it.
But you must admit that integer bitReverse() is a damn cunning piece of code. Not mine of course. I thought it made a good bit of benchmarking code for an MCU what with all those logical ops going on.
Eric, we need REV in C.
Edit: deleted the attached version, see next post for a better one.
Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.
Edit: Remove attachment, it turned out be buggy. See later post for update.
Is there a list of all the intrinsic functions for PropGCC somewhere? If not, can someone point me to the source file where they are implemented?
I think it's something I would like to add to my wiki page
===Jac
That version seems a bit slower (28ms). Still, it's a definite improvement on the original.
(I'm guessing that GCC can probably strength-reduce multiplies by loop indexes into adds, but I'm not certain.)
Eric
Page 566? This is a joke .... isn't it?
Nope, right there in section 6.54.13.
I just managed to get hold of a Gadget Gangster Prop board and pull your latest propgcc performance branch. Here are the results I'm getting. Seems you fixed whatever was up with -O2.
Problem is that the max four core version produces the wrong results when limited to 2 cores (omp_set_num_threads(2)).
Like so:
As such I cannot compare the 2 core speeds for each version.
I don't get that erroneous output when running on my four core Intel box. But potentially if there is a race condition in that case it may not show up on the PC what with having such different timing.
Prosser: But you did find the plans.
Authur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
Prosser: That's our propgcc department.
LOL
Some might say that PDF is the C/C++ programmer's equivalent of Vogon poetry ;-)
(And I really think I should read it some time, too!)
Thanks for finding that page!
===Jac
Most people (including Ross apparently) have never needed to look at the GCC manual just to build programs. That's a great testament to me
I've read through a bunch of it with great admiration for you and your creation!
Not everything in the manual is applicable to the Propeller, and unfortunately gcov and gprof are among those, at least at this time :-(. (Well, you can try them ... they *might* work, but I strongly suspect they won't, since we haven't done any work to support them.)
I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary? As far as I can tell, the following two code fragments generate pretty much the same code:
And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?
===Jac
Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.
Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.
You can achieve those with inline functions, e.g.: and similarly for the others. ABS and NEG are obvious too, I think :-). For the rotates use the well known:
One of these days I intend to post a list of how to achieve all the Propeller instructions in gcc, although it will have to wait for a few compiler changes (SUMZ and SUMNZ, for example, are only generated on the performance branch right now).
Eric
That looks very cool. Thanks, Heater! I'll try to look at it later -- right now I've been distracted by trying to get gdb set up and working (so as to better debug this kind of thing).
Eric
It's getting silly because all that paralleization code is getting a bit unwieldy and is not generally scaleable, what happens when Chip makes a 16 or 32 COG Propeller? Time to try and think how to make that auto-scaling.
Never mind GDB. Our great leader Linus Torvalds has good arguments against debuggers, which I will not restate here, and never uses them. My experience is that when I really need a debugger they don't help or even make the problem worse, often by hiding it. Normally that involves timing and race condition problems in interrupt or threaded code.
OpenMP support is much more important:)
Thanks! I was not aware of that.
I just verified that "a = (a > b) ? b : a;" is replaced with "MAX a, b" . Very impressive! It does mean that the C code for this construct is harder to read than the assembly, though :-)
===Jac
That makes it a lot less verbose and gives me clues as to how to get the FFT to automatically scale to the number of cores actually available, perhaps even more than 8.
Please see attachment.
Of course it may be more efficient to set an upper limit on the number of COGs it tries to use as that will cut down on the amount of slicing and dicing of the data and core scheduling it tries to do.
I have yet to try this on a Propeller. Perhaps later today.