improving CMM performance

jac_goudsmit · 2012-10-04 07:23

static unsigned int bitReverse(unsigned int x,  unsigned int length)
{
    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
    x = (x >> 16) | (x << 16);
    return (x  >> (32 - length));
}

Are you aware the Prop has a built-in "REV" assembler instruction that can reverse some or all bits in a byte in 4 clocks?
I don't know if there's an intrinsic function declared for this (there probably should be!), but if there isn't, the code above can be replaced by something like

__asm__ (
  "REV %1, %2"
);

See page 343 of the Propeller Manual v1.2.

Disclaimer: I don't know much about FFT, except for the basic concept and for the fact that it involves bit reversing, so I didn't review the code thoroughly. I just knew that the Prop can do this in one instruction just because FFT is a common algorithm and needs it and I was curious as to whether you had used this in your implementation. :-)

===Jac

Heater. · 2012-10-04 07:44

jac_goudsmit,

Ah yes, I know there is a rev instruction, I use it in the PASM version of fft_bench and heater_fft. I did not use it in the C version of FFT bench because as a benchmark it is supposed to be pure C only. However now that you mention it I'll probably put it into heater_fft and see what we gain from it.

But you must admit that integer bitReverse() is a damn cunning piece of code. Not mine of course. I thought it made a good bit of benchmarking code for an MCU what with all those logical ops going on.

Eric, we need REV in C.

Dave Hein · 2012-10-04 08:18

You can use __builtin_propeller_rev in PropGCC.

Heater. · 2012-10-04 14:17

Here is a version of parallel fft_bench that only supports 2 parallel cores. May be more efficient than the 4 core version. Sorry it uploads with the same shortened file name as the fft_bench_omp_4.c from this XP machine, no idea why.

Edit: deleted the attached version, see next post for a better one.

Heater. · 2012-10-04 21:22

A further optimization to parallel FFT.

Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.

Edit: Remove attachment, it turned out be buggy. See later post for update.

jac_goudsmit · 2012-10-04 22:34

Dave Hein wrote: »

You can use __builtin_propeller_rev in PropGCC.

Is there a list of all the intrinsic functions for PropGCC somewhere? If not, can someone point me to the source file where they are implemented?

I think it's something I would like to add to my wiki page

===Jac

ersmith · 2012-10-05 04:55

Heater. wrote: »

A further optimization to parallel FFT.

Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.

That version seems a bit slower (28ms). Still, it's a definite improvement on the original.

(I'm guessing that GCC can probably strength-reduce multiplies by loop indexes into adds, but I'm not certain.)

ersmith · 2012-10-05 05:00

jac_goudsmit wrote: »

Is there a list of all the intrinsic functions for PropGCC somewhere? If not, can someone point me to the source file where they are implemented?

It's on page 566 of the GCC manual (http://code.google.com/p/propgcc/downloads/detail?name=gcc.pdf).

Eric

RossH · 2012-10-05 06:09

ersmith wrote: »

It's on page 566 of the GCC manual (http://code.google.com/p/propgcc/downloads/detail?name=gcc.pdf).

Eric

Page 566? This is a joke .... isn't it?

ctwardell · 2012-10-05 06:13

RossH wrote: »

Page 566? This is a joke .... isn't it?

Nope, right there in section 6.54.13.

RossH · 2012-10-05 06:39

ctwardell wrote: »

Nope, right there in section 6.54.13.

And here I was thinking the Dewey Decimal system was old hat! Glad to see it is still alive and well

Heater. · 2012-10-05 06:49

Eric,

I just managed to get hold of a Gadget Gangster Prop board and pull your latest propgcc performance branch. Here are the results I'm getting.

fft_bech_omp_2.c (Max 2 cores)
----------------
Opt     Threads     ms
---     -------     --
-O2     1           48
-O2     2           28

-Os     1          135
-Os     2           75

fft_bech_omp_4.c (Max 4 cores)
----------------
Opt     Threads     ms
---     -------     --
-O2     1           51
-O2     2           30    * Erroneous output
-O2     3           29
-O2     4           21.6

-Os     1           63.3
-Os     2           36.5  * Erroneous output
-Os     3           35.7
-Os     4           24.4

Seems you fixed whatever was up with -O2.

Problem is that the max four core version produces the wrong results when limited to 2 cores (omp_set_num_threads(2)).
Like so:

OpenMP version = 3.0
Freq.    Magnitude
00000000 0000017F
00000080 0000003F
000000C0 000001C4
00000100 0000007F
00000140 000001FF
00000180 0000003F
000001C0 0000003F
00000200 0000017F
1024 point bit-reversal and butterfly run time = 30145 us

As such I cannot compare the 2 core speeds for each version.

I don't get that erroneous output when running on my four core Intel box. But potentially if there is a race condition in that case it may not show up on the PC what with having such different timing.

Heater. · 2012-10-05 07:10

Turns out get correct results when running the max for cores version with 2 cores if I turn off FCACHE. Like so:

fft_bench v1.1
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 75727 us

Heater. · 2012-10-05 07:50

Originally Posted by RossH
Page 566? This is a joke .... isn't it?

Nope, right there in section 6.54.13.

Prosser: But you did find the plans.
Authur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
Prosser: That's our propgcc department.

jac_goudsmit · 2012-10-05 07:59

Heater. wrote: »

Prosser: But you did find the plans.
Arthur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
Prosser: That's our propgcc department.

LOL

Some might say that PDF is the C/C++ programmer's equivalent of Vogon poetry ;-)
(And I really think I should read it some time, too!)

Thanks for finding that page!

===Jac

jazzed · 2012-10-05 08:06

The GCC manual's size is the result of the parameter flexibility and amount of processor support built into it over many years.

Most people (including Ross apparently) have never needed to look at the GCC manual just to build programs. That's a great testament to me

SRLM · 2012-10-05 13:04

That manual has all sorts of little treats, such as gcov and gprof. Thanks for posting.

mindrobots · 2012-10-05 13:09

jazzed wrote: »

Most people (including Ross apparently) have never needed to look at the GCC manual just to build programs. That's a great testament to me

I've read through a bunch of it with great admiration for you and your creation!

ersmith · 2012-10-05 13:17

SRLM wrote: »

That manual has all sorts of little treats, such as gcov and gprof. Thanks for posting.

Not everything in the manual is applicable to the Propeller, and unfortunately gcov and gprof are among those, at least at this time :-(. (Well, you can try them ... they *might* work, but I strongly suspect they won't, since we haven't done any work to support them.)

jazzed · 2012-10-05 13:37

mindrobots wrote: »

I've read through a bunch of it with great admiration for you and your creation!

Ugh ! Oh please :sick: ... I meant "a testament in my opinion" ....

jac_goudsmit · 2012-10-06 00:08

Speaking of the GCC manual (and going further off-topic, sorry)

I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary? As far as I can tell, the following two code fragments generate pretty much the same code:

_NAKED void stopme1(void)
{
    cogstop(cogid());
}

// stopme1( ); generates JMPRET lr, #_stopme1 in the assembly

_NATIVE void stopme2(void)
{
    cogstop(cogid());

    __builtin_unreachable();
}

// stopme2( ); generates CALL #_stopme2 in the assembly, and interestingly enough, compiles correctly even though there is no _stopme2_ret symbol???

And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?

===Jac

Heater. · 2012-10-06 03:40

I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.

ersmith · 2012-10-06 04:58

jac_goudsmit wrote: »

Speaking of the GCC manual (and going further off-topic, sorry)

I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary?

Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.

And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?

You can achieve those with inline functions, e.g.:

/* implement the propeller MIN instruction (not the same as mathematical minimum, rather its opposite) */
inline unsigned MIN(unsigned a, unsigned b) { return (a >= b) ? a : b; }

and similarly for the others. ABS and NEG are obvious too, I think :-). For the rotates use the well known:

inline unsigned ROL(unsigned a, unsigned b) { return (a<<b) | (a>>(32-b)); }

One of these days I intend to post a list of how to achieve all the Propeller instructions in gcc, although it will have to wait for a few compiler changes (SUMZ and SUMNZ, for example, are only generated on the performance branch right now).

Eric

ersmith · 2012-10-06 04:59

Heater. wrote: »

I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.

That looks very cool. Thanks, Heater! I'll try to look at it later -- right now I've been distracted by trying to get gdb set up and working (so as to better debug this kind of thing).

Eric

Heater. · 2012-10-06 15:55

This is getting silly. Attached is the fft_bench with option for a maximum of eight cores. Well the Prop has eight COGs so it has to be done. Is this even runnable? Hopefully if there are only 5, 6 or 7 COGs available it will still work just using what it can find. Just define OMP_EIGHT_CORES.

It's getting silly because all that paralleization code is getting a bit unwieldy and is not generally scaleable, what happens when Chip makes a 16 or 32 COG Propeller? Time to try and think how to make that auto-scaling.

Heater. · 2012-10-06 15:59

Eric,

Never mind GDB. Our great leader Linus Torvalds has good arguments against debuggers, which I will not restate here, and never uses them. My experience is that when I really need a debugger they don't help or even make the problem worse, often by hiding it. Normally that involves timing and race condition problems in interrupt or threaded code.

OpenMP support is much more important:)

jac_goudsmit · 2012-10-06 21:49

ersmith wrote: »

Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.

Thanks! I was not aware of that.

You can achieve those [min/mins/max/maxs] with inline functions, e.g.:...

I just verified that "a = (a > b) ? b : a;" is replaced with "MAX a, b" . Very impressive! It does mean that the C code for this construct is harder to read than the assembly, though :-)

===Jac

Dave Hein · 2012-10-07 06:12

Actually, the C version is clear to me and the PASM version is confusing because it uses the term "MAX" to mean "limit the maximum value". I would have named this instruction "MIN" instead since this is the normal meaning of the min function in every other computer language, mathematical description and standards document.

Heater. · 2012-10-07 12:58

In an effort to make my OpenMP FFT a bit less silly I have swapped from using "omp parallel sections" to "omp parallel for".
That makes it a lot less verbose and gives me clues as to how to get the FFT to automatically scale to the number of cores actually available, perhaps even more than 8.
Please see attachment.

Heater. · 2012-10-07 22:55

Yay! For those who have a 16 COG Propeller or other machines the attached FFT bench will automatically scale the number of cores/COGs it uses from 1 to 16 depending on what you have available, with no code changes required!

Of course it may be more efficient to set an upper limit on the number of COGs it tries to use as that will cut down on the amount of slicing and dicing of the data and core scheduling it tries to do.

I have yet to try this on a Propeller. Perhaps later today.

improving CMM performance

Comments