Shop OBEX P1 Docs P2 Docs Learn Events
improving CMM performance - Page 2 — Parallax Forums

improving CMM performance

24

Comments

  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2012-10-04 07:23
    static unsigned int bitReverse(unsigned int x,  unsigned int length)
    {
        x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
        x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
        x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
        x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
        x = (x >> 16) | (x << 16);
        return (x  >> (32 - length));
    }
    

    Are you aware the Prop has a built-in "REV" assembler instruction that can reverse some or all bits in a byte in 4 clocks?
    I don't know if there's an intrinsic function declared for this (there probably should be!), but if there isn't, the code above can be replaced by something like
    __asm__ (
      "REV %1, %2"
    );
    

    See page 343 of the Propeller Manual v1.2.

    Disclaimer: I don't know much about FFT, except for the basic concept and for the fact that it involves bit reversing, so I didn't review the code thoroughly. I just knew that the Prop can do this in one instruction just because FFT is a common algorithm and needs it and I was curious as to whether you had used this in your implementation. :-)

    ===Jac
  • Heater.Heater. Posts: 21,230
    edited 2012-10-04 07:44
    jac_goudsmit,

    Ah yes, I know there is a rev instruction, I use it in the PASM version of fft_bench and heater_fft. I did not use it in the C version of FFT bench because as a benchmark it is supposed to be pure C only. However now that you mention it I'll probably put it into heater_fft and see what we gain from it.

    But you must admit that integer bitReverse() is a damn cunning piece of code. Not mine of course. I thought it made a good bit of benchmarking code for an MCU what with all those logical ops going on.

    Eric, we need REV in C.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-10-04 08:18
    You can use __builtin_propeller_rev in PropGCC.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-04 14:17
    Here is a version of parallel fft_bench that only supports 2 parallel cores. May be more efficient than the 4 core version. Sorry it uploads with the same shortened file name as the fft_bench_omp_4.c from this XP machine, no idea why.

    Edit: deleted the attached version, see next post for a better one.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-04 21:22
    A further optimization to parallel FFT.

    Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.

    Edit: Remove attachment, it turned out be buggy. See later post for update.
  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2012-10-04 22:34
    Dave Hein wrote: »
    You can use __builtin_propeller_rev in PropGCC.

    Is there a list of all the intrinsic functions for PropGCC somewhere? If not, can someone point me to the source file where they are implemented?

    I think it's something I would like to add to my wiki page

    ===Jac
  • ersmithersmith Posts: 6,093
    edited 2012-10-05 04:55
    Heater. wrote: »
    A further optimization to parallel FFT.

    Whist tying to figure out how to parallelize this thing I started out by simplifying the code. Which kind of un-optimized it by introducing three multiplies calculating some array indices. That made it simpler to rearrange for parallelizing but slower because those muls were in the inner loops. I have now removed the muls by going back to adding a step size to the indices on every iteration. Seems to still work:) See attached.

    That version seems a bit slower (28ms). Still, it's a definite improvement on the original.

    (I'm guessing that GCC can probably strength-reduce multiplies by loop indexes into adds, but I'm not certain.)
  • ersmithersmith Posts: 6,093
    edited 2012-10-05 05:00
    Is there a list of all the intrinsic functions for PropGCC somewhere? If not, can someone point me to the source file where they are implemented?
    It's on page 566 of the GCC manual (http://code.google.com/p/propgcc/downloads/detail?name=gcc.pdf).

    Eric
  • RossHRossH Posts: 5,510
    edited 2012-10-05 06:09
    ersmith wrote: »
    It's on page 566 of the GCC manual (http://code.google.com/p/propgcc/downloads/detail?name=gcc.pdf).

    Eric

    Page 566? This is a joke .... isn't it?
  • ctwardellctwardell Posts: 1,716
    edited 2012-10-05 06:13
    RossH wrote: »
    Page 566? This is a joke .... isn't it?

    Nope, right there in section 6.54.13.
  • RossHRossH Posts: 5,510
    edited 2012-10-05 06:39
    ctwardell wrote: »
    Nope, right there in section 6.54.13.
    And here I was thinking the Dewey Decimal system was old hat! Glad to see it is still alive and well :lol:
  • Heater.Heater. Posts: 21,230
    edited 2012-10-05 06:49
    Eric,

    I just managed to get hold of a Gadget Gangster Prop board and pull your latest propgcc performance branch. Here are the results I'm getting.
    fft_bech_omp_2.c (Max 2 cores)
    ----------------
    Opt     Threads     ms
    ---     -------     --
    -O2     1           48
    -O2     2           28
    
    -Os     1          135
    -Os     2           75
    
    fft_bech_omp_4.c (Max 4 cores)
    ----------------
    Opt     Threads     ms
    ---     -------     --
    -O2     1           51
    -O2     2           30    * Erroneous output
    -O2     3           29
    -O2     4           21.6
    
    -Os     1           63.3
    -Os     2           36.5  * Erroneous output
    -Os     3           35.7
    -Os     4           24.4
    
    Seems you fixed whatever was up with -O2.

    Problem is that the max four core version produces the wrong results when limited to 2 cores (omp_set_num_threads(2)).
    Like so:
    OpenMP version = 3.0
    Freq.    Magnitude
    00000000 0000017F
    00000080 0000003F
    000000C0 000001C4
    00000100 0000007F
    00000140 000001FF
    00000180 0000003F
    000001C0 0000003F
    00000200 0000017F
    1024 point bit-reversal and butterfly run time = 30145 us
    
    As such I cannot compare the 2 core speeds for each version.

    I don't get that erroneous output when running on my four core Intel box. But potentially if there is a race condition in that case it may not show up on the PC what with having such different timing.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-05 07:10
    Turns out get correct results when running the max for cores version with 2 cores if I turn off FCACHE. Like so:
    fft_bench v1.1
    OpenMP version = 3.0
    Freq.    Magnitude
    00000000 000001FE
    000000C0 000001FF
    00000140 000001FF
    00000200 000001FF
    1024 point bit-reversal and butterfly run time = 75727 us
    
  • Heater.Heater. Posts: 21,230
    edited 2012-10-05 07:50
    quote_icon.png Originally Posted by RossH viewpost-right.png
    Page 566? This is a joke .... isn't it?



    Nope, right there in section 6.54.13.


    Prosser: But you did find the plans.
    Authur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
    Prosser: That's our propgcc department.
  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2012-10-05 07:59
    Heater. wrote: »
    Prosser: But you did find the plans.
    Arthur: Yes, I found them. In a locked filing cabinet in a disused lavatory behind a door that said "Beware of the tiger".
    Prosser: That's our propgcc department.

    LOL

    Some might say that PDF is the C/C++ programmer's equivalent of Vogon poetry ;-)
    (And I really think I should read it some time, too!)

    Thanks for finding that page!

    ===Jac
  • jazzedjazzed Posts: 11,803
    edited 2012-10-05 08:06
    The GCC manual's size is the result of the parameter flexibility and amount of processor support built into it over many years.

    Most people (including Ross apparently) have never needed to look at the GCC manual just to build programs. That's a great testament to me :)
  • SRLMSRLM Posts: 5,045
    edited 2012-10-05 13:04
    That manual has all sorts of little treats, such as gcov and gprof. Thanks for posting.
  • mindrobotsmindrobots Posts: 6,506
    edited 2012-10-05 13:09
    jazzed wrote: »
    Most people (including Ross apparently) have never needed to look at the GCC manual just to build programs. That's a great testament to me :)

    I've read through a bunch of it with great admiration for you and your creation! :lol:
  • ersmithersmith Posts: 6,093
    edited 2012-10-05 13:17
    SRLM wrote: »
    That manual has all sorts of little treats, such as gcov and gprof. Thanks for posting.

    Not everything in the manual is applicable to the Propeller, and unfortunately gcov and gprof are among those, at least at this time :-(. (Well, you can try them ... they *might* work, but I strongly suspect they won't, since we haven't done any work to support them.)
  • jazzedjazzed Posts: 11,803
    edited 2012-10-05 13:37
    mindrobots wrote: »
    I've read through a bunch of it with great admiration for you and your creation! :lol:
    Ugh ! Oh please :sick: ... I meant "a testament in my opinion" ....
  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2012-10-06 00:08
    Speaking of the GCC manual (and going further off-topic, sorry)

    I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary? As far as I can tell, the following two code fragments generate pretty much the same code:
    _NAKED void stopme1(void)
    {
        cogstop(cogid());
    }
    
    // stopme1( ); generates JMPRET lr, #_stopme1 in the assembly
    
    _NATIVE void stopme2(void)
    {
        cogstop(cogid());
    
        __builtin_unreachable();
    }
    
    // stopme2( ); generates CALL #_stopme2 in the assembly, and interestingly enough, compiles correctly even though there is no _stopme2_ret symbol???
    


    And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?

    ===Jac
  • Heater.Heater. Posts: 21,230
    edited 2012-10-06 03:40
    I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
    Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.
    c
    c
    21K
  • ersmithersmith Posts: 6,093
    edited 2012-10-06 04:58
    Speaking of the GCC manual (and going further off-topic, sorry)

    I was reminded that gcc has a __builtin_unreachable( ). As far as I can tell, this works as expected. Doesn't that kinda make _NAKED (really: __attribute__((naked)) ) unnecessary?

    Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.
    And while we're talking about builtin functions, shouldn't there also be a __builtin_propeller_min(unsigned) , __builtin_propeller_mins(int), __builtin_propeller_max(unsigned) and __builtin_propeller_maxs(int) ?

    You can achieve those with inline functions, e.g.:
    /* implement the propeller MIN instruction (not the same as mathematical minimum, rather its opposite) */
    inline unsigned MIN(unsigned a, unsigned b) { return (a >= b) ? a : b; }
    
    and similarly for the others. ABS and NEG are obvious too, I think :-). For the rotates use the well known:
    inline unsigned ROL(unsigned a, unsigned b) { return (a<<b) | (a>>(32-b)); }
    

    One of these days I intend to post a list of how to achieve all the Propeller instructions in gcc, although it will have to wait for a few compiler changes (SUMZ and SUMNZ, for example, are only generated on the performance branch right now).

    Eric
  • ersmithersmith Posts: 6,093
    edited 2012-10-06 04:59
    Heater. wrote: »
    I discovered the last parallel FFT version (for 2 cores) was buggy. It would fail if you tried to tweak it back to using 4 cores.
    Attached is a fixed version that has a #define to select a max of 2 or 4 cores. Define OMP_FOUR_CORES or OMP_TWO_CORES as desired. With no define it defaults to a single core.

    That looks very cool. Thanks, Heater! I'll try to look at it later -- right now I've been distracted by trying to get gdb set up and working (so as to better debug this kind of thing).

    Eric
  • Heater.Heater. Posts: 21,230
    edited 2012-10-06 15:55
    This is getting silly. Attached is the fft_bench with option for a maximum of eight cores. Well the Prop has eight COGs so it has to be done. Is this even runnable? Hopefully if there are only 5, 6 or 7 COGs available it will still work just using what it can find. Just define OMP_EIGHT_CORES.

    It's getting silly because all that paralleization code is getting a bit unwieldy and is not generally scaleable, what happens when Chip makes a 16 or 32 COG Propeller? Time to try and think how to make that auto-scaling.
    c
    c
    24K
  • Heater.Heater. Posts: 21,230
    edited 2012-10-06 15:59
    Eric,

    Never mind GDB. Our great leader Linus Torvalds has good arguments against debuggers, which I will not restate here, and never uses them. My experience is that when I really need a debugger they don't help or even make the problem worse, often by hiding it. Normally that involves timing and race condition problems in interrupt or threaded code.

    OpenMP support is much more important:)
  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2012-10-06 21:49
    ersmith wrote: »
    Not quite. __attribute__((naked)) disables both the function prologue (saving of registers and setup) and epilogue (register restore and return). __builtin_unreachable() will only disable the epilogue. Note that __attribute__((naked)) is implemented on quite a few platforms, it's not really propeller specific.

    Thanks! I was not aware of that.
    You can achieve those [min/mins/max/maxs] with inline functions, e.g.:...

    I just verified that "a = (a > b) ? b : a;" is replaced with "MAX a, b" . Very impressive! It does mean that the C code for this construct is harder to read than the assembly, though :-)

    ===Jac
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-10-07 06:12
    Actually, the C version is clear to me and the PASM version is confusing because it uses the term "MAX" to mean "limit the maximum value". I would have named this instruction "MIN" instead since this is the normal meaning of the min function in every other computer language, mathematical description and standards document.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-07 12:58
    In an effort to make my OpenMP FFT a bit less silly I have swapped from using "omp parallel sections" to "omp parallel for".
    That makes it a lot less verbose and gives me clues as to how to get the FFT to automatically scale to the number of cores actually available, perhaps even more than 8.
    Please see attachment.
    c
    c
    22K
  • Heater.Heater. Posts: 21,230
    edited 2012-10-07 22:55
    Yay! For those who have a 16 COG Propeller or other machines the attached FFT bench will automatically scale the number of cores/COGs it uses from 1 to 16 depending on what you have available, with no code changes required!

    Of course it may be more efficient to set an upper limit on the number of COGs it tries to use as that will cut down on the amount of slicing and dicing of the data and core scheduling it tries to do.

    I have yet to try this on a Propeller. Perhaps later today.
    c
    c
    21K
Sign In or Register to comment.