improving CMM performance

RossH · 2012-10-08 01:53

Heater. wrote: »

Yay! For those who have a 16 COG Propeller or other machines the attached FFT bench will automatically scale the number of cores/COGs it uses from 1 to 16 depending on what you have available, with no code changes required!

Of course it may be more efficient to set an upper limit on the number of COGs it tries to use as that will cut down on the amount of slicing and dicing of the data and core scheduling it tries to do.

I have yet to try this on a Propeller. Perhaps later today.

A 16 cog Propeller? I want one of those!

Seriously, though - this is quite interesting. I'll be interested to see how much it speeds things up.

Heater. · 2012-10-08 02:55

RossH,

A 16 cog Propeller? I want one of those!

Me too. Just preparing for the future:) Well actually I voted against 16 COGs when Chip asked the forum if we wanted more COGs OR more RAM in Prop II. At the time I reasoned that halving the bandwidth from COG to HUB was not a nice idea.

Seriously, though - this is quite interesting. I'll be interested to see how much it speeds things up.

I guess you will have to become a GCC user to be able to try it out:) On the other hand what about OpenMP for Catalina?

Over the years I have read many times that "Parallel processing is hard". I would think, pah, I have been working on parallel processing solutions using microprocessors since 1980 it wasn't so hard.

Well, taking that FFT and parallelizing it, especially in a scalable way, gave me a real headache. I start to see what they mean. I did find an existing OpenMP FFT on the net and it was impossible to understand and actually slowed performance for such a small data set on my four core PC.

RossH · 2012-10-08 03:37

Heater. wrote: »

RossH,
I guess you will have to become a GCC user to be able to try it out:)

Do I have to read the 730 page manual just to compile it? Thanks, I'll pass.

Heater. wrote: »

On the other hand what about OpenMP for Catalina?

Yes, I've given it some thought, and played around with a few ideas. In terms of standards, I was leaning towards an implementation of Cilk, but OpenMP is still a possibility. But the reality is that the overheads involved in a full implementation mean neither is likely to be very practical on a "bare bones" Propeller - except possibly for smallish benchmark or tutorial programs (of which your FFT is an outstandingly elegant and stunningly efficient example, of course!).

The annoying thing is that the Propeller offers a unique opportunity for symmetric multiprocessing, but trying to comply with one of the "standard" implementations adds so much overhead that it becomes impractical for most real applications. A better alternative may be to come up with something that works well on the Propeller because it maps directly to the Propeller architecture - and that's the one I'm currently most inclined towards. Not least because that's also the one that would be the most fun to implement!

Ross.

ersmith · 2012-10-08 04:47

RossH wrote: »

Yes, I've given it some thought, and played around with a few ideas. In terms of standards, I was leaning towards an implementation of Cilk, but OpenMP is still a possibility. But the reality is that the overheads involved in a full implementation mean neither is likely to be very practical on a "bare bones" Propeller - except possibly for smallish benchmark or tutorial programs (of which your FFT is an outstandingly elegant and stunningly efficient example, of course!).

There doesn't have to be that much overhead on the Propeller side. The current "tinyomp" implementation in PropGCC is 1420 bytes of code in LMM mode and 592 bytes in CMM mode. Granted, that's not a full OpenMP 3.0 implementation, but it implements enough to be very useful. Threads are mapped directly to COGs, so it's very efficient and low-overhead.

It is a bit of a pain on the compiler side, but fortunately other people did all that work for GCC. I'm not sure if anyone's done it for LCC yet, but it's certainly possible. Or, you could always switch to using GCC as the compiler engine for Catalina...

Eric

Heater. · 2012-10-08 07:43

Eric,

What we need here is the main pogram including the tinyomp library to be compiled as CMM for small code size but the code/function that is parallized, butterflies in this case, to be FCACHED or native COG code. In that way we get maximum performance for smallest code size. Rather like starting COGS running PASM in an otherwise Spin program.

I'm not so much thinking about number crunching code like the FFT here but just normal applications that would like to have some parts running at max speed whilst the whole needs to fit in the limited HUB space.

Is there anyway to achieve this?

Next question might be, given my code has some functions that I want to end up end up as native COG code running in a COG of there own can the compiler tell me that it has actually done that. I notice that when I allow FCACHE I get no idication at build time if anything is FCACHEd or not.

ersmith · 2012-10-08 07:47

Heater. wrote: »

What we need here is the main pogram including the tinyomp library to be compiled as CMM for small code size but the code/function that is parallized, butterflies in this case, to be FCACHED or native COG code. In that way we get maximum performance for smallest code size.

CMM mode does this already. The libraries are all compiled with -Os to minimize size (which also disabled FCACHE). But the user code may be compiled with -O2 or with -mfcache to enable FCACHE mode.

Next question might be, given my code has some functions that I want to end up end up as native COG code running in a COG of there own can the compiler tell me that it has actually done that. I notice that when I allow FCACHE I get no idication at build time if anything is FCACHEd or not.

There should be a warning if any code with __attribute__((fcache)) does not end up being FCACHEd. If there isn't then there's a bug!

Eric

Heater. · 2012-10-08 08:54

Eric,

Fantastic.

Now were do I put this "__attribute__((fcache))"in my FFT? Would it be like so:

__attribute__((fcache))   \
static void butterflies(int32_t* bx, int32_t* by, int32_t firstLevel, int32_t lastLevel, int32_t fix)

I guess I cannot just drop that attribute on a for loop that I want cached.

Sadly I cannot test the latest FFT on a real Propeller until tomorrow. Has anyone tried it?

Edit: Changed code snippet as I had cut and pasted the wrong thing.

ersmith · 2012-10-08 09:42

Heater. wrote: »

Now were do I put this "__attribute__((fcache))"in my FFT?

It's a function attribute, it goes on the function (and indicates that the whole function should go in the FCACHE, not just any loops inside it -- this makes it very useful for things like serial communications where you want a known timing between the setup code and the first loop iteration).

I see that in fact there isn't any warning printed when the function cannot go in FCACHE, so there is a bug somewhere. I'll try to fix it.

In general the way to figure out if code is in FCACHE or not is to look at the assembly output. An FCACHE'd loop starts with a jump to __LMM_FCACHE_LOAD, like:

        jmp     #__LMM_FCACHE_LOAD
        long    .L6-.L5
.L5
        jmp     #__LMM_FCACHE_START+(.L2-.L5)
.L3
        rdlong  r6, r0
        add     r6, #1
        wrlong  r6, r0
        add     r0, #4
.L2
        djnz    r7,#__LMM_FCACHE_START+(.L3-.L5)
        jmp     __LMM_RET
        .compress default
.L6

Sadly I cannot test the latest FFT on a real Propeller until tomorrow. Has anyone tried it?

I tried it, but it hangs -- I guess it's running in to the same bug we saw earlier.

Eric

RossH · 2012-10-09 01:57

ersmith wrote: »

Or, you could always switch to using GCC as the compiler engine for Catalina...
Eric

LOL!

David Betz · 2012-10-09 04:42

RossH wrote: »

LOL!

The idea of using GCC as the compiler engine for Catalina doesn't seem like it is that absurd. It seems like the biggest things that distinguish Catalina from PropGCC is its simple installer, it's IDE choices, and its runtime environment with HMI, registry, plug-ins, etc. These could all be retained even with a different compiler. Anyway, I know it isn't going to happen. I'm just saying that Catalina could still retain its unique "flavor" even if the compiler engine were replaced.

Heater. · 2012-10-09 05:14

Eric,

Help!

I have managed to get fft_bench_omp working, or at least giving the correct results. Please see attached.

In this version I have used only 4 threads (omp_set_num_threads(4)) and configured the fft to use four threads maximum.
I have removed the "omp parallel for" from around the bit reversal decimation loop.
I added __attribute__((fcache)) to the butterflies() function.

BUT here is the biggie, to get it to work I added a delay loop prior to calling butterflies(). By changing the length of this delay I can get different results out, eventually with a long enough delay the correct result comes out as shown below. Note that moving the delay to after the butterflies call brings us back to failure. This does not seem to care about CMM/LMM or -O2/-Os.

Also turns out that butterflies is now a couple of statements too big to fit in FCACHE.

This does run on my 4 core PC, but in case I have a race condition I have been inserting delays here and there in the butterflies with a hope to triggering a similar error situation, so far it just works.

delay of zero loops
--------------------------
fft_bench v1.5
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FF
00000001 000001FE
000000C0 000001FF
00000100 000001FE
00000101 00000470
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 226565 us

delay of 100 loops (gets a bit better)
-------------------------
fft_bench v1.5
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FE
000000C0 000001FF
00000100 000001FE
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 228141 us

 delay of 200 loops (no change)
-------------------------
fft_bench v1.5
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FE
000000C0 000001FF
00000100 000001FE
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 229699 us

delay of 300 loops (Ahh, perfect !!!)
-------------------------
fft_bench v1.5
OpenMP version = 3.0
Freq.    Magnitude
00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit-reversal and butterfly run time = 231314 us

ersmith · 2012-10-09 05:49

Are the last few iterations of butterflies really going to work correctly with 4 processors working on them? It seems like there may be a race condition there, but I'm not sure. Trying to figure out which processor is reading and writing which array elements is really nasty, since the arrays are modified in place. If there were some way to change that so that we ping-pong buffers or something and never actually modify the array we're reading from it would make the parallelization much easier, but that's probably hard.

As for the fcache, there's really no need to put __attribute__((fcache)) on the butterflies function, since the whole thing is a loop -- if it will fit in fcache the compiler will do it automatically, if it won't then it will still try to put the inner loop into fcache.

Eric

Heater. · 2012-10-09 07:55

Eric,

Are the last few iterations of butterflies really going to work correctly
with 4 processors working on them?

A good question, and as far as I know, NO!
This is what was giving me the headache I referred to above. You will see that
I have a "slices" variable which determins how many ways the data is sliced and
hence how many threads, one slice each.

So, we start out with a 4 way slice, but we only do butterflies for levels 0 to
LOG2_FFT_SIZE - 3.

Then we do a 2 way slice this time only acting on level LOG2_FFT_SIZE - 2

And finally a 1 way slice (the whole thing on one thread, acting on only the
last level LOG2_FFT_SIZE - 1.

In this way no thread should ever operate on data outside it's slice.

That "while (slices >= 1)" loop and the "omp parallel for" inside it should
manage all of that. Hopefull it makes the same sequence of calls to
butterflies() as I was previously doing manually with the "omp parallel
sections" version, which did actually work in most cases.

I can try and check that call sequence is correct again, I did have some
prinf's in there previously to check what is going on.

The ping pong thing is tricky.
I will remove that fache attribute, it did at least show that your new warning
message does come out:)

RossH · 2012-10-10 04:51

David Betz wrote: »

The idea of using GCC as the compiler engine for Catalina doesn't seem like it is that absurd. It seems like the biggest things that distinguish Catalina from PropGCC is its simple installer, it's IDE choices, and its runtime environment with HMI, registry, plug-ins, etc. These could all be retained even with a different compiler. Anyway, I know it isn't going to happen. I'm just saying that Catalina could still retain its unique "flavor" even if the compiler engine were replaced.

Hi David,

The PropGCC team is free - as is anyone else - to pick up and run with the concept of the registry, plugins etc - as proposed here. It happens to work brilliantly with Catalina, but have kept it independent of the actual language and/or compiler for exactly this purpose.

Ross.

David Betz · 2012-10-10 06:12

RossH wrote: »

Hi David,

The PropGCC team is free - as is anyone else - to pick up and run with the concept of the registry, plugins etc - as proposed here. It happens to work brilliantly with Catalina, but have kept it independent of the actual language and/or compiler for exactly this purpose.

Ross.

Thanks but that isn't really what I meant.

RossH · 2012-10-11 00:50

David Betz wrote: »

Thanks but that isn't really what I meant.

Then I guess I don't understand what you did mean. I'd say around 50% of the "flavor" of Catalina is the ease with which you can integrate existing Spin/PASM objects into the runtime environment. The other 50% is the ease with which you can add new code-generation features (such as CMM). Since I don't think anybody could seriously claim that GCC is easier in this area (it is in fact notoriously difficult) then without the registry/plugin stuff, what would be the benefit? Yes, I suppose GCC would get an IDE and a Windows installer, but Catalina would lose a source level debugger.

Ross.

David Betz · 2012-10-11 06:14

RossH wrote: »

Then I guess I don't understand what you did mean. I'd say around 50% of the "flavor" of Catalina is the ease with which you can integrate existing Spin/PASM objects into the runtime environment. The other 50% is the ease with which you can add new code-generation features (such as CMM). Since I don't think anybody could seriously claim that GCC is easier in this area (it is in fact notoriously difficult) then without the registry/plugin stuff, what would be the benefit? Yes, I suppose GCC would get an IDE and a Windows installer, but Catalina would lose a source level debugger.

Ross.

Sorry, I guess I didn't realize that the ease of modifying the LCC code generator was a big factor in the success of Catalina. I was only looking at things that are apparent from the outside. So maybe the idea of replacing LCC with GCC in Catalina was a crazy idea after all! :-(

Dave Hein · 2012-10-11 06:24

David, I don't think it was a crazy idea. Looking at it objectively, it's an excellent idea. This would combine the best features of both compilers into one powerful development system. However, due to other factors, it will probably never happen. It's a shame that we can't put aside our differences and produce a single development system that's better than either one by itself.

David Betz · 2012-10-11 06:25

Dave Hein wrote: »

David, I don't think it was a crazy idea. Looking at it objectively, it's an excellent idea. This would combine the best features of both compilers into one powerful development system. However, due to other factors, it will probably never happen. It's a shame that we can't put aside our differences and produce a single development system that's better than either one by itself.

Yes, it is.

jazzed · 2012-10-11 09:06

RossH wrote: »

then without the registry/plugin stuff, what would be the benefit?

Well since there is some consensus that working together may have benefit, I look forward to someone implementing registry stuff since it's "so beneficial." I don't recall what a catalina plugin is exactly. Maybe we can get a refresher.

When are you going to make it easy for anyone other than Ross to add external memory solutions? A few people have already done this with propeller-gcc - one with zero guidance, and another with a few Q&A. I tried using your solution for several months and gave up even with your guidance.

When are you going to add a way to compile COG C code?

When are you going to provide a single simple way to specify a board without having to dig through all your sources?

RossH wrote: »

Yes, I suppose GCC would get an IDE and a Windows installer ....

You are saying that propeller-gcc doesn't have an IDE or Windows installer?
At best this is a distortion of the truth.

AntoineDoinel · 2012-10-11 11:01

jazzed wrote: »

When are you going to make it easy for anyone other than Ross to add external memory solutions? A few people have already done this with propeller-gcc - one with zero guidance, and another with a few Q&A. I tried using your solution for several months and gave up even with your guidance.

Steve, to be fair I've tried both, and sure PropGCC drivers are a more streamlined.
But overall it's not much more difficult with Catalina.

It used to be hard in earlier Catalina releases, because you had to keep code aligned in different files.
That changed long ago, maybe version 2.4 or earlier.

jazzed · 2012-10-11 12:21

AntoineDoinel wrote: »

Steve, to be fair I've tried both, and sure PropGCC drivers are a more streamlined.
But overall it's not much more difficult with Catalina.

It used to be hard in earlier Catalina releases, because you had to keep code aligned in different files.
That changed long ago, maybe version 2.4 or earlier.

Maybe a detailed comparison would be useful in some thread.

RossH · 2012-10-11 22:48

jazzed wrote: »

Well since there is some consensus that working together may have benefit, I look forward to someone implementing registry stuff since it's "so beneficial." I don't recall what a catalina plugin is exactly. Maybe we can get a refresher.

When are you going to make it easy for anyone other than Ross to add external memory solutions? A few people have already done this with propeller-gcc - one with zero guidance, and another with a few Q&A. I tried using your solution for several months and gave up even with your guidance.

When are you going to add a way to compile COG C code?

When are you going to provide a single simple way to specify a board without having to dig through all your sources?

You are saying that propeller-gcc doesn't have an IDE or Windows installer?
At best this is a distortion of the truth.

Hi Steve,

I'm really sorry you had such a bad experience with Catalina. Others have done what you could not, and with little help from me.

As to this this particular topic - I was merely responding to a post from David. To quote:

It seems like the biggest things that distinguish Catalina from PropGCC is its simple installer, it's IDE choices, and its runtime environment with HMI, registry, plug-ins, etc.

If you have a problem with David's characterization of Catalina, please take it up with him, not me. I thought it was fair. I may have been incorrect in assuming GCC did not have a Windows installer or an IDE, since I have not downloaded a version since one of the early alpha releases - perhaps as long as a year ago. As far as I recall the installer was "unzip". I am sure it has improved since then, but I do not know the details. I do recall you were planning to implement an Eclipse plugin, but I did not know you had finished it.

As for working on your GCC team, despite my high regard for a few of the people involved, I would rather poke my eyes out with a stick.

Ross.

Rsadeika · 2012-10-12 00:43

...I would rather poke my eyes out with a stick.

... and hopefully that is the last thing will hear from you. If you have nothing positive or constructive to say, here, then why don't you just say nothing, that just might be a new experience for you.

Ray

RossH · 2012-10-12 01:44

David Betz wrote: »

Sorry, I guess I didn't realize that the ease of modifying the LCC code generator was a big factor in the success of Catalina. I was only looking at things that are apparent from the outside. So maybe the idea of replacing LCC with GCC in Catalina was a crazy idea after all! :-(

Yes, I don't claim to be a compiler guru - much of the success of Catalina is due to the fact that LCC has a beautifully simple code generator that anyone could learn to modify in a couple of weeks. I didn't even buy the LCC book until I had built several versions of Catalina - everything you really need to know is documented in just a few pages. However (as I have acknowledged before) GCC does have a better optimizer - and (best of all) you get the optimizer for nothing with GCC, whereas with LCC I had to write my own - and it is not very pretty :frown:.

But you are quite right - the main reason I would not personally be interested in moving to GCC is that I would lose the ability to implement new ideas so quickly and easily.

However, I'd be perfectly happy for someone else to do so - which is why I suggested that the logical first step would be the adoption of a language-independent driver/plugin/runtime architecture. This would simplify things for everyone.

Ross.

Heater. · 2012-10-12 01:48

Now, now, guys. There is nothing wrong with a bit friendly rivalry.

As for Catalina adopting GCC, whilst it may well be doable it may also be a lot of work. Further, from the perspective of us users having options, in compilers, IDEs, or whatever in general, is a good thing.

Ross is not the only one who thinks that GCC to might be a bit hard to deal with, that is why we have llvm and clang.

mindrobots · 2012-10-12 02:13

Heater. wrote: »

Now, now, guys. There is nothing wrong with a bit friendly rivalry.

As for Catalina adopting GCC, whilst it may well be doable it may also be a lot of work.

Gosh, Heater, you make it sound like Catalina is written in Forth! It's all C or C++, why would changing underlying compilers be a lot of work????

:0)

RossH · 2012-10-12 02:30

mindrobots wrote: »

Gosh, Heater, you make it sound like Catalina is written in Forth! It's all C or C++, why would changing underlying compilers be a lot of work????

:0)

You are correct on one level - C is C, whether compiled by Catalina or GCC or ImageCraft.

But the runtime stuff is not implemented in C, so there is work requried - the main part of it being adapting GCC to use the Catalina registry/plugin/run-time architecture. But that's only step one - there is more: Catalina uses a Spin compiler to do the final compilation (I happen to use Homespun, but could just as easily have used BST or even the Parallax compiler if they fixed some of the limitations - originally I supported all three!) - but GCC does not. I believe even GCC's implementation of plain old PASM is not 100% compatible with Parallax's version (only going by what I have read in these forums, not by personal experience, so I am happy to be corrected). This means much of the Catalina runtime stuff would simply fail to compile under GCC, and have to be rewritten.

Ross.

Heater. · 2012-10-12 02:38

mindrobots,

"Gosh, Heater, you make it sound like Catalina is written in Forth! It's all C or C++, why would changing underlying compilers be a lot of work????"

Catalina may as well be written in Forth or any other language for that matter. As long as it compiles C source into binaries that run on the Prop. (Has any one ever been crazy enough to write a C compiler in a language that is not C?)

I don't know that it would be a lot of work, just guessing. I did say "it may.."

Now, you have raised an interesting point. propgcc is a C compiler and Catalina is a C compiler. However I'm willing to bet that most of the demos from one will not work out of the box with the other and vice versa.

On the one hand Catalina programs will be relying on the Catalina plugin system to provide services like serial I/O, SD card etc. On the other hand propgcc programs may be relying on propgcc built in functions, and GCC specific language extension features that Catalina does not have.

For sure my version of FullDuplexSerial in C will not be usable with Catalina as it is compiled to native code to run in a COG and it uses the GCC "labels as variables" extension to create the TX and RX threads.

On the other hand, C code that is not working down at such low levels and adheres to the standards is quite interchangeable, fft_bench runs on both with out modification for example.

P.S. In answer to my C compiler snot written in C I recall that the GCC project has recently decided to adopt C++ for use in the compiler. They have clearly lost their minds:)

Heater. · 2012-10-12 02:44

RossH,

I believe even GCC's implementation of plain old PASM is not 100% compatible

GCC does not support "plain old PASM" as in the assembler syntax used in Spin programs. GCC compiles C down to it's own assembler syntax which is then assembled by the gas assembler.

There are "plain old PASM" modules used in the propgcc system especially the loader. They are assembled using BST but that is about to be swapped out for Roy's new open source Spin compiler which already produces identical code to the Prop tool (barring any bugs that will no doubt be fixed if they show up in the future)

improving CMM performance

Comments