More than one cogs running xmm code.

RossH · 2014-01-26 21:08

Hi David,

I said in respnse to one of your posts above that ...

RossH wrote: »

When using parallel XMM, the performance can be so good that you simply don't need to bother using the cache - but of course this is dependent on the details of the parallel interface.

I just ran some tests on this. This was using a RamBlade3 board, which has a very fast parallel interface to its XMM RAM. The numbers are in milliseconds for Heater's fft_bench.c, so the lower the better - but other than that the actual numbers are a bit irrelevant. The results are more interesting for what they show about the advantages or otherwise of using caches of various sizes and in various modes:

+--------------+-------+------+
|  CACHE MODE  | SIZE  | TIME |
+--------------+-------+------+
|  code only   | none  | 1555 |
|  code only   |  2K   | 1632 |
|  code only   |  4K   | 1404 |
+--------------+-------+------+
| code & data  | none  | 1878 |
| code & data  |  2K   | 2921 |
| code & data  |  4K   | 3258 |
+--------------+-------+------+

Not directly on topic (but quite interesting nonetheless) is the fact that the Spin execution time for this program is 1465 ms. This shows that executing C code from XMM can actually be faster than executing Spin from Hub RAM. I've mentioned this fact before, but I don't think I've ever shown an example.

When comparing cached to uncached access, the best case is when using a 4k cache for code only access (XMM SMALL) - in this case, you can get a 10% speedup over direct (i.e. uncached) access. A modest improvement, but perhaps worthwhile - provided of course that you have an extra cog and 4K of Hub RAM to spare!

But the more interesting case is when trying to use the cache for both code and data access (XMM LARGE) - in this case, using the cache actually slows you down. Also, with the data in the cache, the larger the cache, the larger the slowdown - I think this is because code access is generally sequential, but when the data also in the cache the random nature of typical access means it is a disadvantage to have to load a larger chunk of data each time you need to access just one location in the chunk.

In short, the advantage of using a cache over fast parallel XMM RAM access is questionable at best, given that it is so expensive - and it can even slow you down if you are not very careful. In fact, it is quite likely that your program will run substantially slower if it is forced to use the cache.

Ross.

Cluso99 · 2014-01-26 23:44

FYI, we have a commercial product that uses Catalina XMM in one prop (a RamBlade3 circuit - 512KB 55ns parallel SRAM & SD card) of the 3 props in the product. Caching is not used.
The product has completed beta field testing and has agency approval. Production is now almost completed and shipping will take place within the next 2 weeks. Sorry I cannot give any more details, at least for now.

RossH · 2014-01-27 01:08

Cluso99 wrote: »

Sorry I cannot give any more details, at least for now.

Thanks for the info, Cluso.

I hope you will post more details when you can.

Ross.

Heater. · 2014-01-27 02:00

Interesting results there Ross.

Glad to see the old fft_bench doing good service.

Those numbers are quite an eye opener. Firstly because it seems that using a cache is almost completely a waste of effort in the best case. Secondly of course because it turns out to be detrimental in the most useful case, big program big data all in ext RAM.

Makes me wonder why I ever accepted the idea of a cache in ZiCog with out ever benchmarking it !

I wonder if you can stomach doing similar tests with other benchmarks like fibo(). I'm thinking that the FFT is a really bad case for caches because it has terrible locality of reference. It likes to access all parts of it's data all the time, thus thrashing the cache.

What is the run time of the fft from HUB? Looks like we are in for a 40 times slow down going to ext RAM.

Dr_Acula · 2014-01-27 03:27

I know this takes things slightly off topic, but not really as it seems that cache drivers need two cogs, and Ross is saying that that one cog using parallel memory might be faster. And if you have two cogs per XMM, presumably you can only have max 4 threads, but there are 8 if you only have one cog per XMM.

Ross said

When using parallel XMM, the performance can be so good that you simply don't need to bother using the cache

Presumably that is using an entire propeller connected to a parallel ram chip? They were the experiments done a few years ago, but now with the new SPI ram, I wonder how they compare?

Specifically, the 8 pin SPI ram chip running in 4 bit mode. So ok, 8 clocks to read a 32 bit long (cf an 8 bit parallel ram chip which would be 4 clocks). But the SPI ram chip can also be set up to feed out the n nibbles with one per clock, so it may be similar in speed as there is slightly less code.

This may or may not be faster than caching. But if it saves a cog that might be helpful.

RossH · 2014-01-27 03:27

Heater. wrote: »

I wonder if you can stomach doing similar tests with other benchmarks like fibo(). I'm thinking that the FFT is a really bad case for caches because it has terrible locality of reference. It likes to access all parts of it's data all the time, thus thrashing the cache.

I'm not doing much benchmarking just at the moment. I have some plans to improve the speed of Catalina, but they will take me some time to implement (who has the time!) - and anyway, Catalina is already as fast as I need. It is also optimal on hub and cog usage (which is more important for the programs I'm running just at the moment anyway).

And yes, I think this program is a slightly pathological case, for precisely the reason you mention - not all programs will be as bad as this one. It also highlights the flaws in the general cache design for this type of program - a different cache design may help in this case ... but then I'm sure there would be programs that would do badly under each different design.

Ross.

RossH · 2014-01-27 03:30

Dr_Acula wrote: »

Specifically, the 8 pin SPI ram chip running in 4 bit mode. So ok, 8 clocks to read a 32 bit long (cf an 8 bit parallel ram chip which would be 4 clocks). But the SPI ram chip can also be set up to feed out the n nibbles with one per clock, so it may be similar in speed as there is slightly less code.

Sadly, in my experience these fancy serial modes (multiple SPI, QPI etc) turn out to be much slower and require much larger code sizes than a simple parallel interface. And the parallel interface doesn't really require that many more pins!

Ross.

David Betz · 2014-01-27 04:10

Dr_Acula wrote: »

I know this takes things slightly off topic, but not really as it seems that cache drivers need two cogs, and Ross is saying that that one cog using parallel memory might be faster. And if you have two cogs per XMM, presumably you can only have max 4 threads, but there are 8 if you only have one cog per XMM.

Neither of these is completely correct with respect to propgcc. In propgcc you have a single COG managing access to external memory and up to 7 COGs running XMM code. So there is only one COG of overhead no matter how many XMM COGs your program uses. However, each XMM COG does require a separate cache. That is where you're likely to be limited to how many you can run simultaneously. If you give each XMM COG a 2K cache you can run four with 8K total hub memory consumed by the caches and still have three COGs left to run PASM or COGC drivers. However, running more than one XMM COG really works best if you use XMMC mode (what Ross calls XMM SMALL) since otherwise you wind up with cache coherency issues. You can manage these in your code by flushing the cache as necessary but the current propgcc library code is not written to do this and so many library functions that make use of global variables will give you trouble if you try to use an XMM memory model where data is in external memory as well as code.

David Betz · 2014-01-27 04:12

RossH wrote: »

I have some plans to improve the speed of Catalina, but they will take me some time to implement (who has the time!) - and anyway, Catalina is already as fast as I need.

You seem to put quite a lot of time into Catalina as it is. I wish I had that much time to put into propgcc! You do an excellent job of supporting Catalina C.

I'm looking forward to hearing about your speed improvements when you have a chance to do them.

Heater. · 2014-01-27 05:47

I wonder if Ross ever remembered to get back to the original project he wanted a C compiler for in the fist place

RossH · 2014-01-27 12:10

Heater. wrote: »

I wonder if Ross ever remembered to get back to the original project he wanted a C compiler for in the fist place

Actually, I did!

But even though I finally managed to get Catalina execution fast enough and the code sizes small enough (this happened a couple of releases ago) I still like messing about with it. Not much else to do while we all wait to see if the P2 will ever emerge! :frown:

Ross.

David Betz · 2014-01-27 12:14

RossH wrote: »

Actually, I did!

But even though I finally managed to get Catalina execution fast enough and the code sizes small enough (this happened a couple of releases ago) I still like messing about with it. Not much else to do while we all wait to see if the P2 will ever emerge! :frown:

Ross.

Grab an FPGA. The DE0-Nano is less than $100. I'm sure you'll have fun with hub execution mode!

RossH · 2014-01-27 13:45

David Betz wrote: »

Grab an FPGA. The DE0-Nano is less than $100. I'm sure you'll have fun with hub execution mode!

I have had a Nano for a long time now. I started working with it then realized it was a bit of a waste of time since the P2 design was not stable enough. But I keep a periodic eye on the P2 forums - maybe this is the year! I believe there is also now a Spin compiler for the P2 available somewhere, but I haven't actually seen a copy yet - have you?

Ross.

David Betz · 2014-01-27 13:49

RossH wrote: »

I have had a Nano for a long time now. I started working with it then realized it was a bit of a waste of time since the P2 design was not stable enough. But I keep a periodic eye on the P2 forums - maybe this is the year! I believe there is also now a Spin compiler for the P2 available somewhere, but I haven't actually seen a copy yet - have you?

Ross.

I think the instruction set that Chip posted recently is probably nearly complete. I'm still hoping he'll add the CALL_LR instruction though. I've heard him mention a Spin2 compiler but I'm not sure if his current release of PNut.exe includes it or not. As far as I know, OpenSpin has not been updated to support P2 yet. I'll be updating GAS soon to match Chip's new instruction encodings but I don't imagine you'd want to use that since it has a different syntax than PASM. For that matter, if all you care about is PASM then I'm sure PNut.exe supports that.

RossH · 2014-01-27 16:29

David Betz wrote: »

I've heard him mention a Spin2 compiler but I'm not sure if his current release of PNut.exe includes it or not. As far as I know, OpenSpin has not been updated to support P2 yet.

No, I don't think it has. It is fairly trivial to update OpenSpin to use the new P2 PASM instructions, but my feeling is that it is still too early to start serious work on "Catalina 2" yet. I can't see the P2 being released anytime soon (Chip never seems to be able to say "No!" whenever someone asks for a new feature!

)

Ross.

More than one cogs running xmm code.

Comments