More than one cogs running xmm code.
aldorifor
Posts: 15
Hello,
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?
Comments
What would be the point anyway? It would be mind numbingly slow as your multiple COGs thrash whatever memory caching there is going on between the external memory and the COG execution kernel.
Not with Catalina. While it might be useful, the overheads are simply horrendous. Just to give you a simple actual example - the Hydra HX512 XMM RAM card actually relies for speed on sequential access, where you don't need to specify the address each time you read from XMM - the card assumes the next address is always the "last address + 1" unless you explicitly specify otherwise by setting a new address. This "last address + 1" assumption is true in a significant majority of cases when a single cog is reading program code from XMM. But you lose that advantage entirely when multiple cogs are doing so - which means XMM access slows down for all cogs by a factor of 3 or more.
On the P2, the situation will be better. But on the P1 it really doesn't make sense to support XMM access from more than one cog (believe me, I've tried!).
Ross.
Thanks for quick reply.
At this time, is there a dedicated COG to access XMM with Catalina?
You can get threading through pthreads and you can use any unused cogs for cogc code (I believe) and also for PASM code.
As Heater said, XMM is slow enough with a single instance. It would be hard to find a use case that would warrant the work and the decreased performance.
I'm sure I have more "official" explanations but my notes are being rearranged for better access
Do you have an idea (or an aproximation) of the equivalent Z80 speed of your emulator?
Does a single instance using XMM will ever be faster than SPIN code?
Great work, PropGCC team!
Good news, what is the amount of cache in hub memory per COG?
It just about matched the speed of an original 8080.
Perceptually it can seem faster than an old CP/M machine with floppy disks because we have a nice fast SD card file system!
Cool! I remember you mentioned that you're planning to implement that feature some time ago in the thread where we talked about the PMC and the issues I had. Now it's already implemented! :-)
Btw: Is there a document describing how to start cogs in all different memory models?
The info is mostly spreaded in different docs or just examples. A single howto foreach memory model would be good.
For example:
Starting a cogc program in LMM, XMM
Starting a cog by handing over the address of a function in LMM, XMM
Starting a cog in XMM mode and COG uses XMM cache driver
...
Is the feature to have real XMM mode for more than one cog the default configuration when launching
new cogs in XMM mode or is there a new way to start the cogs?
Another question, a bit off topic in this thread... In a cogc driver, are there any local variables in cog memory by default or do I have to add the COGMEM tag?
Regards,
Christian
It's your choice. You can ask Catalina to put the XMM access code into the existing kernel cog, or to use a dedicated cog, Using the kernel cog is the default. In some cases (e.g. when serial XMM RAM is used, like the Propeller Memory Card) the XMM access code may be too large to fit in the kernel cog, so a separate cog must be allocated to handle the XMM access. This cog also manages a chunk of Hub RAM used as a cache to speed up performance, since using a dedicated cog is slower than using the kernel cog.
Whether the kernel cog or a dedicated cog is used for XMM access is controlled by the CACHED command line symbol. If this is specified, a dedicated cog is allocated - otherwise the kernel cog is used.
So in short, accessing XMM RAM from multiple cogs is feasible - but on the P1 it is simply not practical.
Ross.
I look forward to the propgcc vs Catalina benchmark results:)
No, I mean it has no practical application. Or perhaps I should say "very little". The reason for this is that if you are writing in C/C++, in almost all cases you would be much better off simply using a different processor with more internal RAM and using multi-threading instead. Your hardware would be cheaper, your software would be faster to develop, and it would probably execute faster anyway.
That's why I eventually gave up on multi-cog XMM on the P1 - it ended up costing so many cogs and so much Hub RAM (to support the cache sizes necessary to get reasonable execution speed) that in the end I couldn't actually run the applications anyway - even with XMM.
This was the main driver for me developing CMM instead - this gave me the ability to run larger C programs on multiple cogs (or with multiple threads) - at faster speeds than multi-cog XMM allowed, using less cogs and less Hub RAM.
Of course, this will change on the P2. But on the P1, even single-cog XMM has very limited use. I'm only aware of a couple of commercial applications that are using it.
Ross.
Edit: Sorry, I should have said "use the Propeller at all for applications that don't fit in hub memory."
Hi David,
Of course I am only speaking from my own experience - who can do otherwise? Also, I never said "useless". In fact,my initial response was:
What I said was "not practical".
Single-cog XMM on the P1 is only marginally practical - it consumes (at least with Catalina) as little as one cog and no Hub RAM. But to get just two XMM cogs running, multi-cog XMM would consume around half your available Propeller resources - i.e. half your cogs and half your Hub RAM!
I'm happy to be proven wrong on this one. But I don't think that's likely.
Ross.
There is, of course, more hub memory used. This is where we will have to see whether multi-COG XMM is really practical. If each XMM COG requires an 8k cache then it will be hopeless with four XMM COGs since the caches would use all of hub memory. However, if an XMM COG can perform well with a smaller cache, say 2k, then four XMM COGs might perform well for some applications. I know we've done some testing on 2k caches in XMMC mode and the performance wasn't too bad. This suggests to me that it might be possible to run 4 XMM COGs each with a 2k cache and get acceptable performance for some applications and no worse than a single XMM COG with a 2k cache.
Anyway, we have no experience with applying this technology yet so you may very well be right. As I've said before, even single-COG XMM has few users at the moment that I'm aware of.
I am aware of a couple of commercial applications under development that use Catalina's single-cog XMM. But they are all using very fast parallel access RAM, not serial RAM (e.g. not SPI or QPI).
With Catalina, this means you need no additional cogs and no additional Hub RAM whether your program is compiled as LMM, CMM or XMM. You can choose your memory model depending only on how large your C application is, and how fast it must execute (and in every case, the C main program still executes faster than SPIN).
I think this is the only really practical "real world" development model for C the Propeller - i.e. a single threaded main control program in C, and (just as with SPIN) using the 7 other cogs for the low-level grunt work (generally, these have to be programmed in PASM for speed anyway). And those 7 cogs still have the entire 32kb of Hub RAM available, apart from whatever stack space the C main control program needs (which can be zero, but is typically of the order of a few hundred bytes).
Multi-cog XMM can't offer that. But it would be fun if it could, so I hope someone manages it!
Ross.
Anything that uses parallel access, or only uses a single SPI serial chip. It is only serial FLASH or QPI SRAMs that needs too much space to fit in the kernel itself (e.g. on the C3, using only the SPI SRAM as XMM RAM needs only the kernel cog, but using FLASH requires the additional cog).
Ross.
Sorry, I may have given you the wrong impression. The parallel interfaces tend to be fast enough to use without the cache. The serial interfaces are definitely much slower when used without the cache, but may still be fast enough for some applications.
Ross.
Correct. When using serial XMM (e.g. SPI RAM and/or FLASH), the cache is nice because all XMM boards then give you pretty much the same performance. Without the cache, some give you acceptable performance and some do not.
When using parallel XMM, the performance can be so good that you simply don't need to bother using the cache - but of course this is dependent on the details of the parallel interface.
Ross.