More than one cogs running xmm code.
Hello,
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?
Comments
What would be the point anyway? It would be mind numbingly slow as your multiple COGs thrash whatever memory caching there is going on between the external memory and the COG execution kernel.
Not with Catalina. While it might be useful, the overheads are simply horrendous. Just to give you a simple actual example - the Hydra HX512 XMM RAM card actually relies for speed on sequential access, where you don't need to specify the address each time you read from XMM - the card assumes the next address is always the "last address + 1" unless you explicitly specify otherwise by setting a new address. This "last address + 1" assumption is true in a significant majority of cases when a single cog is reading program code from XMM. But you lose that advantage entirely when multiple cogs are doing so - which means XMM access slows down for all cogs by a factor of 3 or more.
On the P2, the situation will be better. But on the P1 it really doesn't make sense to support XMM access from more than one cog (believe me, I've tried!).
Ross.
Thanks for quick reply.
At this time, is there a dedicated COG to access XMM with Catalina?
You can get threading through pthreads and you can use any unused cogs for cogc code (I believe) and also for PASM code.
As Heater said, XMM is slow enough with a single instance. It would be hard to find a use case that would warrant the work and the decreased performance.
I'm sure I have more "official" explanations but my notes are being rearranged for better access
Do you have an idea (or an aproximation) of the equivalent Z80 speed of your emulator?
Does a single instance using XMM will ever be faster than SPIN code?
Great work, PropGCC team!
Good news, what is the amount of cache in hub memory per COG?
It just about matched the speed of an original 8080.
Perceptually it can seem faster than an old CP/M machine with floppy disks because we have a nice fast SD card file system!
{ External Memory Driver Common Code Copyright (c) 2013 by David Betz Based on code from Chip Gracey's Propeller II SDRAM Driver Copyright (c) 2013 by Chip Gracey TERMS OF USE: MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. } PUB image return @init_xmem DAT org $0 init_xmem jmp #init_continue xmem_param1 long 0 xmem_param2 long 0 xmem_param3 long 0 xmem_param4 long 0 init_continue ' cmdbase is the base of an array of mailboxes mov cmdbase, par ' initialize the read/write functions call #init ' start the command loop waitcmd #ifdef RELEASE_PINS mov dira, #0 ' release the pins for other SPI clients #endif :reset mov cmdptr, cmdbase :loop #ifdef REFRESH djnz refresh_cnt, #:norefresh' check to see if its time to refresh call #refresh_memory ' if refresh timer expired, reload and refresh :norefresh #endif rdlong t1, cmdptr wz if_z jmp #:next ' skip this mailbox if it's zero cmp t1, #$8 wz ' check for the end of list marker if_z jmp #:reset mov hubaddr, t1 ' get the hub address andn hubaddr, #$f mov stsptr, cmdptr ' get the external address and status pointer add stsptr, #4 rdlong extaddr, stsptr ' get the external address mov t2, t1 ' get the byte count and t2, #7 mov count, #8 shl count, t2 #ifdef RELEASE_PINS mov dira, pindir ' setup the pins so we can use them #endif test t1, #$8 wz ' check the write flag if_z jmp #:read ' do read if the flag is zero call #write_bytes ' do write if the flag is one jmp #:done :read call #read_bytes :done #ifdef RELEASE_PINS mov dira, #0 ' release the pins for other SPI clients #endif wrlong t1, stsptr ' return completion status wrlong zero, cmdptr :next add cmdptr, #8 jmp #:loop ' pointers to mailbox array cmdbase long 0 ' base of the array of mailboxes cmdptr long 0 ' pointer to the current mailbox stsptr long 0 ' pointer to where to store the completion status ' input parameters to read_bytes and write_bytes extaddr long 0 ' external address hubaddr long 0 ' hub address count long 0 zero long 0 ' zero constant t1 long 0 ' temporary variable t2 long 0 ' temporary variable t3 long 0 ' temporary variable '---------------------------------------------------------------------------------------------------- ' ' init - initialize external memory ' ' on input: ' xmem_param1 - xmem_param4 are initialization parameters filled in by the loader from the .cfg file ' '---------------------------------------------------------------------------------------------------- '---------------------------------------------------------------------------------------------------- ' ' read_bytes - read data from external memory ' ' on input: ' extaddr is the external memory address to read ' hubaddr is the hub memory address to write ' count is the number of bytes to read ' '---------------------------------------------------------------------------------------------------- '---------------------------------------------------------------------------------------------------- ' ' write_bytes - write data to external memory ' ' on input: ' extaddr is the external memory address to write ' hubaddr is the hub memory address to read ' count is the number of bytes to write ' '---------------------------------------------------------------------------------------------------- '---------------------------------------------------------------------------------------------------- ' ' refresh_memory - refresh external memory and reset refresh_cnt ' ' Note: only required if REFRESH is defined ' '----------------------------------------------------------------------------------------------------
Cool! I remember you mentioned that you're planning to implement that feature some time ago in the thread where we talked about the PMC and the issues I had. Now it's already implemented! :-)
Btw: Is there a document describing how to start cogs in all different memory models?
The info is mostly spreaded in different docs or just examples. A single howto foreach memory model would be good.
For example:
Starting a cogc program in LMM, XMM
Starting a cog by handing over the address of a function in LMM, XMM
Starting a cog in XMM mode and COG uses XMM cache driver
...
Is the feature to have real XMM mode for more than one cog the default configuration when launching
new cogs in XMM mode or is there a new way to start the cogs?
Another question, a bit off topic in this thread... In a cogc driver, are there any local variables in cog memory by default or do I have to add the COGMEM tag?
Regards,
Christian
It's your choice. You can ask Catalina to put the XMM access code into the existing kernel cog, or to use a dedicated cog, Using the kernel cog is the default. In some cases (e.g. when serial XMM RAM is used, like the Propeller Memory Card) the XMM access code may be too large to fit in the kernel cog, so a separate cog must be allocated to handle the XMM access. This cog also manages a chunk of Hub RAM used as a cache to speed up performance, since using a dedicated cog is slower than using the kernel cog.
Whether the kernel cog or a dedicated cog is used for XMM access is controlled by the CACHED command line symbol. If this is specified, a dedicated cog is allocated - otherwise the kernel cog is used.
So in short, accessing XMM RAM from multiple cogs is feasible - but on the P1 it is simply not practical.
Ross.
I look forward to the propgcc vs Catalina benchmark results:)
No, I mean it has no practical application. Or perhaps I should say "very little". The reason for this is that if you are writing in C/C++, in almost all cases you would be much better off simply using a different processor with more internal RAM and using multi-threading instead. Your hardware would be cheaper, your software would be faster to develop, and it would probably execute faster anyway.
That's why I eventually gave up on multi-cog XMM on the P1 - it ended up costing so many cogs and so much Hub RAM (to support the cache sizes necessary to get reasonable execution speed) that in the end I couldn't actually run the applications anyway - even with XMM.
This was the main driver for me developing CMM instead - this gave me the ability to run larger C programs on multiple cogs (or with multiple threads) - at faster speeds than multi-cog XMM allowed, using less cogs and less Hub RAM.
Of course, this will change on the P2. But on the P1, even single-cog XMM has very limited use. I'm only aware of a couple of commercial applications that are using it.
Ross.
Edit: Sorry, I should have said "use the Propeller at all for applications that don't fit in hub memory."
Hi David,
Of course I am only speaking from my own experience - who can do otherwise? Also, I never said "useless". In fact,my initial response was:
What I said was "not practical".
Single-cog XMM on the P1 is only marginally practical - it consumes (at least with Catalina) as little as one cog and no Hub RAM. But to get just two XMM cogs running, multi-cog XMM would consume around half your available Propeller resources - i.e. half your cogs and half your Hub RAM!
I'm happy to be proven wrong on this one. But I don't think that's likely.
Ross.
There is, of course, more hub memory used. This is where we will have to see whether multi-COG XMM is really practical. If each XMM COG requires an 8k cache then it will be hopeless with four XMM COGs since the caches would use all of hub memory. However, if an XMM COG can perform well with a smaller cache, say 2k, then four XMM COGs might perform well for some applications. I know we've done some testing on 2k caches in XMMC mode and the performance wasn't too bad. This suggests to me that it might be possible to run 4 XMM COGs each with a 2k cache and get acceptable performance for some applications and no worse than a single XMM COG with a 2k cache.
Anyway, we have no experience with applying this technology yet so you may very well be right. As I've said before, even single-COG XMM has few users at the moment that I'm aware of.
I am aware of a couple of commercial applications under development that use Catalina's single-cog XMM. But they are all using very fast parallel access RAM, not serial RAM (e.g. not SPI or QPI).
With Catalina, this means you need no additional cogs and no additional Hub RAM whether your program is compiled as LMM, CMM or XMM. You can choose your memory model depending only on how large your C application is, and how fast it must execute (and in every case, the C main program still executes faster than SPIN).
I think this is the only really practical "real world" development model for C the Propeller - i.e. a single threaded main control program in C, and (just as with SPIN) using the 7 other cogs for the low-level grunt work (generally, these have to be programmed in PASM for speed anyway). And those 7 cogs still have the entire 32kb of Hub RAM available, apart from whatever stack space the C main control program needs (which can be zero, but is typically of the order of a few hundred bytes).
Multi-cog XMM can't offer that. But it would be fun if it could, so I hope someone manages it!
Ross.
Anything that uses parallel access, or only uses a single SPI serial chip. It is only serial FLASH or QPI SRAMs that needs too much space to fit in the kernel itself (e.g. on the C3, using only the SPI SRAM as XMM RAM needs only the kernel cog, but using FLASH requires the additional cog).
Ross.
Sorry, I may have given you the wrong impression. The parallel interfaces tend to be fast enough to use without the cache. The serial interfaces are definitely much slower when used without the cache, but may still be fast enough for some applications.
Ross.
Correct. When using serial XMM (e.g. SPI RAM and/or FLASH), the cache is nice because all XMM boards then give you pretty much the same performance. Without the cache, some give you acceptable performance and some do not.
When using parallel XMM, the performance can be so good that you simply don't need to bother using the cache - but of course this is dependent on the details of the parallel interface.
Ross.