More than one cogs running xmm code.

aldorifor · 2014-01-21 04:17

Hello,
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?

Heater. · 2014-01-21 04:42

No idea. Probably not.

What would be the point anyway? It would be mind numbingly slow as your multiple COGs thrash whatever memory caching there is going on between the external memory and the COG execution kernel.

RossH · 2014-01-21 05:04

aldorifor wrote: »

Hello,
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?

Not with Catalina. While it might be useful, the overheads are simply horrendous. Just to give you a simple actual example - the Hydra HX512 XMM RAM card actually relies for speed on sequential access, where you don't need to specify the address each time you read from XMM - the card assumes the next address is always the "last address + 1" unless you explicitly specify otherwise by setting a new address. This "last address + 1" assumption is true in a significant majority of cases when a single cog is reading program code from XMM. But you lose that advantage entirely when multiple cogs are doing so - which means XMM access slows down for all cogs by a factor of 3 or more.

On the P2, the situation will be better. But on the P1 it really doesn't make sense to support XMM access from more than one cog (believe me, I've tried!).

Ross.

aldorifor · 2014-01-21 05:27

RossH wrote: »

Not with Catalina. While it might be useful, the overheads are simply horrendous. Just to give you a simple actual example - the Hydra HX512 XMM RAM card actually relies for speed on sequential access, where you don't need to specify the address each time you read from XMM - the card assumes the next address is always the "last address + 1" unless you explicitly specify otherwise by setting a new address. This "last address + 1" assumption is true in a significant majority of cases when a single cog is reading program code from XMM. But you lose that advantage entirely when multiple cogs are doing so - which means XMM access slows down for all cogs by a factor of 3 or more.

On the P2, the situation will be better. But on the P1 it really doesn't make sense to support XMM access from more than one cog (believe me, I've tried!).

Ross.

Thanks for quick reply.
At this time, is there a dedicated COG to access XMM with Catalina?

mindrobots · 2014-01-21 05:33

As far as I recall, it isn't possible in PropGCC. The XMM model uses an instruction cache in HUBRAM and there is only one of those tied to the cog running the XMM VM. This and other data structures that and other data structures that can't be duplicated for additional XMM cogs makes it impossible.

You can get threading through pthreads and you can use any unused cogs for cogc code (I believe) and also for PASM code.

As Heater said, XMM is slow enough with a single instance. It would be hard to find a use case that would warrant the work and the decreased performance.

I'm sure I have more "official" explanations but my notes are being rearranged for better access

aldorifor · 2014-01-21 05:37

Heater. wrote: »

No idea. Probably not.

What would be the point anyway? It would be mind numbingly slow as your multiple COGs thrash whatever memory caching there is going on between the external memory and the COG execution kernel.

Thanks Heater,
Do you have an idea (or an aproximation) of the equivalent Z80 speed of your emulator?

aldorifor · 2014-01-21 05:42

mindrobots wrote: »

As far as I recall, it isn't possible in PropGCC. The XMM model uses an instruction cache in HUBRAM and there is only one of those tied to the cog running the XMM VM. This and other data structures that and other data structures that can't be duplicated for additional XMM cogs makes it impossible.

You can get threading through pthreads and you can use any unused cogs for cogc code (I believe) and also for PASM code.

As Heater said, XMM is slow enough with a single instance. It would be hard to find a use case that would warrant the work and the decreased performance.

I'm sure I have more "official" explanations but my notes are being rearranged for better access

Thanks a lot.
Does a single instance using XMM will ever be faster than SPIN code?

David Betz · 2014-01-21 06:07

aldorifor wrote: »

Hello,
Is it possible, with the different existing IDE (Catalina, PropGCC,...and other?), to write n large external memory code(XMM) that can be run with more than one COG at the same time(ie each COG its own XMM code)?
Yes, you can argument if this seem to you usefull or not?

With the version of propgcc that is distributed with SimpleIDE this is not possible. However, the default branch in Google Code does support running XMM code in multiple COGs. It does this by having a separate cache for each XMM COG. Because of cache coherency issues this is likely to work best in XMMC mode where data is in hub memory although it is supported in XMM-SINGLE and XMM-SPLIT modes as well. While each XMM COG has its own cache, all share the same external memory driver so only one extra COG is consumed to manage access to external memory.

mindrobots · 2014-01-21 06:20

Cool! I didn't know this existed.

Great work, PropGCC team!

aldorifor · 2014-01-21 06:47

David Betz wrote: »

With the version of propgcc that is distributed with SimpleIDE this is not possible. However, the default branch in Google Code does support running XMM code in multiple COGs. It does this by having a separate cache for each XMM COG. Because of cache coherency issues this is likely to work best in XMMC mode where data is in hub memory although it is supported in XMM-SINGLE and XMM-SPLIT modes as well. While each XMM COG has its own cache, all share the same external memory driver so only one extra COG is consumed to manage access to external memory.

Thanks David,
Good news, what is the amount of cache in hub memory per COG?

Heater. · 2014-01-21 06:55

aldorifor,

Do you have an idea (or an aproximation) of the equivalent Z80 speed of your emulator?

Not really. Last time I recall actually measuring execution speed was when it was still only an 8080 emulator and running 8080 code from HUB RAM.
It just about matched the speed of an original 8080.
Perceptually it can seem faster than an old CP/M machine with floppy disks because we have a nice fast SD card file system!

David Betz · 2014-01-21 06:58

aldorifor wrote: »

Thanks David,
Good news, what is the amount of cache in hub memory per COG?

The size of the cache is configurable. We have typically used 8k caches for single XMM COGs but obviously you won't be able to do that if you run several. We've tried caches as small as 2k and the perform pretty well so I would guess that four XMM COGs each with a 2k cache might be possible. One COG is consumed by the external memory driver so that would leave three for other non-XMM driver code.

aldorifor · 2014-01-21 07:06

David Betz wrote: »

The size of the cache is configurable. We have typically used 8k caches for single XMM COGs but obviously you won't be able to do that if you run several. We've tried caches as small as 2k and the perform pretty well so I would guess that four XMM COGs each with a 2k cache might be possible. One COG is consumed by the external memory driver so that would leave three for other non-XMM driver code.

Is it easy write XMM driver for a self made extended memory?

David Betz · 2014-01-21 07:11

aldorifor wrote: »

Is it easy write XMM driver for a self made extended memory?

Yes, it should be. It only has to handle to requests: read a block of bytes and write a block of bytes. The write operation includes programming flash if necessary.

David Betz · 2014-01-21 07:15

FYI, this shows how an external memory driver is written in the default branch:

{
  External Memory Driver Common Code
  Copyright (c) 2013 by David Betz
  
  Based on code from Chip Gracey's Propeller II SDRAM Driver
  Copyright (c) 2013 by Chip Gracey

  TERMS OF USE: MIT License

  Permission is hereby granted, free of charge, to any person obtaining a copy
  of this software and associated documentation files (the "Software"), to deal
  in the Software without restriction, including without limitation the rights
  to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
  copies of the Software, and to permit persons to whom the Software is
  furnished to do so, subject to the following conditions:

  The above copyright notice and this permission notice shall be included in
  all copies or substantial portions of the Software.

  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,ARISING FROM,
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  THE SOFTWARE.
}

PUB image
  return @init_xmem

DAT
        org   $0

init_xmem
        jmp     #init_continue
        
xmem_param1
        long    0
xmem_param2
        long    0
xmem_param3
        long    0
xmem_param4
        long    0
        
init_continue

        ' cmdbase is the base of an array of mailboxes
        mov     cmdbase, par

        ' initialize the read/write functions
        call    #init

        ' start the command loop
waitcmd
#ifdef RELEASE_PINS
        mov     dira, #0                ' release the pins for other SPI clients
#endif
:reset  mov     cmdptr, cmdbase
:loop
#ifdef REFRESH
        djnz    refresh_cnt, #:norefresh' check to see if its time to refresh
        call    #refresh_memory         ' if refresh timer expired, reload and refresh
:norefresh
#endif
        rdlong  t1, cmdptr wz
  if_z  jmp     #:next                  ' skip this mailbox if it's zero
        cmp     t1, #$8 wz              ' check for the end of list marker
  if_z  jmp     #:reset
        mov     hubaddr, t1             ' get the hub address
        andn    hubaddr, #$f
        mov     stsptr, cmdptr          ' get the external address and status pointer
        add     stsptr, #4
        rdlong  extaddr, stsptr         ' get the external address
        mov     t2, t1                  ' get the byte count
        and     t2, #7
        mov     count, #8
        shl     count, t2
#ifdef RELEASE_PINS
        mov     dira, pindir            ' setup the pins so we can use them
#endif
        test    t1, #$8 wz              ' check the write flag
  if_z  jmp     #:read                  ' do read if the flag is zero
        call    #write_bytes            ' do write if the flag is one
        jmp     #:done
:read   call    #read_bytes
:done
#ifdef RELEASE_PINS
        mov     dira, #0                ' release the pins for other SPI clients
#endif
        wrlong  t1, stsptr              ' return completion status
        wrlong  zero, cmdptr
:next   add     cmdptr, #8
        jmp     #:loop

' pointers to mailbox array
cmdbase long    0       ' base of the array of mailboxes
cmdptr  long    0       ' pointer to the current mailbox
stsptr  long    0       ' pointer to where to store the completion status

' input parameters to read_bytes and write_bytes
extaddr long    0       ' external address
hubaddr long    0       ' hub address
count   long    0

zero    long    0       ' zero constant
t1      long    0       ' temporary variable
t2      long    0       ' temporary variable
t3      long    0       ' temporary variable

'----------------------------------------------------------------------------------------------------
'
' init - initialize external memory
'
' on input:
'   xmem_param1 - xmem_param4 are initialization parameters filled in by the loader from the .cfg file
'
'----------------------------------------------------------------------------------------------------

'----------------------------------------------------------------------------------------------------
'
' read_bytes - read data from external memory
'
' on input:
'   extaddr is the external memory address to read
'   hubaddr is the hub memory address to write
'   count is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------

'----------------------------------------------------------------------------------------------------
'
' write_bytes - write data to external memory
'
' on input:
'   extaddr is the external memory address to write
'   hubaddr is the hub memory address to read
'   count is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------

'----------------------------------------------------------------------------------------------------
'
' refresh_memory - refresh external memory and reset refresh_cnt
'
' Note: only required if REFRESH is defined
'
'----------------------------------------------------------------------------------------------------

trancefreak · 2014-01-21 11:51

David Betz wrote: »

With the version of propgcc that is distributed with SimpleIDE this is not possible. However, the default branch in Google Code does support running XMM code in multiple COGs. It does this by having a separate cache for each XMM COG. Because of cache coherency issues this is likely to work best in XMMC mode where data is in hub memory although it is supported in XMM-SINGLE and XMM-SPLIT modes as well. While each XMM COG has its own cache, all share the same external memory driver so only one extra COG is consumed to manage access to external memory.

Cool! I remember you mentioned that you're planning to implement that feature some time ago in the thread where we talked about the PMC and the issues I had. Now it's already implemented! :-)
Btw: Is there a document describing how to start cogs in all different memory models?

The info is mostly spreaded in different docs or just examples. A single howto foreach memory model would be good.

For example:
Starting a cogc program in LMM, XMM
Starting a cog by handing over the address of a function in LMM, XMM
Starting a cog in XMM mode and COG uses XMM cache driver
...

Is the feature to have real XMM mode for more than one cog the default configuration when launching
new cogs in XMM mode or is there a new way to start the cogs?

Another question, a bit off topic in this thread... In a cogc driver, are there any local variables in cog memory by default or do I have to add the COGMEM tag?

Regards,
Christian

RossH · 2014-01-21 13:46

aldorifor wrote: »

Thanks for quick reply.
At this time, is there a dedicated COG to access XMM with Catalina?

It's your choice. You can ask Catalina to put the XMM access code into the existing kernel cog, or to use a dedicated cog, Using the kernel cog is the default. In some cases (e.g. when serial XMM RAM is used, like the Propeller Memory Card) the XMM access code may be too large to fit in the kernel cog, so a separate cog must be allocated to handle the XMM access. This cog also manages a chunk of Hub RAM used as a cache to speed up performance, since using a dedicated cog is slower than using the kernel cog.

Whether the kernel cog or a dedicated cog is used for XMM access is controlled by the CACHED command line symbol. If this is specified, a dedicated cog is allocated - otherwise the kernel cog is used.

So in short, accessing XMM RAM from multiple cogs is feasible - but on the P1 it is simply not practical.

Ross.

David Betz · 2014-01-21 13:57

RossH wrote: »

So in short, accessing XMM RAM from multiple cogs is feasible - but on the P1 it is simply not practical.

I assume you mean "not practical for Catalina"?

Heater. · 2014-01-21 14:09

My guess is it can be done but it's going to be so slow as not to be worth the bother.

I look forward to the propgcc vs Catalina benchmark results:)

David Betz · 2014-01-21 14:38

Heater. wrote: »

My guess is it can be done but it's going to be so slow as not to be worth the bother.

I look forward to the propgcc vs Catalina benchmark results:)

I don't see why two XMM COGs would not perform pretty well. I agree that if you try to run 7 XMM COGs you are likely to bog down the system to the point where it isn't worthwhile. This multi-COG XMM support kind of fell out of moving the cache tag logic into the XMM kernel so it didn't really cost much to try it. We'll see how well it performs when and if people start using it. To be honest, I don't think many people have even used single-COG XMM yet.

RossH · 2014-01-21 14:44

David Betz wrote: »

I assume you mean "not practical for Catalina"?

No, I mean it has no practical application. Or perhaps I should say "very little". The reason for this is that if you are writing in C/C++, in almost all cases you would be much better off simply using a different processor with more internal RAM and using multi-threading instead. Your hardware would be cheaper, your software would be faster to develop, and it would probably execute faster anyway.

That's why I eventually gave up on multi-cog XMM on the P1 - it ended up costing so many cogs and so much Hub RAM (to support the cache sizes necessary to get reasonable execution speed) that in the end I couldn't actually run the applications anyway - even with XMM.

This was the main driver for me developing CMM instead - this gave me the ability to run larger C programs on multiple cogs (or with multiple threads) - at faster speeds than multi-cog XMM allowed, using less cogs and less Hub RAM.

Of course, this will change on the P2. But on the P1, even single-cog XMM has very limited use. I'm only aware of a couple of commercial applications that are using it.

Ross.

David Betz · 2014-01-21 14:49

RossH wrote: »

No, I mean it has no practical application. Or perhaps I should say "very little". The reason for this is that if you are writing in C/C++, in almost all cases you would be much better off simply using a different processor with more internal RAM and using multi-threading instead. Your hardware would be cheaper, your software would be faster to develop, and it would probably execute faster anyway.

That's why I eventually gave up on multi-cog XMM on the P1 - it ended up costing so many cogs and so much Hub RAM (to support the cache sizes necessary to get reasonable execution speed) that in the end I couldn't actually run the applications anyway - even with XMM.

This was the main driver for me developing CMM instead - this gave me the ability to run larger C programs on multiple cogs (or with multiple threads) - at faster speeds than multi-cog XMM allowed, using less cogs and less Hub RAM.

Of course, this will change on the P2. But on the P1, even single-cog XMM has very limited use. I'm only aware of a couple of commercial applications that are using it.

Ross.

This may be your experience and it may also be true of propgcc in the end but I don't appreciate you making blanket statements that something is useless just because you weren't able to find a way to make it useful. Your argument could just as well be used to suggest that no one should use the Propeller at all.

Edit: Sorry, I should have said "use the Propeller at all for applications that don't fit in hub memory."

RossH · 2014-01-21 16:26

David Betz wrote: »

This may be your experience and it may also be true of propgcc in the end but I don't appreciate you making blanket statements that something is useless just because you weren't able to find a way to make it useful.

Hi David,

Of course I am only speaking from my own experience - who can do otherwise? Also, I never said "useless". In fact,my initial response was:

While it might be useful, the overheads are simply horrendous.

What I said was "not practical".

Single-cog XMM on the P1 is only marginally practical - it consumes (at least with Catalina) as little as one cog and no Hub RAM. But to get just two XMM cogs running, multi-cog XMM would consume around half your available Propeller resources - i.e. half your cogs and half your Hub RAM!

I'm happy to be proven wrong on this one. But I don't think that's likely.

Ross.

David Betz · 2014-01-21 16:49

RossH wrote: »

Hi David,

Of course I am only speaking from my own experience - who can do otherwise? Also, I never said "useless". In fact,my initial response was:

What I said was "not practical".

Single-cog XMM on the P1 is only marginally practical - it consumes (at least with Catalina) as little as one cog and no Hub RAM. But to get just two XMM cogs running, multi-cog XMM would consume around half your available Propeller resources - i.e. half your cogs and half your Hub RAM!

I'm happy to be proven wrong on this one. But I don't think that's likely.

Ross.

With the current implementation of propgcc, the one distributed with SimpleIDE, XMM takes two COGs, one for the XMM kernel and one for the cache driver. With the version of propgcc that is currently in the default branch on Google Code, N XMM COGs require N+1 COGs, one running the kernel for each XMM COG and one running an external memory driver that handles cache misses for all of the XMM kernel COGs. This external memory driver COG uses the SDRAM driver interface that Chip designed for P2. It is a very simple interface that can serve multiple clients. So there really aren't any more COGs used than in the single XMM COG mode.

There is, of course, more hub memory used. This is where we will have to see whether multi-COG XMM is really practical. If each XMM COG requires an 8k cache then it will be hopeless with four XMM COGs since the caches would use all of hub memory. However, if an XMM COG can perform well with a smaller cache, say 2k, then four XMM COGs might perform well for some applications. I know we've done some testing on 2k caches in XMMC mode and the performance wasn't too bad. This suggests to me that it might be possible to run 4 XMM COGs each with a 2k cache and get acceptable performance for some applications and no worse than a single XMM COG with a 2k cache.

Anyway, we have no experience with applying this technology yet so you may very well be right. As I've said before, even single-COG XMM has few users at the moment that I'm aware of.

RossH · 2014-01-21 18:19

David Betz wrote: »

Anyway, we have no experience with applying this technology yet so you may very well be right. As I've said before, even single-COG XMM has few users at the moment that I'm aware of.

I am aware of a couple of commercial applications under development that use Catalina's single-cog XMM. But they are all using very fast parallel access RAM, not serial RAM (e.g. not SPI or QPI).

With Catalina, this means you need no additional cogs and no additional Hub RAM whether your program is compiled as LMM, CMM or XMM. You can choose your memory model depending only on how large your C application is, and how fast it must execute (and in every case, the C main program still executes faster than SPIN).

I think this is the only really practical "real world" development model for C the Propeller - i.e. a single threaded main control program in C, and (just as with SPIN) using the 7 other cogs for the low-level grunt work (generally, these have to be programmed in PASM for speed anyway). And those 7 cogs still have the entire 32kb of Hub RAM available, apart from whatever stack space the C main control program needs (which can be zero, but is typically of the order of a few hundred bytes).

Multi-cog XMM can't offer that. But it would be fun if it could, so I hope someone manages it!

Ross.

David Betz · 2014-01-21 18:53

RossH wrote: »

I am aware of a couple of commercial applications under development that use Catalina's single-cog XMM. But they are all using very fast parallel access RAM, not serial RAM (e.g. not SPI or QPI).

With Catalina, this means you need no additional cogs and no additional Hub RAM whether your program is compiled as LMM, CMM or XMM.

That is a nice feature of Catalina that we haven't added to PropGCC yet. What external memories do you support in this direct mode with no cache?

RossH · 2014-01-21 18:59

David Betz wrote: »

That is a nice feature of Catalina that we haven't added to PropGCC yet. What external memories do you support in this direct mode with no cache?

Anything that uses parallel access, or only uses a single SPI serial chip. It is only serial FLASH or QPI SRAMs that needs too much space to fit in the kernel itself (e.g. on the C3, using only the SPI SRAM as XMM RAM needs only the kernel cog, but using FLASH requires the additional cog).

Ross.

David Betz · 2014-01-21 19:22

RossH wrote: »

Anything that uses parallel access, or only uses a single SPI serial chip. It is only serial FLASH or QPI SRAMs that needs too much space to fit in the kernel itself (e.g. on the C3, using only the SPI SRAM as XMM RAM needs only the kernel cog, but using FLASH requires the additional cog).

Ross.

Is it really faster using SPI SRAM chips directly without a cache?

RossH · 2014-01-21 19:58

David Betz wrote: »

Is it really faster using SPI SRAM chips directly without a cache?

Sorry, I may have given you the wrong impression. The parallel interfaces tend to be fast enough to use without the cache. The serial interfaces are definitely much slower when used without the cache, but may still be fast enough for some applications.

Ross.

David Betz · 2014-01-21 21:08

RossH wrote: »

Sorry, I may have given you the wrong impression. The parallel interfaces tend to be fast enough to use without the cache. The serial interfaces are definitely much slower when used without the cache, but may still be fast enough for some applications.

Ross.

Okay, I think I get it. You *can* use a SPI SRAM without a cache but you'd only do that if you didn't want to give up a COG to run the cache code. You wouldn't do it for performance.

RossH · 2014-01-21 21:53

David Betz wrote: »

Okay, I think I get it. You *can* use a SPI SRAM without a cache but you'd only do that if you didn't want to give up a COG to run the cache code. You wouldn't do it for performance.

Correct. When using serial XMM (e.g. SPI RAM and/or FLASH), the cache is nice because all XMM boards then give you pretty much the same performance. Without the cache, some give you acceptable performance and some do not.

When using parallel XMM, the performance can be so good that you simply don't need to bother using the cache - but of course this is dependent on the details of the parallel interface.

Ross.

More than one cogs running xmm code.

Comments