Board design and performances

macca · 2013-04-03 09:43

Hello All,

I would like to design a board that could use GCC's XMM modes in order to allow more than 32k of code and data.
I'm most concerned about performances of this solution and seems that I can't find much details about the impact of the various designs.

What I would like to do is:

- Being able to write programs with more than 32k of code and data, these programs may reside on a EEPROM or an SD card (not yet decided what may be best)
- Code should be run from the external memory but some portions must be run from hub memory for maximum performances, if I understand correctly this is already possible by using type modifiers like HUBTEXT
- Data should reside on HUB ram for maximum performances but with the ability to swap (or overwite) variables with data stored in the external memory. Say for example, the program needs to work on different data arrays, it moves the first array in HUB memory, work with it, then it reads another array from the external memory, work on it, and so on. Doesn't necessarily need to save the data back to external memory, I think these will be mostly read-only data.

What kind of design do you suggest ?
Is there some performances data that I could use to compare the impact of the design ? For example, the time needed to run a chunk of code off the HUB ram vs. external SPI ram vs. EEPROM, etc.

Thanks for your help.

Regards,
Marco.

David Betz · 2013-04-03 09:51

macca wrote: »

Hello All,

I would like to design a board that could use GCC's XMM modes in order to allow more than 32k of code and data.
I'm most concerned about performances of this solution and seems that I can't find much details about the impact of the various designs.

What I would like to do is:

- Being able to write programs with more than 32k of code and data, these programs may reside on a EEPROM or an SD card (not yet decided what may be best)
- Code should be run from the external memory but some portions must be run from hub memory for maximum performances, if I understand correctly this is already possible by using type modifiers like HUBTEXT
- Data should reside on HUB ram for maximum performances but with the ability to swap (or overwite) variables with data stored in the external memory. Say for example, the program needs to work on different data arrays, it moves the first array in HUB memory, work with it, then it reads another array from the external memory, work on it, and so on. Doesn't necessarily need to save the data back to external memory, I think these will be mostly read-only data.

What kind of design do you suggest ?
Is there some performances data that I could use to compare the impact of the design ? For example, the time needed to run a chunk of code off the HUB ram vs. external SPI ram vs. EEPROM, etc.

Thanks for your help.

Regards,
Marco.

While you can use a >32k EEPROM as external memory for PropGCC or you could use an SD card, neither of those options perform particularly well and in addition can only be used with xmmc mode where code is in external memory. They can't be used to place data in external memory. The fastest way of providing more than 32k of space for code is using SPI or SQI flash. Parallel memories are faster but require more pins. For example, Steve's SDRAM module for the Propeller Platform is quite fast as is the C3 Synapse board but they also consume many Propeller pins for the parallel interface.

However, if you go with SPI or SQI flash, keep in mind that that only gives you more code space. Your data and stack will still have to fit in hub memory reduced by the size of the cache. We usually configure for an 8k cache although smaller cache sizes can be used.

jazzed · 2013-04-03 10:07

A dual-quad-spi Flash solution only uses 10 pins and is one of the fastest XMMC hardware models and offers reasonable density (2MB to 16MB). SDcard XMMC is horribly slow and is generally a last resort - that being said, there is an SDCard footprint based Flash solution that would work just fine (interest is the only barrier to availability).

Using XMM-SPLIT or XMM-SINGLE slows things significantly. If you only need to read tables or other data from external memory, they can be forced into the code section the same way we add PASM driver code or by other means.

David Betz · 2013-04-03 10:22

jazzed wrote: »

A dual-quad-spi Flash solution only uses 10 pins and is one of the fastest XMMC hardware models and offers reasonable density (2MB to 16MB). SDcard XMMC is horribly slow and is generally a last resort - that being said, there is an SDCard footprint based Flash solution that would work just fine (interest is the only barrier to availability).

Using XMM-SPLIT or XMM-SINGLE slows things significantly. If you only need to read tables or other data from external memory, they can be forced into the code section the same way we add PASM driver code or by other means.

Good point about forcing constant data into the .text section. It would be interesting to see if the n-way cache code would make a dual quad-spi SRAM solution fast enough to be practical. I'm working on code that factors out the tag handling so it's easier to experiment with different cache configurations. With all of the discussion of the 128x8 SPI SRAM chips I'm going to add support for those next. It will allow RayMan's RamPage2 board and David Carrier's new board to work well with PropGCC.

Rayman · 2013-04-03 10:51

Thanks David. Sounds like I need to figure out this trick of seperating constant and dynamic data to minimize the HUB ram requirement...

David Betz · 2013-04-03 10:58

Rayman wrote: »

Thanks David. Sounds like I need to figure out this trick of seperating constant and dynamic data to minimize the HUB ram requirement...

It might be as simple as just declaring the constant data "const".

macca · 2013-04-03 11:26

Forcing data in the code segment seems a very good option. I was thinking just that, after all constant data may be compiled as code, and a memcpy could move it to the HUB ram in the appropriate location / structure.

Still I don't understand the performance impact. Compared to the LMM model, how XMM with a, say, quad-spi ram affects the speed ? Half ? 1/10th ? 1/100th ?

David Betz · 2013-04-03 11:40

macca wrote: »

Forcing data in the code segment seems a very good option. I was thinking just that, after all constant data may be compiled as code, and a memcpy could move it to the HUB ram in the appropriate location / structure.

Still I don't understand the performance impact. Compared to the LMM model, how XMM with a, say, quad-spi ram affects the speed ? Half ? 1/10th ? 1/100th ?

Steve has posted some good comparisons before. Maybe he'll post a link to them. Unfortunately, the performance you actually get will depend a lot on your code as it does on any cached system. There are actually two levels of cache involved with an XMM solution. The code is cached in hub memory and GCC will cache code in COG memory if it's small enough to fit into its fcache buffer. This means that you can get performance that is nearly as good as LMM or even CMM if you have small loops that fit in fcache. On the other hand, you can get horrible performance if your code branches a lot over a wide range of addresses. It's hard to predict the performance of a program without trying it. Sorry, I guess that isn't a very helpful answer.

macca · 2013-04-03 11:51

No it is helpful because it wasn't clear to me how the cache worked, I know the fcache feature but it wasn't clear how the XMM cache works.
So basically, with a cache of 8k, as seems it is the default setting, if the code is within that range I could expect performances comparable to LMM, right ? Code can be organized so functions that calls each other frequently are placed near each other. this could work.
How the cache reads other code chunks ? It always reads the whole 8k size or smaller segments as needed ?

David Betz · 2013-04-03 12:01

macca wrote: »

No it is helpful because it wasn't clear to me how the cache worked, I know the fcache feature but it wasn't clear how the XMM cache works.
So basically, with a cache of 8k, as seems it is the default setting, if the code is within that range I could expect performances comparable to LMM, right ? Code can be organized so functions that calls each other frequently are placed near each other. this could work.
How the cache reads other code chunks ? It always reads the whole 8k size or smaller segments as needed ?

Actually, the 8k cache is broken up into 32 byte cache lines each of which maps to some chunk of external memory. It's a "direct mapped" cache at the moment so many external cache addresses map to the same cache line. Since only one cache line can be in hub memory at a time, a cache miss will cause whatever is in that line to be written back to external memory (if it has been modified) and a new block fo 32 bytes read in. For more details on direct mapped caches check wikipedia. Also, even if all of your code is in hub memory, it still won't execute as fast as LMM since there is a memory access layer between your code and the cache that translates external addresses into cache line addresses. This often requires a request to be sent to the cache COG to resolve the address.

jazzed · 2013-04-03 12:55

The performance of direct-map cached XMMC depends on the "address locality" of and size of code being used.

XMMC can be about the same speed of LMM with fcached code.
XMMC (single bif SPI Flash) can be up to 10 times slower than LMM according to locality and size.
Dual quad-spi flash (10 pin interface) is at least 30% faster in most cases than regular spi flash.

HUBTEXT tagged XMMC functions are the same speed as LMM of course since they are really LMM.

Optimized LMM is at least 10x faster than SPIN according to Eric.

Rayman · 2013-04-03 13:11

can that 32-byte number be easily increased?
I imagine, although don't really know, that dual SQI would be faster with faster cache sizes...

Also, for xmm-split mode, is the cache still 8k? Or, do you make it bigger then?

David Betz · 2013-04-03 13:13

Rayman wrote: »

can that 32-byte number be easily increased?
I imagine, although don't really know, that dual SQI would be faster with faster cache sizes...

Also, for xmm-split mode, is the cache still 8k? Or, do you make it bigger then?

The cache geometry can be configured to pretty much whatever you want as long as cache line size and number of cache lines are powers of 2.

David Betz · 2013-04-03 13:16

Rayman wrote: »

can that 32-byte number be easily increased?
I imagine, although don't really know, that dual SQI would be faster with faster cache sizes...

Also, for xmm-split mode, is the cache still 8k? Or, do you make it bigger then?

Sorry, I didn't answer your second question. The default for xmm-split (only supported on the C3 at the moment) is 4k for flash and 4k for SRAM. You can configure the total amount of space used for the cache but it is always evenly divided by two to make the two separate caches. When we move to an n-way cache, we might want to get rid of the split flash and SRAM caches.

macca · 2013-04-04 09:46

Thanks jazzed, the performances doesn't seem too bad, I think that now I need to build a test board and do some measurements with the code I want to run.

Many thanks to you all for the informations.

Board design and performances

Comments