How to use XMM memory models for SD Card + small SPI SRAM layout?

matrixsmaster · 2016-04-14 15:27

Hello!

I'm new to Propeller chip, but have a lot of background in other microcontrollers. I've searched through a lot of propeller and propgcc documentation, but I still can't answer a few simple questions:
1) What is a "cache"? I know that word, and know it well, but I have a strange impression, that this particular word serving multiple definitions in propeller context, like "bootstrap temporary storage" or "RAM-like device to hold all the Data (not Code, though)" or "cache" (as we know it). I'm not right, I know, but I want to read something explains the different situation or use-cases of this "cache".
2) Can I use SD card space to hold the Data? I've done it many times before with AVR chips, and this solutions are working great (but slow, of course).
3) I can use SD card to hold the Code (XMMC), I know, but can I use it as RAM as well (I thought that is xmm-single is for, besides its main purpose to serve RAM)?
4) Can I use cache with "SD card RAM" solution? It's very attractive solution, I've used it for one of my projects on AVR, and its working surprisingly fast!
5) What propeller-load options can be (and should be) used for:

a) XMM-SINGLE with some SRAM chip
b) XMM-SINGLE with some Flash memory (is it possible?)
c) XMM-SPLIT with SD card as Code storage and some SPI SRAM as Data storage

and the last question,
6) Where can I find documentation on cache drivers?

I know this questions are obvious to answer for you, who are using propeller for years, but for me it's a total darkness. I totally confused by bits and parts of information here and there, googling for three days and two nights...
Please, if someone can, help me to understand these memory models and layouts!

David Betz · 2016-04-14 15:42

The only XMM mode supported for SD cards is XMMC. In that model, code is run from the SD card through a hub cache but all program data must fit in the remainder of hub RAM not used by the cache. In fact, XMMC using the SD card as external storage isn't really very performant. The best way to use external memory is to use XMMC with an external SPI flash chip.

Electrodude · 2016-04-14 17:32

There's also Spin. It's very different from C (it's more like Pascal), but in my (and many others') humble opinion it feels a lot more natural than C on the Propeller and makes everything so much easier.

Spin was designed specifically for the Propeller, while C had to be shoehorned in (via XMM, LMM, and CMM). Spin bytecode is only slightly slower than CMM and is probably faster than XMM, and it's very compact so you don't need to worry about XMM or caches or anything like that, unless you try really pushing the limit. You write code, feed it to the compiler, and it just works... No linker errors, no obscure compiler flags, no need to worry about XMM/LMM/CMM, no massive libraries that eat up a large part of hub ram and force you to use XMM or CMM, and less overhead since the Spin interpreter is in ROM.

If you read the Spin tutorials, you'll be up and running in hours. Or, since you already are experienced with microcontrollers, a good way to learn Spin is just by reading other peoples' Spin programs.

http://learn.parallax.com/propeller-spin-tutorials

There's the older Propeller Tool and the newer PropellerIDE for Spin. Propeller Tool is Chip Gracey's original compiler and IDE, while PropellerIDE has more powerful IDE features and uses the OpenSpin compiler which is based off of Chip Gracey's compiler but has some optimizations and other nice features. There's also BST, which works great but hasn't been maintained in years.

matrixsmaster · 2016-04-14 17:54

Thanks for reply!

David Betz wrote: »

The only XMM mode supported for SD cards is XMMC. In that model, code is run from the SD card through a hub cache but all program data must fit in the remainder of hub RAM not used by the cache. In fact, XMMC using the SD card as external storage isn't really very performant. The best way to use external memory is to use XMMC with an external SPI flash chip.

Thanks, David! I suspected something like that. But what about SINGLE and SPLIT modes? (If we temporarily forget about SD card)

Electrodude wrote: »

There's also Spin.
Spin was designed specifically for the Propeller, while C had to be shoehorned in (via XMM, LMM, and CMM). Spin bytecode is only slightly slower than CMM and is probably faster than XMM, and it's very compact so you don't need to worry about XMM or caches or anything like that, unless you try really pushing the limit.

Thank you! But I rely on a huge amount of already written C code. For this option I had to reinvent the wheel with this strange configuration of "SD card RAM" with other microcontrollers. And for both MSP430 and AVR this method worked perfectly!
So, Spin isn't an option.

I still want to hear at least something about XMM-SPLIT and XMM-SINGLE modes with SPI SRAM, SDRAM, SPI Flash and SPI EEPROM configurations.

David Betz · 2016-04-14 19:34

XMM-single means that code and data go in the same external memory. XMM-split is used for putting code in flash and data in RAM. Mostly, XMM-SPLIT got used with the C3 board that has both SPI flash and SPI SRAM.

matrixsmaster · 2016-04-14 23:49

David Betz wrote: »

XMM-single means that code and data go in the same external memory. XMM-split is used for putting code in flash and data in RAM. Mostly, XMM-SPLIT got used with the C3 board that has both SPI flash and SPI SRAM.

OK. I've got it. Have looked through C3 schematics. My last (stupid) question: is it possible to use "spi_sram_cache" driver with SDXMMC solution to buffer code transactions from SD card? SPI SRAM chip is well-known 23K256 and I've tested it already (program loaded to RAM, like in quickplayer board).

David Betz · 2016-04-15 10:48

matrixsmaster wrote: »

David Betz wrote: »

XMM-single means that code and data go in the same external memory. XMM-split is used for putting code in flash and data in RAM. Mostly, XMM-SPLIT got used with the C3 board that has both SPI flash and SPI SRAM.

OK. I've got it. Have looked through C3 schematics. My last (stupid) question: is it possible to use "spi_sram_cache" driver with SDXMMC solution to buffer code transactions from SD card? SPI SRAM chip is well-known 23K256 and I've tested it already (program loaded to RAM, like in quickplayer board).

It would probably be possible to use a SPI SRAM to buffer data from an SD card. I haven't tried it though and there is no existing cache driver that does that. Is your idea to do this to avoid having to use 512 byte cache lines in the SD cache driver? I guess it would work but going through SPI SRAM on the way to the hub cache would probably slow things down considerably and the SD cache driver is already pretty slow.

matrixsmaster · 2016-04-15 19:57

David Betz wrote: »

Is your idea to do this to avoid having to use 512 byte cache lines in the SD cache driver? I guess it would work but going through SPI SRAM on the way to the hub cache would probably slow things down considerably and the SD cache driver is already pretty slow.

Oh, I'm sorry, until today I haven't noticed the sd_cache is a automatically enabled part of whole SDXMMC scenario. So no, I don't want slow things down.

By now I think I know what I need to know. The one question remaining. Is it possible to actively use SD filesystem driver while using SDXMMC? I've tested it today and it seems to work fine. Both debugging output and file system check results and even llogic analyzer diagrams looks surprisingly right! Is it intended to be so, or just a good luck?

David Betz · 2016-04-15 20:25

matrixsmaster wrote: »

David Betz wrote: »

Is your idea to do this to avoid having to use 512 byte cache lines in the SD cache driver? I guess it would work but going through SPI SRAM on the way to the hub cache would probably slow things down considerably and the SD cache driver is already pretty slow.

Oh, I'm sorry, until today I haven't noticed the sd_cache is a automatically enabled part of whole SDXMMC scenario. So no, I don't want slow things down.

By now I think I know what I need to know. The one question remaining. Is it possible to actively use SD filesystem driver while using SDXMMC? I've tested it today and it seems to work fine. Both debugging output and file system check results and even llogic analyzer diagrams looks surprisingly right! Is it intended to be so, or just a good luck?

Yes, you should be able to use the SD cache driver and the SD filesystem at the same time.

matrixsmaster · 2016-04-18 23:35

Thank you, David! Now my test board works just as planned!

David Betz · 2016-04-19 01:45

matrixsmaster wrote: »

Thank you, David! Now my test board works just as planned!

What kind of performance are you getting? The SD cache driver is the worst of all of them because of the large cache line size and XMM is a lot slower than LMM or CMM anyway.

matrixsmaster · 2016-04-19 16:28

David Betz wrote: »

What kind of performance are you getting? The SD cache driver is the worst of all of them because of the large cache line size and XMM is a lot slower than LMM or CMM anyway.

Propeller performance in XMMC setup is good enough to make processing overhead unnoticed for human user. But my project uses Pseudo-ZPU soft-processor inside. And PZPU performance in this setup is VERY poor. In fact, there's 5 min (!) gap between entering the stdlibc printf() function and actual printing of the first character. Even with dual SD card setup and O2 optimizations. Now I'm thinking about using my own cache driver for Propeller and thus reducing dependency on PZPU. As I noticed from propgcc sources, I can layout data sections whatever I want. And "cache" driver is just a virtual RAM interface. Therefore, XMM does the same thing as PZPU does for my other projects. So, hopefully I can achieve decent performance, eliminating the second virtualization layer.

David Betz · 2016-04-19 17:01

Your project is based on the ZPU? Why don't you use Heater's ZOG? It is an implementation of the ZPU for the Propeller. That is where I first started work on XMM. I'm not sure of the state of the code at this point though.

matrixsmaster · 2016-04-19 17:48

ZPU has two main disadvantages being emulated:
1) it's slow by design (stack architecture)
2) it's an emulator in this case, so overhead is enormous.

As I figured out, XMM uses another approach: just grab the native code from outside and use it. In this case overhead consists of fetching code (or data, doesn't matter) only. Like a RAM proxy layer.
ZPU has to not only fetch the code and data, but do a bunch of actions: decode instruction, increment/decrement stack pointer, etc...

BTW, I heard about ZOG, but didn't realized that it is ZPU emu. My own PZPU emulator works on different platforms (i386, avr, pic, msp430, and now P1, although only avr and i386 are "officially" supported). It works even on 8-pin ATtiny chips, driving LCD screen terminal with keyboard.

David Betz · 2016-04-19 17:58

Heater's ZOG implementation of the ZPU is written in PASM and is reasonably fast.

matrixsmaster · 2016-04-19 19:18

David Betz wrote: »

Heater's ZOG implementation of the ZPU is written in PASM and is reasonably fast.

Is it faster than XMM?

UPD: last but not least: I need to use megabytes of data space, and access to it as fast as I can. Yes, I can just grab my STM32F4 Discovery with native SDRAM support via FMC, but there's other considerations.

David Betz · 2016-04-19 19:45

You mean faster than PropGCC in XMM mode? Not sure. I've never compared them. I stopped working on ZOG when the PropGCC project started. However, I doubt it is faster. They both page code in from external memory but ZOG interprets bytecodes where XMM executes PASM instructions directly like LMM.

Dave Hein · 2016-04-19 21:14

One advantage that zog might have is that it might require less instruction space, and hence fewer cache misses when fetching instructions. I don't know is that's the case, but it would be true for a hypothetical CMM/XMM hybrid that would us CMM opcodes.

matrixsmaster · 2016-04-19 21:20

David Betz wrote: »

You mean faster than PropGCC in XMM mode? Not sure. I've never compared them. I stopped working on ZOG when the PropGCC project started. However, I doubt it is faster. They both page code in from external memory but ZOG interprets bytecodes where XMM executes PASM instructions directly like LMM.

So, in this case, I'll stick up with XMM solution with custom cache driver. This way I can use large program code and separate big data chunks into external RAM using something like

__attribute__((nocommon, section ("EXTERNAL_RAM"), packed ))

UPD:

Dave Hein wrote: »

One advantage that zog might have is that it might require less instruction space, and hence fewer cache misses when fetching instructions. I don't know is that's the case, but it would be true for a hypothetical CMM/XMM hybrid that would us CMM opcodes.

ZPU uses fixed 1-byte opcodes. PZPU instruction cache uses simplest possible cache strategy and have a 32-byte line. It speeds up code execution, yes, but not so much. So I implemented SCache (stack cache), which uses bidirected loading of current stack context. And SCache provides sufficient speed up.

David Betz · 2016-04-19 21:31

matrixsmaster wrote: »
David Betz wrote: »

You mean faster than PropGCC in XMM mode? Not sure. I've never compared them. I stopped working on ZOG when the PropGCC project started. However, I doubt it is faster. They both page code in from external memory but ZOG interprets bytecodes where XMM executes PASM instructions directly like LMM.

So, in this case, I'll stick up with XMM solution with custom cache driver. This way I can use large program code and separate big data chunks into external RAM using something like
__attribute__((nocommon, section ("EXTERNAL_RAM"), packed ))

You might have to write a custom linker script if you use XMM and only want to put select data in external memory. I'm pretty sure the default script puts .data and .bss in external memory if you use either xmm-single or xmm-split.

matrixsmaster · 2016-04-19 21:58

David Betz wrote: »

You might have to write a custom linker script if you use XMM and only want to put select data in external memory. I'm pretty sure the default script puts .data and .bss in external memory if you use either xmm-single or xmm-split.

Yes, it does. Citing "propeller_xmm_single.x":

.data	  :
  {
    *(.data)
    *(.data*)
    *(.rodata)  /* We need to include .rodata here if gcc is used */
    *(.rodata*) /* with -fdata-sections.  */
    *(.gnu.linkonce.d*)
    . = ALIGN(4);
  }  >ram AT>ram
  .bss   :
  {
     PROVIDE (__bss_start = .) ;
    *(.bss)
    *(.bss*)
    *(COMMON)
     PROVIDE (__bss_end = .) ;
  }  >ram AT>ram

The alternative linker script is just a minor inconvenience.

David Betz · 2016-04-19 22:07

Are you planning to write a custom XMM driver that can both read and write the SD card? The current SD cache driver only reads and can only be used for the XMMC memory model. By the way, the interface to memory drivers has changed between the version of PropGCC that is currently distributed with SimpleIDE and what is in the propeller-gcc repository. The new model uses much simpler drivers that only really have a read/write interface. The cache logic is all handled by the XMM kernel now. This was done to allow multiple COGs to run XMM code. That was not possible before because there was no way to share a single cache driver COG.

matrixsmaster · 2016-04-19 22:49

David Betz wrote: »

Are you planning to write a custom XMM driver that can both read and write the SD card? The current SD cache driver only reads and can only be used for the XMMC memory model. By the way, the interface to memory drivers has changed between the version of PropGCC that is currently distributed with SimpleIDE and what is in the propeller-gcc repository. The new model uses much simpler drivers that only really have a read/write interface. The cache logic is all handled by the XMM kernel now. This was done to allow multiple COGs to run XMM code. That was not possible before because there was no way to share a single cache driver COG.

No. Original idea (use SD card for everything) is dropped. It works for AVR (on atmega I use a hardware SPI, which gets me 10MHz SPI clock), but not for propeller. Nonetheless, now there's a much better solution. I want to write a custom RAM driver only to use with custom linker scripts. The only thing I need now is a volatile big data storage for runtime internal components. In my opinion, SD driver works perfectly as it is. There's no need to write something onto SD card from XMM kernel. Regular files on SD card will be accessed for both read and write, though. But it's working right now and I didn't see any problem.

P.S. I noticed many things changed between release_1_0 branch and master. I stick for now with release branch for one thing: the loader. I haven't managed to run new propeller-load to upload SDXMMC code, but the old one (in release branch) does the job done. Maybe I missed something? My current Makefile that works with release branch propeller-load is here: https://github.com/matrixsmaster/PZPU/blob/propeller/pzpu/Makefile.prop

UPD:

David Betz wrote: »

This was done to allow multiple COGs to run XMM code. That was not possible before because there was no way to share a single cache driver COG.

That's a great feature! I think I'll use it to allow user to run multiple programs without preemptive multitasking. Moreover, GUI code can run on its own pthread, am I right?

David Betz · 2016-04-20 10:53

Correct. The SD cache driver was never ported to the new version of PropGCC. That's because Parallax decided they weren't interested in XMM anymore. If there is renewed interest there is no reason it can't be made to work. It just seemed like wasted effort at the time.

matrixsmaster · 2016-04-20 11:11

David Betz wrote: »

It just seemed like wasted effort at the time.

Why? The XMM seems to be very useful with large amount of code running on propeller. It's a quite unique feature and it eliminates the need for virtualization layers needed on more widely used platforms, like PIC. One exception I could thnik about is 8051, but not every current version of MCS-51 supports external RAM and/or ROM devices.
So it seems I have to port sd cache to a new propgcc in addition to write my own ram driver.

David Betz · 2016-04-20 11:25

matrixsmaster wrote: »

David Betz wrote: »

It just seemed like wasted effort at the time.

Why? The XMM seems to be very useful with large amount of code running on propeller. It's a quite unique feature and it eliminates the need for virtualization layers needed on more widely used platforms, like PIC. One exception I could thnik about is 8051, but not every current version of MCS-51 supports external RAM and/or ROM devices.
So it seems I have to port sd cache to a new propgcc in addition to write my own ram driver.

I'm still not sure I understand what you're trying to do. Are you saying you're going to use both the SD cache driver and a RAM cache driver?

matrixsmaster · 2016-04-20 13:36

David Betz wrote: »

I'm still not sure I understand what you're trying to do.

My design evolving as I dive deeper into propeller infrastructure. But the final goal is well-defined.

David Betz wrote: »

Are you saying you're going to use both the SD cache driver and a RAM cache driver?

Yes and no. I need SD card and I certainly need to run my program from it. There's definitely no way to make whole OS with all modules fit into HUB ram. So at first, I need a space for code. Hundreds of kilobytes of compiled propeller code.
The next thing I need is threads. Now I'm completely sure I will not use preemptive multitasking. I've used pthreads on Linux and have some skill in multithread programming. So I would love to use pthreads on propeller.
Next, I need to access my file storage, which is located on the same SD card, as OS itself (for ease of OS upgrades, making nice little hacks, etc).
And at last, there's some big data storage inside my runtime library. This data is highly volatile. I mean, this isn't a static const table or something similar. It's a live image, like a screen buffer.

That's a setup I need to implement.

David Betz · 2016-04-20 19:20

The C3 cache driver is sort of like what you're talking about in that it handles both the SPI flash and SPI SRAM chips on the C3. I guess what you'd want is something to handle both the SD card and SPI SRAM chips. Might not be that hard to concoct.

matrixsmaster · 2016-04-20 20:02

David Betz wrote: »

The C3 cache driver is sort of like what you're talking about in that it handles both the SPI flash and SPI SRAM chips on the C3. I guess what you'd want is something to handle both the SD card and SPI SRAM chips. Might not be that hard to concoct.

Yes, something like this. I'll dive into C3 drivers now. Meanwhile, I have a stupid question: is the C3 board works with current propgcc?

David Betz wrote: »

The SD cache driver was never ported to the new version of PropGCC. That's because Parallax decided they weren't interested in XMM anymore.

Rewriting or porting the driver seems like a simple puzzle. But what you said about XMM? Isn't it supported at all nowadays?

UPD: re-reading this thread I noticed

David Betz wrote: »

The cache logic is all handled by the XMM kernel now. This was done to allow multiple COGs to run XMM code. That was not possible before because there was no way to share a single cache driver COG.

So XMM is still alive, and all what I need - is to write the new C3-like driver. Am I right? If so, it'll be wonderful!

David Betz · 2016-04-20 20:08

XMM support is still there. It just has a different interface with the driver. The old XMM model was to manage the cache in the driver. The new model is to have the driver just handle read/write from external memory. The cache logic is in the XMM kernel. The C3 is still supported under the new model.

matrixsmaster · 2016-04-20 20:24

David Betz wrote: »

The new model is to have the driver just handle read/write from external memory. The cache logic is in the XMM kernel. The C3 is still supported under the new model.

It's the most wonderful news I heard in about one week! Thank you, thank you very much, David!

How to use XMM memory models for SD Card + small SPI SRAM layout?

Comments