XMM musings

Dr_Acula · 2012-04-21 06:35

I'm currently working on a touchscreen and eventually would like to be able to run large XMM C programs that are available on the internet (eg Pacman).

Part of the touchscreen system is some very fast ram accesses - 6 pasm instructions for a word, or 3 pasm instructions per byte. However, if the XMM C kernel is fetching data from the same ram, the C kernel needs some way of being paused while a cog goes off and handles some memory transfers. It can get complicated to think about writing and debugging such a system.

So I had this idea that an intermediate step might be to put external ram on the I2C bus. The nice thing about the I2C bus is that the pins are available on practically every propeller board out there, so such an external memory might be almost "universal".

It will be slow of course, much slower than parallel access. But with caching that may not really matter, especially if you devote a decent part of hub ram to a cache (eg 8k or more).

I had a feeling there were I2C sram chips out there, but after looking for a while, all there seems to be is flash or eeprom I2C, or sram that is SPI. Maybe there are 8 pin I2C sram chips?

Are there other XMM designs already out there that plug into the I2C bus?

Could the I2C expander idea be used to interface with the 32Mb ram chips?

Thoughts would be most appreciated.

jazzed · 2012-04-21 10:11

Dr_Acula wrote: »

Part of the touchscreen system is some very fast ram accesses - 6 pasm instructions for a word, or 3 pasm instructions per byte. However, if the XMM C kernel is fetching data from the same ram, the C kernel needs some way of being paused while a cog goes off and handles some memory transfers. It can get complicated to think about writing and debugging such a system.

I believe denominator (Ted) added a locking feature to the GCC caching system for purposes like this. I don't know all the details yet and haven't had a chance to try it myself - it will probably be undocumented for a while. Maybe Ted can jump in and explain how it works.

Once Propeller-GCC Beta is started, I'll be able to play with this. I also have a project that can benefit from bus sharing.

Dr_Acula wrote: »

So I had this idea that an intermediate step might be to put external ram on the I2C bus. The nice thing about the I2C bus is that the pins are available on practically every propeller board out there, so such an external memory might be almost "universal".

It will be slow of course, much slower than parallel access. But with caching that may not really matter, especially if you devote a decent part of hub ram to a cache (eg 8k or more).

You can get a feel for what it's performance would be by using 64KB+ linear address space EEPROM with GCC XMMC mode (up to 512KB of code storage is possible). Switching to XMM-single will cause about a 4X slow down. Implementing an I2C SRAM is probably not very useful or practical.

Another alternative is to use a less bus pin hungry design byte-wide design.

Use a byte-wide version of your current design.
Use 8xSPI SRAMs (256KB standard or 512KB ipsilog chips).

Practicality trumps using many pins to me (maybe other more practical solutions too). Byte-wide synchronous SPI SRAM will be a little slower than the byte-wirde equivalent of your current design because of the address setup time, but the cache read/write burst time would be the same, and the footprint is very small. The SpinSocket-SRAM uses the design and is on Tubular's Smorgasboard just above his NINJA propeller board.

denominator · 2012-04-21 10:35

jazzed wrote: »

I believe denominator (Ted) added a locking feature to the GCC caching system for purposes like this. I don't know all the details yet and haven't had a chance to try it myself - it will probably be undocumented for a while. Maybe Ted can jump in and explain how it works.

A quick summary of the locking: You can activate the locking for select drivers - currently those that use the I2C or SPI bus, which includes the EEPROM, SD, and C3 cache drivers and the SD file system driver). You activate locking by passing in a Propeller lock id (allocated via locknew). After you activate locking, the driver will acquire a lock before accessing the SPI or I2C bus and then free it when it's done accessing the bus.

The locking is documented in detail in the Issues section of the Google code archive: http://code.google.com/p/propgcc/issues/detail?id=44, including how it works, descriptions of the new API calls, and test code; additionally there is a link to the repository deltas so you can see exactly what was changed. Let me know if something is missing or doesn't make sense.

The biggest problem for Dr Acula might be the fact that when the cache driver grabs the bus, it is somewhat slow to release the bus for two reasons: (1) It grabs typically 32 bytes or more at a time, and (2) it grabs them using relatively slow code.

Another idea for something that would work is replacing or augumenting the I2C EEPROMs with I2C FRAM; relatively minor additions to the I2C EEPROM "read-only" cache driver could turn it into an FRAM read/write cache driver. Serial FRAMs like the FM24V10 available holding 1Mb (128K bytes) each. The downside: they are a bit expensive ($10) and not well stocked (though Future Electronics has them).

jazzed · 2012-04-21 10:57

denominator wrote: »

The locking is documented in detail in the Issues section of the Google code archive: http://code.google.com/p/propgcc/issues/detail?id=44, including how it works, descriptions of the new API calls, and test code; additionally there is a link to the repository deltas so you can see exactly what was changed. Let me know if something is missing or doesn't make sense.

Thanks for pointing that out Ted! I should have known to look there - pardon my lapse.

As for FRAM, 100 trillion read/writes is probably useful and they are spec'd for the 3.4Mbps I2C speed.

We could easily add an fram_cache PASM driver. It wouldn't be hard write the driver - as you've described it.
FRAM cost would be roughly equivalent to an 8x SPI-SRAM design cost. It would be slower but propably practical.

denominator · 2012-04-21 13:58

jazzed wrote: »

As for FRAM, 100 trillion read/writes is probably useful and they are spec'd for the 3.4Mbps I2C speed.

I figured that if you hit them at 3.4MHz and did nothing else but read or write them full speed in block mode, they'd wear out after 7.4 years. The fact that they'll be used as cache backing probably adds 20x or more on to their life.

Dr_Acula · 2012-04-21 18:26

Thanks ++ for all the ideas there. Thanks denominator for that link. It is similar to what I want to do but not quite...

Implementing an I2C SRAM is probably not very useful or practical.

Hmm, I suspect you are right there.

In general terms I would have thought that all serial solutions are slow. But at the same time, are all serial solutions of similar speed? Or is SPI a lot faster than I2C?

Prof Braino sent me an interesting PM suggesting the program is stored in the SD card. Cache blocks as required. It sounds like that could work except I did have a question about how you handle a C program that happens to be in the middle of writing to the sd card and then a cache request comes in the middle of that. ? Close the write file. Open for read. Read block. Reopen write file at the old place.

Thinking about this more, the board I have in front of me has two 512kilobyte ram chips on it and most of the time those chips are not being accessed, and access to those chips is very fast - 3 pasm instructions to get a byte, so is it just a software solution?

It is kind of similar to the discussion in denominator's thread except that in this case we don't just want access to the I2C or SPI bus. What we really want is total control of every propeller pin. This stems from a brilliant suggestion from jazzed a few weeks ago where he suggested a more formal separation between groups of propeller pins. So you have a "group" of prop pins which can be all of them if you like. Then you select another "group". Now you can emulate a propeller with 64 pins. Or 8 groups using a latched 3 to 8 decoder 74HC237 for 256 pins. The clever part is the handover. All drivers and cog code that affects pins must be formally shut down. Then all pins are made HiZ (which puts them into a High state via weak 100k pullups). Then the "group" is changed. Then new cogs and drivers etc are restarted. It sounds complicated, but in the end it comes down to one line of spin code to stop a group, and another line to start a group. I'm doing this many times a second while checking for a touch on the touchscreen and updating the display because there are not enough propeller pins to do both at the same time.

Essentially, the idea would be to put the GCC kernel to sleep. Then another cog (it could even be a cog running spin) can take over the propeller chip and all the propeller pins.

So the steps might be:
Shut down all open files on any SD cards.
Shut down (or make inactive) any cogs that might be running code that will interfere with the pins you want control of.
Set a flag (a long in hub) to tell the new cog you are about to start but not to do anything yet.
Start up a cog that is going to take over the chip. This cog polls the flag above waiting for a green light to say GCC is asleep.
Make all pins HiZ
Request the location of a "restart" long from C, and pass this to the cog that will be taking over the chip.
(all of the above are lines of C in the program and it is up to the programmer to do things like close all open files. From now on though, this is C kernal code)
Call a routine in C that puts the kernel to sleep. The kernel goes into a loop checking the "restart" long to see if it has changed.
Finally, change the flag to tell the new cog it can take over the chip.

The new cog runs its code, maybe sending 70 kilobytes to the screen with a new picture.
When it has finished, it also does a formal handover by putting all pins into HiZ. The last thing it does it change the "restart" long to tell the C kernel to restart.

This all sounds complicated, but much of it is up to the C programmer, not GCC.

The new routines I see as being necessary are
i) A function to request the location of the "restart" long.
ii) A function to put GCC to sleep until it detects a change in that "restart" long.

Why bother with all this?

The two big advantages are that
1) It allows GCC XMM access to fast sram access via a 16 bit parallel bus
2) It can combine the GUI elements of the touchscreen (working = textbox, checkbox, radiobutton, buttons, picturebox) with program sizes that are larger than are possible with spin.

Do you think this would be hard to do?

David Betz · 2012-04-21 19:54

denominator wrote: »

The biggest problem for Dr Acula might be the fact that when the cache driver grabs the bus, it is somewhat slow to release the bus for two reasons: (1) It grabs typically 32 bytes or more at a time, and (2) it grabs them using relatively slow code.

Do you have any suggestions for faster cache line reading code? I guess Steve has used counters to improve the performance of SPI code. Is that what you would suggest?

jazzed · 2012-04-21 21:35

Dr_Acula wrote: »

In general terms I would have thought that all serial solutions are slow. But at the same time, are all serial solutions of similar speed? Or is SPI a lot faster than I2C?

SPI is generally much faster than I2C. SPI on the Propeller with PASM will be slower than SPI specs though.

Dr_Acula wrote: »

Prof Braino sent me an interesting PM suggesting the program is stored in the SD card.

We can do this already.

It's slower than using a single bit SPI Flash though because up to 512 bytes may need to be read to get a cache line for SD card. SPI Flash just needs the start address and a burst read. I've often wondered about making a SPI Flash SD plug-in module for XMMC.

Dr_Acula wrote: »

Cache blocks as required. It sounds like that could work except I did have a question about how you handle a C program that happens to be in the middle of writing to the sd card and then a cache request comes in the middle of that. ? Close the write file. Open for read. Read block. Reopen write file at the old place.

We have 2 ways to handle it. One is Ted's locking mechanism. The other which takes more careful coding is to use that GCC super power i mention long ago that places key code exclusively in HUB RAM and the rest in the XMMC storage device.

Dr_Acula wrote: »

Thinking about this more, the board I have in front of me has two 512kilobyte ram chips on it and most of the time those chips are not being accessed, and access to those chips is very fast - 3 pasm instructions to get a byte, so is it just a software solution?

In theory Ted's locking can be used.

Dr_Acula wrote: »

Essentially, the idea would be to put the GCC kernel to sleep. Then another cog (it could even be a cog running spin) can take over the propeller chip and all the propeller pins.

Interesting. The GCC kernel could just run a function from HUB RAM and wait for a wake-up signal and then go back to normal operation until it's "put to sleep" again. You could use a second GCC kernel/thread running from HUB RAM to handle that LCD image.

denominator · 2012-04-21 22:07

David Betz wrote: »

Do you have any suggestions for faster cache line reading code? I guess Steve has used counters to improve the performance of SPI code. Is that what you would suggest?

David,

Yes, exactly. If you look at the SPI code in FSRW's PASM routines it reads from the SPI device much faster using counters.

- Ted

Dr_Acula · 2012-04-21 22:20

Interesting. The GCC kernel could just run a function from HUB RAM and wait for a wake-up signal and then go back to normal operation until it's "put to sleep" again. You could use a second GCC kernel/thread running from HUB RAM to handle that LCD image.

Hey, that is cunning. Then the lcd image code can be in C too (compile C to a cog and I think you can do that already). Everything can be in C!

Once everything is set up, this is the code that moves two bytes of data from ram chips to the display. It is pasm now, but could it be rewritten in C?

ramtodisplay_loop       andn    outa,maskP17            ' ILI write low
                        or      outa,maskP17            ' ILI write high
                        andn    outa,maskP19            ' clock 161 low
                        or      outa,maskP19            ' clock 161 high
                        djnz    len,#ramtodisplay_loop

If so, and it compiles to a small number of lines, write it all in C.

denominator · 2012-04-21 22:44

Dr_Acula wrote: »

In general terms I would have thought that all serial solutions are slow. But at the same time, are all serial solutions of similar speed? Or is SPI a lot faster than I2C?

The current SPI drivers appear to me to read and write using 5 instructions loops - translated this means 4MHz at a 80MHz clock. The FRAM chip, as previously mentioned, can be clocked at 3.4Mhz, but you'll have to use a 6 instruction loop so it would really be 3.33MHz. So the difference would be 20% slower. In theory some SPI devices (such as the SD cards) could be driven faster, but the current drivers don't.

Dr_Acula wrote: »

Prof Braino sent me an interesting PM suggesting the program is stored in the SD card. Cache blocks as required. It sounds like that could work except I did have a question about how you handle a C program that happens to be in the middle of writing to the sd card and then a cache request comes in the middle of that. ? Close the write file. Open for read. Read block. Reopen write file at the old place.

Steve didn't quite answer your question. First some background information:

When an XMMC program is being run using the SD card as the cache backing store, the cache driver never "closes" the backing file. Here's what happens: Before the XMMC program is run, the loader runs a LMM program called sd_cache_loader. This program finds AUTORUN.PEX, loads a list of all the cluster addresses that make up the program, then places this "cluster map" right above the stack at the "top of memory". Then, sd_cache_loader loads the sd_cache driver and XMM kernel, and hands over control to the kernel. When the kernel faults on a particular cache line, it asks the sd_cache driver to go get it. The sd_cache uses sector-sized cache lines, so on a cache miss it always just needs to read an entire sector. To do this, it calculates the cluster needed, looks up the necessary cluster number in the "cluster map", calculates the necessary sector number, and reads it directly from the SD card into the cache.

Also consider these design details:

- You are not allowed to write to the currently running program (/AUTORUN.PEX)
- The XMMC kernel runs C code only in a single-cog; therefore any operation that reads/writes to the SD card can only happen after any cache misses needed to start the operation are completely fulfilled. (Note that in order to make this 100% true, a small bit of the SD card access code is constrained to always run in HUB ram.)
- The SD cache driver and the library SD driver do not conflict, as they read into seperate buffers and the SD cache driver does not need to access any directories.
- Note that in the most efficient mode for using the SD cache driver and library SD routines, you tell the library SD routines to use the SD cache driver for I/O. Nonetheless, the result is the same - there is no conflict due to the XMM kernel synchronizing access and the use of separate buffers.

You asked "how you handle a C program that happens to be in the middle of writing to the sd card and then a cache request comes in the middle of that?"

The answer: Given the above design, no additional work is necessary. Reading/writing to an SD card from a program being run from that SD card just works.

Dr_Acula · 2012-04-22 01:01

Ah, I think I follow you. So, if you have a loop of C program that happens to be cached into hub ram, it will never need the SD card or external memory.

So - all you have to do is run a loop of code and in that loop, somehow detect that all outstanding cache reads and writes have happened. Then from that point on, the C program behaves as if it does not have access to external memory. So along can come another cog, take over those pins and do what it likes.

So - is there a way of forcing all the cache to finalise? Or do you just wait a while?

David Betz · 2012-04-22 06:16

denominator wrote: »

David,

Yes, exactly. If you look at the SPI code in FSRW's PASM routines it reads from the SPI device much faster using counters.

- Ted

I haven't gotten my head wrapped around counters yet. I'll take a look at the FSRW code though as an example. Did you write that code?

denominator · 2012-04-22 06:58

Dr_Acula wrote: »

Ah, I think I follow you. So, if you have a loop of C program that happens to be cached into hub ram, it will never need the SD card or external memory.

So - all you have to do is run a loop of code and in that loop, somehow detect that all outstanding cache reads and writes have happened. Then from that point on, the C program behaves as if it does not have access to external memory. So along can come another cog, take over those pins and do what it likes.

So - is there a way of forcing all the cache to finalise? Or do you just wait a while?

Yes and no. Yes, if you have a loop and the compiler thinks it can "fcache" it, it will. The entire loop will be paged in from the store into HUB ram before executing and C behaves as if it does not have access to external memory, just as you say.

No, this is not necessary to make the SD library and SD cache play together. The magic is thus:

1) The lowest-level library routine that accesses the SD card is a very small routine that just dumps a value in a mailbox and waits for the mailbox to be cleared. This routine is forced to live in hub ram - you can make this happen for any routine by decorating the routine definition with __attribute__((section(".hubtext"))). Note the mailbox is connected to a PASM driver in a separate cog: either the SD library driver or the SD cache driver, as specified by you, the programmer.

2) Because we're being called by a C routine that is already in memory, any cache paging has by definition completed.

Forcing a routine into hub ram is the easiest and only way I know to make sure there will be no caching.

denominator · 2012-04-22 07:09

David Betz wrote: »

I haven't gotten my head wrapped around counters yet. I'll take a look at the FSRW code though as an example. Did you write that code?

No - it was written by Jonathan Dummer. It appears he's using counter B to generate the clock, and counter A to accumulate the data. He has counter B generating a clock cycle per instruction cycle and this is output to the SPI device's clock in. Each instruction cycle he's changing the counter A's FRQA (accumulator increment value) by shifting it one place, and the counter input is tied to the SD card's data out. Thus he's marching each bit into place, and the result of reading shows up in PHSA. Pretty cool - you get one bit input per instruction cycle. He does something similar on output.

David Betz · 2012-04-22 07:42

denominator wrote: »

No - it was written by Jonathan Dummer. It appears he's using counter B to generate the clock, and counter A to accumulate the data. He has counter B generating a clock cycle per instruction cycle and this is output to the SPI device's clock in. Each instruction cycle he's changing the counter A's FRQA (accumulator increment value) by shifting it one place, and the counter input is tied to the SD card's data out. Thus he's marching each bit into place, and the result of reading shows up in PHSA. Pretty cool - you get one bit input per instruction cycle. He does something similar on output.

Sounds pretty cool! Maybe I'll copy his code! Of course, I'll give him credit for it. Isn't the MIT license nice! :-)

jazzed · 2012-04-22 12:37

Dr_Acula wrote: »

Ah, I think I follow you. So, if you have a loop of C program that happens to be cached into hub ram, it will never need the SD card or external memory.

It's not about a cached piece. You tell the C function to live in HUB RAM with the HUBTEXT attribute. Example ....

/**
 * @brief Wait for *addr to equal value before return.
 * @details This function lives in HUB RAM even with XMM memory models like XMMC, XMM-SPLIT, or XMM-SINGLE.
 * @details The function doesn't try to yield or use propeller wait - it simply looks for a match.
 * @param addr : memory location to watch.
 * @param value : value for contents of addr to attain before exiting function.
 * @returns nothing
 */
HUBTEXT void simpleWaitForValue(int* addr, int value)
{
    while(*addr != value)
        ; // wait while contents of addr != value
}

Pretty document version of the function is here: http://www.microcsource.com/hubtext/wait_8c.html

Dr_Acula wrote: »

So - all you have to do is run a loop of code and in that loop, somehow detect that all outstanding cache reads and writes have happened. Then from that point on, the C program behaves as if it does not have access to external memory. So along can come another cog, take over those pins and do what it likes.

So - is there a way of forcing all the cache to finalise? Or do you just wait a while?

The cache line operation is atomic. It always must complete. You have no control over that.

Dr_Acula · 2012-04-22 16:15

You tell the C function to live in HUB RAM with the HUBTEXT attribute. Example ....

Neat, and very simple! Amazingly, that solves the entire problem. Now the touchscreen memory can also be used as XMM program memory. I like the flexibility of GCC already

XMM musings

Comments