External Memory Model?

localroger · 2009-08-03 15:17

Conversation about mainframe computers in another thread, and refactoring my own in-house I/O project, got me thinking -- generally a dangerous thing...

In the LMM jump instructions are actually calls to a little bitty subroutine that resets the PC.· So OK, says I to myself says I, suppose instead of just stupidly fetching the next long and using that to set the PC, we have that subroutine automatically page blocks of RAM to external storage, such as a swap file on a SD card or another prop serving as RAM controller?· Within a page you would get LMM speed with a slight lookup hit on jumps.· On non timing critical code -- such as most user interface and application logic code -- the occasional paging bump would hardly be noticeable.· You put the timing critical stuff like video and serial drivers in Hub RAM and compile everything else to paged LMM.

Here's the totally eeeeevil thing... assuming you make a swap file on a SD card that's continguous and you provide a way for boot code to find the starting block, one of the things you can page out is the FAT file system.· All that has to live permanently in the hub is timing critical stuff like the block drivers.

I am thinking that one might target a one megabyte memory space using 2048 pages of 512 bytes.· This would require a permanent 2,048 entry table linking each page to its buffer (if any), which adds just one lookup for a jump or fetch to a cached page.· You'd also obviously need a reverse table for however many buffers there's room for in hub RAM (I'm thinking that with all the other code there would probably be room for· at least 32 pages).· You'd obviously need dirty flags and ageing counters.· You could even build a two-level system with both a SD card (slow but nonvolatile) and RAM controller (fast) and it would not change the application code at all.

So for the people who've done LMM code -- doe this seem feasible, or did I drink too much coffee this morning?

jazzed · 2009-08-03 16:10

It is entirely possible if you have room in the LMM interpreter COG ... I had to optimize the heck out of the ImageCraft kernel just to use it for single COG long-at-time fetch ... multi-cog fetch is easier. I'd bet Bill Henning is already doing it [noparse]:)[/noparse] I'm trying to do something similar with Spin.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

localroger · 2009-08-03 16:44

I've done some more thinking about this, and I think the end result should remain significantly faster than Spin overall, with in-cache jumps being more Spin-like in performance.

The way I am thinking of approaching this is to add a page server to sdspiqasm; there's plenty of room in there for more functionality, it's already been hacked to support sdhc, and it only takes on cog because all the Spin code goes away once you get the card mounted. And you want SD access to be single-threaded. Then the LMM interpreter used by each individual cog just has to know enough to drop requests to the page server.

jazzed · 2009-08-03 17:06

As long as it's not dog slow, performance won't matter much in some environments; of course it will matter a lot in others.

If you add a page server, that will be useful for my project too when I get to adding the SD device. For now I'm just using EEPROM buffers for fetch on "code page faults," and eventually I'll use block mode XMM for better performance at costs of course. I'll need an intermediate driver eventually for abstraction passing an address, buffer, operation, length, and not worry about the underlying device.

Any idea how the swap file will be organized? Swap code exists already of course, but it's likely to be huge.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Bill Henning · 2009-08-03 17:12

I've played with the idea of a "proper" virtual memory version of the LMM interpreter; and as long as only the text segment (code+constants) was virtualized, performance could be quite decent.

I was thinking of using 16 bit words for page entries I was thinking of:

bit 15 = "present" - 0=on disk, 1=in hub
bit 14 = "dirty" (in case i decided to play with virtual data segments)
bits 0-13 = address A9-A23 with 512 byte pages (max 16MB virtual memory), or A8-A22 with 256 byte pages (max 8MB virtual memory)

The reason I was considering 256 byte pages is that it may be more useful to have more small pages resident in the hub (the working set) than fewer larger pages - remember, I am thinking in the Largos context of multiple applications running on the Propeller at once.

I was thinking of using up to 24KB in the hub for the working set (combined for all processes, the goal for Largos is to fit in 8KB) and sacrificing a bit of time in order to have a sparse virtual map I am thinking of having a 48 entry reverse page table that would have to be searched to see if the page is resident, and a 48 entry "use" count that would be used to select candidate pages for eviction for an LRU page replacement algorithm.

So in the case of a FCALL or FJMP, the kernel would check through the page table to see if the page is present (wasting a bit of time on the 48 entry search, but winning big on the amount of hub memory used to implement an 8MB/16MB page table)

Now that I have Morpheus with a LOT of memory, with the price of 4GB SDHC cards being so cheap, I have given some thought to virtualizing XMM... to give a full 4GB flat address space. This time I would use "classical" page tables, directly indexed by the upper virtual address bits, and I am thinking of 8KB to 32KB as being a good page size.

jazzed said...
It is entirely possible if you have room in the LMM interpreter COG ... I had to optimize the heck out of the ImageCraft kernel just to use it for single COG long-at-time fetch ... multi-cog fetch is easier. I'd bet Bill Henning is already doing it [noparse]:)[/noparse] I'm trying to do something similar with Spin.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller

localroger · 2009-08-03 17:39

@Bill -- Thanks for chipping in, now I know I'm not crazy. I'm thinking of something a bit more modest, as I said maybe targeting 1 megabyte application space with only the Prop and SD. The reason is that I am aiming at a particular type of application, not really an OS, which will be self-contained, and SD support is essential to that app anyway. I have certain code that is very timing dependent, but most of the application logic involves human I/O and is not timing critical. I'm looking at using about half the 32K Hub RAM for stuff that absolutely has to be done there, and was wondering how to ensure that I won't run out of space as I add actual application code to all the device drivers and timers.

@Steve -- I'm thinking the swap file will just be a flat block of sectors which must be contiguous, reserved in the filesystem as a read-only file; once the swapper is told what the first sector is no more filesystem logic would be necessary to control it.· It would of course be possible to include lots of such images on a SD card, but I"m also thinking it would be useful as a fallback plan to use a second prop with parallel RAM chip instead of (or with) the SD, for the 10x gain in performance.

Post Edited (localroger) : 8/3/2009 5:44:57 PM GMT

Bill Henning · 2009-08-03 18:14

No, you are not crazy - old mainframes were much slower than such a Propeller VM/LMM scheme would be.

If I were you, I'd do a two staged boot - stage one in the EEPROM is just enough to load the drivers and start them one at a time, then loads your main app. This way no need to re-use device images in hub ram.

Ofcourse second prop with fast parallel ram is a great way to go

which is why Morpheus has it

Hmm... I wonder if Morpheus+Mem+ would not meet all your basic hardware needs...

localroger said...
@Bill -- Thanks for chipping in, now I know I'm not crazy. I'm thinking of something a bit more modest, as I said maybe targeting 1 megabyte application space with only the Prop and SD. The reason is that I am aiming at a particular type of application, not really an OS, which will be self-contained, and SD support is essential to that app anyway. I have certain code that is very timing dependent, but most of the application logic involves human I/O and is not timing critical. I'm looking at using about half the 32K Hub RAM for stuff that absolutely has to be done there, and was wondering how to ensure that I won't run out of space as I add actual application code to all the device drivers and timers.

@Steve -- I'm thinking the swap file will just be a flat block of sectors which must be contiguous, reserved in the filesystem as a read-only file; once the swapper is told what the first sector is no more filesystem logic would be necessary to control it. It would of course be possible to include lots of such images on a SD card, but I"m also thinking it would be useful as a fallback plan to use a second prop with parallel RAM chip instead of (or with) the SD, for the 10x gain in performance.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller

jazzed · 2009-08-03 18:38

I'll have great trouble fitting Morpheus into a 1.5"x2.1"x0.9" space (mm 38x53x23) [noparse]:)[/noparse]

Oops, I meant 1.5"x2.1"x0.09" (mm 38x53x2.3).

At some point though, 8 COGs is just not enough for most people especially if 3 or more are used just for memory managment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 8/3/2009 6:59:27 PM GMT

Bill Henning · 2009-08-03 18:45

LOL! no kidding!

jazzed said...
I'll have great trouble fitting Morpheus into a 1.5"x2.1"x0.9" space (mm 38x53x23) [noparse]:)[/noparse]

At some point though, 8 COGs is just not enough for most people especially if 3 or more are used just for memory managment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller

localroger · 2009-08-03 18:50

@jazzed -- as I'm thinking of it only one cog would be used for memory management, and that cog could also be running the SD card block drivers, which need a cog anyway. Each LMM thread would be responsible for its own in-cache switching, and there is no advantage to a cog separate from the SD drivers if they have to go to the SD card to get pages anyway. I'm playing with code right now and it's starting to look *elegant*.

The typical instrument I'm aiming to replace has a 40 MHz 32-bit processor, no mass storage, and 256K or 1 Mb each firmware and RAM. (There is some variance between manufacturers.) A lot of that firmware is bloat, as not too long ago instruments with 64K or less firmware were common, but I was starting to worry that after blowing half the hub on drivers I wouldn't have enough left for application logic, even in Spin.

This fixes that. Worst case scenario is I throw the second prop at it for RAM buffering, and the beautiful thing about that is if I decide to go that route, it's completely transparent to whatever application code I've written.

I also have a homemade VB-like compiler I could easily adapt to spit out the code.

jazzed · 2009-08-03 19:19

@localroger, sounds like a nice project ... you'll get more feedback on it once the emulator crew wakes up [noparse]:)[/noparse]

What I meant by 3 or more COGS for memory management was using 3 + 1 for parallel SRAM memory access for 20MB/s burst with sub 25ns SRAM and VMM function like swap. kuroneko has a 5 cog design that could do near 40MB/s burst with sub 12.5ns SRAM. With kuroneko's design for example you get only one COG for other drivers (assuming additional COGS used for VMM and LMM). With a 2 Propeller design that left-over COG would presumably be used for communicating with the 2nd Propeller. I think there is another 2 Propeller possibility that can get 40MB/s burst and have more COGs left over for other tasks, but it's just an idea right now and I'm trying very hard to focus on the BigSpinVMM(tm) project.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

localroger · 2009-08-03 19:33

@Jazzed -- yeah, I don't need anything like that level of performance. I'll be logging a stream of operational data that probably won't ever amount to more than a few hundred bytes/sec, but I also want to maintain a fairly elaborate user interface and provide simple web reports and controls over ethernet. Those functions promise to need some elaborate logic but don't need to be very fast.

So far it's looking like I'll need 16 instructions or so to fake a jump. Currently playing with different ways to represent the buffers to see if I can optimize that a bit.

Bill Henning · 2009-08-03 21:29

LOL!

And here I was, foolishly thinking that so far only I figured out how to do 40MB/sec!

Fine, I'll let another cat out of the bag.

Allow me to introduce my SOJ36DIP32 adapter - specifically made to allow use of fast SOJ36 SRAM's on Morpheus, Mem+ and any other board that takes JDEC standard DIP32 memory

$7.95 USD + S/H for a pack of four, 24.95 USD + S/H for a pack of sixteen, higher quantities available

jazzed said...
@localroger, sounds like a nice project ... you'll get more feedback on it once the emulator crew wakes up [noparse]:)[/noparse]

What I meant by 3 or more COGS for memory management was using 3 + 1 for parallel SRAM memory access for 20MB/s burst with sub 25ns SRAM and VMM function like swap. kuroneko has a 5 cog design that could do near 40MB/s burst with sub 12.5ns SRAM. With kuroneko's design for example you get only one COG for other drivers (assuming additional COGS used for VMM and LMM). With a 2 Propeller design that left-over COG would presumably be used for communicating with the 2nd Propeller. I think there is another 2 Propeller possibility that can get 40MB/s burst and have more COGs left over for other tasks, but it's just an idea right now and I'm trying very hard to focus on the BigSpinVMM(tm) project.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller

External Memory Model?

Comments