External Memory Model?
localroger
Posts: 3,452
Conversation about mainframe computers in another thread, and refactoring my own in-house I/O project, got me thinking -- generally a dangerous thing...
In the LMM jump instructions are actually calls to a little bitty subroutine that resets the PC.· So OK, says I to myself says I, suppose instead of just stupidly fetching the next long and using that to set the PC, we have that subroutine automatically page blocks of RAM to external storage, such as a swap file on a SD card or another prop serving as RAM controller?· Within a page you would get LMM speed with a slight lookup hit on jumps.· On non timing critical code -- such as most user interface and application logic code -- the occasional paging bump would hardly be noticeable.· You put the timing critical stuff like video and serial drivers in Hub RAM and compile everything else to paged LMM.
Here's the totally eeeeevil thing... assuming you make a swap file on a SD card that's continguous and you provide a way for boot code to find the starting block, one of the things you can page out is the FAT file system.· All that has to live permanently in the hub is timing critical stuff like the block drivers.
I am thinking that one might target a one megabyte memory space using 2048 pages of 512 bytes.· This would require a permanent 2,048 entry table linking each page to its buffer (if any), which adds just one lookup for a jump or fetch to a cached page.· You'd also obviously need a reverse table for however many buffers there's room for in hub RAM (I'm thinking that with all the other code there would probably be room for· at least 32 pages).· You'd obviously need dirty flags and ageing counters.· You could even build a two-level system with both a SD card (slow but nonvolatile) and RAM controller (fast) and it would not change the application code at all.
So for the people who've done LMM code -- doe this seem feasible, or did I drink too much coffee this morning?
In the LMM jump instructions are actually calls to a little bitty subroutine that resets the PC.· So OK, says I to myself says I, suppose instead of just stupidly fetching the next long and using that to set the PC, we have that subroutine automatically page blocks of RAM to external storage, such as a swap file on a SD card or another prop serving as RAM controller?· Within a page you would get LMM speed with a slight lookup hit on jumps.· On non timing critical code -- such as most user interface and application logic code -- the occasional paging bump would hardly be noticeable.· You put the timing critical stuff like video and serial drivers in Hub RAM and compile everything else to paged LMM.
Here's the totally eeeeevil thing... assuming you make a swap file on a SD card that's continguous and you provide a way for boot code to find the starting block, one of the things you can page out is the FAT file system.· All that has to live permanently in the hub is timing critical stuff like the block drivers.
I am thinking that one might target a one megabyte memory space using 2048 pages of 512 bytes.· This would require a permanent 2,048 entry table linking each page to its buffer (if any), which adds just one lookup for a jump or fetch to a cached page.· You'd also obviously need a reverse table for however many buffers there's room for in hub RAM (I'm thinking that with all the other code there would probably be room for· at least 32 pages).· You'd obviously need dirty flags and ageing counters.· You could even build a two-level system with both a SD card (slow but nonvolatile) and RAM controller (fast) and it would not change the application code at all.
So for the people who've done LMM code -- doe this seem feasible, or did I drink too much coffee this morning?
Comments
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
The way I am thinking of approaching this is to add a page server to sdspiqasm; there's plenty of room in there for more functionality, it's already been hacked to support sdhc, and it only takes on cog because all the Spin code goes away once you get the card mounted. And you want SD access to be single-threaded. Then the LMM interpreter used by each individual cog just has to know enough to drop requests to the page server.
If you add a page server, that will be useful for my project too when I get to adding the SD device. For now I'm just using EEPROM buffers for fetch on "code page faults," and eventually I'll use block mode XMM for better performance at costs of course. I'll need an intermediate driver eventually for abstraction passing an address, buffer, operation, length, and not worry about the underlying device.
Any idea how the swap file will be organized? Swap code exists already of course, but it's likely to be huge.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
I was thinking of using 16 bit words for page entries I was thinking of:
bit 15 = "present" - 0=on disk, 1=in hub
bit 14 = "dirty" (in case i decided to play with virtual data segments)
bits 0-13 = address A9-A23 with 512 byte pages (max 16MB virtual memory), or A8-A22 with 256 byte pages (max 8MB virtual memory)
The reason I was considering 256 byte pages is that it may be more useful to have more small pages resident in the hub (the working set) than fewer larger pages - remember, I am thinking in the Largos context of multiple applications running on the Propeller at once.
I was thinking of using up to 24KB in the hub for the working set (combined for all processes, the goal for Largos is to fit in 8KB) and sacrificing a bit of time in order to have a sparse virtual map I am thinking of having a 48 entry reverse page table that would have to be searched to see if the page is resident, and a 48 entry "use" count that would be used to select candidate pages for eviction for an LRU page replacement algorithm.
So in the case of a FCALL or FJMP, the kernel would check through the page table to see if the page is present (wasting a bit of time on the 48 entry search, but winning big on the amount of hub memory used to implement an 8MB/16MB page table)
Now that I have Morpheus with a LOT of memory, with the price of 4GB SDHC cards being so cheap, I have given some thought to virtualizing XMM... to give a full 4GB flat address space. This time I would use "classical" page tables, directly indexed by the upper virtual address bits, and I am thinking of 8KB to 32KB as being a good page size.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller
@Steve -- I'm thinking the swap file will just be a flat block of sectors which must be contiguous, reserved in the filesystem as a read-only file; once the swapper is told what the first sector is no more filesystem logic would be necessary to control it.· It would of course be possible to include lots of such images on a SD card, but I"m also thinking it would be useful as a fallback plan to use a second prop with parallel RAM chip instead of (or with) the SD, for the 10x gain in performance.
Post Edited (localroger) : 8/3/2009 5:44:57 PM GMT
If I were you, I'd do a two staged boot - stage one in the EEPROM is just enough to load the drivers and start them one at a time, then loads your main app. This way no need to re-use device images in hub ram.
Ofcourse second prop with fast parallel ram is a great way to go which is why Morpheus has it
Hmm... I wonder if Morpheus+Mem+ would not meet all your basic hardware needs...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller
Oops, I meant 1.5"x2.1"x0.09" (mm 38x53x2.3).
At some point though, 8 COGs is just not enough for most people especially if 3 or more are used just for memory managment.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
Post Edited (jazzed) : 8/3/2009 6:59:27 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller
The typical instrument I'm aiming to replace has a 40 MHz 32-bit processor, no mass storage, and 256K or 1 Mb each firmware and RAM. (There is some variance between manufacturers.) A lot of that firmware is bloat, as not too long ago instruments with 64K or less firmware were common, but I was starting to worry that after blowing half the hub on drivers I wouldn't have enough left for application logic, even in Spin.
This fixes that. Worst case scenario is I throw the second prop at it for RAM buffering, and the beautiful thing about that is if I decide to go that route, it's completely transparent to whatever application code I've written.
I also have a homemade VB-like compiler I could easily adapt to spit out the code.
What I meant by 3 or more COGS for memory management was using 3 + 1 for parallel SRAM memory access for 20MB/s burst with sub 25ns SRAM and VMM function like swap. kuroneko has a 5 cog design that could do near 40MB/s burst with sub 12.5ns SRAM. With kuroneko's design for example you get only one COG for other drivers (assuming additional COGS used for VMM and LMM). With a 2 Propeller design that left-over COG would presumably be used for communicating with the 2nd Propeller. I think there is another 2 Propeller possibility that can get 40MB/s burst and have more COGs left over for other tasks, but it's just an idea right now and I'm trying very hard to focus on the BigSpinVMM(tm) project.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
So far it's looking like I'll need 16 instructions or so to fake a jump. Currently playing with different ways to represent the buffers to see if I can optimize that a bit.
And here I was, foolishly thinking that so far only I figured out how to do 40MB/sec!
Fine, I'll let another cat out of the bag.
Allow me to introduce my SOJ36DIP32 adapter - specifically made to allow use of fast SOJ36 SRAM's on Morpheus, Mem+ and any other board that takes JDEC standard DIP32 memory
$7.95 USD + S/H for a pack of four, 24.95 USD + S/H for a pack of sixteen, higher quantities available
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheus & Mem+ Advanced dual Propeller SBC with XMM and 256 Color VGA - PCB, kit, A&T available NOW!
www.mikronauts.com - my site 6.250MHz custom Crystals for running Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller