VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)
Bill Henning
Posts: 6,445
VMCOG has now entered BETA testing - it works on PropCade, TriBlade_2, XEDODRAM and Hydra HX512 - see Page 9 for details.
Background
I made a suggestion a few days ago on how it might be possible to make a virtual memory manager that could be used by ZiCog to get 'acceptable' performance with slow external memory designs.
As many of you will know, VM implementations bear a striking resemblance to processor cache design - this is a natural consequence of solving a very similar problem, which is mapping a large but slow memory to a small but fast memory, while presenting the illusion that the whole large memory is fast.
Please note that there are two existing SD card based virtual memory projects I've heard about on the forum, but the aim of VMCOG is XMM and SPI RAM. Later who knows?
Here is the original post:
forums.parallax.com/forums/default.aspx?f=25&m=405722&p=15 from the 'Dracblade SBC now with Catalina C, PropBasic and CP/M' thread. I copied it into the second message in this thread.
The discussion then moved to the 'ZiCog a Zilog Z80 emulator in 1 Cog' thread, a far more appropriate place for it at that time.
forums.parallax.com/forums/default.aspx?f=25&p=032&m=332138
Here is Heater's new ZyCog thread:
forums.parallax.com/forums/default.aspx?f=25&m=423939
Heater asked very nicely that I find time to implement my suggestion - thus this project was born.
After Heater's suggestion for ZyCog, and my realizing that VMCOG would also allow for a VMSpin, I created this thread for VMCOG development.
This thread will be the official thread for developing VMCOG, which will be under the MIT license, free for personal or commercial use, as long as I (and future contributors) are credited in any software and documentation for software/hardware using VMCOG.
I will keep this top post updated with links to documentation, samples, and code (when it is ready), and I welcome questions, suggestions, optimizations etc.
Later today, I will post the start of the specifications, for now read the ZiCog and DracBlade threads to see what I have written today. I will combine and edit my postings to make the V0.1 VMCOG specification.
Somehow I will find the time to write it, and I will be demonstrating it at UPEW.
Alternate Usage Model
The VMCOG interface, minus the MMU functionality, could be used to provide a simple, standard interface for XMM implementations - which would allow the same interface to all types of extended memory solutions, regardless of how they were implemented VMCOG (SPI RAM, SD card using hub caching) or XMCOG (TriBlade, Morpheus, DracBlade, mctrivia's etc using XMM directly). I believe I will make an XMCOG for Morpheus...
Theory of Operation
It has long been known that 90%+ of the total run time of a program is typically spent in <10% of the code. This is why modern processors use multi-level caches in order to make the main (slow compared to processor clock rate) memory appear to be almost as fast as the processors Level 1 cache.
Computers (and operating systems) take this step one level further, implementing 'Virtual Memory', which treats a chunk of your hard drive as if it was RAM, and uses strategies very similar to (and sometimes identical) to what Level 1, 2 and lately 3 caches use.
Since the early days of computing, the actual available 'real memory', is divided up into 'pages'. Each executing program is said to have a 'working set' of pages, which is some fraction of the total available 'real memory'. (I am not going to address variable sized segment based virtual memory here)
As a rule of thumb, the larger the 'working set', the more closely the speed of the 'virtual memory' approximates the speed of the 'real memory'.
The 'virtual memory' is stored in the 'backing store', and how fast pages can be read from, and written to, the 'backing store' greatly affects virtual memory operation.
Spin API
PUB start(mailbox,lastpage,numpages)
PUB rdvbyte(addr)
PUB rdvword(addr)
PUB rdvlong(addr)
PUB wrvbyte(addr,data)
PUB wrvword(addr,data)
PUB wrvlong(addr,data)
PUB Flush
PUB Look(addr)
Virtual Memory LUT Specification
In order to translate 'virtual' memory addresses to 'real' VMCOG will use the first 256 longs to implement a Look Up Table.
256 pages of 256 bytes gives us 64KB of virtual memory, which fits neatly in two MCP 23K256 SPI RAM devices.
LUT Entry Definition
If a LUT entry is zero, the corresponding page of virtual memory is not present in the hub.
If a LUT entry is non-zero, it will be interpreted as follows:
MSB
V
PPPPPPPP PDXCCCCC CCCCCCCC CCCCCCCC
Where
PPPPPPPPP = hub address
The hub address is stored here so that the MOVI instruction can be used to update it without disturbing the rest of the bits in the page table entry
D = Dirty bit
This bit is set whenever a write is performed to any byte(s) in the page
X = Guard bit, must be zero
CCCCC CCCCCCCC CCCCCCCC = 21 bit read access counter
Every time a read is performed to this page, this count is incremented. If the count overflows into the Guard bit, every page count in the address translation table will be divided by two, and the Guard bit cleared, in order to ensure that the LRU page replacement algorithm will work well.
UPDATE: The LUT is very likely to change later this week. While writing some of the code I noticed that checking for counter wraparound is MUCH cheaper using the Carry bit instead of an explicit Guard bit, probably outweighing the benefit of clearing the upper bits simply by shifting the physical page bits down. The revised format is likely to be {count:22,Dirty:1,hubpage:9} but things are still in a state of flux...
Minimum hardware requirements
- any propeller board with five pins available for use
- two MCP23K256 SPI ram devices (Digi-Key part number 23K256-I/P-ND - currently $1.66 each)
VMCOG will use between 4KB to 16KB of hub memory as 'in-core' storage, and 64KB (possibly more) external memory (SPI or parallel, latched or non-latched XMM design) as the 'backing store'
Supported 'real memory' (hub cache) size of the 'working set'
4KB - 16 pages of 256 bytes - guaranteed to be very slow
8KB - 32 pages of 256 bytes - may work fine for smaller CP/M programs
16KB - 64 pages of 256 byte - should perform quite well!
Later I will allow a user settable number of pages (between 16 and 96) but I want to simplify things as much as possible for the first release, and theoretically, later it will be possible to run two (or more) VMCOG's servicing two (or more) ZiCogs. It will also be possible to 'share' the virtual address space, and implement 'shared memory' multi-ZiCog systems. Or even a hybrid MotoCog and Zicog system sharing the same virtual memory.
Supported 'virtual memory' (address space) sizes
Initially only a 64KB memory map will be supported as for the first version I will use a direct mapped LUT (virtual to real address translation look up table)
128KB would be easy to support if I switched to 512 byte pages, or used two VMCOGs.
Virtual address spaces larger than 128KB would require a more sophisticated handling of virtual to real address mapping, and while I *WILL* tackle that, I want to get something simple running first!
The easiest way to handle LARGE page tables is to move it to hub memory - something that will be quite feasible on Prop2, and is possible on Prop1 - but it would add 16-22 cycles to each access.
Virtual memory addresses will be 32 bits wide, and the virtual memory will be byte addressable.
I will also host pages and downloads for VMCOG at my site, I will post URL's later.
Using VMCOG
Here is how VMCOG will work (Real Soon Now (tm))
You wait for vmcommand to become 0 (in case the cog is busy swapping a page in or out, or processing your last command)
- you write vmaddr with the virtual address you want to access (long)
- if you are going to write to the VM, you put your byte/short/long into vmdata (long)
- you write VMWRITE{B|W|L} or VMREAD{B|W|L} into the vmcommand location (short)
if you were doing a VMREAD, you wait for vmcommand to become 0 before reading vmdata
TO DO LIST
- get someone to make a small, fast 23K256 driver
Andy (Ariba) contributed one, I just need to make it run with the 23K256
Fast SPI driver for MCP23K256
- perhaps an adaptation of Mike Green's MCP23K256 driver, combined with fast SPI from MIT licensed fast fsrw SPI code?
- ideally using counters to read/write the SPI memory at 10Mbps (or even 20Mbps?)
- See XMM Code Interface Specification for how I need the SPI driver to interface to VMCOG (it will be part of VMCOG)
XMM Code Interface Specification
I invite authors of all existing (and future) XMM solutions who wish to be supported by VMCOG to submit four PASM subroutines as specified below. The code should be short, but fast.
All contributors to this project agree that any submitted code will be under the MIT license, with the understanding that the license does not extend to the underlying hardware - so no worries, you are specifically NOT allowing people to build clones of your hardware (unless you explicitly give permission to allow people to duplicate your hardware). This will be in the Copyright statement for VMCOG.
The reference implementation will be VMCOG/SPI, other implementations will be named as VMCOG/xmm_solution_name
Background
I made a suggestion a few days ago on how it might be possible to make a virtual memory manager that could be used by ZiCog to get 'acceptable' performance with slow external memory designs.
As many of you will know, VM implementations bear a striking resemblance to processor cache design - this is a natural consequence of solving a very similar problem, which is mapping a large but slow memory to a small but fast memory, while presenting the illusion that the whole large memory is fast.
Please note that there are two existing SD card based virtual memory projects I've heard about on the forum, but the aim of VMCOG is XMM and SPI RAM. Later who knows?
Here is the original post:
forums.parallax.com/forums/default.aspx?f=25&m=405722&p=15 from the 'Dracblade SBC now with Catalina C, PropBasic and CP/M' thread. I copied it into the second message in this thread.
The discussion then moved to the 'ZiCog a Zilog Z80 emulator in 1 Cog' thread, a far more appropriate place for it at that time.
forums.parallax.com/forums/default.aspx?f=25&p=032&m=332138
Here is Heater's new ZyCog thread:
forums.parallax.com/forums/default.aspx?f=25&m=423939
Heater asked very nicely that I find time to implement my suggestion - thus this project was born.
After Heater's suggestion for ZyCog, and my realizing that VMCOG would also allow for a VMSpin, I created this thread for VMCOG development.
This thread will be the official thread for developing VMCOG, which will be under the MIT license, free for personal or commercial use, as long as I (and future contributors) are credited in any software and documentation for software/hardware using VMCOG.
I will keep this top post updated with links to documentation, samples, and code (when it is ready), and I welcome questions, suggestions, optimizations etc.
Later today, I will post the start of the specifications, for now read the ZiCog and DracBlade threads to see what I have written today. I will combine and edit my postings to make the V0.1 VMCOG specification.
Somehow I will find the time to write it, and I will be demonstrating it at UPEW.
Alternate Usage Model
The VMCOG interface, minus the MMU functionality, could be used to provide a simple, standard interface for XMM implementations - which would allow the same interface to all types of extended memory solutions, regardless of how they were implemented VMCOG (SPI RAM, SD card using hub caching) or XMCOG (TriBlade, Morpheus, DracBlade, mctrivia's etc using XMM directly). I believe I will make an XMCOG for Morpheus...
Theory of Operation
It has long been known that 90%+ of the total run time of a program is typically spent in <10% of the code. This is why modern processors use multi-level caches in order to make the main (slow compared to processor clock rate) memory appear to be almost as fast as the processors Level 1 cache.
Computers (and operating systems) take this step one level further, implementing 'Virtual Memory', which treats a chunk of your hard drive as if it was RAM, and uses strategies very similar to (and sometimes identical) to what Level 1, 2 and lately 3 caches use.
Since the early days of computing, the actual available 'real memory', is divided up into 'pages'. Each executing program is said to have a 'working set' of pages, which is some fraction of the total available 'real memory'. (I am not going to address variable sized segment based virtual memory here)
As a rule of thumb, the larger the 'working set', the more closely the speed of the 'virtual memory' approximates the speed of the 'real memory'.
The 'virtual memory' is stored in the 'backing store', and how fast pages can be read from, and written to, the 'backing store' greatly affects virtual memory operation.
Spin API
PUB start(mailbox,lastpage,numpages)
PUB rdvbyte(addr)
PUB rdvword(addr)
PUB rdvlong(addr)
PUB wrvbyte(addr,data)
PUB wrvword(addr,data)
PUB wrvlong(addr,data)
PUB Flush
PUB Look(addr)
Virtual Memory LUT Specification
In order to translate 'virtual' memory addresses to 'real' VMCOG will use the first 256 longs to implement a Look Up Table.
256 pages of 256 bytes gives us 64KB of virtual memory, which fits neatly in two MCP 23K256 SPI RAM devices.
LUT Entry Definition
If a LUT entry is zero, the corresponding page of virtual memory is not present in the hub.
If a LUT entry is non-zero, it will be interpreted as follows:
MSB
V
PPPPPPPP PDXCCCCC CCCCCCCC CCCCCCCC
Where
PPPPPPPPP = hub address
The hub address is stored here so that the MOVI instruction can be used to update it without disturbing the rest of the bits in the page table entry
D = Dirty bit
This bit is set whenever a write is performed to any byte(s) in the page
X = Guard bit, must be zero
CCCCC CCCCCCCC CCCCCCCC = 21 bit read access counter
Every time a read is performed to this page, this count is incremented. If the count overflows into the Guard bit, every page count in the address translation table will be divided by two, and the Guard bit cleared, in order to ensure that the LRU page replacement algorithm will work well.
UPDATE: The LUT is very likely to change later this week. While writing some of the code I noticed that checking for counter wraparound is MUCH cheaper using the Carry bit instead of an explicit Guard bit, probably outweighing the benefit of clearing the upper bits simply by shifting the physical page bits down. The revised format is likely to be {count:22,Dirty:1,hubpage:9} but things are still in a state of flux...
Minimum hardware requirements
- any propeller board with five pins available for use
- two MCP23K256 SPI ram devices (Digi-Key part number 23K256-I/P-ND - currently $1.66 each)
VMCOG will use between 4KB to 16KB of hub memory as 'in-core' storage, and 64KB (possibly more) external memory (SPI or parallel, latched or non-latched XMM design) as the 'backing store'
Supported 'real memory' (hub cache) size of the 'working set'
4KB - 16 pages of 256 bytes - guaranteed to be very slow
8KB - 32 pages of 256 bytes - may work fine for smaller CP/M programs
16KB - 64 pages of 256 byte - should perform quite well!
Later I will allow a user settable number of pages (between 16 and 96) but I want to simplify things as much as possible for the first release, and theoretically, later it will be possible to run two (or more) VMCOG's servicing two (or more) ZiCogs. It will also be possible to 'share' the virtual address space, and implement 'shared memory' multi-ZiCog systems. Or even a hybrid MotoCog and Zicog system sharing the same virtual memory.
Supported 'virtual memory' (address space) sizes
Initially only a 64KB memory map will be supported as for the first version I will use a direct mapped LUT (virtual to real address translation look up table)
128KB would be easy to support if I switched to 512 byte pages, or used two VMCOGs.
Virtual address spaces larger than 128KB would require a more sophisticated handling of virtual to real address mapping, and while I *WILL* tackle that, I want to get something simple running first!
The easiest way to handle LARGE page tables is to move it to hub memory - something that will be quite feasible on Prop2, and is possible on Prop1 - but it would add 16-22 cycles to each access.
Virtual memory addresses will be 32 bits wide, and the virtual memory will be byte addressable.
I will also host pages and downloads for VMCOG at my site, I will post URL's later.
Using VMCOG
Here is how VMCOG will work (Real Soon Now (tm))
You wait for vmcommand to become 0 (in case the cog is busy swapping a page in or out, or processing your last command)
- you write vmaddr with the virtual address you want to access (long)
- if you are going to write to the VM, you put your byte/short/long into vmdata (long)
- you write VMWRITE{B|W|L} or VMREAD{B|W|L} into the vmcommand location (short)
if you were doing a VMREAD, you wait for vmcommand to become 0 before reading vmdata
TO DO LIST
- get someone to make a small, fast 23K256 driver
Andy (Ariba) contributed one, I just need to make it run with the 23K256
Fast SPI driver for MCP23K256
- perhaps an adaptation of Mike Green's MCP23K256 driver, combined with fast SPI from MIT licensed fast fsrw SPI code?
- ideally using counters to read/write the SPI memory at 10Mbps (or even 20Mbps?)
- See XMM Code Interface Specification for how I need the SPI driver to interface to VMCOG (it will be part of VMCOG)
XMM Code Interface Specification
I invite authors of all existing (and future) XMM solutions who wish to be supported by VMCOG to submit four PASM subroutines as specified below. The code should be short, but fast.
All contributors to this project agree that any submitted code will be under the MIT license, with the understanding that the license does not extend to the underlying hardware - so no worries, you are specifically NOT allowing people to build clones of your hardware (unless you explicitly give permission to allow people to duplicate your hardware). This will be in the Copyright statement for VMCOG.
The reference implementation will be VMCOG/SPI, other implementations will be named as VMCOG/xmm_solution_name
chipselLONG0'chipselect,initiallycanonlybe0or1tochosefromtwoSPIram's,notusedforparallelXMMsolutions
vmaddrLONG0'virtualaddress,initially$0000-$FFFF,laterIplantosupportatleast24bitsofaddressspace
hubaddrLONG0'hubmemoryaddresstoreadfromorwriteto
membytesLONG0'numberofbytestoreadorwriteto/fromthehub
</CODE>
START - assert /CS for the device specified by 'chipsel', initially 2 pins are used to select RAM0 or RAM1 (the MCP23K256's are 32KB devices), later can choose between different XMM's on same prop
END - de-assert /CS for the device specified by 'chipsel'
READ - read 'membytes' number of bytes from virtual (extended) address 'vmaddr' to the hub starting at address 'hubaddr'
WRITE - write 'membytes' number of bytes to virtual (extended) address 'vmaddr' from the hub starting at address 'hubaddr'
Why use VMCOG/SPI as the reference implementation?
- because it is the most challenging from a performance point of view
- every XMM design, no matter how many latches are used, is guaranteed to be faster than SPI RAM!
- it is BY FAR the least expensive way to try VMCOG
- I love a challenge
Future Optimizations
- implementing a 'delayed write' strategy (which is why I have the 'DIRTY' bit)
- changing the mailbox format for better performance
- possibly changing TLB format
- possibly checking if access is on the same virtual page as last access, optimizing accesss
The new command format I am considering is as follows:
cmd LONG 0 ' the first long in the 4-long mailbox
3 bit command code as bits 29-31
29 bit virtual address (limits us to 512MB virtual address space without additional hub cycle) as bits 0-28
Commands would be encoded as follows:
000 = NOP (required for polling loop to function)
001 = rdvbyte
010 = rdvword
011 = rdvlong
100 = XOP (extended operation)
101 = wrbyte
110 = wrword
111 = wrlong
- To read a byte from $0000_1F3C would require writing $2000_1F3C to the mailbox
- To read a word from $0000_1F3C would require writing $4000_1F3C to the mailbox
- To read a long from $0000_1F3C would require writing $6000_1F3C to the mailbox
To write a byte/word/long, first I'd write the value to the second long in the mailbox, and then write
$A000_1F3C to write a byte, $C000_1F3C to write a word, $E000_1F3C to write a long!
The other operations such as VMFLUSH, VMDUMP, etc. would write a secondary opcode to the mailbox, and $8xxx_xxxx to invoke the Xtended operation.
This would save one hub write and one hub read on every read/write!
NOTES
- Theoretically VMCOG could also use an SD card, RAMTRON FRAM, or even SPI flash for the backing store - however I suspect that may be too slow, and wear out FLASH quickly.
- one interesting extension for later versions is to use SPI Flash or SD cards to hold the code, and SPI (or parallel) ram to hold data and stack.
- VSpin is a possible name for a Spin VM I hope someone will make that uses VMCOG - it would allow 64KB for spin code!
- Morpheus users can remove the W25X80 Flash chip and use 23K256's in both FLASH sockets - no need to solder anything!
- it would be entirely possible to write an LMM kernel that accessed XMM through VMCOG
- a 'zero additional chip' reference platform is possible, using a 24LC1024 EEPROM, which would eventually wear out
- another 'zero additional chip' reference platform is possible using a 1Mbit FRAM device for combination boot EEPROM and backing store
- it is NOT possible to use the virtual memory as a 'live' frame buffer that is displayed by video drivers
Code Contributed to VMCOG
- Andy ('Ariba') contributed SPI SRAM handling code, looks good!
- heater contributed TriBlade_2 support and a better memory test and lots of debugging help
- jazzed contributed XEDODRAM support and lots of debugging help
Downloads (below my signature)
- VMCOG Spin API Documentation v0.22
- VMDEBUG + VMCOG v0.970 - working for PropCade, TriBlade_2, XEDODRAM and now Hydra HX512!
- VMACCESS.SPIN - sample pasm code for accessing the virtual memory, not tested
Useful Links
- 23K256 web page www.microchip.com/wwwproducts/Devices.aspx?dDocName=en539039
- 23K256 data sheet ww1.microchip.com/downloads/en/DeviceDoc/22100D.pdf
- intoduction to virtual memory, from U of Calgary webdocs.cs.ualberta.ca/~tony/C379/Notes/PDF/08.4.pdf
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0' OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 7/31/2010 8:45:17 PM GMT
Comments
********************************************************************************************
Heater,
With the ever-increasing number of XMM solutions, have you considered decoupling ZiCog from the memory access?
I have not looked at the sources, so I don't know how easy the following would be - or if you have thought of something similar.
Loosely, there are three types of memory accesses:
- code fetch
- data read
- data write
I am thinking of a solution where the memory access is handed off to another cog, and ZiCog requests memory actions through hub locations.
Consider:
The beauty of this approach is that it TOTALLY decouples ZiCog from specific XMM implementation, and the memory cog can try to do all sorts of caching etc.
Adding new XMM targets is trivial.
Frees up some LONGs in ZiCog
Doing split I/D for 128K memory (which I think MP/M supported) is easy.
Doing banked memory on any XMM becomes MUCH easier.
Even better, in any instruction that is not a JUMP/CALL, the next instruction read can be done in parallel with executing the current instruction!
Simply ask for the next instruction before processing the current one.
The hub delay slots can also be used [noparse]:)[/noparse]
I think it would potentiall run faster.
This would also make it trivial to provide breakpoints for execution or data access, and monitoring locations, performance etc.
On the hardware side...
This would also allow a super-cheap ZiCog config I was thinking about, by using two MCP23K256 SPI ram's or FRAMs (with a speed penalty)
What do you guys think?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
***********************************************************************
Quick summary of the relevant previous responses in other threads:
Me:
I was thinking of doing a classic LRU page replacement policy. (LRU = Least Recently Used)
In a 64KB address space there are (obviously) 256 pages of 256 bytes.
My paper design uses cog locations 0..255 for the LRU page table. When the cog is started, location 0 would contain a JMP #$100, which would do a MOV 0,#0 to clear the first page table entry.
Bits 0-8 (the source field) of the page table would contain the upper 7 bits (on prop 1) of the hub page where that page resides.
The upper 25 bits would be the access counter, allowing counting up to 32M accesses.
As there is only 32KB of hub ram on the current prop, bit 8 would be used as a 'dirty' bit (set whenever a write occurs to that page).
If a page is not present in memory, the whole register is set to 0.
When any count approached 16M, all counts should be cut in half.
Say you wanted to read $3F29 in the virtual memory address space.
It might be more efficient to store the page table entries as long {hubpage:7, dirty:1, count:24} because then a simple SHR by #25 would get you the hub page address, and SHL hub page by #8 and it has zeroed the low 8 bits, making it ready to or in the offset. Sorry, I have not spent any time optimizing it yet.
With 64 pages (16KB of hub buffer) I'd expect well over 90% hit ratio (test code would actually be able to calculate this).
When there is a hit, the unoptimized code above takes about 11 instructions and one hub access... call it 44+22 cycles worst case, less than 1us any way.
If we assume that the average ZiCog instruction emulation takes 2us for the instruction, and .5us for an unlatched byte read, the total instruction time would increase to 3us on page hits, and something much worse on misses. Say 256 bytes * 8 bits per byte * 100ns (assuming 10Mbps SPI read) = 204.8us + 2us for the instruction - let's call it 207us.
At 2.5us per ZiCog instruction, 1M instructions would take 2.5s
If 90% of the instructions hit, there would be 900,000 hits at 3us, and 100,000 at 207us, for a total of 2.7s + 20.7s = 23.7s - approximately 10.5% the speed of pure xmm ZiCog
If 95% of the instructions hit, there would be 950,000 hits at 3us, and 50,000 at 207us, for a total of 2.85s + 10.35s = 13.2s - approx. 20% the speed of pure xmm ZiCog
If the Z80 is like most mainframes, the the hit rate would be more like 99%
At 99%, there would be 990,000 at 3us, and 10,000 at 207us, for a total of 2.97us + 2.07s = 5.04s - approx. 49.6% of the speed of pure xmm ZiCog.
Ofcourse if the average ZiCog instruction (with XMM) took 3.75us (50% more than above) the VM approach could reach 75% of XMM performance.
Note that hits would take at most 0.8us, and that page reads at 20Mbps would take 103.5us.
Optimizing the read code would have good effect.
Using 20Mbps reads from the SPI flash would offer a dramatic improvement.
My best guess?
For "average" software, 50%+ of XMM speed should be attainable.
This will be fun to test [noparse]:)[/noparse]
Does anyone have any idea how many ZiCog instructions are processed per second? the 2.5us average (including TriBlade unlatched read) was a WAG based on reading almost 400k instructions per sec earlier.
That's the beauty of the LRU algorithm... initialization code, and infrequently used code would automatically be swapped out, and the most used code would always be resident automatically!
Heater
Bill: Re: Your question about moving the Z80's memory access operations to another COG.
Yes it has been considered. Basically it would work like the ZiCog's IN and OUT instructions work now.
I do like the idea of decoupling junks of software functionality where ever possible, from a software engineering point of view. Straight away it makes life much easier for those who want to port to a different hardware. Like mikediv want's to do for the Hydra. Looks like it saves a few LONGs in the Z80 COG as well.
There are two reasons why I have not pursued that idea:
1) Conservation of COG. I all ways looked at COGs as being few in number and precious. Seemed a waste to use a whole 32 bit CPU for just XMM access. Hence "ZiCog" a Z80 emulator in ONE Cog.
2) Speed. I have yet to see how it can be done in a way that does not slow things down. This is perhaps not such a big issue. Given all the PASM that has to be executed per Z80 op the impact may not be so great. On the other hand I like Clusso's RamBlade attitude, "Everything for speed".
3) Simplicity. At least for the early ZiCog versions there was only one XMM solution.
Regarding banked memory for CP/M 3 and MP/M. We have code in place for bank switching the Z80 RAM space. It's very small, tight and fast. I don't see much room for improvement there.
One problem you seem to have glossed over is in the idea that the Cog handling the RAM can some how do work in the background and hence recover the time lost in COG-COG communications. As far as I can tell this is not possible, or at least won't work as well as one might expect.
Consider: It looks like the memory COG could be (pre)fetching the next Z80 opcode while the current Z80 instruction is executing.
Problem: The current instruction does a data access to memory. Oops, it has to wait in the "data_fetch" until the memory GOG gets around to it.
Problem: When a Z80 jump, call or ret is made the prefetched op is now junk and a new op has to be fetched. This throws away the prefetch time saving. It also means the "code_fetch" path has to check if the requested address is already prefetched or not. It has to do this on every code fetch, this eats time. There are a lot of jumps in Z800 code.
Now it could be that with all the "swings and roundabouts" we have going on here that a dedicated XMM Cog solution can be made that is faster than what we have now or at least breaks even. So, Bill, if you would like to experiment with it we would love to see what the results are[noparse]:)[/noparse]
P.S. I've softened up my stance on "wasting" a COG for XMM. As it is we've eaten up all the Prop pins for RAM and the HUB is pretty full so there is no point in saving COGs that have nowhere to work.
Dr_Acula
Bill, that is a fascinating analysis of the speeds. There would definitely be speed increases for latched versions as you would read in blocks of data and hence most of the time would only be changing the low byte latch. Plus some other code optimisations would make it possibly twice as fast to access memory.
In practice a typical program sitting at 100H is going to be almost always linear with local jumps so that code will be very efficient. There will also be bios calls (keyboard, display output) which will jump to locations in high ram, but these will be the same each time so those blocks will end up on the list fairly early on and then stay there.
As a rough guide the dracblade runs the same as about a 3.5Mhz Z80. Cluso's runs faster.
I guess if sphinx does manage to save a whole lot of hub ram we can experiment with what to do with that. Video buffer ram for graphics? Faster speed? Or maybe the user can choose.
Heater
Bill: PLEASE, PLEASE can you find some time to implement your XMM driver COG with "code_fetch", "data_fetch" etc.
I was writing that last post only minutes after waking up so, still tired, the following motivations had not occurred to me:
1) We have a number of XMM options that are just inherently slow. Those that use complicated latching schemes and those that could be made using serial devices. In these cases any speed hit due to COG-COG communications is probably not going to notice much in the final result. We can still have the "all out for speed" Tri/RamBlade option in the ZiCog code wrapped up in #ifdefs so nothing lost.
Using serial memory appeals to me, may be slow but I'd love to have some free pins such that ZiCog can do IN/OUT to them directly from Z80 code.
2) If you add operations for reading and writing WORDs we get more speed back. ZiCog does a lot of WORD accesses.
3) This can probably be done without wasting a COG. Just combine it with the TriBlade XMM block move driver or such.
4) MoCog. The MoCog 6809 emulator PASM is getting huge. It will require two COGs. Hopefully only two. If both those COGs need access to XMM (likely) then your suggested XMM handler COG would a) Save duplicating access code in two COGs. b) Make life much easier, saves having two COGs fighting for those RAM pins.
5) ZyCog. Yes "ZyCog" not "ZiCog" See below.
What the heck is ZyCog?
For a long time now I've pondered two things:
1) Is there a nice byte code, like Spin, that could be interpreted in one or two Cogs, like Z80, but more efficient and with much larger address space. For use with code in external memory.
2) Is there such a byte code that exists already and has a nice compiler to go with it. C or whatever. So that we have a ready to run tool chain. Yes there is Java but "no thank you".
Recently I found the answer, the ZPU processor core from ZyLin AS.
Get this:
1) The ZPU processor core is the smallest, in terms of logic blocks, 32 bit CPU.
2) It's instructions are all byte wide, good for XMM.
3) There are only a handful instructions, can probably emulate the ZPU in one COG.
4) There is a version of GCC that generates code for the ZPU.
Yes, that's right, with ZPU emulation we can use the GCC compiler for the Propeller and have huge programs in external RAM.
Hence my new project "ZyCog" the ZyLin ZPU processor in a COG.
ZyCog is as yet unannounced and has no Prop code. It's just an idea so don't tell anyone[noparse]:)[/noparse]
At least I got as far as getting the GCC to generate ZPU code to experiment with.
Heater
Bill: "Does anyone have any idea how many ZiCog instructions are processed per second?"
A long time ago this was measured with a frequency counter whilst the Z80 was executing its op code test program. Results we published here somewhere. No idea where or memory of the numbers. Less than 1 million more than 500,000 per second.
Cluso99
A few comments on the above...
For ZiCog, the XMM cog should have 2 seperate rendezvous locations, one for instruction fetch and one for data fetch. That way the prefetch doesn't get flushed every time a data fetch occurs. Also, it may as well prefetch words, or even longs.
ZyCog - love the idea
Heater
Cluso: Bill already has code and data separated, see the DracBlade thread.
WORD access is suggested above. It's surprising how much WORD access goes on in an 8 bit CPU. All those jumps, calls and rets need WORD access. Then there's the loading, storing, PUSHing and POPing, of 16 bit registers.
Looks like what we are about to design is the worlds first 8 bit processor with a 16 bit data bus!
More on ZyCog later....
Dr_Acula
I think Bill might be on to something here.
Take the Dracblade. Remove all the latches. Remove the Sram. Put a 64k serial ram chip on the eeprom bus. Implement a Sphinx OS that frees up 14k of hub ram. Maybe toss out the LCD code for the moment, and the wireless layer, and the upper 512k code, and toss out the ramblade code too. Maybe optimise the VT100 code a bit. I think that should get us to 16k of free hub ram, maybe more.
Put a ram driver in the cog that is currently running the sram driver code. This new ram driver handles a list of 256 ram blocks of 256 bytes each.
The list handling is going to be a priority list. Each time a block is accessed you add 1 to a counter for that block. Rank them in order. If a new block is needed, take the lowest ranking one, put it into serial ram, and then get the new block. Can this all fit into a cog? I think it should. Is the serial ram driver code the same as the eeprom driver code, and if so, is this already somewhere anyway (?? in the sd card object).
Just looking at ram now SPI or I2C. Code exists for both I think.
This could halve the size of the dracblade board for starters, and decrease the chip count from 9 to 4. Plus free up a number of propeller pins for audio or more serial ports.
Agree a block write then read from serial ram will be slow, but that ought to happen only very infrequently. Possibly never for a small sbasic/c/assembly program.
We can't do this now because there there are 7 blocks of 2k code sitting in ram in random locations.
A thought? Maybe we can use it without even needing sphinx! Just tell the serial ram driver cog the locations of the 7 blocks of 2k code, and any more free code area. It can then have a simple list of where it keeps each block of 256 bytes.
Heater
Well Bill is THE inventor of the LMM technique for the Prop. So if he thinks he's on to something we should all sit up and pay attention.
This idea of a COG handling external memory with caches etc may not be as fast as the direct xxxBlade approach we have now but for those who want to save pins and for the up and coming ZPU emulator it shold be a very good compromise.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/3/2010 8:53:57 PM GMT
One change: I've decided to change the command mailbox format.
The mailbox MUST be long aligned.
WORD: VMCOMMAND
WORD: BYTES (MUST be 1, 2, or 4)
LONG: Virtual Address (currently must be $0000-$FFFF)
LONG: data read/written from/to specified virtual address
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
(Reasons why Heater did not decouple memory access into a separate cog)
1) Conservation of COGs
For some applications it does not matter - for example Dr_Acula has a "latch cog" that could be merged into VMCOG
2) Speed
I think it will be fine for most CP/M applications, as long as enough memory is allocated as page buffers. 8K-16K should give good performance
3) Simplicity
You got me there, however with the many existing (and upcoming) XMM solutions, a VMCOG approach will actually be simpler than supporting an ever increasing number of memory interfaces.
"One problem you seem to have glossed over is in the idea that the Cog handling the RAM can some how do work in the background and hence recover the time lost in COG-COG communications. As far as I can tell this is not possible, or at least won't work as well as one might expect."
I think overlapping is possible for a great number of instructions - obviously excepting branches, and quick immediate data loads. I did mention it won't work for branches etc int the original post.
Response to Dr_Acula's post in the ZiCog thread
"Bill, that is a fascinating analysis of the speeds. There would definitely be speed increases for latched versions as you would read in blocks of data and hence most of the time would only be changing the low byte latch. Plus some other code optimisations would make it possibly twice as fast to access memory."
Thanks! I actually believe that performance will be acceptable for most software - after all, all I am doing is implementing a software memory management unit for client cogs. I also think there will definitely be a benefit to solutions like yours, as the latches will only need to be set when reading a memory location that is not currently in the working set.
Response to another post from Heater in the ZiCog thread
"Bill: PLEASE, PLEASE can you find some time to implement your XMM driver COG with "code_fetch", "data_fetch" etc."
I am making the time
I did change the architecture a bit, as when I was implementing it I realized that a direct command dispatch is much faster than polling several different request registers.
For the initial version, I have gone to a combined I/D space (no separate code and data), however later I can easily separate I/D by switching to 512 byte pages or keeping the LUT in the hub.
Re/ 1 - number of slower XMM solutions, plus one faster (Tri/Ramblade), and the appeal of low pin count serial memory
I could not agree more. I started down this path because I was trying to figure out how Dr_Acula's three-latch design could be speeded up by software. Once I thought of adding a software memory management unit, I realized this may make serial SPI memory "fast enough" to have 64KB available to CP/M... thus allowing anyone with any prop board and 5 pins to run ALL the CP/M software out there!
Re/ 2 - supporting word access
It was always planned, I remember coding for the Z80 and loading HL at once. It is a minor pain when the word (or long) crosses a page boundary, but it is not a big deal
Re/ 3 - not wasing a cog, combine with Triblade XMM handler
I intended to merge Dr_Acula's latch driver cog in [noparse]:)[/noparse]
It might be a waste to use VM on non-latched designs such as Tri/RamBlade, I think they would be slowed down by adding the MMU
Re/ 4 - MoCog needing two COG's
I can support two command ports, then we can avoid using LOCK's - but then I have to check both command ports. Tough call as to which is better.
Re/ 5 ZyCog
I love it! I could not resist taking a quick peek at it. Nice simple stack machine.
It will have to wait for a slightly later version, the initial version will only implement a 64KB virtual space; however if I move the LUT to the HUB, I can implement a multi-megabyte virtual space.
I think the MIPS R2000 would also fit in a single cog.
Thanks for the approx. 500k+ Z80 instructions per sec. I was in the ball park [noparse]:)[/noparse] 2us = 500k instructions [noparse]:)[/noparse]
Response to Cluso99 in the ZiCog thread
"For ZiCog, the XMM cog should have 2 seperate rendezvous locations, one for instruction fetch and one for data fetch. That way the prefetch doesn't get flushed every time a data fetch occurs. Also, it may as well prefetch words, or even longs."
Data is always read a page at a time from the backing store, and a fetch can only flush a page if the page it is located in is not present in the working set, and when a page must be flushed, the least-accessed page will be flushed - so no worries about code flushing data, or data flushing code [noparse]:)[/noparse]
It will be possible to request a byte, word or long, even though this may cross page boundaries.
Response to another post from Dr_Acula in the ZiCog thread
"Take the Dracblade. Remove all the latches. Remove the Sram. Put a 64k serial ram chip on the eeprom bus. Implement a Sphinx OS that frees up 14k of hub ram. Maybe toss out the LCD code for the moment, and the wireless layer, and the upper 512k code, and toss out the ramblade code too. Maybe optimise the VT100 code a bit. I think that should get us to 16k of free hub ram, maybe more....."
Exactly! That is exactly the idea, and you obviously got it, as your list handling description is how LRU page replacement strategies work. For "Ranking", a count of how often each page is accessed is maintained, and the least-used page is always the one that is flushed.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/3/2010 11:26:37 PM GMT
Now I have been thinking. For an initial test, it would be OK to use the SD card, since we have these.
SD disadvantages...
SD advantages...
Your thoughts?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
I think this idea could be the key to getting large memory models working for a range of current software projects.
Only minor suggestion - re
WORD: VMCOMMAND
WORD: BYTES (MUST be 1, 2, or 4)
LONG: Virtual Address (currently must be $0000-$FFFF)
LONG: data read/written from/to specified virtual address
Can I suggest the virtual address be 00000000 to FFFFFFFF
You may as well use the full long and for testing it can be 0000 to FFFF (or even 0000 to 8000H). You wouldn't want to get stuck with only 64k *grin*.
Presumably if you asked for a location higher than actual memory it would return an error?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
I will be using SPI RAM, precisely because of the much slower write.
It has however occurred to me that the code segment could be stored in SPI FLASH, SD cards etc without penalty...
In order to keep the initial version as simple and as easy to write as possible, I will be implementing 256 byte pages for now.
Later I will probably move to 512 byte pages as then I can easily support a 128KB virtual address space, however page sizes larger than that will have to wait for Prop2. If we really need a larger virtual address space, I can put the page lookup table into the hub, and just keep resident page LRU counters in the hub, but that will slow VM access down. May be worth it for ZyCog!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I used a LONG for the Virtual Address so that later a full 32 bit address space can be exposed [noparse]:)[/noparse] but for the initial test version, it will be limited to the 64KB supplied by two 23K256's
A much slower version could implement the whole 32 bit (4GB address space) by using an SHDC card; however write performance would be very poor, as not only are writes to flash slow, I'd have to move the page translation table to the hub, slowing things even more. I believe localroger is working on something like that.
If this works well, I will implement a 24 bit address space version, as Morpheus allows up to 16MB of ram
Later (as in MUCH later, at least a year from now) it would even be possible to implement a multi-level store... with XMM or SPI ram used to cache a uSD 4GB SDHC card, providing different levels of performance depending on the amount of XMM memory.
(which, incidentally, is more than enough to run uCLinux, if ZyCog can run it)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/3/2010 11:43:49 PM GMT
The idea is of course very good for slow "like geological time" SPI/I2C devices.
I wish you luck with implementation.
--Steve
This whole thing started as a thought experiment on how to potentially improve DracBlade 3-latch XMM performance with software, then turned into
"hey, maybe I can make ZiCog and <insert other 8 bit emulator> run fast enough with SPI ram!"
In a nutshell, I agree with you. This approach is basically aimed at slow serial devices such as I2C and SPI based memory of various types, where it should be able to provide "OK" performance for many applications - on the order of 50% - 75% of XMM speed, assuming that the time required to implement the virtual opcodes outweighs the average time required to fetch them from XMM.
As I stated quite early on... there is no way it will ever be as fast as unlatched XMM access, or even somewhat latched XMM access.
It is still a fascinating experiment!
And the potential side effect is a generic medium speed XMM interface, as the same mailbox/command approach could be used to present a unified interface to XMM, allowing new XMM implementations immediate access to XMM software such as ZiCog, Catalina etc, until there is time to write fine-tuned implementation specific code. This may in fact be a more significant long term beneficial result than SPI RAM virtual memory.
Thank you for you wishes!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Obviously you will be using a separate COG for caching. You may be able to take some ideas for sharing between COGs from what I posted before near the bottom of this thread.
http://forums.parallax.com/showthread.php?p=866475
Cheers.
In MoGog only one of it's COGs will doing anything at a time. Two COGs is not to provide parallel execution just more PASM code space.
So no need for two memory ports or locks etc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Thanks for the link! Interesting thread...
For now I am going with a classic VM setup, with VMCOG acting like combination MMU and backing store controller, with 256 byte pages.
My reasons for 256 byte pages are as follows:
- allows me to fit the page translation table, dirty bit, and access counter into just 256 cog longs
- allows more pages in the hub at the same time
- decreases probability of "thrashing"
- "natural" fit for 8/16 bit emulations
- semi-reasonable amount of time to read/write a page
- instruction set / emulation architecture neutral "pseudo-hardware MMU"
Later, I plan on making a version with 512 byte pages - that will allow 128KB of virtual memory which will allow split I/D for 8 bit emulations
heater:
Thanks, that is great news; locking would slow every access down by 16-22 cycles even in ideal circumstances, as would monitoring multiple request ports
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
For greatest flexibility and performance though I assume VMCOG must be launched separately with parameter control. Inter-COG data transfer as you know can be relatively slow, and I was looking for ways to fix that.
What I ended up with was a 2 COG solution where 1 did the lookup (crude single line), and 1 COG did fetch and store. Fetch and store was done separately so that best possible parallel access performance could be achieved.
1) A VMCOG.spin "driver", that on launch is passed a pointer to the mailbox
My intention is that the mailbox can initially be filled with the startup parameters - such as the start of the page cache (corresponding to physical memory in classic MMU designs) and the number of available pages (of physical memory)
This way, for extreme testing, it will be possible to start up VMCOG with a very small working set - theoretically as low as 1 page, however I think I will hard wire a minimum of 8 pages, and a maximum of 64 on Prop1.
v0.12 of VMCOG adds a "VMDUMP" message, which will save a copy of the whole page translation table to the address provided in the message.
This will also allow VM performance monitoring, as an emulated CPU could occasionally ask for a VMDUMP, and a user app could then see the usage counters for all the active pages, and watch the LRU replacement policy work in pseudo real time.
2) A VMCOG_Demo.spin "application", which will launch VMCOG
VMCOG_Demo provides an old-style serial "monitor" program for manipulating the virtual memory, including being able to view the LUT "real time"
I am using a serial interface to make it as easy as possible for everyone to try it, and so I can write it as quickly as possible [noparse]:)[/noparse]
3) Sample minimal hardware schematic for VMCOG
The minimum required hardware to add to any Propeller board for running VMCOG. It will consist of two 23K256 devices, and five I/O lines:
/CS0 - select $0000-$7FFF SPI Ram
/CS1 - select $8000-$FFFF SPI Ram
CLK - SPI clock
MOSI - Prop output, SPI Ram input pin
MISO - Prop input, SPI Ram output pin
Obviously later it will be possible to add /CS2 and /CS3, and possibly use demultiplexers for them, however I leave that for a later date.
My intention right now is to prove the concept, and to get a ZiCog running with a 64KB virtual memory setup, with SPI Ram backing store.
Future additions
VMDEBUG, with a "DEBUG" flag being implemented for VMCOG that will make it watch a second port, a "DEBUG" port.
This will allow emulation-neutral breakpoints on any memory access, and single stepping any emulator that uses the VMCOG interface (or the identical future XMCOG interface). VMDEBUG will be able to view/modify any VM memory location (or XMM location with XMCOG), and set breakpoints on access (multiple breakpoints are easy), or have special breakpoints for only reads or writes.
It would also be possible to implement full access trace output logs, with page usage statistics, which would make debugging and optimizing emulations *MUCH* easier.
Once I implement split I/D, it will be possible to have separate break points for data read, data write, code fetch
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/4/2010 5:53:15 PM GMT
Here is an SPI-Ram driver with random single byte access.
I have done this 1 year ago, so I don't remember the details, but it should be simple to expand it to 1..4 bytes per access.
The SPI uses the counters for a speed of 20 MHz for writes and 10 MHz for reads.
The connection is a bit different, I use one CS and SCLK but 2 Data Lines, and I have MOSI and MISO tied together, so only
4 pins are needed for 64kB. The idea was to have 2 bits per access, but it turned out this is not faster, because you can't use
the counters as optimal as with 1 bit.
It's not tested with the Microchip RAMs but with the older AMI RAMs. According to the datasheets they should be similar.
Andy
I think you just might have saved me a whole bunch of work!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I based it on my Propteus board (I have many of those, and only two USB ProtoBoards - easy choice)
Here are the Pxx pins I am using, if you want to match my hardware configuration so you can try VMCOG without changing VMCOG.spin every time I update it
Reason for these pin numbers: I can add a TV out on P16-P18 later.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/5/2010 12:27:11 AM GMT
I am curious, in "dowrite" why is the last "rol phsb,#1" commented out? I count only 31 other instances.
Thanks,
Bill
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Thoughts for later use... Since you have a dirty bit, that is obviously shows that this is a variable/stack section.·When dirty bit is set, maybe clear the counter so that it counts only write accesses. For page replacement, allow dirty (write) usage to be higher priority to remain (shift the cound left #bits).
Allow a call to clear all counts. This would be done whenever a new program was loaded.
Why have I suggested the above...
Well I think trying to establish the variable data and keep it in hub would be best for performance. It would then also work best for flash (such as the SD). Obviously, this is going to result in it being better to make variables global to keep them resident.
just my 2c
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
When there is a write, there are two possibilities:
1) the address being written to is in the working set, and is resident in memory
2) the address being written to is not in memory
I am initially implementing a "write through" strategy, which works as follows:
- in all cases, write the value to the appropriate location in backing store (SPI RAM)
- if the address is NOT in the working set, we are done!
- if the address IS in the working set, write it to the appropiate location in the working set, increment the use count, and set the DIRTY bit
Sometimes VM's implement a "delayed write" strategy (basically similar to buffering output for disk access), in which case dirty pages are only written to the backing store when they become a candidate for being swapped out.
You see, with delayed writes, pages with the "dirty" bit set MUST be written out before the corresponding "real" memory page can be re-used.
Without delayed writes, it does not matter, as every write is reflected in the backing store immediately, so even "dirty" pages can be discarded at will.
Since I am implementing a "write through" VM, the dirty bit is not strictly necessary, but I wanted the option of trying delayed writes later.
The very nature of the LRU strategy will insure that frequently used areas of memory will be resident, and infrequently used ones will not - so a rarely used big array will likely not be resident in memory most of the time, but the stack will be [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/5/2010 4:14:17 AM GMT
I started with mini-computers only having 10KB maximum (and often only 5KB) per program/partition (like a cog, 20 max) and a common area (like hub memory) of usually 10KB (but 80KB max). And now we complain about the prop only having 2KB per cog and 32KB ram + 32KB rom in hub *grin*. My own mini was the length of my garage and had a whole 110KB core memory which was the max the mini could have. BTW it was 6-bit ascii so 10KB = 10Kx6bits, plus a parity bit handled by hardware.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
CounterA is used to generate the SCLK, counter B is "only" used as Multiplexer.
When you write the long into phsb, bit31 goes to the SO pin. After first shift (ROL) bit30 goes to the pin ...
after 31 shifts bit0 goes to the pin and there is no need to shift anymore. Just the 32th SCLK must occure from
counter A, then the counter must be stopped with the right timing.
Andy
Thanks! Yep, I figured some other "old timers" would understand what I am doing [noparse]:)[/noparse] and CS students into their 3rd year *should* understand it.
I tend to write verbose simplified descriptions so that there is a chance that those without the background for it can understand it [noparse]:)[/noparse]
Andy:
Ah! Ok, I think I got it. Today I'll put just a single chip into my PDB with SI/SO tied togeather, and get that running before hacking the code... The reason I want to keep SI/SO separate (for now) is to make the code easier to read and not have to mess with changing DIRA all the time.
I am trying to apply the KISS principle for the first version, to get it running ASAP... then we can all go nuts optimizing it like crazy!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I may be changing the page translation table format, but the API defined above is frozen, and illustrates how to interact with VMCOG from PASM.
There will be a number of additional Spin API calls, but those are not necessary for using VMCOG, and are meant to allow me to debug VMCOG, and to allow writing debuggers for virtual machines.
Comments and suggestions are welcome.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Obviously there is a lot left to do, but assuming vm_flush means clear the TLB to 0, your routine needs a little work. But you knew that [noparse]:)[/noparse]
Jazzed: Thanks! Good catch. I must have been half-asleep [noparse]:)[/noparse] I had even set up my usual "destinc" var at the bottom for incrementing/decrementing dest...
ALL: More updates later today...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
1) I've changed the start() method to allow setting the working set address range on startup.
The new syntax is:
PUB start(mailbox,lastpage,numpages)
vm.start(@mailbox,$7C00,61) is the recommended setting for SphinxOS, which requires $7D00-$7FFF for buffers, I/O handles etc.
"61" above is the total number of hub pages allocated as the "physical memory" working set for VMCOG
This leaves $0000-$3FFF available for "regular" spin code, screen buffers etc.
2) VMCOG will now start by pre-initializing the TLB and loading as many virtual pages as are specified by numpages from the start of the virtual address space.
I am considering changing this later to have the last pre-loaded page be the last page of virtual memory, as that is where I would expect the stack to be. For stack access heavy emulations such as CogZ, it will probably be better to switch to a delayed-write strategy - good thing I allowed for a DIRTY bit [noparse]:)[/noparse]
3) I've started writing VMCOG_Debugger, which will let me test VMCOG, and can later serve as a basis of virtual machine debuggers.
Later I will also add performance monitoring tools to VMCOG + Debugger.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
The good news:
I am working on VMCOG_Debugger, and I've added Andy's SPI Ram code.
The bad news:
Something is wonky, it is not working right.
I'm going to find the data sheet for the part Andy used, and compare it to the MCP 23K256's data sheet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system