Shop OBEX P1 Docs P2 Docs Learn Events
VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM) — Parallax Forums

VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

Bill HenningBill Henning Posts: 6,445
edited 2011-02-02 22:41 in Propeller 1
VMCOG has now entered BETA testing - it works on PropCade, TriBlade_2, XEDODRAM and Hydra HX512 - see Page 9 for details.

Background

I made a suggestion a few days ago on how it might be possible to make a virtual memory manager that could be used by ZiCog to get 'acceptable' performance with slow external memory designs.

As many of you will know, VM implementations bear a striking resemblance to processor cache design - this is a natural consequence of solving a very similar problem, which is mapping a large but slow memory to a small but fast memory, while presenting the illusion that the whole large memory is fast.

Please note that there are two existing SD card based virtual memory projects I've heard about on the forum, but the aim of VMCOG is XMM and SPI RAM. Later who knows?

Here is the original post:

forums.parallax.com/forums/default.aspx?f=25&m=405722&p=15 from the 'Dracblade SBC now with Catalina C, PropBasic and CP/M' thread. I copied it into the second message in this thread.

The discussion then moved to the 'ZiCog a Zilog Z80 emulator in 1 Cog' thread, a far more appropriate place for it at that time.

forums.parallax.com/forums/default.aspx?f=25&p=032&m=332138

Here is Heater's new ZyCog thread:

forums.parallax.com/forums/default.aspx?f=25&m=423939

Heater asked very nicely that I find time to implement my suggestion - thus this project was born.

After Heater's suggestion for ZyCog, and my realizing that VMCOG would also allow for a VMSpin, I created this thread for VMCOG development.

This thread will be the official thread for developing VMCOG, which will be under the MIT license, free for personal or commercial use, as long as I (and future contributors) are credited in any software and documentation for software/hardware using VMCOG.

I will keep this top post updated with links to documentation, samples, and code (when it is ready), and I welcome questions, suggestions, optimizations etc.

Later today, I will post the start of the specifications, for now read the ZiCog and DracBlade threads to see what I have written today. I will combine and edit my postings to make the V0.1 VMCOG specification.

Somehow I will find the time to write it, and I will be demonstrating it at UPEW.

Alternate Usage Model

The VMCOG interface, minus the MMU functionality, could be used to provide a simple, standard interface for XMM implementations - which would allow the same interface to all types of extended memory solutions, regardless of how they were implemented VMCOG (SPI RAM, SD card using hub caching) or XMCOG (TriBlade, Morpheus, DracBlade, mctrivia's etc using XMM directly). I believe I will make an XMCOG for Morpheus...

Theory of Operation

It has long been known that 90%+ of the total run time of a program is typically spent in <10% of the code. This is why modern processors use multi-level caches in order to make the main (slow compared to processor clock rate) memory appear to be almost as fast as the processors Level 1 cache.

Computers (and operating systems) take this step one level further, implementing 'Virtual Memory', which treats a chunk of your hard drive as if it was RAM, and uses strategies very similar to (and sometimes identical) to what Level 1, 2 and lately 3 caches use.

Since the early days of computing, the actual available 'real memory', is divided up into 'pages'. Each executing program is said to have a 'working set' of pages, which is some fraction of the total available 'real memory'. (I am not going to address variable sized segment based virtual memory here)

As a rule of thumb, the larger the 'working set', the more closely the speed of the 'virtual memory' approximates the speed of the 'real memory'.

The 'virtual memory' is stored in the 'backing store', and how fast pages can be read from, and written to, the 'backing store' greatly affects virtual memory operation.

Spin API

PUB start(mailbox,lastpage,numpages)
PUB rdvbyte(addr)
PUB rdvword(addr)
PUB rdvlong(addr)
PUB wrvbyte(addr,data)
PUB wrvword(addr,data)
PUB wrvlong(addr,data)
PUB Flush
PUB Look(addr)

Virtual Memory LUT Specification

In order to translate 'virtual' memory addresses to 'real' VMCOG will use the first 256 longs to implement a Look Up Table.

256 pages of 256 bytes gives us 64KB of virtual memory, which fits neatly in two MCP 23K256 SPI RAM devices.

LUT Entry Definition

If a LUT entry is zero, the corresponding page of virtual memory is not present in the hub.

If a LUT entry is non-zero, it will be interpreted as follows:

MSB
V
PPPPPPPP PDXCCCCC CCCCCCCC CCCCCCCC

Where

PPPPPPPPP = hub address

The hub address is stored here so that the MOVI instruction can be used to update it without disturbing the rest of the bits in the page table entry

D = Dirty bit

This bit is set whenever a write is performed to any byte(s) in the page

X = Guard bit, must be zero

CCCCC CCCCCCCC CCCCCCCC = 21 bit read access counter

Every time a read is performed to this page, this count is incremented. If the count overflows into the Guard bit, every page count in the address translation table will be divided by two, and the Guard bit cleared, in order to ensure that the LRU page replacement algorithm will work well.

UPDATE: The LUT is very likely to change later this week. While writing some of the code I noticed that checking for counter wraparound is MUCH cheaper using the Carry bit instead of an explicit Guard bit, probably outweighing the benefit of clearing the upper bits simply by shifting the physical page bits down. The revised format is likely to be {count:22,Dirty:1,hubpage:9} but things are still in a state of flux...

Minimum hardware requirements

- any propeller board with five pins available for use
- two MCP23K256 SPI ram devices (Digi-Key part number 23K256-I/P-ND - currently $1.66 each)

VMCOG will use between 4KB to 16KB of hub memory as 'in-core' storage, and 64KB (possibly more) external memory (SPI or parallel, latched or non-latched XMM design) as the 'backing store'

Supported 'real memory' (hub cache) size of the 'working set'

4KB - 16 pages of 256 bytes - guaranteed to be very slow
8KB - 32 pages of 256 bytes - may work fine for smaller CP/M programs
16KB - 64 pages of 256 byte - should perform quite well!

Later I will allow a user settable number of pages (between 16 and 96) but I want to simplify things as much as possible for the first release, and theoretically, later it will be possible to run two (or more) VMCOG's servicing two (or more) ZiCogs. It will also be possible to 'share' the virtual address space, and implement 'shared memory' multi-ZiCog systems. Or even a hybrid MotoCog and Zicog system sharing the same virtual memory.

Supported 'virtual memory' (address space) sizes

Initially only a 64KB memory map will be supported as for the first version I will use a direct mapped LUT (virtual to real address translation look up table)

128KB would be easy to support if I switched to 512 byte pages, or used two VMCOGs.

Virtual address spaces larger than 128KB would require a more sophisticated handling of virtual to real address mapping, and while I *WILL* tackle that, I want to get something simple running first!

The easiest way to handle LARGE page tables is to move it to hub memory - something that will be quite feasible on Prop2, and is possible on Prop1 - but it would add 16-22 cycles to each access.

Virtual memory addresses will be 32 bits wide, and the virtual memory will be byte addressable.

I will also host pages and downloads for VMCOG at my site, I will post URL's later.

Using VMCOG

Here is how VMCOG will work (Real Soon Now (tm))

You wait for vmcommand to become 0 (in case the cog is busy swapping a page in or out, or processing your last command)

- you write vmaddr with the virtual address you want to access (long)
- if you are going to write to the VM, you put your byte/short/long into vmdata (long)
- you write VMWRITE{B|W|L} or VMREAD{B|W|L} into the vmcommand location (short)

if you were doing a VMREAD, you wait for vmcommand to become 0 before reading vmdata

TO DO LIST

- get someone to make a small, fast 23K256 driver :)
Andy (Ariba) contributed one, I just need to make it run with the 23K256

Fast SPI driver for MCP23K256

- perhaps an adaptation of Mike Green's MCP23K256 driver, combined with fast SPI from MIT licensed fast fsrw SPI code?
- ideally using counters to read/write the SPI memory at 10Mbps (or even 20Mbps?)
- See XMM Code Interface Specification for how I need the SPI driver to interface to VMCOG (it will be part of VMCOG)

XMM Code Interface Specification

I invite authors of all existing (and future) XMM solutions who wish to be supported by VMCOG to submit four PASM subroutines as specified below. The code should be short, but fast.

All contributors to this project agree that any submitted code will be under the MIT license, with the understanding that the license does not extend to the underlying hardware - so no worries, you are specifically NOT allowing people to build clones of your hardware (unless you explicitly give permission to allow people to duplicate your hardware). This will be in the Copyright statement for VMCOG.

The reference implementation will be VMCOG/SPI, other implementations will be named as VMCOG/xmm_solution_name
chipselLONG0'chipselect,initiallycanonlybe0or1tochosefromtwoSPIram's,notusedforparallelXMMsolutions
vmaddrLONG0'virtualaddress,initially$0000-$FFFF,laterIplantosupportatleast24bitsofaddressspace
hubaddrLONG0'hubmemoryaddresstoreadfromorwriteto
membytesLONG0'numberofbytestoreadorwriteto/fromthehub
</CODE>

START - assert /CS for the device specified by 'chipsel', initially 2 pins are used to select RAM0 or RAM1 (the MCP23K256's are 32KB devices), later can choose between different XMM's on same prop :)
END - de-assert /CS for the device specified by 'chipsel'
READ - read 'membytes' number of bytes from virtual (extended) address 'vmaddr' to the hub starting at address 'hubaddr'
WRITE - write 'membytes' number of bytes to virtual (extended) address 'vmaddr' from the hub starting at address 'hubaddr'

Why use VMCOG/SPI as the reference implementation?

- because it is the most challenging from a performance point of view
- every XMM design, no matter how many latches are used, is guaranteed to be faster than SPI RAM!
- it is BY FAR the least expensive way to try VMCOG
- I love a challenge :)

Future Optimizations

- implementing a 'delayed write' strategy (which is why I have the 'DIRTY' bit)
- changing the mailbox format for better performance
- possibly changing TLB format
- possibly checking if access is on the same virtual page as last access, optimizing accesss

The new command format I am considering is as follows:

cmd LONG 0 ' the first long in the 4-long mailbox

3 bit command code as bits 29-31
29 bit virtual address (limits us to 512MB virtual address space without additional hub cycle) as bits 0-28

Commands would be encoded as follows:

000 = NOP (required for polling loop to function)
001 = rdvbyte
010 = rdvword
011 = rdvlong
100 = XOP (extended operation)
101 = wrbyte
110 = wrword
111 = wrlong

- To read a byte from $0000_1F3C would require writing $2000_1F3C to the mailbox
- To read a word from $0000_1F3C would require writing $4000_1F3C to the mailbox
- To read a long from $0000_1F3C would require writing $6000_1F3C to the mailbox

To write a byte/word/long, first I'd write the value to the second long in the mailbox, and then write
$A000_1F3C to write a byte, $C000_1F3C to write a word, $E000_1F3C to write a long!

The other operations such as VMFLUSH, VMDUMP, etc. would write a secondary opcode to the mailbox, and $8xxx_xxxx to invoke the Xtended operation.

This would save one hub write and one hub read on every read/write!


NOTES

- Theoretically VMCOG could also use an SD card, RAMTRON FRAM, or even SPI flash for the backing store - however I suspect that may be too slow, and wear out FLASH quickly.
- one interesting extension for later versions is to use SPI Flash or SD cards to hold the code, and SPI (or parallel) ram to hold data and stack.
- VSpin is a possible name for a Spin VM I hope someone will make that uses VMCOG - it would allow 64KB for spin code!
- Morpheus users can remove the W25X80 Flash chip and use 23K256's in both FLASH sockets - no need to solder anything!
- it would be entirely possible to write an LMM kernel that accessed XMM through VMCOG
- a 'zero additional chip' reference platform is possible, using a 24LC1024 EEPROM, which would eventually wear out
- another 'zero additional chip' reference platform is possible using a 1Mbit FRAM device for combination boot EEPROM and backing store
- it is NOT possible to use the virtual memory as a 'live' frame buffer that is displayed by video drivers

Code Contributed to VMCOG

- Andy ('Ariba') contributed SPI SRAM handling code, looks good!
- heater contributed TriBlade_2 support and a better memory test and lots of debugging help
- jazzed contributed XEDODRAM support and lots of debugging help

Downloads (below my signature)

- VMCOG Spin API Documentation v0.22
- VMDEBUG + VMCOG v0.970 - working for PropCade, TriBlade_2, XEDODRAM and now Hydra HX512!
- VMACCESS.SPIN - sample pasm code for accessing the virtual memory, not tested

Useful Links

- 23K256 web page www.microchip.com/wwwproducts/Devices.aspx?dDocName=en539039
- 23K256 data sheet ww1.microchip.com/downloads/en/DeviceDoc/22100D.pdf
- intoduction to virtual memory, from U of Calgary webdocs.cs.ualberta.ca/~tony/C379/Notes/PDF/08.4.pdf

www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0' OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/31/2010 8:45:17 PM GMT
«13456716

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 20:08
    Here is a copy of my original post

    ********************************************************************************************

    Heater,

    With the ever-increasing number of XMM solutions, have you considered decoupling ZiCog from the memory access?

    I have not looked at the sources, so I don't know how easy the following would be - or if you have thought of something similar.

    Loosely, there are three types of memory accesses:

    - code fetch
    - data read
    - data write

    I am thinking of a solution where the memory access is handed off to another cog, and ZiCog requests memory actions through hub locations.

    Consider:

    
    codefetch  long 0
    coder        long 0
    
    dataread   long  0
    datar        long  0
    
    datawrite  long  0
    dataw      long  0
    
    ' When ZiCog wants to read an instruction from 'zpc', it does the following:
    
          wrlong zpc,codefetch
    cfl  rdlong  code, coder wz
      if_z jmp #cfl             ' when NOP (0) read, reader returns $1000_0000
    
    ' When ZiCog wants to read the byte at (HL)
    
          wrlong hl,dataread
    drl  rdlong  acc, datar wz
      if_z jmp #drl            ' when zero data byte read, returns $1000_0000
    
    ' when ZiCog wants to write (HL)
    
         wrlong dataw, acc
         wrlong datawrite, HL
    
    



    The beauty of this approach is that it TOTALLY decouples ZiCog from specific XMM implementation, and the memory cog can try to do all sorts of caching etc.

    Adding new XMM targets is trivial.

    Frees up some LONGs in ZiCog

    Doing split I/D for 128K memory (which I think MP/M supported) is easy.

    Doing banked memory on any XMM becomes MUCH easier.

    Even better, in any instruction that is not a JUMP/CALL, the next instruction read can be done in parallel with executing the current instruction!

    Simply ask for the next instruction before processing the current one.

    The hub delay slots can also be used [noparse]:)[/noparse]

    I think it would potentiall run faster.

    This would also make it trivial to provide breakpoints for execution or data access, and monitoring locations, performance etc.

    On the hardware side...

    This would also allow a super-cheap ZiCog config I was thinking about, by using two MCP23K256 SPI ram's or FRAMs (with a speed penalty)

    What do you guys think?
    heater said...
    mikediv: Brave man. I still don't know anything much about the Hydra RAM expansion. I suspect it takes more PASM code to drive it than will fit in the COG with the Z80 emulation.

    Not sure, so if I were attempting it my approach would be to forget about VGA and keyboard and such. Just get a PASM program reading and writing bytes from the external RAM, random access, first. Then you know exactly how many longs you need to do that and can see if it will fit into zicog.

    If it fits then move on to getting emulation and CP/M running using the Prop Plug or whatever link back to the PC as a serial terminal interface. Use the terminal emulator in BST.

    If that ever gets working then worry about how we are going to do peripherals, keyboard and VGA.

    It's a long road....
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 20:10
    Summary of discussion before this thread was started

    ***********************************************************************

    Quick summary of the relevant previous responses in other threads:

    Me:

    I was thinking of doing a classic LRU page replacement policy. (LRU = Least Recently Used)

    In a 64KB address space there are (obviously) 256 pages of 256 bytes.

    My paper design uses cog locations 0..255 for the LRU page table. When the cog is started, location 0 would contain a JMP #$100, which would do a MOV 0,#0 to clear the first page table entry.

    Bits 0-8 (the source field) of the page table would contain the upper 7 bits (on prop 1) of the hub page where that page resides.

    The upper 25 bits would be the access counter, allowing counting up to 32M accesses.

    As there is only 32KB of hub ram on the current prop, bit 8 would be used as a 'dirty' bit (set whenever a write occurs to that page).

    If a page is not present in memory, the whole register is set to 0.

    When any count approached 16M, all counts should be cut in half.

    Say you wanted to read $3F29 in the virtual memory address space.

    
    vmm_addr  long  0   ' the vmm address to be read
    vmm_data  long  0   ' data read / or data to be written
    vmm_tmp   long  0   ' temporary
    vmm_hub   long  0   ' address of virtual byte
    
    ' read a byte - NOT optimized yet, off-the-cuff sample code, I am sure it can be optimized
    
    vmm_read: 
          mov   vmm_tmp, vmm_addr
          shr     vmm_tmp,#8
          movd  vmm_check, vmm_tmp
          movs  vmm_fix, vmm_tmp
    vmm_check: 
          or  0,#0 wz
      if_z  jmp #page_load
    vmm_fix:
          mov    vmm_hub,0
          shl      vmm_hub,#8
          and     vmm_addr,#$FF
          or       vmm_hub,vmm_addr
          rdbyte  vmm_data,0
    vmm_read_ret: ret
    
    ' page load find the smallest non-zero entry in the first 256 registers, and saves that entry into vmm_temp, and that register number in MIN
    ' if bit 8 is set, it needs to save that virtual page first
    ' set register whose address is saved in MIN to zero
    ' read 256 bytes into the HUB page pointed to by vmm_temp bits 0..7 
    ' set the register pointed to by vmm_temp bits 0..7 to 512+hub page. (access count of one, pointing at correct hub page)
    
    
    



    It might be more efficient to store the page table entries as long {hubpage:7, dirty:1, count:24} because then a simple SHR by #25 would get you the hub page address, and SHL hub page by #8 and it has zeroed the low 8 bits, making it ready to or in the offset. Sorry, I have not spent any time optimizing it yet.

    With 64 pages (16KB of hub buffer) I'd expect well over 90% hit ratio (test code would actually be able to calculate this).

    When there is a hit, the unoptimized code above takes about 11 instructions and one hub access... call it 44+22 cycles worst case, less than 1us any way.

    If we assume that the average ZiCog instruction emulation takes 2us for the instruction, and .5us for an unlatched byte read, the total instruction time would increase to 3us on page hits, and something much worse on misses. Say 256 bytes * 8 bits per byte * 100ns (assuming 10Mbps SPI read) = 204.8us + 2us for the instruction - let's call it 207us.

    At 2.5us per ZiCog instruction, 1M instructions would take 2.5s

    If 90% of the instructions hit, there would be 900,000 hits at 3us, and 100,000 at 207us, for a total of 2.7s + 20.7s = 23.7s - approximately 10.5% the speed of pure xmm ZiCog

    If 95% of the instructions hit, there would be 950,000 hits at 3us, and 50,000 at 207us, for a total of 2.85s + 10.35s = 13.2s - approx. 20% the speed of pure xmm ZiCog

    If the Z80 is like most mainframes, the the hit rate would be more like 99%

    At 99%, there would be 990,000 at 3us, and 10,000 at 207us, for a total of 2.97us + 2.07s = 5.04s - approx. 49.6% of the speed of pure xmm ZiCog.

    Ofcourse if the average ZiCog instruction (with XMM) took 3.75us (50% more than above) the VM approach could reach 75% of XMM performance.

    Note that hits would take at most 0.8us, and that page reads at 20Mbps would take 103.5us.

    Optimizing the read code would have good effect.

    Using 20Mbps reads from the SPI flash would offer a dramatic improvement.

    My best guess?

    For "average" software, 50%+ of XMM speed should be attainable.

    This will be fun to test [noparse]:)[/noparse]

    Does anyone have any idea how many ZiCog instructions are processed per second? the 2.5us average (including TriBlade unlatched read) was a WAG based on reading almost 400k instructions per sec earlier.
    Cluso99 said...
    @Bill: Once the operating model is understood, it is likely some fine tuning of the various ram blocks could be done. Of course, finding this is actually fairly simple at the expense of slowing it down while doing so. Just add a little hub table and increment each time a block is accessed.

    There are obviously only certain sections of the 64KB CPM space that would require "overlaying". From what I understand, the blocks would be from about $FF00 or $F000 downwards, leaving as much as possible from $0000 upwards always resident.

    That's the beauty of the LRU algorithm... initialization code, and infrequently used code would automatically be swapped out, and the most used code would always be resident automatically!

    Heater

    Bill: Re: Your question about moving the Z80's memory access operations to another COG.

    Yes it has been considered. Basically it would work like the ZiCog's IN and OUT instructions work now.

    I do like the idea of decoupling junks of software functionality where ever possible, from a software engineering point of view. Straight away it makes life much easier for those who want to port to a different hardware. Like mikediv want's to do for the Hydra. Looks like it saves a few LONGs in the Z80 COG as well.

    There are two reasons why I have not pursued that idea:

    1) Conservation of COG. I all ways looked at COGs as being few in number and precious. Seemed a waste to use a whole 32 bit CPU for just XMM access. Hence "ZiCog" a Z80 emulator in ONE Cog.

    2) Speed. I have yet to see how it can be done in a way that does not slow things down. This is perhaps not such a big issue. Given all the PASM that has to be executed per Z80 op the impact may not be so great. On the other hand I like Clusso's RamBlade attitude, "Everything for speed".

    3) Simplicity. At least for the early ZiCog versions there was only one XMM solution.

    Regarding banked memory for CP/M 3 and MP/M. We have code in place for bank switching the Z80 RAM space. It's very small, tight and fast. I don't see much room for improvement there.

    One problem you seem to have glossed over is in the idea that the Cog handling the RAM can some how do work in the background and hence recover the time lost in COG-COG communications. As far as I can tell this is not possible, or at least won't work as well as one might expect.

    Consider: It looks like the memory COG could be (pre)fetching the next Z80 opcode while the current Z80 instruction is executing.
    Problem: The current instruction does a data access to memory. Oops, it has to wait in the "data_fetch" until the memory GOG gets around to it.
    Problem: When a Z80 jump, call or ret is made the prefetched op is now junk and a new op has to be fetched. This throws away the prefetch time saving. It also means the "code_fetch" path has to check if the requested address is already prefetched or not. It has to do this on every code fetch, this eats time. There are a lot of jumps in Z800 code.

    Now it could be that with all the "swings and roundabouts" we have going on here that a dedicated XMM Cog solution can be made that is faster than what we have now or at least breaks even. So, Bill, if you would like to experiment with it we would love to see what the results are[noparse]:)[/noparse]

    P.S. I've softened up my stance on "wasting" a COG for XMM. As it is we've eaten up all the Prop pins for RAM and the HUB is pretty full so there is no point in saving COGs that have nowhere to work.

    Dr_Acula

    Bill, that is a fascinating analysis of the speeds. There would definitely be speed increases for latched versions as you would read in blocks of data and hence most of the time would only be changing the low byte latch. Plus some other code optimisations would make it possibly twice as fast to access memory.

    In practice a typical program sitting at 100H is going to be almost always linear with local jumps so that code will be very efficient. There will also be bios calls (keyboard, display output) which will jump to locations in high ram, but these will be the same each time so those blocks will end up on the list fairly early on and then stay there.

    As a rough guide the dracblade runs the same as about a 3.5Mhz Z80. Cluso's runs faster.

    I guess if sphinx does manage to save a whole lot of hub ram we can experiment with what to do with that. Video buffer ram for graphics? Faster speed? Or maybe the user can choose.


    Heater

    Bill: PLEASE, PLEASE can you find some time to implement your XMM driver COG with "code_fetch", "data_fetch" etc.

    I was writing that last post only minutes after waking up so, still tired, the following motivations had not occurred to me:

    1) We have a number of XMM options that are just inherently slow. Those that use complicated latching schemes and those that could be made using serial devices. In these cases any speed hit due to COG-COG communications is probably not going to notice much in the final result. We can still have the "all out for speed" Tri/RamBlade option in the ZiCog code wrapped up in #ifdefs so nothing lost.
    Using serial memory appeals to me, may be slow but I'd love to have some free pins such that ZiCog can do IN/OUT to them directly from Z80 code.

    2) If you add operations for reading and writing WORDs we get more speed back. ZiCog does a lot of WORD accesses.

    3) This can probably be done without wasting a COG. Just combine it with the TriBlade XMM block move driver or such.

    4) MoCog. The MoCog 6809 emulator PASM is getting huge. It will require two COGs. Hopefully only two. If both those COGs need access to XMM (likely) then your suggested XMM handler COG would a) Save duplicating access code in two COGs. b) Make life much easier, saves having two COGs fighting for those RAM pins.

    5) ZyCog. Yes "ZyCog" not "ZiCog" See below.


    What the heck is ZyCog?

    For a long time now I've pondered two things:

    1) Is there a nice byte code, like Spin, that could be interpreted in one or two Cogs, like Z80, but more efficient and with much larger address space. For use with code in external memory.

    2) Is there such a byte code that exists already and has a nice compiler to go with it. C or whatever. So that we have a ready to run tool chain. Yes there is Java but "no thank you".

    Recently I found the answer, the ZPU processor core from ZyLin AS.

    Get this:

    1) The ZPU processor core is the smallest, in terms of logic blocks, 32 bit CPU.
    2) It's instructions are all byte wide, good for XMM.
    3) There are only a handful instructions, can probably emulate the ZPU in one COG.
    4) There is a version of GCC that generates code for the ZPU.

    Yes, that's right, with ZPU emulation we can use the GCC compiler for the Propeller and have huge programs in external RAM.

    Hence my new project "ZyCog" the ZyLin ZPU processor in a COG.

    ZyCog is as yet unannounced and has no Prop code. It's just an idea so don't tell anyone[noparse]:)[/noparse]
    At least I got as far as getting the GCC to generate ZPU code to experiment with.

    Heater

    Bill: "Does anyone have any idea how many ZiCog instructions are processed per second?"

    A long time ago this was measured with a frequency counter whilst the Z80 was executing its op code test program. Results we published here somewhere. No idea where or memory of the numbers. Less than 1 million more than 500,000 per second.

    Cluso99

    A few comments on the above...

    For ZiCog, the XMM cog should have 2 seperate rendezvous locations, one for instruction fetch and one for data fetch. That way the prefetch doesn't get flushed every time a data fetch occurs. Also, it may as well prefetch words, or even longs.

    ZyCog - love the idea smile.gif

    Heater

    Cluso: Bill already has code and data separated, see the DracBlade thread.

    WORD access is suggested above. It's surprising how much WORD access goes on in an 8 bit CPU. All those jumps, calls and rets need WORD access. Then there's the loading, storing, PUSHing and POPing, of 16 bit registers.

    Looks like what we are about to design is the worlds first 8 bit processor with a 16 bit data bus!

    More on ZyCog later....

    Dr_Acula

    I think Bill might be on to something here.

    Take the Dracblade. Remove all the latches. Remove the Sram. Put a 64k serial ram chip on the eeprom bus. Implement a Sphinx OS that frees up 14k of hub ram. Maybe toss out the LCD code for the moment, and the wireless layer, and the upper 512k code, and toss out the ramblade code too. Maybe optimise the VT100 code a bit. I think that should get us to 16k of free hub ram, maybe more.

    Put a ram driver in the cog that is currently running the sram driver code. This new ram driver handles a list of 256 ram blocks of 256 bytes each.

    The list handling is going to be a priority list. Each time a block is accessed you add 1 to a counter for that block. Rank them in order. If a new block is needed, take the lowest ranking one, put it into serial ram, and then get the new block. Can this all fit into a cog? I think it should. Is the serial ram driver code the same as the eeprom driver code, and if so, is this already somewhere anyway (?? in the sd card object).

    Just looking at ram now SPI or I2C. Code exists for both I think.

    This could halve the size of the dracblade board for starters, and decrease the chip count from 9 to 4. Plus free up a number of propeller pins for audio or more serial ports.

    Agree a block write then read from serial ram will be slow, but that ought to happen only very infrequently. Possibly never for a small sbasic/c/assembly program.

    We can't do this now because there there are 7 blocks of 2k code sitting in ram in random locations.

    A thought? Maybe we can use it without even needing sphinx! Just tell the serial ram driver cog the locations of the 7 blocks of 2k code, and any more free code area. It can then have a simple list of where it keeps each block of 256 bytes.

    Heater

    Well Bill is THE inventor of the LMM technique for the Prop. So if he thinks he's on to something we should all sit up and pay attention.

    This idea of a COG handling external memory with caches etc may not be as fast as the direct xxxBlade approach we have now but for those who want to save pins and for the up and coming ZPU emulator it shold be a very good compromise.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/3/2010 8:53:57 PM GMT
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 22:18
    I've quickly fleshed out and commented the skeleton VMCOG.

    One change: I've decided to change the command mailbox format.

    The mailbox MUST be long aligned.

    WORD: VMCOMMAND
    WORD: BYTES (MUST be 1, 2, or 4)
    LONG: Virtual Address (currently must be $0000-$FFFF)
    LONG: data read/written from/to specified virtual address

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 22:57
    Response to Heater's post in the ZiCog thread

    (Reasons why Heater did not decouple memory access into a separate cog)

    1) Conservation of COGs

    For some applications it does not matter - for example Dr_Acula has a "latch cog" that could be merged into VMCOG

    2) Speed

    I think it will be fine for most CP/M applications, as long as enough memory is allocated as page buffers. 8K-16K should give good performance

    3) Simplicity

    You got me there, however with the many existing (and upcoming) XMM solutions, a VMCOG approach will actually be simpler than supporting an ever increasing number of memory interfaces.

    "One problem you seem to have glossed over is in the idea that the Cog handling the RAM can some how do work in the background and hence recover the time lost in COG-COG communications. As far as I can tell this is not possible, or at least won't work as well as one might expect."

    I think overlapping is possible for a great number of instructions - obviously excepting branches, and quick immediate data loads. I did mention it won't work for branches etc int the original post.

    Response to Dr_Acula's post in the ZiCog thread

    "Bill, that is a fascinating analysis of the speeds. There would definitely be speed increases for latched versions as you would read in blocks of data and hence most of the time would only be changing the low byte latch. Plus some other code optimisations would make it possibly twice as fast to access memory."

    Thanks! I actually believe that performance will be acceptable for most software - after all, all I am doing is implementing a software memory management unit for client cogs. I also think there will definitely be a benefit to solutions like yours, as the latches will only need to be set when reading a memory location that is not currently in the working set.

    Response to another post from Heater in the ZiCog thread

    "Bill: PLEASE, PLEASE can you find some time to implement your XMM driver COG with "code_fetch", "data_fetch" etc."

    I am making the time smile.gif

    I did change the architecture a bit, as when I was implementing it I realized that a direct command dispatch is much faster than polling several different request registers.

    For the initial version, I have gone to a combined I/D space (no separate code and data), however later I can easily separate I/D by switching to 512 byte pages or keeping the LUT in the hub.

    Re/ 1 - number of slower XMM solutions, plus one faster (Tri/Ramblade), and the appeal of low pin count serial memory

    I could not agree more. I started down this path because I was trying to figure out how Dr_Acula's three-latch design could be speeded up by software. Once I thought of adding a software memory management unit, I realized this may make serial SPI memory "fast enough" to have 64KB available to CP/M... thus allowing anyone with any prop board and 5 pins to run ALL the CP/M software out there!

    Re/ 2 - supporting word access

    It was always planned, I remember coding for the Z80 and loading HL at once. It is a minor pain when the word (or long) crosses a page boundary, but it is not a big deal

    Re/ 3 - not wasing a cog, combine with Triblade XMM handler

    I intended to merge Dr_Acula's latch driver cog in [noparse]:)[/noparse]

    It might be a waste to use VM on non-latched designs such as Tri/RamBlade, I think they would be slowed down by adding the MMU

    Re/ 4 - MoCog needing two COG's

    I can support two command ports, then we can avoid using LOCK's - but then I have to check both command ports. Tough call as to which is better.

    Re/ 5 ZyCog

    I love it! I could not resist taking a quick peek at it. Nice simple stack machine.

    It will have to wait for a slightly later version, the initial version will only implement a 64KB virtual space; however if I move the LUT to the HUB, I can implement a multi-megabyte virtual space.

    I think the MIPS R2000 would also fit in a single cog.

    Thanks for the approx. 500k+ Z80 instructions per sec. I was in the ball park [noparse]:)[/noparse] 2us = 500k instructions [noparse]:)[/noparse]

    Response to Cluso99 in the ZiCog thread

    "For ZiCog, the XMM cog should have 2 seperate rendezvous locations, one for instruction fetch and one for data fetch. That way the prefetch doesn't get flushed every time a data fetch occurs. Also, it may as well prefetch words, or even longs."

    Data is always read a page at a time from the backing store, and a fetch can only flush a page if the page it is located in is not present in the working set, and when a page must be flushed, the least-accessed page will be flushed - so no worries about code flushing data, or data flushing code [noparse]:)[/noparse]

    It will be possible to request a byte, word or long, even though this may cross page boundaries.

    Response to another post from Dr_Acula in the ZiCog thread

    "Take the Dracblade. Remove all the latches. Remove the Sram. Put a 64k serial ram chip on the eeprom bus. Implement a Sphinx OS that frees up 14k of hub ram. Maybe toss out the LCD code for the moment, and the wireless layer, and the upper 512k code, and toss out the ramblade code too. Maybe optimise the VT100 code a bit. I think that should get us to 16k of free hub ram, maybe more....."

    Exactly! That is exactly the idea, and you obviously got it, as your list handling description is how LRU page replacement strategies work. For "Ranking", a count of how often each page is accessed is maintained, and the least-used page is always the one that is flushed.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/3/2010 11:26:37 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-03 22:58
    @Bill: Excellent idea. Good luck.

    Now I have been thinking. For an initial test, it would be OK to use the SD card, since we have these.

    SD disadvantages...
    • Wear levelling, but that is a short price to pay for now.
    • Writes will be slower because it's really flash.

    SD advantages...
    • can use the existing fast fsrw
    • the cache can be directly addressed in a contiguous file
    • blocks are 512 bytes
    • ZiCog uses this already for it's CPM disks so it's already coded

    Your thoughts?



    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Dr_AculaDr_Acula Posts: 5,484
    edited 2010-02-03 23:25
    The thing that intrigues me about this concept is the simplicity of the hardware. For simple testing, it is a single 8 pin ram chip worth $1.50 that you can add to any current propeller board. The hardware is probably even simpler than adding an SD card. So anyone with a prop board could get involved in this. The chip could even be mounted 'dead bug' style if there is no proto area.

    I think this idea could be the key to getting large memory models working for a range of current software projects.

    Only minor suggestion - re
    WORD: VMCOMMAND
    WORD: BYTES (MUST be 1, 2, or 4)
    LONG: Virtual Address (currently must be $0000-$FFFF)
    LONG: data read/written from/to specified virtual address

    Can I suggest the virtual address be 00000000 to FFFFFFFF

    You may as well use the full long and for testing it can be 0000 to FFFF (or even 0000 to 8000H). You wouldn't want to get stuck with only 64k *grin*.

    Presumably if you asked for a location higher than actual memory it would return an error?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.smarthome.viviti.com/propeller
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 23:28
    Thank you!

    I will be using SPI RAM, precisely because of the much slower write.

    It has however occurred to me that the code segment could be stored in SPI FLASH, SD cards etc without penalty...

    In order to keep the initial version as simple and as easy to write as possible, I will be implementing 256 byte pages for now.

    Later I will probably move to 512 byte pages as then I can easily support a 128KB virtual address space, however page sizes larger than that will have to wait for Prop2. If we really need a larger virtual address space, I can put the page lookup table into the hub, and just keep resident page LRU counters in the hub, but that will slow VM access down. May be worth it for ZyCog!
    Cluso99 said...
    @Bill: Excellent idea. Good luck.

    Now I have been thinking. For an initial test, it would be OK to use the SD card, since we have these.

    SD disadvantages...

    <UL>
    * Wear levelling, but that is a short price to pay for now.

    * Writes will be slower because it's really flash.
    </UL>
    SD advantages...

    <UL>
    * can use the existing fast fsrw

    * the cache can be directly addressed in a contiguous file

    * blocks are 512 bytes

    * ZiCog uses this already for it's CPM disks so it's already coded
    </UL>
    Your thoughts?
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 23:37
    Thanks - I like that part too. Sometime in the near future I will add two 23K256's to a USB ProtoBoard, and show pics of how to do it, as that will be the minimum supported configuration (two 23K256's, and five propeller pins, on any Propeller board - even a breadboard!)

    I used a LONG for the Virtual Address so that later a full 32 bit address space can be exposed [noparse]:)[/noparse] but for the initial test version, it will be limited to the 64KB supplied by two 23K256's

    A much slower version could implement the whole 32 bit (4GB address space) by using an SHDC card; however write performance would be very poor, as not only are writes to flash slow, I'd have to move the page translation table to the hub, slowing things even more. I believe localroger is working on something like that.

    If this works well, I will implement a 24 bit address space version, as Morpheus allows up to 16MB of ram smile.gif

    Later (as in MUCH later, at least a year from now) it would even be possible to implement a multi-level store... with XMM or SPI ram used to cache a uSD 4GB SDHC card, providing different levels of performance depending on the amount of XMM memory.

    (which, incidentally, is more than enough to run uCLinux, if ZyCog can run it)
    Dr_Acula said...
    The thing that intrigues me about this concept is the simplicity of the hardware. For simple testing, it is a single 8 pin ram chip worth $1.50 that you can add to any current propeller board. The hardware is probably even simpler than adding an SD card. So anyone with a prop board could get involved in this. The chip could even be mounted 'dead bug' style if there is no proto area.

    I think this idea could be the key to getting large memory models working for a range of current software projects.

    Only minor suggestion - re
    WORD: VMCOMMAND
    WORD: BYTES (MUST be 1, 2, or 4)
    LONG: Virtual Address (currently must be $0000-$FFFF)
    LONG: data read/written from/to specified virtual address

    Can I suggest the virtual address be 00000000 to FFFFFFFF

    You may as well use the full long and for testing it can be 0000 to FFFF (or even 0000 to 8000H). You wouldn't want to get stuck with only 64k *grin*.

    Presumably if you asked for a location higher than actual memory it would return an error?
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/3/2010 11:43:49 PM GMT
  • jazzedjazzed Posts: 11,803
    edited 2010-02-04 01:11
    Considering the time it takes to determine and return data on a cache hit and miss, the approach for caching between XMM and HUB does not appear more effective than direct fast XMM access. Even with a 3 or 4 instruction hit detector, the delivery mechanism is at best an 8-12 instruction delivery cycle. We will of course see the result of your effort in the end.

    The idea is of course very good for slow "like geological time" SPI/I2C devices.

    I wish you luck with implementation.

    --Steve
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-04 02:39
    Hi Steve,

    This whole thing started as a thought experiment on how to potentially improve DracBlade 3-latch XMM performance with software, then turned into

    "hey, maybe I can make ZiCog and <insert other 8 bit emulator> run fast enough with SPI ram!"

    In a nutshell, I agree with you. This approach is basically aimed at slow serial devices such as I2C and SPI based memory of various types, where it should be able to provide "OK" performance for many applications - on the order of 50% - 75% of XMM speed, assuming that the time required to implement the virtual opcodes outweighs the average time required to fetch them from XMM.

    As I stated quite early on... there is no way it will ever be as fast as unlatched XMM access, or even somewhat latched XMM access.

    It is still a fascinating experiment!

    And the potential side effect is a generic medium speed XMM interface, as the same mailbox/command approach could be used to present a unified interface to XMM, allowing new XMM implementations immediate access to XMM software such as ZiCog, Catalina etc, until there is time to write fine-tuned implementation specific code. This may in fact be a more significant long term beneficial result than SPI RAM virtual memory.

    Thank you for you wishes!
    jazzed said...
    Considering the time it takes to determine and return data on a cache hit and miss, the approach for caching between XMM and HUB does not appear more effective than direct fast XMM access. Even with a 3 or 4 instruction hit detector, the delivery mechanism is at best an 8-12 instruction delivery cycle. We will of course see the result of your effort in the end.

    The idea is of course very good for slow "like geological time" SPI/I2C devices.

    I wish you luck with implementation.

    --Steve
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • jazzedjazzed Posts: 11,803
    edited 2010-02-04 03:10
    Bill Henning said...

    And the potential side effect is a generic medium speed XMM interface, as the same mailbox/command approach could be used to present a unified interface to XMM,
    Being generic is good. There is always resistance to standardization for various reasons.

    Obviously you will be using a separate COG for caching. You may be able to take some ideas for sharing between COGs from what I posted before near the bottom of this thread.
    http://forums.parallax.com/showthread.php?p=866475

    Cheers.
  • heaterheater Posts: 3,370
    edited 2010-02-04 08:56
    Re: MoCog needing two COG's

    In MoGog only one of it's COGs will doing anything at a time. Two COGs is not to provide parallel execution just more PASM code space.

    So no need for two memory ports or locks etc.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-04 17:01
    jazzed:

    Thanks for the link! Interesting thread...

    For now I am going with a classic VM setup, with VMCOG acting like combination MMU and backing store controller, with 256 byte pages.

    My reasons for 256 byte pages are as follows:

    - allows me to fit the page translation table, dirty bit, and access counter into just 256 cog longs
    - allows more pages in the hub at the same time
    - decreases probability of "thrashing"
    - "natural" fit for 8/16 bit emulations
    - semi-reasonable amount of time to read/write a page
    - instruction set / emulation architecture neutral "pseudo-hardware MMU"

    Later, I plan on making a version with 512 byte pages - that will allow 128KB of virtual memory which will allow split I/D for 8 bit emulations

    heater:

    Thanks, that is great news; locking would slow every access down by 16-22 cycles even in ideal circumstances, as would monitoring multiple request ports

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • jazzedjazzed Posts: 11,803
    edited 2010-02-04 17:32
    Bill Henning said...
    jazzed:

    Thanks for the link! Interesting thread...

    For now I am going with a classic VM setup, with VMCOG acting like combination MMU and backing store controller, with 256 byte pages.

    ....
    Of course, it is very flexible unlike my experiments.

    For greatest flexibility and performance though I assume VMCOG must be launched separately with parameter control. Inter-COG data transfer as you know can be relatively slow, and I was looking for ways to fix that.

    What I ended up with was a 2 COG solution where 1 did the lookup (crude single line), and 1 COG did fetch and store. Fetch and store was done separately so that best possible parallel access performance could be achieved.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-04 17:47
    Here is what I am working on (interleaved with my commercial projects):

    1) A VMCOG.spin "driver", that on launch is passed a pointer to the mailbox

    My intention is that the mailbox can initially be filled with the startup parameters - such as the start of the page cache (corresponding to physical memory in classic MMU designs) and the number of available pages (of physical memory)

    This way, for extreme testing, it will be possible to start up VMCOG with a very small working set - theoretically as low as 1 page, however I think I will hard wire a minimum of 8 pages, and a maximum of 64 on Prop1.

    v0.12 of VMCOG adds a "VMDUMP" message, which will save a copy of the whole page translation table to the address provided in the message.

    This will also allow VM performance monitoring, as an emulated CPU could occasionally ask for a VMDUMP, and a user app could then see the usage counters for all the active pages, and watch the LRU replacement policy work in pseudo real time.

    2) A VMCOG_Demo.spin "application", which will launch VMCOG

    VMCOG_Demo provides an old-style serial "monitor" program for manipulating the virtual memory, including being able to view the LUT "real time"

    I am using a serial interface to make it as easy as possible for everyone to try it, and so I can write it as quickly as possible [noparse]:)[/noparse]

    3) Sample minimal hardware schematic for VMCOG

    The minimum required hardware to add to any Propeller board for running VMCOG. It will consist of two 23K256 devices, and five I/O lines:

    /CS0 - select $0000-$7FFF SPI Ram
    /CS1 - select $8000-$FFFF SPI Ram
    CLK - SPI clock
    MOSI - Prop output, SPI Ram input pin
    MISO - Prop input, SPI Ram output pin

    Obviously later it will be possible to add /CS2 and /CS3, and possibly use demultiplexers for them, however I leave that for a later date.

    My intention right now is to prove the concept, and to get a ZiCog running with a 64KB virtual memory setup, with SPI Ram backing store.

    Future additions

    VMDEBUG, with a "DEBUG" flag being implemented for VMCOG that will make it watch a second port, a "DEBUG" port.

    This will allow emulation-neutral breakpoints on any memory access, and single stepping any emulator that uses the VMCOG interface (or the identical future XMCOG interface). VMDEBUG will be able to view/modify any VM memory location (or XMM location with XMCOG), and set breakpoints on access (multiple breakpoints are easy), or have special breakpoints for only reads or writes.

    It would also be possible to implement full access trace output logs, with page usage statistics, which would make debugging and optimizing emulations *MUCH* easier.

    Once I implement split I/D, it will be possible to have separate break points for data read, data write, code fetch
    jazzed said...

    Of course, it is very flexible unlike my experiments.

    For greatest flexibility and performance though I assume VMCOG must be launched separately with parameter control. Inter-COG data transfer as you know can be relatively slow, and I was looking for ways to fix that.

    What I ended up with was a 2 COG solution where 1 did the lookup (crude single line), and 1 COG did fetch and store. Fetch and store was done separately so that best possible parallel access performance could be achieved.
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/4/2010 5:53:15 PM GMT
  • AribaAriba Posts: 2,690
    edited 2010-02-04 21:21
    Bill

    Here is an SPI-Ram driver with random single byte access.
    I have done this 1 year ago, so I don't remember the details, but it should be simple to expand it to 1..4 bytes per access.
    The SPI uses the counters for a speed of 20 MHz for writes and 10 MHz for reads.

    The connection is a bit different, I use one CS and SCLK but 2 Data Lines, and I have MOSI and MISO tied together, so only
    4 pins are needed for 64kB. The idea was to have 2 bits per access, but it turned out this is not faster, because you can't use
    the counters as optimal as with 1 bit.

    It's not tested with the Microchip RAMs but with the older AMI RAMs. According to the datasheets they should be similar.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-04 21:28
    Thanks Andy!

    I think you just might have saved me a whole bunch of work!
    Ariba said...
    Bill

    Here is an SPI-Ram driver with random single byte access.
    I have done this 1 year ago, so I don't remember the details, but it should be simple to expand it to 1..4 bytes per access.
    The SPI uses the counters for a speed of 20 MHz for writes and 10 MHz for reads.

    The connection is a bit different, I use one CS and SCLK but 2 Data Lines, and I have MOSI and MISO tied together, so only
    4 pins are needed for 64kB. The idea was to have 2 bits per access, but it turned out this is not faster, because you can't use
    the counters as optimal as with 1 bit.

    It's not tested with the Microchip RAMs but with the older AMI RAMs. According to the datasheets they should be similar.

    Andy
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 00:22
    I've almost finished building a minimal test board for VMCOG, I will post photo's when I am done.

    I based it on my Propteus board (I have many of those, and only two USB ProtoBoards - easy choice)

    Here are the Pxx pins I am using, if you want to match my hardware configuration so you can try VMCOG without changing VMCOG.spin every time I update it

    Reason for these pin numbers: I can add a TV out on P16-P18 later.

      '--------------------------------------------------------------------------------------------------
      ' SPI RAM Driver constants
      '--------------------------------------------------------------------------------------------------
    
      CS0           = 19  ' /CS0 for SPI RAM used for $0000-$7FFF
      CS1           = 20  ' /CS1 for SPI RAM used for $8000-$FFFF
      CLK           = 21  ' CLK for SPI RAM
      MOSI         = 22  ' MOSI for SPI RAM (Prop output, SPI input)
      MISO         = 23  ' MISO for SPI RAM (Prop input, SPI output)
    
      '--------------------------------------------------------------------------------------------------
      ' SD Driver constants - I put it on a separate bus to make debugging VMCOG easier
      '--------------------------------------------------------------------------------------------------
    
      CS2           = 24  ' /CS2 for SD card with my 'Hobo' SD holder
      CLK           = 25  ' CLK for SD card - later can be shared with SPI RAM's
      MOSI         = 26  ' MOSI for SD card - later can be shared with SPI RAM's
      MISO         = 27  ' MISO for SD card - later can be shared with SPI RAM's
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/5/2010 12:27:11 AM GMT
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 02:40
    Hi Andy,

    I am curious, in "dowrite" why is the last "rol phsb,#1" commented out? I count only 31 other instances.

    Thanks,

    Bill
    Ariba said...
    Bill

    Here is an SPI-Ram driver with random single byte access.
    I have done this 1 year ago, so I don't remember the details, but it should be simple to expand it to 1..4 bytes per access.
    The SPI uses the counters for a speed of 20 MHz for writes and 10 MHz for reads.

    The connection is a bit different, I use one CS and SCLK but 2 Data Lines, and I have MOSI and MISO tied together, so only
    4 pins are needed for 64kB. The idea was to have 2 bits per access, but it turned out this is not faster, because you can't use
    the counters as optimal as with 1 bit.

    It's not tested with the Microchip RAMs but with the older AMI RAMs. According to the datasheets they should be similar.

    Andy
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-05 03:01
    Bill, I have been thinking (dangerous I know LOL).

    Thoughts for later use... Since you have a dirty bit, that is obviously shows that this is a variable/stack section.·When dirty bit is set, maybe clear the counter so that it counts only write accesses. For page replacement, allow dirty (write) usage to be higher priority to remain (shift the cound left #bits).

    Allow a call to clear all counts. This would be done whenever a new program was loaded.

    Why have I suggested the above...
    Well I think trying to establish the variable data and keep it in hub would be best for performance. It would then also work best for flash (such as the SD). Obviously, this is going to result in it being better to make variables global to keep them resident.

    just my 2c

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 04:09
    Close!

    When there is a write, there are two possibilities:

    1) the address being written to is in the working set, and is resident in memory

    2) the address being written to is not in memory

    I am initially implementing a "write through" strategy, which works as follows:

    - in all cases, write the value to the appropriate location in backing store (SPI RAM)

    - if the address is NOT in the working set, we are done!

    - if the address IS in the working set, write it to the appropiate location in the working set, increment the use count, and set the DIRTY bit

    Sometimes VM's implement a "delayed write" strategy (basically similar to buffering output for disk access), in which case dirty pages are only written to the backing store when they become a candidate for being swapped out.

    You see, with delayed writes, pages with the "dirty" bit set MUST be written out before the corresponding "real" memory page can be re-used.

    Without delayed writes, it does not matter, as every write is reflected in the backing store immediately, so even "dirty" pages can be discarded at will.

    Since I am implementing a "write through" VM, the dirty bit is not strictly necessary, but I wanted the option of trying delayed writes later.

    The very nature of the LRU strategy will insure that frequently used areas of memory will be resident, and infrequently used ones will not - so a rarely used big array will likely not be resident in memory most of the time, but the stack will be [noparse]:)[/noparse]
    Cluso99 said...
    Bill, I have been thinking (dangerous I know LOL).

    Thoughts for later use... Since you have a dirty bit, that is obviously shows that this is a variable/stack section. When dirty bit is set, maybe clear the counter so that it counts only write accesses. For page replacement, allow dirty (write) usage to be higher priority to remain (shift the cound left #bits).

    Allow a call to clear all counts. This would be done whenever a new program was loaded.

    Why have I suggested the above...

    Well I think trying to establish the variable data and keep it in hub would be best for performance. It would then also work best for flash (such as the SD). Obviously, this is going to result in it being better to make variables global to keep them resident.

    FLUSHVM has been in the VMCOG.spin file since the beginning [noparse]:)[/noparse]

    just my 2c
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/5/2010 4:14:17 AM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-05 05:43
    Bill: Nice. I have implemented limited cache systems in mini's in the 70's, as have others in my field. So I understand what you are doing.

    I started with mini-computers only having 10KB maximum (and often only 5KB) per program/partition (like a cog, 20 max) and a common area (like hub memory) of usually 10KB (but 80KB max). And now we complain about the prop only having 2KB per cog and 32KB ram + 32KB rom in hub *grin*. My own mini was the length of my garage and had a whole 110KB core memory which was the max the mini could have. BTW it was 6-bit ascii so 10KB = 10Kx6bits, plus a parity bit handled by hardware.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • AribaAriba Posts: 2,690
    edited 2010-02-05 05:47
    Bill Henning said...
    Hi Andy,

    I am curious, in "dowrite" why is the last "rol phsb,#1" commented out? I count only 31 other instances.

    CounterA is used to generate the SCLK, counter B is "only" used as Multiplexer.
    When you write the long into phsb, bit31 goes to the SO pin. After first shift (ROL) bit30 goes to the pin ...
    after 31 shifts bit0 goes to the pin and there is no need to shift anymore. Just the 32th SCLK must occure from
    counter A, then the counter must be stopped with the right timing.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 17:21
    Cluso:

    Thanks! Yep, I figured some other "old timers" would understand what I am doing [noparse]:)[/noparse] and CS students into their 3rd year *should* understand it.

    I tend to write verbose simplified descriptions so that there is a chance that those without the background for it can understand it [noparse]:)[/noparse]

    Andy:

    Ah! Ok, I think I got it. Today I'll put just a single chip into my PDB with SI/SO tied togeather, and get that running before hacking the code... The reason I want to keep SI/SO separate (for now) is to make the code easier to read and not have to mess with changing DIRA all the time.

    I am trying to apply the KISS principle for the first version, to get it running ASAP... then we can all go nuts optimizing it like crazy!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 20:00
    I added the VMCOG Spin API documentation as an attachment to the first post.

    I may be changing the page translation table format, but the API defined above is frozen, and illustrates how to interact with VMCOG from PASM.

    There will be a number of additional Spin API calls, but those are not necessary for using VMCOG, and are meant to allow me to debug VMCOG, and to allow writing debuggers for virtual machines.

    Comments and suggestions are welcome.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-06 04:54
    Congratulations Bill. Looking good smile.gif

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • jazzedjazzed Posts: 11,803
    edited 2010-02-06 06:36
    Nice work so far.
    Obviously there is a lot left to do, but assuming vm_flush means clear the TLB to 0, your routine needs a little work. But you knew that [noparse]:)[/noparse]

    vm_flush movd  flush,#255
             mov   count,#256
    flush    mov   0-0,#0
             sub    flush, x200  '<- add line to decr flush destination pointer
             djnz  count,#flush
             mov   nextpage, firstpage
    
    long     x200  $200
    
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-06 16:45
    Cluso: Thanks!

    Jazzed: Thanks! Good catch. I must have been half-asleep [noparse]:)[/noparse] I had even set up my usual "destinc" var at the bottom for incrementing/decrementing dest...

    ALL: More updates later today...

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-06 18:42
    UPDATES:

    1) I've changed the start() method to allow setting the working set address range on startup.

    The new syntax is:

    PUB start(mailbox,lastpage,numpages)

    vm.start(@mailbox,$7C00,61) is the recommended setting for SphinxOS, which requires $7D00-$7FFF for buffers, I/O handles etc.

    "61" above is the total number of hub pages allocated as the "physical memory" working set for VMCOG

    This leaves $0000-$3FFF available for "regular" spin code, screen buffers etc.

    2) VMCOG will now start by pre-initializing the TLB and loading as many virtual pages as are specified by numpages from the start of the virtual address space.

    I am considering changing this later to have the last pre-loaded page be the last page of virtual memory, as that is where I would expect the stack to be. For stack access heavy emulations such as CogZ, it will probably be better to switch to a delayed-write strategy - good thing I allowed for a DIRTY bit [noparse]:)[/noparse]

    3) I've started writing VMCOG_Debugger, which will let me test VMCOG, and can later serve as a basis of virtual machine debuggers.

    Later I will also add performance monitoring tools to VMCOG + Debugger.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-08 00:21
    Update:

    The good news:

    I am working on VMCOG_Debugger, and I've added Andy's SPI Ram code.

    The bad news:

    Something is wonky, it is not working right.

    I'm going to find the data sheet for the part Andy used, and compare it to the MCP 23K256's data sheet.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
Sign In or Register to comment.