VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

David Betz · 2010-09-07 12:21

How does SDRamCache work? Is it a direct mapped cache or is it associative? What would be the advantage of using it through VMCOG rather than directly?

Bill Henning · 2010-09-07 12:33

Hi Steve,

I am definitely expanding VMCOG to support a larger virtual address space, however supporting more than 8MB requires a slightly different interface due to only 23 bits being available for an address in the cmd register.

Actually I expect to be doing at least two different versions of large VM support, and benchmarking them to see which works better. I think I figured out a way to support a full 256MB without too great an additional speed penalty (guesstimate: 5-10% reduction in speed)

I'd prefer if you kept the cache interface alive, I think it makes more sense for I2C eeproms where paging a full page at a time may be sub-optimal.

My first pass at a simple large VM will:

- split the TLB into a "page present" table moved into the hub and
- access count / DIRTY / LOCK entry for each page in the working set in the cog

The "page present" table will hold a single byte for each potential page. If the byte is 0, the page is not present in the working set. If it is not zero, then it is used as the direct index into the TLB maintained in the cog.

The "page present" byte is encoded as: 00nnnnnn where nnnnnn is the upper six bits of the 15 bit hub ram address. The upper two bits are zero in this simple case, but can be used in the second version of the VM.

A 1K page present table will support 512KB, 2K=1MB and so on.

I will do this first, in order to get a big VM running ASAP.

The second pass for the VM gets cute, and supports 8MB VM's with only a 512 byte "page present" table.

The 23 bit VM address will be broken up as follows:

tttttpppppppppbbbbbbbbb

where

- bbbbbbbbb is the 9 bit "byte within page" offset
- ppppppppp is the 9 bit "page present table" index
- ttttt is the five bit "tag" representing one of the 32 "regions" of 256KB in the virtual address space

The TLB in the cog will be made two way associative, and will hold the tag for up to two pages in different regions!

This way, it will be simple & fast to check for page presence, and the two way associativity will help immensely against potential thrashing.

It would also help if the more frequently used of a associative pair was checked first, by ocassionaly sorting associative sets by usage count.

Now to support 256MB, I'll have to reduce the command bits to 4, to leave 28 bits available for address, and use a ten bit tag with the 512 byte entry table, or even better, a nine bit tag with a 1024 byte page present table.

jazzed wrote: »

Bill, I plan to port the SDRAM driver to VMCOG. If you can make room,I can put in the fast burst code rather than the slower loop code. Have a look at the last SdramCache.spin file I posted to understand the challenges. If it makes sense performance wise and VMCOG can be expanded to larger backstore memory support, I can drop the cache interface entirely. Consider it a casual challenge

--Steve

Bill Henning · 2010-09-07 14:05

2,830,000 page faults, single page working set, and USB serial driver under Ubuntu crashed!

Now back to my FlexMem debugging.

jazzed · 2010-09-07 16:10

@Bill. Yes, I have no intention of retiring the cache interface. I already have the non-burst interface driver plugged into an old vmcog, but without room for the burst interface, the performance will suffer. BTW as it exists today, cache uses a one bit command interface in the lower ignored part of the address. The return value is a buffer pointer.

@David. There is no value in trying to connect the cache interface to vmcog since they both eat a cog and performance would essentially drop in half. The cache is direct mapped.

My posting has been less because of personal business; I've had little time to read forums or do other tech stuff recently.

--Steve

Bill Henning · 2010-09-07 16:27

Makes sense... however I need about 16 command codes for VMCOG.

David Betz · 2010-09-07 18:24

jazzed wrote: »

@David. There is no value in trying to connect the cache interface to vmcog since they both eat a cog and performance would essentially drop in half. The cache is direct mapped.

So I guess I misunderstood your original statement about porting your SDRAM driver to VMCOG. I had assumed you meant you would make SDRamCache work with VMCOG in some way and I didn't really understand what the advantage of that would be. Now I see that is not what you were talking about at all.

Bill & Steve: What do you think the relative advantages are of the VMCOG way of handling memory with large pages and the SDRamCache way of handling memory? I would guess that the transfer size is different between the two and probably also the page replacement algorithm. What do you think are the advantages of each method?

Bill Henning · 2010-09-07 18:42

Best guess?

Steve could answer for RamCache better, but I will take a stab at it. RamCache does not have a page replacement algorithm, because it is a standard two-way associative "direct" mapped cache. It uses a 64 byte cache line. Due to the 64 byte cache line and two way associativity it will replace cache lines faster, but may trash more if more than two "mod 64" lines are often in use. As a cache, there is no page replacement policy, so often used cache lines can still be flushed. Best guess: RamCache will be faster for EEPROM's.

VMCog is a "classic" virtual memory implementation, loosely based on the PDP-11/VAX model, with a software implementation of an MMU. VMCog uses a 512 byte page, because that fits the Prop's architecture very nicely (and also fits SDHC sector sizes), and is historically a common page size. Currently only 64KB of VM is supported, but soon I will support at least 2MB. As long as the VM is under 2MB, the VM does not have to implement associativity, so it may thrash less than a chache, however reading/writing a whole page will take longer. For VM's larger than 2MB I will have to add two-way associativity, or implement two-level page tables - which would be far too slow.

It will be very interesting to benchmark both approaches, and see which types of memory interfaces with what types of applications work best. I can pretty well guarantee that for some apps one will outperform the other!

David Betz wrote: »

So I guess I misunderstood your original statement about porting your SDRAM driver to VMCOG. I had assumed you meant you would make SDRamCache work with VMCOG in some way and I didn't really understand what the advantage of that would be. Now I see that is not what you were talking about at all.

Bill & Steve: What do you think the relative advantages are of the VMCOG way of handling memory with large pages and the SDRamCache way of handling memory? I would guess that the transfer size is different between the two and probably also the page replacement algorithm. What do you think are the advantages of each method?

David Betz · 2010-09-07 18:52

Have you considered using an LRU page replacement algorithm? Or does that perform worse than your current access counter based scheme?

jazzed · 2010-09-07 19:02

David Betz wrote: »

Bill & Steve: What do you think the relative advantages are of the VMCOG way of handling memory with large pages and the SDRamCache way of handling memory? I would guess that the transfer size is different between the two and probably also the page replacement algorithm. What do you think are the advantages of each method?

Honestly, the real advantage in the cache design is not in the direct-mapped cache and small line load/store which may favor slow storage. The advantage is in the interface which reduces access time to the data by keeping a buffer pointer and on a hit requires one access to get the data rather than requiring minimum 3 hub accesses per data element on each access.

David Betz · 2010-09-07 19:06

Bill and Steve: Thanks for your comments on the relative advantages of VM and caches. I've been playing with a direct-mapped cache myself as well as using an LRU algorithm with VMCOG but I haven't been able to improve on the current VMCOG performance yet. I guess this will all be academic since there is already a mature implementation of both VM and cache available but it's been fun to do.

jazzed · 2010-09-07 20:01

David, there is nothing wrong with exploring.

Bill my SDRAM cache design allows using several line sizes. Lines of 64 or 128 bytes seems optimal, but 32 can also be used. The EEPROM read only cache design I did for Propeller JVM uses a 16 byte cache line and performs favorably against the HUB only implementation.

Bill Henning · 2010-09-07 20:47

Umm... the counter scheme I use is a modified Least Recently Used page replacement algorithm

The modification is keeping the access count of the last page, to prevent thrashing for new accesses.

David Betz wrote: »

Have you considered using an LRU page replacement algorithm? Or does that perform worse than your current access counter based scheme?

Bill Henning · 2010-09-07 20:48

Scratching head...

Unless you pull the cache read code into the application (interpreter) cog, you should need a similar ammount of hub accesses... assuming you use a mailbox to talk to the cache cog.

jazzed wrote: »

Honestly, the real advantage in the cache design is not in the direct-mapped cache and small line load/store which may favor slow storage. The advantage is in the interface which reduces access time to the data by keeping a buffer pointer and on a hit requires one access to get the data rather than requiring minimum 3 hub accesses per data element on each access.

Bill Henning · 2010-09-07 20:50

You are welcome... I intend to have a lot more fun trying to improve the performance and trying different replacement algorithms!

I will also try Steve's caching, and a hybrid approach.

David Betz wrote: »

Bill and Steve: Thanks for your comments on the relative advantages of VM and caches. I've been playing with a direct-mapped cache myself as well as using an LRU algorithm with VMCOG but I haven't been able to improve on the current VMCOG performance yet. I guess this will all be academic since there is already a mature implementation of both VM and cache available but it's been fun to do.

Bill Henning · 2010-09-07 20:52

Tunable cache line sizes is a great idea, as it will allow you to tune it for the best performance for a given interpreter and backing store type!

When I have time, I intend to try your JVM - I thought it was a really nice implementation, based on what I read in the thread.

Large page / cache line sizes really only shine with relatively fast backing stores.

jazzed wrote: »

David, there is nothing wrong with exploring.

Bill my SDRAM cache design allows using several line sizes. Lines of 64 or 128 bytes seems optimal, but 32 can also be used. The EEPROM read only cache design I did for Propeller JVM uses a 16 byte cache line and performs favorably against the HUB only implementation.

jazzed · 2010-09-08 06:04

Bill Henning wrote: »

Scratching head...

Unless you pull the cache read code into the application (interpreter) cog, you should need a similar ammount of hub accesses... assuming you use a mailbox to talk to the cache cog.

@Bill,

I went into the details of it in the zog thread. In summary there are actually a few *levels* to the cache driver: 1) zog decides in a few short instructions if the current buffer pointer contains the requested address and uses hub access instructions with an index to get/save data, 2) if the requested address is out of range, it requests a new current buffer from the SdramCache driver which manages the actual cache.

Bill Henning · 2010-09-08 07:23

Got it, thanks - so part of it was integrated into ZOG. Makes sense!

jazzed wrote: »

@Bill,

I went into the details of it in the zog thread. In summary there are actually a few *levels* to the cache driver: 1) zog decides in a few short instructions if the current buffer pointer contains the requested address and uses hub access instructions with an index to get/save data, 2) if the requested address is out of range, it requests a new current buffer from the SdramCache driver which manages the actual cache.

David Betz · 2010-09-09 05:05

I finally got my direct mapped cache experiment working. Unfortunately, it doesn't perform as well as Bill's standard VMCOG algorithm. I'll probably try a two-way cache next. Here are the results for a direct mapped cache with 32 128 byte cache lines. Even though it isn't fast, this version of VMCOG had the feature that it can handle full 32 bit addresses. I guess that could be useful for people who have larger RAMs (mine is limited to 64k anyway).

I've attached the code in case anyone is interested. Unfortunately, I've stripped out all but the "MORPHEUS1" branch because I found it difficult to understand the code with all of the #ifdefs.

fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00002ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00009ms)
fibo(09) = 000034 (00014ms)
fibo(10) = 000055 (00023ms)
fibo(11) = 000089 (00037ms)
fibo(12) = 000144 (00061ms)
fibo(13) = 000233 (00098ms)
fibo(14) = 000377 (00160ms)
fibo(15) = 000610 (00258ms)
fibo(16) = 000987 (00418ms)
fibo(17) = 001597 (00677ms)
fibo(18) = 002584 (01096ms)
fibo(19) = 004181 (01773ms)
fibo(20) = 006765 (02869ms)
fibo(21) = 010946 (04643ms)
fibo(22) = 017711 (07512ms)
fibo(23) = 028657 (12155ms)
fibo(24) = 046368 (19667ms)
fibo(25) = 075025 (31823ms)
fibo(26) = 121393 (10530ms)

Bill Henning · 2010-09-17 13:22

Thanks for the results! It is always interesting to see the performance of different approaches.

Now that I am back from an EXTREMELY relaxing Alaska cruise, I am back on working on my Propeller products and software...

David Betz wrote: »

I finally got my direct mapped cache experiment working. Unfortunately, it doesn't perform as well as Bill's standard VMCOG algorithm. I'll probably try a two-way cache next. Here are the results for a direct mapped cache with 32 128 byte cache lines. Even though it isn't fast, this version of VMCOG had the feature that it can handle full 32 bit addresses. I guess that could be useful for people who have larger RAMs (mine is limited to 64k anyway).

Cluso99 · 2010-09-18 03:49

Oooh Bill, that cruise sounds nice :smilewinkgrin:

Bill Henning · 2010-09-18 06:56

Thanks... it was!

That FlexMem driver bug does not stand a chance now...

Cluso99 wrote: »

Oooh Bill, that cruise sounds nice :smilewinkgrin:

David Betz · 2010-09-30 05:35

I'm trying to merge my C3 SPI SRAM back into VMCOG after getting it to work in my direct-mapped cache code under ZOG and am getting errors running the 'x' test. The C3 code is basically a modification of the MORHPEUS1 code. Can you tell me if you have verified that the MORPHEUS1 build works with VMCOG Sep06 v0.985?

Thanks,
David

Bill Henning · 2010-09-30 08:01

I'll check it out in a bit and let you know... thanks!

David Betz wrote: »

I'm trying to merge my C3 SPI SRAM back into VMCOG after getting it to work in my direct-mapped cache code under ZOG and am getting errors running the 'x' test. The C3 code is basically a modification of the MORHPEUS1 code. Can you tell me if you have verified that the MORPHEUS1 build works with VMCOG Sep06 v0.985?

Thanks,
David

Bill Henning · 2010-12-21 16:03

UPDATE:

No new working code yet, but since David asked in another thread about how I plan to handle larger VM's with VMCOG, I thought I'd point at my current design, which is defined in message #423 on the previous page of this thread.

My first cut at BIG VM only handles 256KB, as it uses a single 512 byte hub page to hold the "page present" value. The nice part of this approach is that it frees up 64 longs in VMCOG (well, at least until I add associativity) and gives me two bits to play with for associativity, or indicating if the page is from RAM or FLASH. This version will be VMCOG v2.0

Following v2.0, I am planning a VMCOG 3.0, which will be able to handle Jazzed's 32MB module. This will require the addition of associativity, and also a change to the command interface of VMCOG.

VMCOG work will resume as soon as I've gotten all the boards I showed at UPEW into production.

Since mpark added #include to Homespun, I am thinking of moving VMCOG over to it from BST - managing all the platform dependent code in VMCOG has become a nightmare without #include or equivalent.

(Debugging prototype boards with bad via's etc is a PAIN)

David Betz · 2010-12-21 16:50

Thanks for pointing me to your original description of the large memory VMCOG. I notice that message is actually a reply to a message I posted. I guess I should have remembered that!

Bill Henning · 2010-12-21 17:07

You are welcome... and no worries....

I can't remember how many things I've forgotten too! (pun intended)

David Betz · 2011-02-02 04:43

Has anyone ported VMCOG to Dr_Acula's DracBlade board? I'd like to use the 512k SRAM chip on the DracBlade for ZOG external memory. Is there a version of VMCOG that works with the DracBlade?

Dr_Acula · 2011-02-02 05:02

The answer is "not yet". See the dracblade thread for the ram driver code.

Bill Henning · 2011-02-02 07:25

Hi David,

As Dr_Acula said, no VMCOG driver for DracBlade YET - however I did order a DracBlade last week...

I should be back to being active in the forums and working more on Prop stuff in about two weeks, I have been too busy for a while now.

Bill

David Betz · 2011-02-02 07:41

Bill Henning wrote: »

Hi David,

As Dr_Acula said, no VMCOG driver for DracBlade YET - however I did order a DracBlade last week...

Take a look at the code that Dr_Acula posted in the Dracblade thread. It looks like it should be very easy to use. In fact, it looks like it's based on code for the TriBlade which I believe you already support.

VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

Comments