VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

jazzed · 2010-08-18 11:22

I cut the SDRAM code directly into VMCOG, and it passes the heater test. I had to remove lots of optimizations such as 32 byte bursts to make it fit so this version is much slower than using a separate COG 4s -vs- 3.5s on benchmark.

I'll try integrated VMCOG SDRAM with ZOG after lunch. If I can make that work, I'll try to patch the SDRAM Cache code directly to ZOG later.

--Steve

Bill Henning · 2010-08-18 11:29

Sounds good!

Did you try it with a few different settings for number of working set pages?

jazzed wrote: »

I cut the SDRAM code directly into VMCOG, and it passes the heater test. I had to remove lots of optimizations such as 32 byte bursts to make it fit so this version is much slower than using a separate COG 4s -vs- 3.5s on benchmark.

I'll try integrated VMCOG SDRAM with ZOG after lunch. If I can make that work, I'll try to patch the SDRAM Cache code directly to ZOG later.

--Steve

Bill Henning · 2010-08-18 11:37

Just a quick update...

The latest PropCade version of VMCOG adds two new messages for the mailbox for reading/writing registers in the MCP23S17 that shares the SPI bus with the SPI ram's.

The reason for this is that PropCade multiplexes 8 SPI devices onto one SPI bus, six SPI memory chips, the uSD card, and an MCP23S17 used for two Sega joysticks or two eight bit I/O ports.

With just a bit of spin code, this lets me read the two Sega joysticks as if they were NES joysticks :-)

The current code supports:

- UP/DOWN/LEFT/RIGHT/START/A/B/C buttons on the Sega
- the "C" button is mapped to the NES "Select" button.

The Sega X/Y/Z buttons (on six button joysticks) are currently not supported, as it requires PASM code to generate the necessary timing on an output bit to get that working.

I plan to release a generic MCP23S17 object RSN.

jazzed · 2010-08-18 11:44

Bill Henning wrote: »

Sounds good!

Did you try it with a few different settings for number of working set pages?

VMDebug fails to start with more than 46 pages. The heater test passes with 1, 2, 10, 19, 32, 40 and 46. Using 1 is funny, but it should work regardless of how much we snicker

Now if you could only support 32MB

Bill Henning · 2010-08-18 12:18

Thanks for the results!

46 pages = 23k, so it makes sense that VMDebug would fail - it would be getting clobbered

I can support 32MB, but it will cause a performance hit any way I do it.

The "easiest" way is to do a simple direct mapped cache approach, however this will lead to some trashing.

Second easiest is a two-way associative scheme, I think there is room in VMCOG for that.

Third is a four way associative scheme, however that will require a minimum 2KB table in the hub.

I DO NOT want to get into multi-level page tables, I am certain the performance would be very poor.

An interesting option would be to support say a 2MB VM, but make 30MB available as a very high speed disk...

jazzed wrote: »

VMDebug fails to start with more than 46 pages. The heater test passes with 1, 2, 10, 19, 32, 40 and 46. Using 1 is funny, but it should work regardless of how much we snicker

Now if you could only support 32MB

Heater. · 2010-08-18 13:29

Jazzed:

I'll try to patch the SDRAM Cache code directly to ZOG later.

Wow. How much space do you need? Operating with VMCog there is only 10 LONG's left in Zog.

I'm sure 20 or more LONGs can be recovered by recycling some init code for variables.

If I had adding direct access to RAM into Zog in mind I would not have in lined it so much.

jazzed · 2010-08-18 14:35

Bill, I got fibo running on zog with vmcog/sdram up to fibo(20). I reversed the "shr_hits" 075->076 changes and the test gets to fibo(23). The LRU algorithm may still have some issues.

Heater, I think I can cut in the SDRAM code. I'll let you know.

Heater. · 2010-08-18 22:03

Jazzed. Ahh, that answers the question I just put on the Zog thread.

But if using vmcog v075 and 20 pages surely you should get the same OK result as me. Assuming SDRAM access is always working correctly?

Jazzed "I think I can cut in the SDRAM code."

Is it possible we should take a little speed hit and un-inline the memory accesses. e.g. opcode fetch would call read_byte instead of going to VMCOG directly.

This would isolate RAM access to a few functions, save some space and make adding direct hardware access much easier.

I was never inclined to add direct hardware access to Zog but if you start down that road it will continue for TriBlade, RamBlade, DracBlade etc. That either leads to #ifdef soup or multiple versions. Or can we set up a way to have include files for different hardware codes.

Bill Henning · 2010-08-19 08:34

Thanks, good clue

jazzed wrote: »

Bill, I got fibo running on zog with vmcog/sdram up to fibo(20). I reversed the "shr_hits" 075->076 changes and the test gets to fibo(23). The LRU algorithm may still have some issues.

Heater, I think I can cut in the SDRAM code. I'll let you know.

Bill Henning · 2010-08-26 13:46

UPDATE

I almost have VMCOG running with two chips (on separate SPI 4-wire ports) running on Morpheus CPU1

One bug left to squish then I will upload a new version.

After that:

FlexMem driver for VMCOG!

Bill Henning · 2010-08-26 16:33

VMCOG now runs on Morpheus CPU#1

(After I test the IR in/out on the rev2 pcb, I will add FlexMem support to VMCOG)

Bill Henning wrote: »

UPDATE

I almost have VMCOG running with two chips (on separate SPI 4-wire ports) running on Morpheus CPU1

One bug left to squish then I will upload a new version.

After that:

FlexMem driver for VMCOG!

Bill Henning · 2010-08-26 16:39

New archive with Morpheus CPU#1 support uploaded into the first post!

David Betz · 2010-08-28 20:05

I'm trying to port the VMCOG MORPHEUS1 mode to my custom Hydra SDRAM card that has two 23k256 chips on it with separate chip select pins but common SI, SO, and CLK pins. I've tested this board using a simple SPI driver written in SPIN and it seems to work but I have problems when I run it with VMCOG. The only changes I've made to the code from the vmdebug-bst-archive-100826-163205.zip file is to change the PLL and clock to match the Hydra:

  _clkmode          = xtal1 + pll8x
  _xinfreq          = 10_000_000

And to change the pin assignments for the SDRAM chips:

cs      long  1<<19
clk     long  1<<17
mosi    long  1<<16
miso    long  1<<18
cs_clk  long  (1<<19)|(1<<17)
clk_mosi long (1<<17)|(1<<16)

cs2     long  1<<20
clk2    long  1<<17
mosi2   long  1<<16
miso2   long  1<<18
cs2_clk2  long  (1<<20)|(1<<17)
clk2_mosi2 long (1<<17)|(1<<16)

My board uses the following pin assignments:

SI = P16
SCK = P17
SO = P18
CS = P19
CS2 = P20

Shouldn't that be all I have to do to get this to work? If I try using the 'f' command in vmdebug and then dump page 0 all I get is lots of $1818 words. Any idea what might be going wrong?

Thanks!
David

Bill Henning · 2010-08-29 11:24

Hi David,

That should work...

I will try to wire up chips with your pinout tomorrow. Unfortunately my uncle is in emergency, so I was tied up all day yesterday, and will still be busy today.

Regards,

Bill

David Betz · 2010-08-29 18:33

I found one problem. I hadn't updated the variable spidir to match my pins. In order to make it easier to change pin assignments I made the following changes to vmcog.spin. Unfortunately, setting the spidir variable didn't fix my problem. I still get all $1818 values when I try to fill memory using the 'f' command.

Changed in the CON section:

#ifdef MORPHEUS1

  CS_PIN		= 19
  CLK_PIN		= 17
  MOSI_PIN		= 16
  MISO_PIN		= 18
  
  CS2_PIN		= 20
  CLK2_PIN		= 17
  MOSI2_PIN		= 16
  MISO2_PIN		= 18
  
  READSTATUS    = 140 ' Read SPI RAM status register

  PIOREAD       = 141
  PIOWRITE      = 142

  '--------------------------------------------------------------------------------------------------
  ' PIO commands
  '--------------------------------------------------------------------------------------------------

  PIOREADK      = %0100_000_1_00000000
  PIOWRITEK     = %0100_000_0_00000000

#endif

Changed in the DAT section:

#ifdef MORPHEUS1
dv        long  0               ' device address, between 0 and 7, however 6&7 are not valid
bits      long  0

read      long  $03000000       ' read command
write     long  $02000000       ' write command
ramseq    long  $01400000       ' %00000001_01000000 << 16 ' set sequetial mode
readstat  long  $05000000       ' read status

pagesiz   long 128              ' in longs

spidir    long  (1<<CS_PIN)|(1<<CLK_PIN)|(1<<MOSI_PIN)|(1<<CS2_PIN)|(1<<CLK2_PIN)|(1<<MOSI2_PIN)

pdata     long  0

offs_mask long $7FFF

bit16     long $8000

chip1   mov   tcs,cs
        mov   tclk,clk
        mov   tmosi,mosi
        mov   tmiso,miso
        mov   tcs_clk,cs_clk
        mov   tclk_mosi,clk_mosi
chip1_ret ret

chip2   mov   tcs,cs2
        mov   tclk,clk2
        mov   tmosi,mosi2
        mov   tmiso,miso2
        mov   tcs_clk,cs2_clk2
        mov   tclk_mosi,clk2_mosi2
chip2_ret ret

cs      long  1<<CS_PIN
clk     long  1<<CLK_PIN
mosi    long  1<<MOSI_PIN
miso    long  1<<MISO_PIN
cs_clk  long  (1<<CS_PIN)|(1<<CLK_PIN)
clk_mosi long (1<<CLK_PIN)|(1<<MOSI_PIN)

cs2     long  1<<CS2_PIN
clk2    long  1<<CLK2_PIN
mosi2   long  1<<MOSI2_PIN
miso2   long  1<<MISO2_PIN
cs2_clk2  long  (1<<CS2_PIN)|(1<<CLK2_PIN)
clk2_mosi2 long (1<<CLK2_PIN)|(1<<MOSI2_PIN)

tcs     long 0
tclk    long 0
tmosi   long 0
tmiso   long 0
tcs_clk long 0
tclk_mosi long 0

#endif

David Betz · 2010-08-31 05:54

Okay, now I'm completely confused. I pulled the SPI SRAM code out of VMCOG and wrote a simple test program to see if my Hydra SPI SRAM card would work with it. To simplify my testing I changed the page size to 64 but otherwise I'm running the code from VMCOG unchanged and it seems to work just fine with my Hydra SPI SRAM card. I have no idea why it doesn't work with the vmdebug test program. I'll have to try to understand more of the VMCOG code to see if I can figure it out. In the meantime, I've attached my SPI SRAM test program.

Bill Henning · 2010-08-31 10:08

Ok, I am officially confused too!

Btw, I'd love it if you went through the VMCOG code - there may be a bug lurking, that you may find while understanding it, as Fibo under Zog crashes with some working set sizes. I can't seem to find it, even after looking hundreds of times.

Want to hear something else confusing? I've merged my preliminary (slow) FlexMem drivers into VMCOG, and:

- writes to status register don't work
- reads of status register work
- reads of memory (one long at a time) work
- writes to memory (one long at a time) don't work
- can't read/write pages at a time until writes to status register work as it needs setting sequential mode

(the 23K256 does not have /WP pin, so that can't be it)

And it is basically the same code that works for PropCade and Morpheus1!!!!

Even worse, a scope shows clean signals on all pins, and ViewPort shows correct waveforms in LSA mode!

The good news is that I've sent off 4 of the PCB's I've shown at UPEW to production, so I can concentrate on VMCOG new for a few days.

David Betz wrote: »

Okay, now I'm completely confused. I pulled the SPI SRAM code out of VMCOG and wrote a simple test program to see if my Hydra SPI SRAM card would work with it. To simplify my testing I changed the page size to 64 but otherwise I'm running the code from VMCOG unchanged and it seems to work just fine with my Hydra SPI SRAM card. I have no idea why it doesn't work with the vmdebug test program. I'll have to try to understand more of the VMCOG code to see if I can figure it out. In the meantime, I've attached my SPI SRAM test program.

David Betz · 2010-08-31 10:21

Is there a description of your new boards posted somewhere? I had thought about buying Morpheus but somehow I thought there was a new version coming out so I decided to wait. Are these new boards you're talking about new versions of Morpheus and Mem+?

Bill Henning · 2010-08-31 10:21

Hmmm... working fine outside of VMCOG implies that memory within VMCOG is getting corrupted.

One of the many self-modifying indirect stores may be going wild... best bet would be within the BUSERR handling, when it updates the TLB - if it somehow computed a bad cog address, that could easily clobber code within the cog, thus explaining the behavior you report, and the problem with ZOG!

David Betz wrote: »

Okay, now I'm completely confused. I pulled the SPI SRAM code out of VMCOG and wrote a simple test program to see if my Hydra SPI SRAM card would work with it. To simplify my testing I changed the page size to 64 but otherwise I'm running the code from VMCOG unchanged and it seems to work just fine with my Hydra SPI SRAM card. I have no idea why it doesn't work with the vmdebug test program. I'll have to try to understand more of the VMCOG code to see if I can figure it out. In the meantime, I've attached my SPI SRAM test program.

David Betz · 2010-08-31 10:27

Bill Henning wrote: »

Btw, I'd love it if you went through the VMCOG code - there may be a bug lurking, that you may find while understanding it, as Fibo under Zog crashes with some working set sizes. I can't seem to find it, even after looking hundreds of times.

I will look it over tonight but I'll warn you that I'm far from an expert Spin/PASM programmer as you can probably tell from the code I wrote in my SPI SRAM test. Any good code in there was probably stolen from either you or Andre' LaMothe. :-)

Bill Henning · 2010-08-31 10:50

Yep, these are the new versions, and there are descriptions!

Morpheus (pcb rev 2) and Mem+ (pcb rev 2) are described in towards the end of p.12 in the Morpheus thread:

http://forums.parallax.com/showthread.php?t=113929

I think you'd like the Morpheus Developer's Guide on my downloads page, as it explains the architecture. There is also a page on it on the site.

The Developer's Guide applies to rev.2 pcb's as well, but I will have to add a couple of pages for the new IR features.

PropCade is described in its own thread at:

http://forums.parallax.com/showthread.php?t=121315

The other board that went to production is 485Plug, described on p.13 of the Morpheus thread.

I have 12 other boards going into production over the next month or two, including the high-end Morpheus+ / Mem* combination, and the mysterious "PLC-G"... along with a ton of industrial I/O modules for my boards. If you read the Morpheus thread starting p.12, I briefly described all the new boards except for PLC-G there

David Betz wrote: »

Is there a description of your new boards posted somewhere? I had thought about buying Morpheus but somehow I thought there was a new version coming out so I decided to wait. Are these new boards you're talking about new versions of Morpheus and Mem+?

Bill Henning · 2010-08-31 10:51

Every extra pair of eyeballs is MUCH appreciated - I figure I am too close to the code, and know too well how it "should" work, thus I might be missing something basic!

David Betz wrote: »

I will look it over tonight but I'll warn you that I'm far from an expert Spin/PASM programmer as you can probably tell from the code I wrote in my SPI SRAM test. Any good code in there was probably stolen from either you or Andre' LaMothe. :-)

David Betz · 2010-09-01 09:01

I've been reading through the VMCOG code trying to understand it and I have a general question about the behavior of the Propeller hub access instructions. Do the RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG instructions ignore all but the low order 16 bits of their source operands? In other words is RDLONG foo,$1000 interpreted the same as RDLONG foo,$ffff1000?

Bill Henning · 2010-09-01 10:00

You got it - the upper 16 bits are totally ignored

RDLONG also ignores the two lowest bits

RDWORD ignores the lowest bit

David Betz wrote: »

I've been reading through the VMCOG code trying to understand it and I have a general question about the behavior of the Propeller hub access instructions. Do the RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG instructions ignore all but the low order 16 bits of their source operands? In other words is RDLONG foo,$1000 interpreted the same as RDLONG foo,$ffff1000?

David Betz · 2010-09-01 16:07

Okay, I'm going to try my hand at offering a suggestion. I think the following code:

shr_hits      ' walk through TLB, divide all non-zero hit counts by two
        movs  jx,#0             ' finding candidate page to sacrifice

forj

jx      mov   tlbi,0-0 wz
 if_z   jmp   #nextj
        movd  updtc,jx
        mov   temp,tlbi
        andn  temp,elevenbits
        shr   tlbi,#1
        andn  tlbi,elevenbits
        or    tlbi,temp

updtc   mov   0-0,tlbi

        ' next ix
nextj   add   jx, #1
        and   jx, #128 nr, wz
 if_z   jmp   #forj
shr_hits_ret  ret

elevenbits long $07FF

Could be changed to this:

shr_hits      ' walk through TLB, divide all non-zero hit counts by two
        movs  jx,#0             ' finding candidate page to sacrifice

forj

jx      mov   tlbi,0-0 wz
 if_z   jmp   #nextj
        movd  updtc,jx
        mov   temp,tlbi
' don't need to mask out the high bits here because we do it below
        shr   tlbi,#1
        andn  tlbi,elevenbits
' need to mask out the current count before combining with the updated count
        and   tlbi,elevenbits
        or    tlbi,temp

updtc   mov   0-0,tlbi

        ' next ix
nextj   add   jx, #1
        and   jx, #128 nr, wz
 if_z   jmp   #forj
shr_hits_ret  ret

elevenbits long $07FF

There is no need to mask the count twice, once before and once after the right shift. On the other hand, we do need to mask out the current count before ORing with the new count. Otherwise we get the combination of both sets of bits.

Also, this subroutine is entered when a count overflows. That means that the entry pointed to by vmpage has a zero count. Nothing is done in this code to adjust that. I'm not sure if a zero count will cause any problems but it will make the most frequently accessed page appear to be the least frequently accessed page. Another possible problem is that the entire entry could be zero if it happens to be pointing to the first page in hub RAM and the DIRTY and LOCK bits are also clear. This would make it look like the page wasn't in the cache.

Bill Henning · 2010-09-01 16:29

Thanks David!

Actually, you found a bug - there is a need to mask twice, but the first one should have been "and" not andn!

The "and" was to preserve the hub page allocated to that VM page, and it was being lost. This is a serious bug, that I would have noticed had I not had my nose buried in the code too long... The extra 'n' (ie andn instead of and) was the culprit, not fixing the hit could would just have caused a performance hit.

I think there is an excellent chance that this is the cause of the fibo() problems, as it would clear the pointer to the page in the working set when the hit count overflowed - thus clobbering the lowest 512 bytes in memory!

As you noticed, I forgot to put in a fixed count for the count that wrapped, I added code to do that as well.

The routine below should work now, and I would not be at all surprised if it fixes the fibo() problem.

Frankly, I would not be surprised if this measurably improves performance.

THANK YOU!

David Betz wrote: »

Okay, I'm going to try my hand at offering a suggestion. <snip>

There is no need to mask the count twice, once before and once after the right shift. On the other hand, we do need to mask out the current count before ORing with the new count. Otherwise we get the combination of both sets of bits.

Also, this subroutine is entered when a count overflows. That means that the entry pointed to by vmpage has a zero count. Nothing is done in this code to adjust that. I'm not sure if a zero count will cause any problems but it will make the most frequently accessed page appear to be the least frequently accessed page. Another possible problem is that the entire entry could be zero if it happens to be pointing to the first page in hub RAM and the DIRTY and LOCK bits are also clear. This would make it look like the page wasn't in the cache.

'----------------------------------------------------------------------------------------------------
'
' SHR_HITS - divide all valid hit counts by two, called when a hit count would wrap around
'
' NOTE: shr_hits is not debugged yet!
'
'----------------------------------------------------------------------------------------------------


shr_hits      ' walk through TLB, divide all non-zero hit counts by two
        movs  jx,#0             ' finding candidate page to sacrifice
        movd  fixup,vmpage
forj

jx      mov   tlbi,0-0 wz
 if_z   jmp   #nextj
        movd  updtc,jx
        mov   temp,tlbi
        and   temp,elevenbits
        shr   tlbi,#1
        andn  tlbi,elevenbits
        or    tlbi,temp

updtc   mov   0-0,tlbi

        ' next ix
nextj   add   jx, #1
        and   jx, #128 nr, wz
 if_z   jmp   #forj

' fix overflow count, give it half-count

fixup   or    0-0,halfcount

shr_hits_ret  ret

elevenbits long $07FF
halfcount  long $80000000

Bill Henning · 2010-09-01 16:36

Here is an experimental 0.981 release of VMCOG, with fixes for the bug David just found!

I think this may very well fix the fibo under ZOG under VMCOG issue, and run a bit faster to boot

David Betz · 2010-09-01 18:51

Bill Henning wrote: »

Actually, you found a bug - there is a need to mask twice, but the first one should have been "and" not andn!

Sorry, I guess there was a bug in my bug fix! I failed to notice that you were shifting the original value not the one you had just ANDNed. I'm glad you caught my error before releasing your fix.

Bill Henning · 2010-09-01 19:41

Thank you for trying to optimize it - going over your suggested change is what made me notice the bug!

David Betz wrote: »

Sorry, I guess there was a bug in my bug fix! I failed to notice that you were shifting the original value not the one you had just ANDNed. I'm glad you caught my error before releasing your fix.

Heater. · 2010-09-01 22:23

ZOG no like.

This is worse. Depending on the number of pages I have (I tried 8, 10, 20) it either hangs up around fibo(21) or continues a few more fibos with wrong results and then hangs up.

The heater test in my old vmdebug works OK though.

I cannot compile the new vmdebug, BST is complaining about not finding hex method in FullDuplexSerialPlus. No idea why.

Bill, can you take the TRIBLADE_2 sections from the attached VMCog. It is your last 0.981 version + TRIBLADE_2.

VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

Comments