VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

Bill Henning · 2010-06-07 22:45

The more supported interfaces, the better [noparse]:)[/noparse]

That would work well as long as there was a page miss every 15ms or less.

It has also been my experience that the data sheet minimum refresh intervals are extremely conservative; I've heard reports of dram keeping contents for a second or two!

AntoineDoinel said...
@Bill, Jazzed
I've done a similar experiment with a Fast Page Mode SIMM: http://forums.parallax.com/showthread.php?p=869842

I went up to promising memory tests and benchmarks, but no further (thanks to VMCog maybe that code could be recovered now...)

For the refresh, I simply added a CBR at each command wait cycle, thus enforcing that at least one refresh cycle was made in the event of continuos page read requests.

It only adds a slight latency to the page transfer, and in this case we're only talking of cache misses, not regular memory accesses, so I'd go with the simple and easy one.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Bill Henning · 2010-06-07 22:46

Thanks! I will fold it in tonight.

heater said...
Here is vmcog plus triblade v0.7

Further optimization of the BREAD/BWRITE loops. Now about as tight as they will ever be.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

jazzed · 2010-06-07 23:28

AntoineDoinel said...
@Bill, Jazzed

I've done a similar experiment with a Fast Page Mode SIMM: http://forums.parallax.com/showthread.php?p=869842

[noparse]:)[/noparse]. I have both Samsung and Toshiba 30 pin SIMM. My Propeller pin-out is probably different. Too bad the SIMMs are obsolete.

Bill Henning said...
How long is the longest JNI call, in ms?

I would be perfectly willing to add a "VMREFRESH" command, that Spin could send to the mailbox to force a refresh...

also, at the cost of a 200ns performance hit, it would be possible to change the spinner to incorporate a "is it time to refresh yet? / call refresh" two instruction sequence.

I am sure a viable solution exists

Funny, I have five of those 32Mx8 TSOP-II parts in my "in box"... strange coincidence, don't you think? They arrived from Digikey a few months ago...

Bill, the longest JNI for delay is up to the user ... could be 10's of seconds.

I've looked at those SDRAM over and over; my conclusion is a 2 Propeller solution minimum. We can talk more about that later.

You have the tightest possible spinner now. A 400ns spinner is possible with a djnz ... considering how long it takes to start a DRAM access, it probably doesn't matter.

Honestly though I care less about performance and more about getting something that works reliably at any speed right now so I can test the JVM port.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-06-07 23:36

Ah! if you don't care about performance...

replace

waitcmd wrlong zero,pvmcmd

wl    rdlong    vminst, pvmcmd wz ' top 23 bits = address, bottom 9 = command
    mov    vmaddr, vminst
 if_z    jmp    #wl

with

waitcmd wrlong zero,pvmcmd
goon   mov     refcount,#100 ' tune count to taste, each iteration takes 200ns normally when spinning
wl    rdlong   vminst, pvmcmd wz ' top 23 bits = address, bottom 9 = command
    mov       vmaddr, vminst
 if_z    djnz     refcount,#wl   ' take care of JNI problem... during normal run, costs only 200ns extra per read
 if_z jmp      #refresh ' refresh jumps to goon when complete

And JNI problem is SOLVED!

jazzed said...

Bill, the longest JNI for delay is up to the user ... could be 10's of seconds.

I've looked at those SDRAM over and over; my conclusion is a 2 Propeller solution minimum. We can talk more about that later.

You have the tightest possible spinner now. A 400ns spinner is possible with a djnz ... considering how long it takes to start a DRAM access, it probably doesn't matter.

Honestly though I care less about performance and more about getting something that works reliably at any speed right now so I can test the JVM port.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

AntoineDoinel · 2010-06-08 00:00

Bill Henning said...
That would work well as long as there was a page miss every 15ms or less.

Bill, perhaps I've expressed myself badly, but it's actually the opposite, the less page faults the better the refresh:

loop:
       do_single_cbr_refresh_cycle()
       if !command
           jump #loop
       process(command)
       jump #loop

it's the first thing you suggested, only·a single row per cycle.

even interleaved with page reads this works, at least with a 256 byte page

jazzed · 2010-06-08 00:01

I almost posted something just like that this morning in an #ifdef XEDODRAM wrapper [noparse]:)[/noparse]
However, since the entire driver is packaged in another cog at the moment I think I'm set for a while.
Just got to finish debugging some things [noparse]:)[/noparse]

Cheers.
--Steve

Bill Henning said...
Ah! if you don't care about performance...

replace
waitcmd wrlong zero,pvmcmd

wl    rdlong    vminst, pvmcmd wz ' top 23 bits = address, bottom 9 = command
    mov    vmaddr, vminst
 if_z    jmp    #wl 
with
waitcmd wrlong zero,pvmcmd
goon   mov     refcount,#100 ' tune count to taste, each iteration takes 200ns normally when spinning
wl    rdlong   vminst, pvmcmd wz ' top 23 bits = address, bottom 9 = command
    mov       vmaddr, vminst
 if_z    djnz     refcount,#wl   ' take care of JNI problem... during normal run, costs only 200ns extra per read
 if_z jmp      #refresh ' refresh jumps to goon when complete
And JNI problem is SOLVED!

jazzed said...

Bill, the longest JNI for delay is up to the user ... could be 10's of seconds.

I've looked at those SDRAM over and over; my conclusion is a 2 Propeller solution minimum. We can talk more about that later.

You have the tightest possible spinner now. A 400ns spinner is possible with a djnz ... considering how long it takes to start a DRAM access, it probably doesn't matter.

Honestly though I care less about performance and more about getting something that works reliably at any speed right now so I can test the JVM port.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-06-08 01:44

Thanks, I did mis-understand!

AntoineDoinel said...

Bill Henning said...

That would work well as long as there was a page miss every 15ms or less.

Bill, perhaps I've expressed myself badly, but it's actually the opposite, the less page faults the better the refresh:

[noparse][[/noparse]code]
loop:

do_single_cbr_refresh_cycle()

if !command

jump #loop

process(command)

jump #loop

it's the first thing you suggested, only a single row per cycle.

even interleaved with page reads this works, at least with a 256 byte page

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Bill Henning · 2010-06-08 01:48

I can't wait to see what you are up to [noparse]:)[/noparse]

JVM with 64KB of code space perhaps?

As soon as this version is "proved out" with ZOG/JVM/ZiCog (at least one or two of them) I will make a version that removes the 64KB VM limit.

It will be slightly slower (about 200ns per command) but able to support a large virtual space - a 512KB VM address space would only need a 1KB table in the hub (and 64 longs for access count and DIRTY/LOCKED/READONLY flags in the VMCOG).

jazzed said...
I almost posted something just like that this morning in an #ifdef XEDODRAM wrapper [noparse]:)[/noparse]
However, since the entire driver is packaged in another cog at the moment I think I'm set for a while.
Just got to finish debugging some things [noparse]:)[/noparse]

Cheers.
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

jazzed · 2010-06-08 02:20

Bill Henning said...
I can't wait to see what you are up to [noparse]:)[/noparse]

JVM with 64KB of code space perhaps?

I would prefer 1MB for JVM, but I'll take what I can get. It's likely the linker tool would only support 64K until I fix it.
Some day I'll have a file system that supports long file names and will make a JAVA2 compliant JVM that uses Swing, etc....
Of course I need a good touch screen LCD with graphics memory to do Swing.

With the read-only option for embedded flash, I could have a 32MB code segment and 1MB data space using two VMCOGs.

Not trying to set expectations that I can actually do it all quickly, but those are examples of what I'm shooting for.

I'm attaching an archive for review. Here are the diffs without XEDODRAM_1M.spin
I removed the non-ASCII chars from the MIT license so diff/patch can be used.

Unfortunately I have a problem with heater's march-c test ... I'll fix it later.

Cheers,
--Steve

diff -c vmcog_0607.spin vmcog.spin
*** vmcog_0607.spin     2010-06-07 18:18:29.054000000 -0700
--- vmcog.spin  2010-06-07 19:02:38.322000000 -0700
***************
*** 1,6 ****
  '----------------------------------------------------------------------------------------------------
  '
! ' VMCOG v0.941 - virtual memory server for the Propeller
  '
  ' Copyright February 3, 2010 by William Henning
  '
--- 1,6 ----
  '----------------------------------------------------------------------------------------------------
  '
! ' VMCOG v0.961 - virtual memory server for the Propeller
  '
  ' Copyright February 3, 2010 by William Henning
  '
***************
*** 86,92 ****

  #define EXTRAM
  #ifdef EXTRAM
! #define PROPCADE
  '#define FLEXMEM
  '#define MORPHEUS_SPI
  '#define MORPHEUS_XMM
--- 86,92 ----

  #define EXTRAM
  #ifdef EXTRAM
! '#define PROPCADE
  '#define FLEXMEM
  '#define MORPHEUS_SPI
  '#define MORPHEUS_XMM
***************
*** 96,102 ****
  '#define TRIBLADE
  '#define TRIBLADE_2
  '#define RAMBLADE
! '#define XEDODRAM
  #else
  #warn External Ram access disabled - only use memory up to WS_SIZE
  #endif
--- 96,102 ----
  '#define TRIBLADE
  '#define TRIBLADE_2
  '#define RAMBLADE
! #define XEDODRAM
  #else
  #warn External Ram access disabled - only use memory up to WS_SIZE
  #endif
***************
*** 175,186 ****
--- 175,208 ----
    long[noparse][[/noparse]dataptr]  := nump        ' singe byte read/writes are the default
    word[noparse][[/noparse]cmdptr]   := 0

+ #ifdef XEDODRAM
+   longfill(@xmailbox,0,2)       ' ensure command starts as 0
+   long[noparse][[/noparse]mailbox+12]:= @xmailbox  ' 3rd vmcog mailbox word used for xm interface
+   xm.start(@xmailbox)           ' start up inter-cog xmem
+ #endif
+
    cognew(@vmcog,mailbox)

    fcmdptr  := @fakebox
    fdataptr := fcmdptr+4
    repeat while long[noparse][[/noparse]cmdptr]     ' should fix startup bug heater found - it was the delay to load/init the cog

+ #ifdef XEDODRAM
+ VAR
+ ' xmailbox is a command and data word
+ ' command is %CCCC_LLLL_LLLL_AAAA_AAAA_AAAA_AAAA_AAAA
+ ' C is Command bits up to 16 commands
+ ' L is Length bits up to 256 bytes
+ ' A is Address bits up to 1MB address range
+ ' data is interpreter based on command word context
+ ' data is a pointer in case of buffer read/write
+ ' data is a long/word/byte in other command cases
+ '
+   long xmailbox
+ OBJ
+   xm : "XEDODRAM_1MB"
+ #endif
+
  PUB rdvbyte(adr)
    repeat while long[noparse][[/noparse]cmdptr]
    long[noparse][[/noparse]cmdptr] := (adr<<9)|READVMB
***************
*** 364,373 ****
          'End of TriBlade code
  #endif
  #ifdef XEDODRAM
!         mov   XDRAM_cmd,par
!         add   XDRAM_cmd,#16       ' par+16 for cmd ... vmcog mailbox is 4 longs
!         mov   XDRAM_dat,XDRAM_cmd
!         add   XDRAM_dat,#4        ' par+20 for dat
  #endif
  '----------------------------------------------------------------------------------------------------
  ' END OF BINIT SECTION
--- 386,396 ----
          'End of TriBlade code
  #endif
  #ifdef XEDODRAM
!         mov     XDRAM_cmd,par
!         add     XDRAM_cmd,#12        ' par+n+4 for dat
!         rdlong  XDRAM_cmd,XDRAM_cmd
!         mov     XDRAM_dat,XDRAM_cmd
!         add     XDRAM_dat,#4        ' par+n+4 for dat
  #endif
  '----------------------------------------------------------------------------------------------------
  ' END OF BINIT SECTION
***************
*** 998,1007 ****
          wrlong ptr,XDRAM_dat     ' send new hub pointer
          wrlong xcmd,XDRAM_cmd    ' send command
          rdlong temp,XDRAM_cmd wz
!  if_nz  jmp    #$-1              ' wait for command complete
          add    ptr,#256          ' incr pointers
          add    xcmd,#256
!         djnz   count,#:next      ' do next 128
  #endif
  ' endif XEDORAM
  #endif
--- 1021,1030 ----
          wrlong ptr,XDRAM_dat     ' send new hub pointer
          wrlong xcmd,XDRAM_cmd    ' send command
          rdlong temp,XDRAM_cmd wz
! if_nz   jmp    #$-1              ' wait for command complete
          add    ptr,#256          ' incr pointers
          add    xcmd,#256
!         djnz   count,#:next      ' do next 256
  #endif
  ' endif XEDORAM
  #endif
***************
*** 1084,1093 ****
          wrlong ptr,XDRAM_dat     ' send new hub pointer
          wrlong xcmd,XDRAM_cmd    ' send command
          rdlong temp,XDRAM_cmd wz
!  if_nz  jmp    #$-1              ' wait for command complete
          add    ptr,#256          ' incr pointers
          add    xcmd,#256
!         djnz   count,#:next      ' do next 128
  #endif
  ' endif XEDODRAM
  #endif
--- 1107,1116 ----
          wrlong ptr,XDRAM_dat     ' send new hub pointer
          wrlong xcmd,XDRAM_cmd    ' send command
          rdlong temp,XDRAM_cmd wz
! if_nz   jmp    #$-1              ' wait for command complete
          add    ptr,#256          ' incr pointers
          add    xcmd,#256
!         djnz   count,#:next      ' do next 256
  #endif
  ' endif XEDODRAM
  #endif
***************
*** 1201,1209 ****
  xcmd       long 0
  XDRAM_cmd  long 0
  XDRAM_dat  long 0
! XDRAM_rbuf long $7000_0000      ' always read 256 at a time ... loops 4 times
! XDRAM_wbuf long $8000_0000      ' always write 256 at a time ... loops 4 times
! XDRAM_amsk long $0003_ffff      ' address mask ... just allow 256K for now ....
  #endif
  ' endif XEDODRAM
  #endif
--- 1224,1232 ----
  xcmd       long 0
  XDRAM_cmd  long 0
  XDRAM_dat  long 0
! XDRAM_rbuf long $7ff0_0000      ' always read  256 at a time ... loops 2 times
! XDRAM_wbuf long $8ff0_0000      ' always write 256 at a time ... loops 2 times
! XDRAM_amsk long $000f_ffff      ' address mask up to 1MB
  #endif
  ' endif XEDODRAM
  #endif

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-06-08 03:45

1MB VM address space is not a problem, but it will cost 2KB of hub ram. I plan that the VM size will be a CONstant, making the hub lookup table size self-adjusting.

re/ long file name fs... I am planning (when I have time (LOL)) porting my flashfs to SDHC cards. It is far superior to FAT, supports long file names, arbitrarily deep nested directories, unlimited files per directory, unix-style attributes, and huge files.

Nix on the 32MB code - that would need 64KB for the hub tlb with 512 byte pages. It would take 16KB with 2KB code pages.

Besides, I shudder to think of 32MB of Java byte codes!

FYI, on a Morpheus with a Mem+ you could have 1MB code, 1MB data, 512KB frame buffer - all in fast SRAM, all on CPU#2

CPU#1 can take a FlexMem (or three) as well...

I'll merge the driver tomorrow. I am dealing with a realtor tonight.

jazzed said...

I would prefer 1MB for JVM, but I'll take what I can get. It's likely the linker tool would only support 64K until I fix it.
Some day I'll have a file system that supports long file names and will make a JAVA2 compliant JVM that uses Swing, etc....
Of course I need a good touch screen LCD with graphics memory to do Swing.

With the read-only option for embedded flash, I could have a 32MB code segment and 1MB data space using two VMCOGs.

Not trying to set expectations that I can actually do it all quickly, but those are examples of what I'm shooting for.

I'm attaching an archive for review. Here are the diffs without XEDODRAM_1M.spin
I removed the non-ASCII chars from the MIT license so diff/patch can be used.

Unfortunately I have a problem with heater's march-c test ... I'll fix it later.

Cheers,
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

heater · 2010-06-08 08:33

Yep, the first byte in page bug looks to be fixed in my testing.
The heater test now fails correctly when a error is forced.

My last Triblade optimizations were missed from Bill's last release so here is his last release plus my latest TriBlade mods.

Now I can start to add this to Zog.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-06-08 09:12

Something is not right. I did not get very far with my zogging as I could not load an executable image to the VMem.

On going back to vmdebug I ended up changing the heater3 test from a zero fill at the start to filling with the low byte of address, like so:

  ser.str(string("Zero fill...",13))
  repeat ad from 0 to vm#MEMSIZE - 1 step 1
    vm.wrvbyte(ad, ad)

heater3 then fails like so:

Testing $00010000 bytes
Zero fill...
Checking 00 writing FF ...
Error at 1A01 expected 00 got 01

It should fail at 0001. This reproduces the problem I had in the zog code.

Reading back the memory I see a lot of zeros at the beginning.

Edit: The fill2 test ("f") does not work either.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 6/8/2010 10:05:33 AM GMT

Bill Henning · 2010-06-08 13:23

Thanks for merging to the last file! I was busy with other matters last night.

I cannot reproduce your faulty result - using the zip from your previous message, with your change below, I get:

Testing $00010000 bytes
Zero fill...
Checking 00 writing FF ...
Error at 0001 expected 00 got 01

Which I believe is the correct result.

fill2 also works with #define PROPCADE

heater said...
Something is not right. I did not get very far with my zogging as I could not load an executable image to the VMem.

On going back to vmdebug I ended up changing the heater3 test from a zero fill at the start to filling with the low byte of address, like so:
  ser.str(string("Zero fill...",13))
  repeat ad from 0 to vm#MEMSIZE - 1 step 1
    vm.wrvbyte(ad, ad)
heater3 then fails like so:
Testing $00010000 bytes
Zero fill...
Checking 00 writing FF ...
Error at 1A01 expected 00 got 01
It should fail at 0001. This reproduces the problem I had in the zog code.

Reading back the memory I see a lot of zeros at the beginning.

Edit: The fill2 test ("f") does not work either.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 6/8/2010 1:31:32 PM GMT

heater · 2010-06-08 14:07

Bill. Big apologies from me. Those "optimized" versions were actually "brokenized" somewhere along the line.

Attached is your latest version again but with all my TriBlade optimizations backed out.

This now works on TriBlade (Honest)

I have extended the heater3 test with an incrementing counter fill test, a decrementing counter fill test, and a pseudo random number fill test.

READY>

Testing $00010000 bytes
Up count fill...
Up count check...
Down count fill...
Down count check...
Random fill...
Random check...
Zero fill...
Checking 00 writing FF...
Checking FF writing AA...
Checking AA writing 55...
Checking 55 writing 00...
OK

I did learn may years ago that memory testing is not as simple as it may seem at first. And then forgot!
Looks like I was putting too much faith in that little heater test as I blundered on optimizing and testing, optimizing and testing...

I will forgo any optimization attempts for a while and get Zog working with VM whilst this is solid.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

John Abshier · 2010-06-08 14:59

I did the following test inspired by a robot or game character reading a map.· For hub ram the map was 31x31.· For SPI ram the map was 90x90 or 32,400 bytes. I only had one SPI ram chip installed on the robot.

repeat 1000

·· dx := -1, 0, or 1· dy := -1, 0, 1

·· x += dx y += dy (stay inside map)

·· read the 25·longs centered around x and y

Timing results per read and assignment (in microseconds):

HUB - 30

SPI with sram_contr20 - 402

VMcog with 2kB hub ram - 262

VMcog with 4kB hub ram - 186

VMcog with 8kB hub ram - 120

edit:· clock is 80,000,000

John Abshier

Post Edited (John Abshier) : 6/8/2010 3:34:44 PM GMT

Bill Henning · 2010-06-08 16:19

Great!

Sorry, I missed this message earlier (before coffee)

I can't wait to try 64KB (and later larger) Zog!

I think I will hold off on improvements to VMCOG itself (other than integrating more hardware drivers) until we have some software running under ZOG - at the very least, Fibo and Dhrystone, which can use 64KB or more. Why?

I can then use that software to test the "big" VM version!

heater said...
Yep, the first byte in page bug looks to be fixed in my testing.
The heater test now fails correctly when a error is forced.

My last Triblade optimizations were missed from Bill's last release so here is his last release plus my latest TriBlade mods.

Now I can start to add this to Zog.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Bill Henning · 2010-06-08 16:25

Hi John,

Could you send me your sram_contr20 again? I could not get the copy I have running in the past, but I want to try it again soon as it would speed up page reads/writes.

Very interesting results!

So VMCOG was 1/4 of the speed of direct hub reads with 8K working set. This is consistent with what I expected based on the number of hub accesses VMCOG and client have to do.

Thank you for the smaller working test results - it shows that the performance scales with a larger working set, as expected with more page hits and fewer misses.

John Abshier said...
I did the following test inspired by a robot or game character reading a map. For hub ram the map was 31x31. For SPI ram the map was 90x90 or 32,400 bytes. I only had one SPI ram chip installed on the robot.

repeat 1000
dx := -1, 0, or 1 dy := -1, 0, 1
x += dx y += dy (stay inside map)

read the 25 longs centered around x and y

Timing results per read and assignment (in microseconds):

HUB - 30
SPI with sram_contr20 - 402
VMcog with 2kB hub ram - 262
VMcog with 4kB hub ram - 186
VMcog with 8kB hub ram - 120

edit: clock is 80,000,000

John Abshier

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Bill Henning · 2010-06-08 16:39

Analyzing John's results some more...

90x90x4 = 32,400 - just barely under 32KB

31x31x4 = 3,844 - just over 3.5KB

So with VMCOG, an 8.4x larger map was used.

With a working set of 8KB (1/4 of the virtual memory) the performance was 25% that of the hub version

With a working set of 4KB (1/8 of the virtual memory) the performance was 16% that of the hub version

With a working set of 2KB (1/16 of the virtual memory) the performance was 11% that of the hub version

And let's not forget - the hub version would not be able to execute the test on a 90x90 map at all!

The really interesting point, is by counting cycles, the best case result for VMCOG is 1.2us per access (if the hub sweetspot is hit by both client and server)

A bare hub op (best case) takes 0.2us - so best case to best case, there is a 1us penalty per access to using VMCOG... that is, VMCOG is 1/6th the speed at best of a hypothetical large enough hub.

John's example shows that real clients will not be back-to-back hub operations, so we won't see a 6x slowdown. The more operations the client does between VMCOG accesses, the less slow down there will be compared to pure hub reads.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 6/8/2010 4:51:23 PM GMT

John Abshier · 2010-06-08 19:32

Archive is attached.· Sram_contr20Mod.spin is Andy Schenk's code with a pure Spin methods to call read/write byte 4 times and return a long added.

John Abshier

heater · 2010-06-08 20:44

Big Zog takes his first little steps:

ZOG v0.14
Loading ZPU image...OK

#pc, opcode, sp, top_of_stack
#----------

0X0000000 0X0B 0X00001FF8 0X00000000
0X0000001 0X0B 0X00001FF8 0X00000000
0X0000002 0X0B 0X00001FF8 0X00000000
0X0000003 0X0B 0X00001FF8 0X00000000
0X0000004 0X82 0X00001FF8 0X00000000
0X0000005 0X70 0X00001FF4 0X00000002

That is NOP, NOP, NOP, NOP, IM 2

Was a bit thrown for a while as the example PASM codes in vmaccess.spin are all a mess:

mboxdat is at offset +8 in the mail box
"if_z jmp #waitdone" should be "if_nz...".

but otherwise Big Zog is in good shape.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

jazzed · 2010-06-08 21:58

Looks like lots of progress on all fronts. [noparse]:)[/noparse]

Turns out that one of my old SIMMs is bad. Three out of four work fine.
Now I can focus on the higher goals.

@Bill, can you cut in the diffs I posted? Thanks.

Cheers.
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-06-08 22:03

John:

Thanks!

heater:

Nice! Good progress... sorry about vmaccess, that was thrown together quickly for pullmol weeks ago, and I have not had a chance to test them.

jazzed:

I will tonight, I am dealing with a realtor most of today.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

heater · 2010-06-09 09:11

Just posted a working Zog plus VMCog to the Zog thread.

This has almost all the functionality of the HUB memory build. The syscall I/O stuff is not there yet.

Here is the evidence, Zog HUB and then Zog VM running the RC4 crypto test. That's the test that gave me all the grief with endianness issues a while back.

ZOG v0.14 (HUB)
RC4...
bb f3 16 e8 d9 40 af 0a d3
Done.

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X0000B28 0X00 0X00001FB8 0X00000BA4
BREAKPOINT
ZOG v0.14 (VM)
Loading ZPU image...OK
RC4...
bb f3 16 e8 d9 40 af 0a d3
Done.

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X0000B28 0X00 0X00001FB8 0X00000BA4
BREAKPOINT

The first run of this in VM took 30 seconds as opposed to the HUB version at 1 or 2 seconds!

Then I thought that using only 4 pages was a bit mean. Using 8 was the same slow. Moving to 16 pages gets us back to 2 or 3 seconds.

Edit: 12 pages gives about 16 seconds.

I'll try and get Dhrystone running but first I need to arrange to pull ZPU executables from SD cards rather than waste space with using "file" statements.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 6/9/2010 9:17:09 AM GMT

Bill Henning · 2010-06-09 12:32

Excellent work heater!

Now we can really exercise VMCOG.

Your timing results perfectly reflect CS theory - as soon as you find the "sweet spot" for a given application, VM is not that much slower than using physical ram. Once there is more hub ram free, I'll be curious to see the effect of adding more pages (20, 24,32 total pages) - I bet the curve will flatten and at some point the speed won't improve.

I have some (paying) work that I need to do this week, but I intend to remove the 64KB VM limit by the weekend, latest. Until then, VMCOG should get a good workout (testing). This will also allow testing the large VM version against a known working VM.

As I mentioned in the ZOG thread, what we now need is a simple shell, that can launch .ZOG binaries from an SD card.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Dr_Acula · 2010-06-09 13:23

@Bill, if you need a simple bootloader that can load a spin file, take a look at http://forums.parallax.com/showthread.php?p=912403 my second post on the 6th June.

vga and sd pins might need changing, but this bootloader does not need any external ram. It starts off with a list of .bin files and you run them with SPIN MYFILE.BIN

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller

jazzed · 2010-06-09 20:55

It seems to me a vmcog based boot-loader would be best to write an image into memory
since it already understands the hardware. It could also optionally start the program.

Cheers.
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Post Edited (jazzed) : 6/9/2010 11:22:02 PM GMT

Bill Henning · 2010-06-10 13:58

Dr_Acula:

Thanks, I will take a look, it sounds good from your description.

Jazzed:

I added an API call to VMCOG to make loading faster & easier:

PUB GetVirtLoadAddr(vaddr)|va
  va:= vaddr&$7FFE00         ' 23 bit VM address - force start of page
  wrvbyte(vaddr,rdvbyte(va)) ' force page into working set, set dirty bit
  return GetPhysVirt(va)     ' note returned pointer only valid until next vm call

So to load, you would do something like the following pseudo-code:

  addr:=0;
  while (filesize>0) {
     fread(f,vm.GetVirtLoadAddr(addr),512)
     addr+=512;
     filesize-=512;
  }

This would "read" directly into the VM, 512 bytes at a time.

NOTE:

The pointer returned by GetVirtLoadAddr() is ONLY valid until a different VM op is executed, as a different vm op may force that page to be paged out!

Obviously, this could also be used to save the whole VM image to an SD card, for "swapping" whole VM's, or implementing "suspend" and "hibernation" modes

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

jazzed · 2010-06-10 15:43

Bill Henning said...
I added an API call to VMCOG to make loading faster & easier:
PUB GetVirtLoadAddr(vaddr)|va
  va:= vaddr&$7FFE00         ' 23 bit VM address - force start of page
  wrvbyte(vaddr,rdvbyte(va)) ' force page into working set, set dirty bit
  return GetPhysVirt(va)     ' note returned pointer only valid until next vm call
So to load, you would do something like the following pseudo-code:
  addr:=0;
  while (filesize>0) {
     fread(f,vm.GetVirtLoadAddr(addr),512)
     addr+=512;
     filesize-=512;
  }
This would "read" directly into the VM, 512 bytes at a time.

NOTE:

The pointer returned by GetVirtLoadAddr() is ONLY valid until a different VM op is executed, as a different vm op may force that page to be paged out!

Obviously, this could also be used to save the whole VM image to an SD card, for "swapping" whole VM's, or implementing "suspend" and "hibernation" modes

@Bill, that looks good especially for SRAM. It is a small problem for DRAM as you
probably know and I'll go into that later. ...

... Meanwhile, I think I'll look at adding read-only devices like the serial EEPROM and/or
the NAND flash which would not flush out on dirty page swaps and use very little precious
Propeller memory.

Having "non-flush" EEPROM would allow one of my goals for running JVM from EEPROM
especially since I already have a JVM that can work like that, and it will provide another
comparison for proving the value and effectiveness of the vmcog cache design

--
One of the problems with using DRAM is that if the COG running refresh is rebooted,
the data goes away :-( This is not an issue with SRAM obviously.

Of course if Propeller is not completely rebooted, it's not a problem for DRAM.
But how do you partially reboot Propeller? There are a few answers to that question,
but the easy one is not desirable if you're a "cognostic" (cognew -vs- coginit) user.

Still, in this case, I think there is merit in using coginit(7) for the device with a disclaimer.

Maybe in the future the best answer for the DRAM problem while remaining cognostic
is to have read-only vmcog devices such as SD-CARD, EEPROM, or that NAND flash that
can be loaded on demand per cog for the task and then be recycled. I'm not saying that
vmdebug or vmcog has to have the on-demand loader feature [noparse]:)[/noparse], but there can be
some benefit to having the read-only devices.

--
BTW: I also have another low pin count SRAM design that I may want to add later.

Cheers.
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-06-10 16:48

jazzed said...
Bill Henning said...
I added an API call to VMCOG to make loading faster & easier:
  addr:=0;
  while (filesize>0) {
     fread(f,vm.GetVirtLoadAddr(addr),512)
     addr+=512;
     filesize-=512;
  }
@Bill, that looks good especially for SRAM. It is a small problem for DRAM as you
probably know and I'll go into that later. ...

Thanks - I thought that would handle loading nicely...

As for DRAM - just have a loop with 128 vm.wrvlong(addr,long)

jazzed said...

... Meanwhile, I think I'll look at adding read-only devices like the serial EEPROM and/or
the NAND flash which would not flush out on dirty page swaps and use very little precious
Propeller memory.

I may have time to implement READONLY later today; if not, my plan is to change the LUT entries as follows:

' LUT Entry values take the following form:
'
' if 0, page is not present in memory
'
' if not 0, the 32 bits are interpreted as:
'
' CCCCCCCC CCCCCCCC CCCCWDP PPPPPPPP

' where
'
' PPPPPPPPPP = hub address, upper 9 bits of up to 18 bit address, 000xxxxxx on Prop1
'
' D = Dirty bit - This bit is set whenever a write is performed to any byte(s) in the page
'
' L = Locked bit - set whenever the physical memory page is locked into the hub and may not be swapped out
'
' W = Write protected, probably from FLASH / EEPROM, but can be used to write-protect a RAM page as well
'

This does reduce the hit access count to 20 bits, but the (untested) shr_hits routine compensates for this.

jazzed said...

Having "non-flush" EEPROM would allow one of my goals for running JVM from EEPROM
especially since I already have a JVM that can work like that, and it will provide another
comparison for proving the value and effectiveness of the vmcog cache design

If someone wants to submit some nice, small, fast (but speed tunable for fast/slow eeproms) I2C code for BINIT/BREAD/BWRITE/BDATA there is no reason that there can't be an EEPROM VMCOG - however for such a VMCOG, a write-through strategy makes a lot more sense, so BWRITE would have to take a size: 1/2/4 bytes, and the DIRTY flag would not be maintained.

For a JVM, using hub memory for the stack and heap, but VM for code would be an interesting mix.

jazzed said...

One of the problems with using DRAM is that if the COG running refresh is rebooted,
the data goes away :-( This is not an issue with SRAM obviously.

Of course if Propeller is not completely rebooted, it's not a problem for DRAM.
But how do you partially reboot Propeller? There are a few answers to that question,
but the easy one is not desirable if you're a "cognostic" (cognew -vs- coginit) user.

Still, in this case, I think there is merit in using coginit(7) for the device with a disclaimer.

I think there is a place for both. Personally, both Minos and Largos will have a "system" cog, for managing resources/cogs and providing some other system services.

As for reboot - I think you would find that if you did a refresh just before the reboot, the DRAM would survive. I've seen DRAM content survive for seconds without refresh...

jazzed said...

Maybe in the future the best answer for the DRAM problem while remaining cognostic
is to have read-only vmcog devices such as SD-CARD, EEPROM, or that NAND flash that
can be loaded on demand per cog for the task and then be recycled. I'm not saying that
vmdebug or vmcog has to have the on-demand loader feature [noparse]:)[/noparse], but there can be
some benefit to having the read-only devices.

As soon as I move to an external "page present" table, there will be enough space left in VMCOG for basic I2C capability - which opens interesting possibilities such as your suggested demand loading of drivers (with reusing cogs after the operation is done).

Actually, right now I am thinking that the low-level SD card stuff (initialize, send command, read status, read sector, write sector) needs to be totally split out from the file system and made mailbox based. This would allow compiling file systems to run out of virtual memory! With Catalina and ZOG, it would also allow running arbitrary file systems written in C.

jazzed said...

BTW: I also have another low pin count SRAM design that I may want to add later.

Cheers.
--Steve

Nice!

The more, the merrier!

I have a 12 pin CPLD based design that is just waiting for me to have time after UPEW to test and have boards made.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 6/10/2010 4:55:24 PM GMT

heater · 2010-06-10 17:29

Bill: "...the low-level SD card stuff (initialize, send command, read status, read sector, write sector) needs to be totally split out from the file system and made mailbox based."

That would be excellent. Useful for the Z80, 6502 emulators as well.

By the way shouldn't the Zogs be converted to uses mailboxes as well?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

VMCOG: Virtual Memory for ZiCog, Zog & more (VMCOG 0.976: PropCade,TriBlade_2,HYDRA HX512,XEDODRAM)

Comments