Bill: "What we need now is a simple shell, that leaves as much hub memory free as possible, but can launch .ZOG binaries from an SD card."
My thoughts exactly.
I was thinking of looking into Kye's all singing al dancing FATE which seems to be an OS by itself, or SPHINX or such.
On the other hand my mind is pulling in an other direction.
I'd like to get little Zog (HUB memory) running all by itself after completely replacing spin. It would be booted up from some PASM in a COG which then gives little Zog ALL 32K HUB memory apart from some mail boxes with which it can talk to UART and SD card. Little ZOg then starts up big Zog in external memory. Big Zogs support is now provided by little Zog rather than Spin.
Big Zog runs the "OS", "shell" whatever.
Still all that is a long way of for now.
After these few days concerted effort I have to get back to real work so perhaps I can just squeze in loading a ZPU binary from SD for Big Zog for now.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
with 12 pages it takes about 30 seconds (did not use watch, eyeballed), so SPI ram with slow 4mbps bitbanging driver is definitely slower than TriBlade (DUH!)
with 16 pages, it took 2-3 sec like you said.
Good idea, except please leave the first 4KB of hub untouched... I've been working a bit (on paper) on Minos (minimal version of Largos <grin> ok I like puns)
Initial memory map:
$0000-$01FF - reserved for mailboxes etc
$0200-$07FF - reserved for vmcog's use (needed for larger than 64KB VM's)
$0800-$0FFF - reserved for cog image load buffer
$1000-$7FFF - reserved for working set
The idea is that initial spin bootloader uses all the memory to load drivers (sd, video, kb, etc) into cogs, and all drivers communicate via mailboxes.
Then Spin boots a VMCOG, which re-uses all the memory from $1000-$7FFF (28KB, 56 pages) as the working set for a VM
the file system etc can be re-written in C for ZOG, then unused routines will not take up precious hub room until needed.
once vmlock() and vmunlock() work, even video buffers can be allocated from the vm, and released when no longer needed!
I think I got it to working with Hydra driver for VMCog, after fixing two really stupid mistakes (will post 0.14 with added Hydra code tomorrow)
@heater
there's a particular reason for loading 8KB when the zog binary is 4KB?
Now it starts executing (but only if I load 4KB), before fixing the bugs I noticed it went "BREAK" at two different addresses depending on memory size var.
Bill, any particular reason your reserved memory areas for mailboxes etc have to be at the bottom of RAM rather than at the top?
I was quite looking forward to having C code under Little Zog using HUB RAM from 0000 up. No address translations required.
Perhaps I should figure out how to tweak the linker scripts so that the zpu code can be loaded and started from $1000. Or just live with having to translate ZPU to HUB addresses.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Bill: "C code under Little Zog is free to use low addresses if it takes over the whole machine."
Ah yes but...If someone, like yourself[noparse]:)[/noparse], were to create a bunch of serial, vmem, low level SD block, etc etc drivers that use mailboxes in HUB. Then I'd be wanting to use those and Little Zog, as C replacement for Spin, would have to keep out of the way.
It's a shame the mailbox idea has not been in use in the Prop world from the start. Having all COG drivers written to be accessed through Spin methods is a pain. These software drivers in COG should present themselves through registers like on real hardware. And that's what mailboxes do basically.
The "rambase" thing is what we do now for running from HUB so no problem actually. I'll have another look at linker scripts, they always give me headache.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I am trying to start a trend with VMCOG, with a bit of standardization, to move everyone to mailbox based cog drivers [noparse]:)[/noparse]
I need them for Minos and Largos, and I don't have time to write every possible driver myself [noparse]:([/noparse]
Pretty much all the drivers/code I write now uses the 4 long mailbox model - see my GPU cog for Morpheus, upcoming A/D and industrial drivers etc.
For example, for both Largos and Minos, there is an STDIO mailbox - and depending on what cog driver services it, it can go to serial, VGA/kb, or TV/kb
As I mentioned a few minutes earlier in the VMCOG thread... we *REALLY* need a mailbox based low level SD driver.
(As an aside - in case a driver needs more info than 4 longs can provide, 3 of the 4 can be pointers to tables <grin>)
heater said...
Bill: "C code under Little Zog is free to use low addresses if it takes over the whole machine."
Ah yes but...If someone, like yourself[noparse]:)[/noparse], were to create a bunch of serial, vmem, low level SD block, etc etc drivers that use mailboxes in HUB. Then I'd be wanting to use those and Little Zog, as C replacement for Spin, would have to keep out of the way.
It's a shame the mailbox idea has not been in use in the Prop world from the start. Having all COG drivers written to be accessed through Spin methods is a pain. These software drivers in COG should present themselves through registers like on real hardware. And that's what mailboxes do basically.
The "rambase" thing is what we do now for running from HUB so no problem actually. I'll have another look at linker scripts, they always give me headache.
FYI, I see Minos as being more like OMU than a full Unix, and Largos will be closer to the original v7 Unix, but mailbox based - shades of Hurd and Plan9
This fixes the zpu_addsp instruction to work correctly with virtual memory read_long.
Added reading of a ZPU image from SD card file.
Runs dhrystone by default (dstone.bin)
One can run the tests "test", "endian", "rc4" and "dstone" from either HUB or ext RAM via VMCog. Just change the #defines and the top and look for the files names embedded in the SD loader code and "file" statement at the end. Sorry no fancy user interface for program selection.
This is painfully slow from ext RAM with VMCOG and 16 pages. FIBO(26) as performed at the end of test.bin takes between 1 and 1.5 hours, I went shopping whilst I was waiting[noparse]:)[/noparse]
dhrystone takes 20 minutes with 5000 loops. The attached dhrystone only performs 500 loops.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized) - this will be a great way to test the effect of different working set sizes.
Bill: "I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized)"
Actually that's what puzzles me about FIBO. I have a working set of 16 pages, 8KB.
The fibo routine is very small. So the actual code should be resident all the time. All the data is on the stack which only goes down 26 call levels deep so that can be resident all the time as well. There no other data to worry about.
Let's say I expect an order of magnitude slow down going through the VM access, the HUB version takes 14 seconds so we expect the ext version to take 3 minutes.
As it actually takes over an hour that's more like over 300 times slower!
This just does not seem right. Have I missed a point somewhere? Is VMCog reading/writing pages more than it should?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized) - this will be a great way to test the effect of different working set sizes.
Good work indeed
Maybe a smaller cache line size would help the glacial performance ?
@Bill, I'm not purposefully trying to reinvent vmcog or whatever, but I'm really having a hard time dealing with that 512 byte cache line. It should not be a problem to quickly load 512 bytes in an efficient parallel bus design. Still, I'm sharing my findings so that hopefully some experiments can be tried. ...
In my experiments with EEPROM, I use the modulo hash function: tagline = f(physical address % TLB count). Small 16 to 32 byte blocks load contiguously as required and the time it takes to "probe" and track statistics for a collision on a duplicate hash seems to be more expensive than just losing the line and reloading it if necessary. The other advantages allow for embedding the vm cache in the interpreter cog itself given room, using with a big Propeller emulator/vm program, and use of a large physical backstore address range (though less efficient) with a small cache block of say 2KB + TLB size or less.
Cheers,
--Steve
EDIT: Hmm, I just saw your comment heater. That's pretty mysterious.
Thrashing occurs when the working set is too small for the access pattern of the code running in the VM.
It often occurs when a process does just a few reads/writes to a page before having to access another page, with such a frequency that most of the time is spent swapping pages.
Try different values for whitcount.
It defaults to the same weight as rhitcount, but if few writes are done compared to reads of the code, it could be recycling the same page over and over again.
I suggest trying "64 << 11" for whitcount. That will give writes 64 times the effect on access count as reads, making it more likely that read-only pages are swapped.
You can try any value between say 1 and 128 for whitcount.
You could also add something to ZOG so say every 100 instructions it dumps the TLB to the console, perhaps you can then see what is going on.
heater said...
Bill: "I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized)"
Actually that's what puzzles me about FIBO. I have a working set of 16 pages, 8KB.
The fibo routine is very small. So the actual code should be resident all the time. All the data is on the stack which only goes down 26 call levels deep so that can be resident all the time as well. There no other data to worry about.
Let's say I expect an order of magnitude slow down going through the VM access, the HUB version takes 14 seconds so we expect the ext version to take 3 minutes.
As it actually takes over an hour that's more like over 300 times slower!
This just does not seem right. Have I missed a point somewhere? Is VMCog reading/writing pages more than it should?
For eeprom a caching approach may work better - and while caches and virtual memory are quite similar, they are not exactly the same - the chief difference being the lack of a "page replacement policy" in a cache.
The trick with vm's is to tune the page replacement algorithm to find one that can satisfy most requests from the working set, instead of causing a page fault.
The slower the backing store is, the *much worse* the VM performance gets with a small working set.
This can ofcourse be offset by using very small pages.
The problem is that with small pages, the translation table gets to be very large, or a lot of time has to be spent searching to see if a page is in memory.
There have been whole PhD thesis written on the topic of page replacement policies!
To start with, I implemented a simple LRU (least recently used) algorithm, however I do allow for giving different weights for reads and writes, allowing some tunability.
The problem with using a simple hash function is what if the compiled code happens to use several pages frequently with the same hash code? That also leads to thrasing.
We are just starting to explore VM on the Prop.
jazzed said...
Bill Henning said...
Excellent work!
I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized) - this will be a great way to test the effect of different working set sizes.
Good work indeed
Maybe a smaller cache line size would help the glacial performance ?
@Bill, I'm not purposefully trying to reinvent vmcog or whatever, but I'm really having a hard time dealing with that 512 byte cache line. It should not be a problem to quickly load 512 bytes in an efficient parallel bus design. Still, I'm sharing my findings so that hopefully some experiments can be tried. ...
In my experiments with EEPROM, I use the modulo hash function: tagline = f(physical address % TLB count). Small 16 to 32 byte blocks load contiguously as required and the time it takes to "probe" and track statistics for a collision on a duplicate hash seems to be more expensive than just losing the line and reloading it if necessary. The other advantages allow for embedding the vm cache in the interpreter cog itself given room, using with a big Propeller emulator/vm program, and use of a large physical backstore address range (though less efficient) with a small cache block of say 2KB + TLB size or less.
Cheers,
--Steve
EDIT: Hmm, I just saw your comment heater. That's pretty mysterious.
The problem with using a simple hash function is what if the compiled code happens to use several pages frequently with the same hash code? That also leads to thrasing.
We are just starting to explore VM on the Prop.
Yes, I'm sure thrashing resulting from hash collisions can become severe. So far I'm pretty happy with the results for FIBO24 which is of course small but uses diverse addresses. I just got Propeller JVM EEPROM to out-perform Javelin with FIBO24 [noparse]:)[/noparse]
Lettuce explore VM more [noparse]:)[/noparse]
Edit: Actually, Propeller JVM was out-performing Javelin before, but now it does FIBO24 in less than 9.5 seconds which is only 1.5 times slower than running straight from HUB. Before it was more than 2 times slower.
Indeed, "Lettuce explore VM more [noparse]:)[/noparse]" <grin>
I am very impressed that you outperformed the Javelin! Well done!
jazzed said...
Bill Henning said...
Hi Steve,
The problem with using a simple hash function is what if the compiled code happens to use several pages frequently with the same hash code? That also leads to thrasing.
We are just starting to explore VM on the Prop.
Yes, I'm sure thrashing resulting from hash collisions can become severe. So far I'm pretty happy with the results for FIBO24 which is of course small but uses diverse addresses. I just got Propeller JVM EEPROM to out-perform Javelin with FIBO24 [noparse]:)[/noparse]
As a young thing back in 1981 the washing machine sized box containing a 10MB hard drive and standing beside the mini-computer I was using started to jump around like..well..a washing machine. "It's thrashing" a sage elder engineer said. I immediately knew that my program was just too big for the available real memory, I would have to reorganize things.
Shortly after that the hard drive suffered a serious head crash and the plater had to be replaced. It was commonly said by the group that this expense was my fault. But I think that was just because I was the new boy on the team.
Since then I have learned precious little about virtual memory and page replacement algorithms but I can "feel" my Propeller is thrashing.
But why?
Bill: "Thrashing occurs when the working set is too small for the access pattern of the code running in the VM."
Well here is the FIBO access pattern:
unsigned int fibo (unsigned int n)
{
5fe: fe im -2
5ff: 3d pushspadd
600: 0d popsp
601: 74 loadsp 16
00000602 <.LM12>:
/home/michael/zog_v0_15/test/test.c:49
if (n <= 1)
{
return (n);
602: 70 loadsp 0
603: 53 storesp 12
604: 53 storesp 12
00000605 <.LM13>:
/home/michael/zog_v0_15/test/test.c:47
605: 81 im 1
606: 73 loadsp 12
607: 27 ulessthanorequal
608: 92 im 18
609: 38 neqbranch
0000060a <.LM14>:
/home/michael/zog_v0_15/test/test.c:53
}
else
{
return fibo(n - 1) + fibo(n - 2);
60a: ff im -1
60b: 13 addsp 12
60c: 51 storesp 4
60d: f0 im -16
60e: 3f callpcrel
60f: 80 im 0
610: 08 load
611: fe im -2
612: 14 addsp 16
613: 52 storesp 8
614: 52 storesp 8
615: e8 im -24
616: 3f callpcrel
617: 80 im 0
618: 08 load
619: 12 addsp 8
61a: 52 storesp 8
0000061b <.L11>:
/home/michael/zog_v0_15/test/test.c:55
}
}
61b: 71 loadsp 4
61c: 80 im 0
61d: 0c store
61e: 84 im 4
61f: 3d pushspadd
620: 0d popsp
621: 04 poppc
FIBO's code is tiny. It could sit in one page permanently in memory.
FIBO's data is not so big, all on the stack, another page is enough to hold the call stack to the depth required.
The access pattern is :
1) Fetch an instruction from some small area of code space.
2) Read or write to stack in a small area of stack space
3) Repeat the above till done.
I can't see any need for thrashing due to the memory access pattern. Everything is very local.
A slow down by a factor of 320 odd feels like each of steps 1 and 2 is swapping a page.
FIBO with "whitcount long 64 << 11" is running as I type. Forty minutes so far...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The degenerate case for the current strategy would be when a write was done to a page, then a write to another page, etc., as it would keep re-using the least recently used page (the most recent write).
After UPEW I intend to finish the page hit / miss statistics in VMCOG, then we will be able to precisely tell what is happening.
One possible strategy is to set the "hitcount" of a newly made available page to the average of all current hitcounts, then we would not always be sacrificing the same page. I'll see if I can hack that in over the next few days, I suspect it would help.
If the problem is what I suspect - that the same page was always chosen for sacrifice - then the new version 0.970 I uploaded into the VMCOG thread should help. A lot. I now set the access count of a new page to the average of all loaded pages.
It's probably best to set whitincement back to 1 << 11 with this version.
Sorry, it does not have the 0.10 Triblade code yet.
Best of luck in the morning heater! oops ... was that a joke? I missed it.
Believe it or not, the FIFO Java byte-code is over 8KB with all the JNI, String, and
other support classes. The code being executed is spread out quite a lot though.
I assume most if not all of the ZOG print routines are built-in on the Propeller.
Jazzzed: That FIBO I'm running is 4420 bytes. It includes a few tens of other lines just printing some CPU config/status stuff as it started out as just a "hello world" test.
If I include a small integer only version of printf it bloats out too almost 18K. The full printf brings it up to 40K.
All print formatting is done in C libs, statically linked. Eventually they write to an address that they think is a memory mapped UART. This is picked up by Spin and sent to FullDuplexSerial.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The Java code included the print formatter, but not the floating point support classes.
Have you tried compiling the same with Catalina for "innocent" size comparisons?
Another step forward for Zog, v0.17 is attached to the first post.
This comes with an updated VMCOG and now loads ZPU executables from SD card.
Seems to be quite solid runs fibo, RC4, dhrystone etc tests OK.
It should run on any Propeller board with external RAM that VMCOG supports. Currently that is:
Homebrew boards using SPI RAM.
Bill Henning's Propcade.
Cluso's TriBlade blade #2
Hydra.
Just set the relevant #define in vmcog.spin
There are #defines to select running from HUB memory with ZPU executable pulled in from a "file" statement or running with external memory and executables read from an SD card.
The SD card is set up for TriBlade as released.
By way of comparison Zog runs the Dhrystone benchmark in about 14 seconds from HUB.
From external memory one can set the number of pages cached in HUB RAM. (look for vm.start)
So for external RAM the Dhrystone times are for example:
Pages Seconds
8 90
10 60
36 37
I don't get a real DMIPS number out of this as Zog does not have any nice time functions yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Comments
What we need now is a simple shell, that leaves as much hub memory free as possible, but can launch .ZOG binaries from an SD card.
I will have a "large" VM version of VMCOG by the weekend, exactly when depends on some other work I have to get done first.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
My thoughts exactly.
I was thinking of looking into Kye's all singing al dancing FATE which seems to be an OS by itself, or SPHINX or such.
On the other hand my mind is pulling in an other direction.
I'd like to get little Zog (HUB memory) running all by itself after completely replacing spin. It would be booted up from some PASM in a COG which then gives little Zog ALL 32K HUB memory apart from some mail boxes with which it can talk to UART and SD card. Little ZOg then starts up big Zog in external memory. Big Zogs support is now provided by little Zog rather than Spin.
Big Zog runs the "OS", "shell" whatever.
Still all that is a long way of for now.
After these few days concerted effort I have to get back to real work so perhaps I can just squeze in loading a ZPU binary from SD for Big Zog for now.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I tried 0.14 here, works fine!
with 12 pages it takes about 30 seconds (did not use watch, eyeballed), so SPI ram with slow 4mbps bitbanging driver is definitely slower than TriBlade (DUH!)
with 16 pages, it took 2-3 sec like you said.
Good idea, except please leave the first 4KB of hub untouched... I've been working a bit (on paper) on Minos (minimal version of Largos <grin> ok I like puns)
Initial memory map:
$0000-$01FF - reserved for mailboxes etc
$0200-$07FF - reserved for vmcog's use (needed for larger than 64KB VM's)
$0800-$0FFF - reserved for cog image load buffer
$1000-$7FFF - reserved for working set
The idea is that initial spin bootloader uses all the memory to load drivers (sd, video, kb, etc) into cogs, and all drivers communicate via mailboxes.
Then Spin boots a VMCOG, which re-uses all the memory from $1000-$7FFF (28KB, 56 pages) as the working set for a VM
the file system etc can be re-written in C for ZOG, then unused routines will not take up precious hub room until needed.
once vmlock() and vmunlock() work, even video buffers can be allocated from the vm, and released when no longer needed!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
I think I got it to working with Hydra driver for VMCog, after fixing two really stupid mistakes (will post 0.14 with added Hydra code tomorrow)
@heater
there's a particular reason for loading 8KB when the zog binary is 4KB?
Now it starts executing (but only if I load 4KB), before fixing the bugs I noticed it went "BREAK" at two different addresses depending on memory size var.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
I was quite looking forward to having C code under Little Zog using HUB RAM from 0000 up. No address translations required.
Perhaps I should figure out how to tweak the linker scripts so that the zpu code can be loaded and started from $1000. Or just live with having to translate ZPU to HUB addresses.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
C code under Little Zog is free to use low addresses if it takes over the whole machine.
I suggest using a 'rambase' variable in little zog, and add pc to that when pc is modified. Only causes 50ns delay when doing jumps.
Better yet is what you suggest - having the base address settable in the linker script!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Ah yes but...If someone, like yourself[noparse]:)[/noparse], were to create a bunch of serial, vmem, low level SD block, etc etc drivers that use mailboxes in HUB. Then I'd be wanting to use those and Little Zog, as C replacement for Spin, would have to keep out of the way.
It's a shame the mailbox idea has not been in use in the Prop world from the start. Having all COG drivers written to be accessed through Spin methods is a pain. These software drivers in COG should present themselves through registers like on real hardware. And that's what mailboxes do basically.
The "rambase" thing is what we do now for running from HUB so no problem actually. I'll have another look at linker scripts, they always give me headache.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I need them for Minos and Largos, and I don't have time to write every possible driver myself [noparse]:([/noparse]
Pretty much all the drivers/code I write now uses the 4 long mailbox model - see my GPU cog for Morpheus, upcoming A/D and industrial drivers etc.
For example, for both Largos and Minos, there is an STDIO mailbox - and depending on what cog driver services it, it can go to serial, VGA/kb, or TV/kb
As I mentioned a few minutes earlier in the VMCOG thread... we *REALLY* need a mailbox based low level SD driver.
(As an aside - in case a driver needs more info than 4 longs can provide, 3 of the 4 can be pointers to tables <grin>)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
This fixes the zpu_addsp instruction to work correctly with virtual memory read_long.
Added reading of a ZPU image from SD card file.
Runs dhrystone by default (dstone.bin)
One can run the tests "test", "endian", "rc4" and "dstone" from either HUB or ext RAM via VMCog. Just change the #defines and the top and look for the files names embedded in the SD loader code and "file" statement at the end. Sorry no fancy user interface for program selection.
This is painfully slow from ext RAM with VMCOG and 16 pages. FIBO(26) as performed at the end of test.bin takes between 1 and 1.5 hours, I went shopping whilst I was waiting[noparse]:)[/noparse]
dhrystone takes 20 minutes with 5000 loops. The attached dhrystone only performs 500 loops.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I figured large programs would be glacial with a small working set (unless their accesses were mostly linear and localized) - this will be a great way to test the effect of different working set sizes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Actually that's what puzzles me about FIBO. I have a working set of 16 pages, 8KB.
The fibo routine is very small. So the actual code should be resident all the time. All the data is on the stack which only goes down 26 call levels deep so that can be resident all the time as well. There no other data to worry about.
Let's say I expect an order of magnitude slow down going through the VM access, the HUB version takes 14 seconds so we expect the ext version to take 3 minutes.
As it actually takes over an hour that's more like over 300 times slower!
This just does not seem right. Have I missed a point somewhere? Is VMCog reading/writing pages more than it should?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Maybe a smaller cache line size would help the glacial performance ?
@Bill, I'm not purposefully trying to reinvent vmcog or whatever, but I'm really having a hard time dealing with that 512 byte cache line. It should not be a problem to quickly load 512 bytes in an efficient parallel bus design. Still, I'm sharing my findings so that hopefully some experiments can be tried. ...
In my experiments with EEPROM, I use the modulo hash function: tagline = f(physical address % TLB count). Small 16 to 32 byte blocks load contiguously as required and the time it takes to "probe" and track statistics for a collision on a duplicate hash seems to be more expensive than just losing the line and reloading it if necessary. The other advantages allow for embedding the vm cache in the interpreter cog itself given room, using with a big Propeller emulator/vm program, and use of a large physical backstore address range (though less efficient) with a small cache block of say 2KB + TLB size or less.
Cheers,
--Steve
EDIT: Hmm, I just saw your comment heater. That's pretty mysterious.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Post Edited (jazzed) : 6/15/2010 6:19:07 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!
Thrashing occurs when the working set is too small for the access pattern of the code running in the VM.
It often occurs when a process does just a few reads/writes to a page before having to access another page, with such a frequency that most of the time is spent swapping pages.
Try different values for whitcount.
It defaults to the same weight as rhitcount, but if few writes are done compared to reads of the code, it could be recycling the same page over and over again.
I suggest trying "64 << 11" for whitcount. That will give writes 64 times the effect on access count as reads, making it more likely that read-only pages are swapped.
You can try any value between say 1 and 128 for whitcount.
You could also add something to ZOG so say every 100 instructions it dumps the TLB to the console, perhaps you can then see what is going on.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
For eeprom a caching approach may work better - and while caches and virtual memory are quite similar, they are not exactly the same - the chief difference being the lack of a "page replacement policy" in a cache.
The trick with vm's is to tune the page replacement algorithm to find one that can satisfy most requests from the working set, instead of causing a page fault.
The slower the backing store is, the *much worse* the VM performance gets with a small working set.
This can ofcourse be offset by using very small pages.
The problem is that with small pages, the translation table gets to be very large, or a lot of time has to be spent searching to see if a page is in memory.
There have been whole PhD thesis written on the topic of page replacement policies!
To start with, I implemented a simple LRU (least recently used) algorithm, however I do allow for giving different weights for reads and writes, allowing some tunability.
The problem with using a simple hash function is what if the compiled code happens to use several pages frequently with the same hash code? That also leads to thrasing.
We are just starting to explore VM on the Prop.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Lettuce explore VM more [noparse]:)[/noparse]
Edit: Actually, Propeller JVM was out-performing Javelin before, but now it does FIBO24 in less than 9.5 seconds which is only 1.5 times slower than running straight from HUB. Before it was more than 2 times slower.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Post Edited (jazzed) : 6/15/2010 7:45:12 PM GMT
I am very impressed that you outperformed the Javelin! Well done!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
As a young thing back in 1981 the washing machine sized box containing a 10MB hard drive and standing beside the mini-computer I was using started to jump around like..well..a washing machine. "It's thrashing" a sage elder engineer said. I immediately knew that my program was just too big for the available real memory, I would have to reorganize things.
Shortly after that the hard drive suffered a serious head crash and the plater had to be replaced. It was commonly said by the group that this expense was my fault. But I think that was just because I was the new boy on the team.
Since then I have learned precious little about virtual memory and page replacement algorithms but I can "feel" my Propeller is thrashing.
But why?
Bill: "Thrashing occurs when the working set is too small for the access pattern of the code running in the VM."
Well here is the FIBO access pattern:
FIBO's code is tiny. It could sit in one page permanently in memory.
FIBO's data is not so big, all on the stack, another page is enough to hold the call stack to the depth required.
The access pattern is :
1) Fetch an instruction from some small area of code space.
2) Read or write to stack in a small area of stack space
3) Repeat the above till done.
I can't see any need for thrashing due to the memory access pattern. Everything is very local.
A slow down by a factor of 320 odd feels like each of steps 1 and 2 is swapping a page.
FIBO with "whitcount long 64 << 11" is running as I type. Forty minutes so far...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 6/15/2010 8:30:19 PM GMT
After UPEW I intend to finish the page hit / miss statistics in VMCOG, then we will be able to precisely tell what is happening.
One possible strategy is to set the "hitcount" of a newly made available page to the average of all current hitcounts, then we would not always be sacrificing the same page. I'll see if I can hack that in over the next few days, I suspect it would help.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
From VM, well, we'll see in the morning...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I just hacked in the "set new page access count to average of all pages" patch.
I suspect this may help a lot. Sorry, have not had time to merge in the .10 TriBlade. I will upload it to the VMCOG thread in a couple of minutes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
It's probably best to set whitincement back to 1 << 11 with this version.
Sorry, it does not have the 0.10 Triblade code yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Believe it or not, the FIFO Java byte-code is over 8KB with all the JNI, String, and
other support classes. The code being executed is spread out quite a lot though.
I assume most if not all of the ZOG print routines are built-in on the Propeller.
Cheers.
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Post Edited (jazzed) : 6/15/2010 8:57:47 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
If I include a small integer only version of printf it bloats out too almost 18K. The full printf brings it up to 40K.
All print formatting is done in C libs, statically linked. Eventually they write to an address that they think is a memory mapped UART. This is picked up by Spin and sent to FullDuplexSerial.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The Java code included the print formatter, but not the floating point support classes.
Have you tried compiling the same with Catalina for "innocent" size comparisons?
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
This comes with an updated VMCOG and now loads ZPU executables from SD card.
Seems to be quite solid runs fibo, RC4, dhrystone etc tests OK.
It should run on any Propeller board with external RAM that VMCOG supports. Currently that is:
Homebrew boards using SPI RAM.
Bill Henning's Propcade.
Cluso's TriBlade blade #2
Hydra.
Just set the relevant #define in vmcog.spin
There are #defines to select running from HUB memory with ZPU executable pulled in from a "file" statement or running with external memory and executables read from an SD card.
The SD card is set up for TriBlade as released.
By way of comparison Zog runs the Dhrystone benchmark in about 14 seconds from HUB.
From external memory one can set the number of pages cached in HUB RAM. (look for vm.start)
So for external RAM the Dhrystone times are for example:
Pages Seconds
8 90
10 60
36 37
I don't get a real DMIPS number out of this as Zog does not have any nice time functions yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 6/20/2010 1:58:09 PM GMT