What is the purpose of the first repeat loop? If, after initiating a VMCOG operation, we always wait for the operation to complete there is no need to wait again at the start of the next VMCOG call.
PUB rdvbyte(adr)
repeat while long[cmdptr]
long[cmdptr] := (adr<<9)|READVMB
repeat while long[cmdptr]
return long[dataptr]
I've disabled it now in my test version, I wanted to see if it made any real difference.
Currently I am trying the new 'x' diagnostic to see if I can find the problem. I only have one page in the working set, ensuring that 99% of probes cause a page fault.
What is the purpose of the first repeat loop? If, after initiating a VMCOG operation, we always wait for the operation to complete there is no need to wait again at the start of the next VMCOG call.
PUB rdvbyte(adr)
repeat while long[cmdptr]
long[cmdptr] := (adr<<9)|READVMB
repeat while long[cmdptr]
return long[dataptr]
I improved the test some more, forcing each probe to be on a different random page, with only one page in the working set. Hopefully this will reduce the length of each test run.
update:
I now also check the initial fill pattern *before* the random test, to make sure the fill went OK.
I've added VMCOG v0.985 to the first post in this thread, and I've synced VMDEBUG to v0.985 as well.
This version passed >1.3M probes of the 'x' test before it was stopped by having to reboot due to the USB mouse crashing, and it also completed fib(26) on zog v0.14 with 20 pages. I will now try a few more page sizes!
I also integrated the latest TRIBLADE_2 code from heater, and if no one finds any problems in the next two days, this will be renamed as VMCOG v1.00
UPDATE:
- fibo(26) worked with 5 pages
- fibo(26) worked with 3 pages
heater - can you try 0.985? I am pretty sure the bug is fixed! I am running it now!
Have you tried all of the memory drivers? I'm having trouble getting the MORPHEUS1 build of this new version to work with my Hydra SRAM board. But it might just be my editing in the pin differences between your board and mine.
Have you considered adding my pin definition update to the MORPHEUS1 build to make it easier to switch to different pins?
Have you tried all of the memory drivers? I'm having trouble getting the MORPHEUS1 build of this new version to work with my Hydra SRAM board. But it might just be my editing in the pin differences between your board and mine.
Have you considered adding my pin definition update to the MORPHEUS1 build to make it easier to switch to different pins?
Looks like only 6KB of VM are being used - so it should not be surprising that the 12 page working set result is the same as for a 30 page working set.
Clobbering "temp" you say. One of the oldest classes of bugs in the book "Over aggressively recycling variables". Us "many eyeballs" out here should have been all over that from the beginning in the face of random looking failures. Shame on us.
Any way, well done, excellent piece of detective work.
Fibo is also working for me with v0.985. I have tried 20 pages and then 6,5,4,3,2 and 1 pages. It does not slow down until 2 pages!
Still, this fibo is only a 3K binary of which I know 1K is not used so this is not a surprising result.
It seems to run fine in 1 page, it's like watching an old teletype, hasn't finished yet.
That "minor" optimization for RDVMBYTE is actually quite dramatic. From your results:
before:
fibo(18) = 2584 (1039ms
after:
fibo(18) = 2584 (981ms)
That's 5.9 or 5.5 percent depending how you look at it.
Why is the store through cmdptr done with word[cmdptr] instead of long[cmdptr]. Doesn't this mean that only the low order bits of the $FFFF0000 constant will be stored and hence the repeat loop at the end of my excerpt will never wait even if the initialization of the PASM COG is not complete?
long[vaddrptr] := lastp-((nump-1)*512) ' singe byte read/writes are the default
long[dataptr] := nump ' singe byte read/writes are the default
word[cmdptr] := $FFFF0000
#ifdef XEDODRAM
longfill(@xmailbox,0,2) ' ensure command starts as 0
long[mailbox+12]:= @xmailbox ' 3rd vmcog mailbox word used for xm interface
xm.start(@xmailbox) ' start up inter-cog xmem
#endif
cognew(@vmcog,mailbox)
fcmdptr := @fakebox
fdataptr := fcmdptr+4
repeat while long[cmdptr] ' should fix startup bug heater found - it was the delay to load/init the cog
Thanks. This was VERY hard to find, I had to write the really nasty VM exerciser command 'x' and run it repeatedly before I found a pattern that narrowed the search down enough to find the problem. No shame for anyone, these bugs happen
VMCOG performance is good, as you noted - running fibo at close to 100% speed with a working set only 1/6th the fibo memory footprint!
Ok, >5% is more than a minor speedup on something like this
Clobbering "temp" you say. One of the oldest classes of bugs in the book "Over aggressively recycling variables". Us "many eyeballs" out here should have been all over that from the beginning in the face of random looking failures. Shame on us.
Any way, well done, excellent piece of detective work.
Fibo is also working for me with v0.985. I have tried 20 pages and then 6,5,4,3,2 and 1 pages. It does not slow down until 2 pages!
Still, this fibo is only a 3K binary of which I know 1K is not used so this is not a surprising result.
It seems to run fine in 1 page, it's like watching an old teletype, hasn't finished yet.
That "minor" optimization for RDVMBYTE is actually quite dramatic. From your results:
before:
fibo(18) = 2584 (1039ms
after:
fibo(18) = 2584 (981ms)
That's 5.9 or 5.5 percent depending how you look at it.
I have some minor cleanup to do before releasing the magic v1.00 ... as you may have noticed, I use
VMCOG/VMDEBUG as a generic test platform for all sorts of memory and SPI goodies.
re/2
I am considering dumping them. I initially had two sets due to paranoia, and intended to try to remove them around the v1.00 mark
Why is the store through cmdptr done with word[cmdptr] instead of long[cmdptr]. Doesn't this mean that only the low order bits of the $FFFF0000 constant will be stored and hence the repeat loop at the end of my excerpt will never wait even if the initialization of the PASM COG is not complete?
long[vaddrptr] := lastp-((nump-1)*512) ' singe byte read/writes are the default
long[dataptr] := nump ' singe byte read/writes are the default
word[cmdptr] := $FFFF0000
#ifdef XEDODRAM
longfill(@xmailbox,0,2) ' ensure command starts as 0
long[mailbox+12]:= @xmailbox ' 3rd vmcog mailbox word used for xm interface
xm.start(@xmailbox) ' start up inter-cog xmem
#endif
cognew(@vmcog,mailbox)
fcmdptr := @fakebox
fdataptr := fcmdptr+4
repeat while long[cmdptr] ' should fix startup bug heater found - it was the delay to load/init the cog
VMCOG performance is good, as you noted - running fibo at close to 100% speed with a working set only 1/6th the fibo memory footprint!
I've been thinking about this:
1) The fibo test only times the execution of the actual fibo call in question, all the extraneous looping and printing is excluded.
2) The actual fibo routine is small, it has a good chance of fitting within a page if it happens to be in the right position.
3) The fibo routine uses no data, it's all done on the stack. Apart from the fact that, for some strange reason, results are returned through a pseudo register at address zero.
So the stuff that we are timing here only needs 3 pages, one for code, one for stack and one for the pseudo register.
As we see it runs the same speed in 3 pages as 20.
My measurements on fibo therefore show that the overhead of VMCog when everything is in the working set is 150%. For example:
fibo(18):
423ms from HUB
1074ms from VMCog
What a sensible benchmark for this is I have no idea. Something with more code and data scattered throughout memory.
Still in this modern world of processors being much faster than their memory and therefore sprouting huge caches they do say the best optimizations you can do are to arrange for your code and data to stay in cache as much as possible.
For example, scanning through a two dimensional array can be much faster if you do it in the correct row/column order.
Is it done yet?
Ha, no, it may never be.
Don't worry, it runs fine but, despite the fact this entire building has been rewired over the summer, every time the fridge in this apartment goes on/off it resets my TriBlade. The fridge does not give time for one page fibo to complete.
By "close to 100% speed" I was referring to basically same speed for 3 or 20 page working set, not comparing to hub-only
I agree with your analysis, fibo really only seems to need 3 pages, due to "locality of reference" - which is why VM worked on mainframes, and works on PC's, as most of the time, most of the code/data of a program really does not need to be in the working set - and as you point out, it can help a lot to organize data/code in a VM friendly manner.
I am actually very pleased that VMCOG seems to run at 2/5th the speed of hub-only interpreters!
I think we need a bunch of benchmarks beside fibo() - the old byte benchmark set comes to mind, I think it included dhrystone and whetstone, which as I recall you have already run.
I am beginning to think if an LCC re-targeted for ZOG might not be small enough to run under ZOG/VMCOG once I get large VM support going... it would be slow, but it would be so nice to compile C code right on a Prop!
Re/ 1 page working set - I ran my 'x' test for >1.3M probes before I stopped it, which caused >1.3M page faults...
1) The fibo test only times the execution of the actual fibo call in question, all the extraneous looping and printing is excluded.
2) The actual fibo routine is small, it has a good chance of fitting within a page if it happens to be in the right position.
3) The fibo routine uses no data, it's all done on the stack. Apart from the fact that, for some strange reason, results are returned through a pseudo register at address zero.
So the stuff that we are timing here only needs 3 pages, one for code, one for stack and one for the pseudo register.
As we see it runs the same speed in 3 pages as 20.
My measurements on fibo therefore show that the overhead of VMCog when everything is in the working set is 150%. For example:
fibo(18):
423ms from HUB
1074ms from VMCog
What a sensible benchmark for this is I have no idea. Something with more code and data scattered throughout memory.
Still in this modern world of processors being much faster than their memory and therefore sprouting huge caches they do say the best optimizations you can do are to arrange for your code and data to stay in cache as much as possible.
For example, scanning through a two dimensional array can be much faster if you do it in the correct row/column order.
Ha, no, it may never be.
Don't worry, it runs fine but, despite the fact this entire building has been rewired over the summer, every time the fridge in this apartment goes on/off it resets my TriBlade. The fridge does not give time for one page fibo to complete.
By "close to 100% speed" I was referring to basically same speed for 3 or 20 page working set, not comparing to hub-only
Yeah, it occurred to me that's what you meant right after I posted.
Don't get me wrong, I think VMCog is doing fine.
I think we need a bunch of benchmarks beside fibo() - the old byte benchmark set comes to mind, I think it included dhrystone and whetstone, which as I recall you have already run.
Yep, I'm on it already. Whetstone is a disaster at the moment because it uses software floating point which is huge and slow. I want to provide a floating point support COG using float32. Not much time to play just now.
I am beginning to think if an LCC re-targeted for ZOG might not be small enough to run under ZOG/VMCOG
Now who do we know who has the know how to do that?
It would be great. Actually I was worrying about this recently. Currently C under Zog is totally dependent on the GCC target by ZyLin Inc. That is already based off an old version of GCC. I don't get the impression that ZyLin has the incentive to keep it up to date. Already I can't build the compiler under Ubuntu. Slowly it might rot away.
Bill, I plan to port the SDRAM driver to VMCOG. If you can make room,I can put in the fast burst code rather than the slower loop code. Have a look at the last SdramCache.spin file I posted to understand the challenges. If it makes sense performance wise and VMCOG can be expanded to larger backstore memory support, I can drop the cache interface entirely. Consider it a casual challenge
No worries... VMCog is performing pretty much as I was hoping it would - being able to run large pieces of code with a roughly 2.5x slow down versus not being able to run them at all... BIG WIN.
Dhrystone is a good start
re/LCC ... Paging BradC....
btw,
VMCOG will SCREAM on PropII (when that arrives).
8x faster pasm, 16x faster hub (compared to p1) without even considering multi-long read/writes...
I would not be at all surprised if ZOG/VMCOG hit 10MIPS+
Yeah, it occurred to me that's what you meant right after I posted.
Don't get me wrong, I think VMCog is doing fine.
Yep, I'm on it already. Whetstone is a disaster at the moment because it uses software floating point which is huge and slow. I want to provide a floating point support COG using float32. Not much time to play just now.
Now who do we know who has the know how to do that?
It would be great. Actually I was worrying about this recently. Currently C under Zog is totally dependent on the GCC target by ZyLin Inc. That is already based off an old version of GCC. I don't get the impression that ZyLin has the incentive to keep it up to date. Already I can't build the compiler under Ubuntu. Slowly it might rot away.
Comments
It took 517,684 probes, but I finally got a read error!
Now to change the code to dump the TLB on error and stop running!
Added: Updated VMDEBUG
I've disabled it now in my test version, I wanted to see if it made any real difference.
Currently I am trying the new 'x' diagnostic to see if I can find the problem. I only have one page in the working set, ensuring that 99% of probes cause a page fault.
No idea but Zog does not have any such first wait loop in it's PASM access routines.
Very interesting!
vm.rdvword() returned the wrong result... but the re-read gave the correct one, as the second test puts in the correct value.
I modified 'x' so that it will only run the second test if there is no error in the first... time to run the test again...
I am attaching the latest vmdebug
I am MUCH happier now that I can reproduce the problem as it should now be much easier to find the culprit code.
update:
I now also check the initial fill pattern *before* the random test, to make sure the fill went OK.
And yes, it was a dumb "clobbering" of a temporary variable... called "temp", that would only occur in shr_hits!
I am now testing the fix.
This version passed >1.3M probes of the 'x' test before it was stopped by having to reboot due to the USB mouse crashing, and it also completed fib(26) on zog v0.14 with 20 pages. I will now try a few more page sizes!
I also integrated the latest TRIBLADE_2 code from heater, and if no one finds any problems in the next two days, this will be renamed as VMCOG v1.00
UPDATE:
- fibo(26) worked with 5 pages
- fibo(26) worked with 3 pages
heater - can you try 0.985? I am pretty sure the bug is fixed! I am running it now!
Have you tried all of the memory drivers? I'm having trouble getting the MORPHEUS1 build of this new version to work with my Hydra SRAM board. But it might just be my editing in the pin differences between your board and mine.
Have you considered adding my pin definition update to the MORPHEUS1 build to make it easier to switch to different pins?
Thanks,
David
No, I just got it going a short while ago on PropCade. Tomorrow I will try MORPHEUS1.
I'll take a peek at your pin definition update too
This was with a 30 page working set.
I can see a few more optimizations if I use more cog memory...
Would you believe it runs in the same time in 6 pages?
UPDATE:
Here is the run with a *THREE* page working set:
SAME AS WITH 12-30 PAGES FOR THE LARGER FIBO RUNS!
Clobbering "temp" you say. One of the oldest classes of bugs in the book "Over aggressively recycling variables". Us "many eyeballs" out here should have been all over that from the beginning in the face of random looking failures. Shame on us.
Any way, well done, excellent piece of detective work.
Fibo is also working for me with v0.985. I have tried 20 pages and then 6,5,4,3,2 and 1 pages. It does not slow down until 2 pages!
Still, this fibo is only a 3K binary of which I know 1K is not used so this is not a surprising result.
It seems to run fine in 1 page, it's like watching an old teletype, hasn't finished yet.
That "minor" optimization for RDVMBYTE is actually quite dramatic. From your results:
before:
fibo(18) = 2584 (1039ms
after:
fibo(18) = 2584 (981ms)
That's 5.9 or 5.5 percent depending how you look at it.
A couple of minor points:
1) Out of the box vmdebug does not compile when building for TriBlade as some test use vmcog methods that don't exist in that case.
2) Are you meaning to do away with those first wait loops in the Spin access methods?
My 1 page fibo run is still running....
Thanks. This was VERY hard to find, I had to write the really nasty VM exerciser command 'x' and run it repeatedly before I found a pattern that narrowed the search down enough to find the problem. No shame for anyone, these bugs happen
VMCOG performance is good, as you noted - running fibo at close to 100% speed with a working set only 1/6th the fibo memory footprint!
Ok, >5% is more than a minor speedup on something like this
I have some minor cleanup to do before releasing the magic v1.00 ... as you may have noticed, I use
VMCOG/VMDEBUG as a generic test platform for all sorts of memory and SPI goodies.
re/2
I am considering dumping them. I initially had two sets due to paranoia, and intended to try to remove them around the v1.00 mark
re/1 page fibo
Is it done yet?
Thanks for spotting a bug! Fixing...
I've been thinking about this:
1) The fibo test only times the execution of the actual fibo call in question, all the extraneous looping and printing is excluded.
2) The actual fibo routine is small, it has a good chance of fitting within a page if it happens to be in the right position.
3) The fibo routine uses no data, it's all done on the stack. Apart from the fact that, for some strange reason, results are returned through a pseudo register at address zero.
So the stuff that we are timing here only needs 3 pages, one for code, one for stack and one for the pseudo register.
As we see it runs the same speed in 3 pages as 20.
My measurements on fibo therefore show that the overhead of VMCog when everything is in the working set is 150%. For example:
fibo(18):
423ms from HUB
1074ms from VMCog
What a sensible benchmark for this is I have no idea. Something with more code and data scattered throughout memory.
Still in this modern world of processors being much faster than their memory and therefore sprouting huge caches they do say the best optimizations you can do are to arrange for your code and data to stay in cache as much as possible.
For example, scanning through a two dimensional array can be much faster if you do it in the correct row/column order.
Ha, no, it may never be.
Don't worry, it runs fine but, despite the fact this entire building has been rewired over the summer, every time the fridge in this apartment goes on/off it resets my TriBlade. The fridge does not give time for one page fibo to complete.
By "close to 100% speed" I was referring to basically same speed for 3 or 20 page working set, not comparing to hub-only
I agree with your analysis, fibo really only seems to need 3 pages, due to "locality of reference" - which is why VM worked on mainframes, and works on PC's, as most of the time, most of the code/data of a program really does not need to be in the working set - and as you point out, it can help a lot to organize data/code in a VM friendly manner.
I am actually very pleased that VMCOG seems to run at 2/5th the speed of hub-only interpreters!
I think we need a bunch of benchmarks beside fibo() - the old byte benchmark set comes to mind, I think it included dhrystone and whetstone, which as I recall you have already run.
I am beginning to think if an LCC re-targeted for ZOG might not be small enough to run under ZOG/VMCOG once I get large VM support going... it would be slow, but it would be so nice to compile C code right on a Prop!
Re/ 1 page working set - I ran my 'x' test for >1.3M probes before I stopped it, which caused >1.3M page faults...
Yeah, it occurred to me that's what you meant right after I posted.
Don't get me wrong, I think VMCog is doing fine.
Yep, I'm on it already. Whetstone is a disaster at the moment because it uses software floating point which is huge and slow. I want to provide a floating point support COG using float32. Not much time to play just now.
Now who do we know who has the know how to do that?
It would be great. Actually I was worrying about this recently. Currently C under Zog is totally dependent on the GCC target by ZyLin Inc. That is already based off an old version of GCC. I don't get the impression that ZyLin has the incentive to keep it up to date. Already I can't build the compiler under Ubuntu. Slowly it might rot away.
--Steve
Dhrystone is a good start
re/LCC ... Paging BradC....
btw,
VMCOG will SCREAM on PropII (when that arrives).
8x faster pasm, 16x faster hub (compared to p1) without even considering multi-long read/writes...
I would not be at all surprised if ZOG/VMCOG hit 10MIPS+