- set up one 23K256 on my PDB
- I adapted Andy's original test routine to use FullDuplexSerialPlus
- I changed the test read/write values when I saw it was not working
- tried disconnecting the data line, read all 0's so when connected it IS getting data
- my best guess is that the write is not writing correctly
I am attaching an archive of the current test code to this post
I updated the test code above, and here is the output of a new run... looks like writes are not happening correctly in some cases.
Best guess: CLK edge needs to be inverted, or happens before data settles enough.
Alternate guess: 23K256 can't handle 20Mbps SPI clock, even though the data sheet says it should.
I am attaching the output from my running SPI_Test2 in the archive above.
I just tried my driver again, and it works with my RAMs.
Here are some possible changes at the begin of dowrite to shift the clock to data relation:
1) Clock comes a bit earlier:
mov frqa,freqw 'frequency of SCK
mov phsa,phsr 'start clock
mov ctra,ctramode 'send cmd,address and data with 20MHz clock
rol phsb,#1
rol phsb,#1
...
2) Another way:
mov frqa,freqw 'frequency of SCK
neg phsa,freqw 'try also freqr phsr 'start clock
rol phsb,#1
mov ctra,ctramode 'send cmd,address and data with 20MHz clock
rol phsb,#1
rol phsb,#1
...
Both still work with my RAMs.
And attached is an earlier driver with only 10MHz SCLK.
So it looks like there is a subtle difference between the two ram's as far as clocking goes.
Ariba said...
Bill
I just tried my driver again, and it works with my RAMs.
Here are some possible changes at the begin of dowrite to shift the clock to data relation:
1) Clock comes a bit earlier:
mov frqa,freqw 'frequency of SCK
mov phsa,phsr 'start clock
mov ctra,ctramode 'send cmd,address and data with 20MHz clock
rol phsb,#1
rol phsb,#1
...
2) Another way:
mov frqa,freqw 'frequency of SCK
neg phsa,freqw 'try also freqr phsr 'start clock
rol phsb,#1
mov ctra,ctramode 'send cmd,address and data with 20MHz clock
rol phsb,#1
rol phsb,#1
...
Both still work with my RAMs.
And attached is an earlier driver with only 10MHz SCLK.
I have tried all three methods you suggested above (and some other variations) - no dice, the best I could get was 226 errors (out of 256 write/reads) with your last suggestion.
I have also tried your older 10Mbps write, and got similar results.
What I will try today:
- run it under ViewPort and look at the waveforms at 80Msps - should show me what is going wrong
- try different 23K256's (I've tried two different ones so far)
- try a different prop (in case I managed to damage this one in the past)
- if that does not let me fix it, I will substitute my own 5Mbps write routine
- if that does not work, I will fall back on 5Mbps read / 5Mbps write for now (without counters)
- I wonder if it could be extra capacitance due to running it on a breadboard?
- if you like, I can snail-mail you a couple of 23K256's to try (PM your address to me)
I will also work on the generic VMCOG part, and VMCOG Debugger - I want this running soon, for CogZ, ZiCog and MotoCog!
- Fixed a dumb bug in the TLB
- various minor optimizations and fixes (VMFLUSH, VMINIT, VMDUMP)
- renamed debugger to VMDebug
- VMDump now works correctly!
- updated docs in first post to v0.22, adds vm.GetPhysVirt and vm.GetVirtPhys
- added GetPhysVirt code and debugged it
- added hex dump routine for showing a page
- fixed a bug in VMFLUSH
- VMREADB works if page present in working set, correctly notices if it is not present (does not page in yet!)
- VMREADB correctly updates access counter in TLB
VMDebug Main Menu
1) View Page Table
2) Flush Page Table
3) VMREADB
4) VMREADW
5) VMREADL
6) VMWRITEB
7) VMWRITEW
8) VMWRITEL
9) Show VM Page
READY>
Menu options 1,2,3,6 and 9 are functional (with the exception of swapping the pages in/out)
You can see how many times any resident page has been accessed (the count is incremented on every read and/or write), and any write will set the "DIRTY" bit using '1'
You can reset the TLB using '2'
You can view any virtual page that is in the working set by its virtual address using '9'
Read a virtual byte with '3' and write it with '6'
The new archive is attached to the first post. Enjoy!
THEN unaligned versions of rdvword, wrvword, rdvlong, wrvlong
NOTE:
The VM is currently set up for only four physical pages with "vm.start(@mailbox,$7C00,4)"
If you want a 16KB of physical memory for your virtual space, use "vm.start(@mailbox,$7C00,64)"
The working set starts at hub page $7C00 and grows down.
For now the working set is NOT swapping in/out to SPI RAM... I want to get all the vm messages working first.
Edit: Minor bug, page access count in TLB is shown wrong due to incorrect increment amount ($400) instead of 1 due to TLB format change. I have a fix, will upload later. Fixed in v0.24
- added code aligned WORD and LONG reads
- BYTE/WORD/LONG reads from memory in the working set all work! (for now WORD and LONG must be aligned)
- BYTE write to memory in the working set already works, so....
NEXT:
- aligned versions of WRVWORD and WRVLONG
THEN:
- add support for non-aligned read/write for WORD and LONG
I've just had a strong coffee, my head is clear and I've just read through the latest code. Looking very nice. Clearly there are some basic building blocks to get working first and serial ram is one of those.
Right at the beginning of the thread was a comment that this virtial memory would be slower than latched ram. I am beginning to wonder if it actually could be faster? I've just gone through the dracblade sram driver and it is 24 instructions to read or write a byte.
Consider- the zicog requests a byte. It posts the address to a location in hub ram. Another cog is polling that address. It will find it on the next cycle. It will need to decode that address, check if the 256 byte packet is in propeller ram, read it and return the value. That might still only be half the number of instructions of latched ram, if the packet is available.
A couple of basic questions. Where are you storing cache ram? In the cog or in hub?
If in hub, I'm wondering whether you could reclaim the memory used loading cogs, which will be in random but known locations around the hub ram. Pass the locations to the vm driver at the beginning of the program and the vm program can store 256 byte packets wherever it can. More complex code of course, but potentially up to 14k of ram space in the cache.
The algorithms for handling packets can be simple or complex. Complex is harder to understand so spin may be easier to code first. I'm still trying to understand the least used packet concept. Ok, now my brain is starting to hurt, but I'm going to keep thinking about this as I can see some algorithms that could well be faster than latched ram and which potentially free up a whole lot of pins for better uses, eg analog I/O, more serial ports etc. One algorithm ends up using a sort routine, I'm trying to avoid that one if possible, except that it might end up being the most efficient and given a whole cog has been assigned managing memory, may as well use it to its full potential. brb - off to read en.wikipedia.org/wiki/Cache_algorithms
Thank you... and I agree, there are missing pieces. I wanted to get a good "framework" in place first, cleanly documented (so others can make changes), with a certain level of basic functionality (virtual to physical address mapping, TLB lookups etc) so I can test the read/write routines before attaching the backing store.
I find it easier to debug code piecemeal [noparse]:)[/noparse]
You are correct, it may end up faster overall than the latched driver. Only time will tell!
Frankly I already see several optimizations for VMCOG, but I will wait until it is running before starting serious optimization. The first one will be to combine the command and virtual address, that way a client program (ie ZiCog) will only need to write one long to the hub to request a read of a byte/word/long. I will add an "Optimizations" heading in the first post to describe what I will do.
I am running seriously short of memory in VMCOG, which is why I refactored it twice yesterday - I need to make enough room for the SPI RAM driver.
The "working set" pages (cache / physical memory) is stored in the hub, from address $7C00 down. The start method of VMCOG sets the top address, and the number of 256 byte pages to use. I recommend using 16..64 pages, and I will later add some performance monitoring tools so that it will be possible to tune the size of the working set to the emulator being run, and the program running under the emulation.
I *REALLY* like your suggestion of adding "random" pages that are used to load cogs with drivers!
There would be one limitation - any added pages would have to be 256 byte aligned, as otherwise the TLB could not hold a pointer to them (I need to keep the use counter as large as possible). A bit of memory would be wasted, but every additional page will help!
Later I will modify the start method, and VMFLUSH, to take an array of bytes, which will represent the additional pages to add. The first "0" byte would stop adding pages.
In order to better understand the LRU algorithm, also read the links I put in the first thread - they discuss virtual memory page replacement policies. Also read about the difference between "write through" and "delayed write" designs - initially I am implementing "write through", but later I will also add "delayed write" - I am building in full support for "delayed write" from the start.
Regardless of VMCOG, SPI RAM cannot be faster than parallel ram - even latched parallel ram - however it may be possible to get 90%+ of latched designs performance. Only benchmarks will really tell though.
Dr_Acula said...
I've just had a strong coffee, my head is clear and I've just read through the latest code. Looking very nice. Clearly there are some basic building blocks to get working first and serial ram is one of those.
Right at the beginning of the thread was a comment that this virtial memory would be slower than latched ram. I am beginning to wonder if it actually could be faster? I've just gone through the dracblade sram driver and it is 24 instructions to read or write a byte.
Consider- the zicog requests a byte. It posts the address to a location in hub ram. Another cog is polling that address. It will find it on the next cycle. It will need to decode that address, check if the 256 byte packet is in propeller ram, read it and return the value. That might still only be half the number of instructions of latched ram, if the packet is available.
A couple of basic questions. Where are you storing cache ram? In the cog or in hub?
If in hub, I'm wondering whether you could reclaim the memory used loading cogs, which will be in random but known locations around the hub ram. Pass the locations to the vm driver at the beginning of the program and the vm program can store 256 byte packets wherever it can. More complex code of course, but potentially up to 14k of ram space in the cache.
The algorithms for handling packets can be simple or complex. Complex is harder to understand so spin may be easier to code first. I'm still trying to understand the least used packet concept. Ok, now my brain is starting to hurt, but I'm going to keep thinking about this as I can see some algorithms that could well be faster than latched ram and which potentially free up a whole lot of pins for better uses, eg analog I/O, more serial ports etc. One algorithm ends up using a sort routine, I'm trying to avoid that one if possible, except that it might end up being the most efficient and given a whole cog has been assigned managing memory, may as well use it to its full potential. brb - off to read en.wikipedia.org/wiki/Cache_algorithms
1) I can regain 11 longs by re-using initialization code for variables
2) I can factor out some more code, at expense of run time - probably not an issue with writes
5) Use 512 byte pages - for a 64KB VM that would free 128 longs in VMCOG
3) Use slower SPI code - the unrolled SPI code Andy provided definitely would not fit
4) Use external SPI cog - but I want a single cog version!
5) Move TLB to HUB, adds performance hit, but allows much larger VM, and frees up (256-(#pages in working set)) longs in the cog
heater said...
Delicious irony: Running out of memory in which to implement virtual memory !
I don't think we can tolerate a two COG VM/external memory system.
I mean, we have one COG running Zog, ZiCog, Catalina or whatever it is that will be using the Virtual Memory.
Then we have the Virtual Memory COG.
Then we have, probably, a COG, taking care of I/O (console, files, UARTS etc) from the Zog/ZiCog/Catalina whatever code.
That's staring to eat a lot of Prop to run some external memory program.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Only down side to 512 byte pages is fewer different pages in the working set.
What I am currently thinking of is this:
- Initial SPI implementation will be with 4Mbps bit-banging rolled up SPI, which I can make fit with 256 byte pages
- Use this to gather the following stats:
# total reads per second
# total writes per second
# of reads handled from working set per second
# of writes through to backing store per second
# of page-ins per second
# of page-outs per second
And have a spin program log those stats while running various CP/M (and ZOG) programs.
By varying the number of pages in the working set we will see how big it has to be for different uses, and how apps slow down with smaller working set
I can also do the runs with a write-through strategy, and with a delayed-write strategy, to see which works better for what use.
Then later do a 512 byte page version, re-run stats
Then 512 byte page version, with 10Mbps SPI, re-run stats
Frankly, I don't expect a large difference in overall speed due to 4Mbps or 10Mbps SPI, I expect 95%+ hit rate to working set - which will keep it in one COG [noparse]:)[/noparse]
I *REALLY* want to keep the "standard" VMCOG in one cog, without a hub based TLB, with SPI routines in the same cog - precisely to conserve cogs.
For some uses, a two-cog approach for extra read/write speed may be appropriate, but the standard VMCOG is for one cog [noparse]:)[/noparse]
I think that latched address XMM interfaces like DR_Acula's would also fit in VMCOG.
Also, I have already thought of some further optimizations, but I want to get the initial version working first - to have a "reference" platform.
heater said...
What's wrong with 512 byte pages ?
I don't think we can tolerate a two COG VM/external memory system.
I mean, we have one COG running Zog, ZiCog, Catalina or whatever it is that will be using the Virtual Memory.
Then we have the Virtual Memory COG.
Then we have, probably, a COG, taking care of I/O (console, files, UARTS etc) from the Zog/ZiCog/Catalina whatever code.
That's staring to eat a lot of Prop to run some external memory program.
In that version you posted I see lots of spin and some pasm. Which bit is running out of memory?
Nothing wrong with 512 byte blocks. Maybe even 1024 or 2048? It means the list of entries gets smaller too.
I would very much imagine that if you started using bits of hub ram that were left over from loading cog code, you would try to make all pasm blocks fit into 2048 bytes (or whatever it is). Is that the FIT command?
The first one that could be used could be the unused hub ram from the vm code itself. Do you need some sort of pointer right before the pasm code so you can find out the memory location in hub once it is compiled, then use an @ on that pointer (and ? add 4 bytes) to get the actual pasm location? Maybe there is a simpler way?
I'm very confused by the " *least* recently used" algorithm. Especially when the wikipedia article goes on to say another perfectly valid alorithm is to discared the "*most* recently used". Presumably you need a clock of some sort and a counter to store when a block was accessed?
Intuitively, I'd suspect that 'least *frequently* used' would be better, as it puts a higher weighting on say, 100 hits in sequence to one block than 'least *recently* used'. But to muddy the waters even further, there is an algorithm en.wikipedia.org/wiki/Adaptive_Replacement_Cache that combines least recently used and least frequently used.
From a practical perspective, it may come down to which one can be coded in pasm and actually fit in a cog.
I have been pondering an algorithm where there is a list of blocks and an access to block n triggers a swap between block n and block n-1. So each access to a block moves that block one up the list. It is a simple bubble sort. If a block is at the top of the list then it just stays there. Thinking about this more, it might take a few instructions to do this, but if the first thing the vm cog does is return the byte, then (say) the calling program, zicog or whatever, is going to get on with processing that byte so it will be busy for a while and won't be asking for any more. So the vm cog could use this time to do the housekeeping associated with swapping those two blocks (or more to the point, swapping the two pointers in the list).
Another issue is how the vm cog searches for blocks. Presumably it starts at the top of the list as it is most likely to find a cache hit at the top as those are the popular blocks.
Hmm - 2048 byte blocks does decrease the memory needed for the list. Say you only used the 7 leftover bits of hub ram from loading up cogs. And maybe just one 2048 byte block at the top of hub ram to make the count 8, as 8 is kind of neater in a binary way. 64k has 32 blocks, and 8 of those are in hub ram at any one time. The bubble sort list is only 8 bytes, and there is another array with 8 longs that store the hub ram locations of the leftover cog load space. Zicog requests a byte at location 5000H. Divide by 2048 (800H) which is a rotate instruction to get the block number 00 to 1FH (0AH) Search the 8 bytes in the cache array list for a match. Up to 8 instructions but might return a match in 1 or 2. (If no match, branch to much longer code to load a block from SPI). If got a match, look up the hub ram location. One or 2 instructions. Find the offset - 5000H - (0AH*800H) and there has to be cunning pasm code to find the remainder - possibly as simple as a rotate n bits and then a subtract. Look up the hub byte and return it. Then (and this doesn't count in the timing calculation, do a single bubble sort swap, unless this block already is at the top of the list). Even the list isn't really a list if it is only 8 entries - that is just two longs.
If all that comes in under 24 instructions, this code will be faster than latched sram.
Is 2048 bytes too big? If you were emulating a big spin program, realistically you would have one block for the linear part of the current code, and then the others would probably end up with the popular subroutines, whatever they might be. For CP/M, the most popular ones would be block 0, then probably a couple of blocks in CP/M itself that handle the I/O, and then the remainder would be the program that is loaded in at 0100H. Big blocks certainly makes the lists smaller. For 'big spin', a guide might be the size of the average subroutine, and they do tend to be small. More thinking required.
What is using up all the code space - is it the SPI code?
I prefixed what you wrote with ">>" so I could inter-space my responses.
>> In that version you posted I see lots of spin and some pasm. Which bit is running out of memory?
VMCOG, the actual driver cog is running out.
>> Nothing wrong with 512 byte blocks. Maybe even 1024 or 2048? It means the list of entries gets smaller too.
At some point I am *VERY* likely to switch to 512 byte blocks, even though it cuts in half the number of different pages that can be loaded at once.
I deliberately went with 256 byte pages in order to be able to gather statistics on how many unique pages are heavily used, and how they cluster togeather (ie determine the "natural" page size for ZiCog.
>> I would very much imagine that if you started using bits of hub ram that were left over from loading cog code, you would try to make all pasm blocks fit into 2048 bytes (or whatever it is). Is that the FIT command?
The FIT command can be used to see how many cog locations are left in the current cog image at any point.
I can support adding "leftover" hub ram, but it must be page aligned, and a multiple of pages long... so there will still be some wastage.
>> The first one that could be used could be the unused hub ram from the vm code itself. Do you need some sort of pointer right before the pasm code so you can find out the memory location in hub once it is compiled, then use an @ on that pointer (and ? add 4 bytes) to get the actual pasm location? Maybe there is a simpler way?
Good point!
I am however going to be loading cogs using a shared 2KB buffer, from I2C EEPROM or SPI flash / SD card.
>> I'm very confused by the " *least* recently used" algorithm. Especially when the wikipedia article goes on to say another perfectly valid alorithm is to discared the "*most* recently used". Presumably you need a clock of some sort and a counter to store when a block was accessed?
I am initially implementing LRU because it is easy to implement, and tends to work in a near optimal fashion, as the most frequently used blocks will be swapped out last. I am also adding "aging" so eventually blocks that were used very fequently in the beginning, but not recently, will be swapped out.
>> Intuitively, I'd suspect that 'least *frequently* used' would be better, as it puts a higher weighting on say, 100 hits in sequence to one block than 'least *recently* used'. But to muddy the waters even further, there is an algorithm en.wikipedia.org/wiki/Adaptive_Replacement_Cache that combines least recently used and least frequently used.
Too complicated for the ammount of available cog memory; plus in my experience LRU + aging works best on limited memory systems.
>> From a practical perspective, it may come down to which one can be coded in pasm and actually fit in a cog.
Exactly!
>> I have been pondering an algorithm where there is a list of blocks and an access to block n triggers a swap between block n and block n-1. So each access to a block moves that block one up the list. It is a simple bubble sort. If a block is at the top of the list then it just stays there. Thinking about this more, it might take a few instructions to do this, but if the first thing the vm cog does is return the byte, then (say) the calling program, zicog or whatever, is going to get on with processing that byte so it will be busy for a while and won't be asking for any more. So the vm cog could use this time to do the housekeeping associated with swapping those two blocks (or more to the point, swapping the two pointers in the list).
I will achieve similar results by "aging" the LRU counters; there are also several different aging algorithms to choose from. Personally, I like dividing all the counts by two at pre-defined intervals (either based on time, or based on number of accesses). This simple strategy tends to work quite well, and takes very little code (cog space) to implement.
>> Another issue is how the vm cog searches for blocks. Presumably it starts at the top of the list as it is most likely to find a cache hit at the top as those are the popular blocks.
I implemented a "direct mapped" TLB, which means it will find out if the page it is requesting is in memory or not in very few instructions.
>> Hmm - 2048 byte blocks does decrease the memory needed for the list. Say you only used the 7 leftover bits of hub ram from loading up cogs. And maybe just one 2048 byte block at the top of hub ram to make the count 8, as 8 is kind of neater in a binary way. 64k has 32 blocks, and 8 of those are in hub ram at any one time. The bubble sort list is only 8 bytes, and there is another array with 8 longs that store the hub ram locations of the leftover cog load space. Zicog requests a byte at location 5000H. Divide by 2048 (800H) which is a rotate instruction to get the block number 00 to 1FH (0AH) Search the 8 bytes in the cache array list for a match. Up to 8 instructions but might return a match in 1 or 2. (If no match, branch to much longer code to load a block from SPI). If got a match, look up the hub ram location. One or 2 instructions. Find the offset - 5000H - (0AH*800H) and there has to be cunning pasm code to find the remainder - possibly as simple as a rotate n bits and then a subtract. Look up the hub byte and return it. Then (and this doesn't count in the timing calculation, do a single bubble sort swap, unless this block already is at the top of the list). Even the list isn't really a list if it is only 8 entries - that is just two longs.
I strongly suspect that using 2K pages would be very counter productive on the prop, where less than 32K is available as the working set. If 16KB was allocated to the working set, only 8 unique pages could be resident at once, leading to significantly more swapping.
I believe that 512 byte pages will be the best compromise, however this will be readily testable with statistics gathering that I will add to VMCOG.
>> If all that comes in under 24 instructions, this code will be faster than latched sram.
A request for a byte/word/long that is resident in the working set might very well come in under 24 instructions for bytes, and for aligned words and longs. I won't start counting instructions until everything works [noparse]:)[/noparse]
>> Is 2048 bytes too big? If you were emulating a big spin program, realistically you would have one block for the linear part of the current code, and then the others would probably end up with the popular subroutines, whatever they might be. For CP/M, the most popular ones would be block 0, then probably a couple of blocks in CP/M itself that handle the I/O, and then the remainder would be the program that is loaded in at 0100H. Big blocks certainly makes the lists smaller. For 'big spin', a guide might be the size of the average subroutine, and they do tend to be small. More thinking required.
Yes, I am certain 2048 byte pages would be far too big.
I suspect 512 bytes will turn out to be the best compromise on Prop 1, and in some cases 1KB pages may work reasonably well.
Everything has to do with how the code accesses memory.
>> What is using up all the code space - is it the SPI code?
256 longs for the TLB
leaves only 240 longs for all the VM code AND the SPI code.
Currently it looks like I should be able to fit "rolled up" SPI code, which should give us about 4Mbps to/from the SRAM.
I think that there will be enough room to port VMCOG to DracBlade, tossing out the SPI code in favor of latched access code.
Later, I will make a 128 entry TLB version (using 512 byte pages) which will leave 368 longs for the VM code and SPI code - allowing using unrolled 10Mbps reads and maybe 20Mbps writes.
It will be very interesting to see the performance differential!
v0.32 uploaded - you can find it attached at the end of the first post.
- refactored code to make more room
- see change list at the top of VMCOG.spin for more details
- start of unaligned read/write code
Ok, I made enough room so that the unaligned access code and "rolled up" SPI code should fit
NEXT:
- finish unaligned reads for word and long
- finish unaligned writes for word and long
- add SPI driver
- page in and out [noparse]:)[/noparse]
- add aging code
I am hoping to release a fully functional v1.0 early next week.
I just counted the best case scenarios (when the addressed page is already in memory):
VMREADB currently takes 21 instructions (best case) if the command is in the hub when it looks at the mailbox
VMREADW and VMREADL currently take 22 instructions (best case) if the command is in the hub when it looks at the mailbox (and the word / long is properly aligned).
I have an optimization in mind that may allow me to shave 2-3 instructions in the best case, but I won't try it yet.
It looks like you are correct, VMCOG (in ideal circumstances) can be faster than your current latched driver cog!
Mind you, it will take a big performance hit whenever it has to read a page into memory.
So, while there is some double handling there, my understanding is that the Spin code becomes the glue for the compiler to link the locations for the cog code to known locations in hub ram. Many objects do this and they do it in different ways. The idea of decoupling cog code and its spin code sounds so simple, but I found after nearly a day's worth of coding that it is a huge job. It entails rewriting practically every object in the Obex to communicate via fixed locations in hub. Not only that, but also persuading all future authors of objects to write code using fixed locations.
Am I missing something there?
My concern is the vm object might be limited if it only could use a small number of obex objects that had been custom written. And how do you custom write objects that you don't understand and which fit in a cog by only a few bytes, like the zicog or the 4 port serial driver? This is the practical problem I came up against.
So this is why I was thinking it might be much easier to 'fill' unused cog space in any cog obex code to a multiple of 512 bytes as that presumably is just some filler bytes at the bottom of the code rather than trying to rewrite and understand the code.
Your thoughts on that issue would be most appreciated.
Second issue:
I am becoming more and more confident that this vm is a *very clever idea*. So much so that I believe it has the potential to be faster than sram. So, I am starting to ponder a new board design. Propeller, sd card, vga, keyboard, 2 serial ports, SPI for serial ram. That leaves some pins free, and those could be used for mouse, another serial port, sound and analog I/O. That makes the board more flexible than the dracblade. I think it could be half the size too (which means cluso might even get it into a matchbox!).
I found 32kb SPI rams fairly easily. Are bigger ones more expensive? Can you share pins for two SPI 32k ram chips?
Dr_Acula said...
The idea of decoupling cog code and its spin code sounds so simple, but I found after nearly a day's worth of coding that it is a huge job. It entails rewriting practically every object in the Obex to communicate via fixed locations in hub. Not only that, but also persuading all future authors of objects to write code using fixed locations.
Where did this come from? If you're using fixed locations then - while doable - you are doing it wrong (in this particular context of SPIN/PASM communication). Just define a parameter block anywhere in HUB and pass its address to the cog when you start it.
I know that not all existing objects would be easy to rewrite with that approach but I'm somehow choking on that fixed location thingy [noparse]:)[/noparse]
kuroneko, I'm still working out how this works. What I do know is I tried coding it and got very stuck. I also think Cluso was going to have a go - maybe he has some insights?
Putting cog code into somewhere external (eg I2C eeprom or on an SD card) to me entails two processes.
First, compile the object but with no cog code at all. Will it compile?
Second, compile the object with only the cog code and no spin code. Will it compile?
First problem is compiling an object with no cog code. It will fall over on all the cogstart and cogstop instructions. Those would need to be replaced with some way of passing the settings to the generic cog loader code (which presumably is just one cogstart for the whole program). Presumably also you need a smart generic cogstart and cogstop where you name which cog you want to start or stop.
Then compiling just the cog code with no glue spin code. Maybe it needs a fake cogstart/cogstop at the beginning as usually the first few lines of pasm code are reading in parameters.
Ok, for the simpler objects that only ever pass paramters via the start, this ought to work fine.
Maybe I got put off by starting with the hardest one, which is the zicog. This has many instances where the cog code is glued to random locations in hub that change every time you modify the program and recompile. Take this code fragment
variables b_reg, c_reg and d_reg are not defined as longs at the end of the cog code. They are defined as part of a huge list of variables in the Start of the zicog object
PUB start(cpu_params) : okay | reg_base, io_base
'' Start CPU simulator - starts a cog
'' returns false if no cog available
''
'' cpu_par_params is a pointer to CPU parameter list
stop
reg_base := LONG[noparse][[/noparse]cpu_params] ' memory locations in hub that contain these variables
c_reg := reg_base + 0
b_reg := reg_base + 1
Ok, maybe that is manageable as it all can be transferred over to the generic loader via the pointer cpu_params.
But then there are all the CON values that come up in pasm eg
CON
'Z80 flag bits
sign_bit = %10000000
and in pasm
muxnz flags, #sign_bit
So you have to have a new version of the object where all the CON values are defined at the beginning of the pasm code. But no need to define them now for the spin code. But what about CON values that were actually for the spin code, not the pasm code? Maybe it doesn't matter defining CON values that never get used in code. Copy and paste is simpler. But messy for someone trying to understand why there is a CON value that never gets used in that local program. So to make it neat you have to go through every CON value and assign them either to the spin code or the pasm code. Tedious, but doable.
Re
>It looks like you are correct, VMCOG (in ideal circumstances) can be faster than your current latched driver cog!
>Mind you, it will take a big performance hit whenever it has to read a page into memory.
Great news on the opcode count. I'm still hopeful the page reads will be quite infrequent.
The timing algorithm sounds intriguing.
Re a new board, the chip count is going to replace a 32 pin sram, four 20 pin latches and one 16 pin 138 with two 8 pin serial ram chips. That is a lot of board real estate that will be saved.
I've been looking at ram chips. They seem to max out at 32kilobytes. I2C seems more expensive than SPI - is that correct? (I just had a crazy idea to go I2C which happens to be replicated in the SD driver code. Messy, as it is a cog to cog transfer of information. But the SD code for I2C is likely to already be there, unused. Hmm. Digikey are $1.66 for 32k SPI and $2.74 for 32k I2C). Maybe this could be an escape clause if the code won't fit?
>> Thanks for the detailed response. This is great brain exercise. I'm going to have to stop all alcohol and change over to coffee instead!
You are most welcome!
>> Ok, I understand and agree with all the points. The only bit I don't understand; "I am however going to be loading cogs using a shared 2KB buffer, from I2C EEPROM or SPI flash / SD card."
No problem really... that is how I plan to load drivers in my projects, no one is obligated to do it the same way [noparse]:)[/noparse]
>> A practical problem. Take a typical piece of dracblade code, which is really a typical piece of spin code. I'll use the keyboard object.
I understand, however over time I believe the concept of "stdin" and "stdout" will catch on, and more "interface-friendly" objects shall arise
You have to understand that I have a roadmap in mind, which includes Largos, PropellerBasic, and all the assorted projects I've been working on; and when I say something like an i2c or SPI loader, I am looking to the future.
You will find that I intend to "lead by example" where it comes to loading from external storage and easier Spin interface objects.
Case and point:
VMCOG.spin can easily be modified to have the start method start the cog by (made up names):
"i2c_cognew("vmcog.dll",mailbox)" instead of the existing "cognew(@vmcog,mailbox)"
Of course this does require me to polish my loader code and publish it... it is part of my Largos OS.
Largos shares a 2KB buffer between cog images, and mass storage buffers. When a cog needs to be loaded, the (four) 512 byte buffers are flushed, cog image loaded, and cognew'd. After the cognew, the buffer can again be used to hold disk buffers.
I believe when people stumble across the advantages entailed in such handling of drivers they will modify the obex drivers to conform to such loaders.
FYI, all my new cog images will conform to a very simple interface standard: a mailbox consisting of four longs; if more storage is needed, a pointer in one of the longs will point to it.
>> Am I missing something there?
No, you are not missing anything - however VMCOG will be quite usable the "old fashioned way". I am just trying to push the state of the art for the Propeller.
>> My concern is the vm object might be limited if it only could use a small number of obex objects that had been custom written.
The VMCOG object is self-contained, and does not load any external objects.
>> Your thoughts on that issue would be most appreciated.
See above [noparse]:)[/noparse]
>> I am becoming more and more confident that this vm is a *very clever idea*.
Thank you.
>> So much so that I believe it has the potential to be faster than sram.
I think it will be "fast enough", and allow very inexpensive ZiCog (and other) emulations with the addition of only two 23K256's and four pins.
Now that, in my opinion, is an "enabling technology"
>> So, I am starting to ponder a new board design. Propeller, sd card, vga, keyboard, 2 serial ports, SPI for serial ram. That leaves some pins free, and those could be used for mouse, another serial port, sound and analog I/O. That makes the board more flexible than the dracblade. I think it could be half the size too (which means cluso might even get it into a matchbox!).
I will admit that I have also been working on a low cost design that has a somewhat different approach and target market than the one you describe, but it too leverages off VMCOG and cheap SPI ram... I hope to demonstrate it at UPEW (and perhaps announce it earlier). I should be receiving the first prototypes in early March (an unfortunate side effect of inexpensive prototype PCB manufacturing is longer lead times).
>> I found 32kb SPI rams fairly easily. Are bigger ones more expensive? Can you share pins for two SPI 32k ram chips?
Unfortunately the only SPI ram's larger than 32KB are Ramtron's FRAM's, which cost about $7 for a 64KB device and about $9 for a 128KB device - however given that they are non-volatile, they are an excellent choice for some applications. Unfortunately they are not available in a DIP form factor, however SOIC8 is not too difficult to solder.
Yes, you can share most pins with SPI ram's. The sample circuit I am working with is wired as:
- DI/DO from two SRAM's
- CLK for both SRAM's
- /CS0 for $0000-$7FFF
- /CS1 for $8000-$8FFF
The same configuration Andy was using for his driver, however I am having problems with his driver and the Microchip parts.
I am adding 4Mbps read/write routines to VMCOG, as those will fit with a 256 entry TLB.
Once I start experimenting with a 128 entry TLB using 512 byte pages, I will have room for 10Mbps SPI code.
I agree - pasm code should take all arguments from a memory block pointed to by PAR
kuroneko said...
Where did this come from? If you're using fixed locations then - while doable - you are doing it wrong (in this particular context of SPIN/PASM communication). Just define a parameter block anywhere in HUB and pass its address to the cog when you start it.
I know that not all existing objects would be easy to rewrite with that approach but I'm somehow choking on that fixed location thingy [noparse]:)[/noparse]
This just keeps getting better and better. But - I just checked the date, and it isn't Christmas?!
Ok, custom objects sounds great. And like you say, the idea ought to catch on, especially when you see how much memory it will save in a typical project.
I seem to recall that eeproms come in bigger sizes for really not much more money, so presumably the spare space in an eeprom is a good place to put cog code?
I see 4 lines to drive two SPI rams - is that right, and you are joining data in and data out, which would make sense and that saves pins.
Great to hear a new board is in the pipeline. I left one thing off that list before, and that was TV, but on one of the dracblade designs I used the 8 pins 16 to 24 for vga and also pins 16 to 19 went to TV resistors so you could use one or the other by installing a vga plug or an RCA plug plus the appropriate resistors. (I'm not sure, could you ever drive TV and VGA at the same time?)
Ok, well, I'm out of ideas because everything I ever could have dreamed of is going to be in this new design. Life is good!
OBJ
k: "spiconst"
....
outa[noparse][[/noparse]k#CS]~
dira[noparse][[/noparse]k#CS]~~
...
DAT
cspin long 1<<k#CS
>> Great news on the opcode count. I'm still hopeful the page reads will be quite infrequent.
Thanks, I am hoping for infrequent page reads too.
>> The timing algorithm sounds intriguing.
"aging" makes sure that stale (not recently used) but previously frequently used data does not get stuck in the working set, wasting valuable space
>> Re a new board, the chip count is going to replace a 32 pin sram, four 20 pin latches and one 16 pin 138 with two 8 pin serial ram chips. That is a lot of board real estate that will be saved.
Definitely saves a lot of space - and some money too. Roughly ($3.50+5*$.50) vs. 2*$1.70 ... so $6 vs $3.40 (chips+sockets in both cases)
However it is 1/8th the amount of ram.
re/ I2C
I2C is MUCH slower. It would work, but SPI is cheaper, and MUCH faster. Easy choice [noparse]:)[/noparse]
Dr_Acula said...
But then there are all the CON values that come up in pasm eg
CON
'Z80 flag bits
sign_bit = %10000000
and in pasm
muxnz flags, #sign_bit
So you have to have a new version of the object where all the CON values are defined at the beginning of the pasm code. But no need to define them now for the spin code. But what about CON values that were actually for the spin code, not the pasm code? Maybe it doesn't matter defining CON values that never get used in code. Copy and paste is simpler. But messy for someone trying to understand why there is a CON value that never gets used in that local program. So to make it neat you have to go through every CON value and assign them either to the spin code or the pasm code. Tedious, but doable.
Re
>It looks like you are correct, VMCOG (in ideal circumstances) can be faster than your current latched driver cog!
>Mind you, it will take a big performance hit whenever it has to read a page into memory.
Great news on the opcode count. I'm still hopeful the page reads will be quite infrequent.
The timing algorithm sounds intriguing.
Re a new board, the chip count is going to replace a 32 pin sram, four 20 pin latches and one 16 pin 138 with two 8 pin serial ram chips. That is a lot of board real estate that will be saved.
I've been looking at ram chips. They seem to max out at 32kilobytes. I2C seems more expensive than SPI - is that correct? (I just had a crazy idea to go I2C which happens to be replicated in the SD driver code. Messy, as it is a cog to cog transfer of information. But the SD code for I2C is likely to already be there, unused. Hmm. Digikey are $1.66 for 32k SPI and $2.74 for 32k I2C). Maybe this could be an escape clause if the code won't fit?
I've just figured out how to optimize VMREADB down to just 16 instructions (best case), eliminating one hub access and four regular instructions!
Alas, VMREADW and VMREADL go down to 19 instructions (best case); I can reduce them to 16 instructions as well by using 6 more longs, so there is hope...
I will not add this optimizations until after everything is running as it involves changing the message format etc., and I want a "baseline" fully working VMCOG first.
(best case = memory location present in the working set, properly aligned, command present in hub command register when VMCOG checks)
The reason it works much better is that I will combine the command code with the virtual address.
The new command format will be:
Bits 0-8 = command code
Bits 9-23 = virtual address
This limits the virtual address space to 8MB, but it makes the most common case - vmreadb - MUCH faster!
It is possible to increase the address space to 28 bits (256MB) at the cost of one additional instruction per access.
- The first pass at the unaligned read code used too many longs, I am working on a better/shorter way of doing it
- Development of VMCOG will slow down a bit for a week as:
- I am building a lot of boards for a customer next week
- I have four Chinese New Year family dinners coming up (my wife is Asian) so that will also slow me down
Comments
Here is what I've done so far:
- set up one 23K256 on my PDB
- I adapted Andy's original test routine to use FullDuplexSerialPlus
- I changed the test read/write values when I saw it was not working
- tried disconnecting the data line, read all 0's so when connected it IS getting data
- my best guess is that the write is not writing correctly
I am attaching an archive of the current test code to this post
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/8/2010 1:38:06 AM GMT
Best guess: CLK edge needs to be inverted, or happens before data settles enough.
Alternate guess: 23K256 can't handle 20Mbps SPI clock, even though the data sheet says it should.
I am attaching the output from my running SPI_Test2 in the archive above.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I just tried my driver again, and it works with my RAMs.
Here are some possible changes at the begin of dowrite to shift the clock to data relation:
1) Clock comes a bit earlier:
2) Another way:
Both still work with my RAMs.
And attached is an earlier driver with only 10MHz SCLK.
Andy
I will try all three.
So it looks like there is a subtle difference between the two ram's as far as clocking goes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I have tried all three methods you suggested above (and some other variations) - no dice, the best I could get was 226 errors (out of 256 write/reads) with your last suggestion.
I have also tried your older 10Mbps write, and got similar results.
What I will try today:
- run it under ViewPort and look at the waveforms at 80Msps - should show me what is going wrong
- try different 23K256's (I've tried two different ones so far)
- try a different prop (in case I managed to damage this one in the past)
- if that does not let me fix it, I will substitute my own 5Mbps write routine
- if that does not work, I will fall back on 5Mbps read / 5Mbps write for now (without counters)
- I wonder if it could be extra capacitance due to running it on a breadboard?
- if you like, I can snail-mail you a couple of 23K256's to try (PM your address to me)
I will also work on the generic VMCOG part, and VMCOG Debugger - I want this running soon, for CogZ, ZiCog and MotoCog!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/8/2010 6:58:36 PM GMT
I've uploaded v0.20 of VMCOG, and v0.20 of VMCOG_Debugger - you can find them in the first post as an archive.
The code is not functional yet (well, VMDUMP almost works), and I am still having SPI RAM issues - but I thought I'd upload what progress I have made.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
- Fixed a dumb bug in the TLB
- various minor optimizations and fixes (VMFLUSH, VMINIT, VMDUMP)
- renamed debugger to VMDebug
- VMDump now works correctly!
- updated docs in first post to v0.22, adds vm.GetPhysVirt and vm.GetVirtPhys
- added GetPhysVirt code and debugged it
- added hex dump routine for showing a page
- fixed a bug in VMFLUSH
- VMREADB works if page present in working set, correctly notices if it is not present (does not page in yet!)
- VMREADB correctly updates access counter in TLB
Now I will tackle VMWRITEB...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/9/2010 10:55:59 PM GMT
Menu options 1,2,3,6 and 9 are functional (with the exception of swapping the pages in/out)
You can see how many times any resident page has been accessed (the count is incremented on every read and/or write), and any write will set the "DIRTY" bit using '1'
You can reset the TLB using '2'
You can view any virtual page that is in the working set by its virtual address using '9'
Read a virtual byte with '3' and write it with '6'
The new archive is attached to the first post. Enjoy!
NEXT: aligned versions rdvword, wrvword, rdvlong, wrvlong
THEN: SPI ram fixing...
THEN unaligned versions of rdvword, wrvword, rdvlong, wrvlong
NOTE:
The VM is currently set up for only four physical pages with "vm.start(@mailbox,$7C00,4)"
If you want a 16KB of physical memory for your virtual space, use "vm.start(@mailbox,$7C00,64)"
The working set starts at hub page $7C00 and grows down.
For now the working set is NOT swapping in/out to SPI RAM... I want to get all the vm messages working first.
Edit: Minor bug, page access count in TLB is shown wrong due to incorrect increment amount ($400) instead of 1 due to TLB format change. I have a fix, will upload later. Fixed in v0.24
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/10/2010 3:05:21 AM GMT
- added code aligned WORD and LONG reads
- BYTE/WORD/LONG reads from memory in the working set all work! (for now WORD and LONG must be aligned)
- BYTE write to memory in the working set already works, so....
NEXT:
- aligned versions of WRVWORD and WRVLONG
THEN:
- add support for non-aligned read/write for WORD and LONG
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
- added aligned WORD and LONG writes to working set pages
- various optimizations
- refactored to reduce code size
NEXT:
- unaligned WORD/LONG READ's and WRITES
THEN:
- get SPI RAM debugged
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Right at the beginning of the thread was a comment that this virtial memory would be slower than latched ram. I am beginning to wonder if it actually could be faster? I've just gone through the dracblade sram driver and it is 24 instructions to read or write a byte.
Consider- the zicog requests a byte. It posts the address to a location in hub ram. Another cog is polling that address. It will find it on the next cycle. It will need to decode that address, check if the 256 byte packet is in propeller ram, read it and return the value. That might still only be half the number of instructions of latched ram, if the packet is available.
A couple of basic questions. Where are you storing cache ram? In the cog or in hub?
If in hub, I'm wondering whether you could reclaim the memory used loading cogs, which will be in random but known locations around the hub ram. Pass the locations to the vm driver at the beginning of the program and the vm program can store 256 byte packets wherever it can. More complex code of course, but potentially up to 14k of ram space in the cache.
The algorithms for handling packets can be simple or complex. Complex is harder to understand so spin may be easier to code first. I'm still trying to understand the least used packet concept. Ok, now my brain is starting to hurt, but I'm going to keep thinking about this as I can see some algorithms that could well be faster than latched ram and which potentially free up a whole lot of pins for better uses, eg analog I/O, more serial ports etc. One algorithm ends up using a sort routine, I'm trying to avoid that one if possible, except that it might end up being the most efficient and given a whole cog has been assigned managing memory, may as well use it to its full potential. brb - off to read en.wikipedia.org/wiki/Cache_algorithms
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
Post Edited (Dr_Acula) : 2/11/2010 6:13:05 AM GMT
Ahh... that first morning coffee is always great!
Thank you... and I agree, there are missing pieces. I wanted to get a good "framework" in place first, cleanly documented (so others can make changes), with a certain level of basic functionality (virtual to physical address mapping, TLB lookups etc) so I can test the read/write routines before attaching the backing store.
I find it easier to debug code piecemeal [noparse]:)[/noparse]
You are correct, it may end up faster overall than the latched driver. Only time will tell!
Frankly I already see several optimizations for VMCOG, but I will wait until it is running before starting serious optimization. The first one will be to combine the command and virtual address, that way a client program (ie ZiCog) will only need to write one long to the hub to request a read of a byte/word/long. I will add an "Optimizations" heading in the first post to describe what I will do.
I am running seriously short of memory in VMCOG, which is why I refactored it twice yesterday - I need to make enough room for the SPI RAM driver.
The "working set" pages (cache / physical memory) is stored in the hub, from address $7C00 down. The start method of VMCOG sets the top address, and the number of 256 byte pages to use. I recommend using 16..64 pages, and I will later add some performance monitoring tools so that it will be possible to tune the size of the working set to the emulator being run, and the program running under the emulation.
I *REALLY* like your suggestion of adding "random" pages that are used to load cogs with drivers!
There would be one limitation - any added pages would have to be 256 byte aligned, as otherwise the TLB could not hold a pointer to them (I need to keep the use counter as large as possible). A bit of memory would be wasted, but every additional page will help!
Later I will modify the start method, and VMFLUSH, to take an array of bytes, which will represent the additional pages to add. The first "0" byte would stop adding pages.
In order to better understand the LRU algorithm, also read the links I put in the first thread - they discuss virtual memory page replacement policies. Also read about the difference between "write through" and "delayed write" designs - initially I am implementing "write through", but later I will also add "delayed write" - I am building in full support for "delayed write" from the start.
Regardless of VMCOG, SPI RAM cannot be faster than parallel ram - even latched parallel ram - however it may be possible to get 90%+ of latched designs performance. Only benchmarks will really tell though.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Very true.
I still have a few of tricks up my sleve.
1) I can regain 11 longs by re-using initialization code for variables
2) I can factor out some more code, at expense of run time - probably not an issue with writes
5) Use 512 byte pages - for a 64KB VM that would free 128 longs in VMCOG
3) Use slower SPI code - the unrolled SPI code Andy provided definitely would not fit
4) Use external SPI cog - but I want a single cog version!
5) Move TLB to HUB, adds performance hit, but allows much larger VM, and frees up (256-(#pages in working set)) longs in the cog
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I don't think we can tolerate a two COG VM/external memory system.
I mean, we have one COG running Zog, ZiCog, Catalina or whatever it is that will be using the Virtual Memory.
Then we have the Virtual Memory COG.
Then we have, probably, a COG, taking care of I/O (console, files, UARTS etc) from the Zog/ZiCog/Catalina whatever code.
That's staring to eat a lot of Prop to run some external memory program.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
What I am currently thinking of is this:
- Initial SPI implementation will be with 4Mbps bit-banging rolled up SPI, which I can make fit with 256 byte pages
- Use this to gather the following stats:
# total reads per second
# total writes per second
# of reads handled from working set per second
# of writes through to backing store per second
# of page-ins per second
# of page-outs per second
And have a spin program log those stats while running various CP/M (and ZOG) programs.
By varying the number of pages in the working set we will see how big it has to be for different uses, and how apps slow down with smaller working set
I can also do the runs with a write-through strategy, and with a delayed-write strategy, to see which works better for what use.
Then later do a 512 byte page version, re-run stats
Then 512 byte page version, with 10Mbps SPI, re-run stats
Frankly, I don't expect a large difference in overall speed due to 4Mbps or 10Mbps SPI, I expect 95%+ hit rate to working set - which will keep it in one COG [noparse]:)[/noparse]
I *REALLY* want to keep the "standard" VMCOG in one cog, without a hub based TLB, with SPI routines in the same cog - precisely to conserve cogs.
For some uses, a two-cog approach for extra read/write speed may be appropriate, but the standard VMCOG is for one cog [noparse]:)[/noparse]
I think that latched address XMM interfaces like DR_Acula's would also fit in VMCOG.
Also, I have already thought of some further optimizations, but I want to get the initial version working first - to have a "reference" platform.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/11/2010 9:31:43 PM GMT
Nothing wrong with 512 byte blocks. Maybe even 1024 or 2048? It means the list of entries gets smaller too.
I would very much imagine that if you started using bits of hub ram that were left over from loading cog code, you would try to make all pasm blocks fit into 2048 bytes (or whatever it is). Is that the FIT command?
The first one that could be used could be the unused hub ram from the vm code itself. Do you need some sort of pointer right before the pasm code so you can find out the memory location in hub once it is compiled, then use an @ on that pointer (and ? add 4 bytes) to get the actual pasm location? Maybe there is a simpler way?
I'm very confused by the " *least* recently used" algorithm. Especially when the wikipedia article goes on to say another perfectly valid alorithm is to discared the "*most* recently used". Presumably you need a clock of some sort and a counter to store when a block was accessed?
Intuitively, I'd suspect that 'least *frequently* used' would be better, as it puts a higher weighting on say, 100 hits in sequence to one block than 'least *recently* used'. But to muddy the waters even further, there is an algorithm en.wikipedia.org/wiki/Adaptive_Replacement_Cache that combines least recently used and least frequently used.
From a practical perspective, it may come down to which one can be coded in pasm and actually fit in a cog.
I have been pondering an algorithm where there is a list of blocks and an access to block n triggers a swap between block n and block n-1. So each access to a block moves that block one up the list. It is a simple bubble sort. If a block is at the top of the list then it just stays there. Thinking about this more, it might take a few instructions to do this, but if the first thing the vm cog does is return the byte, then (say) the calling program, zicog or whatever, is going to get on with processing that byte so it will be busy for a while and won't be asking for any more. So the vm cog could use this time to do the housekeeping associated with swapping those two blocks (or more to the point, swapping the two pointers in the list).
Another issue is how the vm cog searches for blocks. Presumably it starts at the top of the list as it is most likely to find a cache hit at the top as those are the popular blocks.
Hmm - 2048 byte blocks does decrease the memory needed for the list. Say you only used the 7 leftover bits of hub ram from loading up cogs. And maybe just one 2048 byte block at the top of hub ram to make the count 8, as 8 is kind of neater in a binary way. 64k has 32 blocks, and 8 of those are in hub ram at any one time. The bubble sort list is only 8 bytes, and there is another array with 8 longs that store the hub ram locations of the leftover cog load space. Zicog requests a byte at location 5000H. Divide by 2048 (800H) which is a rotate instruction to get the block number 00 to 1FH (0AH) Search the 8 bytes in the cache array list for a match. Up to 8 instructions but might return a match in 1 or 2. (If no match, branch to much longer code to load a block from SPI). If got a match, look up the hub ram location. One or 2 instructions. Find the offset - 5000H - (0AH*800H) and there has to be cunning pasm code to find the remainder - possibly as simple as a rotate n bits and then a subtract. Look up the hub byte and return it. Then (and this doesn't count in the timing calculation, do a single bubble sort swap, unless this block already is at the top of the list). Even the list isn't really a list if it is only 8 entries - that is just two longs.
If all that comes in under 24 instructions, this code will be faster than latched sram.
Is 2048 bytes too big? If you were emulating a big spin program, realistically you would have one block for the linear part of the current code, and then the others would probably end up with the popular subroutines, whatever they might be. For CP/M, the most popular ones would be block 0, then probably a couple of blocks in CP/M itself that handle the I/O, and then the remainder would be the program that is loaded in at 0100H. Big blocks certainly makes the lists smaller. For 'big spin', a guide might be the size of the average subroutine, and they do tend to be small. More thinking required.
What is using up all the code space - is it the SPI code?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
Post Edited (Dr_Acula) : 2/11/2010 11:22:01 PM GMT
I prefixed what you wrote with ">>" so I could inter-space my responses.
>> In that version you posted I see lots of spin and some pasm. Which bit is running out of memory?
VMCOG, the actual driver cog is running out.
>> Nothing wrong with 512 byte blocks. Maybe even 1024 or 2048? It means the list of entries gets smaller too.
At some point I am *VERY* likely to switch to 512 byte blocks, even though it cuts in half the number of different pages that can be loaded at once.
I deliberately went with 256 byte pages in order to be able to gather statistics on how many unique pages are heavily used, and how they cluster togeather (ie determine the "natural" page size for ZiCog.
>> I would very much imagine that if you started using bits of hub ram that were left over from loading cog code, you would try to make all pasm blocks fit into 2048 bytes (or whatever it is). Is that the FIT command?
The FIT command can be used to see how many cog locations are left in the current cog image at any point.
I can support adding "leftover" hub ram, but it must be page aligned, and a multiple of pages long... so there will still be some wastage.
>> The first one that could be used could be the unused hub ram from the vm code itself. Do you need some sort of pointer right before the pasm code so you can find out the memory location in hub once it is compiled, then use an @ on that pointer (and ? add 4 bytes) to get the actual pasm location? Maybe there is a simpler way?
Good point!
I am however going to be loading cogs using a shared 2KB buffer, from I2C EEPROM or SPI flash / SD card.
>> I'm very confused by the " *least* recently used" algorithm. Especially when the wikipedia article goes on to say another perfectly valid alorithm is to discared the "*most* recently used". Presumably you need a clock of some sort and a counter to store when a block was accessed?
I am initially implementing LRU because it is easy to implement, and tends to work in a near optimal fashion, as the most frequently used blocks will be swapped out last. I am also adding "aging" so eventually blocks that were used very fequently in the beginning, but not recently, will be swapped out.
>> Intuitively, I'd suspect that 'least *frequently* used' would be better, as it puts a higher weighting on say, 100 hits in sequence to one block than 'least *recently* used'. But to muddy the waters even further, there is an algorithm en.wikipedia.org/wiki/Adaptive_Replacement_Cache that combines least recently used and least frequently used.
Too complicated for the ammount of available cog memory; plus in my experience LRU + aging works best on limited memory systems.
>> From a practical perspective, it may come down to which one can be coded in pasm and actually fit in a cog.
Exactly!
>> I have been pondering an algorithm where there is a list of blocks and an access to block n triggers a swap between block n and block n-1. So each access to a block moves that block one up the list. It is a simple bubble sort. If a block is at the top of the list then it just stays there. Thinking about this more, it might take a few instructions to do this, but if the first thing the vm cog does is return the byte, then (say) the calling program, zicog or whatever, is going to get on with processing that byte so it will be busy for a while and won't be asking for any more. So the vm cog could use this time to do the housekeeping associated with swapping those two blocks (or more to the point, swapping the two pointers in the list).
I will achieve similar results by "aging" the LRU counters; there are also several different aging algorithms to choose from. Personally, I like dividing all the counts by two at pre-defined intervals (either based on time, or based on number of accesses). This simple strategy tends to work quite well, and takes very little code (cog space) to implement.
>> Another issue is how the vm cog searches for blocks. Presumably it starts at the top of the list as it is most likely to find a cache hit at the top as those are the popular blocks.
I implemented a "direct mapped" TLB, which means it will find out if the page it is requesting is in memory or not in very few instructions.
>> Hmm - 2048 byte blocks does decrease the memory needed for the list. Say you only used the 7 leftover bits of hub ram from loading up cogs. And maybe just one 2048 byte block at the top of hub ram to make the count 8, as 8 is kind of neater in a binary way. 64k has 32 blocks, and 8 of those are in hub ram at any one time. The bubble sort list is only 8 bytes, and there is another array with 8 longs that store the hub ram locations of the leftover cog load space. Zicog requests a byte at location 5000H. Divide by 2048 (800H) which is a rotate instruction to get the block number 00 to 1FH (0AH) Search the 8 bytes in the cache array list for a match. Up to 8 instructions but might return a match in 1 or 2. (If no match, branch to much longer code to load a block from SPI). If got a match, look up the hub ram location. One or 2 instructions. Find the offset - 5000H - (0AH*800H) and there has to be cunning pasm code to find the remainder - possibly as simple as a rotate n bits and then a subtract. Look up the hub byte and return it. Then (and this doesn't count in the timing calculation, do a single bubble sort swap, unless this block already is at the top of the list). Even the list isn't really a list if it is only 8 entries - that is just two longs.
I strongly suspect that using 2K pages would be very counter productive on the prop, where less than 32K is available as the working set. If 16KB was allocated to the working set, only 8 unique pages could be resident at once, leading to significantly more swapping.
I believe that 512 byte pages will be the best compromise, however this will be readily testable with statistics gathering that I will add to VMCOG.
>> If all that comes in under 24 instructions, this code will be faster than latched sram.
A request for a byte/word/long that is resident in the working set might very well come in under 24 instructions for bytes, and for aligned words and longs. I won't start counting instructions until everything works [noparse]:)[/noparse]
>> Is 2048 bytes too big? If you were emulating a big spin program, realistically you would have one block for the linear part of the current code, and then the others would probably end up with the popular subroutines, whatever they might be. For CP/M, the most popular ones would be block 0, then probably a couple of blocks in CP/M itself that handle the I/O, and then the remainder would be the program that is loaded in at 0100H. Big blocks certainly makes the lists smaller. For 'big spin', a guide might be the size of the average subroutine, and they do tend to be small. More thinking required.
Yes, I am certain 2048 byte pages would be far too big.
I suspect 512 bytes will turn out to be the best compromise on Prop 1, and in some cases 1KB pages may work reasonably well.
Everything has to do with how the code accesses memory.
>> What is using up all the code space - is it the SPI code?
256 longs for the TLB
leaves only 240 longs for all the VM code AND the SPI code.
Currently it looks like I should be able to fit "rolled up" SPI code, which should give us about 4Mbps to/from the SRAM.
I think that there will be enough room to port VMCOG to DracBlade, tossing out the SPI code in favor of latched access code.
Later, I will make a 128 entry TLB version (using 512 byte pages) which will leave 368 longs for the VM code and SPI code - allowing using unrolled 10Mbps reads and maybe 20Mbps writes.
It will be very interesting to see the performance differential!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/11/2010 11:46:04 PM GMT
v0.32 uploaded - you can find it attached at the end of the first post.
- refactored code to make more room
- see change list at the top of VMCOG.spin for more details
- start of unaligned read/write code
Ok, I made enough room so that the unaligned access code and "rolled up" SPI code should fit
NEXT:
- finish unaligned reads for word and long
- finish unaligned writes for word and long
- add SPI driver
- page in and out [noparse]:)[/noparse]
- add aging code
I am hoping to release a fully functional v1.0 early next week.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/12/2010 12:31:51 AM GMT
I just counted the best case scenarios (when the addressed page is already in memory):
VMREADB currently takes 21 instructions (best case) if the command is in the hub when it looks at the mailbox
VMREADW and VMREADL currently take 22 instructions (best case) if the command is in the hub when it looks at the mailbox (and the word / long is properly aligned).
I have an optimization in mind that may allow me to shave 2-3 instructions in the best case, but I won't try it yet.
It looks like you are correct, VMCOG (in ideal circumstances) can be faster than your current latched driver cog!
Mind you, it will take a big performance hit whenever it has to read a page into memory.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/12/2010 12:41:53 AM GMT
Ok, I understand and agree with all the points. The only bit I don't understand;
"I am however going to be loading cogs using a shared 2KB buffer, from I2C EEPROM or SPI flash / SD card."
A practical problem. Take a typical piece of dracblade code, which is really a typical piece of spin code. I'll use the keyboard object.
It receives a variable via the Start which is in Spin. It then passes setup parameters to the cog with something like;
And the cog code takes those parameters in its first line
So, while there is some double handling there, my understanding is that the Spin code becomes the glue for the compiler to link the locations for the cog code to known locations in hub ram. Many objects do this and they do it in different ways. The idea of decoupling cog code and its spin code sounds so simple, but I found after nearly a day's worth of coding that it is a huge job. It entails rewriting practically every object in the Obex to communicate via fixed locations in hub. Not only that, but also persuading all future authors of objects to write code using fixed locations.
Am I missing something there?
My concern is the vm object might be limited if it only could use a small number of obex objects that had been custom written. And how do you custom write objects that you don't understand and which fit in a cog by only a few bytes, like the zicog or the 4 port serial driver? This is the practical problem I came up against.
So this is why I was thinking it might be much easier to 'fill' unused cog space in any cog obex code to a multiple of 512 bytes as that presumably is just some filler bytes at the bottom of the code rather than trying to rewrite and understand the code.
Your thoughts on that issue would be most appreciated.
Second issue:
I am becoming more and more confident that this vm is a *very clever idea*. So much so that I believe it has the potential to be faster than sram. So, I am starting to ponder a new board design. Propeller, sd card, vga, keyboard, 2 serial ports, SPI for serial ram. That leaves some pins free, and those could be used for mouse, another serial port, sound and analog I/O. That makes the board more flexible than the dracblade. I think it could be half the size too (which means cluso might even get it into a matchbox!).
I found 32kb SPI rams fairly easily. Are bigger ones more expensive? Can you share pins for two SPI 32k ram chips?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
I know that not all existing objects would be easy to rewrite with that approach but I'm somehow choking on that fixed location thingy [noparse]:)[/noparse]
Putting cog code into somewhere external (eg I2C eeprom or on an SD card) to me entails two processes.
First, compile the object but with no cog code at all. Will it compile?
Second, compile the object with only the cog code and no spin code. Will it compile?
First problem is compiling an object with no cog code. It will fall over on all the cogstart and cogstop instructions. Those would need to be replaced with some way of passing the settings to the generic cog loader code (which presumably is just one cogstart for the whole program). Presumably also you need a smart generic cogstart and cogstop where you name which cog you want to start or stop.
Then compiling just the cog code with no glue spin code. Maybe it needs a fake cogstart/cogstop at the beginning as usually the first few lines of pasm code are reading in parameters.
Ok, for the simpler objects that only ever pass paramters via the start, this ought to work fine.
Maybe I got put off by starting with the hardest one, which is the zicog. This has many instances where the cog code is glued to random locations in hub that change every time you modify the program and recompile. Take this code fragment
variables b_reg, c_reg and d_reg are not defined as longs at the end of the cog code. They are defined as part of a huge list of variables in the Start of the zicog object
Ok, maybe that is manageable as it all can be transferred over to the generic loader via the pointer cpu_params.
But then there are all the CON values that come up in pasm eg
and in pasm
So you have to have a new version of the object where all the CON values are defined at the beginning of the pasm code. But no need to define them now for the spin code. But what about CON values that were actually for the spin code, not the pasm code? Maybe it doesn't matter defining CON values that never get used in code. Copy and paste is simpler. But messy for someone trying to understand why there is a CON value that never gets used in that local program. So to make it neat you have to go through every CON value and assign them either to the spin code or the pasm code. Tedious, but doable.
Re
>It looks like you are correct, VMCOG (in ideal circumstances) can be faster than your current latched driver cog!
>Mind you, it will take a big performance hit whenever it has to read a page into memory.
Great news on the opcode count. I'm still hopeful the page reads will be quite infrequent.
The timing algorithm sounds intriguing.
Re a new board, the chip count is going to replace a 32 pin sram, four 20 pin latches and one 16 pin 138 with two 8 pin serial ram chips. That is a lot of board real estate that will be saved.
I've been looking at ram chips. They seem to max out at 32kilobytes. I2C seems more expensive than SPI - is that correct? (I just had a crazy idea to go I2C which happens to be replicated in the SD driver code. Messy, as it is a cog to cog transfer of information. But the SD code for I2C is likely to already be there, unused. Hmm. Digikey are $1.66 for 32k SPI and $2.74 for 32k I2C). Maybe this could be an escape clause if the code won't fit?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
Post Edited (Dr_Acula) : 2/12/2010 2:11:51 AM GMT
>> Thanks for the detailed response. This is great brain exercise. I'm going to have to stop all alcohol and change over to coffee instead!
You are most welcome!
>> Ok, I understand and agree with all the points. The only bit I don't understand; "I am however going to be loading cogs using a shared 2KB buffer, from I2C EEPROM or SPI flash / SD card."
No problem really... that is how I plan to load drivers in my projects, no one is obligated to do it the same way [noparse]:)[/noparse]
>> A practical problem. Take a typical piece of dracblade code, which is really a typical piece of spin code. I'll use the keyboard object.
I understand, however over time I believe the concept of "stdin" and "stdout" will catch on, and more "interface-friendly" objects shall arise
You have to understand that I have a roadmap in mind, which includes Largos, PropellerBasic, and all the assorted projects I've been working on; and when I say something like an i2c or SPI loader, I am looking to the future.
You will find that I intend to "lead by example" where it comes to loading from external storage and easier Spin interface objects.
Case and point:
VMCOG.spin can easily be modified to have the start method start the cog by (made up names):
"i2c_cognew("vmcog.dll",mailbox)" instead of the existing "cognew(@vmcog,mailbox)"
Of course this does require me to polish my loader code and publish it... it is part of my Largos OS.
Largos shares a 2KB buffer between cog images, and mass storage buffers. When a cog needs to be loaded, the (four) 512 byte buffers are flushed, cog image loaded, and cognew'd. After the cognew, the buffer can again be used to hold disk buffers.
I believe when people stumble across the advantages entailed in such handling of drivers they will modify the obex drivers to conform to such loaders.
FYI, all my new cog images will conform to a very simple interface standard: a mailbox consisting of four longs; if more storage is needed, a pointer in one of the longs will point to it.
>> Am I missing something there?
No, you are not missing anything - however VMCOG will be quite usable the "old fashioned way". I am just trying to push the state of the art for the Propeller.
>> My concern is the vm object might be limited if it only could use a small number of obex objects that had been custom written.
The VMCOG object is self-contained, and does not load any external objects.
>> Your thoughts on that issue would be most appreciated.
See above [noparse]:)[/noparse]
>> I am becoming more and more confident that this vm is a *very clever idea*.
Thank you.
>> So much so that I believe it has the potential to be faster than sram.
I think it will be "fast enough", and allow very inexpensive ZiCog (and other) emulations with the addition of only two 23K256's and four pins.
Now that, in my opinion, is an "enabling technology"
>> So, I am starting to ponder a new board design. Propeller, sd card, vga, keyboard, 2 serial ports, SPI for serial ram. That leaves some pins free, and those could be used for mouse, another serial port, sound and analog I/O. That makes the board more flexible than the dracblade. I think it could be half the size too (which means cluso might even get it into a matchbox!).
I will admit that I have also been working on a low cost design that has a somewhat different approach and target market than the one you describe, but it too leverages off VMCOG and cheap SPI ram... I hope to demonstrate it at UPEW (and perhaps announce it earlier). I should be receiving the first prototypes in early March (an unfortunate side effect of inexpensive prototype PCB manufacturing is longer lead times).
>> I found 32kb SPI rams fairly easily. Are bigger ones more expensive? Can you share pins for two SPI 32k ram chips?
Unfortunately the only SPI ram's larger than 32KB are Ramtron's FRAM's, which cost about $7 for a 64KB device and about $9 for a 128KB device - however given that they are non-volatile, they are an excellent choice for some applications. Unfortunately they are not available in a DIP form factor, however SOIC8 is not too difficult to solder.
Yes, you can share most pins with SPI ram's. The sample circuit I am working with is wired as:
- DI/DO from two SRAM's
- CLK for both SRAM's
- /CS0 for $0000-$7FFF
- /CS1 for $8000-$8FFF
The same configuration Andy was using for his driver, however I am having problems with his driver and the Microchip parts.
I am adding 4Mbps read/write routines to VMCOG, as those will fit with a 256 entry TLB.
Once I start experimenting with a 128 entry TLB using 512 byte pages, I will have room for 10Mbps SPI code.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Ok, custom objects sounds great. And like you say, the idea ought to catch on, especially when you see how much memory it will save in a typical project.
I seem to recall that eeproms come in bigger sizes for really not much more money, so presumably the spare space in an eeprom is a good place to put cog code?
I see 4 lines to drive two SPI rams - is that right, and you are joining data in and data out, which would make sense and that saves pins.
Great to hear a new board is in the pipeline. I left one thing off that list before, and that was TV, but on one of the dracblade designs I used the 8 pins 16 to 24 for vga and also pins 16 to 19 went to TV resistors so you could use one or the other by installing a vga plug or an RCA plug plus the appropriate resistors. (I'm not sure, could you ever drive TV and VGA at the same time?)
Ok, well, I'm out of ideas because everything I ever could have dreamed of is going to be in this new design. Life is good!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
and in both the pasm and Spin objects, you can:
>> Great news on the opcode count. I'm still hopeful the page reads will be quite infrequent.
Thanks, I am hoping for infrequent page reads too.
>> The timing algorithm sounds intriguing.
"aging" makes sure that stale (not recently used) but previously frequently used data does not get stuck in the working set, wasting valuable space
>> Re a new board, the chip count is going to replace a 32 pin sram, four 20 pin latches and one 16 pin 138 with two 8 pin serial ram chips. That is a lot of board real estate that will be saved.
Definitely saves a lot of space - and some money too. Roughly ($3.50+5*$.50) vs. 2*$1.70 ... so $6 vs $3.40 (chips+sockets in both cases)
However it is 1/8th the amount of ram.
re/ I2C
I2C is MUCH slower. It would work, but SPI is cheaper, and MUCH faster. Easy choice [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I've just figured out how to optimize VMREADB down to just 16 instructions (best case), eliminating one hub access and four regular instructions!
Alas, VMREADW and VMREADL go down to 19 instructions (best case); I can reduce them to 16 instructions as well by using 6 more longs, so there is hope...
I will not add this optimizations until after everything is running as it involves changing the message format etc., and I want a "baseline" fully working VMCOG first.
(best case = memory location present in the working set, properly aligned, command present in hub command register when VMCOG checks)
The reason it works much better is that I will combine the command code with the virtual address.
The new command format will be:
Bits 0-8 = command code
Bits 9-23 = virtual address
This limits the virtual address space to 8MB, but it makes the most common case - vmreadb - MUCH faster!
It is possible to increase the address space to 28 bits (256MB) at the cost of one additional instruction per access.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/12/2010 4:19:34 AM GMT
The 16 instruction version does not work, but a 20 instruction version with the combined command+vmaddr as described above does [noparse]:)[/noparse]
See the first post for the new archive. No changes to VMDEBUG, all changes contained within VMCOG.
This change saved 3 longs in the cog, and speeds up reads a bit.
The Spin API also received a speed boost and shrunk courtesy of this change.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
- The first pass at the unaligned read code used too many longs, I am working on a better/shorter way of doing it
- Development of VMCOG will slow down a bit for a week as:
- I am building a lot of boards for a customer next week
- I have four Chinese New Year family dinners coming up (my wife is Asian) so that will also slow me down
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system