Propeller Caching XMM Longs for LMM
jazzed
Posts: 11,803
What I'm hoping to discover is the best practice for caching XMM longs for Propeller LMM execution.
This is not a new idea of course. Caching is used everywhere. Bill Henning appears do be doing some caching in some Morpheus code. The ICC kernel has data caching. I've no idea what Catalina does. Spin does not cache of course [noparse]:)[/noparse]
What is the best practice for Propeller instruction caching?
In the rest of this post, I try to think through the question out loud. Help me make sense of it.
Data cache should be considered too? ... probably.
In the simplest terms, a cache is a store of information where data is quickly accessed. There are different cache levels commonly used ... on chip primary or level 1, secondary or level 2 (on or off chip), tertiary or level 3 (virtual memory for swap, etc.), and others dependent upon the architect's whim.
When thinking about Propeller cache, one could specify one or more of the following:
How would an Intra-COG cache work? ... it would have to be small, simple, and would be best if starting at 0. Tags would complicate things too much.
What advantages would it have? ... not always using HUB instructions.
Is it possible to read enough longs from HUB fast enough for an LMM Intra-COG cache to be useful?
Is it possible to use XMM directly? ... possible, but not as fast?
2. HUB Cache
How would a HUB Cache work? ... could maintain a larger cache. Tags are possible?
Does it make sense for one or more COGs to load and maintain a write-back cache with XMM?
With multiple cogs, read rates > 10MB/s are possible but take work.
3. XMM Virtual Memory Cache
Can one manage Virtual Memory in XMM from a HUB Cache manager or other method?
would this most likely be necessary for larger programs?
4. SD card, other serial, or even tiny HDD Virtual Swap.
When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?
Time:
In any event, we have certain constraints with Propeller. While it is great to employ COGs for our purposes, the simple design forces us to pay attention to time when making choices. As PASM coders know, there are windows of opportunity and breaking a window causes a jump to the next "quantum" period.
Consider the simple PropellerI LMM loop (skip if you know this). It takes 32 cycles because the 16 cycle window is broken. To mitigate this, the LMM is "unrolled" N times to increase performance so that instead of N+1 32 cycle instructions, LMM can execute N 16 cycle instructions and 1 32 cycle instruction.
So how would a Cache work? Clearly regardless of what you would like, more instructions per loop are required.
Let's say that you have an LMM Kernel COG and one or more XMMU COG(s). The LMM loop might become something like this:
Is there a better way? We have to waste cycles with that. At this point with no optimizations at all and considering a very efficient cache hit/miss COG, the instruction rate is maximum 1.67 MIPS at 80MHz. This is not bad considering the maximum aggregate unrolled LMM rate is about 2.6 MIPs.
Ok, so what does the Cache COG look like? Cache COG sounds like where the money is, and from what I've seen it adds up (or shrinks).
Let's say you have a Cache COG loop like this:
With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.
Another way to do better seems to be to have an OnCOG cache. The OnCOG Cache would look much like the Cache COG, except that the only time we need HUB cycle time would be on a Cache miss. So, a routine might look like this:
Can we do better?
The other types of cache could be discussed here later.
Code fragments are not guaranteed to work.
This is all I have time to write now. I would appreciate thoughts on the subject and optimizations if any.
Thanks,
--Steve
This is not a new idea of course. Caching is used everywhere. Bill Henning appears do be doing some caching in some Morpheus code. The ICC kernel has data caching. I've no idea what Catalina does. Spin does not cache of course [noparse]:)[/noparse]
What is the best practice for Propeller instruction caching?
In the rest of this post, I try to think through the question out loud. Help me make sense of it.
Data cache should be considered too? ... probably.
In the simplest terms, a cache is a store of information where data is quickly accessed. There are different cache levels commonly used ... on chip primary or level 1, secondary or level 2 (on or off chip), tertiary or level 3 (virtual memory for swap, etc.), and others dependent upon the architect's whim.
When thinking about Propeller cache, one could specify one or more of the following:
How would an Intra-COG cache work? ... it would have to be small, simple, and would be best if starting at 0. Tags would complicate things too much.
What advantages would it have? ... not always using HUB instructions.
Is it possible to read enough longs from HUB fast enough for an LMM Intra-COG cache to be useful?
Is it possible to use XMM directly? ... possible, but not as fast?
2. HUB Cache
How would a HUB Cache work? ... could maintain a larger cache. Tags are possible?
Does it make sense for one or more COGs to load and maintain a write-back cache with XMM?
With multiple cogs, read rates > 10MB/s are possible but take work.
3. XMM Virtual Memory Cache
Can one manage Virtual Memory in XMM from a HUB Cache manager or other method?
would this most likely be necessary for larger programs?
4. SD card, other serial, or even tiny HDD Virtual Swap.
When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?
Time:
In any event, we have certain constraints with Propeller. While it is great to employ COGs for our purposes, the simple design forces us to pay attention to time when making choices. As PASM coders know, there are windows of opportunity and breaking a window causes a jump to the next "quantum" period.
Consider the simple PropellerI LMM loop (skip if you know this). It takes 32 cycles because the 16 cycle window is broken. To mitigate this, the LMM is "unrolled" N times to increase performance so that instead of N+1 32 cycle instructions, LMM can execute N 16 cycle instructions and 1 32 cycle instruction.
nexti rdlong inst, PC ' get instruction add PC, #4 ' increment PC x4 to get next long inst nop ' execute instruction jmp #nexti ' 32 cycles
nexti rdlong inst1, PC add PC, #4 inst1 nop rdlong inst2, PC add PC, #4 inst2 nop jmp #nexti ' 16+32 cycles if no LMM jumps are executed
So how would a Cache work? Clearly regardless of what you would like, more instructions per loop are required.
Let's say that you have an LMM Kernel COG and one or more XMMU COG(s). The LMM loop might become something like this:
nexti wrlong PC, CachePtr ' assumes an address is never 0 add PC, #4 rdlong cinst, CdataPtr wz if_z jmp #$-1 ' wait for an instruction ... NOPs are padded non-zero cinst nop jmp #nexti ' 48 or more cycles depending on cache availability
Is there a better way? We have to waste cycles with that. At this point with no optimizations at all and considering a very efficient cache hit/miss COG, the instruction rate is maximum 1.67 MIPS at 80MHz. This is not bad considering the maximum aggregate unrolled LMM rate is about 2.6 MIPs.
Ok, so what does the Cache COG look like? Cache COG sounds like where the money is, and from what I've seen it adds up (or shrinks).
Let's say you have a Cache COG loop like this:
dat ' CACHE COG :nextread rdlong cadr, AddrPtr wz ' got milk? if_z jmp #$-1 ' repeat if nothing cmp cadr, lcache wc,wz if_ae cmp cadr, hcache wc,wz ' only if address >= cache low addr if_ae call #cacheread ' if address >= cache high, jump to read and cadr, CMASK movd :rdch, cadr nop :rdch wrlong 0-0, DataPtr jmp #:nextread ' 64 cycles minimum cacheread ' not defined yet cacheread_ret ret lcache long 0 hcache long 0 cadr long 0 AddrPtr long 0 DataPtr long 0 CMASK long $3f
With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.
Another way to do better seems to be to have an OnCOG cache. The OnCOG Cache would look much like the Cache COG, except that the only time we need HUB cycle time would be on a Cache miss. So, a routine might look like this:
:nextread mov cadr, PC cmp cadr, lcache wc,wz if_ae cmp cadr, hcache wc,wz ' only if address >= cache low addr if_ae call #cacheread ' if address >= cache high, jump to read and cadr, CMASK movs :rdch, cadr nop ' make this add PC, #4 ? for a loop? :rdch mov inst, DataPtr jmp #:nextread ' 36 cycles minimum cacheread ' not defined yet ... all hub operations go here cacheread_ret ret
Can we do better?
The other types of cache could be discussed here later.
Code fragments are not guaranteed to work.
This is all I have time to write now. I would appreciate thoughts on the subject and optimizations if any.
Thanks,
--Steve
Comments
For the record - Catalina doesn't do any caching ... yet.
I've given it some thought, but unfortunately, I haven't had time to do much about it. But here is my current view:
Although I appreciate the cleverness of it, I decided against the original LMM FCACHE approach (I think this equates to what you call an 'intra-Cog' cache) - it works beautifully on text-book examples, but in real-world programs it can be detrimental - this is because the kernel space it consumes can often be better used implementing other things that also speed up program execution. I've benchmarked this on actual C programs, and while there are indeed FCACHED programs that do execute much faster, they are often fairly simple cases - real-world programs often execute slower when FCACHED.
My current thinking is that a "hub cache" is the way to go. This would have another big advantage for Catalina in that one XMM hub cache could service multiple kernels without much additional effort. One of my original design goals for Catalina (not yet achieved) was to support multiple kernels running simultaneously on different cogs - but once I started supporting XMM this became more difficult to achieve because in most XMM implementatons only one cog can access the XMM memory effectively - the overheads of supporting multiple cogs accessing XMM would make it too slow to be any use at all.
Interesting topic - thanks for starting it. I look forward to hearing what others think.
Ross.
If soneone else does, I'll happily adopt.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I have been giving this much thought, and will be trying different approaches - as soon as I have some time [noparse]:)[/noparse]
The first pre-release of PropellerBasic will not do any sort of LMM caching, but I am planning to add full FCACHE support.
Later, when I add XLMM support, I will look into how to best speed up XLMM - I posted an idea in the other thread that should work, but it would chew up most of the cogs.
Now PropII... I am drooling at what will be possible with it, however I expect it will be quite a while before it is on the market.
Best Regards,
Bill
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheusdual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory IO board kit $89.95, both kits $189.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller
Adding a "pad" for NOP (long 0) before response would make an XMM loop like this possible:
Post Edited (jazzed) : 12/10/2009 6:43:13 AM GMT
The hardware solutions (morpheus/triblade/dracblade) show it is quite possible to add hardware to increase memory. 512k is a common value. There are the long discussions on speed and essentially speed is proportional to pins so a 28 pin solution using parallel ram is one extreme and a 2 pin I2C serial ram is at the other extreme.
For me, hub code is not speed critical. Hub spin is already so much slower than pasm that slowing it down a little further is not a problem. What I'd like to be able to do is to keep writing spin code without ever having to worry about running out of space. Could this be done?
First might be to require a different compiler. But BST has the nuts and bolts to do this. Next criteria might be to make the compiler as similar as possible to standard spin code (so rewriting it is not so daunting).
I wonder (maybe this has been done before?) if you could set up overlays in spin? Set aside 10k of overlay space. Start writing your spin code. When it runs out of space for another subroutine, instead of jumping to it locally, the compiler grabs it from high memory, puts it in that top 10k, runs it, then exits?
If all variables are kept local then shared global variables are not a problem (move a global to a local variable and pass it with the subroutine call?)
Brainstorming further, a 'program' might be a self contained binary file - say it was 100k. The bottom bit fits into the propeller and contains code to call higher bits of itself and move them into that upper 10k (or whatever) as required. You would need a way of storing a 100k program as it won't fit in eeprom, but again that is already possible with hardware that has sd card access.
Even cog overlays could be called from hub which are then called from upper memory.
Maybe you want more control over this than allowing the compiler to do it, but perhaps you could write your standard spin code and for overlay code you add an #ifdef overlay so the compiler knows that sits in high memory.
If this is too hard in spin, maybe it is possible in C or Basic? And it ought to be possible for different hardware if you define some standard procedures eg read_128_longs and write_128_longs and then only those need to be recoded for each hardware variant?
This idea of caching is very interesting.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
Yes it is possible. I started working on exactly this idea which I call "BigSpin", but got distracted
... jazzed looks out the window and shouts: "hey you kids get outta my yard!" ...
Now what was I doing? Oh yea, caching ....
This falls along the lines of items 3 & 4, but not exactly. If you look at my spin debugger, SPUD,
you will see how steal time and get stack/object base info from the spin interpreter. Like this....
The fetch_pcurr routine is called near the spin Main Loop "loop" label.
Other support for the VMM idea you have described is in the file SpudVMM.spin in the SPUD zip.
http://forums.parallax.com/showthread.php?p=831329
The VMM framework is there but not fully tested. I added a way to do a method lookup based on
a stub method signature. Detecting the signature gives a "pointer" for which method body to load.
Using BSTC listings makes it possible to manage a library method body table, so we know where
to find the code and how big it is. The table and library code can live in fixed storage SDCARD,
EEPROM, Flash, or external memory.
A difficulty which you have addressed to a point is the object variables and dat references. For
fixed storage, I would be concerned about wearing out the device ... no problem for external
memory. One could restrict the methods artificially to use only local variables ... but that might
be very unpopular ....
The only real limit to what I've described is the number of methods Spin allows. Maybe that
can be relaxed in the compiler? Is it a function of the interpreter? Not sure.
So in summary, what you suggest is entirely possible. It's just a matter of someone implementing it [noparse]:)[/noparse]
$FE spin methods per object in theory, although just the method table for that alone would consume ~1K.
The first byte in the method table is the number of sub-objects.
The second is the number of Spin methods (PUB+PRI) + 1 to account for the first method table entry
The next word is the relative address of the end of the object (the object length effectively)
So you are limited in the number of objects you can include (although the Parallax compiler limits that to 64 from memory) and the number of spin methods by the table entry size.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
These routines do not include an L1 cache or any XMM access.
I have L1 + XMM fetch code that that works in a single COG, but since it's all in one COG, I can't
pass the response back to the LMM engine without "blocking" (finishing the entire cache fetch).
I'm not sure at this point whether straight access like below is better than using an inter-cog L2
cache or not. For now it seems like a tie with L2 caching, but the L2 caching doesn't read the XMM yet.
This compares with ~1.1 MIPS with direct fetch and execute XLMM. The cache fetches are
burst, the direct fetches are one at a time. The actual direct XMM access time is 750ns but
overhead makes it worse.
Cache hit time at 80MHz is 600ns. Cache miss time is about 6us ... not great, but 8 longs get
read in that time. Theoretical maximum rate is 1.6 MIPS in the current design. Actual data
read rate on burst to HUB is 6.4MB/s. Cache is un-tagged and provides no write back ability.
Functionally, the LMM engine is in one COG and the Cache engine takes up 5 COGs. Pretty
sad that the data rate is so low "per COG". Currently, it takes 3 HUB access cycles to manage
the Cache pointers. It takes 2 HUB access cycles for the main Cache reader to cause a long fetch.
The main fetch/execute loop looks like below. The read32 routine for grabbing long after FJMP,
etc... is about the same.
The "waterline" set by the cache engine allows the cache reader to operate in parallel with the
execution loop. My conclusion at this point has become: "It's more of a PITA than it's worth"
unless you need it to enhance latched design performance (possible but PITA). Using up so
many cogs (among other things like spending too much time on fruitless endeavor) is a waste.
Other things can be done to optimize, but as a friend once reminded me:
"You can polish a turd for only so long before the **it gets all over your hands."
So with that I'll stop messing with caching and get on to something more productive.
Interesting result. Thanks for the update.
The first thing that occurs to me is that if people like us can make something work in 5 cogs, there's got to be a good chance that someone like Chip could make a version that works in only 3 cogs - and at that point it might begin to look worthwhile.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
It is pretty much as I expected, but well worth the exercise. You never know, someone may come back to this later and discover something missed, or better still, a new algorithm.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Yes. Bill Henning has 3 COG code for VGA bursts that might be usable for a cache. It uses a long string of accesses. The big challenge I faced though was getting enough data fast enough for a small "waterline" that would allow the LMM engine to work in parallel with the cache loader engine. I got the latency down to a 3us miss period so that the waterline could be set to 4 longs with some advantage for performance. Yes, there may be others who can solve the problem better. Even the last fetch/execute loop I posted could be optimized maybe.
BTW: That last loop is the closest I could come to a generic interface for a kernel. I have another cache engine that can be used with it, but it is no faster than straight XMM ... which is ok if that's all you need. If you can make use of it, please do.
I've included a Propalyzer screen shot of a cache miss reload and many hits for your amusement. The tick is set for 12.5ns, but it's really reading at 10ns so the numbers at the top are wrong. The whole screen is about 18us instead of the 22us shown. It shows getting a long every 12 instructions with the potential for double that rate since there is waste in the reader cog ... I won't bother with it too much though.
Post Edited (jazzed) : 12/21/2009 4:13:37 AM GMT
Re sd cards 'wearing out', only writes wear them out, so you can have code on an sd card that you read many times into hub/cogs and that won't wear out an sd card. And/or you can read it once from an sd card into an external ram chip, then read from the ram chip into hub/cogs and that would be faster.
The cog caching is already working. I don't understand *how* it works on the zicog, but I understand it enough to be able to replicate it. Then write some cache type code, copy and paste a skeleton of code that goes at the beginning and end. Then add a reference to it somewhere (actually there are 3 references to add) and then put in the code. It is a bit complicated, but possibly could be automated via an IDE.
Cache code for the cog is being stored in hub, so eventually one might run out of space in hub. But they are just bytes, so you could always bring them in from external ram rather than from hub ram. And the building blocks for that sort of code already exists.
So I think caching for pasm already exists, and just needs to be automated/simplified.
What intrugues me is the possibility of caching in spin. First thing would be spin code that is 'portable', ie it can reside anywhere in memory. For instance in Z80 it is possible to write portable code by never using the JP instruction or the CALL instruction. The only jump instruction you can use is the Jump Relative instruction which jumps back or forward by n bytes. Can you do such a thing in Spin?
Or, you always compile the code for a fixed location somewhere near the top of hub memory. Each object is loaded to this location and run, it cannot be longer than n bytes and only one object can be there at any one time. But fast transfers between hub and external ram mean you can be loading bits of code in and out, and it could be very handy for the routines you only use infrequently - eg a bunch of Spin string manipulation routines.
Still, many bits of code would fit these criteria.
But my brain starts to hurt thinking of how to get code like this to work and how to keep things simple for the compiler. Maybe you have a routine called
PUB Cache
You pass 10 variables and you return 10 variables. That should be enough for most things.
After PUB Cache is some code (probably a single call to a 'ram load' or 'sd load' that copies bytes to the cache location. That might be just a few bytes above this point, so the code then continues on at that point. You could be pecise about that if you knew how long the code was for the bytefill code, or maybe there are some NOPs so it doesn't overlap) You always use the same variable names as those that are passed (variable1, variable2 etc. Or i,j,k for Fortran programmers). Maybe you pass a small array as well, or the location of a small array?
So a question would be whether a compiler could handle the internal counters and flags and pointers for something like a 'repeat' in Spin that was always compiled to a fixed location?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
Post Edited (Dr_Acula) : 12/21/2009 4:15:37 AM GMT
I'm afraid Spin would not be very adequate for implementing VM stuff. No compiler changes would be required, but a compiler like BSTC that generates listings would be necessary for creating lookup information As implementation goes, I've considered building a library and lookup table which as you summarized could go into XMM ... having the library loadable from any media without XMM can also be done, but would of course impact performance. A dynamic load library would be good to put in fast virtual memory. There are good models of this already.
I'm not sure how you ever got the idea that reading an SD card would wear it out [noparse]:)[/noparse] ... not from anything I said I hope. The caching I've explored and described here is for running 32 bit external LMM code faster than by normal one-at-a-time fetch/execute from XMM with potential for application on address latched memory systems ... it has nothing to do with the Z80 emulation which probably has such a feature implemented on RAM-DISC or whatever ... not that I care to know one way or another. You can keep that to yourself [noparse]:)[/noparse]
Re wearing out sd cards, it would depend on the cache model. I think if you have a model where you bring in a group of bytes that represent an object, and you put them in hub and then execute it, then return a value and then throw away that code in hub, well that won't wear out the sd card obviously.
But I'm pondering your intial post "When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?"
If you have a 'swap' partition, it depends on how you implement that. If at the 'block of code' level that you bring into hub that would be fine. But if you are running your memory model at the 'long' size, or at the single instruction size (which ?? I think you are), then if a piece of code in spin or pasm says 'modify value x in memory', then that would modify the value in sd memory. ( If that is the swap partition model you are looking at. )
I think that was the path Windows went down. Early PCs had x amount of ram and then a swap file that sat on the hard drive and emulated more memory. Then you could add a memory chip and get a dramatic improvement. But I think there is a subtle difference here as I think a swap file will work on a physical hard drive but it won't work on a compact flash hard drive or an sd drive.
The zicog dracblade board I'm using has both sd card and 512k of ram, so given a choice, I'd do the swap file in ram rather than on the sd. But you still need the sd to load the initial code (unless you use a bigger eeprom). I found this link that is quite interesting propeller.wikispaces.com/LMM+AiChip+Industries Such a concept would be quite easy to implement on a dracblade board (not using CP/M, just using spin and pasm) as you could move in blocks and execute them one instruction at a time. Interesting that they list certain instructions you can't use and have to replace with calls. (the calls and jumps, as would be expected).
I'm still pondering the merits of models that implement one instruction at a time vs ones that move blocks in, then execute them, then move the next block in.
Yours is a cache model that does things in groups of instructions rather than one at a time, right?
How big do you make a 'group' of instructions?
I suppose one model might be to make the size of a 'block' of memory the same as a cog. Consider a program that had hundreds of little subroutines/objects, that were all cog sized. To run one, you just start a cog and use the ram from hub as per the standard prop system. But what about when you run out of hub ram? How about a model where it boots up, moves 200 cog sized 'cogjects' from sd card into external 512k ram, then you run your program but whenever you start a cog and fill up its memory, you get that code from external 512k ram rather than from hub ram? Maybe this has already been done?
Hmm, if I had to code within those constraints I might write some string routines, but maybe they don't all fit in one cog, so I'd put 10 in one 'cogject' and another 10 in another 'cogject' and compartmentalise the program that way. Then you might need some code that works out which 'cogject' to load. So maybe instead of coding with a long list of PUBs you instead have one PUB but pass it one of a long list of keywords? Certainly a slightly different style of programming. Is that getting close to the idea you had of a library and lookup table?
Would that give you the flexibility to still code spin and pasm together?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
On XLMM, one has 32 bit instructions that run on a machine similar to what you found in the LMM wiki. These instructions are normally stored in HUB RAM and are fetched and run one at a time by the LMM COG ... this is the basis for Catalina, ICC, and Bill's LAS (and Basic). The jump class of instructions have special needs in this model, but all the other PASM instructions can be executed straight up. Some advantages come from that by having long constants and fast jumps effected by changing the program counter in addition to LMM jumps. This model is very easily moved into one where the code is executed straight from external memory with speeds that are just as fast as SPIN (SPIN is about a 1MIPS machine depending on the instruction).
BigSpin would work on blocks of code rather than a long at a time. Each "library block" would need to be loaded per method into a areas of memory based on an allocation mechanism with statistics to keep the most used methods in HUB ram. The hard part of that model would be accounting for object variable offsets. At first for proof of concept, methods having only local variables and results would be used. The Spin internal stub I showed in the first BigSpin post of this thread places all object and stack information into shared HUB memory for the method dispatcher to use with the loaded blocks. A library method could be run as a PASM block or more conveniently as an LMM block. One could even start another SPIN interpreter variant COG for running Spin library methods, but that would have to be hacked a special way.
I haven't given the swap idea much thought and don't have comments on it.
Added: Forgot to answer this question.
The XLMM cache I was researching reads a block of memory (32 longs today) which is useful for hardware that uses latches, etc.... Before the entire block is read by the cache reader subsystem at the "waterline", the LMM (one-at-a-time execution engine) starts to work running instructions. So, the simple answer is it pulls groups for one at a time execution which starts before the entire group is fetched.
Post Edited (jazzed) : 12/21/2009 6:44:32 AM GMT