Instruction caching ideas for P2 in XMMC mode with SDRAM

Following on from some discussions on this topic in another post http://forums.parallax.com/showthread.php/147786-Is-Spin2Cpp-version-of-quot-quot-(random)-slow I have started a new thread to cover it more as it probably belongs in its own place.
I had mentioned a crazy idea I had a couple of days ago to try to get a faster VM kernel performance by doing the cache handling of instructions in the kernel COG when running with SDRAM on the future P2 in XMMC mode with large applications. I know we are going to need very good instruction caching in this configuration as the latency is very significant for SDRAM, especially if there are video COGs sharing the SDRAM as well and doing large transfers. After playing around with my idea a bit more I may have found a way to get it to run every 2 hub windows (10MIPs max) and it may potentially be compatible with PropGCC....
I am basically trying to avoid the extra request and polling calls to the cache driver for each and every instruction which will burn more hub cycles. It's not fully complete but here is my idea below.
It uses QUADs of longs in hub memory to hold what I call "L1 cache" entries. This may be a bad name I know given the cache row is only one instruction long but I want to differentiate it from the regular cache entries in hub memory PropGCC already uses which I am thinking could still be considered as L2 cache entries - and could even coexist with my L1 cache. The L2 could hold larger burst transfers from the slow main memory and the L1 quads can then refill themselves from the larger L2 row as required during L1 misses, as would happen in a regular L1/L2 cache hierarchy.
There are other ideas floating around like using P2 stack/CLUT memory as well for the tags which could help even further. But that's separate to this.
Not sure if any of it will fly but might generate some useful discussion. The VM code loop is shown below and if I designed it correctly it seems to currently fit into a 16 instruction loop and so should be hitting hub windows each time for simple single clock PASM instructions. There are also some stats counters present that could also be used to track hit/miss performance and for experimenting with different cache sizes to see which works best. I think if there are spare places for such counters in the main loop this is a useful addition as well to any VM with caching, especially for enabling benchmarking.
Take a look and let me know if you like it or think it won't work or can improve it...
Roger.
I had mentioned a crazy idea I had a couple of days ago to try to get a faster VM kernel performance by doing the cache handling of instructions in the kernel COG when running with SDRAM on the future P2 in XMMC mode with large applications. I know we are going to need very good instruction caching in this configuration as the latency is very significant for SDRAM, especially if there are video COGs sharing the SDRAM as well and doing large transfers. After playing around with my idea a bit more I may have found a way to get it to run every 2 hub windows (10MIPs max) and it may potentially be compatible with PropGCC....
I am basically trying to avoid the extra request and polling calls to the cache driver for each and every instruction which will burn more hub cycles. It's not fully complete but here is my idea below.
It uses QUADs of longs in hub memory to hold what I call "L1 cache" entries. This may be a bad name I know given the cache row is only one instruction long but I want to differentiate it from the regular cache entries in hub memory PropGCC already uses which I am thinking could still be considered as L2 cache entries - and could even coexist with my L1 cache. The L2 could hold larger burst transfers from the slow main memory and the L1 quads can then refill themselves from the larger L2 row as required during L1 misses, as would happen in a regular L1/L2 cache hierarchy.
There are other ideas floating around like using P2 stack/CLUT memory as well for the tags which could help even further. But that's separate to this.
Not sure if any of it will fly but might generate some useful discussion. The VM code loop is shown below and if I designed it correctly it seems to currently fit into a 16 instruction loop and so should be hitting hub windows each time for simple single clock PASM instructions. There are also some stats counters present that could also be used to track hit/miss performance and for experimenting with different cache sizes to see which works best. I think if there are spare places for such counters in the main loop this is a useful addition as well to any VM with caching, especially for enabling benchmarking.
Take a look and let me know if you like it or think it won't work or can improve it...
Roger.
' Here is one possible idea for getting an XMMC type of VM loop to perform a bit faster on P2
' An "L1 I-cache" is formed from an array of quads in hub memory (eg. 256 quads or 4kB)
' Code can then be read and then executed from these quads instead of always reading
' the main memory or polling a cache driver.
' Each quad contains contain 2 instructions in two of its longs and 2 addresses in the
' other 2 longs and this forms a 2 way set associative cache entry. The format is this:
' Long Data
' ------- ----
' 0 instruction @address_x
' 1 address_x
' 2 instruction @address_y
' 3 address_y
' The quads are indexed by the VM shifting, masking and offsetting the PC to form an address.
' If required the VM also can read the hub memory addresses directly.
' For convenience and speed the address space is split into two halves with the MSB of the PC
' indicating which memory to read from. The order could be reversed if required.
' 0x00000000-0x7fffffff is hub memory (with foldover)
' 0x80000000-0xffffffff is external memory (such as SDRAM)
' One special address such as 0xffffffff, or 0x00000000 would need to be reserved to indicate an empty quad entry.
' Any cache misses on external memory lookups will need to go refill and reexecute the
' L1 entry from either a separate L2 cache row or SDRAM directly (still to be coded)
' The main advantage with this approach is that a tight loop (once loaded) can run
' up to 10MIPs and because it is also 2-way set associative the instructions can wrap
' on certain address boundaries and still avoid additional misses. You'd need three
' memory addresses to match the same quad index before you get a miss (one loaded).
' For tight loops this should not occur too often.
' Key disadvantage is that currently there is no prefetching with this L1 implementation,
' so each instruction needs to be loaded one at a time initially to get going.
' However this could perhaps be optimized during misses to load a few at a time but
' will then slow down miss cases a little. That may still be worth it. TBD.
reentry
repd #16 ' create a 16 instruction loop (10 MIPs max @160MHz)
setquaz #quad0 ' point to quad address in COG memory and clear them all
add starts, #1 ' spare spacer instructions can also track some stats
nop ' v--- The 16+ cycle loop starts on the next instruction below
mov addr, pc ' get current PC of VM
muxc zc, #1 ' save VM's C flag before we lose it
rol addr, #2 wc ' get MSB of PC into C and turn long address into a quad address
and addr, mask ' mask off bits for depth of L1 cache lookup
add addr, base ' add to base address of L1 cache to get our quads real address
rdquad addr ' read the quad at this L1 cache address
muxnz zc, #2 ' save VM's Z flag before we lose it
if_nc jmp #hub_memory ' access hub memory directly somewhere outside this loop
add cycles, #1 ' count instructions while we wait (or use for another purpose)
cmp quad3, pc wz ' quad data is now ready so check for address2 match with our PC
if_z mov quad0, quad2 ' if so move instruction2 up to executable position in quad0
if_nz cmp quad1, pc wz ' if the above match failed go check for address1 match with our PC
if_nz jmp #cache_miss ' if no match in either address then we have a miss and we exit loop
resume shr zc, #1 wz, wc ' restore both VM flags
add pc, #4 ' increment VM PC
quad0 nop ' instruction1 placeholder from L1 cache read is executed
' ^--- The loop returns to its beginning (or has already jumped out)
quad1 nop ' address1 placeholder from L1 cache read
quad2 nop ' instruction2 placeholder from L1 cache read
quad3 nop ' address2 placeholder from L1 cache read
hub_memory
jmpd #resume ' prepare delayed jump to resume in 3 more instructions time
mov quad1, restart ' ensure loop is restarted once our hub instruction executes
rdlong quad0, pc ' PASM code to read instructions directly from hub address starts here
nop ' do we need more nop spacers for jmpd (the above sequence may need reordering?)
cache_miss
..todo... ' PASM code to refill L1 cache from SDRAM or L2 cache starts here
jmpd #reentry
add misses, #1 ' count misses (if we have time somewhere)
shr zc, #1 wz, wc ' restore both VM flags
nop ' spacer
restart jmp #reentry ' go restart the VM loop again
pc long 0 ' VM's program counter
zc long 0 ' placeholder to save z,c flags
addr long 0 ' temporary address variable
base long CACHE_BASE ' base of L1 cache
mask long CACHE_MASK ' mask of address bits used for L1 cache
cycles long 0 ' counter for enabling cache hit stats
starts long 0 ' counter for tracking VM loop starts for stats
misses long 0 ' counter for tracking cache misses
Comments
1) Any L1 cache hits only take 2 hub windows (no prefetching), this is a max of 10MIPs when a nice loop is loaded and running. The L1 size can be made quite deep too so it will hold large chunks of code, the two way associativity will help a lot too in code that crosses specific boundaries that map to the same cache entry.
2) The L2 cache is only ever used if the L1 I-cache lookup fails, and a hit from this L2 cache and repopulate L1 takes maybe ~6 hub windows per executed PASM instruction (I am still looking at something for this part - plus the L2 cache tags could possibly be held in stack RAM to speed things up). The max rate is 3.3MIPs then at that hub window rate if all the L1 lookups fail, but L2 succeeds. This would typically happen on C code that only ever executes linearly once (eg. application bootup initialization).
3) The SDRAM is used only for the L2 cache misses and we could prefetch an entire L2 cache row of say 16-32 instructions at a time to help get things moving and ultimately help fill the L1 cache from subsequent L2 entry hits. This size would need to be tuned to take into account other video clients latency requirements on the SDRAM driver if they share it as well. I'd estimate it to be somewhere in the vicinitiy of about 20 hub windows for 16 instruction loads from SDRAM if we can get in right away. That is 1 MIP @160MHz. Large video transfers from SDRAM can slow this further however.
The L2 row size is the biggest tradeoff. You don't want to waste time prefetching lots of code that won't be executed, but at the same time if you make too small you will miss more often and have to hit the SDRAM more frequently which is slow.
So if you were to use these numbers just as an example and you get say 0.7 hit rate on L1 and 0.9 hit rate on L2 during looping (I may be being overly optimistic here) you might get some performance around this number:
160 MHz / ((0.7 * 2 + 0.3 * (0.9 * 6 + 0.1 * 20)) * 8 ) = ~ 5.5 MIPs.
In general I'd probably hope for a few MIPs or so with XMMC and SDRAM on P2 as a ballpark figure, but it all depends on the application code and what other burden the SDRAM has from its other clients, if any.
Roger.
The code for that approach is something like this:
restart repd #16 nop nop nop muxc zc, #1 ' beginning of 16 instruction loop muxnz zc, #2 mov pc, pc wc ' extract msb add cycles, #1 if_nc jmp #hub_mem setspa pc popar addr1 popar instr popar addr2 cmp pc, addr1 wz if_nz popar instr if_nz cmp pc, addr2 wz if_nz jmp #cache_miss resume add pc, #4 shr zc, #1 wz, wc instr nop ' end of normal loop jmp #restart addr1 long 0 addr2 long 0 pc long 0 zc long 0 cycles long 0 hub long 0 hub_mem jmpd #resume add hub, #1 rdlong instr, pc nop cache_miss ' ... code to do cache reload would go here jmp #resume
I will need to take a good look at it. Got bogged down in my own ideas. When I initially browsed it it seemed to be a lot of instructions (seemingly more than 2 hub windows on P2) but I need to look at it again to doublecheck.
[EDIT] Just read your latest code again. On the P1 your cache hits cost it 22 PASM instructions plus a hub read. So that would be scaled up to the next multiple of 16 for the hub windows. 22*4 + 16 rounds up to 112 cycles or 7 hub windows. 80/112 = 0.71 MIPs best case on the P1 @ 80MHz.
If you ran that algorithm basically as is on a P2 you'd still get around 4 hub windows per instruction in the absolute best case if you could have delayed jumps everywhere (my L1 example above should be 2x faster than that). If you start including any jump penalties might bring it closer to 5 or 6 hub windows, but I'd assume it would be recoded suitably to try to avoid that.
Now I really need to find a way to get the L2 hits to run faster in my variant using last address caching somehow. If I can get 4-5 windows for that and update the L1 at the same time I will be happy. There should be some way to do this...
That being said, a four way L2 cache might possibly be nice. If the L2 tags are loaded into the hub RAM you could potentially get 64 x 4 way set associativity in there perhaps. Also one of the things I found is that computing the addresses can take a while with lots of masking/shifting etc. If the cache row start address in hub memory is somehow mapped from the PC address dynamically rather than statically, and is returned directly from a tag entry data structure that matches the masked PC address it could potentially be much faster to do the lookup. It might also lend itself to a circular queue of L2 row buffers and/or LRU/LFU type algorithms when flushing out rows that clash with ones being read in. Maybe the stack can be used to hold a free list of L2 cache row pointers. Just thinking...
So perhaps this could be another use of the P2 stack RAM. It only uses half the stack RAM but is interleaved as 2 longs (addr + opcode), then 2 empty longs etc. A two way L1 cache version would be possible but obviously a bit slower and needs more management to refill it. This way may work better and maybe the other 1/2 of the stack RAM can then still be used for L2 cache tags...?
Roger.
initial_entry jmpd #restart mov pc, start ' initial code entry point sub pc, #4 ' compensate for initial start position shr zc, #1 wz, wc ' setup initial flags l1_miss '... normal L2 cache lookup code goes here, ' the instruction ultimately gets read and copied into instr during this part pushar instr ' copy code into L1 cache subspa #2 ' move back to address position pushar pc ' copy address into L1 cache and fall through to loop below restart reps #8, #0 ' repeat lots of times add restarts, #1 ' stats ' The main 8 cycle VM loop follows add pc, #4 ' increment PC instr nop ' execute instruction setspa pc ' point spa to PC's LSBs popar addr ' pop an address from stack sub addr, pc ' subtract our PC from it tjnz #l1_miss, addr ' compare address match with PC and miss if not popar instr ' pop associated data instruction add cycles, #1 ' end of loop jmpd #restart ' when code exits after 16384 iterations, start it again add loops, #1 nop nop pc long 0 ' VM program counter addr long 0 ' temp address variable start long START_ADDR ' entry point of VM application wc long 1 ' saved flags / initial flags cycles long 0 ' (optional) hit/miss stats misses long 0 ' (optional) hit/miss stats loops long 0 restarts long 0
With my L1 scheme in place:
20MIPs (8 cycles) for all L1 hits,
or 6.6MIPs (24 cycles) for L1 misses, but hits in the same L2 row as last time,
or X hub window cycles for other L2 hits,
or Y hub window cycles for other L2 misses
Without my L1 scheme:
10MIPs (16 cycles) for hits in same hub cache row as the last instruction used,
or X-1 hub window cycles for other cache hits, (I have some sample code I am working on that should get this down to 4 hub cycles (5 MIPs) for 64x4 way set associative cache, tags in stack RAM, so X is probably 5 above and 4MIPs with L1)
or Y-1 hub window cycles for other misses
It's always a tradeoff. Small loops that run often can probably get quite a lot of benefit from the extra L1 and can peak at 20MIPs which is great, but other application code that doesn't loop so much may prefer no L1 to always get the 10MIP rate when hitting the L2 cached row instead of 6.6MIPs and one hub window less for other miss cases. It's pretty much impossible to pick something that suits all needs.
What is the consensus now? A single level cache? What cache algorithm?
Alternatively one could try out a combined 64 entry L1 cache and 64 x 2 way L2 tags sharing the stack RAM which could also potentially be useful. But this whole FCACHE thing may make the L1 less useful... will need to think about that.
I discovered another benefit of storing tags in the stack is if you lookup a 4 way set-associative cache which already has all its 4 tag entries consumed with non matching addresses and then need to reload and drop one tag entry to be able to hold the new row being loaded it is very easy to push back the last 3 oldest tag entries in the group of 4, after all you have already popped and address matched them so have them in COG memory. This frees up the room for the next one to be loaded and also can be used to implement a caching algorithm at the same time. Even better it can occur in parallel while waiting for the results to come back from the SDRAM which then doesn't reduce performance. This allows a kind of LRU (least recently used) cache algorithm to be implemented. Well not strictly LRU but it is dropping off the cache row that has been around the longest of the 4 which is not necessarily the last one used, but is an approximation.
But I still need to somehow shave off one more cycle to be able to read from a different cached row in 4 hub windows with a single cache. Its driving me crazy trying to align hub window reads and maximize delayed jump cycles etc! I thought I had it all worked out in my head an hour ago but can't quite resurrect it on paper yet. It might have to be 5 hub windows. Arggh!
This is the expected performance for each address lookup condition based on the number of hub windows I counted in each scheme I coded. It assumes 160MHz P2 and Chip's current SDRAM driver. The PASM code is pretty tightly optimized and uses delayed jumps as much as possible. I feel it is unlikely to be able to go much faster in this implementation at least.
I have allowed for the VM code to access hub memory from address range 0x-0x7fffffff, and the SDRAM from 0x8000000 to 0xFFFFFFFF. Some region of memory would need to be reserved to indicate an empty row (eg. 0xFFFFFFC0-0xFFFFFFFF, or 0x00000000-0x0000003f assuming using 64 byte cache row sizes).
Scheme 1 has an L1 + L2 cache both sharing the stack/CLUT RAM (half each), the others use a single cache only with address tags again stored in the stack/CLUT RAM.
I still quite like the idea of scheme 1, but it can suffer slightly slower memory accesses when the L1 lookup fails compared to the other single level cache schemes.
Peak performance @160MHz Scheme L1 cache L2 cache (1) L1 hits Hub memory access L2 same row hits L2 other row hits L2 Misses (3) 1 64 entry direct 64 x 2 way set assoc 20 MIPs 20/6.67/4 MIPs (2) 6.67 MIPs 3.33 MIPs <= ~1MIPs 2 None 128 x 2 way set assoc N/A 10 MIPs 10 MIPs 5 MIPs <= ~1MIPs 3 None 64 x 4 way set assoc N/A 10 MIPs 10 MIPs 4 MIPs <= ~1MIPs 4 None 32 x 8 way set assoc N/A 10 MIPs 10 MIPs 3.3 MIPs <= ~1MIPs Notes 1 The actual L2 row size holding instructions is configurable depending on how much hub RAM you want to use. Hit timing is unaffected by the row size, but L2 miss timing is affected. 2 In this case any hub memory reads can also make use of the L1 cache and performance is higher if reading from the same “row” of hub RAM. ie. 20 MIPs if L1 is hit, 6.67 MIPs if in same row of hub RAM, and 4 MIPs if neither. 3 L2 misses will read in an entire row from the SDRAM. This example assumes 64 bytes/row and exclusive SDRAM access. Timing will be rather variable when other clients share the SDRAM. So this is a ballpark number in the best case for transferring 4 quads of instructions.
Thanks for all of your hard work on this!
David
I agree, for some future multi-COG SMP system keeping L2 tags in CLUT RAM would require each COG to have its own hub RAM which would eat up more hub memory. However one possibility that may help there is to use my earlier idea in post #5 of a separate small 64x2 way L1 in the CLUT RAM which is private for each COG and also share a common cache driver accessed by each COG that manages a single L2 cache, but I imagine that L2 cache driver could likely become a major bottleneck once lots of COGs start sharing it.
Doing multi-COG stuff like that in the future with caching may need its own new memory model - perhaps a new "MXMMC" model could be created for multi-COG XMM code for example.
No problem. I just hope it's useful and may speed things up a bit. On the P2 LMM is going to be the faster option over XMMC but once you are dealing with much larger applications like lots of OS or protocol code etc, then the large SDRAM and XMMC with good performance in the 5-20 MIP range will open up a whole new world of possibilities. I would really hope to see something decent for large applications running from SDRAM on P2 one day.
I'll try to tidy up my various sample code VM loops a little and post it here soon so you can take a look to see if any ideas from it make sense to perhaps incorporate into PropGCC for P2 at some point.
Roger.
I agree that trying to share even a second level cache is likely to result in very poor performance. In addition, one of the big claimed advantages of the Propeller is its deterministic execution. That won't be possible running cached code. I would think the best approach would be to have a single COG running a "main program" and then use the other COGs to run either LMM, COG C, or PASM drivers. After all, the Propeller is all about "soft peripherals" and those are likely to require deterministic execution.
Thanks for offering to clean up your sample code so we can try it in the P2 XMM kernel! Also, do you see any issues that would come up if we want to use at least the second level cache for data access as well as code access? PropGCC supports the xmm-single and xmm-split modes that put both code and data in external memory.
I am thinking for the full XMM model with both data and code stored in the same SDRAM memory space, it could make sense to create two separate caches if you can spare a bit more hub space. Eg. your could keep a nice 4kB L2 I-cache for instructions and have a separate general purpose D-cache of some appropriate size in hub memory. This then decouples execution performance when reading in lots of data from SDRAM as it wouldn't ever have to empty the I-cache to make room for data rows and simplifies a few things but it could require a few more memory range checks.
Just a thought.
Disclaimer: This code is completly untested, uncompiled etc as I don't have any P2 tools or DE0-NANO FGPA etc, so please take it as such and there's probably a reasonable chance of a bug in there somewhere.
I will try to post the other L2 only schemes when I get more time to clean that up further. Unfortunately I really have already spent more time on this than I should have done and neglected some other stuff I need to do so I am not sure when I will get back to it (hopefully soon). :frown:
initial_startup reps #1, #256 ' clear out caches in stack/CLUT RAM to prevent false matches setspa #0 ' start at beginning of CLUT pusha last ' fill cache with pattern of all 1's jmpd #restart ' jump into VM after 3 more cycles mov pc, start ' initial code entry point is stored here sub pc, #4 ' compensate for start position by rewinding 1 instruction shr zc, #1 wz, wc ' initialize flags l2_miss ' We have missed the L2 cache lookup so we need to read in from SDRAM. ' We need an algorithm to decide which of the two L2 row entries should be overwritten. ' To do thiw we can use upper half of the L2 cache stack RAM to hold the timestamps ' for each entry and then compare them to decide which of the pair is oldest. ' ' The 256 long stack RAM is arranged like this below with interleaving. ' Half the RAM is allocated to be a 64 entry direct mapped L1 cache. ' The other half is allocated a be a 32 x 2 way set associative cache ' with timestamps storing when the row was last loaded into hub RAM. ' Index Contents ' 0 L1 cached address ' 1 L1 associated data (opcode) from address in index 0 ' 2 L2 cached row 0 upper address bits ' 3 L2 cached row 1 upper address bits ' 4 L1 cached address ' 5 L1 associated data (opcode) from address in index 4 ' 6 L2 cached row 2 upper address bits ' 7 L2 cached row 3 upper address bits ' ... ' upper half now holds timestamps ' 128 L1 cached address ' 129 L1 associated data (opcode) from address in index 128 ' 130 L2 timestamp for tag entry at index 2 (row 0) ' 131 L2 timestamp for tag entry at index 3 (row 1) ' 132 L1 cached address ' 133 L1 associated data (opcode) from address in index 132 ' 134 L2 timestamp for tag entry at index 6 (row 2) ' 135 L2 timestamp for tag entry at index 7 (row 3) ' ... etc addspb #128 ' move up to upper half of stack RAM which holds timestamps popb ts2 ' pop the timestamp for second tag entry of pair popb ts1 ' pop the timestamp for first tag entry of pair subcnt ts1 ' get elapsed time since first tag entry was loaded subspb #126 ' rollback pointer and separate the two subcnts, to avoid getting CNTH subcnt ts2 ' get elapsed time since second tag entry was loaded cmp ts1, ts2 wc ' determine which entry is oldest and setup spb appropriately if_nc subsbp #1 ' if C clear, then cnt-ts1 >= cnt-ts2 which means we replace the first entry ' We prepare to call the SDRAM driver here and make a request to do a SDRAM burst lookup ' at SDRAM address available in "last" variable (PC&mask). wrlong last, sdram_addr ' SDRAM data is to be returned to hub memory cache row address at lastbase = l2base + (spb&1)<<shift2 getspb lastbase ' get the spb value andn lastbase, #1 ' clear lsb shl lastbase,shift2 ' shift up to form a row address add lastbase,l2base ' add to start of L2 cache area in hub memory mov ts1, lastbase ' form a hub address for SDRAM driver to read into shl ts1, #4 ' move up 4 bits to make room for burst transfer add ts1, #BURST ' eg. 64 byte transfer wrlong ts1, sdram_hub ' trigger SDRAM lookup pushbr last ' save newest tag to stack RAM from this read operation addspb #129 ' move back to timestamps area of stack RAM getcnt ts1 ' get tick counter pushbr ts1 ' update the cache timestamp for this entry ' We need to poll for SDRAM driver result here polldriver rdlong ts1,sdram_hub wz' poll for completion if_nz jmp #polldriver ' wait until done ' Eventually the L2 cache row data is updated and we can resume jmpd #resume ' prepare return add misses, #1 ' keep some stats for the number of L2 misses shr zc, #1 wz, wc ' restore flags nop hubaddr ' This code is entered when an instruction is read directly from the hub RAM range jmpd #return ' prepare to return back into VM mov lastbase, addr ' save last hub address shr zc, #1 wz, wc ' restore saved flags rdlong instr, pc ' read instuction at direct hub address l2_lookup muxc zc, #1 ' save VM's C flag before we lose it muxnz zc, #2 ' save VM's Z flag before we lose it mov addr, pc wc ' copy PC and get MSB into C flag and addr, mask ' retain upper bits mov last, addr ' save the last accessed address base if_nc jmp #hubaddr ' if PC top bit is clear lookup directly from hub RAM shr addr, shift1 ' shift down to index stack RAM range andn addr, #128 ' clear upper bit to only read from lower half of stack RAM range setspb addr ' assign stack pointer to SPB addspb #2 ' add interleaving offset to allow sharing with the L1 cache in stack RAM popbr addr ' get base address at this tag entry cmp addr, last wz ' compare with our PC top bits if_nz popbr addr ' if no match read next entry, if_nz cmp addr, last wz ' and compare with our PC top bits if_z jmp l2_miss ' if still no match we have an L2 miss and need an SDRAM lookup getspb lastbase ' get current stack pointer andn lastbase, #1 ' mask off bit 0 to form an index jmpd #resume ' prepare to resume shl lastbase,shift2 ' shift up to form hub cache row address offset add lastbase,l2base ' add base address of L2 cache in hub RAM shr zc, #1 wz, wc ' restore flags l1_miss mov addr, pc ' get current pc and addr, mask ' zero out the lower bits sub addr, last ' compare against last known L2 row address tjnz #l2_lookup,addr ' full L2 lookup is required if a different row is used resume mov addr, pc ' get current pc andn addr, mask ' zero out the upper bits to create offset add addr, lastbase ' add offset to the last known cache row base in hub RAM rdlong instr, addr ' read instruction to be executed at this address return pusha instr ' copy this instruction code back into L1 cache subspa #2 ' move back up to address position in the stack pushar pc ' write PC address into L1 cache and fall through to VM loop below restart reps #8, #0 ' repeat VM loop below lots of times add restarts, #1 ' keep some stats in otherwise empty slot here ' The main 8 instruction cycle VM loop begins here add pc, #4 ' increment PC instr nop ' execute instruction in this position setspa pc ' point spa to PC's LSBs popar addr ' pop an address from stack RAM L1 cache sub addr, pc ' subtract our PC from it tjnz #l1_miss, addr ' compare address match with PC and do L1 miss if no match popar instr ' pop associated data instruction from L1 cache in stack RAM add l1_hits, #1 ' track some stats in otherwise wasted slot here jmpd #restart ' if loop ever exits after 16384 iterations or and special kernel calls nop ' are made which call out the VM loop it will be restarted again here nop ' CALLD opcodes would be appropriate for these kernel calls. nop ' ' some VM related state start long START_ADDR ' entry point of VM application pc long 0 ' VM program counter zc long 1 ' saved flags / initial flags (NZ, NC) ; cache lookup related information mask long L2_MASK ' this address mask is controlled by cache row size shift1 long L2_SHIFT1 ' this bit shift is controlled by cache row size shift2 long L2_SHIFT2 ' this bit shift is controlled by cache row size l2base long L2_BASE_ADDR ' start address of L2 cache in hub memory addr long 0 ' temporary address variable last long $ffffffff ' choose something that won't match by mistake initially lastbase long $ffffffff ' holds hub RAM base address of last access ' timestamps for L2 cache row entries ts1 long 0 ts2 long 0 ' sdram driver stuff (This needs to be configured to point to hub ram in a suitable location) sdram_hub long SDRAM_HUB sdram_addr long SDRAM_ADDR ' optional extra debug counters for tracking hit/miss cycle stats etc in any "free" cycles l1_hits long 0 misses long 0 loops long 0 restarts long 0