Instruction caching ideas for P2 in XMMC mode with SDRAM
rogloh
Posts: 5,860
Following on from some discussions on this topic in another post http://forums.parallax.com/showthread.php/147786-Is-Spin2Cpp-version-of-quot-quot-(random)-slow I have started a new thread to cover it more as it probably belongs in its own place.
I had mentioned a crazy idea I had a couple of days ago to try to get a faster VM kernel performance by doing the cache handling of instructions in the kernel COG when running with SDRAM on the future P2 in XMMC mode with large applications. I know we are going to need very good instruction caching in this configuration as the latency is very significant for SDRAM, especially if there are video COGs sharing the SDRAM as well and doing large transfers. After playing around with my idea a bit more I may have found a way to get it to run every 2 hub windows (10MIPs max) and it may potentially be compatible with PropGCC....
I am basically trying to avoid the extra request and polling calls to the cache driver for each and every instruction which will burn more hub cycles. It's not fully complete but here is my idea below.
It uses QUADs of longs in hub memory to hold what I call "L1 cache" entries. This may be a bad name I know given the cache row is only one instruction long but I want to differentiate it from the regular cache entries in hub memory PropGCC already uses which I am thinking could still be considered as L2 cache entries - and could even coexist with my L1 cache. The L2 could hold larger burst transfers from the slow main memory and the L1 quads can then refill themselves from the larger L2 row as required during L1 misses, as would happen in a regular L1/L2 cache hierarchy.
There are other ideas floating around like using P2 stack/CLUT memory as well for the tags which could help even further. But that's separate to this.
Not sure if any of it will fly but might generate some useful discussion. The VM code loop is shown below and if I designed it correctly it seems to currently fit into a 16 instruction loop and so should be hitting hub windows each time for simple single clock PASM instructions. There are also some stats counters present that could also be used to track hit/miss performance and for experimenting with different cache sizes to see which works best. I think if there are spare places for such counters in the main loop this is a useful addition as well to any VM with caching, especially for enabling benchmarking.
Take a look and let me know if you like it or think it won't work or can improve it...
Roger.
I had mentioned a crazy idea I had a couple of days ago to try to get a faster VM kernel performance by doing the cache handling of instructions in the kernel COG when running with SDRAM on the future P2 in XMMC mode with large applications. I know we are going to need very good instruction caching in this configuration as the latency is very significant for SDRAM, especially if there are video COGs sharing the SDRAM as well and doing large transfers. After playing around with my idea a bit more I may have found a way to get it to run every 2 hub windows (10MIPs max) and it may potentially be compatible with PropGCC....
I am basically trying to avoid the extra request and polling calls to the cache driver for each and every instruction which will burn more hub cycles. It's not fully complete but here is my idea below.
It uses QUADs of longs in hub memory to hold what I call "L1 cache" entries. This may be a bad name I know given the cache row is only one instruction long but I want to differentiate it from the regular cache entries in hub memory PropGCC already uses which I am thinking could still be considered as L2 cache entries - and could even coexist with my L1 cache. The L2 could hold larger burst transfers from the slow main memory and the L1 quads can then refill themselves from the larger L2 row as required during L1 misses, as would happen in a regular L1/L2 cache hierarchy.
There are other ideas floating around like using P2 stack/CLUT memory as well for the tags which could help even further. But that's separate to this.
Not sure if any of it will fly but might generate some useful discussion. The VM code loop is shown below and if I designed it correctly it seems to currently fit into a 16 instruction loop and so should be hitting hub windows each time for simple single clock PASM instructions. There are also some stats counters present that could also be used to track hit/miss performance and for experimenting with different cache sizes to see which works best. I think if there are spare places for such counters in the main loop this is a useful addition as well to any VM with caching, especially for enabling benchmarking.
Take a look and let me know if you like it or think it won't work or can improve it...
Roger.
' Here is one possible idea for getting an XMMC type of VM loop to perform a bit faster on P2 ' An "L1 I-cache" is formed from an array of quads in hub memory (eg. 256 quads or 4kB) ' Code can then be read and then executed from these quads instead of always reading ' the main memory or polling a cache driver. ' Each quad contains contain 2 instructions in two of its longs and 2 addresses in the ' other 2 longs and this forms a 2 way set associative cache entry. The format is this: ' Long Data ' ------- ---- ' 0 instruction @address_x ' 1 address_x ' 2 instruction @address_y ' 3 address_y ' The quads are indexed by the VM shifting, masking and offsetting the PC to form an address. ' If required the VM also can read the hub memory addresses directly. ' For convenience and speed the address space is split into two halves with the MSB of the PC ' indicating which memory to read from. The order could be reversed if required. ' 0x00000000-0x7fffffff is hub memory (with foldover) ' 0x80000000-0xffffffff is external memory (such as SDRAM) ' One special address such as 0xffffffff, or 0x00000000 would need to be reserved to indicate an empty quad entry. ' Any cache misses on external memory lookups will need to go refill and reexecute the ' L1 entry from either a separate L2 cache row or SDRAM directly (still to be coded) ' The main advantage with this approach is that a tight loop (once loaded) can run ' up to 10MIPs and because it is also 2-way set associative the instructions can wrap ' on certain address boundaries and still avoid additional misses. You'd need three ' memory addresses to match the same quad index before you get a miss (one loaded). ' For tight loops this should not occur too often. ' Key disadvantage is that currently there is no prefetching with this L1 implementation, ' so each instruction needs to be loaded one at a time initially to get going. ' However this could perhaps be optimized during misses to load a few at a time but ' will then slow down miss cases a little. That may still be worth it. TBD. reentry repd #16 ' create a 16 instruction loop (10 MIPs max @160MHz) setquaz #quad0 ' point to quad address in COG memory and clear them all add starts, #1 ' spare spacer instructions can also track some stats nop ' v--- The 16+ cycle loop starts on the next instruction below mov addr, pc ' get current PC of VM muxc zc, #1 ' save VM's C flag before we lose it rol addr, #2 wc ' get MSB of PC into C and turn long address into a quad address and addr, mask ' mask off bits for depth of L1 cache lookup add addr, base ' add to base address of L1 cache to get our quads real address rdquad addr ' read the quad at this L1 cache address muxnz zc, #2 ' save VM's Z flag before we lose it if_nc jmp #hub_memory ' access hub memory directly somewhere outside this loop add cycles, #1 ' count instructions while we wait (or use for another purpose) cmp quad3, pc wz ' quad data is now ready so check for address2 match with our PC if_z mov quad0, quad2 ' if so move instruction2 up to executable position in quad0 if_nz cmp quad1, pc wz ' if the above match failed go check for address1 match with our PC if_nz jmp #cache_miss ' if no match in either address then we have a miss and we exit loop resume shr zc, #1 wz, wc ' restore both VM flags add pc, #4 ' increment VM PC quad0 nop ' instruction1 placeholder from L1 cache read is executed ' ^--- The loop returns to its beginning (or has already jumped out) quad1 nop ' address1 placeholder from L1 cache read quad2 nop ' instruction2 placeholder from L1 cache read quad3 nop ' address2 placeholder from L1 cache read hub_memory jmpd #resume ' prepare delayed jump to resume in 3 more instructions time mov quad1, restart ' ensure loop is restarted once our hub instruction executes rdlong quad0, pc ' PASM code to read instructions directly from hub address starts here nop ' do we need more nop spacers for jmpd (the above sequence may need reordering?) cache_miss ..todo... ' PASM code to refill L1 cache from SDRAM or L2 cache starts here jmpd #reentry add misses, #1 ' count misses (if we have time somewhere) shr zc, #1 wz, wc ' restore both VM flags nop ' spacer restart jmp #reentry ' go restart the VM loop again pc long 0 ' VM's program counter zc long 0 ' placeholder to save z,c flags addr long 0 ' temporary address variable base long CACHE_BASE ' base of L1 cache mask long CACHE_MASK ' mask of address bits used for L1 cache cycles long 0 ' counter for enabling cache hit stats starts long 0 ' counter for tracking VM loop starts for stats misses long 0 ' counter for tracking cache misses
Comments
1) Any L1 cache hits only take 2 hub windows (no prefetching), this is a max of 10MIPs when a nice loop is loaded and running. The L1 size can be made quite deep too so it will hold large chunks of code, the two way associativity will help a lot too in code that crosses specific boundaries that map to the same cache entry.
2) The L2 cache is only ever used if the L1 I-cache lookup fails, and a hit from this L2 cache and repopulate L1 takes maybe ~6 hub windows per executed PASM instruction (I am still looking at something for this part - plus the L2 cache tags could possibly be held in stack RAM to speed things up). The max rate is 3.3MIPs then at that hub window rate if all the L1 lookups fail, but L2 succeeds. This would typically happen on C code that only ever executes linearly once (eg. application bootup initialization).
3) The SDRAM is used only for the L2 cache misses and we could prefetch an entire L2 cache row of say 16-32 instructions at a time to help get things moving and ultimately help fill the L1 cache from subsequent L2 entry hits. This size would need to be tuned to take into account other video clients latency requirements on the SDRAM driver if they share it as well. I'd estimate it to be somewhere in the vicinitiy of about 20 hub windows for 16 instruction loads from SDRAM if we can get in right away. That is 1 MIP @160MHz. Large video transfers from SDRAM can slow this further however.
The L2 row size is the biggest tradeoff. You don't want to waste time prefetching lots of code that won't be executed, but at the same time if you make too small you will miss more often and have to hit the SDRAM more frequently which is slow.
So if you were to use these numbers just as an example and you get say 0.7 hit rate on L1 and 0.9 hit rate on L2 during looping (I may be being overly optimistic here) you might get some performance around this number:
160 MHz / ((0.7 * 2 + 0.3 * (0.9 * 6 + 0.1 * 20)) * 8 ) = ~ 5.5 MIPs.
In general I'd probably hope for a few MIPs or so with XMMC and SDRAM on P2 as a ballpark figure, but it all depends on the application code and what other burden the SDRAM has from its other clients, if any.
Roger.
The code for that approach is something like this:
I will need to take a good look at it. Got bogged down in my own ideas. When I initially browsed it it seemed to be a lot of instructions (seemingly more than 2 hub windows on P2) but I need to look at it again to doublecheck.
[EDIT] Just read your latest code again. On the P1 your cache hits cost it 22 PASM instructions plus a hub read. So that would be scaled up to the next multiple of 16 for the hub windows. 22*4 + 16 rounds up to 112 cycles or 7 hub windows. 80/112 = 0.71 MIPs best case on the P1 @ 80MHz.
If you ran that algorithm basically as is on a P2 you'd still get around 4 hub windows per instruction in the absolute best case if you could have delayed jumps everywhere (my L1 example above should be 2x faster than that). If you start including any jump penalties might bring it closer to 5 or 6 hub windows, but I'd assume it would be recoded suitably to try to avoid that.
Now I really need to find a way to get the L2 hits to run faster in my variant using last address caching somehow. If I can get 4-5 windows for that and update the L1 at the same time I will be happy. There should be some way to do this...
That being said, a four way L2 cache might possibly be nice. If the L2 tags are loaded into the hub RAM you could potentially get 64 x 4 way set associativity in there perhaps. Also one of the things I found is that computing the addresses can take a while with lots of masking/shifting etc. If the cache row start address in hub memory is somehow mapped from the PC address dynamically rather than statically, and is returned directly from a tag entry data structure that matches the masked PC address it could potentially be much faster to do the lookup. It might also lend itself to a circular queue of L2 row buffers and/or LRU/LFU type algorithms when flushing out rows that clash with ones being read in. Maybe the stack can be used to hold a free list of L2 cache row pointers. Just thinking...
So perhaps this could be another use of the P2 stack RAM. It only uses half the stack RAM but is interleaved as 2 longs (addr + opcode), then 2 empty longs etc. A two way L1 cache version would be possible but obviously a bit slower and needs more management to refill it. This way may work better and maybe the other 1/2 of the stack RAM can then still be used for L2 cache tags...?
Roger.
With my L1 scheme in place:
20MIPs (8 cycles) for all L1 hits,
or 6.6MIPs (24 cycles) for L1 misses, but hits in the same L2 row as last time,
or X hub window cycles for other L2 hits,
or Y hub window cycles for other L2 misses
Without my L1 scheme:
10MIPs (16 cycles) for hits in same hub cache row as the last instruction used,
or X-1 hub window cycles for other cache hits, (I have some sample code I am working on that should get this down to 4 hub cycles (5 MIPs) for 64x4 way set associative cache, tags in stack RAM, so X is probably 5 above and 4MIPs with L1)
or Y-1 hub window cycles for other misses
It's always a tradeoff. Small loops that run often can probably get quite a lot of benefit from the extra L1 and can peak at 20MIPs which is great, but other application code that doesn't loop so much may prefer no L1 to always get the 10MIP rate when hitting the L2 cached row instead of 6.6MIPs and one hub window less for other miss cases. It's pretty much impossible to pick something that suits all needs.
What is the consensus now? A single level cache? What cache algorithm?
Alternatively one could try out a combined 64 entry L1 cache and 64 x 2 way L2 tags sharing the stack RAM which could also potentially be useful. But this whole FCACHE thing may make the L1 less useful... will need to think about that.
I discovered another benefit of storing tags in the stack is if you lookup a 4 way set-associative cache which already has all its 4 tag entries consumed with non matching addresses and then need to reload and drop one tag entry to be able to hold the new row being loaded it is very easy to push back the last 3 oldest tag entries in the group of 4, after all you have already popped and address matched them so have them in COG memory. This frees up the room for the next one to be loaded and also can be used to implement a caching algorithm at the same time. Even better it can occur in parallel while waiting for the results to come back from the SDRAM which then doesn't reduce performance. This allows a kind of LRU (least recently used) cache algorithm to be implemented. Well not strictly LRU but it is dropping off the cache row that has been around the longest of the 4 which is not necessarily the last one used, but is an approximation.
But I still need to somehow shave off one more cycle to be able to read from a different cached row in 4 hub windows with a single cache. Its driving me crazy trying to align hub window reads and maximize delayed jump cycles etc! I thought I had it all worked out in my head an hour ago but can't quite resurrect it on paper yet. It might have to be 5 hub windows. Arggh!
This is the expected performance for each address lookup condition based on the number of hub windows I counted in each scheme I coded. It assumes 160MHz P2 and Chip's current SDRAM driver. The PASM code is pretty tightly optimized and uses delayed jumps as much as possible. I feel it is unlikely to be able to go much faster in this implementation at least.
I have allowed for the VM code to access hub memory from address range 0x-0x7fffffff, and the SDRAM from 0x8000000 to 0xFFFFFFFF. Some region of memory would need to be reserved to indicate an empty row (eg. 0xFFFFFFC0-0xFFFFFFFF, or 0x00000000-0x0000003f assuming using 64 byte cache row sizes).
Scheme 1 has an L1 + L2 cache both sharing the stack/CLUT RAM (half each), the others use a single cache only with address tags again stored in the stack/CLUT RAM.
I still quite like the idea of scheme 1, but it can suffer slightly slower memory accesses when the L1 lookup fails compared to the other single level cache schemes.
Thanks for all of your hard work on this!
David
I agree, for some future multi-COG SMP system keeping L2 tags in CLUT RAM would require each COG to have its own hub RAM which would eat up more hub memory. However one possibility that may help there is to use my earlier idea in post #5 of a separate small 64x2 way L1 in the CLUT RAM which is private for each COG and also share a common cache driver accessed by each COG that manages a single L2 cache, but I imagine that L2 cache driver could likely become a major bottleneck once lots of COGs start sharing it.
Doing multi-COG stuff like that in the future with caching may need its own new memory model - perhaps a new "MXMMC" model could be created for multi-COG XMM code for example.
No problem. I just hope it's useful and may speed things up a bit. On the P2 LMM is going to be the faster option over XMMC but once you are dealing with much larger applications like lots of OS or protocol code etc, then the large SDRAM and XMMC with good performance in the 5-20 MIP range will open up a whole new world of possibilities. I would really hope to see something decent for large applications running from SDRAM on P2 one day.
I'll try to tidy up my various sample code VM loops a little and post it here soon so you can take a look to see if any ideas from it make sense to perhaps incorporate into PropGCC for P2 at some point.
Roger.
I agree that trying to share even a second level cache is likely to result in very poor performance. In addition, one of the big claimed advantages of the Propeller is its deterministic execution. That won't be possible running cached code. I would think the best approach would be to have a single COG running a "main program" and then use the other COGs to run either LMM, COG C, or PASM drivers. After all, the Propeller is all about "soft peripherals" and those are likely to require deterministic execution.
Thanks for offering to clean up your sample code so we can try it in the P2 XMM kernel! Also, do you see any issues that would come up if we want to use at least the second level cache for data access as well as code access? PropGCC supports the xmm-single and xmm-split modes that put both code and data in external memory.
I am thinking for the full XMM model with both data and code stored in the same SDRAM memory space, it could make sense to create two separate caches if you can spare a bit more hub space. Eg. your could keep a nice 4kB L2 I-cache for instructions and have a separate general purpose D-cache of some appropriate size in hub memory. This then decouples execution performance when reading in lots of data from SDRAM as it wouldn't ever have to empty the I-cache to make room for data rows and simplifies a few things but it could require a few more memory range checks.
Just a thought.
Disclaimer: This code is completly untested, uncompiled etc as I don't have any P2 tools or DE0-NANO FGPA etc, so please take it as such and there's probably a reasonable chance of a bug in there somewhere.
I will try to post the other L2 only schemes when I get more time to clean that up further. Unfortunately I really have already spent more time on this than I should have done and neglected some other stuff I need to do so I am not sure when I will get back to it (hopefully soon). :frown: