Instruction caching ideas for P2 in XMMC mode with SDRAM

rogloh · 2013-05-08 23:20

Following on from some discussions on this topic in another post http://forums.parallax.com/showthread.php/147786-Is-Spin2Cpp-version-of-quot-quot-(random)-slow I have started a new thread to cover it more as it probably belongs in its own place.

I had mentioned a crazy idea I had a couple of days ago to try to get a faster VM kernel performance by doing the cache handling of instructions in the kernel COG when running with SDRAM on the future P2 in XMMC mode with large applications. I know we are going to need very good instruction caching in this configuration as the latency is very significant for SDRAM, especially if there are video COGs sharing the SDRAM as well and doing large transfers. After playing around with my idea a bit more I may have found a way to get it to run every 2 hub windows (10MIPs max) and it may potentially be compatible with PropGCC....

I am basically trying to avoid the extra request and polling calls to the cache driver for each and every instruction which will burn more hub cycles. It's not fully complete but here is my idea below.

It uses QUADs of longs in hub memory to hold what I call "L1 cache" entries. This may be a bad name I know given the cache row is only one instruction long but I want to differentiate it from the regular cache entries in hub memory PropGCC already uses which I am thinking could still be considered as L2 cache entries - and could even coexist with my L1 cache. The L2 could hold larger burst transfers from the slow main memory and the L1 quads can then refill themselves from the larger L2 row as required during L1 misses, as would happen in a regular L1/L2 cache hierarchy.

There are other ideas floating around like using P2 stack/CLUT memory as well for the tags which could help even further. But that's separate to this.

Not sure if any of it will fly but might generate some useful discussion. The VM code loop is shown below and if I designed it correctly it seems to currently fit into a 16 instruction loop and so should be hitting hub windows each time for simple single clock PASM instructions. There are also some stats counters present that could also be used to track hit/miss performance and for experimenting with different cache sizes to see which works best. I think if there are spare places for such counters in the main loop this is a useful addition as well to any VM with caching, especially for enabling benchmarking.

Take a look and let me know if you like it or think it won't work or can improve it...

Roger.

' Here is one possible idea for getting an XMMC type of VM loop to perform a bit faster on P2

        ' An "L1 I-cache" is formed from an array of quads in hub memory (eg. 256 quads or 4kB)
        ' Code can then be read and then executed from these quads instead of always reading
        ' the main memory or polling a cache driver.

        ' Each quad contains contain 2 instructions in two of its longs and 2 addresses in the
        ' other 2 longs and this forms a 2 way set associative cache entry. The format is this:

        '      Long    Data
        '      ------- ----
        '        0     instruction @address_x
        '        1     address_x
        '        2     instruction @address_y
        '        3     address_y

        ' The quads are indexed by the VM shifting, masking and offsetting the PC to form an address.

        ' If required the VM also can read the hub memory addresses directly.
        '   For convenience and speed the address space is split into two halves with the MSB of the PC
        '   indicating which memory to read from.  The order could be reversed if required.
        '      0x00000000-0x7fffffff is hub memory (with foldover)
        '      0x80000000-0xffffffff is external memory (such as SDRAM)
        '      One special address such as 0xffffffff, or 0x00000000 would need to be reserved to indicate an empty quad entry.

        ' Any cache misses on external memory lookups will need to go refill and reexecute the
        ' L1 entry from either a separate L2 cache row or SDRAM directly (still to be coded)

        ' The main advantage with this approach is that a tight loop (once loaded) can run
        ' up to 10MIPs and because it is also 2-way set associative the instructions can wrap
        ' on certain address boundaries and still avoid additional misses.  You'd need three
        ' memory addresses to match the same quad index before you get a miss (one loaded).
        ' For tight loops this should not occur too often.

        ' Key disadvantage is that currently there is no prefetching with this L1 implementation,
        ' so each instruction needs to be loaded one at a time initially to get going.
        ' However this could perhaps be optimized during misses to load a few at a time but
        ' will then slow down miss cases a little.  That may still be worth it. TBD.


reentry
        repd    #16             ' create a 16 instruction loop (10 MIPs max @160MHz)
        setquaz #quad0          ' point to quad address in COG memory and clear them all
        add     starts, #1      ' spare spacer instructions can also track some stats
        nop                     ' v--- The 16+ cycle loop starts on the next instruction below

        mov     addr, pc        ' get current PC of VM
        muxc    zc, #1          ' save VM's C flag before we lose it
        rol     addr, #2     wc ' get MSB of PC into C and turn long address into a quad address
        and     addr, mask      ' mask off bits for depth of L1 cache lookup
        add     addr, base      ' add to base address of L1 cache to get our quads real address
        rdquad  addr            ' read the quad at this L1 cache address
        muxnz   zc, #2          ' save VM's Z flag before we lose it
  if_nc jmp     #hub_memory     ' access hub memory directly somewhere outside this loop
        add     cycles, #1      ' count instructions while we wait (or use for another purpose)
        cmp     quad3, pc    wz ' quad data is now ready so check for address2 match with our PC
  if_z  mov     quad0, quad2    ' if so move instruction2 up to executable position in quad0
  if_nz cmp     quad1, pc    wz ' if the above match failed go check for address1 match with our PC
  if_nz jmp     #cache_miss     ' if no match in either address then we have a miss and we exit loop
resume  shr     zc, #1   wz, wc ' restore both VM flags
        add     pc, #4          ' increment VM PC
quad0   nop                     ' instruction1 placeholder from L1 cache read is executed
                                ' ^--- The loop returns to its beginning (or has already jumped out)

quad1   nop                     ' address1 placeholder from L1 cache read
quad2   nop                     ' instruction2 placeholder from L1 cache read
quad3   nop                     ' address2 placeholder from L1 cache read

hub_memory
        jmpd    #resume         ' prepare delayed jump to resume in 3 more instructions time
        mov     quad1, restart  ' ensure loop is restarted once our hub instruction executes
        rdlong  quad0, pc       ' PASM code to read instructions directly from hub address starts here
        nop                     ' do we need more nop spacers for jmpd (the above sequence may need reordering?)

cache_miss
        ..todo...               ' PASM code to refill L1 cache from SDRAM or L2 cache starts here
        jmpd    #reentry
        add     misses, #1      ' count misses (if we have time somewhere)
        shr     zc, #1   wz, wc ' restore both VM flags
        nop                     ' spacer

restart jmp     #reentry        ' go restart the VM loop again

pc      long    0               ' VM's program counter
zc      long    0               ' placeholder to save z,c flags
addr    long    0               ' temporary address variable
base    long    CACHE_BASE      ' base of L1 cache
mask    long    CACHE_MASK      ' mask of address bits used for L1 cache

cycles  long    0               ' counter for enabling cache hit stats
starts  long    0               ' counter for tracking VM loop starts for stats
misses  long    0               ' counter for tracking cache misses

David Betz · 2013-05-09 05:26

This is a clever idea! However, won't you almost be force to also have an L2 cache as well with this approach? Otherwise, whenever you get a cache miss in your L1 cache, you'll have to do an SDRAM read of only a single instruction. Or am I missing something?

rogloh · 2013-05-09 06:17

Exactly. This approach really lends itself to a hierarchy. My thoughts there are :

1) Any L1 cache hits only take 2 hub windows (no prefetching), this is a max of 10MIPs when a nice loop is loaded and running. The L1 size can be made quite deep too so it will hold large chunks of code, the two way associativity will help a lot too in code that crosses specific boundaries that map to the same cache entry.

2) The L2 cache is only ever used if the L1 I-cache lookup fails, and a hit from this L2 cache and repopulate L1 takes maybe ~6 hub windows per executed PASM instruction (I am still looking at something for this part - plus the L2 cache tags could possibly be held in stack RAM to speed things up). The max rate is 3.3MIPs then at that hub window rate if all the L1 lookups fail, but L2 succeeds. This would typically happen on C code that only ever executes linearly once (eg. application bootup initialization).

3) The SDRAM is used only for the L2 cache misses and we could prefetch an entire L2 cache row of say 16-32 instructions at a time to help get things moving and ultimately help fill the L1 cache from subsequent L2 entry hits. This size would need to be tuned to take into account other video clients latency requirements on the SDRAM driver if they share it as well. I'd estimate it to be somewhere in the vicinitiy of about 20 hub windows for 16 instruction loads from SDRAM if we can get in right away. That is 1 MIP @160MHz. Large video transfers from SDRAM can slow this further however.

The L2 row size is the biggest tradeoff. You don't want to waste time prefetching lots of code that won't be executed, but at the same time if you make too small you will miss more often and have to hit the SDRAM more frequently which is slow.

So if you were to use these numbers just as an example and you get say 0.7 hit rate on L1 and 0.9 hit rate on L2 during looping (I may be being overly optimistic here) you might get some performance around this number:

160 MHz / ((0.7 * 2 + 0.3 * (0.9 * 6 + 0.1 * 20)) * 8 ) = ~ 5.5 MIPs.

In general I'd probably hope for a few MIPs or so with XMMC and SDRAM on P2 as a ballpark figure, but it all depends on the application code and what other burden the SDRAM has from its other clients, if any.

Roger.

David Betz · 2013-05-09 06:28

rogloh wrote: »

1) Any L1 cache hits only take 2 hub windows (no prefetching), this is a max of 10MIPs when a nice loop is loaded and running. The L1 size can be made quite deep too so it will hold large chunks of code, the two way associativity will help a lot too in code that crosses specific boundaries that map to the same cache entry.

Doesn't the code I posted the other day have the same effect for just a single level cache? One hub access is required to fetch the appropriate tag. Then, if you get a cache hit, a second hub access is required to fetch the actual data. Since the last cache hit is remembered, you only need a single hub access for subsequent accesses on the same cache line.

rogloh · 2013-05-09 06:47

By the way, I do have another version that uses the stack RAM to hold the L1 cache. It works in a very similar way and should also run in 16 cycles, but the downside is that it limits the L1 size to be only 128 instructions and no more. It also prevents this stack RAM from being used for other purposes such as for L2 cache tags which I think may suit it better.

The code for that approach is something like this:

restart
        repd    #16
        nop
        nop
        nop
        muxc    zc, #1          ' beginning of 16 instruction loop
        muxnz   zc, #2
        mov     pc, pc wc       ' extract msb
        add     cycles, #1
 if_nc  jmp     #hub_mem
        setspa  pc
        popar   addr1
        popar   instr
        popar   addr2
        cmp     pc, addr1 wz
 if_nz  popar   instr
 if_nz  cmp     pc, addr2 wz
 if_nz  jmp     #cache_miss
resume  add     pc, #4
        shr     zc, #1 wz, wc
instr   nop                     ' end of normal loop
        jmp     #restart

addr1   long    0
addr2   long    0
pc      long    0
zc      long    0
cycles  long    0
hub     long    0

hub_mem jmpd    #resume
        add     hub, #1
        rdlong  instr, pc
        nop


cache_miss ' ... code to do cache reload would go here
        jmp     #resume

rogloh · 2013-05-09 06:48

David Betz wrote: »

Doesn't the code I posted the other day have the same effect for just a single level cache? One hub access is required to fetch the appropriate tag. Then, if you get a cache hit, a second hub access is required to fetch the actual data. Since the last cache hit is remembered, you only need a single hub access for subsequent accesses on the same cache line.

I will need to take a good look at it. Got bogged down in my own ideas. When I initially browsed it it seemed to be a lot of instructions (seemingly more than 2 hub windows on P2) but I need to look at it again to doublecheck.

[EDIT] Just read your latest code again. On the P1 your cache hits cost it 22 PASM instructions plus a hub read. So that would be scaled up to the next multiple of 16 for the hub windows. 22*4 + 16 rounds up to 112 cycles or 7 hub windows. 80/112 = 0.71 MIPs best case on the P1 @ 80MHz.

If you ran that algorithm basically as is on a P2 you'd still get around 4 hub windows per instruction in the absolute best case if you could have delayed jumps everywhere (my L1 example above should be 2x faster than that). If you start including any jump penalties might bring it closer to 5 or 6 hub windows, but I'd assume it would be recoded suitably to try to avoid that.

Now I really need to find a way to get the L2 hits to run faster in my variant using last address caching somehow. If I can get 4-5 windows for that and update the L1 at the same time I will be happy. There should be some way to do this...

David Betz · 2013-05-09 06:49

What if you use rdquad to read four tags at a time to implement a 4-way set associative single level cache?

jazzed · 2013-05-09 06:54

This is a very healthy discussion. Thanks. Wish I had more time to contribute.

rogloh · 2013-05-09 07:26

Well the particular L1 approach I use needs to store the instruction data in the quad along with the matching address so only two would fit per quad. A four way variant would take 2 quad reads which adds another hub cycle. Also the bang per buck of going to 4 way associativity is probably quite a bit less than going from direct mapped to 2 way set associative. Just like going from a single core CPU to 2 cores is a lot more noticable/useful to the user than moving from 2 - 4 cores.

That being said, a four way L2 cache might possibly be nice. If the L2 tags are loaded into the hub RAM you could potentially get 64 x 4 way set associativity in there perhaps. Also one of the things I found is that computing the addresses can take a while with lots of masking/shifting etc. If the cache row start address in hub memory is somehow mapped from the PC address dynamically rather than statically, and is returned directly from a tag entry data structure that matches the masked PC address it could potentially be much faster to do the lookup. It might also lend itself to a circular queue of L2 row buffers and/or LRU/LFU type algorithms when flushing out rows that clash with ones being read in. Maybe the stack can be used to hold a free list of L2 cache row pointers. Just thinking...

David Betz · 2013-05-09 08:01

I guess I was suggesting having only one level if access to that level was either 1 or 2 hub windows. You said earlier that your L1 scheme required two hub accesses per fetch. Why is that? Won't a single access do for a cache hit? Your rdquad gets 2 addresses and 2 instructions. If the address match the one being fetched you can immediately use the instruction without a second hub access. Where does the second hub access come in for a L1 cache hit?

rogloh · 2013-05-09 08:09

My L1 scheme consumes 2 hub window cycles (16 clocks total) for everything, cache lookup, fetch and execute (assuming a single clock PASM instruction being executed by the VM). But only one hub read operation is actually made in the 2 hub cycles to read the quad. There is no second hub read. That is how it gets up to 10MIPs (160/16).

David Betz · 2013-05-09 08:17

rogloh wrote: »

My L1 scheme consumes 2 hub window cycles (16 clocks total) for everything, cache lookup, fetch and execute (assuming a single clock PASM instruction being executed by the VM). But only one hub read operation is actually made in the 2 hub cycles to read the quad. There is no second hub read. That is how it gets up to 10MIPs (160/16).

Okay, thanks for the explanation. I thought you meant that you actually did two hub reads.

rogloh · 2013-05-10 23:02

I've had another potential idea which runs even faster. If we use some of the stack RAM as the L1 cache (single way holding up to 64 instructions), we could get the VM running at 20MIPs when there are L1 hits as would happen in tight loops that don't exceed 64 sequential instructions. Here's the code snippet showing the main VM loop which runs in 8 cycles (or without stats 7 cycles and 22.9 MIPs). Also it is not shown here but I have figured out that if the L1 lookup misses and needs to go to L2 but is being read from the same L2 cache row as the last L1 failure it will (just) be able to complete and refill the L1 cache and execute in 24 cycles which is still 6.66 MIPs and I like that too. The rest of L2 cache handling can sort out the external/internal memory addresses which keeps the L1 cache able to hold both hub and external RAM instructions and run from both at the peak rate.

So perhaps this could be another use of the P2 stack RAM. It only uses half the stack RAM but is interleaved as 2 longs (addr + opcode), then 2 empty longs etc. A two way L1 cache version would be possible but obviously a bit slower and needs more management to refill it. This way may work better and maybe the other 1/2 of the stack RAM can then still be used for L2 cache tags...?

Roger.

initial_entry
        jmpd    #restart
        mov     pc, start       ' initial code entry point
        sub     pc, #4          ' compensate for initial start position
        shr     zc, #1   wz, wc ' setup initial flags

l1_miss 

        '... normal L2 cache lookup code goes here,
        ' the instruction ultimately gets read and copied into instr during this part

        pushar  instr           ' copy code into L1 cache
        subspa  #2              ' move back to address position
        pushar  pc              ' copy address into L1 cache and fall through to loop below

restart
        reps    #8, #0          ' repeat lots of times
        add     restarts, #1    ' stats

        ' The main 8 cycle VM loop follows
        add     pc, #4          ' increment PC
instr   nop                     ' execute instruction
        setspa  pc              ' point spa to PC's LSBs
        popar   addr            ' pop an address from stack
        sub     addr, pc        ' subtract our PC from it
        tjnz    #l1_miss, addr  ' compare address match with PC and miss if not
        popar   instr           ' pop associated data instruction
        add     cycles, #1      ' end of loop 

        jmpd    #restart        ' when code exits after 16384 iterations, start it again
        add     loops, #1
        nop
        nop

pc      long    0               ' VM program counter
addr    long    0               ' temp address variable
start   long    START_ADDR      ' entry point of VM application
wc      long    1               ' saved flags / initial flags
cycles  long    0               ' (optional) hit/miss stats
misses  long    0               ' (optional) hit/miss stats
loops   long    0
restarts long   0

rogloh · 2013-05-11 01:09

One more thing that needs to be kept in mind is that a VM loop without any L1 caching that only ever reads instructions from a cache in hub RAM can in fact be constructed to run in two hub windows when reading from the same cache row as was read from last time (so the code executes within 16 P2 clock cycles). This then yields 10MIPs peak performance on a 160MHz P2. So to compare:

With my L1 scheme in place:
20MIPs (8 cycles) for all L1 hits,
or 6.6MIPs (24 cycles) for L1 misses, but hits in the same L2 row as last time,
or X hub window cycles for other L2 hits,
or Y hub window cycles for other L2 misses

Without my L1 scheme:
10MIPs (16 cycles) for hits in same hub cache row as the last instruction used,
or X-1 hub window cycles for other cache hits, (I have some sample code I am working on that should get this down to 4 hub cycles (5 MIPs) for 64x4 way set associative cache, tags in stack RAM, so X is probably 5 above and 4MIPs with L1)
or Y-1 hub window cycles for other misses

It's always a tradeoff. Small loops that run often can probably get quite a lot of benefit from the extra L1 and can peak at 20MIPs which is great, but other application code that doesn't loop so much may prefer no L1 to always get the 10MIP rate when hitting the L2 cached row instead of 6.6MIPs and one hub window less for other miss cases. It's pretty much impossible to pick something that suits all needs.

ersmith · 2013-05-11 04:20

I wouldn't worry too much about optimizing tight loops, since that's already handled by the FCACHE mechanism in the compiler.

ersmith · 2013-05-11 04:20

I wouldn't worry too much about optimizing tight loops, since that's already handled by the FCACHE mechanism in the compiler.

rogloh · 2013-05-11 04:43

Ok, I wasn't aware the compiler can make use of that mechanism automatically. If so, that is pretty cool. Once loaded up in COG memory it could probably then work at up to 160MIPs I guess leaving a single level cache a good way to go. But it would still need to be loaded up to get running (hopefully optimized using some RDQUADs to load 4 instructions at a time).

David Betz · 2013-05-11 06:16

I guess I missed a lot while I was sleeping! :-)

What is the consensus now? A single level cache? What cache algorithm?

rogloh · 2013-05-11 07:07

Well no final consensus yet in my mind but my gut feeling right now is a single level cache having up to 64 x 4 way tags in stack RAM could be a pretty good option. The cache row size would be configurable but I'd expect something in the vicinity of 4 quads per cache line might be a reasonable ballpark number to use which would then allow up to 16kB I-cache sizes in hub RAM (holding 4k instructions) and still limit the latency added to the SDRAM driver if a video driver is using this memory as well. In the worst case I'd estimate you'd then get about 1MIP when running from SDRAM exclusively and missing every single address (which really should never ever happen, to make it happen your code would be seriously jumping all over the place

). If you don't want to share the SDRAM with video the row size could easily be increased beyond 64 bytes and we could also reduce the number of tag entries in stack RAM from 256 down to smaller sizes in powers of two to control the amount of hub memory used by the cache, but if you increase it too much the SDRAM transfer time obviously then becomes significant on any cache misses.

Alternatively one could try out a combined 64 entry L1 cache and 64 x 2 way L2 tags sharing the stack RAM which could also potentially be useful. But this whole FCACHE thing may make the L1 less useful... will need to think about that.

I discovered another benefit of storing tags in the stack is if you lookup a 4 way set-associative cache which already has all its 4 tag entries consumed with non matching addresses and then need to reload and drop one tag entry to be able to hold the new row being loaded it is very easy to push back the last 3 oldest tag entries in the group of 4, after all you have already popped and address matched them so have them in COG memory. This frees up the room for the next one to be loaded and also can be used to implement a caching algorithm at the same time. Even better it can occur in parallel while waiting for the results to come back from the SDRAM which then doesn't reduce performance. This allows a kind of LRU (least recently used) cache algorithm to be implemented. Well not strictly LRU but it is dropping off the cache row that has been around the longest of the 4 which is not necessarily the last one used, but is an approximation.

But I still need to somehow shave off one more cycle to be able to read from a different cached row in 4 hub windows with a single cache. Its driving me crazy trying to align hub window reads and maximize delayed jump cycles etc! I thought I had it all worked out in my head an hour ago but can't quite resurrect it on paper yet. It might have to be 5 hub windows. Arggh!

rogloh · 2013-05-11 07:15

I can save a clock cycle (in fact two) if I go with 3-way set associativity LOL.

David Betz · 2013-05-11 07:41

rogloh wrote: »

Well no final consensus yet in my mind but my gut feeling right now is a single level cache having up to 64 x 4 way tags in stack RAM could be a pretty good option. The cache row size would be configurable but I'd expect something in the vicinity of 4 quads per cache line might be a reasonable ballpark number to use which would then allow up to 16kB I-cache sizes in hub RAM (holding 4k instructions) and still limit the latency added to the SDRAM driver if a video driver is using this memory as well. In the worst case I'd estimate you'd then get about 1MIP when running from SDRAM exclusively and missing every single address (which really should never ever happen, to make it happen your code would be seriously jumping all over the place ). If you don't want to share the SDRAM with video the row size could easily be increased beyond 64 bytes and we could also reduce the number of tag entries in stack RAM from 256 down to smaller sizes in powers of two to control the amount of hub memory used by the cache, but if you increase it too much the SDRAM transfer time obviously then becomes significant on any cache misses.

Alternatively one could try out a combined 64 entry L1 cache and 64 x 2 way L2 tags sharing the stack RAM which could also potentially be useful. But this whole FCACHE thing may make the L1 less useful... will need to think about that.

I discovered another benefit of storing tags in the stack is if you lookup a 4 way set-associative cache which already has all its 4 tag entries consumed with non matching addresses and then need to reload and drop one tag entry to be able to hold the new row being loaded it is very easy to push back the last 3 oldest tag entries in the group of 4, after all you have already popped and address matched them so have them in COG memory. This frees up the room for the next one to be loaded and also can be used to implement a caching algorithm at the same time. Even better it can occur in parallel while waiting for the results to come back from the SDRAM which then doesn't reduce performance. This allows a kind of LRU (least recently used) cache algorithm to be implemented. Well not strictly LRU but it is dropping off the cache row that has been around the longest of the 4 which is not necessarily the last one used, but is an approximation.

But I still need to somehow shave off one more cycle to be able to read from a different cached row in 4 hub windows with a single cache. Its driving me crazy trying to align hub window reads and maximize delayed jump cycles etc! I thought I had it all worked out in my head an hour ago but can't quite resurrect it on paper yet. It might have to be 5 hub windows. Arggh!

Using at least some of the CLUT memory was what I expected to do as well. Sounds like a good idea.

rogloh · 2013-05-12 04:40

Here is the latest summary of the information I have found when trying to code up my various VM loops with the different cache schemes shown below.

This is the expected performance for each address lookup condition based on the number of hub windows I counted in each scheme I coded. It assumes 160MHz P2 and Chip's current SDRAM driver. The PASM code is pretty tightly optimized and uses delayed jumps as much as possible. I feel it is unlikely to be able to go much faster in this implementation at least.

I have allowed for the VM code to access hub memory from address range 0x-0x7fffffff, and the SDRAM from 0x8000000 to 0xFFFFFFFF. Some region of memory would need to be reserved to indicate an empty row (eg. 0xFFFFFFC0-0xFFFFFFFF, or 0x00000000-0x0000003f assuming using 64 byte cache row sizes).

Scheme 1 has an L1 + L2 cache both sharing the stack/CLUT RAM (half each), the others use a single cache only with address tags again stored in the stack/CLUT RAM.

I still quite like the idea of scheme 1, but it can suffer slightly slower memory accesses when the L1 lookup fails compared to the other single level cache schemes.

Peak performance @160MHz				
Scheme		L1 cache	              L2 cache (1)	L1 hits		Hub memory access    L2 same row hits	L2 other row hits	L2 Misses (3)
1		64 entry direct       	64 x 2 way set assoc	20 MIPs		20/6.67/4 MIPs (2)	6.67 MIPs	3.33 MIPs		<= ~1MIPs
2		None			128 x 2 way set assoc	N/A		10 MIPs			10 MIPs		5 MIPs			<= ~1MIPs
3		None			64 x 4 way set assoc	N/A		10 MIPs			10 MIPs		4 MIPs			<= ~1MIPs
4		None			32 x 8 way set assoc	N/A		10 MIPs			10 MIPs		3.3 MIPs		<= ~1MIPs
								
Notes								
1	The actual L2 row size holding instructions is configurable depending on how much hub RAM you want to use.  
	Hit timing is unaffected by the row size, but L2 miss timing is affected.
2	In this case any hub memory reads can also make use of the L1 cache and performance is higher if reading from the same &#8220;row&#8221; of hub RAM. 
	ie. 20 MIPs if L1 is hit, 6.67 MIPs if in same row of hub RAM, and 4 MIPs if neither.
3	L2 misses will read in an entire row from the SDRAM.    This example assumes 64 bytes/row and exclusive SDRAM access.
	Timing will be rather variable when other clients share the SDRAM.  
	So this is a ballpark number in the best case for transferring 4 quads of instructions.

David Betz · 2013-05-12 05:32

rogloh wrote: »
Here is the latest summary of the information I have found when trying to code up my various VM loops with the different cache schemes shown below.

This is the expected performance for each address lookup condition based on the number of hub windows I counted in each scheme I coded. It assumes 160MHz P2 and Chip's current SDRAM driver. The PASM code is pretty tightly optimized and uses delayed jumps as much as possible. I feel it is unlikely to be able to go much faster in this implementation at least.

I have allowed for the VM code to access hub memory from address range 0x-0x7fffffff, and the SDRAM from 0x8000000 to 0xFFFFFFFF. Some region of memory would need to be reserved to indicate an empty row (eg. 0xFFFFFFC0-0xFFFFFFFF, or 0x00000000-0x0000003f assuming using 64 byte cache row sizes).

Scheme 1 has an L1 + L2 cache both sharing the stack/CLUT RAM (half each), the others use a single cache only with address tags again stored in the stack/CLUT RAM.

I still quite like the idea of scheme 1, but it can suffer slightly slower memory accesses when the L1 lookup fails compared to the other single level cache schemes.
Peak performance @160MHz				
Scheme		L1 cache	              L2 cache (1)	L1 hits		Hub memory access    L2 same row hits	L2 other row hits	L2 Misses (3)
1		64 entry direct       	64 x 2 way set assoc	20 MIPs		20/6.67/4 MIPs (2)	6.67 MIPs	3.33 MIPs		<= ~1MIPs
2		None			128 x 2 way set assoc	N/A		10 MIPs			10 MIPs		5 MIPs			<= ~1MIPs
3		None			64 x 4 way set assoc	N/A		10 MIPs			10 MIPs		4 MIPs			<= ~1MIPs
4		None			32 x 8 way set assoc	N/A		10 MIPs			10 MIPs		3.3 MIPs		<= ~1MIPs
								
Notes								
1	The actual L2 row size holding instructions is configurable depending on how much hub RAM you want to use.  
	Hit timing is unaffected by the row size, but L2 miss timing is affected.
2	In this case any hub memory reads can also make use of the L1 cache and performance is higher if reading from the same &#8220;row&#8221; of hub RAM. 
	ie. 20 MIPs if L1 is hit, 6.67 MIPs if in same row of hub RAM, and 4 MIPs if neither.
3	L2 misses will read in an entire row from the SDRAM.    This example assumes 64 bytes/row and exclusive SDRAM access.
	Timing will be rather variable when other clients share the SDRAM.  
	So this is a ballpark number in the best case for transferring 4 quads of instructions.

Thanks for the detailed analysis and your work on optimizing the code! Since you've been nice enough to code up all of these approaches I guess we could try them all with real-world code to see which performs best in practice. As I understand it, none of these approaches requires a change in the GCC code generation so we can simply swap out kernels to try different algorithms. As Eric said, the fcache may make the differences between your two-level cache and the single-level caches we have been using less significant. It will be interesting to try both to see. Can you post the code for each option?

Thanks for all of your hard work on this!
David

David Betz · 2013-05-12 09:39

Another point I'd like to bring up is that this scheme for improving instruction execution performance on P2 relies on the cache logic residing in the same COG as the XMM kernel. That will make it pretty much impossible to share a single cache among multiple COGs that all want to run XMM code. We could, of course, allocate a separate cache for every XMM COG but that will eat more hub memory. There isn't really much we can do about this but I thought I'd mention it. Parallax *has* expressed an interest in running XMM code on more than one COG at a time.

rogloh · 2013-05-12 18:09

David Betz wrote: »

Another point I'd like to bring up is that this scheme for improving instruction execution performance on P2 relies on the cache logic residing in the same COG as the XMM kernel. That will make it pretty much impossible to share a single cache among multiple COGs that all want to run XMM code. We could, of course, allocate a separate cache for every XMM COG but that will eat more hub memory. There isn't really much we can do about this but I thought I'd mention it. Parallax *has* expressed an interest in running XMM code on more than one COG at a time.

I agree, for some future multi-COG SMP system keeping L2 tags in CLUT RAM would require each COG to have its own hub RAM which would eat up more hub memory. However one possibility that may help there is to use my earlier idea in post #5 of a separate small 64x2 way L1 in the CLUT RAM which is private for each COG and also share a common cache driver accessed by each COG that manages a single L2 cache, but I imagine that L2 cache driver could likely become a major bottleneck once lots of COGs start sharing it.

Doing multi-COG stuff like that in the future with caching may need its own new memory model - perhaps a new "MXMMC" model could be created for multi-COG XMM code for example.

David Betz wrote:

Thanks for the detailed analysis and your work on optimizing the code! Since you've been nice enough to code up all of these approaches I guess we could try them all with real-world code to see which performs best in practice. As I understand it, none of these approaches requires a change in the GCC code generation so we can simply swap out kernels to try different algorithms. As Eric said, the fcache may make the differences between your two-level cache and the single-level caches we have been using less significant. It will be interesting to try both to see. Can you post the code for each option?

Thanks for all of your hard work on this!
David

No problem. I just hope it's useful and may speed things up a bit. On the P2 LMM is going to be the faster option over XMMC but once you are dealing with much larger applications like lots of OS or protocol code etc, then the large SDRAM and XMMC with good performance in the 5-20 MIP range will open up a whole new world of possibilities. I would really hope to see something decent for large applications running from SDRAM on P2 one day.

I'll try to tidy up my various sample code VM loops a little and post it here soon so you can take a look to see if any ideas from it make sense to perhaps incorporate into PropGCC for P2 at some point.

Roger.

David Betz · 2013-05-12 18:20

rogloh wrote: »

I agree, for some future multi-COG SMP system keeping L2 tags in CLUT RAM would require each COG to have its own hub RAM which would eat up more hub memory. However one possibility that may help there is to use my earlier idea in post #5 of a separate small 64x2 way L1 in the CLUT RAM which is private for each COG and also share a common cache driver accessed by each COG that manages a single L2 cache, but I imagine that L2 cache driver could likely become a major bottleneck once lots of COGs start sharing it.

Doing multi-COG stuff like that in the future with caching may need its own new memory model - perhaps a new "MXMMC" model could be created for multi-COG XMM code for example.

No problem. I just hope it's useful and may speed things up a bit. On the P2 LMM is going to be the faster option over XMMC but once you are dealing with much larger applications like lots of OS or protocol code etc, then the large SDRAM and XMMC with good performance in the 5-20 MIP range will open up a whole new world of possibilities. I would really hope to see something decent for large applications running from SDRAM on P2 one day.

I'll try to tidy up my various sample code VM loops a little and post it here soon so you can take a look to see if any ideas from it make sense to perhaps incorporate into PropGCC for P2 at some point.

Roger.

I agree that trying to share even a second level cache is likely to result in very poor performance. In addition, one of the big claimed advantages of the Propeller is its deterministic execution. That won't be possible running cached code. I would think the best approach would be to have a single COG running a "main program" and then use the other COGs to run either LMM, COG C, or PASM drivers. After all, the Propeller is all about "soft peripherals" and those are likely to require deterministic execution.

Thanks for offering to clean up your sample code so we can try it in the P2 XMM kernel! Also, do you see any issues that would come up if we want to use at least the second level cache for data access as well as code access? PropGCC supports the xmm-single and xmm-split modes that put both code and data in external memory.

rogloh · 2013-05-12 19:17

A data cache brings up its own set of issues especially when it comes to writes and any video memory buffers in SDRAM needing writebacks and invalidating etc. This would need to be dealt with separately for the full blown XMM variants storing data in SDRAM. But that is a different problem and I wasn't looking at that right now. Most of this stuff I was looking at above was specifically targeting improving XMMC and so it was doing instruction caching only and I was expecting all data was stored in hub RAM and was therefore uncached. I do know that it is just a subset but it would likely still cover quite a lot of applications with up to ~120kB or so of hub RAM available for data. We also still need to bear in mind that whenever the XMMC code space needs to be modified (eg. reading in programs dynamically from an SD-card into SDRAM etc) there would need to be some special flushing commands added to be able to invalidate the L1/L2 cached entries at the affected addresses so any memory writes in XMMC mode also needs to check the address range first.

I am thinking for the full XMM model with both data and code stored in the same SDRAM memory space, it could make sense to create two separate caches if you can spare a bit more hub space. Eg. your could keep a nice 4kB L2 I-cache for instructions and have a separate general purpose D-cache of some appropriate size in hub memory. This then decouples execution performance when reading in lots of data from SDRAM as it wouldn't ever have to empty the I-cache to make room for data rows and simplifies a few things but it could require a few more memory range checks.

Just a thought.

rogloh · 2013-05-13 00:46

Here is the tidied up version of scheme 1 with combined L1 + 2 way L2. I needed to slightly modify the L2 cache size to be half the depth it was before (down to 32 x 2 way set associative from 64 x 2) to account for some new timestamps to track which cache row to invalidate using an LRU algorithm approximation. But luckily the management of the timestamps for this only adds one more hub cycle to the L2 miss case so it won't be too noticeable I hope.

Disclaimer: This code is completly untested, uncompiled etc as I don't have any P2 tools or DE0-NANO FGPA etc, so please take it as such and there's probably a reasonable chance of a bug in there somewhere.

I will try to post the other L2 only schemes when I get more time to clean that up further. Unfortunately I really have already spent more time on this than I should have done and neglected some other stuff I need to do so I am not sure when I will get back to it (hopefully soon). :frown:

initial_startup
	reps	#1, #256	' clear out caches in stack/CLUT RAM to prevent false matches
	setspa 	#0		' start at beginning of CLUT
	pusha   last		' fill cache with pattern of all 1's
	jmpd 	#restart	' jump into VM after 3 more cycles
	mov	pc, start	' initial code entry point is stored here
	sub	pc, #4		' compensate for start position by rewinding 1 instruction
        shr     zc, #1   wz, wc ' initialize flags

l2_miss	
	' We have missed the L2 cache lookup so we need to read in from SDRAM.
	' We need an algorithm to decide which of the two L2 row entries should be overwritten.  
	' To do thiw we can use upper half of the L2 cache stack RAM to hold the timestamps
	' for each entry and then compare them to decide which of the pair is oldest.
	'
	' The 256 long stack RAM is arranged like this below with interleaving.
	' Half the RAM is allocated to be a 64 entry direct mapped L1 cache.
        ' The other half is allocated a be a 32 x 2 way set associative cache
	' with timestamps storing when the row was last loaded into hub RAM.

	' Index		 Contents
	'   0 		 L1 cached address
	'   1		 L1 associated data (opcode) from address in index 0
	'   2		 L2 cached row 0 upper address bits
	'   3		 L2 cached row 1 upper address bits
	'   4 		 L1 cached address
	'   5		 L1 associated data (opcode) from address in index 4
	'   6		 L2 cached row 2 upper address bits
	'   7		 L2 cached row 3 upper address bits
	'  ...		 ' upper half now holds timestamps
	'  128		 L1 cached address 
	'  129		 L1 associated data (opcode) from address in index 128
	'  130		 L2 timestamp for tag entry at index 2 (row 0)
	'  131		 L2 timestamp for tag entry at index 3 (row 1)
	'  132		 L1 cached address
	'  133		 L1 associated data (opcode) from address in index 132
	'  134		 L2 timestamp for tag entry at index 6 (row 2)
	'  135		 L2 timestamp for tag entry at index 7 (row 3)
	'  ... etc

	addspb  #128		' move up to upper half of stack RAM which holds timestamps
	popb	ts2		' pop the timestamp for second tag entry of pair
	popb	ts1		' pop the timestamp for first tag entry of pair
	subcnt 	ts1 		' get elapsed time since first tag entry was loaded
	subspb  #126		' rollback pointer and separate the two subcnts, to avoid getting CNTH
	subcnt  ts2		' get elapsed time since second tag entry was loaded
	cmp	ts1, ts2 wc	' determine which entry is oldest and setup spb appropriately
 if_nc	subsbp  #1		' if C clear, then cnt-ts1 >= cnt-ts2 which means we replace the first entry

	' We prepare to call the SDRAM driver here and make a request to do a SDRAM burst lookup 
	' at SDRAM address available in "last" variable (PC&mask).

	wrlong	last, sdram_addr

	' SDRAM data is to be returned to hub memory cache row address at lastbase = l2base + (spb&1)<<shift2 

	getspb	lastbase	' get the spb value
	andn	lastbase, #1	' clear lsb
	shl	lastbase,shift2 ' shift up to form a row address
	add	lastbase,l2base ' add to start of L2 cache area in hub memory
	mov	ts1, lastbase	' form a hub address for SDRAM driver to read into
	shl	ts1, #4		' move up 4 bits to make room for burst transfer
	add	ts1, #BURST	' eg. 64 byte transfer 

	wrlong	ts1, sdram_hub  ' trigger SDRAM lookup

	pushbr	last		' save newest tag to stack RAM from this read operation
	addspb	#129		' move back to timestamps area of stack RAM
	getcnt	ts1		' get tick counter
	pushbr	ts1		' update the cache timestamp for this entry

	' We need to poll for SDRAM driver result here
polldriver
	rdlong	ts1,sdram_hub wz' poll for completion
 if_nz  jmp	#polldriver	' wait until done

	' Eventually the L2 cache row data is updated and we can resume
	
	jmpd	#resume		' prepare return
	add	misses, #1	' keep some stats for the number of L2 misses
        shr     zc, #1   wz, wc ' restore flags
	nop

hubaddr				
	' This code is entered when an instruction is read directly from the hub RAM range
	jmpd	#return		' prepare to return back into VM
	mov	lastbase, addr	' save last hub address
        shr     zc, #1   wz, wc ' restore saved flags
	rdlong  instr, pc	' read instuction at direct hub address

l2_lookup
	muxc    zc, #1          ' save VM's C flag before we lose it
	muxnz   zc, #2          ' save VM's Z flag before we lose it
	mov	addr, pc  wc	' copy PC and get MSB into C flag
	and	addr, mask	' retain upper bits
	mov	last, addr	' save the last accessed address base
 if_nc	jmp	#hubaddr	' if PC top bit is clear lookup directly from hub RAM
	shr	addr, shift1	' shift down to index stack RAM range
	andn	addr, #128	' clear upper bit to only read from lower half of stack RAM range
	setspb	addr		' assign stack pointer to SPB
	addspb	#2		' add interleaving offset to allow sharing with the L1 cache in stack RAM
	popbr	addr		' get base address at this tag entry
	cmp	addr, last wz   ' compare with our PC top bits
 if_nz  popbr	addr		' if no match read next entry,
 if_nz	cmp	addr, last wz   ' and compare with our PC top bits
 if_z	jmp     l2_miss		' if still no match we have an L2 miss and need an SDRAM lookup
	getspb  lastbase	' get current stack pointer
	andn	lastbase, #1    ' mask off bit 0 to form an index
	jmpd	#resume		' prepare to resume
	shl	lastbase,shift2 ' shift up to form hub cache row address offset
	add	lastbase,l2base ' add base address of L2 cache in hub RAM
        shr     zc, #1   wz, wc ' restore flags

l1_miss	
	mov	addr, pc	' get current pc
	and	addr, mask	' zero out the lower bits
	sub	addr, last	' compare against last known L2 row address
	tjnz	#l2_lookup,addr	' full L2 lookup is required if a different row is used
resume	mov	addr, pc	' get current pc
	andn	addr, mask	' zero out the upper bits to create offset
	add	addr, lastbase	' add offset to the last known cache row base in hub RAM
	rdlong	instr, addr	' read instruction to be executed at this address
return  pusha  	instr		' copy this instruction code back into L1 cache
	subspa	#2		' move back up to address position in the stack
	pushar	pc		' write PC address into L1 cache and fall through to VM loop below

restart reps	#8, #0		' repeat VM loop below lots of times
     	add	restarts, #1	' keep some stats in otherwise empty slot here 
	
				' The main 8 instruction cycle VM loop begins here
	add	pc, #4		' increment PC
instr   nop			' execute instruction in this position
	setspa	pc		' point spa to PC's LSBs
	popar	addr		' pop an address from stack RAM L1 cache
	sub	addr, pc	' subtract our PC from it
	tjnz	#l1_miss, addr	' compare address match with PC and do L1 miss if no match
	popar	instr		' pop associated data instruction from L1 cache in stack RAM
	add	l1_hits, #1	' track some stats in otherwise wasted slot here

	jmpd	#restart 	' if loop ever exits after 16384 iterations or and special kernel calls 
	nop			' are made which call out the VM loop it will be restarted again here
	nop			' CALLD opcodes would be appropriate for these kernel calls.
	nop			' 

	


' some VM related state
start		long	START_ADDR	' entry point of VM application
pc		long	0		' VM program counter
zc		long	1 		' saved flags / initial flags (NZ, NC)

; cache lookup related information
mask		long	L2_MASK		' this address mask is controlled by cache row size
shift1		long	L2_SHIFT1	' this bit shift is controlled by cache row size
shift2		long	L2_SHIFT2	' this bit shift is controlled by cache row size
l2base  	long	L2_BASE_ADDR	' start address of L2 cache in hub memory
addr		long	0		' temporary address variable
last		long	$ffffffff	' choose something that won't match by mistake initially
lastbase 	long	$ffffffff	' holds hub RAM base address of last access

' timestamps for L2 cache row entries
ts1		long	0		
ts2		long	0

' sdram driver stuff (This needs to be configured to point to hub ram in a suitable location)
sdram_hub 	long	SDRAM_HUB
sdram_addr 	long 	SDRAM_ADDR	

' optional extra debug counters for tracking hit/miss cycle stats etc in any "free" cycles
l1_hits 	long	0		
misses	 	long	0		
loops		long	0
restarts 	long	0

David Betz · 2013-05-13 04:35

Thanks Roger! I'm pretty busy this week so I may not be able to get back to this until after the coming weekend so there is no rush for the L2-only schemes.

Instruction caching ideas for P2 in XMMC mode with SDRAM

Comments