I'm seeing unnecessary icache reloading on loops bigger than 32 instructions, indicating that five-bit counters aren't quite adequate. One more bit would get it over the hump, for sure, but I'm extending the counters to 8 bits to make it really flush.
I'm seeing unnecessary icache reloading on loops bigger than 32 instructions, indicating that five-bit counters aren't quite adequate. One more bit would get it over the hump, for sure, but I'm extending the counters to 8 bits to make it really flush.
I just realized that extending the counter bits can't solve the continual reloading that will occur on loops greater than 32 instructions. We'd need preemptive cache loading to get around that, which looks like a real pain, so far.
Do you think there would be room for four lines of dcache for the RDxxxxC instructions?
If not, four tasks using RDxxxxC instructions will thrash like crazy.
It's not so much about room, but about critical path. The single cache line reading is already bumping the critical path time. Adding two more layers of muxes could really blow it out. I'll look into it, though.
What might be more valuable is 8 lines of icache with preemptive loading, allowing four tasks to run full speed in separate straight lines.
I just realized that extending the counter bits can't solve the continual reloading that will occur on loops greater than 32 instructions. We'd need preemptive cache loading to get around that, which looks like a real pain, so far.
Presumably in a regular cache when you try to read something you don't have you don't just load the cache with the thing you want but also the whole block of stuff it lives in. Whatever size a block night be. So, when it comes to code execution you don't just load the instruction you want into cache but a bunch instructions ahead of that ready for the next op code fetch.
I just realized that extending the counter bits can't solve the continual reloading that will occur on loops greater than 32 instructions. We'd need preemptive cache loading to get around that, which looks like a real pain, so far.
You will always get cache thrashing with the small number of cache lines, so it doesn't make much sense to use LRU anyway. Direct mapped would be much easier.
We have the nice structured high level language we all know and love as Spin. Spin could no doubt be compiled to run on any machine with. I see no reason it could not be compiled to native x86 or ARM instructions.
BUT: That pesky PASM code we put in DAT sections is also Spin. It's defined in the same manual. It's written into the same source files. It's built with the same compiler. Many objects rely on that PASM being there.
PASM is Spin. Spin in PASM.
As such Spin/PASM is totally non-portable. Unless you want to write an Prop emulator to run on your target machine.
That includes P1 to P2 portability. It just isn't. No one will want a P1 emulator on a P2 to run those PASM parts. Makes no sense.
Careful now, that kind of talk could result in calling it SPASM. Not a good thing.
direct mapped would collide too much, causing more thrashing. with four cache lines of eight longs each there would be 2048 octal-longs for every cache line. LRU would allow higher hit rates.
You will always get cache thrashing with the small number of cache lines, so it doesn't make much sense to use LRU anyway. Direct mapped would be much easier.
would it reduce the critical path to have an eight line LRU cache shared for icache and dcache? I think that may require fewer muxes.
if pre-emptive loading is not easy, perhaps allow cache loads to use 'spare' hub slots from other cogs to reduce the latency for fetching the next line.
It's not so much about room about it is critical path. The single cache line reading is already bumping the critical path time. Adding two more layers of muxes could really blow it out. I'll look into it, though.
What might be more valuable is 8 lines of icache with preemptive loading, allowing four tasks to run full speed in separate straight lines.
direct mapped would collide too much, causing more thrashing. with four cache lines of eight longs each there would be 2048 octal-longs for every cache line. LRU would allow higher hit rates.
If you only have one task running code from the hub then I think direct-mapped would probably perform pretty well. I'm not sure anything will perform well if you have four tasks running from hub.
Presumably in a regular cache when you try to read something you don't have you don't just load the cache with the thing you want but also the whole block of stuff it lives in. Whatever size a block night be. So, when it comes to code execution you don't just load the instruction you want into cache but a bunch instructions ahead of that ready for the next op code fetch.
Isn't what you're calling a block what Chip is calling a cache line?
With what we've got now, any time a PC advances to the next 8 longs in hub RAM, a cache load must be done. The only time multiple cache lines benefit is when a reverse branch occurs that falls within already-cached lines.
If preemptive loading could be worked out, we could get way better (2x, or full-speed) performance for single-task straight-line hub execution. I'm thinking that this could be achieved by using the hub cycle if no other instruction is using it. This mode could engage automatically when all task slots are 0 - or by the same mechanism which causes instructions like WAITVID to loop, instead of stalling the pipeline, during multitasking.
Perhaps, I'm not used to this cache terminology. I'm thinking page faults in a virtual memory system. When you hit a location that is not in RAM you swap a whole 4K page (or whatever) from disk to RAM and continue. Not just that instruction you wanted. Then, in the normal run of things the code you want next is already in RAM.
I presume with a cache memory the same goes on except perhaps not 4KB at a time.
Perhaps, I'm not used to this cache terminology. I'm thinking page faults in a virtual memory system. When you hit a location that is not in RAM you swap a whole 4K page (or whatever) from disk to RAM and continue. Not just that instruction you wanted. Then, in the normal run of things the code you want next is already in RAM.
I presume with a cache memory the same goes on except perhaps not 4KB at a time.
Yes, that is correct but in cache terminology the block of memory that gets read on a cache miss is called a cache line. If we ever get to the TLB idea then we would be talking about pages as you suggest.
With what we've got now, any time a PC advances to the next 8 longs in hub RAM, a cache load must be done. The only time multiple cache lines benefit is when a reverse branch occurs that falls within already-cached lines.
If preemptive loading could be worked out, we could get way better (2x, or full-speed) performance for single-task straight-line hub execution. I'm thinking that this could be achieved by using the hub cycle if no other instruction is using it. This mode could engage automatically when all task slots are 0 - or by the same mechanism which causes instructions like WAITVID to loop, instead of stalling the pipeline, during multitasking.
Okay, I think I understand what you mean by "preemptive". You mean that the cache line following where the PC is currently pointing gets fetched in the background before a cache miss happens. Yes, that would tend to speed up straight-line code. I wonder how much code is straight-line code though. It might hurt code that branches a lot.
From what Chip is saying I get the idea he is only loading a single instruction on a cache miss. The cache line is a single LONG. Which is OK if you are in a loop 32 instructions with 32 "cache lines".
From what Chip is saying I get the idea he is only loading a single instruction on a cache miss. The cache line is a single LONG. Which is OK if you are in a loop 32 instructions with 32 "cache lines".
Or am I misunderstanding?
A cache line is 8 longs, and that is what gets loaded when an out-of-cache fetch occurs. So, we pick up a whole 8 instructions. The problem is that after executing those 8 instructions, we need to load 8 more, and that typically takes 8 more clocks, giving ~50% cache hit rate.
From what Chip is saying I get the idea he is only loading a single instruction on a cache miss. The cache line is a single LONG. Which is OK if you are in a loop 32 instructions with 32 "cache lines".
Or am I misunderstanding?
I thought a cache line was 8 longs which is the number you can fetch from the hub in a single pass.
A cache line is 8 longs, and that is what gets loaded when an out-of-cache fetch occurs. So, we pick up a whole 8 instructions. The problem is that after executing those 8 instructions, we need to load 8 more, and that typically takes 8 more clocks, giving ~50% cache hit rate.
With what we've got now, any time a PC advances to the next 8 longs in hub RAM, a cache load must be done. The only time multiple cache lines benefit is when a reverse branch occurs that falls within already-cached lines.
If preemptive loading could be worked out, we could get way better (2x, or full-speed) performance for single-task straight-line hub execution. I'm thinking that this could be achieved by using the hub cycle if no other instruction is using it. This mode could engage automatically when all task slots are 0 - or by the same mechanism which causes instructions like WAITVID to loop, instead of stalling the pipeline, during multitasking.
I was hoping for a simple preemptive loading for use when only a single task is running in hub mode.
I also like the using next available slot idea to load the cache as this will improve performance significantly.
While it would be nice to increase the number of cache slots from 4 to 8, I presume we are getting to silicon space problems.
Therefore, could the following work...
1. The cog knows if multi task or single task mode is set.
2. If single task, then following the execution from "another" cache block (meaning an instruction fetch from another block boundary, whether or not it needed to be reloaded from hub), the preemptive state m/c looks to see if the next block is in cache, and if not it pre-emptively loads it into the LRU block.
3. Reloading could/would use the next available slot.
A cache line is 8 longs, and that is what gets loaded when an out-of-cache fetch occurs. So, we pick up a whole 8 instructions. The problem is that after executing those 8 instructions, we need to load 8 more, and that typically takes 8 more clocks, giving ~50% cache hit rate.
@Chip,
Does this mean when doing hub execution we basically are becoming synchronized to the hub window (but not in a desirable way) after executing the 8 cached instructions and then just miss the window each time, thereby needing to wait for it to come around again with our next 8 instructions, essentially losing 8 extra clock cycles each time and yielding only ~50% instruction throughput? If a speculative/prefetch octal hub read automatically happened in parallel with the 7th instruction of the group (assuming no other hub read was required) would that be able to return the next set of 8 instructions in time and fix this? Is this basically a pipelining type of problem or something else...?
direct mapped would collide too much, causing more thrashing. with four cache lines of eight longs each there would be 2048 octal-longs for every cache line. LRU would allow higher hit rates.
Where would 2048 octal-longs for 4 each lines come from? LOL.
As I understand it Chip has 4 lines with 8 longs per line, which is not really a cache. If he could use more of the COG or AUX space, there could be enough cache to make a difference.
Chip, why don't you draw us a picture of your cache design to be clear?
@Chip,
Does this mean when doing hub execution we basically are becoming synchronized to the hub window (but not in a desirable way) after executing the 8 cached instructions and then just miss the window each time, thereby needing to wait for it to come around again with our next 8 instructions, essentially losing 8 extra clock cycles each time and yielding only ~50% instruction throughput? If a speculative/prefetch octal hub read automatically happened in parallel with the 7th instruction of the group (assuming no other hub read was required) would that be able to return the next set of 8 instructions in time and fix this? Is this basically a pipelining type of problem or something else...?
Roger.
Yes, we do become sync'd to the hub by cache line loading. We load 8 instructions, then they take 8 clocks to execute, then we must wait 8 more clocks to get to the next hub cycle to load the next 8 instructions. A preemptive load is needed to overcome this. I think it will be simple to implement, after all. As long as there is no hub instruction in stage 4 of the pipeline, we can sneak the preemptive cache load in from stage 1. This is the next thing I will try to do. It will enable single-task hub execution to go full-speed, until a branch occurs to a location that is not cached.
Where would 2048 octal-longs for 4 each lines come from? LOL.
As I understand it Chip has 4 lines with 8 longs per line, which is not really a cache. If he could use more of the COG or AUX space, there could be enough cache to make a difference.
Chip, why don't you draw us a picture of your cache design to be clear?
We have 4 lines of 8 longs. A whole line is loaded at once by a 256-bit hub read. Then, the 8 longs within a cache line can be fed, one at a time, into the pipeline.
Comments
Question:
Do you think there would be room for four lines of dcache for the RDxxxxC instructions?
If not, four tasks using RDxxxxC instructions will thrash like crazy.
It's not so much about room, but about critical path. The single cache line reading is already bumping the critical path time. Adding two more layers of muxes could really blow it out. I'll look into it, though.
What might be more valuable is 8 lines of icache with preemptive loading, allowing four tasks to run full speed in separate straight lines.
Presumably in a regular cache when you try to read something you don't have you don't just load the cache with the thing you want but also the whole block of stuff it lives in. Whatever size a block night be. So, when it comes to code execution you don't just load the instruction you want into cache but a bunch instructions ahead of that ready for the next op code fetch.
You will always get cache thrashing with the small number of cache lines, so it doesn't make much sense to use LRU anyway. Direct mapped would be much easier.
Careful now, that kind of talk could result in calling it SPASM. Not a good thing.
PASM starts with "P" too, but I refuse to stop using it.
would it reduce the critical path to have an eight line LRU cache shared for icache and dcache? I think that may require fewer muxes.
if pre-emptive loading is not easy, perhaps allow cache loads to use 'spare' hub slots from other cogs to reduce the latency for fetching the next line.
If preemptive loading could be worked out, we could get way better (2x, or full-speed) performance for single-task straight-line hub execution. I'm thinking that this could be achieved by using the hub cycle if no other instruction is using it. This mode could engage automatically when all task slots are 0 - or by the same mechanism which causes instructions like WAITVID to loop, instead of stalling the pipeline, during multitasking.
Perhaps, I'm not used to this cache terminology. I'm thinking page faults in a virtual memory system. When you hit a location that is not in RAM you swap a whole 4K page (or whatever) from disk to RAM and continue. Not just that instruction you wanted. Then, in the normal run of things the code you want next is already in RAM.
I presume with a cache memory the same goes on except perhaps not 4KB at a time.
Or am I misunderstanding?
A cache line is 8 longs, and that is what gets loaded when an out-of-cache fetch occurs. So, we pick up a whole 8 instructions. The problem is that after executing those 8 instructions, we need to load 8 more, and that typically takes 8 more clocks, giving ~50% cache hit rate.
It is not possible to have 2x8 Longs Cashe.
that in time one 8 Long executes -- other 8 are preloaded?
- significant improvement to straight line code
- if the 8th instruction (of the eight longs) is NOT a hub access, start the pre-fetch
- if pre-fetch, use first available hub slot unused by any cog (then do not have to wait for our own turn)
Regarding four cache lines with LRU:
- much better single task performance, loops up to 32 instructions at cog speed
- much better performance for two to four tasks all running hubexec
Regarding single line RDxxxxC dcache line:
- OK for single task
- with four tasks, degenerates to non-cached reads due to thrashing (four dcache lines with LRU would fix this)
Next fpga image will be a LOT of fun!
I also like the using next available slot idea to load the cache as this will improve performance significantly.
While it would be nice to increase the number of cache slots from 4 to 8, I presume we are getting to silicon space problems.
Therefore, could the following work...
1. The cog knows if multi task or single task mode is set.
2. If single task, then following the execution from "another" cache block (meaning an instruction fetch from another block boundary, whether or not it needed to be reloaded from hub), the preemptive state m/c looks to see if the next block is in cache, and if not it pre-emptively loads it into the LRU block.
3. Reloading could/would use the next available slot.
@Chip,
Does this mean when doing hub execution we basically are becoming synchronized to the hub window (but not in a desirable way) after executing the 8 cached instructions and then just miss the window each time, thereby needing to wait for it to come around again with our next 8 instructions, essentially losing 8 extra clock cycles each time and yielding only ~50% instruction throughput? If a speculative/prefetch octal hub read automatically happened in parallel with the 7th instruction of the group (assuming no other hub read was required) would that be able to return the next set of 8 instructions in time and fix this? Is this basically a pipelining type of problem or something else...?
Roger.
Where would 2048 octal-longs for 4 each lines come from? LOL.
As I understand it Chip has 4 lines with 8 longs per line, which is not really a cache. If he could use more of the COG or AUX space, there could be enough cache to make a difference.
Chip, why don't you draw us a picture of your cache design to be clear?
Yes, we do become sync'd to the hub by cache line loading. We load 8 instructions, then they take 8 clocks to execute, then we must wait 8 more clocks to get to the next hub cycle to load the next 8 instructions. A preemptive load is needed to overcome this. I think it will be simple to implement, after all. As long as there is no hub instruction in stage 4 of the pipeline, we can sneak the preemptive cache load in from stage 1. This is the next thing I will try to do. It will enable single-task hub execution to go full-speed, until a branch occurs to a location that is not cached.
We have 4 lines of 8 longs. A whole line is loaded at once by a 256-bit hub read. Then, the 8 longs within a cache line can be fed, one at a time, into the pipeline.