Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 153 — Parallax Forums

Propeller II update - BLOG

1150151153155156223

Comments

  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-01-11 23:55
    Hey Chip,
    I'm trying to understand how this works at the nitty gritty detail level. So if code is running in hub execution mode, what is actually feeding the execution pipe with the instructions? Cog memory or the cache line buffer memory? I assume it's the cache line memory. It's not reading from hub into the cache line and then copying that to actual cog memory. Right? So, the cache lines aren't really free until you've gotten the 8th instruction in them into the pipeline.

    If you have 4 tasks in hub execution mode, you will not have any free cache lines to preemptively load. You'll need another one at least.

    Also, if I have 32 longs of code in hub and the last instruction is a branch back to the first and I am running only one task in hub execution mode, will this end up in the 4 cache lines and run without ever needing to read hub for cache fills again? How does it know what hub addresses each cache line holds? Now say that my 32 longs contain a loop inside of a loop, so there is still the outer loop, but inside there is a inner smaller loop that is not 8 long aligned. so there is a branch from the middle of one cache line into the middle of another. Will it still use the cache lines, or will it reload a cache line starting at the new hub address? In other words are cache lines always 8 long aligned hub addresses, and can I jump into the middle of one?

    Sorry if all of this is very basic stuff that should be assumed. I'm assuming it all works in reasonable ways, I just am interested in the details. I'm used to dealing with (and trying to make happy) much more complex cache systems in modern CPUs/GPUs, so it's hard to imagine a cache working without all the complex pieces.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-12 00:52
    Roy Eltham wrote: »
    Hey Chip,
    I'm trying to understand how this works at the nitty gritty detail level. So if code is running in hub execution mode, what is actually feeding the execution pipe with the instructions? Cog memory or the cache line buffer memory? I assume it's the cache line memory. It's not reading from hub into the cache line and then copying that to actual cog memory. Right? So, the cache lines aren't really free until you've gotten the 8th instruction in them into the pipeline.
    Yes, but all 8 instructions are loaded in one clock because the hub is 8 longs wide.
    The pipeline is being fed directly from the cache rather than reading the instruction from the cog. IIRC Chip is actually gating this directly in during the 2nd clock of the pipeline.
    If you have 4 tasks in hub execution mode, you will not have any free cache lines to preemptively load. You'll need another one at least.
    There are only 4 so there is no spare cache lines for each task. So it would stall, except Chip is looking at preventing the stall.
    Also, if I have 32 longs of code in hub and the last instruction is a branch back to the first and I am running only one task in hub execution mode, will this end up in the 4 cache lines and run without ever needing to read hub for cache fills again?
    Yes, but preemptive loading (if implemented as suggested) would unfortunately replace the first block. So, with preemptive loading, the loop would be limited to 3*8=24 instructions. I would rather this because in the majority of cases, it would be beneficial.
    How does it know what hub addresses each cache line holds?
    There is a register per block that contains the address of the hub block that is currently loaded into this block.
    Now say that my 32 longs contain a loop inside of a loop, so there is still the outer loop, but inside there is a inner smaller loop that is not 8 long aligned. so there is a branch from the middle of one cache line into the middle of another. Will it still use the cache lines, or will it reload a cache line starting at the new hub address? In other words are cache lines always 8 long aligned hub addresses, and can I jump into the middle of one?
    They are always 8*Long aligned. So there will not be any problems in this (except as above)
    Sorry if all of this is very basic stuff that should be assumed. I'm assuming it all works in reasonable ways, I just am interested in the details. I'm used to dealing with (and trying to make happy) much more complex cache systems in modern CPUs/GPUs, so it's hard to imagine a cache working without all the complex pieces.
    This is a vary simple cache implementation. Its more limited because it is only 4 blocks long. But it will make the P2 "sing" in hub mode :)
  • evanhevanh Posts: 16,075
    edited 2014-01-12 02:42
    Cluso99 wrote: »
    Yes, but preemptive loading (if implemented as suggested) would unfortunately replace the first block. So, with preemptive loading, the loop would be limited to 3*8=24 instructions. I would rather this because in the majority of cases, it would be beneficial.

    Prefetching rather than preemption is definitely the more appropriate term here.

    In the case of max loop size, it should be possible to execute in a similar manner to the existing treatment of Cog branching - where the branch taken is the one that's prefetched. This would allow a cache line to remain entirely filled with a crafted loop until the job is done.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 06:49
    LOL! Sorry for the poor explanation.

    256KB hub = 64K longs

    64KL / 4 cache lines = 16KL / cache line

    16KLL / 8 longs per cache line = 2048 hub longs map onto each cache line long

    so when direct mapped, 2048 lines of eight longs will map onto each cache line

    Without LRU, the cache will trash a lot, as direct mapping will cause useful cache lines to be overwritten a lot more
    jazzed wrote: »
    Where would 2048 octal-longs for 4 each lines come from? LOL.

    As I understand it Chip has 4 lines with 8 longs per line, which is not really a cache. If he could use more of the COG or AUX space, there could be enough cache to make a difference.

    Chip, why don't you draw us a picture of your cache design to be clear?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 06:50
    The pre-fetch could easily tell if the 8long about to be fetched is already in one of the existing cache lines, so loops could still be 32 instructions long.
    evanh wrote: »
    Prefetching rather than preemption is definitely the more appropriate term here.

    In the case of max loop size, it should be possible to execute in a similar manner to the existing treatment of Cog branching - where the branch taken is the one that's prefetched. This would allow a cache line to remain entirely filled with a crafted loop until the job is done.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-01-12 07:49
    LOL! Sorry for the poor explanation.

    256KB hub = 64K longs

    64KL / 4 cache lines = 16KL / cache line

    16KLL / 8 longs per cache line = 2048 hub longs map onto each cache line long

    so when direct mapped, 2048 lines of eight longs will map onto each cache line

    Without LRU, the cache will trash a lot, as direct mapping will cause useful cache lines to be overwritten a lot more
    I don't think anyone was arguing against LRU. Direct-mapped was only proposed as a fairly good alternative if there is difficulty in implementing the LRU algorithm in HW.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 07:51
    I was thinking about pre-fetching the next cache line.

    How about something like this:

    Hubexec fetches a cache line, starts executing it

    - it also sets the address of the next cache line to fetch (pre-fetch) into the hub fetch address register, if it is not already in another cache line

    - if any instruction in the current cache line needs hub access, it overrides the prefetch address

    A better version would use any available unused hub slot to fill requests instead of waiting for the next window.

    This way, cache lines consisting of non-hub access instructions, including RDxxxC that are cached, would prefetch as needed.

    A non-cached hub op would cause a slight delay to the next unused slot, or 8 cycle window.

    Note, 8 single cycle ops would naturally line up with the next window.
  • jazzedjazzed Posts: 11,803
    edited 2014-01-12 07:59
    Without LRU, the cache will trash a lot, as direct mapping will cause useful cache lines to be overwritten a lot more
    The cache is so small it doesn't matter what replacement mechanism is used. Chip found that out already.

    The only real improvement will be made by increasing the number of cache lines (I.E. dedicating a block of COG RAM for that purpose). Then LRU would offer advantages over the cell cost and complexity required to implement it.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 08:04
    It still matters, note the following simplified example:

    (for simplicity, all addresses below 16 bit long addresses, two low order zero bits are implied)

    $1130 main code, 8 instructions in currently executing cache line, calls subrountine at $xxx8

    with LRU, original cache line left alone

    without LRU, original calling cache line overwritten

    It makes a difference with as little as two cache lines.

    With one cache line, there is no difference.

    This can be improved slightly without LRU with a four line direct mapped as follows: (which I believe is what you are proposing)

    %aaaaaaaa aaaLLxxx00

    Where a's are the 11 bit upper hub address lines

    LL's is cache line number

    00's are the implied zero's

    Direct mapping would still overlap for every 128 bytes, so it would still often flush perfectly usable code that LRU would keep.

    Now for P3, the more cache lines the better :)
    jazzed wrote: »
    The cache is so small it doesn't matter what replacement mechanism is used. Chip found that out already.

    The only real improvement will be made by increasing the number of cache lines (I.E. dedicating a block of COG RAM for that purpose). Then LRU would offer advantages over the cell cost and complexity required to implement it.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-01-12 08:10
    With one cache line, there is no difference.
    Ummm... with only one cache line LRU has no meaning. :-)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 08:12
    Exactly :)

    The fact is that even one cache line helps, and that four without LRU is better than one cache line. But four with LRU is better :)

    Only Chip knows if LRU is too complicated to put in, and reading his messages, it does not seem like he sees it as too difficult.
    David Betz wrote: »
    Ummm... with only one cache line LRU has no meaning. :-)
  • David BetzDavid Betz Posts: 14,516
    edited 2014-01-12 08:20
    Exactly :)

    The fact is that even one cache line helps, and that four without LRU is better than one cache line. But four with LRU is better :)
    I agree completely
    Only Chip knows if LRU is too complicated to put in, and reading his messages, it does not seem like he sees it as too difficult.
    It does seem as though Chip has a pretty good handle on LRU so it may not be "too difficult". Only he can say for sure.
  • jazzedjazzed Posts: 11,803
    edited 2014-01-12 08:26
    Did Bill ever get his LRU to work? I never saw it work.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 09:06
    VMCOG had a working LRU almost from the beginning. See "BUSERR" in the source.
    jazzed wrote: »
    Did Bill ever get his LRU to work? I never saw it work.
  • pedwardpedward Posts: 1,642
    edited 2014-01-12 09:11
    My suggestion:

    I presume there is an internal register where the address for a Hub memory access is stored. Preload this register with the next location of the hub to load the next 8 instructions when the cache is loaded. You would then preload a load op. At any time the cog needs to access memory, it would simply overwrite the address and op.

    So basically think of memory access as a queue, you preload the 1 entry queue with the proper values, but if an explicit op is run in COG, you'd usurp the preloaded values so they don't steal time.

    I thought about this before, if any instruction in the cache is a branch or memory op, it would still need to wait for the next hub sync to execute, but you can always preload the cache (in this case if it's a branch, it would be a missed load) for free.

    The load happens in the hub slot when the instructions are running, because the instructions always run 1 hub slot later than the load. You won't see ant penalty for pre-loading the cache because the instructions would have to wait for the window anyway. In fact, the window would be wasted if the preload wasn't happening.

    If you implement automatic preload, the cog will always be reading from memory every hub window, whether a preload, branch, or RDxxxxx instruction.

    You could make it into an automatic state machine, the destination is either icache or dcache, the address is specified as automatic or specific (branch, data read). The state machine is ALWAYS reading from hub memory, there is no "if". The read would only be interrupted by data write, but then resume on next hub window unless the state is changed.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-12 10:14
    Chip,

    I understand you are changing to 8 bit counters, with saturation.

    How about dividing the counts in two by pre-setting all the counters to (counter >> 1) when any counter saturates? I used this technique in my LRU implementation for VMCOG.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-12 17:07
    pedward wrote: »
    My suggestion:

    I presume there is an internal register where the address for a Hub memory access is stored. Preload this register with the next location of the hub to load the next 8 instructions when the cache is loaded. You would then preload a load op. At any time the cog needs to access memory, it would simply overwrite the address and op.

    So basically think of memory access as a queue, you preload the 1 entry queue with the proper values, but if an explicit op is run in COG, you'd usurp the preloaded values so they don't steal time.

    I thought about this before, if any instruction in the cache is a branch or memory op, it would still need to wait for the next hub sync to execute, but you can always preload the cache (in this case if it's a branch, it would be a missed load) for free.

    The load happens in the hub slot when the instructions are running, because the instructions always run 1 hub slot later than the load. You won't see ant penalty for pre-loading the cache because the instructions would have to wait for the window anyway. In fact, the window would be wasted if the preload wasn't happening.

    If you implement automatic preload, the cog will always be reading from memory every hub window, whether a preload, branch, or RDxxxxx instruction.

    You could make it into an automatic state machine, the destination is either icache or dcache, the address is specified as automatic or specific (branch, data read). The state machine is ALWAYS reading from hub memory, there is no "if". The read would only be interrupted by data write, but then resume on next hub window unless the state is changed.
    I would not want to always use the hub slot to load. If Chip implements "use next available slot" technique (which I hope happens) then the slots where a read happens but is not required, it wastes a slot that could be used by another cog. Also, the power usage goes up too.

    A while ago I suggested a deliberate cache load instruction. This instruction would be queued (only 1 level queue) to load the next LRU cache line with the 8 longs at this hub address. This would give us a 1 clock penalty to preload the cache with the next cache line.
  • pedwardpedward Posts: 1,642
    edited 2014-01-12 17:25
    Cluso99 wrote: »
    I would not want to always use the hub slot to load. If Chip implements "use next available slot" technique (which I hope happens) then the slots where a read happens but is not required, it wastes a slot that could be used by another cog. Also, the power usage goes up too.

    A while ago I suggested a deliberate cache load instruction. This instruction would be queued (only 1 level queue) to load the next LRU cache line with the 8 longs at this hub address. This would give us a 1 clock penalty to preload the cache with the next cache line.

    a) It doesn't "Waste" a slot because every COG has a dedicated slot.
    b) If you are running code from HUB, you'll see 50% efficiency without preload, so it's necessary that you always-be-loading unless another memory op requests access to the hub memory bus

    I really think borrowing hub memory slots is a Bad Idea(tm).
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-12 17:35
    There are two different types of instruction caching required...

    * Single-task
    * Multi-task

    SINGLE-TASK INSTRUCTION CACHE

    In stage 1 of the pipeline, the instruction is fetched. Normally this would be a read from a cog address. However, in hubexec mode, this instruction is fetched from the instruction cache. If the instruction cache does not hold the cache line, then a new fetch is issued and the pipeline stalls until the cache line is loaded.

    We now have alignment with the hub window.

    Presume the instructions are sequential. After executing all 8 instructions fetched from hub into the cache, we again reach the point where in stage 1 of the pipeline we need to fetch a new cache line. We are precisely aligned with the hub window, ready for another hub cache line fetch.

    Now, I previously understood that Chip realised that we did not have to wait for a new hub window because we could fetch the cache line using the current window, and directing the first long (first instruction) directly into stage 2 of the pipeline as well as saving it in the cache line. Therefore, I thought that there would be no lost hub windows for sequential hub instruction sequences.

    However, in the above case, hub reads and writes would always stall to the next hub window, and would ultimately appear to always take 8 clocks.

    In the single task case, there are 4 cache lines (with LRU). This benefits the hub execution model for loops of up to 32 instructions.

    MULTI-TASK INSTRUCTION CACHE

    In the multi-task case, each task has only 1 cache line (8 instructions).

    If each task (presume 4 for this case) is running sequential hub code, then each task will only get cache fetch access once every 4 hub cycles. But, each task only executes once in every 4 clock cycles, so while the tasks remain interleaved relative to the hub window, no delays would be noticed. However, a task that gets out of step with the hub window would create stalls for the other tasks.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-12 17:45
    pedward wrote: »
    a) It doesn't "Waste" a slot because every COG has a dedicated slot.
    It does waste a slot because "that slot" could be used for "this cog" to perform a RD/WRxxxxC data cache access.
    b) If you are running code from HUB, you'll see 50% efficiency without preload, so it's necessary that you always-be-loading unless another memory op requests access to the hub memory bus
    See my post above. If I understood correctly, a while back Chip realised that he could actually achieve loading at 100%. This would be fantastic if my recollection was correct.

    I really think borrowing hub memory slots is a Bad Idea(tm).
    We are just going to agree to disagree.

    I hope Chip silently implements this and informs me (and a few others) offline. I can see great performance advantages of utilising otherwise wasted slots. I will utilise this feature to great advantage. For such a simple mechanism, it would be a real shame to deprive those of us who want to take advantage of this mechanism to improve the performance of the P2. If it is in the silicon, it can always be used later. Ultimately, it could make the difference of a design win or not.
  • pedwardpedward Posts: 1,642
    edited 2014-01-12 19:07
    Cluso99 wrote: »
    It does waste a slot because "that slot" could be used for "this cog" to perform a RD/WRxxxxC data cache access.

    See my post above. If I understood correctly, a while back Chip realised that he could actually achieve loading at 100%. This would be fantastic if my recollection was correct.

    We are just going to agree to disagree.

    I hope Chip silently implements this and informs me (and a few others) offline. I can see great performance advantages of utilising otherwise wasted slots. I will utilise this feature to great advantage. For such a simple mechanism, it would be a real shame to deprive those of us who want to take advantage of this mechanism to improve the performance of the P2. If it is in the silicon, it can always be used later. Ultimately, it could make the difference of a design win or not.

    It doesn't waste a slot because I clearly stated that explicit reads or writes take priority.

    When you load 8 instructions from hub, it takes 8 clocks. You start executing those instructions and the hub window is wasted unless you immediately issue a pre-load. So, load 8, exec 8, load 8, exec 8 is how it is now. With preload it's load 8, exec 8/load 8, exec 8/load 8. The load and exec are done simultaneously to keep the cache full. When a branch instruction is hit, you have to wait for the next hub window anyway, so it's load 8, exec 8/load 8, load 8, exec 8/load 8. When an explicit read or write is done, it's load 8, exec 8/load 8, read/write, exec 8/load 8. You see there, the icache already had the code that needs to execute because it was loaded when the RD/WR instruction was hit. So you could have a up to 16 cache hits if the first instruction is a memory op. It really does benefit to keep the hub in read mode unless an explicit op is requested.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-12 19:35
    pedward wrote: »
    It doesn't waste a slot because I clearly stated that explicit reads or writes take priority.
    In the context quoted, the slot would be used to load the cache. If priority is given to explicit reads or writes, they take the slot, and then there is a stall to the next slot where the fetch can take place. So, forcing the fetch to take place (when it does not need to take place, as in the context of this original discussion context) does indeed waste a slot - it is a slot that could otherwise have been used, meaning that a delay would not have resulted.
    When you load 8 instructions from hub, it takes 8 clocks.
    No. The load from hub is 8 longs wide (256 bits wide) so it takes 1 clock. However, without changing the current slot mechanism, another load cannot take place until another hub window is reached 8 clocks later (1+7).
    You start executing those instructions and the hub window is wasted unless you immediately issue a pre-load. So, load 8, exec 8, load 8, exec 8 is how it is now. With preload it's load 8, exec 8/load 8, exec 8/load 8. The load and exec are done simultaneously to keep the cache full.
    This is different to what you are arguing above. I agree with this. However, it was my belief (and we will need to wait for input from Chip) that Chip said previously that if the hub window was aligned correctly, then the cache fetch could occur in the pipeline stage 1 and be used directly in stage 2, thereby eliminating a stall of a hub window (8 clocks). Presuming I am correct, then preloading/prefetching would not be required, because if required, it would occur precisely in the same time slot as that without prefetching.
    When a branch instruction is hit, you have to wait for the next hub window anyway, so it's load 8, exec 8/load 8, load 8, exec 8/load 8. When an explicit read or write is done, it's load 8, exec 8/load 8, read/write, exec 8/load 8.
    If as I believe, then it would be load 8, exec 1-8 + new load 8, exec 1-8 + load 8.
    If not, then it would be load 8, exec 1-8 + (prefetch cancels) new load 8, exec 1-8 + load 8 because the old preload 8 would be cancelled before it took place due to a branch occurring before the preload executed, and would (or at least the Verilog should) replace the preload with the new load. Remember, unless the branch is the last of the 8 long block, there will be time to terminate the preload, and substitute the correct new load 8. If it's the last (8th) instruction to branch, then it depends of whether Chip can do as I think.
    You see there, the icache already had the code that needs to execute because it was loaded when the RD/WR instruction was hit. So you could have a up to 16 cache hits if the first instruction is a memory op. It really does benefit to keep the hub in read mode unless an explicit op is requested.
    I don't understand your point here. Do you mean that somehow we are going to find another slot between the current load 8 to load the next load 8 before the RD/WR instruction is hit??? I am missing something here, as I thought RD/WR would take precedence (over preloading) for the next hub window.
  • koehlerkoehler Posts: 598
    edited 2014-01-13 04:03
    This is only a little off-topic, however....

    I was wondering if anyone here would consider posting an update on the Prop 1 forum with all of the 'updates' that Chip seems to have completed.
    I've read every post in this thread along the way, however frankly, quite a bit of this is sailing right over my head.

    Since we don't want Chip diverted from more important activities, was hoping someone might be able to explain some of the big or interesting advancements at a slightly lower (higher?) level.
  • cgraceycgracey Posts: 14,232
    edited 2014-01-13 08:19
    In case anyone missed it, LRU with four cache lines is already working.

    Today I'm going to try to get the pre-fetch working, so that a single hub execution task can run full speed in a straight line. After that, I want to look into maybe doubling the cache lines so that we could have two per task, allowing each task to run full speed in a straight line. It's too hard to think about all at once, so I'm going to get the single-task done first.
  • cgraceycgracey Posts: 14,232
    edited 2014-01-13 08:26
    Chip,

    I understand you are changing to 8 bit counters, with saturation.

    How about dividing the counts in two by pre-setting all the counters to (counter >> 1) when any counter saturates? I used this technique in my LRU implementation for VMCOG.


    Bill, what's the advantage in doing this? It sounds intriguing.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-13 08:59
    This stops the counters from ever really saturating, and as each counter is divided by two at the same time, the relative magnitude of the counts is preserved, thus performing better for LRU.

    Without this, once two or more counters saturate, there is no way to tell which is a better candidate for replacement, once could be 256 counts, and one at 1,000,000 counts, but it would look the same to the comparators. By doing the shifting, the counts will always be relatively correct.

    Assuming the counters are like below:

    q0-q7, with q8 being the carry out
    d0-d7 for the preset inputs
    reset signal
    load signal
    ciock

    when q8 goes high for any of the counters:

    for all counters, do:
    d7=q8,d6=q7,d5=q6,...,d0=q1
    load

    I used this technique in VMCOG with great success.
    cgracey wrote: »
    Bill, what's the advantage in doing this? It sounds intriguing.
  • ctwardellctwardell Posts: 1,716
    edited 2014-01-13 10:00
    This stops the counters from ever really saturating, and as each counter is divided by two at the same time, the relative magnitude of the counts is preserved, thus performing better for LRU.

    Without this, once two or more counters saturate, there is no way to tell which is a better candidate for replacement, once could be 256 counts, and one at 1,000,000 counts, but it would look the same to the comparators. By doing the shifting, the counts will always be relatively correct.

    Assuming the counters are like below:

    q0-q7, with q8 being the carry out
    d0-d7 for the preset inputs
    reset signal
    load signal
    ciock

    when q8 goes high for any of the counters:

    for all counters, do:
    d7=q8,d6=q7,d5=q6,...,d0=q1
    load

    I used this technique in VMCOG with great success.

    Wouldn't this be more like Least Frequently Used than Recently? Chip's current implementation would also be LFU.

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-13 10:16
    ctwardell wrote: »
    Wouldn't this be more like Least Frequently Used than Recently? Chip's current implementation would also be LFU.

    C.W.

    In #4530 Chip wrote:
    cgracey wrote: »
    A cache miss causes a line to be reloaded. When a line is read (cache hit) or loaded and read (cache miss), its counter is reset to 0. The counter increments on every cache miss that wasn't its own, and saturates at $1F.

    After that, he increased the counter size to 8 bits.

    Basically, if I understood Chip correctly, the largest count will be associated with the cache line that had not been used the most in a row, that is, read least recently (as when other lines hit, the non-hit lines counters increment).

    All the trick I suggested does is allow for a more accurate comparison of the miss counters of lines, in cases where there are more than 256 misses, as otherwise there would be no way to tell if a saturated count represented 256 misses, or 1,000,000 misses.

    Since the counters are cleared on a cache hit, or a cache line load, I suspect that Least Frequently Used and Least Recently Used will correlate very closely.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-01-13 17:10
    cgracey wrote: »
    In case anyone missed it, LRU with four cache lines is already working.

    Today I'm going to try to get the pre-fetch working, so that a single hub execution task can run full speed in a straight line. After that, I want to look into maybe doubling the cache lines so that we could have two per task, allowing each task to run full speed in a straight line. It's too hard to think about all at once, so I'm going to get the single-task done first.
    Fantastic work Chip. Wow, 8 cache lines would be magnificent... for a single task this would permit loops of 7*8=56 instructions presuming that the prefetch clobbers the LRU cache line.

    For up to 4 tasks, will you just allocate 2 cache lines per task, or will you use LRU and dynamically allocate per task??? I am not sure if the latter is more complex, and whether it would be beneficial or not. It's just a thought.

    BTW Am I correct in thinking that when you load a cache line in clock cycle n, that you can also feed these results into the pipeline at stage 2 in parallel?
    So effectively, a single task hub mode could run at full speed sequentially (without prefetch) as follows:
    * clock n-1: pipeline stage 1: fetches instruction from cache line +7 (ie last cache long)
    * clock n+0: pipeline stage 1: fetches next instruction from next cache line +0 but it's not loaded
    ---- so a hub load into a new cache line takes place - presume it is our hub window, so the wide fetch can take place on this clock n+1
    * clock n+1: pipeline stage 2: the hub fetch has taken place and the results will be written to the cache line
    ---- also, the fetched results can be fed directly into stage 2 (ie we did not lose a clock)
    ...
    * clock n+8: pipeline stage 1: fetches next instruction from next cache line +0 but it's not loaded
    ---- so a hub load into a new cache line takes place - presume it is our hub window, so the wide fetch can take place on this clock n+1
    * clock n+9: pipeline stage 2: the hub fetch has taken place and the results will be written to the cache line
    ---- also, the fetched results can be fed directly into stage 2 (ie we did not lose a clock)

    The above would mean that we could run 8 instructions per hub cycle. Presuming no other RD/WRxxxx to hub, then we could run at 100% speed.

    Did I understand correctly?
  • cgraceycgracey Posts: 14,232
    edited 2014-01-13 17:32
    Cluso99 wrote: »
    The above would mean that we could run 8 instructions per hub cycle. Presuming no other RD/WRxxxx to hub, then we could run at 100% speed.

    Did I understand correctly?


    That's right.

    I just got it working!

    Whenever the hub cycle comes around and there is no hub instruction currently executing, and we are executing from the hub, and we are not multitasking, that unused hub cycle is used to pre-fetch the next cache line, relative to where we are currently executing. This means that straight-line code that doesn't do hub operations will run at 100% speed. When a branch occurs to a location that is out-of-cache, it takes 4..11 clocks to get the cache reloaded, before execution resumes at the new address. And this works entirely within the cog - it doesn't use other cogs' spare cycles, so code will always run at the same speed.

    There is a minor thing I want to enhance about the way it's working, but it's looking very good and is not complicated, after all. Whew!
Sign In or Register to comment.