Today, I got the pipeline and peripherals to handle stalls of any duration. Stalls are going to be necessary to hold off the pipeline while instruction caches reload for out-of-range branches.
Next, I must implement the instruction caches. This is going to be a bunch of big steps that must all be done in parallel before anything is testable. All PC's are going to become 16 bits. I'm planning having four 8-long caches, since I could see that two would not be enough and three would be odd. With the right cache-picking algorithm, maybe all 4 tasks could execute from the hub without too much thrashing. Any ideas on how the cache-picking algorithm should work? I'm thinking that maybe the cache that was read longest ago should be the one to get reloaded, no matter the number of tasks executing from the hub. Is there a better way?
I guess LRU is the most common cache replacement strategy. With only two cache lines I'm not sure you can do much better.
Today, I got the pipeline and peripherals to handle stalls of any duration. Stalls are going to be necessary to hold off the pipeline while instruction caches reload for out-of-range branches.
Next, I must implement the instruction caches. This is going to be a bunch of big steps that must all be done in parallel before anything is testable. All PC's are going to become 16 bits. I'm planning having four 8-long caches, since I could see that two would not be enough and three would be odd. With the right cache-picking algorithm, maybe all 4 tasks could execute from the hub without too much thrashing. Any ideas on how the cache-picking algorithm should work? I'm thinking that maybe the cache that was read longest ago should be the one to get reloaded, no matter the number of tasks executing from the hub. Is there a better way?
Also, how are you going to handle BIG if you allow more than one task to run from hub memory. Won't you need a register for each task to remember the BIG value?
Also, how are you going to handle BIG if you allow more than one task to run from hub memory. Won't you need a register for each task to remember the BIG value?
There will be four 8-long cache lines. What does LRU stand for? (least read...)
Actually, I guess there is another option. I'm not sure if this would be best with only 4 cache lines but you could just implement a direct-mapped cache where the cache line to replace is determined by bits 4:3 of the PC.
Actually, I guess there is another option. I'm not sure if this would be best with only 4 cache lines but you could just implement a direct-mapped cache where the cache line to replace is determined by bits 4:3 of the PC.
That seems like a pretty straightforward way to handle it. That would be fine for a single task, but what if there are more? Do you suppose LRU is the best approach in the case of multiple tasks?
That seems like a pretty straightforward way to handle it. That would be fine for a single task, but what if there are more? Do you suppose LRU is the best approach in the case of multiple tasks?
Probably not but what else are you going to do other than assign a cache line to each task? What you probably want is an n-way set associative cache but that would require many more cache lines for four tasks. By the way, the direct-mapped scheme means you don't have to have an associative memory to figure out if one of the cache lines already has the data you want. This is because any particular hub address will always be in the same cache line. You use the same bits to index into the cache lines to figure out if you have a hit that you use to decide which cache line to replace.
I know you like all COGs to be the same but you might consider the possiblity of sizing the cache differently for each COG. You could, for instance, give COG 0 a bigger cache with the idea that it will be used to run the main control logic of the program and that the other COGs either won't have any cache at all or will have smaller caches. Could the cache dimensions be made a parameter to the COG block?
Probably not but what else are you going to do other than assign a cache line to each task? What you probably want is an n-way set associative cache but that would require many more cache lines for four tasks. By the way, the direct-mapped scheme means you don't have to have an associative memory to figure out if one of the cache lines already has the data you want. This is because any particular hub address will always be in the same cache line. You use the same bits to index into the cache lines to figure out if you have a hit that you use to decide which cache line to replace.
That's neat. That would really be ideal for a single hub task. Do you think people will want more than one hub task?
I was thinking that the right algorithm would spread the cache lines among tasks without awareness of how many tasks are running. As I run the simulation in my head, maybe overlooking some important aspect, the LRU seems good. Also, there should be preemptive loading of the next cache range, particularly if only one task is executing from the hub.
Maybe only one task should be allowed to execute from the hub. What do people want? I might be a little tired to think straight right now.
I know you like all COGs to be the same but you might consider the possiblity of sizing the cache differently for each COG. You could, for instance, give COG 0 a bigger cache with the idea that it will be used to run the main control logic of the program and that the other COGs either won't have any cache at all or will have smaller caches. Could the cache dimensions be made a parameter to the COG block?
Edit: Maybe this is a P3 feature.
I read that a few times and was reading 'task' where you typed 'cog'. Do you really mean cogs, and not tasks?
That's neat. That would really be ideal for a single hub task. Do you think people will want more than one hub task?
I was thinking that the right algorithm would spread the cache lines among tasks without awareness of how many tasks are running. As I run the simulation in my head, maybe overlooking some important aspect, the LRU seems good. Also, there should be preemptive loading of the next cache range, particularly if only one task is executing from the hub.
Maybe only one task should be allowed to execute from the hub. What do people want? I might be a little tired to think straight right now.
Various people here have suggested that it would be okay to limit hub execution to a single task. In fact, some have suggested that a COG running hub code could be restricted to only a single task. I think it would be nice to allow other COG tasks even if you limit a COG to a single hub task. This would allow something like a serial driver to coexist with hub execution in a single COG. That could be useful for providing debug support for the hub task.
Call the thought police we have an emergency here.
Yes, I know it goes against the Propeller philosophy but the reality is that there is only so much area on the chip. Would you rather have 8 tiny caches or one or two big ones? I would think that the big ones would give much better performance and it seems likely that most of the COGs will run drivers anyway so it might not be necessary to have big caches on all COGs. Anyway, this is almost certainly a suggestion for P3 if it ever gets done at all.
That more than one thread in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one thread per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one thread per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect threads executing from HUB to be cycle by cycle deterministic.
That more than one thread in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one thread per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one thread per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect threads executing from HUB to be cycle by cycle deterministic.
I mean any task from any cog can run in the hub. Sorry, I was using 'thread' when I meant to say 'task' - as in hardware multi-tasking.
Edit: Wait a minute... you were saying 'thread', not me. I'm going to have to sleep soon.
OK let's try again. Let's use the word "task" as in "hardware multi-tasking" within a COG.
Are we asking:
That more than one task in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one task per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one task per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect tasks executing from HUB to be cycle by cycle deterministic.
I would rather have the COGS be identical. One of the strong attributes of the Propeller is it's performance when problems are parallelized. Being multi-processor isn't so "multi" when one of the processors is significantly better than the others.
Doing this means "single core" with 7 other sub-cores, etc... Which is why I'm opposed to the slot tinkering. I don't see those HUB cycles as wasted. They are optimal for code reuse and consistency, both things have costs too.
So one large cache turns the thing into something more like the other chips with one CPU, that you just know people will want to interrupt, because there is only one big one, etc... Let's not do that. A COG should be a COG. Period.
Frankly, if the hub execute is anywhere near the PASM native speed, having 8 of those possible is insane! I'll take it easily, small cache or not.
I'm generally agreed with Heater.
A COG running in COG mode has tasks. If we can get one of those tasks running HUBEXEC mode? Yeah, that would be cool, but not essential. It's enough to have the COG either be running HUBEXEC, or COG mode.
From there, all COGS identical, which answers most of the other questions.
Re: Caches. I'm OK with those in HUBEXEC mode, because it's about a larger program. The COG code is deterministic. HUB code is like LMM code, and that's OK.
Ideally, a COG can enter HUBEXEC, do some stuff, then return to being a COG. Do the real time deterministic stuff in a COG where it has always been done, IMHO. This best fits the strengths of the Propeller.
Today, I got the pipeline and peripherals to handle stalls of any duration. Stalls are going to be necessary to hold off the pipeline while instruction caches reload for out-of-range branches.
Next, I must implement the instruction caches. This is going to be a bunch of big steps that must all be done in parallel before anything is testable. All PC's are going to become 16 bits. I'm planning on having four 8-long caches, since I could see that two would not be enough and three would be odd. With the right cache-picking algorithm, maybe all 4 tasks could execute from the hub without too much thrashing.
Any ideas on how the cache-picking algorithm should work? I'm thinking that maybe the cache that was read longest ago should be the one to get reloaded, no matter the number of tasks executing from the hub. Is there a better way?
There are better ways, but they are expensive in terms of transistors.
The replacement algorithm you described is known as the LRU (least recently used) algorithm, used for many decades... take a peek at my VMCOG, I implemented an LRU cache there.
The instruction cache should be separate fromthe RDxxxxC cache, this very important.
hubexec should be able to use any unused hub cycle (huge win)
DO NOT TRY THIS AT HOME UNTIL AFTER P2
for best performance:
- each task should have four eigh-long instruction cache
- each task should have a four line RDxxxxC cache
(more cache lines help, so future process shrink can make the caches bigger)
That seems like a pretty straightforward way to handle it. That would be fine for a single task, but what if there are more? Do you suppose LRU is the best approach in the case of multiple tasks?
I know you like all COGs to be the same but you might consider the possiblity of sizing the cache differently for each COG. You could, for instance, give COG 0 a bigger cache with the idea that it will be used to run the main control logic of the program and that the other COGs either won't have any cache at all or will have smaller caches. Could the cache dimensions be made a parameter to the COG block?
I would rather have the COGS be identical. One of the strong attributes of the Propeller is it's performance when problems are parallelized. Being multi-processor isn't so "multi" when one of the processors is significantly better than the others.
Doing this means "single core" with 7 other sub-cores, etc... Which is why I'm opposed to the slot tinkering. I don't see those HUB cycles as wasted. They are optimal for code reuse and consistency, both things have costs too.
So one large cache turns the thing into something more like the other chips with one CPU, that you just know people will want to interrupt, because there is only one big one, etc... Let's not do that. A COG should be a COG. Period.
Frankly, if the hub execute is anywhere near the PASM native speed, having 8 of those possible is insane! I'll take it easily, small cache or not.
I'm generally agreed with Heater.
A COG running in COG mode has tasks. If we can get one of those tasks running HUBEXEC mode? Yeah, that would be cool, but not essential. It's enough to have the COG either be running HUBEXEC, or COG mode.
From there, all COGS identical, which answers most of the other questions.
Re: Caches. I'm OK with those in HUBEXEC mode, because it's about a larger program. The COG code is deterministic. HUB code is like LMM code, and that's OK.
Ideally, a COG can enter HUBEXEC, do some stuff, then return to being a COG. Do the real time deterministic stuff in a COG where it has always been done, IMHO. This best fits the strengths of the Propeller.
All cogs are identical. All tasks within cogs are identical.
1) cog mode: 4 tasks, hubexec mode: 1 ... low risk, other 7 cogs can run 4 tasks each <--- do this first
2) for good performance, each task needs a minimum of four lines of code cache, and ideally four lines for RDxxxxC cache (data cache) <--- when there is spare verilog time
With only one line each, and more than one hubexec task in the same cog, most of the time would be spent reloading the single cache line
That's neat. That would really be ideal for a single hub task. Do you think people will want more than one hub task?
I was thinking that the right algorithm would spread the cache lines among tasks without awareness of how many tasks are running. As I run the simulation in my head, maybe overlooking some important aspect, the LRU seems good. Also, there should be preemptive loading of the next cache range, particularly if only one task is executing from the hub.
Maybe only one task should be allowed to execute from the hub. What do people want? I might be a little tired to think straight right now.
I agree that would be very nice, but in order to not kill performance, requires multiple lines of data cache and code cache per task --> probably too big a change for P2, and significantly more transistors.
Various people here have suggested that it would be okay to limit hub execution to a single task. In fact, some have suggested that a COG running hub code could be restricted to only a single task. I think it would be nice to allow other COG tasks even if you limit a COG to a single hub task. This would allow something like a serial driver to coexist with hub execution in a single COG. That could be useful for providing debug support for the hub task.
OK let's try again. Let's use the word "task" as in "hardware multi-tasking" within a COG.
Are we asking:
That more than one task in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one task per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one task per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect tasks executing from HUB to be cycle by cycle deterministic.
Just making sure Chip. Is the answer B? Sorry, that was just a tired funny kind of thing. Get your sleep. Thanks for working hard for us.
BTW: People were asking about quoting and multi-quoting in this forum editor. I have found a single quote is easiest when I hit the quote button, then paste text. It's all setup. Easy. For a multi-quote, hit the quote button, paste all of it, then highlight sub-quotes and hit it again. The editor will do the right thing. FYI.
I would rather have the COGS be identical. One of the strong attributes of the Propeller is it's performance when problems are parallelized. Being multi-processor isn't so "multi" when one of the processors is significantly better than the others.
Doing this means "single core" with 7 other sub-cores, etc...
Which is why I'm opposed to the slot tinkering. I don't see those HUB cycles as wasted. They are optimal for code reuse and consistency, both things have costs too.
STRONGLY disagree. See my post (3566?) about how much unused hub slots would help hube
So one large cache turns the thing into something more like the other chips with one CPU, that you just know people will want to interrupt, because there is only one big one, etc... Let's not do that. A COG should be a COG. Period.
Small LRU code/data cache per task would work best if tasks must hubexec, otherwise small LRU code/data cache per cog is fine.
Frankly, if the hub execute is anywhere near the PASM native speed, having 8 of those possible is insane! I'll take it easily, small cache or not.
Should be close, with four line code / four line data cache and using unused hub slots I'd estimate north of 90% cog only speed, even without FCACHE/FLIB, close to 99.9% with.
A COG running in COG mode has tasks. If we can get one of those tasks running HUBEXEC mode? Yeah, that would be cool, but not essential. It's enough to have the COG either be running HUBEXEC, or COG mode.
I'd suggest dual mode: 1 hubexec cog, or 4 cog tasks on a cog basis. Lowest risk, fewest added transistors.
From there, all COGS identical, which answers most of the other questions.
Re: Caches. I'm OK with those in HUBEXEC mode, because it's about a larger program. The COG code is deterministic. HUB code is like LMM code, and that's OK.
Ideally, a COG can enter HUBEXEC, do some stuff, then return to being a COG. Do the real time deterministic stuff in a COG where it has always been done, IMHO. This best fits the strengths of the Propeller.
- LRU 4 line (each line is 8x32) instruction cache
- LRU 4 line (each line is 8x32) data cache (for RDxxxC / WRxxxC)
it would work quite well. (total of 32 * 8x32 cache lines)
2 line caches would thrash too much.
If a cog had to share the cache lines (8 * 8x32) among 2-4 tasks performance would be very poor.
I'm going to put a 4-line (x8 long) instruction cache into each cog. Running one hub task would work well. Running more than one task would thrash the cache. There will be a 1-line (x8 long) data cache in each cog for RDxxxxC. Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.
I'm going to sleep. When I come back, I'll increase the program counters to 16 bits and make sure things still run. Then, I'll add the instruction cache.
Comments
There will be four 8-long cache lines. What does LRU stand for? (least read...)
That seems like a pretty straightforward way to handle it. That would be fine for a single task, but what if there are more? Do you suppose LRU is the best approach in the case of multiple tasks?
Edit: Maybe this is a P3 feature.
That's neat. That would really be ideal for a single hub task. Do you think people will want more than one hub task?
I was thinking that the right algorithm would spread the cache lines among tasks without awareness of how many tasks are running. As I run the simulation in my head, maybe overlooking some important aspect, the LRU seems good. Also, there should be preemptive loading of the next cache range, particularly if only one task is executing from the hub.
Maybe only one task should be allowed to execute from the hub. What do people want? I might be a little tired to think straight right now.
I read that a few times and was reading 'task' where you typed 'cog'. Do you really mean cogs, and not tasks?
Blasphemer! Burn him!
Call the thought police we have an emergency here.
Heater, is it important than more than one task run in the hub?
Edit: I think it's actually easier if all tasks are the same, so they will all be able to run in the hub. Let's hope the cache works well.
That more than one thread in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one thread per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one thread per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect threads executing from HUB to be cycle by cycle deterministic.
I mean any task from any cog can run in the hub. Sorry, I was using 'thread' when I meant to say 'task' - as in hardware multi-tasking.
Edit: Wait a minute... you were saying 'thread', not me. I'm going to have to sleep soon.
OK let's try again. Let's use the word "task" as in "hardware multi-tasking" within a COG.
Are we asking:
That more than one task in the whole Propeller can execute from HUB directly? - I would hope at least one per COG can do do.
That more than one task per COG can execute from HUB directly? I might think it's nice that they can but not essential.
That more than one task per COG executing from HUB gets a cache? Again, nice but not essential.
Perhaps I'm out of touch with the current politics, previously I might have imagined caches were off the table - "Eew Yuk, caches destroy determinism".
Not that I expect tasks executing from HUB to be cycle by cycle deterministic.
Doing this means "single core" with 7 other sub-cores, etc... Which is why I'm opposed to the slot tinkering. I don't see those HUB cycles as wasted. They are optimal for code reuse and consistency, both things have costs too.
So one large cache turns the thing into something more like the other chips with one CPU, that you just know people will want to interrupt, because there is only one big one, etc... Let's not do that. A COG should be a COG. Period.
Frankly, if the hub execute is anywhere near the PASM native speed, having 8 of those possible is insane! I'll take it easily, small cache or not.
I'm generally agreed with Heater.
A COG running in COG mode has tasks. If we can get one of those tasks running HUBEXEC mode? Yeah, that would be cool, but not essential. It's enough to have the COG either be running HUBEXEC, or COG mode.
From there, all COGS identical, which answers most of the other questions.
Re: Caches. I'm OK with those in HUBEXEC mode, because it's about a larger program. The COG code is deterministic. HUB code is like LMM code, and that's OK.
Ideally, a COG can enter HUBEXEC, do some stuff, then return to being a COG. Do the real time deterministic stuff in a COG where it has always been done, IMHO. This best fits the strengths of the Propeller.
There are better ways, but they are expensive in terms of transistors.
The replacement algorithm you described is known as the LRU (least recently used) algorithm, used for many decades... take a peek at my VMCOG, I implemented an LRU cache there.
The instruction cache should be separate fromthe RDxxxxC cache, this very important.
hubexec should be able to use any unused hub cycle (huge win)
DO NOT TRY THIS AT HOME UNTIL AFTER P2
for best performance:
- each task should have four eigh-long instruction cache
- each task should have a four line RDxxxxC cache
(more cache lines help, so future process shrink can make the caches bigger)
Having said that, you could use the task id bits (2 bits) as the high bits, and two more bits for cache line.
Suggestion:
Direct mapped for first verilog test, to get things working
while everyone plays, you see if you can easily add LRU
All cogs are identical. All tasks within cogs are identical.
1) cog mode: 4 tasks, hubexec mode: 1 ... low risk, other 7 cogs can run 4 tasks each <--- do this first
2) for good performance, each task needs a minimum of four lines of code cache, and ideally four lines for RDxxxxC cache (data cache) <--- when there is spare verilog time
With only one line each, and more than one hubexec task in the same cog, most of the time would be spent reloading the single cache line
B. The answer is B.
- LRU 4 line (each line is 8x32) instruction cache
- LRU 4 line (each line is 8x32) data cache (for RDxxxC / WRxxxC)
it would work quite well. (total of 32 * 8x32 cache lines)
2 line caches would thrash too much.
If a cog had to share the cache lines (8 * 8x32) among 2-4 tasks performance would be very poor.
Just making sure Chip. Is the answer B? Sorry, that was just a tired funny kind of thing. Get your sleep. Thanks for working hard for us.
BTW: People were asking about quoting and multi-quoting in this forum editor. I have found a single quote is easiest when I hit the quote button, then paste text. It's all setup. Easy. For a multi-quote, hit the quote button, paste all of it, then highlight sub-quotes and hit it again. The editor will do the right thing. FYI.
I agree.
STRONGLY disagree. See my post (3566?) about how much unused hub slots would help hube
Small LRU code/data cache per task would work best if tasks must hubexec, otherwise small LRU code/data cache per cog is fine.
Should be close, with four line code / four line data cache and using unused hub slots I'd estimate north of 90% cog only speed, even without FCACHE/FLIB, close to 99.9% with.
For hand tuned code. More like 70%-80% with gcc.
I'd suggest dual mode: 1 hubexec cog, or 4 cog tasks on a cog basis. Lowest risk, fewest added transistors.
That works.
I'm going to put a 4-line (x8 long) instruction cache into each cog. Running one hub task would work well. Running more than one task would thrash the cache. There will be a 1-line (x8 long) data cache in each cog for RDxxxxC. Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.
I'm going to sleep. When I come back, I'll increase the program counters to 16 bits and make sure things still run. Then, I'll add the instruction cache.