It seems to me that if some high level hub exec code such as what GCC would emit is ever going to be running as multiple hardware based threads in a single COG and need a link register for speeding up leaf functions, it will also most likely be needing to use the COG's register remapping feature so that another hardware task in this COG can share the same hub source code instructions and refer to the same set of COG registers for holding all its temporary register variables etc.
Wouldn't it therefore be good to keep the LR as being in one of these registers that have been remapped? The same argument could apply to the stack pointer as well. Perhaps $1F0 and $1F1 should refer to the address of where this LR and SP currently resides in the block of remapped registers, and be the same for each task...? Would this not make sense?
I think it's clear that what gcc needs is a register written with the return address of the last call. This is complicated by the matter of 4 tasks possibly executing from the hub.
One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact.
Another solution: Treat $1F1 as a window to one of four task-related registers that actually exist in flipflops, which always get written with the last return address. This has higher silicon impact.
Another solution: Move INDA/INDB to $1EE/$1EF and use $1F0..$1F3 as range of registers which ALWAYS gets remapped, according to task, such that any access within that range will result in the two address LSBs being substituted with the task number. Return addresses are always stored in these registers. This has low silicon impact, but takes three more register spaces to implement. Might there be some other compelling use case for this feature?
Thanks for looking at this. I would hate to take up four COG registers for LR since with the current caching scheme it isn't likely that anyone will use hub execute mode with more than one task at a time anyway. Why can't we just have a single LR register and document that it can only be used by one task at a time. We already have other restrictions like that in tasking mode. For example, we only have one PTRA, PTRB, INDA, INDB, etc register so they each have to be allocated to a single hardware task. LR would be no different. It would be possible to write hub execute code without using LR if someone really wants to as long as only one hub execute task was running GCC code.
It seems to me that if some high level hub exec code such as what GCC would emit is ever going to be running as multiple hardware based threads in a single COG and need a link register for speeding up leaf functions, it will also most likely be needing to use the COG's register remapping feature so that another hardware task in this COG can share the same hub source code instructions and refer to the same set of COG registers for holding all its temporary register variables etc.
Wouldn't it therefore be good to keep the LR as being in one of these registers that have been remapped? The same argument could apply to the stack pointer as well. Perhaps $1F0 and $1F1 should refer to the address of where this LR and SP currently resides in the block of remapped registers, and be the same for each task...? Would this not make sense?
Roger.
It would be very low-impact to just write the return address into $000, since we don't have 9 extra bits to define a random cog register address. If multi-tasking and remapping were enabled, $000 would spread out to to $000..$003 by task, solving the whole problem.
So, how about writing the return address to $000? This writing could be enabled/disabled by a special instruction.
It would be very low-impact to just write the return address into $000, since we don't have 9 extra bits to define a random cog register address. If multi-tasking and remapping were enabled, $000 would spread out to to $000..$003 by task, solving the whole problem.
So, how about writing the return address to $000? This writing could be enabled/disabled by a special instruction.
I think that should be fine. I think I suggested using a register that could be remapped a while back. I don't see a problem with it. It will require some rearrangement of the GCC code generator but I think that's minor.
One solution: Always write the return address from hub-mode callers to $1F1.
I want to make sure I understand what you're saying here. First, I assume the same would apply with your more recent proposal to always write the return address to COG address 0. Are you saying that any CALL instruction would also write its return address to this LR register? Or would there still be a special CALL_LR instruction? It might help if you could post a summary of the currently planned call instructions. I think I remember that we will have CALLA/CALLB using PTRA/PTRB as stack pointers in hub memory and CALLX/CALLY for PTRX/PTRY as stack pointers into AUX memory. We'll also still have JMPRET. Will all of these now be able to address the full hub+COG memory space? Anyway, a summary of what is planned would be helpful.
I want to make sure I understand what you're saying here. First, I assume the same would apply with your more recent proposal to always write the return address to COG address 0. Are you saying that any CALL instruction would also write its return address to this LR register? Or would there still be a special CALL_LR instruction? It might help if you could post a summary of the currently planned call instructions. I think I remember that we will have CALLA/CALLB using PTRA/PTRB as stack pointers in hub memory and CALLX/CALLY for PTRX/PTRY as stack pointers into AUX memory. We'll also still have JMPRET. Will all of these now be able to address the full hub+COG memory space? Anyway, a summary of what is planned would be helpful.
Thanks!
David
Any call would write the return address to $000 if the LR mode was enabled. It's too expensive, in terms of opcode space, to have _LR call variants.
All branches can access all of hub space. The caveat are the 9-bit immediate address branches like DJNZ. They'll become relative branches in hub exec mode. JMPSW (was JMPRET) always stores the return address in D, but can only reach all of hub address space using S register.
I'll put the latest instruction list on this thread in an hour, or so.
I think this makes a ton of sense. Kind of a bummer to add another modal behavior, but it's worth it for GCC to see optimal performance. And we've got modals in there now, so this one doesn't add much really in that it will be mostly, "for GCC" use case, ignored by SPIN+PASM programmers, unless they think up some clever use for it.
I'm still not clear on how this works but I guess I need to wait for your instruction list to ask any further questions.
You would execute a special instruction one time to tell the cog to store return address in $000. Thereafter, all CALLs would write their return address to $000.
You would execute a special instruction one time to tell the cog to store return address in $000. Thereafter, all CALLs would write their return address to $000.
Instead of where they normally write it? So CALLA, CALLB, CALLX, CALLY, JMPSWT all write to $000 instead of their normal behavior?
Sorry. I meant, "in addition to their normal behavior".
Okay, I guess that means that GCC should use the AUX stack instructions since I assume that the hub stack call instructions are hub operations and will wait for the hub window. In that case, it might be best if the LR bit just changed the behavior of the AUX stack instructions if that is possible. It also means that we have to setup the AUX stack even though we aren't really using it. That's not a big problem but it does mean we tie up PTRX or PTRY for no good reason.
Okay, I guess that means that GCC should use the AUX stack instructions since I assume that the hub stack call instructions are hub operations and will wait for the hub window. In that case, it might be best if the LR bit just changed the behavior of the AUX stack instructions if that is possible. It also means that we have to setup the AUX stack even though we aren't really using it. That's not a big problem but it does mean we tie up PTRX or PTRY for no good reason.
Actually, I just realized that this won't work at all. If we enter LR mode and then GCC just uses the LR register then the stack in either hub or AUX memory will eventually be filled up with return addresses that aren't ever used. They'll never be cleaned up since no corresponding RET instruction will ever be executed. I think the LR behavior needs to be overlayed on a CALL instruction that doesn't make use of a stack.
Okay, I guess that means that GCC should use the AUX stack instructions since I assume that the hub stack call instructions are hub operations and will wait for the hub window. In that case, it might be best if the LR bit just changed the behavior of the AUX stack instructions if that is possible. It also means that we have to setup the AUX stack even though we aren't really using it. That's not a big problem but it does mean we tie up PTRX or PTRY for no good reason.
The instruction used to enable this feature could convey a 5-bit field via D/# that could enable the individual cases of CALLA, CALLB, CALLX, CALLY, and CALL. It would be initialized to %00000.
Actually, I just realized that this won't work at all. If we enter LR mode and then GCC just uses the LR register then the stack in either hub or AUX memory will eventually be filled up with return addresses that aren't ever used. They'll never be cleaned up since no corresponding RET instruction will ever be executed. I think the LR behavior needs to be overlayed on a CALL instruction that doesn't make use of a stack.
So, you only want LR functionality on CALLs that don't really need to return?
That would be a new kind of call that would just store the return address in a register, instead of a stack. I like that. That would be good for different things. It could be a call, or it could be a pointer pass for some elaborate function or procedure.
You could just use the CALL with return address into register enabled, as outlined above. The CALL stack is just a 4-level hardware stack that can be ignored if you don't want to do a RET. Of course, you could just CALL using that stack and do a POP to get the return address, but that would take one more instruction than having the return address already in a register.
The instruction used to enable this feature could convey a 5-bit field via D/# that could enable the individual cases of CALLA, CALLB, CALLX, CALLY, and CALL. It would be initialized to %00000.
Okay, I guess this solves the problem. What does the CALL instruction do? Where does it normally put its return address? Is this the instruction that uses your new return address FIFO? If that's the case, I guess it doesn't matter so much if that stack overflows. Practically speaking, I think only the bit that overlays the CALL instruction will ever be used by GCC.
o Requires a hub address, which jacks into the critical path because the data is stored in cog ram.
o Because multiple hardware threads/tasks are possible, there is a locking/consistency problem with using a single address LR
o Because of the above problem you need register remapping on a per-task basis
Chip proposes a 4x32 stack for each task, where the "LR" is stored. Peeking at the stack would give the same data as a COG location, but the stack resides in ALU logic, not RAM.
The P2 is comprised of HUB, COG, AUX RAM, and the 'LOGIC' block. The caches and LIFO stack are made of logic elements (flip-flops) in the 'LOGIC' block, which is basically storage that is local to each execution unit and can be accessed outside of the normal COG or HUB access windows.
The penalty is that each of these elements takes about 21 transistors and is 3.5 times larger than an SRAM cell.
The more specialized stuff Chip has to add in caches and such, causes the synthesized logic block to balloon.
I discussed hubex with 4 threads and came to the conclusion that with 4 threads it's almost impossible not to thrash the Icache. Given the penalty the Icache presents in logic elements, I recommended a 1 line WIDE Icache. Since multiple hubex tasks will cause a high number of cache invalidations (based on GCC's code generator), I see no point in trying to make multi-task hubex use a cache.
With only one code cache line, and one data cache line, as you note, a single hubexec task would be ok (but would still work significantly better with more cache lines)
Multiple active hubexec tasks with only one code and data cache line each would perform terribly.
The only way to get decent performance with more than one hubexec task (thread) is to have at least one code cache line (and ideally at least one data cache line for RDxxxxC) per task.
FYI, we are pretty much on the same page - unless we can get four lines of cache for code and data each, I would not want to use more than one hubexec task per cog, and even if we got the 4D 4I lines, a single hubexec task would get better performance from it than splitting it among multiple hubexec tasks in the same cog.
To put things in perspective, 1 Icache cache line takes 255 LEs, with 4 lines that's 1K LEs, times 8 cogs that's 8192 LEs. Now, the Nano only has 22K LEs, so that's a huge chunk used just for 4 cache lines.
The thread stacks are 4x32, so that's another 512 LEs per COG, times 8 COGs, is 4096 LEs.
I don't think it makes sense to constrain the ASIC to make sure we get a fit in the Nano.
Note the nano only fits one cog, so it would take 1K LE's out of 22K, not 8K LE's out of 22K.
The DE2-115 would fit 4 - 5 of these expanded cogs, and for the nano, the cache could be left out, or a timer.
Re/ thread stacks - we get better bang from the buck for using those LE's as more cache as pasm code can use CALLX/RETX CALLY/RETY for single cycle call/ret.
If the cache miss rate for GCC generated hubex code is better than 50%, or put another way, the code has a cache hit rate of 50% or less, caching is more or less pointless IMHO.
Looking at compiled code, roughly one instruction in six is a branch or call of some type.
This suggests that 6 out of 8 cache locations would be a hit, or roughly 75% hit rate.
Of course there will be oddball cases where all eight are non-branch, or there are more branch instructions than one in six (case statements come to mind), but a 75% hit rate seems likely to me.
I came up with this plan, which I think will work well with GCC's code generation, but it only is possible for a single task, because there isn't enough space to apply it to 4 tasks per COG:
o There are 2 cache lines, the COG ping-pongs between them, based on branch instructions that would branch outside of the cache line.
o When a cache line being executed causes a stall, that cache line is reloaded and the other cache line is left alone, to preserve the possibility of hitting the cache on return.
The code above is something you could very well see in a program that grabs a bunch of samples in succession.
The code might hypothetically compile to this, readADC would be a linked in library function, not inlineable code (please don't vilify me because my syntax is off, the instruction set has changed so much I can't keep it all straight):
Note, I took some liberties with the GCC calling, but I think it would be fairly close, I didn't include stack frame setup, which isn't strictly needed for all functions.
In the example above, when a branch instruction is hit, but a branch is not taken, the ping-pong bit is not flipped. If a branch is hit AND taken, which causes a stall, the ping-pong bit is flipped, causing the other cache to be used.
So, cache line 1 would contain the loadADCSamples function (if it is aligned properly, from _SAMPLE_CNT onward).
Cache line 2 would contain the readADC library function (again, WIDE aligned for maximum cache optimization).
In operation, the COG would ping pong back and forth between the 2 cache lines, because they are both filled with the code being executed. Functionality would be the same if the code was inlined, it would just cache 16 instructions vs 8.
To recap, here's the rules for using 2 cache lines:
o When a branch instruction is encountered and taken, toggle cache lines, so the caller is cached and the callee is loaded into the other cache line.
o When execution stalls because a cache line has been exhausted, reload the line that stalled, do not use the other cache line.
o When a branch instruction is encountered, but not taken, do not toggle cache line index.
This algorithm is designed to help non-optimized code and code that is not inlineable -- code that is linked from a pre-compiled library.
That is a pretty good plan for two cache lines, assuming one hubexec task per cog.
Four lines would be better, and adding some more data cache would also help.
The leaf functions already have the overhead of spilling and restoring registers, and with the functional code, normally run for hundreds or thousands of clock cycles.
My point was, as illustrated in I think #4151, that in case some form of LR does not make it in (as that is entirely up to Chip), the performance hit would not be huge (say 1%-3%) for leaf functions.
Please explain how a 1%-3% performance hit on a whole leaf function results in a 2-4x slowdown at the higher level. At most it should only result in a 1%-3% slowdown (depending on total cycle count of the whole leaf function including prolog and epilog code vs. the 4-8 cycles) for the program, even if it did nothing but
while(1)
leaffn();
I am not being sarcastic, even if it seems so - I would really be interested in seeing a realistic case where a 1%-3% slowdown of the leaf function would result in a 2x-4x slowdown.
The delayed branch case was quite different, as due to taking advantage of the multiple delayed branches in hand-crafted code, it really could result in 2x-4x slow down (not 1%-3%) due to branches then taking 1 cycle, and hub windows not being missed as a result.
Again, I was attempting to demonstrate that it is not the end of the world if Chip does not fit LR in.
If the impact on a leaf function is 1%-3%, I don't see how it can have a greater effect on the more complex code than that.
Keeping the prolog, epilog code in mind, unless the leaf function just does something like "a = b", the percentage hit has to be very small.
If the leaf just does something like "a = b", then due to prolog/epilog overhead it would still be very small, probably still a single digit %. Besides, GCC would automatically in-line something as trivial as "a = b".
I am ALWAYS open to technical counter arguments, and calculations supporting them.
Bill,
I think you are underestimating how much loss can be introduced by this for leaf functions. It's not uncommon for a function to call many leaf functions all of which are very tiny (think accessors) where a 4-8 cycle overhead would be as much or more than the function's cost. Additionally, you could have a set of small leaf functions called in a tight inner loop. I see this 4-8 cycles extra per call to the leaves as potentially equating to 2-4x slowdown. Yes there are times when the cost will be fairly small, but there are also times when the cost could be quite dramatic.
I was not meaning to make a personal argument, I just didn't understand how you would argue for one case with less cycles of overhead than this, but then argue that this case is not as important? I am sorry if I came across as personally attacking you, I'm truly just baffled. I see this leaf function overhead issue as a really big deal. Yes, we can get by without it being resolved in hardware, but we would also get by without delay slots or cordic functions or sdram or the whole hubexec thing entirely, but that's not the point.
I know that you have not been against the LR idea, but you have been arguing against it's importance and displaying very small impact numbers in support of your arguements, but I think they are not based on real world C/C++ code of sufficient size/complexity that would make any of this actually matter. In such a program leaf functions are much more plentiful, and likely not large cycle eaters like strcpy (which probably isn't even a leaf, since it likely makes calls itself (it does in the implementations I've seen)).
I've been doing a lot of profiling of C/C++ code lately at work, and in our code (which is quite large and complex) something like 80% of the function calls made during execution are to leaf functions, and 95%+ of those leaf functions are small simple functions. And that's even with aggressive inlining of things like accessors.
Anyway, I really hope Chip (with guidance from you and others) can figure out a way to resolve this issue cleanly.
1) Only single COG task able to run as HUB EXEC is perfectly fine. Direct execution of code in HUB is already more than we ever dared expect isn't it?
2) I understand the HUB EXEC COG space requirements are now minimal and one or more COG threads could be run along with it. Sounds perfect.
3) Any assistance C compilers can get should be top priority, especially for HUB EXEC.
4) Streamlining the Prop instructions for Forth or other languages is just silly. No I'm not knocking on Forth here but Parallax is committed to Spin and C which is what the world want's. That and an easy life for PASM programmers.
Personally, I'd prefer a single register ($1F1, $000, $xxx - your pick of course ) however I can see where a four-register per task block could be useful for XMM style GCC code running four tasks in one cog, but I am not sure that is worth the complication.
I can see a slight advantage to not using $0 as it would interfere with other task memory mapping of the low addresses; however regardless of what fixed location you settle on, I prefer a fixed location as then there is no need for a "SETLR" style instruction.
My personal favorite of your solutions is "One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact."
I think it's clear that what gcc needs is a register written with the return address of the last call. This is complicated by the matter of 4 tasks possibly executing from the hub.
One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact.
Another solution: Treat $1F1 as a window to one of four task-related registers that actually exist in flipflops, which always get written with the last return address. This has higher silicon impact.
Another solution: Move INDA/INDB to $1EE/$1EF and use $1F0..$1F3 as range of registers which ALWAYS gets remapped, according to task, such that any access within that range will result in the two address LSBs being substituted with the task number. Return addresses are always stored in these registers. This has low silicon impact, but takes three more register spaces to implement. Might there be some other compelling use case for this feature?
Any call would write the return address to $000 if the LR mode was enabled. It's too expensive, in terms of opcode space, to have _LR call variants.
All branches can access all of hub space. The caveat are the 9-bit immediate address branches like DJNZ. They'll become relative branches in hub exec mode. JMPSW (was JMPRET) always stores the return address in D, but can only reach all of hub address space using S register.
I'll put the latest instruction list on this thread in an hour, or so.
Personally, I'd prefer a single register ($1F1, $000, $xxx - your pick of course ) however I can see where a four-register per task block could be useful for XMM style GCC code running four tasks in one cog, but I am not sure that is worth the complication.
I can see a slight advantage to not using $0 as it would interfere with other task memory mapping of the low addresses; however regardless of what fixed location you settle on, I prefer a fixed location as then there is no need for a "SETLR" style instruction.
My personal favorite of your solutions is "One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact."
I agree that just using $1F1 as LR would be fine and that we could restrict GCC to a single task. In fact, if we really want to run 4 GCC tasks then we probably need to remap 32 registers and 32*4=128 which is half of COG memory. It would work but combined with the poor cache performance I'm not sure it would ever be used. Using $000 as LR would also be fine. If we only allow one GCC task per COG then we wouldn't even have to use register remapping as long as the other PASM tasks avoided using location $000.
So, you only want LR functionality on CALLs that don't really need to return?
That would be a new kind of call that would just store the return address in a register, instead of a stack. I like that. That would be good for different things. It could be a call, or it could be a pointer pass for some elaborate function or procedure.
You could just use the CALL with return address into register enabled, as outlined above. The CALL stack is just a 4-level hardware stack that can be ignored if you don't want to do a RET. Of course, you could just CALL using that stack and do a POP to get the return address, but that would take one more instruction than having the return address already in a register.
GCC can use the LIFO (4 deep variant), ignoring that stack.
I wonder if we need an LR mode at all? Why not have the LIFO 4 deep stack mode always overwrite $0 (or $1F1, or $xxx you choose to hard write to?)
I think it would be fine if it overwrites $1F1 but it might not be a good idea for it to always overwrite $0 since, as you've mentioned, tasks may want to take advantage of register remapping and they would then have to waste 3 COG locations for copies of LR that they might not really need. However, as Chip mentioned, just using $1F1 means you either can't have more than one GCC task or you need four copies of $1F1. I'm happy with only a single GCC task but Chip seems to want to leave open the option of having more.
I assume that the CALL instruction will just overwrite existing stack entries on overflow?
This CALL instruction would be an idea conduit for LR. Are you sure it would be too costly just to have a POP instruction get the return address into a register?
Would it just add an instruction, or would it make things too complicated? And in 10k instructions of compiled C code, how many POPs do you think would be necessary if all you had was CALL/POP to work with? We've probably been over this, but I don't remember this issue, exactly.
This CALL instruction would be an idea conduit for LR. Are you sure it would be too costly just to have a POP instruction get the return address into a register?
Would it just add an instruction, or would it make things too complicated? And in 10k instructions of compiled C code, how many POPs do you think would be necessary if all you had was CALL/POP to work with? We've probably been over this, but I don't remember this issue, exactly.
Every GCC non-leaf GCC function would require this POP instruction and perhaps even leaf functions if the GCC code generator depends on having the return address in LR. Eric would know that for sure. What if you just write the return address to $1F1 in addition to pushing it on the FIFO just on CALL instructions. I think that could be done all the time without a special LR mode.
Leon, I hate to burst your bubble and complain about XMOS, it wasn't as deterministic as you think, if you wrote something on one core, and then added a thread it would half the speed of the chip as it didn't have as many cores as the prop :P and thus throwing it out timing wise, so wasn't as uber as you always made out it was, it may have been faster than the P1, but I'd always choose a P1 over the XMOS! Let alone the almighty P2
It's still deterministic, anyway, with any number of threads in use:
"Thread scheduling is a simple round robin process with each active thread being executed in the next system clock cycle. This gives the appearance of up to eight concurrent threads per XCore. All threads are independent and have equal priority meaning that each task always receives a guaranteed minimum number of MIPS; this is central to building deterministic and responsive systems."
Actually, I just realized that this won't work at all. If we enter LR mode and then GCC just uses the LR register then the stack in either hub or AUX memory will eventually be filled up with return addresses that aren't ever used. They'll never be cleaned up since no corresponding RET instruction will ever be executed. I think the LR behavior needs to be overlayed on a CALL instruction that doesn't make use of a stack.
? Isn't a CALL instruction that doesn't make use of a stack. usually called a JUMP ?
If the called routine decides to go elsewhere, that is why uC have a POP ?
Comments
Wouldn't it therefore be good to keep the LR as being in one of these registers that have been remapped? The same argument could apply to the stack pointer as well. Perhaps $1F0 and $1F1 should refer to the address of where this LR and SP currently resides in the block of remapped registers, and be the same for each task...? Would this not make sense?
Roger.
It would be very low-impact to just write the return address into $000, since we don't have 9 extra bits to define a random cog register address. If multi-tasking and remapping were enabled, $000 would spread out to to $000..$003 by task, solving the whole problem.
So, how about writing the return address to $000? This writing could be enabled/disabled by a special instruction.
Thanks!
David
Any call would write the return address to $000 if the LR mode was enabled. It's too expensive, in terms of opcode space, to have _LR call variants.
All branches can access all of hub space. The caveat are the 9-bit immediate address branches like DJNZ. They'll become relative branches in hub exec mode. JMPSW (was JMPRET) always stores the return address in D, but can only reach all of hub address space using S register.
I'll put the latest instruction list on this thread in an hour, or so.
You would execute a special instruction one time to tell the cog to store return address in $000. Thereafter, all CALLs would write their return address to $000.
Sorry. I meant, "in addition to their normal behavior".
The instruction used to enable this feature could convey a 5-bit field via D/# that could enable the individual cases of CALLA, CALLB, CALLX, CALLY, and CALL. It would be initialized to %00000.
So, you only want LR functionality on CALLs that don't really need to return?
That would be a new kind of call that would just store the return address in a register, instead of a stack. I like that. That would be good for different things. It could be a call, or it could be a pointer pass for some elaborate function or procedure.
You could just use the CALL with return address into register enabled, as outlined above. The CALL stack is just a 4-level hardware stack that can be ignored if you don't want to do a RET. Of course, you could just CALL using that stack and do a POP to get the return address, but that would take one more instruction than having the return address already in a register.
With only one code cache line, and one data cache line, as you note, a single hubexec task would be ok (but would still work significantly better with more cache lines)
Multiple active hubexec tasks with only one code and data cache line each would perform terribly.
The only way to get decent performance with more than one hubexec task (thread) is to have at least one code cache line (and ideally at least one data cache line for RDxxxxC) per task.
FYI, we are pretty much on the same page - unless we can get four lines of cache for code and data each, I would not want to use more than one hubexec task per cog, and even if we got the 4D 4I lines, a single hubexec task would get better performance from it than splitting it among multiple hubexec tasks in the same cog.
I don't think it makes sense to constrain the ASIC to make sure we get a fit in the Nano.
Note the nano only fits one cog, so it would take 1K LE's out of 22K, not 8K LE's out of 22K.
The DE2-115 would fit 4 - 5 of these expanded cogs, and for the nano, the cache could be left out, or a timer.
Re/ thread stacks - we get better bang from the buck for using those LE's as more cache as pasm code can use CALLX/RETX CALLY/RETY for single cycle call/ret.
Looking at compiled code, roughly one instruction in six is a branch or call of some type.
This suggests that 6 out of 8 cache locations would be a hit, or roughly 75% hit rate.
Of course there will be oddball cases where all eight are non-branch, or there are more branch instructions than one in six (case statements come to mind), but a 75% hit rate seems likely to me.
That is a pretty good plan for two cache lines, assuming one hubexec task per cog.
Four lines would be better, and adding some more data cache would also help.
FYI - great discussion, I love tech discussions
The leaf functions already have the overhead of spilling and restoring registers, and with the functional code, normally run for hundreds or thousands of clock cycles.
My point was, as illustrated in I think #4151, that in case some form of LR does not make it in (as that is entirely up to Chip), the performance hit would not be huge (say 1%-3%) for leaf functions.
Please explain how a 1%-3% performance hit on a whole leaf function results in a 2-4x slowdown at the higher level. At most it should only result in a 1%-3% slowdown (depending on total cycle count of the whole leaf function including prolog and epilog code vs. the 4-8 cycles) for the program, even if it did nothing but
while(1)
leaffn();
I am not being sarcastic, even if it seems so - I would really be interested in seeing a realistic case where a 1%-3% slowdown of the leaf function would result in a 2x-4x slowdown.
The delayed branch case was quite different, as due to taking advantage of the multiple delayed branches in hand-crafted code, it really could result in 2x-4x slow down (not 1%-3%) due to branches then taking 1 cycle, and hub windows not being missed as a result.
Again, I was attempting to demonstrate that it is not the end of the world if Chip does not fit LR in.
If the impact on a leaf function is 1%-3%, I don't see how it can have a greater effect on the more complex code than that.
Keeping the prolog, epilog code in mind, unless the leaf function just does something like "a = b", the percentage hit has to be very small.
If the leaf just does something like "a = b", then due to prolog/epilog overhead it would still be very small, probably still a single digit %. Besides, GCC would automatically in-line something as trivial as "a = b".
I am ALWAYS open to technical counter arguments, and calculations supporting them.
1) Only single COG task able to run as HUB EXEC is perfectly fine. Direct execution of code in HUB is already more than we ever dared expect isn't it?
2) I understand the HUB EXEC COG space requirements are now minimal and one or more COG threads could be run along with it. Sounds perfect.
3) Any assistance C compilers can get should be top priority, especially for HUB EXEC.
4) Streamlining the Prop instructions for Forth or other languages is just silly. No I'm not knocking on Forth here but Parallax is committed to Spin and C which is what the world want's. That and an easy life for PASM programmers.
Personally, I'd prefer a single register ($1F1, $000, $xxx - your pick of course ) however I can see where a four-register per task block could be useful for XMM style GCC code running four tasks in one cog, but I am not sure that is worth the complication.
I can see a slight advantage to not using $0 as it would interfere with other task memory mapping of the low addresses; however regardless of what fixed location you settle on, I prefer a fixed location as then there is no need for a "SETLR" style instruction.
My personal favorite of your solutions is "One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact."
GCC can use the LIFO (4 deep variant), ignoring that stack.
I wonder if we need an LR mode at all? Why not have the LIFO 4 deep stack mode always overwrite $0 (or $1F1, or $xxx you choose to hard write to?)
This CALL instruction would be an idea conduit for LR. Are you sure it would be too costly just to have a POP instruction get the return address into a register?
Would it just add an instruction, or would it make things too complicated? And in 10k instructions of compiled C code, how many POPs do you think would be necessary if all you had was CALL/POP to work with? We've probably been over this, but I don't remember this issue, exactly.
It's still deterministic, anyway, with any number of threads in use:
"Thread scheduling is a simple round robin process with each active thread being executed in the next system clock cycle. This gives the appearance of up to eight concurrent threads per XCore. All threads are independent and have equal priority meaning that each task always receives a guaranteed minimum number of MIPS; this is central to building deterministic and responsive systems."
? Isn't a CALL instruction that doesn't make use of a stack. usually called a JUMP ?
If the called routine decides to go elsewhere, that is why uC have a POP ?