What if we made an 8-level LIFO stack that had special instructions simply called CALL/CALLD and RET/RETD, and those would replace use of JMPRET/JMPRETD. JMPRET/JMPRETD could then just be used for register-based thread loops that track Z/C/PC. Would we miss anything? We wouldn't need to mess with subroutine_RET labels, anymore.
I'm sorry, I've been crazy busy this weekend and haven't had a chance to review everything, so forgive me if this idea has already been shot down for some reason.
Why not just keep JMPRET the way it is now, but write the whole 18 bits of address into the destination register in HUB mode?
This would allow GCC could use its current calling convention.
If a mode specific version is too ugly, could we just always write the full 18 bits and use the 18 bit long form of JUMP as the RET instruction (for COG mode subroutines, of course -- HUB will have to either use the HUB stack form or manage the save of the return register itself)?
It sounds like this is going to be a lot of trouble for you. I guess I thought it was mostly already implemented because JMPRET already knows how to write its return address to a COG register and it would simply be a matter of writing the full 16 bit PC into all 32 bits of the register rather than the 9 bit PC in the S field. If it's going to take major reworking of instruction decoding and the pipeline then it probably isn't worth the risk. Thanks for offering though.
Chip is still thinking about the issue. Personally I'd prefer if $1F1 was the private LR for each task, each task seeing its own LR at that location. This would leave the low cog memory clear.
As I mentioned in an earlier message,u sing CALLA/RETA for calling all functions with a two hub access hit on leaf functions. Unfortunately, leaf functions are where you really want good performance since they tend to be called in inner loops. I'm kind of surprised that Bill seems to think that the two hub accesses is no big deal when he is at the same time asking for PEEK for the new return stack. The extra PUSH that would be required in the absense of PEEK would only add a single instruction time where the two hub accesses for leaf functions could add up to 16 instruction times to a leaf function.
I think you misunderstood:
- I don't like two hub hits in a leaf function, and I think that can be avoided, but until they are
- I hate using two instructions for a hubexec CALL more (separate PUSH LR, CALL address) - especially now that Chip is providing a single instruction CALLA that does both in just one long, and in one less cycle.
There are a number of ways of addressing that besides an LR register - which Chip may yet provide.
1) Have calls to leaf functions be performed by CALLX, and return using RETX. As they are leaf functions, it would only consume one long in AUX.
2) Have calls to leaf functions be performed by CALL (with small lifo as Chip proposed) and return with RET (using the lifo). Does not even consume a long in AUX.
I suspect you will object on the basis of GCC not knowing what a leaf function is, when emitting the code to call it.
The linker could fix that, as by that time, it is possible to know that a leaf function is being called, and the opcode calling it could be patched by the linker from the default 'CALLA' to 'CALLX', and the 'RETA' to 'RETX' (or lifo versions)
Note this avoids the ugly waste of a long on every call - which is a big deal, as we do not want to waste an extra long on every call.
3) for the initial port, take the hub cycle hit. Presumably leaf functions perform a useful amount of work, and the extra hub cycle will not be a large fraction of the time spent in the routine, and using a single instruction for call will free more hub memory, and free some cycles as well.
As time permits, add (1) or (2)
What I am extremely opposed to is using two instructions to do a call in all cases. That is a huge waste of hub memory, which is a precious resource.
- I hate using two instructions for a hubexec CALL more (separate PUSH LR, CALL address) - especially now that Chip is providing a single instruction CALLA that does both in just one long, and in one less cycle.
With a CALL_LR instruction the push of LR would happen in the prologue of non-leaf functions. That is where the PUSHA LR would happen. So here is what you'd end up with:
For leaf functions:
CALL_LR #my_function
' more code
...
my_function
' do interesting stuff
JMP LR
For non-leaf functions:
CALL_LR #my_function
' more code
...
my_function
PUSHA LR
' do interesting stuff
RETA
Notice that the calling sequence is the same in either case.
I hope I'm getting the opcodes right. I'm assuming that PUSHA pushes a value onto the hub stack using PTRA and that RETA pops the return address from the PTRA hub stack.
- I hate using two instructions for a hubexec CALL more (separate PUSH LR, CALL address) - especially now that Chip is providing a single instruction CALLA that does both in just one long, and in one less cycle.
Of course the PUSH LR would happen in the function prologue, not at every call!
With a CALL_LR instruction the push of LR would happen in the prologue of non-leaf functions. That is where the PUSHA LR would happen. So here is what you'd end up with:
Thank you for the example code - it will make illustrating what I mean easier. Also, I see the memory waste is not nearly as bad as long as the PUSHA LR happens in the functions.
CALL_LR #my_function
' more code
...
my_function
PUSHA LR
' do interesting stuff
RETA
Notice that the calling sequence is the same in either case.
I hope I'm getting the opcodes right. I'm assuming that PUSHA pushes a value onto the hub stack using PTRA and that RETA pops the return address from the PTRA hub stack.
David, I don't think any of us should worry about exact opcode naming until we see Chip's final list
Hub stack version:
CALLA #my_function
' more code ...
my_function
' removes need for 'PUSHA LR', saves one hub long for every function call
' do interesting stuff
RETA
For all of the above sample snippets, I am assuming single-long call instructions with 16 bit embedded address (two trailing zero's implied by long boundary) as two-long BIG/CALL sequences have an unacceptably high hub memory overhead.
Thanks, that does not waste nearly as much memory as a two-long call sequence. I posted early, and I think I was still thinking about David's earlier BIG/CALL sequence.
It still uses precious hub memory when there is no real need, see the examples I posted in my reply to David.
Thank you for the example code - it will make illustrating what I mean easier. Also, I see the memory waste is not nearly as bad as long as the PUSHA LR happens in the functions.
Using Chip's LIFO: (leaf function), no hub access
CALL #my_function
' more code
my_function
' do interesting stuff
RET
Using CALLX/RETX
CALLX #my_function (leaf function), no hub access
' more code
my_function
' do interesting stuff
RETX
David, I don't think any of us should worry about exact opcode naming until we see Chip's final list
Hub stack version:
CALLA #my_function
' more code ...
my_function
' removes need for 'PUSHA LR', saves one hub long for every function call
' do interesting stuff
RETA
For all of the above sample snippets, I am assuming single-long call instructions with 16 bit embedded address (two trailing zero's implied by long boundary) as two-long BIG/CALL sequences have an unacceptably high hub memory overhead.
You fail to take into account the fact that GCC will want to generate the same call sequence for both leaf and non-leaf functions because if calling through a pointer or calling a function in a separately compiled module it won't know if a function is leaf or not. Please show me an efficient calling sequence that can call either leaf or non-leaf functions without knowing which one it is calling. This eliminates the possibility of using a different CALL instruction for each case.
You fail to take into account the fact that GCC will want to generate the same call sequence for both leaf and non-leaf functions because if calling through a pointer or calling a function in a separately compiled module it won't know if a function is leaf or not.
- even leaf functions perform many clock cycles worth of work
- worst case CALLA/RETA adds 14 cycles versus using LR
- using CALLA/RETA addresses function pointers, calls from other modules
- average case adds 7 cycles versus using LR
- if unused hub slots are used by hubexec, normal case would only add 2-3 cycles vs. LR
- using LR adds 1 cycle to every not leaf function
- using LR wastes one long in every function
Without LR, adding a "_leaf" attribute to functions and their prototype would allow using non-hub stack versions of CALL/RET even for function pointers
Please show me an efficient calling sequence that can call either leaf or non-leaf functions without knowing which one it is calling. This eliminates the possibility of using a different CALL instruction for each case.
- CALLA/RETA is far more efficient than any call in any LMM/CMM mode
- performance penalty would be insignificant as even leaf functions normally take hundreds of cycles
- performance penalty could be eliminated with a "_leaf" attribute for functions and prototypes
- It is surreal to hear you argue FOR performance
Now please show me how CALLA/RETA will add a significant delay.
- even leaf functions perform many clock cycles worth of work
- worst case CALLA/RETA adds 14 cycles versus using LR
- using CALLA/RETA addresses function pointers, calls from other modules
- average case adds 7 cycles versus using LR
- if unused hub slots are used by hubexec, normal case would only add 2-3 cycles vs. LR
- using LR adds 1 cycle to every not leaf function
- using LR wastes one long in every function
Without LR, adding a "_leaf" attribute to functions and their prototype would allow using non-hub stack versions of CALL/RET even for function pointers
This is not reasonable. It is totally non-standard and it also doesn't work for function pointers unless you want to say that a function pointer can only point to either a leaf function or a non-leaf function but not both. The caller of a function should not be required to know whether it is a leaf function or a non-leaf function.
- CALLA/RETA is far more efficient than any call in any LMM/CMM mode
- performance penalty would be insignificant as even leaf functions normally take hundreds of cycles
- performance penalty could be eliminated with a "_leaf" attribute for functions and prototypes
- It is surreal to hear you argue FOR performance
Now please show me how CALLA/RETA will add a significant delay.
If you think 14 cycles is insignificant then why did you tell Chip that you needed a PEEK instruction to avoid a single cycle PUSH instruction when inspecting the top of the return stack? You're being inconsistent. if 14 cycles is no big deal then certainly one cycle is nothing. You're pushing for instructions that have no real value and arguing against instructions that can save many cycles. I think the FIFO return stack is pretty much useless for anything other than assembly or a specially constructed language that isn't standard in any way.
This is not reasonable. It is totally non-standard and it also doesn't work for function pointers unless you want to say that a function pointer can only point to either a leaf function or a non-leaf function but not both. The caller of a function should not be required to know whether it is a leaf function or a non-leaf function.
I see. It is exactly as non-standard as __inline.
The coloring is not needed if CALLA/RETA is used.
I was showing that a _leaf attribute could be used to address the "impossibility" of having GCC use a different return/call mechanism for leaf functions; it was by no means a requirement.
If you think 14 cycles is insignificant then why did you tell Chip that you needed a PEEK instruction to avoid a single cycle PUSH instruction when inspecting the top of the return stack? You're being inconsistent. if 14 cycles is no big deal then certainly one cycle is nothing. You're pushing for instructions that have no real value and arguing against instructions that can save many cycles. I think the FIFO return stack is pretty much useless for anything other than assembly or a specially constructed language that isn't standard in any way.
Worst case 14 cycles.
Let's say strcpy() is a leaf function, and 80 bytes are copied. That's 640 hub cycles without caching, or 400 cycles with caching. Plus REPS overhead etc.
14/400 = 3.5% slowdown
7/400 = 1.75% slowdown
Of course if the leaf function uses 4000 cycles, then its 0.35% worst case, 0.175% average case.
Adding LR would add a cycle to every non-leaf function, and four extra bytes to every function.
I am not inconsistent, I was just pointing out that if implementing an LR-less leaf function is difficult for GCC, it won't hurt performance much.
PEEK is a different case, for tight assembly code for FORTH - and I don't care that much about it, and to be accurate, Chip proposed it and asked if anyone had a use for it.
David, you have to realize there is more to the world than GCC, and the way propgcc is currently constructed.
David, you have to realize there is more to the world than GCC, and the way propgcc is currently constructed.
Almost every instruction and feature in both the P1 and P2 is intended for purposes other than implementing C. I've only asked for one instruction out of the hundreds that are part of P2. What I take away from this discussion is that C/C++ or any compiled language is of little interest to Parallax. I realize that this isn't completely true. I pushed for hub execution and it looks like we will get that and that will be a big win for compiled languages. I'm very grateful for that. However, beyond hub execution, there isn't a single feature in the P2 that was put their in support of C/C++ so I don't see how you can say there is more to the world than GCC. There could hardly be less to the P2 world than GCC.
Edit: Actually, there are the new pointer addressing modes as well but I don't believe they were put in specifically for GCC.
Bill,
Leaf functions are VERY common, often very tiny, and called a LOT. We don't have the memory to inline everything, so you can't argue that way either. Having leaf functions cost 14 cycles (worst case) or even 7 cycles (average case) more than they need to cost is HUGE. It's rather absurd to dismiss this issue when you threw a giant fit over delayed slots being possibly removed which is only a few cycles per case.
You can't have different types of calling based on if it's calling a leaf or not, because it can't know in all cases even at link time. Function pointers throw the monkey wrench into all your ideas for "fixing it up" because the same pointer can point to either type and can change during runtime. You can't bend the C/C++ spec or introduce wonky conventions just for the Prop into the source without destroying the whole purpose of having C/C++ on the prop in the first place.
David and others have said all of these things, but I figured I'd say them again in case maybe someone else saying them will have you maybe actually listen to it and really get it. Maybe then you will come up with some clever solution that actually will work, but so far it's not there....
Bill,
Leaf functions are VERY common, often very tiny, and called a LOT. We don't have the memory to inline everything, so you can't argue that way either. Having leaf functions cost 14 cycles (worst case) or even 7 cycles (average case) more than they need to cost is HUGE. It's rather absurd to dismiss this issue when you threw a giant fit over delayed slots being possibly removed which is only a few cycles per case.
You can't have different types of calling based on if it's calling a leaf or not, because it can't know in all cases even at link time. Function pointers throw the monkey wrench into all your ideas for "fixing it up" because the same pointer can point to either type and can change during runtime. You can't bend the C/C++ spec or introduce wonky conventions just for the Prop into the source without destroying the whole purpose of having C/C++ on the prop in the first place.
David and others have said all of these things, but I figured I'd say them again in case maybe someone else saying them will have you maybe actually listen to it and really get it. Maybe then you will come up with some clever solution that actually will work, but so far it's not there....
How to do that has been presented here and it needs to happen. Given P2 is a freaking PASM playground anyway, there is no reason the one thing needed to peak out GCC shouldn't get done.
Note that I suggested $1F1 for LR, but Chip said that was a problem due to 4 tasks needing LR's. I was pointing out lack of an LR, in case it does not make it in is not a disaster.
FYI, a calculation I will post tomorrow (I am going to bed in a few min) shows that loss of LR cannot even take 14 cycles, worst case is actually more like 8 cycles, average case 4 cycles, and if unused hub slots are used, more like 2 cycles.
With GCC spilling registers / re-loading registers, and leaf functions having to have some function, not having LR is worst case 1% performance hit.
Almost every instruction and feature in both the P1 and P2 is intended for purposes other than implementing C. I've only asked for one instruction out of the hundreds that are part of P2. What I take away from this discussion is that C/C++ or any compiled language is of little interest to Parallax. I realize that this isn't completely true. I pushed for hub execution and it looks like we will get that and that will be a big win for compiled languages. I'm very grateful for that. However, beyond hub execution, there isn't a single feature in the P2 that was put their in support of C/C++ so I don't see how you can say there is more to the world than GCC. There could hardly be less to the P2 world than GCC.
Edit: Actually, there are the new pointer addressing modes as well but I don't believe they were put in specifically for GCC.
Bill,
Leaf functions are VERY common, often very tiny, and called a LOT. We don't have the memory to inline everything, so you can't argue that way either. Having leaf functions cost 14 cycles (worst case) or even 7 cycles (average case) more than they need to cost is HUGE.
Actually its more like 8 cycles worst case, 4 average, 2 if extra hub slots are used. Will post analysis tomorrow, heading to bed after two more quick posts.
It's rather absurd to dismiss this issue when you threw a giant fit over delayed slots being possibly removed which is only a few cycles per case.
1. I did not "throw a giant fit", I pointed out it will cause a huge performance loss (3x roughly, as in 300%) in an application I worked a lot to tune, compared to 1% worst case. Big difference.
You can't have different types of calling based on if it's calling a leaf or not, because it can't know in all cases even at link time. Function pointers throw the monkey wrench into all your ideas for "fixing it up" because the same pointer can point to either type and can change during runtime. You can't bend the C/C++ spec or introduce wonky conventions just for the Prop into the source without destroying the whole purpose of having C/C++ on the prop in the first place.
David and others have said all of these things, but I figured I'd say them again in case maybe someone else saying them will have you maybe actually listen to it and really get it. Maybe then you will come up with some clever solution that actually will work, but so far it's not there....
1. Read the thread. I proposed $1F1 for LR LONG ago, but Chip said it has issues due to tasking.
2, I have not been against LR, I have been pointing out it is not a disaster if it does not make it in, as using a hub stack CALLA/RETA is worst case tiny performance hit for leaf functions.
3. I am totally against BIG/CALL combo's and BIG/JUMP combos, because those waste ~16% of hub and many cycles compared to single long JMPx / CALLx
Read the thread. I proposed $1F1 as LR a long time ago. Chip said that had technical issues due to tasks. I was pointing out loss of LR is not the end of the world, not a huge performance hit.
How to do that has been presented here and it needs to happen. Given P2 is a freaking PASM playground anyway, there is no reason the one thing needed to peak out GCC shouldn't get done.
o Requires a hub address, which jacks into the critical path because the data is stored in cog ram.
o Because multiple hardware threads/tasks are possible, there is a locking/consistency problem with using a single address LR
o Because of the above problem you need register remapping on a per-task basis
Chip proposes a 4x32 stack for each task, where the "LR" is stored. Peeking at the stack would give the same data as a COG location, but the stack resides in ALU logic, not RAM.
The P2 is comprised of HUB, COG, AUX RAM, and the 'LOGIC' block. The caches and LIFO stack are made of logic elements (flip-flops) in the 'LOGIC' block, which is basically storage that is local to each execution unit and can be accessed outside of the normal COG or HUB access windows.
The penalty is that each of these elements takes about 21 transistors and is 3.5 times larger than an SRAM cell.
The more specialized stuff Chip has to add in caches and such, causes the synthesized logic block to balloon.
I discussed hubex with 4 threads and came to the conclusion that with 4 threads it's almost impossible not to thrash the Icache. Given the penalty the Icache presents in logic elements, I recommended a 1 line WIDE Icache. Since multiple hubex tasks will cause a high number of cache invalidations (based on GCC's code generator), I see no point in trying to make multi-task hubex use a cache.
To put things in perspective, 1 Icache cache line takes 255 LEs, with 4 lines that's 1K LEs, times 8 cogs that's 8192 LEs. Now, the Nano only has 22K LEs, so that's a huge chunk used just for 4 cache lines.
The thread stacks are 4x32, so that's another 512 LEs per COG, times 8 COGs, is 4096 LEs.
If the cache miss rate for GCC generated hubex code is better than 50%, or put another way, the code has a cache hit rate of 50% or less, caching is more or less pointless IMHO.
I came up with this plan, which I think will work well with GCC's code generation, but it only is possible for a single task, because there isn't enough space to apply it to 4 tasks per COG:
o There are 2 cache lines, the COG ping-pongs between them, based on branch instructions that would branch outside of the cache line.
o When a cache line being executed causes a stall, that cache line is reloaded and the other cache line is left alone, to preserve the possibility of hitting the cache on return.
The code above is something you could very well see in a program that grabs a bunch of samples in succession.
The code might hypothetically compile to this, readADC would be a linked in library function, not inlineable code (please don't vilify me because my syntax is off, the instruction set has changed so much I can't keep it all straight):
Note, I took some liberties with the GCC calling, but I think it would be fairly close, I didn't include stack frame setup, which isn't strictly needed for all functions.
In the example above, when a branch instruction is hit, but a branch is not taken, the ping-pong bit is not flipped. If a branch is hit AND taken, which causes a stall, the ping-pong bit is flipped, causing the other cache to be used.
So, cache line 1 would contain the loadADCSamples function (if it is aligned properly, from _SAMPLE_CNT onward).
Cache line 2 would contain the readADC library function (again, WIDE aligned for maximum cache optimization).
In operation, the COG would ping pong back and forth between the 2 cache lines, because they are both filled with the code being executed. Functionality would be the same if the code was inlined, it would just cache 16 instructions vs 8.
To recap, here's the rules for using 2 cache lines:
o When a branch instruction is encountered and taken, toggle cache lines, so the caller is cached and the callee is loaded into the other cache line.
o When execution stalls because a cache line has been exhausted, reload the line that stalled, do not use the other cache line.
o When a branch instruction is encountered, but not taken, do not toggle cache line index.
This algorithm is designed to help non-optimized code and code that is not inlineable -- code that is linked from a pre-compiled library.
I think I can see both sides of the problem here - well I understand Bill's side and I think I understand GCCs side.
What is proposed works great for Bill, etc (hubexec and cog modes) but doesn't fit with GCC.
This is my understanding of GCCs requirement...
GCC places the return address in a fixed cog location "LR".
* If the routine then is a non-leaf routine, the LR is immediately pushed onto a stack by this routine (commonly in hub).
* If the routine is a leaf routine, then the push is not required, LR may be used to pass further info, so it can be operated on using instructions such as ADD LR,#y. At the end of the routine, a RETURN instruction (actually an indirect jump to LR) is performed.
Since Chip is now making hubexec work for all 4 multitasks, 4 x LR locations are required.
Because of the direct usage of LR, they (4 copies) need to be mapped into a single cog position, and the relevant LR used depending on the task#.
Now thinking about this, there is no point in there being 4 sequential LR slots, one for each task#, because you don't want to have to index into the relevant LR slot (too much code overhead). We are all happy to accept that hub cannot be multitasking but Chip is doing it (for now anyway) and it would be silly to just have the LR not working properly for multitasking if this is feasible.
Most likely, one of the JUMP instruction that Chip has allocated would work for the RET LR instruction (needs to be a 16bit hub address +flags).
So, what is required is a CALL style instruction "CALL_LR" where the D & S registers make a 16bit hub goto address, and the return address is placed into the "LR" register (on of 4 windowed LR's, one for each task#).
Do I have this correct???
If so, then I would think the CALL_LR would be quite easy.
The biggest problem here is the mapping of 4 x "LR" registers windowed into a single cog register. $1F1 has been suggested as a location (or the last one unused as a special register).
I am sure Chip can work this out if this is what is required. Seems to me it would also be useful from pasm too.
You've made your points repeatedly Bill. I put my comment here for a few reasons:
1. I do think it's a big deal if it doesn't get in there, and I don't think that because people here who know GCC cold have made it clear how it sings.
2. If Chip is weighing things, it may be worth it to know having a strong GCC matters to people. I have some plans to use GCC, even though I've been mostly SPIN + PASM. Maybe that changes the weighting some, maybe it's just good enough for a reconsideration.
3. The GCC crowd doesn't always get the best messages from the community overall. I want to lend my support because I think their work is important and necessary and that it is appreciated. In light of my past statements related to #2, that may not always be clear and I want it to be.
Your case on the delayed branches isn't a niche case. You made that clear to us citing a virtual machine kernel that employs them for peak performance and clearly you want your kernel to really perform. Who wouldn't? That's not a negative statement, just what you told us.
Overall, the loss of delayed branching isn't that big of a hit otherwise, as mentioned here.
So then, the GCC guys are saying the exact same thing about the calling mechanism! And again, who wouldn't? Not a negative there either.
None of it should be.
Frankly, I am disturbed by the fact that this can't be sorted. Early on, I thought some code samples might help, and I expressed a sincere desire for this to be resolved. The root of that desire is having a strong GCC this time around because I believe it will really matter on the P2. Heck, people are doing great things now on P1, thanks to the GCC team and Ross, both of whom have produced C environments that get stuff done, despite the challenges present on P1.
Repeating that exercise shouldn't be on the table for P2, and as mentioned, the GCC team doesn't need much. Given all we've gotten, and make no mistake a lot of that is indulgent for which I am thankful, making sure GCC has an optimal path to use HUBEXEC seems to make entirely too much sense.
Those things said, making a specific recommendation isn't something I feel good about. Not my area of expertise. However, I can generally observe the dynamic here, and it just isn't productive at all. A balance of interests is required, or some really creative thought to resolve it technically in a way that promotes everybody. I don't care which.
But I do care a lot that we can't seem to get something sorted out for GCC where it's clearly possible and necessary to do. And there is this too: Where a solid GCC exists, there will be more prospects for whatever any of us plan on building. There has got to be a reasonable solution, or whatever any of us are planning on building sees just that more risk.
Rather than belabor the same points, how about working to realize some new ones that may well make everybody happy? (and I see this happening already on the thread, and it is good to see, which is my only intent here because I know this smart group can figure this out)
Personally, I would take a PASM hit to see GCC run at it's peak. There are potentially other solutions to be found once we jam on the new instructions and core update, and PASM is going to be fastest anyway, and PASM programmers can abuse things anyway they want to as well. And maybe that's worth saying too, so I am.
Just read the other post, I would give up tasking for HUBEX mode for GCC to perform as well. Frankly, I don't see myself using tasks much in combination with HUBEX at all, but that's just me. Lots of cases out there.
(please don't vilify me because my syntax is off, the instruction set has changed so much I can't keep it all straight):
o Requires a hub address, which jacks into the critical path because the data is stored in cog ram.
o Because multiple hardware threads/tasks are possible, there is a locking/consistency problem with using a single address LR
o Because of the above problem you need register remapping on a per-task basis
Chip proposes a 4x32 stack for each task, where the "LR" is stored. Peeking at the stack would give the same data as a COG location, but the stack resides in ALU logic, not RAM.
The P2 is comprised of HUB, COG, AUX RAM, and the 'LOGIC' block. The caches and LIFO stack are made of logic elements (flip-flops) in the 'LOGIC' block, which is basically storage that is local to each execution unit and can be accessed outside of the normal COG or HUB access windows.
The penalty is that each of these elements takes about 21 transistors and is 3.5 times larger than an SRAM cell.
The more specialized stuff Chip has to add in caches and such, causes the synthesized logic block to balloon.
I discussed hubex with 4 threads and came to the conclusion that with 4 threads it's almost impossible not to thrash the Icache. Given the penalty the Icache presents in logic elements, I recommended a 1 line WIDE Icache. Since multiple hubex tasks will cause a high number of cache invalidations (based on GCC's code generator), I see no point in trying to make multi-task hubex use a cache.
To put things in perspective, 1 Icache cache line takes 255 LEs, with 4 lines that's 1K LEs, times 8 cogs that's 8192 LEs. Now, the Nano only has 22K LEs, so that's a huge chunk used just for 4 cache lines.
The thread stacks are 4x32, so that's another 512 LEs per COG, times 8 COGs, is 4096 LEs.
If the cache miss rate for GCC generated hubex code is better than 50%, or put another way, the code has a cache hit rate of 50% or less, caching is more or less pointless IMHO.
I came up with this plan, which I think will work well with GCC's code generation, but it only is possible for a single task, because there isn't enough space to apply it to 4 tasks per COG:
o There are 2 cache lines, the COG ping-pongs between them, based on branch instructions that would branch outside of the cache line.
o When a cache line being executed causes a stall, that cache line is reloaded and the other cache line is left alone, to preserve the possibility of hitting the cache on return.
The code above is something you could very well see in a program that grabs a bunch of samples in succession.
The code might hypothetically compile to this, readADC would be a linked in library function, not inlineable code (please don't vilify me because my syntax is off, the instruction set has changed so much I can't keep it all straight):
Note, I took some liberties with the GCC calling, but I think it would be fairly close, I didn't include stack frame setup, which isn't strictly needed for all functions.
In the example above, when a branch instruction is hit, but a branch is not taken, the ping-pong bit is not flipped. If a branch is hit AND taken, which causes a stall, the ping-pong bit is flipped, causing the other cache to be used.
So, cache line 1 would contain the loadADCSamples function (if it is aligned properly, from _SAMPLE_CNT onward).
Cache line 2 would contain the readADC library function (again, WIDE aligned for maximum cache optimization).
In operation, the COG would ping pong back and forth between the 2 cache lines, because they are both filled with the code being executed. Functionality would be the same if the code was inlined, it would just cache 16 instructions vs 8.
To recap, here's the rules for using 2 cache lines:
o When a branch instruction is encountered and taken, toggle cache lines, so the caller is cached and the callee is loaded into the other cache line.
o When execution stalls because a cache line has been exhausted, reload the line that stalled, do not use the other cache line.
o When a branch instruction is encountered, but not taken, do not toggle cache line index.
This algorithm is designed to help non-optimized code and code that is not inlineable -- code that is linked from a pre-compiled library.
I think we are all agreed that HUBEXEC would be simpler if it was not multitasking.
It multitasking is permitted, then perhaps HUBEXEC could be restricted to only run in Task-0, while Tasks 1,2,3 must be cog only.
This would simplify the model quite a lot. It would certainly help with maximising the instruction cache for the single HUBEXEC mode, and perhaps a little cache(s) for the D & S operands.
And I think it would solve the LR problem, by using $1F1 or a windowed single register into $1F1.
I would hate to see the HUBEXEC mode's performance suffer because it tried to support multitasking.
To put things in perspective, 1 Icache cache line takes 255 LEs, with 4 lines that's 1K LEs, times 8 cogs that's 8192 LEs. Now, the Nano only has 22K LEs, so that's a huge chunk used just for 4 cache lines.
The thread stacks are 4x32, so that's another 512 LEs per COG, times 8 COGs, is 4096 LEs.
Are you sure the cost is that high ?
There are 84 M9K blocks in a small Cortex IV, and just one of these can do 256 x 36 Dual Port RAM.
Chip seemed to indicate there was a low silicon cost to all this ?
I think we are all agreed that HUBEXEC would be simpler if it was not multitasking.
It multitasking is permitted, then perhaps HUBEXEC could be restricted to only run in Task-0, while Tasks 1,2,3 must be cog only.
I suggested along those lines to Chip earlier, and his reply indicated the (silicon) cost of making all tasks equal, was not that great an impact : I]there's no significant cost in doing so.[/I. (rather contra to earlier indications on sizes )
It would be a mess to start making the four sets of task hardware different in capability. It's easier to give them all the same functionality and there's no significant cost in doing so. The programmer can decide how to use them.
Bill,
I think you are underestimating how much loss can be introduced by this for leaf functions. It's not uncommon for a function to call many leaf functions all of which are very tiny (think accessors) where a 4-8 cycle overhead would be as much or more than the function's cost. Additionally, you could have a set of small leaf functions called in a tight inner loop. I see this 4-8 cycles extra per call to the leaves as potentially equating to 2-4x slowdown. Yes there are times when the cost will be fairly small, but there are also times when the cost could be quite dramatic.
I was not meaning to make a personal argument, I just didn't understand how you would argue for one case with less cycles of overhead than this, but then argue that this case is not as important? I am sorry if I came across as personally attacking you, I'm truly just baffled. I see this leaf function overhead issue as a really big deal. Yes, we can get by without it being resolved in hardware, but we would also get by without delay slots or cordic functions or sdram or the whole hubexec thing entirely, but that's not the point.
I know that you have not been against the LR idea, but you have been arguing against it's importance and displaying very small impact numbers in support of your arguements, but I think they are not based on real world C/C++ code of sufficient size/complexity that would make any of this actually matter. In such a program leaf functions are much more plentiful, and likely not large cycle eaters like strcpy (which probably isn't even a leaf, since it likely makes calls itself (it does in the implementations I've seen)).
I've been doing a lot of profiling of C/C++ code lately at work, and in our code (which is quite large and complex) something like 80% of the function calls made during execution are to leaf functions, and 95%+ of those leaf functions are small simple functions. And that's even with aggressive inlining of things like accessors.
Anyway, I really hope Chip (with guidance from you and others) can figure out a way to resolve this issue cleanly.
I'd like to apologize for presenting a rather mixed message about LR. Just when Chip said he was going to look into adding it, I noticed that his description of what was necessary seemed complex and could possible affect a critical path. That made me remember that we are trying not to load new features onto P2 so that it becomes the chip that never was or is so late that its value is diminished. So I posted a message saying that we could do without it if it is too big a risk. I suppose I should let Chip be the one to decide what is too big a risk and what is not. Having not seen the Verilog code (and probably I wouldn't fully comprehend it even if I did see it) I'm only in a position to guess what is and isn't easy to do or low risk. Certainly hub execute mode seems very high risk to me but I'm glad to have it.
I'd also be happy to have CALL_LR if it leaves its result in a COG register. I think the return address FIFO solution is less than ideal even with PEEK. In fact, if we were to use the return address FIFO we wouldn't even need the PEEK instruction since the function prologue would immediately pop the return address into a software LR register.
In any case, I return to my original position of saying that CALL_LR would be beneficial to GCC. Actually, any CALL instruction that puts the return address in a COG register would be helpful. Eric pointed out that a variant on JMPRET that stores the full 16 bit address would be fine as well. I didn't suggest that because I didn't see a way to encode a 9 bit D field and a 16 bit immediate S field in the same instruction. If that could be done it would work fine for GCC.
In summary, any CALL instruction that puts its result in either a fixed COG register like LR or has a D field to put the return address in any COG register would work well. Pushing it on a stack had having to pop it in the function prologue is less than ideal.
I think it's clear that what gcc needs is a register written with the return address of the last call. This is complicated by the matter of 4 tasks possibly executing from the hub.
One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact.
Another solution: Treat $1F1 as a window to one of four task-related registers that actually exist in flipflops, which always get written with the last return address. This has higher silicon impact.
Another solution: Move INDA/INDB to $1EE/$1EF and use $1F0..$1F3 as range of registers which ALWAYS gets remapped, according to task, such that any access within that range will result in the two address LSBs being substituted with the task number. Return addresses are always stored in these registers. This has low silicon impact, but takes three more register spaces to implement. Might there be some other compelling use case for this feature?
I think you summarized things very well, David! (And thank you for your advocacy here.) GCC could live with a version of CALL that pushes the return address onto a HUB stack -- this is how many older CISC processors work, so the code is there for it. Having a way to write the return address to a COG register would provide for slightly more efficient leaf functions, so it would be a win. If doing that requires severe contortions like creating a whole new FIFO stack just for return addresses then it's probably better to just go with the HUB stack.
I think it's pretty reasonable to restrict execution to one HUB task at a time. However, I'm also quite intrigued by Chips suggestion of having a range of registers that always get remapped. Besides return addresses, I could see it potentially being useful for passing parameters to tasks. Again, we should definitely stick with low risk solutions though! In the big picture, having hub execution is going to gain us a lot!
Comments
I'm sorry, I've been crazy busy this weekend and haven't had a chance to review everything, so forgive me if this idea has already been shot down for some reason.
Why not just keep JMPRET the way it is now, but write the whole 18 bits of address into the destination register in HUB mode?
This would allow GCC could use its current calling convention.
If a mode specific version is too ugly, could we just always write the full 18 bits and use the 18 bit long form of JUMP as the RET instruction (for COG mode subroutines, of course -- HUB will have to either use the HUB stack form or manage the save of the return register itself)?
Chip is still thinking about the issue. Personally I'd prefer if $1F1 was the private LR for each task, each task seeing its own LR at that location. This would leave the low cog memory clear.
PropGCC will probably end up
I think you misunderstood:
- I don't like two hub hits in a leaf function, and I think that can be avoided, but until they are
- I hate using two instructions for a hubexec CALL more (separate PUSH LR, CALL address) - especially now that Chip is providing a single instruction CALLA that does both in just one long, and in one less cycle.
There are a number of ways of addressing that besides an LR register - which Chip may yet provide.
1) Have calls to leaf functions be performed by CALLX, and return using RETX. As they are leaf functions, it would only consume one long in AUX.
2) Have calls to leaf functions be performed by CALL (with small lifo as Chip proposed) and return with RET (using the lifo). Does not even consume a long in AUX.
I suspect you will object on the basis of GCC not knowing what a leaf function is, when emitting the code to call it.
The linker could fix that, as by that time, it is possible to know that a leaf function is being called, and the opcode calling it could be patched by the linker from the default 'CALLA' to 'CALLX', and the 'RETA' to 'RETX' (or lifo versions)
Note this avoids the ugly waste of a long on every call - which is a big deal, as we do not want to waste an extra long on every call.
3) for the initial port, take the hub cycle hit. Presumably leaf functions perform a useful amount of work, and the extra hub cycle will not be a large fraction of the time spent in the routine, and using a single instruction for call will free more hub memory, and free some cycles as well.
As time permits, add (1) or (2)
What I am extremely opposed to is using two instructions to do a call in all cases. That is a huge waste of hub memory, which is a precious resource.
Regarding PEEK - I wanted it for FORTH
For leaf functions:
For non-leaf functions: Notice that the calling sequence is the same in either case.
I hope I'm getting the opcodes right. I'm assuming that PUSHA pushes a value onto the hub stack using PTRA and that RETA pops the return address from the PTRA hub stack.
Thank you for the example code - it will make illustrating what I mean easier. Also, I see the memory waste is not nearly as bad as long as the PUSHA LR happens in the functions.
Using Chip's LIFO: (leaf function), no hub access
Using CALLX/RETX
David, I don't think any of us should worry about exact opcode naming until we see Chip's final list
Hub stack version:
For all of the above sample snippets, I am assuming single-long call instructions with 16 bit embedded address (two trailing zero's implied by long boundary) as two-long BIG/CALL sequences have an unacceptably high hub memory overhead.
It still uses precious hub memory when there is no real need, see the examples I posted in my reply to David.
- even leaf functions perform many clock cycles worth of work
- worst case CALLA/RETA adds 14 cycles versus using LR
- using CALLA/RETA addresses function pointers, calls from other modules
- average case adds 7 cycles versus using LR
- if unused hub slots are used by hubexec, normal case would only add 2-3 cycles vs. LR
- using LR adds 1 cycle to every not leaf function
- using LR wastes one long in every function
Without LR, adding a "_leaf" attribute to functions and their prototype would allow using non-hub stack versions of CALL/RET even for function pointers
- CALLA/RETA is far more efficient than any call in any LMM/CMM mode
- performance penalty would be insignificant as even leaf functions normally take hundreds of cycles
- performance penalty could be eliminated with a "_leaf" attribute for functions and prototypes
- It is surreal to hear you argue FOR performance
Now please show me how CALLA/RETA will add a significant delay.
I see. It is exactly as non-standard as __inline.
The coloring is not needed if CALLA/RETA is used.
I was showing that a _leaf attribute could be used to address the "impossibility" of having GCC use a different return/call mechanism for leaf functions; it was by no means a requirement.
Worst case 14 cycles.
Let's say strcpy() is a leaf function, and 80 bytes are copied. That's 640 hub cycles without caching, or 400 cycles with caching. Plus REPS overhead etc.
14/400 = 3.5% slowdown
7/400 = 1.75% slowdown
Of course if the leaf function uses 4000 cycles, then its 0.35% worst case, 0.175% average case.
Adding LR would add a cycle to every non-leaf function, and four extra bytes to every function.
I am not inconsistent, I was just pointing out that if implementing an LR-less leaf function is difficult for GCC, it won't hurt performance much.
PEEK is a different case, for tight assembly code for FORTH - and I don't care that much about it, and to be accurate, Chip proposed it and asked if anyone had a use for it.
David, you have to realize there is more to the world than GCC, and the way propgcc is currently constructed.
Edit: Actually, there are the new pointer addressing modes as well but I don't believe they were put in specifically for GCC.
Bill,
Leaf functions are VERY common, often very tiny, and called a LOT. We don't have the memory to inline everything, so you can't argue that way either. Having leaf functions cost 14 cycles (worst case) or even 7 cycles (average case) more than they need to cost is HUGE. It's rather absurd to dismiss this issue when you threw a giant fit over delayed slots being possibly removed which is only a few cycles per case.
You can't have different types of calling based on if it's calling a leaf or not, because it can't know in all cases even at link time. Function pointers throw the monkey wrench into all your ideas for "fixing it up" because the same pointer can point to either type and can change during runtime. You can't bend the C/C++ spec or introduce wonky conventions just for the Prop into the source without destroying the whole purpose of having C/C++ on the prop in the first place.
David and others have said all of these things, but I figured I'd say them again in case maybe someone else saying them will have you maybe actually listen to it and really get it. Maybe then you will come up with some clever solution that actually will work, but so far it's not there....
How to do that has been presented here and it needs to happen. Given P2 is a freaking PASM playground anyway, there is no reason the one thing needed to peak out GCC shouldn't get done.
Note that I suggested $1F1 for LR, but Chip said that was a problem due to 4 tasks needing LR's. I was pointing out lack of an LR, in case it does not make it in is not a disaster.
FYI, a calculation I will post tomorrow (I am going to bed in a few min) shows that loss of LR cannot even take 14 cycles, worst case is actually more like 8 cycles, average case 4 cycles, and if unused hub slots are used, more like 2 cycles.
With GCC spilling registers / re-loading registers, and leaf functions having to have some function, not having LR is worst case 1% performance hit.
Actually its more like 8 cycles worst case, 4 average, 2 if extra hub slots are used. Will post analysis tomorrow, heading to bed after two more quick posts.
1. I did not "throw a giant fit", I pointed out it will cause a huge performance loss (3x roughly, as in 300%) in an application I worked a lot to tune, compared to 1% worst case. Big difference.
2. Make technical, not personal, arguments.
1. Read the thread. I proposed $1F1 for LR LONG ago, but Chip said it has issues due to tasking.
2, I have not been against LR, I have been pointing out it is not a disaster if it does not make it in, as using a hub stack CALLA/RETA is worst case tiny performance hit for leaf functions.
3. I am totally against BIG/CALL combo's and BIG/JUMP combos, because those waste ~16% of hub and many cycles compared to single long JMPx / CALLx
Read the thread. I proposed $1F1 as LR a long time ago. Chip said that had technical issues due to tasks. I was pointing out loss of LR is not the end of the world, not a huge performance hit.
o Requires a hub address, which jacks into the critical path because the data is stored in cog ram.
o Because multiple hardware threads/tasks are possible, there is a locking/consistency problem with using a single address LR
o Because of the above problem you need register remapping on a per-task basis
Chip proposes a 4x32 stack for each task, where the "LR" is stored. Peeking at the stack would give the same data as a COG location, but the stack resides in ALU logic, not RAM.
The P2 is comprised of HUB, COG, AUX RAM, and the 'LOGIC' block. The caches and LIFO stack are made of logic elements (flip-flops) in the 'LOGIC' block, which is basically storage that is local to each execution unit and can be accessed outside of the normal COG or HUB access windows.
The penalty is that each of these elements takes about 21 transistors and is 3.5 times larger than an SRAM cell.
The more specialized stuff Chip has to add in caches and such, causes the synthesized logic block to balloon.
I discussed hubex with 4 threads and came to the conclusion that with 4 threads it's almost impossible not to thrash the Icache. Given the penalty the Icache presents in logic elements, I recommended a 1 line WIDE Icache. Since multiple hubex tasks will cause a high number of cache invalidations (based on GCC's code generator), I see no point in trying to make multi-task hubex use a cache.
To put things in perspective, 1 Icache cache line takes 255 LEs, with 4 lines that's 1K LEs, times 8 cogs that's 8192 LEs. Now, the Nano only has 22K LEs, so that's a huge chunk used just for 4 cache lines.
The thread stacks are 4x32, so that's another 512 LEs per COG, times 8 COGs, is 4096 LEs.
If the cache miss rate for GCC generated hubex code is better than 50%, or put another way, the code has a cache hit rate of 50% or less, caching is more or less pointless IMHO.
I came up with this plan, which I think will work well with GCC's code generation, but it only is possible for a single task, because there isn't enough space to apply it to 4 tasks per COG:
o There are 2 cache lines, the COG ping-pongs between them, based on branch instructions that would branch outside of the cache line.
o When a cache line being executed causes a stall, that cache line is reloaded and the other cache line is left alone, to preserve the possibility of hitting the cache on return.
Here is how it works:
pseudo code:
The code above is something you could very well see in a program that grabs a bunch of samples in succession.
The code might hypothetically compile to this, readADC would be a linked in library function, not inlineable code (please don't vilify me because my syntax is off, the instruction set has changed so much I can't keep it all straight):
Note, I took some liberties with the GCC calling, but I think it would be fairly close, I didn't include stack frame setup, which isn't strictly needed for all functions.
In the example above, when a branch instruction is hit, but a branch is not taken, the ping-pong bit is not flipped. If a branch is hit AND taken, which causes a stall, the ping-pong bit is flipped, causing the other cache to be used.
So, cache line 1 would contain the loadADCSamples function (if it is aligned properly, from _SAMPLE_CNT onward).
Cache line 2 would contain the readADC library function (again, WIDE aligned for maximum cache optimization).
In operation, the COG would ping pong back and forth between the 2 cache lines, because they are both filled with the code being executed. Functionality would be the same if the code was inlined, it would just cache 16 instructions vs 8.
To recap, here's the rules for using 2 cache lines:
o When a branch instruction is encountered and taken, toggle cache lines, so the caller is cached and the callee is loaded into the other cache line.
o When execution stalls because a cache line has been exhausted, reload the line that stalled, do not use the other cache line.
o When a branch instruction is encountered, but not taken, do not toggle cache line index.
This algorithm is designed to help non-optimized code and code that is not inlineable -- code that is linked from a pre-compiled library.
What is proposed works great for Bill, etc (hubexec and cog modes) but doesn't fit with GCC.
This is my understanding of GCCs requirement...
GCC places the return address in a fixed cog location "LR".
* If the routine then is a non-leaf routine, the LR is immediately pushed onto a stack by this routine (commonly in hub).
* If the routine is a leaf routine, then the push is not required, LR may be used to pass further info, so it can be operated on using instructions such as ADD LR,#y. At the end of the routine, a RETURN instruction (actually an indirect jump to LR) is performed.
Since Chip is now making hubexec work for all 4 multitasks, 4 x LR locations are required.
Because of the direct usage of LR, they (4 copies) need to be mapped into a single cog position, and the relevant LR used depending on the task#.
Now thinking about this, there is no point in there being 4 sequential LR slots, one for each task#, because you don't want to have to index into the relevant LR slot (too much code overhead). We are all happy to accept that hub cannot be multitasking but Chip is doing it (for now anyway) and it would be silly to just have the LR not working properly for multitasking if this is feasible.
Most likely, one of the JUMP instruction that Chip has allocated would work for the RET LR instruction (needs to be a 16bit hub address +flags).
So, what is required is a CALL style instruction "CALL_LR" where the D & S registers make a 16bit hub goto address, and the return address is placed into the "LR" register (on of 4 windowed LR's, one for each task#).
Do I have this correct???
If so, then I would think the CALL_LR would be quite easy.
The biggest problem here is the mapping of 4 x "LR" registers windowed into a single cog register. $1F1 has been suggested as a location (or the last one unused as a special register).
I am sure Chip can work this out if this is what is required. Seems to me it would also be useful from pasm too.
1. I do think it's a big deal if it doesn't get in there, and I don't think that because people here who know GCC cold have made it clear how it sings.
2. If Chip is weighing things, it may be worth it to know having a strong GCC matters to people. I have some plans to use GCC, even though I've been mostly SPIN + PASM. Maybe that changes the weighting some, maybe it's just good enough for a reconsideration.
3. The GCC crowd doesn't always get the best messages from the community overall. I want to lend my support because I think their work is important and necessary and that it is appreciated. In light of my past statements related to #2, that may not always be clear and I want it to be.
Your case on the delayed branches isn't a niche case. You made that clear to us citing a virtual machine kernel that employs them for peak performance and clearly you want your kernel to really perform. Who wouldn't? That's not a negative statement, just what you told us.
Overall, the loss of delayed branching isn't that big of a hit otherwise, as mentioned here.
So then, the GCC guys are saying the exact same thing about the calling mechanism! And again, who wouldn't? Not a negative there either.
None of it should be.
Frankly, I am disturbed by the fact that this can't be sorted. Early on, I thought some code samples might help, and I expressed a sincere desire for this to be resolved. The root of that desire is having a strong GCC this time around because I believe it will really matter on the P2. Heck, people are doing great things now on P1, thanks to the GCC team and Ross, both of whom have produced C environments that get stuff done, despite the challenges present on P1.
Repeating that exercise shouldn't be on the table for P2, and as mentioned, the GCC team doesn't need much. Given all we've gotten, and make no mistake a lot of that is indulgent for which I am thankful, making sure GCC has an optimal path to use HUBEXEC seems to make entirely too much sense.
Those things said, making a specific recommendation isn't something I feel good about. Not my area of expertise. However, I can generally observe the dynamic here, and it just isn't productive at all. A balance of interests is required, or some really creative thought to resolve it technically in a way that promotes everybody. I don't care which.
But I do care a lot that we can't seem to get something sorted out for GCC where it's clearly possible and necessary to do. And there is this too: Where a solid GCC exists, there will be more prospects for whatever any of us plan on building. There has got to be a reasonable solution, or whatever any of us are planning on building sees just that more risk.
Rather than belabor the same points, how about working to realize some new ones that may well make everybody happy? (and I see this happening already on the thread, and it is good to see, which is my only intent here because I know this smart group can figure this out)
Personally, I would take a PASM hit to see GCC run at it's peak. There are potentially other solutions to be found once we jam on the new instructions and core update, and PASM is going to be fastest anyway, and PASM programmers can abuse things anyway they want to as well. And maybe that's worth saying too, so I am.
Just read the other post, I would give up tasking for HUBEX mode for GCC to perform as well. Frankly, I don't see myself using tasks much in combination with HUBEX at all, but that's just me. Lots of cases out there.
Yeah, me too.
I think we are all agreed that HUBEXEC would be simpler if it was not multitasking.
It multitasking is permitted, then perhaps HUBEXEC could be restricted to only run in Task-0, while Tasks 1,2,3 must be cog only.
This would simplify the model quite a lot. It would certainly help with maximising the instruction cache for the single HUBEXEC mode, and perhaps a little cache(s) for the D & S operands.
And I think it would solve the LR problem, by using $1F1 or a windowed single register into $1F1.
I would hate to see the HUBEXEC mode's performance suffer because it tried to support multitasking.
I think that only correct technical descriptions can help Chip to made it correct.
Not discussions what one have said what.
Chip is open-minded and if Hi see correct descriptions Hi can implement it correctly to.
Else no any one win on have it half done!
You mean in an ASIC ?
Are you sure the cost is that high ?
There are 84 M9K blocks in a small Cortex IV, and just one of these can do 256 x 36 Dual Port RAM.
Chip seemed to indicate there was a low silicon cost to all this ?
I suggested along those lines to Chip earlier, and his reply indicated the (silicon) cost of making all tasks equal, was not that great an impact : I]there's no significant cost in doing so.[/I. (rather contra to earlier indications on sizes )
I think you are underestimating how much loss can be introduced by this for leaf functions. It's not uncommon for a function to call many leaf functions all of which are very tiny (think accessors) where a 4-8 cycle overhead would be as much or more than the function's cost. Additionally, you could have a set of small leaf functions called in a tight inner loop. I see this 4-8 cycles extra per call to the leaves as potentially equating to 2-4x slowdown. Yes there are times when the cost will be fairly small, but there are also times when the cost could be quite dramatic.
I was not meaning to make a personal argument, I just didn't understand how you would argue for one case with less cycles of overhead than this, but then argue that this case is not as important? I am sorry if I came across as personally attacking you, I'm truly just baffled. I see this leaf function overhead issue as a really big deal. Yes, we can get by without it being resolved in hardware, but we would also get by without delay slots or cordic functions or sdram or the whole hubexec thing entirely, but that's not the point.
I know that you have not been against the LR idea, but you have been arguing against it's importance and displaying very small impact numbers in support of your arguements, but I think they are not based on real world C/C++ code of sufficient size/complexity that would make any of this actually matter. In such a program leaf functions are much more plentiful, and likely not large cycle eaters like strcpy (which probably isn't even a leaf, since it likely makes calls itself (it does in the implementations I've seen)).
I've been doing a lot of profiling of C/C++ code lately at work, and in our code (which is quite large and complex) something like 80% of the function calls made during execution are to leaf functions, and 95%+ of those leaf functions are small simple functions. And that's even with aggressive inlining of things like accessors.
Anyway, I really hope Chip (with guidance from you and others) can figure out a way to resolve this issue cleanly.
The counts are directly from Chip. 21 transistors per flip-flop and 255 flip-flops per WIDE cache line. A high cost indeed.
He said it was a very significant number in comparison to the number of existing flip-flops in the design.
I'd also be happy to have CALL_LR if it leaves its result in a COG register. I think the return address FIFO solution is less than ideal even with PEEK. In fact, if we were to use the return address FIFO we wouldn't even need the PEEK instruction since the function prologue would immediately pop the return address into a software LR register.
In any case, I return to my original position of saying that CALL_LR would be beneficial to GCC. Actually, any CALL instruction that puts the return address in a COG register would be helpful. Eric pointed out that a variant on JMPRET that stores the full 16 bit address would be fine as well. I didn't suggest that because I didn't see a way to encode a 9 bit D field and a 16 bit immediate S field in the same instruction. If that could be done it would work fine for GCC.
In summary, any CALL instruction that puts its result in either a fixed COG register like LR or has a D field to put the return address in any COG register would work well. Pushing it on a stack had having to pop it in the function prologue is less than ideal.
I think it's clear that what gcc needs is a register written with the return address of the last call. This is complicated by the matter of 4 tasks possibly executing from the hub.
One solution: Always write the return address from hub-mode callers to $1F1. If there's only one task executing from the hub, this would be fine. This has low silicon impact.
Another solution: Treat $1F1 as a window to one of four task-related registers that actually exist in flipflops, which always get written with the last return address. This has higher silicon impact.
Another solution: Move INDA/INDB to $1EE/$1EF and use $1F0..$1F3 as range of registers which ALWAYS gets remapped, according to task, such that any access within that range will result in the two address LSBs being substituted with the task number. Return addresses are always stored in these registers. This has low silicon impact, but takes three more register spaces to implement. Might there be some other compelling use case for this feature?
I think it's pretty reasonable to restrict execution to one HUB task at a time. However, I'm also quite intrigued by Chips suggestion of having a range of registers that always get remapped. Besides return addresses, I could see it potentially being useful for passing parameters to tasks. Again, we should definitely stick with low risk solutions though! In the big picture, having hub execution is going to gain us a lot!