I think something that may be being missed in this discussion of the different CALL types vs CALLR and the C compiler usage.
The compiler is going to code every function call the same way (always using CALLR), then when the function being called is a leaf it will just return thru the register, when the function being called is not a leaf it will push the register to the stack before it calls another function. One of the key points is that the CALLER doesn't know if it's calling a leaf or not, only the CALLEE knows if it's a leaf or not. The function being called may not even be in the same compile unit, it'll all get resolved at link time (when there really isn't any code generation (normally).
Anyway, I think having the CALLR instruction is important, and it will get use heavily in code for the P2.
The compiler will generate the same call instruction for leaf and non-leaf functions. Since functions can be separately compiled there is no way for the caller to know if the callee is leaf or non-leaf. Also, there would be no way for an indirect call to know if the target is leaf or non-leaf. A leaf function will just jump indirect through the LR to return. The non-leaf function will build a stack frame on the hub stack including the value from the LR register.
Okay. Now I'm starting to get it. Non-leaf functions build stack frames with the return address pushed last, after a bunch of parameters. So, a CALL effectively RETurns with a bunch of parameters that need to be POP'd off. Is that right?
So this means that the called function actually incurs the overhead to build its stack frame so if a function is meets all the criteria to be a leaf function, it can avoid the cost of building a stack frame and just work out of the register set of the compile environment and return off the LR register if your hardware supports that feature. This makes for a much fast entry and exit for leaf functions because they avoid any stack frame costs.
If you don't have the LR functionality, then every function has to be treated as non-leaf and every function has to incur stack frame construction and deconstruction costs on entry and exit.
You end up have faster function setup for whatever percentage of functions the compiler can create as leaf functions. But we don't really know what the percentage is 10%, 50%, 90%....
There also appear to be other criteria a function needs to meet before it can be a leaf function in GCCs eyes.
Chip, that sounds like it would work if each leaf function in the code knows it is a leaf function and therefore doesn't call any more functions (compiler will hopefully already know this, just not sure if GCC has simple way to appropriately add prologue code that depends on it, that's a very good question for David/Eric etc as to whether it can be done).
I do have another question though. I am getting the feeling (and I really am only guessing here) is that for a GCC port it sounds like PTRA/PTRB won't ever get used for the general stack pointer even if the stack is in hub RAM because if we did that we would not be able to safely make use of the code it generates in cases where we might be running in a COG with multithreading turned on, and so instead some general COG register would be used for the stack pointer. If that is ultimately the case it is a real shame because you've gone and given us these great PTRA/PTRB pointer registers with some excellent read/write access methods that allow stack offset addresssing and autoincrement/decrement etc which are ideal for stack pointers accessing variables from hub based stack frames and pushing/popping operations etc and we won't be using them fully.
Now you don't get if you don't ask, but is it possible each task in the COG could get its own copy of PTRA/PTRB? If so we could potentially still use them safely as stack pointers in C to get the higher performance and code reduction when accessing the stack.
Consider the extra PASM code required for accessing hub stack frame variables if the stack pointer is held in a general COG register
I know assigning "register" variables in C can help alleviate some reading of data from the hub stack each time, but there are only so many registers available.
Also to push data (very common), it will always take 2 instructions, same for pop
wrlong data, tempstack
sub tempstack, #4
But having stack pointer in say PTRA, and leaving PTRB free for other arbitrary/general memory accesses (or "BP" type base pointers) you could do your pushes like this
wrlong data, PTRA--
and we can quickly access the stack frame variables for all aligned stacks with less than 32 arguments (pretty normal) like this
rdlong data, PTRA[-3]
The only downside I can see for using PTRA is that if you want to take the address of a hub stack variable you may need to do an extra "getptra" instruction before you start the computation to get the actual stack pointer value into a general COG register first, but that is far less common than pushing/popping data IMO, and so it is probably worth the small overhead in that case.
It would be really great to get the full performance capabilities of PTRA, PTRB in C code with hub based stacks.
Roger.
Yes, I'd like to make PTRA/PTRB/INDA/INDB unique for each task, but the problem is that doing so would immediately create the new critical path and slow the chip down by maybe 10MHz. Is it worth it? It may be.
Okay. Now I'm starting to get it. Non-leaf functions build stack frames with the return address pushed last, after a bunch of parameters. So, a CALL effectively RETurns with a bunch of parameters that need to be POP'd off. Is that right?
Yes, basically that is correct. However GCC passes parameters in registers not generally on the stack. But for non-leaf functions, it might have to push some of those registers to make space for the parameters of a nested function.
I think something that may be being missed in this discussion of the different CALL types vs CALLR and the C compiler usage.
The compiler is going to code every function call the same way (always using CALLR), then when the function being called is a leaf it will just return thru the register, when the function being called is not a leaf it will push the register to the stack before it calls another function. One of the key points is that the CALLER doesn't know if it's calling a leaf or not, only the CALLEE knows if it's a leaf or not. The function being called may not even be in the same compile unit, it'll all get resolved at link time (when there really isn't any code generation (normally).
Anyway, I think having the CALLR instruction is important, and it will get use heavily in code for the P2.
edit: Dave beat me to this... oh well.
Thanks for the explanation. David had explained this to me before, but I couldn't remember. I'll try to get CALLR worked in.
Now you don't get if you don't ask, but is it possible each task in the COG could get its own copy of PTRA/PTRB?
This is a good question. When hardware tasks were first introduced there were many restrictions on them so I had assumed we would probably only want to run a single C task in a COG. However, Chip has done such a good job of removing almost all of those restrictions that it might be worth considering multiple C tasks again.
The problem is that stacks are expensive to access so you want to avoid doing that if possible. They are of course invaluable for non-leaf functions.
I'm all for register passing. I think stack frames suck in general; just more baggage. I was meaning there is a lack of info on implications ... and of course no implementation examples. It's a code jargon all of it's own.
This is a good question. When hardware tasks were first introduced there were many restrictions on them so I had assumed we would probably only want to run a single C task in a COG. However, Chip has done such a good job of removing almost all of those restrictions that it might be worth considering multiple C tasks again.
I know. Multiple C tasks is one beneficiary of such a change, the other is if we wanted to run C code in a separate task alongside some PASM driver which may or may not use PTRA, PTRB. GCC won't know. So it would probably have to assume it can't ever use PTRA/PTRB. That is what bugs me. I guess there could be some GCC option added saying to compile with PTRA/PTRB as the stack pointer and the developer can carefully choose to set it, but that adds more complexity rather than automatically just using it no matter what. I much prefer simplicity and performance wherever possible.
David, did you know there are CALL/RET instructions that just use the 4-level FIFO stack that each task has?
If you called with CALL, and then the CALLee would do a POP reg, you'd have the equivalent of CALLR. This wouldn't waste any special resource and would only take 1 instruction in the leaf function.
I know. Multiple C tasks is one beneficiary of such a change, the other is if we wanted to run C code in a separate task alongside some PASM driver which may or may not use PTRA, PTRB. GCC won't know. So it would probably have to assume it can't ever use PTRA/PTRB. That is what bugs me. I guess there could be some GCC option added saying to compile with PTRA/PTRB as the stack pointer and the developer can carefully choose to set it, but that adds more complexity rather than automatically just using it no matter what. I much prefer simplicity and performance wherever possible.
I'm looking into adding PTRA/PTRB to every task. I think it might work without a timing problem. INDA/INDB would be deadly for timing, though.
No idea... need coffee... I only scanned the thread, mostly reading Chip's posts, saw issue with opcode space... you may have mentioned it first, don't know. Will re-read thread
No idea... need coffee... I only scanned the thread, mostly reading Chip's posts, saw issue with opcode space... you may have mentioned it first, don't know. Will re-read thread
No need to re-read. What you said is exactly right.
I hope that smile is very tongue-in-cheek! Bill has come up with a solution that did not involve adding a CALLR instruction.
I wasn't objecting to what Bill said. I was just pointing out that I had already suggested the exact same thing in message 173 of this thread. This approach eliminates the need to add any instructions at all.
I wasn't objecting to what Bill said. I was just pointing out that I had already suggested the exact same thing in message 173 of this thread. This approach eliminates the need to add any instructions at all.
That's a bit more than a few posts back. The real problem at that stage was the purpose was so confused that any methods were eye-glazing.
I did not catch it, as I was skimming, reading Chip's posts.
Sorry, you definitely suggested it first.
Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.
One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?
Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.
One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?
The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.
CALL CALLD call subroutine using task's 4-level stack
RET RETD return from subroutine using task's 4-level stack
CALLA CALLAD call subroutine using HUB[PTRA++]
RETA RETAD return from subroutine using HUB[--PTRA]
CALLB CALLBD call subroutine using HUB[PTRB++]
RETB RETBD return from subroutine using HUB[--PTRB]
CALLX CALLXD call subroutine using AUX[PTRX++]
RETX RETXD return from subroutine using AUX[--PTRX]
CALLY CALLYD call subroutine using AUX[!PTRY++]
RETY RETYD return from subroutine using AUX[!--PTRY]
Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.
One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?
Comments
The compiler is going to code every function call the same way (always using CALLR), then when the function being called is a leaf it will just return thru the register, when the function being called is not a leaf it will push the register to the stack before it calls another function. One of the key points is that the CALLER doesn't know if it's calling a leaf or not, only the CALLEE knows if it's a leaf or not. The function being called may not even be in the same compile unit, it'll all get resolved at link time (when there really isn't any code generation (normally).
Anyway, I think having the CALLR instruction is important, and it will get use heavily in code for the P2.
edit: Dave beat me to this... oh well.
Okay. Now I'm starting to get it. Non-leaf functions build stack frames with the return address pushed last, after a bunch of parameters. So, a CALL effectively RETurns with a bunch of parameters that need to be POP'd off. Is that right?
If you don't have the LR functionality, then every function has to be treated as non-leaf and every function has to incur stack frame construction and deconstruction costs on entry and exit.
You end up have faster function setup for whatever percentage of functions the compiler can create as leaf functions. But we don't really know what the percentage is 10%, 50%, 90%....
There also appear to be other criteria a function needs to meet before it can be a leaf function in GCCs eyes.
GCC Leaf Functions
Yes, I'd like to make PTRA/PTRB/INDA/INDB unique for each task, but the problem is that doing so would immediately create the new critical path and slow the chip down by maybe 10MHz. Is it worth it? It may be.
Thanks for the explanation. David had explained this to me before, but I couldn't remember. I'll try to get CALLR worked in.
Damn! Who writes that stuff?! Reading that, one might not think computers have anything called a stack.
I'm all for register passing. I think stack frames suck in general; just more baggage. I was meaning there is a lack of info on implications ... and of course no implementation examples. It's a code jargon all of it's own.
I know. Multiple C tasks is one beneficiary of such a change, the other is if we wanted to run C code in a separate task alongside some PASM driver which may or may not use PTRA, PTRB. GCC won't know. So it would probably have to assume it can't ever use PTRA/PTRB. That is what bugs me. I guess there could be some GCC option added saying to compile with PTRA/PTRB as the stack pointer and the developer can carefully choose to set it, but that adds more complexity rather than automatically just using it no matter what. I much prefer simplicity and performance wherever possible.
Why not make the top of each tasks 4 level stack visible at $1F1 for that task?
That would be:
- exactly equivalent to CALLR
- provide the link register
- not need any more flip flops
and most important
NO NEED TO SCROUNGE FOR OPCODE SPACE!
Edit: in #173 David already suggested this (sorry David, I missed it while skimming)
I'm looking into adding PTRA/PTRB to every task. I think it might work without a timing problem. INDA/INDB would be deadly for timing, though.
Okay! That sounds like a plan. I'll look into that after I address this PTRA/PTRB issue.
Too bad about INDA/INDB.. but compiled code will not need that nearly as much as PTRA/PTRB.
I did not catch it, as I was skimming, reading Chip's posts.
Sorry, you definitely suggested it first.
That's a bit more than a few posts back. The real problem at that stage was the purpose was so confused that any methods were eye-glazing.
Why?
SETPTRB $1F1
Now we can access constants/addresses embedded (-32..31)*scale in the code using PTRB
One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?
The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.
Dang! Not as simple as Bill's idea at all.
You are welcome