HUB EXEC Update Here

Roy Eltham · 2014-02-06 04:52

I think something that may be being missed in this discussion of the different CALL types vs CALLR and the C compiler usage.

The compiler is going to code every function call the same way (always using CALLR), then when the function being called is a leaf it will just return thru the register, when the function being called is not a leaf it will push the register to the stack before it calls another function. One of the key points is that the CALLER doesn't know if it's calling a leaf or not, only the CALLEE knows if it's a leaf or not. The function being called may not even be in the same compile unit, it'll all get resolved at link time (when there really isn't any code generation (normally).

Anyway, I think having the CALLR instruction is important, and it will get use heavily in code for the P2.

edit: Dave beat me to this... oh well.

cgracey · 2014-02-06 04:58

David Betz wrote: »

The compiler will generate the same call instruction for leaf and non-leaf functions. Since functions can be separately compiled there is no way for the caller to know if the callee is leaf or non-leaf. Also, there would be no way for an indirect call to know if the target is leaf or non-leaf. A leaf function will just jump indirect through the LR to return. The non-leaf function will build a stack frame on the hub stack including the value from the LR register.

Okay. Now I'm starting to get it. Non-leaf functions build stack frames with the return address pushed last, after a bunch of parameters. So, a CALL effectively RETurns with a bunch of parameters that need to be POP'd off. Is that right?

mindrobots · 2014-02-06 04:59

So this means that the called function actually incurs the overhead to build its stack frame so if a function is meets all the criteria to be a leaf function, it can avoid the cost of building a stack frame and just work out of the register set of the compile environment and return off the LR register if your hardware supports that feature. This makes for a much fast entry and exit for leaf functions because they avoid any stack frame costs.

If you don't have the LR functionality, then every function has to be treated as non-leaf and every function has to incur stack frame construction and deconstruction costs on entry and exit.

You end up have faster function setup for whatever percentage of functions the compiler can create as leaf functions. But we don't really know what the percentage is 10%, 50%, 90%....

There also appear to be other criteria a function needs to meet before it can be a leaf function in GCCs eyes.

GCC Leaf Functions

cgracey · 2014-02-06 05:02

rogloh wrote: »
Chip, that sounds like it would work if each leaf function in the code knows it is a leaf function and therefore doesn't call any more functions (compiler will hopefully already know this, just not sure if GCC has simple way to appropriately add prologue code that depends on it, that's a very good question for David/Eric etc as to whether it can be done).

I do have another question though. I am getting the feeling (and I really am only guessing here) is that for a GCC port it sounds like PTRA/PTRB won't ever get used for the general stack pointer even if the stack is in hub RAM because if we did that we would not be able to safely make use of the code it generates in cases where we might be running in a COG with multithreading turned on, and so instead some general COG register would be used for the stack pointer. If that is ultimately the case it is a real shame because you've gone and given us these great PTRA/PTRB pointer registers with some excellent read/write access methods that allow stack offset addresssing and autoincrement/decrement etc which are ideal for stack pointers accessing variables from hub based stack frames and pushing/popping operations etc and we won't be using them fully.

Now you don't get if you don't ask, but is it possible each task in the COG could get its own copy of PTRA/PTRB? If so we could potentially still use them safely as stack pointers in C to get the higher performance and code reduction when accessing the stack.

Consider the extra PASM code required for accessing hub stack frame variables if the stack pointer is held in a general COG register
mov     tempstack, stack_pointer
add     tempstack, #12
rdlong  data, tempstack
I know assigning "register" variables in C can help alleviate some reading of data from the hub stack each time, but there are only so many registers available.

Also to push data (very common), it will always take 2 instructions, same for pop
wrlong  data, tempstack
sub     tempstack, #4
But having stack pointer in say PTRA, and leaving PTRB free for other arbitrary/general memory accesses (or "BP" type base pointers) you could do your pushes like this
wrlong  data, PTRA--
and we can quickly access the stack frame variables for all aligned stacks with less than 32 arguments (pretty normal) like this
rdlong  data, PTRA[-3]
The only downside I can see for using PTRA is that if you want to take the address of a hub stack variable you may need to do an extra "getptra" instruction before you start the computation to get the actual stack pointer value into a general COG register first, but that is far less common than pushing/popping data IMO, and so it is probably worth the small overhead in that case.

It would be really great to get the full performance capabilities of PTRA, PTRB in C code with hub based stacks.

Roger.

Yes, I'd like to make PTRA/PTRB/INDA/INDB unique for each task, but the problem is that doing so would immediately create the new critical path and slow the chip down by maybe 10MHz. Is it worth it? It may be.

David Betz · 2014-02-06 05:03

cgracey wrote: »

Okay. Now I'm starting to get it. Non-leaf functions build stack frames with the return address pushed last, after a bunch of parameters. So, a CALL effectively RETurns with a bunch of parameters that need to be POP'd off. Is that right?

Yes, basically that is correct. However GCC passes parameters in registers not generally on the stack. But for non-leaf functions, it might have to push some of those registers to make space for the parameters of a nested function.

cgracey · 2014-02-06 05:04

Roy Eltham wrote: »

I think something that may be being missed in this discussion of the different CALL types vs CALLR and the C compiler usage.

The compiler is going to code every function call the same way (always using CALLR), then when the function being called is a leaf it will just return thru the register, when the function being called is not a leaf it will push the register to the stack before it calls another function. One of the key points is that the CALLER doesn't know if it's calling a leaf or not, only the CALLEE knows if it's a leaf or not. The function being called may not even be in the same compile unit, it'll all get resolved at link time (when there really isn't any code generation (normally).

Anyway, I think having the CALLR instruction is important, and it will get use heavily in code for the P2.

edit: Dave beat me to this... oh well.

Thanks for the explanation. David had explained this to me before, but I couldn't remember. I'll try to get CALLR worked in.

rogloh · 2014-02-06 05:09

Its a tough call. If it just means going from >210MHz to 200MHz then fine, but from 200->190, that's tough.

evanh · 2014-02-06 05:11

mindrobots wrote: »

GCC Leaf Functions

Damn! Who writes that stuff?! Reading that, one might not think computers have anything called a stack.

David Betz · 2014-02-06 05:14

evanh wrote: »

Damn! Who writes that stuff?! Reading that, one might not think computers have anything called a stack.

The problem is that stacks are expensive to access so you want to avoid doing that if possible. They are of course invaluable for non-leaf functions.

David Betz · 2014-02-06 05:21

rogloh wrote: »

Now you don't get if you don't ask, but is it possible each task in the COG could get its own copy of PTRA/PTRB?

This is a good question. When hardware tasks were first introduced there were many restrictions on them so I had assumed we would probably only want to run a single C task in a COG. However, Chip has done such a good job of removing almost all of those restrictions that it might be worth considering multiple C tasks again.

evanh · 2014-02-06 05:25

David Betz wrote: »

The problem is that stacks are expensive to access so you want to avoid doing that if possible. They are of course invaluable for non-leaf functions.

I'm all for register passing. I think stack frames suck in general; just more baggage. I was meaning there is a lack of info on implications ... and of course no implementation examples. It's a code jargon all of it's own.

rogloh · 2014-02-06 05:32

David Betz wrote: »

This is a good question. When hardware tasks were first introduced there were many restrictions on them so I had assumed we would probably only want to run a single C task in a COG. However, Chip has done such a good job of removing almost all of those restrictions that it might be worth considering multiple C tasks again.

I know. Multiple C tasks is one beneficiary of such a change, the other is if we wanted to run C code in a separate task alongside some PASM driver which may or may not use PTRA, PTRB. GCC won't know. So it would probably have to assume it can't ever use PTRA/PTRB. That is what bugs me. I guess there could be some GCC option added saying to compile with PTRA/PTRB as the stack pointer and the developer can carefully choose to set it, but that adds more complexity rather than automatically just using it no matter what. I much prefer simplicity and performance wherever possible.

Bill Henning · 2014-02-06 05:48

Chip,

Why not make the top of each tasks 4 level stack visible at $1F1 for that task?

That would be:

- exactly equivalent to CALLR
- provide the link register
- not need any more flip flops

and most important

NO NEED TO SCROUNGE FOR OPCODE SPACE!

Edit: in #173 David already suggested this (sorry David, I missed it while skimming)

cgracey wrote: »

David, did you know there are CALL/RET instructions that just use the 4-level FIFO stack that each task has?

If you called with CALL, and then the CALLee would do a POP reg, you'd have the equivalent of CALLR. This wouldn't waste any special resource and would only take 1 instruction in the leaf function.

cgracey · 2014-02-06 05:49

rogloh wrote: »

I know. Multiple C tasks is one beneficiary of such a change, the other is if we wanted to run C code in a separate task alongside some PASM driver which may or may not use PTRA, PTRB. GCC won't know. So it would probably have to assume it can't ever use PTRA/PTRB. That is what bugs me. I guess there could be some GCC option added saying to compile with PTRA/PTRB as the stack pointer and the developer can carefully choose to set it, but that adds more complexity rather than automatically just using it no matter what. I much prefer simplicity and performance wherever possible.

I'm looking into adding PTRA/PTRB to every task. I think it might work without a timing problem. INDA/INDB would be deadly for timing, though.

cgracey · 2014-02-06 05:51

Bill Henning wrote: »

Chip,

Why not make the top of each tasks 4 level stack visible at $1F1 for that task?

That would be:

- exactly equivalent to CALLR
- provide the link register
- not need any more flip flops

and most important

NO NEED TO SCROUNGE FOR OPCODE SPACE!

Okay! That sounds like a plan. I'll look into that after I address this PTRA/PTRB issue.

David Betz · 2014-02-06 05:57

cgracey wrote: »

Okay! That sounds like a plan. I'll look into that after I address this PTRA/PTRB issue.

Yes, that would be perfectly satisfactory. Didn't I mention a few messages ago? :-)

Bill Henning · 2014-02-06 05:58

No idea... need coffee... I only scanned the thread, mostly reading Chip's posts, saw issue with opcode space... you may have mentioned it first, don't know. Will re-read thread

David Betz wrote: »

Yes, that would be perfectly satisfactory. Didn't I mention a few messages ago? :-)

evanh · 2014-02-06 06:02

I hope that smile is very tongue-in-cheek! Bill has come up with a solution that did not involve adding a CALLR instruction.

Bill Henning · 2014-02-06 06:03

Per-task PTRA/PTRB would be great!

Too bad about INDA/INDB.. but compiled code will not need that nearly as much as PTRA/PTRB.

cgracey wrote: »

I'm looking into adding PTRA/PTRB to every task. I think it might work without a timing problem. INDA/INDB would be deadly for timing, though.

David Betz · 2014-02-06 06:04

Bill Henning wrote: »

No idea... need coffee... I only scanned the thread, mostly reading Chip's posts, saw issue with opcode space... you may have mentioned it first, don't know. Will re-read thread

No need to re-read. What you said is exactly right.

David Betz · 2014-02-06 06:06

evanh wrote: »

I hope that smile is very tongue-in-cheek! Bill has come up with a solution that did not involve adding a CALLR instruction.

I wasn't objecting to what Bill said. I was just pointing out that I had already suggested the exact same thing in message 173 of this thread. This approach eliminates the need to add any instructions at all.

Bill Henning · 2014-02-06 06:09

Went and looked for it... yes, you did in msg#173

I did not catch it, as I was skimming, reading Chip's posts.

Sorry, you definitely suggested it first.

David Betz wrote: »

Yes, that would be perfectly satisfactory. Didn't I mention a few messages ago? :-)

cgracey · 2014-02-06 06:09

Do you guys think EVERY call should deposit a return address into a register, or just CALL, which uses the 4-level LIFO stack?

evanh · 2014-02-06 06:13

David Betz wrote: »

I wasn't objecting to what Bill said. I was just pointing out that I had already suggested the exact same thing in message 173 of this thread. This approach eliminates the need to add any instructions at all.

That's a bit more than a few posts back. The real problem at that stage was the purpose was so confused that any methods were eye-glazing.

Bill Henning · 2014-02-06 06:14

I think it would be useful if every call did, especially if it was the absolute address (ie post relative->absolute mapping)

Why?

SETPTRB $1F1

Now we can access constants/addresses embedded (-32..31)*scale in the code using PTRB

cgracey wrote: »

Do you guys think EVERY call should deposit a return address into a register, or just CALL, which uses the 4-level LIFO stack?

David Betz · 2014-02-06 06:15

Bill Henning wrote: »

Went and looked for it... yes, you did in msg#173

I did not catch it, as I was skimming, reading Chip's posts.

Sorry, you definitely suggested it first.

Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.

One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?

David Betz · 2014-02-06 06:16

Bill Henning wrote: »

I think it would be useful if every call did, especially if it was the absolute address (ie post relative->absolute mapping)

Why?

SETPTRB $1F1

Now we can access constants/addresses embedded (-32..31)*scale in the code using PTRB

That's true but won't the code that uses these values still have to update the return address on the stack to skip over them when it returns?

cgracey · 2014-02-06 06:20

David Betz wrote: »

Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.

One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?

The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.

evanh · 2014-02-06 06:26

CALL            CALLD                    call subroutine using task's 4-level stack
        RET             RETD                     return from subroutine using task's 4-level stack

        CALLA           CALLAD                   call subroutine using HUB[PTRA++]
        RETA            RETAD                    return from subroutine using HUB[--PTRA]

        CALLB           CALLBD                   call subroutine using HUB[PTRB++]
        RETB            RETBD                    return from subroutine using HUB[--PTRB]

        CALLX           CALLXD                   call subroutine using AUX[PTRX++]
        RETX            RETXD                    return from subroutine using AUX[--PTRX]

        CALLY           CALLYD                   call subroutine using AUX[!PTRY++]
        RETY            RETYD                    return from subroutine using AUX[!--PTRY]

Dang! Not as simple as Bill's idea at all.

Bill Henning · 2014-02-06 06:26

Yes, oldest value is lost. LIFO is really for small, in-cog driver code, AUX/HUB stack modes for deep code

You are welcome

David Betz wrote: »

Actually, I think Chip missed my post too. :-) In any case, I think it will work. Thanks for mentioning it.

One question though, what happens when the LIFO overflows? I assume that the oldest value is just lost but that the most recent four values remain. Is that correct? The LIFO doesn't get cleared on overflow does it?

HUB EXEC Update Here

Comments