HUB EXEC Update Here

cgracey · 2014-02-06 00:54

Cluso99 wrote: »

Chip,

JMPLIST D,S/@/@@

I note that the latest docs (Line 1589+) have conflicting definitions for the index in D or S.

Bill,

The pasm op name is currently limited to 7 characters. I like CALLVEC so perhaps we could rename JMPLIST to JMPVEC too?

Chip & all,

Could all the JMP instructions save a return address in the $1F1 register without penalty if the JMP is taken?
Could this solve the GCC request for CALLR by using JMPxx where the return address is saved in $1F1? Note if further JMP's were used, then the GCC would need to perform a MOV $1F0,$1F1 to save the return address before it is overwritten.

Wow! That's a fantastic idea!!!

I was just ruminating over how I was going to implement this link register, and I didn't have any idea that I was enthused about. You just solved the problem!

This is great, because JMPs are plentiful and rich, non-delayed and delayed, just like CALLs. The link-register CALL really is just a JMP, anyway.

The only downside to this is that the value must immediately be preserved by the CALLee, before another JMP occurs, but that's not too big of a problem. It will be there for whoever needs it.

David Betz, what do you say about this?

cgracey · 2014-02-06 01:03

Maybe CALLR would be best, because the value would persist through JMPs. The biggest problem is that I need to forge more op-code space.

Baggers · 2014-02-06 01:09

Would putting it in $1f1 be ok for multi threaded programs though?

evanh · 2014-02-06 01:18

cgracey wrote: »

Maybe CALLR would be best, because the value would persist through JMPs. The biggest problem is that I need to forge more op-code space.

True, pretty good chance of leaf functions branching before returning.

David didn't appear to like Cluso's idea but wasn't clear why.

My little issue with all of this is how is this meant to give us a generic RET instruction? Isn't that the point, to have functions that don't distinguish between how they were called?

cgracey · 2014-02-06 01:21

evanh wrote: »

True, pretty good chance of leaf functions branching before returning.

David didn't appear to like Cluso's idea but wasn't clear why.

My little issue with all of this is how is this meant to give us a generic RET instruction? Isn't that the point, to have functions that don't distinguish between how they were called?

For leaf functions using CALLR, with the return address at, say, $1F1, they would just JMP $1F1 to return. I can't picture how you'd know HOW to RETurn, unless you had a special 3-bit-wide stack that would push and pop the call modes in LIFO fashion - it could tie your hands, though.

cgracey · 2014-02-06 01:23

Baggers wrote: »

Would putting it in $1f1 be ok for multi threaded programs though?

$1F1 would have to have a different physical register for each task for that to work in multi-tasking.

evanh · 2014-02-06 01:23

Baggers wrote: »

Would putting it in $1f1 be ok for multi threaded programs though?

I think there is a ways to go before crossing that bridge. Chip might just make it a hidden register like the program counter. Then, any shadowing won't be visible.

evanh · 2014-02-06 01:26

cgracey wrote: »

I can't picture how you'd know HOW to RETurn, unless you had a special 3-bit-wide stack that would push and pop the call modes in LIFO fashion - it could tie your hands, though.

So, the leaf functions were always going to be special cases then? I'm failing to grok the desire to have a link register at all now. Why not just use auxRAM?

cgracey · 2014-02-06 01:27

Cluso99 wrote: »

Does the CALLR require 4 variants like the CALL instruction below?

----  1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn     CALL    #abs
----  1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn     CALL    @rel
----  1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn     CALLD   #abs
----  1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn     CALLD   @rel

Yes, plus a CALLR D variant.

cgracey · 2014-02-06 01:31

evanh wrote: »

So, the leaf functions were always going to be special cases then? I'm failing to grok the desire to have a link register at all now. Why not just use auxRAM?

Yes, special cases. With the return address in a register, math can be done without any POP'ing.

It's true that you could just CALL to the routine using the task's built-in 4-level LIFO stack and then the CALLee would do a POP into a register (1 clock, 1 instruction), but that would take an extra instruction. I wonder how many leaf functions might be in an application. If it's only 100, CALLR would only save 100 longs.

DAVID, not to keep beating this thing, but how many leaf functions might there be in an application?

cgracey · 2014-02-06 02:31

I just updated the file at the start of this thread.

This update has the standardized instruction spacing for REPS/REPD and delayed branches. It also includes a REPS/REPD circuit for each task. Now CALLs and JMPs that use a D register as an address source can specify WZ/WC to load Z/C from D[31:30]. This will facilitate CALLR returns that restore flags. CALLR is not implemented yet.

For the DE0-Nano, I had to get rid of CTRA pin output to free enough LE's to get a good compile.

Next, I'm going to see about getting REPS down to 0 spacers and REPD down to 2, as well as implement CALLR and Cluso's USB pin instructions.

Baggers · 2014-02-06 02:38

Awesome work Chip, can't wait to have another play tonight

Cluso99 · 2014-02-06 02:48

cgracey wrote: »
Originally Posted by Cluso99
Does the CALLR require 4 variants like the CALL instruction below?
---- 1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn CALL #abs
---- 1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn CALL @rel
---- 1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn CALLD #abs
---- 1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn CALLD @rel
Yes, plus a CALLR D variant.

I though this might be so. No spare opcodes around here that I can see easily

Will GCC use all of the CALL types?
Maybe GCC could use the CALLY instructions could also save the return address in $1F1 (or another register). Would it be possible to write to both AUXY and a cog register simultaneously in pipeline stage 4?
If not, then perhaps this could be enabled/disabled with a SETCALL x instruction. I don't particularly like swapping what a set of instructions do, but maybe we have no other choice?

Cluso99 · 2014-02-06 02:49

cgracey wrote: »

I just updated the file at the start of this thread.

This update has the standardized instruction spacing for REPS/REPD and delayed branches. It also includes a REPS/REPD circuit for each task. Now CALLs and JMPs that use a D register as an address source can specify WZ/WC to load Z/C from D[31:30]. This will facilitate CALLR returns that restore flags. CALLR is not implemented yet.

For the DE0-Nano, I had to get rid of CTRA pin output to free enough LE's to get a good compile.

Next, I'm going to see about getting REPS down to 0 spacers and REPD down to 2, as well as implement CALLR and Cluso's USB pin instructions.

Thanks Chip. Brilliant work

We can worry about what doesn't fit in DE0 later.

ozpropdev · 2014-02-06 03:06

Nice work Chip!

Re: Latest DOCS, In the PIN TRANSFER section the register SPB is referred to instead of PTRY.

Brian

David Betz · 2014-02-06 03:17

Cluso99 wrote: »

I though this might be so. No spare opcodes around here that I can see easily

Will GCC use all of the CALL types?
Maybe GCC could use the CALLY instructions could also save the return address in $1F1 (or another register). Would it be possible to write to both AUXY and a cog register simultaneously in pipeline stage 4?
If not, then perhaps this could be enabled/disabled with a SETCALL x instruction. I don't particularly like swapping what a set of instructions do, but maybe we have no other choice?

I was hoping to stay away from any of the stack-oriented CALL instructions because they would waste what might otherwise be a valuable register. In the case you mention, the Y register would be unavailable for any other use and at least one location in the AUX memory would have to be reserved just so it could be trashed by the CALLY instruction. In fact, if you never pop the return address off the AUX stack then the entire AUX RAM would eventually be trashed.

cgracey · 2014-02-06 03:22

David Betz wrote: »

I was hoping to stay away from any of the stack-oriented CALL instructions because they would waste what might otherwise be a valuable register. In the case you mention, the Y register would be unavailable for any other use and at least one location in the AUX memory would have to be reserved just so it could be trashed by the CALLY instruction. In fact, if you never pop the return address off the AUX stack then the entire AUX RAM would eventually be trashed.

David, did you know there are CALL/RET instructions that just use the 4-level FIFO stack that each task has?

If you called with CALL, and then the CALLee would do a POP reg, you'd have the equivalent of CALLR. This wouldn't waste any special resource and would only take 1 instruction in the leaf function.

cgracey · 2014-02-06 03:23

ozpropdev wrote: »

Nice work Chip!

Re: Latest DOCS, In the PIN TRANSFER section the register SPB is referred to instead of PTRY.

Brian

Got it. Thanks.

cgracey · 2014-02-06 03:31

cgracey wrote: »

Next, I'm going to see about getting REPS down to 0 spacers and REPD down to 2...

I just compiled the logic to do this and it creates a critical path that sticks way out. I remember now that there were timing reasons I made it operate off of the outputs of flipflops, rather than trying to resolve everything in one clock cycle. REPS and REPD spacer rules aren't going to change, after all.

Sapieha · 2014-02-06 03:44

Hi Chip.

I think that it is now --- spacers are consistent -- NO differences multi else single task.
NO need to leave much work on skip them at all

cgracey wrote: »

I just compiled the logic to do this and it creates a critical path that sticks way out. I remember now that there were timing reasons I made it operate off of the outputs of flipflops, rather than trying to resolve everything in one clock cycle. REPS and REPD spacer rules aren't going to change, after all.

David Betz · 2014-02-06 04:11

cgracey wrote: »

Yes, special cases. With the return address in a register, math can be done without any POP'ing.

It's true that you could just CALL to the routine using the task's built-in 4-level LIFO stack and then the CALLee would do a POP into a register (1 clock, 1 instruction), but that would take an extra instruction. I wonder how many leaf functions might be in an application. If it's only 100, CALLR would only save 100 longs.

DAVID, not to keep beating this thing, but how many leaf functions might there be in an application?

I'm afraid I don't have time to do an analysis of this. Eric once said that leaf functions were very common. It's not only the amount of memory you save but also the slightly faster execution. If you don't want to bother with this then just leave it out but I think this instruction would be used a lot more than 99% of the other new instructions that are in the P2. However, I guess the Propeller is primarily a PASM machine and not optimized as a target for high-level languages so maybe it doesn't matter. You might want to ask how many people will actually code it in PASM compared with the number that will use some high level language whether it be C or Spin or something else. Maybe you could get rid of something like JMPLIST to make room. There will be far more leaf functions than table dispatches I would think. General instructions are usually better than special-purpose instructions that the compiler might have a hard time using.

David Betz · 2014-02-06 04:13

cgracey wrote: »

David, did you know there are CALL/RET instructions that just use the 4-level FIFO stack that each task has?

If you called with CALL, and then the CALLee would do a POP reg, you'd have the equivalent of CALLR. This wouldn't waste any special resource and would only take 1 instruction in the leaf function.

As you say, it adds an extra instruction. I guess you have to decide if you care about high level language performance or not. What about if you just make the CALL instruction also write its return address to a fixed register like $1f1 and make a separate copy of that register for each hardware task. I guess that would work.

cgracey · 2014-02-06 04:17

David Betz wrote: »

As you say, it adds an extra instruction. I guess you have to decide if you care about high level language performance or not. What about if you just make the CALL instruction also write its return address to a fixed register like $1f1 and make a separate copy of that register for each hardware task. I guess that would work.

But you want to RETurn to a location that is the linkreg value plus some amount, right?

David Betz · 2014-02-06 04:21

cgracey wrote: »

But you want to RETurn to a location that is the linkreg value plus some amount, right?

I don't see a need to add something to the link register before returning. If I did want to do that couldn't I just add to the value in $1f1?

cgracey · 2014-02-06 04:23

David Betz wrote: »

I don't see a need to add something to the link register before returning. If I did want to do that couldn't I just add to the value in $1f1?

So, is the point of the link register just to avoid a stack push and pop? Or, do you need to see the linkreg value to look something up?

David Betz · 2014-02-06 04:28

cgracey wrote: »

So, is the point of the link register just to avoid a stack push and pop? Or, do you need to see the linkreg value to look something up?

Yes, the point of the LR is to avoid having to push/pop to the hub stack for leaf functions. However, having the LR visible is necessary because in non-leaf functions it must be pushed on the hub stack.

cgracey · 2014-02-06 04:36

David Betz wrote: »

Yes, the point of the LR is to avoid having to push/pop to the hub stack for leaf functions. However, having the LR visible is necessary because in non-leaf functions it must be pushed on the hub stack.

About the first point: CALL/RET that use the 4-level task-tied LIFO are one clock each. How could that be any faster?

About the second point: Wouldn't non-leaf functions have the return address already on the stack? I'm still not understanding what the point of the linkreg is.

Is a linkreg valuable for anything other than a leaf function?

evanh · 2014-02-06 04:41

In the CALLR ideal: Are the leaf functions called with the same instruction and setup as non-leaf functions?

David Betz · 2014-02-06 04:48

cgracey wrote: »

About the first point: CALL/RET that use the 4-level task-tied LIFO are one clock each. How could that be any faster?

About the second point: Wouldn't non-leaf functions have the return address already on the stack? I'm still not understanding what the point of the linkreg is.

Is a linkreg valuable for anything other than a leaf function?

The compiler will generate the same call instruction for leaf and non-leaf functions. Since functions can be separately compiled there is no way for the caller to know if the callee is leaf or non-leaf. Also, there would be no way for an indirect call to know if the target is leaf or non-leaf. A leaf function will just jump indirect through the LR to return. The non-leaf function will build a stack frame on the hub stack including the value from the LR register.

rogloh · 2014-02-06 04:52

cgracey wrote: »

David, did you know there are CALL/RET instructions that just use the 4-level FIFO stack that each task has?

If you called with CALL, and then the CALLee would do a POP reg, you'd have the equivalent of CALLR. This wouldn't waste any special resource and would only take 1 instruction in the leaf function.

Chip, that sounds like it would work if each leaf function in the code knows it is a leaf function and therefore doesn't call any more functions (compiler will hopefully already know this, just not sure if GCC has simple way to appropriately add prologue code that depends on it, that's a very good question for David/Eric etc as to whether it can be done).

I do have another question though. I am getting the feeling (and I really am only guessing here) is that for a GCC port it sounds like PTRA/PTRB won't ever get used for the general stack pointer even if the stack is in hub RAM because if we did that we would not be able to safely make use of the code it generates in cases where we might be running in a COG with multithreading turned on, and so instead some general COG register would be used for the stack pointer. If that is ultimately the case it is a real shame because you've gone and given us these great PTRA/PTRB pointer registers with some excellent read/write access methods that allow stack offset addresssing and autoincrement/decrement etc which are ideal for stack pointers accessing variables from hub based stack frames and pushing/popping operations etc and we won't be using them fully.

Now you don't get if you don't ask, but is it possible each task in the COG could get its own copy of PTRA/PTRB? If so we could potentially still use them safely as stack pointers in C to get the higher performance and code reduction when accessing the stack.

Consider the extra PASM code required for accessing hub stack frame variables if the stack pointer is held in a general COG register

mov     tempstack, stack_pointer
add     tempstack, #12
rdlong  data, tempstack

I know assigning "register" variables in C can help alleviate some reading of data from the hub stack each time, but there are only so many registers available.

Also to push data (very common), it will always take 2 instructions, same for pop

wrlong  data, tempstack
sub     tempstack, #4

But having stack pointer in say PTRA, and leaving PTRB free for other arbitrary/general memory accesses (or "BP" type base pointers) you could do your pushes like this

wrlong  data, PTRA--

and we can quickly access the stack frame variables for all aligned stacks with less than 32 arguments (pretty normal) like this

rdlong  data, PTRA[3]

The only downside I can see for using PTRA is that if you want to take the address of a hub stack variable you may need to do an extra "getptra" instruction before you start the computation to get the actual stack pointer value into a general COG register first, but that is far less common than pushing/popping data IMO, and so it is probably worth the small overhead in that case.

It would be really great to get the full performance capabilities of PTRA, PTRB in C code with hub based stacks.

Roger.

HUB EXEC Update Here

Comments