JMPSW can only jump to a cog address without an AUGD, which would make it be two longs - which would waste a lot of memory.
ZCWS 1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSW D,S/@ (jump to S/@, store return address in D, WZ/WC to save/load flags)
ZCR- wr 1111111 ZC x CCCC DDDDDDDDD xxxx01000 JMP D (jump to D[16:0] and write {Z,C,P[16:0]} to $1EF)
It looks to me like JMP is getting its address out of the D register and JMPSW is getting its address out of the S register (ignoring the relative variant). And both write PC+1 to a register. And both use WZ/WC to save (and presumably restore) Z/C.
>_RET labels at the end of subroutines, and any RET will return you.
Saves you from having to jump to the _RET label if you want to have multiple exit points in the subroutine.
Space wise it would be the same, speeds it's up a little though.
Thank you for the explanation. It's kind of what I was imagining.
Which leaves me with the question: WTF is it for?
With the good old JMPRET if I am calling a subroutine, which calls a subroutine, which calls a subroutine ...everything works just fine. Every subroutine call's return address gets stored at a unique address for that subroutine.
Having a stack in which to store return addresses only helps if I want to make recursive calls. Either the routine calls itself directly or something it calls calls back to it.
Such recursive calls are a rare thing in micro-controller land.
When do we do need this? Perhaps, for example, for the recursive FIBO benchmark. But then a depth of 4 is pretty much useless. Do we need the recursive FIBO bench mark?
How would a C compiler use this? How would it handle the stack overflow? How would it know if that might happen? And where do the parameters to such recursive calls go?
I'm not sure what HUB memory has to do with this. Surely a stack can be built in COG registers if need be by an PASM programmer?
Saves you from having to jump to the _RET label if you want to have multiple exit points in the subroutine.
You don't have to do that now. jmp sub_ret works just as well as, and one instruction faster than, jmp #sub_ret. In fact, if you use jmpret instead of call you can share return-address depositories with other non-nested routines, saving register space.
Heater,
Yeah, I was thinking the wrong thing when I was trying to explain the 4 long stack. The stack based call/ret makes having multiple exit points from a function less of a hassle to worry about. I think the thinking was that having the call/ret work without having to do the self-modifying code approach of JMPRET was desirable, and it was made 4 deep because that was Chip's determination of being enough depth for cog sized pasm code.
Thank you for the explanation. It's kind of what I was imagining.
Which leaves me with the question: WTF is it for?
With the good old JMPRET if I am calling a subroutine, which calls a subroutine, which calls a subroutine ...everything works just fine. Every subroutine call's return address gets stored at a unique address for that subroutine.
Having a stack in which to store return addresses only helps if I want to make recursive calls. Either the routine calls itself directly or something it calls calls back to it.
Such recursive calls are a rare thing in micro-controller land.
When do we do need this? Perhaps, for example, for the recursive FIBO benchmark. But then a depth of 4 is pretty much useless. Do we need the recursive FIBO bench mark?
How would a C compiler use this? How would it handle the stack overflow? How would it know if that might happen? And where do the parameters to such recursive calls go?
I'm not sure what HUB memory has to do with this. Surely a stack can be built in COG registers if need be by an PASM programmer?
So WTF is this tiny stack for actually?
Compilers will not use the 4 level stack. PropGCC will use the JMP with the link register at $1EF and pushes that on a stack only if it is a non-leaf function.
Other Compilers will use CALLA and RETA which handle a hub-stack with PTRA as stackpointer.
For simple calls in handcrafted PASM we have the 4 level fast stack. The JMPRET way is no longer possible if you want to execute PASM also from Hub. The RET will then be in HubRam and can't be modified easy be the jumpret.
Heater
LMM has to handle the CALL and RET with cog-helper routines that do an indirect access with self modifying code - all very costly.
I think we don't want that to do in Hubexec mode.
ZCWS 1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSW D,S/@ (jump to S/@, store return address in D, WZ/WC to save/load flags)
ZCR- wr 1111111 ZC x CCCC DDDDDDDDD xxxx01000 JMP D (jump to D[16:0] and write {Z,C,P[16:0]} to $1EF)
It looks to me like JMP is getting its address out of the D register and JMPSW is getting its address out of the S register (ignoring the relative variant). And both write PC+1 to a register. And both use WZ/WC to save (and presumably restore) Z/C.
It's a bit confusing because we have 3 version of the JMP instruction: immediate17, relative17 and the register indirect.
If I understand you correct you think we don't need the register indirect version, because this can be made with a JMPSW. I think this will work if we use a read-only register in the D field of jmpsw, for example the CNT or RND register.
Compilers will not use the 4 level stack. PropGCC will use the JMP with the link register at $1EF and pushes that on a stack only if it is a non-leaf function.
and
LMM has to handle the CALL and RET with cog-helper routines that do an indirect access with self modifying code - all very costly.
I think we don't want that to do in Hubexec mode.
So we need a four level stack ... but it's not going to be used.
So, tell me again, very slowly, why do we need it?
Compilers will not use the 4 level stack. PropGCC will use the JMP with the link register at $1EF and pushes that on a stack only if it is a non-leaf function.
and
So we need a four level stack ... but it's not going to be used.
So, tell me again, very slowly, why do we need it?
Compilers will not use the 4 level stack. PropGCC will use the JMP with the link register at $1EF and pushes that on a stack only if it is a non-leaf function.
and
So we need a four level stack ... but it's not going to be used.
So, tell me again, very slowly, why do we need it?
Okay slowly:
W e _ n e e d _ i t _ f o r _ f a s t _ c a l l s _ i n _ p u r e _ P A S M _ c o d e _ t h a t _ c a n _ r u n _ f r o m _ c o g r a m _ o r _ h u b r a m .
Perhaps you still think that hubexec will not get implemented on the P1+, but I think the current state is that Chip will implement it. Perhaps you don't understand that the P1+ uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories. Or perhaps you just got up on the wrong side of the bed today.
It's looking like the hardware stack's real advantage is not that it has more than one level or that it is fast but that it is spare for handcrafted use because the compiler isn't using it.
W e _ n e e d _ i t _ f o r _ f a s t _ c a l l s _ i n _ p u r e _ P A S M _ c o d e _ t h a t _ c a n _ r u n _ f r o m _ c o g r a m _ o r _ h u b r a m .
Perhaps you still think that hubexec will not get implemented on the P1+, but I think the current state is that Chip will implement it. Perhaps you don't understand that the P1+ uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories.
Makes good sense to me, and Chip has already decided it is worth the effort to include.
Of course, anyone who wants to, is free to ignore it.
It's looking like the hardware stack's real advantage is not that it has more than one level or that it is fast but that it is spare for handcrafted use because the compiler isn't using it.
Compilers do not yet use it, and this sounds a compelling enough case : ..uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories.
Compilers do not yet use it, and this sounds a compelling enough case : ..uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories.
The moment a compiler is allowed to use it then it's "speed" advantage vanishes due you having to conform to the compiler's stack management.
Ariba,
You should know me by now. I get up on the wrong side of the bed everyday
W e _ n e e d _ i t _ f o r _ f a s t _ c a l l s _ i n _ p u r e _ P A S M _ c o d e _ t h a t _ c a n _ r u n _ f r o m _ c o g r a m _ o r _ h u b r a m .
and
Perhaps you don't understand that the P1+ uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories.
OK. That sounds great...but...
When would you ever need to be able to place the exact same code in HUB or COG?
Why is it necessary for HUB resident code to look exactly like COG resident code?
Who is ever going to make use of that idea?
We have already determined that compilers will not.
My FFT in C can be compiled for the P1 with or without FCACHE. It's inner loops can be run from HUB or loaded to COG and run. I have no idea how the code is compiled differently in each case. Neither do I care. All I know is that when compiled with FCACHE it runs nearly as fast as my hand crafted PASM version!
generic comment (not aimed at heater or anyone in particular)
Where is this "if it is not used by gcc turf it" attitude coming from?
Adding some capabilities to make life easier for some compiler is great, even if assembly language programmers don't need/want it.
Why is adding capabilities that makes life easier/better/faster for assembly language programmers seen as evil by some?
I find it more than a bit hyporcritical.
The best, tightest, most amazing pieces of code will be in assembly language.
You don't want to use a capability? Don't use it.
You don't need it? Don't use it.
Kindly don't try to tell the rest of us not to use it.
Same goes for helper instructions that reduce memory requirements (ie using a single long, instead of 2/3..) for compiler generated code.
512KB is not infinite, and I'd rather have more ram left for arrays, data, and display buffers than wasting it on two instructions - where a helper could use one.
Ditto for tasks if Chip adds them.
You find them too complicated? Don't use them. Don't try to stop me from using them (if Chip puts them in)
You prefer co-operative multi-tasking? Great! Go for it! But don't try to stop me from using tasks (if Chip puts them in)
Ariba,
You should know me by now. I get up on the wrong side of the bed everyday
and
OK. That sounds great...but...
When would you ever need to be able to place the exact same code in HUB or COG?
Why is it necessary for HUB resident code to look exactly like COG resident code?
Who is ever going to make use of that idea?
We have already determined that compilers will not.
My FFT in C can be compiled for the P1 with or without FCACHE. It's inner loops can be run from HUB or loaded to COG and run. I have no idea how the code is compiled differently in each case. Neither do I care. All I know is that when compiled with FCACHE it runs nearly as fast as my hand crafted PASM version!
Bill/Ariba, you've sold me. It took a while to come to terms with the fact that it's really the separation, an independent stack that doesn't have any software management, that is desired. A second separate LR would be almost as good.
I guess I always envisioned an in-cog, stack-based call/return as something like this:
call == jmpret sp--,#dest
retn == jmp ++sp
That way the stack could start wherever you want and be as deep as you want it to be. sp may, by necessity, have to be a special register, since there are no extra bits to indicate post-decrement or pre-increment. And jmpret maintains its original flavor when the destination register is other than an SFR; similarly for jmp.
A fixed, four-deep stack does seem a bit on stingy side. OTOH, I can't readily come up with an example from my own PASM code where it would have been inadequate. (And I'm not about to write a Perl script to do that kind of "in-depth" -- pardon the pun -- analysis.)
Disclaimer: I haven't followed this thread closely enough to know whether an idea like this has already been hashed out and discarded. If so, in the words of Emily Litella,
Comments
JMPSW can only jump to a cog address without an AUGD, which would make it be two longs - which would waste a lot of memory.
It looks to me like JMP is getting its address out of the D register and JMPSW is getting its address out of the S register (ignoring the relative variant). And both write PC+1 to a register. And both use WZ/WC to save (and presumably restore) Z/C.
Saves you from having to jump to the _RET label if you want to have multiple exit points in the subroutine.
Space wise it would be the same, speeds it's up a little though.
You can't modify the RET instruction with hubexec. On Prop1 the CALL (jmpret) had to modify the S field of the RET (also a jmpret) .
@Roy
Nested calls were possible in PASM1 just no recursions.
Andy
Thank you for the explanation. It's kind of what I was imagining.
Which leaves me with the question: WTF is it for?
With the good old JMPRET if I am calling a subroutine, which calls a subroutine, which calls a subroutine ...everything works just fine. Every subroutine call's return address gets stored at a unique address for that subroutine.
Having a stack in which to store return addresses only helps if I want to make recursive calls. Either the routine calls itself directly or something it calls calls back to it.
Such recursive calls are a rare thing in micro-controller land.
When do we do need this? Perhaps, for example, for the recursive FIBO benchmark. But then a depth of 4 is pretty much useless. Do we need the recursive FIBO bench mark?
How would a C compiler use this? How would it handle the stack overflow? How would it know if that might happen? And where do the parameters to such recursive calls go?
I'm not sure what HUB memory has to do with this. Surely a stack can be built in COG registers if need be by an PASM programmer?
So WTF is this tiny stack for actually?
-Phil
Yeah, I was thinking the wrong thing when I was trying to explain the 4 long stack. The stack based call/ret makes having multiple exit points from a function less of a hassle to worry about. I think the thinking was that having the call/ret work without having to do the self-modifying code approach of JMPRET was desirable, and it was made 4 deep because that was Chip's determination of being enough depth for cog sized pasm code.
Compilers will not use the 4 level stack. PropGCC will use the JMP with the link register at $1EF and pushes that on a stack only if it is a non-leaf function.
Other Compilers will use CALLA and RETA which handle a hub-stack with PTRA as stackpointer.
For simple calls in handcrafted PASM we have the 4 level fast stack. The JMPRET way is no longer possible if you want to execute PASM also from Hub. The RET will then be in HubRam and can't be modified easy be the jumpret.
Andy
Which we should know already because there is P1 code around that does exactly that.
So, WTF is this tinsy, useless stack for?
Thanks for that, I obviously have to think this through...
But before I do, how come LLM code in the P1 did not need this silly little stack but hubexec on the PII does?
LMM has to handle the CALL and RET with cog-helper routines that do an indirect access with self modifying code - all very costly.
I think we don't want that to do in Hubexec mode.
Andy
It's a bit confusing because we have 3 version of the JMP instruction: immediate17, relative17 and the register indirect.
If I understand you correct you think we don't need the register indirect version, because this can be made with a JMPSW. I think this will work if we use a read-only register in the D field of jmpsw, for example the CNT or RND register.
Andy
So, tell me again, very slowly, why do we need it?
I'd use it.
I always intensely disliked the
business.
The 4 level LIFO stack also allows generic subroutines that can be called regardless of living in cog or hub space.
You can use JMPSW as if it was JMPRET if you insist, but that only works for cog-only code.
using the LR at $1EF for cog only code is a pain, as you'd have to save it before calling another routine.
I'm still not convinced this silly little 4 slot stack is any worth yet.
Makes good sense to me, and Chip has already decided it is worth the effort to include.
Of course, anyone who wants to, is free to ignore it.
I agree, Chip should expand it to at least 42 slots, otherwise any robots built with it might be prone to depression.
C.W.
Compilers do not yet use it, and this sounds a compelling enough case :
..uses exactly the same instructions for cogexec and hubexec, so you can run the same code from both memories.
The moment a compiler is allowed to use it then it's "speed" advantage vanishes due you having to conform to the compiler's stack management.
You should know me by now. I get up on the wrong side of the bed everyday and OK. That sounds great...but...
When would you ever need to be able to place the exact same code in HUB or COG?
Why is it necessary for HUB resident code to look exactly like COG resident code?
Who is ever going to make use of that idea?
We have already determined that compilers will not.
My FFT in C can be compiled for the P1 with or without FCACHE. It's inner loops can be run from HUB or loaded to COG and run. I have no idea how the code is compiled differently in each case. Neither do I care. All I know is that when compiled with FCACHE it runs nearly as fast as my hand crafted PASM version!
So WhyTF do we need this little stack again?
Where is this "if it is not used by gcc turf it" attitude coming from?
Adding some capabilities to make life easier for some compiler is great, even if assembly language programmers don't need/want it.
Why is adding capabilities that makes life easier/better/faster for assembly language programmers seen as evil by some?
I find it more than a bit hyporcritical.
The best, tightest, most amazing pieces of code will be in assembly language.
You don't want to use a capability? Don't use it.
You don't need it? Don't use it.
Kindly don't try to tell the rest of us not to use it.
Same goes for helper instructions that reduce memory requirements (ie using a single long, instead of 2/3..) for compiler generated code.
512KB is not infinite, and I'd rather have more ram left for arrays, data, and display buffers than wasting it on two instructions - where a helper could use one.
Ditto for tasks if Chip adds them.
You find them too complicated? Don't use them. Don't try to stop me from using them (if Chip puts them in)
You prefer co-operative multi-tasking? Great! Go for it! But don't try to stop me from using tasks (if Chip puts them in)
All small enough to run in a cog based library (flib) or from the hub if cog space is not available.
I love the idea of being able to have small routines callable regardless of where they live.
(as an aside, if we had fast aux or cog stacks, I would not resist getting rid of the LIFO)
retn == jmp ++sp
That way the stack could start wherever you want and be as deep as you want it to be. sp may, by necessity, have to be a special register, since there are no extra bits to indicate post-decrement or pre-increment. And jmpret maintains its original flavor when the destination register is other than an SFR; similarly for jmp.
A fixed, four-deep stack does seem a bit on stingy side. OTOH, I can't readily come up with an example from my own PASM code where it would have been inadequate. (And I'm not about to write a Perl script to do that kind of "in-depth" -- pardon the pun -- analysis.)
Disclaimer: I haven't followed this thread closely enough to know whether an idea like this has already been hashed out and discarded. If so, in the words of Emily Litella,
-Phil
?? Libraries are one obvious area where all the features you claim have no use, will be very useful.
I'm still missing the point of all your posts, Chip already has this in there - if it bothers you so much, just ignore it.