would be equivalent to jump (potentially freeing up a dual opcode)
and if LR was at $1F1, then the cog-style jumps work, and the assembler is simpler - the return value is in a known location, no need to patch the JMPRET
(just noted David proposed that call variant while I was typing this message, I think replacement is better)
and yes,
BIG #highbits
mov pc,#const9 would work for hub jump, call ect, but cause a huge amount of hub memory to be wasted, thus the encoded instructions in post#2 of this thread.
Roughly 1 in six instructions is a jump or call, so using two longs for it would waste 16% of the hub (which is why I made the tightly encoded jmp/call variants)
You're right. That didn't occur to me. We would have to decide to cancel same-task instructions in the pipeline, or not. I think cancelling would be the safe bet, as some other cases would be indeterminate with multitasking enabled. So, a write to PC at $1F0 would cancel same-task instructions in the pipeline. Very cool!!! This would reduce table branches to 1 instruction! Works with hub execution, too. Awesome! I think I'll implement this right now, before I go any further. Now we've got one orphaned register in the last 16. Any quick idea for what we could use it for?
They are really only good for single-task programs, and in single-task programs, you can use REPS/REPD for zero-overhead looping. Of course, delayed branches give you more MIPS in single-task programs which branch a lot, if you are willing to slow down mentally and apply them properly. They also eat up a lot of instruction space and are a somewhat complex thing to document. Would they be missed?
I'm just thinking that we've got hub and cog execution modes and the delayed branches really stink things up, in a way.
How about making that the LR register and adding a CALL instruction that stores its return address in $1f1? This would be an instruction like Bill's HCALL with a 16 bit immediate target address.
and if LR was at $1F1, then the cog-style jumps work, and the assembler is simpler - the return value is in a known location, no need to patch the JMPRET
Are you suggesting the removal of CALL/RET or just that this is an alternative? I would recommend leaving CALL/RET because it allows one COG mode function to call another without worrying about clobbering the return address of the first one. It doesn't allow recursive calls to the same function but that is probably rare in COG code anyway.
(just noted David proposed that call variant while I was typing this message, I think replacement is better)
would be equivalent to jump (potentially freeing up a dual opcode)
and if LR was at $1F1, then the cog-style jumps work, and the assembler is simpler - the return value is in a known location, no need to patch the JMPRET
(just noted David proposed that call variant while I was typing this message, I think replacement is better)
and yes,
BIG #highbits
mov pc,#const9 would work for hub jump, call ect, but cause a huge amount of hub memory to be wasted, thus the encoded instructions in post#2 of this thread.
Roughly 1 in six instructions is a jump or call, so using two longs for it would waste 16% of the hub (which is why I made the tightly encoded jmp/call variants)
I've seen the LR issue discussed, but I haven't understood what it's about. Could you please explain what it does? David, what do you say?
They are really only good for single-task programs, and in single-task programs, you can use REPS/REPD for zero-overhead looping. Of course, delayed branches give you more MIPS in single-task programs which branch a lot, if you are willing to slow down mentally and apply them properly.
They also eat up instruction space and are a somewhat complex thing to document and use. Would they be missed?
I'm just thinking that we've got hub and cog execution modes and the delayed branches really stink things up, in a way.
They seem like a small thing to give up to gain opcode space, my vote is to get rid of them.
If you accept that I'm not suggesting removing any of the features that you think are important, you've already stated that you agree with the things I've suggested adding. If we can just get both sets of features we'll both be happy! :-)
Of course, reality may get in the way of that but only Chip can say what we will end up with in the end.
They are really only good for single-task programs, and in single-task programs, you can use REPS/REPD for zero-overhead looping. Of course, delayed branches give you more MIPS in single-task programs which branch a lot, if you are willing to slow down mentally and apply them properly. They also eat up a lot of instruction space and are a somewhat complex thing to document. Would they be missed?
I'm just thinking that we've got hub and cog execution modes and the delayed branches really stink things up, in a way.
I got some incredible performance in test code on the earlier FPGA code using it... which would now lose about 75% of the performance without delayed brances. It was a virtual machine.
Performance would drop for my app by a factor of 3 or 4.
You should ofcourse do what you think is best, however for my virtual machines, and processor emulations I was thinking of, I was able to use every single delay slot effectively getting 1 cycle branches.
Unless of course
mov pc,#xxxxx
add pc,#xxxx
sub pc,#xxxx
call addr
are single cycle, in which case definitely get rid of the delayed instructions!
They are really only good for single-task programs, and in single-task programs, you can use REPS/REPD for zero-overhead looping. Of course, delayed branches give you more MIPS in single-task programs which branch a lot, if you are willing to slow down mentally and apply them properly. They also eat up a lot of instruction space and are a somewhat complex thing to document. Would they be missed?
I'm just thinking that we've got hub and cog execution modes and the delayed branches really stink things up, in a way.
Are you suggesting the removal of CALL/RET or just that this is an alternative? I would recommend leaving CALL/RET because it allows one COG mode function to call another without worrying about clobbering the return address of the first one. It doesn't allow recursive calls to the same function but that is probably rare in COG code anyway.
I've seen the LR issue discussed, but I haven't understood what it's about. Could you please explain what it does? David, what do you say?
I'll try.
Currently, for LMM, Eric uses a link register to hold the return address for subroutine calls, just like ARM chips, and early IBM mainframes.
This is basically the same thing as storing the return address in the JMPRET at the end of a subroutine, but it is in a central, known location (You would need 4 for 4 tasks)
This works extremely well for leaf subroutines (subroutines that do not call any other subroutines) as there is no need to push the return address on a hub stack.
It does not work as well on on-leaf subroutines due to extra instructions (WRLONG LR, --SP / RDLONG LR,++SP) needed in non-leaf functions.
Think of JMPRET that stores the return address in LR, not in the jump at the end of a subroutine.
I think that will give code in OBX that are simpler to implement to multitasking.
So in my opinion Remove it.
I was just remembering that I had to get rid of all the delayed branches in the Spin2 interpreter because they wouldn't work if multitasking was enabled. I think I might feel lighter if I nuked them. It would simplify the overall programming picture for people and make code more mixable.
I was just remembering that I had to get rid of all the delayed branches in the Spin2 interpreter because they wouldn't work if multitasking was enabled. I think I might feel lighter if I nuked them. It would simplify the overall programming picture for people and make code more mixable.
I've seen the LR issue discussed, but I haven't understood what it's about. Could you please explain what it does? David, what do you say?
The idea is that a CALL_LR instruction (please find a better name!) will store its return address in the LR register. If the function being called is a leaf function then it will just JMP indirect through the LR register to return. If it is a non-leaf function, the LR register will be pushed on the stack along with any other registers that need to be preserved over the call to the nested function. This is exactly the way the current PropGCC code generator works. It saves an instruction over pushing the return address on an AUX stack and then having to pop it into a register in order to push it on a hub stack for non-leaf functions. I guess it doesn't really help over the AUX stack instructions for leaf functions.
I have a hand-crafted VM I spent > 1 man month on that will slow down by about a factor of 2.5 without the delayed branches and calls.
Btw - I don't need delayed branches/calls in multi-tasking mode at all. Does that help?
I'd like to have a consistent policy about delayed branches, so I'll probably leave them alone. The thing is, DJNZ will work in hub mode, using an immediate #S as a sign-extended relative offset. That leaves DJNZD in there, too, working for hub mode. It just creates a huge proliferation of delayed-branch instructions as things grow.
The idea is that a CALL_LR instruction (please find a better name!) will store its return address in the LR register. If the function being called is a leaf function then it will just JMP indirect through the LR register to return. If it is a non-leaf function, the LR register will be pushed on the stack along with any other registers that need to be preserved over the call to the nested function. This is exactly the way the current PropGCC code generator works. It saves an instruction over pushing the return address on an AUX stack and then having to pop it into a register in order to push it on a hub stack for non-leaf functions. I guess it doesn't really help over the AUX stack instructions for leaf functions.
Thanks for all the explanations. I understand now. It's a stop-gap measure to avoid always playing with the stack.
It looks like the delayed jump/call/ret instructions consume 23 opcodes.
Is there a small most useful subset we could keep?
C.W.
I would assume that Bill was using them all, as they are all useful - but only in a single-task cog program. My interpreter is normally single-task, but because people could load code into the spare register space and enable multitasking, my interpreter had to be hardened to accommodate that possibility. Hence, I had to get rid of all delayed branches.
Thanks for all the explanations. I understand now. It's a stop-gap measure to avoid always playing with the stack.
I'm not sure what you mean by stop-gap but it does allow you to avoid a pop off the AUX stack for non-leaf functions. It also means you don't even have to touch the AUX memory in case you want to use it for something else.
I'd like to have a consistent policy about delayed branches, so I'll probably leave them alone. The thing is, DJNZ will work in hub mode, using an immediate #S as a sign-extended relative offset. That leaves DJNZD in there, too, working for hub mode. It just creates a huge proliferation of delayed-branch instructions as things grow.
I would assume that Bill was using them all, as they are all useful - but only in a single-task cog program. My interpreter is normally single-task, but because people could load code into the spare register space and enable multitasking, my interpreter had to be hardened to accommodate that possibility. Hence, I had to get rid of all delayed branches.
I really only need them in single tasking cog mode. It does not matter to me if delayed is not available when tasking or in hubexec mode.
Got it. Hub code that used delayed branches would only be callable by single-tasking cogs, so no universally-callable hub code could even use delayed branches.
I would assume that Bill was using them all, as they are all useful - but only in a single-task cog program. My interpreter is normally single-task, but because people could load code into the spare register space and enable multitasking, my interpreter had to be hardened to accommodate that possibility. Hence, I had to get rid of all delayed branches.
That's a good variant, but It could be long-term practical (it's a software management problem, not silicon) to support both under a "Spin2 interpreter " umbrella ?
Users who want to pack code+threads into the Spin2 COG, can do so, or those who know they will not, can get a variant that is faster at the Spin2 (but lacks the thread-packing).
That's a good variant, but It could be long-term practical (it's a software management problem, not silicon) to support both under a "Spin2 interpreter " umbrella ?
Users who want to pack code+threads into the Spin2 COG, can do so, or those who know they will not, can get a variant that is faster at the Spin2 (but lacks the thread-packing).
The speed difference might only 15%, so it's a lot of headache for a small benefit.
I'll find the latest instruction list, and make a list. It will take a few hours as I am actually visiting the forum while torturing a new pcb design (or is it torturing a new pcb design between visiting the forum?) LOL
I got significantly better than that... but it required a ton of tricks, and about a man-month of experiments to make optimal use out of the delay slots for my VM.
FYI, it was that VM that made me ask for what became RDAUX reg,D/# as that will help me shave another cycle ... and even more saving from other uses of it.
I got significantly better than that... but it required a ton of tricks, and about a man-month of experiments to make optimal use out of the delay slots for my VM.
FYI, it was that VM that made me ask for what became RDAUX reg,D/# as that will help me shave another cycle ... and even more saving from other uses of it.
The speed difference might only 15%, so it's a lot of headache for a small benefit.
Sounds like the classic Software moving target
- but as this is a Software issue, and Spin2 is not in ROM, this is (relatively) easy to address over time. Just allow it to occur.
Spin2 can potentially dynamically build (mentioned a while back & just software housekeeping) so it might even fold into that ?
Users would tell the Spin2 builder what thread-code to include, and it then knows what Spin2 variants to craft with.
If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!
Comments
Re/ orphan register
I think it would be a great choice for LR for gcc
FYI,
mov pc,#const9
would be equivalent to jump (potentially freeing up a dual opcode)
and if LR was at $1F1, then the cog-style jumps work, and the assembler is simpler - the return value is in a known location, no need to patch the JMPRET
(just noted David proposed that call variant while I was typing this message, I think replacement is better)
and yes,
BIG #highbits
mov pc,#const9 would work for hub jump, call ect, but cause a huge amount of hub memory to be wasted, thus the encoded instructions in post#2 of this thread.
Roughly 1 in six instructions is a jump or call, so using two longs for it would waste 16% of the hub (which is why I made the tightly encoded jmp/call variants)
What if we got rid of all the delayed branches?
They are really only good for single-task programs, and in single-task programs, you can use REPS/REPD for zero-overhead looping. Of course, delayed branches give you more MIPS in single-task programs which branch a lot, if you are willing to slow down mentally and apply them properly. They also eat up a lot of instruction space and are a somewhat complex thing to document. Would they be missed?
I'm just thinking that we've got hub and cog execution modes and the delayed branches really stink things up, in a way.
I've seen the LR issue discussed, but I haven't understood what it's about. Could you please explain what it does? David, what do you say?
They seem like a small thing to give up to gain opcode space, my vote is to get rid of them.
C.W.
It's true enough, and I am going to assume you were not trying to say I was trying to mislead.
Now with this part, I agree!
I think that will give code in OBX that are simpler to implement to multitasking.
So in my opinion Remove it.
I got some incredible performance in test code on the earlier FPGA code using it... which would now lose about 75% of the performance without delayed brances. It was a virtual machine.
Performance would drop for my app by a factor of 3 or 4.
You should ofcourse do what you think is best, however for my virtual machines, and processor emulations I was thinking of, I was able to use every single delay slot effectively getting 1 cycle branches.
Unless of course
mov pc,#xxxxx
add pc,#xxxx
sub pc,#xxxx
call addr
are single cycle, in which case definitely get rid of the delayed instructions!
Sorry, strongly disagree. It slows down byte codes and processor emulations greatly.
Yep!
I'll try.
Currently, for LMM, Eric uses a link register to hold the return address for subroutine calls, just like ARM chips, and early IBM mainframes.
This is basically the same thing as storing the return address in the JMPRET at the end of a subroutine, but it is in a central, known location (You would need 4 for 4 tasks)
This works extremely well for leaf subroutines (subroutines that do not call any other subroutines) as there is no need to push the return address on a hub stack.
It does not work as well on on-leaf subroutines due to extra instructions (WRLONG LR, --SP / RDLONG LR,++SP) needed in non-leaf functions.
Think of JMPRET that stores the return address in LR, not in the jump at the end of a subroutine.
I was just remembering that I had to get rid of all the delayed branches in the Spin2 interpreter because they wouldn't work if multitasking was enabled. I think I might feel lighter if I nuked them. It would simplify the overall programming picture for people and make code more mixable.
I have a hand-crafted VM I spent > 1 man month on that will slow down by about a factor of 2.5 without the delayed branches and calls.
Btw - I don't need delayed branches/calls in multi-tasking mode at all. Does that help?
It looks like the delayed jump/call/ret instructions consume 23 opcodes.
Is there a small most useful subset we could keep?
C.W.
I'd like to have a consistent policy about delayed branches, so I'll probably leave them alone. The thing is, DJNZ will work in hub mode, using an immediate #S as a sign-extended relative offset. That leaves DJNZD in there, too, working for hub mode. It just creates a huge proliferation of delayed-branch instructions as things grow.
Thanks for all the explanations. I understand now. It's a stop-gap measure to avoid always playing with the stack.
I would assume that Bill was using them all, as they are all useful - but only in a single-task cog program. My interpreter is normally single-task, but because people could load code into the spare register space and enable multitasking, my interpreter had to be hardened to accommodate that possibility. Hence, I had to get rid of all delayed branches.
I really only need them in single tasking cog mode. It does not matter to me if delayed is not available when tasking or in hubexec mode.
If it would help, I could make a list of the ones I can do without.
Got it. Hub code that used delayed branches would only be callable by single-tasking cogs, so no universally-callable hub code could even use delayed branches.
That list would be good to see. It might expose something useful.
That's a good variant, but It could be long-term practical (it's a software management problem, not silicon) to support both under a "Spin2 interpreter " umbrella ?
Users who want to pack code+threads into the Spin2 COG, can do so, or those who know they will not, can get a variant that is faster at the Spin2 (but lacks the thread-packing).
The speed difference might only 15%, so it's a lot of headache for a small benefit.
I'll find the latest instruction list, and make a list. It will take a few hours as I am actually visiting the forum while torturing a new pcb design (or is it torturing a new pcb design between visiting the forum?) LOL
FYI, it was that VM that made me ask for what became RDAUX reg,D/# as that will help me shave another cycle ... and even more saving from other uses of it.
- but as this is a Software issue, and Spin2 is not in ROM, this is (relatively) easy to address over time. Just allow it to occur.
Spin2 can potentially dynamically build (mentioned a while back & just software housekeeping) so it might even fold into that ?
Users would tell the Spin2 builder what thread-code to include, and it then knows what Spin2 variants to craft with.
If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!