If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!
I really like putting the Spin interpreter into the top level object (ie executable) as over time, Spin will just keep getting better and faster as everyone learns more and more about how to best exploit P2's capabilities.
Sounds like the classic Software moving target
- but as this is a Software issue, and Spin2 is not in ROM, this is (relatively) easy to address over time. Just allow it to occur.
Spin2 can potentially dynamically build (mentioned a while back & just software housekeeping) so it might even fold into that ?
Users would tell the Spin2 builder what thread-code to include, and it then knows what Spin2 variants to craft with.
If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!
Probably it is just a personal taste thing but having two COG registers acting as the PC just seems a little weird to me. I would have much preferred one register being the PC (eg. $1F0) and the other being the proposed LR ($1F1), as it's just tidier. Maybe it has to be done like that to get the instruction count down but it still feels strange.
Probably it is just a personal taste thing but having two COG registers acting as the PC just seems a little weird to me. I would have much preferred one register being the PC (eg. $1F0) and the other being the proposed LR ($1F1), as it's just tidier. Maybe it has to be done like that to get the instruction count down but it still feels strange.
It is a little weird, but there's no other way to handle both cancelling and non-cancelling branches that modify the PC directly..
As Bill has pointed out correct, even the LR version of CALL is not absolutely necessary. It can be implemented by starting each C function with a pop of the return address off the AUX stack into an LR register. It would just be a little more space and time efficient to have the CALL_LR instruction or whatever it would be called.
Actually that depends on the AUX memory not being used for anything else (like video). It seems to me that the LR version is safer for that reason, although obviously having both would be beneficial depending on circumstances.
- $1F0 is the PC, and writing to it means a non-cancelling jump
- $1F1 is the cancelling version of PC
- $1EF was made LR (fixed at that address)
then:
- we can get rid of JMP and JMPD, replaced with
MOV PC,#addr and
MOV PCD,#addr
- CALL D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PC
- CALLD D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PCD
- RET would be replaced with MOV PC,LR
- RETD would be replaced with MOV PCD,LR
The stack versions of cog call/return need not be modified
- some more of the opcodes from the list above could probably be freed by versions of copying to PC or PCD
- we would automatically get 'D' versions of
ADD PC,#val
ADD PCD,#val
SUB PC,#val
SUB PCD,#val
I suspect it would free up some dual op opcodes.
I think this would make it worthwhile to have a permanent LR at $1EF.
edit: unfortunately this would complicate nested non-aux-stack in-cog subroutines. Argh!
Actually that depends on the AUX memory not being used for anything else (like video). It seems to me that the LR version is safer for that reason, although obviously having both would be beneficial depending on circumstances.
True. I tried to make that point as well. It seems a waste to use the AUX stack to hold one value temporarily.
If there isn't a fixed location for LR then we can go back to my original proposal where we add a "SETLR" instruction to set an internal register that remembers which COG address to use as LR. Then the CALL_LR instruction will just use the location indicated by that hidden internal register. I suppose it would be nice to be able to read that register as well although I guess that's not absolutely necessary. It's interesting that two PCs are wanted now to support a delayed branch when just a little while ago there was a proposal to remove all of the delayed jumps.
Do you envision a COG running a HUB executable program in parallel with other threaded COG code?
I see a good use case for this feature. One could have a micro scheduler task running in COG mode that would be scheduled to run at the lowest frequency (1 in 16 cycles I think). It could then spend most of its time waiting for some elapsed time or condition and stop/reschedule the main hub exec task as required for it to run atomically. Think preemptively (or even co-operatively) tasked hub exec code using hub based memory as locks. This could be very useful indeed. You could even write a mini RTOS this way that all ran in one COG. It would take some COG resources but could be written to run in its own VM to save space. It could also be used as a debugger for example.
Do you envision a COG running a HUB executable program in parallel with other threaded COG code?
I suppose it's not optimal for HUB execute, but it could provide that little extra so the COG can be more fully used.
It's probably a PITA to make work. How would you launch such a COG anyway?
It's actually simplest to not make any task special. It's easier for all tasks to have hub and cog capability than it is for just one. Whatever I do must apply to all tasks. That means all tasks could be in hub mode at the same time, or some could be in hub mode and others could be in cog mode.
I really liked the idea of using $1f1 but I understand why it isn't possible. The two PCs are very useful as well.
I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down. The reason is that the computed ALU result is the last-arriving signal set, and to run it through a few more sets of mux's to accommodate the four task PC's, and then get it out to the cog RAM instruction address input, just takes too long. The only way to circumvent these delays is to add another pipeline stage, which will make cancelling branches take one more clock, and 4-way multitasking branches take two clocks, instead of one. It's not worth it. So, the PCs will have to be addressed by instructions, only, in which the PC result does not go through the main ALU. It was worth trying, though, because the benefits would have been great. I think to compensate, I'll make relative jumps, which are easy to implement without drawbacks. This will give us the same performance we would have had with mappable PC's, when it comes to adding to them.
I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.
Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.
Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.
I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.
Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.
Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.
None of that is going away. It's just that now any task will able to be execute from hub.
I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.
Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.
Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.
None of that is going away, afterall. It's just that now any task will able to be execute from hub.
Comments
They can be called PC and PCD.
I guess we could go back to the SETLR instruction that selects one of the COG register to be LR.
You mean HPC and HPCD ?
Could we not use 'WC' on the branch instruction to specify cancelling pipelined instructions?
WC does not make sense for a jump... that way you can use $1F1 for LR
No. It will be the actual PC for that task, which have all been expanded to 16 bits for hub use. When in cog mode, the upper 7 bits are ignored.
If you look at the latest instruction encodings, every opcode that didn't need WZ/WC has had those bits repurposed to pack instructions tighter.
This is the list I extracted from the latest zip, I hope I got them all!
C.W. said there are 23 'D'elayed instructions... which ones did I miss?
I am only using about half of them right now, but unfortunately I can see uses for the others in hand crafted optimized code.
So far, whenever I was writing P2 assembly code, I could:
- usually use 3 delay slots (about 60% of the time)
- almost always use 2 delay slots (>90% of the remaining 40% not covered above)
So basically, 96%+ of the time my branches only took 1 cycle.
Mind you, it took a fair bit of re-factoring and re-organization to get that level of single-cycle brancing usage, but 80%+ was easy to achieve.
It is a little weird, but there's no other way to handle both cancelling and non-cancelling branches that modify the PC directly..
Thanks for posting that list. I, too, think they'd all find use, so better to leave them alone.
What about using writing to INA/INB/INC for the non-cancelling (delayed) version? I can't think of a good usage case for needing to write to INA...
And thank you for keeping the delayed instructions around... much appreciated.
This might work, but I'd have to revisit how JMPRET/JMPRETD works, as they use INA..IND as dummy write targets.
If:
- $1F0 is the PC, and writing to it means a non-cancelling jump
- $1F1 is the cancelling version of PC
- $1EF was made LR (fixed at that address)
then:
- we can get rid of JMP and JMPD, replaced with
MOV PC,#addr and
MOV PCD,#addr
- CALL D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PC
- CALLD D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PCD
- RET would be replaced with MOV PC,LR
- RETD would be replaced with MOV PCD,LR
The stack versions of cog call/return need not be modified
- some more of the opcodes from the list above could probably be freed by versions of copying to PC or PCD
- we would automatically get 'D' versions of
ADD PC,#val
ADD PCD,#val
SUB PC,#val
SUB PCD,#val
I suspect it would free up some dual op opcodes.
I think this would make it worthwhile to have a permanent LR at $1EF.
edit: unfortunately this would complicate nested non-aux-stack in-cog subroutines. Argh!
If there isn't a fixed location for LR then we can go back to my original proposal where we add a "SETLR" instruction to set an internal register that remembers which COG address to use as LR. Then the CALL_LR instruction will just use the location indicated by that hidden internal register. I suppose it would be nice to be able to read that register as well although I guess that's not absolutely necessary. It's interesting that two PCs are wanted now to support a delayed branch when just a little while ago there was a proposal to remove all of the delayed jumps.
Do you envision a COG running a HUB executable program in parallel with other threaded COG code?
I suppose it's not optimal for HUB execute, but it could provide that little extra so the COG can be more fully used.
It's probably a PITA to make work. How would you launch such a COG anyway?
I see a good use case for this feature. One could have a micro scheduler task running in COG mode that would be scheduled to run at the lowest frequency (1 in 16 cycles I think). It could then spend most of its time waiting for some elapsed time or condition and stop/reschedule the main hub exec task as required for it to run atomically. Think preemptively (or even co-operatively) tasked hub exec code using hub based memory as locks. This could be very useful indeed. You could even write a mini RTOS this way that all ran in one COG. It would take some COG resources but could be written to run in its own VM to save space. It could also be used as a debugger for example.
It's actually simplest to not make any task special. It's easier for all tasks to have hub and cog capability than it is for just one. Whatever I do must apply to all tasks. That means all tasks could be in hub mode at the same time, or some could be in hub mode and others could be in cog mode.
It is very possible that it will go in, somehow.
I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down. The reason is that the computed ALU result is the last-arriving signal set, and to run it through a few more sets of mux's to accommodate the four task PC's, and then get it out to the cog RAM instruction address input, just takes too long. The only way to circumvent these delays is to add another pipeline stage, which will make cancelling branches take one more clock, and 4-way multitasking branches take two clocks, instead of one. It's not worth it. So, the PCs will have to be addressed by instructions, only, in which the PC result does not go through the main ALU. It was worth trying, though, because the benefits would have been great. I think to compensate, I'll make relative jumps, which are easy to implement without drawbacks. This will give us the same performance we would have had with mappable PC's, when it comes to adding to them.
Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.
Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.
None of that is going away. It's just that now any task will able to be execute from hub.
None of that is going away, afterall. It's just that now any task will able to be execute from hub.