JMP #addr -> JMPRET #%1_1111_0100, #addr NC NZ
CALL #addr -> JMPRET #%x_xxxx_xxxx, #addr NC NZ ;where x_xxxx_xxxx is the register with the matching "_ret" label
RET -> JMPRET #%1_1111_0100, #%0_0000_0000 WC WZ ;where the matching CALL will replace the 11 LSBs (Z,C,PC+1)
Wait, something doesn't seem right here (my understanding, I'm sure). I think JMPRET does the following steps:
write PC+1 to D (except in the case of PINA..PIND)
conditionally updates Z and C flags (from the 2 LSBs of D)
branches to s-field
So wouldn't that mean that the two LSBs of D are interpreted in two different ways (first as an address, then as Z and C)? This makes me think my understanding is wrong.
If the D address is PINA..PIND, the two LSBs of the D field can be loaded into Z/C if WZ/WC are used. If D is not PINA..PIND, then bits 10..9 of the contents of that register are candidates for Z/C via WZ/WC.
I believe it is improvements to this process (such as simultaneous multi-cog RDLONGs) that would enhance overall performance more than the time-slicing concept
Depends what you mean by performance.
Loading snippets of code from HUB in order to squeeze more functionality into a COG is a fine and useful idea, hence LMM and Clusos's overlay technique. It will kill real-time responce of threads threads are cooperatively waiting for that load to happen.
But if you want a couple or more threads waiting on some events, pin changes, timer etc then for the lowest latency in responding to those events you will want hardware thread slicing. I believe this is true of examples like UART Tx and Rx at higher speeds, especially multiple UARTs. I'm sure there are plenty of other examples like that.
Also, having the hardware slice your threads for you means you can write your code more easily, each thread is just a sequence of the code you want to get that job done without having to worry about interspersing JMPRETs, TASKSWITCH or other forms of suspend through out your code.
Hardware thread licing also reduces code size as all those extra instructions, JMPRET. TASLKSWITCH, etc are no longer required. Smaller code is faster code:)
Take the case of UARTs. Most of the time there is nothing for a UART transmitter to do, and often also for it's receiver counterpart. So then why not load something else into the cog that DOES want to run. I have found that typically, at least in the case of drivers, it is the Spin program that knows what should be run, and when it should run, and cogs just self suspend or self terminate when done their activity. So Spin then sets a flag or other message to the cog via Par, and the cog's OS picks up on that, and either Resumes the thread or Loads whatever thread the message calls for. An efficient way of doing that would be very useful.
Again, I have nothing against hardware augmented multi-threading -in fact I thinks its great, especially now Chip has confirmed no impairment with the new scheme-, but I believe more users will gravitate to the co-operative approach rather than to time slicing. The extra few JMPRET instructions sprinkled through the code are not significant. And as for timing, in a P1 loading a co-operative UART from hub takes about 10 usec. Workable, but would be nice if that could be reduced.
So as I said before, I don't yet understand the details of the flexibility of the slice assignments and the overhead of changing them. If that is trivial, then a bigger gain will be the result. If that is awkward, then I predict a lot of slot time will be spent just waiting with the alloted time doing not much useful. And there are only 4 slots, so swapping will likely be required in any case.
The performance of the slicing approach will depend on how this all shakes out. I'm anxious to get my hands on a chip to give it a good rundown. A bunch of the new instructions should be of good benefit to my cog kernel.
If the D address is PINA..PIND, the two LSBs of the D field can be loaded into Z/C if WZ/WC are used. If D is not PINA..PIND, then bits 10..9 of the contents of that register are candidates for Z/C via WZ/WC.
[I]Posted by Seairth
JMP #addr -> JMPRET #%1_1111_0100, #addr NC NZ
CALL #addr -> JMPRET #%x_xxxx_xxxx, #addr NC NZ ;where x_xxxx_xxxx is the register with the matching "_ret" label
RET -> JMPRET #%1_1111_0100, #%0_0000_0000 WC WZ ;where the matching CALL will replace the 11 LSBs (Z,C,PC+1)
[/I]Really smart Chip!
Who is the documentation expert (potatohead? others?) to document the JMP instructions - to save Chip's time.
[I]Posted by Seairth
JMP #addr -> JMPRET #%1_1111_0100, #addr NC NZ
CALL #addr -> JMPRET #%x_xxxx_xxxx, #addr NC NZ ;where x_xxxx_xxxx is the register with the matching "_ret" label
RET -> JMPRET #%1_1111_0100, #%0_0000_0000 WC WZ ;where the matching CALL will replace the 11 LSBs (Z,C,PC+1)
[/I]Really smart Chip!
Who is the documentation expert (potatohead? others?) to document the JMP instructions - to save Chip's time.
Don't worry about this, Guys. I'll amend the Prop2_Docs.txt before I post the next FPGA configuration. There are lots of details to cover in that thing.
Don't worry about this, Guys. I'll amend the Prop2_Docs.txt before I post the next FPGA configuration. There are lots of details to cover in that thing.
Including info on the extended features in Counters would allow those with FPGA boards to field-verify the new modes, before final synthesis.
It's the same as on Prop1, with the addition of TASKSW, which goes to the next task automatically while switching flag sets for you.
So how would I switch to an arbitrary task?
With the TASK hardware, I guess you would do something like:
; assume currently using TASK_0 register
SETTASK #%01 ; switch to TASK_1 register (for all time slots)
JMPTASK #%10, addr ; update the PC of TASK_1 register
NOP ; or some other delayed instruction in the context of TASK_0
NOP ; ditto
Because of the JMPTASK, you would have a one-instruction delay (it will cancel the instruction currently loaded in the first stage of the pipeline). But this would otherwise work, I think...
Continuing from my last post (prematurely submitted), I don't know if a similar approach could be done with JMPRET. Maybe something like:
;assuming we are in "task" 0
SETINDA #task_1x ; store the current PC/Z/C in the register before "task" 1
JMPRETD INDA, ++INDA WC, WZ ; perform a delayed TASKSW
MOV task_0, task_1x ; move the stored PC/Z/C to the "task" 0 register
NOP ; or another delayed operation in the context of "task" 0
task_0x RES 1
task_0 RES 1
task_1x RES 1
task_1 RES 1
or maybe
MOV task_x, task_1
JMPRETD task_x, task_x WC, WZ ; not sure if this is allowed
MOV task_0, task_x
NOP
NOP
task_x RES 1
task_0 RES 1
task_1 RES 1
Thinking further, what would it take to use the TASK registers to do the same round-robin approach as TASKSW? Obviously, you'd use SETTASK instead of TASKSW, but you need some way of incrementing the task number. With the ability for the d-field to be a register, it seems you'd reserve a single register for that purpose. The switch code would then look something like:
SETTASK task
INCMOD task, #3
NOP ; delayed instruction in the prior TASK context
NOP ; ditto
Comparing this to TASKSW:
TASKSW is a single instruction, compared to the 4 registers above. If the remaining two delayed slots can be efficiently used, then the above approach uses effectively two registers.
TASKSW causes a three-clock pipeline stall, while the above approach must be treated like a delayed branch instruction.
If one did JMPRETD INDA, ++INDA instead, the number of registers would be the same as above, but you would have on extra delayed instruction slot to use compared to the above.
TASKSW requires one register per task for bookkeeping, as well as a use of INDA, while the above code only requires a single register for bookkeeping.
TASKSW can have more than 4 tasks, while both approaches can manage 4 or fewer tasks.
The above approach can do out-of-order task switching by setting the "task" register.
By using JMPRET directly, you could possibly do out of order task switching, but probably no more efficiently than the above approach.
The above approach would require the use of JMPTASK to set the initial PC values, while TASKSW would require MOV (or some other operation). Though, if code were carefully crafted, the bookkeeping registers could be pre-loaded with addresses. Similarly, if multiple tasks were the same code, JMPTASK could be used more efficiently.
TASKSW can't take advantage of mapped registers (set via SETMAP).
Chip, while writing all of this up, I had two thoughts about TASKSW:
Could TASKSW use INDB instead? I'd think that most user code will use INDA, so this would make TASKSW less likely to conflict.
Could there be a TASKSWD that aliases JMPRETD?
Yes, INDB might be better. I would have to make register remapping work with INDB, also.
There is a TASKSWD, but the tricky thing is that you will be executing with the next tasks' Z and C flags. Also, if you are using register remapping, the new map (by INDA) will be in effect, too. If you don't use register remapping and don't affect the flags, you could get something out of TASKSWD. Otherwise, it's of little value.
There is a TASKSWD, but the tricky thing is that you will be executing with the next tasks' Z and C flags.
Oh, that's right! TASKSW immediately affects flags.In which case, I'd actually avoid providing the delayed version. If someone *really* wants to do it, they could still use JMPRETD directly (at which point, they could also use NC/WC and NZ/WZ as necessary).
Also, if you are using register remapping, the new map (by INDA) will be in effect, too. If you don't use register remapping and don't affect the flags, you could get something out of TASKSWD. Otherwise, it's of little value.
I didn't think that the JMPRET-based tasking had anything to do with register remapping. I thought SETMAP was only applicable when using SETTASK.
Could you add a new instruction called NEXTTASK. This would take no arguments and would work with the TASK registers. All it does is cycle through those registers in a round-robin fashion, so it's only meant to be used in single-tasking mode. With this instruction:
It makes the code impact no worse than using the TASKSW approach.
This would not have the same Z/C and register mapping issues as TASKSW.
A delayed-instruction variant could be safely provided.
This would allow a task to not care who the next task is.
If using less than four tasks, the unused tasks would point to a small default block of code that would just NEXTTASK in a loop (or some other similar code). On the other hand, if using less than four tasks, it might end up being easier to use SETTASK (as an optionally delayed-instruction) to explicitly switch from one task to another. The point is that either option would be available.
Better yet, if the instruction were NEXTTASK #n, where the TASK control register would be set to ((TASK + n) AND %11) (this assumes that the two LSBs of the control register represents the current active task), this would allow the following:
In the general case, you'd use NEXTTASK #1 to go the next task
When using less than four tasks, the last task could do NEXTTASK #2 (if there are three tasks) or NEXTTASK #3 (if there are two tasks), getting rid of the need for the do-nothing stub code in the remaining tasks.
NEXTTASK #0 could be used to drop out of the interleaved tasking mode.
The last item is somewhat interesting, With NEXTTASK #0, you could mix the interleaved and cooperative approach. In general, you could use the interleaved approach. Then, when you hit a particularly critical bit of code, use NEXTTASK #0 to switch to single-task mode for the task that issued the instruction. There would still be up to three instructions (depending on the interleaved scheduling at the time) before the pipeline was truly in single-task mode. Then, once the critical bit is finished, the task would issue SETTASK to re-enable the interleaved mode. Now you can have the best of both worlds!
It seems to be getting a bad rap .
I hope that any of my findings are not scaring people off the idea.
The driving force behind most of my crazy experiments is "to see how far I can go".
The system continues to impress me each time.
I don't think it can be made any simpler. Just set a few JMPTASK's and a SETTASK and away you go, not complex at all.
Sure there a few things to look out for, but isn't that normal for any programming environment.
Writing any code for microcontrollers always requires care in relation to timing and flow, nothing new here.
It seems to be getting a bad rap .
I hope that any of my findings are not scaring people off the idea.
The driving force behind most of my crazy experiments is "to see how far I can go".
The system continues to impress me each time.
I don't think it can be made any simpler. Just set a few JMPTASK's and a SETTASK and away you go, not complex at all.
Sure there a few things to look out for, but isn't that normal for any programming environment.
Writing any code for microcontrollers always requires care in relation to timing and flow, nothing new here.
On the contrary, it's your work (and I wouldn't consider crazy at all!) that has given us a great deal of insight into what is and is not possible with multi-threading. This is exactly the right time to learn its strengths and weakness, see where it can be improved further (as Chip has arguably done, in some ways), and make sure that it really is the benefit we are all imagining it will be.
To be clear, I WANT multitasking. I think limited multitasking as A GOOD THING. My main concern (the thing that spawned this thread) is that I feel the interleaved approach is getting more complex in order to accommodate various edge cases (those that can be accommodated, anyhow). With each increase in complexity, you run the risk of introducing more edge cases and/or introducing new behaviors for existing instructions that weren't affected by multitasking beforehand. We can't expect you to discover all of those situations (though I admit you've been doing a pretty fine job of it so far!). And, of course, I am going to continue suggesting the cooperative approach as a viable alternative, as I feel that it architecturally simpler (in hardware, not necessarily software).
But, at the end of the day, I still think that the Propeller is better for having multitasking (in whatever form) than not at all. So, please don't take my comments as a criticism of multitasking itself.
I don't think it can be made any simpler. Just set a few JMPTASK's and a SETTASK and away you go, not complex at all.
If you can safely ignore all of the restrictions that interleaved multitasking imposes, then I would readily agree with you on this. In my mind, if you could add even one instruction (e.g. NEXTTASK) that would allow multitasking to be used without worrying about any of those restrictions (or worrying about a smaller set of restrictions), then I'd say that multitasking would be even simpler (in some cases, at least). And I am not suggesting (at this point, anyhow) to change the way that JMPTASK and SETTASK work, so all of the existing code would still work exactly as-is and would be no more or less simple than it already is right now.
And, of course, none of this would really be addressing my original worry, which is the overall architectural (hardware) complexity and the risks that result from such complexity. Since I do not expect the tasking hardware to get simpler, I would at least feel that the risks could be mitigated by being able to use it in a way that treats the bulk of the code exactly the same way as it would if it wasn't using tasks at all.
What I minded is that your messages came across to me as "time slicing does not work well in some edge cases, so let's get rid of it"
I understand that. My intent was to highlight the risks (as I see them) that the current approach is creating. As I mentioned earlier, I was not expecting Chip to remove interleaving (especially because such an approach can only be efficiently done at the hardware-level). I was (a bit more seriously) suggesting that it might be better to remove the auto-jump feature (which reduces the side effects, and therfore risk, at the cost of functionality), but I'm not going argue this if there's overwhelming pushback (which I believe there is). But the only thing I really feel strongly about (of the original three suggestions) is using the TASK hardware to run tasks cooperatively such that I can still take part in the hardware-level multitasking goodiness without having to worry about those risks (as much).
But, I do apologize for the miscommunication on my part. I'll wait until the P3 to really rock the boat.
But, I do apologize for the miscommunication on my part. I'll wait until the P3 to really rock the boat.
No worries Seairth
I hope you didn't take my comments as negative, there is good ideas here.
Discussions like these bring out some important scenarios that may have been overlooked or never imagined.
We all win.
Re: P3
I can't begin to imagine the ideas we all will throw out there for the P3 if its ever on the cards.
Based on what the P2 is shaping up to be, the future is looking good.
Frankly, I like the auto-jump, and I think it would be best to bring the polling instructions back. There are great use cases for both. I'm reluctant to go beyond that statement because this needs to get done. It won't be perfect, but it already is excellent.
To me the code rule differences aren't too significant. We've got a trace capability to understand what happens. The hardware mode holds the most potential for me personally. It just makes a COG capable of so much. Happy camper here.
The core reuse will be the COG for most cases. I think we will find the snippet to be highly reusable too and that's just cool. (SPIN 2 ASM command)
Co-op code appears much closer to the native non-thread code. It may well be reused too. It's just not my preferred mode. I much prefer the more parallel nature of the hardware option.
Seems to me, we are mostly there, so long as we don't write one case or the other off.
Comments
If the D address is PINA..PIND, the two LSBs of the D field can be loaded into Z/C if WZ/WC are used. If D is not PINA..PIND, then bits 10..9 of the contents of that register are candidates for Z/C via WZ/WC.
Depends what you mean by performance.
Loading snippets of code from HUB in order to squeeze more functionality into a COG is a fine and useful idea, hence LMM and Clusos's overlay technique. It will kill real-time responce of threads threads are cooperatively waiting for that load to happen.
But if you want a couple or more threads waiting on some events, pin changes, timer etc then for the lowest latency in responding to those events you will want hardware thread slicing. I believe this is true of examples like UART Tx and Rx at higher speeds, especially multiple UARTs. I'm sure there are plenty of other examples like that.
Also, having the hardware slice your threads for you means you can write your code more easily, each thread is just a sequence of the code you want to get that job done without having to worry about interspersing JMPRETs, TASKSWITCH or other forms of suspend through out your code.
Hardware thread licing also reduces code size as all those extra instructions, JMPRET. TASLKSWITCH, etc are no longer required. Smaller code is faster code:)
Take the case of UARTs. Most of the time there is nothing for a UART transmitter to do, and often also for it's receiver counterpart. So then why not load something else into the cog that DOES want to run. I have found that typically, at least in the case of drivers, it is the Spin program that knows what should be run, and when it should run, and cogs just self suspend or self terminate when done their activity. So Spin then sets a flag or other message to the cog via Par, and the cog's OS picks up on that, and either Resumes the thread or Loads whatever thread the message calls for. An efficient way of doing that would be very useful.
Again, I have nothing against hardware augmented multi-threading -in fact I thinks its great, especially now Chip has confirmed no impairment with the new scheme-, but I believe more users will gravitate to the co-operative approach rather than to time slicing. The extra few JMPRET instructions sprinkled through the code are not significant. And as for timing, in a P1 loading a co-operative UART from hub takes about 10 usec. Workable, but would be nice if that could be reduced.
So as I said before, I don't yet understand the details of the flexibility of the slice assignments and the overhead of changing them. If that is trivial, then a bigger gain will be the result. If that is awkward, then I predict a lot of slot time will be spent just waiting with the alloted time doing not much useful. And there are only 4 slots, so swapping will likely be required in any case.
The performance of the slicing approach will depend on how this all shakes out. I'm anxious to get my hands on a chip to give it a good rundown. A bunch of the new instructions should be of good benefit to my cog kernel.
Cheers,
Peter (pjv)
Who is the documentation expert (potatohead? others?) to document the JMP instructions - to save Chip's time.
Don't worry about this, Guys. I'll amend the Prop2_Docs.txt before I post the next FPGA configuration. There are lots of details to cover in that thing.
Including info on the extended features in Counters would allow those with FPGA boards to field-verify the new modes, before final synthesis.
So how would I switch to an arbitrary task?
With the TASK hardware, I guess you would do something like:
Because of the JMPTASK, you would have a one-instruction delay (it will cancel the instruction currently loaded in the first stage of the pipeline). But this would otherwise work, I think...
or maybe
Comparing this to TASKSW:
What else have I missed?
Yes, INDB might be better. I would have to make register remapping work with INDB, also.
There is a TASKSWD, but the tricky thing is that you will be executing with the next tasks' Z and C flags. Also, if you are using register remapping, the new map (by INDA) will be in effect, too. If you don't use register remapping and don't affect the flags, you could get something out of TASKSWD. Otherwise, it's of little value.
Ahh. I didn't realize that only INDA worked with remapping.
Oh, that's right! TASKSW immediately affects flags.In which case, I'd actually avoid providing the delayed version. If someone *really* wants to do it, they could still use JMPRETD directly (at which point, they could also use NC/WC and NZ/WZ as necessary).
I didn't think that the JMPRET-based tasking had anything to do with register remapping. I thought SETMAP was only applicable when using SETTASK.
If using less than four tasks, the unused tasks would point to a small default block of code that would just NEXTTASK in a loop (or some other similar code). On the other hand, if using less than four tasks, it might end up being easier to use SETTASK (as an optionally delayed-instruction) to explicitly switch from one task to another. The point is that either option would be available.
Better yet, if the instruction were NEXTTASK #n, where the TASK control register would be set to ((TASK + n) AND %11) (this assumes that the two LSBs of the control register represents the current active task), this would allow the following:
The last item is somewhat interesting, With NEXTTASK #0, you could mix the interleaved and cooperative approach. In general, you could use the interleaved approach. Then, when you hit a particularly critical bit of code, use NEXTTASK #0 to switch to single-task mode for the task that issued the instruction. There would still be up to three instructions (depending on the interleaved scheduling at the time) before the pipeline was truly in single-task mode. Then, once the critical bit is finished, the task would issue SETTASK to re-enable the interleaved mode. Now you can have the best of both worlds!
It seems to be getting a bad rap .
I hope that any of my findings are not scaring people off the idea.
The driving force behind most of my crazy experiments is "to see how far I can go".
The system continues to impress me each time.
I don't think it can be made any simpler. Just set a few JMPTASK's and a SETTASK and away you go, not complex at all.
Sure there a few things to look out for, but isn't that normal for any programming environment.
Writing any code for microcontrollers always requires care in relation to timing and flow, nothing new here.
On the contrary, it's your work (and I wouldn't consider crazy at all!) that has given us a great deal of insight into what is and is not possible with multi-threading. This is exactly the right time to learn its strengths and weakness, see where it can be improved further (as Chip has arguably done, in some ways), and make sure that it really is the benefit we are all imagining it will be.
To be clear, I WANT multitasking. I think limited multitasking as A GOOD THING. My main concern (the thing that spawned this thread) is that I feel the interleaved approach is getting more complex in order to accommodate various edge cases (those that can be accommodated, anyhow). With each increase in complexity, you run the risk of introducing more edge cases and/or introducing new behaviors for existing instructions that weren't affected by multitasking beforehand. We can't expect you to discover all of those situations (though I admit you've been doing a pretty fine job of it so far!). And, of course, I am going to continue suggesting the cooperative approach as a viable alternative, as I feel that it architecturally simpler (in hardware, not necessarily software).
But, at the end of the day, I still think that the Propeller is better for having multitasking (in whatever form) than not at all. So, please don't take my comments as a criticism of multitasking itself.
I never minded you wanting to use cooperative multitasking - I know that is very useful.
What I minded is that your messages came across to me as "time slicing does not work well in some edge cases, so let's get rid of it"
IMHO:
- the hardware multi-tasking is a lot cleaner and easier in the vast majority of cases than cooperative multi tasking
- I actually like cooperating multi-tasking when you need even more threads :-)
If you can safely ignore all of the restrictions that interleaved multitasking imposes, then I would readily agree with you on this. In my mind, if you could add even one instruction (e.g. NEXTTASK) that would allow multitasking to be used without worrying about any of those restrictions (or worrying about a smaller set of restrictions), then I'd say that multitasking would be even simpler (in some cases, at least). And I am not suggesting (at this point, anyhow) to change the way that JMPTASK and SETTASK work, so all of the existing code would still work exactly as-is and would be no more or less simple than it already is right now.
And, of course, none of this would really be addressing my original worry, which is the overall architectural (hardware) complexity and the risks that result from such complexity. Since I do not expect the tasking hardware to get simpler, I would at least feel that the risks could be mitigated by being able to use it in a way that treats the bulk of the code exactly the same way as it would if it wasn't using tasks at all.
I understand that. My intent was to highlight the risks (as I see them) that the current approach is creating. As I mentioned earlier, I was not expecting Chip to remove interleaving (especially because such an approach can only be efficiently done at the hardware-level). I was (a bit more seriously) suggesting that it might be better to remove the auto-jump feature (which reduces the side effects, and therfore risk, at the cost of functionality), but I'm not going argue this if there's overwhelming pushback (which I believe there is). But the only thing I really feel strongly about (of the original three suggestions) is using the TASK hardware to run tasks cooperatively such that I can still take part in the hardware-level multitasking goodiness without having to worry about those risks (as much).
But, I do apologize for the miscommunication on my part. I'll wait until the P3 to really rock the boat.
No worries Seairth
I hope you didn't take my comments as negative, there is good ideas here.
Discussions like these bring out some important scenarios that may have been overlooked or never imagined.
We all win.
Re: P3
I can't begin to imagine the ideas we all will throw out there for the P3 if its ever on the cards.
Based on what the P2 is shaping up to be, the future is looking good.
Cheers
Brian
Frankly, I like the auto-jump, and I think it would be best to bring the polling instructions back. There are great use cases for both. I'm reluctant to go beyond that statement because this needs to get done. It won't be perfect, but it already is excellent.
To me the code rule differences aren't too significant. We've got a trace capability to understand what happens. The hardware mode holds the most potential for me personally. It just makes a COG capable of so much. Happy camper here.
The core reuse will be the COG for most cases. I think we will find the snippet to be highly reusable too and that's just cool. (SPIN 2 ASM command)
Co-op code appears much closer to the native non-thread code. It may well be reused too. It's just not my preferred mode. I much prefer the more parallel nature of the hardware option.
Seems to me, we are mostly there, so long as we don't write one case or the other off.