Still worried about multitasking
Seairth
Posts: 2,474
Yes, I know this topic has been discussed (ad nauseam?), but every new discussion only makes me worry about the multitasking support more, not less. Here is my take on the current state of things:
In my mind, multitasking was added primarily to allow the grouping of several, low-priority "tasks" (e.g. keyboard, mouse, simple video, etc) on a single cog in order to free up the other cogs for high-priority, dedicated tasks. To acheive this, hardware-level tasking support follows an interleaved (not preemptive or cooperative) approach using SETTASK and JMPTASK. Based on the initial design, some issues were encountered:
So where does this leave us? In my mind, it leaves us with a mechanism that is more complicated to use (than the original version) and has more caveats (than the original version). Further, for every issue that's been encountered so far, the solution has made the mechanism more complex. I don't feel confident that we fully know all of the issues, but I do feel confident that the current trend is for the solutions to those issues to only make multitasking more complicated.
Yes, I know this could all be avoided by simply not using multitasking. But the thing is, I want to use it! I want to be able to lump lots of small, low-priority tasks into a single cog! I just want to be able to do it with the same ease and simplicity as when I'm writing a single task. I don't want there to be caveats. I don't want there to be special-case contortions that I have to go through. It's exactly that kind of complexity in other MCUs which brought me to the Propeller in the first place.
So, in the spirit of offering solutions, not just problems, here are a couple thoughts/ideas:
In the end, while I think that the Propeller will be better for having multitasking, I think it will be even better still if the approach is kept simple and straight-forward. And I fear that the current approach is increasingly neither.
In my mind, multitasking was added primarily to allow the grouping of several, low-priority "tasks" (e.g. keyboard, mouse, simple video, etc) on a single cog in order to free up the other cogs for high-priority, dedicated tasks. To acheive this, hardware-level tasking support follows an interleaved (not preemptive or cooperative) approach using SETTASK and JMPTASK. Based on the initial design, some issues were encountered:
- One of the issues with the interleaved approach was that it was difficult to effectively synchronize a task with HUBOPs. The solution was to change the sequencing of the interleaved tasks based on the number of active tasks. While there have not been any reported side-effects of this solution, it has made the setup and use of tasks to be more complicated.
- One of the issues with the interlaced approach was that blocking operations (those that stall the pipeline) caused all tasks to block, not just the task that contained the blocking operation. The solution was for the blocking operation to implicitly "jump" to itself to allow other tasks to keep moving forward. This solution, has resulted in another issue concerning the way that INDx registers are handled. It could be argued that this is a lesser issue (which can be documented and manually avoided), so the net result is still a better design.
- One of the issues with the interleaved approach is that it becomes much more difficult to write deterministic code. While the addition of "self-jumping" blocking operations may have mitigated this somewhat, it is still not the same as being able to write a block of code as if it were the only task running. There is no solution offered to this issue, other than to add documentation that say "don't use multitasking if you need determinism".
So where does this leave us? In my mind, it leaves us with a mechanism that is more complicated to use (than the original version) and has more caveats (than the original version). Further, for every issue that's been encountered so far, the solution has made the mechanism more complex. I don't feel confident that we fully know all of the issues, but I do feel confident that the current trend is for the solutions to those issues to only make multitasking more complicated.
Yes, I know this could all be avoided by simply not using multitasking. But the thing is, I want to use it! I want to be able to lump lots of small, low-priority tasks into a single cog! I just want to be able to do it with the same ease and simplicity as when I'm writing a single task. I don't want there to be caveats. I don't want there to be special-case contortions that I have to go through. It's exactly that kind of complexity in other MCUs which brought me to the Propeller in the first place.
So, in the spirit of offering solutions, not just problems, here are a couple thoughts/ideas:
- As I have suggested elsewhere, make tasks cooperative (not preemptive or interleaved). With this approach, any individual task has uncontested (and even deterministic, when needed) control of all cog resources. Properly written, the cumulative effect of multiple tasks running over time would be the same as the interleaved approach (e.g. two 10-instruction tasks occurring back-to-back vs interleaved would still be "parallel" at the user timescale). Yes, this would use up registers due to the explicit task switch instructions, though this might be mitigated by having task-switching JMP variants (which seems like a logical point to task switch anyhow). Regardless, a cooperative approach would keep the cogs nearly as simple to code as if there weren't multitask support at all. Remember, the objective of multitasking is to put multiple low-priority tasks together, not to try to cram multiple high-priority, timing critical tasks in fewer cogs. (Otherwise, why don't you just use a PIC or AT, which is already suited for that approach?) Further, I suspect that the gates saved by removing the interleaved functionality would be able to be repurposed for more task registers, allowing maybe 8 or 16 tasks to be tracked.
- If keeping the interleaved approach, remove the re-ordering (self-jumping) functionality. Blocking operations simply block. There are no caveats. There are no unexpected side-effects. Again, if you are concerned about timing critical behavior, interleaved multitasking may not be the right tool to use in the first place. Otherwise, most of those operations have non-blocking variants that can be used in cases like this, maybe not optimally but, predictably.
- At the very least, update SETTASK to optionally flush the pipeline. With this, I think it would be possible to use SETTASK to implement an exclusively cooperative approach (i.e. set all timeslots to the same task). Further, I'd suggest introducing a SWTASK statement that aliases to the SETTASK opcode for clarity. If all of the other tasking functionality is still in place, the pipeline would need to make sure that it is in single-tasking mode when all timeslots are the same task (therefore, ensuring that all of the multitasking caveats and side-effects can be safely ignored). If the change can be made fairly easily, it seems like a low-risk change to make (whether or not it ends up being practical in the long run).
In the end, while I think that the Propeller will be better for having multitasking, I think it will be even better still if the approach is kept simple and straight-forward. And I fear that the current approach is increasingly neither.
Comments
Still, as far as I know, none of the threading features and consequent fixes prevent one from working in traditional single threaded mode or using cooperative threads. I presume JMPRET and TASK SWITCH (whatever they are called now) are still in the instruction set.
So if you app requires the speed or the cycle counting determinism of the old Prop style programming it is still there.
Ozpropdev has shown that the hardware interleaved threads can be used to great effect so I'm inclined to say let's go with what we have, warts and all. Let's see this chip get out the door!
There will always be special case issues when you start time-slicing.
Because you need to allocate resource and launch slots, you also will never quite reach simply able to lump lots of small, low-priority tasks into a single cog
With the right software support, the issue of least surprise can be improved.
I think this needs a slice-aware assembler, and the Prop 2 is really going to need a good simulator.
Personally, I don't see how we can assure the threads are all deterministic given the shared resources.
I do think bringing back the polling versions of the instructions is an idea worth thinking about. The repeated jumping prevents polling of multiple events in a kernel type thread that could dispatch others depending on the state of the polled entities.
Re: Simulator
SETTRACE should help with this. We probably can get easy environments set up that use many of the COGS to present what a test COG is doing. Not that I wouldn't mind a great simulation.... . But we can ask the P2 to tell us lots of things a P1 could not.
Finally, the idea that we get deterministic hardware threading was never on the table. I think it was Heater who understood the core reuse unit is the COG, not the thread, due to implementation at this stage. Most of felt that totally worth it, and ospropdev has shown it is worth it.
Re: Software
There are some cases:
1. SPIN + PASM
We know these two cold. SPIN got the ability to include PASM snippets too. (awesome) This will be the sweet spot where really using the chip happens most easily. It's also the place where it will be easy to hack on the chip to learn things as we are doing right now. (minus SPIN 2 as it's not really cooked yet)
2. Macro-Assembler
Maybe combined with SPIN at some point. We don't have this, but a really great macro-assembler for P2 could be thread aware and have all the fancy stuff needed to manage the multi-tasking. When I look at what people are doing with older 8 bit computers and great assemblers and environments, somebody --maybe a lot of us, will want to get this kind of thing done. Lots of P2 hackery might happen here, maybe with simulator?
3. C
In the end, this will work a whole lot like SPIN + PASM does, only it will be C + PASM. For those users who really feel good about this kind of development, we will get lots of spiffy library code. At the high level, many cool things can be presented in simple ways. Lots of potential here.
4. Stand alone simulator
Seems this is a lot of work. Will it get done? Early on with P2, we got Gear and I personally found it helpful, but limited. Once the learning got done, I did the rest on chip as did most of us.
5. Specialized on-chip environment.
I have high hopes for this one too. SETTRACE opens the door for a P2 to be used in an academic way, presenting what happened and potentially assisting with other methods.
6. Specialized debugger?
We ended up with some nice tools for P1. Seems to me, we will end up with nice tools for P2 done in similar ways.
What is a slice "slice-aware assembler"? What does it have to do that a normal one does not?
potatohead,
Not me either. XMOS managed it with their xcore devices but they designed the thing from the ground up to do be able to do that and have lot's of built in help from clocked I/O and such.
I thought we all understood the hardware task scheduling might not be 100% free of the jitters without being carefull how the threads are written. Main point is that one thread waiting for a pin or timer or such should not stall all of them, which would defeat the whole point of having threads.
I'm quite prepared to accept that some instructions are off limits when tasking or perhaps behave in odd ways. Just document it and it's a feature. Let's get the chip in our hands!:)
One that knows if you are running time-slices, or single-task.
Either by a directive, or it can be smarter, by disassemble.
It can then warn about any constructs that may not be strictly portable from Single-task to Time-sliced threads.
I've seen no show-stoppers and a whole lot of benefit so far. Maybe P3 can expand on this and do it right.
I just think what we have is fantastic! While we may have more caveats than previously, they are really only present as we dig deeper into the uses. Provided we stay with the original intentions there are no real caveats. So isn't this a win-win.
The threads were never intended to do what ozpropdev has done. Doesn't this just go to show what is possible. Who really cares about the caveats as long as we know about them.
Remember, the original intention was to allow simple threading/multitasking so we could have 4+ software UARTs, or multiple PS2, etc rather than wasting multiple cogs. This has been achieved, with so much more usability that was ever intended.
What is going into the P2 this round is a huge improvement, and even if that delays the chip by a few months then it will be worth it IMHO. The changes are such a monumental improvement, and fixes had to be done anyway.
Simulation:
With all due respect:
1) You can still ignore the hardware multi-tasking, and use the whole cog
2) You can still use cooperative multitasking
3) no one is trying to take those abilities away from you
If you don't like the time slicing, don't use it.
Ray
I'm liking the idea of enhanced HW simulation of P2 on FPGA.
Way more powerful and accurate than SW simulation.
To think we almost didn't even head down this road, but for Heater making the right mention at the right time too. Truth is, we still have the COG as the ultimate Propeller style reuse / deterministic case. That doesn't really change at all.
But, for those people willing to dig deeper, a COG can do a whole lot more than it ever would have otherwise. Totally worth it.
But this:
I want to be able to lump lots of small, low-priority tasks into a single cog! I just want to be able to do it with the same ease and simplicity as when I'm writing a single task.
Was never on the table from day one. The discussion is there, right after Heater suggested it and Chip implemented it. We all thought about it, and came down to the COG being the core unit of no brainer reuse, not the thread. Everything after that really has been about optimizing the use of threads.
True, this is still available. But why take up stack (or whatever whatever it's called now, I'm not recalling) resources when there is already hardware support in place?
You simply can't ensure determinism with the current model, at least not without adding synchronization mechanisms. But there are two (or more, possibly, but I'm only concerned with two) levels of determinism. There is code that must entirely be deterministic, and there is code where only very small chunks need to be deterministic. I suspect that the first level (full determinism, I'll call it) is extremely difficult to write for any reasonably large code base. However, I think it is the second level (local determinism, I'll call it) that most of us make use of. We write most of our code without determinism in mind, but when we enter a few critical parts (usually timing-sensitive I/O), we need local determinism for just a few instructions.
Interleaved tasks (I do not like using the "time-slicing" term since the processor is not slicing task by time quanta) make this very difficult, while cooperative tasks make this much easier (since no other task can affect the timing or resources of the current task).
Agreed.
This is part of my point, though. We will continue to dig deeper and may still find more issues. We can certainly adjust our intentions to incorporate those issues (which is what I feel we are doing with INDx), but I don't think we can stick with original intentions (unless maybe we don't agree on what those original intentions are) and still state that there are no caveats.
Sure, and you could say the same thing about the SERDES argument too! If you don't like it the way it is, don't use it. There's always "bit-bang a solution". But I do want to use multitasking. I just wish I had more direct control over when things executed.
I can still implement software-based cooperative multitasking, that is. But why should it be that way? All of the basic tasking functionality is already in the hardware, I just can't use it in a cooperative manner. This is why I offered the third solution in my original post, which would allow those who want to use interleaved task to continue to do so and those who want to use cooperative multitasking to do so as well.
Of course, I realize that most people who've read my original post probably focused on the first solution, which I admit is an extreme (and very likely unwarranted) option at this time. I included it to thoroughly reflect my thoughts, not because I believed my post would cause Chip to suddenly go rip that functionality back out. But, sticking with the original intention of the post, I do feel that it's worth considering rolling back at least one of the features (self-jumping) in order to reduce the possibilities of unexpected (and I feel that INDx definitely qualifies) edge cases. With each new tweak or fix, I fear that unexpected side effects will start to show up even when in single-task mode. Otherwise, the current trend seems to indicate (at least, to me) that multitasking is only going to get more complex, and that worries me.
I know many here are excited about the new multitasking capability. In a way, I am too. In the grand scheme of things, my worries are most likely outweighed by the excitement. But I also suspect that there are others who have the same worries (more or less) that I do and would have been remiss not to express them. (After all, how many times have you all heard "why didn't you say something?" after the fact?) In the end, I truly hope my fears are unfounded and that the multitasking will be nothing but a success.
The rules for the multi-tasking mode are going to be more involved than they are for single-task mode. Over time, we will figure out what the optimal cases are, lots of code will center on that, and we won't worry too much about this. It's just not a perfect thing. Won't ever be on this chip either. But, it's also a very serious gain, so we take the good with the bad. Real P2 chips out there trumps perfecting this IMHO.
There will be lots of single task COG code out there too.
This is exactly how I see it (if I'm understanding you correctly). The cog is the unit of execution, everything else just helps to make efficient use of the cog's time. When I suggest lumping multiple tasks together with the same ease as if there were only one task, I am not suggesting that the unit of execution should be the task or that the hardware should be oriented around the task. I am suggesting that the instructions should make it easy to execute these tasks as the cog's current unit of execution (i.e. the cog is dealing with nothing but one block of code at a time like it always has). In a way, it might be a misnomer to call it multitasking. Maybe this is more like coroutines. Or some other general term that simply means "allow me to easily maintain the state of multiple lines of execution and switch between them efficiently". To me, interleaved multitasking starts to put the cart before the horse (or rather, the task before the cog) precisely because the tasks are increasingly putting all sorts of restrictions and caveats on how one can use the cog (e.g. which resources can't be shared, which side effects change, when exactly code will run, etc). How does that make the cog the core unit of execution and not the task?
In other words, if we have a block of COG code, it works on a COG. If we have a block of hardware tasked code, it's not always going to work in a given task slot, nor will it always be suitable for single task mode without modification and or consideration about what the other tasks might need to be doing and that's because the pipeline and special COG functions are shared.
Because of how the hardware tasking was implemented, it's just not going to be as easy to reuse as single task COG code is.
So I guess most of us just aren't worried too much about this artifact given the very serious gain it represents in terms of what a COG can do. Most of us were thinking combined drivers and such. That ozpropdev stuffed a whole game in there is amazing! (and highly educational)
To answer your last question about how I see it at least, once it's all been sorted out for a given set of tasks and code bodies all running on the COG, it's portable from there. Any COG will run it, and it can be an object, coglet, etc... Thus, the COG is still the core unit of reuse as we've come to expect on Propeller. What happens in the COG will depend on how all the threads, needs, resources, rules play out. So, a piece of hardware threaded code really can't be an object in the same sense.
However, it may be a snippet, depending!
SPIN 2 allows us to blast some PASM into the SPIN COG and run it. That can happen in hardware tasking mode too. (I believe this to be true and haven't really considered it yet. Need to be running a more mature SPIN 2)
Snippets in single task mode will be robust, and we can expect to reuse them as we do entire COG code images. This is really cool! So now we get little bits of PASM with the same usability characteristics, for the most part.
However, snippets or anything really, running in hardware tasking mode will need attention and just won't be reused with the same ease.
Looks like you are asking for something the Prop already has. Even the P1.
So far on the PII in terms of running multiple fucntionalities at the same time on a COG we have:
1) Take care of it in regular code simply by writing some kind of state machine code that checks various external stimlii and jumps to the righ code to hanlde it. Basically polling.
2) Write coroutines using the JMPRET instruction. See FullDuplexSerial as the cannonical example of this where it "hops" between RX and TX coroutines using JMPRET sprinkled around the Tx and Tx loops. I'm sure that with some inginuity this could be used for more than two tasks.
3) Write cooperative threads using the new PII TASKSWITCH instruction. That with the memory remapping going on for tasks is quite a powerful thing.
4) Using hardware interleaved multi-threading on the PII.
What is missing here? (Appart from traditional interuupt driven preemptive scheduling which I am very happy not to see on the Prop:).
Now:
1) Polling: is how you have to do things if you don't have interrups. It is slow and wastefull of code space. It aslo make your code a tangled mess.
2) Cooroutines is an improvement on 1). Using JMPRET to hop from thread to thread makes code easier to write and read, makes it smaller (and faster I believe).
3) TASKSWITCH is an extention of 2) offering easy implementation of more than two threads and some task based memory mapping.
4) Hardware interleaved threading allows you to remove all those JMPRET or TASKSWITCH instructions from your thread loops thus offering:
a) Smaller code.
b) Faster execution, as there are less instructions to execute.
c) Minimal latency to external events and times etc.
d) Hence increased timing determinism for your threads.
True 4) seems to have some restrictions coming to light but in view of what it offers who would want to give it up?
I like to think about "unit of functionality". For example FullDuplexSerial is a "unit of functionality". It's Tx and Rx threads are not. FDS runs in a COG and is packaged in an object. All very clearly deliniated. Some SPI driver or whatever would be another "unit of functionality".
The beauty of the Prop is that such units are easily mix'n'matchable. Just drop them into your project and they work. No worries about timing interactions or hooking up interrupts etc.
"Unit of Functionality" is an object when working in Spin.
I DO NOT expect such units of functionality to become composable as threads within a COG just because we have hardware scheduling.
P1 does not have hardware-level task support. P2 does not have hardware-level cooperative task support.
My understanding is:
It's still strictly software-based, so you can't use the memory remapping in conjunction with it. Further, you have to give up the use of INDA and some of your stack space. In the meantime, there are three hardware task registers that are sitting unused.
P2 adds the possibility to also save the C and Z flags for every task, which P1 not can.
Yes INDA is used to read the TASK-list and switch the tasks with JMPRET. The advantage over hardcoded JMPRETs is exactly that you can use the register remapping, and that you can define the order of tasks in a list in cogram. No stackram is needed. (There is even an example in the Prop2_Doc.txt)
I have just made a test if you also can use SETTASK for cooperative task switching - and it seems to work: This executes SETP and CLRP in two cycles and alternates the tasks. It looks like the next task gets active 4 instructions after a SETTASK.
This form of taskswitching is now possible because Chip has added the variable length task scheduler, in this example it's only 1 entry in the scheduler list.
Also all the other changes he made to the task handling simpifies the use of tasks dramatically. I have ported quite a few single task PASM codes to multitasking and I had to go thru all the instructions and add WC effects and loop jumps here and there. Now you can just enable tasks without modifying the code in most cases it works (as long as you not have optimized for single task with delayed jumps).
Andy
My apologies! I was mixing up the INDx and SPx registers. You (and everyone else who was pointing it out to me) are correct. No stack ram is required, just the use of SPA and some register space. You still have to give up register resources instead of using the built-in TASK registers, which was my primary point.
This is the kind of thing I had in mind. However, I thought that SETTASK did not flush the pipeline, so in the example above, the JMP following each SETTASK would block the other task. You basically need to treat the SETTASK like a delayed jump instruction (i.e. up to three more instructions in the pipeline will execute before the other task starts executing.) Maybe I have this wrong, and it always flushes (effectively stalling for three clock cycles like a non-delayed JMP). If SETTASK could support either mode (like the branch instructions do), then this would be a moot issue.
I better go back and read that discussion again. It appears that maybe I am misunderstanding how that change makes the task scheduler work.
Or needing to make sure that two instructions occur with an exact timing. Or trying to optimally align a block of code with the hub access. Or need to safely use a shared resource like the CORDIC engine. Or mutating INDx in a blocking instruction. I suspect that converting P1 code over is trivial because it's not using any of these features, but I wonder if that will hold true for P2 code.
Isn't the TASKSWITCH (or whatever it is called now) instruction in the PII hardware support fro cooperative task switching? It was that which finally lead to the hardware scheduled task switching.
No, I don't think it exists.
I probably haven't had enough coffee to figure out the implications of this yet but the comment about setting D seems like it places restrictions on where the RET for a CALL instruction can live in COG memory since a range of values are reserved to perform what NR used to do to allow JMP and RET.
Edit: I'm not sure why there can't just be separate opcodes for JMP and JMPRET and JMPD and JMPRETD instead of using magic numbers in the D field to select the behavior.
About the magic D values for JMPRET/JMPRETD: they are there to account for the old case of NR (no result write). If the D address is PINA..PIND (the pin inputs, which are read-only), the write will have no effect, but the cog will know that this is like an old no-write JMPRET/JMPRETD. That there are four magic addresses is significant, because those two LSBs of D are needed to accommodate Z and C flags for RET instructions (which are JMPRET/JMPRETD).
We could have made discreet JMP/CALL/RET instructions, but the universal JMPRET instruction can be made to do it all without needing any extra opcode space.
-Phil
There are now discreet TEST, TESTN, TESTB, CMP, CMPS, CMPX, CMPSX, CMPR, etc. instructions. The one we don't have is the REV NR, which I never though of, but is pretty clever.
That NR bit was only useful in about 5% of instructions, but was cutting the opcode space in half. By getting rid of it, we got space for more D,S instructions which will save a lot of code space in the cog. For example, we now have DIV32 D,S which replaces what used to be SETMULA and SETMULB.
For whatever lingering detriment the lack of NR will create, there is a huge net gain in code density going forward, as many operations can now intake both D and S, and D can always be immediate in instructions where D is not being written.
Very interesting discussions about this multi-threading.
I have spent a lot of time doing that on the P1, running up to 128 (contrived trivial) threads in a single cog. A more practical number for real world applications is probably 8 threads. It turns out that much of the time all threads are idle with the occasional moments of sheer bedlam when many want to run at the same time.
So with the new regime, I'm tickled that there will be hardware support for saving/restoring the states of the C/Z flags as that is something I have not yet implemented in the P1 kernel..... not for any particular reason other than there has not been much need for it, and it will consume code space as well as clock cycles.
I'm not yet familiar with the intricacies of the new instructions, and how the hardware time slicing will actually operate. What does concern me is how the time slices might be dynamically allocated and altered. As aleady mentioned, many times and possibly for long periods a thread is not in need of any cycles, so if a slot is dedicated to doing nothing, then what was the point. The purpose, as I see it, is to maximize the cycles over the greatest number of threads in need of execution. Not just to have some convenient way to do a WAIT while keeping determinism. So perhaps the efficient allocation of cycles to a list of needy threads is looked after with these new capabilities. If not, then my experience would indicate the whole exercise to be less useful than what seems to be expected here. That said, I'm still thrilled I will be able to do a JMPRET while simultaneously saving/restoring C/Z.
Eventhough there will now be some hardware support, as one delves further into the whole multi-threading thing, I expect many folks (at least with my current limited understanding of things) will discover that co-operative threading will have a good following. It is efficient and very flexible and now is augmented with some hardware support.
In dealing with all this on the P1, a very significant revelation for me was the need to efficiently load snippets of dynamically RELOCATABLE code and data from hubram into a running cog. Since the cog's ram is so small, effectively it cannot do TSR type of operations. Space is just too precious. So a lot of swapping goes on, especially if the threads are of somewhat larger size. I believe it is improvements to this process (such as simultaneous multi-cog RDLONGs) that would enhance overall performance more than the time-slicing concept versus the co-operative threading concept. But I also understand we are too far down the road to consider such options, and the new instructions were simple to implement and will bring value.
I expect that most of my work will continue along the co-operative track. I just hope it has not been impaired in some way.
Just my thoughts on the subject.
Cheers,
Peter (pjv)
So, if I understand correctly:
Wait, something doesn't seem right here (my understanding, I'm sure). I think JMPRET does the following steps:
So wouldn't that mean that the two LSBs of D are interpreted in two different ways (first as an address, then as Z and C)? This makes me think my understanding is wrong.
It's the same as on Prop1, with the addition of TASKSW, which goes to the next task automatically while switching flag sets for you.