What's needed for preemptive multitasking?
cgracey
Posts: 14,151
Ahle has mentioned he'd like to make a preemptive multitasking system on the Prop2, so I've been thinking about what might be needed in the chip to facilitate it.
I can make an instruction to swap a task's Z/C/PC with a D register (SWTASK D,S/#, where S=task), but there are some limitations to what can be done with that:
REPS/REPD/TLOCK state information will be lost and not restored later, so these instructions cannot be used in code that is going to be interrupted by SWTASK. Is this a showstopper?
Also, I don't see how each task's PTRA/PTRB can be saved and restored efficiently. Now TLOCK could get us around this, but TLOCK cannot survive a SWTASK.
The trouble with TLOCK is that the instruction can only set a hidden flag to lock the process next time it shows up in stage 0 of the pipeline. If a task executes a TLOCK, but the next task in the pipeline does a SWTASK on that task, the prior task's TLOCK is interrupted before it engages. Had it engaged without interruption, it could have used REPS/REPD safely. I could store that TLOCK-pending flag in bit 29 of D (bits 31/30 = Z/C) , so that it could be restored later. I think I must do this. This way, INDA/INDB/REPS/REPD may be executed in TLOCK'd sections of code that don't get interrupted, but this increases the time granularity of task switching.
What else is to be considered regarding preemptive multitasking?
I can make an instruction to swap a task's Z/C/PC with a D register (SWTASK D,S/#, where S=task), but there are some limitations to what can be done with that:
REPS/REPD/TLOCK state information will be lost and not restored later, so these instructions cannot be used in code that is going to be interrupted by SWTASK. Is this a showstopper?
Also, I don't see how each task's PTRA/PTRB can be saved and restored efficiently. Now TLOCK could get us around this, but TLOCK cannot survive a SWTASK.
The trouble with TLOCK is that the instruction can only set a hidden flag to lock the process next time it shows up in stage 0 of the pipeline. If a task executes a TLOCK, but the next task in the pipeline does a SWTASK on that task, the prior task's TLOCK is interrupted before it engages. Had it engaged without interruption, it could have used REPS/REPD safely. I could store that TLOCK-pending flag in bit 29 of D (bits 31/30 = Z/C) , so that it could be restored later. I think I must do this. This way, INDA/INDB/REPS/REPD may be executed in TLOCK'd sections of code that don't get interrupted, but this increases the time granularity of task switching.
What else is to be considered regarding preemptive multitasking?
Comments
It would be even more complicated if there was a desire to have each hardware task be able to run multiple pre-emptive "soft" threads.
Basically first we have to define a task state.
1) Minimum: PC, C, Z, PTRA/B
2) Full: PC, C, Z, PTRA/B/X/Y, contents of AUX, contents of cog
Then there has to be a process table of tasks that are running, waiting for some event, or ready to run
Then we need a pre-emption mechanism (easiest: counter for cycles)
Too much to do it all in hardware for P2.
What I'd do:
- only allow pre-emptive tasking at a whole cog level, otherwise it is a complicated nightmare
- it only makes sense for hubexec code
- add minimal hardware support required to do the rest in software, details follow
New instruction:
MULTITASK cycles_reg, handler_addr_reg
Causes the cog to enter the hub multitasking mode
1) Every (cycles_reg) cycles, the PC, C, Z are saved at handler_addr_reg, and the cog jumps to handler_addr_reg+4
- this jump is delayed if the cog is waiting (WAITxxx) or performing a REPx block, until the wait/repx is completed
- the handler is responsible to maintain the TCB's (task control blocks)
- the handler is responsible for saving/restoring the task state (PTR*, AUX if needed, cog registers)
- the handler selects next task to run from TCB, restores its task state, and runs it
2) goto 1
This way, it maps to pthread, and keeps the hardware changes simple
This should be fairly easy to implement, and by not interrupting WAIT's and REP blocks, maintains sanity...
Simplest mechanism I can think of would be an instruction or cycle counter, ie every N cycles, switch to the next task
Edit:
Four instructions makes it better
SETHANDLER handleraddr
MULTITASK cycles, savepczcreg {WC} {WZ}
GETTASKID taskid ' allows the current task to get its taskid
STOPTASK taskid ' goes to handler to kill a task, taskid is TCB entry number
Adding tasks would be done by the handler adding to the TCB
Removing tasks would also be done by the handler (scheduler)
A source of interrupts (timed or external events)
Adequate registers in an executive or alternate bank to process the interrupt to the point it is handled or it can be packaged into a switchable task to be handled later
A way to enable interrupts before returning to the interrupted task
A non maskable interrupt to recover from an interrupt loop or storm
Probably more once I recover from the shock!
I have a question: What does preemptive multitasking get us, in terms of special functions or performance? Even Mac and Windows use cooperative multitasking. What's so special about preemptive multitasking?
Good question!
Use case examples that would benefit:
- waiting for data from multiple tcp/ip sockets
- allows writing a "real" O/S without jumping through as many loops
I do not think this is needed for P2, SERDES, USB support, using spare hub slots, etc is of much greater and more immediate benefit.
Linux uses pre-emptive multi-tasking, and so does Mac (BSD) and even Windows under the hood... where Windows uses cooperative is picking up mouse etc events from an event queue.
A cooperative system has one major fault - if one thread never releases control the whole system hangs. A common occurance on early Win/Mac boxes
we have 8 cogs. each cog can run 4 hardware threads. If I am not wrong each hardware thread can still use some form of jmpret as prop1.
how many more independent task you need in a Micro Controller?
@chip. PLEASE. concentrate on SERDES and finish this up.
NO hubstealing (P3?), NO mmu (P3?), NO interrupts (never?) , NO whatever.
finish this. we have end of February already.
PLEASE.
Mike
I just want to be sure that we don't leave any low-hanging fruit behind after implementing hub-exec, which was a huge improvement to things. I know this is dragging on...
But in the case of propeller I think that this is not needed. Propeller already handles time-sharing/slices. And there is already enough granularity (8 cogs * 4 threads = 32 task).
I would focus on lower power consumption rather than performance.
For example: one application needs 8 task. How can it be done in propeller 2?
Use 8 cogs with single thread. Total task = 8. Power consumption = ?
Use 4 cogs with 2 threads each. Total task = 8. Power consumption = ?
Use 2 cogs with 4 threads each. Total task = 8. Power consumption = ?
If cogs could be shutdown we can reduce power consumption to 1/2 or 1/4.
One side benefit of a preemptive task, is you can construct a good Debug stub.
Of course that will be universally useful.
Then, it can grab any other tasks state, report it, edit it if needed, and restart, all ideally invisibly to the task inspected.(except for lost time)
Debug overhead should be as small as practical, in both total memory terms, HW impact and Time impact.
In P2, I think the lowest time impact is 1/32 ? (unless the other task does a LOCK, then you just have to wait).
Here, either ONE thread needs to be either 100% debug, or the Debug needs to patch-in, in a co-operative way, with whatever low-priority code the user wants in a 4th task.
In SW step thru a LOCK block, the debug stepper would have a couple of choices
* It could simulate the lock, other tasks drop to 0%, Dbg'd task appears to be 100%
( of course, it is not real time, so any external hardware will likely have issues
* It can give a warning, and set a breakpoint after the endlock , now Dbg'd task really does get 100%, but you have zero
visibility during that lock time.
To get below 1/32, I think you would need some additional hardware - maybe the COG ROM (if that is still alive?) could flip-in just enough code to send/receive any register. 1/1000 or 1/10000 could be ok for Debug, and be virtually invisible.
This would not need to patch-in into a 4th task.
Such HW may be too primitive for seriously fast preemptive task swapping, but it would support swap.
I would think about Debug support, but I think there is probably enough Task handling in a Prop to keep most happy.
It may be that simple Debug hardware, can give hooks for a usable Preemptive tasker.
2) A means to save the state of that stopped code.
3) A means to restore the state and start it running again.
4) A OS or scheduler to figure out which of the stopped threads it should start at any given time.
5) A separate stack for each thread.
Item 1) is normally an interrupt. Often from a timer tick. Item 2) means saving at least program counter, stack pointers, status flags. Usually you need to save pretty much all the CPU state, after all the new task will want to use those same registers. For Item 4) See Linux, VxWorks orm nany other RTOS.
It means I can write If a high priority task becomes ready to run it immediately kicks me out, no matter if I'm "done" or not. It can then run at full speed until it is done.
My gut tells me that:
a) As we have 8 cores and 4 hardware scheduled threads each preemptive threading is not needed.
b) Don't even think about it, It's going to add a whole bunch of complexity that hardly anyone will use. (Interrupts anyone?)
c) Preemptive is not need for waiting on multiple inputs. We can do that already with tasks on cores or hardware scheduled threads.
It's nuts. Don't even think about it!
Not sure about pthreads but propgcc seems to have no problem spreading pthreads over COGs and I suspect it could do it over hardware scheduled threads just as easily.
I did not even entertain the though of hardware thread scheduling - WAY too much logic & work needed.
It's far from low hanging fruit. I think to be done properly it would need to be the central theme the processor is built around and not just added to an existing architecture to be done improperly it would be tacked on the side and hope something critical wasn't missed. It's a lot of work and silicon for something 9 people are going to use to play with preemptive multi-tasking.
This is still a microcontroller being designed. It can be in a design state forever if the search for low hanging fruit continues. There will always be what ifs. There will always be community members sulking away saying, "gee, I wish it could do this" or "why doesn't it have one or three or 20 of these?".
Where does it stand now in your vision of a P2? Has it met the Parallax design objectives. Because in all honesty it will never meet the design goals of some community members. At some point, the line needs to be drawn and the design phase stopped and the testing needs to start on a frozen design. The paint needs to dry before everyone can ride the ride.
I'm sure these comments are out of place since I'm not an engineer who can discuss verilog issues or select companion chips or how something was done in different processors or don't have any slick code designs to my credit but I can tell when a potato is done and I think with the implementation of 2 or 3 items left on your plate, this potato is done.
Well, I feel better now!
The latest form of ONE/MULTI locks covers that, it can get 100% of the available slots.
The HW slot mapper is already doing hard-time Multi-tasking.
Q: Is there enough HW support already, to 'reach-into' the other, now paused tasks, and extract and/or change all of their information - including their PC's and Flags ?
I have not been following things in detail but what you are say sounds like it does what we want stunningly easily.
If I understand you correctly we could have:
1) Two or more hardware scheduled threads running.
2) Say one of them is waiting for an event, a pin change.
3) When that pin change happens the waiting thread changes the hardware scheduling to give itself 100% CPU.
4) It does whatever processing it has to do at full speed.
5) It changes the scheduling back and all the threads resume running, more slowly, together.
6) It waits on the next event.
In that way our critical thread has, in a way, preempted all the others to grab all the CPU capacity for a while.
Is that what you were meaning or am I way off?
If not, can what I describe above be done currently?
P.S. The reason why I have not been following P2 development in detail is that every time I come across YAFI (Yet Another,,,Instruction) being discussed for addition to the design my brain shuts down another notch. The "preempt" business sounds like it would need a bunch more YAFI's.
jmg might have a decent compromise for the Prop2 with the idea of stipulating a shared 1/16 slice for a supervisory task that can preemptively redirect program flow in the partnering task.
I think his idea is to provide unlimited number of preemptive tasks at the kernel level.
That would be used - MacOS 6 and Win3 respectively, past tense! Although the old infrastructure is prolly still intact for the old programs to continue to use.
It is interesting, though, that preemptive OSes only run smoothly when running cooperatively at strategic points down a layer or two.
Having hundreds of preemptive threads is a sure way of:
1) Eating memory, with all those stacks you need, mostly sitting their idle and wasting space.
2) Creating a slow system, with all the context switching going on.
Even in the Linux world they have cottoned on to this and now build event based web servers, nginx and node.js, that are more efficient than the old process based, Apache, style servers.
Having said that:
If we are running some pthreads. say, from HUB memory, using HUBEXEC How do we swap from on thread to another in the scheduler?
I presume that pthreads used an LMM kernel in the P1 that could check for a reschedule during it's "fetch execute" loop. When we are running with HUB EXEC how is that to be done?
I fairly sure that is not true. It would be shocking if it were. That would be a major security issue.
I don't think this is true. Do you have an example of what you mean?
There is that I guess. The Prop has a big advantage on that front - A Cog can be restarted from a second Cog at the moment so I guess this is currently providing the redundancy factor for those that are using it that way.
I'm generally just referring to yielding. Yielding occurs in many places under the hood whether the coder is expecting it or not.
EDIT: To clarify, I mean cooperation is inherent in hand-overs of yielding within the many layers of an OS. It's nothing to do with the kernel in this case.
Which is related to my question. When multiple pthreads are running in HUBEXEC mode how do we get that yield or thread reschedule done? Still a question in non-preemptive scheduling.
jmg has given a possible solution to creating preemption. That would be doable assuming his question is answered with a, yes, we can read and write the other PC/flags in a Cog.