What's needed for preemptive multitasking?

cgracey · 2014-02-22 20:50

Ahle has mentioned he'd like to make a preemptive multitasking system on the Prop2, so I've been thinking about what might be needed in the chip to facilitate it.

I can make an instruction to swap a task's Z/C/PC with a D register (SWTASK D,S/#, where S=task), but there are some limitations to what can be done with that:

REPS/REPD/TLOCK state information will be lost and not restored later, so these instructions cannot be used in code that is going to be interrupted by SWTASK. Is this a showstopper?

Also, I don't see how each task's PTRA/PTRB can be saved and restored efficiently. Now TLOCK could get us around this, but TLOCK cannot survive a SWTASK.

The trouble with TLOCK is that the instruction can only set a hidden flag to lock the process next time it shows up in stage 0 of the pipeline. If a task executes a TLOCK, but the next task in the pipeline does a SWTASK on that task, the prior task's TLOCK is interrupted before it engages. Had it engaged without interruption, it could have used REPS/REPD safely. I could store that TLOCK-pending flag in bit 29 of D (bits 31/30 = Z/C) , so that it could be restored later. I think I must do this. This way, INDA/INDB/REPS/REPD may be executed in TLOCK'd sections of code that don't get interrupted, but this increases the time granularity of task switching.

What else is to be considered regarding preemptive multitasking?

David Betz · 2014-02-22 21:04

cgracey wrote: »

What else is to be considered regarding preemptive multitasking?

A timer interrupt? :-)

Bill Henning · 2014-02-22 22:32

This is a complicated issue.

It would be even more complicated if there was a desire to have each hardware task be able to run multiple pre-emptive "soft" threads.

Basically first we have to define a task state.

1) Minimum: PC, C, Z, PTRA/B

2) Full: PC, C, Z, PTRA/B/X/Y, contents of AUX, contents of cog

Then there has to be a process table of tasks that are running, waiting for some event, or ready to run

Then we need a pre-emption mechanism (easiest: counter for cycles)

Too much to do it all in hardware for P2.

What I'd do:

- only allow pre-emptive tasking at a whole cog level, otherwise it is a complicated nightmare

- it only makes sense for hubexec code

- add minimal hardware support required to do the rest in software, details follow

New instruction:

MULTITASK cycles_reg, handler_addr_reg

Causes the cog to enter the hub multitasking mode

1) Every (cycles_reg) cycles, the PC, C, Z are saved at handler_addr_reg, and the cog jumps to handler_addr_reg+4

- this jump is delayed if the cog is waiting (WAITxxx) or performing a REPx block, until the wait/repx is completed

- the handler is responsible to maintain the TCB's (task control blocks)

- the handler is responsible for saving/restoring the task state (PTR*, AUX if needed, cog registers)

- the handler selects next task to run from TCB, restores its task state, and runs it

2) goto 1

This way, it maps to pthread, and keeps the hardware changes simple

This should be fairly easy to implement, and by not interrupting WAIT's and REP blocks, maintains sanity...

Simplest mechanism I can think of would be an instruction or cycle counter, ie every N cycles, switch to the next task

cgracey wrote: »

Ahle has mentioned he'd like to make a preemptive multitasking system on the Prop2, so I've been thinking about what might be needed in the chip to facilitate it.

I can make an instruction to swap a task's Z/C/PC with a D register (SWTASK D,S/#, where S=task), but there are some limitations to what can be done with that:

REPS/REPD/TLOCK state information will be lost and not restored later, so these instructions cannot be used in code that is going to be interrupted by SWTASK. Is this a showstopper?

Also, I don't see how each task's PTRA/PTRB can be saved and restored efficiently. Now TLOCK could get us around this, but TLOCK cannot survive a SWTASK.

The trouble with TLOCK is that the instruction can only set a hidden flag to lock the process next time it shows up in stage 0 of the pipeline. If a task executes a TLOCK, but the next task in the pipeline does a SWTASK on that task, the prior task's TLOCK is interrupted before it engages. Had it engaged without interruption, it could have used REPS/REPD safely. I could store that TLOCK-pending flag in bit 29 of D (bits 31/30 = Z/C) , so that it could be restored later. I think I must do this. This way, INDA/INDB/REPS/REPD may be executed in TLOCK'd sections of code that don't get interrupted, but this increases the time granularity of task switching.

What else is to be considered regarding preemptive multitasking?

Edit:

Four instructions makes it better

SETHANDLER handleraddr

MULTITASK cycles, savepczcreg {WC} {WZ}

GETTASKID taskid ' allows the current task to get its taskid

STOPTASK taskid ' goes to handler to kill a task, taskid is TCB entry number

Adding tasks would be done by the handler adding to the TCB

Removing tasks would also be done by the handler (scheduler)

mindrobots · 2014-02-22 22:41

Wow! Talk about throwing a curveball!

A source of interrupts (timed or external events)
Adequate registers in an executive or alternate bank to process the interrupt to the point it is handled or it can be packaged into a switchable task to be handled later
A way to enable interrupts before returning to the interrupted task
A non maskable interrupt to recover from an interrupt loop or storm

Probably more once I recover from the shock!

cgracey · 2014-02-22 22:59

Thanks for responding, Guys.

I have a question: What does preemptive multitasking get us, in terms of special functions or performance? Even Mac and Windows use cooperative multitasking. What's so special about preemptive multitasking?

ozpropdev · 2014-02-22 23:07

cgracey wrote: »

What's so special about preemptive multitasking?

Good question!

Bill Henning · 2014-02-22 23:08

It gets us better performing user-level threads, makes pthreads easier to implement.

Use case examples that would benefit:

- waiting for data from multiple tcp/ip sockets
- allows writing a "real" O/S without jumping through as many loops

I do not think this is needed for P2, SERDES, USB support, using spare hub slots, etc is of much greater and more immediate benefit.

Linux uses pre-emptive multi-tasking, and so does Mac (BSD) and even Windows under the hood... where Windows uses cooperative is picking up mouse etc events from an event queue.

A cooperative system has one major fault - if one thread never releases control the whole system hangs. A common occurance on early Win/Mac boxes

cgracey wrote: »

Thanks for responding, Guys.

I have a question: What does preemptive multitasking get us, in terms of special functions or performance? Even Mac and Windows use cooperative multitasking. What's so special about preemptive multitasking?

msrobots · 2014-02-22 23:20

I think It Is time to stop all this nonsense.

we have 8 cogs. each cog can run 4 hardware threads. If I am not wrong each hardware thread can still use some form of jmpret as prop1.

how many more independent task you need in a Micro Controller?

@chip. PLEASE. concentrate on SERDES and finish this up.

NO hubstealing (P3?), NO mmu (P3?), NO interrupts (never?) , NO whatever.

finish this. we have end of February already.

PLEASE.

Mike

cgracey · 2014-02-22 23:26

msrobots wrote: »

I think It Is time to stop all this nonsense.

we have 8 cogs. each cog can run 4 hardware threads. If I am not wrong each hardware thread can still use some form of jmpret as prop1.

how many more independent task you need in a Micro Controller?

@chip. PLEASE. concentrate on SERDES and finish this up.

NO hubstealing (P3?), NO mmu (P3?), NO interrupts (never?) , NO whatever.

finish this. we have end of February already.

PLEASE.

Mike

I just want to be sure that we don't leave any low-hanging fruit behind after implementing hub-exec, which was a huge improvement to things. I know this is dragging on...

Ramon · 2014-02-22 23:32

I really don't know any special advantage, some people say that it allows a better granularity to allocate time slices to task.

But in the case of propeller I think that this is not needed. Propeller already handles time-sharing/slices. And there is already enough granularity (8 cogs * 4 threads = 32 task).

I would focus on lower power consumption rather than performance.

For example: one application needs 8 task. How can it be done in propeller 2?

Use 8 cogs with single thread. Total task = 8. Power consumption = ?
Use 4 cogs with 2 threads each. Total task = 8. Power consumption = ?
Use 2 cogs with 4 threads each. Total task = 8. Power consumption = ?

If cogs could be shutdown we can reduce power consumption to 1/2 or 1/4.

jmg · 2014-02-22 23:36

cgracey wrote: »

Thanks for responding, Guys.

I have a question: What does preemptive multitasking get us, in terms of special functions or performance? Even Mac and Windows use cooperative multitasking. What's so special about preemptive multitasking?

One side benefit of a preemptive task, is you can construct a good Debug stub.
Of course that will be universally useful.

Then, it can grab any other tasks state, report it, edit it if needed, and restart, all ideally invisibly to the task inspected.(except for lost time)

Debug overhead should be as small as practical, in both total memory terms, HW impact and Time impact.

In P2, I think the lowest time impact is 1/32 ? (unless the other task does a LOCK, then you just have to wait).
Here, either ONE thread needs to be either 100% debug, or the Debug needs to patch-in, in a co-operative way, with whatever low-priority code the user wants in a 4th task.

In SW step thru a LOCK block, the debug stepper would have a couple of choices
* It could simulate the lock, other tasks drop to 0%, Dbg'd task appears to be 100%
( of course, it is not real time, so any external hardware will likely have issues
* It can give a warning, and set a breakpoint after the endlock , now Dbg'd task really does get 100%, but you have zero
visibility during that lock time.

To get below 1/32, I think you would need some additional hardware - maybe the COG ROM (if that is still alive?) could flip-in just enough code to send/receive any register. 1/1000 or 1/10000 could be ok for Debug, and be virtually invisible.
This would not need to patch-in into a 4th task.
Such HW may be too primitive for seriously fast preemptive task swapping, but it would support swap.

jmg · 2014-02-22 23:38

cgracey wrote: »

I just want to be sure that we don't leave any low-hanging fruit behind after implementing hub-exec, which was a huge improvement to things. I know this is dragging on...

I would think about Debug support, but I think there is probably enough Task handling in a Prop to keep most happy.
It may be that simple Debug hardware, can give hooks for a usable Preemptive tasker.

Heater. · 2014-02-22 23:39

What's needed for preemtive mutitasking?

1) A means to stop the currently running code.
2) A means to save the state of that stopped code.
3) A means to restore the state and start it running again.
4) A OS or scheduler to figure out which of the stopped threads it should start at any given time.
5) A separate stack for each thread.

Item 1) is normally an interrupt. Often from a timer tick. Item 2) means saving at least program counter, stack pointers, status flags. Usually you need to save pretty much all the CPU state, after all the new task will want to use those same registers. For Item 4) See Linux, VxWorks orm nany other RTOS.

What's so special about preemptive multitasking?

It means I can write

repeat until done
  'do something

If a high priority task becomes ready to run it immediately kicks me out, no matter if I'm "done" or not. It can then run at full speed until it is done.

My gut tells me that:

a) As we have 8 cores and 4 hardware scheduled threads each preemptive threading is not needed.
b) Don't even think about it, It's going to add a whole bunch of complexity that hardly anyone will use. (Interrupts anyone?)
c) Preemptive is not need for waiting on multiple inputs. We can do that already with tasks on cores or hardware scheduled threads.

It's nuts. Don't even think about it!

Heater. · 2014-02-22 23:42

Bill,

It gets us better performing user-level threads, makes pthreads easier to implement.

Is that true? All that interrupting and state saving is a time waster. Having hardware scheduled threads simultaneously waiting on events is much more efficient that way and has a lot less latency.
Not sure about pthreads but propgcc seems to have no problem spreading pthreads over COGs and I suspect it could do it over hardware scheduled threads just as easily.

potatohead · 2014-02-22 23:44

I don't think this is low hanging fruit, and I second the comments about leaving this off the table for P2.

ozpropdev · 2014-02-22 23:49

Heater. wrote: »

It's nuts. Don't even think about it!

+1

Bill Henning · 2014-02-23 00:04

Btw, I agree that we don't need this, but Chip did ask what it would take.

I did not even entertain the though of hardware thread scheduling - WAY too much logic & work needed.

Heater. wrote: »

Bill,

Is that true? All that interrupting and state saving is a time waster. Having hardware scheduled threads simultaneously waiting on events is much more efficient that way and has a lot less latency.
Not sure about pthreads but propgcc seems to have no problem spreading pthreads over COGs and I suspect it could do it over hardware scheduled threads just as easily.

mindrobots · 2014-02-23 00:07

potatohead wrote: »

I don't think this is low hanging fruit, and I second the comments about leaving this off the table for P2.

It's far from low hanging fruit. I think to be done properly it would need to be the central theme the processor is built around and not just added to an existing architecture to be done improperly it would be tacked on the side and hope something critical wasn't missed. It's a lot of work and silicon for something 9 people are going to use to play with preemptive multi-tasking.

This is still a microcontroller being designed. It can be in a design state forever if the search for low hanging fruit continues. There will always be what ifs. There will always be community members sulking away saying, "gee, I wish it could do this" or "why doesn't it have one or three or 20 of these?".

Where does it stand now in your vision of a P2? Has it met the Parallax design objectives. Because in all honesty it will never meet the design goals of some community members. At some point, the line needs to be drawn and the design phase stopped and the testing needs to start on a frozen design. The paint needs to dry before everyone can ride the ride.

I'm sure these comments are out of place since I'm not an engineer who can discuss verilog issues or select companion chips or how something was done in different processors or don't have any slick code designs to my credit but I can tell when a potato is done and I think with the implementation of 2 or 3 items left on your plate, this potato is done.

Well, I feel better now!

Cluso99 · 2014-02-23 00:28

Heater. wrote: »

b)...(Interrupts anyone?)

Don't even ask that question! UGH!!!

It's nuts. Don't even think about it!

+1

jmg · 2014-02-23 01:45

Heater. wrote: »

If a high priority task becomes ready to run it immediately kicks me out, no matter if I'm "done" or not. It can then run at full speed until it is done.

The latest form of ONE/MULTI locks covers that, it can get 100% of the available slots.
The HW slot mapper is already doing hard-time Multi-tasking.

Q: Is there enough HW support already, to 'reach-into' the other, now paused tasks, and extract and/or change all of their information - including their PC's and Flags ?

Heater. · 2014-02-23 02:52

jmg,

I have not been following things in detail but what you are say sounds like it does what we want stunningly easily.

If I understand you correctly we could have:

1) Two or more hardware scheduled threads running.
2) Say one of them is waiting for an event, a pin change.
3) When that pin change happens the waiting thread changes the hardware scheduling to give itself 100% CPU.
4) It does whatever processing it has to do at full speed.
5) It changes the scheduling back and all the threads resume running, more slowly, together.
6) It waits on the next event.

In that way our critical thread has, in a way, preempted all the others to grab all the CPU capacity for a while.

Is that what you were meaning or am I way off?

If not, can what I describe above be done currently?

P.S. The reason why I have not been following P2 development in detail is that every time I come across YAFI (Yet Another,,,Instruction) being discussed for addition to the design my brain shuts down another notch. The "preempt" business sounds like it would need a bunch more YAFI's.

evanh · 2014-02-23 03:03

I don't think preemptive is so much an advantage over cooperative as much as it's a desire to be experimented with.

jmg might have a decent compromise for the Prop2 with the idea of stipulating a shared 1/16 slice for a supervisory task that can preemptively redirect program flow in the partnering task.

evanh · 2014-02-23 03:09

Heater. wrote: »

In that way our critical thread has, in a way, preempted all the others to grab all the CPU capacity for a while.

I think his idea is to provide unlimited number of preemptive tasks at the kernel level.

evanh · 2014-02-23 03:25

cgracey wrote: »

Even Mac and Windows use cooperative multitasking.

That would be used - MacOS 6 and Win3 respectively, past tense!

Although the old infrastructure is prolly still intact for the old programs to continue to use.

It is interesting, though, that preemptive OSes only run smoothly when running cooperatively at strategic points down a layer or two.

Heater. · 2014-02-23 03:28

evanh,

I think his idea is to provide unlimited number of preemptive tasks at the kernel level.

Yeah, hands up anyone who want's to do that on a P2, which is a microcontroller with limited CPU and RAM resources? Let's see, one......just one then.

Having hundreds of preemptive threads is a sure way of:
1) Eating memory, with all those stacks you need, mostly sitting their idle and wasting space.
2) Creating a slow system, with all the context switching going on.

Even in the Linux world they have cottoned on to this and now build event based web servers, nginx and node.js, that are more efficient than the old process based, Apache, style servers.

Having said that:

If we are running some pthreads. say, from HUB memory, using HUBEXEC How do we swap from on thread to another in the scheduler?

I presume that pthreads used an LMM kernel in the P1 that could check for a reschedule during it's "fetch execute" loop. When we are running with HUB EXEC how is that to be done?

Heater. · 2014-02-23 03:34

Mac and Windows are certainly are not cooperatively scheduling since a long time ago.

Although the old infrastructure is prolly still intact for the old programs to continue to use.

I fairly sure that is not true. It would be shocking if it were. That would be a major security issue.

It is interesting, though, that preemptive OSes only run smoothly when running cooperatively at strategic points down a layer or two.

I don't think this is true. Do you have an example of what you mean?

evanh · 2014-02-23 03:42

Bill Henning wrote: »

A cooperative system has one major fault - if one thread never releases control the whole system hangs. A common occurance on early Win/Mac boxes

There is that I guess. The Prop has a big advantage on that front - A Cog can be restarted from a second Cog at the moment so I guess this is currently providing the redundancy factor for those that are using it that way.

evanh · 2014-02-23 03:46

Heater. wrote: »

I don't think this is true. Do you have an example of what you mean?

I'm generally just referring to yielding. Yielding occurs in many places under the hood whether the coder is expecting it or not.

EDIT: To clarify, I mean cooperation is inherent in hand-overs of yielding within the many layers of an OS. It's nothing to do with the kernel in this case.

Heater. · 2014-02-23 03:54

Yes, "yeilding" can happen. When reading from a socket or file for example. It is assumed that the thread/process has nothing else to do until the data arrives so it's time to see if anyone else can run. That's not the kind of rescheduling that distinguishes "preemptive" from "non-preemptive". They generally both do that.

Which is related to my question. When multiple pthreads are running in HUBEXEC mode how do we get that yield or thread reschedule done? Still a question in non-preemptive scheduling.

evanh · 2014-02-23 04:11

Hehe, I certainly wasn't saying they're not preemptive, just that the primary feature of cooperative multitasking is still present in preemptive world. It could be said that preemptive multitasking is the super-set.

evanh · 2014-02-23 04:18

Heater. wrote: »

Which is related to my question. When multiple pthreads are running in HUBEXEC mode how do we get that yield or thread reschedule done? Still a question in non-preemptive scheduling.

jmg has given a possible solution to creating preemption. That would be doable assuming his question is answered with a, yes, we can read and write the other PC/flags in a Cog.

What's needed for preemptive multitasking?

Comments