Observations of a multi-tasker

David Betz · 2013-09-20 07:50

cgracey wrote: »

The PASM/Spin pin conflict is not an issue, I think, because it would be very bad practice to modify the same pin(s) from PASM and Spin. It would be as silly as having multiple cogs try to manipulate the same pin(s). I see David Betz's point about C needing separate IN/OUT registers, but that would be a very deep change at this time, and I feel too constrained to attempt it.

That's okay. We can handle it in C by using macros or inline functions. I wouldn't want to put P2 at risk for this.

ozpropdev · 2013-09-20 07:58

cgracey wrote: »

If we could use a 40nm process, we could do 32 cogs of logic at 1GHz in 3.6 square millimetres.

Now ypur just teasing us Chip!

David Betz · 2013-09-20 07:59

ozpropdev wrote: »

Now ypur just teasing us Chip!

I see TSMC now has a 16nm FinFET process. What could we do with that? :-)

cgracey · 2013-09-20 07:59

Seairth wrote: »

Back on the original question about modifying the way that tasks are scheduled, I have a thought about how HUBOP behavior could be changed to reduce stalls. In stage four of the pipeline, if the cog is more than 3 cycles away from its access window, cancel all operations in the pipeline for that task and set the PC to the address of the HUBOP that was canceled. This would allow the other tasks to continue executing and the HUBOP would still get queued back up in time for its access window. This assumes, of course that the task in question is scheduled at least every fourth clock cycle and that none of the other tasks have multi-cycle instructions in the pipeline.

Incidentally, this would also work for the trivial case where only one task is running.

Interesting idea. I could see a task getting unfortunately stalled for much longer if other-task hub instructions in the pipe kept falling into the time window, kicking the first guy back over and over.

All these schemes to accomplish hub accesses without pipe stalls suffer the problem of potentially creating worse delays. The idea of notifying the hub of your intent to do a hub operation, and then picking up the result later (upon condition of it being done), can become worse than it is now when 4 tasks try to do hub ops at once. Instead of everyone waiting their turn, they'd have to poll for what would end up taking just as long in the hub.

cgracey · 2013-09-20 08:03

David Betz wrote: »

I see TSMC now has a 16nm FinFET process. What could we do with that? :-)

I'd imagine we could get near 2GHz. It's only a matter of a few million dollars, which none of us have.

David Betz · 2013-09-20 08:05

cgracey wrote: »

I'd imagine we could get near 2GHz. It's only a matter of a few million dollars, which none of us have.

You could do a kickstarter like Parallella did.

potatohead · 2013-09-20 08:33

Ha!

Revisiting the kick start idea once working P2 chips are out there may be worth doing...

Beau was showing us lots of filler cells inserted to meet density requirements. Seems to me, there is roomon that basis. There would just be fewer of those. Critical timing paths seem to be the real consideration, which the synthesis could solve.

I would wonder what impact more flip flops would have on timing and stability. Would max clock speed change, or would operating range change?

David Betz · 2013-09-20 08:45

potatohead wrote: »

Ha!

Revisiting the kick start idea once working P2 chips are out there may be worth doing...

Beau was showing us lots of filler cells inserted to meet density requirements. Seems to me, there is roomon that basis. There would just be fewer of those. Critical timing paths seem to be the real consideration, which the synthesis could solve.

I would wonder what impact more flip flops would have on timing and stability. Would max clock speed change, or would operating range change?

Get the 180nm version of P2 out and then do a Kickstarter to make a 40nm chip with four P2's on it and the inter-Prop communications links wired up between them!

Seairth · 2013-09-20 09:38

It seems to me that the challenge with multitasking is the difficulty of writing deterministic code, primarily due to the interleaving of the tasks. So what if you were to keep the task registers, but use cooperative switching instead? Each task would run until it yields to the next task, which could either be explicit (in the case of a "control" task) or implicit (for simple round-robin scheduling). Note that this is distinctly different from using CALLA/RETA, since a task wouldn't necessarily switch back to the prior task. For instance, you would be able to do the following pattern:

Task A: based on conditions, explicitly switch to Task B, C, or D
Task B: implicitly switch to next task (C)
Task C: implicitly switch to next task (D)
Task

implicitly switch to next task (A)

With the earlier discussion about space invaders, it seems to me that this approach would make it easier to time the other tasks during the video sync periods.

Pros:
* Better determinism
* The pipeline is slightly simplified since it no longer has to deal with interleaved tasks
* The TASK register could simply be a bitmap indicating which tasks are active, allowing up to 32 tasks (with the addition of Z/C/PC holding registers)

Cons:
* You lose one clock cycle due to the explicit YIELD instruction
* There would be a three-clock stall due to a pipeline flush (except the trivial case of a single task switching to itself), though maybe there could be a "delayed" variant of the instruction
* Interleaving I/O and HUB instructions, etc. would require more explicit coding (I'm not sure this is really a con)

ctwardell · 2013-09-20 10:28

Seairth,

Fine grained switching would use a lot of yields, with limited COG memory IMO that would not be a good use of space.

Chris Wardell

Heater. · 2013-09-20 10:32

Searith,

It seems to me that the challenge with multitasking is the difficulty of writing deterministic code, primarily due to the interleaving of the tasks.

Yes and no. If every instruction takes the same time and the hardware task switcher swaps to the next task after every instruction, then the timing is 100% deterministic for each task. Everything just runs 2 or 3 or 4 times slower. This is what happens in the XMOS devices for example.

This can be done on the P2.

The problem comes when instructions take a different time to execute, like HUB ops that are dependent on "external" HUB timing. And as we see when tasks end up having to "que up" waiting for their HUB slots.

I might be wrong here, please Chip correct me, but it seems to me that if only one thread interacts with the HUB then the others will actually run with deterministic timing at one half, one third or one quarter speed.

So what if you were to keep the task registers, but use cooperative switching
instead?

This might help in this case. As far as I know cooperative multi-tasking possible on the P2. That is after all how the tasking was first implemented on the P2. There were the task registers and a task switch instruction (TASKSW ?) to yield control to another task. Given that, perhaps the video thread in Space Invaders can keep control until the frame is over and then let the other
threads run for a bit. Provided those threads don't mind being hung up for a whole frame time it might work out.

As you say the cons of cooperative scheduling are:

1) The loss of overall execution speed due to having to insert TASKSW instructions throughout your code.
2) The increase in latency to events due to having to wait for the running thread to yield control.
3) Uses up valuable code space with those TASKSW instructions.

That is why automatic task switching was implemented.

Phil Pilgrim (PhiPi) · 2013-09-20 10:35

Seairth wrote:

It seems to me that the challenge with multitasking is the difficulty of writing deterministic code,

That depends upon which mechanism has priority: the task switcher or the pipeline. If the pipeline is allowed to stall task switching, then determinism suffers. But if the task switcher runs unabated, determinism is preserved. IOW, if a task is waiting for a hub slot, for example, the task switcher moves on, rather than waiting for the hup op to complete.

This can cause trouble, though, if a task is permanently out of phase with the hub commutator, in that it will never get a turn at the trough. This is where Chip's 15-slot task idea comes into play. Since 15 and the number of clocks between hub slots are relatively prime, every task is guaranteed a shot at the hub.

The other issue involves the wait instructions. It might be possible for a wait condition to come and go (e.g. transient pin input) between "turns" while other tasks are being executed. There are two choices available here: 1) specify that conditions have to persist long enough to span task accesses, or 2) include a mechanism to trap the desired conditions between task accesses.

-Phil

Heater. · 2013-09-20 10:50

Phil,

Bah! I don't care about your relative primes. As soon as a thread has do something interesting, like wait for a pin event or wait to access HUB it's phase with respect to every other thread is thrown all over the place.

It might be possible for a wait condition to come and go (e.g. transient pin input) between "turns" while other tasks are being executed.

Whoa, what! Is that true? Are you saying the wait condition is not latched somewhere and a task may miss it?

Phil Pilgrim (PhiPi) · 2013-09-20 11:02

Heater,

'Not sure how it's done now. My post dealt with the -- possibly hypothetical -- situation where the scheduler has priority over the pipeline. IOW, these are the things that can happen if the pipeline is not allowed to stall the task switcher. And, yes, numbers that are relatively prime do play a role there.

-Phil

Seairth · 2013-09-20 11:15

ctwardell wrote: »

Fine grained switching would use a lot of yields, with limited COG memory IMO that would not be a good use of space.

I think that would only be an issue if you were trying to manually interleave tasks (i.e. simulating the current approach), which I think would be an edge case. In many (most?) cases, interleaving isn't necessary. Thinking on this further, it also occurs to me that:

* The YIELD instruction could optionally set the PC for the current task while switching to the next task. This would allow the same task to be executed over and over again without the need for an additional branch instruction (and subsequent pipeline stall).

* This might allow the creation of "task-friendly" OBEX objects. For instance, a keyboard handler could have a handler that executes a yield instruction after processing input, allowing other tasks to run. In the event that the cog is single-tasked, the yield would effectively be a NOP and the remaining code would continue to run.

Another point I'd like to make is that the current task switching approach ends up being a simplified interrupt mechanism. In other words, a given task doesn't control when it's executing or when another task will cause the pipeline to stall. The cooperative approach is (in my mind, at least) much more in keeping with the original philosophy of the Propeller.

KC_Rob · 2013-09-20 11:15

Seairth wrote: »

How much smaller would the cogs be if tasking support was removed altogether? Enough to squeeze in additional cogs? Okay, I know that's not likely to happen, but I know which of the two approaches I prefer.

I suspect we're in agreement.

KC_Rob · 2013-09-20 11:30

Seairth wrote: »

Another point I'd like to make is that the current task switching approach ends up being a simplified interrupt mechanism. In other words, a given task doesn't control when it's executing or when another task will cause the pipeline to stall. The cooperative approach is (in my mind, at least) much more in keeping with the original philosophy of the Propeller.

I've always felt, intuitively at any rate, that all this intra-cog tasking business runs counter to the Propeller philosophy. On another note, I generally favor cooperative tasking, wherever possible, for my embedded projects, regardless of the uC used. With careful planning, cooperative tasking is generally sufficient for whatever chore is at hand, and nearly always not so unwieldy as other options. Of course it's not the best (or even a viable) solution to every problem, but in my experience more often than not it does the job seamlessly.

Heater. · 2013-09-20 11:37

@Phil,

...numbers that are relatively prime do play a role there.

I don't get it. You have to explain that very slowly in big type.
As far as I can tell, these hardware scheduled threads run with total disregard to each others timing. If threads are waiting for a pin events then there is no telling who gets to run when. I don't see how any fancy prime number relations gets around that.

Phil Pilgrim (PhiPi) · 2013-09-20 11:43

Heater,

As I've patiently tried to explain, the "relatively prime" thing applies to my hypothetical task switcher where threads do not stall the scheduler for pin events or anything else. If the task switcher encounters a waiting thread, and the wait condition is not yet satisfied, it's hasta luego.

-Phil

Bill Henning · 2013-09-20 11:44

Guys,

The hardware threads are a huge feature, and extremely useful for having up to four drivers in one cog. Removing it would significantly cripple Prop2.

Note that the invaders one cog demo likely would never have been done without the hardware task switching.

As the threading consumes less than 1% of the transistors in a cog (mentioned above), it is a very powerful, easy to use feature - perfect for multiple drivers per cog.

I have not looked at the invaders code, but I suspect setting up a producer thread (to RDQUAD into the CLUT) and have a different thread waitvid using a scan line at a time from the clut, would make the issues dissapear.

Heater. · 2013-09-20 11:46

KC_Rob.

cooperative multi-tasking is a wonderful thing.

However if you have no interrupts it means that your cooperative tasks have to be able to poll I/O at a rate fast enough to satisfy the I/O timing requirements.

That means your cooperative threads have to "yield" or "suspend" more and more frequently to meet tighter and tighter timing.

Eventually you are wasting all your time executing yield or suspend so as to poll something that may not be there.

The Props hardware scheduled threads are an elegant solution to that issue. Remove the need for yield instructions, reduce latency to events all around, reduce the code size and save wasting time on lots of yield instructions.

Looks perfectly in line with the Prop philosophy.

Heater. · 2013-09-20 11:48

Phil,

OK thanks, I think I'm almost with the idea.

KC_Rob · 2013-09-20 12:04

Heater. wrote: »

KC_Rob.

cooperative multi-tasking is a wonderful thing.

However if you have no interrupts [with most controllers you do] it means that your cooperative tasks have to be able to poll I/O at a rate fast enough to satisfy the I/O timing requirements.

That means your cooperative threads have to "yield" or "suspend" more and more frequently to meet tighter and tighter timing.

Yet isn't this what COGS are for... the Propeller way?

Looks perfectly in line with the Prop philosophy.

A more ideal fit, to my mind, would simply be to have enough COGS for the job at hand.

Phil Pilgrim (PhiPi) · 2013-09-20 12:10

Heater,

Here's an illustration of the "relatively prime" point I tried to make:

H.......H.......H.......H.......
01230123012301230123012301230123

H.......H.......H.......H.......
01201201201201201201201201201201

In the top version, there are 16 hub slots and four tasks. Only task 0 gets hub access, because the others are persistently out of phase with the hub window.

In the lower version, there are 15 hub slots and three tasks, each of which gets an equal shot at the hub. This works because 15 and 8 (the hub period) are relatively prime.

-Phil

Seairth · 2013-09-20 12:11

Heater. wrote: »

However if you have no interrupts it means that your cooperative tasks have to be able to poll I/O at a rate fast enough to satisfy the I/O timing requirements.

That means your cooperative threads have to "yield" or "suspend" more and more frequently to meet tighter and tighter timing.

So the counter-argument is that interleaved task switching solves the problem? If your timing is that critical, I wouldn't expect you to risk having one or more of the other tasks stalling/blocking. As a result, you would either end up writing code just to make the tasks play nice (which is a space and performance hit), or you would run the code in a dedicated cog with only one task.

Heater. · 2013-09-20 12:15

KC_Rob,

I do see what you mean. More COGs would be conceptually more regular and "pure" in the Prop's philosophical way.

Reality bites. Like I said above, double the number of COGs and you have halved HUB band width and doubled HUB latency. No good. Besides, we just don't have the transistors available to do it. It does not fit.

Hardware scheduled threads is the way to combine lesser demanding threads into a single COG. It really appeals to people, like me, that feel that wasting a 32 bit COG to flash a LED or other simple task is a terrible waste.

The round robin time slicing of threads mirrors the round robin hub access. There is a nice symmetry their.

These threads are a far more elegant solution than other solutions like interrupts!

Heater. · 2013-09-20 12:27

Seairth,

If your timing is that critical, I wouldn't expect you to risk having one or more of the other tasks stalling/blocking.

I do agree.

Where the plan falls down is that not all instructions take the same time. In this case HUB ops cause a problem.
On an XMOS device for example four threads can be run such that they all get exactly one quarter of the execution speed. A thread will run as if the clock were divided by four no matter what the other threads are doing. Totally deterministic.
I believe that this true for the Prop as well as long as we don't end up with threads queuing for HUB access and such. If you have to wait for a resource you have to wait for a resource, that's it.

My answer is that if your timing is that critical you are going to dedicate a whole COG to it.

The threading thing provides a way to mop up less critical functions in to a COG without having to worry about task switching like for example you do in the FullDuplexSerial rx and tx threads on the P1.

The counter argument so far is that we have Space Invaders including video and sound and game logic running on one cog here!

Bill Henning · 2013-09-20 12:33

Hmm.. I thought there were a very few instructions even on XMOS that did not obey the one clock per thread ... I think it was MUL and DIV (but I could easily be wrong)

KC_Rob · 2013-09-20 12:36

Heater. wrote: »

KC_Rob,

I do see what you mean. More COGs would be conceptually more regular and "pure" in the Prop's philosophical way.

Good! I was pretty sure we agree on at least that much.

Reality bites.

Indeed, sadly it often does.

Like I said above, double the number of COGs and you have halved HUB band width and doubled HUB latency. No good. Besides, we just don't have the transistors available to do it. It does not fit.

The latter is, to my mind, the only true deal-breaker. Otherwise, I'd say go for it. But yeah, if they won't fit, they won't fit.

Heater. · 2013-09-20 12:42

Bill,

I thought there were a very few instructions even on XMOS that did not obey the one clock per thread ..

I puzzled over that for a long time. Debated it on the XMOS forums and even ended up discussing it with David May himself over a Christmas break!

I seem to have ended up being 99.9% convinced that even if mul and/or div did take extra time they had no effect on the timing of other threads. In my experiments I could not get mul or div to upset things.

Anyway, the get out's for XMOS are:

1) You are not supposed to be instruction/cycle counting to get your timing right. They have clocked I/O and timers and such to make sure things happen on time.

2) They have a timing analysis tool that will tell you if the source you have written meets the timing constraints you have specified.

Observations of a multi-tasker

Comments