Observations of a multi-tasker

David Betz · 2013-09-20 12:44

Heater. wrote: »

Bill,

I puzzled over that for a long time. Debated it on the XMOS forums and even ended up discussing it with David May himself over a Christmas break!

I seem to have ended up being 99.9% convinced that even if mul and/or div did take extra time they had no effect on the timing of other threads. In my experiments I could not get mul or div to upset things.

Anyway, the get out's for XMOS are:

1) You are not supposed to be instruction/cycle counting to get your timing right. They have clocked I/O and timers and such to make sure things happen on time.

2) They have a timing analysis tool that will tell you if the source you have written meets the timing constraints you have specified.

Maybe XMOS has a separate pipeline for each thread rather than sharing a single one like on P2. That way one pipeline can stall without affecting the others.

Phil Pilgrim (PhiPi) · 2013-09-20 13:04

David Betz wrote:

Maybe XMOS has a separate pipeline for each thread rather than sharing a single one like on P2. That way one pipeline can stall without affecting the others.

That's exactly what my hypothetical example entails. Each time slice gets one clock tick. If an instruction requires n clocks, it uses n time slices to complete. That way determinism is preserved for each task, regardless of what the other tasks are doing -- or waiting for.

-Phil

David Betz · 2013-09-20 13:08

Phil Pilgrim (PhiPi) wrote: »

That's exactly what my hypothetical example entails. Each time slice gets one clock tick. If an instruction requires n clocks, it uses n time slices to complete. That way determinism is preserved for each task, regardless of what the other tasks are doing -- or waiting for.

-Phil

I'm not sure even that preserves determinism because of instructions that take more than one clock to complete. Those instructions will tie up a resource (the hub for instance) until they complete and hence cause the other pipelines to stall if they need that same resource. The separate pipeline scheme would probably eliminate other stalls though line ones induces by the various WAITxxx instructions.

Seairth · 2013-09-20 13:09

Heater. wrote: »

On an XMOS device for example...

I agree that there are better ways of approaching the task switching. However, the current situation is that Chip has limited time and resources to make changes. I think the current task switching approach has very limited use and a "proper" fix would take more effort than Chip is currently afforded. I also think that the cooperative approach would provide more value (in general) and could (hopefully) be implemented with less risk, time, and resources than the "proper" fix.

Phil Pilgrim (PhiPi) · 2013-09-20 13:14

David Betz wrote:

I'm not sure even that preserves determinism because of instructions that take more than one clock to complete. Those instructions will tie up a resource (the hub for instance) until they complete and hence cause the other pipelines to stall if they need that same resource.

No. Instructions that take more than one clock to complete or are waiting for something will require more than one time-slice to complete. IOW, the task-slice granularity is one clock, not one instruction. That way tasks will not tie up any resources or stall the task switcher. That's why it's important that resources that are available on a periodic basis (e.g. the hub) have a period that's relatively prime to the task period.

(Since I haven't been following this subject closely over time, I just assumed this is the way the P2 worked all along. But apparently it's not.)

-Phil

Heater. · 2013-09-20 13:15

Yep, If you have to wait in line for a resource, like HUB access, then you have to wait in line. There is no way around it.

So, if we really really want deterministic timing for threads or at least have a threads timing independent of other threads actions, what are the rules?

1) No HUB ops.
2) No waits on inputs, you have to poll.

Anything else?

David Betz · 2013-09-20 13:16

Heater. wrote: »

Yep, If you have to wait in line for a resource, like HUB access, then you have to wait in line. There is no way around it.

So, if we really really want deterministic timing for threads or at least have a threads timing independent of other threads actions, what are the rules?

1) No HUB ops.
2) No waits on inputs, you have to poll.

Anything else?

Avoid any other instruction that takes more than a single clock? Doesn't DJNZ take a different number of clocks depending on whether it branches?

pjv · 2013-09-20 13:19

Heater. wrote: »

KC_Rob.

cooperative multi-tasking is a wonderful thing.

However if you have no interrupts it means that your cooperative tasks have to be able to poll I/O at a rate fast enough to satisfy the I/O timing requirements.

That means your cooperative threads have to "yield" or "suspend" more and more frequently to meet tighter and tighter timing.

Eventually you are wasting all your time executing yield or suspend so as to poll something that may not be there.

The Props hardware scheduled threads are an elegant solution to that issue. Remove the need for yield instructions, reduce latency to events all around, reduce the code size and save wasting time on lots of yield instructions.

Looks perfectly in line with the Prop philosophy.

My experience is that co-operative multi-tasking works very well..... at least in the P1. Therefore I would assume that such a software co-operative system would also work in the P2, and possibly be enhanced by the new hardware features, permitting, perhaps, as many as 16 threads per cog. But that is getting to be somewhat rediculous for real life applications.

Most situations do not require really tight observation of determinism, and the co-operative approach works just fine. The P1 task swithes in around 1 uSec, and that includes a hub poll. Most applications (but not all) run happily with a 10 to 50 uSec tick time.

Where the P2's hardware approach will really help is to reduce the amount of code the OS scheduler consumes in the cog. Depending on features desired/included in the compile, it runs about 32 instructions for simple to 128 instructions for a full featured option.

I'm still working hard to get that compacted..... the basic kernel is in the 25 instruction range, but not yet fully fleshed out.

Just my two cents' worth.

Cheers,

Peter (pjv)

Bill Henning · 2013-09-20 13:20

Thanks, that makes sense - if they don't impact other threads, all is golden.

I both love and hate their clocked I/O... love that you can just fire off a long, and have it sent out as eight nibbles... but I don't like the limitations on how pins are grouped.

Heater. wrote: »

Bill,

I puzzled over that for a long time. Debated it on the XMOS forums and even ended up discussing it with David May himself over a Christmas break!

I seem to have ended up being 99.9% convinced that even if mul and/or div did take extra time they had no effect on the timing of other threads. In my experiments I could not get mul or div to upset things.

Anyway, the get out's for XMOS are:

1) You are not supposed to be instruction/cycle counting to get your timing right. They have clocked I/O and timers and such to make sure things happen on time.

2) They have a timing analysis tool that will tell you if the source you have written meets the timing constraints you have specified.

Bill Henning · 2013-09-20 13:26

Seairth,

I totally disagree with "the current task switching approach has very limited use" - note invaders in one cog, note how easy it will be to combine four serial / keyboard / mouse / etc drivers in one cog.

There is absolutely nothing stopping people from ignoring the task switching and using cooperative multitasking on P2, where they believe it is a better fit.

The biggest advantage to the hardware threading is that it makes it much easier to treat a cog as four separate tasks, which are far easier to write and debug than the cooperative versions; one can almost cut & paste four different drivers into one cog.

Very high resource contention cases (ie hub access) require a lot of care in any case.

Mind you, a lot of these issues will go away with working silicon, which will be 2.5x faster than the emulation.

Seairth wrote: »

I agree that there are better ways of approaching the task switching. However, the current situation is that Chip has limited time and resources to make changes. I think the current task switching approach has very limited use and a "proper" fix would take more effort than Chip is currently afforded. I also think that the cooperative approach would provide more value (in general) and could (hopefully) be implemented with less risk, time, and resources than the "proper" fix.

Phil Pilgrim (PhiPi) · 2013-09-20 13:26

heater wrote:

Yep, If you have to wait in line for a resource, like HUB access, then you have to wait in line. There is no way around it.

With a single interleaved pipeline, that's true. Not true with multiple pipelines, though.

-Phil

Heater. · 2013-09-20 13:33

@Bill,

...but I don't like the limitations on how pins are grouped.

That is a real downer in the XMOS devices.
You have to treat pins in groups of two, four, eight, sixteen.
All pins in the same group have to be input or output.
You don't get to choose which pins are in which group.
And just to really annoy you, pins are connected to cores. You cannot access any pin from any core.

@pjv,
Cooperative multi-tasking can work very well. We only have to look at FullDuplexSerial on the PI to see that. An rx and a tx thread (or coroutines) cooperatively scheduling and getting a baud rate of 115200.

However, imagine the rx and tx threads of FDS were hardware time sliced as per the Prop II. That means you can:
1) Get rid of all the task switching instructions from the code. Which saves code space and execution time.
2) Reduce latency to responding to those edges on the rx input and clock timing on output.

I speculate that FDS would reach much higer baud rates with hardware times slicing. Whilst having much less jitter on its serial output and tolerance of jitter on its serial input.

Heater. · 2013-09-20 13:42

Phil,

With a single interleaved pipeline, that's true. Not true with multiple pipelines, though.

Grrr.. You lost me again.

I'm seeing:
1) Thread A wanting a HUB access, waiting for its access slot and then doing the access, then continuing.
2) Meanwhile thread B is wanting HUB access at the same time as A. First it has to wait for A to complete all the above, then it has to wait for it's own HUB access slot (which will be a whole round robin cycle later) then do it's access, then continue.
3) And so on for thread C that has to wait for all of 1) and 2) above and then do it's own waiting for HUB.
4) Ditto...

They all have to stand in line waiting for the resource. I don't see how any kind of pipelinning within the COG can get around that.

Seairth · 2013-09-20 13:43

Bill Henning wrote: »

There is absolutely nothing stopping people from ignoring the task switching and using cooperative multitasking on P2, where they believe it is a better fit.

I agree, if there was hardware support for it. I'm under the impression that there isn't.

Heater. · 2013-09-20 13:47

There is.

Unless I missed the, what would be devastating, news that the original TASKSW instruction has been removed.

There is still the old jmpret trick for creating cooroutines.

Phil Pilgrim (PhiPi) · 2013-09-20 13:49

heater wrote:

1) Thread A wanting a HUB access, waiting for its access slot and then doing the access, then continuing.
2) Meanwhile thread B is wanting HUB access at the same time as A. First it has to wait for A to complete all the above, then it has to wait for it's own HUB access slot (which will be a whole round robin cycle later) then do it's access, then continue.
3) And so on for thread C that has to wait for all of 1) and 2) above and then do it's own waiting for HUB.
4) Ditto...

With multiple pipelines,

1) Thread A's turn comes 'round. It's waiting for HUB access, but hub is not available just now. So thread A skips its turn.
2) One clock later, thread B's turn comes along. It's also waiting for hub access, and hub is now available, so it does the hub op.
3) One more clock later, thread C's turn comes along. It's waiting for an input pattern. Pattern is not there, so it skips its turn.
4) Still one clock later, thread D's turn comes. It completes the second part of the two-clock instruction started on its previous turn.
5) Etc.

Each task's timeslice is exactly one clock long. The only thing that stalls is its own pipeline if a condition it's waiting for is not met.

-Phil

ctwardell · 2013-09-20 13:50

Seairth wrote: »

I agree, if there was hardware support for it. I'm under the impression that there isn't.

I believe you can still use this technique on the P2:

http://www.parallaxsemiconductor.com/sites/default/files/appnotes/AN014-Coroutines-v1.0_0.pdf

Chris Wardell

Heater. · 2013-09-20 13:57

Phil,

OK, I can see where you are with that.

But I already have an issue with step 1)

Thread A is say 1 clock away from a HUB slot so it skips it's turn. It now has to wait one HUB cycle and one clock, 9 clocks. Instead of just one clock.
Thread B comes along and snags that slot. Where as previously it would have to wait 8 clocks.

Haven't we just moved delays from on thread to another? Does this actually gain us anything?

Phil Pilgrim (PhiPi) · 2013-09-20 14:05

heater wrote:

Thread A is say 1 clock away from a HUB slot so it skips it's turn. It now has to wait one HUB cycle and one clock, 9 clocks. Instead of just one clock.

It's much worse than that if there are 16 task slots, because Thread A will never get a chance at the hub!. But with 15 task slots, each task gets an equal chance. That's the "relatively prime" thing. Moreover, no task will have to wait longer (i.e. more task slots) than if it were the only task running.

-Phil

Heater. · 2013-09-20 14:10

OK, makes sense. There is perhaps a penny dropping here. Or a ha'penny anyway.

cgracey · 2013-09-20 14:21

Seairth wrote: »

I agree that there are better ways of approaching the task switching. However, the current situation is that Chip has limited time and resources to make changes. I think the current task switching approach has very limited use and a "proper" fix would take more effort than Chip is currently afforded. I also think that the cooperative approach would provide more value (in general) and could (hopefully) be implemented with less risk, time, and resources than the "proper" fix.

I don't know if you're aware, but there is a TASKSW instruction which goes to the next cooperative task.

TASKSW is short for 'JMPRET INDA,++INDA WZ,WC'. By doing a 'FIXINDA #pclist+n-1,#pclist' you create a round-robin tasker of n tasks. Have the pclist loaded with initial PC's and jump to the first task. Everytime a snippet wants to pass time to another task, he just does a TASKSW instruction and the current PC+flags are saved @INDA and the new PC+flags are loaded @++INDA, leaving INDA inc'd. The FIXINDA sets INDA to loop within a task list.

Oh, and you can do 4 sets of TASKSW-type multitasking using the 4-way per-instruction task switching.

David Betz · 2013-09-20 14:28

Phil Pilgrim (PhiPi) wrote: »

It's much worse than that if there are 16 task slots, because Thread A will never get a chance at the hub!. But with 15 task slots, each task gets an equal chance. That's the "relatively prime" thing. Moreover, no task will have to wait longer (i.e. more task slots) than if it were the only task running.

-Phil

You could fix this by having thread A start "queue" its hub operation and then stall. Then, each time it gets a chance to execute again it checks to see if the hub operation has completed. Well, that doesn't make it deterministic but it does prevent it from being stalled forever.

Ariba · 2013-09-20 14:33

3 different ways to flash two LEDs at different rates in the Propeller 2:

Flash two LEDs with different Prop2 Multitasking approaches:
------------------------------------------------------------

1) in 2 cogs:

       cognew @flash2+$E80,#0

flash1 getcnt time1
       add time1,rate1
loop1  notp Led1
       waitcnt time1,rate1
       jmp #loop1

flash2 getcnt time2
       add time2,rate2
loop2  notp Led2
       waitcnt time2,rate2
       jmp #loop2


2) with automatic Taskswitching:

       jmp #Task0
       jmp #Task1

Task0  settask #%%0101
       getcnt time1
       add time1,rate1
loop1  notp Led1
       passcnt time1
       add time1,rate1
       jmp #loop1

Task1  getcnt time2
       add time2,rate2
loop2  notp Led2
       passcnt time2
       add time2,rate2
       jmp #loop2


3) with cooperative Taskswitching:

       fixinda pcs+1,pcs
       getcnt time1
       add    time1,rate1
       getcnt time2
       add    time2,rate2

task1  notp Led1
wait1  tasksw
       cmpcnt time1  wc
  if_b jmp #wait1
       add time1,rate1
       jmp #task1

task2  notp Led2
wait2  tasksw
       cmpcnt time2  wc
  if_b jmp #wait2
       add time2,rate2
       jmp #task2

pcs    long task1,task2

I left out the definitions for Led1,Led2,time1.. and so on, and it's not tested.

Andy

cgracey · 2013-09-20 14:46

I got rid of the FITACCx setup and now have SARACCx D/#n instructions.

Do you think it would be good to replace SPLITW and MERGEW (unary operations that split and merge words and bits) with ENDIAN8 and ENDIAN4 that perform byte- and nibble-endian reversals?

Phil Pilgrim (PhiPi) · 2013-09-20 15:03

David Betz wrote:

You could fix this by having thread A start "queue" its hub operation and then stall. ...

No need. With 15 (or 13 or 17 ...) timeslots, it works itself out automatically.

-Phil

Ariba · 2013-09-20 15:25

cgracey wrote: »

I got rid of the FITACCx setup and now have SARACCx D/#n instructions.

Cool. So you can shift the ACCx any amount of bits and the result is written back to ACCx.
I hope nobody will miss the FITACCx, I would feel a bit guilty...

Do you think it would be good to replace SPLITW and MERGEW (unary operations that split and merge words and bits) with ENDIAN8 and ENDIAN4 that perform byte- and nibble-endian reversals?

An ENDIAN8 byte reversal would be a very useful instruction for big endian CPU-Emulators and Datafiles.
I think MOVF allows byte reversal already in 4 instructions, but a single instruction is always better, and works also with multitasking. MOVF with multitasking is a bit a problem - how hard would it be to make separate MOVF modes per task?

Andy

Seairth · 2013-09-20 15:31

cgracey wrote: »

I don't know if you're aware, but there is a TASKSW instruction which goes to the next cooperative task.

TASKSW is short for 'JMPRET INDA,++INDA WZ,WC'. By doing a 'FIXINDA #pclist+n-1,#pclist' you create a round-robin tasker of n tasks. Have the pclist loaded with initial PC's and jump to the first task. Everytime a snippet wants to pass time to another task, he just does a TASKSW instruction and the current PC+flags are saved @INDA and the new PC+flags are loaded @++INDA, leaving INDA inc'd. The FIXINDA set INDA to loop in a task list.

Yes, I recalled that there was such an approach (though I had forgotten the details). To me, this approach has the following drawbacks:

* You have to give up registers (and the use of INDA), even though there are TASK registers that hold exactly the same information (at least for four tasks)
* You can only do simple round-robin task switching (or else have to add more code to fix up the stack for anything more complicated)

Alternatively, if the existing TASK registers were re-purposed (or even dual-purposed!) such that TASKSW would do the following:

* Switch to the next TASK (based on bits in the control register) in round-robin fashion
* Switch to an explicit TASK (think "child" tasks)
* Optionally fix up the current TASK register to its resume address (which would allow tasks to "loop" without an extra instruction)
* Have a delayed version of the instruction to avoid flushing the pipeline

then you would have the same capabilities as using the JMPRET approach, and more, while also using fewer registers. If this could be added along side of the preemptive switching, that'd be great. I realize that the cooperative approach would also be limited to four tasks, but that was obviously enough in the case of the invaders code (which was what started this conversation).

Also, as I was typing this, I occurred to me that I could almost do this now with SETTASK, or at least the explicit, delayed switch. Just simply call SETTASK #%%3333 to switch to TASK 3, for instance. Unfortunately, you would still need additional code to do the round-robin switching and looping.

Tubular · 2013-09-20 15:34

The existing multi tasking setup is really very powerful and effective, and think will actually find very widespread use. I see it as a logical architectual extension of the holistic propeller concept - just as you can have a micro that has cogs running in parallel, so too you can divide a cog so it has several tasks running in parallel. Having a task in-cog with full access to the inside of that cog is fantastic for debugging.

Chip, would it be possible to modify the existing SETTASK instruction, so that during setup, it works out whether to rotate 16 or 15 bits according to the two most significant bits? ie if the MSB's are 00 then it goes on a 15 task sequence, and if they are anything else it uses a 16 task sequence? We normally set up with a %%3210 rather than %%0123 anyway.

And if that were possible, how much more mischief would it to extend to arbitrary task sequence lengths by "removing the leading zero bit pairs" ? So you could have a sequence length of say 6 (%%302010), if you wanted to, and %%3210 would just go on a short 4 task rotation without needing to be extended to fill 32 bits

Glad you like the ; idea, Chip. Understand it may not be possible to incorporate at this stage.

cgracey · 2013-09-20 15:35

Ariba wrote: »

Cool. So you can shift the ACCx any amount of bits and the result is written back to ACCx.
I hope nobody will miss the FITACCx, I would feel a bit guilty...

An ENDIAN8 byte reversal would be a very useful instruction for big endian CPU-Emulators and Datafiles.
I think MOVF allows byte reversal already in 4 instructions, but a single instruction is always better, and works also with multitasking. MOVF with multitasking is a bit a problem - how hard would it be to make separate MOVF modes per task?

Andy

With SARACCA/SARACCB/SARACCS, you can arithmetically right-shift 0..63 bit positions. This feels a lot better than the FITACCx stuff. I mean, you should know how far to shift, already, based on your algorithm. FITACCx was too smart for its britches. Determinism is better, in this case. Thanks a lot for bringing this shifting matter up again.

A MOVF in every task could start stretching the ALU result path. I'll see, though.

pjv · 2013-09-20 15:39

Hi All;

So while we're looking for improvements or things to add, I believe there is use for a "WAITAddr" instruction that stalls the cog thread until a specific address (or possibly contiguous range of addresses) in hub is accessed by another cog or thread. It would have some aspects of a semaphore or possibly eliminate/improve polling.

Just another one of those things that magically find new, somtimes unintentional uses of the Prop we have come to love so much.

Cheers,

Peter (pjv)

Observations of a multi-tasker

Comments