Observations of a multi-tasker

ozpropdev · 2013-09-17 23:39

Hi Guys

I thought I would share one of my observations on the multi-tasking feature
of the P2. As most of you know my "Inavders" code rely's heavily on it!

So we have FPGA boards running @ 60 Mhz.
In theory this equates to 15 MIPS per thread.
Numbers looking good so far.

Where this theoretical figure gets "eaten up" is the use odf "HUB" instructions.
From the documentation we have we find that HUB insttrunctions (non cached) take
between 3-10 clocks. In my scenario I have 4 threads all fighting for HUB access.
This means one cycle of the tasking loop could take 40 cycles. This is because
all the threads have to wait a complete HUB cycle to read/write their data.

I believe this is causing massive "PIPELINE stall".

By my calculations I had plenty of time to do some "stuff", but in reality
I seemed to be exceeding that time? Worst case scenario it would work out the
estimated 15MIPs could now be in the region of 1.5MIPS!
This explains why some of my stuff appears to take too long.

Don't get me wrong, I am the "BIGGEST" fan of this feature and plan to use it
in most(if not all) of my future projects.
I realize that the "Invaders" code is pushing things a bit(lot!) but the idea of
that exercise was to learn P2 and see what it can do. Plenty it seems!

I feel as long as you are aware of this potential scenario, big things can still happen.

Anyhow, that's what I experienced.
Hope this is helpful.

Cheers
Brian

Heater. · 2013-09-18 01:00

So what you are saying is that :

1) You have four threads running.
2) They all want to hit HUB at the same time.
3) The first one may have to wait 10 clocks and then completes execution.
4) The second one continues and immediately has to wait10 clocks (or so) having just missed the HUB slot.
5) When he is done the third one continues and again has to wait 10 clocks.
6) And so on.

So..If all the threads prety much do nothing but HUB ops execution drops to 1.5 MIPs!

How big a problem is this in reality?

Looks to me like the is no way around that stall. except perhaps minimizing how often it happens.

Not very helpful am I?

ozpropdev · 2013-09-18 01:36

Heater. wrote: »

How big a problem is this in reality?

Probably not as big a problem in reality.
I do admit that my example would possibly be classified as extreme.
"Who would do that?"
Trust me to do "crazy" things....

This also assumes I am correct in my assessment of the situation?

Heater. · 2013-09-18 01:51

I guess we will have to wait for Chip's response to gauge your assessment of the situation.
It sounds reasonable to me at first sight but I have no idea about any cached HUB interactions yet.

I guess we can class this as extreme, you have video and sound and game logic all going on together and all needing HUB.

On the bright side. Imagine you were trying to do that on a regular single core micro with interrupts!
All that interrupt handling would also kill performance.

Cluso99 · 2013-09-18 02:44

I would think it is possible to minimise the stall by carefully placing the instructions so that hub accesses fall into lockstep with the hub access.

In the P1, once the cog is in lockstep, we can issue a hub instruction with 2 intervening normal instructions. With P2, since we get an access every 8 clocks (P1 in 16), and presuming 3 to setup, then it should be possible to execute 5 intervening (1 clock) instructions. So by careful planning, perhaps its possible to minimise the stalls.

ozpropdev · 2013-09-18 02:53

The figure I quoted of 1.5 MIPS is a bit misleading. (apologies!)
What I should of stated is a tasker cycle time of 640nS (4xHub inst.) instead of 64nS (4 x non hub inst.).
Obviously I never have the scenario of 4xhub instructions all the time, but it is likely to happen quite a bit.
The peak MIPS is radically altered with high HUB activity.

ozpropdev · 2013-09-18 03:16

Cluso99 wrote: »

So by careful planning, perhaps its possible to minimise the stalls.

I believe your right Cluso.

In my example this would require a massive restructure. I'm using all of COG ram and have none left now.
In the future I would certainly go down that path, though still quite difficult.
The "Invaders" project is basically an experiment that got out of hand, so no real plan was in place. It just evolved.
Lots was learnt from its construction though.

Cheers
Brian

Heater. · 2013-09-18 03:34

Cluso,

I would think it is possible to minimise the stall by carefully placing the instructions so that hub accesses fall into lockstep with the hub access.

I have a feeling this is isn't very easy to do, if possible at all.

Consider you have four threads running, each one of them has to be getting on with whatever it has to do in it's own time. A video thread has to deliver data in sync with the video stream, the audio thread has it's own timing, some other thread will be responding to external inputs and so on.

In general these threads will be running around on their own cycle time and pausing for this and that when they have to, running in and out of phase with each other as they go along. Eventually they may all arrive at a point in their code where they all want HUB access. BOOM there is your massive stall.

If all your threads are responding to external stimili, pin's timer, etc, then it's impossible to line up the timing. It's rather like having an interrupt driven system when all your interrupt's arrive at the same time and there isn't enough CPU power to get them all serviced in time, something breaks.

Keeping HUB acceses on lock step basically means those threads have to be permanently in sync. Which kind of defeats the idea of having harware scheuled threads.

Perhaps using that task switch instruction, rather than automatic scheduling, might be a way to get the threads synced. That is "cooperative multi-threading" rather than premptive. But that has penalties in performance as you have to add those task switch instructions to your code and latency of task switching. It's also a lot more complex to program.

Interesting problem...

ozpropdev · 2013-09-18 03:41

Heater. wrote: »

Interesting problem...

Thinking.......thinking.....thinking....
My head hurts!
I think I need to lie down for a while........

Heater. · 2013-09-18 03:42

Same here...

cgracey · 2013-09-18 04:17

As Heater pointed out, the first instruction may need ten clocks, but subsequent back-to-back hub instructions (from the same or other threads) would take 8, not 10. The latest any hub instruction releases is on clock cycle 2, if you count the hub access as cycle 0. This means you are early for the next hub instruction and it must wait for the next cycle 0, releasing as late as cycle 2. So, 8 clocks for back-to-back hub instructions.

ozpropdev · 2013-09-18 04:41

Thanks for clearing that up Chip.

So in my worst case scenario it would take 10+8+8+8 cycles (544nS) per tasker cycle.
This explains my codes erratic behaviour. Good to know.

Cheers
Brian

Tubular · 2013-09-18 05:25

I'm stunned at how much can be fit inside a cog using the task instructions. And we haven't really started messing with the task register to put things into execution queue just-in-time yet.

Being able to add a tiny bit of serial tx for debug purposes, taking ~30 longs and inside the same cog as whatever is being debugged, is a very useful feature.

Anyone know whether Chip's SDRAM driver can be adapted to work when split into "task" mode (albeit slower). Wasn't it only 60 longs or so?

Heater. · 2013-09-18 05:45

ozpropdev,

I presume in your code you have four tasks scheduled on a normal round robin basis so the execute as: task 0, task 1, task 2, task 3, task 0, task 1...instruction by instruction.

Now, I'm not sure but I though I picked up the idea months back that it was possible to schedule things a bit differently. So let's say task 0 is your most speed sensitive thing and needs more time than the rest, one arranges to give it more execution "slots" so that things run as something like: task 0, task 1, task 0, task 2, task 0, task 3, task 0, task 2...

Now that of course does not solve the problem of all tasks hitting HUB at the same time and stalling there. But at least it does give the busy guy more chance to get in.

No idea how one sets this up or if it would help your case but it might be interesting to investigate.

Anyone any idea?

mindrobots · 2013-09-18 06:02

Check out the "unofficial P2 Doc" - there's a section on multi-tasking that talks about the 16 slots in the task register and how to distribute those among the 4 tasks.

ozpropdev · 2013-09-18 06:07

Heater,

I'm already doing this in the form of Task 1,0,1,2....repeating
Where task1 is the video stuff, Task 0 is the game engine and task 2 is the serial monitor, timers and sound.

I had to do this because I was having Video sync issues. So I really have 3 tasks not 4, but video gets 50% of the time slots.
The pipeline stall has the most dramatic effect on the Video task. The other tasks are tolerant to stall.

My current code has been "tuned" around the sync problems, but the problem of stall was increased with a variant to that code I
working on at the moment. It's been an interesting exercise anyway....

Cheers

ozpropdev · 2013-09-18 06:15

The 16 time slot's are the same 4 task patterns repeated 4x times.

Heater. · 2013-09-18 06:46

OK, you are so many steps ahead of us it's hard to keep up:)
You really are being extreme.
So what you need is a 80MHz chip plus a bit of overclocking...

ozpropdev · 2013-09-18 06:57

Heater. wrote: »

So what you need is a 80MHz chip plus a bit of overclocking...

Yes please.

Maybe I need to drop my little Nano board in some Liquid Nitrogen!

Heater. · 2013-09-18 07:39

So let me get this straight.

There are three tasks. One of them gets 50% of the time. The other two are still have time to spare.

All works well except when the high proiority task does a hub op and is stalled by the two lower priority guys who happen to be in line to do HUB ops just ahead of it.

Would things work better if we could be sure that there was only one other task doing a HUB op when the high priority guy wants to? That is to say if the two slower guys were always doing HUB ops apart from each other in time thus ensuring there were never three back to back HUB ops occurring?

I suspect this could be done by having one low priority guy set a flag a few instructions before his HUB op and clearing it again a few instructions later. Meanwhile the other low priority guy checks that flag and promises not to do a HUB op if it is set.

Now those two have their HUB ops always spaced out in time and the big guy only ever has to wait for one of them if it's in the way at the time.

Or am I crazy?

Seairth · 2013-09-18 09:47

Heater. wrote: »

So let me get this straight.

There are three tasks. One of them gets 50% of the time. The other two are still have time to spare.

All works well except when the high proiority task does a hub op and is stalled by the two lower priority guys who happen to be in line to do HUB ops just ahead of it.

Would things work better if we could be sure that there was only one other task doing a HUB op when the high priority guy wants to? That is to say if the two slower guys were always doing HUB ops apart from each other in time thus ensuring there were never three back to back HUB ops occurring?

I suspect this could be done by having one low priority guy set a flag a few instructions before his HUB op and clearing it again a few instructions later. Meanwhile the other low priority guy checks that flag and promises not to do a HUB op if it is set.

Now those two have their HUB ops always spaced out in time and the big guy only ever has to wait for one of them if it's in the way at the time.

Or am I crazy?

I see a potential race condition with that approach (using a flag), where task 0 always sets its flag before task 2 does.

Another approach might be to time everything from the task 1 (video) HUBOP. The approach would be to have task 0 enter a NOP loop prior to performing a HUBOP, which then spins until 5 cycles (assuming Cluso99's calculations are correct) after task 1 HUBOP completes. This would ensure that task 1 doesn't stall. Task 2 would perform a similar operation, but would wait 13 cycles. Note that if task 0 wasn't waiting, but task 2 was, you still wouldn't get a stall on task 1 since task 2 would effectively just skip one hub access window (which it was already doing when task 0 was waiting).

cgracey · 2013-09-18 10:21

Seairth wrote: »

I see a potential race condition with that approach (using a flag), where task 0 always sets its flag before task 2 does.

Another approach might be to time everything from the task 1 (video) HUBOP. The approach would be to have task 0 enter a NOP loop prior to performing a HUBOP, which then spins until 5 cycles (assuming Cluso99's calculations are correct) after task 1 HUBOP completes. This would ensure that task 1 doesn't stall. Task 2 would perform a similar operation, but would wait 13 cycles. Note that if task 0 wasn't waiting, but task 2 was, you still wouldn't get a stall on task 1 since task 2 would effectively just skip one hub access window (which it was already doing when task 0 was waiting).

You could use the CLRB or SETB instruction just like a LOCK, but between threads. It can return into C the prior state of the bit, while post-clearing or -setting it.

Heater. · 2013-09-18 10:30

I didn't phrase it well.

The idea was that only one thread sets and clears the flag, The other only reads the flag and proceeds or not.
Because the flag is set some instructions prior to the HUB op it does not matter if there is a little race, the reader COG may end up doing a HUB op just after the flags is set but that still leave some time free before the other threads HUB op. Because the flag is cleared some instructions after the HUB op there will be some time free before the other threads HUB op.

Another, more elegant, way to do this is that one thread only ever sets the flag and the other only clears it. The former loops on flag clear and sets it when it proceeds. The later loops on flag set and clears it when it proceeds.

cgracey · 2013-09-18 10:43

Here's how to let 4 tasks each get an optimal shot at their hub slot:

kick off 4 tasks:

		settask	#%%0123		'4 tasks in round-robin sequence

Each task is timed like this:

task		rdlong	x,y		'0..2	this task's hub turn - could be wrlong+nop+nop
		nop			'3
		nop			'4
		nop			'5
		nop			'6
		nop			'7
		nop			'0	task+1's hub turn
		nop			'1
		nop			'2
		nop			'3
		nop			'4
		nop			'5
		nop			'6
		nop			'7
		nop			'0	task+2's hub turn
		nop			'1
		nop			'2
		nop			'3
		nop			'4
		nop			'5
		nop			'6
		nop			'7
		nop			'0	task+3's hub turn
		nop			'1
		nop			'2
		nop			'3
		nop			'4
		nop			'5
		nop			'6
		jmp	#task		'7

NOP's are just place-holders for 1-clock instructions.

I think all tasks' hub slots would fall into line after one loop.

Seairth · 2013-09-18 11:52

If I understand that correctly (which is dubious, to be sure), wouldn't you end up with the following:

    clock    01234567012345670123456701234567012....
    task0    rrr                        n   n   ....j   
    task1       -----rrr                 n   n  .... j  
    task2               -----rrr          n   n ....  j 
    task3                       -----rrr   n   n....   j

The NOPs repeat, always one or five cycles off of aligning to the hub access window. With an even number of NOPs, there would still be a one clock stall on task0's RDLONG after the JMP. Otherwise, wouldn't you get the same stall pattern on every loop?

cgracey · 2013-09-18 12:11

Seairth wrote: »
If I understand that correctly (which is dubious, to be sure), wouldn't you end up with the following:
    clock    01234567012345670123456701234567012....
    task0    rrr                        n   n   ....j   
    task1       -----rrr                 n   n  .... j  
    task2               -----rrr          n   n ....  j 
    task3                       -----rrr   n   n....   j
The NOPs repeat, always one or five cycles off of aligning to the hub access window. With an even number of NOPs, there would still be a one clock stall on task0's RDLONG after the JMP. Otherwise, wouldn't you get the same stall pattern on every loop?

The diagram looks good.

I'm not sure I follow the last part, but basically, if you have four tasks running and each one does a hub access every 32 cycles, on the nose, no task will have to wait for its hub cycle.

If you had three tasks running, each could do a hub instruction every 24 clocks, except that's not possible, since you have 16 time slots, and 16's not cleanly divisible by 3.

If you had two tasks running, each could do a hub cycle every 16 clocks.

Seairth · 2013-09-18 12:33

cgracey wrote: »

if you have four tasks running and each one does a hub access every 32 cycles, on the nose, no task will have to wait for its hub cycle.

I think the timing can never be exactly 32 clock cycles. You have RDLONG(t0) + RDLONG(t1) + RDLONG(t2) + RDLONG(t3) + (NOP * 4 * n) + (JMP * 4) cycles, or 3+8+8+8+4n+4 cycles, which simplifies to 31 + 4n cycles. There is no multiple of NOPs that would result in 32 cycle alignment. At best, you'd get a one-cycle stall at the beginning of the next loop for task 0, and all of the other tasks would still stall for 5 clocks each.

Heater. · 2013-09-18 12:51

Chip,

...if you have four tasks running and each one does a hub access every 32 cycles, on the nose, no task will have to wait for its hub cycle.

Might well be so but if your threads are running in lock step like that doesn't pretty much negate the point of having threads?

I always imagine the beauty of the hardware scheduled threads is that they can potentially all be waiting on some external event, a pin change say, and they will be able to respond to it with minimal latency (They don't have to wait for a cooperative task switch from other threads).

Given that we scenario have no way to keep lock step. However that business of waiting on a flag before doing a HUB op may be able to separate HUB ops in time so that the guy who needs to be free of long stalls can be.

cgracey · 2013-09-18 12:53

Seairth wrote: »

I think the timing can never be exactly 32 clock cycles. You have RDLONG(t0) + RDLONG(t1) + RDLONG(t2) + RDLONG(t3) + (NOP * 4 * n) + (JMP * 4) cycles, or 3+8+8+8+4n+4 cycles, which simplifies to 31 + 4n cycles. There is no multiple of NOPs that would result in 32 cycle alignment. At best, you'd get a one-cycle stall at the beginning of the next loop for task 0, and all of the other tasks would still stall for 5 clocks each.

Ah, I see what you are saying. I wasn't remembering that the pipeline gets stalled and it's shared by all the tasks.

You are right.

In drawing out the timing pattern, it looks like you could have two tasks doing RDLONGs with two other tasks not doing hub accesses:

Hub	Task0	Task1	Task2	Task3
-------------------------------------
0	RDLONG	-	-	-
1	STALL	-	-	-
2	STALL	-	-	-
3	-	NOP	-	-
4	-	-	NOP	-
5	-	-	-	NOP
6	NOP	-	-	-
7	-	NOP	-	-
0	-	-	RDLONG	-
1	-	-	STALL	-
2	-	-	STALL	-
3	-	-	-	NOP
4	NOP	-	-	-
5	-	NOP	-	-
6	-	-	NOP	-
7	-	-	-	NOP
<repeat>

A WRLONG would mess things up. Maybe there aren't any nice patterns to exploit.

ozpropdev · 2013-09-18 15:56

I seem to have opened Pandora's box here.

The idea of using a flag to control HUB ops is already in my pile of failed code.
Another factor in this mess of mine is my use of POLVID in the video task.
This in itself introduces timing issues to the use of flag setting.
The race for the use of the HUB seemed to always be won by the video task.
Using NOP's to try to align Hub op's would probably make improvements to the efficiency of things
but is not practical in my case because I have no COG space left.

One idea I had early in this process was to synchronize lower priority Hub op's to only occur during Video sync windows.
This ended up being more difficult than first imagined,

User Name · 2013-09-19 08:50

ozpropdev wrote: »

I seem to have opened Pandora's box here.

I think this discussion is excellent, ozpropdev! A slightly complicated time-critical application is just what a new architecture needs. Coding techniques are developed and potential architectural improvements are discovered.

Observations of a multi-tasker

Comments