Observations of a multi-tasker
ozpropdev
Posts: 2,792
Hi Guys
I thought I would share one of my observations on the multi-tasking feature
of the P2. As most of you know my "Inavders" code rely's heavily on it!
So we have FPGA boards running @ 60 Mhz.
In theory this equates to 15 MIPS per thread.
Numbers looking good so far.
Where this theoretical figure gets "eaten up" is the use odf "HUB" instructions.
From the documentation we have we find that HUB insttrunctions (non cached) take
between 3-10 clocks. In my scenario I have 4 threads all fighting for HUB access.
This means one cycle of the tasking loop could take 40 cycles. This is because
all the threads have to wait a complete HUB cycle to read/write their data.
I believe this is causing massive "PIPELINE stall".
By my calculations I had plenty of time to do some "stuff", but in reality
I seemed to be exceeding that time? Worst case scenario it would work out the
estimated 15MIPs could now be in the region of 1.5MIPS!
This explains why some of my stuff appears to take too long.
Don't get me wrong, I am the "BIGGEST" fan of this feature and plan to use it
in most(if not all) of my future projects.
I realize that the "Invaders" code is pushing things a bit(lot!) but the idea of
that exercise was to learn P2 and see what it can do. Plenty it seems!
I feel as long as you are aware of this potential scenario, big things can still happen.
Anyhow, that's what I experienced.
Hope this is helpful.
Cheers
Brian
I thought I would share one of my observations on the multi-tasking feature
of the P2. As most of you know my "Inavders" code rely's heavily on it!
So we have FPGA boards running @ 60 Mhz.
In theory this equates to 15 MIPS per thread.
Numbers looking good so far.
Where this theoretical figure gets "eaten up" is the use odf "HUB" instructions.
From the documentation we have we find that HUB insttrunctions (non cached) take
between 3-10 clocks. In my scenario I have 4 threads all fighting for HUB access.
This means one cycle of the tasking loop could take 40 cycles. This is because
all the threads have to wait a complete HUB cycle to read/write their data.
I believe this is causing massive "PIPELINE stall".
By my calculations I had plenty of time to do some "stuff", but in reality
I seemed to be exceeding that time? Worst case scenario it would work out the
estimated 15MIPs could now be in the region of 1.5MIPS!
This explains why some of my stuff appears to take too long.
Don't get me wrong, I am the "BIGGEST" fan of this feature and plan to use it
in most(if not all) of my future projects.
I realize that the "Invaders" code is pushing things a bit(lot!) but the idea of
that exercise was to learn P2 and see what it can do. Plenty it seems!
I feel as long as you are aware of this potential scenario, big things can still happen.
Anyhow, that's what I experienced.
Hope this is helpful.
Cheers
Brian
Comments
1) You have four threads running.
2) They all want to hit HUB at the same time.
3) The first one may have to wait 10 clocks and then completes execution.
4) The second one continues and immediately has to wait10 clocks (or so) having just missed the HUB slot.
5) When he is done the third one continues and again has to wait 10 clocks.
6) And so on.
So..If all the threads prety much do nothing but HUB ops execution drops to 1.5 MIPs!
How big a problem is this in reality?
Looks to me like the is no way around that stall. except perhaps minimizing how often it happens.
Not very helpful am I?
Probably not as big a problem in reality.
I do admit that my example would possibly be classified as extreme.
"Who would do that?"
Trust me to do "crazy" things....
This also assumes I am correct in my assessment of the situation?
It sounds reasonable to me at first sight but I have no idea about any cached HUB interactions yet.
I guess we can class this as extreme, you have video and sound and game logic all going on together and all needing HUB.
On the bright side. Imagine you were trying to do that on a regular single core micro with interrupts!
All that interrupt handling would also kill performance.
In the P1, once the cog is in lockstep, we can issue a hub instruction with 2 intervening normal instructions. With P2, since we get an access every 8 clocks (P1 in 16), and presuming 3 to setup, then it should be possible to execute 5 intervening (1 clock) instructions. So by careful planning, perhaps its possible to minimise the stalls.
What I should of stated is a tasker cycle time of 640nS (4xHub inst.) instead of 64nS (4 x non hub inst.).
Obviously I never have the scenario of 4xhub instructions all the time, but it is likely to happen quite a bit.
The peak MIPS is radically altered with high HUB activity.
I believe your right Cluso.
In my example this would require a massive restructure. I'm using all of COG ram and have none left now.
In the future I would certainly go down that path, though still quite difficult.
The "Invaders" project is basically an experiment that got out of hand, so no real plan was in place. It just evolved.
Lots was learnt from its construction though.
Cheers
Brian
I have a feeling this is isn't very easy to do, if possible at all.
Consider you have four threads running, each one of them has to be getting on with whatever it has to do in it's own time. A video thread has to deliver data in sync with the video stream, the audio thread has it's own timing, some other thread will be responding to external inputs and so on.
In general these threads will be running around on their own cycle time and pausing for this and that when they have to, running in and out of phase with each other as they go along. Eventually they may all arrive at a point in their code where they all want HUB access. BOOM there is your massive stall.
If all your threads are responding to external stimili, pin's timer, etc, then it's impossible to line up the timing. It's rather like having an interrupt driven system when all your interrupt's arrive at the same time and there isn't enough CPU power to get them all serviced in time, something breaks.
Keeping HUB acceses on lock step basically means those threads have to be permanently in sync. Which kind of defeats the idea of having harware scheuled threads.
Perhaps using that task switch instruction, rather than automatic scheduling, might be a way to get the threads synced. That is "cooperative multi-threading" rather than premptive. But that has penalties in performance as you have to add those task switch instructions to your code and latency of task switching. It's also a lot more complex to program.
Interesting problem...
Thinking.......thinking.....thinking....
My head hurts!
I think I need to lie down for a while........
So in my worst case scenario it would take 10+8+8+8 cycles (544nS) per tasker cycle.
This explains my codes erratic behaviour. Good to know.
Cheers
Brian
Being able to add a tiny bit of serial tx for debug purposes, taking ~30 longs and inside the same cog as whatever is being debugged, is a very useful feature.
Anyone know whether Chip's SDRAM driver can be adapted to work when split into "task" mode (albeit slower). Wasn't it only 60 longs or so?
I presume in your code you have four tasks scheduled on a normal round robin basis so the execute as: task 0, task 1, task 2, task 3, task 0, task 1...instruction by instruction.
Now, I'm not sure but I though I picked up the idea months back that it was possible to schedule things a bit differently. So let's say task 0 is your most speed sensitive thing and needs more time than the rest, one arranges to give it more execution "slots" so that things run as something like: task 0, task 1, task 0, task 2, task 0, task 3, task 0, task 2...
Now that of course does not solve the problem of all tasks hitting HUB at the same time and stalling there. But at least it does give the busy guy more chance to get in.
No idea how one sets this up or if it would help your case but it might be interesting to investigate.
Anyone any idea?
I'm already doing this in the form of Task 1,0,1,2....repeating
Where task1 is the video stuff, Task 0 is the game engine and task 2 is the serial monitor, timers and sound.
I had to do this because I was having Video sync issues. So I really have 3 tasks not 4, but video gets 50% of the time slots.
The pipeline stall has the most dramatic effect on the Video task. The other tasks are tolerant to stall.
My current code has been "tuned" around the sync problems, but the problem of stall was increased with a variant to that code I
working on at the moment. It's been an interesting exercise anyway....
Cheers
You really are being extreme.
So what you need is a 80MHz chip plus a bit of overclocking...
Yes please.
Maybe I need to drop my little Nano board in some Liquid Nitrogen!
There are three tasks. One of them gets 50% of the time. The other two are still have time to spare.
All works well except when the high proiority task does a hub op and is stalled by the two lower priority guys who happen to be in line to do HUB ops just ahead of it.
Would things work better if we could be sure that there was only one other task doing a HUB op when the high priority guy wants to? That is to say if the two slower guys were always doing HUB ops apart from each other in time thus ensuring there were never three back to back HUB ops occurring?
I suspect this could be done by having one low priority guy set a flag a few instructions before his HUB op and clearing it again a few instructions later. Meanwhile the other low priority guy checks that flag and promises not to do a HUB op if it is set.
Now those two have their HUB ops always spaced out in time and the big guy only ever has to wait for one of them if it's in the way at the time.
Or am I crazy?
I see a potential race condition with that approach (using a flag), where task 0 always sets its flag before task 2 does.
Another approach might be to time everything from the task 1 (video) HUBOP. The approach would be to have task 0 enter a NOP loop prior to performing a HUBOP, which then spins until 5 cycles (assuming Cluso99's calculations are correct) after task 1 HUBOP completes. This would ensure that task 1 doesn't stall. Task 2 would perform a similar operation, but would wait 13 cycles. Note that if task 0 wasn't waiting, but task 2 was, you still wouldn't get a stall on task 1 since task 2 would effectively just skip one hub access window (which it was already doing when task 0 was waiting).
You could use the CLRB or SETB instruction just like a LOCK, but between threads. It can return into C the prior state of the bit, while post-clearing or -setting it.
The idea was that only one thread sets and clears the flag, The other only reads the flag and proceeds or not.
Because the flag is set some instructions prior to the HUB op it does not matter if there is a little race, the reader COG may end up doing a HUB op just after the flags is set but that still leave some time free before the other threads HUB op. Because the flag is cleared some instructions after the HUB op there will be some time free before the other threads HUB op.
Another, more elegant, way to do this is that one thread only ever sets the flag and the other only clears it. The former loops on flag clear and sets it when it proceeds. The later loops on flag set and clears it when it proceeds.
NOP's are just place-holders for 1-clock instructions.
I think all tasks' hub slots would fall into line after one loop.
The NOPs repeat, always one or five cycles off of aligning to the hub access window. With an even number of NOPs, there would still be a one clock stall on task0's RDLONG after the JMP. Otherwise, wouldn't you get the same stall pattern on every loop?
The diagram looks good.
I'm not sure I follow the last part, but basically, if you have four tasks running and each one does a hub access every 32 cycles, on the nose, no task will have to wait for its hub cycle.
If you had three tasks running, each could do a hub instruction every 24 clocks, except that's not possible, since you have 16 time slots, and 16's not cleanly divisible by 3.
If you had two tasks running, each could do a hub cycle every 16 clocks.
I think the timing can never be exactly 32 clock cycles. You have RDLONG(t0) + RDLONG(t1) + RDLONG(t2) + RDLONG(t3) + (NOP * 4 * n) + (JMP * 4) cycles, or 3+8+8+8+4n+4 cycles, which simplifies to 31 + 4n cycles. There is no multiple of NOPs that would result in 32 cycle alignment. At best, you'd get a one-cycle stall at the beginning of the next loop for task 0, and all of the other tasks would still stall for 5 clocks each.
I always imagine the beauty of the hardware scheduled threads is that they can potentially all be waiting on some external event, a pin change say, and they will be able to respond to it with minimal latency (They don't have to wait for a cooperative task switch from other threads).
Given that we scenario have no way to keep lock step. However that business of waiting on a flag before doing a HUB op may be able to separate HUB ops in time so that the guy who needs to be free of long stalls can be.
Ah, I see what you are saying. I wasn't remembering that the pipeline gets stalled and it's shared by all the tasks.
You are right.
In drawing out the timing pattern, it looks like you could have two tasks doing RDLONGs with two other tasks not doing hub accesses:
A WRLONG would mess things up. Maybe there aren't any nice patterns to exploit.
The idea of using a flag to control HUB ops is already in my pile of failed code.
Another factor in this mess of mine is my use of POLVID in the video task.
This in itself introduces timing issues to the use of flag setting.
The race for the use of the HUB seemed to always be won by the video task.
Using NOP's to try to align Hub op's would probably make improvements to the efficiency of things
but is not practical in my case because I have no COG space left.
One idea I had early in this process was to synchronize lower priority Hub op's to only occur during Video sync windows.
This ended up being more difficult than first imagined,
I think this discussion is excellent, ozpropdev! A slightly complicated time-critical application is just what a new architecture needs. Coding techniques are developed and potential architectural improvements are discovered.