Observations of a multi-tasker

cgracey · 2013-09-20 15:44

Tubular wrote: »

The existing multi tasking setup is really very powerful and effective, and think will actually find very widespread use. I see it as a logical architectual extension of the holistic propeller concept - just as you can have a micro that has cogs running in parallel, so too you can divide a cog so it has several tasks running in parallel. Having a task in-cog with full access to the inside of that cog is fantastic for debugging.

Chip, would it be possible to modify the existing SETTASK instruction, so that during setup, it works out whether to rotate 16 or 15 bits according to the two most significant bits? ie if the MSB's are 00 then it goes on a 15 task sequence, and if they are anything else it uses a 16 task sequence? We normally set up with a %%3210 rather than %%0123 anyway.

And if that were possible, how much more mischief would it to extend to arbitrary task sequence lengths by "removing the leading zero bit pairs" ? So you could have a sequence length of say 6 (%%302010), if you wanted to, and %%3210 would just go on a short 4 task rotation without needing to be extended to fill 32 bits

Glad you like the ; idea, Chip. Understand it may not be possible to incorporate at this stage.

I was pondering this for a while yesterday and your approach is way more novel than mine.

I drew up all sorts of scenarios for 3 tasks (as 1, 2, and 4 are easily sliced and diced) and I realized that a task loop of 12 slots solves all kinds of symmetry problems. The big goal was to ensure that tasks each kept a certain number of instructions in the pipeline at all times, so that delayed branches could be used to advantage, where a same-task instruction could follow the branch.

Anyway, your solution is very elegant - determine the number of task loops by throwing away leading %00 bit pairs. I'll redo it your way now!

Bill Henning · 2013-09-20 15:54

I like SARACCx as well.

I can only see one advantage to FITACCx ... if you squint the right way, it may have been usable as 18 bit mantissa normalization for a fast floating point format (18 bits mantissa, 6 bit exponent) even though it would have required some manipulation of the sign/exponent afterwards.

cgracey wrote: »

With SARACCA/SARACCB/SARACCS, you can arithmetically right-shift 0..63 bit positions. This feels a lot better than the FITACCx stuff. I mean, you should know how far to shift, already, based on your algorithm. FITACCx was too smart for its britches. Determinism is better, in this case. Thanks a lot for bringing this shifting matter up again.

A MOVF in every task could start stretching the ALU result path. I'll see, though.

Yanomani · 2013-09-20 17:26

Ariba wrote: »

It depends on the application, as always. Do you have a certain application in mind which needs to transfer a lot of 32bit data between all cogs?

My feeling is that if you have a lot of data to transfer, the way over the hubram with QUADs is the fastest. Perhaps with some handshake bits in PORTD.
As far as I know the PORTD is more than just a shared 32bit register. You can also split the PORTD in 4 byte channels for example and then define which cog sees which channel or something like that. But I have not done any experiments in that direction, would make not much sense on a DE0-Nano with one cog ;-)

Andy

Ariba

Not at this moment, just the fact the deeper I dive inside Propellers architecture, the more useful it seems to me.
Since I'd specialized in state machine design, I'm wondering how fast and generic it can be.
- As a start point, determinism is of utmost importance: the Propellers does this, nicely;
- Speed and paralelism does easy things a lot: both Props offer it, at their own niches. If one seems to be stuck near a race condition to deal with, split it between Cogs and keep going ahead;
- Complementing the above statement, if along your development, you find a nyce way to deal with them, perhaps using the threading concept, shrink them back to a single COG;
- The more and faster the internal feedback paths one can count on, the easier to solve timing constraints and spare I/O pins. Also one can have multiple COGs dealing with a faster single stream, determining packet structure, stripping data from heading and tailing information, simultaneously doing handshaking and calculating ECC, and many other functions;
- Extending the above: security and IP protection preserve creative investment and profitability;
- Modularity: what was already designed and verified can be easily agregated into a new project;
- Programmability, extendability, ease to debug and maintain and in circuit modifiable are all extremely important features;
- Assembly, assembly, assembly, no Verilog, no RTL. Chip and Parallax carried this all for us!
- Spin and many other languages to choose from, in the near future: you can do a faster proof of concept, even a final product, using them.
- If I did forgot something (many in fact, I'm sure), every new day a new lesson to be learned.

I believe I must refrain more typing, there are a lot of important things happening at this thread, but I'm sure you understood my basic ideas.

Yanomani

evanh · 2013-09-20 17:29

Oh, I'm a bit slow finding this topic.

If you want an improvement to the hardware multitasking support I'd vote for a pipeline per thread setup. This would wipe out the stalling problems.

ozpropdev · 2013-09-20 17:30

cgracey wrote: »

Do you think it would be good to replace SPLITW and MERGEW (unary operations that split and merge words and bits) with ENDIAN8 and ENDIAN4 that perform byte- and nibble-endian reversals?

Chip

That would be great.

I know its pushing it but could that also be merged with a hubop like so.

ERDLONG DEST,SRC 'read LONG and ENDIAN reverse
ERDWORD DEST,SRC 'read WORD and ENDIAN reverse

Edit:

Can POLVID be modified to operate like PASSCNT. Jumps to itself while VID is busy.
This would tighten up VIDEO ops in multi-tasking and save COG ram.

Just an idea
Brian

Phil Pilgrim (PhiPi) · 2013-09-20 18:54

evanh wrote:

If you want an improvement to the hardware multitasking support I'd vote for a pipeline per thread setup. This would wipe out the stalling problems.

In theory, yes. This evening, after posing this and expounding upon its merits all day, I went out for a long kayak paddle, during which I realized that "one pipeline per thread" also means "one ALU per thread." The reason is that instructions that require more than one clock to complete need to own the ALU in which they're executing. So I really doubt that dev time and silicon real estate will permit such a useful extravagance.

-Phil

David Betz · 2013-09-20 19:11

Phil Pilgrim (PhiPi) wrote: »

In theory, yes. This evening, after posing this and expounding upon its merits all day, I went out for a long kayak paddle, during which I realized that "one pipeline per thread" also means "one ALU per thread." The reason is that instructions that require more than one clock to complete need to own the ALU in which they're executing. So I really doubt that dev time and silicon real estate will permit such a useful extravagance.

-Phil

Right. Any attempt to access a resource that is already in use performing a multi-cycle operation would have to cause the pipeline of the thread attempting the access to stall.

Roy Eltham · 2013-09-20 19:11

cgracey wrote: »

I got rid of the FITACCx setup and now have SARACCx D/#n instructions.

Do you think it would be good to replace SPLITW and MERGEW (unary operations that split and merge words and bits) with ENDIAN8 and ENDIAN4 that perform byte- and nibble-endian reversals?

I would REALLY hate to lose bit level split and merge. It's super handy to be able to merge two words into a single DWORD, and split the two words back out from the DWORD. This is critical for doing morton ordering, which is really good for arranging 2d data (like textures or view buffers) into.

Yanomani · 2013-09-20 20:20

Heater. wrote: »

Yanomani,

I love the way you put that. Many here feel the same but never found such a nice way to say it!

Heater

I must say thank you for your comment and also to my mother, for teaching me about flowers and poetry.
Trying to spare some space and all the good brainstorm that is happening here, I have P.M.'d you a message, extending a bit my answer to this post.

Yanomani

cgracey · 2013-09-20 21:04

Phil Pilgrim (PhiPi) wrote: »

In theory, yes. This evening, after posing this and expounding upon its merits all day, I went out for a long kayak paddle, during which I realized that "one pipeline per thread" also means "one ALU per thread." The reason is that instructions that require more than one clock to complete need to own the ALU in which they're executing. So I really doubt that dev time and silicon real estate will permit such a useful extravagance.

-Phil

Exactly! With a pipeline goes an ALU, which in this case is maybe 1/2 the cog's logic/area.

cgracey · 2013-09-20 21:07

ozpropdev wrote: »

Chip

That would be great.
I know its pushing it but could that also be merged with a hubop like so.

ERDLONG DEST,SRC 'read LONG and ENDIAN reverse
ERDWORD DEST,SRC 'read WORD and ENDIAN reverse

Edit:

Can POLVID be modified to operate like PASSCNT. Jumps to itself while VID is busy.
This would tighten up VIDEO ops in multi-tasking and save COG ram.

Just an idea
Brian

ERDWORD is asking too much in this case, but the idea of making a PASSCNT-like version of POLVID is possible.

KC_Rob · 2013-09-20 21:11

cgracey wrote: »

Exactly! With a pipeline goes an ALU, which in this case is maybe 1/2 the cog's logic/area.

We're almost back to more cogs. Sorry, couldn't resist.

Heater. · 2013-09-21 01:42

Guys, I'm worried, I just woke up in a cold sweat.

We keep suggesting new tweaks for the Prop II. Chip seems to be keen on getting some of them in. Ken is worrying about the time scales, costs and risks involved.

I was having nightmares about the PII. Like it will never be finished, or it get's complex enough it becomes impossible to get it to work, or Parallax simply runs out of money for it, or time and progress make it irrelevant.

The terrifying vision of Chip and Ken falling out over this, starting a family feud that goes on for 10 generations. Parallax goes bankrupt and we all die without ever getting a PII in our hands.

Pinch me someone, tell me this is not going to happen...

Heater. · 2013-09-21 01:57

KC_Rob,

We're almost back to more cogs. Sorry, couldn't resist.

As I said before, more COGs does not help due to HUB access bandwidth reduction and increased latency to HUB.

However, what you suggest, more cores, is a valid way to go. If you look at the Parallella chip from Adapteva http://www.adapteva.com/ you find this philosophy:

1) We want as many floating point units in a chip as possible. We want low power consumption and a small chip.

2) To do that each core will be as simple as possible. No caches, no pipelines, a very small instruction set. Floating point is priority.

3) To get round the memory access bottleneck lets have a core to core communications network. All the cores sit on a "grid".

4) This is all Smile for running big programs like Linux and regular code so let's have a high speed interconnect to an ARM chip that will handle all that.

Of course, apart from having many cores, the Parallella chip is nothing like a Propeller and not suited to what a Propeller does. We will see how well it works out. I'm still waiting for mine to arrive.

Meanwhile the Props hardware scheduling is a great solution to mopping up little lesser demanding tasks. Those things that make you cringe at the idea of wasting a whole 32 bit CPU for such a simple thing as we don now on the PI.

I don't think trying to turbo charge it is worth the effort. After all if you need the speed use a separate COG.

ozpropdev · 2013-09-21 04:10

Hi Guys
Getting back to where this conversation started.
"Multi-tasking"
My initial reason for going down this road was simply a poor man's way of getting 4 Cog's from my 1 Cog Nano board.
I think the results speak for themselves. In hindsight I probably would have done things differently but that's easy to say now.
I think in its current format it is very easy to use and simple to "tune up" for maximum performance.
I have already started converting an old P1 project to P2 and already it's looking like I will have Cog's spare!
I already have radical ideas on what I will use those for.
What more can I say?
Simply Brilliant!

Cheers
Brian

David Betz · 2013-09-21 04:33

cgracey wrote: »

Exactly! With a pipeline goes an ALU, which in this case is maybe 1/2 the cog's logic/area.

Why can't the ALU be shared among the threads? After all, only one instruction is actually executing at the same time. The only problem would be if one thread executes an ALU instruction that takes more than one clock. Then all of the threads would stall until that instruction completes because (presumably), every instruction requires the ALU.

Seairth · 2013-09-21 05:58

Seairth wrote: »

Alternatively, if the existing TASK registers were re-purposed (or even dual-purposed!) such that TASKSW would do the following:

* Switch to the next TASK (based on bits in the control register) in round-robin fashion
* Switch to an explicit TASK (think "child" tasks)
* Optionally fix up the current TASK register to its resume address (which would allow tasks to "loop" without an extra instruction)
* Have a delayed version of the instruction to avoid flushing the pipeline

Since TASKSW currently syntax sugar and can be confused with the actual TASK hardware (see [post=1208442]evanh's comments[/post]), can it be re-purposed for a new instruction?

TASKSW D/#n [,#addr]

* D/#n is a number between 0 and 3. Internally, this performs a SETTASK #%%nnnn. In other words, it updates the task scheduler to switch all time slots to the specified task.
* optionally, #addr is an address that the currently-active task's PC is updated to. Internally, this is just a JMPTASK #addr, mask, where mask references the current task only.

So, in other words, TASKSW is internally a SETTASK with an optional JMPTASK. With this, all of the above would be accomplishable, with the following caveats:

* Due to the current behavior of TASKSW, TASKSW would also not flush the pipeline, effectively acting as a delayed instruction. While I would like to be able to optionally flush the pipeline, I'm concerned that this would require the internal behavior of SETTASK to change (which would add risk).
* Round-robin switching could be supported either by explicit ordering using #n or by using a D register that is incremented by each task.

Note that the does not, in any way, alter the existing task registers or behavior. It simply makes good use of existing resources for those who want to implement cooperative multitasking.

Thoughts?

Cluso99 · 2013-09-21 06:08

Wow, what discussion in the last day!

While invaders would have used more cogs if they were in the DE0, it just shows what is possible with the elegantly simple task(thread) switcher. There will be lots of implementations of drivers making use of this.
I would love more cogs too, but realit is in the space taken up in the cog ram and clut. I am prepared to wait for the P3 in 40nm with 4x 8-cog cores and an extra shared hub between thes 4 sets. The P2 will just have to sell heaps to fund thin unless one of us wins the multi-million lottery

Setaccx seems like a great improvement and I do like the endian8 & 4 instructions although I am not so sure about losing their replacements.

BUT I am going to have "heaters nihgtmares" tonight with Ken freaking me out! Please no more risks Chip - we all want P2

Bill Henning · 2013-09-21 07:54

In a dream... why was Ken chasing a pack of propeller-beanie wearing people with a chainsaw???

Phil Pilgrim (PhiPi) · 2013-09-21 07:58

David Betz wrote:

Why can't the ALU be shared among the threads? After all, only one instruction is actually executing at the same time. The only problem would be if one thread executes an ALU instruction that takes more than one clock. Then all of the threads would stall until that instruction completes because (presumably), every instruction requires the ALU.

The idea was that each thread would get one clock during its timeslice. That preserves determinism. Allowing multi-clock instructions to run to complettion in each timeslice negates that advantage. But, with one clock per timeslice, leaving multi-clock instructions half-baked requires that each thread have its own ALU.

In its current incarnation, as I now understand it, the threading mechanism uses a single pipeline and is basically just a round-robin instruction fetcher that interleaves instructions into that pipeline for execution, It's a simple mechanism that provides a lot of useful functionality for little cost. You just have to beware the caveats regarding stalls.

-Phil

David Betz · 2013-09-21 08:27

Phil Pilgrim (PhiPi) wrote: »

The idea was that each thread would get one clock during its timeslice. That preserves determinism. Allowing multi-clock instructions to run to complettion in each timeslice negates that advantage. But, with one clock per timeslice, leaving multi-clock instructions half-baked requires that each thread have its own ALU.

In its current incarnation, as I now understand it, the threading mechanism uses a single pipeline and is basically just a round-robin instruction fetcher that interleaves instructions into that pipeline for execution, It's a simple mechanism that provides a lot of useful functionality for little cost. You just have to beware the caveats regarding stalls.

-Phil

The only reason I suggested the separate pipelines without separate ALUs is to get around the really long delays that can be the result of hub operations. You're right that sharing an ALU among several pipelines would still introduce stalls when the ALU is used to perform an operation that takes more than one clock thereby breaking determinism. I guess there really is no way to get deterministic behavior from code running using the threading mechanism without huge investments in redundant hardware. Is this how hyperthreading works on the Intel chips? Do they essentially duplicate all resources for each thread? If so, how is that different from having a separate core per thread? What is it that they are sharing? Level 1 caches?

Heater. · 2013-09-21 08:45

David,

I have no idea about Intel hyper-threading really but in a recent video on youtube I guy was showing slides of Intel chips, there you could see 4 cores and each core had two ALU's (or was it more) two multipliers and so on. As far as I remember the top level cache is shared between all those hyper-threads.

Don't forget a couple of points:

1) Intel chips can re-order instructions so they flow down the pipe smoothly. No idea if or how they do that.

2) You would never about any stalls of individual instructions. They are hidden behind so many layers of cache which have massive stalls of their own that it would be impossible to tell when any give instruction was executed.

In that respect the Prop has a much tougher problem to solve if you want lot's of threads and no detectable jitters on I/O or video.

David Betz · 2013-09-21 09:44

Heater. wrote: »

In that respect the Prop has a much tougher problem to solve if you want lot's of threads and no detectable jitters on I/O or video.

Yes, I guess I should have used a more appropriate example. How does XMOS do it then? They are in the same market as the Propeller and require the same degree of determinism. Do you know if they have multiple ALUs, one for each thread?

Heater. · 2013-09-21 10:07

David,

The XMOS thing is a puzzle. As far as I know they don't have an an ALU per thread.

I have seen nice diagrams where they show that an instruction has for stages: fetch, decode, execute, write. Or something like that. Then it shows four threads all on different stages of that pipe moving along nicely.

That would imply only one ALU is needed as there is only one execute going on at a time.

The puzzle for me is that the multiply (or was it divide) takes more clocks. And I'm told there is only one multiplier.

What's with that? They say a multiply does not stall the pipe. David May himself has tried to explain that to me on the XMOS forum and I'm not sure I got it. However the XMOS tools come with a timing analyzer that will count instructions and time things taking into account how many threads are running. In my experiments with the timing analyzer and on real chips I could never get the use of a multiply to show up having effect on timing.

As David says in the architecture document:

"The XCore scheduler therefore allows threads to be treated as virtual processors with performance predicted by tools. There is no possibility that the performance can be reduced below these predicted levels when virtual processors are combined."

Have a look at the XMOS architecture document that talks about all this in some detail:
https://www.xmos.com/download/public/The-XMOS-XS1-Architecture(X7879A).pdf

Phil Pilgrim (PhiPi) · 2013-09-21 10:14

A possible compromise might be that if an instruction is waiting for hub access or some other condition, don't advance the thread's instruction pointer and move immediately to the next thread. That way the instruction will repeat until the condition is met. The downside is that you can't tell when a stalled thread will get hub access, if ever. And you will not be able to characterize hold times for input conditions.

-Phil

Heater. · 2013-09-21 10:43

David,

I managed to find clarification of the XMOS situation.

1) Pretty much all instructions take the same time. So using only those thread timing is spot on. No jitters anywhere.

2) There is one divide unit per core. Problems arise here if more than one thread hits a divide instruction at the same time. Then we get a stall. As an XMOS employee put it "These are the only instructions where there is inter-thread interference."

3) The XMOS timing tool knows how many threads you have running and about use of divides so it can predict the worst case timing.

4) One is not supposed to worry about these jitters as timing sensitive I/O should used clocked input and output.

Here is a nice short and clear thread about it where I discussed the issues back in 2010:

http://www.xcore.com/forum/viewtopic.php?t=472&f=27

Heater. · 2013-09-21 10:48

Phil,

...if an instruction is waiting for hub access or some other condition, don't advance the thread's instruction pointer and move immediately to the next thread. That way the instruction will repeat until the condition is met.

That is what XMOS does with their threads, during I/O operations for example.

...you will not be able to characterize hold times for input conditions.

XMOS gets around that by having clocked, latched, whatever I/O.

All in all the XMOS is a complex beast and I think most here would prefer to code for the Prop. Especially when it comes to assembler. We can accept the Prop II hardware scheduling as a bonus even if it is not perfect in some respects. The Prop makes life easier than the XMOS in many other ways.

Heater. · 2013-09-21 11:01

David,

Here is a rather longer discussion with XMOS guys where I tried to bust their determinism. I think I had them worried for a while. Still not quite sure who won that debate:)

At least on the page I link to below there is a nice explanation of XMOS scheduling and a pretty diagram.

http://www.xcore.com/forum/viewtopic.php?f=5&t=738&start=10

David Betz · 2013-09-21 11:05

So is the conclusion to this P2 thread scheduling discussion that if you want determinism you run only one thread on a COG and if you don't care that much about determinism, you can use the threading support?

kwinn · 2013-09-21 11:22

Heater. wrote: »

David,

The XMOS thing is a puzzle. As far as I know they don't have an an ALU per thread.

I have seen nice diagrams where they show that an instruction has for stages: fetch, decode, execute, write. Or something like that. Then it shows four threads all on different stages of that pipe moving along nicely.

That would imply only one ALU is needed as there is only one execute going on at a time.

The puzzle for me is that the multiply (or was it divide) takes more clocks. And I'm told there is only one multiplier.

What's with that? They say a multiply does not stall the pipe. David May himself has tried to explain that to me on the XMOS forum and I'm not sure I got it. However the XMOS tools come with a timing analyzer that will count instructions and time things taking into account how many threads are running. In my experiments with the timing analyzer and on real chips I could never get the use of a multiply to show up having effect on timing.

As David says in the architecture document:

"The XCore scheduler therefore allows threads to be treated as virtual processors with performance predicted by tools. There is no possibility that the performance can be reduced below these predicted levels when virtual processors are combined."

Have a look at the XMOS architecture document that talks about all this in some detail:
https://www.xmos.com/download/public/The-XMOS-XS1-Architecture(X7879A).pdf

I wonder if it could be that the multiply and divide units could have their own pipelines. I recall seeing something reminiscent of that in an old computer. The multiply, divide, and shift instructions were done by placing the data and operand in specific registers and reading the result register a number of cycles later.

Observations of a multi-tasker

Comments