Propeller II update - BLOG

jmg · 2014-03-02 23:02

Bill Henning wrote: »

I could easily see disallowing task switching until all current long ops in the pipeline finish - maybe even including the delayed jumps and REPx blocks
(with an exception for a debugging mode).

SETMODE ATOMIC - all MUL/DIV/delayed instructions in the pipeline complete before task is allowed to starve/stop
SETMODE DEBUG - the way things are right now, cavet emptor

This (IMHO) may be easy enough to implement without needing too many transistors or changes.

Agreed, which is what I also said in #5686.
Task-Swap will likely prefer to wait-until-done, on any delayed-resource, as that simplifies restore.
(nothing is left half-baked)

Debug is not going to swap in a whole new block, it is more interested in granularity, so it can take a snapshot every task-clock.

Heater. · 2014-03-02 23:07

Bill,

Too many cycles to waste. Extending logic, all RDxxxx/WRxxx should take 8 cycles, so we don't get hub overlap.

Sorry I don't understand.

I have no idea of the figures but let's say that MUL32 take 2 or 4 or 8 or whatever times longer than a normal instructions. There are two ways to go:

a) I just write that instruction in your code and it takes as long as it takes.

b) I write the instruction which immediately continues to the next instruction and I have to somehow poll or twiddle my thumbs for the result to pop out.

My preferred way is a) because its simplest from a programming point of view.

Option b) could potentially let you do some other work whilst you are waiting.

My argument is that b) is so hard to use and can only be used in cases where you can find other work to do that it is of little practical benefit. The complexity far outweighs it. Especially now that we talk about multi threading.

As an example, the Intel i860 exactly had a mode like this that you had to use to get peek flops out of it. It was very hard to use. Compilers did a very bad job of using it. The chip was slow as a result. A waste of time implementing it. The i860 was a flop.

I don't see what RDxxx/WRxxx has got to do with this. Only that it is along winded operation, it takes time, you have to wait and it's atomic. Perfect.

I could easily see disallowing task switching until all current long ops in the pipeline finish

That sounds like a disaster. If you disable task switching whilst waiting for a long operation you have just stalled the entire COG. Even when the other threads don't need that resource. That is a big performance hit for everything.

cgracey · 2014-03-02 23:08

Heater. wrote: »

It's seems straight forward to me.

A MULL32 or other long winded instruction should appear to be an atomic operation. Just like any other instruction.
If the result takes a while to come out, so be it.

Further, if in a threaded mode it turns out some other thread is using the multiplier hardware then you just have to be stalled until it becomes free again. Rather like a HUB access. Then stalled some more waiting for your result. So be it.

Anything else is a programming nightmare.

I don't believe arranging for the programmer to be able to do some other work whilst waiting is going to yield much benefit. And the complexity of it ensures that it will almost never be used.

How hard it is for Chip to make these long winded operations into atomic operations I have no idea. I hope it's not too hard.

P.S If it turns out to be easy for hardware scheduled threading but difficult and expensive for preemptive interrupt driven threads I suggest dropping the latter.

The problem with making these long-winded operations atomic is that there are sometimes multiple results, like X and Y from QROTATE. There are also multiple setups (SETQI/SETQZ before QROTATE). We would have to buffer those setup values and buffer the result values, in order to get atomic operation. This would jack up the amount of data that must be saved and restored for a thread switch. I'm kind of partial to TLOCK/TFREE, because they are useful for all kinds of other things you might need to do that I wouldn't want to make special safeguards for, like using INDA/INDB.

JRetSapDoog · 2014-03-02 23:08

cgracey wrote: »

Take all these various resources and pool them all, so that whoever needs them can use them. Break down the cog barriers.

It's perhaps a different level or kind of interconnectivity/sharing, but now I don't feel so greedy for wishing that a single cog could access more D/A's to, for example, drive more than one display (rather than using multiple cogs). "Tear down this cog wall, Mr. Chip!" he posted. But tasks and threads give one additional options in terms of overall design. Hmm, talk about making out-of-the-blue comments! Sorry, carry on with thrashing out the multi-threading details.

cgracey · 2014-03-02 23:14

jmg wrote: »

I think that is a yes.
To expand:
If a thread has MUL32, it is looping-for-result, Suppose another thread starts MUL32 - it should Loop-until-free, and then MUL32 is accepted. ( If Resource any special pre-loads, those can be considered triggers to Not-Free. ie once you start using a resource, it is yours until done. )

Free here means the task that was looping-for-result,has not only finished the pause, it has also read the result(s) and then final read signals Free.

99% of the time, this handshake is likely never needed, but when it is, you get two correct answers, in two threads, it's just that one may have taken a little longer than expected. No other COGS were disturbed.

The Waiting thread would issue MUL32, and if HW=busy, it starts Loop-until-free, then when the other thread is fully done, the 'paused' MUL32 launches the HW, and flips to looping-for-result.
It exits with no direct knowledge it needed the Loop-until-free.
No extra lines of code are needed, it just works.

I understand more now. Neat idea.

So, you are saying that it initially loops until free to issue, say MUL32, then the thread stays locked until the results are picked up?

cgracey · 2014-03-02 23:18

Bill Henning wrote: »

Heater,

Too many cycles to waste. Extending logic, all RDxxxx/WRxxx should take 8 cycles, so we don't get hub overlap.

BUT

I could easily see disallowing task switching until all current long ops in the pipeline finish - maybe even including the delayed jumps and REPx blocks
(with an exception for a debugging mode).

SETMODE ATOMIC - all MUL/DIV/delayed instructions in the pipeline complete before task is allowed to starve/stop
SETMODE DEBUG - the way things are right now, cavet emptor

This (IMHO) may be easy enough to implement without needing too many transistors or changes.

But we can't look at the pipeline and know if it's a good time to stop it, because the instructions within it represent various things in various states, potentially. What we can do is take away a task's time slots and let it run out of the pipeline. At that point, it is frozen and it can be manipulated, but you must capture all of its state data, like what was in the REP circuit, the delay-branch circuit, etc.

jmg · 2014-03-02 23:19

cgracey wrote: »

I understand more now. Neat idea.

So, you are saying that it initially loops until free to issue, say MUL32, then the thread stays locked until the results are picked up?

Yes, the Free is only set after the last result is collected, and it is set as BUSY, when the first resource-using write occurs.
In the time domain, this flag is wider than loop-for-result.
( It is not really the thread that stays locked, more the shared resource being used.)

Heater. · 2014-03-02 23:19

Chip,

The problem with making these long-winded operations atomic is that there are sometimes multiple results, like X and Y

I see what you mean. I might have guessed things were not so straight forward.

In that case I think I agree, go for the TLOCK/TFREE.

If I understand correctly TLOCK stops all threading dead in it's tracks and TFREE let's threads run again. Rather like disabling interrupts in a single core system just to be sure that no one else can upset what you are doing.

That seems to be the simplest way to go. It's course grained and hurts overall performance but given that I we would actually like to see this chip someday it must be the way to go.

Of course if that MUL32 is going to hold up other threads and you don't want that, then just do the mul the old long hand way

jmg · 2014-03-02 23:22

cgracey wrote: »

There are also multiple setups (SETQI/SETQZ before QROTATE). We would have to buffer those setup values and buffer the result values, in order to get atomic operation.

No, you can avoid multiple buffers and get atomic operation with a boolean, it just needs to set on first resource access, and clear on last resource read. (just like a corded phone hook-switch)

No need for wide buffers. If any opcode that would set busy, finds resource is not free, it loops to self, until is becomes free.

Multiple buffers can shorten the time the resource is locked, but they are not essential, and I would guess the saving as a percentage is slight.

cgracey · 2014-03-02 23:27

jmg wrote: »

Yes, the Free is only set after the last result is collected, and it is set as BUSY, when the first resource-using write occurs.
In the time domain, this flag is wider than loop-for-result.

Got it. But when a thread successfully initiates MUL, but then gets switched out for 1ms before grabbing the results, every other task needing the multiplier gets hung up for that 1ms. This gets us back to needing TLOCK, doesn't it?

cgracey · 2014-03-02 23:29

I fully understand now. With simple multi-tasking, this would work great, but for threading, where a task gets switched out before reading the results and freeing the resource, we hang any other threads needing that resource, right?

jmg · 2014-03-02 23:37

cgracey wrote: »

I fully understand now. With simple multi-tasking, this would work great, but for threading, where a task gets switched out before reading the results and freeing the resource, we hang any other threads needing that resource, right?

Yes, but that is a different issue, with a couple of solutions :

One is to allow SWAP INIT (TLOCK) to 'see' the Wait-until-free flag, and conditionally wait on that.
The benefit of this, is not only do you avoid stalls, but you also know your result is ok.
Even without the stall effect, you probably do not want to swap before read, because restore is not going to read what you hope.

The other is for SW to check if the SWAP should give a few more cycles to the task, before Swap.
That is slower, and more cumbersome.

The more I think about this, the more important that Wait-until-free is, for handling TLOCK timing.
ie you do not want TLOCK to just wait-till-done, then fire, as you have not yet read the results.

You do need to be both done and read.

Otherwise restore/restart fails.

Sapieha · 2014-03-02 23:46

Hi Chip.

Why don't simple write in manual

> with that resources

Can be used only in one thread / Task

cgracey wrote: »

I fully understand now. With simple multi-tasking, this would work great, but for threading, where a task gets switched out before reading the results and freeing the resource, we hang any other threads needing that resource, right?

potatohead · 2014-03-03 00:12

I'm kind of partial to TLOCK/TFREE, because they are useful for all kinds of other things you might need to do that I wouldn't want to make special safeguards for, like using INDA/INDB.

Additionally, if somebody wants to play tricks, they can.

Seconded. Useful in many contexts, and if we fix this one, what about the other ones? Shouldn't it just be consistent?

And if it should, we either go all the way and nail this at any logic / time cost, or we punt and use TLOCK/TFREE dead simple and robust.

When we do this again on P3, we build in from the beginning to avoid kludges like we are facing now.

cgracey · 2014-03-03 00:23

jmg wrote: »

Yes, but that is a different issue, with a couple of solutions :

One is to allow SWAP INIT (TLOCK) to 'see' the Wait-until-free flag, and conditionally wait on that.
The benefit of this, is not only do you avoid stalls, but you also know your result is ok.
Even without the stall effect, you probably do not want to swap before read, because restore is not going to read what you hope.

The other is for SW to check if the SWAP should give a few more cycles to the task, before Swap.
That is slower, and more cumbersome.

The more I think about this, the more important that Wait-until-free is, for handling TLOCK timing.
ie you do not want TLOCK to just wait-till-done, then fire, as you have not yet read the results.

You do need to be both done and read.

Otherwise restore/restart fails.

I see. I still think TLOCK/TFREE is better, all things considered.

It would be kind of sad to loose the ability to overlap MUL, DIV, SQRT, and/or CORDIC operations for high efficiency in some cases.

jmg · 2014-03-03 00:46

cgracey wrote: »

I see. I still think TLOCK/TFREE is better, all things considered.

It would be kind of sad to loose the ability to overlap MUL, DIV, SQRT, and/or CORDIC operations for high efficiency in some cases.

I think it is not an OR decision, I think you need both Resource-Free flags, and TLOCK, for SWAP to work properly.

Without the proper flags for Resource Done, you cannot restore, and get the right results.

Even if TLOCK waits for result-ready, (the only flag in there now) it cannot safely swap until after the results have been read, otherwise on restore, you proceed immediately to read result values from some other calculation.
Outcome = corrupted result.

Whilst the Swap handler could manage all the continue-to-complete-read in SW, that is a lot of slow code, and you mandate everyone writing a Swap handler properly doing this in SW. (otherwise they will get rarely corrupted results on Task Swap Restore)

Boolean Atomic just seems more robust, and more likely to just work ?

potatohead · 2014-03-03 00:49

Wait a minute. If we've asked for TLOCK, then it simply can't be swapped, until TFREE. Done, next.

TLOCK isn't going to wait for any result. It's going to prevent anything else from happening, until TFREE. Otherwise, what is the point?

jmg · 2014-03-03 01:52

potatohead wrote: »

Wait a minute. If we've asked for TLOCK, then it simply can't be swapped, until TFREE. Done, next.

TLOCK isn't going to wait for any result. It's going to prevent anything else from happening, until TFREE. Otherwise, what is the point?

The devil is in the details, of the housekeeping.

TLOCK is used to prepare for a full-swap, along with the bits Chip is storing (PC, Flags et al).
- but at some stage you need to swap back the session you removed.

This is where the details matter, and if you did TLOCK during a Shared process that was not allowed to complete, now you have big problems in Restore.
If you TLOCK just after Result-Ready, now on restore, you will proceed to read what you think are results, and oops, wrong.answer!.

That's why I believe you need both proper Busy/Done flags and TLOCK, for the results to be predictable.

Proper busy/done flags mean you can also have smart sharing across threads, with no unexpected impact on tested/proven threads that do not use the same shared resource. Libraries are simpler.
Many benefits.

Logic to do this should not be complex, and the state-actions are already there.

Heater. · 2014-03-03 03:13

jmg,

TLOCK is used to prepare for a full-swap, along with the bits Chip is storing...

Now I'm really confused. I though that the idea of TLOCK/TFREE is that no swapping can happen between the two. Not hardware scheduled or interrupt driven preemptive.
This is so that long winded, multi-part operations like MUL32 can completed atomically. Or so I though it was explained to me above.

ozpropdev · 2014-03-03 03:35

Heater this might help.
See here

Heater. · 2014-03-03 04:08

ozpropdev,

Thanks for that. On that post I see this explanation (simplified):

TLOCK 'execute only this task, beginning at the next instruction
...
TFREE 'resume multitasking after two more instructions
...

So I understood correctly. No task switching can happen between TLOCK and TFREE.

Which sounds great if I want to use that pesky hardware that takes a few instructions worth of time to deliver a result or can deliver multiple results. Clear and simple. Fine.

So what does jmg mean when he says:

TLOCK is used to prepare for a full-swap...

Surely once you have TLOCKed there is no swapping going on. Unless you are doing a cooperative scheduler in which case none of this is an issue.

Further:

This is where the details matter, and if you did TLOCK during a Shared process that was not allowed to complete, now you have big problems in Restore.

How can that happen?

If the process has TLOCKed how can it "not be allowed to complete". It has full control of the processor at that point.

Presumably a TLOCKED thread could jam up the whole cog by sitting in a busy loop forever, well don't do that.

potatohead · 2014-03-03 07:02

IMHO, it appears jmg is still advocating TLOCK / TFREE work differently than they currently do.

Looking for the swap explanation now.

http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1246748&viewfull=1#post1246748

mindrobots · 2014-03-03 07:27

I'm taking a new attitude:

I'll just march forward with whatever we get because whatever we get will be a lot better than what we got!!

When it comes time to test, I test.

ctwardell · 2014-03-03 07:35

I think I see what jmg is getting at in this thread:

http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1247676&viewfull=1#post1247676

It is a case of mixing preemptive threads AND task trying to share the shared resources.

This is something that I think should NOT be handled in the hardware.

Within a give COG if you are using those shared resources within preemptive code you should NOT also try to use them within a regular task.

Going with the suggested preemptive scheme of having just the scheduler task and threading task this should not be an issue as I see very little reason to use the shared resources in the scheduler.

Anyway... if someone did mix using preemptive use of the shared resources AND use of them in a regular task, this could happen:

Let say task 'A', a regular task, has a lock on the CORDIC.
A preemptive thread 'B' does a TLOCK and then tries to use the CORDIC.
The COG is now locked. 'B' has all the cycles and is spinning waiting for the CORDIC to become available, 'A' can never finish it's CORDIC operation releasing the lock because it isn't getting any cycles.

I think the answer is that you DO NOT mix preemptive use of the shared resources with normal task use of those same resource unless you have setup some further arbitration scheme IN SOFTWARE.

C.W.

Bill Henning · 2014-03-03 07:48

jmg:

Sorry did not notice #5686. All the arguments in the thread made my eyes glaze over, and I just skimmed it.

We agree

Heater:

MUL/DIV can overlap other code some 8-16 cycles as I recall, I don't want to lose that potential performance.

Post #5700 allows us to have our cake and eat it too (like #5686)

I do not consider (b) hard to use, we've been doing the same thing with hub access for years.

Chip:

I think TLOCK/TFREE is a great solution, and allows the flexibility of waiting - or not.

Bill Henning · 2014-03-03 07:50

ctwardell wrote: »

Let say task 'A', a regular task, has a lock on the CORDIC.
A preemptive thread 'B' does a TLOCK and then tries to use the CORDIC.
The COG is now locked. 'B' has all the cycles and is spinning waiting for the CORDIC to become available, 'A' can never finish it's CORDIC operation releasing the lock because it isn't getting any cycles.

Can't heppen.

A does TLOCK. All other tasks pause until A TFREE's

once A TFREE's, B TLOCK's... all other tasks pause. B TFREE's

Basically, all tasks (and threads are run within a task, so for this they behave the same) freeze when any task does a TLOCK/TFREE critical section.

Simple. Easy

ctwardell · 2014-03-03 07:57

Bill Henning wrote: »

Can't heppen.

A does TLOCK. All other tasks pause until A TFREE's

once A TFREE's, B TLOCK's... all other tasks pause. B TFREE's

Basically, all tasks (and threads are run within a task, so for this they behave the same) freeze when any task does a TLOCK/TFREE critical section.

Simple. Easy

No Bill, in what jmg is proposing it can happen.

What jmg is proposing is the you DO NOT use TLOCK/TFREE for non-preemptive tasks, that is what the whole 'auto-lock' is about.

The purpose of the 'auto-lock' is to get around the 'piggish' behavior of TLOCK/TFREE for non-preemptive code.
The problem can arise when you use jmg's proposal mixed with preemptive threads.

Always using TLOCK/TFREE is a sure fix, it just comes at the cost of holding up the other tasks whenever a shared resource is used.

C.W.

potatohead · 2014-03-03 08:09

That's how I saw it too. The default case will be just bracket shared resources and take the hit. Where that's unacceptable, one can choose to structure the software, or better distribute the load across COGS.

Bill Henning · 2014-03-03 08:13

Thanks, I mis-understood jmg's post.

I think that TLOCK/FREE is a perfectly good solution for now, for P3 we can look for better solution.

The most TLOCK/FREE can pause other tasks for (when used properly) is the longest delayed result - some 20 cycles? Big deal - NOT (in 99.99% of cases)

For the few cases where that is not acceptable, people should use a whole cog, or avoid those instructions.

ctwardell wrote: »

No Bill, in what jmg is proposing it can happen.

What jmg is proposing is the you DO NOT use TLOCK/TFREE for non-preemptive tasks, that is what the whole 'auto-lock' is about.

The purpose of the 'auto-lock' is to get around the 'piggish' behavior of TLOCK/TFREE for non-preemptive code.
The problem can arise when you use jmg's proposal mixed with preemptive threads.

Always using TLOCK/TFREE is a sure fix, it just comes at the cost of holding up the other tasks whenever a shared resource is used.

C.W.

mindrobots · 2014-03-03 08:22

mindrobots wrote: »

So you have all these threads running in a single cog that need to use CORDIC, SQRT, BIGMUL, BIGDIV and they can't just lock the resource until done? Isn't this going to be the same at some point with any resource? You run out. Like counters in the P1, if you use the two you have in a COG, you go use another COG. With the P2, if you can't survive with a stall across a lock, go use a resource in another task in another COG. Are there really use cases that will need 9 non-stalling CORDIC or whatever threads? Just because you can try and run everything in one cog doesn't mean you should.

Are folks thinking real world with all this added feature complexity or just theoretical potential?

Apparently, nobody reads what I write....I just do it for my health.

Propeller II update - BLOG

Comments