Propeller II update - BLOG

Heater. · 2014-03-03 08:23

I'm glad it's not just be who no longer understands what is going on with PII threads. Seems no one else does either.

Given the definition of TLOCK/TFREE as posted by Chip I do not see how a preemtive task B can ever take over from hardware scheduled task A if A has stopped all threading with a TLOCK.

No idea where these "auto-locks" came from. I mist have missed a suggestion a few dozen pages back.

ctwardell · 2014-03-03 08:26

Heater,

'A' didn't use TLOCK.

This all started at 5583 when David Betz asked a very valid question.

The genesis of the 'Auto-Lock' seems to be 5613 by ozpropdev.

Further pushed for on post 5641 by jmg.

I agree that something like the auto-lock would be nice if well thought out and implemented, the question is if it makes sense to do it at this point in the game.

C.W.

Heater. · 2014-03-03 08:36

Bill

I think that TLOCK/FREE is a perfectly good solution for now, for P3 we can look for better solution.

The most TLOCK/FREE can pause other tasks for (when used properly) is the longest delayed result - some 20 cycles? Big deal - NOT (in 99.99% of cases)

For the few cases where that is not acceptable, people should use a whole cog, or avoid those instructions.

I agree.

We have whole cogs out there if your threads are so critical and need the speed and/or minimal latency.

Or skip the hardware multipliers and stuff and do it in "long hand" as on the P1 thus allowing your other threads to run free.

Anyone know how much slower a 32 bit multiply is when done manually rather than using MUL32?

That leads to the conclusion that for ease of use, maximum flexibility etc one would not use the math hardware at all.

I can imagine a large multi-threaded (preemtively or otherwise) program running from HUB, as built by GCC for example. Performance will already be sucky enough that the hardware math offers almost no advantage. Might as well not bother with it.

Heater. · 2014-03-03 08:45

ctwardell,

'A' didn't use TLOCK. This all started at 5583 when David Betz asked a very valid question. The genesis of the 'Auto-Lock' seems to be 5613 by ozpropdev. Further pushed for on post 5641 by jmg.

I agree that something like the auto-lock would be nice if well thought out and implemented, the question is if it makes sense to do it at this point in the game.

So, auto-lock is what I was imagining when I said use of "long wined" instructions should be atomic. A thread would wait p for a hardware resource to become free, then wait again for it's result to be ready. All that waiting being hardware managed.

It's a natural thought.

In post #5704 Chip explains why the "auto-lock" making those instructions atomic is probably not going to happen:

The problem with making these long-winded operations atomic is that there are sometimes multiple results, like X and Y from QROTATE. There are also multiple setups (SETQI/SETQZ before QROTATE). We would have to buffer those setup values and buffer the result values, in order to get atomic operation. This would jack up the amount of data that must be saved and restored for a thread switch. I'm kind of partial to TLOCK/TFREE, because they are useful for all kinds of other things you might need to do that I wouldn't want to make special safeguards for, like using INDA/INDB.

So TLOCK/TFREE it is then.

Man it's hard to keep up around here:)

potatohead · 2014-03-03 08:47

Precisely.

We either have to resolve it all, or have it be even more complicated due to inconsistencies, or punt and use TLOCK / TFREE.

Bill Henning · 2014-03-03 08:51

TLOCK/TFREE

Let's us move on to the other bits that need resolving.

We can revisit better solutions once we start on the P3

Heater. · 2014-03-03 09:24

Bill,

Let's us move on to the other bits that need resolving.

Please.

We can revisit better solutions once we start on the P3

I hate to say this but did anyone notice that we have been working on the P3 for over three years now!

This thread was stared 5738 posts ago by Beau Schwabe on 09-10-2010 and the P2 was out in November 2010. I quote:

Currently we are scheduled for Early November 2010 tape out for a test chip!! This is a significant milestone and will help us determine from empirical testing if we need to make any changes before the final Chip.

Now the P4. That's going to be a killer!

ctwardell · 2014-03-03 10:29

Is there any documentation for CLRB and SETB, other than that show below?

ZCMS  0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS     CLRB    D,S/#
ZCMS  0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETB    D,S/#

Specifically the return values of Z and C.

Thanks,

C.W.

cgracey · 2014-03-03 10:51

ctwardell wrote: »
Is there any documentation for CLRB and SETB, other than that show below?
ZCMS  0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS     CLRB    D,S/#
ZCMS  0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETB    D,S/#
Specifically the return values of Z and C.

Thanks,

C.W.

All those xxxB instructions return the original bit's value into C, while Z is the overall long's zero equivalence.

ctwardell · 2014-03-03 10:54

cgracey wrote: »

All those xxxB instructions return the original bit's value into C, while Z is the overall long's zero equivalence.

Most excellent! That is what I was hoping for.

We could use those for the non-hub locks I asked about previously.

You would define a register and specific bit for a given 'lock'.

Use SETB lockreg, lockbit with WC to get a 'lock'.
If C is clear, meaning the lock bit was previously cleared, you have the lock, otherwise you do not.

CLRB lockreg, lockbit to clear the lock.

Thanks,

C.W.

cgracey · 2014-03-03 11:03

ctwardell wrote: »

Most excellent! That is what I was hoping for.

We could use those for the non-hub locks I asked about previously.

You would define a register and specific bit for a given 'lock'.

Use SETB lockreg, lockbit with WC to get a 'lock'.
If C is clear, meaning the lock bit was previously cleared, you have the lock, otherwise you do not.

CLRB lockreg, lockbit to clear the lock.

Thanks,

C.W.

Lo and behold! We had those locks you wanted, all along, and neither of us realized it.

Bill Henning · 2014-03-03 11:10

I am not sure for two reasons:

- does not work for inter-cog hub resource locking

- the instruction takes four pipeline stages to complete

what if all four tasks are trying to acquire the same lock bit at the same time? (extremely unlikely, I know)

TLOCK/TFREE to the rescue!

It would however work just fine as locks between threads running in the same task!

Btw, if it does not take significant resources, I'd love to see 32 hub based locks instead of 8.

cgracey wrote: »

Lo and behold! We had those locks you wanted, all along, and neither of us realized it.

potatohead · 2014-03-03 11:10

With so many instructions... this is bound to happen.

Bookmarked for later.

ctwardell · 2014-03-03 11:12

cgracey wrote: »

Lo and behold! We had those locks you wanted, all along, and neither of us realized it.

Yes, and if someone so desired they could make use of them as locks around the shared resources when not used along with preemptive threads.

C.W.

jmg · 2014-03-03 11:20

ctwardell wrote: »

No Bill, in what jmg is proposing it can happen.

What jmg is proposing is the you DO NOT use TLOCK/TFREE for non-preemptive tasks, that is what the whole 'auto-lock' is about.
.

Correct, you do not NEED to use TLOCK, Of course, if you do, it does not mater,

ctwardell wrote: »

The purpose of the 'auto-lock' is to get around the 'piggish' behavior of TLOCK/TFREE for non-preemptive code.
.

Yes, but this important housekeeping does a little more as well...

The problem can arise when you use jmg's proposal mixed with preemptive threads.

Always using TLOCK/TFREE is a sure fix, it just comes at the cost of holding up the other tasks whenever a shared resource is used.

Nope, Always using TLOCK, is not enough.

Taking this exmple

Let say task 'A', a regular task, has a lock on the CORDIC.
A preemptive thread 'B' does a TLOCK and then tries to use the CORDIC.
The COG is now locked. 'B' has all the cycles and is spinning waiting for the CORDIC to become available, 'A' can never finish it's CORDIC operation releasing the lock because it isn't getting any cycles.

That's missing the critical issue - so let's rewrite that.

There are TWO dangers
a) Direct result corruption, as Chip says, ACCIDENTAL shared use can corrupt data. Truly nasty.

b) Restore Result corruption. This can still occur, even if two tasks avoid shared use.

Let say task 'A', a regular task, has a lock on the CORDIC, but Task A has got to Done.
A preemptive thread 'B' does a TLOCK can use the CORDIC, or the replaced Task can use CORDIC
The COG is now locked. 'A' is swapped out.
Some time later 'A' is swapped back.

It now proceeds to read the cordic results. << this is the critical bit

Problem is, because Task 'A' was removed before it read the results, now when it tries, what it reads is wrong.

cgracey · 2014-03-03 11:21

Bill Henning wrote: »

I am not sure for two reasons:

- does not work for inter-cog hub resource locking

- the instruction takes four pipeline stages to complete

what if all four tasks are trying to acquire the same lock bit at the same time? (extremely unlikely, I know)

TLOCK/TFREE to the rescue!

It would however work just fine as locks between threads running in the same task!

Btw, if it does not take significant resources, I'd love to see 32 hub based locks instead of 8.

It's true that this wouldn't work for inter-cog resource locking, but it would work fine within a cog, even if all tasks were vying for the same lock (a bit within some register). This all gets resolved in stage 4 of the pipeline, and data-forwarding circuitry makes sure stage 3 gets the correct copy if it needs it. So, this WOULD work within a cog 100% of the time, no matter the tasking situation.

ctwardell · 2014-03-03 11:27

Bill Henning wrote: »

I am not sure for two reasons:

- does not work for inter-cog hub resource locking

- the instruction takes four pipeline stages to complete

what if all four tasks are trying to acquire the same lock bit at the same time? (extremely unlikely, I know)

TLOCK/TFREE to the rescue!

It would however work just fine as locks between threads running in the same task!

Btw, if it does not take significant resources, I'd love to see 32 hub based locks instead of 8.

Bill, you are correct that these are not for inter-cog use, they are the cog locks I mentioned in post 5682, only useful within a cog.

I also agree on bumping up the hub based locks.

C.W.

Bill Henning · 2014-03-03 11:33

jmg wrote: »

Nope, Always using TLOCK, is not enough.
...

Let say task 'A', a regular task, has a lock on the CORDIC, but Task A has got to Done.
A preemptive thread 'B' does a TLOCK can use the CORDIC, or the replaced Task can use CORDIC
The COG is now locked. 'A' is swapped out.
Some time later 'A' is swapped back.

It now proceeds to read the cordic results. << this is the critical bit

Problem is, because Task 'A' was removed before it read the results, now when it tries, what it reads is wrong.

jmg,

I am scratching my head.

Using your example, if a regular task A has TLOCK/TFREE around the CORDIC usage, including retrieving the result, no task can use CORDIC at all until A completes.

Task B, also using TLOCK/TFREE around the whole CORDIC usage, cannot start using it until A completes, and due to TLOCK/TFREE, cannot be pre-empted/swapped out until it has finished with its use of CORDIC.

This assumes all tasks contending for the resource TLOCK/TFREE bracket said usage.

Am I missing something?

Bill Henning · 2014-03-03 11:35

That is GREAT!

They will be perfect for inter-task (including inter-thread) locks then!

Makes implementing a Unix style select() mechanism easier.

cgracey wrote: »

It's true that this wouldn't work for inter-cog resource locking, but it would work fine within a cog, even if all tasks were vying for the same lock (a bit within some register). This all gets resolved in stage 4 of the pipeline, and data-forwarding circuitry makes sure stage 3 gets the correct copy if it needs it. So, this WOULD work within a cog 100% of the time, no matter the tasking situation.

jmg · 2014-03-03 12:58

Bill Henning wrote: »

jmg,

I am scratching my head.

Using your example, if a regular task A has TLOCK/TFREE around the CORDIC usage, including retrieving the result, no task can use CORDIC at all until A completes.

Task B, also using TLOCK/TFREE around the whole CORDIC usage, cannot start using it until A completes, and due to TLOCK/TFREE, cannot be pre-empted/swapped out until it has finished with its use of CORDIC.

This assumes all tasks contending for the resource TLOCK/TFREE bracket said usage.

Am I missing something?

No you are not missing anything, but I did not say that all cases were using TLOCKs
The point is, In my boolean-queue case, there is no need for the clumsy SW wrappers. It all just works.

Sure, if you make a paper rule, that every single resource usage MUST be WRAPPED in TLOCK, and those MUST be outside all Write/Op/Read, that's gone a long way from it just works, and you now have

* assumed a lot from programmers, and a paper rule is so easily broken, and in a way they will never notice until much later...
* imposed a lot of overhead on code size,
* slowed everything down
* clobbered threads that were never using shared resource at all.
* Imposed management of multiple libraries, if you want to cover all use cases

To me that is a shipload of negative baggage, that seriously limits the very nifty feature of hard time slices.

Programmers will start to use TLOCK outside of loops, because they can see so many peppering their code, and the hit on other threads grows.

This pretty much now imposes a new rule 'Critical threads must reside in COGS with no MathOps in any threads', but with libraries you are never really sure what opcodes are used, and the straight jacket tightens some more....

For the lack of correct (and simple?) housekeeping flags, you now have larger, slower, less reliable, more fragmented code.

Subtle data corruption is one of the worst ways you can design hardware to fail.

whicker · 2014-03-03 13:43

jmg,

just look at what is going on:

There is a big multiplier, a big divider, a big square rooter, and a cordic engine.
In many ways this will speed up code immensely, even if it is just set and wait.
32 x 32 multiply would take a long time in a software loop.

None of these take particularly long to execute, but as has been discussed, the setting of the input data, the execution itself, and the reading out of the result(s) has to be atomic.

Thread switching does not need to be at the nanosecond level, as has been discussed before.

You're not going to be using interruptable threads in something that is generating a bitstream with nanosecond precision.

But, the really big hammer I could think of to prevent loop abuse of TLOCK / TFREE is to automatic TFREE on any kind of jump, call, or return instruction (anything that potentially affects the program counter).
That probably would generate screams of agony, however.

Not to say locking abuse does not happen: On the windows side of things, there is all sorts of system driver locking issues that poorly written drivers are guilty of. Search for "DPC Latency" for more information. But I don't understand why in the P2 we would ever be locking continuously for even an entire millisecond.

potatohead · 2014-03-03 13:54

Macros would help with the automation, as would templates. I plan on the latter as a first step. I fine template use handy in PASM anyway. Want CORDIC? Invole either, see a nice code block appear. Modify, position, move on.

And we have the option of better distributing tough cases, like say a math COG, able to get the ops done without significant threading pain.

David Betz · 2014-03-03 14:02

whicker wrote: »

jmg,

just look at what is going on:

There is a big multiplier, a big divider, a big square rooter, and a cordic engine.
In many ways this will speed up code immensely, even if it is just set and wait.
32 x 32 multiply would take a long time in a software loop.

None of these take particularly long to execute, but as has been discussed, the setting of the input data, the execution itself, and the reading out of the result(s) has to be atomic.

Thread switching does not need to be at the nanosecond level, as has been discussed before.

You're not going to be using interruptable threads in something that is generating a bitstream with nanosecond precision.

But, the really big hammer I could think of to prevent loop abuse of TLOCK / TFREE is to automatic TFREE on any kind of jump, call, or return instruction (anything that potentially affects the program counter).
That probably would generate screams of agony, however.

Not to say locking abuse does not happen: On the windows side of things, there is all sorts of system driver locking issues that poorly written drivers are guilty of. Search for "DPC Latency" for more information. But I don't understand why in the P2 we would ever be locking continuously for even an entire millisecond.

One disappointing thing about TLOCK bugs is that you probably won't be able to find them using a debugger based on the thread support because the scheduler and hence the debugger will be locked out when TLOCK is executed and won't be able to regain control until after TFREE. If TFREE doesn't happen then the debugger hangs. I guess there's not much that can be done about that though.

jmg · 2014-03-03 14:05

whicker wrote: »

None of these take particularly long to execute, but as has been discussed, the setting of the input data, the execution itself, and the reading out of the result(s) has to be atomic.

Correct.

The vital question is : Should that Atomic handling be a programmer/system level problem, or be fixed by silicon housekeeping ?

whicker wrote: »

Thread switching does not need to be at the nanosecond level, as has been discussed before.

You're not going to be using interruptable threads in something that is generating a bitstream with nanosecond precision.

Correct, but fixing the data corruption issue is not limited to big Task Swaps, you must also apply this big hammer to any thread using maths, where another thread might do the same.
Sprinkle TLOCKs everywhere.

Using libraries in at least one thread will be common, but that imposes significant (and probably unexpected) missing-time-bites, on stand alone threads, who not using any MathOps at all.

Those missing-time-bites now change with code edits elsewhere, and depend on someone's coding style.

Of course, if you are absolutely sure only one thread will ever, over the life of the product, use Mathops, and you never will Full Swap that thread, then you can link in a riskier, but smaller and more granular library.
You may need to change that decision COG by COG.

cgracey · 2014-03-03 14:13

Jmg, we may be able to do what you are talking about, but I need to get a few other things implemented before I'm free enough to address this sharing issue.

jmg · 2014-03-03 14:17

cgracey wrote: »

I understand what jmg is pushing, and why, but I think keeping things simplest with TLOCK/TFREE is best for now, as there are other things, like INDA/INDB, that will need some locking mechanism, as well.

Jmg, we may be able to do what you are talking about, but I need to get a few other things implemented before I'm free enough to address this sharing issue.

That's cool, just when it is done is not so important as 'just having things work' in the final chip.

Cluso99 · 2014-03-03 16:10

Just a reminder...

I had thought that the SETB etc could use the WC & WZ bits to do "pin-paired" instructions for driving complementary pins etc.
At the time I had not realised what WC & WZ actually did.
It's on the list for Chip to think about when he gets to USB/SERDES.

cgracey · 2014-03-03 16:20

I just added instructions to load INDA and INDB from a register:

LODINDA D
LODINDB D

Because these use register contents, they can't execute until stage 4, so on the 3rd instruction after LODINDA/B, INDA/B is usable.

My question is:

Should I make these instructions so that a variable or constant base can be added:

LODINDA D,S/#
LODINDA D,S/#

At first look, it seems like a good idea, but is it really worth taking two 'D,S/#' instruction slots for?

These LODINDA/B instructions are vital for hub exec code, since it can't self-modify. This is the only way for hub exec code to variably set INDA/B. Do you think a base+index instruction is really valuable, over a simple base-only instruction?

Sapieha · 2014-03-03 16:26

Hi <chip.

I think Index is not needed.

rogloh · 2014-03-03 16:47

cgracey wrote: »

I just added instructions to load INDA and INDB from a register:

LODINDA D
LODINDB D

Because these use register contents, they can't execute until stage 4, so on the 3rd instruction after LODINDA/B, INDA/B is usable.

My question is:

Should I make these instructions so that a variable or constant base can be added:

LODINDA D,S/#
LODINDA D,S/#

At first look, it seems like a good idea, but is it really worth taking two 'D,S/#' instruction slots for?

These LODINDA/B instructions are vital for hub exec code, since it can't self-modify. This is the only way for hub exec code to variably set INDA/B. Do you think a base+index instruction is really valuable, over a simple base-only instruction?

The only application I can currently dream up where base+index might be useful, is if INDA points to a block of registers allocated per task in COG RAM and you want it to choose the right block using the task ID for either the base or offset etc. Does INDA and register remapping work together already or does INDA always dereference to 0-511 as absolute addresses without the remapping? It may already have the capability without needing base+index approach.

Propeller II update - BLOG

Comments