Propeller II update - BLOG

David Betz · 2014-03-02 19:23

Heater. wrote: »

If you are moving between different processor architectures and systems you naturally need different compilers for each. You will have the same issue of needing a dozen different Forth engines.

Unless you want to work in "compile once run anywhere" Java or .NET. Good luck with that.

What about JavaScript? :-)

ctwardell · 2014-03-02 19:27

Heater. wrote: »

ctwardell,

I'm not sure if I'm following this any more. The "issue" you describe above is exactly what locks are supposed to do in the commonly accepted meaning. Isn't it?

Could someone explain: The issue under discussion is sharing of hardware resources between preemptive threads. Is it so that this is not an issue with the hardware scheduled "tasks". If not why not?

That comment is in the context of locks on the shared multiplier/divider/CORDIC, etc. when used in preemptive multithreading.

It's about the TLOCK/TFREE vs. a more traditional lock.

The TLOCK/TFREE lets the operation finish before the thread can be swapped out, so the resource is locked for a very short time.

A 'normal' lock would not prevent the thread from being swapped out, so now the resource could end up locked for a relatively long time until the original thread became active again and released the lock.

The comment was based on Chip's comment here:

http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1247376&viewfull=1#post1247376

C.W.

jmg · 2014-03-02 19:42

Heater. wrote: »

I'm not sure if I'm following this any more. The "issue" you describe above is exactly what locks are supposed to do in the commonly accepted meaning. Isn't it?

Could someone explain: The issue under discussion is sharing of hardware resources between preemptive threads. Is it so that this is not an issue with the hardware scheduled "tasks". If not why not?

I'm not quite sure what you are asking, but there a multiple balls in the air here.

One, is a housekeeping detail overlooked in the Tasking, where the DOCs only talk about
Singular resource (ie shared across Tasks, not one per task) and possible conflicts.

Currently, the owner-side is managed like this (eg)
In multi-task mode, GETSQRT will jump to itself until the result is ready, freeing clocks for other tasks.

That's good. but missing, is what happens should another task, also want GETSQRT ?
(and now we have added HubExec, that gets more likely. It may be rare, but it can happen)

The Auto-locks under discussion would add a flag so the other task will jump to itself until the resource is free,
This is really the other side of the owner-task coin. More of a clean-up-pass, than any new added feature.
It does not add anything fancy, it removes gotchas, and allows the one-of resources to be shared cleanly.
Auto-locks also limits pause effect, to only the task wanting shared resource, other tasks do not pause.

Full Thread LOCK is a little different, and that allows any task to claim (briefly) 100% of slots.
This form of lock is useful for Full Task Swap, and Debug & I think Chip has designed one level of save for all resource, for a Full Lock / Swap.

Heater. · 2014-03-02 20:41

@David,

What about JavaScript? :-)

Trust me, I thought about mentioning JS. Then I thought it was a step to far

@ctwardell,

It's about the TLOCK/TFREE vs. a more traditional lock.

So I see. The big issue here is that some instructions require many cycles to complete. As such they should not be switched out before they are done.

TLOCK/TFREE seems to be the only way around that. Although it is not clear to me why we need two extra instructions to do that.

I would have thought that in hardware scheduled "tasking" mode that "lock" and "unlock" or "run till complete" for time consuming ops could be done automatically.

If not in tasking mode then the lock/free is just not done.

@jmg,

I'm not quite sure what you are asking,

Nor was I really. That's why I had to ask

In multi-task mode, GETSQRT will jump to itself until the result is ready, freeing clocks for other tasks.

Good. That's what we want isn't it. Effectively the SQRT despite taking many cycles becomes an atomic operation like any other instruction. That has to be so.

That's good. but missing, is what happens should another task, also want GETSQRT ?

It has to loop and busy wait till the SQRT hardware is free again. Hence TLOCK I guess but I don't see why this is not automatic.

So, how is this different between hardware scheduled threads and this new preemptive interrupt driven threads? Or is it the same issue anyway?

As for general purpose locks, for normal data sharing between threads like the HUB locks, I have no idea. A normal HUB lock is presumably freed if the COG that currently has it is terminated.

How does a lock claimed by a thread get freed if the thread stops running for some reason?

Do we have to rely on the programmer just not doing bad things like that? It's a hard problem to solve otherwise.

potatohead · 2014-03-02 20:49

The Auto-locks under discussion would add a flag so the other task will jump to itself until the resource is free,

I would rather not have this behavior. It's enough to know whether or not the resource request was successful. Depending on the combination of resources and dependencies, it may well be part of some loop which can be doing something during the wait as opposed to merely waiting. If we want it to jump over and over, that's a software decision, and we code the jump, and it's one instruction. We may want to fall back to an alternative, or track time consumed waiting, or any number of things, none of which will happen if the default behavior is just to sit there waiting, jump, jump, jump, etc...

We had a similar discussion with the waitxxx instructions, and it was determined that we couldn't do effective polling in some pretty useful cases with this behavior being the only option.

Heater. · 2014-03-02 21:14

potatohead,

I hear what you are saying. Wasting time in a "busy loop" while waiting on a pin or timer or whatever is, well wasting processor power that could be used for other things.

This is true of a single thread on a single processor.

But surely if we have multiple threads, hardware scheduled or otherwise, it no longer matters if your code hangs on a WAITPE or whatever? Those other threads are still running doing work.

But really, should the programmer have to arrange to poll for the results of a multiply or divide? So that he can do something else in that short time? That is horribly complex and I can't imagine anyone getting much benefit out of it.

It's worse. With many threads in play now the programmer has to manually arrange to poll the multipler to see if it is free for use at all!

No. Those ops you just halt you dead until the result is ready. They should be atomic like any other instruction.

They will cause hiccups in multiple threading mode if another thread has the device you want. So what?

I think everyone here should take five minutes out to read "The Story Of Mel". A great tale of a genius programming optimally for a machine that was hard to program optimally. In such a way that no one else ever understood what he did or could achieve similar. http://www.catb.org/jargon/html/story-of-mel.html

Do we want the PII to be only usable by the Mel's of this world?

Bill Henning · 2014-03-02 21:17

Guys,

To me, it seems like mountains are being constructed out of mole hills.

Multi-threaded C code can use libraries which wrap the MUL/DIV/SQRT etc instructions in TLOCK/TFREE.

Drivers running in other tasks in the same cog can do the same if they use those instructions.

Of course, this all boils down to documentation, and those that don't read the docs and get into trouble pretty much deserve what they get.

If the documentation clearly states:

- these are the shared multi-cycle resources (MUL, DIV, etc)
- if more than one task in a cog uses them, all must TLOCK/TFREE those critical sections

Then the point is moot.

RTFM!

- this is NO different than an RTOS on an arm having several threads overwrite the I/O registers they have in common. Cavet Emptor.

In the case of C code:

- single threaded, owning cog - no issue
- multi-threaded, one scheduler task, one C threaded task - use TLOCK/TFREE in the C task, scheduler should not need those instructions, if it does, use TLOCK/TFREE

This is just a critical section issue - not the end of the world.

- if someone MUST pack more tasks into a C threaded cog, just use TLOCK/TFREE for the critical sections

Using TLOCK/TFREE is NOT rocket science.

One does not toss out babies because only one fits into the wash basin at one time!

The capabilities this one very minor issue gets us are immense.

For P3, we can look for a more "perfect" solution - this is good enough for P2!

cgracey · 2014-03-02 21:25

Wow! I just got caught up on a lot of postings.

To put timing into perspective, consider this:

You are running a task in hub exec mode and the next instruction needed is not cached. It must now be loaded from hub memory into a cache line so that it's ready to feed into the pipeline. Unfortunately, you just missed the hub cycle, so you have to wait 11 clocks for the instruction to get read, cached, and ready to feed into the pipeline. Now that the fetch is done, you can execute the instruction that's been waiting in the last stage of the pipeline. Darn it! It's a RDBYTE instruction! It takes 5 more clocks for the next hub cycle to come again, and then 3 more clocks to finish the RDBYTE. The pipeline was just stalled for 19 clocks!

So, TLOCK/TFREE to quickly use singular resources is not outrageous compared to what is going on, anyway.

mindrobots · 2014-03-02 21:31

So you have all these threads running in a single cog that need to use CORDIC, SQRT, BIGMUL, BIGDIV and they can't just lock the resource until done? Isn't this going to be the same at some point with any resource? You run out. Like counters in the P1, if you use the two you have in a COG, you go use another COG. With the P2, if you can't survive with a stall across a lock, go use a resource in another task in another COG. Are there really use cases that will need 9 non-stalling CORDIC or whatever threads? Just because you can try and run everything in one cog doesn't mean you should.

Are folks thinking real world with all this added feature complexity or just theoretical potential?

potatohead · 2014-03-02 21:38

Oh, I did. Liked the story.

[see below]

Sapieha · 2014-03-02 21:43

Hi mindrobots.

Al of that discussions ends that many people think in therms of Single core CPU's.

Don't think that we have 8 that core's and can distribute work to them to use that optimally.

mindrobots wrote: »

So you have all these threads running in a single cog that need to use CORDIC, SQRT, BIGMUL, BIGDIV and they can't just lock the resource until done? Isn't this going to be the same at some point with any resource? You run out. Like counters in the P1, if you use the two you have in a COG, you go use another COG. With the P2, if you can't survive with a stall across a lock, go use a resource in another task in another COG. Are there really use cases that will need 9 non-stalling CORDIC or whatever threads? Just because you can try and run everything in one cog doesn't mean you should.

Are folks thinking real world with all this added feature complexity or just theoretical potential?

potatohead · 2014-03-02 21:45

Yes!!

In the mess buried above, I mentioned TLOCK / TFREE as the simplest / best solution given where we are right now.

What Bill said:

Multi-threaded C code can use libraries which wrap the MUL/DIV/SQRT etc instructions in TLOCK/TFREE.

Drivers running in other tasks in the same cog can do the same if they use those instructions.

Of course, this all boils down to documentation, and those that don't read the docs and get into trouble pretty much deserve what they get.

If the documentation clearly states:

- these are the shared multi-cycle resources (MUL, DIV, etc)
- if more than one task in a cog uses them, all must TLOCK/TFREE those critical sections

Then the point is moot.

Word. Trying to extend beyond this makes no sense. I'm going to delete the longer post above. It's not needed, and I got sucked in.

cgracey · 2014-03-02 21:47

One way of providing time-efficient resource sharing would be to have each task own a math-circuit output buffer that can hold as much as 3 longs, plus 6 bits, for the worst case of QROTATE (which also needs SETQZ and SETQI). You set up the data via two 'D/#,S/#' instructions, then give the instruction to start some process (QROTATE/QSINCOS/MUL32/etc). The math-circuit output buffer would then wait in line to be able to feed the target math circuit the buffered values. It would then capture the results when they were ready. When the initiating task executes an instruction to read the results, it would loop in place until the result was ready, then receive the data without looping. This wouldn't entail any locks and no task could hang another task. This would require about 170 flipflops per task, or about 5,540 more in the whole chip. I don't know if it would be worth it. In most code, these math instructions make up maybe a few percent of the instructions. This would speed up cases where different tasks wanted different singular math resources, but nothing can help if two tasks both want to perform an operation on the same math circuit.

TLOCK/TFREE can be called a hammer, as it is a might-makes-right solution to sharing, but it is simple and effective.

potatohead · 2014-03-02 21:49

I'm strongly in favor of just sticking with the simple TLOCK / TFREE. People can always choose to not code with them, and perhaps optimize some extreme cases; otherwise, we have a simple, robust solution.

jmg · 2014-03-02 21:55

Heater. wrote: »

Nor was I really. That's why I had to ask

In multi-task mode, GETSQRT will jump to itself until the result is ready, freeing clocks for other tasks.

Good. That's what we want isn't it. Effectively the SQRT despite taking many cycles becomes an atomic operation like any other instruction. That has to be so.

Yes, it is what we want, and this is how this part works now, on this detail.

That's good. but missing, is what happens should another task, also want GETSQRT ?

It has to loop and busy wait till the SQRT hardware is free again. Hence TLOCK I guess but I don't see why this is not automatic.

Yes, it should be as Automatic as the present Wait-Till-Done. (hence Auto-lock term)

TLOCK is a rather blunt instrument that clobbers ALL threads, even ones not using any shared resource, and that means TLOCK needs care, or it can break thread-tested code, in a different thread..

So, how is this different between hardware scheduled threads and this new preemptive interrupt driven threads? Or is it the same issue anyway?

Different issues, the suggested Other-task-wait-till-resource-free is a housekeeping cleanup, and mirrors the Same-task-wait-for-answer.
The DOCs only say "could cause conflicts", & do not cover exactly how it fails, if a present P2 has 2 GETSQRT firing in 2 threads. Maybe a Corrupt answer ?

The 100% Slice grab, is for Full Swap, and Debug, and may need two operational case variants :
Chip has added HW to support Swap and restart.

For Task-Swap, you might prefer to wait until any shared resource is done.(aka multi-cycle opcodes).
(This would add a slight granular quanta, but make pause.resume much easier to handle.)
For Debug Single Step, you might want to advance 1 clock at a time, but as you are not going to replace the Thread, that is ok.

Those cases could be managed in SW, provided you could SW check for Multicycle-opcode-not-done.

mindrobots · 2014-03-02 21:58

Hey, everybody, I got it figured out now!

If we have some more space, let's add a 9th COG that is just a supervisor COG. It can manage the task across the 8 other cogs and relocate threads that are competing for scarce resources. This cog would be good at two things cog introspection of the 8 sub cogs and talking to the supervisor cogs on other P2 chips so you could offload your CORDIC request to another P2 with a free CORDIC engine. With just a few more features, this new cog could capture state of any running cog or task and move it where it could run best. This would allow a small cluster of P2s to run a totally unmanageable number of tasks in an embedded application with optimally no single thread ever incurring a pipeline stall or a wait once the workload was analyzed, profiled and distributed across the cluster...all with just a few more instructions and a couple thousand flops.

jmg · 2014-03-02 22:07

cgracey wrote: »

One way of providing time-efficient resource sharing would be to have each task own a math-circuit output buffer that can hold as much as 3 longs, plus 6 bits, for the worst case of QROTATE (which also needs SETQZ and SETQI). You set up the data via two 'D/#,S/#' instructions, then give the instruction to start some process (QROTATE/QSINCOS/MUL32/etc). The math-circuit output buffer would then wait in line to be able to feed the target math circuit the buffered values. It would then capture the results when they were ready. When the initiating task executes an instruction to read the results, it would loop in place until the result was ready, then receive the data without looping. This wouldn't entail any locks and no task could hang another task. This would require about 170 flipflops per task, or about 5,540 more in the whole chip. I don't know if it would be worth it. In most code, these math instructions make up maybe a few percent of the instructions. This would speed up cases where different tasks wanted different singular math resources, but nothing can help if two tasks both want to perform an operation on the same math circuit.

That sounds complex, and likely still needs flags.

Would it not be simpler to add the suggested Free/Done handshakes ?
You have Done already, Free is the mirror case of that.

cgracey wrote: »

TLOCK/TFREE can be called a hammer, as it is a might-makes-right solution to sharing, but it is simple and effective.

- but it has quite large costs, in multiple libraries, larger code and it clobbers a task that did not need to be clobbered.

For full swap, 'big hammer' TLOCK is needed, but for shared resource management, a middle solution using flags should exist.

"nothing can help if two tasks both want to perform an operation on the same math circuit." - at the same time ? Yes, something has to give, it is shared resource, but the right flags can allow the tasks to cooperate on the resource in a automatic manner, with no conflicts.

Q: How exactly does shared resource fail now, should two threads access it ?

potatohead · 2014-03-02 22:08

Hilarious!! mindrobots, thanks for the laugh!

Q: How exactly does shared resource fail now, should two threads access it ?

Great question!

And if we quit ripping into the guts of things, we might actually have an image long enough to test well and understand what we've already done and how to maximize it.

cgracey · 2014-03-02 22:09

mindrobots wrote: »

Hey, everybody, I got it figured out now!

If we have some more space, let's add a 9th COG that is just a supervisor COG. It can manage the task across the 8 other cogs and relocate threads that are competing for scarce resources. This cog would be good at two things cog introspection of the 8 sub cogs and talking to the supervisor cogs on other P2 chips so you could offload your CORDIC request to another P2 with a free CORDIC engine. With just a few more features, this new cog could capture state of any running cog or task and move it where it could run best. This would allow a small cluster of P2s to run a totally unmanageable number of tasks in an embedded application with optimally no single thread ever incurring a pipeline stall or a wait once the workload was analyzed, profiled and distributed across the cluster...all with just a few more instructions and a couple thousand flops.

I was thinking about something similar. Take all these various resources and pool them all, so that whoever needs them can use them. Break down the cog barriers. This would be a great way to build an efficient thread-processing machine, but would still need to allow exclusive ownership where determinism was needed. The impracticality of it all, though, is that you would need to mux all the resources and this would take either extra clock cycles or slow mux's. Back in 2000, when I started using FPGAs to model architectures, I realized early on that it was best to marry peripherals to processors in some static, but balanced relationship, rather than trying to share everything. These things can be revisited in a Prop3 effort.

mindrobots · 2014-03-02 22:13

cgracey wrote: »

I was thinking about something similar. Take all these various resources and pool them all, so that whoever needs them can use them. Break down the cog barriers. This would be a great way to build an efficient thread-processing machine, but would still need to allow exclusive ownership where determinism was needed. The impracticality of it all, though, is that you would need to mux all the resources and this would take either extra clock cycles or slow mux's. Back in 2000, when I started using FPGAs to model architectures, I realized early on that it was best to marry peripherals to processors in some static, but balanced relationship, rather than trying to share everything. These things can be revisited in a Prop3 effort.

Dang it, Chip! Everybody but you was supposed to read that!!

Save it for the P3, PLEASE!!

cgracey · 2014-03-02 22:18

jmg wrote: »

That sounds complex, and likely still needs flags....

Q: How exactly does shared resource fail now, should two threads access it ?

There would be flags, but they would be hidden from the programmer. They would need to be preserved in a thread switch, though.

A: Two threads accessing the same math resource would likely result in the later-comer's computation being performed, and both tasks getting the same answer. It could also be that one task's GETMULL is correct, but his subsequent GETMULH is from another computation. It's first-come/first-serve.

Heater. · 2014-03-02 22:19

Chip, (and mindrobots),

Take all these various resources and pool them all, so that whoever needs them can use them.

I do hope I detect some humour in these statements!

It is said that "Those who don't understand Unix are condemned to reinvent it, poorly"

Sounds to me like you want to do exactly that in hardware!

...when I started using FPGAs to model architectures, I realized early on that it was best to marry peripherals to processors in some static, but balanced relationship, rather than trying to share everything.

You are in good company. David May, designer of the XMOS devices had that same revelation at about the same time. That is why xcores are very tightly integrated with their I/O pins and ports and timers and so on.

That's why we like the Prop so much!

jmg · 2014-03-02 22:26

cgracey wrote: »

There would be flags, but they would be hidden from the programmer. They would need to be preserved in a thread switch, though.

So why not use the hidden flags, in a simpler Free manner ?

cgracey wrote: »

A: Two threads accessing the same math resource would likely result in the later-comer's computation being performed, and both tasks getting the same answer. It could also be that one task's GETMULL is correct, but his subsequent GETMULH is from another computation. It's first-come/first-serve.

Ouch, rather as I feared.
That's quite a gotcha lurking, for someone using libraries, who may not know what shared resource each is using.
Or they apply a SW update, and library use or even phase changes, and now they have very subtle, and rare, numeric corruptions...
As you said, Maths use can be a tiny % of code, but ideally P2 code should run without surprises, in 1 task, or 3 tasks.

Users expect slows downs, especially with shared resource.

cgracey · 2014-03-02 22:31

jmg wrote: »

So why not use the hidden flags, in a simpler Free manner ?

Ouch, rather as I feared.
That's quite a gotcha lurking, for someone using libraries, who may not know what shared resource each is using.
Or they apply a SW update, and library use or even phase changes, and now they have very subtle, and rare, numeric corruptions...
As you said, Maths use can be a tiny % of code, but ideally P2 code should run without surprises, in 1 task, or 3 tasks.

Users expect slows downs, especially with shared resource.

Are you suggesting that we loop in place on, say, a MUL32 instruction until the big multiplier becomes available, then it's given our command, freeing our task?

I do appreciate the value in always having things just work, by the way.

mindrobots · 2014-03-02 22:38

Heater. wrote: »

Chip, (and mindrobots),

I do hope I detect some humour in these statements!

For my part, it was dead serious satire and sarcasm of the P2 process at this point.

I think it would be fun to play with in FPGA or a P3 variant but now is not the time for that. I was perfectly happy with the multi-tasking P2 with SERDES and USB support. For multi-threading and beyond, I think folks are losing their way as to what the P2 is supposed to be but it could be a visionary feature in its simplest form and a good proving ground for P3 directions so if it can be done without breaking the beauty of the propeller, then, so be it.

Heater. · 2014-03-02 22:40

It's seems straight forward to me.

A MULL32 or other long winded instruction should appear to be an atomic operation. Just like any other instruction.
If the result takes a while to come out, so be it.

Further, if in a threaded mode it turns out some other thread is using the multiplier hardware then you just have to be stalled until it becomes free again. Rather like a HUB access. Then stalled some more waiting for your result. So be it.

Anything else is a programming nightmare.

I don't believe arranging for the programmer to be able to do some other work whilst waiting is going to yield much benefit. And the complexity of it ensures that it will almost never be used.

How hard it is for Chip to make these long winded operations into atomic operations I have no idea. I hope it's not too hard.

P.S If it turns out to be easy for hardware scheduled threading but difficult and expensive for preemptive interrupt driven threads I suggest dropping the latter.

mindrobots · 2014-03-02 22:47

Back to serious.

If you are cycle counting in your code, are you really going to be multi-threading the cycle critical code? Let it loop, stall or whatever, the operation should be atomic.

Wortst case, you will be multi-tasking and probably aware of the shared resource stall situation and able to handle it as a capable programmer (I.e. go find the resource to use on some other cog)
What am I missing?

jmg · 2014-03-02 22:47

cgracey wrote: »

Are you suggesting that we loop in place on, say, a MUL32 instruction until the big multiplier becomes available, then it's given our command, freeing our task?

I do appreciate the value in always having things just work, by the way.

I think that is a yes.
To expand:
If a thread has MUL32, it is looping-for-result, Suppose another thread starts MUL32 - it should Loop-until-free, and then MUL32 is accepted. ( If Resource any special pre-loads, those can be considered triggers to Not-Free. ie once you start using a resource, it is yours until done. )

Free here means the task that was looping-for-result,has not only finished the pause, it has also read the result(s) and then final read signals Free.

99% of the time, this handshake is likely never needed, but when it is, you get two correct answers, in two threads, it's just that one may have taken a little longer than expected. No other COGS were disturbed.

The Waiting thread would issue MUL32, and if HW=busy, it starts Loop-until-free, then when the other thread is fully done, the 'paused' MUL32 launches the HW, and flips to looping-for-result.
It exits with no direct knowledge it needed the Loop-until-free.
No extra lines of code are needed, it just works.

Bill Henning · 2014-03-02 22:49

Heater,

Too many cycles to waste. Extending logic, all RDxxxx/WRxxx should take 8 cycles, so we don't get hub overlap.

BUT

I could easily see disallowing task switching until all current long ops in the pipeline finish - maybe even including the delayed jumps and REPx blocks
(with an exception for a debugging mode).

SETMODE ATOMIC - all MUL/DIV/delayed instructions in the pipeline complete before task is allowed to starve/stop
SETMODE DEBUG - the way things are right now, cavet emptor

This (IMHO) may be easy enough to implement without needing too many transistors or changes.

jmg · 2014-03-02 22:57

Heater. wrote: »

It's seems straight forward to me.

A MULL32 or other long winded instruction should appear to be an atomic operation. Just like any other instruction.
If the result takes a while to come out, so be it.

Further, if in a threaded mode it turns out some other thread is using the multiplier hardware then you just have to be stalled until it becomes free again. Rather like a HUB access. Then stalled some more waiting for your result. So be it.

Anything else is a programming nightmare.

Agreed, it should appear to be an atomic operation.

The stalls will be very slight, and if someone has code that really cannot tolerate stalls, they need to limit other threads anyway.
There are plenty of places to put threads

Propeller II update - BLOG

Comments