Propeller II update - BLOG

David Betz · 2014-03-01 20:41

Heater. wrote: »

So subtle and smart I'm not sure I know what it means.

I thought preemptive scheduling was a method of achieving multitasking not using it.

"Preemptive", "hardware scheduled", "cooperative", "multi-processor" are all ways one can achieve multitasking.

All but cooperative have the issue of how to safely share resources, data or hardware. That's where locks come in.

Unless you do it the "piggy" way and just disable interrupts, thus switching off preemption, for the period of time a resource is in use. Or in this case use TLOCK if I understand correctly.

It is a bit confusing but it seems that the Propeller uses "task" to mean one of the four hardware tasks supported by each COG and "thread" to mean a software scheduled preemptive thread. At least that's the way I've been thinking about it. Is that correct?

ozpropdev · 2014-03-01 20:43

Chip
Can the task ID bits be used to LOCK the one-off resource blocks.
For example a task attempts to start a MUL32 operation with the C flag returning a successful start.

    MUL32 reg1,reg2 WC
IF_NC JMP #$-1

No other task can start the resource until the result has been collected by its owner task.

CHIPKEN · 2014-03-01 20:58

ozpropdev wrote: »
Chip
Can the task ID bits be used to LOCK the one-off resource blocks.
For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
    MUL32 reg1,reg2 WC
IF_NC JMP #$-1
No other task can start the resource until the result has been collected by its owner task.

Yes, we could, but there's a problem of getting interrupted for some indeterminate amount of time, and inadvertently hanging other threads that are trying to use the same resource. If you TLOCK, take care of business and TFREE, there's never this possibility. The way around this might be an atomic... wait... there's no atomic possibility because it takes sometimes two instructions to start a divide/CORDIC operation. I think the only way would be to buffer all inputs to the desired block (mul/div/etc), have them started automatically when the resource finishes any pending operation, and then buffer the results for pickup, later, by the requesting task. That's more complexity than benefit, I think.

ctwardell · 2014-03-01 21:16

I think the combination of TLOCK/TFREE and general purpose locks would work.

When pre-emptive tasking is being used in a cog you would always use TLOCK/TFREE in all tasks and threads within that cog.

When pre-emptive tasking is not being used you can just a lock because your task cannot be preempted.
The only caveat would be if you used SETTASK to no longer give a task any cycles while it held a lock. To prevent that you could use the same lock before doing a SETTASK.

C.W.

Bill Henning · 2014-03-01 21:19

Correct!

David Betz wrote: »

It is a bit confusing but it seems that the Propeller uses "task" to mean one of the four hardware tasks supported by each COG and "thread" to mean a software scheduled preemptive thread. At least that's the way I've been thinking about it. Is that correct?

CHIPKEN · 2014-03-01 21:20

ctwardell wrote: »

I think the combination of TLOCK/TFREE and general purpose locks would work.

When pre-emptive tasking is being used in a cog you would always use TLOCK/TFREE in all tasks and threads within that cog. Because of the TLOCK the operation using the lock would always finish so we don't get the hung lock issue.

When pre-emptive tasking is not being used you can just a lock because your task cannot be preempted.
The only caveat would be if you used SETTASK to no longer give a task any cycles while it held a lock. To prevent that you could use the same lock before doing a SETTASK.

C.W.

That still seems like a lot complexity to speed up what are going to be, likely, just infrequent little blips.

Roy Eltham · 2014-03-01 22:08

David,
I suspect GCC could use the "smaller" multiply instructions (MUL and MULS) for most things it wants a multiply for... It would just need the big one for when the user does integer math on 32bit operands.

Ahle2 · 2014-03-02 01:21

ozpropdev wrote: »
Chip
Can the task ID bits be used to LOCK the one-off resource blocks.
For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
    MUL32 reg1,reg2 WC
IF_NC JMP #$-1
No other task can start the resource until the result has been collected by its owner task.

The sRIO hardware in the Texas Instruments TMS320C6670 handles resource sharing by a "locking queue mechanism" using core ID in combination with shadow registers for each core. Givin the appearance of, code wise, a dedicated resource per core. I don't remember the details how it works anymore, but it was a breeze writing a sRIO driver for it. No need for disable/enable interrupts or other "tricks" like on it's older siblings.

/Johannes

Ahle2 · 2014-03-02 01:47

@David Betz
Some questions.
Do you intend to implement intrinsics for common functions that can be greatly optimized by using "all those not so GCC compliant P2 instructions"?
By using pragmas, it is possible to give hints to the compiler/linker to better optimize things for a specific HW. I think the P2 would benefit more than most MCU's. Any thoughts? It could, for example,be possible to specify where each functions will be located(hub, cog), or how to align data structures etc. And maybe even how threads and dynamic memory management should be handled.

/Johannes

David Betz · 2014-03-02 03:04

Ahle2 wrote: »

@David Betz
Some questions.
Do you intend to implement intrinsics for common functions that can be greatly optimized by using "all those not so GCC compliant P2 instructions"?
By using pragmas, it is possible to give hints to the compiler/linker to better optimize things for a specific HW. I think the P2 would benefit more than most MCU's. Any thoughts? It could, for example,be possible to specify where each functions will be located(hub, cog), or how to align data structures etc. And maybe even how threads and dynamic memory management should be handled.

/Johannes

All of this is possible but you have to consider the fact that the standard libraries come prebuilt. We already generate multiple variants of each library for things like COG vs. LMM vs. XMM and some of the other compiler switches. If we add multiple additional compiler options we'll have to generate all possible combinations of those options for the prebuilt libraries. This is certainly possible but it means a much longer compiler build time and more disk space for the libraries themselves. I guess neither of those is a big deal but this would also apply to libraries the user might create. SimpleIDE might not be quite so simple anymore. :-)

David Betz · 2014-03-02 03:05

Roy Eltham wrote: »

David,
I suspect GCC could use the "smaller" multiply instructions (MUL and MULS) for most things it wants a multiply for... It would just need the big one for when the user does integer math on 32bit operands.

Yes, that is true. I don't think those variants exist for divide though do they?

jmg · 2014-03-02 03:09

CHIPKEN wrote: »

ozpropdev wrote:

Chip
Can the task ID bits be used to LOCK the one-off resource blocks.
For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
Code:

MUL32 reg1,reg2 WC
IF_NC JMP #$-1

No other task can start the resource until the result has been collected by its owner task

Yes, we could, but there's a problem of getting interrupted for some indeterminate amount of time, and inadvertently hanging other threads that are trying to use the same resource. If you TLOCK, take care of business and TFREE, there's never this possibility. The way around this might be an atomic... wait... there's no atomic possibility because it takes sometimes two instructions to start a divide/CORDIC operation. I think the only way would be to buffer all inputs to the desired block (mul/div/etc), have them started automatically when the resource finishes any pending operation, and then buffer the results for pickup, later, by the requesting task. That's more complexity than benefit, I think.

The idea of locking by resource, rather than stalling all other tasks seems better matched to the problem, and less of a Big Hammer.

If two instructions are sometimes needed, can the first one not claim the resource ?

That would catch the rare case of another task trying to access 'started' resource before it was done with.
Then, it would be placed in the wait queue.

In all other cases, no stall of Tasks would occur.

John Abshier · 2014-03-02 07:02

What happens when you're using multiple HW tasks and/or using the new threading capability with code that uses GETMULL/H, GETDIVQ/R, etc? Do we now have a separate multiplier for each HW task? Even if we do, what happens if the scheduler decides to switch threads just before one of these instructions? Won't the new thread get the result of an instruction initiated by the old thread if they're both using that HW resource?

Does this also apply to CORDIC routines?

John Abshier

SRLM · 2014-03-02 08:14

Just to be clear: hardware scheduled threads is what we've had for a while now, correct? That's the one that Heater proposed way back when. And the software tasks are what's new. A software task is really just a collection of state bits (PC, ...) that can be saved on command in some memory location and restored later.

David Betz wrote: »

All of this is possible but you have to consider the fact that the standard libraries come prebuilt. We already generate multiple variants of each library for things like COG vs. LMM vs. XMM and some of the other compiler switches. If we add multiple additional compiler options we'll have to generate all possible combinations of those options for the prebuilt libraries. This is certainly possible but it means a much longer compiler build time and more disk space for the libraries themselves. I guess neither of those is a big deal but this would also apply to libraries the user might create. SimpleIDE might not be quite so simple anymore. :-)

You could always throw convention out the window and build like I've set up libpropeller: everything compiled every time in a single compilation unit.

David Betz · 2014-03-02 08:37

SRLM wrote: »

Just to be clear: hardware scheduled threads is what we've had for a while now, correct? That's the one that Heater proposed way back when. And the software tasks are what's new. A software task is really just a collection of state bits (PC, ...) that can be saved on command in some memory location and restored later.

You could always throw convention out the window and build like I've set up libpropeller: everything compiled every time in a single compilation unit.

Ummm... You want to compile the entire C and possibly C++ library every time you compile a user program? I'm not a fan of header files that contain code. How do you handle the possibility that more than one source file in a program might include your header files? I suppose you could get around that problem with ifdefs but that means you end up with potentially huge header files. Why use that approach rather than a library?

SRLM · 2014-03-02 09:51

David Betz wrote: »

Ummm... You want to compile the entire C and possibly C++ library every time you compile a user program? I'm not a fan of header files that contain code. How do you handle the possibility that more than one source file in a program might include your header files? I suppose you could get around that problem with ifdefs but that means you end up with potentially huge header files. Why use that approach rather than a library?

To be clear, it wasn't a very serious solution. But yes: I want to compile everything that is used and may be compiled differently for the different modes. This won't necessarily include the whole C/C++ libraries: only the parts that you use. I've written some on this in my inline code in headers justification page. For the Propeller 1, at least, compiling everything every time is very quick. This certainly wouldn't scale to compile Linux, but it works well for the forced constraint size of the Propeller and the myriad of options that you'd have to prepare for.

How do you handle the possibility that more than one source file in a program might include your header files? This is not a problem for "regular" classes that don't have static variables. For these classes the C++ standard says that all definitions must be the same, and that only one copy is included by the compiler. So that's taken care of automatically for "regular" C++ programs. In my preferred system you compile everything in a single translation unit so that you can use simple #IFDEF to include the header, and hence static source, only once.

Huge headers are a symptom of class bloat, and you should refactor. This is no different than any other good code practices: make programs consumable in small chunks whose entirety can fit into your brain at once.

A single translation unit also gets you the benefit of full usage of compiler optimizations: the compiler can only optimize within a single translation unit

The way I view it is to treat the C++ build system more like Java, rather than a product of the 80's.

Heater. · 2014-03-02 10:11

SRLM,

I do admire you campaign to bring C++ techniques up to date.

It makes no sense for large PC apps using libraries like boost or Qt or even just the C++ standard library. It takes too long to compile such programs already, having to recompile all the code you use every time would take forever and not be acceptable.

Might work out for small MCU apps, as we have here, I have yet to try it out.

It won't work for C programs if I understand correctly.

David Betz · 2014-03-02 10:27

SRLM wrote: »

To be clear, it wasn't a very serious solution. But yes: I want to compile everything that is used and may be compiled differently for the different modes. This won't necessarily include the whole C/C++ libraries: only the parts that you use. I've written some on this in my inline code in headers justification page. For the Propeller 1, at least, compiling everything every time is very quick. This certainly wouldn't scale to compile Linux, but it works well for the forced constraint size of the Propeller and the myriad of options that you'd have to prepare for.

How do you handle the possibility that more than one source file in a program might include your header files? This is not a problem for "regular" classes that don't have static variables. For these classes the C++ standard says that all definitions must be the same, and that only one copy is included by the compiler. So that's taken care of automatically for "regular" C++ programs. In my preferred system you compile everything in a single translation unit so that you can use simple #IFDEF to include the header, and hence static source, only once.

Huge headers are a symptom of class bloat, and you should refactor. This is no different than any other good code practices: make programs consumable in small chunks whose entirety can fit into your brain at once.

A single translation unit also gets you the benefit of full usage of compiler optimizations: the compiler can only optimize within a single translation unit

The way I view it is to treat the C++ build system more like Java, rather than a product of the 80's.

I don't agree that large code means code bloat. Try to write a GUI library and include all of the source in a single header file. You'll end up with a 100mb header file no matter how wel crafted the library is. This approach only works for small programs. Also, I like the idea of the interface to a big library being separate from its implementation. I guess this is one advantage of the header files that all Spin programmers seem to hate. Anyway, I did understand that you weren't completely serious about using this for the propgcc libraries. I just forgot to postfix my reply with a smiley face. :-)

SRLM · 2014-03-02 10:44

As I said: I don't propose this as a useful technique for anything but microcontroller system programming.

I've never really tried (or wanted to use) C, so I can't comment on the applicability there.

David Betz wrote: »

I don't agree that large code means code bloat. Try to write a GUI library and include all of the source in a single header file. You'll end up with a 100mb header file no matter how wel crafted the library is. This approach only works for small programs. Also, I like the idea of the interface to a big library being separate from its implementation. I guess this is one advantage of the header files that all Spin programmers seem to hate. Anyway, I did understand that you weren't completely serious about using this for the propgcc libraries. I just forgot to postfix my reply with a smiley face. :-)

Hmmm. I think we may be speaking past each other here. There's no reason you can't have multiple .h header files in a project: just #include them. If a header file gets too big (in LOC) you refactor it into several new .h files. That's what I referring to. Of course, with the inline code in headers technique the translation unit will be much larger than traditional techniques, but that hardly matters to my modern Intel PC.

re: interface separate: I guess this is one area where we have philosophical differences. I like having a single copy of the function signature.

potatohead · 2014-03-02 10:48

Personally, I thought this a great optimization for P1, which really benefits from it.

General applicability and philosophical differences will grow as scale does. Perfectly ordinary, if you ask me. P2 programs may well reach sizes where this approach might not make the same sense it does with the P1, where I think it makes a lot of sense.

David Betz · 2014-03-02 10:56

potatohead wrote: »

Personally, I thought this a great optimization for P1, which really benefits from it.

General applicability and philosophical differences will grow as scale does. Perfectly ordinary, if you ask me. P2 programs may well reach sizes where this approach might not make the same sense it does with the P1, where I think it makes a lot of sense.

Another thing we have to consider is that we want to remain compatible with both the C and C++ standard libraries (ignoring for the moment the "simple" libraries that go with SimpleIDE). If we assume that then including <stdio.h> or <stdlib.h> or even <string.h> will drag in fairly large files if all of the code is included in the header file. Short of that we have to split the header files into <fopen.h>, <fclose,h> <fread.h>, etc. That will be totally incompatible with what C or C++ programmers expect coming from other platforms. Like it or not, we'll have to add at least one more dimension to the matrix of libraries we currently build: single-task and multi-task. The single-task librariy won't include the TLOCK/TFREE around each access to shared resources, the multi-task library will.

David Betz · 2014-03-02 11:00

(dumb post deleted)

potatohead · 2014-03-02 11:29

Well, there is a basic difference between getting some things done and building for the future. Those are not always compatible.

jmg · 2014-03-02 11:31

David Betz wrote: »

Like it or not, we'll have to add at least one more dimension to the matrix of libraries we currently build: single-task and multi-task. The single-task librariy won't include the TLOCK/TFREE around each access to shared resources, the multi-task library will.

ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)

David Betz · 2014-03-02 11:34

jmg wrote: »

ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)

That would work but it adds more code for the lock handling whether it be locks on individual resources or the "big hammer" TLOCK/TFREE that locks everything. A single-task program doesn't need either. However, you're correct. We could have a single version of the library if we don't care about the extra unnecessary lock code in the single-task case.

ctwardell · 2014-03-02 12:15

jmg wrote: »

ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)

This still won't work with preemptive threads because of the condition Chip mentioned where a thread has started the use of a locked resource and then gets swapped out before finishing with the locked operation.

The issue is that these shared resources use multiple instructions to complete, the lock needs to remain in place until all the instructions against that resource are complete, if the thread is swapped out between instructions that resource now stays locked until that thread is swapped back in.

I think we are near the point of diminishing returns. I'd like to see the preemptive tasking wrapped up based on Chip's current plan and then move on to the SERDES, etc.

Perhaps P3 can be an entirely different beast with all these issues considered and included from day one.

C.W.

jmg · 2014-03-02 12:34

ctwardell wrote: »

This still won't work with preemptive threads because of the condition Chip mentioned where a thread has started the use of a locked resource and then gets swapped out before finishing with the locked operation.

The issue is that these shared resources use multiple instructions to complete, the lock needs to remain in place until all the instructions against that resource are complete, if the thread is swapped out between instructions that resource now stays locked until that thread is swapped back in.

If the Task Multicycle resource InQueue flags were OR'd and readable, then a Full Swap handler would Freeze Task, check if InQueue, and if so, effectively single-step until NOT inQueue, and then do the Full SWAP.

Debug use of this, would not need to check, as Debug would not be about to use Queue Resource.

FullSwap has 100% of all slices, so nothing else can start in the meantime, so if the very first opcode of Swaped-in task is a resource-queue-trigger, that is also ok.

That may pause slightly longer than absolutely necessary (it waits on any busy) - but it is simple flags & SW, and libraries can be smaller and fewer.

potatohead · 2014-03-02 12:42

This is precisely why I said this wasn't low hanging fruit early on. We've made some design choices early on which define the sweet spots for the P2. Engineering all of this away will cause bloat and proliferation of kludges.

Chip put a nice, simple compromise out there, which opens the door for software solutions later on. IMHO, that's the best path, not continuing to add exceptions and complexity for very little real return in performance.

I like that we've got the option on one of the hardware threads per COG.

If a preemptive model is needed, we've got one available with limits. Those limits really determine whether or not using it makes sense over the other use cases we've got to apply to the problem.

The core of the design isn't pre-emptive. Until it is, this kind of thing won't make as much sense as it otherwise would. Great P3 discussion, IMHO.

Sapieha · 2014-03-02 13:01

Hi Guys.

I like Your's discussion BUT have any question?

It is discussion on made P2 C++ compatible else C++ P2 compatible?

jmg · 2014-03-02 13:39

ctwardell wrote: »

The docs from the release at the end of January still indicate those as single resources per cog.

Tips for coding multi-tasking programs
--------------------------------------

While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
remember that there's only one of each of the following cog resources and only one task can use it at a time:

  Singular resource      Some related instructions that [B]could cause conflicts[/B]
  ----------------------------------------------------------------------------------------------------------
  WIDE registers         RDBYTEC/RDWORDC/RDLONGC/RDWIDEC/RDWIDE/WRWIDE/SETWIDE/SETWIDZ
  INDA                   FIXINDA/FIXINDS/SETINDA/SETINDS / INDA modification via INDA usage
  INDB                   FIXINDB/FIXINDS/SETINDB/SETINDS / INDB modification via INDB usage
  PTRA                   SETPTRA/ADDPTRA/SUBPTRA / PTRA modification via RDxxxx/WRxxxx
  PTRB                   SETPTRB/ADDPTRB/SUBPTRB / PTRB modification via RDxxxx/WRxxxx
  PTRX                   SETPTRX/ADDPTRX/SUBPTRX/CALLX/RETX/PUSHX/POPX / PTRX modification via RDAUXx/WRAUXx
  PTRY                   SETPTRY/ADDPTRY/SUBPTRY/CALLY/RETY/PUSHY/POPY / PTRY modification via RDAUXx/WRAUXx
  ACCA                   SETACCA/SETACCS/MACA/SARACCA/SARACCS/CLRACCA/CLRACCS
  ACCB                   SETACCB/SETACCS/MACB/SARACCB/SARACCS/CLRACCB/CLRACCS
  32x32 multiplier       MUL32/MUL32U
  64/32 divider          FRAC/DIV32/DIV32U/DIV64/DIV64U/DIV64D
  64-bit square rooter   SQRT64/SQRT32
  CORDIC computer        QSINCOS/QARCTAN/QROTATE/QLOG/QEXP/SETQI/SETQZ
  SERA                   SETSERA/SERINA/SEROUTA
  SERB                   SETSERB/SERINB/SEROUTB
  XFR                    SETXFR
  VID                    WAITVID/SETVID/SETVIDY/SETVIDI/SETVIDQ/POLVID
  Block repeater         REPS/REPD
  CTRA                   SETCTRA/SETWAVA/SETPHSA/ADDPHSA/SUBPHSA/GETPHZA/POLCTRA/CAPCTRA/SYNCTRA
  CTRB                   SETCTRB/SETWAVB/SETPHSB/ADDPHSB/SUBPHSB/GETPHZB/POLCTRB/CAPCTRB/SYNCTRB
  PIX                    (not usable in multi-tasking, requires single-task timing)

Looks like a good reason to add those extra locks I requested...

You could wrap the usage of those instructions within a lock.

This is an issue even without Full Swap, and words like could cause conflicts underline the need for some flip-flop level conflict handling - a simple logic flip-flop managed wait-till-free seems the natural way to manage such shared resource ?
(it avoids needing additional SW wrappers, and avoids library sprawl)

That way, even lots of interleaved use still works as expected, but can run slower, as the resource is shared.

addit : I think much of the logic is already there. eg DOCs say
In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
freeing clocks for other tasks.
I think this needs to be expanded a tad, to encompass "other Tasks wanting GETDIVQ/GETDIVR (etc) will jump to themselves until the HW is no longer busy."

In time domain, Busy will be slightly wider than wait-for-done (eg multi-operand starts)

Propeller II update - BLOG

Comments