Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 188 — Parallax Forums

Propeller II update - BLOG

1185186188190191223

Comments

  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-01 20:41
    Heater. wrote: »
    So subtle and smart I'm not sure I know what it means.

    I thought preemptive scheduling was a method of achieving multitasking not using it.

    "Preemptive", "hardware scheduled", "cooperative", "multi-processor" are all ways one can achieve multitasking.

    All but cooperative have the issue of how to safely share resources, data or hardware. That's where locks come in.

    Unless you do it the "piggy" way and just disable interrupts, thus switching off preemption, for the period of time a resource is in use. Or in this case use TLOCK if I understand correctly.
    It is a bit confusing but it seems that the Propeller uses "task" to mean one of the four hardware tasks supported by each COG and "thread" to mean a software scheduled preemptive thread. At least that's the way I've been thinking about it. Is that correct?
  • ozpropdevozpropdev Posts: 2,792
    edited 2014-03-01 20:43
    Chip
    Can the task ID bits be used to LOCK the one-off resource blocks.
    For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
        MUL32 reg1,reg2 WC
    IF_NC JMP #$-1
    
    No other task can start the resource until the result has been collected by its owner task.
  • CHIPKENCHIPKEN Posts: 45
    edited 2014-03-01 20:58
    ozpropdev wrote: »
    Chip
    Can the task ID bits be used to LOCK the one-off resource blocks.
    For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
        MUL32 reg1,reg2 WC
    IF_NC JMP #$-1
    
    No other task can start the resource until the result has been collected by its owner task.


    Yes, we could, but there's a problem of getting interrupted for some indeterminate amount of time, and inadvertently hanging other threads that are trying to use the same resource. If you TLOCK, take care of business and TFREE, there's never this possibility. The way around this might be an atomic... wait... there's no atomic possibility because it takes sometimes two instructions to start a divide/CORDIC operation. I think the only way would be to buffer all inputs to the desired block (mul/div/etc), have them started automatically when the resource finishes any pending operation, and then buffer the results for pickup, later, by the requesting task. That's more complexity than benefit, I think.
  • ctwardellctwardell Posts: 1,716
    edited 2014-03-01 21:16
    I think the combination of TLOCK/TFREE and general purpose locks would work.

    When pre-emptive tasking is being used in a cog you would always use TLOCK/TFREE in all tasks and threads within that cog.

    When pre-emptive tasking is not being used you can just a lock because your task cannot be preempted.
    The only caveat would be if you used SETTASK to no longer give a task any cycles while it held a lock. To prevent that you could use the same lock before doing a SETTASK.

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-01 21:19
    Correct!
    David Betz wrote: »
    It is a bit confusing but it seems that the Propeller uses "task" to mean one of the four hardware tasks supported by each COG and "thread" to mean a software scheduled preemptive thread. At least that's the way I've been thinking about it. Is that correct?
  • CHIPKENCHIPKEN Posts: 45
    edited 2014-03-01 21:20
    ctwardell wrote: »
    I think the combination of TLOCK/TFREE and general purpose locks would work.

    When pre-emptive tasking is being used in a cog you would always use TLOCK/TFREE in all tasks and threads within that cog. Because of the TLOCK the operation using the lock would always finish so we don't get the hung lock issue.

    When pre-emptive tasking is not being used you can just a lock because your task cannot be preempted.
    The only caveat would be if you used SETTASK to no longer give a task any cycles while it held a lock. To prevent that you could use the same lock before doing a SETTASK.

    C.W.


    That still seems like a lot complexity to speed up what are going to be, likely, just infrequent little blips.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-03-01 22:08
    David,
    I suspect GCC could use the "smaller" multiply instructions (MUL and MULS) for most things it wants a multiply for... It would just need the big one for when the user does integer math on 32bit operands.
  • Ahle2Ahle2 Posts: 1,179
    edited 2014-03-02 01:21
    ozpropdev wrote: »
    Chip
    Can the task ID bits be used to LOCK the one-off resource blocks.
    For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
        MUL32 reg1,reg2 WC
    IF_NC JMP #$-1
    
    No other task can start the resource until the result has been collected by its owner task.

    The sRIO hardware in the Texas Instruments TMS320C6670 handles resource sharing by a "locking queue mechanism" using core ID in combination with shadow registers for each core. Givin the appearance of, code wise, a dedicated resource per core. I don't remember the details how it works anymore, but it was a breeze writing a sRIO driver for it. No need for disable/enable interrupts or other "tricks" like on it's older siblings.

    /Johannes

  • Ahle2Ahle2 Posts: 1,179
    edited 2014-03-02 01:47
    @David Betz
    Some questions.
    Do you intend to implement intrinsics for common functions that can be greatly optimized by using "all those not so GCC compliant P2 instructions"?
    By using pragmas, it is possible to give hints to the compiler/linker to better optimize things for a specific HW. I think the P2 would benefit more than most MCU's. Any thoughts? It could, for example,be possible to specify where each functions will be located(hub, cog), or how to align data structures etc. And maybe even how threads and dynamic memory management should be handled.

    /Johannes
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 03:04
    Ahle2 wrote: »
    @David Betz
    Some questions.
    Do you intend to implement intrinsics for common functions that can be greatly optimized by using "all those not so GCC compliant P2 instructions"?
    By using pragmas, it is possible to give hints to the compiler/linker to better optimize things for a specific HW. I think the P2 would benefit more than most MCU's. Any thoughts? It could, for example,be possible to specify where each functions will be located(hub, cog), or how to align data structures etc. And maybe even how threads and dynamic memory management should be handled.

    /Johannes
    All of this is possible but you have to consider the fact that the standard libraries come prebuilt. We already generate multiple variants of each library for things like COG vs. LMM vs. XMM and some of the other compiler switches. If we add multiple additional compiler options we'll have to generate all possible combinations of those options for the prebuilt libraries. This is certainly possible but it means a much longer compiler build time and more disk space for the libraries themselves. I guess neither of those is a big deal but this would also apply to libraries the user might create. SimpleIDE might not be quite so simple anymore. :-)
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 03:05
    Roy Eltham wrote: »
    David,
    I suspect GCC could use the "smaller" multiply instructions (MUL and MULS) for most things it wants a multiply for... It would just need the big one for when the user does integer math on 32bit operands.
    Yes, that is true. I don't think those variants exist for divide though do they?
  • jmgjmg Posts: 15,173
    edited 2014-03-02 03:09
    CHIPKEN wrote: »
    ozpropdev wrote:
    Chip
    Can the task ID bits be used to LOCK the one-off resource blocks.
    For example a task attempts to start a MUL32 operation with the C flag returning a successful start.
    Code:

    MUL32 reg1,reg2 WC
    IF_NC JMP #$-1

    No other task can start the resource until the result has been collected by its owner task

    Yes, we could, but there's a problem of getting interrupted for some indeterminate amount of time, and inadvertently hanging other threads that are trying to use the same resource. If you TLOCK, take care of business and TFREE, there's never this possibility. The way around this might be an atomic... wait... there's no atomic possibility because it takes sometimes two instructions to start a divide/CORDIC operation. I think the only way would be to buffer all inputs to the desired block (mul/div/etc), have them started automatically when the resource finishes any pending operation, and then buffer the results for pickup, later, by the requesting task. That's more complexity than benefit, I think.

    The idea of locking by resource, rather than stalling all other tasks seems better matched to the problem, and less of a Big Hammer.

    If two instructions are sometimes needed, can the first one not claim the resource ?

    That would catch the rare case of another task trying to access 'started' resource before it was done with.
    Then, it would be placed in the wait queue.

    In all other cases, no stall of Tasks would occur.
  • John AbshierJohn Abshier Posts: 1,116
    edited 2014-03-02 07:02
    What happens when you're using multiple HW tasks and/or using the new threading capability with code that uses GETMULL/H, GETDIVQ/R, etc? Do we now have a separate multiplier for each HW task? Even if we do, what happens if the scheduler decides to switch threads just before one of these instructions? Won't the new thread get the result of an instruction initiated by the old thread if they're both using that HW resource?

    Does this also apply to CORDIC routines?

    John Abshier
  • SRLMSRLM Posts: 5,045
    edited 2014-03-02 08:14
    Just to be clear: hardware scheduled threads is what we've had for a while now, correct? That's the one that Heater proposed way back when. And the software tasks are what's new. A software task is really just a collection of state bits (PC, ...) that can be saved on command in some memory location and restored later.
    David Betz wrote: »
    All of this is possible but you have to consider the fact that the standard libraries come prebuilt. We already generate multiple variants of each library for things like COG vs. LMM vs. XMM and some of the other compiler switches. If we add multiple additional compiler options we'll have to generate all possible combinations of those options for the prebuilt libraries. This is certainly possible but it means a much longer compiler build time and more disk space for the libraries themselves. I guess neither of those is a big deal but this would also apply to libraries the user might create. SimpleIDE might not be quite so simple anymore. :-)

    You could always throw convention out the window and build like I've set up libpropeller: everything compiled every time in a single compilation unit. :)
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 08:37
    SRLM wrote: »
    Just to be clear: hardware scheduled threads is what we've had for a while now, correct? That's the one that Heater proposed way back when. And the software tasks are what's new. A software task is really just a collection of state bits (PC, ...) that can be saved on command in some memory location and restored later.



    You could always throw convention out the window and build like I've set up libpropeller: everything compiled every time in a single compilation unit. :)
    Ummm... You want to compile the entire C and possibly C++ library every time you compile a user program? I'm not a fan of header files that contain code. How do you handle the possibility that more than one source file in a program might include your header files? I suppose you could get around that problem with ifdefs but that means you end up with potentially huge header files. Why use that approach rather than a library?
  • SRLMSRLM Posts: 5,045
    edited 2014-03-02 09:51
    David Betz wrote: »
    Ummm... You want to compile the entire C and possibly C++ library every time you compile a user program? I'm not a fan of header files that contain code. How do you handle the possibility that more than one source file in a program might include your header files? I suppose you could get around that problem with ifdefs but that means you end up with potentially huge header files. Why use that approach rather than a library?

    To be clear, it wasn't a very serious solution. But yes: I want to compile everything that is used and may be compiled differently for the different modes. This won't necessarily include the whole C/C++ libraries: only the parts that you use. I've written some on this in my inline code in headers justification page. For the Propeller 1, at least, compiling everything every time is very quick. This certainly wouldn't scale to compile Linux, but it works well for the forced constraint size of the Propeller and the myriad of options that you'd have to prepare for.

    How do you handle the possibility that more than one source file in a program might include your header files? This is not a problem for "regular" classes that don't have static variables. For these classes the C++ standard says that all definitions must be the same, and that only one copy is included by the compiler. So that's taken care of automatically for "regular" C++ programs. In my preferred system you compile everything in a single translation unit so that you can use simple #IFDEF to include the header, and hence static source, only once.

    Huge headers are a symptom of class bloat, and you should refactor. This is no different than any other good code practices: make programs consumable in small chunks whose entirety can fit into your brain at once.

    A single translation unit also gets you the benefit of full usage of compiler optimizations: the compiler can only optimize within a single translation unit

    The way I view it is to treat the C++ build system more like Java, rather than a product of the 80's.
  • Heater.Heater. Posts: 21,230
    edited 2014-03-02 10:11
    SRLM,

    I do admire you campaign to bring C++ techniques up to date.

    It makes no sense for large PC apps using libraries like boost or Qt or even just the C++ standard library. It takes too long to compile such programs already, having to recompile all the code you use every time would take forever and not be acceptable.

    Might work out for small MCU apps, as we have here, I have yet to try it out.

    It won't work for C programs if I understand correctly.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 10:27
    SRLM wrote: »
    To be clear, it wasn't a very serious solution. But yes: I want to compile everything that is used and may be compiled differently for the different modes. This won't necessarily include the whole C/C++ libraries: only the parts that you use. I've written some on this in my inline code in headers justification page. For the Propeller 1, at least, compiling everything every time is very quick. This certainly wouldn't scale to compile Linux, but it works well for the forced constraint size of the Propeller and the myriad of options that you'd have to prepare for.

    How do you handle the possibility that more than one source file in a program might include your header files? This is not a problem for "regular" classes that don't have static variables. For these classes the C++ standard says that all definitions must be the same, and that only one copy is included by the compiler. So that's taken care of automatically for "regular" C++ programs. In my preferred system you compile everything in a single translation unit so that you can use simple #IFDEF to include the header, and hence static source, only once.

    Huge headers are a symptom of class bloat, and you should refactor. This is no different than any other good code practices: make programs consumable in small chunks whose entirety can fit into your brain at once.

    A single translation unit also gets you the benefit of full usage of compiler optimizations: the compiler can only optimize within a single translation unit

    The way I view it is to treat the C++ build system more like Java, rather than a product of the 80's.
    I don't agree that large code means code bloat. Try to write a GUI library and include all of the source in a single header file. You'll end up with a 100mb header file no matter how wel crafted the library is. This approach only works for small programs. Also, I like the idea of the interface to a big library being separate from its implementation. I guess this is one advantage of the header files that all Spin programmers seem to hate. Anyway, I did understand that you weren't completely serious about using this for the propgcc libraries. I just forgot to postfix my reply with a smiley face. :-)
  • SRLMSRLM Posts: 5,045
    edited 2014-03-02 10:44
    As I said: I don't propose this as a useful technique for anything but microcontroller system programming.

    I've never really tried (or wanted to use) C, so I can't comment on the applicability there.
    David Betz wrote: »
    I don't agree that large code means code bloat. Try to write a GUI library and include all of the source in a single header file. You'll end up with a 100mb header file no matter how wel crafted the library is. This approach only works for small programs. Also, I like the idea of the interface to a big library being separate from its implementation. I guess this is one advantage of the header files that all Spin programmers seem to hate. Anyway, I did understand that you weren't completely serious about using this for the propgcc libraries. I just forgot to postfix my reply with a smiley face. :-)

    Hmmm. I think we may be speaking past each other here. There's no reason you can't have multiple .h header files in a project: just #include them. If a header file gets too big (in LOC) you refactor it into several new .h files. That's what I referring to. Of course, with the inline code in headers technique the translation unit will be much larger than traditional techniques, but that hardly matters to my modern Intel PC.

    re: interface separate: I guess this is one area where we have philosophical differences. I like having a single copy of the function signature.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-03-02 10:48
    Personally, I thought this a great optimization for P1, which really benefits from it.

    General applicability and philosophical differences will grow as scale does. Perfectly ordinary, if you ask me. P2 programs may well reach sizes where this approach might not make the same sense it does with the P1, where I think it makes a lot of sense.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 10:56
    potatohead wrote: »
    Personally, I thought this a great optimization for P1, which really benefits from it.

    General applicability and philosophical differences will grow as scale does. Perfectly ordinary, if you ask me. P2 programs may well reach sizes where this approach might not make the same sense it does with the P1, where I think it makes a lot of sense.
    Another thing we have to consider is that we want to remain compatible with both the C and C++ standard libraries (ignoring for the moment the "simple" libraries that go with SimpleIDE). If we assume that then including <stdio.h> or <stdlib.h> or even <string.h> will drag in fairly large files if all of the code is included in the header file. Short of that we have to split the header files into <fopen.h>, <fclose,h> <fread.h>, etc. That will be totally incompatible with what C or C++ programmers expect coming from other platforms. Like it or not, we'll have to add at least one more dimension to the matrix of libraries we currently build: single-task and multi-task. The single-task librariy won't include the TLOCK/TFREE around each access to shared resources, the multi-task library will.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 11:00
    (dumb post deleted)
  • potatoheadpotatohead Posts: 10,261
    edited 2014-03-02 11:29
    Well, there is a basic difference between getting some things done and building for the future. Those are not always compatible.
  • jmgjmg Posts: 15,173
    edited 2014-03-02 11:31
    David Betz wrote: »
    Like it or not, we'll have to add at least one more dimension to the matrix of libraries we currently build: single-task and multi-task. The single-task librariy won't include the TLOCK/TFREE around each access to shared resources, the multi-task library will.

    ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
    If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)
  • David BetzDavid Betz Posts: 14,516
    edited 2014-03-02 11:34
    jmg wrote: »
    ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
    If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)
    That would work but it adds more code for the lock handling whether it be locks on individual resources or the "big hammer" TLOCK/TFREE that locks everything. A single-task program doesn't need either. However, you're correct. We could have a single version of the library if we don't care about the extra unnecessary lock code in the single-task case.
  • ctwardellctwardell Posts: 1,716
    edited 2014-03-02 12:15
    jmg wrote: »
    ozpropdev's suggestion of (auto) resource-level locks would avoid this, correct ?
    If each resource could have a Start-if-availabe, Wait-if-unavailable gateway, it does not even need to know which task actually triggered it. (because two tasks cannot request start on the same clock slice)

    This still won't work with preemptive threads because of the condition Chip mentioned where a thread has started the use of a locked resource and then gets swapped out before finishing with the locked operation.

    The issue is that these shared resources use multiple instructions to complete, the lock needs to remain in place until all the instructions against that resource are complete, if the thread is swapped out between instructions that resource now stays locked until that thread is swapped back in.

    I think we are near the point of diminishing returns. I'd like to see the preemptive tasking wrapped up based on Chip's current plan and then move on to the SERDES, etc.

    Perhaps P3 can be an entirely different beast with all these issues considered and included from day one.

    C.W.
  • jmgjmg Posts: 15,173
    edited 2014-03-02 12:34
    ctwardell wrote: »
    This still won't work with preemptive threads because of the condition Chip mentioned where a thread has started the use of a locked resource and then gets swapped out before finishing with the locked operation.

    The issue is that these shared resources use multiple instructions to complete, the lock needs to remain in place until all the instructions against that resource are complete, if the thread is swapped out between instructions that resource now stays locked until that thread is swapped back in.

    If the Task Multicycle resource InQueue flags were OR'd and readable, then a Full Swap handler would Freeze Task, check if InQueue, and if so, effectively single-step until NOT inQueue, and then do the Full SWAP.

    Debug use of this, would not need to check, as Debug would not be about to use Queue Resource.

    FullSwap has 100% of all slices, so nothing else can start in the meantime, so if the very first opcode of Swaped-in task is a resource-queue-trigger, that is also ok.

    That may pause slightly longer than absolutely necessary (it waits on any busy) - but it is simple flags & SW, and libraries can be smaller and fewer.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-03-02 12:42
    This is precisely why I said this wasn't low hanging fruit early on. We've made some design choices early on which define the sweet spots for the P2. Engineering all of this away will cause bloat and proliferation of kludges.

    Chip put a nice, simple compromise out there, which opens the door for software solutions later on. IMHO, that's the best path, not continuing to add exceptions and complexity for very little real return in performance.

    I like that we've got the option on one of the hardware threads per COG.

    If a preemptive model is needed, we've got one available with limits. Those limits really determine whether or not using it makes sense over the other use cases we've got to apply to the problem.

    The core of the design isn't pre-emptive. Until it is, this kind of thing won't make as much sense as it otherwise would. Great P3 discussion, IMHO.
  • SapiehaSapieha Posts: 2,964
    edited 2014-03-02 13:01
    Hi Guys.

    I like Your's discussion BUT have any question?

    It is discussion on made P2 C++ compatible else C++ P2 compatible?
  • jmgjmg Posts: 15,173
    edited 2014-03-02 13:39
    ctwardell wrote: »
    The docs from the release at the end of January still indicate those as single resources per cog.
    Tips for coding multi-tasking programs
    --------------------------------------
    
    While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
    remember that there's only one of each of the following cog resources and only one task can use it at a time:
    
      Singular resource      Some related instructions that [B]could cause conflicts[/B]
      ----------------------------------------------------------------------------------------------------------
      WIDE registers         RDBYTEC/RDWORDC/RDLONGC/RDWIDEC/RDWIDE/WRWIDE/SETWIDE/SETWIDZ
      INDA                   FIXINDA/FIXINDS/SETINDA/SETINDS / INDA modification via INDA usage
      INDB                   FIXINDB/FIXINDS/SETINDB/SETINDS / INDB modification via INDB usage
      PTRA                   SETPTRA/ADDPTRA/SUBPTRA / PTRA modification via RDxxxx/WRxxxx
      PTRB                   SETPTRB/ADDPTRB/SUBPTRB / PTRB modification via RDxxxx/WRxxxx
      PTRX                   SETPTRX/ADDPTRX/SUBPTRX/CALLX/RETX/PUSHX/POPX / PTRX modification via RDAUXx/WRAUXx
      PTRY                   SETPTRY/ADDPTRY/SUBPTRY/CALLY/RETY/PUSHY/POPY / PTRY modification via RDAUXx/WRAUXx
      ACCA                   SETACCA/SETACCS/MACA/SARACCA/SARACCS/CLRACCA/CLRACCS
      ACCB                   SETACCB/SETACCS/MACB/SARACCB/SARACCS/CLRACCB/CLRACCS
      32x32 multiplier       MUL32/MUL32U
      64/32 divider          FRAC/DIV32/DIV32U/DIV64/DIV64U/DIV64D
      64-bit square rooter   SQRT64/SQRT32
      CORDIC computer        QSINCOS/QARCTAN/QROTATE/QLOG/QEXP/SETQI/SETQZ
      SERA                   SETSERA/SERINA/SEROUTA
      SERB                   SETSERB/SERINB/SEROUTB
      XFR                    SETXFR
      VID                    WAITVID/SETVID/SETVIDY/SETVIDI/SETVIDQ/POLVID
      Block repeater         REPS/REPD
      CTRA                   SETCTRA/SETWAVA/SETPHSA/ADDPHSA/SUBPHSA/GETPHZA/POLCTRA/CAPCTRA/SYNCTRA
      CTRB                   SETCTRB/SETWAVB/SETPHSB/ADDPHSB/SUBPHSB/GETPHZB/POLCTRB/CAPCTRB/SYNCTRB
      PIX                    (not usable in multi-tasking, requires single-task timing)
    

    Looks like a good reason to add those extra locks I requested...

    You could wrap the usage of those instructions within a lock.

    This is an issue even without Full Swap, and words like could cause conflicts underline the need for some flip-flop level conflict handling - a simple logic flip-flop managed wait-till-free seems the natural way to manage such shared resource ?
    (it avoids needing additional SW wrappers, and avoids library sprawl)

    That way, even lots of interleaved use still works as expected, but can run slower, as the resource is shared.

    addit : I think much of the logic is already there. eg DOCs say
    In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
    freeing clocks for other tasks.

    I think this needs to be expanded a tad, to encompass "other Tasks wanting GETDIVQ/GETDIVR (etc) will jump to themselves until the HW is no longer busy."

    In time domain, Busy will be slightly wider than wait-for-done (eg multi-operand starts)
Sign In or Register to comment.