Shop OBEX P1 Docs P2 Docs Learn Events
HUB EXEC Update Here - Page 8 — Parallax Forums

HUB EXEC Update Here

1568101116

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-06 06:28
    Not for (-32..-1)*scale in PTRB which is pretty good :)
    David Betz wrote: »
    That's true but won't the code that uses these values still have to update the return address on the stack to skip over them when it returns?
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-02-06 06:39
    Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.

    The CALLR (fast leaf/non-leaf?) would work by using the CALL and RET (via the task 4 level stack) because only a single level gets used.

    Now for the slow CALLR where the address needs to be saved to thehub stack, in this case only, it would be necessary ro use anextra instruction and 1 clock to POP the return address off the task stack.Then this wouldbe writtento thehubstack,presumablyusing PTRA++.
    If this is so, then.. could an instruction be added WRLONG PTRA++,pop where pop gets the value from the task stack. This woould only be a single instruction, and since weknow both D and S, a simple no operandinstruction like PUSHRET could bedone.

    Alternately, could we make the task stack visible at say $1F1 ? or as I suggested earlier, make CALL also write to $1F1 ?
  • cgraceycgracey Posts: 14,208
    edited 2014-02-06 06:47
    Cluso99 wrote: »
    Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.


    It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

    As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-06 06:50
    CALLA/CALLB use a hub stack, they are my favorites for "large" stack applications as no need for a PUSHA LR (saves one instruction plus one cycle for every non-leaf function)

    CALLX/CALLY use an aux stack, my favorite for most code, as it allows a "medium" stack and avoids hub cycles

    CALL/RET use the 4 element stack, support LR functionality - great for GCC and small cog code

    The three variations cover tiny/medium/large code, full coverage... love it.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-06 06:53
    My only concern with using $000 is that if the initial instruction is a CALL, it will be clobbered.

    Not a big deal, docs will just have to say location $000 gets overwritten.

    Regarding per-task PTRA/PTRB... would love it, but only if does not slow down P2, and does not cause hub to be reduced below 256KB
    cgracey wrote: »
    It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

    As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
  • evanhevanh Posts: 16,039
    edited 2014-02-06 06:54
    Before going to town on flexible hubexec data flows you might want to review the level of stalls from hub timing. Bursting whole frames might be more efficient. Or is there already some caching for this issue?
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-02-06 06:56
    Noticed after posting, some of this has been discussed. (this xoom is painful :( )

    I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

    There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 08:13
    cgracey wrote: »
    The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.
    Sounds perfect. Thanks!
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 08:15
    cgracey wrote: »
    It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

    As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
    Using location $000 sounds okay to me. In fact, I thought we had agreed on that a long time ago. :-)
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 08:16
    evanh wrote: »
    CALL            CALLD                    call subroutine using task's 4-level stack
            RET             RETD                     return from subroutine using task's 4-level stack
    
            CALLA           CALLAD                   call subroutine using HUB[PTRA++]
            RETA            RETAD                    return from subroutine using HUB[--PTRA]
    
            CALLB           CALLBD                   call subroutine using HUB[PTRB++]
            RETB            RETBD                    return from subroutine using HUB[--PTRB]
    
            CALLX           CALLXD                   call subroutine using AUX[PTRX++]
            RETX            RETXD                    return from subroutine using AUX[--PTRX]
    
            CALLY           CALLYD                   call subroutine using AUX[!PTRY++]
            RETY            RETYD                    return from subroutine using AUX[!--PTRY]
    

    Dang! Not as simple as Bill's idea at all. :(
    I'm not sure I follow you here. What is not as simple as Bill's idea?
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 08:18
    Cluso99 wrote: »
    Noticed after posting, some of this has been discussed. (this xoom is painful :( )

    I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

    There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
    If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.
  • jmgjmg Posts: 15,175
    edited 2014-02-06 12:18
    Cluso99 wrote: »
    There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?

    Yes, and I'm still eager to scan the counter details, on new features like HW capture, and Pin CLKin, CLKout modes.
    Atomic capture on two counters on the same edge is one detail many micro miss - should be as simple as an alias-mapping, if the control bits are not in one register.
  • SapiehaSapieha Posts: 2,964
    edited 2014-02-06 12:38
    Hi

    To that maybe (uses only 74 LE) Manchester encoding -- Look in attachment.



    Cluso99 wrote: »
    Noticed after posting, some of this has been discussed. (this xoom is painful :( )

    I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

    There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
    1024 x 549 - 41K
    1024 x 487 - 40K
    1024 x 549 - 15K
  • evanhevanh Posts: 16,039
    edited 2014-02-06 14:08
    David Betz wrote: »
    I'm not sure I follow you here. What is not as simple as Bill's idea?

    Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 14:18
    evanh wrote: »
    Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.
    Yes there would because you don't want to have to adjust the hub stack for leaf functions. That means you wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted to remove the return address before jumping indirect through $1f1 to return from the leaf function. This doesn't have to happen with the LIFO instructions because it doesn't matter if the stack overflows with them.
  • evanhevanh Posts: 16,039
    edited 2014-02-06 14:29
    David Betz wrote: »
    If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.

    Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.

    There is still the much more effective SWITCH instruction for general tasking.
  • evanhevanh Posts: 16,039
    edited 2014-02-06 14:34
    David Betz wrote: »
    ... wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted ...

    Oh? I think you better make that point very clear to Chip. I'm pretty certain he's off trying to make room in each Cog for all the CALLxx instructions to appear at $1f1 or $000.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-02-06 15:00
    evanh wrote: »
    Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.

    There is still the much more effective SWITCH instruction for general tasking.
    I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-06 15:12
    I think the new Spin will be great, should be very useful and a lot of fun.

    Personally, I will use CALL and CALL{A|B|X|Y} depending on what is most efficient for the software. When I don't need a big stack, X&Y will be really nice. A&B are great for a big stack. For a tiny 4 level stack (or GCC) CALL is perfect.

    An embarrassment of riches!
    David Betz wrote: »
    I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.
  • evanhevanh Posts: 16,039
    edited 2014-02-06 23:28
    David Betz wrote: »
    I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.

    Too right he is, and good on him. What I'm saying is that PropGCC shouldn't be making performance sacrifices for compatibility with the hardware threads. They aren't that amazing for general multi-tasking. If the threads become viable in the end then that's great.
  • Heater.Heater. Posts: 21,230
    edited 2014-02-07 01:01
    evanh,
    They [the hardware threads ] aren't that amazing for general multi-tasking.
    Can you explain what is not amazing about them?

    I personally think they are. That's why I suggested them :)

    A case in point is that I wrote a FullDupllexSerial driver in C for the Prop 1. All the code fits in COG, It works. I works better than I thought it would. Even works up to 115200 baud.

    That driver will be a little bit easier to create using the task switch instruction on the Prop II. But not much. The task switch will only replace the current JMPRETs.

    It will be a lot easier to write such a driver using the hardware thread scheduling. It will be faster than using a cooperative task switch and it will have a lot less jitter on it's output edges. It will have more accurate timing of the sampling incoming bits as well. The code will be smaller possibly allowing some other features into the COG.

    What is not amazing about that?

    Hardware scheduled threads are so amazing that XMOS has been using them on their x-core devices for years.

    The situation is a bit different with execution from HUB, but it is still very convenient to be able to write your code as simple loops that do what ever they do and not have to worry about when and where to introduce cooperative task switch instructions.

    Hardware scheduled threads seem to be still very useful in HUB code. Still amazing.
  • evanhevanh Posts: 16,039
    edited 2014-02-07 05:03
    That's all inside a Cog. The time-sliced threading is perfect there, important even now the Prop2 is so much faster.

    I believe David is trying to also cover hubexec mode. The hardware threading is really just a bonus there. However, Chip looks like he'll get it all sorted so there won't be any further concern.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-07 06:10
    Ideally, execution within a task should be independent of what's happening in another task. It seems like the goal would be to allow for 1 or more tasks that run completely within cog memory while other tasks could be executing code from hub memory.

    I think the amazing thing about P2 isn't so much the individual features, such as multiple tasks and hub execution, but how it will all work together. The other thing I find amazing is how Parallax has allowed the Prop community to help in designing the P2. I've never seen that with any other commercial processor.
  • Heater.Heater. Posts: 21,230
    edited 2014-02-07 06:34
    Dave,

    I haven't really thought about this much but that idea of having a big program running from HUB whilst some other task on the same COG is running entirely in COG is quite amazing.

    We have a lot of Spin objects that use two COGs. One for high level Spin and one for a low level PASM driver. Now all that kind of thing can be done in one COG. In Spin or C.

    Hardware scheduled threads really are amazing. (If they work as smoothly as the above implies).

    And yes. The the P2 has come together with all the input from the forum is also amazing. A unique process so far I believe.

    The most amazing thing is the way that Chip has managed not to go insane under the barrage suggestions and advice. He sorted the wheat from the chaff, tweaked it to his liking and put the things together in a coherent way, not just a pile of random features piled into an ugly mess like some CPU's that will remain nameless.
  • CircuitsoftCircuitsoft Posts: 1,166
    edited 2014-02-07 15:37
    cgracey wrote: »
    We could do something like that, but it wouldn't be very legible over a mere square inch.
    It doesn't need to be legible to the naked eye...
  • AleAle Posts: 2,363
    edited 2014-02-07 22:29
    cgracey wrote: »
    I spent a lot of time working out the mux tree for ALU results, as it is the biggest structure with heavy delays, and it is on par with other critical paths. The biggest mux (64 inputs) passes only registered signals. Instructions like SETS/SETD/SETI/MOVBYTS just re-orient signals, so they can go through this mux. There are two other smaller mux's for faster signals, and one final mux which inputs all three other mux's, along with some late-arriving signals from instructions like ADD/SUB/RDBYTE/ROL. This mux tree needed a lot of tuning to get optimized. The synthesis guys, early on, said to just make one huge mux and let the logic compiler sort it all out - that didn't work well. I needed to give some more guidance by forming a mux tree before the tools would compile/place/route it efficiently.

    Thanks Chip !. I see that the old rules sort of still apply when the muxes get big :).
  • BaggersBaggers Posts: 3,019
    edited 2014-02-08 12:34
    Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
    So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?
  • RaymanRayman Posts: 14,762
    edited 2014-02-08 13:14
    SDRAM with nano? I might have to try that...

    Speaking of SDRAM... Could we now execute code from SDRAM using the HUB RAM as some kind of page buffer?
  • roglohrogloh Posts: 5,837
    edited 2014-02-08 16:06
    Rayman wrote: »
    SDRAM with nano? I might have to try that...

    Speaking of SDRAM... Could we now execute code from SDRAM using the HUB RAM as some kind of page buffer?

    Yes I was hoping for this too. Been discussed before over here ... http://forums.parallax.com/showthread.php/152279-SDRAM-and-the-PropII?p=1227584#post1227584
  • cgraceycgracey Posts: 14,208
    edited 2014-02-08 16:34
    Baggers wrote: »
    Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
    So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?


    That driver was made callable so that it could work within the cog, being called often enough to keep the SDRAM refreshed. I'll need to update the SDRAM driver that takes a cog of its own, so that it works with the new WIDEs.
Sign In or Register to comment.