HUB EXEC Update Here

Bill Henning · 2014-02-06 06:28

Not for (-32..-1)*scale in PTRB which is pretty good

David Betz wrote: »

That's true but won't the code that uses these values still have to update the return address on the stack to skip over them when it returns?

Cluso99 · 2014-02-06 06:39

Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.

The CALLR (fast leaf/non-leaf?) would work by using the CALL and RET (via the task 4 level stack) because only a single level gets used.

Now for the slow CALLR where the address needs to be saved to thehub stack, in this case only, it would be necessary ro use anextra instruction and 1 clock to POP the return address off the task stack.Then this wouldbe writtento thehubstack,presumablyusing PTRA++.
If this is so, then.. could an instruction be added WRLONG PTRA++,pop where pop gets the value from the task stack. This woould only be a single instruction, and since weknow both D and S, a simple no operandinstruction like PUSHRET could bedone.

Alternately, could we make the task stack visible at say $1F1 ? or as I suggested earlier, make CALL also write to $1F1 ?

cgracey · 2014-02-06 06:47

Cluso99 wrote: »

Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.

It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.

Bill Henning · 2014-02-06 06:50

CALLA/CALLB use a hub stack, they are my favorites for "large" stack applications as no need for a PUSHA LR (saves one instruction plus one cycle for every non-leaf function)

CALLX/CALLY use an aux stack, my favorite for most code, as it allows a "medium" stack and avoids hub cycles

CALL/RET use the 4 element stack, support LR functionality - great for GCC and small cog code

The three variations cover tiny/medium/large code, full coverage... love it.

Bill Henning · 2014-02-06 06:53

My only concern with using $000 is that if the initial instruction is a CALL, it will be clobbered.

Not a big deal, docs will just have to say location $000 gets overwritten.

Regarding per-task PTRA/PTRB... would love it, but only if does not slow down P2, and does not cause hub to be reduced below 256KB

cgracey wrote: »

It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.

evanh · 2014-02-06 06:54

Before going to town on flexible hubexec data flows you might want to review the level of stalls from hub timing. Bursting whole frames might be more efficient. Or is there already some caching for this issue?

Cluso99 · 2014-02-06 06:56

Noticed after posting, some of this has been discussed. (this xoom is painful

)

I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?

David Betz · 2014-02-06 08:13

cgracey wrote: »

The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.

Sounds perfect. Thanks!

David Betz · 2014-02-06 08:15

cgracey wrote: »

It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.

As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.

Using location $000 sounds okay to me. In fact, I thought we had agreed on that a long time ago. :-)

David Betz · 2014-02-06 08:16

evanh wrote: »

CALL            CALLD                    call subroutine using task's 4-level stack
        RET             RETD                     return from subroutine using task's 4-level stack

        CALLA           CALLAD                   call subroutine using HUB[PTRA++]
        RETA            RETAD                    return from subroutine using HUB[--PTRA]

        CALLB           CALLBD                   call subroutine using HUB[PTRB++]
        RETB            RETBD                    return from subroutine using HUB[--PTRB]

        CALLX           CALLXD                   call subroutine using AUX[PTRX++]
        RETX            RETXD                    return from subroutine using AUX[--PTRX]

        CALLY           CALLYD                   call subroutine using AUX[!PTRY++]
        RETY            RETYD                    return from subroutine using AUX[!--PTRY]

Dang! Not as simple as Bill's idea at all.

I'm not sure I follow you here. What is not as simple as Bill's idea?

David Betz · 2014-02-06 08:18

Cluso99 wrote: »

Noticed after posting, some of this has been discussed. (this xoom is painful )

I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?

If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.

jmg · 2014-02-06 12:18

Cluso99 wrote: »

There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?

Yes, and I'm still eager to scan the counter details, on new features like HW capture, and Pin CLKin, CLKout modes.
Atomic capture on two counters on the same edge is one detail many micro miss - should be as simple as an alias-mapping, if the control bits are not in one register.

Sapieha · 2014-02-06 12:38

Hi

To that maybe (uses only 74 LE) Manchester encoding -- Look in attachment.

Cluso99 wrote: »

Noticed after posting, some of this has been discussed. (this xoom is painful )

I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.

There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?

evanh · 2014-02-06 14:08

David Betz wrote: »

I'm not sure I follow you here. What is not as simple as Bill's idea?

Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.

David Betz · 2014-02-06 14:18

evanh wrote: »

Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.

Yes there would because you don't want to have to adjust the hub stack for leaf functions. That means you wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted to remove the return address before jumping indirect through $1f1 to return from the leaf function. This doesn't have to happen with the LIFO instructions because it doesn't matter if the stack overflows with them.

evanh · 2014-02-06 14:29

David Betz wrote: »

If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.

Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.

There is still the much more effective SWITCH instruction for general tasking.

evanh · 2014-02-06 14:34

David Betz wrote: »

... wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted ...

Oh? I think you better make that point very clear to Chip. I'm pretty certain he's off trying to make room in each Cog for all the CALLxx instructions to appear at $1f1 or $000.

David Betz · 2014-02-06 15:00

evanh wrote: »

Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.

There is still the much more effective SWITCH instruction for general tasking.

I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.

Bill Henning · 2014-02-06 15:12

I think the new Spin will be great, should be very useful and a lot of fun.

Personally, I will use CALL and CALL{A|B|X|Y} depending on what is most efficient for the software. When I don't need a big stack, X&Y will be really nice. A&B are great for a big stack. For a tiny 4 level stack (or GCC) CALL is perfect.

An embarrassment of riches!

David Betz wrote: »

I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.

evanh · 2014-02-06 23:28

David Betz wrote: »

I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.

Too right he is, and good on him. What I'm saying is that PropGCC shouldn't be making performance sacrifices for compatibility with the hardware threads. They aren't that amazing for general multi-tasking. If the threads become viable in the end then that's great.

Heater. · 2014-02-07 01:01

evanh,

They [the hardware threads ] aren't that amazing for general multi-tasking.

Can you explain what is not amazing about them?

I personally think they are. That's why I suggested them

A case in point is that I wrote a FullDupllexSerial driver in C for the Prop 1. All the code fits in COG, It works. I works better than I thought it would. Even works up to 115200 baud.

That driver will be a little bit easier to create using the task switch instruction on the Prop II. But not much. The task switch will only replace the current JMPRETs.

It will be a lot easier to write such a driver using the hardware thread scheduling. It will be faster than using a cooperative task switch and it will have a lot less jitter on it's output edges. It will have more accurate timing of the sampling incoming bits as well. The code will be smaller possibly allowing some other features into the COG.

What is not amazing about that?

Hardware scheduled threads are so amazing that XMOS has been using them on their x-core devices for years.

The situation is a bit different with execution from HUB, but it is still very convenient to be able to write your code as simple loops that do what ever they do and not have to worry about when and where to introduce cooperative task switch instructions.

Hardware scheduled threads seem to be still very useful in HUB code. Still amazing.

evanh · 2014-02-07 05:03

That's all inside a Cog. The time-sliced threading is perfect there, important even now the Prop2 is so much faster.

I believe David is trying to also cover hubexec mode. The hardware threading is really just a bonus there. However, Chip looks like he'll get it all sorted so there won't be any further concern.

Dave Hein · 2014-02-07 06:10

Ideally, execution within a task should be independent of what's happening in another task. It seems like the goal would be to allow for 1 or more tasks that run completely within cog memory while other tasks could be executing code from hub memory.

I think the amazing thing about P2 isn't so much the individual features, such as multiple tasks and hub execution, but how it will all work together. The other thing I find amazing is how Parallax has allowed the Prop community to help in designing the P2. I've never seen that with any other commercial processor.

Heater. · 2014-02-07 06:34

Dave,

I haven't really thought about this much but that idea of having a big program running from HUB whilst some other task on the same COG is running entirely in COG is quite amazing.

We have a lot of Spin objects that use two COGs. One for high level Spin and one for a low level PASM driver. Now all that kind of thing can be done in one COG. In Spin or C.

Hardware scheduled threads really are amazing. (If they work as smoothly as the above implies).

And yes. The the P2 has come together with all the input from the forum is also amazing. A unique process so far I believe.

The most amazing thing is the way that Chip has managed not to go insane under the barrage suggestions and advice. He sorted the wheat from the chaff, tweaked it to his liking and put the things together in a coherent way, not just a pile of random features piled into an ugly mess like some CPU's that will remain nameless.

Circuitsoft · 2014-02-07 15:37

cgracey wrote: »

We could do something like that, but it wouldn't be very legible over a mere square inch.

It doesn't need to be legible to the naked eye...

Ale · 2014-02-07 22:29

cgracey wrote: »

I spent a lot of time working out the mux tree for ALU results, as it is the biggest structure with heavy delays, and it is on par with other critical paths. The biggest mux (64 inputs) passes only registered signals. Instructions like SETS/SETD/SETI/MOVBYTS just re-orient signals, so they can go through this mux. There are two other smaller mux's for faster signals, and one final mux which inputs all three other mux's, along with some late-arriving signals from instructions like ADD/SUB/RDBYTE/ROL. This mux tree needed a lot of tuning to get optimized. The synthesis guys, early on, said to just make one huge mux and let the logic compiler sort it all out - that didn't work well. I needed to give some more guidance by forming a mux tree before the tools would compile/place/route it efficiently.

Thanks Chip !. I see that the old rules sort of still apply when the muxes get big

.

Baggers · 2014-02-08 12:34

Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?

Rayman · 2014-02-08 13:14

SDRAM with nano? I might have to try that...

Speaking of SDRAM... Could we now execute code from SDRAM using the HUB RAM as some kind of page buffer?

rogloh · 2014-02-08 16:06

Rayman wrote: »

SDRAM with nano? I might have to try that...

Speaking of SDRAM... Could we now execute code from SDRAM using the HUB RAM as some kind of page buffer?

Yes I was hoping for this too. Been discussed before over here ... http://forums.parallax.com/showthread.php/152279-SDRAM-and-the-PropII?p=1227584#post1227584

cgracey · 2014-02-08 16:34

Baggers wrote: »

Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?

That driver was made callable so that it could work within the cog, being called often enough to keep the SDRAM refreshed. I'll need to update the SDRAM driver that takes a cog of its own, so that it works with the new WIDEs.

HUB EXEC Update Here

Comments