Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.
The CALLR (fast leaf/non-leaf?) would work by using the CALL and RET (via the task 4 level stack) because only a single level gets used.
Now for the slow CALLR where the address needs to be saved to thehub stack, in this case only, it would be necessary ro use anextra instruction and 1 clock to POP the return address off the task stack.Then this wouldbe writtento thehubstack,presumablyusing PTRA++.
If this is so, then.. could an instruction be added WRLONG PTRA++,pop where pop gets the value from the task stack. This woould only be a single instruction, and since weknow both D and S, a simple no operandinstruction like PUSHRET could bedone.
Alternately, could we make the task stack visible at say $1F1 ? or as I suggested earlier, make CALL also write to $1F1 ?
Adding PTRA/B for each task doesn't seem like it will be any help vs the complexity. Once hubexec starts to do lots of other hub access,many of the speed advantages disappear.
It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.
As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
CALLA/CALLB use a hub stack, they are my favorites for "large" stack applications as no need for a PUSHA LR (saves one instruction plus one cycle for every non-leaf function)
CALLX/CALLY use an aux stack, my favorite for most code, as it allows a "medium" stack and avoids hub cycles
CALL/RET use the 4 element stack, support LR functionality - great for GCC and small cog code
The three variations cover tiny/medium/large code, full coverage... love it.
It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.
As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
Before going to town on flexible hubexec data flows you might want to review the level of stalls from hub timing. Bursting whole frames might be more efficient. Or is there already some caching for this issue?
Noticed after posting, some of this has been discussed. (this xoom is painful )
I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.
There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
The LIFO overflows out the far end, but that doesn't matter. The last 4 values PUSH'd can always be POP'd. It's only 4 levels deep and belongs to the task, only, so nobody cares if it gets abused a little.
It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.
As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
Using location $000 sounds okay to me. In fact, I thought we had agreed on that a long time ago. :-)
CALL CALLD call subroutine using task's 4-level stack
RET RETD return from subroutine using task's 4-level stack
CALLA CALLAD call subroutine using HUB[PTRA++]
RETA RETAD return from subroutine using HUB[--PTRA]
CALLB CALLBD call subroutine using HUB[PTRB++]
RETB RETBD return from subroutine using HUB[--PTRB]
CALLX CALLXD call subroutine using AUX[PTRX++]
RETX RETXD return from subroutine using AUX[--PTRX]
CALLY CALLYD call subroutine using AUX[!PTRY++]
RETY RETYD return from subroutine using AUX[!--PTRY]
Dang! Not as simple as Bill's idea at all.
I'm not sure I follow you here. What is not as simple as Bill's idea?
Noticed after posting, some of this has been discussed. (this xoom is painful )
I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.
There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.
There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
Yes, and I'm still eager to scan the counter details, on new features like HW capture, and Pin CLKin, CLKout modes.
Atomic capture on two counters on the same edge is one detail many micro miss - should be as simple as an alias-mapping, if the control bits are not in one register.
Noticed after posting, some of this has been discussed. (this xoom is painful )
I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.
There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
I'm not sure I follow you here. What is not as simple as Bill's idea?
Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.
Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.
Yes there would because you don't want to have to adjust the hub stack for leaf functions. That means you wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted to remove the return address before jumping indirect through $1f1 to return from the leaf function. This doesn't have to happen with the LIFO instructions because it doesn't matter if the stack overflows with them.
If a high level language like C or Spin use PTRx as a stack or frame pointer then we would be restricted to only a single C or Spin task per COG. That might not be a problem but we should be aware of it.
Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.
There is still the much more effective SWITCH instruction for general tasking.
... wouldn't want to use a CALLA or CALLB instruction even if it did put its return address at $1f1 because it would also push it on a hub stack and that stack pointer would have to be adjusted ...
Oh? I think you better make that point very clear to Chip. I'm pretty certain he's off trying to make room in each Cog for all the CALLxx instructions to appear at $1f1 or $000.
Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.
There is still the much more effective SWITCH instruction for general tasking.
I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.
I think the new Spin will be great, should be very useful and a lot of fun.
Personally, I will use CALL and CALL{A|B|X|Y} depending on what is most efficient for the software. When I don't need a big stack, X&Y will be really nice. A&B are great for a big stack. For a tiny 4 level stack (or GCC) CALL is perfect.
I think Chip is planning to support running Spin in all four hardware tasks on a single COG. Maybe he isn't using PTRx for a stack pointer though.
Too right he is, and good on him. What I'm saying is that PropGCC shouldn't be making performance sacrifices for compatibility with the hardware threads. They aren't that amazing for general multi-tasking. If the threads become viable in the end then that's great.
They [the hardware threads ] aren't that amazing for general multi-tasking.
Can you explain what is not amazing about them?
I personally think they are. That's why I suggested them
A case in point is that I wrote a FullDupllexSerial driver in C for the Prop 1. All the code fits in COG, It works. I works better than I thought it would. Even works up to 115200 baud.
That driver will be a little bit easier to create using the task switch instruction on the Prop II. But not much. The task switch will only replace the current JMPRETs.
It will be a lot easier to write such a driver using the hardware thread scheduling. It will be faster than using a cooperative task switch and it will have a lot less jitter on it's output edges. It will have more accurate timing of the sampling incoming bits as well. The code will be smaller possibly allowing some other features into the COG.
What is not amazing about that?
Hardware scheduled threads are so amazing that XMOS has been using them on their x-core devices for years.
The situation is a bit different with execution from HUB, but it is still very convenient to be able to write your code as simple loops that do what ever they do and not have to worry about when and where to introduce cooperative task switch instructions.
Hardware scheduled threads seem to be still very useful in HUB code. Still amazing.
That's all inside a Cog. The time-sliced threading is perfect there, important even now the Prop2 is so much faster.
I believe David is trying to also cover hubexec mode. The hardware threading is really just a bonus there. However, Chip looks like he'll get it all sorted so there won't be any further concern.
Ideally, execution within a task should be independent of what's happening in another task. It seems like the goal would be to allow for 1 or more tasks that run completely within cog memory while other tasks could be executing code from hub memory.
I think the amazing thing about P2 isn't so much the individual features, such as multiple tasks and hub execution, but how it will all work together. The other thing I find amazing is how Parallax has allowed the Prop community to help in designing the P2. I've never seen that with any other commercial processor.
I haven't really thought about this much but that idea of having a big program running from HUB whilst some other task on the same COG is running entirely in COG is quite amazing.
We have a lot of Spin objects that use two COGs. One for high level Spin and one for a low level PASM driver. Now all that kind of thing can be done in one COG. In Spin or C.
Hardware scheduled threads really are amazing. (If they work as smoothly as the above implies).
And yes. The the P2 has come together with all the input from the forum is also amazing. A unique process so far I believe.
The most amazing thing is the way that Chip has managed not to go insane under the barrage suggestions and advice. He sorted the wheat from the chaff, tweaked it to his liking and put the things together in a coherent way, not just a pile of random features piled into an ugly mess like some CPU's that will remain nameless.
I spent a lot of time working out the mux tree for ALU results, as it is the biggest structure with heavy delays, and it is on par with other critical paths. The biggest mux (64 inputs) passes only registered signals. Instructions like SETS/SETD/SETI/MOVBYTS just re-orient signals, so they can go through this mux. There are two other smaller mux's for faster signals, and one final mux which inputs all three other mux's, along with some late-arriving signals from instructions like ADD/SUB/RDBYTE/ROL. This mux tree needed a lot of tuning to get optimized. The synthesis guys, early on, said to just make one huge mux and let the logic compiler sort it all out - that didn't work well. I needed to give some more guidance by forming a mux tree before the tools would compile/place/route it efficiently.
Thanks Chip !. I see that the old rules sort of still apply when the muxes get big .
Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?
Chip, the new update has the SDRAM reader in an callable function, is that so it works on single cog boards, As I know you work miracles, but I can't see how you somehow managed to make it so all cogs can access SDRAM at the same time.
So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?
That driver was made callable so that it could work within the cog, being called often enough to keep the SDRAM refreshed. I'll need to update the SDRAM driver that takes a cog of its own, so that it works with the new WIDEs.
Comments
The CALLR (fast leaf/non-leaf?) would work by using the CALL and RET (via the task 4 level stack) because only a single level gets used.
Now for the slow CALLR where the address needs to be saved to thehub stack, in this case only, it would be necessary ro use anextra instruction and 1 clock to POP the return address off the task stack.Then this wouldbe writtento thehubstack,presumablyusing PTRA++.
If this is so, then.. could an instruction be added WRLONG PTRA++,pop where pop gets the value from the task stack. This woould only be a single instruction, and since weknow both D and S, a simple no operandinstruction like PUSHRET could bedone.
Alternately, could we make the task stack visible at say $1F1 ? or as I suggested earlier, make CALL also write to $1F1 ?
It would mainly help in CALLA/CALLB/RETA/RETB operations, pushing and popping the stacks for you.
As far as writing a register with the return address goes, I'd prefer $000, since that register can be remapped for 4 tasks, simplifying the logic required and facilitating greater register context for multi-tasking.
CALLX/CALLY use an aux stack, my favorite for most code, as it allows a "medium" stack and avoids hub cycles
CALL/RET use the 4 element stack, support LR functionality - great for GCC and small cog code
The three variations cover tiny/medium/large code, full coverage... love it.
Not a big deal, docs will just have to say location $000 gets overwritten.
Regarding per-task PTRA/PTRB... would love it, but only if does not slow down P2, and does not cause hub to be reduced below 256KB
I reall cannot see the justification in having separate PTR pairs for eachtask, especially if its in the time critical path. I just don't see lots of multithreaded hubexec code using lots of hub accesses being written.
There are still important things tobe added like crc, some pin pair mods, and lets not forget SERDES, and there was a counter chaining suggestion too. How much die space is remaining?
Yes, and I'm still eager to scan the counter details, on new features like HW capture, and Pin CLKin, CLKout modes.
Atomic capture on two counters on the same edge is one detail many micro miss - should be as simple as an alias-mapping, if the control bits are not in one register.
To that maybe (uses only 74 LE) Manchester encoding -- Look in attachment.
Bill's idea was for only the LIFO to appear at $1f1. This only works for first CALL instruction in that list. I know that was all you needed but clearly there is an advantage, what Chip is looking at, if all the CALLxx instruction appear there. Wouldn't be any need for non-leafs to keep shifting the return addresses onto the hub stack then.
Ouch! Don't make that a reason not to use the PTR registers. The hardware threads are time-sliced rather that priority driven. They are best used for low-level fine grained soft devices. Prior to all of Chips work on fleshing out each threading hardware there wasn't really a good reason to try to directly support them as part of hubexec. The threads were never intended for and still aren't that effective for general tasking.
There is still the much more effective SWITCH instruction for general tasking.
Oh? I think you better make that point very clear to Chip. I'm pretty certain he's off trying to make room in each Cog for all the CALLxx instructions to appear at $1f1 or $000.
Personally, I will use CALL and CALL{A|B|X|Y} depending on what is most efficient for the software. When I don't need a big stack, X&Y will be really nice. A&B are great for a big stack. For a tiny 4 level stack (or GCC) CALL is perfect.
An embarrassment of riches!
Too right he is, and good on him. What I'm saying is that PropGCC shouldn't be making performance sacrifices for compatibility with the hardware threads. They aren't that amazing for general multi-tasking. If the threads become viable in the end then that's great.
Can you explain what is not amazing about them?
I personally think they are. That's why I suggested them
A case in point is that I wrote a FullDupllexSerial driver in C for the Prop 1. All the code fits in COG, It works. I works better than I thought it would. Even works up to 115200 baud.
That driver will be a little bit easier to create using the task switch instruction on the Prop II. But not much. The task switch will only replace the current JMPRETs.
It will be a lot easier to write such a driver using the hardware thread scheduling. It will be faster than using a cooperative task switch and it will have a lot less jitter on it's output edges. It will have more accurate timing of the sampling incoming bits as well. The code will be smaller possibly allowing some other features into the COG.
What is not amazing about that?
Hardware scheduled threads are so amazing that XMOS has been using them on their x-core devices for years.
The situation is a bit different with execution from HUB, but it is still very convenient to be able to write your code as simple loops that do what ever they do and not have to worry about when and where to introduce cooperative task switch instructions.
Hardware scheduled threads seem to be still very useful in HUB code. Still amazing.
I believe David is trying to also cover hubexec mode. The hardware threading is really just a bonus there. However, Chip looks like he'll get it all sorted so there won't be any further concern.
I think the amazing thing about P2 isn't so much the individual features, such as multiple tasks and hub execution, but how it will all work together. The other thing I find amazing is how Parallax has allowed the Prop community to help in designing the P2. I've never seen that with any other commercial processor.
I haven't really thought about this much but that idea of having a big program running from HUB whilst some other task on the same COG is running entirely in COG is quite amazing.
We have a lot of Spin objects that use two COGs. One for high level Spin and one for a low level PASM driver. Now all that kind of thing can be done in one COG. In Spin or C.
Hardware scheduled threads really are amazing. (If they work as smoothly as the above implies).
And yes. The the P2 has come together with all the input from the forum is also amazing. A unique process so far I believe.
The most amazing thing is the way that Chip has managed not to go insane under the barrage suggestions and advice. He sorted the wheat from the chaff, tweaked it to his liking and put the things together in a coherent way, not just a pile of random features piled into an ugly mess like some CPU's that will remain nameless.
Thanks Chip !. I see that the old rules sort of still apply when the muxes get big .
So if anyone wants to use more than one cog that uses SDRAM, we need to modify the SDRAM to work in a single cog again with a command list?
Speaking of SDRAM... Could we now execute code from SDRAM using the HUB RAM as some kind of page buffer?
Yes I was hoping for this too. Been discussed before over here ... http://forums.parallax.com/showthread.php/152279-SDRAM-and-the-PropII?p=1227584#post1227584
That driver was made callable so that it could work within the cog, being called often enough to keep the SDRAM refreshed. I'll need to update the SDRAM driver that takes a cog of its own, so that it works with the new WIDEs.