Now he added a new instruction to initialize the internal PC registers, so I suspect he has done away with the vector table.
There is this comment '(these are like nop's to task 0)
So after skipping those 4 reserved locations, you could start like this
' 4 register INITs here, skipped by task 0, currently the 100% task.
jmptask #Share2_3,#%1000
jmptask #Share2_3,#%0100 ' or can use #%1100
jmptask #Unique1, #%0010 ' Task0 is current PC, does not need jmptask
settask TaskTImeSlicePattern
Unique0:
' from here, threads are 3..0 => Share2_3,Share2_3,Unique1,Unique0, sliced
Less clear is how the assembler 'knows' it will be using reg map, to overlay those 4 register sets, but that's a non silicon detail.
(if this mode is now always-on, that makes porting Prop1 code harder still ?)
jmg, of course it's MOD 8. I thought everyone understood that already, but it's probably worth repeating. I suppose you could limit all of the threads to non-hub instructions except for one, but that would prevent you from doing something like 4 high-speed serial ports at the same time where each one needs access to hub RAM.
I don't think multiple pipelines would not require a lot of silicon. It's basically 3 or 4 32-bit registers plus the decoding logic and multiplexors on the control lines. I didn't say it would be easy, which is why I suggested it for P3. It has a huge benefit because you could keep the cog busy 100% of the time without any stalls. Of course, there is still the issue with the hub window where multiple slices would have to queue up to wait their turn for access.
...or something like that, the example would make perfect sense. As it is, I don't see how the program starts...
Chip did add this code comment : '(these are like nop's to task 0)
So the start auto-skips over 4 Regs, in 4 clocks.
Less clear is if the JMP table is optional, or gone ? I think it can go, as per post above.
I suppose you could limit all of the threads to non-hub instructions except for one, but that would prevent you from doing something like 4 high-speed serial ports at the same time where each one needs access to hub RAM.
They would have to be VERY fast serial ports. Most speeds would timer-pace, but if you wanted (say) 10MBd code-paced that is
~ 1 byte/us/port, so you need 1L Write, 1L read every 160 clocks to feed this full duplex. Seems doable ?
The very first thing I see coginit doing is loading the COG up @prg, with execute at location 0. How is that changed? Normally, on P2, we can specify that right? Load a cog @program but start @init kind of thing? I guess I'm also asking why they have to be nops, given that capability. One or the other of the coginit arguments could have just been the first real instruction, instead of 0... or?
I hope not. The slicing feature would mostly be needed only for multi-purpose drivers. Maybe it's time to step back and rethink some of these last-minute changes.
It sounds like it's always on, but the default settings provide for a single thread that uses all the "slots" in the pipeline, so the default would behave as if the feature were disabled.
Given that, which makes sense, how does the program shown start? The coginit arguments are both zeroes. And the mapping and other instructions are higher addresses? Seems like a label or parameter is missing.
It sounds like it's always on, but the default settings provide for a single thread that uses all the "slots" in the pipeline, so the default would behave as if the feature were disabled.
That would be my guess, and the NOP comment from Chip, suggests the 4 registers read as NOPs, if the user does no explicit start.
First executed code byte is then address $04, but the ROM location may also be close to this, so the memory map gets complicated quickly.
I do not see any silicon advantage of a low-rom, but I can see many users will 'bump their shins', and lose time, on a low rom, and it will make porting Prop 1 code harder.
The ideal should be a Code library, that can compile/assemble for either Prop 1 or Prop 2
The multi-threading is enabled by calling SETTASK with the thread ordering. To disable it you perform a SETTASK with all zeros, so only the 0 thread is running.
Before a SETTASK is executed which places some non-0 bit pairs into the task switcher, nothing but task0 is running, which makes the cog act like a single-threaded machine.
Task0 starts at $000
Task1 starts at $001
Task2 starts at $002
Task3 starts at $003
Until a task actually runs, it's idle, waiting at its reset point (as listed above).
Before a task is started, or even when it's running, it's execution address can be altered by the 'JMPTASK D/#n, #mask4' instruction, which gives an address and a 4-bit mask of which tasks' program counters to overwrite. JMPTASK will cancel any instructions in the pipe belonging to any affected tasks.
So, you can rely on the initial addresses and run them from those, or you can alter them via JMPTASK before you start them.
Before a SETTASK is executed which places some non-0 bit pairs into the task switcher, nothing but task0 is running, which makes the cog act like a single-threaded machine.
Task0 starts at $000
Task1 starts at $001
Task2 starts at $002
Task3 starts at $003
Until a task actually runs, it's idle, waiting at its reset point (as listed above).
Before a task is started, or even when it's running, it's execution address can be altered by the 'JMPTASK D/#n, #mask4' instruction, which gives an address and a 4-bit mask of which tasks' program counters to overwrite. JMPTASK will cancel any instructions in the pipe belonging to any affected tasks.
So, you can rely on the initial addresses and run them from those, or you can alter them via JMPTASK before you start them.
I can follow all that, but the mapped register example you gave, in 1277, does not seem to follow this template ?
Those mapped registers, seem to be right on top of the jump table ?
How does the chip know which mode to follow, from reset ?
I can follow all that, but the mapped register example you gave, in 1277, does not seem to follow this template ?
Those mapped registers, seem to be right on top of the jump table ?
How does the chip know which mode to follow, from reset ?
Yes, it's a little maddening because the register remapping occupies the bottom register space - the same place where the tasks want to start and code typically resides. To do register remapping, you need to arrange things such that the bottom of memory is for variables and the executable code is positioned above. This may practically require loading code into upper registers, initializing the bottom registers, setting the task addresses using JMPTASK, then using SETTASK to kick things into motion.
Yes, it's a little maddening because the register remapping occupies the bottom register space - the same place where the tasks want to start and code typically resides. To do register remapping, you need to arrange things such that the bottom of memory is for variables and the executable code is positioned above. This may practically require loading code into upper registers, initializing the bottom registers, setting the task addresses using JMPTASK, then using SETTASK to kick things into motion.
So you can still do both jump table, or register remapping?
How then does the silicon know which mode it is doing
- in register remapping, if it starts at 000, does it skip over the 4 Regs as if they were NOPs as I think your comment suggests ?
- if so, what make those first 4 locations instead act as opcodes, for the jump table ?
Just curious. How is this task switcher better than running several software state-machines in a common loop? A 16-slot 4-task scheduler wouldn't even be that hard to add. (my guess is less overhead and better code isolation?) It feels weird talking about adding hardware this late in the game if a well documented software solution would work nearly as well. ( >_< I want the PII analog IO pins asap!)
- it is MUCH easier to write and debug independent I/O threads with the hardware threading
- hardware threading would allow a debug thread to run concurrently in the cog with up to three user threads
- it makes it trivial to add pseudo-interrupts
- Chip says it won't delay P2
Just curious. How is this task switcher better than running several software state-machines in a common loop? A 16-slot 4-task scheduler wouldn't even be that hard to add. (my guess is less overhead and better code isolation?) It feels weird talking about adding hardware this late in the game if a well documented software solution would work nearly as well. ( >_< I want the PII analog IO pins asap!)
Just curious. How is this task switcher better than running several software state-machines in a common loop? A 16-slot 4-task scheduler wouldn't even be that hard to add. (my guess is less overhead and better code isolation?) It feels weird talking about adding hardware this late in the game if a well documented software solution would work nearly as well.
SW task switching is not deterministic, so there is a huge difference.
You can still do SW task switching, and can even combine the two, so a hardware slice, running in watchdog mode, can over-ride a conventional co-operate tasker.
It also give an elegant want to solve the serious DEBUG access issue - and in a way that users can the focus of.
I do not see this is so much as "adding hardware', as more clever use of what is already there.
{ but yes, there are a few more compilable gates involved }
It very neatly solves some of the Achilles heels of the Prop, which were
* Dead Resource - before, deterministic demands could consume a COG with under 10% of code, and mostly idle.
Now, you can better fill that expensive COG code space, with no less determinism than Prop 1 has now.
In many cases, you will now be (easily?) able to swallow/pack multiple Prop 1 libraries into ONE Prop 2 COG.
* Code ceiling - now, with reg remap, some threads can share code, but use a small set of different registers when doing so.
This means you do not have to give all threads separate code space, multiple instances of UARTS or PWM controllers, can execute the same code, with different parameters.
* Lack of Live access Debug.
This will also have a BIG impact on the Prop 2 TAM.
I wonder what this will mean at the higher level, C code... Will this translate into being able to run 4 C threads in one cog?
Eventually, of course. Full, high level transparent C control of what goes where, would need changes to the compiler to tell the flow which one is Thread 0, and which others are COG destined.
Meanwhile, a moderately smart Assembler for Prop 2 would allow merge of multiple Lib Files into an Asm-Pass fill of COG/Thread/Slices. Those Lib files could be PASM from the existing C flows.
Neither, or both, depending on how you look at it.
The existing video system is there, with a few basic changes. That means a software loop driving pixels. At the higher P2 speed, pretty much any basic scheme you want will be possible. Bitmap, tile, sprites, and combinations, with scan line type drivers, or single COG drivers. Can't wait to try the threading on that. A lot will be possible!
Chip worked with Andre' and others to add support for texture mapping type operations and to do color transforms! It will be possible to output ordinary TV graphics, composite and S-video like before. The higher clock and throughput are way better matched to the higher sweep frequencies found in VGA & RGB. P2 also can do the HDTV YCrCb standard as well. I'm personally looking forward to that one. Basically, if it's analog video, we are set.
There is now a CLUT in the COG. (Color LookUp Table) I think it's 128 longs. It can be a stack when not doing video, and or used for lots of things, in the same fashion waitvid has been used outside of making video. Color transform instructions will help map color spaces to the output format being employed. This will avoid lookups and or strings of instructions to accomplish the same thing, bringing pixel rates way up. At one point 1080p YCbCr / RGB was on the table. Not sure if it still is after the synthesis / critical path stuff constrains things, but a fall back to 720p, or 1080i would be quite good, and from there we could do multi-cog drivers to get the rest of the way in a pinch, just like what was done on P1 to do a 1280 pixel VGA screen.
The biggest thing is the analog pins! Getting good color depths will be an easier task, as always leaving RAM the determining factor. 126Kb or so is a LOT to work with however, and there is the external RAM hardware assist. Good times coming video wise. The thing is going to be a playground!
Short answer: If you want a video buffer, you can have one really super easy. At TV and lower VGA pixel rates, buffers can just live right in the HUB RAM, no tiles. The cool thing about that will be the insane draw / fill rates. One could divvy up the screen into zones and just blast objects to it all over the place, running threads and COGS to get it all done from a draw list. Fast. Could easily do it all single buffer too. (See the Nyan Cat thing I did for an example of how that could work really easy, and it's not using a multi-cog draw scheme, but could...)
SW task switching is not deterministic, so there is a huge difference.
You can still do SW task switching, and can even combine the two, so a hardware slice, running in watchdog mode, can over-ride a conventional co-operate tasker.
It also give an elegant want to solve the serious DEBUG access issue - and in a way that users can the focus of.
I do not see this is so much as "adding hardware', as more clever use of what is already there.
{ but yes, there are a few more compilable gates involved }
It very neatly solves some of the Achilles heels of the Prop, which were
* Dead Resource - before, deterministic demands could consume a COG with under 10% of code, and mostly idle.
Now, you can better fill that expensive COG code space, with no less determinism than Prop 1 has now.
In many cases, you will now be (easily?) able to swallow/pack multiple Prop 1 libraries into ONE Prop 2 COG.
* Code ceiling - now, with reg remap, some threads can share code, but use a small set of different registers when doing so.
This means you do not have to give all threads separate code space, multiple instances of UARTS or PWM controllers, can execute the same code, with different parameters.
* Lack of Live access Debug.
This will also have a BIG impact on the Prop 2 TAM.
For someone who still finds the P1 vastly useful and satisfying, I'm amazed at how much this thread has stimulated my interest in P2. A 1000% increase I'd say. Suddenly I'm looking forward to chipmas, too.
TAM is a TLA (Three letter acronym) for Total Available Market, which can mean total number of potential customers, or the total sales volume, as in 'area under the curve', which is Total Available Market revenue.
Short answer: If you want a video buffer, you can have one really super easy. At TV and lower VGA pixel rates, buffers can just live right in the HUB RAM, no tiles. The cool thing about that will be the insane draw / fill rates. One could divvy up the screen into zones and just blast objects to it all over the place, running threads and COGS to get it all done from a draw list. Fast. Could easily do it all single buffer too. (See the Nyan Cat thing I did for an example of how that could work really easy, and it's not using a multi-cog draw scheme, but could...)
Hmm. How about a direct connection to external SRAM if we want a video buffer, so that we can save the onboard RAM and use it for other stuff? That'll be cool.
It also give an elegant want to solve the serious DEBUG access issue
Hadn't thought of that. Just run two threads, one with application code that gets 15/16th of the cycles and one that snoops variables and shoves them out to the Hub. Potentially a lot cleaner than sprinkling debug code throughout the application. Still, don't think I'd task switch like this most of the time. So far my assembly programs have too many spots where the instructions after a waitxx HAVE to execute on time or the code doesn't work (well). (a 1/2 ratio would be better in those cases)
Lawson
P.S. Much as LOVE the Prop II analog pins, I'd love them MORE on the Prop I. The things I could do with a 100uA! :cool: (total pipe dream I know...)
Comments
If there was:
...or something like that, the example would make perfect sense. As it is, I don't see how the program starts...
There is this comment
'(these are like nop's to task 0)
So after skipping those 4 reserved locations, you could start like this
Less clear is how the assembler 'knows' it will be using reg map, to overlay those 4 register sets, but that's a non silicon detail.
(if this mode is now always-on, that makes porting Prop1 code harder still ?)
I don't think multiple pipelines would not require a lot of silicon. It's basically 3 or 4 32-bit registers plus the decoding logic and multiplexors on the control lines. I didn't say it would be easy, which is why I suggested it for P3. It has a huge benefit because you could keep the cog busy 100% of the time without any stalls. Of course, there is still the issue with the hub window where multiple slices would have to queue up to wait their turn for access.
Chip did add this code comment :
'(these are like nop's to task 0)
So the start auto-skips over 4 Regs, in 4 clocks.
Less clear is if the JMP table is optional, or gone ? I think it can go, as per post above.
~ 1 byte/us/port, so you need 1L Write, 1L read every 160 clocks to feed this full duplex. Seems doable ?
The very first thing I see coginit doing is loading the COG up @prg, with execute at location 0. How is that changed? Normally, on P2, we can specify that right? Load a cog @program but start @init kind of thing? I guess I'm also asking why they have to be nops, given that capability. One or the other of the coginit arguments could have just been the first real instruction, instead of 0... or?
That would be my guess, and the NOP comment from Chip, suggests the 4 registers read as NOPs, if the user does no explicit start.
First executed code byte is then address $04, but the ROM location may also be close to this, so the memory map gets complicated quickly.
I do not see any silicon advantage of a low-rom, but I can see many users will 'bump their shins', and lose time, on a low rom, and it will make porting Prop 1 code harder.
The ideal should be a Code library, that can compile/assemble for either Prop 1 or Prop 2
Task0 starts at $000
Task1 starts at $001
Task2 starts at $002
Task3 starts at $003
Until a task actually runs, it's idle, waiting at its reset point (as listed above).
Before a task is started, or even when it's running, it's execution address can be altered by the 'JMPTASK D/#n, #mask4' instruction, which gives an address and a 4-bit mask of which tasks' program counters to overwrite. JMPTASK will cancel any instructions in the pipe belonging to any affected tasks.
So, you can rely on the initial addresses and run them from those, or you can alter them via JMPTASK before you start them.
I can follow all that, but the mapped register example you gave, in 1277, does not seem to follow this template ?
Those mapped registers, seem to be right on top of the jump table ?
How does the chip know which mode to follow, from reset ?
Yes, it's a little maddening because the register remapping occupies the bottom register space - the same place where the tasks want to start and code typically resides. To do register remapping, you need to arrange things such that the bottom of memory is for variables and the executable code is positioned above. This may practically require loading code into upper registers, initializing the bottom registers, setting the task addresses using JMPTASK, then using SETTASK to kick things into motion.
So you can still do both jump table, or register remapping?
How then does the silicon know which mode it is doing
- in register remapping, if it starts at 000, does it skip over the 4 Regs as if they were NOPs as I think your comment suggests ?
- if so, what make those first 4 locations instead act as opcodes, for the jump table ?
Lawson
Me too
- hardware threading would allow a debug thread to run concurrently in the cog with up to three user threads
- it makes it trivial to add pseudo-interrupts
- Chip says it won't delay P2
Looks like this will make it easy to have several drivers in one cog.
I wonder what this will mean at the higher level, C code... Will this translate into being able to run 4 C threads in one cog?
SW task switching is not deterministic, so there is a huge difference.
You can still do SW task switching, and can even combine the two, so a hardware slice, running in watchdog mode, can over-ride a conventional co-operate tasker.
It also give an elegant want to solve the serious DEBUG access issue - and in a way that users can the focus of.
I do not see this is so much as "adding hardware', as more clever use of what is already there.
{ but yes, there are a few more compilable gates involved }
It very neatly solves some of the Achilles heels of the Prop, which were
* Dead Resource - before, deterministic demands could consume a COG with under 10% of code, and mostly idle.
Now, you can better fill that expensive COG code space, with no less determinism than Prop 1 has now.
In many cases, you will now be (easily?) able to swallow/pack multiple Prop 1 libraries into ONE Prop 2 COG.
* Code ceiling - now, with reg remap, some threads can share code, but use a small set of different registers when doing so.
This means you do not have to give all threads separate code space, multiple instances of UARTS or PWM controllers, can execute the same code, with different parameters.
* Lack of Live access Debug.
This will also have a BIG impact on the Prop 2 TAM.
Yes, you can mix and match, until you run out of CODE/Reg RAM, or Time
Eventually, of course. Full, high level transparent C control of what goes where, would need changes to the compiler to tell the flow which one is Thread 0, and which others are COG destined.
Meanwhile, a moderately smart Assembler for Prop 2 would allow merge of multiple Lib Files into an Asm-Pass fill of COG/Thread/Slices. Those Lib files could be PASM from the existing C flows.
The existing video system is there, with a few basic changes. That means a software loop driving pixels. At the higher P2 speed, pretty much any basic scheme you want will be possible. Bitmap, tile, sprites, and combinations, with scan line type drivers, or single COG drivers. Can't wait to try the threading on that. A lot will be possible!
Chip worked with Andre' and others to add support for texture mapping type operations and to do color transforms! It will be possible to output ordinary TV graphics, composite and S-video like before. The higher clock and throughput are way better matched to the higher sweep frequencies found in VGA & RGB. P2 also can do the HDTV YCrCb standard as well. I'm personally looking forward to that one. Basically, if it's analog video, we are set.
There is now a CLUT in the COG. (Color LookUp Table) I think it's 128 longs. It can be a stack when not doing video, and or used for lots of things, in the same fashion waitvid has been used outside of making video. Color transform instructions will help map color spaces to the output format being employed. This will avoid lookups and or strings of instructions to accomplish the same thing, bringing pixel rates way up. At one point 1080p YCbCr / RGB was on the table. Not sure if it still is after the synthesis / critical path stuff constrains things, but a fall back to 720p, or 1080i would be quite good, and from there we could do multi-cog drivers to get the rest of the way in a pinch, just like what was done on P1 to do a 1280 pixel VGA screen.
The biggest thing is the analog pins! Getting good color depths will be an easier task, as always leaving RAM the determining factor. 126Kb or so is a LOT to work with however, and there is the external RAM hardware assist. Good times coming video wise. The thing is going to be a playground!
Short answer: If you want a video buffer, you can have one really super easy. At TV and lower VGA pixel rates, buffers can just live right in the HUB RAM, no tiles. The cool thing about that will be the insane draw / fill rates. One could divvy up the screen into zones and just blast objects to it all over the place, running threads and COGS to get it all done from a draw list. Fast. Could easily do it all single buffer too. (See the Nyan Cat thing I did for an example of how that could work really easy, and it's not using a multi-cog draw scheme, but could...)
For someone who still finds the P1 vastly useful and satisfying, I'm amazed at how much this thread has stimulated my interest in P2. A 1000% increase I'd say. Suddenly I'm looking forward to chipmas, too.
TAM is a TLA (Three letter acronym) for Total Available Market, which can mean total number of potential customers, or the total sales volume, as in 'area under the curve', which is Total Available Market revenue.
Hmm. How about a direct connection to external SRAM if we want a video buffer, so that we can save the onboard RAM and use it for other stuff? That'll be cool.
Hadn't thought of that. Just run two threads, one with application code that gets 15/16th of the cycles and one that snoops variables and shoves them out to the Hub. Potentially a lot cleaner than sprinkling debug code throughout the application. Still, don't think I'd task switch like this most of the time. So far my assembly programs have too many spots where the instructions after a waitxx HAVE to execute on time or the code doesn't work (well). (a 1/2 ratio would be better in those cases)
Lawson
P.S. Much as LOVE the Prop II analog pins, I'd love them MORE on the Prop I. The things I could do with a 100uA! :cool: (total pipe dream I know...)