I've got the full-speed RDWIDEA/RDWIDEB/WRWIDEA/WRWIDEB circuitry all done, I think, and it is compiling for the next hour, but I'm going to bed. This addition turned out to be quite small in logic elements, but it was a real brain-bender to implement. There was more text in the comments than there was in the Verilog code. It always seems like when things reach critical mass, you can add on big functionalities with just a little extra code. This change didn't need to create any data conduits. It just steers stuff that's already there. It took a total of about 20 flops to realize.
I've got the full-speed RDWIDEA/RDWIDEB/WRWIDEA/WRWIDEB circuitry all done, I think, and it is compiling for the next hour, but I'm going to bed. This addition turned out to be quite small in logic elements, but it was a real brain-bender to implement. There was more text in the comments than there was in the Verilog code. It always seems like when things reach critical mass, you can add on big functionalities with just a little extra code. This change didn't need to create any data conduits. It just steers stuff that's already there. It took a total of about 20 flops to realize.
I don't really understand the multitasking - but reading this I was wondering if,
when you run multiple tasks, you can hold all but 1 task (e.g. the master running from HUB) for some time,
like here, to do a RD/WRWIDEA/B and then continue the other tasks.
Do tasks keep their state, when they are stopped/halted/hibernated, and can they be resumed where they left off?
So in this case the master thread would pause the other threads, wait some clocks until the pipeline is clear,
does it's xWIDEx stuff (which only takes a few cycles and might be OK in many cases) and then resumes the other tasks.
I reread the docs and found this:
from P2 Doc:
The task identified in the bottom two bits of the SETTASK operand will be at the execution stage on
the 5th instruction cycle after SETTASK.
If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
task is given a time slot, it will execute and its Z/C/PC will be updated at every instruction cycle,
or time slot, alloted to it. If an active task's time slots are all taken away, that task's Z/C/PC
remain in the state where they left off, until it is given another time slot.
so this would mean I can have more than 1 task running and then my supervisor task issues a SETTASK 0 (assuming it is task 0) so now we are back to single task mode. we can do additional work for 5 clocks
and then we can do the xWIDEy operation.
After this a SETTASK originalValue resumes the frozen tasks where they left of.
All state is kept.
that task's Z/C/PC
remain in the state where they left off, until it is given another time slot.
... but it was a real brain-bender to implement. There was more text in the comments than there was in the Verilog code. It always seems like when things reach critical mass, you can add on big functionalities with just a little extra code. This change didn't need to create any data conduits. It just steers stuff that's already there. It took a total of about 20 flops to realize.
Yeah! Congrats. Always satisfying to knock those ones off. The comments become a pleasant read at a later date.
I was thinking tonight about how we have these new fast RDWIDEA/B / WRWIDEA/B instructions to stream data between hub memory and cogs at full speed (1 long per clock). The caveat with these is that they can only execute in single-task mode, as they require continuous pipeline feed from the same task for the instructions that route the hub data. I was wondering if there was some way to make this work during multi-tasking. I have a simple idea that would allow for this:
REPTASK #n 'repeat this task for the next 1..512 instructions, starting on the 2nd same-task instruction after REPTASK
This would inhibit the task slots from advancing for 1..512 instruction cycles, beginning on the 2nd same-task instruction after REPTASK. This would grant a task an exclusive run of instructions in which it could perform any special timed I/O, as if it were a single-task program. This has more use than just for the new hub memory instructions.
Now, I'm realizing that INDA/INDB are kind of needed in every task.
This would inhibit the task slots from advancing for 1..512 instruction cycles, beginning on the 2nd same-task instruction after REPTASK. This would grant a task an exclusive run of instructions in which it could perform any special timed I/O, as if it were a single-task program. This has more use than just for the new hub memory instructions.
So you mean it is like a very big hammer, that trumps the task allotment for some number of cycles ?
What does this give over simply swapping the task mapping, manually ?
I guess it could be more granular for high swap rates ?
So you mean it is like a very big hammer, that trumps the task allotment for some number of cycles ?
What does this give over simply swapping the task mapping, manually ?
I guess it could be more granular for high swap rates ?
That's a good point. In cases where you have known tasking (%%3210), you could just write your task number and hog the timing for a while, then restore %%3210. What I proposed would be a little simpler in that you wouldn't need to know your task ID and you wouldn't need to restore the original value via SETTASK. Do you think this is worth doing? I'd be happy NOT to busy things up, anymore.
I've been lamenting that we don't have INDA/INDB per task, but that circuit is already bumping the critical path, so it's not going to tolerate an additional 4-way mux, in order to accommodate per-task instances of INDA/INDB.
By having a task hog a brief run of cycles, it can use singular resources like INDA/INDB and be done with them, all in one shot. If all tasks take this approach, there is less need for per-task instances of INDA/INDB, or even MUL/DIV/SQRT/CORDIC. I think that's the way forward. A simple mechanism to grant tasks some number of contiguous instruction cycles would make this pretty easy.
I've been lamenting that we don't have INDA/INDB per task, but that circuit is already bumping the critical path, so it's not going to tolerate an additional 4-way mux, in order to accommodate per-task instances of INDA/INDB.
By having a task hog a brief run of cycles, it can use singular resources like INDA/INDB and be done with them, all in one shot. If all tasks take this approach, there is less need for per-task instances of INDA/INDB, or even MUL/DIV/SQRT/CORDIC. I think that's the way forward. A simple mechanism to grant tasks some number of contiguous instruction cycles would make this pretty easy.
That seems like a logical solution.
This would work nicely in a multi tasked video driver.
Would REPTASK #512 and RDWIDEA/B used in a SDRAM driver allow 2kbytes to be burst to hub in 1 block?
Brian
I don't think it's a lamentable situation. We're almost to the point of having 32 complete COGlets which is a crazy luxury. At some point you need to sit back and consider what will really be done on a per task basis as far as program functionality. Granting task hogs a burst of cycles to get something done makes sense especially in the face of considering the alternative costs of fully redundant resources. If something is truly "taskable" then it can run in another COG as a task and use the unused resources from that COG.
The REPTASK solves a lot of issues and does make it easier to use than the context switching of the task register which will tie up a register to save the task content across the SETTASK save/restore period.
At some point all the resources are going to become scarce in all situations. I think you've chosen your battles well in this arena and now solutions like REPTASK are the best choices.
That seems like a logical solution.
This would work nicely in a multi tasked video driver.
Would REPTASK #512 and RDWIDEA/B used in a SDRAM driver allow 2kbytes to be burst to hub in 1 block?
Brian
Since you don't need 512 longs, but maybe 16 less than that for $000..$1EF, the 9-bit constant would be adequate to accommodate the $1F0 longs plus a few setup instructions.
So, yes, you could reload the whole cog in one whack via hubexec from a 1-of-4 task, even:
(4-way multitasking and ICACHEN in effect, so no prefetch to disrupt timing)
ALIGNW 'pad 0's to start of next WIDE block, or icache line (8 instructions follow)
REPTASK #3+$1F0+1 'execute SETINDA, REPS, RDWIDEA, MOV * $1F0, and dummy MOV
LOCPTRA @newcode 'point to new code
SETINDA #0 'reset INDA (contiguous pipeline feed for this task begins)
REPS #$1F0 'ready to repeat MOV
RDWIDEA #8 'start read burst
MOV INDA++,$1F1 'load WIDE longs via $1F1 window into $000..$1EF
NOP 'two trailing instructions need to be in this same icache
NOP '..line to avoid fetching during 2nd-to-last move
Since you don't need 512 longs, but maybe 16 less than that for $000..$1EF, the 9-bit constant would be adequate to accommodate the $1F0 longs plus a few setup instructions.
So, yes, you could reload the whole cog in one whack via hubexec from a 1-of-4 task, even:
(4-way multitasking and ICACHEN in effect, so no prefetch to disrupt timing)
ALIGNW 'pad 0's to start of next WIDE block, or icache line (8 instructions follow)
REPTASK #3+$1F0+1 'execute SETINDA, REPS, RDWIDEA, MOV * $1F0, and dummy MOV
LOCPTRA @newcode 'point to new code
SETINDA #0 'reset INDA (contiguous pipeline feed for this task begins)
REPS #$1F0 'ready to repeat MOV
RDWIDEA #8 'start read burst
MOV INDA++,$1F1 'load WIDE longs via $1F1 window into $000..$1EF
NOP 'two trailing instructions need to be in this same icache
NOP '..line to avoid fetching during 2nd-to-last move
This seems to be a good use case for NOT needing individual INDA/INDB for each task. If I'm reloading the cog, aren't I pretty much throwing away any established tasking code in the cog? Wouldn't the new code need to reestablish tasking since you are pulling the instructions out from under the other running but stalled tasks?
This seems to be a good use case for NOT needing individual INDA/INDB for each task. If I'm reloading the cog, aren't I pretty much throwing away any established tasking code in the cog? Wouldn't the new code need to reestablish tasking since you are pulling the instructions out from under the other running but stalled tasks?
That example would reload all the cog RAM, but maybe the other tasks were in hubexec mode. If all tasks only use INDA/INDB within a contiguous block of cycles, there would be no conflict, ever. Makes me kind of wish I didn't make REPS/REPD per task, but that's okay. We really need PTRA/PTRB per task, though, to handle task stacks in hub RAM. Those pointers need to be persistent, per task. I think if we add this REPTASK instruction, a bunch of problems get solved over sharing cog resources which don't need to be task-persistent (unlike PTRA/PTRB).
That example would reload all the cog RAM, but maybe the other tasks were in hubexec mode. If all tasks only use INDA/INDB within a contiguous block of cycles, there would be no conflict, ever. Makes me kind of wish I didn't make REPS/REPD per task, but that's okay. We really need PTRA/PTRB per task, though, to handle task stacks in hub RAM. Those pointers need to be persistent, per task. I think if we add this REPTASK instruction, a bunch of problems get solved over sharing cog resources which don't need to be task-persistent (unlike PTRA/PTRB).
What you appear to be providing us with here is a kind of timed taskLock() operation within the COG. This will be very useful to protect critical sections where atomic read-modify-writes of registers are needed if COG register data is being shared between tasks or if other common COG resources get accessed like the INDA/INDB, I/O pins, multiplier blocks etc. I would also suggest having the ability to cancel out of the locked state early if say REPTASK #0 was issued.
Maybe the instruction chould be called LOCK, as it is locking out the task switching mechanism for a while?
At least for high level C language VM tasks using hub exec I don't envisage the need for INDA/INDB per task so much, so only having one of those registers per COG is hopefully not a major deal. The PTRA/PTRB per task was already a big win for us there. I do suspect PASM based drivers will be able to better take advantage of INDA/INDB resource however. So they will just need to factor its protection in their design if/when multitasking is used.
What you appear to be providing us with here is a kind of timed taskLock() operation within the COG. This will be very useful to protect critical sections where atomic read-modify-writes of registers are needed if COG register data is being shared between tasks or if other common COG resources get accessed like the INDA/INDB, I/O pins, multiplier blocks etc. I would also suggest having the ability to cancel out of the locked state early if say REPTASK #0 was issued.
Maybe the instruction chould be called LOCK, as it is locking out the task switching mechanism for a while?
Going from this, what if you instead provided a lock/unlock pair of functions that just controlled the indexing/bit-shifting of the task control register? In other words:
LOCKTASK
// uninterrupted operations
FREETASK
That way, you wouldn't have to count instructions, worry about CALLs, etc.
Regardless of whether this approach or REPTASK is used, what would happen if two tasks used this instruction at the same time?
ALIGNW 'pad 0's to start of next WIDE block, or icache line (8 instructions follow)
Could the PASM syntax be changed to distinguish between directives and instructions? I know there are only a few that have been added, but I think it makes the code more readable. I'm partial to prefixing with a period (e.g. .ALIGNW), but am fine with whatever feels most natural to the majority (or Chip).
Going from this, what if you instead provided a lock/unlock pair of functions that just controlled the indexing/bit-shifting of the task control register? In other words:
LOCKTASK
// uninterrupted operations
FREETASK
That way, you wouldn't have to count instructions, worry about CALLs, etc.
Regardless of whether this approach or REPTASK is used, what would happen if two tasks used this instruction at the same time?
That would be simpler. I wonder if it would encourage reckless use of task locking, though.
Could the PASM syntax be changed to distinguish between directives and instructions? I know there are only a few that have been added, but I think it makes the code more readable. I'm partial to prefixing with a period (e.g. .ALIGNW), but am fine with whatever feels most natural to the majority (or Chip).
Interesting idea. Maybe we'll do that. I've got a few other things on the plate at the moment, though.
That would be simpler. I wonder if it would encourage reckless use of task locking, though.
It seems to me that this task locking gets you into the same world as interrupts on other processors. At least it's limited to only the COG that is doing the locking but it means that the other tasks will have unpredictable execution behavior. Sure the execution behavior will be predictable within a locked section but what about between locked sections if more than one task is using the locking mechanism. I guess these hardware tasks destroy determinism anyway so maybe this isn't a problem.
It seems to me that this task locking gets you into the same world as interrupts on other processors. At least it's limited to only the COG that is doing the locking but it means that the other tasks will have unpredictable execution behavior. Sure the execution behavior will be predictable within a locked section but what about between locked sections if more than one task is using the locking mechanism. I guess these hardware tasks destroy determinism anyway so maybe this isn't a problem.
It's a mixed bag, but it does allow quick, exclusive use of singular resources, which is pretty useful.
Going from this, what if you instead provided a lock/unlock pair of functions that just controlled the indexing/bit-shifting of the task control register? In other words:
LOCKTASK
// uninterrupted operations
FREETASK
That way, you wouldn't have to count instructions, worry about CALLs, etc.
Regardless of whether this approach or REPTASK is used, what would happen if two tasks used this instruction at the same time?
I'm not quite following, are you meaning this still an opcode REPTASK ?
Or are those ASM macros that change a register ?
If they are macros, I would take care to avoid using words like // uninterrupted operations
as many might expect interrupts, and think they HAVE to do this for deterministic code.
Clearer would be words like 100% slice allocation, reminds users this is a bandwidth tool, not a deterministic requirement.
I am all for avoiding counting instructions, and I would improve the ASM mnemonic more like this :
(based on Analog Devices syntax, I figure they know what they are doing )
If the REP is a high priority (trumps other tasks, briefly), then I would name it closest to what it does. : eg REPHIGH
' paste from Chips above, edit to modify mnemonic
(4-way multitasking and ICACHEN in effect, so no prefetch to disrupt timing)
ALIGNW 'pad 0's to start of next WIDE block, or icache line (8 instructions follow)
REPHIGH Count, RepHighStart, RepHighEnd 'execute SETINDA, REPS, RDWIDEA, MOV * $1F0, and dummy MOV
LOCPTRA @newcode 'point to new code
RepHighStart:
SETINDA #0 'reset INDA (contiguous pipeline feed for this task begins)
REPS #$1F0 'ready to repeat MOV
RDWIDEA #8 'start read burst
MOV INDA++,$1F1 'load WIDE longs via $1F1 window into $000..$1EF
NOP 'two trailing instructions need to be in this same icache
NOP '..line to avoid fetching during 2nd-to-last move
RepHighEnd:
I like the easier 2 instruction complement even tho it uses an extra instruction. It saves counting instructions and it is easy to use indented code in pasm to show the locked section.
Perhaps TSKLOCK/TSKFREE or TASKLOC/TASKFRE ?
Either way, this is a great addition. Can be used to load cog or aux (video) extremely fast. Sustained blocks of 800MB/s transfer to/from hub - wow!
I like the easier 2 instruction complement even tho it uses an extra instruction.
2 instructions is likely ok, but I think locks/lock is still not he right message. Novices will use more locks than the really need to.
What about Rapid ? TSKRAPID/TSKSTD - invoke Rapid, when you want ALL time slots, and back to TSKSTD when you are ok for 'normal transmission' ( or TSKFULL/TSKSTD or ?? )
( or ME_ME_ME and As_You_Were )
I like the easier 2 instruction complement even tho it uses an extra instruction. It saves counting instructions and it is easy to use indented code in pasm to show the locked section.
Perhaps TSKLOCK/TSKFREE or TASKLOC/TASKFRE ?
Either way, this is a great addition. Can be used to load cog or aux (video) extremely fast. Sustained blocks of 800MB/s transfer to/from hub - wow!
I thought about those same exact names. What about TLOCK/TFREE?
2 instructions is likely ok, but i think locks/lock is still not he right message. Novices will use more locks than the really need to.
What about rapid ? Tskrapid/tskstd - invoke rapid, when you want all time slots, and back to tskstd when you are ok for 'normal transmission' ( or tskfull/tskstd or ?? )
( or me_me_me and as_you_were )
Comments
A very powerful addition! Nice work Chip.
and then we can do the xWIDEy operation.
After this a SETTASK originalValue resumes the frozen tasks where they left of.
All state is kept.
Yeah! Congrats. Always satisfying to knock those ones off. The comments become a pleasant read at a later date.
REPTASK #n 'repeat this task for the next 1..512 instructions, starting on the 2nd same-task instruction after REPTASK
This would inhibit the task slots from advancing for 1..512 instruction cycles, beginning on the 2nd same-task instruction after REPTASK. This would grant a task an exclusive run of instructions in which it could perform any special timed I/O, as if it were a single-task program. This has more use than just for the new hub memory instructions.
Now, I'm realizing that INDA/INDB are kind of needed in every task.
So you mean it is like a very big hammer, that trumps the task allotment for some number of cycles ?
What does this give over simply swapping the task mapping, manually ?
I guess it could be more granular for high swap rates ?
That's a good point. In cases where you have known tasking (%%3210), you could just write your task number and hog the timing for a while, then restore %%3210. What I proposed would be a little simpler in that you wouldn't need to know your task ID and you wouldn't need to restore the original value via SETTASK. Do you think this is worth doing? I'd be happy NOT to busy things up, anymore.
By having a task hog a brief run of cycles, it can use singular resources like INDA/INDB and be done with them, all in one shot. If all tasks take this approach, there is less need for per-task instances of INDA/INDB, or even MUL/DIV/SQRT/CORDIC. I think that's the way forward. A simple mechanism to grant tasks some number of contiguous instruction cycles would make this pretty easy.
That seems like a logical solution.
This would work nicely in a multi tasked video driver.
Would REPTASK #512 and RDWIDEA/B used in a SDRAM driver allow 2kbytes to be burst to hub in 1 block?
Brian
I don't think it's a lamentable situation. We're almost to the point of having 32 complete COGlets which is a crazy luxury. At some point you need to sit back and consider what will really be done on a per task basis as far as program functionality. Granting task hogs a burst of cycles to get something done makes sense especially in the face of considering the alternative costs of fully redundant resources. If something is truly "taskable" then it can run in another COG as a task and use the unused resources from that COG.
The REPTASK solves a lot of issues and does make it easier to use than the context switching of the task register which will tie up a register to save the task content across the SETTASK save/restore period.
At some point all the resources are going to become scarce in all situations. I think you've chosen your battles well in this arena and now solutions like REPTASK are the best choices.
Since you don't need 512 longs, but maybe 16 less than that for $000..$1EF, the 9-bit constant would be adequate to accommodate the $1F0 longs plus a few setup instructions.
So, yes, you could reload the whole cog in one whack via hubexec from a 1-of-4 task, even:
This seems to be a good use case for NOT needing individual INDA/INDB for each task. If I'm reloading the cog, aren't I pretty much throwing away any established tasking code in the cog? Wouldn't the new code need to reestablish tasking since you are pulling the instructions out from under the other running but stalled tasks?
That example would reload all the cog RAM, but maybe the other tasks were in hubexec mode. If all tasks only use INDA/INDB within a contiguous block of cycles, there would be no conflict, ever. Makes me kind of wish I didn't make REPS/REPD per task, but that's okay. We really need PTRA/PTRB per task, though, to handle task stacks in hub RAM. Those pointers need to be persistent, per task. I think if we add this REPTASK instruction, a bunch of problems get solved over sharing cog resources which don't need to be task-persistent (unlike PTRA/PTRB).
What you appear to be providing us with here is a kind of timed taskLock() operation within the COG. This will be very useful to protect critical sections where atomic read-modify-writes of registers are needed if COG register data is being shared between tasks or if other common COG resources get accessed like the INDA/INDB, I/O pins, multiplier blocks etc. I would also suggest having the ability to cancel out of the locked state early if say REPTASK #0 was issued.
Maybe the instruction chould be called LOCK, as it is locking out the task switching mechanism for a while?
At least for high level C language VM tasks using hub exec I don't envisage the need for INDA/INDB per task so much, so only having one of those registers per COG is hopefully not a major deal. The PTRA/PTRB per task was already a big win for us there. I do suspect PASM based drivers will be able to better take advantage of INDA/INDB resource however. So they will just need to factor its protection in their design if/when multitasking is used.
Roger.
Going from this, what if you instead provided a lock/unlock pair of functions that just controlled the indexing/bit-shifting of the task control register? In other words:
That way, you wouldn't have to count instructions, worry about CALLs, etc.
Regardless of whether this approach or REPTASK is used, what would happen if two tasks used this instruction at the same time?
Could the PASM syntax be changed to distinguish between directives and instructions? I know there are only a few that have been added, but I think it makes the code more readable. I'm partial to prefixing with a period (e.g. .ALIGNW), but am fine with whatever feels most natural to the majority (or Chip).
That would be simpler. I wonder if it would encourage reckless use of task locking, though.
With what we've seen people do with the P1, "reckless" could become "creative flexibility"
It is way simpler to use that way. I'll implement it as lock/free.
Interesting idea. Maybe we'll do that. I've got a few other things on the plate at the moment, though.
It's a mixed bag, but it does allow quick, exclusive use of singular resources, which is pretty useful.
I'm not quite following, are you meaning this still an opcode REPTASK ?
Or are those ASM macros that change a register ?
If they are macros, I would take care to avoid using words like // uninterrupted operations
as many might expect interrupts, and think they HAVE to do this for deterministic code.
Clearer would be words like 100% slice allocation, reminds users this is a bandwidth tool, not a deterministic requirement.
I am all for avoiding counting instructions, and I would improve the ASM mnemonic more like this :
(based on Analog Devices syntax, I figure they know what they are doing )
If the REP is a high priority (trumps other tasks, briefly), then I would name it closest to what it does. : eg REPHIGH
If this capability were to ever be extended to SPIN, would also be call "lock/free"? or would it not need different name(s) ?
Lock could be confused with locks?
What about focus/free?
The "focus" being on a single task.
Perhaps TSKLOCK/TSKFREE or TASKLOC/TASKFRE ?
Either way, this is a great addition. Can be used to load cog or aux (video) extremely fast. Sustained blocks of 800MB/s transfer to/from hub - wow!
2 instructions is likely ok, but I think locks/lock is still not he right message. Novices will use more locks than the really need to.
What about Rapid ? TSKRAPID/TSKSTD - invoke Rapid, when you want ALL time slots, and back to TSKSTD when you are ok for 'normal transmission' ( or TSKFULL/TSKSTD or ?? )
( or ME_ME_ME and As_You_Were )
I thought about those same exact names. What about TLOCK/TFREE?
LOL! How about URGENT / RELAX
How about KANYE/TAYLOR ?
http://www.youtube.com/watch?v=KchcU4Yy0WQ