... So, to do a single-step, you would do something like 'SETTASK #%%10 followed by SETTASK #%%1 (assuming task 0 was the target task and task 1 was the scheduler task.
Good, so phase is under precise control.
This would also allow a skim-step option in a debug, where it could allocate 15 slots to Target, and 1 to kernal, for 15x the coverage per skim-step-command.
I don't think that the TWAIT grain matters, as it will take many cycles for the scheduler to notice it, and handle it.
I was thinking more of Debug, where flight time may be what you are trying to measure, and the poll loop would be all you are doing.
(There would be some minimum practical time, as after setting up the 'GO' the kernal needs to get ready for the 'DONE' echo )
Perhaps there is headroom in WAITPxx decode space to include a intra-cog flag ?
I was thinking more of Debug, where flight time may be what you are trying to measure, and the poll loop would be all you are doing.
(There would be some minimum practical time, as after setting up the 'GO' the kernal needs to get ready for the 'DONE' echo )
Perhaps there is headroom in WAITPxx decode space to include a intra-cog flag ?
I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.
This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
I thought about it some more, and there is no need for the 'H' - these instructions affect the task, the thread is purely a software construct built with their capabilities!
TSTOP savereg,#1..3 ' only called by the scheduler task
TRUN savereg,#1..3 ' only called by the scheduler task
TSTEP saveregreg,#1..3 ' only called by the scheduler task TWAIT #n ' new instruction! explanation below - NOT TO BE CALLED BY SCHEDULER
There are two other usage cases that should be addressed:
1) A task/thread executing a breakpoint
2) A thread voluntarily yielding as it is waiting for some event (time, signal, socket, etc)
As a breakpoint can be considered as the thread waiting for the debugger, I think one instruction can handle all of the above.
In all of these cases, the thread has to get the attention of the scheduler. We can do this without adding any logic!
TWAIT #n ' write N to $1F1, and wait forever (TSTOP will stop the task, and TRUN will resume at the next address, right after the TWAIT)
We have two easy to use locations in a cog - that are not normally loaded.
$1F1 - TWAIT value
$1F0 - savereg
So basically, the scheduler will in its scheduling loop do the equivalent of:
TJNZ $1F1, #thread_waiting
and code can then decode the reason the thread is waiting, which can be one of:
- breakpoint (say 0..255)
- waiting for a signal/event/timeout (indicated by 256..511)
Note the signal values are totally arbitrary.
TWAIT completes the set - allows for threads to yield, to wait for elapsed time, and also gives us breakpoints!
It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.
I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.
This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.
After verifying the logic of the code that needs such precise timing, what I'd do is something like turn off threading, not run under a scheduler, and
getcnt before
<code to be precisely measured>
getcnt after
then
sub after,before
will give a very accurate count
Yes, you can do that, but that requires an edit and recompile of code, and later removal, and it may be a library you want to check, and you want to avoid raising all sorts of version control flags... (ie best avoided)
- TWAIT #n ... would flush the pipeline, so breakpoints, yields and signals are fine
- TSTEP would execute one atomic instruction, maybe force NOP's into the next three pipeline stages? 'D' instructions would not complete until the "shadow" instructions are completed, they would be one step
- REPs can be a step, if the content of the reps needs debugging, it can be turned into DJNZ
- TSTOP for the purposes of changing threads of execution would have to wait for the pipeline to empty
So if it is difficult, and takes too many gates, I don't think that much state needs to be tracked.
I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.
This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.
Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.
You've almost got everything you need to add interrupts! :-)
It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.
If the value is written, then there is room to also write a Task ID, for cases where multiple Tasks are being managed ?
- If there is more than one thread running, that precision cannot happen - as one or more threads could have been running in the meanwhile
- the scheduler will take 1/16 of the cycles regardless
- however, if you are only running one thread, it would give an indication of the time taken (delta scheduler time, delta scheduler caused extra cache reloads)
Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.
That's sounding positive. I'm glad you used the words "It's going to be very straightforward."
-
- the scheduler will take 1/16 of the cycles regardless
Yes, I was just thinking about that, & maybe if you wanted to be strict in your testing, you might want to reserve a 1/16 Debug slot, even in shipped code. I can think of cases where that 1/16 could be a watchdog style stub.
I've been looking into what it takes to completely redirect a task, so that preemptive multitasking and single-stepping can be accomplished. It turns out that the following bits need to be saved and restored:
16 bits for PC
1 bit for Z flag
1 bit for C flag
18 bits for PTRA
18 bits for PTRB
1 bit for TLOCK pending
2 bits for delayed branch pending
16 bits for delayed branch address
23 bits for AUGS value
1 bit for AUGS pending
23 bits for AUGD value
1 bit for AUGD pending
46 bits for REPS/REPD
167 bits total = 5 longs, 7 bits
That's a lot of data needed to store a task state!
How about instead of being able to stop a task at any point in its program, we have a circuit that waits for an opportune situation before stopping the task. If we waited for the following, we would only need to track PC/Z/C and PTRA/PTRB:
TLOCK is not pending (this potentially causes a 1-instruction delay)
a delayed branch is not pending (this potentially causes a 3-instruction delay)
AUGS/AUGD is not pending (this potentially causes a 1..2 instruction delay)
REPS/REPD in not active (this potentially causes an unknown delay)
By avoiding those circumstances, we eliminate 113 bits of state information that needs saving and restoring, bringing the total down to 54 bits, of which JMPTASK can restore 18 (Z/C/PC) and operand-less instructions can copy the target task's PTRA/PTRB to and from the switcher task's PTRA/PTRB. This would take very little hardware. It would completely enable preemptive multitasking, but would increase the granularity of single-stepping in cases where TLOCK, AUGS/AUGD, or a delayed branch is pending, or where REPS/REPD is active. Single-stepping would step over those cases as if they were one instruction.
Do you think this is adequate, or should the full 167 bits be handled in order to provide more granular single-stepping, as well as REPS/REPD interruption?
@Chip: I know it seems like a lot of state but to do it right I think you'd want to save/restore all this task state and allow switching on any boundary. If there was an instruction that could grab all this state data from another task into a WIDE at any time and also the reverse to load from a WIDE, we could then write whole wide worth of task state to hub if desired and we could (potentially) atomically switch threads of a task using two hub cycles.
That still leaves the 4 entry task stack to deal with however in the cases where that data also needs to be saved (it may not always be depending on the task call model employed by the developer). Multiple pops there and another wide write could be used as required. So the scheduler task has to just commandeer the user task being switched out (once its old PC has already been safely saved to the wide) to make it execute code to read out the user task stack data. It would do the four pop operations and go save those too before reading and pushing in the new stack data from the next thread. If we are lucky all these wide reads and writes might fit within about four hub cycles, which IMO is still rather fast for thread switching as we are then only talking in the vicinity of 160ns @ 200MHz and remember this is just for the high level user thread context switching. We still have hardware task switching for critical real time drivers. If the number of user threads are very small in number we could also try to keep the user task thread state in COG/stack RAM to try to avoid hub access penalty, though I suspect having the 256 bit wide transfers to/from hub may turn out to be faster than multiple shuffling around of state data within internal 32 bit wide RAM when switching out the thread of a task.
In making the conduit for all this task-state data (about eight each of SETxxxx and GETxxxx instructions), I'm realizing this eats lots of opcode space and complicates the ALU result mux, which is already critical-path.
How about using the WIDEs as a big, fat parallel storage/retrieval buffer for task-state data? Aside from getting rid of ~16 instructions with operands, it provides a fast conduit via RDWIDE/WRWIDE for storing/retrieving task states in hub memory. We'd just need to do a WRWIDE with the existing data after the breakpoint and a RDWIDE before returning to the interrupted task. We'd also need to get that dcache-valid bit for restoring its state.
@Chip: I know it seems like a lot of state but to do it right I think you'd want to save/restore all this task state and allow switching on any boundary. If there was an instruction that could grab all this state data from another task into a WIDE at any time and also the reverse to load from a WIDE, we could then write whole wide worth of task state to hub if desired and we could (potentially) atomically switch threads of a task using two hub cycles.
That still leaves the 4 entry task stack to deal with however in the cases where that data also needs to be saved (it may not always be depending on the task call model employed by the developer). Multiple pops there and another wide write could be used as required. So the scheduler task has to just commandeer the user task being switched out (once its old PC has already been safely saved to the wide) to make it execute code to read out the user task stack data. It would do the four pop operations and go save those too before reading and pushing in the new stack data from the next thread. If we are lucky all these wide reads and writes might fit within about four hub cycles, which IMO is still rather fast for thread switching as we are then only talking in the vicinity of 160ns @ 200MHz and remember this is just for the high level user thread context switching. We still have hardware task switching for critical real time drivers. If the number of user threads are very small in number we could also try to keep the user task thread state in COG/stack RAM to try to avoid hub access penalty, though I suspect having the 256 bit wide transfers to/from hub may turn out to be faster than multiple shuffling around of state data within internal 32 bit wide RAM when switching out the thread of a task.
We were thinking the same thoughts. This is definitely the way to do it. No messing around with lots of data elements if you don't want to.
We could make another instruction to get or set a task's entire 4-level LIFO into the WIDEs, too. This would eliminate more monkey motion. Probably cause more unemployment.
... We'd just need to do a WRWIDE with the existing data after the breakpoint and a RDWIDE before returning to the interrupted task. We'd also need to get that dcache-valid bit for restoring its state.
If that is practical to do, it certainly is easy to describe and use ( oh, and fast too )
Using the WIDE's to save/restore states is a good idea.
I do wonder if it would not be simpler to do the PTRA/PTRB/LIFO mapping to the scheduler, as discussed before, and when stepping, step over the whole instruction (stuffing NOPs into pipeline for the three subsequent cycles for the non-delayed instructions, and treating the Delayed instructions as an atomic unit of four instructions.
ie:
TSTOP savereg, #taskid ' saves PC, C, Z, stops after the current pipeline for the task being stopped is empty
Here by switching in the PTRA/PTRB/LIFO for taskid allows the scheduler to save/load state
Four pop's and FIFO can be saved
TRUN savereg, #taskid ' restores PC, C, Z, resumes running at next instruction
Scheduler would restore PTRA/B/LIFO before running it, this can also be used to start threads, does not need harwdare to restore large state
Four push's and FIFO can be restored
TWAIT #waitfor
Copies #waitfor to $1F1, loops on itself, waiting for scheduler to TSTOP it; when TRUN resumes continues at next instruction
TSTEP savereg,#taskid
Runs one atomic instruction, treats non-delayslot instructions as atomic by stuffing three NOP's into the pipeline
stepping over a JMPD variant steps over the jump instruction and three ops in its shadow
*** ALMOST MISSED IT ***
The state saved/restored MUST include two bit LIFO stack pointer!
Whichever is simpler/easier for you to implement Chip is the way to go
Using the WIDE's to save/restore states is a good idea.
I do wonder if it would not be simpler to do the PTRA/PTRB/LIFO mapping to the scheduler, as discussed before, and when stepping, step over the whole instruction (stuffing NOPs into pipeline for the three subsequent cycles for the non-delayed instructions, and treating the Delayed instructions as an atomic unit of four instructions.
Whichever is simpler/easier for you to implement Chip is the way to go
Using the WIDEs is the easiest thing, ever. It's going to be the fastest, too.
Boy, this sure is an impetus to make 4 sets of WIDEs, one for each task.
That would be great... as presumably that could also be used as 4 lines of dcache...
as long as it does not reduce the hub size
I've thought more about this and I've realized (again) that WIDE-muxing for RDxxxxC instructions is already critical-path. There is no more time for another 4:1mux. It would cost another 6,144 flops, or another 10%, too.
Sorry, but I am yet to be convinced any of this is necessary.
Why can't all this run under the normal tasking.
Task 0 is a "super task" where it can set/reset "stall" bits for the other 3 tasks (the pipeline just effectively ignores the instruction and does not advance the PC is the "Stall" bit is active).
Task 0 can switch in the PTRA/B etc of any task so it can r/w those values.
Task 0 could then stall a task, and by examining the PC of the subject task, determine the next instruction to be executed (that is not in the pipe) and replace it with a new instruction to jmp (saving pc,z,c) to some special debugging code.
Perhaps there is something even simpler than this.
I've thought more about this and I've realized (again) that WIDE-muxing for RDxxxxC instructions is already critical-path. There is no more time for another 4:1mux. It would cost another 6,144 flops, or another 10%, too.
Chip,
May I be so bold to ask that you suspend this and get a release out so that we can at least get on with some serious testing?
Then get on with USB and SERDES. If there is time, you can always come back to this later.
Comments
Good, so phase is under precise control.
This would also allow a skim-step option in a debug, where it could allocate 15 slots to Target, and 1 to kernal, for 15x the coverage per skim-step-command.
I was thinking more of Debug, where flight time may be what you are trying to measure, and the poll loop would be all you are doing.
(There would be some minimum practical time, as after setting up the 'GO' the kernal needs to get ready for the 'DONE' echo )
Perhaps there is headroom in WAITPxx decode space to include a intra-cog flag ?
Measuring precise timing while running with a scheduler and a debugger seems a bit of a stretch to me.
After verifying the logic of the code that needs such precise timing, what I'd do is something like turn off threading, not run under a scheduler, and
getcnt before
<code to be precisely measured>
getcnt after
then
sub after,before
will give a very accurate count
Or am I missing a usage case you may need?
I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.
This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.
C.W.
Would it be easier to simply disable pipelining when in debug mode?
I think the "normal" case will be
task 0 - scheduler
task 1 - multi-threaded
as that would give the best multi-threaded performance
I get a headache thinking of one scheduler and two or three multi-threaded tasks
Especially as it would be significantly slower than running one scheduler and one multi-threaded task (due to shared resources such as caches)
Everything must work through the pipeline in stages.
I just see it as good to have the precision, if it can come as low hanging fruit (eg something simple like an added mapped flipflop)
Yes, you can do that, but that requires an edit and recompile of code, and later removal, and it may be a library you want to check, and you want to avoid raising all sorts of version control flags... (ie best avoided)
- TWAIT #n ... would flush the pipeline, so breakpoints, yields and signals are fine
- TSTEP would execute one atomic instruction, maybe force NOP's into the next three pipeline stages? 'D' instructions would not complete until the "shadow" instructions are completed, they would be one step
- REPs can be a step, if the content of the reps needs debugging, it can be turned into DJNZ
- TSTOP for the purposes of changing threads of execution would have to wait for the pipeline to empty
So if it is difficult, and takes too many gates, I don't think that much state needs to be tracked.
If the value is written, then there is room to also write a Task ID, for cases where multiple Tasks are being managed ?
- the scheduler will take 1/16 of the cycles regardless
- however, if you are only running one thread, it would give an indication of the time taken (delta scheduler time, delta scheduler caused extra cache reloads)
That's sounding positive. I'm glad you used the words "It's going to be very straightforward."
Yes, but it is still a real time number, it may not mean 'cycles in that thread'
Yes, I was just thinking about that, & maybe if you wanted to be strict in your testing, you might want to reserve a 1/16 Debug slot, even in shipped code. I can think of cases where that 1/16 could be a watchdog style stub.
1) cog mode
2) hub-exec mode
3) 4 tasks, any mix of cog / hub-exec
4) 1 scheduler, any mix of THREE cog / hub-exec / multi-threaded tasks
NANO testers will greatly benefit!
task 0: display driver
task 1: scheduler
task 2: sprites/sound
task 3: N user threads
Calling ozprop....
@Chip: I know it seems like a lot of state but to do it right I think you'd want to save/restore all this task state and allow switching on any boundary. If there was an instruction that could grab all this state data from another task into a WIDE at any time and also the reverse to load from a WIDE, we could then write whole wide worth of task state to hub if desired and we could (potentially) atomically switch threads of a task using two hub cycles.
That still leaves the 4 entry task stack to deal with however in the cases where that data also needs to be saved (it may not always be depending on the task call model employed by the developer). Multiple pops there and another wide write could be used as required. So the scheduler task has to just commandeer the user task being switched out (once its old PC has already been safely saved to the wide) to make it execute code to read out the user task stack data. It would do the four pop operations and go save those too before reading and pushing in the new stack data from the next thread. If we are lucky all these wide reads and writes might fit within about four hub cycles, which IMO is still rather fast for thread switching as we are then only talking in the vicinity of 160ns @ 200MHz and remember this is just for the high level user thread context switching. We still have hardware task switching for critical real time drivers. If the number of user threads are very small in number we could also try to keep the user task thread state in COG/stack RAM to try to avoid hub access penalty, though I suspect having the 256 bit wide transfers to/from hub may turn out to be faster than multiple shuffling around of state data within internal 32 bit wide RAM when switching out the thread of a task.
How about using the WIDEs as a big, fat parallel storage/retrieval buffer for task-state data? Aside from getting rid of ~16 instructions with operands, it provides a fast conduit via RDWIDE/WRWIDE for storing/retrieving task states in hub memory. We'd just need to do a WRWIDE with the existing data after the breakpoint and a RDWIDE before returning to the interrupted task. We'd also need to get that dcache-valid bit for restoring its state.
We were thinking the same thoughts. This is definitely the way to do it. No messing around with lots of data elements if you don't want to.
If that is practical to do, it certainly is easy to describe and use ( oh, and fast too )
I do wonder if it would not be simpler to do the PTRA/PTRB/LIFO mapping to the scheduler, as discussed before, and when stepping, step over the whole instruction (stuffing NOPs into pipeline for the three subsequent cycles for the non-delayed instructions, and treating the Delayed instructions as an atomic unit of four instructions.
ie:
TSTOP savereg, #taskid ' saves PC, C, Z, stops after the current pipeline for the task being stopped is empty
Here by switching in the PTRA/PTRB/LIFO for taskid allows the scheduler to save/load state
Four pop's and FIFO can be saved
TRUN savereg, #taskid ' restores PC, C, Z, resumes running at next instruction
Scheduler would restore PTRA/B/LIFO before running it, this can also be used to start threads, does not need harwdare to restore large state
Four push's and FIFO can be restored
TWAIT #waitfor
Copies #waitfor to $1F1, loops on itself, waiting for scheduler to TSTOP it; when TRUN resumes continues at next instruction
TSTEP savereg,#taskid
Runs one atomic instruction, treats non-delayslot instructions as atomic by stuffing three NOP's into the pipeline
stepping over a JMPD variant steps over the jump instruction and three ops in its shadow
*** ALMOST MISSED IT ***
The state saved/restored MUST include two bit LIFO stack pointer!
Whichever is simpler/easier for you to implement Chip is the way to go
Using the WIDEs is the easiest thing, ever. It's going to be the fastest, too.
Boy, this sure is an impetus to make 4 sets of WIDEs, one for each task.
as long as it does not reduce the hub size
I've thought more about this and I've realized (again) that WIDE-muxing for RDxxxxC instructions is already critical-path. There is no more time for another 4:1mux. It would cost another 6,144 flops, or another 10%, too.
Why can't all this run under the normal tasking.
Task 0 is a "super task" where it can set/reset "stall" bits for the other 3 tasks (the pipeline just effectively ignores the instruction and does not advance the PC is the "Stall" bit is active).
Task 0 can switch in the PTRA/B etc of any task so it can r/w those values.
Task 0 could then stall a task, and by examining the PC of the subject task, determine the next instruction to be executed (that is not in the pipe) and replace it with a new instruction to jmp (saving pc,z,c) to some special debugging code.
Perhaps there is something even simpler than this.
Too many flops.
May I be so bold to ask that you suspend this and get a release out so that we can at least get on with some serious testing?
Then get on with USB and SERDES. If there is time, you can always come back to this later.