Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 180 — Parallax Forums

Propeller II update - BLOG

1177178180182183223

Comments

  • jmgjmg Posts: 15,155
    edited 2014-02-27 14:44
    cgracey wrote: »
    ... So, to do a single-step, you would do something like 'SETTASK #%%10 followed by SETTASK #%%1 (assuming task 0 was the target task and task 1 was the scheduler task.

    Good, so phase is under precise control.
    This would also allow a skim-step option in a debug, where it could allocate 15 slots to Target, and 1 to kernal, for 15x the coverage per skim-step-command.
  • jmgjmg Posts: 15,155
    edited 2014-02-27 14:50
    I don't think that the TWAIT grain matters, as it will take many cycles for the scheduler to notice it, and handle it.

    I was thinking more of Debug, where flight time may be what you are trying to measure, and the poll loop would be all you are doing.
    (There would be some minimum practical time, as after setting up the 'GO' the kernal needs to get ready for the 'DONE' echo )

    Perhaps there is headroom in WAITPxx decode space to include a intra-cog flag ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 14:59
    I must admit, I am a bit confused.

    Measuring precise timing while running with a scheduler and a debugger seems a bit of a stretch to me.

    After verifying the logic of the code that needs such precise timing, what I'd do is something like turn off threading, not run under a scheduler, and

    getcnt before
    <code to be precisely measured>
    getcnt after

    then

    sub after,before

    will give a very accurate count

    Or am I missing a usage case you may need?
    jmg wrote: »
    I was thinking more of Debug, where flight time may be what you are trying to measure, and the poll loop would be all you are doing.
    (There would be some minimum practical time, as after setting up the 'GO' the kernal needs to get ready for the 'DONE' echo )

    Perhaps there is headroom in WAITPxx decode space to include a intra-cog flag ?
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:00
    PROBLEM!!!

    I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.

    This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
  • ctwardellctwardell Posts: 1,716
    edited 2014-02-27 15:03
    I thought about it some more, and there is no need for the 'H' - these instructions affect the task, the thread is purely a software construct built with their capabilities!

    TSTOP savereg,#1..3 ' only called by the scheduler task
    TRUN savereg,#1..3 ' only called by the scheduler task
    TSTEP saveregreg,#1..3 ' only called by the scheduler task
    TWAIT #n ' new instruction! explanation below - NOT TO BE CALLED BY SCHEDULER

    There are two other usage cases that should be addressed:

    1) A task/thread executing a breakpoint

    2) A thread voluntarily yielding as it is waiting for some event (time, signal, socket, etc)

    As a breakpoint can be considered as the thread waiting for the debugger, I think one instruction can handle all of the above.

    In all of these cases, the thread has to get the attention of the scheduler. We can do this without adding any logic!

    TWAIT #n ' write N to $1F1, and wait forever (TSTOP will stop the task, and TRUN will resume at the next address, right after the TWAIT)

    We have two easy to use locations in a cog - that are not normally loaded.

    $1F1 - TWAIT value
    $1F0 - savereg

    So basically, the scheduler will in its scheduling loop do the equivalent of:

    TJNZ $1F1, #thread_waiting

    and code can then decode the reason the thread is waiting, which can be one of:

    - breakpoint (say 0..255)

    - waiting for a signal/event/timeout (indicated by 256..511)

    Note the signal values are totally arbitrary.

    TWAIT completes the set - allows for threads to yield, to wait for elapsed time, and also gives us breakpoints!

    It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:07
    Hmm...

    Would it be easier to simply disable pipelining when in debug mode?
    cgracey wrote: »
    PROBLEM!!!

    I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.

    This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:09
    You are right if more than one task is running multiple threads

    I think the "normal" case will be

    task 0 - scheduler
    task 1 - multi-threaded

    as that would give the best multi-threaded performance

    I get a headache thinking of one scheduler and two or three multi-threaded tasks :)

    Especially as it would be significantly slower than running one scheduler and one multi-threaded task (due to shared resources such as caches)
    ctwardell wrote: »
    It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.

    C.W.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:10
    Hmm...

    Would it be easier to simply disable pipelining when in debug mode?


    Everything must work through the pipeline in stages.
  • jmgjmg Posts: 15,155
    edited 2014-02-27 15:12
    I must admit, I am a bit confused.

    Measuring precise timing while running with a scheduler and a debugger seems a bit of a stretch to me.

    I just see it as good to have the precision, if it can come as low hanging fruit (eg something simple like an added mapped flipflop)

    After verifying the logic of the code that needs such precise timing, what I'd do is something like turn off threading, not run under a scheduler, and

    getcnt before
    <code to be precisely measured>
    getcnt after

    then

    sub after,before

    will give a very accurate count

    Yes, you can do that, but that requires an edit and recompile of code, and later removal, and it may be a library you want to check, and you want to avoid raising all sorts of version control flags... (ie best avoided)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:14
    I don't think this is an issue, because:

    - TWAIT #n ... would flush the pipeline, so breakpoints, yields and signals are fine

    - TSTEP would execute one atomic instruction, maybe force NOP's into the next three pipeline stages? 'D' instructions would not complete until the "shadow" instructions are completed, they would be one step

    - REPs can be a step, if the content of the reps needs debugging, it can be turned into DJNZ

    - TSTOP for the purposes of changing threads of execution would have to wait for the pipeline to empty

    So if it is difficult, and takes too many gates, I don't think that much state needs to be tracked.
    cgracey wrote: »
    PROBLEM!!!

    I went to start implementing this and realized that you cannot state-selectively stop a task because there are going to likely be multiple pieces of it in the pipeline on any cycle. You can stop a task by taking its time slots away and letting it exhaust itself through the pipeline, but that's it. You can't control where it's going to land and what states it's going to be in, so it IS necessary to track all the state data, after all.

    This is not all bad, as it means we can have finer-grained multitasking and single-stepping and use Bill's idea to have the scheduler task 'map in' the threaded task's PTRA/PTRB/LIFO for easy access. This is all way easier to think about, anyway. We'll need instructions to read and write a task's REPS/REPD states, its TLOCK/AUGS/AUGD/delayed-branch pending states, and it's delayed-branch address, as well as AUGS/AUGD values.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:14
    Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-02-27 15:16
    cgracey wrote: »
    Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.
    You've almost got everything you need to add interrupts! :-)
  • jmgjmg Posts: 15,155
    edited 2014-02-27 15:16
    ctwardell wrote: »
    It looks like using the TWAIT would be limited to the case where there is just one thread task since having TWAIT coming from more than one task could step on each other in the common $1F1 location.

    If the value is written, then there is room to also write a Task ID, for cases where multiple Tasks are being managed ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:18
    - If there is more than one thread running, that precision cannot happen - as one or more threads could have been running in the meanwhile

    - the scheduler will take 1/16 of the cycles regardless

    - however, if you are only running one thread, it would give an indication of the time taken (delta scheduler time, delta scheduler caused extra cache reloads)
    jmg wrote: »
    I just see it as good to have the precision, if it can come as low hanging fruit (eg something simple like an added mapped flipflop)
  • jmgjmg Posts: 15,155
    edited 2014-02-27 15:18
    cgracey wrote: »
    Doing it the long way is going to amount to way simpler and easier-to-understand concepts. I'm actually a lot more excited about this now. It's going to be very straightforward. There will be instructions to set up a task's states and complementary instructions to read back a task's states. SETTASK will be used to give tasks cycles, as little as one at a time.

    That's sounding positive. I'm glad you used the words "It's going to be very straightforward." :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:22
    +1
    jmg wrote: »
    That's sounding positive. I'm glad you used the words "It's going to be very straightforward." :)
  • jmgjmg Posts: 15,155
    edited 2014-02-27 15:23
    - If there is more than one thread running, that precision cannot happen - as one or more threads could have been running in the meanwhile

    Yes, but it is still a real time number, it may not mean 'cycles in that thread'

    -
    - the scheduler will take 1/16 of the cycles regardless

    Yes, I was just thinking about that, & maybe if you wanted to be strict in your testing, you might want to reserve a 1/16 Debug slot, even in shipped code. I can think of cases where that 1/16 could be a watchdog style stub.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:25
    Just thinking... talk about flexibility!

    1) cog mode
    2) hub-exec mode
    3) 4 tasks, any mix of cog / hub-exec
    4) 1 scheduler, any mix of THREE cog / hub-exec / multi-threaded tasks

    NANO testers will greatly benefit!

    task 0: display driver
    task 1: scheduler
    task 2: sprites/sound
    task 3: N user threads

    Calling ozprop....
  • roglohrogloh Posts: 5,267
    edited 2014-02-27 15:28
    cgracey wrote: »
    I've been looking into what it takes to completely redirect a task, so that preemptive multitasking and single-stepping can be accomplished. It turns out that the following bits need to be saved and restored:

    16 bits for PC
    1 bit for Z flag
    1 bit for C flag
    18 bits for PTRA
    18 bits for PTRB
    1 bit for TLOCK pending
    2 bits for delayed branch pending
    16 bits for delayed branch address
    23 bits for AUGS value
    1 bit for AUGS pending
    23 bits for AUGD value
    1 bit for AUGD pending
    46 bits for REPS/REPD

    167 bits total = 5 longs, 7 bits

    That's a lot of data needed to store a task state!

    How about instead of being able to stop a task at any point in its program, we have a circuit that waits for an opportune situation before stopping the task. If we waited for the following, we would only need to track PC/Z/C and PTRA/PTRB:

    TLOCK is not pending (this potentially causes a 1-instruction delay)
    a delayed branch is not pending (this potentially causes a 3-instruction delay)
    AUGS/AUGD is not pending (this potentially causes a 1..2 instruction delay)
    REPS/REPD in not active (this potentially causes an unknown delay)

    By avoiding those circumstances, we eliminate 113 bits of state information that needs saving and restoring, bringing the total down to 54 bits, of which JMPTASK can restore 18 (Z/C/PC) and operand-less instructions can copy the target task's PTRA/PTRB to and from the switcher task's PTRA/PTRB. This would take very little hardware. It would completely enable preemptive multitasking, but would increase the granularity of single-stepping in cases where TLOCK, AUGS/AUGD, or a delayed branch is pending, or where REPS/REPD is active. Single-stepping would step over those cases as if they were one instruction.

    Do you think this is adequate, or should the full 167 bits be handled in order to provide more granular single-stepping, as well as REPS/REPD interruption?

    @Chip: I know it seems like a lot of state but to do it right I think you'd want to save/restore all this task state and allow switching on any boundary. If there was an instruction that could grab all this state data from another task into a WIDE at any time and also the reverse to load from a WIDE, we could then write whole wide worth of task state to hub if desired and we could (potentially) atomically switch threads of a task using two hub cycles.

    That still leaves the 4 entry task stack to deal with however in the cases where that data also needs to be saved (it may not always be depending on the task call model employed by the developer). Multiple pops there and another wide write could be used as required. So the scheduler task has to just commandeer the user task being switched out (once its old PC has already been safely saved to the wide) to make it execute code to read out the user task stack data. It would do the four pop operations and go save those too before reading and pushing in the new stack data from the next thread. If we are lucky all these wide reads and writes might fit within about four hub cycles, which IMO is still rather fast for thread switching as we are then only talking in the vicinity of 160ns @ 200MHz and remember this is just for the high level user thread context switching. We still have hardware task switching for critical real time drivers. If the number of user threads are very small in number we could also try to keep the user task thread state in COG/stack RAM to try to avoid hub access penalty, though I suspect having the 256 bit wide transfers to/from hub may turn out to be faster than multiple shuffling around of state data within internal 32 bit wide RAM when switching out the thread of a task.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:43
    In making the conduit for all this task-state data (about eight each of SETxxxx and GETxxxx instructions), I'm realizing this eats lots of opcode space and complicates the ALU result mux, which is already critical-path.

    How about using the WIDEs as a big, fat parallel storage/retrieval buffer for task-state data? Aside from getting rid of ~16 instructions with operands, it provides a fast conduit via RDWIDE/WRWIDE for storing/retrieving task states in hub memory. We'd just need to do a WRWIDE with the existing data after the breakpoint and a RDWIDE before returning to the interrupted task. We'd also need to get that dcache-valid bit for restoring its state.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:47
    rogloh wrote: »
    @Chip: I know it seems like a lot of state but to do it right I think you'd want to save/restore all this task state and allow switching on any boundary. If there was an instruction that could grab all this state data from another task into a WIDE at any time and also the reverse to load from a WIDE, we could then write whole wide worth of task state to hub if desired and we could (potentially) atomically switch threads of a task using two hub cycles.

    That still leaves the 4 entry task stack to deal with however in the cases where that data also needs to be saved (it may not always be depending on the task call model employed by the developer). Multiple pops there and another wide write could be used as required. So the scheduler task has to just commandeer the user task being switched out (once its old PC has already been safely saved to the wide) to make it execute code to read out the user task stack data. It would do the four pop operations and go save those too before reading and pushing in the new stack data from the next thread. If we are lucky all these wide reads and writes might fit within about four hub cycles, which IMO is still rather fast for thread switching as we are then only talking in the vicinity of 160ns @ 200MHz and remember this is just for the high level user thread context switching. We still have hardware task switching for critical real time drivers. If the number of user threads are very small in number we could also try to keep the user task thread state in COG/stack RAM to try to avoid hub access penalty, though I suspect having the 256 bit wide transfers to/from hub may turn out to be faster than multiple shuffling around of state data within internal 32 bit wide RAM when switching out the thread of a task.


    We were thinking the same thoughts. This is definitely the way to do it. No messing around with lots of data elements if you don't want to.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:50
    We could make another instruction to get or set a task's entire 4-level LIFO into the WIDEs, too. This would eliminate more monkey motion. Probably cause more unemployment.
  • jmgjmg Posts: 15,155
    edited 2014-02-27 15:52
    cgracey wrote: »
    ... We'd just need to do a WRWIDE with the existing data after the breakpoint and a RDWIDE before returning to the interrupted task. We'd also need to get that dcache-valid bit for restoring its state.

    If that is practical to do, it certainly is easy to describe and use :) ( oh, and fast too )
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 15:56
    Using the WIDE's to save/restore states is a good idea.

    I do wonder if it would not be simpler to do the PTRA/PTRB/LIFO mapping to the scheduler, as discussed before, and when stepping, step over the whole instruction (stuffing NOPs into pipeline for the three subsequent cycles for the non-delayed instructions, and treating the Delayed instructions as an atomic unit of four instructions.

    ie:

    TSTOP savereg, #taskid ' saves PC, C, Z, stops after the current pipeline for the task being stopped is empty

    Here by switching in the PTRA/PTRB/LIFO for taskid allows the scheduler to save/load state
    Four pop's and FIFO can be saved

    TRUN savereg, #taskid ' restores PC, C, Z, resumes running at next instruction

    Scheduler would restore PTRA/B/LIFO before running it, this can also be used to start threads, does not need harwdare to restore large state
    Four push's and FIFO can be restored

    TWAIT #waitfor

    Copies #waitfor to $1F1, loops on itself, waiting for scheduler to TSTOP it; when TRUN resumes continues at next instruction

    TSTEP savereg,#taskid

    Runs one atomic instruction, treats non-delayslot instructions as atomic by stuffing three NOP's into the pipeline

    stepping over a JMPD variant steps over the jump instruction and three ops in its shadow


    *** ALMOST MISSED IT ***

    The state saved/restored MUST include two bit LIFO stack pointer!


    Whichever is simpler/easier for you to implement Chip is the way to go :)
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 15:58
    Using the WIDE's to save/restore states is a good idea.

    I do wonder if it would not be simpler to do the PTRA/PTRB/LIFO mapping to the scheduler, as discussed before, and when stepping, step over the whole instruction (stuffing NOPs into pipeline for the three subsequent cycles for the non-delayed instructions, and treating the Delayed instructions as an atomic unit of four instructions.

    Whichever is simpler/easier for you to implement Chip is the way to go :)


    Using the WIDEs is the easiest thing, ever. It's going to be the fastest, too.

    Boy, this sure is an impetus to make 4 sets of WIDEs, one for each task.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 16:06
    That would be great... as presumably that could also be used as 4 lines of dcache...

    as long as it does not reduce the hub size
    cgracey wrote: »
    Using the WIDEs is the easiest thing, ever. It's going to be the fastest, too.

    Boy, this sure is an impetus to make 4 sets of WIDEs, one for each task.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-27 16:09
    That would be great... as presumably that could also be used as 4 lines of dcache...

    as long as it does not reduce the hub size


    I've thought more about this and I've realized (again) that WIDE-muxing for RDxxxxC instructions is already critical-path. There is no more time for another 4:1mux. It would cost another 6,144 flops, or another 10%, too.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-02-27 16:15
    Sorry, but I am yet to be convinced any of this is necessary.

    Why can't all this run under the normal tasking.
    Task 0 is a "super task" where it can set/reset "stall" bits for the other 3 tasks (the pipeline just effectively ignores the instruction and does not advance the PC is the "Stall" bit is active).
    Task 0 can switch in the PTRA/B etc of any task so it can r/w those values.

    Task 0 could then stall a task, and by examining the PC of the subject task, determine the next instruction to be executed (that is not in the pipe) and replace it with a new instruction to jmp (saving pc,z,c) to some special debugging code.

    Perhaps there is something even simpler than this.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-27 16:19
    Yikes!

    Too many flops.
    cgracey wrote: »
    I've thought more about this and I've realized (again) that WIDE-muxing for RDxxxxC instructions is already critical-path. There is no more time for another 4:1mux. It would cost another 6,144 flops, or another 10%, too.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-02-27 16:24
    Chip,
    May I be so bold to ask that you suspend this and get a release out so that we can at least get on with some serious testing?
    Then get on with USB and SERDES. If there is time, you can always come back to this later.
Sign In or Register to comment.