Sorry, but I am yet to be convinced any of this is necessary.
Why can't all this run under the normal tasking.
Task 0 is a "super task" where it can set/reset "stall" bits for the other 3 tasks (the pipeline just effectively ignores the instruction and does not advance the PC is the "Stall" bit is active).
Task 0 can switch in the PTRA/B etc of any task so it can r/w those values.
Task 0 could then stall a task, and by examining the PC of the subject task, determine the next instruction to be executed (that is not in the pipe) and replace it with a new instruction to jmp (saving pc,z,c) to some special debugging code.
Perhaps there is something even simpler than this.
I've read what you wrote a few times, but I'm not understanding what you mean. It does sound like more work (in software) than what is being cooked up right now.
Chip,
May I be so bold to ask that you suspend this and get a release out so that we can at least get on with some serious testing?
Then get on with USB and SERDES. If there is time, you can always come back to this later.
Would you give me another two days? Documentation is a big detour that I'd rather take once than twice.
I've read what you wrote a few times, but I'm not understanding what you mean. It does sound like more work (in software) than what is being cooked up right now.
My take was this is a suggestion for a way to single-step, by fiddling the pipeline.
- ie the 'super task' allows one opcode from another task to dribble thru the pipeline, with as many packing NULLs as needed.
A NULL is like a nop, but it does not change the PC.
A breakpoint may need to trigger a tail version of the same thing, to be able to stop with no added changes..
My take was this is a suggestion for a way to single-step, by fiddling the pipeline.
- ie the super task allows one opcode to dribble thru the pipeline, with as many packing NULLs as needed.
A NULL is like a nop, but it does not change the PC
Okay. I kind of got that part, but I didn't understand how this would facilitate preemptive multitasking.
Chip,
May I be so bold to ask that you suspend this and get a release out so that we can at least get on with some serious testing?
Then get on with USB and SERDES. If there is time, you can always come back to this later.
Cluso99, are you running a DE2-115 or a DE0-Nano? If you've got the DE2-115, I could squeeze one out, but the docs would be a little out of sync, though the instruction set would clue you in.
Okay. I kind of got that part, but I didn't understand how this would facilitate preemptive multitasking.
I think it was mainly about debug, but task-swap 'comes for free' in any full Debug support, but it may not be as rapid as a WIDE swap.
Debug cares less about how quickly it takes to save/restore, than high performance task-swap does.
Cluso99, are you running a DE2-115 or a DE0-Nano? If you've got the DE2-115, I could squeeze one out, but the docs would be a little out of sync, though the instruction set would clue you in.
Unfortunately I have a DE0 so I know you will require extra time to make it fit. I don't use video if that helps (a few of us DE0 users don't require video any more, ozprop included). I will ask on a new thread what we can remove if that will help save you time?
Sorry, but I am yet to be convinced any of this is necessary.
Why can't all this run under the normal tasking.
Task 0 is a "super task" where it can set/reset "stall" bits for the other 3 tasks (the pipeline just effectively ignores the instruction and does not advance the PC is the "Stall" bit is active).
Task 0 can switch in the PTRA/B etc of any task so it can r/w those values.
Task 0 could then stall a task, and by examining the PC of the subject task, determine the next instruction to be executed (that is not in the pipe) and replace it with a new instruction to jmp (saving pc,z,c) to some special debugging code.
Perhaps there is something even simpler than this.
I've read what you wrote a few times, but I'm not understanding what you mean. It does sound like more work (in software) than what is being cooked up right now.
I really don't understand why preemptive multitasking is required at all.
But, my thoughts were that to interrupt a task could be done simply by stalling the task, then by software examine/save its settings, and then release it.
While I understand saving all the threads' states using a wide would be nice, it does not totally permit the thread to continue later by restoring these states alone. Surely thing like waitcnt/passcnt, reps etc would all have major caveats here.
We have up to 4 tasks. Surely, the task scheduler could, by SETTASK control which tasks were running, and this alone would permit preemptive tasking by starting a new task (or itself performing this). This means that 1 task is effectively set aside to perform the preemptive task. So in many ways I don't understand why the current design does not permit this.
Mixed into this, it was brought up about debugging. IMHO that is actually quite simple, although a stall through the pipeline could aid single-stepping. Of course, this does not permit real-time tracing so its not cycle accurate.
We have up to 4 tasks. Surely, the task scheduler could, by SETTASK control which tasks were running, and this alone would permit preemptive tasking by starting a new task (or itself performing this). This means that 1 task is effectively set aside to perform the preemptive task. So in many ways I don't understand why the current design does not permit this.
Mixed into this, it was brought up about debugging. IMHO that is actually quite simple, although a stall through the pipeline could aid single-stepping. Of course, this does not permit real-time tracing so its not cycle accurate.
The important detail (which comes 'free' with debug support), is the ability to do a 100% Task Swap.
That needs more than just Task launch, it needs to save/restore the PC & flags, and other details.
It seems a shame that we can't do a complete context store in one WIDE.
What about this: We reduce the LIFO from 32 bits wide down to the CALL/RET-requisite 18 bits (Z/C/PC), we can increase it by a few levels, too. This will let us pack everything (including DCACHE-address and DCACHE-valid bits) into one WIDE.
The whole process of switching threads would be:
SETTASK D/# 'starve the other task of time slots, it had 15/16
'starting on the next instruction, we'll have all the slots
SETPTRA save_addr 'point to 'save' area
WRWIDE PTRA[0] 'save current WIDEs
RDTASK D/# 'read task's states into WIDEs, along with DCACHE-address and DCACHE-valid bits
WRWIDE PTRA[1] 'save the entire task state
SETPTRA restore_addr 'point to 'restore' area
RDWIDE PTRA[1] 'read new task state
WRTASK D/# 'establish new task state, along with DCACHE-address and DCACHE-valid bits
RDWIDEQ PTRA[0] 'set the new WIDEs without affecting DCACHE-address or DCACHE-valid (RDWIDEQ = quiet)
SETTASK D/# 'give task 15/16 time slots to turn it back on in its new state
<wait for some time, update pointers, loop>
Unfortunately I have a DE0 so I know you will require extra time to make it fit. I don't use video if that helps (a few of us DE0 users don't require video any more, ozprop included). I will ask on a new thread what we can remove if that will help save you time?
I'll pull out everything disposable until the core fits.
It seems a shame that we can't do a complete context store in one WIDE.
What about this: We reduce the LIFO from 32 bits wide down to the CALL/RET-requisite 18 bits (Z/C/PC), we can increase it by a few levels, too. This will let us pack everything (including DCACHE-address and DCACHE-valid bits) into one WIDE.
The whole process of switching threads would be:
SETTASK D/# 'starve the other task of time slots, it had 15/16
'starting on the next instruction, we'll have all the slots
SETPTRA save_addr 'point to 'save' area
WRWIDE PTRA[0] 'save current WIDEs
RDTASK D/# 'read task's states into WIDEs, along with DCACHE-address and DCACHE-valid bits
WRWIDE PTRA[1] 'save the entire task state
SETPTRA restore_addr 'point to 'restore' area
RDWIDE PTRA[1] 'read new task state
WRTASK D/# 'establish new task state, along with DCACHE-address and DCACHE-valid bits
RDWIDEQ PTRA[0] 'set the new WIDEs without affecting DCACHE-address or DCACHE-valid (RDWIDEQ = quiet)
SETTASK D/# 'give task 15/16 time slots to turn it back on in its new state
<wait for some time, update pointers, loop>
We already have 8 cogs that can do 4 tasks each for 32 tasks total. Do we really need more tasks than that?
Also (tongue in cheek a bit, but), Chip, this is the slippery slope that eventually leads to the stuff you hate about Windows.... So many things going on pre-empting each other such that it all turns to molasses or worse, tasks get blocked out by other tasks so much that it stops responding to your input well.
Personally, it's starting to get frustrating to watch these threads anymore. You still have to get this whole thing through synthesis and married to the I/O pad frame, what happens when it doesn't fit? Also, we need things to be stable ASAP so we can get the docs and software done in time for shipping. It's going to takes months and months to get the compilers/etc. built and solid, it needs to have started last year, but at the moment we don't even have Spin2 at all and PASM2 is a moving target.
Sorry to be negative at all here, but somethings gotta give...
It seems a shame that we can't do a complete context store in one WIDE.
What about this: We reduce the LIFO from 32 bits wide down to the CALL/RET-requisite 18 bits (Z/C/PC), we can increase it by a few levels, too. This will let us pack everything (including DCACHE-address and DCACHE-valid bits) into one WIDE.
The whole process of switching threads would be:
SETTASK D/# 'starve the other task of time slots, it had 15/16
'starting on the next instruction, we'll have all the slots
SETPTRA save_addr 'point to 'save' area
WRWIDE PTRA[0] 'save current WIDEs
RDTASK D/# 'read task's states into WIDEs, along with DCACHE-address and DCACHE-valid bits
WRWIDE PTRA[1] 'save the entire task state
SETPTRA restore_addr 'point to 'restore' area
RDWIDE PTRA[1] 'read new task state
WRTASK D/# 'establish new task state, along with DCACHE-address and DCACHE-valid bits
RDWIDEQ PTRA[0] 'set the new WIDEs without affecting DCACHE-address or DCACHE-valid (RDWIDEQ = quiet)
SETTASK D/# 'give task 15/16 time slots to turn it back on in its new state
<wait for some time, update pointers, loop>
Sounds good, if it is practical.
Is this different from the wide approach looked at earlier ?
We already have 8 cogs that can do 4 tasks each for 32 tasks total. Do we really need more tasks than that?
Tasks I see as less important than having good Debug support.
- but once you get good debug snapshots, task swap tends to come too, so anyone who wants to do that, can.
It seems a shame that we can't do a complete context store in one WIDE.
What about this: We reduce the LIFO from 32 bits wide down to the CALL/RET-requisite 18 bits (Z/C/PC), we can increase it by a few levels, too. This will let us pack everything (including DCACHE-address and DCACHE-valid bits) into one WIDE.
The whole process of switching threads would be:
SETTASK D/# 'starve the other task of time slots, it had 15/16
'starting on the next instruction, we'll have all the slots
SETPTRA save_addr 'point to 'save' area
WRWIDE PTRA[0] 'save current WIDEs
RDTASK D/# 'read task's states into WIDEs, along with DCACHE-address and DCACHE-valid bits
WRWIDE PTRA[1] 'save the entire task state
SETPTRA restore_addr 'point to 'restore' area
RDWIDE PTRA[1] 'read new task state
WRTASK D/# 'establish new task state, along with DCACHE-address and DCACHE-valid bits
RDWIDEQ PTRA[0] 'set the new WIDEs without affecting DCACHE-address or DCACHE-valid (RDWIDEQ = quiet)
SETTASK D/# 'give task 15/16 time slots to turn it back on in its new state
<wait for some time, update pointers, loop>
Tasks I see as less important than having good Debug support.
- but once you get good debug snapshots, task swap tends to come too, so anyone who wants to do that, can.
Debug support? I thought that was what the LED on the board was for. :-)
As I recall, threads came about due to propgcc not being able to pre-empt threads for the pthreads library (as there is no LMM anymore to conventiently allow pre-empting)
I am VERY much looking forward to improved debugging capabilities!
Tasks I see as less important than having good Debug support.
- but once you get good debug snapshots, task swap tends to come too, so anyone who wants to do that, can.
As I recall, threads came about due to propgcc not being able to pre-empt threads for the pthreads library (as there is no LMM anymore to conventiently allow pre-empting)
I am VERY much looking forward to improved debugging capabilities!
We already have 8 cogs that can do 4 tasks each for 32 tasks total. Do we really need more tasks than that?
Also (tongue in cheek a bit, but), Chip, this is the slippery slope that eventually leads to the stuff you hate about Windows.... So many things going on pre-empting each other such that it all turns to molasses or worse, tasks get blocked out by other tasks so much that it stops responding to your input well.
Personally, it's starting to get frustrating to watch these threads anymore. You still have to get this whole thing through synthesis and married to the I/O pad frame, what happens when it doesn't fit? Also, we need things to be stable ASAP so we can get the docs and software done in time for shipping. It's going to takes months and months to get the compilers/etc. built and solid, it needs to have started last year, but at the moment we don't even have Spin2 at all and PASM2 is a moving target.
Sorry to be negative at all here, but somethings gotta give...
How this all gets used is up to programmers. There is, at least, a lot of debugging value in this addition.
Otherwise, you are right about time. I want this thing to be over, too. We are at the edge of the project and it's hard not to put a few more things in,
How this all gets used is up to programmers. There is, at least, a lot of debugging value in this addition.
Otherwise, you are right about time. I want this thing to be over, too. We are at the edge of the project and it's hard not to put a few more things in,
As my wife is fond of reminding me, I am not as young as I used to be, and neither is my memory - so I could be wrong about that recollection. Perhaps I dreamt the pthreads issue while snoring loudly (my wife assures me I shake the roof).
I do definitely recall Chip starting a thread about it... let me check... AHA!
Ahle has mentioned he'd like to make a preemptive multitasking system on the Prop2, so I've been thinking about what might be needed in the chip to facilitate it.
I think that since time has already been spent on this (preemptive threading) and Chip has a handle on how to approach it that we should support him in seeing it through.
I know everyone is chomping at the bit to move ahead, but we would regret it later if PropGCC on the P2 has to use an LMM type setup to support threading instead of direct hubexec.
Commercial users would also be pleased with the ability to implement a proper RTOS.
As my wife is fond of reminding me, I am not as young as I used to be, and neither is my memory - so I could be wrong about that recollection. Perhaps I dreamt the pthreads issue while snoring loudly (my wife assures me I shake the roof).
I do definitely recall Chip starting a thread about it... let me check... AHA!
I wasn't necessarily saying it was a bad idea. I just meant that it wasn't something proposed by the propgcc team as far as I know. Anyway, it sounds like Chip has it all worked out. What I did propose was a TLB and a trap mechanism to handle TLB misses. However, I am well aware that that is out of the scope of P2 and maybe P3 as well.
I think that since time has already been spent on this (preemptive threading) and Chip has a handle on how to approach it that we should support him in seeing it through.
I know everyone is chomping at the bit to move ahead, but we would regret it later if PropGCC on the P2 has to use an LMM type setup to support threading instead of direct hubexec.
Commercial users would also be pleased with the ability to implement a proper RTOS.
My 2 cents,
C.W.
Unless things have changed since I last looked, I don't think PropGCC ever preempts threads on P1. It only switches threads when one thread yields. Of course you can run multiple threads in parallel if they are on different COGs.
1. Made Serial clock out of PIN --- Master mode
2. Made Slave mode
External Clock in to PIN
3. Made one I/O that connect both SERIN and SEROUT to same PIN --- Like I2C
I think that, and more, is planned for the serdes enhance pass.
Certainly clocked Sync modes are on the list.
As an aside, he has realised the LIFOs only need to be 16 bits wide (Z+C needed too???). This is another prize.
But, since it is only Ale2 that originally asked and he subsequently pleaded for it to stop.
So may I ask, who is going to use this preemptive tasking???
Maybe the silicon is a waste and could be better utilised (we might even run out of it)?
Certainly I think for debugging, there are perhaps simpler and better ways to do it. Of course the ultimate is to buy a Parallax FPGA and have Chip add some special debug code in there for us.
So may I ask, who is going to use this preemptive tasking???
Probably I would say all the people who use and write code in high level languages that want multithreading (eg typical RTOS users) that don't necessarily want to go to the trouble of adding YIELDS all over the place in their source to allow co-operative multithreading. The good thing about preemptive multitasking is you basically can use your original source code as is and don't need to worry (too much) about where you are being switched out. The RTOS does all that for you.
From what I see Chip is adding we still get total control of the switching from the scheduler task so the thread preemption is not completely automatic, software is still involved there. So we can still add more features in software such as thread locking, thread priorities etc layered on top of the scheduler for allowing critical thread sections and to influence thread switching order. It seems the best of all worlds here. We don't need interrupts but get what a lot of an RTOS normally offers along with rapid context switches.
PS. One question for Chip: if multitasking is enabled and a task is waiting on the system counter with WAITCNT (such as the scheduler task might be doing), does the time have to match exactly to wake it up or just that the elapsed time is greater than some calculated amount when WAITCNT executes? Because due to multitasking the clock cycle that exactly matches the WAITCNT instruction may not be exactly the one when it executes and checks the timer, the system ticks might be observed jumping up by 4's for example (with 4 tasks). We don't have some issue there do we?
Comments
I've read what you wrote a few times, but I'm not understanding what you mean. It does sound like more work (in software) than what is being cooked up right now.
Would you give me another two days? Documentation is a big detour that I'd rather take once than twice.
My take was this is a suggestion for a way to single-step, by fiddling the pipeline.
- ie the 'super task' allows one opcode from another task to dribble thru the pipeline, with as many packing NULLs as needed.
A NULL is like a nop, but it does not change the PC.
A breakpoint may need to trigger a tail version of the same thing, to be able to stop with no added changes..
Okay. I kind of got that part, but I didn't understand how this would facilitate preemptive multitasking.
Cluso99, are you running a DE2-115 or a DE0-Nano? If you've got the DE2-115, I could squeeze one out, but the docs would be a little out of sync, though the instruction set would clue you in.
I think it was mainly about debug, but task-swap 'comes for free' in any full Debug support, but it may not be as rapid as a WIDE swap.
Debug cares less about how quickly it takes to save/restore, than high performance task-swap does.
I really don't understand why preemptive multitasking is required at all.
But, my thoughts were that to interrupt a task could be done simply by stalling the task, then by software examine/save its settings, and then release it.
While I understand saving all the threads' states using a wide would be nice, it does not totally permit the thread to continue later by restoring these states alone. Surely thing like waitcnt/passcnt, reps etc would all have major caveats here.
We have up to 4 tasks. Surely, the task scheduler could, by SETTASK control which tasks were running, and this alone would permit preemptive tasking by starting a new task (or itself performing this). This means that 1 task is effectively set aside to perform the preemptive task. So in many ways I don't understand why the current design does not permit this.
Mixed into this, it was brought up about debugging. IMHO that is actually quite simple, although a stall through the pipeline could aid single-stepping. Of course, this does not permit real-time tracing so its not cycle accurate.
The important detail (which comes 'free' with debug support), is the ability to do a 100% Task Swap.
That needs more than just Task launch, it needs to save/restore the PC & flags, and other details.
What about this: We reduce the LIFO from 32 bits wide down to the CALL/RET-requisite 18 bits (Z/C/PC), we can increase it by a few levels, too. This will let us pack everything (including DCACHE-address and DCACHE-valid bits) into one WIDE.
The whole process of switching threads would be:
I'll pull out everything disposable until the core fits.
That looks neat and tidy (Intuitive) Chip!
Also (tongue in cheek a bit, but), Chip, this is the slippery slope that eventually leads to the stuff you hate about Windows.... So many things going on pre-empting each other such that it all turns to molasses or worse, tasks get blocked out by other tasks so much that it stops responding to your input well.
Personally, it's starting to get frustrating to watch these threads anymore. You still have to get this whole thing through synthesis and married to the I/O pad frame, what happens when it doesn't fit? Also, we need things to be stable ASAP so we can get the docs and software done in time for shipping. It's going to takes months and months to get the compilers/etc. built and solid, it needs to have started last year, but at the moment we don't even have Spin2 at all and PASM2 is a moving target.
Sorry to be negative at all here, but somethings gotta give...
Sounds good, if it is practical.
Is this different from the wide approach looked at earlier ?
Tasks I see as less important than having good Debug support.
- but once you get good debug snapshots, task swap tends to come too, so anyone who wants to do that, can.
Some possibilities:
1) when you take a tasks slots away, perhaps delay until the pipeline is empty? If you do that, less state to save
2) if the LIFO is reduced to 16 bits... 4x16lifo + 16pc + z+c = 82 bits, WIDE is 256 bits... yep, you could increase the LIFO size
3) TYIELD #n fits this scheme too, as the scheduler can take away the tasks slots when it notices the task has yielded... I like
4) TSTEP - could TSTEP just give the task 4 cycles for one instruction to complete (or one JMPD variant, with 3 shadow instructions)
I am VERY much looking forward to improved debugging capabilities!
For detail testing, simulation code, running right on the P2 can single step, whatever, using an LMM style kernel.
How this all gets used is up to programmers. There is, at least, a lot of debugging value in this addition.
Otherwise, you are right about time. I want this thing to be over, too. We are at the edge of the project and it's hard not to put a few more things in,
LOL David
Unfortunately I need another 100 leds!
Brian
In time You put new things.
Can You made some additions to SERDES I/O ---->
1. Made Serial clock out of PIN --- Master mode
2. Made Slave mode
External Clock in to PIN
3. Made one I/O that connect both SERIN and SEROUT to same PIN --- Like I2C
That with 1. and 2. --- Can be much usable
I do definitely recall Chip starting a thread about it... let me check... AHA!
Found it!
http://forums.parallax.com/showthread.php/154167-What-s-needed-for-preemptive-multitasking
I know everyone is chomping at the bit to move ahead, but we would regret it later if PropGCC on the P2 has to use an LMM type setup to support threading instead of direct hubexec.
Commercial users would also be pleased with the ability to implement a proper RTOS.
My 2 cents,
C.W.
I think that, and more, is planned for the serdes enhance pass.
Certainly clocked Sync modes are on the list.
As an aside, he has realised the LIFOs only need to be 16 bits wide (Z+C needed too???). This is another prize.
But, since it is only Ale2 that originally asked and he subsequently pleaded for it to stop.
So may I ask, who is going to use this preemptive tasking???
Maybe the silicon is a waste and could be better utilised (we might even run out of it)?
Certainly I think for debugging, there are perhaps simpler and better ways to do it. Of course the ultimate is to buy a Parallax FPGA and have Chip add some special debug code in there for us.
Probably I would say all the people who use and write code in high level languages that want multithreading (eg typical RTOS users) that don't necessarily want to go to the trouble of adding YIELDS all over the place in their source to allow co-operative multithreading. The good thing about preemptive multitasking is you basically can use your original source code as is and don't need to worry (too much) about where you are being switched out. The RTOS does all that for you.
From what I see Chip is adding we still get total control of the switching from the scheduler task so the thread preemption is not completely automatic, software is still involved there. So we can still add more features in software such as thread locking, thread priorities etc layered on top of the scheduler for allowing critical thread sections and to influence thread switching order. It seems the best of all worlds here. We don't need interrupts but get what a lot of an RTOS normally offers along with rapid context switches.
PS. One question for Chip: if multitasking is enabled and a task is waiting on the system counter with WAITCNT (such as the scheduler task might be doing), does the time have to match exactly to wake it up or just that the elapsed time is greater than some calculated amount when WAITCNT executes? Because due to multitasking the clock cycle that exactly matches the WAITCNT instruction may not be exactly the one when it executes and checks the timer, the system ticks might be observed jumping up by 4's for example (with 4 tasks). We don't have some issue there do we?
Roger.