Bill, this last round of discussions, needed to whittle the whole thing down to that core bit of silicon really turned out rather nice. Thanks for the clear examples.
Keep in mind that what Bill and I are discussing, is not quite what present FPGA does and his code will not run as written on present P2.
Single stepping could be done by adding one bit to the state saved by TPAUSE such that another TPAUSE is automatically triggered after a single instruction is executed.
I think the current idea for single step, is to feed the slave task one clock slot, and read either side of that.
out of 2**32 possible combinations, ONE (0) is reserved.
One of your flags uses up 2**31 possibilities.
If you need more flags, make a 'taskXflags.
Compact mixing is irrelevant, it takes several instructions to pack/unpack.
My way takes less code, and is flexible.
Heck, it can even be used to implement flags.
BUT
The key is that TPAUSE, and TJNZ can use 0 as a very quick mechanism.
One positive thing has come out of my trying to explain it to you - if there is logic for it, it would be nice if TPAUSE could exit on its own if the task1req register becomes 0, as this would remove the need for a TRESUME instruction. Which could not be done with flags done the way you propose.
- be used as breakpoints
- wait for events
- be used as system calls
It is not an interrupt, as the scheduler has to poll for the change in the request register.
TLB's etc will be a fun discussion for P3!
Classic interrupts would be a much larger change, and require all four tasks states be capable of being saved/restored, interrupt vectors, which would lead to interrupt priorities etc - we avoid that huge headache with TPAUSE, which is a lot more flexible than classical interrupts.
It seems like this TPAUSE/TRESUME feature is very close to what is needed to support traps which will be needed for handling TLB misses if we ever get to trying to execute code from external memory through pages cached in hub memory. A TLB miss could automatically pause the task that causes it and jump to some predefined location. It would also need to store a trap reason in another predefined location. The code at that location would then service the trap and possibly modify the state saved by the hardware triggered TPAUSE and then execute a TRESUME on itself to return to the code that was running prior to the trap. I think this could all be done in the context of a single task rather than requiring a scheduler task running in parallel with the task being scheduled. In fact, if you add a timer as a possible trap source then you can do a scheduler within a single task. And Bill's YIELD instruction could essentially be a software trap that is processed as a breakpoint. Single stepping could be done by adding one bit to the state saved by TPAUSE such that another TPAUSE is automatically triggered after a single instruction is executed. I suppose this is essentially introducing interrupts to the P2 but it seems a lot simpler than two tasks running in tandem to effect essentially the same thing.
- be used as breakpoints
- wait for events
- be used as system calls
It is not an interrupt, as the scheduler has to poll for the change in the request register.
TLB's etc will be a fun discussion for P3!
Classic interrupts would be a much larger change, and require all four tasks states be capable of being saved/restored, interrupt vectors, which would lead to interrupt priorities etc - we avoid that huge headache with TPAUSE, which is a lot more flexible than classical interrupts.
I don't see why it would be any more complex than what I already described. The external stimulus (TLB fault, timer, etc) does the following:
1) Store a "reason" to a known location. This would be "TLB miss", "timer", etc.
2) Save the current state of the task using the TPAUSE mechanism.
3) Transfer control to another known location.
Then that code can do whatever it wants to handle this "trap" and when it's done it can just execute a TRESUME to resume execution stopped by the TPAUSE. If these "known locations" were in the area of registers that get remapped for each task then every task could operate independently with its own trap reason and handler registers. And YIELD can serve as a breakpoint by being a software triggered stimulus following the same sequence as above. This all requires only a single task and so all four HW tasks can do this at the same time without interfering with each other. There is no need to waste a task to run the scheduler.
Edit: Or is there only one WIDE for holding the state saved by TPAUSE shared by all tasks?
Edit2: Also, I'm not suggesting adding TLB to P2, just saying P2 already has about half of what is needed to do it.
TPAUSE doesn't do any of the work, it just writes a value to a register and loops to itself. You need the scheduler to be watching the register set by TPAUSE and then the scheduler does the actions to save state. etc.
Currently only TASK 3 can have it's full state persisted.
I'll explain. Except I am skipping anything TLB/MMU related as being out of scope for P2.
To have interrupts:
- you need multiple interrupt sources.
- you need to be able to enable/disable them
- you would need an interrupt vector for each interrupt
- if you have more than one source, you usually need a priority mechanism
- you would need these for all four tasks
- you would need all four tasks internal states to be saved on an interrupt
- or you would need to bind a specific interrupt to a specific task, needs more state and instructions
- you would need a specific 'return from interrupt' instruction that knows how to restore the state and for which task
All of the above needs far more silicon, instructions etc than the current tasking scheme being discussed.
An interrupt would be the hardware equivalent of TPAUSE, which is not an interrupt.
And you are correct, there is only one WIDE for saving the state of task3 only (to save gates & complexity)
I don't see why it would be any more complex than what I already described. The external stimulus (TLB fault, timer, etc) does the following:
1) Store a "reason" to a known location. This would be "TLB miss", "timer", etc.
2) Save the current state of the task using the TPAUSE mechanism.
3) Transfer control to another known location.
Then that code can do whatever it wants to handle this "trap" and when it's done it can just execute a TRESUME to resume execution stopped by the TPAUSE. If these "known locations" were in the area of registers that get remapped for each task then every task could operate independently with its own trap reason and handler registers. And YIELD can serve as a breakpoint by being a software triggered stimulus following the same sequence as above. This all requires only a single task and so all four HW tasks can do this at the same time without interfering with each other. There is no need to waste a task to run the scheduler.
Edit: Or is there only one WIDE for holding the state saved by TPAUSE shared by all tasks?
@JMG: Yes, but I have seen Chip's final output enough times to know about where he will be on it.
To be frank, you tend to ask for every possible option, and push for more in silicon than most here do. Chip tends to filer out and design away the need for as many of those as he can, and bill has a good grasp on where sweet spot cases may or do lie.
At the end of all that, less is often more, and generally speaking, that is "the propeller way" where we can use software for a lot of things where more emphasis on hardware would typically be seen.
That dynamic is why I am here. It is a great philosophy, because software improves over time. Where we have carved out the silicon sweet spots, we maximize that potential.
An easy example is the P1 video system. It is just enough to take the really ugly bits out of that task, without being overly limiting otherwise. Had a bit more hardware been applied, we would have seen some tasks easier and faster, however we may well have not seen the advanced uses as well as non-video related uses we have. Over time, we ended up doing things on the P1 that were not even a consideration at the time it was designed. The only sweet spot case missed was PAL, nicely corrected in P2, which retains most of the "it will do far more than we think" qualities P1 had, while at the same time leveraging all we learned on P1. One of those seriously improved cases is mixed mode and dynamically drawn displays. Having a graphics capable window on a text display, is one example possible on P1, but difficult. Few of us did it, due to the overall difficulty. Another is dynamically drawn displays intended to maximize the RAM efficiency, again possible on P1, but difficult. Both of these are going to be considerably easier and more effective on P2.
It is going to be the same way with these advanced tasking features. We need enough silicon to open the door to as much of the possible as we can, while not closing off options possible in software by defining too much now in the hardware.
We will put it all to use, just as we did P1, and out of that will fall the really sweet spot cases for P3, based on actual application and innovation. Put simply, we think we know the optimal use cases, etc... but we may well not know about what is really effective and or possible, until after software gets written and applied so as to reveal them.
In the balance is something people can use easily, great performance, no OS needed to multi-task and multi-process if desired, etc...
Where the "include it all just in case" approach is taken, complexity is too high and adoption is more difficult and that is easily seen out there on other perfectly capable, but maddening and painful to use devices.
You mention control as very important. Agreed, but moving as much of that to sofftware as is practical means not having to wade through tons of options and initialization just to run some basic concept code, or to get started. Most PASM programmers here picked up on it quickly due to that dynamic being well realized.
Yes, that sometimes means a peak performance case or two isn't as well realized as some would like, but it also meand just doing the vast majority of things is lean, fast, easy, consistent.
Less is very often more. This is why we don't see more people attempting the kinds of things we see people attempting on a Propeller on other devices that have so many options and controls one doesn't even know where to start!
So far, we have preserved this for the vast majority of what I see the P2 capable of, and I'm very excited and pleased because it means we can and will have experiences that are bigger than P1 can bring us, but the overall feel we all got so much out of was not lost amidst a sea of well meaning, but sadly obtuse options.
These differences in ideology have played out well, in that our end results are inclusive without being a burden. Again, that is primary for a whole lot of us.
I'll explain. Except I am skipping anything TLB/MMU related as being out of scope for P2.
To have interrupts:
- you need multiple interrupt sources.
- you need to be able to enable/disable them
- you would need an interrupt vector for each interrupt
- if you have more than one source, you usually need a priority mechanism
- you would need these for all four tasks
- you would need all four tasks internal states to be saved on an interrupt
- or you would need to bind a specific interrupt to a specific task, needs more state and instructions
- you would need a specific 'return from interrupt' instruction that knows how to restore the state and for which task
Of course you are correct that to fully implement interrupts you'd have to do most or all of those things. However, to replace the two-task scheduler scheme with one where each task could run its own scheduler wouldn't require any of that.
All of the above needs far more silicon, instructions etc than the current tasking scheme being discussed.
Yes, it would probably require a little more. Has anyone suggested just storing the state on TPAUSE into COG registers rather than into a special unmapped WIDE? If you run four tasks using register remapping with each task having 32 registers, 8 of them could be used to store the task's state when interrupted and two more could be used for the reason/handler registers. That leaves 22 registers for general use which seems like a fair number and also leaves 128 COG registers to be share among the tasks or used for fast COG code functions.
An interrupt would be the hardware equivalent of TPAUSE, which is not an interrupt.
Not sure I understand this statement. I'm just saying that most of what is needed to do this is already implemented in the TPAUSE instruction.
And you are correct, there is only one WIDE for saving the state of task3 only (to save gates & complexity)
Again, is there any reason that the state couldn't be store in a WIDE in COG memory within the remapped register region? Doing this actually saves gates because you don't need the special purpose WIDE used to store the state of task3.
Compact mixing is irrelevant, it takes several instructions to pack/unpack.
My way takes less code, and is flexible.
Heck, it can even be used to implement flags.
BUT
The key is that TPAUSE, and TJNZ can use 0 as a very quick mechanism.
One positive thing has come out of my trying to explain it to you - if there is logic for it, it would be nice if TPAUSE could exit on its own if the task1req register becomes 0, as this would remove the need for a TRESUME instruction. Which could not be done with flags done the way you propose.
Correct, avoiding TRESUME is a good idea.
Your other claims are not true in the general sense.
There is nothing fundamental about flags that excludes self resume, or has to dictate larger code.
An example on a virtual P2, designed for packed atomic semaphpre and message
' compact master scheduler loop, uses packed atomic semaphpre and message
' this is a bare-bones service provider that can serve as a skeleton for a debugger or scheduler
scheduler
jb31 task1req, #task1handler 'B31 signals Slave is done, and waiting
jb31 task2req, #task2handler ' Slave sets B31 and waits looping until B31=0
jb31 task3req, #task3handler
jmp #scheduler
task1handler
' decode the request, and handle it
mov task1req,#MessToSlave1 ' also does ClrB31 => releases Slave, and pass (optional) message
jmp #scheduler
task2handler
' decode the request, and handle it
mov task2req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
task3handler
' decode the request, and handle it
mov task3req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
task1req long 0 ' two way message and semaphore register
task2req long 0 ' two way message and semaphore register
task3req long 0 ' two way message and semaphore register
'Slave task1
TPAUSEb task1req,#MessFromSlave 'Sets B31, ORs 9b #MessFromSlave onto task1req, allows 2^31 messages
' TPAUSEb Loops here until B31 is Zero, then can read other 31 bits in task1req as messages
' Test message from master, or just continue
Notice this is both smaller, (in code and registers) and has higher message ceiling than your code.
( jb31 can use the saved TRESUME opcode, so adds no more opcodes).
Your other claims are not true in the general sense.
There is nothing fundamental about flags that excludes self resume, or has to dictate larger code.
An example on a virtual P2, designed for packed atomic semaphpre and message
' compact master scheduler loop, uses packed atomic semaphpre and message
' this is a bare-bones service provider that can serve as a skeleton for a debugger or scheduler
scheduler
jb31 task1req, #task1handler 'B31 signals Slave is done, and waiting
jb31 task2req, #task2handler ' Slave sets B31 and waits looping until B31=0
jb31 task3req, #task3handler
jmp #scheduler
task1handler
' decode the request, and handle it
mov task1req,#MessToSlave1 ' also does ClrB31 => releases Slave, and pass (optional) message
jmp #scheduler
task2handler
' decode the request, and handle it
mov task2req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
task3handler
' decode the request, and handle it
mov task3req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
task1req long 0 ' two way message and semaphore register
task2req long 0 ' two way message and semaphore register
task3req long 0 ' two way message and semaphore register
'Slave task1
TPAUSEb task1req,#MessFromSlave 'Sets B31, ORs 9b #MessFromSlave onto task1req, allows 2^31 messages
' TPAUSEb Loops here until B31 is Zero, then can read other 31 bits in task1req as messages
' Test message from master, or just continue
Notice this is both smaller, (in code and registers) and has higher message ceiling than your code.
( jb31 can use the saved TRESUME opcode, so adds no more opcodes).
Chip wants to use WIDE's as apparently that is the easiest and requires the least logic from what I recall.
' TPAUSE as originally proposed was equivalent to:
MOV reg,#code
lp: JMP #lp
' Revised TPAUSE is equivalent to:
MOV reg,#code
lp: TJNZ reg,#lp
The reason it needs an instruction is to save memory, as it will be used very frequently, including as breakpoints.
What I meant is that it is a simple instruction, not an interrupt mechanism.
T3SAVE and T3LOAD implement the WIDE state saving.
What are T3SAVE and T3LOAD? Are they new names for TPAUSE and TRESUME? I guess it is probably impossible to write a WIDE into COG registers in one tick since the COG memory is only 32 bits wide so I guess my idea to make all four tasks able to do threading won't work without separate thread state storage for each task.
These are the instructions that save and load task 3's context to and from the WIDE registers. These are single-cycle instructions that save/load upwards of 256 bits of context data.
These are the instructions that save and load task 3's context to and from the WIDE registers. These are single-cycle instructions that save/load upwards of 256 bits of context data.
Sorry my friend, my way is simpler, requires one less instruction, a bit less logic, and allows a fuller return value.
Given my solution uses less code and data memory, I'm not sure where your simpler (?) claim comes from ??
Of course, I already said it is a virtual P2, so it replaces TRESUME with something useful, and it proves you can have packed atomic semaphpre and messages, to well above the 512 values you claimed were so important earlier, and does it smaller in the most vital register memory resource.
1) 'jb32 reg,#addr' new jump instruction
2) more complicated TPAUSE
' your suggested TPAUSEb equivalent in instructions
mov reg,#code ' could be S
or reg,bit31const
lp: and reg,bit31const wz
if_nz jmp #:lp
' my revised TPAUSE in instructions
mov reg,#code ' could be S
lp: tjnz reg,#lp:
' mine is simpler to implement, less logic required
3) limits return value to 31 bits
4) my solution uses 3 more registers in the whole cog, and can return 32 bit values
Given my solution uses less code and data memory, I'm not sure where your simpler (?) claim comes from ??
Of course, I already said it is a virtual P2, so it replaces TRESUME with something useful, and it proves you can have packed atomic semaphpre and messages, to well above the 512 values you claimed were so important earlier, and does it smaller in the most vital register memory resource.
3) limits return value to 31 bits
4) my solution uses 3 more registers in the whole cog, and can return 32 bit values
Hehe, and yet earlier you claimed 512 states was important ?
That 512 is limited by the immediate operand, and that is exactly the same in both implementations.
Besides, If anyone really wanted 32b fields, they can always use more wasteful extra registers
Less code and less register memory, is a very clear win, as those are what matter to designers. Gate-count is essentially invisible to a user.
JB31 is only a slight variant (subset actually) on existing JZ opcode. Both read and test a register, JB31 only tests upper bit.
(and it is useful not just in this code case )
The TPause is a variant opcode in both our cases.
Okay, I guess I'm too late to join this party. I'm too out of touch with the current plans. :-)
Anyway, looking at Bill's summary I guess T3SAVE and T3LOAD are what I should have mentioned in my message about having a task handle its own scheduling.
Correct, it is a virtual P2 - one that is designed to use less code and less memory.
Hehe, and yet earlier you claimed 512 states was important ?
That 512 is limited by the immediate operand, and that is exactly the same in both implementations.
Besides, If anyone really wanted 32b fields, they can always use more wasteful extra registers
Less code and less register memory, is a very clear win, as those are what matter to designers. Gate-count is essentially invisible to a user.
JB31 is only a slight variant (subset actually) on existing JZ opcode. Both read and test a register, JB31 only tests upper bit.
(and it is useful not just in this code case )
The TPause is a variant opcode in both our cases.
Worst case, my method uses 3 more longs in the cog.
.. and don't forget to add the extra lines of your code, to load the return values.
( You did compare my code, with yours ? )
Code that just works the same going in both directions, wins any code-level KISS contest.
As far as Verilog code goes, neither approach is particularly challenging, and Verilog is written only once. KISS decisions there are more interested in ease of use of the final device.
Thousands of users will write millions of lines of P2 code and the P2 has a hard register ceiling.
Anything that lets users pack more into those registers, is worth a serious look.
.. and don't forget to add the extra lines of your code, to load the return values.
( You did compare my code, with yours ? )
Code that just works the same going in both directions, wins any code-level KISS contest.
As far as Verilog code goes, neither approach is particularly challenging, and Verilog is written only once. KISS decisions there are more interested in ease of use of the final device.
Thousands of users will write millions of lines of P2 code and the P2 has a hard register ceiling.
Anything that lets users pack more into those registers, is worth a serious look.
Try this - I've highlighted the extra line of code, in your case, vs mine - for 3 instances, that is 3 more lines of code.
task2handler
' decode the request, and handle it
[B] mov task2result,result ' optionally pass back result[/B]
mov task2req,#0 ' release task if PC not incremented past TPAUSE
jmp #scheduler
vs
task2handler
' decode the request, and handle it
mov task2req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
Did you maybe miss that by merging the message and the semaphore, my code updates both in a single line ? In your code, it is one line per item.
Try this - I've highlighted the extra line of code, in your case, vs mine - for 3 instances, that is 3 more lines of code.
task2handler
' decode the request, and handle it
[B] mov task2result,result ' optionally pass back result[/B]
mov task2req,#0 ' release task if PC not incremented past TPAUSE
jmp #scheduler
vs
task2handler
' decode the request, and handle it
mov task2req,#MessToSlave2 ' ClrB31 = releases Slave, and pass (optional) message
jmp #scheduler
Did you maybe miss that by merging the message and the semaphore, my code updates both in a single line ? In your code, it is one line per item.
TCHECK D,S/# 'Write S/# into D and jump to self. On subsequent iterations, don't write D, but jump to self if D <> 0.
This gets rid of the need for TRESUME. It takes one bit of state storage to track TCHECK now, so that we know if it's on its first or a subsequent iteration. On the first iteration, it writes S/# into D and jumps to itself. On subsequent iterations, it doesn't write D, but jumps to itself if D <> 0.
So, task A does a TCHECK to write a non-zero value into some register. Task B notices the non-0 value and can do whatever it wants about it, but can write 0 to the register to release Task A.
TCHECK D,S/# 'Write S/# into D and jump to self. On subsequent iterations, don't write D, but jump to self if D <> 0.
This gets rid of the need for TRESUME. It takes one bit of state storage to track TCHECK now, so that we know if it's on its first or subsequent iteration. On the first iteration, it writes S/# into D and jumps to itself. On subsequent iterations, it doesn't write D, but jumps to itself if D <> 0.
So, task A does a TCHECK to write a non-zero value into some register. Task B notices the non-0 value and can do whatever it wants about it, but can write 0 to the register to release Task A.
Comments
Keep in mind that what Bill and I are discussing, is not quite what present FPGA does and his code will not run as written on present P2.
I think the current idea for single step, is to feed the slave task one clock slot, and read either side of that.
out of 2**32 possible combinations, ONE (0) is reserved.
One of your flags uses up 2**31 possibilities.
If you need more flags, make a 'taskXflags.
Compact mixing is irrelevant, it takes several instructions to pack/unpack.
My way takes less code, and is flexible.
Heck, it can even be used to implement flags.
BUT
The key is that TPAUSE, and TJNZ can use 0 as a very quick mechanism.
One positive thing has come out of my trying to explain it to you - if there is logic for it, it would be nice if TPAUSE could exit on its own if the task1req register becomes 0, as this would remove the need for a TRESUME instruction. Which could not be done with flags done the way you propose.
- be used as breakpoints
- wait for events
- be used as system calls
It is not an interrupt, as the scheduler has to poll for the change in the request register.
TLB's etc will be a fun discussion for P3!
Classic interrupts would be a much larger change, and require all four tasks states be capable of being saved/restored, interrupt vectors, which would lead to interrupt priorities etc - we avoid that huge headache with TPAUSE, which is a lot more flexible than classical interrupts.
1) Store a "reason" to a known location. This would be "TLB miss", "timer", etc.
2) Save the current state of the task using the TPAUSE mechanism.
3) Transfer control to another known location.
Then that code can do whatever it wants to handle this "trap" and when it's done it can just execute a TRESUME to resume execution stopped by the TPAUSE. If these "known locations" were in the area of registers that get remapped for each task then every task could operate independently with its own trap reason and handler registers. And YIELD can serve as a breakpoint by being a software triggered stimulus following the same sequence as above. This all requires only a single task and so all four HW tasks can do this at the same time without interfering with each other. There is no need to waste a task to run the scheduler.
Edit: Or is there only one WIDE for holding the state saved by TPAUSE shared by all tasks?
Edit2: Also, I'm not suggesting adding TLB to P2, just saying P2 already has about half of what is needed to do it.
TPAUSE doesn't do any of the work, it just writes a value to a register and loops to itself. You need the scheduler to be watching the register set by TPAUSE and then the scheduler does the actions to save state. etc.
Currently only TASK 3 can have it's full state persisted.
C.W.
I'll explain. Except I am skipping anything TLB/MMU related as being out of scope for P2.
To have interrupts:
- you need multiple interrupt sources.
- you need to be able to enable/disable them
- you would need an interrupt vector for each interrupt
- if you have more than one source, you usually need a priority mechanism
- you would need these for all four tasks
- you would need all four tasks internal states to be saved on an interrupt
- or you would need to bind a specific interrupt to a specific task, needs more state and instructions
- you would need a specific 'return from interrupt' instruction that knows how to restore the state and for which task
All of the above needs far more silicon, instructions etc than the current tasking scheme being discussed.
An interrupt would be the hardware equivalent of TPAUSE, which is not an interrupt.
And you are correct, there is only one WIDE for saving the state of task3 only (to save gates & complexity)
To be frank, you tend to ask for every possible option, and push for more in silicon than most here do. Chip tends to filer out and design away the need for as many of those as he can, and bill has a good grasp on where sweet spot cases may or do lie.
At the end of all that, less is often more, and generally speaking, that is "the propeller way" where we can use software for a lot of things where more emphasis on hardware would typically be seen.
That dynamic is why I am here. It is a great philosophy, because software improves over time. Where we have carved out the silicon sweet spots, we maximize that potential.
An easy example is the P1 video system. It is just enough to take the really ugly bits out of that task, without being overly limiting otherwise. Had a bit more hardware been applied, we would have seen some tasks easier and faster, however we may well have not seen the advanced uses as well as non-video related uses we have. Over time, we ended up doing things on the P1 that were not even a consideration at the time it was designed. The only sweet spot case missed was PAL, nicely corrected in P2, which retains most of the "it will do far more than we think" qualities P1 had, while at the same time leveraging all we learned on P1. One of those seriously improved cases is mixed mode and dynamically drawn displays. Having a graphics capable window on a text display, is one example possible on P1, but difficult. Few of us did it, due to the overall difficulty. Another is dynamically drawn displays intended to maximize the RAM efficiency, again possible on P1, but difficult. Both of these are going to be considerably easier and more effective on P2.
It is going to be the same way with these advanced tasking features. We need enough silicon to open the door to as much of the possible as we can, while not closing off options possible in software by defining too much now in the hardware.
We will put it all to use, just as we did P1, and out of that will fall the really sweet spot cases for P3, based on actual application and innovation. Put simply, we think we know the optimal use cases, etc... but we may well not know about what is really effective and or possible, until after software gets written and applied so as to reveal them.
In the balance is something people can use easily, great performance, no OS needed to multi-task and multi-process if desired, etc...
Where the "include it all just in case" approach is taken, complexity is too high and adoption is more difficult and that is easily seen out there on other perfectly capable, but maddening and painful to use devices.
You mention control as very important. Agreed, but moving as much of that to sofftware as is practical means not having to wade through tons of options and initialization just to run some basic concept code, or to get started. Most PASM programmers here picked up on it quickly due to that dynamic being well realized.
Yes, that sometimes means a peak performance case or two isn't as well realized as some would like, but it also meand just doing the vast majority of things is lean, fast, easy, consistent.
Less is very often more. This is why we don't see more people attempting the kinds of things we see people attempting on a Propeller on other devices that have so many options and controls one doesn't even know where to start!
So far, we have preserved this for the vast majority of what I see the P2 capable of, and I'm very excited and pleased because it means we can and will have experiences that are bigger than P1 can bring us, but the overall feel we all got so much out of was not lost amidst a sea of well meaning, but sadly obtuse options.
These differences in ideology have played out well, in that our end results are inclusive without being a burden. Again, that is primary for a whole lot of us.
Correct, avoiding TRESUME is a good idea.
Your other claims are not true in the general sense.
There is nothing fundamental about flags that excludes self resume, or has to dictate larger code.
An example on a virtual P2, designed for packed atomic semaphpre and message
Notice this is both smaller, (in code and registers) and has higher message ceiling than your code.
( jb31 can use the saved TRESUME opcode, so adds no more opcodes).
Chip wants to use WIDE's as apparently that is the easiest and requires the least logic from what I recall.
The reason it needs an instruction is to save memory, as it will be used very frequently, including as breakpoints.
What I meant is that it is a simple instruction, not an interrupt mechanism.
T3SAVE and T3LOAD implement the WIDE state saving.
1) You are introducing a new two-op instruction... which I don't need
2) Your way only allows passing back a 31 bit result, not a 32 bit
3) TPAUSE has no way of affecting b31
Sorry my friend, my way is simpler, requires one less instruction, a bit less logic, and allows a fuller return value.
These are the instructions that save and load task 3's context to and from the WIDE registers. These are single-cycle instructions that save/load upwards of 256 bits of context data.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1248810&viewfull=1#post1248810
C.W.
Given my solution uses less code and data memory, I'm not sure where your simpler (?) claim comes from ??
Of course, I already said it is a virtual P2, so it replaces TRESUME with something useful, and it proves you can have packed atomic semaphpre and messages, to well above the 512 values you claimed were so important earlier, and does it smaller in the most vital register memory resource.
1) 'jb32 reg,#addr' new jump instruction
2) more complicated TPAUSE
3) limits return value to 31 bits
4) my solution uses 3 more registers in the whole cog, and can return 32 bit values
Sorry, I believe my solution is far superior
TPAUSE D,S/# 'write S/# to D and loop in place
TRESUME D/# 'increment PC of dormant task D/#
TPAUSE is used by a switchable thread, TRESUME is used by the supervisor to put a swichable thread back on the air.
Correct, it is a virtual P2 - one that is designed to use less code and less memory.
Hehe, and yet earlier you claimed 512 states was important ?
That 512 is limited by the immediate operand, and that is exactly the same in both implementations.
Besides, If anyone really wanted 32b fields, they can always use more wasteful extra registers
Less code and less register memory, is a very clear win, as those are what matter to designers. Gate-count is essentially invisible to a user.
JB31 is only a slight variant (subset actually) on existing JZ opcode. Both read and test a register, JB31 only tests upper bit.
(and it is useful not just in this code case )
The TPause is a variant opcode in both our cases.
Anyway, looking at Bill's summary I guess T3SAVE and T3LOAD are what I should have mentioned in my message about having a task handle its own scheduling.
Worst case, my method uses 3 more longs in the cog.
But it makes the verilog simpler and needs less logic.
Btw, in most cases, only one request and result long will be needed for task 3.
I stand my assertion that my way is strongly preferrable due to the KISS principle.
.. and don't forget to add the extra lines of your code, to load the return values.
( You did compare my code, with yours ? )
Code that just works the same going in both directions, wins any code-level KISS contest.
As far as Verilog code goes, neither approach is particularly challenging, and Verilog is written only once. KISS decisions there are more interested in ease of use of the final device.
Thousands of users will write millions of lines of P2 code and the P2 has a hard register ceiling.
Anything that lets users pack more into those registers, is worth a serious look.
task3result can be referenced directly, exactly the same as referencing task3req ... and more readable.
It takes the same amount of code to say
mov somereg, task3req
as
mov somereg, task3result
The only difference in memory is the extra long per task in the scheduler/debugger for taskXresult.
Try this - I've highlighted the extra line of code, in your case, vs mine - for 3 instances, that is 3 more lines of code.
vs
Did you maybe miss that by merging the message and the semaphore, my code updates both in a single line ? In your code, it is one line per item.
Ok, I agree - I need 1 extra long to hold the result per task, and one extra instruction to clear taskXrequest, but only in the debugger/scheduler.
No difference in client tasks.
In TPAUSE's old place is:
TCHECK D,S/# 'Write S/# into D and jump to self. On subsequent iterations, don't write D, but jump to self if D <> 0.
This gets rid of the need for TRESUME. It takes one bit of state storage to track TCHECK now, so that we know if it's on its first or a subsequent iteration. On the first iteration, it writes S/# into D and jumps to itself. On subsequent iterations, it doesn't write D, but jumps to itself if D <> 0.
So, task A does a TCHECK to write a non-zero value into some register. Task B notices the non-0 value and can do whatever it wants about it, but can write 0 to the register to release Task A.
Simple, elegant, Propeller-like!
Enough hardware to let the software play!
It feels like the right solution. These days, tons of good ideas are developing because of the synergy on this forum.
I really appreciate all you guys!