In following the banter here and reading comments about why the co-operative approach is bad, it just seems to me that the approach is poorly understood. From the comments, most seem to have it backwards.
Peter, from your point of view, what's the most common misconception?
Peter, from your point of view, what's the most common misconception?
Firstly let me say that I mean no malice to anyone on this forum, and perhaps I used an inappropriate word when I said that "most seem to have it backwards", when I should have said that "some have it backwards".
Yet I do believe that those poo-pooing the cooperative approach have not themselves done the exercise of investigating it adequately, let alone making it work. My reason for making this statement is based on the example of the full duplex serial uart code that is often cited as a typical cooperative application. If that is the extent of their understanding, then I would agree. That particular code does in fact ping-pong back and forth between receive and transmit modes, and it is a form of multitasking, albeit a very poor one; one that does not use a scheduler. However, what those citing that example seem to be missing is the level of elegance and convenience a "scheduler adds to the mix". In the former case, each half of the co-routines, when done, causes a switch to the other routine. And that can work (as evidenced) but is far from general and not optimal. Instead, the most each thread segment should do (at least in my opinion) is relinquish control back to the scheduler which will then calculate the next time that thread should run, and from what point in its program. On completing that calculation, the scheduler places it in a "wait" condition thereby also minimizing the power consumption.It then it selects the next task to be run, and dispatches that at the appointed time. In this manner all threads are serviced sequentially and orderly; each according to their "scheduled" time, and without (much) regard for each other, except for the jitter and processor load each intruduces to the fold.
This method permits one to have as many threads as one has room for in a cog, of course consistent with the total work load to be done, and the maximum jitter tolerated.
There are limits to what a cooperative approach are suited for: probably 8..16 low speed (1 KHz) threads, 4..6 medium speed (20 KHz) threads, and 2..4 high speed (100 KHz) threads, Of course all dependent on the jitter that can be accomodated. It is quite acceptable to run a combination of the different speeds.
So my conclusions that caused me to say what I did are based on my interpretation of comments from those who appear to not fully understand.
Do I think hardware multi tasking is bad ? Of course not..... unless it is "costly" in some manner. And if that is the case, then I say ditch it; there are other viable approaches.
I hope I have cleared up any mysteries regarding my comments, expanded on how I see cooperative routines working, and that I have not offended anyone.
"There are limits to what a cooperative approach are suited for: probably 8..16 low speed (1 KHz) threads, 4..6 medium speed (20 KHz) threads, and 2..4 high speed (100 KHz) threads, Of course all dependent on the jitter that can be accomodated. It is quite acceptable to run a combination of the different speeds."
Chip's four hardware tasks allow tasks in the MHz range.
Having said that, my earlier post clearly admitted that cooperative multitasking is useful for the kHz range (even up to your 100kHz sample range).
The point of my post was to show that hardware tasks are useful if Chip decides to implement them, and to point out that people can fully understand co-operative threads, have used them (many times) in the past, and prefer hardware tasks for solid reasons.
(All: this is not to push Chip to implement tasks, it is a discussion between myself and pjv regarding the merits of co-operative threads vs. hardware level tasks)
The way I see it, there are several ways to achieve the same end, whether it be a software or hardware approach. Each has a 'layer' of knowledge required, whether that be setting up a scheduler, or a task register and jump addresses for the hardware approach.
Chip did a really beautiful implementation of the hardware approach, but the disadvantage is it limits to 4 tasks, consumes area, and perhaps some power, even when not used.
(disclaimer: like bill said, this isn't a call for tasks to be included in any particular form, just looking at options for moving existing code)
"There are limits to what a cooperative approach are suited for: probably 8..16 low speed (1 KHz) threads, 4..6 medium speed (20 KHz) threads, and 2..4 high speed (100 KHz) threads, Of course all dependent on the jitter that can be accomodated. It is quite acceptable to run a combination of the different speeds."
Chip's four hardware tasks allow tasks in the MHz range.
Having said that, my earlier post clearly admitted that cooperative multitasking is useful for the kHz range (even up to your 100kHz sample range).
The point of my post was to show that hardware tasks are useful if Chip decides to implement them, and to point out that people can fully understand co-operative threads, have used them (many times) in the past, and prefer hardware tasks for solid reasons.
(All: this is not to push Chip to implement tasks, it is a discussion between myself and pjv regarding the merits of co-operative threads vs. hardware level tasks)
Bill, I agree with you, but it's unfair to compare what can currently be done with the P1, to what might be possible with a P2. With the higher speed and enhanced instruction set, I expect to push cooperative routines also above 1 MHz.
We'll see how all that turns out..... can hardly wait.
Sorry Peter, I thought you were talking about P1+ / P2
For a few simple co-operative threads I am sure you can get to somewhere above 1MHz quite easily
Frankly, I can easily see running co-operative threads within a hardware task, quite often a task (at 25MIPS) will be a huge overkill, and one such task will be able to run your scheduler at threads at roughly P1 speeds
Bill, I agree with you, but it's unfair to compare what can currently be done with the P1, to what might be possible with a P2. With the higher speed and enhanced instruction set, I expect to push cooperative routines also above 1 MHz.
We'll see how all that turns out..... can hardly wait.
What I loved about the way Chip implemented tasks was:
- easy to write high performance drivers sharing a cog (up to four)
- the enhanced debugger capabilities
- easy control of how much processing time each task got
- user level threads
With Ken saying Parallax would like to release a new chip every one to two years, I have hope of seeing the "P8X256HP" (aka P2 design before P1+) for high performance applications in one to two years time after P16X512, with all the goodies in it - and maybe more, as with a process shrink we can get more hub, more cogs, more MHz
The way I see it, there are several ways to achieve the same end, whether it be a software or hardware approach. Each has a 'layer' of knowledge required, whether that be setting up a scheduler, or a task register and jump addresses for the hardware approach.
Chip did a really beautiful implementation of the hardware approach, but the disadvantage is it limits to 4 tasks, consumes area, and perhaps some power, even when not used.
(disclaimer: like bill said, this isn't a call for tasks to be included in any particular form, just looking at options for moving existing code)
What I loved about the way Chip implemented tasks was:
- easy to write high performance drivers sharing a cog (up to four)
- the enhanced debugger capabilities
- easy control of how much processing time each task got
- user level threads
I would underline the enhanced debugger capabilities, and expand that to include advanced Watchdog features too.
Emerging software standards are based on verifying correct operation, of things like oscillators, and software core tasks, and a tiny watchdog stub can do that with ease.
Having tracked this sort of failure/recovery in other Micros, it does raise one question :
Q: Does the reset of a P1+, force a re-read of the fuses ?
(on some competitor parts, not quite everything is kicked by Reset, and a power cycle can be required)
I like the most recent specs that Chip has provided. About hardware multitasking within a COG, I say "PLEASE GET RID OF IT". Let's have a simpler design, besides hubexec is more important and hubexec is great, even at 50% of the speed. I'll likely use C and it will compile to something that runs very fast in that mode
The other reason to not worry about hardware multitasking is this: 16 COGS. There are plenty to go around+.
@info - You may not see the value of parallel processing, but I recommend you get a decent propeller 1 board and try it out. once you get used to doing multiple tasks in parallel, it changes how you program and so many things get that much easier to do. Its freaky.
Early in this thread, Chip said that the hub access will be every 16 clocks/8 instructions. With 16 cogs, I don't see how this will work, unless the instruction cycle of the odd cogs are staggered by one clock from the even cogs (i.e. COG0, 2, 4, etc. is performing the load/write, while COG1, 3, 5, etc. is performing the read/execute). If this is not the case, then the hub access window will fall in the middle of the two-cycle instruction window for half of the cogs.
Does this make sense? Or do I have something wrong with my thinking?
Most instructions will take 2 clock cycles (due to dual ported cog memory)
That leads to 100MIPS max.
Therefore, each cog will get at the hub every eight instruction (which is every 16 clock cycles)
I hope all the documentation will talk in terms of real clock cycles, as the instruction clock / system clock difference can be quite confusing, and easy to make mistakes with - heck, it has thrown me off several times too.
Early in this thread, Chip said that the hub access will be every 16 clocks/8 instructions. With 16 cogs, I don't see how this will work, unless the instruction cycle of the odd cogs are staggered by one clock from the even cogs (i.e. COG0, 2, 4, etc. is performing the load/write, while COG1, 3, 5, etc. is performing the read/execute). If this is not the case, then the hub access window will fall in the middle of the two-cycle instruction window for half of the cogs.
Does this make sense? Or do I have something wrong with my thinking?
Early in this thread, Chip said that the hub access will be every 16 clocks/8 instructions. With 16 cogs, I don't see how this will work, unless the instruction cycle of the odd cogs are staggered by one clock from the even cogs (i.e. COG0, 2, 4, etc. is performing the load/write, while COG1, 3, 5, etc. is performing the read/execute). If this is not the case, then the hub access window will fall in the middle of the two-cycle instruction window for half of the cogs.
Does this make sense? Or do I have something wrong with my thinking?
You may be right, there could be a half-Op-cycle (1 SysClk) phase offset on half the COGs.
Not sure that matters much ? - any WAIT would snap back again, but I guess it could give subtle COG-location effects.
Most instructions will take 2 clock cycles (due to dual ported cog memory)
That leads to 100MIPS max.
Therefore, each cog will get at the hub every eight instruction (which is every 16 clock cycles)
I hope all the documentation will talk in terms of real clock cycles, as the instruction clock / system clock difference can be quite confusing, and easy to make mistakes with - heck, it has thrown me off several times too.
I think my mistake was in forgetting that the initial hubop for each cog will typically stall in order to synchronize to the hub window. After that point, the instruction cycle should be aligned (assuming none of the next 7 instructions stall). And, of course, that the hub will now be running at the full system clock rate (unlike P1, which ran at half the rate), which you pointed out.
T
Any more details on the MSGIN and MSGOUT instructions? Anyone?
I suppose they are like the P2's serial instructions. No, I'm not asking for serdes.
These are how the more complex ports are configured, IIRC via a phantom-link(serial) over a line already routed to each pin.
Chip also plans to interface the Config and Registers of the PinCell Counters using this same pathway, with the more direct Pin-registers available as Enable/trigger in some cases.
I may well have your scheduling proposals backwards but you will have to explain why.
I don't poo-poo cooperative scheduling. It's a fine thing and I have used it on many production projects dating back decades.
Yes I did offer FullDuplexSerial as an example. I believe it's as good an example as any. Having written a FullDuplexSerial in C that runs entirely in a COG and reaches 115200 baud I feel I can say something about it.
That particular code does in fact ping-pong back and forth between receive and transmit modes, and it is a form of multitasking, albeit a very poor one; one that does not use a scheduler.
That "ping-pong" of Rx and Tx is commonly known as "coroutines". An ancient and long forgotten technique. There is nothing "very poor" about it. It does what it does very efficiently. Given that we only need two "threads" here it is a very good, low overhead, solution.
However, what those citing that example seem to be missing is the level of elegance and convenience a "scheduler adds to the mix".
That would be me. No I'm not missing the "elegance and convenience" a scheduler can add. However in the case of FDS a scheduler can only add code that we don't need and consume time that is short supply. Not a good solution.
...is far from general and not optimal.
True, coroutines may not be general or optimal in the general case. But in the case of FDS I believe they are. FDS does not need a scheduler or a general solution.
...the scheduler places it in a "wait" condition thereby also minimizing the power
You are not understanding how the P1 or proposed PII works. The only way to get into a lower power mode is to execute a WAITxxx instruction. Say, wait on a pin state. However the WAITxxx instructions halts the entire COG. All other threads and your scheduler will not be able to run.
This method permits one to have as many threads as one has room for in a cog, of course consistent with the total work load to be done, and the maximum jitter tolerated.
Ignoring the low power WAIT idea that cannot work, I agree. If one want's to implement many threads, more than two coroutines, some kind of scheduler is required. That scheduler will of course take time to run so it has an overhead.
So my conclusions that caused me to say what I did are based on my interpretation of comments from those who appear to not fully understand.
I think we understand very well.
If we are missing a point, I have a challenge for you, an opportunity to prove your case. Can you write a version of FullDuplexSerial for the P1 that uses a scheduler in the form you have proposed that can operate at the same, or higher, baud rate as the maximum of FullDuplexSerial? We would all be very interested to see it. If you can demonstrate power saving that would be a bonus.
I hope I have cleared up any mysteries regarding my comments, expanded on how I see cooperative routines working,
You have not. We all know how cooperative schedulers work. What we don't know is how you propose to get the performance, let alone the power saving. Not saying it can't be done but, I at least don't see it.
Do we need hardware scheduling?
I did some advanced mathematics and arrived a startling result.
The previous PII concept has 8 COGs each capable of running 4 hardware scheduled threads each.
That's 8 * 4 = 32 threads.
The current PII concept has 16 COGs each capable of running 2 software coroutines, using JMPRET, very efficiently.
That's 16 * 2 = 32 threads.
This is a stunning result, both designs support the same number of threads! The latter design, with no hardware scheduling, has some advantages:
a) Pairs of coroutines on multiple cores are a lot simpler to reason about for the programmer.
b) We can actually save power by having a single thread in COG and using WAITxxx
c) Better performance.
d) A simple more light weight chip design that get's the PII into silicon faster. (A chip that exists is much faster than one that does not )
My conclusion is that hardware scheduled thread is perhaps nice to have but in order to keep up with the old design the new design does not need them.
P.S. Could someone in the know make the calcs and prove c) above?
Coroutines are just "goto" statements prettied up and made to sound "structured".
No they are not!
GOTO, as in JMP, has no expectation of returning to the instruction, or statement in a high level language, following the GOTO.
Coroutine calls do.
The coroutines as used in FullDuplexSerial use JMPRET. You know the same instruction we use for subroutine calls. The return address is saved and the next call to JMPRET gets you back to where you were.
Further, a GOTO has a specified target location. A coroutine call does not.
"confuse and confound" ? I don't think so. The alternative here is to make a call to some scheduler component which then makes a return to the next threads code, or immediately back to the same thread. This is no simpler to comprehend and adds a bunch of run time overhead that essentially does nothing.
So you get the same challenge. Can you write a FullDuplexSerial that does not use that "confusing" coroutine mechanism that is as efficient and simple as FDS?
The hardware time slicing gives a much finer grain than what is being achieved in FDS. Bill has gone over this already.
Does it provide a benefit? Sure it does. Trying to argue it doesn't is pointless.
Is it required? Of course not, but then neither is a hardware multiplier required.
Is Chip going to add it? He's already said it depends on ease of fit with hubexec.
Yep it does. I have also detailed why and how here a couple of times. I'm not sure why it is being questioned. It could be argued that a cooperative system, with a scheduler or coroutines, has an advantage over a hardware scheduler when one of the threads absolutely needs all the processing power when it needs it and the other(s) are of a lesser priority. However that argument has not been made. And I don't believe that is the case with things like FDS anyway.
Yep, hardware scheduling would be nice but we can live without it pretty well.
A multiply is not required to calculate the address of an element within a structure. Only addition. You have the base address of the structure and you add an offset to get to the field you want.
Multiply is required to calculate the address of elements with in a two dimensional array (or more dimensions) as you have to do something like baseAddress + (xSize * y) + x to get element array[x, y].
How many 2D arrays have you seen in Propeller programs?
Of course if you have the memory you can make the array dimensions a power of 2 in size and just use shifts instead of multiply.
All features come into their own in their own ways.
Take the FDS driver: By re-writing as two tasks, one for rx and one for tx, instead of the coroutine mechanism, one could expect a significant increase in maximum bit rate without requiring the use of extra cogs. Or maybe even go for a dual super fast FDS in one cog.
There is also the question of how much specialised hardware is put in around the cogs/hub/pins ... hardware threading provides some extra options for reducing the amount of specialised hardware that gets added.
How many 2D arrays have you seen in Propeller programs?
In my case: none. But then I usually have another processor alongside the P1 running 'main()'. It's not unusual for me to have arrays of structures containing arrays. Or 2D arrays.
My designs often have multiple uCs in them. It's not unknown for keypad and display duties to be stuffed into a small 20-pinner which is treated as an intelligent peripheral to offload from the main uC.
What excites me about the P1+ is that, at last, I'll be able to write in C without having to worry about constraints as basic as the amount of memory the chip has. I can't wait.
Comments
Peter, from your point of view, what's the most common misconception?
Firstly let me say that I mean no malice to anyone on this forum, and perhaps I used an inappropriate word when I said that "most seem to have it backwards", when I should have said that "some have it backwards".
Yet I do believe that those poo-pooing the cooperative approach have not themselves done the exercise of investigating it adequately, let alone making it work. My reason for making this statement is based on the example of the full duplex serial uart code that is often cited as a typical cooperative application. If that is the extent of their understanding, then I would agree. That particular code does in fact ping-pong back and forth between receive and transmit modes, and it is a form of multitasking, albeit a very poor one; one that does not use a scheduler. However, what those citing that example seem to be missing is the level of elegance and convenience a "scheduler adds to the mix". In the former case, each half of the co-routines, when done, causes a switch to the other routine. And that can work (as evidenced) but is far from general and not optimal. Instead, the most each thread segment should do (at least in my opinion) is relinquish control back to the scheduler which will then calculate the next time that thread should run, and from what point in its program. On completing that calculation, the scheduler places it in a "wait" condition thereby also minimizing the power consumption.It then it selects the next task to be run, and dispatches that at the appointed time. In this manner all threads are serviced sequentially and orderly; each according to their "scheduled" time, and without (much) regard for each other, except for the jitter and processor load each intruduces to the fold.
This method permits one to have as many threads as one has room for in a cog, of course consistent with the total work load to be done, and the maximum jitter tolerated.
There are limits to what a cooperative approach are suited for: probably 8..16 low speed (1 KHz) threads, 4..6 medium speed (20 KHz) threads, and 2..4 high speed (100 KHz) threads, Of course all dependent on the jitter that can be accomodated. It is quite acceptable to run a combination of the different speeds.
So my conclusions that caused me to say what I did are based on my interpretation of comments from those who appear to not fully understand.
Do I think hardware multi tasking is bad ? Of course not..... unless it is "costly" in some manner. And if that is the case, then I say ditch it; there are other viable approaches.
I hope I have cleared up any mysteries regarding my comments, expanded on how I see cooperative routines working, and that I have not offended anyone.
Cheers,
Peter (pjv)
"There are limits to what a cooperative approach are suited for: probably 8..16 low speed (1 KHz) threads, 4..6 medium speed (20 KHz) threads, and 2..4 high speed (100 KHz) threads, Of course all dependent on the jitter that can be accomodated. It is quite acceptable to run a combination of the different speeds."
Chip's four hardware tasks allow tasks in the MHz range.
Having said that, my earlier post clearly admitted that cooperative multitasking is useful for the kHz range (even up to your 100kHz sample range).
The point of my post was to show that hardware tasks are useful if Chip decides to implement them, and to point out that people can fully understand co-operative threads, have used them (many times) in the past, and prefer hardware tasks for solid reasons.
(All: this is not to push Chip to implement tasks, it is a discussion between myself and pjv regarding the merits of co-operative threads vs. hardware level tasks)
JMP appears to have been combined with LINK
The way I see it, there are several ways to achieve the same end, whether it be a software or hardware approach. Each has a 'layer' of knowledge required, whether that be setting up a scheduler, or a task register and jump addresses for the hardware approach.
Chip did a really beautiful implementation of the hardware approach, but the disadvantage is it limits to 4 tasks, consumes area, and perhaps some power, even when not used.
(disclaimer: like bill said, this isn't a call for tasks to be included in any particular form, just looking at options for moving existing code)
Bill, I agree with you, but it's unfair to compare what can currently be done with the P1, to what might be possible with a P2. With the higher speed and enhanced instruction set, I expect to push cooperative routines also above 1 MHz.
We'll see how all that turns out..... can hardly wait.
Cheers,
Peter (pjv)
For a few simple co-operative threads I am sure you can get to somewhere above 1MHz quite easily
Frankly, I can easily see running co-operative threads within a hardware task, quite often a task (at 25MIPS) will be a huge overkill, and one such task will be able to run your scheduler at threads at roughly P1 speeds
Regards,
Bill
If tasks make it in - great.
If not - c'est la vie.
Up to Chip.
What I loved about the way Chip implemented tasks was:
- easy to write high performance drivers sharing a cog (up to four)
- the enhanced debugger capabilities
- easy control of how much processing time each task got
- user level threads
With Ken saying Parallax would like to release a new chip every one to two years, I have hope of seeing the "P8X256HP" (aka P2 design before P1+) for high performance applications in one to two years time after P16X512, with all the goodies in it - and maybe more, as with a process shrink we can get more hub, more cogs, more MHz
I would underline the enhanced debugger capabilities, and expand that to include advanced Watchdog features too.
Emerging software standards are based on verifying correct operation, of things like oscillators, and software core tasks, and a tiny watchdog stub can do that with ease.
Having tracked this sort of failure/recovery in other Micros, it does raise one question :
Q: Does the reset of a P1+, force a re-read of the fuses ?
(on some competitor parts, not quite everything is kicked by Reset, and a power cycle can be required)
The other reason to not worry about hardware multitasking is this: 16 COGS. There are plenty to go around+.
@info - You may not see the value of parallel processing, but I recommend you get a decent propeller 1 board and try it out. once you get used to doing multiple tasks in parallel, it changes how you program and so many things get that much easier to do. Its freaky.
Does this make sense? Or do I have something wrong with my thinking?
Most instructions will take 2 clock cycles (due to dual ported cog memory)
That leads to 100MIPS max.
Therefore, each cog will get at the hub every eight instruction (which is every 16 clock cycles)
I hope all the documentation will talk in terms of real clock cycles, as the instruction clock / system clock difference can be quite confusing, and easy to make mistakes with - heck, it has thrown me off several times too.
You may be right, there could be a half-Op-cycle (1 SysClk) phase offset on half the COGs.
Not sure that matters much ? - any WAIT would snap back again, but I guess it could give subtle COG-location effects.
Then again Chip may have something in the smart pins to deal with this effect
Perhaps we need an instruction HALFNOP
I think my mistake was in forgetting that the initial hubop for each cog will typically stall in order to synchronize to the hub window. After that point, the instruction cycle should be aligned (assuming none of the next 7 instructions stall). And, of course, that the hub will now be running at the full system clock rate (unlike P1, which ran at half the rate), which you pointed out.
Looks lean and mean so far ... just like a winner.
Any more details on the MSGIN and MSGOUT instructions? Anyone?
I suppose they are like the P2's serial instructions. No, I'm not asking for serdes.
Thanks.
They're for configuring sending/receiving data to/from smart pins
see post #151 of this thread http://forums.parallax.com/showthread.php/155145-Putting-smarts-into-the-I-O-pins
These are how the more complex ports are configured, IIRC via a phantom-link(serial) over a line already routed to each pin.
Chip also plans to interface the Config and Registers of the PinCell Counters using this same pathway, with the more direct Pin-registers available as Enable/trigger in some cases.
I may well have your scheduling proposals backwards but you will have to explain why.
I don't poo-poo cooperative scheduling. It's a fine thing and I have used it on many production projects dating back decades.
Yes I did offer FullDuplexSerial as an example. I believe it's as good an example as any. Having written a FullDuplexSerial in C that runs entirely in a COG and reaches 115200 baud I feel I can say something about it. That "ping-pong" of Rx and Tx is commonly known as "coroutines". An ancient and long forgotten technique. There is nothing "very poor" about it. It does what it does very efficiently. Given that we only need two "threads" here it is a very good, low overhead, solution. That would be me. No I'm not missing the "elegance and convenience" a scheduler can add. However in the case of FDS a scheduler can only add code that we don't need and consume time that is short supply. Not a good solution. True, coroutines may not be general or optimal in the general case. But in the case of FDS I believe they are. FDS does not need a scheduler or a general solution. You are not understanding how the P1 or proposed PII works. The only way to get into a lower power mode is to execute a WAITxxx instruction. Say, wait on a pin state. However the WAITxxx instructions halts the entire COG. All other threads and your scheduler will not be able to run. Ignoring the low power WAIT idea that cannot work, I agree. If one want's to implement many threads, more than two coroutines, some kind of scheduler is required. That scheduler will of course take time to run so it has an overhead. I think we understand very well.
If we are missing a point, I have a challenge for you, an opportunity to prove your case. Can you write a version of FullDuplexSerial for the P1 that uses a scheduler in the form you have proposed that can operate at the same, or higher, baud rate as the maximum of FullDuplexSerial? We would all be very interested to see it. If you can demonstrate power saving that would be a bonus. You have not. We all know how cooperative schedulers work. What we don't know is how you propose to get the performance, let alone the power saving. Not saying it can't be done but, I at least don't see it.
Do we need hardware scheduling?
I did some advanced mathematics and arrived a startling result.
The previous PII concept has 8 COGs each capable of running 4 hardware scheduled threads each.
That's 8 * 4 = 32 threads.
The current PII concept has 16 COGs each capable of running 2 software coroutines, using JMPRET, very efficiently.
That's 16 * 2 = 32 threads.
This is a stunning result, both designs support the same number of threads! The latter design, with no hardware scheduling, has some advantages:
a) Pairs of coroutines on multiple cores are a lot simpler to reason about for the programmer.
b) We can actually save power by having a single thread in COG and using WAITxxx
c) Better performance.
d) A simple more light weight chip design that get's the PII into silicon faster. (A chip that exists is much faster than one that does not )
My conclusion is that hardware scheduled thread is perhaps nice to have but in order to keep up with the old design the new design does not need them.
P.S. Could someone in the know make the calcs and prove c) above?
Ugh! Coroutines are just "goto" statements prettied up and made to sound "structured".
Like "goto" they have utility in about 1% of cases, and just serve to confound and confuse in the other 99%.
Ross.
GOTO, as in JMP, has no expectation of returning to the instruction, or statement in a high level language, following the GOTO.
Coroutine calls do.
The coroutines as used in FullDuplexSerial use JMPRET. You know the same instruction we use for subroutine calls. The return address is saved and the next call to JMPRET gets you back to where you were.
Further, a GOTO has a specified target location. A coroutine call does not.
"confuse and confound" ? I don't think so. The alternative here is to make a call to some scheduler component which then makes a return to the next threads code, or immediately back to the same thread. This is no simpler to comprehend and adds a bunch of run time overhead that essentially does nothing.
So you get the same challenge. Can you write a FullDuplexSerial that does not use that "confusing" coroutine mechanism that is as efficient and simple as FDS?
Does it provide a benefit? Sure it does. Trying to argue it doesn't is pointless.
Is it required? Of course not, but then neither is a hardware multiplier required.
Is Chip going to add it? He's already said it depends on ease of fit with hubexec.
Surely multipliers come into their own when used by the compiler to calculate the address of elements within a structure?
Yep it does. I have also detailed why and how here a couple of times. I'm not sure why it is being questioned. It could be argued that a cooperative system, with a scheduler or coroutines, has an advantage over a hardware scheduler when one of the threads absolutely needs all the processing power when it needs it and the other(s) are of a lesser priority. However that argument has not been made. And I don't believe that is the case with things like FDS anyway.
Yep, hardware scheduling would be nice but we can live without it pretty well.
A multiply is not required to calculate the address of an element within a structure. Only addition. You have the base address of the structure and you add an offset to get to the field you want.
Multiply is required to calculate the address of elements with in a two dimensional array (or more dimensions) as you have to do something like baseAddress + (xSize * y) + x to get element array[x, y].
How many 2D arrays have you seen in Propeller programs?
Of course if you have the memory you can make the array dimensions a power of 2 in size and just use shifts instead of multiply.
All features come into their own in their own ways.
Take the FDS driver: By re-writing as two tasks, one for rx and one for tx, instead of the coroutine mechanism, one could expect a significant increase in maximum bit rate without requiring the use of extra cogs. Or maybe even go for a dual super fast FDS in one cog.
There is also the question of how much specialised hardware is put in around the cogs/hub/pins ... hardware threading provides some extra options for reducing the amount of specialised hardware that gets added.
I mean let's say adding hubexec and making it work optimally might involve adding X amount of size, complexity, power consumption (CSP).
Also implementing hardware threading might add Y amount of CSP.
Adding them both together to work optimally might not be (X + Y) * CSP but (X * Y) * CSP.
I.e. the complexity grows faster than you think as all the interactions between the two have to be taken into account.
Is the why the previous design exploded? Are we just expecting the impossible given the technology constraints?
My designs often have multiple uCs in them. It's not unknown for keypad and display duties to be stuffed into a small 20-pinner which is treated as an intelligent peripheral to offload from the main uC.
What excites me about the P1+ is that, at last, I'll be able to write in C without having to worry about constraints as basic as the amount of memory the chip has. I can't wait.