jmg was suggesting there is potentially a power saving on a per counter basis when compared to having them in the Cogs. So, 32 counters on a pin-pair basis should be lighter than the 32 counters we currently have across the 16 Cogs.
In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
No, no, no! Don't go saying things like that!
No, I'm not opposed to hardware multitasking being added back in. (Well, I am, but for other reasons.) I'm opposed to you making statements like this! Be ruthless in what you cut out (for now). Get something working. And then add features back in if you want. But don't say you'll likely add something back in until you are ready to actually add it. Otherwise, all you're doing is putting that pressure back on yourself!
jmg was suggesting there is potentially a power saving on a per counter basis when compared to having them in the Cogs. So, 32 counters on a pin-pair basis should be lighter than the 32 counters we currently have across the 16 Cogs.
I don't understand the logic in that. The counters should consume the same amount of power independent of where they are located. I think putting them at the pins is just a way to move some of the power dissipation away from the cogs.
The problem with deciding on whether hubex and/or multi-tasking should be in is that there is no data about how much real-estate they take, or how much power they dissipate. We know that it's possible to implement both of them, since they exist in the P2 design. If the size and power dissipation figures were known a solid decision could be made on whether to include them. Without those figures the decision must be made on some adhoc basis.
The problem with deciding on whether hubex and/or multi-tasking should be in is that there is no data about how much real-estate they take, or how much power they dissipate.
I am also in favor of not asking Chip to bother with multi-tasking for the reasons cited above and a few others.
Propeller 2 (that's what I'm calling it for now, unless we decide on Propeller 16 or another name) will be used by lots of inventors and entrepreneurs, and their specialty will often lie in fields other than embedded design. The capabilities and needs of these customers is very different than the forum members who contribute to this thread. They might be specialists in renewable energy, medical, robotics, aeronautics, environmental measurements, etc. We will likely have ten thousand customers using 100 to 1000 units each, plus a number of big wins. This particular customer can be overwhelmed by design considerations and possibilities, sometimes so much that they think this chip isn't right for their project because it offers so much. Telling them that "you've got 16 cores to use here - no need to worry about interrupts" will take care of 98% of our users for now. But when you further explain that each core has even more capabilities, like multitasking, then they really start to wonder how the heck they'll architect their program and the discussion will quickly diverge from the joy of understanding a simple design to wondering how it all works. Tracking program flow would be a challenge for them.
When Propeller 1 first came out we did a seminar for our European distributors. These people are technical buyers, representing electronic distribution companies. They can hook it up, load and modify our code examples, and have enough skill to show their customers what a Propeller can do. For two days we went through the Propeller architecture, far too deep for most of them. While multitasking may not be new to a crowd like this, I can imagine it being too much when combined with multicore. They already have trouble envisioning and accounting for the fact any core can write to any pin, how a system clock is accessed, etc. There's no need to complicate something that we expect people to sell.
And all the time adds up, too. I have no idea how much design time it might take, but I'm sure it's a week or more to bring P3 Verilog into P2. And it'll take a week or more for Jeff to document in the data sheet. Drawings, explanations, sample code - it could quickly add up to another month.
I don't know what I don't know, and perhaps that's all I know for certain. This is my take and for now I'm sticking to it. Don't mistake me for a wet blanket, as I only believe there's a lot of sense in optimizing what we've got and refusing as much temptation as possible for now. There's also a very strong business case for making frequent iterations and design improvements, which could be every year or two in our case.
Glad to here you say all that Ken. Most of it seems spot on.
Many of us have been campaigning for simplicity for a long while. Imagine having to explain those 500 opcodes of the previous P2 incarnation to those European distributors. You would be in Europe for a lot long than 2 days !
We all want silicon, now!
I do like the idea of frequent iterations and improvements.
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
Oh, I like the way this is (or at least was) going.
I suspect that most folks, not having used co-operative multitasking, dont realize how fast and simple it is..... once you have an effective scheduler up and running.
In following the banter here and reading comments about why the co-operative approach is bad, it just seems to me that the approach is poorly understood. From the comments, most seem to have it backwards.
So, while I would probably make some use of a hardware approach if it exisited, it certainly is not a "required" feature.
And with the new instructions, I expect the P1 style scheduler's performance will be significantly faster. I'm really looking forward to making one.
So I vote for dropping the hardware approach as I believe it does not add a lot of value.
Nope, we don't have it backwards re: cooperative vs hardware scheduling. Certainly not if you are also including a software scheduler into the mix (as opposed to the coroutines you find in FullDuplexSerial for example)
Not that I'm saying hardware scheduling is essential for this chip.
I always thought that hardware multitasking made things unnecessarily complicated. IIRC, some of the coding required to make it work seemed rather absurd (that may have been cleaned up after I lost interest).
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Oh, I like the way this is (or at least was) going.
I suspect that most folks, not having used co-operative multitasking, dont realize how fast and simple it is..... once you have an effective scheduler up and running.
In following the banter here and reading comments about why the co-operative approach is bad, it just seems to me that the approach is poorly understood. From the comments, most seem to have it backwards.
So, while I would probably make some use of a hardware approach if it exisited, it certainly is not a "required" feature.
And with the new instructions, I expect the P1 style scheduler's performance will be significantly faster. I'm really looking forward to making one.
So I vote for dropping the hardware approach as I believe it does not add a lot of value.
I am in favor of making the core logic simple. I like the idea of Hubex, but I don't like the idea of hubex and multiple tasks.
I specifically recommended against this because I know all the baggage it brings with it. It begets feature creep and makes the elegant design ugly. I really didn't like the look of the P2 after Hubex was added because it lost all of it's elegance.
The chip needs to have as few "rules" as possible. Tasking creates more "rules". Hubex creates more "rules". Multiple tasks with hubex creates yet more compound "rules".
The P1 is simple, there are few rules, basically they fall into "hub instruction" and "not hub instruction". The difference is that one takes 8-22 clocks and the other takes 4 clocks.
The P-X needs to be simple. If Hubex exists, all instructions run from Hub need to take a certain number of clocks, whether 8 or 16 or 4. Multi-tasking was an artifact of the pipeline structure of the P2, the P-X isn't pipelined, so I don't think it should have tasks.
Keep the video simple, but improved over the P1.
Keep the counters working like the P1, with PHSA and PHSB, etc. We like to have the ability to write to them in realtime to synthesize FM output. I wrote an FM transmitter program that could broadcast audio at FM broadcast frequencies. Having actual FM in addition to AM output to the pins is useful and is "free" in the sense that the counter just toggles a pin and doesn't have to communicate a value to the pin I/O circuit.
Above all else, add simplicity, not complexity. I really like the idea of "Hub" peripherals, because they are space efficient and give you a few dedicated building blocks to make stuff. Most importantly, they work in a simpler fashion, rather than multiplying the kitchen sink by n number of Cogs. The P2 rationale was to push everything into the COGs to avoid clock issues, but that just made an overly complex COG.
Hub exec IS going to be in the next chip. I've just been stalled out over the last few days getting the new instruction set nailed down. At times, all the details seem overwhelming and I think about paring it down, just to get it going again. This new memory scheme (dual-port 128x128 bits, instead of quad-port 512x32) changes a lot of things. It's hard for me to get there in one step. Asking you guys how you'd feel about dropping certain features alleviates the pressure on me. In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
Great. This 'divide and conquer' sounds an ideal way to proceed.
You can also get MHz number indicators from any Builds you do along the way, to map the impact of the changes.
jmg was suggesting there is potentially a power saving on a per counter basis when compared to having them in the Cogs. So, 32 counters on a pin-pair basis should be lighter than the 32 counters we currently have across the 16 Cogs.
Power is Cpd * Ft * Vcc^2, Cpd is the sum of the Register plus routing Loads.
So identical sized routing will have identical powers.
The scope I see for power saving in a PinCell, is that the routing tools can focus just on that locally, whilst the COG counter will have to juggle for space with all the other critical paths in the COG during the Autoroute.
Because the PinCell is not large, there may even be some manual layout assist possible, especially if needed to meet MHz targets (and usually a smaller cell results).
Keep the counters working like the P1, with PHSA and PHSB, etc. We like to have the ability to write to them in realtime to synthesize FM output. I wrote an FM transmitter program that could broadcast audio at FM broadcast frequencies. Having actual FM in addition to AM output to the pins is useful and is "free" in the sense that the counter just toggles a pin and doesn't have to communicate a value to the pin I/O circuit.
There is a backward-compatible case for COG counters, but IIRC the design flow process means the PLL is not there (Chip can confirm?), so a P1+ COG counter, will be a subset of things possible on a P1.
Also, having the wide-adder not in the Pin Cell, can make the PinCell a little smaller and faster, but at the cost of a larger overall Logic area, from the duplicated counters.
So we don't establish a new term how about naming them "standard" I/O for now if they're not all "smart" on the FPGA, if it is the case that they don't have the same characteristics?
"Standard" can mean many things, and even a Basic Logic I/O on a FPGA platform is not going to be the same as the Final Pin designs, (of any pins without Counters).
Looks good...hubexec at 50% should be just fine...it will give people incentive to think creatively in using cog memory for faster code:) Plus the option of using hubexec.
No, I'm not opposed to hardware multitasking being added back in. (Well, I am, but for other reasons.) I'm opposed to you making statements like this! Be ruthless in what you cut out (for now). Get something working. And then add features back in if you want. But don't say you'll likely add something back in until you are ready to actually add it. Otherwise, all you're doing is putting that pressure back on yourself!
I agree. If Hubexec is available it will get used. Ditto for multitasking.
But if they are not, there are simple software alternatives that can achieve most of the benefits. The omission of these hardware features won't seriously impact on most people, despite all the cries of woe and despondency you tend to get in these threads when someone's "favorite" feature appears to be under threat.
I think Chip's best course of action would be to take a week or two away from the forums and sort these things out before making any more announcements about this stuff.
Comments
No, no, no! Don't go saying things like that!
No, I'm not opposed to hardware multitasking being added back in. (Well, I am, but for other reasons.) I'm opposed to you making statements like this! Be ruthless in what you cut out (for now). Get something working. And then add features back in if you want. But don't say you'll likely add something back in until you are ready to actually add it. Otherwise, all you're doing is putting that pressure back on yourself!
I am also in favor of not asking Chip to bother with multi-tasking for the reasons cited above and a few others.
Propeller 2 (that's what I'm calling it for now, unless we decide on Propeller 16 or another name) will be used by lots of inventors and entrepreneurs, and their specialty will often lie in fields other than embedded design. The capabilities and needs of these customers is very different than the forum members who contribute to this thread. They might be specialists in renewable energy, medical, robotics, aeronautics, environmental measurements, etc. We will likely have ten thousand customers using 100 to 1000 units each, plus a number of big wins. This particular customer can be overwhelmed by design considerations and possibilities, sometimes so much that they think this chip isn't right for their project because it offers so much. Telling them that "you've got 16 cores to use here - no need to worry about interrupts" will take care of 98% of our users for now. But when you further explain that each core has even more capabilities, like multitasking, then they really start to wonder how the heck they'll architect their program and the discussion will quickly diverge from the joy of understanding a simple design to wondering how it all works. Tracking program flow would be a challenge for them.
When Propeller 1 first came out we did a seminar for our European distributors. These people are technical buyers, representing electronic distribution companies. They can hook it up, load and modify our code examples, and have enough skill to show their customers what a Propeller can do. For two days we went through the Propeller architecture, far too deep for most of them. While multitasking may not be new to a crowd like this, I can imagine it being too much when combined with multicore. They already have trouble envisioning and accounting for the fact any core can write to any pin, how a system clock is accessed, etc. There's no need to complicate something that we expect people to sell.
And all the time adds up, too. I have no idea how much design time it might take, but I'm sure it's a week or more to bring P3 Verilog into P2. And it'll take a week or more for Jeff to document in the data sheet. Drawings, explanations, sample code - it could quickly add up to another month.
I don't know what I don't know, and perhaps that's all I know for certain. This is my take and for now I'm sticking to it. Don't mistake me for a wet blanket, as I only believe there's a lot of sense in optimizing what we've got and refusing as much temptation as possible for now. There's also a very strong business case for making frequent iterations and design improvements, which could be every year or two in our case.
Ken Gracey
Many of us have been campaigning for simplicity for a long while. Imagine having to explain those 500 opcodes of the previous P2 incarnation to those European distributors. You would be in Europe for a lot long than 2 days !
We all want silicon, now!
I do like the idea of frequent iterations and improvements.
Oh, I like the way this is (or at least was) going.
I suspect that most folks, not having used co-operative multitasking, dont realize how fast and simple it is..... once you have an effective scheduler up and running.
In following the banter here and reading comments about why the co-operative approach is bad, it just seems to me that the approach is poorly understood. From the comments, most seem to have it backwards.
So, while I would probably make some use of a hardware approach if it exisited, it certainly is not a "required" feature.
And with the new instructions, I expect the P1 style scheduler's performance will be significantly faster. I'm really looking forward to making one.
So I vote for dropping the hardware approach as I believe it does not add a lot of value.
Cheers,
Peter (pjv)
Nope, we don't have it backwards re: cooperative vs hardware scheduling. Certainly not if you are also including a software scheduler into the mix (as opposed to the coroutines you find in FullDuplexSerial for example)
Not that I'm saying hardware scheduling is essential for this chip.
I always thought that hardware multitasking made things unnecessarily complicated. IIRC, some of the coding required to make it work seemed rather absurd (that may have been cleaned up after I lost interest).
If Chip puts in tasks, great. It would be great for packing up to drivers into a cog.
If he does not, that is his choice.
I've used co-operating multitasking many times... over many decades.
It is far inferior to hardware tasks. (ask XMOS <grin>)
It saves memory, it makes timing high speed signals far easier. It does not need a scheduler.
I will readily grant you that for low speed signals, toggling a ton of led's etc, cooperative is all you need.
But.
Let's take practical examples.
P1+ style cog as discussed, 100MIPS (for simple two clock cycle instructions)
hardware tasks, interleaved every instruction.
25MIPS per task
LED toggling test: 25M toggles (XOR) for each task.
bit-banged serial, half duplex >5mbps
co-operative version?
LED toggling, half the performance (xor, jmpsw), 12.5M toggles
bit-banged serial - best guess, as the interleaving has to happen at waiting for start bit edge, and every bitcell thereafter - ~2mbps max
For high speed signals,
- tasks give you a roughly 2:1 speed advantage
- much easier to write code for
- uses less cog memory
As for people will not understand it / too complex... they don't have to use it if they don't want to, heck co-operative threads are still possible.
Bottom line:
It is up to Chip
I specifically recommended against this because I know all the baggage it brings with it. It begets feature creep and makes the elegant design ugly. I really didn't like the look of the P2 after Hubex was added because it lost all of it's elegance.
The chip needs to have as few "rules" as possible. Tasking creates more "rules". Hubex creates more "rules". Multiple tasks with hubex creates yet more compound "rules".
The P1 is simple, there are few rules, basically they fall into "hub instruction" and "not hub instruction". The difference is that one takes 8-22 clocks and the other takes 4 clocks.
The P-X needs to be simple. If Hubex exists, all instructions run from Hub need to take a certain number of clocks, whether 8 or 16 or 4. Multi-tasking was an artifact of the pipeline structure of the P2, the P-X isn't pipelined, so I don't think it should have tasks.
Keep the video simple, but improved over the P1.
Keep the counters working like the P1, with PHSA and PHSB, etc. We like to have the ability to write to them in realtime to synthesize FM output. I wrote an FM transmitter program that could broadcast audio at FM broadcast frequencies. Having actual FM in addition to AM output to the pins is useful and is "free" in the sense that the counter just toggles a pin and doesn't have to communicate a value to the pin I/O circuit.
Above all else, add simplicity, not complexity. I really like the idea of "Hub" peripherals, because they are space efficient and give you a few dedicated building blocks to make stuff. Most importantly, they work in a simpler fashion, rather than multiplying the kitchen sink by n number of Cogs. The P2 rationale was to push everything into the COGs to avoid clock issues, but that just made an overly complex COG.
KISS!
-Phil
I stopped following this forum mostly too after there was much discussion about tasks too. I only gained interest when hub exec was added.
Great. This 'divide and conquer' sounds an ideal way to proceed.
You can also get MHz number indicators from any Builds you do along the way, to map the impact of the changes.
Power is Cpd * Ft * Vcc^2, Cpd is the sum of the Register plus routing Loads.
So identical sized routing will have identical powers.
The scope I see for power saving in a PinCell, is that the routing tools can focus just on that locally, whilst the COG counter will have to juggle for space with all the other critical paths in the COG during the Autoroute.
Because the PinCell is not large, there may even be some manual layout assist possible, especially if needed to meet MHz targets (and usually a smaller cell results).
There is a backward-compatible case for COG counters, but IIRC the design flow process means the PLL is not there (Chip can confirm?), so a P1+ COG counter, will be a subset of things possible on a P1.
Also, having the wide-adder not in the Pin Cell, can make the PinCell a little smaller and faster, but at the cost of a larger overall Logic area, from the duplicated counters.
Having a FPGA mix of Standard Logic and smarter CounterCells makes sense, and improves test coverages from a finite FPGA.
"Standard" can mean many things, and even a Basic Logic I/O on a FPGA platform is not going to be the same as the Final Pin designs, (of any pins without Counters).
For my applications... No.
Note that the JMP instructions save a return address into $1EF, so these double as the old LINK instructions.
I see it, never mind.
Looks good to me.
-Phil
I read that as an alias-write design, so RAM has a copy of (last) dira, and that means 'or' will work as expected ?
That's right. Prop1 works like this, too. It saves a bunch of D and S mux's.
Looks good...hubexec at 50% should be just fine...it will give people incentive to think creatively in using cog memory for faster code:) Plus the option of using hubexec.
I agree. If Hubexec is available it will get used. Ditto for multitasking.
But if they are not, there are simple software alternatives that can achieve most of the benefits. The omission of these hardware features won't seriously impact on most people, despite all the cries of woe and despondency you tend to get in these threads when someone's "favorite" feature appears to be under threat.
I think Chip's best course of action would be to take a week or two away from the forums and sort these things out before making any more announcements about this stuff.
Ross.
Multiple cogs at 50MIPs, that sounds rather exciting!!
If we have 4 cogs per DE0, should be an easy re-write of invaders, just split the tasks out to a cog each.
Lots more hub activity, perhaps. It'll be interesting to compare the current consumption on the DE0, for the old vs new solution.