I am not quite in agreement with item 2 of your latest post, where you state that 'polling is waste of CPU cycles'. And then, further down, you suggest that the answer to polling are the 'waitxx' instructions. Now, I ask how wasteful are those? Nothing gets done by a stalled cog while waiting for a a timer or an event. My belief is a properly implemented polling system can give both, low latency response and higher throughput.
With such a polling system, each cog in the P1 can already execute multiple threads with single cycle timing granularity and (typical) 5 uSec latency. Sure, this will not be adequate for ALL requirements, but more than 95% I would think.
So, in the absence of an interrupt facility, I'm of the very firm opinion that polling is a good thing.
And I would further say that while it is true that interrupts are not 'necessary' in a Prop (as in one CAN work without them), my RTOS code would likely been faster and tighter with interrupts.
I think Heater meant "periodically checking for a condition without blocking" when he said polling is a waste of CPU cycles. At least, that was my interpretation. The waitxx instructions are more precisely blocking instructions, not polling instructions. Internally, interrupts are simply asynchronous (to the "main" CPU) blocking instructions. And in that context, the waitxx in one cog is asynchronous to all of the other cogs. The obvious advantage to the propeller approach is that the "interrupt handler" effectively becomes the rest of the code in the cog that contains the waitxx, giving us absolute control over the way that one cog actually "interrupts" (used very loosely) another cog.
I also agree that polling (as in "periodically checking for a condition without blocking") can be very useful, particularly now that the P2 supports multitasking.
Back in the day, and even now, we would have one processor. That processor, without interrupts, can only execute one "thread" of execution.
Now imagine we have some data comming in. From a serial line say, at 115200 baud. The UART might handle the bits and assemble the bytes but every incoming byte needs to be read by the CPU before the next one comes in else it is lost. That's like 10,000 bytes per second have to be read from the UART or one every 100 microseconds.
That means that the code that is running the single thread of execution has to poll the UART every hundred microsconds to see if something is there. There are two problems with this:
1) One has to actually code that into the thread so that every few instructions it calls some routine that polls the UART. One really does not want to write programs with polling calls sprinkled throughout it. And who knows how many other devices need polling?
2) All that polling is consuming CPU cycles. Even when there is no data to read it still has to be polled for just in case.
So we see polling is a waste of CPU instructions and a big pain having to write code to do it.
Re: the waitxx instructions:
Now we have a processor running a thread, and another processor ready to read that data in. It's waiting on a watxx instruction. Sure that is a waste of a CPU put that's what you have to do if there are no interrupts.
Admittedly one can have the second processor polling for multiple events whilst the fist processor runs the application. Effectively that is a smaller scale version of the polling problem above. It may work very well in many cases. FullDupleSerial demonstates this well.
My reference to polling was in respect to multiple threads operating in a single (P1) cog. Waitxx's inside such threads render multitasking impossible, and in a P2 less effective. So instead, make a scheduler do all the 'waiting' as well as polling to effect the distribution of the cog's CPU cycles among the threads. This is what I mean by implementing polling 'properly'. On top of that, the scheduler even relieves you from needing to sprinkle 'polls' throughout your threads.
Polling is VERY good -in the right context- probably much better than 95 % of real-world applications.
I agree, a waitxx can halt everything. better to keep it out of the threads.
Sounds to me like you are describing an "event loop": A "scheduler" waits for various events to happen. Perhaps changes in pin state with watxx. Then depending on what event (pin change) happened it sets some code running to deal with that. But. that code cannot hang up in an endless loop or take a long time to complete. No, it has to get back to the scheduler ASAP in case some other event has happened that needs handling.
Sounds like cooperative multitasking. Which requires that the tasks complete very quickly
Sounds like the Lucol language we used to program avionics systems in. In the Lucol language there were no loops or gotos so it was impossible for tasks to hang up. Everything was driven off of timer ticks. The Lucol compiler is the only one I have ever seen that would compile your code and then print out how much of you allotted time you had used. It could do that because there were no loops and analysis was easy.
I think you have to outline your system, in pseudo code, before we can discuss it properly.
Polling, interrupts, and multiple cpu's all have their uses on a micro. This does not mean that I would trade a propeller with it's 8 cogs for a cpu with interrupts even if it had two or three times the speed of all 8 propeller cogs combined. It is by far the best microcontroller for what I do even though there have been times when having an interrupt would have made a task much simpler or would have saved me dedicating an entire cog to that task.
There are several reasons I am convinced that interrupts also have their place. The first is the use one of the earliest S100 Z80 CPM systems I assembled was put to.
It was a basic 2MHz Z80 with 64K of ram, two 8” floppies, and a serial board with eight uarts. Eight serial terminals were connected to the serial board. All eight terminals were used simultaneously for Key to Disk data entry.
There was a single interrupt that occurred 20 times per second and called a polling loop that checked each of the 8 uarts for an input character, and called a single re-entrant program that processed and stored received characters.
I am certain that none of this would not have been possible for such a low powered cpu without interrupts.
If you only have one CPU you pretty much have to have interrupts. There are ways around to handle a case like the one above without interrupts but that requires insuring all your code checks those UARTs 20 times a second. Which is hard to do and will kill performance in all that polling.
If those 20 UARTs were smart enough to DMA data into memory then the interrupts would not be required. That's effectively what we make with a COG running FDS or four port serial.
Often people ask for interrupts in the Propeller. Question is does it make sense to have parallel cores plus the extra circuitry and complexity of interrupts? I think not.
One early computer, the Epson HX-20 (first laptop) had 2 Hitachi 6301 (6801 cpus with some mods) running at 614KHz.
One was dedicated to I/O; among them 2 serial ports, one limited to 4800bps, the other to 38200bps.
That machine felt very responsive to use, even with the low clock speed...
Of course, having two CPUs padded to the cost and increased the design complexity significantly.
Often people ask for interrupts in the Propeller. Question is does it make sense to have parallel cores plus the extra circuitry and complexity of interrupts? I think not.
Interrupts aren't so bad. I don't feel like cogs completely replace interrupts, from this angle that a typical micro has peripherals that can be clocked at will, with added benefit of low current use in peripherals. The ability to clock switch per cog would be ideal.
In some ways, dma can be used like a cog, although it is rather complex to set up.
Clock-switching per cog has been discussed before.
Not a good idea as it really messes up timing for HUB access for the other COGs.
Anyways, interrupts is named as they are because they interrupt the normal workflow of the CPU.
That screws up timing and can make for really weird failure situations that can be almost impossible to track down.
LADDER logic is another "language" that can be deterministic. Most dialects do not allow looping, but the ladder is traversed as a task in the main execution loop.
Currently I'm working on a system that allows users to do design tweaks to an instrument using LADDER logic, and in our case it is a background task, that because of no loops is guaranteed to always return to the foreground loop. The main CPU does use interrupts for a number of high priority tasks, mostly capturing data at a guaranteed interval. Some of these may move to DMA driven routines that can execute nearly autonomously.
These days interrupt code is just C code, as most processors make the interrupt process pretty painless, this is done by pushing a fixed set of registers onto the stack before the interrupt. This is coordinated with the C compiler, so there are no conflicts (what I mean here is a register used by both interrupt and main code without save/restore). In the Cortex version of ARM the interrupt is quite simple, load the address of the routine into the controller and disable the interrupt somewhere in that interrupt code and you are done. There can be fine tuning as to interrupt priority and whether interrupts allow higher priority interrupts and the like. But in most cases it is pretty easy to manage. Like almost all software tasks, you don't have to get too creative until the process you are trying to manage is pushing the limits of the hardware.
Not that I am suggesting LADDER is the killer app for a Prop, but it might be an interesting one. There are a number of open source LADDER to C translators.
Interrupts: They are not needed, true.
... Isn't usually interrupt handlers written in assembler even if the main language used is of a higher class like C, Basic or whatever else and the main code processing stops until a return from the interrupt handler?
One reason that the ARM Cortex-M controllers are getting so much attention is that the interrupt controller sets up the processor in a manner consistent with the APCS (Arm Procedure Call Standard) so interrupts can be set by setting the trap address to a pointer to a C function, no assembly required.
Originally Posted by Circuitsoft One reason that the ARM Cortex-M controllers are getting so much attention is that the interrupt controller sets up the processor in a manner consistent with the APCS (Arm Procedure Call Standard) so interrupts can be set by setting the trap address to a pointer to a C function, no assembly required.
This sounds like a great thing to put on a spec sheet, but I've programmed some ARMs, and I would take PASM over C on an ARM any day. There are other things that make C on an ARM pretty hard and PASM on a Prop pretty easy for real engineering problems.
I do phased PWM with props, and recently I am building an MPPT solar controller (and diversion and battery monitor) for 48V nominal into a 12V battery. Because the duty cycle will always be less than 25%, I was able to do 4-phase synchronous PWM (high/low MOSFETs per phase, so 8 total) with software stepping to increase the effective resolution to 16 bits, bit-banged in a single cog. So, I thought, maybe that's not so great; couldn't I bit-bang that with a 168 MHz Cortex-M4 that has pretty much 8X the MIPs of a cog ? Uh, no, there's nothing like waitcnt; maybe if it had a repeat register like a dsPIC. Well, what if I used 2 timer/counters at 4x the PWM rate (yikes, 800,000 interrupts per second), each with non-overlapping non-re-entrant interrupts ? Close, but without the software stepping, and the minimum duty cycle would have to be higher to make sure the interrupts wouldn't overlap.
There *are* one or two ARMs available with hardware phased PWM now, but not the 168 MHz one, and I'm comparing the utility of its basic design; the Prop doesn't have hardware phased PWM, either. I'm not sure, but I don't think Chip had this application in mind when he designed the P1, and it allows it to do a better job at many things than a much newer, much more "powerful" chip (and I still had 7 cogs to do other things !).
The question on my mind is this: can the P2 duplicate this performance ? What I mean is, the P1 is more powerful and more useful than anyone, probably even Chip, thought in 2006; regardless of its market performance, it's a stunning engineering achievement. The P2 is starting from a much higher base with its new goodies; can it be as much better in 7 years as the P1 is now ? Think of the possibilities !
The question on my mind is this: can the P2 duplicate this performance ? What I mean is, the P1 is more powerful and more useful than anyone, probably even Chip, thought in 2006; regardless of its market performance, it's a stunning engineering achievement. The P2 is starting from a much higher base with its new goodies; can it be as much better in 7 years as the P1 is now ? Think of the possibilities !
During his Skype talk at the Expo, Chip pretty much said that the only barrier to the 40micron level and 1GHz was more than one and less than ten million dollars... and a little fussing with the pin definitions. (There is always fussing with the pin definitions.) When the engineers see the Prop2... that kind of money will be waiting in future orders. No more partners, no venture capital... firm orders with a very flexible delivery date:)
I would give favorable odds that the P3 gets delivered a lot faster than the P1 or P2:)
Seconded. It is proving quite capable at 1/3 clock speed. And the I/O pins really have a lot of functionality we cannot explore until we get real chips.
During his Skype talk at the Expo, Chip pretty much said that the only barrier to the 40micron level and 1GHz was more than one and less than ten million dollars... and a little fussing with the pin definitions. (There is always fussing with the pin definitions.) When the engineers see the Prop2... that kind of money will be waiting in future orders. No more partners, no venture capital... firm orders with a very flexible delivery date:)
I would give favorable odds that the P3 gets delivered a lot faster than the P1 or P2:)
I remember the 40 micron P2+ idea being floated at UPEW '11 and didn't realize it's still being given serious consideration. A 1 GHz P3 would be a very interesting commercial product with GCC in the mix. You could probably run Linux off external RAM and flash/SD using just one of the cogs, a bit slower than a similar clock ARM but then you have the other 7 cogs to do things the ARM can't do. That could be a powerful capability for embedded devices. And it would be as good as having a 32-cog P2 because the 1 GHz P3 threads would be comparable to (even faster than) an entire cog on 160 MHz P2.
Yep... I'm seriously considering the medical applications:) P2 looking berry berry good to me. P3 would just be a dream come true.
The dirty little secret is that except for very high end products, medical technology tends to lag
just far enough behind the bleeding edge that there are very nice markets still open to "the little guys."
Clock-switching per cog has been discussed before.
Not a good idea as it really messes up timing for HUB access for the other COGs.
The Hub timing issue only came about due to it being a late addition of simple slicing of the singular pipeline. It is possible for multiple pipes to solve this but requires significantly greater redesign. I've not looked at the details of pipelining in general though, so maybe having pipeline multiplexing is silly amounts of extra transistors and wouldn't be reasonable anyway.
Time will tell if such a stringent design is even warranted. The P2 might not be any the worse off for the minor timing glitches that can happen when multi-threading in a single cog.
Ah, still reading back through the posts ... the WAITxx instructions when pertaining to low power consumption is a valid objective to pursue. For this alone there is reason to desire a pipe per hardware thread.
Ah, still reading back through the posts ... the WAITxx instructions when pertaining to low power consumption is a valid objective to pursue. For this alone there is reason to desire a pipe per hardware thread.
Out of curiosity, what would be the lowest-power approach to having one cog wait for another cog? Use the WAITxxx instructions on PORTD?.
Ok, maybe, 3.00 dollars, or more/?, that is pretty substantial at 10,000 units plus.
I'm not sure if I'm following here, but I *think* you are taking issue with the price difference between PICs and the P2? It's a fair concern for mass production, sure. Until Parallax builds their own fabrication facility, I would not expect to see the P2 (etc.) be price-competitive with PICs. So, instead, the focus on the other competitive properties: capability, reduced development time, unified MCU (across multiple products), etc. Show that the P2 is worth the difference in cost. Does it mean that it will show up in mass-produced items? Probably not at first. But there are a lot of niches that the P2 is perfect for (I strongly agree with rjo's comments about medical equipment) that can still result in both high visibility and high production levels. The objective right now is to make people aware of it and asking to get one for themselves. From there, the P2 should pretty much sell itself.
A fab facility is a multi billion $ proposition, so that is not what it takes to get to cheap parts these days. Most "semi-conductor" companies are fabless these days. What they do is get large volumes on multi-use parts. In most cases when a new "family" of parts is introduced, it is actually all the same die, just with different features enabled. If one of those parts gets used in some large volume consumer project, they will rev the silicon with just those features. This is really nothing new, when I was in college (in prehistoric times) we upgraded and IBM 1130 from 16K to 32K of memory. We watched the IBM tech come out and remove a jumper.
The competition is not $1 PICs but $1 ARMs these days. Can the P2 compete there? But I think it will find some homes in the hobbyist community or small and simple applications. This is where its simple design, ease of understanding can be applied. As for larger and more complex tasks, the memory limitations are quite severe, as performance drops off by factors of 8 each step from COGRAM to HUBRAM to SDRAM. Not to mention the difficulty of managing memory, as anyone who had to deal with x86 or Rabbit processors had to go through (which I did).
A fab facility is a multi billion $ proposition, so that is not what it takes to get to cheap parts these days. Most "semi-conductor" companies are fabless these days.
Note that Microchip does have three wafer fabrication facilities (Chandler and Tempe, Arizona and Gresham, Oregon), but it's also true that you can get cheaper chips without building your own facility.
The competition is not $1 PICs but $1 ARMs these days. Can the P2 compete there? But I think it will find some homes in the hobbyist community or small and simple applications. This is where its simple design, ease of understanding can be applied. As for larger and more complex tasks, the memory limitations are quite severe, as performance drops off by factors of 8 each step from COGRAM to HUBRAM to SDRAM. Not to mention the difficulty of managing memory, as anyone who had to deal with x86 or Rabbit processors had to go through (which I did).
$1 parts!? Would P2 be playing anywhere in the same league (that goes for not only ARMs but PIC32s, dsPICs and others)? I suspect not. Price is nearly always a factor (sometimes even for hobbyists), but as volume goes up, it becomes a much larger one. That said, there is a market to be had in low-to-mid-volume applications for a higher-priced part *if* it simplifies the task and/or speeds up development. The reality check is, I think, this: P2 will be, like its predecessor and perhaps even more so, primarily a niche part.
Comments
I am not quite in agreement with item 2 of your latest post, where you state that 'polling is waste of CPU cycles'. And then, further down, you suggest that the answer to polling are the 'waitxx' instructions. Now, I ask how wasteful are those? Nothing gets done by a stalled cog while waiting for a a timer or an event. My belief is a properly implemented polling system can give both, low latency response and higher throughput.
With such a polling system, each cog in the P1 can already execute multiple threads with single cycle timing granularity and (typical) 5 uSec latency. Sure, this will not be adequate for ALL requirements, but more than 95% I would think.
So, in the absence of an interrupt facility, I'm of the very firm opinion that polling is a good thing.
And I would further say that while it is true that interrupts are not 'necessary' in a Prop (as in one CAN work without them), my RTOS code would likely been faster and tighter with interrupts.
Just sayin'
Cheers,
Peter (pjv)
I think Heater meant "periodically checking for a condition without blocking" when he said polling is a waste of CPU cycles. At least, that was my interpretation. The waitxx instructions are more precisely blocking instructions, not polling instructions. Internally, interrupts are simply asynchronous (to the "main" CPU) blocking instructions. And in that context, the waitxx in one cog is asynchronous to all of the other cogs. The obvious advantage to the propeller approach is that the "interrupt handler" effectively becomes the rest of the code in the cog that contains the waitxx, giving us absolute control over the way that one cog actually "interrupts" (used very loosely) another cog.
I also agree that polling (as in "periodically checking for a condition without blocking") can be very useful, particularly now that the P2 supports multitasking.
Re: Polling:
Back in the day, and even now, we would have one processor. That processor, without interrupts, can only execute one "thread" of execution.
Now imagine we have some data comming in. From a serial line say, at 115200 baud. The UART might handle the bits and assemble the bytes but every incoming byte needs to be read by the CPU before the next one comes in else it is lost. That's like 10,000 bytes per second have to be read from the UART or one every 100 microseconds.
That means that the code that is running the single thread of execution has to poll the UART every hundred microsconds to see if something is there. There are two problems with this:
1) One has to actually code that into the thread so that every few instructions it calls some routine that polls the UART. One really does not want to write programs with polling calls sprinkled throughout it. And who knows how many other devices need polling?
2) All that polling is consuming CPU cycles. Even when there is no data to read it still has to be polled for just in case.
So we see polling is a waste of CPU instructions and a big pain having to write code to do it.
Re: the waitxx instructions:
Now we have a processor running a thread, and another processor ready to read that data in. It's waiting on a watxx instruction. Sure that is a waste of a CPU put that's what you have to do if there are no interrupts.
Admittedly one can have the second processor polling for multiple events whilst the fist processor runs the application. Effectively that is a smaller scale version of the polling problem above. It may work very well in many cases. FullDupleSerial demonstates this well.
My reference to polling was in respect to multiple threads operating in a single (P1) cog. Waitxx's inside such threads render multitasking impossible, and in a P2 less effective. So instead, make a scheduler do all the 'waiting' as well as polling to effect the distribution of the cog's CPU cycles among the threads. This is what I mean by implementing polling 'properly'. On top of that, the scheduler even relieves you from needing to sprinkle 'polls' throughout your threads.
Polling is VERY good -in the right context- probably much better than 95 % of real-world applications.
Cheers,
Peter (pjv)
I agree, a waitxx can halt everything. better to keep it out of the threads.
Sounds to me like you are describing an "event loop": A "scheduler" waits for various events to happen. Perhaps changes in pin state with watxx. Then depending on what event (pin change) happened it sets some code running to deal with that. But. that code cannot hang up in an endless loop or take a long time to complete. No, it has to get back to the scheduler ASAP in case some other event has happened that needs handling.
Sounds like cooperative multitasking. Which requires that the tasks complete very quickly
Sounds like the Lucol language we used to program avionics systems in. In the Lucol language there were no loops or gotos so it was impossible for tasks to hang up. Everything was driven off of timer ticks. The Lucol compiler is the only one I have ever seen that would compile your code and then print out how much of you allotted time you had used. It could do that because there were no loops and analysis was easy.
I think you have to outline your system, in pseudo code, before we can discuss it properly.
There are several reasons I am convinced that interrupts also have their place. The first is the use one of the earliest S100 Z80 CPM systems I assembled was put to.
It was a basic 2MHz Z80 with 64K of ram, two 8” floppies, and a serial board with eight uarts. Eight serial terminals were connected to the serial board. All eight terminals were used simultaneously for Key to Disk data entry.
There was a single interrupt that occurred 20 times per second and called a polling loop that checked each of the 8 uarts for an input character, and called a single re-entrant program that processed and stored received characters.
I am certain that none of this would not have been possible for such a low powered cpu without interrupts.
If those 20 UARTs were smart enough to DMA data into memory then the interrupts would not be required. That's effectively what we make with a COG running FDS or four port serial.
Often people ask for interrupts in the Propeller. Question is does it make sense to have parallel cores plus the extra circuitry and complexity of interrupts? I think not.
One was dedicated to I/O; among them 2 serial ports, one limited to 4800bps, the other to 38200bps.
That machine felt very responsive to use, even with the low clock speed...
Of course, having two CPUs padded to the cost and increased the design complexity significantly.
Interrupts aren't so bad. I don't feel like cogs completely replace interrupts, from this angle that a typical micro has peripherals that can be clocked at will, with added benefit of low current use in peripherals. The ability to clock switch per cog would be ideal.
In some ways, dma can be used like a cog, although it is rather complex to set up.
Not a good idea as it really messes up timing for HUB access for the other COGs.
Anyways, interrupts is named as they are because they interrupt the normal workflow of the CPU.
That screws up timing and can make for really weird failure situations that can be almost impossible to track down.
Currently I'm working on a system that allows users to do design tweaks to an instrument using LADDER logic, and in our case it is a background task, that because of no loops is guaranteed to always return to the foreground loop. The main CPU does use interrupts for a number of high priority tasks, mostly capturing data at a guaranteed interval. Some of these may move to DMA driven routines that can execute nearly autonomously.
These days interrupt code is just C code, as most processors make the interrupt process pretty painless, this is done by pushing a fixed set of registers onto the stack before the interrupt. This is coordinated with the C compiler, so there are no conflicts (what I mean here is a register used by both interrupt and main code without save/restore). In the Cortex version of ARM the interrupt is quite simple, load the address of the routine into the controller and disable the interrupt somewhere in that interrupt code and you are done. There can be fine tuning as to interrupt priority and whether interrupts allow higher priority interrupts and the like. But in most cases it is pretty easy to manage. Like almost all software tasks, you don't have to get too creative until the process you are trying to manage is pushing the limits of the hardware.
Not that I am suggesting LADDER is the killer app for a Prop, but it might be an interesting one. There are a number of open source LADDER to C translators.
Is it today???? Doh!
Five times a day I turn my mouse toward Rocklin and hit the forum button.
One reason that the ARM Cortex-M controllers are getting so much attention is that the interrupt controller sets up the processor in a manner consistent with the APCS (Arm Procedure Call Standard) so interrupts can be set by setting the trap address to a pointer to a C function, no assembly required.
This sounds like a great thing to put on a spec sheet, but I've programmed some ARMs, and I would take PASM over C on an ARM any day. There are other things that make C on an ARM pretty hard and PASM on a Prop pretty easy for real engineering problems.
I do phased PWM with props, and recently I am building an MPPT solar controller (and diversion and battery monitor) for 48V nominal into a 12V battery. Because the duty cycle will always be less than 25%, I was able to do 4-phase synchronous PWM (high/low MOSFETs per phase, so 8 total) with software stepping to increase the effective resolution to 16 bits, bit-banged in a single cog. So, I thought, maybe that's not so great; couldn't I bit-bang that with a 168 MHz Cortex-M4 that has pretty much 8X the MIPs of a cog ? Uh, no, there's nothing like waitcnt; maybe if it had a repeat register like a dsPIC. Well, what if I used 2 timer/counters at 4x the PWM rate (yikes, 800,000 interrupts per second), each with non-overlapping non-re-entrant interrupts ? Close, but without the software stepping, and the minimum duty cycle would have to be higher to make sure the interrupts wouldn't overlap.
There *are* one or two ARMs available with hardware phased PWM now, but not the 168 MHz one, and I'm comparing the utility of its basic design; the Prop doesn't have hardware phased PWM, either. I'm not sure, but I don't think Chip had this application in mind when he designed the P1, and it allows it to do a better job at many things than a much newer, much more "powerful" chip (and I still had 7 cogs to do other things !).
The question on my mind is this: can the P2 duplicate this performance ? What I mean is, the P1 is more powerful and more useful than anyone, probably even Chip, thought in 2006; regardless of its market performance, it's a stunning engineering achievement. The P2 is starting from a much higher base with its new goodies; can it be as much better in 7 years as the P1 is now ? Think of the possibilities !
yes
I would give favorable odds that the P3 gets delivered a lot faster than the P1 or P2:)
Any takers?
The things that could be done with 1GIPS cogs... drool...
C.W.
The dirty little secret is that except for very high end products, medical technology tends to lag
just far enough behind the bleeding edge that there are very nice markets still open to "the little guys."
The Hub timing issue only came about due to it being a late addition of simple slicing of the singular pipeline. It is possible for multiple pipes to solve this but requires significantly greater redesign. I've not looked at the details of pipelining in general though, so maybe having pipeline multiplexing is silly amounts of extra transistors and wouldn't be reasonable anyway.
Time will tell if such a stringent design is even warranted. The P2 might not be any the worse off for the minor timing glitches that can happen when multi-threading in a single cog.
EDIT: Replaced bad word choice.
Out of curiosity, what would be the lowest-power approach to having one cog wait for another cog? Use the WAITxxx instructions on PORTD?.
I'm not sure if I'm following here, but I *think* you are taking issue with the price difference between PICs and the P2? It's a fair concern for mass production, sure. Until Parallax builds their own fabrication facility, I would not expect to see the P2 (etc.) be price-competitive with PICs. So, instead, the focus on the other competitive properties: capability, reduced development time, unified MCU (across multiple products), etc. Show that the P2 is worth the difference in cost. Does it mean that it will show up in mass-produced items? Probably not at first. But there are a lot of niches that the P2 is perfect for (I strongly agree with rjo's comments about medical equipment) that can still result in both high visibility and high production levels. The objective right now is to make people aware of it and asking to get one for themselves. From there, the P2 should pretty much sell itself.
A fab facility is a multi billion $ proposition, so that is not what it takes to get to cheap parts these days. Most "semi-conductor" companies are fabless these days. What they do is get large volumes on multi-use parts. In most cases when a new "family" of parts is introduced, it is actually all the same die, just with different features enabled. If one of those parts gets used in some large volume consumer project, they will rev the silicon with just those features. This is really nothing new, when I was in college (in prehistoric times) we upgraded and IBM 1130 from 16K to 32K of memory. We watched the IBM tech come out and remove a jumper.
The competition is not $1 PICs but $1 ARMs these days. Can the P2 compete there? But I think it will find some homes in the hobbyist community or small and simple applications. This is where its simple design, ease of understanding can be applied. As for larger and more complex tasks, the memory limitations are quite severe, as performance drops off by factors of 8 each step from COGRAM to HUBRAM to SDRAM. Not to mention the difficulty of managing memory, as anyone who had to deal with x86 or Rabbit processors had to go through (which I did).
Note that Microchip does have three wafer fabrication facilities (Chandler and Tempe, Arizona and Gresham, Oregon), but it's also true that you can get cheaper chips without building your own facility.