Been lurking around for a while... Propeller II have grabbed me back, line, sinker and hook. It's looking very good, in other word, evolving into righteous microcontroller. However, it got me curious - either Chip or you (or can be anybody) can answer this one: is the COG a 80486-like pipelined processor with multithreading issue units or superscalar with thread pooling unit? Even though I would like to have superscalar in-order COG, as it do speed up SPIN or even handle the doubled threads in the case 8 microthreads, I will still welcome even if it's going to be just pipelined. Looking good, you guys.
Dr. Mario, I don't think you can quite characterize the Prop II cog as either type of processor. It's a simple pipelined processor, but some of the overlapping is done using specialized instructions (delayed jumps). There's a very simple threading unit that's optional to use that simply selects one of a set of state registers (PC and flags) for each instruction cycle in a round-robin fashion. There's no interlocking, that's the programmer's responsibility. There are some independent functional units like the multiplier / divider and CORDIC processor as well as the timer / counters.
Ah, I see. It's similar to Transputer, or maybe simpler than that. I do like to have extra integer horsepower, due to the increasing complexity of the projects - we may be halfway done picking Propeller's Silicon-based brain clean, though. It may be possible to run uCLinux (it runs on PIC just fine, given what's in the source package - in the demo file - FreeRTOS ditto). Nevertheless, it's looking good in term of integer capabilities and improved RISC registers (in turn, we would be able to do more with it).
Dr. Mario and others,
The Prop II still is constrained by a 512 long word memory for the cogs. There are more powerful instructions. There are some new independent functional units (multiply / divide, CORDIC). There's some additional specialized memory for each cog that can be used as a stack. The I/O pins are much more powerful / sophisticated than those of the Prop I. As with the Prop I, the cogs of the Prop II will be best suited for I/O drivers and interpreters rather than directly running complex code. Some of the new features are specifically included to aid in making interpreters more powerful and more efficient than on the Prop I. Some of the new features are included to aid in making more complex and higher throughput I/O drivers, like built-in USB and built-in ethernet as well as higher resolution video. I suspect that any implementation of Linux or a RTOS will be done with a conventional high efficiency interpreter in one cog and low level I/O drivers (including an external memory management function) in the other cogs just like it's done now with CP/M on the Prop I.
Mike, sounds good, and I do think your comments on IO circuitry and software linkage seem right on - it's starting to look like transputer ( like XMOS xCore which you program it to dedicate the thread units to IO control and some background service) while at the same time its predecessor, Prop I already have done the same, only within limited COG RAM space (2 kilobytes - 512 longs). Still, it may not be a true transputer but close enough, compared to T425, in term of per-COG transistor count and how they both behave - so far I have been observed from your, Chip's, and Beau's comments. Also, if it ends up walks and talks like a transputer, it would blow xCore outta water.
And be easy on Beau. :P He have done a lot and he's sure to inform us soon, maybe giving us a picture of Propeller II's Silicon die.
P.S. I wish this forums have the thank/like button... :P
The Prop (I or II) is nothing like a Transputer. The Transputer only had one processor. The Transputers CPU was more traditional in that it executed code from main memory not just it's small internal register space like the Prop. There are few similarities.
Well, OK the Prop II will now support a hardware task scheduler to run multiple threads within a core which I believe is what the transputer did as well.
A better case can be made for comparing the Prop II to the XMOS devices. At least those can be had with multiple processors on a chip. They both have hardware thread scheduling. They both have tight coupling between CPU and I/O pins. Still they are chalk and cheese when you look closer.
Well, Heater - one more thing, Transputer was also a multi-threaded processor - I have some older datasheet, up to never-released T9000. I will just zip my mouth for now as it's getting close to Chipmas than before... Once it gets in our hands, we will just find out ourselves - think of it as Pandora's Box - oops, sorry - black box.
Yes, I did mention the hardware thread scheduler similarity between Prop II and Transputer in my last post.
Shame the T9000 never materialized. I was at an Inmos presentation of the T9000 in London, could not wait for it to appear, so disappointing.
Not sure if the Prop II black box is so opaque any more as we have seen a lot of details discussed here, but yes the proof is in the pudding as they say.
Leon is very knowledgable about the Transputer and the XMOS devices and I'm sure could give you a detailed critique and comparison of them to the Propeller I and the Propeller II design as we know it. A PM to him might be warranted.
It was 30 years ago when I was developing transputer-based systems. I've forgotten a lot of the details.
Whew, we dodged a bullet on that one! Mike, congratulations on your 20,000+ posts, but you should really think twice before posting a comment that mentions transputers, XMOS and Leon in the same sentence.
As the closest thing to a logistics manager for the above-mentioned Santa, I'll add you to the "nice" list of kids. You'll be easy to find, too.
We were intensely aiming for a November 7th submittal date to the foundry, but it appears that we need a few more weeks to get this entirely done. There's an important chunk of work to be done to verify that the layers, power and ground line up between our design and that from the synthesis company. Apparently there's not an easy way to do this with the tools we've been using, so Beau is going to get cross-eyed before this is all finished. Chip is nearly done with the ROM code and looks forward to calling it "finished".
I wonder, in hindsight, is there anyway that "crowd source" could have been used to help with; the design layout, or mask inspection, for a complex project like Prop II?
Unfortunately, crowd-sourcing is not possible due to variety reasons - security being one of them - meanwhile, I definitely feel sorry for Beau as routing is definitely like a needle in a hay...
Leon - yea, T9000 had lot of potential due to its superscalar pipelines being better suited for faster context switch and handling of complicated threads. Too bad it cost Inmos lot of money testing the fusion of T810 and T840 into one, not to mention creating the test chips isn't cheap. In turn Inmos was literally bleeding to death due to the cost of R&D on this chip - have Inmos gotten enough funds to continue, we could have 28nm 64-bit, multicore, superscalar out-of-order Transputers - who know the 130nm Transputer might have been used in the PlayStation 3 instead of Cell BE?
Meanwhile, the businesses in processor designing is not easy - I am also aware of that. Not to mention the deadline... Ugh... :P
Unfortunately, crowd-sourcing is not possible due to variety reasons - security being one of them - meanwhile, I definitely feel sorry for Beau as routing is definitely like a needle in a hay...
Think outside of the box, when was the last time you contributed to a secret and complex text project by answering a simple CAPTCHA entry?
There are clever people that could make crowd-source inspection of a project like this happen.
It's not a good comparison comparing the Props to general purpose architectures where data throughput is the primary design driver. That factor is secondary on the Props. Instead, code timing is number one. Which is why there was so much debate over the inclusion of hardware threading.
Funnily, this environment is exactly where hardware threading has greatest bang for bucks.
As for extending the pipelining up to superscalar, I can safely say not gonna happen in any future Prop design. The stalls created by one instruction per clock are bad enough without compounding it further to multi-instruction per clock. In fact Chip has demonstrated a speed efficiency gain where by 4-way interleaving the Prop2 threads back to equivalent of the Prop1 eliminates the branch stalling losses.
I suppose, to be fair, it's not the fundamental architectures that cause the biggest and most unpredictable stalls in the hotheaded CPU's, rather it's the CPU caches and their related less responsive, and usually slower, external busses and DRAMs.
Any PropOS is going to be effecting it's own program/data caching to various external memories. With active drivers living in Hub space and attached soft devices in Cog space, thereby partitioning what can be stalled separately from what can't be stalled.
Well, to be blunt: superscalar pipeline is doable in Propeller III design as it have been done with minimal transistor budgets - the smallest superscalar design I am aware of would be Cortex R5 and/or A8 (superscalar in-order) and having more than one threads actually keep a superscalar pipeline busy - it's the program ordering and issue timing that matters. Out-of-order processor avoid pipeline stall almost entirely due to issuing the available data or by picking "the nearest approximate" data as from "decomposing tree symlink" (this is an example - soft-configured out-of-order VLIW always use this branch prediction structure).
Just two ALU is needed in an integer unit to qualify an integer unit a superscalar processor - as long as there are separate pipelines (X,Y), while X pipeline is too busy, Y pipeline takes of something else or just take X issue.
The problem with a Superscalar OoO design is that, if a pipeline stalls, even if other instructions can be fed in and completed, it destroys the timing determinism of the instructions.
Well, you're definitely adding stages, not to mention bus width, just to prepare for an extra ALU. You're also doubling the number of register ports. I'm not seeing minimal written on this.
All this is still limited to the 508 useful words of immediate program and data. That's one reason why most were happy with the 4 threads idea I believe. Having more is likely to be hitting the Cog memory limit too hard.
I know I'm repeating a little here but I think needs it:
Focusing on high average execution rate is not how the Prop is structured. Irregularities, even relatively uncommon ones, that can cause rather substantial stalls is completely undesirable.
For example, each Cog gets a fixed time slot access to Hub, other Cogs don't get extra if one doesn't use it's allocation. That's also a big difference between time slicing (Suits the Prop) and interrupts (Doesn't suit the Prop). Interrupts only consume cycles upon an external event. Time slicing is always consuming cycles even if it's only polling for an external event.
In both examples, what is being maintained is the ability for tasks/Cogs to not interfere with each others timing, eg: When that event occurs then the task that is polling it can rush off and do it's thing in parallel with all the other tasks and, importantly, no changes in timing of the other tasks will occur.
Whereas a generic interrupt would normally divert CPU cycles, upsetting the user space timing. When there is hardware buffering and DMA controllers involved, then the software normally doesn't have to poll or perform any of the hard realtime data transfers. Then the software doesn't care about timing, it just cares about processing speed (throughput).
The Prop does it's controllers as soft devices ... bit bashing ... so it needs to handle the, normally hardware, timings in software and also do this in a cooperative manner that allows more than one soft device concurrently.
Parallax doesn't have the resources to compete head on with major ARM vendors like TI or ST. Because if they do what you want that's where it will place them. I'd hate to be Parallax's sales dude if that happens. Where it's at as a high performance controller seems to be fine niche provided they target it right and give good support, especially for 3rd parties.
For sure Parallax would have a hard time in the ARM market. There are multiple vendors there already producing bazillions of chips at approaching zero cost. Designing yet another ARM like or traditional architecture processor is a waste of time.
Better to do something unique like the Prop and have a niche that those other guys can't or won't fill.
However, for a long time now I have been dreaming of a credit card sized board carrying an ARM and one or more Props. The ARM runs Linux and gets to do all networking, file system and other high level stuff. The Prop provides a ton of I/O and real-time interfacing.
The closest I'm getting to that now is a Raspberry Pi with a Prop on a piggy back board.
For sure Parallax would have a hard time in the ARM market. There are multiple vendors there already producing bazillions of chips at approaching zero cost. Designing yet another ARM like or traditional architecture processor is a waste of time.
Better to do something unique like the Prop and have a niche that those other guys can't or won't fill.
However, for a long time now I have been dreaming of a credit card sized board carrying an ARM and one or more Props. The ARM runs Linux and gets to do all networking, file system and other high level stuff. The Prop provides a ton of I/O and real-time interfacing.
The closest I'm getting to that now is a Raspberry Pi with a Prop on a piggy back board.
Xilinx Zync + a Propeller 2? Maybe jazzed will make us a board like that! The programmable logic on the Zync would probably allow for a very fast interface between the ARM and the COGs on the P2.
A Zynq or A8 Sitara coupled to a P2 would be a nice board but it's not DIP board project since both are BGA beasts. Designing such a board would take some time, money and engineering resources.
But it's a great way to showcase the P2. Host QNX on the ARM side while running data acquisition programs on the bare metal on the P2 end.
If you don't want to go that route. Make a BeagleBone Cape board (it stacks on the BeagleBone board). Since the Beaglebone is already a established item with plenty of mindshare, that would be one way to leverage both the P2 and the ARM. The Raspberry would work to.
Ok, ok... Boys, put down your clubs. Fair enough. However, even if COGs are to have superscalar pipelines - it will remain in the niche market: being a hobbyist's Swiss-army knife. Even so, it got few things going for it: easy programming (no longer the case with prop II unless the Spin interpreter kernel is to be flashed into SPI flash or MRAM), reprogrammable/configurable IO logics, and lower-power. I do have choice bwtween Freescale PowerPC microcontroller (a true superscalar microcontroller based on 603e) or this chip. I don't have a negative view on Prop II, just bummed out BUT it can be used as a vector processor if programmed correctly - in effect, it do have advantage as a multi-core microcontroller.
EDIT: I have a question about the clock-gating: are all COG's integer unit clock tied together (all running at the same frequency) or independent clock-gating? And will bad thing happen if I remove a COG's own 1.8V power? (If yes, it may probably be wise to tie all 1.8V Vcore to a VRM.)
Buying a Beagleboard? Please. That's silly if you want just a processor IC to be built into your board - that would be a waste of my time. Naturally, if I want superscalar OoO microcontroller, my natural choice would be Freescale MPC5200 as there's no Cortex A9 / R7 MCU in stock at Mouser or Digjkey (it's currently for high-volume buyers nowaday).
Comments
Any crumbs (updates) for us???
The Prop II still is constrained by a 512 long word memory for the cogs. There are more powerful instructions. There are some new independent functional units (multiply / divide, CORDIC). There's some additional specialized memory for each cog that can be used as a stack. The I/O pins are much more powerful / sophisticated than those of the Prop I. As with the Prop I, the cogs of the Prop II will be best suited for I/O drivers and interpreters rather than directly running complex code. Some of the new features are specifically included to aid in making interpreters more powerful and more efficient than on the Prop I. Some of the new features are included to aid in making more complex and higher throughput I/O drivers, like built-in USB and built-in ethernet as well as higher resolution video. I suspect that any implementation of Linux or a RTOS will be done with a conventional high efficiency interpreter in one cog and low level I/O drivers (including an external memory management function) in the other cogs just like it's done now with CP/M on the Prop I.
And be easy on Beau. :P He have done a lot and he's sure to inform us soon, maybe giving us a picture of Propeller II's Silicon die.
P.S. I wish this forums have the thank/like button... :P
Well, OK the Prop II will now support a hardware task scheduler to run multiple threads within a core which I believe is what the transputer did as well.
A better case can be made for comparing the Prop II to the XMOS devices. At least those can be had with multiple processors on a chip. They both have hardware thread scheduling. They both have tight coupling between CPU and I/O pins. Still they are chalk and cheese when you look closer.
Shame the T9000 never materialized. I was at an Inmos presentation of the T9000 in London, could not wait for it to appear, so disappointing.
Not sure if the Prop II black box is so opaque any more as we have seen a lot of details discussed here, but yes the proof is in the pudding as they say.
Patiently waiting for Chipmas
As the closest thing to a logistics manager for the above-mentioned Santa, I'll add you to the "nice" list of kids. You'll be easy to find, too.
We were intensely aiming for a November 7th submittal date to the foundry, but it appears that we need a few more weeks to get this entirely done. There's an important chunk of work to be done to verify that the layers, power and ground line up between our design and that from the synthesis company. Apparently there's not an easy way to do this with the tools we've been using, so Beau is going to get cross-eyed before this is all finished. Chip is nearly done with the ROM code and looks forward to calling it "finished".
Leon - yea, T9000 had lot of potential due to its superscalar pipelines being better suited for faster context switch and handling of complicated threads. Too bad it cost Inmos lot of money testing the fusion of T810 and T840 into one, not to mention creating the test chips isn't cheap. In turn Inmos was literally bleeding to death due to the cost of R&D on this chip - have Inmos gotten enough funds to continue, we could have 28nm 64-bit, multicore, superscalar out-of-order Transputers - who know the 130nm Transputer might have been used in the PlayStation 3 instead of Cell BE?
Meanwhile, the businesses in processor designing is not easy - I am also aware of that. Not to mention the deadline... Ugh... :P
Think outside of the box, when was the last time you contributed to a secret and complex text project by answering a simple CAPTCHA entry?
There are clever people that could make crowd-source inspection of a project like this happen.
Eldon - WA0UWH
-
Funnily, this environment is exactly where hardware threading has greatest bang for bucks.
As for extending the pipelining up to superscalar, I can safely say not gonna happen in any future Prop design. The stalls created by one instruction per clock are bad enough without compounding it further to multi-instruction per clock. In fact Chip has demonstrated a speed efficiency gain where by 4-way interleaving the Prop2 threads back to equivalent of the Prop1 eliminates the branch stalling losses.
Any PropOS is going to be effecting it's own program/data caching to various external memories. With active drivers living in Hub space and attached soft devices in Cog space, thereby partitioning what can be stalled separately from what can't be stalled.
Just two ALU is needed in an integer unit to qualify an integer unit a superscalar processor - as long as there are separate pipelines (X,Y), while X pipeline is too busy, Y pipeline takes of something else or just take X issue.
All this is still limited to the 508 useful words of immediate program and data. That's one reason why most were happy with the 4 threads idea I believe. Having more is likely to be hitting the Cog memory limit too hard.
I know I'm repeating a little here but I think needs it:
Focusing on high average execution rate is not how the Prop is structured. Irregularities, even relatively uncommon ones, that can cause rather substantial stalls is completely undesirable.
For example, each Cog gets a fixed time slot access to Hub, other Cogs don't get extra if one doesn't use it's allocation. That's also a big difference between time slicing (Suits the Prop) and interrupts (Doesn't suit the Prop). Interrupts only consume cycles upon an external event. Time slicing is always consuming cycles even if it's only polling for an external event.
In both examples, what is being maintained is the ability for tasks/Cogs to not interfere with each others timing, eg: When that event occurs then the task that is polling it can rush off and do it's thing in parallel with all the other tasks and, importantly, no changes in timing of the other tasks will occur.
Whereas a generic interrupt would normally divert CPU cycles, upsetting the user space timing. When there is hardware buffering and DMA controllers involved, then the software normally doesn't have to poll or perform any of the hard realtime data transfers. Then the software doesn't care about timing, it just cares about processing speed (throughput).
The Prop does it's controllers as soft devices ... bit bashing ... so it needs to handle the, normally hardware, timings in software and also do this in a cooperative manner that allows more than one soft device concurrently.
Parallax doesn't have the resources to compete head on with major ARM vendors like TI or ST. Because if they do what you want that's where it will place them. I'd hate to be Parallax's sales dude if that happens. Where it's at as a high performance controller seems to be fine niche provided they target it right and give good support, especially for 3rd parties.
If you want a A8, just buy a Beagle board.
Better to do something unique like the Prop and have a niche that those other guys can't or won't fill.
However, for a long time now I have been dreaming of a credit card sized board carrying an ARM and one or more Props. The ARM runs Linux and gets to do all networking, file system and other high level stuff. The Prop provides a ton of I/O and real-time interfacing.
The closest I'm getting to that now is a Raspberry Pi with a Prop on a piggy back board.
Now, having seen what's going on with Adapteva putting an ARM together with their Epiphany processor I start to dream of Parallax creating a board like that for the Prop II.
Adapteva
Epiphany Multicore Accelerator (16 or 64 cores)
Zynq-7010 Dual-core ARM A9 CPU
Xilinx Zync + a Propeller 2? Maybe jazzed will make us a board like that! The programmable logic on the Zync would probably allow for a very fast interface between the ARM and the COGs on the P2.
But it's a great way to showcase the P2. Host QNX on the ARM side while running data acquisition programs on the bare metal on the P2 end.
If you don't want to go that route. Make a BeagleBone Cape board (it stacks on the BeagleBone board). Since the Beaglebone is already a established item with plenty of mindshare, that would be one way to leverage both the P2 and the ARM. The Raspberry would work to.
Just now the thhing to do is make it as Raspberry Pi compatible as possble. Also APC compatible. There you have a huge market of curious techies.
If the QNX guys want to follow along they can do so if they like.
It was just a suggestion. So don't flip and get your feelings hurt. Sheesh.
QNX is a nice RTOS and a lean one with design wins. Nothing wrong with it, except it doesn't fit the Linux fanboi criteria.
Next time I'll run it by you for approval before posting.
EDIT: I have a question about the clock-gating: are all COG's integer unit clock tied together (all running at the same frequency) or independent clock-gating? And will bad thing happen if I remove a COG's own 1.8V power? (If yes, it may probably be wise to tie all 1.8V Vcore to a VRM.)
And, I guess I will just leave you guys alone.