I thought I would look at Intel's FPGA offerings today, since they bought Altera, whose FPGAs we used to develop the Propeller chips.
Their Stratix 10 series is the biggest and fastest they make, built in a 14nm tri-gate process. These chips can run up to $85k each.
There is a funny thing about the way Altera always priced their chips vs. their development boards. It's often less-expensive to buy the development board than the chip, itself.
Digi-Key is selling this Stratix 10 development board for $8k:
This chip is probably several times faster than what we last used (Cyclone V -A9) and it has 9x the logic, along with 28 MBytes of RAM across 11,700 M20K instances. It also has 3.7M registers. We could fully synthesis eight 16-cog/1MB-hub P2's in this thing, probably at 400MHz.
This would be the way to develop P3, while making cut-down versions for use on those Cyclone V -A9 boards that several of you have.
''This chip is probably several times faster than what we last used (Cyclone V -A9) and it has 9x the logic, along with 28 MBytes of RAM across 11,700 M20K instances. It also has 3.7M registers. We could fully synthesis eight 16-cog/1MB-hub P2's in this thing, probably at 400MHz.''
8 Grand, ouch!
LOL, that certainly makes the development board seem like a good deal!
If you could develop such a 16 cog 1MB P2 in the future what sort of process would you need to be targeting, and what sort of clock rates would be realistic to anticipate for that beastie?
@cgracey said:
I thought I would look at Intel's FPGA offerings today, since they bought Altera, whose FPGAs we used to develop the Propeller chips.
Their Stratix 10 series is the biggest and fastest they make, built in a 14nm tri-gate process. These chips can run up to $85k each.
There is a funny thing about the way Altera always priced their chips vs. their development boards. It's often less-expensive to buy the development board than the chip, itself.
Digi-Key is selling this Stratix 10 development board for $8k:
This chip is probably several times faster than what we last used (Cyclone V -A9) and it has 9x the logic, along with 28 MBytes of RAM across 11,700 M20K instances. It also has 3.7M registers. We could fully synthesis eight 16-cog/1MB-hub P2's in this thing, probably at 400MHz.
This would be the way to develop P3, while making cut-down versions for use on those Cyclone V -A9 boards that several of you have.
CGracey, you're right, and I also can't understand the prices for those things. I guess maybe they don't want to let "child brains" to develop nothing with them. So, if you're not a company you're out. Anyway imagine in my country the most people wins about $3000.- / year, or if you've a decent job maybe you could get about $10000.-/year. In any way Stratix does not exist for us.
And yes, sure could be a great platform to develop P3. I understand you, maybe you're anxious to get work on P3 now, but, could not be better to put a bit more effort, to finish with all P2 pending things first ? (I mean, a good datasheet with a full spin2 language explanation.) Just a comment.
Alteras (or now Intel) price policy is "list exorbitant prices, give discount to those who ask for it". You have to negotiate like on the oriental market. A friend of mine pays $20 for Cyclone 5 devices that are listed for over $100.
@ManAtWork said:
Alteras (or now Intel) price policy is "list exorbitant prices, give discount to those who ask for it". You have to negotiate like on the oriental market. A friend of mine pays $20 for Cyclone 5 devices that are listed for over $100.
Wow! That is interesting. Maybe we could get those Cyclone V -A9 chips for $50.
@ManAtWork said:
Alteras (or now Intel) price policy is "list exorbitant prices, give discount to those who ask for it". You have to negotiate like on the oriental market. A friend of mine pays $20 for Cyclone 5 devices that are listed for over $100.
Wow! That is interesting. Maybe we could get those Cyclone V -A9 chips for $50.
Chip: I think your time would be well spent on the marvelous things that a P3 might/will be able to do someday. I also understand the desire to practice your "Art" at the highest level.
With that said, I hope that you have a "Chip Jr." somewhere in the wings ready to do some of the more mundane designs that will make Parallax a true top contender in the microcontroller field. The P1 was and still is a marvelous creation. The fact that there was "room" reserved for A and B ports made it "Obvious" that there would be other versions of the P1 following up soon.
I had Intelligent Motion Systems ready to manufacture motion control systems with the P1 several years ago but I told them a new "Version" of the chip ( with code protection) would be out "soon" and I thought we should wait for it. Honestly, from an engineering stand-point, a P1 with analog, a P1 with 64 I/O and perhaps a bit more memory, or even a $2 single cog P1, all these things made such sense from an engineers standpoint that Of COURSE Parallax would be bringing one or more of these things out soon.
The first year the P1 came out, I tried to bid it into a company's control system. However, in 100,000 piece quantities, saving $4 per unit using a less glamorous processor meant the company saved or made an additional half million dollars or so. Two cogs in a 10-14 pin package (with onboard eeprom, rom, etc.) would have done the job just fine... but it was never to be. I lost that bid to someone using a Pic. Intelligent Motion Systems also lost interest after a year or so waiting for the P2.
I'm pretty well out of the game nowadays, I offered Ken my functional 6 axis CNC code for the P2 three weeks ago but didn't even hear back from him so I guess I'm done trying to design Parallax into projects.
I'm writing this not for myself, but in hopes that new, younger engineers excited about the incredible new P2 won't get stuck in similar engineering situations with the expectations that the "Right" P? will be "coming out in the next year or so". If you're going to go onto a P3 design ( which of course you will) either make sure your customers know those "other" versions are never going to happen or take some pity on them and find a hot-shot "Chip" wannabe and let him/her flesh out the P2 family in a way that does make better engineering/financial sense for high volume applications.
I haven't even looked at the Pic family recently, but when you look at the price/capability of the new Arduinos, ESP-32, the Pi Pico, etc. (most fully functional solder onto a board systems for less than the price of a P2 chip alone ), designing a product around a $15 P2 will rarely make any more financial sense than designing a product around that $85,000 Stratix does. If you don't come out with some of those "follow up" designs discussed, I worry that the P1 and P2 will always be relegated to little niche markets of a few hundred units at most.
Yes... just for grins, I'd love to see what you could do with a boatload of money and a 7nm process. But honestly, I never even used all 8 cogs on the P1, much less the P2. 200 more cores operating at 150 terahertz for $15 would still be less useful to me than that $2 P1 chip.
Best of luck to you and Parallax no matter which direction you go. Thanks for the memories. " What a long, strange trip it's been."
What the Smile! Give poor Chip a break there Ken. The Prop2 isn't cancelled. And Chip is still fully working on it. We all like to day dream a little. Chip's no different.
Regarding the Prop1, it was always going to be a one-off design. I don't know when this was first discussed in the forums but the tools used for designing the Prop1 were end of life when Parallax purchased them. That's why it never had iterations or variations.
Prop2, on the other hand, most certainly can have smaller siblings ... And other possibilities. It'll take some time though. The documentation and software tools for development with the Prop2 is a much bigger job than the Prop1. The money won't be coming in from the Prop2 until those are sorted.
Ken Bash, I understand your frustration. In the P1, we had a very hard design approach, with everything being full-custom. We finished a layout for a 64-pin version of P1, but we could never get our layout tool to pass the layout-versus-schematic check, so we never made that chip.
Now, with P2, our methodology is much simpler and very modern. We have a whole family of P2 parts planned, like you described. We just need to show some volume in current sales to build them. It's a matter of changing some Verilog variables, and then the process is deterministic. I hope we'll get there.
Yes, a long, strange trip, indeed. Some have even passed away, along the course, like Sapieha (an Estonian living in Sweden). Others are back from the dead, some others lost. Newcomers have arrived, too. The strangeness continues.....
I think the first thing that should be done expand the P2 family is to make a new SKU that's just the same P2 we already have, but tested to work properly @ 350 MHz.
The silicon apparently does it just fine, its a matter of marketing.
@Wuerfel_21 said:
I think the first thing that should be done expand the P2 family is to make a new SKU that's just the same P2 we already have, but tested to work properly @ 350 MHz.
The silicon apparently does it just fine, its a matter of marketing.
The ADC's are getting very noisy up at that frequency. And there's not much temperature margin. And, it takes chips that have particularly fast process characteristics, which are hard to thoroughly screen.
Imagine if we could go to the 7nm tri-FET process. We could run at 4GHz, probably. Can you imagine 10x the speed? That would be nuts.
We have what we have and it is as it is. If we want to have more in the future it's our term. If I had no need for more cogs I wouldn't have grasped the concept of parallel processing. Concepts are indepedent of real "hardware". Divide and conquer is such a concept. If one is not happy with the power of the P2, one will never be happy. There was a time, when Chip at an unofficiall propeller meeting, late in the evening said: what about the next Propeller and lets have additional memory to exchange data. Or something similar. Few were present. But those for sure will remember that evening forever. And now we got so much more than we could imagine that evening and we watch those that still want more and more and more. Let us show an example where the P2 must be paired with a second one to solve a problem and present a solution everybody wants and is ready to pay for. Then P3 will be available, because those that have to make profit will invest that moment. If you need a GRBL CNC controller board, you can by it for 40 bugs and it will do the job. But this board will not be able to self adjust, to remember the work done and learn, to watch spoken commands or enjoy you with a Plasma splash screen. And you will never feel the need to understand the firmware.
So, let us just be happy with the P2 for some years after all the years of anticipation!
We've got ten years of exploration ahead of us with the P2. We all have so much to achieve, including finding the killer app. Let's get productive with what we've got in our hands.
Chip,
I keep running into situations where I want GETNIB/GETBYTE/GETWORD instructions to set the Z flag upon a zero result. Probably more than any other non-comparing instruction. I go lookup the instruction sheet and am repeatedly reminded why you never included this ability. Those are already overloaded encodings. I've been looking hard at it now and decided an improvement would have the N encodings use the I bit instead of the Z bit. S operand as an immediate ain't all that useful here. I know that splits up Z and C but it would help. EDIT: As a future Prop3 refinement of course.
@evanh said:
Chip,
I keep running into situations where I want GETNIB/GETBYTE/GETWORD instructions to set the Z flag upon a zero result. Probably more than any other non-comparing instruction. I go lookup the instruction sheet and am repeatedly reminded why you never included this ability. Those are already overloaded encodings. I've been looking hard at it now and decided an improvement would have the N encodings use the I bit instead of the Z bit. S operand as an immediate ain't all that useful here. I know that splits up Z and C but it would help. EDIT: As a future Prop3 refinement of course.
For what it's worth, in my little niche P2 does pretty much everything I needed an MCU to do. Even the freedom to do layout unhindered, with all I/O pins being truly equal with very few exceptions, is a fresh breeze into what sometimes feels like a stale market with few new ideas done well. Everything can be tweaked till the cows come home - I'm guilty as charged, but let's say I'll gladly tweak stuff on my end and be happy with P2 as-is. For now, P2 is exactly how I imagined a "blue sky" type of an MCU. I was always DSP-deprived, since I like digital control loops over purely analog ones, and not only does P2 deliver in the DSP department - it also has neat, fast analog peripherals baked into every pin, letting the DSP be leveraged in ways otherwise limited to custom silicon. The obscene memory bandwidth is a good match for many DSP applications, and I appreciate that sincerely. P2 is not perfect, but neither am I nor my designs, so all's good in my book, eh?
As for the 7nm process: I'm not sure yet how I'd use it - lots of other things would need to change, most notably the instructions would need to be 40 or 48 bits wide to accommodate the larger address spaces. Given the architectural decisions made thus far, with simple two-layer memory hierarchy, such a process would surely let us have more memory, and probably more COGs and pins too, on a smaller die, but I'm not so sure the speed benefits would be as great as expected - the interconnect would get in the way of everything. My expectation is that simulations would bear out the fact that with current architecture, making things more numerous (COGs, COG RAM, HUB RAM, pins) would be easy, but making things an order of magnitude faster would be nigh impossible. I'm betting that a 44 bit instruction set, with 44-bit HUB and COG memories (and many more words, too), with maybe 16 or 24 COGs, would be comfy around 900MHz, with 1.2GHz being the "hot rod" speed. 4GHz would need a product vastly different from what P2 is, at a fundamental, architectural level. Maybe I'm wrong, and I sure don't want to seem like a pessimist, but P2's architecture does not lend itself to cheap speedups, and that's not a bad thing - it's what makes it so flexible. If anything, perhaps we could get that 1GHz speed with base instruction speed being 1 cycle instead of 2, if there'd be enough room to double the ports in COG memories. The LUT RAM distinction would be probably also gone. But 4GHz? I'd believe it when I'd see it - it would be cool marketing but most likely impractical. We would get single-cycle 32x32 multipliers in the 7nm process, and we could probably get basic floating point arithmetics in every COG as well, and the CORDIC pipeline depth could be maybe halved, with more instructions added - perhaps even complete IEEE-754 support in the CORDIC in addition to the fixed point.
Kuba, all those things you are talking about adding is what would inhibit getting to 4GHz. If you take what we have at the moment and built it in 7nm, there wouldn't be any extra congestion.
Would you believe that the cogs' 16x16 multipliers are actually the irreducible paths in the P2? They get the most optimized through placement and routing, then everything else gets optimized to be no slower than the multipliers. The multipliers were even placed within dedicated flops, so that no mux'ing delays would be incurred. And they were built from Synopsis DesignWare libraries, which are super optimized. Even so, they are the irreducible paths in the chip. Going to 32x32 multipliers would probably cut the attainable Fmax in half, in any process. So, all things holding as they are would be an easy approach to the highest speed in a smaller process. The design is very balanced, speed-wise, as it exists. Increasing memory sizes may present little time penalty, while doubling the cogs would be a bigger time penalty.
I wish we were in the position to make cog/hub/pin-reduced derivatives, already (staying in 180nm), so that we could upgrade the analog performance. You are probing the boundaries of what can be utilized together, in new ways. I think you are finding out what needs improvement. There are some simple things that could be done to improve the ADCs, that I know of. I think the DACs are pretty good, though maybe we could just get rid of the 990-ohm DAC and keep the 123-ohm DAC in each pin. It's a lot more useful, but the 990-ohm saves power where high-bandwidth is not an issue. The high-bandwidth of the 123-ohm DAC makes new stuff possible. Our internal 8-bit DAC is a little flimsy and gets knocked around by the comparator input. Maybe the 990-ohm DAC could just feed the comparator.
The kind of work you are doing is pushing the leading edge, so it's creating new things to think about. People can think about how to solve problems in new ways, given new possibilities. And it makes life more interesting.
With no commercial project in mind, I'd like to see the next gen P3 to just step to the next cheapest feature size, which is probably around 90nm. This should give us 1MB of hub ram and a good MHz improvement.
I've said before, although shouted down, that not all cogs need or should be equal. I think Apple have just proven this with their M1 - some are performance cores and some are power cores. P2 will never be an M1, but the concept is sound.
That's laptops/phones. It doesn't apply to the desktop processors. It'll be the same with Intel's newest too. It's all about the advertised battery run time. I wouldn't touch a big.LITTLE design with a barge pole.
That said, the M1's four high power cores are certainly impressive. Apple now just needs a CPU with all the cores the same to go in their desktop boxes.
I'm wondering how those single-cycle 4GHz 32x32 multipliers are done by Intel and others. Maybe it's only thanks to long pipelines that the latencies are hidden? Still, a 7nm process should make the individual gates fast enough that I can't believe at least the 16x16 multiplier couldn't be made to work in a single cycle or at least much faster at two cycles. But it's plenty fast as-is anyway - I really don't need it any faster at all in P2.
If I'll somehow get some real-world "free cycles" in a couple of years, maybe I'll come over for a $1 internship to try and contribute in kind to P3 - maybe on the analog side of things, though. I've only made a multiplier once in my life, even if recently, and it is not a transferable skill. It's about 1,200 reed relays and does a 16x16 unsigned multiply in a single AC line half-cycle (it's for an absurdly silly project of mine). Takes 200us reed relays (and would be what I call robust if I could toss some 100us reeds its way), so it's probably cheating at that point I'm actually in the middle of making a P2-based board tester for that thing, since I've been exercising it some and one (or maybe more) of the relays seems to have gotten slow and is causing occasional errors. Debugging that without automation is not my way of spending evenings, for sure. It's more of a typical case of being able to do something not nearly enough to warrant actually doing it, ha.
@kuba said:
I'm wondering how those single-cycle 4GHz 32x32 multipliers are done by Intel and others. Maybe it's only thanks to long pipelines that the latencies are hidden? Still, a 7nm process should make the individual gates fast enough that I can't believe at least the 16x16 multiplier couldn't be made to work in a single cycle or at least much faster at two cycles. But it's plenty fast as-is anyway - I really don't need it any faster at all in P2.
If I'll somehow get some real-world "free cycles" in a couple of years, maybe I'll come over for a $1 internship to try and contribute in kind to P3 - maybe on the analog side of things, though. I've only made a multiplier once in my life, even if recently, and it is not a transferable skill. It's about 1,200 reed relays and does a 16x16 unsigned multiply in a single AC line half-cycle (it's for an absurdly silly project of mine). Takes 200us reed relays (and would be what I call robust if I could toss some 100us reeds its way), so it's probably cheating at that point I'm actually in the middle of making a P2-based board tester for that thing, since I've been exercising it some and one (or maybe more) of the relays seems to have gotten slow and is causing occasional errors. Debugging that without automation is not my way of spending evenings, for sure. It's more of a typical case of being able to do something not nearly enough to warrant actually doing it, ha.
For super-time-critical logic circuits, Intel may lay out custom zipper logic which, instead of being CMOS, is PMOS, then NMOS, then PMOS, then NMOS.... You precharge the whole thing to a clear state, introduce the inputs, and then let it rip, halving your parasitics and propagation time. That might be how they are getting such speeds. It might just all be pipeline, though. If only someone from Intel could tell us, we'd know.
I can confidently up your offer of $1 to $100 to come be an intern. Working at the transistor/resistor/cap level on the chip is a lot of fun. The big question is if we'll have money to make another chip and if ON will be in agreement, as they want to see volume on the current P2.
Your multiplier sounds interesting. Did 8.33ms come up because of the convenience of AC power? If it slows down, you could ship it to Erna in Germany. Do you have a recording of it working? I imagine it sounds interesting to listen to. It probably changes tenor according to its inputs.
Comments
I thought I would look at Intel's FPGA offerings today, since they bought Altera, whose FPGAs we used to develop the Propeller chips.
Their Stratix 10 series is the biggest and fastest they make, built in a 14nm tri-gate process. These chips can run up to $85k each.
There is a funny thing about the way Altera always priced their chips vs. their development boards. It's often less-expensive to buy the development board than the chip, itself.
Digi-Key is selling this Stratix 10 development board for $8k:
https://www.digikey.com/en/products/detail/intel/DK-DEV-1SGX-H-A/8637058
...while the chip, itself, lists for $34k:
https://www.digikey.com/en/products/detail/intel/1SG280LU3F50E3XG/10483935
This chip is probably several times faster than what we last used (Cyclone V -A9) and it has 9x the logic, along with 28 MBytes of RAM across 11,700 M20K instances. It also has 3.7M registers. We could fully synthesis eight 16-cog/1MB-hub P2's in this thing, probably at 400MHz.
This would be the way to develop P3, while making cut-down versions for use on those Cyclone V -A9 boards that several of you have.
''This chip is probably several times faster than what we last used (Cyclone V -A9) and it has 9x the logic, along with 28 MBytes of RAM across 11,700 M20K instances. It also has 3.7M registers. We could fully synthesis eight 16-cog/1MB-hub P2's in this thing, probably at 400MHz.''
8 Grand, ouch!
LOL, that certainly makes the development board seem like a good deal!
If you could develop such a 16 cog 1MB P2 in the future what sort of process would you need to be targeting, and what sort of clock rates would be realistic to anticipate for that beastie?
CGracey, you're right, and I also can't understand the prices for those things. I guess maybe they don't want to let "child brains" to develop nothing with them. So, if you're not a company you're out. Anyway imagine in my country the most people wins about $3000.- / year, or if you've a decent job maybe you could get about $10000.-/year. In any way Stratix does not exist for us.
And yes, sure could be a great platform to develop P3. I understand you, maybe you're anxious to get work on P3 now, but, could not be better to put a bit more effort, to finish with all P2 pending things first ? (I mean, a good datasheet with a full spin2 language explanation.) Just a comment.
Btxsistemas, yes, there is still lots of work ahead to support P2.
Rogloh, if we went to 90nm, we could certainly do 16 cogs and 1MB of hub RAM. It would probably run almost twice as fast. Maybe we'll get there.
Publison, $8k is a lot, but it would be a one-time expense by Parallax.
Alteras (or now Intel) price policy is "list exorbitant prices, give discount to those who ask for it". You have to negotiate like on the oriental market. A friend of mine pays $20 for Cyclone 5 devices that are listed for over $100.
Wow! That is interesting. Maybe we could get those Cyclone V -A9 chips for $50.
Or the $80K for $100
Chip: I think your time would be well spent on the marvelous things that a P3 might/will be able to do someday. I also understand the desire to practice your "Art" at the highest level.
With that said, I hope that you have a "Chip Jr." somewhere in the wings ready to do some of the more mundane designs that will make Parallax a true top contender in the microcontroller field. The P1 was and still is a marvelous creation. The fact that there was "room" reserved for A and B ports made it "Obvious" that there would be other versions of the P1 following up soon.
I had Intelligent Motion Systems ready to manufacture motion control systems with the P1 several years ago but I told them a new "Version" of the chip ( with code protection) would be out "soon" and I thought we should wait for it. Honestly, from an engineering stand-point, a P1 with analog, a P1 with 64 I/O and perhaps a bit more memory, or even a $2 single cog P1, all these things made such sense from an engineers standpoint that Of COURSE Parallax would be bringing one or more of these things out soon.
The first year the P1 came out, I tried to bid it into a company's control system. However, in 100,000 piece quantities, saving $4 per unit using a less glamorous processor meant the company saved or made an additional half million dollars or so. Two cogs in a 10-14 pin package (with onboard eeprom, rom, etc.) would have done the job just fine... but it was never to be. I lost that bid to someone using a Pic. Intelligent Motion Systems also lost interest after a year or so waiting for the P2.
I'm pretty well out of the game nowadays, I offered Ken my functional 6 axis CNC code for the P2 three weeks ago but didn't even hear back from him so I guess I'm done trying to design Parallax into projects.
I'm writing this not for myself, but in hopes that new, younger engineers excited about the incredible new P2 won't get stuck in similar engineering situations with the expectations that the "Right" P? will be "coming out in the next year or so". If you're going to go onto a P3 design ( which of course you will) either make sure your customers know those "other" versions are never going to happen or take some pity on them and find a hot-shot "Chip" wannabe and let him/her flesh out the P2 family in a way that does make better engineering/financial sense for high volume applications.
I haven't even looked at the Pic family recently, but when you look at the price/capability of the new Arduinos, ESP-32, the Pi Pico, etc. (most fully functional solder onto a board systems for less than the price of a P2 chip alone ), designing a product around a $15 P2 will rarely make any more financial sense than designing a product around that $85,000 Stratix does. If you don't come out with some of those "follow up" designs discussed, I worry that the P1 and P2 will always be relegated to little niche markets of a few hundred units at most.
Yes... just for grins, I'd love to see what you could do with a boatload of money and a 7nm process. But honestly, I never even used all 8 cogs on the P1, much less the P2. 200 more cores operating at 150 terahertz for $15 would still be less useful to me than that $2 P1 chip.
Best of luck to you and Parallax no matter which direction you go. Thanks for the memories. " What a long, strange trip it's been."
Ken Bash
What the Smile! Give poor Chip a break there Ken. The Prop2 isn't cancelled. And Chip is still fully working on it. We all like to day dream a little. Chip's no different.
Regarding the Prop1, it was always going to be a one-off design. I don't know when this was first discussed in the forums but the tools used for designing the Prop1 were end of life when Parallax purchased them. That's why it never had iterations or variations.
Prop2, on the other hand, most certainly can have smaller siblings ... And other possibilities. It'll take some time though. The documentation and software tools for development with the Prop2 is a much bigger job than the Prop1. The money won't be coming in from the Prop2 until those are sorted.
Ken Bash, I understand your frustration. In the P1, we had a very hard design approach, with everything being full-custom. We finished a layout for a 64-pin version of P1, but we could never get our layout tool to pass the layout-versus-schematic check, so we never made that chip.
Now, with P2, our methodology is much simpler and very modern. We have a whole family of P2 parts planned, like you described. We just need to show some volume in current sales to build them. It's a matter of changing some Verilog variables, and then the process is deterministic. I hope we'll get there.
Yes, a long, strange trip, indeed. Some have even passed away, along the course, like Sapieha (an Estonian living in Sweden). Others are back from the dead, some others lost. Newcomers have arrived, too. The strangeness continues.....
I think the first thing that should be done expand the P2 family is to make a new SKU that's just the same P2 we already have, but tested to work properly @ 350 MHz.
The silicon apparently does it just fine, its a matter of marketing.
The ADC's are getting very noisy up at that frequency. And there's not much temperature margin. And, it takes chips that have particularly fast process characteristics, which are hard to thoroughly screen.
Imagine if we could go to the 7nm tri-FET process. We could run at 4GHz, probably. Can you imagine 10x the speed? That would be nuts.
There can be graphs in the datasheet with dotted lines out beyond rated spec. Indicating expected or typical behaviour, but not guaranteed.
We have what we have and it is as it is. If we want to have more in the future it's our term. If I had no need for more cogs I wouldn't have grasped the concept of parallel processing. Concepts are indepedent of real "hardware". Divide and conquer is such a concept. If one is not happy with the power of the P2, one will never be happy. There was a time, when Chip at an unofficiall propeller meeting, late in the evening said: what about the next Propeller and lets have additional memory to exchange data. Or something similar. Few were present. But those for sure will remember that evening forever. And now we got so much more than we could imagine that evening and we watch those that still want more and more and more. Let us show an example where the P2 must be paired with a second one to solve a problem and present a solution everybody wants and is ready to pay for. Then P3 will be available, because those that have to make profit will invest that moment. If you need a GRBL CNC controller board, you can by it for 40 bugs and it will do the job. But this board will not be able to self adjust, to remember the work done and learn, to watch spoken commands or enjoy you with a Plasma splash screen. And you will never feel the need to understand the firmware.
So, let us just be happy with the P2 for some years after all the years of anticipation!
+1
"How much is enough? Just a little bit more." That's mankinds greatest strength but also its greatest weakness.
Agreed with the thoughts stated above.
We've got ten years of exploration ahead of us with the P2. We all have so much to achieve, including finding the killer app. Let's get productive with what we've got in our hands.
Ken Gracey
Is there anything in particular you are attempting to apply that quote to?
Chip,
I keep running into situations where I want GETNIB/GETBYTE/GETWORD instructions to set the Z flag upon a zero result. Probably more than any other non-comparing instruction. I go lookup the instruction sheet and am repeatedly reminded why you never included this ability. Those are already overloaded encodings. I've been looking hard at it now and decided an improvement would have the N encodings use the I bit instead of the Z bit. S operand as an immediate ain't all that useful here. I know that splits up Z and C but it would help. EDIT: As a future Prop3 refinement of course.
Hindsight is wonderful
PM sent.
For what it's worth, in my little niche P2 does pretty much everything I needed an MCU to do. Even the freedom to do layout unhindered, with all I/O pins being truly equal with very few exceptions, is a fresh breeze into what sometimes feels like a stale market with few new ideas done well. Everything can be tweaked till the cows come home - I'm guilty as charged, but let's say I'll gladly tweak stuff on my end and be happy with P2 as-is. For now, P2 is exactly how I imagined a "blue sky" type of an MCU. I was always DSP-deprived, since I like digital control loops over purely analog ones, and not only does P2 deliver in the DSP department - it also has neat, fast analog peripherals baked into every pin, letting the DSP be leveraged in ways otherwise limited to custom silicon. The obscene memory bandwidth is a good match for many DSP applications, and I appreciate that sincerely. P2 is not perfect, but neither am I nor my designs, so all's good in my book, eh?
As for the 7nm process: I'm not sure yet how I'd use it - lots of other things would need to change, most notably the instructions would need to be 40 or 48 bits wide to accommodate the larger address spaces. Given the architectural decisions made thus far, with simple two-layer memory hierarchy, such a process would surely let us have more memory, and probably more COGs and pins too, on a smaller die, but I'm not so sure the speed benefits would be as great as expected - the interconnect would get in the way of everything. My expectation is that simulations would bear out the fact that with current architecture, making things more numerous (COGs, COG RAM, HUB RAM, pins) would be easy, but making things an order of magnitude faster would be nigh impossible. I'm betting that a 44 bit instruction set, with 44-bit HUB and COG memories (and many more words, too), with maybe 16 or 24 COGs, would be comfy around 900MHz, with 1.2GHz being the "hot rod" speed. 4GHz would need a product vastly different from what P2 is, at a fundamental, architectural level. Maybe I'm wrong, and I sure don't want to seem like a pessimist, but P2's architecture does not lend itself to cheap speedups, and that's not a bad thing - it's what makes it so flexible. If anything, perhaps we could get that 1GHz speed with base instruction speed being 1 cycle instead of 2, if there'd be enough room to double the ports in COG memories. The LUT RAM distinction would be probably also gone. But 4GHz? I'd believe it when I'd see it - it would be cool marketing but most likely impractical. We would get single-cycle 32x32 multipliers in the 7nm process, and we could probably get basic floating point arithmetics in every COG as well, and the CORDIC pipeline depth could be maybe halved, with more instructions added - perhaps even complete IEEE-754 support in the CORDIC in addition to the fixed point.
Kuba, all those things you are talking about adding is what would inhibit getting to 4GHz. If you take what we have at the moment and built it in 7nm, there wouldn't be any extra congestion.
Would you believe that the cogs' 16x16 multipliers are actually the irreducible paths in the P2? They get the most optimized through placement and routing, then everything else gets optimized to be no slower than the multipliers. The multipliers were even placed within dedicated flops, so that no mux'ing delays would be incurred. And they were built from Synopsis DesignWare libraries, which are super optimized. Even so, they are the irreducible paths in the chip. Going to 32x32 multipliers would probably cut the attainable Fmax in half, in any process. So, all things holding as they are would be an easy approach to the highest speed in a smaller process. The design is very balanced, speed-wise, as it exists. Increasing memory sizes may present little time penalty, while doubling the cogs would be a bigger time penalty.
I wish we were in the position to make cog/hub/pin-reduced derivatives, already (staying in 180nm), so that we could upgrade the analog performance. You are probing the boundaries of what can be utilized together, in new ways. I think you are finding out what needs improvement. There are some simple things that could be done to improve the ADCs, that I know of. I think the DACs are pretty good, though maybe we could just get rid of the 990-ohm DAC and keep the 123-ohm DAC in each pin. It's a lot more useful, but the 990-ohm saves power where high-bandwidth is not an issue. The high-bandwidth of the 123-ohm DAC makes new stuff possible. Our internal 8-bit DAC is a little flimsy and gets knocked around by the comparator input. Maybe the 990-ohm DAC could just feed the comparator.
The kind of work you are doing is pushing the leading edge, so it's creating new things to think about. People can think about how to solve problems in new ways, given new possibilities. And it makes life more interesting.
I'm not finding the 8 cogs any limitation at all.
With no commercial project in mind, I'd like to see the next gen P3 to just step to the next cheapest feature size, which is probably around 90nm. This should give us 1MB of hub ram and a good MHz improvement.
I've said before, although shouted down, that not all cogs need or should be equal. I think Apple have just proven this with their M1 - some are performance cores and some are power cores. P2 will never be an M1, but the concept is sound.
That's laptops/phones. It doesn't apply to the desktop processors. It'll be the same with Intel's newest too. It's all about the advertised battery run time. I wouldn't touch a big.LITTLE design with a barge pole.
That said, the M1's four high power cores are certainly impressive. Apple now just needs a CPU with all the cores the same to go in their desktop boxes.
I'm wondering how those single-cycle 4GHz 32x32 multipliers are done by Intel and others. Maybe it's only thanks to long pipelines that the latencies are hidden? Still, a 7nm process should make the individual gates fast enough that I can't believe at least the 16x16 multiplier couldn't be made to work in a single cycle or at least much faster at two cycles. But it's plenty fast as-is anyway - I really don't need it any faster at all in P2.
If I'll somehow get some real-world "free cycles" in a couple of years, maybe I'll come over for a $1 internship to try and contribute in kind to P3 - maybe on the analog side of things, though. I've only made a multiplier once in my life, even if recently, and it is not a transferable skill. It's about 1,200 reed relays and does a 16x16 unsigned multiply in a single AC line half-cycle (it's for an absurdly silly project of mine). Takes 200us reed relays (and would be what I call robust if I could toss some 100us reeds its way), so it's probably cheating at that point I'm actually in the middle of making a P2-based board tester for that thing, since I've been exercising it some and one (or maybe more) of the relays seems to have gotten slow and is causing occasional errors. Debugging that without automation is not my way of spending evenings, for sure. It's more of a typical case of being able to do something not nearly enough to warrant actually doing it, ha.
For super-time-critical logic circuits, Intel may lay out custom zipper logic which, instead of being CMOS, is PMOS, then NMOS, then PMOS, then NMOS.... You precharge the whole thing to a clear state, introduce the inputs, and then let it rip, halving your parasitics and propagation time. That might be how they are getting such speeds. It might just all be pipeline, though. If only someone from Intel could tell us, we'd know.
I can confidently up your offer of $1 to $100 to come be an intern. Working at the transistor/resistor/cap level on the chip is a lot of fun. The big question is if we'll have money to make another chip and if ON will be in agreement, as they want to see volume on the current P2.
Your multiplier sounds interesting. Did 8.33ms come up because of the convenience of AC power? If it slows down, you could ship it to Erna in Germany. Do you have a recording of it working? I imagine it sounds interesting to listen to. It probably changes tenor according to its inputs.