Would 16 cogs and 1 MB of Hub be worth a speed reduction from 160 MHz to 120 MHz?
Plus the questions of higher price and larger package ?
Is there a 'killer app' that cannot work at 120MHz, but can at 160MHz ?
Is there a 'killer app' that needs 16 COGS and cannot work with 8 ?
The new GAP8 from Greenwaves, is somewhat similar to P2, but chases lower power.
This has 9 cores, 250MHz PLL, in compact QFN84, but seems light on peripheral support.
Chip,
Now you finally seem to be talking to the right people at OnSemi !
I presume the 120MHz question is because the routing length is the critical path in the larger die, whereas previously it was the ALU in the critical path?
Is it worth asking if they would build the wafers with 50/50 1MB 16 Cog and 256KB 4 Cog or some other ratio? Would there be any/much additional cost for this?
Were you really considering the interposer with SPI Flash?
If you did this, and if the connection was made to the external I/O P58-61 to the internal SPI, then it could be programmed as traditional SPI while holding the P2 in reset. When the P2 boots, it could load some of the SPI into the top of hub and start Cog 0 executing it (hub) as hubexec. This could mean that internal P2 ROM would not be required, and there would be no mask ROM to worry about getting it wrong. If you wanted a debugger or whatever, you could pre-program this in your factory when testing the chips.
It is no doubt too late, but have you asked what OnSemi's next smaller feature size (120nm/90nm?) costs? Just pondering if there was any benefit to go to a smaller feature size at this late stage for P2, and its' implications. Maybe we could get 250MHz???
Interestingly, according to Amkor datasheet specs, the second largest available exposed pad will be a square of 10.3mm x 10.3mm, associated with both LQFP and TQFP 14x14 body size dimensions.
If not the same, they are very similar with the ones already selected to assemble P2 dices.
The biggest one they offer, 11.0mm x 11.0mm, relates to the 28x28 body size package. Not a valuable increase, nor proportional with the increase in package dimensions.
Perhaps they could do a custom version for Parallax, but the extra tooling costs will apply, for sure.
Would 16 cogs and 1 MB of Hub be worth a speed reduction from 160 MHz to 120 MHz?
Wow, that is a dilemma.
A naive calculation shows that 8 * 160= 1280 and 16 * 120 = 1920 so clearly 16 cogs at the slower speed is 50% mo' better.
My humble attempt at an FFT on the Propeller https://github.com/ZiCog/fftbench runs faster with more COGs. Not linearly of course. I forget the figures but let's say spreading it over 4 COGS made it 3 times faster than one. That's great but using half your COGs to do an FFT is limiting what else you can do with the machine. In that case more COGs would be more better, even if they are each a bit slower.
On the other hand, those that really want to tweak I/O at maximum rate would be disappointed with the drop in clock speed.
On the third hand, having more RAM space is always good.
Personally, I'd love to see 16 COGs and 1M of RAM at 120MHz.
It's an audacious thing the world has never seen before.
Perhaps I've became insane in the last months, and typing the following sentences could be considered as a testimony of any severe disorder, trying to destroy my brain, but, aniway...
If and when an application does needs more than 8 COGs, it will soon become short on the number of available pins too.
Holy Geometric Grail!
Why don't try to preserve the MHz, whyle also gaining in number of COGs and pins too?
Two instances of the well designed 8 COG/64 pin version, with 1MB of Dual-Ported HUB ram, BETWEEN them?
In a 20x20 package?
I know, I know, we can address only 64 pins, but we have a lot of things to be proccessed; some will depend on special smart pin configurations, some will need only digital data transactions.
Large and fast external memories will need a lot of digital pins to communicate with the COGs or the HUB. So will display controllers and removable digital storage too.
176-pin packages will provide a lot of room, both in and outwards.
There will be twenty eight externaly-accessible groups of four pins, as per current segregation of pins between power lanes.
Plus the four shared pins (XI, XO, RESn and TESn).
There will be also four unused pins.
Internaly, there will be sixteen pins available.
Use your imagination. Perhaps insanity is not that bad, when shared!
Based on my experiments with P2 I personally would prefer 16 cogs, 1Mb @ 120 MHz.
I found I used more cogs than expected because of conflicts between some of the cool features of P2.
For example if you want to use hiubexec you cannot use the streamer in that cog.
The same applies for lut sharing, this uses the streamer channel as the interconnect bus.
In one application I split an intensive task across 12 cogs and synchronized them all with the COGATN feature(worked great!)
Seems to me the "More cogs, More ram and More IO" request is solved.
In most of my projects with the P1 I was running out of
a) Memory
b) Pins
C) COGS
I never run out of speed.
For me the drop to 8 COGS was quite devastating, even if the COG's itself are more powerful, more COG's allow for more independent code running, the main aspect of propeller programming.
@Heater. already mentioned the increased overall horse power, and even at 120 MHz is a P2 COG 3 times faster then a P1 one.
So again, yes @Chip and @Ken, please consider the 16 COG, 1 MB, 120 MHz version as first offering. Smaller/faster ones could follow later.
Memory is, I think, the biggest factor in limiting what can and can't be done. Any amount more memory we can get over 512KB is a win in my book.
Sure, there are cases where more cogs would be handy, but the P2 cogs are so much more capable that before, that I think 8 of them is plenty for most things. 16 would be nice, but I'd rather have even more memory in that space.
I know that without timing constraints, the design is amounting to 334k cells. To meet the 160MHz requirement, it's looking like 723k cells. Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm. We need to evaluate what kind of savings can be had at 120MHz, and what kind of area 8 more cogs and 16 dual-port 512x32 RAMs would need.
I know that without timing constraints, the design is amounting to 334k cells. To meet the 160MHz requirement, it's looking like 723k cells. Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm. We need to evaluate what kind of savings can be had at 120MHz, and what kind of area 8 more cogs and 16 dual-port 512x32 RAMs would need.
So, at 120 MHz with 8 cogs and 512MB the chip would be smaller? Could you evaluate 8 cogs and 1MB RAM? 16 cogs would more than double hub RAM access times as there would be twice the cogs and only three-quarters of the 8-cog clock speed.
How long, if everything goes well, from placing the first order of chips for testing until there are actual IC's we can buy? Will there be breakout boards? Any plans for something similar to the Propeller Professional Development Board?
I really like the sound of 512K for the Propeller! 16 Cogs and 1MB hub sounds even better. I can't begin to imagine what people will concoct with that much room and/or 16 cogs.
So glad to see the Propeller 2 at this stage.
Once the design is done, it takes about 12 weeks to get it through the foundry. We will initially get 40 prototypes (I think). There could be a lot more. If they work, most of them can go to you guys. It will be another 12 weeks to production, after that.
What we are building now is 8 cogs with 512KB hub and 64 I/Os.
So... What's limiting you from starting with the 16 cog/1MB?
The larger die area would bump the package size significantly ?
Is there a FPGA that can emulate such a design ?
Next stops after 14x14 package look to be 14x20 (128 pins) or 20x20 (144 pins) 24x24 (176 pins) 28x28 (208 pins)
Given P2 currently has only 64io, those pin counts are rather unbalanced ?
I don't think that there would be an affordable FPGA that could handle the whole design.
Chip,
Now you finally seem to be talking to the right people at OnSemi !
I presume the 120MHz question is because the routing length is the critical path in the larger die, whereas previously it was the ALU in the critical path?
Is it worth asking if they would build the wafers with 50/50 1MB 16 Cog and 256KB 4 Cog or some other ratio? Would there be any/much additional cost for this?
Were you really considering the interposer with SPI Flash?
If you did this, and if the connection was made to the external I/O P58-61 to the internal SPI, then it could be programmed as traditional SPI while holding the P2 in reset. When the P2 boots, it could load some of the SPI into the top of hub and start Cog 0 executing it (hub) as hubexec. This could mean that internal P2 ROM would not be required, and there would be no mask ROM to worry about getting it wrong. If you wanted a debugger or whatever, you could pre-program this in your factory when testing the chips.
It is no doubt too late, but have you asked what OnSemi's next smaller feature size (120nm/90nm?) costs? Just pondering if there was any benefit to go to a smaller feature size at this late stage for P2, and its' implications. Maybe we could get 250MHz???
it is a lot of work to make a particular chip of any size. Each one would require a separate effort.
On Semi does have a 110nm process, but it is considerably more expensive.
Memory is, I think, the biggest factor in limiting what can and can't be done. Any amount more memory we can get over 512KB is a win in my book.
Sure, there are cases where more cogs would be handy, but the P2 cogs are so much more capable that before, that I think 8 of them is plenty for most things. 16 would be nice, but I'd rather have even more memory in that space.
Going to 1MB would really grow the die - much more so than 8 more cogs.
I know that without timing constraints, the design is amounting to 334k cells. To meet the 160MHz requirement, it's looking like 723k cells. Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm. We need to evaluate what kind of savings can be had at 120MHz, and what kind of area 8 more cogs and 16 dual-port 512x32 RAMs would need.
So, at 120 MHz with 8 cogs and 512MB the chip would be smaller? Could you evaluate 8 cogs and 1MB RAM? 16 cogs would more than double hub RAM access times as there would be twice the cogs and only three-quarters of the 8-cog clock speed.
Right about increased hub latency with 16 cogs at 120MHz. Twice the clock cycles at 75% speed equals 37.5% latency performance and 75% streaming performance, compared to 8 cogs at 160MHz.
Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.
My preferences in order
1. 1MB Hub RAM (would prefer even more but I know there are currently built in limitations)
2. More Cogs
3. More speed
What if from the base 8 Cog design, another 8 reduced Cogs were added. These 8 would be paired with the 8 main cogs by sharing the LUT. They would not have access to Hub RAM, nor any of the hub pipelined instructions (maths instructions) etc. so these would be basically blind cogs.
This would therefore not impact the hub latency, which would remain at 8 clocks. Code would need to be loaded into the LUT by the paired cog and then started with coginit from LUT.
It's possible that some other high silicon features could also be removed.
So what I am suggesting is that the currently paired (odd) Cogs get some restrictions.
Right about increased hub latency with 16 cogs at 120MHz. Twice the clock cycles at 75% speed equals 37.5% latency performance and 75% streaming performance, compared to 8 cogs at 160MHz.
Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.
How fast can a quite small nibble RAM 16 deep be made ? Is 1~2ns practical ?
If you use one of those between the COG selector and COG, you can map any mix of COG to time slot ?
Comments
Plus the questions of higher price and larger package ?
Is there a 'killer app' that cannot work at 120MHz, but can at 160MHz ?
Is there a 'killer app' that needs 16 COGS and cannot work with 8 ?
The new GAP8 from Greenwaves, is somewhat similar to P2, but chases lower power.
This has 9 cores, 250MHz PLL, in compact QFN84, but seems light on peripheral support.
https://en.wikichip.org/wiki/greenwaves/gap8
Now you finally seem to be talking to the right people at OnSemi !
I presume the 120MHz question is because the routing length is the critical path in the larger die, whereas previously it was the ALU in the critical path?
Is it worth asking if they would build the wafers with 50/50 1MB 16 Cog and 256KB 4 Cog or some other ratio? Would there be any/much additional cost for this?
Were you really considering the interposer with SPI Flash?
If you did this, and if the connection was made to the external I/O P58-61 to the internal SPI, then it could be programmed as traditional SPI while holding the P2 in reset. When the P2 boots, it could load some of the SPI into the top of hub and start Cog 0 executing it (hub) as hubexec. This could mean that internal P2 ROM would not be required, and there would be no mask ROM to worry about getting it wrong. If you wanted a debugger or whatever, you could pre-program this in your factory when testing the chips.
It is no doubt too late, but have you asked what OnSemi's next smaller feature size (120nm/90nm?) costs? Just pondering if there was any benefit to go to a smaller feature size at this late stage for P2, and its' implications. Maybe we could get 250MHz???
If not the same, they are very similar with the ones already selected to assemble P2 dices.
The biggest one they offer, 11.0mm x 11.0mm, relates to the 28x28 body size package. Not a valuable increase, nor proportional with the increase in package dimensions.
Perhaps they could do a custom version for Parallax, but the extra tooling costs will apply, for sure.
A naive calculation shows that 8 * 160= 1280 and 16 * 120 = 1920 so clearly 16 cogs at the slower speed is 50% mo' better.
My humble attempt at an FFT on the Propeller https://github.com/ZiCog/fftbench runs faster with more COGs. Not linearly of course. I forget the figures but let's say spreading it over 4 COGS made it 3 times faster than one. That's great but using half your COGs to do an FFT is limiting what else you can do with the machine. In that case more COGs would be more better, even if they are each a bit slower.
On the other hand, those that really want to tweak I/O at maximum rate would be disappointed with the drop in clock speed.
On the third hand, having more RAM space is always good.
Personally, I'd love to see 16 COGs and 1M of RAM at 120MHz.
It's an audacious thing the world has never seen before.
If and when an application does needs more than 8 COGs, it will soon become short on the number of available pins too.
Holy Geometric Grail!
Why don't try to preserve the MHz, whyle also gaining in number of COGs and pins too?
Two instances of the well designed 8 COG/64 pin version, with 1MB of Dual-Ported HUB ram, BETWEEN them?
In a 20x20 package?
I know, I know, we can address only 64 pins, but we have a lot of things to be proccessed; some will depend on special smart pin configurations, some will need only digital data transactions.
Large and fast external memories will need a lot of digital pins to communicate with the COGs or the HUB. So will display controllers and removable digital storage too.
176-pin packages will provide a lot of room, both in and outwards.
There will be twenty eight externaly-accessible groups of four pins, as per current segregation of pins between power lanes.
Plus the four shared pins (XI, XO, RESn and TESn).
There will be also four unused pins.
Internaly, there will be sixteen pins available.
Use your imagination. Perhaps insanity is not that bad, when shared!
Henrique
+1
I found I used more cogs than expected because of conflicts between some of the cool features of P2.
For example if you want to use hiubexec you cannot use the streamer in that cog.
The same applies for lut sharing, this uses the streamer channel as the interconnect bus.
In one application I split an intensive task across 12 cogs and synchronized them all with the COGATN feature(worked great!)
Seems to me the "More cogs, More ram and More IO" request is solved.
a) Memory
b) Pins
C) COGS
I never run out of speed.
For me the drop to 8 COGS was quite devastating, even if the COG's itself are more powerful, more COG's allow for more independent code running, the main aspect of propeller programming.
@Heater. already mentioned the increased overall horse power, and even at 120 MHz is a P2 COG 3 times faster then a P1 one.
So again, yes @Chip and @Ken, please consider the 16 COG, 1 MB, 120 MHz version as first offering. Smaller/faster ones could follow later.
Enjoy!
Mike
wow. @Potatohead's shortest post, EVER...
Mike
Memory is, I think, the biggest factor in limiting what can and can't be done. Any amount more memory we can get over 512KB is a win in my book.
Sure, there are cases where more cogs would be handy, but the P2 cogs are so much more capable that before, that I think 8 of them is plenty for most things. 16 would be nice, but I'd rather have even more memory in that space.
I'll ask On Semi more on Monday.
I know that without timing constraints, the design is amounting to 334k cells. To meet the 160MHz requirement, it's looking like 723k cells. Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm. We need to evaluate what kind of savings can be had at 120MHz, and what kind of area 8 more cogs and 16 dual-port 512x32 RAMs would need.
So, at 120 MHz with 8 cogs and 512MB the chip would be smaller? Could you evaluate 8 cogs and 1MB RAM? 16 cogs would more than double hub RAM access times as there would be twice the cogs and only three-quarters of the 8-cog clock speed.
Once the design is done, it takes about 12 weeks to get it through the foundry. We will initially get 40 prototypes (I think). There could be a lot more. If they work, most of them can go to you guys. It will be another 12 weeks to production, after that.
Initial supposition that the die must fit in that Amkor package and 2W power dissipation drove 8 cogs and 512KB.
I don't think that there would be an affordable FPGA that could handle the whole design.
That would be too much work, at this point.
it is a lot of work to make a particular chip of any size. Each one would require a separate effort.
On Semi does have a 110nm process, but it is considerably more expensive.
Going to 1MB would really grow the die - much more so than 8 more cogs.
Right about increased hub latency with 16 cogs at 120MHz. Twice the clock cycles at 75% speed equals 37.5% latency performance and 75% streaming performance, compared to 8 cogs at 160MHz.
Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.
Are these clocks the absolute minimum and we should be able to overclock by 20-25%?
Stated Fmax will be for worst process outcome, lowest voltage (1.8V - 10%), and highest temperature (150 C).
1. 1MB Hub RAM (would prefer even more but I know there are currently built in limitations)
2. More Cogs
3. More speed
What if from the base 8 Cog design, another 8 reduced Cogs were added. These 8 would be paired with the 8 main cogs by sharing the LUT. They would not have access to Hub RAM, nor any of the hub pipelined instructions (maths instructions) etc. so these would be basically blind cogs.
This would therefore not impact the hub latency, which would remain at 8 clocks. Code would need to be loaded into the LUT by the paired cog and then started with coginit from LUT.
It's possible that some other high silicon features could also be removed.
So what I am suggesting is that the currently paired (odd) Cogs get some restrictions.
If you use one of those between the COG selector and COG, you can map any mix of COG to time slot ?