I will happily run at the ~3W I expect to be a realisitic worst case with 8 cogs @ 160Mhz, because where I don't need all that performance I can idle unneeded cogs, have them wait etc to reduce power draw.
But that means when I need the performance, it is there. If it was like P1, 4 cycles per op, the extra performance elbow room would not be there.
Absolutely! That is what you (and Bill) are not taking into account. Nowhere did we ask for a P1B to be single cycle (in fact Chip reminded us yesterday), just a slight improvement and specifically stated.
I like option 3 best, as it will get us a chip the quickest. WITHOUT your crippling suggestions. We simply address thermal concerns with multiple power envelopes at different clock speeds.
Option 2 would be nice, but take too much time. Again WITHOUT your crippling suggestions.
Option 1 would take too long, Chip mentioned many times that the tool chain is broken and cannot (as I recall) do drc checks for the layout.
Perhaps the cogs can be reduced to 4 and at the same time their memory increased by 4.
Each task can have its own set of 512 longs. No need for register switching, each task can start executing at register 0.
The tasks can share in/out/dir IO registers or have they can have each its own set thus or-ing 16 of them to the pins.
Of course tasks won't be able to internally communicate through registers but having now the double bandwidth to the hub they can communicate this way or through the aux ram and even port D
This is 4 full speed P2 cogs that equals to 16 P1 ones with P2 features.
One interesting thing can be a hub bandwidth switch:
1) with 4 cogs hub window can happen every 4 clocks
2) even with 4 cogs hub window can happen every 8 cicles but the 4 additional hub windows can be used for fast hub wide data exchange between 2 Props2 (4 windows) or in daisy chain (2 windows to each neighbor)
In this way 2 Props2 can give 8 cogs with background data exchange
The hub ram should not be reduced nor the new counter features loosen.
By above calculations, the P2 will draw 625mW at 1.8V. ~264mW is not 625mW. The extra is in the instruction/alu logic!
You missed my at a 20MHz compatible Vcc. - Vcc scaling can give far more power saving - see the Atmel example. This is 1/8 of the process fMAX so Vcc can reduce significantly.
Using Atmel scaling ration as a rough guide, we come in at 170mW for a P2 @ a 20MHz compatible Vcc (not seeing any 'extra' here now ?)
This is a MUCH better april fools joke than what I found with the RASPBERRY PI, on the very day I needed the information.
I
WAS
LIVID.
I don't like it when I spend the day diagnosing a pcb that is missing its schematic all because some wanker from the EU thought it funny to not include proper links to the latest schematic of the PI pcb.
They changed their site to be all black, and for the joke, had not included all the proper documentation ... they only included the two prior versions, and I was having trouble with the video output of the device.
Looking at the pcb, it was missing parts here and there, but they were on the previous two revisions of the schematic, so i thought my board was missing parts, today I found the 3rd link to the proper schematic, and found it IS missing a small capacitor... directly on the RCA video output line,.....which having the proper schematic, would have saved me a day of mucking with it. I understand jokes, but cmon, no professional would mess with the schematics, and a products documentation, a minor forum post about a 5 watt chip, or even a blog post about some hardware joke.... fine... i'll laugh....
But completely preventing your customers from finding the proper schematic and documentation for your products, for the WHOLE day, IS IDIOTIC. http://forums.adafruit.com/viewtopic.php?f=50&t=52002
Since you obviously don't believe me or jmg, why don't you ask Chip how many transistors the extra instructions take?
And how many transistors there are as a whole?
I strongly suspect the extra instructions transistors take less than 5% (more like 1%) of the P2's transistor budget.
In which case, power consumption would be reduced by 1%-5%
But wait... without those instructions, *MORE* instructions would have to be executed to do similar things, taking more cog ram, hub ram, and time to execute. Frankly, probably increasing power usage over time.
I don't need to Bill, and you know it!
The P1 had way less transistors.
The P2's core (instructions/alu, counters and video) used 14.71 square mm prior to 12 March 2013 (a year ago). We now know it uses > 14.71+3.3= 18mm2 because recently Chip realised he had to squeeze more area for it.
So in 1 year, the instructions/alu has gone from 14.71mm2 --> 18++mm2. These are transistors guys!!!
And the P1 had significantly less that the original a year ago of 14.71mm2.
In fact the total die size for the P1 was reported to be 7.28mm2 and IIRC a geometry of 360nm. That is half the core area of P2 in a geometry double the size.
Maybe the transistor count is just sitting in mid-air ??? Come on guys, you really do know better!!!
Of course the transistor count increased in the last year.
It had a lot to do with the additional 128KB of hub ram... doubling of hub ram used a lot of area & transistors, including the savings without the wider video bus.
The icache/dcache added a bit too, but nowhere near as much as the extra hub. The transistors for the extra instructions? Practically lost in the noise.
The P2's core (instructions/alu, counters and video) used 14.71 square mm prior to 12 March 2013 (a year ago). We now know it uses > 14.71+3.3= 18mm2 because recently Chip realised he had to squeeze more area for it.
So in 1 year, the instructions/alu has gone from 14.71mm2 --> 18++mm2. These are transistors guys!!!
And the P1 had significantly less that the original a year ago of 14.71mm2.
In fact the total die size for the P1 was reported to be 7.28mm2 and IIRC a geometry of 360nm. That is half the core area of P2 in a geometry double the size.
Maybe the transistor count is just sitting in mid-air ??? Come on guys, you really do know better!!!
I think this 5W thing boils down to how to market the chip.
Probably a nice table will be needed, that shows X cogs running at Y MHz needing Z watts worst case.
Then the user can pick their own power budget and performance envelope.
For really low power, run the chip at 20Mhz, even with 8 cogs it won't take anywhere near 1W then - maybe .5W
Add liquid nitrogen cooling to run 8 cogs at 300Mhz... and use 40W
I see nothing wrong with rating it like that... tons of ARM chips sell @ 48MHz, 72MHz, 104MHz, 120MHz, 168MHz etc at difference performance / power envelopes.
With the P2, it would be one chip - choose your performance level, accept the corresponding power envelope.
I think this 5W thing boils down to how to market the chip.
Probably a nice table will be needed, that shows X cogs running at Y MHz needing Z watts worst case.
Then the user can pick their own power budget and performance envelope.
For really low power, run the chip at 20Mhz, even with 8 cogs it won't take anywhere near 1W then - maybe .5W
Add liquid nitrogen cooling to run 8 cogs at 300Mhz... and use 40W
I see nothing wrong with rating it like that... tons of ARM chips sell @ 48MHz, 72MHz, 104MHz, 120MHz, 168MHz etc at difference performance / power envelopes.
With the P2, it would be one chip - choose your performance level, accept the corresponding power envelope.
Sorry, I guess I wasn't clear. I meant to have one COG that could do hub execution and 7 others that can't.
Removing instructions, and recent features cannot save a significant amount of power.
Lowering clock speed, removing whole cogs, dropping vcore are the only way to save significant power.
CMOS logic uses power to switch transistors, when the transistors are not changing very little power is uses. Therefore number of different instructions, hub/cog mode, are basically irrelevant to the power envelope.
Chip,
Is it possible to use the same die in an FPGA BGA package and a QFP package? I am not sure if the FPGA BGA die gets connected to the pga by vias.
The QFP144 package could be rated for slower clock speeds. Perhaps a central ground pad on the QFP144 version would help (jmg's? suggestion) - only necessary for power over xW.
For the FPGA256 BGA256 (presuming 1.0mm pitch) would we get away with only requiring connections from the outer 3 rows (60 pins outer, then 52 and 44 = 156 pins total)?
Of course the transistor count increased in the last year.
It had a lot to do with the additional 128KB of hub ram... doubling of hub ram used a lot of area & transistors, including the savings without the wider video bus.
The icache/dcache added a bit too, but nowhere near as much as the extra hub. The transistors for the extra instructions? Practically lost in the noise.
You know better Ray.
Oh Bill, really!
Of course the hub increased - that is disclosed above. But I am talking about the core (which you do realise, but it doesn't help your quest)! That size increased dramatically from the P1 to 1 year ago to February 2014 then to now. Those extra transistors are used in the core logic - the instruction/alu/cache/hubexec/lifo/etc/etc.
With the numerous design meetings and quotes we've developed around 180nm there has been no discussion about 90nm costs, so I have no constructive input in this regard without more information. We would need to know what further NRE, synthesis, layout and integration of the I/O to the core would be required to go to 90nm. Does it fully leverage our existing efforts or are there redesigns between you and Beau that require additional R&D time (and expense)? Time to market is an entirely different factor and probably more important at this stage. Depending on what 90nm really means for your design time and costs, it could be a case for KickStarter or similar funding mechanism.
While we can't go back, maybe we could choose the best path in consideration of (1) Customer requests, which have primarily been A/D, more RAM, faster, code protect, and more I/Os; and (2) Time to market. And keep in mind that your best products became complicated before they got simple again. If some features of P2 seem complicated to forum members (and these are skilled users who can exploit all features), maybe it's time to get back to something more simple.
But to compare (B) and (C) we'd need to understand how each alternative fares in regards to:
- design time (time to market), for your Verilog and manual layout by Beau
- synthesis costs, if any different
- any benefits in regards to testing the synthesized core with the manual layout and
- customer preference, given a choice
Not knowing these important pieces, I'm probably leaning towards (C) for a couple of reasons: maximizes C compiler performance with hubexec; provides plenty of cogs for most applications (production uses of P1 have limited use for video, so if it's a cog-eater in P2 this is something to consider); it fits in the whole FPGA; seems simple; and maybe the extra pre-emptive multi-tasking stuff stretches four cogs to a high potential anyway. Some customers have asked for more cogs than present in P1, but I believe even more customers in the beginning felt they didn't need all the power of P1's cogs.
If we can put the desire for features aside for a minute and understand the actual costs and time required to achieve (B) or (C) the answer would become clear to us, I think.
Here to help, so let me know what you need to help get this accomplished, even if it's a month in Mexico.
Ken Gracey
I guess it would be interesting to know which features draw the most power. There has been a lot of speculation here but not much real information. Does reducing the RAM sizes help? Does eliminating RAM blocks help? Does removing instructions help? I suppose they all help some but there may be more bang-for-the-buck for some than others. Does Parallax have any real data to indicate which features are causing the most power consumption?
That seems a valid discussion if/when the part fails to fit - which it may still do. Remember, FIT is not confirmed yet, just 'indicated'.
I know very little about the physics involved here. I thought that reducing the number of RAM blocks might reduce power consumption. You wouldn't need the cache or tag RAMs or the LRU logic in the 7 "normal" COGs. Maybe that savings isn't enough to make much of a difference though.
How about only one COG that can do hub execution? (running and ducking for cover!)
How about 2 ?
Unfortunately it needs a lot more than this removed. So the other 6 cogs need to be culled differently.
Otherwise, we accept the P2 power for what it is, and market it that way.
Or back to an advanced P1B in the interim until a 40/65/90nm P2 can be funded/built.
For the FPGA256 (presuming 1.0mm pitch) would we get away with only requiring connections from the outer 3 rows (60 pins outer, then 52 and 44 = 156 pins total)?
No.
Looking at the TI ahd JEDEC docs above, there is actually not much difference thermally between BGA and ThermalPAD, and a key item is the multiple thermal vias - their TQFP144 example uses 100, so all those inner BGA pins are needed for thermal-paths.
They could be mostly GND pins, but they will need to be there.
BGA does allow different die design, but that is quite a big change from the hand-layout rim we have now.
I guess it would be interesting to know which features draw the most power. There has been a lot of speculation here but not much real information. Does reducing the RAM sizes help? Does eliminating RAM blocks help? Does removing instructions help? I suppose they all help some but there may be more bang-for-the-buck for some than others. Does Parallax have any real data to indicate which features are causing the most power consumption?
Transistors do not consumer dynamic power, toggling nodes is what draws power,
via
ChipPower = SumOfAll(Cpd * Ft * Vcc^2)
(you can see instantly why lowering Vcc helps here)
So resource that is there, but not toggling consumes die area, but very little mW (Static only)
unless Chip posts numbers to the contrary, the amount of transistors taken by hubexec, the single dcache line, and the four icache lines looks trivial compared to the total transistors in the P2.
I actually would be quite interested in a transistor count for a March last year cog, and a current cog. It would give us hard numbers to debate.
My point is, for power envelope purposes, that transistor count has to be compared to the whole die's transistor budget.
Based on the last few months posts by Chip, I suspect the increase is less than 5% of the total transistor count. I actually think its closer than 2%, but readily admit that is a guess.
But even if it was 5%, the capabilities of hubexec (and the performance boost due to caching) would be well worth it.
Frankly I think Chip (and Beau for the I/O ring) did a fantastic job - earlier I posted that a 180nm Celeron had a 60W power envelope!
Of course the hub increased - that is disclosed above. But I am talking about the core (which you do realise, but it doesn't help your quest)! That size increased dramatically from the P1 to 1 year ago to February 2014 then to now. Those extra transistors are used in the core logic - the instruction/alu/cache/hubexec/lifo/etc/etc.
I was told an enhanced P1B from scratch would be cake. No tools dependencies. Having the money for it is another story.
I am certain this is the case. A lot of the P2 Verilog can be leveraged back into P1 with some additional benefits.
For some customers (unfortunately don't know how many) an advanced P1B would be nice now. As I said above, it's quite doable with a few extras such as ADC, more pins (I suggested 64 but 48 would probably be fine too), faster (clock and hub 1:8), more hub ram, remove the ROM and add the monitor and security.
ie Leverage everything learnt and done for P2 back to just make a better P1B at 180nm. Then proceed with P2 at 40/65/90nm.
I think this 5W thing boils down to how to market the chip.
Really? Doesn't 5W mean that just the P2 chip alone needs twice as much power as an entire Raspberry Pi single-board computer?
Assuming I have my maths right (always a big assumption!) I'm not sure how you could ever "finesse" your way around a deal breaker like that, which would appear to make the P2 unusable (or at least horrendously expensive) for embedded applications.
I am certain this is the case. A lot of the P2 Verilog can be leveraged back into P1 with some additional benefits.
For some customers (unfortunately don't know how many) an advanced P1B would be nice now. As I said above, it's quite doable with a few extras such as ADC, more pins (I suggested 64 but 48 would probably be fine too), faster (clock and hub 1:8), more hub ram, remove the ROM and add the monitor and security.
ie Leverage everything learnt and done for P2 back to just make a better P1B at 180nm. Then proceed with P2 at 40/65/90nm.
Comments
I will happily run at the ~3W I expect to be a realisitic worst case with 8 cogs @ 160Mhz, because where I don't need all that performance I can idle unneeded cogs, have them wait etc to reduce power draw.
But that means when I need the performance, it is there. If it was like P1, 4 cycles per op, the extra performance elbow room would not be there.
Option 2 would be nice, but take too much time. Again WITHOUT your crippling suggestions.
Option 1 would take too long, Chip mentioned many times that the tool chain is broken and cannot (as I recall) do drc checks for the layout.
Perhaps the cogs can be reduced to 4 and at the same time their memory increased by 4.
Each task can have its own set of 512 longs. No need for register switching, each task can start executing at register 0.
The tasks can share in/out/dir IO registers or have they can have each its own set thus or-ing 16 of them to the pins.
Of course tasks won't be able to internally communicate through registers but having now the double bandwidth to the hub they can communicate this way or through the aux ram and even port D
This is 4 full speed P2 cogs that equals to 16 P1 ones with P2 features.
One interesting thing can be a hub bandwidth switch:
1) with 4 cogs hub window can happen every 4 clocks
2) even with 4 cogs hub window can happen every 8 cicles but the 4 additional hub windows can be used for fast hub wide data exchange between 2 Props2 (4 windows) or in daisy chain (2 windows to each neighbor)
In this way 2 Props2 can give 8 cogs with background data exchange
The hub ram should not be reduced nor the new counter features loosen.
You missed my at a 20MHz compatible Vcc. - Vcc scaling can give far more power saving - see the Atmel example. This is 1/8 of the process fMAX so Vcc can reduce significantly.
Using Atmel scaling ration as a rough guide, we come in at 170mW for a P2 @ a 20MHz compatible Vcc (not seeing any 'extra' here now ?)
I
WAS
LIVID.
I don't like it when I spend the day diagnosing a pcb that is missing its schematic all because some wanker from the EU thought it funny to not include proper links to the latest schematic of the PI pcb.
They changed their site to be all black, and for the joke, had not included all the proper documentation ... they only included the two prior versions, and I was having trouble with the video output of the device.
Looking at the pcb, it was missing parts here and there, but they were on the previous two revisions of the schematic, so i thought my board was missing parts, today I found the 3rd link to the proper schematic, and found it IS missing a small capacitor... directly on the RCA video output line,.....which having the proper schematic, would have saved me a day of mucking with it. I understand jokes, but cmon, no professional would mess with the schematics, and a products documentation, a minor forum post about a 5 watt chip, or even a blog post about some hardware joke.... fine... i'll laugh....
But completely preventing your customers from finding the proper schematic and documentation for your products, for the WHOLE day, IS IDIOTIC.
http://forums.adafruit.com/viewtopic.php?f=50&t=52002
The P1 had way less transistors.
The P2's core (instructions/alu, counters and video) used 14.71 square mm prior to 12 March 2013 (a year ago). We now know it uses > 14.71+3.3= 18mm2 because recently Chip realised he had to squeeze more area for it.
So in 1 year, the instructions/alu has gone from 14.71mm2 --> 18++mm2. These are transistors guys!!!
And the P1 had significantly less that the original a year ago of 14.71mm2.
In fact the total die size for the P1 was reported to be 7.28mm2 and IIRC a geometry of 360nm. That is half the core area of P2 in a geometry double the size.
Maybe the transistor count is just sitting in mid-air ??? Come on guys, you really do know better!!!
The chip we had made was likely 2 watts at high utilization.
This design at 80Mhz isn't far from that, and a whole lot more capable.
Really, it's not about gutting it. It is all about whether or not we must shrink it for broad enough acceptance.
It had a lot to do with the additional 128KB of hub ram... doubling of hub ram used a lot of area & transistors, including the savings without the wider video bus.
The icache/dcache added a bit too, but nowhere near as much as the extra hub. The transistors for the extra instructions? Practically lost in the noise.
You know better Ray.
I think this 5W thing boils down to how to market the chip.
Probably a nice table will be needed, that shows X cogs running at Y MHz needing Z watts worst case.
Then the user can pick their own power budget and performance envelope.
For really low power, run the chip at 20Mhz, even with 8 cogs it won't take anywhere near 1W then - maybe .5W
Add liquid nitrogen cooling to run 8 cogs at 300Mhz... and use 40W
I see nothing wrong with rating it like that... tons of ARM chips sell @ 48MHz, 72MHz, 104MHz, 120MHz, 168MHz etc at difference performance / power envelopes.
With the P2, it would be one chip - choose your performance level, accept the corresponding power envelope.
Lol... I was thinking you could cut the power by 87.5% just by removing less than 1% of the instructions
- CogInit, CogRun etc
That seems a valid discussion if/when the part fails to fit - which it may still do. Remember, FIT is not confirmed yet, just 'indicated'.
Removing instructions, and recent features cannot save a significant amount of power.
Lowering clock speed, removing whole cogs, dropping vcore are the only way to save significant power.
CMOS logic uses power to switch transistors, when the transistors are not changing very little power is uses. Therefore number of different instructions, hub/cog mode, are basically irrelevant to the power envelope.
Is it possible to use the same die in an FPGA BGA package and a QFP package? I am not sure if the FPGA BGA die gets connected to the pga by vias.
The QFP144 package could be rated for slower clock speeds. Perhaps a central ground pad on the QFP144 version would help (jmg's? suggestion) - only necessary for power over xW.
For the FPGA256 BGA256 (presuming 1.0mm pitch) would we get away with only requiring connections from the outer 3 rows (60 pins outer, then 52 and 44 = 156 pins total)?
Of course the hub increased - that is disclosed above. But I am talking about the core (which you do realise, but it doesn't help your quest)! That size increased dramatically from the P1 to 1 year ago to February 2014 then to now. Those extra transistors are used in the core logic - the instruction/alu/cache/hubexec/lifo/etc/etc.
Unfortunately it needs a lot more than this removed. So the other 6 cogs need to be culled differently.
Otherwise, we accept the P2 power for what it is, and market it that way.
Or back to an advanced P1B in the interim until a 40/65/90nm P2 can be funded/built.
No.
Looking at the TI ahd JEDEC docs above, there is actually not much difference thermally between BGA and ThermalPAD, and a key item is the multiple thermal vias - their TQFP144 example uses 100, so all those inner BGA pins are needed for thermal-paths.
They could be mostly GND pins, but they will need to be there.
BGA does allow different die design, but that is quite a big change from the hand-layout rim we have now.
Transistors do not consumer dynamic power, toggling nodes is what draws power,
via
ChipPower = SumOfAll(Cpd * Ft * Vcc^2)
(you can see instantly why lowering Vcc helps here)
So resource that is there, but not toggling consumes die area, but very little mW (Static only)
unless Chip posts numbers to the contrary, the amount of transistors taken by hubexec, the single dcache line, and the four icache lines looks trivial compared to the total transistors in the P2.
I actually would be quite interested in a transistor count for a March last year cog, and a current cog. It would give us hard numbers to debate.
My point is, for power envelope purposes, that transistor count has to be compared to the whole die's transistor budget.
Based on the last few months posts by Chip, I suspect the increase is less than 5% of the total transistor count. I actually think its closer than 2%, but readily admit that is a guess.
But even if it was 5%, the capabilities of hubexec (and the performance boost due to caching) would be well worth it.
Frankly I think Chip (and Beau for the I/O ring) did a fantastic job - earlier I posted that a 180nm Celeron had a 60W power envelope!
For some customers (unfortunately don't know how many) an advanced P1B would be nice now. As I said above, it's quite doable with a few extras such as ADC, more pins (I suggested 64 but 48 would probably be fine too), faster (clock and hub 1:8), more hub ram, remove the ROM and add the monitor and security.
ie Leverage everything learnt and done for P2 back to just make a better P1B at 180nm. Then proceed with P2 at 40/65/90nm.
Really? Doesn't 5W mean that just the P2 chip alone needs twice as much power as an entire Raspberry Pi single-board computer?
Assuming I have my maths right (always a big assumption!) I'm not sure how you could ever "finesse" your way around a deal breaker like that, which would appear to make the P2 unusable (or at least horrendously expensive) for embedded applications.
Ross.
Well, apart from anything else, it means everyone's existing applications (many of which rely on having 8 cogs available) will have to be rewritten.
Not a very auspicious start for the P2, given that this kind of thing was a big contributor to the P1 failing to gain traction.
Ross.