In the past we had a design with many identical power consuming blocks on it, and were asked if we could skew the clocks to the various blocks to get the peak instantaneous power down. It wasn't a good fit for us, but something you could consider if this is a concern. It may not solve a real problem, and might result in major hub headaches.
Assuming that COGs can sleep when there's no work to be done, lowering the frequency alone won't necessary result in longer battery life. Correct? (It may depend on how the batteries react to various current profiles.)
I guess using Roy's car analogy, we're in a supercar, it has a lots of cylinders and a high top speed, but only floor it if you are on a suitable racetrack (suitable power supply). Also, using lower octane fuel (reduced Vdd) can help keep things under control.
I'm not sure we should be as surprised as we are. Chip estimated 1.5 watts, before moving to the hotter OnSemi process. And the analysis is worst case.
For 3d printing applications, we just mount the P2 on the hotbed plate
A COG, despite it's many transistors consumes almost no power if it is not doing any work. Or at least I hope that is the case.
As a UART or PWM driver or whatever it has a lot of "free" time. When it actually has to work that takes energy of course. It should not be sitting in "idle loops" wasting power.
Any way, regardless of that, Moore's law has come to an end and all the worlds chip makers are looking at parallel solutions to make performance increases.
The "Propeller Philosophy" was perhaps ahead of the game. If it does not scale they all have a problem.
Parallel isn't the issue, "Parallel and Identical" is the issue. Making a COG well suited to running the business logic, UI, etc. ends up in many cases making it extreme overkill in the case of using it as a software driven peripheral.
A COG, despite it's many transistors consumes almost no power if it is not doing any work. Or at least I hope that is the case.
As a UART or PWM driver or whatever it has a lot of "free" time. When it actually has to work that takes energy of course. It should not be sitting in "idle loops" wasting power.
Almost, but not quite. The Clock tree consumes a significant power, but a Gated Clock Tree can mimic lower CLK speeds.
A WAITxx can idle much of the other logic, but I'm not sure if Chip takes those opcodes to the COG Clock Tree gate.level.
The other TASK compatible wait, jumps to self, and so that certainly needs the Clock tree and pipeline working.
To me the ideal is a method that allows full fSys precision on timing and waits (maybe this cannot include Tasked waits), but allows the COG to otherwise clock at a reduced speed to save mA/MHz power. (P1 does this already)
Don't those figure assume that the chip is mounted on a reference JEDEC PCB though?
I did not see any specific PCB artwork, but yes, they are Thermal PAD parts, and you do need to add thermal vias, and copper planes are implicit in that.
4 layer PCB is likely to allow more copper than 2 layer, but how many will use P2 on 2 layer PCBs ?
Thicker copper helps, but the 0.4mm leads will pinch the choices on copper weight.
[COLOR=#333333]Parallel isn't the issue, "Parallel and Identical" is the issue. [/COLOR]
I see what you mean.
But that leads to an impossible situation. Ahead of time you don't know what is "business logic" and what is "device driver". And who is to say which one needs the most speed when it needs it, and so on.
Or are you saying we should have a Propeller that is 7 COGs as we know them for flexible device interfacing duty and one ARM core for doing the heavy duty "business logic"?
A solution I'm quite partial to. Except entering such a market against the big boys is a lost cause.
[COLOR=#333333]Parallel isn't the issue, "Parallel and Identical" is the issue. [/COLOR]
I see what you mean.
But that leads to an impossible situation. Ahead of time you don't know what is "business logic" and what is "device driver". And who is to say which one needs the most speed when it needs it, and so on.
Or are you saying we should have a Propeller that is 7 COGs as we know them for flexible device interfacing duty and one ARM core for doing the heavy duty "business logic"?
A solution I'm quite partial to. Except entering such a market against the big boys is a lost cause.
But we all know "Impossible" is just an invitation to a challenge around here!
There is a solution somewhere between what we have now and an ARM + Cogs, not sure exactly what it is yet...
So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!
We did agree, although against previous Prop philosophy, that not all cogs need to be equal
Only 1 or 2 cogs are likely to really run large programs using HUBEXEC
None of us are intending to use these cogs for multitasking or multithreading as provided by P2
IIRC Ross said he would implement his own software model for multitasking/multithreading
Only 1 or 2 cogs are likely to use Video
Most cogs will be used to perform intelligent I/O in software
Not all cogs are equal now (Video DAC pins)
This gives scope to make two types of cogs, reducing appropriate features from each set.
Correct, but this saves Die AREA, more than it saves Watts. Remember,Chip has ~90% Clock gating already
This may still be needed, as we do not have FIT confirmed
[*]Could we use the multitasking feature as a simple clock reduction:
We can specify up to 16 slots for tasks
If an instruction setup "idle mode" for tasks 1, 2 & 3 then:
The cog would only get time for the slots when SETASK allocated task 0
All other SETASK tasks would "idle" the cog
This would reduce the power by effectively reducing the clock to n:16
That's certainly interesting. It would need another set of 4 flags to say which Tasks were actually [Clock=OFF],
then those slots would run with no clock to the COG opcode engine at all, and so combine with the main Clock-Speed control,
to give more granular control of power.
So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!
Works for me - but I'd adjust that 16MHz case to ~1.40V core (guess), and 300mW (peak)
Correct, but this saves Die AREA, more than it saves Watts. Remember,Chip has ~90% Clock gating already
....
Because Chip already uses ~90% Clock gating, that 160Mhz, would have very similar power issues as P2.
= Not actually solving the present problem.
Yes, a smaller die and package might then be possible, but with similar power, that raises power density, which is the wrong way to go.
ie a TQFP144 = Larger Thermal PAD package => lower thermal resistance than a TQFP128 Thermal PAD
We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
Chip has said that idle current (leakage) for 160nm is ~1mA. So if the instructions (ALU) aren't using the power, then what is??? We know SRAM doesn't consume much power, almost irrespective of size (within reason f what we are talking about here).
The argument re 160MHz vs 200MHz is that a while ago, 200MHz was possible. Since then a lot has been added to many paths, and now we are back to 160MHz. Quite likely a simpler P1B design would possibly ramp this back to 200MHz. Just because P1B could achieve 200MHz doesn't mean it has to be used at this speed. Run it at 96MHz (good for USB) and power is halved.
Works for me - but I'd adjust that 16MHz case to ~1.40V core (guess), and 300mW (peak)
I presumed the same 180nm as the P2 was going to be fabricated with.
I presume the voltage is process specific???
But using the half power of 65nm or 90nm, then halve those powers and run at ~1.2V. But the idle power goes up significantly to ~30% of the total power, just for idling.
I presumed the same 180nm as the P2 was going to be fabricated with.
I presume the voltage is process specific???
It is, for a given fMAX, but at 16MHz speeds, you can also lower the Vcc quite a bit.
Look at any of the Wide-Vcc micro specs (non regulator ones), and they have MHz/Volts profiles of CMOS Logic
one example from Atmel
* < 2 MHz @ >1.7
* < 4 MHz @ >1.8
* < 10 MHz @ >2.7
* < 16 MHz @ >4.5
That's an 8:1 MHz ratio, over a 2.647:1 Vcc ratio, and power is significantly reduced at the lower Vcc.
We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
This is a little confused, Clock edges are what costs the power, but opcodes do not all execute at once.
- only ONE opcode is chosen (effectively from a virtual ROM), and that is what is executed.
Chip already has ~90% clock gating, so unused blocks (like MUL and CORDIC) are not costing any power, until they are needed.
We know SRAM doesn't consume much power, almost irrespective of size
? SRAM Static power is small, but there certainly is a significant mA/MHz for working SRAM, which is why OnSemi want the SRAM included in their models.
Expect the new numbers to go both down from working Clock gating modeing, and up from SRAM inclusion.
No one little fix will address the power issue. We either deal with the power curve as it is, shrink the thing, or go and gut it until it's something else entirely.
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
Since then, Hub has doubled to 256KB and more area has been allocated to core (cog ALUs).
hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 square mm = 18.00mm2
38.37mm2
plus power/ground and I/O
Recently, more area was saved and allocated to the core. I could not find a reference to this.
Let us say the core is now 22.00mm2, and the hub/cog/aux remain the same.
hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 +4.00 sqmm = 22.00mm2
42.37mm2
plus power/ground and I/O
Now, what if (still at 180nm) a P1B was built with...
hub RAM = 1.76 square mm (x8) = 28.16mm2 (512KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
core = P1 + minor extras = 5.00mm2
38.12mm2
spare = 4.25mm2
42.37mm2
plus power/ground and I/O
And note the I/O will be quite a bit smaller since it is much simpler than P2.
With only 64 I/Os and simpler I/O than P2 (ie P1 + ADC and maybe 10K pullups?) would save quite a bit of space.
Perhaps this would permit another 128KB of hub ram.
So a P1B with:
64 I/O and ADC and optional 10K pullups
512KB or 640KB hub ram
SPI Flash (instead of I2C Eeprom)
Simple ROM Monitor (like P2 - no spin/tables/font in ROM)
hub access 1:8 instead of 1:16
96/100MHz normal, with 160-200MHz possible with higher power usage
Basic P1 instruction set (from P2 subset opcodes 0xxxxxx) and a few basic additional instructions
Power consumption similar to P1 when active (due to 180nm and 1.8V core) but higher idle current ~1mA
It is, for a given fMAX, but at 16MHz speeds, you can also lower the Vcc quite a bit.
Look at any of the Wide-Vcc micro specs (non regulator ones), and they have MHz/Volts profiles of CMOS Logic
one example from Atmel
* < 2 MHz @ >1.7
* < 4 MHz @ >1.8
* < 10 MHz @ >2.7
* < 16 MHz @ >4.5
That's an 8:1 MHz ratio, over a 2.647:1 Vcc ratio, and power is significantly reduced at the lower Vcc.
So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!
In my vision, our customer, Mr. Bloak, goes to Mr. Jones(a used car salesman) to inquire about the purchase of a V8 Mini. And of course the customer doesn't understand why a Mini needs a V8.
Mr. Jones invites him to sit in the Mini and 4 large screen TVs fold down to reveal a miniature arcade…
Mr. Jones then opens the hood… points at the V8 and says… "that's what generates the power!!!" Then he retrieves the cigarette lighter… with a c4P2 cleverly hidden inside the heater coil… and says… "that's what supplies the logic."
Of course this will only be possible with a c4P2:)
We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
This is a little confused, Clock edges are what costs the power, but opcodes do not all execute at once.
- only ONE opcode is chosen (effectively from a virtual ROM), and that is what is executed.
Chip already has ~90% clock gating, so unused blocks (like MUL and CORDIC) are not costing any power, until they are needed.
There have been lots of multiplexers and the like added. The P1 is no where near as complex in its instruction set, so it absolutely must use significantly less power than the P2 !
The P1 IIRC is 360nm. The scale to 180nm for the same P1 would be half the power for same clock with 1.8V.
Currently the P1 typically uses way less than 100mA @ 3.3V = 0.33W. Lets say worst case is 0.5W.
Now at 180nmand 1.8V and 0.25W = 138mA presuming still at 80MHz. At 160MHz we would be at 0.5W.
Just don't try and tell me that the P1 at 180nm is going to use the same power as P2 at 180nm. It simply is NOT TRUE !
? SRAM Static power is small, but there certainly is a significant mA/MHz for working SRAM, which is why OnSemi want the SRAM included in their models.
Expect the new numbers to go both down from working Clock gating modeing, and up from SRAM inclusion.
Just don't try and tell me that the P1 at 180nm is going to use the same power as P2 at 180nm. It simply is NOT TRUE
You seem to be missing that the P1 is a 4-clock core, so yes, a P1 in 180nm will be lower power than P2, but also lower MIPS.
Of course, remember that P2 can run at 20MHz to match MIPS with a present P1,
P1 is ~264mW, and a P2 will not be too different from that, at a 20MHz compatible Vcc.
You seem to be missing that the P1 is a 4-clock core, so yes, a P1 in 180nm will be lower power than P2, but also lower MIPS.
Absolutely! That is what you (and Bill) are not taking into account. Nowhere did we ask for a P1B to be single cycle (in fact Chip reminded us yesterday), just a slight improvement and specifically stated.
Of course, remember that P2 can run at 20MHz to match MIPS with a present P1,
P1 is ~264mW, and a P2 will not be too different from that, at a 20MHz compatible Vcc.
By above calculations, the P2 will draw 625mW at 1.8V. ~264mW is not 625mW. The extra is in the instruction/alu logic!
Since you obviously don't believe me or jmg, why don't you ask Chip how many transistors the extra instructions take?
And how many transistors there are as a whole?
I strongly suspect the extra instructions transistors take less than 5% (more like 1%) of the P2's transistor budget.
In which case, power consumption would be reduced by 1%-5%
But wait... without those instructions, *MORE* instructions would have to be executed to do similar things, taking more cog ram, hub ram, and time to execute. Frankly, probably increasing power usage over time.
We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
Chip has said that idle current (leakage) for 160nm is ~1mA. So if the instructions (ALU) aren't using the power, then what is??? We know SRAM doesn't consume much power, almost irrespective of size (within reason f what we are talking about here).
The argument re 160MHz vs 200MHz is that a while ago, 200MHz was possible. Since then a lot has been added to many paths, and now we are back to 160MHz. Quite likely a simpler P1B design would possibly ramp this back to 200MHz. Just because P1B could achieve 200MHz doesn't mean it has to be used at this speed. Run it at 96MHz (good for USB) and power is halved.
Comments
Assuming that COGs can sleep when there's no work to be done, lowering the frequency alone won't necessary result in longer battery life. Correct? (It may depend on how the batteries react to various current profiles.)
and the outside temperature is 128.5F(53.611111C),
and my v8 mini is overheating.
First, I turn the air conditioner off. Then I open the windows.
My question is… after I pour cold beer over my headhttp://www.bobinoz.com/blog/9636/australias-best-beers-and-lagers-by-bobinoz/,
can I power my 4 cog 160MHz P2 off the cigarette lighter socket?
Yes I CAN! http://en.wikipedia.org/wiki/CAN_bushttp://en.wikipedia.org/wiki/Bob's_your_uncle
Don't those figure assume that the chip is mounted on a reference JEDEC PCB though?
I'm not sure we should be as surprised as we are. Chip estimated 1.5 watts, before moving to the hotter OnSemi process. And the analysis is worst case.
For 3d printing applications, we just mount the P2 on the hotbed plate
A few things came out of that:
- What we all originally wanted was a few more I/O pins - 64 would be way more than adequate.
- ADC on some/all I/O pins.
- More HUB RAM.
- Faster clock speed.
And IIRC it was in that order. So not much more than the proposed original Prop 1B.All of us would be happy with this first and keep the P2 for a 65nm followup.
I described my idea for AUX / COG RAM. They liked the idea.
- Make COG RAM WIDE
- RD/WRWIDE would take 1 clock to transfer 8 longs, sync'd to hub
- Ditch AUX RAM
- Permit a block of 128 or 256 Cog longs be used as CLUT
- Requires a simple MUX on the Instruction read port for the block(s)
- A simple instruction to allocate the block(s) to CLUT useage
- Increase the LIFO depth for CALL/RET
- Simpler design uses less transistors than AUX as only 18 bits wide.
- Saves 0.20mm2 x8 = 1.6mm silicon
- Maybe that could be used to add 128KB extra HUB RAM (128KB=1.76mm2)
- We all agreed this now makes sense because
- HUBEXEC removes the critical shortage of COG RAM
- Simpler to use and describe
- Saves a tiny amount of power x8
Regarding Speed & Cogs:- None of us liked reducing cogs to 4
- Not all cogs need to be equal
- We did agree, although against previous Prop philosophy, that not all cogs need to be equal
- Only 1 or 2 cogs are likely to really run large programs using HUBEXEC
- None of us are intending to use these cogs for multitasking or multithreading as provided by P2
- IIRC Ross said he would implement his own software model for multitasking/multithreading
- Only 1 or 2 cogs are likely to use Video
- Most cogs will be used to perform intelligent I/O in software
- Not all cogs are equal now (Video DAC pins)
- This gives scope to make two types of cogs, reducing appropriate features from each set.
- We do not actually require the current speed even tho' we like it
- Means we can trade power for speed
- Power vs die geometry
- 5W at 1.8V = 2.8A (180nm)
- 2.5W at 1.2V = 2.1A (65/90nm???)
- We do question the methodology behind the 5W calculations
- Not all features of the chip will be in use at the same time
- Not all 8 videos will be active concurrently
- Not all maths routines will be active concurrently
- What other logic blocks are not active concurrently?
- How much of the instruction block is not active concurrently?
- What blocks use the power?
An improved P1BOvernight I have added these thoughts
Parallel isn't the issue, "Parallel and Identical" is the issue. Making a COG well suited to running the business logic, UI, etc. ends up in many cases making it extreme overkill in the case of using it as a software driven peripheral.
C.W.
Almost, but not quite. The Clock tree consumes a significant power, but a Gated Clock Tree can mimic lower CLK speeds.
A WAITxx can idle much of the other logic, but I'm not sure if Chip takes those opcodes to the COG Clock Tree gate.level.
The other TASK compatible wait, jumps to self, and so that certainly needs the Clock tree and pipeline working.
To me the ideal is a method that allows full fSys precision on timing and waits (maybe this cannot include Tasked waits), but allows the COG to otherwise clock at a reduced speed to save mA/MHz power. (P1 does this already)
You climb in. You turn the ignition key. The car immediately melts and everyone inside dies.
Now, not firing the sparks and supplying the fuel to all cylinders might help a bit.
I love a car analogy.
I did not see any specific PCB artwork, but yes, they are Thermal PAD parts, and you do need to add thermal vias, and copper planes are implicit in that.
4 layer PCB is likely to allow more copper than 2 layer, but how many will use P2 on 2 layer PCBs ?
Thicker copper helps, but the 0.4mm leads will pinch the choices on copper weight.
But that leads to an impossible situation. Ahead of time you don't know what is "business logic" and what is "device driver". And who is to say which one needs the most speed when it needs it, and so on.
Or are you saying we should have a Propeller that is 7 COGs as we know them for flexible device interfacing duty and one ARM core for doing the heavy duty "business logic"?
A solution I'm quite partial to. Except entering such a market against the big boys is a lost cause.
But we all know "Impossible" is just an invitation to a challenge around here!
There is a solution somewhere between what we have now and an ARM + Cogs, not sure exactly what it is yet...
C.W.
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!
Correct, but this saves Die AREA, more than it saves Watts. Remember,Chip has ~90% Clock gating already
This may still be needed, as we do not have FIT confirmed
Agreed, and I think it also needs a COG temperature sense, because of this envelope.
That's certainly interesting. It would need another set of 4 flags to say which Tasks were actually [Clock=OFF],
then those slots would run with no clock to the COG opcode engine at all, and so combine with the main Clock-Speed control,
to give more granular control of power.
I think both those are false assumptions. Certainly the time ignores testing.
MHz is also optimistic. I asked Chip about faster Counters, but because they use adders, their critical path is not much different to an opcode.
That makes the MHz ceiling pretty much set by the data-width, and process, so ~160MHz would be it.
Because Chip already uses ~90% Clock gating, that 160Mhz, would have very similar power issues as P2.
= Not actually solving the present problem.
Yes, a smaller die and package might then be possible, but with similar power, that raises power density, which is the wrong way to go.
ie a TQFP144 = Larger Thermal PAD package => lower thermal resistance than a TQFP128 Thermal PAD
Works for me - but I'd adjust that 16MHz case to ~1.40V core (guess), and 300mW (peak)
I wonder how much power consumption can be attributed to the WIDE access to the HUB.
C.W.
http://www.jedec.org/sites/default/files/docs/jesd51-12.pdf
...seems to be relevant.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
Chip has said that idle current (leakage) for 160nm is ~1mA. So if the instructions (ALU) aren't using the power, then what is??? We know SRAM doesn't consume much power, almost irrespective of size (within reason f what we are talking about here).
The argument re 160MHz vs 200MHz is that a while ago, 200MHz was possible. Since then a lot has been added to many paths, and now we are back to 160MHz. Quite likely a simpler P1B design would possibly ramp this back to 200MHz. Just because P1B could achieve 200MHz doesn't mean it has to be used at this speed. Run it at 96MHz (good for USB) and power is halved.
I presume the voltage is process specific???
But using the half power of 65nm or 90nm, then halve those powers and run at ~1.2V. But the idle power goes up significantly to ~30% of the total power, just for idling.
Yes, I also found this
http://www.ti.com/lit/an/slma002g/slma002g.pdf
The JEDEC examples show 4 layer board is needed, and spreading heat is the key.
TI show 49 vias on TQFP128 and 100 vias on TQFP144, for their heat spreading.
It is, for a given fMAX, but at 16MHz speeds, you can also lower the Vcc quite a bit.
Look at any of the Wide-Vcc micro specs (non regulator ones), and they have MHz/Volts profiles of CMOS Logic
one example from Atmel
* < 2 MHz @ >1.7
* < 4 MHz @ >1.8
* < 10 MHz @ >2.7
* < 16 MHz @ >4.5
That's an 8:1 MHz ratio, over a 2.647:1 Vcc ratio, and power is significantly reduced at the lower Vcc.
This is a little confused, Clock edges are what costs the power, but opcodes do not all execute at once.
- only ONE opcode is chosen (effectively from a virtual ROM), and that is what is executed.
Chip already has ~90% clock gating, so unused blocks (like MUL and CORDIC) are not costing any power, until they are needed.
? SRAM Static power is small, but there certainly is a significant mA/MHz for working SRAM, which is why OnSemi want the SRAM included in their models.
Expect the new numbers to go both down from working Clock gating modeing, and up from SRAM inclusion.
Flat out.
No one little fix will address the power issue. We either deal with the power curve as it is, shrink the thing, or go and gut it until it's something else entirely.
From the thread a year ago (yes 12 March 2013)
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223695&viewfull=1#post1223695
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
Since then, Hub has doubled to 256KB and more area has been allocated to core (cog ALUs).
hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 square mm = 18.00mm2
38.37mm2
plus power/ground and I/O
Recently, more area was saved and allocated to the core. I could not find a reference to this.
Let us say the core is now 22.00mm2, and the hub/cog/aux remain the same.
hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 +4.00 sqmm = 22.00mm2
42.37mm2
plus power/ground and I/O
FWIW http://forums.parallax.com/showthread.php/152109-P2-shrink-(28nm-40nm)-and-how-to-collect-US-3-200-000?p=1224362&viewfull=1#post1224362
The original P1 was P1: die 7.28 sqmm (360nm???)
Now, what if (still at 180nm) a P1B was built with...
hub RAM = 1.76 square mm (x8) = 28.16mm2 (512KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
core = P1 + minor extras = 5.00mm2
38.12mm2
spare = 4.25mm2
42.37mm2
plus power/ground and I/O
And note the I/O will be quite a bit smaller since it is much simpler than P2.
With only 64 I/Os and simpler I/O than P2 (ie P1 + ADC and maybe 10K pullups?) would save quite a bit of space.
Perhaps this would permit another 128KB of hub ram.
So a P1B with:
We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.
This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.
There have been lots of multiplexers and the like added. The P1 is no where near as complex in its instruction set, so it absolutely must use significantly less power than the P2 !
The P1 IIRC is 360nm. The scale to 180nm for the same P1 would be half the power for same clock with 1.8V.
Currently the P1 typically uses way less than 100mA @ 3.3V = 0.33W. Lets say worst case is 0.5W.
Now at 180nmand 1.8V and 0.25W = 138mA presuming still at 80MHz. At 160MHz we would be at 0.5W.
Just don't try and tell me that the P1 at 180nm is going to use the same power as P2 at 180nm. It simply is NOT TRUE !
Yes, I am looking forward to that.
You seem to be missing that the P1 is a 4-clock core, so yes, a P1 in 180nm will be lower power than P2, but also lower MIPS.
Of course, remember that P2 can run at 20MHz to match MIPS with a present P1,
P1 is ~264mW, and a P2 will not be too different from that, at a 20MHz compatible Vcc.
I only see 3 viable options:
- Build a P1B (with a few suggested additional features) now, and P2 follows later in 40/65/90nm.
- Build P2 in 40/65/90nm
- Build P2 in 180nm and market variable speed vs power
For the P2 options 2 & 3, I don't see reducing its capability as a viable alternative. However, I see the following as viable options for the P2:And how many transistors there are as a whole?
I strongly suspect the extra instructions transistors take less than 5% (more like 1%) of the P2's transistor budget.
In which case, power consumption would be reduced by 1%-5%
But wait... without those instructions, *MORE* instructions would have to be executed to do similar things, taking more cog ram, hub ram, and time to execute. Frankly, probably increasing power usage over time.