We're looking at 5 Watts in a BGA!

KeithE · 2014-04-02 13:34

In the past we had a design with many identical power consuming blocks on it, and were asked if we could skew the clocks to the various blocks to get the peak instantaneous power down. It wasn't a good fit for us, but something you could consider if this is a concern. It may not solve a real problem, and might result in major hub headaches.

Assuming that COGs can sleep when there's no work to be done, lowering the frequency alone won't necessary result in longer battery life. Correct? (It may depend on how the batteries react to various current profiles.)

rjo__ · 2014-04-02 13:35

Let's say I'm driving through Brisbane in the summer
and the outside temperature is 128.5F(53.611111C),
and my v8 mini is overheating.

First, I turn the air conditioner off. Then I open the windows.
My question is… after I pour cold beer over my headhttp://www.bobinoz.com/blog/9636/australias-best-beers-and-lagers-by-bobinoz/,
can I power my 4 cog 160MHz P2 off the cigarette lighter socket?

Yes I CAN! http://en.wikipedia.org/wiki/CAN_bus http://en.wikipedia.org/wiki/Bob's_your_uncle

Brian Fairchild · 2014-04-02 13:42

jmg wrote: »

...Power PAD TQFP128, and they are
0.12 °C/W jc .from die to case, and
17.17 °C/W. ja PCB mounted, thermal vias

Don't those figure assume that the chip is mounted on a reference JEDEC PCB though?

Tubular · 2014-04-02 13:51

I guess using Roy's car analogy, we're in a supercar, it has a lots of cylinders and a high top speed, but only floor it if you are on a suitable racetrack (suitable power supply). Also, using lower octane fuel (reduced Vdd) can help keep things under control.

I'm not sure we should be as surprised as we are. Chip estimated 1.5 watts, before moving to the hotter OnSemi process. And the analysis is worst case.

For 3d printing applications, we just mount the P2 on the hotbed plate

Cluso99 · 2014-04-02 13:52

Last night Seairth, RossH and I met for dinner in Sydney. Seairth was visiting for work, so we couldn't miss the opportunity to catch up

A few things came out of that:

What we all originally wanted was a few more I/O pins - 64 would be way more than adequate.
ADC on some/all I/O pins.
More HUB RAM.
Faster clock speed.

And IIRC it was in that order. So not much more than the proposed original Prop 1B.

All of us would be happy with this first and keep the P2 for a 65nm followup.

I described my idea for AUX / COG RAM. They liked the idea.

Make COG RAM WIDE
- RD/WRWIDE would take 1 clock to transfer 8 longs, sync'd to hub
- Ditch AUX RAM
- Permit a block of 128 or 256 Cog longs be used as CLUT
  - Requires a simple MUX on the Instruction read port for the block(s)
  - A simple instruction to allocate the block(s) to CLUT useage
Increase the LIFO depth for CALL/RET
- Simpler design uses less transistors than AUX as only 18 bits wide.
Saves 0.20mm2 x8 = 1.6mm silicon
- Maybe that could be used to add 128KB extra HUB RAM (128KB=1.76mm2)
We all agreed this now makes sense because
- HUBEXEC removes the critical shortage of COG RAM
- Simpler to use and describe
- Saves a tiny amount of power x8

Regarding Speed & Cogs:

None of us liked reducing cogs to 4
Not all cogs need to be equal
- We did agree, although against previous Prop philosophy, that not all cogs need to be equal
- Only 1 or 2 cogs are likely to really run large programs using HUBEXEC
  - None of us are intending to use these cogs for multitasking or multithreading as provided by P2
  - IIRC Ross said he would implement his own software model for multitasking/multithreading
- Only 1 or 2 cogs are likely to use Video
- Most cogs will be used to perform intelligent I/O in software
- Not all cogs are equal now (Video DAC pins)
- This gives scope to make two types of cogs, reducing appropriate features from each set.
We do not actually require the current speed even tho' we like it
- Means we can trade power for speed
Power vs die geometry
- 5W at 1.8V = 2.8A (180nm)
- 2.5W at 1.2V = 2.1A (65/90nm???)
We do question the methodology behind the 5W calculations
- Not all features of the chip will be in use at the same time
  - Not all 8 videos will be active concurrently
  - Not all maths routines will be active concurrently
  - What other logic blocks are not active concurrently?
  - How much of the instruction block is not active concurrently?
  - What blocks use the power?

An improved P1B
Overnight I have added these thoughts

I like the P2 being multi-specified:
- 5W @ 160MHz
- 2.5W @ 80MHz
- 1W @ 32MHz
- and a graph W vs MHz
Could we use the multitasking feature as a simple clock reduction:
- We can specify up to 16 slots for tasks
- If an instruction setup "idle mode" for tasks 1, 2 & 3 then:
  - The cog would only get time for the slots when SETASK allocated task 0
  - All other SETASK tasks would "idle" the cog
  - This would reduce the power by effectively reducing the clock to n:16
If Parallax went the P1B route and followed up soon with P2 updated to 65nm or similar
- Most instructions are already in Verilog (opcode 0xxxxxx)
- Increase Hub ram size
  - Whatever fits (with less instructions, could be 512KB or more!!!)
- Reduce hub cycles to 1:8 from 1:16
- Possibly increase to 160MHz or 200MHz
- Possibly remove ROM tables and use RAM like P2
- Possibly use SPI Flash instead of EEPROM
- This could likely be done by Chip in a couple/few of weeks

ctwardell · 2014-04-02 13:54

Heater. wrote: »

ctwardell,

I don't believe this is true.

A COG, despite it's many transistors consumes almost no power if it is not doing any work. Or at least I hope that is the case.
As a UART or PWM driver or whatever it has a lot of "free" time. When it actually has to work that takes energy of course. It should not be sitting in "idle loops" wasting power.

Any way, regardless of that, Moore's law has come to an end and all the worlds chip makers are looking at parallel solutions to make performance increases.

The "Propeller Philosophy" was perhaps ahead of the game. If it does not scale they all have a problem.

Parallel isn't the issue, "Parallel and Identical" is the issue. Making a COG well suited to running the business logic, UI, etc. ends up in many cases making it extreme overkill in the case of using it as a software driven peripheral.

C.W.

jmg · 2014-04-02 13:54

Heater. wrote: »

A COG, despite it's many transistors consumes almost no power if it is not doing any work. Or at least I hope that is the case.
As a UART or PWM driver or whatever it has a lot of "free" time. When it actually has to work that takes energy of course. It should not be sitting in "idle loops" wasting power.

Almost, but not quite. The Clock tree consumes a significant power, but a Gated Clock Tree can mimic lower CLK speeds.
A WAITxx can idle much of the other logic, but I'm not sure if Chip takes those opcodes to the COG Clock Tree gate.level.

The other TASK compatible wait, jumps to self, and so that certainly needs the Clock tree and pipeline working.

To me the ideal is a method that allows full fSys precision on timing and waits (maybe this cannot include Tasked waits), but allows the COG to otherwise clock at a reduced speed to save mA/MHz power. (P1 does this already)

Heater. · 2014-04-02 13:55

No, Roy's car analogy is more like this:

You climb in. You turn the ignition key. The car immediately melts and everyone inside dies.

Now, not firing the sparks and supplying the fuel to all cylinders might help a bit.

I love a car analogy.

jmg · 2014-04-02 14:00

Brian Fairchild wrote: »

Don't those figure assume that the chip is mounted on a reference JEDEC PCB though?

I did not see any specific PCB artwork, but yes, they are Thermal PAD parts, and you do need to add thermal vias, and copper planes are implicit in that.

4 layer PCB is likely to allow more copper than 2 layer, but how many will use P2 on 2 layer PCBs ?
Thicker copper helps, but the 0.4mm leads will pinch the choices on copper weight.

Heater. · 2014-04-02 14:04

ctwardell,

[COLOR=#333333]Parallel isn't the issue, "Parallel and Identical" is the issue. [/COLOR]

I see what you mean.

But that leads to an impossible situation. Ahead of time you don't know what is "business logic" and what is "device driver". And who is to say which one needs the most speed when it needs it, and so on.

Or are you saying we should have a Propeller that is 7 COGs as we know them for flexible device interfacing duty and one ARM core for doing the heavy duty "business logic"?

A solution I'm quite partial to. Except entering such a market against the big boys is a lost cause.

ctwardell · 2014-04-02 14:09

Heater. wrote: »
ctwardell,
[COLOR=#333333]Parallel isn't the issue, "Parallel and Identical" is the issue. [/COLOR]
I see what you mean.

But that leads to an impossible situation. Ahead of time you don't know what is "business logic" and what is "device driver". And who is to say which one needs the most speed when it needs it, and so on.

Or are you saying we should have a Propeller that is 7 COGs as we know them for flexible device interfacing duty and one ARM core for doing the heavy duty "business logic"?

A solution I'm quite partial to. Except entering such a market against the big boys is a lost cause.

But we all know "Impossible" is just an invitation to a challenge around here!

There is a solution somewhere between what we have now and an ARM + Cogs, not sure exactly what it is yet...

C.W.

Cluso99 · 2014-04-02 14:11

So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!

jmg · 2014-04-02 14:35

Cluso99 wrote: »

[*]Not all cogs need to be equal
We did agree, although against previous Prop philosophy, that not all cogs need to be equal

Only 1 or 2 cogs are likely to really run large programs using HUBEXEC
None of us are intending to use these cogs for multitasking or multithreading as provided by P2

IIRC Ross said he would implement his own software model for multitasking/multithreading

Only 1 or 2 cogs are likely to use Video

Most cogs will be used to perform intelligent I/O in software

Not all cogs are equal now (Video DAC pins)

This gives scope to make two types of cogs, reducing appropriate features from each set.

Correct, but this saves Die AREA, more than it saves Watts. Remember,Chip has ~90% Clock gating already
This may still be needed, as we do not have FIT confirmed

Cluso99 wrote: »

[*]We do not actually require the current speed even tho' we like it
Means we can trade power for speed

Overnight I have added these thoughts
I like the P2 being multi-specified:
5W @ 160MHz

2.5W @ 80MHz

1W @ 32MHz

and a graph W vs MHz

Agreed, and I think it also needs a COG temperature sense, because of this envelope.

Cluso99 wrote: »

[*]Could we use the multitasking feature as a simple clock reduction:
We can specify up to 16 slots for tasks

If an instruction setup "idle mode" for tasks 1, 2 & 3 then:
The cog would only get time for the slots when SETASK allocated task 0

All other SETASK tasks would "idle" the cog

This would reduce the power by effectively reducing the clock to n:16

That's certainly interesting. It would need another set of 4 flags to say which Tasks were actually [Clock=OFF],
then those slots would run with no clock to the COG opcode engine at all, and so combine with the main Clock-Speed control,
to give more granular control of power.

Cluso99 wrote: »

[*]If Parallax went the P1B route...
Possibly increase to 160MHz or 200MHz

This could likely be done by Chip in a couple/few of weeks

I think both those are false assumptions. Certainly the time ignores testing.

MHz is also optimistic. I asked Chip about faster Counters, but because they use adders, their critical path is not much different to an opcode.

That makes the MHz ceiling pretty much set by the data-width, and process, so ~160MHz would be it.

Because Chip already uses ~90% Clock gating, that 160Mhz, would have very similar power issues as P2.
= Not actually solving the present problem.

Yes, a smaller die and package might then be possible, but with similar power, that raises power density, which is the wrong way to go.

ie a TQFP144 = Larger Thermal PAD package => lower thermal resistance than a TQFP128 Thermal PAD

jmg · 2014-04-02 14:39

Cluso99 wrote: »

So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!

Works for me

- but I'd adjust that 16MHz case to ~1.40V core (guess), and 300mW (peak)

ctwardell · 2014-04-02 14:44

jmg wrote: »

That makes the MHz ceiling pretty much set by the data-width, and process, so ~160MHz would be it.

I wonder how much power consumption can be attributed to the WIDE access to the HUB.

C.W.

Brian Fairchild · 2014-04-02 14:46

jmg wrote: »

I did not see any specific PCB artwork...

http://www.jedec.org/sites/default/files/docs/jesd51-12.pdf

...seems to be relevant.

Cluso99 · 2014-04-02 14:49

re my P1B suggestion...

jmg wrote: »

Correct, but this saves Die AREA, more than it saves Watts. Remember,Chip has ~90% Clock gating already
....
Because Chip already uses ~90% Clock gating, that 160Mhz, would have very similar power issues as P2.
= Not actually solving the present problem.

Yes, a smaller die and package might then be possible, but with similar power, that raises power density, which is the wrong way to go.

ie a TQFP144 = Larger Thermal PAD package => lower thermal resistance than a TQFP128 Thermal PAD

We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.

This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.

Chip has said that idle current (leakage) for 160nm is ~1mA. So if the instructions (ALU) aren't using the power, then what is??? We know SRAM doesn't consume much power, almost irrespective of size (within reason f what we are talking about here).

The argument re 160MHz vs 200MHz is that a while ago, 200MHz was possible. Since then a lot has been added to many paths, and now we are back to 160MHz. Quite likely a simpler P1B design would possibly ramp this back to 200MHz. Just because P1B could achieve 200MHz doesn't mean it has to be used at this speed. Run it at 96MHz (good for USB) and power is halved.

Cluso99 · 2014-04-02 14:53

re my Salesman post...

jmg wrote: »

Works for me - but I'd adjust that 16MHz case to ~1.40V core (guess), and 300mW (peak)

I presumed the same 180nm as the P2 was going to be fabricated with.
I presume the voltage is process specific???

But using the half power of 65nm or 90nm, then halve those powers and run at ~1.2V. But the idle power goes up significantly to ~30% of the total power, just for idling.

jmg · 2014-04-02 15:26

Brian Fairchild wrote: »

http://www.jedec.org/sites/default/files/docs/jesd51-12.pdf

...seems to be relevant.

Yes, I also found this
http://www.ti.com/lit/an/slma002g/slma002g.pdf

The JEDEC examples show 4 layer board is needed, and spreading heat is the key.

TI show 49 vias on TQFP128 and 100 vias on TQFP144, for their heat spreading.

jmg · 2014-04-02 15:34

Cluso99 wrote: »

re my Salesman post...

I presumed the same 180nm as the P2 was going to be fabricated with.
I presume the voltage is process specific???

It is, for a given fMAX, but at 16MHz speeds, you can also lower the Vcc quite a bit.

Look at any of the Wide-Vcc micro specs (non regulator ones), and they have MHz/Volts profiles of CMOS Logic
one example from Atmel
* < 2 MHz @ >1.7
* < 4 MHz @ >1.8
* < 10 MHz @ >2.7
* < 16 MHz @ >4.5

That's an 8:1 MHz ratio, over a 2.647:1 Vcc ratio, and power is significantly reduced at the lower Vcc.

jmg · 2014-04-02 15:41

Cluso99 wrote: »

re my P1B suggestion...

We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.

This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.

This is a little confused, Clock edges are what costs the power, but opcodes do not all execute at once.
- only ONE opcode is chosen (effectively from a virtual ROM), and that is what is executed.

Chip already has ~90% clock gating, so unused blocks (like MUL and CORDIC) are not costing any power, until they are needed.

Cluso99 wrote: »

We know SRAM doesn't consume much power, almost irrespective of size

? SRAM Static power is small, but there certainly is a significant mA/MHz for working SRAM, which is why OnSemi want the SRAM included in their models.

Expect the new numbers to go both down from working Clock gating modeing, and up from SRAM inclusion.

potatohead · 2014-04-02 15:59

It's an aggressive design for the process.

Flat out.

No one little fix will address the power issue. We either deal with the power curve as it is, shrink the thing, or go and gut it until it's something else entirely.

Cluso99 · 2014-04-02 16:03

Looking at a P1B option (with P2 to follow later):

From the thread a year ago (yes 12 March 2013)
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223695&viewfull=1#post1223695

DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm

Since then, Hub has doubled to 256KB and more area has been allocated to core (cog ALUs).

hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 square mm = 18.00mm2

38.37mm2
plus power/ground and I/O

Recently, more area was saved and allocated to the core. I could not find a reference to this.
Let us say the core is now 22.00mm2, and the hub/cog/aux remain the same.

hub RAM = 1.76 square mm (x8) = 14.08mm2 (256KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
aux RAM = 0.20 square mm (x8) = 1.60mm2
core = 14.71 +3.29 +4.00 sqmm = 22.00mm2

42.37mm2
plus power/ground and I/O

FWIW http://forums.parallax.com/showthread.php/152109-P2-shrink-(28nm-40nm)-and-how-to-collect-US-3-200-000?p=1224362&viewfull=1#post1224362
The original P1 was P1: die 7.28 sqmm (360nm???)

Now, what if (still at 180nm) a P1B was built with...

hub RAM = 1.76 square mm (x8) = 28.16mm2 (512KB)
cog RAM = 0.62 square mm (x8) = 4.96mm2
core = P1 + minor extras = 5.00mm2

38.12mm2
spare = 4.25mm2

42.37mm2
plus power/ground and I/O
And note the I/O will be quite a bit smaller since it is much simpler than P2.
With only 64 I/Os and simpler I/O than P2 (ie P1 + ADC and maybe 10K pullups?) would save quite a bit of space.
Perhaps this would permit another 128KB of hub ram.

So a P1B with:

64 I/O and ADC and optional 10K pullups
512KB or 640KB hub ram
SPI Flash (instead of I2C Eeprom)
Simple ROM Monitor (like P2 - no spin/tables/font in ROM)
hub access 1:8 instead of 1:16
96/100MHz normal, with 160-200MHz possible with higher power usage
Basic P1 instruction set (from P2 subset opcodes 0xxxxxx) and a few basic additional instructions
Power consumption similar to P1 when active (due to 180nm and 1.8V core) but higher idle current ~1mA

Cluso99 · 2014-04-02 16:06

jmg wrote: »

It is, for a given fMAX, but at 16MHz speeds, you can also lower the Vcc quite a bit.

Look at any of the Wide-Vcc micro specs (non regulator ones), and they have MHz/Volts profiles of CMOS Logic
one example from Atmel
* < 2 MHz @ >1.7
* < 4 MHz @ >1.8
* < 10 MHz @ >2.7
* < 16 MHz @ >4.5

That's an 8:1 MHz ratio, over a 2.647:1 Vcc ratio, and power is significantly reduced at the lower Vcc.

Thanks. This is good!

rjo__ · 2014-04-02 16:14

So our salesman Mr Jones goes into the prospective P2 customer and pulls out his first P2 demo board:
- It is run by a LiPo battery and runs at 16MHz = 0.5W @ 1.8V = 275mA
Next he pulls out a P2 demo board with:
- It is run by USB and runs at 32MHz = 1W @ 1.8V = 555mA
Now he pulls out the last P2 demo board with:
- It is run by a 5V 1A power pack and runs 160MHz = 5W @ 1.8V = 2.8A. It has a miniature heatsink (+ miniature fan??)
- This board has multiple graphics etc, etc all running, giving 1280 MIPS
- He explains the very same P2 chip runs in all these modes !!!

In my vision, our customer, Mr. Bloak, goes to Mr. Jones(a used car salesman) to inquire about the purchase of a V8 Mini. And of course the customer doesn't understand why a Mini needs a V8.

Mr. Jones invites him to sit in the Mini and 4 large screen TVs fold down to reveal a miniature arcade…

Mr. Jones then opens the hood… points at the V8 and says… "that's what generates the power!!!" Then he retrieves the cigarette lighter… with a c4P2 cleverly hidden inside the heater coil… and says… "that's what supplies the logic."

Of course this will only be possible with a c4P2:)

Cluso99 · 2014-04-02 16:16

re my P1B suggestion...

We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.

This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.

jmg wrote: »

This is a little confused, Clock edges are what costs the power, but opcodes do not all execute at once.
- only ONE opcode is chosen (effectively from a virtual ROM), and that is what is executed.

Chip already has ~90% clock gating, so unused blocks (like MUL and CORDIC) are not costing any power, until they are needed.

There have been lots of multiplexers and the like added. The P1 is no where near as complex in its instruction set, so it absolutely must use significantly less power than the P2 !

The P1 IIRC is 360nm. The scale to 180nm for the same P1 would be half the power for same clock with 1.8V.
Currently the P1 typically uses way less than 100mA @ 3.3V = 0.33W. Lets say worst case is 0.5W.
Now at 180nmand 1.8V and 0.25W = 138mA presuming still at 80MHz. At 160MHz we would be at 0.5W.

Just don't try and tell me that the P1 at 180nm is going to use the same power as P2 at 180nm. It simply is NOT TRUE !

? SRAM Static power is small, but there certainly is a significant mA/MHz for working SRAM, which is why OnSemi want the SRAM included in their models.

Expect the new numbers to go both down from working Clock gating modeing, and up from SRAM inclusion.

Yes, I am looking forward to that.

jmg · 2014-04-02 16:32

Cluso99 wrote: »

Just don't try and tell me that the P1 at 180nm is going to use the same power as P2 at 180nm. It simply is NOT TRUE

You seem to be missing that the P1 is a 4-clock core, so yes, a P1 in 180nm will be lower power than P2, but also lower MIPS.

Of course, remember that P2 can run at 20MHz to match MIPS with a present P1,
P1 is ~264mW, and a P2 will not be too different from that, at a 20MHz compatible Vcc.

Cluso99 · 2014-04-02 16:38

While we wait for more data from Chip & OnSemi...

I only see 3 viable options:

Build a P1B (with a few suggested additional features) now, and P2 follows later in 40/65/90nm.
Build P2 in 40/65/90nm
Build P2 in 180nm and market variable speed vs power

For the P2 options 2 & 3, I don't see reducing its capability as a viable alternative. However, I see the following as viable options for the P2:

COG Ram method replacing AUX ram as a viable option
- It increases performance
- Simplifies the Cog Model
- Saves a little power
- Saves reasonable silicon space
  - Could be used for larger LIFO depth (currently 4)
  - Could be used to improve the Instruction Cache (depth or use cog ram block)
Changing Cogs so they are not all alike:
- 6 Cogs are paired down for use as I/O drivers
- 2 Cogs are paired down to run Hubexec without multitasking or multithreading (can be done in sw if required)
  - Is multi-threading required in the 2 or 6 cogs ???
- This option needs further discussion to see what can be removed for the 2 sets.

Cluso99 · 2014-04-02 16:45

jmg wrote: »

You seem to be missing that the P1 is a 4-clock core, so yes, a P1 in 180nm will be lower power than P2, but also lower MIPS.

Absolutely! That is what you (and Bill) are not taking into account. Nowhere did we ask for a P1B to be single cycle (in fact Chip reminded us yesterday), just a slight improvement and specifically stated.

Of course, remember that P2 can run at 20MHz to match MIPS with a present P1,
P1 is ~264mW, and a P2 will not be too different from that, at a 20MHz compatible Vcc.

By above calculations, the P2 will draw 625mW at 1.8V. ~264mW is not 625mW. The extra is in the instruction/alu logic!

Bill Henning · 2014-04-02 16:56

Since you obviously don't believe me or jmg, why don't you ask Chip how many transistors the extra instructions take?

And how many transistors there are as a whole?

I strongly suspect the extra instructions transistors take less than 5% (more like 1%) of the P2's transistor budget.

In which case, power consumption would be reduced by 1%-5%

But wait... without those instructions, *MORE* instructions would have to be executed to do similar things, taking more cog ram, hub ram, and time to execute. Frankly, probably increasing power usage over time.

Cluso99 wrote: »

re my P1B suggestion...

We already have the P1 and it does not use anywhere near the power. But the instruction set has grown from ~64 to ~450. This uses a lot of transistors and this is the bulk of the power usage from what I can see. Reducing 450 to 64 gives 64/450*5W = 0.7W if this scale is linear. We also have far simpler video and counters.

This is the same argument Bill is using, and I don't buy it at all. Clocking transistors use the power. Remove those transistors and the power goes down proportionally.

Chip has said that idle current (leakage) for 160nm is ~1mA. So if the instructions (ALU) aren't using the power, then what is??? We know SRAM doesn't consume much power, almost irrespective of size (within reason f what we are talking about here).

The argument re 160MHz vs 200MHz is that a while ago, 200MHz was possible. Since then a lot has been added to many paths, and now we are back to 160MHz. Quite likely a simpler P1B design would possibly ramp this back to 200MHz. Just because P1B could achieve 200MHz doesn't mean it has to be used at this speed. Run it at 96MHz (good for USB) and power is halved.

We're looking at 5 Watts in a BGA!

Comments