We're looking at 5 Watts in a BGA!

potatohead · 2014-04-03 15:29

Whilst I love the "Nike" philosophy actually it is possible to fail. When all the money is blown, all the employees are let go, when you have to sell the farm to survive. Sometimes there is no coming back.

Yep, and the Silicon Valley path on that is to fail quickly, cheaply, so that one just tries again.

Really, I wasn't speaking to this.

I was speaking to the fact that we, in this instance, really can't have fail as an option. Not that a particular philosophy makes more or less sense.

Additionally I was speaking to this:

Do you push forward in a market where this particular flaw will show up like a clown car at a circus or do you hope to find a guardian angel(VC or OnSemi) to bail out your design? These probably have their own risks as well, especially with a VC being involved.

One thing I haven't heard from the crowd here is how to market the P2 given it's issues or when EDN does a review and brings up it being a power pig compared to cheaper alternatives on the market. It's going to take more than slamming a bunch of hobbyists and artist types using Arduino or repeated nostrums of "hard real time", etc. Evaluators will demand proof why the P2 is a better choice than say a STM32 M4 or some of the multicore ARM's from TI.

Yes. Precisely the right kinds of questions and discussion we need to have happen. How the device is positioned is going to matter A LOT. And to position it correctly, we are going to need to validate some of the strengths we understand, and we are going to have to make some proofs, identify targets, and do all that "other work" that happens after engineering gets done.

I personally would not release it in a way that would even allow that review, until "that other work" is done to a point where it's hard for them to produce a hit piece.

potatohead · 2014-04-03 15:30

Getting OnSemi as partner could be good as long the right deal is struck, like $1 for every P2 sold for the first 1mill units and then drop to 25cents.

Yep. Fingers crossed something like that is possible.

jmg · 2014-04-03 16:04

Cluso99 wrote: »

Chip has said the same die can be used for different heat packages - QFP144, BGA256, and maybe others such as QFP144 with thermal pad.

Therefore I suggest Parallax continue and build P2 at 180nm and package in QFP144, but preferably in a larger package ie larger pitch than 0.4mmbecause it should be a little better with heat transfer (is this so?).

Better cooling is all about area, and I'm looking around for largest-PAD standard packages.

I have found TQFP100 to 9.5mm and 144 pins to 9, 11.2x9 and 11.5 x 11.5mm exposed PADS mentioned.

Figures given for a 8x8 mm pad 144 pin package, come down to
19 'K/W Device mounted on a 4-layer JEDEC board (JESD 51-7) with thermal vias; exposed pad soldered.

Also given is 38'K/W, on 2 layer, no thermal vias, so you can see how important that 'spread and move' of the vias is.

Something like an 11.5mm PAD TQFP144 can expand to a 15mm x 15mm copper area under the device and then drill-down with over 100 vias to inner planes. That should push the 19'K/W down a couple of clicks.

TQFP144 has 16 more pins than 128, which can (all) be GND/Thermal, and once the die is proven, they could investigate tooling for an even larger-PAD and special thermal-lead-frame, where some leads are not separated until the final bend.
PCB designs would not be affected by that change. (or the numbers may show it is not needed )

Cluso99 wrote: »

If the P2has some temperature sensor on board we can monitor this while trying to profile (and validate) the P2.

I think my idea of utilising task(s) as "idle" clocks to reduce speed by 0-15:16 clocks should be implemented as this permits each cog to individually slowed in 1/16 clock increments to lower power.

Agreed, the simple steps of better clock control, and Temperature sense are 'no brainers' here.
It may be possible to combine a fuse into the clock divider options, also low risk if it defaults fast.

David Betz · 2014-04-03 16:14

tonyp12 wrote: »

A 5W P2 will take years before it makes a profit as sales will not be that high, so to use the profit for a 90nm later would mean in year 2020 and then its should be time for a P3 at what ever reasonable nm fab then.

Kickstarter, good thing is that they are pre-orders and not a share in the IP/Company, but sale prices need to be low as people are expecting a discount for pre-ordering 6 months in advance
Getting OnSemi as partner could be good as long the right deal is struck, like $1 for every P2 sold for the first 1mill units and then drop to 25cents.

I wonder how long it will take to sell 1 million P2 chips? Anyone know how many P1 chips have been sold to date?

Tubular · 2014-04-03 16:35

I do like the idea of having a single fuse, when blowed enables the "performance mode". By setting that fuse you're acknowledging the thermals and power supply have been attended to. There can still be thermal measurement and/or protection devices of course.

In simplest form, that fuse could prevent access to the higher PLL multiplier tap points, but how you you know what the actual physical input frequency is ? Compare it to the onboard oscillator? Or lowpass filter Xin unless that fuse is blown?

The problem I see is the OnSemi figure is an estimate erring on the cautious side, with a wide range. So with a single fuse you might target a max theoretical power of say 1 Watt, because that 1 watt would comfortably dissipate in a small two layer pcb. But if, in production, you find out the power consumption figures were on the high side and you've limited the default chip (that we'll write all the obex stuff for) unnecessarily to 0.6 watts.

Maybe that scenario doesn't matter as much as just having two very plain options - "safe" and "performance"

msrobots · 2014-04-03 16:40

I'm "Heater" give me the chip now. Hot as it might be. I feel sure that Bill has a point. The worst case of 5W is not where we will be.

I mostly agree with you about that. Except you are "Heater." now. Nobody knows what happened to "Heater".

Enjoy!

Mike

jmg · 2014-04-03 16:43

Tubular wrote: »

I do like the idea of having a single fuse, when blowed enables the "performance mode".

I would flip that, for insurance, just in case there were fuse problems.
Better to lose the slow mode, than the fast mode !!
Reversing the sense to blow-for-slow, also allows Parallax the option of a price-separation, if they want.

Tubular wrote: »

In simplest form, that fuse could prevent access to the higher PLL multiplier tap points, but how you you know what the actual physical input frequency is ? Compare it to the onboard oscillator? Or lowpass filter Xin?

- but users may want to run timers to 160MHz, so I think that means the fuse control needs to be on the Clock-Gate control per COG. ie limits to something under 100% clock enable

2 Fuses would allow a Fast COG and the rest, or maybe 1 fuse just slows down 7 COGs ?

Cluso99 · 2014-04-03 16:45

Tubular wrote: »

I do like the idea of having a single fuse, when blowed enables the "performance mode". By setting that fuse you're acknowledging the thermals and power supply have been attended to. There can still be thermal measurement and/or protection devices of course.

In simplest form, that fuse could prevent access to the higher PLL multiplier tap points, but how you you know what the actual physical input frequency is ? Compare it to the onboard oscillator? Or lowpass filter Xin?

The problem I see is the OnSemi figure is an estimate erring on the cautious side, with a wide range. So with a single fuse you might target a max theoretical power of say 1 Watt, because that 1 watt would comfortably dissipate in a small two layer pcb. But if, in production, you find out the power consumption figures were on the high side and you've limited the default chip (that we'll write all the obex stuff for) to 0.6 watts.

Maybe that scenario doesn't matter as much as just having two very plain options - "safe" and "performance"

0

I think I would prefer theblown fuse to limit the chip. That way Parallax could provide reduced clock/whatever-restrictionby pre-blowing the fuse during testing. Maybe we could dedicate a few fuses for some options.

Perhaps a fuse blown (provides safety if the feature fails verification) could enable a tiny state mc that monitors temperature and fails gracefully (providing an indication?) if temp gets too high.

Tubular · 2014-04-03 16:48

Just an update on the DE0 running Invaders 2.04 sans resistive dacs - about 340mA (1.7 watts),

ie the dac resistors (and some internal dissipation from driving them) may be consuming roughly 0.25W or the 1.95W total.

Tubular · 2014-04-03 16:57

jmg wrote: »

- but users may want to run timers to 160MHz, so I think that means the fuse control needs to be on the Clock-Gate control per COG. ie limits to something under 100% clock enable

2 Fuses would allow a Fast COG and the rest, or maybe 1 fuse just slows down 7 COGs ?

Yes you're probably right, let the counters and video generator run as fast as they need to and clock-gate the rest of the cog.

The more I think about it though I think it would be very important to have just the two options (1 fuse), with the vast bulk of obex code written for that default, safe case.

It would be a very brave decision to break the "all cogs created equal" mantra and have 7 slower cogs. I suspect you have to hobble all of them equally or it gets messy with "which object gets the fast cog".

jmg · 2014-04-03 17:11

Tubular wrote: »

It would be a very brave decision to break the "all cogs created equal" mantra and have 7 slower cogs. I suspect you have to hobble all of them equally or it gets messy with "which object gets the fast cog".

You can always consume 8 fuses, if the "all cogs created equal" mantra is really that important.

The figures show there is likely enough Thermal envelope to run ONE COG at full speed, so it is important to allow that.
8 fuses would preserve "all cogs created equal" mantra, and allow option of 2 (or 3) COGS at full speed, if the typical values are ok in production.
How many spare fuses are there in P2 ?

Tubular · 2014-04-03 17:27

You could have a fuse per cog, giving more granularity. But I like the simplicity of a single fuse, a bit like breaking the "warranty void" sticker for the intrepid and/or thermally aware. I acknowledge this is just a preference.

Or, we could embed a coil and inductively sense when a heatsink is on top of the chip

Bill Henning · 2014-04-03 17:30

Guys,

1) 5W is extreme worst case, NOT proven, could be 3.5W easily

2) Blow fuses to limit performance? DUMB. Don't treat customers as idiots. P1 did not have fuse to go over 80MHz!

3) Give table of max W per what MHz, let customer choose

Frankly, I find most of the discussion in this thread to be a "The Sky is Falling!" when we don't have anywhere near enough information.

Let's wait for more info from Chip & OnSemi.

The approach I like best?

Make the dang chip as currently envisioned, in an exposed pad TQFP-144 .5mm spacing - bigger surface area to dissapate heat. Also use the extra leads as grounds, to heat sink to top layer.

Label & Sell for the public at whatever thermal evenlope seems to make the most sense, maybe two or three "grades". Let the market decide!

I don't think Parallax can afford to take 2-3 years to do a die shrink.

Much better to get this version to market, *THEN* shrink it as P3.

My best guess?

40Mhz < 1W "Low Power Edition"
160Mhz < 3.5W "Performance Edition"

That is marketable!

ozpropdev · 2014-04-03 17:38

+1 Bill

Can someone please shut (and lock) the door to the Opium Den!

Tubular · 2014-04-03 17:47

Bill Henning wrote: »

Guys,
2) Blow fuses to limit performance? DUMB. Don't treat customers as idiots. P1 did not have fuse to go over 80MHz!

I'd do it the other way and blow the fuse to enable the performance mode.

The danger is even at 5W (given available info), you will need to distinguish between customers designing in, and end users, who just want to be able to choose a number of obex objects and run with them. You will have a situation where the physically larger hardware platforms could run a full selection of objects, while the more compact ones may not be capable of running the same selection.

P1 is not relevant here because it can't get so hot, and therefore never created such a divide

Forcing a larger package would help, true, but this then disadvantages those who might want to embed it in a tighter space and engineer the thermal performance appropriately. TQFP128 is already quite a big chip, physically.

rjo__ · 2014-04-03 17:59

Ozpropdev:

Can someone please shut (and lock) the door to the Opium Den!

Wouldn't that be like closing your back door once the cows came home?

Bill Henning · 2014-04-03 18:10

Sorry, I guess I was not clear.

Fuses to limit performance are BAD, require additional logic, and totally unnecessary. People can read data sheets, and decide on what thermal envelope they want.

Once we know what they are, and they are characterized.

Obex objects can look at clkfreq, and if I hear one more "we must limit speed / performance / what people can do for sake of Obex" I think I will scream.

The attitude of protecting beginners is appropriate to Basic Stamp, other chips made only for education, but NOT for something for a wide commercial market! Certainly should not be used to limit capabilities.

Re: packages - TQFP-144 .5mm with ground pad is a bit bigger, true, but it is cheaper to make the P2 with one "standard" package, and it should have significantly better thermal performance than TQFP-128 .4mm

Tubular wrote: »

I'd do it the other way and blow the fuse to enable the performance mode.

The danger is even at 5W (given available info), you will need to distinguish between customers designing in, and end users, who just want to be able to choose a number of obex objects and run with them. You will have a situation where the physically larger hardware platforms could run a full selection of objects, while the more compact ones may not be capable of running the same selection.

P1 is not relevant here because it can't get so hot, and therefore never created such a divide

Forcing a larger package would help, true, but this then disadvantages those who might want to embed it in a tighter space and engineer the thermal performance appropriately. TQFP128 is already quite a big chip, physically.

RossH · 2014-04-03 18:24

Bill Henning wrote: »

1) 5W is extreme worst case, NOT proven, could be 3.5W easily

No, it is not proven, and we may indeed be overstating the problem. But on the other hand you appear to understate it. Chip's initial post said "My gut feeling is that we can get the power down to 5W ..." (my emphasis). The extreme worst case could be well over 6W since the "4-6W" preliminary estimate from OnSemi is only for the core, and does not include memory or I/O power.

If anyone can find a solution then I'm sure it will be Chip, but he would not have posted if he did not think the problem was real - "huge" and "outrageous" were his actual words. Proposing that Parallax proceed with the current design and "let the customer choose" seems a little risky when the kind of customers Parallax will need to attract to make the P2 profitable are usually quite risk averse.

Ross.

cgracey · 2014-04-03 18:45

The engineer we are working with at OnSemi sent me this report back today of a single cog's power consumption:

****************************************
Report : Averaged Power
Design : cog
Version: F-2011.06-SP1
Date   : Thu Apr  3 17:00:24 2014
****************************************



  Attributes
  ----------
      i  -  Including register clock pin internal power
      u  -  User defined power group

                        Internal  Switching  Leakage    Total
Power Group             Power     Power      Power      Power   (     %)  Attrs
--------------------------------------------------------------------------------
io_pad                     0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
memory                     0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
black_box                  0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
clock_network              0.0359 3.723e-03 1.461e-08    0.0396 ( 5.62%)  i
register                   0.0749    0.0254 8.592e-07    0.1003 (14.22%)  
combinational              0.2649    0.3005 2.227e-06    0.5654 (80.17%)  
sequential                 0.0000    0.0000    0.0000    0.0000 ( 0.00%)  

  Net Switching Power  =    0.3296   (46.74%)
  Cell Internal Power  =    0.3756   (53.26%)
  Cell Leakage Power   = 3.100e-06   ( 0.00%)
                         ---------
Total Power            =    0.7052  (100.00%)

1

Note the "combinational" section. This represents, mainly, all the data paths through the ALU, fed by the 32-bit D and S signals. Every possible result is being computed, though only one is being picked by the result mux. The logic cells are taking 0.2649 Watts, while the wire drivers (inverting buffers) to feed those cells are taking 0.3005 Watts. This accounts for 80% of power dissipation.

I'm going to add several sets of 32-bit flops to selectively gate D and S before they go everywhere. This may cut the power down quite a bit, as those signals won't go where they won't be needed, causing lots of unnecessary toggling.

Cluso99 · 2014-04-03 18:52

Chips are speed marketed all the time. Often the higher speed chips are just tested at the higher speeds and you pay for it. Often chips are sold with lesser features cheaper - but often they are just chips that failed certain tests and sometimes they include the extra features anyway. Bonding sometimes discriminates between them. So I don't see a problem with using fuse(s) to limit a chips features/speed. Best to blow the fuses to limit the functions though.

The P1 is different. Sure we overclock, but too higher clock does not kill the chip, it just fails to work/work-reliably.

On the P2, from Chip's info, there isn't any doubt in my mind that the sw could kill the P2 as it currently stands (by thermal runaway). However, as for P1, the P2 is going to be difficult to parameterise. Therefore , the P2 could get nasty results from dying chips. The reverse is true for P1 - it is almost impossible to kill a P1 (provided all power/gnd pins are connected properly).

BTW If Chip didn't see 5W as being a problem, or that it would be a lot less than that, I am sure we would have heard this from him by now.

potatohead · 2014-04-03 18:55

Seconded easily.

ozpropdev wrote: »

+1 Bill

Can someone please shut (and lock) the door to the Opium Den!

Bill Henning · 2014-04-03 18:57

Thanks Chip!

So if I understand your post correctly:

- a cog going full blast, including counters, cordic, serial, video, multipliers, dividers simultaneously consumes 0.7W
- you are adding some flops to D * S to selectively gate, so that only the actually used subsystems (in a clock) will change state, cutting power drastically

Based on the above:

80% of 0.7W is 0.56W, so the gating can significantly cut down on the 0.56W of the combinatorial logic power consumption.

Assuming that is cut in half:

we get 0.7W - 0.28W = 0.42W / cog, times 8 cogs going full blast, 3.36W - that is VERY close to the 30% off from 5W you predicted in your first post.

And totally in-line with the 1W-2W you suggested for typical power use a long time ago.

I really don't think the sky is falling.

Cluso99 · 2014-04-03 18:57

cgracey wrote: »
The engineer we are working with at OnSemi sent me this report back today of a single cog's power consumption:
****************************************
Report : Averaged Power
Design : cog
Version: F-2011.06-SP1
Date   : Thu Apr  3 17:00:24 2014
****************************************



  Attributes
  ----------
      i  -  Including register clock pin internal power
      u  -  User defined power group

                        Internal  Switching  Leakage    Total
Power Group             Power     Power      Power      Power   (     %)  Attrs
--------------------------------------------------------------------------------
io_pad                     0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
memory                     0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
black_box                  0.0000    0.0000    0.0000    0.0000 ( 0.00%)  
clock_network              0.0359 3.723e-03 1.461e-08    0.0396 ( 5.62%)  i
register                   0.0749    0.0254 8.592e-07    0.1003 (14.22%)  
combinational              0.2649    0.3005 2.227e-06    0.5654 (80.17%)  
sequential                 0.0000    0.0000    0.0000    0.0000 ( 0.00%)  

  Net Switching Power  =    0.3296   (46.74%)
  Cell Internal Power  =    0.3756   (53.26%)
  Cell Leakage Power   = 3.100e-06   ( 0.00%)
                         ---------
Total Power            =    0.7052  (100.00%)

1
Note the "combinational" section. This represents, mainly, all the data paths through the ALU, fed by the 32-bit D and S signals. Every possible result is being computed, though only one is being picked by the result mux. The logic cells are taking 0.2649 Watts, while the wire drivers (inverting buffers) to feed those cells are taking 0.3005 Watts. This accounts for 80% of power dissipation.

I'm going to add several sets of 32-bit flops to selectively gate D and S before they go everywhere. This may cut the power down quite a bit, as those signals won't go where they won't be needed, causing lots of unnecessary toggling.

Chip,

Thanks for the info.

While the news is disappointing, at least we/you know where to look. All those instructions have added up the power significantly.

Selectively gating some of the signals should no doubt help quite a lot. Possibly this will have a slight impact on speed too. But better we get power down quite a bit.

cgracey · 2014-04-03 19:00

Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.

The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.

The other day I did a compile on a full Prop2 chip for Cyclone IV.

The total required LE's were ~252,000.

That means Prop2 is 15.8x the complexity of Prop1.

In other words, we could have, perhaps, 128 Prop1 cogs in the same die space we are working on now. How would that be? Now there would be a flexible power scenario!

Bill Henning · 2014-04-03 19:04

There is NOTHING wrong with marking the chips at different speed grades.

I think it is a great idea to have say 40Mhz "Low Power", 80MHz "Standard", 160MHz "Performance" published specs with each having separate power envelope.

There is a lot wrong with deliberately crippling chips to sell them at different price points. Is it legal? Sure. Is it ethical? Not in my opinion.

Is it sound business? I don't think so.

- It costs *more* to make the crippled chips.
- adding speed controls costs more, complicates the chips
- Inventory costs are much higher if you have more SKU's.
- Documentation costs are higher. Support costs are higher.

Cluso99 wrote: »

Chips are speed marketed all the time. Often the higher speed chips are just tested at the higher speeds and you pay for it. Often chips are sold with lesser features cheaper - but often they are just chips that failed certain tests and sometimes they include the extra features anyway. Bonding sometimes discriminates between them. So I don't see a problem with using fuse(s) to limit a chips features/speed. Best to blow the fuses to limit the functions though.

The P1 is different. Sure we overclock, but too higher clock does not kill the chip, it just fails to work/work-reliably.

On the P2, from Chip's info, there isn't any doubt in my mind that the sw could kill the P2 as it currently stands (by thermal runaway). However, as for P1, the P2 is going to be difficult to parameterise. Therefore , the P2 could get nasty results from dying chips. The reverse is true for P1 - it is almost impossible to kill a P1 (provided all power/gnd pins are connected properly).

BTW If Chip didn't see 5W as being a problem, or that it would be a lot less than that, I am sure we would have heard this from him by now.

Cluso99 · 2014-04-03 19:05

I am presuming the register + combinatorial (~94%) is per cog, and the clock-network (5.6%) is the general overhead ?

So 94.4% x 0.7W = 0.66W * 8 cogs = 5.2W + 0.4W (clocking) = 5.6W max @ 160MHz ???

But if a cog is not operating that power is virtually 0.

potatohead · 2014-04-03 19:06

!!!

That's unreal!

But those interconnections would kind of add up

I agree with Bill, if the gating can help in the rough range indicated, it all might be viable still.

Got any news from OnSemi along the lines of a shared investment? Secondly, any thoughts on build to fund a shrink vs changes?

BTW: Thanks. Being open, honest, real is much appreciated as is being a part of this.

Bill Henning · 2014-04-03 19:06

Hmm... I can see a few problems:

- 32KB hub way too little for 128 P1 cogs
- 1 hub slots every 128 clock cycles. Ouch.
- does not solve the need for larger programs

BUT

Really cool thought experiment!

Allows a P1 cog on every pin!

cgracey wrote: »

Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.

The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.

The other day I did a compile on a full Prop2 chip for Cyclone IV.

The total required LE's were ~252,000.

That means Prop2 is 15.8x the complexity of Prop1.

In other words, we could have perhaps 128 Prop1 cogs in the same die space we are working on now. How would that be? Now there would be a flexible power scenario.

Bill Henning · 2014-04-03 19:09

Very interesting!

Is the increase mostly CORDIC / MUL / DIV / MULACC etc., or is AUX taking a chunk of that?

At 160Mhz, without considering new features, a P2 cog is 8x as fast as a P1 cog, so off-the-cuff the 8x performance increase needed a 16x increase in gates.

Considering pointers, aux, etc., a P2 cog is easily 16x faster than a P1 cog.

cgracey wrote: »

Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.

The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.

The other day I did a compile on a full Prop2 chip for Cyclone IV.

The total required LE's were ~252,000.

That means Prop2 is 15.8x the complexity of Prop1.

In other words, we could have, perhaps, 128 Prop1 cogs in the same die space we are working on now. How would that be? Now there would be a flexible power scenario!

jmg · 2014-04-03 19:13

cgracey wrote: »

The engineer we are working with at OnSemi sent me this report back today of a single cog's power consumption:

Note the "combinational" section. This represents, mainly, all the data paths through the ALU, fed by the 32-bit D and S signals. Every possible result is being computed, though only one is being picked by the result mux. The logic cells are taking 0.2649 Watts, while the wire drivers (inverting buffers) to feed those cells are taking 0.3005 Watts. This accounts for 80% of power dissipation.

I'm going to add several sets of 32-bit flops to selectively gate D and S before they go everywhere. This may cut the power down quite a bit, as those signals won't go where they won't be needed, causing lots of unnecessary toggling.

Interesting numbers - Those are 160MHz, corner case ?

Missing still is Memory ? Did they solve Clock Gating Simulation ?

What they call 'Internal power' is intra-cell power, and what they call "Switching power", is routing, and route-drivers ?
There does seem to be a lot of combin nodes 'waving in the wind'

We're looking at 5 Watts in a BGA!

Comments