We're looking at 5 Watts in a BGA!

Phil Pilgrim (PhiPi) · 2014-04-01 13:15

I do not believe that adding or paring features piecemeal is any way to embody a cohesive vision for the P2. Something important got left behind when the design process was opened to outside suggestions, and I think it will take a much more holistic retrenching to recover what was lost as a consequence. Hence my suggestion above.

-Phil

SRLM · 2014-04-01 13:26

Just curious: what is it that makes the power consumption get so high? For comparison, the Atmel SAM4S ARM processor consumes 0.1W at 120MHz, and it's a full ARM core. That's fifty ARMs or one 8 core Propeller2?

And I'm assuming that this is not an April Fools joke...

mindrobots · 2014-04-01 13:35

I'm guess it's the LATTE instruction that was added last round. Those milk steamers take more power than you expect!!

cgracey · 2014-04-01 13:37

I just got off the phone with OnSemi.

They had done some power analyses and determined, already, that at the 65nm node, the core would take only 1.5W. It would actually take a lot less than that, as they are assuming a very high toggle rate, being conservative and lacking any actual numbers. At 65nm, the chip could be packaged very inexpensively, as it wouldn't have big heat issues. It could also run at 250MHz.

So, we discussed redoing this whole thing in 65nm technology. They didn't seem too deterred when I explained that we don't have a very big budget. We agreed that I'd send them a new set of files with the memories declared in Verilog, so that they could run more complete synthesis tests. We'll see where this goes.

As far as the current Prop2 being far-afield of what it might have been: This is true, but I think it's harmonious and balanced, as it is. It has a lot in it, but it's all congruous to me. It would be very easy to make on-chip tools for with hub-exec. I think it just needs the right process to be viable.

jmg · 2014-04-01 13:38

cgracey wrote: »

OnSemi - My gut feeling is that we can get the power down to 5W and not need cooling by using a 17x17mm BGA package.

BGA is not drop-dead, it just makes modules more of a focus.

I guess that 5W is for all COGS at Full Clock speed ?

5W is mostly an issue if it bumps to demanding complex heatsinks

ALL COGS at Full MIPS is going to be a rare usage, what is the hot-spot on one COG at Full Speed ?

Do you have a mW/MHz value for that ? and one/all COG power levels ?

How does that power vary, with Opcode - ie how much power can the waiting/paused opcodes save ?

Another means to save power could be to time-slice (clock gate) whole COGS.
HUB slots remain but the COG is not operational for varying percentages of time, slashing power.

cgracey wrote: »

Pare down the design, jettisoning all kids of features, like hub exec, and get back to something much smaller and maybe faster. The current FPGA implementation could be further honed and later, hopefully, made into a chip using a smaller technology.

A useful and relatively quick yardstick here could be to get numbers on a P1-Plus - & here some work is already in place :
eg
http://forums.parallax.com/showthread.php/154829-P1-Verilog-implementation-Alpha-testers-required

That would also allow more informed interpolate of the likely options in-between.

cgracey wrote: »

C) Drop the current design to four cogs, which would also reduce cache sizes and hub memory down to 128KB. This would also allow us to shrink the die considerably, as we could change the I/O pin aspect ratio to allow them to fit together more densely, occupying more of what was needed for the core. This would also mean the whole chip would fit on an FPGA.

Lowering HUB memory is going the wrong way, but fewer COGs is on the table, now you have threading and Hubexec.

Another approach, is asymmetric COGs - tools are in place for P2 and P1, so 4xP2 COGS and 4xP1 COGS (or even 2+6 etc) is a valid engineering solution.
Asymmetric could also be in MHz, eg every 2nd COG could be throttled to 50% or lower, to lower total power Density.
I'm sure some will have a philosophical heart attack at not all COGs the same, but this is engineering, not religion..

That said, the issue is looking as more a Power Density one, and it is the watts-per-cog that matter in that case.
4 COGS in half the area does not lower the Power Density, so the local die hot-spot is reduced by less than you think.
Likewise 2.5W in a package half the area, is the same W/cm2 problem.

jmg · 2014-04-01 13:41

cgracey wrote: »

I just got off the phone with OnSemi.

They had done some power analyses and determined, already, that at the 65nm node, the core would take only 1.5W. It would actually take a lot less than that, as they are assuming a very high toggle rate, being conservative and lacking any actual numbers. At 65nm, the chip could be packaged very inexpensively, as it wouldn't have big heat issues. It could also run at 250MHz.

So, we discussed redoing this whole thing in 65nm technology. They didn't seem too deterred when I explained that we don't have a very big budget. We agreed that I'd send them a new set of files with the memories declared in Verilog, so that they could run more complete synthesis tests. We'll see where this goes.

As far as the current Prop2 being far-afield of what it might have been: This is true, but I think it's harmonious and balanced, as it is. It has a lot in it, but it's all congruous to me. It would be very easy to make on-chip tools for with hub-exec. I think it just needs the right process to be viable.

That is certainly a very good avenue to pursue. Valid Numbers always make for better decisions.

cgracey · 2014-04-01 13:42

Phil Pilgrim (PhiPi) wrote: »

I do not believe that adding or paring features piecemeal is any way to embody a cohesive vision for the P2. Something important got left behind when the design process was opened to outside suggestions, and I think it will take a much more holistic retrenching to recover what was lost as a consequence. Hence my suggestion above.

-Phil

Phil, a sped-up, slightly-enhanced Prop1 with analog I/O would be great and I wouldn't mind going that route, at all. It's certainly easier on the brain.

I understand why you're a bit allergic to Prop2, but I really think you could like it.

Tubular · 2014-04-01 13:48

Chip, is it possible to submit a couple of prior releases/versions to onsemi, so we have an idea of "power cost" of recent additions?

For me 5 watts is not a huge problem. P2 should eliminate at least 2W, probably ~3W of power in other chips that will no longer be required. These chips are distributed though, so yes some heat spreading would be required. I understand its impractical for others, not being able to run from a 2.5W USB port, for instance.

Would it be ~2.5W at 80 MHz?

John Abshier · 2014-04-01 13:49

Last summer, if the shuttle run had been successful, we would all have been as happy as pigs knee deep in slop. Chips available in the fall and everyone would be amazed and what people did with that design by now.

John Abshier

Kerry S · 2014-04-01 13:53

BGA is not an issue with Parallax making plug in modules.

5 Watts is not an issue unless extreme heat sinking is needed (for my applications). That is a fraction of my power supply budget but all of my designs are based on being hooked into 120/230 VAC...

How many P2 applications are really going to be battery powered?

For those applications that need power savings how much is to be gained by slowing the clock? How about a way to on the fly enable/disable power to the cogs when they are not needed?

The 65nm process would be a DREAM COME TRUE! If they can fit it into your budget...

I think the current P2 design is where it needs to be for features, capability, and function to compete with all of the 'traditional' options that are out there and will be coming up over the next 5 years. Cutting it back as suggested I think would do more to kill it than BGA + 5W would.

cgracey · 2014-04-01 13:56

Tubular wrote: »

Chip, is it possible to submit a couple of prior releases/versions to onsemi, so we have an idea of "power cost" of recent additions?

For me 5 watts is not a huge problem. P2 should eliminate at least 2W, probably ~3W of power in other chips that will no longer be required. These chips are distributed though, so yes some heat spreading would be required. I understand its impractical for others, not being able to run from a 2.5W USB port, for instance.

Would it be ~2.5W at 80 MHz?

I don't want to burden them with extra work, since we're in the negotiating/exploratory phase right now.

I wouldn't mind 5W, myself, either, because I really want to program the chip. It just seems a lot to ask people to deal with. In the days of the 180nm Pentium, this would have been a no brainer, but now so many chips are being built in such smaller processes, that people's expectations are for chips to not be hot. If the chip is something really compelling, though, heat would be tolerated. I'm afraid the heat issue would stop people early on, though.

cgracey · 2014-04-01 14:02

Jmg,

Those power numbers are coming from their synthesis tools, so that means they are supposing all 8 cogs are going full-blast, along with every on-chip peripheral. It would be an absolute worst-case situation.

mindrobots · 2014-04-01 14:19

In the real world (sorry, I'm not an engineer type), what does 5 watts really mean beyond the obvious current draw and power source concerns?

It gets warm in a case? The case gets warm?

Toss a heatsink on it and convection cooling will dissipate enough heat to prevent thermal damage and the case from having a hot spot.

Enclosed P2s are going to need fans?

Liquid cooling? (Kidding!!

)

mindrobots · 2014-04-01 14:24

Sorry, thought of another question.

The current memory is your hand laid memory, correct?

What does switching to Verilog memory do to the power profile (I'm assuming they think it will reduce it)

Does it have long term advantages or disadvantages to the design? More synthesized parts is good in the long run?

Heater. · 2014-04-01 14:24

Are we only talking about the logic here. Or does this include some nut head like me driving a LED at maximum possible current on every available pin? What then? Would that add significant heat generation in the package?

cgracey · 2014-04-01 14:26

Heater. wrote: »

Are we only taking about the logic here. Or does this include some nut head like me driving a LED at maximum possible current on every available pin? What then? Would that add significant heat generation in the package?

The 4-6W estimation was for core power, only, not including the 24 RAMs or the I/O pins.

Heater. · 2014-04-01 14:26

midrobots,

Enclosed P2s are going to need fans?

Eeek, I would hate to have to put an actual propeller on a Propeller:)

cgracey · 2014-04-01 14:29

mindrobots wrote: »

Sorry, thought of another question.

The current memory is your hand laid memory, correct?

What does switching to Verilog memory do to the power profile (I'm assuming they think it will reduce it)

Does it have long term advantages or disadvantages to the design? More synthesized parts is good in the long run?

By putting the RAMs into Verilog, they will be able to pull them into their synthesis, using their own generated RAMs. This will help them better estimate power and area.

Their generated RAMs are well-characterized and amenable to auto-routing, unlike the ones we laid out by hand for 180nm.

mindrobots · 2014-04-01 14:33

Heater. wrote: »

midrobots,

Eeek, I would hate to have to put an actual propeller on a Propeller:)

Never thought we'd get a fully functional beanie in the deal!!

Baggers · 2014-04-01 14:33

What are the chances of it going to 65m tech?

mindrobots · 2014-04-01 14:33

cgracey wrote: »

By putting the RAMs into Verilog, they will be able to pull them into their synthesis, using their own generated RAMs. This will help them better estimate power and area.

Their generated RAMs are well-characterized and amenable to auto-routing, unlike the ones we laid out by hand for 180nm.

Thanks, Chip!

jmg · 2014-04-01 14:33

cgracey wrote: »

Jmg,

Those power numbers are coming from their synthesis tools, so that means they are supposing all 8 cogs are going full-blast, along with every on-chip peripheral. It would be an absolute worst-case situation.

Those are without Clock gating, and without RAM included. ? full-blast means 160MHz ?

How far away are W/MHz,with Clock Gating working, and RAM at least somewhat simulated ?

Can they (easily) do numbers/estimates for say ~1.1V Core, (whatever gives 100MHz as full blast) ?
Being a RAM based process, does mean you can move Vcore around a little.

Addit: The current FIT %, were they with metal for ~5A of Idd ?
ie Do these new numbers mean you have to allow more for power/gnds metal ?

jmg · 2014-04-01 14:38

Heater. wrote: »

Are we only talking about the logic here. Or does this include some nut head like me driving a LED at maximum possible current on every available pin? What then? Would that add significant heat generation in the package?

Such numbers are usually core only, but with some corner case allowance.
IO power has to be added , but 20mA into 50 pins, of 12 ohms each, is 240mW more

Brian Fairchild · 2014-04-01 14:42

mindrobots wrote: »

I
...what does 5 watts really mean...

Heatsink in moving air. So an additional $7 for a heatsink plus some moving air.

To put the 5W in context, the larger X chips seem to run at around the 600mW mark.

Bill Henning · 2014-04-01 15:08

I've been away for ~2 days, so it will take me a while to catch up.

From what I see:

- 5W is the worst case scenario, at 160MHz, 8 cogs going full blast

- the clock multiplier is user-settable, if a user needs lower power usage then

a) don't use all 8 cogs full blast
b) drop the clock frequency

hubexec / 4 code / 1 data cache is a proverbial drop in the bucket power usage wise

If I correctly remember

- power usage goes up by the square of the frequency
- other than leakage current, power usage is based on transistors changing state

so if the above is correct, reducing the hub size, aux, cog memory will not lead to large power savings

reducing the clock frequency, and number of cogs in use will lead to large power savings

Therefore:

- if you don't need to run at 160Mhz, drop the multiplier
- if your app only needs 3 cogs to run, have the other 5 waiting, not using power

Chip already stated that this is likely a worst-case analysis.

We should not be panicking, we should wait for more data.

Having said all that, I don't think 5W is unreasonable for EIGHT 160MHz processors going full blast!

Regarding 65nm.... drool... approx. 8x increase in transistor density, so even if Chip went for a die 1/4 the size of the current design, we'd still get 2x the transistor budget!

Edit:

I don't see anything wrong with 5W with all 8 cogs going full blast @ 160Mhz

Bill Henning · 2014-04-01 15:12

Am curious.

On what process?

Also, how many tiles going full blast?

Brian Fairchild wrote: »

Heatsink in moving air. So an additional $7 for a heatsink plus some moving air.

To put the 5W in context, the larger X chips seem to run at around the 600mW mark.

Heater. · 2014-04-01 15:18

Bill,

Regarding 65nm.... drool... approx. 8x increase in transistor density, so even if Chip went for a die 1/4 the size of the current design, we'd still get 2x the transistor budget!

65nm and cool would be a dream. Especially if it meant more MIPs was possible as well.

The double transistor budget worries me. That might lead to the "we have room for this one little new feature...and what about we add the capability to..."

Best to soak up the transistors in RAM space.

Baggers · 2014-04-01 15:23

If it gets to 250Mhz at 65nm as suggested, what happens then to the current SDRAMs will it need faster ones? or will we just have to lose that speed gain and run at 160Mhz anyway?

Bill Henning · 2014-04-01 15:29

Re/ more ram space... we don't have more bits for hubexec addresses! That means segment registers or an mmu.

Or double port the hub to cut hub access to 4 cycles.

BUT ... let's hold off spending those "extra" transistors until we see what Chip / OnSemi come up with.

Heater. wrote: »

Bill,

65nm and cool would be a dream. Especially if it meant more MIPs was possible as well.

The double transistor budget worries me. That might lead to the "we have room for this one little new feature...and what about we add the capability to..."

Best to soak up the transistors in RAM space.

Bill Henning · 2014-04-01 15:32

I would be very surprised if 65nm did not run at least 320Mhz, probably more like 450Mhz, if the 180nm can run at 160Mhz.

I wonder if OnSemi has flash cells for their 65nm process... that would be a nice use for the "extra" transistors.

Baggers wrote: »

If it gets to 250Mhz at 65nm as suggested, what happens then to the current SDRAMs will it need faster ones? or will we just have to lose that speed gain and run at 160Mhz anyway?

We're looking at 5 Watts in a BGA!

Comments