We're looking at 5 Watts in a BGA!

Cluso99 · 2014-04-02 18:11

jmg wrote: »

Err ?? You miught want to correct that, I think you mean BGA, not FPGA ??

Yes. Getting confused with what Xilinx etc call FTG266
FT(G)256 Fine-Pitch Thin BGA Package Specifications (1.00 mm Pitch)

No.

Looking at the TI ahd JEDEC docs above, there is actually not much difference thermally between BGA and ThermalPAD, and a key item is the multiple thermal vias - their TQFP144 example uses 100, so all those inner BGA pins are needed for thermal-paths.
They could be mostly GND pins, but they will need to be there.

BGA does allow different die design, but that is quite a big change from the hand-layout rim we have now.

jmg · 2014-04-02 18:13

RossH wrote: »

Assuming I have my maths right (always a big assumption!) I'm not sure how you could ever "finesse" your way around a deal breaker like that, which would appear to make the P2 unusable (or at least horrendously expensive) for embedded applications.

The P2 does not always draw 5W, that is a corner case, mainly for thermal/metal design.

There are many use cases drawing much less, eg from above, it can match a P1 on ~170mW (est)

DL7PNP · 2014-04-02 18:15

How much more cost a 65nm production ?

How much would cost a 65nm Propeller II ?

If 65nm reduce power consumption and may speed up to ~250MHz or more, compare this price to 8x 250MHz ARM MCU's or one 1-2GHz CPU. An 1GHz Cortex A8 cost ~44$.

Bill Henning · 2014-04-02 18:16

A few points

- 5W was a worst case figure from OnSemi, for running all eight cogs at full blast etc, Chip stated with clock gating a more realistic max would be about 30% lower (3.5W)
- The Pi SoC runs one processor at 700Mhz, that's roughly the same as four cogs, and the Pi power figures do not assume everything running at full blast
- We should get more updated power figures from OnSemi soon

I am not trying to "finesse"

I am trying to point out that at a lower clock rate, or not using all the cogs full blast, the power consumption would be a LOT lower.

Since the multiplier is under software control, its easy enough to use say a 10Mhz crystal, and start up at 40Mhz, with as few cogs as your app needs. Why on earth use more cogs, or higher clock speed than your application requires?

And if an application needs more processor power - crank up the multiplier.

The FPGA experiments show that even 1080p is achievable at 80Mhz, and cogs that are not running take little power.

Frankly, I think this whole power envelope issue is blown way out of proportion.

RossH wrote: »

Really? Doesn't 5W mean that just the P2 chip alone needs twice as much power as an entire Raspberry Pi single-board computer?

Assuming I have my maths right (always a big assumption!) I'm not sure how you could ever "finesse" your way around a deal breaker like that, which would appear to make the P2 unusable (or at least horrendously expensive) for embedded applications.

Ross.

Bill Henning · 2014-04-02 18:19

The existing applications would not be efficient on the P2.

Each P2 cog @ 160Mhz is roughly equivalent performance wise to a whole P1 @ 80Mhz, and with hardware tasking, it would be silly to have separate cogs dedicated to simple functions such as fullduplexserial, ps2kb, ps2mouse.

Mind you, if it was used as such, with waitcnt etc., it would use an insignificant amount of power.

With P2, we have a good chance for most Spin & C code not to require huge changes. Simple non-optimized PASM changes won't be too bad.

Using pasm efficiently will require re-writes. C'est la vie.

RossH wrote: »

Well, apart from anything else, it means everyone's existing applications (many of which rely on having 8 cogs available) will have to be rewritten.

Not a very auspicious start for the P2, given that this kind of thing was a big contributor to the P1 failing to gain traction.

Ross.

Cluso99 · 2014-04-02 18:20

Bill Henning wrote: »

Ray,

unless Chip posts numbers to the contrary, the amount of transistors taken by hubexec, the single dcache line, and the four icache lines looks trivial compared to the total transistors in the P2.

I actually would be quite interested in a transistor count for a March last year cog, and a current cog. It would give us hard numbers to debate.

My point is, for power envelope purposes, that transistor count has to be compared to the whole die's transistor budget.

Based on the last few months posts by Chip, I suspect the increase is less than 5% of the total transistor count. I actually think its closer than 2%, but readily admit that is a guess.

But even if it was 5%, the capabilities of hubexec (and the performance boost due to caching) would be well worth it.

Bill, You know I am not just referring to those little bits. Stop trying to sidestep the issue.
You know as well as I do the core (the instructions/alu/cache/etc excluding hub ram) has grown from ~14mm2 to well in excess of 18mm@. They are hard transistors which are consuming power!! And the P1 was ~7mm2 total according to unconfirmed posted figures I quoted above with the thread reference.

Frankly I think Chip (and Beau for the I/O ring) did a fantastic job - earlier I posted that a 180nm Celeron had a 60W power envelope!

Yes, but that has really nothing to do with power. jmg showed in a previous post that the 66W Celeron at 1GHz scaled to ~5W at 160MHz.
Unfortunately, it looks as though that work has to be ditched to go to BGA. (That is the point in my question to Chip about whether the same die can be used in both BGA and QFP).

rjo__ · 2014-04-02 18:22

What is wrong with a full bored, 180MHz 4Cog P2, with cogs operating at 40MHz?

I understand that there is a need for 8 Cogs in some situations and in some situation there might be a need for more that simply invites an organized effort to flesh out the architecture to fully support parallel controller systems, which seems like a no-brainer for Chip.

With 2 4cog P2's you get better cog performance, and more pin support. In the average, mid-level use, a user is more likely to run out of pins before he runs out of room for his logic.

A cog isn't just a cog it has 4 little cogs inside each of which is far more powerful than a P1 cog.

David Betz · 2014-04-02 18:22

rjo__ wrote: »

What is wrong with an energy sipping full bored 4 cog P2, with Cogs operating at twice the bandwidth?

I keep hearing that the biggest advantage of the Propeller is the huge collection of objects in OBEX that can be mixed and matched to make complex applications easily. This depends on having a large number of independent cores. If you instead have half as many cores with multiple threads, it becomes much more difficult to configure the chip for a complex application. You need to start worrying about how to merge two objects into the same COG each running in a separate thread and whether there will be interactions that will cause one or both to stop working. This was far less likely with P1. Unless you can make the hardware threads behave as if they were separate COGs like the XMOS threads do then I don't think reducing the number is a good idea.

jmg · 2014-04-02 18:23

Cluso99 wrote: »

I am certain this is the case. A lot of the P2 Verilog can be leveraged back into P1 with some additional benefits.

For some customers (unfortunately don't know how many) an advanced P1B would be nice now. As I said above, it's quite doable with a few extras such as ADC, more pins (I suggested 64 but 48 would probably be fine too), faster (clock and hub 1:8), more hub ram, remove the ROM and add the monitor and security.

This forgets the hand crafted PAD Ring and Analog block, which involves a shipload of work.
ie any new package = a lot more money and time.

One option is to use the P2 PAD ring, and fill the cavity with (lotsa RAM + P1B), but I suspect your transistors=power blinkers would exclude that

A P1B at only 40MOPS would be pretty pedestrian and ordinary, and only 2x a P1

David Betz · 2014-04-02 18:23

rjo__ wrote: »

What is wrong with a full bored, 180MHz 4Cog P2, with cogs operating at 40MHz?

I understand that there is a need for 8 Cogs in some situations and in some situation there might be a need for more… that simply invites an organized effort to flesh out the architecture to fully support parallel controller systems, which seems like a no-brainer for Chip.

With 2 4cog P2's you get better cog performance, and more pin support. In the average, mid-level use, a user is more likely to run out of pins before he runs out of room for his logic.

A cog isn't just a cog… it has 4 little cogs inside… each of which is far more powerful than a P1 cog.

Those "little COGs" aren't really independent of each other. That's the difference.

Cluso99 · 2014-04-02 18:26

RossH wrote: »

Really? Doesn't 5W mean that just the P2 chip alone needs twice as much power as an entire Raspberry Pi single-board computer?

Assuming I have my maths right (always a big assumption!) I'm not sure how you could ever "finesse" your way around a deal breaker like that, which would appear to make the P2 unusable (or at least horrendously expensive) for embedded applications.

Ross.

FWIW:
IIRC the RPi uses >800mA at 5V which is then switched on the pcb to lower voltages. 800mA * 5V = 4W. IIRC it is about720MHz and they can overclock to 800MHz.

Cluso99 · 2014-04-02 18:27

Bill Henning wrote: »

I actually liked your P1B proposal, but I worry that if it was made, we might never see P2.

Me too! Or else a few years away!

Cluso99 · 2014-04-02 18:33

Bill Henning wrote: »

The existing applications would not be efficient on the P2.

Each P2 cog @ 160Mhz is roughly equivalent performance wise to a whole P1 @ 80Mhz, and with hardware tasking, it would be silly to have separate cogs dedicated to simple functions such as fullduplexserial, ps2kb, ps2mouse.

Mind you, if it was used as such, with waitcnt etc., it would use an insignificant amount of power.

With P2, we have a good chance for most Spin & C code not to require huge changes. Simple non-optimized PASM changes won't be too bad.

Using pasm efficiently will require re-writes. C'est la vie.

As I explained to Seairth and Ross last night, I took my Nokia 5110 LCD spin driver (modified from an OBEX contribution). I converted it to P1 PASM (a reasonable job). Then I converted it to run on the P2 PASM (quite simple and fast). Then I converted it to run in Hubexec mode (again quite simple and fast). BTW it wasn't a very complex driver.
But I was quite impressed by the ease at which it came across from P1 PASM to P2 PASM and then HUBEXEC.

David Betz · 2014-04-02 18:34

If we're going to get a P1B then how about supporting a variant on RDWORD that would expand it's 16 bits into a full 32 bit long before storing it in the destination register. The algorithm for expanding the bits would be designed to support a compressed instruction set and the resulting 32 bit long would be a P1 hardware instruction. It would also be nice to have the cached version like the original P2 had (RDWORDC). This could expand the amount of code that fits in hub memory and at the same time make CMM run nearly as fast as LMM. This is assuming we lose hub execution mode.

Tubular · 2014-04-02 18:35

Cluso99 wrote: »

FWIW:
IIRC the RPi uses >800mA at 5V which is then switched on the pcb to lower voltages. 800mA * 5V = 4W. IIRC it is about720MHz and they can overclock to 800MHz.

No, those figures are way off. It's true having a >500mA supply available is prudent because of power consumption by peripherals.

However I think the expectation of P2 being able to operate at roughly the same power level as the Pi, or Beaglebone, is a valid expectation. In typical use (not worst case), P2 may be around that mark already.

rjo__ · 2014-04-02 18:36

Sorry for the multiple posts… I wasn't seeing my posts on my end and thought they didn't make it.
I am sort of an average user, interested in what you guys are doing, but in no way capable of doing it myself.

About re-writing all of the applications… you have to sit back and ask yourself if that is absolutely true. Even if it is true, is that a good way to choose an architecture?

David… the P2 isn't done. If you need the "little Cogs" to be more independent… this would be the time to speak up.

A 4 Cog P2 is almost done… and it would be exactly what the 8 Cog P2 was supposed to be with better Cog performance… just fewer of them.

Cluso99 · 2014-04-02 18:37

David Betz wrote: »

I keep hearing that the biggest advantage of the Propeller is the huge collection of objects in OBEX that can be mixed and matched to make complex applications easily. This depends on having a large number of independent cores. If you instead have half as many cores with multiple threads, it becomes much more difficult to configure the chip for a complex application. You need to start worrying about how to merge two objects into the same COG each running in a separate thread and whether there will be interactions that will cause one or both to stop working. This was far less likely with P1. Unless you can make the hardware threads behave as if they were separate COGs like the XMOS threads do then I don't think reducing the number is a good idea.

Precisely. Much better to halve the clock and use 8 cogs - simpler and same power result. But you still have the opportunity to only run 4 cogs!
Static power isn't a problem at 180nm but becomes more of an issue at 40/65/90nm.
I definitely don't see any reason to halve the cogs.

RossH · 2014-04-02 18:44

Bill,

Bill Henning wrote: »

- 5W was a worst case figure from OnSemi, for running all eight cogs at full blast etc, Chip stated with clock gating a more realistic max would be about 30% lower (3.5W)

Not so. The worst case was 4-6W - and this is for the core only - it doesn't include memory and I/O power requirements. Chip was musing that with additional work they might get the final numbers down to 5W!

Bill Henning wrote: »

Frankly, I think this whole power envelope issue is blown way out of proportion.

Chip obviously does not think so - the word he actually used was "outrageous". I agree with him.

And (to answer a few posts that all have variations of the same theme) there is no point in arguments like "well, if you only run it at P1 speeds, you will only consume P1 power levels". Whether you are designing an embedded device or even a hobbyist board, you have to design your power supplies, boards, enclosures, cooling etc for the maximum power consumption, not the minimum. So if you expect only to run at P1 speeds, why on earth would you just not just buy a P1 at a fraction of the cost and be done with it?

There is no question that something needs to be done - the P2 is unsaleable at these power levels. The question is what.

My vote still goes to option B.

Ross.

Cluso99 · 2014-04-02 18:45

jmg wrote: »

This forgets the hand crafted PAD Ring and Analog block, which involves a shipload of work.
ie any new package = a lot more money and time.

Won't this need to be done anyway for a BGA package ???

One option is to use the P2 PAD ring, and fill the cavity with (lotsa RAM + P1B), but I suspect your transistors=power blinkers would exclude that

I sort of suggested that earlier today - the P1B with 512KB or 680KB hub ram plus ADC but while I would like the pad ring features, I thought the interfacing would be too much for a 64 IO P1B.

A P1B at only 40MOPS would be pretty pedestrian and ordinary, and only 2x a P1

I did suggest we could get 160-200MHz. At 160MHz / 4 clock per instr * 8 cogs = 320 MIPs by my maths, or 100MHz / 4 * 8 = 200 MIPs. Where does 40 MOPS come from??

Bill Henning · 2014-04-02 18:45

Here we agree.

Cluso99 wrote: »

Precisely. Much better to halve the clock and use 8 cogs - simpler and same power result. But you still have the opportunity to only run 4 cogs!
Static power isn't a problem at 180nm but becomes more of an issue at 40/65/90nm.
I definitely don't see any reason to halve the cogs.

Bill Henning · 2014-04-02 18:48

Ray,

I am not side stepping anything.

I simply have not seen a post by Chip showing the increase from 14mm2 to 18mm2. That does not mean such a post does not exist, but if it does, my eyes must have missed it.

8 * 160Mhz = 1.280GHz, so all of those cogs running full blast is roughly (minus FP) comparable to a 1GHz single-core celeron, so if you want to reduce the celeron to 160Mhz, you can only compare to the power budget of one cog running.

Cluso99 wrote: »

Bill, You know I am not just referring to those little bits. Stop trying to sidestep the issue.
You know as well as I do the core (the instructions/alu/cache/etc excluding hub ram) has grown from ~14mm2 to well in excess of 18mm@. They are hard transistors which are consuming power!! And the P1 was ~7mm2 total according to unconfirmed posted figures I quoted above with the thread reference.

Yes, but that has really nothing to do with power. jmg showed in a previous post that the 66W Celeron at 1GHz scaled to ~5W at 160MHz.
Unfortunately, it looks as though that work has to be ditched to go to BGA. (That is the point in my question to Chip about whether the same die can be used in both BGA and QFP).

jmg · 2014-04-02 18:49

Cluso99 wrote: »

Unfortunately, it looks as though that work has to be ditched to go to BGA. (That is the point in my question to Chip about whether the same die can be used in both BGA and QFP).

The numbers do not really show that much thermal difference, in both cases the key is a lot of thermal vias and 4 layer PCB.

I see there is a 16x16mm 0.5mm PTQP0144 package, between the 20x20mm 0.5mm 144 pin and 14x14 x 0.4 TQFP128,
and there are also 128 pin 0.5mm packages that are larger, and should give lower power densities, and lower thermal resistances, if they have ThermalPAD variants. QFN does not seem to go large enough, or low enough Thermally.

RossH · 2014-04-02 18:49

rjo__ wrote: »

... About re-writing all of the applications… you have to sit back and ask yourself if that is absolutely true. Even if it is true, is that a good way to choose an architecture?

I'm not sure I would argue it is a good way to do it - it is just the way it is done in most instances. The software costs of an embedded applications generally outweigh the hardware costs - at least until you get up into volume figures that neither the P1 nor the P2 (nor the P3 for that matter) are likely to achieve.

Ross.

jmg · 2014-04-02 18:52

Cluso99 wrote: »

I did suggest we could get 160-200MHz. At 160MHz / 4 clock per instr * 8 cogs = 320 MIPs by my maths, or 100MHz / 4 * 8 = 200 MIPs. Where does 40 MOPS come from??

Obviously from 160MHz / 4, which you have above. The * 8 most users will find 'rather creative'.

rjo__ · 2014-04-02 18:57

Most of my projects could be handled by a single P1. But for convenience and to make sure that I never run out of resources in the future, I break up the
load and use multiple P1's talking to each other over serial lines. Given the better serial support that is already present and the enhancements that are coming,
implementing multiple P2 systems will be a breeze.

But I also don't see any reason why some chip level and tool level support for multi-P2 systems couldn't be done so that programming a dual P2 would be almost as easy as programming a single P2. These things should be there anyway it just makes too much sense.

From what I read here, adding a second 4Cog P2 might be cheaper than adding heat sinks.

Cluso99 · 2014-04-02 19:00

Might I suggest I/WE start a new thread to discuss what could be a viable P1B.

1. The basic P1B being pretty much 48-64 I/O with ADC, more hub ram, 2-4x faster.

2. An improved P1B+ where we take some of the better features of P2 and quickly slip them in.

I think the basic P1B is fairly well understood, but what could we simply make the P1B+ be?
It would give Chip some recommended options if the P2 doesn't pan out to be viable without significant changes.

Maybe we could get P1B or P1B+ much sooner and P2 to follow soonish???

brucee · 2014-04-02 19:03

Sorry but 8x160=1.2 GHz is nowhere near equal to the performance of a 1 GHz Celeron processor. The Celeron has a bigger on chip cache and support for off chip caches, so it will be running much closer to that 1 G instruction for each GHz. The Prop falls off the cliff performance-wise as soon as your program exceeds 512 instructions, so it is a ridiculous comparison.

The big problem I see is that if it consumes as much as a Celeron, people will expect Celeron kind of performance out of it. And it just isn't there, even to compete in the RPi world you would need to be less than 2W and even there the RPi will run rings around the Prop.

RossH · 2014-04-02 19:04

Cluso99 wrote: »

Might I suggest I/WE start a new thread to discuss what could be a viable P1B.

1. The basic P1B being pretty much 48-64 I/O with ADC, more hub ram, 2-4x faster.

2. An improved P1B+ where we take some of the better features of P2 and quickly slip them in.

I think the basic P1B is fairly well understood, but what could we simply make the P1B+ be?
It would give Chip some recommended options if the P2 doesn't pan out to be viable without significant changes.

Maybe we could get P1B or P1B+ much sooner and P2 to follow soonish???

Good idea! The only thing I'd add to your list is some form of inter-cog communications for fast communications/synchronization between cogs that did not need to go via the hub.

Ross.

Bill Henning · 2014-04-02 19:11

More like:

Celeron 66W tdp max

P2 5W tdp max

So it is not even close.

brucee wrote: »

Sorry but 8x160=1.2 GHz is nowhere near equal to the performance of a 1 GHz Celeron processor. The Celeron has a bigger on chip cache and support for off chip caches, so it will be running much closer to that 1 G instruction for each GHz. The Prop falls off the cliff performance-wise as soon as your program exceeds 512 instructions, so it is a ridiculous comparison.

The big problem I see is that if it consumes as much as a Celeron, people will expect Celeron kind of performance out of it. And it just isn't there, even to compete in the RPi world you would need to be less than 2W and even there the RPi will run rings around the Prop.

Bill Henning · 2014-04-02 19:16

Good idea.

You could use your "Post Deleted" thread for this, just an advanced edit away...

Cluso99 wrote: »

Might I suggest I/WE start a new thread to discuss what could be a viable P1B.

1. The basic P1B being pretty much 48-64 I/O with ADC, more hub ram, 2-4x faster.

2. An improved P1B+ where we take some of the better features of P2 and quickly slip them in.

I think the basic P1B is fairly well understood, but what could we simply make the P1B+ be?
It would give Chip some recommended options if the P2 doesn't pan out to be viable without significant changes.

Maybe we could get P1B or P1B+ much sooner and P2 to follow soonish???

We're looking at 5 Watts in a BGA!

Comments