We're looking at 5 Watts in a BGA!

richaj45 · 2014-04-01 19:08

Maybe it is time to seriously consider P1.5 in a 64 pin package with just a very few new features.

Only Parallax should determine those feature and not the P1.0 customers that may or may not buy enough to justify the expense.

The Parallax guys did pretty good on 1.0 so i bet you will do even better on 1.5 without making the chip into something outside of Parallax's resources.

Every company has limited resources and the key to good engineering is doing the best with your limitations.

That's my two cents.
Regards,
Rich

Bob Lawrence (VE1RLL) · 2014-04-01 19:12

[h=5]

The 66-core version of the Parallella computer delivers over 90 GFLOPS on a board the size of a credit card while consuming only 5 Watts under typical work loads.

COST:
4,965
backers[/h][h=5]$898,921
pledged of $750,000 goal[/h]

Cluso99 · 2014-04-01 19:12

160MHz clock...
1 Cog = 160 Mips max
2 Cogs = 320
4 Cogs = 640
8 Cogs = 1280

Now, why do we want less cogs ???

It would make more sense to permit each cog to run slower by 1/2, 1/4 & 1/8 by dividing the clock, referenced to the hub. But I fear there are other issues here due to the WIDE accesses.

Cluso99 · 2014-04-01 19:16

cgracey wrote: »

As easily as we could do 8 of the current Prop2 cogs on a 180nm die, we could implement 32 Prop1 cogs in the same area.

How much hub ram ?
Of course then we would want single clock execution

Cluso99 · 2014-04-01 19:19

cgracey wrote: »

I got the files ready for OnSemi, so that the memories are declared in the Verilog code.

To make sure the thing compiles without error, I had Quartus do its analysis and synthesis processes on it. It said it needs 262k LE's! That sounds right.

Excellent. Cannot wait to hear what they say.

WIth Verilog RAM I think there are a few opportunities to tweek the cog and aux ram to WIDE mode and this would overall simplify things. At the risk of getting the gallery offside, with WIDE to/from Cog I think Aux could be removed, LIFO stacks deepened and cache improved.

cgracey · 2014-04-01 19:22

Cluso99 wrote: »

How much hub ram ?
Of course then we would want single clock execution

Single-clock execution requires a 4-port RAM, not a single-port RAM. It also requires a lot more logic for data forwarding, etc.

There's some beauty in the simplicity of a 4-clock/instruction cog.

mindrobots · 2014-04-01 19:26

Cluso99 wrote: »

Excellent. Cannot wait to hear what they say.

WIth Verilog RAM I think there are a few opportunities to tweek the cog and aux ram to WIDE mode and this would overall simplify things. At the risk of getting the gallery offside, with WIDE to/from Cog I think Aux could be removed, LIFO stacks deepened and cache improved.

With Verilog RAM, we no longer have enough LEs in our DE2 FPGA to emulate/test anything and the Nanos are doorstops. (I guess that should have been phrased as a question.)

jmg · 2014-04-01 19:26

Cluso99 wrote: »

160MHz clock...
1 Cog = 160 Mips max
2 Cogs = 320
4 Cogs = 640
8 Cogs = 1280

Now, why do we want less cogs ???

It would make more sense to permit each cog to run slower by 1/2, 1/4 & 1/8 by dividing the clock, referenced to the hub. But I fear there are other issues here due to the WIDE accesses.

I would agree the clock-gearing be explored before removing COGs, but given the jump in power, the present metal-routing in P2 may not be enough - which could hit physical limits.

BGA packages do buy better supply impedance, and more direct thermal paths, even if they are more of a pain to use, and bump PCB designs to more layers.

Perhaps P2 could be shipped as
100MHz in TQFP128 (or TQFP144?) and as
160MHz in a BGA 17x17 package ?

jmg · 2014-04-01 19:29

mindrobots wrote: »

With Verilog RAM, we no longer have enough LEs in our DE2 FPGA to emulate/test anything and the Nanos are doorstops. (I guess that should have been phrased as a question.)

I think the verilog RAM is just a stub to allow the tools at OnSemi to work & include RAMs (compiled) in the speed sims
- FPGA designs would still get the Quartus tools to use the Block RAMS of the FPGA.

rod1963 · 2014-04-01 19:32

And here I was expecting working silicon this year.

At first I thought this was joke by Chip, I mean 5 watts sounded so absurd, it was like a throwback to ECL and Intel egg cooking processors. At that level the P2 will be at serious disadvantage against watt sipping ARM's and PIC32's, not to mention killing any use in mobile devices.

I hope Chip figures out a workaround that still retains most of the functionality and also not breaking the bank in the process.

rod1963 · 2014-04-01 19:42

Bob Lawrence (VE1RLL) wrote: »

[h=5]
COST:
4,965
backers[/h][h=5]$898,921
pledged of $750,000 goal[/h]

Bob

They also needed a cash infusion($3.6 million) from Ericcson and a Israeli VC firm to fund their project as the costs skyrocketed beyond their wildest projections.

Makes me wonder if Parallax won't end up asking a VC firm to make the P2 a reality.

Bill Henning · 2014-04-01 19:47

Ray,

Multi-threading uses extremely little logic, so it would not make any noticable difference on the power envelope.

Number of cogs, and what frequency and voltage they run at, are the primary consumers of the power (I think).

Cluso99 · 2014-04-01 19:49

Here are some interesting possibilities. I have mentioned these before, but with the hand laid ram this was not possible.

Build the COG and AUX with WIDE hub access
- This would give 8 long read/writes in each hub access without the need for a state m/c to break it down into 8 single long read/writes in 8 clocks.
With HUBEXEC mode and deeper caches, WIDE R/W to HUB, I truly believe that AUX and half of the COG RAM could be combined...
- Cog ram would be 2 blocks of 256 longs, 4 port (3 reads, 1 write)
- One half of the Cog ram could be configured (single opcode) to make the cog ram become the CLUT (and nothing else)
  - Since the AUX ram is clut only, it cannot be used for instruction code, thereby freeing the instruction read port to become the Clut/Video out port. A single multiplexor.
- The clut can now be accessed with normal cog instructions (because it is in reality half the cog except for the instruction read port). Therefore it can be updated using WIDE instructions - a big gain - a single hub read instruction RDWIDE, with 7 instructions available (presuming single clock RDWIDE) for other cog code. In fact it can run as a task within the cog.
- Part of the COG/CLUT can be used as stacks using PTRX & PTRY without issues except that the total space is now 512 longs.
- LIFOs are cheaper, being only 18 bits wide, so the CALL/RET could be much deeper now.
I think that now we could use the COG as the Instruction Cache, meaning that each task could have its own block of instruction cache. The same basically applies to Data Cache.

With HUBEXEC mode, I no longer think the COG suffers from lack of instruction or data space. Therefore I don't see the loss of AUX as being a problem any more. It sure reduces the instruction complexity. I don't see the video modes as requiring as much cog space as before, as hubexec mode can also be used efficiently. Remember, there are only likely to be 1 or 2 cogs doing video, and with WIDE loading, this makes it a breeze.

Bill Henning · 2014-04-01 19:50

Ray,

We have gone over this several times, and my friend, I am sorry you are wrong.

Removing AUX removes all the CLUT video capabilities, and stack capabilities.

Cluso99 wrote: »

Excellent. Cannot wait to hear what they say.

WIth Verilog RAM I think there are a few opportunities to tweek the cog and aux ram to WIDE mode and this would overall simplify things. At the risk of getting the gallery offside, with WIDE to/from Cog I think Aux could be removed, LIFO stacks deepened and cache improved.

Bob Lawrence (VE1RLL) · 2014-04-01 19:55

@rod1963

Ok thanks! for that info . I didn't see it listed anywhere but finally found the reference below on the Adapteva wiki.

http://en.wikipedia.org/wiki/Adapteva

Wikipedia:
"The initial prototypes enabled Adapteva to secure $1.5M in Series-A funding from BittWare, a company from Concord, New Hampshire, in October 2009.^[2]"

Cluso99 · 2014-04-01 19:55

Bill Henning wrote: »

Ray,

Multi-threading uses extremely little logic, so it would not make any noticable difference on the power envelope.

Number of cogs, and what frequency and voltage they run at, are the primary consumers of the power (I think).

Of course Bill.
But every little bit has added up to 5W. Something has to give here.
Would you rather pull the plug on hubexec? I certainly wouldn't!

I am very interested to see what would be likely in smaller geometry such as 90nm or 65nm. If it gave the tradeoff of 400MHz 5W or 200MHz 2.5W then we could have a winner

Bill Henning · 2014-04-01 19:59

Here Ray & I go again...

*IF* the memory is re-laid out, *AND* the additional muxes needed for doing everything 8-long do not increase critical paths too much, doing everything 8-long would be nice.

AUX/HALF-COG should not be combined as you say, because:

- the recovered read port would cost the ability to execute from half of this memory
- even if the re-configuration you posit is realistic, all of a sudden cog executable memory is reduced to 256 longs
- you would reduce total stack size, as in your proposed scheme, the PTRX/Y stack would be taken from the 512 cog longs, not the 256 long separate AUX
- I don't believe LIFO's are cheaper than ram cells (Chip? jmg?)
- we can't just use the cog memory as instruction cache, it is missing the tag bits, the LRU mechanism, it is not as simple as you hope

To summarize:

Going 8-wide for AUX/COG is a good idea, if reasonably feasible

Removing AUX - BAD idea

Trying to use cog memory as automagic cache not feasible as you wish, needs tag bits and LRU counter for every line

Cannot substitute hubexec for cog mode code for highest performance most deterministic driver use.

Ray, we went through this in a couple of threads before.

Cluso99 wrote: »

Here are some interesting possibilities. I have mentioned these before, but with the hand laid ram this was not possible.
Build the COG and AUX with WIDE hub access
This would give 8 long read/writes in each hub access without the need for a state m/c to break it down into 8 single long read/writes in 8 clocks.

With HUBEXEC mode and deeper caches, WIDE R/W to HUB, I truly believe that AUX and half of the COG RAM could be combined...
Cog ram would be 2 blocks of 256 longs, 4 port (3 reads, 1 write)

One half of the Cog ram could be configured (single opcode) to make the cog ram become the CLUT (and nothing else)
Since the AUX ram is clut only, it cannot be used for instruction code, thereby freeing the instruction read port to become the Clut/Video out port. A single multiplexor.

The clut can now be accessed with normal cog instructions (because it is in reality half the cog except for the instruction read port). Therefore it can be updated using WIDE instructions - a big gain - a single hub read instruction RDWIDE, with 7 instructions available (presuming single clock RDWIDE) for other cog code. In fact it can run as a task within the cog.

Part of the COG/CLUT can be used as stacks using PTRX & PTRY without issues except that the total space is now 512 longs.

LIFOs are cheaper, being only 18 bits wide, so the CALL/RET could be much deeper now.

I think that now we could use the COG as the Instruction Cache, meaning that each task could have its own block of instruction cache. The same basically applies to Data Cache.

With HUBEXEC mode, I no longer think the COG suffers from lack of instruction or data space. Therefore I don't see the loss of AUX as being a problem any more. It sure reduces the instruction complexity. I don't see the video modes as requiring as much cog space as before, as hubexec mode can also be used efficiently. Remember, there are only likely to be 1 or 2 cogs doing video, and with WIDE loading, this makes it a breeze.

David Betz · 2014-04-01 20:03

eldonb46 wrote: »

Or go even more extreme; How about two Cogs per chip (8 threads), with Large Memories, Maybe fewer I/O's?, Lower cost part, Maybe DIP parts could be available? and with very Integrated High Speed Connection to multiple other P2's (at least 4 or maybe +8). All chips would hopefully be equal, as the Cogs are in the P1.

For Marketing; P2 would actually mean two cores. :-)

Or, after this very hectic day (Apr 1), am I just dreaming?

I just want a P2 soon !

Are threads within a COG really going to be as easy to use as separate COGs were in P1? I seem to recall that there are still lots of restrictions if you want deterministic execution with hardware threads. Wouldn't going to four deterministic COGs be a step backwards even given the hardware threading? Maybe the threading can be made more deterministic though.

Bill Henning · 2014-04-01 20:05

Ray that little bit did not add up to 5W.

The 8 cogs at 160Mhz did.

Everything other than that (with possible exception of the impact of doubling the hub to 256KB) is a marginal difference. Seriously. Look at % of gates added. All these other features are totally insignificant in the power envelope.

Don't believe me.

Ask Chip, jmg, OnSemi.

You want to significantly reduce the power envelope?

- cut number of cogs drastically

OR

- drop frequency

OR

- drop core voltage

OR

- go to a smaller geometry

There is no magic simple fix of removing a small feature that uses a few gates. I wish there was.

OR simply accept a 3.5W-5W power envelope when all eight cogs are going full blast

That's what I'd do. Get that chip cooking and out, offer it in a couple of speed/power grades:

P2-80 .. 80Mhz, probably max 1.5W, typical 0.5W
P2-160 .. 160Mhz, probably max 4W, typical 1.5W

THEN start working on P3, target 65nm or smaller.

Cluso99 wrote: »

Of course Bill.
But every little bit has added up to 5W. Something has to give here.
Would you rather pull the plug on hubexec? I certainly wouldn't!

I am very interested to see what would be likely in smaller geometry such as 90nm or 65nm. If it gave the tradeoff of 400MHz 5W or 200MHz 2.5W then we could have a winner

mindrobots · 2014-04-01 20:08

STOP! You guys are going crazy again.

Chip and OnSemi need to resolve the problem first.

Chip's been really good at building us this great new house. He's been getting great deals on materials so more can be added for the same prices. He's been very tolerant of some people driving a bulldozer through finished sections every so often with a suggestion for something to make it oh so much better.

Now, he's gone to the bank (OnSemi) and they've told him the finance terms have changed. You guys hear "for a longer term and a different rate, you can get more money for the same payments"...so now it;s time to add the billiard room and the second gourmet kitchen in the basement and three more slots in the 6 car garage and the mechanics and the outdoor pool and maybe look for a bigger lot...even though nobody plays billiards, you mostly eat out, don't repair cars, can't swim and hate yardwork.......

Stop. SIt back. Try and get a clear vision of what the P2 should be within all the constraints (Parallax budget, time to market, increased costs (Ken's NRE) and everything else. You've added things, removed things, changed things turned many "Propellerisms" upside down, most agreed upon by the community and accepted by Chip. At some point, it needs to stop, the Propeller2 needs to be frozen and when ready work begins on the Propeller3. At the current level of feature creep, I really think there won't be a Propeller2 in silicon. As it is now, no matter what it turns out to be, someone will say, "it could have had this". if only...", "why doesn't it......"???? No matter where the process stops, there will always be one more thing. What if instead of 65nm technology, you had gone with 40nm? What more could have been put in? How much faster, how much lower power? Who wants faster? Who wants lower power? Who wants faster?

Bill Henning · 2014-04-01 20:10

For comparison, Intel 180nm processors:

http://en.wikipedia.org/wiki/Coppermine_%28microprocessor%29#Coppermine

http://en.wikipedia.org/wiki/Celeron#Willamette-128
http://www.cpu-world.com/Cores/Willamette.html

Willamette 128 REALITY CHECK: 63.5-66.1 WATTS (1 core, 1.5-2.0GHz)

And you guys are worried about 5W worst case???? (8 core, aggregate 1.28GHz)

PERSONALLY, 5W is great by comparison!

Others:

http://en.wikipedia.org/wiki/PowerPC_G4#PowerPC_7445_and_7455

Tubular · 2014-04-01 20:26

Bill Henning wrote: »

P2-80 .. 80Mhz, probably max 1.5W, typical 0.5W
P2-160 .. 160Mhz, probably max 4W, typical 1.5W

P2-125 MHz? Just so we can have a 4 digit MIPS figure (1000 MIPS)

Perhaps CogNew could return a new error code for refusing to add another cog because the core is already too hot

Four cogs, still with 4 hardware tasks and hub execution would still be potent for most applications.

I'm curious how much current can pass through the 4 core supply pins (per core voltage rail), with double bonding etc. Could these wire bonds really stand ~1 Amp, each?

Bill Henning · 2014-04-01 20:27

Ray,

I got upset at your detailed proposal because

(1) it would require a complete re-design
(2) several of your points could not work the way you hope
(3) I don't believe pushing release off that long is viable

Personally, I'd love to see a die shrink.

Even 90nm that Ken mentioned would help the power envelope drastically - best guess, <2W

My problem is that a 90nm P2 two years from now will be too late.

I'd like to see the part currently being discussed, released at 80Mhz and 160Mhz, two different power envelopes.

Maybe Chip/OnSemi can drop the vCore to 1.2V ... that would help a lot.

What I don't want to see is the baby being tossed out with the bath water.

Frankly, I am not sure Parallax should invenst in a die shrink at this point.

Heck, it would be better to release P2-LP40 (40Mhz low power), P2-80, P2-160 as different speed grades for different power envelopes.

Those of us in the know could just reduce the multiplier to save power.

Cluso99 wrote: »

Of course Bill.
But every little bit has added up to 5W. Something has to give here.
Would you rather pull the plug on hubexec? I certainly wouldn't!

I am very interested to see what would be likely in smaller geometry such as 90nm or 65nm. If it gave the tradeoff of 400MHz 5W or 200MHz 2.5W then we could have a winner

jmg · 2014-04-01 20:28

Bill Henning wrote: »

Willamette 128 REALITY CHECK: 63.5-66.1 WATTS (1 core, 1.5-2.0GHz)
PERSONALLY, 5W is great by comparison!

Interesting the coarse W/Hz metrics are surprisingly similar here :

66W/2G = 3.3e-8
5W/160M = 3.125e-8

Cluso99 · 2014-04-01 20:30

Bill,
I know you are a sw guy, but you do know better!

1. Remove the AUX - gain in space and ram power. Simplifies the design as all the hub/aux and cog/aux instructions go. More space and less power x8.
2. The cache tags will still be required - they are only a couple of bits for each WIDE block and are a separate part of the cache block anyway.
3. The LIFOs are only 18 bits wide and are simple chained flops, not ram memory. Big savings over using Aux.
4. If the block of cog ram is not used for clut, then it can be used for instructions. It is a simple mux for the instruction read bits to be sent to the ALU or the VIDEO.

A while ago Chip asked about reducing Cog ram to 256 longs. My solution is a mix where we can keep the 512 cog longs without clut, or have 256 cog longs with 256 clut longs. Perhaps it could be 3/4 cog and 1/4 clut. Cog ram is no longer the bottleneck it was due to hubexec and wide, and this improvement of wide configuration would improve upon that.

And this is a by-product of using standard verilog cells. So, provided there is no downside to using OnSemi cells, it could be a big improvement to P2 irrespective of power.

It may be that none of this pans out, but at least I am throwing it into the mix.

Cluso99 · 2014-04-01 20:34

jmg wrote: »

Interesting the coarse W/Hz metrics are surprisingly similar here :

66W/2G = 3.3e-8
5W/160M = 3.125e-8

Wow - scales well. Would be nice to put massive cooling on P2 and run it at 2G

Cluso99 · 2014-04-01 20:39

Chip,
Maybe there is some benefit from powering down some logic blocks. After all, there will be 8 such blocks, one for each cog.

Could the video blocks be powered down if not required, and likely power saving ?
Could the math instructions be powered down (mult, divide, cordic etc) if not required, and likely saving ?
Any there any other largish blocks that could benefit from this ?

Bill Henning · 2014-04-01 20:46

Ray,

Even though most of my education was in software, it was in computer science, with a healthy dose of engineering courses.

And I do know better. Which is why I back up my positions with calculations, and indicate my assumptions. I even count transistors

In the spirit of technical discussion, please find my rebuttal, including calculations, to your points. I welcome improved calculations, and their support.

1. Remove Aux: Lose CLUT, lose X/Y stack, lose 256 longs of temporary/stack storage

2. Cache tags required per cache line:

256KB hub, addressed as 8-long lines needs 11 bits of address tags, and an 8 bit counter for LRU. And a whole passel of xor gates for comparing the tag.

32 cache lines (in 256 longs) would require 11*32 = 352 bits, plus 32 8 bit counters... roughly 352+256 = 608 flip-flops plus some random logic. FYI, a typical flip flop is 8 transistors. WELL over 2400 transistors (not including comparators)

HARDLY A COUPLE OF BITS.

3. LIFO vs. AUX stack

(based on 2 transistors per AUX bit, as I recall HUB is 1 transistor, AUX is 2, COG is 4)

256*32 bits AUX at two transistors per cell = 8k transistors (64 transistors per level)

18 flip-flops per stack level, 8 transistors per flip flop, = 152 transistors per level.

You could only get 53 LIFO stack levels for the transistors of the 256 longs in the CLUT.

Even if you needed 4 transistors per AUX location, you would only get 106 LIFO for 256 AUX. BAD TRADE

4. Using a block of "cog ram" as "clut" requires significant additional multiplexers, and two ports - NOT ONE.

I am not convinced that it is reasonable to section off cog ram like that, I have visions of many multiplexers, and more complex, slower signal paths. But Chip would know better.

Ray,

Your solution, as near as I can tell, has numerous technical flaws, which you did not address even I pointed them out to you in your previous thread.

What *might* work, is 256 long cog memory, 256 long CLUT. But I for one do not want to make cog memory smaller.

Cluso99 wrote: »

Bill,
I know you are a sw guy, but you do know better!

1. Remove the AUX - gain in space and ram power. Simplifies the design as all the hub/aux and cog/aux instructions go. More space and less power x8.
2. The cache tags will still be required - they are only a couple of bits for each WIDE block and are a separate part of the cache block anyway.
3. The LIFOs are only 18 bits wide and are simple chained flops, not ram memory. Big savings over using Aux.
4. If the block of cog ram is not used for clut, then it can be used for instructions. It is a simple mux for the instruction read bits to be sent to the ALU or the VIDEO.

A while ago Chip asked about reducing Cog ram to 256 longs. My solution is a mix where we can keep the 512 cog longs without clut, or have 256 cog longs with 256 clut longs. Perhaps it could be 3/4 cog and 1/4 clut. Cog ram is no longer the bottleneck it was due to hubexec and wide, and this improvement of wide configuration would improve upon that.

And this is a by-product of using standard verilog cells. So, provided there is no downside to using OnSemi cells, it could be a big improvement to P2 irrespective of power.

It may be that none of this pans out, but at least I am throwing it into the mix.

potatohead · 2014-04-01 20:52

That solution is simply a different CPU. We need to finish this CPU, then we can discuss the next CPU.

jmg · 2014-04-01 20:55

Bill Henning wrote: »

I'd like to see the part currently being discussed, released at 80Mhz and 160Mhz, two different power envelopes.
Maybe Chip/OnSemi can drop the vCore to 1.2V ... that would help a lot.

This seems the safest pathway from where we are now.

I think it will pay to run some numbers on differing Vcc, and develop a Vcc/MHz/W set of data points.

There may be multiple packages choices for the different thermal points ? - such as heat slug on top like this
http://upload.wikimedia.org/wikipedia/en/thumb/a/ac/Sega_304-pin_QFP.jpg/160px-Sega_304-pin_QFP.jpg

We're looking at 5 Watts in a BGA!

Comments