We're looking at 5 Watts in a BGA!

Cluso99 · 2014-04-01 20:56

Bill Henning wrote: »

Ray,

Even though most of my education was in software, it was in computer science, with a healthy dose of engineering courses.

And I do know better. Which is why I back up my positions with calculations, and indicate my assumptions. I even count transistors

In the spirit of technical discussion, please find my rebuttal, including calculations, to your points. I welcome improved calculations, and their support.

1. Remove Aux: Lose CLUT, lose X/Y stack, lose 256 longs of temporary/stack storage

2. Cache tags required per cache line:

256KB hub, addressed as 8-long lines needs 11 bits of address tags, and an 8 bit counter for LRU.

32 cache lines (in 256 longs) would require 11*32 = 352 bits, plus 32 8 bit counters... roughly 352+256 = 608 flip-flops plus some random logic. FYI, a typical flip flop is 8 transistors.

HARDLY A COUPLE OF BITS.

3. LIFO vs. AUX stack

(based on 2 transistors per AUX bit, as I recall HUB is 1 transistor, AUX is 2, COG is 4)

256*32 bits AUX at two transistors per cell = 8k transistors (64 transistors per level)

18 flip-flops per stack level, 8 transistors per flip flop, = 152 transistors per level.

You could only get 53 LIFO stack levels for the transistors of the 256 longs in the CLUT.

Even if you needed 4 transistors per AUX location, you would only get 106 LIFO for 256 AUX. BAD TRADE

4. Using a block of "cog ram" as "clut" requires significant additional multiplexers, and two ports - NOT ONE.

I am not convinced that it is reasonable to section off cog ram like that, I have visions of many multiplexers, and more complex, slower signal paths. But Chip would know better.

Ray,

Your solution, as near as I can tell, has numerous technical flaws, which you did not address even I pointed them out to you in your previous thread.

What *might* work, is 256 long cog memory, 256 long CLUT. But I for one do not want to make cog memory smaller.

Not worth arguing Bill. BTW you should patent your 1bit memory cell - its worth millions!

jmg · 2014-04-01 21:09

Cluso99 wrote: »

Chip,
Maybe there is some benefit from powering down some logic blocks. After all, there will be 8 such blocks, one for each cog.

Could the video blocks be powered down if not required, and likely power saving ?
Could the math instructions be powered down (mult, divide, cordic etc) if not required, and likely saving ?
Any there any other largish blocks that could benefit from this ?

I think Chip is already doing much of this, as well as Clock Gating.
The Capacitive power formula, (Cpd * Ft*Vcc^2) is based on the sum of NODES TOGGLING, so logic that is not toggling, is drawing no Cpd power.

The clock tree draws power, but sections of that are gated, and I think there is scope to gate the COG itself, so that not all COGS are forced to run at the same effective MHz.

Cluso99 · 2014-04-01 21:09

Here are the sizes from an old thread
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223695&viewfull=1#post1223695

DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm

So presuming cog takes the same area when divided into 2 identical 256 long blocks (or 4 x 128 long blocks). The add a small amount to make WIDE 8 long access (currently single long). Now add 1 multiplexer per block and a 1 or 2 bit register to select the instruction read pathway to go to either the clut video input or cog alu. Add and instruction to set the 1-2 bit register.
Now we have save a whole AUX block plus its' power. And x8.

If IIRC SRAMs use 6 transistors, so 6*(256*32)*8=393,216 transistors = ~400K transistors for removing AUX.

Now we can save some instructions that access AUX...
WRAUX, WRAUXR
We will probably want to keep PTRX, PTRY, and the associated PUSHx/POPx/CALLx/RETx but they wold probably share some instruction logic with similar instructions.

Bill Henning · 2014-04-01 21:11

Memory served me wrong: I mis-remembered 1 transistor for single port, 2 for dual port, 4 for cog, i think I was thinking dram cells (1T 1C)

FYI, here is a two transistor sram cell: http://www.google.com/patents/US6088259

I stated in my message that I could be wrong - as my wife often reminds me, my memory is not great. Looks like I was wrong, and I admit it.

I looked it up - six FET's is typical for sram, but there are four transistor sram cells as well.

My point is, I do calculations, and I admit when I am wrong.

Cluso99 wrote: »

Not worth arguing Bill. BTW you should patent your 1bit memory cell - its worth millions!

Cluso99 · 2014-04-01 21:24

Here is another thought...

Could we ditch the hubexec instruction cache ?

If we changed the hub slots to take 9 clocks, being 1 for each cog, and 1 for sharing.

Then hubexec mode could use its clock to fetch the next wide from hub. This would use the 9th clock, then be able to execute 8 instructions (provided no jmp/call instructions). The additional 9th clock could then be used in a round robbin sharing but on a simple priority basis. This would alleviate to some extent hubexec mode requiring data access to the hub.

Hubexec would now be a bit slower because of wasting 1 clock in every 9 clocks for fetching the next instruction block. Obviously we have a stall every time there is a jump/call outside the current block.

jmg · 2014-04-01 21:40

Cluso99 wrote: »

Here is another thought...

Could we ditch the hubexec instruction cache ?

I'm not quite following the direction here - removing small slices of logic, will have minimal impact on power, but will incrementally save die area.

So these suggestions are more what would be expected if the device were too large physically.

Dynamic power is controlled by things like clock tree design, Clock Gating, and varying Vcc.

KeithE · 2014-04-01 21:42

Chip - you might want to look at driving down to DAC sometime just to be aware of what tools are out there - http://dac.com (Nothing there will be cheap ;-( )

Have you ever looked into what eASIC (http://www.easic.com/) offers? I haven't looked into them since they aren't suitable for what I do, but they used to be down the street from us.

cgracey · 2014-04-01 21:49

Cluso99 wrote: »

Chip,
Maybe there is some benefit from powering down some logic blocks. After all, there will be 8 such blocks, one for each cog.

Could the video blocks be powered down if not required, and likely power saving ?
Could the math instructions be powered down (mult, divide, cordic etc) if not required, and likely saving ?
Any there any other largish blocks that could benefit from this ?

When a cog is off, a CTR is off, a VID is off, etc., they receive no clock and they don't toggle. So, power consumption is much lower than these worst-case numbers of several Watts. There's not much else to gate off, as it is. Right now, ~90% of flipflops are already clock-gated.

cgracey · 2014-04-01 21:54

KeithE wrote: »

Chip - you might want to look at driving down to DAC sometime just to be aware of what tools are out there - http://dac.com (Nothing there will be cheap ;-( )

Have you ever looked into what eASIC offers? I haven't looked into them, but they used to be down the street from us.

I haven't looked at eASIC, but I will. One of the two big EDA vendors said we could get a Verilog-to-GDSII flow for about $300k/year - which was 50% off.

jmg · 2014-04-01 21:55

cgracey wrote: »

When a cog is off, a CTR is off, a VID is off, etc., they receive no clock and they don't toggle. So, power consumption is much lower than these worst-case numbers of several Watts. There's not much else to gate off, as it is. Right now, ~90% of flipflops are already clock-gated.

Since you have ~ 90% clock gating there already, can this be user-dynamic controlled per COG, so as to allow one cog to run at 160MHz, whilst others can choose 80MHz or 40MHz. - That could make typical cases much lower power ?

The gating for each COG would be HUB slot aligned.

cgracey · 2014-04-01 21:56

I don't think stripping down the current Prop2 is a good idea, unless we did something like go to four cogs, since the architecture has grown to accommodate all the features, like hub exec, with special instructions, etc. It's a viable design. It just needs the right process.

cgracey · 2014-04-01 22:02

jmg wrote: »

Since you have ~ 90% clock gating there already, can this be user-dynamic controlled per COG, so as to allow one cog to run at 160MHz, whilst others can choose 80MHz or 40MHz. - That could make typical cases much lower power ?

The gating for each COG would be HUB slot aligned.

That could certainly be done. Each cog could receive an extra clock-gate qualifier signal that it would use throughout.

Cluso99 · 2014-04-01 22:05

Chip,
any ideas of the static power usage (leakage current)?

RossH · 2014-04-01 22:06

John Abshier wrote: »

I would go for option B. Make the design trajectory bend toward the P1 and not bend toward ARM. If you get 160MZ with piplined instructions, that is almost an order of magnitude faster than P1. I would hate to lose the functionality planned for the IO pins. I do worry about cost. A PII, breakout board, power supply, RAM, and passives looks like it may cost more than a Raspberry Pi. Five watts is perhaps a killer for Activity Bot/Scribbler class robots. At that power level is USB power still an option?

John Abshier

Agree 100% - go with option B. Lose the cog multithreading, the HUBEXEC and a few of the new I/O pin complexities. Or rather don't lose them, but save them for the P3 and free up the overburdened P2 till Parallax can afford the additional production costs associated with the 65nm (or smaller) manufacturing processes.

For most of the practical applications for which the P2 should be the solution of choice, these features can be catered for perfectly adequately in software (yes, you lose the linear execution speed for executing high-level languages, and so we probably won't be able to run Linux - but honestly, anyone who wanted these things should really just go buy a Raspberry Pi!)

Much as I'd like to play with all these fancy features myself, I have thought for a while that the P2 is in danger of becoming an "also-ran" micro-processor rather than a "heavy hitting" micro-controller. Or, to put it another way - it is in danger of becoming a white elephant rather than a ferret up the trouser legs of the Arduweenies, Pi-eaters and X-men.

All we really ever needed for the P2 over the P1 was a faster clock speed, more I/O pins and 256k internal hub RAM. And the real shame is that we could have had all this a year ago.

Cue the flaming Zeppelin!

Ross.

rogloh · 2014-04-01 22:17

cgracey wrote: »

I don't think stripping down the current Prop2 is a good idea, unless we did something like go to four cogs, since the architecture has grown to accommodate all the features, like hub exec, with special instructions, etc. It's a viable design. It just needs the right process.

+1
This is good to hear, Chip. Original manufacturing process now needs careful consideration, P2 design itself is still very nice.

cgracey · 2014-04-01 22:23

Cluso99 wrote: »

Chip,
any ideas of the static power usage (leakage current)?

At 180nm, it would probably be about 1mA. At 65nm it might be 1/3 of the total power.

User Name · 2014-04-01 22:30

RossH wrote: »

All we really ever needed for the P2 over the P1 was a faster clock speed, more I/O pins and 256k internal hub RAM. And the real shame is that we could have had all this a year ago.

I'd add to this list 'internal pins' (practically a freebie) and at least some of the analog features of P2 external pins.

Still, finding a way to fund a 65nm P2 seems like the best way forward. Maybe I'm a Pollyanna, but I'm encouraged that OnSemi seemed to suggest to Chip that 65 nm wasn't wildly expensive. Certainly cutting edge is always more $$$. 65nm might be the sweet spot of the bang-for-buck curve.

RossH · 2014-04-01 22:37

User Name wrote: »

I'd add to this list 'internal pins' (practically a freebie) and at least some of the analog features of P2 external pins.

Yes, absolutely. I don't want to lose all the new pin functionality - just enough to make the P2 feasible!

jmg · 2014-04-01 22:54

User Name wrote: »

Still, finding a way to fund a 65nm P2 seems like the best way forward. Maybe I'm a Pollyanna, but I'm encouraged that OnSemi seemed to suggest to Chip that 65 nm wasn't wildly expensive. Certainly cutting edge is always more $$$. 65nm might be the sweet spot of the bang-for-buck curve.

Sometimes, the passage of time can work to your advantage

Cluso99 · 2014-04-01 23:38

cgracey wrote: »

At 180nm, it would probably be about 1mA. At 65nm it might be 1/3 of the total power.

I want my cake and eat it too!

1mA would scale nicely but 65nm has good benefits too.

Problem is 5W at 1.2V is ~4A ! However I think most of us will never actually run the P2 using that much power in real life.

Peter Jakacki · 2014-04-01 23:48

Sad to hear. Can't we just get back to what we were all wanting in the first place with P2?, that is, more pins, more memory, more speed. I'm really squeezing P1 at present and just need these basic improvements, the rest is icing but what is icing without a cake?. Wouldn't the design you originally submitted back in 2012 work if you fixed any errors? This will give all some breathing space while we wait for P3.

WBA Consulting · 2014-04-02 02:18

Kerry S wrote: »

BGA is not an issue with Parallax making plug in modules.

Agreed, +1

With a proper SOM design (System on Module), utilizing the P2 will not be difficult at all. Whether using a DIMM layout or a Castellated module, it would be very easy to adapt by end users. The castellated version can also be handsoldered if kept to a decent pitch.

Baggers · 2014-04-02 03:59

Having had time to think about it, 5W isn't a deal breaker, and if people don't want that power, then they can easily lower the clock speed, it is a propeller after all, and easily changeable.

I for one will be more than happy with the way the prop is at the moment! yes, the 65nm or even 90nm would give us more speed/memory/less power, which would be nice, but it doesn't alter the fact that the current chip ( even with us running it at 80Mhz is an awesome chip ) clocked at it's target 160Mhz would practically quadruple what we have now performance wise, as it's almost double the cogs and double the speed.

And with that in mind, the current state P2 at 180nm is going to be a winner in my eyes! plus it's easier and quicker to market!

Releasing it stating it clocks at 80Mhz is still gonna be awesome, yet has the power to really overclock it up to 160Mhz but uses more Watts.

When you take into account what we have even at 80Mhz, it's 4* performance of P1, yet we also have ability to have 4 threads, and a far better instruction set, and maths, so still way outperforms the P1!

So what we have at the moment is still releasable in my eyes, but that's just my thoughts on it!

But ultimately, it's really down to just two guys, who, whatever option they choose, I'm sure I speak for everyone in this forum, when I say "Will have the backing of all of us!"

Chip and Ken, now both need to take a day or a two and clear their minds of the situation, take a family trip out, picnic in a park, go karting, or anything, and totally not think about anything P2 related, come back with fresh clear minds, giving their subconscious minds time to run a few decision simulations also, and then way up the pros and cons of each route to take.

Kerry S · 2014-04-02 04:54

cgracey wrote: »

I don't think stripping down the current Prop2 is a good idea, unless we did something like go to four cogs, since the architecture has grown to accommodate all the features, like hub exec, with special instructions, etc. It's a viable design. It just needs the right process.

Agreed on current Prop2 is where it should be. Do not agree on dropping to 4 cogs as some of us NEED those cogs to do what we plan to do.

Maybe just a little Marketing 101 here: Offically derate the chip to 80MHz, which gives a much lower spec sheet power consumption, then brag about how there is 'tons of room to overclock!' and just give a power curve chart so they guys that WANT to overclock can plan for the needed power/heat. You could even extend the chart down to say 20MHz (Prop1 effective speed) to show the low power guys what they can do with further clock reductions and careful cog usage.

I don't see the need to panic and toss out all of this hard and amazing design work at this point. After all this is a powerful MicroController not a cell phone SOC we are talking about...

RossH · 2014-04-02 05:32

potatohead wrote: »

We need to finish this CPU, then we can discuss the next CPU.

I disagree (would you expect otherwise?

).

The problem is that Parallax need to finish the last CPU - you know, the original P2 (i.e. a P1 with more RAM, more I/O pins and more speed) before they finish the next CPU.

If Parallax released a 200Mhz version of the P1, with 64 I/O pins, 256 Kb Hub RAM and a just a few additional bells and whistles (ADCs etc), then the P3 could then take its time and come along later to blow us all away. In the meantime we'd all be out creating all those applications that just won't quite fit on the P1 - and shelling out money on P2's to fund it!

Ross.

EDIT: Ah! I see that while I was writing this, Peter Jakacki just posted virtually the same idea! Maybe we should take a vote!

ctwardell · 2014-04-02 06:00

RossH wrote: »

I disagree (would you expect otherwise? ).

The problem is that Parallax need to finish the last CPU - you know, the original P2 (i.e. a P1 with more RAM, more I/O pins and more speed) before they finish the next CPU.

If Parallax released a 200Mhz version of the P1, with 64 I/O pins, 256 Kb Hub RAM and a just a few additional bells and whistles (ADCs etc), then the P3 could then take its time and come along later to blow us all away. In the meantime we'd all be out creating all those applications that just won't quite fit on the P1 - and shelling out money on P2's to fund it!

Ross.

EDIT: Ah! I see that while I was writing this, Peter Jakacki just posted virtually the same idea! Maybe we should take a vote!

I agree with potatohead, this chip needs finished. I do agree that the chip you mentioned should have happened, a least a few years ago, but we are where we are.

Expectations have been set and a lot of time has been spent by Parallax and forum members on the latest design. Right or wrong, forum input has been incorporated by Parallax and they have decided to run with that concept of openness. To toss that design as a knee jerk reaction to a preliminary power assessment would IMHO do considerable damage to a community that already seems to be fragmented.

IMHO it is best to stay on the current course and finish the design from the logic point of view so FPGA testing can continue. If the power consumption at 180nm is acceptable, possibly at lower clock speeds, then that is the fastest path to market.
A 65nm version with lower consumption and higher clock rates can follow up in the future.

C.W.

Bill Henning · 2014-04-02 06:04

Ross,

Removing those items would not change the power envelope noticeably as they take too small a fraction of the total gates to matter.

Lowering clock frequency, removing cogs, lowering vCore - those make the difference.

Of the suggestions so far, the one I really like is setting the multiplier (of xin) for each cog individually; that would definitely help.

RossH wrote: »

Agree 100% - go with option B. Lose the cog multithreading, the HUBEXEC and a few of the new I/O pin complexities.

Brian Fairchild · 2014-04-02 06:10

SRLM wrote: »

...the Atmel SAM4S ARM processor consumes 0.1W at 120MHz, and it's a full ARM core. That's fifty ARMs or one 8 core Propeller2?

Kerry S wrote: »

How many P2 applications are really going to be battery powered?

Probably not many battery P2 applications BUT, like it or not, uA/MHz is a hugely important metric and it'd be a brave company that goes against the trend for lower power/more powerful devices.

RossH wrote: »

...the P2 is in danger of becoming an "also-ran" micro-processor rather than a "heavy hitting" micro-controller...

Agreed, +1.

Whilst we can argue all day about the line between 'controller' and 'processor', once you get into serious thermal management issues it becomes a lot more difficult to design the chip in and you move very much into 'processor' territory.

WBA Consulting wrote: »

With a proper SOM design (System on Module), utilizing the P2 will not be difficult at all. Whether using a DIMM layout or a Castellated module, it would be very easy to adapt by end users.

Except, to get heat away from a package you need PCB copper and neither of those modules have that luxury.

Kerry S wrote: »

...Offically derate the chip to 80MHz, which gives a much lower spec sheet power consumption, then brag about how there is 'tons of room to overclock!'...

So you end up with a crippled chip? Who designs real products based on overclocking?

Ramon · 2014-04-02 06:33

cgracey wrote: »

I haven't looked at eASIC, but I will. One of the two big EDA vendors said we could get a Verilog-to-GDSII flow for about $300k/year - which was 50% off.

Chip, I know a place in Taiwan that offered (last year) a two months training course for IC Layout at 1/100th that price. They used Synopsys EDA and Faraday-tech cells for UMC process.

I don't know, but maybe in the US there are similar programs.

I think that you would rather prefer Mexico's beach and 'Tortillas' than Taiwan, but this is another option. You will have two months period to make design ready for shuttle.

potatohead · 2014-04-02 06:47

@Ross, of course! I wouldn't have it any other way! (BTW: I think we agree on performance after all. Yes, hand PASM will peak perform more using mooch, meaning you are correct, but I was referring to the majority use case, numbers essentially, in that more people will simply see a boost with compiled code, due to the number of hand PASM coders we are likely to see, but this realization took a while. )

@Baggers: Yeah, exactly my thoughts at present.

@All: I suppose it comes down to whether or not this design being produced would fund ongoing work. A while back, we all came to a realization, and that realization was there could be a business in this FPGA design adventure. That realization was predicated on the P2 actually selling, because if it does sell in sufficient numbers to fund ongoing work, we then could take on a lot of different niche tasks with the Propeller idea or way of doing things, nail 'em and get products out there people can use.

This is what everybody wants, whether or not they realize it. Getting products out there people can, will, want to use.

So will people want to use this current design at it's current power consumption or not?

To me, that's the core question, with the qualifier: can, will, want to use in numbers sufficient to fund future designs?

Let's have that discussion briefly, please. It matters. And it matters, because we are in a place where this design could go, and be on schedule, etc... Stepping away from that has to be due to a failure condition, not some preferences, because we need to fund the future. Seriously.

Share your thoughts on who, how this chip may be adopted outside our niche of players.

We're looking at 5 Watts in a BGA!

Comments