We're looking at 5 Watts in a BGA!

Heater. · 2014-04-01 15:38

Who cares?

Even if hubexec is limited to 256K can't we use anything over for data?

Bill Henning · 2014-04-01 15:46

TRUE!

Very good point Heater.

Heater. wrote: »

Who cares?

Even if hubexec is limited to 256K can't we use anything over for data?

jmg · 2014-04-01 15:51

Bill Henning wrote: »

If I correctly remember

- power usage goes up by the square of the frequency
- other than leakage current, power usage is based on transistors changing state

Most micros spec uA/MHz as their metric, which is a linear -frequency law

You can get more savings with lower MHz, if you also lower the Vcc - now you have a lower fMAX, but uA/MHz is lower, and Vcc is lower so power is less than the same ratio MHz drop at fixed Vcc.

Another chip power model, is power dissipation capacitance.
This is a virtual value, that combines with Fs and Vcc^2 to give power ( Cpd * Fs * Vcc^2)

jmg · 2014-04-01 15:52

Heater. wrote: »

Who cares?

Even if hubexec is limited to 256K can't we use anything over for data?

Does OnSemi have a DRAM cell - how much of that would fit at 65nm ?

potatohead · 2014-04-01 15:59

If we could pile in a batch of data only RAM, I'm totally for that idea.

Actually, I'm for any reasonable idea that gets us chips that work first and early.

Rayman · 2014-04-01 16:09

Long as it doesn't need a fan,
I don't really care about power...

Bill Henning · 2014-04-01 16:11

Agreed.

Re-reading Chip's original post, the 5W seems more likely to be 3.5W (he mentioned 30% less) with all eight cogs going full blast. That's not very likely.

I would be very happy to get the P2, as currently discussed, in 180nm @ 3.5-5W full blast, as soon as possible.

Frankly, I doubt most applications would draw more than 1W to 2W.

Anyone that needs lower power need only run at a lower clock speed, with fewer cogs active.

potatohead wrote: »

If we could pile in a batch of data only RAM, I'm totally for that idea.

Actually, I'm for any reasonable idea that gets us chips that work first and early.

Bill Henning · 2014-04-01 16:12

jmg,

Thank you, appreciate the info!

jmg wrote: »

Most micros spec uA/MHz as their metric, which is a linear -frequency law

You can get more savings with lower MHz, if you also lower the Vcc - now you have a lower fMAX, but uA/MHz is lower, and Vcc is lower so power is less than the same ratio MHz drop at fixed Vcc.

Another chip power model, is power dissipation capacitance.
This is a virtual value, that combines with Fs and Vcc^2 to give power ( Cpd * Fs * Vcc^2)

JRetSapDoog · 2014-04-01 16:25

Chip, why wasn't "Pursue a finer fabrication process" not Option A (or at least among the options)? I think you were being humble/conservative. Yes, I realize that the short answer is cost and that you hadn't talked with OnSemi about it yet. But surely you don't want to "throw out" much of the hard work of the last several months. And about some of the chatter about going to four cogs, gee, for me, that's just not a full-fledged Propeller, even with HubExec (but never say never, I suppose). The ongoing design work has taken us (you/Parallax) in the direction of a finer process and it seems that there's no way around it. Maybe embrace it (if it's embraceable without unjustifiably mortgaging your future). Anyway, if going to a finer process, I'm wondering if the current design would be compatible (given some modifications) with having eight "HUB-like" RAM areas of 128KB (256KB?), wherein access to those RAM's could be bank-switched to the (hopefully) eight cogs, as opposed to dedicated to the cogs on a one-to-one basis (though the latter might be worth considering if the internal bus wiring was much simpler, etc.). It seems like the current design would be compatible given the current addressing mechanisms. And maybe the round-robin HUB access could be revisited or made more user-configurable. But maybe I'm the one now making the Propeller not so "Propellerish." Anyway, thinking out loud.

jmg · 2014-04-01 16:40

Bill Henning wrote: »

Agreed.

Re-reading Chip's original post, the 5W seems more likely to be 3.5W (he mentioned 30% less) with all eight cogs going full blast. That's not very likely.

I would be very happy to get the P2, as currently discussed, in 180nm @ 3.5-5W full blast, as soon as possible.

Frankly, I doubt most applications would draw more than 1W to 2W.

Anyone that needs lower power need only run at a lower clock speed, with fewer cogs active.

Lower Vcc is also a choice, if you do not want 160MHz peaks.

It sounds like P2 already uses clock gating, but there could be room for a little more on a COG basis, that would allow time-polling opcodes to run to full resolution, but a lower overall execution speed. This is what P1 does now.

eg a "P1 gearing" based Clock gate approach, would resolve time to 160Mhz, but optionally allow other opcodes at a power slashing 40MHz.

Ken Gracey · 2014-04-01 17:01

cgracey wrote: »

What we can do at this point:

A) Continue on the current trajectory.

Pare down the design, jettisoning all kids of features, like hub exec, and get back to something much smaller and maybe faster. The current FPGA implementation could be further honed and later, hopefully, made into a chip using a smaller technology.

C) Drop the current design to four cogs, which would also reduce cache sizes and hub memory down to 128KB. This would also allow us to shrink the die considerably, as we could change the I/O pin aspect ratio to allow them to fit together more densely, occupying more of what was needed for the core. This would also mean the whole chip would fit on an FPGA.

D) Retire to an opium den.

E) Other ideas?

Chip,

With the numerous design meetings and quotes we've developed around 180nm there has been no discussion about 90nm costs, so I have no constructive input in this regard without more information. We would need to know what further NRE, synthesis, layout and integration of the I/O to the core would be required to go to 90nm. Does it fully leverage our existing efforts or are there redesigns between you and Beau that require additional R&D time (and expense)? Time to market is an entirely different factor and probably more important at this stage. Depending on what 90nm really means for your design time and costs, it could be a case for KickStarter or similar funding mechanism.

While we can't go back, maybe we could choose the best path in consideration of (1) Customer requests, which have primarily been A/D, more RAM, faster, code protect, and more I/Os; and (2) Time to market. And keep in mind that your best products became complicated before they got simple again. If some features of P2 seem complicated to forum members (and these are skilled users who can exploit all features), maybe it's time to get back to something more simple.

But to compare (B) and (C) we'd need to understand how each alternative fares in regards to:

- design time (time to market), for your Verilog and manual layout by Beau
- synthesis costs, if any different
- any benefits in regards to testing the synthesized core with the manual layout and
- customer preference, given a choice

Not knowing these important pieces, I'm probably leaning towards (C) for a couple of reasons: maximizes C compiler performance with hubexec; provides plenty of cogs for most applications (production uses of P1 have limited use for video, so if it's a cog-eater in P2 this is something to consider); it fits in the whole FPGA; seems simple; and maybe the extra pre-emptive multi-tasking stuff stretches four cogs to a high potential anyway. Some customers have asked for more cogs than present in P1, but I believe even more customers in the beginning felt they didn't need all the power of P1's cogs.

If we can put the desire for features aside for a minute and understand the actual costs and time required to achieve (B) or (C) the answer would become clear to us, I think.

Here to help, so let me know what you need to help get this accomplished, even if it's a month in Mexico.

Ken Gracey

Tubular · 2014-04-01 17:03

I've a bit of experience with fanless 5 to 10 watt designs.

There are several "low power" embedded, fanless single board computers around the 10 watt mark. Most of the time these use a large heat spreader / heat sink, but don't generally require a fan. If you put them in a tightly sealed metal box, no vents, it can pay to have a fan just to move the air around to get even dissipation with fewer hot spots. Those boxes do get warm but it can work fine, from an overall engineered solution point of view. But these are generally 'bigger' boxes, not your credit card sized computer

Where it may get difficult is dealing with that kind of heat on say the little plug in PCIE based card Parallax talked about some time ago. Yes, having to deal with such problems can dissuade users early on, as Chip correctly notes.

rogloh · 2014-04-01 17:09

cgracey wrote: »

OnSemi has been doing some prep work and their initial power estimation for the core came back at 4-6W. This is without clock-gating, so when they get their synthesis tools to properly infer clock-gating, power will drop by probably 1/3. This is still huge, and doesn't account for memory and I/O power, which could be significant. My gut feeling is that we can get the power down to 5W and not need cooling by using a 17x17mm BGA package.

This all seems, outrageous, I know. I think we are being very ambitious for a 180nm process. This level of design complexity really needs 90nm, or less, to be practical. 40nm would be great - low power and GHz+ speed, but we don't have the budget for that.

I don't know what level of enthusiasm there might be for a 5W BGA Prop2 that costs $12. I think the 5W is worse than $12.

What we can do at this point:

A) Continue on the current trajectory.

Pare down the design, jettisoning all kids of features, like hub exec, and get back to something much smaller and maybe faster. The current FPGA implementation could be further honed and later, hopefully, made into a chip using a smaller technology.

C) Drop the current design to four cogs, which would also reduce cache sizes and hub memory down to 128KB. This would also allow us to shrink the die considerably, as we could change the I/O pin aspect ratio to allow them to fit together more densely, occupying more of what was needed for the core. This would also mean the whole chip would fit on an FPGA.

D) Retire to an opium den.

E) Other ideas?

Not good news but I guess these things happen. 5W might make it the most power hungry microcontroller I know... but to me it would be a real shame to start hacking it down. It does seem now from what you say that 180nm was probably a very optimistic target for what the P2 has become. The BGA package is also not great and may make boards more expensive/difficult to manufacture although I don't really know how bad BGAs are or the layer requirements it imposes compared to flat packs as I don't design PCBs for a living.

Personally I would probably either:

1) stick to it as it and live with the outcome which may or may not be quite as good as hoped (TBD), or

2) investigate the migration to a smaller process size and see if it is in any way feasible financially for Parallax. I wonder how the economics compare when you factor in the resulting chip price, final device frequency, power usage, and how this increases the P2 market. You do at least get some additional benefits beyond reduced power usage even though the upfront costs might be higher and take longer if some 180nm design work has been lost...

I'm really hoping the second option might pan out for you somehow as I think it could help make P2 even more competitive....wishing you all good luck there!

Roger.

rjo__ · 2014-04-01 17:29

This has been a heck of a day.

Wouldn't halving the number of cogs also allow doubling the bandwidth of the remaining 4 cogs?

That is a fairly attractive change that should be considered on its own merits… April fool's day or not. Wouldn't you rather have a cog that has double the native bandwidth? I would. I have yet to build a system that runs out of P1 assets… except HubRam of course. With HubExec and multitasking… a 4 cog P2, with 40MHz cogs would be a beast.

This option would also amplify performance under Hub Exec.

And if someone wants a system with 8 cogs, Parallax gets to sell two units!

In fact, if this becomes the direction, I would hope that Parallax offers a multiple P2 board, straight out of the gate.

Rich

cgracey · 2014-04-01 17:48

rjo__ wrote: »

This has been a heck of a day.

Wouldn't halving the number of cogs also allow doubling the bandwidth of the remaining 4 cogs?

That is a fairly attractive change that should be considered on its own merits April fool's day or not. Wouldn't you rather have a cog that has double the native bandwidth? I would. I have yet to build a system that runs out of P1 assets except HubRam of course. With HubExec and multitasking a 4 cog P2, with 40MHz cogs would be a beast.

This option would also amplify performance under Hub Exec.

And if someone wants a system with 8 cogs, Parallax gets to sell two units!

In fact, if this becomes the direction, I would hope that Parallax offers a multiple P2 board, straight out of the gate.

Rich

It's true that going to 4 cogs would double the hub bandwidth per cog. It would also reduce ICACHE and DCACHE lines to 4 longs, instead of 8. We could actually keep 256KB hub RAM.

With 4 tasks per cog, 4 cogs would still be pretty powerful.

T Chap · 2014-04-01 17:52

Can two 4 cog devices be connected together via a "cog link" so that they function identical to an 8 core with software only running on one device?

jmg · 2014-04-01 17:54

Ken Gracey wrote: »

Not knowing these important pieces, I'm probably leaning towards (C) for a couple of reasons: maximizes C compiler performance with hubexec; provides plenty of cogs for most applications (production uses of P1 have limited use for video, so if it's a cog-eater in P2 this is something to consider); it fits in the whole FPGA; seems simple; and maybe the extra pre-emptive multi-tasking stuff stretches four cogs to a high potential anyway.

The smaller memory aspect of C) is best avoided, but fewer COGs is worth a look.

Keep in mind, this lowers total Watts, but at the cost of total MIPs, and does not solve hot-spot temperatures per COG.

This does allow a cheaper part, and that may matter more than Watts.

Watts come as uA/MHz and Cpd*Fs*Vcc^2, so lowering Vcc can lower power (and peak MHz), and more Clocking choices, which are not complex to add, can give users more choices on the operating curve.

This Higher power may make 2 spec points sensible
* An 100MHz 'modest power' spec, with reduced Vcc, for most users. Nothing fancy needed on cooling.
* An 'overclocked' spec, that needs cooling, and finer Vcc control, targeting power users.

I would also explore Timers at 160MHz, Cores at 80MHz/40MHz options, which lets users save power with little loss of precision. ( present P1 has a 'Timer at 80MHz, cores at 20MHz' effective operating mode)

jmg · 2014-04-01 17:57

cgracey wrote: »

We could actually keep 256KB hub RAM.

Or even increase it a little ?

What is the current ratio of 1 COG to present 256K memory area ?

potatohead · 2014-04-01 18:04

What does the power look like at 80Mhz?

To be honest, 80Mhz operation would be excellent for a lot of users. If that makes sense, then there is an opportunity for overclocked devices

Rayman · 2014-04-01 18:08

I was just looking at the power draw of the Atom on the new minnow board thing...
The Intel Atom is spec'd at 3.3 to 4.5 W depending on model.

Anyway if Intel's little board can get away with that, I'd think we could too...

Cluso99 · 2014-04-01 18:11

Chip,
Really sorry to hear the results. My advice FWIW...

Keep the dialog going with OnSemi. They have way more experience with solutions for lowering the power, etc.
Take a sleeping pill, go to bed early, and get a full nights sleep. You will feel much better in the morning.
Take some time to digest the problems and solutions. Go to Mexico if that helps.

Here are some ideas...

Don't reduce the cogs from 8.
- This is what keeps us away from interrupts.
- We need some cogs to be simple I/O drivers.
- Keep multi-tasking.
- Re-examine the benefits for multi-threading ???
Don't reduce HUB RAM from 256KB
- Preferably an increase would be better
- Since HUBEXEC works nicely with the 16bit address, any increase > 256KB could work as follows
  - HUBEXEC could use an instruction to set which pair of 128KB blocks it uses (ie a pair of banks).
Preferably QFP as FPGA will limit lower volume users
- Too hard to prototype
- Too hard to build low volumes
- To expensive to use prebuilt modules.
I like the reduction to 90/65/40nm - big advantage if the cost does not blow out.
I like the change to standard RAM cells as this gives us...
- AUX & COG RAM can be changed to WIDE access directly without state m/c to simulate.
- Perhaps this reduces the requirement for all the separate caches ???
- With direct WIDE hub access, I think the COG/AUX could be fleshed out better now.
- With direct WIDE hub access, I think perhaps HUBEXEC could be simplified.
We have 4 tasks per Cog. Re-examine the multi-threading requirement.

Don't forget we still have USB/SERDES to go.

jmg · 2014-04-01 18:14

potatohead wrote: »

What does the power look like at 80Mhz?

To be honest, 80Mhz operation would be excellent for a lot of users. If that makes sense, then there is an opportunity for overclocked devices

Yes, I'm sure there is room for different Power/cooling usages. 80MHz is the same time resolution as P1, 100MHz is a nice round number. Even going down to 80MHz timer, and 20MHz Cores would match P1 nominal speeds.

If Vcc was not changed, power at 80MHz would usually be half of the 160Mhz, if you also change Vcc, then power can approach one quarter of 160MHz figure.

rogloh · 2014-04-01 18:22

cgracey wrote: »

It's true that going to 4 cogs would double the hub bandwidth per cog. It would also reduce ICACHE and DCACHE lines to 4 longs, instead of 8. We could actually keep 256KB hub RAM.

With 4 tasks per cog, 4 cogs would still be pretty powerful.

If you are only targeting the types of things a P1 does today 4 COGs is probably reasonable given the new multtasking support in P2, however once you start to want applications that require more and higher speed I/O capabilities, like USB, Ethernet PHY, HD VIDEO etc, QuadSPI, I'm thinking you will probably run out of COGs fairly quickly. Also fully deterministic hubexec is not truly/readily guaranteed. Certain real-time critical things are going to need native PASM execution and sometimes in their own dedicated COG.

USB is probably going to take either a whole COG or most of it.
For decent performance, your hubexec "main" application code may want to take a whole COG, along with its scheduler.
Video could possibly be muxed with other lower speed I/O like serial, audio, but it really wants a COG.
An external SDRAM memory manager may take a whole COG.
Quad SPI or MII transfers at 100Mbps could take up most of a COG depending on hardware support for serdes/xfer.

I think with just 4 we can quickly start to run out of COGs.

eldonb46 · 2014-04-01 18:25

T Chap wrote: »

Can two 4 cog devices be connected together via a "cog link" so that they function identical to an 8 core with software only running on one device?

Or go even more extreme; How about two Cogs per chip (8 threads), with Large Memories, Maybe fewer I/O's?, Lower cost part, Maybe DIP parts could be available? and with very Integrated High Speed Connection to multiple other P2's (at least 4 or maybe +8). All chips would hopefully be equal, as the Cogs are in the P1.

For Marketing; P2 would actually mean two cores. :-)

Or, after this very hectic day (Apr 1), am I just dreaming?

I just want a P2 soon !

jmg · 2014-04-01 18:32

rogloh wrote: »

USB is probably going to take either a whole COG or most of it.
...
Quad SPI or MII transfers at 100Mbps could take most of a COG depending on hardware support for serdes/xfer.

QuadSPI i think is planned in SerDes, so should have the same code footprint as SPI, which is very low.
USB can also drop the COG overhead, with some small cells to assist the bit level handling, and edge sync.
That moves USB to byte-level software, which will be much smaller, and needs much less COG BW.

The above means with modest silicon peripheral support, the valuable SW resource in each COG is not consumed doing trivial stuff.
Certainly any move to 4 COGs makes (already panned) good peripheral support in SerDes and small USB cells, more important.

cgracey · 2014-04-01 18:54

I got the files ready for OnSemi, so that the memories are declared in the Verilog code.

To make sure the thing compiles without error, I had Quartus do its analysis and synthesis processes on it. It said it needs 262k LE's! That sounds right.

Bob Lawrence (VE1RLL) · 2014-04-01 18:57

[h=3]We're looking at 5 Watts in a BGA![/h]

The 66-core version of the Parallella computer delivers over 90 GFLOPS on a board the size of a credit card while consuming only 5 Watts under typical work loads.

http://www.parallella.org/board/

cgracey · 2014-04-01 19:02

Bob Lawrence (VE1RLL) wrote: »

The 66-core version of the Parallella computer delivers over 90 GFLOPS on a board the size of a credit card while consuming only 5 Watts under typical work loads.

http://www.parallella.org/board/

I recall that they used a 28nm process for that chip. And the architecture was optimized for efficient FLOPS.

cgracey · 2014-04-01 19:05

As easily as we could do 8 of the current Prop2 cogs on a 180nm die, we could implement 32 Prop1 cogs in the same area.

Bob Lawrence (VE1RLL) · 2014-04-01 19:07

I recall that they used a 28nm process for that chip. And the architecture was optimized for efficient FLOPS.

Chip , you are correct and I was just going to post it when I seen your response. LOL
[h=1]Epiphany-IV 64-core 28nm Microprocessor[/h]

We're looking at 5 Watts in a BGA!

Comments