List of Changes in Next P2 Silicon

evanh · 2019-03-15 09:25

jmg wrote: »

Volume users need guaranteed specifications to work to.

160 MHz.

evanh · 2019-03-15 09:32

jmg wrote: »

That was always a risk of the new added logic....

Chip then went on to say how it had impacted the P2ES synthesis as well. Only by doing the compare have they realised that spec'ing for 180 MHz was pushing too hard.

I'm really keen to see how much this reduces heating. Which in turn makes for easy over-clocking without the need for active cooling. That would be a neat outcome.

jmg · 2019-03-15 21:27

evanh wrote: »

jmg wrote: »

Volume users need guaranteed specifications to work to.

160 MHz.

Yes, that is one point on the TVP curve.
Cluso99 was asking for another point (as other vendors often specify)
In his case, it was
When done, is it possible to ask what a lesser spec would give as fmax?
Say 0C to +70C at 1.8V +- 2.5%

evanh · 2019-03-15 21:57

You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.

jmg · 2019-03-15 22:09

evanh wrote: »

You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.

That's one guess. (ie 3.125%)

Another litmus test is to look at Atmel, who already do this on their parts...
Their Commercial is 0~70 and 5% Vcc and Industrial is 0~85 and 10% Vcc
They state that 10ns 'I' grade is ~ 7ns 'C' grade, so that's a 30% gain for 15'C drop in TMax and halve of Vcc spread.

evanh · 2019-03-16 03:39

Slap whatever you like on.

Cluso99 · 2019-03-16 06:07

evanh wrote: »

You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.

A specification is a specification, not a guess!

evanh · 2019-03-16 08:02

It wasn't said as a guess. It is a statement of you get the same result.

cgracey · 2019-03-16 14:53

Wendy reminded me yesterday that timing closure for the new P2 is set from -55C to +150C junction temperature. Our package Tja is ~20C/W. The anticipated power dissipation was ~2.25W. This would result in a ~45C (20×2.25) rise in junction temperature over ambient temperature, which affords us a -55C to +85C packaged temperature range with ~20C (150-45-85) allowance for local hot spots on the die.

After the tapeout is complete, we will be able to generate a graph of temperature vs Fmax.

Since power is going to be lower than originally planned for, we should be able to get a higher stated Fmax than before. Also, if the customer constrains ambient temperature to, say, 70C, that will allow for an even higher stated Fmax.

We will have simulation data soon that will indicate what this curve will look like.

evanh · 2019-03-16 15:56

Sounds great Chip. Looks like that'll answer all the questions on power.

Apologies to Cluso,
There wasn't any good reason for me to jump on you for this question. It wasn't my problem to solve. It's just question seemed to be asking for extra attempts at miniscule variations of parameters and that struck a nerve in me.

Hopefully I've had enough trolling for the moment.

Cluso99 · 2019-03-16 23:55

Excellent news Chip

It will be great to see what the fmax will be with lower spec requirements like say 70C and tighter 1V8.

And then for us to run the silicon and see what we can push it to before it breaks

No worries Evan. It's all good.

cgracey · 2019-03-18 23:42

Wendy just told me that she was able to wrap up timing across all corners at 175MHz, after all.

She is going to send me a final report that has more data in it. I'm curious to know what this did to the instance count.

jmg · 2019-03-18 23:46

cgracey wrote: »

Wendy just told me that she was able to wrap up timing across all corners at 175MHz, after all.

She is going to send me a final report that has more data in it. I'm curious to know what this did to the instance count.

Sounds great!!
Seems to always best 'to approach routing problems from below'..
If the instance count improves speed, it does not matter so much, until it starts to not fit into the die.

evanh · 2019-03-19 02:08

But does it really improve speed in any significant way? We're only talking a few MHz in the parameters. I'd rather get the speed without needing active cooling.

jmg · 2019-03-19 03:02

evanh wrote: »

But does it really improve speed in any significant way? We're only talking a few MHz in the parameters. I'd rather get the speed without needing active cooling.

Keep in mind that better drive improves the slews of the driven lines, and that reduces the Icc effect from the transition current peak.
So whilst you have more devices, you also have better slew, which means you would need to dig deep into the spice results to decide which of those effects dominates.
The real test will be to compare the mA/MHz Cpd figures for P2es and P2+
P2+ should be well ahead of P2es, in all cases except where all 8 COGS are fully operational.
Even with 8 COGS running, P2+ may have clock gate savings on the smart pin cells too, depends how far OnSemi went.

evanh · 2019-03-19 03:19

The real test is how many extra gates are used to get so little extra MHz.

msrobots · 2019-03-19 03:21

I sadly have very less time for my P2-es, but compared to the P1 this thing runs fast. I have no fan installed yet so I stay at 180 or such and id does not even get warm running all COGs. But still it is at least 4 times faster as the P1 and those smart pins save a lot of code.

I can't wait to get my hands on more then one...

Mike

jmg · 2019-03-19 03:37

evanh wrote: »

The real test is how many extra gates are used to get so little extra MHz.

Not really, you or I or other users do care about mA/MHz specs, not about how many gates are inside the package.

evanh · 2019-03-19 03:39

Lol, we're saying the same thing JMG.

Mark_T · 2019-03-19 19:45

I did a little test with some cogs driving constant cordic rotate ops, so each cog does a sine and cosine
every 8 clocks, ie a 40Mflops equivalent at 160MHz, counting sine and cosine separately

At 160MHz 1 cog took 0.26A, 7 cogs took 0.38A (at 5V), so 20mA per cog for 40Mflops, or put another
way 2Gflops/amp (Ignoring the constant 0.24A drain).

[ I'm treating flop to mean "fixpoint operations" of course ]

What this is at the 1.8V rail I don't know, I guess about 700Mflops/A, perhaps 400Mflop/W is a better
way to state it.

In practice its not possible to use both sine/cosine results driven every 8 clocks, but its a interesting
performance figure as it relates actual computation rather than just clock frequency. A more
plausible practice figure is 200Mflop/W. [My unrolled loop code for FFT calculation roughly agrees
with this.]

The comparable value for integer ops is presumably about twice this as 4 instructions every 8 clocks is
typical, 400MIPS/W

So assuming the high constant power drain is fixed in the next silicon, it feels pretty competitive
https://en.wikipedia.org/wiki/Performance_per_watt#Examples

(and assuming fixed point is adequate for your application!)

jmg · 2019-03-19 20:15

Mark_T wrote: »

..
At 160MHz 1 cog took 0.26A, 7 cogs took 0.38A (at 5V), so 20mA per cog for 40Mflops, or put another
way 2Gflops/amp (Ignoring the constant 0.24A drain).
..
So assuming the high constant power drain is fixed in the next silicon, it feels pretty competitive

You can't quite 'Ignore the constant 0.24A drain', as that is not a static load, but comes from the clock tree.
What you can expect is that 1 COG is closer to 1/7 that of 7 cogs, and 7 COGs will be lower than 8 COGS, but the peak mA/MHz for everything running, will likely not change much.

Mark_T · 2019-03-20 17:12

Yes, I guess I knew that, so perhaps half the performance I'd like to imagine once this is factored in, but still.

evanh · 2019-03-21 01:23

I read ignoring as putting it aside. Ie: Leaving it out for purpose of demonstrating a linearity.

cgracey · 2019-03-29 23:00

Wendy is running the simulations and she got me some data on current during download (one cog):

Current silicon = 77mA
Next silicon = 40mA

It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

jmg · 2019-03-29 23:10

cgracey wrote: »

Wendy is running the simulations and she got me some data on current during download (one cog):

Current silicon = 77mA
Next silicon = 40mA

It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

Sounding good.
Is the clock tree not divided into 8 clean branches, with an enable per COG ?
The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

cgracey · 2019-03-29 23:30

jmg wrote: »

cgracey wrote: »

Wendy is running the simulations and she got me some data on current during download (one cog):

Current silicon = 77mA
Next silicon = 40mA

It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

Sounding good.
Is the clock tree not divided into 8 clean branches, with an enable per COG ?
The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

The tools automatically arranged the clock gating, without any specific direction from me or Wendy. It made its own inferences from the design.

Cluso99 · 2019-03-30 01:09

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

Wendy is running the simulations and she got me some data on current during download (one cog):

Current silicon = 77mA
Next silicon = 40mA

It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

Sounding good.
Is the clock tree not divided into 8 clean branches, with an enable per COG ?
The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

The tools automatically arranged the clock gating, without any specific direction from me or Wendy. It made its own inferences from the design.

Great news Chip!

Nice to see much lower current when sections are not being used. Download will be exercising HUB tho' not as much as say hubexec. Hopefully when the hub is idle the current will drop rather than appears to be the case on the current silicon.

It is going to be pretty hard to push the P2 to its' limits

Tubular · 2019-03-30 01:09

jmg wrote: »

Is the clock tree not divided into 8 clean branches, with an enable per COG ?
The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

Yes this is certainly heading in the right direction. The high power simulations will be interesting, too.

Yanomani · 2019-03-30 01:31

IMHO, P2 clock tree needs to have a lot more branches than the number of Cogs it actualy has.

- The Hub needs to have at least one, perhaps more, due to its complexity; plus another eight, whose destiny is the Ram banks.

- Each Smart pin also needs its own share.

cgracey · 2019-03-30 01:41

Yanomani wrote: »

IMHO, P2 clock tree needs to have a lot more branches than the number of Cogs it actualy has.

- The Hub needs to have at least one, perhaps more, due to its complexity; plus another eight, whose destiny is the Ram banks.

- Each Smart pin also needs its own share.

The clock tree synthesis tool inserted 1,830 clock-gate instances. So, the granularity is much finer than entire cogs.

At 20MHz, the clock tree is consuming about 29mW. That means it will probably consume 290mW at 200MHz.

List of Changes in Next P2 Silicon

Comments