Shop OBEX P1 Docs P2 Docs Learn Events
List of Changes in Next P2 Silicon - Page 3 — Parallax Forums

List of Changes in Next P2 Silicon

1356

Comments

  • evanhevanh Posts: 15,126
    jmg wrote: »
    Volume users need guaranteed specifications to work to.

    160 MHz.

  • evanhevanh Posts: 15,126
    jmg wrote: »
    That was always a risk of the new added logic.... :(
    Chip then went on to say how it had impacted the P2ES synthesis as well. Only by doing the compare have they realised that spec'ing for 180 MHz was pushing too hard.

    I'm really keen to see how much this reduces heating. Which in turn makes for easy over-clocking without the need for active cooling. That would be a neat outcome.
  • jmgjmg Posts: 15,140
    evanh wrote: »
    jmg wrote: »
    Volume users need guaranteed specifications to work to.

    160 MHz.

    Yes, that is one point on the TVP curve.
    Cluso99 was asking for another point (as other vendors often specify)
    In his case, it was
    When done, is it possible to ask what a lesser spec would give as fmax?
    Say 0C to +70C at 1.8V +- 2.5%
  • evanhevanh Posts: 15,126
    You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.
  • jmgjmg Posts: 15,140
    evanh wrote: »
    You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.

    That's one guess. (ie 3.125%)

    Another litmus test is to look at Atmel, who already do this on their parts...
    Their Commercial is 0~70 and 5% Vcc and Industrial is 0~85 and 10% Vcc
    They state that 10ns 'I' grade is ~ 7ns 'C' grade, so that's a 30% gain for 15'C drop in TMax and halve of Vcc spread.
  • evanhevanh Posts: 15,126
    Slap whatever you like on.
  • Cluso99Cluso99 Posts: 18,066
    evanh wrote: »
    You might get 5 MHz more that way - Which will be rounded back down to 160 MHz.

    A specification is a specification, not a guess!
  • evanhevanh Posts: 15,126
    It wasn't said as a guess. It is a statement of you get the same result.
  • cgraceycgracey Posts: 14,133
    Wendy reminded me yesterday that timing closure for the new P2 is set from -55C to +150C junction temperature. Our package Tja is ~20C/W. The anticipated power dissipation was ~2.25W. This would result in a ~45C (20×2.25) rise in junction temperature over ambient temperature, which affords us a -55C to +85C packaged temperature range with ~20C (150-45-85) allowance for local hot spots on the die.

    After the tapeout is complete, we will be able to generate a graph of temperature vs Fmax.

    Since power is going to be lower than originally planned for, we should be able to get a higher stated Fmax than before. Also, if the customer constrains ambient temperature to, say, 70C, that will allow for an even higher stated Fmax.

    We will have simulation data soon that will indicate what this curve will look like.
  • evanhevanh Posts: 15,126
    Sounds great Chip. Looks like that'll answer all the questions on power.

    Apologies to Cluso,
    There wasn't any good reason for me to jump on you for this question. It wasn't my problem to solve. It's just question seemed to be asking for extra attempts at miniscule variations of parameters and that struck a nerve in me.

    Hopefully I've had enough trolling for the moment. :)
  • Cluso99Cluso99 Posts: 18,066
    Excellent news Chip :smiley:

    It will be great to see what the fmax will be with lower spec requirements like say 70C and tighter 1V8.

    And then for us to run the silicon and see what we can push it to before it breaks :wink:

    No worries Evan. It's all good.
  • cgraceycgracey Posts: 14,133
    Wendy just told me that she was able to wrap up timing across all corners at 175MHz, after all.

    She is going to send me a final report that has more data in it. I'm curious to know what this did to the instance count.
  • jmgjmg Posts: 15,140
    cgracey wrote: »
    Wendy just told me that she was able to wrap up timing across all corners at 175MHz, after all.

    She is going to send me a final report that has more data in it. I'm curious to know what this did to the instance count.

    Sounds great!!
    Seems to always best 'to approach routing problems from below'..
    If the instance count improves speed, it does not matter so much, until it starts to not fit into the die.
  • evanhevanh Posts: 15,126
    But does it really improve speed in any significant way? We're only talking a few MHz in the parameters. I'd rather get the speed without needing active cooling.
  • jmgjmg Posts: 15,140
    evanh wrote: »
    But does it really improve speed in any significant way? We're only talking a few MHz in the parameters. I'd rather get the speed without needing active cooling.

    Keep in mind that better drive improves the slews of the driven lines, and that reduces the Icc effect from the transition current peak.
    So whilst you have more devices, you also have better slew, which means you would need to dig deep into the spice results to decide which of those effects dominates.
    The real test will be to compare the mA/MHz Cpd figures for P2es and P2+
    P2+ should be well ahead of P2es, in all cases except where all 8 COGS are fully operational.
    Even with 8 COGS running, P2+ may have clock gate savings on the smart pin cells too, depends how far OnSemi went.
  • evanhevanh Posts: 15,126
    The real test is how many extra gates are used to get so little extra MHz.
  • I sadly have very less time for my P2-es, but compared to the P1 this thing runs fast. I have no fan installed yet so I stay at 180 or such and id does not even get warm running all COGs. But still it is at least 4 times faster as the P1 and those smart pins save a lot of code.

    I can't wait to get my hands on more then one...

    Mike
  • jmgjmg Posts: 15,140
    evanh wrote: »
    The real test is how many extra gates are used to get so little extra MHz.

    Not really, you or I or other users do care about mA/MHz specs, not about how many gates are inside the package.
  • evanhevanh Posts: 15,126
    Lol, we're saying the same thing JMG.
  • I did a little test with some cogs driving constant cordic rotate ops, so each cog does a sine and cosine
    every 8 clocks, ie a 40Mflops equivalent at 160MHz, counting sine and cosine separately :)

    At 160MHz 1 cog took 0.26A, 7 cogs took 0.38A (at 5V), so 20mA per cog for 40Mflops, or put another
    way 2Gflops/amp (Ignoring the constant 0.24A drain).

    [ I'm treating flop to mean "fixpoint operations" of course ]

    What this is at the 1.8V rail I don't know, I guess about 700Mflops/A, perhaps 400Mflop/W is a better
    way to state it.

    In practice its not possible to use both sine/cosine results driven every 8 clocks, but its a interesting
    performance figure as it relates actual computation rather than just clock frequency. A more
    plausible practice figure is 200Mflop/W. [My unrolled loop code for FFT calculation roughly agrees
    with this.]

    The comparable value for integer ops is presumably about twice this as 4 instructions every 8 clocks is
    typical, 400MIPS/W

    So assuming the high constant power drain is fixed in the next silicon, it feels pretty competitive
    https://en.wikipedia.org/wiki/Performance_per_watt#Examples

    (and assuming fixed point is adequate for your application!)
  • jmgjmg Posts: 15,140
    Mark_T wrote: »
    ..
    At 160MHz 1 cog took 0.26A, 7 cogs took 0.38A (at 5V), so 20mA per cog for 40Mflops, or put another
    way 2Gflops/amp (Ignoring the constant 0.24A drain).
    ..
    So assuming the high constant power drain is fixed in the next silicon, it feels pretty competitive
    You can't quite 'Ignore the constant 0.24A drain', as that is not a static load, but comes from the clock tree.
    What you can expect is that 1 COG is closer to 1/7 that of 7 cogs, and 7 COGs will be lower than 8 COGS, but the peak mA/MHz for everything running, will likely not change much.

  • Yes, I guess I knew that, so perhaps half the performance I'd like to imagine once this is factored in, but still.
  • evanhevanh Posts: 15,126
    I read ignoring as putting it aside. Ie: Leaving it out for purpose of demonstrating a linearity.
  • cgraceycgracey Posts: 14,133
    Wendy is running the simulations and she got me some data on current during download (one cog):

    Current silicon = 77mA
    Next silicon = 40mA

    It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

    Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.
  • jmgjmg Posts: 15,140
    cgracey wrote: »
    Wendy is running the simulations and she got me some data on current during download (one cog):

    Current silicon = 77mA
    Next silicon = 40mA

    It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

    Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

    Sounding good.
    Is the clock tree not divided into 8 clean branches, with an enable per COG ?
    The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

  • cgraceycgracey Posts: 14,133
    jmg wrote: »
    cgracey wrote: »
    Wendy is running the simulations and she got me some data on current during download (one cog):

    Current silicon = 77mA
    Next silicon = 40mA

    It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

    Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

    Sounding good.
    Is the clock tree not divided into 8 clean branches, with an enable per COG ?
    The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

    The tools automatically arranged the clock gating, without any specific direction from me or Wendy. It made its own inferences from the design.
  • Cluso99Cluso99 Posts: 18,066
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    Wendy is running the simulations and she got me some data on current during download (one cog):

    Current silicon = 77mA
    Next silicon = 40mA

    It's not as low as I'd hoped, but about half the current is not bad. The clock tree is actually taking half of that 40mA.

    Wendy should have the high-power simulation results back soon. The chip is definitely going to run cooler than before.

    Sounding good.
    Is the clock tree not divided into 8 clean branches, with an enable per COG ?
    The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

    The tools automatically arranged the clock gating, without any specific direction from me or Wendy. It made its own inferences from the design.

    Great news Chip!

    Nice to see much lower current when sections are not being used. Download will be exercising HUB tho' not as much as say hubexec. Hopefully when the hub is idle the current will drop rather than appears to be the case on the current silicon.

    It is going to be pretty hard to push the P2 to its' limits ;)
  • jmg wrote: »
    Is the clock tree not divided into 8 clean branches, with an enable per COG ?
    The previous P2 has a high base mA/MHz and a lower /COG mA/MHz, and drop in the base mA/MHz is good.

    Yes this is certainly heading in the right direction. The high power simulations will be interesting, too.
  • IMHO, P2 clock tree needs to have a lot more branches than the number of Cogs it actualy has.

    - The Hub needs to have at least one, perhaps more, due to its complexity; plus another eight, whose destiny is the Ram banks.

    - Each Smart pin also needs its own share.

  • cgraceycgracey Posts: 14,133
    Yanomani wrote: »
    IMHO, P2 clock tree needs to have a lot more branches than the number of Cogs it actualy has.

    - The Hub needs to have at least one, perhaps more, due to its complexity; plus another eight, whose destiny is the Ram banks.

    - Each Smart pin also needs its own share.

    The clock tree synthesis tool inserted 1,830 clock-gate instances. So, the granularity is much finer than entire cogs.

    At 20MHz, the clock tree is consuming about 29mW. That means it will probably consume 290mW at 200MHz.
Sign In or Register to comment.