New P2 Silicon Observations

Cluso99 · 2018-10-09 16:13

Peter,
One word... AWESOME

Is there any power difference if only one cog is active vs 8 cogs active?
Does it make any power difference for hubexec vs cog execution?

jmg · 2018-10-09 16:21

Cluso99 wrote: »

Is there any power difference if only one cog is active vs 8 cogs active?

Yes, but not a large amount.
See my Cpd modeling - appx 1.56nF for chip-wide clock tree, and 80pF/gog on top of that.

Unfortunately, the shiny MHz numbers are diverting attention from the big POWER issues P2 revA has.

jmg · 2018-10-09 16:43

cgracey wrote: »

..... to set the center bias to around VIO/2, which is a simple inverter's logic threshold inside the converter (there's no faster high-gain amplifier than a few digital inverters connected in series)...

How many is a few ?
That detail worries me a little, as I’ve seen digital chips suffer transition oscillations, where slow edges spend enough time in the linear region to see those inverter-chains oscillate.
First symptom is a clock disturbance, but also seen are chip-wide crosstalk effects where unrelated logic misbehaves.
Careful Analog testing around crosstalk will be needed.

cgracey · 2018-10-09 16:54

jmg wrote: »

cgracey wrote: »

..... to set the center bias to around VIO/2, which is a simple inverter's logic threshold inside the converter (there's no faster high-gain amplifier than a few digital inverters connected in series)...

How many is a few ?
That detail worries me a little, as I’ve seen digital chips suffer transition oscillations, where slow edges spend enough time in the linear region to see those inverter-chains oscillate.
First symptom is a clock disturbance, but also seen are chip-wide crosstalk effects where unrelated logic misbehaves.
Careful Analog testing around crosstalk will be needed.

I'm not in front of my computer, but I think we use 4 digital inverters in series to make the ADC sense amp. We need enough gain to turn a few millivolts into a digital output.

tonyp12 · 2018-10-09 17:31

"So, the whole thing worked at 300MHz"

You so close to "unofficially" support 250MHz then?,
Can smart-pins do TMDS (Transition Minimized Differential Signaling) natively?
Can LUT do 8b/10b encoding with Running disparity?
When it opens up hdmi and many other modern transfer protocols.

https://www.fpga4fun.com/HDMI.html

evanh · 2018-10-09 17:52

I'd call that more than close, it is unofficial.

jmg · 2018-10-09 18:43

tonyp12 wrote: »

Can smart-pins do TMDS (Transition Minimized Differential Signaling) natively?

No. That would be in the PAD ring customer area & likely outside the process.... maybe at 110nm ?

tonyp12 wrote: »

Can LUT do 8b/10b encoding with Running disparity?

No

cgracey · 2018-10-09 22:43

Tubular wrote: »

That table paste didn't work, here's a pic of the summary table for glitch depth, on a pin by pin basis

We just had a meeting with ON Semi about the respin.

One engineer there realized what the problem was, right away, with the I/O pin glitching. It's not a coupling problem. It's a signal-race problem.

Here's something none of us remembered that explains the whole thing: When DIR goes low, so does OUT, because OUT gets AND'd with DIR, right? So, we are seeing OUT transition before DIR in these glitch cases. Some timing constraints on the DIR and OUT signals will eliminate this problem.

ON Semi is going to come up with a time-frame in which they will address this. They would like us to test the current chip as much as possible before the respin.

We also talked about the 2-cog derivative part. I need to send them some info about it so that they can quote it.

Cluso99 · 2018-10-09 22:48

Excellent news Chip.
That glitch really bothered me.

Did you discuss the power issue???

jmg · 2018-10-09 23:05

cgracey wrote: »

Here's something none of us remembered that explains the whole thing: When DIR goes low, so does OUT, because OUT gets AND'd with DIR, right? ...

Why not simply remove the 'AND' - why is that cross coupling even in there ? DIR should control the direction only, nothing else ?

Tubular · 2018-10-09 23:24

Fine, but why don't we see the race condition on pins other than the JTAG pins? Or haven't we looked far enough yet, because we've primarily investigated phantom pulldowns taqoz lsio reports?

My belief/hypothesis, at this stage, is the extra Jtag logic might be affecting things. But I need to look at those last 3 jtag pins and can look at other pins too, to be sure. If thats the case whatever timing constraints they add need to include their jtag test circuit. Thats probably no big deal for them.

Yanomani · 2018-10-09 23:25

IIRC, it's because they (DIR registers out bits) are used to block corresponding OUT registers bits influence, in the state of the wired-or connection among all Cogs OUT bits, before they reach Smart pins cells.

jmg · 2018-10-09 23:33

Yanomani wrote: »

IIRC, it's because they (DIR registers out bits) are used to block corresponding OUT registers bits influence, in the state of the wired-or connection among all Cogs OUT bits, before they reach Smart pins cells.

.. but if you have just floated the pin, any internal node for OUT is a don't care. The pin is floating, not driven.

rogloh · 2018-10-09 23:33

Glad you are very confidently getting on top of the glitch IO stuff Chip. IO definitely needs to be reliable.

Perhaps this may not have been a key reason for your meeting with OnSemi but do you think is there any more knowledge now of where all the power is being consumed in the P2 when COGs are idle and whether or not there may be anything else that is "simple" and low risk that can be done in the next spin to address it without reducing the target frequency in any significant way?

For example if the inputs to a disabled COG can be somehow gated off or otherwise held in reset so its internal state doesn't change (so much) at the full frequency, does that reduce power? In an ASIC I am expecting a clocked FF in a COG whose inputs are unchanging won't draw as much power as one that needs to change state on each cycle. Similarly for combinational logic if the input is static it shouldn't draw a lot of power. Maybe some arbitrary bus data and other changing inputs are arriving at a disabled COG causing a whole lot of its internal logic and FFs to change even though the COG isn't doing anything useful at that time? Similarly with the egg beater model if the main RAM blocks are doing any speculative reads or outputting any other random data to a disabled COG from their buses, this may be causing lots of surplus transitions inside all the COG logic that depends on this data source. Maybe you could stop clocks for just a few key sources and stop a whole lot of other downstream logic that depends on it from transitioning. That's my basic thought anyway, though I'm sure it would have some downside if it involves critical paths. Perhaps the COG logic gets so complex that even one input bit changing somewhere cascades into a whole lot of transitions further down. Dunno.

Do you have any new feeling on this yet, or whether it might just be too difficult/risky to reduce the idle power of the 8 COG chip?

PS. one interesting thing to test is whether a P2 system whose HUB memory is filled with all zeroes draws much less power in an idle system than the case with random HUB data. If it does it might indicate that all the egg beater reads are somehow triggering lots of transitions in hub or COG logic, even when idle. Or maybe the memory reads stop completely and this shouldn't even be possible. I don't know.

jmg · 2018-10-09 23:46

rogloh wrote: »

Perhaps this may not have been a key reason for your meeting with OnSemi but do you think is there any more knowledge now of where all the power is being consumed in the P2 when COGs are idle...

The power is consumed because all clock distribution, to all 88,000 FF's in P2, is currently always on.
That's why the Cpd for the device is a large base 1.56nF and increases by a small 80pF/COG

One of the more elegant features of P1, is how WAIT opcode drops Cpd down to a very low ~ 4pF, during that WAIT time.
It's a pity that did not make it into P2, Rev A. Especially as the static Icc has come in quite low, on the silicon.

rogloh · 2018-10-09 23:50

@jmg Are you sure it is just the clock net capacitance or is it because a whole lot of the 88000 FF outputs may be transitioning as well?

Yanomani · 2018-10-09 23:52

jmg wrote: »

.. but if you have just floated the pin, any internal node for OUT is a don't care. The pin is floating, not driven.

You have two ways to float a pin:

- Thru DIR register control, but then, all Cogs must agree, for the right direction and pin state to be set;

- Thru WRPIN's low-level pin control, which, when enabled, overrides DIR/OUT definitions (%HHH, %LLL fields).

jmg · 2018-10-10 00:11

rogloh wrote: »

@jmg Are you sure it is just the clock net capacitance or is it because a whole lot of the 88000 FF outputs may be transitioning as well?

The Cpd 80pF portion / COG is the output activity measure, and the 1.56nf is the clock-active, but all 8 COGs doing nothing, base cost.

jmg · 2018-10-10 00:14

Yanomani wrote: »

- Thru WRPIN's low-level pin control, which, when enabled, overrides DIR/OUT definitions (%HHH, %LLL fields).

Does that mean using WRPIN's different pathway, avoids this crosstalk effect entirely ?

Tubular · 2018-10-10 00:27

jmg wrote: »

Yanomani wrote: »

- Thru WRPIN's low-level pin control, which, when enabled, overrides DIR/OUT definitions (%HHH, %LLL fields).

Does that mean using WRPIN's different pathway, avoids this crosstalk effect entirely ?

There's a few possibilities - using HHH and LLL configuration bits the LLL could be set to float, then you only need to toggle the OUT signal, DIR can changed afterwards after the pin is in LLL float if necessary.

There are certainly workarounds available, some have a timing penalty (setting/changing smartpin mode), or a passive cost (eg an external capacitor).

Yanomani · 2018-10-10 00:31

jmg wrote: »

Does that mean using WRPIN's different pathway, avoids this crosstalk effect entirely ?

No, because those two fields (%HHH, %LLL) induced results are still sensitive to the state of DIR/OUT combined control, that emmanates from the Cogs themselves, because since they (DIR/OUT) would, in fact, control the logic state of each and every wire, used to connect the Cogs (wired-or) composed outputs to Smart pin cells.

Tubular · 2018-10-10 00:42

But there are 8 different drive options for each of the high and low legs of the pin. These including strong drive, current sources/sink, resistors, and float. You can set the 'high side' driver to be strong (normal output high, about 19 ohms resistance), and the 'low side' driver to be float.

Once that is setup, you may not need to change the DIR signal, OUT would alternate between "Output High" (when out=H) and "float" (when OUT=L) .

Tubular · 2018-10-10 00:49

I hope to try this when I see OzProp on Friday, Yanomani. He's currently got the only silicon P2 in this town.

This all may not matter (onsemi may respin P2 anyway), but its always good to understand and test the available options, since pins are so fundamental. And sometimes that testing throws up other issues.

Peter Jakacki · 2018-10-10 08:53

I think I will install a small fan on my dev board under the P2D2 module to blow air onto the chip since the module is flipped. I find that I can run sustained 340MHz with the lid temperature of 37'C which really is quite cool.
EDIT: The loading on the USB serial power caused the +5V rail to drop to around 3V but with a direct 5V supply the fan runs faster and the P2 runs cooler so 350MHz sustained.

TAQOZ# fibos 
fibo(1) = 473 cycles = 1,391ns @340MHz result =1
fibo(6) = 793 cycles = 2,332ns @340MHz result =8
fibo(11) = 1,113 cycles = 3,273ns @340MHz result =89
fibo(16) = 1,433 cycles = 4,214ns @340MHz result =987
fibo(21) = 1,753 cycles = 5,155ns @340MHz result =10946
fibo(26) = 2,073 cycles = 6,097ns @340MHz result =121393
fibo(31) = 2,393 cycles = 7,038ns @340MHz result =1346269
fibo(36) = 2,713 cycles = 7,979ns @340MHz result =14930352
fibo(41) = 3,033 cycles = 8,920ns @340MHz result =165580141
fibo(46) = 3,353 cycles = 9,861ns @340MHz result =1836311903 ok
TAQOZ# lsio 
P:00000000001111111111222222222233333333334444444444555555555566
P:01234567890123456789012345678901234567890123456789012345678901
=:ddd~d~~~~~~ddd~ddd~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~h~h ok
TAQOZ#

cgracey · 2018-10-10 12:52

That is fantastic, Peter!

The chip is designed to run at 180MHz with 85C ambient temperature. With 2W of self-heating and a package Tja of 20C/W, that would bring junction temperature up to 125C, but for some reason, 150C at 1.62V at worst process corner was used as the condition at which 180MHz was guaranteed. We've got lots of headroom.

Peter Jakacki · 2018-10-10 13:04

cgracey wrote: »

That is fantastic, Peter!

The chip is designed to run at 180MHz with 85C ambient temperature. With 2W of self-heating and a package Tja of 20C/W, that would bring junction temperature up to 125C, but for some reason, 150C at 1.62V at worst process corner was used as the condition at which 180MHz was guaranteed. We've got lots of headroom.

That's an understatement! I would think if we could go to 250MHz that we would have lots of headroom but I am quite comfortable now with the idea of running this continually at 340MHz and now that the fan has got close to 5V on it the lid sits at 32'C. Now, that is amazing. Well done Chip!

cgracey · 2018-10-10 13:14

It's ON's design methodology and some aggressive initial spec'ing.

I love how the timing path optimization puts most everything up against the wall, so that when failure occurs, it's systemic. It makes it easy to get near the edge, without going over.

cgracey · 2018-10-10 13:18

Peter, big question:

Do you have a flash chip that can be booted from with deterministic timing, so that we can verify that, indeed, the big PRNG comes up with different results each time after power up?

Cluso99 · 2018-10-10 13:22

340M * 8 / 2 = 1.36 Gips

Now if that were the P2HOT 340M * 8 = 2.72Gips

cgracey · 2018-10-10 13:26

Cluso99 wrote: »

340M * 8 / 2 = 1.36 Tflops

Now if that were the P2HOT 340M * 8 = 2.72Tflops

Not quite Tflops, but BIPS.

New P2 Silicon Observations

Comments