New P2 Silicon Observations

evanh · 2018-10-23 22:38

Cluso99 wrote: »

As far as I can see, a simple logic change in Verilog as I suggested above, would fix this completely.
Add a one clock delay to DIR if it goes low which enables drive. Add a one clock delay to OUT if DIR goes high which tri-states the pin.

How many times does this have to be said that already exists?

BTW P1 silicon is much slower so there probably was never a glitch. Don't forget it was hand laid so this logic was probably thought about.

Slower just means longer glitches too.

cgracey · 2018-10-23 22:45

thej wrote: »

cgracey wrote: »

Just got word from Wendy at OnSemi that if I introduce clock gating into my Verilog code, their synthesis tools will generate the appropriate structures and clock trees. Shouldn't cause a hiccup in the respin process.

I think the general consensus on the forum was in favour of highest possible performance.
How might this impact the potential top MHz ?

J

It shouldn't have any effect. Let me ask Wendy, just to be sure, though...

jmg · 2018-10-23 22:51

cgracey wrote: »

Just got word from Wendy at OnSemi that if I introduce clock gating into my Verilog code, their synthesis tools will generate the appropriate structures and clock trees. Shouldn't cause a hiccup in the respin process.

To do clock gating, a clock-qualifier in the form of a flop Q output would be AND'd with the clock signal which would have subsequent high fan-out. Seems like that flop and AND gate could hardly be early enough in the clock tree, considering the subsequent fanout to maybe 4,500 flops per cog.

I see the Cyclone V series mentions 16 Global Clock networks, so it should be possible to have 1+7 enabled clock zones.
Not sure how that migrates to ASIC tools.

There is info here about Cyclone V Clock gates (1 or 2 D-FF and AND gate )
https://www.intel.com/content/www/us/en/programmable/documentation/sam1403481100977/sam1403476916208.html

The Dual-D ff hardens it more for metastable, and could allow clkena to come later from the downstream clock zone, and thus have no SysCLK speed impact ?

Addit: The COG that is always on, would clone this delay path, (with it's clkena == 1) to match PLL -> all CLK delays.
With all loads the same, and all PLL-> CLKs matched, there should be low between-cog skews.

Feedback of WAIT status might be possible, but I think that gets trickier.
The circuit that operates WAIT logic needs a non-gated clock, but the loads on that network are much lower, so there is a different load-skew effect to consider.

Q: What is the fan-out loads on the Smart Pin cells ?
because that needs delay matching anyway, maybe those can have a 9th enabled clk-zone, that users can set a single-bit on, should they need any Smart Pin feature ?
(This could default to ON, so users chasing lowest power would drop to 1 COG, then optionally disable Smart Pin cells)

cgracey · 2018-10-23 23:10

jmg wrote: »

cgracey wrote: »

Just got word from Wendy at OnSemi that if I introduce clock gating into my Verilog code, their synthesis tools will generate the appropriate structures and clock trees. Shouldn't cause a hiccup in the respin process.

To do clock gating, a clock-qualifier in the form of a flop Q output would be AND'd with the clock signal which would have subsequent high fan-out. Seems like that flop and AND gate could hardly be early enough in the clock tree, considering the subsequent fanout to maybe 4,500 flops per cog.

I see the Cyclone V series mentions 16 Global Clock networks, so it should be possible to have 1+7 enabled clock zones.
Not sure how that migrates to ASIC tools.

There is info here about Cyclone V Clock gates (1 or 2 D-FF and AND gate )
https://www.intel.com/content/www/us/en/programmable/documentation/sam1403481100977/sam1403476916208.html

The Dual-D ff hardens it more for metastable, and could allow clkena to come later from the downstream clock zone, and thus have no SysCLK speed impact ?

Addit: The COG that is always on, would clone this delay path, (with it's clkena == 1) to match PLL -> all CLK delays.
With all loads the same, and all PLL-> CLKs matched, there should be low between-cog skews.

Feedback of WAIT status might be possible, but I think that gets trickier.
The circuit that operates WAIT logic needs a non-gated clock, but the loads on that network are much lower, so there is a different load-skew effect to consider.

Q: What is the fan-out loads on the Smart Pin cells ?
because that needs delay matching anyway, maybe those can have a 9th enabled zone, that users can set a single-bit on, should they need any Smart pin.

I think those dual D-FF flops allow for an extra clock cycle on turn-off, so that the clock-gated circuit can register the lack of its 'enable' signal on a final clock, thereby registering final shutdown states. I was trying to resolve how to handle this, and I think that's how it would be done. Otherwise, you'd need a very fast local reset tree to handle the shutdown. Better to be able to use a synchronously-registerable signal for that purpose. Just takes an extra clock.

jmg · 2018-10-23 23:40

cgracey wrote: »

I think those dual D-FF flops allow for an extra clock cycle on turn-off, so that the clock-gated circuit can register the lack of its 'enable' signal on a final clock, thereby registering final shutdown states. I was trying to resolve how to handle this, and I think that's how it would be done. Otherwise, you'd need a very fast local reset tree to handle the shutdown. Better to be able to use a synchronously-registerable signal for that purpose. Just takes an extra clock.

Yes, it allows you to treat the enable as more async in nature. Because it is used on the somewhat rare Start/Stop of COG, one added clock is not a big problem.

Altera says this about that 2nd D-FF
Cyclone® V devices have an additional metastability register that aids in asynchronous enable and disable of the GCLK and RCLK networks. You can optionally bypass this register in the Intel® Quartus® Prime software.

rogloh · 2018-10-24 02:16

cgracey wrote: »

thej wrote: »

cgracey wrote: »

Just got word from Wendy at OnSemi that if I introduce clock gating into my Verilog code, their synthesis tools will generate the appropriate structures and clock trees. Shouldn't cause a hiccup in the respin process.

I think the general consensus on the forum was in favour of highest possible performance.
How might this impact the potential top MHz ?

J

It shouldn't have any effect. Let me ask Wendy, just to be sure, though...

Wow it would be great to have the potential for both consuming less power and also still keeping the possibility of HDMI open, especially if it might end up being a much lower risk than initially thought to achieve it.

I'm only thinking of use cases for future portable/robotics applications where you might need some COGs at full speed (eg. for driving an HDMI display - in fact that's actually a great example) while at the same time also not necessarily running everything at full tilt and chewing through your maximum power at all time the system is operational just because you need to clock at 250MHz for the HDMI output to work. Some COGs could be happy to be idle or mostly waiting on something to wake them up or spawn them or just sleeping and polling external devices/sensors less frequently. This is something where the P2 as it stands right now may have some limitations in its power consumption scaling, at least that's my take. However I know others are far less worried about it and these application areas I mention may not be a key target / volume market for the P2, certainly until it gets some initial traction elsewhere and people start to really learn what is possible to do with it.

jmg · 2018-10-24 02:30

rogloh wrote: »

Wow it would be great to have the potential for both consuming less power and also still keeping the possibility of HDMI open, especially if it might end up being a much lower risk than initially thought to achieve it.

I'm only thinking of use cases for future portable/robotics applications where you might need some COGs at full speed (eg. for driving an HDMI display - in fact that's actually a great example) while at the same time also not necessarily running everything at full tilt and chewing through your maximum power at all time the system is operational just because you need to clock at 250MHz for the HDMI output to work. Some COGs could be happy to be idle or mostly waiting on something to wake them up or spawn them or just sleeping and polling external devices/sensors less frequently. This is something where the P2 as it stands right now may have some limitations in its power consumption scaling, at least that's my take.

Some variants/combinations of that would be possible, but not all.
A simple Clock-Zone with an enable per COG means the COG is either SysCLK operating, or off.
That means you could drop the power to 1/8th of a present P2 idle level, provided the main COG could manage the INIT/STOP house keeping.

In present P2, there is currently no scope to have differing SysCLK's per COG, tho that has been suggested before, via a ClkEnable & simple divider.
That would require some register to set the division ratio, which feeds a small reload-counter.

Ym2413a · 2018-10-24 03:31

250MHz combined with the multiply instruction has me excited.

KeithE · 2018-10-24 04:50

If you add clock gating:

0 - if this is about tools converting RTL enables "if (enable) begin ... end" to gated clocks then it's not so scary. But you may not save as much power as if you nip the clock tree closer to the root. If you have all of the clock generation already centralized then it's probably not too difficult to do that.

1 - consider adding C.Y.A. bit(s) to effectively disable it. (Even if it works you can use that to see how much power clock gating saves.)

2 - I would suggest adding some simulations to your flow. You can use them as input for dynamic power sims. Also without them how do you verify before tape-out if you only have FPGA emulation to verify the design? Unless you can emulate it in an FPGA. I'm doubtful, but perhaps it can be done for a partial design? I guess if you have C.Y.A. bit(s) you will hopefully not break anything, and perhaps you can use formal tools as an alternative to simulation. At the very least it should be easy enough to formally verify that the clock gated design is equivalent to a non-clock gated design when clock gating is disabled by the C.Y.A. bit.

cgracey · 2018-10-24 05:05

KeithE wrote: »

If you add clock gating:

0 - if this is about tools converting RTL enables "if (enable) begin ... end" to gated clocks then it's not so scary. But you may not save as much power as if you nip the clock tree closer to the root. If you have all of the clock generation already centralized then it's probably not too difficult to do that.

1 - consider adding C.Y.A. bit(s) to effectively disable it. (Even if it works you can use that to see how much power clock gating saves.)

2 - I would suggest adding some simulations to your flow. You can use them as input for dynamic power sims. Also without them how do you verify before tape-out if you only have FPGA emulation to verify the design? Unless you can emulate it in an FPGA. I'm doubtful, but perhaps it can be done for a partial design? I guess if you have C.Y.A. bit(s) you will hopefully not break anything, and perhaps you can use formal tools as an alternative to simulation. At the very least it should be easy enough to formally verify that the clock gated design is equivalent to a non-clock gated design when clock gating is disabled by the C.Y.A. bit.

Good ideas, KeithE.

Yes, controlling the clock at the highest levels is key to real dynamic power saving. The cogs are obvious, but I'm thinking that I'll start with the stages in the CORDIC pipeline. That's got as many flops as two cogs and its nature is very simple.

We do run gate-level simulations from the synthesized netlist before tapeout.

I like the idea of a bit to enable clock gating.

I'm going to see if I can at least get the CORDIC clock-gated, even though there won't be any power savings in the FPGA. I just need to see that it works.

jmg · 2018-10-24 05:15

cgracey wrote: »

.... The cogs are obvious, but I'm thinking that I'll start with the stages in the CORDIC pipeline. That's got as many flops as two cogs and its nature is very simple..

How many flops are in the smart pin cells, relative to 1 COG ?

cgracey wrote: »

I'm going to see if I can at least get the CORDIC clock-gated, even though there won't be any power savings in the FPGA. I just need to see that it works.

Do you mean a clock-gate that is on-demand based (enables clock when Cordic data is being crunched), or clock-gate via a peripheral boolean that the user sets, as a simpler global on/off ?
The first approach has more power saving, but will more likely affect timing, and needs a live 'cordic needed/busy' information extract.

cgracey · 2018-10-24 05:48

jmg wrote: »

cgracey wrote: »

.... The cogs are obvious, but I'm thinking that I'll start with the stages in the CORDIC pipeline. That's got as many flops as two cogs and its nature is very simple..

How many flops are in the smart pin cells, relative to 1 COG ?

cgracey wrote: »

I'm going to see if I can at least get the CORDIC clock-gated, even though there won't be any power savings in the FPGA. I just need to see that it works.

Do you mean a clock-gate that is on-demand based (enables clock when Cordic data is being crunched), or clock-gate via a peripheral boolean that the user sets, as a simpler global on/off ?
The first approach has more power saving, but will more likely affect timing, and needs a live 'cordic needed/busy' information extract.

Each smart pin has 290 flops and needs 490 ALMs.

A cog has 4300 flops and needs 4800 ALMs.

The CORDIC has 7400 flops and needs 9100 ALMs.

These three circuits are prime candidates for clock gating.

I will first try on-demand clock gating in the CORDIC, so that if that stage is not being used, it receives no clock pulse.

cgracey · 2018-10-24 06:46

KeithE, do you know if clock-gating introduces much of a ripple into scan testing? Or, anyone? Jmg?

KeithE · 2018-10-25 03:52

cgracey wrote: »

KeithE, do you know if clock-gating introduces much of a ripple into scan testing? Or, anyone? Jmg?

I haven't heard of any issues, but I wasn't the one inserting the scan logic. Just fixing issues found when it was inserted. The simplest approach would probably be to effectively disable it in test mode. And the "enable" itself should be observable. I would ask OnSemi about their experiences. Make sure that that they sound confident and that all of this is old hat for them.

cgracey · 2018-10-25 04:00

For ON Semi, I think they can do anything. They've done it so many times that it immediately translates to a dollars and sense issue.

Jonathan · 2018-10-25 04:32

David Betz wrote: »

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

Tubular wrote: »

Are the multipliers in the colorspace converter also close to the critical path, like the cog MUL instructions?

That's what is hard to get through people's heads. Maybe 1/4 of all paths are "critical", in the sense that they were optimized up to the goal line and no further. So, the whole chip fails at the same frequency, but that frequency is maybe 280MHz for this batch of chips, running at room temperature.

In the case of this IQ modulator, it is a logic problem.

Is there a workaround or does this require a spin?

A respin.

The chip we have seems to do everything else, though.

I still need to dig into why the P59 resistor is required.

If that's the only problem it's not that bad. I suppose people will want NTSC video though. However, I'm perfectly happy to have a chip that can't do it!

We need NTSC! Otherwise I won't be able to use the P2 with my 1958 Philco Predicta television! :-)

cgracey · 2018-10-25 04:44

Isn't it kind of funny how color TV images can be sent over 1 wire, but all this newer stuff takes more and more wires to work?

When I was maybe eight years old, my parents took us to San Francisco for the day and there was some burly guy out in the street in a wizard robe yelling, "Burn the covered wagon! Burn the covered wagon!", like he was mad that the covered wagons were all gone. We all thought he was crazy.

potatohead · 2018-10-25 04:49

Yeah it is.

Still in wide use too.

My second favorite is the component signal. It's only three wires, and works at a variety of horizontal sweep frequencies, including the nice, slow TV ones. And if one wants monochrome, it's only one wire, good from TV through 1080p HDTV.

It isn't quite like covered wagons, thankfully.

More wires or not, I am interested in exploring the new HDMI capability. Wondering what tricks can be played.

Roy Eltham · 2018-10-25 04:51

For 1080p60 it's ~27 times more data per frame. Thats why...
And in 4K 60 it's ~108 times more data.

One wire isn't enough really, especially over many feet.

potatohead · 2018-10-25 04:53

Totally, but I will say component, three wire signaling, can deliver comparable quality up through 1080p. (TV's are not supposed to support that signal as progressive, but most do.)

Roy Eltham · 2018-10-25 05:07

That would be 108p 30 or 24, most likely, so half the data or less.

cgracey · 2018-10-25 05:10

Roy Eltham wrote: »

That would be 108p 30 or 24, most likely, so half the data or less.

I ran my HDTV at 1080p 60Hz via 3-wire Y-Pb-Pr when working on P2-Hot.

potatohead · 2018-10-25 05:14

Edit: Chip beat me. And one wire delivers an awesome monochrome signal at that resolution. It's spiffy.

Actually, component tops out at 1080p @ 60hz. Component signals go that high. I've got a player that does it. It is very hard to tell the difference between HDMI and that signal. The player outputs both, and I've a Samsung plasma TV that takes them both.

The most common spec out there is 1080i, and the movie studios pushed for that, and pushed for HDMI with the encryption as the peak 1080p signal. Most manufacturers quietly ignored that, doing 1080p over component. Early on, some displays would not take it. Today, I would be surprised to find one that doesn't.

Roy Eltham · 2018-10-25 05:25

Interesting, back when I used component I could not get a good signal really above 720p, 1080i worked but wasn't really any better than 720p.
I suspect you're maybe not be getting nearly the same quality at 1080p 60 (or even 30) as you would over an HDMI cable.

Perhaps you all need better TVs that actually show the difference between 720 and 1080 :P

potatohead · 2018-10-25 05:49

The plasma I have is a pretty great tv...

It's also dang fast! 3D capable. I think it tops out at 120Hz or something.

A whole lot depends on the processor in the TV, IMHO.

Cluso99 · 2018-10-25 06:43

Roy Eltham wrote: »

Interesting, back when I used component I could not get a good signal really above 720p, 1080i worked but wasn't really any better than 720p.
I suspect you're maybe not be getting nearly the same quality at 1080p 60 (or even 30) as you would over an HDMI cable.

Perhaps you all need better TVs that actually show the difference between 720 and 1080 :P

Or glasses

Cluso99 · 2018-11-06 04:02

I just realised that unconnected pins read as "1" whereas IIRC the P1 unconnected pins read as "0".
Am I missing something or is this meant to be ???

cgracey · 2018-11-06 04:26

Then there must be some slight leakage to VIO, probably in the nano amp range.

jmg · 2018-11-06 05:11

cgracey wrote: »

Then there must be some slight leakage to VIO, probably in the nano amp range.

Does the boot loader exit with a light pullup as default on all pins, or do all pins float ?

cgracey · 2018-11-06 14:33

jmg wrote: »

cgracey wrote: »

Then there must be some slight leakage to VIO, probably in the nano amp range.

Does the boot loader exit with a light pullup as default on all pins, or do all pins float ?

The pins are floating.

New P2 Silicon Observations

Comments