New P2 Silicon Observations

cgracey · 2018-09-28 20:39

I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

Here is the MUL test code:

dat
		orgh	0
'
' Launch all cogs with test program.
'
		org

		hubset	##%1_000000_0000001101_1111_10_00	'enable crystal+PLL, stay in 20MHz+ mode
		waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
		hubset	##%1_000000_0000001101_1111_10_11	'now switch to PLL running at 180MHz


.loop		coginit	cognum,#@pgm	'last iteration relaunches cog 0
		djnf	cognum,#.loop

cognum		long	7
'
' Do multiply test and indicate success on pin(cogid)
'
		org

pgm		cogid	x		'which cog am I, 0..7?

.loop		altr	yp
		mul	a0,a1
		cmp	ap,y	wz
		drvz	x

		altr	yp
		mul	b0,b1
		cmp	bp,y	wz
		drvz	x

		jmp	#.loop


a0		long	$B9A3
a1		long	$5EBD
ap		long	$B9A3 * $5EBD

b0		long	$CB9A
b1		long	$49BF
bp		long	$CB9A * $49BF

yp		long	y

x		res	1
y		res	1

jmg · 2018-09-28 21:08

cgracey wrote: »

I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

Always nice when the bench lines up with the sims

cgracey wrote: »

Here is the MUL test code:

Can you do a MUL/DIV suite, with rolling values, and toggle one pin on PASS and another pin on FAIL ?

cgracey · 2018-09-28 21:19

jmg wrote: »

cgracey wrote: »

I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

Always nice when the bench lines up with the sims

cgracey wrote: »

Here is the MUL test code:

Can you do a MUL/DIV suite, with rolling values, and toggle one pin on PASS and another pin on FAIL ?

What does that look like?

Yanomani · 2018-09-28 21:27

cgracey wrote: »

I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

As for the test setup you have in hands, were all ceramic decoupling caps installed, especially the ones (related to both Vio and Vdd) that feed the VCO/PLL, heavily filtered, internal voltage regulator?

Have you devised any means of verifying VCO actual jitter figures?

cgracey · 2018-09-28 21:30

thej wrote: »

cgracey wrote: »

Sure, but 280MHz will be very iffy. Just getting a part with slow process corner characteristics would likely dash that expectation. To be safe, stick with the 180MHz. Maybe 200MHz is reasonable for indoor environments.

Regardless of the MHz, could RCFAST be used as a dynamic low power mode?

j

Yes, but there's a 20KHz slow RC mode, too, that really drops the power.

cgracey · 2018-09-28 21:32

Wendy, if you are reading this, join the forum!

jmg · 2018-09-28 21:33

cgracey wrote: »

Yes, but there's a 20KHz slow RC mode, too, that really drops the power.

What is Icc at 20kHz , and how does the kHz change as the part cools down ?

jmg · 2018-09-28 21:43

cgracey wrote: »

What does that look like?

I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.

K2 · 2018-09-28 21:46

Thanks Chip for keeping us informed. All of this is wonderful information!

ersmith · 2018-09-28 21:56

jmg wrote: »

cgracey wrote: »

What does that look like?

I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.

Can you write that in FreeBASIC, and we can translate it to PASM2?

jmg · 2018-09-28 22:17

ersmith wrote: »

Can you write that in FreeBASIC, and we can translate it to PASM2?

In CCalc these test sweeps
Av=round(2^16/111) = 590 example adder value
Mv16=0
Mv16=Mv16+Av;Mr32=Mv16*Av;Res = Mr32/Mv16

for 16*16 Mul and 32b div, this Result/Value = 590(Av) in all cases, except when Mv+Av wraps
or maybe for better coverage
Mv16=0
Mv16=(Mv16+Av)%2^16;Mr32=(Mv16*(Mv16-Av))%2^32;Res = Mr32/Mv16/Av
gives Res = 0..110, as it sweeps thru the number range. Looks to need a reset at 110, as it does not wrap nicely, at least on my test CCalc.

cgracey · 2018-09-28 22:38

I made a bunch of current measurements of 0..7 cogs running from 20MHz..220MHz:

https://docs.google.com/spreadsheets/d/1maeX7_tFKshFVPXdt9scWupgoN61MaEniNhF2_901l4/edit?usp=sharing

Maybe one of you superdogs who knows how to work the Google can turn this into a graph?

Rayman · 2018-09-28 22:46

If I were at work I could do it nicely in Origin...
But, here's a quick Excel plot.

This can obviously be done better...

cgracey · 2018-09-28 22:53

Thanks, Rayman.

You can see that the hub is what's burning the power. It must be that eggbeater arrangement. The cogs only add a little to the power consumption. I can make them run more complex code, but it doesn't change much. It almost seems like the big hub RAMs are all firing on every clock, doing reads when neither reads nor writes are called for. I've asked Wendy about this.

Cluso99 · 2018-09-28 23:32

Fantastic results Chip. 192MHz is the USB sweet spot.

BTW 200MHz was my hope and you seem to have exceeded that easily.

I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.

If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.

Yanomani · 2018-09-28 23:40

cgracey wrote: »

You can see that the hub is what's burning the power. It must be that eggbeater arrangement.

A means I could devise to impose an almost "silent mode" to the Hub circuits, is ensureing that EACH and EVERY of the eight Cogs does a single group of Hub-related operations (e.g., RDLONG), targeting the same address (0000, perhaps?), or even a WRLONG followed by a RDLONG, using two options for the value of the writen data (0000, then FFFF), and also two options for the target address (0000, then the last accessible long, at memory map) , before concentrating solely into a one-instruction, infinite loop of Cog code, without doing any further Hub-related access or referencing.

That way, all the Egg-beater address/data input latches contents will reflect the last address/data coming from each Cog-Hub transaction (that will remain steady from that point ahead), so does their address/data muxes, at Hub ram, and they would ever see the same values, while rotating, giving each Cog the opportunity for a new access, wich will never come, until Reset is applied.

You could also destinate eight pins, simultaneously cleared by the Cog that instantiates the other ones, then set to 1, individualy, at the end of each Cog's routine, before they enter their individual infine loops, shuting their mouths.

That way you could verify that each and everyone has accomplished its task, and are standing still, from Hub accesses standpoint, until Reset is applied.

Hope it helps

Henrique

cgracey · 2018-09-28 23:42

Cluso99 wrote: »

Fantastic results Chip. 192MHz is the USB sweet spot.

BTW 200MHz was my hope and you seem to have exceeded that easily.

I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.

If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.

CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.

Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.

Bean · 2018-09-29 00:21

Chip, I was hoping for 200MHz, but 250MHz (room temp) would be great.
I'd like to know what the RCSLOW current is (if RCSLOW is even an option).

Bean

jmg · 2018-09-29 01:20

cgracey wrote: »

CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.

Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.

Interesting data very MHz dependent, and rather more COG independent.

With tight timing, and this much logic, I would expect Clock Trees to be hogs.

A useful test would be to idle 8 cogs in a mostly WAIT loop (eg toggle pins at 100,110,120..180ms) - that leaves just clock trees.

It looks like COGINIT or STOP does not enable/disable the clock tree for that COG ?

Cluso99 · 2018-09-29 01:39

I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.

kwinn · 2018-09-29 05:19

Here is another view of the data. Certainly shows that frequency has a much greater effect on current draw than the number of running cogs.

cgracey · 2018-09-29 05:38

Cluso99 wrote: »

I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.

Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock.

jmg · 2018-09-29 05:52

and here are the Cpd equivalent values, from CMOS logic formulae

 power dissipation capacitance
 Pd = Cpd x Vcc^2 x Fi
 Cpd=(Icc*Vcc)/(Vcc*Vcc*Fi)

 Fi=20M;Icc=64m;   Cpd = 1.777nF
 Fi=60M;Icc=182m;  Cpd = 1.685nF
 Fi=100M;Icc=301m; Cpd = 1.672nF
 Fi=140M;Icc=426m; Cpd = 1.690nF
 Fi=180M;Icc=534m; Cpd = 1.648nF
 Fi=220M;Icc=638m; Cpd = 1.611nF
then Enable 7 more COGs 
 Fi=220M;Icc=869m; Cpd = 2.194nF

 (Cpd-1.611e-9)/7 = adds 83.333pF/COG

Shows the Clock tree is quite significant, (that's the Cg of a moderate power MOSFET) and the incremental addition per extra started COG, is more modest at 83.33pF/COG

Above has a down slope on Cpd, because some static Icc was not allowed for, so we can iterate to find an equivalent offset value of Iq that gives a nominally flat Cpd

 Iq=5m  Cpd=((Icc-Iq)*Vcc)/(Vcc*Vcc*Fi)

 Fi=20M;Icc=64m;   Cpd = 1.638nF
 Fi=60M;Icc=182m;  Cpd = 1.638nF
 Fi=100M;Icc=301m; Cpd = 1.644nF
 Fi=220M;Icc=638m; Cpd = 1.598nF

ie Iq ~ 5mA and Cpd ~ 1.6nF, for P2

Beau Schwabe · 2018-09-29 06:08

"Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.

Congrats getting this far on P2 BTW

cgracey · 2018-09-29 07:35

Jmg, there are 88k flops on the die. No clock-gating was done at the Verilog level. The synthesis tools, though, might have implemented something. Also, I think I remember seeing something about a flop being clocked, but without its data input changing, resulting in lowest power, which makes sense, but implies no clock gating.

cgracey · 2018-09-29 07:39

Beau Schwabe wrote: »

"Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.

Congrats getting this far on P2 BTW

The clock tree was synthesized with a jitter budget that ultimately afforded a clock uncertainty of 300ps.

Thanks, Beau.

cgracey · 2018-09-29 08:59

The DACs seem to be working well. Here is a 438KB 5:6:5 RGB image in RAM being displayed via VGA.

P0 = HSYNC, 123-ohm 3.3V DAC, pixel-locked to color signals on P1/P2/P3
P1 = Blue, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P2 = Green, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P3 = Red, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P4 = VSYNC, digital output

Here is the code. Note that the PLL is using the biggest XI divide possible (64) and a big VCO divide of 512, with a post-VCO divide of 2. This results in 20MHz / 64 * 512 / 2 = 80MHz, but with the slowest/weakest PLL feedback possible, and the image shows no jitter, at all:

'*************************************
'*  VGA 640 x 480 x 16bpp 5:6:5 RGB  *
'*************************************

CON

  intensity	= 80	'0..128

  fclk		= 80_000_000.0
  fpix		= 25_000_000.0
  fset		= (fpix / fclk * 2.0) * float($4000_0000)

  vsync		=	4	'vsync pin

DAT		org
'
'
' Setup
'
		hubset	##%1_111111_0111111111_0000_10_00	'enable crystal+PLL, stay in 20MHz+ mode
		waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
		hubset	##%1_111111_0111111111_0000_10_11	'now switch to PLL running at 80MHz

		rdfast	##640*350*2/64,##$1000	'set rdfast to wrap on bitmap

		setxfrq ##round(fset)		'set transfer frequency to 25MHz

		'the next 4 lines may be commented out to bypass signal level scaling

		setcy	##intensity << 24	'r	set colorspace for rgb
		setci	##intensity << 16	'g
		setcq	##intensity << 08	'b
		setcmod	#%01_0_000_0		'enable colorspace conversion (may be commented out)

		wrpin	dacmode_s,#0		'enable dac modes in pins 0..3
		wrpin	dacmode_c,#1
		wrpin	dacmode_c,#2
		wrpin	dacmode_c,#3
		setnib	dira,#%1111,#0
'
'
' Field loop
'
field		mov	x,#90			'top blanks
		call	#blank

		mov     x,#350			'set visible lines
line		call	#hsync			'do horizontal sync
		xcont	m_rf,#1			'visible line
		djnz    x,#line           	'another line?

		mov	x,#83			'bottom blanks
		call	#blank

		drvnot	#vsync			'sync on

		mov	x,#2			'sync blanks
		call	#blank

		drvnot	#vsync			'sync off

                jmp     #field                  'loop
'
'
' Subroutines
'
blank		call	#hsync			'blank lines
		xcont	m_vi,#0
	_ret_	djnz	x,#blank

hsync		xcont	m_bs,#0			'horizontal sync
		xzero	m_sn,#1
	_ret_	xcont	m_bv,#0
'
'
' Initialized data
'
dacmode_s		long	%0000_0000_000_1011000000000_01_00000_0		'hsync is 123-ohm, 3.3V
dacmode_c		long	%0000_0000_000_1011100000000_01_00000_0		'R/G/B are 75-ohm, 2.0V

m_bs		long	$CF000000+16		'before sync
m_sn		long	$CF000000+96		'sync
m_bv		long	$CF000000+48		'before visible
m_vi		long	$CF000000+640		'visible

m_rf		long	$2F000000+640		'visible rfword rgb16 (5:6:5)

x		res	1
y		res	1
'
'
' Bitmap
'
		orgh	$1000 - 70		'justify pixels at $1000
		file	"birds_16bpp.bmp"	'rayman's picture (640 x 350)

Publison · 2018-09-29 09:26

That's sweet! Congrats, Chip!

Roy Eltham · 2018-09-29 09:31

Excellent, DAC goodness!
Can't wait to see all the graphics modes & features everyone comes up with on the P2.

Tubular · 2018-09-29 09:43

Sweet. Nice and solid.

New P2 Silicon Observations

Comments