New P2 Silicon Observations

2456716

Comments

  • I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

    Here is the MUL test code:
    dat
    		orgh	0
    '
    ' Launch all cogs with test program.
    '
    		org
    
    		hubset	##%1_000000_0000001101_1111_10_00	'enable crystal+PLL, stay in 20MHz+ mode
    		waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
    		hubset	##%1_000000_0000001101_1111_10_11	'now switch to PLL running at 180MHz
    
    
    .loop		coginit	cognum,#@pgm	'last iteration relaunches cog 0
    		djnf	cognum,#.loop
    
    cognum		long	7
    '
    ' Do multiply test and indicate success on pin(cogid)
    '
    		org
    
    pgm		cogid	x		'which cog am I, 0..7?
    
    .loop		altr	yp
    		mul	a0,a1
    		cmp	ap,y	wz
    		drvz	x
    
    		altr	yp
    		mul	b0,b1
    		cmp	bp,y	wz
    		drvz	x
    
    		jmp	#.loop
    
    
    a0		long	$B9A3
    a1		long	$5EBD
    ap		long	$B9A3 * $5EBD
    
    b0		long	$CB9A
    b1		long	$49BF
    bp		long	$CB9A * $49BF
    
    yp		long	y
    
    x		res	1
    y		res	1
    
  • cgracey wrote: »
    I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
    Always nice when the bench lines up with the sims :)
    cgracey wrote: »
    Here is the MUL test code:
    Can you do a MUL/DIV suite, with rolling values, and toggle one pin on PASS and another pin on FAIL ?

  • jmg wrote: »
    cgracey wrote: »
    I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
    Always nice when the bench lines up with the sims :)
    cgracey wrote: »
    Here is the MUL test code:
    Can you do a MUL/DIV suite, with rolling values, and toggle one pin on PASS and another pin on FAIL ?

    What does that look like?
  • cgracey wrote: »
    I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.

    As for the test setup you have in hands, were all ceramic decoupling caps installed, especially the ones (related to both Vio and Vdd) that feed the VCO/PLL, heavily filtered, internal voltage regulator?

    Have you devised any means of verifying VCO actual jitter figures?
  • thej wrote: »
    cgracey wrote: »
    Sure, but 280MHz will be very iffy. Just getting a part with slow process corner characteristics would likely dash that expectation. To be safe, stick with the 180MHz. Maybe 200MHz is reasonable for indoor environments.

    Regardless of the MHz, could RCFAST be used as a dynamic low power mode?

    j

    Yes, but there's a 20KHz slow RC mode, too, that really drops the power.
  • Wendy, if you are reading this, join the forum!
  • cgracey wrote: »
    Yes, but there's a 20KHz slow RC mode, too, that really drops the power.

    What is Icc at 20kHz , and how does the kHz change as the part cools down ?

  • cgracey wrote: »
    What does that look like?
    I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
    Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.
  • Thanks Chip for keeping us informed. All of this is wonderful information!
  • jmg wrote: »
    cgracey wrote: »
    What does that look like?
    I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
    Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.
    Can you write that in FreeBASIC, and we can translate it to PASM2?

  • jmgjmg Posts: 12,064
    edited September 28 Vote Up0Vote Down
    ersmith wrote: »
    Can you write that in FreeBASIC, and we can translate it to PASM2?
    In CCalc these test sweeps
    Av=round(2^16/111) = 590 example adder value
    Mv16=0
    Mv16=Mv16+Av;Mr32=Mv16*Av;Res = Mr32/Mv16

    for 16*16 Mul and 32b div, this Result/Value = 590(Av) in all cases, except when Mv+Av wraps
    or maybe for better coverage
    Mv16=0
    Mv16=(Mv16+Av)%2^16;Mr32=(Mv16*(Mv16-Av))%2^32;Res = Mr32/Mv16/Av
    gives Res = 0..110, as it sweeps thru the number range. Looks to need a reset at 110, as it does not wrap nicely, at least on my test CCalc.
  • cgraceycgracey Posts: 9,541
    edited September 28 Vote Up0Vote Down
    I made a bunch of current measurements of 0..7 cogs running from 20MHz..220MHz:

    https://docs.google.com/spreadsheets/d/1maeX7_tFKshFVPXdt9scWupgoN61MaEniNhF2_901l4/edit?usp=sharing

    Maybe one of you superdogs who knows how to work the Google can turn this into a graph?
  • RaymanRayman Posts: 8,742
    edited September 28 Vote Up0Vote Down
    If I were at work I could do it nicely in Origin...
    But, here's a quick Excel plot.

    This can obviously be done better...
    488 x 296 - 10K
    Prop Info and Apps: http://www.rayslogic.com/
  • Thanks, Rayman.

    You can see that the hub is what's burning the power. It must be that eggbeater arrangement. The cogs only add a little to the power consumption. I can make them run more complex code, but it doesn't change much. It almost seems like the big hub RAMs are all firing on every clock, doing reads when neither reads nor writes are called for. I've asked Wendy about this.
  • Fantastic results Chip. 192MHz is the USB sweet spot.

    BTW 200MHz was my hope and you seem to have exceeded that easily.

    I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.

    If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • cgracey wrote: »
    You can see that the hub is what's burning the power. It must be that eggbeater arrangement.

    A means I could devise to impose an almost "silent mode" to the Hub circuits, is ensureing that EACH and EVERY of the eight Cogs does a single group of Hub-related operations (e.g., RDLONG), targeting the same address (0000, perhaps?), or even a WRLONG followed by a RDLONG, using two options for the value of the writen data (0000, then FFFF), and also two options for the target address (0000, then the last accessible long, at memory map) , before concentrating solely into a one-instruction, infinite loop of Cog code, without doing any further Hub-related access or referencing.

    That way, all the Egg-beater address/data input latches contents will reflect the last address/data coming from each Cog-Hub transaction (that will remain steady from that point ahead), so does their address/data muxes, at Hub ram, and they would ever see the same values, while rotating, giving each Cog the opportunity for a new access, wich will never come, until Reset is applied.

    You could also destinate eight pins, simultaneously cleared by the Cog that instantiates the other ones, then set to 1, individualy, at the end of each Cog's routine, before they enter their individual infine loops, shuting their mouths.

    That way you could verify that each and everyone has accomplished its task, and are standing still, from Hub accesses standpoint, until Reset is applied.

    Hope it helps

    Henrique
  • cgraceycgracey Posts: 9,541
    edited September 28 Vote Up0Vote Down
    Cluso99 wrote: »
    Fantastic results Chip. 192MHz is the USB sweet spot.

    BTW 200MHz was my hope and you seem to have exceeded that easily.

    I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.

    If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.

    CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.

    Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.
  • Chip, I was hoping for 200MHz, but 250MHz (room temp) would be great.
    I'd like to know what the RCSLOW current is (if RCSLOW is even an option).

    Bean
  • cgracey wrote: »
    CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.

    Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.

    Interesting data very MHz dependent, and rather more COG independent.

    With tight timing, and this much logic, I would expect Clock Trees to be hogs.

    A useful test would be to idle 8 cogs in a mostly WAIT loop (eg toggle pins at 100,110,120..180ms) - that leaves just clock trees.

    It looks like COGINIT or STOP does not enable/disable the clock tree for that COG ?
  • I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Here is another view of the data. Certainly shows that frequency has a much greater effect on current draw than the number of running cogs.
    1361 x 772 - 65K
    In science there is no authority. There is only experiment.
    Life is unpredictable. Eat dessert first.
  • Cluso99 wrote: »
    I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.

    Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock.
  • jmgjmg Posts: 12,064
    edited September 29 Vote Up0Vote Down
    and here are the Cpd equivalent values, from CMOS logic formulae
     power dissipation capacitance
     Pd = Cpd x Vcc^2 x Fi
     Cpd=(Icc*Vcc)/(Vcc*Vcc*Fi)
    
     Fi=20M;Icc=64m;   Cpd = 1.777nF
     Fi=60M;Icc=182m;  Cpd = 1.685nF
     Fi=100M;Icc=301m; Cpd = 1.672nF
     Fi=140M;Icc=426m; Cpd = 1.690nF
     Fi=180M;Icc=534m; Cpd = 1.648nF
     Fi=220M;Icc=638m; Cpd = 1.611nF
    then Enable 7 more COGs 
     Fi=220M;Icc=869m; Cpd = 2.194nF
    
     (Cpd-1.611e-9)/7 = adds 83.333pF/COG
    

    Shows the Clock tree is quite significant, (that's the Cg of a moderate power MOSFET) and the incremental addition per extra started COG, is more modest at 83.33pF/COG

    Above has a down slope on Cpd, because some static Icc was not allowed for, so we can iterate to find an equivalent offset value of Iq that gives a nominally flat Cpd
     Iq=5m  Cpd=((Icc-Iq)*Vcc)/(Vcc*Vcc*Fi)
    
     Fi=20M;Icc=64m;   Cpd = 1.638nF
     Fi=60M;Icc=182m;  Cpd = 1.638nF
     Fi=100M;Icc=301m; Cpd = 1.644nF
     Fi=220M;Icc=638m; Cpd = 1.598nF
    
    ie Iq ~ 5mA and Cpd ~ 1.6nF, for P2
  • "Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.

    Congrats getting this far on P2 BTW


    Beau Schwabe -- Submicron Forensic Engineer
    www.Kit-Start.com - bschwabe@Kit-Start.com ෴෴ www.BScircuitDesigns.com - icbeau@bscircuitdesigns.com ෴෴

  • Jmg, there are 88k flops on the die. No clock-gating was done at the Verilog level. The synthesis tools, though, might have implemented something. Also, I think I remember seeing something about a flop being clocked, but without its data input changing, resulting in lowest power, which makes sense, but implies no clock gating.
  • "Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.

    Congrats getting this far on P2 BTW

    The clock tree was synthesized with a jitter budget that ultimately afforded a clock uncertainty of 300ps.

    Thanks, Beau.
  • cgraceycgracey Posts: 9,541
    edited September 29 Vote Up0Vote Down
    The DACs seem to be working well. Here is a 438KB 5:6:5 RGB image in RAM being displayed via VGA.

    P0 = HSYNC, 123-ohm 3.3V DAC, pixel-locked to color signals on P1/P2/P3
    P1 = Blue, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
    P2 = Green, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
    P3 = Red, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
    P4 = VSYNC, digital output

    Here is the code. Note that the PLL is using the biggest XI divide possible (64) and a big VCO divide of 512, with a post-VCO divide of 2. This results in 20MHz / 64 * 512 / 2 = 80MHz, but with the slowest/weakest PLL feedback possible, and the image shows no jitter, at all:
    '*************************************
    '*  VGA 640 x 480 x 16bpp 5:6:5 RGB  *
    '*************************************
    
    CON
    
      intensity	= 80	'0..128
    
      fclk		= 80_000_000.0
      fpix		= 25_000_000.0
      fset		= (fpix / fclk * 2.0) * float($4000_0000)
    
      vsync		=	4	'vsync pin
    
    DAT		org
    '
    '
    ' Setup
    '
    		hubset	##%1_111111_0111111111_0000_10_00	'enable crystal+PLL, stay in 20MHz+ mode
    		waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
    		hubset	##%1_111111_0111111111_0000_10_11	'now switch to PLL running at 80MHz
    
    		rdfast	##640*350*2/64,##$1000	'set rdfast to wrap on bitmap
    
    		setxfrq ##round(fset)		'set transfer frequency to 25MHz
    
    		'the next 4 lines may be commented out to bypass signal level scaling
    
    		setcy	##intensity << 24	'r	set colorspace for rgb
    		setci	##intensity << 16	'g
    		setcq	##intensity << 08	'b
    		setcmod	#%01_0_000_0		'enable colorspace conversion (may be commented out)
    
    		wrpin	dacmode_s,#0		'enable dac modes in pins 0..3
    		wrpin	dacmode_c,#1
    		wrpin	dacmode_c,#2
    		wrpin	dacmode_c,#3
    		setnib	dira,#%1111,#0
    '
    '
    ' Field loop
    '
    field		mov	x,#90			'top blanks
    		call	#blank
    
    		mov     x,#350			'set visible lines
    line		call	#hsync			'do horizontal sync
    		xcont	m_rf,#1			'visible line
    		djnz    x,#line           	'another line?
    
    		mov	x,#83			'bottom blanks
    		call	#blank
    
    		drvnot	#vsync			'sync on
    
    		mov	x,#2			'sync blanks
    		call	#blank
    
    		drvnot	#vsync			'sync off
    
                    jmp     #field                  'loop
    '
    '
    ' Subroutines
    '
    blank		call	#hsync			'blank lines
    		xcont	m_vi,#0
    	_ret_	djnz	x,#blank
    
    hsync		xcont	m_bs,#0			'horizontal sync
    		xzero	m_sn,#1
    	_ret_	xcont	m_bv,#0
    '
    '
    ' Initialized data
    '
    dacmode_s		long	%0000_0000_000_1011000000000_01_00000_0		'hsync is 123-ohm, 3.3V
    dacmode_c		long	%0000_0000_000_1011100000000_01_00000_0		'R/G/B are 75-ohm, 2.0V
    
    m_bs		long	$CF000000+16		'before sync
    m_sn		long	$CF000000+96		'sync
    m_bv		long	$CF000000+48		'before visible
    m_vi		long	$CF000000+640		'visible
    
    m_rf		long	$2F000000+640		'visible rfword rgb16 (5:6:5)
    
    x		res	1
    y		res	1
    '
    '
    ' Bitmap
    '
    		orgh	$1000 - 70		'justify pixels at $1000
    		file	"birds_16bpp.bmp"	'rayman's picture (640 x 350)
    
    2048 x 1152 - 792K
  • PublisonPublison Posts: 10,328
    edited September 29 Vote Up0Vote Down
    That's sweet! Congrats, Chip!
    Infernal Machine
  • Excellent, DAC goodness!
    Can't wait to see all the graphics modes & features everyone comes up with on the P2.
  • Sweet. Nice and solid.
Sign In or Register to comment.