New P2 Silicon Observations

cgraceycgracey Posts: 9,528
edited September 28 in Propeller 2 Vote Up0Vote Down
I've got the P2 running on my bench at home now and I'm going through it. Haven't figured out, yet, what's up with the ROM code and pull-up sensing, but I'm trying other stuff out.

Using a 20MHz crystal and the PLL, I'm running a power torture test at 180Mhz:

a) All 8 cogs are running in hubexec.
b) Each cog executes a 2-instruction loop which causes the FIFO to reload continuously.
c) Each cog issues a CORDIC command on each loop.
d) All 64 smart pins are doing PWM with a 16-clock frame, so the pins toggle every 8 clocks, on average.
e) VDD = 1.80V, VIO = 3.30V, F=180MHz

Results:

VDD current = 920mA
VIO current = 38mA

VDD power = 1.80V * 920mA = 1.66W
VIO power = 3.30V * 38mA = 0.125W

Total power = 1.79W

The chip is a little too hot to keep your finger on, but I think with a solid ground plane in a 4-layer PCB, it could be brought way down. As it is, the back of the PCB is only slightly less hot than the top of the chip, so the Amkor package and the via plane on Peter's 2-layer PCB are doing their job well.

We will design a really simple 4-layer board with internal ground/heat plane and high-current power regulators. This will be needed to discover the chip's real speed limits. My current setup uses a dual-output bench supply and some wires with grabbers. At this VDD current level, I'm dropping ~200mV just over the wires going to the board, so I need to output 2.00V to get 1.80V onto the board. Peter's board has regulators, potentially, but I'm just using it as a carrier board for the chip. This way, I can measure currents easily.

Here is the power torture test code:
dat
		orgh	0
'
' Launch all cogs with test program.
'
		org

		hubset	##%1_000000_0000001000_1111_10_00	'enable crystal+PLL, stay in 20MHz+ mode
		waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
		hubset	##%1_000000_0000001000_1111_10_11	'now switch to PLL running at 180MHz


.loop		coginit	cognum,#@pgm	'last iteration relaunches cog 0
		djnf	cognum,#.loop

cognum		long	7
'
' Set 8 pins (per cog) to PWM mode, jump to hubexec loop
'
		org

pgm		cogid	x		'which cog am I, 0..7?
		shl	x,#3

		rep	@.p,#8		'start pwm pins
		wrpin	#%01_01001_0,x
		wxpin	pat,x
		wypin	#1,x
		dirh	x
		add	x,#1
.p
		jmp	#hubloop


pat		long	$0010_0001

x		res	1
'
' Hubexec loop
'
		orgh	$400

hubloop		qrotate	x,x
		jmp	#hubloop
«13456716

Comments

  • 460 Comments sorted by Date Added Votes
  • TubularTubular Posts: 3,117
    edited September 28 Vote Up0Vote Down
    Great news, Chip. Fun times ahead.

    When watching live space missions I think this sounds like what they refer to as "nominal", meaning as-expected

    Are you thinking linear regs? There were linear regs on those old smartpack FPGA boards you used to make
  • cgraceycgracey Posts: 9,528
    edited September 28 Vote Up0Vote Down
    The torture test also runs at 240MHz:

    VDD current = 1200mA
    VIO current = 48mA

    VDD power = 1.80V * 1200mA = 2.16W
    VIO power = 3.30V * 48mA = 0.158W

    Total power = 2.32W

    This is quite proportional to the 180MHz power. We are running 33% faster and taking 30% more power.

    I don't know the integrity of execution, though. The smart pins are the only thing giving indication of function, but all paths were optimized up to the goal point, so the cogs should fail at the same frequency that the smart pins fail at.
  • Things seem to run at 280MHz, but not at 300MHz.
  • This is great news! I was afraid of being stuck at 80 MHz...
    Prop Info and Apps: http://www.rayslogic.com/
  • thejthej Posts: 138
    edited September 28 Vote Up0Vote Down
    WOW !!!
    I see anyone with dreams of writing an emulator for the P2, pretty much falling off their chair at this point !
    :-O


    ...and what kind of USB performance could be achieved at 280MHz :-)

    J
  • As I recall, worst case simulations indicated power consumption at 2.4 Watts @ 180 MHz. Right on the money! Yeah!
    Feel the need for speed between your PC's com port and Prop?
    Try the FTDI 245 and the FullDuplexParallel Object.

    Check out my spin driver for the Parallax "96 x 64 Color OLED Display Module" Product ID: 28087
  • Chip,
    This all sounds very promising for having 200Mhz+ as an option, similar to how we often use the P1 at 96Mhz+ vs the spec'd 80Mhz.

    I'm guessing more typical use cases will pull a fraction of that power and run cool.
  • cgracey wrote: »
    Things seem to run at 280MHz, but not at 300MHz.

    That's well above what was modeled, correct?
    Feel the need for speed between your PC's com port and Prop?
    Try the FTDI 245 and the FullDuplexParallel Object.

    Check out my spin driver for the Parallax "96 x 64 Color OLED Display Module" Product ID: 28087
  • Is it possible to change the MHz dynamically during execution?
    I'm just wondering it it could run at 280MHz (with an appropriate crystal) then be dynamically switched to RCFAST as an "alive but in Sleep mode" ?

    j
  • It's probably safe to design commercial products to run at 200MHz, where they will be in human environments, but the chip is designed to run at 180MHz with +85C ambient temperature, 1.8V-10% VDD, 3.3V-10% VIO, with worst process corner.
  • thej wrote: »
    Is it possible to change the MHz dynamically during execution?
    I'm just wondering it it could run at 280MHz (with an appropriate crystal) then be dynamically switched to RCFAST as an "alive but in Sleep mode" ?

    j

    Sure, but 280MHz will be very iffy. Just getting a part with slow process corner characteristics would likely dash that expectation. To be safe, stick with the 180MHz. Maybe 200MHz is reasonable for indoor environments.
  • Combined with the huge Spin2 gains, being able to overclock into 200~280 MHz region is simply amazing. What a beast
  • If this does well and we have the capital, we could use a 28nm process and get well over 1GHz. Power dissipation would be only 5%-10% of the current design.
  • cgracey wrote: »
    If this does well and we have the capital, we could use a 28nm process and get well over 1GHz. Power dissipation would be only 5%-10% of the current design.
    And 64 COGs? :smile:

  • thejthej Posts: 138
    edited September 28 Vote Up0Vote Down
    cgracey wrote: »
    Sure, but 280MHz will be very iffy. Just getting a part with slow process corner characteristics would likely dash that expectation. To be safe, stick with the 180MHz. Maybe 200MHz is reasonable for indoor environments.

    Regardless of the MHz, could RCFAST be used as a dynamic low power mode?

    j
  • Watched last night till I fell asleep and you had the ah-ha moment. Saw that this morning.

    I saw Total power = 1.79W and thought P2 Hot part duce, but a 4 layer should take care of that. Even the 240MHz wattage is half the P2 Hot. This is going in the right direction! Searching for good flights to Rocklin in April.
    Infernal Machine
  • I remember that P2-Hot was projected to be 8 Watts!
  • cgracey wrote: »
    If this does well and we have the capital, we could use a 28nm process and get well over 1GHz. Power dissipation would be only 5%-10% of the current design.

    You may not even need to go to 28nm for a substantial speed improvement!
    Check this out! 90nm challenging the performance of 7nm!

    https://spectrum.ieee.org/nanoclast/semiconductors/processors/the-foundry-at-the-heart-of-darpas-plan-to-let-old-fabs-beat-new-ones

    j
  • Peter JakackiPeter Jakacki Posts: 7,709
    edited September 28 Vote Up0Vote Down
    DELETE - reposted back into TAQOZ thread

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • cgracey wrote: »
    Things seem to run at 280MHz, but not at 300MHz.

    Margins are sounding ok, in this simplest of tests.

    You mean smart pins run to 280MHz, when doing a /16 PWM ?

    The rest of the code is not really doing any integrity verification, before everyone mentally locks in 250+ MHz, you should run some memory write/read verifies, and some Mul/Div verifies, & cordic result verifies.
  • thej wrote: »
    Is it possible to change the MHz dynamically during execution?
    I'm just wondering it it could run at 280MHz (with an appropriate crystal) then be dynamically switched to RCFAST as an "alive but in Sleep mode" ?

    Yes, the PLL has reasonable M/N control, so you can scale SysCLK to whatever the PLL manages.
    It might be there are some opcodes that are 'Canary opcodes', that can be used to check as 'too fast' indicators, in which case you could scale SysCLK and check you were ok.
    Such a design would probably also need an external watchdog part, that stored the last MHz and reset the P2, should it ever freeze stone dead.
    Interesting bench testing stuff, but I'm not sure you would deploy in the field running machines !!


  • thej wrote: »
    WOW !!!
    ...and what kind of USB performance could be achieved at 280MHz :-)

    Still 12MHz ? - as that's the USB clock speed.
    The next step of 480MHz is well outside P2 ability, but maybe someone will connect a HS-USB PHY to a P2 one day ?

    It would be quite interesting to load the USB code, and ramp the SysCLK, and see how clk tolerant the USB code is. (FPGA tested at 80MHz)
  • Jmg, I know that the MUL instruction is a critical path. I will use it to check speed.
  • cgracey wrote: »
    Results:

    VDD current = 920mA
    VIO current = 38mA

    VDD power = 1.80V * 920mA = 1.66W
    VIO power = 3.30V * 38mA = 0.125W

    Total power = 1.79W

    We will design a really simple 4-layer board with internal ground/heat plane and high-current power regulators. This will be needed to discover the chip's real speed limits. My current setup uses a dual-output bench supply and some wires with grabbers. At this VDD current level, I'm dropping ~200mV just over the wires going to the board, so I need to output 2.00V to get 1.80V onto the board. Peter's board has regulators, potentially, but I'm just using it as a carrier board for the chip. This way, I can measure currents easily.

    There is a XCL220 switching regulator footprint on the P2D2, but that is a 1A part, with a >1.3A current limit, so is looking marginal for push-testing.
    You might want to select a high efficiency, 3~6A region switcher (with PowerGOOD) + conservative inductor, so the power loss it adds is minimal.

  • With the FPGA compiles, different cogs have different Fmax values.

    Now that synthesized layout is locked in, there's likely to be a fastest and slowest cog

    So perhaps you could have a 'canary cog' as well as a canary instruction
  • Maybe you can run 1 cog at 480 MHz?
    Prop Info and Apps: http://www.rayslogic.com/
  • Tubular wrote: »
    Now that synthesized layout is locked in, there's likely to be a fastest and slowest cog

    So perhaps you could have a 'canary cog' as well as a canary instruction
    Good point, yes, there could be variations across cogs, hopefully not too opcode variable, so that a slow opcode is slow on all COGS/CORES and slowest on the slowest core.

    Rayman wrote: »
    Maybe you can run 1 cog at 480 MHz?
    Maybe not 480MHz :) - but the thermal loading will be less with 1 COG, so you could expect to gain a few extra %, up the temperature curve.

    I don't think there is any means to check die temperature in P2 ?
    Inference from RCSLOW/RCFAST tempcos was talked about, but IIRC there is no capture path on those ?
  • P1 can run 1 cog at 160 MHz, as I recall... Maybe P2 can do that trick too?
    Prop Info and Apps: http://www.rayslogic.com/
  • cgracey wrote: »
    Jmg, I know that the MUL instruction is a critical path. I will use it to check speed.

    MUL and DIV could also be some of the 'warmest' opcodes, as they likely have the most active silicon ?
  • Tubular wrote: »
    With the FPGA compiles, different cogs have different Fmax values.

    Now that synthesized layout is locked in, there's likely to be a fastest and slowest cog

    So perhaps you could have a 'canary cog' as well as a canary instruction

    All paths are optimized only to the goal point. This means a huge chunk of paths are splat against the timing wall, and those paths are certainly spread among all cogs. So, there won't be a cog that's faster than another. Prop1 was different, since it was laid out by hand and there were effective timing groupings.
Sign In or Register to comment.