Due to the presence of the thermal pad, the board and enclosure heatsinking will matter a lot. The better that is the less relative rise will occur. The design of the heatsinking will be a factor in the Prop2 usable clockrate spec. Where it wasn't in the Prop1.
Okay. Wendy sent me the current-silicon power test simulation that shows 1.2W:
Internal Switching Leakage Total
Power Group Power Power Power Power ( %) Attrs
--------------------------------------------------------------------------------
clock_network 0.8340 0.2442 1.136e-06 1.0783 (87.15%) i
register 0.0147 1.255e-03 2.577e-05 0.0160 ( 1.29%)
combinational 0.0162 0.0430 9.582e-05 0.0594 ( 4.80%)
sequential 4.971e-04 0.0000 1.453e-06 4.985e-04 ( 0.04%)
memory 0.0152 3.047e-03 1.950e-04 0.0185 ( 1.49%)
io_pad 0.0000 0.0646 8.822e-05 0.0647 ( 5.23%)
black_box 0.0000 0.0000 3.884e-14 3.884e-14 ( 0.00%)
Net Switching Power = 0.3561 (28.78%)
Cell Internal Power = 0.8807 (71.18%)
Cell Leakage Power = 4.074e-04 ( 0.03%)
---------
Total Power = 1.2372 (100.00%)
And for comparison, here is the future-silicon test that shows 790mW:
Internal Switching Leakage Total
Power Group Power Power Power Power ( %) Attrs
--------------------------------------------------------------------------------
clock_network 0.3084 0.1488 3.312e-06 0.4572 (57.93%) i
register 8.300e-03 3.107e-03 2.573e-05 0.0114 ( 1.45%)
combinational 0.0361 0.0769 7.545e-05 0.1131 (14.33%)
sequential 4.566e-04 2.212e-05 1.270e-06 4.800e-04 ( 0.06%)
memory 0.1056 1.972e-03 1.950e-04 0.1078 (13.65%)
io_pad 0.0000 0.0992 8.938e-05 0.0993 (12.58%)
black_box 0.0000 0.0000 9.151e-14 9.151e-14 ( 0.00%)
Net Switching Power = 0.3301 (41.82%)
Cell Internal Power = 0.4589 (58.13%)
Cell Leakage Power = 3.902e-04 ( 0.05%)
---------
Total Power = 0.7894 (100.00%)
I just remembered that the current-silicon test executes from cog, while the future-silicon test executes from hub, activating the hub RAMs. So, the total power difference is ~90mW greater than shown. That means the current-silicon power should be about 1.33W, which gets reduced by 40% in the future silicon.
Okay. Wendy sent me the current-silicon power test simulation that shows 1.2W:
Good, that's close to the reported typical of P2es
The reports still shows some changes ?
Total Total
Power Group Power ( %) Attrs Power ( %) Attrs
-------------------------------------------------- --------------------------
clock_network 1.0783 (87.15%) i 0.4572 (57.93%) i
register 0.0160 ( 1.29%) 0.0114 ( 1.45%)
combinational 0.0594 ( 4.80%) 0.1131 (14.33%)
sequential 4.985e-04 ( 0.04%) 4.800e-04 ( 0.06%)
memory 0.0185 ( 1.49%) 0.1078 (13.65%)
io_pad 0.0647 ( 5.23%) 0.0993 (12.58%)
black_box 3.884e-14 ( 0.00%) 9.151e-14 ( 0.00%)
Net Switching Power = 0.3561 (28.78%) 0.3301 (41.82%)
Cell Internal Power = 0.8807 (71.18%) 0.4589 (58.13%)
Cell Leakage Power = 4.074e-04 ( 0.03%) 3.902e-04 ( 0.05%)
--------- ---------
Total Power = 1.2372 (100.00%) 0.7894 (100.00%)
Memory power has increased quite a bit, why would that be ?
Clock_network has fallen (-0.6211W ! ), but partly offset by an increase (+0.0537) in combinational as the clock gating moves reporting columns.
Overall, that is now 64% of previous Power value, which is quite a gain. (1.57 x MHz for the same power)
Seems almost too good to be true ?
I modified my post above. It's actually even better, because the future-silicon test executes from hub, while the current-silicon test executed from cog RAM. We need to add 89mW to the current-silicon power.
Future power is 59.5% of current power, yielding 1.68 x MHz for the same power.
I modified my post above. It's actually even better, because the future-silicon test executes from hub, while the current-silicon test executed from cog RAM. We need to add 89mW to the current-silicon power.
Future power is 59.5% of current power, yielding 1.68 x MHz for the same power.
Here is the current-silicon power test that run from cog RAM:
dat
orgh 0
'
' Launch all cogs with test program.
'
org
.loop coginit cognum,#@pgm 'last iteration relaunches cog 0
djnf cognum,#.loop
cognum long 7
'
' Toggle 8 pins, get long from hub, start cordic command
'
org
pgm cogid x 'which cog am I, 0..7?
shl x,#3
rep @.p,#8 'start pwm pins
wrpin #%01_01001_0,x
wxpin pat,x
wypin #1,x
dirh x
add x,#1
.p
.loop rflong y
qrotate y,y
jmp #.loop
pat long $0010_0001
x res 1
y res 1
Here is the future-silicon power test that runs from hub RAM (more power):
dat
orgh 0
'
' Launch all cogs with test program.
'
org
.loop coginit cognum,#@pgm 'last iteration relaunches cog 0
djnf cognum,#.loop
cognum long 7
'
' Toggle 8 pins, get long from hub, start cordic command
'
org
pgm cogid x 'which cog am I, 0..7?
shl x,#3
rep @.p,#8 'start pwm pins
wrpin #%01_01001_0,x
wxpin pat,x
wypin #1,x
dirh x
add x,#1
.p
wrlong .loop+0,##$400
wrlong .loop+1,##$404
jmp #$400
.loop qrotate y,y
jmp #.loop
pat long $0010_0001
x res 1
y res 1
Actually, the current-silicon test was doing an RFLONG, which was exercising the FIFO, but not nearly as much as hub-exec in the future-silicon version. So, I imagine that the future-silicon power is maybe only 58% of the current-silicon power.
And remember that doesn't include the 3.3V I/O power.
Using a quick Cpd compare, I get that a 8-COG, all Pins P2+ needs to run at ~ 61MHz SysCLK to match the power drain of a 80MHz P1.
So if your numbers are right jmg that would basically mean that we can run ~50% more PASM instructions on a P2 than a P1 for the same core input power. And many of these P2 instructions are more flexible/powerful than the existing P1 instructions. Still seems like a net gain.
Be interesting to see what current draw we could end up with on the P2 at low frequency RC clock rates for sleep modes etc when hopefully many clocks can get gated off. Can that be estimated yet from these numbers?
Be interesting to see what current draw we could end up with on the P2 at low frequency RC clock rates for sleep modes etc when hopefully many clocks can get gated off. Can that be estimated yet from these numbers?
You can use Cpd to calculate the 1-COG, 20kHz number, of around 36uA + Static Icc (should be similar to P2es static/leakage?), or measure a P2es with stopped Clock, then 20kHz, and scale the dynamic Icc by ~ 0.58
Using a quick Cpd compare, I get that a 8-COG, all Pins P2+ needs to run at ~ 61MHz SysCLK to match the power drain of a 80MHz P1.
So, only extending those basic numbers:
A P2, running at ~61 MHz, is able to execute >30 M basic instructions per second, thus equivalent to >1.5 x P1 capabilities, using the same set of basic instructions.
And a P2 has twice the pin count, taken as a means of interacting with surrounding logic.
Calculating... 1.5 x 2 = 3; then P2 can outrun three P1s, without having to resort to a single Smart Cell, nor Lut sharing, or Streamer-related features.
Even a lot more, because all P2 features can be internally coordinated, an strategy that is time and many resource consuming, when attempted at the 3 x P1 setup.
I just went over the final approval doc's and timing constraints with Wendy and I signed off on everything. So, the reticles are going to be made now and it will go into fab.
Fifteen weeks from now will be July 20, when the prototypes should be arriving. A month after that we should have 1,000 chips and new P2 Eval boards.
I've been working on the new Spin compiler, backing into the PUB/PRI section after updating all the ancillary functions. Hopefully, it will show signs of life soon. I'm happy to be working on tools, finally, like ersmith and RossH have been.
when you are back working on spin2, please look at the enhancements @ersmith already did to his spin2 version.
Some things he did are very useful, the biggest one are optional parameter when calling a function, multiple return values, @ @ @, path for objects to include and #define/ #ifdef.
I have not used the last ones but they will make a lot of sense for writing code able to run on the p1 and the p2.
.
But fastspin does compile to native code and even with 512K a bytecode interpreter will be for sure needed. HDMI will use a lot of ram.
Anyways, welcome back from the world of HDL's to the world of us normal(?) programmers, you maybe have not even realized how much fun it is to program the P2, being stuck in development of it for so long.
This brainchild of yours is just incredible, I am currently (re)writing a fullduplex driver, and every time I start new the thing gets smaller and faster, even if I still fight to understand/use most of the changed/new instructions.
Comments
And for comparison, here is the future-silicon test that shows 790mW:
I just remembered that the current-silicon test executes from cog, while the future-silicon test executes from hub, activating the hub RAMs. So, the total power difference is ~90mW greater than shown. That means the current-silicon power should be about 1.33W, which gets reduced by 40% in the future silicon.
The reports still shows some changes ?
Memory power has increased quite a bit, why would that be ?
Clock_network has fallen (-0.6211W ! ), but partly offset by an increase (+0.0537) in combinational as the clock gating moves reporting columns.
Overall, that is now 64% of previous Power value, which is quite a gain. (1.57 x MHz for the same power)
Seems almost too good to be true ?
I modified my post above. It's actually even better, because the future-silicon test executes from hub, while the current-silicon test executed from cog RAM. We need to add 89mW to the current-silicon power.
Future power is 59.5% of current power, yielding 1.68 x MHz for the same power.
Here is the future-silicon power test that runs from hub RAM (more power):
Actually, the current-silicon test was doing an RFLONG, which was exercising the FIFO, but not nearly as much as hub-exec in the future-silicon version. So, I imagine that the future-silicon power is maybe only 58% of the current-silicon power.
Nice power reduction! 0.8W is quite manageable, and that is sort of max too. Most real-life will be significantly less
And remember that doesn't include the 3.3V I/O power.
Be interesting to see what current draw we could end up with on the P2 at low frequency RC clock rates for sleep modes etc when hopefully many clocks can get gated off. Can that be estimated yet from these numbers?
So, only extending those basic numbers:
A P2, running at ~61 MHz, is able to execute >30 M basic instructions per second, thus equivalent to >1.5 x P1 capabilities, using the same set of basic instructions.
And a P2 has twice the pin count, taken as a means of interacting with surrounding logic.
Calculating... 1.5 x 2 = 3; then P2 can outrun three P1s, without having to resort to a single Smart Cell, nor Lut sharing, or Streamer-related features.
Even a lot more, because all P2 features can be internally coordinated, an strategy that is time and many resource consuming, when attempted at the 3 x P1 setup.
Hi rogloh
You've did beat me. I'm such a slow-writer.
Fifteen weeks from now will be July 20, when the prototypes should be arriving. A month after that we should have 1,000 chips and new P2 Eval boards.
I've been working on the new Spin compiler, backing into the PUB/PRI section after updating all the ancillary functions. Hopefully, it will show signs of life soon. I'm happy to be working on tools, finally, like ersmith and RossH have been.
Fingers crossed this revision will be solid. Thank you Chip.
And 1k eval boards? And or chips? PARTY IN ROCKLIN
when you are back working on spin2, please look at the enhancements @ersmith already did to his spin2 version.
Some things he did are very useful, the biggest one are optional parameter when calling a function, multiple return values, @ @ @, path for objects to include and #define/ #ifdef.
I have not used the last ones but they will make a lot of sense for writing code able to run on the p1 and the p2.
.
But fastspin does compile to native code and even with 512K a bytecode interpreter will be for sure needed. HDMI will use a lot of ram.
Anyways, welcome back from the world of HDL's to the world of us normal(?) programmers, you maybe have not even realized how much fun it is to program the P2, being stuck in development of it for so long.
This brainchild of yours is just incredible, I am currently (re)writing a fullduplex driver, and every time I start new the thing gets smaller and faster, even if I still fight to understand/use most of the changed/new instructions.
Thanks for bringing fun back into programming,
Mike