I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
Here is the MUL test code:
dat
orgh 0
'
' Launch all cogs with test program.
'
org
hubset ##%1_000000_0000001101_1111_10_00 'enable crystal+PLL, stay in 20MHz+ mode
waitx ##20_000_000/100 'wait ~10ms for crystal+PLL to stabilize
hubset ##%1_000000_0000001101_1111_10_11 'now switch to PLL running at 180MHz
.loop coginit cognum,#@pgm 'last iteration relaunches cog 0
djnf cognum,#.loop
cognum long 7
'
' Do multiply test and indicate success on pin(cogid)
'
org
pgm cogid x 'which cog am I, 0..7?
.loop altr yp
mul a0,a1
cmp ap,y wz
drvz x
altr yp
mul b0,b1
cmp bp,y wz
drvz x
jmp #.loop
a0 long $B9A3
a1 long $5EBD
ap long $B9A3 * $5EBD
b0 long $CB9A
b1 long $49BF
bp long $CB9A * $49BF
yp long y
x res 1
y res 1
I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
I just did a speed test using the known critical path of the MUL instruction executing and verifying in all 8 cogs. No surprise that it fails at the same frequency the smart pins fail at - 300MHz.
As for the test setup you have in hands, were all ceramic decoupling caps installed, especially the ones (related to both Vio and Vdd) that feed the VCO/PLL, heavily filtered, internal voltage regulator?
Have you devised any means of verifying VCO actual jitter figures?
Sure, but 280MHz will be very iffy. Just getting a part with slow process corner characteristics would likely dash that expectation. To be safe, stick with the 180MHz. Maybe 200MHz is reasonable for indoor environments.
Regardless of the MHz, could RCFAST be used as a dynamic low power mode?
j
Yes, but there's a 20KHz slow RC mode, too, that really drops the power.
I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.
I'd run an NCO type adder, to add (eg + 2^32/111 for a 111 test period ie scans the number range in 111 loops, or 2^16/111 for 16b mathops )
Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.
Can you write that in FreeBASIC, and we can translate it to PASM2?
Can you write that in FreeBASIC, and we can translate it to PASM2?
In CCalc these test sweeps
Av=round(2^16/111) = 590 example adder value
Mv16=0
Mv16=Mv16+Av;Mr32=Mv16*Av;Res = Mr32/Mv16
for 16*16 Mul and 32b div, this Result/Value = 590(Av) in all cases, except when Mv+Av wraps
or maybe for better coverage
Mv16=0
Mv16=(Mv16+Av)%2^16;Mr32=(Mv16*(Mv16-Av))%2^32;Res = Mr32/Mv16/Av
gives Res = 0..110, as it sweeps thru the number range. Looks to need a reset at 110, as it does not wrap nicely, at least on my test CCalc.
You can see that the hub is what's burning the power. It must be that eggbeater arrangement. The cogs only add a little to the power consumption. I can make them run more complex code, but it doesn't change much. It almost seems like the big hub RAMs are all firing on every clock, doing reads when neither reads nor writes are called for. I've asked Wendy about this.
Fantastic results Chip. 192MHz is the USB sweet spot.
BTW 200MHz was my hope and you seem to have exceeded that easily.
I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.
If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.
You can see that the hub is what's burning the power. It must be that eggbeater arrangement.
A means I could devise to impose an almost "silent mode" to the Hub circuits, is ensureing that EACH and EVERY of the eight Cogs does a single group of Hub-related operations (e.g., RDLONG), targeting the same address (0000, perhaps?), or even a WRLONG followed by a RDLONG, using two options for the value of the writen data (0000, then FFFF), and also two options for the target address (0000, then the last accessible long, at memory map) , before concentrating solely into a one-instruction, infinite loop of Cog code, without doing any further Hub-related access or referencing.
That way, all the Egg-beater address/data input latches contents will reflect the last address/data coming from each Cog-Hub transaction (that will remain steady from that point ahead), so does their address/data muxes, at Hub ram, and they would ever see the same values, while rotating, giving each Cog the opportunity for a new access, wich will never come, until Reset is applied.
You could also destinate eight pins, simultaneously cleared by the Cog that instantiates the other ones, then set to 1, individualy, at the end of each Cog's routine, before they enter their individual infine loops, shuting their mouths.
That way you could verify that each and everyone has accomplished its task, and are standing still, from Hub accesses standpoint, until Reset is applied.
Fantastic results Chip. 192MHz is the USB sweet spot.
BTW 200MHz was my hope and you seem to have exceeded that easily.
I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.
If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.
CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.
Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.
CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.
Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.
Interesting data very MHz dependent, and rather more COG independent.
With tight timing, and this much logic, I would expect Clock Trees to be hogs.
A useful test would be to idle 8 cogs in a mostly WAIT loop (eg toggle pins at 100,110,120..180ms) - that leaves just clock trees.
It looks like COGINIT or STOP does not enable/disable the clock tree for that COG ?
I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.
I am presuming the hub access is every 2 clocks, so the ENA should toggle every 2nd clock if it's NOT being deselected when no cog/core access is required. Wonder if the ENA is not being deselected when no access is required. That could account for little current variation whether hub access or not.
Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock.
and here are the Cpd equivalent values, from CMOS logic formulae
power dissipation capacitance
Pd = Cpd x Vcc^2 x Fi
Cpd=(Icc*Vcc)/(Vcc*Vcc*Fi)
Fi=20M;Icc=64m; Cpd = 1.777nF
Fi=60M;Icc=182m; Cpd = 1.685nF
Fi=100M;Icc=301m; Cpd = 1.672nF
Fi=140M;Icc=426m; Cpd = 1.690nF
Fi=180M;Icc=534m; Cpd = 1.648nF
Fi=220M;Icc=638m; Cpd = 1.611nF
then Enable 7 more COGs
Fi=220M;Icc=869m; Cpd = 2.194nF
(Cpd-1.611e-9)/7 = adds 83.333pF/COG
Shows the Clock tree is quite significant, (that's the Cg of a moderate power MOSFET) and the incremental addition per extra started COG, is more modest at 83.33pF/COG
Above has a down slope on Cpd, because some static Icc was not allowed for, so we can iterate to find an equivalent offset value of Iq that gives a nominally flat Cpd
"Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.
Jmg, there are 88k flops on the die. No clock-gating was done at the Verilog level. The synthesis tools, though, might have implemented something. Also, I think I remember seeing something about a flop being clocked, but without its data input changing, resulting in lowest power, which makes sense, but implies no clock gating.
"Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock." - So what is the clock latency from one side of the die to the other? Depending on how the memories and everything else is spread out over the entire die, the clock latency could be an issue and it might make sense with what you are observing so far. I ran the Avanti Place and Route tool for the DP83865 and to get timing just right the tool literally ran for days through several iterations. If the timing was off by as much as a couple of ns clock sensitive areas, specifically memories became very unstable.
Congrats getting this far on P2 BTW
The clock tree was synthesized with a jitter budget that ultimately afforded a clock uncertainty of 300ps.
The DACs seem to be working well. Here is a 438KB 5:6:5 RGB image in RAM being displayed via VGA.
P0 = HSYNC, 123-ohm 3.3V DAC, pixel-locked to color signals on P1/P2/P3
P1 = Blue, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P2 = Green, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P3 = Red, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P4 = VSYNC, digital output
Here is the code. Note that the PLL is using the biggest XI divide possible (64) and a big VCO divide of 512, with a post-VCO divide of 2. This results in 20MHz / 64 * 512 / 2 = 80MHz, but with the slowest/weakest PLL feedback possible, and the image shows no jitter, at all:
'*************************************
'* VGA 640 x 480 x 16bpp 5:6:5 RGB *
'*************************************
CON
intensity = 80 '0..128
fclk = 80_000_000.0
fpix = 25_000_000.0
fset = (fpix / fclk * 2.0) * float($4000_0000)
vsync = 4 'vsync pin
DAT org
'
'
' Setup
'
hubset ##%1_111111_0111111111_0000_10_00 'enable crystal+PLL, stay in 20MHz+ mode
waitx ##20_000_000/100 'wait ~10ms for crystal+PLL to stabilize
hubset ##%1_111111_0111111111_0000_10_11 'now switch to PLL running at 80MHz
rdfast ##640*350*2/64,##$1000 'set rdfast to wrap on bitmap
setxfrq ##round(fset) 'set transfer frequency to 25MHz
'the next 4 lines may be commented out to bypass signal level scaling
setcy ##intensity << 24 'r set colorspace for rgb
setci ##intensity << 16 'g
setcq ##intensity << 08 'b
setcmod #%01_0_000_0 'enable colorspace conversion (may be commented out)
wrpin dacmode_s,#0 'enable dac modes in pins 0..3
wrpin dacmode_c,#1
wrpin dacmode_c,#2
wrpin dacmode_c,#3
setnib dira,#%1111,#0
'
'
' Field loop
'
field mov x,#90 'top blanks
call #blank
mov x,#350 'set visible lines
line call #hsync 'do horizontal sync
xcont m_rf,#1 'visible line
djnz x,#line 'another line?
mov x,#83 'bottom blanks
call #blank
drvnot #vsync 'sync on
mov x,#2 'sync blanks
call #blank
drvnot #vsync 'sync off
jmp #field 'loop
'
'
' Subroutines
'
blank call #hsync 'blank lines
xcont m_vi,#0
_ret_ djnz x,#blank
hsync xcont m_bs,#0 'horizontal sync
xzero m_sn,#1
_ret_ xcont m_bv,#0
'
'
' Initialized data
'
dacmode_s long %0000_0000_000_1011000000000_01_00000_0 'hsync is 123-ohm, 3.3V
dacmode_c long %0000_0000_000_1011100000000_01_00000_0 'R/G/B are 75-ohm, 2.0V
m_bs long $CF000000+16 'before sync
m_sn long $CF000000+96 'sync
m_bv long $CF000000+48 'before visible
m_vi long $CF000000+640 'visible
m_rf long $2F000000+640 'visible rfword rgb16 (5:6:5)
x res 1
y res 1
'
'
' Bitmap
'
orgh $1000 - 70 'justify pixels at $1000
file "birds_16bpp.bmp" 'rayman's picture (640 x 350)
Comments
Here is the MUL test code:
Can you do a MUL/DIV suite, with rolling values, and toggle one pin on PASS and another pin on FAIL ?
What does that look like?
As for the test setup you have in hands, were all ceramic decoupling caps installed, especially the ones (related to both Vio and Vdd) that feed the VCO/PLL, heavily filtered, internal voltage regulator?
Have you devised any means of verifying VCO actual jitter figures?
Yes, but there's a 20KHz slow RC mode, too, that really drops the power.
What is Icc at 20kHz , and how does the kHz change as the part cools down ?
Then use Cordic Mul/Div & remainder to check core mul, & you hope they fail differently, but give the same answer when they work.
Av=round(2^16/111) = 590 example adder value
Mv16=0
Mv16=Mv16+Av;Mr32=Mv16*Av;Res = Mr32/Mv16
for 16*16 Mul and 32b div, this Result/Value = 590(Av) in all cases, except when Mv+Av wraps
or maybe for better coverage
Mv16=0
Mv16=(Mv16+Av)%2^16;Mr32=(Mv16*(Mv16-Av))%2^32;Res = Mr32/Mv16/Av
gives Res = 0..110, as it sweeps thru the number range. Looks to need a reset at 110, as it does not wrap nicely, at least on my test CCalc.
https://docs.google.com/spreadsheets/d/1maeX7_tFKshFVPXdt9scWupgoN61MaEniNhF2_901l4/edit?usp=sharing
Maybe one of you superdogs who knows how to work the Google can turn this into a graph?
But, here's a quick Excel plot.
This can obviously be done better...
You can see that the hub is what's burning the power. It must be that eggbeater arrangement. The cogs only add a little to the power consumption. I can make them run more complex code, but it doesn't change much. It almost seems like the big hub RAMs are all firing on every clock, doing reads when neither reads nor writes are called for. I've asked Wendy about this.
BTW 200MHz was my hope and you seem to have exceeded that easily.
I always expected the egg-beater to be a power hog. With all 8 cores running hubexec and in a loop causing continuous Hub reads, you are reading the whole 8 * 64KB blocks on every clock. I'll bet power will be significantly reduced if the cores run a core loop. As soon as I get time I'll write a test for this.
If you get a chance, I'd like to see the results without smart pins running. Will let us see the cores power usage. Similarly without Cordic/multiply.
A means I could devise to impose an almost "silent mode" to the Hub circuits, is ensureing that EACH and EVERY of the eight Cogs does a single group of Hub-related operations (e.g., RDLONG), targeting the same address (0000, perhaps?), or even a WRLONG followed by a RDLONG, using two options for the value of the writen data (0000, then FFFF), and also two options for the target address (0000, then the last accessible long, at memory map) , before concentrating solely into a one-instruction, infinite loop of Cog code, without doing any further Hub-related access or referencing.
That way, all the Egg-beater address/data input latches contents will reflect the last address/data coming from each Cog-Hub transaction (that will remain steady from that point ahead), so does their address/data muxes, at Hub ram, and they would ever see the same values, while rotating, giving each Cog the opportunity for a new access, wich will never come, until Reset is applied.
You could also destinate eight pins, simultaneously cleared by the Cog that instantiates the other ones, then set to 1, individualy, at the end of each Cog's routine, before they enter their individual infine loops, shuting their mouths.
That way you could verify that each and everyone has accomplished its task, and are standing still, from Hub accesses standpoint, until Reset is applied.
Hope it helps
Henrique
CORDIC hardly adds anything. Same with hubexec. It seems to me that those hub RAMs are firing on every clock for some reason, and not just when we need them to. You can see that at boot up, running 25MHz, the chip takes 77mA. Only 4mA of that is cog 0. The rest is just the clock tree and maybe those big RAMs, though I think Wendy said she saw their ENA inputs toggling, as expected. It's just weird that no matter how strenuously I make the logic perform, it only adds a little current to the base draw at that frequency. Maybe tons of power is just being spent on the clock tree and idle flops. That would make sense. Nothing I do, though, affects power much. It's 75% a function of frequency.
Wendy is going to get us a hierarchical breakdown of power consumption on Monday, I assume. That might indicate something.
I'd like to know what the RCSLOW current is (if RCSLOW is even an option).
Bean
Interesting data very MHz dependent, and rather more COG independent.
With tight timing, and this much logic, I would expect Clock Trees to be hogs.
A useful test would be to idle 8 cogs in a mostly WAIT loop (eg toggle pins at 100,110,120..180ms) - that leaves just clock trees.
It looks like COGINIT or STOP does not enable/disable the clock tree for that COG ?
Hub access is on every clock. There are no "multi-cycle" paths in this design. Everything resolves in one clock.
Shows the Clock tree is quite significant, (that's the Cg of a moderate power MOSFET) and the incremental addition per extra started COG, is more modest at 83.33pF/COG
Above has a down slope on Cpd, because some static Icc was not allowed for, so we can iterate to find an equivalent offset value of Iq that gives a nominally flat Cpd ie Iq ~ 5mA and Cpd ~ 1.6nF, for P2
Congrats getting this far on P2 BTW
The clock tree was synthesized with a jitter budget that ultimately afforded a clock uncertainty of 300ps.
Thanks, Beau.
P0 = HSYNC, 123-ohm 3.3V DAC, pixel-locked to color signals on P1/P2/P3
P1 = Blue, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P2 = Green, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P3 = Red, 75-ohm 2.0V DAC, VGA monitor adds 75 ohms to GND to make 1.0V DAC
P4 = VSYNC, digital output
Here is the code. Note that the PLL is using the biggest XI divide possible (64) and a big VCO divide of 512, with a post-VCO divide of 2. This results in 20MHz / 64 * 512 / 2 = 80MHz, but with the slowest/weakest PLL feedback possible, and the image shows no jitter, at all:
Can't wait to see all the graphics modes & features everyone comes up with on the P2.