Here's a graph of Prop2 done to similar scale as in the Prop1 datasheet, and I've included that for comparison.
At 20MHz an idle P2 is consuming 64mA which is ~90x the current of an idle P1 (~0.7mA hub only trace), though the P1 is running at 3.3V and the P2 is 1.8V so the actual power consumed is only 49x as much (ignoring the IO voltage current).
Been thinking some more about this. Perhaps I am missing something, but the explanation of DIR and OUT lines coupling over a long distance to the pin doesn't make sense to me.
The pin final cmos drivers should be located extreme.ly close to the pin. That was what the ring frame with custom pin drivers and analog was all about. These circuits are right at the pin aren't they? The DIR and OUT lines driving these cmos drivers will never float as they are always driven to connect to the gates (ie think base of transistor) of final mos drivers, so why are they picking up noise???
Surely this is going to affect the correct functioning of the pins in analog mode too
Just my midnight thoughts while trying to sleep.
Postedit: Could it be skew between the DIR and OUT ???
If the OUT and DIR changed at the same time and the lines were skewed the OUT could take effect slightly before or after the DIR tristates the output.
Cluso,
The noise is all internal to the synthesised verilog - it's in the long OUT signal routes going from the hub staging to the pin-ring. The same long routes that we had a timing problem with on the FPGAs. BTW: That's something to test now too.
Chip will be pursuing making those internal signals cleaner, as the first option.
Another solution would be to change the outputs to always use the, currently selectable, final registered stage that's in the custom pin-ring. It would add one more clock cycle to the propagation of output actions. To be honest, such a change probably wouldn't be a huge issue since there is already a lot of registered stages now anyway.
Evan,
I need to draw up what I believe the circuit looks like as the long lines are not the pins, but a stage prior.
If DIR tristates the pin without OUT changing, then there should not be the glitch we are seeing. However, if OUT changes too, then it's possible that timing skew is causing the glitch.
... If DIR tristates the pin without OUT changing, then there should not be the glitch we are seeing.
The thinking is that just the proximity of DIR to OUT, running as a pair, is, when DIR goes low, causing OUT to droop enough to glitch its pin driver before being disabled. Each pair produces different, but consistent, glitch sizes, depending on the geometries of the pairs in those long routes.
The DIR and OUT lines driving these cmos drivers will never float as they are always driven to connect to the gates (ie think base of transistor) of final mos drivers, so why are they picking up noise???
I mean... the lines have non-zero impedance. They don't need to float for one to pick up a transient from the other, and if they're running close enough over a long enough distance, and in silicon they could be quite close indeed...
I was just thinking about that analog test chip...
Couldn't one just add some capacitance between the DIR and OUT signals and see if the glitch shows up?
Evan,
I need to draw up what I believe the circuit looks like as the long lines are not the pins, but a stage prior.
If DIR tristates the pin without OUT changing, then there should not be the glitch we are seeing. However, if OUT changes too, then it's possible that timing skew is causing the glitch.
Not quite, this is actually not all digital, but analog.
When DIR is driven low, there is no logical change to OUT, but enough parasitic coupling will spike OUT.
If the spike on out is wide enough to cross the threshold of the next buffer, it will propagate to the pin.
If you look at the plots, you can see this spike+threshold effect, where there is either no disturbance, or, a significant one.
With the physical layout info, it should be possible to confirm the crosstalk, and verify the observed-worst pins, match with the simulated-worst.
That is one way to confirm this model of the failure.
I was just thinking about that analog test chip...
Couldn't one just add some capacitance between the DIR and OUT signals and see if the glitch shows up?
Yes, if those DIR and OUT lines connect directly, you could arrange deliberate cross-talk drive signals that glitched OUT on DIR change.
Those lines might not include the 1v8 <-> 3v3 level translators.
On clock gating - in a perfect world we'd have 2 dies available, one optimized for f_max and one with all of the clock gating available for lower power. But probably one option needs to be selected now.
To start with I'd want to see whichever one will cover more of Parallax's expected customer base, while also balancing the risk of potential problems being introduced by the clock gating against that. That way we can have the best opportunity to get the 2nd 'version', or a smaller process, or to support having better development tools, other products from Parallax, etc in the future.
I'm sure I'll use P2 in some hobby projects of regardless of which direction is selected - but I'll never buy enough chips to make a difference in the volume numbers to say which way to go, fast vs lower power.
If I need low power, I need usually ultra low power. For this I use MCUs designed especially for ultra low power applications. In P2 I prefer more MIPS.
By lowering the Fmax, the tool would be able to insert real clock gates on most flops.
IIRC you mentioned 88000 FF's clocked - that maps to roughly 0.5pF Cpd per FF
Can you tabulate FFs used by each of COG/SmartPin Cell/Cordic
Does it do that gate insert, right at the FF, for every FF ?
That's easiest for automated tools, but it means you still have a very large clock tree, all that changes is a single-gate load appears, where a FF may have more load.
If the tool instead is told to break out 8 clocks trees for the COGS, those whole clock trees are silent, if unused.
Getting a lowest power wait, means the wait opcode needs a local clock stub, that holds off the COG tree, until the WAIT expires.
I'm not clear if their synthesis tools can (easily?) do that, but I guess it must be possible, as that's how power is driven lowest these days.
What is the shortest code that will detect the current chip compared to the fully-working version?
The shortest code would test for the ALTx sign-extension of S[17:09] missing as it's added into D. How about this:
' Test for current or new chip
altr dreg,sreg 'ALTR shouldn't affect TJZ
tjz dreg,#new_chip
current_chip
dreg long 1 'becomes 0 via ALTR on new chip
sreg long $1FF << 9 '-1 on new chip, +$3FE00 on current chip
What is the shortest code that will detect the current chip compared to the fully-working version?
The shortest code would test for the ALTx sign-extension of S[17:09] missing as it's added into D. How about this:
' Test for current or new chip
altr dreg,sreg 'ALTR shouldn't affect TJZ
tjz dreg,#new_chip
current_chip
dreg long 1 'becomes 0 via ALTR on new chip
sreg long $1FF << 9 '-1 on new chip, +$3FE00 on current chip
Thanks, Chip.
If we change XORO32 (and I think we should) that would also tell us. We'd need only one reg for this test, to hold the state, but I'm not sure how the two methods compare overall.
I think changes should be avoided to not add additional mishaps. Else we end up again in the Columbo-Situation, just one more change and we are finally done...
I know @"Peter Jakacki" wants to change the TAQOZ rom, already, @Chip thinks about lowering Fmax , @Jmg talks about redoing the clock-tree to switch off cores, cordic or the streamer(?), what else we got, new instructions, anybody?
Yes, since a respin needs to be done and paid for, lets change as much of the veriog as possible, so we can do another test run on FPGAs for a couple of month, @Ken might need the time to sell more WX-Badges to earn the needed money.
Seeing some comments above in favour of the speed over power consumption, it would be nice to see more opinions in this regard. I also favor speed of the P2 over its current.
Seeing some comments above in favour of the speed over power consumption, it would be nice to see more opinions in this regard. I also favor speed of the P2 over its current.
If I think about the P2 as something akin to an FPGA, where performance is everything, then power is less of a concern.
Maybe we could even push fohr a 200MHz rating on the next silicon.
Chip, I agree. The P2 is never going to be an ultra-low power device. I say go for the fastest speed which is where the P2 shines. You guys are so close, I can taste it. Don't screw it up now. Just saying now is not the time to be adding features. You need to leave something for the P3....
Bean
Maybe we could even push fohr a 200MHz rating on the next silicon..
I think the best way to target 200MHz spec is to adjust VCore min & Tmax, to give a fmax envelope, just like other vendors
For clock gating with minimal speed impact, I was thinking about appx 12 very early AND gates.
These go after VCO divider, and 8 drive from cogstart, 1 enables CORDIC and 1 maybe 2 maybe 3 enable smart pin cells (but not pin IO sync FF), any non gated clks feed via a same delay path, so only the pll phase moves.
2 more config enable bits could feed RCSLOW RCFAST into a chosen smart pin cell (tx pin ?) this last bit permits P2 to read & calibrate RC osc, & also read die temp.
Most hot parts these days can read their own temperature, and even with no clock gating, P2 needs to be able to check die temp.
Jmg, the current oscillators are designed to be temperature-invariant. Plus, they are only enabled when in use. It might be a lot more practical to use a smart pin in Schmidt feedback mode to form an oscillator whose frequency can be measured, which frequency is temperature dependent. Or, it might be good to take a calibration reading on an ADC.
True, I'm confident in there being additional future models of the Prop2.
That will be the case for sure. The propeller gained momentum and if there is a project that booms, all the changes will be made to bring this project through the ceiling. Nothing is more successful than success.
Comments
At 20MHz an idle P2 is consuming 64mA which is ~90x the current of an idle P1 (~0.7mA hub only trace), though the P1 is running at 3.3V and the P2 is 1.8V so the actual power consumed is only 49x as much (ignoring the IO voltage current).
Been thinking some more about this. Perhaps I am missing something, but the explanation of DIR and OUT lines coupling over a long distance to the pin doesn't make sense to me.
The pin final cmos drivers should be located extreme.ly close to the pin. That was what the ring frame with custom pin drivers and analog was all about. These circuits are right at the pin aren't they? The DIR and OUT lines driving these cmos drivers will never float as they are always driven to connect to the gates (ie think base of transistor) of final mos drivers, so why are they picking up noise???
Surely this is going to affect the correct functioning of the pins in analog mode too
Just my midnight thoughts while trying to sleep.
Postedit: Could it be skew between the DIR and OUT ???
If the OUT and DIR changed at the same time and the lines were skewed the OUT could take effect slightly before or after the DIR tristates the output.
The noise is all internal to the synthesised verilog - it's in the long OUT signal routes going from the hub staging to the pin-ring. The same long routes that we had a timing problem with on the FPGAs. BTW: That's something to test now too.
Chip will be pursuing making those internal signals cleaner, as the first option.
Another solution would be to change the outputs to always use the, currently selectable, final registered stage that's in the custom pin-ring. It would add one more clock cycle to the propagation of output actions. To be honest, such a change probably wouldn't be a huge issue since there is already a lot of registered stages now anyway.
I need to draw up what I believe the circuit looks like as the long lines are not the pins, but a stage prior.
If DIR tristates the pin without OUT changing, then there should not be the glitch we are seeing. However, if OUT changes too, then it's possible that timing skew is causing the glitch.
The thinking is that just the proximity of DIR to OUT, running as a pair, is, when DIR goes low, causing OUT to droop enough to glitch its pin driver before being disabled. Each pair produces different, but consistent, glitch sizes, depending on the geometries of the pairs in those long routes.
Couldn't one just add some capacitance between the DIR and OUT signals and see if the glitch shows up?
Here's that old thread with pinout of test chip: https://forums.parallax.com/discussion/165303/prop2-analog-test-chip-arrived/p1
Not quite, this is actually not all digital, but analog.
When DIR is driven low, there is no logical change to OUT, but enough parasitic coupling will spike OUT.
If the spike on out is wide enough to cross the threshold of the next buffer, it will propagate to the pin.
If you look at the plots, you can see this spike+threshold effect, where there is either no disturbance, or, a significant one.
With the physical layout info, it should be possible to confirm the crosstalk, and verify the observed-worst pins, match with the simulated-worst.
That is one way to confirm this model of the failure.
Those lines might not include the 1v8 <-> 3v3 level translators.
To start with I'd want to see whichever one will cover more of Parallax's expected customer base, while also balancing the risk of potential problems being introduced by the clock gating against that. That way we can have the best opportunity to get the 2nd 'version', or a smaller process, or to support having better development tools, other products from Parallax, etc in the future.
I'm sure I'll use P2 in some hobby projects of regardless of which direction is selected - but I'll never buy enough chips to make a difference in the volume numbers to say which way to go, fast vs lower power.
IIRC you mentioned 88000 FF's clocked - that maps to roughly 0.5pF Cpd per FF
Can you tabulate FFs used by each of COG/SmartPin Cell/Cordic
Does it do that gate insert, right at the FF, for every FF ?
That's easiest for automated tools, but it means you still have a very large clock tree, all that changes is a single-gate load appears, where a FF may have more load.
If the tool instead is told to break out 8 clocks trees for the COGS, those whole clock trees are silent, if unused.
Getting a lowest power wait, means the wait opcode needs a local clock stub, that holds off the COG tree, until the WAIT expires.
I'm not clear if their synthesis tools can (easily?) do that, but I guess it must be possible, as that's how power is driven lowest these days.
The shortest code would test for the ALTx sign-extension of S[17:09] missing as it's added into D. How about this:
The ROM has a version number, which a boot-host can query, but I'm not sure what run-time code access to version ID info there is ?
Thanks, Chip.
If we change XORO32 (and I think we should) that would also tell us. We'd need only one reg for this test, to hold the state, but I'm not sure how the two methods compare overall.
I know @"Peter Jakacki" wants to change the TAQOZ rom, already, @Chip thinks about lowering Fmax , @Jmg talks about redoing the clock-tree to switch off cores, cordic or the streamer(?), what else we got, new instructions, anybody?
Yes, since a respin needs to be done and paid for, lets change as much of the veriog as possible, so we can do another test run on FPGAs for a couple of month, @Ken might need the time to sell more WX-Badges to earn the needed money.
Mike
Currently we are both testing P2 code.
If I think about the P2 as something akin to an FPGA, where performance is everything, then power is less of a concern.
Maybe we could even push fohr a 200MHz rating on the next silicon.
Welcome, Uwe_Filter.
Bean
I think the best way to target 200MHz spec is to adjust VCore min & Tmax, to give a fmax envelope, just like other vendors
For clock gating with minimal speed impact, I was thinking about appx 12 very early AND gates.
These go after VCO divider, and 8 drive from cogstart, 1 enables CORDIC and 1 maybe 2 maybe 3 enable smart pin cells (but not pin IO sync FF), any non gated clks feed via a same delay path, so only the pll phase moves.
2 more config enable bits could feed RCSLOW RCFAST into a chosen smart pin cell (tx pin ?) this last bit permits P2 to read & calibrate RC osc, & also read die temp.
Most hot parts these days can read their own temperature, and even with no clock gating, P2 needs to be able to check die temp.