P2 - Minimising power usage in a COG (cpu core)

in Propeller 2
Chip,
How can we minimise the power used by a cog?
Suppose we have a cog running a driver to some external interface (eg an SD Card Driver). What steps should we take to minimise power when it's idle?
In P1 we had waitcnt, waitpxx, etc.
Typically we will be waiting on a command via hub.
Some ideas...
* Use a mix of rdbyte and waitx - introduces some latency but may not be an issue
* Have the calling cog use the COGATN instruction, and wait for it in a POLLATN or WAITATN loop
* Use a LOCK and wait for the lock to become "locked"
* Anything else? Or does it not matter?
How can we minimise the power used by a cog?
Suppose we have a cog running a driver to some external interface (eg an SD Card Driver). What steps should we take to minimise power when it's idle?
In P1 we had waitcnt, waitpxx, etc.
Typically we will be waiting on a command via hub.
Some ideas...
* Use a mix of rdbyte and waitx - introduces some latency but may not be an issue
* Have the calling cog use the COGATN instruction, and wait for it in a POLLATN or WAITATN loop
* Use a LOCK and wait for the lock to become "locked"
* Anything else? Or does it not matter?
Comments
CIGINIT #%1_x_cccc,S/#
On restart(warm boot) remember to restore your IO pin dirs./states too.So how low we can go waiting at RCfast with one COG?
curious,
Mike
Evanh, would mind comparing VDD current between these two different cog programs?
DAT org pgm0 jmp #pgm0 org pgm1 waitx #500 jmp #pgm1
| Prop2 revA Prop2 revB | RCFAST RCSLOW RCFAST RCSLOW --------|----------------------------------------- PGM0 | 68.4 mA 101 uA 32.4 mA 82 uA PGM1 | 65.4 mA 99 uA 28.6 mA 77 uA |
EDIT: It's because of an automatic implementation feature, called "clock gating", that On Semi enabled upon Chip's request.
There was a link Chip found to an article about a subject called "dark silicon". It was about reorienting and taking advantage of the problem of integrated circuits getting so densely packed with compute power that the thermal management had become a problem.
The premise was that RAM blocks, in particular, do not consume power the way logic does. Mainly because only one location at a time is accessed, the rest are sitting idle waiting their turn and therefore not drawing much current. This naturally low power state is termed as dark.
In effect, clock gating brings an auto-darkness mode to the processor logic, or more accurately the clocking circuits of the intervening flip-flops.
Of course, when you're designing compute chips, and the power consumption is slashed in half for you, you then immediately add more compute to use all available power again.
I wish there was a way to measure the RCSlow frequency. I guess there may be some way to do it by having an external RC circuit on some pin, and measure its response / time constant while operating on crystal, then repeat while on RCSlow. I guess we could use 1kohm dac mode then all you need is external reference capacitance
dat org hubset #1 dirh #56 rep #2, #0 waitx #5-4 outnot #56
One off calibrate should be possible using USART.
Eg at lowest 367 baud of UB3, you can calibrate to 0.2%/ 0x00 char, limited by the auto cal jitter of the xtal-Less bridges.
If you need low ppm cal/track, then use a Xtal-UART & a small burst of chars.
Thanks Evan for doing that.
Would you mind adding figures for say 180MHz and 360MHz please?
This would give us some nice comparisons.
AFAIK the clock gating is helping the logic use less power. The RAM blocks that ON provide most likely already include the clock gating although IIRC the hub blocks were still being enabled if not required.
Adding this test running in hubexec would be a nice comparison too. Just the 180 and 360 would be fine.
I remember Chip posted a couple of numbers that indicated roughly 45% of those for the revB silicon.
EDIT: Here's some new measurements of those single cog loops above (note the slightly higher RCFAST, I've changed to higher capacity meter for these and retested RCFAST with it)
| Prop2 revA Prop2 revB | PGM0 PGM1 PGM0 PGM1 --------|----------------------------------------- RCSLOW | 101 uA 99 uA 82 uA 77 uA RCFAST | 73.4 mA 70.1 mA 33.3 mA 29.4 mA 50 MHz | 157 mA 150 mA 66.2 mA 58.3 mA 100 MHz | 309 mA 295 mA 132 mA 116 mA 180 MHz | 541 mA 517 mA 234 mA 206 mA 250 MHz | 734 mA 703 mA 323 mA 284 mA 300 MHz | 867 mA 830 mA 384 mA 339 mA 360 MHz | 1023 mA 978 mA 460 mA 405 mA |
Thanks, Evanh. I don't know why, but I remember the difference being really big. It seems to make little difference, though.
| Prop2 revA (VDD = 1.83 V) Prop2 revB (VDD = 1.855 V) | PGM0 PGM1 PGM2 PGM3 PGM0 PGM1 PGM2 PGM3 --------|--------------------------------------------------------------------------- RCSLOW | 107 uA 104 uA 110 uA 116 uA 83 uA 79 uA 87 uA 96 uA RCFAST | 73.6 mA 70.3 mA 78.6 mA 84.7 mA 33.3 mA 29.4 mA 39.2 mA 47.4 mA 50 MHz | 157 mA 150 mA 168 mA 181 mA 66.2 mA 58.3 mA 78.4 mA 94.1 mA 100 MHz | 309 mA 295 mA 330 mA 355 mA 132 mA 116 mA 156 mA 187 mA 180 MHz | 542 mA 518 mA 578 mA 621 mA 234 mA 206 mA 277 mA 331 mA 250 MHz | 736 mA 704 mA 783 mA 843 mA 323 mA 284 mA 380 mA 455 mA 300 MHz | 869 mA 832 mA 922 mA 994 mA 384 mA 339 mA 453 mA 540 mA 360 MHz | 982 mA 459 mA 405 mA 535 mA 641 mA | 1.755 V 1.76 V 1.755 V 1.817 V 1.82 V 1.81 V 1.802 V |
' program 2 ' rep #4, #0 'loop forever burning the cordic getrnd pa qmul pb, pa add pb, pa mul pb, pa
' program 3 wrfast #0, #0 rep #2, ##(64*1024/4) '64 kB random hubRAM data getrnd pa wflong pa rdfast ##(64*1024/64), #0 '64 kB block rollover ' xinit ##(%0011_0000_1000_0001<<16) | $ffff, #0 'revA silicon, full sysclock/1 continuous FIFO reading xinit ##(%1011_0000_1000_0001<<16) | $ffff, #0 'revB silicon, full sysclock/1 continuous FIFO reading rep #4, #0 'loop forever burning the cordic getrnd pa qmul pb, pa add pb, pa mul pb, pa
Is it possible to add a test with Pgm1 running in all 8 COGs, to get a All-COG baseline number ?
I don't think the clock gating is smart enough to shut down all logic of not-started-yet COGs, so hopefully the current per added COG is not large.
PGM3 in all 8 COGS would bump the current quite a bit, given the added activity.
Any idea why V6063 is needed?
Has anybody tried this?
Ok, I see. I'm wondering if P2 can go to even lower power by removing all VIO. Perhaps powering them back up at intervals to check for the need to wake...
If you are not using any of the analog functions, I thought the VIO Static overhead per small VIO group, was single-digit uA ?
(ie not significant ?)
EDIT: Yep, I'm pretty certain I've got that one verified. With program #1, one cog running on the revB chip at 360 MHz it uses 405 mA, while with eight cogs it uses only 333 mA. Pretty much same a one cog at 300 MHz.
EDIT2: And on the revA chip they are practically the same. Both almost 1 amp. Eight cog program is a few mA higher. Tried at 180 MHz and same story, both around 520 mA.
I guess the revA is not a surprise, but the revB chip sure is. It appears an idle cog uses less power than a stopped cog on the revB silicon.