P2 - Minimising power usage in a COG (cpu core)

Cluso99 · 2019-09-19 19:32

Chip,
How can we minimise the power used by a cog?

Suppose we have a cog running a driver to some external interface (eg an SD Card Driver). What steps should we take to minimise power when it's idle?
In P1 we had waitcnt, waitpxx, etc.

Typically we will be waiting on a command via hub.

Some ideas...
* Use a mix of rdbyte and waitx - introduces some latency but may not be an issue
* Have the calling cog use the COGATN instruction, and wait for it in a POLLATN or WAITATN loop
* Use a LOCK and wait for the lock to become "locked"
* Anything else? Or does it not matter?

ozpropdev · 2019-09-19 22:47

You could use a COGSTOP and when needed do a cog restart.

CIGINIT #%1_x_cccc,S/#

On restart(warm boot) remember to restore your IO pin dirs./states too.

cgracey · 2019-09-20 07:16

I noticed on the new silicon, which has clock gating, that any kind of stall slashes power. Even WAITX reduces cog power to a fraction.

msrobots · 2019-09-20 08:10

how much, can you specify, measure it?

So how low we can go waiting at RCfast with one COG?

curious,

Mike

evanh · 2019-09-20 09:16

I'm seeing 29 mA on VDD and 1.5 mA on VIO at RCFAST and all idling. NOTE: VIO current will mostly be the LDOs I presume. I'm measuring this on the 5 Volt supply to them.

evanh · 2019-09-20 09:23

RCSLOW is impressive, I'm seeing idling there down to 90 uA on VDD.

Cluso99 · 2019-09-20 11:32

Excellent news

cgracey · 2019-09-20 21:44

evanh wrote: »

RCSLOW is impressive, I'm seeing idling there down to 90 uA on VDD.

Evanh, would mind comparing VDD current between these two different cog programs?

DAT
	org

pgm0	jmp	#pgm0


	org

pgm1	waitx	#500
	jmp	#pgm1

evanh · 2019-09-20 23:03

	|   Prop2 revA            Prop2 revB
	| RCFAST     RCSLOW     RCFAST     RCSLOW
--------|-----------------------------------------
PGM0	| 68.4 mA    101 uA     32.4 mA     82 uA  
PGM1	| 65.4 mA     99 uA     28.6 mA     77 uA  
	|

Publison · 2019-09-20 23:13

That's significant reduction in RCFAST mode!

evanh · 2019-09-20 23:19

Yeah, a good 50% off all the way up to 400 MHz. Lots of people have mentioned how much cooler running the revB is.

EDIT: It's because of an automatic implementation feature, called "clock gating", that On Semi enabled upon Chip's request.

There was a link Chip found to an article about a subject called "dark silicon". It was about reorienting and taking advantage of the problem of integrated circuits getting so densely packed with compute power that the thermal management had become a problem.

The premise was that RAM blocks, in particular, do not consume power the way logic does. Mainly because only one location at a time is accessed, the rest are sitting idle waiting their turn and therefore not drawing much current. This naturally low power state is termed as dark.

In effect, clock gating brings an auto-darkness mode to the processor logic, or more accurately the clocking circuits of the intervening flip-flops.

Of course, when you're designing compute chips, and the power consumption is slashed in half for you, you then immediately add more compute to use all available power again.

Tubular · 2019-09-20 23:38

Great set of numbers, thanks Evanh

I wish there was a way to measure the RCSlow frequency. I guess there may be some way to do it by having an external RC circuit on some pin, and measure its response / time constant while operating on crystal, then repeat while on RCSlow. I guess we could use 1kohm dac mode then all you need is external reference capacitance

evanh · 2019-09-21 00:02

Same as usual. At room temperature I get 20.8 kHz on the revA and 21.2 kHz on the revB. Here's the 1/10th frequency program outputting on pin 56 that I used.

dat	org

	hubset	#1
	dirh	#56

	rep	#2, #0
	waitx	#5-4
	outnot	#56

jmg · 2019-09-21 00:02

Tubular wrote: »

Great set of numbers, thanks Evanh

I wish there was a way to measure the RCSlow frequency. I guess there may be some way to do it by having an external RC circuit on some pin, and measure its response / time constant while operating on crystal, then repeat while on RCSlow. I guess we could use 1kohm dac mode then all you need is external reference capacitance

Chip has shown a single-C oscillator that uses uA current drive in FB mode, but I can’t recall if that had sysclk sampling caveats? (A non-sampled Schmitt fb osc would free-run)

One off calibrate should be possible using USART.
Eg at lowest 367 baud of UB3, you can calibrate to 0.2%/ 0x00 char, limited by the auto cal jitter of the xtal-Less bridges.
If you need low ppm cal/track, then use a Xtal-UART & a small burst of chars.

Cluso99 · 2019-09-21 00:05

Great figures indeed

Thanks Evan for doing that.

Would you mind adding figures for say 180MHz and 360MHz please?
This would give us some nice comparisons.

AFAIK the clock gating is helping the logic use less power. The RAM blocks that ON provide most likely already include the clock gating although IIRC the hub blocks were still being enabled if not required.

Adding this test running in hubexec would be a nice comparison too. Just the 180 and 360 would be fine.

Rayman · 2019-09-21 00:19

Can one disable the VIO while keeping the core voltage applied?

evanh · 2019-09-21 00:24

Rayman wrote: »

Can one disable the VIO while keeping the core voltage applied?

Not completely, V2831 is needed to power the oscillators and V6063 is needed too.

evanh · 2019-09-21 01:22

Cluso99 wrote: »

Would you mind adding figures for say 180MHz and 360MHz please?
This would give us some nice comparisons.

Here's an earlier go with all cogs running on revA silicon - https://forums.parallax.com/discussion/comment/1466238/#Comment_1466238

I remember Chip posted a couple of numbers that indicated roughly 45% of those for the revB silicon.

EDIT: Here's some new measurements of those single cog loops above (note the slightly higher RCFAST, I've changed to higher capacity meter for these and retested RCFAST with it)

	|      Prop2 revA           Prop2 revB
	|    PGM0      PGM1       PGM0      PGM1
--------|-----------------------------------------
RCSLOW	|   101 uA     99 uA      82 uA     77 uA
RCFAST	|  73.4 mA   70.1 mA    33.3 mA   29.4 mA
 50 MHz	|   157 mA    150 mA    66.2 mA   58.3 mA
100 MHz	|   309 mA    295 mA     132 mA    116 mA
180 MHz	|   541 mA    517 mA     234 mA    206 mA
250 MHz	|   734 mA    703 mA     323 mA    284 mA
300 MHz	|   867 mA    830 mA     384 mA    339 mA
360 MHz	|  1023 mA    978 mA     460 mA    405 mA
	|

cgracey · 2019-09-21 02:43

evanh wrote: »

	|   Prop2 revA            Prop2 revB
	| RCFAST     RCSLOW     RCFAST     RCSLOW
--------|-----------------------------------------
PGM0	| 68.4 mA    101 uA     32.4 mA     82 uA  
PGM1	| 65.4 mA     99 uA     28.6 mA     77 uA  
	|

Thanks, Evanh. I don't know why, but I remember the difference being really big. It seems to make little difference, though.

evanh · 2019-09-21 04:59

What we are seeing there with RCSLOW on the revB silicon, is a more pronounced corner transitioning to static leakage as the dominant component.

Cluso99 · 2019-09-21 06:33

Thanks evan. I'm still interested to see what difference it makes if the loop is in hubexec tho.

evanh · 2019-09-25 00:10

Cool, the cordic seems to have variable power draw depending on bit pattern of input data. It can add about 10 mA per 50 MHz per cog using every pipeline slot.

evanh · 2019-09-25 12:26

Updated with a couple more test programs. PS: These are all single cog tests.

        |      Prop2 revA (VDD = 1.83 V)            Prop2 revB (VDD = 1.855 V)
        |    PGM0     PGM1     PGM2     PGM3       PGM0     PGM1     PGM2     PGM3  
--------|---------------------------------------------------------------------------
RCSLOW  |   107 uA   104 uA   110 uA   116 uA      83 uA    79 uA    87 uA    96 uA 
RCFAST  |  73.6 mA  70.3 mA  78.6 mA  84.7 mA    33.3 mA  29.4 mA  39.2 mA  47.4 mA 
 50 MHz |   157 mA   150 mA   168 mA   181 mA    66.2 mA  58.3 mA  78.4 mA  94.1 mA 
100 MHz |   309 mA   295 mA   330 mA   355 mA     132 mA   116 mA   156 mA   187 mA 
180 MHz |   542 mA   518 mA   578 mA   621 mA     234 mA   206 mA   277 mA   331 mA 
250 MHz |   736 mA   704 mA   783 mA   843 mA     323 mA   284 mA   380 mA   455 mA 
300 MHz |   869 mA   832 mA   922 mA   994 mA     384 mA   339 mA   453 mA   540 mA 
360 MHz |            982 mA                       459 mA   405 mA   535 mA   641 mA 
        |           1.755 V   1.76 V  1.755 V    1.817 V   1.82 V   1.81 V  1.802 V
        |

' program 2
'
		rep	#4, #0				'loop forever burning the cordic
		getrnd	pa
		qmul	pb, pa
		add	pb, pa
		mul	pb, pa

' program 3
		wrfast	#0, #0
		rep	#2, ##(64*1024/4)		'64 kB random hubRAM data
		getrnd	pa
		wflong	pa

		rdfast	##(64*1024/64), #0		'64 kB block rollover
'		xinit	##(%0011_0000_1000_0001<<16) | $ffff, #0	'revA silicon, full sysclock/1 continuous FIFO reading
		xinit	##(%1011_0000_1000_0001<<16) | $ffff, #0	'revB silicon, full sysclock/1 continuous FIFO reading

		rep	#4, #0				'loop forever burning the cordic
		getrnd	pa
		qmul	pb, pa
		add	pb, pa
		mul	pb, pa

jmg · 2019-09-25 20:39

evanh wrote: »

Updated with a couple more test programs. PS: These are all single cog tests.

Looks good.
Is it possible to add a test with Pgm1 running in all 8 COGs, to get a All-COG baseline number ?
I don't think the clock gating is smart enough to shut down all logic of not-started-yet COGs, so hopefully the current per added COG is not large.

PGM3 in all 8 COGS would bump the current quite a bit, given the added activity.

Rayman · 2019-09-25 22:21

evanh wrote: »

Not completely, V2831 is needed to power the oscillators and V6063 is needed too.

Any idea why V6063 is needed?
Has anybody tried this?

jmg · 2019-09-25 22:42

Rayman wrote: »

Any idea why V6063 is needed?
Has anybody tried this?

I think that means 'needed' for the UART and Loaders, ie the P2 could 'run' with that removed, but not talk to much ...

Cluso99 · 2019-09-25 22:53

I would still like to see the difference between a cog loop and a hubexec loop. The difference will show how much the hub ram is taking.

Rayman · 2019-09-25 23:37

jmg wrote: »

I think that means 'needed' for the UART and Loaders, ie the P2 could 'run' with that removed, but not talk to much ...

Ok, I see. I'm wondering if P2 can go to even lower power by removing all VIO. Perhaps powering them back up at intervals to check for the need to wake...

jmg · 2019-09-25 23:45

Rayman wrote: »

jmg wrote: »

I think that means 'needed' for the UART and Loaders, ie the P2 could 'run' with that removed, but not talk to much ...

Ok, I see. I'm wondering if P2 can go to even lower power by removing all VIO. Perhaps powering them back up at intervals to check for the need to wake...

If you are not using any of the analog functions, I thought the VIO Static overhead per small VIO group, was single-digit uA ?
(ie not significant ?)

evanh · 2019-09-25 23:59

Cluso99 wrote: »

I would still like to see the difference between a cog loop and a hubexec loop. The difference will show how much the hub ram is taking.

Program 3 is equivalent to worst case hubexec.

evanh · 2019-09-26 00:02

jmg wrote: »

Is it possible to add a test with Pgm1 running in all 8 COGs, to get a All-COG baseline number ?

I started doing that last night but kept making mistakes and having to start over. Program 1 in particular is using less power with all cogs running! Working on it ...

EDIT: Yep, I'm pretty certain I've got that one verified. With program #1, one cog running on the revB chip at 360 MHz it uses 405 mA, while with eight cogs it uses only 333 mA. Pretty much same a one cog at 300 MHz.

EDIT2: And on the revA chip they are practically the same. Both almost 1 amp. Eight cog program is a few mA higher. Tried at 180 MHz and same story, both around 520 mA.

I guess the revA is not a surprise, but the revB chip sure is. It appears an idle cog uses less power than a stopped cog on the revB silicon.

P2 - Minimising power usage in a COG (cpu core)

Comments