Is LUT sharing between adjacent cogs very important?

jmg · 2016-05-19 19:57

cgracey wrote: »

Cogs have to enable LUT writes from their companion cog. So, if one cog stops, it doesn't hurt anything.

See the other thread about power-saving, and the nice ability to restart a frozen COG, without reloading.
ie given the lower power of a Clock-Off COG, there is a place for an equivalent COGFRZ. so that

COGSTOP - removes clocks, stops, and frees COG into pool
COGFRZ - removes clocks, but does not free into pool

COGFRZ is used when you know it will restart very soon.
COGSTOP is used when you are done, and something else can have the COG.

Heater. · 2016-05-19 20:04

What is this about "low power".

I have always understood that even with all COGs idle the P2 was never going to be a low power device.

Why do we need a COGFRZ when a WAITwhatever does the same thing ?

jmg · 2016-05-19 20:20

Heater. wrote: »

What is this about "low power".

Where have you been ?

Lower power always matters.

Heater. wrote: »

I have always understood that even with all COGs idle the P2 was never going to be a low power device.

Low power is relative. With all the rush to WiFi and Bluetooth, average currents of a few mA are becoming tolerated.

Anything that pushes P2 down in Current, helps it get design wins.
It does not need to hit sub uA, just needs to avoid 'sticker shock'

Heater. wrote: »

Why do we need a COGFRZ when a WAITwhatever does the same thing ?

See this thread
http://forums.parallax.com/discussion/164300/restarting-stopped-cogs-power-save-mode

WAITwhatever is NOT the same as a Clock-Gated COG,
Chip says this
"COGSTOP stops all toggling within a cog."
whilst WAITwhatever, has the Clock tree active.

WAITwhatever is the lowest power of an Active COG, but it is not the lowest power possible way to pause a COG.

Heater. · 2016-05-19 20:51

jmg,

Where have you been ?

Why, I have been here of course. Some years ago Chip pointed out that the P2 was never going to achieve the low power consumption of the P1. Which leads me to conclude that low power is not a design priority. Unless carried to the extremes of the old "P2 Hot"

Lower power always matters.

No, it does not. If the thing I am controlling consumes a thousand or million or whatever times more power than my controller then who cares?

Low power is relative. With all the rush to WiFi and Bluetooth, average currents of a few mA are becoming tolerated.

True. And sure, WiFi and Bluetooth are power hungry. But now we are looking at LoRa wireless devices to run from a batteries for years. Not P2 territory.

WAITwhatever is the lowest power of an Active COG, but it is not the lowest power possible way to pause a COG.

Perhaps. Is it significantly different?

I guess we have to wait till we have a real chip to find out.

cgracey · 2016-05-19 21:02

At 180nm and with all the transistors the P2 will have, I expect the raw leakage current to hit 2mA.

We're quite late in the design to be introducing clock gating.

jmg · 2016-05-19 21:12

Heater. wrote: »

Perhaps. Is it significantly different?

I guess we have to wait till we have a real chip to find out.

Yes, it will be significantly different.
In once case the Clock Tree is active, in the other it is not.

Heater. wrote: »

Some years ago Chip pointed out that the P2 was never going to achieve the low power consumption of the P1.

The P1 can go sub 1uA, so I agree P2 is not going to be there.
However, even 180nm is not a badly leaky process, and it is foolish to dismiss power savings, just because you 'conclude that low power is not a design priority.'

Another reference point is TCXOs. Over the last few years, the price and current of those has fallen, and now for around $1 you can get a 1.8mA sub 2ppm source.

One P2 constraint, is that the SysCLK cannot be scaled across COGS, and that if your SysCLK is dictated by timing resolution, that also sets your starting point power.

That means all methods to help reduce that, become important.

kwinn · 2016-05-19 21:37

Heater. wrote: »

thej,

Well, my one and only kid has grown to be a man whilst we were waiting for the P 2.

Perhaps I can bequeath the waiting to him should I shuffle off this mortal coil in the mean time.

You're such a pessimist...

Oh wait, maybe a realist.

Rayman · 2016-05-19 21:52

I think power is only a big issue for portable or otherwise battery powered things.

Still, if you could turn off RAM and hibernate or go into low power mode somehow, that might be interesting....

Heater. · 2016-05-19 21:52

I'm rewriting my last will and testament as we speak.

jmg · 2016-05-19 22:04

Rayman wrote: »

...
Still, if you could turn off RAM and hibernate or go into low power mode somehow, that might be interesting....

That is what stopping the Clock tree does.

SRAM and Logic that is not clocked, drops to Static Icc levels.

Lowest P2 power would be one COG in WAIT, and the rest loaded, ready to go, but in COGFRZ.
Depending on the PLL lock times, it may even be possible to drop SysCLK on that one, then increase again ?
A Post-VCO-divider would allow very rapid clock scaling.
I forget the bit-counts & PLL structure Chip settled on ...

jmg · 2016-05-19 22:07

cgracey wrote: »

At 180nm and with all the transistors the P2 will have, I expect the raw leakage current to hit 2mA.

Where is that 2mA in PVT ?

cgracey wrote: »

We're quite late in the design to be introducing clock gating.

? There is no more Clock Gating, than you have in there now.

The only change for COGFRZ, is a small one around the COGNEW & allocate flag handling.
ie you make it clear to the system that the COG is not fully released, just frozen.
Clock management is identical.

COGNEW simply skips one of these COGS.

See the thread on COGSTOP and rapid restart.
This works now, it just may interact badly with the cog-pair allocations.

cgracey · 2016-05-20 00:41

The EDA tools perform limited clock gating, but to really cut off a clock, you must incorporate special IP blocks into the Verilog. That's rather adventurous, at this point.

cgracey · 2016-05-20 00:43

At last synthesis for P2-Hot, the EDA tools were able to apply clock gating to 89% of flops.

Seairth · 2016-05-20 00:51

Seairth wrote: »

cgracey wrote: »

Any recommendations on LUT events? We have two free events now. We already have a read-LUT-end event.

Before answering this, can someone summarize what was finally implemented?

Okay, since no one else is going to summarize, I guess I'll do it. I'm sure you all will correct me where I'm wrong. :P

* Every pair of cogs (0 and 1, 2 and 3, etc) has a LUT-sharing capability.
* Cog 0 can use SETLUT to control whether Cog 1 may write to its LUT.
* Cog 1 can use SETLUT to control whether Cog 0 may write to its LUT.
* If Cog 0 calls "SETLUT #0", then any time Cog 1 calls WRLUT, only Cog 1's LUT will be updated.
* If Cog 1 calls "SETLUT #0", then any time Cog 0 calls WRLUT, only Cog 0's LUT will be updated.
* If Cog 0 calls "SETLUT #1", then any time Cog 1 calls WRLUT, both LUTs will be updated.
* If Cog 1 calls "SETLUT #1", then any time Cog 0 calls WRLUT, both LUTs will be updated.
* When COGINIT is used to start two cogs, "SETLUT #1" is implied for both cogs.
* When COGINIT is used to start one cog, "SETLUT #0" is implied for the cog.
* If Cog 0 already has LUT sharing when Cog 1 is started (via single-cog COGINIT), Cog 1 will still modify Cog 0's LUT when it writes to its own LUT.

Does that look right?

And what events are there? Just a general WRLUT (from the other cog) event? Or only at a specific address?

jmg · 2016-05-20 00:52

cgracey wrote: »

The EDA tools perform limited clock gating, but to really cut off a clock, you must incorporate special IP blocks into the Verilog. That's rather adventurous, at this point.

in the other thread, you said "COGSTOP stops all toggling within a cog"
- which sounds to me, that the COG Clock tree is not driven.
Was that not what you meant ?

cgracey · 2016-05-20 00:59

jmg wrote: »

cgracey wrote: »

The EDA tools perform limited clock gating, but to really cut off a clock, you must incorporate special IP blocks into the Verilog. That's rather adventurous, at this point.

in the other thread, you said "COGSTOP stops all toggling within a cog"
- which sounds to me, that the COG Clock tree is not driven.
Was that not what you meant ?

The EDA tool should be able to gate off clocking to everything in a stopped cog.

Beau Schwabe · 2016-05-20 01:12

Without proper clock latency setup and hold times, your signals will be all over the place. A proper clock tree is only part of the puzzle. Other setup and hold times need to be implemented on data buses or else you have the same problem. Especially since the bus is so wide and traverses into every cog and I/O, this is especially important. ... This needs to go through several hundred Verilog iterations using a tool set such as !Avant! place and route to meet all of the required timing. !Avant! was part of my tool set that I used when working at National Semiconductor on a daily basis and is an essential tool needed for proper timing within a digital core block(s). Without setup and hold times applied properly, your going to have a very slow chip, if it runs at all, simply because timing lead or lag to the remaining logic doesn't sync properly. If timing to some lines is too fast with respect to adjacent lines or too slow, an unpredictable condition could lockup the processor.

Latency timing from clock tree AND signal lines should not be taken lightly in lieu of the success of this chip. IOW, this is not a corner that should be cut to save space or ironically ... save time.

jmg · 2016-05-20 01:15

cgracey wrote: »

jmg wrote: »

in the other thread, you said "COGSTOP stops all toggling within a cog"
- which sounds to me, that the COG Clock tree is not driven.
Was that not what you meant ?

The EDA tool should be able to gate off clocking to everything in a stopped cog.

ok, that's good.
This means using a form of Stop, to keep the code in the COG ready to quickly re-start, should offer worthwhile energy savings.

This Stop/ReStart can be done right now, but doing a classic Full Stop, means another COG might claim this paused one, with a COGNEW ? Right ?

cgracey · 2016-05-20 01:16

Seairth wrote: »

Seairth wrote: »

cgracey wrote: »

Any recommendations on LUT events? We have two free events now. We already have a read-LUT-end event.

Before answering this, can someone summarize what was finally implemented?

Okay, since no one else is going to summarize, I guess I'll do it. I'm sure you all will correct me where I'm wrong. :P

* Every pair of cogs (0 and 1, 2 and 3, etc) has a LUT-sharing capability.
* Cog 0 can use SETLUT to control whether Cog 1 may write to its LUT.
* Cog 1 can use SETLUT to control whether Cog 0 may write to its LUT.
* If Cog 0 calls "SETLUT #0", then any time Cog 1 calls WRLUT, only Cog 1's LUT will be updated.
* If Cog 1 calls "SETLUT #0", then any time Cog 0 calls WRLUT, only Cog 0's LUT will be updated.
* If Cog 0 calls "SETLUT #1", then any time Cog 1 calls WRLUT, both LUTs will be updated.
* If Cog 1 calls "SETLUT #1", then any time Cog 0 calls WRLUT, both LUTs will be updated.
* When COGINIT is used to start two cogs, "SETLUT #1" is implied for both cogs.
* When COGINIT is used to start one cog, "SETLUT #0" is implied for the cog.
* If Cog 0 already has LUT sharing when Cog 1 is started (via single-cog COGINIT), Cog 1 will still modify Cog 0's LUT when it writes to its own LUT.

Does that look right?

And what events are there? Just a general WRLUT (from the other cog) event? Or only at a specific address?

Only one inaccuracy in there: When COGINIT starts two cogs, it does not automatically do a 'SETLUT #1' on them. That is always under software control.

There are a total of three LUT-related events:

'SETRDL D/#' is used to setup the read-LUT event:

00 = this cog read $000
01 = this cog read $1FF
10 = other cog read $000
11 = other cog read $1FF

'SETWRL D/#' is used to setup the write-LUT event:

00 = this cog wrote $000
01 = this cog wrote $1FF
10 = other cog wrote $000
11 = other cog wrote $1FF

The third event triggers when this cog's streamer reads $1FF. Streamer reads do not affect the other events.

jmg · 2016-05-20 01:19

Seairth wrote: »

And what events are there? Just a general WRLUT (from the other cog) event? Or only at a specific address?

I did not see address mentioned, but I did see Read and Write mentioned in separate posts.
I think enough events to allow both co-operating cogs to handshake are what is needed, at a minimum.
I have wondered about a tag field to events, to allow more than just the source COG info to be passed, on a Sticky-OR basis, but that does need more logic.

Seairth · 2016-05-20 01:20

cgracey wrote: »

Only one inaccuracy in there: When COGINIT starts two cogs, it does not automatically do a 'SETLUT #1' on them. That is always under software control.

So it does do an implicit "SETLUT #0" for single COGINIT? If not, I foresee a cog that had LUT sharing enabled get re-used and the new code be unaware that the other cog can still write to its LUT.

jmg · 2016-05-20 01:31

cgracey wrote: »

There are a total of three LUT-related events:

'SETRDL D/#' is used to setup the read-LUT event:

00 = this cog read $000
01 = this cog read $1FF
10 = other cog read $000
11 = other cog read $1FF

'SETWRL D/#' is used to setup the write-LUT event:

00 = this cog wrote $000
01 = this cog wrote $1FF
10 = other cog wrote $000
11 = other cog wrote $1FF

The third event triggers when this cog's streamer reads $1FF. Streamer reads do not affect the other events.

These only apply to Paired COGs ?
'other COG read $000' means Read COGo.LUT[00]
'other COG write $000' means write of which of COGo.LUT[00], COGt.LUT[00] ?

I can see that triggers for New Data Arrived and Sent Data Read are needed, but I'm less sure where reading your own LUT needs a trigger. Is that done for an alternative means of loop/limit control ?
An alternative could be :
00 = other cog wrote $00x, or maybe $001 ?
01 = other cog wrote $1Fx, or maye $01FE ?

Seairth · 2016-05-20 01:41

jmg wrote: »

cgracey wrote: »

There are a total of three LUT-related events:

'SETRDL D/#' is used to setup the read-LUT event:

00 = this cog read $000
01 = this cog read $1FF
10 = other cog read $000
11 = other cog read $1FF

'SETWRL D/#' is used to setup the write-LUT event:

00 = this cog wrote $000
01 = this cog wrote $1FF
10 = other cog wrote $000
11 = other cog wrote $1FF

The third event triggers when this cog's streamer reads $1FF. Streamer reads do not affect the other events.

These only apply to Paired COGs ?
'other COG read $000' means Read COGo.LUT[00]
'other COG write $000' means write of which of COGo.LUT[00], COGt.LUT[00] ?

I can see that triggers for New Data Arrived and Sent Data Read are needed, but I'm less sure where reading your own LUT needs a trigger. Is that done for an alternative means of loop/limit control ?
An alternative could be :
00 = other cog wrote $00x, or maybe $001 ?
01 = other cog wrote $1Fx, or maye $01FE ?

Or, why not just, for Cog A:

SETRDL D/#, where D is the address to watch for Cog B to read from Cog B's LUT
SETWRL D/#, where D is the address to watch for Cog B to write in both LUTs

Electrodude · 2016-05-20 01:58

Seairth wrote: »

jmg wrote: »

cgracey wrote: »

There are a total of three LUT-related events:

'SETRDL D/#' is used to setup the read-LUT event:

00 = this cog read $000
01 = this cog read $1FF
10 = other cog read $000
11 = other cog read $1FF

'SETWRL D/#' is used to setup the write-LUT event:

00 = this cog wrote $000
01 = this cog wrote $1FF
10 = other cog wrote $000
11 = other cog wrote $1FF

The third event triggers when this cog's streamer reads $1FF. Streamer reads do not affect the other events.

These only apply to Paired COGs ?
'other COG read $000' means Read COGo.LUT[00]
'other COG write $000' means write of which of COGo.LUT[00], COGt.LUT[00] ?

I can see that triggers for New Data Arrived and Sent Data Read are needed, but I'm less sure where reading your own LUT needs a trigger. Is that done for an alternative means of loop/limit control ?
An alternative could be :
00 = other cog wrote $00x, or maybe $001 ?
01 = other cog wrote $1Fx, or maye $01FE ?

Or, why not just, for Cog A:

SETRDL D/#, where D is the address to watch for Cog B to read from Cog B's LUT
SETWRL D/#, where D is the address to watch for Cog B to write in both LUTs

What if they were 18 bit values: a 9 bit mask and a 9 bit target. The event would fire for Cog A when Cog B writes to Cog A's LUT at an address for which (addr & mask) == target, similarly to how the P1's waitpeq instruction works. The programmer could make it fire for only specific addresses with mask = $1FF and for any address with mask = 0, or anything in between.

A similar mask technique could be used by Cog A to decide which writes that Cog B makes to its own LUT also affect Cog A's LUT. That way, a cog could make both the write mask and target $100 and keep the bottom half of its LUT to itself.

They could all be range compares instead of masks, but masks are cheaper.

jmg · 2016-05-20 02:20

Seairth wrote: »

Or, why not just, for Cog A:

SETRDL D/#, where D is the address to watch for Cog B to read from Cog B's LUT
SETWRL D/#, where D is the address to watch for Cog B to write in both LUTs

I presume you mean the mirror case for B-A too ?

Yes, that is more flexible, (& more flexible again, is adding a Mask), but each of those adds more Logic.

I do think one top or bottom address, is rather constrained, and could prove a bottleneck. Most flexible is the Addr & Mask.

Keeping COG code small here matters too, as it is likely such tight COGS cannot run HUB EXEC.

Seairth · 2016-05-20 02:32

jmg wrote: »

Seairth wrote: »

Or, why not just, for Cog A:

SETRDL D/#, where D is the address to watch for Cog B to read from Cog B's LUT
SETWRL D/#, where D is the address to watch for Cog B to write in both LUTs

I presume you mean the mirror case for B-A too ?

Indeed.

I figured that Chip's current design was already triggering by address, so making it *any* address wouldn't add all that much logic.

I too like the mask idea, but I agree that it would add more logic. Also, the immediate form of the instructions would be useless without AUGS.

jmg · 2016-05-20 02:42

Seairth wrote: »

I figured that Chip's current design was already triggering by address, so making it *any* address wouldn't add all that much logic.

I too like the mask idea, but I agree that it would add more logic. Also, the immediate form of the instructions would be useless without AUGS.

Or maybe a mid-ground variant that passes just a mask, and applies that to Addr 000 ? - simpler params and setup,
All-bits active is the same as Chip's original, but mask 0x00F would allow any of 0..15 to trigger.
Compromise is trigger becomes zero-based, but it is now range-selectable.

cgracey · 2016-05-20 03:52

Here's the thing, though: there's only one possible cog of 16 that could write your LUT externally. I can see masking/comparing for use with several external write sources, but when there's only one, what does it buy you?

I figured first or last locations have the most value, so put the sensitivities there.

The reason there are LUT r/w events for same-cog writes is because it might be handy to trigger an interrupt, in order to reload the LUT.

cgracey · 2016-05-20 03:55

SETLUT #0 is the default, in all cases, when a cog starts.

evanh · 2016-05-20 04:08

cgracey wrote: »

Here's the thing, though: there's only one possible cog of 16 that could write your LUT externally. I can see masking/comparing for use with several external write sources, but when there's only one, what does it buy you?

Yes, this does need to be made clear. It's important to note from the perspective of the first incarnation of LUT sharing where it was possible to chain-link all Cogs together through their LUTs. That is no longer possible.

It's now eight isolated pairs instead. This provided an easy way forward for COGNEW to function again.

Is LUT sharing between adjacent cogs very important?

Comments