Is LUT sharing between adjacent cogs very important?

jmg · 2016-05-20 04:10

cgracey wrote: »

Here's the thing, though: there's only one possible cog of 16 that could write your LUT externally. I can see masking/comparing for use with several external write sources, but when there's only one, what does it buy you?

It means you are not locked into a single 1-2 address, as you may need to pass half a dozen packets of data, and avoiding 1 or 2 single locations would help there.

cgracey wrote: »

I figured first or last locations have the most value, so put the sensitivities there.

The reason there are LUT r/w events for same-cog writes is because it might be handy to trigger an interrupt, in order to reload the LUT.

That makes sense, but there, would a trigger point that was user-adjustable be more use ?
Trigger at the end, is likely to be too late to maintain unbroken streaming ?

Depending on the clock rates, you may want to trigger even at 50%.
If the HUB can fill at the Streamer CLK, or slightly faster, it starts at 00 and when the streamer wraps, 00 is valid, and HUB fill is half way, Streamer is back at 50% when HUB is doing last write.
With that, you could sustain continual refreshed streaming.

Cluso99 · 2016-05-20 06:29

cgracey wrote: »

At 180nm and with all the transistors the P2 will have, I expect the raw leakage current to hit 2mA.

We're quite late in the design to be introducing clock gating.

I can certainly live with 2mA leakage current. I was expecting a lot more.

Tubular · 2016-05-20 06:34

I thought it would be more, too.

During Tracy Allen's curve tracing on the P1 the leakage current seemed to halve going from 3v3 to 2v5. There's hope...

cgracey · 2016-05-20 07:03

jmg wrote: »

cgracey wrote: »

Here's the thing, though: there's only one possible cog of 16 that could write your LUT externally. I can see masking/comparing for use with several external write sources, but when there's only one, what does it buy you?

It means you are not locked into a single 1-2 address, as you may need to pass half a dozen packets of data, and avoiding 1 or 2 single locations would help there.

cgracey wrote: »

I figured first or last locations have the most value, so put the sensitivities there.

The reason there are LUT r/w events for same-cog writes is because it might be handy to trigger an interrupt, in order to reload the LUT.

That makes sense, but there, would a trigger point that was user-adjustable be more use ?
Trigger at the end, is likely to be too late to maintain unbroken streaming ?

Depending on the clock rates, you may want to trigger even at 50%.
If the HUB can fill at the Streamer CLK, or slightly faster, it starts at 00 and when the streamer wraps, 00 is valid, and HUB fill is half way, Streamer is back at 50% when HUB is doing last write.
With that, you could sustain continual refreshed streaming.

Good points you made here.

So, I souped things up a bit:

SETLUT D/# - sets the number of LUT locations (0..512), starting from LUT $000 that the other cog will affect in your LUT when he writes his LUT. $000 is the default on cog startup and completely inhibits the other cog from affecting your LUT. If you set $001, only location $000 will be written in your LUT when he writes $000 in his LUT. Setting $200 allows the entire LUT to receive writes from the other cog.

For the LUT read and write events, you now set the 9-bit LUT address and pick your cog or the other cog to trigger the event.

The streamer LUT read still triggers on LUT address $1FF.

jmg · 2016-05-20 10:24

cgracey wrote: »

Good points you made here.

So, I souped things up a bit:

SETLUT D/# - sets the number of LUT locations (0..512), starting from LUT $000 that the other cog will affect in your LUT when he writes his LUT. $000 is the default on cog startup and completely inhibits the other cog from affecting your LUT. If you set $001, only location $000 will be written in your LUT when he writes $000 in his LUT. Setting $200 allows the entire LUT to receive writes from the other cog.

For the LUT read and write events, you now set the 9-bit LUT address and pick your cog or the other cog to trigger the event.

The streamer LUT read still triggers on LUT address $1FF.

Sounding great, plenty of flexibility, with hopefully not too much logic cost.
Needs some real world test code exercising now...

Cluso99 · 2016-05-20 12:20

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

Here's the thing, though: there's only one possible cog of 16 that could write your LUT externally. I can see masking/comparing for use with several external write sources, but when there's only one, what does it buy you?

It means you are not locked into a single 1-2 address, as you may need to pass half a dozen packets of data, and avoiding 1 or 2 single locations would help there.

cgracey wrote: »

I figured first or last locations have the most value, so put the sensitivities there.

The reason there are LUT r/w events for same-cog writes is because it might be handy to trigger an interrupt, in order to reload the LUT.

That makes sense, but there, would a trigger point that was user-adjustable be more use ?
Trigger at the end, is likely to be too late to maintain unbroken streaming ?

Depending on the clock rates, you may want to trigger even at 50%.
If the HUB can fill at the Streamer CLK, or slightly faster, it starts at 00 and when the streamer wraps, 00 is valid, and HUB fill is half way, Streamer is back at 50% when HUB is doing last write.
With that, you could sustain continual refreshed streaming.

Good points you made here.

So, I souped things up a bit:

SETLUT D/# - sets the number of LUT locations (0..512), starting from LUT $000 that the other cog will affect in your LUT when he writes his LUT. $000 is the default on cog startup and completely inhibits the other cog from affecting your LUT. If you set $001, only location $000 will be written in your LUT when he writes $000 in his LUT. Setting $200 allows the entire LUT to receive writes from the other cog.

For the LUT read and write events, you now set the 9-bit LUT address and pick your cog or the other cog to trigger the event.

The streamer LUT read still triggers on LUT address $1FF.

Excellent. No need to waste the whole LUT if we only need a long or two.

potatohead · 2016-05-20 16:21

Agreed. Worth it. Nice work Chip and jmg

Rayman · 2016-05-20 20:04

Is it possible to compile now with LUT sharing out so can get all smart pins back in?

That's probably too much trouble I suppose.

jmg · 2016-05-20 20:38

Rayman wrote: »

Is it possible to compile now with LUT sharing out so can get all smart pins back in?

That's probably too much trouble I suppose.

Not quite sure why ?
The new LUT mapping needs testing asap, and the Smart pins do not need all pin cells in place at this stage of testing.

Because the Pin cells repeat identically, you do not need all 64 to test the pin cell code. (but, of course, do need to check you have 64 in the final code build !

)

cgracey · 2016-05-20 23:17

Rayman wrote: »

Is it possible to compile now with LUT sharing out so can get all smart pins back in?

That's probably too much trouble I suppose.

What pins do you need, again, for USB?

I found the post. You need P32 and P33, for sure. No problem. I think we will have all but 8, or so. The holes will be in P48..P55.

cgracey · 2016-05-27 05:04

I realized while in Rocklin that there is no time in the logic to qualify LUT write addresses, as they are late arriving. LUT writing sharing can simply be on or off and span the whole LUT. In order to qualify the address, like we were intending, we would have to register the address, the enable signal, and the 32-data signals, and process them all on the next clock. That would be overly expensive to do.

Even to do the read-LUT and write-LUT events in a timely manner, those enables and addresses might need to be registered and evaluated on the next clock, as they are too late-arriving to do much with in one clock.

jmg · 2016-05-27 05:28

cgracey wrote: »

I realized while in Rocklin that there is no time in the logic to qualify LUT write addresses, as they are late arriving. LUT writing sharing can simply be on or off and span the whole LUT. In order to qualify the address, like we were intending, we would have to register the address, the enable signal, and the 32-data signals, and process them all on the next clock. That would be overly expensive to do.

Even to do the read-LUT and write-LUT events in a timely manner, those enables and addresses might need to be registered and evaluated on the next clock, as they are too late-arriving to do much with in one clock.

The Events could tolerate a delay, I think ?

Sharing that is "can simply be on or off and span the whole LUT" sounds tolerable, if the alternative is too expensive in Logic.
That leaves the user to manage the shared areas, which they are likely already doing, in such tightly co-operating COGs ?

cgracey · 2016-05-27 05:38

jmg wrote: »

cgracey wrote: »

I realized while in Rocklin that there is no time in the logic to qualify LUT write addresses, as they are late arriving. LUT writing sharing can simply be on or off and span the whole LUT. In order to qualify the address, like we were intending, we would have to register the address, the enable signal, and the 32-data signals, and process them all on the next clock. That would be overly expensive to do.

Even to do the read-LUT and write-LUT events in a timely manner, those enables and addresses might need to be registered and evaluated on the next clock, as they are too late-arriving to do much with in one clock.

The Events could tolerate a delay, I think ?

Sharing that is "can simply be on or off and span the whole LUT" sounds tolerable, if the alternative is too expensive in Logic.
That leaves the user to manage the shared areas, which they are likely already doing, in such tightly co-operating COGs ?

Yes, the events could tolerate a delay, no problem. They would still need 10 flops, each.

jmg · 2016-05-27 05:42

cgracey wrote: »

I realized while in Rocklin that there is no time in the logic to qualify LUT write addresses, as they are late arriving. LUT writing sharing can simply be on or off and span the whole LUT. In order to qualify the address, like we were intending, we would have to register the address, the enable signal, and the 32-data signals, and process them all on the next clock. That would be overly expensive to do.

Thinking some more about this, what if you shrink the Address qualify right down, to speed it up, to meet timing ?
Even in the order of 1-2-3 bits is useful, and will have a much faster logic path adder than a full address compare?

cgracey · 2016-05-27 06:09

jmg wrote: »

cgracey wrote: »

I realized while in Rocklin that there is no time in the logic to qualify LUT write addresses, as they are late arriving. LUT writing sharing can simply be on or off and span the whole LUT. In order to qualify the address, like we were intending, we would have to register the address, the enable signal, and the 32-data signals, and process them all on the next clock. That would be overly expensive to do.

Thinking some more about this, what if you shrink the Address qualify right down, to speed it up, to meet timing ?
Even in the order of 1-2-3 bits is useful, and will have a much faster logic path adder than a full address compare?

It's the same problem, mainly, because you still need to MUX the address after that logic. That's the killer.

Cluso99 · 2016-05-27 06:38

Ouch!!! I need to think about this.

LUT sharing was not my initial requirement. What I was after was a simple way to pass a byte plus flag bit each way between adjacent cogs. But your dual port LUT seemed to fit nicely, provide some very powerful features between adjacent cogs, and solved what I was after too.

I had thought that the LUT address could be 10 bits instead of 9 bits, where the extra address bit would select either the local or the adjacent cogs LUT. I presume this would have the same timing problem?

jmg · 2016-05-27 06:43

cgracey wrote: »

It's the same problem, mainly, because you still need to MUX the address after that logic. That's the killer.

Well, yes, the MUX takes time, but that is of low-logic depth.
A simpler compare will always be quicker than a wide compare, and that time saving may be all it needs to meet budget.

How far outside budget it is now, and what are the MUX and Compare delays now ?

Electrodude · 2016-05-27 07:02

Instead of a full compare, what if you just went with a simple "(addr & mask) == target" test, which is faster since each bit is independent of all the others?

cgracey · 2016-05-27 07:49

The problem is that the address lines are very late arriving. To analyze them in any way and THEN mux them is what takes so much time. There is some fan-out on the qualifier signal, derived from the address lines, that needs to be buffered up and then used to mux those same address lines. Mux'ing the address lines based on the address lines just takes too long.

evanh · 2016-05-27 07:57

HubRAM write buffering anyone?

Ignore me, just being cheeky. I'm sure Chip will find an efficient solution.

Cluso99 · 2016-05-27 08:46

Chip,
Did the 50/50 LUT between the lower and upper adjacent cogs have the same problem?

cgracey · 2016-05-27 09:00

Cluso99 wrote: »

Chip,
Did the 50/50 LUT between the lower and upper adjacent cogs have the same problem?

When I implemented it, there was no address checking, but it would have had the same problem. Also, that original way involved reading, not just writing, which took double the logic.

jmg · 2016-05-27 10:14

cgracey wrote: »

The problem is that the address lines are very late arriving. To analyze them in any way and THEN mux them is what takes so much time. There is some fan-out on the qualifier signal, derived from the address lines, that needs to be buffered up and then used to mux those same address lines. Mux'ing the address lines based on the address lines just takes too long.

If there is no simpler middle ground, then it sounds like a build with the faster, full-overlay is needed, to see if any issues shake out in use.

Seairth · 2016-05-27 10:51

I say drop shared LUT and the new coginit. We have smart pin mailboxes, which can pick up some of the slack. If anything, see if you can reduce the stall times of the pin instructions, either by speeding up transfers or by making the pin-set instructions asynchronous. Otherwise, leave the rest like it is and call it a day.

Rayman · 2016-05-27 11:24

Sounds like time to hit the undo button on LUT sharing...

jmg · 2016-05-27 19:59

?
There is no need to remove LUT sharing, as Chip said this (it was a little buried, maybe some missed it ?)

Chip: "LUT writing sharing can simply be on or off and span the whole LUT. "

ie LUT writing is fine, without the planned Address qualifier.

Rayman · 2016-05-27 21:03

I think you're right. Missed that part...
Guess we'll see if that works.

cgracey · 2016-05-28 06:24

jmg wrote: »

?
There is no need to remove LUT sharing, as Chip said this (it was a little buried, maybe some missed it ?)

Chip: "LUT writing sharing can simply be on or off and span the whole LUT. "

ie LUT writing is fine, without the planned Address qualifier.

That compiles fine, without stretching Fmax.

Cluso99 · 2016-05-28 07:31

Seairth · 2016-05-28 12:48

cgracey wrote: »

jmg wrote: »

?
There is no need to remove LUT sharing, as Chip said this (it was a little buried, maybe some missed it ?)

Chip: "LUT writing sharing can simply be on or off and span the whole LUT. "

ie LUT writing is fine, without the planned Address qualifier.

That compiles fine, without stretching Fmax.

So, what does this mean, exactly? That LUT write permissions are an all-or-nothing affair?

Is LUT sharing between adjacent cogs very important?

Comments