Is LUT sharing between adjacent cogs very important?

jmg · 2016-05-15 04:48

Tubular wrote: »

So is it accurate to say that everything (address,data) just gets OR'd? Ie the LUT address that gets written is the OR'd result of what's being commanded by multiple other cogs (in case of a clash)?

If so that makes it essentially the same as output pins on the P1. Simple enough to follow, and has established precedent

Nope. Quite different.
Pin OR is onto the same bit, you do not corrupt OTHER bits in case of a collision.
Multiple Address-OR can go anywhere (usually UP somewhere else...)
- this needs something better.

cgracey · 2016-05-15 04:50

jmg wrote: »

cgracey wrote: »

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

How do you 'design around' such flukes in timing ?

( I expected a Priority Encoder, so at least then one (kown) COG wins, and goes to a valid address )

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

It's only error-prone if more than one cog can write a particular LUT. That capability didn't exist in the first incarnation. This is not for wimps.

evanh · 2016-05-15 04:50

jmg wrote: »

evanh wrote: »

Just don't do it. Same as how the pins and HubRAM are shared now.

Err, not quite - these details matter.
3 adjacent HUB addresses are quite fine to be written by 3 COGS.
Nothing is lost, and all 3 know they achieved write.

My HubRAM example was more about allocation. One can write over any old memory address or I/O pin and mess things up.

If you want to use the crazy approach of writing from multiple Cogs to a single LUT then you are the one responsible for coordination. A priority encoder isn't going to fix a thing.

jmg · 2016-05-15 04:52

evanh wrote: »

My HubRAM example was more about allocation. One can write over any old memory address or I/O pin and mess things up.

If you want to use the crazy approach of writing from multiple Cogs to a single LUT then you are the one responsible for coordination. A priority encoder isn't going to fix a thing.

Wow, so you do not consider corruption of wholly unexpected addresses something worth fixing ?? Really ?

jmg · 2016-05-15 04:55

cgracey wrote: »

It's only error-prone if more than one cog can write a particular LUT. That capability didn't exist in the first incarnation. This is not for wimps.

You can certainly say that again!!
It is far more dangerous that COGID management, which was quite simple housekeeping.

I'm no wimp, and I could easily manage COGID mappings, however I'm less sure I'd want to unleash any code (or hardware) that could corrupt unexpected addresses.

cgracey · 2016-05-15 04:56

If everyone that wanted to write the same LUT were to wait for a system counter value whose bottom nibble matched their COGID, and then did the WRLUTX instruction, there would be no conflict.

JRetSapDoog · 2016-05-15 04:56

@evanh: Ah, I see. Thanks! I missed some posts above while posting. Things are happening fast-and-furious. But even after reading them just now, I still didn't understand the "problem" until I saw your response. Yeah, I see that potential for "garble" as more of a feature than a problem. *Update* Gee, many more posts just flew by (so I added your name).

Electrodude · 2016-05-15 04:58

jmg wrote: »

cgracey wrote: »

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

How do you 'design around' such flukes in timing ?

( I expected a Priority Encoder, so at least then one (kown) COG wins, and goes to a valid address )

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

The simplest solution is to only have one given cog ever write to another cog's LUT. Do it in pairs or as one master with many slaves, but use the hub or smartpins if you need more than that. Don't create a situation where there would be a problem, and you won't have a problem.

Also, note that, since everything's OR'd together, two cogs can write bitfields to the same address of one cog's LUT, and they'll get OR'd together. You could just only ever use one address, and it would be just like the attention strobes, except there would be 32 per cog.

Someone might come up with some clever reason to have one cog set the low bits and the other set the high bits of the address going to another cog's LUT. I can't see any use for this right now, but if it makes it to silicon, somebody will find a use.

Heater. · 2016-05-15 05:02

jmg,

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

That's not the way I'm reading it, Chip said:

"Each cog's LUT's 2nd port would use an AND-OR mux to bring in writes/address/data from other cogs."

So that implies multiple COGS can write to a single COG and if there is conflict, they write at the same time, then wrong data will end up in the wrong place in the receiving COGs LUT.

Well tough you have a bug in timing of your communications.

Where does it imply that there is a conflict in actual LUT selection that would cause an unrelated COG's LUT to be selected? As far as I can tell the damage is limited to the receiving COG in this group.

Chip's example code looked like this:

	SETLUTX	##$000F			' ready to write 4  LUTs
	WRLUTX	data,address		' write to the LUTs

I'm assuming the data and address in the second instruction can conflict with similar code running on an other COG. But the actual damage is limited to the LUTs in your group selected by the bits in the first instruction. No random LUTs get written to.

Wow, so you do not consider corruption of wholly unexpected addresses something worth fixing ?? Really ?

Assuming things work as Chip described I don't see how this is no worse than multiple COGS writing to shared HUB RAM space. It's the same problem. It's a fundamental problem of such data sharing. It's fixed by using LOCKs or otherwise getting your timing right.

cgracey · 2016-05-15 05:07

Electrodude wrote: »

jmg wrote: »

cgracey wrote: »

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

How do you 'design around' such flukes in timing ?

( I expected a Priority Encoder, so at least then one (kown) COG wins, and goes to a valid address )

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

The simplest solution is to only have one given cog ever write to another cog's LUT. Do it in pairs or as one master with many slaves, but use the hub or smartpins if you need more than that. Don't create a situation where there would be a problem, and you won't have a problem.

Also, note that, since everything's OR'd together, two cogs can write bitfields to the same address of one cog's LUT, and they'll get OR'd together. You could just only ever use one address, and it would be just like the attention strobes, except there would be 32 per cog.

Someone might come up with some clever reason to have one cog set the low bits and the other set the high bits of the address going to another cog's LUT. I can't see any use for this right now, but if it makes it to silicon, somebody will find a use.

Wow! That bit field writing idea is really interesting.

jmg · 2016-05-15 05:12

Heater. wrote: »

Where does it imply that there is a conflict in actual LUT selection that would cause an unrelated COG's LUT to be selected?

That's not what I said.

Heater. wrote: »

So that implies multiple COGS can write to a single COG and if there is conflict, they write at the same time, then wrong data will end up in the wrong place in the receiving COGs LUT.

I made in bold, what you said - which is exactly what I said, we agree on the effect. in the wrong place is the key here.

Good luck debugging that one !!

Heater. wrote: »

Assuming things work as Chip described I don't see how this is no worse than multiple COGS writing to shared HUB RAM space. It's a fundamental problem of such data sharing,

Nope, read again your own in the wrong place comment.
The failure mode is quite different. HUB RAM never corrupts unexpected address, and in fact, never fails at different address values.
The only issue in HUB access, is with the SAME address. Chalk and Cheese.

jmg · 2016-05-15 05:18

Electrodude wrote: »

The simplest solution is to only have one given cog ever write to another cog's LUT. Do it in pairs or as one master with many slaves, but use the hub or smartpins if you need more than that. Don't create a situation where there would be a problem, and you won't have a problem.

That works for me, but has a familiar ring to it ? LUT sharing anyone ?

JRetSapDoog · 2016-05-15 05:22

Oh my! Using a pin analogy: two cogs write the same pin and a different pin gets modified. That's scary, even if it were a nearby pin! I didn't know I signed up for that. I should have read the fine print. Yeah, definitely not for the faint of heart.

evanh · 2016-05-15 05:23

jmg wrote: »

evanh wrote: »

If you want to use the crazy approach of writing from multiple Cogs to a single LUT then you are the one responsible for coordination. A priority encoder isn't going to fix a thing.

Wow, so you do not consider corruption of wholly unexpected addresses something worth fixing ?? Really ?

It's a feature.

The real reason is it's not intended to be used that way. Another whole layer of logic is needed to stall the writes if we wanted to support multiple sources simultaneously. This would probably add base instruction latency on top of the stall time as it has to feed back to the writing Cog.

evanh · 2016-05-15 05:25

jmg wrote: »

Electrodude wrote: »

The simplest solution is to only have one given cog ever write to another cog's LUT. Do it in pairs or as one master with many slaves, but use the hub or smartpins if you need more than that. Don't create a situation where there would be a problem, and you won't have a problem.

That works for me, but has a familiar ring to it ? LUT sharing anyone ?

I don't think Electrodude is saying Chip change the hardware. Rather, just a convention of "don't do it" for the software writers.

Heater. · 2016-05-15 05:35

jmg,

...we agree on the effect. in the wrong place is the key here.

Good luck debugging that one !!

The only issue in HUB access, is with the SAME address. Chalk and Cheese

Looks like the same problem as multiple writers writing to a shared space in HUB. Say a FIFO or such. Often when you do that there is some shared pointer or index in that area that tells where data should be written next. Such pointers need to be updated to maintain the data structure. If there is a conflict the pointer value gets screwed and somebody ends up using it to write to the wrong place.

In fact, thinking about it, such problems are far worse when sharing data areas in HUB. An invalid pointer or index can cause someone to write to any HUB location that corrupting any COG anywhere. BUT when a LUT gets corrupted the damage is limited to the group of COGs.

All in all no worse off than we were without this mechanism. Perhaps even safer.

Heater. · 2016-05-15 05:38

jmg,

...has a familiar ring to it ? LUT sharing anyone ?

Sounds totally different. LUT sharing impacts everybody's code. The suggested mechanism confines the damage to your group of COGS.

Electrodude · 2016-05-15 05:46

jmg wrote: »

Electrodude wrote: »

The simplest solution is to only have one given cog ever write to another cog's LUT. Do it in pairs or as one master with many slaves, but use the hub or smartpins if you need more than that. Don't create a situation where there would be a problem, and you won't have a problem.

That works for me, but has a familiar ring to it ? LUT sharing anyone ?

Pairwise LUT sharing can certainly be done with this system. The whole point of this is that it can do pairwise LUT sharing without requiring the cogs to be next to each other. If you do more than a pair and aren't very careful about timing, you're either asking for trouble or doing something really clever.

Heater. · 2016-05-15 05:54

Can't think of a good example now but my hunch is that the ability to broadcast with this new scheme is potential a big win over the old LUT sharing.

cgracey · 2016-05-15 06:29

cgracey wrote: »

If everyone that wanted to write the same LUT were to wait for a system counter value whose bottom nibble matched their COGID, and then did the WRLUTX instruction, there would be no conflict.

I just realized there could be two WRLUTX instructions: WRLUTX to write immediately (2 clocks) and WRLUTXS to write when CT[3:0] = COGID. This would give each cog attempting a shared write to some LUT(s) a unique time slot, ensuring that no write conflict occurs.

It just so happens that we have exactly two 'D/#,S/#' instruction slots open in the opcode space.

jmg · 2016-05-15 06:34

Electrodude wrote: »

If you do more than a pair and aren't very careful about timing, you're either asking for trouble or doing something really clever.

A common usage of this that occurred to me inside 30 seconds, and I am sure will occur to countless other programmers, is to use one COG as a Floating Point unit.
Oops, that means many COGS wanting to write to one.
The user carefully allocates separate param space for each COG, thinking that is enough.
They test it, and it works fine with a pair of COGS, so they move on...

They pass a set of variables, plus the owner COG, in LUT and the FPU gives the result there, using the write-only pathways.

Only the OR effect (quite rarely) corrupts the owner COG value, and the FPU writes the results to an unrelated cog. *BOOM*

Heater. wrote: »

...The suggested mechanism confines the damage to your group of COGS.

Optimistic much ? See above example. Errors can easily propagate across the whole LUT area.

evanh · 2016-05-15 07:12

cgracey wrote: »

It just so happens that we have exactly two 'D/#,S/#' instruction slots open in the opcode space.

Nice. Glad to see a cheap addition that doesn't affect regular usage.

evanh · 2016-05-15 07:18

jmg wrote: »

... is to use one COG as a Floating Point unit.

Abuse for sure. Cluso!

jmg · 2016-05-15 07:21

cgracey wrote: »

I just realized there could be two WRLUTX instructions: WRLUTX to write immediately (2 clocks) and WRLUTXS to write when CT[3:0] = COGID. This would give each cog attempting a shared write to some LUT(s) a unique time slot, ensuring that no write conflict occurs.

It just so happens that we have exactly two 'D/#,S/#' instruction slots open in the opcode space.

That sounds like a better solution !

Things like FPU are not going to be 1-cycle paranoid, but they could be general use, and ad-hoc in nature.
Other cases, that are 1-cycle paranoid are likely to be more controlled, with pairing the most common.
It does not eliminate issues, but it does reduce them from the most general cases.

Heater. · 2016-05-15 08:07

jmg,

Optimistic much ? See above example. Errors can easily propagate across the whole LUT area.

Exactly. There errors are confined to the LUT in the target COG. This is arguably safer than similar errors that can occur when having multiple writers to HUB, where the error can corrupt any HUB location.

As soon as you share memory between multiple writers and readers an error in one can propagate to all the others.

However I now see where our difference of opinion lies. I assumed that when sharing the LUT memory the thing being shared is the entire LUT memory. Whereas you desire to share different areas of a LUT with different writers. Certainly a timing conflict brings down that scheme.

Is it worth fixing?

Your FPU for example has many writers, presumably all writing commands and operands, that implies the FPU COG has to run around and service them all. That means timing is critical here. That means you might as well do it via shared HUB mailboxes.

The suggested WRLUTXS seems to imply introducing delays into the instruction so again timing has gone wonky.

Cluso99 · 2016-05-15 09:00

Wow! Overnight we have a new way.

Nice but for the potential of being a disaster waiting to happen.

If you wait for a slot, nice - much slower but you can still be clobbered by an object that doesn't wait! And to boot, it might not even be at the address you were expecting!
Yet the shared LUT detractors are happy with this??? But they are not happy if an object that will behave but cannot find a pair of cogs to run unless you manage the starting of the cogs - a fairly simple exercise if you want to use it.

Cluso99 · 2016-05-15 09:05

Chip,
How much die space is available?
Many of us would love more hub ram. Some would use it now, while I see it more later in the life cycle. Would there be enough for another 256KB ?

Heater. · 2016-05-15 09:18

Clusso,

Yet the shared LUT detractors are happy with this???

Seems like the lesser of two evils.

Cluso99 · 2016-05-15 09:21

Chip,
Re LUT sharing

I had thought each cog shared its LUT with the next cog, and the cog saw the previous cogs shared LUT. This would seem to be a simpler implementation logic wise ( minor).

Could the shared LUT (from the previous cog LUT) be mapped and used as $200-$3FF ?
Could this then be usable as normal LUT ? That is, could it be used as LUT-exec, streamer, etc ? What I am wondering if there is a benefit (if simple to implement) such that a cog could have twice the LUT for normal use at the expense of the adjacent cogs LUT ? This way we don't even require any extra red/wrLUTX instructions to access the shared LUT.

The shared LUT concept seems to have so much going for it. It's only detraction is that the programmer will likely need to allocate these cogs first. There isn't the corruption problems like with you latest proposal - if cogs aren't available then it will gracefully fail to launch!

evanh · 2016-05-15 10:20

I declare the Prop2 feature set as frozen!

Is LUT sharing between adjacent cogs very important?

Comments