Is LUT sharing between adjacent cogs very important?

evanh · 2016-05-15 21:37

I'll repost this again at the top of this page:

cgracey wrote: »

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

You get two write instructions to choose from:
- If you think you may be sharing with other writers then use WRLUTS.
- If you're sure it's one-on-one then you can use the fast WRLUTX.

JRetSapDoog · 2016-05-15 22:02

"cgracey wrote: »

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

What's the 'X' stand for? A variable meaning any LUT address? There's probably some convention that I'm not familiar with. Hmm, maybe 'X' denotes the mask (though it's present in both writes).

Anyway, so far, I've got the following for 'X' and 'S':

X = eXtreme ??? Yeah, I'm partly kidding.
S = Shared/Safe(r)/Slow(er); I think Chip intends "shared."

Note: I didn't list "Simultaneous writes" for 'S' because they are not simultaneous from the underlying hardware standpoint, and, besides, both the 'S' and 'X' versions in the code of one cog can simultaneously write multiple cogs' LUT's (at the same LUT address for each of the multiple cogs).

I guess 'X' marks the spot (address), but it could be a moving spot.

Anyhow, I'm falling in love with this approach all over again. It just ties the cogs together better.

For ref: Here's a list of ex-words: http://www.morewords.com/starts-with/ex/. Go nuts!

MJB · 2016-05-15 22:02

evanh wrote: »

I'll repost this again at the top of this page:

cgracey wrote: »

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

You get two write instructions to choose from:
- If you think you may be sharing with other writers then use WRLUTS.
- If you're sure it's one-on-one then you can use the fast WRLUTX.

does this really mean ALL COGs can write simultaneously at the same clock to some arbitrary LUT as long as there is no confict (or even if there is conflict - as a special feature).
COG0 -> LUT1
COG1 -> LUT3
COG2 -> LUT12
COG3 -> LUT15
...
then it includes the neighbouring LUT sharing model and I buy it ;-)

addit: now we need to just add a simple mask for each COGs write lines (like the DIR)
and all wild mixes are possible.
Like 4 COGs writing to the same LUT same LONG-Address - each one 'his' byte ;-)
I don't have a use in mind, but someone will find one ...

kwinn · 2016-05-15 22:19

Roy Eltham wrote: »

You guys are making it sound like interference from other cogs and the possibility of data/address corruption things are going to be common and have to be carefully worked around to make "safe" code. However, in real practical usage cases it's not something you even need to worry about much if at all.

The fact that it may be a rare occurence does not necessarily make it less of a problem. In many cases that makes it a worse problem. Finding a glitch that happens once a week or less is a much more difficult proposition than tracking down one that happens every few minutes, and that very intermittent type of fault is one that is much more likely to be shipped to customers. Not a good situation, and I say that from experience.

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2.

No. Cogs in Choice 1 can read and write data into the other cog’s LUT without collisions while those in Choice 2 would both end up with corrupted data if they were to write to each others LUTS at the same time (same SysClk). Remember, the address and data bus is shared between all cogs and are or’ed to the bus.

Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing .....

True, you cannot send from one cog to multiple receiver cogs, which is one reason I think the two choices complement each other to an extent. OTOH data can be passed up through several sequential cogs, which may not be as fast, but can be used to advantage in some cases.

....AND those two can conflict with each other and need to do something to avoid it to in those cases. Choice 1 is not completely free of interference/collisions either. If both cogs write to the same LUT address as the same time, at best only 1 wins.

Definitely a possible problem, but at least the hunt is limited to two cogs rather than sixteen, and easily avoided by having reserved areas of the LUT for each cog.

Besides, choice 2 wins hands down because of the number one reason, this is a Propeller chip.

This reason is definitely a case of taking a generally good philosophy to an extreme.

Finally, you forgot the second main disadvantage of choice 1, COGNEW needs to be reworked or eliminated. I don't care if you can work out techniques to get 2 adjacent cogs without changing it, that doesn't solve it. You keep ignoring the issues with this that many have spelled out (including Chip).

No. COGNEW and COGINIT can stay as they are. All that is required is that code loaded into cog0 at bootup reserves whatever sequential cogs are needed for LUT sharing. That is not a hardship. All that is required is a repeat loop that starts cogs. Very simple in spin, pasm, or C. Other objects can then use the remaining cogs.

jmg · 2016-05-15 22:23

MJB wrote: »

addit: now we need to just add a simple mask for each COGs write lines (like the DIR)
and all wild mixes are possible.
Like 4 COGs writing to the same LUT same LONG-Address - each one 'his' byte ;-)
I don't have a use in mind, but someone will find one ...

That would be possible with Chip's proposed AND-OR path 16*(1+9+32) = 672 wide
It would need extreme care to make sure they all did that write on the same SysCLK, to the same address, to activate the OR.
If you are off-by-one, you would get only one 'byte', with others cleared to 00.

MJB wrote: »

then it includes the neighbouring LUT sharing model and I buy it ;-)

Yes, it is a super set, with fish-hooks, but keep in mind it does add significant logic
16*(1+9+32) = 672 wide Bus, (+ Sea of gates) which may yet mean a lower MHz, or less RAM.

Roy Eltham · 2016-05-15 22:27

dMajo wrote: »

Roy Eltham wrote: »

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2. Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing AND those two can conflict with each other and need to do something to avoid it to in those cases.

Wrong:
because cogN-1 and cogN+1 when both writing to cogN's LUT each one do it on its half visible LUT, 2 different areas.

I'm talking about cogN writing to the same location as either cogN+1 or cogN-1. That is that case that can conflict. Yes they are on different ports, but it's still the same ram location, so only one can win at best.

Roy Eltham · 2016-05-15 22:28

cgracey wrote: »

Here's what I think should be done:

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

Chip this is excellent!

evanh · 2016-05-15 22:35

MJB wrote: »

does this really mean ALL COGs can write simultaneously at the same clock to some arbitrary LUT as long as there is no confict (or even if there is conflict - as a special feature).
COG0 -> LUT1
COG1 -> LUT3
COG2 -> LUT12
COG3 -> LUT15
...
then it includes the neighbouring LUT sharing model and I buy it ;-)

Yes. Every Cog has a dedicated 42-wire (32+9+1) bus going to every LUT's second port (Even it's own). There the buses are mux'd together. Part of the cost will also be in the length of those buses travelling between all the Cogs, 16 x 42 = 672 wires doing the loop.

Like 4 COGs writing to the same LUT same LONG-Address - each one 'his' byte ;-)
I don't have a use in mind, but someone will find one ...

AKA abuse.

jmg · 2016-05-15 22:38

Roy Eltham wrote: »

I'm talking about cogN writing to the same location as either cogN+1 or cogN-1. That is that case that can conflict. Yes they are on different ports, but it's still the same ram location, so only one can win at best.

Not quite apples with apples there.
In the Shared LUT case, those COGS are already closely co-operating, and are completely immune from other COGs.
The normal memory map for Dual Port, is to have one region for write and another for read, so address-map naturally makes this safe.
Same SysCLK writes are quite ok, as it is true dual port.

Quite the opposite, for an OR-scheme, where using separate addresses (same LUT) corrupts to a third wholly unexpected address.

jmg · 2016-05-15 22:42

dMajo wrote: »

But as I said above, if this or nothing, then better this.

Well yes, if this has no MHz or RAM impacts, then I could use it.

dMajo wrote: »

But I pose the same question as jmg, how can, for the purists of this forum, this be better than adjacent LUT sharing. I can't understand.

Yes, a wry smile is needed there...

jmg · 2016-05-15 22:45

evanh wrote: »

Like 4 COGs writing to the same LUT same LONG-Address - each one 'his' byte ;-)
I don't have a use in mind, but someone will find one ...

AKA abuse.

One man's 'abuse' is another man's craft...
As I mentioned above, this could work, with the matching extreme timing care.

I'm not sure if Chip still has the inter-cog signaling ? Hopefully, yes ?

Roy Eltham · 2016-05-15 22:49

kwinn, and others, I think you have this and/or thing wrong for Chip's newly proposed solution. It's not the entire bus for all cogs conflicting with each other all the time like you seem to think.
The and/or stuff happens at each cog's LUT as it goes into the second port. The conflict only happens when multiple cogs try to write to the same LUT on the same sysclk.

You can have cog1 write to cog2's LUT, and cog2 write to cog1's LUT at the same time with no conflict, and any other combination of cog's writing to other cog's LUTs all at the same time as long as there isn't any one LUT being targeted on the same sysclk by more than one other cog.

Think of it like the pins in P1, each of the 32 pins has a line from each of the 8 cogs, and at the pin the 8 lines going to it from the cogs are or'd together.

The reason it has the significant logic cost is that each of the 42 bits has to OR the input lines of 15 other cogs for the bit. So there are 42 15 input OR gates, and 42x15x16 connections.

If this were not the case, then it would be so severely limited (only one cog can do a write at any given sysclk without conflict) that it's silly, and clearly Chip wouldn't have suggested it.

Seairth · 2016-05-15 22:55

Wait a minute. I'm not understanding something here... I get that the data lines are ORed, but what about the address lines? What if two cogs try to write to two different addresses in the same LUT at the same time?

jmg · 2016-05-15 23:06

Seairth wrote: »

Wait a minute. I'm not understanding something here... I get that the data lines are ORed, but what about the address lines? What if two cogs try to write to two different addresses in the same LUT at the same time?

Yes Addr too... See my notes above.
Same LUT, multiple writes, can end up in a third (unexpected) address on the target LUT and/or also corrupt the data.

dMajo · 2016-05-15 23:13

Roy Eltham wrote: »

You can have cog1 write to cog2's LUT, and cog2 write to cog1's LUT at the same time with no conflict, and any other combination of cog's writing to other cog's LUTs all at the same time as long as there isn't any one LUT being targeted on the same sysclk by more than one other cog.

So you are saying that two obex objects, using COGNEW for 2/3 cogs and then SETLUTX based upon the received CogID, can use WRLUTX safely without being aware of each other? Of course that there is no conflicts inside an object is programmer's responsibility.

Rayman · 2016-05-15 23:15

I think this is a great bonus feature.

I imagine the LUT will be underutilized and this will help make it more useful for general applications.

jmg · 2016-05-15 23:22

Rayman wrote: »

I think this is a great bonus feature.

I imagine the LUT will be underutilized and this will help make it more useful for general applications.

Yes, I could see compiler options to pass parameters in LUT, as that then opens the choice to flip some functions over to another COG.
LUT opcodees may need a slight revisit...

Seairth · 2016-05-15 23:28

jmg wrote: »

Seairth wrote: »

Wait a minute. I'm not understanding something here... I get that the data lines are ORed, but what about the address lines? What if two cogs try to write to two different addresses in the same LUT at the same time?

Yes Addr too... See my notes above.
This can end up in a third (unexpected) address on the target LUT.

I missed that. So, basically, only WRLUTS can be used when two or more cogs need to perform uncoordinated writes to the same LUT. Technically, you could use WRLUTX, but the cogs would have to cooperatively avoid calling it on the same clock cycle. Tricky, but not impossible. (It may be that all of this has been covered in prior posts, but I'm only just now grokking the implications.)

@cgracey, were you able to change the attention events to support the 16 inputs, as I suggested earlier? This could be used along with WRLUT* to ACK. If you keep the attention event a single ORed input, then the writer cannot determine who is ACKing.

jmg · 2016-05-15 23:38

Seairth wrote: »

I missed that. So, basically, only WRLUTS can be used when two or more cogs need to perform uncoordinated writes to the same LUT. Technically, you could use WRLUTX, but the cogs would have to cooperatively avoid calling it on the same clock cycle. Tricky, but not impossible.

Pretty much. WRLUTX is best used in a enclosure, and WRLUTS is slower, but safer in the wild...
There is even the case suggested above, of using 4 COGS to send byte info, on the same SysCLK, which needs careful cooperative calling to hit the same clock cycle!

Given the caveats here, I wonder if the tools should have a option switch to flip between WRLUTS & WRLUTX opcode choices ?

Seairth · 2016-05-15 23:54

jmg wrote: »

There is even the case suggested above, of using 4 COGS to send byte info, on the same SysCLK, which needs careful cooperative calling to hit the same clock cycle!

I can see how that could be very useful! One cog might use the attention event to tell four other cogs that it is ready for the next 4 bytes of data. As long as all 4 cogs are sitting in a WAITATN, they would all simultaneously wake up and WRLUTX on the same clock cycle. Neat.

Roy Eltham · 2016-05-16 00:00

dMajo wrote: »

Roy Eltham wrote: »

You can have cog1 write to cog2's LUT, and cog2 write to cog1's LUT at the same time with no conflict, and any other combination of cog's writing to other cog's LUTs all at the same time as long as there isn't any one LUT being targeted on the same sysclk by more than one other cog.

So you are saying that two obex objects, using COGNEW for 2/3 cogs and then SETLUTX based upon the received CogID, can use WRLUTX safely without being aware of each other? Of course that there is no conflicts inside an object is programmer's responsibility.

No, I'm just saying that you have to take care in programming it because you can have a conflict, so it's not right to say it has no possibility of conflict. I didn't read the advantage/disadvantage thing is strictly just regarding obex stuff, or separate objects working at the same time, but just the overall advantage/disadvantage.

jmg · 2016-05-16 00:26

Seairth wrote: »

jmg wrote: »

There is even the case suggested above, of using 4 COGS to send byte info, on the same SysCLK, which needs careful cooperative calling to hit the same clock cycle!

I can see how that could be very useful! One cog might use the attention event to tell four other cogs that it is ready for the next 4 bytes of data. As long as all 4 cogs are sitting in a WAITATN, they would all simultaneously wake up and WRLUTX on the same clock cycle. Neat.

Yes, or if you were running Sync'd COGs, this could be a nice way to confirm they were keeping sync'd.
A simple incrementing byte on each trigger, could be a good way to give trace info, at quite low overhead.
If they avoid 00H, any 00H (usually 3 x 00H) is loss of sync, and any change in bytes offsets is a sync-creep.

kwinn · 2016-05-16 02:23

Roy Eltham wrote: »

dMajo wrote: »

Roy Eltham wrote: »

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2. Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing AND those two can conflict with each other and need to do something to avoid it to in those cases.

Wrong:
because cogN-1 and cogN+1 when both writing to cogN's LUT each one do it on its half visible LUT, 2 different areas.

I'm talking about cogN writing to the same location as either cogN+1 or cogN-1. That is that case that can conflict. Yes they are on different ports, but it's still the same ram location, so only one can win at best.

Yes, I am familiar with a variety of implementations of multi-port memories and their inherent problems, and as I posted previously the way to avoid that problem is very simple. The software writer assigns each cog in share group areas of the LUT's they can write to, so if cog(n-1) is assigned area A, cog(n) and cog(n+1) are assigned to other areas. That is after all part of the programmer's job.

evanh · 2016-05-16 02:33

kwinn wrote: »

... The software writer assigns each cog in share group areas of the LUT's they can write to, so if cog(n-1) is assigned area A, cog(n) and cog(n+1) are assigned to other areas. That is after all part of the programmer's job.

That is the abuse case. Ie: Is not the intent of this feature.

cgracey · 2016-05-16 02:45

Seairth wrote: »

jmg wrote: »

Seairth wrote: »

Wait a minute. I'm not understanding something here... I get that the data lines are ORed, but what about the address lines? What if two cogs try to write to two different addresses in the same LUT at the same time?

Yes Addr too... See my notes above.
This can end up in a third (unexpected) address on the target LUT.

I missed that. So, basically, only WRLUTS can be used when two or more cogs need to perform uncoordinated writes to the same LUT. Technically, you could use WRLUTX, but the cogs would have to cooperatively avoid calling it on the same clock cycle. Tricky, but not impossible. (It may be that all of this has been covered in prior posts, but I'm only just now grokking the implications.)

@cgracey, were you able to change the attention events to support the 16 inputs, as I suggested earlier? This could be used along with WRLUT* to ACK. If you keep the attention event a single ORed input, then the writer cannot determine who is ACKing.

Your understanding is correct.

About knowing who caused the attention (or LUT write): how do we parse the who information? I could see it being rapidly overwritten and lost. I think it's unrealistic to know who did something, but realistic to know that someone did something.

Rayman · 2016-05-16 02:47

For one, I really don't care about this LUT stuff, I just want it done...

Although, I could see it being useful in certain situations...

Electrodude · 2016-05-16 02:55

It might be nice if there were some sort of mode or instruction where you could only set, but not clear, bits in another cog's LUT. This would be good if your application only ever uses one address and wants to send different sorts of flags, possibly at different times but also possibly at the same time.

jmg · 2016-05-16 03:02

cgracey wrote: »

Seairth wrote: »

@cgracey, were you able to change the attention events to support the 16 inputs, as I suggested earlier? This could be used along with WRLUT* to ACK. If you keep the attention event a single ORed input, then the writer cannot determine who is ACKing.

Your understanding is correct.

About knowing who caused the attention (or LUT write): how do we parse the who information? I could see it being rapidly overwritten and lost. I think it's unrealistic to know who did something, but realistic to know that someone did something.

What is the status of separate attention / wait flags between COGs.
Are they still there ? (or did they swallow into this new Any-LUT feature ?)

jmg · 2016-05-16 03:05

Electrodude wrote: »

It might be nice if there were some sort of mode or instruction where you could only set, but not clear, bits in another cog's LUT. This would be good if your application only ever uses one address and wants to send different sorts of flags, possibly at different times but also possibly at the same time.

That would need a Read-Modify-Write,
or
P2 could steal an idea from 32b MCUs with poor Boolean support, and they have one address to Set-bit and another address to Clear-bit.
The 'memory' cell behaves as a S-R FF and allows any/all bits to be set, without needing a RMW action.

kwinn · 2016-05-16 03:05

Roy Eltham wrote: »

kwinn, and others, I think you have this and/or thing wrong for Chip's newly proposed solution. It's not the entire bus for all cogs conflicting with each other all the time like you seem to think.
The and/or stuff happens at each cog's LUT as it goes into the second port. The conflict only happens when multiple cogs try to write to the same LUT on the same sysclk.

You can have cog1 write to cog2's LUT, and cog2 write to cog1's LUT at the same time with no conflict, and any other combination of cog's writing to other cog's LUTs all at the same time as long as there isn't any one LUT being targeted on the same sysclk by more than one other cog.

Think of it like the pins in P1, each of the 32 pins has a line from each of the 8 cogs, and at the pin the 8 lines going to it from the cogs are or'd together.

The reason it has the significant logic cost is that each of the 42 bits has to OR the input lines of 15 other cogs for the bit. So there are 42 15 input OR gates, and 42x15x16 connections.

If this were not the case, then it would be so severely limited (only one cog can do a write at any given sysclk without conflict) that it's silly, and clearly Chip wouldn't have suggested it.

Perhaps I am wrong, but that's certainly not how I interpret what chip posted.

cgracey wrote: »
Here's what I think should be done:

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

Sounds more like a single cog can write to one or more (up to 16) LUT's at once, which to me implies a bus consisting of 16 bits of LUT selects, 9 bits of LUT address, and 32 bits of data. IOW one cog sends and 1..16 LUT's receive. Multiple cogs sending on the same SysClk results in garbage in. Also means writes to all the selected LUT's are at the same address location.

Is LUT sharing between adjacent cogs very important?

Comments