Is LUT sharing between adjacent cogs very important?

MJB · 2016-05-15 10:50

jmg wrote: »

cgracey wrote: »

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

How do you 'design around' such flukes in timing ?

( I expected a Priority Encoder, so at least then one (known) COG wins, and goes to a valid address.
Other COGS could simply stall until their priority is reached. No garbage or corruptions )

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

I am wondering that with this new mechanism in comparison with the neighboring LUT sharing,
nobody talks about, what looks to me an important difference.
The new version introduced only ONE conduit, that has to be managed very carefully.
Whereas the neighbour version introduces 16 conduits, that deterministically work really in PARALLEL.
No traffic jam, no timing issues, deterministic, fast ..., very easy to handle, once the COGs have been allocated.

And the new mechanism will require standards, every OBJECT needs to follow (if they want to use this mechanism)
to not create a huge nightmare ...
Not much different to what's needed for the neighbour-LUTs - to me it looks more complicated right now.

edit: while I write half a dozen new messages - wow

evanh · 2016-05-15 11:01

MJB wrote: »

... nobody talks about, what looks to me an important difference.
The new version introduced only ONE conduit, that has to be managed very carefully. ...

Come on, that was covered heavily in the questions. Read it again. Chip is investing a lot of logic to make this work better than ever and, rather amazingly, satisfies both concerns and then some.

kwinn · 2016-05-15 12:29

evanh wrote: »

cgracey wrote: »

There is no hub timing involved. It happens right away. If a lot of cogs want to write one LUT, they need to do it on different SysClks.

Ah, right, yep. The Cogs would have to be organised not to write on the same cycle.

Wouldn't that imply some variable latency just like the eggbeater or hub access on the P1?

Heater. · 2016-05-15 13:06

MJB,

...the new mechanism will require standards, every OBJECT needs to follow...

I don't think so. There are a few cases we can consider:

1) I write an object that uses multiple COGS and they communicate among themselves via the new LUT mechanism.

No problem. I just have to get my cooperating COGS to work properly. Then anyone can use that object with the ease of a typical P1 object today. Any problems are confined to my code and are for me to fix. The user need not know or care how I get the work done.

2) I write an object that may use multiple objects but this time I use the new LUT mechanism as an API for others to interact with it from their programs.

That could be problematic. Especially in the case of having multiple client processes in the user code accessing the API LUT. As in jmg's FPU example.

However I don't see that as any worse than doing the same by sharing HUB RAM with user code as an interface. As we can already do on the P1.

3) Can't think of a case 3). What am I missing ?

The old shared LUT idea creates dependencies that infect the entire Propeller code base.

Rayman · 2016-05-15 13:14

I think I like this new LUT scheme.
Makes good use of 2nd LUT port and maintains cog symmetry.

Not sure I like removing smartpins to test it though...

T Chap · 2016-05-15 13:28

Is there a bigger FPGA so at least Chip or someone else could test the full set?

mindrobots · 2016-05-15 13:43

@rayman, I don'the think it is a case of removing ALL smartpins, just some. I think P2 feature testing/verification could be done with 32 smart pins and 32 dumb pins.

potatohead · 2016-05-15 15:12

Seconded. Only some smart pins get removed.

potatohead · 2016-05-15 15:25

Case 3 seems to be adding one or more COGS to an existing set using the LUT broadcast capability.

This is like adding a graphics COG to a set delivering video. Or it might be to offer some other capability, or extend capability of the object in question.

User will need to very fully understand code and test.

As Chip said, not for wimps. (Hilarious!)

Honestly, a broadcast to all cogs with mask is a powerful thing. I'm in favor of this. I was agnostic to the near cog LUT share.

kwinn · 2016-05-15 16:50

I started catching up on posts this morning until I came up on Chips post where he writes “That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.” That made me wonder why that statement shouldn’t apply to LUT sharing as well? Sharing LUT’s between adjacent cogs provides such a large potential performance boost at such a very small logic cost that it seems to be too good to pass up in spite of the pitfalls. So I decided to start a bit further back in the posts and list the advantages and disadvantages of each approach. Here’s what I gleaned from the postings.

Choice 1 – Adjacent cogs share LUT 

Each cog(n) can read and write to it’s own LUT and half of the LUT in cog(n-1) and cog(n+1) 

Advantages: 
1. - Requires minimal additional logic.
2. - No latency
3. - High speed
4. - No interference with/from other cogs
5. - Multiple groups of adjacent cogs possible without affecting each other

Disadvantages:
1. - Not in line with Propeller philosophy?
2. - Needs adjacent cogs


Choice 2 – Cog writes to multiple cogs

Advantages:
1. - Consistent with Propeller philosophy
2. - No latency for single cog senders
3. - High speed for single cog senders
4. - Adjacent cogs not required
5. - One cog to many/any cog sending

Disavantages:
1. - More logic and 32+16+9+?  bit or’ed bus required (or is that 16*(32+16+9+?)
2. - Latency and additional instruction(s) required for multiple senders
3. - Lower speed and additional instruction(s) required for multiple senders
4. - Possibility of data/address corruption with multiple senders

If I had to chose one or the other it would be choice 1 for it’s higher peak speed and the ability to have multiple shared groups that do not interfere with each other or other cogs. The other concern that makes me prefer it over choice 2 is the potential debugging nightmare of having intermittent corruption of data at any address of any cog’s LUT. That does not mean I am opposed to choice 2. I think the two of them complement each other and having both would be a bonus.

Cluso99 · 2016-05-15 18:15

Thanks Kwinn. I think that's a good summary.

Roy Eltham · 2016-05-15 19:38

You guys are making it sound like interference from other cogs and the possibility of data/address corruption things are going to be common and have to be carefully worked around to make "safe" code. However, in real practical usage cases it's not something you even need to worry about much if at all.

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2. Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing AND those two can conflict with each other and need to do something to avoid it to in those cases.

Choice 1 is not completely free of interference/collisions either. If both cogs write to the same LUT address as the same time, at best only 1 wins.

Besides, choice 2 wins hands down because of the number one reason, this is a Propeller chip.

Finally, you forgot the second main disadvantage of choice 1, COGNEW needs to be reworked or eliminated. I don't care if you can work out techniques to get 2 adjacent cogs without changing it, that doesn't solve it. You keep ignoring the issues with this that many have spelled out (including Chip).

Roy Eltham · 2016-05-15 19:40

Like seriously, Chip came up with a solution that gives you the low latency solution that fits with the Propeller philosophy that Chip wants, and all you can do is whine that it's not your solution.
Personally, I think this new solution is much better, and not just because it fits the Propeller philosophy.

potatohead · 2016-05-15 19:49

Seconded! We get a lot of new options for using cogs together.

Most importantly, it does address that test, respond case in a parallel way.

Go Chip!

dMajo · 2016-05-15 20:56

jmg wrote: »

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

jmg, I am OK with it, because forced. I prefer much more the previous LUT sharing, but seems Chip can't be convinced.. So it ends out that this is better than nothing.

Tubular wrote: »

So is it accurate to say that everything (address,data) just gets OR'd? Ie the LUT address that gets written is the OR'd result of what's being commanded by multiple other cogs (in case of a clash)?

If so that makes it essentially the same as output pins on the P1. Simple enough to follow, and has established precedent

Wrong:
- on P1 if cog1 sets pin1 and cog2 sets pin2, pin3 is not afected. If both cogs sets same pin the OR wins but nonother pin status is changed
- on P2, being the address OR'd means that if cog1 writes on LUT's address1 and cog2 writes to LUT's address2, at the same time, really address3(address1_OR_address2) is written with data1_OR_data2. So a completely un foreseen address' value is changed.

But as I said above, if this or nothing, then better this. But I pose the same question as jmg, how can for the purists of this forum this be better than adjacent LUT sharing. I can't understand.

David Betz · 2016-05-15 21:00

Can someone summarize what Chip has proposed? In particular, is this safe as long as two COGs don't try to write to the same COG's LUT at the same time? If COG1 writes to COG2's LUT and COG3 writes to COG4's lut at the same time, there is no conflict, correct?

evanh · 2016-05-15 21:08

Correct. And Chip is putting in a second instruction that is time-slotted as well, for those that do want to have multiple Cogs writing the same LUT.

David Betz · 2016-05-15 21:09

evanh wrote: »

Correct. And Chip is putting in a second instruction that is time-slotted as well, for those that do want to have multiple Cogs writing the same LUT.

But the time-slotted version won't be any faster than the hub will it?

evanh · 2016-05-15 21:15

The question over data corruption was always only about multiple Cogs writing to the same LUT on the same clock. It was not an intended use case but Chip has found an easy solution for even that.

evanh · 2016-05-15 21:16

David Betz wrote: »

But the time-slotted version won't be any faster than the hub will it?

It will have it's minor advantages but basically correct. It's only a bonus outcome.

Tubular · 2016-05-15 21:17

Yes dmajo this pin effect is true, but LUT is more harmless than pins. Perhaps it can even be taken advantage of in rare situations.

It's still early and this mechanism isn't yet final - lets give Chip a bit of time and space and see what he comes up with. He is right to worry about impact on technical support staff

Personally I'm fine with any of these schemes.

cgracey · 2016-05-15 21:17

Here's what I think should be done:

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

dMajo · 2016-05-15 21:18

Roy Eltham wrote: »

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2. Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing AND those two can conflict with each other and need to do something to avoid it to in those cases.

Wrong:
because cogN-1 and cogN+1 when both writing to cogN's LUT each one do it on its half visible LUT, 2 different areas.

evanh · 2016-05-15 21:22

An event for every write, that's cool.

jmg · 2016-05-15 21:23

David Betz wrote: »

evanh wrote: »

Correct. And Chip is putting in a second instruction that is time-slotted as well, for those that do want to have multiple Cogs writing the same LUT.

But the time-slotted version won't be any faster than the hub will it?

I think that is true, from a time-to-write viewpoint.
However, End to end it could be faster, as the target COG has immediate read, whilst any COG wanting to read new HUB info, has also to Slot-align.

I still want to see real numbers in a table, for end-to-end data delays for

** The new DAC-Data flows (there may be incremental gains possible there, with some overlay of SET and GET? )
** HUB paths
** Shared LUT
** Any-LUT (this latest one)

dMajo · 2016-05-15 21:24

David Betz wrote: »

Can someone summarize what Chip has proposed? In particular, is this safe as long as two COGs don't try to write to the same COG's LUT at the same time? If COG1 writes to COG2's LUT and COG3 writes to COG4's lut at the same time, there is no conflict, correct?

Don't know.
There is a mask that each cog can set to decide which other cog's lut to write (single and/or multiple luts).
But as I have understood this mask acts as a chip-select but there is one address/data bus.

It could mean that if cog1 writes lut2 at the same time cog8 writes lut10 both, lut 2 and 10, get written with OR'd address with OR'd data value

evanh · 2016-05-15 21:26

dMajo,
This was resolved something like 12 hours ago.

BTW: The forum software should display the timestamp for all posts.

dMajo · 2016-05-15 21:29

evanh wrote: »

dMajo,
This was resolved something like 12 hours ago.

BTW: The forum software should display the timestamp for all posts.

I am sorry, I've went through 2 days of posts now ... perhaps I didn't notice this.

evanh · 2016-05-15 21:32

Here's Chip's new summary:

cgracey wrote: »

Any cog can write up to 16 cogs' LUTs at once via 16 x 42-bit AND-OR muxes (1write+9data+32data) which feed the 2nd port of each cog's LUT.

SETLUTX D/# 'set 16-bit write mask
WRLUTX D/#,S/# '2 clocks
WRLUTS D/#,S/# '2..16 clocks, no conflict possible for simultaneous writes when multiple cogs use this instruction to write a particular LUT.

An event/interrupt fires when a cog's LUT receives a write on its 2nd port.

You get two write instructions to choose from. If you think you may be sharing with other writers then use WRLUTS. If you're sure it's one-on-one then you can use the fast WRLUTX.

David Betz · 2016-05-15 21:32

jmg wrote: »

David Betz wrote: »

evanh wrote: »

Correct. And Chip is putting in a second instruction that is time-slotted as well, for those that do want to have multiple Cogs writing the same LUT.

But the time-slotted version won't be any faster than the hub will it?

I think that is true, from a time-to-write viewpoint.
However, End to end it could be faster, as the target COG has immediate read, whilst any COG wanting to read new HUB info, has also to Slot-align.

I still want to see real numbers in a table, for end-to-end data delays for

** The new DAC-Data flows (there may be incremental gains possible there, with some overlay of SET and GET? )
** HUB paths
** Shared LUT
** Any-LUT (this latest one)

Ah, good point. The reads would certainly be faster.

Is LUT sharing between adjacent cogs very important?

Comments