Is LUT sharing between adjacent cogs very important?

T Chap · 2016-05-16 03:16

If it had one extra bit for status when writing, you could write a high on the status bit when writing, all others could check for the high bit and wait before writing. The bit would automatically or manually reset after the write. This may cause a slight delay but would avoid errors.

Spin Example

If LUTlock == 0
  LUTLock~~
  WriteLUT
  LUTLock~

jmg · 2016-05-16 03:18

kwinn wrote: »

Sounds more like a single cog can write to one or more (up to 16) LUT's at once, which to me implies a bus consisting of 16 bits of LUT selects, 9 bits of LUT address, and 32 bits of data. IOW one cog sends and 1..16 LUT's receive. Multiple cogs sending results in garbage in. Also means writes to all the selected LUT's are at the same address location.

Not quite, it is rather larger than that - Chip says
AND-OR, as 16 x (1w+9Adr+32Data) = 672 total bus lines (this thing is large!)
- the 1w enables any one of those BUSes into OR* gates merge to that LUT.
The conflict occurs only when multiple sources target a single LUT, on the same SysCLK, now the simplicity of using OR-merge bites, and you get a result you might not have expected, placed where you did not ask.
(unless you are the special case above, of careful and deliberate same-clk merge to one address in one LUT )

* The OR gates can be 15w or 16w, 16w is not essential, but may help with device testing, as it allows this path to also write-to-self.

kwinn · 2016-05-16 03:22

T Chap wrote: »

If it had one extra bit for status when writing, you could write a high on the status bit when writing, all others could check for the high bit and wait before writing. The bit would automatically or manually reset after the write. This may cause a slight delay but would avoid errors.

Chip's suggestion of using the lower bits of the system counter would work as well although it may slow writes slightly.

cgracey · 2016-05-16 03:25

The 42-bit mux is on the 2nd port of each LUT. Each mux sees all WRLUTX/S activity from all cogs. That's why there is no conflict when different cogs write different LUTs simultaneously. The only conflict is when multiple cogs attempt to write a particular LUT on the same clock cycle. The WRLUTS instruction gets around that by singulating each WRLUTS in time according to the cog number (cog id == cnt[3:0]).

jmg · 2016-05-16 03:28

T Chap wrote: »

If it had one extra bit for status when writing, you could write a high on the status bit when writing, all others could check for the high bit and wait before writing. The bit would automatically or manually reset after the write. This may cause a slight delay but would avoid errors.

There are various hardware schemes that could avoid same-LUT, same-cycle errors.

Chip has done a simple one, that maps LSB for an added delay of up to 16 sysclks.

Or, you could use a priority encoder and MUX, on address only, or on address + data which would have one winner.
Others could then queue, to the next SysCLK. This swaps a 15w OR gate for a MUX, and Encoder, and some queue logic, so it gets even larger and slower...

cgracey · 2016-05-16 03:31

jmg wrote: »

cgracey wrote: »

Seairth wrote: »

@cgracey, were you able to change the attention events to support the 16 inputs, as I suggested earlier? This could be used along with WRLUT* to ACK. If you keep the attention event a single ORed input, then the writer cannot determine who is ACKing.

Your understanding is correct.

About knowing who caused the attention (or LUT write): how do we parse the who information? I could see it being rapidly overwritten and lost. I think it's unrealistic to know who did something, but realistic to know that someone did something.

What is the status of separate attention / wait flags between COGs.
Are they still there ? (or did they swallow into this new Any-LUT feature ?)

The attention is still there.

I will repurpose the RLE (read LUT end) event to track 2nd-port LUT writes.

kwinn · 2016-05-16 03:45

jmg wrote: »

kwinn wrote: »

Sounds more like a single cog can write to one or more (up to 16) LUT's at once, which to me implies a bus consisting of 16 bits of LUT selects, 9 bits of LUT address, and 32 bits of data. IOW one cog sends and 1..16 LUT's receive. Multiple cogs sending results in garbage in. Also means writes to all the selected LUT's are at the same address location.

Not quite, it is rather larger than that - Chip says
AND-OR, as 16 x (1w+9Adr+32Data) = 672 total bus lines (this thing is large!)
- the 1w enables any one of those BUSes into OR* gates merge to that LUT.
The conflict occurs only when multiple sources target a single LUT, on the same SysCLK, now the simplicity of using OR-merge bites, and you get a result you might not have expected, placed where you did not ask.
(unless you are the special case above, of careful and deliberate same-clk merge to one address in one LUT )

* The OR gates can be 15w or 16w, 16w is not essential, but may help with device testing, as it allows this path to also write-to-self.

Thanks for the clarification jmg, it clears up a lot of questions. That is going to be a great boost for the P2.

Cluso99 · 2016-05-16 04:30

Whoa!!!
I must be going insane.

It's too difficult to allocate cog pairs for shared LUT at the start of your program, but there can be no corruption ..
But it's ok to have a corruption to an unknown LUT address if two cogs write simultaneously to the same cogs LUT. And for this we use a huge piece of die space.

Chip,
Without the latest idea, is ther enough die space for another 128KB or 256KB of hub ram?

MJB · 2016-05-16 05:29

Just waking up I have this thought ...
and before reading the all nights new messages ... :-)

wouldn't all the problems with overwriting and corrupting be gone if the whole new bus system was READ instead of WRITE?

If any COG could read any other LUT ???

Any number of helper COGs can fetch their data from the data acquisition COGs LUT ...
CRC-COG reading the USB-COGs LUT buffer ...

edit: and messaging done via the signaling mechanisms.

edit2: for data flow computing the writing version might be the more appropriate one.
but both need the signalling as well or polling (local or remote memory)

Brian Fairchild · 2016-05-16 05:43

There's lots of talk that the tools could do this or the tools could do that. Who's going to write those tools? The ideas proposed don't strike me as typical GCC back-end customisations.

cgracey · 2016-05-16 06:55

Cluso99 wrote: »

Whoa!!!
I must be going insane.

It's too difficult to allocate cog pairs for shared LUT at the start of your program, but there can be no corruption ..
But it's ok to have a corruption to an unknown LUT address if two cogs write simultaneously to the same cogs LUT. And for this we use a huge piece of die space.

Chip,
Without the latest idea, is ther enough die space for another 128KB or 256KB of hub ram?

That OR'd address/data problem can be avoided by using the WRLUTS instruction, in cases where multiple cogs are writing the same LUT at unknown (possibly identical) times. You, as the programmer, will know if you need to use WRLUTS or if you can by with the faster 2-clock WRLUTX. The die space is not huge for this. And we may even be able to fit another 128KB or 256KB of RAM. We'll see.

So, the alarmism over bad data going to bad LUT locations is not necessary. That problem has been solved. Also, there is exactly enough 'D/#,S/#' instruction space left to implement WRLUTX/WRLUTS. We even have an event slot available to capture these LUT writes. What more do we need?

jmg · 2016-05-16 07:15

MJB wrote: »

wouldn't all the problems with overwriting and corrupting be gone if the whole new bus system was READ instead of WRITE?

If any COG could read any other LUT ???

It does not eliminate the problem entirely, but it does change the personality.

The Address & Data can still OR to a bad location, and/or Value, but now that Bad-Addr location is not written to, merely read.
The running program gets bad data, but now some unexpected memory location is not clobbered.

Wide-read could be quite useful for Debug and Trace.

Logic wise, I think there is no difference.

The niche Write-or-merge feature someone mentioned above, would be removed.

I think Chip said Read was slower than write ? - but I guess there is a write+read pair in most param passing.

Heater. · 2016-05-16 10:33

cgracey,

...we may even be able to fit another 128KB or 256KB of RAM.

Oh what. Javascript on the Prop here we come!

Or Lua. or .....

evanh · 2016-05-16 10:34

MJB wrote: »

Just waking up I have this thought ...
and before reading the all nights new messages ... :-)

wouldn't all the problems with overwriting and corrupting be gone if the whole new bus system was READ instead of WRITE?

The supposed problem was resolved long ago - http://forums.parallax.com/discussion/comment/1375972/#Comment_1375972 - Chip added a second instruction, WRLUTS, just for those silly enough to be trying write a single LUT from multiple Cogs.

evanh · 2016-05-16 10:37

A full two pages back in fact. Along with repeated reposts and people being corrected.

Rayman · 2016-05-16 10:40

Since LUT is executable, this would let one cog directly change the code of another...

Instead of just passing arguments, one could directly change the variables in a running code, right?

David Betz · 2016-05-16 10:56

I see mention of how many gates this feature will cost but what about routing? Seems like there are wide busses that have to be routed to every COG. Is that likely to be a problem?

Rayman · 2016-05-16 11:07

I think the only problem is that it won't all fit at once in A9 fpga...

JRetSapDoog · 2016-05-16 11:09

If I've got it correct, for the (1W, 9A, 32D = 42 lines) x 16 instances configuration, the 1W part really ties back to the SETLUTX mask(s). Also, for any particular cog, it obviously "sends" the same 9 address lines to all 16 LUT's and the same 32 data lines to all 16 LUT's, but it sends 1 distinct write line to each of the 16 LUT's. Then there are 16 instances of the whole thing, so each cog individually connects 9A and 32D (unique) lines to each LUT. So, yeah, each LUT connects with 16 sets of 9 address lines (one set from each of the 16 cogs) for a total of 144 address lines (per LUT), but they get bit-wise OR'ed together (if the addresss lines first make it through the AND gates "controlled by" the 16 write lines entering each LUT) to form the final 9 address lines. And the same is true of the data lines for each LUT: 512 incomming lines get bitwise OR'ed (after AND'ing) down to 32 data lines for any one cog's LUT.

Although I'm tempted to think of the write lines as kind of an address extention for the collective LUT space (all 16 of them thought of as being together), it doesn't work that way...because each cog can write any combination of the 16 LUT's at once. So, each cog has 16 distinct write lines (one to each LUT (including its own LUT)) configured by SETLUTX, not a 4-bit LUT-selection bus that would only allow writing one LUT at a time. Yep, there are 65,536 possible (2^16) combinations of LUT's any one cog could be writing at one time (but that number is kind of meaningless, as the coder knows what set combinations to focus on).

@Chip, I'm wondering if there could be a register for each cog's LUT that controls which cogs are allowed to write to it. These registers could be AND'ed with the write lines (the gatekeepers, so to speak). Some instruction could set this 16-bit register up in advance. That is, set it and forget it. For any one cog, the default value could be to allow all cogs to write to its LUT (i.e., an allow mask of $FFFF), that way one wouldn't even need to set it unless protection was desired. With such a control mechanism, for example, four cogs working together for video (or whatever) could configure themselves to only be writable by those four cogs, thus protecting (or sandboxing) themselves from inadvertant writes from other cogs. [If there's not room in the instruction space for an ALLOWLUT mask configuration instruction, perhaps the first call to SETLUTX could be used, though that would force one to set it up and it couldn't be changed without a cog reset.]

Changing gears, Electrodude mentioned that, "Someone might come up with some clever reason to have one cog set the low bits and the other set the high bits of the address going to another cog's LUT." Chip referred to it as "bit field writing." In a similar way, cooperating cogs might intentionally perform similar manipulation of the address bits. Whoops! I thought he said "data," not "address." Anyway, one cog could attempt to write one address in a LUT (or set of LUT's) while another cog could try to write a different address in the same LUT(s). This would either resolve to one of those two address or a higher address, depending on the specific addresses being OR'ed. Two (or more) cogs could "constructively interfere" with each other by design (in some admittedly difficult to comment code). For example, in pseudocode, write to address x only if cog n isn't attempting to write my LUT somewhere else, otherwise write at the OR'ed address. Don't ask me for a use, but I also wonder if some kind of filtering would be possible (without having any specific idea about what I mean by "filtering"). Anyway, something this radical (dangerous) must have the occasional use that saves the day timing-wise.

evanh · 2016-05-16 11:20

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

Yeah, Chip has been feeling a bit lavish since the increased die size. The bus length must take a good chuck though.

I was pondering the current kitchen sink arrangement of the Prop2 at work today. Mentally comparing it to the Prop1 with it's two (Hub and Pins) physical databus links per Cog. Noting the now five physical databuses per Prop2 Cog - Hub, Pins, SmartPins, DACs, and now LUTs.

There's a lot of idle time coming for all those buses.

David Betz · 2016-05-16 11:24

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

That doesn't seem like a big problem. As long as enough of the design will fit to allow testing then that shouldn't be an issue.

David Betz · 2016-05-16 11:25

evanh wrote: »

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

Yeah, Chip has been feeling a bit lavish since the increased die size. The bus length must take a good chuck though.

I was pondering the current kitchen sink arrangement of the Prop2 at work today. Mentally comparing it to the Prop1 with it's two (Hub and Pins) physical databus links per Cog. Noting the now five physical databuses per Prop2 Cog - Hub, Pins, SmartPins, DACs, and now LUTs.

There's a lot of idle time coming for all those buses.

Yeah, P2 has lost some of the elegance and simplicity of P1. I suppose that was inevitable.

evanh · 2016-05-16 11:25

As was noted in that article Chip found, though, dark silicon isn't necessarily a bad thing.

EDIT: Found that article - http://yosefk.com/blog/the-bright-side-of-dark-silicon.html

JRetSapDoog · 2016-05-16 11:38

David Betz wrote: »

I see mention of how many gates this feature will cost but what about routing? Seems like there are wide busses that have to be routed to every COG. Is that likely to be a problem?

Good point. However, it seems that Chip isn't worried. Perhaps some of that routing is filling in wasted space or is on a less-utilized chip layer. Anyway, if this were just allowing writing (and reading) a long for each cog as I once proposed, it would be a lot of cost for the gain (particularly due to the routing requirements of my idea), as folks mentioned. But with the current scheme, as we know, it allows writing any LUT address of another cog (not just a single long per cog outside the cog space as I asked about), and not just for one cog's LUT but any combination of LUT's. So there's a lot more "bang for the buck" than what I was trying to propose. Still, you may be right that this scheme with its higher logic and routing demands could somehow run us into (unexpected) problems, so it's a reasonable question to consider.

kwinn · 2016-05-16 12:04

cgracey wrote: »

The 42-bit mux is on the 2nd port of each LUT. Each mux sees all WRLUTX/S activity from all cogs. That's why there is no conflict when different cogs write different LUTs simultaneously. The only conflict is when multiple cogs attempt to write a particular LUT on the same clock cycle. The WRLUTS instruction gets around that by singulating each WRLUTS in time according to the cog number (cog id == cnt[3:0]).

So are the following guesses correct?

1 - Each cog has a 57 bit output bus (16 LUT selects, 9 LUT address bits, and 32 data bits)

2 - The LUT address and data bits along with one of the 16 LUT select bits from each cog connects to one of the 16 inputs on the LUT mux

Rayman · 2016-05-16 13:36

David Betz wrote: »

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

That doesn't seem like a big problem. As long as enough of the design will fit to allow testing then that shouldn't be an issue.

If it were me, I'd want to be able to test the full design. Otherwise, you won't know 100% that it is going to work. And, what if it doesn't work? Then, you are really in trouble...

David Betz · 2016-05-16 13:46

Rayman wrote: »

David Betz wrote: »

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

That doesn't seem like a big problem. As long as enough of the design will fit to allow testing then that shouldn't be an issue.

If it were me, I'd want to be able to test the full design. Otherwise, you won't know 100% that it is going to work. And, what if it doesn't work? Then, you are really in trouble...

Well, there are lots of replicated parts in the design. Is it really necessary to test every instance of each of them? On the other hand, there could be corner cases that only come up with the full design. It would be better to test everything but maybe not at the expense of having to leave out features. Is there an FPGA that is bigger than the A9 that Parallax could use to validate the complete design even if we forum members can not?

User Name · 2016-05-16 14:14

I am coming to this discussion late and I can't possibly read 600 rambling posts right now, but it sounds like LUT sharing is out. On the P1 I made extensive use of the i/o ports to pass data from one cog to another when speed and precise timing were critical. What is the equivalent to this on the P2? IIRC, smart pins on the P2 don't work the same. So what shared register is there among cogs, which won't require the associated delays and indeterminacy of hub?

Electrodude · 2016-05-16 14:29

User Name wrote: »

I am coming to this discussion late and I can't possibly read 600 rambling posts right now, but it sounds like LUT sharing is out. On the P1 I made extensive use of the i/o ports to pass data from one cog to another when speed and precise timing were critical. What is the equivalent to this on the P2? IIRC, smart pins on the P2 don't work the same. So what shared register is there among cogs, which won't require the associated delays and indeterminacy of hub?

LUT sharing was replaced by a mechanism by which any cog can write, but not read, any combination of other cogs' LUTs in a normal two clock instruction. It's writing and not reading, since writing doesn't need to wait to get the data back and so is faster. A SETLUTX instruction sets the mask of which other cogs get their LUTs written, and a WRLUTX instruction to write the same value to the same address of all selected LUTs. Multiple cogs can do this at once - cog 0 could write some location of the LUTs belonging to cogs 1, 3, and 9 at the same time that cog 3 writes some location of the LUTs belonging to cogs 2, 7, 12, and 14 without any conflicts. Cogs get an event whenever their LUT is written by another cog. This is better than the previous LUT sharing because the relative cog IDs don't matter.

If two or more cogs try writing the same cog's LUT at the same time, the 9 address lines and the 32 data lines get OR'd together. The resulting value gets written to the resulting address. If one cog writes %10 to address %10 of cog N's LUT, and another writes %01 to address %01 of cog N's LUT, the end result will be that %11 gets written to address %11 of cog N's LUT. To help prevent this from happenning, Chip added a separate WRLUTS instruction that waits for the bottom 4 bits of the system clock to match the writing cog's ID.

It will fit in silicon, but not in the FPGA, so Chip will probably have to take out some smartpins for the FPGA.

dMajo · 2016-05-16 14:40

Roy Eltham wrote: »

dMajo wrote: »

Roy Eltham wrote: »

You can have cog1 write to cog2's LUT, and cog2 write to cog1's LUT at the same time with no conflict, and any other combination of cog's writing to other cog's LUTs all at the same time as long as there isn't any one LUT being targeted on the same sysclk by more than one other cog.

So you are saying that two obex objects, using COGNEW for 2/3 cogs and then SETLUTX based upon the received CogID, can use WRLUTX safely without being aware of each other? Of course that there is no conflicts inside an object is programmer's responsibility.

No, I'm just saying that you have to take care in programming it because you can have a conflict, so it's not right to say it has no possibility of conflict. I didn't read the advantage/disadvantage thing is strictly just regarding obex stuff, or separate objects working at the same time, but just the overall advantage/disadvantage.

Roy,
Let say object A starts two helper cogs using LUT between them and object B do the same with its own set of cogs and LUT sharing.
Now each object's programmer is aware of his code but not how the object will be used.
A third entity, the user, can pick the two objects and use them together. May the two objects corrupt others cog LUT write operations (if the object manage LUT writes only within its own cogs)?

The question was made because of the many shared LUT detractors because of atomicity of COGNEW opcode, object isolation and bla bla bla. IMHO writing and corrupting unplanned addresses is much worse, specially because potentially it can mean changing other cog's code which can result also, as an extreme, in death (depends how and where the P2 will be employed.
But chip has clarified this once for ever:

cgracey wrote: »

The 42-bit mux is on the 2nd port of each LUT. Each mux sees all WRLUTX/S activity from all cogs. That's why there is no conflict when different cogs write different LUTs simultaneously. The only conflict is when multiple cogs attempt to write a particular LUT on the same clock cycle. The WRLUTS instruction gets around that by singulating each WRLUTS in time according to the cog number (cog id == cnt[3:0]).

@cgracey: what happens if WRLUTS is used? The cog remains holded for the necessary sysclock cycles or only the write is delayed but the cog continues with the next intruction in the meanwhile?

Rayman wrote: »

Since LUT is executable, this would let one cog directly change the code of another...

Instead of just passing arguments, one could directly change the variables in a running code, right?

As far as I remember you can't change variables/data values, you can only change opcodes. The LUT executes opcodes but all the data is taken from cog's ram/registers otherwise also the lut would need to be 4 port.

Is LUT sharing between adjacent cogs very important?

Comments