Is LUT sharing between adjacent cogs very important?

cgracey · 2016-05-15 03:34

Tubular wrote: »

sounds interesting, Chip. Being write-only simplifies things in a good way, I think

It'd be good for getting data split out to multiple cogs to act on 'in a jiffy'

It feels simpler to me, too. It simplifies related event(s), as well.

jmg · 2016-05-15 03:39

cgracey wrote: »

How expensive? It would be on the order of the hub egg-beater. That takes about 5,500 ALMs.

There is another option for half the price: Make LUT sharing write-only, but with 16-bit LUT mask selection. That way, you could write any or all of the other cogs' LUT's at once, from any cog. Each cog's LUT's 2nd port would use an AND-OR mux to bring in writes/address/data from other cogs. For you to receive feedback from any other cog, the other cog would write your LUT in a complimentary configuration. This would allow cog-number agnosticism AND writes to multiple cogs' LUT's simultaneously. This would take only 2 clocks, because there is no read feedback. Also, it would obviate the cog attention mechanism, so we could get rid of it. Actually, this would be the same as the attention mechanism, but with 9 bits of address and 32 bits of data added.

So, like my suggestion of adding bits to Signaling, taken to an extreme...

It could still be nice to still have simple to understand opcodes for inter-cog signaling flags ?

This sounds highly flexible, but also rather costly in Logic, and maybe speed.
Can this be done and not affect the ClkMAX ?
What is the RAM-Equivalent of the ALMs + routing this requires ?

cgracey wrote: »
You'd do something like this:
	SETLUTX	##$FFFF			'ready to write all LUTs
	WRLUTX	data,address		'write to all LUTs

How do (eg) 3 COGS write, in the same SysCLK time-slot ?
This needs a 16x32 BUS for data + 16 x 9 for addr ?

evanh · 2016-05-15 03:44

Oops, I just realised I assumed the wrong thing again.

Chip,
How much logic was it for the earlier Cog-to-Cog LUT sharing? You don't have anywhere close to the A9 resources left to do this do you?

evanh · 2016-05-15 03:48

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

jmg · 2016-05-15 03:55

evanh wrote: »

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

Sounds like a conflict to me...
I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

evanh · 2016-05-15 04:07

jmg wrote: »

I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Ah, yeah, that'll probably garbage up too. Not an issue at all for the expected one-on-one setup. It's only a problem when one wants to have multiple writers to the same LUT. Such software will have to be crafted to suit.

It fixes the Cog allocation issue while providing even more than anyone hoped for.

JRetSapDoog · 2016-05-15 04:08

cgracey wrote: »

There is another option for half the price: Make LUT sharing write-only, but with 16-bit LUT mask selection. That way, you could write any or all of the other cogs' LUT's at once, from any cog. Each cog's LUT's 2nd port would use an AND-OR mux to bring in writes/address/data from other cogs. For you to receive feedback from any other cog, the other cog would write your LUT in a complimentary configuration. This would allow cog-number agnosticism AND writes to multiple cogs' LUT's simultaneously.

I've got to say that I'm lovin' that option/configuration! It really integrates the cogs and does so in an easy-to-comprehend orthogonal way. I like sentences with "any," "any or all," "at once" and "multiple...simultaneously."

Sure, it forces other cogs to listen, but in the special LUT space, not regular cog code space. And I suppose some kind of arbitration mechanism could be coded to establish flags to a mask (or masks) telling other cogs whether you wanted to be forced to listen or not, and such a mechanism could be sent at slow speed. Guess that would be high-level coding, unless there's something that could be done in hardware to facilitate that. Is there?

For something as powerful as this, it seems worth it to give up being able to compile the full P2 into an A9.

cgracey · 2016-05-15 04:11

evanh wrote: »

Oops, I just realised I assumed the wrong thing again.

Chip,
How much logic was it for the earlier Cog-to-Cog LUT sharing? You don't have anywhere close to the A9 resources left to do this do you?

It was very little.

We can reduce the number of smart pins on the A9 to accommodate cog-number-agnostic LUT sharing, though. Each smart pin takes ~400 ALMs.

evanh · 2016-05-15 04:12

Okay, that's about eight of 'em. Smartpins do add up don't they.

cgracey · 2016-05-15 04:13

jmg wrote: »

evanh wrote: »

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

Sounds like a conflict to me...
I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Multiple cogs could only write a particular LUT per cycle.

cgracey · 2016-05-15 04:13

evanh wrote: »

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

That's right. Garbled data.

evanh · 2016-05-15 04:21

cgracey wrote: »

jmg wrote: »

I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Multiple cogs could only write a particular LUT per cycle.

So that wouldn't garble?

Electrodude · 2016-05-15 04:21

cgracey wrote: »

evanh wrote: »

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

That's right. Garbled data.

I'm guessing all the data and all the address lines get OR'd together when this happens, or what? Could situations like this somehow be detected and a flag be set?

Heater. · 2016-05-15 04:23

Chip,

Make LUT sharing write-only, but with 16-bit LUT mask selection. That way, you could write any or all of the other cogs' LUT's at once, from any cog.

Wow, on the face of it this sounds sweet. I'm not up to speed on LUT's, or any P2 feature details now a days, so can I summarise this for myself as:

1) Each LUT can be used as a unidirectional "channel" between a COG and any other COG.
2) A bidirectional channel between any two COGS can be created by using both their LUTS as above.
3) Broadcasting can be done from any COG to any other set of COGs in a cooperating group.
4) No restrictions on which COGs are interacting this way. COGNEW still works.

Is this really practical?

evanh · 2016-05-15 04:25

Electrodude wrote: »

Could situations like this somehow be detected and a flag be set?

Just don't do it. Same as how the pins and HubRAM are shared now.

cgracey · 2016-05-15 04:25

Electrodude wrote: »

cgracey wrote: »

evanh wrote: »

jmg wrote: »

How do (eg) 3 COGS write, in the same SysCLK time-slot ?

No conflict occurs but if the software does that then it'll be garbage data in the affected LUTs. Same as multiple Cogs twiddling the pins. That's one of the scary parts.

That's right. Garbled data.

I'm guessing all the data and all the address lines get OR'd together when this happens, or what? Could situations like this somehow be detected and a flag be set?

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

jmg · 2016-05-15 04:26

cgracey wrote: »

jmg wrote: »

Sounds like a conflict to me...
I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Multiple cogs could only write a particular LUT per cycle.

Per whose cycle ? Do you mean like HUB Slot ? I thought this was any COG/ any Time ?

3 COGs could all want to write to the SAME LUT, but at different addresses, what it the actual outcome of that ?
Do 2 wait, then one waits, then they are all done in 3 SysCLKs, or ??

cgracey · 2016-05-15 04:27

evanh wrote: »

cgracey wrote: »

jmg wrote: »

I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Multiple cogs could only write a particular LUT per cycle.

So that wouldn't garble?

That would not garble.

jmg · 2016-05-15 04:28

evanh wrote: »

jmg wrote: »

I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Ah, yeah, that'll probably garbage up too. Not an issue at all for the expected one-on-one setup. It's only a problem when one wants to have multiple writers to the same LUT. Such software will have to be crafted to suit.

It fixes the Cog allocation issue while providing even more than anyone hoped for.

I'm still trying to nail down the fish-hooks... and there seem to be plenty..

Comments like "Such software will have to be crafted to suit" and "probably garbage up too" need more specific detail.

JRetSapDoog · 2016-05-15 04:29

BTW, I know that Chip is anxious to be done with the design (and there's pressure to do so, a lot of it from himself), but I think it makes sense to go slow at the end here to consider this LUT-sharing/Cognew matter. And I think it would be a mistake to dismiss something simply because of some added complexity. That is, if such added complexity brings useful features, it could well be worth it. I'm glad to see that Chip is still mulling things over, whatever he decides. The main thing is to not rush the design at the end when important considerations are being made, considerations that we didn't know would be needed until we got to this point.

cgracey · 2016-05-15 04:29

jmg wrote: »

cgracey wrote: »

jmg wrote: »

Sounds like a conflict to me...
I was meaning 3 COGS writing to COG0.LUT, Adr 1,2,3 - Separate address values, but same time slot

Multiple cogs could only write a particular LUT per cycle.

Per whose cycle ? Do you mean like HUB Slot ? I thought this was any COG/ any Time ?

3 COGs could all want to write to the SAME LUT, but at different addresses, what it the actual outcome of that ?
Do 2 wait, then one waits, then they are all done in 3 SysCLKs, or ??

There is no hub timing involved. It happens right away. If a lot of cogs want to write one LUT, they need to do it on different SysClks.

jmg · 2016-05-15 04:33

cgracey wrote: »

There is no hub timing involved. It happens right away. If a lot of cogs want to write one LUT, they need to do it on different SysClks.

what is the outcome if they do not observe that rule ?
Do some wait, or does one succeed, with the other 2 having no idea they failed ?

evanh · 2016-05-15 04:34

cgracey wrote: »

There is no hub timing involved. It happens right away. If a lot of cogs want to write one LUT, they need to do it on different SysClks.

Ah, right, yep. The Cogs would have to be organised not to write on the same cycle.

jmg · 2016-05-15 04:36

evanh wrote: »

Just don't do it. Same as how the pins and HubRAM are shared now.

Err, not quite - these details matter.
3 adjacent HUB addresses are quite fine to be written by 3 COGS.
Nothing is lost, and all 3 know they achieved write.

Tubular · 2016-05-15 04:36

So is it accurate to say that everything (address,data) just gets OR'd? Ie the LUT address that gets written is the OR'd result of what's being commanded by multiple other cogs (in case of a clash)?

If so that makes it essentially the same as output pins on the P1. Simple enough to follow, and has established precedent

evanh · 2016-05-15 04:37

Heater. wrote: »

1) Each LUT can be used as a unidirectional "channel" between a COG and any other COG.
2) A bidirectional channel between any two COGS can be created by using both their LUTS as above.
3) Broadcasting can be done from any COG to any other set of COGs in a cooperating group.
4) No restrictions on which COGs are interacting this way. COGNEW still works.

That's certainly exactly what I'm seeing.

Is this really practical?

Well, it does cost quite a bit in logic, half a 16x16x(32+9+?) crosspoint switch it seems.

JRetSapDoog · 2016-05-15 04:40

evanh wrote: »

cgracey wrote: »

There is no hub timing involved. It happens right away. If a lot of cogs want to write one LUT, they need to do it on different SysClks.

Ah, right, yep. The Cogs would have to be organised not to write on the same cycle.

Yikes! I didn't realize that when I enthusiastically posted up above. Why can't they modify the LUT's on the same cycle in the same way that all cogs can assert the pins as outputs at the same time through a big OR mechanism?

evanh · 2016-05-15 04:40

evanh wrote: »

Well, it does cost quite a bit in logic, half a 16x16x(32+9+?) crosspoint switch it seems.

It's similar to adding another 16 I/O ports I guess. But you get block access on top of that.

jmg · 2016-05-15 04:42

cgracey wrote: »

That's the kind of thing that just needs to be designed around, I think. I don't see any point in putting hardware in that shouldn't ever be needed if programming was correct.

And, yes, everything gets OR'd together.

Wow, really ? That means 2 COGS writing can affect a THIRD wholly unexpected address ? Yikes.

How do you 'design around' such flukes in timing ?

( I expected a Priority Encoder, so at least then one (known) COG wins, and goes to a valid address.
Other COGS could simply stall until their priority is reached. No garbage or corruptions )

I'm quite amazed those who were worried about the quite simple management of COGIDs (or a smarter COGNEW), are now fine with this hand-grenade in the hands of novices ?

evanh · 2016-05-15 04:42

JRetSapDoog wrote: »

Yikes! I didn't realize that when I enthusiastically posted up above. Why can't they modify the LUT's on the same cycle in the same way that all cogs can assert the pins as outputs at the same time through a big OR mechanism?

This is only special case of multiple Cogs all writing to the same LUT.

And yes, it can be done just like the pins. You'll get the same sort of result as the pins - which is unlikely to be useful.