Is LUT sharing between adjacent cogs very important?

MJB · 2016-05-06 09:20

what about just having the good old P1 style (missing) internal port
that can be used in many different ways -
plus a pin-change event trigger mask on this 32 bit port.

Every (I hope) P1 user knows the concept already

cgracey · 2016-05-06 10:23

MJB wrote: »

what about just having the good old P1 style (missing) internal port
that can be used in many different ways -
plus a pin-change event trigger mask on this 32 bit port.

Every (I hope) P1 user knows the concept already

It takes a lot of resources to implement that, since it becomes part of the pin-handling infrastructure. What's cleanest is each cog outputting 16 possible strobes and cog N being sensitive to his own incoming strobe, which is the other cogs' bit Ns OR'd together.

Rayman · 2016-05-06 10:35

This cog-cog signaling (if it fits) might help minimize time to transfer data from one cog to another.

Assuming it's all deterministic, when you get this event you could wait some predetermined number of clocks and then read your message from the hub. Right?

Maybe the hub address used for the message could be optimized based on the cog #s involved?

cgracey · 2016-05-06 10:53

Rayman wrote: »

This cog-cog signaling (if it fits) might help minimize time to transfer data from one cog to another.

Assuming it's all deterministic, when you get this event you could wait some predetermined number of clocks and then read your message from the hub. Right?

Maybe the hub address used for the message could be optimized based on the cog #s involved?

If I get rid of the LUT sharing, it will fit, no problem, along with all 64 smart pins on the -A9.

I'm leaning towards getting rid of the LUT sharing. It feels like something that will encourage bad thinking. We've got this super fast egg-beater system that can move longs on every clock to/from the hub. I think having fast attention signalling will get us what we need.

Are there any of you that REALLY like LUT sharing? I'm kind of frustrated that I moved some instructions around to cleanly accommodate it, and I don't know what I'll do with those nice slots, if I pull it out. I've got the RDLUT and RDLUTN instructions right next to each other, same as the WRLUT and WRLUTN instructions.

ErNa · 2016-05-06 10:59

Leave the slots, as it shows: there is always a chance to find an argument that gives you reason to continue not to stop!

Seairth · 2016-05-06 11:11

So...

GETSTB [WZ]	' Gets own strobe state (WZ: Z = !strobe), sets to zero unless being strobed again at the same time
WAITSTB		' Waits for strobe signal, sets to zero unless being strobed again at the same time
STROBE D/#n	' D: lower 16 bits mask cogs to strobe
		' #n: index (0-15) of other single cog to strobe

* Assuming the strobe event can be handled by an interrupt
* What would it mean for D to be zero? Just a NOP?

Out of curiosity, how much circuitry would it be for each cog to get receive 16 strobe lines? In other words, the strobe output from each cog fans out to their respective cogs. So each cog would see which other cog strobed them. In which case, the instructions would look like:

GETSTB D [WZ]	' Gets incoming strobes (WZ: 1 if no active strobes), sets to zero unless being strobed again at the same time
WAITSTB		' Waits for strobe signal (does not reset strobes, must use GETSTB)
STROBE D/#n	' D: lower 16 bits mask cogs to strobe
		' #n: index (0-15) of other single cog to strobe

cgracey · 2016-05-06 11:19

Seairth wrote: »
So...
GETSTB [WZ]	' Gets own strobe state (WZ: Z = !strobe), sets to zero unless being strobed again at the same time
WAITSTB		' Waits for strobe signal, sets to zero unless being strobed again at the same time
STROBE D/#n	' D: lower 16 bits mask cogs to strobe
		' #n: index (0-15) of other single cog to strobe
* Assuming the strobe event can be handled by an interrupt
* What would it mean for D to be zero? Just a NOP?

Out of curiosity, how much circuitry would it be for each cog to get receive 16 strobe lines? In other words, the strobe output from each cog fans out to their respective cogs. So each cog would see which other cog strobed them. In which case, the instructions would look like:
GETSTB D [WZ]	' Gets incoming strobes (WZ: 1 if no active strobes), sets to zero unless being strobed again at the same time
WAITSTB		' Waits for strobe signal (does not reset strobes, must use GETSTB)
STROBE D/#n	' D: lower 16 bits mask cogs to strobe
		' #n: index (0-15) of other single cog to strobe

It would take a little more hardware, but then I could see it taking a lot more software to deal with WHO strobed you.

Seairth · 2016-05-06 11:19

cgracey wrote: »

Rayman wrote: »

This cog-cog signaling (if it fits) might help minimize time to transfer data from one cog to another.

Assuming it's all deterministic, when you get this event you could wait some predetermined number of clocks and then read your message from the hub. Right?

Maybe the hub address used for the message could be optimized based on the cog #s involved?

If I get rid of the LUT sharing, it will fit, no problem, along with all 64 smart pins on the -A9.

I'm leaning towards getting rid of the LUT sharing. It feels like something that will encourage bad thinking. We've got this super fast egg-beater system that can move longs on every clock to/from the hub. I think having fast attention signalling will get us what we need.

Are there any of you that REALLY like LUT sharing? I'm kind of frustrated that I moved some instructions around to cleanly accommodate it, and I don't know what I'll do with those nice slots, if I pull it out. I've got the RDLUT and RDLUTN instructions right next to each other, same as the WRLUT and WRLUTN instructions.

I'm fine with letting the shared LUT go. I thought it was overkill in the first place. I still think it's important to have an efficient cog-to-cog conduit. With the addition of strobes, I think this will work nicely in conjunction with the hub. The other advantage of the hub is that it's shared by all cogs, not just adjacent cogs. As Heater pointed out, this lets your code run on any cog, while the shared LUT required you to run on specific cogs.

cgracey · 2016-05-06 11:22

I think maybe nobody expressed in this thread that LUT sharing is very important to them. The idea seems great, at first, but then there's a sideways feeling that creeps up.

Seairth · 2016-05-06 11:26

cgracey wrote: »

It would take a little more hardware, but then I could see it taking a lot more software to deal with WHO strobed you.

Right. With this approach, if you don't care WHO strobed you, just look a Z. But if you do care, no need to look any further than the value in D.

Also, I suppose you could change WAITSTB, in the second case, to "WAITSTB D [WZ]". In other words a blocking GET. (My guess is that this would really just be the same instruction with the C flag set, or something.) Or, if we still use the WC thing to turn a blocking WAIT into a non-blocking WAIT, you could get rid of GETSTB altogether.

evanh · 2016-05-06 11:53

Doesn't the LOCK bits cover sync'ing?

Seairth · 2016-05-06 12:04

evanh wrote: »

Doesn't the LOCK bits cover sync'ing?

It can, but there are a few issues:

* It is a shared hub resource, and therefore subject to the round-robin access.
* You can set an edge interrupt or use WAITHLK, but neither of those will actually clear lock bit.

LOCKs are good for exactly what they were intended: mutexes for shared resources. In fact, I'd even argue that there should be 64 of them, but that's another discussion. But, even if there were more, they would not be the right thing to use for strobing. Putting it another way, when using locks as events, they are "manual reset events", while strobes are "auto-reset events".

Seairth · 2016-05-06 12:24

Following up on the 64 locks suggestion, my argument is simple: we now have 64 shared smartpins. An obvious use case is having multiple cogs wanting to send data on the same smartpin (e.g. two cogs sending their different data on the same ASYNC TX). In this case, using a lock will make a lot of sense. Alternatively, you would have to dedicate a single cog to transmit and have all other cogs pass data via the hub.

Could you still get away with only 16 locks? Possibly. But with so much more shared resources (smart pins, more powerful hub memory usage, etc.) than we had in the P1, I think locks will become more useful than they have in the past.

There should be no technical reason to limit them, as the instructions don't really change. The issue is one of resources, which I would think this would have little impact.

jmg · 2016-05-06 20:40

cgracey wrote: »

Are there any of you that REALLY like LUT sharing? I'm kind of frustrated that I moved some instructions around to cleanly accommodate it, and I don't know what I'll do with those nice slots, if I pull it out. I've got the RDLUT and RDLUTN instructions right next to each other, same as the WRLUT and WRLUTN instructions.

If it is done, don't remove it just yet, but tag it as the first removal candidate.

Just how many LUTs does this add, and where do they get used ?

cgracey wrote: »

I'm leaning towards getting rid of the LUT sharing. It feels like something that will encourage bad thinking. We've got this super fast egg-beater system that can move longs on every clock to/from the hub. I think having fast attention signalling will get us what we need.

I would agree that fast signaling matters more.
It is so fundamental, I was assuming that was already in there.

Many COG-COG co-operation cases need only booleans, and those that need data, should be ok with data blocks.

Someone may come up with a use case where Blocks are not enough ?

Does the LUT Mux also allow larger CLUT use ?

jmg · 2016-05-06 20:46

cgracey wrote: »

What's cleanest is each cog outputting 16 possible strobes and cog N being sensitive to his own incoming strobe, which is the other cogs' bit Ns OR'd together.

That sounds good, but maybe 32 ? (see below)

Seairth wrote: »

{re locks} ... In fact, I'd even argue that there should be 64 of them, but that's another discussion. But, even if there were more, they would not be the right thing to use for strobing. Putting it another way, when using locks as events, they are "manual reset events", while strobes are "auto-reset events".
... Could you still get away with only 16 locks? ...

Since these are best Atomic, isn't the natural count for both Locks and Strobes 32 ?
It is a 32b Core, so you can read.modify 32 at a time ?

dMajo · 2016-05-06 21:35

@cgracey
is the LUT sharing only an FPGA problem or will it possibly also be an issue for the final silicon area needed to accomplish that?
If it's only FPGA gates missing can't you drop 32 smart pins? Aren't 32 smart pins enough for test/debug the design? Even if paired/teamed you have 16 couples. Do you really need all 64 pins?
Will this give you the room for the shared LUT implementation?

jmg · 2016-05-06 21:46

dMajo wrote: »

@cgracey
is the LUT sharing only an FPGA problem or will it possibly also be an issue for the final silicon area needed to accomplish that?
If it's only FPGA gates missing can't you drop 32 smart pins? Aren't 32 smart pins enough for test/debug the design? Even if paired/teamed you have 16 couples. Do you really need all 64 pins?
Will this give you the room for the shared LUT implementation?

Yes, Chip already has a build setting with reduced Smart pins.

I think he is looking for compelling use cases, where the LUT-share opens up some application ?

Maybe a list of what opcodes can make use of the new LUT area would help ?

Seairth · 2016-05-06 22:07

jmg wrote: »

Seairth wrote: »

{re locks} ... In fact, I'd even argue that there should be 64 of them, but that's another discussion. But, even if there were more, they would not be the right thing to use for strobing. Putting it another way, when using locks as events, they are "manual reset events", while strobes are "auto-reset events".
... Could you still get away with only 16 locks? ...

Since these are best Atomic, isn't the natural count for both Locks and Strobes 32 ?
It is a 32b Core, so you can read.modify 32 at a time ?

By keeping the strobes at 16, the index of the bit matches the cog number. You could go to 32 bits, but it would also complicate the STROBE #n format that I suggested earlier. Further, as soon as you need to convey any more than three values (%01, %10, %11), you would have to use out-of-band data-passing (e.g. hub) anyhow. So why not just keep the strobes as simple as can be, and use other means to add complex messaging.

And the fact that this is a 32b core doesn't really matter for locks, since you operate on them by index, not bit mask.

evanh · 2016-05-07 00:50

cgracey wrote: »

I think maybe nobody expressed in this thread that LUT sharing is very important to them. The idea seems great, at first, but then there's a sideways feeling that creeps up.

Chip,
The earlier raised HubRAM write buffering idea would be more universally useful. And would fit nicely with Seairth's strobing idea.

EDIT: Hmm, I know this is a good idea but have not been able to see a clear description of how to merge it with the myriad of write channels already existing. From ordinary WRLONG to SETQ'd to WRFAST to Streamer ...

Okay, WRFAST and Streamer are autonomous and are covered by the FIFO. So they stay independent. Although, WRFAST becomes 100% redundant if the buffering exists.

I'm guessing a SETQ'd write has the same timing variability of a plain WRLONG, so, ideally, we'd want to cover both WRLONG and SETQ'd WRLONG.

Cluso99 · 2016-05-07 21:46

Sorry, I have been MIA in the UK without Internet access

I brought up the idea of a shared register between adjacent cogs. The only alternate way is to sacrifice pins.

The cases are real. If you need to perform fast I/o processing the only way is to throw more cogs at the task. This is what scan lime did in USB FS implementation. The basic problem is that you just cannot pass data between cogs via hub fast enough. Typically, what is happening here is that you have a fast data stream coming in. The first cog reads the stream converting it into something like bytes. Each byte is passed quickly to the next cog for processing the next higher level data. Ther could even 2 cogs do this, one to calculate a CRC value, while the other presumes a correct CRC and prepares the reply, possibly eve using another cog for transmitting.

USB FS is a typical example that without some form of hardware assist, would be an extremely difficult task.

We have 16 cogs but how are we going to use them? I can think of many ways that shared LUT between adjacent cogs would be extremely beneficial.

The argument that all cogs must be equal IMHO severly restricted the capability of the P2. Most would agree that in nearly all designs, there is one main cog performing the main tasks, with helper cogs performing specialised sub tasks. Some have even asked for a combo P2+ARM. Remember, you are not being forced to use thes features. But please don't deprive some of us to do great things by using specialised features.

Roy Eltham · 2016-05-08 01:10

Cluso99,
With the egg-beater setup to hub, reading and writing to/from hub would have the same maximum throughput as directly accessing neighbor LUTs (one clock per long), it would just have a longer latency for startup. So as long as you can live with a small latency at startup of the data streaming between cogs, then through HUB should be fine.

Actually, I'm not even sure accessing a neighbor cogs LUT would have the same maximum throughput, depends on if it could have the same streaming setup as there is with the PINS<->HUB and LUT<->HUB. It would have better latency though overall.

Also, if you are doing any form of processing on the data at all, then you are not going to get anywhere near maximum throughput. What would be more interesting to me would be a DMA like setup where the cog can setup a stream of data in/out to HUB, and then at the same time continue to execute instructions so you could overlap the streaming with the processing, instead of taking turns. However, I'd rather Chip just finish up this thing so we can all have it.

Seairth · 2016-05-08 02:17

I think the difference between fast LUT sharing and fast HUB sharing is akin to the difference between fast random access and fast sequential access.

However, I'm fine with dropping the shared LUT. There are just too many concerns about getting it right, working correctly with streaming and LUT-exec, etc. If there were a clear/obvious approach to implementing it, that would be different. But that's not the case, so add it to the P3 wish list and let's move on.

Heater. · 2016-05-08 03:14

I have not followed the details of this shared LUT thing. Sounds like a complication I would never want to have to get my head around.

The gist of it seems to be having two COGs that share a common resource. So COG 0 shares with GOG 1, COG 1 shares with COG 2...COG N shares with COG N+1. Presumably COG 15 shares with COG 0.

My question is, how do I get those two consecutive COGs started?

I can't just use COGNEW to grab two COGS. I cannot be sure that a second call to GOGNEW will get me a COG adjacent to the first, something else may have started that COG in between the two calls.

I can't use COGINIT because, well, we don't use COGINIT. COGINT assumes I am managing all the allocation of COGs in my program which is not generally the case when mixing and matching objects from here and there.

Sounds like we would need a COGNEW2 that would acquire and start two consecutive COGS. If it it can find them. It would have to be an atomic operation to ensure that it gets two adjacent COGS. It can fail even if there are many free COGS, in the event that every other COG is already in use (Yeah, I know an unlikely edge case but still).

Clusso is hinting that perhaps I want/need a system of 3 or more COGs to get the job done? Now I need to atomically claim M consecutive COGs!

All in all this sounds very messy.

Or did I miss a point about this somewhere?

evanh · 2016-05-08 03:53

Heater,
Yep, that sums it up I'd say. I might claim some credit even but it was probably just a coincidence.

Cluso99 · 2016-05-08 05:07

Roy,
It is precisely the latency that is the problem I am trying to resolve. USB FS gives us an exact case in question. The responses must take place extremely fast, in a matter of a small number of clocks. Hub can never deliver this minimum latency. The egg beater does nothing to help in this instance. In fact it makes calculating the latency even more difficult because you need to know your cog and the other cog, plus the hub address of the byte/word/long.

Heater,
For doing any high speed protocols, you will need to know cog numbers for calculating latency. In fact almost certainly you will still need to start specific cogs to do the job. We are not even talking very fast I/o in today's terms... 12MHz is not fast. There are many high speed protocols that should be doable with P2. That's what most cogs are for.
We are missing a great opportunity by not using the dual port ability of the LUT. There is silicon available, and Chip has done it. It should be a simple matter of addressing to permit extended LUT exec but I can live without it.
I gave an example where more cogs can be used to get better results.

We will push the P2 to whatever limits we can. We also need to manage not only today's interfaces, but those for the next 10 years.

In fact, I would rather shared LUT between adjacent cogs than the egg beater. I still have reservations about the egg beater.

evanh · 2016-05-08 05:45

It's going a bit overkill on the Cogs if a second one is needed just for ACK/NAK response times. Can't the single Cog deal with that on it's own if the Hub timing became both deterministic and fast? The second Cog should be dealing with a higher protocol level.

Phil Pilgrim (PhiPi) · 2016-05-08 06:35

My vote is with anything that preserves cog-number-agnosticism. "Adjacent cogs" sounds like a recipe for disaster.

-Phil

cgracey · 2016-05-08 07:29

After a lot of think,

Phil Pilgrim (PhiPi) wrote: »

My vote is with anything that preserves cog-number-agnosticism. "Adjacent cogs" sounds like a recipe for disaster.

-Phil

Providentially, the ascetic returns at just the right moment!

I was thinking about this cog-number-agnosticism matter, myself, this evening, lamenting that the cogs' four fast DAC channels are hard-tied to pins %CCCCxx, where %CCCC is the cog number. This killed cog-number-agnosticism a while back and the LUT sharing further muddied the water.

Here's what I think we need to do:

Use a variation of the PINSETM instruction to enable which cogs' 4-channel DAC outputs are OR'd together to feed each four-pin-group's DAC channel inputs. And forget about the LUT sharing. This way, more than one cog could drive a set-of-four-pins' DAC channel inputs, enabling cogs to tag-team for continuous high-bandwidth analog output. On the same clock that one cog outputs four $00's, another steps in with data to be OR'd with the $00's. This is not as mux-hungry as the original plan was, way back, because we are dealing with groups of 4 pins, not individual pins. This amounts to 16 sets (one for each 4-pin block) of 16x32 AND-OR circuits (16 cogs, each outputting 4*8 bits). Now, we have full cog-number-agnosticism back and we won't have endless cog allocation headaches like we were surely going to encounter when it comes time to make complete tools.

Heater. · 2016-05-08 07:45

cgracey,

This killed cog-number-agnosticism a while back and the LUT sharing further muddied the water.

Epic disaster.

Use a variation of the PINSETM instruction to enable which cogs' 4-channel DAC outputs are OR'd together to feed each four-pin-group's DAC channel inputs. And forget about the LUT sharing

Phew! Disaster averted.

jmg · 2016-05-08 08:24

cgracey wrote: »

Use a variation of the PINSETM instruction to enable which cogs' 4-channel DAC outputs are OR'd together to feed each four-pin-group's DAC channel inputs. And forget about the LUT sharing. ..

but there is still fast-signaling being done, between any/all COGs - ie so one COG can start/resync the other up to 15 ? (maybe to 32 bits?)

cgracey wrote: »

Now, we have full cog-number-agnosticism back and we won't have endless cog allocation headaches like we were surely going to encounter when it comes time to make complete tools.

I don't see 'cog-number-agnosticism' as a holy grail, plenty of other systems manage 'largely equal' fabric, but have special use cases where higher performance comes from simple floor planing.
Higher performance usually wins, hands down.

A good example is CPLD's :
Here Macrocells are all of equal resource, but details like fast carry, require adjacent macrocells.
This is managed with a combination of Tool and user placement.
Macrocells also have clusters, or Blocks. Within those blocks, routing is more flexible than across blocks.

This is not the software headache imagined by some.
This sort of allocate and check, compilers and linkers and fitters do every day.

There is more merit in Pin-agnostic design, as breaking that can cause a PCB respin on a code change.
A common lament in MCU-Land, is designers cannot actually use ALL peripheral choices at the same time.

Is LUT sharing between adjacent cogs very important?

Comments