Is LUT sharing between adjacent cogs very important?

Rayman · 2016-05-18 19:58

Wonder if easy to do 2-bit control where 1 bit allows writes to your LUT and other bit allows writes to other LUT? Then, you could send a message while keeping your LUT intact?

Not important, just a thought...

Heater. · 2016-05-18 20:05

I wish this whole LUT thing would go away. It's starting to sound like a twist in complexity too far.

Get me the chip already

jmg · 2016-05-18 20:09

Rayman wrote: »

Wonder if easy to do 2-bit control where 1 bit allows writes to your LUT and other bit allows writes to other LUT? Then, you could send a message while keeping your LUT intact?

Not important, just a thought...

If there is a boolean to enable/disable this, is the most natural thing not to use it as an address bit ?

ie one LUT appears above the other, in each COG, and a write to the high address, goes to other LUT ?

I can see that the most compact routing comes from a placement of two LUTs in the same place, with no across-cog links.

Cluso99 · 2016-05-18 20:11

cgracey wrote: »

By making writes common between two LUTs, there is no need for a special result mux, or a SETQ3 to facilitate block r/w, or another mux for LUT exec. This means, though, that what was two separate 512-long LUTs become, effectively, one LUT. It's a LUT mind meld. This can only work between a pair of cogs, though, not a rolling overlap, as proposed before. Therefore, an even-odd pair is probably ideal, given the constraints. COGNEW can allocate double cogs from the top down, while single cogs are allocated from the bottom up. Someone proposed something like this a few hours ago, in order to minimize cog fragmentation. I feel okay about all this. You've got to lump the cog-pair issue, though.

Chip,
Now I see where you are headed with this.

I am not advocating any block transfers although I am unsure of anyone else. Hence a SETQ3 isn't required. To transfer blocks of LUT from one cog to the next could be done with a normal programmed loop. It would still be fast.

What I don't particularly like is the melding of the 2 adjacent cog LUTs as we effectively lose 512 longs between the pair. Is there any other way for this?
If not, then personally I'd rather only a smaller block of LUT be combined, say 1/4 (64 longs).

Also I had hoped that there would be sharing both ways. This would mean that we didn't have to have even + next pairs, so it would be easier to find adjacent pairs. Also, more than two cogs could then work together too. As I have said, I am happy to allocate them with the standard COGNEW early in the initialisation.
However, I will be happy with any direct LUT sharing between adjacent cogs.

Is it too difficult to put back in what you did originally a couple of weeks ago???

Rayman · 2016-05-18 20:11

That sounds good... Guess it's question of which way minimizes logic required...

jmg · 2016-05-18 20:13

cgracey wrote: »

Seeing that no other low-latency path is going to work well between cogs besides adjacent LUT sharing, I will look at making COGINIT allocate 1..2 adjacent cogs. Bit 6 of D can be:

0 = 1 cog
1 = 2 cogs

Then, each cog can have instructions:

RDLUTN - read next cog's LUT
WRLUTN - write next cog's LUT

Are more than two cogs necessary?

That depends on the LUT overlay design.
If it is strictly pairs, then clearly 2 is all you need

If it is half to COG-1 and Half to COG+1, then 3 and 4 are quite valid user requests.
Three gives one master and to minion COGS, either side and four gives two linked masters, each of which has a minion.

The COGNEW N, that I suggested, would allow any number from 1 to 16

jmg · 2016-05-18 20:15

Phil Pilgrim (PhiPi) wrote: »

The bottom-up/top-down schema will work great from a reboot. But how hard is it to implement once available cogs become fragmented?

Surely someone juggling COGs to that extent, has a COG manager.
How else do they know they always avoid asking for a COG 17 ?

It takes a quite contrived case to create fragmentation, but even the worst scenario can be recovered with a COGs reload.

potatohead · 2016-05-18 20:15

@Phil: some people may need a deCOGulator.

I'm fine with it either way Chip.

Cluso99 · 2016-05-18 20:18

Chip,
Another thought. Is there a spare bit in the WRLUT (perhaps WC) such that we could choose whether to write to our own LUT or both LUTs? Or else just one extra WRLUTX instruction to write to both?

jmg · 2016-05-18 20:28

Cluso99 wrote: »

Chip,
Another thought. Is there a spare bit in the WRLUT (perhaps WC) such that we could choose whether to write to our own LUT or both LUTs? Or else just one extra WRLUTX instruction to write to both?

If an address-bit cannot fit, then yes, any other 'live & atomic' means to select own/both would make sense.

jmg · 2016-05-18 20:32

Cluso99 wrote: »

...
However, I will be happy with any direct LUT sharing between adjacent cogs.

Is it too difficult to put back in what you did originally a couple of weeks ago???

The earlier design allowed 3 or 4 or more COGs to closely co-operate, which had real appeal.

However, it does add routing to run the LUT Adr.Data to both sides of a COG, so I can see this is smallest die impact, if also less flexible.

Roy Eltham · 2016-05-18 20:34

Ugh.

evanh · 2016-05-18 20:37

Phil Pilgrim (PhiPi) wrote: »

That would be okay only as long as there is a special cognew2 that allocates adjacent cogs simultaneously. It could just return the number of the even-numbered cog in that case. There would have to be well-defined procedure for cogstopping them, so that cogs that remain "entangled" don't subsequently get allocated individually. (This is starting to sound like quantum mechanics.)

Sounds like Chip has figured COGINIT will have to do for dual allocation - he's mentioned adding a special dual allocation that ensures two consecutive Cogs are used.

The good news is COGNEW can now coexist this way. Albeit in a semi-static mapping arrangement so probably just achieves the equivalent of what Cluso was planning on doing by hand.

jmg · 2016-05-18 20:39

evanh wrote: »

The good news is, when COGINIT static allocates, this protects COGNEW for subsequent ordinary use quite nicely. No need to throw it away now.

? - Who was throwing away COGNEW ?

kwinn · 2016-05-18 20:43

Cluso99 posted “If so, would it be difficult to make the LUT addresses 10 bits, where immediate addresses are 9 bits (with 10th bit assumed 0), but register addresses use the full 10 bits where 10th bit is 1 for adjacent LUT. ” in an earlier post, and I think it is a good idea if the LUT address could be extended to 10 bits. Using address bits for this seems like a natural choice.
Even better with 11 address bits so any combination of n, n(-1), and n(+1) could be written to. That would also preserve LUT space.

dMajo · 2016-05-18 20:48

cgracey wrote: »

Thinking about this more...

How about rather than having special RDLUTN/WRLUTN instructions, we arrange the LUT sharing so that even/odd cog pairs have all LUT writes in common, meaning that the 2nd port is used to replicate all write activity from the partner cog? This way, a write by either cog to its own LUT effects both LUTs. This would take only a write mux on the second port of each LUT. All the SETQ2+RDLONG/WRLONG block operations would work in this mode, as well.

Is there any practical use for 16 LUTs (except for the equity of the cogs)?

What if instead of considering a cog as unity you consider a cog pair instead?
If you make 1 LUT for a cog pair and this LUT is 4 port it uses the same ram size, you have one port for each cog and streamer, no muxes, no double instructions.
And such cog pair probably will be physically placed adjacent with the LUT in the middle, but from hub window perspective it will be nice that the cog pair will be 8 slots apart. I mean one pair 1-9 second pair 2-10 and so on.
And in this configuration perhaps the cogs, member of the pait, can borrow its hub slot to each other. And in this way halving the latency through the hub between cog pairs.

Again equity, but between cog pairs instead of single cogs. A sort of P2 with 8 twins.

potatohead · 2016-05-18 20:49

No. We need individual COGS.

jmg · 2016-05-18 20:53

dMajo wrote: »

Is there any practical use for 16 LUTs (except for the equity of the cogs)?

What if instead of considering a cog as unity you consider a cog pair instead?
If you make 1 LUT for a cog pair and this LUT is 4 port it uses the same ram size, you have one port for each cog and streamer, no muxes, no double instructions.
And such cog pair probably will be physically placed adjacent with the LUT in the middle, but from hub window perspective it will be nice that the cog pair will be 8 slots apart. I mean one pair 1-9 second pair 2-10 and so on.

There is also a close variant of this suggested, where the HUB drops to 8 - ie only one of the paired COGS has HUB access, now with half the latency.

The other COGS can still use the (quite new) Pin-Cell pathways, for any-any links
That's probably too large a change, at this stage.

Phil Pilgrim (PhiPi) · 2016-05-18 20:54

evanh wrote:

Sounds like Chip has figured COGINIT will have to do for dual allocation ...

I didn't get that impression when he mentioned the bottom-up/top-down allocation scheme.

As it turns out, that scheme offers no advantages over allocating both single and double cogs from the bottom. I suspected that to be the case and wrote a Spin program to simulate 1000 allocations, randomly mixing single and double cognew requests until a request can't be filled. A tally of the average number of succesful requests was kept and displayed at the end of the simulation. It was the same for both allocation schemes.

Here's the program:

CON

  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

  N_TRIALS      = 1000
  UP_DOWN       = 0
  BIAS          = 8

OBJ

  sio   :       "Parallax Serial Terminal"
  rnd   :       "RealRandom"

VAR

  long sum, seed
  word occupied
  
PUB  start | cog, i, n

  sio.start(9600)
  cog := rnd.start - 1
  seed := rnd.random
  cogstop(cog)
  repeat i from 1 to N_TRIALS
    n~
    occupied~
    repeat
      if (?seed & 15 => BIAS)
        if (get_double_up)
          n++
        else
          quit
      else
        if (get_single)
          n++
        else
          quit
    sum += n
  sio.str(string("Average sucesses * 100: "))
  sio.dec(sum * 100 / N_TRIALS)
      
PUB get_double_up | mask

  mask := 3
  repeat 8
    ifnot (occupied & mask)
      occupied |= mask
      return true
    mask <<= 2
  return false
  
PUB get_double_down | mask

  mask := $c000
  repeat 8
    ifnot (occupied & mask)
      occupied |= mask
      return true
    mask >>= 2
  return false
  
PUB get_single | mask

  mask := 1
  repeat 16
    ifnot (occupied & mask)
      occupied |= mask
      return true
    mask <<= 1
  return false

Anyway, requiring coginit simply will not do. There has to be a way to do it with some sort of hybrid cognew.

-Phil

kwinn · 2016-05-18 20:57

jmg wrote: »

Cluso99 wrote: »

...
However, I will be happy with any direct LUT sharing between adjacent cogs.

Is it too difficult to put back in what you did originally a couple of weeks ago???

The earlier design allowed 3 or 4 or more COGs to closely co-operate, which had real appeal.

However, it does add routing to run the LUT Adr.Data to both sides of a COG, so I can see this is smallest die impact, if also less flexible.

No reason more than 2 cogs could not be made to communicate "upstream" as long as every cog can write to the next higher cog's LUT. Cog(n) can process some and send some data to cog(n+1), which can process some and pass some on to cog(n+2), etc. etc.. Daisy chain as many cogs as needed.

evanh · 2016-05-18 20:58

jmg wrote: »

? - Who was throwing away COGNEW ?

That was the required outcome. Just adding the atomic allocation helps bring COGNEW back again. Ie: They're now at least compatible again.

Dual allocations is still effectively static mapping without a robust API but that's not a major for the expected usage.

cgracey · 2016-05-18 21:05

Rayman wrote: »

This is a bit complex at first read...
So, if cog A wants to send message to cog B and get reply:

Cog A writes to both LUTs at same time
Cog B is notified somehow and reads his LUT to get message
Then
Cog B writes response to both LUTs at same time
Cog A is notified and reads response in his LUT.

Ok, I think I have it...

It's very simple. The two cogs' LUTs become more the same, with each new write from either cog.

evanh · 2016-05-18 21:06

Phil Pilgrim (PhiPi) wrote: »

Anyway, requiring coginit simply will not do.

I'd say Chip has resigned to allowing simplistic static dual allocations while still providing the "any order" feature so at to prevent potential init breakages of typical OBEX code. True dynamic allocations is not the objective.

EDIT: Yes, it is getting messy requiring COGINITs and top down static mapping just to make some form of compatibility.

Phil Pilgrim (PhiPi) · 2016-05-18 21:10

On second thought, I may have missed the point of the bottom-up/top-down scheme. My impression was that cognew was used for both single and dual allocation. But now it appears that coginit might have been implied for the top-down part, just so it didn't collide with the single allocations coming up from the bottom. I don't like that. For one, if the bottom up allocations become full, coginit could easily clobber a running cog without knowing it.

Better would be an atomic cognew2 that allocates even/odd pairs simultaneously and loads the same code into both. The odd-numbered cog would then, via cogid, identify and coginit itself with the correct code. The even-numbered cog, after verifying such with cogid, would just continue running the code that was loaded.

-Phil

Phil Pilgrim (PhiPi) · 2016-05-18 21:17

evanh wrote:

... that's not a major for the expected usage.

"Expected usage" is hardly ever matches ultimate usage. Consider the P1's counters. Who could have anticipated ahead of time all the different things they've been used for?

Of course, this whole thread is just a theoretical distraction and is getting in the way of actually getting the P2 finished.

-Phil

evanh · 2016-05-18 21:26

True, expected usage then becomes limited usage.

evanh · 2016-05-18 21:29

Phil Pilgrim (PhiPi) wrote: »

Of course, this whole thread is just a theoretical distraction and is getting in the way of actually getting the P2 finished.

The unexpected fallout of increased die space.

User Name · 2016-05-18 21:42

I'm okay with almost any LUT sharing scheme, but I still prefer the first and the simplest approach. There is no more bang for the buck in the whole P2.

It's a specific tool for specific situations. It's not the defacto means of inter-cog communications. But its existence adds a layer of possibilities that makes the chip much more desirable to a subset of potential users that have certain odd requirements and are happy to work within the constraints that LUT-sharing imposes.

Meanwhile, go ahead and forbid the posting of LUT-sharing code to the OBEX. How is the OBEX any worse off if we forbid such code vs not have the feature in the P2 at all?

Phil Pilgrim (PhiPi) · 2016-05-18 21:51

User Name wrote:

But its existence adds a layer of possibilities that makes the chip much more desirable to a subset of potential users that have certain odd requirements and are happy to work within the constraints that LUT-sharing imposes.

Parallax might have to host exams and issue licenses for LUT-sharing privileges, just to make sure the feature is not abused.

-Phil

evanh · 2016-05-18 21:51

User Name wrote: »

But its existence adds a layer of possibilities that makes the chip much more desirable to a subset of potential users that have certain odd requirements and are happy to work within the constraints that LUT-sharing imposes.

Perceived requirements I suspect. Cluso has proven this. He keeps saying the direct link is important when really all he wanted was to avoid instruction stalls. Instruction counting is often important.

Is LUT sharing between adjacent cogs very important?

Comments