Is LUT sharing between adjacent cogs very important?

jmg · 2016-05-17 20:14

Rayman wrote: »

Still thinking of other things to do, in case can't compile...
Maybe can tie second LUT port to the fast smartpin transfer?

I would like to see the DAC_Data path tuned for speed - that seemed to be low hanging fruit, that was ignored in the rush to LUT-Any.

Seairth also makes a good case for a better COG-COG ATN signaling here
http://forums.parallax.com/discussion/164286/making-the-case-for-expanded-atn

Rayman · 2016-05-17 20:15

I think we can live without this new LUT interface...

If the broadcast cog-cog event signaling is added and the smartpin cog-cog transfer is there, I think that makes up for any latency in small data packet transmissions using HUB.

For large data packets, the egg-beater is perfect.

cgracey · 2016-05-17 20:20

jmg wrote: »

cgracey wrote: »

The problem is that we can't route the inter-cog LUT sharing in a way that allows cogs to be independent of one another.

I'm not seeing that as a problem that cannot be managed ?

Seems one thing this Any-LUT exercise has shown, is everyone is actually fine with some form of management by users, no one expects using the P2 to be thinking-free.

Yes, but there is a problem when cogs must be allocated in some order. It breaks a simple paradigm and forces application writers to become mindful of which cogs are being used. It adds a layer of complexity that we will never get away from. I don't want to do that.

jmg · 2016-05-17 20:22

cgracey wrote: »

There's no cheap way to do LUT sharing outside of the adjacent-cog topology, which is too problematic on the software side, with acknowledgment to all those who say it isn't.

Compared to the software-side caveats that LUT_Any bought along, I'd say simple LUT sharing is a doddle!!

Maybe DAC_Data can be tuned to improve the speed ?
It was already faster in moving Bytes than LUT_Any was, in safe mode.

Also, a simply smarter COGNEW N has to be far smaller than LUT_Any
- COGNEW is something that hands out COGID tokens, and is used by all COGS, so this logic occurs once only & is local in nature. ( ie similar to the shared Cordic)
Size of this is thus far less critical.

MJB · 2016-05-17 20:26

Heater. wrote: »

MJB,

As it stands a WRITE can write to many other COGs LUTs at the same time. Specified by the LUT mask.

Doing a READ would end up with all those LUTs being returned in some mangled fashion.

sure - usually I would only read from one single other LUT.
but there might be a use for 'multiple read' ;-)
like there might be a use for the multiple overlapping write.

Cluso99 · 2016-05-17 20:30

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

The problem is that we can't route the inter-cog LUT sharing in a way that allows cogs to be independent of one another.

I'm not seeing that as a problem that cannot be managed ?

Seems one thing this Any-LUT exercise has shown, is everyone is actually fine with some form of management by users, no one expects using the P2 to be thinking-free.

Yes, but there is a problem when cogs must be allocated in some order. It breaks a simple paradigm and forces application writers to become mindful of which cogs are being used. It adds a layer of complexity that we will never get away from. I don't want to do that.

It only forces those cooperating cogs to know which cogs are which. I disagree with you that there is any problem with allocating them, either at startup, or later. And if the user cannot live with that, he certainly doesn't have to use it. It is not like we are forcing anyone to use it!

What about leaving the default disabled, and permit those who require the performance boost to enable the option? Without the boost there will be some things we just cannot do with the P2. For simple silicon, seems a shame to leave it out just because some fear cog allocation.

Please let me do a demo utilising it ???
(Leave it out for this release so at least we can start testing)

jmg · 2016-05-17 20:47

Cluso99 wrote: »

It only forces those cooperating cogs to know which cogs are which. I disagree with you that there is any problem with allocating them, either at startup, or later. And if the user cannot live with that, he certainly doesn't have to use it. It is not like we are forcing anyone to use it!

The only real, technical issue around this I recall, was around simple atomic assign, and that can be solved with a smarter COGNEW N.
Anything else, is easily managed.

jmg · 2016-05-17 20:51

cgracey wrote: »

Yes, but there is a problem when cogs must be allocated in some order. It breaks a simple paradigm and forces application writers to become mindful of which cogs are being used. It adds a layer of complexity that we will never get away from. I don't want to do that.

I'm not quite seeing this ?
COG allocate currently is single units, you simply allow user choice of how many COGS that COGNEW allocates.
I'm not seeing a lot of Logic there, in a single block.
That is then no more complex to manage than COGNEW was.
Users still do have to be able to count, to make sure they do not COGNEW 17 COGS

cgracey · 2016-05-17 21:03

Cluso99 wrote: »

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

The problem is that we can't route the inter-cog LUT sharing in a way that allows cogs to be independent of one another.

I'm not seeing that as a problem that cannot be managed ?

Seems one thing this Any-LUT exercise has shown, is everyone is actually fine with some form of management by users, no one expects using the P2 to be thinking-free.

Yes, but there is a problem when cogs must be allocated in some order. It breaks a simple paradigm and forces application writers to become mindful of which cogs are being used. It adds a layer of complexity that we will never get away from. I don't want to do that.

It only forces those cooperating cogs to know which cogs are which. I disagree with you that there is any problem with allocating them, either at startup, or later. And if the user cannot live with that, he certainly doesn't have to use it. It is not like we are forcing anyone to use it!

What about leaving the default disabled, and permit those who require the performance boost to enable the option? Without the boost there will be some things we just cannot do with the P2. For simple silicon, seems a shame to leave it out just because some fear cog allocation.

Please let me do a demo utilising it ???
(Leave it out for this release so at least we can start testing)

I understand what you are saying, but the problem is that everyone is going to be incorporating objects written by others and some of those objects WILL use this feature, and that's what will break things. The only sure way around this is application-level cog planning, which is a horrible thing to force on a newbie. That throws fun and spontaneity right out the window.

jmg · 2016-05-17 21:05

Rayman wrote: »

If the broadcast cog-cog event signaling is added and the smartpin cog-cog transfer is there, I think that makes up for any latency in small data packet transmissions using HUB.

Yes, those two are quite new, and certainly improve cog-cog.

* The event signaling just needs a tweak (see other thread) to make who signaled visible.

* The smart pin links need end-to-end speed numbers, and it may be possible to overlap some of the TX/RX to gain some speed ?

cgracey · 2016-05-17 21:07

Meanwhile, this new compile seems to be progressing. We are at the 2 hour mark and it's moved to 80% on the fitter. Let's see what happens...

Seairth · 2016-05-17 21:16

cgracey wrote: »

If this goes badly, this whole LUT-write sharing thing is coming out and we are moving forward without any kind of inter-cog LUT access.

Hah! You sure know how to stir the pot! You should have just quietly made the change and not told us anything until you either succeeded or ripped it back out! :P

Seairth · 2016-05-17 21:21

jmg wrote: »

Rayman wrote: »

If the broadcast cog-cog event signaling is added and the smartpin cog-cog transfer is there, I think that makes up for any latency in small data packet transmissions using HUB.

Yes, those two are quite new, and certainly improve cog-cog.

* The event signaling just needs a tweak (see other thread) to make who signaled visible.

* The smart pin links need end-to-end speed numbers, and it may be possible to overlap some of the TX/RX to gain some speed ?

If we dropped the shared LUT thing, I wonder if there would there be enough room to double the bus width to the smart pins. That would speed up the special link mode AND all other smart pin interaction.

@cgracey, what is the current bus width to the smart pins? I have another thought (which I'll have to wait and share in a few hours).

jmg · 2016-05-17 21:27

Seairth wrote: »

If we dropped the shared LUT thing, I wonder if there would there be enough room to double the bus width to the smart pins. That would speed up the special link mode AND all other smart pin interaction.

I did wonder that too...

Seairth wrote: »

... what is the current bus width to the smart pins?...

I think it is nibble based, with a length header, so you get the 4,6,10 base clocks for 8,16,32b (+ extras for read IIRC?)

Maybe this could use DDR, to keep the nibble width, and double the speed ?

cgracey · 2016-05-17 21:46

The compile finished at 2.5 hours and I see that the inter-cog LUT writing has a lot of nets near the critical path. Even the ATN mechanism is in there. The ATN can be sped up, though, by a few flops. The LUT thing is wet blanket, though. I'm going to make a git backup and rip it out. That it takes an extra hour to compile the LUT stuff with only 89% ALMs used means this thing is a dog.

potatohead · 2016-05-17 22:01

Good call. I'm thinking the synthesis is already a challenge. No need to make that worse.

Tubular · 2016-05-17 23:01

A byte version of this would still be useful. You'd go from 9 address / 32 data to 11 address / 8 data, 320 rather than 672 interconnects.

But presumably it'd still lower Fmax in an unacceptable way

jmg · 2016-05-17 23:19

cgracey wrote: »

.... but the problem is that everyone is going to be incorporating objects written by others and some of those objects WILL use this feature, and that's what will break things. The only sure way around this is application-level cog planning, which is a horrible thing to force on a newbie. That throws fun and spontaneity right out the window.

Not quite - The newbie appeal is using COGNEW, right ?

Looking at a smarter COGNEW, I think it can be done in just 16+16+4 registers and some state logic & zero tests.
That returns any number of adjacent COGS as a bitset mask, given a request for N. 0 if failed to find N adjacent.
N=1 for common use, but 2..16 are legal.
COGS can be removed, and a new COGNEW 1, finds first unused.

That seems like a trivial amount of logic, just one copy needed ?

Rayman · 2016-05-17 23:39

Hey, I was Googling and found this book describing core-core coms between adjacent cores:

https://books.google.com/books?id=kdJQgUNLNb0C&pg=PA21&lpg=PA21&dq=intercore+connections+multicore&source=bl&ots=XJaKawwJyh&sig=kv6wWUxlsNTXVUcZDqcpWflQFz0&hl=en&sa=X&ved=0ahUKEwip_5y6pOLMAhXkxYMKHaR-AM0Q6AEIVzAG#v=onepage&q=intercore connections multicore&f=false

I guess we're not the first to think about it...

cgracey · 2016-05-17 23:56

The compile without inter-cog LUT writing completed in only 1.5 hours, instead of 2.5. It wound up taking 86% of ALMs, instead of 89%, which isn't that much of a difference. Fmax came in at 74MHz, instead of 70MHz, and ATN is way out of the critical path.

cgracey · 2016-05-18 00:11

jmg wrote: »

cgracey wrote: »

.... but the problem is that everyone is going to be incorporating objects written by others and some of those objects WILL use this feature, and that's what will break things. The only sure way around this is application-level cog planning, which is a horrible thing to force on a newbie. That throws fun and spontaneity right out the window.

Not quite - The newbie appeal is using COGNEW, right ?

Looking at a smarter COGNEW, I think it can be done in just 16+16+4 registers and some state logic & zero tests.
That returns any number of adjacent COGS as a bitset mask, given a request for N. 0 if failed to find N adjacent.
N=1 for common use, but 2..16 are legal.
COGS can be removed, and a new COGNEW 1, finds first unused.

That seems like a trivial amount of logic, just one copy needed ?

It could return the base cog id, only, and that would work.

My gut feeling is still that this just isn't a good idea - this adjacent LUT sharing. It would introduce a wrinkle for maybe only marginal gain. I wish I felt better about it. My life experience has been that my gut always knows, even if my head can't figure it out.

rjo__ · 2016-05-18 01:23

Chip,

I'm glad you dumped cog sharing, it was giving me a headache.

I have a good gut, it is my brain sometimes doesn't cooperate.

A Cog can only rdfast or wrfast, it can't do both simultaneously. And in each case, the rd/wrfast can have only 1 base address.
I think it was Cluso that mentioned dropping a simple glob between cogs to allow each cog
to transmit to the one next to it. This would allow one cog to rdfast while the next does wrfast and that second cog can give data to the next cog when it belongs at a different address.

It greatly simplifies parallel processing and it is highly deterministic.

We have coginit... the whole idea of coginit is to allow the user to pick which cog he/she is using. We do it all the time for other reasons... why not for this?

There is really nothing special about selecting adjacent cogs... it won't confuse anyone. I agree that objects written
for others will have to be well written if they use this mechanism and might have to be fairly clever in making sure that
this issue is handled well... but that is precisely why standardization of the OBEX was instituted.

This isn't a minor improvement... it can multiply the bandwidth:)

OK... I feel better now:)

jmg · 2016-05-18 01:53

rjo__ wrote: »

...
We have coginit... the whole idea of coginit is to allow the user to pick which cog he/she is using. We do it all the time for other reasons... why not for this?

There is really nothing special about selecting adjacent cogs... it won't confuse anyone. I agree that objects written
for others will have to be well written if they use this mechanism and might have to be fairly clever in making sure that
this issue is handled well... but that is precisely why standardization of the OBEX was instituted.

This isn't a minor improvement... it can multiply the bandwidth:)

I would be happy with absolute COG control (and absolute COG control must always be possible for testing, and it needs to be there, in case one COG fails in the fab process..)
However, I can understand the appeal of COGNEW for Chip, and it does mean simplest systems have less housekeeping.

Hence my suggestion of an expansion to be COGNEW N, which keeps that less-housekeeping philosophy.

Phil Pilgrim (PhiPi) · 2016-05-18 02:38

I have no issues with coginit, as long as the cog number comes from the cognew or cogid of a cog that's already running. But starting a cog from scratch with coginit? Nein! Verboten! Never! There's nothing to think about here, folks. Just don't do it -- ever -- or embrace any architecture that requires it.

-Phil

GreenWorld · 2016-05-18 03:22

I am what you would call a hobbyist and have worked a little with the P1. I am excited about the way the P2 is shaping up because it gives hobbyists like me the opportunity to program some really powerful applications without having to learn to program other much more complicated MCU's.

I actually like the inter-Cog LUT write scheme very much and am a little disappointed it did not compile on the A9. Maybe it was a little too ambitious with a 16 x 42 bus (672 connections) allowing all Cogs to be writing to LUTs at the same time. The HUB eggbeater should be the main communication route for most everyday mundane communications between cogs.

Direct inter-COG communication would only be used in the few cases where low latency is required. There is no need for all 16 Cogs to be writing to 16 LUTs at the same time. Two, three, or four Cog-Cog connections would be enough I think to give the P2 that extra edge. So , I'm wondering if the inter-Cog LUT write bus can be made to fit the A9 by reducing it to say a 4 x 42 bit bus (168 connections instead of 672) which would allow any 4 Cogs to write to any 4 different LUTs at the same time.

I suppose that some lock mechanism would have to be used to manage this limited resource, but that should not be a problem and it doesn't violate the propeller philosophy.

Would this work? and more importantly, would it compile on the A9?

cgracey · 2016-05-18 03:29

GreenWorld wrote: »

I am what you would call a hobbyist and have worked a little with the P1. I am excited about the way the P2 is shaping up because it gives hobbyists like me the opportunity to program some really powerful applications without having to learn to program other much more complicated MCU's.

I actually like the inter-Cog LUT write scheme very much and am a little disappointed it did not compile on the A9. Maybe it was a little too ambitious with a 16 x 42 bus (672 connections) allowing all Cogs to be writing to LUTs at the same time. The HUB eggbeater should be the main communication route for most everyday mundane communications between cogs.

Direct inter-COG communication would only be used in the few cases where low latency is required. There is no need for all 16 Cogs to be writing to 16 LUTs at the same time. Two, three, or four Cog-Cog connections would be enough I think to give the P2 that extra edge. So , I'm wondering if the inter-Cog LUT write bus can be made to fit the A9 by reducing it to say a 4 x 42 bit bus (168 connections instead of 672) which would allow any 4 Cogs to write to any 4 different LUTs at the same time.

I suppose that some lock mechanism would have to be used to manage this limited resource, but that should not be a problem and it doesn't violate the propeller philosophy.

Would this work? and more importantly, would it compile on the A9?

It could work, but wouldn't it make cogs unequal? Or, would it work by being a discrete resource which any cog(s) could use? That would require more mux'ing, and basically get us most of the way back where we just were, routing-wise. To do any of this without variable latencies, it either has to be a huge do-anything/everything mux, or be hardwired between certain cogs. Neither is nearly perfect.

ozpropdev · 2016-05-18 03:43

Chip
If a cog is already using it's streamer and another cog does a WRLUTX, which port has priority?

cgracey · 2016-05-18 03:56

ozpropdev wrote: »

Chip
If a cog is already using it's streamer and another cog does a WRLUTX, which port has priority?

The WRLUTX had priority.

evanh · 2016-05-18 04:42

I'd be happy to go back to single ported LUT then add as much HubRAM as possible.

cgracey · 2016-05-18 04:46

evanh wrote: »

I'd be happy to go back to single ported LUT then add as much HubRAM as possible.

Keeping the dual-port LUT is good because it enables the streamer to play LUT samples while the cog simultaneously updates the LUT.

Is LUT sharing between adjacent cogs very important?

Comments