Is LUT sharing between adjacent cogs very important?

dMajo · 2016-05-16 14:46

Rayman wrote: »

David Betz wrote: »

Rayman wrote: »

I think the only problem is that it won't all fit at once in A9 fpga...

That doesn't seem like a big problem. As long as enough of the design will fit to allow testing then that shouldn't be an issue.

If it were me, I'd want to be able to test the full design. Otherwise, you won't know 100% that it is going to work. And, what if it doesn't work? Then, you are really in trouble...

I think that if you have an A9 image with 16 or 32 smart pins is enough for testing. All are multiple instances of the same logic design. At max there can be two paired smartcells if rx/tx comms or usb functions are splitted over 2 adjacent smart pins. So 16/32 smart pins give you 8/16 pairs to test with.

dMajo · 2016-05-16 15:10

Now, with some Chip's clarifications (mainly the isolation of parallel writing of several cogs to several LUTS unless they all write to the same one) I start to like this new LUT function.
I wonder if (perhaps for the next chip, not P2) this can be extended. Why for example not have 20 LUTs of which 4 are missing its cogs? Why instead of having LUTS this can't be global peripherals accessed in this way?
Eg SETLUT17 can set to which cog's lut an hypothetical high speed usart transfer its received data and each cog writing to ID 17 can in this way send data or setup peripheral config registers. The same apply for LUT(peripheral)18..20. Four IP cores/functions that are difficult otherwise to develop internally can be bought and added to the design using this communication conduit, respecting propeller philosophy.

cgracey · 2016-05-16 15:44

It's true that each cog outputs 16W, 9A, and 32D lines. Each LUT's 2nd port takes in one of the 16W for AND'ing and all the rest of the signals.

The WRLUTS waits for the time match and then releases, causing 2..16 cycles to elapse in the cog.

There is no testing problem in having less than all 64 smart pins.

Phil Pilgrim (PhiPi) · 2016-05-16 15:48

Chip,

Please remember that you're Parallax, a large percentage of whose customers are students and neophytes. Whatever you do, you will have to explain it to them. If any of the P2's features -- or their consequences -- are too complicated for even the "experts" to grasp fully, your main customer base will be completely lost. Elegance and simplicity will always trump whatever slight performance edge an added feature might bestow.

Thanks,
-Phil (the ascetic)

cgracey · 2016-05-16 15:58

Phil Pilgrim (PhiPi) wrote: »

Chip,

Please remember that you're Parallax, a large percentage of whose customers are students and neophytes. Whatever you do, you will have to explain it to them. If any of the P2's features -- or their consequences -- are too complicated for even the "experts" to grasp fully, your main customer base will be completely lost. Elegance and simplicity will always trump whatever slight performance edge an added feature might bestow.

Thanks,
-Phil (the ascetic)

This LUT writing is simple, though it was made to seem scary by some who preferred the original recipe. They'll soon get used to this new extra crispy GMO formulation and be happy as clams. It will change their minds.

potatohead · 2016-05-16 16:11

LMAO

. That will do it Chip. We just put it all in the food. Everyone comes to see it our way, no worries!

@Phil. So far, it's not so far from P1 in terms of getting going and doing stuff. I'm often away from this for long enough periods of time to experience "the easy" and the biggest difference is depth.

Think of it like the P1 counters and WAITVID. Many users never did much with those early on, instead using objects others created. Over time, as common knowledge improved, more people went digging to hit the extents.

This one is the same way, only big PASM programs are possible now. Ideally, we see Eric Smiths in line PASM in P2 SPIN.

With that, we can share snippets and grok king this stuff will be easier than we think. SmartPins will shine bright here.

The really crazy things we have in the thing can be ignored for a good long time. Having so many COGS will give people the choice of optimizing the Smile out of something to utilize one COG, or in many scenarios, they can brute force it with several COGS.

I've already done a little of that, thinking maybe I get stuff done first, then improve it. Works out pretty well, IMHO.

potatohead · 2016-05-16 16:17

One thing I like so far is the idea of public and private mailboxes. An object can have it's own internal comms using high LUT addresses, and reserve HUB based ones for only the few bits that matter.

Rayman · 2016-05-16 16:25

Well, I was hoping that running out of logic on A9 would put a stop to adding new features...
Guess not.

Hopefully, this is the last thing to be added.

Roy Eltham · 2016-05-16 16:29

Phil,
As potatohead says, there are aspects of the P1 that took time and detailed application notes for people to understand them and use them. However, the time to blinking LED is tiny, and the same will be true of the P2.

Chip,
"extra crispy GMO formulation" MmmmMmmm good!

mindrobots · 2016-05-16 16:35

After riding through the process of Chip announcing "the design is finished", I now understand why should never watch sausage being made and it's best to stay out of the kitchen at your favorite retaurant!

I'm glad the design is done again!

User Name · 2016-05-16 17:07

Electrodude wrote: »

LUT sharing was replaced by a mechanism by which any cog can write...

Thank you! Looks good to me. I'm putting a copy of your excellent explanation in my notes.

potatohead · 2016-05-16 17:17

It's super close. Bet you this is the last feature.

Tweaks and bux fixes from here.

Cluso99 · 2016-05-16 19:05

With all these buses for hub and LUT, and dual port LUT, I wonder if the whole hub and LUT could be one block with some of it being dual port (LUT). ie I wonder if there might be some simplification possible.

cgracey · 2016-05-16 19:22

Cluso99 wrote: »

With all these buses for hub and LUT, and dual port LUT, I wonder if the whole hub and LUT could be one block with some of it being dual port (LUT). ie I wonder if there might be some simplification possible.

I don't know. That would require some careful thinking. For there to be a fast way around the eggbeater, I think the additional buses would have to be there. It seems that what we have now will be good: Big hub for streaming and cross-cog LUT access for fast messaging and data between cogs. I think the event(s) around LUT access might need a little more thought.

potatohead · 2016-05-16 19:57

Personally, I feel the idle buses are a good thing.

In the hot chip, maxing it all out was a focus, and we hit 5 watts!

Done this way, people have great options, but the overall efficiency remains good. And there is a lot of potential to be exploited over time too.

Since we do have a few paths, we also avoid gaps. Right now, if there is a high demand use case, we probably have it covered. To me, that is more important than the thought of percent idle silicon is.

Besides, given how COGS work, there is always idle silicon in a majority of uses. It would be nice to trim that up a bit, but then again, with P1 we saw all of it more and more utilized as knowledge got shared, and more powerful code did too.

I'm not clear on the LUT write events.

jmg · 2016-05-16 20:42

dMajo wrote: »

... IMHO writing and corrupting unplanned addresses is much worse, specially because potentially it can mean changing other cog's code which can result also, as an extreme, in death (depends how and where the P2 will be employed.

That unplanned address corruption aspect of failure, is perhaps the hardest thing to try to debug.
It also gives a great attack mechanism, that will scare big potential customers.

evanh wrote: »

The supposed problem was resolved long ago...

Well, nice spin, but the problem is only resolved if it is eliminated.

The new, much slower, write is only truly safe if everyone uses it.
No COG can be allowed to use the Fast write, or you may have a corruption path.
That rather self-defeats the new feature.

You now need to very carefully check any code with Fast write, to verify WHICH COGS it can use.
This may mean compile time COG allocation, in order to mask-off any risk.

This is where the change to a READ Any-LUT model, suggested above, may be worth exploring.

Uses the same MUX design, same opcodes flipped to read, but now, any collision does not corrupt not-addressed memory.

You also gain a very useful, non intrusive, Debug & diagnostic Read-Any-LUT pathway.

jmg · 2016-05-16 20:51

Rayman wrote: »

Since LUT is executable, this would let one cog directly change the code of another...

Instead of just passing arguments, one could directly change the variables in a running code, right?

Sounds dangerous, but in theory possible, with caveats.
You can only change LUT, so LUT code-changes can update constants, but any variables not located in LUT, are not changed.
The changer, of course, has to be very sure the changee is not running, or about to run, that code area.
Sounds like some very complex tools would be needed to manage this safely.

There are paths now, I think, to write to LUT from HUB, would that be a better system to manage code-swap ?
HUB burst operations are the fastest of all paths, for block updates.

I do like the idea of functions in other COGS, and LUT as parameters.
That can be supported with either of read or write models.

jmg · 2016-05-16 20:58

Rayman wrote: »

If it were me, I'd want to be able to test the full design. Otherwise, you won't know 100% that it is going to work. And, what if it doesn't work? Then, you are really in trouble...

Yes, but elements like Pin Cells simply repeat, in comparative isolation
However, I would be careful to avoid simple binary slicing, but I think Chip is already doing that.
He plans to keep Top-end PinCells and Bottom-end too, and remove some from the middle. (non binary)
That keeps the decode-sizes to the 6 bits, and ensures tool optimise does not change what you are testing.
Large, easily counted blocks are ok to manually check-off.

jmg · 2016-05-16 21:03

Cluso99 wrote: »

With all these buses for hub and LUT, and dual port LUT, I wonder if the whole hub and LUT could be one block with some of it being dual port (LUT). ie I wonder if there might be some simplification possible.

Now there is a new Any-LUT access path, it does open up the possibility of change to the HUB index.
ie a MOD 8 Hub, would be twice as fast, at random access.

Roy Eltham · 2016-05-16 21:13

jmg,
Flipping the model from write to read doesn't change the addressing aspect (which is still going in the same direction and OR'd together from multiple sources). So instead of writing the wrong location, you read the wrong location when multiple COGs try to read the same LUT in the same sysclk. There's probably other implications with flipping too...

jmg · 2016-05-16 21:21

Roy Eltham wrote: »

jmg,
Flipping the model from write to read doesn't change the addressing aspect (which is still going in the same direction and OR'd together from multiple sources). So instead of writing the wrong location, you read the wrong location when multiple COGs try to read the same LUT in the same sysclk.

I thought I said exactly that ? ( & Read data also OR's ) What does change is the unexpected write-corruption.

Heater. · 2016-05-16 21:52

jmg,

That unplanned address corruption aspect of failure, is perhaps the hardest thing to try to debug.
It also gives a great attack mechanism, that will scare big potential customers.

How so?

As it stands a lot of code will be run from HUB. Or in COG code will be started from HUB. Any code in any COG can "attack" any other COG by "attack" or by simple bug.

Where is this "attack" going to come from?

If you want an attack proof system you need virtual memory spaces, rings of privilege, sand boxes. You need an Intel or ARM processor not a micro-controller.

evanh · 2016-05-16 21:52

jmg wrote: »

Well, nice spin, but the problem is only resolved if it is eliminated.

Would you like an MMU with that?

Heater. · 2016-05-16 22:05

jmg,

I thought I said exactly that ? ( & Read data also OR's ) What does change is the unexpected write-corruption.

There is no unexpected write corruption.

You write code that uses multiple COGs communicating via LUT. You just have to get it right. Like any other code.

Yes writing parallel shared memory code is hard.

jmg · 2016-05-16 22:32

Heater. wrote: »

There is no unexpected write corruption.

Simply Priceless.

The unexpected aspect is that the address that is corrupted, is not one use by either COG.
"You just have to get it right" takes on a whole new meaning, when memory you did not address changed.
That is no longer 'like any other code'.

Heater. · 2016-05-16 22:59

jmg,

As far as I can tell it's exactly the same problem as the usual case of having many writers to a shared memory area in any parallel processing system.

Somehow you have to get your timing right and make transactions atomic and consistent else all hell breaks loose.

Perhaps there is a little difference here in that the shared memory is the entire LUT space, not just some portion of it.

Well so be it.

potatohead · 2016-05-16 23:06

The address concerns are being way overblown.

Time to move on.

jmg · 2016-05-16 23:12

Heater. wrote: »

As far as I can tell it's exactly the same problem as the usual case of having many writers to a shared memory area in any parallel processing system.

Nope, the BIG DIFFERENCE HERE, is that memory you did not address changes.

In your example, you can only ever corrupt the shared memory you target, not somewhere else.

Even worse, the conventional way to manage your "exactly the same problem as the usual case of having many writers to a shared memory area" is to just separate the memory map. Oops, doing that here, will fail.

Clearly, is it a long way from "exactly the same problem", as the handling is quite different, as are the consequences.

Rayman · 2016-05-16 23:18

Maybe we should just think of this as a bonus feature.
Doesn't have to be ideal. Just makes something out of spare capacity that would otherwise feel wasted...

jmg · 2016-05-16 23:21

Rayman wrote: »

Maybe we should just think of this as a bonus feature.
Doesn't have to be ideal. Just makes something out of spare capacity that would otherwise feel wasted...

Strange no one seem to be considering or analyzing the READ path variant version of this ?

Is LUT sharing between adjacent cogs very important?

Comments