Is LUT sharing between adjacent cogs very important?

evanh · 2016-05-18 04:48

A second FIFO for each Cog could be helpful. I guess it would put an upper speed limit of every second rotation, or something, for alternating them.

JRetSapDoog · 2016-05-18 04:49

Cluso99 wrote: »

Without the latest idea, is ther enough die space for another 128KB or 256KB of hub ram?

cgracey wrote: »

... [W]e may even be able to fit another 128KB or 256KB of RAM. We'll see.

LUT sharing, sit tight; don't call us; we'll call you.

Meanwhile, simple solutions can go a long way. I take it that additional RAM has almost none of these routing and F-max concerns.

I know that some of us have already pushed for the full 1MB, but even a bump from 512KB to 768KB makes many things possible without off-chip memory. Everyone already knows the following, but I'm mentioning it just as a reminder now that we have more die space:

640x480 @ 16 bpp (65536 colors)
640x480 @  8 bpp (256 colors) with an additional frame buffer
640x480 @  4 bpp (16 colors) with 4 separate "frames"

All of these consume 614400 bytes and leave a sizable 172032 bytes for code space.

I feel that the P2 should focus on smaller screens (not huge panels). So, that's why I've used standard VGA resolution. These days, 800x480 WVGA screens are quite affordable. And the driver boards for them nicely stretch VGA to WVGA (saving a little memory).

For me, I prefer not adding any additional logic for LUT tricks and so on if it removes the small (but still present) possibility of going to 768KB. Sure, 512KB is a nice number sitting at the half-way point to 1MB and it's a clean power-of-two. But the bump to 256KB just puts the graphics functionality over the top. For example, an industrial control panel could instantly flip between four 16-color VGA screens, and that might just be enough to handle the user interface and nudge a designer into picking the P2 over an ARM, especially with all its pin smarts (the pinheads) and separate cores (cogs).

I wonder what the yield reduction for such an increase would actually be. I'd venture a non-educated guess of < 2%, otherwise we'd be way more concerned with adding all this logic and so on. Of course, even 2% is a concern, but it might be worth it for the increased capability the memory could bring (and reduced system complexity that could result from not needing to go off chip for RAM). And memory is useful for a lot more than "just" graphics (though that's certainly one of the most "visible" usages).

PS/Full Disclosure: I'm not a designer of industrial products, and I am perhaps selfishly thinking of my own project needs.

PS2: What I said above also applies to a memory bump of 128KB, except that it leaves (a nothing to sneeze about) 40960 bytes free for code.

evanh · 2016-05-18 04:50

Or adding a FIFO facility to flip between read and write without instruction stalling.

evanh · 2016-05-18 07:21

Rayman wrote: »

Hey, I was Googling and found this book describing core-core coms between adjacent cores:

https://books.google.com/books?id=kdJQgUNLNb0C&pg=PA21&lpg=PA21&dq=intercore+connections+multicore&source=bl&ots=XJaKawwJyh&sig=kv6wWUxlsNTXVUcZDqcpWflQFz0&hl=en&sa=X&ved=0ahUKEwip_5y6pOLMAhXkxYMKHaR-AM0Q6AEIVzAG#v=onepage&q=intercore connections multicore&f=false

I guess we're not the first to think about it...

Don't worry, that one is super basic. It don't even have Cluso's intermediate latch. The sender stalls until the receiver picks it up! It's intended for maximising number of cores in a grid at the expense of core utilisation. Not at all what we are after.

On the other hand the receiver probably also stalls until the sender feeds it something. This will the way it's intended to be used.

Heater. · 2016-05-18 09:38

I'm always going to vote for more RAM, if possible, over any other extra features.

RAM is just so useful.

Cluso99 · 2016-05-18 09:58

An idea...

Would this give us 1MB HUB RAM ???

Reduce the P2 to 8 Cogs, no LUT sharing (any version). About 4KB of hub is lost to cog and LUT addressing. Make another 4KB hub dual port (with/without egg beater -whatever is easiest).

Make another 8 Cogs that use the 4KB dual port hub as its hub ram. Make the LUT shared between adjacent cogs.
Remove the streamer and DAC.
Remove the egg beater.
Perhaps remove the cordic and complex instructions.
This then becomes a faster I/O processor.
Perhaps the cog ram could be quad port for single clock instructions like the P2hot (without multitasking).

cgracey · 2016-05-18 10:00

Cluso99 wrote: »

An idea...

Would this give us 1MB HUB RAM ???

Reduce the P2 to 8 Cogs, no LUT sharing (any version). About 4KB of hub is lost to cog and LUT addressing. Make another 4KB hub dual port (with/without egg beater -whatever is easiest).

Make another 8 Cogs that use the 4KB dual port hub as its hub ram. Make the LUT shared between adjacent cogs.
Remove the streamer and DAC.
Remove the egg beater.
Perhaps remove the cordic and complex instructions.
This then becomes a faster I/O processor.
Perhaps the cog ram could be quad port for single clock instructions like the P2hot (without multitasking).

That's several months of work. It may be possible, though, to drop 8 cogs and go all the way to 1MB hub RAM. That could be another chip, easily made after this 16-cog version.

Heater. · 2016-05-18 10:45

Great. Let's redesign the entire Prop architecture again. Just as we were getting so close...

dMajo · 2016-05-18 11:05

JRetSapDoog wrote: »

Cluso99 wrote: »

Without the latest idea, is ther enough die space for another 128KB or 256KB of hub ram?

cgracey wrote: »

... [W]e may even be able to fit another 128KB or 256KB of RAM. We'll see.

....
For me, I prefer not adding any additional logic for LUT tricks and so on if it removes the small (but still present) possibility of going to 768KB. Sure, 512KB is a nice number sitting at the half-way point to 1MB and it's a clean power-of-two. But the bump to 256KB just puts the graphics functionality over the top. For example, an industrial control panel could instantly flip between four 16-color VGA screens, and that might just be enough to handle the user interface and nudge a designer into picking the P2 over an ARM, especially with all its pin smarts (the pinheads) and separate cores (cogs).

If we speak of industrial panels I thing that than this beside outputting some screens needs to handle DeviceNet, ProfiBus/Net, Modbus, Ethernet, .... connectivity. It should have a full working SD/MMC/USB-Sorage connections.
Panels many times are needed to be remotable, or mounted on rotating arms. That mean that you need as less wires as possible to connect them to the machine controller (plc). With this considerations also all the smartpins of P2 are a waste because nobody wants IOs connected to the HMI, apart for low-cost small-systems all-in-one hHMI+PLC

I think you will never see a P2 based HMI connected to Siemens/Simatic, Omron, SchniderElectric, AllanBradley, GE Fanuc, ... PLC.
But it can operate a DIY panel or perhaps a domotics/home automation panel even if also here the demand is for hi-res graphics on at least 7/10" screens. Thus I will call il more hobbyist than industrial.

Roy Eltham wrote: »

dMajo wrote: »

Roy Eltham wrote: »

Also, I'm not sure I agree with all your advantages for choice 1. For example #5 is BS, because it's true for Choice 2 also, and one could argue it's easier to accomplish in Choice 2. Two of your disadvantages for choice 2 are about a feature that isn't even possible in choice 1. Choice 1 does not have multiple senders to one LUT other than just the two sharing AND those two can conflict with each other and need to do something to avoid it to in those cases.

Wrong:
because cogN-1 and cogN+1 when both writing to cogN's LUT each one do it on its half visible LUT, 2 different areas.

I'm talking about cogN writing to the same location as either cogN+1 or cogN-1. That is that case that can conflict. Yes they are on different ports, but it's still the same ram location, so only one can win at best.

I don't thing there is any conflict here. While CogN-1 and CogN+2 each one writes on its half of port B, the CogN writes on full port A. If I remember Chip said that port A has precedence over port B (like in Cog<>streamer conflict)

Phil Pilgrim (PhiPi) wrote: »

I have no issues with coginit, as long as the cog number comes from the cognew or cogid of a cog that's already running. But starting a cog from scratch with coginit? Nein! Verboten! Never! There's nothing to think about here, folks. Just don't do it -- ever -- or embrace any architecture that requires it.

-Phil

I thought that this si actually already doable.

@Chip and others:
If the any-shared LUT is impracticable than the adjacent-shared one has to be reconsidered. This is already there for most of the logic. To make it happen is practically free.
It is not elegant? It doesn't matter, also the possible corruption wasn't, the DAC allocation to certain cogs wasn't. Possibly some other things, but was accepted. There is always a compromise and this communication speed/latency, at least between a few cogs (the adjacent ones) if not possible for all, should not be missed.

I think that the obex is contributed for the most part from hobbyists not commercial customers who do not disclose their code. If anyone is scared of it, I think it can be easy to check that this feature is not used in obex objects. I thing a regex script can easy check for coginit and adjacent lut write and prevent eventually object uploading because it not conforms to obex policy.

But not prevent the high-speed low-latency (random) communication opportunity. People will find they way on how to leverage and handle this feature if the want to use it.
If there is an official support concern it can be also an undocumented feature, people will help each other on the forums, unofficially.

Dave Hein · 2016-05-18 12:16

STOP ASKING FOR FEATURES! PLEASE FINISH THE DAMN CHIP!
Could people please stop asking for changes to the P2. Chip, could you please stop adding new features to the chip. I'm sure that Ken and the rest of Parallax would appreciate it, not to mention all of the future P2 customers.

ctwardell · 2016-05-18 12:35

It is my opinion that not implementing the adjacent LUT sharing due to dogmatic adherence to a philosophy would be a mistake.

I agree the any-shared LUT was preferable, but since the logic and interconnect cost is too high it isn't an option.

Reworking the smart pin communications to attempt to make them suitable to assist in LUT sharing seems like a bad path. It is a lot of rework to something that was settled just to gain mediocre LUT sharing performance.

T Chap · 2016-05-18 13:01

What is the issue with putting back adjacent LUT, banning its use from OBX, and offering zero support for it? If someone calls for support say it is not supported. Don't even put it in the manual. I'd hate to lose functionality that is already completed.

Rayman · 2016-05-18 13:25

I always seemed to run out of cogs with P1, so personally would prefer having all 16.

BTW: I think a mode I might use a lot is 1080p in text only mode. Wouldn't use that much ram at all...

Cluso99 · 2016-05-18 13:27

cgracey wrote: »

Cluso99 wrote: »

An idea...

Would this give us 1MB HUB RAM ???

Reduce the P2 to 8 Cogs, no LUT sharing (any version). About 4KB of hub is lost to cog and LUT addressing. Make another 4KB hub dual port (with/without egg beater -whatever is easiest).

Make another 8 Cogs that use the 4KB dual port hub as its hub ram. Make the LUT shared between adjacent cogs.
Remove the streamer and DAC.
Remove the egg beater.
Perhaps remove the cordic and complex instructions.
This then becomes a faster I/O processor.
Perhaps the cog ram could be quad port for single clock instructions like the P2hot (without multitasking).

That's several months of work. It may be possible, though, to drop 8 cogs and go all the way to 1MB hub RAM. That could be another chip, easily made after this 16-cog version.

Chip,
None of us wants delays. Just threw it out there anyway.

cgracey · 2016-05-18 14:19

Seeing that no other low-latency path is going to work well between cogs besides adjacent LUT sharing, I will look at making COGINIT allocate 1..2 adjacent cogs. Bit 6 of D can be:

0 = 1 cog
1 = 2 cogs

Then, each cog can have instructions:

RDLUTN - read next cog's LUT
WRLUTN - write next cog's LUT

Are more than two cogs necessary?

twm47099 · 2016-05-18 14:35

Chip,
In the P1 coginit the "d" value contains the cog id, the HUB address of the code to be loaded and the address of the par variable. How would coginit define those for 2 cogs opening? Or that already accounted for in the current P2 coginit instruction?

Thanks
Tom

Cluso99 · 2016-05-18 14:56

cgracey wrote: »

Seeing that no other low-latency path is going to work well between cogs besides adjacent LUT sharing, I will look at making COGINIT allocate 1..2 adjacent cogs. Bit 6 of D can be:

0 = 1 cog
1 = 2 cogs

Then, each cog can have instructions:

RDLUTN - read next cog's LUT
WRLUTN - write next cog's LUT

Are more than two cogs necessary?

Fantastic News! Thanks Chip

While I personally think modifying COGINIT is unnecessary, it will satisfy others.
While I might be able to use more than two adjacent cogs, I think it would be rare. So two should be more than enough.

Are you just going to share the whole LUT with the next cog, while the previous cog shares its LUT with this one? (Rather than the split mode you did before?)

Electrodude · 2016-05-18 15:05

Since you're now implementing a double cog allocator, could you change the cog allocator to help reduce fragmentation?

First, make a circuit that outputs the cog id of the first idle cog N for which both cog N-1 and N+1 are running, or that outputs a flag if there are no such cogs. Then, make another circuit that outputs the cog id of the first idle cog N with cog N-1 running and cog N+1 idle, or that outputs a flag if there are no such cogs.

If you want to start a single cog, use the output from the first circuit if it found a cog, and otherwise use the output from the second (via a mux controlled by the first circuit's flag). If you want to start two cogs, use the output from the second circuit as the first of the two allocated cogs.

If the output of the mux for single cog allocation has its no-such-cog flag set, then either there are no free cogs or no running cogs. If the second circuit's no-such-cog flag is set, there are no free adjacent pairs of cogs.

If you make it possible to start 3 adjacent cogs, there should be another circuit that outputs the first idle cog N for which cog N-1 is running and cogs N+1 and N+2 are idle, and another circuit that outputs the first idle cog N for which cog N-1 and N+2 is running and cogs N+1 is idle. Stick these in the priority chain in the appropriate places. Each level adds two more of these circuits.

(By the way, what are these circuits that return the index of their first asserted input called?)

Cluso99 · 2016-05-18 15:16

I expect that if you are using a pair of cooperating cogs they would be allocated at init time. It is highly unlikely that they would be dynamically started.

kwinn · 2016-05-18 15:40

cgracey wrote: »

Seeing that no other low-latency path is going to work well between cogs besides adjacent LUT sharing, I will look at making COGINIT allocate 1..2 adjacent cogs. Bit 6 of D can be:

0 = 1 cog
1 = 2 cogs

Then, each cog can have instructions:

RDLUTN - read next cog's LUT
WRLUTN - write next cog's LUT

Are more than two cogs necessary?

Great news. Two is good, three would be even better, but as I posted earlier if I wanted to share LUT's I would reserve them in the main cog before any others were loaded. I am of course assuming cog(n) can access cog(n+1), and cog(n+1) can access cog(n+2), etc.

Rayman · 2016-05-18 16:09

I wonder if having two adjacent cogs would also let you game the egg beater to pass random data quickly between them...

ErNa · 2016-05-18 16:20

p = cm² <=> cp = mc * mc <=> computing power = mighty code * mighty processor.

A solution:

0 = (infinity-1) * 0

Do we want more?

cgracey · 2016-05-18 17:47

Thinking about this more...

How about rather than having special RDLUTN/WRLUTN instructions, we arrange the LUT sharing so that even/odd cog pairs have all LUT writes in common, meaning that the 2nd port is used to replicate all write activity from the partner cog? This way, a write by either cog to its own LUT effects both LUTs. This would take only a write mux on the second port of each LUT. All the SETQ2+RDLONG/WRLONG block operations would work in this mode, as well.

Rayman · 2016-05-18 17:52

can that be turned off?

Also, the reading allowed for bi-directional comms...

cgracey · 2016-05-18 17:57

Rayman wrote: »

can that be turned off?

Also, the reading allows for bi-directional comms...

There would be an instruction for enabling/disabling this mode in each cog. When enabled, your LUT receives every write that your even/odd partner cog generates.

Reading would allow for bidirectional comms, because all LUT writes could be common, if both cogs enabled this feature.

Cluso99 · 2016-05-18 18:48

cgracey wrote: »

Thinking about this more...

How about rather than having special RDLUTN/WRLUTN instructions, we arrange the LUT sharing so that even/odd cog pairs have all LUT writes in common, meaning that the 2nd port is used to replicate all write activity from the partner cog? This way, a write by either cog to its own LUT effects both LUTs. This would take only a write mux on the second port of each LUT. All the SETQ2+RDLONG/WRLONG block operations would work in this mode, as well.

I am not quite sure I understand.

If enabled, Cog n WRLUT writes to Cog n and Cog n+1 simultaneously?

Does this mean you are only pairing even/odd Cog pairs? I hope not

I presume you are wanting to save extra instructions??? If so, would it be difficult to make the LUT addresses 10 bits, where immediate addresses are 9 bits (with 10th bit assumed 0), but register addresses use the full 10 bits where 10th bit is 1 for adjacent LUT.

With the SETQ2, do you mean both cog LUTs get written? Again I hope not.

Phil Pilgrim (PhiPi) · 2016-05-18 18:54

That would be okay only as long as there is a special cognew2 that allocates adjacent cogs simultaneously. It could just return the number of the even-numbered cog in that case. There would have to be well-defined procedure for cogstopping them, so that cogs that remain "entangled" don't subsequently get allocated individually. (This is starting to sound like quantum mechanics.)

-Phil

cgracey · 2016-05-18 19:38

By making writes common between two LUTs, there is no need for a special result mux, or a SETQ3 to facilitate block r/w, or another mux for LUT exec. This means, though, that what was two separate 512-long LUTs become, effectively, one LUT. It's a LUT mind meld. This can only work between a pair of cogs, though, not a rolling overlap, as proposed before. Therefore, an even-odd pair is probably ideal, given the constraints. COGNEW can allocate double cogs from the top down, while single cogs are allocated from the bottom up. Someone proposed something like this a few hours ago, in order to minimize cog fragmentation. I feel okay about all this. You've got to lump the cog-pair issue, though.

Rayman · 2016-05-18 19:52

This is a bit complex at first read...
So, if cog A wants to send message to cog B and get reply:

Cog A writes to both LUTs at same time
Cog B is notified somehow and reads his LUT to get message
Then
Cog B writes response to both LUTs at same time
Cog A is notified and reads response in his LUT.

Ok, I think I have it...

Phil Pilgrim (PhiPi) · 2016-05-18 19:54

The bottom-up/top-down schema will work great from a reboot. But how hard is it to implement once available cogs become fragmented?

-Phil

Is LUT sharing between adjacent cogs very important?

Comments