Is LUT sharing between adjacent cogs very important?

Heater. · 2016-05-09 13:01

evanh,

COGNEW, being a language feature,...

No it is not. It's built into the silicon. The COGINIT instruction performs both as a COGINIT and COGNEW.

This way we get atomic starting of the next free COG when used as COGNEW. No fighting and race conditions between processes as to which COG they get next.

...could maybe grow itself a new extension where it can be asked to start multiple tasks at once/consecutively and guarantee those will be in matching consecutive ordered Cogs.

I often used to speculate about a system with thousands of processors. When you wanted some you could call a cpualloc() function that found a bunch of them an allocated them to you. Rather like memory allocation, malloc(), works.

Of course, this then becomes a complex function ....

Yep, too much for an MCU.

Heater. · 2016-05-09 13:09

dMajo,

- if you are the programmer of the whole thing you know what you are doing, this apply also on cogs allocation

True. But quite often not the case. Users should be able to simply grab the objects they need for drivers and such and throw them into their code without having to worry about annoying details like COG ids or COG starting order.

This is a magic Propeller feature, along with the absence of interrupts, than makes mixing and matching code from OBEX and other places so painless.

Not to be abandoned lightly.

- if you develop a driver COGINIT COGID+1 make the thing dynamic

It also makes the system prone to race conditions. A really bad idea.

evanh · 2016-05-09 13:38

Heater. wrote: »

evanh,

COGNEW, being a language feature,...

No it is not. It's built into the silicon. The COGINIT instruction performs both as a COGINIT and COGNEW.

COGNEW doesn't exist in PASM. Well, not in the docs at least.

I now figure (Hence my previous edit.) that Spin's COGINIT and COGNEW are both wrapping the instruction in a resource tracker. There will be a reserved system variable somewhere in HubRAM that maps the allocated Cogs.

Heater. · 2016-05-09 13:49

evanh,

COGNEW doesn't exist in PASM. Well, not in the docs at least.

Yes it does. Except not by name. COGINIT does the job. In my Propeller Manual Version 1.2

COGINIT – Assembly Language Reference

Explanation
The COGINIT instruction behaves similar to two Spin commands, COGNEW and COGINIT, put
together. Propeller Assembly’s COGINIT instruction can be used to start a new cog or restart
an active cog.
....
The third field, bit 3, should be set (1) if a new cog should be started, or cleared (0) if a
specific cog should be started or restarted.
If the third field bit is set (1), the Hub will start the next available (lowest-numbered inactive)
cog and return that cog’s ID in Destination (if the WR effect is specified).
If the third field bit is clear (0), the Hub will start or restart the cog identified by Destination’s
fourth field, bits 2:0.

I now figure (Hence my previous edit.) that Spin's COGINIT and COGNEW are both wrapping the instruction in a resource tracker. There will be a reserved system variable somewhere in HubRAM that maps the allocated Cogs.

No it does not. The software need not track any resources. The silicon does it. See above. That's how we get atomic allocation of COGs. A software COG resource tracker would have to use LOCKS to achieve the same effect as the COGINIT instruction.

Spin's COGINIT and COGNEW both wrap the COGINIT assembler instruction.

evanh · 2016-05-09 14:09

Oh, COGINIT instruction is way more complicated than I thought. My lack of hands on coding showing again.

As for the topic of launching consecutive Cogs, I'm pretty agnostic, one can manage those cases where it becomes important. It may require a tiny bit of editing on occasion.

cgracey · 2016-05-09 15:03

COGNEW selects a free cog on an atomic basis. Without use of this hardware mechanism, there would need to be some hub-based variable, protected by a LOCK semaphore. That would turn what is now instantaneous into a software routine that must be called in order to allocate a new cog. That would not be good.

Today, I'm going to implement a cog DAC channel mux, so that we get the fast cog DAC channels working with all pins, not just the set of four pins related to the cog ID. This will free up the biggest hardware gotcha regarding cog allocation.

The other matter at hand is LUT sharing. I wish there was some cheap way to generalize this, short of huge AND-OR mux's. The thing about LUT sharing is that it really doesn't take much hardware, at all, for what it provides. That it only works with adjacent cogs, throws a wrench into things. I'll deal with that after the cog DAC channels are freed.

evanh · 2016-05-09 15:36

COGNEW could have a PASM assembly keyword, a bit like JMP, RET and CALL.

I definitely tripped up there!

Electrodude · 2016-05-09 15:54

evanh wrote: »

COGNEW could have a PASM assembly keyword, a bit like JMP, RET and CALL.

I definitely tripped up there!

It couldn't really, since the difference is in the value of bit 3 of the destination register, not the actual instruction encoding. In other words, the COGNEW and COGINIT assembly keywords would be equivalent and interchangeable, which would only lead to confusion. However, since the Spin versions of COGNEW and COGINIT worry about setting the destination register for you, they can be different in Spin.

EDIT: Is this true on the P2? This is still true on the P2, according to Seairth's post below.

evanh · 2016-05-09 16:17

Damn!

Thinking about the Prop2 and looking at the docs right now ... There isn't much describing COGINIT, Cluso's colour coded docs say bits 4-0 bits 3-0 of D register hold the ID of the Cog to start and the S register points to the target program to execute. No mention of a separate PAR nor the NEW bit.

I suspect I've not got all the docs available.

Heater. · 2016-05-09 16:23

evanh,

Do you have Propeller Manual Version 1.2 ?

Or perhaps a later version. It's all there.

Chip thought of everything.

Edit: Oh sorry you meant the P2 docs. Well, it's early days still.

potatohead · 2016-05-09 17:18

There is no PAR on P2. PTRA & B are used. I'm away from my docs at the moment, but that's the gist of it.

A few things to catch up:

1. Until Chip announced an improvement, COGS and DAC pins were mapped. We were going to be using specific COGS a lot. That we can improve this is GREAT!

2. I'm not against, nor in favor of the LUT sharing. I suspect this will get used for a few things and ignored for a whole lot of things. If it makes it, we can live with specifying a COG explicitly for those instances. A similar thing may happen with COG events (interrupts) too.

There are a LOT of features on this chip. I suspect there are sweet use cases not thought of yet. At the ~160 to 200Mhz target clock, performance is going to be more than good enough for a lot more things. Many objects just won't need the specifics. That's IMHO, of course.

We also decided some time ago to keep P1 style programming an option, and add some performance features were needed. Events were the first of these, and they may be needed to get the "edge case" performance needs met. These have been added in ways that one can completely ignore. It's possible to program the P2, treating it largely like a P1 that has larger RAM and HUBEXEC, and doing that isn't significantly more difficult than P1 currently is.

Seairth · 2016-05-09 17:20

evanh wrote: »

Damn!

Thinking about the Prop2 and looking at the docs right now ... There isn't much describing COGINIT, Cluso's colour coded docs say bits 4-0 bits 3-0 of D register hold the ID of the Cog to start and the S register points to the target program to execute. No mention of a separate PAR nor the NEW bit.

I suspect I've not got all the docs available.

I've added the following text to the P2 Good Doc:

D[8:6]		        Reserved

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
D[5] = %1	        JMP to PTRB

D[4:0] = %1----	        Target cog is lowest-numbered inactive cog
D[4:0] = %0nnnn	        Target cog is indicated by %nnnn

S				Address of first instruction to execute, copied to PTRB of the target cog
Q				From a preceding SETQ, copied to PTRA of the target cog

Edit: Added the "Q" bit from Chip's comment below.

MJB · 2016-05-09 18:21

Seairth wrote: »

I've added the following text to the P2 Good Doc:

D[8:6]		        Reserved

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
D[5] = %1	        JMP to PTRB

D[4:0] = %1----	        Target cog is lowest-numbered inactive cog
D[4:0] = %0nnnn	        Target cog is indicated by %nnnn

S[8:0]			Address of first instruction to execute, copied to PTRB of the target cog

just wondering about the JMP to $000 in

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000

because this excludes loading a COG and staring at any address.
And it seems completely unnecessary.
Just setting s[8:0] to 0 and JMP to PTRB would do the same job.

So looks to me
D[5] could be used just to indicate load / noload
and start is always done via PTRB
would give the additional freedom to load a COG AND run from any address.

or do I miss something?

cgracey · 2016-05-09 18:31

MJB wrote: »
Seairth wrote: »
I've added the following text to the P2 Good Doc:
D[8:6]		        Reserved

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
D[5] = %1	        JMP to PTRB

D[4:0] = %1----	        Target cog is lowest-numbered inactive cog
D[4:0] = %0nnnn	        Target cog is indicated by %nnnn

S[8:0]			Address of first instruction to execute, copied to PTRB of the target cog
just wondering about the JMP to $000 in
D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
because this excludes loading a COG and staring at any address.
And it seems completely unnecessary.
Just setting s[8:0] to 0 and JMP to PTRB would do the same job.

So looks to me
D[5] could be used just to indicate load / noload
and start is always done via PTRB
would give the additional freedom to load a COG AND run from any address.

or do I miss something?

When you do a COGINIT, S[19:0] is written to PTRB in the target cog, while Q[19:0] is written to PTRA in the target cog. If D[5] is 1, the target cog will start from its PTRB (which is S[19:0] in the COGINIT). If that address is below $200, the target cog will execute whatever is in its cog RAM, starting at that address. Usually, you would want to load the cog RAM to have something known in there, but there may be a case where you already know what's in there, and you just want to have the cog start there.

Seairth · 2016-05-09 18:39

MJB wrote: »
Seairth wrote: »
I've added the following text to the P2 Good Doc:
D[8:6]		        Reserved

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
D[5] = %1	        JMP to PTRB

D[4:0] = %1----	        Target cog is lowest-numbered inactive cog
D[4:0] = %0nnnn	        Target cog is indicated by %nnnn

S[8:0]			Address of first instruction to execute, copied to PTRB of the target cog
just wondering about the JMP to $000 in
D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
because this excludes loading a COG and staring at any address.
And it seems completely unnecessary.
Just setting s[8:0] to 0 and JMP to PTRB would do the same job.

So looks to me
D[5] could be used just to indicate load / noload
and start is always done via PTRB
would give the additional freedom to load a COG AND run from any address.

or do I miss something?

In the case where D[5]=0, PTRB is pointing to $1F8 longs that will be copied into the cog before jumping to $000. This takes something like 510-525 clock cycles.

In the case where D[5] is set, you are simply jumping to the given address in PTRB. Think of it as a forced JMP that's triggered by a different cog. This takes between something like 3-18 clock cycles. The most likely place to use this is to start a cog in HUBEXEC mode, but you could certainly use it to just "restart" a cog at a different entrypoint in already-loaded code for either COGEXEC or LUTEXEC.

Cluso99 · 2016-05-09 18:40

There are cases where we required specific cogs to run on P1. The same will apply to P2.

So starting specific cogs is already a requirement. Don't use this as an excuse to scuttle a fantastic feature.

Chip, I really don't think we need to be able to pass data directly from any cog to any cog without going via the hub. Your method of sharing LUT with the next highest cog makes perfect sense. In effect, this means a cog can communicate fast with the adjacent lower and adjacent upper cog. We have not lost any cog symmetry. It just means there is another (faster) way to pass data to adjacent cogs. It is easy to explain and understand.

It would be an added bonus to be able to run LUT exec from the adjacent cogs LUT (effectively doubles LUT ram), but it ranks well below the fast cog-cog transfer.

There would also be some nice video tricks that could be done with shared LUT.

jmg · 2016-05-09 22:07

cgracey wrote: »

The other matter at hand is LUT sharing. I wish there was some cheap way to generalize this, short of huge AND-OR mux's. The thing about LUT sharing is that it really doesn't take much hardware, at all, for what it provides. That it only works with adjacent cogs, throws a wrench into things. I'll deal with that after the cog DAC channels are freed.

I don't think adjacent is such an issue, especially now COG DAC constraints are gone.

I worry about routing delays and physical placement aspects of this, so that makes a huge AND-OR impractical.

The least impact topology in terms of routing lengths and delays, seems to be to pair the LUT Memory cells, so they are 'to the right of' one COG, and 'to the left of' the next one.
COGs route then roughly as mirror images, and you can do a full dual-port share in minimal space and routing.
Anything more, to me seems to bump up the routing lengths quite a lot.

This placement approach gives full, 2 way linkage between adjacent COGS in Even/Odd sense, but there is no linkage between 2nd and 3rd COGS.

The next step would be to split the LUT in half, and place half-left and half-right, and dual port those.
The overlap per cog is halved, but you have a continual connection, (ie gain a linkage between 2nd and 3rd COGs) and now any COG can tightly co-operate with two others, to allow a triple-set.
In this topology, the LUT Bus has to span across the COG.

Questions I guess are:
* is that larger LUT BUS a real issue ? , and
* is the triple-set gain, worth halving the overlap per COG ?
One COG could still gain 100% more LUT, by borrow of two lots of 50%, one from each neighbour.

MJB · 2016-05-09 22:39

cgracey wrote: »
MJB wrote: »
Seairth wrote: »
I've added the following text to the P2 Good Doc:
D[8:6]		        Reserved

D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
D[5] = %1	        JMP to PTRB

D[4:0] = %1----	        Target cog is lowest-numbered inactive cog
D[4:0] = %0nnnn	        Target cog is indicated by %nnnn

S[8:0]			Address of first instruction to execute, copied to PTRB of the target cog
just wondering about the JMP to $000 in
D[5] = %0	        Copy $1F8 longs from Hub @PTRB into cog, then JMP to $000
because this excludes loading a COG and staring at any address.
And it seems completely unnecessary.
Just setting s[8:0] to 0 and JMP to PTRB would do the same job.

So looks to me
D[5] could be used just to indicate load / noload
and start is always done via PTRB
would give the additional freedom to load a COG AND run from any address.

or do I miss something?
When you do a COGINIT, S[19:0] is written to PTRB in the target cog, while Q[19:0] is written to PTRA in the target cog. If D[5] is 1, the target cog will start from its PTRB (which is S[19:0] in the COGINIT). If that address is below $200, the target cog will execute whatever is in its cog RAM, starting at that address. Usually, you would want to load the cog RAM to have something known in there, but there may be a case where you already know what's in there, and you just want to have the cog start there.

think I found my misconception.
The COGINIT runs in the original COG, but only to copy the values of PTRA and PTRB to the target COG.
Then the loading of the target COG RAM is performed by the target COG itself.
I was in the thinking the @PTRB of D[5]=%0 was refering to the original COG.

Cluso99 · 2016-05-10 00:28

jmg,
I thought Chip had implemented the dual port LUT such that cog n shared its LUT with cog n+1, and cog n+1 shared its LUT with cog n+2. Therefore cog n+1 can indeed communicate directly with both cog n and cog n+2.

jmg · 2016-05-10 01:52

Cluso99 wrote: »

jmg,
I thought Chip had implemented the dual port LUT such that cog n shared its LUT with cog n+1, and cog n+1 shared its LUT with cog n+2. Therefore cog n+1 can indeed communicate directly with both cog n and cog n+2.

Well, yes N, N+1 has been mentioned, but for Dual Port memory, only one other port is allowed.
In your case, cog n, and cog N+2 clearly cannot both access memory in cog n on the same SysCLK.
To me, that is 3 or 4 port operation.

This lack of clarity in the labeling and operation details, is why I go back to first principles of Ports and Placement.

rjo__ · 2016-05-10 01:53

Cluso,

As I understand it, Chip is juggling constraints with a few still up in the air. At a minimum, I think adjacent Cogs will "twitter" each other... which to my mind is almost as good as full LUT sharing. (In my simple view of the world... in order to have a LUT to share, you first have to write to it... So, if you want that data in the next cog, use cog twittering:)

If LUT sharing is dropped, what we won't be able to do is push data at a Cog from multiple directions simultaneously... I don't have an application for this and Chip has asked.

???

jmg · 2016-05-10 01:57

rjo__ wrote: »

If LUT sharing is dropped, what we won't be able to do is push data at a Cog from multiple directions simultaneously... I don't have an application for this and Chip has asked.

Not quite.
LUT sharing allows very low latency data transfer, two ways, without HUB+Eggbeater variables getting into the mix.

rjo__ · 2016-05-10 03:01

If I don't say what I think... it doesn't get corrected.

Thanks

cgracey · 2016-05-10 03:52

I got the cog DAC channel mux'ing implemented. I've also got a simple scheme for a cog-to-cog(s) 'attention' event/interrupt worked out. And.... I removed LUT sharing. This combination makes all cogs equal for allocation purposes.

It's true that you might space cogs by some fixed amount for optimized egg-beater relationships in a custom application sans LUT-sharing, but objects cannot be practically written and shared which incorporate LUT-sharing - at least, not without some dual-cog COGNEW equivalent. I don't want to add a wrinkle of complexity to something as fundamental as cog instantiation, when it's otherwise very simple.

I know some of you really like the idea of LUT sharing. If it could be selectable, instead of sequential, it would be great to me, but I don't think it's worth all the complexity it would bring into object sharing.

What we have now is universally flexible.

Electrodude · 2016-05-10 03:57

Since you removed LUT sharing, can you add the third timer back?

In fact, if there are still any unused events when everything else is done, can you fill in all the unused events with more timers? (I'm guessing timers are pretty cheap, just a check for if the system counter equals some target value, a register to hold the target value, and a way to set the target register?)

jmg · 2016-05-10 04:13

cgracey wrote: »

I got the cog DAC channel mux'ing implemented. I've also got a simple scheme for a cog-to-cog(s) 'attention' event/interrupt worked out.

Sounding good.
How many active 'attention' flags are there, across the 16 COGs ?

cgracey wrote: »

I know some of you really like the idea of LUT sharing. If it could be selectable, instead of sequential, it would be great to me, but I don't think it's worth all the complexity it would bring into object sharing.

What we have now is universally flexible.

.... but also crippled, from what it could have been.

I'm not following the object sharing problems, with what is an optional feature ?

Those who do not want LUT to cloud their Object Sharing are wholly free to not use it.

I cannot imagine two objects, that could get 'confused' by this.
Either both need it, and they must then be coded to very closely co-operate, and thus will launch as pairs, or they simply do not use it.

cgracey · 2016-05-10 04:14

Electrodude wrote: »

Since you removed LUT sharing, can you add the third timer back?

In fact, if there are still any unused events when everything else is done, can you fill in all the unused events with more timers? (I'm guessing timers are pretty cheap, just a check for if the system counter equals some target value, a register to hold the target value, and a way to set the target register?)

CT3 is back in. The one empty event is going to the cog attention circuit.

cgracey · 2016-05-10 04:20

jmg wrote: »

cgracey wrote: »

I got the cog DAC channel mux'ing implemented. I've also got a simple scheme for a cog-to-cog(s) 'attention' event/interrupt worked out.

Sounding good.
How many active 'attention' flags are there, across the 16 COGs ?

cgracey wrote: »

I know some of you really like the idea of LUT sharing. If it could be selectable, instead of sequential, it would be great to me, but I don't think it's worth all the complexity it would bring into object sharing.

What we have now is universally flexible.

.... but also crippled, from what it could have been.

I'm not following the object sharing problems, with what is an optional feature ?

Those who do not want LUT to cloud their Object Sharing are wholly free to not use it.

I cannot imagine two objects, that could get 'confused' by this.
Either both need it, and they must then be coded to very closely co-operate, and thus will launch as pairs, or they simply do not use it.

There is no universal guarantee that they will launch sequentially, though. That's the whole problem. To make such a guarantee, we would either have to have an extended dual-cog COGNEW or some application framework in which startup code is called for all objects before regular runtime code. Everything will get wrapped around that axle because of that single LUT-sharing feature which many apps may not care about. I just don't think it's worth it.

It would turn what is now an atomic hardware function into an application-level paradigm.

Cluso99 · 2016-05-10 05:05

Chip,
I am extremely disappointed that the scaremongers have succeeded in derailing a fantastically simple mechanism that would provide a fast efficient mechanism for two cooperating cogs

Code to find two sequential cogs is quite simple and could be handled simply by the start routine in the object. The chances of not finding two adjacent cogs when we have 16 cogs is highly unlikely. It is way more likely that we cannot locate two cogs to minimise latency in the egg beater hub mechanism, and the latency will always be higher!

The additional mechanism of signalling a cog is great, but does not resolve fast simple cog-cog Comms.

I suspect that cooperating cogs using the egg beater hub will need to be spaced 3-4 cogs apart. This is a much more difficult setup than with adjacent cogs. With LUT sharing, the second cog could be in a tight loop waiting for the long to become non-zero, at which time we have the byte/word/long! Almost the same mechanism we use as mailboxes in P1, but without the delays due to hub latency, so extremely efficient.

Please reconsider your stance, because it will be way more complex and less efficient to have cooperating cog objects under the egg beater mechanism than with shared LUT.

jmg · 2016-05-10 05:20

cgracey wrote: »

There is no universal guarantee that they will launch sequentially, though. That's the whole problem.

I'm not following - in order to test the P2, you surely must be able to launch specific COGs on command ?

That will also be pretty much required for Debug too.

Is LUT sharing between adjacent cogs very important?

Comments