Shop Learn
Is LUT sharing between adjacent cogs very important? - Page 7 — Parallax Forums

Is LUT sharing between adjacent cogs very important?

145791033

Comments

  • RaymanRayman Posts: 11,828
    LUT sharing is interesting, as I've heard, because the LUT memory is dual ported, making it possible.

    If I remember right, the problem with LUT sharing is:
    1. We don't have enough logic left in A9 to fully test it (without removing a bunch of stuff that we may want to keep).
    2. It would force one to pick sequential cogs to use it, which goes against a Propeller precept.
  • jmgjmg Posts: 14,595
    edited 2016-05-11 23:51
    rjo__ wrote: »
    ... The idea that I found very compelling... and is worth bending the rules for... was cog signaling. Suddenly the conversation switched over to LUT sharing, which I think is a different kettle of fish:). The original idea was to be able to send a certain number of bits from one cog to the next one, without going through the hub. I don't think this has been ruled out.
    There were two (maybe 3?) items that were somewhat overlooked in the Eggbeater design
    a) Fast Signaling between COGS, without consuming Pins.
    b) Fast Data Transfer between (some) COGS, without Hub delays and jitters.
    c) maybe ? Jitter free HUB operation ?

    I think missing feature a) is now in there, - as Chip said this
    I just implemented the cog-to-cog(s) 'attention' mechanism, which can be used to alert multiple cogs on the same clock that they need to do something.
    but unclear if that is 32 available flags, or 16, and as yet, no examples of use ? Certainly a good feature.

    b) has had some partial solutions suggested, including run-time optimal slot choosing (from a candidate list of 16)
    This costs code, has caveats, and needs a learning phase, and is certainly not easy to explain.
    One plus of this run-time-tune angle, is it can work between any pair of COGs, but is unlikely to have the raw speed of dual-port LUT sharing.
    I see it more as a phasing workaround, than a real bandwidth solution.
    Q: How fast can this actually run ?

    The actual Logic cost of LUT sharing has not been defined, that I have seen, but the physical layout angle suggests a couple of practical choices for LUT sharing.
  • Heater.Heater. Posts: 21,233
    Rayman,
    2. It would force one to pick sequential cogs to use it, which goes against a Propeller precept.
    And importantly it's impossible to do in a race free manner without tweaking COGNEW such that it grabs two cogs atomically.

  • rjo__rjo__ Posts: 2,115
    edited 2016-05-12 01:22
    Rayman and Heater,

    I have been working with parallelism in my code. It is as much fun as I am allowed to have:)

    I'm constantly using cogs that have parametric relationships to each other.

    I can't figure out a way to write perfectly parallel code without exactly specifying which cogs I'm using, except to set up a messaging area (which is a perfect pia) and then adding logic to the parallel code...cog says to itself: "look at the message area, count the parallel cogs started... figure out where I am in the pecking order, from this fact assign appropriate parameters." OR... I can just use coginit as God intended.

    About the OBEX... looking for two adjacent cogs can be done in software. Once that is done, a coginit can be done with the returned cog numbers. No cognew problems here!!!






  • cgraceycgracey Posts: 13,378
    rjo__ wrote: »
    About the OBEX... looking for two adjacent cogs can be done in software. Once that is done, a coginit can be done with the returned cog numbers. No cognew problems here!!!

    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.
  • jmgjmg Posts: 14,595
    Heater. wrote: »
    ... without tweaking COGNEW such that it grabs two cogs atomically.

    OK, then what is the Logic impact of tweaking COGNEW, such that it can grab two cogs atomically ?
  • cgraceycgracey Posts: 13,378
    Cluso99 wrote: »
    Shared LUT

    Last night I wrote a reply but did not post it. I was so disheartened by the replies against it, it kept me awake most of the night. Today I put it out of my mind, deciding to ignore it.

    It seems like we have a fabulous new vehicle design. It has 4 doors with 5 seats, 2 in the front and 3 in the rear. But we want to give every occupant equal access to their seat. There is a lazy Susan (a spinning disc like yo often find in Chinese restaurants) that rotates all seats via one door. So this way, every occupant has equal access via the same single door. Of course the occupants must wait in line for their turn to access their seat. This is the hub.
    Another way to access the seats via the four doors, so four occupants have equal access to their seats, but there is no way the occupant of the middle back seat has equal access. The result... Remove the 5th seat. After all who wants a 5th seat anyway.

    No one wants to upset the 4 main occupants access, to permit the vehicle to carry 5 occupants! The idea to allow the 5th occupant to enter the vehicle by cooperating with either of the two other rear occupants was considered, but was rejected!

    If we can find a way to make it work in the wild, with other people's objects, we could still do it.

    One kind of kludgey way to do it would be to do COGNEWs until two in a row were started, then COGSTOP the others.

    Another way is to have a dual-cog COGINIT. That's a sounder approach, but needs some thinking through.

    I don't like either of these approaches, though, because in cases where cog usage has been swiss-cheesed, they will both fail.

    The only right way to do this is to make it so each cog can select which other cog's LUT it has access to, through an AND-OR array. We should have plenty of logic area to do that in, but it's expensive in the FPGA. It's expensive in silicon, too. It's almost as expensive as the eggbeater, minus the FIFOs.
  • rjo__rjo__ Posts: 2,115
    Chip

    I hate it when I waste your time... I don't need to understand this. If I got it wrong, please ignore me:)
    [/quote]

    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.[/quote]

    ... but isn't that still a software problem?

    Remember, I am only talking about cog twittering... passing a few nibs to the next cog.

    Let's say I need 4 twittering cogs to run my code. I implement them as I want. No problems.

    And then I want to use an OBEX object that requires cog-twittering (which is clearly indicated in the description of the object) ... I check to see if I have enough cogs to use it... then I use it next:) Twittering gets priority... which priority is established in my main code. After that comes the non-twittering objects that can come and go as they like as long as the entire code base requires no more than 16 Cogs.

    Sorry:)




  • cgraceycgracey Posts: 13,378
    edited 2016-05-12 02:22
    rjo__ wrote: »
    Chip

    I hate it when I waste your time... I don't need to understand this. If I got it wrong, please ignore me:)
    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.

    ... but isn't that still a software problem?

    Remember, I am only talking about cog twittering... passing a few nibs to the next cog.

    Let's say I need 4 twittering cogs to run my code. I implement them as I want. No problems.

    And then I want to use an OBEX object that requires cog-twittering (which is clearly indicated in the description of the object) ... I check to see if I have enough cogs to use it... then I use it next:) Twittering gets priority... which priority is established in my main code. After that comes the non-twittering objects that can come and go as they like as long as the entire code base requires no more than 16 Cogs.

    Sorry:)




    You can increase your chances of finding two adjacent cogs, but you can never guarantee it. That's the root problem. Maybe two cogs are free, but they are 1 and 5.

    For any kind of cog-to-cog link, there just needs to be some way to select the cog, or cogs. Then, you have something certain to work.
  • RaymanRayman Posts: 11,828
    edited 2016-05-12 02:30
    I have to think most users will not want to deal with what actual cog# they are using...

    I'm pretty sure multithreaded PC apps don't depend on the actual core they are running on. Wouldn't that be a nightmare?
  • Heater.Heater. Posts: 21,233
    rjo__,

    I'm not following you. Perhaps you are doing some weird thing that requires specific COG allocation.

    Everybody runs parallel code on the Prop all the time. You know, a serial driver, a video driver, this, that and the other. Never needing to know which COG is which or use COGINIT.

    If you are talking parallel code as in a parallelized algorithm clearly that can be done as well. My Fast Fourier Transform can spread itself over 1, 2 or 4 COGS as required. With no change to the code. There no explicit COG IDs being used there.

    Looking for two adjacent COGs can be done in software. In the general case it will suffer from race conditions and fail catastrophically.
    ... but isn't that still a software problem?   
    
    It's the same problem as sharing data areas between multiple COGs that read and write. It can't be done reliably in software unless you have hardware support for locks to grant COGs atomic access to the shared data.

    In this case we are trying to share COGs between COGs and support for atomic COG acquisition would be required.

    Yes, we could make all COG allocation static and done at start up time. That moves the problem to the user of objects and messes up he whole design philosophy.

  • jmgjmg Posts: 14,595
    cgracey wrote: »
    If we can find a way to make it work in the wild, with other people's objects, we could still do it.

    One kind of kludgey way to do it would be to do COGNEWs until two in a row were started, then COGSTOP the others.

    Another way is to have a dual-cog COGINIT. That's a sounder approach, but needs some thinking through.

    both of those sound worth exploring. dual-cog COGINIT seems the most deterministic.
    cgracey wrote: »
    I don't like either of these approaches, though, because in cases where cog usage has been swiss-cheesed, they will both fail.
    That has to be very rare - that someone knows they have COG pairing, and then miss-manages STOP and ALLOC in such a way, they shoot themselves in the foot.

    RJO's example of someone who knows what they are doing, is far more common, and surely actually the customer you want to target ?
    cgracey wrote: »
    The only right way to do this is to make it so each cog can select which other cog's LUT it has access to, through an AND-OR array. We should have plenty of logic area to do that in, but it's expensive in the FPGA. It's expensive in silicon, too. It's almost as expensive as the eggbeater, minus the FIFOs.
    Maybe, but that sounds rather too costly in areas and delays...

    How many SysCLK is the smart-pin link down to now, & what can it stream at ?
    That is partly serial, so lowers the routing cost, if you added 16 more 'pins' to that structure, that would give
    a COG-COG link - but is that better than HUB delays & jitters ?
  • rjo__rjo__ Posts: 2,115
    Let's say I am using rdfast for byte sized data... and I don't have enough time to do a bulk transfer via wrfast back to the HUB and I can't do wrbyte because it is too slow. I can pack the data into a long, store it in the LUT and then index the LUT location. Then it is a single call to transfer the packed LUT back to the HUB. The issue is the clocks it takes to pack the data and index the position and store to LUT. When I actually implement it... it is takes too long, I have 6 clocks it takes 8+.

    On the other hand, if I twitter the data to the next cog...that cog can be waiting for it and then use wrfast from there... fixed timing. No mess, no fuss.

    I have found a different way to do it... and I'm a happy camper, but if the bandwidth requirement was a tad higher, I'd be stuck.


  • T ChapT Chap Posts: 4,051
    edited 2016-05-12 04:16

    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.

    How difficult is it to reserve two adjacent cogs for future use by some means of a key for access, then they are sure to be available.

    ReserveCog[CogId, accesskey]

    Run the Reserve at the top of the main program to set aside a cog(s) for later, then require a key to get it to run. The key is defined in the Reserve function. No other cog can accidentally use it.
  • jmgjmg Posts: 14,595
    rjo__ wrote: »
    ... The issue is the clocks it takes to pack the data and index the position and store to LUT. When I actually implement it... it is takes too long, I have 6 clocks it takes 8+.

    On the other hand, if I twitter the data to the next cog...that cog can be waiting for it and then use wrfast from there... fixed timing. No mess, no fuss.

    I have found a different way to do it... and I'm a happy camper, but if the bandwidth requirement was a tad higher, I'd be stuck.
    Can you expand on what code you used, and what bandwidth that gives ?
    Hard data from real code, is always useful.

  • Cluso99Cluso99 Posts: 17,436
    edited 2016-05-12 09:02
    Heater. wrote: »
    Rayman,
    .
    2. It would force one to pick sequential cogs to use it, which goes against a Propeller precept.
    And importantly it's impossible to do in a race free manner without tweaking COGNEW such that it grabs two cogs atomically.
    heater,
    THIS IS TOTAL RUBBISH !!!

    I can make an object that will automatically search for two adjacent cogs without any problems whatsoever. No additional instructions required!!! I can do it in P1 too!!!

    I just cannot be bothered posting code as I am so disappointed with you and others derailing such a significant feature with such a minor cost. AFAIK Chip has already made the LUT dual port for the streamer. There is no other way to get fast single bytes passed between cooperating cogs.

    While I only asked for a port between cogs, Chip made LUT dual port, which made for an extremely simple mechanism.
  • Cluso99Cluso99 Posts: 17,436
    cgracey wrote: »
    rjo__ wrote: »
    About the OBEX... looking for two adjacent cogs can be done in software. Once that is done, a coginit can be done with the returned cog numbers. No cognew problems here!!!

    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.

    Chip,
    No! You just start cogs with a stub program until you find two adjacent cogs, then stop the other cogs you started, and redirect the two adjacent cogs to the code required. It is quite simple. The chances of not finding two adjacent cogs out of 16 is small.
    To get minimise latency via hub, it will be far more complex to find cooperating cogs because they have to be specifically spaced, and lower!
  • Cluso99,
    Chip is the one who decided he didn't want this feature as is because of the reasons HE stated. So you should not be mad or disappointed at anyone for its "derailing" other than him. Claiming that it was all because of other forum users is silly. You are ignoring what Chip has said he doesn't like about it, and pushing forward a software solution that you think is easy, but most users would not. If Chip can come up with a way to do it that doesn't violate the fundamental thing about the Propeller cogs that HE feels is important enough to have, then he will. Trying to force him to do it in a way he's not happy with isn't going to work.

    I agree with you that a fast and low latency port/channel for cogs to talk to each other would enable some really great things, but let's try to solve it in a way that fits the Propeller, not bend the Propeller to fit it.
  • Cluso99Cluso99 Posts: 17,436
    Roy, I respect your comments, but you need to re-read the thread to see how shared LUT was derailed. Chip responded to those comments.
  • Cluso99Cluso99 Posts: 17,436
    I don't believe it is a requirement to...
    1. Make shared LUT be available to any cog. Currently, by specifically setting up your cog usage, share LUT between adjacent cogs would be plenty sufficient. As I pointed out earlier, a group of cogs can cooperate.
    If any LUT could be attached to any cog, we would require 16 busses of 32 data bits plus 9 address bits plus a read and write strobe and perhaps a ram enable bit too. This is way too much silicon to waste.

    I don't consider a shared LUT between adjacent cogs to be against the prop design philosophy. All cogs are still equal, they just also have a direct path between cogs via the LUT.

    Note: as each cog shares its LUT with the next higher, this means each cog can talk directly to the lower and the upper cog. What we have here is an extremely powerful method for cog cooperation.
  • Sorry to chime in - just a lurker and with really low skills compared to all people here!
    What about a Prop2 with 8 'master' and 8 'slave' cogs? I mean, almost the same chip, just a different config somewhere in the hardware that let cognew allocate only 8 cog (for example the even ones). Every 'master' cog has the right to start the cog right after (the next odd one) with a (new) different instruction. I see lot of resources squandered (only 8 COGS instead of 16 just for having perhaps one or two paired), this way we would have 8 'coupled' super-fast communicating cogs instead of 16 'atomic' cogs, and those 8 cogs would be all the same. Perhaps this could be done with the same chip at boot time...?
  • Cluso99Cluso99 Posts: 17,436
    edited 2016-05-12 13:11
    Comparison of via hub an via LUT
    Via HUB...
    wrlong(2) + wait for slot(0-15) + rdlong(2) + wait for slot(0-15) + jump back if z(2/4?) = 6/8-36/38 clocks.
    If cog n passes to cog n+1, then the second wait will be 15 clocks, but if cog n+1 passes to cog n then the second could be 1 clock. There is a dependency here of the two instruction loop of rdlong wz and jnz $-1.
    Via LUT...
    wrlut(2) + rdlut(2) + jnz(2/4) = 6/8 clocks

    In both cases, I have presumed the rdlong/rdlut was executed at the precise clock to minimise the latency.


    Now consider the worst case (via hub) of 36/38 clocks, and my USB example. There is a two way passing of data and the reply, meaning for a single passing of data plus the reply could take 2* 36/38 clocks = 72/76 clocks. This exceeds the 50 clock maximum without any processing.
    The worst case (via LUT) is 12/16 clocks. This gives 30+ clocks for processing.

    Hope this shows the considerable advantages of sharing LUT.
  • kwinnkwinn Posts: 8,684
    Cluso99 wrote: »
    I don't believe it is a requirement to...
    1. Make shared LUT be available to any cog. Currently, by specifically setting up your cog usage, share LUT between adjacent cogs would be plenty sufficient. As I pointed out earlier, a group of cogs can cooperate.
    If any LUT could be attached to any cog, we would require 16 busses of 32 data bits plus 9 address bits plus a read and write strobe and perhaps a ram enable bit too. This is way too much silicon to waste.

    I don't consider a shared LUT between adjacent cogs to be against the prop design philosophy. All cogs are still equal, they just also have a direct path between cogs via the LUT.

    Note: as each cog shares its LUT with the next higher, this means each cog can talk directly to the lower and the upper cog. What we have here is an extremely powerful method for cog cooperation.

    Have to agree with Cluso on this one. The cogs are still equal if every cog can share a LUT with it's neighboring cogs. Equality between cogs is good, but like any philosophy it can be detrimental if taken to an extreme.

    At the same time I am wondering if there is any way to enhance the eggbeater and/or the streamer to accomplish the LUT sharing between any cogs. After all the cogs that are sharing LUT data cannot be using the eggbeater or streamer at the same time. Perhaps something like the 16 long transfer between hub and cog, but from one LUT to another?

    I have neither the expertise nor have I been following this thread closely enough to know if it is a practical idea so I am throwing it out here and ducking for cover.
  • MJBMJB Posts: 1,200
    Cluso99 wrote: »
    I don't believe it is a requirement to...
    1. Make shared LUT be available to any cog. Currently, by specifically setting up your cog usage, share LUT between adjacent cogs would be plenty sufficient. As I pointed out earlier, a group of cogs can cooperate.
    If any LUT could be attached to any cog, we would require 16 busses of 32 data bits plus 9 address bits plus a read and write strobe and perhaps a ram enable bit too. This is way too much silicon to waste.

    I don't consider a shared LUT between adjacent cogs to be against the prop design philosophy. All cogs are still equal, they just also have a direct path between cogs via the LUT.

    Note: as each cog shares its LUT with the next higher, this means each cog can talk directly to the lower and the upper cog. What we have here is an extremely powerful method for cog cooperation.

    and while a shared conduit between all COGs would create another bottleneck, that needs to be managed (even if AND-OR resolves HW conflicts) the 16 conduits between neighbours run really parallel giving huge communication bandwidth if so desired. And can be just ignored else.
  • At this point, I'm not seeing any new information being added to the discussion. There are very clearly two opposing points of view, which I'm sure Chip is very aware of. Unless someone has a new take on the issue, I say we give Chip some breathing room to mull it over and see what he comes up with (if anything). If he figures something out, great! If not, still great! Why? Because, either way, the design will be done and we will be one giant step closer to having the real thing!
  • I'll never be smart enough to use this to any advantage but I think it is a feature that has large potential usefulness but won't be widely used by the average programmer (unless in some high-speed infrastructure objects).

    I don't think it disrupts COG symmetry - each (and every) COG can quickly share data with neighbors. Use it to your advantage if you can find a use case, if not, don't use it...it doesn't impact your application if you don't use it.

    I don't think any application using the feature will blindly start firing off COGs as it starts and tehn decide to start up the sharing COGs. When any app starts, it is running only in one COG and until that COG decides to start others, it is in control. Being in control, it can start up a pair next to each other and make sure the condition is met. Whether it parks them until it needs them later, registers them as a reserved pair (for error recovery purposes) or lets them run off and do their co-dependent thing, COG0 is in charge until it starts cutting things loose with multi-processing. The true multi-processing system I am familiar with, UNisys 1100 series mainframes, always had a lead processor to orchestrate things and corral other processors if something came up. We had processor affinity that was software managed so I imagine the P2 could survive with co-processor affinity managed through software.

    This feature won't be for the faint of heart but it could be a big differentiator between the P2 and others and also could be something that someone figures out how to exploit in ways not even thought of yet.

    If it is a low-risk addition to the timeline, I think it is worth it.
  • RaymanRayman Posts: 11,828
    edited 2016-05-12 17:33
    I wonder if there's another way to use the 2nd port on the LUT ram...

    So, the streamer can move data back and forth between LUT and I/O pins or DAC.

    Maybe you could use the second port as a second way to move data between I/O pins and LUT?
    What about a command that causes transfer to/from LUT and port A or B?

    I don't know if that makes any sense or not... Just trying to think of alternatives...
  • If USB were not in the picture, would this even be an issue?
  • dMajodMajo Posts: 831
    edited 2016-05-12 18:01
    cgracey wrote: »
    rjo__ wrote: »
    About the OBEX... looking for two adjacent cogs can be done in software. Once that is done, a coginit can be done with the returned cog numbers. No cognew problems here!!!

    This is the problem. You may discover that two adjacent cogs are free, but when you go to start them, they may not be free, anymore. What was free at cycle N is sometimes not free at cycle N+1.

    Sorry @cgracey may I have an explanation from the chip builder (with understandable wording) where the issues are because I really can't understand.

    How are the cogs started? Usually in the main app from the programmer
    How are the obex objects started? Again usually in the main app from the programmer.

    Can the obex object dinamically start a cog? yes, of course. And how many? Up to all cogs available. Perhaps it can require even more cogs than available (because other objects have already allocated them) )thus not matching its requirements to operate. It depends on how many cogs are already in use. It depends on the order the programmer starts the various objects.
    I think that such behaviour should be documented in the object docs for the others to use correctly.

    Now, if the start method of an objects "cognew" a cog and then "coginit cogid+1" the second returning a status "done" when finished. Isn't here also a matter of documentation? Isn't here enough that such object states that no other objects should be launched until the procedure is not completed? That this object should be eventually launched the first?
    If other objects (perhaps launched before) dinamically starts other cogs can create issues, but such objects can potentially be incompatible or create issues even today, even on the P1, isn't it?

    How are this two behaviours different?
  • Cluso99Cluso99 Posts: 17,436
    edited 2016-05-12 18:07
    If USB were not in the picture, would this even be an issue?
    Absolutely! It's a bugbear of the P1.
    If fact it has nothing to do with USB, but rather any code that could make better use of cogs by cooperating better. In particular, any protocol as faster or faster than USB FS (12MHz) is a candidate for this type of use. USB FS is just being used as a real world example, not a contrived example.

    On the P1 I have used 4 tightly coupled cogs, executing at a staggered 1 clock apart. However, I just couldn't get data out of the cogs to other cogs for processing quick enough. This may have solved my problem.

    I just cannot agree that all cogs are still not equal.

    BTW I am also of the opinion that not All cogs need be equal, as long as there is a minimum of equality. In other words, I wouldn't consider one or more cogs with additional LUT RAM as a problem provided the cog equality is shown as the lowest common denominator, with some cogs having extra special features. Caveat: To access the special features may require specific code to utilise thes additional features.
Sign In or Register to comment.