Shop OBEX P1 Docs P2 Docs Learn Events
Assembly Example of 2 Cogs Sharing an Array in Main Memory — Parallax Forums

Assembly Example of 2 Cogs Sharing an Array in Main Memory

ElectronegativityElectronegativity Posts: 311
edited 2010-09-21 19:03 in Propeller 1
Hello everyone!

I am an SX retread that is getting the Propeller up to speed in my mind.
Attached is the spin code of the example I would have liked to see when I first started out.

I tried to paste the text of the code into this post for easier viewing, but the result was so hideous that H.P Lovecraft would stand gaping in awe at the sight of it.

I hope that someone will benefit from it, and would appreciate feedback from the Gurus if I have stated something that is incorrect.

Comments

  • bill190bill190 Posts: 769
    edited 2010-09-18 08:20
    Thank you and very good timing posting this, as I am about to run out of assembly memory for my project. And I need to move some stuff to spin or separate cogs.

    The above also shows how you can have two separate assembly programs in the same file. Note there are two org 0's.

    And I would assume that each separate program would have the full per cog assembly program memory space available?

    One question I have...

    If you have a separate assembly program running on a separate cog like the above...

    org 0
    ReadAndLight

    Is that program loaded into the cog assembly memory only at runtime?

    Then would it be overwritten if the cog was stopped and another program loaded into that cog?

    Or worded another way, could I have several separate assembly programs which together total up to more assembly memory space usage than is available, but I would only need to call one program at a time, and each separate program would fit ok?

    (For example I have 20 buttons. And each button would run a different program. And I would only need to press one button at a time. And each program would run for less than 2 seconds, then quit.)
  • Heater.Heater. Posts: 21,230
    edited 2010-09-18 08:33
    Bill190:
    And I would assume that each separate program would have the full per cog assembly program memory space available?

    Yes.
    Is that program loaded into the cog assembly memory only at runtime?

    Nothing is ever loaded into COG unless your program commands it with COGNEW in Spin or COGNEW or COGINIT in PASM .
    Except for the initial start up when the Spin interpreter is automatically loaded into COG 0.
    Then would it be overwritten if the cog was stopped and another program loaded into that cog?

    A Cog will only stop if it is running Spin code and runs off the end of it's method. Or if you call COGSTOP for that cogs id.

    Or worded another way, could I have several separate assembly programs which together total up to more assembly memory space usage than is available, but I would only need to call one program at a time, and each separate program would fit ok?

    Indeed you could.

    But bare in mind it take 100uS or so to load and start a COG so this is slow if you are needing high speed PASM.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-18 08:40
    Just a question (having recently been through a similar exercise):

    If the 2 bytes in your shared array are to be treated as a single atom, then should the access to the array be guarded by locks?

    For example, if cogx gets clocked out after accessing one byte of a two byte array operation, then cogy comes around and changes one or the other of the array bytes, is the shared memory "corrupted?"

    This is great approach, but I'm just asking in case there is the possibility of a race condition in your final application.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-18 09:24
    If the two bytes are accessed one byte at a time but their values are actually related, like maybe an x, y, coordinate pair, then there is the possibility of a race condition.

    In that case one can simply read and write the two bytes as a word which is an atomic operation on HUB memory.

    If you have a more complicated data structure being shared then you may need locks to prevent races and corruption.

    I have yet to use locks because in many cases they are not necessary.

    E.g. If only one COG writes data to a structure and only one COG reads that data then a simple flag can be used determine who can read and write at any given time.

    Or there are examples like FullDuplexSerial which share cyclic buffers between the reader COG and the writer COG. Access to the buffer pointers is done in such a way that no corruption can occur.
  • Ron CzapalaRon Czapala Posts: 2,418
    edited 2010-09-18 19:05
    What is a "race condition"?
  • HarleyHarley Posts: 997
    edited 2010-09-18 19:12
    I like your avatar. I just got a 4.3" color LCD working with a Prop Protoboard and realized things I'd not noticed couple years ago on a TV display.

    I think that 'race condition' is about reading two values not synchronized properly and, say if x,y pairs, the two wouldn't be related to a given 'pixel' or point. I think that is what was meant, no?
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-18 20:41
    Let me see if I can explain a "race condition". A race condition can occur if there are two or more processes/threads-of-execution that attempt to operate on the same data. Consider two processes A and B that deal with the same set of data. Further, assume that process-A will always write the data, but that that data write is not guaranteed to be atomic. If process-A (which writes data) can be interrupted at some point during a write operation (before it can complete the entire write operation), then the data is not completely written by process-A and is not guaranteed to be coherent.

    Further, consider that while process-A is suspended, process-B attempts to read the data (further consider that process-B always consider this data to be complete and consistent) and operates against it. Given the conditions that process-A encountered when writing the data, the results that process-B reads is not guaranteed to be coherent and therefore can be unreliable for operation.

    Now, under these circumstances, the validity of the data read by process-B is entirely dependent on when process-A completes its data write, as compared to when process-B attempts to read the data.

    This set of conditions and side effects are typically termed "race conditions" because process-A is trying to "race" process-B and complete its data write prior to process-B attempting to read it.

    Now, consider that process-A and process-B are programs running in seperate cogs. And that these two programs share data through a common buffer as described in this thread. If the data write is non-atomic (can not be guaranteed to be completed as a single coherent item of work), then race conditions can occur between the two cogs and their operations against the data can become unreliable.

    However, as heater pointed out, there are types of writes that are atomic and they may be considered to head off this condition entirely. But, as he also points out, if more complex write operations (say to some form of "structured data buffers") and these writes are not-atomic then a race condition could be created.

    Hope that my attempt at explaining what a race condition is, and what can contribute to them, is helpful.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-18 21:08
    If you read the comments in the CON section of the code I posted you will see a statement to the effect that the code loops endlessly to avoid problems where one Cog reads before the other writes.

    In practical code this could be an issue.
    There are "lock" operations provided for this purpose, but as Heater mentioned earlier, it is easy to prevent this problem with a single flag.

    Take one bit of main memory to be the flag.
    If the bit is zero then it is ok to read, if it is one then it is ok to write.
    When a Cog is done reading it sets the flag to one, and if a Cog is done writing it sets the flag to zero.

    The locks would come in handy if you had say 6 Cogs all trying to read and write the same location of RAM at the same time. In that case a simple flag would not suffice, but with well written code I suspect that the problem could still be addressed with a register containing several flags.

    This will slow things down a little because you are taking an extra Hub operation to check the status of the flag, but it's hard to imagine a case where this would cause a perceptible performance issue.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-18 21:18
    Don't get me wrong. I like the solution as a means to exchange data between cogs. It is likely fine and will work well. Further I know that several others have raised questions about how to share data between cogs.

    I do have one question though about the scenario you describe (I'm hoping I'll learn something here).

    Is it possible, and what happens if a cog reads the data, then it gets clocked out before the flag is modified to reflect that state?

    A related question can be asked about when the data is written, but the cog doing the writing is clocked out before it can modify the flag?

    The reason I ask this is that I am currently using locks (though I am using a slightly more complex "data structure" as my shared data between cogs), and I'm wondering if need to rethink the need for locks.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-18 21:33
    Hi TinkersALot.

    I'm not really sure what you mean by a Cog getting "clocked out before the flag is modified".

    The Cog modifies the flag after it is done reading or writing.

    Say you have an array of 64 bytes in main RAM.
    It requires 16 separate Hub operations for a Cog to completely transfer the array to its local RAM (1 long = 4 bytes per rdlong instruction).

    The 17th Hub operation sets the flag.

    Let's say another Cog is ready to write to the shared main memory.
    It reads the bit. If the flag is zero it reads the bit again, but if it is one then it writes to the array. After it is done writing it sets the bit to zero.

    The reading cog reads the bit and if it is one it reads it again. If it is zero then it reads the array.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-19 09:32
    Hello Electronegativity,

    Apologies for the inaccurate sounding term, the term "clocked out" was meant to describe the time slicing mechanism that occurs between cogs and they're having access to shared hub memory.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-19 10:02
    I'm still a bit confused about your question.
    What do you mean by "time slicing mechanism"?

    Each Cog in turn is given a chance to access to the shared Hub memory.
    The Hub polls each Cog one at a time:

    "Cog1, do you want to perform a Hub operation?"
    "Cog2 do you want to perform a Hub operation?"
    Etc.

    The window of opportunity moves to Cog2 regardless of whether Cog1 performed an operation or not.

    Below is the flag method in a little more detail.
    The flag is initialized to 1 and is set to 0 if it is ok to read and 1 if it's ok to write.
    Cog1 is the reader and Cog2 is the writer in the following example using a 2 element array:

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 reads the flag and sees that it is a 1 (ok to write).

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 writes the first array element to main memory.

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 writes the second array element to main memory.

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 sets the flag value to 0.

    Cog1 reads the flag and sees that it is a 0 (ok to read).
    Cog2 reads the flag and sees that it is a 0 (no write).

    Cog1 reads the first array element.
    Cog2 reads the flag and sees that it is a 0 (no write).

    Cog1 reads the second array element.
    Cog2 reads the flag and sees that it is a 0 (no write).

    Cog1 sets the flag back to 1.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-19 10:57
    Hi Electronegativity,

    I understand your analysis, and it does raise a few more questions in my mind:

    1. The flag seems to be operating the equivalent function as a semaphore (lock). In that the flag setting is allowing/blocking each cog from doing work against this shared buffer.

    2. This program, though it is running in multiple cogs, seems to behave the same as a single threaded (single cog) program with regard to the shared buffer, because of the strictly enforced (by the control flag) singular process (cog) access to data in the shared buffer. This is fine if the cogs can do other work over and above what is in the buffer. But if their primary function is to write or read from this data buffer and operate against it, it seems that would spend the majority of their time just waiting for access.

    3. I am wondering if this alternative (using locks) is accurate:

    cog 1 and cog2 both vie for lock
    only one of them gains it
    the "winning cog" operates on all bytes
    the "winning cog" releases the lock allowing the other cog to gain access
    the "other cog" operates on all bytes
    the "other cog" releases the lock

    loop forever as described above

    is this lock example close to the same effect that the access flag has?

    I'm not trying to pick nits here, as I said, just trying to get as firm of an understanding I can around the topic of buffers shared between cogs. Plus, I think it is an interesting topic and I'm glad you introduced it along with a working sample.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-19 14:17
    I think the execution would be the same if you used locks instead of flags.

    Allowing the cogs to both vie for the lock would be identical to randomly initializing the flag in the example above.

    A flag value of 1 is the same as a lock placed by the writing cog, and a flag value of zero is the same as a lock placed by the reading cog.

    One solution to the problem of having the cogs spend most of their time waiting for access that you mention in point #2, is to use a back buffer like the one employed by DirectX.

    Suppose you have a chunk of main RAM that represents pixel colors to be written to a screen. One cog will read the array and output the values to the screen, and the second will fill the memory with the next picture.

    The way to keep both cogs going at once is to have 2 copies of the array in memory.

    Cog1 writes the values in Array1 to the screen while Cog2 is filling Array2 with the next picture. Then Cog1 reads Array2 to the screen while Cog2 fills Array1 with the next picture.
    With properly written code there would be no need to have either Cog spin its wheels.
  • mparkmpark Posts: 1,305
    edited 2010-09-19 15:50
    I think the execution would be the same if you used locks instead of flags.

    Electronegativity's flag example requires each multi-element write (by cog2) be followed by a multi-element read (by cog1), and each read be followed by a write.

    TinkersALot's lock alternative allows the writing cog to perform multiple multi-element writes without waiting for the reading cog to read (and vice versa).

    The two approaches behave differently. Each may be correct for a particular situation.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-19 18:06
    Hi mpark.

    The flag method also allows multiple writes without waiting for a read.

    Here is a paste from the example above with Cog2's actions bolded:

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 reads the flag and sees that it is a 1 (ok to write).

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 writes the first array element to main memory.

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 writes the second array element to main memory.

    Cog1 reads the flag and sees that it is a 1 (no read).
    Cog2 sets the flag value to 0.

    Once Cog2 gets the go ahead flag it executes consecutive writes until it is done writing, then it sets the flag to zero and starts checking it again.
  • mparkmpark Posts: 1,305
    edited 2010-09-19 19:21
    Cog2 sets the flag value to 0.

    Once Cog2 gets the go ahead flag it executes consecutive writes until it is done writing, then it sets the flag to zero and starts checking it again.

    If I understand your protocol correctly, once cog2 sets the flag to zero, it cannot perform any more writes until the reading cog (cog1) sets the flag to one.

    In other words, at some point cog2's progress depends on cog1.

    In TinkersALot's system, the writing cog and the reading cog are independent. The writing cog can continue to write to memory no matter what the reading cog is doing or not doing.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-19 20:32
    mpark wrote: »
    In TinkersALot's system, the writing cog and the reading cog are independent. The writing cog can continue to write to memory no matter what the reading cog is doing or not doing.

    If this is what you want to happen then there is no need for any lock or flag.
    The attachment to my original post contains code that does exactly what you are describing.

    The problem occurs when a Cog is in the process of reading a long array while another Cog is writing to the same array. It is possible that a value could be written and then overwritten with a different value before the reading cog has a chance to fetch the initial value.

    This is prevented by a lock or flag that prevents reading until the entire array is written, or prevents writing until the entire array is read.

    The lock has a small advantage over the flag because it uses a hardware register to hold the bit instead of using a memory location. So you have one more bit of main memory available if you use the lock.

    The lock is also more intuitive for someone that is reading the code.

    Other than those two caveats, the final result is the same whether you let the hardware do it (lock), or code it yourself (flag).
  • mparkmpark Posts: 1,305
    edited 2010-09-19 21:24
    If this is what you want to happen then there is no need for any lock or flag.
    The attachment to my original post contains code that does exactly what you are describing.

    The problem occurs when a Cog is in the process of reading a long array while another Cog is writing to the same array. It is possible that a value could be written and then overwritten with a different value before the reading cog has a chance to fetch the initial value.

    This is prevented by a lock or flag that prevents reading until the entire array is written, or prevents writing until the entire array is read.

    The lock has a small advantage over the flag because it uses a hardware register to hold the bit instead of using a memory location. So you have one more bit of main memory available if you use the lock.

    The lock is also more intuitive for someone that is reading the code.

    Other than those two caveats, the final result is the same whether you let the hardware do it (lock), or code it yourself (flag).

    I was merely trying to point out that the two described approaches behave differently, contrary to your assertion that they would execute the same. Both systems prevent reader and writer cogs from concurrent access to shared memory; however, yours also imposes an ordering on how the cogs take turns reading and writing. That's all.

    That being said, I wonder if you aren't missing the point of locks when you bring up their small benefits but neglect to mention the very big advantage they have over flags, namely an atomic test-and-set operation.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-19 21:41
    Greetings mpark!

    The ordering on how the Cogs take turn reading and writing was certainly deliberate.
    I want the writing to finish before the reading begins.

    Do you have code examples where it is advantageous to read a value before it is written?

    I must admit that I am a bit of a flag man, and may have missed out on some of the advantages of locking.

    Perhaps you will be kind enough to provide us with a practical example of the advantages conferred by atomic test-and-set operations.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-19 22:19
    Electronegativity, the point about locks is that if you have multiple writers to a data structure and multiple readers you will find that is not possible to keep reads and writes separated and the data coherent without the atomic read/write provided by locks in hardware.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-20 10:16
    We should not forget to add this well proven mantra from consideration:

    decouple, decouple, decouple
  • mparkmpark Posts: 1,305
    edited 2010-09-20 11:00
    Electronegativity, consider two cogs that increment a common memory location at random times. You need some sort of mechanism to prevent the cogs from interfering with each other's update operation (read-increment-write). That's what a lock is for, and a flag just won't do.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-20 18:19
    I didn't think about locks at all until this thread so it isn't surprising that I missed something.

    So if Cog1 reads and increments while it is waiting for the Hub to come back around, then Cog 2 reads and increments while it is waiting for the Hub to come back around, then Cog1 writes, then Cog2 writes, the value is one less than it should be.

    Cog1 can't simply set the flag to tell Cog2 it is done, because if Cog1 needed to increment the value twice in a row it would be prevented by the flag from doing so until Cog2 set it back.

    If that is actually the problem you are describing then how does using the lock instructions solve it?
  • Heater.Heater. Posts: 21,230
    edited 2010-09-20 21:15
    Yes, you have described the problem well. Looks like you underdtand it well enough to eaily grasp the description of lock operation in the prop manual.
  • LukeHLukeH Posts: 22
    edited 2010-09-20 23:18
    mpark wrote:
    That being said, I wonder if you aren't missing the point of locks when you bring up their small benefits but neglect to mention the very big advantage they have over flags, namely an atomic test-and-set operation.

    You beat me to it! I was just going to say this. In order to truly deconflict hub memory access, you must use locks, even if only two cogs are sharing data.

    The magic of the lock operation is that LOCKSET simultaneously sets the lock and returns its previous state. You'll notice there is no LOCKREAD instruction; this is intentional! For lock operations to work you have to use LOCKSET.

    "But why am I telling a cog to SET the lock if it doesn't yet own the shared variable?" Because the LOCKSET command will not "unlock" the lock; only the LOCKCLR can do that, and in properly written code, only the cog currently operating on the shared variable should execute LOCKCLR. It's the returned value that makes LOCKSET powerful by telling the cog which just attempted to set the lock that the lock was already set, and it should thus try again.

    If you use a flag, remember a cog can only do one hub operation per access window. This includes reading the flag! So a cog may read a "0" in your flag bit, but now it has to wait for the next cycle to set the flag bit to "1", then another cycle to start working on the shared memory. Meanwhile the hub cycle steps through all the other cogs, which all read the flag bit and all see zero... so they too prepare to write the flag bit on the next pass, then start working on the data. Result: no deconfliction between cog read/writes on hub memory, and data get clobbered. This can even happen with only two cogs sharing memory, particularly if one cog accesses the hub variable much more frequently than the other, or if one is running ASM and the other Spin. I think this is what you figured out yourself, Electro.

    I too scoffed locks until I figured this out. There really are some tricky "traps" like this in propeller programming, which are not self-evident until you either peruse the front of the Propeller manual, or learn them the hard way. It does appear that on the Propeller forums, "scofflock" is a popular sentiment. But I know better now!

    Locks are so easy and elegant... the simple structures below deconflict everything perfectly, and scale up to any number of cogs needing access to the same resource:

    ' in Spin:
    
    repeat until not lockset(lockid)
    <work on shared memory>
    lockclr(lockid)
    
    ' or in ASM:
    
    :lock   lockset  lockid  wc
      if_c  jmp      #:lock
    <work on shared memory>
            lockclr  lockid
    

    The "owning" object should initially check out the lock with LOCKNEW, then pass its ID to all other objects it spawns that will need to work on the same data. From there, the first cog to do a LOCKSET wins, and it's all nicely sorted out thereafter. The "owning" object should also be the only one to call LOCKRET when all the other objects/cogs are done or stopped. This is the other half of using locks effectively: that of writing disciplined code. Though any object/cog can arbitrarily execute LOCKCLR, the only one that should ever do so should be the one which "won" the latest LOCKSET contest. To ensure this, never put a LOCKCLR before a LOCKSET, and always make sure a LOCKSET is successful (returns 0 or false) before proceeding.
  • TinkersALotTinkersALot Posts: 535
    edited 2010-09-20 23:38
    LukeH, that is a very lucid description of what I was scratching at with "clocked out" or "time sliced" but your explanation is extremely helpful to adding to my understanding of what the propeller is doing internally. This explanation really enhanced what I had previously understood (but admitedly at a more conceptual level).
  • Heater.Heater. Posts: 21,230
    edited 2010-09-20 23:54
    This problem of "atomic operations" is not unique to the propeller and its cogs. It is a general problem for all machines where two or more processors can access a resource. Or where a single processor supports "preemptive threads" ie a thread can be interrupted at any point and another thread takes over execution.

    Ofcourse now a days all this messy detail is hidden in modern programming languages to a large extent and programmers can fail to realize the fundamental need for hardware support in the cpu to achieve exclusion.

    Don't forget though that locks are not needed in many common situations. As in when there is only one writer to a data structure and only one reader.
  • ElectronegativityElectronegativity Posts: 311
    edited 2010-09-21 19:03
    Thanks LukeH, I get it now.
Sign In or Register to comment.