Assembly Example of 2 Cogs Sharing an Array in Main Memory
Electronegativity
Posts: 311
Hello everyone!
I am an SX retread that is getting the Propeller up to speed in my mind.
Attached is the spin code of the example I would have liked to see when I first started out.
I tried to paste the text of the code into this post for easier viewing, but the result was so hideous that H.P Lovecraft would stand gaping in awe at the sight of it.
I hope that someone will benefit from it, and would appreciate feedback from the Gurus if I have stated something that is incorrect.
I am an SX retread that is getting the Propeller up to speed in my mind.
Attached is the spin code of the example I would have liked to see when I first started out.
I tried to paste the text of the code into this post for easier viewing, but the result was so hideous that H.P Lovecraft would stand gaping in awe at the sight of it.
I hope that someone will benefit from it, and would appreciate feedback from the Gurus if I have stated something that is incorrect.
Comments
The above also shows how you can have two separate assembly programs in the same file. Note there are two org 0's.
And I would assume that each separate program would have the full per cog assembly program memory space available?
One question I have...
If you have a separate assembly program running on a separate cog like the above...
org 0
ReadAndLight
Is that program loaded into the cog assembly memory only at runtime?
Then would it be overwritten if the cog was stopped and another program loaded into that cog?
Or worded another way, could I have several separate assembly programs which together total up to more assembly memory space usage than is available, but I would only need to call one program at a time, and each separate program would fit ok?
(For example I have 20 buttons. And each button would run a different program. And I would only need to press one button at a time. And each program would run for less than 2 seconds, then quit.)
Yes.
Nothing is ever loaded into COG unless your program commands it with COGNEW in Spin or COGNEW or COGINIT in PASM .
Except for the initial start up when the Spin interpreter is automatically loaded into COG 0.
A Cog will only stop if it is running Spin code and runs off the end of it's method. Or if you call COGSTOP for that cogs id.
Indeed you could.
But bare in mind it take 100uS or so to load and start a COG so this is slow if you are needing high speed PASM.
If the 2 bytes in your shared array are to be treated as a single atom, then should the access to the array be guarded by locks?
For example, if cogx gets clocked out after accessing one byte of a two byte array operation, then cogy comes around and changes one or the other of the array bytes, is the shared memory "corrupted?"
This is great approach, but I'm just asking in case there is the possibility of a race condition in your final application.
In that case one can simply read and write the two bytes as a word which is an atomic operation on HUB memory.
If you have a more complicated data structure being shared then you may need locks to prevent races and corruption.
I have yet to use locks because in many cases they are not necessary.
E.g. If only one COG writes data to a structure and only one COG reads that data then a simple flag can be used determine who can read and write at any given time.
Or there are examples like FullDuplexSerial which share cyclic buffers between the reader COG and the writer COG. Access to the buffer pointers is done in such a way that no corruption can occur.
I think that 'race condition' is about reading two values not synchronized properly and, say if x,y pairs, the two wouldn't be related to a given 'pixel' or point. I think that is what was meant, no?
Further, consider that while process-A is suspended, process-B attempts to read the data (further consider that process-B always consider this data to be complete and consistent) and operates against it. Given the conditions that process-A encountered when writing the data, the results that process-B reads is not guaranteed to be coherent and therefore can be unreliable for operation.
Now, under these circumstances, the validity of the data read by process-B is entirely dependent on when process-A completes its data write, as compared to when process-B attempts to read the data.
This set of conditions and side effects are typically termed "race conditions" because process-A is trying to "race" process-B and complete its data write prior to process-B attempting to read it.
Now, consider that process-A and process-B are programs running in seperate cogs. And that these two programs share data through a common buffer as described in this thread. If the data write is non-atomic (can not be guaranteed to be completed as a single coherent item of work), then race conditions can occur between the two cogs and their operations against the data can become unreliable.
However, as heater pointed out, there are types of writes that are atomic and they may be considered to head off this condition entirely. But, as he also points out, if more complex write operations (say to some form of "structured data buffers") and these writes are not-atomic then a race condition could be created.
Hope that my attempt at explaining what a race condition is, and what can contribute to them, is helpful.
In practical code this could be an issue.
There are "lock" operations provided for this purpose, but as Heater mentioned earlier, it is easy to prevent this problem with a single flag.
Take one bit of main memory to be the flag.
If the bit is zero then it is ok to read, if it is one then it is ok to write.
When a Cog is done reading it sets the flag to one, and if a Cog is done writing it sets the flag to zero.
The locks would come in handy if you had say 6 Cogs all trying to read and write the same location of RAM at the same time. In that case a simple flag would not suffice, but with well written code I suspect that the problem could still be addressed with a register containing several flags.
This will slow things down a little because you are taking an extra Hub operation to check the status of the flag, but it's hard to imagine a case where this would cause a perceptible performance issue.
I do have one question though about the scenario you describe (I'm hoping I'll learn something here).
Is it possible, and what happens if a cog reads the data, then it gets clocked out before the flag is modified to reflect that state?
A related question can be asked about when the data is written, but the cog doing the writing is clocked out before it can modify the flag?
The reason I ask this is that I am currently using locks (though I am using a slightly more complex "data structure" as my shared data between cogs), and I'm wondering if need to rethink the need for locks.
I'm not really sure what you mean by a Cog getting "clocked out before the flag is modified".
The Cog modifies the flag after it is done reading or writing.
Say you have an array of 64 bytes in main RAM.
It requires 16 separate Hub operations for a Cog to completely transfer the array to its local RAM (1 long = 4 bytes per rdlong instruction).
The 17th Hub operation sets the flag.
Let's say another Cog is ready to write to the shared main memory.
It reads the bit. If the flag is zero it reads the bit again, but if it is one then it writes to the array. After it is done writing it sets the bit to zero.
The reading cog reads the bit and if it is one it reads it again. If it is zero then it reads the array.
Apologies for the inaccurate sounding term, the term "clocked out" was meant to describe the time slicing mechanism that occurs between cogs and they're having access to shared hub memory.
What do you mean by "time slicing mechanism"?
Each Cog in turn is given a chance to access to the shared Hub memory.
The Hub polls each Cog one at a time:
"Cog1, do you want to perform a Hub operation?"
"Cog2 do you want to perform a Hub operation?"
Etc.
The window of opportunity moves to Cog2 regardless of whether Cog1 performed an operation or not.
Below is the flag method in a little more detail.
The flag is initialized to 1 and is set to 0 if it is ok to read and 1 if it's ok to write.
Cog1 is the reader and Cog2 is the writer in the following example using a 2 element array:
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 reads the flag and sees that it is a 1 (ok to write).
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 writes the first array element to main memory.
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 writes the second array element to main memory.
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 sets the flag value to 0.
Cog1 reads the flag and sees that it is a 0 (ok to read).
Cog2 reads the flag and sees that it is a 0 (no write).
Cog1 reads the first array element.
Cog2 reads the flag and sees that it is a 0 (no write).
Cog1 reads the second array element.
Cog2 reads the flag and sees that it is a 0 (no write).
Cog1 sets the flag back to 1.
I understand your analysis, and it does raise a few more questions in my mind:
1. The flag seems to be operating the equivalent function as a semaphore (lock). In that the flag setting is allowing/blocking each cog from doing work against this shared buffer.
2. This program, though it is running in multiple cogs, seems to behave the same as a single threaded (single cog) program with regard to the shared buffer, because of the strictly enforced (by the control flag) singular process (cog) access to data in the shared buffer. This is fine if the cogs can do other work over and above what is in the buffer. But if their primary function is to write or read from this data buffer and operate against it, it seems that would spend the majority of their time just waiting for access.
3. I am wondering if this alternative (using locks) is accurate:
cog 1 and cog2 both vie for lock
only one of them gains it
the "winning cog" operates on all bytes
the "winning cog" releases the lock allowing the other cog to gain access
the "other cog" operates on all bytes
the "other cog" releases the lock
loop forever as described above
is this lock example close to the same effect that the access flag has?
I'm not trying to pick nits here, as I said, just trying to get as firm of an understanding I can around the topic of buffers shared between cogs. Plus, I think it is an interesting topic and I'm glad you introduced it along with a working sample.
Allowing the cogs to both vie for the lock would be identical to randomly initializing the flag in the example above.
A flag value of 1 is the same as a lock placed by the writing cog, and a flag value of zero is the same as a lock placed by the reading cog.
One solution to the problem of having the cogs spend most of their time waiting for access that you mention in point #2, is to use a back buffer like the one employed by DirectX.
Suppose you have a chunk of main RAM that represents pixel colors to be written to a screen. One cog will read the array and output the values to the screen, and the second will fill the memory with the next picture.
The way to keep both cogs going at once is to have 2 copies of the array in memory.
Cog1 writes the values in Array1 to the screen while Cog2 is filling Array2 with the next picture. Then Cog1 reads Array2 to the screen while Cog2 fills Array1 with the next picture.
With properly written code there would be no need to have either Cog spin its wheels.
Electronegativity's flag example requires each multi-element write (by cog2) be followed by a multi-element read (by cog1), and each read be followed by a write.
TinkersALot's lock alternative allows the writing cog to perform multiple multi-element writes without waiting for the reading cog to read (and vice versa).
The two approaches behave differently. Each may be correct for a particular situation.
The flag method also allows multiple writes without waiting for a read.
Here is a paste from the example above with Cog2's actions bolded:
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 reads the flag and sees that it is a 1 (ok to write).
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 writes the first array element to main memory.
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 writes the second array element to main memory.
Cog1 reads the flag and sees that it is a 1 (no read).
Cog2 sets the flag value to 0.
Once Cog2 gets the go ahead flag it executes consecutive writes until it is done writing, then it sets the flag to zero and starts checking it again.
If I understand your protocol correctly, once cog2 sets the flag to zero, it cannot perform any more writes until the reading cog (cog1) sets the flag to one.
In other words, at some point cog2's progress depends on cog1.
In TinkersALot's system, the writing cog and the reading cog are independent. The writing cog can continue to write to memory no matter what the reading cog is doing or not doing.
If this is what you want to happen then there is no need for any lock or flag.
The attachment to my original post contains code that does exactly what you are describing.
The problem occurs when a Cog is in the process of reading a long array while another Cog is writing to the same array. It is possible that a value could be written and then overwritten with a different value before the reading cog has a chance to fetch the initial value.
This is prevented by a lock or flag that prevents reading until the entire array is written, or prevents writing until the entire array is read.
The lock has a small advantage over the flag because it uses a hardware register to hold the bit instead of using a memory location. So you have one more bit of main memory available if you use the lock.
The lock is also more intuitive for someone that is reading the code.
Other than those two caveats, the final result is the same whether you let the hardware do it (lock), or code it yourself (flag).
I was merely trying to point out that the two described approaches behave differently, contrary to your assertion that they would execute the same. Both systems prevent reader and writer cogs from concurrent access to shared memory; however, yours also imposes an ordering on how the cogs take turns reading and writing. That's all.
That being said, I wonder if you aren't missing the point of locks when you bring up their small benefits but neglect to mention the very big advantage they have over flags, namely an atomic test-and-set operation.
The ordering on how the Cogs take turn reading and writing was certainly deliberate.
I want the writing to finish before the reading begins.
Do you have code examples where it is advantageous to read a value before it is written?
I must admit that I am a bit of a flag man, and may have missed out on some of the advantages of locking.
Perhaps you will be kind enough to provide us with a practical example of the advantages conferred by atomic test-and-set operations.
decouple, decouple, decouple
So if Cog1 reads and increments while it is waiting for the Hub to come back around, then Cog 2 reads and increments while it is waiting for the Hub to come back around, then Cog1 writes, then Cog2 writes, the value is one less than it should be.
Cog1 can't simply set the flag to tell Cog2 it is done, because if Cog1 needed to increment the value twice in a row it would be prevented by the flag from doing so until Cog2 set it back.
If that is actually the problem you are describing then how does using the lock instructions solve it?
You beat me to it! I was just going to say this. In order to truly deconflict hub memory access, you must use locks, even if only two cogs are sharing data.
The magic of the lock operation is that LOCKSET simultaneously sets the lock and returns its previous state. You'll notice there is no LOCKREAD instruction; this is intentional! For lock operations to work you have to use LOCKSET.
"But why am I telling a cog to SET the lock if it doesn't yet own the shared variable?" Because the LOCKSET command will not "unlock" the lock; only the LOCKCLR can do that, and in properly written code, only the cog currently operating on the shared variable should execute LOCKCLR. It's the returned value that makes LOCKSET powerful by telling the cog which just attempted to set the lock that the lock was already set, and it should thus try again.
If you use a flag, remember a cog can only do one hub operation per access window. This includes reading the flag! So a cog may read a "0" in your flag bit, but now it has to wait for the next cycle to set the flag bit to "1", then another cycle to start working on the shared memory. Meanwhile the hub cycle steps through all the other cogs, which all read the flag bit and all see zero... so they too prepare to write the flag bit on the next pass, then start working on the data. Result: no deconfliction between cog read/writes on hub memory, and data get clobbered. This can even happen with only two cogs sharing memory, particularly if one cog accesses the hub variable much more frequently than the other, or if one is running ASM and the other Spin. I think this is what you figured out yourself, Electro.
I too scoffed locks until I figured this out. There really are some tricky "traps" like this in propeller programming, which are not self-evident until you either peruse the front of the Propeller manual, or learn them the hard way. It does appear that on the Propeller forums, "scofflock" is a popular sentiment. But I know better now!
Locks are so easy and elegant... the simple structures below deconflict everything perfectly, and scale up to any number of cogs needing access to the same resource:
The "owning" object should initially check out the lock with LOCKNEW, then pass its ID to all other objects it spawns that will need to work on the same data. From there, the first cog to do a LOCKSET wins, and it's all nicely sorted out thereafter. The "owning" object should also be the only one to call LOCKRET when all the other objects/cogs are done or stopped. This is the other half of using locks effectively: that of writing disciplined code. Though any object/cog can arbitrarily execute LOCKCLR, the only one that should ever do so should be the one which "won" the latest LOCKSET contest. To ensure this, never put a LOCKCLR before a LOCKSET, and always make sure a LOCKSET is successful (returns 0 or false) before proceeding.
Ofcourse now a days all this messy detail is hidden in modern programming languages to a large extent and programmers can fail to realize the fundamental need for hardware support in the cpu to achieve exclusion.
Don't forget though that locks are not needed in many common situations. As in when there is only one writer to a data structure and only one reader.