Trouble with concurrency/timing of multiple cogs running PASM code
I am stuck - but it could be user error - on passing values between cogs at high speed. I would provide actual code but it is very large and I've not yet been able to create a tiny test case. However, I am now wondering if I have a fundamental misunderstanding of Prop programming using shared resources, how the Prop functions, or the proper techniques to be used.
I'm hoping that there is some seemingly insignifcant but huge issue that the experts know to avoid or how to deal with, but I don't. Or, as I said, it could be user error. (example: be sure there is at least one instruction between a mov[s|d|i] instruction and execution of the modified instruction due to pipelining/prefetch).
So... I do have a somewhat simplified test case where a "master cog" running SPIN is driving a "slave cog" running PASM as fast as it can. The slave is passed the hub address for a "command word" and it loops like this:
until a non-zero command word value is found. It then reads the 16 words following the "command" word, sets the command word value to zero, processes the 16 words then returns to looping until another non-zero command word is found.
This works really well with slow SPIN code providing the command and 16 data words (words, not longs or bytes). When assembled into a larger system, things broke, so I went back to debugging and found that if I drive the slave/PASM cog quickly, like this:
then things go very wrong. I realize that there's a lot of the implementation that is left out, and I'm working on further reduction of the test case. I'm wondering if there is some nuance that I'm unaware of, such as:
- operations that appear atomic in SPIN are not; writing a word value (not a byte) could cause one byte to be detected by the PASM loop before the second byte of the word has a valid value written
- some kind of strange pipelining of read/write access to hub memory that would allow interleaving of operations and conflict; this seems unlikely, in that the manual seems to be pretty clear that there is no contention on atomic memory operations
- I really need to use locks here, my design is flawed
Is there something tricky I should be aware of that I can check? The fact that the drivers function perfectly when they have plenty of time to finish before the next command is sent has me very suspicious of this aspect of the design. I saw this very same behavior in the precursor that has led me to this project, and I never did undertand what the problem was. Now I have no choice but to understand.
Any suggestions, ideas, design patterns, code examples, etc are very, very much appreciated. Thanks.
I'm hoping that there is some seemingly insignifcant but huge issue that the experts know to avoid or how to deal with, but I don't. Or, as I said, it could be user error. (example: be sure there is at least one instruction between a mov[s|d|i] instruction and execution of the modified instruction due to pipelining/prefetch).
So... I do have a somewhat simplified test case where a "master cog" running SPIN is driving a "slave cog" running PASM as fast as it can. The slave is passed the hub address for a "command word" and it loops like this:
:getCmd rdword myReg, par wz if_z jmp #:getCmd ...process command
until a non-zero command word value is found. It then reads the 16 words following the "command" word, sets the command word value to zero, processes the 16 words then returns to looping until another non-zero command word is found.
This works really well with slow SPIN code providing the command and 16 data words (words, not longs or bytes). When assembled into a larger system, things broke, so I went back to debugging and found that if I drive the slave/PASM cog quickly, like this:
repeat repeat while cmdWord<>0 cmdWord:=newValue newValue++
then things go very wrong. I realize that there's a lot of the implementation that is left out, and I'm working on further reduction of the test case. I'm wondering if there is some nuance that I'm unaware of, such as:
- operations that appear atomic in SPIN are not; writing a word value (not a byte) could cause one byte to be detected by the PASM loop before the second byte of the word has a valid value written
- some kind of strange pipelining of read/write access to hub memory that would allow interleaving of operations and conflict; this seems unlikely, in that the manual seems to be pretty clear that there is no contention on atomic memory operations
- I really need to use locks here, my design is flawed
Is there something tricky I should be aware of that I can check? The fact that the drivers function perfectly when they have plenty of time to finish before the next command is sent has me very suspicious of this aspect of the design. I saw this very same behavior in the precursor that has led me to this project, and I never did undertand what the problem was. Now I have no choice but to understand.
Any suggestions, ideas, design patterns, code examples, etc are very, very much appreciated. Thanks.
Comments
-Phil
This is more like what I'm doing: my thinking is that I've already read the value I need (or actually, I read the next 16 word values, then do this) so there is no need to keep the "watchdog" command at a non-zero value. This allows the "master cog" in my design time to first load the 16 words, then set the command value <> 0 while I'm doing other processing. If I've already read the values I need, I'm looking to interleave the operations.
I'm not sure I understand what you are asking/saying.
The slave cog running PASM uses rdword at the address provided by par, and if it is not zero I escape the loop waiting for a command.
In the master cog running SPIN, I simply use an assignment operator: cmdWord := someValueNotZero.
I'm not sure if that is atomic in SPIN, that is an area I am wondering is the problem.
I thought that the Propeller Tool arranged VARs so that they are always properly aligned by ordering long, then word, then byte sized symbols together. The command word is actually one element of a word-sized array variable.
If there is value in going through the exercise of changing my code to break up the command word into a command byte, and just querying one byte using rdbyte in my PASM loop, I can do that. I thought that was a long shot or I would have done it. (of course, I'll have to be sure to write the byte that I'm querying in PASM last in SPIN)
Thanks for the response.
Yes, that's what I would expect. On the one hand, I'm glad it works as expected. On the other hand, I am concerned that I have some fundamental flaw in how I'm approaching this problem. It works at slow speed (no chance of the master cog trying to update the slave cog before it has processed the previous data set) but not a full speed (the master is always ready to slam the next dataset into the slave cog as soon as it is ready to accept it).
Of course, that's how it seems, with my understanding of what I intend the code to do. That understanding may well be what is keeping me from really identifying what is actually happening, or what I've coded it to do...
Yes, but consider this: In this case cmdWord{0} is located at 4n+2 which breaks when used as par (you'll get the address of dummy instead).
Yes, I can see how that would fail. In between different assignments is not atomic. I'm writing the entire value of the command word at once, in one SPIN assignment operation.
Yes, but consider this: In this case cmdWord{0} is located at 4n+2 which breaks when used as par (you'll get the address of dummy instead).[/QUOTE]
Oh oh, this may be the root of my fundamental misunderstanding and the problem. I assure you I've read the pages on WR*/RD* in the Propeller Manual v1.1.pdf and that's not how I understood it (and I'm not claiming that my understanding is correct - far from it).
I see that with a 9-bit literal value, only long-aligned addresses are possible. However, isn't par a long-word, so I thought it would contain 32 address bits (double what is needed for addressing the Prop RAM)?
I think I need much more understanding here, can you point me at a good resource, or help with further details?
Thanks! This may be on the path to a solution to a problem that has been vexing me for a while...
You can avoid the alignment issue by storing the value in a long variable and do a rdlong addr, par instead of using par directly. This way you'll get the whole 32bit value.
Oh Good Grieff!. I bet that's it. I see some discussion about that in the Manual on the PAR register page.
So PAR contains the address of the long value that is guaranteed to contain the actual value I want, but it may not be the low byte or word, so with RDWORD/RDBYTE I may get garbage? Yikes! What techniques are used to avoid that (other than only using long values?)
While I'm on the topic (roughly) - when using WRBYTE/WRWORD, the Manual says the long-register values in cog RAM are truncated to word and byte sizes, and zero-extended. However, does that mean that the entire long-size register value is altered? I thought about it and decided the answer must be "no". Even with this new knowledge about PAR, once in PASM-land, I can still manipulate all 16 bits of the source register, and increment it by 1 (one byte addressing - I've done this and it sure *looks* like it works) so I presume that I can precisely address and surgically alter single-byte values of hub RAM from PASM code. Is this correct?
Thanks!!
par itself is read-only. While you can do an add par, #1 this will only affect its shadow register. Adjusting addresses based on par has to be done by taking a r/w copy and working with this instead, e.g. Normal registers can be manipulated in any way you want. And - to finally answer the question - every single byte in hub RAM can be modified.
I thought I had a lead on the problem. Alas, while it has been a good education for me, this was not the problem. As it turns out, through some combination of luck and idiot savant, I always pass parameters to a PASM cog through PAR based on a long variable. That is, similar to the previous example, I establish a long variable, pass the address of that through PAR, and then read the value of that address, or the sequentially following longs. In other words, this is what I do:
I would be suspecting a memory problem now, but the fact that the symptoms occur when processing is sped up makes me wonder.
Ugh.