Trouble with concurrency/timing of multiple cogs running PASM code

ags · 2011-09-06 18:41

I am stuck - but it could be user error - on passing values between cogs at high speed. I would provide actual code but it is very large and I've not yet been able to create a tiny test case. However, I am now wondering if I have a fundamental misunderstanding of Prop programming using shared resources, how the Prop functions, or the proper techniques to be used.

I'm hoping that there is some seemingly insignifcant but huge issue that the experts know to avoid or how to deal with, but I don't. Or, as I said, it could be user error. (example: be sure there is at least one instruction between a mov[s|d|i] instruction and execution of the modified instruction due to pipelining/prefetch).

So... I do have a somewhat simplified test case where a "master cog" running SPIN is driving a "slave cog" running PASM as fast as it can. The slave is passed the hub address for a "command word" and it loops like this:

:getCmd  rdword myReg, par wz
       if_z  jmp #:getCmd
...process command

until a non-zero command word value is found. It then reads the 16 words following the "command" word, sets the command word value to zero, processes the 16 words then returns to looping until another non-zero command word is found.

This works really well with slow SPIN code providing the command and 16 data words (words, not longs or bytes). When assembled into a larger system, things broke, so I went back to debugging and found that if I drive the slave/PASM cog quickly, like this:

repeat
  repeat while cmdWord<>0
  cmdWord:=newValue
  newValue++

then things go very wrong. I realize that there's a lot of the implementation that is left out, and I'm working on further reduction of the test case. I'm wondering if there is some nuance that I'm unaware of, such as:

- operations that appear atomic in SPIN are not; writing a word value (not a byte) could cause one byte to be detected by the PASM loop before the second byte of the word has a valid value written
- some kind of strange pipelining of read/write access to hub memory that would allow interleaving of operations and conflict; this seems unlikely, in that the manual seems to be pretty clear that there is no contention on atomic memory operations
- I really need to use locks here, my design is flawed

Is there something tricky I should be aware of that I can check? The fact that the drivers function perfectly when they have plenty of time to finish before the next command is sent has me very suspicious of this aspect of the design. I saw this very same behavior in the precursor that has led me to this project, and I never did undertand what the problem was. Now I have no choice but to understand.

Any suggestions, ideas, design patterns, code examples, etc are very, very much appreciated. Thanks.

Phil Pilgrim (PhiPi) · 2011-09-06 18:51

You realize, of course, that newValue++ eventually reaches zero, which will freeze processing -- right?

-Phil

kuroneko · 2011-09-06 18:55

The following test case uses your suggested SPIN loop (drive quickly). It also adapts the PASM command fetch loop. Instead of reading the following 16 words it simply displays the command value and delays for a second. Then it indicates that the command area is ready to be refilled. And I get what I asked for, the LED bank on the demoboard shows an incrementing binary pattern (the command) with 1Hz update frequency. So the bug must hide somewhere else.

VAR
  long  newValue
  word  cmdWord[17]
  
PUB null

  cognew(@entry, @cmdWord)

  repeat
    repeat while cmdWord
    cmdWord:=newValue
    newValue++
    
DAT             org     0

entry           mov     dira, mask

:getCmd         rdword  myReg, par wz
        if_z    jmp     #:getCmd

' read all the other data (delay instead)

                mov     outa, myReg
                rev     outa, #{32 -}8          ' display command
                
                rdlong  cnt, #0
                add     cnt, cnt
                waitcnt cnt, #0

                wrword  par, par                ' indicate ready
                jmp     #:getCmd

mask            long    $00FF0000

myReg           res     1

                fit
                
DAT

kuroneko · 2011-09-06 18:57

Phil Pilgrim (PhiPi) wrote: »

You realize, of course, that newValue++ eventually reaches zero, which will freeze processing -- right?

Why would you say that? Writing 0 as a command will simply skip the inner repeat loop.

ags · 2011-09-06 18:57

Yes, that was a bad example. I was trying to simplify the problem. I am always writing out a non-zero, word-length command. In operation, I don't see freezing, I see garbage being output by the slave cog.

kuroneko · 2011-09-06 19:05

@ags: You do update the command word atomically? Not - as we had some time ago - first setting a base command then adding/or'ing flags ... Also, you use words, is the command word 4n aligned?

ags · 2011-09-06 19:09

Does this work?

VAR
long newValue
word cmdWord[17]
 
PUB null
 
cognew(@entry, @cmdWord)
 
repeat
repeat while cmdWord
cmdWord:=newValue
newValue++
 
DAT org 0
 
entry mov dira, mask
 
:getCmd rdword myReg, par wz
if_z jmp #:getCmd
 
' **** clear the command word here, since I have what I need from hub RAM (in myReg)
 
wrword par, par ' indicate ready
 
' read all the other data (delay instead)
 
mov outa, myReg
rev outa, #{32 -}8 ' display command
 
rdlong cnt, #0
add cnt, cnt
waitcnt cnt, #0
jmp #:getCmd
 
mask long $00FF0000
 
myReg res 1
 
fit
 
DAT

This is more like what I'm doing: my thinking is that I've already read the value I need (or actually, I read the next 16 word values, then do this) so there is no need to keep the "watchdog" command at a non-zero value. This allows the "master cog" in my design time to first load the 16 words, then set the command value <> 0 while I'm doing other processing. If I've already read the values I need, I'm looking to interleave the operations.

kuroneko · 2011-09-06 19:17

Sure, the delay in my example was used to indicate parameter fetching. As long as you read everything important from hub before clearing the command word it's fine, i.e. moving the wrword immediately before outa handling doesn't change behaviour.

ags · 2011-09-06 19:19

kuroneko wrote: »

@ags: You do update the command word atomically? Not - as we had some time ago - first setting a base command then adding/or'ing flags ... Also, you use words, is the command word 4n aligned?

I'm not sure I understand what you are asking/saying.

The slave cog running PASM uses rdword at the address provided by par, and if it is not zero I escape the loop waiting for a command.

In the master cog running SPIN, I simply use an assignment operator: cmdWord := someValueNotZero.
I'm not sure if that is atomic in SPIN, that is an area I am wondering is the problem.

I thought that the Propeller Tool arranged VARs so that they are always properly aligned by ordering long, then word, then byte sized symbols together. The command word is actually one element of a word-sized array variable.

If there is value in going through the exercise of changing my code to break up the command word into a command byte, and just querying one byte using rdbyte in my PASM loop, I can do that. I thought that was a long shot or I would have done it. (of course, I'll have to be sure to write the byte that I'm querying in PASM last in SPIN)

Thanks for the response.

ags · 2011-09-06 19:23

kuroneko wrote: »

Sure, the delay in my example was used to indicate parameter fetching. As long as you read everything important from hub before clearing the command word it's fine, i.e. moving the wrword immediately before outa handling doesn't change behaviour.

Yes, that's what I would expect. On the one hand, I'm glad it works as expected. On the other hand, I am concerned that I have some fundamental flaw in how I'm approaching this problem. It works at slow speed (no chance of the master cog trying to update the slave cog before it has processed the previous data set) but not a full speed (the master is always ready to slam the next dataset into the slave cog as soon as it is ready to accept it).

Of course, that's how it seems, with my understanding of what I intend the code to do. That understanding may well be what is keeping me from really identifying what is actually happening, or what I've coded it to do...

kuroneko · 2011-09-06 19:27

ags wrote: »

I'm not sure I understand what you are asking/saying.

cmdWord := base
cmdWord |= flags

would be wrong as PASM may get in between both assignments and read base instead of base | flags. A single assignment is atomic (wrword). Just checking, we've seen this before and it was hard to track down (courtesy of Phil, IIRC) as we didn't have that part of the source.

ags wrote: »

I thought that the Propeller Tool arranged VARs so that they are always properly aligned by ordering long, then word, then byte sized symbols together. The command word is actually one element of a word-sized array variable.

Yes, but consider this:

VAR
  long  first
  word  dummy
  word  cmdWord[17]

In this case cmdWord{0} is located at 4n+2 which breaks when used as par (you'll get the address of dummy instead).

ags · 2011-09-06 19:42

cmdWord := base
cmdWord |= flags

Yes, I can see how that would fail. In between different assignments is not atomic. I'm writing the entire value of the command word at once, in one SPIN assignment operation.

Yes, but consider this:

VAR
long first
word dummy
word cmdWord[17]

In this case cmdWord{0} is located at 4n+2 which breaks when used as par (you'll get the address of dummy instead).[/QUOTE]

Oh oh, this may be the root of my fundamental misunderstanding and the problem. I assure you I've read the pages on WR*/RD* in the Propeller Manual v1.1.pdf and that's not how I understood it (and I'm not claiming that my understanding is correct - far from it).

I see that with a 9-bit literal value, only long-aligned addresses are possible. However, isn't par a long-word, so I thought it would contain 32 address bits (double what is needed for addressing the Prop RAM)?

I think I need much more understanding here, can you point me at a good resource, or help with further details?

Thanks! This may be on the path to a solution to a problem that has been vexing me for a while...

kuroneko · 2011-09-06 19:50

The problem(?) with par is that it has to fit into 14bit (PASM coginit parameter: 14bit par, 14bit address, 4bit cog ID handling). Which means the lower 2bits of the parameter value passed into coginit/cognew get lost (bits 31..16 as well). Which still lets you cover the whole 64K address range (RAM/ROM) but only in 4n steps.

You can avoid the alignment issue by storing the value in a long variable and do a rdlong addr, par instead of using par directly. This way you'll get the whole 32bit value.

ags · 2011-09-06 20:03

kuroneko wrote: »

The problem(?) with par is that it has to fit into 14bit (PASM coginit parameter: 14bit par, 14bit address, 4bit cog ID handling). Which means the lower 2bits of the parameter value passed into coginit/cognew get lost (%-00). Which still lets you cover the whole 64K address range (RAM/ROM) but only in 4n steps.

Oh Good Grieff!. I bet that's it. I see some discussion about that in the Manual on the PAR register page.

So PAR contains the address of the long value that is guaranteed to contain the actual value I want, but it may not be the low byte or word, so with RDWORD/RDBYTE I may get garbage? Yikes! What techniques are used to avoid that (other than only using long values?)

While I'm on the topic (roughly) - when using WRBYTE/WRWORD, the Manual says the long-register values in cog RAM are truncated to word and byte sizes, and zero-extended. However, does that mean that the entire long-size register value is altered? I thought about it and decided the answer must be "no". Even with this new knowledge about PAR, once in PASM-land, I can still manipulate all 16 bits of the source register, and increment it by 1 (one byte addressing - I've done this and it sure *looks* like it works) so I presume that I can precisely address and surgically alter single-byte values of hub RAM from PASM code. Is this correct?

Thanks!!

kuroneko · 2011-09-06 20:22

ags wrote: »

So PAR contains the address of the long value that is guaranteed to contain the actual value I want, but it may not be the low byte or word, so with RDWORD/RDBYTE I may get garbage?

Let's take a step back. The 14bit limitation is a fact. Period. You can get any value into the cog you just have to know how to do it.

cognew(@entry, user)

In the SPIN example above you pass an opaque user value (doesn't have to be an address) to the PASM cog. par will contain user[15..2] in par[15..2], everything else is zero. Which means ordinary 4n aligned addresses can be passed as is. And you simply do a rdlong temp, par. If your address is not 4n aligned (or it's an arbitrary value, e.g. $12345678) then you can either align it or pass it indirectly.

VAR
  long  storage

PUB null

  storage := arbitrary_value
  cognew(@entry, @storage)

DAT     org     0

entry   rdlong  addr, par          ' get arbitrary_value into addr
        rdbyte  [COLOR="orange"]temp[/COLOR], addr         ' read e.g. from a byte aligned address

ags wrote: »

While I'm on the topic (roughly) - when using WRBYTE/WRWORD, the Manual says the long-register values in cog RAM are truncated to word and byte sizes, and zero-extended. However, does that mean that the entire long-size register value is altered?

For wrbyte/wrword only that particular portion of the hub long is altered. Using rdbyte/rdword is where the zero-extension comes in (cog locations are longs). Which means you can construct a hub long by using 4 wrbytes but this doesn't work the other way around, a cog long can't be assembled directly using 4 rdbytes. Also, a wrbyte doesn't affect the long being written from.

ags wrote: »

Even with this new knowledge about PAR, once in PASM-land, I can still manipulate all 16 bits of the source register, and increment it by 1 (one byte addressing - I've done this and it sure *looks* like it works) so I presume that I can precisely address and surgically alter single-byte values of hub RAM from PASM code. Is this correct?

par itself is read-only. While you can do an add par, #1 this will only affect its shadow register. Adjusting addresses based on par has to be done by taking a r/w copy and working with this instead, e.g.

mov     addr, par
rdlong  v1, addr
add     addr, #4
rdlong  v2, addr

Normal registers can be manipulated in any way you want. And - to finally answer the question - every single byte in hub RAM can be modified.

ags · 2011-09-06 21:59

The thrill of victory... and the agony of defeat.

I thought I had a lead on the problem. Alas, while it has been a good education for me, this was not the problem. As it turns out, through some combination of luck and idiot savant, I always pass parameters to a PASM cog through PAR based on a long variable. That is, similar to the previous example, I establish a long variable, pass the address of that through PAR, and then read the value of that address, or the sequentially following longs. In other words, this is what I do:

VAR
long g_paramRoot
long g_nextParam0
long g_nextParam1
...
PUB null
cognew(@entryPoint, @g_paramRoot)
DAT
entryPoint
 
mov tmpReg, par
 
:loop rdlong paramReg, tmpReg wz
if_z jmp #:loop
 
'do something with g_paramRoot value now in paramReg
 
add tmpReg, #4 'increment to the next param value address (longs)
rdlong paramReg, tmpReg
 
'do something with g_nextParam0 value now in paramReg
 
add tmpReg, #4
rdlong paramReg, tmpReg
 
'do something with g_nextParam1 value now in paramReg
...
tmpReg res 1
paramReg res 1

I would be suspecting a memory problem now, but the fact that the symptoms occur when processing is sped up makes me wonder.

Ugh.

Trouble with concurrency/timing of multiple cogs running PASM code

Comments