If only two cogs are contending for a lock then only a short time would be needed between a LOCKCLR and LOCKSET. If the other cog wants the lock it will be sitting in a tight loop doing a LOCKSET. So the cog that is clearing the lock just needs to wait slightly longer than the loop time. Of course, the loop time in Spin will be substantially longer than a loop in PASM.
It gets more complicated when three or more cogs are actively contending for the lock. In that case a random wait time should fix the problem. Or lock requests could be queued up in a FIFO. The write to the FIFO would need to be protected with a lock. However, it seems like some wait time would still be needed before a cog could add a new request to the FIFO.
Surely if only two COGs are in the game, then when one does a LOCKCLR it has to wait a whole HUB cycle before it can do a LOCKSET. Isn't that enough time for the other COG to always be able to get in?
The shortest PASM loop to wait for a lock is one Hub-op and one conditional jmp.
So a PASM COG can get hold of the lock at every HUB-cycle it can access. If more then one PASM COG is waiting, no deadlock possible, after releasing a lock all other PASM COGs waiting have at least one chance at it until the first one gets around again.
Not so with SPIN COGs. A 'repeat until lockset' in SPIN will not be able to catch every HUB-cycle.
It's certainly possible to create a PASM program that would prevent a Spin cog from getting the lock. The PASM cog would just need a set/clr cycle that matched the number of cycles in the Spin 'repeat until lockset' loop. Would this happen normally? Probably not, but it is possible. I think in general multiple cogs will not get locked out. Some cogs may require more attempts to acquire the lock, but it is unlikely that they would never get it.
It's certainly possible to create a PASM program that would prevent a Spin cog from getting the lock. The PASM cog would just need a set/clr cycle that matched the number of cycles in the Spin 'repeat until lockset' loop. Would this happen normally? Probably not, but it is possible. I think in general multiple cogs will not get locked out. Some cogs may require more attempts to acquire the lock, but it is unlikely that they would never get it.
The issue I had was with a cog running Spin that would never obtain the lock. There were 3 other cogs, all running PASM, that were accessing the same lock. It was very, very, sporadic, maybe happening in 1 out of 10 executions of 10 minute runs each.
But happen it did. And it was always the cog running Spin that would hang. I do believe that there was an inadvertent timing synchronization between the Spin routine and at least one of the PASM routines.
Adding the differing wait times to each cog has fixed the problem (it has never recurred).
Finally got a cog to hog a lock, effectively preventing three others from obtaining it. Here's code that does not hog the lock:
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
PUB start
cognew(@locktest, 4) 'Scope channel 1.
cognew(@locktest, 8) 'Scope channel 2.
cognew(@locktest, 16) 'Scope channel 3.
cognew(@locktest, 32) 'Scope channel 4.
DAT
org 0
locktest mov dira,par
lockget lockset zero wc
nop
nop
nop
' nop
if_c jmp #lockget
or outa,par
andn outa,par
lockclr zero
jmp #lockget
zero long 0
Here's the scope output:
If I uncomment the nop, this is what I get:
What seems to be happening is that in channels 2 through 4, the jmp is synchronized to the hub cycle, so they miss every opportunity to obtain a lock when their turn comes around. The code is admittedly pathological, but it does demonstrate the possibility of a cog not being able to obtain a lock.
I haven't yet been able to duplicate this behavior without instructions between the lockset and the jmp. Also, it takes four cogs with the same code for any one of them to get locked out. My mistake: 'locks out even with two.
It occurs to me that if all cogs that use a lock immediately follow the lockset by the if_c jmp back to it, all cogs will get an equal shot at obtaining a lock. The reason is that each cog will be able to access the hub again on its very next turn, as Heater rightly pointed out. In my examples above, I put a delay between the lockset and the jmp, and this is what caused the problem. What this tells me that any kind of random hold-off will create the problem, rather than fixing it.
In Spin, the situation is different because, as was pointed out, you don't get to do a lockset on every hub cycle. So a Spin cog might still get locked out. Given more instruction space in the Spin interpreter, a lockwait function that was interpreted using the tight PASM loop, would have taken care of the problem.
Thanks, Phil, for your time and effort researching this issue.
One thing that your test code doesn't do is keep the lock for any significant time. The issue I had was with locking messages that required decoding. So each cog has to obtain the lock, retrieve the message, decode it, determine if it is the intended recipient, and potentially process it - all before releasing the lock.
I believe that the issue in my case was that one of the PASM cogs spent just enough time doing the above - on a message that wasn't intended for it - and keeping the other cogs from obtaining the lock. This, of course, meant that the intended recipient was never able to retrieve and process (and reset) it's message.
Especially since the intended recipient was the Spin cog, the cog always having the issue.
It would have been nice to use a separate lock for each recipient. Lack of available cog space precludes this.
I think that my "random waits" code fixes the issue because each cog waits a different amount of time before retrying.
Of course, it's almost impossible to prove that it will always work.
I think that my "random waits" code fixes the issue because each cog waits a different amount of time before retrying.
Based upon my experiments, you should always retry as soon as possible, IOW in a tight loop. Waiting only increases the chance that you will not obtain the lock at all.
Here's an example:
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
PUB start
cognew(@locktest, 4)
cognew(@locktest, 8)
cognew(@locktest, 16)
cognew(@locktest, 32)
DAT
org 0
locktest mov dira,par
lockget lockset zero wc
if_nc jmp #:doit
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
jmp #lockget
:doit or outa,par
mov t0,cnt
:waitlp mov t1,cnt
sub t1,t0
sub t1,keep wc
if_c jmp #:waitlp
andn outa,par
lockclr zero
jmp #lockget
zero long 0
keep long 100000
t0 res 1
t1 res 1
In this example, each cog keeps the lock for 100,000 clocks. The first cog gets the lock once, and the fourth one hogs it thereafter. If I remove the nops, every cog gets its turn.
Interesting insights Phil. I wonder if lockwait was ever under consideration.
Have I got this right in the current logic? The combination of lockset wc and if_c jmp #$-1 will always hit every hub rotation, and even adding one nop or one other 4-cycle instruction will be okay. But two or more instructions between the the lockset and the jmp $-1 will mean it will miss one or more hub rotation. You can design or inadvertently cause another cog to sync in hub phase so that it will aways release its lock and also pick it up during that missed rotation, a hog-cog. Definitely a reason to keep that loop tight.
One could use a djnz value,#$-1 instead of a straight jmp in order to make a failsafe for a locked up situation for whatever cause, but I haven't ever before thought that something like that might be necessary. Seems an unlikely bump in the night. Code that has random hub accesses or decision trees will tend to escape (?!) quickly, eventually.
Comments
It gets more complicated when three or more cogs are actively contending for the lock. In that case a random wait time should fix the problem. Or lock requests could be queued up in a FIFO. The write to the FIFO would need to be protected with a lock. However, it seems like some wait time would still be needed before a cog could add a new request to the FIFO.
Surely if only two COGs are in the game, then when one does a LOCKCLR it has to wait a whole HUB cycle before it can do a LOCKSET. Isn't that enough time for the other COG to always be able to get in?
-Phil
The shortest PASM loop to wait for a lock is one Hub-op and one conditional jmp.
So a PASM COG can get hold of the lock at every HUB-cycle it can access. If more then one PASM COG is waiting, no deadlock possible, after releasing a lock all other PASM COGs waiting have at least one chance at it until the first one gets around again.
Not so with SPIN COGs. A 'repeat until lockset' in SPIN will not be able to catch every HUB-cycle.
Mike
Trying, but so far unsuccessful -- even with four cogs vying normally, plus one "lock hog."
-Phil
Mind you, it's the same challenge, seeing as Spin is PASM under the hood.
The issue I had was with a cog running Spin that would never obtain the lock. There were 3 other cogs, all running PASM, that were accessing the same lock. It was very, very, sporadic, maybe happening in 1 out of 10 executions of 10 minute runs each.
But happen it did. And it was always the cog running Spin that would hang. I do believe that there was an inadvertent timing synchronization between the Spin routine and at least one of the PASM routines.
Adding the differing wait times to each cog has fixed the problem (it has never recurred).
Here's the scope output:
If I uncomment the nop, this is what I get:
What seems to be happening is that in channels 2 through 4, the jmp is synchronized to the hub cycle, so they miss every opportunity to obtain a lock when their turn comes around. The code is admittedly pathological, but it does demonstrate the possibility of a cog not being able to obtain a lock.
I haven't yet been able to duplicate this behavior without instructions between the lockset and the jmp. Also, it takes four cogs with the same code for any one of them to get locked out. My mistake: 'locks out even with two.
-Phil
In Spin, the situation is different because, as was pointed out, you don't get to do a lockset on every hub cycle. So a Spin cog might still get locked out. Given more instruction space in the Spin interpreter, a lockwait function that was interpreted using the tight PASM loop, would have taken care of the problem.
-Phil
Amazing that after all these years there are things to be discovered about the Prop.
It's the "lock bomb" !
One thing that your test code doesn't do is keep the lock for any significant time. The issue I had was with locking messages that required decoding. So each cog has to obtain the lock, retrieve the message, decode it, determine if it is the intended recipient, and potentially process it - all before releasing the lock.
I believe that the issue in my case was that one of the PASM cogs spent just enough time doing the above - on a message that wasn't intended for it - and keeping the other cogs from obtaining the lock. This, of course, meant that the intended recipient was never able to retrieve and process (and reset) it's message.
Especially since the intended recipient was the Spin cog, the cog always having the issue.
It would have been nice to use a separate lock for each recipient. Lack of available cog space precludes this.
I think that my "random waits" code fixes the issue because each cog waits a different amount of time before retrying.
Of course, it's almost impossible to prove that it will always work.
Lock, read, unlock, process...., lock, write, unlock, repeat.
Here's an example:
In this example, each cog keeps the lock for 100,000 clocks. The first cog gets the lock once, and the fourth one hogs it thereafter. If I remove the nops, every cog gets its turn.
-Phil
Have I got this right in the current logic? The combination of lockset wc and if_c jmp #$-1 will always hit every hub rotation, and even adding one nop or one other 4-cycle instruction will be okay. But two or more instructions between the the lockset and the jmp $-1 will mean it will miss one or more hub rotation. You can design or inadvertently cause another cog to sync in hub phase so that it will aways release its lock and also pick it up during that missed rotation, a hog-cog. Definitely a reason to keep that loop tight.
One could use a djnz value,#$-1 instead of a straight jmp in order to make a failsafe for a locked up situation for whatever cause, but I haven't ever before thought that something like that might be necessary. Seems an unlikely bump in the night. Code that has random hub accesses or decision trees will tend to escape (?!) quickly, eventually.