The lockset() and lockclr() mechanisms on the P2


ersmith mentioned that using lockset() and lockclr() can hog the locks and cause a race condition on the P2. I wanted to know more about that mechanism. In fact, if I remove lockset and lockclr from the infamous prime number program code, it runs faster. That is very different from what happened on the P1. The post that brought my attention can be seen here:

However, I want do dedicate a new thread just for understanding the nature of the problem. Is it an issue? Or the lockset()/lockclr() mechanisms are now implicit (in hardware)?

Kind regards, Samuel Lourenço


  • It was always a problem, on P1 as well as on P2 (i.e. the only thing different about P2 that exposes the bug is that the timing is slightly different, but it could still appear on P1 as well in a different program).

    The issue is that if you have several COGs all doing:
       while (lockset(lock) != 0)
        do some stuff
    You can get into a situation where, say, COG 1 and COG 2 keep trading ownership of the lock, and so COG 3 never gets a chance to run. This is a problem if the "do some stuff" code at some point relies on a result that COG 3 should calculate; then you end up in a deadlock and the program never finishes.

    In general you should never require a lock to be acquired before reading any variable; locks should only be used to protect writes. That still doesn't guarantee that a deadlock could arise, but it helps reduce the incidence.

    Multi-core programming is hard :).
  • Thanks! I had the notion that, on the P1, the program would be slower with the lock operations removed. But I'll look up the original code again and do some experiments on the P1.

    I'll keep you posted!

    Kind regards, Samuel Lourenço

  • samuellsamuell Posts: 406
    edited 2019-05-26 - 00:03:18

    Finally did some testing with the P1, and concluded that the code without lockset() and lockclr() is consistently faster. To test it, I've used two of my dev boards, Prop and Prop II. Also I've tested two different programs, one for each board. Note that the original Prop board runs at 64MHz, while the Prop II runs at 96MHz. I've ran the tests calculating all the prime numbers from 1 to 1000000.

    Here are the results:
    - Prop with older algorithm: 7:03.5 with locks, 6:52.4 without locks
    - Prop II with newer algorithm: 4:40.6 with locks, 4:33.0 without locks

    As you can notice, the improvement follows a similar ration on both cases. This is similar to what was observed during the tests with the P2. Thus, the results with the P2 make sense.

    I wanted to mention that my previous conclusion was wrong, sinve I've noticed that I was running the program without locks using the CMM memory model, which is slower. That drove my to the wrong assumption that a program without locks would be slower on the P1.

    Thanks, ersmith. I think the code is valid. I'll do more testing to see if results are consistent for big numbers.

    P.S.: Attached is one of the programs that were used. The code has all lockset() and lockclr() operations removed, even for writing, since the cogs 1 to 7 write to different spaces in memory. Thus, lockset() and lockclr() are not needed at all (even for flags@i@, a variable that is also written by cog 0, but that in any case is always verified first and therefore never written simultaneously by different cogs). The code can be further optimized.

    Kind regards, Samuel Lourenço
Sign In or Register to comment.