Proplem with mutex locks

Dave Hein · 2015-02-08 09:21

I have a problem using pthreads and mutex locks in PropGCC. The attached test program starts a few pthreads, and then uses a lock to control the use of the "active_thread" flag. The flag will have a value of -1 when none of the threads are using it, and it will contain the thread instance number when in use. I use a lock to prevent more than one thread from setting active_thread at a time. The program works fine when USE_PROP_LOCK is defined, and a Propeller lock is used. The program seems to hang when USE_MUTEX_LOCK is defined, and a mutex lock is used.

It's also possible to see what happens when neither compile flag is defined, and a lock is not used. In that case the program will detect multiple threads using the active_thread flag as expected.

Does anybody know how to use mutex locks, and is there something wrong with my logic? BTW, I have used mutex locks in code running on a PC without any problems, so I suspect there may be a bug in the PropGCC mutex locks.

kuroneko · 2015-02-09 01:42

Dave Hein wrote: »

The program seems to hang when USE_MUTEX_LOCK is defined, and a mutex lock is used.

FWIW, running this configuration under fedora gives me a core dump (could be a stack alignment issue). I then removed the pthread_attr_setstackaddr() call (marked deprecated) and it started working. IOW general mutex usage looks fine.

OK, stack was too small (now 10000 as an example) and it appears to require the end of the stack for the stack set call, e.g. pthread_attr_setstackaddr(&attr, stacks[i+1]).

Heater. · 2015-02-09 03:04

Works fine here on my Debian Wheezy after patching as kuroneko says:

--- locktest.c  2015-02-08 11:11:10.000000000 +0200
+++ locktest.hacked.c   2015-02-09 13:00:43.987514089 +0200
@@ -9,11 +9,12 @@
 #include <stdlib.h>
 #include <string.h>
 #include <pthread.h>
-#include <propeller.h>
+#include <unistd.h>
+//#include <propeller.h>

 // Define either USE_PROP_LOCK or USE_MUTEX_LOCK
-#define USE_PROP_LOCK
-#undef  USE_MUTEX_LOCK
+#undef USE_PROP_LOCK
+#define  USE_MUTEX_LOCK

 // Check if both locks are defined
 #if defined(USE_PROP_LOCK) && defined(USE_MUTEX_LOCK)
@@ -55,7 +56,7 @@

 int RandVal()
 {
-    return (CNT & 0x7fffffff)%5000 + 5000;
+    return (rand() & 0x7fffffff)%5000 + 5000;
 }

 static void TestLock(int instance)
@@ -117,7 +118,7 @@
         printf("StartPthreads: %d\n", i+1);
         pthread_attr_init(&attr);
         pthread_attr_setstacksize(&attr, PTHREAD_STACKSIZE);
-        pthread_attr_setstackaddr(&attr, stacks[i]);
+//        pthread_attr_setstackaddr(&attr, stacks[i]);

         if (pthread_create(&threads[i], &attr, ThreadFunc, (void *)i+1))
         {

Heater. · 2015-02-09 03:09

Or alternatively set the threads stacks with:

pthread_attr_setstack(&attr, malloc(4096), 4096);

Works fine here.

Dave Hein · 2015-02-09 07:00

kuroneko and Heater, thanks for your responses. It does look like pthread_attr_setstack is preferred over using pthread_attr_setstacksize and pthread_attr_setstackaddr. However, the PropGCC library doesn't contain pthread_attr_setstack. It appears that one of the problems with using pthread_attr_setstackaddr is that the address should point to the beginning of the stack for implementations that grow the stack upward, and after the end of the stack for implementation that grow the stack downward.

So according to the function specification, I should specify stacks[i+1] instead of stacks for PropGCC since the stack grows downward. I looked at the PropGCC library code, and it always adds the stack size to the stack address independent of whether it is specified or malloc'ed by the library function. So in the case of PropGCC, stacks is correct. BTW, if the stack address and stack size are not specified, pthread_create will malloc a stack space of 512 bytes.

I guess it time to start poking around in the pthread_mutex code. I'll copy over pthread_mutex.c from the library and add a few printfs to it.

ersmith · 2015-02-09 13:22

There is a typo in the pthread_mutex_init function:

  memset(mutex, 0, sizeof(mutex));

should read:

  memset(mutex, 0, sizeof(*mutex));

I've checked a fix in to the propgcc repository, but it was a fairly recent change (Jan. 21 2015).

Eric

Heater. · 2015-02-09 14:00

OK, wow, maybe now the heater_fft with OMP works!

It might take me a while for me to check this.

Dave Hein · 2015-02-09 19:40

Eric, I tried the fix in my test program, but it still hangs when using the mutex locks.

Dave Hein · 2015-02-10 06:44

I think there is a problem with pthread_mutex_lock. On entry it calls __addlock that does an atomic add of 1, and then returns the resulting value. If the value is 1, then the lock was originally 0, and the function returns. Otherwise, it assumes someone else has the lock, and the function spins in a loop waiting for the lock to be 0, and then atomically setting it to 1.

The problem is that if the lock has already been set to 1, a second thread will set it to 2. When the lockholder is done, he will decrement the lock, which will set it to 1. Therefore, it never gets back to zero. With more active threads the lock will just get stuck at a higher number.

I think the pthread_mutex_lock routine should use pthread_mutex_trylock at the beginning of the function instead of __addlock. This way the lock will only have a value of either 0 or 1. The thread to grab the lock will set it to 1, and then clear it to 0 when it's done.

I also had a problem with pthread_sleep and pthread_wake. Maybe this works when the pthreads are in the same cog, but it doesn't seem to work when they are in different cogs. I replaced pthread_sleep with usleep(100) and disable pthread_wake, and along with the mutex_trylock change the test program now works. My changes are enabled with the compile flag MUTEX_KLUDGE in pthread_mutex.c.

My working test program is in the attached zip file. This works when pthreads are in separate cogs, but probably won't work for pthreads in the same cog.

Tor · 2015-02-10 21:40

Dave Hein wrote: »

I think there is a problem with pthread_mutex_lock. On entry it calls __addlock that does an atomic add of 1, and then returns the resulting value. If the value is 1, then the lock was originally 0, and the function returns. Otherwise, it assumes someone else has the lock, and the function spins in a loop waiting for the lock to be 0, and then atomically setting it to 1.

The problem is that if the lock has already been set to 1, a second thread will set it to 2. When the lockholder is done, he will decrement the lock, which will set it to 1. Therefore, it never gets back to zero. With more active threads the lock will just get stuck at a higher number.

It seems the lock is missing one crucial step. Spinlock implementations should atomically add one, and return if the result is 1, so far so good. But if the result is not 1, then it should subtract one, and spin. Like so:

#define INITLOCK(lock) (*lock = 0)
#define SPINLOCK(lock) while (++(*lock) != 1) {(*lock)--;}
#define SPINUNLOCK(lock) (*lock)--

ersmith · 2015-02-11 05:20

Hmm, it looks like you're right that there's a decrement missing. I think, though, that just replacing the atomic increment with pthread_mutex_trylock will introduce a race condition between another cog releasing the mutex and this one going to sleep waiting for it (which could cause the thread to never wake up). Maybe we should replace the fancy scheduler based sleep with a simple spinlock like Tor has posted. That wouldn't work as nicely for single cog code, though :-(.

Heater. · 2015-02-11 05:33

I think I'm missing a point somewhere.

My impression is that it impossible to create locks between threads running on different processors without actually using actual hardware atomic instructions.

Intel has "lock xchg" for this purpose.

Propeller has the hardware lock mechanism.

Are we using hardware atomic locking here?

Dave Hein · 2015-02-11 07:31

Heater, the LMM kernel has routines to do atomic read-modify-writes. These are used by the mutex functions to ensure that only one cog has access to a mutex lock at a time. When the kernel starts up on the first cog it allocates a Prop hardware lock and uses that for all atomic operations.

Heater. · 2015-02-11 07:48

Ah ha OK.

So what you are saying is that all synchronization between any COG and any other COG all depends on one global lock to do the atomic operations.

Sounds like the old Linux kernel's global lock

ersmith · 2015-02-11 13:18

Heater. wrote: »

Ah ha OK.

So what you are saying is that all synchronization between any COG and any other COG all depends on one global lock to do the atomic operations.

Sounds like the old Linux kernel's global lock

It's a two level system -- the C code sees locks as variables in HUB memory, but they are implemented by grabbing a shared hardware lock with lockset, doing the read-modify-write of HUB RAM, and then releasing the hardware lock (lockclr). The COGs only have to hold the hardware lock long enough to do the update of the "software" lock (HUB location that they're doing an atomic read-modify-write on). The only alternative I could think of was to limit ourselves to 8 mutexes, but that would be pretty restrictive. The code is in the kernel, so there are no LMM cache misses during it -- thus it only holds the hardware lock for 2 hub access windows (or do lockset/lockclr take hub access too? if so it would be 4 hub windows).

Heater. · 2015-02-11 13:38

Lock operations are HUB ops, so yes 4 hub windows.

Dave Hein · 2015-02-17 11:45

I'm just curious if there's been any further work on the mutex_lock and prthread_sleep issues.

ersmith · 2015-02-25 19:18

Sorry, this one fell by the wayside for a while. I have fixed some bugs -- pthread_mutex wasn't decrementing if there was contention (as you noticed) and also pthread_wake() was calling pthread_schedule() without putting the current thread on the ready queue, which meant it would never get to run again

. Both fixes are checked in to the propgcc default branch, along with a simple mutex lock test in lib/Test.

Thanks for finding this!

Dave Hein · 2015-02-26 10:29

I tried my test program with your changes to pthread_mutex.c and pthread_create.c, and it works great! Thanks for fixing this.

Heater, maybe your FFT with OMP will work with the new fixes.

ersmith · 2015-02-26 16:55

Dave Hein wrote: »

I tried my test program with your changes to pthread_mutex.c and pthread_create.c, and it works great! Thanks for fixing this.

Heater, maybe your FFT with OMP will work with the new fixes.

Great! Thanks for finding the problem.

The fix probably won't help Heater's FFT, though, because the OMP library doesn't use pthreads. But maybe there's a similar bug lurking in there.

Eric

Dave Hein · 2015-02-27 05:12

Yes, I tried a simple "Hello World" program with OMP, and it didn't produce consistent results. At first I thought it was working, but then I realized it was printing twice per thread, and when I made minor changes it only printed from one or two threads before freezing up. I thought OMP used pthreads, but if it doesn't use pthreads it must be using something similar to that.

I wonder if there's a way to output the intermediate source file produced after the OMP compilation is done. If so, that would help us figure out what's going wrong with it.

ersmith · 2015-02-27 08:47

Dave Hein wrote: »

Yes, I tried a simple "Hello World" program with OMP, and it didn't produce consistent results. At first I thought it was working, but then I realized it was printing twice per thread, and when I made minor changes it only printed from one or two threads before freezing up. I thought OMP used pthreads, but if it doesn't use pthreads it must be using something similar to that.

Could you post your program, please? Simple examples that go wrong are always useful in tracking down bugs

.

I wonder if there's a way to output the intermediate source file produced after the OMP compilation is done. If so, that would help us figure out what's going wrong with it.

I don't think there is an intermediate source file; OMP is integrated with the rest of the compiler. We can dump the assembler output with -S though.

Dave Hein · 2015-02-27 09:03

Here's my test program. It also includes the latest pthread_mutex.c and pthread_create.c files.

ersmith · 2015-02-27 14:09

I think your test program is falling afoul of the "only one cog can use the serial port at a time" rule. If you modify it so that it uses FullDuplexSerial for stdout then it seems to work correctly.

The pthreads library has some magic in it to force FullDuplexSerial to be used if pthread_create is called, but since OMP doesn't use pthreads you have to do the above "by hand".

Sometimes I wonder if SimpleSerial is worth the hassle. It is nice to not have to dedicate a cog to the serial port, but there are a lot of cases where it's necessary, and things would be simpler if we always had a serial cog.

Dave Hein · 2015-02-27 14:41

It "seems" to work with the FDS driver, but it's printing twice per thread instead of once. If I change the length of the printed message or change the number of threads then it doesn't print everything. I don't think OMP is really necessary for the Prop, but it would be nice to get it working. However, there doesn't seem to be any control of the stack size per thread, so it might not be useful for a processor with only 32K of memory.

Heater. · 2015-02-27 15:57

Do what? OMP is totally necessary for the Prop. How else will I get my parallelized FFT working there?

Mind you, I never managed convinced myself that the FFT failure on the Prop is due to a failure in OMP or a failure in my program. It's just that the same code has worked on every other multi-core machine I tried it on.

I have often wondered about the stack size issue with OMP. I was kind of guessing that it analyses your code enough to know exactly what stack size it needs. Unlike pthreads which just arranges to run whatever function you throw at it in a thread.

Dave Hein · 2015-02-27 18:26

I suppose OMP can add up all the stack variables that are defined between the braces, but there's no way to account for stack usage in functions located in another file.

You don't really need OMP to parallelize the FFT. You could use pthreads to do that. That's what I did in the threaded chess program.

I do think that OMP could be really useful, and it will become even more interesting on the P2 with 16 cogs and 512K memory.

Heater. · 2015-02-27 19:12

But does OMP even allow for calling functions in other files? I have no idea, never tried.

Certainly the algorithm can be parallelized without OMP. Use pthreads. Or without C. Use assembler.

The magic of OMP is how it makes it easier to have the same source code run on a single core machine, or two cores or four or whatever. It just spreads te work in those for loops around for you if it can.

ersmith · 2015-02-28 06:26

It looks like there's a serious problem with lib/sys/propeller/tinyomp.c (our mini version of the OMP runtime library). The work function is never cleared, so the worker threads start up the same jobs again until they are killed by the main thread. I don't know how we missed this -- I guess in some demos the race condition resolves favorably, and in others running the work again doesn't matter.

Here's a fixed version, that also lets you change the stack size by setting __OMP_STACKSIZE to something other than the default 1K. I'll check it in to the default branch later, but for now you can test by just including tinyomp.c in your OMP using projects.

Heater. · 2015-02-28 07:16

Yes, how did you all miss that? The FFT with OMP example has not been working since forever! Tut-tut

Sadly I don't have any Propeller's to hand at the moment to test this fix with.

Dave Hein · 2015-02-28 11:03

Eric, my hello world program works great with the new tinyomp.c file! Now I can move on to something more interesting using OMP. I've been following the OMP YouTube tutorial at https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG , but I stopped when the hello world wasn't working right. The next thing is to comute PI using multiple cogs. Eventually, I'll try running the threaded chess program under OMP.

Proplem with mutex locks

Comments