Proplem with mutex locks
Dave Hein
Posts: 6,347
I have a problem using pthreads and mutex locks in PropGCC. The attached test program starts a few pthreads, and then uses a lock to control the use of the "active_thread" flag. The flag will have a value of -1 when none of the threads are using it, and it will contain the thread instance number when in use. I use a lock to prevent more than one thread from setting active_thread at a time. The program works fine when USE_PROP_LOCK is defined, and a Propeller lock is used. The program seems to hang when USE_MUTEX_LOCK is defined, and a mutex lock is used.
It's also possible to see what happens when neither compile flag is defined, and a lock is not used. In that case the program will detect multiple threads using the active_thread flag as expected.
Does anybody know how to use mutex locks, and is there something wrong with my logic? BTW, I have used mutex locks in code running on a PC without any problems, so I suspect there may be a bug in the PropGCC mutex locks.
It's also possible to see what happens when neither compile flag is defined, and a lock is not used. In that case the program will detect multiple threads using the active_thread flag as expected.
Does anybody know how to use mutex locks, and is there something wrong with my logic? BTW, I have used mutex locks in code running on a PC without any problems, so I suspect there may be a bug in the PropGCC mutex locks.
zip
2K
Comments
OK, stack was too small (now 10000 as an example) and it appears to require the end of the stack for the stack set call, e.g. pthread_attr_setstackaddr(&attr, stacks[i+1]).
pthread_attr_setstack(&attr, malloc(4096), 4096);
Works fine here.
So according to the function specification, I should specify stacks[i+1] instead of stacks for PropGCC since the stack grows downward. I looked at the PropGCC library code, and it always adds the stack size to the stack address independent of whether it is specified or malloc'ed by the library function. So in the case of PropGCC, stacks is correct. BTW, if the stack address and stack size are not specified, pthread_create will malloc a stack space of 512 bytes.
I guess it time to start poking around in the pthread_mutex code. I'll copy over pthread_mutex.c from the library and add a few printfs to it.
Eric
It might take me a while for me to check this.
The problem is that if the lock has already been set to 1, a second thread will set it to 2. When the lockholder is done, he will decrement the lock, which will set it to 1. Therefore, it never gets back to zero. With more active threads the lock will just get stuck at a higher number.
I think the pthread_mutex_lock routine should use pthread_mutex_trylock at the beginning of the function instead of __addlock. This way the lock will only have a value of either 0 or 1. The thread to grab the lock will set it to 1, and then clear it to 0 when it's done.
I also had a problem with pthread_sleep and pthread_wake. Maybe this works when the pthreads are in the same cog, but it doesn't seem to work when they are in different cogs. I replaced pthread_sleep with usleep(100) and disable pthread_wake, and along with the mutex_trylock change the test program now works. My changes are enabled with the compile flag MUTEX_KLUDGE in pthread_mutex.c.
My working test program is in the attached zip file. This works when pthreads are in separate cogs, but probably won't work for pthreads in the same cog.
My impression is that it impossible to create locks between threads running on different processors without actually using actual hardware atomic instructions.
Intel has "lock xchg" for this purpose.
Propeller has the hardware lock mechanism.
Are we using hardware atomic locking here?
So what you are saying is that all synchronization between any COG and any other COG all depends on one global lock to do the atomic operations.
Sounds like the old Linux kernel's global lock
It's a two level system -- the C code sees locks as variables in HUB memory, but they are implemented by grabbing a shared hardware lock with lockset, doing the read-modify-write of HUB RAM, and then releasing the hardware lock (lockclr). The COGs only have to hold the hardware lock long enough to do the update of the "software" lock (HUB location that they're doing an atomic read-modify-write on). The only alternative I could think of was to limit ourselves to 8 mutexes, but that would be pretty restrictive. The code is in the kernel, so there are no LMM cache misses during it -- thus it only holds the hardware lock for 2 hub access windows (or do lockset/lockclr take hub access too? if so it would be 4 hub windows).
Thanks for finding this!
Heater, maybe your FFT with OMP will work with the new fixes.
Great! Thanks for finding the problem.
The fix probably won't help Heater's FFT, though, because the OMP library doesn't use pthreads. But maybe there's a similar bug lurking in there.
Eric
I wonder if there's a way to output the intermediate source file produced after the OMP compilation is done. If so, that would help us figure out what's going wrong with it.
I don't think there is an intermediate source file; OMP is integrated with the rest of the compiler. We can dump the assembler output with -S though.
The pthreads library has some magic in it to force FullDuplexSerial to be used if pthread_create is called, but since OMP doesn't use pthreads you have to do the above "by hand".
Sometimes I wonder if SimpleSerial is worth the hassle. It is nice to not have to dedicate a cog to the serial port, but there are a lot of cases where it's necessary, and things would be simpler if we always had a serial cog.
Mind you, I never managed convinced myself that the FFT failure on the Prop is due to a failure in OMP or a failure in my program. It's just that the same code has worked on every other multi-core machine I tried it on.
I have often wondered about the stack size issue with OMP. I was kind of guessing that it analyses your code enough to know exactly what stack size it needs. Unlike pthreads which just arranges to run whatever function you throw at it in a thread.
You don't really need OMP to parallelize the FFT. You could use pthreads to do that. That's what I did in the threaded chess program.
I do think that OMP could be really useful, and it will become even more interesting on the P2 with 16 cogs and 512K memory.
Certainly the algorithm can be parallelized without OMP. Use pthreads. Or without C. Use assembler.
The magic of OMP is how it makes it easier to have the same source code run on a single core machine, or two cores or four or whatever. It just spreads te work in those for loops around for you if it can.
Here's a fixed version, that also lets you change the stack size by setting __OMP_STACKSIZE to something other than the default 1K. I'll check it in to the default branch later, but for now you can test by just including tinyomp.c in your OMP using projects.
Sadly I don't have any Propeller's to hand at the moment to test this fix with.