Locks on the P2

RossH · 2019-05-25 12:05

Hello all

I'm having some trouble using locks on the P2. They seem to operate quite differently to the locks on the P1.

I was aware that the result of the P2 "locktry" instruction (i.e. the carry flag) is apparently the reverse of the P1 "lockset" instruction, but the differences seem to go deeper than that.

On the P1, only one "lockset" instruction - executed on any cog - would return the result that the lock had been acquired. But on the P2 the "locktry" operation seems to return that result every time it is executed on the same cog once the lock is acquired - i.e. the lock seems to belong to the whole cog, rather than to any particular program executing on that cog.

Can anyone confirm that this is correct?

Thanks!

cgracey · 2019-05-25 13:29

On page 54 of the Google doc, LOCKs are explained:

https://docs.google.com/document/d/1UnelI6fpVPHFISQ9vpLzOVa8oUghxpI6UpkXVsYgBEQ/edit?usp=sharing

I changed the way they worked to make them more robust for managing debugging. I can't remember the details of the "whys" at the moment.

RossH · 2019-05-25 23:31

cgracey wrote: »

On page 54 of the Google doc, LOCKs are explained:

https://docs.google.com/document/d/1UnelI6fpVPHFISQ9vpLzOVa8oUghxpI6UpkXVsYgBEQ/edit?usp=sharing

I changed the way they worked to make them more robust for managing debugging. I can't remember the details of the "whys" at the moment.

Yes, I read that - it seemed to confirm what I am seeing in practice - i.e. that locks now belong to the entire cog, and can no longer be used as semaphores to protect critical code segments within a cog. I will try to implement my own

cgracey · 2019-05-26 00:20

RossH, sorry if I made it worse in some way. I think, though, within a single cog, you don't have the need for atomicity like you do between cogs.

David Betz · 2019-05-26 01:42

RossH wrote: »

cgracey wrote: »

On page 54 of the Google doc, LOCKs are explained:

https://docs.google.com/document/d/1UnelI6fpVPHFISQ9vpLzOVa8oUghxpI6UpkXVsYgBEQ/edit?usp=sharing

I changed the way they worked to make them more robust for managing debugging. I can't remember the details of the "whys" at the moment.

Yes, I read that - it seemed to confirm what I am seeing in practice - i.e. that locks now belong to the entire cog, and can no longer be used as semaphores to protect critical code segments within a cog. I will try to implement my own

Sounds like you're developing a preemptive multi-tasking system on a single COG!

rogloh · 2019-05-26 02:28

If you used interrupts within a COG and some type of task scheduler I could see you might like to have a semaphore lock in the same COG that the ISR could release, and the task could wait on, though I imagine there could be other simpler ways to do it with other forms of protection like disabling interrupts in critical sections etc.

I guess the more interesting case is when you need protection across COGs at the same time as within a COG. Some multi-core RTOS or something weird like that.

RossH · 2019-05-26 02:49

Catalina has always had multi-threading support built in.

I have tested running 1500 threads on multiple cogs (I used to only be able to run 80 per cog on the P1!) and it works fine - except for the locks

RossH · 2019-05-26 03:10

Here is a Catalina multi-threaded program, compiled for the P2 EVAL (serial interface, 230400 baud). This one only starts 300 threads and only on a single cog - but you can use as many cogs as you like:

/***************************************************************************\
 *                                                                           *
 *                          Multiple Thread Demo                             *
 *                                                                           *
 *    Demonstrates many threads executing concurrently on a single cog       *
 *                                                                           *
 \***************************************************************************/

/*
 * include Catalina multi-threading:
 */
#include <catalina_threads.h>

/*
 * include some useful multi-threading utility functions:
 */
#include <thread_utilities.h>

/*
 * define how many threads we want:
 */
#ifdef __CATALINA_P2
#define THREAD_COUNT 300 // can go higher, but things start to slow down!
#else
#define THREAD_COUNT 80 // (just barely on the Propeller 1!)
#endif
/*
 * define the stack size each thread needs (since this number depends on the
 * function executed by the thread, the smallest possible stack size has to be 
 * established by trial and error):
 */
#define STACK_SIZE (MIN_THREAD_STACK_SIZE + 45)

/*
 * define the number of thread locks we need:
 */
#define NUM_LOCKS 1

/*
 * define some global variables that all threads will share:
 */
static int ping;

/*
 * a pool of thread locks - note that the pool must be 5 bytes larger than
 * the actual number of locks required (MIN_THREAD_POOL_SIZE = 5) 
 */
static char pool[MIN_THREAD_POOL_SIZE + NUM_LOCKS]; 

static int lock;

/*
 * function : this function can be executed as a thread.
 */
int function(int me, char *not_used[]) {

   while (1) {
      if (ping == me) {
         // print our id
         _thread_printf(pool, lock, "%d ", (unsigned)me);
         ping = 0;
      }
      else {
         // nothing to do, so yield
         _thread_yield();
      }
   }
   return 0;
}

/*
 * main : start up to THREAD_COUNT threads, then ping each one in turn
 */
int main(void) {

   int i = 0;
   int lock;
   void *thread_id;

   unsigned long stacks[STACK_SIZE * THREAD_COUNT];

   // assign a lock to avoid context switch contention 
   _thread_set_lock(_locknew());

   // initialize a pool of thread locks
   _thread_init_lock_pool (pool, NUM_LOCKS, _locknew());

   // assign a thread lock to avoid plugin contention
   lock = _thread_locknew(pool);

   _thread_printf(pool, lock, "Press a key to start\n");
   k_wait();

   // start instances of function until we have started THREAD_COUNT of them
   for (i = 1; i <= THREAD_COUNT; i++) {
      thread_id = _thread_start(&function, &stacks[STACK_SIZE*i], i, NULL);
      _thread_printf(pool, lock, "thread %d ", i);
      if (thread_id == (void *)0) {
         _thread_printf(pool, lock, " failed to start\n");
         while (1) { };
      }
      else {
         _thread_printf(pool, lock, " started, id = %d\n", (unsigned)thread_id);
      }
   }

   // now loop forever, pinging each thread in turn
   while (1) {
      _thread_printf(pool, lock, "\n\nPress a key to ping all threads\n");
      k_wait();
      for (i = 1; i <= THREAD_COUNT; i++) {
         _thread_printf(pool, lock, "%d:", i);
         // ping the thread
         ping = i;
         // wait till thread responds
         while (ping) {
            // nothing to do, so yield
            _thread_yield();
         };
      }
   }

   return 0;
}

rogloh · 2019-05-26 03:30

Interesting @RossH, is your Catalina environment purely co-operative or is also pre-emptively scheduled? That is, do tasks have to call thread_yield() for your scheduler to work? I'm guessing without interrupts on the P1 it would need to be co-operative, while P2 might potentially support pre-emptive now with its timer interrupts. Any concept of thread priorities in the task scheduling? I'll have to take a look at it sometime.

evanh · 2019-05-26 03:44

There is a couple of instructions that return a pre-modified compare status. Namely CMPSUB and the FGE/FLE group. I think FGE could be the most helpful for acquiring a lock.

None of the bit setting instructions have an equivalent though, so you have to use a whole register for each lock.

EDIT: INCMOD/DECMOD can do this too.

evanh · 2019-05-26 03:56

Oh, wow, MOV can even do it! Because C = S[31], when S and D are the same location then C can tell the prior state of the lock. Bit 31 isn't a convenient value to work with though.

RossH · 2019-05-26 04:06

rogloh wrote: »

Interesting @RossH, is your Catalina environment purely co-operative or is also pre-emptively scheduled? That is, do tasks have to call thread_yield() for your scheduler to work? I'm guessing without interrupts on the P1 it would need to be co-operative, while P2 might potentially support pre-emptive now with its timer interrupts. Any concept of thread priorities in the task scheduling? I'll have to take a look at it sometime.

No, it is not based on co-routines (if that's what you mean by co-operative). You may have thought so because of the "yield" operations shown in the example. However, these are not necessary, and the program works with them removed - they are included so that a thread that finds it has nothing useful to do can tell the kernel that it can context switch to another thread if there are any waiting (otherwise it does nothing).

But it is also not pre-emptive. There is just a simple round-robin scheduler built into each multi-threading kernel. And yes, on the P1 it works without interrupts. I may modify it to use interrupts on the P2 - in fact, I will need to for the new "NATIVE" mode, when there is no actual kernel that can do the task scheduling.

Ross.

RossH · 2019-05-26 04:13

evanh wrote: »

There is a couple of instructions that return a pre-modified compare status. Namely CMPSUB and the FGE/FLE group. I think FGE could be the most helpful for acquiring a lock.

None of the bit setting instructions have an equivalent though, so you have to use a whole register for each lock.

EDIT: INCMOD/DECMOD can do this too.

Thanks. I will investigate. However, I have to be able to implement locks without using up cog resources for each one. If you are running thousands of threads and each one needs a lock (for some reason) then you would soon run out of cog resources!

With the P1-style semaphores, I can implement as many thread locks as I need using just one "true" lock and some hub RAM. But this fails on the P2, because the locks are not true semaphores.

There will be a solution - I just don't know what it is yet!

RossH · 2019-05-26 04:23

evanh wrote: »

Oh, wow, MOV can even do it! Because C = S[31], when S and D are the same location then C can tell the prior state of the lock. Bit 31 isn't a convenient value to work with though.

Yes, this might work. I would have to use one hub lock to resolve inter-cog conflicts, plus one register per cog to prevent intra-cog conflicts.

Thanks.

evanh · 2019-05-26 04:44

Doh! MOV doesn't work because it won't modify the lock when both S and D are the same location.

evanh · 2019-05-26 04:53

Oops, maybe I've spoken wrong about the BITx instruction too. Time to do some testing ...

EDIT: Okay, yes, these are the best for the job. BITH both sets the target bit and returns its prior state. Dunno why I thought otherwise now.

RossH · 2019-05-26 05:25

Ok - here is what I have come up with for the intra-cog lock. I have written them as if they were actual subroutine calls, but in fact they would be inlined (i.e. 2 instructions to set the lock, one instruction to clear it). I would prefer to do without the need for a "max" long, but without it I could only have a maximum of 511 threads:

DAT


' set_lock : return with carry flag set if we successfully set the lock

set_lock
          decmod  lock,max wc   ' if we set the lock then C will be set 
 if_nc    incmod  lock,max      ' we did not set the lock, so restore it
          ret


' clr_lock : we must clear the lock to allow others to set it

clr_lock
    _ret_ incmod  lock,max      ' release the lock 


' lock variables :

lock      long    0             ' lock must initially be zero
max       long    10000         ' must be larger than max number of threads

Can anyone see any problems, or improve on this?

Thanks!

evanh · 2019-05-26 05:26

Just need an ALTBH prefix instruction now and it could handle thousands of locks with a single index.

evanh · 2019-05-26 05:40

RossH wrote: »

Can anyone see any problems, or improve on this?

Lol, an analogue readout of that would be so noisy! It looks to work but BITx instructions are the obvious best solution now. Sorry for not seeing that earlier.

EDIT: Here's an example using BITH and BITL (limited to 32 locks):

set_lock
		bith	lock, locki	wcz	'request lock, C and Z set if already taken
		ret


clr_lock
		bitl	lock, locki	wcz	'release lock, C and Z set if normal release
		ret

RossH · 2019-05-26 05:54

evanh wrote: »

RossH wrote: »

Can anyone see any problems, or improve on this?

Lol, an analogue readout of that would be so noisy! It looks to work but BITx instructions are the obvious best solution now. Sorry for not seeing that earlier.

Yes, your bith/bitl solution looks better than mine!

cgracey · 2019-05-26 06:32

But, wait! There's more...

set_lock
	_ret_	bith	lock, locki	wcz	'request lock, C and Z set if already taken


clr_lock
	_ret_	bitl	lock, locki	wcz	'release lock, C and Z set if normal release

No need to CALL it, even. Just put the instruction wherever it's needed.

RossH · 2019-05-26 06:47

Here is a solution for P1-style locks - 3 instructions to lock, 2 to unlock.

And again, if anyone can see something wrong or has an improvement, all suggestions welcome!

' Simulating P1-style locks on the P2 ...

' set_lock : return with C=1 and Z=0 (i.e. C_AND_NZ) if we get the lock. 
'            note we must get both inter-cog and intra-cog locks.

set_lock 

              bith    lock,#31 wcz ' can we get intra-cog lock?
 if_nz        locktry lock wc      ' Z=0 means yes - can we get inter-cog lock?
 if_nz_and_nc bitl    lock,#31     ' C=0 means no - release intra-cog lock
              ret

' clr_lock : release both locks.

clr_lock 
              lockrel lock         ' release inter-cog lock
              bitl    lock,#31     ' release intra-cog lock
              ret

' lock : bits 3:0 hold the number of the inter-cog lock, 
'        while bit 31 is the actual intra-cog lock

lock          long 0

EDIT: Oops! Must use wcz with bith. Why?

evanh · 2019-05-26 07:03

RossH wrote: »

EDIT: Oops! Must use wcz with bith. Why?

The BITxx group of instructions share opcode encoding with TESTBx group. BITxx can have WCZ or none. TESTBx must be either WC or WZ.

cgracey · 2019-05-26 07:23

evanh wrote: »

RossH wrote: »

EDIT: Oops! Must use wcz with bith. Why?

The BITxx group of instructions share opcode encoding with TESTBx group. BITxx can have WCZ or none. TESTBx must be either WC or WZ.

Plus, there are logical flag operators for TESTB/TESTBN:

TESTB   D,{#}S         WC/WZ
TESTBN  D,{#}S         WC/WZ
TESTB   D,{#}S     ANDC/ANDZ
TESTBN  D,{#}S     ANDC/ANDZ
TESTB   D,{#}S       ORC/ORZ
TESTBN  D,{#}S       ORC/ORZ
TESTB   D,{#}S     XORC/XORZ
TESTBN  D,{#}S     XORC/XORZ
BITL    D,{#}S         {WCZ}
BITH    D,{#}S         {WCZ}
BITC    D,{#}S         {WCZ}
BITNC   D,{#}S         {WCZ}
BITZ    D,{#}S         {WCZ}
BITNZ   D,{#}S         {WCZ}
BITRND  D,{#}S         {WCZ}
BITNOT  D,{#}S         {WCZ}

evanh · 2019-05-26 07:35

Ross,
I think it should be "if_nz" ... because BITH returns the prior state, not the change of state. C/Z comes back low for a successful try.

RossH · 2019-05-26 07:40

evanh wrote: »

Ross,
I think it should be "if_nz" ... because BITH returns the prior state, not the change of state. C/Z comes back low for a successful try.

Yes, you are correct. Amended.

RossH · 2019-05-27 10:10

Just to finish off this thread - the P1-style lock simulation works as expected, and I now have Catalina's multi-threading support working properly on the P2.

Here is a more sophisticated multi-threading demo - this program runs 5 multi-threaded kernel cogs (4 started dynamically) and then 50 threads. The threads wander around between the kernel cogs, moving themselves from cog to cog randomly. As usual, this program is compiled for the P2 EVAL board, serial interface, 230400 baud.

/***************************************************************************\
 *                                                                           *
 *                          Thread Affinity Demo                             *
 *                                                                           *
 *            Demonstrates changing the affinity of a thread                 *
 *                                                                           *
 *      (i.e. moving threads between kernels running on different cogs)      *
 *                                                                           *
 \***************************************************************************/

/*
 * include Catalina multi-threading functions:
 */
#include <catalina_threads.h>

/*
 * include some useful multi-threading utility functions:
 */
#include <thread_utilities.h>

/*
 * define how many additional kernel cogs we want (note: there must 
 * be this many free cogs available!):
 */
#define NUM_KERNELS 4

/*
 * define how many threads we want per kernel:
 */
#define NUM_THREADS 10

/*
 * define how many thread locks we want (we only really need 1):
 */
#define NUM_LOCKS 1

/*
 * define the stack size for each kernel cog and each thread:
 */
#define STACK_SIZE (MIN_THREAD_STACK_SIZE + 100)


/*
 * global variables that all multi-threaded cogs will share ...
 */

/*
 * flag to tell all kernels to start their threads:
 */
static int start_threads;

/*
 * flag to tell all threads to start switching between kernels:
 */
static int start_switching;

/*
 * a lock to use to avoid kernel contention (all kernels must use
 * the same lock for this purpose)
 */
static int kernel_lock;

/*
 * a pool of thread locks - note that the pool must be 5 bytes larger than
 * the actual number of locks required (MIN_THREAD_POOL_SIZE = 5):
 */
static char pool[MIN_THREAD_POOL_SIZE + NUM_LOCKS]; 

/*
 * The particular thread lock (out of the pool above) that we will use to 
 * protect our HMI functions:
 */
static int hmi_lock;

/*
 * cogs running multithreading kernels notify the threads that they are
 * available by putting a 1 in this array:
 */
static int kernel[8] = { 0 };


/*
 * thread_function : this function can be started as a thread. It runs on the
 *                   cog it is started for a while, then moves itself to the 
 *                   next available cog (cogs running multi-threading kernels
 *                   are indicated by the value 1 in the kernel array).
 */
int thread_function(int argc, char *argv[]) {

   void *me = _thread_id();
   int old_cog;
   int new_cog;

   // get our initial cog 
   old_cog = _cogid();

   // print where we were started
   _thread_printf(pool, hmi_lock, "Thread %d (%s) started on cog %d\n",
                  argc, argv[0], old_cog);


   // wait until we are told to start switching
   while (!start_switching) {
      _thread_yield();
   }

   while (1) {

      // wait a random time (to mix things up a little, but 
      // not go so fast that we can't read the messages!)
      _thread_wait(200*random(5));

      // get our current cog
      old_cog = _cogid();

      // find the next available multi-threading kernel
      new_cog = old_cog;
      do {
         new_cog = (new_cog + 1) % 8;
      } while (kernel[new_cog] == 0);

      // 50% of the time, move ourselves to the new kernel
      if (random(100) > 50) {
         _thread_affinity_change (me, new_cog);
      }
      
      // get our new new cog
      new_cog = _cogid();

      // print a message if we moved
      if ((new_cog != old_cog)) {
         _thread_printf(pool, hmi_lock, 
                        "Thread %d (%s) moved from cog %d to cog %d\n",
                        argc, argv[0], old_cog, new_cog);
      }
   }
   return 0;
}

/*
 * cog_function : this function will be run as the first thread of a new 
 *                multi-threading kernel on a new cog. This function will
 *                then start NUM_THREADS threads, which will wander between
 *                all the available multi-threading kernels.
 */
int cog_function(int argc, char *argv[]) {

   int cog = _cogid();
   void *me = _thread_id();
   void *thread;
   char *message[1] = {"g'day!"};
   int i;

   // stack space for threads
   unsigned long thread_stack[STACK_SIZE * NUM_THREADS];

   // set the lock of this kernel (all kernels must use the same lock, and
   // this must be set up before any other thread functions are called)
   _thread_set_lock(kernel_lock);

   // announce ourselves 
   _thread_printf(pool, hmi_lock, 
                 "Multi-threading kernel (%s) started on cog %d\n",
                 argv[0], cog);

   // indicate we are available to run threads
   kernel[cog] = 1;

   // wait until we are told to start the threads
   while (!start_threads) {
      _thread_yield();
   }

   // start some threads that will wander between the kernels
   for (i = 0; i < NUM_THREADS; i++) {
      thread = _thread_start(&thread_function, 
                             &thread_stack[STACK_SIZE * (i + 1)], 
                             (cog+1)*NUM_THREADS + i, 
                             message);
      if (thread == 0) {
         _thread_printf(pool, hmi_lock, "Failed to start thread\n");
      }
   }

   // now wait forever - this thread does not actually do anything
   // except give the multi-threading kernel something to execute
   // when it is not executing any other threads. It could perform
   // other tasks if required.
   while (1) {
      _thread_yield();
   }

   return 0;
}

/*
 * main : Start NUM_KERNELS additional kernels, and then start NUM_THREADS 
 *        threads that will switch between them. Each kernel will also start 
 *        NUM_THREADS threads of their own.
 */
int main(int argc, char *argv[]) {
   int i;
   int cog;
   void *thread;
   char *message[1] = {"hello!"};

   // stack space for kernels and threads   
   unsigned long kernel_stack[NUM_KERNELS * (STACK_SIZE * NUM_THREADS + 100)];
   unsigned long thread_stack[STACK_SIZE * NUM_THREADS];

   // assign a lock to be used to avoid kernel contention
   kernel_lock = _locknew();

   // set the lock of this kernel (all kernels must use the same lock, and
   // this must be set up before any other thread functions are called)
   _thread_set_lock(kernel_lock);
   
   // initialize a pool of thread locks
   _thread_init_lock_pool (pool, NUM_LOCKS, _locknew());

   // assign a thread lock to avoid plugin contention
   hmi_lock = _thread_locknew(pool);

   // a delay here is used to introduce some randomness
   _thread_printf(pool, hmi_lock, "\nPress a key to start kernels\n");
   k_wait();
   randomize();

   // start additional multi-threading kernels
   for (i = 0; i < NUM_KERNELS; i++) {
      cog = _thread_cog(&cog_function, 
                        &kernel_stack[(STACK_SIZE*NUM_THREADS + 100)*(i + 1)], 
                        i, message);
      if (cog < 0) {
         _thread_printf(pool, hmi_lock, "Failed to start kernel\n");
      }
   }

   // announce ourselves
   cog = _cogid();
   _thread_printf(pool, hmi_lock, 
                  "Multi-threading kernel also running on cog %d\n", 
                  cog);

   // declare ourselves available to run threads
   kernel[cog] = 1;

   _thread_wait(500);

   // now start the threads on all the kernels
   _thread_printf(pool, hmi_lock, "\nPress a key to start all threads\n");
   k_wait();

   start_threads = 1;

   // start some threads of our own that will wander between the kernels
   for (i = 0; i < NUM_THREADS; i++) {
      thread = _thread_start(&thread_function, 
                             &thread_stack[STACK_SIZE * (i + 1)], 
                             (cog+1)*NUM_THREADS + i, 
                             message);
      if (thread == 0) {
         _thread_printf(pool, hmi_lock, "Failed to start thread\n");
      }
   }

   _thread_wait(500);

   // now allow all the threads to switch between kernels
   _thread_printf(pool, hmi_lock, "\nPress a key to start thread switching\n");
   k_wait();

   start_switching = 1;

   // now wait forever - this thread does not actually do anything
   // except give the multi-threading kernel something to execute
   // when it is not executing any other threads. It could perform
   // other tasks if required.
   while (1) {
      _thread_yield();
   }

   return 0;
}

The multi-threading support will be part of the next release of Catalina.

evanh · 2019-05-27 11:13

Whaa! I can't see any pasm.

cgracey · 2019-05-27 17:54

RossH, does this abstract thread execution to the point where the kernel under which a thread runs becomes trivial? Could threads be genericized to the point where they could be automatically distributed among kernels?

RossH · 2019-05-28 00:19

cgracey wrote: »

RossH, does this abstract thread execution to the point where the kernel under which a thread runs becomes trivial? Could threads be genericized to the point where they could be automatically distributed among kernels?

Possibly. I'll be able to answer that question better once I have completed the thread support for the new "native" mode ... because there is no kernel in this mode!

RossH · 2019-05-28 00:21

evanh wrote: »

Whaa! I can't see any pasm.

The demo program was compiled in "compact" mode, so no pasm. Wait till I finish the other modes (compact mode is always the first one I work on, because it is the easiest).

Locks on the P2

Comments