Shop OBEX P1 Docs P2 Docs Learn Events
Catalina - The Next Generation — Parallax Forums

Catalina - The Next Generation

RossHRossH Posts: 5,519
edited 2010-11-06 16:36 in Propeller 1
You know how it is ... you just finish developing a wonderful concurrent algorithm, and then discover that you have don't have enough cogs to execute it!

Well, your worries are over! Catalina now incorporates a multi-threading kernel to save you ever having to worry about running out of cogs again!

With the multi-threading kernel, Catalina users now have the option of starting a C function on another cog using the normal kernel (effectively limiting you to 8 concurrent functions) ... OR ... starting a multi-threaded kernel on one or more cogs and running as many concurrent C functions as you like! You can even run multi-threading kernels on all 8 cogs simultaneously if you want.

Below is an example of the Dining Philosopher's program re-coded to run under Catalina's new multi-threaded kernel. With the previous version of this program the only option was to dedicate a whole cog to each diner. When you added in the necessary display drivers etc you could typically only support 5 diners. The following program runs 16 diners concurrently. It could just as easily run 100, except that I couldn't think of that many philosophers!
/***************************************************************************\
 *                                                                           *
 *                            Dining Philosophers                            *
 *                                                                           *
 * This progran simulates the "Dining Philosophers" dilemma on the Propeller *
 * using a thread to represent each philosopher, and a thread lock to        *
 * represent each fork.                                                      *
 *                                                                           *
 * For more details on this classic problem in concurrency, see:             *
 *                                                                           *
 *   http://en.wikipedia.org/wiki/Dining_philosophers_problem                *
 *                                                                           *
 \***************************************************************************/

/*
 * include Catalina multi-threading:
 */
#include "catalina_threads.h"

/*
 * include some useful multi-thread utility functions:
 */
#include "thread_utilities.h"

/*
 * define stack size for threads:
 */
#define STACK_SIZE 100

/*
 * define how many Philosophers we have:
 */
#define NUM_PHILOSOPHERS 16

/*
 * define how many courses they will eat:
 */
#define NUM_COURSES 20


struct Customer {
   char *name;  // customer's name
   int   left;  // customer's left fork
   int   right; // customer's right fork
};

static struct Customer Philosopher[NUM_PHILOSOPHERS] = {
   {"Aristotle",     0,  1},
   {"Kant",          1,  2},
   {"Spinoza",       2,  3},
   {"Marx",          3,  4},
   {"Russell",       4,  5},
   {"Aquinas",       5,  6},
   {"Bacon",         6,  7},
   {"Hume",          7,  8},
   {"Descarte",      8,  9},
   {"Plato",         9, 10},
   {"Hegel",        10, 11},
   {"de Beauvoir",  11, 12},
   {"Sartre",       12, 13},
   {"Wittgenstein", 13, 14},
   {"Schopenhauer", 14, 15},
   {"Rousseau",     15,  0},
};


static void * Seat[NUM_PHILOSOPHERS];   // maps seats to threads

static int Fork[NUM_PHILOSOPHERS];      // maps forks to thread locks

static int HMI_Lock = 0;                // cog lock to prevent HMI contention

static int Dinner_Bell = 0;             // set to 1 to start dinner

static char pool[NUM_PHILOSOPHERS + 5]; // a pool of thread locks


/*
 * Pick_Up - pick up a fork
 */
void Pick_Up(int fork) {
   do { } while (_thread_lockset(pool, Fork[fork]));
}


/*
 * Put_Down - put down a fork
 */
void Put_Down(int fork) {
   _thread_lockclr(pool, Fork[fork]);
}


/*
 * Diner - this function runs as a thread to simulate a diner
 */
int Diner(int me, char *not_used[]) {
   void * my_thread;
   int course;

   // wait till dinner is served
   do { } while (!Dinner_Bell);

   // get our thread identity and our seat number
   my_thread = _thread_id();

   // be seated
   thread_printf(HMI_Lock, "%s has been seated\n", Philosopher[me].name);

   // eat dinner
   for (course = 0; course < NUM_COURSES; course++) {
      thread_printf(HMI_Lock, "%s is thinking\n", Philosopher[me].name);
      thread_wait(random(2000));
      thread_printf(HMI_Lock, "%s is hungry\n", Philosopher[me].name);
      Pick_Up(Philosopher[me].left);
      Pick_Up(Philosopher[me].right);
      thread_printf(HMI_Lock, "%s is eating\n", Philosopher[me].name);
      thread_wait(random(2000));
      Put_Down(Philosopher[me].right);
      Put_Down(Philosopher[me].left);
   }

   // leave
   thread_printf(HMI_Lock, "%s is leaving\n", Philosopher[me].name);
   thread_wait(100);
   return 0;
}
   
/*
 * main - set up the diners and then start dinner
 */
void main() {

   int i = 0;
   long stacks[STACK_SIZE * NUM_PHILOSOPHERS];

   // assign a lock to avoid context switch contention 
   _thread_set_lock(_locknew());

   // assign a lock to avoid HMI contention
   HMI_Lock = _locknew();

   // allocate a pool of thread locks to use as forks
   _thread_init_lock_pool(pool, NUM_PHILOSOPHERS, _locknew());

   thread_printf(HMI_Lock, "The Dining Philosophers\n\n");
   thread_printf(HMI_Lock, "Press a key to begin ...\n\n");

   k_wait();
   randomize();

   // set up seats and forks 
   for (i = 0; i < NUM_PHILOSOPHERS; i++) {
      Seat[i] = _thread_start(&Diner, &stacks[STACK_SIZE * (i + 1)], i, NULL);
      Fork[i] = _thread_locknew(pool);
   }
   thread_printf(HMI_Lock, "Table set for %d diners\n\n", NUM_PHILOSOPHERS);

   // now start dinner
   Dinner_Bell = 1;

   // now loop forever 
   while (1) { }
}
The thread functions are intentionally very similar to the functions used in the previous "multicog" Catalina demos. There are new functions to manage threads (e.g. _thread_start(), _thread_stop and _thread_id()) as well as new functions to manage "thread locks" (e.g _thread_locknew(), _thread_lockset() and _thread_lockclr()). You can have as many thread locks as you like - you are not limited to the 8 built-in locks (but these are still available).

A binary for the Hydra showing this program in action is attached. It uses the Hydra TV display and keyboard. It was compiled using the command:
catalina -lci -lthreads -D THREADED -D HYDRA -O3 dining_philosophers.c
Defining the THREADED symbol selects the multi-threaded kernel, and the "threads" library includes all the necessary multithreading support functions. The combination of these two is similar in effect to the POSIX "pthreads" library.

I will release the new multi-threading version of Catalina shortly. It will first be made available as a patch release to those users who have purchased the Catalina Optimizer. The new version fully supports both the Optimizer and the BlackCat and BlackBox debuggers.

Ross.

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2010-10-09 07:42
    Great work!
  • RossHRossH Posts: 5,519
    edited 2010-10-09 19:49
    Thanks Bill!

    Everything seems to be working quite well - I'm just doing some final integration work, and then there's a bit of documentation left to do.

    What I'm working on now is integrating the new multi-threading support with the existing multi-cog support. The idea of this is that when you spawn a new concurrent C function, you can let Catalina decide automatically whether to use a cog or a thread. By default it will first use any unused cogs. Once there are no unused cogs left, it will instead spawn the function as a new thread on the current cog. This will be completely transparent to the program. I expect this will be the method most people will use to spawn new processes. In most cases it will give you the best (fastest) results.

    Of course, there will always be circumstances where you definitely do NOT want to use multi-threading - such as when you are controlling external hardware and need the determinism or the speed provided by the non-threaded kernel. So you will still be able to dedicate a C function to execute on its own unthreaded cog if you need to do so.

    I hope to finish it all today - if I don't manage it then it will probably be another week or so before I get a chance to get back to it.

    Ross.
  • RossHRossH Posts: 5,519
    edited 2010-10-09 23:06
    Ok - it turns out I was exaggerating slightly. The dining philosophers program cannot possibly be extended to 100 philosophers because it would simply run out of Propeller Hub RAM to use as stack space.

    But here is a slightly simpler program that executes 80 concurrent C threads on a single cog:
      /***************************************************************************\
     *                                                                           *
     *                          Multiple Thread Demo                             *
     *                                                                           *
     *    Demonstrates many threads executing concurrently on a single cog       *
     *                                                                           *
     \***************************************************************************/
    
    /*
     * include Catalina multi-threading:
     */
    #include "catalina_threads.h"
    
    /*
     * include the definitions of some useful multi-thread utility functions:
     */
    #include "thread_utilities.h"
    
    /*
     * define how many threads we want:
     */
    #define THREAD_COUNT 80
    
    /*
     * define the stack size each thread needs (established by trial and error!):
     */
    #define STACK_SIZE 75
    
    /*
     * define some global variables that all threads will share:
     */
    static int ping;
    static int lock;
    
    /*
     * function : this function can be executed as a thread.
     */
    int function(int me, char *not_used[]) {
    
       while (1) {
          if (ping == me) {
             // print our id
             thread_printf(lock, "%d ", (unsigned)me);
             ping = 0;
          }
          else {
             // nothing to do, so yield
             _thread_yield();
          }
       }
       return 0;
    }
    
    /*
     * main : start up to THREAD_COUNT threads, then ping each one in turn
     */
    void main() {
    
       int i = 0;
       void *thread_id;
    
       unsigned long stacks[STACK_SIZE * THREAD_COUNT];
    
       // assign a lock to avoid context switch contention 
       _thread_set_lock(_locknew());
    
       // assign a lock to avoid plugin contention
       lock = _locknew();
    
       thread_printf(lock, "Press a key to start\n");
       k_wait();
    
       // start instances of function until we have started THREAD_COUNT of them
       for (i = 1; i <= THREAD_COUNT; i++) {
          thread_id = _thread_start(&function, &stacks[STACK_SIZE*i], i, NULL);
          thread_printf(lock, "thread %d ", i);
          if (thread_id == (void *)0) {
             thread_printf(lock, " failed to start\n");
             while (1) { };
          }
          else {
             thread_printf(lock, " started, id = %d\n", (unsigned)thread_id);
          }
       }
    
       // now loop forever, pinging each thread in turn
       while (1) {
          thread_printf(lock, "\n\nPress a key to ping all threads\n");
          k_wait();
          for (i = 1; i <= THREAD_COUNT; i++) {
             thread_printf(lock, "%d:", i);
             // ping the thread
             ping = i;
             // wait till thread responds
             while (ping) {
                // nothing to do, so yield
                _thread_yield();
             };
          }
       }
    }
    
    When run, this program displays the following:
    Press a key to start
    thread 1  started, id = 7236
    thread 2  started, id = 7536
    thread 3  started, id = 7836
    
       ... 
    
    thread 79  started, id = 30636
    thread 80  started, id = 30936
    
    
    Press a key to ping all threads
    1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 17:17
    18:18 19:19 20:20 21:21 22:22 23:23 24:24 25:25 26:26 27:27 28:28 29:29 30:30 31:31
    32:32 33:33 34:34 35:35 36:36 37:37 38:38 39:39 40:40 41:41 42:42 43:43 44:44 45:45
    46:46 47:47 48:48 49:49 50:50 51:51 52:52 53:53 54:54 55:55 56:56 57:57 58:58 59:59
    60:60 61:61 62:62 63:63 64:64 65:65 66:66 67:67 68:68 69:69 70:70 71:71 72:72 73:73
    74:74 75:75 76:76 77:77 78:78 79:79 80:80
    
    Press a key to ping all threads
    1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 17:17
    18:18 19:19 20:20 21:21 22:22 23:23 24:24 25:25 26:26 27:27 28:28 29:29 30:30 31:31
    32:32 33:33 34:34 35:35 36:36 37:37 38:38 39:39 40:40 41:41 42:42 43:43 44:44 45:45
    46:46 47:47 48:48 49:49 50:50 51:51 52:52 53:53 54:54 55:55 56:56 57:57 58:58 59:59
    60:60 61:61 62:62 63:63 64:64 65:65 66:66 67:67 68:68 69:69 70:70 71:71 72:72 73:73
    74:74 75:75 76:76 77:77 78:78 79:79 80:80
    
    Press a key to ping all threads
    
     ... etc ...
    
    A binary compiled for the Hydra is attached. More threads could be executed if each one did not require so much stack space. In this program each thread prints a value on the screen whenever it is "pinged", and this - plus the thread overhead of 35 longs - means each thread needs 75 longs of stack. That's 24k of the Hub RAM required just for stack space!

    Another point to note is the use of the _thread_yield() function - if a thread discovers it has nothing to do at the moment it can simply yield control of the cog to another thread.

    Ross.
  • Heater.Heater. Posts: 21,230
    edited 2010-10-10 01:43
    Phenomenal. We will get Unix on the Propeller.

    I thought all philosophers were called "Bruce".

    http://www.ibras.dk/montypython/episode22.htm
  • RsadeikaRsadeika Posts: 3,837
    edited 2010-10-10 03:39
    This is starting to get very interesting, I just might start looking into what Catalina can do. I know that you have built in support for some of the hardware platforms like dracblade, and others, will you be considering the C3 platform? I am also waiting for the jazzed memory module to appear, but at this point I am favoring the C3, only because that it has ADC support on the same board.

    Ray
  • RossHRossH Posts: 5,519
    edited 2010-10-10 04:56
    Hi Ray,

    With a small amount of configuration, Catalina will work perfectly well on the C3 in LMM mode - including the latest multi-cog and multi-thread suppoprt. However, Catalina will not be able to use the SRAM on the C3 as XMM RAM (i.e. for executing programs). This is because the SRAM is addressed serially rather than in parallel, which means the necessary access code will not currently fit in the space available in the XMM kernel cog.

    I had hoped to add C3 support to Catalina, but I had to give up this idea because I simply don't have the time to do it at present. Adding multi-threading support required only about 40 additional instructions, which I could easily fit into the "alternate" kernel. I can occasionally manage a day or two here and there, but any significant changes to the XMM kernel (which would be required to support serial SRAM) will have to wait until I get some larger slabs of time.

    However, it seems that the C3 will have enough spare pins to support parallel SRAM access, so I'm sure someone will soon design a parallel SRAM add-on board - then it will be relatively simple to add full Catalina support.

    Ross.

    P.S. @Heater - the initial version of my "dining philosophers" program did indeed have all the philosophers named "Bruce" - but the University of Woolloomooloo threatened to revoke my honourary doctorate (in beer making) if I published it that way.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-10-10 07:50
    Ross,

    VMCOG has a driver for C3 (thanks Dave!), so if you add VMCOG support to the Catalina LMM, it will run on the SPI ram's just fine :)

    Bill
    RossH wrote: »
    Hi Ray,

    With a small amount of configuration, Catalina will work perfectly well on the C3 in LMM mode - including the latest multi-cog and multi-thread suppoprt. However, Catalina will not be able to use the SRAM on the C3 as XMM RAM (i.e. for executing programs). This is because the SRAM is addressed serially rather than in parallel, which means the necessary access code will not currently fit in the space available in the XMM kernel cog.

    I had hoped to add C3 support to Catalina, but I had to give up this idea because I simply don't have the time to do it at present. Adding multi-threading support required only about 40 additional instructions, which I could easily fit into the "alternate" kernel. I can occasionally manage a day or two here and there, but any significant changes to the XMM kernel (which would be required to support serial SRAM) will have to wait until I get some larger slabs of time.

    However, it seems that the C3 will have enough spare pins to support parallel SRAM access, so I'm sure someone will soon design a parallel SRAM add-on board - then it will be relatively simple to add full Catalina support.

    Ross.

    P.S. @Heater - the initial version of my "dining philosophers" program did indeed have all the philosophers named "Bruce" - but the University of Woolloomooloo threatened to revoke my honourary doctorate (in beer making) if I published it that way.
  • jazzedjazzed Posts: 11,803
    edited 2010-10-10 09:54
    Rsadeika wrote: »
    I am also waiting for the jazzed memory module to appear, but at this point I am favoring the C3, only because that it has ADC support on the same board.
    @Rsadeika,

    The 32MB SDRAM boards have an I2C attached device that has a 10bit ADC. It can have up to 3 single ended channels (2 is much easier) with 5V range or 1 differential channel with 1.1V or 2.56V ranges. It is also possible to have a PS2 port (mouse or keyboard) and 1 single ended ADC channel.

    The ADC function does require code which I have not written yet. I'm still working on the keyboard code, but the mouse is done. A 2nd Propeller would have been far more flexible of course :) In any event, functionality is just a matter of software and what the hardware will support.

    Of course the C3 is almost ready to ship and I'm sure it will make many people happy. I have 7 functional SDRAM boards on my desk (in 3 flavors), but I'm not sure when boards will be generally available for sale.

    @RossH,

    It's cool that you're providing a threaded library. A subset of the common pthreads interface could be quickly implemented http://en.wikipedia.org/wiki/POSIX_Threads The problem is in supporting pthreads entirely :)

    Adding some kind of a virtual memory interface to Catalina would be useful - and it could potentially simplify the memory back-end. I've looked at adding SDRAM cache to Catalina (to take advantage of performance and size) and have received feedback on implementation and an overview of the XMM standard - I was looking for help getting it working :).

    Adding cache to Catalina it is not my highest priority right now, but at some point I'll get back to it. I'm sure I'll need help getting cache working ... VMCOG will be equally difficult for anyone but RossH to add IMHO.
  • Toby SeckshundToby Seckshund Posts: 2,027
    edited 2010-10-10 13:25
    I have been reading the Catalina thread, and its documentation, and It got me thinking. I have various versions of DracBlades here but I didn't continue to follow the TriBlade side of the Z80 emulations. I notice that Catalina states that both the Blade2 and Blade1 ccts are suported. Is the Blade1 with the KBD, VID or VGA included, or just as a memory driver and serial ouput?

    I only ask as it is a slighty more elegant (ie lower chip count) than the Dracs (which probably have all those extras for a good reason)
  • RossHRossH Posts: 5,519
    edited 2010-10-10 15:34
    @Bill,

    Yes, I agree VMCog support might be the right way to get Catalina on a lot of new boards - including Morpheus II and the C3 - but it's a matter of finding a slab of time large enough to do it. I still hope to get to it sometime, but I am slowly learning not to make commitments to things that I can't complete using just the odd few hours here or there.

    @jazzed,

    I'm not really after POSIX compatibility - a complete pthreads compatible library could easily take up most of the Hub RAM and leave no space for doing anything useful. My threads are intended to be "lean and mean" - and POSIX threads are anything but that! While I hope others will also find them useful, they are primarily intended for my own use in my own projects. However, I think they also provide another of the missing pieces that seem to be preventing the Propeller being used to its full potential.

    Also, the above comments to Bill apply to your board as well. I'm happy to help out, but at the moment I am having trouble finding blocks of contiguous time big enough to do complex work - so I'm tending to stick to things that I know I can complete within a few hours. I'd actually planned to do the threads stuff ages ago, but kept putting it off because I didn't really need it yet. It got done this week mainly because it was something I knew I could finish quickly (having laid most of the groundwork much earlier).

    @Toby

    Yes, Catalina fully supports the TriBlade blade #1 in all its variants - XMM RAM, VGA output, TV output, keyboard, mouse, serial I/O etc. Look in the file README.TriBladeProp in the Catalina\target directory for more details.

    The TriBlade is one of my favorite "all round" boards - but the DracBlade has an advantage in that it supports VGA at the same time as the XMM RAM. From memory, I think on the TriBlade you can only use the TV output when also using XMM RAM.

    Ross.
  • Toby SeckshundToby Seckshund Posts: 2,027
    edited 2010-10-10 15:49
    Thanks for that clarification. It was what I suspected.

    I love the way that the DracBlade separates out the pin clashes but of course at the expense of speed.

    I have dug up a Blade2 board that I made with a PropCMD on it, sort of Blade1 + 2. I will have to pull out the digit(s) and have a race between the two
  • jazzedjazzed Posts: 11,803
    edited 2010-10-10 16:06
    RossH wrote: »
    I'm not really after POSIX compatibility - a complete pthreads compatible library could easily take up most of the Hub RAM and leave no space for doing anything useful.

    I did not say that you should aim for complete pthread compatibility. A subset of it would be nice though for people who already understand it.
  • RossHRossH Posts: 5,519
    edited 2010-10-10 17:59
    Hi Jazzed,

    I could build some simple wrappers that would make the API look more 'pthread-like' - I'll see if I can find some time for this.

    Ross.
  • RossHRossH Posts: 5,519
    edited 2010-10-11 04:30
    @All,

    Jazzed asked if I could make the Catalina threads library more compatible with the POSIX pthreads library. This seemed like a reasonable request, but it would have been quite hard to do using just the basic threading features I had already implemented - so I have added a few more. The new features will make it substantially easier to emulate POSIX threads, as well as implementing inter-process communications mechanisms such as mailboxes or remote procedure calls.

    I already had the ability to start and stop threads, and I also already had thread locks (much like POSIX mutexes or Parallax locks). Now I have added the ability to pass parameters to and from threads, and also added a new _thread_join primitive which allows you to wait for a thread to complete and then fetch its return value.

    Catalina's mechanism for passing parameters to threads is different to the POSIX mechanism - POSIX threads accept and return a pointer value, whereas Catalina threads mimic the C command line processing mechanism - i.e. using argc and argv, and returning an int. However, this can be made to simulare POSIX if required since an int can also be used to pass and return an arbitrary pointer value (since they are the same size).

    Here is a program that illustrates all of the new features:
    /***************************************************************************\
     *                                                                           *
     *                          Thread Argument Demo                             *
     *                                                                           *
     *    Demonstrates passing and retrieving arguments to/from a thread         *
     *                                                                           *
     \***************************************************************************/
    
    /*
     * include Catalina multi-threading:
     */
    #include "catalina_threads.h"
    
    /*
     * include the definitions of some useful multi-thread utility functions:
     */
    #include "thread_utilities.h"
    
    /*
     * define how many threads we want:
     */
    #define THREAD_COUNT 5
    
    /*
     * define the stack size for each thread:
     */
    #define STACK_SIZE 100
    
    /*
     * define some global variables that all threads will share:
     */
    static void * ping;
    static int    lock;
    
    /*
     * function : this function can be executed as a thread.
     */
    int function(int argc, char *argv[]) {
       int i;
       void *me = _thread_id();
    
       // wait till we are pinged
       while (ping != me) {
          // nothing to do, so yield
          _thread_yield();
       }
       // print our id and all our arguments
       thread_printf(lock, "thread %d args = ", (unsigned)me);
       thread_printf(lock, "%d, ", argc);
       for (i = 0; i < argc; i++) {
          thread_printf(lock, "%s ", argv[i]);
       }
       thread_printf(lock, "\n");
       ping = (void *)0;
    
       // exit and return a status
       return -argc;
    }
    
    /*
     * main : start THREAD_COUNT threads, ping each one, then fetch each ones result
     */
    void main() {
    
       int i = 0;
       char * s[THREAD_COUNT] = {"hello", "there", "this", "is", "catalina"};
       unsigned long stacks[STACK_SIZE * THREAD_COUNT];
       void * thread[THREAD_COUNT];
       int result;  
    
       // assign a lock to be used to avoid context switch contention
       _thread_set_lock(_locknew());
    
       // assign a lock to be used to avoid plugin contention
       lock = _locknew();
    
       thread_printf(lock, "Press a key to start\n");
       k_wait();
    
       // start instances of function until we have started THREAD_COUNT of them
       for (i = 0; i < THREAD_COUNT; i++) {
          thread[i] = _thread_start( &function, &stacks[STACK_SIZE*(i+1)], i+1, s);
          thread_printf(lock, "thread %d ", i);
          if (thread[i] == (void *)0) {
             thread_printf(lock, " failed to start\n");
             while (1) { };
          }
          else {
             thread_printf(lock, " started, id = %d\n", (unsigned)thread[i]);
          }
       }
    
       // now ping each thread in turn
       thread_printf(lock, "\n\nPress a key to ping all threads\n");
       k_wait();
       for (i = 0; i < THREAD_COUNT; i++) {
          // ping the thread
          ping = thread[i];
          // wait till thread responds
          while (ping) {
             // nothing to do, so yield
             _thread_yield();
          };
       }
    
       // now join each thread to fetch the result
       thread_printf(lock, "\n\nPress a key to join all threads\n");
       k_wait();
       for (i = 0; i < THREAD_COUNT; i++) {
          if (_thread_join(thread[i], &result) != thread[i]) {
             thread_printf(lock, "failed to join thread %d\n", thread[i]);
          };
          thread_printf(lock, "thread %d result = %d\n", i, result);
       }
    
       // now wait forever
       while (1) { }
    }
    
    When run, this program displays the following:
    Press a key to start
    thread 0  started, id = 29292
    thread 1  started, id = 29692
    thread 2  started, id = 30092
    thread 3  started, id = 30492
    thread 4  started, id = 30892
    
    
    Press a key to ping all threads
    thread 29292 args = 1, hello
    thread 29692 args = 2, hello there
    thread 30092 args = 3, hello there this
    thread 30492 args = 4, hello there this is
    thread 30892 args = 5, hello there this is catalina
    
    
    Press a key to join all threads
    thread 0 result = -1
    thread 1 result = -2
    thread 2 result = -3
    thread 3 result = -4
    thread 4 result = -5
    
    As usual, a binary for the Hydra is attached.

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2010-10-11 09:33
    Good progress Ross. Very nice example too. I've made an equivalence table below. Some things are clear, others are either not or seem to be missing.

    Do you need something equivalent to pthread_spin_lock etc... for SMP muteces ? IIRC, a standard mutex considers just one core where a spin lock can be used with multiple cores.

    Thanks.
    PThread Function                      Catalina Equivalent
    
    pthread_cancel
    pthread_create                        task_start
    pthread_detach
    pthread_exit                          return value;
    pthread_join                          thread_join
    pthread_self                          thread_id
    pthread_mutex_destroy
    pthread_mutex_init                    _locknew
    pthread_mutex_lock                    _thread_set_lock
    pthread_mutex_trylock
    pthread_mutex_unlock
    
  • RossHRossH Posts: 5,519
    edited 2010-10-11 14:31
    Hi Jazzed,

    Thanks. I'm not really trying to emulate pthreads, I'm just trying to make it easier for someone (possibly me!) to do so later.

    Some more details ...

    There is a _thread_stop which would be equivalent to pthread_cancel.

    There is a _thread_check which can be used to check the state of a thread (i.e. executing or not) - I don't know the POSIX equivalent to this.

    There is no equivalent to pthread_detach since all threads share all resources (including the address space) so it is not really necessary.

    The thread scheduling is simple round-robin time slicing, but there is a _thread_set_ticks that can be used to adjust the relative priority of each thread.

    The equivalent to pthread_mutex_init is actually _thread_init_lock_pool, (not _locknew). A pool is an arbitrary sized set of locks that can then be used individually - but you can have as many pools as you like, of any size.

    I modelled the individual thread lock operations (_thread_locknew, _thread_lockret, _thread_lockset and _thread_lockclr) on the parallax hub lock operations since I thought they would be more familiar to Propeller users - but with these primitives (and _thread_yield and _thread_wait) you can simulate any of the POSIX mutex or condition operations.

    I'm still working on integrating the new multi-thread capability with the existing multi-cog capability - but this is mostly cosmetic work intended to simplify things for the user. I can already launch multi-threading kernels on multiple cogs, and the purpose of the _thread_set_lock and _thread_get_lock primitives is to synchronize different kernels operating on different cogs, so that the thread operations work correctly between cogs.

    This will probably be all for the present. These primitives already take around 200 longs to implement, so I'd be very hesitant to add any more.

    I'll publish some documentation when I get a chance.

    Ross.
  • RossHRossH Posts: 5,519
    edited 2010-10-13 02:19
    After a bit more messing about, I have decided that the overhead of being able to manage multiple threads on multiple cogs using the same operations is too high to be suitable for normal use. I have therefore decided that the thread operations will be split into two types - those that will work only for threads being executed on the current cog, and those that will work on threads being executed on any cog.

    The operations already specified (e.g. _thread_stop, _thread_join) will operate only on threads being executed by the current cog - this is quite efficient, and is sufficient for most purposes.

    A new set of "affinity" operations will be added that can reach across cogs to operate on threads being executed on any cog (in addition to those being executed on the current cog) - these operations are much less efficient, and should only be used in cases where you know the thread you want is not executing "locally". This way, if you don't need them, you won't waste the extra RAM on the affinity capabilities.

    The basic affinity operations (for "affinity" you can mostly read "cog") will be as follows:
    • int _thread_affinity(void *thread_id)
    Return the affinity of the specified thread. This will work even if the specified thread has a different affinity than the calling thread. It can be used to determine both the current affinity, and also the current state of any outstanting affinity requests on the specified thread.
    • int _thread_affinity_stop(void *thread_id)
    Stop the specified thread, which may have a different affinity from the calling thread. Returns an error if an affinity command is already outstanding for the specified thread.
    • int _thread_affinity_join(void *thread_id, int *result)
    Join the specified thread, which may have a different affinity from the calling thread. Return the result. Returns an error if the thread has already been terminated, or an affinity command is already outstanding for the specified thread.
    • void _thread_affinity_change(void *thread_id, int new_affinity)
    Request a change of affinity for the specified thread. Check the affinity of the thread later to see if the change has taken effect.
    The last function is really cool - you can "move" a thread from one cog to another (e.g. to balance the load if one cog has too many threads executing) and the thread being "moved" never even knows about it!

    Ross.
  • RossHRossH Posts: 5,519
    edited 2010-10-18 06:32
    Just a progress update ...

    Catalina now has its multi-cog capabilities fully integrated with its multi-thread capabilities.

    The following demo program illustrates starting 4 additional multi-threading kernels (each one on its own cog) and then starting 10 additional C threads designed to wander around between all the kernel cogs at random, being seemlessly executed for a random amount of time by each of the different kernel cogs in turn.

    I'm not exactly sure what the use of this capability is yet - but it's fairly simple to do once you have the basic multi-threading, so why not?
    /***************************************************************************\
     *                                                                           *
     *                          Thread Affinity Demo                             *
     *                                                                           *
     *            Demonstrates changing the affinity of a thread                 *
     *                                                                           *
     *        (i.e. passing threads between multi-threaded kernels)              *
     *                                                                           *
     \***************************************************************************/
    
    /*
     * include Catalina multi-threading:
     */
    #include "catalina_threads.h"
    
    /*
     * include some useful multi-thread utility functions:
     */
    #include "thread_utilities.h"
    
    /*
     * define how many additional kernel cogs we want:
     */
    #define NUM_KERNELS 4
    
    /*
     * define how many additional threads we want:
     */
    #define NUM_THREADS 10
    
    /*
     * define the stack size for each kernel cog and thread:
     */
    #define STACK_SIZE (MIN_THREAD_STACK_SIZE + 100)
    
    /*
     * define some global variables that all multi-threaded cogs will share:
     */
    static int started;
    static int hmi_lock;
    static int cog_lock;
    static int kernel[8];
    
    /*
     * cog_function : this function will be run as the first thread
     *                of a new multi-threading kernel on a new cog
     */
    void cog_function() {
       int cog;
       void * me;
    
       cog = _cogid();
       me = _thread_id();
    
       // set the lock of this kernel (all kernels must use the same lock)
       _thread_set_lock(cog_lock);
    
       // indicate this cog is running a multi-threading kernel
       // and is available to execute other threads if required
       thread_printf(hmi_lock, "Kernel ready on cog %d\n", cog);
       kernel[cog] = 1;
    
       // now wait forever - this thread does not actually do anything
       // except give the multi-threading kernel something to execute
       // when it is not executing any other threads. It could perform
       // other tasks if required.
       while (1) { 
          _thread_yield();
       }
    }
    
    /*
     * thread_function : this function will run as a new thread. It runs on
     *                   the cog it is started for a while, then moves itself
     *                   to the next available cog (cogs running multi-threading
     *                   kernels are indicated by the kernel array).
     */
    int thread_function(int argc, char *argv[]) {
       int i;
       void * me;
    
       me = _thread_id();
    
       i = 0;
       while (1) {
          // print where we are currently running
          thread_printf(hmi_lock, "Thread %d now on cog = %d\n", me, _cogid());
    
          while (!started) { };
    
          // wait a random time
          thread_wait(250*random(5));
    
          // find the next available multi-threading kernel
          do {
             i = (i + 1) % 8;
          } while (kernel[i] == 0);
    
          // move ourselves to that kernel
          _thread_affinity_change (me, i);
       }
       return 0;
    }
    
    /*
     * main : start NUM_KERNELS additional kernels, and then start
     *        NUM_THREADS threads that will switch between them
     */
    void main() {
    
       int i;
       int cog;
       unsigned long cog_stack[STACK_SIZE * NUM_KERNELS];
       unsigned long thread_stack[STACK_SIZE * NUM_THREADS];
       void * thread;
       
       // assign a lock to be used to avoid cog kernel contention
       cog_lock = _locknew();
    
       // set the lock of this kernel (all kernels must use the same lock)
       _thread_set_lock(cog_lock);
    
       // assign a lock to be used to avoid hmi plugin contention
       hmi_lock = _locknew();
    
       thread_printf(hmi_lock, "Press a key to start kernels\n");
       k_wait();
       randomize();
    
       // start additional cogs (each one running a multi-threading kernel)
       for (i = 0; i < NUM_KERNELS; i++) {
          cog = thread_coginit(&cog_function, 
                &cog_stack[STACK_SIZE * (i + 1)]);
          if (cog < 0) {
             // coginit failed 
             thread_printf(hmi_lock, "Failed to start kernel\n");
             while (1) { };
          }
       }
    
       // we can declare ourselves available to execute threads as well
       cog = _cogid();
       kernel[cog] = 1;
       thread_printf(hmi_lock, "Kernel ready on cog %d\n", cog);
    
       thread_wait(100);
       thread_printf(hmi_lock, "\nPress a key to start threads\n");
       k_wait();
    
       // start additional threads that will wander between the kernels
       for (i = 0; i < NUM_THREADS; i++) {
          thread = _thread_start(&thread_function, 
             &thread_stack[STACK_SIZE * (i + 1)], 0, NULL);
          if (thread == 0) {
             thread_printf(hmi_lock, "Failed to start thread\n");
             while (1) { };
          }
       }
    
       thread_wait(100);
       thread_printf(hmi_lock, "\nPress a key to start kernel switching\n");
       k_wait();
       started = 1;
    
       // now wait forever - we could do other tasks if required
       while (1) {
         _thread_yield();
       }
    }
    
    Attached is the usual demo, compiled for the HYDRA. It was compiled using the command:
    catalina -lthreads -lci -D THREADED thread_affinity.c -D NO_MOUSE -D NO_FP -O3
    
    When run, the program produces the following output:
    Press a key to start kernels
    Kernel ready on cog 4
    Kernel ready on cog 5
    Kernel ready on cog 6
    Kernel ready on cog 7
    Kernel ready on cog 0
    
    Press a key to start threads
    Thread 26560 now on cog = 0
    Thread 26012 now on cog = 0
    Thread 25464 now on cog = 0
    Thread 24916 now on cog = 0
    Thread 24368 now on cog = 0
    Thread 23820 now on cog = 0
    Thread 28752 now on cog = 0
    Thread 28204 now on cog = 0
    Thread 27656 now on cog = 0
    Thread 27108 now on cog = 0
    
    Press a key to start kernel switching
    Thread 26560 now on cog = 4
    Thread 26560 now on cog = 5
    Thread 27656 now on cog = 4
    Thread 27656 now on cog = 5
    Thread 28204 now on cog = 4
    Thread 26012 now on cog = 0
    Thread 24368 now on cog = 4
    Thread 27108 now on cog = 4
    Thread 23820 now on cog = 0
    Thread 25464 now on cog = 4
    Thread 27656 now on cog = 6
    Thread 24916 now on cog = 4
    Thread 28752 now on cog = 4
    Thread 26012 now on cog = 5
    Thread 25464 now on cog = 5
    Thread 26560 now on cog = 6
    Thread 27656 now on cog = 7
    Thread 26012 now on cog = 6
    Thread 28752 now on cog = 5
    Thread 28204 now on cog = 5
    Thread 23820 now on cog = 5
    Thread 24368 now on cog = 5
    
    << etc >>
    
    I hope to get some time to build a new Catalina release soon.

    Ross.
  • RossHRossH Posts: 5,519
    edited 2010-11-05 22:33
    Another update...

    Below is a program that shows off most of the multicog/multithread capabilities that will be available in the next release of Catalina.

    This interactive program starts 3 multithreaded kernels dynamically, running a program in each which starts 4 additional threads. The main program then lets you interact with any of the 12 concurrent threads - you can ask each thread to exit, or just respond (by printing something), or join the thread (i.e. wait for it to complete and then print its return value), or stop the thread (i.e. terminate it with extreme prejudice), or ask the thread to move itself to another kernel.

    Any failures in the operations (such as trying to "join" a thread that you have previously stopped) are reported.

    For those interested in such things, when using the new 2-phase loader, the program itself occupies only 9.5kb of Hub RAM. The actual code size is about 6.5kb - including all the multi-threading support library code required. The remainder of the 9.5kb is mostly the 2kb required to hold a copy of the kernel itself (required because this program loads additional kernels at run time).

    Who says the Prop I is unsuitable for C? I'd surely like to see an equivalent program coded in SPIN or PASM!

    As usual, a binary for the Hydra is attached (this is an ordinary binary - i.e. it does not use the 2-phase loader).
    /***************************************************************************\
     *                                                                           *
     *                         Test Thread Operations                            *
     *                                                                           *
     *          Demonstrates thread stop, join and affinity change (move)        *
     *                                                                           *
     *       (i.e. stopping, joining and moving threads on different cogs)       *
     *                                                                           *
     \***************************************************************************/
    
    /*
     * include Catalina multi-threading:
     */
    #include <catalina_threads.h>
    
    /*
     * include some useful multi-threading utility functions:
     */
    #include <thread_utilities.h>
    
    /*
     * define how many additional kernel cogs we want:
     */
    #define NUM_KERNELS 3
    
    /*
     * define how many additional threads we want each kernel to run:
     */
    #define NUM_THREADS 4
    
    /*
     * define the stack size for each cog and thread function:
     */
    #define STACK_SIZE (MIN_THREAD_STACK_SIZE + 100)
    
    /*
     * define the number of times to retry various things:
     */
    #define MAX_RETRIES 100000
    
    /*
     * define an array to hold the id of each additional thread:
     */
    static void *thread[NUM_KERNELS * NUM_THREADS];
    
    /*
     * define some global variables that all threads can share:
     */
    static int request;
    static int hmi_lock;
    static int kernel_lock;
    static int kernel[8];
    
    /*
     * thread_function : this function will run as a new thread. 
     *                   It responds to ping, exit, join or move
     *                   requests.
     */
    int thread_function(int me, char *not_used[]) {
    
       int cog;
    
       while (1) {
          if (request == me) {
             // ping request - just print our id
             _thread_printf(hmi_lock, " %d pinged", me);
             request = 0;
          }
          else if (request == -me) {
             // exit request - exit immediately
             _thread_printf(hmi_lock, " %d exiting", me);
             request = 0;
             return -me;
          }
          else if (request == 1000 * me) {
             // join request - wait 5 seconds, then exit
             // (so that another thread can join this one)
             _thread_printf(hmi_lock, " %d waiting", me);
             _thread_wait(2000);
             request = 0;
             return 1000 * me;
          }
          else if (request == -1000 * me) {
             // move request - move to the next cog
             // executing a multi-threaded kernel
             cog = (_cogid() + 1) % 8;
             while (!kernel[cog]) {
                cog = (cog + 1) % 8;
             }
             _thread_affinity_change(_thread_id(), cog);
             _thread_printf(hmi_lock, "%d now on cog %d", me, _cogid());
             request = 0;
          }
          else {
             // nothing to do, so yield
             _thread_yield();
          }
       }
       return 0;
    }
    
    /*
     * cog_function : this function will be run as the first thread
     *                executed by a multi-threading kernel on a new 
     *                cog. It starts additional threads running in 
     *                the same kernel, but otherwise does nothing. 
     */
    int cog_function(int first, char *not_used[]) {
    
       int cog;
       int i;
    
       // stack space for the additional threads we will start
       unsigned long thread_stack[STACK_SIZE * NUM_THREADS];
    
       // get our cog id 
       cog = _cogid();
    
       // set the lock of this kernel (all kernels must use the same lock)
       _thread_set_lock(kernel_lock);
    
       // indicate this cog is running a multi-threading kernel
       // and the numbers of the threads this kernel is executing
       kernel[cog] = first;
       _thread_printf(hmi_lock, "Executing threads %d to %d on cog %d\n", 
             first, first + NUM_THREADS - 1, cog);
    
       // now start additional threads in this kernel
       for (i = 0; i < NUM_THREADS; i++) {
          thread[first + i] = _thread_start(&thread_function, 
                &thread_stack[STACK_SIZE * (i + 1)], first + i, NULL);
          if (thread[first + i] == 0) {
             _thread_printf(hmi_lock, "Failed to start thread\n");
             while (1) { };
          }
       }
    
       // now wait forever - this thread does not actually do anything
       // except start the threads it has been asked to run. It could 
       // perform other tasks if required. However, it cannot exit or
       // the thread stack space we have allocated will be lost.
       while (1) { 
          _thread_yield();
       }
    
       return 0;
    }
    
    /*
     * main : start NUM_KERNELS additional kernels, then allow the
     *        user to perform various commands on the threads running 
     *        in those kernels.
     */
    int main(int argc, char *argv[]) {
    
       int i;
       int cog;
       char cmd;
       int result;
       int retries;
    
       // stack space for the additional kernels we will start 
       // (note that this must also include stack space for their threads)
       unsigned long cog_stack[STACK_SIZE * (NUM_THREADS + 1) * NUM_KERNELS];
    
       
       // assign a lock to be used to avoid kernel contention 
       kernel_lock = _locknew();
    
       // set the lock of this kernel (all kernels must use the same lock)
       _thread_set_lock(kernel_lock);
    
       // assign a lock to be used to avoid hmi plugin contention
       hmi_lock = _locknew();
    
       _thread_printf(hmi_lock, "Press a key to start the kernels\n\n");
       k_wait();
    
       // start the additional multi-threaded kernels
       for (i = 0; i < NUM_KERNELS; i++) {
          cog = _thread_cog(&cog_function, 
                &cog_stack[STACK_SIZE * (NUM_THREADS + 1) * (i + 1)], 
                i * NUM_THREADS + 1, NULL);
          if (cog < 0) {
             // coginit failed 
             _thread_printf(hmi_lock, "Failed to start kernel\n");
             while (1) { };
          }
       }
    
       // now prompt to exit, ping, stop, join or move each thread in turn
       while (1) {
    
          for (i = 1; i < NUM_KERNELS * NUM_THREADS + 1; i++) {
    
             // print the number and affinity of the thread
             _thread_printf(hmi_lock, "\nThread %d (af=%x) ", i, 
                   _thread_affinity(thread[i]));
             if ((_thread_affinity(thread[i]) & 0x0000C000) == 0) {
                _thread_printf(hmi_lock, "executing\n", i);
             }
             else {
                _thread_printf(hmi_lock, "stopped\n", i);
             }
    
             // get a command to execute on the the thread
             _thread_printf(hmi_lock, "Type E=exit P=ping S=stop J=join M=move\n");
             cmd = k_wait();
             _thread_printf(hmi_lock, "Cmd = %c\n\n", cmd);
    
             // now process the command 
             switch (toupper(cmd)) {
    
                case 'P' :
                   // ping the thread
                   _thread_printf(hmi_lock, " pinging %d:", i);
                   request = i;
                   // wait till thread responds or we have tried MAX_RETRIES times
                   retries = 0;
                   while (request && (retries++ < MAX_RETRIES)) {
                      // nothing to do, so yield
                      _thread_yield();
                   };
                   if (retries >= MAX_RETRIES) {
                      _thread_printf(hmi_lock, " %d failed to respond", i);
                   }
                   request = 0;
                   break;
    
                case 'E' :
                   // ask the thread to exit (gracefully)
                   _thread_printf(hmi_lock, " exiting %d:", i);
                   request = -i;
                   // wait till thread terminated
                   retries = 0;
                   while ((_thread_affinity(thread[i]) & 0x0000C000) == 0 
                   &&     (retries++ < MAX_RETRIES)) {
                      // nothing to do, so yield
                      _thread_yield();
                   };
                   if (retries >= MAX_RETRIES) {
                      _thread_printf(hmi_lock, " %d failed to respond", i);
                   }
                   else {
                      _thread_printf(hmi_lock, ": %d exited", i);
                   }
                   request = 0;
                   break;
    
                case 'S' :
                   // stop the thread 
                   _thread_printf(hmi_lock, " stopping %d:", i);
                   if (_thread_affinity_stop(thread[i]) == 0) {
                      _thread_printf(hmi_lock, " %d cannot be stopped", i);
                   }
                   else {
                      _thread_printf(hmi_lock, " %d stopping", i);
                      // wait till thread terminated 
                      retries = 0;
                      while ((_thread_affinity(thread[i]) & 0x0000C000) == 0 
                      &&     (retries++ < MAX_RETRIES)) {
                         // nothing to do, so yield
                         _thread_yield();
                      };
                      if (retries >= MAX_RETRIES) {
                         _thread_printf(hmi_lock, " %d failed to respond", i);
                      }
                      else {
                         _thread_printf(hmi_lock, ": %d stopped", i);
                      }
                   }
                   break;
    
                case 'J' :
                   // join the thread 
                   _thread_printf(hmi_lock, " joining %d:", i);
                   request = 1000 * i;
                   if (_thread_join(thread[i], &result) == 0) {
                      _thread_printf(hmi_lock, " %d cannot be joined", i);
                   }
                   else {
                      _thread_printf(hmi_lock, ": result %d", result);
                   }
                   break;
    
                case 'M' :
                   // ask the thread to move
                   _thread_printf(hmi_lock, " moving %d:", i);
                   request = - 1000 * i;
                   // wait till thread responds
                   retries = 0;
                   while (request && (retries++ < MAX_RETRIES)) {
                      // nothing to do, so yield
                      _thread_yield();
                   };
                   if (retries >= MAX_RETRIES) {
                      _thread_printf(hmi_lock, " %d failed to respond", i);
                   }
                   request = 0;
                   break;
    
                default :
                   _thread_printf(hmi_lock, " no command");
                   break;
    
             }   
             _thread_printf(hmi_lock, "\n", i);
          }
       }
       return 0;
    }
    
    Here is some typical output when the program is run:
    Press a key to start the kernels
    
    Executing threads 1 to 4 on cog 4
    Executing threads 5 to 8 on cog 5
    Executing threads 9 to 12 on cog 6
    
    Thread 1 (af=00640402) executing
    Type E=exit P=ping S=stop J=join M=move
    Cmd = p
    
     pinging 1: 1 pinged
    
    Thread 2 (af=00640402) executing
    Type E=exit P=ping S=stop J=join M=move
    Cmd = e
    
     exiting 2: 2 exiting: 2 exited
    
    Thread 3 (af=00640402) executing
    Type E=exit P=ping S=stop J=join M=move
    Cmd = j
    
     joining 3: 3 waiting: result 3000
    
    Thread 4 (af=00640402) executing
    Type E=exit P=ping S=stop J=join M=move
    Cmd = s
    
     stopping 4: 4 stopping: 4 stopped
    
    Thread 5 (af=00640502) executing
    Type E=exit P=ping S=stop J=join M=move
    Cmd = m
    
     moving 5:5 now on cog 6
    
    Thread 6 (af=00640502) executing
    Type E=exit P=ping S=stop J=join M=move
    
    
    << etc >>
    
  • jazzedjazzed Posts: 11,803
    edited 2010-11-05 22:58
    Nice example Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2010-11-06 05:49
    RossH wrote: »
    As usual, a binary for the Hydra is attached (this is an ordinary binary - i.e. it does not use the 2-phase loader).
    Does this require anything other than the Hydra itself? No external RAM or SD card?
  • RossHRossH Posts: 5,519
    edited 2010-11-06 16:36
    Hi David,

    Should work on any Hydra - no external RAM or SD card required.

    It uses the Hydra TV output and keyboard.

    Ross.
Sign In or Register to comment.