Parallel Processing with C

RossH · 2020-08-29 01:53

All,

Here is something I have been working on, and which I now have in a fit state for initial release - i.e. a way to make parallel processing on the Propeller almost completely trivial, using C.

Obviously, this method currently only works with Catalina (use Catalina 4.4) but if any of the other C compiler builders here are interested, we could have a common method (and tool) for implementing multiprocessing across all C compilers. And, potentially, other languages as well.

This method works equally well with Catalina on the Propeller 1 and the Propeller 2.

See the document in the attached zip file for full details and a tutorial.

Feedback welcome!

Ross.

EDIT: Attached file updated to version 1.6 - new functionality has been added, which is described below.

   - Requires Catalina 4.5, which has just been released to fix some issues
     found when using locks in a multi-threaded environment.

   - Source code is now included. The parallelizer is an AWKA script, and must
     be compiled with awka 0.7.5. It could be modified for other awk scripting
     implementations if required. The script is MIT licensed.

   - More error checking and more warning messages.

   - The factory related pragmas are now all optional. Factories will be 
     defined and started automatically if there are no explicit factory or 
     start pragmas specified. Now, only one worker pragma, and one begin and 
     end pragma, may be required to parallelize a serial program.
 
   - New conditional options for worker and wait pragmas. In worker pragmas,
     conditionals can be used to control the output of variables, which will
     only take place if the condition is true when the worker completes. In 
     wait pragmas they allow the wait to be terminated early if a specified 
     condition is true. In both cases the condition can be any arbitrary 
     boolean expression, but is normally something that is signaled by a 
     worker thread to indicate that the main thread should continue, even if 
     there are still worker threads yet to complete.

   - New foreman option on the factory pragma, which allows you to use your 
     own foreman function. This may be useful if your program is already 
     multi-threaded and you already have a thread function that can be used 
     as a foreman function.

   - New memory management option, which can be specified on any pragma. 
     Memory management for factories and workers can now be specified as 
     either static or dynamic. No matter where it is specified, the option
     currently applies to ALL workers and factories. Static is faster, and 
     has smaller code sizes, but all memory is pre-allocated at compile time,
     which may be wasteful (e.g. if you start and stop different factories at
     different times, all memory for all of them will still be pre-allocated
     when static memory management is used. In such cases, dynamic memory 
     management may be preferable - but it may be slower and also require
     more code space, because the malloc() and free() library functions must
     be included).

   - New and modified test programs to demonstrate the new options.

RossH · 2020-08-29 02:32

All,

I just fixed a stupid typo in the document and updated the zip file - on page 14 the original said:

Now execute it as a parallel program, using a command like:
catalina_serial P2_EVAL sieve

This should have read (and now does):

Now execute it as a parallel program, using a command like:
catalina_parallel P2_EVAL sieve

It doesn't matter how many times you proofread ..

RossH · 2020-08-30 03:57

All,

I discovered a potential problem when "parallelizing" a program consisting of multiple C source files, which I have fixed by defining some new pragmas. The zip file at the top of this thread has been updated to version 1.1.

One benefit of this is that there should now be no cases where you have to modify the original C source code just to parallelize an existing C program. Everything can now be done just by adding a few pragmas.

Ross.

cgracey · 2020-08-30 04:38

When you say parallelize, you mean separate programs running, right, and not some kind of parallel inference from what is a single-thread program?

RossH · 2020-08-30 05:04

cgracey wrote: »

When you say parallelize, you mean separate programs running, right, and not some kind of parallel inference from what is a single-thread program?

No, I mean truly parallel. Here is a trivial example. Consider the following program:

void main() {
    int i;
    printf("Hello, world!\n\n");
    for (i = 1; i <= 10; i++) {
       printf("a simple test ");
    }
}

Here is what you would normally expect to see when you run it:

Hello, world!

a simple test a simple test a simple test a simple test a simple test a simple test a simple test a simple test a simple test a simple test

We can "parallelize" that so that all the iterations of the "for" loop execute in parallel on multiple cogs by adding a few pragmas, as follows:


#pragma propeller worker(void)

void main() {
    int i;
    printf("Hello, world!\n\n");

    #pragma propeller start
    for (i = 1; i <= 10; i++) {

       #pragma propeller begin
       printf("a simple test ");
       #pragma propeller end

    }
}

Now, here is what you see when you run it:

Hello, world! 

aaa aa  s  ssissiimiimmpmmpplppllellee ee  t  ttetteeseesstsstt tt  a  aa aa  s  ssissiimiimmpmmpplppllellee ee  t  ttetteeseesstsstt tt

When you run the the "parallelized" version, the 10 "printf" statements are actually executed on 5 different cogs - 2 run on each cog.

This is actually the first demo in the zip file. Note that by using pragmas, the second version of program can still be compiled and run using any C compiler - it just won't run in parallel, so you will see the first output, not the second.

cgracey · 2020-08-30 05:41

That's interesting. I guess this means you just couldn't have any inter-dependencies within the code being iterated, since there may be multiple cogs running the same code concurrently, but at different time offsets.

RossH · 2020-08-30 05:48

cgracey wrote: »

That's interesting. I guess this means you just couldn't have any inter-dependencies within the code being iterated.

In general, that's true - but certain types of inter-dependency are ok - you can add "exclusive" and "shared" pragmas to the parallel code segments, which identify critical code segments.

cgracey · 2020-08-30 05:58

I see. This is kind of like Windows threads that I'm trying to get a handle on tonight.

RossH · 2020-08-30 06:22

cgracey wrote: »

I see. This is kind of like Windows threads that I'm trying to get a handle on tonight.

Yes. Or Posix threads.

Every instance of parallel execution is assigned to a thread. You can specify how many threads can execute at a time (available memory is the only limiting factor, not the number of available cogs - if you try to start more than the maximum number of threads, the program simply stalls until a free thread becomes available).

And you can specify how many cogs will be used to execute the threads. The threads will be distributed amongst all those cogs. If you have multiple parallel segments defined, then the same set of cogs can be used to execute all the threads, or you can dedicate specific cogs to run specific types of thread.

RossH · 2020-08-30 07:19

Just to clarify the "threads" thing, in case anyone thinks that using threads is somehow not "true" parallelism (i.e. that it is just another name for "coroutines"), or that it only applies to "for" loops - there is an example in the zip file (test_8.c) of a program that just uses the pragmas to execute one program segment on one cog, while another program segment executes on a second cog, while the main function continues to execute on the original cog.

This is as "parallel" as you can get.

RossH · 2020-08-31 04:44

All,

I just updated the zip file at the top of this thread to version 1.2.

A few minor changes (mainly the addition of a ticks option to the worker pragma, which allows you to adjust the time slice allocated to a worker type). Plus I have now added Linux support.

This completes the parallelizer functionality, since you can now do pretty much everything that you used to do using the thread library, just by using a few pragmas instead. Of course, you can still use the thread library directly, even if you also plan to use the parallelizer.

The Linux executable was compiled on Ubuntu 18.04, 64 bit. If the executable doesn't work on your flavor of Linux, then you will have to wait for the next Catalina release, which will include the parallelizer source code.

Ross.

Reinhard · 2020-08-31 07:28

This is really very interesting!
Is it correct that the parallelizer can in principle work with any C compiler?
I will install Catalina soon.
First I take a look under the hook.

reinhard@reinhard-TUXEDO:~/Downloads/catalina_pragmas_1.2$ ./parallelize test_1.c


Catalina Parallelizer 4.4
/*
 * This program defines one worker, which prints a simple test message. 
 * We then execute 10 of those in a "for" loop. When this program is 
 * "parallelized", we will execute all 10 instances of the worker 
 * in parallel. 
 *
 * Note what this does to the output when we execute this program as a
 * parallel program, compared to when we execute it as a serial program!
 */

#ifndef ANY_COG
#ifdef __CATALINA_P2
#define ANY_COG 16
#else
#define ANY_COG 8
#endif
#endif

#ifndef __PARALLELIZED
#define __PARALLELIZED
#endif

#include <catalina_threads.h>

// structure to hold worker data
struct _worker_struct {
   long    *stack;
   void    *worker;
   struct _worker_struct *next;
};

typedef struct _worker_struct WORKER;

// structure to hold factory data
struct _factory_struct {
   int     cogs[ANY_COG];
   int     cog_count;
   int     last_used;
   long   *stack[ANY_COG];
   WORKER *workers;
};

typedef struct _factory_struct FACTORY;

static int foreman(int argc, char *not_used[]);

static FACTORY *_create_factory(int num_cogs, int stack_size, int lock) {
   FACTORY *factory = NULL;
   long *s = NULL;
   int i;

   factory = (void *)malloc(sizeof(FACTORY));
   if (factory != NULL) {
      for (i = 0; i < ANY_COG; i++) {
         factory->cogs[i] = -1;
         factory->stack[i] = NULL;
      }
      factory->last_used = 0;
      factory->cog_count = 0;
      factory->workers   = NULL;
      // start the multi-threading kernels
      for (i = 0; i < num_cogs; i++) {
         s = (void *)malloc(stack_size*4);
         if (s != NULL) {
            factory->stack[i] = s;
            factory->cogs[i] = _thread_cog(&foreman, s + stack_size, 0, NULL);
            if (factory->cogs[i] >= 0) {
               factory->cog_count++;
               factory->last_used = i;
            }
         }
      }
   }
   return factory;
}

static void *_create_worker(FACTORY *factory, _thread worker, int stack_size, int ticks, int argc, char *argv[]) {
   void   *w = NULL;
   long   *s = NULL;
   WORKER *ws = NULL;

   if (factory != NULL) {
      if (factory->cog_count > 0) {
         ws = (void *)malloc(sizeof(WORKER));
         s = (void *)malloc(stack_size*4);
         if ((ws != NULL) && (s != NULL)) {
            ws->stack = s;
            ws->next = factory->workers;
            factory->workers = ws;
            while (factory->cogs[factory->last_used] < 0) {
               factory->last_used = (factory->last_used + 1)%ANY_COG;
            }
            w = _thread_start(worker, s + stack_size, argc, argv);
            ws->worker = w;
            if (w != NULL) {
               if (ticks > 0) {
                  _thread_ticks(w, ticks);
               }
               _thread_affinity_change(w, factory->cogs[factory->last_used]);
            }
            factory->last_used = (factory->last_used + 1)%ANY_COG;
            // give the affinity change an opportunity to be processed ...
            _thread_yield();
         }
      }
   }
   return w;
}

static int _exclusive_lock = -1;

static void _exclusive(void) {
   if (_exclusive_lock < 0) {
      _exclusive_lock = _thread_get_lock();
   }
   if (_exclusive_lock >= 0) {
      do { } while (!_lockset(_exclusive_lock));
   }
}

static void _shared(void) {
   if (_exclusive_lock >= 0) {
      _lockclr(_exclusive_lock);
   }
}

static FACTORY *__factory_factory;

#ifndef _CUSTOM_LOCK_DEFINED
#ifndef _KERNEL_LOCK_DEFINED
#define _KERNEL_LOCK_DEFINED
static int _kernel_lock = -1;
#endif
#endif

#ifndef _FOREMAN_DEFINED
#define _FOREMAN_DEFINED
// the foreman runs when no other threads are running
static int foreman(int argc, char *not_used[]) {
   _thread_set_lock(_kernel_lock);
   while (1) {
      _thread_yield();
   }
   return 0;
}
#endif

void _start__worker();

void _start__factory_factory() {
   // assign a lock to avoid context switch contention 
#ifndef _KERNEL_LOCKED
#define _KERNEL_LOCKED
   _kernel_lock = _locknew();
   _thread_set_lock(_kernel_lock);
#endif

   __factory_factory = _create_factory(ANY_COG, 200, _kernel_lock);
   if (__factory_factory == NULL) {
       exit(1);
   }

   _start__worker();
}
void _stop__factory_factory() {
   int i;
   int cog;
   int lock;
   long *s;
   void *w;
   WORKER *ws;

   lock = _thread_get_lock();
   do { } while (!_lockset(lock));

   // stop the multi-threading kernels
   for (i = 0; i < ANY_COG; i++) {
      if ((cog = __factory_factory->cogs[i]) >= 0) {
          _cogstop(cog);
      }
      s = __factory_factory->stack[i];
      if (s != NULL) {
         free(s);
      }
   }
   // free the factory workers structure and stacks
   while (__factory_factory->workers != NULL) {
      ws = __factory_factory->workers;
      s = ws->stack;
      __factory_factory->workers = ws->next;
      free(ws);
      free(s);
   }
    free(__factory_factory);
    __factory_factory = NULL;

   _lockclr(lock);

}
// structure to hold input and output variables for _worker
struct __worker_parameters {
   int _status;
};

// parameters for _worker
static struct __worker_parameters __worker_param[10];

// code for _worker thread
int __worker(int me, char *unused[]) {
   while (1) {
      while ((__worker_param[me]._status) == 0) {
         _thread_yield();
      }
      // begin thread code segment
       printf("a simple test ");
      // end thread code segment
      __worker_param[me]._status = 0;
   }
   return 0;
}

// _next__worker - move to the next worker
#define _next__worker(j) (j = (j + 1)%(10))

// _free__worker - find a free worker, waiting if necessary
int _free__worker(void) {
   int i = 0;
   while (__worker_param[i]._status != 0) {
      _next__worker(i);
      if (i == 0) {
         _thread_yield();
      }
   }
   return i;
}

// _await_all__worker - wait for all worker threads to complete
void _await_all__worker(void) {
   int i = 0;
   while (1) {
      if (__worker_param[i]._status != 0) {
         _thread_yield();
      }
      else {
         _next__worker(i);
         if (i == 0) {
            break;
         }
      }
   }
}

void _start__worker() {
    int i;
    _thread *w;

   if (__factory_factory == NULL) {
       exit(1);
   }

   // create worker threads in the factory
   for (i = 0; i < (10); i++) {
      if (_create_worker(__factory_factory, &__worker, 200, 100, i, NULL) == NULL) {
          exit(1);
      }
   }
}

void main() {
    int i;

   // start the factory and workers
   _start__factory_factory();

    printf("Hello, world!\n\n");

    for (i = 1; i <= 10; i++) {

   {
      int __worker_id = 0;
      // find a free worker, waiting as necessary
      while (__worker_param[__worker_id]._status != 0) {
         _next__worker(__worker_id);
         if (__worker_id == 0) {
            _thread_yield();
         }
      }
      // found a free thread, so give it some work
      __worker_param[__worker_id]._status = 1;
      _next__worker(__worker_id);
   }


    }

    printf("\n\nGoodbye, world!\n");

    while(1);  // never terminate
}

RossH · 2020-08-31 08:03

Reinhard wrote: »

Is it correct that the parallelizer can in principle work with any C compiler?

In theory, yes. As you can see, the output of the parallelizer is just plain old C. Of course, the current parallelizer only knows how to generate code that uses the Catalina threads library - but adapting it to generate the appropriate function calls for a different thread library should be a fairly straightforward process. There are several Posix thread implementations that could probably be used. We don't even need the whole library to be implemented (it's quite large!) - just the fundamental ability to start a thread.

Ross.

RossH · 2020-09-02 00:36

All,

The file at the top of this thread has been updated to version 1.4.

No functionality changes, just a few improvements to the parallelizer's "robustness", and also to stop the compiler generating warning messages when compiling some programs - e.g. test_8.c.

Also, that test program has been simplified slightly to make it more obvious what's actually going on.

Ross.

Publison · 2020-09-02 00:37

Thanks @RossH !

Reinhard · 2020-09-02 07:44

I have now successfully installed Catalina 4.4 on my Ubuntu computer. At the moment I use the command line version, later maybe Geany as editor / IDE.
test_1.bin also delivers the desired result. As soon as time allows, I want to get the philosopher problem, a prime example of parallel processing, working on the P2.
On Linux it is compiled like this: gcc phil2.c -o phil2 -lpthread

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
 
typedef struct philData {
    pthread_mutex_t *fork_lft, *fork_rgt;
    const char *name;
    pthread_t thread;
    int   fail;
} Philosopher;
 
int running = 1;
 
void *PhilPhunction(void *p) {
    Philosopher *phil = (Philosopher*)p;
    int failed;
    int tries_left;
    pthread_mutex_t *fork_lft, *fork_rgt, *fork_tmp;
 
    while (running) {
        printf("%s is sleeping --er thinking\n", phil->name);
        sleep( 1+ rand()%8);
 
        fork_lft = phil->fork_lft;
        fork_rgt = phil->fork_rgt;
        printf("%s is hungry\n", phil->name);
        tries_left = 2;   /* try twice before being forceful */
        do {
            failed = pthread_mutex_lock( fork_lft);
            failed = (tries_left>0)? pthread_mutex_trylock( fork_rgt )
                                   : pthread_mutex_lock(fork_rgt);
            if (failed) {
                pthread_mutex_unlock( fork_lft);
                fork_tmp = fork_lft;
                fork_lft = fork_rgt;
                fork_rgt = fork_tmp;
                tries_left -= 1;
            }
        } while(failed && running);
 
        if (!failed) {
            printf("%s is eating\n", phil->name);
            sleep( 1+ rand() % 8);
            pthread_mutex_unlock( fork_rgt);
            pthread_mutex_unlock( fork_lft);
        }
    }
    return NULL;
}
 
void Ponder()
{
    const char *nameList[] = { "Kant", "Guatma", "Russel", "Aristotle", "Bart" };
    pthread_mutex_t forks[5];
    Philosopher philosophers[5];
    Philosopher *phil;
    int i;
    int failed;
 
    for (i=0;i<5; i++) {
        failed = pthread_mutex_init(&forks[i], NULL);
        if (failed) {
            printf("Failed to initialize mutexes.");
            exit(1);
        }
    }
 
    for (i=0;i<5; i++) {
        phil = &philosophers[i];
        phil->name = nameList[i];
        phil->fork_lft = &forks[i];
        phil->fork_rgt = &forks[(i+1)%5];
        phil->fail = pthread_create( &phil->thread, NULL, PhilPhunction, phil);
    }
 
    sleep(40);
    running = 0;
    printf("cleanup time\n");
 
    for(i=0; i<5; i++) {
        phil = &philosophers[i];
        if ( !phil->fail && pthread_join( phil->thread, NULL) ) {
            printf("error joining thread for %s", phil->name);
            exit(1);
        }
    }
}
 
int main()
{
    Ponder();
    return 0;
}

Reinhard · 2020-09-02 07:53

One more question about the Catalina installation. build_all aborts because I did not install the Boost C ++ Library, therefore libtool could not be built. Is that important ?

RossH · 2020-09-02 08:06

Reinhard wrote: »

One more question about the Catalina installation. build_all aborts because I did not install the Boost C ++ Library, therefore libtool could not be built. Is that important ?

I believe libtool is required to build the srecord utility, which is the last thing to be built. Probably not required if you don't use Catalina's -F command-line option.

EDIT: I think it may also be required to build Geany.

RossH · 2020-09-02 08:11

Reinhard wrote: »

I have now successfully installed Catalina 4.4 on my Ubuntu computer. At the moment I use the command line version, later maybe Geany as editor / IDE.
test_1.bin also delivers the desired result. As soon as time allows, I want to get the philosopher problem, a prime example of parallel processing, working on the P2.
On Linux it is compiled like this: gcc phil2.c -o phil2 -lpthread

Catalina comes with a couple of versions of the Dining Philosopher problem - e.g. see demos/multithread/dining_philosophers.c

However, I have not yet rewritten it to use the new pragmas. Let me know how you go with your version!

Reinhard · 2020-09-02 08:21

.
.
.
Done!

Now I have installed all the needed stuff and start to learn Catalina.
Thank you !

RossH · 2020-09-02 11:46

Reinhard wrote: »
.
.
.
Done!
Now I have installed all the needed stuff and start to learn Catalina.
Thank you !

I recently had to reinstall Linux, so I have an updated list of what is required to build Catalina and Geany:

For building Catalina:

sudo apt-get install libreadline-dev
sudo apt-get install gcc-multilib
sudo apt-get install libboost-dev
sudo apt-get install libncurses5-dev
sudo apt-get install dos2unix
sudo apt-get install bison
sudo apt-get install flex

For building Geany (see https://wiki.geany.org/howtos/):

- get source from https://github.com/geany/geany
- apply the Catalina changes (in catalina_geany/catalina_geany_source.zip)

sudo apt-get install libgtk2.0-dev
sudo apt-get install intltool
sudo apt-get install libtool-bin
sudo apt-get install docutils-common

I will include this updated information in the next release.

Reinhard · 2020-09-02 12:29

Thank you very much for your kindly support. With the next version of Catalina, I'm learning to get Geany as an IDE.
But this is not so important at the moment (for me). IDE is nice, but not elementary.
I will get along with the language because I have been programming skills in C for several years.
I've already read some of your documents and that gives me a deeper look into the chip than the origin documents about SPIN and pasm.
At the moment I still have the problem that I can only execute payload as sudo. But even I will find a solution.

RossH · 2020-09-02 23:46

Reinhard wrote: »

At the moment I still have the problem that I can only execute payload as sudo. But even I will find a solution.

Try adding yourself to the "dialout" group, as described in README.Linux:

sudo usermod -a -G dialout <username>

RossH · 2020-09-03 01:38

Hi @Reinhard

Since you are interested in exploring the Dining Philosophers, I decided to create a version where everything is configurable on the command-line (but note that it still doesn't use the new pragmas!):

/***************************************************************************\
 *                                                                           *
 *                            Dining Philosophers                            *
 *                                                                           *
 * This progran simulates the "Dining Philosophers" dilemma on the Propeller *
 * using a thread to represent each philosopher, and a thread lock to        *
 * represent each fork.                                                      *
 *                                                                           *
 * For more details on this classic problem in concurrency, see:             *
 *                                                                           *
 *   http://en.wikipedia.org/wiki/Dining_philosophers_problem                *
 *                                                                           *
 \***************************************************************************/

/*
 * include Catalina multi-threading:
 */
#include <catalina_threads.h>

/*
 * include some useful multi-threading utility functions:
 */
#include <thread_utilities.h>

/*
 * define stack size for threads:
 */
#define STACK_SIZE 100

/*
 * define the maximum nuber of Philosophers we can have:
 */
#define MAX_PHILOSOPHERS 16
/*
 * define how many Philosophers we actually have:
 */
#ifndef NUM_PHILOSOPHERS
#define NUM_PHILOSOPHERS 16
#endif

/*
 * define amount of time to wait between pickup up left and right fork
 * (increasing this increases the chance of deadlocks!)
 */
#ifndef INTERFORK_WAIT
#define INTERFORK_WAIT 100
#endif

/*
 * define how many courses they will eat:
 */
#ifndef NUM_COURSES
#define NUM_COURSES 20
#endif


struct Customer {
   char *name;  // customer's name
   int   left;  // customer's left fork
   int   right; // customer's right fork
};

static struct Customer Philosopher[MAX_PHILOSOPHERS] = {
   {"Aristotle",     0,  1},
   {"Kant",          1,  2},
   {"Spinoza",       2,  3},
   {"Marx",          3,  4},
   {"Russell",       4,  5},
   {"Aquinas",       5,  6},
   {"Bacon",         6,  7},
   {"Hume",          7,  8},
   {"Descarte",      8,  9},
   {"Plato",         9, 10},
   {"Hegel",        10, 11},
   {"de Beauvoir",  11, 12},
   {"Sartre",       12, 13},
   {"Wittgenstein", 13, 14},
   {"Schopenhauer", 14, 15},
   {"Rousseau",     15,  0},
};


/*
 * a pool of thread locks - note that the pool must be 5 bytes larger than
 * the actual number of locks required (MIN_THREAD_POOL_SIZE = 5) 
 */
static char Pool[MIN_THREAD_POOL_SIZE + NUM_PHILOSOPHERS + 1]; 

static void * Seat[NUM_PHILOSOPHERS];       // maps seats to threads

static int Fork[NUM_PHILOSOPHERS];          // maps forks to thread locks

static int HMI_Lock = 0;                    // thread lock to protect HMI calls

static int Dinner_Bell = 0;                 // set this to 1 to start dinner


/*
 * Pick_Up - pick up a fork (a fork is modelled by a thread lock)
 */
void Pick_Up(int fork) {
   do { } while (!_thread_lockset(Pool, Fork[fork]));
}


/*
 * Put_Down - put down a fork (a fork is modelled by a thread lock)
 */
void Put_Down(int fork) {
   _thread_lockclr(Pool, Fork[fork]);
}


/*
 * Diner - (each diner is modelled by a thread)
 */
int Diner(int me, char *not_used[]) {
   int course;
   char *my_name = Philosopher[me].name;

   // wait till dinner is served
   do { 
      _thread_yield();
   } while (!Dinner_Bell);

   // be seated
   _thread_printf(Pool, HMI_Lock, "%s has been seated\n", my_name);

   // eat dinner
   for (course = 0; course < NUM_COURSES; course++) {
      _thread_printf(Pool, HMI_Lock, "%s is thinking\n", my_name);
      _thread_wait(random(2000));
      _thread_printf(Pool, HMI_Lock, "%s is hungry\n", my_name);
      Pick_Up(Philosopher[me].left);
      _thread_wait(random(INTERFORK_WAIT)); // <-- increase for more deadlocks!
      Pick_Up(Philosopher[me].right);
      _thread_printf(Pool, HMI_Lock, "%s is eating\n", my_name);
      _thread_wait(random(2000));
      Put_Down(Philosopher[me].right);
      Put_Down(Philosopher[me].left);
   }

   // leave
   _thread_printf(Pool, HMI_Lock, "%s is leaving\n", my_name);
   _thread_wait(100);
   return 0;
}
   
/*
 * main - set up the diners and then start dinner
 */
int main(void) {

   int i = 0;
   long stacks[STACK_SIZE * NUM_PHILOSOPHERS];

   // assign a lock to avoid context switch contention 
   _thread_set_lock(_locknew());

   // allocate a pool of thread locks to use as forks (note we
   // need one extra lock to use as our HMI Lock)
   _thread_init_lock_pool(Pool, NUM_PHILOSOPHERS + 1, _locknew());

   // assign a lock to avoid HMI contention
   HMI_Lock = _thread_locknew(Pool);

   _thread_printf(Pool, HMI_Lock, "The Dining Philosophers\n\n");
   _thread_printf(Pool, HMI_Lock, "Press a key to begin ...\n\n");

   k_wait();
   randomize();

   // fix forks to suit actual number of philosophers
   if ((NUM_PHILOSOPHERS < 2) || (NUM_PHILOSOPHERS > MAX_PHILOSOPHERS)) {
      _thread_printf(Pool, HMI_Lock, 
         "NUM_PHILOSOPHERS must be between 2 and %d\n", MAX_PHILOSOPHERS);
      while(1);
   }
   Philosopher[NUM_PHILOSOPHERS - 1].right = 0;

   // set up seats and forks 
   for (i = 0; i < NUM_PHILOSOPHERS; i++) {
      Seat[i] = _thread_start(&Diner, &stacks[STACK_SIZE * (i + 1)], i, NULL);
      Fork[i] = _thread_locknew(Pool);
   }
   _thread_printf(Pool, HMI_Lock, "Table set for %d diners\n\n", NUM_PHILOSOPHERS);

   // now start dinner
   Dinner_Bell = 1;

   // now loop forever 
   while (1) { 
      _thread_yield();
   };

   return 0;
}

You can now specify the number of philosophers, courses and a delay that improves or reduces the likelihood of deadlock.

To compile this version, use a command like:

catalina -p2 -C P2_EVAL -D NUM_PHILOSOPHERS=5 -D NUM_COURSES=10 -D INTERFORK_WAIT=1000  dining_philosophers.c -lci -lthreads

To raise the chances of deadlock, raise the interfork wait, or reduce the number of philosophers. To reduce the chance, reduce the interfork wait, or increase the number of philosophers.

Also, I noticed that the original version had a bug on line 144, which was:

_thread_init_lock_pool(Pool, NUM_PHILOSOPHERS, _locknew());

It should have been:

_thread_init_lock_pool(Pool, NUM_PHILOSOPHERS + 1, _locknew());

I've fixed this in the above version.

Ross.

Reinhard · 2020-09-03 07:21

This is a very cool demo about parallel processing.
Thank you.

I add me to the dialgroup, as suggested in README Linux.
But:

payload -i -p /dev/ttyUSB0 DiningPhil.bin
unable to open comport : Permission denied
Error: Unable to initialize Propeller - check connections

this works

sudo ../catalina/bin/payload -i -p /dev/ttyUSB0 DiningPhil.bin

RossH · 2020-09-03 07:44

Reinhard wrote: »
This is a very cool demo about parallel processing.
Thank you.

I add me to the dialgroup, as suggested in README Linux.
But:
payload -i -p /dev/ttyUSB0 DiningPhil.bin
unable to open comport : Permission denied

Well, it's definitely a permission problem. Try executing the "groups" command, and see what groups you belong to. The group you need to belong to is "dialout". Also, you need to log out and log back in again for any group changes to take effect.

As a last resort, you can change the permissions on the /dev/ttyUSB0 device to allow all users to read and write to it, or change the owner of the device to be you - but these are drastic steps, and shouldn't really be necessary.

Ross.

Reinhard · 2020-09-03 09:05

Yes, I'll find a solution. But this is not the topic of this thread and I don't want hitchhiking it.

Reinhard

Reinhard · 2020-09-03 12:09

Unfortunately, I still haven't parallelized the Dinining Philosophers. Right now I have too many ideas in my head.
I programmed a Sudoku solver with Catalina

catalina propsudoku.c -p2 -lc -C P2_EVAL -C NATIVE -C TTY -C CR_ON_LF

and then loaded it onto the P2 EvalBoard with p2load (please don't ask me why, at the moment I have reasons for it).

The result is correct, the speed - well, moderate.

My question is, is this suitable for parallelization?

The solver function is called recursively, is that suitable for #pragmas?

/*
 * 
 * Sudoku solver
 * In the original, the puzzle is read from a file.
 * Here from an array.
 * 
 * The puzzle is solved serial.
 * The Trick is the recursive Call of solve()
 * 
 * Expected are 18 Solutions for this puzzle
 * 
*/

#include <stdio.h>

/* size of sudoku */
#define G 9 

/* unsolved puzzle, 0 means an empty space */
static int feld[G][G] ={{0,0,0,0,7,0,0,0,0},
	                    {0,0,0,1,9,5,3,0,0},
	                    {0,9,8,0,0,0,0,6,0},
	                    {0,0,0,0,6,0,0,0,3},
	                    {0,0,0,8,0,3,0,0,1},
	                    {7,0,0,0,2,0,0,0,6},
	                    {0,6,0,0,0,0,2,8,0},
	                    {0,0,0,4,1,9,0,0,5},
	                    {0,0,0,0,8,0,0,7,9}};

/* count of solution(s) */
static int loesungen = 0;

/* Forward Declarations */
void printSudoku();
int solve(int x, int y);
int checkBox(int x, int y, int wert);
int checkVertical(int x, int wert);
int checkHorizontal(int y, int wert);
int check(int x, int y, int wert);

int main() {

  printf("### unsolved Sudoku: ###\n");
  printSudoku();

  printf("\n###################\n");
  solve(0, 0);
  printf("Solutions: %d\n", loesungen);
	
  return 0;
}


/*
 * Checks whether number already exists (calls a sub-function for each condition
 * Return: 0 for not found
 *         1 for found
*/
int check(int x, int y, int wert) {
  if(checkHorizontal(y, wert))
    return 1;
  if(checkVertical(x, wert))
    return 1;
  if(checkBox(x, y, wert))
    return 1;
  return 0;   
}


/*
 * Checks whether number is already present in the horizontal line
 * Return: 0 for not found
 *         1 for found
*/
int checkHorizontal(int y, int wert) {
  int i;
  for(i = 0; i < G; i++)
    if(feld[y][i] == wert)
      return 1;
  return 0;
}

/*
 * Checks whether number is already present in the vertical line
 * Return: 0 for not found
 *         1 for found
*/
int checkVertical(int x, int wert) {
  int i;
  for(i = 0; i < G; i++)
    if(feld[i][x] == wert)
      return 1;
  return 0;
}


/*
 * Checks whether the number is already in the box
 * Return: 0 for not found
 *         1 for found
*/
int checkBox(int x, int y, int wert) {
  int x_box, y_box, i, j;
  /* Find out the right corner of the box */
  x_box = (int)(x / 3) * 3;
  y_box = (int)(y / 3) * 3;
  for(i = y_box; i < y_box + 3; i++)
    for(j = x_box; j < x_box + 3; j++)
      if(feld[i][j] == wert)
        return 1;
  return 0;
}


/*
 * Provides all solutions for a Sudoku
 * x, y start value (0.0)
 * Return: 0 attempt did not work
 *         1 attempt worked
*/
int solve(int x, int y) {
  int i;
  if(x == G) {                 /* End of line reached */
    y++;
    x = 0;
    if(y == G)                 /* End reached */
      return 1;
  }   
  
  if(feld[y][x] > 0)           /* Field already set */
    return solve(x+1, y);      /* Next Field */
    
  for(i = 1; i <= G; i++) {    /* No number available */
    if(!check(x, y, i)) {      /* go through all the numbers */
      feld[y][x] = i;          /* If number matches, put */
	    if(solve(x+1, y)) {    /* check next Field */
          loesungen++;           /* found solution */
          printf("Solution %d:\n", loesungen);
		  printSudoku();
		  printf("\n");
          /*return 1;*/          /* print only one solution */
	    }
    }
  }
  
  feld[y][x] = 0;              /* No number matched, put 0 again */
  return 0;
}

/*
 * print sudoku on screen
*/
void printSudoku() {
  int i, j;
  for(i = 0; i < G; i++) {
    for(j = 0; j < G; j++) {
      printf("%d", feld[i][j]);
    }
    printf("\n");
  }
}

the goodie is, this piece of code can compiled with Catalina and gcc w/o changes.

RossH · 2020-09-04 00:47

Reinhard wrote: »

My question is, is this suitable for parallelization?

Nice algorithm! However, on a quick examination, I think not parallelizable - or at least not much.

The problem is that this program modifies entries in the "feld" array in a manner which means the algorithm won't work unless the recursive calls (in this case, to "solve") are done in a specific order. If you parallelize it, the changes being made to "feld" by one parallel task will interfere with the operation of all the other tasks.

Parallelizing recursive algorithms is certainly possible - but mainly in cases where the recursion is used to subdivide the task into smaller independent tasks, not if the execution of these tasks has to be done in a particular order. Also, the number of threads you may need would tend to make it impractical - you need a thread for each potential recursive call, and each thread needs a stack, so you can very quickly run out of memory.

However, your algorithm has given me an idea - if I can get it working without complicating the current pragmas too much, I could get you some improvement in this algorithm - perhaps as much as 50% improvement.

I'll let you now how I go.

Ross.

ersmith · 2020-09-04 12:54

Ross, your parallelize program looks interesting. But I really strongly think it would be more useful if the pragmas were standard OpenMP pragmas. You're already most of the way there; it looks like what you're doing is very similar to OpenMP.

I know, OpenMP looks complicated. But honestly your #pragma propeller also looks complicated when one first looks at it (looking at the whole documentation). The use of "#pragma propeller" for simple cases is not hard, of course. But neither is "#pragma omp" hard for simple cases.

It's easier for you to *implement* your own pragmas, but it's much easier for users to *use* standard pragmas (even if they're only a subset) because, well, they're standard. OpenMP is implemented in a lot of compiilers (Visual C, gcc, clang, Intel C, etc.) and examples and documentation are widely available on the web.

The parallelize tool already seems to do much of what an OpenMP implementation would have to do. It looks like you would have to add the ability to automatically generate #pragma propeller worker statements from #pragma omp parallel statements, but the syntax of the two is pretty similar (you would have to make parallelize understand where C blocks of code start and end, but it probably has to parse the code already to find the #pragmas, so I don't think that would be an insurmountable barrier).

With that change your parallelize tool would be interesting not just in the propeller world but for arduino and other embedded systems too! I urge you to consider this.

Is the source code for parallelize available anywhere?

RossH · 2020-09-05 00:02

ersmith wrote: »

Ross, your parallelize program looks interesting. But I really strongly think it would be more useful if the pragmas were standard OpenMP pragmas. You're already most of the way there; it looks like what you're doing is very similar to OpenMP.

I know, OpenMP looks complicated. But honestly your #pragma propeller also looks complicated when one first looks at it (looking at the whole documentation). The use of "#pragma propeller" for simple cases is not hard, of course. But neither is "#pragma omp" hard for simple cases.

It's easier for you to *implement* your own pragmas, but it's much easier for users to *use* standard pragmas (even if they're only a subset) because, well, they're standard. OpenMP is implemented in a lot of compiilers (Visual C, gcc, clang, Intel C, etc.) and examples and documentation are widely available on the web.

The parallelize tool already seems to do much of what an OpenMP implementation would have to do. It looks like you would have to add the ability to automatically generate #pragma propeller worker statements from #pragma omp parallel statements, but the syntax of the two is pretty similar (you would have to make parallelize understand where C blocks of code start and end, but it probably has to parse the code already to find the #pragmas, so I don't think that would be an insurmountable barrier).

With that change your parallelize tool would be interesting not just in the propeller world but for arduino and other embedded systems too! I urge you to consider this.

Is the source code for parallelize available anywhere?

I just read the OpenMP spec in more detail, and yes, there is more similarity than I at first realized. I suppose that was inevitable, seeing as we are both using similar techniques to solve a similar problem. However, while I can see that it might be possible to implement the propeller pragmas using OpenMP pragmas, the reverse is probably not true - OpenMP is extremely complex, and requires implementation within the compiler, whereas the propeller pragmas are relatively simple and do not.

Yet the propeller pragmas seem to be quite adequate.

I think the reason for this is that OpenMP is designed to make it possible add parallelism to a multitude of different architectures, whereas the propeller pragmas are only trying to support just one architecture - one which (thanks to Parallax!) already has parallelism built in, with all the necessary supporting infrastructure (basic parallelism, locks, shared memory, etc) available at a much higher conceptual level than is true of most architectures.

All we need to do is expose that capability, not invent it, and not simulate it where it doesn't naturally exist.

I am also very strongly against implementing "subsets" of standards - if you implement something, you should implement it fully, or you are not being honest with your users - and I doubt we would ever be able to claim full OpenMP conformance - certainly not for the Propeller 1 - there is simply not enough room to support all those features.

So I will continue to work with the propeller pragmas - they are quite trivial to implement, as you can see from the code the parallelizer generates - and see how much additional complexity is required to make them more generally useful. My current belief is that not much more is needed, but I may be underestimating.

I will release the source code of the parallelizer once I think it is stable functionality-wise (it is just a single massive "awk" script - but my awk skills are rudimentary at best, so it is not the prettiest program you will ever see!).

Have you plans to add threading to your compiler (or, come to think of it, do you already have it - I haven't checked!). If so, let me know which library you use (I assume it would be posix threads) and I will see if I can add support for it to the parallelizer so you can try it out.

Ross.

Parallel Processing with C

Comments