Catalina - The Next Generation
RossH
Posts: 5,519
You know how it is ... you just finish developing a wonderful concurrent algorithm, and then discover that you have don't have enough cogs to execute it!
Well, your worries are over! Catalina now incorporates a multi-threading kernel to save you ever having to worry about running out of cogs again!
With the multi-threading kernel, Catalina users now have the option of starting a C function on another cog using the normal kernel (effectively limiting you to 8 concurrent functions) ... OR ... starting a multi-threaded kernel on one or more cogs and running as many concurrent C functions as you like! You can even run multi-threading kernels on all 8 cogs simultaneously if you want.
Below is an example of the Dining Philosopher's program re-coded to run under Catalina's new multi-threaded kernel. With the previous version of this program the only option was to dedicate a whole cog to each diner. When you added in the necessary display drivers etc you could typically only support 5 diners. The following program runs 16 diners concurrently. It could just as easily run 100, except that I couldn't think of that many philosophers!
A binary for the Hydra showing this program in action is attached. It uses the Hydra TV display and keyboard. It was compiled using the command:
I will release the new multi-threading version of Catalina shortly. It will first be made available as a patch release to those users who have purchased the Catalina Optimizer. The new version fully supports both the Optimizer and the BlackCat and BlackBox debuggers.
Ross.
Well, your worries are over! Catalina now incorporates a multi-threading kernel to save you ever having to worry about running out of cogs again!
With the multi-threading kernel, Catalina users now have the option of starting a C function on another cog using the normal kernel (effectively limiting you to 8 concurrent functions) ... OR ... starting a multi-threaded kernel on one or more cogs and running as many concurrent C functions as you like! You can even run multi-threading kernels on all 8 cogs simultaneously if you want.
Below is an example of the Dining Philosopher's program re-coded to run under Catalina's new multi-threaded kernel. With the previous version of this program the only option was to dedicate a whole cog to each diner. When you added in the necessary display drivers etc you could typically only support 5 diners. The following program runs 16 diners concurrently. It could just as easily run 100, except that I couldn't think of that many philosophers!
/***************************************************************************\ * * * Dining Philosophers * * * * This progran simulates the "Dining Philosophers" dilemma on the Propeller * * using a thread to represent each philosopher, and a thread lock to * * represent each fork. * * * * For more details on this classic problem in concurrency, see: * * * * http://en.wikipedia.org/wiki/Dining_philosophers_problem * * * \***************************************************************************/ /* * include Catalina multi-threading: */ #include "catalina_threads.h" /* * include some useful multi-thread utility functions: */ #include "thread_utilities.h" /* * define stack size for threads: */ #define STACK_SIZE 100 /* * define how many Philosophers we have: */ #define NUM_PHILOSOPHERS 16 /* * define how many courses they will eat: */ #define NUM_COURSES 20 struct Customer { char *name; // customer's name int left; // customer's left fork int right; // customer's right fork }; static struct Customer Philosopher[NUM_PHILOSOPHERS] = { {"Aristotle", 0, 1}, {"Kant", 1, 2}, {"Spinoza", 2, 3}, {"Marx", 3, 4}, {"Russell", 4, 5}, {"Aquinas", 5, 6}, {"Bacon", 6, 7}, {"Hume", 7, 8}, {"Descarte", 8, 9}, {"Plato", 9, 10}, {"Hegel", 10, 11}, {"de Beauvoir", 11, 12}, {"Sartre", 12, 13}, {"Wittgenstein", 13, 14}, {"Schopenhauer", 14, 15}, {"Rousseau", 15, 0}, }; static void * Seat[NUM_PHILOSOPHERS]; // maps seats to threads static int Fork[NUM_PHILOSOPHERS]; // maps forks to thread locks static int HMI_Lock = 0; // cog lock to prevent HMI contention static int Dinner_Bell = 0; // set to 1 to start dinner static char pool[NUM_PHILOSOPHERS + 5]; // a pool of thread locks /* * Pick_Up - pick up a fork */ void Pick_Up(int fork) { do { } while (_thread_lockset(pool, Fork[fork])); } /* * Put_Down - put down a fork */ void Put_Down(int fork) { _thread_lockclr(pool, Fork[fork]); } /* * Diner - this function runs as a thread to simulate a diner */ int Diner(int me, char *not_used[]) { void * my_thread; int course; // wait till dinner is served do { } while (!Dinner_Bell); // get our thread identity and our seat number my_thread = _thread_id(); // be seated thread_printf(HMI_Lock, "%s has been seated\n", Philosopher[me].name); // eat dinner for (course = 0; course < NUM_COURSES; course++) { thread_printf(HMI_Lock, "%s is thinking\n", Philosopher[me].name); thread_wait(random(2000)); thread_printf(HMI_Lock, "%s is hungry\n", Philosopher[me].name); Pick_Up(Philosopher[me].left); Pick_Up(Philosopher[me].right); thread_printf(HMI_Lock, "%s is eating\n", Philosopher[me].name); thread_wait(random(2000)); Put_Down(Philosopher[me].right); Put_Down(Philosopher[me].left); } // leave thread_printf(HMI_Lock, "%s is leaving\n", Philosopher[me].name); thread_wait(100); return 0; } /* * main - set up the diners and then start dinner */ void main() { int i = 0; long stacks[STACK_SIZE * NUM_PHILOSOPHERS]; // assign a lock to avoid context switch contention _thread_set_lock(_locknew()); // assign a lock to avoid HMI contention HMI_Lock = _locknew(); // allocate a pool of thread locks to use as forks _thread_init_lock_pool(pool, NUM_PHILOSOPHERS, _locknew()); thread_printf(HMI_Lock, "The Dining Philosophers\n\n"); thread_printf(HMI_Lock, "Press a key to begin ...\n\n"); k_wait(); randomize(); // set up seats and forks for (i = 0; i < NUM_PHILOSOPHERS; i++) { Seat[i] = _thread_start(&Diner, &stacks[STACK_SIZE * (i + 1)], i, NULL); Fork[i] = _thread_locknew(pool); } thread_printf(HMI_Lock, "Table set for %d diners\n\n", NUM_PHILOSOPHERS); // now start dinner Dinner_Bell = 1; // now loop forever while (1) { } }The thread functions are intentionally very similar to the functions used in the previous "multicog" Catalina demos. There are new functions to manage threads (e.g. _thread_start(), _thread_stop and _thread_id()) as well as new functions to manage "thread locks" (e.g _thread_locknew(), _thread_lockset() and _thread_lockclr()). You can have as many thread locks as you like - you are not limited to the 8 built-in locks (but these are still available).
A binary for the Hydra showing this program in action is attached. It uses the Hydra TV display and keyboard. It was compiled using the command:
catalina -lci -lthreads -D THREADED -D HYDRA -O3 dining_philosophers.cDefining the THREADED symbol selects the multi-threaded kernel, and the "threads" library includes all the necessary multithreading support functions. The combination of these two is similar in effect to the POSIX "pthreads" library.
I will release the new multi-threading version of Catalina shortly. It will first be made available as a patch release to those users who have purchased the Catalina Optimizer. The new version fully supports both the Optimizer and the BlackCat and BlackBox debuggers.
Ross.
Comments
Everything seems to be working quite well - I'm just doing some final integration work, and then there's a bit of documentation left to do.
What I'm working on now is integrating the new multi-threading support with the existing multi-cog support. The idea of this is that when you spawn a new concurrent C function, you can let Catalina decide automatically whether to use a cog or a thread. By default it will first use any unused cogs. Once there are no unused cogs left, it will instead spawn the function as a new thread on the current cog. This will be completely transparent to the program. I expect this will be the method most people will use to spawn new processes. In most cases it will give you the best (fastest) results.
Of course, there will always be circumstances where you definitely do NOT want to use multi-threading - such as when you are controlling external hardware and need the determinism or the speed provided by the non-threaded kernel. So you will still be able to dedicate a C function to execute on its own unthreaded cog if you need to do so.
I hope to finish it all today - if I don't manage it then it will probably be another week or so before I get a chance to get back to it.
Ross.
But here is a slightly simpler program that executes 80 concurrent C threads on a single cog: When run, this program displays the following:
A binary compiled for the Hydra is attached. More threads could be executed if each one did not require so much stack space. In this program each thread prints a value on the screen whenever it is "pinged", and this - plus the thread overhead of 35 longs - means each thread needs 75 longs of stack. That's 24k of the Hub RAM required just for stack space!
Another point to note is the use of the _thread_yield() function - if a thread discovers it has nothing to do at the moment it can simply yield control of the cog to another thread.
Ross.
I thought all philosophers were called "Bruce".
http://www.ibras.dk/montypython/episode22.htm
Ray
With a small amount of configuration, Catalina will work perfectly well on the C3 in LMM mode - including the latest multi-cog and multi-thread suppoprt. However, Catalina will not be able to use the SRAM on the C3 as XMM RAM (i.e. for executing programs). This is because the SRAM is addressed serially rather than in parallel, which means the necessary access code will not currently fit in the space available in the XMM kernel cog.
I had hoped to add C3 support to Catalina, but I had to give up this idea because I simply don't have the time to do it at present. Adding multi-threading support required only about 40 additional instructions, which I could easily fit into the "alternate" kernel. I can occasionally manage a day or two here and there, but any significant changes to the XMM kernel (which would be required to support serial SRAM) will have to wait until I get some larger slabs of time.
However, it seems that the C3 will have enough spare pins to support parallel SRAM access, so I'm sure someone will soon design a parallel SRAM add-on board - then it will be relatively simple to add full Catalina support.
Ross.
P.S. @Heater - the initial version of my "dining philosophers" program did indeed have all the philosophers named "Bruce" - but the University of Woolloomooloo threatened to revoke my honourary doctorate (in beer making) if I published it that way.
VMCOG has a driver for C3 (thanks Dave!), so if you add VMCOG support to the Catalina LMM, it will run on the SPI ram's just fine
Bill
The 32MB SDRAM boards have an I2C attached device that has a 10bit ADC. It can have up to 3 single ended channels (2 is much easier) with 5V range or 1 differential channel with 1.1V or 2.56V ranges. It is also possible to have a PS2 port (mouse or keyboard) and 1 single ended ADC channel.
The ADC function does require code which I have not written yet. I'm still working on the keyboard code, but the mouse is done. A 2nd Propeller would have been far more flexible of course In any event, functionality is just a matter of software and what the hardware will support.
Of course the C3 is almost ready to ship and I'm sure it will make many people happy. I have 7 functional SDRAM boards on my desk (in 3 flavors), but I'm not sure when boards will be generally available for sale.
@RossH,
It's cool that you're providing a threaded library. A subset of the common pthreads interface could be quickly implemented http://en.wikipedia.org/wiki/POSIX_Threads The problem is in supporting pthreads entirely
Adding some kind of a virtual memory interface to Catalina would be useful - and it could potentially simplify the memory back-end. I've looked at adding SDRAM cache to Catalina (to take advantage of performance and size) and have received feedback on implementation and an overview of the XMM standard - I was looking for help getting it working .
Adding cache to Catalina it is not my highest priority right now, but at some point I'll get back to it. I'm sure I'll need help getting cache working ... VMCOG will be equally difficult for anyone but RossH to add IMHO.
I only ask as it is a slighty more elegant (ie lower chip count) than the Dracs (which probably have all those extras for a good reason)
Yes, I agree VMCog support might be the right way to get Catalina on a lot of new boards - including Morpheus II and the C3 - but it's a matter of finding a slab of time large enough to do it. I still hope to get to it sometime, but I am slowly learning not to make commitments to things that I can't complete using just the odd few hours here or there.
@jazzed,
I'm not really after POSIX compatibility - a complete pthreads compatible library could easily take up most of the Hub RAM and leave no space for doing anything useful. My threads are intended to be "lean and mean" - and POSIX threads are anything but that! While I hope others will also find them useful, they are primarily intended for my own use in my own projects. However, I think they also provide another of the missing pieces that seem to be preventing the Propeller being used to its full potential.
Also, the above comments to Bill apply to your board as well. I'm happy to help out, but at the moment I am having trouble finding blocks of contiguous time big enough to do complex work - so I'm tending to stick to things that I know I can complete within a few hours. I'd actually planned to do the threads stuff ages ago, but kept putting it off because I didn't really need it yet. It got done this week mainly because it was something I knew I could finish quickly (having laid most of the groundwork much earlier).
@Toby
Yes, Catalina fully supports the TriBlade blade #1 in all its variants - XMM RAM, VGA output, TV output, keyboard, mouse, serial I/O etc. Look in the file README.TriBladeProp in the Catalina\target directory for more details.
The TriBlade is one of my favorite "all round" boards - but the DracBlade has an advantage in that it supports VGA at the same time as the XMM RAM. From memory, I think on the TriBlade you can only use the TV output when also using XMM RAM.
Ross.
I love the way that the DracBlade separates out the pin clashes but of course at the expense of speed.
I have dug up a Blade2 board that I made with a PropCMD on it, sort of Blade1 + 2. I will have to pull out the digit(s) and have a race between the two
I did not say that you should aim for complete pthread compatibility. A subset of it would be nice though for people who already understand it.
I could build some simple wrappers that would make the API look more 'pthread-like' - I'll see if I can find some time for this.
Ross.
Jazzed asked if I could make the Catalina threads library more compatible with the POSIX pthreads library. This seemed like a reasonable request, but it would have been quite hard to do using just the basic threading features I had already implemented - so I have added a few more. The new features will make it substantially easier to emulate POSIX threads, as well as implementing inter-process communications mechanisms such as mailboxes or remote procedure calls.
I already had the ability to start and stop threads, and I also already had thread locks (much like POSIX mutexes or Parallax locks). Now I have added the ability to pass parameters to and from threads, and also added a new _thread_join primitive which allows you to wait for a thread to complete and then fetch its return value.
Catalina's mechanism for passing parameters to threads is different to the POSIX mechanism - POSIX threads accept and return a pointer value, whereas Catalina threads mimic the C command line processing mechanism - i.e. using argc and argv, and returning an int. However, this can be made to simulare POSIX if required since an int can also be used to pass and return an arbitrary pointer value (since they are the same size).
Here is a program that illustrates all of the new features: When run, this program displays the following:
As usual, a binary for the Hydra is attached.
Ross.
Do you need something equivalent to pthread_spin_lock etc... for SMP muteces ? IIRC, a standard mutex considers just one core where a spin lock can be used with multiple cores.
Thanks.
Thanks. I'm not really trying to emulate pthreads, I'm just trying to make it easier for someone (possibly me!) to do so later.
Some more details ...
There is a _thread_stop which would be equivalent to pthread_cancel.
There is a _thread_check which can be used to check the state of a thread (i.e. executing or not) - I don't know the POSIX equivalent to this.
There is no equivalent to pthread_detach since all threads share all resources (including the address space) so it is not really necessary.
The thread scheduling is simple round-robin time slicing, but there is a _thread_set_ticks that can be used to adjust the relative priority of each thread.
The equivalent to pthread_mutex_init is actually _thread_init_lock_pool, (not _locknew). A pool is an arbitrary sized set of locks that can then be used individually - but you can have as many pools as you like, of any size.
I modelled the individual thread lock operations (_thread_locknew, _thread_lockret, _thread_lockset and _thread_lockclr) on the parallax hub lock operations since I thought they would be more familiar to Propeller users - but with these primitives (and _thread_yield and _thread_wait) you can simulate any of the POSIX mutex or condition operations.
I'm still working on integrating the new multi-thread capability with the existing multi-cog capability - but this is mostly cosmetic work intended to simplify things for the user. I can already launch multi-threading kernels on multiple cogs, and the purpose of the _thread_set_lock and _thread_get_lock primitives is to synchronize different kernels operating on different cogs, so that the thread operations work correctly between cogs.
This will probably be all for the present. These primitives already take around 200 longs to implement, so I'd be very hesitant to add any more.
I'll publish some documentation when I get a chance.
Ross.
The operations already specified (e.g. _thread_stop, _thread_join) will operate only on threads being executed by the current cog - this is quite efficient, and is sufficient for most purposes.
A new set of "affinity" operations will be added that can reach across cogs to operate on threads being executed on any cog (in addition to those being executed on the current cog) - these operations are much less efficient, and should only be used in cases where you know the thread you want is not executing "locally". This way, if you don't need them, you won't waste the extra RAM on the affinity capabilities.
The basic affinity operations (for "affinity" you can mostly read "cog") will be as follows:
Ross.
Catalina now has its multi-cog capabilities fully integrated with its multi-thread capabilities.
The following demo program illustrates starting 4 additional multi-threading kernels (each one on its own cog) and then starting 10 additional C threads designed to wander around between all the kernel cogs at random, being seemlessly executed for a random amount of time by each of the different kernel cogs in turn.
I'm not exactly sure what the use of this capability is yet - but it's fairly simple to do once you have the basic multi-threading, so why not?
Attached is the usual demo, compiled for the HYDRA. It was compiled using the command: When run, the program produces the following output:
I hope to get some time to build a new Catalina release soon.
Ross.
Below is a program that shows off most of the multicog/multithread capabilities that will be available in the next release of Catalina.
This interactive program starts 3 multithreaded kernels dynamically, running a program in each which starts 4 additional threads. The main program then lets you interact with any of the 12 concurrent threads - you can ask each thread to exit, or just respond (by printing something), or join the thread (i.e. wait for it to complete and then print its return value), or stop the thread (i.e. terminate it with extreme prejudice), or ask the thread to move itself to another kernel.
Any failures in the operations (such as trying to "join" a thread that you have previously stopped) are reported.
For those interested in such things, when using the new 2-phase loader, the program itself occupies only 9.5kb of Hub RAM. The actual code size is about 6.5kb - including all the multi-threading support library code required. The remainder of the 9.5kb is mostly the 2kb required to hold a copy of the kernel itself (required because this program loads additional kernels at run time).
Who says the Prop I is unsuitable for C? I'd surely like to see an equivalent program coded in SPIN or PASM!
As usual, a binary for the Hydra is attached (this is an ordinary binary - i.e. it does not use the 2-phase loader).
Here is some typical output when the program is run:
Should work on any Hydra - no external RAM or SD card required.
It uses the Hydra TV output and keyboard.
Ross.