ParaSail : Parallel multicore programming Language
jmg
Posts: 15,183
["ParaSail: Less is more with multicore
Parallel multicore programming should be easier than it is. What makes it such a struggle? Most programming languages were designed without parallel programming in mind. "]
http://www.eetimes.com/design/embedded/4375616/ParaSail--Less-is-more-with-multicore
http://parasail-programming-language.blogspot.com/
Could be interesting to see if this scales to a Prop 1 / Prop 2 ?
Parallel multicore programming should be easier than it is. What makes it such a struggle? Most programming languages were designed without parallel programming in mind. "]
http://www.eetimes.com/design/embedded/4375616/ParaSail--Less-is-more-with-multicore
http://parasail-programming-language.blogspot.com/
Could be interesting to see if this scales to a Prop 1 / Prop 2 ?
Comments
http://www.cs.cmu.edu/~scandal/nesl.html
The main emphasis in the design of NESL was to make parallel programming easy and portable. Algorithms are typically significantly more concise in NESL than in most other parallel programming languages. Furthermore the code closely resembles high-level pseudocode.
The Propeller chip already has a number of parallel based algorithms that can serve as a foundation in beginning to construct a parallel based language with the features mentioned in the articles. As time progresses, the majority of chips will have increasing numbers of parallel cores, and parallel languages will be very useful in extracting the magnitude of power that potentially resides in these constructs.
That's true of what I'd call a large system implementation.
A prop version does not need to do that, and can be similar to the GCC choice, where small code pieces can compile to PASM.
The Prop 2 will allow faster reloads, should the compiler/user decide that is a good idea.
( Quad SPI memory now comes at DDR rates )
You would of course try to avoid crossing a reload boundary on any inner code loop, but it is not that a big hit in the outer flows.
A language that helps the compiler make these choices, seems smarter than trying to fudge C into the task ?
My thesis was based on the unspecified nature of commas and semicolons. Some compilers build code in function arguments from left to right , others from right to left. I suggested that one could specifically use commas for parallelism and semicolons for sequential operations.
The brackets would outline the scope of the processes.
Perry
The ParaSail run-time system uses the work stealing model (similar to what is used by Cilk and Intel's Threaded Building Blocks -- TBB), where the computation is split up into extremely light-weight threads ("picothreads", "sparks", etc.), and then there are a much smaller number of "worker" processes, roughly one per physical core, which each have their own double-ended queue of picothreads. When a picothread is created, it is added to the end of the queue of the worker process that created it. Each worker process serves its own queue in a LIFO manner, picking up the picothread it most recently created. When a worker process runs out of picothreads of its own to serve, it steals a picothread from some other worker. But when it steals a picothread, it takes the oldest picothread on the queue, i.e. it is using a FIFO discipline (hence the double-ended nature of each worker's queue).
The net effect of the work-stealing approach is that parts of a computation are only moved to a new core when there is an idle core, and in that case, there is an attempt to pick a picothread that has been languishing on its queue, and hence probably much of its important data are already out of the cache. So in many cases, the only overhead associated with breaking a computation up into picothreads is, on average, the overhead of queuing and dequeuing. Picothreads don't have their own stack. Furthermore, the compiler does not even create a picothread for every potentially parallel sub-computation. The compiler uses a heuristic to determine whether the sub-computation is sufficiently large to justify the queuing overhead. The current ParaSail compiler uses a very simple heuristic, which will evolve as we learn more. The current heuristic is that a picothread shouldn't be spawned unless it involves at least one out-of-line function call, or one loop, as opposed to simply a sequence of built-in operator evaluations.
A web search for "work stealing" will produce more detail on this topic.
With the little I know of Ada (and ParaSail looks a lot like an Ada extension), I get the impression that it's for much larger systems, and that programs written in it have enough overhead to not easily fit on something like the Prop. Since there's 32K of global ram, and each cog has it's own ram (sort of cache-ish), it would seem that you wouldn't get much out of trying to optimize for cache hits/misses, like you imply in your second paragraph.
No, I haven't played with the propeller chip. It looks interesting. I am tempted to order a development kit.
ParaSail is much simpler than Ada, and in fact, it is primarily interesting (in my view) because of how much it leaves out relative to most mainstream languages, as the referenced article explains. In particular, ParaSail has no pointers, no global variables, no global garbage-collected heap, no run-time exception handling, no parameter aliasing, etc. So it should be amenable to small systems, because it doesn't need much in the way of run-time support.
Sorry if I over-simplified ParaSail. I'm still not certain that the Propeller is a good target. The basic structure of the Propeller is eight 32 bit processors (cog) with 496 32 bit words of local RAM (code and data) and 32Kbytes of shared RAM (accessed in non-blocking round-robin). The main limitation for parallel processing is all code has to be copied from shared RAM to local RAM for execution - a process which requires ~8K clock cycles (~2K instruction cycles) **. So the usual style is to initialize each cog with a driver which then reacts to parameters passed via shared RAM.
** There are ways to execute code from shared memory by having the cog execute an interpreter which executes either a byte code (SPIN, FORTH) or native instructions (Large Memory Model). While allowing for larger programs, both are at least 4 times slower than native code. It's also possible to execute code from external memory, but this is slower still and much more complex as the Propeller does not have a native memory bus.
Interesting model. It is clearly designed for small programs with code and data being moved between cogs as rarely as possible. Some kind of "work stealing" model as used by ParaSail might still be a nice way to get automatic load balancing, but clearly the heavy overhead of "stealing" means that you want to "steal" a sub-computation as a very last resort, and whatever you steal should be a relatively hefty sub-computation in its own right. In some ways the propeller seems to have "software" caching, with very expensive cache misses (though these days, cache misses are pretty expensive with hardware caches as well).
The original idea was that certain functions (like the VGA driver) consume most of the resources on the prop, so just and more props to get more pins, cogs, hub RAM. It totally works great, so I that it might be possible to distribute algorithms of the multiple cogs. It is very inexpensive and simple to get a rig with multiple props, one protoboard and six bare prop chips would yield (cypherin'.. please stand by...) 56 cogs, with 35 usable by the application(s). One could easily control a boat load of servos, but I whant to see it we can make an integer only "baby super computer" just to play with the parallel programming techniques. Nobody would ever use it for anything of consequence, but it would be fun.
Can you show us how to get started on setting up the parallel algorithms?
As far as I can see, as explained by sttaft above, ParaSail and "work stealing" are based on a fairly simple idea.
1) You have a CPU with many cores sharing RAM.
2) You have a programming language that makes it easier for the compiler to identify operations that can be done in parallel. For example a loop calculating the square of a million numbers might be spilt by the compiler into 8 small jobs that are a loop of 125000 square operations.
3) Those small jobs can be "stolen" by other CPUs, when they are free, so as to speed up the calculation.
Simple idea, I'm sure not so easy to implement.
Now, this model does not seem to be so suitable to the case of many processors all with their own RAM (no shared RAM). There CSP seems to be our best bet.
This sounds very interesting. You can read more about "work stealing" on the web to understand the basic idea. Intel's Cilk language, and their Threaded Building Blocks (TBBs) both use it. You do need to break your program up into small pieces (picothreads), each of which can pretty much run to completion. The ParaSail compiler does this for you automatically. The Cilk compiler does this at points where you write "spawn." With TBB, you do it yourself. Then you have one process per core/cog which services a queue of these picothreads, using LIFO when serving its own queue, and using FIFO when stealing a picothread from some other queue. You would need to keep track of which bits of code are already in the local RAM of the given cog, and when stealing, you would presumably make an effort to steal a picothread whose code is already loaded. If the code is not loaded, you would have to transfer it first. But since this core/cog was otherwise going to be idle, presumably this is not wasted effort.
In any of these models, picothreads, GO routines, co-routines, multi-tasking, etc. you always strive to avoid the context switching mechanics or setup/knockdown time of a new execution entity. Want to be able to handle real time interrrupts, put in a second set of registers to handle the interrrupts - boom, no context switch to go from user to executive states, just flip a bit. Want to be able to avoid OS threads? build pico threads or GO routines as lightweight threads on top of OS threads - boom, no thread startup delays. Want to avoid COG reloads, find a way to put primitive routines into COGs and Q requests up to the proper COG - boom, soft peripherals (or software engines) with I/O queues.
I think that's one of the reason the implementations like PropForth and the LMM kernels do so well on the prop. Your COG running the kernel (interpreter) is self contained and is able to execute code, pass messages and context switch rather quickly without the COG reload time.
"Work stealing" on the Propeller - sure, you just need to hand craft the code to have COGSfull of useful functions and a way to dispatch work. The difficulties will be breaking up the problem space into work packages that best use the resources. Does the overhead in parsing the problem into packages and then the distribution of the work packages best utilize the resources?? That's where things get interesting. The classic trade off of general purpose against tailored/dedicated use.
The Propellers does very well with the soft peripheral models and the interpreter models. The shared program space is a stumbling block that most of the parallel models rely upon. Run time wrappers, large flat memory models, garbage collection, etc. are all things these languages rely on for their success. It's hard to do a lot of that in a 32K microcontroller.
CSP and Forth are getting exciting because of the kernel structure and the way Forth works. With a common dictionary image, you can pass a work unit (single Forth word or an entire Forth application to be compiled and run remotely) to a new COG in the same Propeller or in a remote propeller. As long as the underlying dictionary supports the vocabulary and words used in the application, it can run and return results if needed. That's pretty cool and very little additional overhead to get started.
How about a PC based Forth that sees Propellers as little collections of remote execution units it can pass/spawn/co-routine words to and have them either run as peripherals or perform a task and complete a result? Soft Propeller Peripherals attached to your PC through a channel you don't care to know the implementation of?
The hard part is going to come back to how you break your problem up into work units that can be run in parallel (or appear to be run in parallel at some abstraction layer). It's like how do you break those batch jobs up into scheduling units to make the best utilization of your tape drives.....how do you minimize the idle cycles on my multi-million dollar mainframe? How can you share that mainframe so it looks like each user has their own computer? How can you best utilize that mini-computer so the CPU isn't idle when some task is waiting on I/O? How can you load that cloud up with virtual machines so we can make everyone think they have their own server?
Just to analyse my understanding of this.
To use an analogy, this is how 'just in time' works for factories. All the bits of the product are made in parallel and arrive 'just in time' to be fitted to the end product whatever that is. These systems have a manager that coordinates things because what you dont want is 'just too late' and some process is kept waiting.
I worked for a while for a chocolate manufacturer who had their boxes and labels manufactured by third party companies and delivered on a JIT system. Two things became apparent to me, one was that effectively the chocolate manufacturer had moved their stock onto a third party in the form of a buffer stock, two was that when the labels changed they had a management issue in ensuring that the old labels all got used before the cut in date of the new labels and that could also include making sure that they had sold all the old stuff too.
If we are to break problems down into smaller chunks then pass them out to other processes then there has to be some sort of scheduler that gets them back in time to be used or the parent process could be kept waiting. I can see how in Forth using a common dictionary and split stacks that multiple processes could carry on but then you have the dependencies to resolve. And how do you merge the stacks back together?
The reason for most of what I say here is about the hidden problems of parallel processing. One has to be very sure what and how problems are broken down and then brought back together, timing is everything.
Michael
While a miss would be expensive if a COG (core) is to be reloaded entirely, a COG having a monitor can also load and run small chunks of code (overlay). Everything is relative of course, having to use an overlay loader at all will be much slower than using a multi-core, multi-thread CPU tuned for the task (totally different animal).