Communicating Sequential Processes by C.A.R. Hoare
prof_braino
Posts: 4,313
Is anybody interested in discussing
Communicating Sequential Processes by C.A.R. Hoare
http://www.usingcsp.com/
http://www.usingcsp.com/cspbook.pdf
Sal Sanci (the propforth guy) has advised that this would be a good resource for learning the ground work for parallel distributed processing on the prop. Its from olden times when the transputer was new, something about how its not necessarily the bible on parallel programming but its very close.
The propforth team might have the inter-prop communications channels implemented in the next couple weeks, and I would like to make an actual application that uses multiple cogs on multiple props working in concert to perform a task in a distributed arrangement.
Although my work will mostly focus on propforth, the techniques in the discussion should be applicable to assembler, spin, and anything else.
The material is a little beyond me, so any input would be appreciated
Communicating Sequential Processes by C.A.R. Hoare
http://www.usingcsp.com/
http://www.usingcsp.com/cspbook.pdf
Sal Sanci (the propforth guy) has advised that this would be a good resource for learning the ground work for parallel distributed processing on the prop. Its from olden times when the transputer was new, something about how its not necessarily the bible on parallel programming but its very close.
The propforth team might have the inter-prop communications channels implemented in the next couple weeks, and I would like to make an actual application that uses multiple cogs on multiple props working in concert to perform a task in a distributed arrangement.
Although my work will mostly focus on propforth, the techniques in the discussion should be applicable to assembler, spin, and anything else.
The material is a little beyond me, so any input would be appreciated
Comments
http://async.org.uk/async2008/async-nocs-slides/Wednesday/Keynote/DavidMay-Keynote1-NOCS.pdf
That should give you some idea of the difficulties.
Implementing channels efficiently on the Propeller will be the biggest problem. They really need to be implemented in hardware, as was done on the aforementioned devices.
Thanks for the link Leon.
I don't have time to read all the threads, sorry that I did not run across yours. I am not familiar with many other micro-controllers, so I am not familiar with the chip you are not talking about.
You are right, the hardware is probably unsuitable, we only got 5.5 thousand characters per second throughput on each of the eight application serial channels; but this is only for fun.
I think the key point from the presentation you linked might be on the last page: "Emphasis on process structure will replace emphasis on data structures". That sounds really fun! This presentation is a couple years old, it that still the prediction being made?
Follow up David's affiliation on the first page of that presentation (it begins with X) for details of the chip I was referring to. Parallax doesn't like it to be mentioned.
I don't think many Prop users are clamoring for massively parallel systems, but if they were (and it's probably worth thinking about since the technology is going to allow putting more and more on a chip with time), I'd probably stick with shared memory and go the way of the dominant direction of supercomputers, with hierarchical memory, like local memory, cluster memory, global memory. I think the main programming paradigm for supercomputers today is distributed shared memory and that maps very well to hierarchical memory systems.
If you get to the point where there are so many cpu's in the system you don't want to deal with them individually, I think you need to go to something like Linda, where you just define the job to be done and the sytem decides what processor is going to work on which part. Something like that also allows you to abstract away from the underlying architecture, so you don't have to write different programs for systems based on shared memory or communications channels. Of course you burn cpu cycles to make things easier for the programmer, but how far are we away from that? Like, "I don't want to explicitly program 20 different processors." Not far really. I would think for the people with multiProp systems, they're already there.
-phar
I'm not a big fan of CSP, but for those who are - or for anyone else who wants to experiment with parallel programming on the Propeller at a higher level of abstraction than is currently possible with either PASM or SPIN - I am in the process of adding multithreading capabilities to Catalina.
This is currently discussed (with examples) in http://forums.parallax.com/showthread.php?t=126253. I'll add a full description - and a new release of Catalina - shortly. For dedicated CSP-ers, such software emulation via multi-threading may not be the ideal - but unless you want to buy up a stock of antique transputers, it is a very cost-effective way to get started with concurrent programming.
My intention with Catalina multi-threading is to take advantage of the Prop's existing concurrency capabilities as far as possible (i.e. each thread executing on its own cog) but also add simulated concurrency (i.e. multiple threads executing on the same cog) as seamlessly as possible for programs where 8 concurrent threads is simply not enough.
Catalina can currently create about 170 concurrent C threads in the 32kb of a normal Propeller, executing across multiple cogs. I could increase this to around 200 with some internal kernel changes, but I am reluctant to do this as it would break compatibility with existing code (such as BlackCat debugger) and it also doesn't seem worthwhile for just a few extra threads.
While Catalina's thread support is intentionally "lean and mean", it could also be used to re-create either a full POSIX pthreads library (described here - http://www.kernel.org/doc/man-pages/online/pages/man7/pthreads.7.html) or a more CSP-like libthread library (described here - http://plan9.bell-labs.com/magic/man2html/2/thread). The problem with such "heavy" threading libraries is that because the Propeller is so limited in memory space, using such libraries would leave you with so little space left that you could probably only create a few dozen concurrent C threads - barely more than there are cogs!
As well as supporting the Prop I, I am targeting Catalina multithreading at the Prop II, where it should be possible to execute nearly 800 concurrent C threads - that should be enough for quite a few interesting projects in concurrency.
Ross.
I found the chip you mentioned. I know I found the correct one, your name appears on most of the discussion pages for it that I saw.
Do you know if they ever got tforth (transputer forth) ported to it? If it doesn't have a version of forth already running on it, I can't use it for my experiments. I had thought to use tforth, but I could never find anything beyond discussion.
My thought is to take PropForth in the direction of parallel forthhttp://www.ultratechnology.com/4thpar.html
Up till now I have not had a pool of processors with a method of coordination, this should be available in a couple weeks.
Have you looked at Jeff Fox' F*? It's primary advantage would seem to be that it presents a very simple model to the programmer. It has its limitations, but I believe the com ports on the chip which must not be named have support for direct memory access of other cpu's. I'm sure Leon will tell us if this is the case or not. If so, it would appear possible to turn F*'s shared memory inside out and directly access memory on other cpu's designated as global.
-phar
Actually, the discussion is still about CSP. In my case, with regards to PropForth.
We've just established that the other version of forth that is most similar to the topic under discussion in not available on another micro-controller, so so is not a suitable reference for this work.
That out of the way, the discussion should return to the main topic of CSP on the prop.
Right guys? (@Leon)
@pharseid
Actually, I'm looking to go the route of parallel forth
http://www.ultratechnology.com/4thpar.html
This seems to be the furthest along work of Fox et al.
Now that I have available to me the proper multi-core processor, and the inter-processor communications channels needed, I can finally get started on this.
Has anyone actually read the CSP material and want to discuss it?
-phar
Right away it starts getting into examples in LISP. LISP strikes me as needing more memory than less, list of language for the prop talks about running out of memory after a few lines of LISP. Probably won't switch to LISP for this project.
I DO like the stuff about processes. Understanding and controlling processes is pretty much the key to everything it seems.
I wonder what that means to us now? Was it too conservative, in that it was bogged down in unsuitable paradigm; or was it thorough, methodical analysis, and simply too far ahead of it's time? Lately I've run across discussions of parallel processing, but this is about concurrent processing, which might be more useful. Some of the material looks like it could have been written with the Prop in mind. But then again, everything looks like a nail to a guy with a hammer. I tend to think of projects in terms of whats in my tool kit
Life is a compromise and so are all the ideal ideas limited when there is a real implementation. But we can try to make the best of it. If it is not allowed to eat an apple, why not invent apple juice?
With the propeller I found a tool, that gave me back a part of my life, and I am still very enthusiastic, even without P][. I sometimes tried to propose a mechanism, which allows to transparently distribute task in a network of processors, even if these are located in different chips. I'm also willing to read Hoares book and gain a better understanding of the mathematical and logical background and find relations between theory and a real, physical implementation. Further I design a board "coreprop" intended to put together propellers in a unified way, but I have to change the connectors, because they are not reliable and too expensive. So very basic issues. It alway the horseshoe nail. And the time available. And the tools, used.
So I am very interested in starting up such an project. And I even don't hesitate to use the Xword, because we always should do the best with what we have!
The goal here is not to parallelize per se, that is, I don't want to has a bunch of cores doing the exact same thing in lock step. Rather, the goal is to get a large problem, such as calculating a 1024x768 fractal, break it into subtasks and distribute the work over N cogs, and have them all work concurrently. The one that finishes first gets a new chunk to work on , and so on. Two cogs should be fast than one, and ten should be faster than two, and twenty should be faster than ten, until we get to a point where adding more cogs doesn't cause a decrease in execution wall clock time. If this point is not seen on 8 cogs, maybe we would get Humanoido to try it on his tower or something.
This is what I have in mind.
I'm up to chapter two so far, the material does not read quite like L4D2 "The Sacrifice"
It is for that I mentioned to Chip ---> One of side step of my ideas
Very fast serialization possibility's in Prop II and Internal Ports to communicate between COG's
ErNa
-phar
Ok, we might have transparent channels. Right now it looks like accessing a cog on the next prop is the same as accessing a cog on this prop..
instead of the one prop version to execute a command on a neighbor cog:
the multiprop version is
There are eight channels so there can be full duplex communication with 4 props or eight half duplex channels to eight props, max (Or I guess there could be eight parallel channels at one prop, I hadn't thought about about that till just now). I'm sure we'll try experiments with a mix of both full and half duplex channels.
I'm curious about the special hardware statement. If we already had the channels, and don't claim a need to keep all the props synchronized beyond that which is needed for the channels (which is achieved by the standard crystal), do we still need special hardware or can it be implemented in software?
Right now it looks like there might be a nice way to have eeprom-less slaves, wherein the master wakes up and notices being master, and assigns the slaves ID and uploads their executable image to them.
I need to get my head around the special considerations for data parallelism vs specialist parallelism vs agenda parallelism.
So data parallelism is the fractal; whatever the data, the same steps are applied. I think this is the easy one, at least to understand the concept.
Specialist parallelism is pipelining; so this is for example a particular unit in the bunch is assigned to handle a specific piece of hardware. This does sound applicable, as the I/O pins not used as communications channels are intended to be used as general I/O to devices as on a single prop. So a bunch of eight props with all communications channels assigned would have 8 * 24 = um, that looks like 192 pins for devices. That should be enough for a little while. I'm thinking something like a GPS, compass, accelerometer, temp, humidity, photo-diode, SD card, extended ram, VGA, Keyboard, mouse, mic, audio, servos, and ESC, and whatever else I have in this box here, all hooked up at once on dedicated pins and running full speed. I don't know WHY I would do this, but it seems fun. Having specialist parallelism doesn't seem to necessarily contradict having data parallelism, there could be a pool of cogs processing data for a variety of dedicated hardware. Cool enough so far!
Now this agenda parallelism. I don't see the prop in this banking example, mostly because I don't see the application software I would write for it. Is agenda parallelism different from the other two in that the problem domain is not common to all tasks (did I say that right?). If many bank customers chose different transactions from the same menu, is this still agenda parallelism? Could you make another example for this one, I've confused myself. Or is this type where we start thinking about heuristics and AI etc?
This is good discussion. Thanks for contributing this, this what I was hoping for. I can feel my brain getting exercise.
If you have data and instruction as two independent items, and single and multiple as two properties, you easily can created this set:
single instruction, single data
single instruction, multiple data
multiple instruction, single data
multiple instruction, multiple data,
or, in a shorter representation: 00, 01, 10, 11.
As it seems the simplest task, science first looked to single instruction, multiple data. That mostly is pure number crunching. Like inverting matrices, vector operations, ... This just needs specialized hardware.
The propeller is a tool to solve 11-problems. As we are taught to serialize a parallel problem, because in history we only had serial computers, and as we are proud to be able to do this, we do not see, that the more natural solution is processing in parallel. Like in every days life, we can divide our workload in small junks and waste time in organizing the work flow. That is exactly the way, a scientist does research ;-))
OK, he does it as an example and searches for the best way, so necessarily there is a lot of nonsense done in gaining a result. No criticism.
I use the propeller to control a motor and run different tasks in parallel, one manages the display. So I have to pass parameters from task to task and do this in a non ideal way, but always following on scheme. So I do not call different processes with a different number of parameters, but I start a process and pass a list of addresses to him, where the number of elements is in the list. That works very well.
Now I found, that having access to the global memory with the physical address allows me to access to single values, to arrays, to pointers, to pointers to arrays. That makes the software really flexible and powerful. But if you own such an instrument, it still takes exercise to have the mechanism present, and to have an overview.
I have the strong feeling, that, knowing the physical address of a spin variable in global memory, it should be even possible to exchange data from spin to spin without using a global read/write. That may be confusing: local access to global memory?
How that? Normally we still have a main spin program and this program starts cogs and passes pointers to variables to them. So the started process knows the address of the variable in the main program, the "global" memory. If the process want to hold a "local" copy of that variable, and more of them are running, than they each have such local copies and can not directly access to the others copies, only using global data share.
Now, eight processes is a very limited amount, if there is a real life problem and, e.g. for home automation we need a distributed system. But I do not want to change the software, when the configuration changes. It needs some networking without a multi level protocol to connect processes. That is surely not simple, but worth to think about.
By the way: since I see your avatar continuously, I have problems to order a pizza frutti di mare. Don't you have a better alternative?
ErNa
There seems to be some confusion going on here between hardware and software/algorithmic aspects of parallelism.
Down at the hardware level we have such ideas as SIMD, MIMD as outlined by Erna. These are really aspects of processor design.
Do we operate one pair of integers/floats at at time with one instruction as in ADD X, Y, as in a normal CPU?
Or do we operate on two, four, eight... pairs of operands simultaneously, as in Intels SSE instructions http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions?
Or do we operate on large arrays of operands in one go, as in vector processors http://en.wikipedia.org/wiki/Vector_processor?
At a lower level there is even pipelining and parallel execution of the different instruction phases, fetch, decode, operate write result etc.
Then there can be hardware support for shared memory systems or support for channel communications between processes.
Mean while up at the more abstract software/algorithm level there are different ways of breaking up problems for parallelism.
Do you take the CSP approach, no shared memory between processes and channel communications between them. Or do you take the shared memory approach?
CSP itself is some what abstract. A way of looking at algorithms. By removing the possibility of shared memory it makes it possible to reason about algorithms more easily. It becomes possible to transform algorithms like you can transform mathematical formulae.
The terms "specialist parallelism", "pipelining, "agenda parallelism" used at the higher abstract level of thinking about and implementing a parallel algorithms to problems. These basic paradigms for parallelism are defined nicely here[FONT=Times New Roman, Times, serif][/FONT][FONT=Times New Roman, Times, serif][/FONT]http://www.lindaspaces.com/book/chap2.htm.
(Although they use "result parallelism", "agenda parallelism" and "specialist parallelism")
These "basic paradigms" are applicable to all software problems whether they are actually running on hardware that supports parallelism or not. And perhaps without regard to whether the hardware supports shared memory or processes and channels.
I see no reason why result, agenda or specialist approaches are not all applicable to the Propeller (or multiple Props.)
OK, the propeller brings a different experience to all the people and so there is hope!
But lets go on. I MUST have something to present next UPEW.
Nothing hard about "parallel". In the context of this discussion it's just "doing more than one thing at at any given time". Of course that is a very general statement and applies at all levels of detail from the flip-flops that make up the storage cells in your CPU registers to the Google search engine farm. Pitching it at the right level for any given discussion might be a bit more tricky.
I wonder what they meant and what they thought "true parallel" means". Clearly a single Transputer is only operating sequentially, at the program level, even if your program contains many threads (Say, using PAR in Occam). But that's not the point, the Transputer, for the first time, made it trivially easy to hook up many processors that would then operate in parallel. The Occam language made it easy to write a single program that would be executed on many processors. When you have a hundred CPUs all munching away on the Mandlebrot set, or whatever, from a single program, how could they say it is not "true parallel"?
No. One can "wait and do something else in between" without CSP. In a single processor shared memory system with a multitasking operating system it happens all the time.
CSP is a bit more rigorous. If it were not there would be no point in having it. For example: It introduces the rule that a processes is not allowed to access the data space used by any other process. No matter if those processes are executing on a sequential machine with a scheduler or on multiple CPUs with shared memory or on totally separate CPU/memory systems.
Why make such a rule?
Consider the evolution of programming languages...
Originally we had assembler and primitive languages like BASIC. They made programming easier and had few restrictions. All variables/data were probably global, meaning that any part of the program could read/write any part of the data. There was JMP/GOTO such that all points in your program are essentially global, meaning that at any point in the code it would be possible to tranfer control to any other point in the code.
This enables one to get programs written more easily but without care leads to chaos and makes it very hard to reason about a program or test whether it works or not.
For example if a write a subroutine "A" which gets input, provides output and uses its own data on the way. Then I can read it and think it probably works. I can test it and find out it does work for a lot of test cases. But now I write a function "B" that may uses the same data as "A". There comes a problem. If the combined system fails some times because the data is messed up, is it "A" or "B" going wrong?
When we get to routines "C", "D" "E" etc etc this becomes impossible to work with. As a result languages introduced the idea of local variables, the idea that functions should not have side effects, the idea that global data is not a good. And so on.
Similarly GOTO is frowned upon. It's very useful but turns out it's better to partition your code into isolated junks that you can't just jump into or out of at any place. This makes code more easily relied upon, reasoned about, tested, and reused.
CSP is carryiing on in that tradition of organizing software design by introducing some more simple restrictions for the situation where things can be done in parallel and on multiple processors.
Is that really true? I have no idea about what they teach in CS departments nowadays. I did not do CS but back in the 70's my friends who did were well into getting stuff to happen in parallel. That's what operating systems have been about for a long time. The problem has always been one of hardware availability, computers were big and expensive. It's only relatively recently we are getting the luxury of multi-core PCs, multi-core micro-controllers (Thank you Parallax) and the means to build parallel systems for the masses of software engineers to tinker with.
Hi Sapieha
If this is something you are working on, would you be will to give feedback on the implementation when it is available? The earliest would be a couple weeks.
Of course, its only on the prop 1 and is only several single serial channels between separate props. But I would be interested to know how close it comes to achieving your target and how it is different from your concept.
The Prop I has very fast serialization output possibilities. That is the video hardware. Sadly it does not have any corresponding fast input serialization.
So the idea is that the Prop II have a more flexible output serializer hardware and the matching input serialization. We are talking about shifting bits in and out at the chips clock frequency.
As this is on a COG by COG basis it opens the possibility to have fast serial channels between COGS on a Propeller or from COGS on one Propeller to COGs on another.
So then, run out of COGS? No problem, just slap on another Prop chip and hook up the serial links. All of a sudden it does not matter if that FullDuplexSerial or Mandlebrot set process is on this Prop or some other one.
Over the years it has struck me that that is an excellent way of looking at parallelism because we have proven examples of hardware/software specializing in each type. The most familiar example of data parallel hardware is probably the MMX instruction set. SIMD parallelism on a grander scale was employed in the massively parallel SIMD machines of the late 80's, like the Massively Parallel Processor and the Connection Machine. The languages on those machines would seem to meet IrNa's wish to be able to perform the same operation on many data items with one instruction, but do not handle the other types of parallelism well. There was an article in The Journal of Forth Application and Research on MPP Forth; I would imagine Professor Braino would find that interesting. I met a guy who programmed in MPP Forth on NASA's MPP years ago, he got me interested in SIMD machines and I eventually built a simple SIMD machine with PLD's and memory chips. One of my bucket list items is to learn to program FPGA's and build a bigger SIMD machine someday.
The Propellor/Spin is a good example of an architecture for implementing specialist parallelism. You create specific tasks, assign them to specific cogs. You don't have the simplicity of dealing with a processor array as a single unit like a SIMD machine, but it's much more flexible.
Finally, as an example of an agenda machine, you have Linda and the various architectures that it runs on. It eases the burden on the programmer by removing the requirement of delagating tasks to specific processors, you just dump data and tasks into tuple space and let the system sort out what does what. As I mentioned earlier, if we get massively parallel MIMD machines, I think you need something like this, but how do you handle situations when you want to use hardware available on a particular cpu, like a specific I/O port?
Of course, a language running on any MIMD machine should be able to utilyze all of software constucts of all these examples, but would that make it the ultimate parallel programming language or a bloated monstrosity that encourages programmers to write in styles that run poorly on their particular machines? I've thought about this, but trying to implement it isn't even on my bucket list (maybe in my next incarnation?)
-phar