Communicating Sequential Processes by C.A.R. Hoare

Sapieha · 2010-10-29 07:42

Hi Heater.

My idea of Serialization are not inspired from Transputer as I never worked with it.
My idea are from one system I have designed to synchronize 2 I8085 computers I used in old days to control some things.
As I0885 had one serialize I/O inbuilt. It was not so fast but in time I used it it was good enough for me. And system needed some syncing to work on common information.

Ps. You can think that syncing as parallel processing to control functioning of one system by other

Heater. wrote: »

I think both Sapieha and I have suggested this from time to time. Both inspired by the Transputer I guess.

The Prop I has very fast serialization output possibilities. That is the video hardware. Sadly it does not have any corresponding fast input serialization.

So the idea is that the Prop II have a more flexible output serializer hardware and the matching input serialization. We are talking about shifting bits in and out at the chips clock frequency.

As this is on a COG by COG basis it opens the possibility to have fast serial channels between COGS on a Propeller or from COGS on one Propeller to COGs on another.

So then, run out of COGS? No problem, just slap on another Prop chip and hook up the serial links. All of a sudden it does not matter if that FullDuplexSerial or Mandlebrot set process is on this Prop or some other one.

ErNa · 2010-10-29 08:42

There are so many activities on the way and so many projects have started, that it needs a parallelized brain to follow ;-) My vision is a distributed system of propellers, where every one can have different features and abilities, and a enumeration to identify them. Then there has to be some kind of broadcasting to get the information, how the networks looks like, who can do what and is willing to do. And therefor we first need a simple network protocol that is not blown up, but just delivers, what is needed. I am short in time and it's the hen-egg problem: build the right tools first or use them in advance.

pharseid · 2010-10-29 13:12

Leon

My comment about accessing I/O on a particular cpu was referring to Linda and not MIMD systems in general. My perception is that MIMD systems which are optimized for supporting specialist parallelism are actually good at this sort of thing. Linda, which is the only example which I would consider optimized for agenda parallelism, doesn't appear to have any facilities for doing this at all. But I have no hands-on experience with it.

I never used Occam to program my own little transputer network, so I may have missed out on a lot. I used a version of C with CSP-type extensions, which allowed me to just cut and paste huge chunks of C with little or no editing. I may have mentioned this before, but it surprised me how well that went.

Fitting this into the classification scheme I mentioned, I would view it as optimized for specialist parallelism, statically mapping tasks to processors. To compare it with an efficient system for agenda parallelism, where you have to be able to run any task on any processor dynamically, you either have to have a lot more program memory or an instruction cache on a shared memory system or virtual memory on a distributed memory system. So I think a processor optimal for an agenda system requires more silicon real estate than one optimal for specialist parallelism, which requires more real estate than a processing element on a SIMD system.

-phar

prof_braino · 2010-10-30 11:05

Heater. wrote: »

The Prop I has very fast serialization output possibilities. That is the video hardware. Sadly it does not have any corresponding fast input serialization.

So the idea is that the Prop II have a more flexible output serializer hardware and the matching input serialization. We are talking about shifting bits in and out at the chips clock frequency.

As this is on a COG by COG basis it opens the possibility to have fast serial channels between COGS on a Propeller or from COGS on one Propeller to COGs on another.

SO, is this different from Beau Schawb technique of using the counters to send data between cogs on different props (or even the same prop I guess) at clock speed? Isn't this already present on the prop 1? Or is there some method to go faster using the video capabilities?

My experiments are in regards to using the counters as inter-prop channels.

prof_braino · 2010-10-30 11:13

pharseid wrote: »

To really understand the classification system I mentioned, you should go to the original paper I got this from "How to write parallel programs: a guide to the perplexed" by David Gelernter.

http://www.matematicas.unam.mx/jloa/Articulos/carriero.pdf

ErNa · 2010-10-30 13:50

Thank You, green enough! Now it is more a pleasure to read your contribution. The communication mechanism is quick and dirty designed and surely not well reasoned, as CSP is. I now have 8 gogs used, one creates the user interface, I use Aribas propterminal. Now I run out of cogs and memory and wish to transfer the gui to a separate chip. That means, I have to transfer content from A to B, but don't like to change the software, just replace the screen process with a transport to screen process. And the process should be not specific, but universal.

pharseid · 2010-10-31 10:03

I just read the Prop 2 preliminary feature list thread. If the Prop 2 supports SDRAM, my thought is you could create an interface on an FPGA to make it look like an SDRAM and do high-speed communication that way. For ambitious projects like Humanoido's, I would think you would give each Prop an FPGA and do communication between FPGA's with one of the high-speed I/O protocols. That would certainly seem to address a lot of issues. Speed and topology for instance. If you stuck some intelligence in there (maybe one of Heater's ZPU's) you add all sorts of possibilities, organizing the multiplier/accumulators as vector processors, using some hardware/software combination as a memory management unit to support schemes like Distributed Shared Memory and Linda. That would keep the Prop Supercomputer people busy for a while.

-phar

ErNa · 2010-10-31 13:02

As I understand, CSP lives from parallel processes. And a lot of them! The propeller will have only a limited number of cogs. So the question is: can we have a system, that can run processes in SPIN (global memory of one chip), in ASM (local memory in COG) or even in different chips and have a timesharing ability to run processes in one processor (ok, I do not insist in running Spin processes and ASM processes in one cog.
The question: what can a propeller do is in my opinion not the right one. There is a higher level, where we decide, what to do and a lower one, how to do it best.

prof_braino · 2010-10-31 14:28

I might be getting off the stated topic of the original post, I could start another topic if you folks feel this following is a separate discussion.

I just got the beta of propforth 3.6 which includes 8 channel high speed (relatively speaking) between props.

By design, one master prop contains the boot image in rom. The master notices this status, and sets self up accordingly. Master then uploads same image to slaves. Slaves notice master, and configure themselves accordingly. Each slave may have seven slaves. There no longer needs to be a limit on the number of cores in an array.

While up to eight I/O pins can be consumed for communications channels, the remaining pins on a prop are still available for interface to devices. The logical association of the function of each device may impact the choice of placement of devices on physical props.

The communication speed about 5.5 thousand application characters per second. The raw channel speed is about 4Mbits per second, the protocol consumes most of the times.

When full duplex channel (using two of the eight possible channels) the channels begins continuously sending 96 bit packets. When a cogs has something to transmit, it plops it in the queue (is there already a technical term defined for plopping data into a continuously running 96 bit transmission channel?) and the channel handles transmission. An error counter automagically checks a CRC and monitors for errors. There are test to inject errors to verify the error checking works, so far there have been no errors detected in actual use.

The intent of the original post was to figure out how to use the above for parallel processing. I think the infrastructure is there to do at least the communications part of it. I am looking for help on how structure algorithms to take advantage of a pool of cogs, and a pool of devices, spread across N prop chips and separated by the stated communications channel.

Is this too specific for this thread? The topic has already expanded to include other great reference, which I have not been able to read yet, but still plan to. I'd really like your input on how this low level infrastructure can be used. Should this be a separate thread?

pharseid · 2010-10-31 18:22

I think a concrete problem is entirely appropriate. Also, I think I would use the term "inject" rather than "plop".

With 4 full duplex channels, you could completely connect 5 Props, but in general, I would implement a 2D array. As a first step, I wouldn't try to automate the distribution of tasks, I would explicitly assign them. And create a mailbox system, which would handle the process of sending a message to any cog in the system. Then build more advanced stuff on top of that.

Gelernter, who wrote the paper I mentioned, also wrote one on implementing Linda. Linda handles a lot of things, automatic task distribution, communication, load balancing, at a cost in efficiency. At least back in the day, it only added 5 constructs to a non-parallel language to accomplish this. If you can find that paper, it's worth reading anyway.

-phar

prof_braino · 2010-10-31 20:03

pharseid wrote: »

I think a concrete problem is entirely appropriate. Also, I think I would use the term "inject" rather than "plop".

With 4 full duplex channels, you could completely connect 5 Props,

Gelernter, wrote one on implementing Linda.
-phar

Thanks! I appreciate you help!

The arrangement of Master to Slave can potentially be extended with each slave acting as a sub-Master to up to three slaves, and so on until sufficient cogs are in the pool and sufficient I/O pins are present.

But you have a good idea of keeping it simple at first, so I will document the experiments on a single chip so folks can examine the results easier, and build from there.

I found the Gelernter paper and will read it. I have heard of Linda, also of Occam, although I am not very familiar with either. This method is may have similarities to Occam.

pharseid · 2010-11-01 10:40

An aside on concrete projects as opposed to purely theoretical discussions. I used to be on the MIT Press mailing list and I found that all the best books were descriptions of actual projects. A great thing about MIT Press in those days is that you could wait for their warehouse clearout sales, buy their books for a song and still be up on developments (like VLIW processors) before they hit the market.

I did a little programming on a transputer network years ago, but I never used Occam. It seemed expedient at the time, but in the long run I may have lost out.

I think the dominant paradigm for parallel systems today is distributed shared memory. I haven't seen a detailed description of an implementation, but the nebulous ones I've read make it sound a lot more complex than implementing Linda. The prize is that code doesn't have to be ported to the parallel system, it just manages threads in a parallel fashion.

Of course you will find a high level paradigm you like and end up going back to rewrite all the lower level stuff to accomodate it, but there's no way around that in an experimental system. Plus, as you say, it may serve as inspiration for others and they may take it in a different direction (I think we've already mentioned about 10 person-years worth of directions in this thread).

-phar

Leon · 2010-11-01 11:01

The Transterpreter Project is based on occam and can be used on small devices:

http://transterpreter.org/

pharseid · 2010-11-01 14:21

I think Leon just hit the bullseye. If you can find the transterpreter source, you can modify the wrapper for the Prop, compile and be on your merry way. I found the KRoC source, but that only gets you half way there. The situation is decscribed on the publications page in the paper on implementing the transterpreter. The second to the last paper, I think.

Of course, that means you're programming in Occam. If you want to stick with Forth, somebody created a transputer Forth back when the transputer was still in production (Leon's post jogged my memory about that). If you could find that, you could implement CSP-type words for PropForth and forget about emulating the transputer.

-phar

prof_braino · 2010-11-01 18:04

Thanks guys!

The thing I'm looking for is insight on what are some typical examples of algorithms that can be parallel-ized, and how one goes about parallelizing them. Personally, I'll be using forth, but the concepts should be applicable to any language. In fact I'm interested in seeing what thing are easy in a given language, and how issues are overcome in another. The forth examples will hopefully be clear enough to translate into spin etc.

Just to be clear, there are some items that are similar to Occam, but I'm not going to program in Occam. There are some things that are similar to Linda, but again I'm not going to program in Linda, I'm going to take the useful ideas and use them with my existing tool kit.

I started with the Sieve of Eratosthenes as suggested, and if I ever get the single processor version finished I will parallelize that, and post both versions for comparison. I am interested in seeing how the actual programs match with the descriptions in the literature

pharseid · 2010-11-03 13:01

In another thread I think, I offered a parallel quicksort. It was mentioned that when you factor in the overhead of loading and intitialyzing all the cogs, it probably wouldn't be that fast, but in a case where the code remains in place, like doing 3D using the painter's algorithm, that isn't a factor. 3D graphics was my favorite application during my "parallel period", I had a running 3d pipeline in C and it mapped quite nicely to a few transputers. The 3D pipeline was a good example of specialist parallelism, transforming, projecting, clipping, shading, and rendering. But I realized then that every stage of the pipeline was also data parallel; if you looked at a cross section of the pipeline it could potentially be an array of thousands of processors. So it's easy to see how we've gotten where we are today in 3D graphics, hundreds of processors are only scratching the surface of exploitable parallelism.

It's way too big as an example program, but I noticed one of the game programmers has already ported Wolfenstein 3D to the Propellor, so maybe a 3D pipeline would have a very receptive audience. I'm thinking fixed point fractional math for rotations, the painter's algorithm, Gouraud shading with a tiny bit of texture mapped objects.

-phar

prof_braino · 2010-11-04 04:50

pharseid wrote: »

a parallel quicksort.
3D using the painter's algorithm 3d pipeline in C

The 3D pipeline was a good example of specialist parallelism, transforming, projecting, clipping, shading, and rendering. every stage of the pipeline was also data parallel

fixed point fractional math for rotations, the painter's algorithm, Gouraud shading with a tiny bit of texture mapped objects.

-phar

I'll check out your quick sort, also like to see your painters algorithm.

I also notice a large amount of overlap in each type of parallelism defined in the two reference I'm reading (cspbook and carriero). This is probably what's confusing me. While the literature makes clear distinctions for the purposes of classification, in actual application they are not so clear. Maybe the definitions presented are not a close fit for the propeller? Or maybe the classification depends upon the perspective from which the problem is consifered, and all can apply to varying degree? I'm trying to summarize my notes, if I can clear them up I will post them someplace and ask for comments.

My direction using the low-level definitions for the communications and defining software tools to perform each class of parallelism that is needed. Plan is to re-code things in forth or assembler.

Clock Loop · 2010-11-05 04:24

pharseid wrote: »

In another thread I think, I offered a parallel quicksort. It was mentioned that when you factor in the overhead of loading and intitialyzing all the cogs, it probably wouldn't be that fast, but in a case where the code remains in place, like doing 3D using the painter's algorithm, that isn't a factor. 3D graphics was my favorite application during my "parallel period", I had a running 3d pipeline in C and it mapped quite nicely to a few transputers.

-phar

The loading of propellers can be done in parallel if you accept some caveats.
You must use the Master-slave-subslave "bigboy" method.
http://forums.parallax.com/showthread.php?t=124343

I have tested over 28 prop chips all tied to the "rail"
Because I use a sequential daisy chain wiring, the clock/data/reset lines all arrive in sync, when I used random, closest point wiring, blocks of props would drop off the chain.

The master-slave-subslave works out to at least 28 props(my tests) when using a daisy chain, like the schematic shows. Because the props will be spread out over a distance, the best way to wire it all up is using a daisy chain.

I would see no reason as to why more couldn't work. My props are all from different orders, over the past year, some are older. None of that seems to matter, the only thing that does is the wiring or layout you use when building anything with that many processors.

prof_braino · 2010-11-06 20:19

Clock Loop wrote: »

The loading of propellers can be done in parallel if you accept some caveats.
You must use the Master-slave-subslave "bigboy" method.

I have three props with line 31 tied together, and line 30 tied together. These are connected to the MAX232 same as if it were a single prop. I can program all three at once just using prop tool. Is this the same as the big-boy method?

What I'm doing is simultaneous, but not really what I had in mind for parallel. I was thinking that each prop would load a DIFFERENT program for my needs of parallel.

I noticed that when I load a program to flash the led on each of the three props, all three flash at pretty much the exact same time, even though they have different crystals. I never did see them drift apart, but I wasn't testing that at the time. Is this only because these are all from the same batch, and different batches may exhibit varying characteristics?

Clock Loop · 2010-11-07 18:24

prof_braino wrote: »

I have three props with line 31 tied together, and line 30 tied together. These are connected to the MAX232 same as if it were a single prop. I can program all three at once just using prop tool. Is this the same as the big-boy method?

It is. Except in the big-boy method, I use a main master prop to do the slave programming and clock output.

prof_braino wrote: »

What I'm doing is simultaneous, but not really what I had in mind for parallel. I was thinking that each prop would load a DIFFERENT program for my needs of parallel.

You can have each prop run a different program, if you make each prop generate unique ID's (like my big boy demo programs does)
A better way would be to uniquely ID each prop with a specific pin tied high. A limitation to this is the maximum size of the program for all 3 props is limited to the props memory, eeprom, and the prop tool.

prof_braino wrote: »

I noticed that when I load a program to flash the led on each of the three props, all three flash at pretty much the exact same time, even though they have different crystals. I never did see them drift apart, but I wasn't testing that at the time. Is this only because these are all from the same batch, and different batches may exhibit varying characteristics?

In all the props shown in my picture are from various batches.
The variation in prop program synchronicity is due to variations from prop to prop due to the internal RC oscillator boot sequence. The time it takes one prop to latch onto the 5mhz clock versus another prop will vary by a certain minimum (theoretically 0) and a certain maximum.

The fact that we have a maximum time before a prop must boot, means we can program all props to wait for a signal on a pin if present. The first prop checks for the signal, and if its not there, it raises it, then waits max amount of time to let all props boot and see the signal and wait for its fall.
The first prop that booted first, lowers the signal after boot time has passed. At that point all props are, or can be synchronized farther using assembly.

The props will not drift, they are all locked clock wise, because you had separate crystals, drift could occur due to manufacturing and circuit voltage differences. The key to synchronicity is symmetrical design, that entails all aspects from thermal, to circuit, to layout, to EMI.

I have problems modifying my breadboard of my props due to the large amount of noise in the system from so much hardware, jumpers, junctions, etc. The only thing I can do to fix it is take every single prop, and jumper wire and resolder every single pin on each prop and jumper wire. (this is to "tin" the tips of the props, I found that when I do this, large breadboard projects work.

pharseid · 2010-11-12 02:45

I never implemented a parallel quicksort, it was only an idea. Also, I'm not a Propellor programmer, so you'll have to suffer through my ad hoc pseudocode of what I think it should look like. The painter's algorithm or depth-sort simply consists of sorting a polygon array (which has already been transformed, projected and clipped) by some estimate of distance (maximum, minimum or average z-coordinate) and then doing a 2D draw of each polygon starting from the most distant to the least. There are pathological conditions in which this simple version fails (what I call the diving board problem). It can handled by subdividing polygons whose x-y projection overlap, but I would start with a simpler version.

struct sort_item {
polygon
z-estimate
}

/* Polygon should probably be a pointer into an array of polygons, since each item that has to be sorted
will have to be moved numerous times. The z_estimate is the key of the sort and may be an average of
the z-coordinates or the maximum z coordinate. */

sort_item sort_array[num_items]

id_bitmask = 4

/* I'm assuming here that we have an array of procedures that are identical except for an id number, which
is the same as the index of that procedure in the array. For 3 levels of partitions, after we do the
partition we will hand the high side of the subarray to a process determined by OR'ing the processes id
with the id_mask and handing it a new right shifted mask. So procedure 0 launchs procedure 100(binary),
then 010, then 001, then quicksorts the rest of its section itself. */

procedure depth_sorts[8]

depth_sort0(array,left,right,mask)
my_id = 0
i = left
j = right
p = (array

.z_estimate + array

.z_estimate)/2 /* choose some guess at average */
While ( j > i ) {
While (array.z_estimate < p )
i = i + 1
While ( array[j].z_estimate > p )
j = j - 1
If ( NOT (i > j) ) {
swap(array, array[j])
i = i + 1
j = j - 1
}
}
if( mask != 0 ) {
parallel {
if (left < j) depth_sorts[my_id](array,left,j,mask >> 1)
if (right > i) depth_sorts[my_id OR mask](array,i,right,mask >> 1)
} else {
if (left < j) depth_sorts[my_id](array,left,j,mask >> 1)
if (right > i) depth_sorts[my_id](array,i,right,mask >> 1)
}
}

main
depth_sort0(sort_array,0,num_items,id_bitmask)
for ( i = num_items downto 0 )
draw2D(sort_array.polygon)

If the individual processes act like a fork...join type structure, this works okay, but if you have these constantly running processes and they're signalling with flags or something, there are cases in which all processes aren't called, so that has to be dealt with.

By the time I had my little transputer network to work with, my 3D pipeline was using the z-buffer method, so I never implemented this. I don't think the memory-hungry z-buffer is a good fit for the Prop I though. It wasn't long after that that I heard OpenGL was coming to the PC and that was the end of my graphics programming.

amveronis · 2011-05-22 17:35

Hi Prof-Braino

Interesting that you should mention the Transputer. I used to be a strong proponent of this unit and participated as Chairman of a conference on transputers with another Brit prof and friend, Yakup Paker. The transputer, particularly the 9000 appeared very promising, but never left the paper stage and, thanks to the British Government and SGS Thompson, it died a slow death.

amveronis · 2011-05-22 17:38

Prof_braino

Have you heard of the Pyramid architecture that an old colleague, David Schaefer, invented?

Andrew

Leon · 2011-05-22 17:46

The transputer actually went into production, and was quite successful at the time. I used to design and market systems using it.

XMOS devices, designed by David May who designed the transputer, share some features with the transputer, and have been available for a couple of years, XMOS is doing very well.

Perry · 2011-05-22 19:43

prof_braino wrote: »

Is anybody interested in discussing

Communicating Sequential Processes by C.A.R. Hoare

The material is a little beyond me, so any input would be appreciated

Having done the minimum of research an this topic http://en.wikipedia.org/wiki/Communicating_sequential_processes you will see on the fourth line a reference to "limbo" as being influenced by "Communicating Sequential Processes"

This is the exokernal language of "inferno" which is a real world OS with and complete functional environment http://en.wikipedia.org/wiki/List_of_Inferno_applications that many users with unix/linux backgrounds will find comfortable.

Perry

Communicating Sequential Processes by C.A.R. Hoare

Comments