super parallel computer topologies

mctrivia · 2010-01-07 06:14

i have been trying to think of what the best infinitly expandable topology would be that can be cheaply be made on PCB.

Best I can think of so far is a 11 prop mesh topology on a pcb with each of the props also having a full duplex serial link to the board on each of the 6 sides of a cube. this would use all 32 pins. I could make this using qfp parts relatively cheaply.

Pros:
*full mesh on pcb means each prop is 1 hop away on pcb
*external links on all 6 sides means any number of boards can be connected together in 3d

Cons:
*for comunication between x boards longest path is equal to
(cuberoot(x)-1)*3+1

this is not to bad since 1000 boards(11000 props) has a max length of 27 hops

a hypercube of 4012 props has a max length of 12 my topology has a max length of 17

any suggestions for better topologies? I figure i can probably sell the pcb at $20 a peace. would probably cost about $130 to fully populate each board(price is for you to populate if i have to buy parts it gets more expensive.)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Post Edited (mctrivia) : 1/7/2010 11:39:41 PM GMT

Ale · 2010-01-07 08:45

mctrivia:

There are several hw topologies so to say but they are bonded to a certain sw implementation and communications requirements.

example: Hypercubes like the one implemented in the xmp-64 (xmos) is quite nice and uses only 16 chips. Beyond an academic work the question is... which problem are you going to solve ?.

One possibility is to have several units, cells, small robots that work combined in small groups or separated to perform a task. Maybe they can map together an area using IR or ultrasound, that way they can cover much more area.

A small board that can communicate with others (XBee dongles by Parallax?) and have a couple of sensors, motors, servos and so on can may be programmed to do the task I described. There are interesting examples that biology provide and can be mimicked with interesting results. I does not have to be utterly complicated... just enough so sensors and so on can be directly connected. Bare-bones you have already a couple.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

Humanoido · 2010-01-07 08:51

The best naturally expandable topology has connections with one great aspect of consideration - software. I think in the overall hardware is the consideration of drawing these chips together with software. As ALE says, "Beyond an academic work the question is... which problem are you going to solve ?.."

humanoido

Gadgetman · 2010-01-07 09:18

The problem with a HyperCube topology is the sheer amount of connections...

2 CPUs = 1 connect
4 CPUs = 4 connects
8 CPUs = 12 connects
16 CPUs = 32 connects
32 CPUs = 80 connects
64 CPUs = 192 connects
128 CPUs = 448 connects
256 CPUs = 1024 connects
512 CPUs = 2304 connects
1024 CPUs = 5120 connects
2048 CPUs = 11264 connects
4096 CPUs = 24576 connects.
(That's a lot of wiring... )

That's with a classic Hypercube with one CPU in each node, of course.
Also, the number of connections a node must control is the same as max length in the topology. That means a node in a 4096 node network needs to control 12 connections, possibly sending and receiving on several of them at any given time.
(I once wanted to build a Z80-based Hypercube... Never could find a supplier of cheap enough CPUs, RAM and EPROMs, though)

To me, HyperCube designs looks to fit 'medium-sized' tasks where there's no time-critical inter-process communications.
If fast communication is needed, it will require an OS that's capable of allocating close-proximity neighbours to a task, possibly by using some sort of 'requirements script' file.
(Imagine a sort of high-speed sieve program where you need to run 4 concurrent processes, each with their own data streams. The output of these must then be fed to 2 new processes, which again, feeds a single process. For best throughput the first-stage processes must be only one interconnect away from the second stage processes and so on. This is not necessarily guarranteed if the OS allocates CPUs sequentially)

Some of the OS issues also crop up to a larger or smaller extent in other designs, too.
(Is a new task to be started on a board which already have a task running, if there are enough free CPUs, or should it be started on a separate board, to allow for easier communications if new sub-processes must be spawned?)

High-speed data streams.
Hypercubes or even matrices in general doesn't lend themselves to processing high-speed data streams unless the CPUs at the 'edge' is used to process the incoming streams, then feed the resultant streams 'inwards'.
This again means that the 'outer' CPUs may need to be reserved somehow for tasks that need that kind of speed.
(Assuming that the data streams aren't connected to the 'core' CPUs in the middle of a matrix, of course. Which must also be considered if using a matrix, as it gives the shortest average connection length to reach any CPU in the matrix)

The alternative is to manually decide on which CPUs a given program should run.

Maybe a design would be to have several 'high-speed busses' passing all the boards, but which must be allocated to a process on an 'as needed' basis?
Then two physically separate boards can 'hook onto' a bus and communicate directly, without the need to go through all the boards in-between.
(If so, this also adds complexity to the OS, unless everything is run manually, as it needs to decide if a process can be started or not if a bus is not available, and the process is flagged as needing one. Should processes lock them from the moment it's started, or only when it needs it? Race and lockout situations aren't fun... This goes for all 'special' resources that are exclusive access.)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

heater · 2010-01-07 09:37

As has been discussed many times here the prospect of building a "super computer" with Propeller chips is not really reasonable. The best one might achieve is a big, slow, power hungry and expensive network of Props that in no way competes in that arena. The performance is not there and the communications bottlenecks are insurmountable.

However there are other more realistic reasons for wanting a network of independent Props, or other MCU's for that matter.

So I would like to propose one answer to the question "which problem are you going to solve?".

I would propose tackling the problem of fault tolerance. That is the creation of a network of Props that can together continue to correctly perform some task in the face of failure of one of the Props (hardware or software), or failure of one of the communication links.

Traditionally this has been tackled with 3 communicating nodes participating in some kind of voting scheme such that if one node fails there is a majority vote in favour of the correct result and the system as a whole continues to function correctly.

In the early 80's it was shown that for a fault tolerant network to tolerate one failure in a node or link then having 3 nodes is not sufficient. 4 nodes are required. It was shown that to tolerate n faults in the system then 3n + 1 nodes are required.

You can read this rather unintuitive conclusion in the classic paper on the "Byzantine Generals Problem" by Leslie Lampart and others which I attach here.

N.B. If you fly the Boeing 777 you might like to consider that it does not satisfy the Byzantine Generals criteria in that it only has 3 Primary Flight Computers in use at once.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Gadgetman · 2010-01-07 09:57

We wouldn't expect a Propeller-based 'super-computer' to compete with number-crunchers designed for meteorological or other heavy simulations.

What such a computer could be used for is as a learning tool.
The biggest problem in these kind of systems is writing the SW.

Do you know of any schools that teach massively-parallell programming algorithms?
And have a suitable computer to 'experiment' on?

Many universities and research institutes have super computers, but they don't like to let students 'mess about' on them. Crashes - race and lockout situations - is 'not encouraged'.
Tinkering with the OS is strongly discouraged...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

Ale · 2010-01-07 10:01

I think that at Bristol Uni there are some research going on on parallel computing...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

VIRAND · 2010-01-07 10:43

I have been thinking about trying to program an artificial neural network in a Propeller,
but I have not even begun to think about how to add more brains to it.

Two apps:
1.Voice recognition.
2.How much can it learn, and what will it invent if it is overtaught?
(I've seen an ANN start generating new information as soon as it started forgetting what it learned.)

he1957 · 2010-01-07 11:16

A parallel bus structure could be another approach. Ignoring specifics of signal propogation &c. and the starting of each chip with its initial "monitor code"; once each chip is initialised, the initially synchronised monitor code could allow each chip/COG to perform "some function".

In this design, each chip is interconnected via its 32 pins, in parallel giving 1 hop between each chip. A purpose built multi-cycle protocol could be designed to communicate between chips using Address/Control/Data cycles. A single, distributed clock source could be employed to assure synchronisation between chips/COGS thus simplifying chip/COG behaviour and access to common resources. Because the Propeller is cycle predictable it would be feasible to include some form of fault detection in the adopted bus protocol by using a status/progress mechanism and mailbox in its design.

The bus protocol could address chips, COGS, specify a read and/or write operation or some function to perform for same or perform a status request or some other control function or data operation. Some form of common chip memory and I/O device access would be required - some of the I/O could be performed by an arbitrary chip controlled by the protocol. Available chip/COG Memory/code limits may prevent more sophisticated operations however.

An interesting academic project because of the chips design but for raw data crunching there may be better designs.

As mentioned, it really depends on what goals the intended design targets. ie: What would it be used for?

OK, Beat me up Scotty

Gadgetman · 2010-01-07 11:22

Voice recognicion requires high bandwidth and lots of memory.

Neural networks, though...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

TonyWaite · 2010-01-07 12:36

Where is Leon when you need him?

XC or CUDA are available now, with hardware from XMOS/NVIDIA, for all you supercomputing
freethinkers.

And this time there's little place for a Prop!

T o n y

Ale · 2010-01-07 12:52

TonyWaite said...
Where is Leon when you need him?

He got scared before posting about XMOS (again) ?

I mentioned xmos in another thread just as an example... it was in this very thread !!!

I think the swarm idea represents a low bandwidth independent enough approach. While neural networks sound great I haven't got the slightest idea what/how/why are they.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

Leon · 2010-01-07 14:15

I'm around, I got up very late this morning!

Plessey Roke Manor bought one of my 16-module transputer systems 24 years ago (it cost them about £13,000!), and designed a fault-tolerant system for it. They'd get a customer to pull out two or three modules at random and it would carry on working, albeit with a reduction in performance. My motherboard had two Inmos C004 transputer link chips on it which allowed the topology to be changed in software, it could even be done on the fly with some clever programming. Systolic arrays were popular with transputers for some applications - data just flows across the network from one side to the other. I actually designed a hypercube system using them, with transputers on little vertical towers of tiny PCBs using ChipRack and chip-on-board technology, but never had the funds to build it.

I once did some work with neural nets, they are very interesting. The Propeller won't be much good at them, it just hasn't got the memory or performance. When I worked at BAe MAD, Brough, I supervised a student project using a Dataglove with a neural net for sign language recognition. The NN was coded in C on a PC.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Humanoido · 2010-01-07 14:40

Ok, that's your answer! Leon can write the software!

mctrivia · 2010-01-07 16:08

I do not have a planned project and do not think it has any practice use until prop2. Was just trying to think what could be done inexpensively.

Hyper cube is no good as it requires doubling each time you expand. I want to be able to add $200 at a time.

I think adding 3 access buses would be good. Allows 5 hops to be max no matter the size of the cube assuming the. Buses have dedicated router prop on each board.

I would use an fpga as a transputer to allow the topology to be changed on the fly and to make easy routing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Leon · 2010-01-07 16:19

The Inmos transputer is/was a floating-point MPU. Because of the high-speed on-chip link engines, Meiko actually used them as comms elements in their supercomputer, for connecting very fast floating-point devices together. Each FPU was connected to a transputer as a co-processor.

An FPGA could be used for signal routing, but special chips are made for that purpose, like the old Inmos C004.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Gadgetman · 2010-01-07 18:23

mctrivia said...

Hyper cube is no good as it requires doubling each time you expand. I want to be able to add $200 at a time.

Actually, that's not completely true...

Sure, it's nice to have all the nodes there, for easier routing, but as long as the OS is capable of keeping track of which nodes/CPUs are online, there's no reason it can't be built bit by bit.

One feature of a successful HyperCube design is flexible routing, which can work around gaps in the pattern. After all, it wouldn't be good if a 1024 CPU design stopped working altogether just because one CPU is on the blink...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

mctrivia · 2010-01-07 18:29

Good point. However it does not easily fit into the large array of identical boards I am proposing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Gadgetman · 2010-01-07 18:58

Why not?

Most HyperCube designs(the ones I've seen at any rate) is a ring-shaped design of idential cards.
(Lots of vertical cards around a central core, and the interconnects on the outer rim. )
And that can easily be opened up to a straight stack.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

mctrivia · 2010-01-07 21:04

Physically cards would be identical. But the soft router at the core of each card would be different in a hyper cube. At least in the stacking method I am thinking of.

If I drop down to 8 props per board hyper cubes with any power of 2 cards should be doable.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

mctrivia · 2010-01-07 23:39

I changed the title because this is not going to be a super computer like sony's 1PFlop computer but a massive expandable network of props.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

pharseid · 2010-01-08 01:04

I would think using an FPGA in conjunction with the Prop would make this more practical. How complex do you envision the FPGA design? The FPGA could buffer data packets and could read the message destination to route the message without requiring service from the Prop. So in a big system, messages which had to be routed through intermediate nodes would not drain MIPS from the associated Props.

6 com channels would probably satisfy 99+% of concievable users. How many people are going to build a system with more than 64 Props?

mctrivia · 2010-01-08 01:37

FPGA design would be extremely simple for a topology engine would be extremely simple. All it would have to do is deal with loading the code to the props through p30,31 at startup, provide clock signals to the props, and define what pins on each prop is connected to what.

To have it act as router as you mention would not be to hard either though there is limited buffering one could do.

The problem is BGA would make boards to expensive and i want people to be able to hand solder so EP2C5Q208C8N is the best affordable FPGA that could be used. that gives only 142 pins so hardwiring as many pins as possible is needed.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

mctrivia · 2010-01-08 02:52

i don't think an fpga would be a good way to go. not enough pins in afordable packages. air wires could be used to make the connections but would that be any good? and would anyone really want to connect multipe $100 modules together in 3D?

a 4D hyper cube can be routed on a single pcb easily.
attachment.php?attachmentid=66538

This would give 128 cogs for under $200.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Post Edited (mctrivia) : 1/8/2010 3:05:54 AM GMT

Forest Godfrey · 2010-01-08 08:00

One thing to consider when picking a topology is routing, and not the circuit board kind, the software kind. You need a communication protocol and most of them are going to be store-and forward. Unless you want to add a retransmission protocol, you can't drop packets. Each node has a limited amount of resources to store data. Let's take a 2D mesh with 4 nodes (ie, a square) which is trivial to route on a circuit board:

     A <-> B
     ^       ^
      |        |
      V       V
      C <-> D

If A is talking to D, B to C, and C to B, and everyone goes around clockwise, you deadlock. This can be solved in a variety of ways, all of them at least a little complex to understand. The mesh topology can be easily expanded into a torus, but that won't make the deadlock issue go away.

I'd recommend a hypercube. With the same number of links out of each node as the torus or mesh (6), you get a 64 propeller hypercube. The routing is nice and simple: Order your dimensions, route in each dimension in order. There is at most 1 hop per dimension.

While it's nice to design something that is theoretically infinitely scalable, nobody is actually going to build one...

However, if you're really up for a challenge, of the common supercomputer topologies, the one with the best bisection bandwidth is a folded clos sometimes called a fat tree. Unfortunately, for that one, you need a high-radix router. Fortunately, you could probably design your boards so that a Prop could be either an end node with I/O and, say, 4 links upward into the tree, or a router, with, say, 16 links (8 down, 8 up).

mctrivia · 2010-01-08 18:07

What you drew is a ring topology. For mesh you need an x in the middle. Mesh is great because every device is directly connected to every other but it gets to be to many links quick

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Forest Godfrey · 2010-01-08 21:08

Heh, whoops, we apparently abuse the term mesh here (we use it to refer to what would more properly be called a 3D grid. A standard Cray is a torus and we sometimes disconnect the loops in X, Y, or Z to make that dimmension a grid, but for some reason, we usually say mesh, not grid)... What I was drawing was a grid topology, though for 4 nodes, there's no difference between a 2D grid and a ring.

In any case, you're right, a fully connected graph gets ugly fast in terms of numbers of links. It's great for bisection bandwidth, though [noparse]:)[/noparse]

fullspeceng · 2010-01-08 22:23

FPGAs have built in dual port ram. That makes life a lot easier for sharing between two chips.

How does an FPGA not have enough pins? A spartan 3e pq208 has 208 i/o pins. If that isn't enough go with a Virtex 5 with 1200 i/o pins. The top end virtex 5 has around 30,000 logic blocks. A microblaze processor takes 500-3000 depending on options. Say 1000 average. So you could have a 30,000 cpu in a chip pretty easily. Probably take you a month of computing time to synthesize [noparse];)[/noparse]

An off the shelf NVIDIA GTX280 has 240 cores for < $300.

Propeller performance is pretty bad if you want to do floating point.

However it's just easier to have a faster chip. A 400mips chip like the XMOS can allocate 50mips per virtual core x 8 and get the same thing done.

I think if you need more than 1 propeller for a big application you should probably look for another chip.

mctrivia · 2010-01-08 22:34

fullspeceng you are right that there is no practical use for this. i specifically said it has no use until the prop 2 and even then there is likely better ways.

as for how can an fpga not have enough. if you want to stick with non bga chips under $20 pin count is limited. if you want to be able to connect every prop pin together then lots are needed.

I have decided to play around with a simple 4d hyper cube topology. not because i need lots of power. just because i want to play with programming such a system. I will make boards available when i finish.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Leon · 2010-01-08 22:41

The software is the biggest problem. You could always use XMOS's XC compiler which was designed for systems like that, it's in the public domain. It's a complete waste of time, of course, given that a single XMOS chip costing a few $ will outperform any number of Propeller chips.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Post Edited (Leon) : 1/8/2010 11:44:50 PM GMT

fullspeceng · 2010-01-08 22:41

The Spartan 3E PQ208 250k or 500k have 208 i/o pins non bga around $20.

Programming CUDA is a lot funner than programming a ton of props and you get the same experience trying to manage a thousand threads and forgo the solder of chips on 8+ layer boards.

super parallel computer topologies

Comments