super parallel computer topologies
i have been trying to think of what the best infinitly expandable topology would be that can be cheaply be made on PCB.
Best I can think of so far is a 11 prop mesh topology on a pcb with each of the props also having a full duplex serial link to the board on each of the 6 sides of a cube. this would use all 32 pins. I could make this using qfp parts relatively cheaply.
Pros:
*full mesh on pcb means each prop is 1 hop away on pcb
*external links on all 6 sides means any number of boards can be connected together in 3d
Cons:
*for comunication between x boards longest path is equal to
(cuberoot(x)-1)*3+1
this is not to bad since 1000 boards(11000 props) has a max length of 27 hops
a hypercube of 4012 props has a max length of 12 my topology has a max length of 17
any suggestions for better topologies? I figure i can probably sell the pcb at $20 a peace. would probably cost about $130 to fully populate each board(price is for you to populate if i have to buy parts it gets more expensive.)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
Post Edited (mctrivia) : 1/7/2010 11:39:41 PM GMT
Best I can think of so far is a 11 prop mesh topology on a pcb with each of the props also having a full duplex serial link to the board on each of the 6 sides of a cube. this would use all 32 pins. I could make this using qfp parts relatively cheaply.
Pros:
*full mesh on pcb means each prop is 1 hop away on pcb
*external links on all 6 sides means any number of boards can be connected together in 3d
Cons:
*for comunication between x boards longest path is equal to
(cuberoot(x)-1)*3+1
this is not to bad since 1000 boards(11000 props) has a max length of 27 hops
a hypercube of 4012 props has a max length of 12 my topology has a max length of 17
any suggestions for better topologies? I figure i can probably sell the pcb at $20 a peace. would probably cost about $130 to fully populate each board(price is for you to populate if i have to buy parts it gets more expensive.)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
Post Edited (mctrivia) : 1/7/2010 11:39:41 PM GMT
Comments
There are several hw topologies so to say but they are bonded to a certain sw implementation and communications requirements.
example: Hypercubes like the one implemented in the xmp-64 (xmos) is quite nice and uses only 16 chips. Beyond an academic work the question is... which problem are you going to solve ?.
One possibility is to have several units, cells, small robots that work combined in small groups or separated to perform a task. Maybe they can map together an area using IR or ultrasound, that way they can cover much more area.
A small board that can communicate with others (XBee dongles by Parallax?) and have a couple of sensors, motors, servos and so on can may be programmed to do the task I described. There are interesting examples that biology provide and can be mimicked with interesting results. I does not have to be utterly complicated... just enough so sensors and so on can be directly connected. Bare-bones you have already a couple.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU
humanoido
2 CPUs = 1 connect
4 CPUs = 4 connects
8 CPUs = 12 connects
16 CPUs = 32 connects
32 CPUs = 80 connects
64 CPUs = 192 connects
128 CPUs = 448 connects
256 CPUs = 1024 connects
512 CPUs = 2304 connects
1024 CPUs = 5120 connects
2048 CPUs = 11264 connects
4096 CPUs = 24576 connects.
(That's a lot of wiring... )
That's with a classic Hypercube with one CPU in each node, of course.
Also, the number of connections a node must control is the same as max length in the topology. That means a node in a 4096 node network needs to control 12 connections, possibly sending and receiving on several of them at any given time.
(I once wanted to build a Z80-based Hypercube... Never could find a supplier of cheap enough CPUs, RAM and EPROMs, though)
To me, HyperCube designs looks to fit 'medium-sized' tasks where there's no time-critical inter-process communications.
If fast communication is needed, it will require an OS that's capable of allocating close-proximity neighbours to a task, possibly by using some sort of 'requirements script' file.
(Imagine a sort of high-speed sieve program where you need to run 4 concurrent processes, each with their own data streams. The output of these must then be fed to 2 new processes, which again, feeds a single process. For best throughput the first-stage processes must be only one interconnect away from the second stage processes and so on. This is not necessarily guarranteed if the OS allocates CPUs sequentially)
Some of the OS issues also crop up to a larger or smaller extent in other designs, too.
(Is a new task to be started on a board which already have a task running, if there are enough free CPUs, or should it be started on a separate board, to allow for easier communications if new sub-processes must be spawned?)
High-speed data streams.
Hypercubes or even matrices in general doesn't lend themselves to processing high-speed data streams unless the CPUs at the 'edge' is used to process the incoming streams, then feed the resultant streams 'inwards'.
This again means that the 'outer' CPUs may need to be reserved somehow for tasks that need that kind of speed.
(Assuming that the data streams aren't connected to the 'core' CPUs in the middle of a matrix, of course. Which must also be considered if using a matrix, as it gives the shortest average connection length to reach any CPU in the matrix)
The alternative is to manually decide on which CPUs a given program should run.
Maybe a design would be to have several 'high-speed busses' passing all the boards, but which must be allocated to a process on an 'as needed' basis?
Then two physically separate boards can 'hook onto' a bus and communicate directly, without the need to go through all the boards in-between.
(If so, this also adds complexity to the OS, unless everything is run manually, as it needs to decide if a process can be started or not if a bus is not available, and the process is flagged as needing one. Should processes lock them from the moment it's started, or only when it needs it? Race and lockout situations aren't fun... This goes for all 'special' resources that are exclusive access.)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
However there are other more realistic reasons for wanting a network of independent Props, or other MCU's for that matter.
So I would like to propose one answer to the question "which problem are you going to solve?".
I would propose tackling the problem of fault tolerance. That is the creation of a network of Props that can together continue to correctly perform some task in the face of failure of one of the Props (hardware or software), or failure of one of the communication links.
Traditionally this has been tackled with 3 communicating nodes participating in some kind of voting scheme such that if one node fails there is a majority vote in favour of the correct result and the system as a whole continues to function correctly.
In the early 80's it was shown that for a fault tolerant network to tolerate one failure in a node or link then having 3 nodes is not sufficient. 4 nodes are required. It was shown that to tolerate n faults in the system then 3n + 1 nodes are required.
You can read this rather unintuitive conclusion in the classic paper on the "Byzantine Generals Problem" by Leslie Lampart and others which I attach here.
N.B. If you fly the Boeing 777 you might like to consider that it does not satisfy the Byzantine Generals criteria in that it only has 3 Primary Flight Computers in use at once.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
What such a computer could be used for is as a learning tool.
The biggest problem in these kind of systems is writing the SW.
Do you know of any schools that teach massively-parallell programming algorithms?
And have a suitable computer to 'experiment' on?
Many universities and research institutes have super computers, but they don't like to let students 'mess about' on them. Crashes - race and lockout situations - is 'not encouraged'.
Tinkering with the OS is strongly discouraged...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU
but I have not even begun to think about how to add more brains to it.
Two apps:
1.Voice recognition.
2.How much can it learn, and what will it invent if it is overtaught?
(I've seen an ANN start generating new information as soon as it started forgetting what it learned.)
In this design, each chip is interconnected via its 32 pins, in parallel giving 1 hop between each chip. A purpose built multi-cycle protocol could be designed to communicate between chips using Address/Control/Data cycles. A single, distributed clock source could be employed to assure synchronisation between chips/COGS thus simplifying chip/COG behaviour and access to common resources. Because the Propeller is cycle predictable it would be feasible to include some form of fault detection in the adopted bus protocol by using a status/progress mechanism and mailbox in its design.
The bus protocol could address chips, COGS, specify a read and/or write operation or some function to perform for same or perform a status request or some other control function or data operation. Some form of common chip memory and I/O device access would be required - some of the I/O could be performed by an arbitrary chip controlled by the protocol. Available chip/COG Memory/code limits may prevent more sophisticated operations however.
An interesting academic project because of the chips design but for raw data crunching there may be better designs.
As mentioned, it really depends on what goals the intended design targets. ie: What would it be used for?
OK, Beat me up Scotty
Neural networks, though...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
XC or CUDA are available now, with hardware from XMOS/NVIDIA, for all you supercomputing
freethinkers.
And this time there's little place for a Prop!
T o n y
He got scared before posting about XMOS (again) ?
I think the swarm idea represents a low bandwidth independent enough approach. While neural networks sound great I haven't got the slightest idea what/how/why are they.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU
Plessey Roke Manor bought one of my 16-module transputer systems 24 years ago (it cost them about £13,000!), and designed a fault-tolerant system for it. They'd get a customer to pull out two or three modules at random and it would carry on working, albeit with a reduction in performance. My motherboard had two Inmos C004 transputer link chips on it which allowed the topology to be changed in software, it could even be done on the fly with some clever programming. Systolic arrays were popular with transputers for some applications - data just flows across the network from one side to the other. I actually designed a hypercube system using them, with transputers on little vertical towers of tiny PCBs using ChipRack and chip-on-board technology, but never had the funds to build it.
I once did some work with neural nets, they are very interesting. The Propeller won't be much good at them, it just hasn't got the memory or performance. When I worked at BAe MAD, Brough, I supervised a student project using a Dataglove with a neural net for sign language recognition. The NN was coded in C on a PC.
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Hyper cube is no good as it requires doubling each time you expand. I want to be able to add $200 at a time.
I think adding 3 access buses would be good. Allows 5 hops to be max no matter the size of the cube assuming the. Buses have dedicated router prop on each board.
I would use an fpga as a transputer to allow the topology to be changed on the fly and to make easy routing.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
An FPGA could be used for signal routing, but special chips are made for that purpose, like the old Inmos C004.
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Sure, it's nice to have all the nodes there, for easier routing, but as long as the OS is capable of keeping track of which nodes/CPUs are online, there's no reason it can't be built bit by bit.
One feature of a successful HyperCube design is flexible routing, which can work around gaps in the pattern. After all, it wouldn't be good if a 1024 CPU design stopped working altogether just because one CPU is on the blink...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
Most HyperCube designs(the ones I've seen at any rate) is a ring-shaped design of idential cards.
(Lots of vertical cards around a central core, and the interconnects on the outer rim. )
And that can easily be opened up to a straight stack.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
If I drop down to 8 props per board hyper cubes with any power of 2 cards should be doable.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
6 com channels would probably satisfy 99+% of concievable users. How many people are going to build a system with more than 64 Props?
To have it act as router as you mention would not be to hard either though there is limited buffering one could do.
The problem is BGA would make boards to expensive and i want people to be able to hand solder so EP2C5Q208C8N is the best affordable FPGA that could be used. that gives only 142 pins so hardwiring as many pins as possible is needed.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
a 4D hyper cube can be routed on a single pcb easily.
This would give 128 cogs for under $200.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
Post Edited (mctrivia) : 1/8/2010 3:05:54 AM GMT
If A is talking to D, B to C, and C to B, and everyone goes around clockwise, you deadlock. This can be solved in a variety of ways, all of them at least a little complex to understand. The mesh topology can be easily expanded into a torus, but that won't make the deadlock issue go away.
I'd recommend a hypercube. With the same number of links out of each node as the torus or mesh (6), you get a 64 propeller hypercube. The routing is nice and simple: Order your dimensions, route in each dimension in order. There is at most 1 hop per dimension.
While it's nice to design something that is theoretically infinitely scalable, nobody is actually going to build one...
However, if you're really up for a challenge, of the common supercomputer topologies, the one with the best bisection bandwidth is a folded clos sometimes called a fat tree. Unfortunately, for that one, you need a high-radix router. Fortunately, you could probably design your boards so that a Prop could be either an end node with I/O and, say, 4 links upward into the tree, or a router, with, say, 16 links (8 down, 8 up).
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
In any case, you're right, a fully connected graph gets ugly fast in terms of numbers of links. It's great for bisection bandwidth, though [noparse]:)[/noparse]
How does an FPGA not have enough pins? A spartan 3e pq208 has 208 i/o pins. If that isn't enough go with a Virtex 5 with 1200 i/o pins. The top end virtex 5 has around 30,000 logic blocks. A microblaze processor takes 500-3000 depending on options. Say 1000 average. So you could have a 30,000 cpu in a chip pretty easily. Probably take you a month of computing time to synthesize [noparse];)[/noparse]
An off the shelf NVIDIA GTX280 has 240 cores for < $300.
Propeller performance is pretty bad if you want to do floating point.
However it's just easier to have a faster chip. A 400mips chip like the XMOS can allocate 50mips per virtual core x 8 and get the same thing done.
I think if you need more than 1 propeller for a big application you should probably look for another chip.
as for how can an fpga not have enough. if you want to stick with non bga chips under $20 pin count is limited. if you want to be able to connect every prop pin together then lots are needed.
I have decided to play around with a simple 4d hyper cube topology. not because i need lots of power. just because i want to play with programming such a system. I will make boards available when i finish.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Post Edited (Leon) : 1/8/2010 11:44:50 PM GMT
Programming CUDA is a lot funner than programming a ton of props and you get the same experience trying to manage a thousand threads and forgo the solder of chips on 8+ layer boards.