super parallel computer topologies

fullspeceng · 2010-01-08 22:47

I've been playing with the XMOS devices for the last month or so. Not bad and easy to do stuff like switch a few pins but no one would run XC to do anything useful due to no floating point.

If you move beyond hobby/low end commercial then FPGA is actually much easier to use instead of digging through tons of semi functioning blocks. I bought a Nanoboard 3000 from Altium and it's pretty cool. You just drag a core onto the sheet, add the MMU if you need it or drop an SPI bus down, etc. Wishbone interconnects join all the IP cores. So then all peripherals appear as a memory address you can just read/write to.

CUDA isn't bad for limited expansion but MPI (message passing interface) is standard for all multiple cpu processing in the scientific world.

I wrote part of my thesis doing MPI on cluster computers for Quantum Simulations (google VASP).

Leon · 2010-01-08 22:51

Floating-point is available with the XMOS C compiler - C and XC can be used together on the same application. Most XMOS (and Propeller) applications don't need it, of course.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Gadgetman · 2010-01-08 22:57

The main reason to build a super parallell computer with Propellers is because it's possible...
(But difficult... We all like a challenge, right? )

I mean, anyone can get hold of a 'number cruncher' just by stacking wads of cash on a counter.
What the Propeller allows is a 'relatively cheap' and clean design for a parallell computer.

No use for one?
The same can be said about my collection of Zelda games. They have no practical use, except to waste time.

What about the Hydra?
It can't compete with any Nintendo after the SNES, any of the PlayStations, or even the Xbox, but no one doubts that it has a purpose...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

stevenmess2004 · 2010-01-08 23:09

This is probably horribly inefficient but why not just use the current concept of the propeller? Make each prop chip a node, use an 8 bit bus connecting each chip, another prop to act as the controller and hang a large sram of the controller propeller. Each node prop gets enough access time write or read some stuff to the sram and the sram is basically an oversized hub. Wouldn't be fast but it should be relatively easy to do. Depending on what you were doing it may not even be that slow.

fullspeceng · 2010-01-09 02:38

"Super" computer means different things to different people.

It's definitely not cheaper.

How many propeller chips would you need to call it a "super" computer?· 64 x 8 cores = 512 cores?· Then you still only have 32k x 64 = 2 megs of ram and can't do floating point.· $8x64=$512 @ 20 MIPs/core*8 cores/chip*64 chips~ 1·billion ips for the chips alone.· How much is your PCB going to cost you to design/make?· An intel·quad core 2 duo claims is around 3 billion mips and costs $100 for chip alone ($500 for computer).· A $300 Sony PS3 claims 6 gigaflops: http://en.wikipedia.org/wiki/PlayStation_2

Computation really should be left to other devices.· If you need a fast bus, PCI Express v3 can transfer 16GB/sec (compare to a 400mbit=50MByte/sec XMOS bus or 20MHz * 8 bits·= 20MB/sec for an·a bit bus on the Prop·).· Even PCI is 33mhz * 32 bits.

My $100 ipod·can do a lot more like play HD video.

A propeller has 32 i/o.· A $10-15 Spartan 3E PQ208 has 208 i/o and can be programmed with a few Microblaze processors.

I have both the propeller or xmos devices and will use them the college robotics class I'm teaching starting next week.· They're great for rapid prototype embedded applications but a supercomputer it is not.

But they're is no denying a propeller is fun [noparse]:)[/noparse]

fullspeceng · 2010-01-09 02:56

fullspeceng said...
"Super" computer means different things to different people.

It's definitely not cheaper.

How many propeller chips would you need to call it a "super" computer? 64 x 8 cores = 512 cores? Then you still only have 32k x 64 = 2 megs of ram and can't do floating point. $8x64=$512 @ 20 MIPs/core*8 cores/chip*64 chips~ 1 billion ips for the chips alone. How much is your PCB going to cost you to design/make? An intel quad core 2 duo claims is around 3 billion ips and costs $100 for chip alone ($500 for computer). A $300 Sony PS3 claims 6 gigaflops: http://en.wikipedia.org/wiki/PlayStation_2

Computation really should be left to other devices. If you need a fast bus, PCI Express v3 can transfer 16GB/sec (compare to a 400mbit=50MByte/sec XMOS bus or 20MHz * 8 bits = 20MB/sec for an 8 bit bus on the Prop ). Even PCI is 33mhz * 32 bits.

My $100 ipod can do a lot more like play HD video.

A propeller has 32 i/o. A $10-15 Spartan 3E PQ208 has 208 i/o and can be programmed with a few Microblaze processors.

I have both the propeller or xmos devices and will use them the college robotics class I'm teaching starting next week. They're great for rapid prototype embedded applications but a supercomputer it is not.

But they're is no denying a propeller is fun [noparse]:)[/noparse]

mctrivia · 2010-01-09 02:58

Yes this is why I changed the title. super parallel computer. It is not about speed but a means to learning.

As for board cost. the attached 2"x6" board would cost me $12 each in quantities of 12

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

fullspeceng · 2010-01-09 03:11

Looks great. Don't let me discourage you from learning. Do you have a use for it yet figured out?

If all you need is a whole bunch of pins, maybe communication to the FPGA via a propeller is a better way:
http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=122-1483-ND

I'd like to see someone make an easy to use Wishbone bus and propeller code to control more pins on an FPGA or an FPGA coprocessor to the Prop.

Phil Pilgrim (PhiPi) · 2010-01-09 03:35

Leon said...
It's a complete waste of time, of course, given that a single XMOS chip costing a few $ will outperform any number of Propeller chips.

Martha, give the Victrola another kick, will ya. The needle's stuck again!

-Phil

fullspeceng · 2010-01-09 04:00

I have to agree with Leon that XMOS is really nice.

My main attraction is that it has a simulator and a free C like programming language.

The market for Parallax and XMOS is very similar but competition is always good.

Hopefully Parallax can hire someone and make comparable quality of tools or market share will start to drift. I think all the tools are open source and XMOS just took things like GCC and open source IDEs.

I can understand them sticking with Spin but No simulator = no fun.

Gear definitely isn't the same thing.

heater · 2010-01-09 06:28

Prefix "Super-", Etymology: Latin, over, above, in addition.

Yep, this is super-computer all right.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Leon · 2010-01-09 11:18

Phil Pilgrim (PhiPi) said...

Leon said...
It's a complete waste of time, of course, given that a single XMOS chip costing a few $ will outperform any number of Propeller chips.

Martha, give the Victrola another kick, will ya. The needle's stuck again!

-Phil

Propellers are ideal for many applications, but cobbling lots of them together won't result in a high-performance system, or one that is easy to use. Something scalable is needed, with decent tools that support parallel systems.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Post Edited (Leon) : 1/9/2010 3:33:48 PM GMT

pharseid · 2010-01-09 17:46

All I know about FPGA's comes from a casual reading of the data sheets, but 142 pins of an FPGA seems like a lot to me. Because you're always sending packets, I would think you could multiplex handshaking signals with data lines. So 8 data bits and a strobe. It appears that the high-speed I/O modes use differential drivers, so 6 pins in each direction (2 data and 4 handshaking) for a total of 12 for each FPGA to FPGA serial link. Or lower speed and 4 data lines each way. Every input feeds into a RAM block configured as a FIFO and every output is derived from a multiplexor which selects one of the FIFO's. So it would seem possible to fit 8 Prop ports and 5 links, enough to support an 8-D hypercube.

-phar

Leon · 2010-01-09 17:56

An FPGA with 144 pins is relatively tiny, most users want a lot more than that. A lot of them will be ground and supply pins, of course. The smallest Xilinx Spartan-3 XC3S50 device in a VQ100 package only has 63 I/Os.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Post Edited (Leon) : 1/9/2010 6:06:16 PM GMT

pharseid · 2010-01-09 18:32

Well, yes, the chip Mctrivia mentioned was a 208 pin package, that should have been 142 I/O's.

mctrivia · 2010-01-09 18:50

yes an 8d hyper cube could be done with the fpga i mentioned but the infinitly expandable architecture i originally proposed could not.

For cost reasons playing with 3 and 4d hyper cubes makes more sense.

The pcb i have partially layed out will handle 1 to 5D hyper cubes(5D requires 2 PCB)

i think for playing this is the best architecture to play with because the architecture matches binary making each nodes identification number describe where it is located in the cube. should make transport layer easier.

If I use a M25P80-VMW6TG for the code storage and a AT24C512BN-SH25-T for the boot loader the entire board would be relatively cheap to populate.

Cost Estimates*:
1D - $46
2D - $62
3D - $94
4D - $158
5D - $316

*Prices based on you buying PCB off me and sourcing parts yourself. Cost to me is likely higher because I will not likely sell an entire panel of boards.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

hinv · 2010-01-10 05:43

Hi Forest,

I was thinking doing just that, but expanding beyond that into what I called a propgrid which would leave 8 i/o pins on every side node and 16IO on each corner. I hadn't really thought of the deadlock problem, however. Can you explain how this happens?

Thanks,
Doug

Leon · 2010-01-10 06:32

Deadlock is when you have two processes, each waiting for the other to finish, so neither of them terminates. It's a common problem in parallel processing.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

hinv · 2010-01-10 15:23

I know what a deadlock is, but why it would be caused by the square topology described by Forest is the question.

Gadgetman · 2010-01-10 18:06

It doesn't.

Deadlocks can exist on any multi-process systems.

Even on oldfashioned 8bit computers.
All it takes is to have 'exclusive use' resources and 'not perfectly written' programs.

Typically, Process A reserves input 'a' then starts processing.
Process B which also needs 'a', reserves output 'b', then tries to reserve 'a', too, but fails as it is taken.
B goes into a wait mode, waiting for A to finish with 'a', but without releasing 'b'...
A finishes with 'a', but doesn't release it, then tries to reserve 'b' which is busy.

The only way to be REALLY certain that this never happens is to disallow exclusive locking of resources.
Unfortunately, in some cses that can't be done(files, serial ports among them).

Another way to help avoid the problem is to make programs reserve ALL required resources at the beginning, but that can be 'rather' resource hungry, as programmers tends to 'err on the side of caution' and reserves more than they really need.

That leaves a OS subprocess to trawl the Process and resource tables and look for deadlocks, which can also be a resource hog. (Particularly on multi-tasking systems with a low number of CPUs)
And how that subprocess is supposed to decide which process to kill to resolve deadlocks is another chapter entirely...
(There's also the question of 'cleanup' if a process is terminated. )

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

heater · 2010-01-10 18:35

Ah, the "Deadly Embrace".

Say we both need a fork and a knife before we can eat. There is only on fork and one knife.
I pick up the fork.
You pick up the knife.
I now want to pick up the knife but cannot, you have it.
You want to pick up the fork but cannot, I have it.
We both go hungry.

I'm not sure that making the rule that all processes must reserve what they want at start up helps, as in that example above. In that case it would help if either:
1) We both claim resources in the same order, say knife first.
2) We both release all resources when we get "stuck" and start again. Although it's not clear we don't repeat the same problem sequence again depending how the timing works out.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Gadgetman · 2010-01-10 21:26

The 'release resources if not all are available' is a good idea, but it only works in a perfect world.
- There's always one idiot who doesn't write a badly behaved program.
- Sometimes, releasing a resource before the job is complete may leave it in an 'undesirable state'

To reduce the danger we usually try to build around the problem:
- Adding queueing drivers; The print spooler in Windows is one example as it can queue jobs from several programs at the same time that are alldestined to the same printer.
- Finegrained locks; The Registry on a Windows PC is one example. Programs often have a need to update small portions without blocking access for programs needing to read or update other parts of it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

Forest Godfrey · 2010-01-10 22:44

First, you're going to need at least two virtual channels (one for requests and one for responses) or you have a request to response dependance which will cause deadlock with only two nodes.

Second, the kinds of deadlocks Leon and Gadgetman are describing are due to bugs. This deadlock is due to packet routing and will occur even with otherwise perfectly written programs. It most closely resembles a lock ordering deadlock.

The problem is one of cyclic dependancy. Consider the square with everyone talking to everyone else.

     A <-> B
     ^       ^
      |        |
      V       V
      C <-> D

With the simplest protocol, you aren't allowed to drop packets, ever. Let's now assume that B decides to stop accepting packets for a little bit because it's busy and the next packet it needs to accept is heading for D. Now, A can no longer talk to D because B is busy. C can no longer talk to B because A is now backed up. D can't talk to A because C is busy and B can't talk to C because D is busy. And now we're done. B is no longer busy, but can't accept any packets because they need to go to D, which is busy.

There are a variety of ways around this. You can use multiple virtual channel pairs with a "dateline" (ever packet crossing from, say, B to D changes virtual channel pairs).
You can use a software credit scheme to guarantee that you never have head-of-line blocking.
You can allow packets to be thrown away (this is what Ethernet does), but then you need a retransmission protocol.
You can do a mathematical proof that you have sufficient buffering such that you're never going to deadlock for a given packet transmission pattern.

The nice thing about the hypercube or tree topology is that you can use one virtual channel pair and very simple routing rules:
Hypercube: Order the dimensions, always route the dimensions in order.
Tree (fat or otherwise): If your destination is below you, go down. Otherwise, go up.

mctrivia · 2010-01-10 23:00

Forest Godfrey said...
First, you're going to need at least two virtual channels (one for requests and one for responses) or you have a request to response dependance which will cause deadlock with only two nodes.

Two virtual channels. ok but for physical communication a full duplex serial connection between each node is enough right?

4D hyper cube requires 8 pins for full duplex serial. to make 2 physical chancels would require 16 pins and require the pcb to be bigger or for tighter trace width/clearance tolerances.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
24 bit LCD Breakout Board now in. $24.99 has backlight driver and touch sensitive decoder.

Forest Godfrey · 2010-01-10 23:11

For physical channels, you only need one and it can be either full or half duplex, depending on your speed requirements.

With the way the prop works, the virtual channels can either be implemented as physical channels or software buffering. You have two queues in front of the transmitter and behind the receiver. Round robin between the queues on the transmit side and the receive side. A packet on virtual channel A can never end up on virtual channel B.

Ari · 2011-03-03 19:24

pico ITX embedded cpu....fast ethernet....dragonfly BSD.....less money, easy to build, uses industry standard protocols and can be obtained virtually anywhere....is this for academic purposes or computational power? if it's the latter then distributed clustering across propellers is the largest waste of money and time I can imagine....

you could buy intel atom (pineview) embedded boards (with gig-e on board) for $50-$70 wholesale.....ethernet cable is cheap....so are gig switches, and dragonfly BSD is free

you could build a self healing cluster (10 boards with PSU and cabling/switches) for under $2k....you could deploy it in a few hours and it would run circles around a propeller cube....both in terms of throughput and power consumption....so much of the used power in microcontrollers is wasted in poor voltage regulation.....

I wish that would be addressed by vendors....dc-dc switching regulators on board....with the ability to draw reasonable current (3A), rather than the linear regulators currently being employed....

from my understanding that is how amd and intel are controlling power consumption....software controlled switching regulators.....please correct me if I am wrong

super parallel computer topologies

Comments