How to Go From Tera to Peta?
Humanoido
Posts: 5,770
In your opinion, what is the best way (in regards to relative simplicity, economical cost, and short time investment) to take the Big Propeller Brain (a collection of Propeller chips and other processors running in Parallel) from TeraFlops speed to the next level of PetaFLOPs?
Basically I'm interested in developing a small supercomputer with a Propeller front end that does controlling, with a couple hundred Propeller chips, and perhaps some streaming processor cards that handle the high processing speed (now that prices have dramatically dropped). Future apps could include telescope control and extremely high speed universe simulation.
Another option is if anyone knows of NASA resources for supercomputer time or a cloud computing arrangement.
The project is currently combining Propellers for control and AMD streaming processors for speed. I'm open to suggestions for more cards and more Propeller chip enhancements. Thanks sincerely for your reply.
http://humanoidolabs.blogspot.com/
Basically I'm interested in developing a small supercomputer with a Propeller front end that does controlling, with a couple hundred Propeller chips, and perhaps some streaming processor cards that handle the high processing speed (now that prices have dramatically dropped). Future apps could include telescope control and extremely high speed universe simulation.
Another option is if anyone knows of NASA resources for supercomputer time or a cloud computing arrangement.
The project is currently combining Propellers for control and AMD streaming processors for speed. I'm open to suggestions for more cards and more Propeller chip enhancements. Thanks sincerely for your reply.
http://humanoidolabs.blogspot.com/
Comments
courtesy of a frustrated baker: http://rinapedia.com/2011/04/18/whole-wheat-pita-bread/
-Phil
My guess it is the parallelism that is being experimented with. So the prop can achieve this quite cheaply.
It's just like the SETI project - lots of PCs crunching little pieces rather than a huge monolith costing a fortune.
I think it might be contributed to the Dunning-Kruger effect...
Hey, I resemble that. I mean, I resent that.
-N
But making a network of Propellers is not as stupid idea as it seems at first look
Weather simulation needs some simple, but fast computation for every point in atmosphere. You need to compute next state of your point, you take your current state, and states of some neighbours as input. You have to communicate only with neighbours.
A network of Propellers can do this, one propeller for 8 points (8 cogs!)
These models operate on networks, some kilometers sized, If it is 50 km (about 0.5 degree) , you need about 60x60x20(height)=72000 points=9000 Propellers in network to create a machine which can simulate weather in Europe.
Whichever way you look at it the computing you can do with 9000 Propellers (as in your example) could be done more easily and cheaply with a handful of PCs. Especially if you can take the GPU's into use. And more especially if you want to use floating point arithmetic.
It might make sense if your computing task also requires the use of the nearly 300000 I/O pins such a system would have and communicating between nodes is not paramount. Or you can make use of all the video generators or counters.
I wrote a software synthesizer. It eats of 20% computing power of my 2x2500MHz Athlon 64 - 15000 MIPS processor. (it is not asm program).
After I saw a SIDcog I think it is possible to fit this in one 20 MIPS cog. Maybe two of them.
These high level programming languages and OSes wastes a lot of PC's processing power.
Just because you have a million little tasks or jobs to do in your model does not mean you need a million
operating system threads or processes to do them. A single thread could do a little of each of your tasks
on every iteration with no context switching overhead.
If running that model is important you will have few other processes on that machine getting in the way.
Not sure what you mean by wastage through memory transfers, all computers do that. Having your tasks
communicate through memory will always be faster than going through some communications channel
to another processor. Having "real parallel processing" is not an end in itself. It's a means to get more work
done in a given time than a single processor can on it's own. From a purely logical point
of view a parallel machine cannot do anything more than a single processor machine can. Let's do that:
Unless your task is so small that it fits in a COG you are not going to get 20MIPS.
Best you can do is use LMM PASM and get say 5MIPS.
So you need 2000 / 5 = 400 cogs to get the equivalent x86 MIPS
We are probably going to use a couple of COGs on each Propeller for communications
and house keeping so to get 400 cogs for actual work we need 400 / 6 or about 67 Propeller chips.
Round here that will cost be about 10 Euro per Prop or 670 Euros. Let's double that cost
for all the supporting circuitry, power supplies and hardware required to build a working
parallel machine from Props. Which I think is generous. Say about 1300 Euros.
For that price I can get myself a couple of laptops giving me twice the compute power.
If I were to make use of their GPUs for doing calculations I'd get 10 or 100 or more
times the compute power than the Prop array. Not only that but programming the thing will
be a lot easier.
Now all the above assumes we are working with integer arithmetic. If I want to work in floating
point I'm going to have to have 100 times more Cogs to get the same FLOPS. So we are up to
130,000 Euros!!
Then we have to think about how fast we can get the data into and out of the Props. This is
going to be slow. I'll leave evaluating the cost/overhead of that as an exercise. If the Props have to
talk between each other that is probably going to eat more cogs on each one jacking the costs
up further.
I would not say a lot. If you program in C you are going to be getting code that runs
almost as fast as writing in assembler. C++ can add overheads if used unwisely but otherwise
is as good as C. Then there is compiled code from Fortran, Ada, etc etc. Of course if you
use Perl, Python, Java etc you are sacrificing performance to interpreters and virtual machines.
Run your code on Linux with few other proceses going on and it will get most of the procesor time.
Conclusion: Building a Parallel processor machine out of Propellers in order to get compute performance
is not in anyway practical or economical give the availability of PC class machines.
Still I'd be very happy if someone would come up with a working parallel Prop machine
that does something I cannot do with a PC or two. I imagine it would have to be
making use of the Props IO or video or counters. But then we are basically looking at using
the Props as I/O devices which is what they are built for.
Yes, I thought of a bunch of the same simple tasks which fits in one cog, and don't need LMM, and communicates with neighbours via i/o pins. When using LMM, external ram etc... power is lost for all of these overheads and it is really better to get a x86 machine.
Some years ago I was writing a program (sound synthesis) in Delphi (Object Pascal). I have about 10 ns for one loop on 1 GHz processor to do what I wanted. Pascal's loop takes more than 50 ns. I had to write this in manually optimized assembly code. It did this job in about 9 ns.
little secret, modern CPUs are super general purpose.
unlike older CPUs (think 8/16 bit era) which had special purposes, modern CPUs like you said lose a lot of time to the OS and organizing the Multitasking functions.
if your task can have work offloaded into the way you design your system you could in fact see a massive boost in what it can do in less cycles.
good luck on your goals.
this is just my stance on MIPS, GIPS, TIPS and PIPS, and goals
think of the simplest way to do each step and then divide the steps into whatever will do each step best. in that way you can get the most in each parts performance.
remember MIPs and such are variable, if the chip has for instance no hardware multiply you will lose many cycles to doing it in software.
M illions
I nstructions (Instruction is key what is the CPU Instruction but a single command, just a single thing a CPU can do in just that moment)
P er
S econd.
thing you need to ask "what Instructions can this CPU/MCU do? " and "what one do i need?"
also sometimes something as small as the layout can cost you. if you have 2 CPUs sharing ram like some early shared ram systems, but they have to wait for the other to much it will slow everything down. but if you can make a system where the ram is shared via a hardware device that allows both CPUs to work oblivious to the other it will allow for a major boost in speed. design matters a lot.
I cant quite put my finger on what you want aside for more instructions per second.
but I would ask "what do I want to do with PIPS?" and then see if I can do it with what i can get.
if you want some inspiration watch what some people do with C64s and ATARI800s its crazy what 1MHZ and 8bits can do.
when I say to computer guys at collages what some people have done with 1MHZ and show them the videos I have, their jaws drop, no one seems to believe it but here it is 20 to 30 years later and people are still pushing more out of these old computers.
Instructions per second are not the end all measure of a system. It is just the measure of how many of that CPUs instructions can be done in one second.
and OS restricts access to I/O, and specifications are kept in secret (GPUs) and you have to use proprietary tools or reverse engineering, and you don't mind if your "hello, world" program is 18 MB long after it compiles and even stripped from debug information, exe still exceeds 2 MB
i still use an apple 2 e for work lol
Let's accept the "job fits in cog" limitation. It's a sever limitation but we can do FFT in a COG so there must be lot's of other useful things that fit.
That buys you 4 times the performance but I'd argue that the cost and complexity of it all still does not stack up against using modern general purpose CPU's. If the laptop is too expensive use a few ARM boards at 100 dollars a piece. The Raspberry Pi board is a quarter of that price when we can get them.
Having the compute cogs communicate via pins is not such a good idea, when they are wiggling the pins to communicate they are not computing the job at hand. Better to have dedicated communication cogs so that comms and computation can go on in parallel.
I would not argue against hand crafting time critical code in assembler when really required. Been there many times. However it would have to be a necessity as I like my code to run on x86 or ARM or whatever. It's nice that I can now test C code on the PC and then run it on the Prop with propgcc.
Only wider and faster and with a lot more memory than back in the day.
Perhaps your operating system gets in the way. Mine does not. If I only run the one important process
that does all this high speed computation there will only be minimal multitasking overhead to worry about.
pik33,
Perhaps your operating system restricts access to I/O. Mine does not. If I want direct access to some
hardware I can put my code into a device driver. Job done.
Anyway this talk of operating systems is not relevant to the debate. Nothing stops you writing your
program in such a way that it is booted into straight from the BIOS or boot loader and is the only thing that
runs on the machine. Of course then you will want that console, serial, network, etc IO and which you had your
OS kernel back again:)
Hello world on my PC compiles to 5516 bytes. Yes, it uses shared libs but they are in memory
waiting be used anyway because a bunch of other programs run here that also use them.
I agree. I hate that. All devices should come with interface and operating specs.
and its not the CPUs fault I am only saying upkeep seperating processes out can take up more machine cycles.
if a CPU is made with a specific purpose in mind it can power through things a general purpose CPU may take slowly.
examples include how a game system can at times out perform a PC in the same CPU class in games, but cannot seem to do as much.
sega genesis games needed to run on a 66MHz 486 even rewritten into machine code for the x86. examples include segas 6pack PC edition and SONIC and KNUCKLES collection.
current systems like Wii xbox360 and PS3 are out performing PCs in some ways due to less multitasking and background tasks.
plus for example the powerPC architecture in the Wii, and the Cell CPU in the PS3 they are set up for very specific tasks.
but in some ways this limits what the systems can be, whereas the PC OSes and architecture are more open and free and can do more then one thing at a time, yet no matter what you lose some CPU cycles to upkeep of stacks and interruptions due to time sensitive areas of programs needing attention.
this truth is always present.
in this way I just feel what you want to do with this power is important since you may be able to offload some work into the design.......
again, not saying anyone is wrong I am just stating may view on the importance of the design over the pure instruction cycle rate.
FREEDOS, Excellent example. DOS plus the Watcom 32 bit C compiler was my favorite way to code back in the DOS days. I should look in to Open Watcom. For sure FREEDOS keeps out of you way.
I do agree with your comments re: operating systems getting out of control. For this reason I have been using Linux exclusively since about 1997. Now I find Linux distributions are throwing things in that I don't want or need and suck processor time and disk space. (The file indexing service in KDE for example). Still it's not so hard to track those things down and disable them or there are always lean and mean distributions to choose from.
And yes, careful design can get you more benefit than simply a faster CPU. In fact good design can help you over come a crappy compiler.
I recently did an experiment rewriting some C++ server functionality in JavaScript (Yes slow, interpreted js running on a server). Turns out the JS version is faster! So far I can only put this down to the fact that the JS version does not use threads and we switched to JSON format data instead of XML.
Don't forget that adding dedicated hardware for specific functions can make a mediocre microprocessor look fast in terms of performance. That's what the Amiga and Atari ST designers did. Sadly both companies shot themselves in head with incredibly stupid management decisions.
And, 5 Tesla card for even 1 Petaflops? No, it should be like 334 cards. (3 Teraflops as an example on recent GTX680 video cards and Chinese scientists have done it. I can't remember where, but have read about it.)
And, yet, don't forget about the current consumption: 180nm transistor leakages and internal resistance. Wiring voltage drops (as a function of the resistance of wirings - I use ElectroDroid, an Android program to figure all of those messes out). Power factor in the SMPS array. Thermal losses (again, a function of transistor leakage - quite negligible at lower clock, but it starts to matter more when overclocked). I think you guys may already know all of that, though.
In all, it's not practical. Impossible? Nope, we do it anyways, in spirit of exploiting the Propeller's inner workings.
BTW, I agree, even the 68K is still a decent CPU, even in microcontroller space.
EDIT: Foods for thoughts, I have went for 1.6 PIPS, dividing by Propeller II's 1.6 GIPS horsepower, to approach that, you would have linked a million of Propeller II, not to mention the power consumption! For 600mA consumption threshold on Propeller II, that would consume about 167 kiloamps.... Or higher! O___O