G_FLOPS ... Possible with FPU units?
TJHJ
Posts: 243
Hi,
So I am cooking up this high math crunching application, and was wondering if anyone could make some suggestions on how many FLOPS (Floating point operations per second) can be obtained by a FPU unit or a few units, I trying to get around 2 Gflops, more would be better, but I am not sure I can move the data in and out.
I am looking to have the prop act as the center point passing the data around and making it into something useful.
Any suggestions on what FPU chips could be interfaced with the prop or get anywhere close to that kind of speed?
Thanks for everyones help.
So I am cooking up this high math crunching application, and was wondering if anyone could make some suggestions on how many FLOPS (Floating point operations per second) can be obtained by a FPU unit or a few units, I trying to get around 2 Gflops, more would be better, but I am not sure I can move the data in and out.
I am looking to have the prop act as the center point passing the data around and making it into something useful.
Any suggestions on what FPU chips could be interfaced with the prop or get anywhere close to that kind of speed?
Thanks for everyones help.
Comments
-Phil
http://www.nvidia.com/object/cuda_home_new.html
No.
Even if you had an infinitly fast FPU. Software in a Cog or external device you still only have a 20MIPs processor in the Prop. With even slower access to RAM or the outside world.
No way to move data around fast enough to get 2GFLOPS.
A single GPU processor can hit nearly 4 TFLOPS. The internal bandwith of a high end GPU card is well over 200GB/s. Send the GPU a few megabytes of data along with an OpenCL kernel, then sit and wait for an answer.
The answer comes down to what type of workload you have. The bcm2835 used in the Raspberry Pi is supposed to do 10+ GFLOPs, uses a couple watts, and costs under $20
2 GFLOPS would be absurd to attempt with an array of Propeller chips. Each cog can do about 20 kFLOPS (timed using fmult in Float32Full). So 2 GFLOPS would require 100,000 cogs, or more than fourteen thousand Propeller chips (each running Float32Full in seven cogs, plus one cog for communication).
-Phil
-Phil
Please don't be misled. No reasonable array of Propeller chips can achieve performance in the GFLOPS range, let alone TFLOPS. Even if you had enough Propeller chips to hit such a rate theoretically, the communication overhead involved in uploading parameters and downloading results would reduce the whole system to a crawl. When Humanoido refers to the "Big Brain", it encompasses more than just an array of Propeller chips. If he's getting TFLOPS performance, it's coming from a GPU, not the Propellers.
-Phil
Hmm..Wonder how much that would weigh? What volume it would occupy, using demo boards say? How much power it would consume?
All in all I think one would be better of linking your Propeller, via FullDuplexSerial, to one of these:
which ticks over at a steady 1750 TeraFlops.
Wouldn't Parallax LOVE to get an order for fourteen thousand Propeller to try on this project.
Jim
-Phil