interface to a PCI graphics card?

prof_braino · 2012-04-18 19:01

I hear folks are using bunches of graphics cards as "supercomputer". All the cores, stream processing units, memory.

What about using a prop I.E. demo board configuration as user I/O node, and a PCI graphics card as processing node?

Anybody play with this yet?

Heater. · 2012-04-18 20:14

Probably very hard to do. Any driver for a modern graphics card is amazingly complex and mostly the drivers are closed source and the interface to the hardware undisclosed. Although the vendors have been softening up on that a bit recently.
I would imagine that with the slow data rate in and out of the Prop and the minimal memory space within the Prop there is not much point in doing it anyway.

My conclusion: Impossible to do and mostly pointless.

There, with that said I can be sure someone has done it already

rod1963 · 2012-04-18 23:06

Professor

The NVIDIA cards that do all that neat parallel processing stuff use the PCI-Express which a lot different from your standard PCI. You'd need a FPGA to interface to it. In short it's a major project even for people who know the PCI-Express bus and VHDL coding. Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.

pik33 · 2012-04-19 05:39

rod1963 wrote: »

Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.

There is Nouveau nvidia open source drivers for linux. They did a lot of reverse engineering with these cards.

prof_braino · 2012-04-19 07:22

Heater. wrote: »

Probably very hard to do. Any driver for a modern graphics card is amazingly complex and mostly the drivers are closed source and the interface to the hardware undisclosed. Although the vendors have been softening up on that a bit recently.
I would imagine that with the slow data rate in and out of the Prop and the minimal memory space within the Prop there is not much point in doing it anyway.

My conclusion: Impossible to do and mostly pointless.

There, with that said I can be sure someone has done it already

As you say, its already been done.

http://www.engadget.com/2009/12/14/university-of-antwerp-stuffs-13-gpus-into-fastra-ii-supercompute/

To clarify, the goal is NOT to use a video card are a video card, it is to use the processor as a massively parallel computing node. A traditional driver is NOT what is needed here.

The role of the prop is handle the sensors and actuators, as the tend lends itself well to this application. Again the role of the GPU is to crunch big data, as expected.

The "slow data rate of the prop and minimal memory space" are completely appropriate for the prop target application. In fact if needed, the GPU can be slowed as needed, it will still crunch a lot better than a prop, and nicely fullfil this role.

Remember, the goal is not to go toe to toe with Cray or Fujitsu, the goal is to get something remotely similar in concept for under $200, so more undergrads can play more often.

prof_braino · 2012-04-19 07:27

rod1963 wrote: »

Professor

The NVIDIA cards that do all that neat parallel processing stuff use the PCI-Express which a lot different from your standard PCI. You'd need a FPGA to interface to it. In short it's a major project even for people who know the PCI-Express bus and VHDL coding. Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.

Ok, first step forward is made. PCI-E is what we want, not old PCI. I heard somebody mention FPGA in the past, and interfaces to these cards exist; so the request is for somebody that knows about this stuff. Probably hard to find on these forums, maybe I'll have to look elsewhere. It's ok, I can wait.

Heater. · 2012-04-19 08:15

Well there is the thing. Once you have thrown an FPGA into the mix there is probably no need for a Prop any more.

rod1963 · 2012-04-19 10:00

Professor

The fly in the ointment is that the Prop cannot host the NVIDIA development environment. It's predicated on either Win, Linux or Mac hosts. And the tools are rather resource hogs.

prof_braino · 2012-04-19 13:48

I wasn't thinking about hosting Nvidia development, I was thinking it would be somebody write a crunching app on the GPU, and the prop just pump a meager data stream to it for crunching. The Nvidia dev can happen on any favorite environment.

Maybe the PCI bit is the mistake. It doesn't have to be PCI, it just need to be an app on the GPU that we can send a serial stream to for processing, and recieve some result that it finished.

rod1963 · 2012-04-19 20:27

Professor

You gain nothing by this approach, it's easier to buy a used pc with the appropriate bus, then go to geeks.com and buy a $40 GPU CUDA enabled card. This all can be done for about $300.00. Well within the range of a hobbyist or student.

And you need the PCI-E bus to talk to the GPU, there is no getting around this and that means you need a bridge chip to handle that interface stuff which equals a big dog FPGA like the Spartan-6 series which is in BGA only. Maybe it's not the only way, but no one AFAIK has hacked a NVIDIA GPU the way you want and succeeded.

pik33 · 2012-04-19 23:13

Nvidia GPU is reverse engineered by noveau open source linux driver team. If there is open source driver for it, all can be possible.

Propeller with new nvidia GPU looks like a motor from smallest possible lawn mower added to a heavy truck. If we add some suitable custom made gears, yes, it can make this truck move, but only very slow.

We can still add this motor to a smaller car. I have some old PCI and ISA 2D-accelerator cards with 512KB/1MB RAM and documentation. They can be suitable to attach a propeller to, particularly ISA one.

prof_braino · 2012-04-20 08:03

rod1963 wrote: »

You gain nothing by this approach, it's easier to buy a used pc with the appropriate bus, then go to geeks.com and buy a $40 GPU CUDA enabled card. This all can be done for about $300.00. Well within the range of a hobbyist or student.

And you need the PCI-E bus to talk to the GPU, there is no getting around this and that means you need a bridge chip to handle that interface stuff which equals a big dog FPGA like the Spartan-6 series which is in BGA only. Maybe it's not the only way, but no one AFAIK has hacked a NVIDIA GPU the way you want and succeeded.

Sorry, this is my shallow understanding of the concept. I though the Antwerp guys linked above did exactly this.
While taking a stock graphics card does require PCI bus, we don't want to use the GPU as a graphics card, so we don;t need it.
The GPU chip, AFAIK, does NOT require PCI-E to function. The GPU is to be used as a pool of processing power with pool of memory. All we need is a serial or parallel channel to talk to it, and the channel's speed can be whatever, all it needs to send is the result. The GPU does that manipulation of large amounts of data.
At least this is my thought.

prof_braino · 2012-04-20 08:10

pik33 wrote: »

Propeller with new nvidia GPU looks like a motor from smallest possible lawn mower added to a heavy truck.

Maybe I got it backwards. I thought using a heavy duty GPU to process the relatively small data stream from one or a series of props would result something like these:

http://csudigitalhumanities.org/exhibits/archive/files/2ea878ba7cdc2803db6d1f218a914a18.jpg
http://2.bp.blogspot.com/_7gYNb3GSh4M/TH0buf-dVXI/AAAAAAAAVnw/t314AE09bwg/s640/4079865886586287.JPG

Maybe it doesn't work that way. but this is NOT intended to be used as a graphics card, its as a number crunched.

mindrobots · 2012-04-20 08:32

I remember the old Cray machines being vector processors - cycle by cycle, they weren't that fast but when you fed data into the pipeline and a fast, steady rate, they spit out numbers like crazy. I've never looked at a GPU architecture but I imagine it is similar - they can process computations on large amounts of data fed to them at tremendous rates but you need to keep them fed. Sending them two big numbers via serial ports and then having your number ready when you come back to read the port is probably not the way to use these. Spraying a gigantic array into GPU memory and then telling it to "INVERT" or whatever probably is the way to use these.

I don't know, just guessing.

prof_braino · 2012-04-20 11:28

Since you said the secret word "Cray", you win the prize:

The wiki description of the Cray 1 series were the initial driver for this inquiry. http://en.wikipedia.org/wiki/Cray_1

Considering that Cray cost 9 million dollars, and only a Cray will be a Cray, what can we make that has some similarities to a Cray, and still remain under $50?

NOTE to lurkers: This is NOT to build a replacement for your university's supercomputer. This is to get something that show something similar to those techniques used in big computers, that we can have on (my) desktop.

So far, nothing fits the bill really well, but I think if we raise the cost barrier, we can get a bunch of Cray-ish function. Question is how high do we have to raise the cost to get something interesting to play with? My hypothesis it that to cost is closer to $200 than $9 million.

So here's what I'm thinking:
Compare: Cray -to- Braino's box of parts, mostly prop chips
clock: 80Mhz: : 80Mhz
Word size 64 bit : : 32 bit, so use double words and/or 2 cog or more cogs per operation
Processors 144 : : 20 prop chips = 160 cogs; but 2 cogs per chip are used for communication channel overhead, so use 25(?) prop chips

Propforth can easily implement the use of registers for processing, and we think we can figure out something to address the concept of vector processing.

We (arbitrarily) have 32 channels to acccess which sets of processors we talk to at any given time. One could have more slower channels, or fewer faster channels, or use more cogs per chip for channels, and just add more chips.

Ignoring the ROM based tables and successive approximation techniques, the prop does not excel at floating point. So the major function we are lacking is the ability to crunch large amount of floating point numbers quickly. My thought is that a GPU or DSP is the hardware for this job, folks tell me the GPU as used by Antwerp is the way to go.

Of course, even if we get something that works, folks will still have to write software ON TOP of the infrastructure before we can actually do something useful. Those folks are already watching from the wings. But the model would be some "firmware" on the prop for inftrastructure (more or less in PASM, even if I have the source in propforth), and then something on top that serves as an application.

codeviper · 2012-04-20 17:58

prof_braino may i offer a different perspective?
one prop with a program and all pins connected to an FPGA and FPGAs each with a prop connected to it using the hyper transfer examples to send data to all the props at the same time, a central ram bank and each FPGA with simple but fast 8 bit CPU's.

I sort of did this with an old computer where other CPUs with machine code are called on and given some variables to work on.
just picture an apple 2 with 2 65c02s plus the initial 65c02.
I had an interest in augmenting computer power so I wrote basic code programs some machine code on the main 65c02 and copied those machine code segments to the 6502s that where on a bread board with a ram chip and some support circuits. then when it was needed have the main cpu send some bytes telling one of the secondaries to use some variables and pass over them with the machine code.

User Name · 2012-04-20 18:14

I hold Prof_Braino personally responsible for keeping me up hours past my bedtime last night studying the expectation-maximization algorithm, fueled by delusions of making my own CT system. In fact I'm still not entirely over this flight of fancy.

prof_braino · 2012-04-20 18:21

codeviper wrote: »

just picture an apple 2 with 2 65c02s plus the initial 65c02.

I think this is what we have with the regular prop-to-prop multi-channel serial, except we use two pins instead of many pins.

The thing is the props are not good at the crunching we want to do, hence the quest for GPU stream processor and memory etc.

codeviper · 2012-04-20 18:33

i under stand that's why I think a FPGA that does some work as well will boost that area.
so the entire prop talks to the FPGA and the FPGA has some support systems like a math coprocessor

kwinn · 2012-04-20 18:38

IIRC the Cray 1 had 1 meg of 64 bit wide memory (not counting parity bits), an 80MHz clock, and a pipelined architecture that could complete 2 instructions per cycle, including floating point instructions. That would be a theoretical performance of 160 MIPS and 160 MFLOPS.

I suppose you could use propeller chips along with an external memory to emulate the Cray. Hub ram could be used for the registers and one or more cogs for each of the pipeline stages. It would require multiple propeller chips and the floating point performance would be dismal without a fp coprocessor but the MIPS rating would probably be respectable. Anyone have any Cray software and manuals laying around?

prof_braino · 2012-04-20 18:48

User Name wrote: »

I hold Prof_Braino personally responsible for keeping me up hours past my bedtime last night studying the expectation-maximization algorithm, fueled by delusions of making my own CT system. In fact I'm still not entirely over this flight of fancy.

I knew I'd find somebody smart! Keep working on it, we should have .... Never mind, just keep working on it.

codeviper · 2012-04-20 18:50

sort of what I am thinking the prop + FPGA as a sort of node, and the FPGAs using the Floating Point and hyper transport example verilog cores to share data and do math.
and a cluster of nodes to do a big job

codeviper · 2012-04-20 18:51

also someone already made a CRAY for FPGAs google search it
here it is
http://chrisfenton.com/homebrew-cray-1a/
it is a working and cool looking example.
since it both is a working model and he made it 1/10 scale model as well.

rod1963 · 2012-04-20 19:54

The entry point is quite low too $225.00 and he's done most of the skull work, though the software needs serious fixing. The FPGA itself is quite reasonable at $60+ a piece, though it is BGA monster. Overall it's a much more friendly and less costly chip than the more modern FPGA variants of vector processors that are implemented on $4k FPGA's and most don't provide VHDL code either.

Certainly a doable project. Prototype it on the Digilent board, get the software to work and that's most of the effort.

prof_braino · 2012-04-20 22:18

kwinn wrote: »

IIRC the Cray 1 had 1 meg of 64 bit wide memory (not counting parity bits), an 80MHz clock, and a pipelined architecture that could complete 2 instructions per cycle, including floating point instructions. That would be a theoretical performance of 160 MIPS and 160 MFLOPS.

Yes, this is what I think the GPU would do

I suppose you could use propeller chips along with an external memory to emulate the Cray.

Not what I had in mind, that approach is too far beyond me. My thought was the props would just be there to send data into the GPU, and collect results coming out.

prof_braino · 2012-04-20 22:27

Ok, looks I didn't state this clearly, I'm NOT looking for Cray compatible anything, if it requires an FPGA its probably not what I'm looking for, sorry I was not clear. That might be fun (it sounds like fun), but not for me.

The thing the Cray does that I am interested is what I want to do on GPU. Like in the link, but not using a graphics card, and not for use as a graphics card.

codeviper · 2012-04-20 22:35

sorry i did not mean to divert this thread, just mentioned that as an option showing FPGAs can do pretty powerful multiprocessing and be easier to interface with

rod1963 · 2012-04-20 23:02

Emulating the Cray-1 with a FPGA is probably the easiest way to go. The performance won't be nothing anywhere near what the original as Mr. Fenton's article point's out, but it's doable. And a interesting way to learn about VHDL and FPGA's. Get good and you can roll your own Forth cpu in VHDL - in fact there are several out there already to work from. It's win, win I think.

Good luck

kwinn · 2012-04-20 23:04

prof_braino wrote: »

Ok, looks I didn't state this clearly, I'm NOT looking for Cray compatible anything, if it requires an FPGA its probably not what I'm looking for, sorry I was not clear. That might be fun (it sounds like fun), but not for me.

The thing the Cray does that I am interested is what I want to do on GPU. Like in the link, but not using a graphics card, and not for use as a graphics card.

You may want to take a look at some of the DSP chips as well. It may be simpler to interface them to a prop compared to the GPU chips. I am not sure the DSP's (or the GPU's for that matter) will do 64 bit fixed or floating point math. That is something that would need to be determined from the data sheets or manuals. It has been a while since I have been involved in this area so I have not kept up with the latest and greatest in the supercomputing realm. Most of my work has been in the industrial automation, building automation, and security systems lately.

Leon · 2012-04-21 02:15

DSPs tend to be 16-bit fixed-point and 32-bit floating point.

prof_braino · 2012-04-21 07:27

codeviper wrote: »

as an option showing FPGAs can do pretty powerful multiprocessing and be easier to interface with

Thanks, I agree, but I don't have any FPGA resources, and can't get any for the foreseeable future. Not having it in the shop makes it difficult to test, so i would have to have somebody else do the experiments, and I would do the requirements. But I don't have an FPGA guy yet. I do think FPGA is good way to go, but remains out of my scope.

interface to a PCI graphics card?

Comments