Shop OBEX P1 Docs P2 Docs Learn Events
interface to a PCI graphics card? — Parallax Forums

interface to a PCI graphics card?

prof_brainoprof_braino Posts: 4,313
edited 2012-04-22 11:54 in General Discussion
I hear folks are using bunches of graphics cards as "supercomputer". All the cores, stream processing units, memory.

What about using a prop I.E. demo board configuration as user I/O node, and a PCI graphics card as processing node?

Anybody play with this yet?
«1

Comments

  • Heater.Heater. Posts: 21,230
    edited 2012-04-18 20:14
    Probably very hard to do. Any driver for a modern graphics card is amazingly complex and mostly the drivers are closed source and the interface to the hardware undisclosed. Although the vendors have been softening up on that a bit recently.
    I would imagine that with the slow data rate in and out of the Prop and the minimal memory space within the Prop there is not much point in doing it anyway.

    My conclusion: Impossible to do and mostly pointless.

    There, with that said I can be sure someone has done it already
  • rod1963rod1963 Posts: 752
    edited 2012-04-18 23:06
    Professor

    The NVIDIA cards that do all that neat parallel processing stuff use the PCI-Express which a lot different from your standard PCI. You'd need a FPGA to interface to it. In short it's a major project even for people who know the PCI-Express bus and VHDL coding. Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.
  • pik33pik33 Posts: 2,402
    edited 2012-04-19 05:39
    rod1963 wrote: »

    Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.

    There is Nouveau nvidia open source drivers for linux. They did a lot of reverse engineering with these cards.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-19 07:22
    Heater. wrote: »
    Probably very hard to do. Any driver for a modern graphics card is amazingly complex and mostly the drivers are closed source and the interface to the hardware undisclosed. Although the vendors have been softening up on that a bit recently.
    I would imagine that with the slow data rate in and out of the Prop and the minimal memory space within the Prop there is not much point in doing it anyway.

    My conclusion: Impossible to do and mostly pointless.

    There, with that said I can be sure someone has done it already

    As you say, its already been done.

    http://www.engadget.com/2009/12/14/university-of-antwerp-stuffs-13-gpus-into-fastra-ii-supercompute/

    To clarify, the goal is NOT to use a video card are a video card, it is to use the processor as a massively parallel computing node. A traditional driver is NOT what is needed here.

    The role of the prop is handle the sensors and actuators, as the tend lends itself well to this application. Again the role of the GPU is to crunch big data, as expected.

    The "slow data rate of the prop and minimal memory space" are completely appropriate for the prop target application. In fact if needed, the GPU can be slowed as needed, it will still crunch a lot better than a prop, and nicely fullfil this role.

    Remember, the goal is not to go toe to toe with Cray or Fujitsu, the goal is to get something remotely similar in concept for under $200, so more undergrads can play more often.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-19 07:27
    rod1963 wrote: »
    Professor

    The NVIDIA cards that do all that neat parallel processing stuff use the PCI-Express which a lot different from your standard PCI. You'd need a FPGA to interface to it. In short it's a major project even for people who know the PCI-Express bus and VHDL coding. Plus forget about getting NVIDIA driver info unless you're a tier one PC vendor.

    Ok, first step forward is made. PCI-E is what we want, not old PCI. I heard somebody mention FPGA in the past, and interfaces to these cards exist; so the request is for somebody that knows about this stuff. Probably hard to find on these forums, maybe I'll have to look elsewhere. It's ok, I can wait.
  • Heater.Heater. Posts: 21,230
    edited 2012-04-19 08:15
    Well there is the thing. Once you have thrown an FPGA into the mix there is probably no need for a Prop any more.
  • rod1963rod1963 Posts: 752
    edited 2012-04-19 10:00
    Professor

    The fly in the ointment is that the Prop cannot host the NVIDIA development environment. It's predicated on either Win, Linux or Mac hosts. And the tools are rather resource hogs.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-19 13:48
    I wasn't thinking about hosting Nvidia development, I was thinking it would be somebody write a crunching app on the GPU, and the prop just pump a meager data stream to it for crunching. The Nvidia dev can happen on any favorite environment.

    Maybe the PCI bit is the mistake. It doesn't have to be PCI, it just need to be an app on the GPU that we can send a serial stream to for processing, and recieve some result that it finished.
  • rod1963rod1963 Posts: 752
    edited 2012-04-19 20:27
    Professor

    You gain nothing by this approach, it's easier to buy a used pc with the appropriate bus, then go to geeks.com and buy a $40 GPU CUDA enabled card. This all can be done for about $300.00. Well within the range of a hobbyist or student.

    And you need the PCI-E bus to talk to the GPU, there is no getting around this and that means you need a bridge chip to handle that interface stuff which equals a big dog FPGA like the Spartan-6 series which is in BGA only. Maybe it's not the only way, but no one AFAIK has hacked a NVIDIA GPU the way you want and succeeded.
  • pik33pik33 Posts: 2,402
    edited 2012-04-19 23:13
    Nvidia GPU is reverse engineered by noveau open source linux driver team. If there is open source driver for it, all can be possible.

    Propeller with new nvidia GPU looks like a motor from smallest possible lawn mower added to a heavy truck. If we add some suitable custom made gears, yes, it can make this truck move, but only very slow.

    We can still add this motor to a smaller car. I have some old PCI and ISA 2D-accelerator cards with 512KB/1MB RAM and documentation. They can be suitable to attach a propeller to, particularly ISA one.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 08:03
    rod1963 wrote: »
    You gain nothing by this approach, it's easier to buy a used pc with the appropriate bus, then go to geeks.com and buy a $40 GPU CUDA enabled card. This all can be done for about $300.00. Well within the range of a hobbyist or student.

    And you need the PCI-E bus to talk to the GPU, there is no getting around this and that means you need a bridge chip to handle that interface stuff which equals a big dog FPGA like the Spartan-6 series which is in BGA only. Maybe it's not the only way, but no one AFAIK has hacked a NVIDIA GPU the way you want and succeeded.

    Sorry, this is my shallow understanding of the concept. I though the Antwerp guys linked above did exactly this.
    While taking a stock graphics card does require PCI bus, we don't want to use the GPU as a graphics card, so we don;t need it.
    The GPU chip, AFAIK, does NOT require PCI-E to function. The GPU is to be used as a pool of processing power with pool of memory. All we need is a serial or parallel channel to talk to it, and the channel's speed can be whatever, all it needs to send is the result. The GPU does that manipulation of large amounts of data.
    At least this is my thought.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 08:10
    pik33 wrote: »
    Propeller with new nvidia GPU looks like a motor from smallest possible lawn mower added to a heavy truck.

    Maybe I got it backwards. I thought using a heavy duty GPU to process the relatively small data stream from one or a series of props would result something like these:

    http://csudigitalhumanities.org/exhibits/archive/files/2ea878ba7cdc2803db6d1f218a914a18.jpg
    http://2.bp.blogspot.com/_7gYNb3GSh4M/TH0buf-dVXI/AAAAAAAAVnw/t314AE09bwg/s640/4079865886586287.JPG

    Maybe it doesn't work that way. but this is NOT intended to be used as a graphics card, its as a number crunched.
  • mindrobotsmindrobots Posts: 6,506
    edited 2012-04-20 08:32
    I remember the old Cray machines being vector processors - cycle by cycle, they weren't that fast but when you fed data into the pipeline and a fast, steady rate, they spit out numbers like crazy. I've never looked at a GPU architecture but I imagine it is similar - they can process computations on large amounts of data fed to them at tremendous rates but you need to keep them fed. Sending them two big numbers via serial ports and then having your number ready when you come back to read the port is probably not the way to use these. Spraying a gigantic array into GPU memory and then telling it to "INVERT" or whatever probably is the way to use these.

    I don't know, just guessing.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 11:28
    Since you said the secret word "Cray", you win the prize:

    The wiki description of the Cray 1 series were the initial driver for this inquiry. http://en.wikipedia.org/wiki/Cray_1

    Considering that Cray cost 9 million dollars, and only a Cray will be a Cray, what can we make that has some similarities to a Cray, and still remain under $50?

    NOTE to lurkers: This is NOT to build a replacement for your university's supercomputer. This is to get something that show something similar to those techniques used in big computers, that we can have on (my) desktop.

    So far, nothing fits the bill really well, but I think if we raise the cost barrier, we can get a bunch of Cray-ish function. Question is how high do we have to raise the cost to get something interesting to play with? My hypothesis it that to cost is closer to $200 than $9 million.

    So here's what I'm thinking:
    Compare: Cray -to- Braino's box of parts, mostly prop chips
    clock: 80Mhz: : 80Mhz
    Word size 64 bit : : 32 bit, so use double words and/or 2 cog or more cogs per operation
    Processors 144 : : 20 prop chips = 160 cogs; but 2 cogs per chip are used for communication channel overhead, so use 25(?) prop chips

    Propforth can easily implement the use of registers for processing, and we think we can figure out something to address the concept of vector processing.

    We (arbitrarily) have 32 channels to acccess which sets of processors we talk to at any given time. One could have more slower channels, or fewer faster channels, or use more cogs per chip for channels, and just add more chips.

    Ignoring the ROM based tables and successive approximation techniques, the prop does not excel at floating point. So the major function we are lacking is the ability to crunch large amount of floating point numbers quickly. My thought is that a GPU or DSP is the hardware for this job, folks tell me the GPU as used by Antwerp is the way to go.

    Of course, even if we get something that works, folks will still have to write software ON TOP of the infrastructure before we can actually do something useful. Those folks are already watching from the wings. But the model would be some "firmware" on the prop for inftrastructure (more or less in PASM, even if I have the source in propforth), and then something on top that serves as an application.
  • codevipercodeviper Posts: 208
    edited 2012-04-20 17:58
    prof_braino may i offer a different perspective?
    one prop with a program and all pins connected to an FPGA and FPGAs each with a prop connected to it using the hyper transfer examples to send data to all the props at the same time, a central ram bank and each FPGA with simple but fast 8 bit CPU's.

    I sort of did this with an old computer where other CPUs with machine code are called on and given some variables to work on.
    just picture an apple 2 with 2 65c02s plus the initial 65c02.
    I had an interest in augmenting computer power so I wrote basic code programs some machine code on the main 65c02 and copied those machine code segments to the 6502s that where on a bread board with a ram chip and some support circuits. then when it was needed have the main cpu send some bytes telling one of the secondaries to use some variables and pass over them with the machine code.
  • User NameUser Name Posts: 1,451
    edited 2012-04-20 18:14
    I hold Prof_Braino personally responsible for keeping me up hours past my bedtime last night studying the expectation-maximization algorithm, fueled by delusions of making my own CT system. In fact I'm still not entirely over this flight of fancy.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 18:21
    codeviper wrote: »
    just picture an apple 2 with 2 65c02s plus the initial 65c02.

    I think this is what we have with the regular prop-to-prop multi-channel serial, except we use two pins instead of many pins.

    The thing is the props are not good at the crunching we want to do, hence the quest for GPU stream processor and memory etc.
  • codevipercodeviper Posts: 208
    edited 2012-04-20 18:33
    i under stand that's why I think a FPGA that does some work as well will boost that area.
    so the entire prop talks to the FPGA and the FPGA has some support systems like a math coprocessor
  • kwinnkwinn Posts: 8,697
    edited 2012-04-20 18:38
    IIRC the Cray 1 had 1 meg of 64 bit wide memory (not counting parity bits), an 80MHz clock, and a pipelined architecture that could complete 2 instructions per cycle, including floating point instructions. That would be a theoretical performance of 160 MIPS and 160 MFLOPS.

    I suppose you could use propeller chips along with an external memory to emulate the Cray. Hub ram could be used for the registers and one or more cogs for each of the pipeline stages. It would require multiple propeller chips and the floating point performance would be dismal without a fp coprocessor but the MIPS rating would probably be respectable. Anyone have any Cray software and manuals laying around?
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 18:48
    User Name wrote: »
    I hold Prof_Braino personally responsible for keeping me up hours past my bedtime last night studying the expectation-maximization algorithm, fueled by delusions of making my own CT system. In fact I'm still not entirely over this flight of fancy.

    I knew I'd find somebody smart! Keep working on it, we should have .... Never mind, just keep working on it.
  • codevipercodeviper Posts: 208
    edited 2012-04-20 18:50
    sort of what I am thinking the prop + FPGA as a sort of node, and the FPGAs using the Floating Point and hyper transport example verilog cores to share data and do math.
    and a cluster of nodes to do a big job
  • codevipercodeviper Posts: 208
    edited 2012-04-20 18:51
    also someone already made a CRAY for FPGAs google search it
    here it is
    http://chrisfenton.com/homebrew-cray-1a/
    it is a working and cool looking example.
    since it both is a working model and he made it 1/10 scale model as well.
  • rod1963rod1963 Posts: 752
    edited 2012-04-20 19:54
    The entry point is quite low too $225.00 and he's done most of the skull work, though the software needs serious fixing. The FPGA itself is quite reasonable at $60+ a piece, though it is BGA monster. Overall it's a much more friendly and less costly chip than the more modern FPGA variants of vector processors that are implemented on $4k FPGA's and most don't provide VHDL code either.

    Certainly a doable project. Prototype it on the Digilent board, get the software to work and that's most of the effort.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 22:18
    kwinn wrote: »
    IIRC the Cray 1 had 1 meg of 64 bit wide memory (not counting parity bits), an 80MHz clock, and a pipelined architecture that could complete 2 instructions per cycle, including floating point instructions. That would be a theoretical performance of 160 MIPS and 160 MFLOPS.
    Yes, this is what I think the GPU would do
    I suppose you could use propeller chips along with an external memory to emulate the Cray.
    Not what I had in mind, that approach is too far beyond me. My thought was the props would just be there to send data into the GPU, and collect results coming out.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-20 22:27
    Ok, looks I didn't state this clearly, I'm NOT looking for Cray compatible anything, if it requires an FPGA its probably not what I'm looking for, sorry I was not clear. That might be fun (it sounds like fun), but not for me.

    The thing the Cray does that I am interested is what I want to do on GPU. Like in the link, but not using a graphics card, and not for use as a graphics card.
  • codevipercodeviper Posts: 208
    edited 2012-04-20 22:35
    sorry i did not mean to divert this thread, just mentioned that as an option showing FPGAs can do pretty powerful multiprocessing and be easier to interface with
  • rod1963rod1963 Posts: 752
    edited 2012-04-20 23:02
    Emulating the Cray-1 with a FPGA is probably the easiest way to go. The performance won't be nothing anywhere near what the original as Mr. Fenton's article point's out, but it's doable. And a interesting way to learn about VHDL and FPGA's. Get good and you can roll your own Forth cpu in VHDL - in fact there are several out there already to work from. It's win, win I think.

    Good luck
  • kwinnkwinn Posts: 8,697
    edited 2012-04-20 23:04
    Ok, looks I didn't state this clearly, I'm NOT looking for Cray compatible anything, if it requires an FPGA its probably not what I'm looking for, sorry I was not clear. That might be fun (it sounds like fun), but not for me.

    The thing the Cray does that I am interested is what I want to do on GPU. Like in the link, but not using a graphics card, and not for use as a graphics card.

    You may want to take a look at some of the DSP chips as well. It may be simpler to interface them to a prop compared to the GPU chips. I am not sure the DSP's (or the GPU's for that matter) will do 64 bit fixed or floating point math. That is something that would need to be determined from the data sheets or manuals. It has been a while since I have been involved in this area so I have not kept up with the latest and greatest in the supercomputing realm. Most of my work has been in the industrial automation, building automation, and security systems lately.
  • LeonLeon Posts: 7,620
    edited 2012-04-21 02:15
    DSPs tend to be 16-bit fixed-point and 32-bit floating point.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-04-21 07:27
    codeviper wrote: »
    as an option showing FPGAs can do pretty powerful multiprocessing and be easier to interface with

    Thanks, I agree, but I don't have any FPGA resources, and can't get any for the foreseeable future. Not having it in the shop makes it difficult to test, so i would have to have somebody else do the experiments, and I would do the requirements. But I don't have an FPGA guy yet. I do think FPGA is good way to go, but remains out of my scope.
Sign In or Register to comment.