Parallella
steddyman
Posts: 91
I know it has been mentioned on these forums previously, but I just purchased on of the new 16 core parallel boards.
Pretty amazing value for money. Fully Open Source, with 16 x 1 Ghz Epiphany cores in addition to a Xilinx XC7010/7020. I went for the 7020 based embedded platform which has 85k LE's on the FPGA. For around £160 that is cheaper than any of the Zynq dev boards out there and a lot more power.
http://www.parallella.org/board/
It should be possible to port the P1V over to it. Wonder if I can get it working alongside the ARM A9 cores and the Epiphany cores all in one design!
Pretty amazing value for money. Fully Open Source, with 16 x 1 Ghz Epiphany cores in addition to a Xilinx XC7010/7020. I went for the 7020 based embedded platform which has 85k LE's on the FPGA. For around £160 that is cheaper than any of the Zynq dev boards out there and a lot more power.
http://www.parallella.org/board/
It should be possible to port the P1V over to it. Wonder if I can get it working alongside the ARM A9 cores and the Epiphany cores all in one design!
Comments
Shamefully my Parallella has not even been booted up yet, there are just too many things to do in the time available.
Getting a P1 in there sounds like a great project. I have no idea what use such a beast might be in the end but it sounds like a challenge that someone has to tackle.
I'm confident there is no GPU and no specific video logic at all. I'm gonna guess the HDMI signals connect to the FPGA fabric, as does the Epiphany chip.
Here's the datasheet for the CPU/FPGA - http://www.xilinx.com/publications/prod_mktg/zynq7000/Zynq-7000-combined-product-table.pdf
Damn post man didn't deliver my board today. Hope it comes tomorrow now.
85k LEs, that's a lot of heft! Combined with the Epiphany stuff, those boards look pretty mighty!
Those binary blobs in GPUs and such are annoying for sure.
Note however that the binary blob in, for example, the Raspberry Pi is not a driver. Or at least not a Linux kernel driver module. Rather it is firmware loaded into the GPU.
The Raspi gets a lot of stick for this, but oddly if that same firmware blob were permanently blown into ROM on the GPU even the Free Software Foundation have said they would be happy with it.
And we have the same problem with all the firmware blobs living in USB sticks, SD cards, hard disks and SSDs, WIFI and 3G dongles etc etc etc. Heck even the very CPU's we run can be updated with new microcode binary blobs.
The Parallella stuff looks more promising in that respect.
R
And then...
My p1v retromachine is preloaded with a blob too. It starts the machine, then looks for "boot.sys" file on SD, then run it, so you don't have to program the machine with propplug after every restart. The boot.sys can be everything, which can run on the P1V, but it is intended to be an operating system or operating system loader - or even a "micro dos" (name from 8-bit Atari) which lists all executable files on the disk (SD in this case) and allows to select one.
The Pi works in similar way.
This Parallela board looks interesting... 85k LEs? I have to read more about it...
The link in post #1 links to 404 error page.
http://www.parallella.org/parallella-board/
Maybe until I can buy this 7020 version will be available, too. They say - 18 cores; well, when we add some P1Vs we will have 26 or maybe 34 ... 85k LEs? ... maybe 42 cores...
There is only one problem: no GPIO available in standard way. How to connect to these slots? I cannot find any extension boards for this.
There is an extension board available called the porcupine which brings all the IO out to standard pin headers:
http://www.parallella.org/meet-porcupine-the-parallella-breakout-board/
It has only just been released, and I can't see it in the UK yet but Digikey will ship internationally and it is showing as in stock.
Here is all the info you need on the Zynq ARM/FPGA combo:
http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf
Also notice that somebody is porting a full OpenGL / Direct 3D GPU to it (in Verilog):
http://forums.parallella.org/viewtopic.php?f=51&t=1581
Here is a picture of it next to an SD card and an Apple TV remove for comparison:
The Porcupine suddenly makes tinkering with the Parallella a whole lot more attractive.
@pik33,
I would not get so carried away with those core counts. Some of them are real general purpose cores, on the ARM. 16 or whatever of them are little more than FPU accelerators. If you throw a Pi in there than that is 8 more processors again of a very different kind.
Throw it all together and we have a fast linux machine, with super fast maths capability and funky real world bit banging interfacing via the P1. A real Frankenmachine that could be, shall we say "interesting" to develop applications on.
A lot of different instruction sets to develop for. I suppose with the right compiler tools it wouldn't be so bad. But it seems like a lot of overlap in functionality between the Epiphany itself and the Prop?
Now you have another problem. How to program the mess you have created in an integrated "holistic" way.
With the Parallella, for example, one could imagine the ARM application, the Epiphany maths and the Propeller I/O are all programmed in C in a single project. Your IDE or other tooling would make sure all the right code ends up in the right place. The compilers take care of the instruction set differences nicely.
There is almost no overlap with what you would want to do on the Epiphany and the Propeller. The Epiphany is basically a multi-core floating point accelerator. It's processors have little (no) interaction with the outside world (pins). The Props COGS are a very much slower integer only processor with very limited memory, but they do have that tight connection to I/O pins.
Certainly would be interesting.
Sounds like the Epiphany also has quite limited memory; 32k per core. That is higher at least than the per-Cog core, but still.
Interesting how each Epiphany core seems to be memory mapped. Could do something similar from the ARM to the Prop cores. Write instructions directly into HUB ram and then trigger cog start from a register address or something, and then have the Cog's cog ram memory mapped to the ARM for reading.
But the deal is that we have some code that needs to be compiled and run on the ARM. No doubt with it's own main() function etc. Then we have some code that needs to be run on on the Epiphany, no doubt compiled with a different compiler and perhaps with it's own main(). Then we have more code for the Prop, again as a program with it's own main() and a different compiler.
Somehow it would be cool to be able to write all of that as a single project, no matter what compiler is used in each case. Heck I might not even want to use C in all places, perhaps I want the code running on the ARM to be JavaScript run under Node.js, after all the fast stuff is being done elsewhere in this system.
http://parallellagram.org/cat/Parallella%20FPGA%20Tutorials
However, I'm finding the USB implementation incredibly flaky and looking at the forums I am not alone. I only get a working keyboard about 20% of the times I boot up, and even less a mouse.
The USB and HDMI are implemented as Hard IP's from Xilinx so there is no access to source code. May not be a panacea after all. Probably much better suited as a headless compute engine.
Here is the Software Development book from Xilinx:
http://www.xilinx.com/support/documentation/user_guides/ug821-zynq-7000-swdev.pdf
Maybe we can try to make this kind of structure with modified p1v cogs.
A while back I took up the challenge of parallelizing my home grown FFT for the Propeller using prop-gcc and OMP. Sure enough that can spread work over 2 or 4 (or even all 8) COGs for a performance boost. No it does not scale linearly.
Thing is it's not easy. Not something I want to be doing everyday. No doubt there are already far better parallel FFTs for the Parallella than anything I could come up with.
It's just hard. That is why my Paralllella board has never even been booted up.
I guess if I had an idea in mind that needed such performance and there was some ready made code to do the work then my Parallella would get unboxed.
The algorithm I tested here was mean to run on GPU. Tested on "normal" multicore cpu it scales near linearly up to 32 threads and maybe more. It theoreticallu should run on 2000 compute units with speed about 1000x - the one thread overhead is very small.
Now I want to buy this thing. Maybe in summer.
Interesting stuff you have going on there. You may also like to take a look at the NVIDIA Jetson TK1 board. Dual core ARM and 192 GPU cores ready to run your CUDA code. Only 200 dollars or so.
https://www.youtube.com/watch?v=XmnM7ikhY1s
If you are waiting till the summer, then hang on for the 64 core parallella that is due out shortly. Combined with the fpga means you should be able to custom craft a hybrid hardware and software solution.
After much pain and learning I can now use the Xilinx ISE tools to compile the original sources with a minor code change to prove it works. I have compiled and tested the 7020 HDMI sources.
On the 7020 board the HDMI configuration takes around 25% of the LE's and around 10% of the BRAM, though there are tons of space for extra custom logic.
Bought a QuickStart board for US$6 (cheap) at a Radio Shack. Recently started playing with it, and found the Propeller 1 Verilog forum. Since I design in Verilog and have used FPGAs, I was excited to check it out.
Alas, it was targeted for an Altera part, and I have only used Xilinx parts.
@steddyman, have you ported the P1 to Xilinx yet?
@pik33, you're right about implementing your DSP in hardware (FPGA). I implemented a 40Gbps EFEC 7 years ago in a couple of Xilinx parts (100% hardware) and was able to sustain while other vendors used mixed hardware/firmware and had to burst the data. Needless to say, when you're attempting to achieve 10e-24, even at-speed sustained takes forever to verify.
I haven't ported P1v yet. I've had to take a break to updates couple of my iOS apps.