Embedded FPGA options for P1?
T Chap
Posts: 4,223
I have spent some time trying to find out what options are being used for FPGA for p1. My goal is to extend the life of P1 on some projects where the needs are pretty simple: Get more pins with PortB, more cores, more memory per core or at least more hub.
Can the size of a core be increased above 2k. One of the reasons for this is to extend memory for LCD graphics that are constrained by core size.
What are people using for SMT (preferably non BGA) FPGA? 20-30 bucks max is not an issue. I prefer pins outside the chip for easier fixes if needed, but QFN is OK.
Logic elements needed for the bare minimum request of PortB + 16cores? I see MAX10 comes in a number or sizes.
Has anyone here actually built their own boards with an FPGA?
I use BSTC for optimizing large programs, and assume that the FPGA does not care where the binary comes from but I use Silabs CP2110 for programming the binary, so does the loading work exactly the same so that the hardware and software I use for USB loading can remain unchanged?
From what I read the only limitation is no ROM fonts. Should the r/c DACs behave identically using the FPGA pins?
Can the size of a core be increased above 2k. One of the reasons for this is to extend memory for LCD graphics that are constrained by core size.
What are people using for SMT (preferably non BGA) FPGA? 20-30 bucks max is not an issue. I prefer pins outside the chip for easier fixes if needed, but QFN is OK.
Logic elements needed for the bare minimum request of PortB + 16cores? I see MAX10 comes in a number or sizes.
Has anyone here actually built their own boards with an FPGA?
I use BSTC for optimizing large programs, and assume that the FPGA does not care where the binary comes from but I use Silabs CP2110 for programming the binary, so does the loading work exactly the same so that the hardware and software I use for USB loading can remain unchanged?
From what I read the only limitation is no ROM fonts. Should the r/c DACs behave identically using the FPGA pins?
Comments
It depends..
.. on how many P1V Cores do you want ?
The smaller parts, that will fit at least 1 P1V COG, do tend to constrain the memory.
You could look at the SDRAM work rogloh was doing, as once you have fpga pins to burn, the SDRAM impact somewhat shifts.
Or, you could look to connect a HyperRAM and focus on pushing that speed.
That may allow smaller package FPGAs.
At the smallest end of the scale, I like the look of Lattice iCE5LP2K in QFN48 7x7mm 39io
Lattice also have a new MachXO3LF-9400, can come in a 9x9 BGA with 208 io, and 54K of RAM
Things like ROM Fonts, you would swap into SPI Flash, and load as needed.
rogloh seems to be using BEmicro that has MAX10,
https://www.arrow.com/en/products/bemicromax10/arrow-development-tools
which shows the MAX10M08 which is 8000 LE for 31 bucks. 484 bga 1.25V 250 io.
Edit: I see he pointed out so he appears to only get 3 cogs out of his 8000LE device. Seems like it is going to be impossible to get 16 cores out of something under 30 bucks based on his LE consumption.
http://www.digikey.com/product-detail/en/altera/10M25SCE144C8G/10M25SCE144C8G-ND/5429556
$31 in qty 60
You might try your luck getting better pricing from Altera
If you don't need counters/video in all cog, you can save something like 250 or 300 LE's per cog ( cogs are about 1500 LEs by themselves)
Any suggestions for a suitable FPGA for 4 cores and more memory and of course I/O?
Modifying a P1V to have more than 8 cogs would require substantial modifications to the architecture.
Larger HUB ram and more ports is relatively easy to do but ultimately comes down to FPGA size/cost.
For example, quite a while back I built a 40 cog (5 * P1V) build on a DE2-115.
It worked but managing the programming of the five P1V's was messy.
It also then required P1V-to_P!V comms to coordinate all the cogs.
Throw Quartus compile times into the mix and be prepared for long development times.
a P1 and a FPGA.
There, you can do as Peter suggests, with 4 FPGA COGS and 8 Standard COGS.
The P2 boards have quite a nice instance of P1 used as FPGA loader, so that P1 displaces another 1-2 chips.
( remove one COG for FPGA development use ? )
Chip reports their P1 loader, is faster than Altera's Standard FTDI FIFO + CPLD, so it seems they have some significant slip-ups in that approach. A CPLD, properly done, should be faster than a P1 UART....
Same for you Pete, could you get by on 3 cogs, if one of them had super powers?
If that could talk to HyperRAM, then things get interesting....
If you can tolerate BGA, then the LCMXO3LF-6900E is available now in the same 9x9 mm BGA the MachXO3LF-9400 will come in.
Just so I understand jmg, the same P1 on board can be used as the P1V loader from USB? After loading the image on P1v, I assume it boots normally off it's own EEPROM.
The only resource I've ever REALLY wanted was portB. Add your choice of external memory and off you go. As far as lcd graphics memory on-die (for fast redraws n such) it's a total waste. When you are playing with the onboard video dacs I could see it maybe. I've actually experimented with the SSD1963 and several different types of memory. (all applicable to SSD1289, ILIxxx and other 8-16b tft controllers AFAIK) I'm still very happy with the SRAM -151s although QPI FLASH or SRAM work just as well (from a slow human perspective) For the price of a decent FPGA, seems to make more sense to include a memory solution, FLASH, SRAM and the like are DIRT CHEAP.
That's given you can do with only 8 cogs... Although I've never REALLY bumped this limit. Depends on what you are doing with your LCD device though. Pushing a whole screen refresh from cog memory seems silly to me though. the basic math on a smaller display - 320 * 240 =76,800 pixels. @4bpp that's 9,600 longs... What always made the most sense to me was have a "display cache memory" to load icons, backgrounds or whatever, connected directly to the display. With a little bit of planning you could have a free core for about 77k clocks while doing a full display refresh just using a counter.
That's just my 2 cents though...
*edit*
Sorry, I took too long writing that post and missed your post about using all 8 cogs. DARN.
I could get behind a p1v with all 8 cores and some sort of memory - display bus.
As with everyone, I'd love to have them all but, bare minimum I think 3 or 4 would do the trick. Just something for a REALLY flashy synth gui. Use the actual propellers for the heavy lifting... Especially if there was some kinda baked-in p2p??
I'm not sure how portable that P1 loader is to something like Lattice, but Lattice usually deploy the FT2232H as their CPLD loader.
Might work on a MAX 10 ?
FYI The P1 loader on the P123 board takes 25 seconds to load the P2 image from power up.
I think a CPLD + FLASH would be a lot quicker.
The MAX10 is the quickest of the Altera range with its on board flash.
Yes, any part with in-package flash, should load and start faster.
http://forums.parallax.com/discussion/comment/1343908/#Comment_1343908
Chip: "The loader on it is way faster than the Altera arrangement that all other boards seem to use."
but search does not find explicit times, which I saw somewhere...
BTW I do have a modified version of the loader that boots in 9 seconds instead of 25!
I never got around to finishing the faster write though.
You need to restore th original loader if you want to update the flash.
I can dig up the source and post it if anyone is interested.
A couple more package choices appear on Lattice website.
( where they show long before that package is actually available )
* XO2-1200 shows a new 32-pin QFN (5 x 5 mm) with 21io
(hat could augment an existing P1, that needs more peripherals)
I think this will become the most CPLD in a QFN32, when it becomes real.
* XO2-4000 shows a 84-pin QFN (7 x 7 mm) with 68io
84 pins I think is 48+36 as two rings, 14 & 12/side.
Simpler PCBs could use one ring.
Is there a table of the numbers anywhere, for how many pins are used, and how the speeds compare with more common FPGA-PC & FPGA-Boot pathways ?
http://www.digikey.com/product-detail/en/altera/10M40SAE144C8G/544-3120-ND/5284844
No idea if they are workable in reality, just doing parametric searches. Is an FPGA an FPGA as long as you have the io, LE and voltage?
I think that is only 32k gates, rather less LE's.
If you want to sort by package, picking TQFP100 looks to have XIlinx XC3S500E- as the largest at around 10476 LE and 368640b RAM (40+ k bytes)
http://www.mouser.com/ProductDetail/Altera/10M25SCE144C8G/?qs=bKenfurwlskgplYqDYvyBw==&gclid=Cj0KEQjwjoC6BRDXuvnw4Ym2y8MBEiQACA-jWfMTSahniSyPkLvjkB1Rl8W1g4xe3dtXG5I7oWfUF4QaAsS68P8HAQ
Add the HyperRam, Port B, EVE2 and get this platform up and running to last for another year until P2.
With HyperRAM, the EVE2 is less required, but it looks an easy way to get something going...
I just see a new Altera board than has HyperFLASH & HyperRAM included
HyperMAX
Device: 10M25DAF256C7G
Memory: HyperFLASH™, HyperRAM™
Interfaces: NFC, Ethernet, USB to UART, CAN, authentication microprocessor, Arduino connector, Digilent Pmod™ Compatible connectors, expansion header
The block diagram also shows a i2c Clock generator, and a 70 pin header for expansion. Claims to be 'low cost' - no price found, but the 10M25DAF256C7G lists for $70 1+
I haven’t run the full timing analysis on it yet as I need to adjust the SDC file again but based on that preliminary result at least, the 10M25 part is probably the sweet spot for a full P1V experience on MAX10 and it still leaves a bit more breathing space for other things.
Update: fitter completed and reports 72% usage. Regular cogs use 1889 LEs, the special SuperCog uses 2405. Will try the timing later.
Update2: TimeQuest reports Fmax of about 60MHz in this test. I didn't set all the constraints so am not sure if this is entirely accurate, but it is a number nonetheless. I did find that adding the DSP multiplier really dropped the Fmax down, so I might need to rework that to run over a number of clocks. The MUL instruction then won't execute in 4 clocks like the other instructions but might take 5 or 6. Probably not that big a deal since it is still so much faster than the software based multiply operations.