Embedded FPGA options for P1?

T Chap · 2016-05-20 19:31

I have spent some time trying to find out what options are being used for FPGA for p1. My goal is to extend the life of P1 on some projects where the needs are pretty simple: Get more pins with PortB, more cores, more memory per core or at least more hub.

Can the size of a core be increased above 2k. One of the reasons for this is to extend memory for LCD graphics that are constrained by core size.

What are people using for SMT (preferably non BGA) FPGA? 20-30 bucks max is not an issue. I prefer pins outside the chip for easier fixes if needed, but QFN is OK.

Logic elements needed for the bare minimum request of PortB + 16cores? I see MAX10 comes in a number or sizes.

Has anyone here actually built their own boards with an FPGA?

I use BSTC for optimizing large programs, and assume that the FPGA does not care where the binary comes from but I use Silabs CP2110 for programming the binary, so does the loading work exactly the same so that the hardware and software I use for USB loading can remain unchanged?

From what I read the only limitation is no ROM fonts. Should the r/c DACs behave identically using the FPGA pins?

jmg · 2016-05-20 20:53

An open ended question....
It depends..
.. on how many P1V Cores do you want ?
The smaller parts, that will fit at least 1 P1V COG, do tend to constrain the memory.

You could look at the SDRAM work rogloh was doing, as once you have fpga pins to burn, the SDRAM impact somewhat shifts.

Or, you could look to connect a HyperRAM and focus on pushing that speed.
That may allow smaller package FPGAs.

At the smallest end of the scale, I like the look of Lattice iCE5LP2K in QFN48 7x7mm 39io

Lattice also have a new MachXO3LF-9400, can come in a 9x9 BGA with 208 io, and 54K of RAM

Things like ROM Fonts, you would swap into SPI Flash, and load as needed.

T Chap · 2016-05-20 22:58

I want to get minimum 12cores, ideally 16. 64 i/o is the ideal.

rogloh seems to be using BEmicro that has MAX10,

https://www.arrow.com/en/products/bemicromax10/arrow-development-tools

which shows the MAX10M08 which is 8000 LE for 31 bucks. 484 bga 1.25V 250 io.

Edit: I see he pointed out

A larger MAX10M16 device would probably have been ideal and yielded those 5 extra COGs

so he appears to only get 3 cogs out of his 8000LE device. Seems like it is going to be impossible to get 16 cores out of something under 30 bucks based on his LE consumption.

Tubular · 2016-05-21 00:29

I think something like this is what you'd be looking at, it should do about 12 cogs
http://www.digikey.com/product-detail/en/altera/10M25SCE144C8G/10M25SCE144C8G-ND/5429556
$31 in qty 60

You might try your luck getting better pricing from Altera

If you don't need counters/video in all cog, you can save something like 250 or 300 LE's per cog ( cogs are about 1500 LEs by themselves)

Peter Jakacki · 2016-05-21 01:38

This is one option I keep considering although if I can do it with one or two P1s then I choose the latter. I can't understand why you would need 16 cores as some of these will no doubt be used as serial ports or SPI etc and to add that to the P1V is very easy and very light on silicon. My Tachyon system with SD card and Ethernet is getting by mostly on one core with another simply as a background timer and another as the console serial. That leaves 5 cores free. It seems to me that if I had 4 cores and more hub memory and I/O then that would be enough assuming I add SPI and UARTs. There is no point in trying to add more than 2k memory to each cog though as the instruction set only has a 9-bit cog address field and there are no indirect or indexed addressing modes either.

Any suggestions for a suitable FPGA for 4 cores and more memory and of course I/O?

ozpropdev · 2016-05-21 02:25

The only way to get more cogs in a FPGA P1V design is to have multiple P1V's.
Modifying a P1V to have more than 8 cogs would require substantial modifications to the architecture.
Larger HUB ram and more ports is relatively easy to do but ultimately comes down to FPGA size/cost.
For example, quite a while back I built a 40 cog (5 * P1V) build on a DE2-115.
It worked but managing the programming of the five P1V's was messy.
It also then required P1V-to_P!V comms to coordinate all the cogs.
Throw Quartus compile times into the mix and be prepared for long development times.

jmg · 2016-05-21 02:32

Because you cannot buy 8 COGS in a FPGA for the same price as a P1, to me it makes most sense to deploy
a P1 and a FPGA.
There, you can do as Peter suggests, with 4 FPGA COGS and 8 Standard COGS.
The P2 boards have quite a nice instance of P1 used as FPGA loader, so that P1 displaces another 1-2 chips.
( remove one COG for FPGA development use ? )

Chip reports their P1 loader, is faster than Altera's Standard FTDI FIFO + CPLD, so it seems they have some significant slip-ups in that approach. A CPLD, properly done, should be faster than a P1 UART....

T Chap · 2016-05-21 02:34

I had no idea you would have to program each block of 8 separately, I guess I misunderstood somewhere along the line that extra cores was part of the concept (but not as separate entities). Also, I did not consider that you could not access more than 9 bits in a core! So that leaves extra pins and extra hub as the gain. One driving consideration is graphics in a cog. I sure wish there was a trick to drive an LCD with 2 cores on the same pins.

Tubular · 2016-05-21 02:37

If the hi res/hi color video and ram and refresh is all taken care of, how many cogs would you need T Chap?

Same for you Pete, could you get by on 3 cogs, if one of them had super powers?

jmg · 2016-05-21 02:41

Peter Jakacki wrote: »

Any suggestions for a suitable FPGA for 4 cores and more memory and of course I/O?

The iCE5LP4K has an appealing QFN48 (39io) package, and might pack 2 COGS.
If that could talk to HyperRAM, then things get interesting....

If you can tolerate BGA, then the LCMXO3LF-6900E is available now in the same 9x9 mm BGA the MachXO3LF-9400 will come in.

T Chap · 2016-05-21 02:52

One project is maxed out at 8 cores and can't drop anything out. I use many counters for DAC etc, no vid. Even with superpowers, there is too much going on to drop to less cores. 8 is minimum. I am studying the EVE2 so that will solve the graphics problem in the very near future. But several projects are already maxed out for pins, and that includes having already added 2 x 16 i/o I2C expanders. I will explore multiple P1's or P1 + P1v 4core.

Just so I understand jmg, the same P1 on board can be used as the P1V loader from USB? After loading the image on P1v, I assume it boots normally off it's own EEPROM.

average joe · 2016-05-21 03:18

Please tell me someone has a sticky or a post somewhere documenting the number of LEs per propeller resource!

The only resource I've ever REALLY wanted was portB. Add your choice of external memory and off you go. As far as lcd graphics memory on-die (for fast redraws n such) it's a total waste. When you are playing with the onboard video dacs I could see it maybe. I've actually experimented with the SSD1963 and several different types of memory. (all applicable to SSD1289, ILIxxx and other 8-16b tft controllers AFAIK) I'm still very happy with the SRAM -151s although QPI FLASH or SRAM work just as well (from a slow human perspective) For the price of a decent FPGA, seems to make more sense to include a memory solution, FLASH, SRAM and the like are DIRT CHEAP.

That's given you can do with only 8 cogs... Although I've never REALLY bumped this limit. Depends on what you are doing with your LCD device though. Pushing a whole screen refresh from cog memory seems silly to me though. the basic math on a smaller display - 320 * 240 =76,800 pixels. @4bpp that's 9,600 longs... What always made the most sense to me was have a "display cache memory" to load icons, backgrounds or whatever, connected directly to the display. With a little bit of planning you could have a free core for about 77k clocks while doing a full display refresh just using a counter.

That's just my 2 cents though...

*edit*

Sorry, I took too long writing that post and missed your post about using all 8 cogs. DARN.

I could get behind a p1v with all 8 cores and some sort of memory - display bus.

Tubular wrote: »

If the hi res/hi color video and ram and refresh is all taken care of, how many cogs would you need...

As with everyone, I'd love to have them all but, bare minimum I think 3 or 4 would do the trick. Just something for a REALLY flashy synth gui. Use the actual propellers for the heavy lifting... Especially if there was some kinda baked-in p2p??

jmg · 2016-05-21 03:19

T Chap wrote: »

..
Just so I understand jmg, the same P1 on board can be used as the P1V loader from USB? After loading the image on P1v, I assume it boots normally off it's own EEPROM.

That's for the Altera Cyclone parts on the P123, but it shows you can have Serial link, and dual-purpose it for FPGA loading with no drop in performance over what Altera offers on their development boards.
I'm not sure how portable that P1 loader is to something like Lattice, but Lattice usually deploy the FT2232H as their CPLD loader.
Might work on a MAX 10 ?

ozpropdev · 2016-05-21 03:34

@ jmg
FYI The P1 loader on the P123 board takes 25 seconds to load the P2 image from power up.
I think a CPLD + FLASH would be a lot quicker.
The MAX10 is the quickest of the Altera range with its on board flash.

jmg · 2016-05-21 03:41

ozpropdev wrote: »

@ jmg
FYI The P1 loader on the P123 board takes 25 seconds to load the P2 image from power up.

I was going by Chip's comments, but I think he was talking about Download speed, more than Power Up loading.

ozpropdev wrote: »

I think a CPLD + FLASH would be a lot quicker.
The MAX10 is the quickest of the Altera range with its on board flash.

Yes, any part with in-package flash, should load and start faster.

T Chap · 2016-05-21 03:48

25 seconds normal boot up or downloading binaries? That is a long time for typical boot up, but ok for downloading binaries.

jmg · 2016-05-21 04:33

Search finds this
http://forums.parallax.com/discussion/comment/1343908/#Comment_1343908

Chip: "The loader on it is way faster than the Altera arrangement that all other boards seem to use."

but search does not find explicit times, which I saw somewhere...

ozpropdev · 2016-05-21 11:47

jmg wrote: »

Search finds this
http://forums.parallax.com/discussion/comment/1343908/#Comment_1343908

Chip: "The loader on it is way faster than the Altera arrangement that all other boards seem to use."

but search does not find explicit times, which I saw somewhere...

Yes I believe he is talking about the download speed.
BTW I do have a modified version of the loader that boots in 9 seconds instead of 25!
I never got around to finishing the faster write though.
You need to restore th original loader if you want to update the flash.
I can dig up the source and post it if anyone is interested.

jmg · 2016-05-21 21:38

jmg wrote: »

Peter Jakacki wrote: »

Any suggestions for a suitable FPGA for 4 cores and more memory and of course I/O?

The iCE5LP4K has an appealing QFN48 (39io) package, and might pack 2 COGS.
If that could talk to HyperRAM, then things get interesting....

If you can tolerate BGA, then the LCMXO3LF-6900E is available now in the same 9x9 mm BGA the MachXO3LF-9400 will come in.

A couple more package choices appear on Lattice website.
( where they show long before that package is actually available )

* XO2-1200 shows a new 32-pin QFN (5 x 5 mm) with 21io
(hat could augment an existing P1, that needs more peripherals)
I think this will become the most CPLD in a QFN32, when it becomes real.

* XO2-4000 shows a 84-pin QFN (7 x 7 mm) with 68io
84 pins I think is 48+36 as two rings, 14 & 12/side.
Simpler PCBs could use one ring.

jmg · 2016-05-21 21:40

ozpropdev wrote: »

BTW I do have a modified version of the loader that boots in 9 seconds instead of 25!

How does that compare with other boot choices ?

Is there a table of the numbers anywhere, for how many pins are used, and how the speeds compare with more common FPGA-PC & FPGA-Boot pathways ?

T Chap · 2016-05-21 22:41

These look like a nice parts for an 8 core. MAX10 40k LE. 3v3, 101io. QFP $60

http://www.digikey.com/product-detail/en/altera/10M40SAE144C8G/544-3120-ND/5284844

No idea if they are workable in reality, just doing parametric searches. Is an FPGA an FPGA as long as you have the io, LE and voltage?

jmg · 2016-05-22 01:48

T Chap wrote: »

EPF10 66io QFP 31k LE $37

I think that is only 32k gates, rather less LE's.
If you want to sort by package, picking TQFP100 looks to have XIlinx XC3S500E- as the largest at around 10476 LE and 368640b RAM (40+ k bytes)

T Chap · 2016-05-22 01:57

My mistake, I am settling on the MAX10 25k LE device Tubular linked earlier, in stock at Mouser for $31.00

http://www.mouser.com/ProductDetail/Altera/10M25SCE144C8G/?qs=bKenfurwlskgplYqDYvyBw==&gclid=Cj0KEQjwjoC6BRDXuvnw4Ym2y8MBEiQACA-jWfMTSahniSyPkLvjkB1Rl8W1g4xe3dtXG5I7oWfUF4QaAsS68P8HAQ

Add the HyperRam, Port B, EVE2 and get this platform up and running to last for another year until P2.

jmg · 2016-05-22 02:23

T Chap wrote: »

My mistake, I am settling on the MAX10 25k LE device Tubular linked earlier, in stock at Mouser for $31.00

Looks a good choice. MAX10 scales well.

T Chap wrote: »

Add the HyperRam, Port B, EVE2 and get this platform up and running to last for another year until P2.

With HyperRAM, the EVE2 is less required, but it looks an easy way to get something going...

T Chap · 2016-05-22 02:30

I have been playing around with the EVE2 screen editors, very slick stuff, instant graphics/buttons/sliders/progress etc of all types and it is much easier than creating graphics, converting to bitmaps/DAT etc. EVE2 is the way to go for me. I will explore HyperRam for simpler LCD projects that can live with minimal graphics/text and eventually streaming vid from cameras to LCD. Create your screens and see instantly what it will look like, export binaries and upload to EVE. Also assign touch tags to objects so you do not have to define areas for touch when you place objects, it does it automatically.

T Chap · 2016-05-22 02:36

EVE2 editor screen shot

jmg · 2016-05-23 01:42

T Chap wrote: »

My mistake, I am settling on the MAX10 25k LE device Tubular linked earlier, in stock at Mouser for $31.00
..
Add the HyperRam, Port B, EVE2 and get this platform up and running to last for another year until P2.

I just see a new Altera board than has HyperFLASH & HyperRAM included

HyperMAX
Device: 10M25DAF256C7G
Memory: HyperFLASH™, HyperRAM™
Interfaces: NFC, Ethernet, USB to UART, CAN, authentication microprocessor, Arduino connector, Digilent Pmod™ Compatible connectors, expansion header

The block diagram also shows a i2c Clock generator, and a 70 pin header for expansion. Claims to be 'low cost' - no price found, but the 10M25DAF256C7G lists for $70 1+

T Chap · 2016-05-23 02:11

Wow pretty nice. It has the 10M25 Dual Flash Image. That may be a good route for some of us looking to get 8 cores without having to build anything just yet. I can't find one for sell. If you run across a schematic for this please let me know. I am finding the documents for typical appication schematics hard to come by online for MAX10.

rogloh · 2016-05-23 07:15

By the way, I just ran an experiment with the 10M25. I have 8 COGs and all my other extras - SDRAM/SRAM/Video/UFM FLASH/MUL/AutoIncOpcodes/LinkReg/PortB now fitting in 19760 of the 24960 LE’s (79%). Each COG takes about 1730 LEs in its own right including its two counters and ALU but not its video generators which I removed in my build.

I haven’t run the full timing analysis on it yet as I need to adjust the SDC file again but based on that preliminary result at least, the 10M25 part is probably the sweet spot for a full P1V experience on MAX10 and it still leaves a bit more breathing space for other things.

Update: fitter completed and reports 72% usage. Regular cogs use 1889 LEs, the special SuperCog uses 2405. Will try the timing later.

Update2: TimeQuest reports Fmax of about 60MHz in this test. I didn't set all the constraints so am not sure if this is entirely accurate, but it is a number nonetheless. I did find that adding the DSP multiplier really dropped the Fmax down, so I might need to rework that to run over a number of clocks. The MUL instruction then won't execute in 4 clocks like the other instructions but might take 5 or 6. Probably not that big a deal since it is still so much faster than the software based multiply operations.

Cluso99 · 2016-05-23 08:55

There used to be a P1V section. I posted details about LE usage there. I had more than 2KB working in a cog but you did have a special instruction to jump to it. Instructions then needed to be changed to relative but not done. Alternately it was quite easy to have two blocks of code and they remained within their block using jumps and calls. I did it as a proof of concept.

T Chap · 2016-05-23 11:38

Thanks for the info on your tests. Does the FMax of 60 imply that code adapter for 80Mh on P1 would have to be reworked to run at 60?

Embedded FPGA options for P1?

Comments