P1V with 2MB of hub visible RAM and now 32MB of SDRAM

rogloh · 2016-05-22 13:11

T Chap wrote: »

I am curious if you have estimated what device would be required for you to get 16 cores? When you compile for the FPGA image, is there some info that states how many logic elements are required to host the image?

I think ozpropdev may possibly have some ideas on the different requirements for such larger systems. I have only played with the De0-nano and BeMicro MAX10 boards which can't reach those numbers of 16 COGs. When I did do a recent experiment retargeting for the 10M16 FPGA instead of the 10M08 to see if I could double my COGs by doubling the LE's I found it doesn't seem to go perfectly linearly and failed to complete the fitter, I guess routing resources are also a problem when you double you COGs, not just the LE's.

Yes the FPGA compile tool does state how many LE's are required once you compile for it.

T Chap · 2016-05-22 14:10

Thanks. Is your program binary living on an external eeprom just like the normal Props use? So the program is still exposed to interception? I don't see an external eeprom on the BE Micro.

rogloh · 2016-05-22 14:45

No, it is in now in internal flash, and as such I believe can be secured from any prying eyes. I just got the internal flash boot working, see this post
http://forums.parallax.com/discussion/comment/1377145/#Comment_1377145

T Chap · 2016-05-22 15:13

Did you guys explore replacing the MAX10 with a larger LE version? I have not looked if there is a pin for pin replacement. That would be pretty nice to bump up what you have on that board.

rogloh · 2016-05-23 00:58

Yeah, now I've kind of hit the limit for LE's, I've been thinking about that idea too. If the BeMicro MAX10 board had a 484 pin 10M25 populated instead of the 10M08 all sorts of things could become possible with the 3x higher LE count. For a start 8 COGs, and I'd like to try to use the ADC on the board as well as that needs more LE's.

Maybe a board or two could be sacrificed / reworked to try a ball grid array part replacement. They are very cheap development boards to experiment with. The 484 pin package chosen is nice as is goes all the way up to the 10M50. Whether or not the power is sufficient on the board for this part I don't know. See here for device options.
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/max-10-product-table.pdf

If it worked, this bigger device would be sweet. You also get 75kB of RAM on the 10M25 instead of 42kB so that's a nice feature as well if you need to keep more of the ROM tables and it even might be possible to add in a couple of extra COGs, though I'd probably just use it for extra hub SRAM.

Alternatively I could move back to the DE0 nano Cyclone IV part but I really like the internal flash feature of the MAX10 family for rapid startup. And some of those extra IO pins can be put to good use on the BeMicro MAX10 development board with my external 2MB SRAM hub expansion board too.

Tubular · 2016-05-23 01:12

I think Max10 is really the way forward, especially given what Rogloh and OzPropDev have made it do

If you want to upgrade boards, the $50 Altera one with arduino headers is the way to go - its EQFP rather than balls. Would be easy to upgrade the FPGA on it to a 10M25 or beyond, if need be
http://www.digikey.com/product-detail/en/altera/EK-10M08E144ES-P/544-3042-ND/4976140

T Chap since you're interested in displays there's also the Max10 NEEK from Terasic that might get you developing quickly, while other options appear for final implementation.
http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=218&No=956

Terasic are probably more reliable from a stock point of view, compared with Arrows BeMicro series, which appear and disappear a bit

jmg · 2016-05-23 01:37

rogloh wrote: »

Maybe a board or two could be sacrificed / reworked to try a ball grid array part replacement. They are very cheap development boards to experiment with. The 484 pin package chosen is nice as is goes all the way up to the 10M50.

For the brave maybe ?

I see Altera have new boards

https://www.altera.com/products/fpga/max-series/max-10/design-tools.html

A new 10M50 option there, shows at $125, quite good for development ?
(first 10M50 boards were $199, the 10M50DF484C6GES used lists for $150.. makes that $125 look good value

More interesting is a new one HyperMAX
Device: 10M25DAF256C7G
Memory: HyperFLASH™, HyperRAM™
Interfaces: NFC, Ethernet, USB to UART, CAN, authentication microprocessor, Arduino connector, Digilent Pmod™ Compatible connectors, expansion header

As well as the HyperFLASH & HyperRAM, the block diagram shows a i2c Clock generator, and a 70 pin header for expansion. Claims to be 'low cost' - no price found, but the 10M25DAF256C7G lists for $70 1+

How much of a P2 can fit into the 10M25DAF256C7G?

T Chap · 2016-05-23 01:56

I have been a few hours today looking at options.

https://www.arrow.com/en/products/bemicromax10/arrow-development-tools

The part on the BEMicro linked above shows Flash Dual Image/boot, 256k user Flash. The MAX10 family tree shows it has options for 32-172k, not sure how they arrive at 256k on the eval board, sort of a contradiction.

I have been looking at the MAX10M25 25k LE, http://www.mouser.com/ProductDetail/Altera/10M25SCE144C8G/?qs=bKenfurwlskgplYqDYvyBw==&gclid=Cj0KEQjwjoC6BRDXuvnw4Ym2y8MBEiQACA-jWfMTSahniSyPkLvjkB1Rl8W1g4xe3dtXG5I7oWfUF4QaAsS68P8HAQ

Which is the Compact, Single Voltage, Single boot version. 32-400 user flash but I can't tell from the part numbers at mouser how to determine what flash it has. I know that 25=25LE, S=Single power 3v3, C=Compact Single Image,
144 = pins, but C8G? I can't find out what C8G designates. Certainly there must be a code for user flash. Any ideas?

MAX10 overview here:
http://www.mouser.com/ds/2/591/br-max10-brochure-740801.pdf

I will build a board in the next month with this part, a HyperRam, EVE2 and was likely going to have options to go direct from sets of RGB pins off the FPGA to the 40FPC plug, but also be able to use the EVE2 8/8/8 RGB pins as well, use some 0603 33ohm resistors to decide what gets to the FPC for the LCD. I can place all 0603 parts on the Neoden machine so most of this is a breeze, but hand place the larger parts. I think the FPGA is easy in QFP. When I get this done I would share a few boards if anyone is interested. The MAX10m50 is too much money per chip for real world use, I am wanting something practical for real world use as far as cost and 80+ dollars is out for me, I want 8 cores, 64 io.

rogloh · 2016-05-23 02:11

Not sure about the P2 but likely the entire P1 would fit in the 10M25 part. For P2, Chip only has 1 COG and some smart pins in the Cyclone IV DE-0 nano and the LE count is comparable.

That new EBV dev board looks interesting and different, including its form factor. I see that the Intel PSG branding is starting to rear its hear now in the picture. Pity there is no SDRAM on board though there is a header for IO which might let it be added. By the looks of it the latency for HyperRAM seems to be a bit slow for the P1V to be able to randomly access per hub cycle to extend the hub RAM, though if it could it would be great.

rogloh · 2016-05-23 02:28

@TChap,
I believe the 10M25 has between 32kB and 400kB of flash depending on how you partition the configuration file(s) over the CFM blocks. At worst you end up with 32kB if you use memory initialisation of RAM blocks. At best you get 400kB if you have a single image without memory initialisation.

Also C8G is the speed rating. 8 is the slowest. From what I recall, C=consumer, I=industrial and think G may possibly indicate a general release part? ES=engineering sample.

T Chap · 2016-05-23 02:41

In that case, your device now has speed=8 10M08DAF484C8GES . Is it fast enough for your needs?

jmg · 2016-05-23 02:43

T Chap wrote: »

I will build a board in the next month with this part, a HyperRam, EVE2 and was likely going to have options to go direct from sets of RGB pins off the FPGA to the 40FPC plug, but also be able to use the EVE2 8/8/8 RGB pins as well, use some 0603 33ohm resistors to decide what gets to the FPC for the LCD. I can place all 0603 parts on the Neoden machine so most of this is a breeze, but hand place the larger parts. I think the FPGA is easy in QFP. When I get this done I would share a few boards if anyone is interested.

If there is room, you could add an option for a VGA connector too ?
(and some USB ? - as I think a portion of a P2 would fit in the 10M25)

T Chap · 2016-05-23 02:51

rogloh wrote: »

@TChap,
At worst you end up with 32kB if you use memory initialisation of RAM blocks. At best you get 400kB if you have a single image without memory initialisation.

What are you doing on yours? I would also like to use the Flash for the program like you are doing. My goal is to use Silabs Cp2110 USB interface vs FTDI/Prop plug and use my homemade proploader tools. As I have understood it, you can program the FPGA via USB no different that a real P1. Although I am curious if you are doing some other method. As you stated at one point you are "booting off Flash"

jmg · 2016-05-23 03:13

rogloh wrote: »

By the looks of it the latency for HyperRAM seems to be a bit slow for the P1V to be able to randomly access per hub cycle to extend the hub RAM, though if it could it would be great.

That depends on the Clock ratios (and if they chose the 100MHz or 166MHz parts?)

How many SysCLKs is there per HUB on a P1V, and what MHz does that run at ?

Looks to me like 21 edges is the latency with defaults ( means no refresh jitter), so that would allow 32 edges to comfortably give 32 bits of read or write. It could probably do 64b, if you could work out how to manage that ?
With refresh hidden, and simpler IO, the HyperRAM looks easier to me, to get working than SDRAM.

rogloh · 2016-05-23 03:27

T Chap wrote: »

What are you doing on yours? I would also like to use the Flash for the program like you are doing. My goal is to use Silabs Cp2110 USB interface vs FTDI/Prop plug and use my homemade proploader tools. As I have understood it, you can program the FPGA via USB no different that a real P1. Although I am curious if you are doing some other method. As you stated at one point you are "booting off Flash"

I have the UFM flash blocks mapped into hub address window. I get 32kB free for regular propeller use.
However I now have an idea of how to boot such that the full MAX10 flash space (eg. 172kB on 10M08, likely 400kB on 10M25) can realized and available to at least one COG. This will benefit the bootup and allow a very fully featured application (eg. XMM from flash).

rogloh · 2016-05-23 03:37

jmg wrote: »

That depends on the Clock ratios (and if they chose the 100MHz or 166MHz parts?)

How many SysCLKs is there per HUB on a P1V, and what MHz does that run at ?

Looks to me like 21 edges is the latency with defaults ( means no refresh jitter), so that would allow 32 edges to comfortably give 32 bits of read or write. It could probably do 64b, if you could work out how to manage that ?
With refresh hidden, and simpler IO, the HyperRAM looks easier to me, to get working than SDRAM.

I expect to align with the hub of P1V the HyperRAM would want to be run synchonously at 160MHz for a 80MHz P1V. That appears to give a nice 320MB DDR transfer rate when HyperRAM is clocked at 160MHz, but to work in a single hub cycle and give the COG its result the total data latency from when the full address is latched in the P1V needs to be 11 or less 160Mhz clock edges for 32 bit reads. You could possibly slow it down to require two hub cycles, but that is not quite as nice. With my auto incrementing pointer for reads it does help alleviate that so you could still copy one long to hub RAM in 3 2 hub windows in a tight loop for block transfers.

loop RDLONG data, hyperRAM WC  ` takes 2 hub cycles (per loop), WC=autoincrement ptr
     WRLONG data, hub  
     ADD hub, #4
     DJNZ count, #loop

UPDATE: actually so long as the read returns data within the same hub cycle (32 x 160MHz clocks), you could get a read fitting in a single hub window, it will just have the data returned in more than 8 P1V clocks. So the loop above reduces to 2 hub windows per long which is very fast and equivalent to copying from hub to hub today anyway. The thing you miss out on is achieving hubExec at 5MIPs when reading from the external HyperRAM.

Tubular · 2016-05-23 04:00

Regarding the HyperRam, the max clock is tied to the voltage. So 1v8 Hyperram works up to 160 MHz, and the 3v3 Hyperram works up to 100 MHz.

The only stock appears to be the 1v8 from digikey, but 3v3 due in a few days. We'll see.

rogloh · 2016-05-23 04:03

I thought I saw the ISSI part could be clocked at 166MHz. Didn't realise that it was for low voltage only. Pity, though I guess you could set one of the banks to 1.8V in the FPGA.

jmg · 2016-05-23 04:12

rogloh wrote: »
loop RDLONG data, hyperRAM WC  ` takes 2 hub cycles (per loop), WC=autoincrement ptr
     WRLONG data, hub  
     ADD hub, #4
     DJNZ count, #loop
UPDATE: actually so long as the read returns data within the same hub cycle (32 x 160MHz clocks), you could get a read fitting in a single hub window, it will just have the data returned in more than 8 P1V clocks. So the loop above reduces to 2 hub windows per long which is very fast and equivalent to copying from hub to hub today anyway. The thing you miss out on is achieving hubExec at 5MIPs when reading from the external HyperRAM.

If you have tweaked opcodes anyway, what about adding using WZ to enable a CEN extension for consecutive reads. ?

RDLONG WZ would
IF CS=1 : lower CS and send address and refresh latency aka StartHR
if CS=0 : use 4 edges to Rd or Wr one Long. aka ContinueHR
RDLONG with no WZ, uses 4 edges and releases CEN

that would allow SW to get 1,2,3 etc Long per HyperRAM cycle, and could suit HubEXEC ?

Tubular · 2016-05-23 04:14

rogloh wrote: »

I thought I saw the ISSI part could be clocked at 166MHz. Didn't realise that it was for low voltage only. Pity, though I guess you could set one of the banks to 1.8V in the FPGA.

Yep, needing 12 or 13 pins total fits within 1 bank nicely

I have to admit the low pin count is very attractive, and might even find applications with standard P1s (at lower clock rates)

T Chap · 2016-05-23 04:15

You have dual voltage on the Be Micro, so it could maybe be used as is for testing HyperRam? Else, on a homebrewed board what is the issue with a translation from a 3v3 FPGA to a 1.8V hyperram. Easy enough to add a 1.8 LDO.

jmg · 2016-05-23 04:35

T Chap wrote: »

You have dual voltage on the Be Micro, so it could maybe be used as is for testing HyperRam? Else, on a homebrewed board what is the issue with a translation from a 3v3 FPGA to a 1.8V hyperram. Easy enough to add a 1.8 LDO.

Just buy the 3V part, if you need 3V

The 1.8V part uses Differential Clocking, and if you were pushing the clock right up, then you may need to start using the return clock signal which compensates for Driver, pin and PCB delays.
It looks like you can fix the refresh window, to always on, which makes jitter less and verilog simpler.

jmg · 2016-05-23 04:38

Tubular wrote: »

I have to admit the low pin count is very attractive, and might even find applications with standard P1s (at lower clock rates)

There is no Min CLK speed, and clock can run from a timer, so a P1 at 20MHz RW-rate, or 10MHz RW-rate with rotate opcode included, is still well ahead of any SPI memory and similar to what parallel memory can burst at.

rogloh · 2016-05-24 04:10

HyperRam has pretty fast transfer rates and a low pin count. The total latency seems to be its main issue which lowers the overall cycle time possible for random accesses, yet it remains faster than some other options. It's probably better used as flash/SD card replacement than for real RAM if you want to be executing code out of it, unless you can provide a cache to try to hide the latency (though then you lose some determinism). Streaming video data may be a possible use for it; of course enabling data writes at the same time will be the problem as usual like most single port memories. You could double buffer perhaps with two devices but that requires 2x the pins. In my case I plan to switch SDRAM banks for my video implementation, giving the prop full access to the non-active video bank until it gets switched for the next drawn frame. This is ideal for write performance but it does mean you have to manage two frame buffers, not so bad when you have 8-32MB of SDRAM to play with... :cool:

jmg · 2016-05-24 05:31

rogloh wrote: »

.... Streaming video data may be a possible use for it; of course enabling data writes at the same time will be the problem as usual like most single port memories.

I've just been looking at this, and the data hints you can go > tCSM, which makes simple single-frame buffer, clocked playback, use simple, even for P1 ?
That still leaves reasonable time for Writes.

rogloh wrote: »

You could double buffer perhaps with two devices but that requires 2x the pins.

True, but even with 2 devices, you still have far fewer pins than SDRAM, and now you can focus one part on Display handling, and the other can be XIP tuned.

rogloh wrote: »

In my case I plan to switch SDRAM banks for my video implementation, giving the prop full access to the non-active video bank until it gets switched for the next drawn frame. This is ideal for write performance but it does mean you have to manage two frame buffers, not so bad when you have 8-32MB of SDRAM to play with... :cool:

To get best Whole RAM / Mixed Use performance from HyperRAM, will need some buffers helping, and higher Clocks.
Of course, a P1V is an ideal place to add those helpers.

It will also be interesting to see how P2 Streamers 'play along' with the HyperRAM.
Some small tuning of the interface may be needed.

rogloh · 2016-05-24 06:51

Byte banging from COG software (or maybe with smartpin assistance) in the P2 while streaming to HUB will be interesting for HyperRAM. I do wonder what overall access transaction rate (not streaming rate) will be possible for basic 32 bit read transfers for a COG from start to end. It could still be a lot of total P2 clocks required for a single read especially if the request comes from another independent COG. The whole LUT thing may help a little.

T Chap · 2016-05-24 14:17

In Verilog are you guys able to route any of the P0-P63 pins for ports a/b to any io? Or is it done in groups or pins? When you assign are you actually picking a real pin number on the FPGA? If so could you share your pin numbers for port A and B? E(DIT eh that won't work probably, the Be Micro is using a 484 BGA due to the dual power/analog chip) I am curious to learn what pins would be used on a custom board vs the eval board. They have a lot of stuff broken out that is likely not to be used by the P1V. On a certain eval board for this part they only show the USB blaster for programming. Is that what you guys are using? How is the programming done for the typical Prop loader off BST/PropTool etc? Still on the USB Blaster or other method? I am trying to figure out how this would connect to a typical proploader and which pins, as well as if the same prop timing sequence works as it did before on P1. Seems you have two processes to manage: Download the image via USB to flash the device with the P1V image, after that you use the device just as if it were a P1 via USB.

P1V with 2MB of hub visible RAM and now 32MB of SDRAM

Comments