First attempt at building a P1V image for the BeMicro Max10
ozpropdev
Posts: 2,792
Hi All
Update: 11th Feb 2015 : See working BeMicro-Max10 files here.
I started on a P1V image today for the BeMicro MAX10 board and I've hit a wall.
A rough estimate suggested 4 cogs could fit into the 10M08DAF484C8GES Max10 device. (8000 LE's).
This included the removal of the lower rom stuff (tables,fonts) like the DE0-Nano build.
Quartus II 14.1 reported problems fitting the image with the following details.
The obvious answer is probably a defective "ME", but any other suggestions would be a great help.
Cheers
Brian
Update: 11th Feb 2015 : See working BeMicro-Max10 files here.
I started on a P1V image today for the BeMicro MAX10 board and I've hit a wall.
A rough estimate suggested 4 cogs could fit into the 10M08DAF484C8GES Max10 device. (8000 LE's).
This included the removal of the lower rom stuff (tables,fonts) like the DE0-Nano build.
Quartus II 14.1 reported problems fitting the image with the following details.
+---------------------------------------------------------------------------------+ ; Flow Summary ; +------------------------------------+--------------------------------------------+ ; Flow Status ; Flow Failed - Sat Feb 07 20:00:50 2015 ; ; Quartus II 64-Bit Version ; 14.1.0 Build 186 12/03/2014 SJ Web Edition ; ; Revision Name ; top ; ; Top-level Entity Name ; top ; ; Family ; MAX 10 ; ; Device ; 10M08DAF484C8GES ; ; Timing Models ; Preliminary ; ; Total logic elements ; [color=red]17,306 / 8,064 ( 215 % ) [/color] ; ; Total combinational functions ; 15,830 / 8,064 ( 196 % ) ; ; Dedicated logic registers ; 2,930 / 8,064 ( 36 % ) ; ; Total registers ; 2930 ; ; Total pins ; 42 / 250 ( 17 % ) ; ; Total virtual pins ; 0 ; ; Total memory bits ; 327,680 / 387,072 ( 85 % ) ; ; Embedded Multiplier 9-bit elements ; 0 / 48 ( 0 % ) ; ; Total PLLs ; 1 / 2 ( 50 % ) ; ; UFM blocks ; 0 / 1 ( 0 % ) ; ; ADC blocks ; 0 / 1 ( 0 % ) ; +------------------------------------+--------------------------------------------+By removing the ram loading stuff in hub_mem.v file has a dramatic effect on LE usage.
(* ram_init_file = "hub_rom_high.hex" *) reg [31:0] rom_high [4095:0]; changed to reg [31:0] rom_high [4095:0];Resulted in a successful build
+---------------------------------------------------------------------------------+ ; Flow Summary ; +------------------------------------+--------------------------------------------+ ; Flow Status ; Successful - Sat Feb 07 20:29:15 2015 ; ; Quartus II 64-Bit Version ; 14.1.0 Build 186 12/03/2014 SJ Web Edition ; ; Revision Name ; top ; ; Top-level Entity Name ; top ; ; Family ; MAX 10 ; ; Device ; 10M08DAF484C8GES ; ; Timing Models ; Preliminary ; ; Total logic elements ; 7,415 / 8,064 ( 92 % ) ; ; Total combinational functions ; 6,788 / 8,064 ( 84 % ) ; ; Dedicated logic registers ; 2,898 / 8,064 ( 36 % ) ; ; Total registers ; 2898 ; ; Total pins ; 42 / 250 ( 17 % ) ; ; Total virtual pins ; 0 ; ; Total memory bits ; 327,680 / 387,072 ( 85 % ) ; ; Embedded Multiplier 9-bit elements ; 0 / 48 ( 0 % ) ; ; Total PLLs ; 1 / 2 ( 50 % ) ; ; UFM blocks ; 0 / 1 ( 0 % ) ; ; ADC blocks ; 0 / 1 ( 0 % ) ; +------------------------------------+--------------------------------------------+Is their another way to initialize the ram with an image or is this result indicating that the MAX10 cell design is consumed by ROM emulation.
The obvious answer is probably a defective "ME", but any other suggestions would be a great help.
Cheers
Brian
Comments
Edit: I didn't notice any difference in LE usage when commented out (* ram_init_file = "hub_rom_high.hex" *)
Something specific to MAX10?
I would be might impressed if we could get 3 cogs running, let alone 4
Want me to post your Max10?
These may help, some suggest it is a little tricky
http://jimselectronicsblog.blogspot.co.nz/2014/12/storing-nios-ii-application-code-in-non.html
http://www.altera.com/support/kdb/solutions/rd10302014_959.html
Still, the rest looks promising, if tight. 92% full, with 649 Spare LE's
It looks like I might have to take a step back to 14.02 to fix the problem. Yikes! WTG Altera
Cheers
Brian
"You can turn on the Enable ERAM Preload option in the More Analysis & Synthesis Settings dialog box."
but an Altera answer Jan 2015 says
http://www.altera.com/support/kdb/solutions/rd01072015_668.html
[" Title : Where can I find the "Enable ERAM Preload" option for MAX 10 devices in the Quartus II software version 14.1 and later?
Description
The "Enable ERAM Preload" option for MAX® 10 devices can be located in the Device options in the Quartus® II software version 14.1 and later. "]
I think that Altera answer is newer then the Blog, so it may help.
Just tried one of Altera's suggested settings....failed!
Trying their second suggestion now......
The bad news is I now blow out on the memory bits budget.
I had to reduce to 3 cogs now for some reason, but building for 1 cog still exceeds memory bit limit.
Here's the result for a 3 cog build on Quartus II 14.1 It looks like a larger Max10 is needed to accommodate a P1V.
The LE per cog changes from 7,415/4 = 1853.75 to 6,374/3 = 2124.66 - a 15.6% rise ?
RAM usage has also elevated by 14336 Bytes, but that should be mapped to HUB + N 8 COG memory.
I've not tried Quartus 14.1, but sometimes tools give more lucid info, on a Build that worked, than one that failed.
With 9 bit memory maths,
(387,072-512*36*4)/9 = 34816 Bytes of avail memory, or 2048 Bytes above 32768 for ROM. ( a Full 32K ROM will overflow the part )
Some trade off of ROM features and RAM will be needed, initially.
2048 Bytes should be enough for a Loader ?
The Max10 has 176128 Bytes of FLASH, with parallel interface options, so rather than Initialised-RAM, Flash could be mapped.
The Flash speed is slower than RAM, and burst orientated, so that will need some work.
A reduced RAM with minimal ROM Loader should give a testing base ?
The Altera site only seems to have a 14.1 version available. The upgrade from 14.0 to 14.02 does not include this file.
The device installer only recognizes 14.0 files.
Has anyone successfully added Max10 devices into 14.02 ?
Edit: Never mind.....It is there....It's been a long day......:)
eg if ROM is as above (4096x32) , (387,072-512*36*4-4096*36)/9 = 18432 Bytes of HUB RAM
That's a large ROM, what bumps the size ?
Each test takes approx. 15~20 minutes. Time is dragging now....more coffee needed....:coffee:
This is what I got first run of 14.02
Do you think my answer should be "How the hell do I know!"
I gave up on Quartus 14.02, nothing but heartache there.
Back to 14.1 now and just built a 4 Cog 16K HUB P1V...phew I bump HUB ram up till it fits next. Now I can get some sleep...
Cheers
Brian
ie Does it run, and how fast ?
There may be a way to drop the ROM size to just the [interpreter and loader] value of 0xffff-0xf002 = 4093, or even further, to Loader alone.
Well done Brian. Thats quite a tight fit on the logic elements, it will be interesting what speed it runs to
If you get a moment I'd be curious whether 8 cogs + 32k hub ram fits in a 10M16
On paper, that depends on the ROM handling chosen .
If you wanted to simply drop-in ROM the same as P1, then it comes up short (even on 10M25)
10M16 :
549*1024-(32768*9 + 32768*9 +512*36*8) = -175104b (19456 Bytes short)
Even 10M25
675*1024-(32768*9 + 32768*9 +512*36*8) = -46080 (5120 Bytes Short)
Or solving for ROM size (or Bonus RAM, if ROM is handled differently, see below )
10M16 : (549*1024-(32768*9 +512*36*8))/9 = 13312 ROM (RAM)
10M25 : (675*1024-(32768*9 +512*36*8))/9 = 27648 ROM (RAM)
However, there is a LOT of User Flash with a Parallel burst read ability, that may be able to be used for most ROM tasks.
10M08 : 1376*1024/8 = 176128 Bytes Flash
10M16 : 2,368*1024/8 = 303104 Bytes Flash
10M25 : 3,200*1024/8 = 409600 Bytes Flash
Certainly, it can have many COG images and the Logic or even an Opcode to Load Page from Flash would open this.
The timing diagrams show a Address and Burst Count of up to 128 x 32 reads, but I cannot see if there are any boundary caveats on that address ?.
If the Burst read takes any starting address and feeds up to 128 words, then that is easy to use.
Even if those reads have to be on 128 word boundaries, it is still useful.
With a Burst Count shown that could be provided by the ReadFlash Opcode
aka
Ra has 32b Flash Start address, and Rb contains Burst Count of 7(?9) bits, 1 bit for COG or HUB destination and
lower 9 bits as Dest Address if COG, or lower 15 ?? BITs if HUB destination.
The same opcode could work nicely with External QuadSPI memory.. Same params, but Count can be larger (9b? is easy to set)
Data says Burstcount range is 1, 2, 4, 7, ... 128
Looks like 1 can be used for single word reads, but 3,5,6 seem invalid ?, and the timing suggests pair-gets ?
An example gives 6 Count, so maybe that's a typo ?
The exact handling of Flash Reads varies with Chip part code.
I think a 10M08 can read 128 x 32 bit words in (5+128*2)/116M = 2.25 us, which is a pretty nimble task switch, or Function call.
See MAX 10 User Flash Memory User Guide for details.
And if this is correct, this would be a rather neat way to implement a P1V. You would program the MAX10 with the core, keeping the entire ROM in UFM. Then the "ROM" could potentially be modified without having to touch the verilog.
Anyway yes a reduced ROM or access via flash would be fine
The Flash has a readdatavalid signal, so it should need a fairly simple state machine.
Correct, but Flash PGM from User code is a little more work.
There is also a speed penalty for Flash reads over RAM reads, but for many (most?) apps that would not matter.
Using the BurstCount should reduce the impact of the Flash latency.
The '9 bit effect' means the tools will still report some Memory Free, even when all available bytes are taken.
FYI. I was able to build a 6 Cog 32K HUB image for a 10M16. Cheers
Brian
Haven't seen pricing for those 10m16s yet. Altera seem to release new models at the start of each quarter - the 10M50's appeared in the last release. Hopefully the 10M16's aren't far away
The BeMicroMAX10 uses the 10M08DAF484C8GES. According to Mouser and Arrow's BeMicroMAX10 details, that particular version has 256Kb. I believe the 1376Kb is the maximum available in a 10M08 package, Based on the MAX10 overview document, the 1376Kb also includes the configuration flash (though the User Flash Memory User Guide states differently).
Of course, the "ES" at the end of the part number also means "Engineering Sample", so it might very well be that there's a bit less UFM than will be available in full production chips.
Edit: to make it even more complicated, if you select the device in Quartus, it states that the maximum UFM is 2555904 bits (2496Kb, 312KB). Interestingly, though, the User Flash Memory User Guide states that the Configuration Flash Memory is 2240Kb. If you add 256Kb to that, you get 2496Kb. Here's my guess: the overview document is wrong, as is the UFM column in the UFM User Guide. The actual total amount of flash is 312KB, split into 8KB of UFM and 280KB of CFM.
You're right about them being new to market, and also it might be possible to repurpose memory to some extent that stops from achieving maximums, but the fits from OzPropDev seem consistent with 378Kbits
256kb is 32K Bytes ?
Altera DOCs do look very conflicted and confused. You wonder if anyone bothers to read them before release ?
I also find this
["Table 2: UFM and CFM Array Size for MAX 10 Devices
This table lists the dimensions of the UFM and CFM arrays for MAX 10 devices. The Altera On-Chip Flash IP core also gives you access to configuration flash memory (CFM) when you turn on the dual image configuration mode option."] somewhat hidden over in this pdf
http://www.altera.com/literature/an/an631.pdf
I think they mean you get access to flash when is it NOT used as the Dual Image
That makes sense of (8+8+41+29)*(16) = 1376kb, achieved by using (CFM2) & (CFM1), with CFM0 as Single Config, so I think their docs mean to say :
Available FLASH, in Single Config image mode :
(UFM0) = 16k Bytes (or 8 COG images)
(UFM1) = 16k Bytes (or 8 COG images)
(CFM0) = 82k Bytes ( or 31 COG images)
(CFM1) = 58k Bytes ( or 29 COG images)
Those builds map to
11,102/6 = 1850.33 LUT per COG for 6
7,481/4 = 1870.25 LUT per COG for 4
On those rates, a 10M16 is a ceiling of 8.560 COGs worth
The 10M50 has a price of 5 + $76.95 in EQFP144, and it maps to roughly 27 COGS
The 10M04 is $9.69 (BGA) @ 119 for ~ 2 COGs
There are some interesting things you can do with a P1 with less cogs but more pins - eg a touchscreen with the ability to rapidly dump a megabyte of data out to the screen for fast text refresh in any font. The P1 never seems to have quite enough pins for that. And maybe 8 cogs are not needed - I suspect a VHDL/Verilog UART may take less elements than a cog running UART code.
There seems to be a thread every few weeks about what the P2 should look like - much more fun to actually be building things and testing them out. I am watching this thread with interest (and some degree of guilt, as I have one of these BeMicro boards sitting in the Man Cave on the 'to-do" projects list).
This modification is required to allow the loader to work. When I have a verified running Max10 I'll post the archived project