DDR3 - The quest to 128 MBytes
Ale
Posts: 2,363
Hei all,
I want DDR3, want it want it want it, did I say that I want it ? , no ?... well... i do.
Here I'll post my progress.
Testbench:
BeMicroCV. Cyclone V, DDR3 chip: Micron MT41J64M16JT-15E here: http://www.micron.com/parts/dram/ddr3-sdram/mt41j64m16jt-15e?source=ps
Code: my own Cog, Ale's Cog , just because I know it and can modify it to suit this test, it runs normal cog's assembler, it has one difference, a hw-uart .
I'll post the code to my github account as soon as I have found my github password ;-) again....
The idea is to get it to work, I have compiled the core before... but didn't have a suitable environment to test it, now that the environment works well, I can go on with the DDR3 config itself.
Note on speed and bandwidth:
The board's docs don't really say much more than it "works", let's take they word for it at 300 MHz. The chip is a CL 9 chip, meaning from command to answer we have 9 clocks, plus all the overhead and so on, like 20 clocks per transfer... I'm sure it can be integrated as a HUB peripherial, the hub waits anyways up to 16 80 MHz clocks, maybe it fits somehow.
Controller setup:
- Hard external memory controller, select the chip from the list on the right and 50 MHz as base clock, and 300 MHz Memory clock, avalon at full speed.
- I'll use a 32 bit port to read and write.
- Compiling the thing takes like forever, welcome to the slow lane
Edit:
I got it to compile, it costs like 800 extra ALMs... but it did something strange for the dqs pins during assignment...
I want DDR3, want it want it want it, did I say that I want it ? , no ?... well... i do.
Here I'll post my progress.
Testbench:
BeMicroCV. Cyclone V, DDR3 chip: Micron MT41J64M16JT-15E here: http://www.micron.com/parts/dram/ddr3-sdram/mt41j64m16jt-15e?source=ps
Code: my own Cog, Ale's Cog , just because I know it and can modify it to suit this test, it runs normal cog's assembler, it has one difference, a hw-uart .
I'll post the code to my github account as soon as I have found my github password ;-) again....
The idea is to get it to work, I have compiled the core before... but didn't have a suitable environment to test it, now that the environment works well, I can go on with the DDR3 config itself.
Note on speed and bandwidth:
The board's docs don't really say much more than it "works", let's take they word for it at 300 MHz. The chip is a CL 9 chip, meaning from command to answer we have 9 clocks, plus all the overhead and so on, like 20 clocks per transfer... I'm sure it can be integrated as a HUB peripherial, the hub waits anyways up to 16 80 MHz clocks, maybe it fits somehow.
Controller setup:
- Hard external memory controller, select the chip from the list on the right and 50 MHz as base clock, and 300 MHz Memory clock, avalon at full speed.
- I'll use a 32 bit port to read and write.
- Compiling the thing takes like forever, welcome to the slow lane
Edit:
I got it to compile, it costs like 800 extra ALMs... but it did something strange for the dqs pins during assignment...
Comments
hier the cog's code. VSCL is used as UART interface
So instead of DDR3 I am starting with the DE-0 nano with regular SDRAM. Was looking at it last night and even there the memory timing is tight. To fit it in with standard hub read timing once locked to the hub, you need to be able to latch the result back into a COG register within 5 COG clocks of latching the first read S parameter of say RDLONG D,S which is holding the address. In the case of SDRAM that becomes 10 clocks as I plan to run the SDRAM at 2x the Prop clock (eg. 144MHz and 72Mhz). In theory it should just fit because the SDRAM on that board will return the last of the 32 bits on the 7th clock after the ACTIVE command is issued. There might be one or two registers in the pipeline that delay it further but I have 3 memory clocks left.
In your case with DDR3 if you run your memory at say 3x the Prop clock (100MHz COGs, 300MHz memory) you'll have 15 DDR3 clocks to get the results back in time. I think that is beyond what the DDR3 on the BeMicroCV can do as according to the Micron device datasheet its CAS latency is 9 and its tRCD is 9 so your first 16 bit result won't be back until 18 clocks after you issue the read. I think you might want to look at running the device at 4x the Prop (75MHz COGs, 300MHz memory) to give yourself 20 memory clocks for returning the result in time. It's either that or add more M4 wait states to the external memory reads, which is possible but not as nice for performance/timing compatibility with existing code.
Looking forward to hearing how it goes.
Success ! I was able to re-enter the pins in the pin planner and it compiled !
Now, I have to write a simple mem test and see what happens , wish me luck !
I think it is time I read the hard memory controller manual , guesswork brought me so far ;-) hahaha
Yes, and that is what I am using... or trying anyways... I think that the problem lays in the pll/reset part... I am reading some docs but haven't found what I need yet, this is not for beginners like me... there is only experiment..
As I said above, it uses my version of a cog, not chip's one
log file
Is this one working?
It is a NIOS-II with DDR3 example.
This one, the one I posted before is quite nice because it is not only a working DDR3 example, it also has SignalTap, here : http://www.alterawiki.com/uploads/1/1c/QuartusII_projects.zip
I hope I can make some progress...
in c:\altera\15.0\nios2eds\nios2_command_shell
and from there the nios2-terminal.
- Compile the example
- Load it to the BeMicroCV A2
- run nios2-terminal
- Enjoy blinking LEDs
This is a QSys example. It would be great if
- we could build our own QSys components, a P1V would be great, the HUB would be DDR3 DRAM, no idea how much time from request to data you get, but if we have 8 80 MHz cycles (at least), that means 100 ns, we should be getting data in this time, I think.
It would mean an avalon-enabled P1V, whatever that exactly means (I have a only rough idea that is improving while reading the document posted below).
Happy coding !
Link to a couple of useful pdfs: JTAG-UART
Using Terminals
I found it ! Making QSys components it wasn't that difficult to find...
Your efforts will pay off nicely if you can get it going.
I've been working on similar concept but with regular SDRAM, not DDR3 in order to keep things simpler. On paper at least I found you can just about manage when running memory at 144MHz (eg. for SDRAM on DE0-Nano or BeMicro MAX10) and the P1V at 72MHz to get the data back in a single hub cycle for one or two COGs. The problem is you basically don't know the memory address to read/write until near the end of the second P1V CPU cycle, really just leaving you with 4 more CPU clocks before your data needs to be ready to latch if use the regular ALU path (or possibly 5 if you bypass it). So latency will likely become the major issue depending on propagation delay and DRAM/FPGA setup time etc. I also think there is a possibility of inserting an extra M4 cycle to give you one extra clock (ie. 8 clocks instead of 7 in the hub) and still fit in the existing overall hub cycle timing window which would be ideal for LMM and other deterministic applications.
So far I've started planning some Verilog state machine stuff to try to do this but still have to figure out a few more things like SDRAM initialization and FPGA clock generation etc which will no doubt take me a while to nail it all.
I'll have to keep an eye out if you get your Nios approach working with the P1V. It's likely there's a Nios driver for regular SDRAM lying about somewhere so it might still be an option for me on the DE0-Nano if it can access external memory in a single hub cycle otherwise I'm still gonna have to do my own thing. I'm reasonably confident what we want is doable in principle, it's just a matter of time and effort and also whether the board layout and FPGA timing allows the maximum memory speed to be attained.
Cheers,
rogloh
I am trying to understand the trace given by signal tap... and I really seem to be missing something: The problem is I don't really know how this whole "bus traffic generator" works.
I see that writes are issued and reads... but I do not see that the reads get the written data...
It would be great if someone would also help a bit here , anyone ?
Here the avalon handbook
I wish I could offer help, but I just came to the FPGA party. (Very recently, I bought a BeMicroCV, a BeMicroCVA9, and a LatticeXP Bravia2). I have loaded the Propeller 1V into the BeMicroCV and am interested in using it with the DDR3 Ram. I see that you have been working on this for quite some time.
I do suspect that the DDR3 ram will demand quite a bit more power. Has that been taken care of?
For now, I should just follow along and learn. At least I do understand most of what you are trying to do. I am not very clear on Ale's Cog image which included a hardware UART. So I will poke around and try to catch up.
It does seem that your approach of clocking the DDR3 at 300mHz is conservative. I have read that some of the Cyclone V E devices might clock 400mHz and all do at least 333mHz. Maybe I missed something.
I'll be quiet and just lurk.
The 2 examples that I found also use 300 MHz, the BeMicro handbook says somewhere so. Maybe it can be pushed to 400 no idea, but for testing I'd go even lower if necessary. My first step is to replace the traffic generator with somethig i made myself my cog or p1v or something, and to measure turnaroud times, those times will tell what can we expect from this memory pool, I think.
If it is too slow, what I expect, it means that we may have to make a dummy read, and then we get say 2 longs of data or four, and lock it to a specific cog. One can configure up to 6 bus masters so the arbitration is not in your hands but that will need a bit of work.