SDRAM Driver
cgracey
Posts: 14,206
Here is the SDRAM driver I've been working on.
It took a long time to get this sorted out, since scheduling, functionality, and efficiency all had to be optimized together. I had to come up with a new way to document what was happening concurrently with the instructions. The driver wound up being only 52 longs.
This runs on the DE2-115 platform and uses one of the on-board 32M x 16 SDRAM chips. It treats it as a 16M x 16 device, so that it will be faithful to what the actual Prop2 chip will use. The 16M x 16 part is only $2 in volume, whereas the 32M x 16 is $12.
This driver reads and writes 2^n QUADs at a time. When doing maximal 64 QUAD (1024 byte) transfers, the timing efficiency is ~89%, or 107MB/s out of a theoretical 120MB/s.
It requires a new DE2-115 configuration file, which is included here:
Prop2_SDRAM_Driver.zip
Here is the actual driver source (also in the .zip file):
Tomorrow, I hope to make a video example using this driver. It's good at 60MHz for displays which consume up to 107MB/s of data.
It took a long time to get this sorted out, since scheduling, functionality, and efficiency all had to be optimized together. I had to come up with a new way to document what was happening concurrently with the instructions. The driver wound up being only 52 longs.
This runs on the DE2-115 platform and uses one of the on-board 32M x 16 SDRAM chips. It treats it as a 16M x 16 device, so that it will be faithful to what the actual Prop2 chip will use. The 16M x 16 part is only $2 in volume, whereas the 32M x 16 is $12.
This driver reads and writes 2^n QUADs at a time. When doing maximal 64 QUAD (1024 byte) transfers, the timing efficiency is ~89%, or 107MB/s out of a theoretical 120MB/s.
It requires a new DE2-115 configuration file, which is included here:
Prop2_SDRAM_Driver.zip
Here is the actual driver source (also in the .zip file):
'******************************** '* * '* Propeller II SDRAM Driver * '* for 16M x 16 devices * '* (FPGA version) * '* * '* Version 0.1 * '* 9 April 2013 * '* by Chip Gracey * '* * '******************************** { SDRAM connections: P85 = cke (held high) Port C P84 = cs P83 = ras P82 = cas P81 = we P80 = udqm (not used) P79 = ldqm (not used) P78..P77 = ba[1..0] P76..P64 = a[12..0] P63..P48 = dq[15..0] Port B Note: All pin directions have a 2-clock delay All pin outputs going to SDRAM have a 3-clock delay, since they are registered at the pin All pin inputs coming from SDRAM have a 2-clock delay, since they are registered at the pin Commands (pairs of longs): name quads bytes hub_address (+0, set 2nd) sdram_address (+1, set 1st) ------------------------------------------------------------------------------------------------------------------------ rw_1024 64 1024 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W111 %xxxx_xxxA_AAAA_AAAA_AAAA_AA00_0000_0000 rw_512 32 512 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W110 %xxxx_xxxA_AAAA_AAAA_AAAA_AAA0_0000_0000 rw_256 16 256 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W101 %xxxx_xxxA_AAAA_AAAA_AAAA_AAAA_0000_0000 rw_128 8 128 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W100 %xxxx_xxxA_AAAA_AAAA_AAAA_AAAA_A000_0000 rw_64 4 64 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W011 %xxxx_xxxA_AAAA_AAAA_AAAA_AAAA_AA00_0000 rw_32 2 32 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W010 %xxxx_xxxA_AAAA_AAAA_AAAA_AAAA_AAA0_0000 rw_16 1 16 %xxxx_xxxx_xxxx_xxxA_AAAA_AAAA_AAAA_W001 %xxxx_xxxA_AAAA_AAAA_AAAA_AAAA_AAAA_0000 skip_done 0 0 %0000_0000_0000_0000_0000_0000_0000_0000 %xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx end_of_list 0 0 %0000_0000_0000_0000_0000_0000_0000_1000 %xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx The driver scans a list for commands. Commands take two longs and are structured as hub_address + sdram_address. To have the driver perform a read or write operation on the SDRAM, first set the sdram_address (2nd long) and then the hub_address (1st long). The driver will set each hub_address long to skip_done (0) when its associated operation has completed. Before starting the driver, build the command list structure with pairs of 0's, then terminate it with an 8. When launching the driver into a cog, point the parameter (S) to the start of the command list. Command list (longs): hub_address (skip_done/rw_xxxx) sdram_address hub_address (skip_done/rw_xxxx) sdram_address hub_address (skip_done/rw_xxxx) sdram_address ... end_of_list } DAT org ' ' ' Initialize SDRAM clocks hub ' ---------------- sdram_driver getptra list '1 - save command list address reps #6000,#1 '1 - repeat instruction for 100us @60MHz mov pinc,h003FFFFF '1 - cke and cs high mov dirc,h003FFFFF wz '6000 - set SDRAM signals to outputs (100us), Z = 0 ' ' ' Scan list for commands ' :finish if_z wrlong cmd,ptra[-2] '1 0 if command finished, set it to skip_done (cmd = 0) mov pinc,h00240400 '1 1 issue 'precharge all' command setb pinc,#20 wz '1 2 deselect, Z = 0 :reset if_z setptra list '1 3 if end_of_list, reset list pointer; satisy Trp mov pinc,h00200037 '1 4 issue 'set mode' command setb pinc,#20 '1 5 deselect, satisfy Tmrd mov pinc,h00226000 '1 6 issue 'auto refresh' command setb pinc,#20 '1 7 deselect, satisfy Trfc in 8 more clocks :next rdlong cmd,ptra++[2] wz '3 0..2 check command, point to next setptrb cmd '1 3 in case command, set hub address if_z jmp #:next '1/4 4/4..7 if skip_done, next command decod3 cmd wc '1 5 decode size bits, C = write and cmd,#%11111110 wz '1 6 isolate bits 7..1, Z = 1 if end_of_list if_z jmp #:reset '1/4 7/7..2 if end_of_list, reset list pointer ' ' ' Execute read/write command clocks hub XFR write SDRAM XFR read SDRAM ' ------------------------------------------------------ rdlong adr,ptra[-1] '3 0..2 - - get SDRAM address shl adr,#7 '1 3 - - make 'active' command with bank and row address or adr,#%1_0011_00 '1 4 - - (ba[1..0], a[12..0]) = adr[24..10] rol adr,#15 '1 5 - - mov pinc,adr '1 6 - - issue 'active' command setb pinc,#20 '1 7 - - deselect, satisfy Trcd in 1 more clock if_c rdquad ptrb++ '1 0 - - if write, read initial QUADs from hub setbc :quad,#0 '1 1 - - set wrquad/rdquad according to read/write if_c setxfr #%010_011 '1 2 <QUADs_to_16_pins> - if write, configure XFR at hub cycle 2 shr adr,#22+1 '1 3 - - make blank command with bank and column address and pinc,h00306000 '1 4 - - (ba[1..0], a[12..0]) = (adr[24..23], %0000, adr[9..1]) or pinc,adr '1 5 - - if_c mov dirb,hFFFF0000 '1 6 - - if write, enable data outputs if_nc setxfr #%100_011 '1 7 <16_pins_to_QUADs> if read, configure XFR at hub cycle 7 if_nc xor pinc,h001A0000 '1 0 - - if read, issue 'read' command if_nc setb pinc,#20 '1 1 - - if read, deselect if_c xor pinc,h00180000 '1 2 - - if write, issue 'write' command (aligns with XFR on next clock) if_c setb pinc,#20 '1 3 output QUAD0 w0 - if write, deselect; read: SDRAM sees 'read' command shr cmd,#1 '1 4 output QUAD0 w1 - get loop count if_nc nop #7 '1 5 output QUAD1 w0 - if read, pad time '(nop) '1 6 - read: SDRAM starts outputting data stream '(nop) '1 7 - '(nop) '1 0 input w0 read: SDRAM data stream begins arriving in XFR '(nop) '1 1 input w1 -> QUAD0 '(nop) '1 2 input w0 '(nop) '1 3 input w1 -> QUAD1 '(nop) '1 4 input w0 :quad '(wrquad/rdquad) '1 5 output QUAD1 w0 input w1 -> QUAD2 '(wrquad/rdquad) '1 6 output QUAD1 w1 input w0 '(wrquad/rdquad) '1 7 output QUAD2 w0 input w1 -> QUAD3 wrquad ptrb++ '1 0 output QUAD2 w1 input w0 write: read next QUADS; read: write current QUADS djnz cmd,#:quad '1 1 output QUAD3 w0 input w1 -> QUAD0 loop for each set of QUADs '(djnz looping) '1 2 output QUAD3 w1 input w0 '(djnz looping) '1 3 output QUAD0 w0 (new) input w1 -> QUAD1 write: RDQUAD data valid '(djnz looping) '1 4 output QUAD0 w1 input w0 mov pinc,h002C0000 '1 2 output QUAD3 w1 - issue 'burst terminate' command (aligns with XFR on next clock) mov dirb,#0 wz '1 3 - - cancel data outputs (aligns with 'burst terminate'), Z=1 jmp #:finish '4 4..7 - - finish up, scan for next command ' ' Constants ' h00180000 long $00180000 'write' command toggle h001A0000 long $001A0000 'read' command toggle h00200037 long $00200037 'set mode' command (full-page r/w bursts, cas latency = 3) h00226000 long $00226000 'read' command mask h00240400 long $00240400 'precharge all' command h002C0000 long $002C0000 'burst terminate' command h00306000 long $00306000 'blank command mask h003FFFFF long $003FFFFF 'control pins mask hFFFF0000 long $FFFF0000 'data pins mask ' ' ' Variables ' list res 1 'command list pointer cmd res 1 'command adr res 1 'address
Tomorrow, I hope to make a video example using this driver. It's good at 60MHz for displays which consume up to 107MB/s of data.
Comments
Thanks
What are differences from old one.
Are You added PS2 - and some other board resources ?
I just changed the clock generation for the SDRAM.
Could you tell me what you would like to see for the PS2 connections, etc.?
I work on self contained Basic interpreter for P2.
With other words You can like it old Home computers.
I hope that will say You all what I need that to !!
Sounds neat. I want to make something like that, too.
You wiil !
I'm are half way done with (COG) RUN time interpreter module.
This is going to be fun...
Yes -- RUN time are written in PASM - will work in any COG and thinking is -- multiple instances
To answer question for You from another thread !
Now it is NOT confident any more what I'm working on.
I just read the driver - very nicely done ... I like it!
For now, we can read a quad, modify it, and write it back - it will work fine.
however if you are taking requests :-)
For graphics use, it would be handy (and faster for things like line drawing, drawing circles etc) to add:
- read/write long
- read/write word
- potentially even read/write byte (for 8 bit per pixel modes)
so that individual pixels can be read/written from/to a frame buffer.
If you don't have time to add it, obviously you can leave it as an exercise to the readers
Now I am going back to analyzing your code... and trying it.
Edit:
I really like the way you documented every cycle, even ones that did not have associated instructions. I think I will start to use the same (or similar) style in my pasm2 code.
Agreed! Now, if we could only get the editor to show that (clocks and hub window) automatically!
It will be really fun to see how things develop further in this area and how people start architecting their code to allow combined use of SDRAM for both hires graphics frame buffers and LMM code for example and what caching schemes are used etc. That is what I've always been personally interested in, something that is capable of providing a nice big flat memory space for an LMM VM to access for both its application and some graphics memory buffers. It seems it could soon be a reality and I look forward to seeing just how high the graphics resolutions and bit depths will realistically get with all this potential SDRAM bandwidth once the VM starts to compete for access to the same memory. We'll have to see where things go from here and which partitioning approaches best cut it.
Very interesting times ahead...
Oh yes,yes,yes,yes,yes,
goody,goody,goody
(happy days are here again.......)
sigh
Dave
It should run fine on the DE0-Nano. The only problem, of course, is that it would have to be altered to do more than just serve SDRAM. You'll need to keep a single-threaded model, unless you can figure out what timeslots can be consistently forfeited. Whooo, there's a challenge!
...
Now that I'm thinking about it, there are no time slots available, except for 7 cycles during a read, which can't be relied upon. The way to do this would be to open up some multiple of 8 cycles for other threads inside the command-fetch loop.
I know, we need byte/word/long access, as well, especially for drawing graphics. I'm looking into it...
The most sensible solution would be to have a tile cache in main memory, then do a BLIT to SDRAM. I assume that a 64 Quad transfer is the most efficient, so that would be the natural size. The problem comes from doing operations across the memory boundaries, but perhaps putting boundary detection into the primitives would give you a good hook to save/load blocks on demand.
I seem to recall you stating that at high color depths and resolutions there wasn't any memory bandwidth left to draw into the buffer, EG 1920x1080.
It seems to me that to save cycles, the tile buffer needs to be in COG memory, to avoid hub access.
Well, the SDRAM can read/write 16 bits per clock, which is the hub transfer rate (or 128 bits every 8 clocks for RDQUAD/WRQUAD). So, SDRAM data can move between pins, a server cog, the hub, and other cogs, all at the same rate.
I think it's cleanest to just make a generic SDRAM server, thereby decoupling any video dependency from it, and vice-versa. There is no speed advantage to be had, though saving a cog could be important.
Saving a COG on the DE0-Nano might be particularly important! :-)
So true !
Would we need also a new FPGA configuration for the DE0-Nano to make an SDRAM-driver working ?
Then it would be wasted time if I try to get it working on a DE0.
Andy
I'm compiling a new file for the DE0-Nano that will make it work like the DE2-115. I'll post it in 20 minutes, or so, once it's done.
DE0_Nano_Prop2.zip