Shop OBEX P1 Docs P2 Docs Learn Events
SDRAM caching - what's best? — Parallax Forums

SDRAM caching - what's best?

cgraceycgracey Posts: 14,206
edited 2013-03-30 16:33 in Propeller 2
What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

What about single-byte writing? Is it important?

There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
«1

Comments

  • David BetzDavid Betz Posts: 14,516
    edited 2013-03-21 15:24
    cgracey wrote: »
    What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

    Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

    What about single-byte writing? Is it important?

    There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
    The current cache drivers transfer 64 bytes at a time but that can be changed if necessary.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-03-21 15:30
    I guess I had thought we would go with a pseudo-DMA approach where a SDRAM client would pass a message probably through a mailbox to the SDRAM COG passing something like the following:
    function code: READ or WRITE
    SDRAM address
    hub address
    byte count (or maybe long count?)
    
    To be a little fancier and to allow a cache line fill with a single request it might be nice to be able to pass an SDRAM address and two hub addresses, one to write and the other to read. That would allow a dirty cache line to be replaced with new data with a single request.

    Even more ambitious would be a scatter-gather scheme where it would be possible to pass a list of requests each of which was in the form of the one I mentioned first.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-03-21 15:31
    I think that we need to keep UDQM/LDQM for byte writes, otherwise all byte writes will turn into read/modify/write's - making things even slower.

    Burst accesses are definitely preferred; especially long bursts when possible.

    Can you post the documentation for the SETXFR instruction? And anything else releavant to it? (ie configuring RAS, CAS latency etc)

    ie where it transfers to/from in the cog, if it can span pages in the sdram, etc.

    Edit:

    Ideally with sample code for:

    - reading a byte/word
    - writing a byte/word
    - reading N words as a burst
    - writing N words as a burst

    With the above samples, it will be possible to start experimenting with large bitmap drivers, xmm drivers...

    cgracey wrote: »
    What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

    Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

    What about single-byte writing? Is it important?

    There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
  • jazzedjazzed Posts: 11,803
    edited 2013-03-21 15:31
    Hi Chip.

    I've used 32 byte bursts for the SDRAM cache driver in P1 Propeller-GCC here:
    https://code.google.com/p/propgcc/source/browse/loader/spin/sdram_cache.spin

    That driver is for a byte-wide device that uses an address latch - address setup runs like a turtle, but data runs like a rabbit. The caching mechanism uses dirty bits and write-back methodology. It's possible to use bigger buffers with a better COG/HUB interface like in P2.

    If we don't have to manipulate the QMs that's great. Fewer transactions the better.

    Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2013-03-21 15:42
    I think it would be best to have the option for byte level access, and just document the performance considerations.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-21 15:45
    jazzed wrote: »
    Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.

    It takes, if I recall right, something like 10 clocks to just write a single byte or word. Extra words take 1 clock each, though.

    BTW, I'm working on the XFR doc's right now, as we need them badly.

    In the interim, here's the SDRAM data sheet for anyone who's interested:

    http://www.winbond.com/NR/rdonlyres/16898431-2772-4CEA-8474-7E4AA855555F/0/W9825G6JH.pdf

    It took me a few hours to get a handle on this the first time. I need to study it again, because I don't remember the specifics. It's no walk in the park.
  • jazzedjazzed Posts: 11,803
    edited 2013-03-21 16:33
    I'm hoping the SDRAM clock is separate from the other counter outputs. If the other counters are sufficiently complex we could use them for CAS/RAS refresh instead of a refresh loop.

    It would be really nice to access SDRAM and have an interpreter running in a single COG. If that makes sense, then you need all kinds of accesses. It's software controlled anyway, so any access is possible although with penalty of being generic of course. As Bill said the QM bits are necessary for byte wide access. When pulled down, they won't need to be diddled with burst access - fortunately, the P2 has pull up/down built-in.

    Some of us have discussed different methods of sharing cache space among COGs for quick multiple COG access. We don't have it all figured out yet, but having multiple COG access to memory without a lot of per-transaction blocking seems important enough for more serious thought.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-03-21 16:35
    For emulations and interpreters, byte read andwrites will be required.I will be doing a pcb with SRAM for these,but ifyourpcbhandlesthisit wouldbeagoodbase. 10 clocks at 200MHzisnotthat bad btw (50ns).

    I would recommendconnecting all pins for now so we canexperiment. A microSDsocket is missingthough.

    (sorry about missingspaces - on myxoom)
  • RaymanRayman Posts: 14,755
    edited 2013-03-21 18:17
    Was just thinking that a couple things I've been waiting to do on P2 don't need fancy graphics...
    In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...
  • David BetzDavid Betz Posts: 14,516
    edited 2013-03-21 18:25
    Rayman wrote: »
    Was just thinking that a couple things I've been waiting to do on P2 don't need fancy graphics...
    In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...
    We can certainly provide more than one cache driver one of which controls the SDRAM directly for faster C performance when the SDRAM isn't needed for graphics.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-21 18:40
    Here are the latest doc's which now cover XFR:

    Prop2_Docs.txt

    Here's the XFR part:
    PIN TRANSFER
    ------------
    
    Each cog has a pin transfer circuit (XFR) which can automatically move data between pins
    and QUADs or from pins to stack RAM, in the background, while instructions execute normally.
    
    XFR is configured with the SETXFR instruction:
    
        SETXFR  D/#n    - Set XFR configuration to %MMM_PPP
    
              %MMM = mode
    
                     %00x = off (initial state after cog start)
                     %010 = QUADs_to_16_pins
                     %011 = QUADs_to_32_pins
                     %100 = 16_pins_to_QUADs
                     %101 = 32_pins_to_QUADs
                     %110 = 16_pins_to_stack
                     %111 = 32_pins_to_stack
    
              %PPP = pin group
    
                    %000 = pins 15..0  for 16-pin modes, pins 31..0  for 32-pin modes
                    %001 = pins 31..16 for 16-pin modes, pins 31..0  for 32-pin modes
                    %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes
                    %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes
                    %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes
                    %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes
                    %11x = no pins
    
    
    For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following
    8-clock pattern begins and then repeats indefinitely:
    
        1st clock: QUAD0 low word is output to pins
        2nd clock: QUAD0 high word is output to pins
        3rd clock: QUAD1 low word is output to pins
        4th clock: QUAD1 high word is output to pins
        5th clock: QUAD2 low word is output to pins
        6th clock: QUAD2 high word is output to pins
        7th clock: QUAD3 low word is output to pins
        8th clock: QUAD3 high word is output to pins
    
    This mode is useful for coordinating with a 'RDQUAD PTRx++' instruction so that a
    continuous stream of words from hub memory can be output to an SDRAM's DQ pins. This
    enables SDRAM writing at the cog's hub bandwidth limit.
    
    
    For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following
    4-clock pattern begins and then repeats indefinitely:
    
        1st clock: QUAD0 is output to pins
        2nd clock: QUAD1 is output to pins
        3rd clock: QUAD2 is output to pins
        4th clock: QUAD3 is output to pins
    
    
    For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following
    8-clock pattern begins and then repeats indefinitely:
    
        1st clock: pins are sampled as low word
        2nd clock: pins are sampled as high word, long is written to QUAD0
        3rd clock: pins are sampled as low word
        4th clock: pins are sampled as high word, long is written to QUAD1
        5th clock: pins are sampled as low word
        6th clock: pins are sampled as high word, long is written to QUAD2
        7th clock: pins are sampled as low word
        8th clock: pins are sampled as high word, long is written to QUAD3
    
    This mode is useful for coordinating with a 'WRQUAD PTRx++' instruction so that a
    continuous stream of words input from an SDRAM's DQ pins can be written to hub memory.
    This enables SDRAM reading at the cog's hub bandwidth limit.
    
    
    For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following
    4-clock pattern begins and then repeats indefinitely:
    
        1st clock: pins are sampled and written to QUAD0
        2nd clock: pins are sampled and written to QUAD1
        3rd clock: pins are sampled and written to QUAD2
        4th clock: pins are sampled and written to QUAD3
    
    
    For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
    2-clock pattern begins and then repeats indefinitely:
    
        1st clock: pins are sampled as low word
        2nd clock: pins are sampled as high word, long is written to stack at SPA++
    
    
    For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following
    1-clock pattern begins and then repeats indefinitely:
    
        1st clock: pins are sampled and written to stack at SPA++
    
    
    The pins_to_stack modes are useful for streaming SDRAM data into stack RAM for video
    displays. While a pins_to_stack mode is active, you should not read or write stack RAM
    or modify SPA, as such attempts will likely cause unexpected results. You will need to
    do a 'SETSPA D/#n' instruction before starting a pins_to_stack mode..
    
    To stop XFR, execute 'SETXFR #0'.
    
  • David BetzDavid Betz Posts: 14,516
    edited 2013-03-21 19:00
    cgracey wrote: »
    Here are the latest doc's which now cover XFR:

    Prop2_Docs.txt

    Thanks Chip! That looks like a very powerful feature!
  • cgraceycgracey Posts: 14,206
    edited 2013-03-21 19:22
    David Betz wrote: »
    Thanks Chip! That looks like a very powerful feature!

    I don't know why it's so hard for me to write documentation. This feature was pretty simple to write about after I tried a few different approaches. It had seemed complicated for months to me.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-03-21 19:36
    Thanks Chip, nice way of doing burst transfers!
    cgracey wrote: »
    Here are the latest doc's which now cover XFR:

    Prop2_Docs.txt

    Here's the XFR part:
  • pedwardpedward Posts: 1,642
    edited 2013-03-21 19:48
    Am I correct in assuming that it will recycle between 4 contiguous quads when in pins-to-quads mode?

    That seems like a powerful way of running code directly from external memory!
  • SapiehaSapieha Posts: 2,964
    edited 2013-03-21 20:06
    Hi Chip.

    You write
    "For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
    2-clock pattern begins and then repeats indefinitely:"


    But how I know Stack is filled?
  • cgraceycgracey Posts: 14,206
    edited 2013-03-21 20:38
    Sapieha wrote: »
    Hi Chip.

    You write
    "For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
    2-clock pattern begins and then repeats indefinitely:"


    But how I know Stack is filled?

    Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-21 20:41
    pedward wrote: »
    Am I correct in assuming that it will recycle between 4 contiguous quads when in pins-to-quads mode?

    That seems like a powerful way of running code directly from external memory!

    I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.
  • pedwardpedward Posts: 1,642
    edited 2013-03-21 22:10
    cgracey wrote: »
    I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.

    I'm sort of thinking you might want to interleave code and data, something like this:

    code
    code
    jump $-2
    data

    or

    data
    code
    code
    jump $-2

    You could also do:

    data
    code
    code
    code
    jump $-3

    then every 4 long is the data and the jump happens while the data is loaded.

    I also wonder how this might play into the pipeline and multi-tasking. several of the instructions don't have effects for several clock cycles.
  • TubularTubular Posts: 4,705
    edited 2013-03-21 22:24
    This is really neat stuff.

    It would be interesting to work out how to adapt this for driving 18 & 24 bit digital LCDs, or other digital displays.

    Potentially all the P2 would need would be the ability to read back a 'dot clock' pin to control the PTRx++ increments at the dotclock rate. The dot clock itself could be generated using a counter. Alternatively PTRx could be advanced manually (low-med resolution displays), or on an integer division of the system clock for faster displays.

    It doesn't really matter that the QUAD<>PIN executes every single clock. Just control the pointer increments to effect the correct overall data rate

    You would just encode the sync pulses and DE signal inside the data stream, since there aren't very big porches driving LCDs anyway, so it wouldn't waste much ram.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-03-22 05:23
    The mind boggles with what we can do with this!
  • SeairthSeairth Posts: 2,474
    edited 2013-03-22 06:52
    cgracey wrote: »
    Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.

    From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?
  • cgraceycgracey Posts: 14,206
    edited 2013-03-22 06:58
    Seairth wrote: »
    From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?

    It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.
  • SapiehaSapieha Posts: 2,964
    edited 2013-03-22 07:04
    Hi Chip.

    Sorry -- But I still are missing in this instructions one pointer
    Bytes/Words/Longs transferred.

    cgracey wrote: »
    It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-22 07:15
    Sapieha wrote: »
    Hi Chip.

    Sorry -- But I still are missing in this instructions one pointer
    Bytes/Words/Longs transferred.

    It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.
  • SapiehaSapieha Posts: 2,964
    edited 2013-03-22 07:24
    Hi Chip.

    Thanks.
    I see I need wait for Yours demo programs to fully understand it.



    cgracey wrote: »
    It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-22 11:51
    Here's a much better SDRAM data sheet from Micron:

    http://www.micron.com/~/media/Documents/Products/Data%20Sheet/DRAM/256Mb_sdr.pdf

    32MB SDRAMs from Winbond are only $2 in quantity, but their data sheet is difficult to read.
  • roglohrogloh Posts: 5,837
    edited 2013-03-22 18:01
    Hey there Chip,

    Great work as usual. XFR will be very nice to use. One question I had related to the documentation you updated...
    The pins_to_stack modes are useful for streaming SDRAM data into stack RAM for video
    displays. While a pins_to_stack mode is active, you should not read or write stack RAM
    or modify SPA, as such attempts will likely cause unexpected results. You will need to
    do a 'SETSPA D/#n' instruction before starting a pins_to_stack mode..


    I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?

    The reason I ask is that it would be great to be able to simultaneously stream video out while reading more SDRAM graphics data into the stack RAM for achieving the high resolution graphics modes with a single COG. However if that is not possible I guess some alternating read scheme during hsync video line blanking or multiple COGs like we had on P1 would be required for higher bandwidth video applications using the SDRAM buffer. Or we could try to read SDRAM into the COG RAM QUADs while streaming video and then copy this data to the CLUT on the fly. Lots of extra copying overhead there unfortunately so I am hoping we can just write to CLUT using the SDRAM reads and then read this memory too with the video engine at the same time (at different addresses).

    Thanks,
    Roger.
  • cgraceycgracey Posts: 14,206
    edited 2013-03-22 21:50
    rogloh wrote: »
    I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?

    Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.
  • roglohrogloh Posts: 5,837
    edited 2013-03-23 03:23
    cgracey wrote: »
    Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.

    I was really hoping that would be the case. That is excellent news.
    Roger.
Sign In or Register to comment.