SDRAM caching - what's best?

cgracey · 2013-03-21 15:13

What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

What about single-byte writing? Is it important?

There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.

David Betz · 2013-03-21 15:24

cgracey wrote: »

What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

What about single-byte writing? Is it important?

There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.

The current cache drivers transfer 64 bytes at a time but that can be changed if necessary.

David Betz · 2013-03-21 15:30

I guess I had thought we would go with a pseudo-DMA approach where a SDRAM client would pass a message probably through a mailbox to the SDRAM COG passing something like the following:

function code: READ or WRITE
SDRAM address
hub address
byte count (or maybe long count?)

To be a little fancier and to allow a cache line fill with a single request it might be nice to be able to pass an SDRAM address and two hub addresses, one to write and the other to read. That would allow a dirty cache line to be replaced with new data with a single request.

Even more ambitious would be a scatter-gather scheme where it would be possible to pass a list of requests each of which was in the form of the one I mentioned first.

Bill Henning · 2013-03-21 15:31

I think that we need to keep UDQM/LDQM for byte writes, otherwise all byte writes will turn into read/modify/write's - making things even slower.

Burst accesses are definitely preferred; especially long bursts when possible.

Can you post the documentation for the SETXFR instruction? And anything else releavant to it? (ie configuring RAS, CAS latency etc)

ie where it transfers to/from in the cog, if it can span pages in the sdram, etc.

Edit:

Ideally with sample code for:

- reading a byte/word
- writing a byte/word
- reading N words as a burst
- writing N words as a burst

With the above samples, it will be possible to start experimenting with large bitmap drivers, xmm drivers...

cgracey wrote: »

What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?

Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?

What about single-byte writing? Is it important?

There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.

jazzed · 2013-03-21 15:31

Hi Chip.

I've used 32 byte bursts for the SDRAM cache driver in P1 Propeller-GCC here:
https://code.google.com/p/propgcc/source/browse/loader/spin/sdram_cache.spin

That driver is for a byte-wide device that uses an address latch - address setup runs like a turtle, but data runs like a rabbit. The caching mechanism uses dirty bits and write-back methodology. It's possible to use bigger buffers with a better COG/HUB interface like in P2.

If we don't have to manipulate the QMs that's great. Fewer transactions the better.

Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.

Roy Eltham · 2013-03-21 15:42

I think it would be best to have the option for byte level access, and just document the performance considerations.

cgracey · 2013-03-21 15:45

jazzed wrote: »

Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.

It takes, if I recall right, something like 10 clocks to just write a single byte or word. Extra words take 1 clock each, though.

BTW, I'm working on the XFR doc's right now, as we need them badly.

In the interim, here's the SDRAM data sheet for anyone who's interested:

http://www.winbond.com/NR/rdonlyres/16898431-2772-4CEA-8474-7E4AA855555F/0/W9825G6JH.pdf

It took me a few hours to get a handle on this the first time. I need to study it again, because I don't remember the specifics. It's no walk in the park.

jazzed · 2013-03-21 16:33

I'm hoping the SDRAM clock is separate from the other counter outputs. If the other counters are sufficiently complex we could use them for CAS/RAS refresh instead of a refresh loop.

It would be really nice to access SDRAM and have an interpreter running in a single COG. If that makes sense, then you need all kinds of accesses. It's software controlled anyway, so any access is possible although with penalty of being generic of course. As Bill said the QM bits are necessary for byte wide access. When pulled down, they won't need to be diddled with burst access - fortunately, the P2 has pull up/down built-in.

Some of us have discussed different methods of sharing cache space among COGs for quick multiple COG access. We don't have it all figured out yet, but having multiple COG access to memory without a lot of per-transaction blocking seems important enough for more serious thought.

Cluso99 · 2013-03-21 16:35

For emulations and interpreters, byte read andwrites will be required.I will be doing a pcb with SRAM for these,but ifyourpcbhandlesthisit wouldbeagoodbase. 10 clocks at 200MHzisnotthat bad btw (50ns).

I would recommendconnecting all pins for now so we canexperiment. A microSDsocket is missingthough.

(sorry about missingspaces - on myxoom)

Rayman · 2013-03-21 18:17

Was just thinking that a couple things I've been waiting to do on P2 don't need fancy graphics...
In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...

David Betz · 2013-03-21 18:25

Rayman wrote: »

Was just thinking that a couple things I've been waiting to do on P2 don't need fancy graphics...
In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...

We can certainly provide more than one cache driver one of which controls the SDRAM directly for faster C performance when the SDRAM isn't needed for graphics.

cgracey · 2013-03-21 18:40

Here are the latest doc's which now cover XFR:

Prop2_Docs.txt

Here's the XFR part:

PIN TRANSFER
------------

Each cog has a pin transfer circuit (XFR) which can automatically move data between pins
and QUADs or from pins to stack RAM, in the background, while instructions execute normally.

XFR is configured with the SETXFR instruction:

    SETXFR  D/#n    - Set XFR configuration to %MMM_PPP

          %MMM = mode

                 %00x = off (initial state after cog start)
                 %010 = QUADs_to_16_pins
                 %011 = QUADs_to_32_pins
                 %100 = 16_pins_to_QUADs
                 %101 = 32_pins_to_QUADs
                 %110 = 16_pins_to_stack
                 %111 = 32_pins_to_stack

          %PPP = pin group

                %000 = pins 15..0  for 16-pin modes, pins 31..0  for 32-pin modes
                %001 = pins 31..16 for 16-pin modes, pins 31..0  for 32-pin modes
                %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes
                %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes
                %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes
                %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes
                %11x = no pins


For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 low word is output to pins
    2nd clock: QUAD0 high word is output to pins
    3rd clock: QUAD1 low word is output to pins
    4th clock: QUAD1 high word is output to pins
    5th clock: QUAD2 low word is output to pins
    6th clock: QUAD2 high word is output to pins
    7th clock: QUAD3 low word is output to pins
    8th clock: QUAD3 high word is output to pins

This mode is useful for coordinating with a 'RDQUAD PTRx++' instruction so that a
continuous stream of words from hub memory can be output to an SDRAM's DQ pins. This
enables SDRAM writing at the cog's hub bandwidth limit.


For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 is output to pins
    2nd clock: QUAD1 is output to pins
    3rd clock: QUAD2 is output to pins
    4th clock: QUAD3 is output to pins


For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled as low word
    2nd clock: pins are sampled as high word, long is written to QUAD0
    3rd clock: pins are sampled as low word
    4th clock: pins are sampled as high word, long is written to QUAD1
    5th clock: pins are sampled as low word
    6th clock: pins are sampled as high word, long is written to QUAD2
    7th clock: pins are sampled as low word
    8th clock: pins are sampled as high word, long is written to QUAD3

This mode is useful for coordinating with a 'WRQUAD PTRx++' instruction so that a
continuous stream of words input from an SDRAM's DQ pins can be written to hub memory.
This enables SDRAM reading at the cog's hub bandwidth limit.


For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to QUAD0
    2nd clock: pins are sampled and written to QUAD1
    3rd clock: pins are sampled and written to QUAD2
    4th clock: pins are sampled and written to QUAD3


For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled as low word
    2nd clock: pins are sampled as high word, long is written to stack at SPA++


For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following
1-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to stack at SPA++


The pins_to_stack modes are useful for streaming SDRAM data into stack RAM for video
displays. While a pins_to_stack mode is active, you should not read or write stack RAM
or modify SPA, as such attempts will likely cause unexpected results. You will need to
do a 'SETSPA D/#n' instruction before starting a pins_to_stack mode..

To stop XFR, execute 'SETXFR #0'.

David Betz · 2013-03-21 19:00

cgracey wrote: »

Here are the latest doc's which now cover XFR:

Prop2_Docs.txt

Thanks Chip! That looks like a very powerful feature!

cgracey · 2013-03-21 19:22

David Betz wrote: »

Thanks Chip! That looks like a very powerful feature!

I don't know why it's so hard for me to write documentation. This feature was pretty simple to write about after I tried a few different approaches. It had seemed complicated for months to me.

Bill Henning · 2013-03-21 19:36

Thanks Chip, nice way of doing burst transfers!

cgracey wrote: »

Here are the latest doc's which now cover XFR:

Prop2_Docs.txt

Here's the XFR part:

pedward · 2013-03-21 19:48

Am I correct in assuming that it will recycle between 4 contiguous quads when in pins-to-quads mode?

That seems like a powerful way of running code directly from external memory!

Sapieha · 2013-03-21 20:06

Hi Chip.

You write
"For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:"

But how I know Stack is filled?

cgracey · 2013-03-21 20:38

Sapieha wrote: »

Hi Chip.

You write
"For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:"

But how I know Stack is filled?

Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.

cgracey · 2013-03-21 20:41

pedward wrote: »

Am I correct in assuming that it will recycle between 4 contiguous quads when in pins-to-quads mode?

That seems like a powerful way of running code directly from external memory!

I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.

pedward · 2013-03-21 22:10

cgracey wrote: »

I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.

I'm sort of thinking you might want to interleave code and data, something like this:

code
code
jump $-2
data

or

data
code
code
jump $-2

You could also do:

data
code
code
code
jump $-3

then every 4 long is the data and the jump happens while the data is loaded.

I also wonder how this might play into the pipeline and multi-tasking. several of the instructions don't have effects for several clock cycles.

Tubular · 2013-03-21 22:24

This is really neat stuff.

It would be interesting to work out how to adapt this for driving 18 & 24 bit digital LCDs, or other digital displays.

Potentially all the P2 would need would be the ability to read back a 'dot clock' pin to control the PTRx++ increments at the dotclock rate. The dot clock itself could be generated using a counter. Alternatively PTRx could be advanced manually (low-med resolution displays), or on an integer division of the system clock for faster displays.

It doesn't really matter that the QUAD<>PIN executes every single clock. Just control the pointer increments to effect the correct overall data rate

You would just encode the sync pulses and DE signal inside the data stream, since there aren't very big porches driving LCDs anyway, so it wouldn't waste much ram.

Cluso99 · 2013-03-22 05:23

The mind boggles with what we can do with this!

Seairth · 2013-03-22 06:52

cgracey wrote: »

Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.

From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?

cgracey · 2013-03-22 06:58

Seairth wrote: »

From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?

It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.

Sapieha · 2013-03-22 07:04

Hi Chip.

Sorry -- But I still are missing in this instructions one pointer

Bytes/Words/Longs transferred.

cgracey wrote: »

It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.

cgracey · 2013-03-22 07:15

Sapieha wrote: »

Hi Chip.

Sorry -- But I still are missing in this instructions one pointer
Bytes/Words/Longs transferred.

It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.

Sapieha · 2013-03-22 07:24

Hi Chip.

Thanks.
I see I need wait for Yours demo programs to fully understand it.

cgracey wrote: »

It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.

cgracey · 2013-03-22 11:51

Here's a much better SDRAM data sheet from Micron:

http://www.micron.com/~/media/Documents/Products/Data%20Sheet/DRAM/256Mb_sdr.pdf

32MB SDRAMs from Winbond are only $2 in quantity, but their data sheet is difficult to read.

rogloh · 2013-03-22 18:01

Hey there Chip,

Great work as usual. XFR will be very nice to use. One question I had related to the documentation you updated...

The pins_to_stack modes are useful for streaming SDRAM data into stack RAM for video
displays. While a pins_to_stack mode is active, you should not read or write stack RAM
or modify SPA, as such attempts will likely cause unexpected results. You will need to
do a 'SETSPA D/#n' instruction before starting a pins_to_stack mode..

I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?

The reason I ask is that it would be great to be able to simultaneously stream video out while reading more SDRAM graphics data into the stack RAM for achieving the high resolution graphics modes with a single COG. However if that is not possible I guess some alternating read scheme during hsync video line blanking or multiple COGs like we had on P1 would be required for higher bandwidth video applications using the SDRAM buffer. Or we could try to read SDRAM into the COG RAM QUADs while streaming video and then copy this data to the CLUT on the fly. Lots of extra copying overhead there unfortunately so I am hoping we can just write to CLUT using the SDRAM reads and then read this memory too with the video engine at the same time (at different addresses).

Thanks,
Roger.

cgracey · 2013-03-22 21:50

rogloh wrote: »

I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?

Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.

rogloh · 2013-03-23 03:23

cgracey wrote: »

Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.

I was really hoping that would be the case. That is excellent news.
Roger.

SDRAM caching - what's best?

Comments