SDRAM caching - what's best?

What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?
Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?
What about single-byte writing? Is it important?
There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?
What about single-byte writing? Is it important?
There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
Comments
function code: READ or WRITE SDRAM address hub address byte count (or maybe long count?)
To be a little fancier and to allow a cache line fill with a single request it might be nice to be able to pass an SDRAM address and two hub addresses, one to write and the other to read. That would allow a dirty cache line to be replaced with new data with a single request.Even more ambitious would be a scatter-gather scheme where it would be possible to pass a list of requests each of which was in the form of the one I mentioned first.
Burst accesses are definitely preferred; especially long bursts when possible.
Can you post the documentation for the SETXFR instruction? And anything else releavant to it? (ie configuring RAS, CAS latency etc)
ie where it transfers to/from in the cog, if it can span pages in the sdram, etc.
Edit:
Ideally with sample code for:
- reading a byte/word
- writing a byte/word
- reading N words as a burst
- writing N words as a burst
With the above samples, it will be possible to start experimenting with large bitmap drivers, xmm drivers...
I've used 32 byte bursts for the SDRAM cache driver in P1 Propeller-GCC here:
https://code.google.com/p/propgcc/source/browse/loader/spin/sdram_cache.spin
That driver is for a byte-wide device that uses an address latch - address setup runs like a turtle, but data runs like a rabbit. The caching mechanism uses dirty bits and write-back methodology. It's possible to use bigger buffers with a better COG/HUB interface like in P2.
If we don't have to manipulate the QMs that's great. Fewer transactions the better.
Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.
It takes, if I recall right, something like 10 clocks to just write a single byte or word. Extra words take 1 clock each, though.
BTW, I'm working on the XFR doc's right now, as we need them badly.
In the interim, here's the SDRAM data sheet for anyone who's interested:
http://www.winbond.com/NR/rdonlyres/16898431-2772-4CEA-8474-7E4AA855555F/0/W9825G6JH.pdf
It took me a few hours to get a handle on this the first time. I need to study it again, because I don't remember the specifics. It's no walk in the park.
It would be really nice to access SDRAM and have an interpreter running in a single COG. If that makes sense, then you need all kinds of accesses. It's software controlled anyway, so any access is possible although with penalty of being generic of course. As Bill said the QM bits are necessary for byte wide access. When pulled down, they won't need to be diddled with burst access - fortunately, the P2 has pull up/down built-in.
Some of us have discussed different methods of sharing cache space among COGs for quick multiple COG access. We don't have it all figured out yet, but having multiple COG access to memory without a lot of per-transaction blocking seems important enough for more serious thought.
I would recommendconnecting all pins for now so we canexperiment. A microSDsocket is missingthough.
(sorry about missingspaces - on myxoom)
In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...
Prop2_Docs.txt
Here's the XFR part:
PIN TRANSFER ------------ Each cog has a pin transfer circuit (XFR) which can automatically move data between pins and QUADs or from pins to stack RAM, in the background, while instructions execute normally. XFR is configured with the SETXFR instruction: SETXFR D/#n - Set XFR configuration to %MMM_PPP %MMM = mode %00x = off (initial state after cog start) %010 = QUADs_to_16_pins %011 = QUADs_to_32_pins %100 = 16_pins_to_QUADs %101 = 32_pins_to_QUADs %110 = 16_pins_to_stack %111 = 32_pins_to_stack %PPP = pin group %000 = pins 15..0 for 16-pin modes, pins 31..0 for 32-pin modes %001 = pins 31..16 for 16-pin modes, pins 31..0 for 32-pin modes %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes %11x = no pins For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 low word is output to pins 2nd clock: QUAD0 high word is output to pins 3rd clock: QUAD1 low word is output to pins 4th clock: QUAD1 high word is output to pins 5th clock: QUAD2 low word is output to pins 6th clock: QUAD2 high word is output to pins 7th clock: QUAD3 low word is output to pins 8th clock: QUAD3 high word is output to pins This mode is useful for coordinating with a 'RDQUAD PTRx++' instruction so that a continuous stream of words from hub memory can be output to an SDRAM's DQ pins. This enables SDRAM writing at the cog's hub bandwidth limit. For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 is output to pins 2nd clock: QUAD1 is output to pins 3rd clock: QUAD2 is output to pins 4th clock: QUAD3 is output to pins For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled as low word 2nd clock: pins are sampled as high word, long is written to QUAD0 3rd clock: pins are sampled as low word 4th clock: pins are sampled as high word, long is written to QUAD1 5th clock: pins are sampled as low word 6th clock: pins are sampled as high word, long is written to QUAD2 7th clock: pins are sampled as low word 8th clock: pins are sampled as high word, long is written to QUAD3 This mode is useful for coordinating with a 'WRQUAD PTRx++' instruction so that a continuous stream of words input from an SDRAM's DQ pins can be written to hub memory. This enables SDRAM reading at the cog's hub bandwidth limit. For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to QUAD0 2nd clock: pins are sampled and written to QUAD1 3rd clock: pins are sampled and written to QUAD2 4th clock: pins are sampled and written to QUAD3 For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following 2-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled as low word 2nd clock: pins are sampled as high word, long is written to stack at SPA++ For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following 1-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to stack at SPA++ The pins_to_stack modes are useful for streaming SDRAM data into stack RAM for video displays. While a pins_to_stack mode is active, you should not read or write stack RAM or modify SPA, as such attempts will likely cause unexpected results. You will need to do a 'SETSPA D/#n' instruction before starting a pins_to_stack mode.. To stop XFR, execute 'SETXFR #0'.
Thanks Chip! That looks like a very powerful feature!
I don't know why it's so hard for me to write documentation. This feature was pretty simple to write about after I tried a few different approaches. It had seemed complicated for months to me.
That seems like a powerful way of running code directly from external memory!
You write
"For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:"
But how I know Stack is filled?
Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.
I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.
I'm sort of thinking you might want to interleave code and data, something like this:
code
code
jump $-2
data
or
data
code
code
jump $-2
You could also do:
data
code
code
code
jump $-3
then every 4 long is the data and the jump happens while the data is loaded.
I also wonder how this might play into the pipeline and multi-tasking. several of the instructions don't have effects for several clock cycles.
It would be interesting to work out how to adapt this for driving 18 & 24 bit digital LCDs, or other digital displays.
Potentially all the P2 would need would be the ability to read back a 'dot clock' pin to control the PTRx++ increments at the dotclock rate. The dot clock itself could be generated using a counter. Alternatively PTRx could be advanced manually (low-med resolution displays), or on an integer division of the system clock for faster displays.
It doesn't really matter that the QUAD<>PIN executes every single clock. Just control the pointer increments to effect the correct overall data rate
You would just encode the sync pulses and DE signal inside the data stream, since there aren't very big porches driving LCDs anyway, so it wouldn't waste much ram.
From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?
It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.
Sorry -- But I still are missing in this instructions one pointer
Bytes/Words/Longs transferred.
It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.
Thanks.
I see I need wait for Yours demo programs to fully understand it.
http://www.micron.com/~/media/Documents/Products/Data%20Sheet/DRAM/256Mb_sdr.pdf
32MB SDRAMs from Winbond are only $2 in quantity, but their data sheet is difficult to read.
Great work as usual. XFR will be very nice to use. One question I had related to the documentation you updated...
I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?
The reason I ask is that it would be great to be able to simultaneously stream video out while reading more SDRAM graphics data into the stack RAM for achieving the high resolution graphics modes with a single COG. However if that is not possible I guess some alternating read scheme during hsync video line blanking or multiple COGs like we had on P1 would be required for higher bandwidth video applications using the SDRAM buffer. Or we could try to read SDRAM into the COG RAM QUADs while streaming video and then copy this data to the CLUT on the fly. Lots of extra copying overhead there unfortunately so I am hoping we can just write to CLUT using the SDRAM reads and then read this memory too with the video engine at the same time (at different addresses).
Thanks,
Roger.
Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.
I was really hoping that would be the case. That is excellent news.
Roger.