SDRAM caching - what's best?
cgracey
Posts: 14,151
What are your guys' thoughts on how the SDRAM-controller cog should handle other cogs' memory requests?
Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?
What about single-byte writing? Is it important?
There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
Do you think the protocol should be for a fixed cache line read/write of, say a quad, four quads, or maybe 16 quads?
What about single-byte writing? Is it important?
There may be value in fixing the transaction size. This would get rid of the UDQM/LDQM pins that mask bytes for writing. Also, the SDRAM really performs fast when doing reads and writes of many contiguous words. Every time you change addresses or switch directions, there is a several-clock penalty. It's most efficient doing string operations, not atomic. Atomic operations are sometime necessary by the software, though.
Comments
Even more ambitious would be a scatter-gather scheme where it would be possible to pass a list of requests each of which was in the form of the one I mentioned first.
Burst accesses are definitely preferred; especially long bursts when possible.
Can you post the documentation for the SETXFR instruction? And anything else releavant to it? (ie configuring RAS, CAS latency etc)
ie where it transfers to/from in the cog, if it can span pages in the sdram, etc.
Edit:
Ideally with sample code for:
- reading a byte/word
- writing a byte/word
- reading N words as a burst
- writing N words as a burst
With the above samples, it will be possible to start experimenting with large bitmap drivers, xmm drivers...
I've used 32 byte bursts for the SDRAM cache driver in P1 Propeller-GCC here:
https://code.google.com/p/propgcc/source/browse/loader/spin/sdram_cache.spin
That driver is for a byte-wide device that uses an address latch - address setup runs like a turtle, but data runs like a rabbit. The caching mechanism uses dirty bits and write-back methodology. It's possible to use bigger buffers with a better COG/HUB interface like in P2.
If we don't have to manipulate the QMs that's great. Fewer transactions the better.
Single byte access would be important without a cache. I don't care for all that setup just for doing one byte though.
It takes, if I recall right, something like 10 clocks to just write a single byte or word. Extra words take 1 clock each, though.
BTW, I'm working on the XFR doc's right now, as we need them badly.
In the interim, here's the SDRAM data sheet for anyone who's interested:
http://www.winbond.com/NR/rdonlyres/16898431-2772-4CEA-8474-7E4AA855555F/0/W9825G6JH.pdf
It took me a few hours to get a handle on this the first time. I need to study it again, because I don't remember the specifics. It's no walk in the park.
It would be really nice to access SDRAM and have an interpreter running in a single COG. If that makes sense, then you need all kinds of accesses. It's software controlled anyway, so any access is possible although with penalty of being generic of course. As Bill said the QM bits are necessary for byte wide access. When pulled down, they won't need to be diddled with burst access - fortunately, the P2 has pull up/down built-in.
Some of us have discussed different methods of sharing cache space among COGs for quick multiple COG access. We don't have it all figured out yet, but having multiple COG access to memory without a lot of per-transaction blocking seems important enough for more serious thought.
I would recommendconnecting all pins for now so we canexperiment. A microSDsocket is missingthough.
(sorry about missingspaces - on myxoom)
In that case, I'd want the SDRAM driver to just do LMM memory calls and I'd use another cog and HUB for graphics memory...
Prop2_Docs.txt
Here's the XFR part:
Thanks Chip! That looks like a very powerful feature!
I don't know why it's so hard for me to write documentation. This feature was pretty simple to write about after I tried a few different approaches. It had seemed complicated for months to me.
That seems like a powerful way of running code directly from external memory!
You write
"For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:"
But how I know Stack is filled?
Your code needs to turn XFR on and then turn it off. The idea is that you coordinate your code with XFR, so that it does something to facilitate some mutual goal. For example, in an SDRAM application, you would output a command sequence to the SDRAM, then turn on XFR at a certain cycle to do its thing, while you issue time-coordinated SDRAM commands and RDQUAD/WRQUAD instructions.
I guess as long as every 4th instruction read was a JMP #$-3, it would keep executing from the QUADs. Interesting idea.
I'm sort of thinking you might want to interleave code and data, something like this:
code
code
jump $-2
data
or
data
code
code
jump $-2
You could also do:
data
code
code
code
jump $-3
then every 4 long is the data and the jump happens while the data is loaded.
I also wonder how this might play into the pipeline and multi-tasking. several of the instructions don't have effects for several clock cycles.
It would be interesting to work out how to adapt this for driving 18 & 24 bit digital LCDs, or other digital displays.
Potentially all the P2 would need would be the ability to read back a 'dot clock' pin to control the PTRx++ increments at the dotclock rate. The dot clock itself could be generated using a counter. Alternatively PTRx could be advanced manually (low-med resolution displays), or on an integer division of the system clock for faster displays.
It doesn't really matter that the QUAD<>PIN executes every single clock. Just control the pointer increments to effect the correct overall data rate
You would just encode the sync pulses and DE signal inside the data stream, since there aren't very big porches driving LCDs anyway, so it wouldn't waste much ram.
From a timing point of view, would you execute the "SETXFR #0" on the same clock cycle as the "final" XFR transfer, or one cycle later? Put another way, does the execution of "SETXFR #0" affect the XFR operation on the same clock cycle?
It would take effect on the next cycle, so you would execute it on the final cycle of your intended operation.
Sorry -- But I still are missing in this instructions one pointer
Bytes/Words/Longs transferred.
It's a function of time. If you set up for a word mode and let it run for 16 clocks, you would transfer 8 longs. For long modes, it's one long per clock. There are no byte modes.
Thanks.
I see I need wait for Yours demo programs to fully understand it.
http://www.micron.com/~/media/Documents/Products/Data%20Sheet/DRAM/256Mb_sdr.pdf
32MB SDRAMs from Winbond are only $2 in quantity, but their data sheet is difficult to read.
Great work as usual. XFR will be very nice to use. One question I had related to the documentation you updated...
I don't know how many ports your stack RAM has for all its connected client blocks so does this quote above also now mean that the stack RAM (aka CLUT) should not be read by the video generator engine at the same time it is being populated by the SDRAM operation that writes to the stack RAM (at different addresses)? Or does this quote just indicate that the PASM software running on the local COG should not try to access the stack RAM while the SDRAM streaming is currently happening?
The reason I ask is that it would be great to be able to simultaneously stream video out while reading more SDRAM graphics data into the stack RAM for achieving the high resolution graphics modes with a single COG. However if that is not possible I guess some alternating read scheme during hsync video line blanking or multiple COGs like we had on P1 would be required for higher bandwidth video applications using the SDRAM buffer. Or we could try to read SDRAM into the COG RAM QUADs while streaming video and then copy this data to the CLUT on the fly. Lots of extra copying overhead there unfortunately so I am hoping we can just write to CLUT using the SDRAM reads and then read this memory too with the video engine at the same time (at different addresses).
Thanks,
Roger.
Sorry. I should have elaborated a little. It just means that PASM code shouldn't read or write the stack RAM or modify SPA. The video generator reads data from a separate asynchronous port and it can stream data out at the same time as XFR is streaming it in.
I was really hoping that would be the case. That is excellent news.
Roger.