Improving SD Card performance
It's been over a year since I looked at this and @evanh came up with a test program that I thought I would try out. It turns out it didn't work for me at first. I quickly discovered I was running out of memory and needed to make some changes to the code.
It also uncovered a bug in my SD card file support functions. I rewrote the flexspin file system functions to make some improvements and added a bug in the conversion process. It's one of those if the function returns 0 is that true and if it's not true is it really true bug.
It turns out that one of the nice things about the way flexspin implements SD card support is you can completely rewrite that whole thing and not mess with any of the existing code that is there. Kind of like a plug and play deal.
I also tried to add multi volume or drive support as well. This turns out to add a little bit of overhead with passing around the volume number, but I know somebody is going to want to attach more than one SD card to the P2 and want that function.
I also moved this code over to P2LLVM so that it would have SD card support. The low-level code is about the same but the upper level not so much.
In addition it looks like FatFs came out with a small update that removed a lot of custom memory code and went back to using the standard library functions. For P2LLVM it was a cut and past and Flexspin needed a few more tweaks.
From the diagram the filesystem program is the high level C library functions that are hooked by virtual function pointers except for the mount and directory functions. This system then uses FatFs to convert these library functions into calls to the diskio code that uses the sd_mmc low level code to actually read and write data to the SD card. So one code replace the sd_mmc code with say flash or a usb drive code.
So all the performance functions lie in the low level driver sd_mmc.
Now back to running some performance test and see how we did.
Here is a copy of the performance program that I used:
#include <stdio.h> #include <propeller.h> #include <sys/vfs.h> uint32_t randfill(uint32_t *, size_t); int compare(uint32_t *, uint32_t *, size_t); #define PIN_SS 23 #define PIN_MISO 20 #define PIN_CLK 21 #define PIN_MOSI 22 uint32_t data1[25000]; uint32_t data2[25000]; int main(int argc, char** argv) { FILE *fh; uint32_t ticks; printf( " clkfreq = %d clkmode = 0x%x\n", _clockfreq(), _clockmode() ); printf( " Randfill ticks = %d\n", randfill( data1, sizeof(data1) ) ); printf( " Mounting: " ); mount( "/sd", _vfs_open_sdcardx(PIN_CLK, PIN_SS, PIN_MOSI, PIN_MISO) ); if( fh = fopen( "/sd/speed2.bin", "w" ) ) { ticks = _getms(); fwrite( data1, 1, sizeof(data1), fh ); fclose( fh ); ticks = _getms() - ticks; printf( " Writing %u bytes at %u kB/s\n", sizeof(data1), (sizeof(data1) * 1000 / ticks + 512) >> 10 ); } else printf( " SD card write error!\n" ); if( fh = fopen( "/sd/speed2.bin", "r" ) ) { ticks = _getms(); fread( data2, 1, sizeof(data2), fh ); fclose( fh ); ticks = _getms() - ticks; printf( " Reading %u bytes at %u kB/s\n", sizeof(data2), (sizeof(data2) * 1000 / ticks + 512) >> 10 ); if( compare( data1, data2, sizeof(data2) ) ) printf( " Matches! :)\n" ); else printf( " Mis-matches! :(\n" ); } else printf( " SD card read error!\n" ); while (1) { _waitms(500); } } uint32_t randfill( uint32_t *addr, size_t size ) { uint32_t ticks; size >>= 2; ticks = _cnt(); do { *(addr++) = _rnd(); } while( --size ); return( _cnt() - ticks ); } int compare( uint32_t *addr1, uint32_t *addr2, size_t size ) { uint32_t pass = 1; size >>= 2; do { if( *(addr1++) != *(addr2++) ) pass = 0; } while( --size ); return( pass ); }
For these tests I used a standard Parallax SD card Adapter and some loose wires.
First test with the standard flexspin driver:
Entering terminal mode. Press Ctrl-] to exit. clkfreq = 200000000 clkmode = 0x10009fb Randfill ticks = 225062 Mounting: Writing 100000 bytes at 136 kB/s Reading 100000 bytes at 314 kB/s Matches! :)
Second test using my SD card functions changing one line of code:
mount( "/sd", _vfs_open_sd(PIN_SS, PIN_CLK, PIN_MOSI, PIN_MISO) ); Entering terminal mode. Press Ctrl-] to exit. clkfreq = 200000000 clkmode = 0x10009fb Randfill ticks = 225070 Mounting: Writing 100000 bytes at 164 kB/s Reading 100000 bytes at 718 kB/s Matches! :)
Third test using P2LLVM. This required a few more changes:
uint32_t randfill(uint32_t *, size_t); int compare(uint32_t *, uint32_t *, size_t); #define PIN_SS 23 #define PIN_MISO 20 #define PIN_CLK 21 #define PIN_MOSI 22 uint32_t data1[25000]; uint32_t data2[25000]; int main(int argc, char** argv) { FILE *fh; uint32_t ticks; printf( " clkfreq = %d clkmode = 0x%x\n", _clkfreq, _clkmode); printf( " Randfill ticks = %d\n", randfill( data1, sizeof(data1) ) ); printf( " Mounting: " ); sd_mount(0, PIN_SS, PIN_CLK, PIN_MOSI, PIN_MISO); if( (fh = fopen( "SD0:/speed2.bin", "w" )) > 0 ) { ticks = getms(); fwrite( data1, 1, sizeof(data1), fh ); fclose( fh ); ticks = getms() - ticks; printf( " Writing %u bytes at %u kB/s\n", sizeof(data1), (sizeof(data1) * 1000 / ticks + 512) >> 10 ); } else printf( " SD card write error!\n" ); if( (fh = fopen( "SD0:/speed2.bin", "r" )) > 0 ) { ticks = getms(); fread( data2, 1, sizeof(data2), fh ); fclose( fh ); ticks = getms() - ticks; printf( " Reading %u bytes at %u kB/s\n", sizeof(data2), (sizeof(data2) * 1000 / ticks + 512) >> 10 ); if( compare( data1, data2, sizeof(data2) ) ) printf( " Matches! :)\n" ); else printf( " Mis-matches! :(\n" ); } else printf( " SD card read error!\n" ); while (1) { wait(500); } } uint32_t randfill( uint32_t *addr, size_t size ) { uint32_t ticks; size >>= 2; ticks = _cnt(); do { *(addr++) = rand(); } while( --size ); return( _cnt() - ticks ); } int compare( uint32_t *addr1, uint32_t *addr2, size_t size ) { uint32_t pass = 1; size >>= 2; do { if( *(addr1++) != *(addr2++) ) pass = 0; } while( --size ); return( pass ); }
P2LLVM was setup to do multi volume support so the 0 is the volume number used. This also slowed performance some:
Entering terminal mode. Press Ctrl-] to exit. clkfreq = 200000000 clkmode = 0x14cc8fb Randfill ticks = 6574977 Mounting: Writing 100000 bytes at 149 kB/s Reading 100000 bytes at 794 kB/s Matches! :)
Here is the assembly code used to read the SD card and the internal assembly code:
void ReceiveSD(char *buff, unsigned int bc) { char *b = buff; unsigned int c = bc; int m = Pins.MOSI; int n = Pins.MISO; int x = Pins.CLK; int i; int j; int v; __asm { drvh m waitx #8 mov i, c loopi mov v, #0 mov j, #8 loopj testp n wc drvh x rcl v, #1 drvl x waitx #6 djnz j, #loopj wrbyte v, b add b, #1 djnz i, #loopi } } /***********************/ _ReceiveSD add ptr__dat__, ##201926 rdbyte _var01, ptr__dat__ add ptr__dat__, #1 rdbyte _var02, ptr__dat__ sub ptr__dat__, #2 rdbyte _var03, ptr__dat__ sub ptr__dat__, ##201925 zerox _var03, #7 drvh _var01 waitx #8 LR__0687 mov _var04, #0 loc pa, #(@LR__0690-@LR__0688) call #FCACHE_LOAD_ LR__0688 rep @LR__0691, #8 LR__0689 testp _var02 wc drvh _var03 rcl _var04, #1 drvl _var03 waitx #6 LR__0690 LR__0691 wrbyte _var04, arg01 add arg01, #1 djnz arg02, #LR__0687 _ReceiveSD_ret ret
And this is P2LLVM code for the same:
void ReceiveSD(int drive, BYTE *buff, unsigned int bc) { BYTE *b = buff; unsigned int c = bc; int m = (Pins[drive] >> 16) & 0xff; int n = (Pins[drive] >> 24) & 0xff; int x = (Pins[drive] >> 8) & 0xff; int i = 0; int j = 0; int v = 0; asm volatile ("drvh %[m]\n" "waitx #8\n" "mov %[i], %[c]\n" ".li: mov %[v], #0\n" "mov %[j], #8\n" ".lj: testp %[n] wc\n" "drvh %[x]\n" "rcl %[v], #1\n" "drvl %[x]\n" "djnz %[j], #.lj\n" "wrbyte %[v], %[b]\n" "add %[b], #1\n" "djnz %[i], #.li\n" :[i]"+r"(i), [j]"+r"(j), [v]"+r"(v), [b]"+r"(b) :[x]"r"(x), [n]"r"(n), [m]"r"(m), [c]"r"(c)); } /***********************/ 00007468 <ReceiveSD>: 7468: 28 02 64 fd setq #1 746c: 61 a1 67 fc wrlong r0, ptra++ 7470: 28 08 64 fd setq #4 7474: 61 a7 67 fc wrlong r3, ptra++ 7478: 02 a0 67 f0 shl r0, #2 747c: 4f 02 00 ff augs #591 7480: 34 a7 07 f6 mov r3, #308 7484: d0 a7 03 f1 add r3, r0 7488: d3 a1 03 fb rdlong r0, r3 748c: d0 a7 03 f6 mov r3, r0 7490: 08 a6 47 f0 shr r3, #8 7494: ff a6 07 f5 and r3, #255 7498: d0 a9 03 f6 mov r4, r0 749c: 18 a8 47 f0 shr r4, #24 74a0: 10 a0 47 f0 shr r0, #16 74a4: ff a0 07 f5 and r0, #255 74a8: 00 aa 07 f6 mov r5, #0 74ac: d5 ad 03 f6 mov r6, r5 74b0: d5 af 03 f6 mov r7, r5 74b4: 59 a0 63 fd drvh r0 74b8: 1f 10 64 fd waitx #8 74bc: d2 ab 03 f6 mov r5, r2 000074c0 <.li>: 74c0: 00 ae 07 f6 mov r7, #0 74c4: 08 ac 07 f6 mov r6, #8 000074c8 <.lj>: 74c8: 40 a8 73 fd testp r4 wc 74cc: 59 a6 63 fd drvh r3 74d0: 01 ae a7 f0 rcl r7, #1 74d4: 58 a6 63 fd drvl r3 74d8: fb ad 6f fb djnz r6, #-5 74dc: d1 af 43 fc wrbyte r7, r1 74e0: 01 a2 07 f1 add r1, #1 74e4: f6 ab 6f fb djnz r5, #-10 74e8: 28 08 64 fd setq #4 74ec: 5f a7 07 fb rdlong r3, --ptra 74f0: 28 02 64 fd setq #1 74f4: 5f a1 07 fb rdlong r0, --ptra 74f8: 2e 00 64 fd reta
It looks like from the assembled code that flexspin is using FCACHE where as P2LLVM is not yet the speed number is higher for P2LLVM which questions is there is any performance gain by using it. Also my DJNZ loop was converted to a repeat instruction in flexspin.
Hopefully I didn't make to many mistakes in putting this together.
Mike
Comments
Ha, didn't notice you post this last night. Cool looking Edge Card carrier, BTW. I want to make something similar.
That
WAITX #6
is basically negating the cogexec advantage. Have you tried my code again?Your speeds are a little slow for the original FlexC code. I'm getting around 340 kB/s reading at 200 MHz sysclock.
Try this sdmm.cc with it's debug code turned on. It might tell us a little more why.
@evanh ,
Those test were done using a old 16Mb SD card I had laying around. Kind of worse case test.
Here is with a new 32Gb SanDisk card rated at 130Mb per second:
And here is with the new code:
And here is with P2LLVM:
Mike
PS: most of the slow downs in flexc is because of too long a delay before checking ready state.
Where is that in the source? Found it
wait_ready()
in sdmm.cc.select()
hits it a lot!EDIT: Testing that ... while there is some select() delays for block writes, for the block reading, select() is returning on first pass. 100% no pausing for wait_ready() through the whole file. And only one select() per file cluster too. So I don't see that as really impacting read performance.
Oh, now I see it. There's a duplicate of wait_ready()'s code inside of
rcvr_datablock()
... which is used for every block read, one by one ... yep, that repeats ... at least one delay of 100 us preceding each block. More at cluster start.Wow, you're totally right. Knocked the set 100 us delay down to 10 us which bumped me from 1560 kB/s up to 2250 kB/s, at 200 MHz sysclock.
EDIT: And 1 us delays at 360 MHz:
Read averaging 10.8 sysclock ticks per bit. 45 MHz SPI clock.
And the 200 MHz report using 1 us delays:
That delay is totally pointless, you can just read dummy bytes until you get a block start, no delay needed at all. Also, the only reason for a timeout IME is to handle sudden disconnect of the SD card or bung commands from higher up. A valid command will never time out.
Hmm, I hadn't really thought about why it was there. I see Eric comments 500 ms and 100 ms timeouts. Which are now wildly missed with the much smaller 1 us delays.
Okay, reverted the delays back to 100 us for the old code ... but, in new code, replaced the ready loop with an actual timeout and no delays. eg:
@evanh ,
That's amazing. By using RDFAST and WRFAST you can double the input/output speed to the SD card. How can this be.
It doesn't work with my old 16Meg memory card though:
This can also mean that byte reads and writes are slow to hub memory.
So if I switch to writing 32 bits at a time I will see a speed increase?
Mike
Not just using the FIFO. It's tricky to figure out the I/O pin latencies and work with them so as to remove that WAITX from the inner loop.
There also needs to be enough lead time to allow for external latencies too. I make use of the fact that there is 8 sysclock ticks per bit to give me slack for this without needing to calculate a compensation based on sysclock frequency.
That's your slow card still working isn't it. That's normal. Slow SD cards are slow. EDIT: BTW, run it a second time and the write speed might be faster the second time.
I tried that at one stage but it failed badly. Didn't try to sort out why though.
You have to take care of the bit order - just shifting bits into a long does not suffice, you'll end up with the first byte received in the last byte of the long (a single MOVBYTS can fix this)
Cool. Should try it again I guess ... yep, that worked. Gained another 10% extra read speed at 200 MHz. Maybe 7% gain for write speed.
Ada,
My quick hack there for 32-bit inner loop is split into two separate blobs of assembly with a simple logic decision as to which one gets used. I've now got a more robust logic solution that can use both paths but it also merges the two blobs into one. So I have my doubts as to its benefit because of the larger all-in-one blob size.
Do you know if Fcache always reloads into cogRAM/lutRAM or is there some smarts to reuse an already loaded blob?
EDIT: So far, testing isn't revealing any advantage either way. Larger blob doesn't seem to be any hindrance ...
EDIT2: Here's what I've got now:
As is it always reloads, yes. But each additional instruction beyond the initial overhead just costs one cycle and you're saving some by doing the branch inside the FCACHE, so yeah, difference is minor.
Also, to make it not explode with FCACHE disabled, I think an FCACHE_SIZE preprocessor symbol could be introduced...
Thanks, hehe, I'm on to smartpins now. If this works out then it won't even need Fcache at all. Maybe all done in C too.
Oh, wow, SD's SPI protocol is sensitive to, even unclocked, post-command trailing level of DI. Not very SPI compliant at all. Makes it tricky to keep the tx smartpin doing the right thing.
The good news is it's now good enough to read and write blocks. Bad news is it corrupts the filesystem ...
EDIT: Huh, I may have found a bug in the disk_write() function. The optimiser maybe reordering the multiple functions within an if() condition. Namely this:
I split them up to work out which one was giving an error and it started working ... err, well, still not getting 100% valid data read back yet but it is writing a complete file now without destroying the filesystem ... Update: One of my three uSD cards is passing the speed test with matching data ...
Well, I have created a PR to define
__HAVE_FCACHE__
when FCACHE is available. That should solve the issue of making stuff work at -O0Thanks, that was a little hairy.
I think I've solved the smartpins implementation too. I changed it from SPI clock mode 0 to mode 3. The clock pin now idles high. All three cards are passing the speed test - only tested at 10 MHz sysclock ...
Aaaand it's on master. That was quick.
Got a problem detecting SPI mode on one card at high clock rate. It switches/detects and works first run, but rebooting the Prop2 with that SD card still powered in SPI mode then fails to detect. Amusingly it's the one that was happy in mode 0.
EDIT: Oh, I guess it should be doing those steps at a low clock rate ...
EDIT2: hmm, might be a latencies thing ... hmm, yeah, I guess I'm pushing the limits now ...
EDIT3: Grrr, there's a missing pull-up on the Eval Board. DO is meant to have one according to SD specs. Important for idle detection. Good news is I can substitute with the rx pin's drive controls ... doesn't seem to make it any faster though.
EDIT4: Oh, I wasn't expecting this:
EDIT5: Ah, here we go:
PS: Those results are 200 MHz and sysclock/4, ie: 50 MHz SPI clock. I was surprised it worked at all at sysclock/4. The SPI clock config to prevent glitches on the DI pin means I can't setup the rx smartpin to change phase like I would with a regular SPI memory.
Soldered up some wires for powering and programming an Edge Card. Plugged it all together and slotted a uSD card to see if the tests were any better there ... Nope.
First thing I managed to do was supply 24 Volts to the Edge Card. Didn't seem to hurt it luckily.
That's surprising. The Eval Board is faster than the Edge Card. Something is messing with data out from the SD card. Maybe reflections. Dunno, way too fast for the scope.
Ouch! And it's bad enough to mess with the optimised bit-bashed code I posted earlier too. Bugger, 200 MHz sysclock is all it can do on the Edge.
I'll be damned! The Sandisk Extreme is fine all the way up to max test frequency of 360 MHz! Drive strength clearly makes a difference.
I better try reading the SD standard to see what queries can be made for speed limits ...
PS: On the current state of the smartpins development: I've mostly solved the card detection issue on reboot. Just needed a small delay added after the first CMD0 to wait for switching out of transfer mode. It was of those it-never-happened-with-debug-enabled situations. Because the time required to report status was enough to make the SD card happy.
There still is occasional issue after I've had massive errors due to bad timings and then need to power down to recover.
Doing some more clean up today ... and after discovering the above bit-bashing issue I figured it'd also be worth bringing that method up to speed with the other fixes I'd done while developing the smartpins method. And voila, it works fully on the Edge Card. I had to adjust rx phase to suit clock idle high instead of low, but not much else other than remove the smartpin init code.
So both methods are getting pretty robust. Oops, spoke too soon. Timings too tight in the new smartpins method ...
Update: Here's the latest code. It's set to bit-bashed. I'd like to get others testing the revised bit-bashed method.
Eliminated the need for Fcache - without clock stretching within a block transfer. And in the process found out I'd wrongly assumed SD was sensitive to trailing DI level when it's not.
So now there is a possibility of adjusting rx phase at faster clock rates and solidly hitting 50 MHz SPI clock. Or maybe not. It's way overkill for the size of RAM.
EDIT: Scratching my head lots on this tight timings issue. It looks like the main factor is actually the smartB clock input for both tx and rx smartpins. It seems that registering just the clock pin irons out a lot of inconsistencies. It also ups the lag effect out to +5 sysclock ticks on the tx smartpin. I'd previously avoided even trying it because of the extra lag.
Okay, this time. Smartpin solution is enabled again.
EDIT: Oops, left debug enabled. Fixed now.
Updated sdmm.cc: Some minor tweaks and comments added.
Updated speedtest.c: Removed the spin2 library code.
@evanh ,
Looking at the code I think it would be better to have two sets of code. There is a block read and write section that could have there own block read/write.
Leave the simple 1, 2 byte send/receive functions simple as not much speed can be gained there and build block read/write functions into the block code.
In the above code use block function to read N number of blocks.
Mike
@ersmith ,
I have made a lot of changes to your vfs_sdcard functions and move 99% of the code you put into FatFs out. They have come out with a newer version and to update it only required coping in the new files and changing two or three lines of code in the FatFs system.
Along with Evans changes we could have a very efficient SD card support.
The only thing is the vfs_sdcard parameters seems to be a sticking point. On the P1 we always had to provide them and don't see a problem with passing them in. I know the P2 has the SD card mounted and would use the same pins but passing them in seems to be simple enough.
I also added long file name support and date/time, file size, and entry type to the directory functions which aids in finding files on the SD card.
Let me know what you think.
Mike
I've not tried to redesign how Eric's done the SD protocol verses block level functionality. In fact, you're more than welcome to work on that side of the code.
BTW, Eric supports SDv1/MMC variable length blocks.
EDIT: On that note, I'm procrastinating on a project I started with the Prop1. Time to get back to it. I'm much more comfortable with the Prop2 but the Prop1 is more than capable for the job and the DIP40 package is surprisingly convenient for the layout.
I'd certainly be interested to see what you've done. The SD card driver can definitely use some improvements (and I hope to incorporate @evanh 's changes when they're ready, as well).
There's a vfs_sdcardx where the pin parameters are provided. Are there other parameters that we need? If so, we may also be able to provide them via mount() (I've recently changed the vfs code so the mount() parameter gets passed to an initialization routine in the VFS layer).
I think we'll want to leave the long file name support as optional (I think it can be enabled by a compile time define). Extending the directory functions is probably a good idea in principle, but we'll need to make sure the other file systems can support it.
Thanks,
Eric
Oh, that'd be useful indeed. readdir is a bit silly as-is because AFAICT you have to separately stat each file to figure out what the entry actually is.
Also, there's a weird issue I had to work around to get scanning subdirectories to work in megayume: The SD root directory can only be read if a trailing slash is given, but subdirectories only if it is not given.
Relatedly, I think the true root directory
/
can not be enumerated at all?