@evanh said:
Rayman,
Turn on SD_DEBUG_PERFORMANCE so we can see if any blocks are successful. It looks like nothing is working because it can't even read the MBR.
While you're at it, uncomment the three ACMD13 lines below in the driver:
// Support for caching is not yet implemented, but sounds a promising approach
send_acmd(13, 0, resp);
rx_datablocks(buff, 1, timeout, resp); // data length is 64 bytes, CRC will fail
// __builtin_printf(" ACMD13 - ");
// for( tmr = 0; tmr <= 63; tmr++ )
// __builtin_printf(" %02x", buff[tmr]);
__builtin_printf("Cache (A2 extension) supported = ");
That'll give us a little peek at read data content.
@evanh said:
Rayman,
Turn on SD_DEBUG_PERFORMANCE so we can see if any blocks are successful. It looks like nothing is working because it can't even read the MBR.
While you're at it, uncomment the three ACMD13 lines below in the driver:
...
That'll give us a little peek at read data content.
Or you could write a program to dump the content of some known data blocks like the MBR.
If we're suspicious of filesystem mismanagement then it's easy enough to swap to the old SPI driver and associated filesystem while still using the same card socket.
In the tester program, edit the mountsd() function. Comment out the three lines pertaining to _vfs_open_sdsdcard() and uncomment the two lines pertaining to _vfs_open_sdcardx().
You probably also need to edit the pin enums at the top for PIN_DI, PIN_DO and PIN_CS.
CS is same pin as DAT3
DO is same pin as DAT0
DI is same pin as CMD
It works for me now. My cards all work with both drivers. Reformatting isn't going to make it not work.
Give the SPI driver a run. It should work for you. EDIT: You could probably comment out all the speed tests except one. The SPI driver will be a lot slower. Wow, no, at 300 MHz sysclock, it's close at writes until the buffer size gets large.
In the past there has been a difference between the compiler on Windoze and the compiler on Linux. The bug was found by comparing the user compiled binaries and .lst files. ie: doing a binary compare of sdfat-speedtest.binary of both mine and yours.
If the SPI driver doesn't help, okay setup Ubuntu then. I'm on Kubuntu 24.04, btw.
A few months back I moved from Kubuntu 20.04. Interestingly, I installed as a minimal desktop, so lots of stuff doesn't get pre-installed then. A little bit surprisingly that includes no compilers. It wasn't any big deal though. Just had to add GCC and Make/Bison for building Flexspin. Don't remember having to add Git though. Maybe I added that earlier and forgot.
Hmm, well, no luck with CMD48. The A2 cards just aren't responding to it at all.
CMD48 issued while in Transfer State. Card isn't busy. Moving on to another command afterward is no problem.
EDIT: Err, it does mess up the subsequent command ...
EDIT2: Oh, now that's weird, I am getting something as long as I ignore the timeout on the command-response packet. I have no idea what it is yet:
EDIT3: That's the Sandisk. The second A2 card, the Samsung, is responding but isn't giving anything meaningful. Other cards timeout on the data block as well as the command-response.
Doh! There was a valid R1 response all along. I'd just bugged the check logic and didn't bother to verify it. I got the scope out this morning and did exactly that and only then realised my mistake. Was too tired again I guess.
Okay, I think I'm getting it slowly. The entirety of Extension Function #0 looks to be just a description of what is contained in the subsequent functions.
Assuming Function's 1 and 2 are always going to be the same predetermined Power Management Function (PMF) and Performance Enhancement Function (PEF) structures respectively, Function 0 can probably be ignored. Which could explain why the Samsung card hasn't filled out its Function 0.
All the extra pages per function look to be just that, spare storage for that Extension Function should it desire it.
Got it going I think. Am able to set it without error now. And I get a performance change from the Samsung card. Sadly, that change is for the worse. The Sandisk is unaffected performance wise.
Samsung EVO 128 GB without the cache extension:
clkfreq = 360000000 clkmode = 0x10011fb
Filesystem = fatfs, Driver = sdsdcard
mount sd: OK
Buffer = 2 kB, Written 2048 kB at 693 kB/s, Verified, Read 2048 kB at 6370 kB/s
Buffer = 2 kB, Written 2048 kB at 808 kB/s, Verified, Read 2048 kB at 6848 kB/s
Buffer = 2 kB, Written 2048 kB at 733 kB/s, Verified, Read 2048 kB at 6655 kB/s
Buffer = 2 kB, Written 2048 kB at 718 kB/s, Verified, Read 2048 kB at 6429 kB/s
Buffer = 4 kB, Written 2048 kB at 1476 kB/s, Verified, Read 2048 kB at 11091 kB/s
Buffer = 4 kB, Written 2048 kB at 1481 kB/s, Verified, Read 2048 kB at 10820 kB/s
Buffer = 4 kB, Written 2048 kB at 1485 kB/s, Verified, Read 2048 kB at 10606 kB/s
Buffer = 4 kB, Written 2048 kB at 1483 kB/s, Verified, Read 2048 kB at 10256 kB/s
Buffer = 8 kB, Written 4096 kB at 2825 kB/s, Verified, Read 4096 kB at 16941 kB/s
Buffer = 8 kB, Written 4096 kB at 2523 kB/s, Verified, Read 4096 kB at 14640 kB/s
Buffer = 8 kB, Written 4096 kB at 2896 kB/s, Verified, Read 4096 kB at 15173 kB/s
Buffer = 8 kB, Written 4096 kB at 2591 kB/s, Verified, Read 4096 kB at 12146 kB/s
Buffer = 16 kB, Written 4096 kB at 5116 kB/s, Verified, Read 4096 kB at 25527 kB/s
Buffer = 16 kB, Written 4096 kB at 4422 kB/s, Verified, Read 4096 kB at 20703 kB/s
Buffer = 16 kB, Written 4096 kB at 4490 kB/s, Verified, Read 4096 kB at 20652 kB/s
Buffer = 16 kB, Written 4096 kB at 4425 kB/s, Verified, Read 4096 kB at 18552 kB/s
Samsung EVO 128 GB with the cache extension enabled:
clkfreq = 360000000 clkmode = 0x10011fb
Filesystem = fatfs, Driver = sdsdcard
mount sd: OK
Buffer = 2 kB, Written 2048 kB at 649 kB/s, Verified, Read 2048 kB at 6508 kB/s
Buffer = 2 kB, Written 2048 kB at 666 kB/s, Verified, Read 2048 kB at 6606 kB/s
Buffer = 2 kB, Written 2048 kB at 666 kB/s, Verified, Read 2048 kB at 6439 kB/s
Buffer = 2 kB, Written 2048 kB at 660 kB/s, Verified, Read 2048 kB at 6099 kB/s
Buffer = 4 kB, Written 2048 kB at 1268 kB/s, Verified, Read 2048 kB at 10637 kB/s
Buffer = 4 kB, Written 2048 kB at 1264 kB/s, Verified, Read 2048 kB at 10302 kB/s
Buffer = 4 kB, Written 2048 kB at 1263 kB/s, Verified, Read 2048 kB at 11865 kB/s
Buffer = 4 kB, Written 2048 kB at 1269 kB/s, Verified, Read 2048 kB at 11616 kB/s
Buffer = 8 kB, Written 4096 kB at 2466 kB/s, Verified, Read 4096 kB at 16237 kB/s
Buffer = 8 kB, Written 4096 kB at 2243 kB/s, Verified, Read 4096 kB at 13647 kB/s
Buffer = 8 kB, Written 4096 kB at 2455 kB/s, Verified, Read 4096 kB at 14132 kB/s
Buffer = 8 kB, Written 4096 kB at 2232 kB/s, Verified, Read 4096 kB at 17138 kB/s
Buffer = 16 kB, Written 4096 kB at 3926 kB/s, Verified, Read 4096 kB at 24097 kB/s
Buffer = 16 kB, Written 4096 kB at 3498 kB/s, Verified, Read 4096 kB at 19588 kB/s
Buffer = 16 kB, Written 4096 kB at 3529 kB/s, Verified, Read 4096 kB at 19268 kB/s
Buffer = 16 kB, Written 4096 kB at 3478 kB/s, Verified, Read 4096 kB at 17290 kB/s
Huh, just found a bug in the DAT0 Busy waiting routine. Fixing this has restored the performance difference in the Samsung card. I'm not sure why it wasn't more of a problem generally to be honest.
The bug came from me recently removing the CMD7 SELECT that used to be embedded in that routine but I didn't add a replacement of continuous clocks during the waiting. It was in effect relying on whatever trailing clocks came off prior activity.
So both cards are unaffected in the end. They are both responding to the CMD48/CMD49 packets though. They appear to be engaging the cache feature. It just doesn't help with the way I'm using them.
@Wuerfel_21 said:
Is this with any manual cache flushing? That might really make it worse for small sizes.
I've verified, with the modified fwrite()/fread(), that fflush() is only called once at the fclose(). And the ioctl(SYNC)'ing is the only place where I have the card's cache flushed.
Well that's a wash. Some performance enhancement that is. Though you found a bug, so that's good. But that implies that something did change. Maybe the first few sectors are accelerated and then it gets slower towards the end?
@Wuerfel_21 said:
Well that's a wash. Some performance enhancement that is. Though you found a bug, so that's good. But that implies that something did change. Maybe the first few sectors are accelerated and then it gets slower towards the end?
There's an extra step in the ioctl(SYNC) routine where it has to wait on the Busy both before the CMD49 and again after to ensure the flush is complete. Only one wait is needed without the CMD49.
With the bug there, the waiting was somehow slower but not stalled. Whereas without the CMD49 it wasn't slowed at all.
So @evanh have you managed to determine the source of all these various inter-cluster sector overheads when you timestamped them and which might be candidates for removal/optimization?
That FATFS stuff we found earlier related to avoiding cluster allocation during writes still has no effect? Was that because these APIs can't easily be accessed by your test application or some other reason? Unfortunately I'm only partially following this thread right now so don't have a lot of time to consider it all.
@evanh said:
Heh, no, I stopped looking at that when Ada gave me hope for ignoring it.
I'm guessing you may have to revisit this eventually if we want to get rid of those single sector accesses which seem to be killing streaming performance.
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
@evanh said:
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.
Comments
While you're at it, uncomment the three ACMD13 lines below in the driver:
That'll give us a little peek at read data content.
Here's an example of that output:
With the Mac formatted driver, it seems it can read from a file named speed1.bin that I created on the disk:
So, appears it can read a file, just not write...
Or you could write a program to dump the content of some known data blocks like the MBR.
Ok, here's with the mac formatted disk:
Here's with Windows formatted disk:
What are your drives formatted with? Some kind of Linux?
Maybe this formatter will help?
https://www.sdcard.org/downloads/formatter/
That didn't work, but does give a different error message now:
If we're suspicious of filesystem mismanagement then it's easy enough to swap to the old SPI driver and associated filesystem while still using the same card socket.
In the tester program, edit the
mountsd()
function. Comment out the three lines pertaining to _vfs_open_sdsdcard() and uncomment the two lines pertaining to _vfs_open_sdcardx().You probably also need to edit the pin enums at the top for PIN_DI, PIN_DO and PIN_CS.
CS is same pin as DAT3
DO is same pin as DAT0
DI is same pin as CMD
@evanh any way you could format a card with the sdcard.org program and see it it works?
It works for me now. My cards all work with both drivers. Reformatting isn't going to make it not work.
Give the SPI driver a run. It should work for you. EDIT: You could probably comment out all the speed tests except one. The SPI driver will be a lot slower. Wow, no, at 300 MHz sysclock, it's close at writes until the buffer size gets large.
Replace the enums with the following:
In the past there has been a difference between the compiler on Windoze and the compiler on Linux. The bug was found by comparing the user compiled binaries and .lst files. ie: doing a binary compare of
sdfat-speedtest.binary
of both mine and yours.We'd need to align our source files first though.
I’ll just setup a Linux box to reproduce your result. Think flavor matters? U on Ubuntu ?
No, test the SPI driver. That's easy to do.
If the SPI driver doesn't help, okay setup Ubuntu then. I'm on Kubuntu 24.04, btw.
A few months back I moved from Kubuntu 20.04. Interestingly, I installed as a minimal desktop, so lots of stuff doesn't get pre-installed then. A little bit surprisingly that includes no compilers. It wasn't any big deal though. Just had to add GCC and Make/Bison for building Flexspin. Don't remember having to add Git though. Maybe I added that earlier and forgot.
Hmm, well, no luck with CMD48. The A2 cards just aren't responding to it at all.
CMD48 issued while in Transfer State. Card isn't busy. Moving on to another command afterward is no problem.
EDIT: Err, it does mess up the subsequent command ...
EDIT2: Oh, now that's weird, I am getting something as long as I ignore the timeout on the command-response packet. I have no idea what it is yet:
EDIT3: That's the Sandisk. The second A2 card, the Samsung, is responding but isn't giving anything meaningful. Other cards timeout on the data block as well as the command-response.
Doh! There was a valid R1 response all along. I'd just bugged the check logic and didn't bother to verify it. I got the scope out this morning and did exactly that and only then realised my mistake. Was too tired again I guess.
Rayman,
This program should be preconfigured to use the SPI driver with your card slot config.
Okay, I think I'm getting it slowly. The entirety of Extension Function #0 looks to be just a description of what is contained in the subsequent functions.
Assuming Function's 1 and 2 are always going to be the same predetermined Power Management Function (PMF) and Performance Enhancement Function (PEF) structures respectively, Function 0 can probably be ignored. Which could explain why the Samsung card hasn't filled out its Function 0.
All the extra pages per function look to be just that, spare storage for that Extension Function should it desire it.
Got it going I think. Am able to set it without error now. And I get a performance change from the Samsung card. Sadly, that change is for the worse. The Sandisk is unaffected performance wise.
Samsung EVO 128 GB without the cache extension:
Samsung EVO 128 GB with the cache extension enabled:
Is this with any manual cache flushing? That might really make it worse for small sizes.
Huh, just found a bug in the DAT0 Busy waiting routine. Fixing this has restored the performance difference in the Samsung card. I'm not sure why it wasn't more of a problem generally to be honest.
The bug came from me recently removing the CMD7 SELECT that used to be embedded in that routine but I didn't add a replacement of continuous clocks during the waiting. It was in effect relying on whatever trailing clocks came off prior activity.
So both cards are unaffected in the end. They are both responding to the CMD48/CMD49 packets though. They appear to be engaging the cache feature. It just doesn't help with the way I'm using them.
I've verified, with the modified fwrite()/fread(), that fflush() is only called once at the fclose(). And the ioctl(SYNC)'ing is the only place where I have the card's cache flushed.
Well that's a wash. Some performance enhancement that is. Though you found a bug, so that's good. But that implies that something did change. Maybe the first few sectors are accelerated and then it gets slower towards the end?
There's an extra step in the ioctl(SYNC) routine where it has to wait on the Busy both before the CMD49 and again after to ensure the flush is complete. Only one wait is needed without the CMD49.
With the bug there, the waiting was somehow slower but not stalled. Whereas without the CMD49 it wasn't slowed at all.
@evanh Tried with regular uSD driver and it doesn't work. Investigating as to why...
It has to be FAT32. That formatter may default to ExFAT.
So @evanh have you managed to determine the source of all these various inter-cluster sector overheads when you timestamped them and which might be candidates for removal/optimization?
That FATFS stuff we found earlier related to avoiding cluster allocation during writes still has no effect? Was that because these APIs can't easily be accessed by your test application or some other reason? Unfortunately I'm only partially following this thread right now so don't have a lot of time to consider it all.
Heh, no, I stopped looking at that when Ada gave me hope for ignoring it.
I'm guessing you may have to revisit this eventually if we want to get rid of those single sector accesses which seem to be killing streaming performance.
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.