@evanh said:
EDIT: Taking a peek, it's quite the beast I see. Supports long file names, TRIMming, and ExFAT too.
The Unicode stuff seems overkill. That's going to just be for filenames.
The unicode stuff is only needed if LFN is enabled. You need it because LFN/ExFAT names are in UTF-16, so unusable for normal narrow-character code.
Any file/folder/github links to these FAT32 related files...? I wish I knew where you guys were looking.
UPDATE : okay I found filesys/fatfs/ff.c at least
If there are optimizations that can be done that reduce all the single sector accesses between clusters that'd be nice. They may involve extra memory use if some sectors get cached. Whether that makes it into the full build of flexspin long term by default, not sure, but perhaps some special optimization switches/#defines could be created for enabling high(er) speed SD performance.
Looking at the code there is this FF_USE_FASTSEEK define that looks interesting. One goes and reads FAT while one reads from a local CLMT table which is probably a lot faster. Have you tried building with that @evanh?
#if FF_USE_FASTSEEK
if (fp->cltbl) {
clst = clmt_clust(fp, fp->fptr); /* Get cluster# from the CLMT */
} else
#endif
{
clst = get_fat(&fp->obj, fp->clust); /* Follow cluster chain on the FAT */
}
@Wuerfel_21 said:
The unicode stuff is only needed if LFN is enabled. You need it because LFN/ExFAT names are in UTF-16, so unusable for normal narrow-character code.
Damn, I never imagined LFN actually required Unicode, especially UTF-16. That's just perverse bloat for something like FAT.
@rogloh said:
Looking at the code there is this FF_USE_FASTSEEK define that looks interesting. One goes and reads FAT while one reads from a local CLMT table which is probably a lot faster. Have you tried building with that @evanh?
#if FF_USE_FASTSEEK
if (fp->cltbl) {
clst = clmt_clust(fp, fp->fptr); /* Get cluster# from the CLMT */
} else
#endif
{
clst = get_fat(&fp->obj, fp->clust); /* Follow cluster chain on the FAT */
}
Not making any difference to the sequence of blocks. But I do get three compiler warnings:
warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
#define FF_USE_FASTSEEK 0
from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29: #include "ffconf.h"
from /home/evanh/hoard/coding/include/filesys/sdfatfs/fatfs_vfs.c: 6: #include "ff.h"
previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1 /* (predefined):0 */
warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
#define FF_USE_FASTSEEK 0
from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29: #include "ffconf.h"
from /home/evanh/hoard/coding/include/filesys/sdfatfs/ffunicode.c: 26: #include "ff.h"
from /home/evanh/hoard/coding/include/filesys/sdfatfs/fatfs.cc: 13: #include "ffunicode.c"
previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1 /* (predefined):0 */
warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
#define FF_USE_FASTSEEK 0
from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29: #include "ffconf.h"
from /home/evanh/hoard/coding/include/filesys/sdfatfs/diskio.h: 5: #include "ff.h"
from /home/evanh/hoard/coding/include/filesys/sdfatfs/sdmm.cc: 25: #include "diskio.h"
previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1 /* (predefined):0 */
Mentions it can't be used to expand the file size however this snippet looks interesting and may speed up new file writes perhaps if the final size is known in advance.
/* Cluster pre-allocation (to prevent buffer overrun on streaming write) */
res = f_open(fp, recfile, FA_CREATE_NEW | FA_WRITE); /* Create a file */
res = f_lseek(fp, PRE_SIZE); /* Expand file size (cluster pre-allocation) */
if (res || f_tell(fp) != PRE_SIZE) ... /* Check if the file has been expanded successfly */
res = f_lseek(fp, OFS_DATA); /* Record data stream with free from cluster allocation delay */
... /* Write operation should be aligned to sector boundary to optimize the write throughput */
res = f_truncate(fp); /* Truncate unused area */
res = f_lseek(fp, OFS_HEADER); /* Set file header */
...
res = f_close(fp);
web page:
"It can also be used to expand the file size (cluster pre-allocation)."
"can" being the operative word there. There's no indication that FlexC's FAT filesystem uses lseek() for preallocating a file. It just appends on the fly.
A search of the include files gets one hit other than the function itself - #define f_rewind(fp) f_lseek((fp), 0)
I think we're in for writing any speed-ups ourselves.
It's going to be major I think - Not normal procedures. We want to be able to make multiple calls to fwrite() that produces only a single CMD25, assuming the file itself is made of consecutive clusters of course. Make the filesystem leave the SD card hanging until the user program says otherwise.
Putting it like that doesn't sound reasonable. Maybe there is other ways to get the SD card to have less busy time with short write bursts. A large buffer isn't very friendly to hubRAM. Besides, even a 256 kB buffer wasn't a fabulous result.
I guess one solution is interface the filesystem to external RAM expansions. So it can be given a very large amount of memory to work with. Ditch the buffer idea and just tell it where all the data resides in one hit.
This would then allow the filesystem to be optimised around concatenating multiple consecutive cluster writes into one CMD25.
web page:
"It can also be used to expand the file size (cluster pre-allocation)."
"can" being the operative word there. There's no indication that FlexC's FAT filesystem uses lseek() for preallocating a file. It just appends on the fly.
A search of the include files gets one hit other than the function itself - #define f_rewind(fp) f_lseek((fp), 0)
I think we're in for writing any speed-ups ourselves.
Are these f_xxxx type API functions exposed to SPIN2 or is there another layer that hides this from us? The code in the sample seems to be what we'd code ourselves from SPIN2/FlexC applications.
Bah, need to say focused on the inter-fwrite() single blocks!
Revised version of earlier sequence now shows presence of SYNC calls: ... WR2d940+40 WR84f WR7fcf RDf740 WRf740 WR801 SYNC RD84f WR2d980+40 ...
So it's reading back block $84f just after the SYNC, which means it's the start of the next fwrite().
Right, first thing is get those time stamps sorted. See how much time is going to the singles ...
@rogloh said:
Are these f_xxxx type API functions exposed to SPIN2 or is there another layer that hides this from us? The code in the sample seems to be what we'd code ourselves from SPIN2/FlexC applications.
Those will be a layer under I guess. But basically directly mapped to the standard C API.
These comments gave me hope it would speed up writes...
/* Record data stream with free from cluster allocation delay /
/ Write operation should be aligned to sector boundary to optimize the write throughput */
There's a way to grab a reference to the underlying FF object from a VFS mount point. I did once figure out how to use this to convert between long and short file names (you'd need this if, e.g. you had a file browser program with LFN support but wanted to pass an ARGv to a program without (MegaYume etc)).
@evanh said:
Right, first thing is get those time stamps sorted. See how much time is going to the singles ...
I've added FlexC's microsecond counter onto each read/write/sync operation. These prints are at the start of each op. So the time stamp of the subsequent op tells you how long it takes.
Interesting. Wonder what it is doing between gaps in the multi-burst writes (not the FAT cluster accesses but just between clusters). ~9ms per 64 sectors is only 3.6MB/s, yet your raw write rate should be up to 10x faster.
Is something being copied here?
So, 64 block clusters are written in the range of 3 to 9.5 milliseconds. That's quite wide variability already. This is the Samsung EVO card. It always had erratic results, even in raw blocks.
Single reads are mostly 0.9 ms on their own but can be under half that on adjoining incremental single block reads. Which suggests the card is predicting it. Why the filesystem is even doing that is another question.
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.
@rogloh said:
Interesting. Wonder what it is doing between gaps in the multi-burst writes (not the FAT cluster accesses but just between clusters). ~9ms per 64 sectors is only 3.6MB/s, yet your raw write rate should be up to 10x faster.
Is something being copied here?
It's the SD card raising BUSY on DAT0 pin. Every time we complete a CMD24 or CMD25 is telling the card it can go away and do its housekeeping. So it does.
Some cards are quicker than others but none of them are great. Some UHS feature will make this less painful I suspect. Maybe there is other solutions to notifying the cards of further write intent, I dunno.
PS: Each cluster is generating a CMD25. Modifying that to concatenate consecutive cluster writes, at the driver level, is what I hacked up yesterday,
@evanh said:
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.
The card has to guarantee that the write is actually committed when the busy signal stops (i.e. could rip it out of the socket immediately after and not loose data). There's a cache feature that can be enabled that allows the card to buffer writes, but it needs a special command to force flush the buffer. See section 4.17 in the SD spec. Should be available on all newer cards with A2 performance rating.
@evanh said:
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.
The card has to guarantee that the write is actually committed when the busy signal stops (i.e. could rip it out of the socket immediately after). There's a cache feature that can be enabled that allows the card to buffer writes, but it needs a special command to force flush the buffer. See section 4.17 in the SD spec. Should be available on all newer cards with A2 performance rating.
Problem is those sort of features tend to need UHS engaged first - Which requires the Prop2 to perform 1.8 Volt signalling. Not that I've explicitly tried everything, so I could be surprised still.
It doesn't seem to say anywhere that UHS is required. It certainly doesn't work in SPI mode (I remember messing with it at some point...), which is documented.
A theoretical P2 die revision should include 1.8V I/O and hardware TERC4 encoding, headaches would be solved all around
These high speed cards that are rated U3 might be worth trying if you've not already got one of those. They say they have a minimum sequential write speed of 30MB/s, some are even higher V60/V90 video rated cards. Whether you get this only with the lower voltage UHS modes though not sure, hopefully not. With any luck they wouldn't need to slow as much between multi-sector bursts.
The video rating only applies to video recording mode, which is a special sauce feature. The thing you're looking for is the A class, which pertains to random read/write
Seeing this note: // output to red LED, used as CMD response shifter
not exactly sure what that last part means...
Ah, I see that's been left in the enums of the tester program. That comment is out of date. It applies to the development code only. It was written back when I only had the smartpins for card init. That smartpin doubled up as the CMD pin rx shifter via input redirect.
Whereas the driver code uses streamer, start to end.
@Wuerfel_21 said:
It doesn't seem to say anywhere that UHS is required. It certainly doesn't work in SPI mode (I remember messing with it at some point...), which is documented.
I have 2 of 7 cards indicating support without UHS engaged. The Samsung EVO 128 GB, and the newer 64 GB Sandisk Extreme - which I've rarely posted about here since I've got files on it I didn't want to corrupt.
Both report "Extension" and "Queuing" Command Classes as supported in the CSD register. And both have the "Cache" bit330 set in SD Status register. And both support the max queuing depth of 32.
Comments
The unicode stuff is only needed if LFN is enabled. You need it because LFN/ExFAT names are in UTF-16, so unusable for normal narrow-character code.
Any file/folder/github links to these FAT32 related files...? I wish I knew where you guys were looking.
UPDATE : okay I found filesys/fatfs/ff.c at least
If there are optimizations that can be done that reduce all the single sector accesses between clusters that'd be nice. They may involve extra memory use if some sectors get cached. Whether that makes it into the full build of flexspin long term by default, not sure, but perhaps some special optimization switches/#defines could be created for enabling high(er) speed SD performance.
Looking at the code there is this FF_USE_FASTSEEK define that looks interesting. One goes and reads FAT while one reads from a local CLMT table which is probably a lot faster. Have you tried building with that @evanh?
Damn, I never imagined LFN actually required Unicode, especially UTF-16. That's just perverse bloat for something like FAT.
Not making any difference to the sequence of blocks. But I do get three compiler warnings:
Oh, ha, ffconf.h is where I'm meant to set the compile switches ... Still makes no difference.
It seems for you to make use of this you need to do more in the code. See this page in particular:
http://elm-chan.org/fsw/ff/doc/lseek.html
Mentions it can't be used to expand the file size however this snippet looks interesting and may speed up new file writes perhaps if the final size is known in advance.
"can" being the operative word there. There's no indication that FlexC's FAT filesystem uses lseek() for preallocating a file. It just appends on the fly.
A search of the include files gets one hit other than the function itself -
#define f_rewind(fp) f_lseek((fp), 0)
I think we're in for writing any speed-ups ourselves.
It's going to be major I think - Not normal procedures. We want to be able to make multiple calls to fwrite() that produces only a single CMD25, assuming the file itself is made of consecutive clusters of course. Make the filesystem leave the SD card hanging until the user program says otherwise.
Putting it like that doesn't sound reasonable. Maybe there is other ways to get the SD card to have less busy time with short write bursts. A large buffer isn't very friendly to hubRAM. Besides, even a 256 kB buffer wasn't a fabulous result.
I guess one solution is interface the filesystem to external RAM expansions. So it can be given a very large amount of memory to work with. Ditch the buffer idea and just tell it where all the data resides in one hit.
This would then allow the filesystem to be optimised around concatenating multiple consecutive cluster writes into one CMD25.
Are these f_xxxx type API functions exposed to SPIN2 or is there another layer that hides this from us? The code in the sample seems to be what we'd code ourselves from SPIN2/FlexC applications.
Bah, need to say focused on the inter-
fwrite()
single blocks!Revised version of earlier sequence now shows presence of SYNC calls:
... WR2d940+40 WR84f WR7fcf RDf740 WRf740 WR801 SYNC RD84f WR2d980+40 ...
So it's reading back block $84f just after the SYNC, which means it's the start of the next fwrite().
Right, first thing is get those time stamps sorted. See how much time is going to the singles ...
Those will be a layer under I guess. But basically directly mapped to the standard C API.
These comments gave me hope it would speed up writes...
There's a way to grab a reference to the underlying FF object from a VFS mount point. I did once figure out how to use this to convert between long and short file names (you'd need this if, e.g. you had a file browser program with LFN support but wanted to pass an ARGv to a program without (MegaYume etc)).
I've added FlexC's microsecond counter onto each read/write/sync operation. These prints are at the start of each op. So the time stamp of the subsequent op tells you how long it takes.
Here's the performance with debug prints turned off. Surprisingly the same:
Interesting. Wonder what it is doing between gaps in the multi-burst writes (not the FAT cluster accesses but just between clusters). ~9ms per 64 sectors is only 3.6MB/s, yet your raw write rate should be up to 10x faster.
Is something being copied here?
So, 64 block clusters are written in the range of 3 to 9.5 milliseconds. That's quite wide variability already. This is the Samsung EVO card. It always had erratic results, even in raw blocks.
Single reads are mostly 0.9 ms on their own but can be under half that on adjoining incremental single block reads. Which suggests the card is predicting it. Why the filesystem is even doing that is another question.
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.
It's the SD card raising BUSY on DAT0 pin. Every time we complete a CMD24 or CMD25 is telling the card it can go away and do its housekeeping. So it does.
Some cards are quicker than others but none of them are great. Some UHS feature will make this less painful I suspect. Maybe there is other solutions to notifying the cards of further write intent, I dunno.
PS: Each cluster is generating a CMD25. Modifying that to concatenate consecutive cluster writes, at the driver level, is what I hacked up yesterday,
The card has to guarantee that the write is actually committed when the busy signal stops (i.e. could rip it out of the socket immediately after and not loose data). There's a cache feature that can be enabled that allows the card to buffer writes, but it needs a special command to force flush the buffer. See section 4.17 in the SD spec. Should be available on all newer cards with A2 performance rating.
Problem is those sort of features tend to need UHS engaged first - Which requires the Prop2 to perform 1.8 Volt signalling. Not that I've explicitly tried everything, so I could be surprised still.
It doesn't seem to say anywhere that UHS is required. It certainly doesn't work in SPI mode (I remember messing with it at some point...), which is documented.
A theoretical P2 die revision should include 1.8V I/O and hardware TERC4 encoding, headaches would be solved all around
These high speed cards that are rated U3 might be worth trying if you've not already got one of those. They say they have a minimum sequential write speed of 30MB/s, some are even higher V60/V90 video rated cards. Whether you get this only with the lower voltage UHS modes though not sure, hopefully not. With any luck they wouldn't need to slow as much between multi-sector bursts.
The video rating only applies to video recording mode, which is a special sauce feature. The thing you're looking for is the A class, which pertains to random read/write
is pin_red required?
Seeing this note:
// output to red LED, used as CMD response shifter
not exactly sure what that last part means...
Ah, I see that's been left in the enums of the tester program. That comment is out of date. It applies to the development code only. It was written back when I only had the smartpins for card init. That smartpin doubled up as the CMD pin rx shifter via input redirect.
Whereas the driver code uses streamer, start to end.
Can't seem to make it work... yet...
One problem is that my power pin works the opposite way, has to be high to turn on power. Think hacked that to be on, but still doesn't work.
One note: I get "Mount OK", even if the uSD is not connected. Is that right?
I have 2 of 7 cards indicating support without UHS engaged. The Samsung EVO 128 GB, and the newer 64 GB Sandisk Extreme - which I've rarely posted about here since I've got files on it I didn't want to corrupt.
Both report "Extension" and "Queuing" Command Classes as supported in the CSD register. And both have the "Cache" bit330 set in SD Status register. And both support the max queuing depth of 32.
Does that at all correlate to having an A2 logo on the card?
Bang on, yes. two A2, three A1, and two non-A.