New SD mode P2 accessory board

Wuerfel_21 · 2024-10-12 02:05

@evanh said:
EDIT: Taking a peek, it's quite the beast I see. Supports long file names, TRIMming, and ExFAT too.
The Unicode stuff seems overkill. That's going to just be for filenames.

The unicode stuff is only needed if LFN is enabled. You need it because LFN/ExFAT names are in UTF-16, so unusable for normal narrow-character code.

rogloh · 2024-10-12 02:26

Any file/folder/github links to these FAT32 related files...? I wish I knew where you guys were looking.
UPDATE : okay I found filesys/fatfs/ff.c at least

If there are optimizations that can be done that reduce all the single sector accesses between clusters that'd be nice. They may involve extra memory use if some sectors get cached. Whether that makes it into the full build of flexspin long term by default, not sure, but perhaps some special optimization switches/#defines could be created for enabling high(er) speed SD performance.

rogloh · 2024-10-12 02:44

Looking at the code there is this FF_USE_FASTSEEK define that looks interesting. One goes and reads FAT while one reads from a local CLMT table which is probably a lot faster. Have you tried building with that @evanh?

#if FF_USE_FASTSEEK
                        if (fp->cltbl) {
                            clst = clmt_clust(fp, fp->fptr);    /* Get cluster# from the CLMT */
                        } else
#endif      
                        {
                           clst = get_fat(&fp->obj, fp->clust);    /* Follow cluster chain on the      FAT */
                        }

evanh · 2024-10-12 04:27

@Wuerfel_21 said:
The unicode stuff is only needed if LFN is enabled. You need it because LFN/ExFAT names are in UTF-16, so unusable for normal narrow-character code.

Damn, I never imagined LFN actually required Unicode, especially UTF-16. That's just perverse bloat for something like FAT.

evanh · 2024-10-12 05:06

@rogloh said:
Looking at the code there is this FF_USE_FASTSEEK define that looks interesting. One goes and reads FAT while one reads from a local CLMT table which is probably a lot faster. Have you tried building with that @evanh?
#if FF_USE_FASTSEEK
                        if (fp->cltbl) {
                            clst = clmt_clust(fp, fp->fptr);    /* Get cluster# from the CLMT */
                        } else
#endif      
                        {
                           clst = get_fat(&fp->obj, fp->clust);    /* Follow cluster chain on the      FAT */
                        }

Not making any difference to the sequence of blocks. But I do get three compiler warnings:

warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
    #define FF_USE_FASTSEEK 0
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29:    #include "ffconf.h"
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/fatfs_vfs.c: 6:    #include "ff.h"
    previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1    /* (predefined):0       */

warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
    #define FF_USE_FASTSEEK 0
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29:    #include "ffconf.h"
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/ffunicode.c: 26:    #include "ff.h"
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/fatfs.cc: 13:    #include "ffunicode.c"
    previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1    /* (predefined):0       */

warning: Preprocessor warnings:
/home/evanh/hoard/coding/include/filesys/sdfatfs/ffconf.h:39: warning: The macro is redefined
    #define FF_USE_FASTSEEK 0
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/ff.h: 29:    #include "ffconf.h"
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/diskio.h: 5:    #include "ff.h"
    from /home/evanh/hoard/coding/include/filesys/sdfatfs/sdmm.cc: 25:    #include "diskio.h"
    previously macro "FF_USE_FASTSEEK" defined as: #define FF_USE_FASTSEEK 1    /* (predefined):0       */

evanh · 2024-10-12 05:09

Oh, ha, ffconf.h is where I'm meant to set the compile switches ... Still makes no difference.

rogloh · 2024-10-12 07:26

@evanh said:
Oh, ha, ffconf.h is where I'm meant to set the compile switches ... Still makes no difference.

It seems for you to make use of this you need to do more in the code. See this page in particular:

http://elm-chan.org/fsw/ff/doc/lseek.html

Mentions it can't be used to expand the file size however this snippet looks interesting and may speed up new file writes perhaps if the final size is known in advance.

/* Cluster pre-allocation (to prevent buffer overrun on streaming write) */

    res = f_open(fp, recfile, FA_CREATE_NEW | FA_WRITE);   /* Create a file */

    res = f_lseek(fp, PRE_SIZE);             /* Expand file size (cluster pre-allocation) */
    if (res || f_tell(fp) != PRE_SIZE) ...   /* Check if the file has been expanded successfly */

    res = f_lseek(fp, OFS_DATA);             /* Record data stream with free from cluster allocation delay */
    ...                                      /* Write operation should be aligned to sector boundary to optimize the write throughput */

    res = f_truncate(fp);                    /* Truncate unused area */
    res = f_lseek(fp, OFS_HEADER);           /* Set file header */
    ...

    res = f_close(fp);

evanh · 2024-10-12 10:39

web page:
"It can also be used to expand the file size (cluster pre-allocation)."

"can" being the operative word there. There's no indication that FlexC's FAT filesystem uses lseek() for preallocating a file. It just appends on the fly.
A search of the include files gets one hit other than the function itself - #define f_rewind(fp) f_lseek((fp), 0)

I think we're in for writing any speed-ups ourselves.

evanh · 2024-10-12 11:16

It's going to be major I think - Not normal procedures. We want to be able to make multiple calls to fwrite() that produces only a single CMD25, assuming the file itself is made of consecutive clusters of course. Make the filesystem leave the SD card hanging until the user program says otherwise.

Putting it like that doesn't sound reasonable. Maybe there is other ways to get the SD card to have less busy time with short write bursts. A large buffer isn't very friendly to hubRAM. Besides, even a 256 kB buffer wasn't a fabulous result.

evanh · 2024-10-12 11:26

I guess one solution is interface the filesystem to external RAM expansions. So it can be given a very large amount of memory to work with. Ditch the buffer idea and just tell it where all the data resides in one hit.

This would then allow the filesystem to be optimised around concatenating multiple consecutive cluster writes into one CMD25.

rogloh · 2024-10-12 11:40

@evanh said:

web page:
"It can also be used to expand the file size (cluster pre-allocation)."

"can" being the operative word there. There's no indication that FlexC's FAT filesystem uses lseek() for preallocating a file. It just appends on the fly.
A search of the include files gets one hit other than the function itself - #define f_rewind(fp) f_lseek((fp), 0)

I think we're in for writing any speed-ups ourselves.

Are these f_xxxx type API functions exposed to SPIN2 or is there another layer that hides this from us? The code in the sample seems to be what we'd code ourselves from SPIN2/FlexC applications.

evanh · 2024-10-12 11:41

Bah, need to say focused on the inter-fwrite() single blocks!
Revised version of earlier sequence now shows presence of SYNC calls: ... WR2d940+40 WR84f WR7fcf RDf740 WRf740 WR801 SYNC RD84f WR2d980+40 ...

So it's reading back block $84f just after the SYNC, which means it's the start of the next fwrite().

Right, first thing is get those time stamps sorted. See how much time is going to the singles ...

evanh · 2024-10-12 11:43

@rogloh said:
Are these f_xxxx type API functions exposed to SPIN2 or is there another layer that hides this from us? The code in the sample seems to be what we'd code ourselves from SPIN2/FlexC applications.

Those will be a layer under I guess. But basically directly mapped to the standard C API.

rogloh · 2024-10-12 11:45

These comments gave me hope it would speed up writes...

/* Record data stream with free from cluster allocation delay /
/ Write operation should be aligned to sector boundary to optimize the write throughput */

Wuerfel_21 · 2024-10-12 11:53

There's a way to grab a reference to the underlying FF object from a VFS mount point. I did once figure out how to use this to convert between long and short file names (you'd need this if, e.g. you had a file browser program with LFN support but wanted to pass an ARGv to a program without (MegaYume etc)).

evanh · 2024-10-12 12:36

@evanh said:
Right, first thing is get those time stamps sorted. See how much time is going to the singles ...

I've added FlexC's microsecond counter onto each read/write/sync operation. These prints are at the start of each op. So the time stamp of the subsequent op tells you how long it takes.

 RD0 279714  RD800 280805  RD801 281071  RDf740 281371  WRf740 281662  RD84f 286027  WR84f 286921  WR7fcf 292358  RD85a 297372  WR85a 298216  WR7fda 303814  RDf740 308903  Buffer = 256 kB,  WRf740 309829  WR801 315372  SYNC 315463  RD84f 320527  WR2d780+40 321353  WR2d7c0+40 327681  WR2d800+40 336865  WR2d840+40 341514  WR2d880+40 347313  WR2d8c0+40 356460  WR2d900+40 361111  WR2d940+40 366908  WR84f 376062  WR7fcf 379953  RDf740 383913  WRf740 384772  WR801 390219  SYNC 390310  RD84f 395397  WR2d980+40 396229  WR2d9c0+40 402472  WR2da00+40 411888  WR2da40+40 416791  WR2da80+40 422504  WR2dac0+40 431979  WR2db00+40 436884  WR2db40+40 442600  WR84f 451968  WR7fcf 455757  RDf740 459560  WRf740 460415  WR801 464768  SYNC 464859  RD84f 468785  RD850 469771  RD851 470324  RD852 470878  RD853 471432  RD854 471987  RD855 472540  RD856 473094  RD857 473648  RD858 474202  RD859 474747  RD85a 475293  WR85a 475651  WR7fda 480203  RD84f 484236  WR43b80+40 485060  WR84f 489987  WR7fcf 495547  RD85a 501215  WR43bc0+40 502049  WR43c00+40 510966  WR43c40+40 513773  WR43c80+40 517390  WR43cc0+40 524707  WR43d00+40 527511  WR43d40+40 531127  WR85a 538456  WR7fda 540271  RDf740 542099  WRf740 542942  WR801 545186  SYNC 545277  RD85a 547079  WR43d80+40 547908  WR43dc0+40 550986  WR43e00+40 554687  WR43e40+40 561768  WR43e80+40 564316  WR43ec0+40 568014  WR43f00+40 575101  WR43f40+40 577651  WR85a 581322  WR7fda 585159  RDf740 591695  WRf740 592538  WR801 594627  SYNC 594718  Written 1024 kB at 3572 kB/s, Starttime=309758, Stoptime=596387

Here's the performance with debug prints turned off. Surprisingly the same:

 Buffer = 256 kB,  Written 1024 kB at 3581 kB/s ...

rogloh · 2024-10-12 12:54

Interesting. Wonder what it is doing between gaps in the multi-burst writes (not the FAT cluster accesses but just between clusters). ~9ms per 64 sectors is only 3.6MB/s, yet your raw write rate should be up to 10x faster.
Is something being copied here?

evanh · 2024-10-12 13:10

So, 64 block clusters are written in the range of 3 to 9.5 milliseconds. That's quite wide variability already. This is the Samsung EVO card. It always had erratic results, even in raw blocks.

Single reads are mostly 0.9 ms on their own but can be under half that on adjoining incremental single block reads. Which suggests the card is predicting it. Why the filesystem is even doing that is another question.

Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.

evanh · 2024-10-12 13:12

@rogloh said:
Interesting. Wonder what it is doing between gaps in the multi-burst writes (not the FAT cluster accesses but just between clusters). ~9ms per 64 sectors is only 3.6MB/s, yet your raw write rate should be up to 10x faster.
Is something being copied here?

It's the SD card raising BUSY on DAT0 pin. Every time we complete a CMD24 or CMD25 is telling the card it can go away and do its housekeeping. So it does.

Some cards are quicker than others but none of them are great. Some UHS feature will make this less painful I suspect. Maybe there is other solutions to notifying the cards of further write intent, I dunno.

PS: Each cluster is generating a CMD25. Modifying that to concatenate consecutive cluster writes, at the driver level, is what I hacked up yesterday,

Wuerfel_21 · 2024-10-12 13:20

@evanh said:
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.

The card has to guarantee that the write is actually committed when the busy signal stops (i.e. could rip it out of the socket immediately after and not loose data). There's a cache feature that can be enabled that allows the card to buffer writes, but it needs a special command to force flush the buffer. See section 4.17 in the SD spec. Should be available on all newer cards with A2 performance rating.

evanh · 2024-10-12 13:22

@Wuerfel_21 said:

@evanh said:
Single writes are 2 to 6 ms! So we definitely want to kill off as many of these single writes as we can.

The card has to guarantee that the write is actually committed when the busy signal stops (i.e. could rip it out of the socket immediately after). There's a cache feature that can be enabled that allows the card to buffer writes, but it needs a special command to force flush the buffer. See section 4.17 in the SD spec. Should be available on all newer cards with A2 performance rating.

Problem is those sort of features tend to need UHS engaged first - Which requires the Prop2 to perform 1.8 Volt signalling. Not that I've explicitly tried everything, so I could be surprised still.

Wuerfel_21 · 2024-10-12 13:37

It doesn't seem to say anywhere that UHS is required. It certainly doesn't work in SPI mode (I remember messing with it at some point...), which is documented.

A theoretical P2 die revision should include 1.8V I/O and hardware TERC4 encoding, headaches would be solved all around

rogloh · 2024-10-12 13:48

These high speed cards that are rated U3 might be worth trying if you've not already got one of those. They say they have a minimum sequential write speed of 30MB/s, some are even higher V60/V90 video rated cards. Whether you get this only with the lower voltage UHS modes though not sure, hopefully not. With any luck they wouldn't need to slow as much between multi-sector bursts.

Wuerfel_21 · 2024-10-12 13:52

The video rating only applies to video recording mode, which is a special sauce feature. The thing you're looking for is the A class, which pertains to random read/write

Rayman · 2024-10-12 14:16

is pin_red required?

Seeing this note:
// output to red LED, used as CMD response shifter

not exactly sure what that last part means...

evanh · 2024-10-12 15:23

@Rayman said:
is pin_red required?

Seeing this note:
// output to red LED, used as CMD response shifter

not exactly sure what that last part means...

Ah, I see that's been left in the enums of the tester program. That comment is out of date. It applies to the development code only. It was written back when I only had the smartpins for card init. That smartpin doubled up as the CMD pin rx shifter via input redirect.

Whereas the driver code uses streamer, start to end.

Rayman · 2024-10-12 16:27

Can't seem to make it work... yet...

One problem is that my power pin works the opposite way, has to be high to turn on power. Think hacked that to be on, but still doesn't work.

One note: I get "Mount OK", even if the uSD is not connected. Is that right?

evanh · 2024-10-12 16:28

@Wuerfel_21 said:
It doesn't seem to say anywhere that UHS is required. It certainly doesn't work in SPI mode (I remember messing with it at some point...), which is documented.

I have 2 of 7 cards indicating support without UHS engaged. The Samsung EVO 128 GB, and the newer 64 GB Sandisk Extreme - which I've rarely posted about here since I've got files on it I didn't want to corrupt.

Both report "Extension" and "Queuing" Command Classes as supported in the CSD register. And both have the "Cache" bit330 set in SD Status register. And both support the max queuing depth of 32.

Wuerfel_21 · 2024-10-12 16:30

Does that at all correlate to having an A2 logo on the card?

evanh · 2024-10-12 16:39

@Wuerfel_21 said:
Does that at all correlate to having an A2 logo on the card?

Bang on, yes. two A2, three A1, and two non-A.

New SD mode P2 accessory board

Comments