Discussing SD Drivers (depreciated fsrw)

rogloh · 2021-10-30 01:46

@evanh said:
A small tweak to FSRW code and I've got pread() going almost as fast as FastBlocksRead(). pread() was doing a seemingly unneeded block copy for each and every read block.

Both methods are now heavily limited by a block request delay, imposed by the SD card, that I suspect always occurs when initiating a read. The seemingly obvious way to improve it is to change from using the single-block command, CMD17, to using the multi-block command, CMD18. I think that'll be next on my list of to-do's.

Multi-block will likely make a nice boost to sustained performance. It's definitely worth trying.

evanh · 2021-10-30 01:55

Certainly did, that's the 5 MB/s, quadrupled what I was getting with CMD17 at the same clock rate. But it was also a straight line test, effectively all blocks in one cluster.

Turning that into some sort of tight chain that can step through sparse clusters without slowing down might be quite tricky.

Here's where I got to: (oops) ... Smile, I've messed up ... lol, just a filename change and I wasn't reading the compiler error message and my script was downloading the old binary still.

[zip file with buggy pread() removed]

evanh · 2021-10-30 03:51

And a tidy up of current incarnation:

pub  readblock( bnum, haddr ) : result
'
'   Read a single block.  The "bnum" passed in is the
'   block number (blocks are 512 bytes); the "haddr" passed
'   in is the hubRAM address of 512 block to fill with the data
'
    starttime := getct()
    cmd(17, doSDHC(bnum))

    rblocks( 1, haddr )

    return endcmd()


pub  readblocks( bnum, blocks, haddr ) : result
'
'   Read multiple consecutive blocks
'   "bnum" is the starting block number (blocks are 512 bytes)
'   "blocks" is total number of blocks to get from the SD card
'   "haddr" is the starting hubRAM address to fill with the data blocks
'
    starttime := getct()
    cmd(18, doSDHC(bnum))

    rblocks( blocks, haddr )

    readresp()          ' option 2 for Stop Transmission, CMD12
    cmd(12, 0)
    readresp()
    return endcmd()


pri  rblocks( blocks, haddr ) | ckpin, miso, inval, words
    ckpin := clk
    miso := do
    repeat
        inval := readresp() ' returns first byte of block packet, the block token, value $FE
        org
        wrfast  ##$8000_0000, haddr ' setup FIFO to write hubRAM
        drvl    ckpin           ' leading clock, to prime data pin
        mov words, #512/4       ' number of longwords in a block
        outh    ckpin
.loop
        rep @.rend32, #32       ' grab a longword chunk at a time
        outl    ckpin
        rcl inval, #1       ' msbit first - also fisrt byte to the left, effectively loading as big-endian
        outh    ckpin
        testp   miso        wc  ' pick up data bit
.rend32
        rcl inval, #1
        movbyts inval, #%%0123      ' little-endian shuffle to retain SD byte order in hubRAM
        wflong  inval           ' write to hubRAM via FIFO
        djnz    words, #.loop   ' bit is pending on data pin, loop to clock subsequent bit, then pickup the pending bit

        rep @.rend15, #15       ' make up the dummy CRC clocks
        outl    ckpin
        nop
        outh    ckpin
        nop
.rend15
        end                 ' implicit RDFAST occurs here
        haddr += 512
    while --blocks

rogloh · 2021-10-30 05:07

@evanh said:
Certainly did, that's the 5 MB/s, quadrupled what I was getting with CMD17 at the same clock rate.

Good boost.

evanh · 2021-10-30 05:49

Yes, that's 43.75 MHz SPI clock. When at 10 MHz, the data rate is 1.15 MByte/s. Which looks proportionally nominal. But using CMD17, instead of CMD18, those numbers aren't so proportional, namely 1.1 MB/s and 0.8 MB/s respectively.

There is a transfer start delay when issuing a new command. With CMD18, you only issue it once and then begin receiving consecutive blocks with no further commands. So only the one delay.

Just have to identify each start of block in the bit stream - which has similarity to async serial, between blocks is continuous $FF until an $FE signifies next block start.

PS: While the number of bytes can be variable between blocks, the same is not true for bits in a byte. There is no byte framing, in SPI mode at least. CS stays low until the next command. It is vital that clock pulses are always tracked to match each byte for the entire multi-block transfer. Not even one extra clock pulse can be dangling.
Actually, no that's only true when trying to treat everything as bytes. Which is the what the low-level is currently written as. But it could instead look for the bit level transition from stop to start and just mark that as the first byte. Good idea Evan! Time to rewrite the readresp() routine.

evanh · 2021-10-30 08:00

Nice, got an improvement on CMD17 with that change. Here's the new block read code which no longer uses the readresp() routine:

pri  rblocks( blocks, haddr ) | ckpin, miso, inval, words
    ckpin := clk
    miso := do

    org
        drvl    ckpin           ' leading clock, to prime first stop bit
        wrfast  ##$8000_0000, haddr ' setup FIFO to write hubRAM
.blockloop
        mov words, #512/4       ' number of longwords in a block
.startloop
        outh    ckpin
        nop
        outl    ckpin
        testp   miso        wc  ' test for the start bit
    if_c    jmp #.startloop     ' loop if still stop bit
        outh    ckpin

.wordloop
        rep @.rend32, #32       ' grab a longword chunk at a time
        outl    ckpin
        rcl inval, #1       ' msbit first - also fisrt byte to the left, effectively loading as big-endian
        outh    ckpin
        testp   miso        wc  ' pick up data bit
.rend32
        rcl inval, #1
        movbyts inval, #%%0123      ' little-endian shuffle to retain SD byte order in hubRAM
        wflong  inval           ' write to hubRAM via FIFO
        djnz    words, #.wordloop   ' bit is pending on data pin, loop to clock subsequent bit, then pickup the pending bit

        rep @.rend15, #16       ' make up the dummy CRC clocks + 1 for priming first stop bit into next block loop
        outl    ckpin
        nop
        outh    ckpin
        nop
.rend15
        djnz    blocks, #.blockloop
    end
        ' implicit RDFAST occurs here

It can have extra clocks now. But not less. It'll trip up on any stray low bit at the start, treating it as the start of the block.

EDIT: Revised code with better crafted integration of the start bit detection
EDIT2: A couple of new comments added

evanh · 2021-10-30 12:05

Next problem is, I just got round to changing the test program to reload the file multiple times, and this fails to reopen on first repeat. popen() returns error code -1

It fails the same way with all editions of low-level driver. I'm guessing FSRW is broken and likely needs some deep knowledge of FAT32 to repair. I wasn't really wanting to go in that direction.

Here's the test loop:

    repeat
        'Load bmp file
        send( 13,10,"Opening bitmap2.bmp file.. " )
        r:=sd.popen(string("bitmap2.bmp"),"r")
        if r<0
            send( "Cannot open file, error ",dec(r),13,10 )
            repeat
        send( "File Opened, " )
        s := sd.fsize() 'size of file
        send( "File Size = ",dec(s),13,10 )
        p := @BitmapFile'image'BitmapFile
        r := getct()

        sd.pread(p,s) 'read in file the regular, safe, way
'        sd.FastBlocksRead( (s+511)/512, p )  'read in assuming blocks are sequetial

        dt := muldiv64( getct() - r, 1_000_000, clkfreq )
        send( "Read complete:   Load time [us] = ",dec(dt) )
        r := muldiv64( s, 953_674, dt )
        send( "   Speed [MB/s] = ",dec(r/1_000_000),".",dec(r//1_000_000) )
        send( "   Re-starting VGA",13,10 )

        coginit(1,@VgaOrigin, p)'@image)   'restart VGA driver to update the palette

        sd.pclose()
        waitms(100)

evanh · 2021-10-30 12:33

Oooooh, dang, it's a buffer overrun because the total read blocks is larger than the file - of course it is, I did that by removing the internal FSRW buffer, I did wonder what it achieved. Obvious now.

Hmm, not sure I want to deal with rewriting that to work without an internal buffer. Might just have to revert my change there.

evanh · 2021-10-30 13:21

Ugh, the existing code post-checks data validity after the block is read. I guess that figures, when the buffer can be dumped, partially dumped usually, it's no big deal. It's a lot of rewriting to do though. I'll revert ...

Back to using CMD17 one block at a time ... nets me 1.7 MB/s at 43.75 MHz SPI clock. Still an improvement on the earlier 1.1 MB/s.

Okay, so that 43.75 MHz SPI clock is from 350 MHz sysclock/8. Other examples is 250 MHz sysclock provides 31.25 MHz SPI clock and 1.47 MB/s. 100 MHz sysclock is 12.5 MHz SPI and 870 kB/s.

[zip file with buggy fastblocksread() removed]

Rayman · 2021-10-30 14:10

The fsrw I worked on has a couple versions of multi block read.

One assumes the blocks are in order and is very fast…

evanh · 2021-10-30 14:23

Yep, I should have said that's still there. It works as long as you leave an extra 512 bytes spare in hubRAM beyond the file length. And it now uses CMD18 for much faster loading too.

The reverting I did was only for pread(). And to be fair pread() wasn't that much faster before I reverted. I missed saying that too. EDIT: 250 MHz sysclock, 31.25 MHz SPI was 1.53 MB/s, compared to now at 1.47 MB/s.

So choices are use pread() or your fastblocksread(). pread() doesn't need any extra reserve hubRAM space, fastblocksread() does need extra.

EDIT2: Grr, not quite yet, fastblocksread() needs some work to update the FAT info it seems. It can't reload a file even with decent extra space ...

EDIT3: Got it going with scatter gun approach - found by adding the line fclust := nextcluster() to fastblocksread(), it started working. That'll do me. 250 MHz sysclock, 31.25 MHz SPI using fastblocksread() gets 3.5 MB/s.

evanh · 2021-10-31 04:09

I'm pretty certain I'm mishandling CMD12 (Stop Transmission). The PDF warns of it being difficult. I'm seeing rapid fire problems when using CMD18+CMD12. The issue goes away when using a leading CMD23 (Set Block Count) in place of the trailing CMD12. I can also work around it by calling busy() after the CMD12. Adds about 400 us extra time to each call. I suspect what this is doing is reading out a whole block of zeros. Works because the block after the file is blank.

Problem with CMD23 is it didn't exist until SDHC SDXC (V3.0) spec release. Part of UHS extensions.

EDIT: Hmm, not sure. Testing isn't conclusive. It only happens if CMD18 (Read Multiple Block) is abused into being used for single block (using CMD12 to terminate each block) and then multiple consecutive blocks are read with it that way. Issue isn't apparent with rapid reloading of one file when loading whole file with a single CMD18+CMD12 pair.

I'll leave it as is. The apparent issue only came out of experimental stuff anyway.

EDIT2: Or not. The true test will be a fragmented file. I'd need to work out how clusters are managed in FSRW ...

EDIT3: Huh, I think it's fixed with a few extra clock pulses ... grr, same delay as using busy(). Some other step must be soaking up the extra block I guess.

rogloh · 2021-11-01 03:34

It's amazing how involved SD card stuff can get... it's a pity it has to be so complex. Layers upon layers of added features over time etc.

I don't know the full history but when I read through the SD specs it always appeared to me that it was a proprietary solution that accidentally evolved to become a pseudo standard, and then Sandisk etc had to write some type of spec after the fact.

evanh · 2021-11-01 04:06

yes, kind of looks like it was a little too open ended, and undefined, originally with many aspects locked down and some redefined with SDHC. The transition to ditching byte addressable storage seems sensible and in hindsight one wonders why it was supported originally. The whole variable length block thing must have been a nightmare for all parties. Especially give that anyone wanting it probably wanted to set the block length longer than 512 bytes max.

The resulting legacy of modes creates untold software possibilities to support everything. Designing host controllers must be painful work to meet spec.

evanh · 2021-11-01 07:33

Lol, there's bits reserved in the CSD card register for specifying what filesystem or partitioning is used. That's just dumb because it can't be relied on for compliance. It's like the do-not-track flag for web browsers. Only those that feel like it will pay any attention. And really, it's not the job of a driver to operate at that level. Too many layers is so true.

EDIT: And funnily, along with many other fields, in v2.0 (SDHC) that field is just a single bit with definition of "always zero".

EDIT2: I don't get it. In the SCR register there is a 4-bit field named SD_SPEC. With v1.0 cards it had value of 0, with v1.1 cards it had value of 1, with v2.0 cards it had value of 2 ... but with v3.0 and higher cards it stopped ticking up. I've got a v6.0 PDF here and all versions from 2.0 up are stuck at a value of 2.

EDIT3: Must be a selection of optional features in newer classes. So you have to check for availability of specific features.

evanh · 2021-11-01 08:37

Huh, CMD23 isn't listed as supported in the PDF for SPI mode. It certainly works for me though. EDIT: On that note, I'm planning on working on getting steady recording speeds too. For that, CMD20 is likely important. It also is not supported for SPI mode ... but maybe it'll work anyway.

They're both part of UHS class 1. The older equivalent SPI supported commands are CMD12 and ACMD23. To deal to CMD12's timings better I'll probably change to using the streamer. That way the cog can be more supervisory and able to intervene without impacting bulk speeds.

EDIT2: Right, yep, I better try getting CMD12 ironed out. The Adata card fails on CMD23 and it's UHS-I rated. I'd missed that before.

rogloh · 2021-11-01 09:25

It will be interesting to see if using the streamer helps or hinders the SD transfer stuff. I guess once a given sector starts streaming it should be nice and contiguous and the streamer will work well, but some further housekeeping will probably be needed for each new sector for delineation etc.

evanh · 2021-11-01 10:07

True, start bit detection is needed on each block. If that isn't handled inline, then software will have to post-process a buffer to stitch it up. Not going to be easy either way.

evanh · 2021-11-01 10:30

What the ...! v4.1 (post-UHS-II) specs has mandatory extensions for SPI multi-block commands namely newly added CMD58 and CMD59. That seems a way late addition. Ah, it's for something else, something called Extension Register Space:

The Extension Register space is independent to Memory Space accessed by CMD17/18 and CMD24/25.

Wuerfel_21 · 2021-11-01 22:03

Extension registers are where you enable funny features like cache (presumably write-behind?) and command queue.

Wuerfel_21 · 2021-11-01 22:07

Unrelatedly, has anyone noticed that there seem to exist "SD card on a chip" type devices? (not eMMC, that's a different beast) Well, I can only find one, XTSD01GLGEAG and its different capacity variants. Not sure how the cost compares to the 16MB SPI flash on Parallax's boards, but the price per bit is certainly lower. These might work really well as P2 boot memory.

evanh · 2021-11-02 01:45

Can't say I'd looked. If it were of value though it'd be in write speeds. SD cards have definitely been getting good at making flash memory faster to write to.

evanh · 2021-11-02 01:53

@Wuerfel_21 said:
Extension registers are where you enable funny features like cache (presumably write-behind?) and command queue.

Oh, that could be SLC or DRAM or even a smaller SRAM ... and so then doing something like using block move commands to direct the card where you want the written blocks relocated to in the bulk flash memory.

evanh · 2021-11-02 03:16

Anyone familiar with FSRW's FAT32 internals? Rayman's FastBlocksRead() doesn't update anything. And it only does a single linear burst, can't handle any fragmentation. I've got it happy enough to be reused now at least just by calling nextcluster(). But it really needs to also crawl the cluster chain to handle file fragmentation.

I've renamed it in my testing too:

pub  preadblocks( haddr, bcount ) : result   'RJA:  Assume file is sequential blocks
    sdspi.readblocks( datablock(), bcount, haddr )
    fclust := nextcluster()
    return bcount

evanh · 2021-11-03 14:29

@evanh said:
True, start bit detection is needed on each block. If that isn't handled inline, then software will have to post-process a buffer to stitch it up. Not going to be easy either way.

One idea, I'm gonna write up here now because my attention is maybe redirected tomorrow, is to learn the gap length between CMD18 and the read data block, maybe even have it auto-learn during mount process, and then able to rapid fire off the right number of clock pulses after a CMD18 to acquire the block start token, verify it with a smartpin I guess, then engage the streamer for 512 byte block transfer.

Wuerfel_21 · 2021-11-03 16:00

@evanh said:

@evanh said:
True, start bit detection is needed on each block. If that isn't handled inline, then software will have to post-process a buffer to stitch it up. Not going to be easy either way.

One idea, I'm gonna write up here now because my attention is maybe redirected tomorrow, is to learn the gap length between CMD18 and the read data block, maybe even have it auto-learn during mount process, and then able to rapid fire off the right number of clock pulses after a CMD18 to acquire the block start token, verify it with a smartpin I guess, then engage the streamer for 512 byte block transfer.

Won't work, delay between command and data block is somewhat inscrutable (though in SPI mode it will be rounded such that the block is byte-aligned). It may be a particular value most of the time on a particular card, but it will spike when the card has to do some internal work (SanDisk cards in particular seem to go to sleep if they are deselected for more than ~5 ms (not sure) and the next command will take very long after that)

evanh · 2021-11-03 23:02

Damn, thanks for heads-up. I suppose then the number of SPI clock pulses isn't relevant, it's more just the prep time the card requires. On that knowledge, leaving the block setup as bit-bashing will be best solution. After all, the main objective here is to get CMD12 timing into spec - which is it has to be issued during block transfer.

BTW: I've actually had success with CMD12 between blocks by raising CS immediately after CMD12 issued. The subsequent last block is still read in just fine and no extended delays induced. I've not investigated why though. It could be card dependant, although both my brands are handling it. Maybe it is acting the same way as an early CMD12 where the card is responding with a post-block error code as well. I wouldn't know because those are ignored with current coding.

Wuerfel_21 · 2021-11-03 23:09

Have you tried starting CMD12 while the card is sending the block CRC? That's kinda half during / half after, but seems like the ideal time if it works.

evanh · 2021-11-03 23:25

Didn't help with the combinations I tried. Didn't go to the effort of looking with the scope though. I've just started using a scope now but so far that has been confusing me more. I'm seeing oddball multiple data toggling from the SD card within one clock state. Not very SPI compliant would be fair to say.

rogloh · 2021-11-04 02:00

@evanh said:
I'm seeing oddball multiple data toggling from the SD card within one clock state.

That sounds very strange and I would not know why it would need to transition multiple times like that in a single bit period. Maybe it was tri-stating then, or very briefly going busy or handling some internal interrupt? Weirdness.

Discussing SD Drivers (depreciated fsrw)

Comments