New SD mode P2 accessory board

evanh · 2024-09-01 12:28

In hindsight, there was an indicator of a stuck pin. But at the time I was thinking there was mistimed block reads, ie: I'd just bumped into a hidden bug that previously wasn't apparent.

When I enable diagnostic printing I get something like this appearing at the end of the card init sequence. It is the only place during init where a data block read operation occurs:

CMD6  4600ffffffe3  >  rlat=5  06 00 00 09 00 dd  CRC=dd  datlat=6  blklen=1042
 00 c8 80 01 80 01 80 01 80 01 80 01 80 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 Current limit = 200 mA (averaged)
 Function BitMap: G6=8001 G5=8001 G4=8001 G3=8001 G2=8001 G1=8003
 Function Select: G6 = 0  G5 = 0  G4 = 0  G3 = 0  G2 = 0  G1 = 1

But this is what was reporting for the badly seated card - Bit1 is high in every nibble of the data block:

CMD6  4600ffffffe3  >  rlat=8  06 00 00 09 00 dd  CRC=dd  datlat=4  blklen=1042
 22 32 a2 23 a2 23 a2 23 a2 23 a2 23 a2 23 22 22 23 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22
 Current limit = 8754 mA (averaged)
 Function BitMap: G6=a223 G5=a223 G4=a223 G3=a223 G2=a223 G1=a223
 Function Select: G6 = 2  G5 = 2  G4 = 2  G3 = 2  G2 = 2  G1 = 3

Needless to say, that old routine doesn't do the CRC check. Maybe that's a good thing for this testing.

evanh · 2024-09-14 13:23

I've had another go at graphing the lag map from https://forums.parallax.com/discussion/comment/1560644/#Comment_1560644
I wasn't happy with what a spreadsheet's graphs could achieve. I needed a multisegmented line with gaps in it so I've gone back to ASCII art form again:

                    Mapping of Sandisk Extreme (2017) for sysclock/1 reliability bands at 25 degC.  NOTE: Sysclock/1 means when SD transfers are at 1:1 to the sysclock frequency, ie: With max'd out DDR tranfers.
      ClkPol
    ClkReg                                                                                                                                                                                                                                                                                                                                              Schmitt
  DatReg                                                    11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222333333333333333333333333333333333333333333333333333333333333333333333333333333333     DatReg
Schmitt   5555555555666666666677777777778888888888999999999900000000001111111111222222222223333333334444444444555555555566666666667777777777888888888899999999990000000000111111111122222222222333333333444444444455555555556666666666777777777788888888889999999999000000000011111111112222222222233333333344444444445555555555666666666677777777778       ClkReg
          0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890         ClkPol
=================================================================================================================================================================================================================================================================================================================================================================
1.0.0.1   -------------------       --------------------------------------------------------------              -------------------------------------------------------                   --------------------------------------------                                                                                                                  1.0.0.1
1.0.0.0   ----------------------      ----------------------------------------------------------------              ---------------------------------------------------------                   -----------------------------------------------                                                                                                         1.0.0.0
1.0.1.1   ----------------------       ---------------------------------------------------------------                -------------------------------------------------------                      ----------------------------------------------                                                                                                       1.0.1.1
1.0.1.0   ------------------------       -------------------------------------------------------------------              -----------------------------------------------------------                    ------------------------------------------------                                                                                               1.0.1.0
0.0.0.1   ------------------------       -------------------------------------------------------------------              ------------------------------------------------------------                   -------------------------------------------------                                                                                              0.0.0.1
0.0.0.0   ---------------------------       ----------------------------------------------------------------------             ---------------------------------------------------------------                  ----------------------------------------------------------                                                                              0.0.0.0
0.0.1.1   ----------------------------       ---------------------------------------------------------------------               -------------------------------------------------------------                     ---------------------------------------------------                                                                                  0.0.1.1
0.0.1.0   ------------------------------       -------------------------------------------------------------------------              -----------------------------------------------------------------                    ---------------------------------------------------------                                                                    0.0.1.0
1.1.0.1   --------------------------------        --------------------------------------------------------------------------                ---------------------------------------------------------------                         ----------------------------------------------------------                                                          1.1.0.1
1.1.0.0   ------------------------------------       -----------------------------------------------------------------------------                -------------------------------------------------------------------                       -------------------------------------------------------------                                               1.1.0.0
1.1.1.1 * ------------------------------------        -----------------------------------------------------------------------------                  ---------------------------------------------------------------                            -----------------------------------------------------------                                           * 1.1.1.1
1.1.1.0 * ---------------------------------------         -------------------------------------------------------------------------------                  ---------------------------------------------------------------------                          --------------------------------------------------------------                              * 1.1.1.0
0.1.0.1   ---------------------------------------        ----------------------------------------------------------------------------------               ---------------------------------------------------------------------                          ---------------------------------------------------------------                                0.1.0.1
0.1.0.0   --------------------------------------------        -------------------------------------------------------------------------------------              -----------------------------------------------------------------------------                      ----------------------------------------------------                                0.1.0.0
0.1.1.1   --------------------------------------------         ------------------------------------------------------------------------------------                 -----------------------------------------------------------------------                              ------------------------------------------------                               0.1.1.1
0.1.1.0   ---------------------------------------------------      -----------------------------------------------------------------------------------------               ---------------------------------------------------------------------------------                        ------------------------------------                                0.1.1.0
=================================================================================================================================================================================================================================================================================================================================================================
                                                            11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222333333333333333333333333333333333333333333333333333333333333333333333333333333333         ClkPol
          5555555555666666666677777777778888888888999999999900000000001111111111222222222223333333334444444444555555555566666666667777777777888888888899999999990000000000111111111122222222222333333333444444444455555555556666666666777777777788888888889999999999000000000011111111112222222222233333333344444444445555555555666666666677777777778       ClkReg  
Schmitt   0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890     DatReg    
  DatReg                                                                                                                                                                                                                                                                                                                                                Schmitt
    ClkReg
      ClkPol

* Signal collapses above 360 MHz sysclock at 25 degC, no matter how big the clock divider is.

evanh · 2024-09-14 14:12

As implied, this is mostly only of concern to transfers operating at syclock/1. So doesn't really belong in this SD topic. It's just something I've been wanting to do a better job of from the earliest days of the Prop2 - When we first discovered the shifting I/O latencies as the clock ramps up. And SD cards have become the my most studied subject.

The bands all move up and down with temperature change but they will track together relatively evenly through the temperatures. That's easily handled simply by recalibrating upon any error. What might be harder to solve (since there is no explicit hardware delay-line setting in the Prop2 pins) is selecting a universal track that applies to all memory chips.

As can be seen in my posted graphs there is four pin mode settings that, in combination, provides an impressive range of adjustability to effect a settable delay. Roger has had success with HyperRAMs in the past simply by toggling pin registering of the data pins. And that indeed produces the biggest impact in my graphs too, so confirms he had made a good choice.

From Roger's docs:

'  Delay:
'   This nibble is comprised of two fields which create the delay needed for Hyper memory reads.
'   bits
'   15-13 = P2 clock delay cycles when waiting for valid data bus input to be returned
'   12 = set to 0 if bank uses registered data bus inputs for reads, or 1 for unregistered

evanh · 2024-09-14 15:24

Trimmed down and tidied up and selected for regular clock polarity - Idle low. It doesn't make much sense to be using both together. The best spread also can leave the clock pin unregistered.

      ClkPol             Mapping of Sandisk Extreme (2017) for sysclock/1 reliability bands at 20 degC.                                                                                                                                                                                                                                                   Schmitt
    ClkReg               NOTE: Sysclock/1 means when SD transfers are at 1:1 with the sysclock frequency, ie: With max'd out DDR tranfers.                                                                                                                                                                                                                  DatReg
  DatReg                                                                                                                                                                                                                                                                                                                                                      ClkReg  
Schmitt    50 MHz    60 MHz    70 MHz    80 MHz    90 MHz    100 MHz   110 MHz   120 MHz   130 MHz   140 MHz   150 MHz   160 MHz   170 MHz   180 MHz   190 MHz   200 MHz   210 MHz   220 MHz   230 MHz   240 MHz   250 MHz   260 MHz   270 MHz   280 MHz   290 MHz   300 MHz   310 MHz   320 MHz   330 MHz   340 MHz   350 MHz   360 MHz   370 MHz   380 MHz    ClkPol
           |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |
===================================================================================================================================================================================================================================================================================================================================================================
1.0.0.0    ----------------------      ----------------------------------------------------------------              ---------------------------------------------------------                   -----------------------------------------------                                                                                                          1.0.0.0
0.0.0.0    ---------------------------       ----------------------------------------------------------------------             ---------------------------------------------------------------                  ----------------------------------------------------------                                                                               0.0.0.0
1.1.0.0    ------------------------------------       -----------------------------------------------------------------------------                -------------------------------------------------------------------                       -------------------------------------------------------------                                                1.1.0.0
0.1.0.0    --------------------------------------------        -------------------------------------------------------------------------------------              -----------------------------------------------------------------------------                      ----------------------------------------------------                                 0.1.0.0
===================================================================================================================================================================================================================================================================================================================================================================

And the inverted clock (Idle high) equivalent:

      ClkPol             Mapping of Sandisk Extreme (2017) for sysclock/1 reliability bands at 20 degC.                                                                                                                                                                                                                                                   Schmitt
    ClkReg               NOTE: Sysclock/1 means when SD transfers are at 1:1 with the sysclock frequency, ie: With max'd out DDR tranfers.                                                                                                                                                                                                                  DatReg
  DatReg                                                                                                                                                                                                                                                                                                                                                      ClkReg  
Schmitt    50 MHz    60 MHz    70 MHz    80 MHz    90 MHz    100 MHz   110 MHz   120 MHz   130 MHz   140 MHz   150 MHz   160 MHz   170 MHz   180 MHz   190 MHz   200 MHz   210 MHz   220 MHz   230 MHz   240 MHz   250 MHz   260 MHz   270 MHz   280 MHz   290 MHz   300 MHz   310 MHz   320 MHz   330 MHz   340 MHz   350 MHz   360 MHz   370 MHz   380 MHz    ClkPol
           |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |         |
===================================================================================================================================================================================================================================================================================================================================================================
1.0.0.1    -------------------       --------------------------------------------------------------              -------------------------------------------------------                   --------------------------------------------                                                                                                                   1.0.0.1
0.0.0.1    ------------------------       -------------------------------------------------------------------              ------------------------------------------------------------                   -------------------------------------------------                                                                                               0.0.0.1
1.1.0.1    --------------------------------        --------------------------------------------------------------------------                ---------------------------------------------------------------                         ----------------------------------------------------------                                                           1.1.0.1
0.1.0.1    ---------------------------------------        ----------------------------------------------------------------------------------               ---------------------------------------------------------------------                          ---------------------------------------------------------------                                 0.1.0.1
===================================================================================================================================================================================================================================================================================================================================================================

evanh · 2024-09-15 13:25

I've started the integration into Flexspin's sdmm.cc filesystem driver. In the process I've been reading higher level functions of the existing SPI mode routines and I came across the use of ACMD23 for notifying the SD card of the number of blocks to be pre-erased before writing. I wasn't expecting its used ... so I setup a test in my existing tester program to compare using ACMD23 for lots of writes vs not using it at all.

Short story is it makes no difference to the SD cards performance. And I'm not surprised really, the cards will all be pre-erasing what they can anyway. They don't have to be told when to do it.

Using ACMD23 does of course add a little overhead. Especially for the smaller sized writes. I won't be carrying it over to the SD mode version.

PS: I tested six different cards.

rogloh · 2024-09-15 14:33

Seems like good progress. It'll be great to have this in sdmm.cc . I'm edging closer towards needing to save at high speeds onto uSD in 4 bit mode. Working on HDMI capture HW right now which will certainly put it to a test once it's all proven and I get to move beyond single frame capture for bitstream TMDS analysis.

evanh · 2024-09-18 12:21

While making sense of the old select/deselect routines I realised I didn't need to test for Card Busy for every command. Only those that are attempting a data read/write need it. A whole timed polling loop has now been removed from in front of issuing each command. However, the check is/was pretty quick, it makes no measurable performance difference.

EDIT: Ah, it does make a 1% speed-up of non-data commands.
EDIT2: Oh, it's better than that. I might need to fine tune the changes after some sleep.

PS: It has tidied up that leading code nicely. It was quite a spaghetti knot with conditional execution gymnastics to get fast fall through on both CMD pin and DAT0 checks. Now it checks only CMD pin.

evanh · 2024-09-22 23:29

Nope, no real difference. As you've said before, Roger, the commands and their responses don't change with optimising. Same story for timeout handling really. Part of it is the command issuing code is already pure hubexec. It delivers streamer ops as "immediate" mode with no loops. It's already unrolled and doesn't even have Fcache overheads.

Optimising data throughput is where gains are made.

rogloh · 2024-09-23 01:05

@evanh said:
Optimising data throughput is where gains are made.

Yep. Just optimize where the code spends the majority of its time. Changing the rest will buy very little. See Amdahl's law.

evanh · 2024-09-23 01:30

I'm feeling happier with the look of the integration now anyways. Nothing's tested but here's an example:

Old SPI based block read interface in sdmm.cc:

DRESULT disk_read (
    BYTE drv,           /* Physical drive nmuber (0) */
    BYTE *buff,         /* Pointer to the data buffer to store read data */
    LBA_t sector,       /* Start sector number (LBA) */
    UINT count          /* Sector count (1..128) */
)
{
    BYTE cmd;
    DWORD sect = (DWORD)sector;

#ifdef _DEBUG
        __builtin_printf("disk_read: PINS=%d %d %d %d\n", _pin_clk, _pin_ss, _pin_di, _pin_do);
#endif  

    if (disk_status(drv) & STA_NOINIT) return RES_NOTRDY;
    if (!(CardType & CT_BLOCK)) sect *= 512;    /* Convert LBA to byte address if needed */

    cmd = count > 1 ? CMD18 : CMD17;            /*  READ_MULTIPLE_BLOCK : READ_SINGLE_BLOCK */
    if (send_cmd(cmd, sect) == 0) {
        do {
            if (!rcvr_datablock(buff, 512)) break;
            buff += 512;
        } while (--count);
        if (cmd == CMD18) send_cmd(CMD12, 0);   /* STOP_TRANSMISSION */
    }
    deselect();

    return count ? RES_ERROR : RES_OK;
}

New SD block read interface in sdmm.cc:

DRESULT disk_read(
    uint8_t drv,       // Physical drive nmuber (0)
    uint8_t *buff,     // Pointer to the data buffer to store read data
    LBA_t firstblk,    // Start block number (LBA)
    LBA_t blocks       // Block count
)
{
    uint8_t  resp[20];    // response buffer, in hubRAM
    uint32_t  cmd = blocks > 1 ? 18 : 17,    //  READ_MULTIPLE_BLOCK : READ_SINGLE_BLOCK
        timeout = _clockfreq() / 4;    // 250 ms

#ifdef _DEBUG_SD
    __builtin_printf("disk_read: PINS=%d %d %d %d %d\n", clk_pin, cmd_pin, _pin_dat, pow_pin, led_pin);
#endif

    if( disk_status(drv) & STA_NOINIT )
        return RES_NOTRDY;

    if( wait_ready(resp, timeout) )    // selects card to examine busy on DAT0
    {
        if( tx_command(cmd, firstblk) )
            if( rx_response(resp, 6, 2) )    // R1 response
            {
                blocks = rx_datablocks(buff, blocks, timeout);
                if( cmd == 18 )
                    send_cmd(12, 0, resp);    // STOP_TRANSMISSION, R1b response
            }
    }

    send_cmd(7, 0, NULL);    // DESELECT, no response expected

    return blocks ? RES_ERROR : RES_OK;
}

evanh · 2024-09-23 01:40

rx_datablocks() does all the heavy lifting in one call. The SPI code, as well as a complicated send_cmd() sequence, has another layer that sequences each individual block. And when it was bit-bashed it also had another layer again that also managed sequencing of bytes.

Where I've used send_cmd() now, eg: for DESELECT, it is a less complicated form of the old send_cmd(). It consists primarily of a tx_command() + rx_response() pair. It's useful for general housekeeping but mainly to keep the long card init sequence clutter free.

static int  send_cmd(
    uint32_t cmd,
    uint32_t arg,
    uint8_t *resp,
)
{
    if( tx_command(cmd, arg) )
        if( resp )  {
            if( rx_response(resp, 6, 80) )    // R1, R1b, R3, R6, R7 responses
                return 1;
#ifdef _DEBUG_SD
            else
                __builtin_printf(" No response to CMD%d\n", cmd);
#endif
        } else {    // expecting no response
            _wypin(clk_pin, 80);
            return 1;
        }
#ifdef _DEBUG_SD
    else
        __builtin_printf(" CMD pin stuck low at CMD%d\n", cmd);
#endif

    return 0;
}

PS: There is two other variants: send_acmd(), for ACMDs, and send_cmd_r2() for commands that expect the longer R2 response. These reduce the complexity of send_cmd() and obviously have to be used selectively for those circumstances.

evanh · 2024-09-24 15:26

The content of R1 responses can be ignored generally. As long as there is a response then all is well. If the sequence gets messed up then usually the card will simply fail to respond to an ill-sequenced command. Then it would be worth inspecting the last response or issuing a generic CMD13 to get a refresh.

So while I've found that performing CRC on data blocks is important, it's not important to do the same for command responses unless the content of the response is actually being utilised. This is how I've been operating all along but with the polishing for production I figured I better give it some thought.

It has a bearing on the return codes too. Ignoring the CRC means there is only the no-response failure condition. It makes the logic less complicated. Not unlike the removal of Card Busy check before issuing every command.

evanh · 2024-09-25 04:45

Damn, I'd forgotten I still didn't have a solution for handling the CRC of the final block in the buffer. I'd been doing all the testing with an oversized buffer so that the CRC nibbles would fit without any exception handling. Today I've been going around in circles trying to insert the exception, making a mess and having to revert. The problem is rx_datablocks() is cumbersome and I've already made heavy use of conditional execution ...

I'm going to be de-optimising soon I suspect.

rogloh · 2024-09-25 05:08

@evanh said:
I'm going to be de-optimising soon I suspect.

"Premature optimization is the root of all evil."

evanh · 2024-09-26 03:52

I got it going without throwing the toys out. Overheads went from 7.9% to 10.1% for a 16 kB buffer and sysclock/4. This is primarily because the final CRC check can only be done serially. Pushing the buffer size to 64 kB helps it back down to 8.5%.

PS: Sysclock/4 tells me about the non-crc-loop overheads. At sysclock/3 there is a little bleeding of parallel'd crc-loop-time.

PPS: There is an annoying three instructions I had to add inside the block loop that added a touch more overhead.

// no room in the buffer for final CRC-16 nibbles, to be handled later
                cmp     blocks, #2   wc    // last block?
        if_c    sub     clocks, #16+2    // stop the clocks before the CRC-16 nibbles
        if_c    sub     r_mdat, #16    // truncate DMA to end of buffer

evanh · 2024-09-26 04:12

Here's how I handle having the CRC nibbles for each block located in the main buffer.

                modz    _clr   wz    // clear Z flag to create a one-shot for first block
nextblk
        if_z    setq    #512/4+16/8-1    // one SD block + CRC
        if_z    rdlong  126, buf    // fast copy to cogRAM before engaging the FIFO, cogRAM buffer address = 126
        if_z    add     buf, nextblkoffset
                wrfast  lnco, buf    // setup FIFO for streamer use

So it cleanly copies the latest block+CRC into cogRAM just before reinitialising the FIFO to overwrite those CRC nibbles. Once the streamer is kicked off, it then runs the CRC code over the data stashed in cogRAM.

evanh · 2024-09-26 04:19

And the final check is done afterward so can be copied and called immediately:

if_nc_and_z     setq    #512/4-1    // 512 bytes, one SD block, 4 bytes per longword
if_nc_and_z     rdlong  126, buf    // fast copy to cogRAM, cogRAM buffer address = 126
if_nc_and_z     setq    #16/8-1    // 16 CRC-16 nibbles, 8 nibbles per longword
if_nc_and_z     rdlong  126+512/4, pb
// do the check
if_nc_and_z     call    #crc_check

PS: The cogRAM address of 126 is arbitrary. I chose it to give my Fcache'd code enough room to grow in front without running into the data, but still fits within first half of cogRAM space.

rogloh · 2024-09-26 04:49

So your driver code is going to be inline PASM2 and Fcache'd in on demand when sectors are read or written. No additional driver cogs needed and file transfer ops done via existing C lib function access from Flexspin, right? Is this the long term plan? What would you do for functions that can't be exposed via standard stuff? Like enabling/disabling read CRCs? Is this something you'll have a low level API that can be called from the spin2 layer to turn on/off?

Some other stuff may be handy to include like card reporting, and hotswap insert/removal, LED control etc.

evanh · 2024-09-26 05:10

At the moment, there is a bunch of compile switches. They can be toggled in the compile command with -D options. Or, in my case, I'm editing the start of the driver source file to toggle them.

@rogloh said:
So your driver code is going to be inline PASM2 and Fcache'd in on demand when sectors are read or written. No additional driver cogs needed and file transfer ops done via existing C lib function access from Flexspin, right?

Right.

evanh · 2024-09-26 05:18

I suppose, once it's running smooth in FlexC, we can work on porting to Spin2 and making it for PNut. Might want to also see if Chip wants a standardised file handling API of some sort.

rogloh · 2024-09-26 05:32

Sounds promising. Having an easy to use SPIN2 wrapper object and/or built into the underlying file system stuff that flexspin already contains would be great. Once I do this new interface board for the HDMI video grab stuff I'll be hoping to start to look at doing 4 bit SD transfers and see what performance the P2 can achieve with fast SD cards.

evanh · 2024-09-26 06:08

There is the "ioctl" API. I have no idea how to find/predict its features but things like card presence should be doable via it. I note the existing SPI driver is reporting the card capacity this way. It queries the CSD register (CMD9), picks out the rated capacity, and converts that to a block count.

        case GET_SECTOR_COUNT : /* Get number of sectors on the disk (DWORD) */
            if ((send_cmd(CMD9, 0) == 0) && rcvr_datablock(csd, 16)) {
                if ((csd[0] >> 6) == 1) {   /* SDC ver 2.00 */
                    cs = csd[9] + ((WORD)csd[8] << 8) + ((DWORD)(csd[7] & 63) << 16) + 1;
                    *(LBA_t*)buff = cs << 10;
                } else {                    /* SDC ver 1.XX or MMC */
                    n = (csd[5] & 15) + ((csd[10] & 128) >> 7) + ((csd[9] & 3) << 1) + 2;
                    cs = (csd[8] >> 6) + ((WORD)csd[7] << 2) + ((WORD)(csd[6] & 3) << 10) + 1;
                    *(LBA_t*)buff = cs << (n - 9);
                }
                res = RES_OK;
            }
            break;

rogloh · 2024-09-26 06:11

Yes I'd expect some sort of ioctl type of API would be helpful here. It'd be nice to know when cards are inserted/removed - though simple pin monitoring is also possible in parallel to the usual R/W API at a high level. I imagine ioctl could be good for things like turning off CRC check on read for example, and maybe to configure and then automatically control an optional LED pin during transfers.

evanh · 2024-09-26 09:49

Using the compile switches, here's a review of performance vs size. Two switches were used:

BLOCK_MULTI selects whether rx_datablocks() or rx_datablock() is compiled in. This also swaps the testing loop since single needs to loop for each block whereas multi is on a per buffer basis.
BLOCK_CRC adds in the parallel CRC processing code to either single or multi versions. The single with CRC is still cheating by using the oversized buffer. But at least is running the processing on the final CRC, which multi hadn't been until now.

                File Size
                Increase    Overhead
======================================
Single          base        8.1 %
Single+CRC      280         12.3 %
Multi           156         5.6 %
Multi+CRC       380         10.1 %

Overhead is measured using SD clock of sysclock/4, buffer size of 16 kB and contiguous read length of 16 MB.

evanh · 2024-09-26 10:16

Adjusting the buffer size has an impact on multi performance only. Compiled binary size and single performance are unchanged.
Reducing to 8 kB buffer raises the Multi+CRC overhead up to 12.3 %, the same as Single+CRC.
Increasing the buffer is nice though, brings the Multi+CRC overhead down to 8.5 % with 64 kB buffer.
Multi without CRC is only slightly impacted. Up 0.1% at 8 kB and down 0.1% at 64 kB.

EDIT: I suppose I should make Single+CRC correctly handle a disjointed CRC on the final block of the buffer. It's not a fair comparison without that. Pity the FIFO's FBLOCK feature was a bust.

EDIT2: Oh, I kind of did. In that I did a version that is completely serial. Read a block then process its CRC before reading the next block. It results in a nearly 50 % overhead at sysclock/4. Exceeds 50% at sysclock/3.

evanh · 2024-09-27 00:05

Huh, not as bad as I was expecting ... Just implemented a single block version that reads every block using "jointed" CRC - Where it stops the clock and streamer just before the CRC nibbles, then adjusts the FIFO and restarts the clock and streamer to stash the CRC elsewhere. It's still processing the CRC in parallel, except for the last block.

I'll call it Single+JointedCRC. It produces 13.3 % overhead as per earlier sysclock/4 conditions.

EDIT: Oh, oops, Singles+CRC are all bugged. The higher level loop never got the CRC updates the Multi routine got. None of them have been checking the end of buffer CRC. Which is why they weren't affected by buffer size. Doh!

EDIT2: That's more like it. Jointed version is now 15.0 % overheads with 16 kB buffer size and sysclock/4. 16.8 % with 8 kB buffer. 13.7 % with 64 kB buffer.
The new singles loop is:

                do {
                    ptr = buf;
                    if( rxblock(ptr, timeout, NULL) )    // read first block into buffer
                        goto readerr;
                    if( --count )  {
                        if( count >= BUFCOUNT-1 )
                            i = BUFCOUNT-1;
                        else
                            i = count;
                        do {
                            crcptr = ptr;
                            ptr += 512;
                            if( rxblock(ptr, timeout, crcptr) )    // read a block and verify prior CRC
                                goto readerr;
                            --count;
                        } while( --i );
                    }
                    if( crc16join(ptr) )  {    // verify final CRC of buffer
                        count++;
                        goto readerr;
                    }
                } while( count );

rogloh · 2024-09-27 02:36

Jointed version is now 15.0 % overheads with 16 kB buffer size and sysclock/4. 16.8 % with 8 kB buffer. 13.7 % with 64 kB buffer.

Unless it's absolutely critical I wouldn't be too concerned with eking out the last 1.3% of performance if it means you need to burn an extra 48kB of precious hub RAM for buffering. Is this buffer going to be something we control ourselves, or would it need to be built into the driver itself statically?

evanh · 2024-09-27 03:14

I'm mostly just trying to confirm that that low-level multi-block function is superior enough to retain. It does like the larger buffer sizes and it wasn't comparing particularly favourably while the singles didn't accommodate the final CRC. Now the 8 kB buffer is showing Single-Jointed as 16.8 % vs Multi's 12.3 % it's looking better for Multi.

There is one more option to test out - merging into the single block read function how the multi-block variant is doing the transition from CRC-in-buffer to CRC-jointed for final block of buffer ...

evanh · 2024-09-27 03:18

The alternative would be to require the buffer to be 8 bytes longer than desired for read data.

PS: The idea is there won't be any intermediate buffer for hubRAM target loads. The buffer would only exist for expansion memory where one is wanting to load far more than hubRAM can hold. The SD driver doesn't make any distinction whether it's a buffer or final target. It's just a data load address to the driver. It doesn't move the blocks after read from SD.

So if a 8 byte extend requirement was imposed then it would apply to all uses.

evanh · 2024-09-28 10:56

Right, done. Revised Single-block-Jointed version is now matching the Multi-block method in that for the bulk of the blocks it is contiguous CRC clocking direct to the main buffer and both are copied as one into cogRAM for processing. Only the final block of each full buffer is treated as jointed CRC where it is split off by stopping the clock and restarting once the FIFO is redirected.

New stats is 8 kB buffer has 16.1 % overhead, 16 kB has 14.2 % overhead, 64 kB has 12.7 % overhead.
Comparing previous all blocks jointed: 8 kB was 16.8 %, 16 Kb was 15.0 %, 64 kB was 13.7 %.
Comparing Multi-block equivalent: 8 kB is 12.3 %, 16 kB is 10.1 %, and 64 kB is 8.5 %.

Multi-block is proving itself finally. I'm happy now.

New SD mode P2 accessory board

Comments