In hindsight, there was an indicator of a stuck pin. But at the time I was thinking there was mistimed block reads, ie: I'd just bumped into a hidden bug that previously wasn't apparent.
When I enable diagnostic printing I get something like this appearing at the end of the card init sequence. It is the only place during init where a data block read operation occurs:
As implied, this is mostly only of concern to transfers operating at syclock/1. So doesn't really belong in this SD topic. It's just something I've been wanting to do a better job of from the earliest days of the Prop2 - When we first discovered the shifting I/O latencies as the clock ramps up. And SD cards have become the my most studied subject.
The bands all move up and down with temperature change but they will track together relatively evenly through the temperatures. That's easily handled simply by recalibrating upon any error. What might be harder to solve (since there is no explicit hardware delay-line setting in the Prop2 pins) is selecting a universal track that applies to all memory chips.
As can be seen in my posted graphs there is four pin mode settings that, in combination, provides an impressive range of adjustability to effect a settable delay. Roger has had success with HyperRAMs in the past simply by toggling pin registering of the data pins. And that indeed produces the biggest impact in my graphs too, so confirms he had made a good choice.
From Roger's docs:
' Delay:
' This nibble is comprised of two fields which create the delay needed for Hyper memory reads.
' bits
' 15-13 = P2 clock delay cycles when waiting for valid data bus input to be returned
' 12 = set to 0 if bank uses registered data bus inputs for reads, or 1 for unregistered
Trimmed down and tidied up and selected for regular clock polarity - Idle low. It doesn't make much sense to be using both together. The best spread also can leave the clock pin unregistered.
I've started the integration into Flexspin's sdmm.cc filesystem driver. In the process I've been reading higher level functions of the existing SPI mode routines and I came across the use of ACMD23 for notifying the SD card of the number of blocks to be pre-erased before writing. I wasn't expecting its used ... so I setup a test in my existing tester program to compare using ACMD23 for lots of writes vs not using it at all.
Short story is it makes no difference to the SD cards performance. And I'm not surprised really, the cards will all be pre-erasing what they can anyway. They don't have to be told when to do it.
Using ACMD23 does of course add a little overhead. Especially for the smaller sized writes. I won't be carrying it over to the SD mode version.
Seems like good progress. It'll be great to have this in sdmm.cc . I'm edging closer towards needing to save at high speeds onto uSD in 4 bit mode. Working on HDMI capture HW right now which will certainly put it to a test once it's all proven and I get to move beyond single frame capture for bitstream TMDS analysis.
While making sense of the old select/deselect routines I realised I didn't need to test for Card Busy for every command. Only those that are attempting a data read/write need it. A whole timed polling loop has now been removed from in front of issuing each command. However, the check is/was pretty quick, it makes no measurable performance difference.
EDIT: Ah, it does make a 1% speed-up of non-data commands.
EDIT2: Oh, it's better than that. I might need to fine tune the changes after some sleep.
PS: It has tidied up that leading code nicely. It was quite a spaghetti knot with conditional execution gymnastics to get fast fall through on both CMD pin and DAT0 checks. Now it checks only CMD pin.
Nope, no real difference. As you've said before, Roger, the commands and their responses don't change with optimising. Same story for timeout handling really. Part of it is the command issuing code is already pure hubexec. It delivers streamer ops as "immediate" mode with no loops. It's already unrolled and doesn't even have Fcache overheads.
Optimising data throughput is where gains are made.
rx_datablocks() does all the heavy lifting in one call. The SPI code, as well as a complicated send_cmd() sequence, has another layer that sequences each individual block. And when it was bit-bashed it also had another layer again that also managed sequencing of bytes.
Where I've used send_cmd() now, eg: for DESELECT, it is a less complicated form of the old send_cmd(). It consists primarily of a tx_command() + rx_response() pair. It's useful for general housekeeping but mainly to keep the long card init sequence clutter free.
PS: There is two other variants: send_acmd(), for ACMDs, and send_cmd_r2() for commands that expect the longer R2 response. These reduce the complexity of send_cmd() and obviously have to be used selectively for those circumstances.
The content of R1 responses can be ignored generally. As long as there is a response then all is well. If the sequence gets messed up then usually the card will simply fail to respond to an ill-sequenced command. Then it would be worth inspecting the last response or issuing a generic CMD13 to get a refresh.
So while I've found that performing CRC on data blocks is important, it's not important to do the same for command responses unless the content of the response is actually being utilised. This is how I've been operating all along but with the polishing for production I figured I better give it some thought.
It has a bearing on the return codes too. Ignoring the CRC means there is only the no-response failure condition. It makes the logic less complicated. Not unlike the removal of Card Busy check before issuing every command.
Damn, I'd forgotten I still didn't have a solution for handling the CRC of the final block in the buffer. I'd been doing all the testing with an oversized buffer so that the CRC nibbles would fit without any exception handling. Today I've been going around in circles trying to insert the exception, making a mess and having to revert. The problem is rx_datablocks() is cumbersome and I've already made heavy use of conditional execution ...
I got it going without throwing the toys out. Overheads went from 7.9% to 10.1% for a 16 kB buffer and sysclock/4. This is primarily because the final CRC check can only be done serially. Pushing the buffer size to 64 kB helps it back down to 8.5%.
PS: Sysclock/4 tells me about the non-crc-loop overheads. At sysclock/3 there is a little bleeding of parallel'd crc-loop-time.
PPS: There is an annoying three instructions I had to add inside the block loop that added a touch more overhead.
// no room in the buffer for final CRC-16 nibbles, to be handled later
cmp blocks, #2 wc // last block?
if_c sub clocks, #16+2 // stop the clocks before the CRC-16 nibbles
if_c sub r_mdat, #16 // truncate DMA to end of buffer
Here's how I handle having the CRC nibbles for each block located in the main buffer.
modz _clr wz // clear Z flag to create a one-shot for first block
nextblk
if_z setq #512/4+16/8-1 // one SD block + CRC
if_z rdlong 126, buf // fast copy to cogRAM before engaging the FIFO, cogRAM buffer address = 126
if_z add buf, nextblkoffset
wrfast lnco, buf // setup FIFO for streamer use
So it cleanly copies the latest block+CRC into cogRAM just before reinitialising the FIFO to overwrite those CRC nibbles. Once the streamer is kicked off, it then runs the CRC code over the data stashed in cogRAM.
And the final check is done afterward so can be copied and called immediately:
if_nc_and_z setq #512/4-1 // 512 bytes, one SD block, 4 bytes per longword
if_nc_and_z rdlong 126, buf // fast copy to cogRAM, cogRAM buffer address = 126
if_nc_and_z setq #16/8-1 // 16 CRC-16 nibbles, 8 nibbles per longword
if_nc_and_z rdlong 126+512/4, pb
// do the check
if_nc_and_z call #crc_check
PS: The cogRAM address of 126 is arbitrary. I chose it to give my Fcache'd code enough room to grow in front without running into the data, but still fits within first half of cogRAM space.
So your driver code is going to be inline PASM2 and Fcache'd in on demand when sectors are read or written. No additional driver cogs needed and file transfer ops done via existing C lib function access from Flexspin, right? Is this the long term plan? What would you do for functions that can't be exposed via standard stuff? Like enabling/disabling read CRCs? Is this something you'll have a low level API that can be called from the spin2 layer to turn on/off?
Some other stuff may be handy to include like card reporting, and hotswap insert/removal, LED control etc.
At the moment, there is a bunch of compile switches. They can be toggled in the compile command with -D options. Or, in my case, I'm editing the start of the driver source file to toggle them.
@rogloh said:
So your driver code is going to be inline PASM2 and Fcache'd in on demand when sectors are read or written. No additional driver cogs needed and file transfer ops done via existing C lib function access from Flexspin, right?
I suppose, once it's running smooth in FlexC, we can work on porting to Spin2 and making it for PNut. Might want to also see if Chip wants a standardised file handling API of some sort.
Sounds promising. Having an easy to use SPIN2 wrapper object and/or built into the underlying file system stuff that flexspin already contains would be great. Once I do this new interface board for the HDMI video grab stuff I'll be hoping to start to look at doing 4 bit SD transfers and see what performance the P2 can achieve with fast SD cards.
There is the "ioctl" API. I have no idea how to find/predict its features but things like card presence should be doable via it. I note the existing SPI driver is reporting the card capacity this way. It queries the CSD register (CMD9), picks out the rated capacity, and converts that to a block count.
case GET_SECTOR_COUNT : /* Get number of sectors on the disk (DWORD) */
if ((send_cmd(CMD9, 0) == 0) && rcvr_datablock(csd, 16)) {
if ((csd[0] >> 6) == 1) { /* SDC ver 2.00 */
cs = csd[9] + ((WORD)csd[8] << 8) + ((DWORD)(csd[7] & 63) << 16) + 1;
*(LBA_t*)buff = cs << 10;
} else { /* SDC ver 1.XX or MMC */
n = (csd[5] & 15) + ((csd[10] & 128) >> 7) + ((csd[9] & 3) << 1) + 2;
cs = (csd[8] >> 6) + ((WORD)csd[7] << 2) + ((WORD)(csd[6] & 3) << 10) + 1;
*(LBA_t*)buff = cs << (n - 9);
}
res = RES_OK;
}
break;
Yes I'd expect some sort of ioctl type of API would be helpful here. It'd be nice to know when cards are inserted/removed - though simple pin monitoring is also possible in parallel to the usual R/W API at a high level. I imagine ioctl could be good for things like turning off CRC check on read for example, and maybe to configure and then automatically control an optional LED pin during transfers.
Using the compile switches, here's a review of performance vs size. Two switches were used:
BLOCK_MULTI selects whether rx_datablocks() or rx_datablock() is compiled in. This also swaps the testing loop since single needs to loop for each block whereas multi is on a per buffer basis.
BLOCK_CRC adds in the parallel CRC processing code to either single or multi versions. The single with CRC is still cheating by using the oversized buffer. But at least is running the processing on the final CRC, which multi hadn't been until now.
File Size
Increase Overhead
======================================
Single base 8.1 %
Single+CRC 280 12.3 %
Multi 156 5.6 %
Multi+CRC 380 10.1 %
Overhead is measured using SD clock of sysclock/4, buffer size of 16 kB and contiguous read length of 16 MB.
Adjusting the buffer size has an impact on multi performance only. Compiled binary size and single performance are unchanged.
Reducing to 8 kB buffer raises the Multi+CRC overhead up to 12.3 %, the same as Single+CRC.
Increasing the buffer is nice though, brings the Multi+CRC overhead down to 8.5 % with 64 kB buffer.
Multi without CRC is only slightly impacted. Up 0.1% at 8 kB and down 0.1% at 64 kB.
EDIT: I suppose I should make Single+CRC correctly handle a disjointed CRC on the final block of the buffer. It's not a fair comparison without that. Pity the FIFO's FBLOCK feature was a bust.
EDIT2: Oh, I kind of did. In that I did a version that is completely serial. Read a block then process its CRC before reading the next block. It results in a nearly 50 % overhead at sysclock/4. Exceeds 50% at sysclock/3.
Huh, not as bad as I was expecting ... Just implemented a single block version that reads every block using "jointed" CRC - Where it stops the clock and streamer just before the CRC nibbles, then adjusts the FIFO and restarts the clock and streamer to stash the CRC elsewhere. It's still processing the CRC in parallel, except for the last block.
I'll call it Single+JointedCRC. It produces 13.3 % overhead as per earlier sysclock/4 conditions.
EDIT: Oh, oops, Singles+CRC are all bugged. The higher level loop never got the CRC updates the Multi routine got. None of them have been checking the end of buffer CRC. Which is why they weren't affected by buffer size. Doh!
EDIT2: That's more like it. Jointed version is now 15.0 % overheads with 16 kB buffer size and sysclock/4. 16.8 % with 8 kB buffer. 13.7 % with 64 kB buffer.
The new singles loop is:
do {
ptr = buf;
if( rxblock(ptr, timeout, NULL) ) // read first block into buffer
goto readerr;
if( --count ) {
if( count >= BUFCOUNT-1 )
i = BUFCOUNT-1;
else
i = count;
do {
crcptr = ptr;
ptr += 512;
if( rxblock(ptr, timeout, crcptr) ) // read a block and verify prior CRC
goto readerr;
--count;
} while( --i );
}
if( crc16join(ptr) ) { // verify final CRC of buffer
count++;
goto readerr;
}
} while( count );
Jointed version is now 15.0 % overheads with 16 kB buffer size and sysclock/4. 16.8 % with 8 kB buffer. 13.7 % with 64 kB buffer.
Unless it's absolutely critical I wouldn't be too concerned with eking out the last 1.3% of performance if it means you need to burn an extra 48kB of precious hub RAM for buffering. Is this buffer going to be something we control ourselves, or would it need to be built into the driver itself statically?
I'm mostly just trying to confirm that that low-level multi-block function is superior enough to retain. It does like the larger buffer sizes and it wasn't comparing particularly favourably while the singles didn't accommodate the final CRC. Now the 8 kB buffer is showing Single-Jointed as 16.8 % vs Multi's 12.3 % it's looking better for Multi.
There is one more option to test out - merging into the single block read function how the multi-block variant is doing the transition from CRC-in-buffer to CRC-jointed for final block of buffer ...
The alternative would be to require the buffer to be 8 bytes longer than desired for read data.
PS: The idea is there won't be any intermediate buffer for hubRAM target loads. The buffer would only exist for expansion memory where one is wanting to load far more than hubRAM can hold. The SD driver doesn't make any distinction whether it's a buffer or final target. It's just a data load address to the driver. It doesn't move the blocks after read from SD.
So if a 8 byte extend requirement was imposed then it would apply to all uses.
Right, done. Revised Single-block-Jointed version is now matching the Multi-block method in that for the bulk of the blocks it is contiguous CRC clocking direct to the main buffer and both are copied as one into cogRAM for processing. Only the final block of each full buffer is treated as jointed CRC where it is split off by stopping the clock and restarting once the FIFO is redirected.
New stats is 8 kB buffer has 16.1 % overhead, 16 kB has 14.2 % overhead, 64 kB has 12.7 % overhead.
Comparing previous all blocks jointed: 8 kB was 16.8 %, 16 Kb was 15.0 %, 64 kB was 13.7 %.
Comparing Multi-block equivalent: 8 kB is 12.3 %, 16 kB is 10.1 %, and 64 kB is 8.5 %.
Multi-block is proving itself finally. I'm happy now.
Comments
In hindsight, there was an indicator of a stuck pin. But at the time I was thinking there was mistimed block reads, ie: I'd just bumped into a hidden bug that previously wasn't apparent.
When I enable diagnostic printing I get something like this appearing at the end of the card init sequence. It is the only place during init where a data block read operation occurs:
But this is what was reporting for the badly seated card - Bit1 is high in every nibble of the data block:
Needless to say, that old routine doesn't do the CRC check. Maybe that's a good thing for this testing.
I've had another go at graphing the lag map from https://forums.parallax.com/discussion/comment/1560644/#Comment_1560644
I wasn't happy with what a spreadsheet's graphs could achieve. I needed a multisegmented line with gaps in it so I've gone back to ASCII art form again:
As implied, this is mostly only of concern to transfers operating at syclock/1. So doesn't really belong in this SD topic. It's just something I've been wanting to do a better job of from the earliest days of the Prop2 - When we first discovered the shifting I/O latencies as the clock ramps up. And SD cards have become the my most studied subject.
The bands all move up and down with temperature change but they will track together relatively evenly through the temperatures. That's easily handled simply by recalibrating upon any error. What might be harder to solve (since there is no explicit hardware delay-line setting in the Prop2 pins) is selecting a universal track that applies to all memory chips.
As can be seen in my posted graphs there is four pin mode settings that, in combination, provides an impressive range of adjustability to effect a settable delay. Roger has had success with HyperRAMs in the past simply by toggling pin registering of the data pins. And that indeed produces the biggest impact in my graphs too, so confirms he had made a good choice.
From Roger's docs:
Trimmed down and tidied up and selected for regular clock polarity - Idle low. It doesn't make much sense to be using both together. The best spread also can leave the clock pin unregistered.
And the inverted clock (Idle high) equivalent:
I've started the integration into Flexspin's sdmm.cc filesystem driver. In the process I've been reading higher level functions of the existing SPI mode routines and I came across the use of ACMD23 for notifying the SD card of the number of blocks to be pre-erased before writing. I wasn't expecting its used ... so I setup a test in my existing tester program to compare using ACMD23 for lots of writes vs not using it at all.
Short story is it makes no difference to the SD cards performance. And I'm not surprised really, the cards will all be pre-erasing what they can anyway. They don't have to be told when to do it.
Using ACMD23 does of course add a little overhead. Especially for the smaller sized writes. I won't be carrying it over to the SD mode version.
PS: I tested six different cards.
Seems like good progress. It'll be great to have this in sdmm.cc . I'm edging closer towards needing to save at high speeds onto uSD in 4 bit mode. Working on HDMI capture HW right now which will certainly put it to a test once it's all proven and I get to move beyond single frame capture for bitstream TMDS analysis.
While making sense of the old select/deselect routines I realised I didn't need to test for Card Busy for every command. Only those that are attempting a data read/write need it. A whole timed polling loop has now been removed from in front of issuing each command. However, the check is/was pretty quick, it makes no measurable performance difference.
EDIT: Ah, it does make a 1% speed-up of non-data commands.
EDIT2: Oh, it's better than that. I might need to fine tune the changes after some sleep.
PS: It has tidied up that leading code nicely. It was quite a spaghetti knot with conditional execution gymnastics to get fast fall through on both CMD pin and DAT0 checks. Now it checks only CMD pin.
Nope, no real difference. As you've said before, Roger, the commands and their responses don't change with optimising. Same story for timeout handling really. Part of it is the command issuing code is already pure hubexec. It delivers streamer ops as "immediate" mode with no loops. It's already unrolled and doesn't even have Fcache overheads.
Optimising data throughput is where gains are made.
Yep. Just optimize where the code spends the majority of its time. Changing the rest will buy very little. See Amdahl's law.
I'm feeling happier with the look of the integration now anyways. Nothing's tested but here's an example:
Old SPI based block read interface in sdmm.cc:
New SD block read interface in sdmm.cc:
rx_datablocks() does all the heavy lifting in one call. The SPI code, as well as a complicated send_cmd() sequence, has another layer that sequences each individual block. And when it was bit-bashed it also had another layer again that also managed sequencing of bytes.
Where I've used send_cmd() now, eg: for DESELECT, it is a less complicated form of the old send_cmd(). It consists primarily of a tx_command() + rx_response() pair. It's useful for general housekeeping but mainly to keep the long card init sequence clutter free.
PS: There is two other variants:
send_acmd()
, for ACMDs, andsend_cmd_r2()
for commands that expect the longer R2 response. These reduce the complexity of send_cmd() and obviously have to be used selectively for those circumstances.The content of R1 responses can be ignored generally. As long as there is a response then all is well. If the sequence gets messed up then usually the card will simply fail to respond to an ill-sequenced command. Then it would be worth inspecting the last response or issuing a generic CMD13 to get a refresh.
So while I've found that performing CRC on data blocks is important, it's not important to do the same for command responses unless the content of the response is actually being utilised. This is how I've been operating all along but with the polishing for production I figured I better give it some thought.
It has a bearing on the return codes too. Ignoring the CRC means there is only the no-response failure condition. It makes the logic less complicated. Not unlike the removal of Card Busy check before issuing every command.
Damn, I'd forgotten I still didn't have a solution for handling the CRC of the final block in the buffer. I'd been doing all the testing with an oversized buffer so that the CRC nibbles would fit without any exception handling. Today I've been going around in circles trying to insert the exception, making a mess and having to revert. The problem is
rx_datablocks()
is cumbersome and I've already made heavy use of conditional execution ...I'm going to be de-optimising soon I suspect.
"Premature optimization is the root of all evil."
I got it going without throwing the toys out. Overheads went from 7.9% to 10.1% for a 16 kB buffer and sysclock/4. This is primarily because the final CRC check can only be done serially. Pushing the buffer size to 64 kB helps it back down to 8.5%.
PS: Sysclock/4 tells me about the non-crc-loop overheads. At sysclock/3 there is a little bleeding of parallel'd crc-loop-time.
PPS: There is an annoying three instructions I had to add inside the block loop that added a touch more overhead.
Here's how I handle having the CRC nibbles for each block located in the main buffer.
So it cleanly copies the latest block+CRC into cogRAM just before reinitialising the FIFO to overwrite those CRC nibbles. Once the streamer is kicked off, it then runs the CRC code over the data stashed in cogRAM.
And the final check is done afterward so can be copied and called immediately:
PS: The cogRAM address of 126 is arbitrary. I chose it to give my Fcache'd code enough room to grow in front without running into the data, but still fits within first half of cogRAM space.
So your driver code is going to be inline PASM2 and Fcache'd in on demand when sectors are read or written. No additional driver cogs needed and file transfer ops done via existing C lib function access from Flexspin, right? Is this the long term plan? What would you do for functions that can't be exposed via standard stuff? Like enabling/disabling read CRCs? Is this something you'll have a low level API that can be called from the spin2 layer to turn on/off?
Some other stuff may be handy to include like card reporting, and hotswap insert/removal, LED control etc.
At the moment, there is a bunch of compile switches. They can be toggled in the compile command with -D options. Or, in my case, I'm editing the start of the driver source file to toggle them.
Right.
I suppose, once it's running smooth in FlexC, we can work on porting to Spin2 and making it for PNut. Might want to also see if Chip wants a standardised file handling API of some sort.
Sounds promising. Having an easy to use SPIN2 wrapper object and/or built into the underlying file system stuff that flexspin already contains would be great. Once I do this new interface board for the HDMI video grab stuff I'll be hoping to start to look at doing 4 bit SD transfers and see what performance the P2 can achieve with fast SD cards.
There is the "ioctl" API. I have no idea how to find/predict its features but things like card presence should be doable via it. I note the existing SPI driver is reporting the card capacity this way. It queries the CSD register (CMD9), picks out the rated capacity, and converts that to a block count.
Yes I'd expect some sort of ioctl type of API would be helpful here. It'd be nice to know when cards are inserted/removed - though simple pin monitoring is also possible in parallel to the usual R/W API at a high level. I imagine ioctl could be good for things like turning off CRC check on read for example, and maybe to configure and then automatically control an optional LED pin during transfers.
Using the compile switches, here's a review of performance vs size. Two switches were used:
rx_datablocks()
orrx_datablock()
is compiled in. This also swaps the testing loop since single needs to loop for each block whereas multi is on a per buffer basis.Adjusting the buffer size has an impact on multi performance only. Compiled binary size and single performance are unchanged.
Reducing to 8 kB buffer raises the Multi+CRC overhead up to 12.3 %, the same as Single+CRC.
Increasing the buffer is nice though, brings the Multi+CRC overhead down to 8.5 % with 64 kB buffer.
Multi without CRC is only slightly impacted. Up 0.1% at 8 kB and down 0.1% at 64 kB.
EDIT: I suppose I should make Single+CRC correctly handle a disjointed CRC on the final block of the buffer. It's not a fair comparison without that. Pity the FIFO's FBLOCK feature was a bust.
EDIT2: Oh, I kind of did. In that I did a version that is completely serial. Read a block then process its CRC before reading the next block. It results in a nearly 50 % overhead at sysclock/4. Exceeds 50% at sysclock/3.
Huh, not as bad as I was expecting ... Just implemented a single block version that reads every block using "jointed" CRC - Where it stops the clock and streamer just before the CRC nibbles, then adjusts the FIFO and restarts the clock and streamer to stash the CRC elsewhere. It's still processing the CRC in parallel, except for the last block.
I'll call it Single+JointedCRC. It produces 13.3 % overhead as per earlier sysclock/4 conditions.
EDIT: Oh, oops, Singles+CRC are all bugged. The higher level loop never got the CRC updates the Multi routine got. None of them have been checking the end of buffer CRC. Which is why they weren't affected by buffer size. Doh!
EDIT2: That's more like it. Jointed version is now 15.0 % overheads with 16 kB buffer size and sysclock/4. 16.8 % with 8 kB buffer. 13.7 % with 64 kB buffer.
The new singles loop is:
Unless it's absolutely critical I wouldn't be too concerned with eking out the last 1.3% of performance if it means you need to burn an extra 48kB of precious hub RAM for buffering. Is this buffer going to be something we control ourselves, or would it need to be built into the driver itself statically?
I'm mostly just trying to confirm that that low-level multi-block function is superior enough to retain. It does like the larger buffer sizes and it wasn't comparing particularly favourably while the singles didn't accommodate the final CRC. Now the 8 kB buffer is showing Single-Jointed as 16.8 % vs Multi's 12.3 % it's looking better for Multi.
There is one more option to test out - merging into the single block read function how the multi-block variant is doing the transition from CRC-in-buffer to CRC-jointed for final block of buffer ...
The alternative would be to require the buffer to be 8 bytes longer than desired for read data.
PS: The idea is there won't be any intermediate buffer for hubRAM target loads. The buffer would only exist for expansion memory where one is wanting to load far more than hubRAM can hold. The SD driver doesn't make any distinction whether it's a buffer or final target. It's just a data load address to the driver. It doesn't move the blocks after read from SD.
So if a 8 byte extend requirement was imposed then it would apply to all uses.
Right, done. Revised Single-block-Jointed version is now matching the Multi-block method in that for the bulk of the blocks it is contiguous CRC clocking direct to the main buffer and both are copied as one into cogRAM for processing. Only the final block of each full buffer is treated as jointed CRC where it is split off by stopping the clock and restarting once the FIFO is redirected.
New stats is 8 kB buffer has 16.1 % overhead, 16 kB has 14.2 % overhead, 64 kB has 12.7 % overhead.
Comparing previous all blocks jointed: 8 kB was 16.8 %, 16 Kb was 15.0 %, 64 kB was 13.7 %.
Comparing Multi-block equivalent: 8 kB is 12.3 %, 16 kB is 10.1 %, and 64 kB is 8.5 %.
Multi-block is proving itself finally. I'm happy now.