Actually, the Single-block loop is even bigger now. Looking at it I'm a little surprised the Multi-block path is working as well as it does. Almost all of the logic is done in the assembly. The only part there still in C is the decision, and loop, on the block count fitting the buffer.
So with your overheads and what gaps you've seen before with fast cards, what sort of sustained transfer rates do you expect will be achievable on reads (no-CRC check enabled) and writes (with CRC) ? Can we get 28MiB/s non-stop running on a 270MHz P2? That would allow 30fps video at 640x480x24 bits in pure RGB (no 4:2:2 subsampling) with audio. 30MiB/s would allow 24fps widescreen 858x480p or thereabouts.
270 MHz sysclock, clock divider of 3, with CRC processing enabled, it can read data at 36 MiB/s. EDIT: Ah, well, still need to add filesystem overheads to that. Time to get back to integrating into the driver ...
EDIT2: Oh, the 36 MB/s was with 8 kB buffer size, btw. A 64 kB buffer moves that up to 38 MB/s. And 16 kB gives a solid 37 MB/s.
EDIT3: Disabling the CRC processing and using sysclock/2 divider takes that to 55 MB/s. Or even a little more, 58 MB/s, with the Sandisk cards.
EDIT3: 270 MHz with sysclock/2 (135 MHz SD clock) is of course massively overclocking the SD bus for the 3.3 Volt High Speed interface. It does, however, fit within the upper limit of 1.8 Volt UHS-I interface (Which can operate in spec up to an insane 208 MHz SD clock). So I guess that's why newer cards just accept it and keep up.
The Samsung EVO card has the strangest behaviour. I've hesitated to mention it before but it does seem to be persisting. Upon first run of the testing it performs exceptionally poorly at about half the expected speeds - Even on repeats of reading the same blocks over and over. One whole test run rereads the same sequential block list 15 times, with each loop halving the total number to read.
After the first run it's fine. It's like the card needs a few seconds to warm up.
Maybe it fits in some sort of internal cache? 16MB is not beyond the realm of fitting into a cache. If you try a much larger transfer range test that has no chance of fitting then maybe you won't see such a difference between runs 1 and 2.
It would be sorted after the first loop then. The second line of the first test run should show a dramatic up-tick in performance but it doesn't.
And power cycling doesn't revert the performance either. It's still fine after swapping cards for a while and coming back to the Samsung. The problem only seems to show up after hours or days of no power.
I'm guessing if I made a test that ran for say 30 seconds and graphed progress every 0.1 second I'd see it rise suddenly a few seconds into the run. But only when the card has been cold.
Wow, I'm impressed. Even the older cards are performing at 135 MHz SD clock. Here's my oldest card, the Adata Silver (2013), at sysclock/2 without CRC processing. Note the first line has a, repeatable, latency spike:
@rogloh said:
What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.
Yeah, no, that was a euphemistic use of cold. But, taking the hint, I've now tested it as an actual thermally cold card and it's still behaving perfectly fine first try. So cold in this case only seems to be when unpowered for days.
I won't throw it out, but clearly it's not looking a happy SD card any longer. I'm gonna file it under it-was-already-faulty and I just sped it to the grave.
EDIT: Huh, that latest pattern above, where the performance was consistently ok-poor-ok through the test sizes, I do now remember one of the cards did that before. Yeah, I'm concluding the Samsung card has always been sick.
Yeah, the cell charge, a cell level calibration thingy. QLC Flash will be the worst for this. Back in 840 EVO days was still TLC.
Oddly, it has always seemed to be a Samsung exclusive issue though.
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Doing a little write up for posterity:
Working through the steps for High-Speed has resulted in a mostly convenient symbiosis between the clock phase and clock polarity:
When Default Speed access mode is active then the SD card outputs CMD/DAT with the falling clock edge. And by setting clock polarity to negative (inverted) this then means the starting, falling, edge of each clock pulse produces new data for the streamer to sample.
When High Speed access mode is active then the SD card outputs CMD/DAT with the rising clock edge. And by setting clock polarity to positive this preserves the starting, rising, edge of each clock pulse producing new data for the streamer to sample.
Using that helps with the more complex rx side of the equation.
But, as per usual, tx timing is different from rx timing. Using the streamer means predicting all these relationships. There is no hardware synchronising to help, not even at the bus clock level. Everything, rx and tx, is about pin sampling. The Prop2, as the master in this setup, outputting of SD data and clock on the pins is refreshed in unison each sysclock. As it's important to not output fresh data along with the rising clock edge, it's up to the software to ensure they are separated by at least one sysclock tick. At sysclock/2, this would mean ensuring they always occur on alternating ticks - which means the clock falls when updating data pins. Different story from the slave device. Well, at least until hardware delay lines get added.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
@evanh said:
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Yeah, one would like to hope there was some sort of consistency with timing phase amongst boards running in SD mode. Good catch about only testing previously in SPI mode.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
I found the same effect for memory with writes vs reads. With writes you have full control of the clock phase and as the outputs pin states are almost perfectly synchronized it's much easier to get consistent write results over the frequency range for sysclk/2 or lower.
Reads are harder because they involve latencies in the chip and delays on the board and in the target device. At high speed this needs some sort of calibration to get error free data.
It's almost going! So close. But there is still some bug newly introduced I'm not quite seeing. The old file read/write speed tester works, the one I used when developing the smartpin SPI driver, ... in selected ways. Just not in every way like the SD mode development code does. Non-inverted clock is causing grief, the calibration routine is finding success with failure values. But it works correctly when the clock is inverted. This should be a tell-tail of the cause but so far I can't see it. I haven't dug out the oscilloscope just yet so I guess that's probably up next.
There is quite a long list of changes between the two solutions. Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development. Much has been resolved from basic typos to inverting return code logic and trimming down the logic. Stuff has been renamed, stack allocation and pointer passing added where there was static buffers. Removal of compile options. Clean ups of old routines that hadn't been touched in a while.
Found it! Perfect example of "it works but don't ask why and don't fiddle." It was a minor tweak I'd done in passing during an optimising clean-up quite early on in the conversion. The clean-up was around removal of all the latency measuring diagnostic code. Lucky I'd done a backup just before starting so was able to realise when the bug was introduced.
During the clean-up I'd noticed I had an excess of clocks leading into issuing the command. This created a longer delay before each command was issued because it waited for the completion of these pulses. This was a remanent of the wait-for-busy check that had briefly existed at the head of command issuing before it got split off on its own. Thought about it for a moment and decided a single clock pulse was enough to trigger an event, why have more. However, it turns out that, at some point in sequencing, the SD cards need two extra pulses there. One isn't always enough.
I don't really know why but it fixes this issue. ... hmmm, or maybe I need to look harder - Just now had one of the cards needed it set to 3 pulses!
PS: It seems to matter only during block write sequencing.
I should have said eight clocks rather than eight bits.
Anyway, looks like the other problem was merely that I'd removed the exact binary compare when performing calibration. I'd started relying on the CRC alone, but that proved not to be rugged enough for the calibration process, I would dearly love to have access to the official UHS method. Sadly, the SD cards don't respond to CMD19 without engaging UHS mode first. EDIT: Correction, one of six cards tested actually seems to support CMD19 without switching into UHS first. Not much help for me.
CMD19 is compulsory for UHS-1 compliance so I figure it is supported by all my SD cards when UHS is engaged.
EDIT2: One detail from the CMD19 procedure is it says up to 40 repeats to be used for certainty. That's something I'll adopt myself.
I've found that the only places I needed to ensure the extra trailing clocks was after expected non-responses. Namely CMD7 for deselect and also CMD0. A normal response, which already has a couple of extra clocks anyway, doesn't seem to need all eight spacing clocks. I've fixed the exceptions and left the pre-command clocks set at one. It still could come back to bite me suppose.
What's left for you to do now @evanh? Is this mainly bug fixing, or optimization work now, or are there still pieces remaining to be coded for the burst reads/writes?
Down to beta testing and looking for stray bugs I guess.
Attached is the patched vfs.h header file that replaces same existing in include/sys
and the new driver directory that gets added to include/filesys
and lastly my current tester program to get you going.
EDIT: Updated the tester program to show the used clock divider. This is new feature exposed. Driver had been using a constant until today.
@evanh said:
It's usable right now. I presume you're interested in giving it a go?
Depending on my capture board bring up and how it works out, I probably would like to try something soon. It will be interesting to see if video can be captured in real time to SD card.
Comments
Actually, the Single-block loop is even bigger now. Looking at it I'm a little surprised the Multi-block path is working as well as it does. Almost all of the logic is done in the assembly. The only part there still in C is the decision, and loop, on the block count fitting the buffer.
So with your overheads and what gaps you've seen before with fast cards, what sort of sustained transfer rates do you expect will be achievable on reads (no-CRC check enabled) and writes (with CRC) ? Can we get 28MiB/s non-stop running on a 270MHz P2? That would allow 30fps video at 640x480x24 bits in pure RGB (no 4:2:2 subsampling) with audio. 30MiB/s would allow 24fps widescreen 858x480p or thereabouts.
270 MHz sysclock, clock divider of 3, with CRC processing enabled, it can read data at 36 MiB/s. EDIT: Ah, well, still need to add filesystem overheads to that. Time to get back to integrating into the driver ...
EDIT2: Oh, the 36 MB/s was with 8 kB buffer size, btw. A 64 kB buffer moves that up to 38 MB/s. And 16 kB gives a solid 37 MB/s.
EDIT3: Disabling the CRC processing and using sysclock/2 divider takes that to 55 MB/s. Or even a little more, 58 MB/s, with the Sandisk cards.
EDIT3: 270 MHz with sysclock/2 (135 MHz SD clock) is of course massively overclocking the SD bus for the 3.3 Volt High Speed interface. It does, however, fit within the upper limit of 1.8 Volt UHS-I interface (Which can operate in spec up to an insane 208 MHz SD clock). So I guess that's why newer cards just accept it and keep up.
The Samsung EVO card has the strangest behaviour. I've hesitated to mention it before but it does seem to be persisting. Upon first run of the testing it performs exceptionally poorly at about half the expected speeds - Even on repeats of reading the same blocks over and over. One whole test run rereads the same sequential block list 15 times, with each loop halving the total number to read.
After the first run it's fine. It's like the card needs a few seconds to warm up.
The poor results of first time run:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 23.4 MiB/s duration = 681427 us zero-overhead = 248551 us overheads = 63.5 % 16384 blocks = 8192 kiB rate = 21.2 MiB/s duration = 375874 us zero-overhead = 124276 us overheads = 66.9 % 8192 blocks = 4096 kiB rate = 19.8 MiB/s duration = 201299 us zero-overhead = 62138 us overheads = 69.1 % 4096 blocks = 2048 kiB rate = 17.3 MiB/s duration = 115100 us zero-overhead = 31069 us overheads = 73.0 % 2048 blocks = 1024 kiB rate = 11.2 MiB/s duration = 88680 us zero-overhead = 15534 us overheads = 82.4 % 1024 blocks = 512 kiB rate = 10.7 MiB/s duration = 46359 us zero-overhead = 7767 us overheads = 83.2 % 512 blocks = 256 kiB rate = 5.9 MiB/s duration = 42065 us zero-overhead = 3884 us overheads = 90.7 % 256 blocks = 128 kiB rate = 3.4 MiB/s duration = 36679 us zero-overhead = 1942 us overheads = 94.7 % 128 blocks = 64 kiB rate = 1.6 MiB/s duration = 37188 us zero-overhead = 971 us overheads = 97.3 % 64 blocks = 32 kiB rate = 0.9 MiB/s duration = 34154 us zero-overhead = 485 us overheads = 98.5 % 32 blocks = 16 kiB rate = 0.4 MiB/s duration = 35968 us zero-overhead = 243 us overheads = 99.3 % 16 blocks = 8 kiB rate = 0.2 MiB/s duration = 33200 us zero-overhead = 121 us overheads = 99.6 % 8 blocks = 4 kiB rate = 0.1 MiB/s duration = 36551 us zero-overhead = 61 us overheads = 99.8 % 4 blocks = 2 kiB rate = 0.0 MiB/s duration = 33527 us zero-overhead = 30 us overheads = 99.9 % 2 blocks = 1 kiB rate = 0.0 MiB/s duration = 33432 us zero-overhead = 15 us overheads = 99.9 %
Then the very next run is this:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 38.3 MiB/s duration = 417277 us zero-overhead = 248551 us overheads = 40.4 % 16384 blocks = 8192 kiB rate = 38.2 MiB/s duration = 208896 us zero-overhead = 124276 us overheads = 40.5 % 8192 blocks = 4096 kiB rate = 38.2 MiB/s duration = 104645 us zero-overhead = 62138 us overheads = 40.6 % 4096 blocks = 2048 kiB rate = 38.0 MiB/s duration = 52518 us zero-overhead = 31069 us overheads = 40.8 % 2048 blocks = 1024 kiB rate = 37.8 MiB/s duration = 26455 us zero-overhead = 15534 us overheads = 41.2 % 1024 blocks = 512 kiB rate = 37.5 MiB/s duration = 13305 us zero-overhead = 7767 us overheads = 41.6 % 512 blocks = 256 kiB rate = 36.8 MiB/s duration = 6790 us zero-overhead = 3884 us overheads = 42.7 % 256 blocks = 128 kiB rate = 35.4 MiB/s duration = 3531 us zero-overhead = 1942 us overheads = 45.0 % 128 blocks = 64 kiB rate = 32.7 MiB/s duration = 1906 us zero-overhead = 971 us overheads = 49.0 % 64 blocks = 32 kiB rate = 29.0 MiB/s duration = 1074 us zero-overhead = 485 us overheads = 54.8 % 32 blocks = 16 kiB rate = 23.3 MiB/s duration = 670 us zero-overhead = 243 us overheads = 63.7 % 16 blocks = 8 kiB rate = 16.7 MiB/s duration = 467 us zero-overhead = 121 us overheads = 74.0 % 8 blocks = 4 kiB rate = 12.6 MiB/s duration = 310 us zero-overhead = 61 us overheads = 80.3 % 4 blocks = 2 kiB rate = 7.5 MiB/s duration = 260 us zero-overhead = 30 us overheads = 88.4 % 2 blocks = 1 kiB rate = 4.1 MiB/s duration = 235 us zero-overhead = 15 us overheads = 93.6 %
Maybe it fits in some sort of internal cache? 16MB is not beyond the realm of fitting into a cache. If you try a much larger transfer range test that has no chance of fitting then maybe you won't see such a difference between runs 1 and 2.
It would be sorted after the first loop then. The second line of the first test run should show a dramatic up-tick in performance but it doesn't.
And power cycling doesn't revert the performance either. It's still fine after swapping cards for a while and coming back to the Samsung. The problem only seems to show up after hours or days of no power.
I'm guessing if I made a test that ran for say 30 seconds and graphed progress every 0.1 second I'd see it rise suddenly a few seconds into the run. But only when the card has been cold.
What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.
Wow, I'm impressed. Even the older cards are performing at 135 MHz SD clock. Here's my oldest card, the Adata Silver (2013), at sysclock/2 without CRC processing. Note the first line has a, repeatable, latency spike:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 51.5 MiB/s duration = 310387 us zero-overhead = 248551 us overheads = 19.9 % 16384 blocks = 8192 kiB rate = 59.1 MiB/s duration = 135142 us zero-overhead = 124276 us overheads = 8.0 % 8192 blocks = 4096 kiB rate = 59.0 MiB/s duration = 67693 us zero-overhead = 62138 us overheads = 8.2 % 4096 blocks = 2048 kiB rate = 58.5 MiB/s duration = 34181 us zero-overhead = 31069 us overheads = 9.1 % 2048 blocks = 1024 kiB rate = 58.0 MiB/s duration = 17237 us zero-overhead = 15534 us overheads = 9.8 % 1024 blocks = 512 kiB rate = 56.4 MiB/s duration = 8859 us zero-overhead = 7767 us overheads = 12.3 % 512 blocks = 256 kiB rate = 53.5 MiB/s duration = 4669 us zero-overhead = 3884 us overheads = 16.8 % 256 blocks = 128 kiB rate = 48.5 MiB/s duration = 2575 us zero-overhead = 1942 us overheads = 24.5 % 128 blocks = 64 kiB rate = 40.8 MiB/s duration = 1529 us zero-overhead = 971 us overheads = 36.4 % 64 blocks = 32 kiB rate = 31.0 MiB/s duration = 1005 us zero-overhead = 485 us overheads = 51.7 % 32 blocks = 16 kiB rate = 21.0 MiB/s duration = 743 us zero-overhead = 243 us overheads = 67.2 % 16 blocks = 8 kiB rate = 12.7 MiB/s duration = 612 us zero-overhead = 121 us overheads = 80.2 % 8 blocks = 4 kiB rate = 7.1 MiB/s duration = 547 us zero-overhead = 61 us overheads = 88.8 % 4 blocks = 2 kiB rate = 3.7 MiB/s duration = 514 us zero-overhead = 30 us overheads = 94.1 % 2 blocks = 1 kiB rate = 1.9 MiB/s duration = 498 us zero-overhead = 15 us overheads = 96.9 %
The Apacer (2018) is clean though:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 59.5 MiB/s duration = 268675 us zero-overhead = 248551 us overheads = 7.4 % 16384 blocks = 8192 kiB rate = 59.4 MiB/s duration = 134555 us zero-overhead = 124276 us overheads = 7.6 % 8192 blocks = 4096 kiB rate = 59.3 MiB/s duration = 67358 us zero-overhead = 62138 us overheads = 7.7 % 4096 blocks = 2048 kiB rate = 59.1 MiB/s duration = 33831 us zero-overhead = 31069 us overheads = 8.1 % 2048 blocks = 1024 kiB rate = 58.5 MiB/s duration = 17066 us zero-overhead = 15534 us overheads = 8.9 % 1024 blocks = 512 kiB rate = 57.5 MiB/s duration = 8685 us zero-overhead = 7767 us overheads = 10.5 % 512 blocks = 256 kiB rate = 55.6 MiB/s duration = 4496 us zero-overhead = 3884 us overheads = 13.6 % 256 blocks = 128 kiB rate = 52.0 MiB/s duration = 2403 us zero-overhead = 1942 us overheads = 19.1 % 128 blocks = 64 kiB rate = 46.1 MiB/s duration = 1354 us zero-overhead = 971 us overheads = 28.2 % 64 blocks = 32 kiB rate = 37.6 MiB/s duration = 830 us zero-overhead = 485 us overheads = 41.5 % 32 blocks = 16 kiB rate = 27.5 MiB/s duration = 568 us zero-overhead = 243 us overheads = 57.2 % 16 blocks = 8 kiB rate = 17.9 MiB/s duration = 436 us zero-overhead = 121 us overheads = 72.2 % 8 blocks = 4 kiB rate = 10.5 MiB/s duration = 371 us zero-overhead = 61 us overheads = 83.5 % 4 blocks = 2 kiB rate = 5.7 MiB/s duration = 338 us zero-overhead = 30 us overheads = 91.1 % 2 blocks = 1 kiB rate = 3.0 MiB/s duration = 322 us zero-overhead = 15 us overheads = 95.3 %
Yeah, no, that was a euphemistic use of cold. But, taking the hint, I've now tested it as an actual thermally cold card and it's still behaving perfectly fine first try. So cold in this case only seems to be when unpowered for days.
A 10 hour gap isn't enough. Samsung EVO worked first try. Although, the first line does indicate a minor latency extend there:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 37.7 MiB/s duration = 423925 us zero-overhead = 372827 us overheads = 12.0 % 16384 blocks = 8192 kiB rate = 38.2 MiB/s duration = 209397 us zero-overhead = 186414 us overheads = 10.9 % 8192 blocks = 4096 kiB rate = 38.1 MiB/s duration = 104894 us zero-overhead = 93207 us overheads = 11.1 % 4096 blocks = 2048 kiB rate = 37.9 MiB/s duration = 52647 us zero-overhead = 46603 us overheads = 11.4 % 2048 blocks = 1024 kiB rate = 37.7 MiB/s duration = 26523 us zero-overhead = 23302 us overheads = 12.1 % 1024 blocks = 512 kiB rate = 37.4 MiB/s duration = 13336 us zero-overhead = 11651 us overheads = 12.6 % 512 blocks = 256 kiB rate = 36.7 MiB/s duration = 6807 us zero-overhead = 5825 us overheads = 14.4 % 256 blocks = 128 kiB rate = 35.3 MiB/s duration = 3540 us zero-overhead = 2913 us overheads = 17.7 % 128 blocks = 64 kiB rate = 32.7 MiB/s duration = 1906 us zero-overhead = 1456 us overheads = 23.6 % 64 blocks = 32 kiB rate = 29.0 MiB/s duration = 1077 us zero-overhead = 728 us overheads = 32.4 % 32 blocks = 16 kiB rate = 23.3 MiB/s duration = 670 us zero-overhead = 364 us overheads = 45.6 % 16 blocks = 8 kiB rate = 16.6 MiB/s duration = 468 us zero-overhead = 182 us overheads = 61.1 % 8 blocks = 4 kiB rate = 12.4 MiB/s duration = 315 us zero-overhead = 91 us overheads = 71.1 % 4 blocks = 2 kiB rate = 7.3 MiB/s duration = 264 us zero-overhead = 46 us overheads = 82.5 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 239 us zero-overhead = 23 us overheads = 90.3 %
Which vanishes again on subsequent runs:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 38.2 MiB/s duration = 418271 us zero-overhead = 372827 us overheads = 10.8 % 16384 blocks = 8192 kiB rate = 38.2 MiB/s duration = 209394 us zero-overhead = 186414 us overheads = 10.9 % 8192 blocks = 4096 kiB rate = 38.1 MiB/s duration = 104892 us zero-overhead = 93207 us overheads = 11.1 % 4096 blocks = 2048 kiB rate = 37.9 MiB/s duration = 52643 us zero-overhead = 46603 us overheads = 11.4 % 2048 blocks = 1024 kiB rate = 37.7 MiB/s duration = 26520 us zero-overhead = 23302 us overheads = 12.1 % 1024 blocks = 512 kiB rate = 37.4 MiB/s duration = 13336 us zero-overhead = 11651 us overheads = 12.6 % 512 blocks = 256 kiB rate = 36.7 MiB/s duration = 6806 us zero-overhead = 5825 us overheads = 14.4 % 256 blocks = 128 kiB rate = 35.3 MiB/s duration = 3540 us zero-overhead = 2913 us overheads = 17.7 % 128 blocks = 64 kiB rate = 32.7 MiB/s duration = 1907 us zero-overhead = 1456 us overheads = 23.6 % 64 blocks = 32 kiB rate = 29.0 MiB/s duration = 1077 us zero-overhead = 728 us overheads = 32.4 % 32 blocks = 16 kiB rate = 23.3 MiB/s duration = 670 us zero-overhead = 364 us overheads = 45.6 % 16 blocks = 8 kiB rate = 16.6 MiB/s duration = 468 us zero-overhead = 182 us overheads = 61.1 % 8 blocks = 4 kiB rate = 12.4 MiB/s duration = 315 us zero-overhead = 91 us overheads = 71.1 % 4 blocks = 2 kiB rate = 7.3 MiB/s duration = 265 us zero-overhead = 46 us overheads = 82.6 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 239 us zero-overhead = 23 us overheads = 90.3 %
Seems weird. Charge leakage?
Maybe. And I may have done damage now. I put it in an oven, possibly over 100 degC, for 5 hours.
First run after:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 22.1 MiB/s duration = 721074 us zero-overhead = 372827 us overheads = 48.2 % 16384 blocks = 8192 kiB rate = 16.9 MiB/s duration = 470980 us zero-overhead = 186414 us overheads = 60.4 % 8192 blocks = 4096 kiB rate = 11.2 MiB/s duration = 356398 us zero-overhead = 93207 us overheads = 73.8 % 4096 blocks = 2048 kiB rate = 6.6 MiB/s duration = 302006 us zero-overhead = 46603 us overheads = 84.5 % 2048 blocks = 1024 kiB rate = 3.6 MiB/s duration = 271768 us zero-overhead = 23302 us overheads = 91.4 % 1024 blocks = 512 kiB rate = 3.3 MiB/s duration = 148580 us zero-overhead = 11651 us overheads = 92.1 % 512 blocks = 256 kiB rate = 3.4 MiB/s duration = 72867 us zero-overhead = 5825 us overheads = 92.0 % 256 blocks = 128 kiB rate = 3.6 MiB/s duration = 33802 us zero-overhead = 2913 us overheads = 91.3 % 128 blocks = 64 kiB rate = 4.6 MiB/s duration = 13378 us zero-overhead = 1456 us overheads = 89.1 % 64 blocks = 32 kiB rate = 5.6 MiB/s duration = 5543 us zero-overhead = 728 us overheads = 86.8 % 32 blocks = 16 kiB rate = 4.4 MiB/s duration = 3487 us zero-overhead = 364 us overheads = 89.5 % 16 blocks = 8 kiB rate = 5.2 MiB/s duration = 1485 us zero-overhead = 182 us overheads = 87.7 % 8 blocks = 4 kiB rate = 5.5 MiB/s duration = 701 us zero-overhead = 91 us overheads = 87.0 % 4 blocks = 2 kiB rate = 2.8 MiB/s duration = 684 us zero-overhead = 46 us overheads = 93.2 % 2 blocks = 1 kiB rate = 3.9 MiB/s duration = 246 us zero-overhead = 23 us overheads = 90.6 %
Seventh run after (5 minutes later):
Read blocks speed test: 32768 blocks = 16384 kiB rate = 24.4 MiB/s duration = 654095 us zero-overhead = 372827 us overheads = 43.0 % 16384 blocks = 8192 kiB rate = 18.5 MiB/s duration = 432374 us zero-overhead = 186414 us overheads = 56.8 % 8192 blocks = 4096 kiB rate = 12.5 MiB/s duration = 319853 us zero-overhead = 93207 us overheads = 70.8 % 4096 blocks = 2048 kiB rate = 7.4 MiB/s duration = 266948 us zero-overhead = 46603 us overheads = 82.5 % 2048 blocks = 1024 kiB rate = 4.1 MiB/s duration = 240554 us zero-overhead = 23302 us overheads = 90.3 % 1024 blocks = 512 kiB rate = 3.7 MiB/s duration = 133101 us zero-overhead = 11651 us overheads = 91.2 % 512 blocks = 256 kiB rate = 3.8 MiB/s duration = 64458 us zero-overhead = 5825 us overheads = 90.9 % 256 blocks = 128 kiB rate = 4.2 MiB/s duration = 29343 us zero-overhead = 2913 us overheads = 90.0 % 128 blocks = 64 kiB rate = 6.5 MiB/s duration = 9491 us zero-overhead = 1456 us overheads = 84.6 % 64 blocks = 32 kiB rate = 16.2 MiB/s duration = 1923 us zero-overhead = 728 us overheads = 62.1 % 32 blocks = 16 kiB rate = 15.1 MiB/s duration = 1034 us zero-overhead = 364 us overheads = 64.7 % 16 blocks = 8 kiB rate = 16.2 MiB/s duration = 480 us zero-overhead = 182 us overheads = 62.0 % 8 blocks = 4 kiB rate = 12.2 MiB/s duration = 318 us zero-overhead = 91 us overheads = 71.3 % 4 blocks = 2 kiB rate = 7.2 MiB/s duration = 268 us zero-overhead = 46 us overheads = 82.8 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 243 us zero-overhead = 23 us overheads = 90.5 %
Eleventh run after (10 minutes):
Read blocks speed test: 32768 blocks = 16384 kiB rate = 24.6 MiB/s duration = 647795 us zero-overhead = 372827 us overheads = 42.4 % 16384 blocks = 8192 kiB rate = 18.8 MiB/s duration = 425308 us zero-overhead = 186414 us overheads = 56.1 % 8192 blocks = 4096 kiB rate = 12.6 MiB/s duration = 315773 us zero-overhead = 93207 us overheads = 70.4 % 4096 blocks = 2048 kiB rate = 7.6 MiB/s duration = 261738 us zero-overhead = 46603 us overheads = 82.1 % 2048 blocks = 1024 kiB rate = 4.2 MiB/s duration = 236732 us zero-overhead = 23302 us overheads = 90.1 % 1024 blocks = 512 kiB rate = 3.8 MiB/s duration = 130418 us zero-overhead = 11651 us overheads = 91.0 % 512 blocks = 256 kiB rate = 3.9 MiB/s duration = 63736 us zero-overhead = 5825 us overheads = 90.8 % 256 blocks = 128 kiB rate = 4.3 MiB/s duration = 28628 us zero-overhead = 2913 us overheads = 89.8 % 128 blocks = 64 kiB rate = 6.9 MiB/s duration = 8942 us zero-overhead = 1456 us overheads = 83.7 % 64 blocks = 32 kiB rate = 26.5 MiB/s duration = 1175 us zero-overhead = 728 us overheads = 38.0 % 32 blocks = 16 kiB rate = 22.1 MiB/s duration = 705 us zero-overhead = 364 us overheads = 48.3 % 16 blocks = 8 kiB rate = 16.3 MiB/s duration = 479 us zero-overhead = 182 us overheads = 62.0 % 8 blocks = 4 kiB rate = 12.3 MiB/s duration = 317 us zero-overhead = 91 us overheads = 71.2 % 4 blocks = 2 kiB rate = 7.3 MiB/s duration = 264 us zero-overhead = 46 us overheads = 82.5 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 239 us zero-overhead = 23 us overheads = 90.3 %
30 minutes in the freezer:
Run 1:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 24.9 MiB/s duration = 640527 us zero-overhead = 372827 us overheads = 41.7 % 16384 blocks = 8192 kiB rate = 19.1 MiB/s duration = 417223 us zero-overhead = 186414 us overheads = 55.3 % 8192 blocks = 4096 kiB rate = 12.8 MiB/s duration = 311159 us zero-overhead = 93207 us overheads = 70.0 % 4096 blocks = 2048 kiB rate = 7.8 MiB/s duration = 256325 us zero-overhead = 46603 us overheads = 81.8 % 2048 blocks = 1024 kiB rate = 4.3 MiB/s duration = 227744 us zero-overhead = 23302 us overheads = 89.7 % 1024 blocks = 512 kiB rate = 3.9 MiB/s duration = 127050 us zero-overhead = 11651 us overheads = 90.8 % 512 blocks = 256 kiB rate = 4.0 MiB/s duration = 61168 us zero-overhead = 5825 us overheads = 90.4 % 256 blocks = 128 kiB rate = 4.5 MiB/s duration = 27760 us zero-overhead = 2913 us overheads = 89.5 % 128 blocks = 64 kiB rate = 6.9 MiB/s duration = 8937 us zero-overhead = 1456 us overheads = 83.7 % 64 blocks = 32 kiB rate = 26.6 MiB/s duration = 1174 us zero-overhead = 728 us overheads = 37.9 % 32 blocks = 16 kiB rate = 15.5 MiB/s duration = 1003 us zero-overhead = 364 us overheads = 63.7 % 16 blocks = 8 kiB rate = 16.1 MiB/s duration = 485 us zero-overhead = 182 us overheads = 62.4 % 8 blocks = 4 kiB rate = 12.4 MiB/s duration = 314 us zero-overhead = 91 us overheads = 71.0 % 4 blocks = 2 kiB rate = 7.3 MiB/s duration = 267 us zero-overhead = 46 us overheads = 82.7 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 240 us zero-overhead = 23 us overheads = 90.4 %
Run 12:
Read blocks speed test: 32768 blocks = 16384 kiB rate = 24.7 MiB/s duration = 645458 us zero-overhead = 372827 us overheads = 42.2 % 16384 blocks = 8192 kiB rate = 18.8 MiB/s duration = 423661 us zero-overhead = 186414 us overheads = 55.9 % 8192 blocks = 4096 kiB rate = 12.8 MiB/s duration = 311061 us zero-overhead = 93207 us overheads = 70.0 % 4096 blocks = 2048 kiB rate = 7.7 MiB/s duration = 258025 us zero-overhead = 46603 us overheads = 81.9 % 2048 blocks = 1024 kiB rate = 4.3 MiB/s duration = 230127 us zero-overhead = 23302 us overheads = 89.8 % 1024 blocks = 512 kiB rate = 3.8 MiB/s duration = 128549 us zero-overhead = 11651 us overheads = 90.9 % 512 blocks = 256 kiB rate = 3.9 MiB/s duration = 63045 us zero-overhead = 5825 us overheads = 90.7 % 256 blocks = 128 kiB rate = 4.4 MiB/s duration = 28306 us zero-overhead = 2913 us overheads = 89.7 % 128 blocks = 64 kiB rate = 7.0 MiB/s duration = 8928 us zero-overhead = 1456 us overheads = 83.6 % 64 blocks = 32 kiB rate = 27.5 MiB/s duration = 1133 us zero-overhead = 728 us overheads = 35.7 % 32 blocks = 16 kiB rate = 21.6 MiB/s duration = 722 us zero-overhead = 364 us overheads = 49.5 % 16 blocks = 8 kiB rate = 16.1 MiB/s duration = 483 us zero-overhead = 182 us overheads = 62.3 % 8 blocks = 4 kiB rate = 12.1 MiB/s duration = 321 us zero-overhead = 91 us overheads = 71.6 % 4 blocks = 2 kiB rate = 7.2 MiB/s duration = 268 us zero-overhead = 46 us overheads = 82.8 % 2 blocks = 1 kiB rate = 4.0 MiB/s duration = 240 us zero-overhead = 23 us overheads = 90.4 %
I won't throw it out, but clearly it's not looking a happy SD card any longer. I'm gonna file it under it-was-already-faulty and I just sped it to the grave.
EDIT: Huh, that latest pattern above, where the performance was consistently ok-poor-ok through the test sizes, I do now remember one of the cards did that before. Yeah, I'm concluding the Samsung card has always been sick.
EDIT2: Reminds me of the days of the full spec'd Samsung 840 EVO SSD needing a firmware update for excessively slow read speeds with age of data. And even then it wasn't a perfect fix. https://www.anandtech.com/show/8617/samsung-releases-firmware-update-to-fix-the-ssd-840-evo-read-performance-bug
EDIT3: Ha! Yep, writing fresh data fixes it.
Read blocks speed test: 32768 blocks = 16384 kiB rate = 37.6 MiB/s duration = 425349 us zero-overhead = 372827 us overheads = 12.3 % 16384 blocks = 8192 kiB rate = 37.5 MiB/s duration = 212789 us zero-overhead = 186414 us overheads = 12.3 % 8192 blocks = 4096 kiB rate = 37.5 MiB/s duration = 106510 us zero-overhead = 93207 us overheads = 12.4 % 4096 blocks = 2048 kiB rate = 37.4 MiB/s duration = 53370 us zero-overhead = 46603 us overheads = 12.6 % 2048 blocks = 1024 kiB rate = 37.3 MiB/s duration = 26800 us zero-overhead = 23302 us overheads = 13.0 % 1024 blocks = 512 kiB rate = 36.9 MiB/s duration = 13515 us zero-overhead = 11651 us overheads = 13.7 % 512 blocks = 256 kiB rate = 36.3 MiB/s duration = 6872 us zero-overhead = 5825 us overheads = 15.2 % 256 blocks = 128 kiB rate = 35.1 MiB/s duration = 3552 us zero-overhead = 2913 us overheads = 17.9 % 128 blocks = 64 kiB rate = 33.0 MiB/s duration = 1891 us zero-overhead = 1456 us overheads = 23.0 % 64 blocks = 32 kiB rate = 29.4 MiB/s duration = 1061 us zero-overhead = 728 us overheads = 31.3 % 32 blocks = 16 kiB rate = 24.1 MiB/s duration = 646 us zero-overhead = 364 us overheads = 43.6 % 16 blocks = 8 kiB rate = 17.8 MiB/s duration = 438 us zero-overhead = 182 us overheads = 58.4 % 8 blocks = 4 kiB rate = 14.1 MiB/s duration = 277 us zero-overhead = 91 us overheads = 67.1 % 4 blocks = 2 kiB rate = 8.5 MiB/s duration = 228 us zero-overhead = 46 us overheads = 79.8 % 2 blocks = 1 kiB rate = 4.8 MiB/s duration = 203 us zero-overhead = 23 us overheads = 88.6 %
Yeah, the cell charge, a cell level calibration thingy. QLC Flash will be the worst for this. Back in 840 EVO days was still TLC.
Oddly, it has always seemed to be a Samsung exclusive issue though.
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Doing a little write up for posterity:
Working through the steps for High-Speed has resulted in a mostly convenient symbiosis between the clock phase and clock polarity:
Using that helps with the more complex rx side of the equation.
But, as per usual, tx timing is different from rx timing. Using the streamer means predicting all these relationships. There is no hardware synchronising to help, not even at the bus clock level. Everything, rx and tx, is about pin sampling. The Prop2, as the master in this setup, outputting of SD data and clock on the pins is refreshed in unison each sysclock. As it's important to not output fresh data along with the rising clock edge, it's up to the software to ensure they are separated by at least one sysclock tick. At sysclock/2, this would mean ensuring they always occur on alternating ticks - which means the clock falls when updating data pins. Different story from the slave device. Well, at least until hardware delay lines get added.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
Yeah, one would like to hope there was some sort of consistency with timing phase amongst boards running in SD mode. Good catch about only testing previously in SPI mode.
I found the same effect for memory with writes vs reads. With writes you have full control of the clock phase and as the outputs pin states are almost perfectly synchronized it's much easier to get consistent write results over the frequency range for sysclk/2 or lower.
Reads are harder because they involve latencies in the chip and delays on the board and in the target device. At high speed this needs some sort of calibration to get error free data.
It's almost going! So close. But there is still some bug newly introduced I'm not quite seeing. The old file read/write speed tester works, the one I used when developing the smartpin SPI driver, ... in selected ways. Just not in every way like the SD mode development code does. Non-inverted clock is causing grief, the calibration routine is finding success with failure values. But it works correctly when the clock is inverted. This should be a tell-tail of the cause but so far I can't see it. I haven't dug out the oscilloscope just yet so I guess that's probably up next.
There is quite a long list of changes between the two solutions. Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development. Much has been resolved from basic typos to inverting return code logic and trimming down the logic. Stuff has been renamed, stack allocation and pointer passing added where there was static buffers. Removal of compile options. Clean ups of old routines that hadn't been touched in a while.
Ah. The constraints of reality strike back. I know how that feels.
Found it! Perfect example of "it works but don't ask why and don't fiddle." It was a minor tweak I'd done in passing during an optimising clean-up quite early on in the conversion. The clean-up was around removal of all the latency measuring diagnostic code. Lucky I'd done a backup just before starting so was able to realise when the bug was introduced.
During the clean-up I'd noticed I had an excess of clocks leading into issuing the command. This created a longer delay before each command was issued because it waited for the completion of these pulses. This was a remanent of the wait-for-busy check that had briefly existed at the head of command issuing before it got split off on its own. Thought about it for a moment and decided a single clock pulse was enough to trigger an event, why have more. However, it turns out that, at some point in sequencing, the SD cards need two extra pulses there. One isn't always enough.
I don't really know why but it fixes this issue. ... hmmm, or maybe I need to look harder - Just now had one of the cards needed it set to 3 pulses!
PS: It seems to matter only during block write sequencing.
Aren't there some CSD structures that indicate how many clocks you'd need between commands for timeouts etc, or perhaps am I thinking of Flash stuff.
It's a fixed number of eight bits. And yeah, that'll be the problem, I don't explicitly wait for that to happen after a command sequence.
I've now realised that there's something else playing up too. I need more sleep to get a good run at it.
I should have said eight clocks rather than eight bits.
Anyway, looks like the other problem was merely that I'd removed the exact binary compare when performing calibration. I'd started relying on the CRC alone, but that proved not to be rugged enough for the calibration process, I would dearly love to have access to the official UHS method. Sadly, the SD cards don't respond to CMD19 without engaging UHS mode first. EDIT: Correction, one of six cards tested actually seems to support CMD19 without switching into UHS first. Not much help for me.
CMD19 is compulsory for UHS-1 compliance so I figure it is supported by all my SD cards when UHS is engaged.
EDIT2: One detail from the CMD19 procedure is it says up to 40 repeats to be used for certainty. That's something I'll adopt myself.
I've found that the only places I needed to ensure the extra trailing clocks was after expected non-responses. Namely CMD7 for deselect and also CMD0. A normal response, which already has a couple of extra clocks anyway, doesn't seem to need all eight spacing clocks. I've fixed the exceptions and left the pre-command clocks set at one. It still could come back to bite me suppose.
What's left for you to do now @evanh? Is this mainly bug fixing, or optimization work now, or are there still pieces remaining to be coded for the burst reads/writes?
It's usable right now. I presume you're interested in giving it a go?
Down to beta testing and looking for stray bugs I guess.
Attached is the patched vfs.h header file that replaces same existing in
include/sys
and the new driver directory that gets added to
include/filesys
and lastly my current tester program to get you going.
EDIT: Updated the tester program to show the used clock divider. This is new feature exposed. Driver had been using a constant until today.
Depending on my capture board bring up and how it works out, I probably would like to try something soon. It will be interesting to see if video can be captured in real time to SD card.