Actually, the Single-block loop is even bigger now. Looking at it I'm a little surprised the Multi-block path is working as well as it does. Almost all of the logic is done in the assembly. The only part there still in C is the decision, and loop, on the block count fitting the buffer.
So with your overheads and what gaps you've seen before with fast cards, what sort of sustained transfer rates do you expect will be achievable on reads (no-CRC check enabled) and writes (with CRC) ? Can we get 28MiB/s non-stop running on a 270MHz P2? That would allow 30fps video at 640x480x24 bits in pure RGB (no 4:2:2 subsampling) with audio. 30MiB/s would allow 24fps widescreen 858x480p or thereabouts.
270 MHz sysclock, clock divider of 3, with CRC processing enabled, it can read data at 36 MiB/s. EDIT: Ah, well, still need to add filesystem overheads to that. Time to get back to integrating into the driver ...
EDIT2: Oh, the 36 MB/s was with 8 kB buffer size, btw. A 64 kB buffer moves that up to 38 MB/s. And 16 kB gives a solid 37 MB/s.
EDIT3: Disabling the CRC processing and using sysclock/2 divider takes that to 55 MB/s. Or even a little more, 58 MB/s, with the Sandisk cards.
EDIT3: 270 MHz with sysclock/2 (135 MHz SD clock) is of course massively overclocking the SD bus for the 3.3 Volt High Speed interface. It does, however, fit within the upper limit of 1.8 Volt UHS-I interface (Which can operate in spec up to an insane 208 MHz SD clock). So I guess that's why newer cards just accept it and keep up.
The Samsung EVO card has the strangest behaviour. I've hesitated to mention it before but it does seem to be persisting. Upon first run of the testing it performs exceptionally poorly at about half the expected speeds - Even on repeats of reading the same blocks over and over. One whole test run rereads the same sequential block list 15 times, with each loop halving the total number to read.
After the first run it's fine. It's like the card needs a few seconds to warm up.
Maybe it fits in some sort of internal cache? 16MB is not beyond the realm of fitting into a cache. If you try a much larger transfer range test that has no chance of fitting then maybe you won't see such a difference between runs 1 and 2.
It would be sorted after the first loop then. The second line of the first test run should show a dramatic up-tick in performance but it doesn't.
And power cycling doesn't revert the performance either. It's still fine after swapping cards for a while and coming back to the Samsung. The problem only seems to show up after hours or days of no power.
I'm guessing if I made a test that ran for say 30 seconds and graphed progress every 0.1 second I'd see it rise suddenly a few seconds into the run. But only when the card has been cold.
Wow, I'm impressed. Even the older cards are performing at 135 MHz SD clock. Here's my oldest card, the Adata Silver (2013), at sysclock/2 without CRC processing. Note the first line has a, repeatable, latency spike:
@rogloh said:
What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.
Yeah, no, that was a euphemistic use of cold. But, taking the hint, I've now tested it as an actual thermally cold card and it's still behaving perfectly fine first try. So cold in this case only seems to be when unpowered for days.
I won't throw it out, but clearly it's not looking a happy SD card any longer. I'm gonna file it under it-was-already-faulty and I just sped it to the grave.
EDIT: Huh, that latest pattern above, where the performance was consistently ok-poor-ok through the test sizes, I do now remember one of the cards did that before. Yeah, I'm concluding the Samsung card has always been sick.
Yeah, the cell charge, a cell level calibration thingy. QLC Flash will be the worst for this. Back in 840 EVO days was still TLC.
Oddly, it has always seemed to be a Samsung exclusive issue though.
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Doing a little write up for posterity:
Working through the steps for High-Speed has resulted in a mostly convenient symbiosis between the clock phase and clock polarity:
When Default Speed access mode is active then the SD card outputs CMD/DAT with the falling clock edge. And by setting clock polarity to negative (inverted) this then means the starting, falling, edge of each clock pulse produces new data for the streamer to sample.
When High Speed access mode is active then the SD card outputs CMD/DAT with the rising clock edge. And by setting clock polarity to positive this preserves the starting, rising, edge of each clock pulse producing new data for the streamer to sample.
Using that helps with the more complex rx side of the equation.
But, as per usual, tx timing is different from rx timing. Using the streamer means predicting all these relationships. There is no hardware synchronising to help, not even at the bus clock level. Everything, rx and tx, is about pin sampling. The Prop2, as the master in this setup, outputting of SD data and clock on the pins is refreshed in unison each sysclock. As it's important to not output fresh data along with the rising clock edge, it's up to the software to ensure they are separated by at least one sysclock tick. At sysclock/2, this would mean ensuring they always occur on alternating ticks - which means the clock falls when updating data pins. Different story from the slave device. Well, at least until hardware delay lines get added.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
@evanh said:
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Yeah, one would like to hope there was some sort of consistency with timing phase amongst boards running in SD mode. Good catch about only testing previously in SPI mode.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
I found the same effect for memory with writes vs reads. With writes you have full control of the clock phase and as the outputs pin states are almost perfectly synchronized it's much easier to get consistent write results over the frequency range for sysclk/2 or lower.
Reads are harder because they involve latencies in the chip and delays on the board and in the target device. At high speed this needs some sort of calibration to get error free data.
It's almost going! So close. But there is still some bug newly introduced I'm not quite seeing. The old file read/write speed tester works, the one I used when developing the smartpin SPI driver, ... in selected ways. Just not in every way like the SD mode development code does. Non-inverted clock is causing grief, the calibration routine is finding success with failure values. But it works correctly when the clock is inverted. This should be a tell-tail of the cause but so far I can't see it. I haven't dug out the oscilloscope just yet so I guess that's probably up next.
There is quite a long list of changes between the two solutions. Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development. Much has been resolved from basic typos to inverting return code logic and trimming down the logic. Stuff has been renamed, stack allocation and pointer passing added where there was static buffers. Removal of compile options. Clean ups of old routines that hadn't been touched in a while.
Found it! Perfect example of "it works but don't ask why and don't fiddle." It was a minor tweak I'd done in passing during an optimising clean-up quite early on in the conversion. The clean-up was around removal of all the latency measuring diagnostic code. Lucky I'd done a backup just before starting so was able to realise when the bug was introduced.
During the clean-up I'd noticed I had an excess of clocks leading into issuing the command. This created a longer delay before each command was issued because it waited for the completion of these pulses. This was a remanent of the wait-for-busy check that had briefly existed at the head of command issuing before it got split off on its own. Thought about it for a moment and decided a single clock pulse was enough to trigger an event, why have more. However, it turns out that, at some point in sequencing, the SD cards need two extra pulses there. One isn't always enough.
I don't really know why but it fixes this issue. ... hmmm, or maybe I need to look harder - Just now had one of the cards needed it set to 3 pulses!
PS: It seems to matter only during block write sequencing.
I should have said eight clocks rather than eight bits.
Anyway, looks like the other problem was merely that I'd removed the exact binary compare when performing calibration. I'd started relying on the CRC alone, but that proved not to be rugged enough for the calibration process, I would dearly love to have access to the official UHS method. Sadly, the SD cards don't respond to CMD19 without engaging UHS mode first. EDIT: Correction, one of six cards tested actually seems to support CMD19 without switching into UHS first. Not much help for me.
CMD19 is compulsory for UHS-1 compliance so I figure it is supported by all my SD cards when UHS is engaged.
EDIT2: One detail from the CMD19 procedure is it says up to 40 repeats to be used for certainty. That's something I'll adopt myself.
I've found that the only places I needed to ensure the extra trailing clocks was after expected non-responses. Namely CMD7 for deselect and also CMD0. A normal response, which already has a couple of extra clocks anyway, doesn't seem to need all eight spacing clocks. I've fixed the exceptions and left the pre-command clocks set at one. It still could come back to bite me suppose.
What's left for you to do now @evanh? Is this mainly bug fixing, or optimization work now, or are there still pieces remaining to be coded for the burst reads/writes?
Down to beta testing and looking for stray bugs I guess.
Attached is the patched vfs.h header file that replaces same existing in include/sys
and the new driver directory that gets added to include/filesys
and lastly my current tester program to get you going.
EDIT: Updated the tester program to show the used clock divider. This is new feature exposed. Driver had been using a constant until today.
@evanh said:
It's usable right now. I presume you're interested in giving it a go?
Depending on my capture board bring up and how it works out, I probably would like to try something soon. It will be interesting to see if video can be captured in real time to SD card.
Comments
Actually, the Single-block loop is even bigger now. Looking at it I'm a little surprised the Multi-block path is working as well as it does. Almost all of the logic is done in the assembly. The only part there still in C is the decision, and loop, on the block count fitting the buffer.
So with your overheads and what gaps you've seen before with fast cards, what sort of sustained transfer rates do you expect will be achievable on reads (no-CRC check enabled) and writes (with CRC) ? Can we get 28MiB/s non-stop running on a 270MHz P2? That would allow 30fps video at 640x480x24 bits in pure RGB (no 4:2:2 subsampling) with audio. 30MiB/s would allow 24fps widescreen 858x480p or thereabouts.
270 MHz sysclock, clock divider of 3, with CRC processing enabled, it can read data at 36 MiB/s. EDIT: Ah, well, still need to add filesystem overheads to that. Time to get back to integrating into the driver ...
EDIT2: Oh, the 36 MB/s was with 8 kB buffer size, btw. A 64 kB buffer moves that up to 38 MB/s. And 16 kB gives a solid 37 MB/s.
EDIT3: Disabling the CRC processing and using sysclock/2 divider takes that to 55 MB/s. Or even a little more, 58 MB/s, with the Sandisk cards.
EDIT3: 270 MHz with sysclock/2 (135 MHz SD clock) is of course massively overclocking the SD bus for the 3.3 Volt High Speed interface. It does, however, fit within the upper limit of 1.8 Volt UHS-I interface (Which can operate in spec up to an insane 208 MHz SD clock). So I guess that's why newer cards just accept it and keep up.
The Samsung EVO card has the strangest behaviour. I've hesitated to mention it before but it does seem to be persisting. Upon first run of the testing it performs exceptionally poorly at about half the expected speeds - Even on repeats of reading the same blocks over and over. One whole test run rereads the same sequential block list 15 times, with each loop halving the total number to read.
After the first run it's fine. It's like the card needs a few seconds to warm up.
The poor results of first time run:
Then the very next run is this:
Maybe it fits in some sort of internal cache? 16MB is not beyond the realm of fitting into a cache. If you try a much larger transfer range test that has no chance of fitting then maybe you won't see such a difference between runs 1 and 2.
It would be sorted after the first loop then. The second line of the first test run should show a dramatic up-tick in performance but it doesn't.
And power cycling doesn't revert the performance either. It's still fine after swapping cards for a while and coming back to the Samsung. The problem only seems to show up after hours or days of no power.
I'm guessing if I made a test that ran for say 30 seconds and graphed progress every 0.1 second I'd see it rise suddenly a few seconds into the run. But only when the card has been cold.
What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.
Wow, I'm impressed. Even the older cards are performing at 135 MHz SD clock. Here's my oldest card, the Adata Silver (2013), at sysclock/2 without CRC processing. Note the first line has a, repeatable, latency spike:
The Apacer (2018) is clean though:
Yeah, no, that was a euphemistic use of cold. But, taking the hint, I've now tested it as an actual thermally cold card and it's still behaving perfectly fine first try. So cold in this case only seems to be when unpowered for days.
A 10 hour gap isn't enough. Samsung EVO worked first try. Although, the first line does indicate a minor latency extend there:
Which vanishes again on subsequent runs:
Seems weird. Charge leakage?
Maybe. And I may have done damage now. I put it in an oven, possibly over 100 degC, for 5 hours.
First run after:
Seventh run after (5 minutes later):
Eleventh run after (10 minutes):
30 minutes in the freezer:
Run 1:
Run 12:
I won't throw it out, but clearly it's not looking a happy SD card any longer. I'm gonna file it under it-was-already-faulty and I just sped it to the grave.
EDIT: Huh, that latest pattern above, where the performance was consistently ok-poor-ok through the test sizes, I do now remember one of the cards did that before. Yeah, I'm concluding the Samsung card has always been sick.
EDIT2: Reminds me of the days of the full spec'd Samsung 840 EVO SSD needing a firmware update for excessively slow read speeds with age of data. And even then it wasn't a perfect fix. https://www.anandtech.com/show/8617/samsung-releases-firmware-update-to-fix-the-ssd-840-evo-read-performance-bug
EDIT3: Ha! Yep, writing fresh data fixes it.
Yeah, the cell charge, a cell level calibration thingy. QLC Flash will be the worst for this. Back in 840 EVO days was still TLC.
Oddly, it has always seemed to be a Samsung exclusive issue though.
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.
And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.
Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.
Doing a little write up for posterity:
Working through the steps for High-Speed has resulted in a mostly convenient symbiosis between the clock phase and clock polarity:
Using that helps with the more complex rx side of the equation.
But, as per usual, tx timing is different from rx timing. Using the streamer means predicting all these relationships. There is no hardware synchronising to help, not even at the bus clock level. Everything, rx and tx, is about pin sampling. The Prop2, as the master in this setup, outputting of SD data and clock on the pins is refreshed in unison each sysclock. As it's important to not output fresh data along with the rising clock edge, it's up to the software to ensure they are separated by at least one sysclock tick. At sysclock/2, this would mean ensuring they always occur on alternating ticks - which means the clock falls when updating data pins. Different story from the slave device. Well, at least until hardware delay lines get added.
On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.
Yeah, one would like to hope there was some sort of consistency with timing phase amongst boards running in SD mode. Good catch about only testing previously in SPI mode.
I found the same effect for memory with writes vs reads. With writes you have full control of the clock phase and as the outputs pin states are almost perfectly synchronized it's much easier to get consistent write results over the frequency range for sysclk/2 or lower.
Reads are harder because they involve latencies in the chip and delays on the board and in the target device. At high speed this needs some sort of calibration to get error free data.
It's almost going! So close. But there is still some bug newly introduced I'm not quite seeing. The old file read/write speed tester works, the one I used when developing the smartpin SPI driver, ... in selected ways. Just not in every way like the SD mode development code does. Non-inverted clock is causing grief, the calibration routine is finding success with failure values. But it works correctly when the clock is inverted. This should be a tell-tail of the cause but so far I can't see it. I haven't dug out the oscilloscope just yet so I guess that's probably up next.
There is quite a long list of changes between the two solutions. Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development. Much has been resolved from basic typos to inverting return code logic and trimming down the logic. Stuff has been renamed, stack allocation and pointer passing added where there was static buffers. Removal of compile options. Clean ups of old routines that hadn't been touched in a while.
Ah. The constraints of reality strike back. I know how that feels.
Found it! Perfect example of "it works but don't ask why and don't fiddle." It was a minor tweak I'd done in passing during an optimising clean-up quite early on in the conversion. The clean-up was around removal of all the latency measuring diagnostic code. Lucky I'd done a backup just before starting so was able to realise when the bug was introduced.
During the clean-up I'd noticed I had an excess of clocks leading into issuing the command. This created a longer delay before each command was issued because it waited for the completion of these pulses. This was a remanent of the wait-for-busy check that had briefly existed at the head of command issuing before it got split off on its own. Thought about it for a moment and decided a single clock pulse was enough to trigger an event, why have more. However, it turns out that, at some point in sequencing, the SD cards need two extra pulses there. One isn't always enough.
I don't really know why but it fixes this issue. ... hmmm, or maybe I need to look harder - Just now had one of the cards needed it set to 3 pulses!
PS: It seems to matter only during block write sequencing.
Aren't there some CSD structures that indicate how many clocks you'd need between commands for timeouts etc, or perhaps am I thinking of Flash stuff.
It's a fixed number of eight bits. And yeah, that'll be the problem, I don't explicitly wait for that to happen after a command sequence.
I've now realised that there's something else playing up too. I need more sleep to get a good run at it.
I should have said eight clocks rather than eight bits.
Anyway, looks like the other problem was merely that I'd removed the exact binary compare when performing calibration. I'd started relying on the CRC alone, but that proved not to be rugged enough for the calibration process, I would dearly love to have access to the official UHS method. Sadly, the SD cards don't respond to CMD19 without engaging UHS mode first. EDIT: Correction, one of six cards tested actually seems to support CMD19 without switching into UHS first. Not much help for me.
CMD19 is compulsory for UHS-1 compliance so I figure it is supported by all my SD cards when UHS is engaged.
EDIT2: One detail from the CMD19 procedure is it says up to 40 repeats to be used for certainty. That's something I'll adopt myself.
I've found that the only places I needed to ensure the extra trailing clocks was after expected non-responses. Namely CMD7 for deselect and also CMD0. A normal response, which already has a couple of extra clocks anyway, doesn't seem to need all eight spacing clocks. I've fixed the exceptions and left the pre-command clocks set at one. It still could come back to bite me suppose.
What's left for you to do now @evanh? Is this mainly bug fixing, or optimization work now, or are there still pieces remaining to be coded for the burst reads/writes?
It's usable right now. I presume you're interested in giving it a go?
Down to beta testing and looking for stray bugs I guess.
Attached is the patched vfs.h header file that replaces same existing in
include/sys
and the new driver directory that gets added to
include/filesys
and lastly my current tester program to get you going.
EDIT: Updated the tester program to show the used clock divider. This is new feature exposed. Driver had been using a constant until today.
Depending on my capture board bring up and how it works out, I probably would like to try something soon. It will be interesting to see if video can be captured in real time to SD card.