New SD mode P2 accessory board

evanh · 2024-09-28 11:01

Actually, the Single-block loop is even bigger now. Looking at it I'm a little surprised the Multi-block path is working as well as it does. Almost all of the logic is done in the assembly. The only part there still in C is the decision, and loop, on the block count fitting the buffer.

rogloh · 2024-09-28 12:05

So with your overheads and what gaps you've seen before with fast cards, what sort of sustained transfer rates do you expect will be achievable on reads (no-CRC check enabled) and writes (with CRC) ? Can we get 28MiB/s non-stop running on a 270MHz P2? That would allow 30fps video at 640x480x24 bits in pure RGB (no 4:2:2 subsampling) with audio. 30MiB/s would allow 24fps widescreen 858x480p or thereabouts.

evanh · 2024-09-28 12:17

270 MHz sysclock, clock divider of 3, with CRC processing enabled, it can read data at 36 MiB/s. EDIT: Ah, well, still need to add filesystem overheads to that. Time to get back to integrating into the driver ...

EDIT2: Oh, the 36 MB/s was with 8 kB buffer size, btw. A 64 kB buffer moves that up to 38 MB/s. And 16 kB gives a solid 37 MB/s.
EDIT3: Disabling the CRC processing and using sysclock/2 divider takes that to 55 MB/s. Or even a little more, 58 MB/s, with the Sandisk cards.

EDIT3: 270 MHz with sysclock/2 (135 MHz SD clock) is of course massively overclocking the SD bus for the 3.3 Volt High Speed interface. It does, however, fit within the upper limit of 1.8 Volt UHS-I interface (Which can operate in spec up to an insane 208 MHz SD clock). So I guess that's why newer cards just accept it and keep up.

evanh · 2024-09-28 13:06

The Samsung EVO card has the strangest behaviour. I've hesitated to mention it before but it does seem to be persisting. Upon first run of the testing it performs exceptionally poorly at about half the expected speeds - Even on repeats of reading the same blocks over and over. One whole test run rereads the same sequential block list 15 times, with each loop halving the total number to read.

After the first run it's fine. It's like the card needs a few seconds to warm up.

The poor results of first time run:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 23.4 MiB/s   duration = 681427 us   zero-overhead = 248551 us   overheads = 63.5 %
16384 blocks = 8192 kiB   rate = 21.2 MiB/s   duration = 375874 us   zero-overhead = 124276 us   overheads = 66.9 %
8192 blocks = 4096 kiB   rate = 19.8 MiB/s   duration = 201299 us   zero-overhead = 62138 us   overheads = 69.1 %
4096 blocks = 2048 kiB   rate = 17.3 MiB/s   duration = 115100 us   zero-overhead = 31069 us   overheads = 73.0 %
2048 blocks = 1024 kiB   rate = 11.2 MiB/s   duration = 88680 us   zero-overhead = 15534 us   overheads = 82.4 %
1024 blocks = 512 kiB   rate = 10.7 MiB/s   duration = 46359 us   zero-overhead = 7767 us   overheads = 83.2 %
512 blocks = 256 kiB   rate = 5.9 MiB/s   duration = 42065 us   zero-overhead = 3884 us   overheads = 90.7 %
256 blocks = 128 kiB   rate = 3.4 MiB/s   duration = 36679 us   zero-overhead = 1942 us   overheads = 94.7 %
128 blocks = 64 kiB   rate = 1.6 MiB/s   duration = 37188 us   zero-overhead = 971 us   overheads = 97.3 %
64 blocks = 32 kiB   rate = 0.9 MiB/s   duration = 34154 us   zero-overhead = 485 us   overheads = 98.5 %
32 blocks = 16 kiB   rate = 0.4 MiB/s   duration = 35968 us   zero-overhead = 243 us   overheads = 99.3 %
16 blocks = 8 kiB   rate = 0.2 MiB/s   duration = 33200 us   zero-overhead = 121 us   overheads = 99.6 %
8 blocks = 4 kiB   rate = 0.1 MiB/s   duration = 36551 us   zero-overhead = 61 us   overheads = 99.8 %
4 blocks = 2 kiB   rate = 0.0 MiB/s   duration = 33527 us   zero-overhead = 30 us   overheads = 99.9 %
2 blocks = 1 kiB   rate = 0.0 MiB/s   duration = 33432 us   zero-overhead = 15 us   overheads = 99.9 %

Then the very next run is this:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 38.3 MiB/s   duration = 417277 us   zero-overhead = 248551 us   overheads = 40.4 %
16384 blocks = 8192 kiB   rate = 38.2 MiB/s   duration = 208896 us   zero-overhead = 124276 us   overheads = 40.5 %
8192 blocks = 4096 kiB   rate = 38.2 MiB/s   duration = 104645 us   zero-overhead = 62138 us   overheads = 40.6 %
4096 blocks = 2048 kiB   rate = 38.0 MiB/s   duration = 52518 us   zero-overhead = 31069 us   overheads = 40.8 %
2048 blocks = 1024 kiB   rate = 37.8 MiB/s   duration = 26455 us   zero-overhead = 15534 us   overheads = 41.2 %
1024 blocks = 512 kiB   rate = 37.5 MiB/s   duration = 13305 us   zero-overhead = 7767 us   overheads = 41.6 %
512 blocks = 256 kiB   rate = 36.8 MiB/s   duration = 6790 us   zero-overhead = 3884 us   overheads = 42.7 %
256 blocks = 128 kiB   rate = 35.4 MiB/s   duration = 3531 us   zero-overhead = 1942 us   overheads = 45.0 %
128 blocks = 64 kiB   rate = 32.7 MiB/s   duration = 1906 us   zero-overhead = 971 us   overheads = 49.0 %
64 blocks = 32 kiB   rate = 29.0 MiB/s   duration = 1074 us   zero-overhead = 485 us   overheads = 54.8 %
32 blocks = 16 kiB   rate = 23.3 MiB/s   duration = 670 us   zero-overhead = 243 us   overheads = 63.7 %
16 blocks = 8 kiB   rate = 16.7 MiB/s   duration = 467 us   zero-overhead = 121 us   overheads = 74.0 %
8 blocks = 4 kiB   rate = 12.6 MiB/s   duration = 310 us   zero-overhead = 61 us   overheads = 80.3 %
4 blocks = 2 kiB   rate = 7.5 MiB/s   duration = 260 us   zero-overhead = 30 us   overheads = 88.4 %
2 blocks = 1 kiB   rate = 4.1 MiB/s   duration = 235 us   zero-overhead = 15 us   overheads = 93.6 %

rogloh · 2024-09-28 13:20

Maybe it fits in some sort of internal cache? 16MB is not beyond the realm of fitting into a cache. If you try a much larger transfer range test that has no chance of fitting then maybe you won't see such a difference between runs 1 and 2.

evanh · 2024-09-28 13:28

It would be sorted after the first loop then. The second line of the first test run should show a dramatic up-tick in performance but it doesn't.

And power cycling doesn't revert the performance either. It's still fine after swapping cards for a while and coming back to the Samsung. The problem only seems to show up after hours or days of no power.

evanh · 2024-09-28 13:38

I'm guessing if I made a test that ran for say 30 seconds and graphed progress every 0.1 second I'd see it rise suddenly a few seconds into the run. But only when the card has been cold.

rogloh · 2024-09-28 13:48

What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.

evanh · 2024-09-28 13:53

Wow, I'm impressed. Even the older cards are performing at 135 MHz SD clock. Here's my oldest card, the Adata Silver (2013), at sysclock/2 without CRC processing. Note the first line has a, repeatable, latency spike:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 51.5 MiB/s   duration = 310387 us   zero-overhead = 248551 us   overheads = 19.9 %
16384 blocks = 8192 kiB   rate = 59.1 MiB/s   duration = 135142 us   zero-overhead = 124276 us   overheads = 8.0 %
8192 blocks = 4096 kiB   rate = 59.0 MiB/s   duration = 67693 us   zero-overhead = 62138 us   overheads = 8.2 %
4096 blocks = 2048 kiB   rate = 58.5 MiB/s   duration = 34181 us   zero-overhead = 31069 us   overheads = 9.1 %
2048 blocks = 1024 kiB   rate = 58.0 MiB/s   duration = 17237 us   zero-overhead = 15534 us   overheads = 9.8 %
1024 blocks = 512 kiB   rate = 56.4 MiB/s   duration = 8859 us   zero-overhead = 7767 us   overheads = 12.3 %
512 blocks = 256 kiB   rate = 53.5 MiB/s   duration = 4669 us   zero-overhead = 3884 us   overheads = 16.8 %
256 blocks = 128 kiB   rate = 48.5 MiB/s   duration = 2575 us   zero-overhead = 1942 us   overheads = 24.5 %
128 blocks = 64 kiB   rate = 40.8 MiB/s   duration = 1529 us   zero-overhead = 971 us   overheads = 36.4 %
64 blocks = 32 kiB   rate = 31.0 MiB/s   duration = 1005 us   zero-overhead = 485 us   overheads = 51.7 %
32 blocks = 16 kiB   rate = 21.0 MiB/s   duration = 743 us   zero-overhead = 243 us   overheads = 67.2 %
16 blocks = 8 kiB   rate = 12.7 MiB/s   duration = 612 us   zero-overhead = 121 us   overheads = 80.2 %
8 blocks = 4 kiB   rate = 7.1 MiB/s   duration = 547 us   zero-overhead = 61 us   overheads = 88.8 %
4 blocks = 2 kiB   rate = 3.7 MiB/s   duration = 514 us   zero-overhead = 30 us   overheads = 94.1 %
2 blocks = 1 kiB   rate = 1.9 MiB/s   duration = 498 us   zero-overhead = 15 us   overheads = 96.9 %

The Apacer (2018) is clean though:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 59.5 MiB/s   duration = 268675 us   zero-overhead = 248551 us   overheads = 7.4 %
16384 blocks = 8192 kiB   rate = 59.4 MiB/s   duration = 134555 us   zero-overhead = 124276 us   overheads = 7.6 %
8192 blocks = 4096 kiB   rate = 59.3 MiB/s   duration = 67358 us   zero-overhead = 62138 us   overheads = 7.7 %
4096 blocks = 2048 kiB   rate = 59.1 MiB/s   duration = 33831 us   zero-overhead = 31069 us   overheads = 8.1 %
2048 blocks = 1024 kiB   rate = 58.5 MiB/s   duration = 17066 us   zero-overhead = 15534 us   overheads = 8.9 %
1024 blocks = 512 kiB   rate = 57.5 MiB/s   duration = 8685 us   zero-overhead = 7767 us   overheads = 10.5 %
512 blocks = 256 kiB   rate = 55.6 MiB/s   duration = 4496 us   zero-overhead = 3884 us   overheads = 13.6 %
256 blocks = 128 kiB   rate = 52.0 MiB/s   duration = 2403 us   zero-overhead = 1942 us   overheads = 19.1 %
128 blocks = 64 kiB   rate = 46.1 MiB/s   duration = 1354 us   zero-overhead = 971 us   overheads = 28.2 %
64 blocks = 32 kiB   rate = 37.6 MiB/s   duration = 830 us   zero-overhead = 485 us   overheads = 41.5 %
32 blocks = 16 kiB   rate = 27.5 MiB/s   duration = 568 us   zero-overhead = 243 us   overheads = 57.2 %
16 blocks = 8 kiB   rate = 17.9 MiB/s   duration = 436 us   zero-overhead = 121 us   overheads = 72.2 %
8 blocks = 4 kiB   rate = 10.5 MiB/s   duration = 371 us   zero-overhead = 61 us   overheads = 83.5 %
4 blocks = 2 kiB   rate = 5.7 MiB/s   duration = 338 us   zero-overhead = 30 us   overheads = 91.1 %
2 blocks = 1 kiB   rate = 3.0 MiB/s   duration = 322 us   zero-overhead = 15 us   overheads = 95.3 %

evanh · 2024-09-28 14:00

@rogloh said:
What if you start with a warm card to begin with? Sit it next to a heat source for a bit then test it.

Yeah, no, that was a euphemistic use of cold. But, taking the hint, I've now tested it as an actual thermally cold card and it's still behaving perfectly fine first try. So cold in this case only seems to be when unpowered for days.

evanh · 2024-09-29 02:22

A 10 hour gap isn't enough. Samsung EVO worked first try. Although, the first line does indicate a minor latency extend there:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 37.7 MiB/s   duration = 423925 us   zero-overhead = 372827 us   overheads = 12.0 %
16384 blocks = 8192 kiB   rate = 38.2 MiB/s   duration = 209397 us   zero-overhead = 186414 us   overheads = 10.9 %
8192 blocks = 4096 kiB   rate = 38.1 MiB/s   duration = 104894 us   zero-overhead = 93207 us   overheads = 11.1 %
4096 blocks = 2048 kiB   rate = 37.9 MiB/s   duration = 52647 us   zero-overhead = 46603 us   overheads = 11.4 %
2048 blocks = 1024 kiB   rate = 37.7 MiB/s   duration = 26523 us   zero-overhead = 23302 us   overheads = 12.1 %
1024 blocks = 512 kiB   rate = 37.4 MiB/s   duration = 13336 us   zero-overhead = 11651 us   overheads = 12.6 %
512 blocks = 256 kiB   rate = 36.7 MiB/s   duration = 6807 us   zero-overhead = 5825 us   overheads = 14.4 %
256 blocks = 128 kiB   rate = 35.3 MiB/s   duration = 3540 us   zero-overhead = 2913 us   overheads = 17.7 %
128 blocks = 64 kiB   rate = 32.7 MiB/s   duration = 1906 us   zero-overhead = 1456 us   overheads = 23.6 %
64 blocks = 32 kiB   rate = 29.0 MiB/s   duration = 1077 us   zero-overhead = 728 us   overheads = 32.4 %
32 blocks = 16 kiB   rate = 23.3 MiB/s   duration = 670 us   zero-overhead = 364 us   overheads = 45.6 %
16 blocks = 8 kiB   rate = 16.6 MiB/s   duration = 468 us   zero-overhead = 182 us   overheads = 61.1 %
8 blocks = 4 kiB   rate = 12.4 MiB/s   duration = 315 us   zero-overhead = 91 us   overheads = 71.1 %
4 blocks = 2 kiB   rate = 7.3 MiB/s   duration = 264 us   zero-overhead = 46 us   overheads = 82.5 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 239 us   zero-overhead = 23 us   overheads = 90.3 %

Which vanishes again on subsequent runs:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 38.2 MiB/s   duration = 418271 us   zero-overhead = 372827 us   overheads = 10.8 %
16384 blocks = 8192 kiB   rate = 38.2 MiB/s   duration = 209394 us   zero-overhead = 186414 us   overheads = 10.9 %
8192 blocks = 4096 kiB   rate = 38.1 MiB/s   duration = 104892 us   zero-overhead = 93207 us   overheads = 11.1 %
4096 blocks = 2048 kiB   rate = 37.9 MiB/s   duration = 52643 us   zero-overhead = 46603 us   overheads = 11.4 %
2048 blocks = 1024 kiB   rate = 37.7 MiB/s   duration = 26520 us   zero-overhead = 23302 us   overheads = 12.1 %
1024 blocks = 512 kiB   rate = 37.4 MiB/s   duration = 13336 us   zero-overhead = 11651 us   overheads = 12.6 %
512 blocks = 256 kiB   rate = 36.7 MiB/s   duration = 6806 us   zero-overhead = 5825 us   overheads = 14.4 %
256 blocks = 128 kiB   rate = 35.3 MiB/s   duration = 3540 us   zero-overhead = 2913 us   overheads = 17.7 %
128 blocks = 64 kiB   rate = 32.7 MiB/s   duration = 1907 us   zero-overhead = 1456 us   overheads = 23.6 %
64 blocks = 32 kiB   rate = 29.0 MiB/s   duration = 1077 us   zero-overhead = 728 us   overheads = 32.4 %
32 blocks = 16 kiB   rate = 23.3 MiB/s   duration = 670 us   zero-overhead = 364 us   overheads = 45.6 %
16 blocks = 8 kiB   rate = 16.6 MiB/s   duration = 468 us   zero-overhead = 182 us   overheads = 61.1 %
8 blocks = 4 kiB   rate = 12.4 MiB/s   duration = 315 us   zero-overhead = 91 us   overheads = 71.1 %
4 blocks = 2 kiB   rate = 7.3 MiB/s   duration = 265 us   zero-overhead = 46 us   overheads = 82.6 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 239 us   zero-overhead = 23 us   overheads = 90.3 %

rogloh · 2024-09-29 03:28

Seems weird. Charge leakage?

evanh · 2024-09-29 08:00

Maybe. And I may have done damage now. I put it in an oven, possibly over 100 degC, for 5 hours.

First run after:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 22.1 MiB/s   duration = 721074 us   zero-overhead = 372827 us   overheads = 48.2 %
16384 blocks = 8192 kiB   rate = 16.9 MiB/s   duration = 470980 us   zero-overhead = 186414 us   overheads = 60.4 %
8192 blocks = 4096 kiB   rate = 11.2 MiB/s   duration = 356398 us   zero-overhead = 93207 us   overheads = 73.8 %
4096 blocks = 2048 kiB   rate = 6.6 MiB/s   duration = 302006 us   zero-overhead = 46603 us   overheads = 84.5 %
2048 blocks = 1024 kiB   rate = 3.6 MiB/s   duration = 271768 us   zero-overhead = 23302 us   overheads = 91.4 %
1024 blocks = 512 kiB   rate = 3.3 MiB/s   duration = 148580 us   zero-overhead = 11651 us   overheads = 92.1 %
512 blocks = 256 kiB   rate = 3.4 MiB/s   duration = 72867 us   zero-overhead = 5825 us   overheads = 92.0 %
256 blocks = 128 kiB   rate = 3.6 MiB/s   duration = 33802 us   zero-overhead = 2913 us   overheads = 91.3 %
128 blocks = 64 kiB   rate = 4.6 MiB/s   duration = 13378 us   zero-overhead = 1456 us   overheads = 89.1 %
64 blocks = 32 kiB   rate = 5.6 MiB/s   duration = 5543 us   zero-overhead = 728 us   overheads = 86.8 %
32 blocks = 16 kiB   rate = 4.4 MiB/s   duration = 3487 us   zero-overhead = 364 us   overheads = 89.5 %
16 blocks = 8 kiB   rate = 5.2 MiB/s   duration = 1485 us   zero-overhead = 182 us   overheads = 87.7 %
8 blocks = 4 kiB   rate = 5.5 MiB/s   duration = 701 us   zero-overhead = 91 us   overheads = 87.0 %
4 blocks = 2 kiB   rate = 2.8 MiB/s   duration = 684 us   zero-overhead = 46 us   overheads = 93.2 %
2 blocks = 1 kiB   rate = 3.9 MiB/s   duration = 246 us   zero-overhead = 23 us   overheads = 90.6 %

Seventh run after (5 minutes later):

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 24.4 MiB/s   duration = 654095 us   zero-overhead = 372827 us   overheads = 43.0 %
16384 blocks = 8192 kiB   rate = 18.5 MiB/s   duration = 432374 us   zero-overhead = 186414 us   overheads = 56.8 %
8192 blocks = 4096 kiB   rate = 12.5 MiB/s   duration = 319853 us   zero-overhead = 93207 us   overheads = 70.8 %
4096 blocks = 2048 kiB   rate = 7.4 MiB/s   duration = 266948 us   zero-overhead = 46603 us   overheads = 82.5 %
2048 blocks = 1024 kiB   rate = 4.1 MiB/s   duration = 240554 us   zero-overhead = 23302 us   overheads = 90.3 %
1024 blocks = 512 kiB   rate = 3.7 MiB/s   duration = 133101 us   zero-overhead = 11651 us   overheads = 91.2 %
512 blocks = 256 kiB   rate = 3.8 MiB/s   duration = 64458 us   zero-overhead = 5825 us   overheads = 90.9 %
256 blocks = 128 kiB   rate = 4.2 MiB/s   duration = 29343 us   zero-overhead = 2913 us   overheads = 90.0 %
128 blocks = 64 kiB   rate = 6.5 MiB/s   duration = 9491 us   zero-overhead = 1456 us   overheads = 84.6 %
64 blocks = 32 kiB   rate = 16.2 MiB/s   duration = 1923 us   zero-overhead = 728 us   overheads = 62.1 %
32 blocks = 16 kiB   rate = 15.1 MiB/s   duration = 1034 us   zero-overhead = 364 us   overheads = 64.7 %
16 blocks = 8 kiB   rate = 16.2 MiB/s   duration = 480 us   zero-overhead = 182 us   overheads = 62.0 %
8 blocks = 4 kiB   rate = 12.2 MiB/s   duration = 318 us   zero-overhead = 91 us   overheads = 71.3 %
4 blocks = 2 kiB   rate = 7.2 MiB/s   duration = 268 us   zero-overhead = 46 us   overheads = 82.8 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 243 us   zero-overhead = 23 us   overheads = 90.5 %

Eleventh run after (10 minutes):

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 24.6 MiB/s   duration = 647795 us   zero-overhead = 372827 us   overheads = 42.4 %
16384 blocks = 8192 kiB   rate = 18.8 MiB/s   duration = 425308 us   zero-overhead = 186414 us   overheads = 56.1 %
8192 blocks = 4096 kiB   rate = 12.6 MiB/s   duration = 315773 us   zero-overhead = 93207 us   overheads = 70.4 %
4096 blocks = 2048 kiB   rate = 7.6 MiB/s   duration = 261738 us   zero-overhead = 46603 us   overheads = 82.1 %
2048 blocks = 1024 kiB   rate = 4.2 MiB/s   duration = 236732 us   zero-overhead = 23302 us   overheads = 90.1 %
1024 blocks = 512 kiB   rate = 3.8 MiB/s   duration = 130418 us   zero-overhead = 11651 us   overheads = 91.0 %
512 blocks = 256 kiB   rate = 3.9 MiB/s   duration = 63736 us   zero-overhead = 5825 us   overheads = 90.8 %
256 blocks = 128 kiB   rate = 4.3 MiB/s   duration = 28628 us   zero-overhead = 2913 us   overheads = 89.8 %
128 blocks = 64 kiB   rate = 6.9 MiB/s   duration = 8942 us   zero-overhead = 1456 us   overheads = 83.7 %
64 blocks = 32 kiB   rate = 26.5 MiB/s   duration = 1175 us   zero-overhead = 728 us   overheads = 38.0 %
32 blocks = 16 kiB   rate = 22.1 MiB/s   duration = 705 us   zero-overhead = 364 us   overheads = 48.3 %
16 blocks = 8 kiB   rate = 16.3 MiB/s   duration = 479 us   zero-overhead = 182 us   overheads = 62.0 %
8 blocks = 4 kiB   rate = 12.3 MiB/s   duration = 317 us   zero-overhead = 91 us   overheads = 71.2 %
4 blocks = 2 kiB   rate = 7.3 MiB/s   duration = 264 us   zero-overhead = 46 us   overheads = 82.5 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 239 us   zero-overhead = 23 us   overheads = 90.3 %

evanh · 2024-09-29 08:44

30 minutes in the freezer:
Run 1:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 24.9 MiB/s   duration = 640527 us   zero-overhead = 372827 us   overheads = 41.7 %
16384 blocks = 8192 kiB   rate = 19.1 MiB/s   duration = 417223 us   zero-overhead = 186414 us   overheads = 55.3 %
8192 blocks = 4096 kiB   rate = 12.8 MiB/s   duration = 311159 us   zero-overhead = 93207 us   overheads = 70.0 %
4096 blocks = 2048 kiB   rate = 7.8 MiB/s   duration = 256325 us   zero-overhead = 46603 us   overheads = 81.8 %
2048 blocks = 1024 kiB   rate = 4.3 MiB/s   duration = 227744 us   zero-overhead = 23302 us   overheads = 89.7 %
1024 blocks = 512 kiB   rate = 3.9 MiB/s   duration = 127050 us   zero-overhead = 11651 us   overheads = 90.8 %
512 blocks = 256 kiB   rate = 4.0 MiB/s   duration = 61168 us   zero-overhead = 5825 us   overheads = 90.4 %
256 blocks = 128 kiB   rate = 4.5 MiB/s   duration = 27760 us   zero-overhead = 2913 us   overheads = 89.5 %
128 blocks = 64 kiB   rate = 6.9 MiB/s   duration = 8937 us   zero-overhead = 1456 us   overheads = 83.7 %
64 blocks = 32 kiB   rate = 26.6 MiB/s   duration = 1174 us   zero-overhead = 728 us   overheads = 37.9 %
32 blocks = 16 kiB   rate = 15.5 MiB/s   duration = 1003 us   zero-overhead = 364 us   overheads = 63.7 %
16 blocks = 8 kiB   rate = 16.1 MiB/s   duration = 485 us   zero-overhead = 182 us   overheads = 62.4 %
8 blocks = 4 kiB   rate = 12.4 MiB/s   duration = 314 us   zero-overhead = 91 us   overheads = 71.0 %
4 blocks = 2 kiB   rate = 7.3 MiB/s   duration = 267 us   zero-overhead = 46 us   overheads = 82.7 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 240 us   zero-overhead = 23 us   overheads = 90.4 %

Run 12:

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 24.7 MiB/s   duration = 645458 us   zero-overhead = 372827 us   overheads = 42.2 %
16384 blocks = 8192 kiB   rate = 18.8 MiB/s   duration = 423661 us   zero-overhead = 186414 us   overheads = 55.9 %
8192 blocks = 4096 kiB   rate = 12.8 MiB/s   duration = 311061 us   zero-overhead = 93207 us   overheads = 70.0 %
4096 blocks = 2048 kiB   rate = 7.7 MiB/s   duration = 258025 us   zero-overhead = 46603 us   overheads = 81.9 %
2048 blocks = 1024 kiB   rate = 4.3 MiB/s   duration = 230127 us   zero-overhead = 23302 us   overheads = 89.8 %
1024 blocks = 512 kiB   rate = 3.8 MiB/s   duration = 128549 us   zero-overhead = 11651 us   overheads = 90.9 %
512 blocks = 256 kiB   rate = 3.9 MiB/s   duration = 63045 us   zero-overhead = 5825 us   overheads = 90.7 %
256 blocks = 128 kiB   rate = 4.4 MiB/s   duration = 28306 us   zero-overhead = 2913 us   overheads = 89.7 %
128 blocks = 64 kiB   rate = 7.0 MiB/s   duration = 8928 us   zero-overhead = 1456 us   overheads = 83.6 %
64 blocks = 32 kiB   rate = 27.5 MiB/s   duration = 1133 us   zero-overhead = 728 us   overheads = 35.7 %
32 blocks = 16 kiB   rate = 21.6 MiB/s   duration = 722 us   zero-overhead = 364 us   overheads = 49.5 %
16 blocks = 8 kiB   rate = 16.1 MiB/s   duration = 483 us   zero-overhead = 182 us   overheads = 62.3 %
8 blocks = 4 kiB   rate = 12.1 MiB/s   duration = 321 us   zero-overhead = 91 us   overheads = 71.6 %
4 blocks = 2 kiB   rate = 7.2 MiB/s   duration = 268 us   zero-overhead = 46 us   overheads = 82.8 %
2 blocks = 1 kiB   rate = 4.0 MiB/s   duration = 240 us   zero-overhead = 23 us   overheads = 90.4 %

evanh · 2024-09-29 08:54

I won't throw it out, but clearly it's not looking a happy SD card any longer. I'm gonna file it under it-was-already-faulty and I just sped it to the grave.

EDIT: Huh, that latest pattern above, where the performance was consistently ok-poor-ok through the test sizes, I do now remember one of the cards did that before. Yeah, I'm concluding the Samsung card has always been sick.

EDIT2: Reminds me of the days of the full spec'd Samsung 840 EVO SSD needing a firmware update for excessively slow read speeds with age of data. And even then it wasn't a perfect fix. https://www.anandtech.com/show/8617/samsung-releases-firmware-update-to-fix-the-ssd-840-evo-read-performance-bug

EDIT3: Ha! Yep, writing fresh data fixes it.

Read blocks speed test:
32768 blocks = 16384 kiB   rate = 37.6 MiB/s   duration = 425349 us   zero-overhead = 372827 us   overheads = 12.3 %
16384 blocks = 8192 kiB   rate = 37.5 MiB/s   duration = 212789 us   zero-overhead = 186414 us   overheads = 12.3 %
8192 blocks = 4096 kiB   rate = 37.5 MiB/s   duration = 106510 us   zero-overhead = 93207 us   overheads = 12.4 %
4096 blocks = 2048 kiB   rate = 37.4 MiB/s   duration = 53370 us   zero-overhead = 46603 us   overheads = 12.6 %
2048 blocks = 1024 kiB   rate = 37.3 MiB/s   duration = 26800 us   zero-overhead = 23302 us   overheads = 13.0 %
1024 blocks = 512 kiB   rate = 36.9 MiB/s   duration = 13515 us   zero-overhead = 11651 us   overheads = 13.7 %
512 blocks = 256 kiB   rate = 36.3 MiB/s   duration = 6872 us   zero-overhead = 5825 us   overheads = 15.2 %
256 blocks = 128 kiB   rate = 35.1 MiB/s   duration = 3552 us   zero-overhead = 2913 us   overheads = 17.9 %
128 blocks = 64 kiB   rate = 33.0 MiB/s   duration = 1891 us   zero-overhead = 1456 us   overheads = 23.0 %
64 blocks = 32 kiB   rate = 29.4 MiB/s   duration = 1061 us   zero-overhead = 728 us   overheads = 31.3 %
32 blocks = 16 kiB   rate = 24.1 MiB/s   duration = 646 us   zero-overhead = 364 us   overheads = 43.6 %
16 blocks = 8 kiB   rate = 17.8 MiB/s   duration = 438 us   zero-overhead = 182 us   overheads = 58.4 %
8 blocks = 4 kiB   rate = 14.1 MiB/s   duration = 277 us   zero-overhead = 91 us   overheads = 67.1 %
4 blocks = 2 kiB   rate = 8.5 MiB/s   duration = 228 us   zero-overhead = 46 us   overheads = 79.8 %
2 blocks = 1 kiB   rate = 4.8 MiB/s   duration = 203 us   zero-overhead = 23 us   overheads = 88.6 %

evanh · 2024-09-29 09:24

Yeah, the cell charge, a cell level calibration thingy. QLC Flash will be the worst for this. Back in 840 EVO days was still TLC.
Oddly, it has always seemed to be a Samsung exclusive issue though.

evanh · 2024-10-05 03:04

Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.

And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.

Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.

evanh · 2024-10-05 05:24

Doing a little write up for posterity:
Working through the steps for High-Speed has resulted in a mostly convenient symbiosis between the clock phase and clock polarity:

When Default Speed access mode is active then the SD card outputs CMD/DAT with the falling clock edge. And by setting clock polarity to negative (inverted) this then means the starting, falling, edge of each clock pulse produces new data for the streamer to sample.
When High Speed access mode is active then the SD card outputs CMD/DAT with the rising clock edge. And by setting clock polarity to positive this preserves the starting, rising, edge of each clock pulse producing new data for the streamer to sample.

Using that helps with the more complex rx side of the equation.

But, as per usual, tx timing is different from rx timing. Using the streamer means predicting all these relationships. There is no hardware synchronising to help, not even at the bus clock level. Everything, rx and tx, is about pin sampling. The Prop2, as the master in this setup, outputting of SD data and clock on the pins is refreshed in unison each sysclock. As it's important to not output fresh data along with the rising clock edge, it's up to the software to ensure they are separated by at least one sysclock tick. At sysclock/2, this would mean ensuring they always occur on alternating ticks - which means the clock falls when updating data pins. Different story from the slave device. Well, at least until hardware delay lines get added.

On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.

rogloh · 2024-10-05 07:57

@evanh said:
Roger,
Regarding the High-Speed access mode switching. I've sort of poo-poo'd it a little in the past because it appeared inconsistent as to how each card responded. In particular, that some phase-shifted the clock while others didn't ... Well, I've come to the realisation there is a high likelihood those cards that didn't adjust their phase timing probably also didn't changed modes. Back then, I never wrote any code to confirm the mode change had occurred. I just requested it and assumed it happened.

And the reason why some cards might not make the change could easily because it was SPI interface and I doubt there is any requirement for such features to be supported in that interface type.

Certainly, in SD interface type, High-Speed access mode has been entirely consistent with all my cards. Not that it seems to offer any measurable advantage though.

Yeah, one would like to hope there was some sort of consistency with timing phase amongst boards running in SD mode. Good catch about only testing previously in SPI mode.

On the bright side, as the bus master, tx is easier than rx because the master controls when each clock pulse occurs and therefore can pre-align data with that clock. Which is good because, to maintain the tx clock-data phase relationship, when changing clock polarity, a timing shift in the pre-alignment is then needed.

I found the same effect for memory with writes vs reads. With writes you have full control of the clock phase and as the outputs pin states are almost perfectly synchronized it's much easier to get consistent write results over the frequency range for sysclk/2 or lower.
Reads are harder because they involve latencies in the chip and delays on the board and in the target device. At high speed this needs some sort of calibration to get error free data.

evanh · 2024-10-05 13:58

It's almost going! So close. But there is still some bug newly introduced I'm not quite seeing. The old file read/write speed tester works, the one I used when developing the smartpin SPI driver, ... in selected ways. Just not in every way like the SD mode development code does. Non-inverted clock is causing grief, the calibration routine is finding success with failure values. But it works correctly when the clock is inverted. This should be a tell-tail of the cause but so far I can't see it. I haven't dug out the oscilloscope just yet so I guess that's probably up next.

There is quite a long list of changes between the two solutions. Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development. Much has been resolved from basic typos to inverting return code logic and trimming down the logic. Stuff has been renamed, stack allocation and pointer passing added where there was static buffers. Removal of compile options. Clean ups of old routines that hadn't been touched in a while.

rogloh · 2024-10-06 01:59

@evanh said:
Accommodating the filesystem interfacing produced plenty of variation that I wasn't considering during development.

Ah. The constraints of reality strike back. I know how that feels.

evanh · 2024-10-07 19:46

Found it! Perfect example of "it works but don't ask why and don't fiddle." It was a minor tweak I'd done in passing during an optimising clean-up quite early on in the conversion. The clean-up was around removal of all the latency measuring diagnostic code. Lucky I'd done a backup just before starting so was able to realise when the bug was introduced.

During the clean-up I'd noticed I had an excess of clocks leading into issuing the command. This created a longer delay before each command was issued because it waited for the completion of these pulses. This was a remanent of the wait-for-busy check that had briefly existed at the head of command issuing before it got split off on its own. Thought about it for a moment and decided a single clock pulse was enough to trigger an event, why have more. However, it turns out that, at some point in sequencing, the SD cards need two extra pulses there. One isn't always enough.

I don't really know why but it fixes this issue. ... hmmm, or maybe I need to look harder - Just now had one of the cards needed it set to 3 pulses!

PS: It seems to matter only during block write sequencing.

rogloh · 2024-10-08 02:27

Aren't there some CSD structures that indicate how many clocks you'd need between commands for timeouts etc, or perhaps am I thinking of Flash stuff.

evanh · 2024-10-09 12:52

It's a fixed number of eight bits. And yeah, that'll be the problem, I don't explicitly wait for that to happen after a command sequence.

I've now realised that there's something else playing up too. I need more sleep to get a good run at it.

evanh · 2024-10-10 05:16

I should have said eight clocks rather than eight bits.
Anyway, looks like the other problem was merely that I'd removed the exact binary compare when performing calibration. I'd started relying on the CRC alone, but that proved not to be rugged enough for the calibration process, I would dearly love to have access to the official UHS method. Sadly, the SD cards don't respond to CMD19 without engaging UHS mode first. EDIT: Correction, one of six cards tested actually seems to support CMD19 without switching into UHS first. Not much help for me.

CMD19 is compulsory for UHS-1 compliance so I figure it is supported by all my SD cards when UHS is engaged.

EDIT2: One detail from the CMD19 procedure is it says up to 40 repeats to be used for certainty. That's something I'll adopt myself.

evanh · 2024-10-10 07:57

I've found that the only places I needed to ensure the extra trailing clocks was after expected non-responses. Namely CMD7 for deselect and also CMD0. A normal response, which already has a couple of extra clocks anyway, doesn't seem to need all eight spacing clocks. I've fixed the exceptions and left the pre-command clocks set at one. It still could come back to bite me suppose.

rogloh · 2024-10-10 08:23

What's left for you to do now @evanh? Is this mainly bug fixing, or optimization work now, or are there still pieces remaining to be coded for the burst reads/writes?

evanh · 2024-10-10 08:24

It's usable right now. I presume you're interested in giving it a go?

evanh · 2024-10-10 08:26

Down to beta testing and looking for stray bugs I guess.

Attached is the patched vfs.h header file that replaces same existing in include/sys
and the new driver directory that gets added to include/filesys
and lastly my current tester program to get you going.
EDIT: Updated the tester program to show the used clock divider. This is new feature exposed. Driver had been using a constant until today.

rogloh · 2024-10-10 08:26

@evanh said:
It's usable right now. I presume you're interested in giving it a go?

Depending on my capture board bring up and how it works out, I probably would like to try something soon. It will be interesting to see if video can be captured in real time to SD card.

New SD mode P2 accessory board

Comments