New SD mode P2 accessory board

rogloh · 2024-08-25 01:11

This sounds really promising @evanh. I think the stopping of the clock for the last CRC redirection using the streamer would be fine. If you are checking CRC you'll want to check all blocks not just the group prior to the last block unless you are cancelling the transaction and the last block is to be ignored.

So does you reading of multiple blocks now keep up with fast SD Cards streaming out their data blocks with CMD18 and continually detects the start of each data block in the process? If so that's pretty neat. I'd expect you should be able to use SET_BLOCK_COUNT to indicate the number of blocks to stream. Why would they not support this from the beginning and force card multiple reads to just be aborted with stop transmission commands, unless this capability was overlooked initially and added later with UHS. I guess I wouldn't put that past Sandisk given how poorly their spec is written and seems to have evolved, it's a real mess.

evanh · 2024-08-25 01:58

Oh, the clock is stopped between blocks. The code isn't that amazing. Finding the start bit is just messy. I would have loved to use the trick they did for the Ethernet driver but that one had a predictable early sync signal that gave time to adjust on the fly before the data stream arrived. No such opportunity here.

In fact I just realised I'm still using the PWM_SMPS mode in places. I need to eliminate that completely, it depends on restricted pin combinations which I'm hell bent on eliminating even if it costs performance. And it will. That mode would stop automonously at the start bit. Which meant I could fire it off and then initiate the Fcache loading and not have to be concerned with missing the start bit.

The last big effort was paralleling the CRC processing with streamer transfer of next block. And then adapting the level above to handle the dual function of the low level routine. Yeah, the fancy feature then needed utilising. It didn't just work by itself.

evanh · 2024-08-25 02:24

The unmaintained routine I mentioned doesn't use SMPS smartpin mode. It's approach is to minimise the block-to-block gap by looping within the Fcache code and keep going until all blocks are read in before exiting back to C. It had not really had much of a speed advantage over the single-block routine for the very reason that the single-block routine uses the SMPS mode for start-bit detection.

So I expect the single-block routine to lose performance very soon. Which will encourage me to resurrect the multi-block routine again.

evanh · 2024-08-25 11:45

Kingston Select Plus 64GB at 300 MHz sysclock/3, just before removal of SMPS smartpin start-bit clocking,

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=9F   OEMID=TI   Name=SD64G
   Ver=6.1   Serial=58480246   Date=2021-12
Init successful
 Other compile options:
  - Response handler #2
  - Single-block-read loop
  - CRC processed parallel with read block
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

32768 blocks = 16384 kiB   rate = 41.2 MiB/s   duration = 388347 us   zero-overhead = 335544 us   overheads = 13.5 %
16384 blocks = 8192 kiB   rate = 41.1 MiB/s   duration = 194243 us   zero-overhead = 167772 us   overheads = 13.6 %
8192 blocks = 4096 kiB   rate = 41.1 MiB/s   duration = 97249 us   zero-overhead = 83886 us   overheads = 13.7 %
4096 blocks = 2048 kiB   rate = 41.0 MiB/s   duration = 48754 us   zero-overhead = 41943 us   overheads = 13.9 %
2048 blocks = 1024 kiB   rate = 40.8 MiB/s   duration = 24505 us   zero-overhead = 20972 us   overheads = 14.4 %
1024 blocks = 512 kiB   rate = 40.5 MiB/s   duration = 12329 us   zero-overhead = 10486 us   overheads = 14.9 %
512 blocks = 256 kiB   rate = 39.8 MiB/s   duration = 6267 us   zero-overhead = 5243 us   overheads = 16.3 %
256 blocks = 128 kiB   rate = 38.6 MiB/s   duration = 3236 us   zero-overhead = 2621 us   overheads = 19.0 %
128 blocks = 64 kiB   rate = 36.3 MiB/s   duration = 1720 us   zero-overhead = 1311 us   overheads = 23.7 %
64 blocks = 32 kiB   rate = 32.4 MiB/s   duration = 963 us   zero-overhead = 655 us   overheads = 31.9 %
32 blocks = 16 kiB   rate = 26.7 MiB/s   duration = 584 us   zero-overhead = 328 us   overheads = 43.8 %
16 blocks = 8 kiB   rate = 22.6 MiB/s   duration = 345 us   zero-overhead = 164 us   overheads = 52.4 %
8 blocks = 4 kiB   rate = 15.6 MiB/s   duration = 250 us   zero-overhead = 82 us   overheads = 67.2 %
4 blocks = 2 kiB   rate = 9.6 MiB/s   duration = 203 us   zero-overhead = 41 us   overheads = 79.8 %
2 blocks = 1 kiB   rate = 5.4 MiB/s   duration = 179 us   zero-overhead = 20 us   overheads = 88.8 %

Same SD card and clock speed but using the more open pin management without SMPS smartpin clocking of start-bit search.

32768 blocks = 16384 kiB   rate = 40.4 MiB/s   duration = 395322 us   zero-overhead = 335544 us   overheads = 15.1 %
16384 blocks = 8192 kiB   rate = 40.4 MiB/s   duration = 197725 us   zero-overhead = 167772 us   overheads = 15.1 %
8192 blocks = 4096 kiB   rate = 40.4 MiB/s   duration = 98983 us   zero-overhead = 83886 us   overheads = 15.2 %
4096 blocks = 2048 kiB   rate = 40.3 MiB/s   duration = 49612 us   zero-overhead = 41943 us   overheads = 15.4 %
2048 blocks = 1024 kiB   rate = 40.1 MiB/s   duration = 24927 us   zero-overhead = 20972 us   overheads = 15.8 %
1024 blocks = 512 kiB   rate = 39.8 MiB/s   duration = 12544 us   zero-overhead = 10486 us   overheads = 16.4 %
512 blocks = 256 kiB   rate = 39.2 MiB/s   duration = 6373 us   zero-overhead = 5243 us   overheads = 17.7 %
256 blocks = 128 kiB   rate = 38.0 MiB/s   duration = 3287 us   zero-overhead = 2621 us   overheads = 20.2 %
128 blocks = 64 kiB   rate = 35.8 MiB/s   duration = 1744 us   zero-overhead = 1311 us   overheads = 24.8 %
64 blocks = 32 kiB   rate = 32.1 MiB/s   duration = 973 us   zero-overhead = 655 us   overheads = 32.6 %
32 blocks = 16 kiB   rate = 26.6 MiB/s   duration = 587 us   zero-overhead = 328 us   overheads = 44.1 %
16 blocks = 8 kiB   rate = 22.4 MiB/s   duration = 348 us   zero-overhead = 164 us   overheads = 52.8 %
8 blocks = 4 kiB   rate = 15.5 MiB/s   duration = 252 us   zero-overhead = 82 us   overheads = 67.4 %
4 blocks = 2 kiB   rate = 9.5 MiB/s   duration = 204 us   zero-overhead = 41 us   overheads = 79.9 %
2 blocks = 1 kiB   rate = 5.4 MiB/s   duration = 179 us   zero-overhead = 20 us   overheads = 88.8 %

EDIT: Added CRC processing to emitted compile options.

evanh · 2024-08-25 11:56

New start-bit search code sets the clock going after Fcache has been loaded, rather than while Fcache does its loading. That's the extra overhead. The clock rate is the same sysclock/10. It makes use of a WAITSEx to rapidly cancel the clock pulses when the start bit arrives. It actually overshoots the start-bit, stopping the SD clock at the first data bit instead. Then the streamer just has to start its sampling one SD clock early, just ahead of the SD clock restarting.

        wxpin   r_clks, r_clk    // clock divider for start-bit search
        wypin   lnco, r_clk    // start the search
        getct   pb    // timeout begins
        add     timeout, pb    // SDHC 100 ms timeout of start-bit search, Nac (SD spec 4.12.4 note 1)
        setq    timeout    // apply the 100 ms timeout to WAITSEn
        waitse2   wc    // wait for start-bit, C is set if timed-out
        dirl    r_clk    // halt clock gen ASAP
        wxpin   r_clkdiv, r_clk    // clock divider for rx block data

You can see there that everything is adjustable. The pin numbers, the dividers. The timeout duration in sysclock ticks is calculated by the calling function. I decided not to track time at the low level ... but it could still be relegated to the setset() functions that prefill many of the registers like for the pin numbers. SETSE2 is issued before Fcache is loaded, but it too uses adjustable value from preloaded register.

I'll be adding this to the multi-block routine too. It's currently doing a more traditional, and slower, method where it pulses one clock, takes a sample before the clock takes effect, then waits for the pulse to complete before repeating. So it also ends up stopping at the first data bit too. It's actually not too bad speed wise but the newest solution definitely has a faster clock rate while searching.

evanh · 2024-08-26 08:04

Multi-block resurrected but still without CRC processing as yet:

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=9F   OEMID=TI   Name=SD64G
   Ver=6.1   Serial=58480246   Date=2021-12
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

32768 blocks = 16384 kiB   rate = 44.5 MiB/s   duration = 359012 us   zero-overhead = 335544 us   overheads = 6.5 %
16384 blocks = 8192 kiB   rate = 44.5 MiB/s   duration = 179600 us   zero-overhead = 167772 us   overheads = 6.5 %
8192 blocks = 4096 kiB   rate = 44.4 MiB/s   duration = 89951 us   zero-overhead = 83886 us   overheads = 6.7 %
4096 blocks = 2048 kiB   rate = 44.3 MiB/s   duration = 45127 us   zero-overhead = 41943 us   overheads = 7.0 %
2048 blocks = 1024 kiB   rate = 44.0 MiB/s   duration = 22717 us   zero-overhead = 20972 us   overheads = 7.6 %
1024 blocks = 512 kiB   rate = 43.8 MiB/s   duration = 11415 us   zero-overhead = 10486 us   overheads = 8.1 %
512 blocks = 256 kiB   rate = 43.0 MiB/s   duration = 5812 us   zero-overhead = 5243 us   overheads = 9.7 %
256 blocks = 128 kiB   rate = 41.5 MiB/s   duration = 3011 us   zero-overhead = 2621 us   overheads = 12.9 %
128 blocks = 64 kiB   rate = 38.8 MiB/s   duration = 1610 us   zero-overhead = 1311 us   overheads = 18.5 %
64 blocks = 32 kiB   rate = 34.3 MiB/s   duration = 910 us   zero-overhead = 655 us   overheads = 28.0 %
32 blocks = 16 kiB   rate = 27.9 MiB/s   duration = 560 us   zero-overhead = 328 us   overheads = 41.4 %
16 blocks = 8 kiB   rate = 24.3 MiB/s   duration = 321 us   zero-overhead = 164 us   overheads = 48.9 %
8 blocks = 4 kiB   rate = 16.7 MiB/s   duration = 233 us   zero-overhead = 82 us   overheads = 64.8 %
4 blocks = 2 kiB   rate = 10.3 MiB/s   duration = 189 us   zero-overhead = 41 us   overheads = 78.3 %
2 blocks = 1 kiB   rate = 5.8 MiB/s   duration = 167 us   zero-overhead = 20 us   overheads = 88.0 %

For comparison, the single-block routine with CRC processing disabled:

32768 blocks = 16384 kiB   rate = 42.4 MiB/s   duration = 376998 us   zero-overhead = 335544 us   overheads = 10.9 %
16384 blocks = 8192 kiB   rate = 42.4 MiB/s   duration = 188575 us   zero-overhead = 167772 us   overheads = 11.0 %
8192 blocks = 4096 kiB   rate = 42.3 MiB/s   duration = 94422 us   zero-overhead = 83886 us   overheads = 11.1 %
4096 blocks = 2048 kiB   rate = 42.2 MiB/s   duration = 47345 us   zero-overhead = 41943 us   overheads = 11.4 %
2048 blocks = 1024 kiB   rate = 42.0 MiB/s   duration = 23806 us   zero-overhead = 20972 us   overheads = 11.9 %
1024 blocks = 512 kiB   rate = 41.7 MiB/s   duration = 11969 us   zero-overhead = 10486 us   overheads = 12.3 %
512 blocks = 256 kiB   rate = 41.0 MiB/s   duration = 6085 us   zero-overhead = 5243 us   overheads = 13.8 %
256 blocks = 128 kiB   rate = 39.7 MiB/s   duration = 3142 us   zero-overhead = 2621 us   overheads = 16.5 %
128 blocks = 64 kiB   rate = 37.4 MiB/s   duration = 1671 us   zero-overhead = 1311 us   overheads = 21.5 %
64 blocks = 32 kiB   rate = 33.3 MiB/s   duration = 936 us   zero-overhead = 655 us   overheads = 30.0 %
32 blocks = 16 kiB   rate = 27.5 MiB/s   duration = 568 us   zero-overhead = 328 us   overheads = 42.2 %
16 blocks = 8 kiB   rate = 23.7 MiB/s   duration = 329 us   zero-overhead = 164 us   overheads = 50.1 %
8 blocks = 4 kiB   rate = 16.4 MiB/s   duration = 237 us   zero-overhead = 82 us   overheads = 65.4 %
4 blocks = 2 kiB   rate = 10.2 MiB/s   duration = 191 us   zero-overhead = 41 us   overheads = 78.5 %
2 blocks = 1 kiB   rate = 5.8 MiB/s   duration = 168 us   zero-overhead = 20 us   overheads = 88.0 %

When it was using the SMPS smartpin mode for start-bit search I think it was getting lower than 8% maybe 7.5% overheads... yep, 8% according to a three week old backup of the source code:

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=9F   OEMID=TI   Name=SD64G
   Ver=6.1   Serial=58480246   Date=2021-12
Card init successful
 Other compile options:
  - Response handler #2
  - Single-block-read loop
  - Response CRC before block-read
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

32768 blocks = 16384 kiB   rate = 43.8 MiB/s   duration = 364789 us   zero-overhead = 335544 us   overheads = 8.0 %
16384 blocks = 8192 kiB   rate = 43.8 MiB/s   duration = 182482 us   zero-overhead = 167772 us   overheads = 8.0 %
8192 blocks = 4096 kiB   rate = 43.7 MiB/s   duration = 91390 us   zero-overhead = 83886 us   overheads = 8.2 %
4096 blocks = 2048 kiB   rate = 43.6 MiB/s   duration = 45841 us   zero-overhead = 41943 us   overheads = 8.5 %
2048 blocks = 1024 kiB   rate = 43.3 MiB/s   duration = 23067 us   zero-overhead = 20972 us   overheads = 9.0 %
1024 blocks = 512 kiB   rate = 43.1 MiB/s   duration = 11593 us   zero-overhead = 10486 us   overheads = 9.5 %
512 blocks = 256 kiB   rate = 42.3 MiB/s   duration = 5899 us   zero-overhead = 5243 us   overheads = 11.1 %
256 blocks = 128 kiB   rate = 40.9 MiB/s   duration = 3053 us   zero-overhead = 2621 us   overheads = 14.1 %
128 blocks = 64 kiB   rate = 38.3 MiB/s   duration = 1629 us   zero-overhead = 1311 us   overheads = 19.5 %
64 blocks = 32 kiB   rate = 34.0 MiB/s   duration = 917 us   zero-overhead = 655 us   overheads = 28.5 %
32 blocks = 16 kiB   rate = 27.8 MiB/s   duration = 562 us   zero-overhead = 328 us   overheads = 41.6 %
16 blocks = 8 kiB   rate = 24.1 MiB/s   duration = 323 us   zero-overhead = 164 us   overheads = 49.2 %
8 blocks = 4 kiB   rate = 16.6 MiB/s   duration = 234 us   zero-overhead = 82 us   overheads = 64.9 %
4 blocks = 2 kiB   rate = 10.3 MiB/s   duration = 189 us   zero-overhead = 41 us   overheads = 78.3 %
2 blocks = 1 kiB   rate = 5.8 MiB/s   duration = 167 us   zero-overhead = 20 us   overheads = 88.0 %

rogloh · 2024-08-26 10:54

These sorts of SD read rates if sustained would be one way a P2 could stream digitized and uncompressed VGA resolution video material at 24/25/30Hz with 24bpp or 50/60Hz at 16bpp colour depth off an SD card. This requires ~40MiB/s or less, depending on whether you expand 24 to 32 bits in storage and if you fully frame buffer or not. A full frame buffer at these depths would need external memory as well, but it could be done if the card itself could sustain 40MiB/s. The read CRC could potentially be ignored in such an application.

EDIT: you wouldn't get a lot of video stored on a card this way at this read rate though (2.4GiB/minute!). A 256GB card could barely store a movie. That's about AUD$30 worth of media to store it.

evanh · 2024-08-26 11:15

Yeah, media files can drop CRC for sure. I'll try to allow such a compile switch.
Many of the cards seem to max out at around 48-49 MB/s raw data rate irrespective of what I do to reduce overheads. Card driven latencies get extended to limit the average data rate. The exception again is the Sandisk cards. Even the older Sandisk will keep delivering data faster if clocked higher.

PS: I had a thought to try engaging 1.8 V mode in the card, to make UHS-I modes available, but leave the supply at 3.3 V anyway. I'm not sure of wisdom of even trying though. I have no idea what the 1.8 V switchover does inside a card.

It might only adjust the pin drive strength. Although that probably is a second set of transistors, they must be 3.3 Volt capable or they'd be fried even when not in use.

evanh · 2024-08-26 13:00

I've now made the multi-block read more realistic by having a definable buffer that fits within hubRAM. The previous runs were all setup to repeatedly overwrite the same 512 bytes. So, for the speed test, it could go on indefinitely unbroken.

Making it a 32 kB buffer so that the multi-block routine loops internally 64 times then exits back to C I get a small increase of 0.1% in overheads. A larger buffer is barely detectable.

rogloh · 2024-08-27 00:22

I wonder what sort of impact a copy into external RAM would add to your single hub buffer scheme when doing SD multiple sector reads? It'd be nice to be able to sustain something close to the high data rates you are currently seeing while also getting the SD data into the PSRAM for example. If some status long was updated with the current number of sectors read, another COG could monitor this and potentially instruct a PSRAM driver to copy these sectors into PSRAM from hub RAM at suitable times in parallel to the continued streaming in of additional sectors. You'd just need to time things so older data is not overwritten before sent to PSRAM (or use a ping ping hub buffer scheme or start/stop the incoming sector data where required). You can certainly move data into a PSRAM faster than the data comes off the card but there may be latencies to deal with if another COG application such as an emulator or a video driver is also sharing the external memory at the same time.

evanh · 2024-08-27 00:54

Should be easy enough. Any interprocess signalling can be used to indicate a buffer is filled. Standard C, from the earliest days, file handling allows any amount of a file to be read/written per call of the block read/write functions. A unique buffer per call is fine also.

evanh · 2024-08-27 23:17

@evanh said:
Making it a 32 kB buffer so that the multi-block routine loops internally 64 times then exits back to C I get a small increase of 0.1% in overheads. A larger buffer is barely detectable.

Ah, not so pretty with the CRC processing in place. 6.6% up to 9.8% ignoring final CRC, then 12.4% adding the end-of-buffer block CRC processing. And that's by using an oversized buffer and leaving the CRC bits attached to the end. It'll be worse once I decide how to split off the final CRC bits.

Err, oops, that was using a 16 kB buffer.
32 kB buffer is 6.6%, 9.7% and 11.1% respectively.
64 kB buffer is 6.5%, 9.7% and 10.4% respectively.
8 kB buffer is 6.8%, 9.9% and 14.9% respectively.

PS: These numbers only apply to the Kingston Select Plus SD card. Every other card gives different (worse) results at such a high clock rate.

rogloh · 2024-08-28 03:31

If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried. CRC is just going to cost CPU at fast transfer speeds. Multi-sector reads will reduce the overhead down anyway by averaging the initial latency over all the sectors.

I do wonder in the real world situation in the presence of some degree of SD card sector fragmentation, what the idle time gaps will look like and whether something like 40MiB/s transfers could be sustained for video long term. In a video kiosk style application and with external memory attached we could possibly assume the availability of a ~30MB (or ~0.75 second) buffer with the PSRAM boards to help ride out the gaps a little. I wonder if this would be enough to support reliable video streaming transfers from a somewhat fragmented filesystem on the SD which may not always stream out sector data continuously back-to-back in multi-sector reads for the entire duration of a video. Video streaming with multi-sector reads would also require knowledge of where the various sectors are located on the card using filesystem data such as FAT32 cluster entries etc which may require additional occasional sector lookups as well during streaming (which temporarily interrupts the data transfers) if these are not completely read in ahead of time.

Wuerfel_21 · 2024-08-28 04:17

I already had video streaming going that was bottlenecked on decompression and PSRAM RMW ops with just the normal SPI driver. The primary problem there was that FFMPEG's cinepak encoder kinda stinks (terrible rgb to pseudoYUV conversion, everything too green... had some minor success at writing a better encoder but got bored or smth). Streaming uncompressed 24bpp would have better quality than cinepak, but the bandwidth is insane. Maybe raw 4:2:0 (=12bpp) if you can use the CSC or AVIInfoFrame on HDMI to natively output YCbCr.

evanh · 2024-08-28 05:47

@rogloh said:
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried.

The 12% is amortised figure over the whole 16 MB stream. The final CRC will be something like 60% overhead for that block because it has to be processed serially.

rogloh · 2024-08-28 14:03

@Wuerfel_21 said:
I already had video streaming going that was bottlenecked on decompression and PSRAM RMW ops with just the normal SPI driver. The primary problem there was that FFMPEG's cinepak encoder kinda stinks (terrible rgb to pseudoYUV conversion, everything too green... had some minor success at writing a better encoder but got bored or smth). Streaming uncompressed 24bpp would have better quality than cinepak, but the bandwidth is insane. Maybe raw 4:2:0 (=12bpp) if you can use the CSC or AVIInfoFrame on HDMI to natively output YCbCr.

Yeah the uncompressed approach would help alleviate all that decompression workload and give great SD quality output at the expense of IO bandwidth demands on the SD interface. You'd still want to transfer into PSRAM for doing a true colour output but that's fast if you do large block transfers of video data already configured for frame buffers (ie. 24bpp expanded to 32 bits). Audio would need to be mixed into the stream at the end of each frame and sent to some DAC or digital output COG. But you could have other COGs assist if needed. If this was achievable you could probably also do some pretty cool FF/REW playback tricks smoothly with almost instant seeking, particularly if each video frame begins on a sector boundary. An OSD would be nice too if it could be rendered in time on top after each frame gets loaded (or make have a custom video driver that does it on the fly like sprite overlays). A video pipeline for rendering the OSD on top is probably possible with another COG involved. This would be a fun project. I guess we wait to see how evanh's driver progresses and the final bandwidths it can achieve.

rogloh · 2024-08-28 14:04

@evanh said:

@rogloh said:
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried.

The 12% is amortised figure over the whole 16 MB stream. The final CRC will be something like 60% overhead for that block because it has to be processed serially.

Ok, then it might be needed to disable CRC for such a video application depending on the final numbers...

evanh · 2024-08-28 15:16

Oh, huh, I'd forcibly disabled the parallel CRC processing in all those tests and forgotten because it was done the night before. Impressively, re-enabling that code (after a bug fix) only adds 0.1% difference! 16kB buffer is 6.6%, 9.9% and 12.5%. It shows how well that tight loop is keeping up with the sysclock/3 data rate.

EDIT: Another point of interest there is the 9.9%, up from 9.8%, is measuring the parallel performance alone. Given that the 9.8% measurement was made with the tight parallel loop entirely missing what we're then seeing is the added overheads, 6.6% increasing to 9.8%, of the parallel support code that gets compiled in. Probably the main increase comes from the fast copy of the block+CRC into cogRAM which happens in between streamer ops.

EDIT2: Yep, recompiling for parallel CRC with both the tight loop and the fast copy removed yields a result of 6.7% overheads for the 16 MB stream with 16 kB buffer.

evanh · 2024-08-28 16:51

Shaved 0.1% off the parallel code again, now 9.8%, by optimising placement of WAITSE1 to post fast copy instead of pre. It'll help more at higher clock dividers ... sysclock/4: 7.7% before optimisation, 6.8% after optimisation.

EDIT: Bugger, had to revert this. Formed a bug around larger dividers where the fast copy would occur before the streamer had completed its prior block. Corrupting the copy in cogRAM.

rogloh · 2024-08-28 23:57

What sort of extra overhead/idle time delay does a seek to a new SD sector range add @evanh with your fast SD card? I'm wondering what sort of penalty in terms of bandwidth that would be incurred each time you hit a new fragmentation point and the file needs to be read from a different sector range. I know overall it depends on how often this would occur in a given file, but how much time is wasted if a sector multi-read restarts at a new random file position for example. E.g. burst read 20 sectors then burst read another 20 sectors from elsewhere on the disk and compare to burst reading 40 sectors. What sort of penalty is that?

evanh · 2024-08-29 00:35

The latency on the first block of a CMD18 is huge. As a guess, it'll add a good 3% overhead. That's what I was figuring could be improved by using CMD23 instead of CMD12.

PS: It'll be the biggest difference between the various cards too. The Kingston is the best in that respect. Others might be adding as much as 10% overhead on that one latency. I should do some timed measurements and get some hard numbers ...

PPS: That latency is the reason why a low block count gets such a poor result on the overheads. eg:

2 blocks = 1 kiB   rate = 5.4 MiB/s   duration = 179 us   zero-overhead = 20 us   overheads = 88.8 %

So we can probably expect it to be about 150 µs.

evanh · 2024-08-29 07:31

Initial block latencies:
136 µs for the Kinston Select Plus 64GB (2021)
160 µs for the Samsung EVO 128GB (2023)
472 µs for the Adata Silver 16GB (2013)
292 µs for the Adata Orange 64GB (2024)
296 µs for the Apacer 16GB (2018)
160 µs for the Sandisk Extreme 32GB (2017)
207 µs for the Sandisk Extreme 64GB (2021)

evanh · 2024-08-29 11:08

Actually, those measurements might be low. I think I'm creating an early trigger in the cards, by adding extra clocks at the end of the command response, with the intent of prep'ing the data block read - Well before I'm measuring.

That said, they are the relevant figures for the additional delay incurred on top of the execution overheads.

rogloh · 2024-08-29 12:30

Ok, so say you incur 150us penalty for a fast card each time another "seek" or burst restart operation is done. To sustain 40MiB/s with 12.5% overhead at sysclk/3 on a 300MHz P2 you'd need to ensure on average you don't restart a new burst more often than one in every "x" sectors and this is assuming you know the full cluster map of the file up front and don't go off reading FAT32 cluster table info etc on the fly.

To work out "x" I think it would be approximately something like this...

Sustained transfers possible with 12.5% overhead = 300M/3/2 * 0.875 = 43.75MiB/s

We want to budget for streaming @~40MiB/s so the additional 3.75MiB/s can be allocated to account for the 150usec penalties each time a new burst is started.
So for 3.75/43.75 of the time we can be restarting a new burst which is proportionally 85.7 milliseconds per second. This equates to 85.7/0.150 = 571 sector burst restarts per second.

For the data being transferred we have 40MiB/s = 40M/512 sectors per second or 79101 sectors/second. "x" is then 79101/571 = 138.

So as long as the average burst transferred is 138 sectors or more each time we would be able to sustain 40MiB/s. This is just under 1% fragmentation which isn't a lot.

evanh · 2024-08-29 12:45

Some cards can go higher data rate with a higher clock. Here's the Kingston SD card using 330 MHz sysclock:

  Full Speed clock divider = 3 (110.0 MHz)
  rxlag=7 selected  Lowest=6 Highest=8
 CID decode:  ManID=9F   OEMID=TI   Name=SD64G
   Ver=6.1   Serial=58480246   Date=2021-12
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

32768 blocks = 16384 kiB  lat=96    rate = 48.7 MiB/s   duration = 328117 us   zero-overhead = 305040 us   overheads = 7.0 %
16384 blocks = 8192 kiB  lat=88    rate = 48.6 MiB/s   duration = 164523 us   zero-overhead = 152520 us   overheads = 7.2 %
8192 blocks = 4096 kiB  lat=88    rate = 48.4 MiB/s   duration = 82631 us   zero-overhead = 76260 us   overheads = 7.7 %
4096 blocks = 2048 kiB  lat=95    rate = 47.9 MiB/s   duration = 41692 us   zero-overhead = 38130 us   overheads = 8.5 %
2048 blocks = 1024 kiB  lat=88    rate = 47.1 MiB/s   duration = 21215 us   zero-overhead = 19065 us   overheads = 10.1 %
1024 blocks = 512 kiB  lat=96    rate = 46.1 MiB/s   duration = 10824 us   zero-overhead = 9533 us   overheads = 11.9 %
512 blocks = 256 kiB  lat=88    rate = 43.8 MiB/s   duration = 5706 us   zero-overhead = 4766 us   overheads = 16.4 %
256 blocks = 128 kiB  lat=97    rate = 39.7 MiB/s   duration = 3146 us   zero-overhead = 2383 us   overheads = 24.2 %
128 blocks = 64 kiB  lat=2877    rate = 31.9 MiB/s   duration = 1955 us   zero-overhead = 1192 us   overheads = 39.0 %
64 blocks = 32 kiB  lat=87    rate = 25.6 MiB/s   duration = 1220 us   zero-overhead = 596 us   overheads = 51.1 %
32 blocks = 16 kiB  lat=44876    rate = 15.1 MiB/s   duration = 1031 us   zero-overhead = 298 us   overheads = 71.0 %
16 blocks = 8 kiB  lat=44876    rate = 9.8 MiB/s   duration = 793 us   zero-overhead = 149 us   overheads = 81.2 %
8 blocks = 4 kiB  lat=44877    rate = 5.4 MiB/s   duration = 712 us   zero-overhead = 74 us   overheads = 89.6 %
4 blocks = 2 kiB  lat=44856    rate = 2.9 MiB/s   duration = 672 us   zero-overhead = 37 us   overheads = 94.4 %
2 blocks = 1 kiB  lat=44856    rate = 1.4 MiB/s   duration = 652 us   zero-overhead = 19 us   overheads = 97.0 %

And a Sandisk at 360 MHz sysclock:

  Full Speed clock divider = 3 (120.0 MHz)
  rxlag=7 selected  Lowest=6 Highest=8
 CID decode:  ManID=03   OEMID=SD   Name=SE32G
   Ver=8.0   Serial=6018369B   Date=2017-12
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

32768 blocks = 16384 kiB  lat=138    rate = 52.6 MiB/s   duration = 303616 us   zero-overhead = 279620 us   overheads = 7.9 %
16384 blocks = 8192 kiB  lat=138    rate = 52.2 MiB/s   duration = 153189 us   zero-overhead = 139810 us   overheads = 8.7 %
8192 blocks = 4096 kiB  lat=138    rate = 52.6 MiB/s   duration = 75943 us   zero-overhead = 69905 us   overheads = 7.9 %
4096 blocks = 2048 kiB  lat=138    rate = 52.3 MiB/s   duration = 38232 us   zero-overhead = 34953 us   overheads = 8.5 %
2048 blocks = 1024 kiB  lat=138    rate = 51.5 MiB/s   duration = 19400 us   zero-overhead = 17476 us   overheads = 9.9 %
1024 blocks = 512 kiB  lat=138    rate = 50.0 MiB/s   duration = 9984 us   zero-overhead = 8738 us   overheads = 12.4 %
512 blocks = 256 kiB  lat=138    rate = 47.3 MiB/s   duration = 5276 us   zero-overhead = 4369 us   overheads = 17.1 %
256 blocks = 128 kiB  lat=138    rate = 42.7 MiB/s   duration = 2922 us   zero-overhead = 2185 us   overheads = 25.2 %
128 blocks = 64 kiB  lat=138    rate = 35.8 MiB/s   duration = 1745 us   zero-overhead = 1092 us   overheads = 37.4 %
64 blocks = 32 kiB  lat=138    rate = 27.0 MiB/s   duration = 1156 us   zero-overhead = 546 us   overheads = 52.7 %
32 blocks = 16 kiB  lat=57606    rate = 16.4 MiB/s   duration = 949 us   zero-overhead = 273 us   overheads = 71.2 %
16 blocks = 8 kiB  lat=57606    rate = 9.7 MiB/s   duration = 802 us   zero-overhead = 137 us   overheads = 82.9 %
8 blocks = 4 kiB  lat=57606    rate = 5.3 MiB/s   duration = 729 us   zero-overhead = 68 us   overheads = 90.6 %
4 blocks = 2 kiB  lat=57606    rate = 2.8 MiB/s   duration = 692 us   zero-overhead = 34 us   overheads = 95.0 %
2 blocks = 1 kiB  lat=57606    rate = 1.4 MiB/s   duration = 674 us   zero-overhead = 17 us   overheads = 97.4 %

rogloh · 2024-08-29 12:50

Am thinking for streaming uncompressed SDTV @30Hz truecolor formatted as 32 bit pixels with a stereo 48kHz audio track you might be best to try transferring at sysclk/2 with no CRC. Sysclk/3 will be very demanding for achieving 40MiB/s at 300MHz. Another idea is to store pixels on the SD card as 24 bits and expand into 32 bit data on the fly but this can be intensive and burns COGs. It would however drop the bandwidth needed significantly.

Yes 330MHz operation could also help account for the CRC+overheads, although it's pushing the P2 fairly hard.

evanh · 2024-08-29 12:52

Actually, with CRC processing disabled then sysclock/2 can perform. Here's the Sandisk SD card at 280 MHz sysclock:

  Full Speed clock divider = 2 (140.0 MHz)
  rxlag=7 selected  Lowest=6 Highest=7
 CID decode:  ManID=03   OEMID=SD   Name=SE32G
   Ver=8.0   Serial=6018369B   Date=2017-12
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=2

32768 blocks = 16384 kiB  lat=146    rate = 59.8 MiB/s   duration = 267429 us   zero-overhead = 239675 us   overheads = 10.3 %
16384 blocks = 8192 kiB  lat=146    rate = 60.0 MiB/s   duration = 133250 us   zero-overhead = 119837 us   overheads = 10.0 %
8192 blocks = 4096 kiB  lat=146    rate = 59.7 MiB/s   duration = 66967 us   zero-overhead = 59919 us   overheads = 10.5 %
4096 blocks = 2048 kiB  lat=146    rate = 59.3 MiB/s   duration = 33713 us   zero-overhead = 29959 us   overheads = 11.1 %
2048 blocks = 1024 kiB  lat=146    rate = 58.3 MiB/s   duration = 17142 us   zero-overhead = 14980 us   overheads = 12.6 %
1024 blocks = 512 kiB  lat=146    rate = 56.4 MiB/s   duration = 8857 us   zero-overhead = 7490 us   overheads = 15.4 %
512 blocks = 256 kiB  lat=146    rate = 53.0 MiB/s   duration = 4714 us   zero-overhead = 3745 us   overheads = 20.5 %
256 blocks = 128 kiB  lat=146    rate = 47.2 MiB/s   duration = 2643 us   zero-overhead = 1872 us   overheads = 29.1 %
128 blocks = 64 kiB  lat=146    rate = 38.8 MiB/s   duration = 1607 us   zero-overhead = 936 us   overheads = 41.7 %
64 blocks = 32 kiB  lat=146    rate = 28.6 MiB/s   duration = 1090 us   zero-overhead = 468 us   overheads = 57.0 %
32 blocks = 16 kiB  lat=44806    rate = 17.0 MiB/s   duration = 918 us   zero-overhead = 234 us   overheads = 74.5 %
16 blocks = 8 kiB  lat=44806    rate = 9.9 MiB/s   duration = 789 us   zero-overhead = 117 us   overheads = 85.1 %
8 blocks = 4 kiB  lat=44806    rate = 5.3 MiB/s   duration = 724 us   zero-overhead = 59 us   overheads = 91.8 %
4 blocks = 2 kiB  lat=44806    rate = 2.8 MiB/s   duration = 692 us   zero-overhead = 29 us   overheads = 95.8 %
2 blocks = 1 kiB  lat=44806    rate = 1.4 MiB/s   duration = 676 us   zero-overhead = 15 us   overheads = 97.7 %

PS: The reason why there is two distinct "lat" values is because it reports the initial latency of the last buffer. The buffer size is 16 kB, therefore stream lengths above 16 kB are reporting a mid-stream block latency rather than the initial block.

PPS: And they're in sysclock ticks rather than microseconds. A quick hack.

rogloh · 2024-08-29 12:55

Yeah that's quite a bit faster. Nice.

rogloh · 2024-08-29 13:15

Here's a snippet that could expand 4 compressed 24 bit pixels read in as 3 longs and outputting as 4 longs back to HUB RAM. Looks like a single COG could do the translation. Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.

expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in 32 p2 clocks 
    mov loopcount, ##600
loop
    rep #14, #512/4
    rflong a
    rflong b
    rflong c
    mov d, a
    setbyte d, #0, #3
    wrlut d, ptrb++
    getword d, b, #1
    rolbyte b, a, #3
    setbyte b, #0, #3
    wrlut b, ptrb++
    setbyte d, c, #2
    wrlut d, ptrb++
    shr c, #8
    wrlut c, ptrb++
    setq2 #(512-1)
    wrlong 0, ptra
    add ptra, incr
    djnz loopcount, #loop
    ret

a   long 0
b   long 0
c   long 0
d   long 0
incr long 512*4
loopcount long 0

Wuerfel_21 · 2024-08-29 13:37

@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.

Would need to not use the FIFO then.

Also, your loop there puts zero in the byte #3. P2 RGB32 mode is $RRGGBBxx, where xx is ignored, so there's a double oddity there. Also note that you are able to use PTRA++ mode with block writes, the bug only happens when you try to use a custom offset.

Maybe: (untested)

expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks 
    mov loopcount, ##600
loop
    rep #14, ##512
    rflong a
    rflong b
    rflong c
    mov d, a
    shl d,#8
    wrlut d, ptrb++
    getword d, b, #1
    rolword b, a, #1
    wrlut b, ptrb++
    setbyte d, c, #2
    shl d,#8
    wrlut d, ptrb++
    wrlut c, ptrb++
    setq2 #(512-1)
    wrlong 0, ptra++
    djnz loopcount, #loop
    ret

a   long 0
b   long 0
c   long 0
d   long 0
loopcount long 0

New SD mode P2 accessory board

Comments