This sounds really promising @evanh. I think the stopping of the clock for the last CRC redirection using the streamer would be fine. If you are checking CRC you'll want to check all blocks not just the group prior to the last block unless you are cancelling the transaction and the last block is to be ignored.
So does you reading of multiple blocks now keep up with fast SD Cards streaming out their data blocks with CMD18 and continually detects the start of each data block in the process? If so that's pretty neat. I'd expect you should be able to use SET_BLOCK_COUNT to indicate the number of blocks to stream. Why would they not support this from the beginning and force card multiple reads to just be aborted with stop transmission commands, unless this capability was overlooked initially and added later with UHS. I guess I wouldn't put that past Sandisk given how poorly their spec is written and seems to have evolved, it's a real mess.
Oh, the clock is stopped between blocks. The code isn't that amazing. Finding the start bit is just messy. I would have loved to use the trick they did for the Ethernet driver but that one had a predictable early sync signal that gave time to adjust on the fly before the data stream arrived. No such opportunity here.
In fact I just realised I'm still using the PWM_SMPS mode in places. I need to eliminate that completely, it depends on restricted pin combinations which I'm hell bent on eliminating even if it costs performance. And it will. That mode would stop automonously at the start bit. Which meant I could fire it off and then initiate the Fcache loading and not have to be concerned with missing the start bit.
The last big effort was paralleling the CRC processing with streamer transfer of next block. And then adapting the level above to handle the dual function of the low level routine. Yeah, the fancy feature then needed utilising. It didn't just work by itself.
The unmaintained routine I mentioned doesn't use SMPS smartpin mode. It's approach is to minimise the block-to-block gap by looping within the Fcache code and keep going until all blocks are read in before exiting back to C. It had not really had much of a speed advantage over the single-block routine for the very reason that the single-block routine uses the SMPS mode for start-bit detection.
So I expect the single-block routine to lose performance very soon. Which will encourage me to resurrect the multi-block routine again.
New start-bit search code sets the clock going after Fcache has been loaded, rather than while Fcache does its loading. That's the extra overhead. The clock rate is the same sysclock/10. It makes use of a WAITSEx to rapidly cancel the clock pulses when the start bit arrives. It actually overshoots the start-bit, stopping the SD clock at the first data bit instead. Then the streamer just has to start its sampling one SD clock early, just ahead of the SD clock restarting.
wxpin r_clks, r_clk // clock divider for start-bit search
wypin lnco, r_clk // start the search
getct pb // timeout begins
add timeout, pb // SDHC 100 ms timeout of start-bit search, Nac (SD spec 4.12.4 note 1)
setq timeout // apply the 100 ms timeout to WAITSEn
waitse2 wc // wait for start-bit, C is set if timed-out
dirl r_clk // halt clock gen ASAP
wxpin r_clkdiv, r_clk // clock divider for rx block data
You can see there that everything is adjustable. The pin numbers, the dividers. The timeout duration in sysclock ticks is calculated by the calling function. I decided not to track time at the low level ... but it could still be relegated to the setset() functions that prefill many of the registers like for the pin numbers. SETSE2 is issued before Fcache is loaded, but it too uses adjustable value from preloaded register.
I'll be adding this to the multi-block routine too. It's currently doing a more traditional, and slower, method where it pulses one clock, takes a sample before the clock takes effect, then waits for the pulse to complete before repeating. So it also ends up stopping at the first data bit too. It's actually not too bad speed wise but the newest solution definitely has a faster clock rate while searching.
When it was using the SMPS smartpin mode for start-bit search I think it was getting lower than 8% maybe 7.5% overheads... yep, 8% according to a three week old backup of the source code:
These sorts of SD read rates if sustained would be one way a P2 could stream digitized and uncompressed VGA resolution video material at 24/25/30Hz with 24bpp or 50/60Hz at 16bpp colour depth off an SD card. This requires ~40MiB/s or less, depending on whether you expand 24 to 32 bits in storage and if you fully frame buffer or not. A full frame buffer at these depths would need external memory as well, but it could be done if the card itself could sustain 40MiB/s. The read CRC could potentially be ignored in such an application.
EDIT: you wouldn't get a lot of video stored on a card this way at this read rate though (2.4GiB/minute!). A 256GB card could barely store a movie. That's about AUD$30 worth of media to store it.
Yeah, media files can drop CRC for sure. I'll try to allow such a compile switch.
Many of the cards seem to max out at around 48-49 MB/s raw data rate irrespective of what I do to reduce overheads. Card driven latencies get extended to limit the average data rate. The exception again is the Sandisk cards. Even the older Sandisk will keep delivering data faster if clocked higher.
PS: I had a thought to try engaging 1.8 V mode in the card, to make UHS-I modes available, but leave the supply at 3.3 V anyway. I'm not sure of wisdom of even trying though. I have no idea what the 1.8 V switchover does inside a card.
It might only adjust the pin drive strength. Although that probably is a second set of transistors, they must be 3.3 Volt capable or they'd be fried even when not in use.
I've now made the multi-block read more realistic by having a definable buffer that fits within hubRAM. The previous runs were all setup to repeatedly overwrite the same 512 bytes. So, for the speed test, it could go on indefinitely unbroken.
Making it a 32 kB buffer so that the multi-block routine loops internally 64 times then exits back to C I get a small increase of 0.1% in overheads. A larger buffer is barely detectable.
I wonder what sort of impact a copy into external RAM would add to your single hub buffer scheme when doing SD multiple sector reads? It'd be nice to be able to sustain something close to the high data rates you are currently seeing while also getting the SD data into the PSRAM for example. If some status long was updated with the current number of sectors read, another COG could monitor this and potentially instruct a PSRAM driver to copy these sectors into PSRAM from hub RAM at suitable times in parallel to the continued streaming in of additional sectors. You'd just need to time things so older data is not overwritten before sent to PSRAM (or use a ping ping hub buffer scheme or start/stop the incoming sector data where required). You can certainly move data into a PSRAM faster than the data comes off the card but there may be latencies to deal with if another COG application such as an emulator or a video driver is also sharing the external memory at the same time.
Should be easy enough. Any interprocess signalling can be used to indicate a buffer is filled. Standard C, from the earliest days, file handling allows any amount of a file to be read/written per call of the block read/write functions. A unique buffer per call is fine also.
@evanh said:
Making it a 32 kB buffer so that the multi-block routine loops internally 64 times then exits back to C I get a small increase of 0.1% in overheads. A larger buffer is barely detectable.
Ah, not so pretty with the CRC processing in place. 6.6% up to 9.8% ignoring final CRC, then 12.4% adding the end-of-buffer block CRC processing. And that's by using an oversized buffer and leaving the CRC bits attached to the end. It'll be worse once I decide how to split off the final CRC bits.
Err, oops, that was using a 16 kB buffer.
32 kB buffer is 6.6%, 9.7% and 11.1% respectively.
64 kB buffer is 6.5%, 9.7% and 10.4% respectively.
8 kB buffer is 6.8%, 9.9% and 14.9% respectively.
PS: These numbers only apply to the Kingston Select Plus SD card. Every other card gives different (worse) results at such a high clock rate.
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried. CRC is just going to cost CPU at fast transfer speeds. Multi-sector reads will reduce the overhead down anyway by averaging the initial latency over all the sectors.
I do wonder in the real world situation in the presence of some degree of SD card sector fragmentation, what the idle time gaps will look like and whether something like 40MiB/s transfers could be sustained for video long term. In a video kiosk style application and with external memory attached we could possibly assume the availability of a ~30MB (or ~0.75 second) buffer with the PSRAM boards to help ride out the gaps a little. I wonder if this would be enough to support reliable video streaming transfers from a somewhat fragmented filesystem on the SD which may not always stream out sector data continuously back-to-back in multi-sector reads for the entire duration of a video. Video streaming with multi-sector reads would also require knowledge of where the various sectors are located on the card using filesystem data such as FAT32 cluster entries etc which may require additional occasional sector lookups as well during streaming (which temporarily interrupts the data transfers) if these are not completely read in ahead of time.
I already had video streaming going that was bottlenecked on decompression and PSRAM RMW ops with just the normal SPI driver. The primary problem there was that FFMPEG's cinepak encoder kinda stinks (terrible rgb to pseudoYUV conversion, everything too green... had some minor success at writing a better encoder but got bored or smth). Streaming uncompressed 24bpp would have better quality than cinepak, but the bandwidth is insane. Maybe raw 4:2:0 (=12bpp) if you can use the CSC or AVIInfoFrame on HDMI to natively output YCbCr.
@rogloh said:
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried.
The 12% is amortised figure over the whole 16 MB stream. The final CRC will be something like 60% overhead for that block because it has to be processed serially.
@Wuerfel_21 said:
I already had video streaming going that was bottlenecked on decompression and PSRAM RMW ops with just the normal SPI driver. The primary problem there was that FFMPEG's cinepak encoder kinda stinks (terrible rgb to pseudoYUV conversion, everything too green... had some minor success at writing a better encoder but got bored or smth). Streaming uncompressed 24bpp would have better quality than cinepak, but the bandwidth is insane. Maybe raw 4:2:0 (=12bpp) if you can use the CSC or AVIInfoFrame on HDMI to natively output YCbCr.
Yeah the uncompressed approach would help alleviate all that decompression workload and give great SD quality output at the expense of IO bandwidth demands on the SD interface. You'd still want to transfer into PSRAM for doing a true colour output but that's fast if you do large block transfers of video data already configured for frame buffers (ie. 24bpp expanded to 32 bits). Audio would need to be mixed into the stream at the end of each frame and sent to some DAC or digital output COG. But you could have other COGs assist if needed. If this was achievable you could probably also do some pretty cool FF/REW playback tricks smoothly with almost instant seeking, particularly if each video frame begins on a sector boundary. An OSD would be nice too if it could be rendered in time on top after each frame gets loaded (or make have a custom video driver that does it on the fly like sprite overlays). A video pipeline for rendering the OSD on top is probably possible with another COG involved. This would be a fun project. I guess we wait to see how evanh's driver progresses and the final bandwidths it can achieve.
@rogloh said:
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried.
The 12% is amortised figure over the whole 16 MB stream. The final CRC will be something like 60% overhead for that block because it has to be processed serially.
Ok, then it might be needed to disable CRC for such a video application depending on the final numbers...
Oh, huh, I'd forcibly disabled the parallel CRC processing in all those tests and forgotten because it was done the night before. Impressively, re-enabling that code (after a bug fix) only adds 0.1% difference! 16kB buffer is 6.6%, 9.9% and 12.5%. It shows how well that tight loop is keeping up with the sysclock/3 data rate.
EDIT: Another point of interest there is the 9.9%, up from 9.8%, is measuring the parallel performance alone. Given that the 9.8% measurement was made with the tight parallel loop entirely missing what we're then seeing is the added overheads, 6.6% increasing to 9.8%, of the parallel support code that gets compiled in. Probably the main increase comes from the fast copy of the block+CRC into cogRAM which happens in between streamer ops.
EDIT2: Yep, recompiling for parallel CRC with both the tight loop and the fast copy removed yields a result of 6.7% overheads for the 16 MB stream with 16 kB buffer.
Shaved 0.1% off the parallel code again, now 9.8%, by optimising placement of WAITSE1 to post fast copy instead of pre. It'll help more at higher clock dividers ... sysclock/4: 7.7% before optimisation, 6.8% after optimisation.
EDIT: Bugger, had to revert this. Formed a bug around larger dividers where the fast copy would occur before the streamer had completed its prior block. Corrupting the copy in cogRAM.
What sort of extra overhead/idle time delay does a seek to a new SD sector range add @evanh with your fast SD card? I'm wondering what sort of penalty in terms of bandwidth that would be incurred each time you hit a new fragmentation point and the file needs to be read from a different sector range. I know overall it depends on how often this would occur in a given file, but how much time is wasted if a sector multi-read restarts at a new random file position for example. E.g. burst read 20 sectors then burst read another 20 sectors from elsewhere on the disk and compare to burst reading 40 sectors. What sort of penalty is that?
The latency on the first block of a CMD18 is huge. As a guess, it'll add a good 3% overhead. That's what I was figuring could be improved by using CMD23 instead of CMD12.
PS: It'll be the biggest difference between the various cards too. The Kingston is the best in that respect. Others might be adding as much as 10% overhead on that one latency. I should do some timed measurements and get some hard numbers ...
PPS: That latency is the reason why a low block count gets such a poor result on the overheads. eg:
Initial block latencies:
136 µs for the Kinston Select Plus 64GB (2021)
160 µs for the Samsung EVO 128GB (2023)
472 µs for the Adata Silver 16GB (2013)
292 µs for the Adata Orange 64GB (2024)
296 µs for the Apacer 16GB (2018)
160 µs for the Sandisk Extreme 32GB (2017)
207 µs for the Sandisk Extreme 64GB (2021)
Actually, those measurements might be low. I think I'm creating an early trigger in the cards, by adding extra clocks at the end of the command response, with the intent of prep'ing the data block read - Well before I'm measuring.
That said, they are the relevant figures for the additional delay incurred on top of the execution overheads.
Ok, so say you incur 150us penalty for a fast card each time another "seek" or burst restart operation is done. To sustain 40MiB/s with 12.5% overhead at sysclk/3 on a 300MHz P2 you'd need to ensure on average you don't restart a new burst more often than one in every "x" sectors and this is assuming you know the full cluster map of the file up front and don't go off reading FAT32 cluster table info etc on the fly.
To work out "x" I think it would be approximately something like this...
Sustained transfers possible with 12.5% overhead = 300M/3/2 * 0.875 = 43.75MiB/s
We want to budget for streaming @~40MiB/s so the additional 3.75MiB/s can be allocated to account for the 150usec penalties each time a new burst is started.
So for 3.75/43.75 of the time we can be restarting a new burst which is proportionally 85.7 milliseconds per second. This equates to 85.7/0.150 = 571 sector burst restarts per second.
For the data being transferred we have 40MiB/s = 40M/512 sectors per second or 79101 sectors/second. "x" is then 79101/571 = 138.
So as long as the average burst transferred is 138 sectors or more each time we would be able to sustain 40MiB/s. This is just under 1% fragmentation which isn't a lot.
Am thinking for streaming uncompressed SDTV @30Hz truecolor formatted as 32 bit pixels with a stereo 48kHz audio track you might be best to try transferring at sysclk/2 with no CRC. Sysclk/3 will be very demanding for achieving 40MiB/s at 300MHz. Another idea is to store pixels on the SD card as 24 bits and expand into 32 bit data on the fly but this can be intensive and burns COGs. It would however drop the bandwidth needed significantly.
Yes 330MHz operation could also help account for the CRC+overheads, although it's pushing the P2 fairly hard.
PS: The reason why there is two distinct "lat" values is because it reports the initial latency of the last buffer. The buffer size is 16 kB, therefore stream lengths above 16 kB are reporting a mid-stream block latency rather than the initial block.
PPS: And they're in sysclock ticks rather than microseconds. A quick hack.
Here's a snippet that could expand 4 compressed 24 bit pixels read in as 3 longs and outputting as 4 longs back to HUB RAM. Looks like a single COG could do the translation. Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in 32 p2 clocks
mov loopcount, ##600
loop
rep #14, #512/4
rflong a
rflong b
rflong c
mov d, a
setbyte d, #0, #3
wrlut d, ptrb++
getword d, b, #1
rolbyte b, a, #3
setbyte b, #0, #3
wrlut b, ptrb++
setbyte d, c, #2
wrlut d, ptrb++
shr c, #8
wrlut c, ptrb++
setq2 #(512-1)
wrlong 0, ptra
add ptra, incr
djnz loopcount, #loop
ret
a long 0
b long 0
c long 0
d long 0
incr long 512*4
loopcount long 0
@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.
Would need to not use the FIFO then.
Also, your loop there puts zero in the byte #3. P2 RGB32 mode is $RRGGBBxx, where xx is ignored, so there's a double oddity there. Also note that you are able to use PTRA++ mode with block writes, the bug only happens when you try to use a custom offset.
Maybe: (untested)
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks
mov loopcount, ##600
loop
rep #14, ##512
rflong a
rflong b
rflong c
mov d, a
shl d,#8
wrlut d, ptrb++
getword d, b, #1
rolword b, a, #1
wrlut b, ptrb++
setbyte d, c, #2
shl d,#8
wrlut d, ptrb++
wrlut c, ptrb++
setq2 #(512-1)
wrlong 0, ptra++
djnz loopcount, #loop
ret
a long 0
b long 0
c long 0
d long 0
loopcount long 0
Comments
This sounds really promising @evanh. I think the stopping of the clock for the last CRC redirection using the streamer would be fine. If you are checking CRC you'll want to check all blocks not just the group prior to the last block unless you are cancelling the transaction and the last block is to be ignored.
So does you reading of multiple blocks now keep up with fast SD Cards streaming out their data blocks with CMD18 and continually detects the start of each data block in the process? If so that's pretty neat. I'd expect you should be able to use SET_BLOCK_COUNT to indicate the number of blocks to stream. Why would they not support this from the beginning and force card multiple reads to just be aborted with stop transmission commands, unless this capability was overlooked initially and added later with UHS. I guess I wouldn't put that past Sandisk given how poorly their spec is written and seems to have evolved, it's a real mess.
Oh, the clock is stopped between blocks. The code isn't that amazing. Finding the start bit is just messy. I would have loved to use the trick they did for the Ethernet driver but that one had a predictable early sync signal that gave time to adjust on the fly before the data stream arrived. No such opportunity here.
In fact I just realised I'm still using the PWM_SMPS mode in places. I need to eliminate that completely, it depends on restricted pin combinations which I'm hell bent on eliminating even if it costs performance. And it will. That mode would stop automonously at the start bit. Which meant I could fire it off and then initiate the Fcache loading and not have to be concerned with missing the start bit.
The last big effort was paralleling the CRC processing with streamer transfer of next block. And then adapting the level above to handle the dual function of the low level routine. Yeah, the fancy feature then needed utilising. It didn't just work by itself.
The unmaintained routine I mentioned doesn't use SMPS smartpin mode. It's approach is to minimise the block-to-block gap by looping within the Fcache code and keep going until all blocks are read in before exiting back to C. It had not really had much of a speed advantage over the single-block routine for the very reason that the single-block routine uses the SMPS mode for start-bit detection.
So I expect the single-block routine to lose performance very soon. Which will encourage me to resurrect the multi-block routine again.
Kingston Select Plus 64GB at 300 MHz sysclock/3, just before removal of SMPS smartpin start-bit clocking,
Same SD card and clock speed but using the more open pin management without SMPS smartpin clocking of start-bit search.
EDIT: Added CRC processing to emitted compile options.
New start-bit search code sets the clock going after Fcache has been loaded, rather than while Fcache does its loading. That's the extra overhead. The clock rate is the same sysclock/10. It makes use of a WAITSEx to rapidly cancel the clock pulses when the start bit arrives. It actually overshoots the start-bit, stopping the SD clock at the first data bit instead. Then the streamer just has to start its sampling one SD clock early, just ahead of the SD clock restarting.
You can see there that everything is adjustable. The pin numbers, the dividers. The timeout duration in sysclock ticks is calculated by the calling function. I decided not to track time at the low level ... but it could still be relegated to the setset() functions that prefill many of the registers like for the pin numbers. SETSE2 is issued before Fcache is loaded, but it too uses adjustable value from preloaded register.
I'll be adding this to the multi-block routine too. It's currently doing a more traditional, and slower, method where it pulses one clock, takes a sample before the clock takes effect, then waits for the pulse to complete before repeating. So it also ends up stopping at the first data bit too. It's actually not too bad speed wise but the newest solution definitely has a faster clock rate while searching.
Multi-block resurrected but still without CRC processing as yet:
For comparison, the single-block routine with CRC processing disabled:
When it was using the SMPS smartpin mode for start-bit search I think it was getting lower than 8% maybe 7.5% overheads... yep, 8% according to a three week old backup of the source code:
These sorts of SD read rates if sustained would be one way a P2 could stream digitized and uncompressed VGA resolution video material at 24/25/30Hz with 24bpp or 50/60Hz at 16bpp colour depth off an SD card. This requires ~40MiB/s or less, depending on whether you expand 24 to 32 bits in storage and if you fully frame buffer or not. A full frame buffer at these depths would need external memory as well, but it could be done if the card itself could sustain 40MiB/s. The read CRC could potentially be ignored in such an application.
EDIT: you wouldn't get a lot of video stored on a card this way at this read rate though (2.4GiB/minute!). A 256GB card could barely store a movie. That's about AUD$30 worth of media to store it.
Yeah, media files can drop CRC for sure. I'll try to allow such a compile switch.
Many of the cards seem to max out at around 48-49 MB/s raw data rate irrespective of what I do to reduce overheads. Card driven latencies get extended to limit the average data rate. The exception again is the Sandisk cards. Even the older Sandisk will keep delivering data faster if clocked higher.
PS: I had a thought to try engaging 1.8 V mode in the card, to make UHS-I modes available, but leave the supply at 3.3 V anyway. I'm not sure of wisdom of even trying though. I have no idea what the 1.8 V switchover does inside a card.
It might only adjust the pin drive strength. Although that probably is a second set of transistors, they must be 3.3 Volt capable or they'd be fried even when not in use.
I've now made the multi-block read more realistic by having a definable buffer that fits within hubRAM. The previous runs were all setup to repeatedly overwrite the same 512 bytes. So, for the speed test, it could go on indefinitely unbroken.
Making it a 32 kB buffer so that the multi-block routine loops internally 64 times then exits back to C I get a small increase of 0.1% in overheads. A larger buffer is barely detectable.
I wonder what sort of impact a copy into external RAM would add to your single hub buffer scheme when doing SD multiple sector reads? It'd be nice to be able to sustain something close to the high data rates you are currently seeing while also getting the SD data into the PSRAM for example. If some status long was updated with the current number of sectors read, another COG could monitor this and potentially instruct a PSRAM driver to copy these sectors into PSRAM from hub RAM at suitable times in parallel to the continued streaming in of additional sectors. You'd just need to time things so older data is not overwritten before sent to PSRAM (or use a ping ping hub buffer scheme or start/stop the incoming sector data where required). You can certainly move data into a PSRAM faster than the data comes off the card but there may be latencies to deal with if another COG application such as an emulator or a video driver is also sharing the external memory at the same time.
Should be easy enough. Any interprocess signalling can be used to indicate a buffer is filled. Standard C, from the earliest days, file handling allows any amount of a file to be read/written per call of the block read/write functions. A unique buffer per call is fine also.
Ah, not so pretty with the CRC processing in place. 6.6% up to 9.8% ignoring final CRC, then 12.4% adding the end-of-buffer block CRC processing. And that's by using an oversized buffer and leaving the CRC bits attached to the end. It'll be worse once I decide how to split off the final CRC bits.
Err, oops, that was using a 16 kB buffer.
32 kB buffer is 6.6%, 9.7% and 11.1% respectively.
64 kB buffer is 6.5%, 9.7% and 10.4% respectively.
8 kB buffer is 6.8%, 9.9% and 14.9% respectively.
PS: These numbers only apply to the Kingston Select Plus SD card. Every other card gives different (worse) results at such a high clock rate.
If the 12.4% overhead is mainly incurred for the last sector I wouldn't be too worried. CRC is just going to cost CPU at fast transfer speeds. Multi-sector reads will reduce the overhead down anyway by averaging the initial latency over all the sectors.
I do wonder in the real world situation in the presence of some degree of SD card sector fragmentation, what the idle time gaps will look like and whether something like 40MiB/s transfers could be sustained for video long term. In a video kiosk style application and with external memory attached we could possibly assume the availability of a ~30MB (or ~0.75 second) buffer with the PSRAM boards to help ride out the gaps a little. I wonder if this would be enough to support reliable video streaming transfers from a somewhat fragmented filesystem on the SD which may not always stream out sector data continuously back-to-back in multi-sector reads for the entire duration of a video. Video streaming with multi-sector reads would also require knowledge of where the various sectors are located on the card using filesystem data such as FAT32 cluster entries etc which may require additional occasional sector lookups as well during streaming (which temporarily interrupts the data transfers) if these are not completely read in ahead of time.
I already had video streaming going that was bottlenecked on decompression and PSRAM RMW ops with just the normal SPI driver. The primary problem there was that FFMPEG's cinepak encoder kinda stinks (terrible rgb to pseudoYUV conversion, everything too green... had some minor success at writing a better encoder but got bored or smth). Streaming uncompressed 24bpp would have better quality than cinepak, but the bandwidth is insane. Maybe raw 4:2:0 (=12bpp) if you can use the CSC or AVIInfoFrame on HDMI to natively output YCbCr.
The 12% is amortised figure over the whole 16 MB stream. The final CRC will be something like 60% overhead for that block because it has to be processed serially.
Yeah the uncompressed approach would help alleviate all that decompression workload and give great SD quality output at the expense of IO bandwidth demands on the SD interface. You'd still want to transfer into PSRAM for doing a true colour output but that's fast if you do large block transfers of video data already configured for frame buffers (ie. 24bpp expanded to 32 bits). Audio would need to be mixed into the stream at the end of each frame and sent to some DAC or digital output COG. But you could have other COGs assist if needed. If this was achievable you could probably also do some pretty cool FF/REW playback tricks smoothly with almost instant seeking, particularly if each video frame begins on a sector boundary. An OSD would be nice too if it could be rendered in time on top after each frame gets loaded (or make have a custom video driver that does it on the fly like sprite overlays). A video pipeline for rendering the OSD on top is probably possible with another COG involved. This would be a fun project. I guess we wait to see how evanh's driver progresses and the final bandwidths it can achieve.
Ok, then it might be needed to disable CRC for such a video application depending on the final numbers...
Oh, huh, I'd forcibly disabled the parallel CRC processing in all those tests and forgotten because it was done the night before. Impressively, re-enabling that code (after a bug fix) only adds 0.1% difference! 16kB buffer is 6.6%, 9.9% and 12.5%. It shows how well that tight loop is keeping up with the sysclock/3 data rate.
EDIT: Another point of interest there is the 9.9%, up from 9.8%, is measuring the parallel performance alone. Given that the 9.8% measurement was made with the tight parallel loop entirely missing what we're then seeing is the added overheads, 6.6% increasing to 9.8%, of the parallel support code that gets compiled in. Probably the main increase comes from the fast copy of the block+CRC into cogRAM which happens in between streamer ops.
EDIT2: Yep, recompiling for parallel CRC with both the tight loop and the fast copy removed yields a result of 6.7% overheads for the 16 MB stream with 16 kB buffer.
Shaved 0.1% off the parallel code again, now 9.8%, by optimising placement of WAITSE1 to post fast copy instead of pre. It'll help more at higher clock dividers ... sysclock/4: 7.7% before optimisation, 6.8% after optimisation.
EDIT: Bugger, had to revert this. Formed a bug around larger dividers where the fast copy would occur before the streamer had completed its prior block. Corrupting the copy in cogRAM.
What sort of extra overhead/idle time delay does a seek to a new SD sector range add @evanh with your fast SD card? I'm wondering what sort of penalty in terms of bandwidth that would be incurred each time you hit a new fragmentation point and the file needs to be read from a different sector range. I know overall it depends on how often this would occur in a given file, but how much time is wasted if a sector multi-read restarts at a new random file position for example. E.g. burst read 20 sectors then burst read another 20 sectors from elsewhere on the disk and compare to burst reading 40 sectors. What sort of penalty is that?
The latency on the first block of a CMD18 is huge. As a guess, it'll add a good 3% overhead. That's what I was figuring could be improved by using CMD23 instead of CMD12.
PS: It'll be the biggest difference between the various cards too. The Kingston is the best in that respect. Others might be adding as much as 10% overhead on that one latency. I should do some timed measurements and get some hard numbers ...
PPS: That latency is the reason why a low block count gets such a poor result on the overheads. eg:
So we can probably expect it to be about 150 µs.
Initial block latencies:
136 µs for the Kinston Select Plus 64GB (2021)
160 µs for the Samsung EVO 128GB (2023)
472 µs for the Adata Silver 16GB (2013)
292 µs for the Adata Orange 64GB (2024)
296 µs for the Apacer 16GB (2018)
160 µs for the Sandisk Extreme 32GB (2017)
207 µs for the Sandisk Extreme 64GB (2021)
Actually, those measurements might be low. I think I'm creating an early trigger in the cards, by adding extra clocks at the end of the command response, with the intent of prep'ing the data block read - Well before I'm measuring.
That said, they are the relevant figures for the additional delay incurred on top of the execution overheads.
Ok, so say you incur 150us penalty for a fast card each time another "seek" or burst restart operation is done. To sustain 40MiB/s with 12.5% overhead at sysclk/3 on a 300MHz P2 you'd need to ensure on average you don't restart a new burst more often than one in every "x" sectors and this is assuming you know the full cluster map of the file up front and don't go off reading FAT32 cluster table info etc on the fly.
To work out "x" I think it would be approximately something like this...
Sustained transfers possible with 12.5% overhead = 300M/3/2 * 0.875 = 43.75MiB/s
We want to budget for streaming @~40MiB/s so the additional 3.75MiB/s can be allocated to account for the 150usec penalties each time a new burst is started.
So for 3.75/43.75 of the time we can be restarting a new burst which is proportionally 85.7 milliseconds per second. This equates to 85.7/0.150 = 571 sector burst restarts per second.
For the data being transferred we have 40MiB/s = 40M/512 sectors per second or 79101 sectors/second. "x" is then 79101/571 = 138.
So as long as the average burst transferred is 138 sectors or more each time we would be able to sustain 40MiB/s. This is just under 1% fragmentation which isn't a lot.
Some cards can go higher data rate with a higher clock. Here's the Kingston SD card using 330 MHz sysclock:
And a Sandisk at 360 MHz sysclock:
Am thinking for streaming uncompressed SDTV @30Hz truecolor formatted as 32 bit pixels with a stereo 48kHz audio track you might be best to try transferring at sysclk/2 with no CRC. Sysclk/3 will be very demanding for achieving 40MiB/s at 300MHz. Another idea is to store pixels on the SD card as 24 bits and expand into 32 bit data on the fly but this can be intensive and burns COGs. It would however drop the bandwidth needed significantly.
Yes 330MHz operation could also help account for the CRC+overheads, although it's pushing the P2 fairly hard.
Actually, with CRC processing disabled then sysclock/2 can perform. Here's the Sandisk SD card at 280 MHz sysclock:
PS: The reason why there is two distinct "lat" values is because it reports the initial latency of the last buffer. The buffer size is 16 kB, therefore stream lengths above 16 kB are reporting a mid-stream block latency rather than the initial block.
PPS: And they're in sysclock ticks rather than microseconds. A quick hack.
Yeah that's quite a bit faster. Nice.
Here's a snippet that could expand 4 compressed 24 bit pixels read in as 3 longs and outputting as 4 longs back to HUB RAM. Looks like a single COG could do the translation. Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.
Would need to not use the FIFO then.
Also, your loop there puts zero in the byte #3. P2 RGB32 mode is $RRGGBBxx, where xx is ignored, so there's a double oddity there. Also note that you are able to use PTRA++ mode with block writes, the bug only happens when you try to use a custom offset.
Maybe: (untested)