New SD mode P2 accessory board

rogloh · 2024-08-29 13:46

Oh, yeah I forgot the exact RGB format - been a while since I've done much P2 stuff lately - it shows. LOL.

I think you'd have something like this in order to stream uncompressed high quality SDTV stuff from a P2:

1 COG doing the SD reading of sectors into hub RAM
1 COG doing the reading of hub RAM RGB24 data and writing back into hub as RGB32 - this may also make sense to do an OSD overlay on top as part of it. It would also trigger the PSRAM write. It would need to co-ordinate closely with the first COG's buffering mechanism
1 PSRAM driver COG writing data into PSRAM and serving video read requests
1 VIDEO driver COG
1 audio COG
1 control COG maybe

Still leaves a couple of other COGs free for network or another interface maybe. [EDIT: wrong] Interestingly this bandwidth fits nicely on a 100Mbps link. Could possibly make a sling box of some kind which streams video over a network as well...
It'd be nice to find some device that could take an SDTV stream and digitize it to feed into a P2 like from a tuner/STB. Then you might make a simple DVR out of a P2. Although it probably could only read or write at a time, not both.

Wuerfel_21 · 2024-08-29 13:47

Also, same concept but for expanding YUYV (aka $VV_Y1_UU_Y0 when loaded into a little-endian register) to $VVYYUUxx (P2's native-ish 32bpp YUV format as it were)
(of course without any attempt at smoothing the chroma)

expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~2 pixels in ??? p2 clocks 
    mov loopcount, ##600
loop
    rep @.inner, ##512/2
    rflong a
    wrlut a,ptrb++
    setbyte a,a,#2
    wrlut a,ptrb++
.inner
    setq2 #(512-1)
    wrlong 0, ptra++
    djnz loopcount, #loop
    ret

a   long 0
loopcount long 0

Suprisingly YUYV is already most of the way there.

rogloh · 2024-08-29 13:58

@Wuerfel_21 said:

@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.

Would need to not use the FIFO then.

No the FIFO just can't be used on the PSRAM driver COG. That is the COG that's using its streamer to copy into PSRAM. The RGB24-32 translation COG making the requests to write to PSRAM is still free to use its FIFO.

Wuerfel_21 · 2024-08-29 14:03

I know, your wording just sounded like doing it in the driver cog.

TonyB_ · 2024-08-29 17:05

@Wuerfel_21 said:

@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.

Would need to not use the FIFO then.

Also, your loop there puts zero in the byte #3. P2 RGB32 mode is $RRGGBBxx, where xx is ignored, so there's a double oddity there. Also note that you are able to use PTRA++ mode with block writes, the bug only happens when you try to use a custom offset.

Maybe: (untested)
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks 
    mov loopcount, ##600
loop
    rep #14, ##512
    rflong a
    rflong b
    rflong c
    mov d, a
    shl d,#8
    wrlut d, ptrb++
    getword d, b, #1
    rolword b, a, #1
    wrlut b, ptrb++
    setbyte d, c, #2
    shl d,#8
    wrlut d, ptrb++
    wrlut c, ptrb++
    setq2 #(512-1)
    wrlong 0, ptra++
    djnz loopcount, #loop
    ret

a   long 0
b   long 0
c   long 0
d   long 0
loopcount long 0 

I think one instruction can be saved in the loop and the d variable. Also good case here for using @.end:

expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks 
    mov loopcount, ##600
loop
    rep @.end, ##512

    rflong a            '= B1_R0_G0_B0
    rflong b            '= G2_B2_R1_G1
    rflong c            '= R3_G3_B3_R2

    rol a, #8           '= R0_G0_B0_B1
    wrlut a, ptrb++     'write R0_G0_B0_XX

    rol a, #8           '= G0_B0_B1_R0
    setword a, b, #1    '= R1_G1_B1_R0
    wrlut a, ptrb++     'write R1_G1_B1_XX

    ror b, #8           '= G1_G2_B2_R1
    setbyte b, c, #3    '= R2_G2_B2_R1
    wrlut b, ptrb++     'write R2_G2_B2_XX

    wrlut c, ptrb++     'write R3_G3_B3_XX
.end
    setq2 #(512-1)
    wrlong 0, ptra++

    djnz loopcount, #loop
    ret

a   long 0
b   long 0
c   long 0
loopcount long 0

EDIT:
Deleted d variable

rogloh · 2024-08-30 00:43

Yeah good catches for optimizing further @TonyB_ since RRGGBBXX has XX as a don't care and you can rotate into that XX byte at will and still preserve the original source data. I also found a bug in the code, the 512 REP counter should be 512/4 as there are 4 longs written to the LUT in the inner loop. With 12 instructions now in the critical inner loop you get ~ 24+4 P2 clocks used for every 4 pixels, including the block transfer clock overhead writing the RGB32 back to hub RAM. So this is only 7 clocks per RGB32 pixel generated. For video using 640x480 resolution @30Hz, you need to convert 9.216Mpixels/s and this work consumes 64.5M P2 clocks per second. You'd only need a 129 MHz P2 to do this processing so at 300MHz or so there should be plenty of time for other work like triggering the PSRAM driver COG to write and for a simple lowres OSD overlay to be added here, maybe with some transparency effect too. Perhaps coloured text characters at 1/4 of the main resolution, a bit like an old VCR/DVD player.

During the processing of the SD card's sectors carrying an AV stream you'd need to demux the audio data and feed it to another COG for further processing/output. You could either put all the audio samples for a frame at the end of the video data - to keep video data contiguous and easy to write into a frame buffer, or you could interleave it and have the RGB24 expansion COG aware of audio boundaries and it could redirect those samples elsewhere in hub RAM. If you encode using 48kHz audio you have an integer number of samples at 24/25/30/50/60 Hz frame rates so audio can always be synced to frame boundaries which is nice. Also if you align each frame's audio+video data to the SD sector boundaries with padding at the end of each frame to fill its last sector, you should be able to rapidly seek forwards or backwards into the stream.

Ideally when playing back the video file from an SD card, its full chain of clusters could be read in and processed initially so that a list of starting clusters/sectors of each multi-sector burst and each burst length could be created to be followed by the SD reader COG. This would also let you figure out the amount of fragmentation up front and whether you'd still be able to play the file without impairment. If the file is not fragmented at all it would not need a lot of space to store the list, just a few 32 bit values could then suffice. Doing this has a downside though as there some initial work to process the cluster data at the start which may slightly delay the beginning of playback and you'd need more hub space to hold the list of bursts if the filesystem is heavily fragmented. It avoids any additional reading the FAT32 table during playback though and can therefore guarantee playback reliability if there is very constrained SD card transfer bandwidth (assuming the SD card itself has bounded latencies).

evanh · 2024-08-31 06:46

Oops, the last reports, after doing the latency measurements, with lat=xxx displayed, were including the latency printing time in the duration value. Which then impacted the calculated values for data rate and overheads. That said, it didn't have any meaningful impact on the values produce for the 16 MB streams.

Fixed now.

rogloh · 2024-08-31 07:31

So @evanh I am still thinking about SDTV resolution video acquisition and write/playback from SD cards on the P2. I might start a new thread on this topic soon. Do you expect that the faster UHS-I card's write speeds when operating at 3.3V in 4 bit mode will be capable of their 1.8V rated performance? E.g., I go buy a V60 rated UHS-I card and then have it sustain ~40MiB/s writes (which remains within 60MB/s) but operate it at 3.3V and clock it around 100M nibbles/sec (sysclk/3) or possibly sysclk/2? If a card can't sustain this type of write speed at 3.3V we won't be able to use a P2 to capture the pixel data and write the file on the fly in real-time as hoped...it'd need to be written offline by a PC for example.

evanh · 2024-08-31 07:45

I suspect so. But I've not really tried it out even though the block writes has been a stable routine for a while. Block writes were actually quite a lot easier than the block reads to get working without hassle ...

evanh · 2024-08-31 09:46

Seems to be true. Here's one of the Sandisk's. You can see some fluctuations. Presumably these are the SD card making room as it remaps the logical to physical block mapping and initiating automatic erasures.

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=03   OEMID=SD   Name=SE32G
   Ver=8.0   Serial=6018369B   Date=2017-12
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

Write blocks speed test:
32768 blocks = 16384 kiB   busy=74   rate = 37.2 MiB/s   duration = 429193 us   zero-overhead = 335544 us   overheads = 21.8 %
16384 blocks = 8192 kiB   busy=74   rate = 38.9 MiB/s   duration = 205402 us   zero-overhead = 167772 us   overheads = 18.3 %
8192 blocks = 4096 kiB   busy=74   rate = 39.1 MiB/s   duration = 102280 us   zero-overhead = 83886 us   overheads = 17.9 %
4096 blocks = 2048 kiB   busy=74   rate = 38.9 MiB/s   duration = 51375 us   zero-overhead = 41943 us   overheads = 18.3 %
2048 blocks = 1024 kiB   busy=74   rate = 38.5 MiB/s   duration = 25926 us   zero-overhead = 20972 us   overheads = 19.1 %
1024 blocks = 512 kiB   busy=74   rate = 23.4 MiB/s   duration = 21346 us   zero-overhead = 10486 us   overheads = 50.8 %
512 blocks = 256 kiB   busy=74   rate = 29.2 MiB/s   duration = 8544 us   zero-overhead = 5243 us   overheads = 38.6 %
256 blocks = 128 kiB   busy=74   rate = 34.1 MiB/s   duration = 3657 us   zero-overhead = 2621 us   overheads = 28.3 %
128 blocks = 64 kiB   busy=74   rate = 36.7 MiB/s   duration = 1702 us   zero-overhead = 1311 us   overheads = 22.9 %
64 blocks = 32 kiB   busy=74   rate = 34.4 MiB/s   duration = 907 us   zero-overhead = 655 us   overheads = 27.7 %
32 blocks = 16 kiB   busy=74   rate = 30.6 MiB/s   duration = 509 us   zero-overhead = 328 us   overheads = 35.5 %
16 blocks = 8 kiB   busy=74   rate = 25.2 MiB/s   duration = 310 us   zero-overhead = 164 us   overheads = 47.0 %
8 blocks = 4 kiB   busy=74   rate = 18.5 MiB/s   duration = 211 us   zero-overhead = 82 us   overheads = 61.1 %
4 blocks = 2 kiB   busy=74   rate = 12.1 MiB/s   duration = 161 us   zero-overhead = 41 us   overheads = 74.5 %
2 blocks = 1 kiB   busy=74   rate = 7.1 MiB/s   duration = 137 us   zero-overhead = 20 us   overheads = 85.4 %

evanh · 2024-08-31 09:49

And that Kingston card is all over the place:

Write blocks speed test:
32768 blocks = 16384 kiB   busy=74   rate = 19.1 MiB/s   duration = 837516 us   zero-overhead = 335544 us   overheads = 59.9 %
16384 blocks = 8192 kiB   busy=74   rate = 12.6 MiB/s   duration = 633725 us   zero-overhead = 167772 us   overheads = 73.5 %
8192 blocks = 4096 kiB   busy=74   rate = 14.3 MiB/s   duration = 279545 us   zero-overhead = 83886 us   overheads = 69.9 %
4096 blocks = 2048 kiB   busy=74   rate = 12.3 MiB/s   duration = 161802 us   zero-overhead = 41943 us   overheads = 74.0 %
2048 blocks = 1024 kiB   busy=74   rate = 12.2 MiB/s   duration = 81521 us   zero-overhead = 20972 us   overheads = 74.2 %
1024 blocks = 512 kiB   busy=74   rate = 7.2 MiB/s   duration = 68632 us   zero-overhead = 10486 us   overheads = 84.7 %
512 blocks = 256 kiB   busy=74   rate = 31.8 MiB/s   duration = 7853 us   zero-overhead = 5243 us   overheads = 33.2 %
256 blocks = 128 kiB   busy=74   rate = 26.7 MiB/s   duration = 4672 us   zero-overhead = 2621 us   overheads = 43.8 %
128 blocks = 64 kiB   busy=74   rate = 20.2 MiB/s   duration = 3082 us   zero-overhead = 1311 us   overheads = 57.4 %
64 blocks = 32 kiB   busy=74   rate = 13.6 MiB/s   duration = 2286 us   zero-overhead = 655 us   overheads = 71.3 %
32 blocks = 16 kiB   busy=74   rate = 8.2 MiB/s   duration = 1889 us   zero-overhead = 328 us   overheads = 82.6 %

32768 blocks = 16384 kiB   busy=74   rate = 17.1 MiB/s   duration = 932153 us   zero-overhead = 335544 us   overheads = 64.0 %
16384 blocks = 8192 kiB   busy=74   rate = 37.8 MiB/s   duration = 211198 us   zero-overhead = 167772 us   overheads = 20.5 %
8192 blocks = 4096 kiB   busy=74   rate = 33.2 MiB/s   duration = 120327 us   zero-overhead = 83886 us   overheads = 30.2 %
4096 blocks = 2048 kiB   busy=74   rate = 38.2 MiB/s   duration = 52254 us   zero-overhead = 41943 us   overheads = 19.7 %
2048 blocks = 1024 kiB   busy=74   rate = 37.3 MiB/s   duration = 26805 us   zero-overhead = 20972 us   overheads = 21.7 %
1024 blocks = 512 kiB   busy=74   rate = 35.5 MiB/s   duration = 14080 us   zero-overhead = 10486 us   overheads = 25.5 %
512 blocks = 256 kiB   busy=74   rate = 32.3 MiB/s   duration = 7719 us   zero-overhead = 5243 us   overheads = 32.0 %
256 blocks = 128 kiB   busy=74   rate = 27.5 MiB/s   duration = 4538 us   zero-overhead = 2621 us   overheads = 42.2 %
128 blocks = 64 kiB   busy=74   rate = 21.2 MiB/s   duration = 2947 us   zero-overhead = 1311 us   overheads = 55.5 %
64 blocks = 32 kiB   busy=74   rate = 14.5 MiB/s   duration = 2152 us   zero-overhead = 655 us   overheads = 69.5 %
32 blocks = 16 kiB   busy=74   rate = 8.8 MiB/s   duration = 1757 us   zero-overhead = 328 us   overheads = 81.3 %

evanh · 2024-08-31 09:57

Hmm, weird, even the least likely one to be capable, the Apacer, is still sometimes peaking over 30 MB/s. And weirder is it's entirely consistent too. Every run produces the same pattern of fast and slow rates.

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=9C   OEMID=SO   Name=USD00
   Ver=0.2   Serial=39A8FBFB   Date=2018-1
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

Write blocks speed test:
32768 blocks = 16384 kiB   busy=74   rate = 10.5 MiB/s   duration = 1515588 us   zero-overhead = 335544 us   overheads = 77.8 %
16384 blocks = 8192 kiB   busy=74   rate = 11.4 MiB/s   duration = 698975 us   zero-overhead = 167772 us   overheads = 75.9 %
8192 blocks = 4096 kiB   busy=74   rate = 11.4 MiB/s   duration = 350679 us   zero-overhead = 83886 us   overheads = 76.0 %
4096 blocks = 2048 kiB   busy=74   rate = 10.0 MiB/s   duration = 198217 us   zero-overhead = 41943 us   overheads = 78.8 %
2048 blocks = 1024 kiB   busy=74   rate = 8.0 MiB/s   duration = 123559 us   zero-overhead = 20972 us   overheads = 83.0 %
1024 blocks = 512 kiB   busy=74   rate = 33.2 MiB/s   duration = 15024 us   zero-overhead = 10486 us   overheads = 30.2 %
512 blocks = 256 kiB   busy=74   rate = 4.4 MiB/s   duration = 56029 us   zero-overhead = 5243 us   overheads = 90.6 %
256 blocks = 128 kiB   busy=74   rate = 25.4 MiB/s   duration = 4916 us   zero-overhead = 2621 us   overheads = 46.6 %
128 blocks = 64 kiB   busy=74   rate = 19.6 MiB/s   duration = 3175 us   zero-overhead = 1311 us   overheads = 58.7 %
64 blocks = 32 kiB   busy=74   rate = 13.3 MiB/s   duration = 2347 us   zero-overhead = 655 us   overheads = 72.0 %
32 blocks = 16 kiB   busy=74   rate = 8.0 MiB/s   duration = 1949 us   zero-overhead = 328 us   overheads = 83.1 %
16 blocks = 8 kiB   busy=74   rate = 4.4 MiB/s   duration = 1751 us   zero-overhead = 164 us   overheads = 90.6 %
8 blocks = 4 kiB   busy=74   rate = 2.3 MiB/s   duration = 1652 us   zero-overhead = 82 us   overheads = 95.0 %
4 blocks = 2 kiB   busy=74   rate = 1.2 MiB/s   duration = 1601 us   zero-overhead = 41 us   overheads = 97.4 %
2 blocks = 1 kiB   busy=74   rate = 0.6 MiB/s   duration = 1576 us   zero-overhead = 20 us   overheads = 98.7 %

evanh · 2024-08-31 10:04

I don't think I'm doing anything wrong, I basically copied the prior write loop of one buffer length that was used for lag mapping routine and expanded it to repeat for a specified stream length. As such these are all verified writes. It checks for the data response of CRC-is-valid pattern on DAT0. And anyway, if the SD card rejected any one block then the sequence would go to Smile, timing out somewhere.

EDIT: There is a caveat of sorts - The buffer is only filled with a pattern once. Meaning the stream is a repeating pattern of one buffer length long - so repeats every 16 kBytes.

evanh · 2024-08-31 10:14

Adata Orange using 128 kB buffer. It's consistent and fast, and also modern:

  Full Speed clock divider = 3 (100.0 MHz)
  rxlag=6 selected  Lowest=5 Highest=7
 CID decode:  ManID=1D   OEMID=AD   Name=USD
   Ver=2.0   Serial=000003A0   Date=2024-1
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 128 kiBytes
  - Read data CRC ignored
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

Write blocks speed test:
32768 blocks = 16384 kiB   busy=74   rate = 35.5 MiB/s   duration = 450576 us   zero-overhead = 335544 us   overheads = 25.5 %
16384 blocks = 8192 kiB   busy=74   rate = 36.7 MiB/s   duration = 217458 us   zero-overhead = 167772 us   overheads = 22.8 %
8192 blocks = 4096 kiB   busy=74   rate = 36.3 MiB/s   duration = 110030 us   zero-overhead = 83886 us   overheads = 23.7 %
4096 blocks = 2048 kiB   busy=74   rate = 38.8 MiB/s   duration = 51517 us   zero-overhead = 41943 us   overheads = 18.5 %
2048 blocks = 1024 kiB   busy=74   rate = 38.7 MiB/s   duration = 25808 us   zero-overhead = 20972 us   overheads = 18.7 %
1024 blocks = 512 kiB   busy=74   rate = 38.5 MiB/s   duration = 12955 us   zero-overhead = 10486 us   overheads = 19.0 %
512 blocks = 256 kiB   busy=74   rate = 38.2 MiB/s   duration = 6529 us   zero-overhead = 5243 us   overheads = 19.6 %
256 blocks = 128 kiB   busy=74   rate = 37.7 MiB/s   duration = 3315 us   zero-overhead = 2621 us   overheads = 20.9 %
128 blocks = 64 kiB   busy=74   rate = 36.5 MiB/s   duration = 1708 us   zero-overhead = 1311 us   overheads = 23.2 %
64 blocks = 32 kiB   busy=74   rate = 34.5 MiB/s   duration = 905 us   zero-overhead = 655 us   overheads = 27.6 %
32 blocks = 16 kiB   busy=74   rate = 31.0 MiB/s   duration = 504 us   zero-overhead = 328 us   overheads = 34.9 %
16 blocks = 8 kiB   busy=74   rate = 25.6 MiB/s   duration = 305 us   zero-overhead = 164 us   overheads = 46.2 %
8 blocks = 4 kiB   busy=74   rate = 19.0 MiB/s   duration = 205 us   zero-overhead = 82 us   overheads = 60.0 %
4 blocks = 2 kiB   busy=74   rate = 12.5 MiB/s   duration = 156 us   zero-overhead = 41 us   overheads = 73.7 %
2 blocks = 1 kiB   busy=74   rate = 7.4 MiB/s   duration = 131 us   zero-overhead = 20 us   overheads = 84.7 %

evanh · 2024-08-31 11:32

LOL, the "busy" value of 74 is actually the data block CRC-is-valid status from the SD card. I had vaguely intended to report the duration of busy signal between blocks but it's the CRC status pattern that is past back so that's what's printed. The fact they are all 74 (%10_010_10) is a good thing.

EDIT: It's made up of a high-to-low start-bit, then %010 for valid status, then a high end bit1. The final low bit0 probably shouldn't be there. It's marked as hi-Z in the spec ... right, fixed ... new valid pattern is 37 (%10_010_1):

Write blocks speed test:
32768 blocks = 16384 kiB   busy=37   rate = 37.6 MiB/s   duration = 425208 us   zero-overhead = 335544 us   overheads = 21.0 %
16384 blocks = 8192 kiB   busy=37   rate = 35.0 MiB/s   duration = 228432 us   zero-overhead = 167772 us   overheads = 26.5 %
8192 blocks = 4096 kiB   busy=37   rate = 36.4 MiB/s   duration = 109593 us   zero-overhead = 83886 us   overheads = 23.4 %
4096 blocks = 2048 kiB   busy=37   rate = 38.9 MiB/s   duration = 51308 us   zero-overhead = 41943 us   overheads = 18.2 %
2048 blocks = 1024 kiB   busy=37   rate = 38.9 MiB/s   duration = 25702 us   zero-overhead = 20972 us   overheads = 18.4 %
1024 blocks = 512 kiB   busy=37   rate = 38.7 MiB/s   duration = 12903 us   zero-overhead = 10486 us   overheads = 18.7 %
512 blocks = 256 kiB   busy=37   rate = 38.4 MiB/s   duration = 6502 us   zero-overhead = 5243 us   overheads = 19.3 %
256 blocks = 128 kiB   busy=37   rate = 37.8 MiB/s   duration = 3302 us   zero-overhead = 2621 us   overheads = 20.6 %
128 blocks = 64 kiB   busy=37   rate = 36.7 MiB/s   duration = 1702 us   zero-overhead = 1311 us   overheads = 22.9 %
64 blocks = 32 kiB   busy=37   rate = 34.6 MiB/s   duration = 902 us   zero-overhead = 655 us   overheads = 27.3 %
32 blocks = 16 kiB   busy=37   rate = 31.0 MiB/s   duration = 503 us   zero-overhead = 328 us   overheads = 34.7 %
16 blocks = 8 kiB   busy=37   rate = 25.6 MiB/s   duration = 305 us   zero-overhead = 164 us   overheads = 46.2 %
8 blocks = 4 kiB   busy=37   rate = 18.9 MiB/s   duration = 206 us   zero-overhead = 82 us   overheads = 60.1 %
4 blocks = 2 kiB   busy=37   rate = 12.5 MiB/s   duration = 156 us   zero-overhead = 41 us   overheads = 73.7 %
2 blocks = 1 kiB   busy=37   rate = 7.4 MiB/s   duration = 131 us   zero-overhead = 20 us   overheads = 84.7 %

rogloh · 2024-08-31 13:32

I wonder what sort of sustained write speeds one can achieve with sysclk/2 operation...? Maybe if it's easy for you to test, give that option a try. Or is it limited by the CRC computation?

If we wanted to capture and store interlaced 720x480 res video (16:9) at 60Hz with 24bpp packed pixel data, i.e. approximately equivalent to 480p30, we'd need to sustain a write speed of around 31.1MiB/s for video (480 * 30 * 720 * 3 & no framebuffer) and say an extra 192kBps for some stereo audio sampled at 48kHz, 16bits. Your write speed numbers are already close and above this even at sysclk/3 in many cases.

If I can extract the video data (and I am working on a technique for this), I wonder if it could be written on the fly to SD without a large PSRAM buffer in the middle and only using the hub RAM for any temporary buffering. We would only need to buffer whatever maximum case latency is ever seen during the multi-sector writes to absorb the timing variation. With 512kB of hub RAM I am sort of hoping that might be enough because if I don't have to use external memory for the capture then I can get by without using a PSRAM based P2 Edge and save lots of IO pins which I need. A P2-EVAL is another option then too.

evanh · 2024-08-31 13:36

It won't achieve anything in a generic sense. The CRC generation is only good for sysclock/3. If the CRC's were all pre-calculated then it could do sysclock/2 easy.

evanh · 2024-08-31 13:40

Wow, some SD cards can in fact go faster. That Adata Orange is hitting 46 MB/s by using 120 MHz (sysclock = 360 MHz).

EDIT: The block write code path is do the busy check, then single-block-write, then compute CRC, then wait for clock stopped before writing the CRC at the end, then CRC status retrieval, then return back to C. The C code manages the buffer and block sequence, similar to the single-block-read method.

// tx complete, now rx "CRC status"
        dirl    r_pdat    // release DAT pins ready for CRC status from SD card
        and     r_pdat, #0x3f    // strip away ADDPINS, leaving DAT0
        rep     @.rend2, #7
#ifdef CLK_STEP
        wypin   #2, r_clk    // one clock pulse
#else
        wypin   #1, r_clk    // one clock pulse
#endif
        testp   r_pdat   wc    // sample DAT0 before the clock pin transitions
        rcl     rc, #1    // collect status bits
        waitse1      // wait for each clock done
.rend2

rogloh · 2024-08-31 13:44

@evanh said:
Wow, some SD cards can in fact go faster. That Adata Orange is hitting 46 MB/s by using 120 MHz (sysclock = 360 MHz).

Yeah I remember seeing some bizarrely high speeds back when I ran my own hacked up test code. 83.5 MiB/s @350MHz P2.

https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820

rogloh · 2024-08-31 13:47

That was reading though, not writes and only for single sector IIRC. Essentially the actual transfer rate once clocking so not a real-world number.

evanh · 2024-08-31 13:57

@rogloh said:
That was reading though, not writes and only for single sector IIRC. Essentially the actual transfer rate once clocking so not a real-world number.

Right, yeah, there's a lot of toing and froing in the details that slows it all down. I guess ignoring the CRC status is an option but it's a very risky one.

avsa242 · 2024-08-31 20:15

That made me curious - in your experience with all of the testing, do you have any stats or a rough idea of how often transactions fail CRC?

evanh · 2024-08-31 23:45

Because of signal integrity? Never. Well, there was plenty during development of the routines. This is an advantage of using sysclock/3 for rx timing. Sysclock/2 data-clock phase relationship runs much closer to an edge, but it's not easy to know how close. Would have to ramp the sysclock frequency to figure it out.

When the die temperature changes enough then the timing edges will move around. So, at high clock rates, due to this drifting, even for sysclock/3, there will be errors that will need "rxlag" recalibrated.

There is one intentional timeout every run. I have "rxlag" misconfigured to trigger an automated calibration of the command response. This process involves repeatedly retrieving CID and CSD registers using CMD10 and CMD9 respectively. This occurs after the completion of card init.

Currently all the init steps are done with slow smartpin based low-level tx/rx routines. These have been a stable reference during much instability of the fast streamer based routines. Not requiring setting precise clock phase timing was big advantage for getting started.

At this stage, the fast routines are now stable and robust enough to use them for card init as well. They are supposed to be configurable for a large divider so maybe I could give that a try too ...

rogloh · 2024-09-01 01:27

@evanh said:

There is one intentional timeout every run. I have "rxlag" misconfigured to trigger an automated calibration of the command response. This process involves repeatedly retrieving CID and CSD registers using CMD10 and CMD9 respectively. This occurs after the completion of card init.

Nice. That already should help things remain reasonably stable. I wonder at the clock rates you are using how quickly it would drift beyond the clock cycle range the based on temperature changes alone after initial calibration. Would it drift by more than 10ns @ 100MHz in typical room temp (15C-25C) scenarios, or maybe it's just 3nS drift you'd need to mess it up. Wonder what temp change you'd need for that sort of variation? We could put a hairdryer near the P2 and see if it messes with the read CRC check?

evanh · 2024-09-01 01:37

It's more a power consumption problem. Stuff like Ada's emulators tend to be power hungry and that raises the die temperature a lot as it transitions from loading to running. So it'll be fine unless you want to, say, save something back the SD card.

As for how much temperature. That depends on the clock-data phase edge proximity - Which is dependant on not just temperature but the card dependant lag response, and also clock frequency because there is more than one potential edge to cross.

At sysclock/3, I feel safe that a simple recalibration upon error will suffice.

rogloh · 2024-09-01 02:03

The only thing I see is that with 3 P2 clocks per nibble (sysclk/3) for sampling, you'll end up with one bad delay value and two good ones. How would you choose between the two good delay values?

evanh · 2024-09-01 02:30

The logic at the moment is:

Sysclock                'rxlag' compensating value
 MHz |    1    2    3    4    5    6    7    8    9   10   11   12
=====+===============================================================
...
 200 |    0    0    0  100  100  100    0    0    0    0    0    0 
 201 |    0    0    0  100  100  100    0    0    0    0    0    0 
 202 |    0    0    0  100  100  100    0    0    0    0    0    0 
 203 |    0    0    0  100  100  100    0    0    0    0    0    0 
 204 |    0    0    0  100  100  100    0    0    0    0    0    0 
 205 |    0    0    0  100  100  100    0    0    0    0    0    0 
 206 |    0    0    0  100  100  100    0    0    0    0    0    0 
 207 |    0    0    0  100  100  100    0    0    0    0    0    0 
 208 |    0    0    0  100  100  100    0    0    0    0    0    0 
 209 |    0    0    0   99  100  100    0    0    0    0    0    0 
 210 |    0    0    0   12  100  100    0    0    0    0    0    0 
 211 |    0    0    0    6  100  100    0    0    0    0    0    0 
 212 |    0    0    0    5  100  100    0    0    0    0    0    0 
 213 |    0    0    0    5  100  100    0    0    0    0    0    0 
 214 |    0    0    0    4  100  100    0    0    0    0    0    0 
 215 |    0    0    0    4  100  100    0    0    0    0    0    0 
 216 |    0    0    0    2  100  100    0    0    0    0    0    0 
 217 |    0    0    0    1  100  100    0    0    0    0    0    0 
 218 |    0    0    0    1  100  100    0    0    0    0    0    0 
 219 |    0    0    0    0  100  100    0    0    0    0    0    0 
 220 |    0    0    0    0  100  100    0    0    0    0    0    0 
...
=====+===============================================================
 MHz |    1    2    3    4    5    6    7    8    9   10   11   12

In this example, if sysclock was at 200 MHz then it will choose a rxlag=5. But if sysclock was at 220 MHz then it'll choose rxlag=6.

It happens at this line in the card init reports: rxlag=7 selected Lowest=6 Highest=8

   clkfreq = 360000000   clkmode = 0x10011fb
   &buff = 0x6c84   &buff1 = 0xae84
Card detected ... power cycle of SD card
 400 kHz SD clock divider = 900
  CMD8 R7 0000015a - Valid v2.0+ SD Card
  ACMD41 OCR c0ff8000 - Valid SDHC/SDXC Card
  CMD2/CMD3 - Data Transfer Mode entered - Published RCA 59b40000
  ACMD6 - 4-bit data interface engaged
  CMD6 - High-Speed access mode engaged
  CMD9 - CSD backed-up
  CMD10 - CID backed-up
  Full Speed clock divider = 3 (120.0 MHz)
  rxlag=7 selected  Lowest=6 Highest=8
 CID decode:  ManID=1D   OEMID=AD   Name=SD   
   Ver=0.2   Serial=B16204C2   Date=2013-3
Init successful
 Other compile options:
  - Response handler #2
  - Multi-block-read loop
  - Multi-block buffer size is 16 kiBytes
  - Read data CRC processed in parallel
  - P_PULSE clock-gen
  RX_SCHMITT=0  DAT_REG=0  CLK_REG=0  CLK_POL=0  CLK_DIV=3

evanh · 2024-09-01 05:37

Bah! I just spent an hour or so trying to work out why one SD card was getting CRC read errors. It wasn't until a did a second round of trying all the cards again that the problem went away on its own - The old bad contact problem again. Only this time it must have been one of the other DAT pins so it only showed up in the CRC. So it isn't happening to just the DAT0 pin, and it isn't just for one card either.

So I've now had a huge batch of 100% CRC error rate that wasn't my doing.

rogloh · 2024-09-01 06:10

Is the card contact bad or the reader, maybe over time with lots of use the reader contacts degrade or lose their spring? You've probably had a lot of cards in and out by now.
Here's the mechanical datasheet for the part I used (IIRC), but it doesn't list endurance of the contacts. Maybe that's in another specification from OUPIIN. Contacts are copper alloy.
https://download.altronics.com.au/files/datasheets_P5717.pdf

evanh · 2024-09-01 07:11

It's just the way socket works I think. It might be tiny specks of dirt holding the spring contacts away from the card.

This has happened on occasion all along. I can't really say how much since it's only in recent days, since the CRC code paths have become fully functioning, that I can even detect it reliably. I'd previously noticed it happen a few times with the DAT0 pin after I got the busy detect functioning. I don't think I've noticed the CMD or CLK pins having the same issue though.

Unrelated segue: I've been writing some high level code in Basic to load settings from a SD card in a motion controller at work. It has a weird behaviour where it'll fault the program on file read, after a successful file open. I have to trap the error or it will abort the program. After anything from zero to maybe eight retries of closing, re-opening, and reattempting the file read, it'll start reading the file just fine. Then continues fine for subsequent file operations as well. Re-socketing the SD card makes no real difference. The firmware just seems quirky.

New SD mode P2 accessory board

Comments