Oh, yeah I forgot the exact RGB format - been a while since I've done much P2 stuff lately - it shows. LOL.
I think you'd have something like this in order to stream uncompressed high quality SDTV stuff from a P2:
1 COG doing the SD reading of sectors into hub RAM
1 COG doing the reading of hub RAM RGB24 data and writing back into hub as RGB32 - this may also make sense to do an OSD overlay on top as part of it. It would also trigger the PSRAM write. It would need to co-ordinate closely with the first COG's buffering mechanism
1 PSRAM driver COG writing data into PSRAM and serving video read requests
1 VIDEO driver COG
1 audio COG
1 control COG maybe
Still leaves a couple of other COGs free for network or another interface maybe. [EDIT: wrong] Interestingly this bandwidth fits nicely on a 100Mbps link. Could possibly make a sling box of some kind which streams video over a network as well...
It'd be nice to find some device that could take an SDTV stream and digitize it to feed into a P2 like from a tuner/STB. Then you might make a simple DVR out of a P2. Although it probably could only read or write at a time, not both.
Also, same concept but for expanding YUYV (aka $VV_Y1_UU_Y0 when loaded into a little-endian register) to $VVYYUUxx (P2's native-ish 32bpp YUV format as it were)
(of course without any attempt at smoothing the chroma)
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~2 pixels in ??? p2 clocks
mov loopcount, ##600
loop
rep @.inner, ##512/2
rflong a
wrlut a,ptrb++
setbyte a,a,#2
wrlut a,ptrb++
.inner
setq2 #(512-1)
wrlong 0, ptra++
djnz loopcount, #loop
ret
a long 0
loopcount long 0
Suprisingly YUYV is already most of the way there.
@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.
Would need to not use the FIFO then.
No the FIFO just can't be used on the PSRAM driver COG. That is the COG that's using its streamer to copy into PSRAM. The RGB24-32 translation COG making the requests to write to PSRAM is still free to use its FIFO.
@rogloh said:
Perhaps this could be merged with the COG that copies into PSRAM and the work overlapped.
Would need to not use the FIFO then.
Also, your loop there puts zero in the byte #3. P2 RGB32 mode is $RRGGBBxx, where xx is ignored, so there's a double oddity there. Also note that you are able to use PTRA++ mode with block writes, the bug only happens when you try to use a custom offset.
Maybe: (untested)
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks
mov loopcount, ##600
loop
rep #14, ##512
rflong a
rflong b
rflong c
mov d, a
shl d,#8
wrlut d, ptrb++
getword d, b, #1
rolword b, a, #1
wrlut b, ptrb++
setbyte d, c, #2
shl d,#8
wrlut d, ptrb++
wrlut c, ptrb++
setq2 #(512-1)
wrlong 0, ptra++
djnz loopcount, #loop
ret
a long 0
b long 0
c long 0
d long 0
loopcount long 0
I think one instruction can be saved in the loop and the d variable. Also good case here for using @.end:
expand
' todo: init fifo ptr and ptrb/ptra prior to this
' expands at a rate of ~4 pixels in ??? p2 clocks
mov loopcount, ##600
loop
rep @.end, ##512
rflong a '= B1_R0_G0_B0
rflong b '= G2_B2_R1_G1
rflong c '= R3_G3_B3_R2
rol a, #8 '= R0_G0_B0_B1
wrlut a, ptrb++ 'write R0_G0_B0_XX
rol a, #8 '= G0_B0_B1_R0
setword a, b, #1 '= R1_G1_B1_R0
wrlut a, ptrb++ 'write R1_G1_B1_XX
ror b, #8 '= G1_G2_B2_R1
setbyte b, c, #3 '= R2_G2_B2_R1
wrlut b, ptrb++ 'write R2_G2_B2_XX
wrlut c, ptrb++ 'write R3_G3_B3_XX
.end
setq2 #(512-1)
wrlong 0, ptra++
djnz loopcount, #loop
ret
a long 0
b long 0
c long 0
loopcount long 0
Yeah good catches for optimizing further @TonyB_ since RRGGBBXX has XX as a don't care and you can rotate into that XX byte at will and still preserve the original source data. I also found a bug in the code, the 512 REP counter should be 512/4 as there are 4 longs written to the LUT in the inner loop. With 12 instructions now in the critical inner loop you get ~ 24+4 P2 clocks used for every 4 pixels, including the block transfer clock overhead writing the RGB32 back to hub RAM. So this is only 7 clocks per RGB32 pixel generated. For video using 640x480 resolution @30Hz, you need to convert 9.216Mpixels/s and this work consumes 64.5M P2 clocks per second. You'd only need a 129 MHz P2 to do this processing so at 300MHz or so there should be plenty of time for other work like triggering the PSRAM driver COG to write and for a simple lowres OSD overlay to be added here, maybe with some transparency effect too. Perhaps coloured text characters at 1/4 of the main resolution, a bit like an old VCR/DVD player.
During the processing of the SD card's sectors carrying an AV stream you'd need to demux the audio data and feed it to another COG for further processing/output. You could either put all the audio samples for a frame at the end of the video data - to keep video data contiguous and easy to write into a frame buffer, or you could interleave it and have the RGB24 expansion COG aware of audio boundaries and it could redirect those samples elsewhere in hub RAM. If you encode using 48kHz audio you have an integer number of samples at 24/25/30/50/60 Hz frame rates so audio can always be synced to frame boundaries which is nice. Also if you align each frame's audio+video data to the SD sector boundaries with padding at the end of each frame to fill its last sector, you should be able to rapidly seek forwards or backwards into the stream.
Ideally when playing back the video file from an SD card, its full chain of clusters could be read in and processed initially so that a list of starting clusters/sectors of each multi-sector burst and each burst length could be created to be followed by the SD reader COG. This would also let you figure out the amount of fragmentation up front and whether you'd still be able to play the file without impairment. If the file is not fragmented at all it would not need a lot of space to store the list, just a few 32 bit values could then suffice. Doing this has a downside though as there some initial work to process the cluster data at the start which may slightly delay the beginning of playback and you'd need more hub space to hold the list of bursts if the filesystem is heavily fragmented. It avoids any additional reading the FAT32 table during playback though and can therefore guarantee playback reliability if there is very constrained SD card transfer bandwidth (assuming the SD card itself has bounded latencies).
Oops, the last reports, after doing the latency measurements, with lat=xxx displayed, were including the latency printing time in the duration value. Which then impacted the calculated values for data rate and overheads. That said, it didn't have any meaningful impact on the values produce for the 16 MB streams.
So @evanh I am still thinking about SDTV resolution video acquisition and write/playback from SD cards on the P2. I might start a new thread on this topic soon. Do you expect that the faster UHS-I card's write speeds when operating at 3.3V in 4 bit mode will be capable of their 1.8V rated performance? E.g., I go buy a V60 rated UHS-I card and then have it sustain ~40MiB/s writes (which remains within 60MB/s) but operate it at 3.3V and clock it around 100M nibbles/sec (sysclk/3) or possibly sysclk/2? If a card can't sustain this type of write speed at 3.3V we won't be able to use a P2 to capture the pixel data and write the file on the fly in real-time as hoped...it'd need to be written offline by a PC for example.
I suspect so. But I've not really tried it out even though the block writes has been a stable routine for a while. Block writes were actually quite a lot easier than the block reads to get working without hassle ...
Seems to be true. Here's one of the Sandisk's. You can see some fluctuations. Presumably these are the SD card making room as it remaps the logical to physical block mapping and initiating automatic erasures.
Hmm, weird, even the least likely one to be capable, the Apacer, is still sometimes peaking over 30 MB/s. And weirder is it's entirely consistent too. Every run produces the same pattern of fast and slow rates.
I don't think I'm doing anything wrong, I basically copied the prior write loop of one buffer length that was used for lag mapping routine and expanded it to repeat for a specified stream length. As such these are all verified writes. It checks for the data response of CRC-is-valid pattern on DAT0. And anyway, if the SD card rejected any one block then the sequence would go to Smile, timing out somewhere.
EDIT: There is a caveat of sorts - The buffer is only filled with a pattern once. Meaning the stream is a repeating pattern of one buffer length long - so repeats every 16 kBytes.
LOL, the "busy" value of 74 is actually the data block CRC-is-valid status from the SD card. I had vaguely intended to report the duration of busy signal between blocks but it's the CRC status pattern that is past back so that's what's printed. The fact they are all 74 (%10_010_10) is a good thing.
EDIT: It's made up of a high-to-low start-bit, then %010 for valid status, then a high end bit1. The final low bit0 probably shouldn't be there. It's marked as hi-Z in the spec ... right, fixed ... new valid pattern is 37 (%10_010_1):
I wonder what sort of sustained write speeds one can achieve with sysclk/2 operation...? Maybe if it's easy for you to test, give that option a try. Or is it limited by the CRC computation?
If we wanted to capture and store interlaced 720x480 res video (16:9) at 60Hz with 24bpp packed pixel data, i.e. approximately equivalent to 480p30, we'd need to sustain a write speed of around 31.1MiB/s for video (480 * 30 * 720 * 3 & no framebuffer) and say an extra 192kBps for some stereo audio sampled at 48kHz, 16bits. Your write speed numbers are already close and above this even at sysclk/3 in many cases.
If I can extract the video data (and I am working on a technique for this), I wonder if it could be written on the fly to SD without a large PSRAM buffer in the middle and only using the hub RAM for any temporary buffering. We would only need to buffer whatever maximum case latency is ever seen during the multi-sector writes to absorb the timing variation. With 512kB of hub RAM I am sort of hoping that might be enough because if I don't have to use external memory for the capture then I can get by without using a PSRAM based P2 Edge and save lots of IO pins which I need. A P2-EVAL is another option then too.
It won't achieve anything in a generic sense. The CRC generation is only good for sysclock/3. If the CRC's were all pre-calculated then it could do sysclock/2 easy.
Wow, some SD cards can in fact go faster. That Adata Orange is hitting 46 MB/s by using 120 MHz (sysclock = 360 MHz).
EDIT: The block write code path is do the busy check, then single-block-write, then compute CRC, then wait for clock stopped before writing the CRC at the end, then CRC status retrieval, then return back to C. The C code manages the buffer and block sequence, similar to the single-block-read method.
// tx complete, now rx "CRC status"
dirl r_pdat // release DAT pins ready for CRC status from SD card
and r_pdat, #0x3f // strip away ADDPINS, leaving DAT0
rep @.rend2, #7
#ifdef CLK_STEP
wypin #2, r_clk // one clock pulse
#else
wypin #1, r_clk // one clock pulse
#endif
testp r_pdat wc // sample DAT0 before the clock pin transitions
rcl rc, #1 // collect status bits
waitse1 // wait for each clock done
.rend2
@rogloh said:
That was reading though, not writes and only for single sector IIRC. Essentially the actual transfer rate once clocking so not a real-world number.
Right, yeah, there's a lot of toing and froing in the details that slows it all down. I guess ignoring the CRC status is an option but it's a very risky one.
Because of signal integrity? Never. Well, there was plenty during development of the routines. This is an advantage of using sysclock/3 for rx timing. Sysclock/2 data-clock phase relationship runs much closer to an edge, but it's not easy to know how close. Would have to ramp the sysclock frequency to figure it out.
When the die temperature changes enough then the timing edges will move around. So, at high clock rates, due to this drifting, even for sysclock/3, there will be errors that will need "rxlag" recalibrated.
There is one intentional timeout every run. I have "rxlag" misconfigured to trigger an automated calibration of the command response. This process involves repeatedly retrieving CID and CSD registers using CMD10 and CMD9 respectively. This occurs after the completion of card init.
Currently all the init steps are done with slow smartpin based low-level tx/rx routines. These have been a stable reference during much instability of the fast streamer based routines. Not requiring setting precise clock phase timing was big advantage for getting started.
At this stage, the fast routines are now stable and robust enough to use them for card init as well. They are supposed to be configurable for a large divider so maybe I could give that a try too ...
There is one intentional timeout every run. I have "rxlag" misconfigured to trigger an automated calibration of the command response. This process involves repeatedly retrieving CID and CSD registers using CMD10 and CMD9 respectively. This occurs after the completion of card init.
Nice. That already should help things remain reasonably stable. I wonder at the clock rates you are using how quickly it would drift beyond the clock cycle range the based on temperature changes alone after initial calibration. Would it drift by more than 10ns @ 100MHz in typical room temp (15C-25C) scenarios, or maybe it's just 3nS drift you'd need to mess it up. Wonder what temp change you'd need for that sort of variation? We could put a hairdryer near the P2 and see if it messes with the read CRC check?
It's more a power consumption problem. Stuff like Ada's emulators tend to be power hungry and that raises the die temperature a lot as it transitions from loading to running. So it'll be fine unless you want to, say, save something back the SD card.
As for how much temperature. That depends on the clock-data phase edge proximity - Which is dependant on not just temperature but the card dependant lag response, and also clock frequency because there is more than one potential edge to cross.
At sysclock/3, I feel safe that a simple recalibration upon error will suffice.
The only thing I see is that with 3 P2 clocks per nibble (sysclk/3) for sampling, you'll end up with one bad delay value and two good ones. How would you choose between the two good delay values?
Bah! I just spent an hour or so trying to work out why one SD card was getting CRC read errors. It wasn't until a did a second round of trying all the cards again that the problem went away on its own - The old bad contact problem again. Only this time it must have been one of the other DAT pins so it only showed up in the CRC. So it isn't happening to just the DAT0 pin, and it isn't just for one card either.
So I've now had a huge batch of 100% CRC error rate that wasn't my doing.
Is the card contact bad or the reader, maybe over time with lots of use the reader contacts degrade or lose their spring? You've probably had a lot of cards in and out by now.
Here's the mechanical datasheet for the part I used (IIRC), but it doesn't list endurance of the contacts. Maybe that's in another specification from OUPIIN. Contacts are copper alloy. https://download.altronics.com.au/files/datasheets_P5717.pdf
It's just the way socket works I think. It might be tiny specks of dirt holding the spring contacts away from the card.
This has happened on occasion all along. I can't really say how much since it's only in recent days, since the CRC code paths have become fully functioning, that I can even detect it reliably. I'd previously noticed it happen a few times with the DAT0 pin after I got the busy detect functioning. I don't think I've noticed the CMD or CLK pins having the same issue though.
Unrelated segue: I've been writing some high level code in Basic to load settings from a SD card in a motion controller at work. It has a weird behaviour where it'll fault the program on file read, after a successful file open. I have to trap the error or it will abort the program. After anything from zero to maybe eight retries of closing, re-opening, and reattempting the file read, it'll start reading the file just fine. Then continues fine for subsequent file operations as well. Re-socketing the SD card makes no real difference. The firmware just seems quirky.
Comments
Oh, yeah I forgot the exact RGB format - been a while since I've done much P2 stuff lately - it shows. LOL.
I think you'd have something like this in order to stream uncompressed high quality SDTV stuff from a P2:
Still leaves a couple of other COGs free for network or another interface maybe. [EDIT: wrong] Interestingly this bandwidth fits nicely on a 100Mbps link. Could possibly make a sling box of some kind which streams video over a network as well...
It'd be nice to find some device that could take an SDTV stream and digitize it to feed into a P2 like from a tuner/STB. Then you might make a simple DVR out of a P2. Although it probably could only read or write at a time, not both.
Also, same concept but for expanding YUYV (aka $VV_Y1_UU_Y0 when loaded into a little-endian register) to $VVYYUUxx (P2's native-ish 32bpp YUV format as it were)
(of course without any attempt at smoothing the chroma)
Suprisingly YUYV is already most of the way there.
No the FIFO just can't be used on the PSRAM driver COG. That is the COG that's using its streamer to copy into PSRAM. The RGB24-32 translation COG making the requests to write to PSRAM is still free to use its FIFO.
I know, your wording just sounded like doing it in the driver cog.
I think one instruction can be saved in the loop and the d variable. Also good case here for using
@.end
:EDIT:
Deleted d variable
Yeah good catches for optimizing further @TonyB_ since RRGGBBXX has XX as a don't care and you can rotate into that XX byte at will and still preserve the original source data. I also found a bug in the code, the 512 REP counter should be 512/4 as there are 4 longs written to the LUT in the inner loop. With 12 instructions now in the critical inner loop you get ~ 24+4 P2 clocks used for every 4 pixels, including the block transfer clock overhead writing the RGB32 back to hub RAM. So this is only 7 clocks per RGB32 pixel generated. For video using 640x480 resolution @30Hz, you need to convert 9.216Mpixels/s and this work consumes 64.5M P2 clocks per second. You'd only need a 129 MHz P2 to do this processing so at 300MHz or so there should be plenty of time for other work like triggering the PSRAM driver COG to write and for a simple lowres OSD overlay to be added here, maybe with some transparency effect too. Perhaps coloured text characters at 1/4 of the main resolution, a bit like an old VCR/DVD player.
During the processing of the SD card's sectors carrying an AV stream you'd need to demux the audio data and feed it to another COG for further processing/output. You could either put all the audio samples for a frame at the end of the video data - to keep video data contiguous and easy to write into a frame buffer, or you could interleave it and have the RGB24 expansion COG aware of audio boundaries and it could redirect those samples elsewhere in hub RAM. If you encode using 48kHz audio you have an integer number of samples at 24/25/30/50/60 Hz frame rates so audio can always be synced to frame boundaries which is nice. Also if you align each frame's audio+video data to the SD sector boundaries with padding at the end of each frame to fill its last sector, you should be able to rapidly seek forwards or backwards into the stream.
Ideally when playing back the video file from an SD card, its full chain of clusters could be read in and processed initially so that a list of starting clusters/sectors of each multi-sector burst and each burst length could be created to be followed by the SD reader COG. This would also let you figure out the amount of fragmentation up front and whether you'd still be able to play the file without impairment. If the file is not fragmented at all it would not need a lot of space to store the list, just a few 32 bit values could then suffice. Doing this has a downside though as there some initial work to process the cluster data at the start which may slightly delay the beginning of playback and you'd need more hub space to hold the list of bursts if the filesystem is heavily fragmented. It avoids any additional reading the FAT32 table during playback though and can therefore guarantee playback reliability if there is very constrained SD card transfer bandwidth (assuming the SD card itself has bounded latencies).
Oops, the last reports, after doing the latency measurements, with
lat=xxx
displayed, were including the latency printing time in the duration value. Which then impacted the calculated values for data rate and overheads. That said, it didn't have any meaningful impact on the values produce for the 16 MB streams.Fixed now.
So @evanh I am still thinking about SDTV resolution video acquisition and write/playback from SD cards on the P2. I might start a new thread on this topic soon. Do you expect that the faster UHS-I card's write speeds when operating at 3.3V in 4 bit mode will be capable of their 1.8V rated performance? E.g., I go buy a V60 rated UHS-I card and then have it sustain ~40MiB/s writes (which remains within 60MB/s) but operate it at 3.3V and clock it around 100M nibbles/sec (sysclk/3) or possibly sysclk/2? If a card can't sustain this type of write speed at 3.3V we won't be able to use a P2 to capture the pixel data and write the file on the fly in real-time as hoped...it'd need to be written offline by a PC for example.
I suspect so. But I've not really tried it out even though the block writes has been a stable routine for a while. Block writes were actually quite a lot easier than the block reads to get working without hassle ...
Seems to be true. Here's one of the Sandisk's. You can see some fluctuations. Presumably these are the SD card making room as it remaps the logical to physical block mapping and initiating automatic erasures.
And that Kingston card is all over the place:
Hmm, weird, even the least likely one to be capable, the Apacer, is still sometimes peaking over 30 MB/s. And weirder is it's entirely consistent too. Every run produces the same pattern of fast and slow rates.
I don't think I'm doing anything wrong, I basically copied the prior write loop of one buffer length that was used for lag mapping routine and expanded it to repeat for a specified stream length. As such these are all verified writes. It checks for the data response of CRC-is-valid pattern on DAT0. And anyway, if the SD card rejected any one block then the sequence would go to Smile, timing out somewhere.
EDIT: There is a caveat of sorts - The buffer is only filled with a pattern once. Meaning the stream is a repeating pattern of one buffer length long - so repeats every 16 kBytes.
Adata Orange using 128 kB buffer. It's consistent and fast, and also modern:
LOL, the "busy" value of 74 is actually the data block CRC-is-valid status from the SD card. I had vaguely intended to report the duration of busy signal between blocks but it's the CRC status pattern that is past back so that's what's printed. The fact they are all 74 (%10_010_10) is a good thing.
EDIT: It's made up of a high-to-low start-bit, then %010 for valid status, then a high end bit1. The final low bit0 probably shouldn't be there. It's marked as hi-Z in the spec ... right, fixed ... new valid pattern is 37 (%10_010_1):
I wonder what sort of sustained write speeds one can achieve with sysclk/2 operation...? Maybe if it's easy for you to test, give that option a try. Or is it limited by the CRC computation?
If we wanted to capture and store interlaced 720x480 res video (16:9) at 60Hz with 24bpp packed pixel data, i.e. approximately equivalent to 480p30, we'd need to sustain a write speed of around 31.1MiB/s for video (480 * 30 * 720 * 3 & no framebuffer) and say an extra 192kBps for some stereo audio sampled at 48kHz, 16bits. Your write speed numbers are already close and above this even at sysclk/3 in many cases.
If I can extract the video data (and I am working on a technique for this), I wonder if it could be written on the fly to SD without a large PSRAM buffer in the middle and only using the hub RAM for any temporary buffering. We would only need to buffer whatever maximum case latency is ever seen during the multi-sector writes to absorb the timing variation. With 512kB of hub RAM I am sort of hoping that might be enough because if I don't have to use external memory for the capture then I can get by without using a PSRAM based P2 Edge and save lots of IO pins which I need. A P2-EVAL is another option then too.
It won't achieve anything in a generic sense. The CRC generation is only good for sysclock/3. If the CRC's were all pre-calculated then it could do sysclock/2 easy.
Wow, some SD cards can in fact go faster. That Adata Orange is hitting 46 MB/s by using 120 MHz (sysclock = 360 MHz).
EDIT: The block write code path is do the busy check, then single-block-write, then compute CRC, then wait for clock stopped before writing the CRC at the end, then CRC status retrieval, then return back to C. The C code manages the buffer and block sequence, similar to the single-block-read method.
Yeah I remember seeing some bizarrely high speeds back when I ran my own hacked up test code. 83.5 MiB/s @350MHz P2.
https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
That was reading though, not writes and only for single sector IIRC. Essentially the actual transfer rate once clocking so not a real-world number.
Right, yeah, there's a lot of toing and froing in the details that slows it all down. I guess ignoring the CRC status is an option but it's a very risky one.
That made me curious - in your experience with all of the testing, do you have any stats or a rough idea of how often transactions fail CRC?
Because of signal integrity? Never. Well, there was plenty during development of the routines. This is an advantage of using sysclock/3 for rx timing. Sysclock/2 data-clock phase relationship runs much closer to an edge, but it's not easy to know how close. Would have to ramp the sysclock frequency to figure it out.
When the die temperature changes enough then the timing edges will move around. So, at high clock rates, due to this drifting, even for sysclock/3, there will be errors that will need "rxlag" recalibrated.
There is one intentional timeout every run. I have "rxlag" misconfigured to trigger an automated calibration of the command response. This process involves repeatedly retrieving CID and CSD registers using CMD10 and CMD9 respectively. This occurs after the completion of card init.
Currently all the init steps are done with slow smartpin based low-level tx/rx routines. These have been a stable reference during much instability of the fast streamer based routines. Not requiring setting precise clock phase timing was big advantage for getting started.
At this stage, the fast routines are now stable and robust enough to use them for card init as well. They are supposed to be configurable for a large divider so maybe I could give that a try too ...
Nice. That already should help things remain reasonably stable. I wonder at the clock rates you are using how quickly it would drift beyond the clock cycle range the based on temperature changes alone after initial calibration. Would it drift by more than 10ns @ 100MHz in typical room temp (15C-25C) scenarios, or maybe it's just 3nS drift you'd need to mess it up. Wonder what temp change you'd need for that sort of variation? We could put a hairdryer near the P2 and see if it messes with the read CRC check?
It's more a power consumption problem. Stuff like Ada's emulators tend to be power hungry and that raises the die temperature a lot as it transitions from loading to running. So it'll be fine unless you want to, say, save something back the SD card.
As for how much temperature. That depends on the clock-data phase edge proximity - Which is dependant on not just temperature but the card dependant lag response, and also clock frequency because there is more than one potential edge to cross.
At sysclock/3, I feel safe that a simple recalibration upon error will suffice.
The only thing I see is that with 3 P2 clocks per nibble (sysclk/3) for sampling, you'll end up with one bad delay value and two good ones. How would you choose between the two good delay values?
The logic at the moment is:
In this example, if sysclock was at 200 MHz then it will choose a rxlag=5. But if sysclock was at 220 MHz then it'll choose rxlag=6.
It happens at this line in the card init reports:
rxlag=7 selected Lowest=6 Highest=8
Bah! I just spent an hour or so trying to work out why one SD card was getting CRC read errors. It wasn't until a did a second round of trying all the cards again that the problem went away on its own - The old bad contact problem again. Only this time it must have been one of the other DAT pins so it only showed up in the CRC. So it isn't happening to just the DAT0 pin, and it isn't just for one card either.
So I've now had a huge batch of 100% CRC error rate that wasn't my doing.
Is the card contact bad or the reader, maybe over time with lots of use the reader contacts degrade or lose their spring? You've probably had a lot of cards in and out by now.
Here's the mechanical datasheet for the part I used (IIRC), but it doesn't list endurance of the contacts. Maybe that's in another specification from OUPIIN. Contacts are copper alloy.
https://download.altronics.com.au/files/datasheets_P5717.pdf
It's just the way socket works I think. It might be tiny specks of dirt holding the spring contacts away from the card.
This has happened on occasion all along. I can't really say how much since it's only in recent days, since the CRC code paths have become fully functioning, that I can even detect it reliably. I'd previously noticed it happen a few times with the DAT0 pin after I got the busy detect functioning. I don't think I've noticed the CMD or CLK pins having the same issue though.
Unrelated segue: I've been writing some high level code in Basic to load settings from a SD card in a motion controller at work. It has a weird behaviour where it'll fault the program on file read, after a successful file open. I have to trap the error or it will abort the program. After anything from zero to maybe eight retries of closing, re-opening, and reattempting the file read, it'll start reading the file just fine. Then continues fine for subsequent file operations as well. Re-socketing the SD card makes no real difference. The firmware just seems quirky.