Shop OBEX P1 Docs P2 Docs Learn Events
New SD mode P2 accessory board - Page 6 — Parallax Forums

New SD mode P2 accessory board

12346»

Comments

  • evanhevanh Posts: 15,046

    Yep, got that one going tonight ... But then went around in circles trying to work out why sysclock/4 gets stuck at the response. I've added a 400 kHz card init frequency - which works well but doesn't help the moment I switch to sysclock/4 after init. Tried adjusting phase of the command tx shifter, didn't help. Calling the same response routine as all the other commands. Adjusting sysclock frequency doesn't help. Not even a newly added general purpose timeout is tripping. It has me puzzled.

    Will hook up the scope again tomorrow. I'd pulled it off a couple days back in preparation for boosting clock speed.

    I can see myself moving to using the streamer for all CMD and DAT pins in the end. Whatever is going wrong with sysclock/4 will likely go away then. Not to mention that sysclock/2 is impossible with smartpins.

  • roglohrogloh Posts: 5,111
    edited 2023-07-13 11:19

    I reckon the smartpins are pretty good for the CMD/RESPONSE stuff as long as you run this phase a little slower before switching over to a faster data phase with the streamer. But the potential for overlap does complicate things. Maybe it just needs a special check when the DAT start bit arrives before the response on the smartpin has fully come in and then some manual clocking in that special case until the response is fully in and a reduction of the streamed data length after that in that case. Remember you can't stream from both the CMD and DAT sources at the same time from the same COG so one will need to be polled or a smartpin used.

  • evanhevanh Posts: 15,046
    edited 2023-07-13 07:25

    Bah, too tired last night. I'd missed updating the source for a dynamic pin mode. It was still using the older compile time mode setting at SD command issuing. Therefore the phase shift setting wasn't being applied when switching to sysclock/4 config.

    If I'd got the scope going earlier I'd have sorted it quicker. Another sign of tired I guess.

    On a positive note, I wanted to add those features at some stage anyway.

  • evanhevanh Posts: 15,046
    edited 2023-07-17 13:33

    sysclock/2 is pretty crazy performance for sure. Streamer timing is very sensitive though. The kicker here is different SD cards have different latencies so not really a goer without a dynamically calibrated compensation delay.

    Adding a calibrate sequence to a failed response CRC should be doable without creating additional overhead ... well at least not on block read/write anyway. Question is, is that really enough checking when forever teetering on the edge like this? Checking the data block CRC as well would provide a lot more security.

  • Yeah checking data CRC on reads could be a good thing for ensuring integrity and this would be okay with just single sector reads, but streaming data in at the same time using multiple sector reads at higher speeds may be hard to keep up unless you could find a way to use the COG to compute it while streaming somehow.

    We would need a data CRC for writes so some tight CRC code will need to be figured out regardless.

  • Think your comment below from the flexspin compiler thread belongs here...

    @evanh said:
    I think I've worked I did indeed have a timing issue. If the response code missed any clocks it got stuck. The earlier code must have been teetering on the edge of missing the first pulse - Which would be tripped up on worst case hubRAM access timings.

    I've worked out I can reliably create a similar situation by making the command/response clocks faster than sysclock/4. Then the rx smartpin itself becomes unreliable at seeing every clock pulse - Which has the same outcome of appearing to lock up for the same reason. I still assume the rx smartpin is counting all clock pulses. Time to fix that ...

    I can't recall the clock speed I ran for the CMD/RESP smart pin but I know didn't see any timing issues when running my data data phase up to sysclk/2 using the streamer although the P2 frequency I was running at might have nicely been in a sweet spot range for the input delay window. I know you can't clock the sync rx smartpin mode too fast or it won't detect input pin clock transitions correctly (something less than sysclk/2 is required). I was certainly running the clock fast only for the DAT pin transfers and somewhat slower for the CMD/RESP transfer and I know I was doing simple pin polling for the start bit vs what you do with that special SMPS smartpin mode that is also looking for a change on the pin while running. If that works l do like your solution.

    In general though I don't think you have to optimize the CMD/RESP stuff down to saving every last clock to get the best bang per buck with SD cards, it's the data transfer phase sending 512 bytes that needs to be fast, although speeding up the start bit checking is a good thing which is where my own code would have been somewhat slower that what you are doing. Ultimately supporting multiple SD sector reads will be where it will really fly and stream back data super fast off a decent SD card. By that point the limiting factor is probably just some CRC validation should you wish to enable it, it's probably best to make that optional on reads.

  • evanhevanh Posts: 15,046

    That was me clarifying that my earlier post about a possible Flexspin bug was likely just my own buggy code.

    In fact I should add to it that the historical issue I referenced in there has a decent probability of being the same problem - Sync serial smartpins not seeing every clock pulse can lead to puzzling behaviour.

  • evanhevanh Posts: 15,046
    edited 2023-07-22 07:19

    As for minimum ratio, the smartpin needs sampling time to identify individually both the low phase and high phase of each clock pulse:
    - Sysclock/4 is it when the Prop2 is the clock master. In theory this could be smaller but then the phase timing has to be calibrated. I don't know this for sure. See below.

    • Sysclock/5 is smallest ratio when Prop2 is the slave. I learnt this with the async smartpins.
  • evanhevanh Posts: 15,046

    As for why I end up in these situations. I like to interleave functionality when optimising this stuff, so can blow up what was a perfectly okay piece of code.

  • evanhevanh Posts: 15,046
    edited 2023-07-22 06:19

    Bah! My rx smartpin code might be too slow for sysclock/3 ... Yeah, I'd discovered that in the past but forgot about it. I'm not stopping the clock at the response start-bit detection - which adds bit-shift processing complexity - which I'm real-time dealing with.

    Like you say, no need to be pushing the command and response clock rate anyway.

  • evanhevanh Posts: 15,046
    edited 2023-07-26 12:17

    @Wuerfel_21 said:
    But does the latency period really care about the clock speed while waiting for the start bit? I'd think this is limited by the internal speed of the SD controller, but I never verified that assumption.

    It seems to be a little of both. Every card model seems to need a delay after response followed by a set amount of preamble clocks (different for different models) before it sends the first data block. The minimum delay is as yet unknown - Empirically, 400 μs for first block, but is longer on some cards. So, nothing like the spec looks.

    This arrangement has a nice side effect - it means a small number trailing clock pulses post-response aren't counted toward the data block preamble. Therefore they can be blindly added after the response as is required for any command completion.

  • evanhevanh Posts: 15,046

    Here's a processing performance comparison between two different methods. The response is R1 from a CMD13 command. SD clock rate is sysclock/4.

    • Upper method is using a sync rx smartpin. The code, when transitioning from command to response, pauses the clock briefly to change over the pin mode and also drive the CMD pin high in preparation for detecting the responses start-bit. It then relies on real-time start-bit detection and bit-shifting to pick out the correct data as it's clocking in. It takes some extra time and clocks after the last bit is received before the processing is completed. This is marked by the termination of the clock gen a short time before the green trace is lowered. The final burst of clock pulses is a guard group added after completion.

    • Lower method is using an Fcached streamer block and the earlier mentioned P_PWM_SMPS clocking for start-bit search. It begins with the same command timing but instead of driving the CMD pin high it only needs to stop driving it because the smartpin search for the start-bit is happy to wait for the pin to rise on its pull-up. As can be seen the start-bit search takes extra time to complete and also transition back to sysclock/4 with the streamer all aligned. But this is made up for at the end by it's faster completion. And finally, the same guard group of clock pulses is added after completion.

    End result is pretty much six of one and half a dozen of the other.

  • @evanh said:
    End result is pretty much six of one and half a dozen of the other.

    Yeah once you do the data transfer and find it swamps any minor speed difference here, you probably won't worry much about it in the end.

    I do quite like this smartpin mode finding the start bit though, especially if it can leverage/share other pins for LEDs/card detect purposes etc. Seems cleaner to have the P2 HW do things for you and reduce the complexity in SW. Just have to make sure this CMD+RESP scheme can work in cleanly with the subsequent data transfers as well, particularly in cases with potential overlap (which I've not seen IRL). Remember that one COG can only stream from one given set of pins at a time.

  • evanhevanh Posts: 15,046
    edited 2023-07-27 13:12

    @rogloh said:
    ... Just have to make sure this CMD+RESP scheme can work in cleanly with the subsequent data transfers as well, particularly in cases with potential overlap (which I've not seen IRL).

    That's the minimum of 400 μs gap I've observed after the response. It seems safe to assume they never overlap even though the spec allows it.

  • evanhevanh Posts: 15,046
    edited 2023-07-27 13:29

    @rogloh said:
    I do quite like this smartpin mode finding the start bit though, especially if it can leverage/share other pins for LEDs/card detect purposes etc. Seems cleaner to have the P2 HW do things for you and reduce the complexity in SW.

    There is one reason to use the smartpin-start-bit-search and streamer method over the bit-shifted smartpin method. It allows the response to be checked for tuning the streamer alignment since the streamer here is aligned in exactly the same way as the data block streamer code is.

    Such an option allows to check every response CRC on the fly to figure if all is well or not. Therefore no house keeping unless a response CRC error has been detected. The alternative would be periodic checking or added overhead in front of a block read command to check its tuning first.

    The third option would be to perform read data block CRCs but that is a heavy burden that I can't see being worth it simply because why are we pushing the performance envelop in the first place?

  • @evanh said:

    @rogloh said:
    ... Just have to make sure this CMD+RESP scheme can work in cleanly with the subsequent data transfers as well, particularly in cases with potential overlap (which I've not seen IRL).

    That's the minimum of 400 μs gap I've observed after the response. It seems safe to assume they never overlap even though the spec allows it.

    Based on what I saw too I tend to agree. Probably ok to just work with that until there is definitive proof of actual overlap with some real card.

    @evanh said:

    Such an option allows to check every response CRC on the fly to figure if all is well or not. Therefore no house keeping unless a response CRC error has been detected. The alternative would be periodic checking or added overhead in front of a block read command to check its tuning first.

    Yeah that is handy if you want to keep input delay adaptive.

    The third option would be to perform read data block CRCs but that is a heavy burden that I can't see being worth it simply because why are we pushing the performance envelop in the first place?

    Yeah read CRCs should be optional due to their extra overhead which could slow things down too much if multiple sectors are desired to be read (ala streaming), and if you can determine input delay based on CMD+RESP feedback only that seems like a reasonable plan.

  • evanhevanh Posts: 15,046

    Hmm, I suspect, but haven't looked for real, that the existing SD SPI driver code in Flexspin is not sequencing a CMD12 correctly during a CMD18 multi-block read. It just does its command default of ten bytes timeout after the CMD12 and then gives up, deselecting the card anyway.

    Since the CMD12 is being issued at the beginning of a block read, ten bytes would not even be close to enough to collect the actual response - which would be delivered after the block in progress. Given it obviously is working out in practice, I surmise this is likely because of card deselection cancelling the remaining sequence.

  • IIRC CMD12 cuts off any blocks in progress (at least in SPI mode)

  • Infact,

  • evanhevanh Posts: 15,046
    edited 2023-07-30 13:02

    Oh, I misunderstood the diagram. I'd read it as CMD12 just needs to be earlier than the end of the block. Makes a lot more sense that the block rather terminates early because of CMD12.

    Thanks for finding the info. I'm about to test doing the same in SD mode. CMD23 looks to be an alternative to CMD12 there though.

Sign In or Register to comment.