Shop OBEX P1 Docs P2 Docs Learn Events
New SD mode P2 accessory board - Page 37 — Parallax Forums

New SD mode P2 accessory board

1313233343537»

Comments

  • roglohrogloh Posts: 6,411

    @evanh said:
    PPS: Regarding those block read/write routines, they partly depend on registers being loaded along with the code. If Fcache is shifted to LUTRAM you'd have to move such data into locals instead.

    Yeah I know all about it, that's the tricky part. But we have 32 registers to play with and there are only up to 9 parameters I think. We can get them into C register variables using high level assignments from the source structure.

    Also I've been making this multi-instance so your globals are no longer present. Even though I am only exposing the SD mode driver host interface as a single slot initially, I see no reason to fix it to just one interface.

  • evanhevanh Posts: 17,297

    I don't think there is any true globals in that Flex driver. Things like the pin numbers are contained to the instance by FlexC's C++ struct __using() wrapper. Only a DAT section would be global.

    Also, in case you missed my previous edit, those register constants could, instead of creating locals, be placed in with the presets that are fast copied into the RES section.

  • evanhevanh Posts: 17,297

    I've updated the Object Exchange with v1.14. The zip file now includes four examples and improved documentation - https://obex.parallax.com/obex/sdsd-cc/

  • evanhevanh Posts: 17,297

    I've got a working driver with a small optimisation for size, making for a saving of 148 bytes. It is just dealing with command response start bit search. Nothing else. It clocks and tests one bit at a time instead of relying on a reconfigured clock - Borrowed from a similar solution I used for CRC response collection in the block write routine. The code is tidier due to no longer dealing with variable lag effects.

    It doesn't affect block read/write performance at all but when comparing command-only performance the existing v1.14 solution is 13% faster than this one.

    Anyone got objections to locking in this one?

  • evanhevanh Posts: 17,297

    I've also removed the redundant CMD6 SWITCH_FUNC switchover to High-Speed interface. Binary file size reduced by another 128 bytes.

    I had left it in early on in the hopes of it helping with enabling other features, which it never did, and kind of forgotten about it.

  • evanhevanh Posts: 17,297
    edited 2026-05-20 22:02

    @evanh said:
    ... It is just dealing with command response start bit search. Nothing else. ...

    It looks like I can go further with this approach. It seems also be able to remove all rx lead-ins and their associated rate settings. Replacing with a simpler WAITX + XINIT located after the coupled DIRH + WYPIN for starting the SD clock.

    I had thought I'd covered this ground thoroughly very early on but I must have overlooked something there. Best guess is having found I had no choice when it came to the tx side of things I'd assumed rx wasn't going to be any easier and just duplicated the tx approach.

    PS: To be able to transmit at any divider rate, on demand, from sysclock/2 and up, it needs exact sysclock tick adjustability for timing alignment between streamer and smartpin. And there is a bad overlap with the SD clock start in the small dividers without using a spacing lead-in to precisely push out the streamer timing to match the smartpin. Here's an example of the required tx sequence:

            setxfrq lnco    // sysclock/1 for lead-in timing
            dirl    p_clk    // reset clock-gen smartpin to align with streamer
            ...
            xinit   m_calign, #0    // lead-in delay at sysclock/1
            setq    v_nco    // data transfer rate, takes effect on XZERO below (buffered command)
            xzero   m_cmd, pb    // place first 32 bits in tx buffer, aligned to clock via lead-in delay
            dirh    p_clk    // clock timing starts here, first clock pulse occurs in smartpin's second period
            wypin   #6*8+2, p_clk    // SD clock pulses, 6 bytes + 2 pulses to kick off a response
            ...
            xzero   m_cmd, pb     // place remaining 16 bits in buffer for when first 32 bits completes
    

    And I have the rx routine, after start-bit is found, done basically the same but with a measured delay, for the frequency dependant rx latencies, added to that lead-in.

    PPS: Now it looks like, for rx routine, I can actually get away with the following:

            dirl    p_clk    // reset clock-gen smartpin to align with streamer
            ...
            dirh    p_clk    // clock timing starts here
            wypin   clocks, p_clk    // first clock pulse outputs during second clock period
            waitx   rxlag
            xinit   m_resp, #0     // rx response, aligned to SD clock
            ...
    

    No need to setup the smartpin rate nor even the streamer rate as they are both unchanged from the command issuing.

  • evanhevanh Posts: 17,297
    edited 2026-06-04 14:52

    I've found, on the Eval board at least, that I can get higher reliability by using registered clock pin. I accidentally found the extra reliability by going insane with the SD card clock frequency - Hitting 180 MHz (sysclock/2 clock divider)!

    The accident happened because I'd been banging my head against a new bug, that was only appearing during very high clock rates, and the bug wasn't where I was expecting. Which, funnily, turned out to be in the recently rewritten command response start-bit search. I'd used an edge event trigger in a place where I'd previously proven only TESTP instruction. The latency difference mattered.

    I've tested both with the new and old methods of start-bit search and, with the bug squashed, both behave the same. So now I wonder if maybe I should leave clock registration switched on permanently ... I should test more boards first ....

  • roglohrogloh Posts: 6,411

    That's pretty fast. What speed grade was your card? I recall I also once hit speeds like that earlier on in this same thread when I was testing things out https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
    At these speeds you can fill or empty the P2 memory pretty rapidly. But sustaining it for long is rather unlikely because of the filesystem overheads. At this transfer rate, I'm interested to know what can your driver sustain for video streaming applications?

  • evanhevanh Posts: 17,297
    edited 2026-06-05 09:25

    Ah, I don't intend anyone to actually use that clock rate. Even though it worked I doubt it was something that would stay reliable over extended environments, eg: Different wiring of the SD slot. I highlighted it only because it showed up a timing improvement that would presumably make the driver more robust.

    I need to do some testing ... I've been tidying up and reverting some of my experiments. I had started working on replacing all start-bit search code blocks with generally the same approach as above but realised it creates extra work in dealing with timeouts. The command-response timeout is based on a limited number of SD clock cycles rather than an actual duration.

    EDIT: I guess I could use branch-on-event to keep the inner loop small. Not sure it'll result in reduced code size though. The whole aim is to shrink the driver size.

  • evanhevanh Posts: 17,297
    edited 2026-06-07 12:58

    That's unexpected. I'll guess it's the smartpin IN behaviour, rather than the event hardware. I've just been using sysclock/2 to see how tight the new response start-bit search loop can go at ... and it's a tad disappointing. It seem I can throw in up to 4 instructions between the WYPIN #1,pin and the associated WAITSE1 without any change in loop time.

    The pulse period is set to just 2 sysclock ticks, yet it seems to take 10 ticks to trigger its IN rise event. :(

    Here's the relevant code:

    // Z is initially clear
                rep     @.rend, #64    // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
        if_nz   wypin   #1, p_clk    // one clock pulse
        if_nz   testpn  p_cmd   wz    // Z set if start(S)-bit found, TESTP is required for low latency!
        if_nz   waitse1      // wait for each clock done
    .rend
    // Z clear if timed-out - S-bit not found
    

    So I may as well add an early exit branch in there. That way it won't loop all 64 times.

    EDIT: On the other hand, that does explain why it has been holding together so well at extreme clock settings. The pulse interval never gets shorter than sysclock/12. And I had been counting my blessing without going to the trouble of scoping it up till now.

    Maybe it's all for the better. It means I don't need exceptions for short timings. And this was all about reduced code size after all.

  • evanhevanh Posts: 17,297
    edited 2026-06-07 13:06

    Here's the updated loop. It loops just as fast with the extra JMP instruction.

                rep     @.rend, #64    // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
                wypin   #1, p_clk    // one clock pulse
                testp   p_cmd   wc    // C clear if start(S)-bit found, TESTP is required for low latency!
        if_nc   jmp     #.rend
                waitse1      // wait for each clock done
    .rend
    // C set if timed-out - S-bit not found
    
  • evanhevanh Posts: 17,297
    edited 2026-06-08 05:26

    Huh, I had been lucky with the above loop too. The basic loop has been in use for gathering the CRC status response from the SD card after each block written.

            rep @.rend2, #6
            wypin   #1, p_clk    // one clock pulse
            testp   p_dat   wc    // sample DAT0 before the clock pin transitions
            rcl timeout, #1    // collect status bits
            waitse1      // wait for each clock done
    .rend2
    

    Turns out it only works 100% when the clock pin is unregistered and also SD clock is non-inverted (Which is used when I engage SD High-Speed interface). Both of these conditions have been true in the sdsd.cc driver up until now.
    I don't know exactly why the polarity should matter, since the SD card also changes which clock edge it clocks out on when changing speed mode.

    At any rate, the fix is to simply (Now that I know it doesn't reduce performance) add a padding instruction between WYPIN and TESTP:

            rep @.rend2, #6
            wypin   #1, p_clk    // one clock pulse
            nop        // timing padding, for high sysclock frequencies
            testp   p_dat   wc    // sample DAT0 before the clock pin transitions
            rcl timeout, #1    // collect status bits
            waitse1      // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN
    .rend2
    

    And same fix for the command-response's start-bit search loop.

  • evanhevanh Posts: 17,297
    edited 2026-06-08 06:22

    Bugger, I broke something at sdsd v1.12 too. I just discovered that all block read data gets corrupted when disabling the CRC check. That goes back to last year! :(
    I guess that's why unit tests are a thing. Oh well ...

    EDIT: Found the bug. Got a tad gung-ho by changing to looping on streamer complete rather than smartpin complete. It saved an instruction as well as terminating earlier.

    Buggy behaviour was: When the driver was configured for no CRC it has the streamer finishing at the end of the 512 bytes of data, ignoring the CRC following. But those CRC bits are still arriving, with the clock still cycling, as the driver has the streamer reading what it thinks is the next block. Chaos ensued.

  • evanhevanh Posts: 17,297
    edited 2026-06-08 11:17

    Man, I've not been fully following the framing spec on CRC status after block write either. I've, at some point, assumed a fixed number of clock cycles for turnaround between end of CRC and the CRC status coming back. All cards seem to have the same gap. However, the spec says it can be from 2 to 8 clocks. One more improvement to get right ...

    EDIT: Dealt to:

                mov     timeout, #0    // reused to hold the CRC status bits, and as return code
    
                rep     @.rend2, #13    // collect 6 bits for status, and Ncrc = 2..8 clocks (SD spec 4.12.5.5)
                wypin   #1, p_clk    // one clock pulse
                testb   timeout, #5  wz    // start-bit latch
                testbn  timeout, #4  andz    // start-bit latch
                testp   p_dat   wc    // sample DAT0 before the clock pin transitions
        if_nz   rcl     timeout, #1    // collect status bits
        if_nz   waitse1      // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN
    .rend2
        _ret_   wypin   #500, p_clk    // trailing clocks
    
  • evanhevanh Posts: 17,297

    @rogloh said:
    That's pretty fast. What speed grade was your card? I recall I also once hit speeds like that earlier on in this same thread when I was testing things out https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
    At these speeds you can fill or empty the P2 memory pretty rapidly. But sustaining it for long is rather unlikely because of the filesystem overheads. At this transfer rate, I'm interested to know what can your driver sustain for video streaming applications?

    To answer the question, with read CRC disabled, it hits a linear read speed over 70 MB/s. The SD card is Sandisk Extreme 64 GB (Year 2021).

       clkfreq = 360000000   clkmode = 0x10011fb
      Compiled with FlexC v7.6.5-beta-v7.6.4-2-gceea0901
    SD card driver (4-bit SD mode) v1.16
     sdsd_open: using pins: 21 20 16 23 22
     Card detected ... power cycle of SD card
      power-down threshold = 37   pin state = 1
      power-down slope = 25709 us   pin state = 0
     SD clock-divider set to sysclock/900 (400 kHz)
     Card idle OK
     OCR register c0ff8000  - SDHC/SDXC Card
     Data Transfer Mode entered - Published RCA aaaa0000
     CID decode:  ManID=03   OEMID=SD  Name=SN64G
      Ver=8.0   Serial=8ab989e1   Date=2021-2
     4-bit data interface engaged
      Speed Class = C10  UHS Grade = U3
      Video Class = V30  App Class = A2
      Cache = 1  Queue = 31
      TRIM = 1  FULE = 1
     High-Speed access mode engaged
      Card User Capacity = 60906 MB (59.48 GB)
     SD clock-divider set to sysclock/4 (90.0 MHz)
    --------------||||||||--
      **PASS**  rxlag=6 selected,  high=9  low=2
    SD Card Init Successful
     BLOCK_READ_CRC 2 0 0  CLKDIV  SD clock-divider set to sysclock/2 (180.0 MHz)
    --------------|||-------
      **PASS**  rxlag=8 selected,  high=9  low=7
     cluster size = 32768 bytes
    
     Buffer = 2 kB,  Written 2048 kB at 32126 kB/s,  Verified,  Read 2048 kB at 56143 kB/s
     Buffer = 2 kB,  Written 2048 kB at 33422 kB/s,  Verified,  Read 2048 kB at 56160 kB/s
     Buffer = 2 kB,  Written 2048 kB at 32931 kB/s,  Verified,  Read 2048 kB at 56129 kB/s
    
     Buffer = 4 kB,  Written 2048 kB at 25446 kB/s,  Verified,  Read 2048 kB at 61974 kB/s
     Buffer = 4 kB,  Written 2048 kB at 29289 kB/s,  Verified,  Read 2048 kB at 62220 kB/s
     Buffer = 4 kB,  Written 2048 kB at 29378 kB/s,  Verified,  Read 2048 kB at 62207 kB/s
    
     Buffer = 8 kB,  Written 4096 kB at 28379 kB/s,  Verified,  Read 4096 kB at 68382 kB/s
     Buffer = 8 kB,  Written 4096 kB at 30253 kB/s,  Verified,  Read 4096 kB at 68458 kB/s
     Buffer = 8 kB,  Written 4096 kB at 28555 kB/s,  Verified,  Read 4096 kB at 68340 kB/s
    
     Buffer = 16 kB,  Written 4096 kB at 33741 kB/s,  Verified,  Read 4096 kB at 72898 kB/s
     Buffer = 16 kB,  Written 4096 kB at 36349 kB/s,  Verified,  Read 4096 kB at 74315 kB/s
     Buffer = 16 kB,  Written 4096 kB at 36036 kB/s,  Verified,  Read 4096 kB at 74322 kB/s
    
     Buffer = 32 kB,  Written 8192 kB at 35987 kB/s,  Verified,  Read 8192 kB at 76107 kB/s
     Buffer = 32 kB,  Written 8192 kB at 36516 kB/s,  Verified,  Read 8192 kB at 76141 kB/s
     Buffer = 32 kB,  Written 8192 kB at 33464 kB/s,  Verified,  Read 8192 kB at 76024 kB/s
    
     Buffer = 64 kB,  Written 16384 kB at 40020 kB/s,  Verified,  Read 16384 kB at 77458 kB/s
     Buffer = 64 kB,  Written 16384 kB at 38843 kB/s,  Verified,  Read 16384 kB at 77455 kB/s
     Buffer = 64 kB,  Written 16384 kB at 37844 kB/s,  Verified,  Read 16384 kB at 77386 kB/s
    
Sign In or Register to comment.