Shop OBEX P1 Docs P2 Docs Learn Events
New SD mode P2 accessory board - Page 37 — Parallax Forums

New SD mode P2 accessory board

1313233343537»

Comments

  • roglohrogloh Posts: 6,411

    @evanh said:
    PPS: Regarding those block read/write routines, they partly depend on registers being loaded along with the code. If Fcache is shifted to LUTRAM you'd have to move such data into locals instead.

    Yeah I know all about it, that's the tricky part. But we have 32 registers to play with and there are only up to 9 parameters I think. We can get them into C register variables using high level assignments from the source structure.

    Also I've been making this multi-instance so your globals are no longer present. Even though I am only exposing the SD mode driver host interface as a single slot initially, I see no reason to fix it to just one interface.

  • evanhevanh Posts: 17,292

    I don't think there is any true globals in that Flex driver. Things like the pin numbers are contained to the instance by FlexC's C++ struct __using() wrapper. Only a DAT section would be global.

    Also, in case you missed my previous edit, those register constants could, instead of creating locals, be placed in with the presets that are fast copied into the RES section.

  • evanhevanh Posts: 17,292

    I've updated the Object Exchange with v1.14. The zip file now includes four examples and improved documentation - https://obex.parallax.com/obex/sdsd-cc/

  • evanhevanh Posts: 17,292

    I've got a working driver with a small optimisation for size, making for a saving of 148 bytes. It is just dealing with command response start bit search. Nothing else. It clocks and tests one bit at a time instead of relying on a reconfigured clock - Borrowed from a similar solution I used for CRC response collection in the block write routine. The code is tidier due to no longer dealing with variable lag effects.

    It doesn't affect block read/write performance at all but when comparing command-only performance the existing v1.14 solution is 13% faster than this one.

    Anyone got objections to locking in this one?

  • evanhevanh Posts: 17,292

    I've also removed the redundant CMD6 SWITCH_FUNC switchover to High-Speed interface. Binary file size reduced by another 128 bytes.

    I had left it in early on in the hopes of it helping with enabling other features, which it never did, and kind of forgotten about it.

  • evanhevanh Posts: 17,292
    edited 2026-05-20 22:02

    @evanh said:
    ... It is just dealing with command response start bit search. Nothing else. ...

    It looks like I can go further with this approach. It seems also be able to remove all rx lead-ins and their associated rate settings. Replacing with a simpler WAITX + XINIT located after the coupled DIRH + WYPIN for starting the SD clock.

    I had thought I'd covered this ground thoroughly very early on but I must have overlooked something there. Best guess is having found I had no choice when it came to the tx side of things I'd assumed rx wasn't going to be any easier and just duplicated the tx approach.

    PS: To be able to transmit at any divider rate, on demand, from sysclock/2 and up, it needs exact sysclock tick adjustability for timing alignment between streamer and smartpin. And there is a bad overlap with the SD clock start in the small dividers without using a spacing lead-in to precisely push out the streamer timing to match the smartpin. Here's an example of the required tx sequence:

            setxfrq lnco    // sysclock/1 for lead-in timing
            dirl    p_clk    // reset clock-gen smartpin to align with streamer
            ...
            xinit   m_calign, #0    // lead-in delay at sysclock/1
            setq    v_nco    // data transfer rate, takes effect on XZERO below (buffered command)
            xzero   m_cmd, pb    // place first 32 bits in tx buffer, aligned to clock via lead-in delay
            dirh    p_clk    // clock timing starts here, first clock pulse occurs in smartpin's second period
            wypin   #6*8+2, p_clk    // SD clock pulses, 6 bytes + 2 pulses to kick off a response
            ...
            xzero   m_cmd, pb     // place remaining 16 bits in buffer for when first 32 bits completes
    

    And I have the rx routine, after start-bit is found, done basically the same but with a measured delay, for the frequency dependant rx latencies, added to that lead-in.

    PPS: Now it looks like, for rx routine, I can actually get away with the following:

            dirl    p_clk    // reset clock-gen smartpin to align with streamer
            ...
            dirh    p_clk    // clock timing starts here
            wypin   clocks, p_clk    // first clock pulse outputs during second clock period
            waitx   rxlag
            xinit   m_resp, #0     // rx response, aligned to SD clock
            ...
    

    No need to setup the smartpin rate nor even the streamer rate as they are both unchanged from the command issuing.

  • evanhevanh Posts: 17,292
    edited 2026-06-04 14:52

    I've found, on the Eval board at least, that I can get higher reliability by using registered clock pin. I accidentally found the extra reliability by going insane with the SD card clock frequency - Hitting 180 MHz (sysclock/2 clock divider)!

    The accident happened because I'd been banging my head against a new bug, that was only appearing during very high clock rates, and the bug wasn't where I was expecting. Which, funnily, turned out to be in the recently rewritten command response start-bit search. I'd used an edge event trigger in a place where I'd previously proven only TESTP instruction. The latency difference mattered.

    I've tested both with the new and old methods of start-bit search and, with the bug squashed, both behave the same. So now I wonder if maybe I should leave clock registration switched on permanently ... I should test more boards first ....

  • roglohrogloh Posts: 6,411

    That's pretty fast. What speed grade was your card? I recall I also once hit speeds like that earlier on in this same thread when I was testing things out https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
    At these speeds you can fill or empty the P2 memory pretty rapidly. But sustaining it for long is rather unlikely because of the filesystem overheads. At this transfer rate, I'm interested to know what can your driver sustain for video streaming applications?

  • evanhevanh Posts: 17,292
    edited 2026-06-05 09:25

    Ah, I don't intend anyone to actually use that clock rate. Even though it worked I doubt it was something that would stay reliable over extended environments, eg: Different wiring of the SD slot. I highlighted it only because it showed up a timing improvement that would presumably make the driver more robust.

    I need to do some testing ... I've been tidying up and reverting some of my experiments. I had started working on replacing all start-bit search code blocks with generally the same approach as above but realised it creates extra work in dealing with timeouts. The command-response timeout is based on a limited number of SD clock cycles rather than an actual duration.

    EDIT: I guess I could use branch-on-event to keep the inner loop small. Not sure it'll result in reduced code size though. The whole aim is to shrink the driver size.

  • evanhevanh Posts: 17,292
    edited 2026-06-07 12:58

    That's unexpected. I'll guess it's the smartpin IN behaviour, rather than the event hardware. I've just been using sysclock/2 to see how tight the new response start-bit search loop can go at ... and it's a tad disappointing. It seem I can throw in up to 4 instructions between the WYPIN #1,pin and the associated WAITSE1 without any change in loop time.

    The pulse period is set to just 2 sysclock ticks, yet it seems to take 10 ticks to trigger its IN rise event. :(

    Here's the relevant code:

    // Z is initially clear
                rep     @.rend, #64    // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
        if_nz   wypin   #1, p_clk    // one clock pulse
        if_nz   testpn  p_cmd   wz    // Z set if start(S)-bit found, TESTP is required for low latency!
        if_nz   waitse1      // wait for each clock done
    .rend
    // Z clear if timed-out - S-bit not found
    

    So I may as well add an early exit branch in there. That way it won't loop all 64 times.

    EDIT: On the other hand, that does explain why it has been holding together so well at extreme clock settings. The pulse interval never gets shorter than sysclock/12. And I had been counting my blessing without going to the trouble of scoping it up till now.

    Maybe it's all for the better. It means I don't need exceptions for short timings. And this was all about reduced code size after all.

  • evanhevanh Posts: 17,292
    edited 2026-06-07 13:06

    Here's the updated loop. It loops just as fast with the extra JMP instruction.

                rep     @.rend, #64    // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
                wypin   #1, p_clk    // one clock pulse
                testp   p_cmd   wc    // C clear if start(S)-bit found, TESTP is required for low latency!
        if_nc   jmp     #.rend
                waitse1      // wait for each clock done
    .rend
    // C set if timed-out - S-bit not found
    
  • evanhevanh Posts: 17,292

    Huh, I had been lucky with the above loop too. The basic loop has been in use for gathering the CRC status response from the SD card after each block written.

            rep @.rend2, #6
            wypin   #1, p_clk    // one clock pulse
            testp   p_dat   wc    // sample DAT0 before the clock pin transitions
            rcl timeout, #1    // collect status bits
            waitse1      // wait for each clock done
    .rend2
    

    Turns out it only works 100% when the clock pin is unregistered and also SD clock is non-inverted (Which is used when I engage SD High-Speed interface). Both of these conditions have been true in the sdsd.cc driver up until now.
    I don't know exactly why the polarity should matter. Since the SD card also changes which clock edge it clocks out on when changing speed mode.

    At any rate, the fix is simply (Now that I know it doesn't reduce performance) to add a padding instruction between WYPIN and TESTP:

            rep @.rend2, #6
            wypin   #1, p_clk    // one clock pulse
            nop    // timing padding, for high sysclock frequencies
            testp   p_dat   wc    // sample DAT0 before the clock pin transitions
            rcl timeout, #1    // collect status bits
            waitse1      // wait for each clock done
    .rend2
    

    And same with the command-response's start-bit search loop.

Sign In or Register to comment.