New SD mode P2 accessory board

evanh · 2024-06-17 01:26

Nothing major I'm afraid. Haven't been too focused. Dealing with winter colds.

The last few days sat down and found I could concentrate again and have been doing some clean up and bug fixes. Most fixes are due to various earlier experiments of differing approaches where they messed with state change tracking. The later example above being a good tricky example.

PS: It's still just read-only at the moment. I'm afraid I've been experimenting far more than trying to complete anything.

evanh · 2024-06-17 23:14

Main additions of late are timeout checks. There is a lot of them in SD mode. And having the full spec is a requirement too. Most of their values are listed in section 4.12 of the full SD spec.

Ouch, the DAT0 busy programming timeout is rather huge at up to one second (host side allowance). Not very suited to retries. I'll have to work out if retries even make sense I guess.

rogloh · 2024-06-18 00:58

Yeah with a timeout that large you shouldn't really expect it to happen often unless communications with the card has been lost in the middle of it. Probably makes sense to just give up and fail as a failed card write if it ever does.

Rayman · 2024-06-18 18:12

A board like this would be great for Eval type boards when using the new code in flash RISCVP2 system. Especially Micropython...
Question is how hard would it be to adapt this to micropython once it's working....
Also, how to acquire one?

evanh · 2024-06-18 23:53

My intent is to repeat the exercise of improving the sdmm.cc driver file in Flexspin. Which should provide an instant upgrade for anything compiled with it.

The effort maybe more involved than previously at the driver API level. The pin set is different, obviously, and I've already come to blows with the layer above not liking me attempting to add pins.

Also, there is a device number mechanism at the driver API level that appears to be unsupported or barely implemented. I'm guessing Eric probably ripped a big chunk of the code from elsewhere and this part never fully made it. Having the device numbering working would be good. Multiple SD cards at once could be handy.

Wuerfel_21 · 2024-06-18 23:58

@evanh said:
Also, there is a device number mechanism at the driver API level that appears to be unsupported or barely implemented. I'm guessing Eric probably ripped a big chunk of the code from elsewhere and this part never fully made it. Having the device numbering working would be good. Multiple SD cards at once could be handy.

As is, each VFS mount point gets its own filesystem instance, so device numbers are pointless. Opening multiple cards should(tm) actually work.

evanh · 2024-06-19 01:05

@Wuerfel_21 said:
As is, each VFS mount point gets its own filesystem instance, so device numbers are pointless. Opening multiple cards should(tm) actually work.

The pin number storage is just one set of globals at the top of sdmm.cc. I'm guessing those are outside of the VFS mechanism. Not that I know anything about VFS.

Wuerfel_21 · 2024-06-19 01:10

The FatFS is loaded as a spin-style object, so top level variables are actually instance variables.

evanh · 2024-06-19 01:20

Oh, okay, so drv number is effectively depreciated. So all those checks and built in messages are unneeded bloat then.

rogloh · 2024-06-19 02:16

@Rayman said:
A board like this would be great for Eval type boards when using the new code in flash RISCVP2 system. Especially Micropython...
Question is how hard would it be to adapt this to micropython once it's working....
Also, how to acquire one?

@Rayman , until SD mode is working Parallax do also sell a SD card breakout (SPI mode only). If you are really keen to play around with SD mode with Micropython by eventually porting Evan's code you might be able to beg @Tubular for our last board of the 3 originally made (I have one and so does evanh). I still have the board design here so can always make more down the track for our local use in Melbourne although right now I don't intend to distribute more widely. Otherwise the design is presented in the first page and is easily replicated by PCB designers such as yourself. My layout and part selection is known to work at very high speeds on my P2 Eval, though admittedly that's probably more of a fluke. The card reader required was specific to my board layout and so might be harder to come by, but is still available here at Altronics by us locally if needed. I believe it was one of these OUPIIN branded ones. It's a decent part as it includes the HW card detection switch that works nicely with my circuit and a push in push out mechanical action. https://www.altronics.com.au/p/p5717-oupiin-surface-mount-micro-sd-slot-type-memory-card-socket/

LEDs were specially reverse mounted using these types of LED https://forums.parallax.com/discussion/comment/1545088/#Comment_1545088

Other than the UM6J1N dual PFET I just used some jellybean passives and resnets. This dual PFET was something I just had handy in a parts bin, probably a little underspec'd Rds resistance wise but it seems to work and can still sustain at least 400mA to the card. You could choose something else and some comments were made elsewhere in this thread regarding just that and suggesting a NTJD4152P.

evanh · 2024-06-20 01:53

Grr, typical, the correction to my earlier assumption was still making an assumption. The correction was basically flat wrong.

It seems that there is a minimum deployment latency when issuing a WYPIN, for the P_TRANSITION smartpin mode at least. The discrepancy only impacts the timing when using the fastest WXPIN #1 rate. But of course that's also the most sensitive rate to off-by-one timing errors.

        xinit   mleadin, #0    // lead-in delay from here at sysclock/1
        setq    mnco        // streamer transfer rate (takes effect with buffered command below)
        xzero   mdat, #0     // rx buffered-op, aligned with clock via lead-in
        dirh    #PIN_CLK    // clock timing starts here
        wypin   clocks, #PIN_CLK    // first pulse outputs during second clock period

I had been calculating the lead-in delay with a straight forward quantity of divider + offset. It relied on the clock smartpin being predictable as to when the first clock pulse is produced. It worked perfectly for pulse mode and a divider/2 version worked well for transition mode too ... except when the transition period is 1. Then it needs a +1 added to the offset.

Well, there's certainly a lot of cases to making it an adjustable clock divider. I guess the reason this particular one hadn't raised its head until now is because I've previously favoured using P_PULSE mode instead.

rogloh · 2024-06-20 02:04

@evanh said:
Grr, typical, the correction to my earlier assumption was still making an assumption. The correction was basically flat wrong.

It seems that there is a minimum deployment latency when issuing a WYPIN, for the P_TRANSITION smartpin mode at least. The discrepancy only impacts the timing when using the fastest WXPIN #1 rate. But of course that's also the most sensitive rate to off-by-one timing errors.

Ugh. This brings back the memory horrors of dealing with this sort of stuff in my HyperRAM driver with the various clock option combinations and dividers using transition mode. There were lots of things to test and mess with like this until you could get it right for all cases. I think I resorted to patching code paths/instructions for different cases, although I was very constrained in Cog+LUT RAM use.

evanh · 2024-06-20 02:45

Roger,
I now think I can see how you've got away without using any explicit lead-in mechanism in your existing drivers. Using P_TRANSITION, as long as only sysclock/2 and sysclock/4 are desired, this then allows a single hardcoded offset which can then be embedded within instruction order.

PS: In the case of HyperRAM's DDR operation it would apply to sysclock/1 and sysclock/2.
PPS: Ah, but WXPIN #2 still needs the reset of the clock smartpin to ensure consistent phase alignment.

rogloh · 2024-06-20 03:29

@evanh said:
Roger,
I now think I can see how you've got away without using any explicit lead-in mechanism in your existing drivers. Using P_TRANSITION, as long as only sysclock/2 and sysclock/4 are desired, this then allows a single hardcoded offset which can then be embedded within instruction order.

PS: In the case of HyperRAM's DDR operation it would apply to sysclock/1 and sysclock/2.
PPS: Ah, but WXPIN #2 still needs the reset of the clock smartpin to ensure consistent phase alignment.

Lots of hoops were jumped through, very messy stuff.

evanh · 2024-06-20 03:39

@rogloh said:
Lots of hoops were jumped through, very messy stuff.

Yes. It all comes down to those smartpin modes internally cycling the moment that DIR rises. They don't wait for their Y register to be set.

evanh · 2024-06-20 04:05

Oh, I forgot about accounting for the adding of rxlag. That's still needed at higher frequencies. In my code I incorporate it into the lead-in offset. In your code it is a separate WAITX. End result is the difference in performance and bloat is near zero.

evanh · 2024-06-21 03:26

@evanh said:
It seems that there is a minimum deployment latency when issuing a WYPIN, for the P_TRANSITION smartpin mode at least. The discrepancy only impacts the timing when using the fastest WXPIN #1 rate.
      xinit   mleadin, #0    // lead-in delay from here at sysclock/1
      setq    mnco        // streamer transfer rate (takes effect with buffered command below)
      xzero   mdat, #0     // rx buffered-op, aligned with clock via lead-in
      dirh    #PIN_CLK    // clock timing starts here
      wypin   clocks, #PIN_CLK    // first pulse outputs during second clock period

It is making sense to me now. The very reason I place DIRH and WYPIN together as a pair is to minimise the impact of instruction execution. It has just sunk in that what is special with the P_TRANSITION mode is I've been halving the divider derived delay for the lead-in delay calculation. This means that at sysclock/2 the calculated divider delay value is only 1 tick. The problem with it being 1 is that is less than the instruction execution time of the WYPIN. It then needs an extra tick to fit.

So there is a latency exception. It's just not in the WYPIN internals like I first guessed. It's rather because WYPIN is 2 ticks after DIRH.

rogloh · 2024-06-21 04:14

The main problem with all these very specific timing details that go undocumented is that they are elusive until experimentally verified and also so easily forgotten afterwards unless you work with them on an ongoing basis.

evanh · 2024-06-22 11:59

That's interesting. I've got one SD card (Kingston SDCS2/64GB) that isn't happy with me cancelling a multi-block read on the first block. I don't think doing so is any rule breaker. It's causing a data block timeout on the subsequent repeat attempt. It's odd in that all the responses look fine on the scope, and further commands after the timeout are just fine. If I cancel it on the second block or enable diagnostic reporting, ie: Add delays, then no timeout occurs.

I guess I need to check over the timings of the command/data overlap on CMD12 use. It's not something I've paid close attention to yet.

rogloh · 2024-06-23 00:13

Given how relatively poorly the SD spec was written and all this proprietary stuff from SanDisk somehow evolved into the defacto standard it's probably not all that surprising that different vendors still do different things. Although it's even worse when different cards from the same vendor (especially if it's SanDisk) do different things.

evanh · 2024-06-23 00:59

On that note, it felt like SPI mode had just one controller model covering all brands other than Sandisk. SD mode is not giving me that feeling. They're all different now. It shows up notably in latencies.

evanh · 2024-06-23 03:36

Speaking of which, it's 840 us from end of a CMD18 (READ_MULTIPLE_BLOCK) response to start-bit of the Kingston card's first data block. That's the window where a CMD12 can trigger the later timeout on next attempt.

But even weirder is this only occurs at high clock rates. It doesn't happen at 48 MHz SD clock, but does happen at 50 MHz and above. Maybe I should try engaging HS mode ...

evanh · 2024-06-25 10:27

Ya, that does indeed make it happy.
Hmm, it could be explained through the power limit. I've been reading the details of the CMD6 data response and note that High Speed Access Mode is allowed 200 mA compared with 100 mA for Default Speed. If the card has detected it's drawing too much then maybe it backs off in the form of cancelling the operation it was doing. Or maybe that's just how that model of card reacts.

So I guess there is good reason to enable High Speed mode after all.

rogloh · 2024-06-25 16:13

Or potentially that dual pFET I chose might be limiting the output voltage due to Rds losses and this is somehow triggering a cancellation ? Perhaps this is more likely to happen more on writes though which probably draw more power vs reads. You could possibly probe the card's voltage to see if it coincides with the cancellation?

I've not seen problems in that area myself to date but that's not to say there won't be some if the voltage falls out of spec and you are now doing more testing than I did.

evanh · 2024-06-26 06:44

No indication of any brown-out condition - which would nominally be a full card reset I'd say. The card is maintaining state except for that one CMD18.
The power limit thing will be a self-limit. The internal software decides to back off. On that note, there is indication I'm seeing shorter data latencies now in HS mode.

evanh · 2024-06-26 12:12

@evanh said:
... On that note, there is indication I'm seeing shorter data latencies now in HS mode.

Well, the Kingston card is the only one of the five SD cards I'm playing with that is clearly exhibiting a latency improvement in HS mode. I did get something weird from one of the two Sandisks as well but I'll need to develop the testing further before making any definitive statements I think. It's only 8 consecutive blocks at this stage.

evanh · 2024-06-28 03:19

Ha, that's interesting. When I try to push above about 40 MB/s the SD cards look like they're introducing extra latencies. The overheads rise from about 7 % of total time to more like 47 % when doubling the clock rate. Not that I've tried to study it in detail.

evanh · 2024-06-29 03:31

Preview time

   clkfreq = 380000000   clkmode = 0x10012fb
Card detected ... power cycle of SD card
 400 kHz SD clock divider = 950
  CMD8 R7 0000015a - Valid v2.0+ SD Card
  ACMD41 OCR c0ff8000 - Valid SDHC/SDXC Card
  CMD2/CMD3 - Data Transfer Mode entered - Published RCA 59b40000
  ACMD6 - 4-bit data interface engaged
    Default-Speed access mode
  CMD9 - CSD backed-up
  CMD10 - CID backed-up
   ManID: 1B   OEMID: SM   Name: ED2S5
   Ver: 3.0   Serial: 49C16906   Date: 2023-2
  Full Speed clock divider = 4 (95.0 MHz)
  rxlag=8 selected  Lowest=6 Highest=9
Card init successful
 Double check the calibration:
   rate = 0.4 MiB/s   duration = 1161 us
   rate = 2.0 MiB/s   duration = 241 us
   rate = 2.0 MiB/s   duration = 243 us
   rate = 3.8 MiB/s   duration = 253 us
32768 blocks = 16384 kiB   rate = 42.2 MiB/s   duration = 379115 us   zero-overhead = 353204 us   overheads = 6.8 %
16384 blocks = 8192 kiB   rate = 42.1 MiB/s   duration = 189849 us   zero-overhead = 176602 us   overheads = 6.9 %
8192 blocks = 4096 kiB   rate = 42.0 MiB/s   duration = 95214 us   zero-overhead = 88301 us   overheads = 7.2 %
4096 blocks = 2048 kiB   rate = 41.7 MiB/s   duration = 47898 us   zero-overhead = 44150 us   overheads = 7.8 %
2048 blocks = 1024 kiB   rate = 41.2 MiB/s   duration = 24238 us   zero-overhead = 22075 us   overheads = 8.9 %
1024 blocks = 512 kiB   rate = 40.6 MiB/s   duration = 12291 us   zero-overhead = 11037 us   overheads = 10.2 %
512 blocks = 256 kiB   rate = 39.2 MiB/s   duration = 6376 us   zero-overhead = 5518 us   overheads = 13.4 %
256 blocks = 128 kiB   rate = 36.5 MiB/s   duration = 3419 us   zero-overhead = 2759 us   overheads = 19.3 %
128 blocks = 64 kiB   rate = 32.1 MiB/s   duration = 1941 us   zero-overhead = 1379 us   overheads = 28.9 %
64 blocks = 32 kiB   rate = 26.0 MiB/s   duration = 1201 us   zero-overhead = 689 us   overheads = 42.6 %
32 blocks = 16 kiB   rate = 18.8 MiB/s   duration = 831 us   zero-overhead = 344 us   overheads = 58.6 %
16 blocks = 8 kiB   rate = 15.9 MiB/s   duration = 489 us   zero-overhead = 172 us   overheads = 64.8 %
8 blocks = 4 kiB   rate = 12.1 MiB/s   duration = 322 us   zero-overhead = 86 us   overheads = 73.2 %
4 blocks = 2 kiB   rate = 7.0 MiB/s   duration = 276 us   zero-overhead = 43 us   overheads = 84.4 %
2 blocks = 1 kiB   rate = 3.8 MiB/s   duration = 253 us   zero-overhead = 21 us   overheads = 91.6 %
 400 kHz SD clock divider = 950
All finished  :-)
Card detected ... power cycle of SD card

PS: These are device level consecutive block reads. There is no filesystem in testing so far.

rogloh · 2024-06-29 07:01

When you say device level consecutive do you mean individual sector read of consecutive sectors, or streaming block reads where the card continues to send out its sector data consecutively and automatically?

evanh · 2024-06-29 09:18

One CMD18 per line.

New SD mode P2 accessory board

Comments