P2 SD Drivers - COG PASM Version v.223

Cluso99Cluso99 Posts: 16,910
edited 2020-05-28 - 09:10:44 in Propeller 2
Cluso's P2 SD Drivers

Update v.223
Increased performance and now works from 10MHz-370MHz (my P2EVAL fails serial at 380MHz). Code attached.

I expect to release a number of versions, all with different advantages and disadvantages. They will be released on this thread, with the latest releases in this post.
They are all based on my SD Driver present in the P2 ROM for booting from an SD Card.

While I have released various test versions in the past, all new releases will be available from here.

Cluso's P2 SD Driver for COG v.222

This is the first (test) release of my COG version of my SD Drivers. There are two files in the zip file...

* SD_COG.spin2 is just the driver. It requires your code to work for now, and it can be compiled with your code. There is very little space left in the cog ram though you can relocate it in LUT except for the 16 registers it requires which can be located anywhere in COG - just change the _regs equate.
I intend to release a mailbox version later so that this driver could reside in its' own COG and communicate via a 4 long hub mailbox.
This driver has been made for speed with reads and writes being unraveled and hence its' big footprint. Each send/receive bit takes 4 instructions for an 8 clocks per bit which averages to 9 clocks per 512+2 bytes. My tests have been done at 200MHz - I haven't checked other clock speeds yet. Works from 40MHz-280MHz (290MHz fails, didn't try below 40MHz))
Timing tests have been commented out. You can re-enable them or use SD_COG_test.spin2 to do any timing yourself.
If you look at the readFast sections you will see TESTB instructions commented out. This is an alternative method I tried and should shift the sample window one clock earlier (because TESTB has an extra clock delay between the sample it sees and the instruction starting versus TESTP). If you want to try it, comment out TESTP and uncomment TESTB for each pair.
I intend to add block mode later.

Note that this code will only work on SDHC cards and above which use block mode for sector addresses rather than the older original byte mode. Cards 4GB and above are likely to be OK.

* SD_COG_test.spin2 is the above driver with a test harness built using the ROM Serial/Monitor calls. It has timing placed for each function call so you can tell how fast your SD Card is.

Note: There is a bug in my _readFastLong routine so I am performing 4 byte reads for this instead. It's only when reading the command response so it doesn't really matter. I wanted to get this out hence the workaround.

Please report any bugs or problems here. Enjoy :sunglasses:

Notes: There is a significant lag in SD Cards while preparing to read or write a sector. I have observed lags on my SanDisk Ultra 16GB microSD SDHC I C10 of 4ms, 2.7ms and 1.7ms before the read or write can proceed (ie busy). I am achieving (at 200MHz) 512+2 byte reads of 37071 clocks = 185us or 2.77MB/s and 512+2 byte writes of 40145 clocks = 200us or 2.56MB/s
«1

Comments

  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 09:08:17
    FAT32 Drivers

    Attached is the first working version of Kye's FAT32 Driver that I converted to run under P2 SPIN2.
    I haven't included all the re-direction objects that make this run, but I thought I would post it here in case anyone was interested.
    Note there is a problem that the code uses the P1 ROM space to write to when skipping bytes. This will need to be fixed.
    I have used this in my P2 OS here
    forums.parallax.com/discussion/171600/propeller-2-os-its-alive
  • Reserved
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 02:12:54
    SD_COG.spin2
    Using the streamer has shaved
    read sector (512+2bytes) 1151 clocks =5us @200MHz
    write sector(512+2bytes) 6255 clocks = 31us @200MHz

    Hope I've used the streamer correctly - the data looks fine.
    Here's what I did for read sector and write to hub
                    WRFAST  #0,               PTRB           ' use streamer /fifo
                    rep     @.reprd,          sdx_bytecnt    ' read 512 bytes (or 16 bytes if CSD/CID)
    .....
                    wfbyte  sdx_dataout                      ' save byte to streamer/fifo
    
    and here's what I did for read from hub and write sector
                    RDFAST  #0,               PTRB           ' use streamer /fifo
                    rep     @.repwr,          ##512          ' write 512 bytes
                    rfbyte  sdx_dataout                      ' get byte from streamer/fifo
    .....
    

    Just love some of these P2 features :sunglasses:
  • Cluso99 wrote: »

    Just love some of these P2 features :sunglasses:

    Yes once you can get the clock timing and streamer aligned it is a great combination.
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 04:35:41
    SD_COG.spin2

    Just tweeked the code to re-insert a couple of timing NOP's at strategic points where I had over-optimised the code. While I've lost a few clocks, I've extended the range to 10MHz-370MHz. My board fails serial at 380MHz so no point in trying to get any higher.
  • evanhevanh Posts: 10,086
    edited 2020-05-28 - 04:46:12
    Ah, that's the FIFO, not the streamer, you're using. Streamer is commanded with XINIT/XCONT instructions.

    With the streamer the bit-bashing code is eliminated. And can go to sysclock/2 for single-data-rate (SDR) clocking and sysclock/1 for double-data-rate (DDR) clocking.

  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 08:14:44
    evanh wrote: »
    Ah, that's the FIFO, not the streamer, you're using. Streamer is commanded with XINIT/XCONT instructions.

    With the streamer the bit-bashing code is eliminated. And can go to sysclock/2 for single-data-rate (SDR) clocking and sysclock/1 for double-data-rate (DDR) clocking.
    Thanks Evan. The docs aren't all that clear about if RFxxxx uses the streamer and/or fifo.

    Where's the best place to look for info to use sysclock /4 with SPI (ie clock + data) - data out first ?
    That's an enormous thread on Hyperram and I've skimmed it a number of times.
  • Peter JakackiPeter Jakacki Posts: 9,811
    edited 2020-05-28 - 08:35:47
    Cluso99 wrote: »
    SD_COG.spin2
    Using the streamer has shaved
    read sector (512+2bytes) 1151 clocks =5us @200MHz
    write sector(512+2bytes) 6255 clocks = 31us @200MHz

    Are you sure about those figures? 5us for 516 bytes is 822MHz SPI data rate - well beyond the 50MHz SPI limit on SD cards. The absolute best you could do without any delays would be 83us or so for 512+2 bytes.
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 08:55:39
    Peter,
    I wish :) No, I shaved that time off the previous time taken.
    At 200MHz it takes (for the 512+2 bytes as I'm measuring for the complete sector plus the crc bytes) for read 179us and write 169us.
    They are both using 4 instruction per bit unrolled loops.

    I've been playing with using skipf this afternoon. But perhaps I should instead be looking at using the smartpins and streamer.
    Did you notice I found my P2EVAL fails at 380MHz using serial.

    If anyone wants to add their SD Drivers here, please do so, plus links to any threads and I'll update the first post with details and links :sunglasses:
  • Ok, because you didn't say what the actual figure was. Didn't think it was for real :)
    This is the heart of my sector read, just a general purpose SPI block read I call with a count of 512.
    ' SPI>BUF ( dst cnt -- sum ) 46,128 cycles= 153,760ns @300MHz ok
    SPIRX           wrfast  #0,b
                    mov     b,#0
    .L0             rep     #5,#8                ' 8 bits
                     outnot  sck                   	' clock (low high or low high)
    		 rcl     r1,#1                  ' shift in msb first (first dummy zero)
                     outnot  sck
    		 nop
                     testp   miso wc                ' read data from card
    
    		rcl     r1,#1                   ' last bit
                    wfbyte  r1
                    djnz    a,#.L0
                    jmp     #DROP
    
  • Cluso99 wrote: »
    Where's the best place to look for info to use sysclock /4 with SPI (ie clock + data) - data out first ?
    That's an enormous thread on Hyperram and I've skimmed it a number of times.
    The HR stuff is all done with streamer ops because it's an 8-bit parallel bus and speed was always the objective from the outset.

    I did do an all smartpins 2-bit solution for the EEPROM a long time back. In fact that's what ended up in the loadp2 EEPROM boot loader. Wasn't very tidy though. Always meant to convert it to streamer. Source, "P2ES_flashloader.spin2" is there with loadp2. The bootloader part, using smartpins, is down the bottom of the file.

    Chip did a compact 1-bit streamer version for PNut's EEPROM boot loader too. It ticks over lazily on RCFAST so doesn't have any complexity around high frequency compensations. Source, "flash_loader.spin2" is in Pnut zip.

  • Peter,
    This is the heart of my writeFastSector. It's unrolled 8 bits with 4 instructions per bit.
                    RDFAST  #0,               PTRB           ' use streamer /fifo
                    rep     @.repwr,          ##512          ' write 512 bytes
                    rfbyte  sdx_dataout                      ' get byte from streamer/fifo
                    rol     sdx_dataout,      #25        wc  ' test output bit 7     (DI=0/1) (msbit first)
                    bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 7 ready for output with clk=0
    .o7             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                    rol     sdx_dataout,      #1         wc  ' test output bit 6  (DI=0/1)
                    outh    #sdx_ck                          ' CLK=1
                    bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 6 ready for output with clk=0
    .o6             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                    rol     sdx_dataout,      #1         wc  ' test output bit 5  (DI=0/1)
                    outh    #sdx_ck                          ' CLK=1
                    bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 5 ready for output with clk=0
    
    and readFastSector again unrolled. It's unrolled 8 bits with 4 instructions per bit. Note the TESTB alternative is commented out.
                    WRFAST  #0,               PTRB           ' use streamer /fifo
                    rep     @.reprd,          sdx_bytecnt    ' read 512 bytes (or 16 bytes if CSD/CID)
    '               outl    #sdx_ck                          ' CLK=0  (already 0 first time)
    .s7             outh    #sdx_ck                          ' CLK=1
                    mov     sdx_reply,        #0             ' clear reply
                    outl    #sdx_ck                          ' CLK=0
                    nop
    .s6             outh    #sdx_ck                          ' CLK=1
                    testp   #sdx_do                      wc  '\ read input bit7:  sample on/after prior CLK rising edge
    '               testb   inb,             #sdx_do-32  wc  '/ read input bit7:  sample on/after prior CLK rising edge
                    outl    #sdx_ck                          ' CLK=0
                    rcl     sdx_reply,        #1             ' accum DO input bit 7
    
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 10:38:11
    evanh wrote: »
    Cluso99 wrote: »
    Where's the best place to look for info to use sysclock /4 with SPI (ie clock + data) - data out first ?
    That's an enormous thread on Hyperram and I've skimmed it a number of times.
    The HR stuff is all done with streamer ops because it's an 8-bit parallel bus and speed was always the objective from the outset.

    I did do an all smartpins 2-bit solution for the EEPROM a long time back. In fact that's what ended up in the loadp2 EEPROM boot loader. Wasn't very tidy though. Always meant to convert it to streamer. Source, "P2ES_flashloader.spin2" is there with loadp2. The bootloader part, using smartpins, is down the bottom of the file.

    Chip did a compact 1-bit streamer version for PNut's EEPROM boot loader too. It ticks over lazily on RCFAST so doesn't have any complexity around high frequency compensations. Source, "flash_loader.spin2" is in Pnut zip.

    Thanks Evan. I'll take a look.

    This afternoon I was working on the writesector using an 8 bit unrolled 2 instruction loop (actually 3 instructions but only 2 execute and the other is skipped) using the skipf instruction. I used two skipf sections to limit the table to 16 longs, one per nibble. Alas I broke my code so I've given up for today at least.
                    ror     sdx_dataout,      #4             ' msb nibble in lower bits
                    altd    sdx_dataout,      #skiptable     ' [index+table] into next instr
                    skipf   0-0                              ' pickup the skip for msb nibble
    .o7             mov     outb,             sdx_bitout0    ' OUT: /CS=0 + CLK=0 + DI=0 (output data with CLK falling edge)
                    mov     outb,             sdx_bitout1    ' OUT: /CS=0 + CLK=0 + DI=1 (output data with CLK falling edge)
                    outh    #sdx_ck                          ' CLK=1
    .o6             mov     outb,             sdx_bitout0    ' OUT: /CS=0 + CLK=0 + DI=0 (output data with CLK falling edge)
                    mov     outb,             sdx_bitout1    ' OUT: /CS=0 + CLK=0 + DI=1 (output data with CLK falling edge)
                    outh    #sdx_ck                          ' CLK=1
    
  • evanhevanh Posts: 10,086
    edited 2020-05-28 - 10:47:44
    Brian, I think it was, did a single instruction partially bit-bashed loop for hyperRAM. Worked at sysclock/2 amazingly. He set a smartpin running for the HR clock then REP'd a single WFBYTE/RFBYTE going flat out for the burst duration. It only worked with base pins 0 or 32 because it used INA/B or OUTA/B.

  • Cluso99 wrote: »
    This afternoon I was working on the writesector using an 8 bit unrolled 2 instruction loop (actually 3 instructions but only 2 execute and the other is skipped) using the skipf instruction. ...
    I've not tried to understand skip'ing. Isn't there also additional time for table lookup overheads as well as the executed instructions?

  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 12:55:42
    evanh wrote: »
    Cluso99 wrote: »
    This afternoon I was working on the writesector using an 8 bit unrolled 2 instruction loop (actually 3 instructions but only 2 execute and the other is skipped) using the skipf instruction. ...
    I've not tried to understand skip'ing. Isn't there also additional time for table lookup overheads as well as the executed instructions?
    Not sure what you’re asking.
    SKIPF is great. In it’s basic form, you have a bitmap as the D operand (there is no S operand). This gets loaded into an internal register and gets shifted right once for each instruction. If the bit being shifted out is a “1” the instruction is skipped, until all remaining bits are 0.
    Now the fun part. If the first instruction is skipped, then it’s a nop (2 clocks). Now any set up to 7/8? skipped instructions will be skipped with 0 clocks! Yep, no clocks. After 7/8 there is one 2 clock penalty, followed by another free 7/8. This is only valid for skipf in cog/LUT but not hub. Hub is like the skip instruction where all skipped instructions take 2 clocks.
    Next the interesting bit. If you have call that is not skipped, skip is suspended until the call returns.

    There’s a further twist with execf where the lowest 10 bits are a cog/LUT address plus 22 bits of skip. When you perform an execf it jumps to the address and executes internally the skipf function on the remaining 22 bits. This takes 2 clocks.

    And the last twist is the ??? Instruction which uses a bytecode value (operand) to load an item from a table + bytecode-index which is used as an internal execf (jump and skipf). This takes IIRC 6 clocks. This operation is the guts of the spin2 interpreter and I’ve used it extensively in my Z80 emulation, which has remained finished and untested since last August because I needed a P2 OS. The OS basics now work. Not how I want it, but it works.

    Oh, there’s a trick with skip that I used in my ROM SD code. I needed to skip an instruction at the end of a routine but I couldn’t use the c or z flags as they were already being used. I set a register to 0 or 1 earlier in the routine, and near the end i just did skip register and it optionally skipped the next instruction, leaving my c and z flags intact to be returned to the caller.
  • Cluso99 wrote: »
    There’s a further twist with execf where the lowest 10 bits are a cog/LUT address plus 22 bits of skip. When you perform an execf it jumps to the address and executes internally the skipf function on the remaining 22 bits. This takes 2 clocks.
    Actually, IIRC EXECF and I think all the other indirect jumps(?) are 4 clocks.
  • Thanks. Of course jumps taken are 4 so execf would be 4.
    And xbyte is 9.
  • Wuerfel_21Wuerfel_21 Posts: 975
    edited 2020-05-28 - 15:09:06
    Cluso99 wrote: »
    And xbyte is 9.

    I always heard it was 6 (so the shortest possible bytecode (single instruction with _ret_) would take 8 cycles)
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-28 - 17:31:02
    According to the docs it’s 9 although I thought it was 6.

    With all this concern about timing reminds me of the mini I cut my teeth on. 3.3 us cycle time, instructions took 100+us. Took 5 mins to read a 10MB disc drive (yes discs were originally spelt with a c). We’ve certainly progressed. And 300 or 1200 baud was the norm.
  • Wuerfel_21 wrote: »
    Cluso99 wrote: »
    And xbyte is 9.

    I always heard it was 6 (so the shortest possible bytecode (single instruction with _ret_) would take 8 cycles)

    6 and 8 are correct. _ret_ adds 0 cycles at end of XBYTE routine, +2 elsewhere.
  • Cluso,
    You have an example there. I was looking at that. If all of that is inside the bit-bashing loop then it seems to me there is a substantial overhead preceding the actual bit-bashing instructions.

  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-29 - 04:55:08
    @evanh
    Skipf example attached, code extract here...
    Either the modc _clr or _set will be execute depending on the bit being 0 or 1, then the rcl always executes.
    The program just accumulates the bits for the byte presented and outputs the hex to confirm it worked correctly, for all 256 cases. It does the byte in two halves (nibbles) which reduces the table to 16 longs instead of 256.
                  mov       char, #0
    loop          mov       byteout, char
                  mov       lmm_x, #0
    
                  ror       byteout, #4                     ' msb nibble in lower bits
                  altd      byteout, #skiptable             ' [index+table] into next instr
                  skipf     0-0                             ' pickup the skip for msb nibble
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
    
                  shr       byteout, #28                    ' lsb nibble in lower bits
                  altd      byteout, #skiptable             ' [index+table] into next instr
                  skipf     0-0                             ' pickup the skip for lsb nibble
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
    
                  call      #_HubHex8
                  call      #_hubTxCR
    
                  add       char, #1
                  cmp       char, #$100                wz
        if_nz     jmp       #loop
    
    .loop         call      #_hubMonitor
                  jmp       #.loop
    
    
    '                         b0  b1  b2  b3 '<---
    '                        h10_h10_h10_h10      
    skiptable     long      %010_010_010_010 '0000
                  long      %001_010_010_010 '0001
                  long      %010_001_010_010 '0010
                  long      %001_001_010_010 '0011
                  long      %010_010_001_010 '0100
                  long      %001_010_001_010 '0101
                  long      %010_001_001_010 '0110
                  long      %001_001_001_010 '0111
                  long      %010_010_010_001 '0000
                  long      %001_010_010_001 '0001
                  long      %010_001_010_001 '0010
                  long      %001_001_010_001 '0011
                  long      %010_010_001_001 '0100
                  long      %001_010_001_001 '0101
                  long      %010_001_001_001 '0110
                  long      %001_001_001_001 '0111
    
  • Is there execution overhead?
  • No because skipf only loads a long into memory and uses it immediately. There is no jump involved, so its a 2 clock instruction.
    Only if the first instruction is skipped there is a 2 clock penalty because its in the pipeline, and every 7-8 skipped instructions in a row causes a 2 clock penalty. So in my example, only if the first instruction following the skipf is skipped it takes 2 clocks to skip over it. Then because I'm always not skipping over every third instruction, the 7-8 successive skips never happens, then every instruction skipped is 0 clocks. Neat isn't it :)
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-29 - 06:52:46
    Just put the timer on to check :)
    $32=50 clocks for this (which is what I calculated) :)
                  getct     time1
                  ror       byteout, #4                     ' msb nibble in lower bits
                  altd      byteout, #skiptable             ' [index+table] into next instr
                  skipf     0-0                             ' pickup the skip for msb nibble
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
    
                  shr       byteout, #28                    ' lsb nibble in lower bits
                  altd      byteout, #skiptable             ' [index+table] into next instr
                  skipf     0-0                             ' pickup the skip for lsb nibble
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  modc      _clr                       wc   
                  modc      _set                       wc
                  rcl       lmm_x, #1
                  getct     time2
    
  • evanhevanh Posts: 10,086
    edited 2020-05-29 - 10:50:29
    I don't see how it loops without overheads.
  • Cluso99Cluso99 Posts: 16,910
    edited 2020-05-29 - 11:09:20
    evanh wrote: »
    I don't see how it loops without overheads.
    What do you mean about looping?
    This is just an unrolled loop to send 8 bits. It’s repeated 512 times in an outer rep and rfbyte loop if that’s what your asking.

    If your referring to the execf and xbyte, that’s more complicated and I’m not using it here, at least for now anyway. I doubt there would be any saving, probably it would be worse.
  • Eight bits isn't worth much is it.
  • IIRC correctly...
    It reduces the time to send (it's send only ie write a sector) a byte of the number of clocks per bit from 8*4*2=64 down to 8*2*2=32, but adds 6-7*2=12-14, so we go from 64 clocks to 44-46 clocks. That's a saving of 28%-31%.
    BUT this saving is for 512 bytes in the sector, plus the commands (6 bytes). So it adds up quickly. (I already cheat for sending the CRC16 bytes as all $FF so get this as 2 instructions per bit)

    At 200MHz thats 512+6=518*12-14=31us-36us or to put it in real terms, reducing 165us down to 129us-134us.

    Worth it??? Don't know, and I have a bug to fix because it's not working at present :(

    Unfortunately I cannot do the same trick for reads and of course they are the most prevalent too.
Sign In or Register to comment.