P2 SD Driver V2.60 (pasm cog from spin2)

Cluso99 · 2020-12-05 00:01

P2 SD Driver V2.60

This is an SD Driver which resides in its' own cog and can be accessed via a 4 long mailbox in hub ram from either spin2 or pasm.
The current demo shows access from spin2 only. I have used Eric's FlexSpin to compile - I have not tried pnut or PropTool.

The hub mailbox is

'' + sd_mailbox                    '\ [16] mailbox for SD Driver               +
'' + mbox_command    byte    0     '| [1] command                              +
'' + mbox_status     byte    0     '| [1] status                               +
'' + mbox_spare      word    0     '| [2] -spare-                              +
'' + mbox_bufad      long    0     '| [4] address of disk buffer in hub        +
'' + mbox_sector     long    0     '| [4] sector                               +
'' + mbox_aux        long    0     '/ [4] aux eg filesize                      +

The currently supported commands are

'' +    "I" = Initialise SD Driver                                             +
'' +    "R" = readFastSector                                                   +
'' +    "W" = writeFastSector                                                  +
'' +    "D" = find Directory                                                   +
'' +    "F" = find File                                                        +
'' +    "X" = Load/Execute (usually after "F")    (see Warning)                 +

WARNING: The "X" command will likely fail because it will likely overwrite hub code in use. Special care must be taken when using this command.

Please report any bugs here. Enjoy

Peter Jakacki · 2020-12-05 00:25

Looks good. What r/w speeds have you measured? Are you using multi-block read?

btw, where's the file?

Cluso99 · 2020-12-05 01:06

Peter Jakacki wrote: »

Looks good. What r/w speeds have you measured? Are you using multi-block read?

btw, where's the file?

Oops

file added.

No multiblock read/write.

However the read and write spi routines have been extremely optimised by unravelling the bits sent within each byte, so it is as fast as is possible (4 instructions per bit), and then the 512 bytes are done with a rep instruction, all without smartpins. The CRCs are optimised even further to 2 instructions per bit since they are not used.
Should be about 3MB/s (30Mb/s). Of course, the SD card will have overheads.

I am going to defy anyone to improve upon this with bit-bashing

Here is an extract for 2 bits read

.s5             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wc  '\ read input bit6:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
                rcl     sdx_reply,        #1             ' accum DO input bit 6
.s4             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wc  '\ read input bit5:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
                rcl     sdx_reply,        #1             ' accum DO input bit 5

And 2 bits write

.o5             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                rol     sdx_dataout,      #1         wc  ' test output bit 4  (DI=0/1)
                outh    #sdx_ck                          ' CLK=1
                bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 4 ready for output with clk=0
.o4             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                rol     sdx_dataout,      #1         wc  ' test output bit 3  (DI=0/1)
                outh    #sdx_ck                          ' CLK=1
                bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 3 ready for output with clk=0

Wuerfel_21 · 2020-12-05 01:15

Cluso99 wrote: »

as fast as is possible (4 instructions)

I think you can use RCZL to get down to 3.5 instructions per bit.

.s5             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wc  '\ read input bit6:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
.s4             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wz  '\ read input bit5:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
                rczl    sdx_reply                        ' accum DO input bit 6/5

.o5             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                rczl    sdx_dataout,      #1         wcz ' test output bit 4/3  (DI=0/1)
                outh    #sdx_ck                          ' CLK=1
                bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 4 ready for output with clk=0
.o4             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                outh    #sdx_ck                          ' CLK=1
                bitz    sdx_bitout,       #sdx_di-32     ' prepare output bit 3 ready for output with clk=0

Peter Jakacki · 2020-12-05 02:51

The speed will always be limited by the latency to read a block which is why you really need to implement multi-block read. It's not hard to do at all and while you may have a fast SPI routine, this change makes a huge difference to actual throughput because when I read in a bmp or video file, this is what actually counts and is needed.

I can read and flip and render a 640x480x8 bmp in under 100ms. There's a test there for you.

Wuerfel_21 · 2020-12-05 03:05

Peter Jakacki wrote: »

The speed will always be limited by the latency to read a block which is why you really need to implement multi-block read. It's not hard to do at all and while you may have a fast SPI routine, this change makes a huge difference to actual throughput because when I read in a bmp or video file, this is what actually counts and is needed.

Yeah, this. Use multiblock commands where applicable (is there a downside to using multiblock to read a single block?) and/or program your application such that it doesn't wait on transfers where it's not neccessary.

IIRC on P1 with the usual 1-instruction-per-bit read routine, the single block read time is roughly 50:50 between waiting and actual transfer. I don't remember if that measurement was on an A1 card or some ol 2GB crust flake though.

Cluso99 · 2020-12-05 04:10

Wuerfel_21 wrote: »

Cluso99 wrote: »

as fast as is possible (4 instructions)

I think you can use RCZL to get down to 3.5 instructions per bit.

.s5             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wc  '\ read input bit6:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
.s4             outh    #sdx_ck                          ' CLK=1
                testp   #sdx_do                      wz  '\ read input bit5:  sample on/after prior CLK rising edge
                outl    #sdx_ck                          ' CLK=0
                rczl    sdx_reply                        ' accum DO input bit 6/5

.o5             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                rczl    sdx_dataout,      #1         wcz ' test output bit 4/3  (DI=0/1)
                outh    #sdx_ck                          ' CLK=1
                bitc    sdx_bitout,       #sdx_di-32     ' prepare output bit 4 ready for output with clk=0
.o4             mov     outb,             sdx_bitout     ' OUT: /CS=0 + CLK=0 + DI=0/1 (output data with CLK falling edge)
                outh    #sdx_ck                          ' CLK=1
                bitz    sdx_bitout,       #sdx_di-32     ' prepare output bit 3 ready for output with clk=0

Ooooh nice

When did we get that instruction

Cluso99 · 2020-12-05 04:13

Peter Jakacki wrote: »

The speed will always be limited by the latency to read a block which is why you really need to implement multi-block read. It's not hard to do at all and while you may have a fast SPI routine, this change makes a huge difference to actual throughput because when I read in a bmp or video file, this is what actually counts and is needed.

I can read and flip and render a 640x480x8 bmp in under 100ms. There's a test there for you.

Well, you do have to have code that will utilise multi-blocking.
Almost none of my P1 or P2 code could even utilise it. I don't read/write video files.

P2 SD Driver V2.60 (pasm cog from spin2)

Comments