P2 SD Drivers - COG PASM Version V2.70
Cluso99
Posts: 18,069
Cluso's P2 SD Drivers
Update V2.70 low level SD Driver 30-Jan-2021
https://forums.parallax.com/discussion/comment/1515707/#Comment_1515707
Provides low level sd card access such as initialise, locate fat directory, find file, sector read, sector write, and load file, all from a spin demo program.
You may also be interested in the FAT32 driver and my OS here
Cluso's Propeller OS (v2.43) - includes SD Driver and FAT32 Object plus Serial Driver
forums.parallax.com/discussion/173395/clusos-propeller-os-v2-38-includes-sd-driver-and-fat32-object-plus-serial-driver
Update v.223
Increased performance and now works from 10MHz-370MHz (my P2EVAL fails serial at 380MHz). Code attached.
I expect to release a number of versions, all with different advantages and disadvantages. They will be released on this thread, with the latest releases in this post.
They are all based on my SD Driver present in the P2 ROM for booting from an SD Card.
While I have released various test versions in the past, all new releases will be available from here.
Cluso's P2 SD Driver for COG v.222
This is the first (test) release of my COG version of my SD Drivers. There are two files in the zip file...
* SD_COG.spin2 is just the driver. It requires your code to work for now, and it can be compiled with your code. There is very little space left in the cog ram though you can relocate it in LUT except for the 16 registers it requires which can be located anywhere in COG - just change the _regs equate.
I intend to release a mailbox version later so that this driver could reside in its' own COG and communicate via a 4 long hub mailbox.
This driver has been made for speed with reads and writes being unraveled and hence its' big footprint. Each send/receive bit takes 4 instructions for an 8 clocks per bit which averages to 9 clocks per 512+2 bytes. My tests have been done at 200MHz - I haven't checked other clock speeds yet. Works from 40MHz-280MHz (290MHz fails, didn't try below 40MHz))
Timing tests have been commented out. You can re-enable them or use SD_COG_test.spin2 to do any timing yourself.
If you look at the readFast sections you will see TESTB instructions commented out. This is an alternative method I tried and should shift the sample window one clock earlier (because TESTB has an extra clock delay between the sample it sees and the instruction starting versus TESTP). If you want to try it, comment out TESTP and uncomment TESTB for each pair.
I intend to add block mode later.
Note that this code will only work on SDHC cards and above which use block mode for sector addresses rather than the older original byte mode. Cards 4GB and above are likely to be OK.
* SD_COG_test.spin2 is the above driver with a test harness built using the ROM Serial/Monitor calls. It has timing placed for each function call so you can tell how fast your SD Card is.
Note: There is a bug in my _readFastLong routine so I am performing 4 byte reads for this instead. It's only when reading the command response so it doesn't really matter. I wanted to get this out hence the workaround.
Please report any bugs or problems here. Enjoy
Notes: There is a significant lag in SD Cards while preparing to read or write a sector. I have observed lags on my SanDisk Ultra 16GB microSD SDHC I C10 of 4ms, 2.7ms and 1.7ms before the read or write can proceed (ie busy). I am achieving (at 200MHz) 512+2 byte reads of 37071 clocks = 185us or 2.77MB/s and 512+2 byte writes of 40145 clocks = 200us or 2.56MB/s
Update V2.70 low level SD Driver 30-Jan-2021
https://forums.parallax.com/discussion/comment/1515707/#Comment_1515707
Provides low level sd card access such as initialise, locate fat directory, find file, sector read, sector write, and load file, all from a spin demo program.
You may also be interested in the FAT32 driver and my OS here
Cluso's Propeller OS (v2.43) - includes SD Driver and FAT32 Object plus Serial Driver
forums.parallax.com/discussion/173395/clusos-propeller-os-v2-38-includes-sd-driver-and-fat32-object-plus-serial-driver
Update v.223
Increased performance and now works from 10MHz-370MHz (my P2EVAL fails serial at 380MHz). Code attached.
I expect to release a number of versions, all with different advantages and disadvantages. They will be released on this thread, with the latest releases in this post.
They are all based on my SD Driver present in the P2 ROM for booting from an SD Card.
While I have released various test versions in the past, all new releases will be available from here.
Cluso's P2 SD Driver for COG v.222
This is the first (test) release of my COG version of my SD Drivers. There are two files in the zip file...
* SD_COG.spin2 is just the driver. It requires your code to work for now, and it can be compiled with your code. There is very little space left in the cog ram though you can relocate it in LUT except for the 16 registers it requires which can be located anywhere in COG - just change the _regs equate.
I intend to release a mailbox version later so that this driver could reside in its' own COG and communicate via a 4 long hub mailbox.
This driver has been made for speed with reads and writes being unraveled and hence its' big footprint. Each send/receive bit takes 4 instructions for an 8 clocks per bit which averages to 9 clocks per 512+2 bytes. My tests have been done at 200MHz - I haven't checked other clock speeds yet. Works from 40MHz-280MHz (290MHz fails, didn't try below 40MHz))
Timing tests have been commented out. You can re-enable them or use SD_COG_test.spin2 to do any timing yourself.
If you look at the readFast sections you will see TESTB instructions commented out. This is an alternative method I tried and should shift the sample window one clock earlier (because TESTB has an extra clock delay between the sample it sees and the instruction starting versus TESTP). If you want to try it, comment out TESTP and uncomment TESTB for each pair.
I intend to add block mode later.
Note that this code will only work on SDHC cards and above which use block mode for sector addresses rather than the older original byte mode. Cards 4GB and above are likely to be OK.
* SD_COG_test.spin2 is the above driver with a test harness built using the ROM Serial/Monitor calls. It has timing placed for each function call so you can tell how fast your SD Card is.
Note: There is a bug in my _readFastLong routine so I am performing 4 byte reads for this instead. It's only when reading the command response so it doesn't really matter. I wanted to get this out hence the workaround.
Please report any bugs or problems here. Enjoy
Notes: There is a significant lag in SD Cards while preparing to read or write a sector. I have observed lags on my SanDisk Ultra 16GB microSD SDHC I C10 of 4ms, 2.7ms and 1.7ms before the read or write can proceed (ie busy). I am achieving (at 200MHz) 512+2 byte reads of 37071 clocks = 185us or 2.77MB/s and 512+2 byte writes of 40145 clocks = 200us or 2.56MB/s
Comments
Attached is the first working version of Kye's FAT32 Driver that I converted to run under P2 SPIN2.
I haven't included all the re-direction objects that make this run, but I thought I would post it here in case anyone was interested.
Note there is a problem that the code uses the P1 ROM space to write to when skipping bytes. This will need to be fixed.
I have used this in my P2 OS here
forums.parallax.com/discussion/171600/propeller-2-os-its-alive
Here are some diagrams I drew when writing my SD Drivers
Using the streamer has shaved
read sector (512+2bytes) 1151 clocks =5us @200MHz
write sector(512+2bytes) 6255 clocks = 31us @200MHz
Hope I've used the streamer correctly - the data looks fine.
Here's what I did for read sector and write to hub and here's what I did for read from hub and write sector
Just love some of these P2 features
Yes once you can get the clock timing and streamer aligned it is a great combination.
Just tweeked the code to re-insert a couple of timing NOP's at strategic points where I had over-optimised the code. While I've lost a few clocks, I've extended the range to 10MHz-370MHz. My board fails serial at 380MHz so no point in trying to get any higher.
With the streamer the bit-bashing code is eliminated. And can go to sysclock/2 for single-data-rate (SDR) clocking and sysclock/1 for double-data-rate (DDR) clocking.
Where's the best place to look for info to use sysclock /4 with SPI (ie clock + data) - data out first ?
That's an enormous thread on Hyperram and I've skimmed it a number of times.
Are you sure about those figures? 5us for 516 bytes is 822MHz SPI data rate - well beyond the 50MHz SPI limit on SD cards. The absolute best you could do without any delays would be 83us or so for 512+2 bytes.
I wish No, I shaved that time off the previous time taken.
At 200MHz it takes (for the 512+2 bytes as I'm measuring for the complete sector plus the crc bytes) for read 179us and write 169us.
They are both using 4 instruction per bit unrolled loops.
I've been playing with using skipf this afternoon. But perhaps I should instead be looking at using the smartpins and streamer.
Did you notice I found my P2EVAL fails at 380MHz using serial.
If anyone wants to add their SD Drivers here, please do so, plus links to any threads and I'll update the first post with details and links
This is the heart of my sector read, just a general purpose SPI block read I call with a count of 512.
I did do an all smartpins 2-bit solution for the EEPROM a long time back. In fact that's what ended up in the loadp2 EEPROM boot loader. Wasn't very tidy though. Always meant to convert it to streamer. Source, "P2ES_flashloader.spin2" is there with loadp2. The bootloader part, using smartpins, is down the bottom of the file.
Chip did a compact 1-bit streamer version for PNut's EEPROM boot loader too. It ticks over lazily on RCFAST so doesn't have any complexity around high frequency compensations. Source, "flash_loader.spin2" is in Pnut zip.
This is the heart of my writeFastSector. It's unrolled 8 bits with 4 instructions per bit. and readFastSector again unrolled. It's unrolled 8 bits with 4 instructions per bit. Note the TESTB alternative is commented out.
Thanks Evan. I'll take a look.
This afternoon I was working on the writesector using an 8 bit unrolled 2 instruction loop (actually 3 instructions but only 2 execute and the other is skipped) using the skipf instruction. I used two skipf sections to limit the table to 16 longs, one per nibble. Alas I broke my code so I've given up for today at least.
SKIPF is great. In it’s basic form, you have a bitmap as the D operand (there is no S operand). This gets loaded into an internal register and gets shifted right once for each instruction. If the bit being shifted out is a “1” the instruction is skipped, until all remaining bits are 0.
Now the fun part. If the first instruction is skipped, then it’s a nop (2 clocks). Now any set up to 7/8? skipped instructions will be skipped with 0 clocks! Yep, no clocks. After 7/8 there is one 2 clock penalty, followed by another free 7/8. This is only valid for skipf in cog/LUT but not hub. Hub is like the skip instruction where all skipped instructions take 2 clocks.
Next the interesting bit. If you have call that is not skipped, skip is suspended until the call returns.
There’s a further twist with execf where the lowest 10 bits are a cog/LUT address plus 22 bits of skip. When you perform an execf it jumps to the address and executes internally the skipf function on the remaining 22 bits. This takes 2 clocks.
And the last twist is the ??? Instruction which uses a bytecode value (operand) to load an item from a table + bytecode-index which is used as an internal execf (jump and skipf). This takes IIRC 6 clocks. This operation is the guts of the spin2 interpreter and I’ve used it extensively in my Z80 emulation, which has remained finished and untested since last August because I needed a P2 OS. The OS basics now work. Not how I want it, but it works.
Oh, there’s a trick with skip that I used in my ROM SD code. I needed to skip an instruction at the end of a routine but I couldn’t use the c or z flags as they were already being used. I set a register to 0 or 1 earlier in the routine, and near the end i just did skip register and it optionally skipped the next instruction, leaving my c and z flags intact to be returned to the caller.
And xbyte is 9.
I always heard it was 6 (so the shortest possible bytecode (single instruction with _ret_) would take 8 cycles)
With all this concern about timing reminds me of the mini I cut my teeth on. 3.3 us cycle time, instructions took 100+us. Took 5 mins to read a 10MB disc drive (yes discs were originally spelt with a c). We’ve certainly progressed. And 300 or 1200 baud was the norm.
6 and 8 are correct. _ret_ adds 0 cycles at end of XBYTE routine, +2 elsewhere.
You have an example there. I was looking at that. If all of that is inside the bit-bashing loop then it seems to me there is a substantial overhead preceding the actual bit-bashing instructions.
Skipf example attached, code extract here...
Either the modc _clr or _set will be execute depending on the bit being 0 or 1, then the rcl always executes.
The program just accumulates the bits for the byte presented and outputs the hex to confirm it worked correctly, for all 256 cases. It does the byte in two halves (nibbles) which reduces the table to 16 longs instead of 256.
Only if the first instruction is skipped there is a 2 clock penalty because its in the pipeline, and every 7-8 skipped instructions in a row causes a 2 clock penalty. So in my example, only if the first instruction following the skipf is skipped it takes 2 clocks to skip over it. Then because I'm always not skipping over every third instruction, the 7-8 successive skips never happens, then every instruction skipped is 0 clocks. Neat isn't it
$32=50 clocks for this (which is what I calculated)
This is just an unrolled loop to send 8 bits. It’s repeated 512 times in an outer rep and rfbyte loop if that’s what your asking.
If your referring to the execf and xbyte, that’s more complicated and I’m not using it here, at least for now anyway. I doubt there would be any saving, probably it would be worse.
It reduces the time to send (it's send only ie write a sector) a byte of the number of clocks per bit from 8*4*2=64 down to 8*2*2=32, but adds 6-7*2=12-14, so we go from 64 clocks to 44-46 clocks. That's a saving of 28%-31%.
BUT this saving is for 512 bytes in the sector, plus the commands (6 bytes). So it adds up quickly. (I already cheat for sending the CRC16 bytes as all $FF so get this as 2 instructions per bit)
At 200MHz thats 512+6=518*12-14=31us-36us or to put it in real terms, reducing 165us down to 129us-134us.
Worth it??? Don't know, and I have a bug to fix because it's not working at present
Unfortunately I cannot do the same trick for reads and of course they are the most prevalent too.