CRCs can be a PITA, but thankfully the P2 can do a good job of them once you figure it all out and setup the poly and feed the bits or nibbles in the right order.
I'm rather pleased with my efficient command tx code. It easily computes the CRC in the middle of shifting out the command bits, so effectively zero overhead.
_wrpin( PIN_CMD, P_SYNC_TX | P_OE | (PIN_CLK - PIN_CMD & 7)<<24 ); // setup CMD output smartpin mode_wxpin( PIN_CMD, 31 ); // 32-bit shifter, continuous mode_wypin( PIN_CMD, _rev( (0x40 | cmd)<<24 | arg>>8 ) ); // first 32 bits into tx shifter, continuous mode_wypin( PIN_CLK, 6 * 8 ); // begin SD clocks, tx smartpin won't see clocks for about 8 sysclock ticks_pinl( PIN_CMD ); // start tx shifter, in continuous mode__asm { // SD spec 4.5movpa, cmdorpa, #0x40shlpa, #24setqpamovcrc, #0crcnibcrc, #0x48// CCITT polynomial is 1 + x3 + x7 (0x09 reversed for CRCNIB)crcnibcrc, #0x48setqargcrcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48crcnibcrc, #0x48revcrcshrcrc, #24orcrc, #1// stop-bit at end of command packet
}
_wypin( PIN_CMD, _rev( arg<<24 | crc<<16 ) ); // final 16 bits into tx bufferwhile( !_pinr( PIN_CLK ) ); // wait for tx completion_pinf( PIN_CMD );
_wrpin( PIN_CMD, 0 ); // release CMD pin, ready for response
Eventually, I want to make the command-response transition gapless clocking as well. But that will depend on whether it's possible to align the response start-bit at speed.
At the moment, I test one bit at a time. Re-arming the rx shifter for each inter-command-response clock. It might actually be faster to bit-bash the response entirely.
i = 64; // max of 64 clocks for response starting timeout, SD spec 4.12.4do { // look for start-bit, one clock at a time
_pinf( PIN_CMD ); // reset rx shifter
_wypin( PIN_CLK, 1 ); // one SD clock
_pinl( PIN_CMD );
_waitx( CLK_DIV * 2 + 8 ); // clock smartpin cycle allowance + I/O lantency compensation
} while( !_rdpin( PIN_LED ) && --i );
I was about to suggest using async RX mode, but the response is too long.
The real problem I think is that CMD and DAT activity can overlap. So you need to do this sort of waiting for the start bit on both CMD and DAT0 simulataneously. I think, anyways.
Interesting discovery I just made while looking at the SD spec in Okular (trying to find the TOC before remembering that it doesn't have one...): the watermark is on a separate layer that you can just turn off. Amazing. Does this work on your full version, too?
I was concerned about this overlap too although I never saw any overlap in my testing, but in theory based on that diagram I guess it could happen. If you have to cancel a read multiple operation where it might overlap at the final block, I expected that you can always ignore the final data block on the DAT lines while you are doing the CMD & RSP stuff (so just cancel after you have read enough blocks and ignore any extra block which might be arriving at that time). For writes you are in control of the DAT timing yourself.
Maybe that's a side effect of running it through some online PDF unlock tool (which isn't needed in Okular because it can bypass PDF DRM itself - quality software)
Looking at how Simon and Nicolas did gapless clocking with the Ethernet sync'ing I see they're adjusting the rx shifter width on the fly in the middle of valid data reception! I'm suitably impressed that even works without data loss. Kudos to Simon for thinking to try it - https://forums.parallax.com/discussion/comment/1537291/#Comment_1537291
@rogloh said:
I was concerned about this overlap too although I never saw any overlap in my testing, ...
The block start delay on a single block read with my SanDisk is a surprise. It's something like 460 clocks after the command (CMD17), which is around 400 after the response ended.
Yeah in reality the cards would seem to need some time to process their requests. Not sure if there is a lower bound that you can always rely on though. Be good if there was. Maybe over time this delay would shrink as the cards become faster to respond, so they didn't want to nominate some minimum value for it.
Perhaps if you monitored for a start bit in the DAT lines while waiting for the CMD response you could in theory temporarily slow the clock down in the case there was some overlap, to be able to do things to keep up with the incoming data before the response fully arrived? I found you can slow/pause the clock in the middle of these transfers and it seems to work.
Also once you see the start bit you could setup the streamer for the expected DAT transfer size for reads and enable a fast clock output. For writes you are in full control and can wait for the CMD response to complete before you begin sending any data. No overlap needed there.
Hmm, been mulling those ideas too. The above dump is from both a smartpin for command response and streamer for DAT pins sampling. So they are both collecting data concurrently with clock going at steady speed. Hence the post processing to find the response bits and realign them for running the CRC.
I haven't done the same post processing of the data block as yet and aren't really contemplating it either. The code for post processing the response bits is a proper horror!
for( i = 0; i < 8*8; i++ ) {
if( !(resp[i>>3] & (1<<(7-i&7))) )
break;
}
offset = i;
for( i = 0; i < 8*8; i++ ) {
bit = resp[(i+offset)>>3]>>(7-(i+offset)&7);
resp[i>>3] = resp[i>>3] & ~1<<(7-i&7) | bit<<(7-i&7);
}
Nah, post processing is not ideal. Best to have the data already aligned when read in. That is why the clock should be stopped after the start bit to set things up.
Now that I've verified good data block retrieval and have a feel for nominal behaviour, I'll have a shot at dynamically adjusting the shifter width on the fly similar to the Ethernet driver code. If that is successful then I'll go ahead with full gapless clocking.
EDIT: Grr, already having second thoughts. It'll probably still need post realigned. The Ethernet driver had a fixed length preamble to work with that isn't present here.
EDIT2: Yeah, without any preamble to throw away, the first bits will always need aligned anyway. The code to do that may as well be applied to the whole response in real time - Which I've now done.
Loop time is 26 sysclocks (plus extra for the hub write, so 27..34 ticks) to process 16 response bits. Actually, it can do 32 bits in the same time but that drags out more waiting for the serial shifter to fill up. Oops, it's now hubexec, so add another 12 or so ticks for the FIFO refill. It's good for sysclock/4.
Ouch! Forgot about the over-reached clock smartpin inputs. That's not an issue when using the streamer. Almost enough incentive to do post-processing block copying right there.
The remapping of smartpins needs careful documenting to keep it straight. Not to mention the extra instructions ... and CMD has to be repurposed in mid flight ... and it doesn't work for tx either, so streamer only for DAT tx pins. All very messy.
EDIT: And block copying approach can happen concurrently with next streamer action too. So not necessarily slowing things down.
Hmm, the real-time align and word merging code for the rx block data is too big to execute within sysclock/4. It's four times everything, so that's basically doubling the time required over what 32-bit shifter word size provides. Sysclock/8 should be doable with smartpins. Here's just the word merger and writing to hubRAM:
Why do all this shifting stuff? Just setup the transfer after the start bit and it will be aligned. Sysclk/8 is only 40M nibbles per second or 20MB/s at 320MHz. I was able to get it clocking much faster than that (sysclk/2). You just give up the gapless CMD idea and it'll be fine. For performance you mainly want to optimize DAT reads, not so much the CMD+RESP timing stuff.
Because the clock is going at speed. I've long ditched the bit-bashing start-bit search.
There's definitely gains to be made by not doing a long pulse by pulse search to find the start of the block. There's 400 odd clocks to churn through checking on each block.
BTW: I was hoping for better than sysclock/4 originally. Therefore, agreed, the above approach is rather dead in the water now. The word merge kills it before even getting to the shifts.
Yeah I reckon a 512 byte sector transfer time savings at sysclk/2 will swamp the gains from not having to bit bang the clock to search for the start bit and then transferring at sysclk/8. This is even more true once reading of multiple blocks are being done. Although admittedly if you want to strictly meet the 50MB/s max speed for normal SD cards, then the gain is lessened slightly.
For polling the start bit with a bit bang clock before you start the streamer and speed up the clock I'd expect you should be able to do it pretty fast in PASM (guessing a couple of microseconds or so per request).
The streamer hardware takes care of ordering so only the shifts need sorted ... and since the streamer sampling has to be tuned then it can also account for nibble alignment in the one parameter. Meaning it will be possible to tune the block copying so only an edge check has to be done to verify which byte to start at with no actual shifts required.
All this coding is partly the science, the learning, and partly an "I've been there" experience. To see the required code to get it working each way. More than some guessing.
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
But does the latency period really care about the clock speed while waiting for the start bit? I'd think this is limited by the internal speed of the SD controller, but I never verified that assumption.
@evanh said:
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
This looks handy. If it works I can see it could potentially speed things up a fair bit and should let you clock faster while waiting for a start bit. Should then be no excuse to not work with fully aligned data after that!
EDIT: looks like you have it working in latest post. Cool. I'll have to give it a try here at some point soon.
Just ran your test code with my board @evanh. This was the output I received from a 16G Sandisk card:
Looks okay so far but the first data sector is still offset. I guess that is next for you to sort out.
Comments
CRCs can be a PITA, but thankfully the P2 can do a good job of them once you figure it all out and setup the poly and feed the bits or nibbles in the right order.
I'm rather pleased with my efficient command tx code. It easily computes the CRC in the middle of shifting out the command bits, so effectively zero overhead.
_wrpin( PIN_CMD, P_SYNC_TX | P_OE | (PIN_CLK - PIN_CMD & 7)<<24 ); // setup CMD output smartpin mode _wxpin( PIN_CMD, 31 ); // 32-bit shifter, continuous mode _wypin( PIN_CMD, _rev( (0x40 | cmd)<<24 | arg>>8 ) ); // first 32 bits into tx shifter, continuous mode _wypin( PIN_CLK, 6 * 8 ); // begin SD clocks, tx smartpin won't see clocks for about 8 sysclock ticks _pinl( PIN_CMD ); // start tx shifter, in continuous mode __asm { // SD spec 4.5 mov pa, cmd or pa, #0x40 shl pa, #24 setq pa mov crc, #0 crcnib crc, #0x48 // CCITT polynomial is 1 + x3 + x7 (0x09 reversed for CRCNIB) crcnib crc, #0x48 setq arg crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 crcnib crc, #0x48 rev crc shr crc, #24 or crc, #1 // stop-bit at end of command packet } _wypin( PIN_CMD, _rev( arg<<24 | crc<<16 ) ); // final 16 bits into tx buffer while( !_pinr( PIN_CLK ) ); // wait for tx completion _pinf( PIN_CMD ); _wrpin( PIN_CMD, 0 ); // release CMD pin, ready for response
Eventually, I want to make the command-response transition gapless clocking as well. But that will depend on whether it's possible to align the response start-bit at speed.
At the moment, I test one bit at a time. Re-arming the rx shifter for each inter-command-response clock. It might actually be faster to bit-bash the response entirely.
i = 64; // max of 64 clocks for response starting timeout, SD spec 4.12.4 do { // look for start-bit, one clock at a time _pinf( PIN_CMD ); // reset rx shifter _wypin( PIN_CLK, 1 ); // one SD clock _pinl( PIN_CMD ); _waitx( CLK_DIV * 2 + 8 ); // clock smartpin cycle allowance + I/O lantency compensation } while( !_rdpin( PIN_LED ) && --i );
I was about to suggest using async RX mode, but the response is too long.
The real problem I think is that CMD and DAT activity can overlap. So you need to do this sort of waiting for the start bit on both CMD and DAT0 simulataneously. I think, anyways.
Interesting discovery I just made while looking at the SD spec in Okular (trying to find the TOC before remembering that it doesn't have one...): the watermark is on a separate layer that you can just turn off. Amazing. Does this work on your full version, too?
I was concerned about this overlap too although I never saw any overlap in my testing, but in theory based on that diagram I guess it could happen. If you have to cancel a read multiple operation where it might overlap at the final block, I expected that you can always ignore the final data block on the DAT lines while you are doing the CMD & RSP stuff (so just cancel after you have read enough blocks and ignore any extra block which might be arriving at that time). For writes you are in control of the DAT timing yourself.
Huh, no, the option is not there with the pillaged PDF of full spec. The "layers" pane and widget itself disappears.
Maybe that's a side effect of running it through some online PDF unlock tool (which isn't needed in Okular because it can bypass PDF DRM itself - quality software)
Looking at how Simon and Nicolas did gapless clocking with the Ethernet sync'ing I see they're adjusting the rx shifter width on the fly in the middle of valid data reception! I'm suitably impressed that even works without data loss. Kudos to Simon for thinking to try it - https://forums.parallax.com/discussion/comment/1537291/#Comment_1537291
The block start delay on a single block read with my SanDisk is a surprise. It's something like 460 clocks after the command (CMD17), which is around 400 after the response ended.
CMD17 51 00000000 55 > 1999 ff fc 44 00 00 24 01 9f ff ff ff ff ff ff ff ff offset = 14 > 11 00 00 09 00 67 ff ff ff ff ff ff ff ff ff ff CRC = 67 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff f0 33 c0 fa 8e d8 8e d0 bc 00 7c 89 e6 06 57 8e c0 fb fc bf 00 06 b9 00 01 f3 a5 ea 1f 06 00 00 52 52 b4 41 bb aa 55 31 c9 30 f6 f9 cd 13 72 13 81 fb 55 aa 75 0d d1 e9 73 09 66 c7 06 8d 06 b4 42 eb 15 5a b4 08 cd 13 83 e1 3f 51 0f b6 c6 40 f7 e1 52 50 66 31 c0 66 99 e8 66 00 e8 35 01 4d 69 73 73 69 6e 67 20 6f 70 65 72 61 74 69 6e 67 20 73 79 73 74 65 6d 2e 0d 0a 66 60 66 31 d2 bb 00 7c 66 52 66 50 06 53 6a 01 6a 10 89 e6 66 f7 36 f4 7b c0 e4 06 88 e1 88 c5 92 f6 36 f8 7b 88 c6 08 e1 41 b8 01 02 8a 16 fa 7b cd 13 8d 64 10 66 61 c3 e8 c4 ff be be 7d bf be 07 b9 20 00 f3 a5 c3 66 60 89 e5 bb be 07 b9 04 00 31 c0 53 51 f6 07 80 74 03 40 89 de 83 c3 10 e2 f3 48 74 5b 79 39 59 5b 8a 47 04 3c 0f 74 06 24 7f 3c 05 75 22 66 8b 47 08 66 8b 56 14 66 01 d0 66 21 d2 75 03 66 89 c2 e8 ac ff 72 03 e8 b6 ff 66 8b 46 1c e8 a0 ff 83 c3 10 e2 cc 66 61 c3 e8 76 00 4d 75 6c 74 69 70 6c 65 20 61 63 74 69 76 65 20 70 61 72 74 69 74 69 6f 6e 73 2e 0d 0a 66 8b 44 08 66 03 46 1c 66 89 44 08 e8 30 ff 72 27 66 81 3e 00 7c 58 46 53 42 75 09 66 83 c0 04 e8 1c ff 72 13 81 3e fe 7d 55 aa 0f 85 f2 fe bc fa 7b 5a 5f 07 fa ff e4 e8 1e 00 4f 70 65 72 61 74 69 6e 67 20 73 79 73 74 65 6d 20 6c 6f 61 64 20 65 72 72 6f 72 2e 0d 0a 5e ac b4 0e 8a 3e 62 04 b3 07 cd 10 3c 0a 75 f1 cd 18 f4 eb fd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d1 ee 0f 6a 00 00 80 04 01 04 0b fe c2 ff 00 08 00 00 00 18 b7 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa 7d 49 f8 37 e1 3e 1e 4e ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Yeah in reality the cards would seem to need some time to process their requests. Not sure if there is a lower bound that you can always rely on though. Be good if there was. Maybe over time this delay would shrink as the cards become faster to respond, so they didn't want to nominate some minimum value for it.
Perhaps if you monitored for a start bit in the DAT lines while waiting for the CMD response you could in theory temporarily slow the clock down in the case there was some overlap, to be able to do things to keep up with the incoming data before the response fully arrived? I found you can slow/pause the clock in the middle of these transfers and it seems to work.
Also once you see the start bit you could setup the streamer for the expected DAT transfer size for reads and enable a fast clock output. For writes you are in full control and can wait for the CMD response to complete before you begin sending any data. No overlap needed there.
Hmm, been mulling those ideas too. The above dump is from both a smartpin for command response and streamer for DAT pins sampling. So they are both collecting data concurrently with clock going at steady speed. Hence the post processing to find the response bits and realign them for running the CRC.
I haven't done the same post processing of the data block as yet and aren't really contemplating it either. The code for post processing the response bits is a proper horror!
for( i = 0; i < 8*8; i++ ) { if( !(resp[i>>3] & (1<<(7-i&7))) ) break; } offset = i; for( i = 0; i < 8*8; i++ ) { bit = resp[(i+offset)>>3]>>(7-(i+offset)&7); resp[i>>3] = resp[i>>3] & ~1<<(7-i&7) | bit<<(7-i&7); }
Nah, post processing is not ideal. Best to have the data already aligned when read in. That is why the clock should be stopped after the start bit to set things up.
Now that I've verified good data block retrieval and have a feel for nominal behaviour, I'll have a shot at dynamically adjusting the shifter width on the fly similar to the Ethernet driver code. If that is successful then I'll go ahead with full gapless clocking.
EDIT: Grr, already having second thoughts. It'll probably still need post realigned. The Ethernet driver had a fixed length preamble to work with that isn't present here.
EDIT2: Yeah, without any preamble to throw away, the first bits will always need aligned anyway. The code to do that may as well be applied to the whole response in real time - Which I've now done.
Loop time is 26 sysclocks (plus extra for the hub write, so 27..34 ticks) to process 16 response bits. Actually, it can do 32 bits in the same time but that drags out more waiting for the serial shifter to fill up. Oops, it's now hubexec, so add another 12 or so ticks for the FIFO refill. It's good for sysclock/4.
Ouch! Forgot about the over-reached clock smartpin inputs. That's not an issue when using the streamer. Almost enough incentive to do post-processing block copying right there.
The remapping of smartpins needs careful documenting to keep it straight. Not to mention the extra instructions ... and CMD has to be repurposed in mid flight ... and it doesn't work for tx either, so streamer only for DAT tx pins. All very messy.
EDIT: And block copying approach can happen concurrently with next streamer action too. So not necessarily slowing things down.
Hmm, the real-time align and word merging code for the rx block data is too big to execute within sysclock/4. It's four times everything, so that's basically doubling the time required over what 32-bit shifter word size provides. Sysclock/8 should be doable with smartpins. Here's just the word merger and writing to hubRAM:
rolbyte pb, pr3, #3 rolbyte pb, pr2, #3 rolbyte pb, pr1, #3 rolbyte pb, pr0, #3 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #2 rolbyte pb, pr2, #2 rolbyte pb, pr1, #2 rolbyte pb, pr0, #2 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #1 rolbyte pb, pr2, #1 rolbyte pb, pr1, #1 rolbyte pb, pr0, #1 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #0 rolbyte pb, pr2, #0 rolbyte pb, pr1, #0 rolbyte pb, pr0, #0 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++
haha, sexy!
CMD17 51 00000000 55 > 11 00 00 09 00 67 CRC = 67 datablock: shl=29 shr=3 33 c0 fa 8e d8 8e d0 bc 00 7c 89 e6 06 57 8e c0 fb fc bf 00 06 b9 00 01 f3 a5 ea 1f 06 00 00 52 52 b4 41 bb aa 55 31 c9 30 f6 f9 cd 13 72 13 81 fb 55 aa 75 0d d1 e9 73 09 66 c7 06 8d 06 b4 42 eb 15 5a b4 08 cd 13 83 e1 3f 51 0f b6 c6 40 f7 e1 52 50 66 31 c0 66 99 e8 66 00 e8 35 01 4d 69 73 73 69 6e 67 20 6f 70 65 72 61 74 69 6e 67 20 73 79 73 74 65 6d 2e 0d 0a 66 60 66 31 d2 bb 00 7c 66 52 66 50 06 53 6a 01 6a 10 89 e6 66 f7 36 f4 7b c0 e4 06 88 e1 88 c5 92 f6 36 f8 7b 88 c6 08 e1 41 b8 01 02 8a 16 fa 7b cd 13 8d 64 10 66 61 c3 e8 c4 ff be be 7d bf be 07 b9 20 00 f3 a5 c3 66 60 89 e5 bb be 07 b9 04 00 31 c0 53 51 f6 07 80 74 03 40 89 de 83 c3 10 e2 f3 48 74 5b 79 39 59 5b 8a 47 04 3c 0f 74 06 24 7f 3c 05 75 22 66 8b 47 08 66 8b 56 14 66 01 d0 66 21 d2 75 03 66 89 c2 e8 ac ff 72 03 e8 b6 ff 66 8b 46 1c e8 a0 ff 83 c3 10 e2 cc 66 61 c3 e8 76 00 4d 75 6c 74 69 70 6c 65 20 61 63 74 69 76 65 20 70 61 72 74 69 74 69 6f 6e 73 2e 0d 0a 66 8b 44 08 66 03 46 1c 66 89 44 08 e8 30 ff 72 27 66 81 3e 00 7c 58 46 53 42 75 09 66 83 c0 04 e8 1c ff 72 13 81 3e fe 7d 55 aa 0f 85 f2 fe bc fa 7b 5a 5f 07 fa ff e4 e8 1e 00 4f 70 65 72 61 74 69 6e 67 20 73 79 73 74 65 6d 20 6c 6f 61 64 20 65 72 72 6f 72 2e 0d 0a 5e ac b4 0e 8a 3e 62 04 b3 07 cd 10 3c 0a 75 f1 cd 18 f4 eb fd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d1 ee 0f 6a 00 00 80 04 01 04 0b fe c2 ff 00 08 00 00 00 18 b7 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa
Here's the post-response assembly:
// DATA BLOCK RECEIVE dirl #PIN_CLK dirl #PIN_CMD wrpin ##P_SYNC_RX | (PIN_DAT2 - PIN_CMD & 7)<<28 | (PIN_CLK - PIN_CMD & 7)<<24, #PIN_CMD // DAT2 pin wxpin #31, #PIN_CMD // DAT2 pin dirh #PIN_DAT2 | 4<<6 // enable shifters for DAT0..DAT3 pins, and CLK pin wypin ##2000, #PIN_CLK // kick it in the guts .loop3 waitse1 // trigger on PIN_CMD (DAT2 pin) rdpin pr0, #PIN_DAT2 // DAT0 pin rdpin pr1, #PIN_DAT3 // DAT1 pin rdpin pr2, #PIN_CMD // DAT2 pin rdpin pr3, #PIN_LED // DAT3 pin rev pr2 not shiftr, pr2 wz if_z jmp #.loop3 encod shiftr mov shiftl, shiftr subr shiftl, #32 rev pr0 rev pr1 rev pr3 mov i, #32 mov ptrb, buf .loop4 waitse1 rdpin pr4, #PIN_DAT2 // DAT0 pin rdpin pr5, #PIN_DAT3 // DAT1 pin rdpin pr6, #PIN_CMD // DAT2 pin rdpin pr7, #PIN_LED // DAT3 pin rev pr4 rev pr5 rev pr6 rev pr7 mov pb, pr4 shr pb, shiftr shl pr0, shiftl or pr0, pb mov pb, pr5 shr pb, shiftr shl pr1, shiftl or pr1, pb mov pb, pr6 shr pb, shiftr shl pr2, shiftl or pr2, pb mov pb, pr7 shr pb, shiftr shl pr3, shiftl or pr3, pb rolbyte pb, pr3, #3 rolbyte pb, pr2, #3 rolbyte pb, pr1, #3 rolbyte pb, pr0, #3 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #2 rolbyte pb, pr2, #2 rolbyte pb, pr1, #2 rolbyte pb, pr0, #2 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #1 rolbyte pb, pr2, #1 rolbyte pb, pr1, #1 rolbyte pb, pr0, #1 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ rolbyte pb, pr3, #0 rolbyte pb, pr2, #0 rolbyte pb, pr1, #0 rolbyte pb, pr0, #0 mergeb pb movbyts pb, #0b00_01_10_11 wrlong pb, ptrb++ mov pr0, pr4 mov pr1, pr5 mov pr2, pr6 mov pr3, pr7 djnz i, #.loop4
Hmm, darn, not even looking hopeful for even sysclock/8 without changing to cogexec.
Why do all this shifting stuff? Just setup the transfer after the start bit and it will be aligned. Sysclk/8 is only 40M nibbles per second or 20MB/s at 320MHz. I was able to get it clocking much faster than that (sysclk/2). You just give up the gapless CMD idea and it'll be fine. For performance you mainly want to optimize DAT reads, not so much the CMD+RESP timing stuff.
Because the clock is going at speed. I've long ditched the bit-bashing start-bit search.
There's definitely gains to be made by not doing a long pulse by pulse search to find the start of the block. There's 400 odd clocks to churn through checking on each block.
BTW: I was hoping for better than sysclock/4 originally. Therefore, agreed, the above approach is rather dead in the water now. The word merge kills it before even getting to the shifts.
Yeah I reckon a 512 byte sector transfer time savings at sysclk/2 will swamp the gains from not having to bit bang the clock to search for the start bit and then transferring at sysclk/8. This is even more true once reading of multiple blocks are being done. Although admittedly if you want to strictly meet the 50MB/s max speed for normal SD cards, then the gain is lessened slightly.
For polling the start bit with a bit bang clock before you start the streamer and speed up the clock I'd expect you should be able to do it pretty fast in PASM (guessing a couple of microseconds or so per request).
Next approach is streamer + block copying.
The streamer hardware takes care of ordering so only the shifts need sorted ... and since the streamer sampling has to be tuned then it can also account for nibble alignment in the one parameter. Meaning it will be possible to tune the block copying so only an edge check has to be done to verify which byte to start at with no actual shifts required.
All this coding is partly the science, the learning, and partly an "I've been there" experience. To see the required code to get it working each way. More than some guessing.
And on that note, time to post what I've got:
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
_pinstart( PIN_CLK, P_PWM_SMPS | P_OE | P_INVERT_OUTPUT | // SD clock gen smartpin (PIN_CMD - PIN_CLK & 7)<<28 | P_INVERT_A | // smartA input select (PIN_CMD - PIN_CLK & 7)<<24 | P_INVERT_B, // smartB input select 1 | 8<<16, 4 ); while( _pinr( PIN_CMD ) ); // wait for CMD low - found the start-bit _pinstart( PIN_CLK, P_PULSE | P_OE | P_INVERT_OUTPUT, // reconfig back to regular clock gen CLK_DIV | (CLK_DIV / 2)<<16, 0 );
Well that's a funny way to do it
But does the latency period really care about the clock speed while waiting for the start bit? I'd think this is limited by the internal speed of the SD controller, but I never verified that assumption.
Here it is working with the command response. I'll apply it to the block read code tomorrow.
This looks handy. If it works I can see it could potentially speed things up a fair bit and should let you clock faster while waiting for a start bit. Should then be no excuse to not work with fully aligned data after that!
EDIT: looks like you have it working in latest post. Cool. I'll have to give it a try here at some point soon.
Just ran your test code with my board @evanh. This was the output I received from a 16G Sandisk card:
Looks okay so far but the first data sector is still offset. I guess that is next for you to sort out.
loadp2 -b 230400 -t sdmode_framing.binary ( Entering terminal mode. Press Ctrl-] or Ctrl-Z to exit. ) clkfreq = 4000000 clkmode = 0x10005eb Card detected ... 40 ms power cycle of SD card ... DONE! CMD0 40 00000000 95 > TIMED-OUT! CMD8 48 0000015a 9b > 08 00 00 01 5a 0f CRC = 0f R7 0000015a - v2.0+ SD Card CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f 00 ff 80 00 ff CRC = c7 CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f 00 ff 80 00 ff CRC = c7 CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f 00 ff 80 00 ff CRC = c7 CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f 00 ff 80 00 ff CRC = c7 CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f 00 ff 80 00 ff CRC = c7 CMD55 77 00000000 65 > 37 00 00 01 20 83 CRC = 83 CMD41 69 40100000 cd > 3f c0 ff 80 00 ff CRC = 63 OCR c0ff8000 CMD2 42 00000000 4d > 3f 03 53 44 53 4c 31 36 47 00 20 b9 34 bb 01 16 f7 CRC = f7 CMD3 43 00000000 21 > 03 aa aa 05 20 d1 CRC = d1 RCA aaaa0000 CMD7 47 aaaa0000 cd > 07 00 00 07 00 75 CRC = 75 CMD55 77 aaaa0000 2b > 37 00 00 09 20 33 CRC = 33 CMD6 46 00000002 cb > 06 00 00 09 20 b9 CRC = b9 Card init complete Test RDPIN 9d049000 CMD17 51 00000000 55 > 11 00 00 09 00 67 CRC = 67 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff f0 fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00 00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75 f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b 4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRC = 01 All finished :)