CRCs can be a PITA, but thankfully the P2 can do a good job of them once you figure it all out and setup the poly and feed the bits or nibbles in the right order.
I'm rather pleased with my efficient command tx code. It easily computes the CRC in the middle of shifting out the command bits, so effectively zero overhead.
Eventually, I want to make the command-response transition gapless clocking as well. But that will depend on whether it's possible to align the response start-bit at speed.
At the moment, I test one bit at a time. Re-arming the rx shifter for each inter-command-response clock. It might actually be faster to bit-bash the response entirely.
i = 64; // max of 64 clocks for response starting timeout, SD spec 4.12.4
do { // look for start-bit, one clock at a time
_pinf( PIN_CMD ); // reset rx shifter
_wypin( PIN_CLK, 1 ); // one SD clock
_pinl( PIN_CMD );
_waitx( CLK_DIV * 2 + 8 ); // clock smartpin cycle allowance + I/O lantency compensation
} while( !_rdpin( PIN_LED ) && --i );
I was about to suggest using async RX mode, but the response is too long.
The real problem I think is that CMD and DAT activity can overlap. So you need to do this sort of waiting for the start bit on both CMD and DAT0 simulataneously. I think, anyways.
Interesting discovery I just made while looking at the SD spec in Okular (trying to find the TOC before remembering that it doesn't have one...): the watermark is on a separate layer that you can just turn off. Amazing. Does this work on your full version, too?
I was concerned about this overlap too although I never saw any overlap in my testing, but in theory based on that diagram I guess it could happen. If you have to cancel a read multiple operation where it might overlap at the final block, I expected that you can always ignore the final data block on the DAT lines while you are doing the CMD & RSP stuff (so just cancel after you have read enough blocks and ignore any extra block which might be arriving at that time). For writes you are in control of the DAT timing yourself.
Maybe that's a side effect of running it through some online PDF unlock tool (which isn't needed in Okular because it can bypass PDF DRM itself - quality software)
Looking at how Simon and Nicolas did gapless clocking with the Ethernet sync'ing I see they're adjusting the rx shifter width on the fly in the middle of valid data reception! I'm suitably impressed that even works without data loss. Kudos to Simon for thinking to try it - https://forums.parallax.com/discussion/comment/1537291/#Comment_1537291
@rogloh said:
I was concerned about this overlap too although I never saw any overlap in my testing, ...
The block start delay on a single block read with my SanDisk is a surprise. It's something like 460 clocks after the command (CMD17), which is around 400 after the response ended.
Yeah in reality the cards would seem to need some time to process their requests. Not sure if there is a lower bound that you can always rely on though. Be good if there was. Maybe over time this delay would shrink as the cards become faster to respond, so they didn't want to nominate some minimum value for it.
Perhaps if you monitored for a start bit in the DAT lines while waiting for the CMD response you could in theory temporarily slow the clock down in the case there was some overlap, to be able to do things to keep up with the incoming data before the response fully arrived? I found you can slow/pause the clock in the middle of these transfers and it seems to work.
Also once you see the start bit you could setup the streamer for the expected DAT transfer size for reads and enable a fast clock output. For writes you are in full control and can wait for the CMD response to complete before you begin sending any data. No overlap needed there.
Hmm, been mulling those ideas too. The above dump is from both a smartpin for command response and streamer for DAT pins sampling. So they are both collecting data concurrently with clock going at steady speed. Hence the post processing to find the response bits and realign them for running the CRC.
I haven't done the same post processing of the data block as yet and aren't really contemplating it either. The code for post processing the response bits is a proper horror!
for( i = 0; i < 8*8; i++ ) {
if( !(resp[i>>3] & (1<<(7-i&7))) )
break;
}
offset = i;
for( i = 0; i < 8*8; i++ ) {
bit = resp[(i+offset)>>3]>>(7-(i+offset)&7);
resp[i>>3] = resp[i>>3] & ~1<<(7-i&7) | bit<<(7-i&7);
}
Nah, post processing is not ideal. Best to have the data already aligned when read in. That is why the clock should be stopped after the start bit to set things up.
Now that I've verified good data block retrieval and have a feel for nominal behaviour, I'll have a shot at dynamically adjusting the shifter width on the fly similar to the Ethernet driver code. If that is successful then I'll go ahead with full gapless clocking.
EDIT: Grr, already having second thoughts. It'll probably still need post realigned. The Ethernet driver had a fixed length preamble to work with that isn't present here.
EDIT2: Yeah, without any preamble to throw away, the first bits will always need aligned anyway. The code to do that may as well be applied to the whole response in real time - Which I've now done.
Loop time is 26 sysclocks (plus extra for the hub write, so 27..34 ticks) to process 16 response bits. Actually, it can do 32 bits in the same time but that drags out more waiting for the serial shifter to fill up. Oops, it's now hubexec, so add another 12 or so ticks for the FIFO refill. It's good for sysclock/4.
Ouch! Forgot about the over-reached clock smartpin inputs. That's not an issue when using the streamer. Almost enough incentive to do post-processing block copying right there.
The remapping of smartpins needs careful documenting to keep it straight. Not to mention the extra instructions ... and CMD has to be repurposed in mid flight ... and it doesn't work for tx either, so streamer only for DAT tx pins. All very messy.
EDIT: And block copying approach can happen concurrently with next streamer action too. So not necessarily slowing things down.
Hmm, the real-time align and word merging code for the rx block data is too big to execute within sysclock/4. It's four times everything, so that's basically doubling the time required over what 32-bit shifter word size provides. Sysclock/8 should be doable with smartpins. Here's just the word merger and writing to hubRAM:
Why do all this shifting stuff? Just setup the transfer after the start bit and it will be aligned. Sysclk/8 is only 40M nibbles per second or 20MB/s at 320MHz. I was able to get it clocking much faster than that (sysclk/2). You just give up the gapless CMD idea and it'll be fine. For performance you mainly want to optimize DAT reads, not so much the CMD+RESP timing stuff.
Because the clock is going at speed. I've long ditched the bit-bashing start-bit search.
There's definitely gains to be made by not doing a long pulse by pulse search to find the start of the block. There's 400 odd clocks to churn through checking on each block.
BTW: I was hoping for better than sysclock/4 originally. Therefore, agreed, the above approach is rather dead in the water now. The word merge kills it before even getting to the shifts.
Yeah I reckon a 512 byte sector transfer time savings at sysclk/2 will swamp the gains from not having to bit bang the clock to search for the start bit and then transferring at sysclk/8. This is even more true once reading of multiple blocks are being done. Although admittedly if you want to strictly meet the 50MB/s max speed for normal SD cards, then the gain is lessened slightly.
For polling the start bit with a bit bang clock before you start the streamer and speed up the clock I'd expect you should be able to do it pretty fast in PASM (guessing a couple of microseconds or so per request).
The streamer hardware takes care of ordering so only the shifts need sorted ... and since the streamer sampling has to be tuned then it can also account for nibble alignment in the one parameter. Meaning it will be possible to tune the block copying so only an edge check has to be done to verify which byte to start at with no actual shifts required.
All this coding is partly the science, the learning, and partly an "I've been there" experience. To see the required code to get it working each way. More than some guessing.
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
But does the latency period really care about the clock speed while waiting for the start bit? I'd think this is limited by the internal speed of the SD controller, but I never verified that assumption.
@evanh said:
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
This looks handy. If it works I can see it could potentially speed things up a fair bit and should let you clock faster while waiting for a start bit. Should then be no excuse to not work with fully aligned data after that!
EDIT: looks like you have it working in latest post. Cool. I'll have to give it a try here at some point soon.
Just ran your test code with my board @evanh. This was the output I received from a 16G Sandisk card:
Looks okay so far but the first data sector is still offset. I guess that is next for you to sort out.
Comments
CRCs can be a PITA, but thankfully the P2 can do a good job of them once you figure it all out and setup the poly and feed the bits or nibbles in the right order.
I'm rather pleased with my efficient command tx code. It easily computes the CRC in the middle of shifting out the command bits, so effectively zero overhead.
Eventually, I want to make the command-response transition gapless clocking as well. But that will depend on whether it's possible to align the response start-bit at speed.
At the moment, I test one bit at a time. Re-arming the rx shifter for each inter-command-response clock. It might actually be faster to bit-bash the response entirely.
I was about to suggest using async RX mode, but the response is too long.
The real problem I think is that CMD and DAT activity can overlap. So you need to do this sort of waiting for the start bit on both CMD and DAT0 simulataneously. I think, anyways.
Interesting discovery I just made while looking at the SD spec in Okular (trying to find the TOC before remembering that it doesn't have one...): the watermark is on a separate layer that you can just turn off. Amazing. Does this work on your full version, too?
I was concerned about this overlap too although I never saw any overlap in my testing, but in theory based on that diagram I guess it could happen. If you have to cancel a read multiple operation where it might overlap at the final block, I expected that you can always ignore the final data block on the DAT lines while you are doing the CMD & RSP stuff (so just cancel after you have read enough blocks and ignore any extra block which might be arriving at that time). For writes you are in control of the DAT timing yourself.
Huh, no, the option is not there with the pillaged PDF of full spec. The "layers" pane and widget itself disappears.
Maybe that's a side effect of running it through some online PDF unlock tool (which isn't needed in Okular because it can bypass PDF DRM itself - quality software)
Looking at how Simon and Nicolas did gapless clocking with the Ethernet sync'ing I see they're adjusting the rx shifter width on the fly in the middle of valid data reception! I'm suitably impressed that even works without data loss. Kudos to Simon for thinking to try it - https://forums.parallax.com/discussion/comment/1537291/#Comment_1537291
The block start delay on a single block read with my SanDisk is a surprise. It's something like 460 clocks after the command (CMD17), which is around 400 after the response ended.
Yeah in reality the cards would seem to need some time to process their requests. Not sure if there is a lower bound that you can always rely on though. Be good if there was. Maybe over time this delay would shrink as the cards become faster to respond, so they didn't want to nominate some minimum value for it.
Perhaps if you monitored for a start bit in the DAT lines while waiting for the CMD response you could in theory temporarily slow the clock down in the case there was some overlap, to be able to do things to keep up with the incoming data before the response fully arrived? I found you can slow/pause the clock in the middle of these transfers and it seems to work.
Also once you see the start bit you could setup the streamer for the expected DAT transfer size for reads and enable a fast clock output. For writes you are in full control and can wait for the CMD response to complete before you begin sending any data. No overlap needed there.
Hmm, been mulling those ideas too. The above dump is from both a smartpin for command response and streamer for DAT pins sampling. So they are both collecting data concurrently with clock going at steady speed. Hence the post processing to find the response bits and realign them for running the CRC.
I haven't done the same post processing of the data block as yet and aren't really contemplating it either. The code for post processing the response bits is a proper horror!
Nah, post processing is not ideal. Best to have the data already aligned when read in. That is why the clock should be stopped after the start bit to set things up.
Now that I've verified good data block retrieval and have a feel for nominal behaviour, I'll have a shot at dynamically adjusting the shifter width on the fly similar to the Ethernet driver code. If that is successful then I'll go ahead with full gapless clocking.
EDIT: Grr, already having second thoughts. It'll probably still need post realigned. The Ethernet driver had a fixed length preamble to work with that isn't present here.
EDIT2: Yeah, without any preamble to throw away, the first bits will always need aligned anyway. The code to do that may as well be applied to the whole response in real time - Which I've now done.
Loop time is 26 sysclocks (plus extra for the hub write, so 27..34 ticks) to process 16 response bits. Actually, it can do 32 bits in the same time but that drags out more waiting for the serial shifter to fill up. Oops, it's now hubexec, so add another 12 or so ticks for the FIFO refill. It's good for sysclock/4.
Ouch! Forgot about the over-reached clock smartpin inputs. That's not an issue when using the streamer. Almost enough incentive to do post-processing block copying right there.
The remapping of smartpins needs careful documenting to keep it straight. Not to mention the extra instructions ... and CMD has to be repurposed in mid flight ... and it doesn't work for tx either, so streamer only for DAT tx pins. All very messy.
EDIT: And block copying approach can happen concurrently with next streamer action too. So not necessarily slowing things down.
Hmm, the real-time align and word merging code for the rx block data is too big to execute within sysclock/4. It's four times everything, so that's basically doubling the time required over what 32-bit shifter word size provides. Sysclock/8 should be doable with smartpins. Here's just the word merger and writing to hubRAM:
haha, sexy!
Here's the post-response assembly:
Hmm, darn, not even looking hopeful for even sysclock/8 without changing to cogexec.
Why do all this shifting stuff? Just setup the transfer after the start bit and it will be aligned. Sysclk/8 is only 40M nibbles per second or 20MB/s at 320MHz. I was able to get it clocking much faster than that (sysclk/2). You just give up the gapless CMD idea and it'll be fine. For performance you mainly want to optimize DAT reads, not so much the CMD+RESP timing stuff.
Because the clock is going at speed. I've long ditched the bit-bashing start-bit search.
There's definitely gains to be made by not doing a long pulse by pulse search to find the start of the block. There's 400 odd clocks to churn through checking on each block.
BTW: I was hoping for better than sysclock/4 originally. Therefore, agreed, the above approach is rather dead in the water now. The word merge kills it before even getting to the shifts.
Yeah I reckon a 512 byte sector transfer time savings at sysclk/2 will swamp the gains from not having to bit bang the clock to search for the start bit and then transferring at sysclk/8. This is even more true once reading of multiple blocks are being done. Although admittedly if you want to strictly meet the 50MB/s max speed for normal SD cards, then the gain is lessened slightly.
For polling the start bit with a bit bang clock before you start the streamer and speed up the clock I'd expect you should be able to do it pretty fast in PASM (guessing a couple of microseconds or so per request).
Next approach is streamer + block copying.
The streamer hardware takes care of ordering so only the shifts need sorted ... and since the streamer sampling has to be tuned then it can also account for nibble alignment in the one parameter. Meaning it will be possible to tune the block copying so only an edge check has to be done to verify which byte to start at with no actual shifts required.
All this coding is partly the science, the learning, and partly an "I've been there" experience. To see the required code to get it working each way. More than some guessing.
And on that note, time to post what I've got:
Okay, change of tack, this is too good to leave alone ... My main aversion to the bit-bashed search of the start bit was the slow rate of pulse then examine - Waiting for all the latencies to clear, one cycle at a time. Well, I remembered the discussion about whether the PWM_SMPS smartpin mode could be used for a fast response trigger of some sort .... I thought, maybe, I could turn it into a clock gen that can abruptly shut off when an input goes low (The start bit arriving) .... ta-da!
Well that's a funny way to do it
But does the latency period really care about the clock speed while waiting for the start bit? I'd think this is limited by the internal speed of the SD controller, but I never verified that assumption.
Here it is working with the command response. I'll apply it to the block read code tomorrow.
This looks handy. If it works I can see it could potentially speed things up a fair bit and should let you clock faster while waiting for a start bit. Should then be no excuse to not work with fully aligned data after that!
EDIT: looks like you have it working in latest post. Cool. I'll have to give it a try here at some point soon.
Just ran your test code with my board @evanh. This was the output I received from a 16G Sandisk card:
Looks okay so far but the first data sector is still offset. I guess that is next for you to sort out.