@evanh said:
PPS: Regarding those block read/write routines, they partly depend on registers being loaded along with the code. If Fcache is shifted to LUTRAM you'd have to move such data into locals instead.
Yeah I know all about it, that's the tricky part. But we have 32 registers to play with and there are only up to 9 parameters I think. We can get them into C register variables using high level assignments from the source structure.
Also I've been making this multi-instance so your globals are no longer present. Even though I am only exposing the SD mode driver host interface as a single slot initially, I see no reason to fix it to just one interface.
I don't think there is any true globals in that Flex driver. Things like the pin numbers are contained to the instance by FlexC's C++ struct __using() wrapper. Only a DAT section would be global.
Also, in case you missed my previous edit, those register constants could, instead of creating locals, be placed in with the presets that are fast copied into the RES section.
I've got a working driver with a small optimisation for size, making for a saving of 148 bytes. It is just dealing with command response start bit search. Nothing else. It clocks and tests one bit at a time instead of relying on a reconfigured clock - Borrowed from a similar solution I used for CRC response collection in the block write routine. The code is tidier due to no longer dealing with variable lag effects.
It doesn't affect block read/write performance at all but when comparing command-only performance the existing v1.14 solution is 13% faster than this one.
@evanh said:
... It is just dealing with command response start bit search. Nothing else. ...
It looks like I can go further with this approach. It seems also be able to remove all rx lead-ins and their associated rate settings. Replacing with a simpler WAITX + XINIT located after the coupled DIRH + WYPIN for starting the SD clock.
I had thought I'd covered this ground thoroughly very early on but I must have overlooked something there. Best guess is having found I had no choice when it came to the tx side of things I'd assumed rx wasn't going to be any easier and just duplicated the tx approach.
PS: To be able to transmit at any divider rate, on demand, from sysclock/2 and up, it needs exact sysclock tick adjustability for timing alignment between streamer and smartpin. And there is a bad overlap with the SD clock start in the small dividers without using a spacing lead-in to precisely push out the streamer timing to match the smartpin. Here's an example of the required tx sequence:
setxfrq lnco // sysclock/1 for lead-in timing
dirl p_clk // reset clock-gen smartpin to align with streamer
...
xinit m_calign, #0 // lead-in delay at sysclock/1
setq v_nco // data transfer rate, takes effect on XZERO below (buffered command)
xzero m_cmd, pb // place first 32 bits in tx buffer, aligned to clock via lead-in delay
dirh p_clk // clock timing starts here, first clock pulse occurs in smartpin's second period
wypin #6*8+2, p_clk // SD clock pulses, 6 bytes + 2 pulses to kick off a response
...
xzero m_cmd, pb // place remaining 16 bits in buffer for when first 32 bits completes
And I have the rx routine, after start-bit is found, done basically the same but with a measured delay, for the frequency dependant rx latencies, added to that lead-in.
PPS: Now it looks like, for rx routine, I can actually get away with the following:
dirl p_clk // reset clock-gen smartpin to align with streamer
...
dirh p_clk // clock timing starts here
wypin clocks, p_clk // first clock pulse outputs during second clock period
waitx rxlag
xinit m_resp, #0 // rx response, aligned to SD clock
...
No need to setup the smartpin rate nor even the streamer rate as they are both unchanged from the command issuing.
I've found, on the Eval board at least, that I can get higher reliability by using registered clock pin. I accidentally found the extra reliability by going insane with the SD card clock frequency - Hitting 180 MHz (sysclock/2 clock divider)!
The accident happened because I'd been banging my head against a new bug, that was only appearing during very high clock rates, and the bug wasn't where I was expecting. Which, funnily, turned out to be in the recently rewritten command response start-bit search. I'd used an edge event trigger in a place where I'd previously proven only TESTP instruction. The latency difference mattered.
I've tested both with the new and old methods of start-bit search and, with the bug squashed, both behave the same. So now I wonder if maybe I should leave clock registration switched on permanently ... I should test more boards first ....
That's pretty fast. What speed grade was your card? I recall I also once hit speeds like that earlier on in this same thread when I was testing things out https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
At these speeds you can fill or empty the P2 memory pretty rapidly. But sustaining it for long is rather unlikely because of the filesystem overheads. At this transfer rate, I'm interested to know what can your driver sustain for video streaming applications?
Ah, I don't intend anyone to actually use that clock rate. Even though it worked I doubt it was something that would stay reliable over extended environments, eg: Different wiring of the SD slot. I highlighted it only because it showed up a timing improvement that would presumably make the driver more robust.
I need to do some testing ... I've been tidying up and reverting some of my experiments. I had started working on replacing all start-bit search code blocks with generally the same approach as above but realised it creates extra work in dealing with timeouts. The command-response timeout is based on a limited number of SD clock cycles rather than an actual duration.
EDIT: I guess I could use branch-on-event to keep the inner loop small. Not sure it'll result in reduced code size though. The whole aim is to shrink the driver size.
That's unexpected. I'll guess it's the smartpin IN behaviour, rather than the event hardware. I've just been using sysclock/2 to see how tight the new response start-bit search loop can go at ... and it's a tad disappointing. It seem I can throw in up to 4 instructions between the WYPIN #1,pin and the associated WAITSE1 without any change in loop time.
The pulse period is set to just 2 sysclock ticks, yet it seems to take 10 ticks to trigger its IN rise event.
Here's the relevant code:
// Z is initially clear
rep @.rend, #64 // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
if_nz wypin #1, p_clk // one clock pulse
if_nz testpn p_cmd wz // Z set if start(S)-bit found, TESTP is required for low latency!
if_nz waitse1 // wait for each clock done
.rend
// Z clear if timed-out - S-bit not found
So I may as well add an early exit branch in there. That way it won't loop all 64 times.
EDIT: On the other hand, that does explain why it has been holding together so well at extreme clock settings. The pulse interval never gets shorter than sysclock/12. And I had been counting my blessing without going to the trouble of scoping it up till now.
Maybe it's all for the better. It means I don't need exceptions for short timings. And this was all about reduced code size after all.
Here's the updated loop. It loops just as fast with the extra JMP instruction.
rep @.rend, #64 // max of 64 clocks for response to start, Ncr (SD spec 4.12.4)
wypin #1, p_clk // one clock pulse
testp p_cmd wc // C clear if start(S)-bit found, TESTP is required for low latency!
if_nc jmp #.rend
waitse1 // wait for each clock done
.rend
// C set if timed-out - S-bit not found
Huh, I had been lucky with the above loop too. The basic loop has been in use for gathering the CRC status response from the SD card after each block written.
rep @.rend2, #6
wypin #1, p_clk // one clock pulse
testp p_dat wc // sample DAT0 before the clock pin transitions
rcl timeout, #1 // collect status bits
waitse1 // wait for each clock done
.rend2
Turns out it only works 100% when the clock pin is unregistered and also SD clock is non-inverted (Which is used when I engage SD High-Speed interface). Both of these conditions have been true in the sdsd.cc driver up until now.
I don't know exactly why the polarity should matter, since the SD card also changes which clock edge it clocks out on when changing speed mode.
At any rate, the fix is to simply (Now that I know it doesn't reduce performance) add a padding instruction between WYPIN and TESTP:
rep @.rend2, #6
wypin #1, p_clk // one clock pulse
nop // timing padding, for high sysclock frequencies
testp p_dat wc // sample DAT0 before the clock pin transitions
rcl timeout, #1 // collect status bits
waitse1 // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN
.rend2
And same fix for the command-response's start-bit search loop.
Bugger, I broke something at sdsd v1.12 too. I just discovered that all block read data gets corrupted when disabling the CRC check. That goes back to last year!
I guess that's why unit tests are a thing. Oh well ...
EDIT: Found the bug. Got a tad gung-ho by changing to looping on streamer complete rather than smartpin complete. It saved an instruction as well as terminating earlier.
Buggy behaviour was: When the driver was configured for no CRC it has the streamer finishing at the end of the 512 bytes of data, ignoring the CRC following. But those CRC bits are still arriving, with the clock still cycling, as the driver has the streamer reading what it thinks is the next block. Chaos ensued.
Man, I've not been fully following the framing spec on CRC status after block write either. I've, at some point, assumed a fixed number of clock cycles for turnaround between end of CRC and the CRC status coming back. All cards seem to have the same gap. However, the spec says it can be from 2 to 8 clocks. One more improvement to get right ...
EDIT: Dealt to:
mov timeout, #0 // reused to hold the CRC status bits, and as return code
rep @.rend2, #13 // collect 6 bits for status, and Ncrc = 2..8 clocks (SD spec 4.12.5.5)
wypin #1, p_clk // one clock pulse
testb timeout, #5 wz // start-bit latch
testbn timeout, #4 andz // start-bit latch
testp p_dat wc // sample DAT0 before the clock pin transitions
if_nz rcl timeout, #1 // collect status bits
if_nz waitse1 // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN
.rend2
_ret_ wypin #500, p_clk // trailing clocks
Comments
Yeah I know all about it, that's the tricky part. But we have 32 registers to play with and there are only up to 9 parameters I think. We can get them into C register variables using high level assignments from the source structure.
Also I've been making this multi-instance so your globals are no longer present. Even though I am only exposing the SD mode driver host interface as a single slot initially, I see no reason to fix it to just one interface.
I don't think there is any true globals in that Flex driver. Things like the pin numbers are contained to the instance by FlexC's C++
struct __using()wrapper. Only a DAT section would be global.Also, in case you missed my previous edit, those register constants could, instead of creating locals, be placed in with the presets that are fast copied into the RES section.
I've updated the Object Exchange with v1.14. The zip file now includes four examples and improved documentation - https://obex.parallax.com/obex/sdsd-cc/
I've got a working driver with a small optimisation for size, making for a saving of 148 bytes. It is just dealing with command response start bit search. Nothing else. It clocks and tests one bit at a time instead of relying on a reconfigured clock - Borrowed from a similar solution I used for CRC response collection in the block write routine. The code is tidier due to no longer dealing with variable lag effects.
It doesn't affect block read/write performance at all but when comparing command-only performance the existing v1.14 solution is 13% faster than this one.
Anyone got objections to locking in this one?
I've also removed the redundant CMD6 SWITCH_FUNC switchover to High-Speed interface. Binary file size reduced by another 128 bytes.
I had left it in early on in the hopes of it helping with enabling other features, which it never did, and kind of forgotten about it.
It looks like I can go further with this approach. It seems also be able to remove all rx lead-ins and their associated rate settings. Replacing with a simpler WAITX + XINIT located after the coupled DIRH + WYPIN for starting the SD clock.
I had thought I'd covered this ground thoroughly very early on but I must have overlooked something there. Best guess is having found I had no choice when it came to the tx side of things I'd assumed rx wasn't going to be any easier and just duplicated the tx approach.
PS: To be able to transmit at any divider rate, on demand, from sysclock/2 and up, it needs exact sysclock tick adjustability for timing alignment between streamer and smartpin. And there is a bad overlap with the SD clock start in the small dividers without using a spacing lead-in to precisely push out the streamer timing to match the smartpin. Here's an example of the required tx sequence:
setxfrq lnco // sysclock/1 for lead-in timing dirl p_clk // reset clock-gen smartpin to align with streamer ... xinit m_calign, #0 // lead-in delay at sysclock/1 setq v_nco // data transfer rate, takes effect on XZERO below (buffered command) xzero m_cmd, pb // place first 32 bits in tx buffer, aligned to clock via lead-in delay dirh p_clk // clock timing starts here, first clock pulse occurs in smartpin's second period wypin #6*8+2, p_clk // SD clock pulses, 6 bytes + 2 pulses to kick off a response ... xzero m_cmd, pb // place remaining 16 bits in buffer for when first 32 bits completesAnd I have the rx routine, after start-bit is found, done basically the same but with a measured delay, for the frequency dependant rx latencies, added to that lead-in.
PPS: Now it looks like, for rx routine, I can actually get away with the following:
dirl p_clk // reset clock-gen smartpin to align with streamer ... dirh p_clk // clock timing starts here wypin clocks, p_clk // first clock pulse outputs during second clock period waitx rxlag xinit m_resp, #0 // rx response, aligned to SD clock ...No need to setup the smartpin rate nor even the streamer rate as they are both unchanged from the command issuing.
I've found, on the Eval board at least, that I can get higher reliability by using registered clock pin. I accidentally found the extra reliability by going insane with the SD card clock frequency - Hitting 180 MHz (sysclock/2 clock divider)!
The accident happened because I'd been banging my head against a new bug, that was only appearing during very high clock rates, and the bug wasn't where I was expecting. Which, funnily, turned out to be in the recently rewritten command response start-bit search. I'd used an edge event trigger in a place where I'd previously proven only TESTP instruction. The latency difference mattered.
I've tested both with the new and old methods of start-bit search and, with the bug squashed, both behave the same. So now I wonder if maybe I should leave clock registration switched on permanently ... I should test more boards first ....
That's pretty fast. What speed grade was your card? I recall I also once hit speeds like that earlier on in this same thread when I was testing things out https://forums.parallax.com/discussion/comment/1545820/#Comment_1545820
At these speeds you can fill or empty the P2 memory pretty rapidly. But sustaining it for long is rather unlikely because of the filesystem overheads. At this transfer rate, I'm interested to know what can your driver sustain for video streaming applications?
Ah, I don't intend anyone to actually use that clock rate. Even though it worked I doubt it was something that would stay reliable over extended environments, eg: Different wiring of the SD slot. I highlighted it only because it showed up a timing improvement that would presumably make the driver more robust.
I need to do some testing ... I've been tidying up and reverting some of my experiments. I had started working on replacing all start-bit search code blocks with generally the same approach as above but realised it creates extra work in dealing with timeouts. The command-response timeout is based on a limited number of SD clock cycles rather than an actual duration.
EDIT: I guess I could use branch-on-event to keep the inner loop small. Not sure it'll result in reduced code size though. The whole aim is to shrink the driver size.
That's unexpected. I'll guess it's the smartpin IN behaviour, rather than the event hardware. I've just been using sysclock/2 to see how tight the new response start-bit search loop can go at ... and it's a tad disappointing. It seem I can throw in up to 4 instructions between the WYPIN #1,pin and the associated WAITSE1 without any change in loop time.
The pulse period is set to just 2 sysclock ticks, yet it seems to take 10 ticks to trigger its IN rise event.
Here's the relevant code:
// Z is initially clear rep @.rend, #64 // max of 64 clocks for response to start, Ncr (SD spec 4.12.4) if_nz wypin #1, p_clk // one clock pulse if_nz testpn p_cmd wz // Z set if start(S)-bit found, TESTP is required for low latency! if_nz waitse1 // wait for each clock done .rend // Z clear if timed-out - S-bit not foundSo I may as well add an early exit branch in there. That way it won't loop all 64 times.
EDIT: On the other hand, that does explain why it has been holding together so well at extreme clock settings. The pulse interval never gets shorter than sysclock/12. And I had been counting my blessing without going to the trouble of scoping it up till now.
Maybe it's all for the better. It means I don't need exceptions for short timings. And this was all about reduced code size after all.
Here's the updated loop. It loops just as fast with the extra JMP instruction.
rep @.rend, #64 // max of 64 clocks for response to start, Ncr (SD spec 4.12.4) wypin #1, p_clk // one clock pulse testp p_cmd wc // C clear if start(S)-bit found, TESTP is required for low latency! if_nc jmp #.rend waitse1 // wait for each clock done .rend // C set if timed-out - S-bit not foundHuh, I had been lucky with the above loop too. The basic loop has been in use for gathering the CRC status response from the SD card after each block written.
rep @.rend2, #6 wypin #1, p_clk // one clock pulse testp p_dat wc // sample DAT0 before the clock pin transitions rcl timeout, #1 // collect status bits waitse1 // wait for each clock done .rend2Turns out it only works 100% when the clock pin is unregistered and also SD clock is non-inverted (Which is used when I engage SD High-Speed interface). Both of these conditions have been true in the sdsd.cc driver up until now.
I don't know exactly why the polarity should matter, since the SD card also changes which clock edge it clocks out on when changing speed mode.
At any rate, the fix is to simply (Now that I know it doesn't reduce performance) add a padding instruction between WYPIN and TESTP:
rep @.rend2, #6 wypin #1, p_clk // one clock pulse nop // timing padding, for high sysclock frequencies testp p_dat wc // sample DAT0 before the clock pin transitions rcl timeout, #1 // collect status bits waitse1 // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN .rend2And same fix for the command-response's start-bit search loop.
Bugger, I broke something at sdsd v1.12 too. I just discovered that all block read data gets corrupted when disabling the CRC check. That goes back to last year!
I guess that's why unit tests are a thing. Oh well ...
EDIT: Found the bug. Got a tad gung-ho by changing to looping on streamer complete rather than smartpin complete. It saved an instruction as well as terminating earlier.
Buggy behaviour was: When the driver was configured for no CRC it has the streamer finishing at the end of the 512 bytes of data, ignoring the CRC following. But those CRC bits are still arriving, with the clock still cycling, as the driver has the streamer reading what it thinks is the next block. Chaos ensued.
Man, I've not been fully following the framing spec on CRC status after block write either. I've, at some point, assumed a fixed number of clock cycles for turnaround between end of CRC and the CRC status coming back. All cards seem to have the same gap. However, the spec says it can be from 2 to 8 clocks. One more improvement to get right ...
EDIT: Dealt to:
mov timeout, #0 // reused to hold the CRC status bits, and as return code rep @.rend2, #13 // collect 6 bits for status, and Ncrc = 2..8 clocks (SD spec 4.12.5.5) wypin #1, p_clk // one clock pulse testb timeout, #5 wz // start-bit latch testbn timeout, #4 andz // start-bit latch testp p_dat wc // sample DAT0 before the clock pin transitions if_nz rcl timeout, #1 // collect status bits if_nz waitse1 // wait for each clock done, IN goes high 8 ticks + pulse period after WYPIN .rend2 _ret_ wypin #500, p_clk // trailing clocks