But not yet confident. I've found a second SD card, of eight now, that's not entirely happy. This one works when I enable the block logging though.
I used to have a bunch of full-sized SD cards too, but haven't found even one of those. I suspect I threw them all out a long time ago because they probably all got smashed. I remember sitting on some and cracking the thin PCB material.
@evanh said:
Whoa! I found a bug (Every second join became disjointed) that was slowing it down still. And I've got writes working with lazy CMD12 too! Don't need no big buffers any longer. Fast recording here we come.
Wow that's what I want to see! These sorts of speeds will let me get back to the HDMI capture stuff I started a while back. I'd sort of assumed we'd hit a fundamental limit somewhere but didn't know why.
@evanh said:
Certainly is. I hadn't imagined such a big jump.
But not yet confident. I've found a second SD card, of eight now, that's not entirely happy. This one works when I enable the block logging though.
I used to have a bunch of full-sized SD cards too, but haven't found even one of those. I suspect I threw them all out a long time ago because they probably all got smashed. I remember sitting on some and cracking the thin PCB material.
Keep at it. You'll probably figure out what the real limitation is there eventually and what feature the card needs to be able to do what we want.
The scope is at work at the moment. I'm unsure what I'll be able to see but I need some inspiration at least, or just plain insight into what's actually occurring. The fact that the newer cards seem happy might be telling.
Uh-oh again, I've messed something up along the way. None of the cards work using portA now. There must be a typo or something similar. I don't know how far back this occurred, since the stable v1.2 release, because I've been using pin base of 40 all this time.
PS: Hehe, I just double checked v1.2 to be sure it's okay. No problem there.
The good news is that just by verbatim copying the two disk_read()/disk_write() functions from v1.2 everything returns to normal. So all my problems appear to be contained inside those two - Which is of course where 99% of the edits have occurred.
PS: The new code structure makes the two of them act as one unit. They share state tracking of where CMD12 next needs to be issued. They look identical at first glance. Just a couple of differences like CMD18 for reads vs CMD25 for writes. CMD17 and CMD24 have been removed. In fact, if I get this to work cleanly, I think I'll be making them one function with two hooks for FS handler.
Err, I guess it would need two wrapper functions to specify read vs write. Direct hooks won't provided that. I'll test it out before deciding.
Huh, found a weakness in tx_datablock() low level routine where it was failing to report unknown errors with debugging enabled. Something odd is happening with the card's CRC response. Definitely want to get the scope on to it now.
Okay, that one's solved. It was a brand new bug in the newest code attempting to narrow down why some cards weren't happy with the lazy CMD12. I'd missed out doing a bit-shift on the LED pin number variable. This resulted in a spurious low pin number with a huge ADDPINS value. Which would've messed with OUTA/DIRA bits when the activity LED was toggled.
Amusingly, the botched workaround was on the right track. If I hadn't got confused I wouldn't have stopped the other day.
Issuing CMD24 for doing single block writes has proven to be important.
I've found I can still use CMD25+CMD12 for doing singles but a 0.5 ms separation is then required. Dunno why, but it works for both of the affected cards. Obviously adding extra delays on top of the CMD25+CMD12 already being a double command is simply a bad approach when using CMD24 by itself works best.
So reads and writes have a notable structural difference now.
@Wuerfel_21 said:
So writing from 512 byte buffers -> hit CMD24 sad case?
For those two old cards, yes. One is the 2007 camera card (Adata), it is also SDSC 2GB capacity. The other is not so old, 2018, but is obscure and cheap Apacer brand.
PS: I have a cheap 2013 Adata that is fine using CMD25, without adding delays, for all writes. So age isn't everything.
I mean driver-side it would issue the single block write instead of keeping a multi-block open for more.
I guess there's a general problem with the FS interface not really telling you in advance if there's going to be another sequential read/write command after the current.
Oh, yes, exactly. No way to know what might come later. Especially when it's just one block.
It might be a blessing in disguise anyway. I hadn't got round to even thinking about the consequences of removing a card with a CMD25 still in progress.
EDIT: Err, logging the block numbers and SYNC calls I can see the final SYNC is called after the final FAT meta data is written. So using CMD25 for everything should be just as reliable.
LOL, the optimiser beat me to it. I just noticed I could optimise the code size by shuffling the logic to eliminate duplication inside the disk_write() function. But it made no difference to the compiled size. Oh well.
EDIT: Oops, it did make a tiny difference, 76 bytes. I just hadn't checked the exact bytes.
EDIT2: Shaved another 44 bytes off, yahoo!
EDIT3: Shaved off another 32 bytes, not off the recent bloat though. I finally replaced the CRC7 routine with the smaller Fcached version, as per Ada's request from a while back.
Huh, I just started looking at the plug-in wrapper code and note it actually already merges the read and write hooks into one function, and then splits them up again to call my disk_read()/disk_write(). So there's two extra layers there, inside the driver, I hadn't even noticed before ...
@evanh said:
Huh, I just started looking at the plug-in wrapper code and note it actually already merges the read and write hooks into one function, and then splits them up again to call my disk_read()/disk_write(). So there's two extra layers there, inside the driver, I hadn't even noticed before ...
Do you expect that it would slow things down all that much?
Dunno, it is at least four extra hubexec branches per r/w operation, but it's gotta be extra binary size too. And I have noted the plug-in version of sdmm.cc is bigger than the built-in version.
EDIT: Okay, I figure the plug-in wrapper code is arranged like that for compatibility with the legacy built-in code path. I think I could rehash it since my driver doesn't have any use for built-in path.
Binary size is back down to 40 bytes less than v1.2, which feels like a small milestone at least. But, comparatively, I could do some of the same to v1.2 and make it even smaller.
PS: I've moved a lot of code directly into the new plug-in interfacing functions. Some of the boiler plate code is replaced. The old interface is gone from the driver. It'd be good for others to test this out for new bugs.
PPS: One functional change I did, for performance only, was to ignore the outcome of all CMD12s. The code path returns without waiting on the card's response. Instead it relies on the subsequent card-busy check that follows such an action. I've not actually tried to measure what difference it has. The speed gain may only show up at low sysclock frequencies where the Prop2 overheads take longer than a command response + card busy completing ... which would most likely occur for block reads. EDIT: Actually, it's almost a non-change in effect. With the new CMD12 state tracking the only time a card-busy check doesn't immediately follow a CMD12 is when there has been a data error and the driver has aborted. I'll change that now ... done - Data errors now collect the CMD12 response before aborting.
Hmm, I might have a lingering bug there. I'm not certain how write data errors are meant to be handled. Having the CMD12 might be entirely superfluous for writes. EDIT: Okay, all good. I've checked the case of a failed CRC during CMD25 writes. It does still require a following CMD12 to close it off.
Comments
Uh-oh, one of my seven SD cards doesn't like it ... getting a busy timeout following the first write ...
It's happy with the older v1.2 driver, so can't blame the card.
IIRC while the write is ongoing, the result isn't guaranteed to be flushed, so you might need to plumb through the flushing from the FS layer.
Yup, my updated CTRL_SYNC ioctl() issues a CMD12 and then waits on card busy before returning. An fclose() on an open write generates this sync.
@evanh That's pretty awesome speeds there!
Certainly is. I hadn't imagined such a big jump.
But not yet confident. I've found a second SD card, of eight now, that's not entirely happy. This one works when I enable the block logging though.
I used to have a bunch of full-sized SD cards too, but haven't found even one of those. I suspect I threw them all out a long time ago because they probably all got smashed. I remember sitting on some and cracking the thin PCB material.
Wow that's what I want to see! These sorts of speeds will let me get back to the HDMI capture stuff I started a while back. I'd sort of assumed we'd hit a fundamental limit somewhere but didn't know why.
Keep at it. You'll probably figure out what the real limitation is there eventually and what feature the card needs to be able to do what we want.
Appreciated, thanks.
The scope is at work at the moment. I'm unsure what I'll be able to see but I need some inspiration at least, or just plain insight into what's actually occurring. The fact that the newer cards seem happy might be telling.
Uh-oh again, I've messed something up along the way. None of the cards work using portA now. There must be a typo or something similar. I don't know how far back this occurred, since the stable v1.2 release, because I've been using pin base of 40 all this time.
PS: Hehe, I just double checked v1.2 to be sure it's okay. No problem there.
One step forward two steps back
The good news is that just by verbatim copying the two disk_read()/disk_write() functions from v1.2 everything returns to normal. So all my problems appear to be contained inside those two - Which is of course where 99% of the edits have occurred.
PS: The new code structure makes the two of them act as one unit. They share state tracking of where CMD12 next needs to be issued. They look identical at first glance. Just a couple of differences like CMD18 for reads vs CMD25 for writes. CMD17 and CMD24 have been removed. In fact, if I get this to work cleanly, I think I'll be making them one function with two hooks for FS handler.
Err, I guess it would need two wrapper functions to specify read vs write. Direct hooks won't provided that. I'll test it out before deciding.
Huh, found a weakness in tx_datablock() low level routine where it was failing to report unknown errors with debugging enabled. Something odd is happening with the card's CRC response. Definitely want to get the scope on to it now.
Okay, that one's solved. It was a brand new bug in the newest code attempting to narrow down why some cards weren't happy with the lazy CMD12. I'd missed out doing a bit-shift on the LED pin number variable. This resulted in a spurious low pin number with a huge ADDPINS value. Which would've messed with OUTA/DIRA bits when the activity LED was toggled.
Didn't need the scope, just some sleep.
Amusingly, the botched workaround was on the right track. If I hadn't got confused I wouldn't have stopped the other day.
Issuing CMD24 for doing single block writes has proven to be important.
I've found I can still use CMD25+CMD12 for doing singles but a 0.5 ms separation is then required. Dunno why, but it works for both of the affected cards. Obviously adding extra delays on top of the CMD25+CMD12 already being a double command is simply a bad approach when using CMD24 by itself works best.
So reads and writes have a notable structural difference now.
So writing from 512 byte buffers -> hit CMD24 sad case?
For those two old cards, yes. One is the 2007 camera card (Adata), it is also SDSC 2GB capacity. The other is not so old, 2018, but is obscure and cheap Apacer brand.
PS: I have a cheap 2013 Adata that is fine using CMD25, without adding delays, for all writes. So age isn't everything.
I mean driver-side it would issue the single block write instead of keeping a multi-block open for more.
I guess there's a general problem with the FS interface not really telling you in advance if there's going to be another sequential read/write command after the current.
Oh, yes, exactly. No way to know what might come later. Especially when it's just one block.
It might be a blessing in disguise anyway. I hadn't got round to even thinking about the consequences of removing a card with a CMD25 still in progress.
EDIT: Err, logging the block numbers and SYNC calls I can see the final SYNC is called after the final FAT meta data is written. So using CMD25 for everything should be just as reliable.
Hmm, I'm happy with the stability again but I'm not particularly happy with the 0.4 kB increase in binary size.
I'll post it here for testing before adding it to the Obex.
LOL, the optimiser beat me to it. I just noticed I could optimise the code size by shuffling the logic to eliminate duplication inside the disk_write() function. But it made no difference to the compiled size. Oh well.
EDIT: Oops, it did make a tiny difference, 76 bytes. I just hadn't checked the exact bytes.
EDIT2: Shaved another 44 bytes off, yahoo!
EDIT3: Shaved off another 32 bytes, not off the recent bloat though. I finally replaced the CRC7 routine with the smaller Fcached version, as per Ada's request from a while back.
Huh, I just started looking at the plug-in wrapper code and note it actually already merges the read and write hooks into one function, and then splits them up again to call my disk_read()/disk_write(). So there's two extra layers there, inside the driver, I hadn't even noticed before ...
Do you expect that it would slow things down all that much?
Dunno, it is at least four extra hubexec branches per r/w operation, but it's gotta be extra binary size too. And I have noted the plug-in version of sdmm.cc is bigger than the built-in version.
EDIT: Okay, I figure the plug-in wrapper code is arranged like that for compatibility with the legacy built-in code path. I think I could rehash it since my driver doesn't have any use for built-in path.
Binary size is back down to 40 bytes less than v1.2, which feels like a small milestone at least. But, comparatively, I could do some of the same to v1.2 and make it even smaller.
PS: I've moved a lot of code directly into the new plug-in interfacing functions. Some of the boiler plate code is replaced. The old interface is gone from the driver. It'd be good for others to test this out for new bugs.
PPS: One functional change I did, for performance only, was to ignore the outcome of all CMD12s. The code path returns without waiting on the card's response. Instead it relies on the subsequent card-busy check that follows such an action. I've not actually tried to measure what difference it has. The speed gain may only show up at low sysclock frequencies where the Prop2 overheads take longer than a command response + card busy completing ... which would most likely occur for block reads. EDIT: Actually, it's almost a non-change in effect. With the new CMD12 state tracking the only time a card-busy check doesn't immediately follow a CMD12 is when there has been a data error and the driver has aborted. I'll change that now ... done - Data errors now collect the CMD12 response before aborting.
Hmm, I might have a lingering bug there. I'm not certain how write data errors are meant to be handled. Having the CMD12 might be entirely superfluous for writes. EDIT: Okay, all good. I've checked the case of a failed CRC during CMD25 writes. It does still require a following CMD12 to close it off.