@rogloh said:
... Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oi! Clear your head!
A: We know there is plenty of waiting time happening. There is lots of opportunity for improvement here.
B: The data written is not a large amount at all.
C: Even if filled part way through the cache would still make a difference if it was helping at all.
A. we probably need to see a breakdown of the timing gaps, to know what is being held up. Is the gap just while the card is busy writing, or a delay while the COG is executing FATFS logic, or a gap while cluster map table is being read/updated/processed etc. etc
B. The data written could be huge if streaming. I'm talking during things like video capture where it keeps on writing for hundreds of MB to several GB.
C. Yes. It could help a bit if some work can be done during the gaps between clusters is already occurring due to FATFS code slowdowns and if you'll be stuck waiting then anyway. Some internal cache write backs could perhaps continue during this time.
These delays seem massive though compared to your underlying performance with 4 bit IO. What are we seeing like a 5x slowdown or so?
Behaviour stayed consistent with earlier measurements of the same. In those earlier measurements I logged each timing interval from fwrite() to fwrite() plus every filesystem break in between. So we have the breakdown of the earlier runs already. And there's little evidence anything has changed with caching enabled.
Yes, they are massive, imho. The waiting gaps should vanish entirely if the caching was doing its job. For some reason it isn't happening.
Maybe UHS is still needed, which seems unlikely given it allowed me to engage the cache setting.
Maybe I haven't done all the needed steps and it's not actually engaging at all. I tend to only read enough of the docs to make it happen and stop after that. There could be steps I've missed.
Maybe other features, like command queuing, is required in combination. I've never tried to implement such before and have no comprehension of what it would take. My gut feel is queuing will take a lot RAM on the host for a starters. And add notable overhead to manage it.
Two runs with only the Cache mode toggled. 512kB file length, 64 kB buffer size. Second one is with caching enabled.
PS:There is natural variation between runs in both settings. I picked a common result around 4000 kB/s. Min/Max are something like 3700/4700 kB/s.
Compare without the timing log. Same order as above. Shows that write performance is barely impacted by the logging.
EDIT: Repeated runs of the first case below looks like the average might be 100 kB/s higher, ie: Min/Max is sitting between 3800/4800 kB/s and I'm getting more above 4000. EDIT: I've actually had 3600 kB/s now so the more testing done the wider the min/max extremes seem to get. Not seeing significant change between settings though.
CID decode: ManID=1B OEMID=SM Name=ED2S5
Ver=3.0 Serial=49C16906 Date=2023-2
SD Card Init Successful
mount sd: OK
Buffer = 64 kB, Written 512 kB at 4012 kB/s, Verified, Read 512 kB at 28915 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5
Ver=3.0 Serial=49C16906 Date=2023-2
Cache Extension Enabled
SD Card Init Successful
mount sd: OK
Buffer = 64 kB, Written 512 kB at 4022 kB/s, Verified, Read 512 kB at 30048 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5
Ver=3.0 Serial=49C16906 Date=2023-2
SD Card Init Successful
mount sd: OK
Buffer = 4 kB, Written 128 kB at 771 kB/s, Verified, Read 128 kB at 12779 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5
Ver=3.0 Serial=49C16906 Date=2023-2
Cache Extension Enabled
SD Card Init Successful
mount sd: OK
Buffer = 4 kB, Written 128 kB at 775 kB/s, Verified, Read 128 kB at 11811 kB/s
Be good to chart this data somehow. Maybe total # of real payload sectors written on y axis, by time (p2 clocks) on x axis and you'll then see gaps and the resulting slope gives the overall rate.
We're just looking for the Busy gaps really. Do they change with cache enabled? The rest is just confirming the methods aren't spoiling the measurements.
The Busy's are looking damn near identical. Even the pattern of high-low bumps is the same. I'm declaring the cache as inactive even though the enable bit is set - and verified set.
SDv6 spec Section 4.16.2.4 Application Performance Measurement Conditions has a list of steps for how to measure performance to spec. It includes enabling the cache then using command queue together as a pair.
I get the feeling that the caching system is only accessible via the command queue. Now I wonder if a limited use, ie: 1 deep, of CQ can be effective.
@evanh said:
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oh, you were talking about when a large write is done in the wild. Now I follow.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating newer SD features then I'm willing to give it a go.
@evanh said:
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oh, you were talking about when a large write is done in the wild. Now I follow.
Yep.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating more of the advanced SD features I'm willing to give it a go.
Worth a shot to see what happens with the different SD features, if not just to learn more about it. It's just those FatFS APIs that preallocated the file's clusters during opening that look potentially really good for long sustained writes if they avoid the extra singles. But if it can't speed it up on it's own then it's not as handy. I guess to help more I'd need a breakdown as the to real bulk of the slowness as all I've seen is raw data and it's not always easy to draw conclusions directly.
The writes in those timing dumps are sequentially ordered clusters all the way. The allocating method seems fine.
The extra singles intermingled in the dump will be for house keeping in between the fwrites(). The SYNCs were a good example of that. The FS layers are written in a robust manner, ie: They act like each fwrite() consists of an fopen()/fwrite()/fclose() combo to ensure everything is always flushed. It likely also minimises RAM usage, which is very desirable.
I think using the SD based caching will hide a lot of the extra delays that this approach inflicts. And if this works out, all you need to do when buying an SD card is look for the A2 logo to know it has the required feature set.
It was just the photo and schematic that was posted.
Grr, I must have a bug still. The driver shouldn't just get stuck like that, irrespective of cause.
EDIT: Although, a compiler bug is definitely a possibility now too.
It's getting stuck before any data blocks are read at all. The first use of rx_readblocks() is to fetch the SD_STATUS register to print the cache extension availability. Only happens when SD_DEBUG is set.
The second use is for switching to High-Speed interface mode, which also reports back with the SD_STATUS register. Only happens when CLK_POL = 0.
If those two are skipped, then the MBR will be attempted.
DUH!!!! I should have seen that a lot earlier! Rayman, add --fcache=256 to your compile line.
EDIT: I started talking with Eric about what to do with this. This larger Fcache encroaches on the space needed for the nu-code VM. Eric decided nothing to do and that people wanting to use the VM could live with just using the SPI driver.
But, even accepting that, it's still a niggle that compiling needs this parameter.
EDIT2: Yeah, it makes complete sense that both tx_datablock() and rx_datablocks() gave you grief. Those routines both internally copy the data block into cogRAM so they can process the CRC in parallel with the streamer transferring to/from the SD card. That amount of data in cogRAM splatters over FlexC working registers if Fcache isn't expanded to allow for it.
Awsome. One of my cards didn't work, but another gives this:
```
clkfreq = 200000000 clkmode = 0x10009fb
Clock divider for SD card is 4 (50 MHz)
mount: OK
Buffer = 2 kB, Written 2048 kB at 957 kB/s, Verified, Read 2048 kB at 7802 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7786 k
B/s
Buffer = 2 kB, Written 2048 kB at 962 kB/s, Verified, Read 2048 kB at 7763 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7809 k
B/s
Buffer = 4 kB, Written 2048 kB at 1702 kB/s, Verified, Read 2048 kB at 11242
kB/s
Buffer = 4 kB, Written 2048 kB at 1741 kB/s, Verified, Read 2048 kB at 11692
kB/s
Buffer = 4 kB, Written 2048 kB at 1751 kB/s, Verified, Read 2048 kB at 11762
kB/s
Buffer = 4 kB, Written 2048 kB at 1759 kB/s, Verified, Read 2048 kB at 11640
kB/s
Buffer = 8 kB, Written 4096 kB at 2869 kB/s, Verified, Read 4096 kB at 15414
kB/s
Buffer = 8 kB, Written 4096 kB at 2926 kB/s, Verified, Read 4096 kB at 15543
kB/s
Buffer = 8 kB, Written 4096 kB at 2920 kB/s, Verified, Read 4096 kB at 15555
kB/s
Buffer = 8 kB, Written 4096 kB at 2872 kB/s, Verified, Read 4096 kB at 15193
kB/s
Buffer = 16 kB, Written 4096 kB at 3972 kB/s, Verified, Read 4096 kB at 1839
1 kB/s
Buffer = 16 kB, Written 4096 kB at 3958 kB/s, Verified, Read 4096 kB at 1843
6 kB/s
Buffer = 16 kB, Written 4096 kB at 4020 kB/s, Verified, Read 4096 kB at 1829
0 kB/s
Buffer = 16 kB, Written 4096 kB at 3885 kB/s, Verified, Read 4096 kB at 1814
3 kB/s
Buffer = 32 kB, Written 8192 kB at 4724 kB/s, Verified, Read 8192 kB at 2049
5 kB/s
Buffer = 32 kB, Written 8192 kB at 4609 kB/s, Verified, Read 8192 kB at 2009
0 kB/s
Buffer = 32 kB, Written 8192 kB at 4710 kB/s, Verified, Read 8192 kB at 2042
2 kB/s
Buffer = 32 kB, Written 8192 kB at 4682 kB/s, Verified, Read 8192 kB at 2049
1 kB/s
Buffer = 64 kB, Written 8192 kB at 6802 kB/s, Verified, Read 8192 kB at 2057
9 kB/s
Buffer = 64 kB, Written 8192 kB at 6683 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 7018 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 6530 kB/s, Verified, Read 8192 kB at 2058
4 kB/s
Buffer = 128 kB, Written 16384 kB at 9003 kB/s, Verified, Read 16384 kB at 2
0668 kB/s
Buffer = 128 kB, Written 16384 kB at 8934 kB/s, Verified, Read 16384 kB at 2
0607 kB/s
Buffer = 128 kB, Written 16384 kB at 8848 kB/s, Verified, Read 16384 kB at 2
0079 kB/s
Buffer = 128 kB, Written 16384 kB at 9054 kB/s, Verified, Read 16384 kB at 2
0666 kB/s
Buffer = 256 kB, Written 16384 kB at 10582 kB/s, Verified, Read 16384 kB at
20196 kB/s
Buffer = 256 kB, Written 16384 kB at 10535 kB/s, Verified, Read 16384 kB at
20688 kB/s
Buffer = 256 kB, Written 16384 kB at 10289 kB/s, Verified, Read 16384 kB at
20243 kB/s
Buffer = 256 kB, Written 16384 kB at 10844 kB/s, Verified, Read 16384 kB at
20678 kB/s
Comments
Oi! Clear your head!
A: We know there is plenty of waiting time happening. There is lots of opportunity for improvement here.
B: The data written is not a large amount at all.
C: Even if filled part way through the cache would still make a difference if it was helping at all.
A. we probably need to see a breakdown of the timing gaps, to know what is being held up. Is the gap just while the card is busy writing, or a delay while the COG is executing FATFS logic, or a gap while cluster map table is being read/updated/processed etc. etc
B. The data written could be huge if streaming. I'm talking during things like video capture where it keeps on writing for hundreds of MB to several GB.
C. Yes. It could help a bit if some work can be done during the gaps between clusters is already occurring due to FATFS code slowdowns and if you'll be stuck waiting then anyway. Some internal cache write backs could perhaps continue during this time.
These delays seem massive though compared to your underlying performance with 4 bit IO. What are we seeing like a 5x slowdown or so?
I've done the testing already. The most recent data lengths were 2MB, 4MB, 8MB and 16MB. Repeated. EDIT: Err, just the 2 and 4, buffer sizes were 2kB, 4kB, 8kB and 16kB. So not large write lengths at all. https://forums.parallax.com/discussion/comment/1562497/#Comment_1562497
Behaviour stayed consistent with earlier measurements of the same. In those earlier measurements I logged each timing interval from fwrite() to fwrite() plus every filesystem break in between. So we have the breakdown of the earlier runs already. And there's little evidence anything has changed with caching enabled.
Yes, they are massive, imho. The waiting gaps should vanish entirely if the caching was doing its job. For some reason it isn't happening.
Maybe UHS is still needed, which seems unlikely given it allowed me to engage the cache setting.
Maybe I haven't done all the needed steps and it's not actually engaging at all. I tend to only read enough of the docs to make it happen and stop after that. There could be steps I've missed.
Maybe other features, like command queuing, is required in combination. I've never tried to implement such before and have no comprehension of what it would take. My gut feel is queuing will take a lot RAM on the host for a starters. And add notable overhead to manage it.
Ok, well hopefully we'll be able to figure it out in time as we discover more. Would be neat to be able to take advantage of all the work you've done.
Yep, I'll keep plugging at it.
I'll do some more reading and if that turns up nothing I'll get stuck into, attempting to at least, streamlining the fatfs filesystem code.
Two runs with only the Cache mode toggled. 512kB file length, 64 kB buffer size. Second one is with caching enabled.
PS:There is natural variation between runs in both settings. I picked a common result around 4000 kB/s. Min/Max are something like 3700/4700 kB/s.
And again but using 4 kB buffer size, file length reduced to 128 kB for reading sanity:
Compare without the timing log. Same order as above. Shows that write performance is barely impacted by the logging.
EDIT: Repeated runs of the first case below looks like the average might be 100 kB/s higher, ie: Min/Max is sitting between 3800/4800 kB/s and I'm getting more above 4000. EDIT: I've actually had 3600 kB/s now so the more testing done the wider the min/max extremes seem to get. Not seeing significant change between settings though.
Be good to chart this data somehow. Maybe total # of real payload sectors written on y axis, by time (p2 clocks) on x axis and you'll then see gaps and the resulting slope gives the overall rate.
We're just looking for the Busy gaps really. Do they change with cache enabled? The rest is just confirming the methods aren't spoiling the measurements.
The Busy's are looking damn near identical. Even the pattern of high-low bumps is the same. I'm declaring the cache as inactive even though the enable bit is set - and verified set.
SDv6 spec Section 4.16.2.4 Application Performance Measurement Conditions has a list of steps for how to measure performance to spec. It includes enabling the cache then using command queue together as a pair.
I get the feeling that the caching system is only accessible via the command queue. Now I wonder if a limited use, ie: 1 deep, of CQ can be effective.
Oh, you were talking about when a large write is done in the wild. Now I follow.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating newer SD features then I'm willing to give it a go.
Yep.
Worth a shot to see what happens with the different SD features, if not just to learn more about it. It's just those FatFS APIs that preallocated the file's clusters during opening that look potentially really good for long sustained writes if they avoid the extra singles. But if it can't speed it up on it's own then it's not as handy. I guess to help more I'd need a breakdown as the to real bulk of the slowness as all I've seen is raw data and it's not always easy to draw conclusions directly.
The writes in those timing dumps are sequentially ordered clusters all the way. The allocating method seems fine.
The extra singles intermingled in the dump will be for house keeping in between the fwrites(). The SYNCs were a good example of that. The FS layers are written in a robust manner, ie: They act like each fwrite() consists of an fopen()/fwrite()/fclose() combo to ensure everything is always flushed. It likely also minimises RAM usage, which is very desirable.
I think using the SD based caching will hide a lot of the extra delays that this approach inflicts. And if this works out, all you need to do when buying an SD card is look for the A2 logo to know it has the required feature set.
Not exactly sure what was going on, but reinstalled flexprop and seems to work in 1-bit mode with your code now:
Four bit SD mode still doesn't work though...
Also, that program did format the drive as exFAT. Didn't notice that...
I've had a shot at using that formatter myself now too. It didn't function. Under Wine, it don't find any drives to format.
What does the output look like with both DEBUG's defined?
Still stuck after 4-bit mode:
Hmm, yeah, that's getting stuck at the
rx_datablocks()
.Try this driver. It's nothing special, just reports the pin states at that point. You should see something like this:
CLK=0 CMD=1 POW=0 D3=1 d2=1 D1=1 D0=1
Did you post the gerbers for this board? Maybe I can just have some made?
It was just the photo and schematic that was posted.
Grr, I must have a bug still. The driver shouldn't just get stuck like that, irrespective of cause.
EDIT: Although, a compiler bug is definitely a possibility now too.
Rayman,
Post your latest binary here.
Even better, a listing (use -l switch)! It's easy to use a diff tool on the listing to see where the differences are
yes, that too please.
Ok, here it is
Still thinking it's something to do with MBR, but who knows.
I'll have an Ubuntu laptop in a few days...
It's getting stuck before any data blocks are read at all. The first use of rx_readblocks() is to fetch the SD_STATUS register to print the cache extension availability. Only happens when SD_DEBUG is set.
The second use is for switching to High-Speed interface mode, which also reports back with the SD_STATUS register. Only happens when CLK_POL = 0.
If those two are skipped, then the MBR will be attempted.
DUH!!!! I should have seen that a lot earlier! Rayman, add
--fcache=256
to your compile line.EDIT: I started talking with Eric about what to do with this. This larger Fcache encroaches on the space needed for the nu-code VM. Eric decided nothing to do and that people wanting to use the VM could live with just using the SPI driver.
But, even accepting that, it's still a niggle that compiling needs this parameter.
EDIT2: Yeah, it makes complete sense that both tx_datablock() and rx_datablocks() gave you grief. Those routines both internally copy the data block into cogRAM so they can process the CRC in parallel with the streamer transferring to/from the SD card. That amount of data in cogRAM splatters over FlexC working registers if Fcache isn't expanded to allow for it.
Awsome. One of my cards didn't work, but another gives this:
```
clkfreq = 200000000 clkmode = 0x10009fb
Clock divider for SD card is 4 (50 MHz)
mount: OK
Buffer = 2 kB, Written 2048 kB at 957 kB/s, Verified, Read 2048 kB at 7802 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7786 k
B/s
Buffer = 2 kB, Written 2048 kB at 962 kB/s, Verified, Read 2048 kB at 7763 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7809 k
B/s
Buffer = 4 kB, Written 2048 kB at 1702 kB/s, Verified, Read 2048 kB at 11242
kB/s
Buffer = 4 kB, Written 2048 kB at 1741 kB/s, Verified, Read 2048 kB at 11692
kB/s
Buffer = 4 kB, Written 2048 kB at 1751 kB/s, Verified, Read 2048 kB at 11762
kB/s
Buffer = 4 kB, Written 2048 kB at 1759 kB/s, Verified, Read 2048 kB at 11640
kB/s
Buffer = 8 kB, Written 4096 kB at 2869 kB/s, Verified, Read 4096 kB at 15414
kB/s
Buffer = 8 kB, Written 4096 kB at 2926 kB/s, Verified, Read 4096 kB at 15543
kB/s
Buffer = 8 kB, Written 4096 kB at 2920 kB/s, Verified, Read 4096 kB at 15555
kB/s
Buffer = 8 kB, Written 4096 kB at 2872 kB/s, Verified, Read 4096 kB at 15193
kB/s
Buffer = 16 kB, Written 4096 kB at 3972 kB/s, Verified, Read 4096 kB at 1839
1 kB/s
Buffer = 16 kB, Written 4096 kB at 3958 kB/s, Verified, Read 4096 kB at 1843
6 kB/s
Buffer = 16 kB, Written 4096 kB at 4020 kB/s, Verified, Read 4096 kB at 1829
0 kB/s
Buffer = 16 kB, Written 4096 kB at 3885 kB/s, Verified, Read 4096 kB at 1814
3 kB/s
Buffer = 32 kB, Written 8192 kB at 4724 kB/s, Verified, Read 8192 kB at 2049
5 kB/s
Buffer = 32 kB, Written 8192 kB at 4609 kB/s, Verified, Read 8192 kB at 2009
0 kB/s
Buffer = 32 kB, Written 8192 kB at 4710 kB/s, Verified, Read 8192 kB at 2042
2 kB/s
Buffer = 32 kB, Written 8192 kB at 4682 kB/s, Verified, Read 8192 kB at 2049
1 kB/s
Buffer = 64 kB, Written 8192 kB at 6802 kB/s, Verified, Read 8192 kB at 2057
9 kB/s
Buffer = 64 kB, Written 8192 kB at 6683 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 7018 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 6530 kB/s, Verified, Read 8192 kB at 2058
4 kB/s
Buffer = 128 kB, Written 16384 kB at 9003 kB/s, Verified, Read 16384 kB at 2
0668 kB/s
Buffer = 128 kB, Written 16384 kB at 8934 kB/s, Verified, Read 16384 kB at 2
0607 kB/s
Buffer = 128 kB, Written 16384 kB at 8848 kB/s, Verified, Read 16384 kB at 2
0079 kB/s
Buffer = 128 kB, Written 16384 kB at 9054 kB/s, Verified, Read 16384 kB at 2
0666 kB/s
Buffer = 256 kB, Written 16384 kB at 10582 kB/s, Verified, Read 16384 kB at
20196 kB/s
Buffer = 256 kB, Written 16384 kB at 10535 kB/s, Verified, Read 16384 kB at
20688 kB/s
Buffer = 256 kB, Written 16384 kB at 10289 kB/s, Verified, Read 16384 kB at
20243 kB/s
Buffer = 256 kB, Written 16384 kB at 10844 kB/s, Verified, Read 16384 kB at
20678 kB/s
exit
```
Nice one Rayman.
With a P2 clocking at 50MHz, 20MB/s reads certainly looks like 4bit SD is finally functioning now for you.
Yes, pretty awesome