@rogloh said:
... Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oi! Clear your head!
A: We know there is plenty of waiting time happening. There is lots of opportunity for improvement here.
B: The data written is not a large amount at all.
C: Even if filled part way through the cache would still make a difference if it was helping at all.
A. we probably need to see a breakdown of the timing gaps, to know what is being held up. Is the gap just while the card is busy writing, or a delay while the COG is executing FATFS logic, or a gap while cluster map table is being read/updated/processed etc. etc
B. The data written could be huge if streaming. I'm talking during things like video capture where it keeps on writing for hundreds of MB to several GB.
C. Yes. It could help a bit if some work can be done during the gaps between clusters is already occurring due to FATFS code slowdowns and if you'll be stuck waiting then anyway. Some internal cache write backs could perhaps continue during this time.
These delays seem massive though compared to your underlying performance with 4 bit IO. What are we seeing like a 5x slowdown or so?
Behaviour stayed consistent with earlier measurements of the same. In those earlier measurements I logged each timing interval from fwrite() to fwrite() plus every filesystem break in between. So we have the breakdown of the earlier runs already. And there's little evidence anything has changed with caching enabled.
Yes, they are massive, imho. The waiting gaps should vanish entirely if the caching was doing its job. For some reason it isn't happening.
Maybe UHS is still needed, which seems unlikely given it allowed me to engage the cache setting.
Maybe I haven't done all the needed steps and it's not actually engaging at all. I tend to only read enough of the docs to make it happen and stop after that. There could be steps I've missed.
Maybe other features, like command queuing, is required in combination. I've never tried to implement such before and have no comprehension of what it would take. My gut feel is queuing will take a lot RAM on the host for a starters. And add notable overhead to manage it.
Two runs with only the Cache mode toggled. 512kB file length, 64 kB buffer size. Second one is with caching enabled.
PS:There is natural variation between runs in both settings. I picked a common result around 4000 kB/s. Min/Max are something like 3700/4700 kB/s.
Compare without the timing log. Same order as above. Shows that write performance is barely impacted by the logging.
EDIT: Repeated runs of the first case below looks like the average might be 100 kB/s higher, ie: Min/Max is sitting between 3800/4800 kB/s and I'm getting more above 4000. EDIT: I've actually had 3600 kB/s now so the more testing done the wider the min/max extremes seem to get. Not seeing significant change between settings though.
CID decode: ManID=1B OEMID=SM Name=ED2S5Ver=3.0 Serial=49C16906 Date=2023-2SD Card Init Successful
mount sd: OK
Buffer = 64 kB, Written 512 kB at 4012 kB/s, Verified, Read 512 kB at 28915 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5Ver=3.0 Serial=49C16906 Date=2023-2Cache Extension Enabled
SD Card Init Successful
mount sd: OK
Buffer = 64 kB, Written 512 kB at 4022 kB/s, Verified, Read 512 kB at 30048 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5Ver=3.0 Serial=49C16906 Date=2023-2SD Card Init Successful
mount sd: OK
Buffer = 4 kB, Written 128 kB at 771 kB/s, Verified, Read 128 kB at 12779 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5Ver=3.0 Serial=49C16906 Date=2023-2Cache Extension Enabled
SD Card Init Successful
mount sd: OK
Buffer = 4 kB, Written 128 kB at 775 kB/s, Verified, Read 128 kB at 11811 kB/s
Be good to chart this data somehow. Maybe total # of real payload sectors written on y axis, by time (p2 clocks) on x axis and you'll then see gaps and the resulting slope gives the overall rate.
We're just looking for the Busy gaps really. Do they change with cache enabled? The rest is just confirming the methods aren't spoiling the measurements.
The Busy's are looking damn near identical. Even the pattern of high-low bumps is the same. I'm declaring the cache as inactive even though the enable bit is set - and verified set.
SDv6 spec Section 4.16.2.4 Application Performance Measurement Conditions has a list of steps for how to measure performance to spec. It includes enabling the cache then using command queue together as a pair.
I get the feeling that the caching system is only accessible via the command queue. Now I wonder if a limited use, ie: 1 deep, of CQ can be effective.
@evanh said:
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oh, you were talking about when a large write is done in the wild. Now I follow.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating newer SD features then I'm willing to give it a go.
@evanh said:
The idea had been that the caching would make it all faster by eliminating the long Busy states. The small singles would be so fast they wouldn't matter much. Alas, that didn't pan out.
Plus those sorts of extra features are probably somewhat card dependent anyway. Caching may not help streaming writes much one the buffer fills up and you are still writing.
Oh, you were talking about when a large write is done in the wild. Now I follow.
Yep.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating more of the advanced SD features I'm willing to give it a go.
Worth a shot to see what happens with the different SD features, if not just to learn more about it. It's just those FatFS APIs that preallocated the file's clusters during opening that look potentially really good for long sustained writes if they avoid the extra singles. But if it can't speed it up on it's own then it's not as handy. I guess to help more I'd need a breakdown as the to real bulk of the slowness as all I've seen is raw data and it's not always easy to draw conclusions directly.
The writes in those timing dumps are sequentially ordered clusters all the way. The allocating method seems fine.
The extra singles intermingled in the dump will be for house keeping in between the fwrites(). The SYNCs were a good example of that. The FS layers are written in a robust manner, ie: They act like each fwrite() consists of an fopen()/fwrite()/fclose() combo to ensure everything is always flushed. It likely also minimises RAM usage, which is very desirable.
I think using the SD based caching will hide a lot of the extra delays that this approach inflicts. And if this works out, all you need to do when buying an SD card is look for the A2 logo to know it has the required feature set.
clkfreq = 300000000 clkmode = 0x1000efb
Filesystem = fatfs, Driver = sdsdcard
Clock divider for SD card is 4 (75 MHz)
Set pins: CLK_PIN=44 CMD_PIN=45 DAT_PIN=40 POW_PIN=39 LED_PIN=46Card detected ... power cycle of SD card
power-down threshold = 37 pin state = 1power-down slope = 50060 us pin state = 0power-up threshold = 209 pin state = 0power-up slope = 1129 us pin state = 1SD clock divider set to sysclock/750. 'rxlag' compensation is 0Card idle OK
OCR register c0ff8000 - Valid SDHC/SDXC Card
Data Transfer Mode entered - Published RCA 59b400004-bit data interface engaged
It was just the photo and schematic that was posted.
Grr, I must have a bug still. The driver shouldn't just get stuck like that, irrespective of cause.
EDIT: Although, a compiler bug is definitely a possibility now too.
It's getting stuck before any data blocks are read at all. The first use of rx_readblocks() is to fetch the SD_STATUS register to print the cache extension availability. Only happens when SD_DEBUG is set.
The second use is for switching to High-Speed interface mode, which also reports back with the SD_STATUS register. Only happens when CLK_POL = 0.
If those two are skipped, then the MBR will be attempted.
DUH!!!! I should have seen that a lot earlier! Rayman, add --fcache=256 to your compile line.
EDIT: I started talking with Eric about what to do with this. This larger Fcache encroaches on the space needed for the nu-code VM. Eric decided nothing to do and that people wanting to use the VM could live with just using the SPI driver.
But, even accepting that, it's still a niggle that compiling needs this parameter.
EDIT2: Yeah, it makes complete sense that both tx_datablock() and rx_datablocks() gave you grief. Those routines both internally copy the data block into cogRAM so they can process the CRC in parallel with the streamer transferring to/from the SD card. That amount of data in cogRAM splatters over FlexC working registers if Fcache isn't expanded to allow for it.
Awsome. One of my cards didn't work, but another gives this:
```
clkfreq = 200000000 clkmode = 0x10009fb
Clock divider for SD card is 4 (50 MHz)
mount: OK
Buffer = 2 kB, Written 2048 kB at 957 kB/s, Verified, Read 2048 kB at 7802 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7786 k
B/s
Buffer = 2 kB, Written 2048 kB at 962 kB/s, Verified, Read 2048 kB at 7763 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7809 k
B/s
Buffer = 4 kB, Written 2048 kB at 1702 kB/s, Verified, Read 2048 kB at 11242
kB/s
Buffer = 4 kB, Written 2048 kB at 1741 kB/s, Verified, Read 2048 kB at 11692
kB/s
Buffer = 4 kB, Written 2048 kB at 1751 kB/s, Verified, Read 2048 kB at 11762
kB/s
Buffer = 4 kB, Written 2048 kB at 1759 kB/s, Verified, Read 2048 kB at 11640
kB/s
Buffer = 8 kB, Written 4096 kB at 2869 kB/s, Verified, Read 4096 kB at 15414
kB/s
Buffer = 8 kB, Written 4096 kB at 2926 kB/s, Verified, Read 4096 kB at 15543
kB/s
Buffer = 8 kB, Written 4096 kB at 2920 kB/s, Verified, Read 4096 kB at 15555
kB/s
Buffer = 8 kB, Written 4096 kB at 2872 kB/s, Verified, Read 4096 kB at 15193
kB/s
Buffer = 16 kB, Written 4096 kB at 3972 kB/s, Verified, Read 4096 kB at 1839
1 kB/s
Buffer = 16 kB, Written 4096 kB at 3958 kB/s, Verified, Read 4096 kB at 1843
6 kB/s
Buffer = 16 kB, Written 4096 kB at 4020 kB/s, Verified, Read 4096 kB at 1829
0 kB/s
Buffer = 16 kB, Written 4096 kB at 3885 kB/s, Verified, Read 4096 kB at 1814
3 kB/s
Buffer = 32 kB, Written 8192 kB at 4724 kB/s, Verified, Read 8192 kB at 2049
5 kB/s
Buffer = 32 kB, Written 8192 kB at 4609 kB/s, Verified, Read 8192 kB at 2009
0 kB/s
Buffer = 32 kB, Written 8192 kB at 4710 kB/s, Verified, Read 8192 kB at 2042
2 kB/s
Buffer = 32 kB, Written 8192 kB at 4682 kB/s, Verified, Read 8192 kB at 2049
1 kB/s
Buffer = 64 kB, Written 8192 kB at 6802 kB/s, Verified, Read 8192 kB at 2057
9 kB/s
Buffer = 64 kB, Written 8192 kB at 6683 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 7018 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 6530 kB/s, Verified, Read 8192 kB at 2058
4 kB/s
Buffer = 128 kB, Written 16384 kB at 9003 kB/s, Verified, Read 16384 kB at 2
0668 kB/s
Buffer = 128 kB, Written 16384 kB at 8934 kB/s, Verified, Read 16384 kB at 2
0607 kB/s
Buffer = 128 kB, Written 16384 kB at 8848 kB/s, Verified, Read 16384 kB at 2
0079 kB/s
Buffer = 128 kB, Written 16384 kB at 9054 kB/s, Verified, Read 16384 kB at 2
0666 kB/s
Buffer = 256 kB, Written 16384 kB at 10582 kB/s, Verified, Read 16384 kB at
20196 kB/s
Buffer = 256 kB, Written 16384 kB at 10535 kB/s, Verified, Read 16384 kB at
20688 kB/s
Buffer = 256 kB, Written 16384 kB at 10289 kB/s, Verified, Read 16384 kB at
20243 kB/s
Buffer = 256 kB, Written 16384 kB at 10844 kB/s, Verified, Read 16384 kB at
20678 kB/s
Comments
Oi! Clear your head!
A: We know there is plenty of waiting time happening. There is lots of opportunity for improvement here.
B: The data written is not a large amount at all.
C: Even if filled part way through the cache would still make a difference if it was helping at all.
A. we probably need to see a breakdown of the timing gaps, to know what is being held up. Is the gap just while the card is busy writing, or a delay while the COG is executing FATFS logic, or a gap while cluster map table is being read/updated/processed etc. etc
B. The data written could be huge if streaming. I'm talking during things like video capture where it keeps on writing for hundreds of MB to several GB.
C. Yes. It could help a bit if some work can be done during the gaps between clusters is already occurring due to FATFS code slowdowns and if you'll be stuck waiting then anyway. Some internal cache write backs could perhaps continue during this time.
These delays seem massive though compared to your underlying performance with 4 bit IO. What are we seeing like a 5x slowdown or so?
I've done the testing already. The most recent data lengths were 2MB, 4MB, 8MB and 16MB. Repeated. EDIT: Err, just the 2 and 4, buffer sizes were 2kB, 4kB, 8kB and 16kB. So not large write lengths at all. https://forums.parallax.com/discussion/comment/1562497/#Comment_1562497
Behaviour stayed consistent with earlier measurements of the same. In those earlier measurements I logged each timing interval from fwrite() to fwrite() plus every filesystem break in between. So we have the breakdown of the earlier runs already. And there's little evidence anything has changed with caching enabled.
Yes, they are massive, imho. The waiting gaps should vanish entirely if the caching was doing its job. For some reason it isn't happening.
Maybe UHS is still needed, which seems unlikely given it allowed me to engage the cache setting.
Maybe I haven't done all the needed steps and it's not actually engaging at all. I tend to only read enough of the docs to make it happen and stop after that. There could be steps I've missed.
Maybe other features, like command queuing, is required in combination. I've never tried to implement such before and have no comprehension of what it would take. My gut feel is queuing will take a lot RAM on the host for a starters. And add notable overhead to manage it.
Ok, well hopefully we'll be able to figure it out in time as we discover more. Would be neat to be able to take advantage of all the work you've done.
Yep, I'll keep plugging at it.
I'll do some more reading and if that turns up nothing I'll get stuck into, attempting to at least, streamlining the fatfs filesystem code.
Two runs with only the Cache mode toggled. 512kB file length, 64 kB buffer size. Second one is with caching enabled.
PS:There is natural variation between runs in both settings. I picked a common result around 4000 kB/s. Min/Max are something like 3700/4700 kB/s.
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 SD Card Init Successful RD0 330376 RD800 331491 RD801 331771 mount sd: OK RDf740 332489 WRf740 332819 RD85a 339094 WR85a 339952 WR7fda 345652 RDf740 350838 Buffer = 64 kB, RD85a 351750 WR43f80+40 352061 WR43fc0+40 358425 WR44000+40 367821 WR44040+40 372460 WR44080+40 378297 WR440c0+40 387651 WR44100+40 392286 WR44140+40 398122 WR44180+40 407499 WR441c0+40 412137 WR44200+40 417975 WR44240+40 427364 WR44280+40 431999 WR442c0+40 437836 WR44300+40 447232 WR44340+40 451870 WR85a 457678 WR7fda 466687 RDf740 470555 WRf740 471438 WR801 475789 SYNC 475880 Written 512 kB at 3998 kB/s, RDf740 480039 RD43f80+40 480913 RD85a 482087 RD43fc0+40 482453 RD44000+40 484025 RD44040+40 485242 RD44080+40 486815 RD440c0+40 488018 RD44100+40 489605 RD44140+40 490809 RD44180+40 492381 RD441c0+40 493600 RD44200+40 495173 RD44240+40 496375 RD44280+40 497962 RD442c0+40 499167 RD44300+40 500738 RD44340+40 501956 Verified, RD43f80+40 503784 RD43fc0+40 504990 RD44000+40 506198 RD44040+40 507414 RD44080+40 508624 RD440c0+40 509827 RD44100+40 511050 RD44140+40 512254 RD44180+40 513462 RD441c0+40 514680 RD44200+40 515891 RD44240+40 517096 RD44280+40 518319 RD442c0+40 519523 RD44300+40 520730 RD44340+40 521949 Read 512 kB at 26417 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 Cache Extension Enabled SD Card Init Successful RD0 291810 RD800 293090 RD801 293369 mount sd: OK RDf740 294085 WRf740 294415 RD85a 300656 WR85a 301514 WR7fda 307192 RDf740 312355 Buffer = 64 kB, RD85a 313267 WR43f80+40 313578 WR43fc0+40 319924 WR44000+40 329037 WR44040+40 333656 WR44080+40 339472 WR440c0+40 348663 WR44100+40 353282 WR44140+40 359101 WR44180+40 368319 WR441c0+40 372938 WR44200+40 378754 WR44240+40 387961 WR44280+40 392580 WR442c0+40 398399 WR44300+40 407634 WR44340+40 412252 WR85a 418038 WR7fda 426870 RDf740 430700 WRf740 431575 WR801 435895 SYNC 435986 Written 512 kB at 4009 kB/s, RDf740 441192 RD43f80+40 442065 RD85a 443192 RD43fc0+40 443549 RD44000+40 445073 RD44040+40 446244 RD44080+40 447770 RD440c0+40 448926 RD44100+40 450466 RD44140+40 451624 RD44180+40 453148 RD441c0+40 454319 RD44200+40 455845 RD44240+40 457001 RD44280+40 458541 RD442c0+40 459699 RD44300+40 461223 RD44340+40 462394 Verified, RD43f80+40 464176 RD43fc0+40 465334 RD44000+40 466494 RD44040+40 467664 RD44080+40 468827 RD440c0+40 469982 RD44100+40 471158 RD44140+40 472315 RD44180+40 473475 RD441c0+40 474646 RD44200+40 475808 RD44240+40 476964 RD44280+40 478139 RD442c0+40 479297 RD44300+40 480457 RD44340+40 481628 Read 512 kB at 27495 kB/s
And again but using 4 kB buffer size, file length reduced to 128 kB for reading sanity:
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 SD Card Init Successful RD0 382233 RD800 383349 RD801 383628 mount sd: OK RDf740 384090 WRf740 384420 RD85a 388721 WR85a 389613 WR7fda 394127 RDf740 398128 Buffer = 4 kB, RD85a 399031 WR43f80+8 399333 WR43f88+8 403617 WR43f90+8 407387 WR43f98+8 411203 WR43fa0+8 415944 WR43fa8+8 420793 WR43fb0+8 425656 WR43fb8+8 430529 WR43fc0+8 435416 WR43fc8+8 440347 WR43fd0+8 445290 WR43fd8+8 450246 WR43fe0+8 455726 WR43fe8+8 461316 WR43ff0+8 466919 WR43ff8+8 472537 WR44000+8 478167 WR44008+8 483844 WR44010+8 489533 WR44018+8 498146 WR44020+8 501705 WR44028+8 505375 WR44030+8 509057 WR44038+8 512754 WR44040+8 516463 WR44048+8 520219 WR44050+8 523988 WR44058+8 527804 WR44060+8 532543 WR44068+8 537391 WR44070+8 542252 WR44078+8 547125 WR85a 551986 WR7fda 557273 RDf740 561515 WRf740 562383 WR801 565954 SYNC 566045 Written 128 kB at 750 kB/s, RDf740 569862 RD43f80+8 570733 RD43f88+8 571190 RD43f90+8 571648 RD43f98+8 572105 RD43fa0+8 572575 RD43fa8+8 573045 RD43fb0+8 573515 RD43fb8+8 573985 RD85a 574435 RD43fc0+8 574738 RD43fc8+8 575211 RD43fd0+8 575685 RD43fd8+8 576158 RD43fe0+8 576614 RD43fe8+8 577070 RD43ff0+8 577526 RD43ff8+8 577982 RD44000+8 578444 RD44008+8 578903 RD44010+8 579363 RD44018+8 579822 RD44020+8 580219 RD44028+8 580615 RD44030+8 581012 RD44038+8 581408 RD44040+8 581809 RD44048+8 582209 RD44050+8 582608 RD44058+8 583006 RD44060+8 583401 RD44068+8 583796 RD44070+8 584191 RD44078+8 584586 Verified, RD43f80+8 585051 RD43f88+8 585485 RD43f90+8 585920 RD43f98+8 586355 RD43fa0+8 586801 RD43fa8+8 587248 RD43fb0+8 587695 RD43fb8+8 588142 RD43fc0+8 588594 RD43fc8+8 589044 RD43fd0+8 589495 RD43fd8+8 589945 RD43fe0+8 590378 RD43fe8+8 590811 RD43ff0+8 591244 RD43ff8+8 591677 RD44000+8 592115 RD44008+8 592552 RD44010+8 592988 RD44018+8 593424 RD44020+8 593798 RD44028+8 594171 RD44030+8 594545 RD44038+8 594918 RD44040+8 595296 RD44048+8 595672 RD44050+8 596048 RD44058+8 596424 RD44060+8 596796 RD44068+8 597168 RD44070+8 597539 RD44078+8 597911 Read 128 kB at 9664 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 Cache Extension Enabled SD Card Init Successful RD0 394150 RD800 395444 RD801 395724 mount sd: OK RDf740 396186 WRf740 396519 RD85a 402743 WR85a 403562 WR7fda 409889 RDf740 415701 Buffer = 4 kB, RD85a 416605 WR43f80+8 416909 WR43f88+8 423067 WR43f90+8 428744 WR43f98+8 434434 WR43fa0+8 443017 WR43fa8+8 446576 WR43fb0+8 450246 WR43fb8+8 453930 WR43fc0+8 457628 WR43fc8+8 461336 WR43fd0+8 465092 WR43fd8+8 468861 WR43fe0+8 472676 WR43fe8+8 477416 WR43ff0+8 482266 WR43ff8+8 487129 WR44000+8 492004 WR44008+8 496891 WR44010+8 501822 WR44018+8 506766 WR44020+8 511722 WR44028+8 517202 WR44030+8 522793 WR44038+8 528399 WR44040+8 534017 WR44048+8 539646 WR44050+8 545323 WR44058+8 551012 WR44060+8 559589 WR44068+8 563148 WR44070+8 566818 WR44078+8 570503 WR85a 574175 WR7fda 578177 RDf740 581163 WRf740 582033 WR801 584352 SYNC 584443 Written 128 kB at 749 kB/s, RDf740 587637 RD43f80+8 588507 RD43f88+8 588966 RD43f90+8 589425 RD43f98+8 589884 RD43fa0+8 590342 RD43fa8+8 590796 RD43fb0+8 591249 RD43fb8+8 591703 RD85a 592133 RD43fc0+8 592436 RD43fc8+8 592893 RD43fd0+8 593350 RD43fd8+8 593807 RD43fe0+8 594264 RD43fe8+8 594733 RD43ff0+8 595202 RD43ff8+8 595672 RD44000+8 596143 RD44008+8 596616 RD44010+8 597089 RD44018+8 597562 RD44020+8 598035 RD44028+8 598490 RD44030+8 598947 RD44038+8 599402 RD44040+8 599860 RD44048+8 600319 RD44050+8 600778 RD44058+8 601237 RD44060+8 601696 RD44068+8 602090 RD44070+8 602485 RD44078+8 602879 Verified, RD43f80+8 603341 RD43f88+8 603777 RD43f90+8 604213 RD43f98+8 604648 RD43fa0+8 605084 RD43fa8+8 605514 RD43fb0+8 605945 RD43fb8+8 606376 RD43fc0+8 606808 RD43fc8+8 607242 RD43fd0+8 607675 RD43fd8+8 608109 RD43fe0+8 608543 RD43fe8+8 608989 RD43ff0+8 609436 RD43ff8+8 609882 RD44000+8 610330 RD44008+8 610780 RD44010+8 611230 RD44018+8 611680 RD44020+8 612129 RD44028+8 612562 RD44030+8 612994 RD44038+8 613427 RD44040+8 613861 RD44048+8 614297 RD44050+8 614733 RD44058+8 615168 RD44060+8 615604 RD44068+8 615976 RD44070+8 616347 RD44078+8 616718 Read 128 kB at 9303 kB/s
Compare without the timing log. Same order as above. Shows that write performance is barely impacted by the logging.
EDIT: Repeated runs of the first case below looks like the average might be 100 kB/s higher, ie: Min/Max is sitting between 3800/4800 kB/s and I'm getting more above 4000. EDIT: I've actually had 3600 kB/s now so the more testing done the wider the min/max extremes seem to get. Not seeing significant change between settings though.
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 SD Card Init Successful mount sd: OK Buffer = 64 kB, Written 512 kB at 4012 kB/s, Verified, Read 512 kB at 28915 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 Cache Extension Enabled SD Card Init Successful mount sd: OK Buffer = 64 kB, Written 512 kB at 4022 kB/s, Verified, Read 512 kB at 30048 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 SD Card Init Successful mount sd: OK Buffer = 4 kB, Written 128 kB at 771 kB/s, Verified, Read 128 kB at 12779 kB/s
CID decode: ManID=1B OEMID=SM Name=ED2S5 Ver=3.0 Serial=49C16906 Date=2023-2 Cache Extension Enabled SD Card Init Successful mount sd: OK Buffer = 4 kB, Written 128 kB at 775 kB/s, Verified, Read 128 kB at 11811 kB/s
Be good to chart this data somehow. Maybe total # of real payload sectors written on y axis, by time (p2 clocks) on x axis and you'll then see gaps and the resulting slope gives the overall rate.
We're just looking for the Busy gaps really. Do they change with cache enabled? The rest is just confirming the methods aren't spoiling the measurements.
The Busy's are looking damn near identical. Even the pattern of high-low bumps is the same. I'm declaring the cache as inactive even though the enable bit is set - and verified set.
SDv6 spec Section 4.16.2.4 Application Performance Measurement Conditions has a list of steps for how to measure performance to spec. It includes enabling the cache then using command queue together as a pair.
I get the feeling that the caching system is only accessible via the command queue. Now I wonder if a limited use, ie: 1 deep, of CQ can be effective.
Oh, you were talking about when a large write is done in the wild. Now I follow.
If that happens then it won't be because of the small singles. It'll just be the limit of the card, and would also do the same with raw linear writing. So, not the issue we're addressing here.
PS: I don't want to mess too much with the filesystem software. The potential for introducing subtle and devilish bugs at that level I think is quite high. So if we can get more sustained speed by activating newer SD features then I'm willing to give it a go.
Yep.
Worth a shot to see what happens with the different SD features, if not just to learn more about it. It's just those FatFS APIs that preallocated the file's clusters during opening that look potentially really good for long sustained writes if they avoid the extra singles. But if it can't speed it up on it's own then it's not as handy. I guess to help more I'd need a breakdown as the to real bulk of the slowness as all I've seen is raw data and it's not always easy to draw conclusions directly.
The writes in those timing dumps are sequentially ordered clusters all the way. The allocating method seems fine.
The extra singles intermingled in the dump will be for house keeping in between the fwrites(). The SYNCs were a good example of that. The FS layers are written in a robust manner, ie: They act like each fwrite() consists of an fopen()/fwrite()/fclose() combo to ensure everything is always flushed. It likely also minimises RAM usage, which is very desirable.
I think using the SD based caching will hide a lot of the extra delays that this approach inflicts. And if this works out, all you need to do when buying an SD card is look for the A2 logo to know it has the required feature set.
Not exactly sure what was going on, but reinstalled flexprop and seems to work in 1-bit mode with your code now:
clkfreq = 100000000 clkmode = 0x100090b Filesystem = fatfs, Driver = sdcardx mount sd: OK Buffer = 16 kB, Written 4096 kB at 1594 kB/s, Verified, Read 4096 kB at 2741 kB/s exit
Four bit SD mode still doesn't work though...
Also, that program did format the drive as exFAT. Didn't notice that...
I've had a shot at using that formatter myself now too. It didn't function. Under Wine, it don't find any drives to format.
What does the output look like with both DEBUG's defined?
Still stuck after 4-bit mode:
clkfreq = 300000000 clkmode = 0x1000efb Filesystem = fatfs, Driver = sdsdcard Clock divider for SD card is 4 (75 MHz) Set pins: CLK_PIN=44 CMD_PIN=45 DAT_PIN=40 POW_PIN=39 LED_PIN=46 Card detected ... power cycle of SD card power-down threshold = 37 pin state = 1 power-down slope = 50060 us pin state = 0 power-up threshold = 209 pin state = 0 power-up slope = 1129 us pin state = 1 SD clock divider set to sysclock/750. 'rxlag' compensation is 0 Card idle OK OCR register c0ff8000 - Valid SDHC/SDXC Card Data Transfer Mode entered - Published RCA 59b40000 4-bit data interface engaged
Hmm, yeah, that's getting stuck at the
rx_datablocks()
.Try this driver. It's nothing special, just reports the pin states at that point. You should see something like this:
... Card idle OK OCR register c0ff8000 - SDHC/SDXC Card Data Transfer Mode entered - Published RCA 59b40000 4-bit data interface engaged CLK=0 CMD=1 POW=0 D3=1 d2=1 D1=1 D0=1
CLK=0 CMD=1 POW=0 D3=1 d2=1 D1=1 D0=1
Did you post the gerbers for this board? Maybe I can just have some made?
It was just the photo and schematic that was posted.
Grr, I must have a bug still. The driver shouldn't just get stuck like that, irrespective of cause.
EDIT: Although, a compiler bug is definitely a possibility now too.
Rayman,
Post your latest binary here.
Even better, a listing (use -l switch)! It's easy to use a diff tool on the listing to see where the differences are
yes, that too please.
Ok, here it is
Still thinking it's something to do with MBR, but who knows.
I'll have an Ubuntu laptop in a few days...
It's getting stuck before any data blocks are read at all. The first use of rx_readblocks() is to fetch the SD_STATUS register to print the cache extension availability. Only happens when SD_DEBUG is set.
The second use is for switching to High-Speed interface mode, which also reports back with the SD_STATUS register. Only happens when CLK_POL = 0.
If those two are skipped, then the MBR will be attempted.
DUH!!!! I should have seen that a lot earlier! Rayman, add
--fcache=256
to your compile line.EDIT: I started talking with Eric about what to do with this. This larger Fcache encroaches on the space needed for the nu-code VM. Eric decided nothing to do and that people wanting to use the VM could live with just using the SPI driver.
But, even accepting that, it's still a niggle that compiling needs this parameter.
EDIT2: Yeah, it makes complete sense that both tx_datablock() and rx_datablocks() gave you grief. Those routines both internally copy the data block into cogRAM so they can process the CRC in parallel with the streamer transferring to/from the SD card. That amount of data in cogRAM splatters over FlexC working registers if Fcache isn't expanded to allow for it.
Awsome. One of my cards didn't work, but another gives this:
```
clkfreq = 200000000 clkmode = 0x10009fb
Clock divider for SD card is 4 (50 MHz)
mount: OK
Buffer = 2 kB, Written 2048 kB at 957 kB/s, Verified, Read 2048 kB at 7802 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7786 k
B/s
Buffer = 2 kB, Written 2048 kB at 962 kB/s, Verified, Read 2048 kB at 7763 k
B/s
Buffer = 2 kB, Written 2048 kB at 976 kB/s, Verified, Read 2048 kB at 7809 k
B/s
Buffer = 4 kB, Written 2048 kB at 1702 kB/s, Verified, Read 2048 kB at 11242
kB/s
Buffer = 4 kB, Written 2048 kB at 1741 kB/s, Verified, Read 2048 kB at 11692
kB/s
Buffer = 4 kB, Written 2048 kB at 1751 kB/s, Verified, Read 2048 kB at 11762
kB/s
Buffer = 4 kB, Written 2048 kB at 1759 kB/s, Verified, Read 2048 kB at 11640
kB/s
Buffer = 8 kB, Written 4096 kB at 2869 kB/s, Verified, Read 4096 kB at 15414
kB/s
Buffer = 8 kB, Written 4096 kB at 2926 kB/s, Verified, Read 4096 kB at 15543
kB/s
Buffer = 8 kB, Written 4096 kB at 2920 kB/s, Verified, Read 4096 kB at 15555
kB/s
Buffer = 8 kB, Written 4096 kB at 2872 kB/s, Verified, Read 4096 kB at 15193
kB/s
Buffer = 16 kB, Written 4096 kB at 3972 kB/s, Verified, Read 4096 kB at 1839
1 kB/s
Buffer = 16 kB, Written 4096 kB at 3958 kB/s, Verified, Read 4096 kB at 1843
6 kB/s
Buffer = 16 kB, Written 4096 kB at 4020 kB/s, Verified, Read 4096 kB at 1829
0 kB/s
Buffer = 16 kB, Written 4096 kB at 3885 kB/s, Verified, Read 4096 kB at 1814
3 kB/s
Buffer = 32 kB, Written 8192 kB at 4724 kB/s, Verified, Read 8192 kB at 2049
5 kB/s
Buffer = 32 kB, Written 8192 kB at 4609 kB/s, Verified, Read 8192 kB at 2009
0 kB/s
Buffer = 32 kB, Written 8192 kB at 4710 kB/s, Verified, Read 8192 kB at 2042
2 kB/s
Buffer = 32 kB, Written 8192 kB at 4682 kB/s, Verified, Read 8192 kB at 2049
1 kB/s
Buffer = 64 kB, Written 8192 kB at 6802 kB/s, Verified, Read 8192 kB at 2057
9 kB/s
Buffer = 64 kB, Written 8192 kB at 6683 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 7018 kB/s, Verified, Read 8192 kB at 2057
6 kB/s
Buffer = 64 kB, Written 8192 kB at 6530 kB/s, Verified, Read 8192 kB at 2058
4 kB/s
Buffer = 128 kB, Written 16384 kB at 9003 kB/s, Verified, Read 16384 kB at 2
0668 kB/s
Buffer = 128 kB, Written 16384 kB at 8934 kB/s, Verified, Read 16384 kB at 2
0607 kB/s
Buffer = 128 kB, Written 16384 kB at 8848 kB/s, Verified, Read 16384 kB at 2
0079 kB/s
Buffer = 128 kB, Written 16384 kB at 9054 kB/s, Verified, Read 16384 kB at 2
0666 kB/s
Buffer = 256 kB, Written 16384 kB at 10582 kB/s, Verified, Read 16384 kB at
20196 kB/s
Buffer = 256 kB, Written 16384 kB at 10535 kB/s, Verified, Read 16384 kB at
20688 kB/s
Buffer = 256 kB, Written 16384 kB at 10289 kB/s, Verified, Read 16384 kB at
20243 kB/s
Buffer = 256 kB, Written 16384 kB at 10844 kB/s, Verified, Read 16384 kB at
20678 kB/s
exit
```
Nice one Rayman.
With a P2 clocking at 50MHz, 20MB/s reads certainly looks like 4bit SD is finally functioning now for you.
Yes, pretty awesome