@evanh said:
It was a shock at that time that reads and writes were so different.
Basically, rule is, FIFO writes of a byte size, or more, are written at first opportunity. Which can mean each and every rotation of the hub. Each time that this clashes with a cog access to hubRAM the cog then has to wait another full rotation to get in. By which time the FIFO may well want to write that very same slice again.
Perhaps you can control the access spacing in your COG loop reads for CRC computation to try to fit in with the FIFO...EDIT : guess it won't help if the FIFO uses the HUB each and every rotation to write a byte..
At sysclock/4, and faster, it's probably forcing the cog's fast copy to track the FIFO's progress in lockstep around the hubRAM slices. The measurements certainly are indicating this.
@rogloh said:
Perhaps you can control the access spacing in your COG loop reads for CRC computation to try to fit in with the FIFO...EDIT : guess it won't help if the FIFO uses the HUB each and every rotation to write a byte..
A counter-rotating access pattern would do that. But since the fast copy is incrementing addresses, the same as the FIFO, then we get what we've got.
In another spin of the die, puns intended, having an "immediate" streamer mode for data input as well as output would make a world of difference here.
I feel I've already done a decent SD write block routine where the data is read from the FIFO and CRC computed on the fly and 32-bits at a time written to the streamer using an "immediate" streamer mode. And then the final computed CRC is written last without stopping the clock. If the streamer supported such then doing the same for reading from the SD card could work just as well.
You might be able to CRC check the previous block read in while the next one is being streamed into HUB by stopping the clock and block reading the source data sector to check to LUT/COGRAM before the next sector read is started by just enabling the clock again. You'd still be overlapping the work in the COG while the streamer is busy with the FIFO writes for the next sector. Then at the end of the overall burst read you check the final sector read as well and report the result.
Good point, do the fast copy to cogRAM before starting the streamer but still process the CRC during streamer activity. I'll give that a try ...
Yay, very good! Thanks. Overheads are only up from 8.2% (without read data CRC processing) to 17.1% instead of 53% with the congested fast copy.
EDIT: Ah, that's for sysclock/3. Sysclock/2 isn't so pleasant at 44%.
EDIT2: Sysclock/4 goes from 6.6% to 8.7%. Had previously been around 50% with the congested fast copy.
I now need to sort out how to temporarily use lutRAM for Fcache. Currently I've got a whole, not small, function fixed in lutRAM with an __attribute__((lut))
@evanh said:
It was a shock at that time that reads and writes were so different.
Basically, rule is, FIFO writes of a byte size, or more, are written at first opportunity. Which can mean each and every rotation of the hub. Each time that this clashes with a cog access to hubRAM the cog then has to wait another full rotation to get in. By which time the FIFO may well want to write that very same slice again. Stalling the cog for consecutive rotations!
So that explains why there is a large drop in congestion between sysclock/4 and sysclock/5. The streamer is reading nibbles from the SD card and writes bytes to the FIFO on every second cycle. Which in turn means when the streamer is cycling at sysclock/4 the FIFO is writing on every hub rotation, but when the streamer is cycling at sysclock/5 the FIFO is starting to write less than every hub rotation.
I think performance could be improved sometimes if fast block reads start before streamer writes start. The idea is the initial random read has time to complete before the FIFO hogs the hub RAM bus and blocks a full-length read of 9-16 cycles.
It's a pity there's no option to tell the FIFO not to write bytes and words at first opportunity and instead wait until a long has been filled. Had this option been implemented the FIFO would write only full longs if start address is long-aligned or partial longs at the start and end only if not long-aligned, both of which would reduce FIFO hogging.
I've studied how the FIFO probably writes bytes to hub RAM at sysclk/8, e.g. when streaming nibbles from SD card at sysclk/4. There is a repeating pattern that takes 256 cycles to write 32 bytes or eight longs. For seven of those longs the time taken to write four bytes to hub RAM is 9+8+8+8 = 33 cycles. The first byte has an extra cycle because it starts at the next slice compared to the previous write. For every eighth long the FIFO writes a word then two bytes taking 9+8+8 = 25 cycles. The extra cycle at the start of the previous seven longs creates an increasing lag such that two bytes are ready to be written at the start of the eighth long.
@evanh said:
Good point, do the fast copy to cogRAM before starting the streamer but still process the CRC during streamer activity. I'll give that a try ...
Yay, very good! Thanks. Overheads are only up from 8.2% (without read data CRC processing) to 17.1% instead of 53% with the congested fast copy.
EDIT: Ah, that's for sysclock/3. Sysclock/2 isn't so pleasant at 44%.
EDIT2: Sysclock/4 goes from 6.6% to 8.7%. Had previously been around 50% with the congested fast copy.
Cool. Glad it was of some benefit.
Am wondering if you get a CRC error whether you should just stop right away and report it as soon as you hit one when you are doing a block read of multiple sectors. Why even continue reading at that point? If you do stop immediately, there should be a way to identify which sector had the error too, perhaps by reporting number of bytes/sectors read to that point. That way some high level program (e.g. one intended for low level data examination or doing filesystem recovery) could at least continue beyond the bad sector if it ever needed to. Also for applications like video streaming read sector CRC errors could probably just be ignored and so CRC check could be avoided fully on reads (which would slightly increase performance too). Ideally CRC checking would be settable per read operation, via some parameter.
I'll integrate with the standard C file handling layer as needed. Eric will need to give me pointers on that and I wouldn't be surprised if some fixes in Flexspin happen along the way. How that ends up presenting to Spin programs is up to Eric.
Ha! I removed one instruction, eliminated the plain MOV, from the CRC processing loop and that drops sysclock/2 from 44% to 40% overheads.
EDIT: Huh, it's done even better for sysclock/3 for some reason. That's now below 11% so knocked off a good 6% there.
// check CRC of prior read SD block
if_a mov pb, ##0x1f6<<19 // register PA for ALTI result substitution
if_a rep @.rend, #(512+8)/4 // one SD block + CRC
if_a alti pb, #0b100_111_000 // next D-field substitution, then increment PB, result goes to PA
if_a movbyts pa, #0b00_01_10_11 // byte swap within longword
if_a splitb pa // 8-nibble order becomes bit order of 4 bytes
if_a setq pa
if_a crcnib crc3, poly
if_a crcnib crc3, poly
if_a crcnib crc2, poly
if_a crcnib crc2, poly
if_a crcnib crc1, poly
if_a crcnib crc1, poly
if_a crcnib crc0, poly
if_a crcnib crc0, poly
.rend
I've adapted the block transmit routine to work at any divider. It needed eight extra instructions inside the cogexec code block. The upside now is I can probably finally use a common set of tx and rx routines for both the fastest rates from sysclock/2 all the way down to the 400 kHz init rate. The downside now is there's now two decent sized functions locked in lutRAM. Probably not much lutRAM space left. I want to move both their critical assembly sections to Fcache if I can. That'll release them entirely from lutRAM. To do this I need to solve stacking of the 520 byte CRC buffer on the end of used Fcache in cogRAM.
Sounds versatile @evanh . If you do need more room it might not be that bad performance wise to split your CRC sector buffer into smaller lots for the CRC work. i.e. process 2x256 bytes or 4x128 bytes at a time instead of 1x512 bytes, etc. Should only add a few more hub windows worth of time for each sector read and a couple of extra instructions for an extra outer REP loop or the like.
Another possibility if you are not using the streamer while you do the CRC computation is to just use the FIFO for the CRC data source as you generate the CRC, with RFLONG or RFBYTE etc. That will save the need to load all the data into COG or LUT RAM and may not introduce much extra time depending on how you've coded your CRC calculation loop. What is your CRC processing loop right now? Is there a code snippet posted for that? EDIT: ignore, I see your code just above.
Grr, Eric has filled cogRAM too much. The general.md doc says Fcache defaults to 1024 bytes but alas I discover it's really only 256 bytes. I then tried setting it with --fcache=1024 on the compile line but this creates an error of not fitting in cogRAM. Then trying lower values I find that the largest achievable is now a measly 336 bytes! I require at least double that to fit both the critical code plus the 520 byte buffer for CRC processing.
@evanh said:
Grr, Eric has filled cogRAM too much. The general.md doc says Fcache defaults to 1024 bytes but alas I discover it's really only 256 bytes. I then tried setting it with --fcache=1024 on the compile line but this creates an error of not fitting in cogRAM. Then trying lower values I find that the largest achievable is now a measly 336 bytes! I require at least double that to fit both the critical code plus the 520 byte buffer for CRC processing.
@rogloh said:
What is your CRC processing loop right now? Is there a code snippet posted for that? EDIT: ignore, I see your code just above.
Yep, that runs while the streamer is loading the subsequent data block. It's effective for sysclock/3 with some added overhead to handle restarting the streamer.
Swap out LUTRAM contents before CRC processing is done to make room for buffer, then read it back in at the end? Adds 128 clocks plus hub window overhead and a 512 byte scratch space.
Hmm, the tx data block critical code is 236 bytes. It's a lot bigger than the rx data block code because of unrolled duplication. That's only five instructions from hitting the default 256 bytes of Fcache. I hadn't been thinking I was running out of space.
Huh, you know what. Even my existing lutRAM code must be overwriting some of Flexspin's internal cogRAM space. The FIT errors are set at 480 bytes while I'm merrily splashing the whole 520 bytes in there. Not to mention Fcache isn't allowed the whole 480 anyway.
Oh, maybe I've got it flipped. --fcache maybe specified in longwords now. I've been thinking there is no space but there is four times that. So Fcache can go up to 336 x 4 = 1344 bytes ...
Yep, seems stable now. I've set --fcache=256 which I suspect it wasn't defaulting to even though I thought I'd seen 256 somewhere. I've hardcoded the CRC buffer to address 256 - 130 = 126 which gives me plenty of spare room for code still. The tx code is only 59 longwords so far.
Ah, that max 336 is actually a tad variable. Seems to depend on the number of locals I have declared. Now that I've moved lots of locals into the Fcache'd pasm code block as dat variables I can increase it beyond 336 if I wanted.
Uh-oh, I went from one lonely warning to 220 of them with one change. The extra 219 warnings are all like this: sdmode_rdwrblks.c:1121: warning: Second operand to wxpin is a constant used without #; is this correct? If so, you may suppress this warning by putting -0 after the operand
This happened because I adjusted one enum: 0x1e8 (PR8 address) to 118 (arbitrary cogRAM address).
It seems Eric has exceptions on certain known register addresses. And the rest of cogRAM will generate this warning without the special -0 suffix added to each occurrence in the source.
Well, I've got data block CRC checking fully functioning now. It took some effort and a little more bloat to get it all across the line. Read data CRC processing is started after copying the relevant block+CRC into cogRAM just prior to kicking off the streamer reading the next block over top of the CRC bits in hubRAM. So I kind of had to move the fast copy instructions earlier anyway. Otherwise I'd be trying to find a temporary buffer for the next block to avoid clobbering the CRC bits, or back to trying to use the FBLOCK instruction to split off the CRC bits into another buffer.
The only limitation is clocking the SD bus at sysclock/2 is pointless. The CRC compute routine is on par with sysclock/3. Sysclock/2 wasn't a lot faster anyway, the overheads were always relatively larger. The small speed sacrifice is worth the integrity, imho. The speed tests are coming out pretty good. Still getting in excess of 40 MB/s at 300 MHz sysclock, and even more above that. So still better than 10x what the shared SPI SD boot wiring allows.
Repeat caveat: This is direct raw SD block reads. Filesystem layers still to be integrated.
Also, I haven't implemented a solution for checking the CRC bits of the final block of each consecutive sequence. The parent hubRAM buffer won't hold them. I'll try add an exception, in the rx data routine, that redirects those last CRC bits. Probably by stopping the clock then restarting the FIFO+streamer with a new buffer address. This'll apply to CMD17 single block reads also.
Still more testing work to do here. There is an alternative rx data solution that hasn't been kept maintained, and is more complex, but has a small speed advantage that I want to resurrect.
On a separate note:
First pass of reading sequential blocks from both the SanDisk cards are always slower than the repeat of same blocks. The SanDisk cards have always had this behaviour. I'm guessing there's a way to inform a SD card what is planned. Might just be a case of swapping out the use of CMD12 (STOP_TRANSMISSION) at the end for CMD23 (SET_BLOCK_COUNT) at the beginning. Whether this will improve the responsiveness of the Sandisk cards, I don't know.
I have yet to try CMD23. It might be a bust because CMD23 looks to be part of UHS and therefore possibly ignored until a UHS mode is engaged.
Comments
Perhaps you can control the access spacing in your COG loop reads for CRC computation to try to fit in with the FIFO...EDIT : guess it won't help if the FIFO uses the HUB each and every rotation to write a byte..
At sysclock/4, and faster, it's probably forcing the cog's fast copy to track the FIFO's progress in lockstep around the hubRAM slices. The measurements certainly are indicating this.
A counter-rotating access pattern would do that. But since the fast copy is incrementing addresses, the same as the FIFO, then we get what we've got.
In another spin of the die, puns intended, having an "immediate" streamer mode for data input as well as output would make a world of difference here.
I feel I've already done a decent SD write block routine where the data is read from the FIFO and CRC computed on the fly and 32-bits at a time written to the streamer using an "immediate" streamer mode. And then the final computed CRC is written last without stopping the clock. If the streamer supported such then doing the same for reading from the SD card could work just as well.
You might be able to CRC check the previous block read in while the next one is being streamed into HUB by stopping the clock and block reading the source data sector to check to LUT/COGRAM before the next sector read is started by just enabling the clock again. You'd still be overlapping the work in the COG while the streamer is busy with the FIFO writes for the next sector. Then at the end of the overall burst read you check the final sector read as well and report the result.
Good point, do the fast copy to cogRAM before starting the streamer but still process the CRC during streamer activity. I'll give that a try ...
Yay, very good! Thanks. Overheads are only up from 8.2% (without read data CRC processing) to 17.1% instead of 53% with the congested fast copy.
EDIT: Ah, that's for sysclock/3. Sysclock/2 isn't so pleasant at 44%.
EDIT2: Sysclock/4 goes from 6.6% to 8.7%. Had previously been around 50% with the congested fast copy.
I now need to sort out how to temporarily use lutRAM for Fcache. Currently I've got a whole, not small, function fixed in lutRAM with an
__attribute__((lut))
I think performance could be improved sometimes if fast block reads start before streamer writes start. The idea is the initial random read has time to complete before the FIFO hogs the hub RAM bus and blocks a full-length read of 9-16 cycles.
It's a pity there's no option to tell the FIFO not to write bytes and words at first opportunity and instead wait until a long has been filled. Had this option been implemented the FIFO would write only full longs if start address is long-aligned or partial longs at the start and end only if not long-aligned, both of which would reduce FIFO hogging.
I've studied how the FIFO probably writes bytes to hub RAM at sysclk/8, e.g. when streaming nibbles from SD card at sysclk/4. There is a repeating pattern that takes 256 cycles to write 32 bytes or eight longs. For seven of those longs the time taken to write four bytes to hub RAM is 9+8+8+8 = 33 cycles. The first byte has an extra cycle because it starts at the next slice compared to the previous write. For every eighth long the FIFO writes a word then two bytes taking 9+8+8 = 25 cycles. The extra cycle at the start of the previous seven longs creates an increasing lag such that two bytes are ready to be written at the start of the eighth long.
Cool. Glad it was of some benefit.
Am wondering if you get a CRC error whether you should just stop right away and report it as soon as you hit one when you are doing a block read of multiple sectors. Why even continue reading at that point? If you do stop immediately, there should be a way to identify which sector had the error too, perhaps by reporting number of bytes/sectors read to that point. That way some high level program (e.g. one intended for low level data examination or doing filesystem recovery) could at least continue beyond the bad sector if it ever needed to. Also for applications like video streaming read sector CRC errors could probably just be ignored and so CRC check could be avoided fully on reads (which would slightly increase performance too). Ideally CRC checking would be settable per read operation, via some parameter.
I'll integrate with the standard C file handling layer as needed. Eric will need to give me pointers on that and I wouldn't be surprised if some fixes in Flexspin happen along the way. How that ends up presenting to Spin programs is up to Eric.
Ha! I removed one instruction, eliminated the plain MOV, from the CRC processing loop and that drops sysclock/2 from 44% to 40% overheads.
EDIT: Huh, it's done even better for sysclock/3 for some reason. That's now below 11% so knocked off a good 6% there.
EDIT2: Added comments to MOVBYTS and SPLITB.
I've adapted the block transmit routine to work at any divider. It needed eight extra instructions inside the cogexec code block. The upside now is I can probably finally use a common set of tx and rx routines for both the fastest rates from sysclock/2 all the way down to the 400 kHz init rate. The downside now is there's now two decent sized functions locked in lutRAM. Probably not much lutRAM space left. I want to move both their critical assembly sections to Fcache if I can. That'll release them entirely from lutRAM. To do this I need to solve stacking of the 520 byte CRC buffer on the end of used Fcache in cogRAM.
Sounds versatile @evanh . If you do need more room it might not be that bad performance wise to split your CRC sector buffer into smaller lots for the CRC work. i.e. process 2x256 bytes or 4x128 bytes at a time instead of 1x512 bytes, etc. Should only add a few more hub windows worth of time for each sector read and a couple of extra instructions for an extra outer REP loop or the like.
Another possibility if you are not using the streamer while you do the CRC computation is to just use the FIFO for the CRC data source as you generate the CRC, with RFLONG or RFBYTE etc. That will save the need to load all the data into COG or LUT RAM and may not introduce much extra time depending on how you've coded your CRC calculation loop. What is your CRC processing loop right now? Is there a code snippet posted for that? EDIT: ignore, I see your code just above.
Grr, Eric has filled cogRAM too much. The general.md doc says Fcache defaults to 1024 bytes but alas I discover it's really only 256 bytes. I then tried setting it with --fcache=1024 on the compile line but this creates an error of not fitting in cogRAM. Then trying lower values I find that the largest achievable is now a measly 336 bytes! I require at least double that to fit both the critical code plus the 520 byte buffer for CRC processing.
See above for ideas to reduce COG RAM usage.
Yep, that runs while the streamer is loading the subsequent data block. It's effective for sysclock/3 with some added overhead to handle restarting the streamer.
yeah, even more overheads.
Ugh, yeah a problem.
Swap out LUTRAM contents before CRC processing is done to make room for buffer, then read it back in at the end? Adds 128 clocks plus hub window overhead and a 512 byte scratch space.
Hmm, the tx data block critical code is 236 bytes. It's a lot bigger than the rx data block code because of unrolled duplication. That's only five instructions from hitting the default 256 bytes of Fcache. I hadn't been thinking I was running out of space.
Huh, you know what. Even my existing lutRAM code must be overwriting some of Flexspin's internal cogRAM space. The FIT errors are set at 480 bytes while I'm merrily splashing the whole 520 bytes in there. Not to mention Fcache isn't allowed the whole 480 anyway.
On the partial block small buffer side of things, I'd very much like to avoid doing that. Clashing with the FIFO is really not desirable.
Oh man, you gotta fix your sizes.
Actually, has Eric got a bug in the compiler? FIT should be longword addressing, not bytes. 480 would sure suit a lot better as longword addressing.
Oh, maybe I've got it flipped. --fcache maybe specified in longwords now. I've been thinking there is no space but there is four times that. So Fcache can go up to 336 x 4 = 1344 bytes ...
If it's 1344 bytes, surely that's going to be enough?
Yep, seems stable now. I've set --fcache=256 which I suspect it wasn't defaulting to even though I thought I'd seen 256 somewhere. I've hardcoded the CRC buffer to address 256 - 130 = 126 which gives me plenty of spare room for code still. The tx code is only 59 longwords so far.
Ah, that max 336 is actually a tad variable. Seems to depend on the number of locals I have declared. Now that I've moved lots of locals into the Fcache'd pasm code block as dat variables I can increase it beyond 336 if I wanted.
Uh-oh, I went from one lonely warning to 220 of them with one change. The extra 219 warnings are all like this:
sdmode_rdwrblks.c:1121: warning: Second operand to wxpin is a constant used without #; is this correct? If so, you may suppress this warning by putting -0 after the operand
This happened because I adjusted one enum: 0x1e8 (PR8 address) to 118 (arbitrary cogRAM address).
It seems Eric has exceptions on certain known register addresses. And the rest of cogRAM will generate this warning without the special -0 suffix added to each occurrence in the source.
EDIT: Corrected 119 typo to 219.
Well, I've got data block CRC checking fully functioning now. It took some effort and a little more bloat to get it all across the line. Read data CRC processing is started after copying the relevant block+CRC into cogRAM just prior to kicking off the streamer reading the next block over top of the CRC bits in hubRAM. So I kind of had to move the fast copy instructions earlier anyway. Otherwise I'd be trying to find a temporary buffer for the next block to avoid clobbering the CRC bits, or back to trying to use the FBLOCK instruction to split off the CRC bits into another buffer.
The only limitation is clocking the SD bus at sysclock/2 is pointless. The CRC compute routine is on par with sysclock/3. Sysclock/2 wasn't a lot faster anyway, the overheads were always relatively larger. The small speed sacrifice is worth the integrity, imho. The speed tests are coming out pretty good. Still getting in excess of 40 MB/s at 300 MHz sysclock, and even more above that. So still better than 10x what the shared SPI SD boot wiring allows.
Repeat caveat: This is direct raw SD block reads. Filesystem layers still to be integrated.
Also, I haven't implemented a solution for checking the CRC bits of the final block of each consecutive sequence. The parent hubRAM buffer won't hold them. I'll try add an exception, in the rx data routine, that redirects those last CRC bits. Probably by stopping the clock then restarting the FIFO+streamer with a new buffer address. This'll apply to CMD17 single block reads also.
Still more testing work to do here. There is an alternative rx data solution that hasn't been kept maintained, and is more complex, but has a small speed advantage that I want to resurrect.
On a separate note:
First pass of reading sequential blocks from both the SanDisk cards are always slower than the repeat of same blocks. The SanDisk cards have always had this behaviour. I'm guessing there's a way to inform a SD card what is planned. Might just be a case of swapping out the use of CMD12 (STOP_TRANSMISSION) at the end for CMD23 (SET_BLOCK_COUNT) at the beginning. Whether this will improve the responsiveness of the Sandisk cards, I don't know.
I have yet to try CMD23. It might be a bust because CMD23 looks to be part of UHS and therefore possibly ignored until a UHS mode is engaged.