Discussing SD Drivers (depreciated fsrw)

Rayman · 2020-05-08 23:29

Uh Oh... Upon closer inspection seeing some occasional errors in the last bits of bytes...

Cluso's version doesn't have this... I may have optimized too much...

Rayman · 2020-05-08 23:34

Copied the timing in from Cluso's example and seems all better now:

        drvl    c
        waitx   #2    '' Copied from Cluso's
        drvh    c
        waitx   #3
        testp   o               wc

It's weird with P2 that you have to test the input several clocks after you'd think you should...

Cluso99 · 2020-05-09 01:00

@Rayman,
Remember that there is a significant delay in the card preparing the read/write sector routine which may outweigh any gains.
If there is a significant speed gain then I'll shift the send/recv routine to cog at runtime which should take less than 16 longs.

Currently I have two waitx instructions, a #2 and a #3. I'll need to find the thread where I discussed the clocking delays with Chip for read and write to pins. IIR there is a 2 clock delay after the OUT instruction completes before the data arrives at the pin (output) and a 3 clock delay before the data appears at the start of the IN instruction (ie it reads 3 clocks before the IN instructions

Last night I posted SDDriver_code_209.spin2 which has the fsrw calls so that sdspi_bashed.spin2 (or a replacement) isn't required.

BTW I force $0 status on the start/initialise because of the fastspin bug. pnut only works with P59 pullup enabled.

jurop · 2020-05-09 06:31

Cluso99 wrote: »

@Rayman,
IIR there is a 2 clock delay after the OUT instruction completes before the data arrives at the pin (output) and a 3 clock delay before the data appears at the start of the IN instruction (ie it reads 3 clocks before the IN instructions

Yep - it's in the docs now. I report it here since this could be one of P2 traps:

When a DIRx/OUTx bit is changed by any instruction, it takes THREE additional clocks
after the instruction before the pin starts transitioning to the new state. 
Here this delay is demonstrated using DRVH:

                 ____0     ____1     ____2     ____3     ____4     ____5     
Clock:          /    \____/    \____/    \____/    \____/    \____/    \____/
DIRA:           |         |  DIRA-->|   REG-->|   REG-->|   REG-->| P0 DRIV |
OUTA:           |         |  OUTA-->|   REG-->|   REG-->|   REG-->| P0 HIGH |
                |                   |
Instruction:    | DRVH #0   

When an INx register is read by an instruction, it will reflect the state of the pins
registered THREE clocks before the start of the instruction. 
Here this delay is demonstrated using TESTB:

                 ____0     ____1     ____2     ____3     ____4     ____5     
Clock:          /    \____/    \____/    \____/    \____/    \____/    \____/
INA:            | P0 IN-->|   REG-->|   REG-->|   REG-->|   ALU-->|   C/Z-->|
                                                        |                   |
Instruction:                                            | TESTB INA,#0      |

When a TESTP/TESTPN instruction is used to read a pin, the value read will reflect the state
of the pin registered TWO clocks before the start of the instruction. 
So, TESTP/TESTPN get fresher INx data than is available via the INx registers:

                 ____0     ____1     ____2     ____3     ____4     
Clock:          /    \____/    \____/    \____/    \____/    \____/
INA:            | P0 IN-->|   REG-->|   REG-->|   REG-->|   C/Z-->|
                                              |                   |
Instruction:                                  | TESTP #0          |

evanh · 2020-05-09 11:19

Rayman,
I just had a quick nosy at the low-level byte read/write loop of sdspi_bashed.spin2 and afaics it is cycling at 26 sysclocks per bit! And I think that's actually been trimmed down from sysclock/33 too. No wonder Peter's code whips this one hands down. His is ticking over at /10 last I read.

It's possible to achieve /8 still completely bit-bashed ... but if the streamer is an option then going dualSPI and sysclock/2 is up for grabs - Maybe up to 30 MB/s.

EDIT: Correction, dualSPI at sysclock/2 can be done with just polling of smartpins when it comes to block transfers. The streamer can be left alone if desired.

EDIT2: Err, reading a little further, I see the block reads in sdspi_bashed.spin2 aren't as bad as the byte read/writes. Blocks have already been reduced to 15 sysclocks per bit.

Cluso99 · 2020-05-09 11:27

Yes, and IIRC the Rev B silicon added an extra clock to make it 3 clocks following the completion of the instruction from 2 in the Rev A silicon.

Rayman · 2020-05-09 17:08

I switched the SPI block reader code to @"Peter Jakacki" way, but still stuck at 920 kB/s.
Seems I'll need to bite the bullet and use an extra cog to get that 3000 kB/s...

Can get to 1104 kB/s by changing clock from 250 to 300 MHz...

Rayman · 2020-05-09 19:14

I'm having a bad day...
Had code working at 300 MHz clock and was all happy.
But, I had two copies of sdspi_bashed.spin2 open and saved the wrong one on top of the working one...
Now, I can't get to work past 250 MHz. Kills me when I had it in my hand and then messed up... Wasting a lot of time...

Think it's back! I have to remember to never open two files with the same name again...

Rayman · 2020-05-09 21:59

Here's a revised test program that can read faster.
It reloads a bitmap from SD card over and over again now.
This way, it's obvious if there are any bad bits.

Using ideas from @Cluso99 and @"Peter Jakacki" to get this read loop:

        rep     #.end_read, #8
        drvl    c
        rcl     x,#1  ' shift in msb first (first dummy zero)    
        waitx   #2          
        drvh    c 
        nop
        testp   o  wc  
.end_read

Now gets to 1108 kB/s @ 300 MHz.

From what @"Peter Jakacki" has said, looks like we can get to 3000 kB/s if we use a dedicated cog.
But, I guess this isn't so bad for not using a cog.

Ariba · 2020-05-09 22:18

Do you use single block read or multiblock read?
I think the maximum read speed depends mainly on the SD card. I have a 4GB Sandisc, that goes not over 200 kB/s for single blocks, and about 1.4 MB/s for multiblocks.
Making the SPI speed faster does only help to read the data, but the time between sending the command and getting the data is decided by the card.

Andy

Wuerfel_21 · 2020-05-09 22:32

Ariba wrote: »

Do you use single block read or multiblock read?
I think the maximum read speed depends mainly on the SD card. I have a 4GB Sandisc, that goes not over 200 kB/s for single blocks, and about 1.4 MB/s for multiblocks.
Making the SPI speed faster does only help to read the data, but the time between sending the command and getting the data is decided by the card.

Andy

Cards bearing the A1 (or A2) symbol (for "Application Performance Class") are guaranteed to be pretty fast at responding to commands. A1 guarantees 1500 random reads per second. A2 implies A1 and A1 also implies Class10.

evanh · 2020-05-09 23:42

Huh, that's still up at sysclock/14. That's not the same code I remember from Peter.

Peter Jakacki · 2020-05-09 23:50

While my SPIRX and other SPI code runs from cog, they do not have a dedicated cog. The trick is to also do multi-block read with the CMD 18 and I always treat my files as non-fragmented, which they always are. You will lose too much speed trying to follow a cluster chain which doesn't need following in my experience. Besides if you are really worried, check the file for fragmentation and mark it as such if it is, which it won't be.

Here is my SDRDS code, you should be able get an idea of the flow, and SDRDBLK calls SPIRX. All the source code and assembler files are in my TAQOZ Dropbox.

--- read multiple sectors in continuous multiblock mode -- update @sdrd pointer
pub SDRDS ( sector dst bytes --  )
---	convert bytes to sectors
	B>S
---	multiblock read
	-ROT SWAP 18 CMD
	IF --- command not accepted
	  2DROP FALSE
	ELSE ( sectors dst )--- process read token and read block if available '
	  DAT?
	  IF --- data available, read in a block
	    SWAP FOR DUP SDRDBLK DROP 512 + SDWAIT ?NEXT
	  ELSE --- no more data available, terminate
	    2DROP SDSTAT DROP SPICE FALSE
	  THEN
	THEN
---	update the read index
	RELEASE  @sdrd !
---	cancel multiblock read on error
	  2000 BEGIN 1-   0 12 CMD 0=   OVER 0=   OR UNTIL DROP
	RELEASE
	;

rogloh · 2020-05-09 23:56

The trick is to also do multi-block read with the CMD 18 and I always treat my files as non-fragmented, which they always are. You will lose too much speed trying to follow a cluster chain which doesn't need following in my experience. Besides if you are really worried, check the file for fragmentation and mark it as such if it is, which it won't be.

Is that assumption because in general people won't fill up a large SD card and trigger fragmentation?

Peter Jakacki · 2020-05-10 00:03

rogloh wrote: »

The trick is to also do multi-block read with the CMD 18 and I always treat my files as non-fragmented, which they always are. You will lose too much speed trying to follow a cluster chain which doesn't need following in my experience. Besides if you are really worried, check the file for fragmentation and mark it as such if it is, which it won't be.

Is that assumption because in general people won't fill up a large SD card and trigger fragmentation?

If I was using any old SD card with any old Windows files that anybody could stick in their PC I probably might not assume that. However, all the SD cards I've tested have come back with zero fragmentation, and certainly the ones I use. We don't just stick any old chip in place of the P2, we know what we want and need. The same too with the SD card, it's totally in your control. If you have a lot of system files on there that Windows is messing with, then they "might" be fragmented, but when data files are written, they never seem to be so.

note: If I am logging to a file I preallocate or at the very least use a file at the end of the used area if no other files are going to be created, which in an embedded environment, I would surely know about. If the preallocated file fills up and I need more, I then create another sequentially numbered file and preallocate. Even 32MB is a tiny file in a 32GB card, and it sure takes a long time to fill up normally. Besides, I treat files as virtual memory, so I can't and won't have them fragmented.

evanh · 2020-05-10 01:17

Just to nit-pick, FDDs, HDDs and SSDs are physical memory. Either all files are virtual by definition, or none are.

Peter Jakacki · 2020-05-10 01:23

evanh wrote: »

Just to nit-pick, FDDs, HDDs and Flash are physical memory. Either all files virtual by definition, or none are.

nit-pick away. Valid points are valid points.
But I mean virtual "memory" that I can address and access in the same way as hub memory. In TAQOZ, when I say $4000 C@ I read a byte from $4000 in hub memory, but when I say $4000.0000 SDC@ I read from the currently open file up to 4GB. If I type $4000.0000 256 SD DUMP it will dump 256 bytes from that file the same as it would if it were hub.

evanh · 2020-05-10 01:35

So the file is the physical store used by a virtual system. Like how a file uses blocks.

Rayman · 2020-05-10 10:32

I am doing multi block reads for the fast read mode and also just reading sequential files.

I do admire the way FSRW 2.6 did it... they pre read sequential and then checked that this is what was requested.

Btw I wonder if we should be doing crc checks now that we have hardware help...

Wuerfel_21 · 2020-05-10 11:02

Peter Jakacki wrote: »

rogloh wrote: »

The trick is to also do multi-block read with the CMD 18 and I always treat my files as non-fragmented, which they always are. You will lose too much speed trying to follow a cluster chain which doesn't need following in my experience. Besides if you are really worried, check the file for fragmentation and mark it as such if it is, which it won't be.

Is that assumption because in general people won't fill up a large SD card and trigger fragmentation?

If I was using any old SD card with any old Windows files that anybody could stick in their PC I probably might not assume that. However, all the SD cards I've tested have come back with zero fragmentation, and certainly the ones I use. We don't just stick any old chip in place of the P2, we know what we want and need. The same too with the SD card, it's totally in your control. If you have a lot of system files on there that Windows is messing with, then they "might" be fragmented, but when data files are written, they never seem to be so.

note: If I am logging to a file I preallocate or at the very least use a file at the end of the used area if no other files are going to be created, which in an embedded environment, I would surely know about. If the preallocated file fills up and I need more, I then create another sequentially numbered file and preallocate. Even 32MB is a tiny file in a 32GB card, and it sure takes a long time to fill up normally. Besides, I treat files as virtual memory, so I can't and won't have them fragmented.

I pretty much always get my SD cards fragmented when doing something like this:
1. Copy some File A to the card
2. Copy some File B to the card
3. Copy new larger version of File A to card
4. File A is now fragmented and my code complains

Just have to remember to defragment in such cases. (For Windows, you can even get a neat little command line program called CONTIG that can defragment individual files of your choosing)

Rayman · 2020-05-13 01:09

Got the multiblock read speed up to 2400 kB/s (with 300 MHz P2 clock) by using a separate cog for the SPI. Definitely the way to go for any attempt at video...

rogloh · 2020-05-13 01:20

That seems a lot better now, Rayman.

What sort of interface is it for the transfer commands between the requesting COG and the driver COG? If it is a type of mailbox, perhaps there is an opportunity over time to add multiple clients too. Though I expect locks may then be needed for some file systems, particularly if there are multiple simultaneous writers depending on how granular your file system requests are.

Rayman · 2020-05-13 16:24

There is some kind of mailbox system in FSRW... The last P1 version added the ability to have several instances of FSRW open from one main cog. Then, you can read from several files. I seem to remember that you can only have one file open for writing though...

Rayman · 2020-05-13 16:26

For some reason, reading blocks directly (like I'm doing for video) messed up the ability to open a new file.
FSRW is kind of complicated and I'm not able to figure out why...
So, I made a new function "remount" that is just the bottom part of "mount" without the actual mounting part.
Just reads in first sector and does the cluster math...

Starts out like this:

pub remount() : r | start, sectorspercluster, reserved, rootentries, sectors
{{
'   Re-Mount a volume in order to more safely open a new file.  Sometimes necessary after doing things like reading blocks directly.
}}

   lastread := -1
   dirty := 0
   sdspi.readblock(0, @buf)

Seems to work...

Electrodude · 2020-05-13 16:48

Since the P2 is so much different from the P1, with its smartpins, streamer, FIFO, block transfers, etc., and since FSRW seems to be rather complicated in order to be efficient on the P1, have you considered starting from scratch or at least starting with a gutted skeleton of FSRW instead of trying to port it? It sounds like it would be a lot less work.

Rayman · 2020-05-13 16:59

There are two parts to FSRW... There's FSRW.spin2 itself, then there's the block reader.
The block reader is platform dependent and has been rewritten more or less from scratch by @cheezus (with tweaks by myself).

But, FSRW is platform independent. C version even works on PC...

Rayman · 2020-05-13 18:48

Here's the test program that gives 2000 kB/s with 250 MHz clock and 2400 kB/s with 300 MHz clock.

Just put the bitmap2.bmp file onto SD card and run test program
It repeatedly reloads the bmp file and says how fast over serial port.

Compiled with FastSpin/FlexGui 4.1.8

Peter Jakacki · 2020-05-13 22:45

Rayman, once I open a file I get the file size and the first sector. After that I don't worry about the file system and use the multi-block command to read in all the sequential sectors into memory in one hit. Never fails. If you try it you will probably kick yourself for not trying it sooner.

Actually, I read in the first sector and check the header and work out the offset into memory after which I block read into that offset address so it all lines up, and then do a quick vertical flip for unadjusted bmp files.

Rayman · 2020-05-13 23:23

Peter, that's what that test program above does when the "sd.FasterBlocksRead((s+511)/512,p)" option is selected...
Thanks for pointing out how much faster this could be!

What I'm doing now is packaging BMP files into sector sized packets and mixing in audio in sector size packets.
Found free tools to create video and audio.
I'm using Handbrake to resize video. Then, use VirtualDub to extract frames and wav audio. Then, use Irfanview to create indexed bmp files and also flip the images.

Peter Jakacki · 2020-05-13 23:47

Great!

At present I have a separate audio track from my packed bmp video file, so BENDER.BMV has a corresponding BENDER.WAV. No audio track, no problem. Since we are only working with sector numbers once the files are open, then it is easy to handle the two streams with larger wav buffers than normal to maintain the audio while a bmp frame is being read. After a frame is read I check to see if I need more audio, and if so I then do a block read of the wav file.

But I will be interested to see what you do with the mixed video and audio.

Yes, there are plenty of free tools and there are a number of ways. One of the ways is to let VLC extract the sequential frames as png files, and then use Xnconvert to batch convert these to bmp frames and then cat them together into one big file. But I'm in the process of redoing some videos so I will see if I can automate it a bit better this time.

Discussing SD Drivers (depreciated fsrw)

Comments