FSRW for eMMC (with 8-bit bus) Now at 28 MB/s! (example code posted)
Rayman
Posts: 14,867
It works!
Just got it working. But read speed is only 2 MB/s
Going to fix that up shortly.
Should be 25 MB/s soon and maybe 300 MB/s before long...
Update: I've got it up to 23.7 MB/s. Maybe that's good enough?
Update2: Moving basepin over to P32 gets me up to 28 MB/s (faster than VGA pixel clock now) !!!
Udpate3: Attaching code that demonstrates the 28 MB/s read speed. Compiles with FastSpin 4.1.8 (and maybe PNut, haven't checked)
Just got it working. But read speed is only 2 MB/s
Going to fix that up shortly.
Should be 25 MB/s soon and maybe 300 MB/s before long...
Update: I've got it up to 23.7 MB/s. Maybe that's good enough?
Update2: Moving basepin over to P32 gets me up to 28 MB/s (faster than VGA pixel clock now) !!!
Udpate3: Attaching code that demonstrates the 28 MB/s read speed. Compiles with FastSpin 4.1.8 (and maybe PNut, haven't checked)
bmp
301K
Comments
The thing is that it works, and now that it works, you have something to work with
You are working on the five Fs - Function First, Fancy Features Follow
Even if the most you got was 25MB/s that would be a very good incentive to integrate eMMC into a board. I will see what I need to do to add that to my boards, I just have to find a a good source for the chips first.
On every uSD card I've looked at, sector #0 on the card is just an MBR and pretty much useless except for a pointer to the actual sector# of the volume boot record.
FSRW seems to allow for both cases though.
Here, sector#0 seems to be the actual MBR&VBR with all the info you need. No idea why that is... Except maybe Windows somehow knows it can't be bootable or something?
EDIT: Even if you reformatted the SD card, you probably just formatted the partition, leaving the MBR mostly alone, right? Whereas the eMMC was completly blank, right?
So, I use the Spin driver to start commands and now use the assembly just for reading in the blocks.
Read speed for a test .bmp file is now at 23.7 MB/s.
Update: meant MB/s, not kB/s
Very good, how does the write speed compare?
BTW: Moving my contraption over to have the data bus start at P32 lets me shave some assembly instructions off.
Now at 28 MB/s !!!!
If 1.8V input works, might get to 150 MB/s...
BTW: I think I'm seeing that a 16:9 widescreen movie (like Sintel) can fit a 480p 16bpp frame within hub ram.
This would be really nice because then HDTV could upscale it and it might look pretty good...
Think I might have just enough bandwidth to pull it off.
There are two main CRC16 algorithms, an IBM one and just to be different CCITT chose a different one. Then Microcom implemented a flawed CCITT16 version as used in XMODEM. You'll need to find out which one is used.
Once you have that, you'll need to work out which value is preset, and whether you need to (IIRC invert?) the bits at the end. These are just different ways the same algorithm was implemented.
Google is your friend here. And there are online crc calculators so you can verify you are getting the right results before you try an implement it in your routines.
Here are a couple of links
https://en.wikipedia.org/wiki/Cyclic_redundancy_check
srecord.sourceforge.net/crc16-ccitt.html
16:9 widescreen for 640 pixel wide source is 450kB at 16bpp and may just fit. How well does video look at 16bpp, compared to an optimized 8 bit palette per frame I wonder? It might have to quantize the colours a little but I'm sure it is faster to convert.
Is there even such a thing as an optimal 8 bit colour palette for a given frame? How would it be measured? Minimum least square error sum over all pixels vs ideal value? I imagine this is probably some field of study that PhD students could have worked on.
I guess that would be a good way to measure how well a palette is suited to an image. However, there are actually multiple ways to measure the difference between two pixels. RGB difference is ok, but CIELAB difference is better.
There are a couple of different algorithms for generating an optimized palette for an image and then a couple different ones for quantizing an image onto a given palette (how do you find the closest color, do you do dithering, etc). Quite interesting subject.
Of course, with video frames there is the additional consideration that the frames are shown in sequence...
Yeah the colours could fuzz a bit from frame to frame even if each frame is optimal. How well video looks with only 5 or 6 bits of each primary colour I'm not sure. But give it a go.
With a HyperRAM frame buffer you'll be able to have the video frame shown in 24 bit colour though it can still waste 8 of the 32 bits per pixel in your storage, depending on whether you might allocate other COG(s) to decompress 24 -> 32 bits. It's probably fairly intensive for even P2 COGs to do that on every pixel in real time however, e.g. 5 instructions per pixel to re-arrange and write to LUT buffer before writing back to HUB (so ~10 clocks plus 1 clock per pixel to write back, if the read fifo keeps up). Perhaps one COG could do it at 250MHz for VGA width and line frequency. It will boost your video storage by 33%.
Wasn't really that way with uSD…
Found a page with CrystalDiskMark results that show the same thing:
https://www.windowscentral.com/emmc-vs-ssd
My own tests with the uSD adapter are much slower, but show the same trend:
If you request sector x the driver loads sector x into the COG ram. Pushes it into HUB and read the next sector into the COG ram. Just in case you might want it. Then it is already there to push to the HUB.
Using the LUT as sector buffer (reading from sd/emmc in block mode) one could read say 3 or even 4 sectors at once and if the next request is the next sector, voila a setq and rdlut and boom its in the HUB.
A) block mode is faster, often the next needed sector IS the next sector, C) the driver COG does it in parallel by design.
But anyways @Rayman this is amazing what you are archiving and I might order some of those cards to play with.
Thanks,
Mike
For uSD the difference was there, but for eMMC it appears to be a much bigger difference from single-block read....
@Lonesocks Trick is that the SD COG delivers and finish the current command so the calling COG can do what it has to do while the driver is silently loading the next sector in parallel, just in case it is needed.
And if indeed the next sector is needed it has it already ready, sort of.
Same would work with both of your block drivers.
Read from card a block into LUT. Starting with sector asked for. Say two Sectors.
Deliver result from LUT to HUB, Mailbox free. Calling Cog can proceed.
If next asked sector is in LUT just deliver it, else load new block into LUT.
Load the next 2 sectors after delivering the last one in parallel.
So if the next sector is needed it is already there. That is why FSRW is way faster the Kye's driver.
Same goes for writing, copy from HUB to LUT and clear the mailbox. Done. Write from COG to SD/EMMC while calling COG is doing something else.
This allows to have the calling COG not waiting for completion while the called COG is doing something.
The multi core idea just works if the guys ARE running in parallel. Calling another COG and waiting in a loop for it to finish does not help a bit. You are just running one core at a time, nothing parallel.
anyways,
Mike
Reading the regular way would be faster. Can put that on the todo list for somebody.
There's a lot still to do (like implementing writing to eMMC).
The bmp file below should first be loaded onto the eMMC (can use a uSD adapter to do that).
With some help from the forum, got CRC7 in a much better place. Can be moved to assembly now.
Otherwise, you might have to power cycle to get it into a known state. But, that might be OK, guess it depends...