Shop OBEX P1 Docs P2 Docs Learn Events
FSRW for eMMC (with 8-bit bus) Now at 28 MB/s! (example code posted) — Parallax Forums

FSRW for eMMC (with 8-bit bus) Now at 28 MB/s! (example code posted)

RaymanRayman Posts: 14,867
edited 2020-06-04 17:30 in Propeller 2
It works!
Just got it working. But read speed is only 2 MB/s :(

Going to fix that up shortly.
Should be 25 MB/s soon and maybe 300 MB/s before long...

Update: I've got it up to 23.7 MB/s. Maybe that's good enough?
Update2: Moving basepin over to P32 gets me up to 28 MB/s (faster than VGA pixel clock now) !!!
Udpate3: Attaching code that demonstrates the 28 MB/s read speed. Compiles with FastSpin 4.1.8 (and maybe PNut, haven't checked)
«1

Comments

  • Rayman wrote: »
    It works!
    Just got it working. But read speed is only 2 MB/s :(

    Going to fix that up shortly.
    Should be 25 MB/s soon and maybe 300 MB/s before long...

    The thing is that it works, and now that it works, you have something to work with :)
    You are working on the five Fs - Function First, Fancy Features Follow

    Even if the most you got was 25MB/s that would be a very good incentive to integrate eMMC into a board. I will see what I need to do to add that to my boards, I just have to find a a good source for the chips first.
  • Yes 25MB/s+ could make a pretty decent P2 filesystem and even streaming raw 24 bpp video at SD resolutions starts to look doable if rates above that are sustainable.
  • RaymanRayman Posts: 14,867
    I did notice one strange/interesting thing that I don't really understand...

    On every uSD card I've looked at, sector #0 on the card is just an MBR and pretty much useless except for a pointer to the actual sector# of the volume boot record.
    FSRW seems to allow for both cases though.

    Here, sector#0 seems to be the actual MBR&VBR with all the info you need. No idea why that is... Except maybe Windows somehow knows it can't be bootable or something?
  • Wuerfel_21Wuerfel_21 Posts: 5,141
    edited 2020-05-30 18:12
    You formatted both on Windows through an SD reader and get different formatting results? odd.

    EDIT: Even if you reformatted the SD card, you probably just formatted the partition, leaving the MBR mostly alone, right? Whereas the eMMC was completly blank, right?
  • RaymanRayman Posts: 14,867
    Maybe windows didn't do an MBR because the chip was completely blank...
  • When I format an SD card to FAT32 on the P2 in TAQOZ, the MBR really only has the partition tables and signature. I suppose If I say the first sector for the partition is 0, I could could write the VBR to sector 0 because it has the same signature, and really only takes up about the first 100 bytes. I will grab another card later and force it to have "0" hidden bytes by setting the start of partition 0 to sector 0. No reason it shouldn't work.
    TAQOZ# .MBR --- 
                                                                                                                   
                       *** MBR ***  
        PARTITION....................... 0 00 INACTIVE 
        FILE SYSTEM..................... FAT32 LBA 
        CHS START....................... 1023,254,63 
        CHS END......................... 0,0,0 
        FIRST SECTOR.................... $0000_8000  
        TOTAL SECTORS................... 124,702,720 = 63,847MB
    
    TAQOZ# $8000 OPEN-SECTOR ---  ok 
    TAQOZ# 0 $200 SD DUMP ---   
    00000: EB 5B 90 54  41 51 4F 5A  20 50 32 00  02 40 20 00     '.[.TAQOZ P2..@ .'                               
    00010: 02 00 00 00  00 F8 00 00  3F 00 FF 00  00 80 00 00     '........?.......'                               
    00020: 00 D0 6E 07  76 3B 00 00  00 00 00 00  02 00 00 00     '..n.v;..........'                               
    00030: 01 00 06 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'                               
    00040: 80 01 29 01  02 69 62 50  32 20 43 41  52 44 20 20     '..)..ibP2 CARD  '                               
    00050: 20 20 46 41  54 33 32 20  20 20 00 00  00 00 00 00     '  FAT32   ......'                               
    00060: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................' 
    <snip>
    001E0: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'                               
    001F0: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 55 AA     '..............U.' ok
    
  • Cluso99Cluso99 Posts: 18,069
    Yes the later sd cards I've seen have nothing in the MBR other than the partition table and checksum. They used to have MSDOS boot code which often just gave an error reply of something like not a bootable device.
  • RaymanRayman Posts: 14,867
    edited 2020-05-31 21:15
    I had some trouble with a pure assembly version of the driver...

    So, I use the Spin driver to start commands and now use the assembly just for reading in the blocks.

    Read speed for a test .bmp file is now at 23.7 MB/s.

    Update: meant MB/s, not kB/s
  • RaymanRayman Posts: 14,867
    Thinking I can do 480p video at 30fps and 16-bit color now...
  • Rayman wrote: »
    I had some trouble with a pure assembly version of the driver...

    So, I use the Spin driver to start commands and now use the assembly just for reading in the blocks.

    Read speed for a test .bmp file is now at 23.7 MB/s.

    Update: meant MB/s, not kB/s

    Very good, how does the write speed compare?
  • RaymanRayman Posts: 14,867
    edited 2020-06-01 01:18
    I'm sure it will be good, but I haven't even tried that yet...

    BTW: Moving my contraption over to have the data bus start at P32 lets me shave some assembly instructions off.
    Now at 28 MB/s !!!!

    If 1.8V input works, might get to 150 MB/s...
  • RaymanRayman Posts: 14,867
    Also, you cannot turn off CRC with eMMC. So, I'd have to get CRC16 working in order to write...
  • Yeah if you byte bang you get a gain, and P0-7, P32-39 are special. Though you'll find that a streamer approach can put the data bus on any 8 byte boundary of the P2 pins without a problem.
  • RaymanRayman Posts: 14,867
    edited 2020-06-01 01:24
    That would be really great!

    BTW: I think I'm seeing that a 16:9 widescreen movie (like Sintel) can fit a 480p 16bpp frame within hub ram.
    This would be really nice because then HDTV could upscale it and it might look pretty good...
    Think I might have just enough bandwidth to pull it off.
  • Cluso99Cluso99 Posts: 18,069
    There was a great old thread where the crc instructions were added to the P2.

    There are two main CRC16 algorithms, an IBM one and just to be different CCITT chose a different one. Then Microcom implemented a flawed CCITT16 version as used in XMODEM. You'll need to find out which one is used.
    Once you have that, you'll need to work out which value is preset, and whether you need to (IIRC invert?) the bits at the end. These are just different ways the same algorithm was implemented.

    Google is your friend here. And there are online crc calculators so you can verify you are getting the right results before you try an implement it in your routines.

    Here are a couple of links
    https://en.wikipedia.org/wiki/Cyclic_redundancy_check
    srecord.sourceforge.net/crc16-ccitt.html
  • roglohrogloh Posts: 5,865
    edited 2020-06-01 05:26
    Rayman wrote: »
    That would be really great!

    BTW: I think I'm seeing that a 16:9 widescreen movie (like Sintel) can fit a 480p 16bpp frame within hub ram.
    This would be really nice because then HDTV could upscale it and it might look pretty good...
    Think I might have just enough bandwidth to pull it off.

    16:9 widescreen for 640 pixel wide source is 450kB at 16bpp and may just fit. How well does video look at 16bpp, compared to an optimized 8 bit palette per frame I wonder? It might have to quantize the colours a little but I'm sure it is faster to convert.

    Is there even such a thing as an optimal 8 bit colour palette for a given frame? How would it be measured? Minimum least square error sum over all pixels vs ideal value? I imagine this is probably some field of study that PhD students could have worked on.
  • rogloh wrote: »
    Is there even such a thing as an optimal 8 bit colour palette for a given frame? How would it be measured? Minimum least square error sum over all pixels vs ideal value? I imagine this is probably some field of study that PhD students could have worked on.

    I guess that would be a good way to measure how well a palette is suited to an image. However, there are actually multiple ways to measure the difference between two pixels. RGB difference is ok, but CIELAB difference is better.

    There are a couple of different algorithms for generating an optimized palette for an image and then a couple different ones for quantizing an image onto a given palette (how do you find the closest color, do you do dithering, etc). Quite interesting subject.

    Of course, with video frames there is the additional consideration that the frames are shown in sequence...
  • roglohrogloh Posts: 5,865
    edited 2020-06-01 08:09
    Of course, with video frames there is the additional consideration that the frames are shown in sequence...

    Yeah the colours could fuzz a bit from frame to frame even if each frame is optimal. How well video looks with only 5 or 6 bits of each primary colour I'm not sure. But give it a go.

    With a HyperRAM frame buffer you'll be able to have the video frame shown in 24 bit colour though it can still waste 8 of the 32 bits per pixel in your storage, depending on whether you might allocate other COG(s) to decompress 24 -> 32 bits. It's probably fairly intensive for even P2 COGs to do that on every pixel in real time however, e.g. 5 instructions per pixel to re-arrange and write to LUT buffer before writing back to HUB (so ~10 clocks plus 1 clock per pixel to write back, if the read fifo keeps up). Perhaps one COG could do it at 250MHz for VGA width and line frequency. It will boost your video storage by 33%.
    REP #5, count
    RFWORD rgb
    RFBYTE blue
    ROLBYTE rgb, blue, #0
    MOVBYTS rgb, #%%2103  ' or whatever order you need
    WRLUT rgb, PTRA++
    
  • RaymanRayman Posts: 14,867
    I'm seeing multiblock reads as significantly faster than single block reads with eMMC.
    Wasn't really that way with uSD…

    Found a page with CrystalDiskMark results that show the same thing:
    https://www.windowscentral.com/emmc-vs-ssd

    My own tests with the uSD adapter are much slower, but show the same trend:
    482 x 347 - 17K
  • Now that you can read this so much faster in 8-bit mode, then the latency is so much greater in comparison. So yes, multiblock will only have the initial latency to deal with mostly.
  • RaymanRayman Posts: 14,867
    Yeah, must be. I was thinking that my Spin CRC7 calc was slowing things down. It was, but just a hair...
  • The trick of @lonesocks driver was the read ahead.

    If you request sector x the driver loads sector x into the COG ram. Pushes it into HUB and read the next sector into the COG ram. Just in case you might want it. Then it is already there to push to the HUB.

    Using the LUT as sector buffer (reading from sd/emmc in block mode) one could read say 3 or even 4 sectors at once and if the next request is the next sector, voila a setq and rdlut and boom its in the HUB.

    A) block mode is faster, B) often the next needed sector IS the next sector, C) the driver COG does it in parallel by design.

    But anyways @Rayman this is amazing what you are archiving and I might order some of those cards to play with.

    Thanks,

    Mike
  • RaymanRayman Posts: 14,867
    I guess it depends on what you are trying to do... If you want to read a bit and process it and then read some more, then the read ahead helps a lot. But, for video direct from drive, it doesn't really help because you already want the next sector right away.
  • Multi block read is best for "read-ahead" because you can take your time between sectors to process and then continue to read the next sector, perhaps even back in the same buffer. No penalty at all as long as you don't try to do anything else with the sd until you are done.
  • RaymanRayman Posts: 14,867
    Yeah, I was just going to say that both the original FSRW's read ahead and what I am doing for video need the multi-block read mode.
    For uSD the difference was there, but for eMMC it appears to be a much bigger difference from single-block read....
  • Exactly, and reading 4(2?) upfront in block mode is faster then reading 4 separately.

    @Lonesocks Trick is that the SD COG delivers and finish the current command so the calling COG can do what it has to do while the driver is silently loading the next sector in parallel, just in case it is needed.

    And if indeed the next sector is needed it has it already ready, sort of.

    Same would work with both of your block drivers.

    Read from card a block into LUT. Starting with sector asked for. Say two Sectors.
    Deliver result from LUT to HUB, Mailbox free. Calling Cog can proceed.
    If next asked sector is in LUT just deliver it, else load new block into LUT.
    Load the next 2 sectors after delivering the last one in parallel.

    So if the next sector is needed it is already there. That is why FSRW is way faster the Kye's driver.

    Same goes for writing, copy from HUB to LUT and clear the mailbox. Done. Write from COG to SD/EMMC while calling COG is doing something else.

    This allows to have the calling COG not waiting for completion while the called COG is doing something.

    The multi core idea just works if the guys ARE running in parallel. Calling another COG and waiting in a loop for it to finish does not help a bit. You are just running one core at a time, nothing parallel.

    anyways,

    Mike
  • RaymanRayman Posts: 14,867
    It would be better if the low level code always did multiblock reads, like one of the original FSRW drivers did.
    Reading the regular way would be faster. Can put that on the todo list for somebody.
  • RaymanRayman Posts: 14,867
    edited 2020-06-04 13:55
    Here's the code that demonstrates 28 MB/s read speed from eMMC.
    There's a lot still to do (like implementing writing to eMMC).
    The bmp file below should first be loaded onto the eMMC (can use a uSD adapter to do that).
    With some help from the forum, got CRC7 in a much better place. Can be moved to assembly now.
  • RaymanRayman Posts: 14,867
    Note that the strobe pin is not currently being used. Also, maybe you don’t need the reset pin. Could be tied high, like the uSD adapter does.
  • RaymanRayman Posts: 14,867
    edited 2020-06-19 18:39
    Actually, I think the reset pin might be a good idea for cases where the chip is not removeable from the system...

    Otherwise, you might have to power cycle to get it into a known state. But, that might be OK, guess it depends...
Sign In or Register to comment.