Shop OBEX P1 Docs P2 Docs Learn Events
QuadSPI speeds, burst reading Flash into RAM ? — Parallax Forums

QuadSPI speeds, burst reading Flash into RAM ?

jmgjmg Posts: 15,183
edited 2012-05-03 13:26 in Propeller 1
Does anyone have a handle on the CLK rates possible to read from QuadSPI flash, into smallish blocks of RAM ?
(up to 1000 bytes) in either COG or main memory ?
The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
Data is here
http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf

Comments

  • jazzedjazzed Posts: 11,803
    edited 2012-01-25 23:25
    jmg wrote: »
    Does anyone have a handle on the CLK rates possible to read from QuadSPI flash, into smallish blocks of RAM ?
    (up to 1000 bytes) in either COG or main memory ?
    The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
    Data is here
    http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf

    Propeller GCC has a Winbond driver that reads 2.5MB/s at 80MHz from byte-wide flash on P0..7 pins. There is a provision for a more aggressive 5MB/s transfer rate.
  • jmgjmg Posts: 15,183
    edited 2012-01-25 23:40
    jazzed wrote: »
    Propeller GCC has a Winbond driver that reads 2.5MB/s at 80MHz from byte-wide flash on P0..7 pins. There is a provision for a more aggressive 5MB/s transfer rate.

    Thanks, - those are for byte-wide, which needs TWO chips.
    Do they simply scale for Nibble-wide fetches ? - ie average rates of 2.5MHz (or 5MHz? ) clocks


    Are those numbers the average read rates, or the clock speeds, which then need adjusting for the Opcode/Address overheads ?

    What Winbond modes does the library support ? - I think we can accept the 16 byte block in this application, (in return for higher average speeds), and that needs the 0AxH modifier, and Octal Word opcode.

    How is progress on GCC.Prop ?
  • RaymanRayman Posts: 14,833
    edited 2012-01-26 04:40
    I've got some examples too for the Flashpoint modules here:
    http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm

    These are working with Catalina right now.
    Also, I have a couple examples using it as a video buffer...
  • jazzedjazzed Posts: 11,803
    edited 2012-01-26 08:42
    jmg wrote: »
    Thanks, - those are for byte-wide, which needs TWO chips.
    Do they simply scale for Nibble-wide fetches ? - ie average rates of 2.5MHz (or 5MHz? ) clocks
    The main complication is having to assemble the bytes before writing to HUB memory.
    I worked with Rayman on this problem a little and IIRC the best you could get with a nibble
    is 1.25MHz which is 1/4th the best possible 2 chip read rate.

    The great thing about using 2 chips besides better performance is getting double density.
    It's possible to have a fast 32MB solution today which could be easily sub-divided.
    jmg wrote: »
    Are those numbers the average read rates, or the clock speeds, which then need adjusting for the Opcode/Address overheads ?
    The highest byte-wide Propeller COG-HUB transfer rate at 80MHz is 5MB/s.
    To achieve this rate, one must use 3 instructions per clock i.e.:
    :loop
            mov     data,ina        ' read byte
            wrbyte  data,phsb       ' write to hub pointer
            djnz    len,#:loop      ' do until _len bytes are stored
    

    In the above case, CTRA is used to clock the data, and CTRB PHSB is used as the HUB pointer.
    Data must be on P0..7. There is not enough instruction space to assemble nibbles and maintain the 5MHz window.
    jmg wrote: »
    What Winbond modes does the library support ? - I think we can accept the 16 byte block in this application, (in return for higher average speeds), and that needs the 0AxH modifier, and Octal Word opcode.

    I'm using Fast Read Quad I/O (0xEB) and bursting 128 bytes at a time.
    Performance is about the same or better than SPIN in most cases.
    jmg wrote: »
    How is progress on GCC.Prop ?

    The compiler is stable now and has been for a couple of months.
    It has many Propeller features that the vast army of GCC developers can appreciate.

    Jeff is working on Eclipse GUI plugins and has made excellent progress there.
    We will have more GCC updates soon. Most likely we will start the P2 phase before summer.

    The Propeller GCC Quad-SPI Flash solution has been working nicely for a while at 2.5MB/s.
    I still have optimize the driver timing to use the 5MB/s solution - I've been very busy.

    I've attached a Propeller GCC 2x Quad-SPI XMMC reference design example SpinSocket-Flash.
    Note the 2 SO8 devices to the left of Propeller on the DIP32 module. All passives are on the bottom side.
  • jmgjmg Posts: 15,183
    edited 2012-01-29 19:43
    Thanks, I've just got back to this again..
    jazzed wrote: »
    The main complication is having to assemble the bytes before writing to HUB memory.
    I worked with Rayman on this problem a little and IIRC the best you could get with a nibble
    is 1.25MHz which is 1/4th the best possible 2 chip read rate.
    Do you have more details on this 1/4?.
    I can see the 5MHz (16 cycle loop) for byte wide (2 chips), but with SHL and OR, my pencil example gives one spare opcode in a 32 cycle loop, (ie very slightly unrolled) that does 2 nibble reads, so that seems to me to be one byte @ 2.5MHz ? ( 2.5MB/s)
    On this QuadSPI device, that allows me 312us to read 16 bytes in a Octal Word 0AxH mode.

    Which is fine for my needs - I can see 2x4 is more compact, and faster, so better suits execute-in-place, but I'd prefer avoiding the extra complexity and cost of 2 chips.
    jazzed wrote: »
    The Propeller GCC Quad-SPI Flash solution has been working nicely for a while at 2.5MB/s. I still have optimize the driver timing to use the 5MB/s solution - I've been very busy.

    Is there a link to the GCC Quad-SPI Flash solution ? - I search the GCC project for QuadSPI and Quad-SPI and get no finds ?
  • jazzedjazzed Posts: 11,803
    edited 2012-01-29 20:20
    jmg wrote: »
    Is there a link to the GCC Quad-SPI Flash solution ? - I search the GCC project for QuadSPI and Quad-SPI and get no finds ?

    Here you go:
    http://code.google.com/p/propgcc/source/browse/loader/spin/ssf_cache.spin

    I'm too tired to answer the other part of your post right now.
  • RaymanRayman Posts: 14,833
    edited 2012-01-30 06:35
    jmg, BTW I think I've found a way to use multiple cogs to increase throughput for single chip solutions (like Flashpoint).
    Anyway, I guess you have to decide if you want to give up cogs or pins for faster access...
  • jazzedjazzed Posts: 11,803
    edited 2012-01-30 11:24
    Rayman,

    Where is the working snippet for loading from 4 bits of Flash from a COG to HUB ram?
    I found something in FlashPointSuperQuad2a.spin but that's like 14 instructions.
    We discussed a better performing version in a few posts, but I can't find it just now.

    Thanks,
    --Steve
  • jmgjmg Posts: 15,183
    edited 2012-01-30 12:51
    Rayman wrote: »
    jmg, BTW I think I've found a way to use multiple cogs to increase throughput for single chip solutions (like Flashpoint).
    Anyway, I guess you have to decide if you want to give up cogs or pins for faster access...

    If you have a solution, it could help to post it here, ( even if I do not need more speed right now, others might, or I might later..)
    Corner cases are always good.

    The link in #7 includes the important timer-setup preamble for the byte-wide version, Counter clocked. Nifty idea.
  • RaymanRayman Posts: 14,833
    edited 2012-01-30 14:12
    Ok, it's taken me a few minutes to recollect where I was with this...

    First, let me say the Ross has me convinced that the speed of the memory access doesn't make a huge impact for running C code
    when you use caching.

    But, I was looking into video drivers, and I needed speed...
    I know I have a 1-cog driver that uses counters to reduce the number instruction cycles from 16 to 12, but I can't seem to find it right now...
    Anyway, here's the cog#1 part of the 2-cog version:
    'Original loop has 14 regular and 1 wrbyte instructions = 16 instructions
    'Want to use counter to speed things up...
    'Easy 2-cog counter loop has 10 regular and 1 wrbyte = 12 instructions
    ReadBmpLineLoopStart
    'sync with other cog
                  mov       t1,cnt
                  add       t1,syncadd
                  wrlong    t1,arg5
                  waitcnt   t1,#0
                  wrlong    zero,arg5
    
                  'Setup counters
                  sub       t5,#2 'need to subtract because adding before wrbyte
                  mov       frqa,aFreq           ' 
                  mov       phsa,aPhsa
                  movs      ctra,#SckPin                   
                  rdbyte    t1, #0               ' sync up with hub 
                  movi      ctra,#4 << 3            ' single end NCO clock mode 
    ReadBmpLineLoop
                  mov       phsa,aPhsa          '1
                  mov       t1,ina              '2  'First Nibble already on output
                  ror       t1,#Sio0Pin         '3
                  shl       t1,#4               '4  'need a clock every 3 instructions=12 cycles
                  mov       t2,ina              '5
                  ror       t2,#Sio0Pin         '6
                  and       t2,#15              '7
                  add       t2,t1               '8
                  add       t5,#2               '9
                  wrbyte    t2,t5               '10
                  'wrbyte counts for 2          '11
                  djnz      t4,#ReadBmpLineLoop '12
                  mov       ctra,#0 'kill clock
                  
                  mov       t4,arg2
                  shr       t4,#1  'Only half as many with 2 cogs 
                  mov       t5,arg4
    
    
  • jazzedjazzed Posts: 11,803
    edited 2012-01-30 14:57
    Rayman wrote: »
    First, let me say the Ross has me convinced that the speed of the memory access doesn't make a huge impact for running C code
    when you use caching.

    The data we have collected with GCC is not necessarily consistent with this (not a criticism of Catalina or some GCC promotional statement).

    Purely from a technical "change only one variable at a time viewpoint" the speed of memory will not impact code that can reside mainly in cache. However for code that will not easily stay in cache there is a dramatic difference.

    Here is data from one experiment:
                Spin   LMM FLASH XMMC EEPROM XMMC SSF XMMC SSF2 XMMC SDRAM XMMC SDCARD XMMC
                ----   --- ---------- ----------- -------- --------- ---------- -----------
    Transpose 64,400   887      902        904       902       902        902         903
    sprintf    1,351   531   27,386     10,046      3515      3415       3886       34262
    CacheLines   N/A   N/A       32        128        64       128         64          16
    Line Size    N/A   N/A      128         64       128        64         32         512
    Cache Size   N/A   N/A     4096       8192      8192      8192       2048        8192
    


    In the example "transpose" is a fairly small program, and sprintf is a fairly big program. The differences are quite varied.

    Of course I don't have a FlashPoint driver, so that can't be compared. I would certainly like to have one though. I have parts.
  • RaymanRayman Posts: 14,833
    edited 2012-01-30 16:34
    This is getting off topic, but I think it's hard to tell the real world impact from those two examples.
    You have one where speed matters and one where speed doesn't...
    Perhaps real world is somewhere inbetween...
  • jmgjmg Posts: 15,183
    edited 2012-05-03 13:26
    As an update to QuadSPI, I see Spansion are claiming higher DDR rates, (now 66MBytes/s DDR) and faster Erase/Pgm times - which could mean smaller buffers in loggers.
    (Pgm claims 1.5MB/s )

    http://www.spansion.com/Products/Pages/serial-flash-spi-memory.aspx

    This also has a 1K OTP block. ( and I think a 128 bit ID ?)
Sign In or Register to comment.