QuadSPI speeds, burst reading Flash into RAM ?

jmg · 2012-01-25 23:21

Does anyone have a handle on the CLK rates possible to read from QuadSPI flash, into smallish blocks of RAM ?
(up to 1000 bytes) in either COG or main memory ?
The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
Data is here
http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf

jazzed · 2012-01-25 23:25

jmg wrote: »

Does anyone have a handle on the CLK rates possible to read from QuadSPI flash, into smallish blocks of RAM ?
(up to 1000 bytes) in either COG or main memory ?
The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
Data is here
http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf

Propeller GCC has a Winbond driver that reads 2.5MB/s at 80MHz from byte-wide flash on P0..7 pins. There is a provision for a more aggressive 5MB/s transfer rate.

jmg · 2012-01-25 23:40

jazzed wrote: »

Propeller GCC has a Winbond driver that reads 2.5MB/s at 80MHz from byte-wide flash on P0..7 pins. There is a provision for a more aggressive 5MB/s transfer rate.

Thanks, - those are for byte-wide, which needs TWO chips.
Do they simply scale for Nibble-wide fetches ? - ie average rates of 2.5MHz (or 5MHz? ) clocks

Are those numbers the average read rates, or the clock speeds, which then need adjusting for the Opcode/Address overheads ?

What Winbond modes does the library support ? - I think we can accept the 16 byte block in this application, (in return for higher average speeds), and that needs the 0AxH modifier, and Octal Word opcode.

How is progress on GCC.Prop ?

Rayman · 2012-01-26 04:40

I've got some examples too for the Flashpoint modules here:
http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm

These are working with Catalina right now.
Also, I have a couple examples using it as a video buffer...

jazzed · 2012-01-26 08:42

jmg wrote: »

Thanks, - those are for byte-wide, which needs TWO chips.
Do they simply scale for Nibble-wide fetches ? - ie average rates of 2.5MHz (or 5MHz? ) clocks

The main complication is having to assemble the bytes before writing to HUB memory.
I worked with Rayman on this problem a little and IIRC the best you could get with a nibble
is 1.25MHz which is 1/4th the best possible 2 chip read rate.

The great thing about using 2 chips besides better performance is getting double density.
It's possible to have a fast 32MB solution today which could be easily sub-divided.

jmg wrote: »

Are those numbers the average read rates, or the clock speeds, which then need adjusting for the Opcode/Address overheads ?

The highest byte-wide Propeller COG-HUB transfer rate at 80MHz is 5MB/s.
To achieve this rate, one must use 3 instructions per clock i.e.:

:loop
        mov     data,ina        ' read byte
        wrbyte  data,phsb       ' write to hub pointer
        djnz    len,#:loop      ' do until _len bytes are stored

In the above case, CTRA is used to clock the data, and CTRB PHSB is used as the HUB pointer.
Data must be on P0..7. There is not enough instruction space to assemble nibbles and maintain the 5MHz window.

jmg wrote: »

What Winbond modes does the library support ? - I think we can accept the 16 byte block in this application, (in return for higher average speeds), and that needs the 0AxH modifier, and Octal Word opcode.

I'm using Fast Read Quad I/O (0xEB) and bursting 128 bytes at a time.
Performance is about the same or better than SPIN in most cases.

jmg wrote: »

How is progress on GCC.Prop ?

The compiler is stable now and has been for a couple of months.
It has many Propeller features that the vast army of GCC developers can appreciate.

Jeff is working on Eclipse GUI plugins and has made excellent progress there.
We will have more GCC updates soon. Most likely we will start the P2 phase before summer.

The Propeller GCC Quad-SPI Flash solution has been working nicely for a while at 2.5MB/s.
I still have optimize the driver timing to use the 5MB/s solution - I've been very busy.

I've attached a Propeller GCC 2x Quad-SPI XMMC reference design example SpinSocket-Flash.
Note the 2 SO8 devices to the left of Propeller on the DIP32 module. All passives are on the bottom side.

jmg · 2012-01-29 19:43

Thanks, I've just got back to this again..

jazzed wrote: »

The main complication is having to assemble the bytes before writing to HUB memory.
I worked with Rayman on this problem a little and IIRC the best you could get with a nibble
is 1.25MHz which is 1/4th the best possible 2 chip read rate.

Do you have more details on this 1/4?.
I can see the 5MHz (16 cycle loop) for byte wide (2 chips), but with SHL and OR, my pencil example gives one spare opcode in a 32 cycle loop, (ie very slightly unrolled) that does 2 nibble reads, so that seems to me to be one byte @ 2.5MHz ? ( 2.5MB/s)
On this QuadSPI device, that allows me 312us to read 16 bytes in a Octal Word 0AxH mode.

Which is fine for my needs - I can see 2x4 is more compact, and faster, so better suits execute-in-place, but I'd prefer avoiding the extra complexity and cost of 2 chips.

jazzed wrote: »

The Propeller GCC Quad-SPI Flash solution has been working nicely for a while at 2.5MB/s. I still have optimize the driver timing to use the 5MB/s solution - I've been very busy.

Is there a link to the GCC Quad-SPI Flash solution ? - I search the GCC project for QuadSPI and Quad-SPI and get no finds ?

jazzed · 2012-01-29 20:20

jmg wrote: »

Is there a link to the GCC Quad-SPI Flash solution ? - I search the GCC project for QuadSPI and Quad-SPI and get no finds ?

Here you go:
http://code.google.com/p/propgcc/source/browse/loader/spin/ssf_cache.spin

I'm too tired to answer the other part of your post right now.

Rayman · 2012-01-30 06:35

jmg, BTW I think I've found a way to use multiple cogs to increase throughput for single chip solutions (like Flashpoint).
Anyway, I guess you have to decide if you want to give up cogs or pins for faster access...

jazzed · 2012-01-30 11:24

Rayman,

Where is the working snippet for loading from 4 bits of Flash from a COG to HUB ram?
I found something in FlashPointSuperQuad2a.spin but that's like 14 instructions.
We discussed a better performing version in a few posts, but I can't find it just now.

Thanks,
--Steve

jmg · 2012-01-30 12:51

Rayman wrote: »

jmg, BTW I think I've found a way to use multiple cogs to increase throughput for single chip solutions (like Flashpoint).
Anyway, I guess you have to decide if you want to give up cogs or pins for faster access...

If you have a solution, it could help to post it here, ( even if I do not need more speed right now, others might, or I might later..)
Corner cases are always good.

The link in #7 includes the important timer-setup preamble for the byte-wide version, Counter clocked. Nifty idea.

Rayman · 2012-01-30 14:12

Ok, it's taken me a few minutes to recollect where I was with this...

First, let me say the Ross has me convinced that the speed of the memory access doesn't make a huge impact for running C code
when you use caching.

But, I was looking into video drivers, and I needed speed...
I know I have a 1-cog driver that uses counters to reduce the number instruction cycles from 16 to 12, but I can't seem to find it right now...
Anyway, here's the cog#1 part of the 2-cog version:

'Original loop has 14 regular and 1 wrbyte instructions = 16 instructions
'Want to use counter to speed things up...
'Easy 2-cog counter loop has 10 regular and 1 wrbyte = 12 instructions
ReadBmpLineLoopStart
'sync with other cog
              mov       t1,cnt
              add       t1,syncadd
              wrlong    t1,arg5
              waitcnt   t1,#0
              wrlong    zero,arg5

              'Setup counters
              sub       t5,#2 'need to subtract because adding before wrbyte
              mov       frqa,aFreq           ' 
              mov       phsa,aPhsa
              movs      ctra,#SckPin                   
              rdbyte    t1, #0               ' sync up with hub 
              movi      ctra,#4 << 3            ' single end NCO clock mode 
ReadBmpLineLoop
              mov       phsa,aPhsa          '1
              mov       t1,ina              '2  'First Nibble already on output
              ror       t1,#Sio0Pin         '3
              shl       t1,#4               '4  'need a clock every 3 instructions=12 cycles
              mov       t2,ina              '5
              ror       t2,#Sio0Pin         '6
              and       t2,#15              '7
              add       t2,t1               '8
              add       t5,#2               '9
              wrbyte    t2,t5               '10
              'wrbyte counts for 2          '11
              djnz      t4,#ReadBmpLineLoop '12
              mov       ctra,#0 'kill clock
              
              mov       t4,arg2
              shr       t4,#1  'Only half as many with 2 cogs 
              mov       t5,arg4

jazzed · 2012-01-30 14:57

Rayman wrote: »

First, let me say the Ross has me convinced that the speed of the memory access doesn't make a huge impact for running C code
when you use caching.

The data we have collected with GCC is not necessarily consistent with this (not a criticism of Catalina or some GCC promotional statement).

Purely from a technical "change only one variable at a time viewpoint" the speed of memory will not impact code that can reside mainly in cache. However for code that will not easily stay in cache there is a dramatic difference.

Here is data from one experiment:

            Spin   LMM FLASH XMMC EEPROM XMMC SSF XMMC SSF2 XMMC SDRAM XMMC SDCARD XMMC
            ----   --- ---------- ----------- -------- --------- ---------- -----------
Transpose 64,400   887      902        904       902       902        902         903
sprintf    1,351   531   27,386     10,046      3515      3415       3886       34262
CacheLines   N/A   N/A       32        128        64       128         64          16
Line Size    N/A   N/A      128         64       128        64         32         512
Cache Size   N/A   N/A     4096       8192      8192      8192       2048        8192

In the example "transpose" is a fairly small program, and sprintf is a fairly big program. The differences are quite varied.

Of course I don't have a FlashPoint driver, so that can't be compared. I would certainly like to have one though. I have parts.

Rayman · 2012-01-30 16:34

This is getting off topic, but I think it's hard to tell the real world impact from those two examples.
You have one where speed matters and one where speed doesn't...
Perhaps real world is somewhere inbetween...

jmg · 2012-05-03 13:26

As an update to QuadSPI, I see Spansion are claiming higher DDR rates, (now 66MBytes/s DDR) and faster Erase/Pgm times - which could mean smaller buffers in loggers.
(Pgm claims 1.5MB/s )

http://www.spansion.com/Products/Pages/serial-flash-spi-memory.aspx

This also has a 1K OTP block. ( and I think a 128 bit ID ?)

QuadSPI speeds, burst reading Flash into RAM ?

Comments