QuadSPI speeds, burst reading Flash into RAM ?
jmg
Posts: 15,183
Does anyone have a handle on the CLK rates possible to read from QuadSPI flash, into smallish blocks of RAM ?
(up to 1000 bytes) in either COG or main memory ?
The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
Data is here
http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf
(up to 1000 bytes) in either COG or main memory ?
The Winbond W25Q64BV seems a good price, with some extra help to lower clocks in what they call Octal Word read (16 byte blocks)
Data is here
http://www.winbond.com/NR/rdonlyres/591A37FF-007C-4E99-956C-F7EE4A6D9A8F/0/W25Q64BV.pdf
Comments
Propeller GCC has a Winbond driver that reads 2.5MB/s at 80MHz from byte-wide flash on P0..7 pins. There is a provision for a more aggressive 5MB/s transfer rate.
Thanks, - those are for byte-wide, which needs TWO chips.
Do they simply scale for Nibble-wide fetches ? - ie average rates of 2.5MHz (or 5MHz? ) clocks
Are those numbers the average read rates, or the clock speeds, which then need adjusting for the Opcode/Address overheads ?
What Winbond modes does the library support ? - I think we can accept the 16 byte block in this application, (in return for higher average speeds), and that needs the 0AxH modifier, and Octal Word opcode.
How is progress on GCC.Prop ?
http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm
These are working with Catalina right now.
Also, I have a couple examples using it as a video buffer...
I worked with Rayman on this problem a little and IIRC the best you could get with a nibble
is 1.25MHz which is 1/4th the best possible 2 chip read rate.
The great thing about using 2 chips besides better performance is getting double density.
It's possible to have a fast 32MB solution today which could be easily sub-divided.
The highest byte-wide Propeller COG-HUB transfer rate at 80MHz is 5MB/s.
To achieve this rate, one must use 3 instructions per clock i.e.:
In the above case, CTRA is used to clock the data, and CTRB PHSB is used as the HUB pointer.
Data must be on P0..7. There is not enough instruction space to assemble nibbles and maintain the 5MHz window.
I'm using Fast Read Quad I/O (0xEB) and bursting 128 bytes at a time.
Performance is about the same or better than SPIN in most cases.
The compiler is stable now and has been for a couple of months.
It has many Propeller features that the vast army of GCC developers can appreciate.
Jeff is working on Eclipse GUI plugins and has made excellent progress there.
We will have more GCC updates soon. Most likely we will start the P2 phase before summer.
The Propeller GCC Quad-SPI Flash solution has been working nicely for a while at 2.5MB/s.
I still have optimize the driver timing to use the 5MB/s solution - I've been very busy.
I've attached a Propeller GCC 2x Quad-SPI XMMC reference design example SpinSocket-Flash.
Note the 2 SO8 devices to the left of Propeller on the DIP32 module. All passives are on the bottom side.
Do you have more details on this 1/4?.
I can see the 5MHz (16 cycle loop) for byte wide (2 chips), but with SHL and OR, my pencil example gives one spare opcode in a 32 cycle loop, (ie very slightly unrolled) that does 2 nibble reads, so that seems to me to be one byte @ 2.5MHz ? ( 2.5MB/s)
On this QuadSPI device, that allows me 312us to read 16 bytes in a Octal Word 0AxH mode.
Which is fine for my needs - I can see 2x4 is more compact, and faster, so better suits execute-in-place, but I'd prefer avoiding the extra complexity and cost of 2 chips.
Is there a link to the GCC Quad-SPI Flash solution ? - I search the GCC project for QuadSPI and Quad-SPI and get no finds ?
Here you go:
http://code.google.com/p/propgcc/source/browse/loader/spin/ssf_cache.spin
I'm too tired to answer the other part of your post right now.
Anyway, I guess you have to decide if you want to give up cogs or pins for faster access...
Where is the working snippet for loading from 4 bits of Flash from a COG to HUB ram?
I found something in FlashPointSuperQuad2a.spin but that's like 14 instructions.
We discussed a better performing version in a few posts, but I can't find it just now.
Thanks,
--Steve
If you have a solution, it could help to post it here, ( even if I do not need more speed right now, others might, or I might later..)
Corner cases are always good.
The link in #7 includes the important timer-setup preamble for the byte-wide version, Counter clocked. Nifty idea.
First, let me say the Ross has me convinced that the speed of the memory access doesn't make a huge impact for running C code
when you use caching.
But, I was looking into video drivers, and I needed speed...
I know I have a 1-cog driver that uses counters to reduce the number instruction cycles from 16 to 12, but I can't seem to find it right now...
Anyway, here's the cog#1 part of the 2-cog version:
The data we have collected with GCC is not necessarily consistent with this (not a criticism of Catalina or some GCC promotional statement).
Purely from a technical "change only one variable at a time viewpoint" the speed of memory will not impact code that can reside mainly in cache. However for code that will not easily stay in cache there is a dramatic difference.
Here is data from one experiment:
In the example "transpose" is a fairly small program, and sprintf is a fairly big program. The differences are quite varied.
Of course I don't have a FlashPoint driver, so that can't be compared. I would certainly like to have one though. I have parts.
You have one where speed matters and one where speed doesn't...
Perhaps real world is somewhere inbetween...
(Pgm claims 1.5MB/s )
http://www.spansion.com/Products/Pages/serial-flash-spi-memory.aspx
This also has a 1K OTP block. ( and I think a 128 bit ID ?)