W25Q128JV SPI Flash on P2 - dual SPI operation?

rogloh · 2021-04-03 06:16

I was just looking at the SPI flash used on the P2-EVAL and reading it can offer a dual SPI mode where data is input/output on two pins and can also be clocked at high speed up to 133MHz.

At first I got a little excited by this but then unfortunately found it has been wired in reverse data order to what the streamer natively uses, with IO-1 on pin 58 and IO-0 pin 59. This means any data streamed into it will be reversed to the natural byte order. Even the streamer's re-ordering capability doesn't solve this.

Eg, if the data output by the Flash to both pins is in this sequence:

I/O-0  I/O-1  
D6      D7
D4      D5
D2      D3
D0      D1

Then it will be assembled into one of the following bytes and written into HUB memory depending on the "a" bit setting in the streamer.

b7 ..... b0
D0 D1 D2 D3 D4 D5 D6 D7 (needs reversing per byte)
D6 D7 D4 D5 D2 D3 D0 D1 (totally messed up, needs bit pair swaps)

It the two pins had been wired the other way, we would have been able to stream the data in/out of the P2 HUB at high speed (~33MB/s using 133MHz) with little COG intervention once the DUAL SPI transfer was initiated. After all the original discussion on this ages ago, it almost feels like it was arbitrary to wire in this way in the end which is a pity.

Maybe some bit banging as Peter Jakacki has done for the SD can get the rate up higher...?

Also, even if P56, P57 get wired on custom boards to complete the Quad SPI data group, the nibble would still be in reversed bit order. Can't win.

The only way to make it work is to lay out the memory in the reversed bit order when writing to flash, but keep it correct for the boot image portion. That bit order is now locked into the boot ROM.

Cluso99 · 2021-04-03 06:51

It was discussed at length. We tried to get QSPI D0-D3 on pins P56-59 and CLK & CS on P60 & 61 but that wasted too many pins when not used

rogloh · 2021-04-03 07:10

Yeah QSPI was too pin hungry but in the end we could have had fallen back to dual SPI but failed to even get that. I can understand coding with single bit SPI for boot, but the wiring order for any streamer based dual SPI is problematic now.

rogloh · 2021-04-03 07:35

Looking at Peter's code for SD transfers as an example:

.l0             wypin   #32,sck                 ' trigger 32 clocks
                waitx   sddly                   ' delay (set to 8 for 200MHz)
                rep     x,#32                   '  x = 2 for fast or 3 for slower
                 testp  miso wc                 ' 4 cycle/bit read
                 rcl    r1,#1
                 waitx  clkdly                  ' extra wait instruction for modified SPIRX
                movbyts r1,#%00011011           ' rearrange bytes
                wflong  r1                      ' save four bytes
                djnz    a,#.l0                  ' for long count

We can probably use a similar approach with ROLNIB and SHR and achieve 1 bit per instruction or 133Mbps @ 266MHz P2. This is no faster than just using the streamer for a single SPI implementation, though the bit frequency is halved which might be good for signal integrity issues on the board. Sending 2 bits at 66MHz is probably better than doing 1 bit at 133MHz. We still need to flip the bit order though while the streamer can do this for us for a single bit SPI.

wypin   #16, sck ' trigger 16 clocks
waitx   delay ' tune for I/O and P2 clock
rep     #2, #15 ' repeat next instruction pair 15 times
rolnib  accumulator, inb, #6 ' accumulate bit pairs
shr     accumulator,#2
getnib  tmp, inb, #6 ' last pair is special
testb   tmp, #2 wz
testb   tmp, #3 wc
rczl    accumulator
movbyts accumulator, #%%0123
'TODO: find some merge/split operations to rearrange bit pairs
wflong  accumulator

Update: with overheads the rate drops to one 32 bit long every 42 instructions or so in a long transfer loop with no delay (will be worse than this in reality). That means it will achieve less than 12.67 MB/s vs 16.6MB/s peak for a streamer approach with a 266MHz P2.

cgracey · 2021-04-03 07:50

To read these pins in the unreversed order, set PinA of each pin to point to the other. This is done via bits 30..28 in WRPIN.

rogloh · 2021-04-03 07:52

@cgracey said:
To read these pins in the unreversed order, set PinA of each pin to point to the other. This is done via bits 30..28 in WRPIN.

Excellent! Does this also work with the streamer or just reading pins manually? And similarly can the writes be swapped?

cgracey · 2021-04-03 07:59

@rogloh said:

@cgracey said:
To read these pins in the unreversed order, set PinA of each pin to point to the other. This is done via bits 30..28 in WRPIN.

Excellent! Does this also work with the streamer or just reading pins manually? And similarly can the writes be swapped?

It will work with the streamer for reading, only.

For writes, 20x more time will be spent waiting for erases and writes than on SPI activity, so it wouldn't even matter.

rogloh · 2021-04-03 08:01

Yeah read only bit swaps is fine if it works with the streamer. This is great news Chip! Cheers

cgracey · 2021-04-03 08:03

I need to document it better.

MagIO2 · 2021-04-03 08:28

Sorry, but I am not sure if I got the point.

If you write you have the same swap in order, than when you read, don't you? Otherwise the prop 2 would have some more general problem.
How the bits are stored in flash is irrelevant, if the data is read the same way, because 2 times swapping makes it right again. The remaining problem is, that bit correct code can't read/write locations which are meant to be written/read by the bit incorrect code.

I am not sure how likely it is, that you need to update the flash-part holding the flashed program which is read during boot. This would be the only use-case that runs into trouble.
Not sure if hiding the bit re-ordering in some smartpin configuration helps in regards of readability and later understanding of the code.

rogloh · 2021-04-03 08:30

The point is that we can write, either using regular single SPI mode (with no swap required), or in dual SPI mode by swapping during our two-bit bang in software, before going out to the pins (which will swap the data back due to the board wiring).
For reads we will read using 2 bits in dual SPI mode which will normally swap, however if we re-map the two input pins then the data will be unswapped for us. In all cases the bits will be stored in memory in the normal bit order as used by boot etc.

It's as clear as mud

VonSzarvas · 2021-04-03 09:49

Does the order swapping also allow for non-sequential patterns? A possible SD socket on the new edge module could allow 4bit access in that case.

Fiiting in with the boot pins and using p56/57 for the extra 2 data lines, from sketchy memory it goes something like d0/3/2/1 at the moment. That pattern represents the idea anyhow.

Wuerfel_21 · 2021-04-03 10:06

@VonSzarvas said:
Does the order swapping also allow for non-sequential patterns? A possible SD socket on the new edge module could allow 4bit access in that case.

Fiiting in with the boot pins and using p56/57 for the extra 2 data lines, from sketchy memory it goes something like d0/3/2/1 at the moment. That pattern represents the idea anyhow.

Should work. But only for reading, which makes it less useful. (though bit swizzling the data is just 3 instructions (SPLITB/MOVBYTS/MERGEB) per long)

rogloh · 2021-04-03 10:19

@VonSzarvas Is this proposed SD socket you are talking about also for the PSRAM based Edge as well?

It might be possible to remap but I do wonder how it will interwork with the flash chip select if the flash is also enabled.

EDIT: It's presumably okay because the Flash CS still rises every SD clock cycle and this doesn't give it enough clocks to activate any data transfers. We would also remove the SD pin remap (or adjust for dual SPI use) when accessing the SPI flash.

I'm sort of tempted to try to make a combo SPI flash / SD sector read/write driver and build it into my existing memory driver suite. Like HyperRAM/HyperFlash and PSRAM it would probably need its own COG to manage its bus, but doing it this way could enable both devices to be shared properly/cleanly amongst all COGs if there is a known sequence to achieve this, instead of having different COGs fighting each other for access as we would have now.

MagIO2 · 2021-04-03 14:29

I think the purpose of this design was to use either SD card OR Flash to boot from. It was not meant for using both.

I can see how useful it is to use both, but then the design should be changed. Saving pins comes from using data and clock lines for the different devices. But not from sharing the *CS lines. in this way. Who want's to test hundreds of SD cards to see if there is one which has problems seeing a clocked CS line? Does not look profesionell
To be flexible, it would be better to have the *CS lines connected via tiny jumpers (or dip-switches like on the edge board?). Either the chip select is jumpered to a propeller pin, or it is jumpered to high level and the propeller pin can be used for other things. So each user can decide for flash, for SD, for both or for none, which would then free all the unused pins for other usages.

In case you really need both, you are already happy that adding SD only costs 1-3 pins (one for SD with 1 data line, 3 for SD with 4 data lines) instead of 4-6 pins.

evanh · 2021-04-03 22:07

I put dualSPI feature into Ozpropdev's flashloader.spin2 that gets used by loadp2 for booting from that chip. My method used a pair of smarpins rather than a streamer. The plus side of this is I didn't have to worry about aligning the timing of the clock. The down side was the data did need massaging to merge the data from the two smartpins. But that was quite doable even at sysclock/2 if I remember correctly.

EDIT: Just had a quick squizz at the source code and now I remember I had to code it to collect data 32 bits at a time to have the processing time to massage the data. It used WFLONG to place in hubRAM. Smartpins configured as 16-bit shifter/buffer.

rogloh · 2021-04-04 00:06

@evanh said:
I put dualSPI feature into Ozpropdev's flashloader.spin2 that gets used by loadp2 for booting from that chip. My method used a pair of smarpins rather than a streamer. The plus side of this is I didn't have to worry about aligning the timing of the clock. The down side was the data did need massaging to merge the data from the two smartpins. But that was quite doable even at sysclock/2 if I remember correctly.

EDIT: Just had a quick squizz at the source code and now I remember I had to code it to collect data 32 bits at a time to have the processing time to massage the data. It used WFLONG to place in hubRAM. Smartpins configured as 16-bit shifter/buffer.

Sounds like a neat idea. Those SPLIT / MERGE instruction variants come in handy yet again.

evanh · 2021-04-04 01:10

Yes! And even something as bland as ROLWORD helped too.

28 clocks total, including a DJNZ and CALL/RET. 32 clocks being the limit imposed by bit rate at sysclock/2.

rogloh · 2021-04-04 01:55

@MagIO2 said:
I think the purpose of this design was to use either SD card OR Flash to boot from. It was not meant for using both.

I can see how useful it is to use both, but then the design should be changed. Saving pins comes from using data and clock lines for the different devices. But not from sharing the *CS lines. in this way. Who want's to test hundreds of SD cards to see if there is one which has problems seeing a clocked CS line? Does not look profesionell
To be flexible, it would be better to have the *CS lines connected via tiny jumpers (or dip-switches like on the edge board?). Either the chip select is jumpered to a propeller pin, or it is jumpered to high level and the propeller pin can be used for other things. So each user can decide for flash, for SD, for both or for none, which would then free all the unused pins for other usages.

In case you really need both, you are already happy that adding SD only costs 1-3 pins (one for SD with 1 data line, 3 for SD with 4 data lines) instead of 4-6 pins.

Yeah it can certainly become problematic sharing. We've already seen this with the DO release thing needing extra clocks etc.

One issue in the ideas from my table above is that if you were to do a flash CS transaction, an SD card (in SD mode) would get a clock pulse and it would clock in it's CMD pin state. It's CMD pin is electrically wired with the Flash IO0 pin. That pin is driven in dual-SPI mode by the flash during reads and tri-states only after it's CS goes high. This means that there will be random CMD activity during each SD clock and this is not good as it could somehow eventually decode to valid SD command. In practice the CRC will probably fail most of the time but it's still not good and could put the SD card into some weird state waiting for the command token to end or something, or worse, have the card start to respond on the CMD pin while the flash was active (CMD is bi-directional in SD mode). If you could ensure that CMD was always high before the flash CS went high, this could work, but you can't in the Flash's dual IO mode because the Flash is driving this signal (you could do it in it's single I/O mode though, because the COG always drives it then). I think this is getting too tricky to figure out...

rogloh · 2021-04-27 06:09

All these SPI NOR FLASH devices are a bit different.

Been browsing a few manufacturers data sheets recently for SPI NOR flash chips such as Cypress, Winbond, ISSI, Micron etc. They all seem to have different register formats for status, and there are differences in commands to enter/leave Quad-SPI mode, QPI mode, enabling 3 or 4 byte addressing when the devices exceed 16MB in size, etc. The JEDEC SFDP table helps a little to figure out what each can do, but even still coding some generic PASM driver for different SPI FLASH parts is troublesome. The high level code will need to setup different registers depending on what is present on the board and be aware of the any read status registers differences during flashing etc. Still hopeful for doing a driver for this though that fits into my memory driver framework and can flash the part too (like we can for HyperFlash). To start with I could try to support the Winbond Flash part used on the P2-EVAL and I'm hoping to get Dual-SPI supported for faster reads. Quad-SPI and maybe QPI would be great to aim for down the track for even higher performance devices and other board setups but this is where a lot of variation happens between chips unfortunately. Also many devices have different speed ranges too.

rogloh · 2021-04-29 10:18

Moderators, all 5 of pukatchu posts made in a 5 minute period in 5 threads have been advertising SPAM.

rogloh · 2021-04-29 12:01

I've made a start at another low level driver to add to my suite of memory types supported...which is currently HyperRAM/HyperFlash/PSRAM/SRAM/HUBRAM.

Expected features:

support QPI, Quad-SPI, Dual-SPI, Single SPI RAMs, and NOR flash devices such as the part fitted on the P2-EVAL.
programmable 4:4:4, 1:4:4, 1:2:2, and 1:1:1 transfer operation for reads/writes
selectable input delay for various P2 clock speeds
multiple memory devices (up to 16) sharing a common bus but with independent CS pins (and probably also CLKs)
Support to reverse D0/DI pins in S/W for P2 boot flash wiring to support Dual-SPI transfers (for reads only, this is as yet unproven but Chip says it would work with the streamer)
Different device manufacturers read/write SPI commands supported - at least two types of devices from the same driver (eg, RAM and Flash but I'm unlikely to be able to support too many more variants simultaneously due to LUTRAM limits)
General purpose register read and write functionality
Settable dummy clocks/latency per device for reads
Flash erase and programming (full feature set is still somewhat TBD, as there are many differences there)

Based on what I've coded so far I think it should fit in the COG, but I'll need to code more to be 100% sure.

Even though faster RAM technologies are possible on the P2, if a QPI SRAM is used with 108/133MHz clocks, then 54/66MB/s burst transfer rates can be attained, and this should be sufficient for simple VGA video resolutions at lower bit depths.

I think the P2D2 P2PAL board is going to have an option for a QPI SRAM/FLASH SOIC device to be fitted (in addition to the faster HyperRAM option).

While operating this driver will consume a COG which may appear wasteful, but it can always be shut down when not used or after the data from a boot flash device gets copied from flash into some other faster RAM, eg, a R/W disk partition in HyperRAM or PSRAM, for access by other COGs at even high speed.

W25Q128JV SPI Flash on P2 - dual SPI operation?

Comments