Shop OBEX P1 Docs P2 Docs Learn Events
LY68L6400 8MB 8-pin RAM and SPI — Parallax Forums

LY68L6400 8MB 8-pin RAM and SPI

Peter JakackiPeter Jakacki Posts: 10,193
edited 2019-10-15 00:13 in Propeller 2
I decided to order some of these LY68L6400 64Mbit RAM chips on Friday and had them turn up on my doorstep on Monday! :) Even though I don't have a special pcb for them I reasoned that I would test them out in standard single SPI mode by replacing the Flash chip on my P2D2, which I did. Using the standard SF commands that I have for serial Flash I could read the ID, write, read, and dump from it just like Flash although I will write some specific commands for this device. While the memory will be useful for all kinds of stuff I really want to be able to run the SPI bus up to the 84MHz continuous sequential read speed possible, whereas now the SPI is bit-bashed and runs about 1/10 sysclk, so 25MHz for 250MHz P2 clock.

I'd like to experiment with an 8 color VGA mode first with this basic arrangement since I should be able to read 3-bits for every pixel. If I could run it in QSPI mode then full 640x480x8 is possible and although it would involve an extra cog working full-time to buffer a scan-line, it would mean that most of fast hub RAM would be available for other things.

Here's the thing, the Smartpin SPI modes I've seen seem to be awkward but I may be mistaken. Is there a good example of using the smartpins in SPI mode that might be useful?

Here's a terminal session interacting with the chip using TAQOZ SF commands.
TAQOZ# SFJID .L --- $FFFF_FC0D ok
TAQOZ# $8.0000 $40 SF DUMP --- 
0008_0000: AA CA AA AB  AA CA EA BA  AA AB AF CB  D2 A8 BE AE     '................'
0008_0010: AA AA 29 CF  AE AA BA FA  4E AE AB FF  AA EA EF BD     '..).....N.......'
0008_0020: 7A EF AE 2A  B2 3C AB AA  3B AB AA AE  AE B0 A2 FA     'z..*.<..;.......'
0008_0030: AA 89 BE DE  AA FA F8 AA  A8 AB CD AF  EA B0 AA BC     '................'
TAQOZ# $8.0000 $40 DUMP --- 
0008_0000: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0010: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0020: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0030: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'

and copying the TAQOZ dictionary into it as well (easy to see the ASCII)
TAQOZ# WE ---  ok
TAQOZ# @WORDS $8.0000 $4000 SFWRS --- \
TAQOZ# $8.0000 $40 DUMP --- 
0008_0000: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0010: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0020: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
0008_0030: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................'
TAQOZ# $8.0000 $40 SF DUMP --- 
0008_0000: 06 52 47 42  53 51 5A 1E  53 04 4D 55  58 51 14 53     '.RGBSQZ.S.MUXQ.S'
0008_0010: 04 53 45 54  51 0A 53 04  4D 55 58 51  00 53 07 4D     '.SETQ.S.MUXQ.S.M'
0008_0020: 4F 56 42 59  54 53 F6 52  06 53 45 55  53 53 52 EC     'OVBYTS.R.SEUSSR.'
0008_0030: 52 06 53 45  55 53 53 46  E2 52 06 4D  45 52 47 45     'R.SEUSSF.R.MERGE'

Checking upper 4MB and 8MB for mirroring or random.
TAQOZ# $1.8000 $40 SF DUMP --- 
0001_8000: AA 8A AA AB  8A CA AA B2  AA AB AF 8B  42 A8 AA AE     '............B...'
0001_8010: AA 2A 29 CF  AE AA BA BA  4E AE AB EA  AA AA EE BD     '.*).....N.......'
0001_8020: 7A 5F AE 2E  B2 1C AA 28  3A 2B AA AE  AA B0 A2 FA     'z_.....(:+......'
0001_8030: AA 89 BA DE  AA FA F8 8A  A8 AA CC 8F  EA 30 A2 A0     '.............0..' ok
TAQOZ# 4 MB .L --- $0040_0000 ok
TAQOZ# $40.8000 $40 SF DUMP --- 
0040_8000: AA AA AA AA  AA A2 DA AA  AA AA 8A AA  BA AA A8 A2     '................'
0040_8010: AA AA 2E AA  AA AA 1A AA  A8 AA EA AA  AA A8 A9 B2     '................'
0040_8020: AA AA AA 2E  E2 AA A6 9A  AB 2A 2A AA  AA AA AA AA     '.........**.....'
0040_8030: AA AA AB AE  AB 28 EA BA  AA AA 2A CA  AA AA 8A AA     '.....(....*.....' ok
TAQOZ# $7F.8000 $40 SF DUMP --- 
007F_8000: 57 57 DD 55  55 5D 55 41  55 55 55 55  55 55 55 55     'WW.UU]UAUUUUUUUU'
007F_8010: 55 05 1D 57  D5 55 55 51  55 55 55 95  DF 55 55 54     'U..W.UUQUUU..UUT'
007F_8020: 5F 57 55 55  5D D5 15 55  55 55 55 55  55 55 D5 D5     '_WUU]..UUUUUUU..'
007F_8030: 55 55 D7 45  55 55 5D 55  75 57 55 55  55 55 55 55     'UU.EUU]UuWUUUUUU' ok
TAQOZ#

Comments

  • evanhevanh Posts: 15,126
    Here's the thing, the Smartpin SPI modes I've seen seem to be awkward but I may be mistaken. Is there a good example of using the smartpins in SPI mode that might be useful?
    There's no real advantage. Bit-bashing is fast already. They both have overhead gaps. The single word buffering is not enough to make a difference. Funnily that buffer is more helpful at slower data rates because then it can pace itself while the program is away.

    A streamer can do better because it DMAs the hub data direct. There is a lot more setup for the streamers but, for block transfers, it would worth the effort in the end.

  • evanhevanh Posts: 15,126
    edited 2019-10-15 00:51
    evanh wrote: »
    ... Funnily that buffer is more helpful at slower data rates because then it can pace itself while the program is away.
    That comes into its own if you decided to use interrupts. But just be wary that if you push for highest data rate then the IRQ is going to gobble up the cog's time.
  • evanhevanh Posts: 15,126
    Mike has done an object that dedicates a whole cog to buffering for the comport. Going down that path is an option. Then using the cog's time for optimising speed is up for grabs. You've also got a whole streamer to yourself then too.
  • jmgjmg Posts: 15,140
    edited 2019-10-15 01:36
    Here's the thing, the Smartpin SPI modes I've seen seem to be awkward but I may be mistaken. Is there a good example of using the smartpins in SPI mode that might be useful?
    All of the smart pin modes could do with some examples.
    The most useful thing I can see for SPI HW, is in buying time inside the short CSL window the PSRAMs have. ie transfer more bits per address+data-block.
    .. If I could run it in QSPI mode then full 640x480x8 is possible and although it would involve an extra cog working full-time to buffer a scan-line, it would mean that most of fast hub RAM would be available for other things
    I think P2 QSPI in SPI HW did not make the cut, but the streamer can do nibbles, so that may be a means to QSPI ?
    Here's a terminal session interacting with the chip using TAQOZ SF commands.
    ...
    and copying the TAQOZ dictionary into it as well (easy to see the ASCII)
    ...
    Checking upper 4MB and 8MB for mirroring or random.
    One piece of info that's annoyingly hard to glean from these PSRAMs, is their tolerance on CS duty cycles.

    eg Some spec refresh times of 16~64ms region, and CSL of 4us, but are less clear on just when the refresh counter advances ?
    eg If it advances on a CS edge, (no internal clocks) and ignores user address, then you need 8192 pulses inside (say) 64ms to keep refresh.
    It a separate clock runs inside(when CS=H), it just needs a certain % of HI time on CS, and the CSL time can be stretched.

    Maybe TAQOZ can do some retention checks, and vary CS to see when the memory fades, and what duty CS needs ?
    I'd expect Write, then CS=L for 6 seconds, then read, to fail, but write,CS=H.32ms, then CS=L.32ms,read, may be ok ?

  • Hi Peter,

    Please can you post transfer speed (MB/s) numbers for LY68L6400 SPI and raw SD card sectors (without FAT32) for comparison?.

    Is there any other (high capacity) storage available for P2 with high transfer rates?

    Thank you!
  • jmgjmg Posts: 15,140
    Just looking at this :
    http://www.avalanche-technology.com/products/discrete-mram/p-sram-gen-2/

    1Mbit – 16Mbit SPI MRAM

    Memory for evanh ;)

    Not sure where to buy it yet .... and endurance may need to be watched.
  • jmg wrote: »
    Not sure where to buy it yet .... and endurance may need to be watched.

    The site has a link that, as far as I understand, claims infinite write endurance (as any MRAM-based device should have?)
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2019-10-20 00:58
    It seems to me that 8MB of SPI RAM makes a great file cache since I can do a random read in microseconds or buffer a cached sector to hub memory in 200us without the SD setup latency. Since I am redoing my file system I can see the advantage of building in an option that can manage a RAM cache. There is not any advantage when reading in a large file, but there is a huge advantage in random access when reading a large file up to 8MB.

    As an exercise I might even output 1bpp VGA from a cog and see how that goes. If it works well then that means a dedicated chip in QPI mode can do 4bpp VGA or with dual chips 8bpp VGA. I'm not sure how well modern monitors handle flicker if I were to in this 1-bit SPI arrangement in place of the Flash, and output 2-bits for every pixel by using 1-bit alternate frames I could achieve 4-color VGA.

    @Ramon
    The SPI bus is bit-bashed and runs at the same speed as SD etc and is about 1/10 of the sysclk. With a 240MHz that translates to a 24MHz clock and we need to use smartpin SPI effectively to really push the limits.

    Timing a SPI RAM multi-block WRITE reveals we are transferring a byte every 377ns or 2.65MBs at 240MHz sysclk.
    TAQOZ# 0 0 1 MB LAP SRWRS LAP .LAP --- 90,505,505 cycles= 377,106,270ns @240MHz ok
    

    Read timing is very similar too with a byte read every 383ns (reading 100kB)
    TAQOZ# 0 $1.0000 100000 LAP SRRDS LAP .LAP --- 9,200,081 cycles= 38,333,670ns @240MHz ok
    

    These figures aren't much different from SD reads which suffer from setup latency but power along with multi-block reads. The advantage of the RAM is that it needs only about 3us to setup the address.
    TAQOZ# 0 LAP SRRD LAP SPICE .LAP --- 713 cycles= 2,970ns @240MHz ok
    

    and so in TAQOZ read a single random byte anywhere in the 8MB takes about 4us.
    TAQOZ# $CE7E LAP SRC@ LAP .LAP --- 1,001 cycles= 4,170ns @240MHz ok
    
  • RamonRamon Posts: 484
    edited 2019-10-20 05:56
    Thank you so much for all those numbers. Yes, you are right that latency is even more important. There are some transfer speed and latency test done on Teensy forums about that, and wanted to get some data from P2 to compare.

    On the Teensy Audio interface board they had two options on layout: SD Card and 8-SOIC for FLASH or RAM (W25Q128/23LC1024).
    They did that with the purpose of having a low latency storage for wavetable/LUT or recording. Someone reported exact numbers at different block sizes (512 bytes, 50 bytes, 32K) and I remember that SD card can have several tens of us of latency while Flash and RAM can decrease latency to <5 us. I don't have the links at hand but they can be searched on their forums (not sure if they were posted on Teensy or Arduino forums).
  • Wuerfel_21Wuerfel_21 Posts: 4,374
    edited 2019-10-20 10:11
    SD card latency varies a lot from card to card. I don't remember my measurement, but a new A1 rated (= min. 1500 IOPS) card has significantly less latency than whatever old clunker of a card one may pull from the drawer for propeller funtimes.

    (There is reaserch suggesting A2 rated cards are actually slower, since they only need to reach their 4000 IOPS goal when cheating using advanced protocol features)
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2019-10-20 14:58
    Wuerfel_21 wrote: »
    SD card latency varies a lot from card to card. I don't remember my measurement, but a new A1 rated (= min. 1500 IOPS) card has significantly less latency than whatever old clunker of a card one may pull from the drawer for propeller funtimes.

    (There is reaserch suggesting A2 rated cards are actually slower, since they only need to reach their 4000 IOPS goal when cheating using advanced protocol features)

    Yes, I can never understand why some insist on using their old worn out card when a brand new Sandisk Ultra 16GB A1 card can be had for ten bucks of less, and a 32GB for only a couple of dollars more.

    Here's the speed and latency test for an "old" 8GB Ultra from 2016. The sector speeds are spread over the drive and include the latency.
    *** SPEEDS *** 
        LATENCY......................... 230us,376us,307us,306us,304us,305us,339us,310us,
        SECTOR.......................... 415us,518us,504us,501us,500us,501us,537us,506us,
        BLOCKS.......................... 2,519kB/s @240MHz
    

    This is the full disk report:
    TAQOZ# .DISK ---  CARD: SANDISK   SD SL08G REV$80 #1561170528 DATE:2016/2
    
                       *** OCR *** 
        VALUE........................... $C0FF_8000
        RANGE........................... 2.7V to 3.6V
    
                       *** CSD *** 
        CARD TYPE....................... SDHC
        LATENCY......................... 1ms+1400 clocks 
        SPEED........................... 50MHz 
        CLASSES......................... 010110110101
        BLKLEN.......................... 512
        SIZE............................ 7,761MB
        Iread Vmin...................... 100ma
        Iread Vmax...................... 25ma
        Iwrite Vmin..................... 1ma
        Iwrite Vmax..................... 45ma
    
                     *** SPEEDS *** 
        LATENCY......................... 230us,376us,307us,306us,304us,305us,339us,310us,
        SECTOR.......................... 415us,518us,504us,501us,500us,501us,537us,506us,
        BLOCKS.......................... 2,519kB/s @240MHz
    
                       *** MBR *** 
        PARTITION....................... 0 00 INACTIVE
        FILE SYSTEM..................... FAT32 LBA
        CHS START....................... 1023,254,63
        CHS END......................... 0,0,0
        FIRST SECTOR.................... $0000_2000
        TOTAL SECTORS................... 15,515,648 = 7,944MB
    
    00170: 0000_0000 0000_0001 0000_F000 506F_7250     '............ProP'
    
                      *** FAT32 *** 
        OEM............................. TAQOZ P2
        Byte/Sect....................... 512
        Sect/Clust...................... 64 = 32kB
        FATs............................ 2
        Media........................... F8
        Sect/Track...................... $003F
        Heads........................... $00FF
        Hidden Sectors.................. 8,192 = 4MB
        Sect/Part....................... 15,515,648 = 7,944MB
        Sect/FAT........................ 1,894 = 969kB
        Flags........................... 0
        Ver............................. 00 00 
        ROOT Cluster.................... $0000_0002 SECTOR: $0000_2EEC
        INFO Sector..................... $0001 = $0000_2001
        Backup Sector................... $0006 = $0000_2006
        res............................. 00 00 00 00 00 00 00 00 00 00 00 00 
        Drive#.......................... 128
        Ext sig......................... $29 OK!
        Part Serial#.................... $50AD_0021 #1353515041
        Volume Name..................... P2 CARD    FAT32    ok
    

  • roglohrogloh Posts: 5,122
    edited 2019-10-22 01:10
    @"Peter Jakacki" , if you can get the first data returning within about 20us of a request and then at a sustained rate of over 11MBps thereafter from your SPI RAM in QSPI mode you could certainly get 4bpp VGA resolution graphics out of my DVI driver.

    All you'd actually need is a memory driver that matches my proposed memory driver spec in this thread linked below and it would just work, assuming you give the video COG some priority and limit or fragment the requests from the non video COGs accordingly to meet latency. If the entire line is back within 25us or so, the mouse sprite can be drawn over it too.

    http://forums.parallax.com/discussion/170645/proposed-external-hyperram-memory-interface-suitable-for-video-drivers-with-other-cogs#latest
  • jmgjmg Posts: 15,140
    rogloh wrote: »
    if you can get the first data returning within about 20us of a request and then at a sustained rate of over 11MBps thereafter from your SPI RAM in QSPI mode you could certainly get 4bpp VGA resolution graphics out of my DVI driver.
    The LY68L6400 can probably average that, but the limited CS=Low impost means multiple re-address will be needed on every scan line.
    If simpler address-once bursts are needed, there is
    ISSI's IS62WVS5128GBLL, which is 45MHz QSPI 4MBit SRAM, (stocked at Mouser)
    and the new
    AS3016204 is 54MHz(DDR)/108MHz QSPI 16Mbit MRAM (also comes in 1Mb, 4Mb, 8Mb)

  • evanhevanh Posts: 15,126
    edited 2019-10-29 15:44
    Here's the heart of using a streamer+smartpin for SPI data and clock respectively. This can transmit up to sysclock/2 burst rates, for any length, direct from hubram. I still have to look at handling receiving data as a burst like this but I don't see any particular obstacle to doing it.
    spi_tx_burst
    'setup streamer as SPI data transmit
    		rdfast	nonwait, dmaaddr		'prime the FIFO, don't wait on hubram slot
    		setword	dmamode, cycles, #0		'set streamer burst length
    		dirl	#CKPIN				'reset/realign the "base period" of SPI clock smartpin
    		dirh	#CKPIN				'  taking advantage of the smartpin's base period timing to allow
    							'  starting of two independant hardwares in sync and in phase
    		waitx	#CKCOMP				'phase compensation, also give FIFO filling time - min #2
    'start transmission
    		xinit	dmamode, #0			'tx the bits, 1-bit RFBYTE, big-endian
    	_ret_	wypin	cycles, #CKPIN			'start emulated SPI clock
    
    
    cycles		long	DMALEN
    dmamode		long	DM_01bRFbe | D_PGRP0_31 | (TXPIN<<17)
    dmaaddr		long	@id_byte
    nonwait		long	$8000_0000
    
    

    Constants in use
    CON
    	RXPIN		= 1
    	TXPIN		= RXPIN+1
    	CKPIN		= RXPIN+2
    
    	CKCOMP		= 2			' 2 5 3 7  phase compensation for aligning SPI clock and data pins
    	DMADIV		= 2			' 2 4 6 8  sysclocks per streamer cycle (sysclocks per SPI clock cycle)
    	DMALEN		= 8192			' streamer cycles
    
    	DM_01bRFbe	= (%1000 << 28)|(%1 << 16)	' 1-bit RFBYTE, big-endian
    	D_PGRP0_31	= (%1000 << 20)
    
    	P_REGD		= (%1 << 16)		' turn on clocked digital I/O (registered pins)
    	SP_OUT		= (%1 << 6)		' force on pin output when DIR operates smartpin
    	SPM_PULSES	= %00100_0 |SP_OUT		' pulse/cycle output
    
    

    And config code
    'setup SPI pins
    		wrpin	##SPM_PULSES | P_REGD, #CKPIN		'SPI clock out pin, pulse out, registered pin, Y = 0
    		wxpin	##((DMADIV/2)<<16) | DMADIV, #CKPIN	'pulse width (space->mark) and period respectively
    
    		wrpin	##P_REGD, #TXPIN		'streamer supplied SPI tx pin, registered pin
    		dirh	#TXPIN
    
    		setxfrq	##($4000_0000 / DMADIV)<<1	'set nominal streamer data rate
    
    
  • evanhevanh Posts: 15,126
    edited 2019-10-29 13:23
    Huh, actually, after poo-poo'ing the idea in my first post and after doing all that above ... When transferring a block of data, there's nothing really stopping from using 32-bit word size with synchronous serial smartpin mode. It would make sync serial mode notably faster with the cog dedicated to feeding just the tx smartpin in a tight loop.

    EDIT: Oh! Now I remember, the SPI clock is still the messy part. To get to sysclock/2 needs the streamer to generate the SPI clock. So the above code is notably better solution.

    PS: Synchronous serial receive smartpin mode is, funnily, a lot easier to handle because it aligns nicely with whatever the external SPI clock is - like regular SPI hardware. With the prop2 as master, it can happily go to sysclock/2. And with it configured for 32-bit word size it'll be quite manageable for bursting a block without needing the streamer for anything.
  • evanhevanh Posts: 15,126
    edited 2019-10-29 15:41
    Hmm, still need to rewrite those streamer constants, the pin groupings are a mess.

    Okay, here's QSPI transmit anyway. The heart is the same, just tweaked cycles and streamer width.
    spi_tx_burst
    'setup streamer as SPI data transmit
    		rdfast	nonwait, dmaaddr		'prime the FIFO, don't wait on hubram slot
    		setword	dmamode, cycles, #0		'set streamer burst length
    		dirl	#CKPIN				'reset/realign the "base period" of SPI clock smartpin
    		dirh	#CKPIN				'  taking advantage of the smartpin's base period timing to allow
    							'  starting of two independant hardwares in sync and in phase
    		waitx	#CKCOMP				'phase compensation, also give FIFO filling time - min #2
    'start transmission
    		xinit	dmamode, #0			'tx the bits, 4-bit RFBYTE, big-endian
    	_ret_	wypin	cycles, #CKPIN			'start emulated SPI clock
    
    
    cycles		long	(ddata - sdata) * 2		'nibbles
    dmamode		long	DM_04bRFbe | D_PGRP0_31 | (TXPIN<<17)
    dmaaddr		long	@id_byte
    nonwait		long	$8000_0000
    
    
    CON
    	CKPIN		= 3
    	TXPIN		= 4			'has to be a multiple of 4
    
    	DM_04bRFbe	= (%1010 << 28)|(%101 << 16)	' 4-bit RFBYTE, big-endian
    
    

    Config code
    'setup SPI pins
    		wrpin	##SPM_PULSES | P_REGD, #CKPIN		'SPI clock out pin, pulse out, registered pin, Y = 0
    		wxpin	##((DMADIV/2)<<16) | DMADIV, #CKPIN	'pulse width (space->mark) and period respectively
    
    		wrpin	##P_REGD, #TXPIN | (3<<6)		'streamer supplied 4-bit SPI tx pins, registered pins
    		dirh	#TXPIN | (3<<6)
    
    		setxfrq	##($4000_0000 / DMADIV)<<1	'set nominal streamer data rate
    
    
  • evanhevanh Posts: 15,126
    edited 2019-11-02 03:00
    Peter,
    Using smartpins for rx works pretty well up to Dual SPI at least. The smartpin rx mode has a big advantage over the tx mode - clock and data are actually related, the SPI device has responded to the propeller produced clock, and both flow as external inputs in unison to the rx smartpins. See https://forums.parallax.com/discussion/comment/1480866/#Comment_1480866 However, there is a couple of limits if trying to go faster. Processing of Quad SPI or QPI will struggle to compete with that. Dual is close to maxing out the cog. And the clock pin will need to be physically in the middle of the data pins too for the B-input to function as SPI clock in the smartpins.

    Using a streamer eliminates both the processing overhead and the clock pin selection limit. But this still needs sorted. It won't be as easy as the tx streamer methods.

    Tx at those speeds always needs the streamer involved. Either as demo'd above or as the SPI clock generator instead. An advantage of going with the clock generator method is it frees up the cog's hubram FIFO.

    EDIT: PS: There is a third way to use the streamer for tx. It would involved encoding the SPI clock as one of the parallel data bits within in the data block to be DMA'd from hubram. This eliminates the phase compensation guesswork but obviously incurs the encoding overhead.
  • evanhevanh Posts: 15,126
    edited 2019-11-02 02:21
    Quad SPI using rx smartpin would be doable as gapless if happy with sysclock/4 for the SPI clock. One decent advantage of this is the sysclock could then be wound to max overclocking without breaking the poor SPI memory device.

    I guess sysclock/3 is also an option. Not sure.
  • @evanh - Sounds like a plan, I will have to try it out for sure and let you know how it goes.
Sign In or Register to comment.