Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

rogloh · 2020-09-28 04:14

evanh wrote: »

Roger, I'm looking for the number of latency clocks that are used to read the hyperRAM, including the registers. Where's that located in the sources?

It is stored in the per bank information in the LUT. It is read from this line.

getbyte latency, pinconfig, #3  ' a b c d e | g get latency clock edges for this bank

You can set it differently with the setRamLatency() API if you want to experiment with it.

' method to set up HyperRAM device latency in the driver and in the memory device
' addr - identifies memory region containing the HyperRAM device to configure
' latency - latency clock value from 3-7
' returns 0 for success or negative error code
PUB setRamLatency(addr, latency) : r | newvalue, origvalue

evanh · 2020-09-28 04:22

The default is 6, right? I'm struggling to make sense of what I've empirically measured a long time ago. For some reason I have latency of 11 in all my code. And, looking on the scope, that is what is happening too. 3 clocks for the CA phase then 11 clocks for the latency phase then the data appears. And this is true for CR0 register as well, so it's not a case of zeros in the first RAM addresses.

rogloh · 2020-09-28 04:27

It is 6 yes. You will actually see 14 which is 2+(2*6). The first 4 bytes of the address phase are not included in the latency count according to the data sheet timing diagrams. The last two bytes are, and because the RAM is using fixed latency, the latency itself gets doubled.

evanh · 2020-09-28 04:30

Okay, that fits, thanks. So the datasheet showing two groups of four as the default is just wrong then.

Ah-ha! The config register default in the datasheet says six. I was reading the diagrams instead.

rogloh · 2020-09-28 04:35

That 4 clock latency shown is just a timing diagram example.

ISSI Data sheet has this:

The default value is 6 clocks, allowing for operation up to a maximum frequency of 166MHz prior to the host system setting a lower initial latency value that may be more optimal for the system.

evanh · 2020-09-28 04:47

Not found in my copy of the datasheet. It's revB4 dated Sept 2018.

Err, yes it is buried deep. Oddly the text search failed.

ozpropdev · 2020-09-28 04:51

Ah yes, the datasheet diagram led me astray too.

evanh · 2020-09-28 06:44

I've been rebuilding my routines to use the streamer for all phases instead of leading with bit-bashing. Took quite a while to figure out why I couldn't make any headway. What I'd missed was the OUT bits weren't being cleared right at start so the streamer and OUT were mixing and corrupting the CA phase.

Just relying on the scope without a logic analyser meant I wasn't looking at every pin.

rogloh · 2020-09-28 07:00

If you find a reliable way to trim the code down in the streamer setup/clock control we can possibly try to improve it perhaps to save an instruction or two in the path, though I know this is very tough to get 100% right for all possible combinations and I've spent a lot of time experimenting with the scope. What I have in there now is what I found works for odd or even starting or ending addresses, running properly for sysclk/1, sysclk/2 and with registered/unregistered clocks and data (or at least it is meant to be unless I've somehow introduced a regression).

rogloh · 2020-09-28 07:07

evanh wrote: »

I've been rebuilding my routines to use the streamer for all phases instead of leading with bit-bashing. Took quite a while to figure out why I couldn't make any headway. What I'd missed was the OUT bits weren't being cleared right at start so the streamer and OUT were mixing and corrupting the CA phase.

Just relying on the scope without a logic analyser meant I wasn't looking at every pin.

I just used this instruction at the start when I enable the data pins for the address phase, so any OUT bits being OR'd are driven to zero.

                            drvl    datapins                'enable the DATA bus

evanh · 2020-09-28 07:33

yep, same, easy fix. It just wasn't part of the bit-bashing I was reworking and I'd overlooked the possibility.

evanh · 2020-10-04 13:35

Well, my first ambition seems partly doomed at least. I've tried to use a series of chained XCONT to seamlessly join the phases together while staying correctly aligned with the independently generated clock from the smartpin. This worked beautifully for writing to hyperRAM, and came together quickly too, but for reading from hyperRAM I couldn't get it jelling. I am probably abusing the intent of the streamer hardware by trying to chain a streamer data output (hyperbus CA phase) with streamer data input (hyperbus data read phase). There seems to be something in the streamer hardware that causes a timing granularity that I wasn't able to overcome.

I've opted now to use the simple XCONT chaining for burst writes. But for reads, have a gap after the CA phase using a WAITX and a subsequent XINIT to start the burst read data phase. Notably, if this WAITX gap is preceded by a WAITXFI it will have the exact same granularity issue. Albeit correctable with a custom WAITX gap for each divider.

rogloh · 2020-10-04 14:40

Yep, after attempting that back to back clock thing too for reads some time ago now I don't think it is easy for reads and you will have a gap to turn the bus around. Certainly the writes at sysclk/2 though can be done with continuous clocks. Not sure about sysclk/1 writes, probably not if you choose to to control RDWS the way I do, though there are likely other ways to do RWDS with Smartpin shift register output etc.

evanh · 2020-10-04 21:06

My best guess for the granularity problem is it's caused by a difference in the number of buffer stages between streaming in and streaming out. Err, no, can't be that. It's definitely a function of end-of-transfer detection.

I still bit-bash RWDS as I'm not yet using it for write masking.

evanh · 2020-10-05 02:56

Actually, RWDS is definitely a candidate for leaving its smartpin mode enabled all the time. If I'm not mistaken you are only reverting to bit-bashing to check the RWDS level during CA phase. This test could be done via an input redirection to the CS pin say. CS is never used as an input so they can be initially configured and left for the duration of the driver's activities.

rogloh · 2020-10-05 03:45

Yeah redirection of RDWS input pin to CS input is a good idea and may help save some instructions during writes. It just means that RWDS and CS need to be located close together (which they are on the Parallax EVAL breakout board). Or maybe CLK pin could also be used for that as it is never read either (only its WXPIN and WYPIN settings are manipulated for timing control). That may suit the pinout of the upcoming EDGE board.

cgracey · 2020-10-06 03:34

Have you tried using XCONT commands that don't really do anything (read pins, with w=0, so no WFBYTE occurs), but take the needed number of clocks to space things out?

evanh · 2020-10-06 04:10

Yep, for that attempt it was CA and latency phases were being done using "imm 4 x 8" mode - %0110 dddd eppp 1110 starting with a XINIT. So four bytes arranged in PA register for the command and address, followed by an immediate #0 XCONT to pace out the latency, followed by 8-bit Pins->WFBYTE mode - %1110 dddd wppp 1110 as a XCONT for the burst read from the DRAM.

evanh · 2020-10-06 04:33

Here's a snippet of that final source for that attempt. I had added in an extra step to shimmy the streamer timing using SETXFRQ instruction. Way too much of a hack and wasn't saving instructions so I abandoned it at that point.

'------------------------------------------------------------------------------
read_block_dma
'read data from hyperRAM
		setword	lacfg, #34, #0			'doh! can't be used for compensation - Granularity is "dmadiv"
		wrfast	fastmask, ptra			'non-blocking

		callpa	#readram, #send_ca_dma		'block read command, includes padding clocks

		setword	rxcfg, hrbytes, #0		'max 64 kB per burst
		waitx	#dmadiv*4-4			'CA completion before tristating
		dirl	#hr_base | 7<<6			'tristate the HR databus

		setxfrq	fastmask			'sysclock/1, adds a small window for compensation
		waitx	comp
		setxfrq	xfrq				'set streamer back to read/write rate

		xcont	rxcfg, #0			'queue data phase
		waitxfi					'wait for completion of DMA

		outh	#ram_cs
	_ret_	rdfast	#0, #0				'flush the FIFO


'------------------------------------------------------------------------------
send_ca_dma
'PA has 3-bit command
		drvh	#ram_cs				'ensure hyperRAM is deselected
		wrpin	hrckmode, #ram_ck
		dirl	#ram_ck				'mode is set first for steady pin drive
		wxpin	#dmadiv, #ram_ck		'HR clock step interval
		drvl	#ram_ck
		fltl	#ram_rwds
		wrpin	hrdatmode, #hr_base | 7<<6	'eight data pins registered (in and out)
		drvl	#hr_base | 7<<6			'set all data pins low
		setxfrq	xfrq				'set streamer transfer rate for read/write

		andn	hraddr, #%111			'address alignment of 16 byte increments
		or	pa, hraddr			'merge address with the three bits of command
		ror	pa, #3				'put command at top bits and truncate the bottom address bits
		movbyts	pa, #%%0123			'endian swap because streamer can only do sub-byte endian swapping
		mov	pb, hrbytes
		add	pb, #6+22			'clock steps for fixed latency added to data length

		outl	#ram_cs				'begin "Command-Address" phase
		xinit	cacfg, pa			'kick the streamer off for CA (command and address) phase
		wypin	pb, #ram_ck			'initial clock steps for CA phase
	_ret_	xcont	lacfg, #0			'remaining two bytes of CA phase, currently nulls, plus "latency" spacers


txcfg		long	DM_8bRF | DM_DIGI_IO | (hr_base << 17) | bytes		' DMA cycles (RFBYTE), pins "hr_base"
rxcfg		long	DM_8bWF | DM_DIGI_IO | (hr_base << 17) | bytes		' DMA cycles (WFBYTE), pins "hr_base"
cacfg		long	DM_8bIMM | DM_DIGI_IO | (hr_base << 17) | 4
lacfg		long	DM_8bIMM | DM_DIGI_IO | (hr_base << 17) | (2+22)
fastmask	long	$8000_0000			'Makes RDFAST and WRFAST non-blocking instructions
xfrq		long	($4000_0000 / dmadiv)<<1	'SETXFRQ parameter, bit-shift is compensation to keep it unsigned

#ifdef CK_REGD
hrckmode	long	P_REGD | SPM_STEPS			'mode config for HR clock pin
#else
hrckmode	long	SPM_STEPS				'mode config for HR clock pin
#endif
#ifdef D_REGD
hrdatmode	long	P_REGD				'mode config for data pin registers (in and out)
#else
hrdatmode	long	0
#endif

rogloh · 2020-10-06 05:17

Looking at the current critical part of the read code in my driver, I wonder are there any instruction candidates for removal? I'm already doing work in between streamer instructions during the address phase (at sysclk/2) where the instructions would otherwise by unused. How much time can be shaved but not lose capability of operating with sysclk/2 or sysclk/1 and with registered clock/data pin settings? Can any of the code after the waitxfi instruction be setup in advance before the waitxfi instruction I wonder, while we might still be idle waiting for the latency period to complete before the data phase begins? Can a new setxfrq happen while the streamer is still active, same with wxpin, wypin while it is still clocking, or do you need to wait first?

                            setxfrq xfreq2                  'setup streamer frequency (sysclk/2)
                            waitx   clkdelay                'odd delay shifts clock phase from data
                            xinit   ximm4, addrhi           'send 4 bytes of addrhi data
                            wypin   clks, clkpin            'start memory clock output 
                            testb   c, #0 wz                'test special odd transfer case
                            mov     clks, c                 'reset clock count to byte count
                            xcont   ximm, addrlo            'send 2 bytes of addrlo 
            if_c_ne_z       add     clks, #1                'extra clock to go back to low state
                            waitx   #2                      'delay long enough for DATA bus transfer to complete
                            fltl    datapins                'tri-state DATA bus
                            waitxfi                         'wait for address phase+latency to complete
p1                          wxpin   #2, clkpin              'adjust transition delay to # clocks
p2                          setxfrq xfreq2                  'setup streamer frequency
                            wypin   clks, clkpin            'setup number of transfer clocks
                            wrpin   regdatabus, datapins    'setup data bus inputs as registered or not
                            waitx   delay                   'tuning delay for input data reading
                            xinit   xrecv, #0               'start data transfer and then jump to setup code

evanh · 2020-10-06 06:04

Certainly don't need both a WAITX and a WAITXFI. I'd be inclined to remove the WAITXFI. That instruction created an undesirable granularity for me. You're probably relying on it at this stage so timing will be different without it.

Also seems to be xfreq2 used twice in a row. Probably can remove the second one. And that'll apply to the WXPIN #2 clock pin as well.

rogloh · 2020-10-06 06:12

Well I patch that setxfrq with either the sysclk/1 or sysclk/2 setting for the data phase which can differ from the address phase (always sysclk/2), so I sort of need it. That's why I have the p2 label there. In my most of my code if you see some label name of the form pNN then it dynamically patchable at code startup time. p1 patches the wxpin value for this sysclk/1 vs sysclk/2 reason as well.

The problem if I try to somehow combine the waitx #2 and waitxfi is that the actual total wait length is a function of the address phase latency which can vary per bank and is therefore needing to be dynamically computed which is more overhead - if I could do some useful work during the waitx #2 time (two instructions possible there) I would. I also do not want to keep the data pins driven any longer than they should be during the latency phase.

evanh · 2020-10-06 06:22

Right, patching, makes sense.

I wouldn't be concerned with a late tristating of the hyperbus. There is a lot of spare time before the data phase starts. You could ditch the WAITX and move the WAITXFI in its place.

rogloh · 2020-10-06 06:34

Yeah agree there is some more time there, the key with that waitx #2 was more to not shut it off early, though I still wouldn't want to wait until the latency portion completes, the specified tDQLZ time (time when the data bus can be driven after a clock edge by the device) is 0ns from the final rising edge within the latency period.

If I get a chance I might take another look at more optimizations to shave a few clocks. It's one of those things that quite tedious to test in my setup. You find you might improve one case, yet break another etc.

evanh · 2020-10-06 07:30

Since that failed attempt at seamlessly streaming the CA to read data, I've reverted to not pacing the latency with the streamer. It's just the clock smartpin only for the latency phase now. Which is how it was when I was bit-bashing.

As you say, it requires a timing calculation though. In my case it generates some compile time constants. You could do similar in that you only have the two speeds anyway.

So, use a WAITX or two and drop the WAITXFI.

Here's what I'm using right now:
- "comp" is the compensation columns in my reports
- "dmadiv" is the sysclock divider constant

read_block_dma
'read data from hyperRAM
		callpa	#readram, #send_ca_dma		'block read command, includes padding clocks

		wrfast	fastmask, ptra			'non-blocking
		setword	rxcfg, hrbytes, #0		'set the streamer burst length, max 64 kB
		waitx	#dmadiv*4-4			'pause for CA phase to complete
		dirl	#hr_base | 7<<6			'tristate the HR databus

		mov	pa, comp
		add	pa, #dmadiv*25-12
		waitx	pa

		xinit	rxcfg, #0			'queue data phase
		...

And note that in send_ca_dma, there is only four non-setup instructions - right at the end:

send_ca_dma
'PA has 3-bit command
		drvh	#ram_cs				'ensure hyperRAM is deselected
		wrpin	hrckmode, #ram_ck
		dirl	#ram_ck				'mode is set first for steady pin drive
		wxpin	#dmadiv, #ram_ck		'HR clock step interval
		drvl	#ram_ck
		fltl	#ram_rwds
		wrpin	hrdatmode, #hr_base | 7<<6	'eight data pins registered (in and out)
		drvl	#hr_base | 7<<6			'set all data pins low
		setxfrq	xfrq				'set streamer transfer rate for read/write

		andn	hraddr, #%111			'address alignment of 16 byte increments
		or	pa, hraddr			'merge address with the three bits of command
		ror	pa, #3				'put command at top bits
		movbyts	pa, #%%0123			'endian swap because streamer can only do sub-byte endian swapping

		mov	pb, hrbytes			'clock steps for data phase
		add	pb, #6+22			'clock steps for CA and fixed latency added to data length

		outl	#ram_cs				'begin "Command-Address" phase
		xinit	cacfg, pa			'kick the streamer off for CA (command and address) phase
		wypin	pb, #ram_ck			'clock go!
	_ret_	xcont	cacfg, #0			'4 nil bytes, for remaining CA phase, needed to manage RWDS/databus transition

rogloh · 2020-10-06 09:12

Yeah evanh it's actually kind of tricky to compare our two different sequences (mine & yours) as they have different capabilities. So I was looking at my code and discounting the instructions that relate to extra features I have like per bank setup and latency, rwds sampling, odd/even byte handling to see how my code and your code portions compare cycle wise with parts in common and what gains might still be possible.

Your code seems like it should be several less instructions but I can't totally figure out exactly by how much yet and some of this is waitx stuff too which will vary actual execution timing.

I do seem to burn extra cycles setting up clock phase alignments for clock smartpin output etc though maybe this can't be helped if we need flexibility in supporting different sysclk rates.

Steps needed for read that we both have in common (not necessarily in perfect order):

drive CS low
read command + address setup
address byte reversal
setxfrq for address phase
driving data pins
clock pin setup with wxpin, wypin
address phase sending 6 bytes (split over two streamer commands)
latency phase delay timing
float the data pins
read timing delay for operating frequency
setxfreq for data phase
fifo setup
data phase streaming
wait for end of transfer
CS high

Extra things I currently do:
I compute latency dynamically based on RWDS pin state (because HyperFlash & HyperRAM differ)
I resync the clock Smartpin phase on each transfer phase to mix sysclk/2 address phase with either rate data phase
I handle odd or even length transfer sizes with any address alignment.
I handle registered/unregistered data bus and clock pin settings.

If I just count up the instructions from my CS low to the equivalent xinit "queue data phase" instruction in my code I get 31 instructions while yours is 28 or effectively 30 if you account for the callpa and ret overhead. But it's not a perfect comparison and there a still a few extra things I do outside the CS low time as well that are unaccounted for.

evanh · 2020-10-06 10:07

My next aim was to dump the alternate 100% bit-bashed routines, in this case used for writing procedurally generated data, so that all activity on the data pins is via the streamer and all clocking is via the smartpin. Then I can move a decent chunk of the init code out of the two send_ca routines because it'll only need set once. At least that's the theory. I certainly want to prove it can work.

cgracey · 2020-10-06 12:01

When you guys execute the RDFAST/WRFAST, do you have D [31] set so that the instruction doesn't wait?

rogloh · 2020-10-06 13:54

Right now I don't. I pass #0 for D on both RDFAST and WRFAST and the FIFO is only used for the data phase portion in my case. In my code it's actually set up quite a bit in advance of the streamer stuff, probably at least 40 instructions prior and would be full in time by the time the streamer command needs it so perhaps it can help save some cycles in write burst setup with RDFAST and is a good idea for another optimization, thanks Chip

I'll just need to get a long for the #$80000000 constant.

cgracey · 2020-10-06 15:51

rogloh wrote: »

Right now I don't. I pass #0 for D on both RDFAST and WRFAST and the FIFO is only used for the data phase portion in my case. In my code it's actually set up quite a bit in advance of the streamer stuff, probably at least 40 instructions prior and would be full in time by the time the streamer command needs it so perhaps it can help save some cycles in write burst setup with RDFAST and is a good idea for another optimization, thanks Chip I'll just need to get a long for the #$80000000 constant.

Maybe you could just use BITH reg,#31 to set the MSB.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments