Roger, I'm looking for the number of latency clocks that are used to read the hyperRAM, including the registers. Where's that located in the sources?
It is stored in the per bank information in the LUT. It is read from this line.
getbyte latency, pinconfig, #3 ' a b c d e | g get latency clock edges for this bank
You can set it differently with the setRamLatency() API if you want to experiment with it.
' method to set up HyperRAM device latency in the driver and in the memory device
' addr - identifies memory region containing the HyperRAM device to configure
' latency - latency clock value from 3-7
' returns 0 for success or negative error code
PUB setRamLatency(addr, latency) : r | newvalue, origvalue
The default is 6, right? I'm struggling to make sense of what I've empirically measured a long time ago. For some reason I have latency of 11 in all my code. And, looking on the scope, that is what is happening too. 3 clocks for the CA phase then 11 clocks for the latency phase then the data appears. And this is true for CR0 register as well, so it's not a case of zeros in the first RAM addresses.
It is 6 yes. You will actually see 14 which is 2+(2*6). The first 4 bytes of the address phase are not included in the latency count according to the data sheet timing diagrams. The last two bytes are, and because the RAM is using fixed latency, the latency itself gets doubled.
That 4 clock latency shown is just a timing diagram example.
ISSI Data sheet has this:
The default value is 6 clocks, allowing for operation up to a maximum frequency of 166MHz prior to the host system setting a lower initial latency value that may be more optimal for the system.
I've been rebuilding my routines to use the streamer for all phases instead of leading with bit-bashing. Took quite a while to figure out why I couldn't make any headway. What I'd missed was the OUT bits weren't being cleared right at start so the streamer and OUT were mixing and corrupting the CA phase.
Just relying on the scope without a logic analyser meant I wasn't looking at every pin.
If you find a reliable way to trim the code down in the streamer setup/clock control we can possibly try to improve it perhaps to save an instruction or two in the path, though I know this is very tough to get 100% right for all possible combinations and I've spent a lot of time experimenting with the scope. What I have in there now is what I found works for odd or even starting or ending addresses, running properly for sysclk/1, sysclk/2 and with registered/unregistered clocks and data (or at least it is meant to be unless I've somehow introduced a regression).
I've been rebuilding my routines to use the streamer for all phases instead of leading with bit-bashing. Took quite a while to figure out why I couldn't make any headway. What I'd missed was the OUT bits weren't being cleared right at start so the streamer and OUT were mixing and corrupting the CA phase.
Just relying on the scope without a logic analyser meant I wasn't looking at every pin.
I just used this instruction at the start when I enable the data pins for the address phase, so any OUT bits being OR'd are driven to zero.
Well, my first ambition seems partly doomed at least. I've tried to use a series of chained XCONT to seamlessly join the phases together while staying correctly aligned with the independently generated clock from the smartpin. This worked beautifully for writing to hyperRAM, and came together quickly too, but for reading from hyperRAM I couldn't get it jelling. I am probably abusing the intent of the streamer hardware by trying to chain a streamer data output (hyperbus CA phase) with streamer data input (hyperbus data read phase). There seems to be something in the streamer hardware that causes a timing granularity that I wasn't able to overcome.
I've opted now to use the simple XCONT chaining for burst writes. But for reads, have a gap after the CA phase using a WAITX and a subsequent XINIT to start the burst read data phase. Notably, if this WAITX gap is preceded by a WAITXFI it will have the exact same granularity issue. Albeit correctable with a custom WAITX gap for each divider.
Yep, after attempting that back to back clock thing too for reads some time ago now I don't think it is easy for reads and you will have a gap to turn the bus around. Certainly the writes at sysclk/2 though can be done with continuous clocks. Not sure about sysclk/1 writes, probably not if you choose to to control RDWS the way I do, though there are likely other ways to do RWDS with Smartpin shift register output etc.
My best guess for the granularity problem is it's caused by a difference in the number of buffer stages between streaming in and streaming out. Err, no, can't be that. It's definitely a function of end-of-transfer detection.
I still bit-bash RWDS as I'm not yet using it for write masking.
Actually, RWDS is definitely a candidate for leaving its smartpin mode enabled all the time. If I'm not mistaken you are only reverting to bit-bashing to check the RWDS level during CA phase. This test could be done via an input redirection to the CS pin say. CS is never used as an input so they can be initially configured and left for the duration of the driver's activities.
Yeah redirection of RDWS input pin to CS input is a good idea and may help save some instructions during writes. It just means that RWDS and CS need to be located close together (which they are on the Parallax EVAL breakout board). Or maybe CLK pin could also be used for that as it is never read either (only its WXPIN and WYPIN settings are manipulated for timing control). That may suit the pinout of the upcoming EDGE board.
Have you tried using XCONT commands that don't really do anything (read pins, with w=0, so no WFBYTE occurs), but take the needed number of clocks to space things out?
Yep, for that attempt it was CA and latency phases were being done using "imm 4 x 8" mode - %0110 dddd eppp 1110 starting with a XINIT. So four bytes arranged in PA register for the command and address, followed by an immediate #0 XCONT to pace out the latency, followed by 8-bit Pins->WFBYTE mode - %1110 dddd wppp 1110 as a XCONT for the burst read from the DRAM.
Here's a snippet of that final source for that attempt. I had added in an extra step to shimmy the streamer timing using SETXFRQ instruction. Way too much of a hack and wasn't saving instructions so I abandoned it at that point.
'------------------------------------------------------------------------------
read_block_dma
'read data from hyperRAM
setword lacfg, #34, #0 'doh! can't be used for compensation - Granularity is "dmadiv"
wrfast fastmask, ptra 'non-blocking
callpa #readram, #send_ca_dma 'block read command, includes padding clocks
setword rxcfg, hrbytes, #0 'max 64 kB per burst
waitx #dmadiv*4-4 'CA completion before tristating
dirl #hr_base | 7<<6 'tristate the HR databus
setxfrq fastmask 'sysclock/1, adds a small window for compensation
waitx comp
setxfrq xfrq 'set streamer back to read/write rate
xcont rxcfg, #0 'queue data phase
waitxfi 'wait for completion of DMA
outh #ram_cs
_ret_ rdfast #0, #0 'flush the FIFO
'------------------------------------------------------------------------------
send_ca_dma
'PA has 3-bit command
drvh #ram_cs 'ensure hyperRAM is deselected
wrpin hrckmode, #ram_ck
dirl #ram_ck 'mode is set first for steady pin drive
wxpin #dmadiv, #ram_ck 'HR clock step interval
drvl #ram_ck
fltl #ram_rwds
wrpin hrdatmode, #hr_base | 7<<6 'eight data pins registered (in and out)
drvl #hr_base | 7<<6 'set all data pins low
setxfrq xfrq 'set streamer transfer rate for read/write
andn hraddr, #%111 'address alignment of 16 byte increments
or pa, hraddr 'merge address with the three bits of command
ror pa, #3 'put command at top bits and truncate the bottom address bits
movbyts pa, #%%0123 'endian swap because streamer can only do sub-byte endian swapping
mov pb, hrbytes
add pb, #6+22 'clock steps for fixed latency added to data length
outl #ram_cs 'begin "Command-Address" phase
xinit cacfg, pa 'kick the streamer off for CA (command and address) phase
wypin pb, #ram_ck 'initial clock steps for CA phase
_ret_ xcont lacfg, #0 'remaining two bytes of CA phase, currently nulls, plus "latency" spacers
txcfg long DM_8bRF | DM_DIGI_IO | (hr_base << 17) | bytes ' DMA cycles (RFBYTE), pins "hr_base"
rxcfg long DM_8bWF | DM_DIGI_IO | (hr_base << 17) | bytes ' DMA cycles (WFBYTE), pins "hr_base"
cacfg long DM_8bIMM | DM_DIGI_IO | (hr_base << 17) | 4
lacfg long DM_8bIMM | DM_DIGI_IO | (hr_base << 17) | (2+22)
fastmask long $8000_0000 'Makes RDFAST and WRFAST non-blocking instructions
xfrq long ($4000_0000 / dmadiv)<<1 'SETXFRQ parameter, bit-shift is compensation to keep it unsigned
#ifdef CK_REGD
hrckmode long P_REGD | SPM_STEPS 'mode config for HR clock pin
#else
hrckmode long SPM_STEPS 'mode config for HR clock pin
#endif
#ifdef D_REGD
hrdatmode long P_REGD 'mode config for data pin registers (in and out)
#else
hrdatmode long 0
#endif
Looking at the current critical part of the read code in my driver, I wonder are there any instruction candidates for removal? I'm already doing work in between streamer instructions during the address phase (at sysclk/2) where the instructions would otherwise by unused. How much time can be shaved but not lose capability of operating with sysclk/2 or sysclk/1 and with registered clock/data pin settings? Can any of the code after the waitxfi instruction be setup in advance before the waitxfi instruction I wonder, while we might still be idle waiting for the latency period to complete before the data phase begins? Can a new setxfrq happen while the streamer is still active, same with wxpin, wypin while it is still clocking, or do you need to wait first?
setxfrq xfreq2 'setup streamer frequency (sysclk/2)
waitx clkdelay 'odd delay shifts clock phase from data
xinit ximm4, addrhi 'send 4 bytes of addrhi data
wypin clks, clkpin 'start memory clock output
testb c, #0 wz 'test special odd transfer case
mov clks, c 'reset clock count to byte count
xcont ximm, addrlo 'send 2 bytes of addrlo
if_c_ne_z add clks, #1 'extra clock to go back to low state
waitx #2 'delay long enough for DATA bus transfer to complete
fltl datapins 'tri-state DATA bus
waitxfi 'wait for address phase+latency to complete
p1 wxpin #2, clkpin 'adjust transition delay to # clocks
p2 setxfrq xfreq2 'setup streamer frequency
wypin clks, clkpin 'setup number of transfer clocks
wrpin regdatabus, datapins 'setup data bus inputs as registered or not
waitx delay 'tuning delay for input data reading
xinit xrecv, #0 'start data transfer and then jump to setup code
Certainly don't need both a WAITX and a WAITXFI. I'd be inclined to remove the WAITXFI. That instruction created an undesirable granularity for me. You're probably relying on it at this stage so timing will be different without it.
Also seems to be xfreq2 used twice in a row. Probably can remove the second one. And that'll apply to the WXPIN #2 clock pin as well.
Well I patch that setxfrq with either the sysclk/1 or sysclk/2 setting for the data phase which can differ from the address phase (always sysclk/2), so I sort of need it. That's why I have the p2 label there. In my most of my code if you see some label name of the form pNN then it dynamically patchable at code startup time. p1 patches the wxpin value for this sysclk/1 vs sysclk/2 reason as well.
The problem if I try to somehow combine the waitx #2 and waitxfi is that the actual total wait length is a function of the address phase latency which can vary per bank and is therefore needing to be dynamically computed which is more overhead - if I could do some useful work during the waitx #2 time (two instructions possible there) I would. I also do not want to keep the data pins driven any longer than they should be during the latency phase.
I wouldn't be concerned with a late tristating of the hyperbus. There is a lot of spare time before the data phase starts. You could ditch the WAITX and move the WAITXFI in its place.
Yeah agree there is some more time there, the key with that waitx #2 was more to not shut it off early, though I still wouldn't want to wait until the latency portion completes, the specified tDQLZ time (time when the data bus can be driven after a clock edge by the device) is 0ns from the final rising edge within the latency period.
If I get a chance I might take another look at more optimizations to shave a few clocks. It's one of those things that quite tedious to test in my setup. You find you might improve one case, yet break another etc.
Since that failed attempt at seamlessly streaming the CA to read data, I've reverted to not pacing the latency with the streamer. It's just the clock smartpin only for the latency phase now. Which is how it was when I was bit-bashing.
As you say, it requires a timing calculation though. In my case it generates some compile time constants. You could do similar in that you only have the two speeds anyway.
So, use a WAITX or two and drop the WAITXFI.
Here's what I'm using right now:
- "comp" is the compensation columns in my reports
- "dmadiv" is the sysclock divider constant
read_block_dma
'read data from hyperRAM
callpa #readram, #send_ca_dma 'block read command, includes padding clocks
wrfast fastmask, ptra 'non-blocking
setword rxcfg, hrbytes, #0 'set the streamer burst length, max 64 kB
waitx #dmadiv*4-4 'pause for CA phase to complete
dirl #hr_base | 7<<6 'tristate the HR databus
mov pa, comp
add pa, #dmadiv*25-12
waitx pa
xinit rxcfg, #0 'queue data phase
...
And note that in send_ca_dma, there is only four non-setup instructions - right at the end:
send_ca_dma
'PA has 3-bit command
drvh #ram_cs 'ensure hyperRAM is deselected
wrpin hrckmode, #ram_ck
dirl #ram_ck 'mode is set first for steady pin drive
wxpin #dmadiv, #ram_ck 'HR clock step interval
drvl #ram_ck
fltl #ram_rwds
wrpin hrdatmode, #hr_base | 7<<6 'eight data pins registered (in and out)
drvl #hr_base | 7<<6 'set all data pins low
setxfrq xfrq 'set streamer transfer rate for read/write
andn hraddr, #%111 'address alignment of 16 byte increments
or pa, hraddr 'merge address with the three bits of command
ror pa, #3 'put command at top bits
movbyts pa, #%%0123 'endian swap because streamer can only do sub-byte endian swapping
mov pb, hrbytes 'clock steps for data phase
add pb, #6+22 'clock steps for CA and fixed latency added to data length
outl #ram_cs 'begin "Command-Address" phase
xinit cacfg, pa 'kick the streamer off for CA (command and address) phase
wypin pb, #ram_ck 'clock go!
_ret_ xcont cacfg, #0 '4 nil bytes, for remaining CA phase, needed to manage RWDS/databus transition
Yeah evanh it's actually kind of tricky to compare our two different sequences (mine & yours) as they have different capabilities. So I was looking at my code and discounting the instructions that relate to extra features I have like per bank setup and latency, rwds sampling, odd/even byte handling to see how my code and your code portions compare cycle wise with parts in common and what gains might still be possible.
Your code seems like it should be several less instructions but I can't totally figure out exactly by how much yet and some of this is waitx stuff too which will vary actual execution timing.
I do seem to burn extra cycles setting up clock phase alignments for clock smartpin output etc though maybe this can't be helped if we need flexibility in supporting different sysclk rates.
Steps needed for read that we both have in common (not necessarily in perfect order):
drive CS low
read command + address setup
address byte reversal
setxfrq for address phase
driving data pins
clock pin setup with wxpin, wypin
address phase sending 6 bytes (split over two streamer commands)
latency phase delay timing
float the data pins
read timing delay for operating frequency
setxfreq for data phase
fifo setup
data phase streaming
wait for end of transfer
CS high
Extra things I currently do:
I compute latency dynamically based on RWDS pin state (because HyperFlash & HyperRAM differ)
I resync the clock Smartpin phase on each transfer phase to mix sysclk/2 address phase with either rate data phase
I handle odd or even length transfer sizes with any address alignment.
I handle registered/unregistered data bus and clock pin settings.
If I just count up the instructions from my CS low to the equivalent xinit "queue data phase" instruction in my code I get 31 instructions while yours is 28 or effectively 30 if you account for the callpa and ret overhead. But it's not a perfect comparison and there a still a few extra things I do outside the CS low time as well that are unaccounted for.
My next aim was to dump the alternate 100% bit-bashed routines, in this case used for writing procedurally generated data, so that all activity on the data pins is via the streamer and all clocking is via the smartpin. Then I can move a decent chunk of the init code out of the two send_ca routines because it'll only need set once. At least that's the theory. I certainly want to prove it can work.
Right now I don't. I pass #0 for D on both RDFAST and WRFAST and the FIFO is only used for the data phase portion in my case. In my code it's actually set up quite a bit in advance of the streamer stuff, probably at least 40 instructions prior and would be full in time by the time the streamer command needs it so perhaps it can help save some cycles in write burst setup with RDFAST and is a good idea for another optimization, thanks Chip I'll just need to get a long for the #$80000000 constant.
Right now I don't. I pass #0 for D on both RDFAST and WRFAST and the FIFO is only used for the data phase portion in my case. In my code it's actually set up quite a bit in advance of the streamer stuff, probably at least 40 instructions prior and would be full in time by the time the streamer command needs it so perhaps it can help save some cycles in write burst setup with RDFAST and is a good idea for another optimization, thanks Chip I'll just need to get a long for the #$80000000 constant.
Maybe you could just use BITH reg,#31 to set the MSB.
Comments
It is stored in the per bank information in the LUT. It is read from this line.
You can set it differently with the setRamLatency() API if you want to experiment with it.
Ah-ha! The config register default in the datasheet says six. I was reading the diagrams instead.
ISSI Data sheet has this:
The default value is 6 clocks, allowing for operation up to a maximum frequency of 166MHz prior to the host system setting a lower initial latency value that may be more optimal for the system.
Err, yes it is buried deep. Oddly the text search failed.
Just relying on the scope without a logic analyser meant I wasn't looking at every pin.
I've opted now to use the simple XCONT chaining for burst writes. But for reads, have a gap after the CA phase using a WAITX and a subsequent XINIT to start the burst read data phase. Notably, if this WAITX gap is preceded by a WAITXFI it will have the exact same granularity issue. Albeit correctable with a custom WAITX gap for each divider.
I still bit-bash RWDS as I'm not yet using it for write masking.
Also seems to be xfreq2 used twice in a row. Probably can remove the second one. And that'll apply to the WXPIN #2 clock pin as well.
The problem if I try to somehow combine the waitx #2 and waitxfi is that the actual total wait length is a function of the address phase latency which can vary per bank and is therefore needing to be dynamically computed which is more overhead - if I could do some useful work during the waitx #2 time (two instructions possible there) I would. I also do not want to keep the data pins driven any longer than they should be during the latency phase.
I wouldn't be concerned with a late tristating of the hyperbus. There is a lot of spare time before the data phase starts. You could ditch the WAITX and move the WAITXFI in its place.
If I get a chance I might take another look at more optimizations to shave a few clocks. It's one of those things that quite tedious to test in my setup. You find you might improve one case, yet break another etc.
As you say, it requires a timing calculation though. In my case it generates some compile time constants. You could do similar in that you only have the two speeds anyway.
So, use a WAITX or two and drop the WAITXFI.
Here's what I'm using right now:
- "comp" is the compensation columns in my reports
- "dmadiv" is the sysclock divider constant
And note that in send_ca_dma, there is only four non-setup instructions - right at the end:
Your code seems like it should be several less instructions but I can't totally figure out exactly by how much yet and some of this is waitx stuff too which will vary actual execution timing.
I do seem to burn extra cycles setting up clock phase alignments for clock smartpin output etc though maybe this can't be helped if we need flexibility in supporting different sysclk rates.
Steps needed for read that we both have in common (not necessarily in perfect order):
drive CS low
read command + address setup
address byte reversal
setxfrq for address phase
driving data pins
clock pin setup with wxpin, wypin
address phase sending 6 bytes (split over two streamer commands)
latency phase delay timing
float the data pins
read timing delay for operating frequency
setxfreq for data phase
fifo setup
data phase streaming
wait for end of transfer
CS high
Extra things I currently do:
I compute latency dynamically based on RWDS pin state (because HyperFlash & HyperRAM differ)
I resync the clock Smartpin phase on each transfer phase to mix sysclk/2 address phase with either rate data phase
I handle odd or even length transfer sizes with any address alignment.
I handle registered/unregistered data bus and clock pin settings.
If I just count up the instructions from my CS low to the equivalent xinit "queue data phase" instruction in my code I get 31 instructions while yours is 28 or effectively 30 if you account for the callpa and ret overhead. But it's not a perfect comparison and there a still a few extra things I do outside the CS low time as well that are unaccounted for.
Maybe you could just use BITH reg,#31 to set the MSB.