HyperRAM driver for P2

12223242527

Comments

  • Oh man, seeing your last edited post @evanh brings back memories of me fighting with bus timing and all the different cases. That's one of the reasons once I finally got it working was to keep it as it is and put my efforts into all the other features, even though there might still be ways to shave a few cycles somehow. At least we have a working baseline at this point. If there are things found that work in all cases that speed it up a bit, we can try to optimise there, but getting it working fully was definitely my first priority.
  • :) Yeah, once I saw how big you'd gone, I stopped looking and went back to refining what I already had. Prove the ideas here first, so to speak.

    Got another evening shift at work ahead of me, better get ready for that now.

  • evanhevanh Posts: 9,983
    edited 2020-10-14 - 22:23:06
    Doh! I'd jumped to conclusions. The clock registering transition during latency phase isn't causing any problem at all. I hadn't paid close attention to what you'd done with the lsb of the compensation delay value. I just woke up to the fact I'm meant to be switching registering of the data pins, not the clock pin.

    I'd assumed the data pins had to stay registered for signal integrity, therefore it must be the clock pin being switched. Rerunning your "delaytest" program confirmed my dawning suspicion that I was chasing the wrong problem.

    Fixing ... Done already! :)

  • Now that you have included the registered/unregistered bit in your compensation as part of the delay like I used, your new results table shows the overlap intervals fading in and out over the transition bands quite nicely. Is it really still operating at 346 and 375MHz? Was this for reads or writes?

    We can see that pathway through the frequency range with zero bit errors in your chart. Imagine if we could also visualize how this path shifts and closes off with temperature at some frequencies with some type of animation, that would be quite cool.
  • It's aimed at reads but the writes are also using the streamer now too. Pretty much mimicked your arrangement of using sysclock/2 for the writes then using the HR_DIV (dmadiv) for reads. Yep, those top bands seem to be real enough. Used the unmodified hyper accessory board for that run.

    Here's the same again but with 10 pF capacitor on the HR clock pin. The bands shift down a little but surprisingly haven't shrunk in size.
  • And one of the main aims was to easily integrate HR writes at sysclock/1. Here's the board with the 10 pF capacitor again but with both reads and writes (including CA phase) operating at sysclock/1.

    The notable change, unsurprisingly, is those two upper bands have sporadic single bit errors.
  • Any chance you might modify your test so it focusses on a single frequency in the middle like 252MHz and then run it for much longer duration, logging results while you ramp the temp up/down from ~ 20C--> 45C etc? If we start to see bit errors in the middle of normal bands that might potentially help explain what I observed with video (or that may still be something else).
  • Hmm, can't blame temperature I don't think. Here's a 70 oC run using the unmodified hyper board, writes at sysclock/2 (all registered pins), and reads at sysclock/1. It's still got some room in the band.

    PS: I tried to up my starting XMUL but found something has a bug when starting above roughly 90 ...
  • evanhevanh Posts: 9,983
    edited 2020-10-15 - 00:53:31
    Oops, I'm using pin drive setting of 7 (19 ohms) in CR0. That might shift the bands ...

    This run is with board temperature around 26 oC and pin drive setting of 0 (34 ohms). Writes at sysclock/2. Ha, it's wiped out those two top bands.

    EDIT: Second run (report6.txt) is same again but above 70 oC ambient temperature. 252 MHz is borderline this time.
  • roglohrogloh Posts: 2,777
    edited 2020-10-15 - 01:08:38
    Yeah I always thought you'd be pushing your luck attempting sysclk/1 and 375MHz ! LOL.

    That's 375MB/s instead of the rated 200MB/s.
  • roglohrogloh Posts: 2,777
    edited 2020-10-15 - 01:09:42
    If you are seeing bit errors I wonder if the internal self-heating could get it up to around 70C in room temp ambient?

    The thing is that data errors should be rather random, but I seem to see it at the same offset on the video line - perhaps the particular colour transition on the data bus pushes the IO signal integrity just that little bit further to trigger it?

    I take it that sysclk/2 is okay. We can at least still fallback to that when required...
  • evanhevanh Posts: 9,983
    edited 2020-10-15 - 01:14:22
    rogloh wrote: »
    If you are seeing bit errors I wonder if the internal self-heating could get it up to around 70C in room temp ambient?
    Doubt it. You're not doing enough to beat the good heat sinking of the Eval Board.

    Try adjusting the drive strength and see if the video errors are suppressed.

  • Yeah I might do that. I am just using its default but it can be reduced for tweaking.

    000 - 34 ohms (default)
    001 - 115 ohms
    010 - 67 ohms
    011 - 46 ohms
    100 - 34 ohms
    101 - 27 ohms
    110 - 22 ohms
    111 - 19 ohms
  • TubularTubular Posts: 4,100
    edited 2020-10-15 - 02:51:54
    Hmm those drives might be from fpga never mind we're talking hyperram strength not p2
  • roglohrogloh Posts: 2,777
    edited 2020-10-15 - 04:48:39
    Wow, I think I've found the graphics issue, thanks for mentioning this @evanh! :smile:

    Somehow after my initialization I am reading back $FF9F from CR0 on die0 and die1 and I see the noise issue at 252MHz with sysclk/1. However if I write $8F1F to this register or any other value that is not 115 ohms, the problem goes away. I don't know why this is but it seems to be off by a byte (maybe the first byte read/written after startup gets corrupted), maybe the latency setting is off by one or or it's a chicken/egg thing with not using the correct latency before I modify the latency. But I know once this register is set correctly it works. Still digging into the code path that causes it. It's probably just an initialisation sequence bug.
  • roglohrogloh Posts: 2,777
    edited 2020-10-15 - 05:23:58
    Ok, so I think it might have figured this weird register stuff and it isn't entirely what I expected but more a system level thing.

    In my video demo I found I am changing the PLL to a higher speed while the HyperRAM driver is still initialising, this means that its internal wait until after device reset could be shorter than it should be, causing the first register access that occurs after reset to read nonsense data because the chip was still initializing internally. Once I fixed that startup timing issue by not changing the clock frequency, it fixes the problem. We'll just need to not change the clock frequency while the HyperRAM driver is still starting up otherwise its reset delay may be insufficient and it won't setup the registers in the correct way.

    This issue basically caused my driver code to go set the impedance to 115 ohms and that broke the data integrity at certain data bus transitions at certain frequencies. Nasty.

    Update: Oops, this startup timing "fix" was running at sysclk/2. Need to also double check sysclk/1
  • evanhevanh Posts: 9,983
    edited 2020-10-15 - 12:32:14
    rogloh wrote: »
    This issue basically caused my driver code to go set the impedance to 115 ohms and that broke the data integrity at certain data bus transitions at certain frequencies. Nasty.
    I'll say. I'm somewhat amazed it did as well as it did, given the limitation. Ah-ha - running that as a test config now ... oh, ouch, the error rate at 252 MHz is terrible!
  • Yeah that 115 ohm setting was definitely the cause of the corruption at 252MHz (thankfully not something else because we want 252MHz to be reliable). I'm still tracking down the root cause of why this register gets corrupted. Yesterday it I thought it was to do with reset delays at startup when I changed that code and the behaviour improved, but I've since proven that there is something more to it. If I begin video with just the HyperRAM register defaults things work out, but if I try to setup the control register first it often breaks. I may have introduced a regression in my register read/write code with recent changes. Also starting the HyperRAM driver running after video or before it affects the behaviour too. That's when the frequency changes, during video init. I'll figure it out.
  • evanhevanh Posts: 9,983
    edited 2020-10-16 - 01:22:56
    Oh, ha, I think I may have hit the same problem but with different symptoms. You know how I said I had a bug with starting tests above 90 MHz? Well, upon further investigation, I see it's also related to badly configured CR0. There might be some timing constraint we're unaware of and tripping up on.

    EDIT: Doh! Of course, in my case, I'm not calculating the compensation when reading CR0. That's a bit of an oversight. Doing a read-modify-write is garbaging the register. :( :)

  • Yeah my own read-modify-write is also garbaging it because I read bad data first before updating it. I've seen the first byte of the register read as FF and at other times both bytes of the word are reading the same value.
  • evanhevanh Posts: 9,983
    edited 2020-10-17 - 14:23:06
    Here's my hard coded pre-calculation for setting the HR register reading compensation:
    '--- precise timing of CR0 register read --- ( Data rate of sysclock/2 )
    		mov	pa, clk_freq
    		shr	pa, #20
    		mov	cr0comp, #86			' < 160 MHz
    		cmp	pa, #160_000_000>>20	wcz
    	if_ae	mov	cr0comp, #88			' < 240 MHz
    	if_ae	cmp	pa, #240_000_000>>20	wcz
    	if_ae	mov	cr0comp, #90			' < 300 MHz
    	if_ae	cmp	pa, #300_000_000>>20	wcz
    	if_ae	mov	cr0comp, #91			' >= 300 MHz
    

    EDIT: Updated for lsb used as registered/unregistered databus config. And added the routine that uses this compensation:
    read_cr0
    'read data from hyperRAM CR0 register to cog PA register
    '
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		wxpin	#2, #ram_ck			'HR clock step interval (dmadiv)
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    		wrpin	rgdatmode, #hr_bpin | 7<<6	'registered HR databus (in and out)
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0rd			'kick the streamer off for CA (command and address) phase
    		wypin	#4+2*12+2, #ram_ck		'HR clock go! (Conveniently, streamer data leads by one sysclock)
    
    		mov	pa, cr0comp			'precalculated compensation based on sysclock frequency
    		shr	pa, #1			wc	'remove lsb, used for reg/unreg databus select
    		waitx	#8				'pause for CA phase to complete
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    	if_c	wrpin	#0, #hr_bpin | 7<<6		'unregistered HR databus (in and out)
    		waitx	pa
    
    		wxpin	#1, #ram_ck			'realign clock to instruction timing, needs space to next WXPIN, important!
    		outh	#ram_cs
    		getbyte	pa, ina+pinx, #bytx		'collect first data byte, upper 24 bits of PA is cleared
    	_ret_	rolbyte	pa, ina+pinx, #bytx		'collect second data byte
    
  • evanhevanh Posts: 9,983
    edited 2020-10-17 - 15:49:05
    And the equivalent for a pure sysclock/1 setup. It's limited to about 300 MHz sysclock.
    '--- precise timing of CR0 register read --- ( Data rate of sysclock/1 )
    		mov	pa, clk_freq
    		shr	pa, #20
    		mov	cr0comp, #46			'registered, 23 sysclocks
    		cmp	pa, #90_000_000>>20	wcz
    	if_ae	mov	cr0comp, #47			'unregistered, 23 sysclocks
    	if_ae	cmp	pa, #130_000_000>>20	wcz
    	if_ae	mov	cr0comp, #48			'registered, 24 sysclocks
    	if_ae	cmp	pa, #180_000_000>>20	wcz
    	if_ae	mov	cr0comp, #49			'unregistered, 24 sysclocks
    	if_ae	cmp	pa, #220_000_000>>20	wcz
    	if_ae	mov	cr0comp, #50			'registered, 25 sysclocks
    	if_ae	cmp	pa, #260_000_000>>20	wcz
    	if_ae	mov	cr0comp, #51			'unregistered, 25 sysclocks
    

    When at sysclock/1 for everything, it's notable that much of the config thrashing disappears. The smartpin and streamer dividers are a set once affair and, because the smartpin divider is now fixed at #1, the smartpin period timer no longer needs special treatment for realigning to the instruction times.

    Example of how small the code becomes:
    send_cr0
    'write cog PB register to hyperRAM CR0 register, upper 16 bits is tail of CA phase so should stay cleared
    '
    		movbyts	pb, #%%0123
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    
    		xinit	cacfg, cr0wr			'kick the streamer off for CA (command and address) phase
    		wypin	#6+2, #ram_ck			'HR clock go! (Conveniently, streamer data leads by one sysclock)
    		xcont	cacfg, pb
    
    		waitxfi					'wait for streamer done
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    	_ret_	outh	#ram_cs
    
    
    read_cr0
    'read data from hyperRAM CR0 register to cog PA register
    '
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0rd			'kick the streamer off for CA (command and address) phase
    		wypin	#4+2*12+2, #ram_ck		'HR clock go! (Conveniently, streamer data leads by one sysclock)
    
    		mov	pa, cr0comp			'precalculated compensation based on sysclock frequency
    		shr	pa, #1			wc	'remove lsb, used for reg/unreg databus select
    	if_c	wrpin	#0, #hr_bpin | 7<<6		'unregistered HR databus (in and out)
    		dirl	#hr_bpin | 7<<6			'tristate HR databus after CA phase
    		waitx	pa
    
    		outh	#ram_cs
    		wrpin	rgdatmode, #hr_bpin | 7<<6	'registered HR databus (in and out)
    		getbyte	pa, ina+pinx, #bytx		'collect first data byte, upper 24 bits of PA is cleared
    	_ret_	rolbyte	pa, ina+pinx, #bytx		'collect second data byte
    


    Registering/unregistering also takes a back seat because all activity uses unregistered clock pin and data writes are all registered data pins. Only data reads have a potential unregistered data pins.

    The catch is, the hyper accessory board is not really up to this. There is a bunch of caveats for reliable operation. For one, you need a 10 pF capacitor on the HR clock pin. And even then writes above 200 MT/s are hairy, the band around 250 MHz sysclock is hard to make 100% reliable. I'm hoping the Edge Board w/HR will smooth this over nicely.

    Attached is a report at purely sysclock/1, including CR0 handling, using the 10 pF capacitor.

    EDIT: Removed an unneeded WAITX, updated listing.
  • roglohrogloh Posts: 2,777
    edited 2020-10-17 - 14:10:19
    Yeah needing sysclk/2 as well as sysclk/1 operation in the same code blows it out a bit and definitely complicates things. It would be great to standardize on the higher speed, but I doubt that it will be possible in all cases. Hence the reason I needed to support both speeds. I look forward to the Edge board that keeps the timing tight, perhaps some HyperRAM v2 testing will be eventually be possible too on this board if/when parts are available?

    I've not had a chance to get back onto this video issue fix yet, been tied up doing some MicroPython work.
  • evanhevanh Posts: 9,983
    edited 2020-10-17 - 14:50:24
    Huh, I forgot about dealing to the read data speed of reading CR0. I'm still bit-bashing that part, which can only read every second sysclock. It seems to be working still, so I presume it's due to the clock stopping on the second response byte from the HR chip and therefore the second byte stays steady on the HR databus until CS pin goes high.

    EDIT: I can't do it the way I'd like anyway, there is no streamer equivalent to "immediate" output mode for inputting pin data. The only way to stream in data at sysclock/1 is to DMA it to hubRAM. So I'll leave my code as is for now.
  • I haven't read the whole thread, but one thing I don't understand is why the there is a high error rate at the lower frequencies with the higher compensations.

    Has anyone tried your drivers on a P2 connected to HyperRam with short traces and no connectors?
  • evanhevanh Posts: 9,983
    edited 2020-10-19 - 00:07:17
    hinv wrote: »
    I haven't read the whole thread, but one thing I don't understand is why the there is a high error rate at the lower frequencies with the higher compensations.
    I'm not clear on the question being asked. The narrowness of the error-free compensations, in sysclock granularity, is a function of the data rate as a fraction of sysclock, ie: When the data rate is at sysclock/1 then there is only a single WAITX timing compensation that aligns valid incoming read data. If the data rate is sysclock/2 then there is two usable WAITX timing compensations that can be used to read the valid data. This has no bearing on outgoing writes as there is no clock response involved then.

    On top of that there is compensation columns for registered data pins (even compensation values are used for specifying this) and columns for unregistered data pins (odd values). This is a configuration trick that provides an estimated extra 0.5-1.0 ns delay in the prop2's data latching, so can be employed to shift timings a little further.
    		mov	pa, cr0comp			'precalculated compensation based on sysclock frequency
    		shr	pa, #1			wc	'remove lsb, used for reg/unreg databus select
    	if_c	wrpin	#0, #hr_bpin | 7<<6		'unregistered HR databus (in and out)
    		dirl	#hr_bpin | 7<<6			'tristate HR databus after CA phase
    		waitx	pa
    

    Has anyone tried your drivers on a P2 connected to HyperRam with short traces and no connectors?
    Not yet. I think Rayman has a layout done. And the Edge w/HR is underway now too.

  • evanh wrote: »
    ... This has no bearing on outgoing writes as there is no clock response involved then.
    I'll attempt to clarify this a little: The writes do have to be carefully timed also, but they don't have a shifting compensation with different clock frequencies. A single alignment works at all frequencies, and even all ratios if you've got it spot on.

  • Yeah thankfully this is the case for writes. Once the code has been carefully constructed (particularly at sysclk/2 to center the clock transition in the data bit) writes can remain in sync over all operating frequencies.
  • Roger,
    Just looking at your "WRITES with latency", I note you are adding "c" to "clks" and then issuing a WYPIN with both. Looks to me like the first WYPIN is not completing before the second one is issued. That might be why you've ended up with the XINIT/XCONT consecutive instructions.

  • evanhevanh Posts: 9,983
    edited 2020-10-25 - 04:25:58
    Hmm, I don't understand how the REP/XCONT loop fits in there either. It looks like you are sending "hubdata" at least twice.

    EDIT: Oh, it's all about those damn SKIPF patterns. I barely even noticed you had one there. I thought it was only earlier in the source code. Man, they do add another mental layer!

    EDIT2: That particular SKIPF is taking up more code space than a JMP would.

    EDIT3: Is "fastwrite" routine even used at all?

Sign In or Register to comment.