HyperRAM driver for P2

12223242628

Comments

  • Yeah you might be able to see something on your high quality scope @evanh if you look at data/clock timing differences and how much jitter/skew there is present there at the different frequencies.
  • Okay, that does look like data corruption rather than video timing behaviours. And I just thought to try swapping the hyper accessory boards too. The unmodified board gives me a clean picture on the 1920x1080 mode, but when I do the same with the other board, that currently has a 6.8 pF capacitor on the HR clock pin, I get a fine but clear pixel noise on the left side of the display. And it's the same effect on both a modern LCD and an old analogue CRT.

    The effect is most visible near the left edge but I can still see it out to about 1/3 across the display.

  • And putting the hair drier on it boosts the effect. It makes the image tend to blackness, spreading from the middle left of the display.
  • roglohrogloh Posts: 2,777
    edited 2020-10-10 - 10:23:26
    Interesting result. Yeah why does the left side of the screen suffer more? It's like something early on in the transfer burst is more prone to errors? The 640x480 @ 252MHz seemed to suffer quite badly in my setup while the 1920x1080 was almost perfect except for very fine noise on the left. However that is being severely overclocked at 297MHz so I can accept that. My earlier tests had used 200MHz for VGA and SVGA 195MHz for XGA, and 216MHz for SXGA and they all looked great. There was very little overclock there.

    I wonder if v2 HyperRAM should help solve this, given there will be no overclock there at those frequencies/resolutions even with a sysclk/1 transfer rate.

    To prove very high speed data corruption definitively I probably need to write something that continually reads from the frame buffer in external memory and logs whether the last scan line read is different from the prior data at the same scan line. I wonder if the data signal is overshooting somehow and there is ringing with different colour transitions on the data bus, or any other ground bounce or crosstalk/signal switching effects etc.
  • The 252 MHz will be in a band crossing region. What tested as borderline within, really wasn't.

    I have no idea why the beginning of the burst suffers more. And never would've thought to test for it either.

  • Variable cap on the clock output pin for fine tuning...?
  • rogloh wrote: »
    I wonder if v2 HyperRAM should help solve this, given there will be no overclock there at those frequencies/resolutions even with a sysclk/1 transfer rate.
    Yep, and better board layout too. A v2 HR on an prop2 Edge board will be nippy as.
    To prove very high speed data corruption definitively I probably need to write something that continually reads from the frame buffer in external memory and logs whether the last scan line read is different from the prior data at the same scan line. I wonder if the data signal is overshooting somehow and there is ringing with different colour transitions on the data bus, or any other ground bounce or crosstalk/signal switching effects etc.
    It's all just the stacked latencies that creates those usable bands. You'll be able to get the effect down at the top of the first band also, somewhere around 90 MHz should do it.
  • evanhevanh Posts: 9,983
    edited 2020-10-10 - 10:35:13
    rogloh wrote: »
    Variable cap on the clock output pin for fine tuning...?
    I'm now pretty certain that's mostly a counter to the extra load on the HR data pins that exists with the hyper accessory board. Shouldn't be needed with the dedicated HR Edge board. Removing the hyperFlash from the accessory board might make all the diff too. Not sure I'm quite up for doing that though ...

  • roglohrogloh Posts: 2,777
    edited 2020-10-10 - 10:50:22
    When I get a chance tomorrow I might try to put a serial port channel into this type of vertical line demo, to manually tweak the delay and possibly allow HR register write to adjust the data bus impedance to see if that makes any difference to the result.

    I know there is a band crossing around 225-235 and another 270-280 MHz or so, but perhaps it is still marginal at 252 MHz which isn't good as that is in the middle where it should be clean. Perhaps it takes a certain type of colour transition on the data bus to skew the output waveform timing enough to not be well sampled? Why this effect can only happen in the first part of the scanline only for 1080p is still confusing me though.

    Another idea is to play with the PLL divider settings to see if that has any effect here. Maybe 252MHz has some additional jitter with that setting. 250MHz might be different. A little extra P2 jitter in certain places or times along the scan line and the fixed latency back from the HR to the P2 input pins might possibly interact together in a negative way.
  • evanhevanh Posts: 9,983
    edited 2020-10-10 - 21:54:58
    Oh, the added capacitor I've been experimenting with is solely to make burst writes work at sysclock/1. The burst reads have always suffered degraded band widths as a result. So, no, a variable capacitor wouldn't be helpful here.

    Removing the inherent capacitance/impedance of long parallel tracks and unneeded connections/components is what would help.

    EDIT: Capacitor-less burst writes at sysclock/1 should work fine when using a well matched board layout for the HR. So then just by using an unregistered output to drive the HR clock, against registered HR data pins, provides the needed small data setup write timing.

    The new Prop2 Edge w/HR will give us this. :)

    EDIT2: Actually, an accessory board sans the hyperFlash for the Eval revC could do this too but it won't help much with read performance, like the closely coupled layout of the Edge will, so doesn't seem all that worth while to me.

  • So it seems we need something tunable rather than a fixed passive or serpentine clock delay

    There's space on my P2DIP40/P2DIL80 board, so yesterday i put a footprint down there to make sure a Hyperram fits there, which it does. This is unlikely to be a loaded component in the bigger production runs but for these initial proto runs it can be there, and I have a short digikey reel of these v1 Hyperrams that will otherwise go to waste

    Due to its location it really needs to share its i/o with P32..43 of the first Mikrobus socket, which has 12 gpio.

    My question is, can any use be made of the 12th (spare) GPIO pin, in order to effectively tune the clock phase? I am thinking this signal could be driven from the inverse clock signal, which would permit it to be used with 1v8 v2 hyperrams should we find a way to make those work. So lets assume that signal is being driven by CLK#. What would help to make the phase tunable (so it can be switched back and forth depending on whether reading or writing) ? A series inductor on the CLK# line? (combined with ability to reduce clock amplitude height using bit dac mode, or change DAC driving impedance on just that clk pin)

    Roger, in your tight driver do you have any room to change a pin config when turning from writing to reading anyway? I guess right at this stage we just need to prove that things can be achieved reliably
  • evanhevanh Posts: 9,983
    edited 2020-10-10 - 23:34:24
    At the moment Roger has lots of WRPINs for reconfiging the pins and smartpins. So tweaking those for a drive variation won't cost space. However, it's a big step from the 20 ohms of the logic drive to the DAC's 124 ohms. I doubt it'll help, it'll be smothered in attenuation.

    The best help is registered vs unregistered. Chip specified a limit of 1.0 ns in synthesised I/O propagation delays. So, given the large amount of symmetries in the prop2, unregistered will likely lag registered by a little under 1 ns all round.

  • Ok, perhaps there is hope in squeezing this into existing pin configs of Roger's driver, then

    The 75 ohm dac could be used given suitable terminating passives at the junction

    Is a serpentine clock delay any use here? You would move the taps at the top of one end of the delay line, but of course it affects both read and writes. But perhaps getting that delay right helps balance things up, so you can then just use registered/unregistered depending on whether reading or writing


  • TubularTubular Posts: 4,100
    edited 2020-10-11 - 00:01:32
    My other thought is consider a 'driven shield' approach to the existing clock line. I think this gives three basic modes
    1. In phase with Clock - Theoretically this moves most capacitive loading across to the driven shield, so its effect is like a phase shift.
    2. DC. If you just hold or drive the driven screen at DC (eg GND) you effectively have an increased capacitance on the clock line, relative to the original design.
    3. Antiphase - should be like significantly increasing the capacitive loading on the clock line.

    Thats a fair bit of flexibility from a single adjacent smartpin


  • Tubular wrote: »
    Roger, in your tight driver do you have any room to change a pin config when turning from writing to reading anyway? I guess right at this stage we just need to prove that things can be achieved reliably

    Well there are always some ways to free up space if we need it, it just might take a few extra instructions to do so in the worst case. My fast execf vector table that burns a lot of memory can be squished down at some point if we ever get really desperate. That can free up to 80 longs from COG memory, also there is common code per COG handled that executed that could be merged with another 4 clock jump instruction overhead, that could save 13 longs in COG RAM. Certainly for testing out ideas things are doable.

    However those V2 HyperRAMs mean we would not have to overclock to reach sysclk/1 speeds beyond 200MB/s so there probably won't be benefit of the 1.8V operation unless it offers a further way to try to tweak its clocking behavior like changing the comparator level. But I think Chip mentioned it was a slow comparator so it would potentially limit read speed anyway. Might still be worth an experiment regardless. We just need a test board with these ideas to try out, so if you wanted to make one up anyway it can be tested.
  • evanhevanh Posts: 9,983
    edited 2020-10-11 - 01:51:24
    Tubular wrote: »
    My other thought is consider a 'driven shield' approach to the existing clock line. I think this gives three basic modes
    1. In phase with Clock - Theoretically this moves most capacitive loading across to the driven shield, so its effect is like a phase shift.
    2. DC. If you just hold or drive the driven screen at DC (eg GND) you effectively have an increased capacitance on the clock line, relative to the original design.
    3. Antiphase - should be like significantly increasing the capacitive loading on the clock line.

    Those things would be in order if we didn't have a convenient option at our fingertips already. I don't feel anything fancy is needed. Clean and short is best to widen HR burst read frequency bands. And that helps the very top end of writes too.

    The main thing writes needs is a consistent phase shift on the clock to provide a tiny data setup time. This appears to be fine for the v1 hyperbus as low as 0.5 ns, albeit out of spec. V2 hyperbus actually specifies it as 0.5 ns.

    The trick is guaranteeing that setup time is always there evenly for the whole data bus. Again, keeping the path clean and short seems to me to be the best approach. Clock and data board layout all evenly matched. Once that's sorted then just using an unregistered pin for HR clock should be the bees knees to give a consistent setup time.

  • Gonna be good to try this out when available. I'm still blown away that the 297MHz sysclk/1 works far better that at 252MHz and don't know why the middle of the 230-280 band is behaving as it is. Obviously the RAM itself is capable of delivering good data when clocked faster at 297MHz so this is a timing or IO type of problem with ringing etc. I wouldn't have though it would be bad clock timing if it is in the middle of the band though, so the data pattern itself seems to be a part of it. A cleaner routing of the signals could help that perhaps.
  • It took me a while to realise that I wouldn't need the capacitor I'm using on this accessory board if the data pins didn't have the extra loading on them from the hyperFlash chip. It dawned on me properly only a month or so back.

  • evanhevanh Posts: 9,983
    edited 2020-10-11 - 02:02:45
    rogloh wrote: »
    ... Obviously the RAM itself is capable of delivering good data when clocked faster at 297MHz so this is a timing or IO type of problem with ringing etc...
    Yeah, I suppose ringing could be it. We might still have to deal with that at a higher frequency with the Edge board layout too. My prior experience in this field is pretty much zip.

  • I would expect good quality signal operation in the 100-200MHz frequency range needs careful attention on PCBs. None of my own simple boards have needed to really worry about that, as I've always done simple/slower boards < 25MHz, nothing high speed at all really.
  • Ah you just reminded me we talked about putting a small hi speed unity gain buffer onboard so we can probe without affecting signals

    This board needs to go to pcbzone who cut off at 8am tomorrow morning. Let me know if you want anything else useful on it
  • evanhevanh Posts: 9,983
    edited 2020-10-13 - 00:44:27
    evanh wrote: »
    My next aim was to dump the alternate 100% bit-bashed routines, in this case used for writing procedurally generated data, so that all activity on the data pins is via the streamer and all clocking is via the smartpin. Then I can move a decent chunk of the init code out of the two send_ca routines because it'll only need set once. At least that's the theory. I certainly want to prove it can work.
    Okay, did all that. Got every low level routine using the smartpin for clock, and streamer in all but one place.

    And I've finally got the no-DIR method tested and working as I wanted. Fell into all the usual mistakes, like copy and pasting then deleting too much, along the way. I managed to keep enough backups to rescue myself at least three times. This type coding collapses into confusion real easy!

    Right, to use the no-DIR trick, it requires strategic placement of WXPIN #1,ckpin to replace each DIRL/DIRH pair. So needs well commented to remind the reason for an otherwise seemingly extraneous instruction.

    Here's the one routine that doesn't use the streamer 100%
    read_cr0
    'read data from hyperRAM CR0 register to cog PA register
    '
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		wxpin	#2, #ram_ck			'HR clock step interval (dmadiv)
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0rd			'kick the streamer off for CA (command and address) phase
    		wypin	#4+2*12+2, #ram_ck		'HR clock go! (Conveniently, streamer data leads by one sysclock)
    
    		waitx	#12				'pause for CA phase to complete
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    
    		waitx	#44				'precise timing of data phase
    		wxpin	#1, #ram_ck			'realign clock to instruction timing, needs space to next WXPIN, important!
    		outh	#ram_cs
    		getbyte	pa, ina+pinx, #bytx		'collect first data byte, upper 24 bits of PA is cleared
    	_ret_	rolbyte	pa, ina+pinx, #bytx		'collect second data byte
    
  • evanhevanh Posts: 9,983
    edited 2020-10-13 - 00:50:40
    You can see there I've got the data rate locked at sysclock/2. I felt that was wisest for the register accesses.

    Another possible optimisation I thought about, but chose not to, was to leave the HR data bus driven by default and only tristate it when needed for reading.

  • evanhevanh Posts: 9,983
    edited 2020-10-13 - 00:59:52
    The WXPIN #1,#ram_ck is placed at the end of the routine. It could go at the beginning instead, it would be easier to comprehend at the beginning. The reason I haven't is so that I can set large sysclock dividers for my testing.

    If only switching between sysclock/1 and sysclock/2 then having the WXPIN at the beginning would be fine.

    PS: All the low level routines would then need it arranged that way. Here's the alternate with WXPIN at the beginning:
    read_cr0
    'read data from hyperRAM CR0 register to cog PA register
    '
    		wxpin	#1, #ram_ck			'realign clock to instruction timing, needs space to next WXPIN, important!
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    		wxpin	#2, #ram_ck			'HR clock step interval (dmadiv) - note three instructions from prior WXPIN
    
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0rd			'kick the streamer off for CA (command and address) phase
    		wypin	#4+2*12+2, #ram_ck		'HR clock go! (Conveniently, streamer data leads by one sysclock)
    
    		waitx	#12				'pause for CA phase to complete
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    
    		waitx	#46				'precise timing of data phase
    		outh	#ram_cs
    		getbyte	pa, ina+pinx, #bytx		'collect first data byte, upper 24 bits of PA is cleared
    	_ret_	rolbyte	pa, ina+pinx, #bytx		'collect second data byte
    
  • Here's the single run init code that has been shifted out of the low level data routines
    ' --- initial config of pin modes for hyper bus ---
    		drvh	#ram_cs				'ensure hyperRAM is deselected
    		fltl	#ram_ck				'mode is set first for steady pin drive
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		wxpin	#1, #ram_ck			'align clock to instruction timing, #1 is important!
    		dirh	#ram_ck
    		fltl	#ram_rwds
    		fltl	#hr_bpin | 7<<6			'set all data pins low and tristate
    		wrpin	##P_REGD, #hr_bpin | 7<<6	'eight data pins registered (in and out)
    		xstop					'ensure streamer is stopped
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    
  • evanhevanh Posts: 9,983
    edited 2020-10-13 - 22:46:37
    All the low level routines as they are right now
    '------------------------------------------------------------------------------
    send_block_dma
    'write data to hyperRAM, read it from hubRAM
    'PTRA has hubRAM start address
    '
    		callpa	#writeram, #send_ca_dma		'block write command, includes padding clocks
    
    		xcont	lacfg, #0			'20 nil bytes, covers remaining latency phase, seamless therefore no compensation
    		rdfast	fastmask, ptra			'non-blocking
    		setword	txcfg, hrbytes, #0		'set the streamer burst length, max 64 kB
    
    		xcont	txcfg, #0			'queue data phase, note this is buffered while prior XCONT is pacing
    		dirh	#ram_rwds			'no masking, occurs maybe 8 sysclocks into latency phase
    
    		waitxfi					'wait for completion of DMA
    		outh	#ram_cs
    		wxpin	#1, #ram_ck			'provides clock realignment, needs space to next WXPIN, important!
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    	_ret_	dirl	#ram_rwds
    
    
    '------------------------------------------------------------------------------
    read_block_dma
    'read data from hyperRAM, write it to hubRAM
    'PTRA has hubRAM start address
    '
    		callpa	#readram, #send_ca_dma		'block read command, includes padding clocks
    
    		wrfast	fastmask, ptra			'non-blocking
    		setword	rxcfg, hrbytes, #0		'set the streamer burst length, max 64 kB
    		waitx	#dmadiv*4-4			'pause for CA phase to complete
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    
    		mov	pa, comp
    		add	pa, #dmadiv*25-12
    		waitx	pa				'precise timing of data phase
    		xinit	rxcfg, #0			'fire off data phase immediately
    
    		waitxfi					'wait for completion of DMA
    		outh	#ram_cs
    		wxpin	#1, #ram_ck			'provides clock realignment, needs space to next WXPIN, important!
    	_ret_	rdfast	#0, #0				'flush the FIFO
    
    
    '------------------------------------------------------------------------------
    send_ca_dma
    'PA has 3-bit command
    '
    		wrpin	hrckmode, #ram_ck		'registered/unregistered HR clock
    		wxpin	#dmadiv, #ram_ck		'HR clock step interval
    		setxfrq	xfrq				'set streamer transfer rate for read/write
    
    		andn	hraddr, #%111			'address granularity of 16 bytes (half page)
    		or	pa, hraddr			'merge address with the three bits of command
    		ror	pa, #3				'put command at top bits
    		movbyts	pa, #%%0123			'endian swap because streamer can only do sub-byte endian swapping
    
    		mov	pb, hrbytes			'clock steps for data phase
    		add	pb, #6+22			'clock steps for CA and fixed latency added to data length
    
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, pa			'kick the streamer off for CA (command and address) phase
    		wypin	pb, #ram_ck			'clock go!
    	_ret_	xcont	cacfg, #0			'4 nil bytes, for remaining CA phase, needed to manage RWDS/databus transition
    
    
    '------------------------------------------------------------------------------
    send_cr0
    'write cog PB register to hyperRAM CR0 register
    '
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		wxpin	#2, #ram_ck			'HR clock step interval (dmadiv)
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    
    		movbyts	pb, #%%0123
    
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0wr			'kick the streamer off for CA (command and address) phase
    		wypin	#8, #ram_ck			'HR clock go! (Conveniently, streamer data leads by one sysclock)
    		xcont	cacfg, pb
    
    		waitxfi					'wait for completion of DMA
    		wxpin	#1, #ram_ck			'provides clock realignment, needs space to next WXPIN, important!
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    	_ret_	outh	#ram_cs
    
    
    '------------------------------------------------------------------------------
    read_cr0
    'read data from hyperRAM CR0 register to cog PA register
    '
    		wrpin	rgckmode, #ram_ck		'registered HR clock
    		wxpin	#2, #ram_ck			'HR clock step interval (dmadiv)
    		setxfrq	xfrq2				'set streamer transfer rate for read/write
    
    		dirh	#hr_bpin | 7<<6			'drive HR databus
    		outl	#ram_cs				'begin "Command-Address" phase
    		xinit	cacfg, cr0rd			'kick the streamer off for CA (command and address) phase
    		wypin	#4+2*12+2, #ram_ck		'HR clock go! (Conveniently, streamer data leads by one sysclock)
    
    		waitx	#12				'pause for CA phase to complete
    		dirl	#hr_bpin | 7<<6			'tristate HR databus
    
    		waitx	#44				'precise timing of data phase
    		wxpin	#1, #ram_ck			'realign clock to instruction timing, needs space to next WXPIN, important!
    		outh	#ram_cs
    		getbyte	pa, ina+pinx, #bytx		'collect first data byte, upper 24 bits of PA is cleared
    	_ret_	rolbyte	pa, ina+pinx, #bytx		'collect second data byte
    
    

    EDIT: Fixed a comment typo

  • Next I'll do the same you've done with the bottom bit of the compensation delay used to control registering when generating the reports.

  • roglohrogloh Posts: 2,777
    edited 2020-10-13 - 02:50:39
    It will be interesting to see how much you've been able to shrink it down by the end @evanh. If it is reliable and saves instructions and can support registered/unregistered and both syclk/1 and sysclk/2 reads I can try to patch it into the driver code. I'm less concerned about optimizing the no latency write code. That is for register access only and is not typically used after setup. Even for burst writes to HyperFlash it won't save much. The flash write transactions have lots of other overheads that will dwarf it.

    One thing you may not be doing is looking at RWDS pin after CS goes low for doubling the latency. HyperFlash and HyperRAM behave differently there so hard coding to automatically double at all times is probably not doable.

    Update: from what I can tell right now, you don't seem to be supporting RWDS for byte masking which is a problem for general purpose writes of arbitrary sized blocks to arbitrary addresses. It's fine if you only ever access words or longs on word boundaries.
  • Heh, at the moment I only support 16-byte granularity. The bottom three address bits are ignored.

  • evanhevanh Posts: 9,983
    edited 2020-10-14 - 00:10:29
    Hmm, grr, struck a barrier. Can't integrate two features I had planned. The nicely working single WYPIN for the smartpin clock can't be combined with switching the clock pin between registered and unregistered in the latency phase. And the CA write phase can't do both at sysclock/1.

    All choices at sysclock/1 involve pausing the HR clocking to reconfigure.
    Or, maybe ...
Sign In or Register to comment.