Oh man, seeing your last edited post @evanh brings back memories of me fighting with bus timing and all the different cases. That's one of the reasons once I finally got it working was to keep it as it is and put my efforts into all the other features, even though there might still be ways to shave a few cycles somehow. At least we have a working baseline at this point. If there are things found that work in all cases that speed it up a bit, we can try to optimise there, but getting it working fully was definitely my first priority.
Doh! I'd jumped to conclusions. The clock registering transition during latency phase isn't causing any problem at all. I hadn't paid close attention to what you'd done with the lsb of the compensation delay value. I just woke up to the fact I'm meant to be switching registering of the data pins, not the clock pin.
I'd assumed the data pins had to stay registered for signal integrity, therefore it must be the clock pin being switched. Rerunning your "delaytest" program confirmed my dawning suspicion that I was chasing the wrong problem.
Now that you have included the registered/unregistered bit in your compensation as part of the delay like I used, your new results table shows the overlap intervals fading in and out over the transition bands quite nicely. Is it really still operating at 346 and 375MHz? Was this for reads or writes?
We can see that pathway through the frequency range with zero bit errors in your chart. Imagine if we could also visualize how this path shifts and closes off with temperature at some frequencies with some type of animation, that would be quite cool.
It's aimed at reads but the writes are also using the streamer now too. Pretty much mimicked your arrangement of using sysclock/2 for the writes then using the HR_DIV (dmadiv) for reads. Yep, those top bands seem to be real enough. Used the unmodified hyper accessory board for that run.
Here's the same again but with 10 pF capacitor on the HR clock pin. The bands shift down a little but surprisingly haven't shrunk in size.
And one of the main aims was to easily integrate HR writes at sysclock/1. Here's the board with the 10 pF capacitor again but with both reads and writes (including CA phase) operating at sysclock/1.
The notable change, unsurprisingly, is those two upper bands have sporadic single bit errors.
Any chance you might modify your test so it focusses on a single frequency in the middle like 252MHz and then run it for much longer duration, logging results while you ramp the temp up/down from ~ 20C--> 45C etc? If we start to see bit errors in the middle of normal bands that might potentially help explain what I observed with video (or that may still be something else).
Hmm, can't blame temperature I don't think. Here's a 70 oC run using the unmodified hyper board, writes at sysclock/2 (all registered pins), and reads at sysclock/1. It's still got some room in the band.
PS: I tried to up my starting XMUL but found something has a bug when starting above roughly 90 ...
If you are seeing bit errors I wonder if the internal self-heating could get it up to around 70C in room temp ambient?
The thing is that data errors should be rather random, but I seem to see it at the same offset on the video line - perhaps the particular colour transition on the data bus pushes the IO signal integrity just that little bit further to trigger it?
I take it that sysclk/2 is okay. We can at least still fallback to that when required...
Wow, I think I've found the graphics issue, thanks for mentioning this @evanh!
Somehow after my initialization I am reading back $FF9F from CR0 on die0 and die1 and I see the noise issue at 252MHz with sysclk/1. However if I write $8F1F to this register or any other value that is not 115 ohms, the problem goes away. I don't know why this is but it seems to be off by a byte (maybe the first byte read/written after startup gets corrupted), maybe the latency setting is off by one or or it's a chicken/egg thing with not using the correct latency before I modify the latency. But I know once this register is set correctly it works. Still digging into the code path that causes it. It's probably just an initialisation sequence bug.
Ok, so I think it might have figured this weird register stuff and it isn't entirely what I expected but more a system level thing.
In my video demo I found I am changing the PLL to a higher speed while the HyperRAM driver is still initialising, this means that its internal wait until after device reset could be shorter than it should be, causing the first register access that occurs after reset to read nonsense data because the chip was still initializing internally. Once I fixed that startup timing issue by not changing the clock frequency, it fixes the problem. We'll just need to not change the clock frequency while the HyperRAM driver is still starting up otherwise its reset delay may be insufficient and it won't setup the registers in the correct way.
This issue basically caused my driver code to go set the impedance to 115 ohms and that broke the data integrity at certain data bus transitions at certain frequencies. Nasty.
Update: Oops, this startup timing "fix" was running at sysclk/2. Need to also double check sysclk/1
This issue basically caused my driver code to go set the impedance to 115 ohms and that broke the data integrity at certain data bus transitions at certain frequencies. Nasty.
I'll say. I'm somewhat amazed it did as well as it did, given the limitation. Ah-ha - running that as a test config now ... oh, ouch, the error rate at 252 MHz is terrible!
Yeah that 115 ohm setting was definitely the cause of the corruption at 252MHz (thankfully not something else because we want 252MHz to be reliable). I'm still tracking down the root cause of why this register gets corrupted. Yesterday it I thought it was to do with reset delays at startup when I changed that code and the behaviour improved, but I've since proven that there is something more to it. If I begin video with just the HyperRAM register defaults things work out, but if I try to setup the control register first it often breaks. I may have introduced a regression in my register read/write code with recent changes. Also starting the HyperRAM driver running after video or before it affects the behaviour too. That's when the frequency changes, during video init. I'll figure it out.
Oh, ha, I think I may have hit the same problem but with different symptoms. You know how I said I had a bug with starting tests above 90 MHz? Well, upon further investigation, I see it's also related to badly configured CR0. There might be some timing constraint we're unaware of and tripping up on.
EDIT: Doh! Of course, in my case, I'm not calculating the compensation when reading CR0. That's a bit of an oversight. Doing a read-modify-write is garbaging the register.
Yeah my own read-modify-write is also garbaging it because I read bad data first before updating it. I've seen the first byte of the register read as FF and at other times both bytes of the word are reading the same value.
EDIT: Updated for lsb used as registered/unregistered databus config. And added the routine that uses this compensation:
read_cr0
'read data from hyperRAM CR0 register to cog PA register
'
wrpin rgckmode, #ram_ck 'registered HR clock
wxpin #2, #ram_ck 'HR clock step interval (dmadiv)
setxfrq xfrq2 'set streamer transfer rate for read/write
wrpin rgdatmode, #hr_bpin | 7<<6 'registered HR databus (in and out)
dirh #hr_bpin | 7<<6 'drive HR databus
outl #ram_cs 'begin "Command-Address" phase
xinit cacfg, cr0rd 'kick the streamer off for CA (command and address) phase
wypin #4+2*12+2, #ram_ck 'HR clock go! (Conveniently, streamer data leads by one sysclock)
mov pa, cr0comp 'precalculated compensation based on sysclock frequency
shr pa, #1 wc 'remove lsb, used for reg/unreg databus select
waitx #8 'pause for CA phase to complete
dirl #hr_bpin | 7<<6 'tristate HR databus
if_c wrpin #0, #hr_bpin | 7<<6 'unregistered HR databus (in and out)
waitx pa
wxpin #1, #ram_ck 'realign clock to instruction timing, needs space to next WXPIN, important!
outh #ram_cs
getbyte pa, ina+pinx, #bytx 'collect first data byte, upper 24 bits of PA is cleared
_ret_ rolbyte pa, ina+pinx, #bytx 'collect second data byte
When at sysclock/1 for everything, it's notable that much of the config thrashing disappears. The smartpin and streamer dividers are a set once affair and, because the smartpin divider is now fixed at #1, the smartpin period timer no longer needs special treatment for realigning to the instruction times.
Example of how small the code becomes:
send_cr0
'write cog PB register to hyperRAM CR0 register, upper 16 bits is tail of CA phase so should stay cleared
'
movbyts pb, #%%0123
dirh #hr_bpin | 7<<6 'drive HR databus
outl #ram_cs 'begin "Command-Address" phase
xinit cacfg, cr0wr 'kick the streamer off for CA (command and address) phase
wypin #6+2, #ram_ck 'HR clock go! (Conveniently, streamer data leads by one sysclock)
xcont cacfg, pb
waitxfi 'wait for streamer done
dirl #hr_bpin | 7<<6 'tristate HR databus
_ret_ outh #ram_cs
read_cr0
'read data from hyperRAM CR0 register to cog PA register
'
dirh #hr_bpin | 7<<6 'drive HR databus
outl #ram_cs 'begin "Command-Address" phase
xinit cacfg, cr0rd 'kick the streamer off for CA (command and address) phase
wypin #4+2*12+2, #ram_ck 'HR clock go! (Conveniently, streamer data leads by one sysclock)
mov pa, cr0comp 'precalculated compensation based on sysclock frequency
shr pa, #1 wc 'remove lsb, used for reg/unreg databus select
if_c wrpin #0, #hr_bpin | 7<<6 'unregistered HR databus (in and out)
dirl #hr_bpin | 7<<6 'tristate HR databus after CA phase
waitx pa
outh #ram_cs
wrpin rgdatmode, #hr_bpin | 7<<6 'registered HR databus (in and out)
getbyte pa, ina+pinx, #bytx 'collect first data byte, upper 24 bits of PA is cleared
_ret_ rolbyte pa, ina+pinx, #bytx 'collect second data byte
Registering/unregistering also takes a back seat because all activity uses unregistered clock pin and data writes are all registered data pins. Only data reads have a potential unregistered data pins.
The catch is, the hyper accessory board is not really up to this. There is a bunch of caveats for reliable operation. For one, you need a 10 pF capacitor on the HR clock pin. And even then writes above 200 MT/s are hairy, the band around 250 MHz sysclock is hard to make 100% reliable. I'm hoping the Edge Board w/HR will smooth this over nicely.
Attached is a report at purely sysclock/1, including CR0 handling, using the 10 pF capacitor.
Yeah needing sysclk/2 as well as sysclk/1 operation in the same code blows it out a bit and definitely complicates things. It would be great to standardize on the higher speed, but I doubt that it will be possible in all cases. Hence the reason I needed to support both speeds. I look forward to the Edge board that keeps the timing tight, perhaps some HyperRAM v2 testing will be eventually be possible too on this board if/when parts are available?
I've not had a chance to get back onto this video issue fix yet, been tied up doing some MicroPython work.
Huh, I forgot about dealing to the read data speed of reading CR0. I'm still bit-bashing that part, which can only read every second sysclock. It seems to be working still, so I presume it's due to the clock stopping on the second response byte from the HR chip and therefore the second byte stays steady on the HR databus until CS pin goes high.
EDIT: I can't do it the way I'd like anyway, there is no streamer equivalent to "immediate" output mode for inputting pin data. The only way to stream in data at sysclock/1 is to DMA it to hubRAM. So I'll leave my code as is for now.
I haven't read the whole thread, but one thing I don't understand is why the there is a high error rate at the lower frequencies with the higher compensations.
Has anyone tried your drivers on a P2 connected to HyperRam with short traces and no connectors?
I haven't read the whole thread, but one thing I don't understand is why the there is a high error rate at the lower frequencies with the higher compensations.
I'm not clear on the question being asked. The narrowness of the error-free compensations, in sysclock granularity, is a function of the data rate as a fraction of sysclock, ie: When the data rate is at sysclock/1 then there is only a single WAITX timing compensation that aligns valid incoming read data. If the data rate is sysclock/2 then there is two usable WAITX timing compensations that can be used to read the valid data. This has no bearing on outgoing writes as there is no clock response involved then.
On top of that there is compensation columns for registered data pins (even compensation values are used for specifying this) and columns for unregistered data pins (odd values). This is a configuration trick that provides an estimated extra 0.5-1.0 ns delay in the prop2's data latching, so can be employed to shift timings a little further.
mov pa, cr0comp 'precalculated compensation based on sysclock frequency
shr pa, #1 wc 'remove lsb, used for reg/unreg databus select
if_c wrpin #0, #hr_bpin | 7<<6 'unregistered HR databus (in and out)
dirl #hr_bpin | 7<<6 'tristate HR databus after CA phase
waitx pa
Has anyone tried your drivers on a P2 connected to HyperRam with short traces and no connectors?
Not yet. I think Rayman has a layout done. And the Edge w/HR is underway now too.
... This has no bearing on outgoing writes as there is no clock response involved then.
I'll attempt to clarify this a little: The writes do have to be carefully timed also, but they don't have a shifting compensation with different clock frequencies. A single alignment works at all frequencies, and even all ratios if you've got it spot on.
Yeah thankfully this is the case for writes. Once the code has been carefully constructed (particularly at sysclk/2 to center the clock transition in the data bit) writes can remain in sync over all operating frequencies.
Roger,
Just looking at your "WRITES with latency", I note you are adding "c" to "clks" and then issuing a WYPIN with both. Looks to me like the first WYPIN is not completing before the second one is issued. That might be why you've ended up with the XINIT/XCONT consecutive instructions.
Hmm, I don't understand how the REP/XCONT loop fits in there either. It looks like you are sending "hubdata" at least twice.
EDIT: Oh, it's all about those damn SKIPF patterns. I barely even noticed you had one there. I thought it was only earlier in the source code. Man, they do add another mental layer!
EDIT2: That particular SKIPF is taking up more code space than a JMP would.
Comments
Got another evening shift at work ahead of me, better get ready for that now.
I'd assumed the data pins had to stay registered for signal integrity, therefore it must be the clock pin being switched. Rerunning your "delaytest" program confirmed my dawning suspicion that I was chasing the wrong problem.
Fixing ... Done already!
We can see that pathway through the frequency range with zero bit errors in your chart. Imagine if we could also visualize how this path shifts and closes off with temperature at some frequencies with some type of animation, that would be quite cool.
Here's the same again but with 10 pF capacitor on the HR clock pin. The bands shift down a little but surprisingly haven't shrunk in size.
The notable change, unsurprisingly, is those two upper bands have sporadic single bit errors.
PS: I tried to up my starting XMUL but found something has a bug when starting above roughly 90 ...
This run is with board temperature around 26 oC and pin drive setting of 0 (34 ohms). Writes at sysclock/2. Ha, it's wiped out those two top bands.
EDIT: Second run (report6.txt) is same again but above 70 oC ambient temperature. 252 MHz is borderline this time.
That's 375MB/s instead of the rated 200MB/s.
The thing is that data errors should be rather random, but I seem to see it at the same offset on the video line - perhaps the particular colour transition on the data bus pushes the IO signal integrity just that little bit further to trigger it?
I take it that sysclk/2 is okay. We can at least still fallback to that when required...
Try adjusting the drive strength and see if the video errors are suppressed.
000 - 34 ohms (default)
001 - 115 ohms
010 - 67 ohms
011 - 46 ohms
100 - 34 ohms
101 - 27 ohms
110 - 22 ohms
111 - 19 ohms
Somehow after my initialization I am reading back $FF9F from CR0 on die0 and die1 and I see the noise issue at 252MHz with sysclk/1. However if I write $8F1F to this register or any other value that is not 115 ohms, the problem goes away. I don't know why this is but it seems to be off by a byte (maybe the first byte read/written after startup gets corrupted), maybe the latency setting is off by one or or it's a chicken/egg thing with not using the correct latency before I modify the latency. But I know once this register is set correctly it works. Still digging into the code path that causes it. It's probably just an initialisation sequence bug.
In my video demo I found I am changing the PLL to a higher speed while the HyperRAM driver is still initialising, this means that its internal wait until after device reset could be shorter than it should be, causing the first register access that occurs after reset to read nonsense data because the chip was still initializing internally. Once I fixed that startup timing issue by not changing the clock frequency, it fixes the problem. We'll just need to not change the clock frequency while the HyperRAM driver is still starting up otherwise its reset delay may be insufficient and it won't setup the registers in the correct way.
This issue basically caused my driver code to go set the impedance to 115 ohms and that broke the data integrity at certain data bus transitions at certain frequencies. Nasty.
Update: Oops, this startup timing "fix" was running at sysclk/2. Need to also double check sysclk/1
EDIT: Doh! Of course, in my case, I'm not calculating the compensation when reading CR0. That's a bit of an oversight. Doing a read-modify-write is garbaging the register.
EDIT: Updated for lsb used as registered/unregistered databus config. And added the routine that uses this compensation:
When at sysclock/1 for everything, it's notable that much of the config thrashing disappears. The smartpin and streamer dividers are a set once affair and, because the smartpin divider is now fixed at #1, the smartpin period timer no longer needs special treatment for realigning to the instruction times.
Example of how small the code becomes:
Registering/unregistering also takes a back seat because all activity uses unregistered clock pin and data writes are all registered data pins. Only data reads have a potential unregistered data pins.
The catch is, the hyper accessory board is not really up to this. There is a bunch of caveats for reliable operation. For one, you need a 10 pF capacitor on the HR clock pin. And even then writes above 200 MT/s are hairy, the band around 250 MHz sysclock is hard to make 100% reliable. I'm hoping the Edge Board w/HR will smooth this over nicely.
Attached is a report at purely sysclock/1, including CR0 handling, using the 10 pF capacitor.
EDIT: Removed an unneeded WAITX, updated listing.
I've not had a chance to get back onto this video issue fix yet, been tied up doing some MicroPython work.
EDIT: I can't do it the way I'd like anyway, there is no streamer equivalent to "immediate" output mode for inputting pin data. The only way to stream in data at sysclock/1 is to DMA it to hubRAM. So I'll leave my code as is for now.
Has anyone tried your drivers on a P2 connected to HyperRam with short traces and no connectors?
On top of that there is compensation columns for registered data pins (even compensation values are used for specifying this) and columns for unregistered data pins (odd values). This is a configuration trick that provides an estimated extra 0.5-1.0 ns delay in the prop2's data latching, so can be employed to shift timings a little further.
Not yet. I think Rayman has a layout done. And the Edge w/HR is underway now too.
Just looking at your "WRITES with latency", I note you are adding "c" to "clks" and then issuing a WYPIN with both. Looks to me like the first WYPIN is not completing before the second one is issued. That might be why you've ended up with the XINIT/XCONT consecutive instructions.
EDIT: Oh, it's all about those damn SKIPF patterns. I barely even noticed you had one there. I thought it was only earlier in the source code. Man, they do add another mental layer!
EDIT2: That particular SKIPF is taking up more code space than a JMP would.
EDIT3: Is "fastwrite" routine even used at all?