I think we're down to selecting components and/or board layout to get the desired result. I've tried 1k5R output drive on the HR clock pin to give it some delay but that had no signal at all. Next was 124R DAC drive and that did give the 100% I'm after, but only up to 80 MHz. So still too weak/slow. Next I replaced the 10R resistor to the CK pin of the hyperRAM chip with two legged 10R (20 ohms total) resistors and this also worked up to 110 MHz with unregistered HR clock pin. So this one is still too fast ...
If you clock it slower but still at sysclock/1, does the error rate disappear @evanh?
No. Not without delaying the clock phase more. The trick is in consistently controlling that lag.
EDIT: The way I'm seeing it is: By default, there is no setup time in the data-to-clock coming from the prop2. A small amount can be added by having the data pins registered and the clock pin unregistered, but not enough. Or at least not enough with the way the accessory board is routed.
EDIT2: And swapping those around to have small hold time instead of setup time is a similar but worse result. Probably due to data timing variations of unregistered data pins.
@evanh For high-high speed experiments, you might try using the bottom-edge I/O headers on the RevB Eval board, as those have the closest matched trace lengths. The HyperRAM accessory pcb is trace-length matched to within specification, but the Eval board adds some unmatched length into the mix.
With the RevB Eval, the best choice of I/O header to use for high speed stuff would be in this order: bottom, left, right, top.
Wow! Thanks Von, moving to that side made a difference. I'm getting some 100% matches now with unmodified boards, unregistered HR clock and registered HR data on the prop2.
And with the 20R CK resistor on the accessory board it's almost perfect all the way up to 380 MHz. I'm getting 255 out of 256 matches in two areas: 110-120 MHz and 210-270 MHz. The rest of the sysclock frequencies are 100% matches.
PS: Here's an example dump (latest) of what I'm seeing for my diagnostics: (Forum wouldn't paste it all )
PPS: The cycles value is still based on the 32 bit word size I had when doing the streamer to streamer work. I've left it that way so the compare routine is unmodified. I just internally multiply it by four to do the hyperRAM byte sized copying. So, block size is 1 kByte.
I'm doing writes only, with the streamer, my reads are all bit-bashed at sysclock/9 for the moment. Now you've mentioned it though, maybe I should try swapping around to see where that's at for me before doing any more soldering work ...
Ha! 4 microseconds is the max allowable chip select time. That's never going to be complied with. I'm using 34 clocks after CS goes low in CA routine, and in the bit-bashing read routine another 54 clocks for the setup and spacer clocks. At 40 MHz sysclock, that's already 2.1 μs before data burst has started.
As for the data: 1 kByte of data and 9 sysclocks per data byte makes it a 230 μs burst!
I was thinking for video applications with large bursts, it might be possible to have an arbiter COG take over the refresh function by interleaving some refresh accesses in between real COG accesses when idle, then the CS could be quite long. It would be nice for sequential accesses to refresh multiple rows though and I'm not sure it does things that way.
Here's what i'm getting by alternate switching clocked IO and waitx tweaks.
Yep, I'm seeing same behaviour for reads ...
Registering doesn't seem to have anywhere near the advantage I was hoping for on reads.
First round of thermal testing indicates a 5% scaling up of everything for a 30 °C drop in board temperature. Presumably this effect is linear, so for a full spec range of 120 °C there is a rough 20% scaling adjustment.
I don't know it that is possible due to the current schematic of both pcbs involved (I'm not on my computer, where the docs are saved), but, have you tried lowering VIO to 3.0V or even a bit less, at the pin banks that are connected to the HyperRam chip?
IIRC, keeping VccQ (HR I/O circuit power supply) as low as possible (but within specc'd 2.7V <= VccQ <= 3.3V) and also the switching levels that come from the HR controller device (P2) would reduce the total noise seen by the HR device and contribute for better DQ[7:0] bus waveforms, despite any adverse effects lower voltages can impose to HR main memory array access time, when both Vcc and VccQ are connected to the same power lane.
I've got a revA board with a revA chip on it. This can be hooked up to an external supply for lower VIO. The read performance may be a little extendable in the right conditions, dunno.
The multiple bad spots are not very enticing though. Those certainly won't go away with a reduced supply voltage. Ignoring the bad spots, I've already got it touching the spec'd 1.8 Volt speed of 333 MB/s but at 3.3 Volts.
The writing performance will definitely be negatively impacted with a reduce supply voltage.
Ha! 4 microseconds is the max allowable chip select time. That's never going to be complied with. I'm using 34 clocks after CS goes low in CA routine, and in the bit-bashing read routine another 54 clocks for the setup and spacer clocks. At 40 MHz sysclock, that's already 2.1 μs before data burst has started.
As for the data: 1 kByte of data and 9 sysclocks per data byte makes it a 230 μs burst!
Yes, they are frustratingly vague around the refresh rules on these DRAMS with auto-refresh.
With Auto-refresh, you would hope that more CS=H time, could buy longer CS=L, (even if clocking when CS=H),
but they sort of hint at needing CS =\_ to advance refresh ( 64m/8192 = 7.8125us)
but if that were really true, holding CS=H, would fail to refresh.
There is info suggesting 64ms(max) is an expected refresh repeat, so maybe you can look for that (room temp limit may be well above that 64ms)
ie start with a setup that deliberately fails refresh, then see what is needed to pass ?
For video work, if the RAM is not used for anything else, it may be fine to simply use the Any-Read-Is-Refresh rule, as frame times will always be < 64ms.
That gets to be a problem if you want frame buffers, to switch quickly. (unless you split the row, making left-half screen1 and right-half screen2 ?)
I can't make it fail. I've now got a solid 10 seconds of chip select low time, with 2 ms intervals between clock pulses doing a linear read operation. Before that I do the 1 kB of random data write and afterwards I read it back and compare. Not a single bit failure over many repeats.
It must be still doing refreshes even with CS low.
EDIT: Oops, and I was also stressing the data pins at the same time by leaving the prop2 HR_databus driven for the long duration.
I can't make it fail. I've now got a solid 10 seconds of chip select low time, with 2 ms intervals between clock pulses doing a linear read operation. Before that I do the 1 kB of random data write and afterwards I read it back and compare. Not a single bit failure over many repeats.
It must be still doing refreshes even with CS low.
EDIT: Oops, and I was also stressing the data pins at the same time by leaving the prop2 HR_databus driven for the long duration.
I've been bursting 400k blocks to hub with no data loss.
The original data was written once, then read over the course of the day with various tests.
I gave up trying to make sense of the datasheet and just tried things to see what happened.
I can't make it fail. I've now got a solid 10 seconds of chip select low time, with 2 ms intervals between clock pulses doing a linear read operation. Before that I do the 1 kB of random data write and afterwards I read it back and compare. Not a single bit failure over many repeats.
It must be still doing refreshes even with CS low.
Maybe try going longer than 10s, that will be quite logarithmic - try 2 minutes (and warm the chip )
I've finally got something! I've changed up to 80 kB block size and with 5 seconds of CS low (clocking doesn't make any difference) I'm getting errors. The temperature does need to be raised too although I guess an hour would do it without.
Error count is small in number and even at 60 °C climbs to only a few percent. So I guess I'm still only catching the poorer cells.
EDIT: That's a lot of safety margin they've got there! At least 100 fold.
Okay, upped to 4 MB block size (Used XORO32 for procedural data instead of storing anything in hubRAM) and removed the intentional long chip select. Again, I was able to generate errors once I got the hyperRAM chip temperature above 60 °C. Possibly 70 °C. (Note: All these temperatures are somewhat of a guess since I am using the thermocouple I have soldered to the prop2 Eval Board as the guide.)
The 4 MB block is read from hyperRAM as a single CS low duration at 8 sysclocks per byte, and separately as 7 sysclocks per byte writing to hyperRAM.
At 200 MHz, it was hard to get more than a few bit errors even when I'm confident I had gone over 70 °C. 160 milliseconds for each block read.
At 360 MHz, there was a couple of hundred. Possibly the prop2 wasn't handling the temperature at that speed.
At 40 MHz, there was a hundred or so bit errors but I also didn't push the temperature so high then. So the longer CS time showed up more with this. 800 milliseconds for each block read.
EDIT: Actually, I think I need more accurate temperature measuring. I've discovered the accessory board has notable thermal inertia.
The 4 MB block is read from hyperRAM as a single CS low duration at 8 sysclocks per byte, and separately as 7 sysclocks per byte writing to hyperRAM.
At 200 MHz, it was hard to get more than a few bit errors even when I'm confident I had gone over 70 °C. 160 milliseconds for each block read.
At 360 MHz, there was a couple of hundred. Possibly the prop2 wasn't handling the temperature at that speed.
At 40 MHz, there was a hundred or so bit errors but I also didn't push the temperature so high then. So the longer CS time showed up more with this. 800 milliseconds for each block read.
EDIT: Actually, I think I need more accurate temperature measuring. I've discovered the accessory board has notable thermal inertia.
Good to see failures come and go
Can you add a read of another area, say 1 in every 20 of those scans. The idea here is to check non-read areas for (hidden) refresh sustain.
There's no changing the prop2 voltage, it still has 3v3 VIO
You're using the fast settling time of the P2 dacs to do the physical i/o driving at 1v8 or 1v1 or 300mV or whatever level you need. You still use the normal digital signal as if its a 3v3 logic pin. Its a mode I suggested to Chip so we can interface with non 3v3 logic just like this
The hyperrams have spec points at 100 and 166 MHz but also at 133 MHz in between. Of course this operation may venture off the straight and narrow spec, just like standard overclocking
Whether the input comparator (in pin A>D mode) can keen up is one question I have, but even if the input can't keep up, being able to burst write fast is still useful for capture (just not so useful for video).
Today OzPropDev and I had a quick look at this "1v8 compatibility" idea, based on Chip's comments from last weeks zoom conference. The test is to have an output pin, P16 in 'Bit_Dac/BitDac' mode, driving a ~1v8 square wave into adjacent pin P17 via a jumper. Pin 17 is configured in A>D mode, with threshold typically set around value 85ish (85/255*3v3 volts). We set up some adjacent pins P18/19/20 to monitor what was happening
Where we got to, we were able to 'see' a 49 MHz square wave passing back though P17 comparator, using a monitoring adjacent smartpin.
The possible significance of this would be interfacing to 1v8 devices, such as 1v8 hyperrams, without interposing level shifting logic. Chip previously mentioned he thought the A>D comparators may have a limit around 30 MHz, but be dependent on amplitude of the incoming signal swing. This backs that up.
With some additional work we may be able to push beyond 49 MHz, as it was somewhat limited by our methodology, but at least we got that far. Also, the hyperram drive strength can configured to be stronger than the 75 ohm DAC used it Bit_Dac mode
The comparators would have about 0.8-0.9V overdrive, so they should be as fast as they could be while using them in the usual way. Their slowness implies latency, so they need to be sampled as late as possible, and that may require an external phase-shift network on the clock - essentially you may end up having to sample the comparator state 3/4 of the clock period after the clock has changed state. Since we can't directly adjust the comparator sampling phase offset (unless I missed it), it's the clock that has to be tweaked. The clock can be as fast as you wish as long as the comparator delay remains under 85% of the clock period or thereabouts. Over full temperature range, 75% of clock period is more realistic. Say, at 100MHz the comparator delay would need to be 7.5-8ns. I'm not sure if they can be forced to be that fast.
There's another trick - use two comparators and set their thresholds away from the midpoint, I'd say as far away as you can get away with That's how I overcame comparator latency in the SMPS application on P2. The "high" comparator needs threshold as close to 0V as possible to still have it work. The "low" comparator needs threshold as close to 1.8V as possible. These thresholds have to be adapted at runtime, I'd presume - at least I had to do that for the SMPS application. Then some bit-twiddling magic would be needed to combine their results. In the SMPS case it was easier because I only cared about the latency for the comparator to switch one way, and for that I'd adjust only the threshold of that comparator. It'd be plenty fast on the desired slope, but much slower on the opposite slope where it didn't matter that it got slow as long as it did reset eventually. I didn't use the A/D mode for the SMPS application though, just the usual M=11xx_xxxxxxxxx comparator mode, with DIR=0.
Assuming a square wave input, as the comparator thresholds are driven away from the midpoint, their output's duty cycle decreases - that's the ultimate limitation, since the short pulses they produce need to have enough setup and hold time as needed by whatever samples them. The best way to adjust the comparator threshold would be to have the comparator output (pin Input) driven onto a separate output pin, low-pass filtered, ADC-sampled (it can be slow) and used to tweak the threshold for a desired duty cycle. Since a DAC can only drive a pin through 1.5kOhm, that establishes the low-pass R value. It doesn't take much C to average it enough for an ADC readback.
@kuba said:
The comparators would have about 0.8-0.9V overdrive, so they should be as fast as they could be while using them in the usual way. Their slowness implies latency, so they need to be sampled as late as possible, and that may require an external phase-shift network on the clock - essentially you may end up having to sample the comparator state 3/4 of the clock period after the clock has changed state. Since we can't directly adjust the comparator sampling phase offset (unless I missed it), it's the clock that has to be tweaked. The clock can be as fast as you wish as long as the comparator delay remains under 85% of the clock period or thereabouts. Over full temperature range, 75% of clock period is more realistic. Say, at 100MHz the comparator delay would need to be 7.5-8ns. I'm not sure if they can be forced to be that fast.
There's another trick - use two comparators and set their thresholds away from the midpoint, I'd say as far away as you can get away with That's how I overcame comparator latency in the SMPS application on P2. The "high" comparator needs threshold as close to 0V as possible to still have it work. The "low" comparator needs threshold as close to 1.8V as possible. These thresholds have to be adapted at runtime, I'd presume - at least I had to do that for the SMPS application. Then some bit-twiddling magic would be needed to combine their results. In the SMPS case it was easier because I only cared about the latency for the comparator to switch one way, and for that I'd adjust only the threshold of that comparator. It'd be plenty fast on the desired slope, but much slower on the opposite slope where it didn't matter that it got slow as long as it did reset eventually. I didn't use the A/D mode for the SMPS application though, just the usual M=11xx_xxxxxxxxx comparator mode, with DIR=0.
Assuming a square wave input, as the comparator thresholds are driven away from the midpoint, their output's duty cycle decreases - that's the ultimate limitation, since the short pulses they produce need to have enough setup and hold time as needed by whatever samples them. The best way to adjust the comparator threshold would be to have the comparator output (pin Input) driven onto a separate output pin, low-pass filtered, ADC-sampled (it can be slow) and used to tweak the threshold for a desired duty cycle. Since a DAC can only drive a pin through 1.5kOhm, that establishes the low-pass R value. It doesn't take much C to average it enough for an ADC readback.
You are my data sheet, Kuba.
The internal DAC that is used by the comparator is kind of high-impedance and bounces around a lot via coupling from the comparator's other input. I added some internal capacitance to it to stabilize it, but maybe that just stretched out its recovery time.
Comments
EDIT: The way I'm seeing it is: By default, there is no setup time in the data-to-clock coming from the prop2. A small amount can be added by having the data pins registered and the clock pin unregistered, but not enough. Or at least not enough with the way the accessory board is routed.
EDIT2: And swapping those around to have small hold time instead of setup time is a similar but worse result. Probably due to data timing variations of unregistered data pins.
With the RevB Eval, the best choice of I/O header to use for high speed stuff would be in this order: bottom, left, right, top.
And with the 20R CK resistor on the accessory board it's almost perfect all the way up to 380 MHz. I'm getting 255 out of 256 matches in two areas: 110-120 MHz and 210-270 MHz. The rest of the sysclock frequencies are 100% matches.
PS: Here's an example dump (latest) of what I'm seeing for my diagnostics: (Forum wouldn't paste it all )
PPS: The cycles value is still based on the 32 bit word size I had when doing the streamer to streamer work. I've left it that way so the compare routine is unmodified. I just internally multiply it by four to do the hyperRAM byte sized copying. So, block size is 1 kByte.
Main burst read code is simply
As for the data: 1 kByte of data and 9 sysclocks per data byte makes it a 230 μs burst!
Registering doesn't seem to have anywhere near the advantage I was hoping for on reads.
First round of thermal testing indicates a 5% scaling up of everything for a 30 °C drop in board temperature. Presumably this effect is linear, so for a full spec range of 120 °C there is a rough 20% scaling adjustment.
I don't know it that is possible due to the current schematic of both pcbs involved (I'm not on my computer, where the docs are saved), but, have you tried lowering VIO to 3.0V or even a bit less, at the pin banks that are connected to the HyperRam chip?
IIRC, keeping VccQ (HR I/O circuit power supply) as low as possible (but within specc'd 2.7V <= VccQ <= 3.3V) and also the switching levels that come from the HR controller device (P2) would reduce the total noise seen by the HR device and contribute for better DQ[7:0] bus waveforms, despite any adverse effects lower voltages can impose to HR main memory array access time, when both Vcc and VccQ are connected to the same power lane.
The multiple bad spots are not very enticing though. Those certainly won't go away with a reduced supply voltage. Ignoring the bad spots, I've already got it touching the spec'd 1.8 Volt speed of 333 MB/s but at 3.3 Volts.
The writing performance will definitely be negatively impacted with a reduce supply voltage.
I'm off to bed now though.
Yes, they are frustratingly vague around the refresh rules on these DRAMS with auto-refresh.
With Auto-refresh, you would hope that more CS=H time, could buy longer CS=L, (even if clocking when CS=H),
but they sort of hint at needing CS =\_ to advance refresh ( 64m/8192 = 7.8125us)
but if that were really true, holding CS=H, would fail to refresh.
There is info suggesting 64ms(max) is an expected refresh repeat, so maybe you can look for that (room temp limit may be well above that 64ms)
ie start with a setup that deliberately fails refresh, then see what is needed to pass ?
For video work, if the RAM is not used for anything else, it may be fine to simply use the Any-Read-Is-Refresh rule, as frame times will always be < 64ms.
That gets to be a problem if you want frame buffers, to switch quickly. (unless you split the row, making left-half screen1 and right-half screen2 ?)
It must be still doing refreshes even with CS low.
EDIT: Oops, and I was also stressing the data pins at the same time by leaving the prop2 HR_databus driven for the long duration.
I've been bursting 400k blocks to hub with no data loss.
The original data was written once, then read over the course of the day with various tests.
I gave up trying to make sense of the datasheet and just tried things to see what happened.
Maybe try going longer than 10s, that will be quite logarithmic - try 2 minutes (and warm the chip )
Error count is small in number and even at 60 °C climbs to only a few percent. So I guess I'm still only catching the poorer cells.
EDIT: That's a lot of safety margin they've got there! At least 100 fold.
The 4 MB block is read from hyperRAM as a single CS low duration at 8 sysclocks per byte, and separately as 7 sysclocks per byte writing to hyperRAM.
At 200 MHz, it was hard to get more than a few bit errors even when I'm confident I had gone over 70 °C. 160 milliseconds for each block read.
At 360 MHz, there was a couple of hundred. Possibly the prop2 wasn't handling the temperature at that speed.
At 40 MHz, there was a hundred or so bit errors but I also didn't push the temperature so high then. So the longer CS time showed up more with this. 800 milliseconds for each block read.
EDIT: Actually, I think I need more accurate temperature measuring. I've discovered the accessory board has notable thermal inertia.
Can you add a read of another area, say 1 in every 20 of those scans. The idea here is to check non-read areas for (hidden) refresh sustain.
Today OzPropDev and I had a quick look at this "1v8 compatibility" idea, based on Chip's comments from last weeks zoom conference. The test is to have an output pin, P16 in 'Bit_Dac/BitDac' mode, driving a ~1v8 square wave into adjacent pin P17 via a jumper. Pin 17 is configured in A>D mode, with threshold typically set around value 85ish (85/255*3v3 volts). We set up some adjacent pins P18/19/20 to monitor what was happening
Where we got to, we were able to 'see' a 49 MHz square wave passing back though P17 comparator, using a monitoring adjacent smartpin.
The possible significance of this would be interfacing to 1v8 devices, such as 1v8 hyperrams, without interposing level shifting logic. Chip previously mentioned he thought the A>D comparators may have a limit around 30 MHz, but be dependent on amplitude of the incoming signal swing. This backs that up.
With some additional work we may be able to push beyond 49 MHz, as it was somewhat limited by our methodology, but at least we got that far. Also, the hyperram drive strength can configured to be stronger than the 75 ohm DAC used it Bit_Dac mode
The comparators would have about 0.8-0.9V overdrive, so they should be as fast as they could be while using them in the usual way. Their slowness implies latency, so they need to be sampled as late as possible, and that may require an external phase-shift network on the clock - essentially you may end up having to sample the comparator state 3/4 of the clock period after the clock has changed state. Since we can't directly adjust the comparator sampling phase offset (unless I missed it), it's the clock that has to be tweaked. The clock can be as fast as you wish as long as the comparator delay remains under 85% of the clock period or thereabouts. Over full temperature range, 75% of clock period is more realistic. Say, at 100MHz the comparator delay would need to be 7.5-8ns. I'm not sure if they can be forced to be that fast.
There's another trick - use two comparators and set their thresholds away from the midpoint, I'd say as far away as you can get away with That's how I overcame comparator latency in the SMPS application on P2. The "high" comparator needs threshold as close to 0V as possible to still have it work. The "low" comparator needs threshold as close to 1.8V as possible. These thresholds have to be adapted at runtime, I'd presume - at least I had to do that for the SMPS application. Then some bit-twiddling magic would be needed to combine their results. In the SMPS case it was easier because I only cared about the latency for the comparator to switch one way, and for that I'd adjust only the threshold of that comparator. It'd be plenty fast on the desired slope, but much slower on the opposite slope where it didn't matter that it got slow as long as it did reset eventually. I didn't use the A/D mode for the SMPS application though, just the usual M=11xx_xxxxxxxxx comparator mode, with DIR=0.
Assuming a square wave input, as the comparator thresholds are driven away from the midpoint, their output's duty cycle decreases - that's the ultimate limitation, since the short pulses they produce need to have enough setup and hold time as needed by whatever samples them. The best way to adjust the comparator threshold would be to have the comparator output (pin Input) driven onto a separate output pin, low-pass filtered, ADC-sampled (it can be slow) and used to tweak the threshold for a desired duty cycle. Since a DAC can only drive a pin through 1.5kOhm, that establishes the low-pass R value. It doesn't take much C to average it enough for an ADC readback.
You are my data sheet, Kuba.
The internal DAC that is used by the comparator is kind of high-impedance and bounces around a lot via coupling from the comparator's other input. I added some internal capacitance to it to stabilize it, but maybe that just stretched out its recovery time.