Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

rogloh · 2020-10-25 06:43

Yes "fastwrite" is used with sysclk/1 write selection. The code is patched at startup time accordingly into label "p5". Locations "p6" and "p7" are skipped dynamically with if_z enabled in some cases as well. I needed this to avoid a different loop for sysclk/1 writes. And burst fills with immediate data has to default to sysclk/2 IIRC because of the instruction in the REP loop taking 2 clocks.

I make sneaky use of SKIPF in various places, you gotta watch out for that. Especially with dynamically patched code. The things you do when you are pushed to fit it in. LOL

evanh · 2020-10-25 06:57

A hidden patch layer too! No wonder it is difficult to fathom.

rogloh · 2020-10-25 06:59

It's basically become obfuscated code.

evanh · 2020-10-25 07:53

Here's my working code as it stands today. Someone was wanting a copy. There is a few constants and #defines that can be edited for customising each test run. I did that rather than having interactive prompts. So, requires Eric's fastspin/flex-suite to assemble it. I use the -t option of loadp2 to gather the resulting report.

PS: If you want to change the default baud from 230400 then it's editable down a ways at the label "asyn_baud".

evanh · 2020-10-26 11:21

Tell you what, the accessory board arrangement has been great for attaching scope probes to. Examining the behaviour of the hyper bus is pretty easy at the accessory header pins. No special bed of nails or similar needed.

rogloh · 2020-10-26 13:51

That's what I did too. You can run it without memory board attached, just to see the transaction timing being generated by the P2 in many cases.

evanh · 2020-10-26 14:47

I mentioned it now because it just dawned on me that I'd been taking it for granted all this time. When the integrated boards, that can handle sysclock/1, arrive, they'll owe that to the testing done on the Eval Boards. The new layouts won't be anywhere near as easy to examine electrically.

Tubular · 2020-10-26 21:02

What i've done is added a 74ALVC125 quad buffer after the two hypers. Each hyper has its clock and a data line buffered, so CRO measurements won't affect the phase comparison. I think this will help with observability

evanh · 2020-10-26 22:44

Ah, the buffer chip adds its own loading. Doing that to only some of the pins is then unbalancing the group. I believe that's the very reason I've had to add a capacitor to the hyper accessory board when testing writes at sysclock/1. The extra loading on the data pins, from the hyperFlash chip, is making them slower than the clock pin. So I have to slow the clock pin down with a capacitor.

Tubular · 2020-10-26 22:56

The input capacitance of the buffer is listed as 3.5 pF typical, so you can reduce the clock loading capacitor as required

We are only using 1 bit of the data byte, so can tap off one of the faster bits now that we have that data

We are pairing the clock signals out to Hyper A and Hyper B, so each is driven from the P2 separate to each other. I don't know what the propagation delay is in doing that (inside the P2), but thats one thing we can check out

evanh · 2020-10-26 23:11

Who is we? It seems extraneous to add that to the more integrated boards when the testing is easy to do already on the Eval Boards. And the exisitng accessories will work on many future designs too. The 2x 20 pin header is a great test bed.

PS: Definitely want the chosen clock pin to have a longer output propagation time.

evanh · 2020-10-26 23:20

Actually, those propagation numbers should be for final stage unregistered propagations rather than whole chain times. The numbers provided aren't really what we need. What we're looking for is the relative differences between pins rather than OUT to IN response times.

PS: Given the absolute symmetry of the pad ring structures, I'll assume differences in registered propagations to be insignificant.

evanh · 2020-10-26 23:32

I guess deriving the final stage propagation is doable - If we knew the simulated clock frequency that is.

Tubular · 2020-10-27 00:49

Hopefully with the buffers we can look at this another way and perform some measurements

"We" is really Roglog and Ozprop, we have been bouncing ideas back and forth. One other idea we had is to use a 100k ntc thermistor for the CS pullup of the hyperrams, and butt that component up against the hyperram, so the P2 can measure its temperature. We need to look at temperature influence, for embedded setups that are much smaller than P2eval.

Tubular · 2020-10-27 00:55

evanh wrote: »

Actually, those propagation numbers should be for final stage unregistered propagations rather than whole chain times. The numbers provided aren't really what we need. What we're looking for is the relative differences between pins rather than OUT to IN response times.

PS: Given the absolute symmetry of the pad ring structures, I'll assume differences in registered propagations to be insignificant.

Why should they? I think Chip described it on zoom as being to/from a 'marshalling register' deep inside the P2 core, out to each pad

The pin structures may well be mostly symmetrical but at some level the differences will matter, and this might be such a level. You'd remember yourself plotting those pin trails showing ADC GIO and VIO calibration points and how they follow a path with temperature, so there's some such difference right there. Usually this stuff doesn't matter but if you can characterise it early then its less of a surprise when you're pushing things under pressure

evanh · 2020-10-27 01:12

I said that because the existing numbers will all be with unregistered I/O. And just highlighting there's no significant reason to get another simulation for registered.

Main thing we need now is the clock frequency of the existing simulation.

Tubular · 2020-10-27 01:33

I'm not sure the frequency would affect those propagation time values, which are probably for a nominal case. The die temperature, voltage, process variations are likely to have more impact than frequency (though frequency is related to temperature due to self heating)

evanh · 2020-10-27 01:36

Why it matters, to get the simulated frequency that was used, is so we can then derive the final stage propagation of those simulated results.

evanh · 2020-10-27 02:38

Tubular wrote: »

One other idea we had is to use a 100k ntc thermistor for the CS pullup of the hyperrams, and butt that component up against the hyperram, so the P2 can measure its temperature.

Having a board temperature reading is good addition. I've been thinking it might be roughly doable with tracking changes in prop2 internal ADC alone. Enough to keep the read data timing compensations in order at least.

evanh · 2020-10-27 03:07

evanh wrote: »

I said that because the existing numbers will all be with unregistered I/O. And just highlighting there's no significant reason to get another simulation for registered.

Ah, maybe there is a reason to know the registered propagation. Again to compare against the existing numbers so that would give us the difference between registered and unregistered. At the moment I'm generally guessing the diff is about one clock period minus one nanosecond.

EDIT: Or, other way around, unregistered is about one nanosecond later than registered. That'll be worst case, thinking about it. More realisticly, range will be 0.5 to 1.0 ns. Oh, your sheet has min-max of 0.8. That's wider than I was hoping for.

EDIT2: Doh! I was forgetting my own plan. The idea is to acquire the simulated frequency so that the derived final stage unregistered propagation can be assumed to also be the lag between registered and unregistered at the physical pin. This would solve for the hyperbus writes at sysclock/1.

EDIT3: There maybe too many assumptions in that idea though. Just thought it would be easy to ask and then see what comes out.

evanh · 2020-10-27 09:52

Roger,
I worked out I can do leading as well as trailing RWDS masking using bit-bashing at sysclock/1.

Gets rid of the smartpin config parameters and even saves an instruction at the trailing end. Although, there is still the matter of setting lead timing for each clock ratio. I'm probably trading one set of parameters for another.

It's currently at three instructions either side of write data burst, plus another six instructions to handle odd/even detection and clock/streamer numerical corrections. C flag set means a leading odd address, Z flag set means a trailing odd address.

rogloh · 2020-10-27 22:25

Interesting, if you find an approach that works in all cases and saves space and fits in with what I need in terms of flags use we can try to incorporate at some point if the benefits are compelling. That stuff is tricky to get right for all cases. I know I'd spent many hours on it with the analyser, and making sure that the data bus is enabled at the right time too.

evanh · 2020-10-28 03:44

I'm not attempting to handle adjusting the latency length. What I have done is remove the clock stall between latency and data phases. So, at sysclock/1 in particular, there is not a great deal of time to prep and align the streamer for HR reads during the full 12 latency. Currently has timing left for five more instructions.

Now, HR writes has the leading RWDS timing to handle during the latency phase. Writes used to be pretty idle there. What happens now is I drive RWDS high early (straight after CA phase), instead of low, and then decide on holding it for masking on an odd start address or lowering it before data phase starts so the mask doesn't happen. The fact that it was high during latency doesn't mean anything to the HR.

EDIT: Oh, and I've recently made the HR clock config simple. I now have just two options, it's either everything at sysclock/2 or everything at sysclock/1, no transitioning back and forth. Also, I've found that leaving the clock pin as unregistered all the time universally works. So that's handy. EDIT2: PS: True for the above posted source code too.

EDIT3: I basically copied your RWDS code initially, and the trailing RWDS masking is still much the same in principle. Changed back to using OUT and DIR instead of tricks with WRPIN for bit-bashing.

rogloh · 2020-10-29 23:30

@evanh
Just had an idea you might be interested in, using P2 HW capabilities for delaying the clock for sysclk/1 writes. In fact two ideas...

1) what if we drove just the clock output pin in a bit DAC mode? Maybe the different output driver path might be sufficient for delaying this pin relative to the data bus. But I don't know what impedance the BITDAC has, maybe 123.75ohms, if so that could be too slow?

2) This one may require a spare nearby pin. What if we fed in an intermediate clock output signal using a schmitt input mode (as live I/O, not clocked), with the output pin of the real HyperRAM clock pin following the input read from the intermediate clock pin instead of the OUT signal. This will delay the clock signal relative to data, possibly by some useful amount, and it may even be trimmed with an external RC circuit. There's also the comparator modes but they seem to only feedback the input at 1.5k which would be too high.

If you have a good scope you might be able to see how much delay a clock pin setup like this could achieve relative to normal GPIO data pins.

evanh · 2020-10-30 04:33

rogloh wrote: »

1) what if we drove just the clock output pin in a bit DAC mode? Maybe the different output driver path might be sufficient for delaying this pin relative to the data bus. But I don't know what impedance the BITDAC has, maybe 123.75ohms, if so that could be too slow?

That was one of my first objectives way back. It was far too weak. Attenuation killed it.

2) This one may require a spare nearby pin. What if we fed in an intermediate clock output signal using a schmitt input mode (as live I/O, not clocked), with the output pin of the real HyperRAM clock pin following the input read from the intermediate clock pin instead of the OUT signal. This will delay the clock signal relative to data, possibly by some useful amount, and it may even be trimmed with an external RC circuit.

I'll give it a whirl. The intermediate will have to be one of the other hyper pins on the accessory board though. That might ruin the attempt ...

rogloh · 2020-10-30 04:50

evanh wrote: »

I'll give it a whirl. The intermediate will have to be one of the other hyper pins on the accessory board though. That might ruin the attempt ...

I was wondering for some quick test if we can simply measure the propagation delay from in to out via a pin in schmitt trigger mode from another pin instead of doing the full blown HyperRAM code, just to observe the various delay effect(s) possible in order to see what ballpark etc.

Perhaps it might not suit the current HyperRAM accessory board unless we could drive out a clock via the INT pin if it is not enabled or the other device's clock pin. EDIT: it would have to be adjacent clock pin, INT is too far away to read. But adjacent clock is ideal as it is already an output and is harmless to drive.

If you do try this soon, I may not be online until a little bit later tonight.

evanh · 2020-10-30 06:13

Huh, discovered I'd incorrectly stated the feedback was always inverted in my pin config docs. Not sure how I'd got that wrong. I may have got it from the older blue sheet that I did all my early work from. Dunno. I deleted it just a few months back, once I realised it was out of date.

The answer is (In pinB and out pinA):
- As logic level input, 3.0 ns propagation.
- As schmitt trigger input, 3.4 ns propagation.

EDIT: Well, at least that's what the Eval Board at 25 °C can do. A bare prop2 will do a smidge better I'd guess. EDIT2: Or not.

That requires the singular paired pinB to be used. So, with the hyper accessory board, pinB would be the hyperFlash clock pin. That's functionally doable at least. I'm not much interested in trying though. It'll be some work to incorporate such relatively large lags into useful frequency band compensation mappings.

rogloh · 2020-10-30 14:19

This is a good result and thanks for doing the test evanh. Now we know that this might be another technique we can potentially use on a P2 to lag an output clock without external circuitry in order to provide extra setup time. It could be useful for other streamer stuff running at sysclk/1 rates that requires a clock output, beyond just the HyperRAM application.

evanh · 2020-10-31 03:15

For those interested, here's the code:

		drvl	#flash_ck
'		wrpin	##%010<<17, #ram_ck	'PinA echoes pinB
		wrpin	##%101<<17, #ram_ck	'PinA echoes schmitt pinB
		drvl	#ram_ck
		rep	#1, #0
		outnot	#flash_ck

And a screen shot ... ha, the 3.4 ns has climbed to 3.5 now. It's a warm day today.

rogloh · 2020-10-31 04:29

At 300MHz P2, the HyperRAM clock is 150MHz which has a 6.67ns period. Ideally we'd delay by half a bit to centre the clock in the bit period, though the setup time is more than the hold time so we can be closer to the data transition.

3.5ns will take us more than one half period away. This means it would need to be inverted. I wonder how much more delay the inverter step adds? Maybe you can invert the input of the Pin. Right now it only provide 0.17ns of effective delay after the next clock edge (DDR). If the inverter added another say 1ns or so, that would be good.

You could try inverting the output too and see if that introduces a small delay, maybe we can make use of that if it does too.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments