Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

cgracey · 2020-09-24 00:28

evanh wrote: »

cgracey wrote: »

Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip

RWDS is very handy in one particular activity - byte sized blit type ops, like window dragging. I get the feeling that eight bits per pixel is very suited to the Prop2/HyperRAM combo.

If you have any solution for efficiently doing say four bits per pixel blit ops then that would eliminate the need for RWDS at eight bits.

Thanks, Evanh. We will have to have RWDS. Von Sarvas and I have been going over the datasheet and this thread for the last 45 minutes. We can't get around RWDS.

evanh · 2020-09-24 00:35

Err, I should have said "byte granularity", You got the gist though.

evanh · 2020-09-24 00:44

cgracey wrote: »

We are working on a P2 Edge with HyperRAM and are wondering if we can just connect RESET# to RESn on the P2 chip.

I'm vague on Roger's driver uses, I think he really only uses RESET for the HyperFlash part. To me, tying the HyperRAM RESET to the prop2's RESn is ideal. You get a known power up state and from there the software can do the rest.

Yanomani · 2020-09-24 01:01

A while ago, I've crafted a schematic for a circuit to be used with HyperThings, originally able to manage up to four control signals, but can be expanded to any number, to suit a particular application.

It uses a single master control signal, originally HyperCS# that can also be tought as HyperRESET#, because one can have as many DQ_x lines activelly pulled LOW, at the same time.

So, as designed, it serves two HyperThings, simultaneously; one CS#/RESET# pair, per device.

The needed adaptations at the driving software are trully minimum, and will not affect/modify the interface timing more than reasonably.

The transparent latches and the OR gates (single or multiple-packed devices) are from the little/tiny logic kind: super fast 3.3V Cmos, and small as one can loose many of them, within a portion of dust.

Hope it helps

Henrique

P.S. latches: (Nexperia: 74AUP1G373); (OnSemi: NC7SZ373P6X-L22347); (TI: SN74LVC1G373)
latch input capacitances: in the order of 3/4 pF@ 3.3V (important to avoid heavy capacitive loadings at bus lines)

- OR-gates: (Diodes: 74LVC1G32); (TI: SN74AUP1G32)

rogloh · 2020-09-24 01:16

cgracey wrote: »

Rogloh, have you found it necessary to control the RESET# pin on the HyperRAM chips?

We are working on a P2 Edge with HyperRAM and are wondering if we can just connect RESET# to RESn on the P2 chip.

I've designed the driver at least to make its control of the reset pin optional. The key thing is that registers can still be accessed and setup after power up. The main thing there is setting the latency back to some known default as this will affect the timing of data being read and written. For HyperRAM, new register writes can always be done to get the chip back into some working state. I actually have some init code that begins to deal with situations like this but v2 HyperRAM will complicate this as it uses a different default latency value so some extra detection may be required when that eventuality arrives. If people have changed other registers like the refresh timing, they might need to be set back to defaults as well.

    'Loop through bus devices and setup a default device latency in case it had been changed 
    'prior to this driver restarting, and if its reset pin was not enabled.  An obscure case.

    repeat i from 0 to NUMBANKS-1
        device := devices[bus * 2 * NUMBANKS + i]
        if device
            if device & F_FLASHFLAG
                setFlashLatency(i<<24, DEFAULT_HYPERFLASH_LATENCY)
            'else 
                ' assume V1 HyperRAM for now
                'setRamLatency(i<<24, DEFAULT_HYPERRAM1_LATENCY)

For HyperFlash however there is a slight risk that if you were to stop/restart the code in the middle of a chip erase operation without an actual HW reset you may lose the ability to get it back into its usual working state. In that case the user would have to power cycle or HW reset. This is where the reset pin is useful. In your case with just HyperRAM on the P2 Edge this should not be a problem.

From the HyperFlash data sheet:

"Once the Chip Erase operation has begun, only a Status Read, Hardware Reset, or Power cycle are valid. All other commands are ignored."

cgracey wrote: »

Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip

We need RWDS to be able to mask the bytes written inside the 16 bit values. Without it there is no way to update at the byte level. I've designed my driver to always need it.

evanh · 2020-09-24 01:17

[deleted]

cgracey · 2020-09-24 02:07

rogloh wrote: »
cgracey wrote: »

Rogloh, have you found it necessary to control the RESET# pin on the HyperRAM chips?

We are working on a P2 Edge with HyperRAM and are wondering if we can just connect RESET# to RESn on the P2 chip.

I've designed the driver at least to make its control of the reset pin optional. The key thing is that registers can still be accessed and setup after power up. The main thing there is setting the latency back to some known default as this will affect the timing of data being read and written. For HyperRAM, new register writes can always be done to get the chip back into some working state. I actually have some init code that begins to deal with situations like this but v2 HyperRAM will complicate this as it uses a different default latency value so some extra detection may be required when that eventuality arrives. If people have changed other registers like the refresh timing, they might need to be set back to defaults as well.
    'Loop through bus devices and setup a default device latency in case it had been changed 
    'prior to this driver restarting, and if its reset pin was not enabled.  An obscure case.

    repeat i from 0 to NUMBANKS-1
        device := devices[bus * 2 * NUMBANKS + i]
        if device
            if device & F_FLASHFLAG
                setFlashLatency(i<<24, DEFAULT_HYPERFLASH_LATENCY)
            'else 
                ' assume V1 HyperRAM for now
                'setRamLatency(i<<24, DEFAULT_HYPERRAM1_LATENCY) 
For HyperFlash however there is a slight risk that if you were to stop/restart the code in the middle of a chip erase operation without an actual HW reset you may lose the ability to get it back into its usual working state. In that case the user would have to power cycle or HW reset. This is where the reset pin is useful. In your case with just HyperRAM on the P2 Edge this should not be a problem.

From the HyperFlash data sheet:

"Once the Chip Erase operation has begun, only a Status Read, Hardware Reset, or Power cycle are valid. All other commands are ignored."

cgracey wrote: »

Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip

We need RWDS to be able to mask the bytes written inside the 16 bit values. Without it there is no way to update at the byte level. I've designed my driver to always need it.

Okay. Thanks, Rogloh.

Since we are just using HyperRAM, it seems we are safe just connecting RESET# to RESn.

The 8M x 8 chips are about $3.40, which is reasonable. We're planning to use two for graphics bandwidth, but maybe one could suffice. If we run the P2 at 200MHz and the HyperRAM at 100MHz, how much data could we read or write per second? How about at 300MHz?

cgracey · 2020-09-24 02:09

Is the effective r/w rate better than 90% of bus speed.

evanh · 2020-09-24 02:30

My thinking is the current design of the auto-refresh circuit is it operates at double the needed row rate. So will cycle the whole refresh twice as often as spec'd when not blocked by data accesses. This leads me to believe that a pushy design can block up to 50% of refreshes and still conform to spec. EDIT: This assumes blocking puts a hold on the refreshing row address so that when the auto-refresh circuit continues it'll always resume at consecutive rows.

So, longer burst lengths, than 128 bytes 4 us, in the driver software could be employed to improve the ratio of setup time to peak bandwidth.

rogloh · 2020-09-24 02:57

cgracey wrote: »

Okay. Thanks, Rogloh.

Since we are just using HyperRAM, it seems we are safe just connecting RESET# to RESn.

The 8M x 8 chips are about $3.40, which is reasonable. We're planning to use two for graphics bandwidth, but maybe one could suffice. If we run the P2 at 200MHz and the HyperRAM at 100MHz, how much data could we read or write per second? How about at 300MHz?

I've not timed it lately but depending on how you configure your burst sizes, with the P2 running at 200MHz speed you'll probably get 80-85% utilization in my driver (other custom drivers that have less features could be improved further). The best condition is when you use large bursts consuming as much of the 4us as possible. For sysclk/2 operation this is then 80-85MB/s if only one COG is using it, while enabling the tighter Sysclk/1 rate operation doubles this and playing around with refresh timing can boost it too, though personally I stick to 4us. I use sysclk/1 read rates all the time with video and for full HD @ 297MHz I can get over 200MB/s, and enough for 1080p at 8bpp but some may not want to as it could possibly make the system more susceptible to noise or jitter etc as the timing margins are reduced. Performance scales directly with P2 clock speed. I do the writes only at syslk/2 speeds. Writing at sysclk/1 is potentially very risky and doesn't work with the P2-EVAL directly without HW mods, but if you design your P2-Edge board in a controlled way perhaps it may become achievable, and @evanh has done some good things there...

Having two memory chips is handy if you can spare the pins. It could really help with write bandwidth in video applications if you split these memories per frame allowing full write bandwidth to one device while video is streaming its data from the second device, and would also allow for applications using two completely independent accesses, e.g. video/audio on one and some future XMM on another....

rogloh · 2020-09-24 03:10

cgracey wrote: »

Is the effective r/w rate better than 90% of bus speed.

Not in my driver, for other dedicated drivers with less features it could be improved.

Forgetting software overheads etc, theoretically there is a 4us CS limit, and a 10ns minimum CS high time after each transfer, with about 14 clocks lost due to the address phase and latency. At 100MHz clock (200MB/s), in 4us this allows a burst of 400 clock cycles of which 386 can be data (772 bytes with DDR), and then a 10ns gap. So the best possible case would be 772 bytes in 4.01us. This is 192MB/s or ~96% bus utilisation. But in practice it is still not possible to achieve this on the P2 due to things like streamer setup overheads in software and other instructions needed for pin control etc. Once you factor all that in, and add other features, mailbox polling time, latency/pin parameter selection etc, etc, etc, its effective rate drops lower down to the 80-85% range (at least with my driver), and only with the largest transfers that fill the 4us.

If I get a chance later today I'll have a quick look on the scope to see what rate I can actually hit nowadays.

cgracey · 2020-09-24 03:19

evanh wrote: »

My thinking is the current design of the auto-refresh circuit is it operates at double the needed row rate. So will cycle the whole refresh twice as often as spec'd when not blocked by data accesses. This leads me to believe that a pushy design can block up to 50% of refreshes and still conform to spec. EDIT: This assumes blocking puts a hold on the refreshing row address so that when the auto-refresh circuit continues it'll always resume at consecutive rows.

So, longer burst lengths, than 128 bytes 4 us, in the driver software could be employed to improve the ratio of setup time to peak bandwidth.

Thanks, Evanh. Wouldn't normal reads sufficiently refresh the memory? Is the hidden auto refresh the only refresh? And, do we rely on it to operate in the background when CS is high?

cgracey · 2020-09-24 03:24

rogloh wrote: »

cgracey wrote: »

Okay. Thanks, Rogloh.

Since we are just using HyperRAM, it seems we are safe just connecting RESET# to RESn.

The 8M x 8 chips are about $3.40, which is reasonable. We're planning to use two for graphics bandwidth, but maybe one could suffice. If we run the P2 at 200MHz and the HyperRAM at 100MHz, how much data could we read or write per second? How about at 300MHz?

I've not timed it lately but depending on how you configure your burst sizes, with the P2 running at 200MHz speed you'll probably get 80-85% utilization in my driver (other custom drivers that have less features could be improved further). The best condition is when you use large bursts consuming as much of the 4us as possible. For sysclk/2 operation this is then 80-85MB/s if only one COG is using it, while enabling the tighter Sysclk/1 rate operation doubles this and playing around with refresh timing can boost it too, though personally I stick to 4us. I use sysclk/1 read rates all the time with video and for full HD @ 297MHz I can get over 200MB/s, and enough for 1080p at 8bpp but some may not want to as it could possibly make the system more susceptible to noise or jitter etc as the timing margins are reduced. Performance scales directly with P2 clock speed. I do the writes only at syslk/2 speeds. Writing at sysclk/1 is potentially very risky and doesn't work with the P2-EVAL directly without HW mods, but if you design your P2-Edge board in a controlled way perhaps it may become achievable, and @evanh has done some good things there...

Having two memory chips is handy if you can spare the pins. It could really help with write bandwidth in video applications if you split these memories per frame allowing full write bandwidth to one device while video is streaming its data from the second device, and would also allow for applications using two completely independent accesses, e.g. video/audio on one and some future XMM on another....

Thanks, Rogloh.

That seems really risky reading at sysclock/1. I think I've seen posts here where the CS line was being delayed slightly by a small capacitor? That seems to be the only control you might have over data registration. Really tempting, though. I should have made a streamer mode which would input data from pins and run the values through the lookup table to produce RGB

Yanomani · 2020-09-24 03:28

Since, in the present case, the HyperRams will be mounted on a version of P2 Edge, would the connections between P2 and the Hypers be limited to those two chips (CKs, DQs, RWDSs and CS#s), or they also extend, right to the edge connector, and beyound?

rogloh · 2020-09-24 03:31

That seems really risky reading at sysclock/1. I think I've seen posts here where the CS line was being delayed slightly by a small capacitor? That seems to be the only control you might have over data registration. Really tempting, though. I should have made a streamer mode which would input data from pins and run the values through the lookup table to produce RGB

Personally I've not found too many issue with sysclk/1 reads, but I'm someone who is overclocking everything, i.e. both the P2 and the HyperRAM, and am only working at room temps. If you want to meet the (current version 1, 3 Volt) HyperRAM spec, then sysclk/2 is the only way to do it unless the P2 always remains limited to 200MHz and below. That being said sysclk/1 does actually work if you give it a go. How reliable it is over voltage/temp/board layout/chip etc is the unknown.

You only needed the capacitor for the sysclk/1 writes, the sysclk/1 reads didn't need it. My driver can enable the higher speeds independently for both reads and writes. For reads we switch between registered/unregistered data pins to achieve the timing skew needed for "reliable" data reading. This is done over different frequencies and there is a table (it's probably temp dependent too) to map between frequency and delay + registered/unregistered data pin control for reads. Some (small) frequency ranges are best to avoid if running at sysclk/1.

Yanomani · 2020-09-24 03:31

cgracey wrote: »

That seems really risky reading at sysclock/1. I think I've seen posts here where the CS line was being delayed slightly by a small capacitor? That seems to be the only control you might have over data registration. Really tempting, though. I should have made a streamer mode which would input data from pins and run the values through the lookup table to produce RGB

IIRC, it was the HyperCK line that was being delayed, not CS#.

rogloh · 2020-09-24 03:38

Yanomani wrote: »

IIRC, it was the HyperCK line that was being delayed, not CS#.

Yes, evanh uses the capacitor only on the clock line.

cgracey · 2020-09-24 03:45

Yanomani wrote: »

cgracey wrote: »

That seems really risky reading at sysclock/1. I think I've seen posts here where the CS line was being delayed slightly by a small capacitor? That seems to be the only control you might have over data registration. Really tempting, though. I should have made a streamer mode which would input data from pins and run the values through the lookup table to produce RGB

IIRC, it was the HyperCK line that was being delayed, not CS#.

Ah, yes. That makes more sense.

cgracey · 2020-09-24 03:47

Yanomani wrote: »

Since, in the present case, the HyperRams will be mounted on a version of P2 Edge, would the connections between P2 and the Hypers be limited to those two chips (CKs, DQs, RWDSs and CS#s), or they also extend, right to the edge connector, and beyound?

If we brought them out, they could pick up way too much capacitance. We were not going to connect those signals to the edge fingers.

cgracey · 2020-09-24 03:50

rogloh wrote: »

That seems really risky reading at sysclock/1. I think I've seen posts here where the CS line was being delayed slightly by a small capacitor? That seems to be the only control you might have over data registration. Really tempting, though. I should have made a streamer mode which would input data from pins and run the values through the lookup table to produce RGB

Personally I've not found too many issue with sysclk/1 reads, but I'm someone who is overclocking everything, i.e. both the P2 and the HyperRAM, and am only working at room temps. If you want to meet the (current version 1, 3 Volt) HyperRAM spec, then sysclk/2 is the only way to do it unless the P2 always remains limited to 200MHz and below. That being said sysclk/1 does actually work if you give it a go. How reliable it is over voltage/temp/board layout/chip etc is the unknown.

You only needed the capacitor for the sysclk/1 writes, the sysclk/1 reads didn't need it. My driver can enable the higher speeds independently for both reads and writes. For reads we switch between registered/unregistered data pins to achieve the timing skew needed for "reliable" data reading. This is done over different frequencies and there is a table (it's probably temp dependent too) to map between frequency and delay + registered/unregistered data pin control for reads. Some (small) frequency ranges are best to avoid if running at sysclk/1.

I see. Sysclk/2 is rock-solid, though, right?

cgracey · 2020-09-24 03:56

If we run at 297MHz using sysclk/2, we could get a steady stream of 8-bit pixels at 148.5 MHz out of two HyperRAMs, each supplying half, right? That would use up maybe 60% of each RAM's bandwidth. It would take two cogs to manage.

cgracey · 2020-09-24 03:59

What if we kept the two RWDS and two CS lines common, but with data separate, so that we get a 16-bit path? That would only take one cog to manage and would simplify things.

rogloh · 2020-09-24 04:25

cgracey wrote: »

I see. Sysclk/2 is rock-solid, though, right?

Well I've not developed benchmarks and run in ovens etc so I can't say. Proper qualifications will require proper testing setups. Sysclk/2 operation just allows for more choices over the clock delay you select to read in the data. All we can control from the P2 on the software side is the input clock delay in cycles when we choose to read the data with the streamer and whether the pins are registered or unregistered. The rest comes down to layout and signal integrity stuff, and how much timing varies with PVT.

cgracey wrote: »

If we run at 297MHz using sysclk/2, we could get a steady stream of 8-bit pixels at 148.5 MHz out of two HyperRAMs, each supplying half, right? That would use up maybe 60% of each RAM's bandwidth. It would take two cogs to manage.

Well you could try to parallelize that way, but I think it is easier to run them independently at sysclk/1 and give all bandwidth from one RAM to the video and all the bandwidth from the other to the writer COGs. Splitting that way with sysclk/2 means the video driver would then probably have to alternate RAMs between scan lines and it always requires that extra COG (making at least 3 COGs). I've certainly not written mine to work that way and it can run 1080p 8bpp with just 2 COGs (1 video + 1 HyperRAM).

cgracey wrote: »

What if we kept the two RWDS and two CS lines common, but with data separate, so that we get a 16-bit path? That would only take one cog to manage and would simplify things.

It might be possible, it would need a different driver or at least some more modifications to support it. The masking of bytes in the combined sixteen bit word(s) would be more complicated. I think you'd still need independent RWDS pins. It probably wouldn't be worth that design restriction just to save the two HW pins. I do wonder though, to keep the possibility alive, can't a physical pin output state be replicated using Smartpins to output a duplicate of one of its neighbouring pins? If you did that then you might have the option of a 16 bit path for future driver use with common CS + CLOCK using replication, as well as maintaining the better option of independent 8 bit paths where you have all control pins fully independent on the board (assuming the control pins remain within the +/- 4 pin distance of their neighbour that they need to replicate). It would consume 22 pins.

cgracey · 2020-09-24 05:09

Rogloh, good idea about making both CS pins and both RWDS pins drivable via one OUT bit each. I need to review how that works, but I think it's every even/odd pin pair that can do that.

cgracey · 2020-09-24 05:11

It looks to me like the CK can operate in bursts. Would it be important to maintain separate CK pins, too? Does refresh happen without CK pulses?

rogloh · 2020-09-24 05:15

Yes definitely the clock pin should be independent if you want the option of independent data bus activity on your two chips. All three should be independent in my view.

I believe refresh should happen without clock activity. I only drive out the clock during transfers.

whicker · 2020-09-24 05:16

Yes there is an internal oscillator for self refresh. CK is only for the bus interface, and can either stay free running or completely stop.

And as an aside, refresh is completely paused when CS is low. That's where the 4 uS max burst length comes from.

cgracey · 2020-09-24 05:22

whicker wrote: »

Yes there is an internal oscillator for self refresh. CK is only for the bus interface, and can either stay free running or completely stop.

And as an aside, refresh is completely paused when CS is low. That's where the 4 uS max burst length comes from.

Cool. Thanks, guys.

I think controlling the clock also simplifies writes because you don't have to raise CS at just the right time. You just let a smart pin generate the number of clocks you need, start the streamer, do a WAITXFI, then raise CS.

Yanomani · 2020-09-24 06:22

Please, for each HyperRam device, ever keep DQ[7:0) and RWDS at exactly the same switching voltage levels, preferably being fed by the same 3.3V linear voltage regulator, as if the Hypers had a 9-bit-wide bus.

Is that the way they are constructed inside the Hypers, as a 9-bit-wide entity; any mismatch between those swinging levels, and/or slew rates, will be translated as more noise feeding into the data receiving circuits (to the Hypers), peak to peak.

CK is also connected to the same structure, but being an input-only entity, its design differs from the other ones. Its swinging levels and slew rates does only affect timing.

If you want to use (16 data bits + 2RWDS + 2KC), its advisable to feed those 20 signals with the same regulator.

CS#s and RESET#s will don't matter; they can be seen as "static" control lines, from that standpoint.

cgracey · 2020-09-24 11:58

Yanomani wrote: »

Please, for each HyperRam device, ever keep DQ[7:0) and RWDS at exactly the same switching voltage levels, preferably being fed by the same 3.3V linear voltage regulator, as if the Hypers had a 9-bit-wide bus.

Is that the way they are constructed inside the Hypers, as a 9-bit-wide entity; any mismatch between those swinging levels, and/or slew rates, will be translated as more noise feeding into the data receiving circuits (to the Hypers), peak to peak.

CK is also connected to the same structure, but being an input-only entity, its design differs from the other ones. Its swinging levels and slew rates does only affect timing.

If you want to use (16 data bits + 2RWDS + 2KC), its advisable to feed those 20 signals with the same regulator.

CS#s and RESET#s will don't matter; they can be seen as "static" control lines, from that standpoint.

We've got the same regulator pattern on the P2 Edge that we have on the P2 Eval, where each set of 8 pins gets its own regulator.

The DQ pin sets each have their own regulator and the control pins wind up on two other regulators.

Pinout looks like this now:

P57 = CS_B
P56 = CS_A
P48..P55 = DQ_B
P40..P47 = DQ_A
P39 = RWDS_B
P38 = RWDS_A
P37 = CK_B
P36 = CK_A

We could we run all of P36..P55 from the same regulator, and leave the non-critical CS pins on the regulator that drives P56..P63.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments