64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins

Yanomani · 2022-07-05 01:29

@rogloh said:

@Yanomani said:
Are all main memory array data access totally restricted to only happen within the initially-specified (addressed) row, or does some of the existing drivers makes use of any row boundary crossing scheme?

Bursts in my driver get fragmented to not go over the page boundary, while Wuerfel_21's reads are small blocks that are also 2^n aligned and won't cross the page.

Thanks again, @rogloh!

Then, there is something that may easy a bit the decision/testing of /-----__/-----__ (66/33) versus /--____/--____ (33/66), and /------_____/-----_____ (50/50).

The Psrams should "feel" a bit more "confortable" by being fed with the 50/50-clock-option thrugouth all the Command/Address phase and any forcefully-following wait-cycles-counting period.

Please note that the above waveforms do "EVER" end at "Low", since it's a must have for the Psrams, in order to work properlly.

At the end of the initial wait-cycles-counting period (during the "low"-part of its last clock) one can "extend" its trailling "low" for a bit, during the time needed to reprogramm the smart pin, in order to produce the next clock waveform.

The same is valid for the last clock pulse, at the end of any main memory array, either read or write, BUT CE# must raise to "High" as soon as possible (aka: exactly to the point, as if it wass a "last clock pulse", not a regular "chip enable") , because it's the responsible for any last "write operation", and also because it signals a very definite "Command Termination" to the Psrams's internal state machine.

rogloh · 2022-07-05 09:36

@rogloh said:
Ok @Wuerfel_21 , here's a very early 8 bit PSRAM driver I've just got running. I've only tested the basic read and write block transfers using my memory test, so there could still be bugs I need to fix in other areas not yet tested, but given the memory test is passing I think hopefully at least the burst transfers work and you'll be able to use it to put some game ROM data into PSRAM.

@Wuerfel_21 Today I found some issues with writes to some unaligned long addresses with the latest 8 bit PSRAM driver I posted. This is unlikely to affect you much because for burst writes from ROMs to aligned long addresses the code will work (as you are finding) but if you try to write data to address offsets 2,3, or 6,7 etc it will corrupt the transfer. So far the reads seem okay as they are simpler, and I've also fixed the bitwise RMW operation for single byte/words/longs which got broken as well in the port. Am still narrowing down on the exact issue causing the problem and working on fixing it.

Probably my shortcut idea for the writes was too optimistic and I'll have to bite the bullet and redesign my write paths fully for this bus width architecture which is a PITA because all the logic for it gets complicated when you include single bytes, words, longs, bursts and fills to all address offsets with their different transfer lengths, and fragmentation plays a part here as well. Was hoping not to need to do this...maybe once I figure out what is going on, the fix won't be too hard, we'll see.

Yanomani · 2022-07-05 12:06

@rogloh said:

@Wuerfel_21 Today I found some issues with writes to some unaligned long addresses with the latest 8 bit PSRAM driver I posted. This is unlikely to affect you much because for burst writes from ROMs to aligned long addresses the code will work (as you are finding) but if you try to write data to address offsets 2,3, or 6,7 etc it will corrupt the transfer. So far the reads seem okay as they are simpler, and I've also fixed the bitwise RMW operation for single byte/words/longs which got broken as well in the port. Am still narrowing down on the exact issue causing the problem and working on fixing it.

Probably my shortcut idea for the writes was too optimistic and I'll have to bite the bullet and redesign my write paths fully for this bus width architecture which is a PITA because all the logic for it gets complicated when you include single bytes, words, longs, bursts and fills to all address offsets with their different transfer lengths, and fragmentation plays a part here as well. Was hoping not to need to do this...maybe once I figure out what is going on, the fix won't be too hard, we'll see.

A question: when doing RMW, are you still keeping with linear bursts, even if only single bytes (or just a few ones, provided that they can fit into any 32-byte boundary)) are to be modified?

I'm asking this because there is also the Wrap Boundary Toggle command ('hC0), and I believe it's usefullness can show up, in such cases of short-range overwrites, though I'm not totally sure yet.

The Psram starts (on power on Reset, at least, but, perhaps, it'll also seems a good idea to do the "soft" version, anyways, before starting using the chips; just in case...).

The Wrap Boundary Toggle is kind of a T-flip-flop; one needs to keep track of its previous state in order to control it, as needed, but since the software driver is "in charge" all the time, it seems easy to be ensured of the current state setup, anytime, or reset it to the default state (Linear Burst, after Reset, either due to Power-on, or to the Soft-Controlled way).

One thing is certain: given its importance in the sense of the way it affects the execution/interpretation of any Read or Write after it's activated, better be ensured that command is exercized under strict rules of 50/50% duty cycle CLK operation, so as to avoid the whole chip behavior going "off track", from a software's control standpoint.

rogloh · 2022-07-05 12:30

Yes I do linear bursts at all times, in fact I've just changed my code to fix a potential bug where it might have accidently crossed a boundary if I read a long at the last word boundary prior to the page boundary. I don't ever change the state of the device after I reset it. Not enough time to be doing that per request.

I thankfully found I can fix the write alignment issues I was having with a simple andn instruction to setup the word address, aligned on long boundaries. This means I now expect I can keep my working/well tested write code intact and don't have to rewrite a large complex portion. So far it's working nicely, still testing...

Here's the current and hopefully fixed driver code for 8 bit PSRAM.

Rayman · 2022-07-08 20:59

Soldered on the 8x PSRAM chips. Board in the middle might need some touch up though...

Wuerfel_21 · 2022-07-08 23:42

Excited to see how these perform. Needs some code first ig.

evanh · 2022-07-11 03:53

Here's the delay test report of the newly arrived 96 MB add-on that Rayman sent me. It's getting shaky even to select a common delay above 250 MHz.

evanh · 2022-07-11 04:05

Was there an option for choosing sysclock/4?

rogloh · 2022-07-11 07:19

@evanh said:
Was there an option for choosing sysclock/4?

Not in this test program. I need to work on a variant of the driver than can take a flag that chooses to underclock to sysclk/4 (a bit like I did for HyperRAM to overclock from sysclk/2 to sysclk/1). Getting the timing right for that shouldn't be too hard I hope. With underclocking, this board's timing should still work with P2s at high frequencies.

In the results you posted @evanh, it looks like your first two chips are not performing as well as the rest (1st one is worst). The 3rd and 5th tested pair of chips seem to behave the best. But running in 8 bit mode means that both chips have to behave well to pass the test, reducing the chances of success.

I think these results themselves for a memory board of this size are reasonably okay and still makes a useful expansion board for large memory. It's just that NeoYume needs to run things too fast

My boards also arrived today so I'll try to test them out sometime later tonight.

rogloh · 2022-07-11 07:26

@evanh said:
It's getting shaky even to select a common delay above 250 MHz.

That's true if you have a common delay for all, but in my driver I can setup different delays per bank, which would be okay for this setup. In Wuerfel_21's code however I think there is probably a common delay which would make it difficult I agree. In that case sysclk/4 operation is the go.

Yanomani · 2022-07-11 09:09

@evanh said:
Here's the delay test report of the newly arrived 96 MB add-on that Rayman sent me. It's getting shaky even to select a common delay above 250 MHz.

If you live where I believe you do, perhaps local weather (relative humidity ~92%), total-time in transit, and storage conditions during shipping can be playing some havoc to your particular pcb.

The kind of flux used during soldering (and the consequent post-assembly cleaning process), plus any specific characteristic of the pcb-stackup/raw-materials could also led to a trend for excessive moisture absorption (hygroscopicity).

Electronic equipment always suffered from leakage-effects; the consequences will depend on a lot of factors, but the higher the frequencies they operate, the worse the outcomes.

Please take a look at the "Static PCB Effects"-section (12.15) of the following document, from Analog Devices:

https://analog.com/media/en/training-seminars/design-handbooks/Basic-Linear-Design/Chapter12.pdf

After proper cleansing, I also suggest doing a little "baking" of it, but it (normally) don't need to reach 100ºC, as described on IPC 1601; something in between 40/60ºC during ~6 to 8 hours would be enough to show their resultings.

evanh · 2022-07-11 10:30

That is titled "Linear Design". They're mostly concerned with common mode offsets in the "nanoAmp"s. Maybe a tiny amount can be eked out with slightly less attenuation. I doubt we'll get much.

Reminds me of once getting assistance from an RF guy over the phone where he instructed me to simply clean the outer glass surface of a giant valve. The damn thing was the size of my body! Its interconnecting wires were thin sheets of flexible copper. The floor standing double-cabinet was a transformer + rectifier on one side and oscillator on the other. I vaguely remember a couple of needle meters for power and volts. Can't remember how the power was adjusted.

It was a RF plastic welder. The weirdest part was it had no shielding around the welding form. Gave me the willies to be in the room with it.

Yanomani · 2022-07-11 12:10

In fact, I was looking for any documented advice, about corrective/preventive maintenance on some Sega GPU-based arcade systems I used to work-with, between 1991 and 2002, but since I've found just a few links, I choose to look for equivalent information, and the best source I was able to locate was the one I've linked-to, at the former post.

The following one, and its ($$$) successors, where the best, and most intrincate, at the time, and so does the problems they could experience, during their "carreers". Also lots to learn, thougth...

https://segaretro.org/Sega_System_32

Wuerfel_21 · 2022-07-11 21:14

Anyways, slightly delayed, here is the reward for the working 96MB board: The promised JPEG of a cookie.

Since the board only works at half speed, I couldn't stop my pigge from eating half of the cookie, but oh well.

Wuerfel_21 · 2022-07-11 21:20

Actually, here's an alt version with slightly better framing, so I guess you get your money's worth after all

Rayman · 2022-07-11 21:22

Thanks! I'm hungry now...

Wuerfel_21 · 2022-07-11 21:43

Well, got one thing in common with a piggy... Well I assume you aren't round or have green fur.

rogloh · 2022-07-12 02:40

Got the results from testing the 96MB board that I received from @Rayman yesterday (attached). I ran with all six banks at base data pin 32 of the P2-EVAL.

Some banks definitely perform better than others in terms of their input timing sensitivity/band overlap. One reached 305MHz before seeing errors on all delays, others could get to around 330MHz, best was about 335MHz. So not enough for NeoYume at sysclk/2 transfer speeds (as we basically know from @Wuerfel_21 's testing).

The problem here is that if you combine a fast chip with a slower one in an 8 bit group you will potentially reduce the band overlap and overall frequency you can reach. You sort of need to sort the memory devices into groups by binning which is not really practical unless you only make small number of boards. Ideally two devices should perform the same in the pair. We can set different read delay timing per bank (in my driver), but we can't really break it down into per nibble device within the bank.

I'll take a look at the other 24MB (nibble oriented) board. I also want to look into adding sysclk/4 operation in my driver as a feature. This can let you at least use the board at high P2 frequencies, at the expense of half the bandwidth/double the latency.

rogloh · 2022-07-12 03:29

Getting much better results on the 24MB board.

Clean runs all the way up to 350MHz at sysclk/2 operation for all 3 devices. Tested with P2-EVAL with pin 48 as the base pin.

I think this is a rather handy board and will be useful for the 8086 emulator project and some video stuff too. For small transfers nibble mode is not so bad.

Update:
As an example you can see the following read code to read a single byte in nibble mode.

To get a byte into memory the fastest single read with the code below is probably going to require this:

CALLPA addr, #psram_read8 ' 4 clocks
the psram_read8 function ' takes 15 instructions (32 clocks with ret) plus the address+wait+data transfer time of 40 clock transitions (minus overlap of 6 clocks perhaps) so ~ 66 clocks
rdbyte data, hub_scratch ' 9 -16 clocks

The total count from this is ~79-86 P2 clocks to read a byte into a register given some address in PA that can be lost afterwards and a known area in HUB to read into. In an 8 bit mode at best it will only be 4 clocks less than this which is not a significant boost at all plus you also need to handle reading right at the crossing of a page boundary and a check for that case will cancel this gain out anyway. 8 bit wide PSRAM only helps to reduce the read time if larger blocks are transferred, however 8 bit mode slows down all byte writes, or unaligned word writes and long writes due to requiring extra read-modify-write cycles which becomes very significant for small transfers.

psram_read8
                wrfast  bit31, hub_scratch  ' pa = PSRAM read address, hub_scratch is where in hub to read PSRAM data into
                setbyte pa, #$EB, #3
                movbyts pa, #%%0123

                drvl    #PSRAM_CE_PIN
                drvl    #PSRAM_DATA_PINS
                xinit   ximm8, pa
                wypin   #(8+PSRAM_WAIT+2)*2, #PSRAM_CLK_PIN ' enough clocks for address phase, delay and 1 byte transfer
                setq    nco_fast
                xcont   #PSRAM_WAIT*2+PSRAM_DELAY,#0        ' send address
                waitxmt
                fltl    #PSRAM_DATA_PINS
                setq    nco_slow
                xcont   xread2, #0                          ' read data
                waitxfi                                     ' wait until streamer is done
        _ret_   drvh    #PSRAM_CE_PIN

evanh · 2022-07-12 07:57

@rogloh said:
Got the results from testing the 96MB board that I received from @Rayman yesterday (attached). I ran with all six banks at base data pin 32 of the P2-EVAL.

Matches my run very well. Same banks have same behaviours.

The problem here is that if you combine a fast chip with a slower one in an 8 bit group ...

It'll be the board layout rather than the ICs.

rogloh · 2022-07-12 08:15

@evanh said:
It'll be the board layout rather than the ICs.

Yes I'd say you are probably right there... I don't see so much variation on my own board or the P2-Edge.

Also in that sample code above that I posted which was derived from Wuerfel_21's low latency code, I'm getting confused about the number of clocks being generated when reading data from PSRAM. I think it is generating too many clocks and clocks could be still occurring outside of the CE low time which mightn't be good, especially if it runs over a page boundary in the meantime before CE rises...that might mess up refresh or something.

My own driver's code to read a long generates a total of 32 clock transitions for 16 bit PSRAM. This code seems to generate even more clocks to read a long depending on the PSRAM_WAIT delay value. PSRAM_WAIT was set to 10, PSRAM_DELAY was set to 4 in Ada's code, so for a long transfer of 4 bytes with 16 bit PSRAM that makes it (8+PSRAM_WAIT) * 2 + 4 = 40 clock transitions vs just 32 in my code. What's going on? Why so many clocks?

UPDATE: I just tested macca's x86 emulator with my PSRAM code and reduced the clock transitions right down to 30 to read a single byte and it still worked and let me latch in 2 nibbles in 4 bit mode and executed the x86 startup correctly. Dropping it further to 28 clock transitions was too much for it though and it would not start. So perhaps both Ada and myself are outputting too many clock transitions in our code, me by 2 and @Wuerfel_21 by 14. From this test I have a feeling it should be set to 26 + 2 * number of bus transfers needed for the data requested (i.e. dependent on bus width and size requested). Will test further...

Wuerfel_21 · 2022-07-12 10:41

I just landed on PSRAM_WAIT=10 by trial and error. So you're saying it should be 5?

rogloh · 2022-07-12 12:54

@Wuerfel_21 said:
I just landed on PSRAM_WAIT=10 by trial and error. So you're saying it should be 5?

Yeah, you are sending too many clocks I think. Back it off down to 2826 transitions plus 2 * the number of streamer transfers for data.

So if you were reading say 4 aligned longs in a burst using a 16 bit PSRAM data bus (8 streamer transfers), you would send 2826+8*2 = 42 clock transitions for example. It would be different for 8 and 4 bits but use the same approach.

Give it a go, and I think if you then back it off by another two clock transitions further than this step (i.e. by one full clock) then it will fail, and then you'll know you reached the minimum total clocks required for the PSRAM. It's just the clock transitions that need changing, I think you probably have the correct streamer count with PSRAM_WAIT*2 + PSRAM_DELAY as the P2 clocks (with NCO fast).

Update: Also I found out why I hadn't used that shorter scheme I just re-discovered without the split/rev/mergeb stuff earlier. With LUT immediate streaming needed in 8 and 16 bit modes you can't reverse the nibbles in the bytes without using that approach, as the streamer command does not include the "a" bit and can't do it for you. This explains why this optimization only works in 4 bit mode, but it's still handy to reduce the latency further and saves COG/LUTRAM space too.

Wuerfel_21 · 2022-07-12 13:57

PSRAM_WAIT is supposed to be the count of additional clocks between address and data phases. Address always takes 8 nibbles / 16 transitions, so PSRAM_WAIT=5 would end us up with 26 pre-data transitions. But that'd mean adding 10 delay cycles, which seems a bit silly.

Though indeed, WAIT=5; DELAY=18 works for sysclk/4.

rogloh · 2022-07-12 14:01

The first 8 clocks send the 8 cmd+address nibbles.
The next 5 clocks are the latency / bus turnaround.
The next clock starts the first data transfer and continues until the clock stops / CE is raised.

10 delay cycles if measured in P2 clocks is not silly. This accounts for the time to get the clk and data out of the P2 pipeline, to the memory, and then sent from the RAM on the board to get back to the P2, plus the time to get the data into the P2 pipeline. These delays (in clock cycles) are significant at high frequencies. But remember 10 P2 clocks is only 30ns at 333MHz, it's pretty fast really.

Wuerfel_21 · 2022-07-12 14:03

Yes, that's what i was getting at. DELAY=18 just seems a bit silly.

rogloh · 2022-07-16 07:49

Hey @Wuerfel_21 I'm interested to know did you ever try out the sysclk/3 read rate idea in your code?

Right now I'm looking at adding support for sysclk/4 reads and writes in my driver. I think its probably doable for the 4 bit and hopefully might also still fit the 8 bit PSRAM driver, but the 16 bit driver is a worry as there is very little space left for the extra instructions (just 2 LUTRAM longs which I think I wanted for pik33's locked video read list feature IIRC). So that driver might end up having different capabilities which isn't ideal.

With that 96MB board that Rayman designed, a sysclk/4 driver could be useful in 8 bit mode to help alleviate the extra load by running the transfers at half speed. It does add a couple of extra overhead instructions to setup and phase align the timing correctly.

evanh · 2022-07-16 08:43

Roger,
How much Cog code is used for decoding requests? Sometimes, some bulkier parameters for simpler code helps. Eliminating shifts and adds and using getnib/getbyte/getword in place of smaller parameters.

rogloh · 2022-07-16 08:57

@evanh said:
Roger,
How much Cog code is used for decoding requests? Sometimes, some bulkier parameters for simpler code helps. Eliminating shifts and adds and using getnib/getbyte/getword in place of smaller parameters.

Yeah I do already keep my data structure parameters aligned on byte/nibble boundaries and leverage those optimizations wherever I find they are possible. There is a big jump table for execf, which could in time be changed to try to free space, however it would increase latency though by a bunch of clocks which is undesirable, and nothing comes for free, plus this is in the common code for all drivers and I don't want to change everything right now or have a different way to process/maintain request handling over different drivers.

There might still be other driver specific optimizations I can look for before going down that path. The RMW code and logic to decide how to handle writes for all the different sized bursts and wrap cases etc, takes up quite a bit of space in the 16 and 8 bit drivers. Maybe there is some scope there.

pik33 · 2022-07-16 10:07

A crazy idea.These PSRAMs have also 1-bit mode. 12 chips, 2 banks (8+4) = 2 CS, CLK, 96 MB Or P2-EC128 MB, 16 chips, 16 bits

Edit: that was already stupid, as 1-bit mode needs 2 pins...

64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins

Comments