64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins

Wuerfel_21 · 2022-06-07 21:19

If it's 2x 16MB, isn't that just 32MB?

And it's only able to run NeoYume if you send me and roger a freebie :P (or bang your head against the code for many hours yourself)

Rayman · 2022-06-07 21:59

Oops, yes 32 MB, just fixed that.

This way costs a lot less than a bunch of the QPI chips too...
(Actually, this price comparison is suspect too).

I'd probably be willing to send you both samples...

Wuerfel_21 · 2022-06-07 22:20

Though 64MB would come in handy. Metal Slug 2 theoretically runs on NeoYume, but there's not enough space to load the graphics, since they alone take up 32MB.

Rayman · 2022-06-08 00:34

Are you saying you could run more games if there were more chips?

How much do you need?

Could add up to four more chips if ditched the uSD.
Maybe more if used a demultiplexer...

Rayman · 2022-06-08 00:45

Here's with Parallax style bottom loading headers.
I don't know how Parallax was able to use the holes for routing. In Eagle, the holes don't look to be plated, and don't know how to change that...
Guess I wait and see how they show up. If plated, guess can just run traces over top of the holes...

rogloh · 2022-06-08 01:35

Yeah this one is a hybrid between PSRAM and HyperRAM though the pinout makes it looks a lot more like HyperRAM. It might be possible to try to adapt the current HyperRAM driver to use it by altering the latency and addressing command format as applicable perhaps. Not promising anything for you though...getting the clocking working right on these RAMs can be real PITA.

evanh · 2022-06-08 09:50

Having a single CLK to all chips will help. That is a problem on the Eval Hyper add-on board where it has separate clocks for each chip - Which made the clocks faster than the data pins, which hurt setup times.

NOTE: Make sure CLK signal path is longer than all data paths. Ideally make data paths similar lengths to each other.

BTW: Rayman, I don't envy you soldering those MBGA chips.

Rayman · 2022-06-08 12:21

@evanh Maybe should add option RC filter on clock to slow it down?

BTW: Assembly of these kind of things is super easy with stencil and manual pick and place. Rework with air gun also usually works, when needed.

evanh · 2022-06-08 12:55

@Rayman said:
Maybe should add option RC filter on clock to slow it down?

Maybe an unfitted cap. That'd be heaps.

My plan is to rely on registering the data pins to provide a reliable early setup time (for command/data output) ahead of unregistered clock. Should give upward of 1.0 nanosecond. It seemed a good idea with the Eval Hyper add-on except it was foiled by the aforementioned negative effects of that particular layout ... so I've not yet proven it.

evanh · 2022-06-08 14:53

Kicad 6 symbol and footprint attached

Rayman · 2022-06-08 17:58

Here's with 4 chips for 64 MB total. Only two extra pins, so change from uSD to USB with them...

evanh · 2022-06-08 23:44

Now I'm interested in testing three board sizes with 8MB (single die), 16MB (dual die), 32 MB and 64 MB to compare timings.

Yanomani · 2022-06-09 01:34

It's a pity all these DQS/DMs and CLKs are connected in parallel (sure, within their respective groups).

This precludes copying between devices, without involving P2 internal bus, though writing from P2 to multiple devices, at the same address range, and with the same data, is allowed, and also can be stoped in a per-device basis.

evanh · 2022-06-09 10:51

DQS, CLK and data pins all should be equally loaded. I do not recommend separating them between chips.

Rayman · 2022-06-09 11:30

I suppose the 32 MB version with two chips could be rewired with separate clocks to allow direct copying between chips. Use case seems limited though….

Guess good for coping background image to screen buffer.

Rayman · 2022-06-09 17:03

Here's the 32 MB version with separate clocks, CE, and DM:

Wuerfel_21 · 2022-06-11 14:13

Before you send off another batch of gerbers: Wanna make a 4bit SD breakout?

Because actually loading the larger RAM modules with data through SPI (and the handicapped SPI on the parallax boards at that) will take... a while.

Rayman · 2022-06-11 20:39

Ok, here's a little board for P2 Eval that adds audio, USB, and uSD.
This should complete the required hardware to run game emulation on LCD: Memory board, LCD, and this.

Wuerfel_21 · 2022-06-11 21:26

That's not 4bit mode compliant though. But it doesn't have the silly resistor, so it can probably still go quite a bit faster than the system SD slot

Rayman · 2022-06-12 01:13

Do we have code for 4-bit usd mode?

evanh · 2022-06-12 02:14

Nope. That's a streamer job - Needs quite some testing with real hardware. It combines the timing twists of bit-bashed latencies with smartpin wrangling. Same as PSRAM/HyperRAM (No, worse because needs to support slower than sysclock/2) ... but also has all the SD protocol baggage, including needing to efficiently add CRC codes and a lot more mode handling that currently isn't used for SPI mode.

evanh · 2022-06-12 02:53

I'm a little worried the CRC needs will kill performance. It probably should be tested in 1-bit SD mode first. If the streamer, in 1-bit mode, can outrun the cog then not much point doing the 4-bit mode.

Wuerfel_21 · 2022-06-12 08:47

The CRC is only needed for writing/commands, you can still just ignore the CRC on incoming data. CRCNIB is certainly fast enough to do it on the fly.

evanh · 2022-06-12 10:46

@Wuerfel_21 said:
The CRC is only needed for writing/commands, you can still just ignore the CRC on incoming data. CRCNIB is certainly fast enough to do it on the fly.

Yeah, you might be right ... the REP loop below will achieve a 512 byte block in 10us at 300 MHz sysclock, which translates to as much as 50 MB/s.

        rdfast  #0, buf
        mov crc, #0
        mov ptrb, #0
        mov polynomial, ##$8408  // from the left: x0 + x5 + x12

        rep @.rend, len
        rflong  val
        wrlut   val, ptrb++
        setq    val
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
        crcnib  crc, polynomial
.rend
        rev crc
        shr crc, #16

PS: Only just learnt what the frig the polynomial equations mean in practice - https://forums.parallax.com/discussion/comment/1498124/#Comment_1498124

EDIT: Err, damn, I assumed it was possible for the streamer to block read lutRAM without using the FIFO. Turns out that requires the cog to feed the lutRAM addresses. So the FIFO is needed for streamer block ops ... will have to tack the CRC on afterwards then ...

Wuerfel_21 · 2022-06-12 14:10

Not sure if using the streamer is terribly useful over just bitbanging (remember that you can ROLNIB from INx) or using multiple smartpins and MERGEB-ing the results together.

evanh · 2022-06-12 22:09

Concurrently was the idea. Run CRC processing while block transmits - at up to sysclock/2 (which takes 2048 sysclock ticks per 512 byte block). The REP loop above takes a little under 3000 ticks.

Wuerfel_21 · 2022-06-12 22:12

Why sysclk/2? The max bus speed is 50 MHz. IDK why you'd need the max possible SD speed down at 100 MHz.

evanh · 2022-06-12 22:20

Because it can. That's already what I'm doing in Flexspin's driver. Well, not /2, smartpins have a problem at /2, I used /4 to /8 there.

EDIT: So do make sure the sysclock frequency is setup before initialising the SD driver. It is sensitive to sysclock staying put for the duration of use.

PS: I use mount() and umount() pairs in the SD speed testing code I wrote to handle adjusting sysclock. eg:

            umount( "/sd" );
            _waitms( 200 );
            _clkset( clkmod, clkfrq );
            _setbaud( 230400 );
            printf( "\n\n   clkfreq = %d   clkmode = 0x%x\n", _clockfreq(), _clockmode() );
            mount( "/sd", _vfs_open_sdcard() );

evanh · 2022-06-13 07:45

A better solution would be patching of _clkset() to have it adjust the smartpins. Wouldn't need the umount/mount duo at all then. There isn't much to update: The clock ratio of the SPI clock smartpin, and the tx smartpin's clock phase config to compensate for the lag effect from I/O staging.

Rayman · 2022-06-17 23:43

Got one of the 24 MB modules soldered up.
Guess I'll see if MegaYume can work with it soon.

It just clears the ground posts. Should have paid more attention to that. Got lucky...

64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins

Comments