Shop OBEX P1 Docs P2 Docs Learn Events
64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins - Page 2 — Parallax Forums

64 MB PSRAM module using 16 pins? --> 96 MB w/16 pins or 24 MB w/8 pins

2456718

Comments

  • If it's 2x 16MB, isn't that just 32MB?

    And it's only able to run NeoYume if you send me and roger a freebie :P (or bang your head against the code for many hours yourself)

  • RaymanRayman Posts: 14,762
    edited 2022-06-08 00:32

    Oops, yes 32 MB, just fixed that.

    This way costs a lot less than a bunch of the QPI chips too...
    (Actually, this price comparison is suspect too).

    I'd probably be willing to send you both samples...

  • Though 64MB would come in handy. Metal Slug 2 theoretically runs on NeoYume, but there's not enough space to load the graphics, since they alone take up 32MB.

  • RaymanRayman Posts: 14,762

    Are you saying you could run more games if there were more chips?

    How much do you need?

    Could add up to four more chips if ditched the uSD.
    Maybe more if used a demultiplexer...

  • RaymanRayman Posts: 14,762

    Here's with Parallax style bottom loading headers.
    I don't know how Parallax was able to use the holes for routing. In Eagle, the holes don't look to be plated, and don't know how to change that...
    Guess I wait and see how they show up. If plated, guess can just run traces over top of the holes...

    594 x 706 - 57K
  • Yeah this one is a hybrid between PSRAM and HyperRAM though the pinout makes it looks a lot more like HyperRAM. It might be possible to try to adapt the current HyperRAM driver to use it by altering the latency and addressing command format as applicable perhaps. Not promising anything for you though...getting the clocking working right on these RAMs can be real PITA.

  • evanhevanh Posts: 16,039
    edited 2022-06-08 10:19

    Having a single CLK to all chips will help. That is a problem on the Eval Hyper add-on board where it has separate clocks for each chip - Which made the clocks faster than the data pins, which hurt setup times.

    NOTE: Make sure CLK signal path is longer than all data paths. Ideally make data paths similar lengths to each other.

    BTW: Rayman, I don't envy you soldering those MBGA chips.

  • RaymanRayman Posts: 14,762

    @evanh Maybe should add option RC filter on clock to slow it down?

    BTW: Assembly of these kind of things is super easy with stencil and manual pick and place. Rework with air gun also usually works, when needed.

  • evanhevanh Posts: 16,039
    edited 2022-06-08 12:55

    @Rayman said:
    Maybe should add option RC filter on clock to slow it down?

    Maybe an unfitted cap. That'd be heaps.

    My plan is to rely on registering the data pins to provide a reliable early setup time (for command/data output) ahead of unregistered clock. Should give upward of 1.0 nanosecond. It seemed a good idea with the Eval Hyper add-on except it was foiled by the aforementioned negative effects of that particular layout ... so I've not yet proven it.

  • evanhevanh Posts: 16,039

    Kicad 6 symbol and footprint attached

  • RaymanRayman Posts: 14,762

    Here's with 4 chips for 64 MB total. Only two extra pins, so change from uSD to USB with them...

    817 x 650 - 76K
  • evanhevanh Posts: 16,039
    edited 2022-06-08 23:44

    Now I'm interested in testing three board sizes with 8MB (single die), 16MB (dual die), 32 MB and 64 MB to compare timings.

  • It's a pity all these DQS/DMs and CLKs are connected in parallel (sure, within their respective groups).

    This precludes copying between devices, without involving P2 internal bus, though writing from P2 to multiple devices, at the same address range, and with the same data, is allowed, and also can be stoped in a per-device basis.

  • evanhevanh Posts: 16,039

    DQS, CLK and data pins all should be equally loaded. I do not recommend separating them between chips.

  • RaymanRayman Posts: 14,762
    edited 2022-06-09 11:31

    I suppose the 32 MB version with two chips could be rewired with separate clocks to allow direct copying between chips. Use case seems limited though….

    Guess good for coping background image to screen buffer.

  • RaymanRayman Posts: 14,762

    Here's the 32 MB version with separate clocks, CE, and DM:

    783 x 963 - 92K
  • Before you send off another batch of gerbers: Wanna make a 4bit SD breakout?

    Because actually loading the larger RAM modules with data through SPI (and the handicapped SPI on the parallax boards at that) will take... a while.

  • RaymanRayman Posts: 14,762

    Ok, here's a little board for P2 Eval that adds audio, USB, and uSD.
    This should complete the required hardware to run game emulation on LCD: Memory board, LCD, and this.

    1328 x 905 - 102K
  • That's not 4bit mode compliant though. But it doesn't have the silly resistor, so it can probably still go quite a bit faster than the system SD slot

  • RaymanRayman Posts: 14,762

    Do we have code for 4-bit usd mode?

  • evanhevanh Posts: 16,039
    edited 2022-06-12 02:26

    Nope. That's a streamer job - Needs quite some testing with real hardware. It combines the timing twists of bit-bashed latencies with smartpin wrangling. Same as PSRAM/HyperRAM (No, worse because needs to support slower than sysclock/2) ... but also has all the SD protocol baggage, including needing to efficiently add CRC codes and a lot more mode handling that currently isn't used for SPI mode.

  • evanhevanh Posts: 16,039

    I'm a little worried the CRC needs will kill performance. It probably should be tested in 1-bit SD mode first. If the streamer, in 1-bit mode, can outrun the cog then not much point doing the 4-bit mode.

  • The CRC is only needed for writing/commands, you can still just ignore the CRC on incoming data. CRCNIB is certainly fast enough to do it on the fly.

  • evanhevanh Posts: 16,039
    edited 2022-06-12 12:30

    @Wuerfel_21 said:
    The CRC is only needed for writing/commands, you can still just ignore the CRC on incoming data. CRCNIB is certainly fast enough to do it on the fly.

    Yeah, you might be right ... the REP loop below will achieve a 512 byte block in 10us at 300 MHz sysclock, which translates to as much as 50 MB/s.

            rdfast  #0, buf
            mov crc, #0
            mov ptrb, #0
            mov polynomial, ##$8408  // from the left: x0 + x5 + x12
    
            rep @.rend, len
            rflong  val
            wrlut   val, ptrb++
            setq    val
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
            crcnib  crc, polynomial
    .rend
            rev crc
            shr crc, #16
    

    PS: Only just learnt what the frig the polynomial equations mean in practice - https://forums.parallax.com/discussion/comment/1498124/#Comment_1498124

    EDIT: Err, damn, I assumed it was possible for the streamer to block read lutRAM without using the FIFO. Turns out that requires the cog to feed the lutRAM addresses. So the FIFO is needed for streamer block ops ... will have to tack the CRC on afterwards then ...

  • Not sure if using the streamer is terribly useful over just bitbanging (remember that you can ROLNIB from INx) or using multiple smartpins and MERGEB-ing the results together.

  • evanhevanh Posts: 16,039

    Concurrently was the idea. Run CRC processing while block transmits - at up to sysclock/2 (which takes 2048 sysclock ticks per 512 byte block). The REP loop above takes a little under 3000 ticks.

  • Why sysclk/2? The max bus speed is 50 MHz. IDK why you'd need the max possible SD speed down at 100 MHz.

  • evanhevanh Posts: 16,039
    edited 2022-06-12 22:36

    Because it can. That's already what I'm doing in Flexspin's driver. Well, not /2, smartpins have a problem at /2, I used /4 to /8 there.

    EDIT: So do make sure the sysclock frequency is setup before initialising the SD driver. It is sensitive to sysclock staying put for the duration of use.

    PS: I use mount() and umount() pairs in the SD speed testing code I wrote to handle adjusting sysclock. eg:

                umount( "/sd" );
                _waitms( 200 );
                _clkset( clkmod, clkfrq );
                _setbaud( 230400 );
                printf( "\n\n   clkfreq = %d   clkmode = 0x%x\n", _clockfreq(), _clockmode() );
                mount( "/sd", _vfs_open_sdcard() );
    
  • evanhevanh Posts: 16,039

    A better solution would be patching of _clkset() to have it adjust the smartpins. Wouldn't need the umount/mount duo at all then. There isn't much to update: The clock ratio of the SPI clock smartpin, and the tx smartpin's clock phase config to compensate for the lag effect from I/O staging.

  • RaymanRayman Posts: 14,762
    edited 2022-06-17 23:44

    Got one of the 24 MB modules soldered up.
    Guess I'll see if MegaYume can work with it soon.

    It just clears the ground posts. Should have paid more attention to that. Got lucky...

    1008 x 756 - 138K
Sign In or Register to comment.