Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

rogloh · 2022-05-27 11:48

@evanh said:
Got it, smaller drivers aren't specifically documented. Actually, the document doesn't name any files at all. Probably need to add something for that.

Check the release notes file for file details.

evanh · 2022-05-27 12:06

Oh, that's not a great solution. At least name the driver that the doc covers.

evanh · 2022-06-11 13:11

Here's the corrected auto-delays in psram.spin2 for use on Edge EC32MB card. First line I've retained as commented out original table. Second line is the updated table.

'delayTable  long    7,92_000000,150_000000,206_000000,258_000000,310_000000,333_000000,0  'Eval + Add-on
delayTable  long    7,130_000000,190_000000,240_000000,290_000000,340_000000  'Edge EC32MB

rogloh · 2022-06-11 13:46

Good. People using the P2-EC32MB boards should probably use that adjusted setting for now until there is a better way to auto detect optimized timing somehow...

For other PSRAM systems, you can always run the supplied psram_delay_test program to work out where the various frequency breakpoints are and create your own custom timing table optimized for your own setup. Unfortunately with this board timing stuff there is no one size fits all set of values. And even temperature can affect timing. We should figure out a routine to call periodically to monitor for optimized read input timing at the current temperature. For that I know we can create secondary overlapping banks in the driver with new timing values that don't affect the regular bank timing, so the test can be run in parallel to normal operation, and the actual operational banks can be adjusted only once the new values are determined. The only real issue is there just needs to be a COG to perform this operation, as the driver is busy servicing COGs and can't really stop to go off and do this work on its own, so it will need a client COG to make the read requests and set the timing registers and hunt for timing with minimum read errors etc.

evanh · 2022-06-11 15:07

I'd do it as a user callable tuning - that is also used by init function. If, say, clock speed is adjusted then the application program should be the one to manage the transition. Then the application can not only decide when to tune but also designate a block of RAM to tune with.

evanh · 2022-06-27 22:19

Roger,
I'd like to offer my PLL setting routine for your use. Or at least the clkmode_ / clkfreq_ logic. They offer a more streamlined solution to correctly calculating the sysclock mode and frequency. No need to require particular defining of constants.
Their use recently gained compatibility in Flexspin - https://forums.parallax.com/discussion/comment/1539723/#Comment_1539723

PS: It has two internal arbitrary presets:

The 250_000 minimum Fpfd, although this does match Chip's own limit - https://forums.parallax.com/discussion/171044/pll-settings-calculator/p1
And the 1% acceptable error from target. Chip used 1 MHz instead. This can be loosened if desired.

pub  pllset( targetfreq ) : xinfreq | mode, mult, divd, divp, post, Fpfd, Fvco, error, besterror, vcolimit

    mode := clkmode_

    if clkmode_ >> 24 & 1                           ' compiled with PLL on
        divd := clkmode_ >> 18 & $3f + 1
        mult := clkmode_ >> 8 & $3ff + 1
        divp := (clkmode_ >> 4 + 1) & $f
        divp := divp ? divp * 2 : 1
        xinfreq := muldiv65( divp * divd, clkfreq_, mult )

    elseif clkmode_ >> 2 & 3                        ' compiled with PLL off
        xinfreq := clkfreq_                     ' clock pass-through
    else                                            ' unknown build mode
        xinfreq := 20_000_000                   ' default to 20 MHz crystal
        mode := %10_00                          ' default to 15 pF loading

    mode := %11_11 & mode | %11                     ' keep %CC, force %SS, ditch the rest

    besterror := div33( targetfreq, 100 )           ' _errfreq at 1.0% of targetfreq
    vcolimit := targetfreq + besterror
    vcolimit := vcolimit < 201_000_000 ? 201_000_000 : vcolimit

    repeat post from 0 to 15
        divp := post ? post * 2 : 1

        repeat divd from 64 to 1
            Fpfd := div33( xinfreq, divd )
            mult := muldiv65( divp * divd, targetfreq, xinfreq )
            Fvco := muldiv65( xinfreq, mult, divd )

            if Fpfd >= 250_000 and mult <= 1024 and Fvco > 99_000_000 and Fvco <= vcolimit
                error := div33( Fvco, divp ) - targetfreq

                if abs( error ) <= abs( besterror )     ' the last iteration at equality gets priority
                    besterror := error
                    mode := (mode&%11_11) | 1<<24 | (divd-1)<<18 | (mult-1)<<8 | ((post-1)&15)<<4

    if mode.[24]                                            ' PLL-ON bit set when calculation is valid
        clkset( mode, targetfreq + besterror )          ' make the frequency change
'       baudval()                                       ' recalibrate debug comms as well
    else
        xinfreq := -1                                   ' failed, no change

EDIT: Oops, forgot it also uses my rounding-to-nearest dividing routines too ... here's the simpler versions:

PUB  div33( dividend, divisor ) : r
    org
        mov     r, divisor
        shr     r, #1
        add     dividend, r
        qdiv    dividend, divisor
        getqx   r
    end



PUB  muldiv65( mult1, mult2, divisor ) : r
    org
        qmul    mult1, mult2
        mov     r, divisor
        shr     r, #1
        getqx   mult1
        getqy   mult2
        add     mult1, r    wc
        addx    mult2, #0
        setq    mult2
        qdiv    mult1, divisor
        getqx   r
    end

evanh · 2022-06-28 02:05

I offer it because, in your existing code, the %CC bits and the XIN frequency are somewhat assumed to be %11 and 20 MHz respectively.

EDIT: Err, the %CC bits are being calculated, not unreasonably on assumption, rather than maintained.

rogloh · 2022-06-28 02:28

Am distracted by other things right now but will try to take a look at this soon.

rogloh · 2022-06-28 03:43

Briefly comparing the code...

Your version uses the existing PLL settings if enabled to compute the input frequency and can also determine it when using a direct pass through clock source, while mine needs to be specified... win to you. This part is especially useful.
Your version uses extra precision in the computation by introducing your own rounding divide and multiply code and rounding in multiple places - now whether they are all needed or not, or is extra overkill is TBD. Mine achieves rounding via testing twice per mult value (with simple addition) to see which value is the closer match.
Yours looks for percent error less than 1% while mine uses a frequency difference as the tolerance (500kHz). Probably not that significant.
Mine exits when best match is found at lowest DIVD, DIVP values, and is a slight optimization in my code. Yours will test every possible value and do the calculations for all combinations which may be slower to run.
Your source code is much tighter, mine is more verbose although a little more commented. Perhaps it's unnecessarily verbose as it was never really optimized as such.

I don't know, maybe I'll start with just the clockmode_ stuff you added. That part seems handy.

evanh · 2022-06-28 05:10

@rogloh said:
... maybe I'll start with just the clockmode_ stuff you added. That part seems handy.

I'd be happy with that alone updated.

It required the recent change in Flexspin to operate below top-level anyway. I'd put it all aside a year or so back when it failed to compile in Flexspin as an object.

rogloh · 2022-06-28 05:25

@evanh said:

@rogloh said:
... maybe I'll start with just the clockmode_ stuff you added. That part seems handy.

I'd be happy with that alone updated.

It required the recent change in Flexspin to operate below top-level anyway. I'd put it all aside a year or so back when it failed to compile in Flexspin as an object.

Yeah I know it was a problem in the past, but does this scheme now also work in PropTool?

evanh · 2022-06-28 07:08

Sure does, it was a feature parity improvement for Flexspin.

rogloh · 2022-06-28 07:13

Sure does, it was a feature parity improvement for Flexspin.

Ok good, cheers.

By the way this PLL setting topic is probably relevant to my p2 video driver thread not this memory driver thread, but as I've already followed it up here, who am I to mention that...? The memory driver doesn't specifically touch the PLL, only my video driver.

rogloh · 2022-06-28 07:30

Also @evanh , I see this comment in the SPIN2 documentation...I take it this means the documenation is out of date with reality? This was the old behavior, right? Now clkmode_ is inherited from the top level file...

The 'clkmode_' value may differ in each file of the application hierarchy. Files below the top-level file do not inherit the top-level file's value.

evanh · 2022-06-28 08:04

@rogloh said:
By the way this PLL setting topic is probably relevant to my p2 video driver thread not this memory driver thread, but as I've already followed it up here, who am I to mention that...? The memory driver doesn't specifically touch the PLL, only my video driver.

Yeah, I realised that myself after I'd posted. In my defence, I was reading the video driver code in the memory driver archive.

evanh · 2022-06-28 08:13

@rogloh said:
Also @evanh , I see this comment in the SPIN2 documentation...I take it this means the documenation is out of date with reality? This was the old behavior, right? Now clkmode_ is inherited from the top level file...

The 'clkmode_' value may differ in each file of the application hierarchy. Files below the top-level file do not inherit the top-level file's value.

Wow, you're right. I didn't read it all. It's weird because how can an undefined be anything other than inherited? Inheritance is basically exactly what I got added to Flexspin. Before the improvement there was no such symbols below top-level so I just got compile errors when my code was put in an object file.

And that is how it appears to work in Pnut and Proptool. Those two symbols aren't source defined at any level. They are compiler built from the board config constants, or defaults if no config. And they exist below top-level.

Rayman · 2022-07-05 12:41

@rogloh your psram video driver seems to do well in vga and xga resolutions, even in nibble mode with single chip.

Think it could do 720p and/or 1080p?

rogloh · 2022-07-05 13:02

@Rayman said:
@rogloh your psram video driver seems to do well in vga and xga resolutions, even in nibble mode with single chip.

Think it could do 720p and/or 1080p?

Yes will work but perhaps at the lower colour depths unless you use 16 bit PSRAM. Rough rule of thumb is that a 4 bit PSRAM can peak transfer at sysclk/4 bytes/second, 8 bit PSRAM at sysclk/2 bytes/second and 16 bit PSRAM at sysclk bytes/second. I'm excluding setup, latency and fragmentation overheads as well as the horizontal blanking, assuming they'll approximately cancel out.

1080p60 dot clock is 148.5 MHz. So it would need at least 148MB/s at 8bpp, or 74.25MB/s at 4bpp. If a P2 runs at 297MHz it would likely be able to do 2bpp for 4 bit PSRAM, 4bpp for 8 bit PSRAM and 8bpp for 16 bit PSRAM. In fact I just tried it out and seem to be able to get 2bpp with 4 bit PSRAM at 1080p60.

You can play with it, just by modifying the resolution requested as well as bit depths. The demo itself won't work right because it was designed for 8bpp and will draw weird things because of that, but you'll see something, and you can write into the framebuffer yourself.

UPDATE:
Also just got 4bpp 1080p working with 8 bit PSRAMs. 8bpp works with 720p with 8 bit PSRAMs which looks really nice.

Here are changes for 1080p, replace these lines accordingly:

    RESOLUTION = vid.RES_1920x1080    ' this should match the above
    XSIZE = 1920 
    YSIZE = 1080 
    BPP = 2

    'setup an 2bpp graphics region sourced by framebuffer in external RAM using memory "bus 1"
    vid.initRegion(@region, vid.LUT2, 0, 0, @vgapalette, 0, 0, (1<<28) | RAM, 0)

You can try out 4bpp too if you replace the 4 bit PSRAM driver with the 8 bit generic PSRAM driver and setup your pins accordingly.

    RESOLUTION = vid.RES_1920x1080    ' this should match the above
    XSIZE = 1920 
    YSIZE = 1080 
    BPP = 4

    'setup an 4bpp graphics region sourced by framebuffer in external RAM using memory "bus 1"
    vid.initRegion(@region, vid.LUT4, 0, 0, @vgapalette, 0, 0, (1<<28) | RAM, 0)

pik33 · 2022-07-05 18:33

@Rayman said:
@rogloh your psram video driver seems to do well in vga and xga resolutions, even in nibble mode with single chip.

Think it could do 720p and/or 1080p?

I use the PSRAM driver in my video drivers. First of these is a HDMI 1024x576x8bpp. It works with 4-bit driver (and of course with 16-bit too) using about 1/2 of its bandwidth, so there is time to put pixels on this screen and also reuse it for audio. Nothing more than this resolution is possible with HDMI because of P2 and PSRAM instability over something near 340 MHz.

Of course you can lower the refresh rate, but this is already 50 Hz and most of monitors (except one) I tested don't support anything significantly lower than 50 Hz.

When using VGA, 1024x768x8bpp will be very tight using a 4-bit RAM, but at 50 Hz it should be doable, saturating near the full bandwidth. 60 Hz is too fast for this single small chip.

With a 16bit PSRAM 32 bpp should be possible at this resolution.

Another of my drivers is a FullHD 1920x1080 @ 8bpp using 16bit PSRAM. It uses more than 50% of bandwidth - the 16 bit is less than 4x faster than 4-bit, so no more than 8 bpp is possible here. 2bpp only is possible using 4-bit RAM but maybe an attribute map can help here as in good old ZX Spectrum.

There is also a command list implemented in the PSRAM driver which opens a possibility of displaying windows. This slows the transfers but is fast enough to enable displaying even 15 segments of the picture from 15 different addresses in one display line There is a short demo of two independent windows I put somewhere in the forum. This also enables, for example, reading a palette for the line at the start of it.

rogloh · 2022-07-07 06:41

Here's some interesting information I have been putting together for the PSRAM driver overheads...

I'm trying to come up with a function that returns the maximum burst size a non-video COG can safely make use of such that it won't interfere with a video COG requiring a particular amount of bandwidth per scan line based on its timing, resolution and bit depth etc. The higher this burst the more efficient things are but there is a limit to how high you can go. You can't exceed 8us CE low time, and you can't delay the real-time video COG for too long or it can starve it and corrupt the video.

I've counted up overhead cycles assuming the worst case for the hub window cycles and the code path if branches are taken.

Driver processing is made up of the following...

mailbox polling time to branching to per COG state code where request processing begins
burst read service time to the point where PSRAM chip select is dropped low
address transfer phase (14 nibbles fixed)
data transfer phase (directly proportional to burst size)
other instruction overheads while CE is low to the point where the chip select gets raised high and the service returns to the mailbox notification stage
mailbox notification overhead back to the start of the polling loop

Interestingly it appears that due to the slightly simpler pathways in the code a 4 bit PSRAM can in fact be faster than 16 bit PSRAM for the very small burst transfers. Once the transfer size increases this of course goes away. The driver's overhead excluding the data transfer for 4 bit PSRAM is 32 P2 clocks shorter, which means that even though each byte takes 4 times as many P2 clocks to transfer using 4 bit PSRAM vs 16 bit PSRAM, for up to 8 byte bursts there is no real penalty.

In the calculation for optimizing the maximum burst size, I will have to take into account that locked video requests can also be fragmented due to the the 8us CE limit and page boundaries, and this adds more overheads. I'll also need to assume and find some worst case code path for the non-video COG, such that it delays video COG re-polling for the longest time. This is likely to be a list request with a graphics fill operation or some type of burst that will need to resume after its own fragmentation...TBD. Even after this it will be wise to include a small safety margin on top which might account for other pipeline delays I may not have accounted for precisely. It's hard to know everything to the exact clock cycle if it involves the streamer and input delays etc.

hinv · 2022-07-16 17:37

@rogloh Do you think that if someone did a careful layout like the P2-EC32MB but extended up 2 rows of PSRAMs that it would likely perform at SYSCLK/2 for all 96MB? Edit: I mean at 338MHz also.

evanh · 2022-07-16 17:54

Hinv,
That's the best chance of success. The 16-bit width helps too.

Wuerfel_21 · 2022-07-16 17:57

yea. 3 chips on a bus works fine even on an acc. board.

rogloh · 2022-07-16 18:24

@hinv said:
@rogloh Do you think that if someone did a careful layout like the P2-EC32MB but extended up 2 rows of PSRAMs that it would likely perform at SYSCLK/2 for all 96MB? Edit: I mean at 338MHz also.

Potentially yes. I've already seen it working at that frequency using the 4 bit board from Rayman with 3 paralleled PSRAM devices as an external plug in on P2-EVAL. It might still be violating the total capacitive load according to the data sheet and at that clock speed it is already overclocked anyway (i.e. won't be guaranteed to work over its usual full temperature and voltage range etc), yet it seems to have enough margin to get away with it at room temperature from what I've observed to date.

My own 64MB test board worked also on P2-EVAL at this speed, but that only had two device banks in parallel.

Wuerfel_21 · 2022-07-16 19:04

Kind of obnoxious that all the high density parts (> 16 Mbit per bus pin) are 1.8V only. Has anyone tried if interfacing these really is a lost cause?

rogloh · 2022-07-16 19:29

@Wuerfel_21 said:
Kind of obnoxious that all the high density parts (> 16 Mbit per bus pin) are 1.8V only.

Yeah and they are in that pseudo Hyper format too and in miniBGA making it harder to build your own.

Has anyone tried if interfacing these really is a lost cause?

Not to my knowledge.

Wuerfel_21 · 2022-07-16 19:38

@rogloh said:
Yeah and they are in that pseudo Hyper format too and in miniBGA making it harder to build your own.

They do make 512Mb actual-HyperRAM (though merely a 256Mb die stack), but yea, BGA...

hinv · 2022-07-16 19:42

@Wuerfel_21 said:
Kind of obnoxious that all the high density parts (> 16 Mbit per bus pin) are 1.8V only. Has anyone tried if interfacing these really is a lost cause?

Actually since the P2 is internally 1.8V, it makes me think that for memory we had 20pins that didn't have smart pins attached and were just 1.8V from the core with sturdy enough drivers to run them at 300+MHz. I think it would actually take up less die space. Maybe the P3 should have external hub rams so that we could have 1GB of hub at full speed. ;^)

hinv · 2022-07-16 19:44

@rogloh said:
Potentially yes. I've already seen it working at that frequency using the 4 bit board from Rayman with 3 paralleled PSRAM devices as an external plug in on P2-EVAL.

My own 64MB test board worked also on P2-EVAL at this speed, but that only had two device banks in parallel.

I guess I should try to keep up with that other thread because I didn't know we could get SYSCLK/2 over the P2-EVAL pins.

hinv · 2022-07-16 19:46

@Wuerfel_21 said:
Kind of obnoxious that all the high density parts (> 16 Mbit per bus pin) are 1.8V only. Has anyone tried if interfacing these really is a lost cause?

Ooh look: https://www.alliancememory.com/new-8mb-to-128mb-high-speed-cmos-psrams/
4x16MB would be nice, huh?

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments