PSRAM as VGA buffer tests --> Now aiming towards DPO scope

Rayman · 2024-01-02 23:16

@rogloh has a very nice PSRAM driver.
Here, looking at doing something a little different...

Want to explore have two independent 8-bit banks for double buffering of VGA content.
So, far was able to get this HyperRam VGA driver converted to using 2 PSRAM chips instead.
This is 8bpp VGA. Not sure if can do higher resolution that this or not...

rogloh · 2024-01-02 23:52

In my own setup I know PSRAM can support 1080p60 8bpp when the P2 is clocked fast (~300MHz) using a 16 bit memory bus. So I expect it should be able to do 16 colors at that resolution with an 8 bit memory bus.

My own P2 drivers (video + memory) were originally designed with double buffering capabilities. You would need to use the full blown memory driver with two independent data buses configured and map a suitable range of addresses to the banks and setup the video driver to alternate it's external source address for each second frame or field (which it supports). Then the drivers will just display each field from each bank and you get full write bandwidth to the currently inactive bank during the time the other bank is being accessed for display by the video driver. You can even mix HyperRAM for one bank and PSRAM for the other if you want. Only downside for this higher performance is you need two memory driver COGs and a video COG (3 COGs total) instead of just two COGs normally, and the video memory has to be split up on a 16MB boundary (which typically isn't a big deal).

Rayman · 2024-01-03 00:04

Ok, so sounds like 4bpp is best one can do for 1080p and two chips?

I’m hoping for 8bpp but maybe that is not possible …

rogloh · 2024-01-03 00:42

8bpp should be doable with lower frame rates. 60Hz 8bpp 1080p needs to sustain 148.5MB/s out of the memory to keep up with the pixel rate excluding overheads. With PSRAM overheads you probably need to clock it at least 1.1x higher or so using large transfers which leaves at least some idle time, or maybe you can use some of the blanking interval for these setup overheads depending on your driver design. So maybe if you pushed the P2 above say 320MHz you could get 1080p 8bpp 60Hz from 8bit memory if you can use fractional P2 to pixel clock ratios as well. You'll need to try it out.

Rayman · 2024-01-03 00:49

In this scheme, each row of pixels is contained in one 1024 byte page of each chip, or 2048 bytes for the 8-bit bus.
That might reduce the overhead to work. We'll see...

Rayman · 2024-01-03 00:52

On kind of more advanced think I'm thinking about is a HUB RAM buffer where graphics gets drawn.
Drawing things like circle takes a lot of overhead.
But, if can read an area of the screen into hub. Modify it, then send back to PSRAM, might be faster.

Or, maybe it's fast enough as is...

rogloh · 2024-01-03 01:11

It's certainly more efficient to write stripes of data to PSRAM than individual pixels wherever possible. For example in block fills, or filled circles. Text rendering can be sped up significantly if you do it in hub then write it back to PSRAM. This also allows P2 transparency effects too, either using HW or SW methods.

rogloh · 2024-01-03 01:25

One other thing to mention. In my own driver when enabled, I have the page flipping occurring automatically. So if you don't finish all graphics writes in time and the frame buffer is incomplete it would still be displayed. For guaranteed frame timing applications like a locked 60Hz game or a decoder with deterministic execution this is useful, but if not, you'd have to manage the flip yourself (the actual buffer flip occurs at vsync time) or you'll see missing objects or other unwanted effects. If you flip manually you can flip once the frame is complete and still see a stable image, albeit with a variable frame rate. To flip manually you just setup the source address register to be used for the next odd and even frames and wait until the next frame starts before drawing again (fairly simple to implement).

pik33 · 2024-01-03 20:15

I wrote several versions of PSRAM based VGA and HDMI drivers. The math is simple: you can go up to 340 MHz. Half of it is 170 MHz. 1080p pixel clock is about 150 MHz at 60 Hz and about 125 MHz at 50 Hz refresh rate.
This means you can do one pixel at one PSRAM clock while the pixel depth equals PSRAM width. 8 bit PSRAM=8 bpp FullHD.

My HDMI driver I use in the media player and Basic interpreter has a 4-line buffer in the hub. The video driver tells the PSRAM driver to fill the buffer for line+2. Then, it draws HUB RAM based sprites on line+1 while the streamer pushes line+0 to the output. There are 4 lines to make the buffer to be a 64*n so the streamer works in the loop without restarting for the full frame.

Drawing on the PSRAM buffer seems to be fast enough, see my youtube clips: https://www.youtube.com/@pik33100/videos I only made a fast assembly character rendering procedure to display characters much faster than by calling putpixel for every pixel of it. Also, the PSRAM driver 'fill' procedure can draw horizontal lines fast, so I use this for filled circles and rectangles.

Double buffering can be achieved simply by switching between 2 places in the PSRAM for displaying things. You can do a procedure in the driver that switches the buffer on demand, so you draw, and when finished, switch.

Or, as I did, add a display list to the driver, so every line can have its own address in the PSRAM. This also allows smooth scrolling

Here is an example of FullHD VGA driver in action:

Rayman · 2024-01-03 21:06

@pik33 Is that using four chips for 16 bit bus size?

pik33 · 2024-01-03 22:13

Yes, it is. This was an experiment with 1920x1080, 50 Hz, 8 bpp, to see if I can do fullHD resolution with the player, on an Edge EC32MB. The player was written for 1024x600, so it looks... strange in 1920x1080
The drivers (still WIP) are here:
https://gitlab.com/pik33/P2-retromachine/-/tree/main/Propeller/Videodriver_develop

Here https://github.com/pik33/P2-Retromachine-Basic is slightly modified HDMI driver for the Basic interpreter

They are named hg... ht... and vg... where h is hdmi, v is vga, t is 32bpp, g is 8 bpp, the rest is the version number.

Rayman · 2024-01-06 22:30

Here's a better version of the code in top post.
Now loads both images in PSRAM and then switches between them. (instead of loading from uSD constantly)
Also gets top and bottom lines right. There's some funniness with rdfast in VGA driver... If set to #0, you get some junk on last line. Used to remember why, but forgot...

Also, don't quite understand why I need the "waitx ##5000 " in the VGA buffer feeder. Used to be "waitx #50 " with hyperram. Don't think this is that much faster, but maybe was wrong before...
Have to pay attention to get top and bottom lines perfect. With photos, that can be hard to tell...

Ordered a 32MB Edge so can try to get this stuff working with it too...

Rayman · 2024-01-06 23:54

sorta have 1080p with this 8-bit bus going, but needs work to get top,bottom,left,and right edges right...

Rayman · 2024-01-07 18:19

Ok it's working! Thought had an issue, but monitor is just picky about pixel clock. Tried to use 150 MHz, but monitor wants real 148.5 MHz.

Anyway, here is 1080p at 8 bpp using just two PSRAM chips.

I'm hopeful that this means the plan for having two independent banks here is going to work...

Rayman · 2024-01-07 18:24

Uh oh, just realized that the trick being used to read in RAM at sysclock/2 isn't going to work for the second group with basepin at P40.
This only work with basepin of 0 or 32:

              'read in bytes
              rep       #1,PsBytes 'tight loop of just wfbyte at clk/2 speed
              wfbyte    inb
.reploop2

Fortunately, @rogloh has already figured out how to use the streamer for this from basepin 40, so maybe I can borrow from that...

Rayman · 2024-01-07 18:51

Or, maybe one 16-bit bus is almost just as good as two independent 8-bit busses?
That would let me forgo learning how to use the streamer...

Wuerfel_21 · 2024-01-07 19:42

The streamer use for this is really quite simple. If anything bitbanging it is way harder and a very silly way of doing it.

For basic streamer reads, I could recommend reading the rather short PSRAM code in one of my emulators. Roger's is very convoluted because it supports a lot of dubiously useful things. Mine is simple and optimized for low latency. But it's really just straightforward: Queue up commands, use WAITXMT and WAITXFI to synchronize other operations to it. The main trick involved is that to broadcast the command to multiple 4-bit lanes, LUT mode is used with $0000, $1111, $2222 etc kinda values ($11 for 8 bit, of course. For 4-bit bus, just use direct 4-bit mode).

The emulator PSRAM code has some funnies:

Access is always done on aligned longs. With the 16 bit bus, the access quantum is a 32 bit long. 8 bit PSRAM does the same with words, of course, but the code was first written for 16 bit, so the limitation sticks.
since it never does big block accesses, it can only handle one row boundary within a transfer. The row size is 4K for 16 bit, 2K for 8 bit, 1K for 4 bit.
the Z80 access code in MegaYume and the sprite and ADPCM accesses in NeoYume are always of the same size (4, 8 and 16 bytes respectively) and never cross a row boundary, so they use specialized code paths to reduce overhead.
the aforementioned NeoYume sprite transfers are a particular display of PSRAM access efficiency. For one, they're loosely asynchronous. The first rendering cog generates a list of sprites that need rendering on each scanline. It then hands that off to the PSRAM cog, which works through it at it's own pace. It just has to get them all loaded during the timespan of a scanline. They then get picked up by the second rendering cod on the next scanline (+2 from when the first one generated the list). Note that nowhere in this scheme does anything ever stall while waiting on the PSRAM, everything works in parallel. The other thing is that since a big list of transfers is submitted at once, the next transfer in the list can be pre-processed while the streamer is working in the background, reducing overhead by a dozen cycles or so. That was added relatively more recently. Though it of course makes the biggest difference if overhead is a big chunk of time spent per transfer, which it really wasn't before due to the constant size optimization mentioned in the previous item, especially on slower memory boards (where overhead is the same, but actual transfer is longer).

Rayman · 2024-01-15 19:31

This one does 720p at 16bpp with 8-bit PSRAM bus. Kinda like the idea of this one. Maybe with 16-bit bus, can do something interesting...

Update: Attached works better now, fixed bottom and top line issues.

cgracey · 2024-01-22 22:02

Does anyone know if you can clock these PSRAM chips on the P2-EC32MB at 160MHz?

The data sheet says they are good for 133MHz, but that's at 85C.

evanh · 2024-01-22 22:23

Yes, with ease.

When the EC32MB was in development I was actually hoping for double that using DDR.

cgracey · 2024-01-22 22:30

@evanh said:
Yes, with ease.

When the EC32MB was in development I was actually hoping for double that using DDR.

Super!

What about input latency into the P2 pins? Is there a one-size-fits-all number of clocks, or is this needing live adjustment?

evanh · 2024-01-22 22:35

To add more detail: The spec'd frequency limit of 133 MHz isn't any actual barrier. Or at least not at room temperature anyway. There seems to be a lot of safety margins.

That goes for the DRAM refresh too, which is geared for a substantial discharge degradation up at 100 degC.

evanh · 2024-01-22 22:45

@cgracey said:
What about input latency into the P2 pins? Is there a one-size-fits-all number of clocks, or is this needing live adjustment?

The EC32MB routing is the best tested but yes the lower the divider is the harsher that problem becomes. I have a bunch of compile time calculations for handling smartpin-streamer alignment timings for the setting of divider ratio and SPI clock mode and pin registration options. And that's just for tx phase alignment.

Rx turnaround lag has to be guessed on top of that. It can be preset and so far no one has tried to do otherwise. But a runtime calibration for temperature compensation would probably be ideal in the long run.

evanh · 2024-01-22 22:50

Further reading - https://forums.parallax.com/discussion/comment/1556355/#Comment_1556355

Wuerfel_21 · 2024-01-22 22:52

NeoYume's configuration for EC32MB runs the PSRAM at 169 MHz (338MHz / 2) and that s‌hit is so solid that there's a second delay setting that almost works (tripped me up during development). The delay setting (consisting of cycle delay and pin sync/async settings) seems to be consistent for boards of a given type - i.e. all EC32MB seem to work with the same values, but a different board needs a different value. Also max. clock speed is very board design dependent. Code should always have a clkfreq/3 divider available for "lesser" boards (such as that 96MB one Ray made - only really reliable at room temp and with that /3 divider).

(Have you ever actually played with the funny emulators, @cgracey ? Running NeoYume seems to have somehow become the de-facto benchmark for new P2 boards. I'm pretty sure you have a keyboard and monitor available )

cgracey · 2024-01-22 23:08

@Wuerfel_21 said:
NeoYume's configuration for EC32MB runs the PSRAM at 169 MHz (338MHz / 2) and that s‌hit is so solid that there's a second delay setting that almost works (tripped me up during development). The delay setting (consisting of cycle delay and pin sync/async settings) seems to be consistent for boards of a given type - i.e. all EC32MB seem to work with the same values, but a different board needs a different value. Also max. clock speed is very board design dependent. Code should always have a clkfreq/3 divider available for "lesser" boards (such as that 96MB one Ray made - only really reliable at room temp and with that /3 divider).

(Have you ever actually played with the funny emulators, @cgracey ? Running NeoYume seems to have somehow become the de-facto benchmark for new P2 boards. I'm pretty sure you have a keyboard and monitor available )

Cool!

A divide-by-3 would generate 1,0,0,1,0,0...

That would violate the 45-55% clock duty cycle that the data sheet specifies. I take it that doesn't really matter, though.

Where do I load your emulator code?

Wuerfel_21 · 2024-01-22 23:18

@cgracey said:
A divide-by-3 would generate 1,0,0,1,0,0...

That would violate the 45-55% clock duty cycle that the data sheet specifies. I take it that doesn't really matter, though.

Yes. Doesn't matter. I don't remember if I do 100100 or 110110, but whatever I do is working.

Where do I load your emulator code?

https://github.com/IRQsome/MegaYume and https://github.com/IRQsome/NeoYume respectively. Heed the readme. Both should be pre-configured for a EC32MB setup. Though audio/video/USB pins may need checking (config.spin2). Of course, needs some version of flexspin installed to build (and it really needs to be built through the included build.sh - much mindbending compiler skulduggery going on, NeoYume actually has 3 top Spin2 files)

evanh · 2024-01-22 23:37

Chip,
I've now added the required "stdlib.spin2" file in the topic that I linked. I hadn't noticed it was missing when I originally made the post.

Rayman · 2024-01-23 04:01

One really strange thing is coordinating psram driver with vga driver. Using cogatn in vga driver works , but nothing like you can expect or control…

Anyway, think have handle one it now, but not easy …

cgracey · 2024-01-23 04:14

@evanh said:
Chip,
I've now added the required "stdlib.spin2" file in the topic that I linked. I hadn't noticed it was missing when I originally made the post.

Thanks, Evanh.

pik33 · 2024-01-23 15:40

@cgracey said:
Does anyone know if you can clock these PSRAM chips on the P2-EC32MB at 160MHz?

The data sheet says they are good for 133MHz, but that's at 85C.

They are easily overclockable. My HDMI 1024x600 uses 336 MHz clock, so PSRAM got 168 MHz.

The rest depends on the wiring.

On P2-EC32 the limit is somewhere about 340 MHz. The same frequencies are available on a P2 Eval with 4-bit PSRAM attached to a pin bank (in my case, 40..47). I have 2 "backpacks" with a PSRAM chip for Eval, one of them slightly (5 MHz or something ) better.

On P2 Edge without PSRAM, with a chip attached to a pin bank as for Eval, things are much worse. Even 300 (=150 on PSRAM) doesn't work. Maybe 280 is the real maximum for the PSRAM at clk/2, but I even didn't test this.

PSRAM as VGA buffer tests --> Now aiming towards DPO scope

Comments