PSRAM as VGA buffer tests --> Now aiming towards DPO scope

rogloh · 2024-01-28 01:45

@cgracey said:
In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.

Yeah that resolution is a reasonable compromise and probably suits the P2 Edge PSRAM memory config nicely. Should allow 120 columns for typical 8 pixel wide fonts and a good number of rows for textual information. 54 rows for a 10 pixel high font is decent or 45 rows for 12 pixels high (up to 90 for a tiny font). It'd be awesome if it could be achieved with HDMI/DVI output at 60Hz instead of VGA. I think pik33's approaches suggest it could be made it work at 50Hz but it mostly depends on the amount blanking the monitor tolerates. 50Hz is still decent and hopefully HD monitors/TV's in the US by now would support that rate fully, unlike SDTVs in the past. What line timing (hfp, sync, hbp, etc) are you using Chip?

cgracey · 2024-01-28 01:45

@Wuerfel_21 said:
I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.

I have been thinking that 32-bit pixels are nice because to do blending, you don't have to break words in and out, like you would in 16-bit pixels. You have one pixel per register when you work on the data, which is really simple.

cgracey · 2024-01-28 01:49

@rogloh said:

@cgracey said:
In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.

Yeah that resolution is a reasonable compromise and probably suits the P2 Edge PSRAM memory config nicely. Should allow 120 columns for typical 8 pixel wide fonts and a good number of rows for textual information. 54 rows for a 10 pixel high font is decent or 45 rows for 12 pixels high (up to 90 for a tiny font). It'd be awesome if it could be achieved with HDMI/DVI output at 60Hz instead of VGA. I think pik33's approaches suggest it could be made it work at 50Hz but it mostly depends on the amount blanking the monitor tolerates. 50Hz is still decent and hopefully HD monitors/TV's in the US by now would support that rate fully, unlike SDTVs in the past. What line timing (hfp, sync, hbp, etc) are you using Chip?

I have found that in HDMI, because it's digital, you can cut front/back-porch timing to nearly zero. This makes it possible (on my Samsung TV, anyway), to do quarter-HD over HDMI at 60Hz.

I think people won't want to mess with VGA/component-to-HDMI converter boxes. It would be too much to ask them to do. But, if they can just plug an HDMI cable in directly, that's perfect.

rogloh · 2024-01-28 01:53

Another benefit is that 4 chips allow 4kB to be transferred without crossing pages as the page size is 4 times this, so in theory a 960 32 bit pixels fits in a single overall PSRAM page. Although the 8us maximum chip select low time might impact that and still require 2 burst transfers at 300MHz or so by the video driver to allow refresh to work properly. That's not something your driver deals with yet Chip but may need down the line unless you hard limit burst sizes by the clients - a bit of PITA for a video driver that might want to do something else in the meantime to have to come back and restart the second half line transfer. Maybe a real-time video COG client's request can be handled specially if it is somehow nominated in the mailbox request (top bit of the first long etc) and then broken up by the driver. You can sort of make this happen if you artificially halve the page size boundary in the PASM2 code.

cgracey · 2024-01-28 02:03

@rogloh said:
Another benefit is that 4 chips allow 4kB to be transferred without crossing pages as the page size is 4 times this, so in theory a 960 32 bit pixels fits in a single overall PSRAM page. Although the 8us maximum chip select low time might impact that and still require 2 burst transfers at 300MHz or so by the video driver to allow refresh to work properly. That's not something your driver deals with yet Chip but may need down the line unless you hard limit burst sizes by the clients - a bit of PITA for a video driver that might want to do something else in the meantime to have to come back and restart the second half line transfer. Maybe a real-time video COG client's request can be handled specially if it is somehow nominated in the mailbox request (top bit of the first long etc) and then broken up by the driver. You can sort of make this happen if you artificially halve the page size boundary in the PASM2 code.

1024 longs (4KB, or 1KB from each chip) would need 2,048 clocks plus a tiny overhead for the gap. 2,048 clocks at 160MHz would take almost 13us. The visible scan line is 30us, plus maybe another 2us for HSYNC. There's a lot of room.

rogloh · 2024-01-28 02:21

There is room to do multiple transfers, but it does burden a video driver to have to come back for a second request if it can do something else useful in the meantime and just issue a (single) request in the blanking portion. My own video driver for example only issues a single request. Also if you break it up without any prioritization you'll lose some time to other COG's requests if they sneak in when you don't want them to. I'd suggest halving the effective page size in the driver in order to handle this - simplest solution to not overrun the chip select low time and the transfer will be made in two pieces. For lower rates you might need 1/4 of the size, TBD, but then you wouldn't be doing video with a slow P2.

Wuerfel_21 · 2024-01-28 02:30

@cgracey said:

@Wuerfel_21 said:
I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.

I have been thinking that 32-bit pixels are nice because to do blending, you don't have to break words in and out, like you would in 16-bit pixels. You have one pixel per register when you work on the data, which is really simple.

IMO the big boon with 32bpp is that you can use WMLONG for blitting if you accept that you can't have true black, only #010101. Though that's not happening with direct-to-PSRAM rendering. When doing opaque drawing (no destination read) or something slightly more complex (gradients, texture mapping, etc), the overhead of doing it in 16 bits is not so big (really just RGBSQZ - great instruction). Slightly more if you want to do dithering (if range of pre-dither pixels is clamped to #F8FCF8, can use a standard ADD instead of slow ADDPIX - easy to do so when rendering a gradient of sorts. Of course when doing a gradient to black, the add can be rolled into MIXPIX and be basically free).

Rayman · 2024-01-28 02:32

@cgracey vga to hdmi cables are $10 on Amazon. Basically same price as hdmi cable..

rogloh · 2024-01-28 02:58

To allow for prioritization in a very simple/fast way I'd suggest a flag bit that is passed in the mailbox request - e.g bit=0 for normal round-robin requests, or bit=1 for special real-time requests

The PSRAM driver code monitors this bit and changes the COG ID handing to be either real-time or round robin polled. If this state changes for a COG it quickly adjusts to move the COG into the round robin polled group or real-time COG polled group as required - this needs fast detection to see if things ever change for the COG from the last request made but that should be simple bitmap store/test. Most of the time it will remain constant and no further work is needed, keeping it fast, yet adaptive.. The polling loop will need to be split into two portions, and you'd repoll the real-time COGs after any COG is serviced if any real-time COGs exist - this means reading the mailboxes again (at least for the real-time COG mailboxes). When more than one real-time COG exists (eg, a sound driver and video driver), I expect you might be best to re-load the entire mailbox group vs fiddling about figuring out which mailboxes to read - it's a tradeoff but setq reads are fast. And when only non-real time COGs exist you could default back to the existing polling scheme as you already coded @cgracey - which is still useful for general purpose application level memory sharing with round robin access.

Non-realtime COGs would still need to use lower burst sizes to avoid bandwidth overload per scanline affecting a video COG. This size can be tuned by the developer if needed - enforcing it in software requires calculations based on real-time requirements. For a single video resolution/colour mode and single video COG this is computable statically but gets harder once you vary the application's real-time requirements or if they happen to be changed dynamically.

rogloh · 2024-01-28 03:24

Here is the idea for comparing real-time requests or not. Only 3 extra instructions are needed in the critical path for this approach if nothing changes, which is probably 99.99999% of the time after startup. I modified COG6 handler below. The "pollchange" code block would be responsible for altering the poller logic and "rtcogs" is a simple long with a bit for each COG indicating if the last request was real-time or not. Pretty simple, this part of it anyway. I'll look at the poller logic change when I can.

The code also clears the flag bit in the "hub" variable but may not need to and you could potentially keep this bit setup for later - that would mean only two P2 instructions of overhead if you set the WC flag in the preceding MOV instruction.
EDIT: I don't think it actually needs to be cleared when looking at the later code that uses it .

.cog5           mov     hub,cmd+5*3+0
                mov     adr,cmd+5*3+1
                mov     len,cmd+5*3+2
                callpa  #5*12+8,#.got
                jmp     #.chk6

.cog6           mov     hub,cmd+6*3+0
                bitl    hub, #31 wc             ' get msb to check if realtime or not
                testb   rtcogs, #6 xorc         ' test whether the real time group has changed
        if_c    callpa  #6, #pollchange         ' changes polling group for cog6
                mov     adr,cmd+6*3+1
                mov     len,cmd+6*3+2
                callpa  #6*12+8,#.got
                jmp     #.chk7

.cog7           mov     hub,cmd+7*3+0
                mov     adr,cmd+7*3+1
                mov     len,cmd+7*3+2
                callpa  #7*12+8,#.got
                jmp     #.chk0

pik33 · 2024-01-28 08:52

wastes a quarter of its bandwidth

2 bits of the lower byte in 32 bpp mode controls the HDMI modulator, but it seems 6 bits may be used to keep information. For example, a playfield number, to detect collision of a sprite with an object for a game machine.

cgracey · 2024-01-28 09:33

@pik33 said:

wastes a quarter of its bandwidth

2 bits of the lower byte in 32 bpp mode controls the HDMI modulator, but it seems 6 bits may be used to keep information. For example, a playfield number, to detect collision of a sprite with an object for a game machine.

I wish I would have given the FIFO a 24- bit mode. The streamer could have used it nicely for 8:8:8 RGB.

Rayman · 2024-01-28 12:01

My code copies psram into scanline buffer…

Maybe can do 24 to 32 bit conversion during that process?

cgracey · 2024-01-28 18:38

@Rayman said:
My code copies psram into scanline buffer…

Maybe can do 24 to 32 bit conversion during that process?

I think it would take several instructions to convert a batch of 3 longs into 4 pixels. So, this would only work for slower, low-res modes, which wouldn't take much memory, anyway.

Wuerfel_21 · 2024-01-28 19:06

Actually drawing 24 bit graphics in the first place would be another headache without dedicated 24 bit instructions.

Wuerfel_21 · 2024-01-28 19:11

Though for the video playback thing, I did think it would be worthwhile to keep the PSRAM buffers in subsampled format, and expand each scanline pair into 32 bit pixels in real time - then just output those YUV directly (for VGA, colorspace unit can handle it, for HDMI the display needs to deal with it) - never got around to that though.

Rayman · 2024-01-29 23:02

Working on mouse cursor drawing now...
Was thinking would have to save area of PSRAM where mouse is, then write mouse cursor on top of that area.

But now, looks like there is time to just copy mouse cursor pixels onto the scanline, right after it is read into HUB RAM.
Makes things much simpler...

evanh · 2024-01-29 23:21

Yes, all sprites should be stored in independent memory and overlaid at display time. They wouldn't be much of a sprite otherwise.

rogloh · 2024-01-30 01:51

@Rayman said:
Working on mouse cursor drawing now...

But now, looks like there is time to just copy mouse cursor pixels onto the scanline, right after it is read into HUB RAM.
Makes things much simpler...

That's how I do it in my own video driver when running from external PSRAM framebuffer and displaying the mouse sprite. It does eat into the budget however so you want a fast way to do it.

pik33 · 2024-01-30 14:49

I linked my drivers repository at the post #11

They have sprites.

The driver has 4 kB buffer in the hub. The streamer is started at the start of the frame and then it is continuosly pushing these 4 kB (=4 lines of 1024 pixels) to the video output via the LUT color table (8 bpp) in the loop (or, in 32bpp version, directly, and the buffer is 16 kB)
While the streamer outputs the line to the screen, the video driver calls the PSRAM driver to fill 1024 bytes of line+2 from PSRAM.
Then it overwrites line+1 with sprites.

So it is a pipeline. line+0 is displayed by the streamer, line+1 is at the same time filled with sprites by the video cog and line+2 is fetched by the PSRAM cog.

I didn't manage to place sprites in the PSRAM, so they are kept in the hub. Maybe it should be possible using the PSRAM's memory access list.

Rayman · 2024-01-31 22:56

Got the mouse cursor going... Almost too easy...
Seems to work perfectly.

Defined like this for now:

DAT 'Cursor Data

Cursor1  '8x12 words
        word  $0000,$0000,$CCCC,$CCCC,$CCCC,$CCCC,$CCCC,$CCCC
        word  $0000,$FFFF,$0000,$CCCC,$CCCC,$CCCC,$CCCC,$CCCC
        word  $0000,$FFFF,$FFFF,$0000,$CCCC,$CCCC,$CCCC,$CCCC
        word  $0000,$FFFF,$FFFF,$FFFF,$0000,$CCCC,$CCCC,$CCCC
        word  $0000,$FFFF,$FFFF,$FFFF,$FFFF,$0000,$CCCC,$CCCC
        word  $0000,$FFFF,$FFFF,$FFFF,$FFFF,$FFFF,$0000,$CCCC
        word  $0000,$FFFF,$FFFF,$FFFF,$FFFF,$FFFF,$FFFF,$0000
        word  $0000,$0000,$0000,$FFFF,$FFFF,$0000,$0000,$0000
        word  $CCCC,$CCCC,$0000,$FFFF,$FFFF,$0000,$CCCC,$CCCC
        word  $CCCC,$CCCC,$CCCC,$0000,$FFFF,$FFFF,$0000,$CCCC
        word  $CCCC,$CCCC,$CCCC,$0000,$FFFF,$FFFF,$0000,$CCCC
        word  $CCCC,$CCCC,$CCCC,$CCCC,$0000,$0000,$0000,$CCCC

Could be much better with this 16-bpp driver, but good enough for now...

Rayman · 2024-02-04 23:11

One sorta less than stellar feature has to do with writing a vertical line into the screen buffer while doing video.
Seems that during horizontal refresh, there is only time for 5 pixels of a vertical line.
Guess that's OK, but was hoping for more...

DAT 'Vertical Line     '(Or1, And1, rows, startrow, offset, pBuffer)
VerticalLine_

              add       arg2,arg3 'calc end row
              mov       arg3,#41


VLLoop
              mov       ptrb,arg5
              mov       PsLongs,#1
              mov       PsRow,arg3
              mov       PsOffset,arg4
              call      #ReadRamBurstSub  'read next line
              waitx     #200

              rdlong    t3,arg5
              and       t3,arg1 'modify long as directed
              or        t3,arg0
              wrlong    t3,arg5


              mov     SourceAdd,arg5
              mov     PsLongs,#1
              mov     PsRow,arg3
              mov     PsOffset,arg4 'offset from start of row
              call    #WriteRamBurstSub
              waitx     #200

              add     arg3,#1
              cmp     arg3,arg2 wcz

        if_b  jmp        #VLLoop


              jmp     #PsLoopEnd 'done

evanh · 2024-02-04 23:33

That's going to be highly dependant on how long the blanking is. "Reduced Blanking" timings impacts horizontal blanking far more than vertical blanking.

rogloh · 2024-02-05 00:24

@Rayman said:
One sorta less than stellar feature has to do with writing a vertical line into the screen buffer while doing video.
Seems that during horizontal refresh, there is only time for 5 pixels of a vertical line.
Guess that's OK, but was hoping for more...

Pixel plotting individual pixels is not that efficient when they are not in a horizontal line. This is one of the reasons I began adding some simple line drawing graphics acceleration into my own PSRAM driver and raw rectangular blitting. Saves the client from doing it in small chunks and waiting around to process the next pixel transfer etc and it keeps the memory bus + driver COG busy. Although it is currently rudimentary with just a single pixel, no AA fx etc, but it can do any angled lines at 8/16/32bpp which can be handy. It does make the driver more complex though and requires state to be maintained across accesses and preemption by the video COG when it needs it (by frequent mailbox polling) so it's not really for simple drivers. The number of pixels you can plot per line will then essentially be dependent on the the P2 clock frequency and your video mode's line timing plus bit depth which defines how much bandwidth is left over per scan line for other graphics accesses.

Rayman · 2024-02-05 00:31

On the bright side, the drivers seem to be in a good place. Maybe using more cogs than I'd like, but all working well.
For the next step, working on a form based GUI. First test will be the mixed signal scope, going from 2bpp to 16bpp.
Think we can have a DPO analog window...

Rayman · 2024-02-10 21:56

Ok, after some soul searching, decided to try this 720p, 16bpp, forms based approach to GUI with my row based PSRAM interface to Chip's PSRAM driver.

What's great about this is that for the usual form elements, only a small area of the screen needs to be updated to respond to mouse clicks.
That means it doesn't need the absolute ultimate speed.
There's is plenty of time during horizontal refresh so that the update is flicker free.

As a first goal for this, want to implement a mixed signal scope. Hopefully with DPO mode.
I think I have the form system worked out.
The attached example has a single form with just one option, but it works.

The scope update would need a lot of PSRAM access, but hopefully can use vertical refresh time for that...

So the form resources in this case are bitmaps, which makes life a bit easier because they contain height and width info.
Want to make it so that they can come from Hub, Flash, or SD (maybe I should include from PSRAM option, now that I think about it). But, from Hub only for now.

These resources have to be 16bpp Windows BMP files for now, but I'm realizing that this is a pain. There are really on a few programs that can save this way. I'm using Photoshop, but I know not everybody has this. I'm toying with the idea of letting them be 24 bit and changing them to 16 bpp on first use or maybe on the fly. Since the 16bpp version is smaller, I can just convert them if stored in HUB ram. Have to think more about how to handle this if from SD or Flash. Or, maybe I can find or create some simple tool to do the conversion... I've read that it's tricky to do this with the absolute best results, but don't need best results for this...

Rayman · 2024-02-10 22:00

Also, I need to find a better tool to create graphics. Paint works, but the graphics are 100% pixelated, might have to use PowerPoint or something.
Paint is good for text though...

cgracey · 2024-02-14 09:28

@Rayman said:
The scope update would need a lot of PSRAM access, but hopefully can use vertical refresh time for that...

Use another cog's streamer to capture into hub RAM, then write it to PSRAM in maybe 512-long sets. There should be time for both the video cog and and the pin-capture cog to write data. Maybe the hub is even sufficient for storing the pin-capture data.

Rayman · 2024-02-18 18:58

Thinking that having all text as BMP is a pain...
Now capturing a Windows font as BMP image and saving from uSD to PSRAM.
So, drawing text on the screen will just be a BitBlt operation for each character...

Rayman · 2024-02-18 22:16

Think I have text working...
Turns out easiest when font character width is a factor of 4, and just happened to be 12x24 pixels.
This is because PSRAM natively reads and writes in longs...

So, it's fixed pitch for now, but looks OK. Maybe slightly too bold, might fix that later...
This is how the text is defined:

DAT 'Form1Text1
Form1Text1
            long    mText               '0 type 5=Option
            long    400            '1 top
            long    200            '2 left
            long    0               '3 Data (general purpose)
            long    0               '4 Status
            long    0       '5 Font#
            'long    @Form1Text1Text       '6  pText

Form1Text1Text byte "This is a Test!",0

PSRAM as VGA buffer tests --> Now aiming towards DPO scope

Comments