P2 DVI/VGA driver

rogloh · 2021-10-08 06:22

@evanh said:
Those things were ugly hacks even then!

LOL, you're not a fan of the old monitors it would seem.

@evanh said:
Leave LVDS for Chip to add, alongside TMDS, as a new mode for CMOD.

Wouldn't that be nice. Yeah I'm not sure if/when that will happen... we should have pushed for it earlier. It probably would have been almost trivial vs TMDS encoding.

evanh · 2021-10-08 06:38

I would've been more vocal if I'd known they were different.

AJL · 2021-10-08 13:52

@rogloh said:

I've added 6 control bits in the top byte of the driver initialization register that can select which outputs are generated in parallel digital RGB mode:

bit 0 - disables CLK (CLK is generated by default, not sure why you wouldn't need a clock but whatever)

So I finally found a use case for disabling the clock in my parallel output mode. This will be a good way to support old school CGA/MDA/EGA monitors that don't need a clock (or DE for that matter). My new RGB parallel output mode is maskable and will readily support 1, 2, 4, 6 colour bits per pixel anyway, so assuming the video timing is setup correctly my driver should be able to output to these old TTL monitors nicely. It would be a nostalgic laugh to see the P2 generate some graphics on older digital monitors like these if I can source one. It should still look nice and vibrant even with their limited colour palette range.

Since I've just added parallel digital output I have also been thinking about LVDS again and also located this thread https://forums.parallax.com/discussion/171516/p2-fpd-link-lvds-displays. I figure if I can get the processing done in less than 14 clocks per pixel it might (only) just be achievable to output LVDS using my driver. There would need to be a lot of limitations imposed though, like single region only, no mouse, no text render, no border, low refresh rate panels only ~24Hz?, either 4bpp or 8bpp gfx mode only, maybe with some pixel doubling (not sure). If I can't get pixel doubling it would not be possible to have a framebuffer for 1024x600x8bpp displays fit in internal hub RAM though it could be done for 4bpp. External RAM would have the capacity but again I'd need to see if there is any time for that in the code, and it would need some more rework to read external data one scan line earlier so the pixels are ready to be processed in time.

The thread above mentioned that 20MHz pixel clocks (low refresh) were possible on a particular LVDS display. I came up with an instruction sequence below that can generate 8 LVDS encoded pixels at a rate of 12 P2 clocks/pixel in its inner loop which is still under the budget of 14 P2 clocks for a 280MHz P2 clocking out 20MHz LVDS encoded pixels. However I would also need to read the input pixels from HUB to LUT and then write back to HUB once encoded and this adds 1.125 more clocks per pixel. It's already really tight and yet to be determined if its execution timing fits in with the loop overheads and HUB burst read/write delays etc. Perhaps 21 P2 clocks per pixel would be needed instead, but that only reduces the frame rate even more...

```
rdlut colours, ptra++           ' read in 4 coloured 8 bit pixels from LUTRAM input buffer

getbyte index, colours, #0      ' get first pixel colour (8bit colour index)
rdlut pixels, index             ' read palette colour to translate to 7x4 bit LVDS patterns (18 or 24 bit colour)

getbyte index, colours, #1      ' get second pixel colour
rdlut pixels+1, index           ' get second pixel pattern
setnib pixels, pixels+1, #7     ' merge new portion with 28 bits of first long
shr pixels+1, #4                ' align second long
wrlut pixels, ptrb++            ' write first long to output buffer in LUTRAM

getbyte index, colours, #2      ' get third pixel colour
rdlut pixels+2, index           ' get third pixel pattern
setbyte pixels+1, pixels+2, #3  ' merge new portion with 24 bits of second long
shr pixels+2, #8                ' align third long
wrlut pixels+1, ptrb++          ' write second long to output buffer

getbyte index, colours, #3      ' get fourth pixel colour
rdlut pixels+3, index           ' get fourth pixel pattern
setnib pixels+2, pixels+3, #5   ' merge new portion with 20 bits of third long
shr pixels+3, #4                ' align fourth long
setbyte pixels+2, pixels+3, #3  ' merge new portion with 24 bits of third long
shr pixels+3, #8                ' align fourth long
wrlut pixels+2, ptrb++          ' write third long to output buffer
    
rdlut colours, ptra++           ' continue with next four pixels...writing four further longs to LUT (7 total)

getbyte index, colours, #0
rdlut pixels+4, index
setword pixels+3, pixels+4, #1
shr pixels+4, #16
wrlut pixels+3, ptrb++

getbyte index, colours, #1
rdlut pixels+5, index
setnib pixels+4, pixels+5, #3
shr pixels+5, #4
setword pixels+4, pixels+5, #1
shr pixels+5, #16
wrlut pixels+4, ptrb++

getbyte index, colours, #2
rdlut temp, index
setbyte pixels+5, temp, #1
shr temp, #8
setword pixels+5, temp, #1
wrlut pixels+5, ptrb++

getbyte index, colours, #3
rdlut pixels+6, index
rolnib pixels+6, a, #4
wrlut pixels+6, ptrb++


' Total loop for 8 pixels = 10xRDLUT(3) + 7xWRLUT(2) + 26x2 other instructions = 30+14+52 = 96 clocks or 12 clocks/pixel
' plus
'   4 pixels read per long on input in 8bpp mode (~ 0.25 clocks/pixel in SETQ2 read burst)
'   7 longs generated for 8 pixels on output (0.875 clocks/pixels to write to hub)
' making 12 + 0.25 + 0.875 = 13.125 clocks/pixel exluding other loop overheads etc.

' This number must remain below 14 after the overheads. It will be very tight!
```

Maybe a partner cog working with LUT sharing could give the budget you need; One cog loading to and from hubram, and the other doing the LVDS encoding?

potatohead · 2021-10-08 15:40

I have a PVM that does TV frequency CGA, with TTL inputs. My MDA appears dead now. Will have to investigate. Hope it is not the pixel tube.

Wuerfel_21 · 2021-10-08 16:02

@potatohead said:
I have a PVM that does TV frequency CGA, with TTL inputs. My MDA appears dead now. Will have to investigate. Hope it is not the pixel tube.

CGA is always TV frequency (which is why old CGA cards had composite output alongside TTL).

I am reminded once again of some VGA monitors I have that need fixing, but one of them I'm very certain just has a bad solder joint in the horizontal geometry section.

potatohead · 2021-10-08 16:49

It is! And CGA was so weird. Odd color palettes.

I suppose I said TV frequency with EGA in mind.

My one VGA needs powered up. Late model flat screen CRT. Does 1600x something.

Wuerfel_21 · 2021-10-08 18:06

@potatohead said:

My one VGA needs powered up. Late model flat screen CRT. Does 1600x something.

What goes real hard is when you push them to the max VSync frequency. That one I mentioned can do 960x720@120Hz and if you get a game to actually render to that it's really pretty.

potatohead · 2021-10-08 18:08

It is! Fast CRT displays are my favorite. At that resolution, seeing the pixels is nice, but not particularly limiting too.

I am not sure how high refresh the one I have goes. Project for another day here soon. I have most of it stored and have just recently been digging stuff, seeing what still lives and is available to play with.

rogloh · 2021-10-09 01:25

@AJL said:
Maybe a partner cog working with LUT sharing could give the budget you need; One cog loading to and from hubram, and the other doing the LVDS encoding?

Yeah, I was thinking the same last night and it might be one way to go. Although it might be possible to just remain within the 14 clock budget identified in the code above. That code seems to still be saving 5/7 of a clock per pixel after the streamer's hub transfer load is factored in which is happening in parallel to my own burst transfers to/from LUTRAM and periodically slowing hub accesses down (streamer needs 1 long read per 8 P2 clocks). For LVDS panels with 1024 pixels this equates to about 365 x 2 clock cycle instructions worth of spare time accumulated per scan line. I'd expect that amount should still suffice for the additional loop and other setup overheads required but to prove this out one way or another, I could probably hack something up fairly quickly. For example I could just keep some VGA syncs running and mock up a static screen with some dummy streamer output and check the picture would stay stable under this amount of LVDS encoding load per scan line.

rogloh · 2021-10-09 01:29

Hmm. Does LUT sharing allow the streamer in one of the COGs to also access the LUT? Maybe not...
I found this in the P2 documentation:

"Lookup-RAM writes from the adjacent cog are implemented on the 2nd port of the lookup RAM. The 2nd port is also shared by the streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode."

So it doesn't look good as the LVDS mode certainly needs to use the streamer with LUT reads unless we encode directly to byte lanes (2 longs per pixels - even more encoding work) or have nibbles output to external differential amps from 4 P2 pins that create the 8 pins needed for LVDS. It could actually just be 6 data wires if the clock is handed independently as a smartpin diff pair using a PWM mode. I hope that mode could also use the bitDAC at the same time for LVDS but haven't checked if that is doable.

A different way to offload work to another COG is to have a second COG encode from pixels into packed LVDS nibbles via HUB reads and writes in the line buffer and not use LUT sharing, but this probably doesn't save a lot of work for the encoder. It might however mean that more of the normal video driver features could be supported which is still probably of benefit.

rogloh · 2021-10-09 02:35

Found this LVDS device too, which may help solve everything really. It would work in my clocked parallel output mode basically as is and we would need no code changes. Runs up to 68 Mpixels/s and ~$11USD part cost.

https://www.ti.com/lit/ds/symlink/sn65lvds95.pdf

Feels like cheating though...

It does burn more pins too of course. Needs 20-22 IO pins depending on whether you need H&V sync signals incorporated in the signal vs 8 pins for 18 bit panels if encoded directly from the P2.

I am somewhat tempted to buy a device and larger LVDS panel and build a board to test it.

evanh · 2021-10-09 03:25

@rogloh said:
It does burn more pins too of course.

Not cheating when it's not a savings. The reason to use LVDS on the Prop2 is to use less pins.

rogloh · 2021-10-09 04:48

@evanh said:

@rogloh said:
It does burn more pins too of course.

Not cheating when it's not a savings. The reason to use LVDS on the Prop2 is to use less pins.

Well then that would imply a software encoding solution as the HW encoder solutions will need more pins.

I also quickly found a higher speed and cheaper (~$9) part that might be better suited to supporting 10 inch 1280x800 panels that might like to operate around 72Mpixels/s. This one can do 4 data lanes too so could support up to 24 bit colour LCD panels.

https://www.ti.com/lit/ds/symlink/sn65lvds93b.pdf

There are probably other devices around that are even better given that LVDS based LCD panels are pretty common these days.

evanh · 2021-10-09 05:17

Well, from another perspective, ruggedising is a good approach too. Those interface chips are good line drivers as well. Makes it easy to have longer display cabling.

And, at the Propeller, don't have to use all 24 RGB pins either.

rogloh · 2021-10-09 05:33

Yeah you can get by with 20 pins for 18bit panels if they don't need H&V. So only 12 more than 8 used directly in the software approach. It's not too bad. I am now thinking of making up a board for the P2-EVAL with one of these encoders on it, like those different SRAM boards I made. Should be simple enough to do. Main issue, despite modern LCD panels using a standardized group of LVDS lanes for their pixel data, appears to be that different LCD manufacturers seem to use varying pinouts on their flat flex cables, so you almost need a custom board per LCD panel selected. There's no real standard there from what I can see.

evanh · 2021-10-09 06:12

There was a couple of metric IDC connectors with common TTL pinouts back in the 90's when I was looking. But, LVDS, I never looked into actually buying those.

evanh · 2021-10-09 06:48

Oh, and the tiny Hirose surface mount connectors. They were fun to make cables from ... not! The few of those I worked with did seem to follow a convention for pinout.

rogloh · 2021-10-09 07:23

I just thought of another way that might be promising and perhaps get down to under 7 clocks per pixel (using a second COG).

The main work needed is not so much the LVDS encode (you can get that for free in a palette lookup) but is packing groups of 8 x 28 bit pixel data sequences into 7 longs so the streamer can send out all the pixel sequences in a single operation, while the driver can then prepare the next scan line.

If the second "LVDS COG" executed mainly from LUTRAM and had a working buffer in COGRAM that gets filled with pixels to be packed, using a burst read from hub RAM, it could then work in an unrolled loop to process groups of something like 64 or 128 pixels at a time, packing them into 56 or 112 longs and then writing back to hub RAM. The write back to hub RAM operation would be done at the end of the processed group with a burst write. I think with HUB read/write overhead it should still fit within the 7 clock budget per pixel as it avoids those extra LUT reads and writes. You could send the packed pixels back to the same hub buffer or a different one. The processing work looks like this...

    setq #64-1 ' read in next batch of pixels
    rdlong 0, ptra++  

    setnib 0, 1, #7      ' assume working buffer starts at COG location 0
    shr 1, #4

    setbyte 1, 2, #3
    shr 2, #8

    setnib 2, 3, #5
    shr 3, #4
    setbyte 2, 3, #3
    shr 3, #8

    setword 3, 4, #1
    shr 4, #16

    setnib 4, 5, #3
    shr 5, #4
    setword 4, 5, #1
    shr 5, #16

    setbyte 5, 6, #1
    shr 6, #8
    setword 5, 6, #1
    rolnib 7, 6,  #4
    mov 6, 7

   ' loop continues to be unrolled...

   ' setq #56-1
   ' wrlong 0, ptrb++

' works out at a rate of 19  x 2 clocks per 8 pixels packed = 38 clocks
' 38/8 is 4.75 clks/pixel, leaving time for burst reads/writes adding ~ 2 clocks per pixel more.

Update: I might actually need 2 extra COGs for LVDS..., it looks just a fraction too tight for the main COG to do the palette translation to 28 bits within 7 cycles so another COG is likely needed. 7 cycles per pixel is good as it allows 40-50MHz pixel clocks using a 280-350MHz P2 for example.

1) Main p2 video driver COG - supporting majority of usual operations, pixel doubling, regions, external RAM, mouse etc. Internally uses 4 bit LUT during output mode only to translate nibbles from the streamer into LVDS byte lane data on pins. Generates scan line buffer data using colour indexes only for later LVDS translation, so only text and the palette based colour modes are allowed and no borders.

2) primary helper COG - reads indexed colours from scanline buffer using rfbyte, then translates via 1/2/4/8 bit palette lookup into 28 bit colour (data written as blocks of longs). ~6 clocks per pixel plus loop overheads.

3) secondary helper COG, reads groups of 8x28 bits from hub and packs into 7 longs and writes back. Code above is ~6.75 clocks /pixel.

evanh · 2021-10-10 01:33

That's so tight. Avoiding individual load/store is a huge reduction of overhead. Wouldn't have had a show without the block read/write.

rogloh · 2021-10-10 04:02

Yeah 7 clocks per pixel is really tough with my video driver that needs to translate and load and store. If you did it all in a dedicated LVDS video driver outputting on the fly you would have a chance with one COG but you'd have no time for any other features...it would be raw framebuffer output only.

This sample loop below does 8bpp palette mode LVDS data at 7 clock cycles per pixel, the trouble starts when you need to issue SYNCs. If you have an LCD that only uses the DE signal then you can just fix the output pattern for the entire blanking interval and have a smart pin in duty mode for the clock diff pair output. That's probably the simplest way it is achievable and you can buy yourself lots of scan line setup time by sending a constant output data pattern of all zeroes during the blanking for many pixels in a row. This could get pixel clock rates up to 50MHz using a 350MHz P2.

rep #3, pixelcount
rfbyte colourindex  ' get 8bpp colour index
rdlut pixeldata, colourindex ' lookup 7x4 bit nibble data in LUTRAM 0-$ff for this colour, ie. 8bit index -> 18 or 24bit RGB pixel pattern 
xcont streamermode, pixeldata ' use LUTRAM to translate 3 or 4 bit lane RGB data into 6 or 8 bit LVDS differential pin output pattern

streamermode long $20880007  ' immediate 8x4 bit LUT lookup mode using LUTRAM table at $100, outputting for only 7 clock cycles (28 bits only)

A 4 bit variant (16 colours) could be done with a 256 entry palette that has 16 colours replicated 16 times. Similar approach could be used for 1 and 2 bit colour palette modes with more replication. Pixel doubling could be achieved using a similar approach by using the same index data more than once.

rep #6, pixelcount_div_2
rfbyte colourindex  ' get 4bpp colour index
rdlut pixeldata, colourindex ' lookup 7x4 bit nibble data in LUTRAM 0-$ff for this colour, ie. 4bit index -> 18 or 24bit RGB pixel pattern 
xcont streamermode, pixeldata ' use LUTRAM to translate 3 or 4 bit lane RGB data into 6 or 8 bit LVDS differential pin output pattern
shr colourindex,#4  ' get 4bpp colour index
rdlut pixeldata, colourindex ' lookup 7x4 bit nibble data in LUTRAM 0-$ff for this colour, ie. 4bit index -> 18 or 24bit RGB pixel pattern 
xcont streamermode, pixeldata ' use LUTRAM to translate 3 or 4 bit lane RGB data into 6 or 8 bit LVDS differential pin output pattern

streamermode long $20880007  ' immediate 8x4 bit LUT lookup mode using LUTRAM table at $100, outputting for only 7 clock cycles (28 bits only)

rogloh · 2021-10-10 04:13

If I could incorporate that approach in the video driver, and patch the COG code in LVDS mode to generate data only during the scanline instead of one line before, I might be able to use this method in my driver (with H&V sync bits always set to 0). I might even be able to have the data sourced from the external RAM too, if there is time to setup the request in the hsync porches. I'm going to look into this.... it still would not support text or multiple regions of course.

rogloh · 2021-10-12 01:14

Another approach might be to have a single LVDS helper COG working on the output data only by using the line buffer generated by my video COG as its data source. The advantage with that two COG driver solution would be that existing text/mouse/region stuff should still work and the main limitation would still be that any graphics regions have to use the LUTRAM based palette modes only. It does cost a second COG though.

This helper COG would need to be well co-ordinated with palettes and region changes and fully synced to the driver COG.

So I sort of have 3 approaches now:

use the existing single video driver COG with an external RGB to LVDS encoder chip. This has the most pins used on the P2, but can work at higher resolutions supporting up to 1080p panels @ 60Hz for example and the advantage is that the LVDS voltage output is well controlled.
have a single driver COG with LVDS support added for LUT based single region gfx only, with pixel/line doubling and optional external memory frame buffers supported, no borders and most likely no mouse sprite either. 1024x600 LCD panels would be possible @ 60Hz on a fast P2, and any higher resolution panels only with lowered refresh rate. HSync & VSync signals would not be generated, only DE.
a two COG solution, allowing multiple text/LUT based gfx regions, pixel doubling and a mouse sprite plus external memory based frame buffers (I am just not yet sure about borders). The same panel size limits as in 2)

Options 2 and 3 need more development to prove they would work though I'm now pretty confident they could from a software perspective. The P2 BitDAC LVDS output still needs proving however, I'm not sure how good a signal it will be.

AJL · 2021-10-12 01:31

With option 3, would it be possible for the 'standard' driver to be driving one of the existing display types (e.g. VGA) and the helper COG (suitably synchronized) drive an LVDS panel with the same image?

If so, that might be an argument for option 3 as the same program could support both display types depending on the hardware setup.

rogloh · 2021-10-12 01:47

Yeah that might be possible to get working. Basically mirroring two outputs from the same source. It would need to have the same sync/line frequencies though and that would limit the resolution, particularly if VGA monitors need something over 50Hz. You wouldn't be able to do DVI and LVDS at the same time as the p2:pixel clock ratios are different, being 10:1 and 7:1 respectively.

Yanomani · 2021-10-12 06:02

I really don't know how do you'll deal with it, but, when it comes to the magic code shrinking business, you're the One, so here are some viable numbers (I hope):

LVDS horizontal total = 1140 pixels / scan line;
1140 x 7 = 7980 clocks / scan line;
DVI/VGA horizontal total = 7980 / 10 = 798;
DVI horizontal visible = 640 pixels (minimum);
LVDS horizontal visible = 960 pixels (minimum);
VGA horizontal visible = 640 pixels;
DVI/LVDS/VGA vertical total = 600 scan lines;
DVI vertical visible = 480 scan lines (minimum);
LVDS vertical visible = 480 scan lines (minimum);
VGA vertical visible = 480 scan lines;
Sysclk ~=285 MHz
Hsync ~= 35.7 kHz
Vsync ~= 59.5 Hz

As for Hub memory footprint, it would consume 960 x 480 = 460,800 bytes (63,488 "free" bytes).

DVI/VGA image would be restricted to a subset window within the LVDS memory footprint (would it support horizontal/vertical panning?).

Please, don't blame me for any weird aspect ratios, and sure, feel free to rubber it all, if impossible...

rogloh · 2021-10-12 06:21

Everything but the kitchen sink @Yanomani LOL!

I'm not sure I want to bite that combination off right now but I suspect we could probably get parallel digital RGB plus VGA output from the main driver, and LVDS from a helper COG which is three outputs. TIming is critical, the generated horizontal and vertical sync frequency would need be fixed/common if the two scan line buffer is shared but perhaps either driver could show a subset of active lines if their blanking values were set differently.

rogloh · 2021-10-12 06:51

I wonder if the 2V DAC output with Bit DAC levels set to something around 1.65V and 0.75V (~1.2V common mode, 0.9V swing) would make a good LVDS output? In a differential pair arrangement this should send around 3.6mA into the 100 ohm terminating resistor at the receiver I would expect, given the two 75 ohm source impedances in series with that terminating resistor.

Is the 4 bit BIT_DAC output level's bit granularity 1/15th, or 1/16th, of the DAC voltage range? Does anyone know? If it's 1/16th then these two BITDAC levels are going to be 6 and 13.

evanh · 2021-10-12 06:57

$0 is GIO and $f is VIO, has to be 1/15th of VIO.

rogloh · 2021-10-12 06:59

Is that 1/15 of VIO always being 3.3V, or 1/15 of the 2V DAC range?

I mean if I set the pin mode to DAC output with a reduced 2V range, will the BITDAC granularity be 1/15 of that selected range, or the full 3.3V range? I'm assuming the former makes more sense.

evanh · 2021-10-12 07:13

Okay, yeah, 1/15th of 2.0 V when in that mode.

BTW: Reading those datasheets for the SN65LVDS95 and 93B, it looks to me like the differential drive is spec'd at 3.3 V. So, just use $0 and $f for the bitDAC levels.

P2 DVI/VGA driver

Comments