P2 DVI/VGA driver

Tubular · 2021-10-12 10:36

Yes what evanh said, 1/15 th of the full scale range.

There's data here from earlier testing, the first two readings are GIO, then the 256 DAC levels, follwed by two VIO readings
https://forums.parallax.com/discussion/169269/accurate-dac-reference-data-voltages-from-p2-silicon

There's about 0.5mv difference between GIO and outputting DAC 00

evanh · 2021-10-12 15:13

@Tubular said:
There's about 0.5mv difference between GIO and outputting DAC 00

Huh, that must be ground rise from general power draw on GIO. As far as I know, there is nothing internal pulling high on the I/O pin when DAC value is $0. All 255 transistors are driving low.

Tubular · 2021-10-12 18:19

Thats right, i'm pretty sure thats what it is, there's a similar difference between VIO and 255 when driving on the 3v3 ranges

rogloh · 2021-10-13 06:32

Had another idea for a different use of a second "helper" COG...

It would be nice to get even higher resolutions from the P2 so we could feed a 2560x1600 monitor from the P2 using a frame buffer in external memory. I'm thinking I could potentially get two of those cheap VGA to DVI/HDMI converter boxes (which seem to do a good job) and make up a custom DualLink-DVI cable for a 30 inch Dell monitor I have that still uses DL-DVI. I already know it would be relatively easy to get two instances of my driver outputting VGA on different pins.

A setup like this would need two short VGA cables to feed the VGA video into those two converters and also merge their two HDMI outputs into a shared DL-DVI connector and cable attached to the monitor, taking the DVI clock pair from only one of these converters. I'm assuming that the PLL's in each device will lock to the exact same pixel rate anyway given the same sync inputs, so most of the work should be mechanical/connector oriented vs real electronic design. If I keep the connections very short and the same length, then hopefully there won't be major signal integrity issues that kill the concept. I guess if the two PLLs clocks wander around with respect to one another during the scan line sync pulses it might not work, but I'm hopeful that may not happen if the PLLs are stable enough.

An 8 pin group from a P2-EVAL breakout connector could output a shared sync H&V pair as well as two independent analog RGB pin triplets from the DACs. Like this:

H
B0
G0
R0
V
B1
G1
R1
The 5V pin could power these converter boxes too.

It is my understanding that for a dual-link DVI monitor the odd/even pixels get split over the two different DVI link groups so the frame buffer would need to store odd/even pixels separately which is a small annoyance to the software filling up the frame buffer, unless a third COG could read from these two pixel buffers and interleave the pixels into a merged HUB RAM buffer beforehand. At a combined rate of 268 Mpixels/s that could get rather tricky, would need to think about how to do it and this might not be possible or worth the extra COG. For 60Hz operation each VGA output needs to generate 1280x1600 pixels in a total window of 2720x1646 according to the EDID data from my monitor. So that's 1360x1646 per output (134Mpixels/s each) which is achievable with a 268MHz P2:

Detailed mode: Clock 268.000 MHz, 646 mm x 406 mm
               2560 2608 2640 2720 hborder 0
               1600 1603 1609 1646 vborder 0
               +hsync +vsync 
               VertFreq: 59 Hz, HorFreq: 98529 Hz

It would be cool to see a 2560x1600 frame buffer coming from the P2. A colour depth of 4bpp should be okay for PSRAM, maybe even 8bpp is possible...

AJL · 2021-10-13 07:48

I'm confused as to why you'd be reading from the two pixel buffers into a merged buffer. Don't you want to go the other way?

Could the two instances use something like the pixel doubling routine, with each reading from the single frame buffer, and then processing only the odd or even pixel and writing into separate odd and even line buffers to drive the separate channels?

I confess that I haven't looked at your driver code in a while.

rogloh · 2021-10-13 07:56

You are correct @AJL, I had that the wrong way around. I would need to read from a common buffer and split into two different buffers. Ideally at a rate of 1 P2 clock cycle per pixel so it's tough.

Could the two instances use something like the pixel doubling routine, with each reading from the single frame buffer, and then processing only the odd or even pixel and writing into separate odd and even line buffers to drive the separate channels?

Yeah I need something like that. Ideally that work load could be split over the two COGs somehow.

AJL · 2021-10-13 12:26

If you halve the bit depth of what you can already achieve you can read the full line into both drivers as normal, and then with patched code only process odd /even pixels as appropriate. Ultimately you’ll be limited by the cog to hubram bandwidth, right?

rogloh · 2021-10-13 12:29

Wow, this de-interleaving of odd and even pixels is almost doable for 8bpp using the video COGs themselves...

If two driver COGs are used and each processes half the scan line I think we have a budget of 2 clocks per pixel on each COG with the 2560x1600 timing above.

Using the approach below I would read 32 longs into COG RAM and build up two output buffers (one in COGRAM and one in LUTRAM), then write back these two buffers to separate HUB addresses, splitting the odd/even pixels in the process. The inner executable loop code is unrolled 16 times in LUT RAM and would use hard coded COG addresses.

The sample code is processing 8 pixels in 6 instructions (12 clocks) plus it needs two clocks to read in 8 pixels and two clocks to write out 8 pixels (minus overheads). That equates to 16 clocks for 8 pixels which is the entire budget so it would be real nice to shave another instruction off to pay for the loop and transfer setup overheads which this code doesn't account for. Can that be done somehow, perhaps with a muxq or both output buffers using LUTRAM, or something else? Any clever ideas with P2 instructions? Having this de-interleaving would be quite useful.

I could probably double the unrolling again to work on 64 longs (256 pixels) at a time or so for higher efficiency. There would be space for that amount of unrolling in my driver. But we still have the other overheads to account for somehow...

rep @loopend, #10 ' do 10 loops of 128 pixels per COG (1280 pixels each) for 2650x1600
' read some pixels
setq #32-1
rdlong 0, linebuf ' read a batch of pixels from HUB to be de-interleaved into the COGRAM input buffer (shown at address 0 here)
add linebuf, #128 ' increase source buffer address

movbyts 0, #%%3120  ' pixels 3210 becomes 3120
movbyts 1, #%%2031  ' pixels 7654 becomes 6475
mov outbuf, 1 ' output COGRAM buffer starts at outbuf
rolword outbuf, 0, #1 ' odd pixels 7531
setword 1, 0, #0 ' even pixels 6420
wrlut 1, ptra++   ' save to LUTRAM output buffer

`... repeat 15 more times in an unrolled loop before writing back both buffers

setq #16-1
wrlong outbuf, outlinebuf1 ' output line buffer 1 from COGRAM to HUB
setq2 #16-1
wrlong outbuf2-$200, outlinebuf2 ' output line buffer 2 from LUTRAM to HUB
add outlinebuf1, #64 ' increase buffer address
add outlinebuf2, #64 ' increase buffer address
loopend

rogloh · 2021-10-13 12:47

@AJL said:
If you halve the bit depth of what you can already achieve you can read the full line into both drivers as normal, and then with patched code only process odd /even pixels as appropriate. Ultimately you’ll be limited by the cog to hubram bandwidth, right?

Halving the bit depth to 4bpp might make de-interleaving slower...8bpp might be a sweet spot because of movbyts. It's the pixel count that matters more here, less so the bandwidth.

rogloh · 2021-10-13 12:56

To give an idea of the overheads above, outside the processing and transfer cycles there are 3 SETQs, 3 ADDs, 2 WRLONGs and one RDLONG which account for a worst case time of 45 clocks. This is done 10 times in the code above making 450 overhead clocks in the loop. I could have that down to 225 if I double my unroll. These excess cycles would have to fit within the blanking portion available to me which is going to be quite a lot less than 160 P2 clock cycles. So I think I need to shave another instruction to make it fit.

AJL · 2021-10-13 20:09

So, is this still doing regions, text, pixel doubling, and a mouse pointer?
Or is this purely byte shuffling?

TonyB_ · 2021-10-13 21:48

@rogloh said:
To give an idea of the overheads above, outside the processing and transfer cycles there are 3 SETQs, 3 ADDs, 2 WRLONGs and one RDLONG which account for a worst case time of 45 clocks. This is done 10 times in the code above making 450 overhead clocks in the loop. I could have that down to 225 if I double my unroll. These excess cycles would have to fit within the blanking portion available to me which is going to be quite a lot less than 160 P2 clock cycles. So I think I need to shave another instruction to make it fit.

After the first RDLONG all subsequent timing is predictable. Assuming as here that the number of longs for each SETQ+RDLONG/WRLONG is a multiple of 8 (every fast block move starts with same hub RAM eggbeater slice), then I think the fastest execution speed will occur if timing is exactly as follows:

Between RDLONG and WRLONG = 8N cycles (0,8,16,...)
Between WRLONG and WRLONG = 6 cycles
Between WRLONG and RDLONG = 6 cycles

If so then all of the RDLONGS (after the first) and WRLONGs will take the minimum cycles possible.

rogloh · 2021-10-13 21:51

@AJL said:
So, is this still doing regions, text, pixel doubling, and a mouse pointer?
Or is this purely byte shuffling?

As it stands there are no free cycles for any of those extras like the mouse, doubling and multiple regions but that's okay. Text would be an alternative to graphics but would probably be a different effort (it could be monochrome only) or there might be no text. This might just be a customized "special" driver not the mainstream one but it would just be nice to have it work with only two COGs. It probably could work without the de-interleave function we are talking about, but if de-interleaving was possible to add that would be a real bonus. I'd still have to find space for the several instructions ~14 needed to issue the request to external RAM too. If I can shave an instruction to get it below 16 clocks per 8 pixels it might work but I'm not sure if that is achievable without a clever trick.

rogloh · 2021-10-13 22:00

@TonyB_ said:
...
Second WRLONG will take one cycle more than minimum as SETQ2 + ADD take only 4 cycles. Similarly for next RDLONG after second WRLONG.

Ok this is helps a bit more, so instead of 45 extra clocks per loop iteration in the worst case I assumed, it could in reality be more around 26 once the loop gets going?

EDIT: Doh! I was also forgetting about the simultaneous FIFO/streamer overhead which is running in parallel to this workload. I think we will lose one hub transfer in 8 if we are streaming 8bpp pixels out at sysclk/2. This means that we have to find a trick for sure.

TonyB_ · 2021-10-13 22:14

@rogloh said:

@TonyB_ said:
...
Second WRLONG will take one cycle more than minimum as SETQ2 + ADD take only 4 cycles. Similarly for next RDLONG after second WRLONG.

Ok this is helps a bit more, so instead of 45 extra clocks per loop iteration in the worst case I assumed, it could in reality be more around 26 once the loop gets going?

I find it easier to think in terms of the extra cycles due to eggbeater waits. If you could ensure exactly 8N cycles between RDLONG and first WRLONG there should be only two wait cycles for each loop (only one cycle for last loop) with best-case timings for RDLONG/WRLONG.

N.B. Timings are theoretical and ignore streamer.

evanh · 2021-10-13 22:19

Yeah, any streamer requests of that cog's FIFO will mess the burst timing. FIFO will, as it needs, stall the ALU from accessing hubRAM.

rogloh · 2021-10-13 22:29

As an idea I was thinking I might be able to somehow make use of the alt instruction variant that alters the destination register so I don't clobber the data... but can that ever save a cycle if it costs one? This sequence below is what I need to reduce by one instruction. Maybe I'm all out of tricks now.

    movbyts 0, #%%3120  ' pixels 3210 becomes 3120
    movbyts 1, #%%2031  ' pixels 7654 becomes 6475
    mov outbuf, 1 ' output COGRAM buffer starts at outbuf
    rolword outbuf, 0, #1 ' odd pixels 7531
    setword 1, 0, #0 ' even pixels 6420
    wrlut 1, ptra++   ' save to LUTRAM output buffer

evanh · 2021-10-13 22:39

Only chance of any ALTx instruction saving execution time, verses unrolled, is to use ALTI with at least two incrementing operations. If there is no elimination by doing the increments then it won't be faster.

ALTI can potentially do three increments by working on S,D and R operands at once. R is an amusing case because there is no R field to replace. Which means it has different handling to S and D fields.

TonyB_ · 2021-10-13 23:24

Quickest time for a fast block write of 32 longs is 34 cycles, therefore shortest time between two 32-long writes starting at same slice is 6 cycles *, not 5 as I said earlier. Lowest multiple of 8 above 34 is 40 and 40 - 34 = 6. * Same for any 8N block size.

rogloh · 2021-10-14 01:52

For a 2560x1600 dual COG setup I believe I'll have to live with using odd/even pixel split external memory layouts as I just can't figure out a way to shrink the processing code to save another instruction, particularly if I have to use the video timing above for my monitor to sync to it - I know it's very fussy as there is no scaler and I think it only supports 60Hz refresh (need to check).

I guess all this is moot anyway until I prove I can use two of those VGA-DVI converter boxes in parallel.

I also looked a bit at ALTx, MUXQ and they can't really help solve it either from what I can tell. I'd guess I'd need to include a third COG to share some of the processing but that itself then messes up the balancing of the workload. So I think I have finally reached the limits of the P2 and my code now. 2560x1600 with dual link will need to have odd/even pixels separated which I know is not as convenient, but whatever...

There's probably not enough PSRAM bandwidth anyway for 8bpp given the short blanking period available for hiding the memory driver's own overheads. 4bpp and lower depths might be the only supported formats and de-interleaving those could be even harder. I still want to see if I can at least get something working there with DL-DVI. Even 16 colour text from a 2560x1600 framebuffer could be useful. A 320x200 character text display might be fun to see with an 8x8 font. That's a lot of characters.

rogloh · 2021-10-14 03:05

Actually 4bpp de-interleave might be doable if I can process 16 pixels in say 28 clocks or so. I would just need a way to transform 8 packed nibbles from this:

76543210

into this:

75316420

in ~3-4 instructions or so. Maybe those split/merge ops can help there with movbyts or some such thing. I could unroll a sequence using this up to 16 times if the total count was 12 instructions or less.

The split/merge stuff is shown here:
https://forums.parallax.com/discussion/comment/1479691/#Comment_1479691

rogloh · 2021-10-14 08:09

Spent a couple of hours more and think I finally found a sequence below that could work out for de-interleaving pixels at 2560x1600x4bpp with external memory. Uses mergew, splitw and movbyts to split the 16 nibbles per long pair into odd/even pixel groups in different buffers. There might be a shorter way to do this, but this way already looks promising if I have it correct...

    mergew  0          'long contains nibbles in this order 76543210
    movbyts 0, #%%3120
    splitw  0
    movbyts 0, #%%3120 'long now contains nibbles in this order 75316420
    mergew  1
    movbyts 1, #%%3120
    splitw  1
    movbyts 1, #%%3120
    getword buf, 1, #1 ' get high odd pixels
    rolword buf, 0, #1 ' rotate in the low odd pixels to the odd pixel buffer in COGRAM
    rolword 1, 0, #0   ' rolword in low even pixels to high even pixels
    wrlut   1, ptra++  ' store to even pixel buffer in LUTRAM

 ' unroll this code 16 times, consumes 192 longs of instruction space in LUTRAM buffer area (OK)
 ' repeat the unrolled sequence 5 times for processing 1280 pixels total per COG
 ' use a 32 long COGRAM pixel input buffer holding 256 pixels to be de-interleaved per iteration
 ' use a 16 long COGRAM output buffer for odd pixels
 ' use a 16 long LUTRAM output buffer for even pixels
 ' expect worst case loop overhead of 45 clocks per iteration, although the read/write bursting is also impacted by 1 streamer long transfer every 16 clocks.
 ' 12*2 + 4 = 28 clocks per 16 pixels read/written (ignoring lost cycles due to streamer)
 ' 5 * (45 + 16*28) = 2465 < 2560 budget by a good margin.  This should fit even with the streamer transfer impact AFAICT.

rogloh · 2021-10-15 05:26

First experiment on the 30 inch 2560x1600 Dell monitor with a P2 driving it worked out well. Note that this is NOT yet dual-link. But it is 1280x800 VGA to a HDMI adapter then to a Dual link input (running as single link). It gives me hope I may be able to use this adapter, although there will eventually be 1600 lines sent instead of 800 and that might push the adapter too far, so that's still TBD. Once 6th lockdown is over hopefully soon I plan to go buy a dual link cable and hack its plug off and try to solder wires with a couple of these cheap HDMI breakouts which will then plug into my CM201 VGA to DVI adapters...it will be a cheap and nasty way to test it. Who cares about signal integrity...lol.
https://www.jaycar.com.au/gold-plated-pcb-mount-hdmi-plug/p/PP0941
Good thing is even if the dual link experiment doesn't work out I now know I can still run this nice big monitor at 1280x800 from a P2 and it looks pretty cool.

AJL · 2021-10-15 05:35

Nice work

rogloh · 2021-10-15 05:38

@AJL said:
Nice work

Cheers, but I'm only part of the way there...

I have a feeling that this VGA to HDMI adapter is not going to like 1600 visible lines per frame. It probably wasn't designed to ever expect more than 1200 lines. That would be a show stopper if it can't support 1600 lines. Also if the PLL doesn't stay stable between common horizontal sync pulses on both adapters then there could be odd or even pixel corruption at the monitor, or loss of sync for one link which will upset things. I can only take the dual-link DVI clock from one of these adapters, and I have to assume the other adapter outputs the exact same clock timing which of course it may not. So a lot of things have to go right for this to even be doable. But it's fun to try.

evanh · 2022-03-11 11:12

Roger,
I'm fiddling with your driver again, after success with Eric's one, and bumped into a discrepancy I didn't expect between the two drivers. Certain combinations in the timing parameters are more inclined to produce errors with mode detection in my HDMI TV (DSE GE7101, made in 2015 just before they went tits-up). Putting bounds on the timings sorts it. However, I found that those bounds are not identical for both drivers! Namely, Eric's driver produces correct TV behaviour with a smaller minimum vblank of 18. Your driver needs the minimum at 20.

I'm using the textdemo program, from Ver 0.93, and have added my own custom timings build routine to the driver. I've verified that the timing[] array contains valid and correct data and the sysclock is correct and stable frequency - in both drivers. I've even copied your DVI hsync/vsync encodings over to Eric's driver.

Basically, I'm stumped as to why there is any discrepancy. And I fear you can't test it. Do you have any relatively new monitors/TVs? Manufactured after 2013. I suspect that's a minimum.

BTW: Why pack the timing[] array so tight?

pik33 · 2022-03-11 13:46

I tried to generate 2560x1440 using P2 and VGA. My 2560x1440 monitor cannot accept this, thinking it is 1920x1440 or not displaying anything at all. Maybe - as I didn't manage to repeat the experiment - one of the settings gave me 2560x1440@50 Hz working, but as 2560x1440 is useless on a P2 without additional memory, I gave up then. I also read somewhere, maybe on this forum, that the VGA pixel clock limit is 200 MHz and that's why it didn't work.

rogloh · 2022-03-11 13:50

@evanh said:
Basically, I'm stumped as to why there is any discrepancy. And I fear you can't test it. Do you have any relatively new monitors/TVs? Manufactured after 2013. I suspect that's a minimum.

I have a Pioneer plasma TV (1366x768) to test with over HDMI or VGA, but what resolution are you talking about?
Maybe my driver needs more vblank because it is one scan line behind?

BTW: Why pack the timing[] array so tight?

I did this because I was originally thinking there would be a myriad of timing parameters for all sorts of resolutions and it would start to eat up memory if not tightly packed, but it's probably better to have just fewer custom parameters and comment out the rest that are not used. If official SPIN2 had a way to selectively compile subsets of code (ie. pre-processor functionality) it wouldn't be an issue, but it still is unfortunately.

evanh · 2022-03-11 14:03

@rogloh said:
I have a Pioneer plasma TV (1366x768) to test with over HDMI or VGA, but what resolution are you talking about?

Currently, 1280x640 and somewhere around 33 Hz refresh. Modern HDMI's are flexible. The TV is happy with just vblank of 18 using Eric's driver. But I do know the position of the sync within the blanking is important to achieve this ... that's a good point, I haven't carefully compared the streamer command ordering ...

Maybe my driver needs more vblank because it is one scan line behind?

The internals shouldn't matter as long as output timings are as spec'd in the timing[] parameters. What exactly do you mean?

rogloh · 2022-03-11 14:08

I thought it might be the image needed to start being read an extra scan line before and needed a minimum back porch, but that is internals, and if you can't sync at all then that is a different story. Am not especially concerned at those far edge conditions you are testing with. I doubt they would work on my TV, which I think is strict and probably wants 50 - 60Hz refresh like my Dell monitors (they have a little more wiggle room, but can't do 33Hz.)

P2 DVI/VGA driver

Comments