Shop Learn
P2 DVI/VGA driver - Page 17 — Parallax Forums

P2 DVI/VGA driver

1111213141517»

Comments

  • TubularTubular Posts: 4,423

    Yes what evanh said, 1/15 th of the full scale range.

    There's data here from earlier testing, the first two readings are GIO, then the 256 DAC levels, follwed by two VIO readings
    https://forums.parallax.com/discussion/169269/accurate-dac-reference-data-voltages-from-p2-silicon

    There's about 0.5mv difference between GIO and outputting DAC 00

  • evanhevanh Posts: 12,022
    edited 2021-10-12 15:14

    @Tubular said:
    There's about 0.5mv difference between GIO and outputting DAC 00

    Huh, that must be ground rise from general power draw on GIO. As far as I know, there is nothing internal pulling high on the I/O pin when DAC value is $0. All 255 transistors are driving low.

  • TubularTubular Posts: 4,423

    Thats right, i'm pretty sure thats what it is, there's a similar difference between VIO and 255 when driving on the 3v3 ranges

  • roglohrogloh Posts: 3,654

    Had another idea for a different use of a second "helper" COG...

    It would be nice to get even higher resolutions from the P2 so we could feed a 2560x1600 monitor from the P2 using a frame buffer in external memory. I'm thinking I could potentially get two of those cheap VGA to DVI/HDMI converter boxes (which seem to do a good job) and make up a custom DualLink-DVI cable for a 30 inch Dell monitor I have that still uses DL-DVI. I already know it would be relatively easy to get two instances of my driver outputting VGA on different pins.

    A setup like this would need two short VGA cables to feed the VGA video into those two converters and also merge their two HDMI outputs into a shared DL-DVI connector and cable attached to the monitor, taking the DVI clock pair from only one of these converters. I'm assuming that the PLL's in each device will lock to the exact same pixel rate anyway given the same sync inputs, so most of the work should be mechanical/connector oriented vs real electronic design. If I keep the connections very short and the same length, then hopefully there won't be major signal integrity issues that kill the concept. I guess if the two PLLs clocks wander around with respect to one another during the scan line sync pulses it might not work, but I'm hopeful that may not happen if the PLLs are stable enough.

    An 8 pin group from a P2-EVAL breakout connector could output a shared sync H&V pair as well as two independent analog RGB pin triplets from the DACs. Like this:

    • H
    • B0
    • G0
    • R0
    • V
    • B1
    • G1
    • R1
      The 5V pin could power these converter boxes too.

    It is my understanding that for a dual-link DVI monitor the odd/even pixels get split over the two different DVI link groups so the frame buffer would need to store odd/even pixels separately which is a small annoyance to the software filling up the frame buffer, unless a third COG could read from these two pixel buffers and interleave the pixels into a merged HUB RAM buffer beforehand. At a combined rate of 268 Mpixels/s that could get rather tricky, would need to think about how to do it and this might not be possible or worth the extra COG. For 60Hz operation each VGA output needs to generate 1280x1600 pixels in a total window of 2720x1646 according to the EDID data from my monitor. So that's 1360x1646 per output (134Mpixels/s each) which is achievable with a 268MHz P2:

    Detailed mode: Clock 268.000 MHz, 646 mm x 406 mm
                   2560 2608 2640 2720 hborder 0
                   1600 1603 1609 1646 vborder 0
                   +hsync +vsync 
                   VertFreq: 59 Hz, HorFreq: 98529 Hz
    

    It would be cool to see a 2560x1600 frame buffer coming from the P2. A colour depth of 4bpp should be okay for PSRAM, maybe even 8bpp is possible...

  • AJLAJL Posts: 457

    I'm confused as to why you'd be reading from the two pixel buffers into a merged buffer. Don't you want to go the other way?

    Could the two instances use something like the pixel doubling routine, with each reading from the single frame buffer, and then processing only the odd or even pixel and writing into separate odd and even line buffers to drive the separate channels?

    I confess that I haven't looked at your driver code in a while.

  • roglohrogloh Posts: 3,654
    edited 2021-10-13 07:57

    You are correct @AJL, I had that the wrong way around. I would need to read from a common buffer and split into two different buffers. Ideally at a rate of 1 P2 clock cycle per pixel so it's tough.

    Could the two instances use something like the pixel doubling routine, with each reading from the single frame buffer, and then processing only the odd or even pixel and writing into separate odd and even line buffers to drive the separate channels?

    Yeah I need something like that. Ideally that work load could be split over the two COGs somehow.

  • AJLAJL Posts: 457

    If you halve the bit depth of what you can already achieve you can read the full line into both drivers as normal, and then with patched code only process odd /even pixels as appropriate. Ultimately you’ll be limited by the cog to hubram bandwidth, right?

  • roglohrogloh Posts: 3,654
    edited 2021-10-13 12:31

    Wow, this de-interleaving of odd and even pixels is almost doable for 8bpp using the video COGs themselves...

    If two driver COGs are used and each processes half the scan line I think we have a budget of 2 clocks per pixel on each COG with the 2560x1600 timing above.

    Using the approach below I would read 32 longs into COG RAM and build up two output buffers (one in COGRAM and one in LUTRAM), then write back these two buffers to separate HUB addresses, splitting the odd/even pixels in the process. The inner executable loop code is unrolled 16 times in LUT RAM and would use hard coded COG addresses.

    The sample code is processing 8 pixels in 6 instructions (12 clocks) plus it needs two clocks to read in 8 pixels and two clocks to write out 8 pixels (minus overheads). That equates to 16 clocks for 8 pixels which is the entire budget so it would be real nice to shave another instruction off to pay for the loop and transfer setup overheads which this code doesn't account for. Can that be done somehow, perhaps with a muxq or both output buffers using LUTRAM, or something else? Any clever ideas with P2 instructions? Having this de-interleaving would be quite useful.

    I could probably double the unrolling again to work on 64 longs (256 pixels) at a time or so for higher efficiency. There would be space for that amount of unrolling in my driver. But we still have the other overheads to account for somehow...

    rep @loopend, #10 ' do 10 loops of 128 pixels per COG (1280 pixels each) for 2650x1600
    ' read some pixels
    setq #32-1
    rdlong 0, linebuf ' read a batch of pixels from HUB to be de-interleaved into the COGRAM input buffer (shown at address 0 here)
    add linebuf, #128 ' increase source buffer address
    
    movbyts 0, #%%3120  ' pixels 3210 becomes 3120
    movbyts 1, #%%2031  ' pixels 7654 becomes 6475
    mov outbuf, 1 ' output COGRAM buffer starts at outbuf
    rolword outbuf, 0, #1 ' odd pixels 7531
    setword 1, 0, #0 ' even pixels 6420
    wrlut 1, ptra++   ' save to LUTRAM output buffer
    
    `... repeat 15 more times in an unrolled loop before writing back both buffers
    
    setq #16-1
    wrlong outbuf, outlinebuf1 ' output line buffer 1 from COGRAM to HUB
    setq2 #16-1
    wrlong outbuf2-$200, outlinebuf2 ' output line buffer 2 from LUTRAM to HUB
    add outlinebuf1, #64 ' increase buffer address
    add outlinebuf2, #64 ' increase buffer address
    loopend
    
  • roglohrogloh Posts: 3,654
    edited 2021-10-13 12:49

    @AJL said:
    If you halve the bit depth of what you can already achieve you can read the full line into both drivers as normal, and then with patched code only process odd /even pixels as appropriate. Ultimately you’ll be limited by the cog to hubram bandwidth, right?

    Halving the bit depth to 4bpp might make de-interleaving slower...8bpp might be a sweet spot because of movbyts. It's the pixel count that matters more here, less so the bandwidth.

  • roglohrogloh Posts: 3,654

    To give an idea of the overheads above, outside the processing and transfer cycles there are 3 SETQs, 3 ADDs, 2 WRLONGs and one RDLONG which account for a worst case time of 45 clocks. This is done 10 times in the code above making 450 overhead clocks in the loop. I could have that down to 225 if I double my unroll. These excess cycles would have to fit within the blanking portion available to me which is going to be quite a lot less than 160 P2 clock cycles. So I think I need to shave another instruction to make it fit. :s

  • AJLAJL Posts: 457

    So, is this still doing regions, text, pixel doubling, and a mouse pointer?
    Or is this purely byte shuffling?

  • TonyB_TonyB_ Posts: 1,771
    edited 2021-10-13 23:26

    @rogloh said:
    To give an idea of the overheads above, outside the processing and transfer cycles there are 3 SETQs, 3 ADDs, 2 WRLONGs and one RDLONG which account for a worst case time of 45 clocks. This is done 10 times in the code above making 450 overhead clocks in the loop. I could have that down to 225 if I double my unroll. These excess cycles would have to fit within the blanking portion available to me which is going to be quite a lot less than 160 P2 clock cycles. So I think I need to shave another instruction to make it fit. :s

    After the first RDLONG all subsequent timing is predictable. Assuming as here that the number of longs for each SETQ+RDLONG/WRLONG is a multiple of 8 (every fast block move starts with same hub RAM eggbeater slice), then I think the fastest execution speed will occur if timing is exactly as follows:

    1. Between RDLONG and WRLONG = 8N cycles (0,8,16,...)
    2. Between WRLONG and WRLONG = 6 cycles
    3. Between WRLONG and RDLONG = 6 cycles

    If so then all of the RDLONGS (after the first) and WRLONGs will take the minimum cycles possible.

  • roglohrogloh Posts: 3,654
    edited 2021-10-13 22:10

    @AJL said:
    So, is this still doing regions, text, pixel doubling, and a mouse pointer?
    Or is this purely byte shuffling?

    As it stands there are no free cycles for any of those extras like the mouse, doubling and multiple regions but that's okay. Text would be an alternative to graphics but would probably be a different effort (it could be monochrome only) or there might be no text. This might just be a customized "special" driver not the mainstream one but it would just be nice to have it work with only two COGs. It probably could work without the de-interleave function we are talking about, but if de-interleaving was possible to add that would be a real bonus. I'd still have to find space for the several instructions ~14 needed to issue the request to external RAM too. If I can shave an instruction to get it below 16 clocks per 8 pixels it might work but I'm not sure if that is achievable without a clever trick.

  • roglohrogloh Posts: 3,654
    edited 2021-10-13 22:08

    @TonyB_ said:
    ...
    Second WRLONG will take one cycle more than minimum as SETQ2 + ADD take only 4 cycles. Similarly for next RDLONG after second WRLONG.

    Ok this is helps a bit more, so instead of 45 extra clocks per loop iteration in the worst case I assumed, it could in reality be more around 26 once the loop gets going?

    EDIT: Doh! I was also forgetting about the simultaneous FIFO/streamer overhead which is running in parallel to this workload. I think we will lose one hub transfer in 8 if we are streaming 8bpp pixels out at sysclk/2. This means that we have to find a trick for sure.

  • TonyB_TonyB_ Posts: 1,771
    edited 2021-10-13 22:15

    @rogloh said:

    @TonyB_ said:
    ...
    Second WRLONG will take one cycle more than minimum as SETQ2 + ADD take only 4 cycles. Similarly for next RDLONG after second WRLONG.

    Ok this is helps a bit more, so instead of 45 extra clocks per loop iteration in the worst case I assumed, it could in reality be more around 26 once the loop gets going?

    I find it easier to think in terms of the extra cycles due to eggbeater waits. If you could ensure exactly 8N cycles between RDLONG and first WRLONG there should be only two wait cycles for each loop (only one cycle for last loop) with best-case timings for RDLONG/WRLONG.

    N.B. Timings are theoretical and ignore streamer.

  • evanhevanh Posts: 12,022
    edited 2021-10-13 22:22

    Yeah, any streamer requests of that cog's FIFO will mess the burst timing. FIFO will, as it needs, stall the ALU from accessing hubRAM.

  • roglohrogloh Posts: 3,654
    edited 2021-10-13 22:38

    As an idea I was thinking I might be able to somehow make use of the alt instruction variant that alters the destination register so I don't clobber the data... but can that ever save a cycle if it costs one? This sequence below is what I need to reduce by one instruction. Maybe I'm all out of tricks now.

        movbyts 0, #%%3120  ' pixels 3210 becomes 3120
        movbyts 1, #%%2031  ' pixels 7654 becomes 6475
        mov outbuf, 1 ' output COGRAM buffer starts at outbuf
        rolword outbuf, 0, #1 ' odd pixels 7531
        setword 1, 0, #0 ' even pixels 6420
        wrlut 1, ptra++   ' save to LUTRAM output buffer
    
    
  • evanhevanh Posts: 12,022
    edited 2021-10-13 22:45

    Only chance of any ALTx instruction saving execution time, verses unrolled, is to use ALTI with at least two incrementing operations. If there is no elimination by doing the increments then it won't be faster.

    ALTI can potentially do three increments by working on S,D and R operands at once. R is an amusing case because there is no R field to replace. Which means it has different handling to S and D fields.

  • TonyB_TonyB_ Posts: 1,771
    edited 2021-10-14 18:35

    Quickest time for a fast block write of 32 longs is 34 cycles, therefore shortest time between two 32-long writes starting at same slice is 6 cycles *, not 5 as I said earlier. Lowest multiple of 8 above 34 is 40 and 40 - 34 = 6. * Same for any 8N block size.

  • roglohrogloh Posts: 3,654
    edited 2021-10-14 02:47

    For a 2560x1600 dual COG setup I believe I'll have to live with using odd/even pixel split external memory layouts as I just can't figure out a way to shrink the processing code to save another instruction, particularly if I have to use the video timing above for my monitor to sync to it - I know it's very fussy as there is no scaler and I think it only supports 60Hz refresh (need to check).

    I guess all this is moot anyway until I prove I can use two of those VGA-DVI converter boxes in parallel.

    I also looked a bit at ALTx, MUXQ and they can't really help solve it either from what I can tell. I'd guess I'd need to include a third COG to share some of the processing but that itself then messes up the balancing of the workload. So I think I have finally reached the limits of the P2 and my code now. 2560x1600 with dual link will need to have odd/even pixels separated which I know is not as convenient, but whatever...

    There's probably not enough PSRAM bandwidth anyway for 8bpp given the short blanking period available for hiding the memory driver's own overheads. 4bpp and lower depths might be the only supported formats and de-interleaving those could be even harder. I still want to see if I can at least get something working there with DL-DVI. Even 16 colour text from a 2560x1600 framebuffer could be useful. A 320x200 character text display might be fun to see with an 8x8 font. That's a lot of characters.

  • roglohrogloh Posts: 3,654
    edited 2021-10-14 03:13

    Actually 4bpp de-interleave might be doable if I can process 16 pixels in say 28 clocks or so. I would just need a way to transform 8 packed nibbles from this:

    76543210

    into this:

    75316420

    in ~3-4 instructions or so. Maybe those split/merge ops can help there with movbyts or some such thing. I could unroll a sequence using this up to 16 times if the total count was 12 instructions or less.

    The split/merge stuff is shown here:
    https://forums.parallax.com/discussion/comment/1479691/#Comment_1479691

  • roglohrogloh Posts: 3,654
    edited 2021-10-14 08:40

    Spent a couple of hours more and think I finally found a sequence below that could work out for de-interleaving pixels at 2560x1600x4bpp with external memory. Uses mergew, splitw and movbyts to split the 16 nibbles per long pair into odd/even pixel groups in different buffers. There might be a shorter way to do this, but this way already looks promising if I have it correct...

        mergew  0          'long contains nibbles in this order 76543210
        movbyts 0, #%%3120
        splitw  0
        movbyts 0, #%%3120 'long now contains nibbles in this order 75316420
        mergew  1
        movbyts 1, #%%3120
        splitw  1
        movbyts 1, #%%3120
        getword buf, 1, #1 ' get high odd pixels
        rolword buf, 0, #1 ' rotate in the low odd pixels to the odd pixel buffer in COGRAM
        rolword 1, 0, #0   ' rolword in low even pixels to high even pixels
        wrlut   1, ptra++  ' store to even pixel buffer in LUTRAM
    
     ' unroll this code 16 times, consumes 192 longs of instruction space in LUTRAM buffer area (OK)
     ' repeat the unrolled sequence 5 times for processing 1280 pixels total per COG
     ' use a 32 long COGRAM pixel input buffer holding 256 pixels to be de-interleaved per iteration
     ' use a 16 long COGRAM output buffer for odd pixels
     ' use a 16 long LUTRAM output buffer for even pixels
     ' expect worst case loop overhead of 45 clocks per iteration, although the read/write bursting is also impacted by 1 streamer long transfer every 16 clocks.
     ' 12*2 + 4 = 28 clocks per 16 pixels read/written (ignoring lost cycles due to streamer)
     ' 5 * (45 + 16*28) = 2465 < 2560 budget by a good margin.  This should fit even with the streamer transfer impact AFAICT. 
    
  • roglohrogloh Posts: 3,654

    First experiment on the 30 inch 2560x1600 Dell monitor with a P2 driving it worked out well. Note that this is NOT yet dual-link. But it is 1280x800 VGA to a HDMI adapter then to a Dual link input (running as single link). It gives me hope I may be able to use this adapter, although there will eventually be 1600 lines sent instead of 800 and that might push the adapter too far, so that's still TBD. Once 6th lockdown is over hopefully soon I plan to go buy a dual link cable and hack its plug off and try to solder wires with a couple of these cheap HDMI breakouts which will then plug into my CM201 VGA to DVI adapters...it will be a cheap and nasty way to test it. Who cares about signal integrity...lol.
    https://www.jaycar.com.au/gold-plated-pcb-mount-hdmi-plug/p/PP0941
    Good thing is even if the dual link experiment doesn't work out I now know I can still run this nice big monitor at 1280x800 from a P2 and it looks pretty cool.

  • AJLAJL Posts: 457

    Nice work :smile:

  • roglohrogloh Posts: 3,654
    edited 2021-10-15 06:41

    @AJL said:
    Nice work :smile:

    Cheers, but I'm only part of the way there...

    I have a feeling that this VGA to HDMI adapter is not going to like 1600 visible lines per frame. It probably wasn't designed to ever expect more than 1200 lines. That would be a show stopper if it can't support 1600 lines. Also if the PLL doesn't stay stable between common horizontal sync pulses on both adapters then there could be odd or even pixel corruption at the monitor, or loss of sync for one link which will upset things. I can only take the dual-link DVI clock from one of these adapters, and I have to assume the other adapter outputs the exact same clock timing which of course it may not. So a lot of things have to go right for this to even be doable. But it's fun to try.

Sign In or Register to comment.