Shop OBEX P1 Docs P2 Docs Learn Events
HDMI discussion — Parallax Forums

HDMI discussion

This discussion was created from comments split from: Console Emulation.
«134567

Comments

  • roglohrogloh Posts: 5,714

    Replying in your thread to avoid over-polluting MXXs thread...

    @Wuerfel_21 said:
    (hey @rogloh, you think there's space to wrench audio of some sort into your bitmap/text driver?)

    @rogloh said:
    LOL, yeah that's pushing it because the COG is busy on the line when rendering text, although I did have some earlier plans to try to encode all audio during vertical blanking - it would then need to buffer the audio samples arriving during the frame and it introduces 16.7ms latency at 60Hz. Whether there's enough time or not to do it is then resolution timing dependent too so not ideal.

    @Wuerfel_21 said:
    The same scanfunc trick I use for pixel doubling should also work for text (though 8 or 16 px at a time rather than 4... might need to do some futzing so the same task code can run during pixel double or text modes).

    It might help but you would need sufficient clocks for all the text work. My old comments in my code mention it can render 80 x 8 pixel wide text independent FG/BG coloured & optional flashing characters in 2760 P2 clocks so that's about 4.3 P2 clocks per pixel. With 10 clocks per pixel in HDMI mode there should be time to render text and do quite a bit of other work. The bigger issue perhaps is that solution uses LUT RAM to temporarily buffer the scan line output and another 64 COGs longs for a per scanline font. Another solution involving the HUB RAM instead might reduce that usage but require more access cycles which have to also compete with the streamer access.

    Are you considering to put text into your own NeoVGA driver or think I should try to add HDMI to mine?

    I was comparing what our different video drivers offer recently and there is quite some overlap now, but differences still exist. My driver is more general purpose overall, while your's is still currently primarily aimed at emulators but could be expanded over time with more capabilities.

    p2videodrv:

    • grew organically rather than fully designed up front and now is somewhat constrained by all the optimizations required to make things fit, has limited overlay use which can restrict free memory for future features
    • multiple output types on different pins
    • configurable resolutions
    • linked list of multi-regions so you can split a screen into multiple parts with regions for colour text, pure framebuffers, or tiles & sprite using extra render COGs
    • interfaces to shared PSRAM/HyperRAM external memory drivers with mailbox or simpler hub based framebuffers
    • all colour modes the P2 offers
    • text mode rendering inbuilt, different fonts and heights supported
    • dual configurable "HW" cursors in text mode, e.g for a mouse+keyboard
    • pixel doubling options in X/Y, uses idle time during stream out for performing pixel doubling on next scan line
    • optional mouse sprite in all colour modes
    • can run very fast, or very slow ~2x pixel clock for low power applications
    • support for borders
    • screen wrap and skew support

    NeoVGA:

    • very clean overlay design, still very clean for now making things more extensible
    • multiple output types on different pins
    • HDMI audio support with resampling | big win there!
    • configurable resolutions, simultaneous VGA+DVI/HDMI
    • single region only - intended application is primarily emulator use
    • single 24 bit colour mode only
    • accesses HUB RAM only and optimized for sprite driver model with other COGs rendering data into a limited scan line buffer in HUB
    • probably far cleaner PAL/NTSC handling vs my initial attempts which are still not update with better values from Chip and your work - I need to steal some of your colourspace settings, lol.
    • pixel doubling in X/Y with more options for tripling and other factors
    • intended for high speed P2 operation, doubling pixel on the fly requires sufficient clocks, fragments CPU time for other work
    • border/pillarbox handling
    • wrap/stride support
  • @rogloh said:
    Replying in your thread to avoid over-polluting MXXs thread...

    :+1:

    @Wuerfel_21 said:
    (hey @rogloh, you think there's space to wrench audio of some sort into your bitmap/text driver?)

    @rogloh said:
    LOL, yeah that's pushing it because the COG is busy on the line when rendering text, although I did have some earlier plans to try to encode all audio during vertical blanking - it would then need to buffer the audio samples arriving during the frame and it introduces 16.7ms latency at 60Hz. Whether there's enough time or not to do it is then resolution timing dependent too so not ideal.

    @Wuerfel_21 said:
    The same scanfunc trick I use for pixel doubling should also work for text (though 8 or 16 px at a time rather than 4... might need to do some futzing so the same task code can run during pixel double or text modes).

    It might help but you would need sufficient clocks for all the text work. My old comments in my code mention it can render 80 x 8 pixel wide text independent FG/BG coloured & optional flashing characters in 2760 P2 clocks so that's about 4.3 P2 clocks per pixel. With 10 clocks per pixel in HDMI mode there should be time to render text and do quite a bit of other work. The bigger issue perhaps is that solution uses LUT RAM to temporarily buffer the scan line output and another 64 COGs longs for a per scanline font. Another solution involving the HUB RAM instead might reduce that usage but require more access cycles which have to also compete with the streamer access.

    You don't need to buffer the rendered line, you just feed one character sliver at a time into the streamer. It's essentially the same as pixel doubling. The font line buffer is needed though, RDBYTE per character is too much. Problem is that you can't byte-index LUT. (Though do we really need 256 characters? 128 is probably fine for basic debug/system output and if you want actually nice text you'd render it yourself (with proportional fonts, unicode, whatever) - ASCII only has 95 printable characters, so there's some space to squeeze umlauts).

    Are you considering to put text into your own NeoVGA driver or think I should try to add HDMI to mine?

    Uhhh good question. In general more independent-ish HDMI implementations would be the good and right thing.

    I was comparing what our different video drivers offer recently and there is quite some overlap now, but differences still exist. My driver is more general purpose overall, while your's is still currently primarily aimed at emulators but could be expanded over time with more capabilities.

    The big difference IMO is that the ***VGA driver (both the horrid old version and the new clean one) is really designed for hacking in whatever feature-of-the-week is needed. It's really currently not something that can just be dropped into a new project. Though the new version tries to be better about it...
    PSRAM mailbox action and other color modes are trivial to add in, of course (I did it before, though I think that never released). The new version reserves the first half of LUT so 8bit mode can be added easily. It's just that I've found that having some other cog render stuff out to RGB24 line buffers has been the correctTM way of doing things in most of the projects I do (Tempest 2000 uses framebuffer rendering, but multiple buffers need to be composited and the 16 bit CRY color format is of course not native to the P2 - so mikoobj.spin2 is what ends up doing the PSRAM request list, compositing all the CRY buffers together and converting the result to RGB24 - oops, same video driver model again)

    Also, big feature: driver loads itself fully into cog/lut, so you can nuke the hub data after it gets going. This would be really pertinent if I could add overlays to flexspin. NeoYume's menu-overwriting trick is a bit brute-force...

  • roglohrogloh Posts: 5,714

    @Wuerfel_21 said:
    Also, big feature: driver loads itself fully into cog/lut, so you can nuke the hub data after it gets going. This would be really pertinent if I could add overlays to flexspin. NeoYume's menu-overwriting trick is a bit brute-force...

    That's what my driver was intended to do too. Loading all config into the entire COG exec space (using LUT RAM) and the total ~4kB memory footprint overhead can be reused for a dual scan line buffer area, making it very low HUB usage overall. Also without HUB exec required this means that any corruption of HUB RAM after the video driver COG is spawned can't really crash the driver which is useful if you are using it to debug the problem, although it can obviously corrupt the screen output itself.

  • roglohrogloh Posts: 5,714

    One other thing is I do wonder how high a resolution your driver can be operated at? I know mine can reach 1920x1200 in VGA mode, not sure about 2048x1536 - I think that exceeds my buffering capacity with the mouse sprite stealing 16 longs in LUT leaving only 240 for the temporary line buffer for doubling and 256 for the palette RAM. What would limit the resolution in your driver? Is it just the P2 clock speed for outputting doubled pixels or other things like streamer ops?

  • The old driver could do 5x scaling from 384x240 to 1920x1200. No reason it couldn't do this with 1x native data. Haven't readded such hi-res modes with the new one, but I guess it should work. Remember that software pixel doubling is only needed for DVI/HDMI (where there is always time for it because of the fixed clock ratio), for analog I can just set a slower clock and have the normal streamer DMA do the work.

  • roglohrogloh Posts: 5,714
    edited 2024-10-08 03:46

    Aha, yeah pixel doubling via HW streamer DMA is nice. I guess I expected I couldn't do that due to the fact that different regions could exist with/without doubling present, unless I somehow tweak streamer clocks on the fly per region - hmm maybe I could use setq to change the streamer rate per region if it were cleaned up a bit.

    Doing HW pixel doubling and via the scanfunc trick could open up lots more LUT longs in my driver and free space needed for things like HDMI and frequent audio polling. I should think about this if I want to restructure it.

    This is all very interesting as I can see you've probably learned a few things from what I did in my driver code (at least what parts to avoid and how to improve) and I can learn some from your approach too.

    Something that takes the best parts of both our drivers could be pretty nice - probably closer to your overlay framework. I do sort of like the simplicity of basic frame buffers for graphics work and single COG operation with some in built text support is nice too with full per character bg/fg colour like real VGA, otherwise you need to burn more COGs. Multi-region is probably not as frequently needed but it can really help if you want to have pull down screens with different COG applications running in each, not needing to really share a single screen. Like a debug console that can be scrolled/hidden on demand running with some sprite based game or another GUI on-screen with a mouse cursor that don't need to know about this console and can simply continue writing to their own buffers independently.

  • What would the DVI encoder do if operated with a pixel clock other than 1/10 sysclock? A couple possibilities.

    1. Best case. It samples data upon XINIT and every 10th clock afterward. This would result in a nearest neighbor interpolation.
    2. Restarts upon each new pixel, but will resend old data if that is all that is available. Hardware pixel doubling.
    3. Restarts upon each new pixel and sends undefined bits after the 10 TMDS bits. Bad, but this is what we are assuming.
  • roglohrogloh Posts: 5,714
    edited 2024-10-08 05:54

    @SaucySoliton said:
    What would the DVI encoder do if operated with a pixel clock other than 1/10 sysclock? A couple possibilities.

    1. Best case. It samples data upon XINIT and every 10th clock afterward. This would result in a nearest neighbor interpolation.
    2. Restarts upon each new pixel, but will resend old data if that is all that is available. Hardware pixel doubling.
    3. Restarts upon each new pixel and sends undefined bits after the 10 TMDS bits. Bad, but this is what we are assuming.

    I don't think it works correctly with anything but a 10:1 ratio (as you need 10 serial clocks to stream out all the serial data with the TMDS/TERC4 encoded data). If you ran at 20:1 ratio it might do something interesting with every second pixel, but with 25MHz minimum DVI pixel clock you'd technically need a 500MHz P2 to do this. I guess it should be tried out (didn't we try this once already - can't recall, but fairly confident it didn't double for us otherwise I'd be making use of it).

  • roglohrogloh Posts: 5,714

    @Wuerfel_21 said:
    You don't need to buffer the rendered line, you just feed one character sliver at a time into the streamer. It's essentially the same as pixel doubling. The font line buffer is needed though, RDBYTE per character is too much. Problem is that you can't byte-index LUT. (Though do we really need 256 characters? 128 is probably fine for basic debug/system output and if you want actually nice text you'd render it yourself (with proportional fonts, unicode, whatever) - ASCII only has 95 printable characters, so there's some space to squeeze umlauts).

    Was thinking about this... I currently use altgb to just read the byte array of font data from COGRAM and this pulls out the correct byte from the long for you automatically in the following getbyte instruction. But it could be read from LUT with this, assuming this snippet executes from COGRAM (and is easily patchable) and flags can be used. This only adds 3 extra instructions per font byte (7 clocks with the RDLUT) vs the existing altgb method. These 7 extra clocks do thankfully get amortized over all 8 pixels so it's not too much slower in the 80 clock budget total for DVI/HDMI.

      getbyte data, colordata, #0 ' extract character byte from color+char word
      rczr data wcz
      bitz instr, #19 ' patch instruction below
      bitc instr, #20 ' patch instruction below
      rdlut data, data ' keep font for scanline in lower 256 longs of LUTRAM
    instr
      getbyte data, data, #0-0 ' byte lane gets patched
    
  • @rogloh said:
    Aha, yeah pixel doubling via HW streamer DMA is nice. I guess I expected I couldn't do that due to the fact that different regions could exist with/without doubling present, unless I somehow tweak streamer clocks on the fly per region - hmm maybe I could use setq to change the streamer rate per region if it were cleaned up a bit.

    Yes, SETQ+XCONT. You really want to set it for the active video only and change it back for the front porch.

    This is all very interesting as I can see you've probably learned a few things from what I did in my driver code (at least what parts to avoid and how to improve) and I can learn some from your approach too.

    🎵Live and Learn, from the works of yesterdaaaaay!🎵

    Something that takes the best parts of both our drivers could be pretty nice - probably closer to your overlay framework. I do sort of like the simplicity of basic frame buffers for graphics work and single COG operation with some in built text support is nice too with full per character bg/fg colour like real VGA, otherwise you need to burn more COGs.

    Agree, I usually bust out p2videodrv for simple demos or tests and such. Which reminds me that I never posted any of the texture mapping research code... (and I was going to add it here as a bonus but flexspin's project zipping just sharted itself)

    Multi-region is probably not as frequently needed but it can really help if you want to have pull down screens with different COG applications running in each, not needing to really share a single screen. Like a debug console that can be scrolled/hidden on demand running with some sprite based game or another GUI on-screen with a mouse cursor that don't need to know about this console and can simply continue writing to their own buffers independently.

    Yeah :+1:

    @rogloh said:

    @Wuerfel_21 said:

      getbyte data, colordata, #0 ' extract character byte from color+char word
      rczr data wcz
      bitz instr, #19 ' patch instruction below
      bitc instr, #20 ' patch instruction below
      rdlut data, data ' keep font for scanline in lower 256 longs of LUTRAM
    instr
      getbyte data, data, #0-0 ' byte lane gets patched
    

    Not sure if that's worth it...

    The pixel doubling code is:

    scanfunc_2x_rgb24 ' scan out 2 pixels with 2x doubling -> 40 streamer cycles
                  ' actual function takes 18cy (inc. call overhead)
                  ' so 11 instructions can go between calls
                  rflong scantmp
                  andn scantmp,#255 ' mask alpha/control channel (DMA would do this for us)
                  xcont pixel_multiplier,scantmp
                  rflong scantmp
                  andn scantmp,#255
            _ret_ xcont pixel_multiplier,scantmp
    

    The task code gets 22 useful cycles out of 40. Also note that it leaves the flags alone (not sure if I rely on that currently).

    There may be a problem with doing both pixel doubling and text with the same scantask. My code uses 76 of these 22-cycle time slots for CRC+TERC4 encoding, so if you did 80 calls into a text function, there wouldn't be any time left. You could alternate it so that 160 calls produce 80 characters. Can certainly do monochrome text, but color seems like a pipe dream with this.
    May look a bit like this:

    scanfunc_text
    .even
                  rfbyte scantmp ' grab character
                  rczr scantmp wcz
                  bitz .ptch, #19 ' patch instruction below
                  bitc .ptch, #20 ' patch instruction below
                  mov scanfunc,#.odd
                  ret wcz
    .odd
                  rdlut scantmp, scantmp ' keep font for scanline in lower 256 longs of LUTRAM
    .ptch         getbyte scantmp,scantmp,#0-0
                  xcont text_cmd,scantmp
            _ret_ mov scanfunc,#.even
    
  • roglohrogloh Posts: 5,714
    edited 2024-10-09 08:21

    So just putting aside the doubling for now, if you are sending data out in immediate mode using the streamer you don't have to issue an XCONT per pixel, but can do it per groups of 8 pixels when running 16 colour text (ala VGA style). This greatly reduces the frequency of calling the scanfunc to just 40 times per line in a 640x480 sized screen for example. A lot of my earlier budget involved looking up the LUTRAM for character & colour data and writing back. Some of this goes away or reduces when reading the FIFO and issuing XCONTs directly. Looking at my current text code the basic character loop is something like this where the last two instructions after the rep loop label are actually included because of skipf processing which skips two instructions inside the loop.:

    patchtext
                                mov     a, #%11000 wc           'reset starting lookup index
    p3          if_z            rep     @endwide, #COLS/2       '2100 clocks for 40 double wide
    p4          if_nz           rep     @endnormal, #COLS       '2760 clocks for 80 normal wide
                                skipf   a                       'skip 2 of the next 5 instructions
                                xor     a, #%11110              'flip skip sequence for next time
                                rdlut   d, pb                   'read pair of characters/colours
                                getword c, d, #1                'select first word in long (skipf)
                                getword c, d, #0                'select second word in long (skipf)
                                sub     pb, #1                  'decrement LUT read index (skipf)
                                getbyte b, c, #0                'extract font offset for char
                                altgb   b, #font                'determine font lookup address
                                getbyte pixels, 0-0, #0         'get font for character's scanline
    testflash                   bitl    c, #15 wcz              'test (and clear) flashing bit 
    flash       if_c            and     pixels, #$ff            'make it all background if flashing
                                movbyts c, #%01010101           'colours becomes BF_BF_BF_BF
                                mov     b, c                    'grab a copy for muxing step next
                                rol     b, #4                   'b becomes FB_FB_FB_FB
                                setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                                muxq    c, b                    'c becomes FF_FB_BF_BB
                                testb   modedata, #10 wz        'repeat columns test, z was trashed
    endnormal                                                   'end rep 2 instructions early (skipf)
                if_nz           movbyts c, pixels               'select pixel colours for char
                if_nz           wrlut   c, ptrb--               'write coloured pixel data into LUT
    

    In your own case, the scanfunc needs to be fully buffered up with 2 XCONTs to gain cycles to go do other work. This means 2 x 8 pixel characters or 4 total bytes with the colour attribute bytes included can be read from the FIFO and worked on at a time. Once loaded with the next pair of characters it would only have to happen on average every 160 clocks (16 pixels).
    Sounds like your code uses 76x22 = 1672 total cycles during the active portion to complete the other TERC4/audio polling work. As long as you get enough total cycles on the line to do your work it probably doesn't matter precisely how they are distributed, assuming audio is still polled often enough to not miss the sample and that you don't ever underrun the streamer and you can fit these scanfunc calls somewhere useful in your own code.

    With this in mind I expect a variant of the above could be coded which is optimized for reading source buffer data from the FIFO and its font data out of LUTRAM (although you do need 64+ extra housekeeping clocks per line to fill this once before the start of the active data can begin). It's mostly a matter of breaking this work into appropriately sized chunks so you get some free cycles for your operations. Something like this maybe?

    scanfunc_text
                                rep     @.endscan,#2            'do twice
                                rfbyte  data                    'read character from fifo
                                rfbyte  color                   'read colour attribute from fifo
                                rczr    data wcz                'divide by 4 and get c,z flags
                                bitz    .patch, #19             'patch instruction
                                bitc    .patch, #20             'patch instruction
                                rdlut   data, data              'read font in lower 64 longs
    .patch                      getbyte pixels, data, #0-0      'extract pixeldata from font
    testflash                   bitl    color, #7 wcz           'test (and clear) flashing bit 
    flash       if_c            and     pixels, #$ff            'make it all background if flashing
                                movbyts color, #0               'colours becomes BF_BF_BF_BF
                                mov     data, color             'grab a copy for muxing step next
                                rol     data, #4                'data becomes FB_FB_FB_FB
                                setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                                muxq    color, data             'color becomes FF_FB_BF_BB
                                movbyts color, pixels           'select pixel colours for char
                                xcont   X_IMM_8X4_LUT, color    'write colour indices into streamer as 8 nibbles
    .endscan                    ret     wcz                     'restore flags
    

    You have repeat the work twice in the scanfunc before returning using the rep at the start and call this snippet 40 times per line. As counted it would take 70 clocks + 6 call overhead = 76 clocks every 160 P2 clocks, leaving 40x(160-76) = 3360 cycles remaining for TERC4/audio which seems like it should be sufficient if you only needed 1672 before. However you do have to read the 64 long font for the scan line into LUTRAM in before it can be used, taking 64+ more clocks elsewhere in the horizontal blanking portion or maybe at the end of prior line if needed. Also as coded, this loop consumes 4 extra clocks per character with the flashing text attribute, but it's good to keep that.

    Now for the pixel doubling case, I'd expect that it should only increase the CPU time available between calls, as only half the number of characters need to be read in and processed (40 vs 80 for example with VGA). You can also then duplicate the computed 8x4 nibbles across 2 32 bit longs by replicating nibbles colour indices into bytes. So the generated 8x4 bit nibbles in the non-doubled color long above such as N7_N6_N5_N4_N3_N2_N1_N0 then becomes two longs as N7_N7_N6_N6_N5_N5_N4_N4 and N3_N3_N2_N2_N1_N1_N0_N0 to be sent to the streamer in the 40 iterations and the rep loop part wouldn't be needed. This extra work is less than processing a whole other character and can save cycles using those handy dandy SPLIT/MERGE style instructions which assist here. Something like this below may work which amounts to 51 clocks + 6 call overhead or 57 clocks total instead of 76.

    scanfunc_text
                                rfbyte  data                    'read character from fifo
                                rfbyte  color                   'read colour attribute from fifo
                                rczr    data wcz                'divide by 4 and get c,z flags
                                bitz    .patch, #19             'patch instruction
                                bitc    .patch, #20             'patch instruction
                                rdlut   data, data              'read font in lower 64 longs
    .patch                      getbyte pixels, data, #0-0      'extract pixeldata from font
    testflash                   bitl    color, #7 wcz           'test (and clear) flashing bit 
    flash       if_c            and     pixels, #$ff            'make it all background if flashing
                                movbyts color, #0               'colours becomes BF_BF_BF_BF
                                mov     data, color             'grab a copy for muxing step next
                                rol     data, #4                'data becomes FB_FB_FB_FB
                                setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                                splitw  pixels
                                setword pixels, pixels, #1
                                mergew  pixels
    
                                muxq    color, data             'color becomes FF_FB_BF_BB
                                mov     color2, color           'save it
                                movbyts color, pixels           'select pixel colours for char
                                xcont   X_IMM_8X4_LUT, color    'write colour indices into streamer as 8 nibbles
    
                                getbyte pixels, pixels, #1      'use next group of pixels
                                muxq    color2, data            'color becomes FF_FB_BF_BB
                                movbyts color2, pixels          'select pixel colours for char
                                xcont   X_IMM_8X4_LUT, color2   'write colour indices into streamer as 8 nibbles
                                ret     wcz                     'restore flags
    
  • roglohrogloh Posts: 5,714
    edited 2024-10-09 08:18

    Actually there's another problem which you might be considering and I only just realized. If the outer calling code is coded up to call fine grained scanfuncs for RGB24 individual pixel pair use and this common code also needs to be shared with text scanfuncs, then the calls will be happening far more frequently than we need. So maybe the text scanfunc needs to know this and only truly execute the scanfunc once every xx actual calls based on the number of pixels it generates and returns early otherwise so you can still share the audio+TERC calling code for both types of scanfunc. I see that issue now. An incmod counter,#limit wz instruction and early return may help (but that will still burn 10 P2 clocks per fake call, which might still be ok, not sure). It may mess up your accounting scheme for how many total pixels have been sent unless the scanfunc counts it for you perhaps.

    scanfunc_text
                                incmod  realcall, #7 wz         'count up to 8 or 4 or whatever multiple makes sense
                if_nz           ret     wcz                     'exit early if not ready
    
                                rep     @.endscan,#2            'do twice
                                rfbyte  data                    'read character from fifo
                                ...
    
  • That's what I was on about earlier. The task code will run for 0<=x<=22 cycles for each scanfunc call. There's essentially two somewhat related problems here:

    • If the scanfunc generates too many pixels per call, not enough of the task runs before the line is over (the 76 slot number above is just for packet encoding. Audio processing maybe adds another 20? Need to check. Lets assume ~96 slots are needed at all times)
    • If the scanfunc takes too many cycles, the streamer will run out of data.

    So if you have a 76cy function generate 16px (160cy) of streamer data each time, you only get 40 timeslots.
    If you add the incmod dealio, the base cost increases to 80.

    • (80+10)/2 -> 45cy on 80cy, 80 timeslots -> still too few timeslots
    • (80+10+10)/3 33.3 cy on 53.3 cy, 120 timeslots -> enough slots, but overall time has now run out! 80+10+10+22+22+22 is 166 cycles.

    The incmod idea just kinda sucks, changing the function pointer uses 2 less cycles per call, so 78+8+8+22+22+22 = 160. That might work out exactly! Though introducing 3/16 as a factor somewhere is nasty of course.

  • roglohrogloh Posts: 5,714

    Yeah the incmod isn't ideal in the scanfunc as it involves the call overhead every time, I knew that, just a starting idea. Best to be done somehow on the caller side maybe with a conditional test that's only valid once every 8 times for example. Sort of like the idea of an incmod on the caller side, but will burn extra instruction space.

    I expect there will be some way to chop it up to make it work, just need to think more. There's enough total clocks per line for the work itself.

  • Did you see that post way earlier? You can just change the function pointer and that only takes 2 cycles, that are inside the scanfunc. You do need enough _ret_ mov scanfunc,#.something lines to get to the desired division level, but for 1/3 calls, that's fine. Can also be used to balance out the workload.

  • roglohrogloh Posts: 5,714
    edited 2024-10-09 11:18

    If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.

                   sub scantimecomp,#1
                   call scanfunc
    

    E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...

                   incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
       if_z        call scanfunc
    

    Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.

  • roglohrogloh Posts: 5,714
    edited 2024-10-09 12:02

    I see you'd probably only get 80 clocks per slot as the text code takes the other half.
    So you would get to execute 3 * 22 clocks (your current spacing) for every scanfunc call and also incur 4 clock overhead penalty for 3 out of 4 slots where the call is not made.
    (22+4) * 3 = 78. This gives 120 per line.
    When the call is made it's going to take the 2 cycle incmod test + 6 call overhead cycles plus 78 working cycles for 16 pixels (including the extra accounting instruction)
    This is 86 clocks.
    So we have 78+86 = 164 clock cycles in 160 budget. Damn not enough, unless flashing attribute is yanked or other optimizations are found. You could unroll the rep loop in the scanfunc for two clocks saved.

  • roglohrogloh Posts: 5,714
    edited 2024-10-09 12:14

    Actually it's not as bad. I double counted the overhead. It was only 76 before including the call overhead so it's now 78 plus the prior 2 cycle incmod = 80. 78+80 will just fit the budget, but it's tight. :smile:

    Update: One hassle with this is there are no free cycles for any cursors to be done on the fly. :( You could do a soft cursor with a flashing block or underscore, but it's not the same and will destructively kill the character it is on. Maybe that's the final icing on top if there's a way. It's mostly a compare counter value and overriding mov instruction to do a simple cursor, 4 clocks would just about do it.

    Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

    Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two :smile:

  • @rogloh said:
    If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.

                   sub scantimecomp,#1
                   call scanfunc
    

    E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...

                   incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
       if_z        call scanfunc
    

    Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.

    You can't do that, it'll mess up the timing for pixel doubling. There's a reason the caller subtracts from scantimecomp, it can do it in batches, since it only needs to be right at the end.

    Please re-read my earlier posts. With changing the function pointer I calculated it down to 160 cycles, should work exactly like that and need no extra change to scantask code. (just need to change the initial value of scantimecomp)

    @rogloh said:
    Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

    Not actually a problem. Remember the streamer has a command buffer, so the guard band can be entered while the back porch is still going. The opposite is a problem, going from short immediate commands to border/blanking, but that's a lot easier to solve.

    @rogloh said:
    Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two :smile:

    As mentioned, accounting for scantimecomp is done in huge batches. It only needs to have the correct value at the very end.

  • roglohrogloh Posts: 5,714

    @Wuerfel_21 said:

    @rogloh said:
    If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.

                   sub scantimecomp,#1
                   call scanfunc
    

    E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...

                   incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
       if_z        call scanfunc
    

    Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.

    You can't do that, it'll mess up the timing for pixel doubling. There's a reason the caller subtracts from scantimecomp, it can do it in batches, since it only needs to be right at the end.

    Pixel doubling for non-text modes can be maintained I think if the incmod limit variable is set to zero, so it always calls your scanfunc in that case just like before. The extra incmod does of course need to introduce some adjustments to the sequence.
    When pixel doubling for text, there would still be 40 real calls made and the scanfuncs would be able to work the same in the 160 clock budget, just half as much source data is read from the fifo and 16 total pixels would still go out like before.

    So then are you mostly worried that the existing 22 clock cycle interval maximum between calling scanfunc can't ever be messed with any more due to existing critical fragments? Maybe that's the irritation if it was really hard to get this timing correct...? I was envisioning that some small adjustments to the places where the calls are done could still be made (which may increase total slot usage of the 120 available slightly) so long as you still get to call it every 40 clock cycles total (or faster, in which case it'll wait for the streamer buffer to be ready in the scanfunc before returning), keeping the spacing intact? Maybe you have some code that needs all 22 clock cycles for some processing work in some cases and reducing down to 20 cycles is really difficult or impossible.

    Are you worried it won't fit in the COG RAM with the extra prior incmods needed per each scanfunc call, and your accounting is often done in batches so there won't be enough reclaiming of COGRAM space to make up for the extra incmod?
    Or maybe incmod flags use killing either Z/C per scanfunc call is a showstopper?

    The use of scantimecomp as it stands seems mainly to determine the end of line condition in the line loop. I was thinking if we accounted for all these pixels sent out within the scanfunc itself then a benefit is that this counter could do double duty and be used to reference the character index for a cursor position check....though you'd also need to put the same accounting operation back into the rgb24 and any other pixel doubling variants which will use two more cycles and obviously force adjustments to your code's call slot sequences. That's really only why I like the incmod stuff put on the caller side and moving the scantimecomp accounting into the scanfunc as it can free a few more clocks avoiding the call overhead altogether (4 instead of 8), though I can see it will consume more instruction space. I counted 28 calls to scanfunc and 8 adjustments to scantimecomp. This means it would consume 20 more instruction longs in COG/LUT which is a downside of course if space is really limited.

    Please re-read my earlier posts. With changing the function pointer I calculated it down to 160 cycles, should work exactly like that and need no extra change to scantask code. (just need to change the initial value of scantimecomp)

    Yeah I know how your approach works - it'd just keep all the scanfunc calls as is and just adjust the next scanfunc function address dynamically in a chain, creating a few shortened calls followed by the real call. I do rather like it for the small impact it creates but the main downside is that there is no more budget in the text loop code pasted above so it won't ever be able to support a cursor. Minor gripe but I am looking for any solutions that resolve that...although perhaps my solutions are consuming too many resources for the minor budget increase it provides.

    @rogloh said:
    Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

    Not actually a problem. Remember the streamer has a command buffer, so the guard band can be entered while the back porch is still going. The opposite is a problem, going from short immediate commands to border/blanking, but that's a lot easier to solve.

    Yeah that's true, the guard band sends two pixels the same so it'll probably be able to issue just a single immediate channel command for that while the back porch is running. I was thinking each pixel needed to get it's own channel command and then the command fifo would be blocked.

    @rogloh said:
    Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two :smile:

    As mentioned, accounting for scantimecomp is done in huge batches. It only needs to have the correct value at the very end.

    Yeah not all 22 cycle batches as coded do change the scantimecomp variable so we can't bank on gaining those extra 6 cycles at all time. If we stole them by dropping down to 20 useful instructions between the scanfunc calls (from 22) then this would mean it has to increase slot usage slightly. Hopefully only about 10% or so if you go from 22 clocks down to 20. If you only used 96 slots before then hopefully 120 is still enough to finish before the end of the line...? It's tight.

    Though introducing 3/16 as a factor somewhere is nasty of course.

    What exactly is this 3/16 thing you mentioned earlier?

  • @rogloh said:

    Though introducing 3/16 as a factor somewhere is nasty of course.

    What exactly is this 3/16 thing you mentioned earlier?

    16 pixels over 3 calls -> nasty factor to get from 640 to 120

    Adding cursor could be done as-is if flashing is dropped. Is that really an important feature? (ZX Spectrum programmers hate it!)

    I'm getting a minor headache off this I feel. Maybe I should just prototype an actual impl instead of dilly-dallying.

  • roglohrogloh Posts: 5,714

    @Wuerfel_21 said:

    @rogloh said:

    Though introducing 3/16 as a factor somewhere is nasty of course.

    What exactly is this 3/16 thing you mentioned earlier?

    16 pixels over 3 calls -> nasty factor to get from 640 to 120

    Ok I guess that's due to the accounting outside the scanfunc.

    Adding cursor could be done as-is if flashing is dropped. Is that really an important feature? (ZX Spectrum programmers hate it!)

    Yeah I thought the same if it came down to it. Even VGA has a bit to ignore it and just use it for more background colours. Maybe I'll come up with more instr saved in the text loop somehow. Last night I thought of a way that the scantimecomp use could be avoided for the cursor index counter. Instead once at the start of line just configure a underflowing counter which only hits zero and sets the Z flag exactly when we reach the cursor pos. Be nice to have a secondary mouse cursor though for TextUI use.

    I'm getting a minor headache off this I feel.

    Well that's the way to shutdown the argument. :wink: Yeah probably best just start to code it and things will work themselves out. I'm also looking for ways that my own features could potentially work in all this discussion, though not too sure my model of always being one scan line ahead would fit with it which probably also affects PSRAM, mouse sprite and multi-region stuff badly.

    Maybe I should just prototype an actual impl instead of dilly-dallying.

    What was that spaceX quip again, something like: it's just easier to go the moon, than convince NASA we can go to the moon! :smiley:

  • @rogloh said:
    I'm also looking for ways that my own features could potentially work in all this discussion, though not too sure my model of always being one scan line ahead would fit with it which probably also affects PSRAM, mouse sprite and multi-region stuff badly.

    I mean you could probably make it work without any of this scanfunc headache then. If you always pre-render the next scanline into a buffer, you can run the audio task without timing headaches. Can also buffer the packets into hubRAM, which makes streaming them out nicer.

  • roglohrogloh Posts: 5,714
    edited 2024-10-09 23:22

    Agree. But how does doubling ever work then with HDMI? I'd think I'd have to give it up in HDMI mode which I guess is a sacrifice I'd be willing to make. Plus the extra HDMI code sort of needs space and would use the 240 LUT longs instead of how I use them today to both double pixels and render text into before writing it out. HDMI with audio is especially hard as you've discovered. HDMI video framing alone okay.

  • Well if we say the HDMI audio needs some ~2200 cycles in a scanline, there is a lot left left to copy and double up data for the next scanline.

  • roglohrogloh Posts: 5,714

    Yeah, apart from 24 bpp mode which was a real killer IIRC. I'll have to dig up the actual cycles in the old DVI thread. It's different for the different depths. Less bpp are harder to double, but more packed bits get processed at a time and share the same longs for transfers. The 24bpp mode was quite hard on the COG because for each pixel doubled you need one HUB long read in to LUT and one read from LUT and two write to LUT operations and two writes back to HUB plus the loop overhead to fit in bursts over the 240 LUT longs. That's already 1+3+2*2 + 2 clocks or 10 clocks per doubled input pixel which is half the pixel budget. This even shares the memory bandwidth with the streamer too. But at best that's probably gonna leave around 320x10 = 3200 cycles minus overheads in a 640x480 resolution with 6400 clocks per active portion of the scanline. Might just still fit the budget. :smile: But another problem is the free LUT space and existing usage. How much extra code space does a basic HDMI TERC encode + audio resampler take I wonder ? It's certainly a challenge. Maybe there's scope to dynamically swap out TERC encoder code on the fly and reuse LUT space afterwards for doubling though that breaks my model of self contained rock solid video COG independent of HUB RAM corruption. All sorts of tricks may be needed.

    ' Code to double pixels in all the different colour depths
    
    doublepixels
                                push    ptra                    'preserve current pointers
                                push    ptrb
                                mov     ptra, #$188             'setup pointers for LUT accesses
                                mov     ptrb, #$110
    doubleloop                  rep     #0-0, pb                'patched pixel doubling loop count
                                skipf   pattern                 '1  2  4  8 16 32
                                rdlut   a, ptra++               '*  *  *  *  *  *
                                setq    nibblemask              '      *
                                splitw  a                       '   *
                                mov     b, a                    '*  *  *  *  *
                                movbyts a, #%%2020              '   *
                                movbyts a, #%%1100              '      *  *
                                movbyts a, #%%1010              '*           *
                                wrlut   a, ptrb++               '               *
                                mergeb  a                       '   *
                                mergew  a                       '*
                                mov     pb, a                   '      *
                                shl     pb, #4                  '      *
                                muxq    a, pb                   '      *
                                wrlut   a, ptrb++               '*  *  *  *  *  *Short 32bpp loop
                                movbyts b, #%%3131              '   *           |returns here but
                                movbyts b, #%%3322              '      *  *     |falls through at
                                movbyts b, #%%3232              '*           *  |end after its REP
                                mergew  b                       '*              |block completes.
                                mergeb  b                       '   *           |
                                mov     pb, b                   '      *        |
                                shl     pb, #4                  '      *        |
                                muxq    b, pb                   '      *        |
                                wrlut   b, ptrb++               '*  *  *  *  *  |<skipped for 32bpp
                                jmp     #exitmouse              'share common code to restore ptrs
    
  • I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072

      <blockquote class="UserQuote"><div class="QuoteAuthor"><a href="/profile/" rel="nofollow"></a> said:</div><div class="QuoteText"><p>      <blockquote class="UserQuote"><div class="QuoteAuthor"><a href="/profile/" rel="nofollow"></a> said:</div><div class="QuoteText"><p>Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url].
    

    Question:
    Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

    Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

    Seems like we can't quote from other threads.

  • TonyB_TonyB_ Posts: 2,176
    edited 2024-10-10 18:31

    Quoting old messages seems to only quote the last replier's response. Nested quotes can be done manually, by quoting the earlier message, replacing > with >> and merging the two quotes, e.g.

    @cgracey said:

    @TonyB_ said:
    Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs here.

    Question:
    Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

    Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

    I notice that old messages say X wrote: and newer ones say X said:. I don't know when that changed happened, probably years ago now!

    EDIT:
    1. I also had to manually do the here link.
    2. I really enjoyed the P2 design era and I miss it, although I showed up late in the process.

  • roglohrogloh Posts: 5,714
    edited 2024-10-10 23:21

    @SaucySoliton said:

    I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072

      
    
    said:

    >

    said:

    Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url]. > > Question: > Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

    > > Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

    Seems like we can't quote from other threads.

    This comment from Chip sounds promising. We'll have to retest this, it could save us a lot of problems if the DVI HW will repeat the pixels for us and open up the streamer use again.

  • roglohrogloh Posts: 5,714
    edited 2024-10-11 02:26

    @rogloh said:
    @SaucySoliton said:

    I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072

      
    
    said:

    >>

    said:

    Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url]. >> >> Question: >> Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

    > > >> Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

    Seems like we can't quote from other threads.

    This comment from Chip sounds promising. We'll have to retest this, it could save us a lot of problems if the DVI HW will repeat the pixels for us and open up the streamer use again.

    Just tried patching my code so it outputs half the active pixels at half the rate in DVI mode with a setq before the active pixels begin using the divide by 20 value and restored at the end of active portion back to divide by 10 so the other sync timing stuff still all works. Couldn't get an image on screen. Might still be something I'm doing wrong or maybe this repetition only works in immediate mode...? My code is using the streamer.

    Update: just verified this same approach of the extra setq instructions with VGA instead of DVI in my driver does double the pixels on screen so it would appear my patched code is okay. It's just that DVI output doesn't like it. I wonder if the streamer's NCO value still somehow affects the DVI encoder block's timing and messes with it or we have to do some special initial priming for it to work.

Sign In or Register to comment.