HDMI discussion

System · 2024-10-13 17:30

This discussion was created from comments split from: Console Emulation.

rogloh · 2024-10-08 01:47

Replying in your thread to avoid over-polluting MXXs thread...

@Wuerfel_21 said:
(hey @rogloh, you think there's space to wrench audio of some sort into your bitmap/text driver?)

@rogloh said:
LOL, yeah that's pushing it because the COG is busy on the line when rendering text, although I did have some earlier plans to try to encode all audio during vertical blanking - it would then need to buffer the audio samples arriving during the frame and it introduces 16.7ms latency at 60Hz. Whether there's enough time or not to do it is then resolution timing dependent too so not ideal.

@Wuerfel_21 said:
The same scanfunc trick I use for pixel doubling should also work for text (though 8 or 16 px at a time rather than 4... might need to do some futzing so the same task code can run during pixel double or text modes).

It might help but you would need sufficient clocks for all the text work. My old comments in my code mention it can render 80 x 8 pixel wide text independent FG/BG coloured & optional flashing characters in 2760 P2 clocks so that's about 4.3 P2 clocks per pixel. With 10 clocks per pixel in HDMI mode there should be time to render text and do quite a bit of other work. The bigger issue perhaps is that solution uses LUT RAM to temporarily buffer the scan line output and another 64 COGs longs for a per scanline font. Another solution involving the HUB RAM instead might reduce that usage but require more access cycles which have to also compete with the streamer access.

Are you considering to put text into your own NeoVGA driver or think I should try to add HDMI to mine?

I was comparing what our different video drivers offer recently and there is quite some overlap now, but differences still exist. My driver is more general purpose overall, while your's is still currently primarily aimed at emulators but could be expanded over time with more capabilities.

p2videodrv:

grew organically rather than fully designed up front and now is somewhat constrained by all the optimizations required to make things fit, has limited overlay use which can restrict free memory for future features
multiple output types on different pins
configurable resolutions
linked list of multi-regions so you can split a screen into multiple parts with regions for colour text, pure framebuffers, or tiles & sprite using extra render COGs
interfaces to shared PSRAM/HyperRAM external memory drivers with mailbox or simpler hub based framebuffers
all colour modes the P2 offers
text mode rendering inbuilt, different fonts and heights supported
dual configurable "HW" cursors in text mode, e.g for a mouse+keyboard
pixel doubling options in X/Y, uses idle time during stream out for performing pixel doubling on next scan line
optional mouse sprite in all colour modes
can run very fast, or very slow ~2x pixel clock for low power applications
support for borders
screen wrap and skew support

NeoVGA:

very clean overlay design, still very clean for now making things more extensible
multiple output types on different pins
HDMI audio support with resampling | big win there!
configurable resolutions, simultaneous VGA+DVI/HDMI
single region only - intended application is primarily emulator use
single 24 bit colour mode only
accesses HUB RAM only and optimized for sprite driver model with other COGs rendering data into a limited scan line buffer in HUB
probably far cleaner PAL/NTSC handling vs my initial attempts which are still not update with better values from Chip and your work - I need to steal some of your colourspace settings, lol.
pixel doubling in X/Y with more options for tripling and other factors
intended for high speed P2 operation, doubling pixel on the fly requires sufficient clocks, fragments CPU time for other work
border/pillarbox handling
wrap/stride support

Wuerfel_21 · 2024-10-08 02:26

@rogloh said:
Replying in your thread to avoid over-polluting MXXs thread...

@Wuerfel_21 said:
(hey @rogloh, you think there's space to wrench audio of some sort into your bitmap/text driver?)

@rogloh said:
LOL, yeah that's pushing it because the COG is busy on the line when rendering text, although I did have some earlier plans to try to encode all audio during vertical blanking - it would then need to buffer the audio samples arriving during the frame and it introduces 16.7ms latency at 60Hz. Whether there's enough time or not to do it is then resolution timing dependent too so not ideal.

@Wuerfel_21 said:
The same scanfunc trick I use for pixel doubling should also work for text (though 8 or 16 px at a time rather than 4... might need to do some futzing so the same task code can run during pixel double or text modes).

It might help but you would need sufficient clocks for all the text work. My old comments in my code mention it can render 80 x 8 pixel wide text independent FG/BG coloured & optional flashing characters in 2760 P2 clocks so that's about 4.3 P2 clocks per pixel. With 10 clocks per pixel in HDMI mode there should be time to render text and do quite a bit of other work. The bigger issue perhaps is that solution uses LUT RAM to temporarily buffer the scan line output and another 64 COGs longs for a per scanline font. Another solution involving the HUB RAM instead might reduce that usage but require more access cycles which have to also compete with the streamer access.

You don't need to buffer the rendered line, you just feed one character sliver at a time into the streamer. It's essentially the same as pixel doubling. The font line buffer is needed though, RDBYTE per character is too much. Problem is that you can't byte-index LUT. (Though do we really need 256 characters? 128 is probably fine for basic debug/system output and if you want actually nice text you'd render it yourself (with proportional fonts, unicode, whatever) - ASCII only has 95 printable characters, so there's some space to squeeze umlauts).

Are you considering to put text into your own NeoVGA driver or think I should try to add HDMI to mine?

Uhhh good question. In general more independent-ish HDMI implementations would be the good and right thing.

I was comparing what our different video drivers offer recently and there is quite some overlap now, but differences still exist. My driver is more general purpose overall, while your's is still currently primarily aimed at emulators but could be expanded over time with more capabilities.

The big difference IMO is that the ***VGA driver (both the horrid old version and the new clean one) is really designed for hacking in whatever feature-of-the-week is needed. It's really currently not something that can just be dropped into a new project. Though the new version tries to be better about it...
PSRAM mailbox action and other color modes are trivial to add in, of course (I did it before, though I think that never released). The new version reserves the first half of LUT so 8bit mode can be added easily. It's just that I've found that having some other cog render stuff out to RGB24 line buffers has been the correct^TM way of doing things in most of the projects I do (Tempest 2000 uses framebuffer rendering, but multiple buffers need to be composited and the 16 bit CRY color format is of course not native to the P2 - so mikoobj.spin2 is what ends up doing the PSRAM request list, compositing all the CRY buffers together and converting the result to RGB24 - oops, same video driver model again)

Also, big feature: driver loads itself fully into cog/lut, so you can nuke the hub data after it gets going. This would be really pertinent if I could add overlays to flexspin. NeoYume's menu-overwriting trick is a bit brute-force...

rogloh · 2024-10-08 02:34

@Wuerfel_21 said:
Also, big feature: driver loads itself fully into cog/lut, so you can nuke the hub data after it gets going. This would be really pertinent if I could add overlays to flexspin. NeoYume's menu-overwriting trick is a bit brute-force...

That's what my driver was intended to do too. Loading all config into the entire COG exec space (using LUT RAM) and the total ~4kB memory footprint overhead can be reused for a dual scan line buffer area, making it very low HUB usage overall. Also without HUB exec required this means that any corruption of HUB RAM after the video driver COG is spawned can't really crash the driver which is useful if you are using it to debug the problem, although it can obviously corrupt the screen output itself.

rogloh · 2024-10-08 02:49

One other thing is I do wonder how high a resolution your driver can be operated at? I know mine can reach 1920x1200 in VGA mode, not sure about 2048x1536 - I think that exceeds my buffering capacity with the mouse sprite stealing 16 longs in LUT leaving only 240 for the temporary line buffer for doubling and 256 for the palette RAM. What would limit the resolution in your driver? Is it just the P2 clock speed for outputting doubled pixels or other things like streamer ops?

Wuerfel_21 · 2024-10-08 03:07

The old driver could do 5x scaling from 384x240 to 1920x1200. No reason it couldn't do this with 1x native data. Haven't readded such hi-res modes with the new one, but I guess it should work. Remember that software pixel doubling is only needed for DVI/HDMI (where there is always time for it because of the fixed clock ratio), for analog I can just set a slower clock and have the normal streamer DMA do the work.

rogloh · 2024-10-08 03:43

Aha, yeah pixel doubling via HW streamer DMA is nice. I guess I expected I couldn't do that due to the fact that different regions could exist with/without doubling present, unless I somehow tweak streamer clocks on the fly per region - hmm maybe I could use setq to change the streamer rate per region if it were cleaned up a bit.

Doing HW pixel doubling and via the scanfunc trick could open up lots more LUT longs in my driver and free space needed for things like HDMI and frequent audio polling. I should think about this if I want to restructure it.

This is all very interesting as I can see you've probably learned a few things from what I did in my driver code (at least what parts to avoid and how to improve) and I can learn some from your approach too.

Something that takes the best parts of both our drivers could be pretty nice - probably closer to your overlay framework. I do sort of like the simplicity of basic frame buffers for graphics work and single COG operation with some in built text support is nice too with full per character bg/fg colour like real VGA, otherwise you need to burn more COGs. Multi-region is probably not as frequently needed but it can really help if you want to have pull down screens with different COG applications running in each, not needing to really share a single screen. Like a debug console that can be scrolled/hidden on demand running with some sprite based game or another GUI on-screen with a mouse cursor that don't need to know about this console and can simply continue writing to their own buffers independently.

SaucySoliton · 2024-10-08 03:55

What would the DVI encoder do if operated with a pixel clock other than 1/10 sysclock? A couple possibilities.

Best case. It samples data upon XINIT and every 10th clock afterward. This would result in a nearest neighbor interpolation.
Restarts upon each new pixel, but will resend old data if that is all that is available. Hardware pixel doubling.
Restarts upon each new pixel and sends undefined bits after the 10 TMDS bits. Bad, but this is what we are assuming.

rogloh · 2024-10-08 05:52

@SaucySoliton said:
What would the DVI encoder do if operated with a pixel clock other than 1/10 sysclock? A couple possibilities.

Best case. It samples data upon XINIT and every 10th clock afterward. This would result in a nearest neighbor interpolation.

Restarts upon each new pixel, but will resend old data if that is all that is available. Hardware pixel doubling.

Restarts upon each new pixel and sends undefined bits after the 10 TMDS bits. Bad, but this is what we are assuming.

I don't think it works correctly with anything but a 10:1 ratio (as you need 10 serial clocks to stream out all the serial data with the TMDS/TERC4 encoded data). If you ran at 20:1 ratio it might do something interesting with every second pixel, but with 25MHz minimum DVI pixel clock you'd technically need a 500MHz P2 to do this. I guess it should be tried out (didn't we try this once already - can't recall, but fairly confident it didn't double for us otherwise I'd be making use of it).

rogloh · 2024-10-09 01:13

@Wuerfel_21 said:
You don't need to buffer the rendered line, you just feed one character sliver at a time into the streamer. It's essentially the same as pixel doubling. The font line buffer is needed though, RDBYTE per character is too much. Problem is that you can't byte-index LUT. (Though do we really need 256 characters? 128 is probably fine for basic debug/system output and if you want actually nice text you'd render it yourself (with proportional fonts, unicode, whatever) - ASCII only has 95 printable characters, so there's some space to squeeze umlauts).

Was thinking about this... I currently use altgb to just read the byte array of font data from COGRAM and this pulls out the correct byte from the long for you automatically in the following getbyte instruction. But it could be read from LUT with this, assuming this snippet executes from COGRAM (and is easily patchable) and flags can be used. This only adds 3 extra instructions per font byte (7 clocks with the RDLUT) vs the existing altgb method. These 7 extra clocks do thankfully get amortized over all 8 pixels so it's not too much slower in the 80 clock budget total for DVI/HDMI.

  getbyte data, colordata, #0 ' extract character byte from color+char word
  rczr data wcz
  bitz instr, #19 ' patch instruction below
  bitc instr, #20 ' patch instruction below
  rdlut data, data ' keep font for scanline in lower 256 longs of LUTRAM
instr
  getbyte data, data, #0-0 ' byte lane gets patched

Wuerfel_21 · 2024-10-09 01:46

@rogloh said:
Aha, yeah pixel doubling via HW streamer DMA is nice. I guess I expected I couldn't do that due to the fact that different regions could exist with/without doubling present, unless I somehow tweak streamer clocks on the fly per region - hmm maybe I could use setq to change the streamer rate per region if it were cleaned up a bit.

Yes, SETQ+XCONT. You really want to set it for the active video only and change it back for the front porch.

This is all very interesting as I can see you've probably learned a few things from what I did in my driver code (at least what parts to avoid and how to improve) and I can learn some from your approach too.

🎵Live and Learn, from the works of yesterdaaaaay!🎵

Something that takes the best parts of both our drivers could be pretty nice - probably closer to your overlay framework. I do sort of like the simplicity of basic frame buffers for graphics work and single COG operation with some in built text support is nice too with full per character bg/fg colour like real VGA, otherwise you need to burn more COGs.

Agree, I usually bust out p2videodrv for simple demos or tests and such. Which reminds me that I never posted any of the texture mapping research code... (and I was going to add it here as a bonus but flexspin's project zipping just sharted itself)

Multi-region is probably not as frequently needed but it can really help if you want to have pull down screens with different COG applications running in each, not needing to really share a single screen. Like a debug console that can be scrolled/hidden on demand running with some sprite based game or another GUI on-screen with a mouse cursor that don't need to know about this console and can simply continue writing to their own buffers independently.

Yeah

@rogloh said:

@Wuerfel_21 said:

  getbyte data, colordata, #0 ' extract character byte from color+char word
  rczr data wcz
  bitz instr, #19 ' patch instruction below
  bitc instr, #20 ' patch instruction below
  rdlut data, data ' keep font for scanline in lower 256 longs of LUTRAM
instr
  getbyte data, data, #0-0 ' byte lane gets patched

Not sure if that's worth it...

The pixel doubling code is:

scanfunc_2x_rgb24 ' scan out 2 pixels with 2x doubling -> 40 streamer cycles
              ' actual function takes 18cy (inc. call overhead)
              ' so 11 instructions can go between calls
              rflong scantmp
              andn scantmp,#255 ' mask alpha/control channel (DMA would do this for us)
              xcont pixel_multiplier,scantmp
              rflong scantmp
              andn scantmp,#255
        _ret_ xcont pixel_multiplier,scantmp

The task code gets 22 useful cycles out of 40. Also note that it leaves the flags alone (not sure if I rely on that currently).

There may be a problem with doing both pixel doubling and text with the same scantask. My code uses 76 of these 22-cycle time slots for CRC+TERC4 encoding, so if you did 80 calls into a text function, there wouldn't be any time left. You could alternate it so that 160 calls produce 80 characters. Can certainly do monochrome text, but color seems like a pipe dream with this.
May look a bit like this:

scanfunc_text
.even
              rfbyte scantmp ' grab character
              rczr scantmp wcz
              bitz .ptch, #19 ' patch instruction below
              bitc .ptch, #20 ' patch instruction below
              mov scanfunc,#.odd
              ret wcz
.odd
              rdlut scantmp, scantmp ' keep font for scanline in lower 256 longs of LUTRAM
.ptch         getbyte scantmp,scantmp,#0-0
              xcont text_cmd,scantmp
        _ret_ mov scanfunc,#.even

rogloh · 2024-10-09 06:55

So just putting aside the doubling for now, if you are sending data out in immediate mode using the streamer you don't have to issue an XCONT per pixel, but can do it per groups of 8 pixels when running 16 colour text (ala VGA style). This greatly reduces the frequency of calling the scanfunc to just 40 times per line in a 640x480 sized screen for example. A lot of my earlier budget involved looking up the LUTRAM for character & colour data and writing back. Some of this goes away or reduces when reading the FIFO and issuing XCONTs directly. Looking at my current text code the basic character loop is something like this where the last two instructions after the rep loop label are actually included because of skipf processing which skips two instructions inside the loop.:

patchtext
                            mov     a, #%11000 wc           'reset starting lookup index
p3          if_z            rep     @endwide, #COLS/2       '2100 clocks for 40 double wide
p4          if_nz           rep     @endnormal, #COLS       '2760 clocks for 80 normal wide
                            skipf   a                       'skip 2 of the next 5 instructions
                            xor     a, #%11110              'flip skip sequence for next time
                            rdlut   d, pb                   'read pair of characters/colours
                            getword c, d, #1                'select first word in long (skipf)
                            getword c, d, #0                'select second word in long (skipf)
                            sub     pb, #1                  'decrement LUT read index (skipf)
                            getbyte b, c, #0                'extract font offset for char
                            altgb   b, #font                'determine font lookup address
                            getbyte pixels, 0-0, #0         'get font for character's scanline
testflash                   bitl    c, #15 wcz              'test (and clear) flashing bit 
flash       if_c            and     pixels, #$ff            'make it all background if flashing
                            movbyts c, #%01010101           'colours becomes BF_BF_BF_BF
                            mov     b, c                    'grab a copy for muxing step next
                            rol     b, #4                   'b becomes FB_FB_FB_FB
                            setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                            muxq    c, b                    'c becomes FF_FB_BF_BB
                            testb   modedata, #10 wz        'repeat columns test, z was trashed
endnormal                                                   'end rep 2 instructions early (skipf)
            if_nz           movbyts c, pixels               'select pixel colours for char
            if_nz           wrlut   c, ptrb--               'write coloured pixel data into LUT

In your own case, the scanfunc needs to be fully buffered up with 2 XCONTs to gain cycles to go do other work. This means 2 x 8 pixel characters or 4 total bytes with the colour attribute bytes included can be read from the FIFO and worked on at a time. Once loaded with the next pair of characters it would only have to happen on average every 160 clocks (16 pixels).
Sounds like your code uses 76x22 = 1672 total cycles during the active portion to complete the other TERC4/audio polling work. As long as you get enough total cycles on the line to do your work it probably doesn't matter precisely how they are distributed, assuming audio is still polled often enough to not miss the sample and that you don't ever underrun the streamer and you can fit these scanfunc calls somewhere useful in your own code.

With this in mind I expect a variant of the above could be coded which is optimized for reading source buffer data from the FIFO and its font data out of LUTRAM (although you do need 64+ extra housekeeping clocks per line to fill this once before the start of the active data can begin). It's mostly a matter of breaking this work into appropriately sized chunks so you get some free cycles for your operations. Something like this maybe?

scanfunc_text
                            rep     @.endscan,#2            'do twice
                            rfbyte  data                    'read character from fifo
                            rfbyte  color                   'read colour attribute from fifo
                            rczr    data wcz                'divide by 4 and get c,z flags
                            bitz    .patch, #19             'patch instruction
                            bitc    .patch, #20             'patch instruction
                            rdlut   data, data              'read font in lower 64 longs
.patch                      getbyte pixels, data, #0-0      'extract pixeldata from font
testflash                   bitl    color, #7 wcz           'test (and clear) flashing bit 
flash       if_c            and     pixels, #$ff            'make it all background if flashing
                            movbyts color, #0               'colours becomes BF_BF_BF_BF
                            mov     data, color             'grab a copy for muxing step next
                            rol     data, #4                'data becomes FB_FB_FB_FB
                            setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                            muxq    color, data             'color becomes FF_FB_BF_BB
                            movbyts color, pixels           'select pixel colours for char
                            xcont   X_IMM_8X4_LUT, color    'write colour indices into streamer as 8 nibbles
.endscan                    ret     wcz                     'restore flags

You have repeat the work twice in the scanfunc before returning using the rep at the start and call this snippet 40 times per line. As counted it would take 70 clocks + 6 call overhead = 76 clocks every 160 P2 clocks, leaving 40x(160-76) = 3360 cycles remaining for TERC4/audio which seems like it should be sufficient if you only needed 1672 before. However you do have to read the 64 long font for the scan line into LUTRAM in before it can be used, taking 64+ more clocks elsewhere in the horizontal blanking portion or maybe at the end of prior line if needed. Also as coded, this loop consumes 4 extra clocks per character with the flashing text attribute, but it's good to keep that.

Now for the pixel doubling case, I'd expect that it should only increase the CPU time available between calls, as only half the number of characters need to be read in and processed (40 vs 80 for example with VGA). You can also then duplicate the computed 8x4 nibbles across 2 32 bit longs by replicating nibbles colour indices into bytes. So the generated 8x4 bit nibbles in the non-doubled color long above such as N7_N6_N5_N4_N3_N2_N1_N0 then becomes two longs as N7_N7_N6_N6_N5_N5_N4_N4 and N3_N3_N2_N2_N1_N1_N0_N0 to be sent to the streamer in the 40 iterations and the rep loop part wouldn't be needed. This extra work is less than processing a whole other character and can save cycles using those handy dandy SPLIT/MERGE style instructions which assist here. Something like this below may work which amounts to 51 clocks + 6 call overhead or 57 clocks total instead of 76.

scanfunc_text
                            rfbyte  data                    'read character from fifo
                            rfbyte  color                   'read colour attribute from fifo
                            rczr    data wcz                'divide by 4 and get c,z flags
                            bitz    .patch, #19             'patch instruction
                            bitc    .patch, #20             'patch instruction
                            rdlut   data, data              'read font in lower 64 longs
.patch                      getbyte pixels, data, #0-0      'extract pixeldata from font
testflash                   bitl    color, #7 wcz           'test (and clear) flashing bit 
flash       if_c            and     pixels, #$ff            'make it all background if flashing
                            movbyts color, #0               'colours becomes BF_BF_BF_BF
                            mov     data, color             'grab a copy for muxing step next
                            rol     data, #4                'data becomes FB_FB_FB_FB
                            setq    ##$F0FF000F             'mux mask adjusts fg and bg colours
                            splitw  pixels
                            setword pixels, pixels, #1
                            mergew  pixels

                            muxq    color, data             'color becomes FF_FB_BF_BB
                            mov     color2, color           'save it
                            movbyts color, pixels           'select pixel colours for char
                            xcont   X_IMM_8X4_LUT, color    'write colour indices into streamer as 8 nibbles

                            getbyte pixels, pixels, #1      'use next group of pixels
                            muxq    color2, data            'color becomes FF_FB_BF_BB
                            movbyts color2, pixels          'select pixel colours for char
                            xcont   X_IMM_8X4_LUT, color2   'write colour indices into streamer as 8 nibbles
                            ret     wcz                     'restore flags

rogloh · 2024-10-09 07:09

Actually there's another problem which you might be considering and I only just realized. If the outer calling code is coded up to call fine grained scanfuncs for RGB24 individual pixel pair use and this common code also needs to be shared with text scanfuncs, then the calls will be happening far more frequently than we need. So maybe the text scanfunc needs to know this and only truly execute the scanfunc once every xx actual calls based on the number of pixels it generates and returns early otherwise so you can still share the audio+TERC calling code for both types of scanfunc. I see that issue now. An incmod counter,#limit wz instruction and early return may help (but that will still burn 10 P2 clocks per fake call, which might still be ok, not sure). It may mess up your accounting scheme for how many total pixels have been sent unless the scanfunc counts it for you perhaps.

scanfunc_text
                            incmod  realcall, #7 wz         'count up to 8 or 4 or whatever multiple makes sense
            if_nz           ret     wcz                     'exit early if not ready

                            rep     @.endscan,#2            'do twice
                            rfbyte  data                    'read character from fifo
                            ...

Wuerfel_21 · 2024-10-09 10:30

That's what I was on about earlier. The task code will run for 0<=x<=22 cycles for each scanfunc call. There's essentially two somewhat related problems here:

If the scanfunc generates too many pixels per call, not enough of the task runs before the line is over (the 76 slot number above is just for packet encoding. Audio processing maybe adds another 20? Need to check. Lets assume ~96 slots are needed at all times)
If the scanfunc takes too many cycles, the streamer will run out of data.

So if you have a 76cy function generate 16px (160cy) of streamer data each time, you only get 40 timeslots.
If you add the incmod dealio, the base cost increases to 80.

(80+10)/2 -> 45cy on 80cy, 80 timeslots -> still too few timeslots
(80+10+10)/3 33.3 cy on 53.3 cy, 120 timeslots -> enough slots, but overall time has now run out! 80+10+10+22+22+22 is 166 cycles.

The incmod idea just kinda sucks, changing the function pointer uses 2 less cycles per call, so 78+8+8+22+22+22 = 160. That might work out exactly! Though introducing 3/16 as a factor somewhere is nasty of course.

rogloh · 2024-10-09 10:47

Yeah the incmod isn't ideal in the scanfunc as it involves the call overhead every time, I knew that, just a starting idea. Best to be done somehow on the caller side maybe with a conditional test that's only valid once every 8 times for example. Sort of like the idea of an incmod on the caller side, but will burn extra instruction space.

I expect there will be some way to chop it up to make it work, just need to think more. There's enough total clocks per line for the work itself.

Wuerfel_21 · 2024-10-09 10:50

Did you see that post way earlier? You can just change the function pointer and that only takes 2 cycles, that are inside the scanfunc. You do need enough _ret_ mov scanfunc,#.something lines to get to the desired division level, but for 1/3 calls, that's fine. Can also be used to balance out the workload.

rogloh · 2024-10-09 10:55

If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.

               sub scantimecomp,#1
               call scanfunc

E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...

               incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
   if_z        call scanfunc

Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.

rogloh · 2024-10-09 11:17

I see you'd probably only get 80 clocks per slot as the text code takes the other half.
So you would get to execute 3 * 22 clocks (your current spacing) for every scanfunc call and also incur 4 clock overhead penalty for 3 out of 4 slots where the call is not made.
(22+4) * 3 = 78. This gives 120 per line.
When the call is made it's going to take the 2 cycle incmod test + 6 call overhead cycles plus 78 working cycles for 16 pixels (including the extra accounting instruction)
This is 86 clocks.
So we have 78+86 = 164 clock cycles in 160 budget. Damn not enough, unless flashing attribute is yanked or other optimizations are found. You could unroll the rep loop in the scanfunc for two clocks saved.

rogloh · 2024-10-09 11:28

Actually it's not as bad. I double counted the overhead. It was only 76 before including the call overhead so it's now 78 plus the prior 2 cycle incmod = 80. 78+80 will just fit the budget, but it's tight.

Update: One hassle with this is there are no free cycles for any cursors to be done on the fly. You could do a soft cursor with a flashing block or underscore, but it's not the same and will destructively kill the character it is on. Maybe that's the final icing on top if there's a way. It's mostly a compare counter value and overriding mov instruction to do a simple cursor, 4 clocks would just about do it.

Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two

Wuerfel_21 · 2024-10-09 13:05

@rogloh said:
If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.
               sub scantimecomp,#1
               call scanfunc
E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...
               incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
   if_z        call scanfunc
Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.

You can't do that, it'll mess up the timing for pixel doubling. There's a reason the caller subtracts from scantimecomp, it can do it in batches, since it only needs to be right at the end.

Please re-read my earlier posts. With changing the function pointer I calculated it down to 160 cycles, should work exactly like that and need no extra change to scantask code. (just need to change the initial value of scantimecomp)

@rogloh said:
Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

Not actually a problem. Remember the streamer has a command buffer, so the guard band can be entered while the back porch is still going. The opposite is a problem, going from short immediate commands to border/blanking, but that's a lot easier to solve.

@rogloh said:
Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two

As mentioned, accounting for scantimecomp is done in huge batches. It only needs to have the correct value at the very end.

rogloh · 2024-10-09 14:31

@Wuerfel_21 said:
@rogloh said:
If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.
               sub scantimecomp,#1
               call scanfunc
E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...
               incmod realcall, limit wz  ' limit is set to 0 for rgb24 mode, and 3 for text mode EDITED: was 7 instead of 3 before.
   if_z        call scanfunc
Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.
You can't do that, it'll mess up the timing for pixel doubling. There's a reason the caller subtracts from scantimecomp, it can do it in batches, since it only needs to be right at the end.

Pixel doubling for non-text modes can be maintained I think if the incmod limit variable is set to zero, so it always calls your scanfunc in that case just like before. The extra incmod does of course need to introduce some adjustments to the sequence.
When pixel doubling for text, there would still be 40 real calls made and the scanfuncs would be able to work the same in the 160 clock budget, just half as much source data is read from the fifo and 16 total pixels would still go out like before.

So then are you mostly worried that the existing 22 clock cycle interval maximum between calling scanfunc can't ever be messed with any more due to existing critical fragments? Maybe that's the irritation if it was really hard to get this timing correct...? I was envisioning that some small adjustments to the places where the calls are done could still be made (which may increase total slot usage of the 120 available slightly) so long as you still get to call it every 40 clock cycles total (or faster, in which case it'll wait for the streamer buffer to be ready in the scanfunc before returning), keeping the spacing intact? Maybe you have some code that needs all 22 clock cycles for some processing work in some cases and reducing down to 20 cycles is really difficult or impossible.

Are you worried it won't fit in the COG RAM with the extra prior incmods needed per each scanfunc call, and your accounting is often done in batches so there won't be enough reclaiming of COGRAM space to make up for the extra incmod?
Or maybe incmod flags use killing either Z/C per scanfunc call is a showstopper?

The use of scantimecomp as it stands seems mainly to determine the end of line condition in the line loop. I was thinking if we accounted for all these pixels sent out within the scanfunc itself then a benefit is that this counter could do double duty and be used to reference the character index for a cursor position check....though you'd also need to put the same accounting operation back into the rgb24 and any other pixel doubling variants which will use two more cycles and obviously force adjustments to your code's call slot sequences. That's really only why I like the incmod stuff put on the caller side and moving the scantimecomp accounting into the scanfunc as it can free a few more clocks avoiding the call overhead altogether (4 instead of 8), though I can see it will consume more instruction space. I counted 28 calls to scanfunc and 8 adjustments to scantimecomp. This means it would consume 20 more instruction longs in COG/LUT which is a downside of course if space is really limited.

Please re-read my earlier posts. With changing the function pointer I calculated it down to 160 cycles, should work exactly like that and need no extra change to scantask code. (just need to change the initial value of scantimecomp)

Yeah I know how your approach works - it'd just keep all the scanfunc calls as is and just adjust the next scanfunc function address dynamically in a chain, creating a few shortened calls followed by the real call. I do rather like it for the small impact it creates but the main downside is that there is no more budget in the text loop code pasted above so it won't ever be able to support a cursor. Minor gripe but I am looking for any solutions that resolve that...although perhaps my solutions are consuming too many resources for the minor budget increase it provides.

@rogloh said:
Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.

Not actually a problem. Remember the streamer has a command buffer, so the guard band can be entered while the back porch is still going. The opposite is a problem, going from short immediate commands to border/blanking, but that's a lot easier to solve.

Yeah that's true, the guard band sends two pixels the same so it'll probably be able to issue just a single immediate channel command for that while the back porch is running. I was thinking each pixel needed to get it's own channel command and then the command fifo would be blocked.

@rogloh said:
Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two

As mentioned, accounting for scantimecomp is done in huge batches. It only needs to have the correct value at the very end.

Yeah not all 22 cycle batches as coded do change the scantimecomp variable so we can't bank on gaining those extra 6 cycles at all time. If we stole them by dropping down to 20 useful instructions between the scanfunc calls (from 22) then this would mean it has to increase slot usage slightly. Hopefully only about 10% or so if you go from 22 clocks down to 20. If you only used 96 slots before then hopefully 120 is still enough to finish before the end of the line...? It's tight.

Though introducing 3/16 as a factor somewhere is nasty of course.

What exactly is this 3/16 thing you mentioned earlier?

Wuerfel_21 · 2024-10-09 17:39

@rogloh said:

Though introducing 3/16 as a factor somewhere is nasty of course.

What exactly is this 3/16 thing you mentioned earlier?

16 pixels over 3 calls -> nasty factor to get from 640 to 120

Adding cursor could be done as-is if flashing is dropped. Is that really an important feature? (ZX Spectrum programmers hate it!)

I'm getting a minor headache off this I feel. Maybe I should just prototype an actual impl instead of dilly-dallying.

rogloh · 2024-10-09 21:54

@Wuerfel_21 said:

@rogloh said:

Though introducing 3/16 as a factor somewhere is nasty of course.

What exactly is this 3/16 thing you mentioned earlier?

16 pixels over 3 calls -> nasty factor to get from 640 to 120

Ok I guess that's due to the accounting outside the scanfunc.

Adding cursor could be done as-is if flashing is dropped. Is that really an important feature? (ZX Spectrum programmers hate it!)

Yeah I thought the same if it came down to it. Even VGA has a bit to ignore it and just use it for more background colours. Maybe I'll come up with more instr saved in the text loop somehow. Last night I thought of a way that the scantimecomp use could be avoided for the cursor index counter. Instead once at the start of line just configure a underflowing counter which only hits zero and sets the Z flag exactly when we reach the cursor pos. Be nice to have a secondary mouse cursor though for TextUI use.

I'm getting a minor headache off this I feel.

Well that's the way to shutdown the argument. Yeah probably best just start to code it and things will work themselves out. I'm also looking for ways that my own features could potentially work in all this discussion, though not too sure my model of always being one scan line ahead would fit with it which probably also affects PSRAM, mouse sprite and multi-region stuff badly.

Maybe I should just prototype an actual impl instead of dilly-dallying.

What was that spaceX quip again, something like: it's just easier to go the moon, than convince NASA we can go to the moon!

Wuerfel_21 · 2024-10-09 22:41

@rogloh said:
I'm also looking for ways that my own features could potentially work in all this discussion, though not too sure my model of always being one scan line ahead would fit with it which probably also affects PSRAM, mouse sprite and multi-region stuff badly.

I mean you could probably make it work without any of this scanfunc headache then. If you always pre-render the next scanline into a buffer, you can run the audio task without timing headaches. Can also buffer the packets into hubRAM, which makes streaming them out nicer.

rogloh · 2024-10-09 23:22

Agree. But how does doubling ever work then with HDMI? I'd think I'd have to give it up in HDMI mode which I guess is a sacrifice I'd be willing to make. Plus the extra HDMI code sort of needs space and would use the 240 LUT longs instead of how I use them today to both double pixels and render text into before writing it out. HDMI with audio is especially hard as you've discovered. HDMI video framing alone okay.

Wuerfel_21 · 2024-10-10 00:10

Well if we say the HDMI audio needs some ~2200 cycles in a scanline, there is a lot left left to copy and double up data for the next scanline.

rogloh · 2024-10-10 00:44

Yeah, apart from 24 bpp mode which was a real killer IIRC. I'll have to dig up the actual cycles in the old DVI thread. It's different for the different depths. Less bpp are harder to double, but more packed bits get processed at a time and share the same longs for transfers. The 24bpp mode was quite hard on the COG because for each pixel doubled you need one HUB long read in to LUT and one read from LUT and two write to LUT operations and two writes back to HUB plus the loop overhead to fit in bursts over the 240 LUT longs. That's already 1+3+2*2 + 2 clocks or 10 clocks per doubled input pixel which is half the pixel budget. This even shares the memory bandwidth with the streamer too. But at best that's probably gonna leave around 320x10 = 3200 cycles minus overheads in a 640x480 resolution with 6400 clocks per active portion of the scanline. Might just still fit the budget. But another problem is the free LUT space and existing usage. How much extra code space does a basic HDMI TERC encode + audio resampler take I wonder ? It's certainly a challenge. Maybe there's scope to dynamically swap out TERC encoder code on the fly and reuse LUT space afterwards for doubling though that breaks my model of self contained rock solid video COG independent of HUB RAM corruption. All sorts of tricks may be needed.

' Code to double pixels in all the different colour depths

doublepixels
                            push    ptra                    'preserve current pointers
                            push    ptrb
                            mov     ptra, #$188             'setup pointers for LUT accesses
                            mov     ptrb, #$110
doubleloop                  rep     #0-0, pb                'patched pixel doubling loop count
                            skipf   pattern                 '1  2  4  8 16 32
                            rdlut   a, ptra++               '*  *  *  *  *  *
                            setq    nibblemask              '      *
                            splitw  a                       '   *
                            mov     b, a                    '*  *  *  *  *
                            movbyts a, #%%2020              '   *
                            movbyts a, #%%1100              '      *  *
                            movbyts a, #%%1010              '*           *
                            wrlut   a, ptrb++               '               *
                            mergeb  a                       '   *
                            mergew  a                       '*
                            mov     pb, a                   '      *
                            shl     pb, #4                  '      *
                            muxq    a, pb                   '      *
                            wrlut   a, ptrb++               '*  *  *  *  *  *Short 32bpp loop
                            movbyts b, #%%3131              '   *           |returns here but
                            movbyts b, #%%3322              '      *  *     |falls through at
                            movbyts b, #%%3232              '*           *  |end after its REP
                            mergew  b                       '*              |block completes.
                            mergeb  b                       '   *           |
                            mov     pb, b                   '      *        |
                            shl     pb, #4                  '      *        |
                            muxq    b, pb                   '      *        |
                            wrlut   b, ptrb++               '*  *  *  *  *  |<skipped for 32bpp
                            jmp     #exitmouse              'share common code to restore ptrs

SaucySoliton · 2024-10-10 16:47

I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072

  <blockquote class="UserQuote"><div class="QuoteAuthor"><a href="/profile/" rel="nofollow"></a> said:</div><div class="QuoteText"><p>      <blockquote class="UserQuote"><div class="QuoteAuthor"><a href="/profile/" rel="nofollow"></a> said:</div><div class="QuoteText"><p>Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url].

Question:
Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

Seems like we can't quote from other threads.

TonyB_ · 2024-10-10 18:24

Quoting old messages seems to only quote the last replier's response. Nested quotes can be done manually, by quoting the earlier message, replacing > with >> and merging the two quotes, e.g.

@cgracey said:

@TonyB_ said:
Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs here.

Question:
Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.

Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

I notice that old messages say X wrote: and newer ones say X said:. I don't know when that changed happened, probably years ago now!

EDIT:
1. I also had to manually do the here link.
2. I really enjoyed the P2 design era and I miss it, although I showed up late in the process.

rogloh · 2024-10-10 23:18

@SaucySoliton said:

I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072
  
said:
>
said:
Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url]. > > Question: > Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.
> > Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

Seems like we can't quote from other threads.

This comment from Chip sounds promising. We'll have to retest this, it could save us a lot of problems if the DVI HW will repeat the pixels for us and open up the streamer use again.

rogloh · 2024-10-11 00:51

@rogloh said:
@SaucySoliton said:
I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072
  
said:
>>
said:
Decoding the TMDS 10-bit outputs reproduces the 8-bit inputs [url="http://forums.parallax.com/discussion/comment/1449960/#Comment_1449960"]here[/url]. >> >> Question: >> Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.
> > >> Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.

Seems like we can't quote from other threads.

This comment from Chip sounds promising. We'll have to retest this, it could save us a lot of problems if the DVI HW will repeat the pixels for us and open up the streamer use again.

Just tried patching my code so it outputs half the active pixels at half the rate in DVI mode with a setq before the active pixels begin using the divide by 20 value and restored at the end of active portion back to divide by 10 so the other sync timing stuff still all works. Couldn't get an image on screen. Might still be something I'm doing wrong or maybe this repetition only works in immediate mode...? My code is using the streamer.

Update: just verified this same approach of the extra setq instructions with VGA instead of DVI in my driver does double the pixels on screen so it would appear my patched code is okay. It's just that DVI output doesn't like it. I wonder if the streamer's NCO value still somehow affects the DVI encoder block's timing and messes with it or we have to do some special initial priming for it to work.

HDMI discussion

Comments