Shop OBEX P1 Docs P2 Docs Learn Events
HDMI discussion - Page 4 — Parallax Forums

HDMI discussion

124»

Comments

  • roglohrogloh Posts: 5,615
    edited 2024-10-15 01:57

    @Wuerfel_21 , I had a look at your driver in terms of the COG space it might need for a port to my driver for HDMI audio.

    I counted up:

    185 longs for build-audio without scanfunc overhead - this would go into my spare LUTRAM

    COG RAM use:

    • 9 longs for packet buffers
    • 16 longs for TERC4 encoding table
    • 3 longs for TMDS mask, spdif parity
    • 2 longs for spdif chnl status
    • 1 long audioperiod
    • 2 longs for spleft/right
    • 24 longs for misc hdmi audio/terc encoding state
    • 6 audio poll loop

    The known COG RAM use amounts to about 63 longs which would fit nicely inside my freed up font area in COG RAM, plus some extra COG RAM instructions are needed for transmitting HDMI sync/island output - this would vary a bit from yours as I'd need to integrate to my own code's framework, but hopefully not too much extra code is required given it would stream from HUB data. I could move the region handling code and/or mouse sprite code to an overlay and free 54(+),63 longs respectively for any extra HDMI sync logic that may be needed above what I can free by removing unnecessary PAL/NTSC/VGA stuff. Should be plenty of space.

    I now think this approach has legs and would likely fit in my driver's space and hopefully timing budget even when you factor in the extra overhead of reading these overlays into LUT while pixels are streaming.

    Eg. consider worst case pixel doubling in 32 bpp mode during active scan line of 6400 P2 clocks.
    It takes about 3648 P2 clocks for pixel doubling.
    It would take 185*1.25 = 231 P2 clocks to read in your packet encoding code from "build_audio" to end of encoding and at the end of this a 32 byte packet would be created in LUT with approximately 50 extra clocks to output it to HUB RAM.
    This still leaves 6400-2752-231-50 = 2471 for a mouse sprite any PSRAM mailbox request and your TERC4 packet encoder to execute.

    A mouse sprite takes ~78 clocks to read in the code and 480 clocks to execute.

    This leaves 2471-78-480 = 1913 P2 clocks remaining in the active portion for packet encode, several audio polls on the scan line, and a PSRAM mailbox call in case some of that's not already doable during hblank. Hopefully it would suffice. I think you mentioned earlier your code takes 1398 cycles to CRC+TERC encode the packet. This leaves about ~515 for audio resampling and to populate the subpacket data in the best case. Your code uses ~180 per sample to resample and ~24 clocks each time you slide the audio input history buffer down. If done twice on a scan line it'll just fit, but 3 samples per scan line might be too hard. Hopefully we don't ever need 3 samples encoded at 48kHz given a 640x480 minimum resolution. Or I could look at simpler resampling.

  • Good job counting all those cycles!

    You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.

    In a pinch you can drop to linear resampling, but that has awful artifacts to it.

    You can actually do HDMI+VGA, so better not get rid of the VGA sync code. If you use the normal 252MHz VGA mode, it will even work properly! Maybe useful if someone wants to use a DVI-I connector.

  • roglohrogloh Posts: 5,615
    edited 2024-10-15 03:04

    @Wuerfel_21 said:
    Good job counting all those cycles!

    Yes, tedious and easy to miss something.

    You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.

    Yeah, that's right. It could still potentially work with more samples per line but maybe just not during pixel doubling which is unfortunately what you may want if you are doing 480i anyway. It might still be useful for PSRAM based non-pixel doubled true colour setups. There are lots of extra cycles available in graphics modes without doubling enabled for more samples to be packed into the packet, although not many for text regions. Need to mull that over.

    In a pinch you can drop to linear resampling, but that has awful artifacts to it.

    Good to know.

    You can actually do HDMI+VGA, so better not get rid of the VGA sync code. If you use the normal 252MHz VGA mode, it will even work properly! Maybe useful if someone wants to use a DVI-I connector.

    Yep true, and I'll try to include if it fits, but it's not the main priority to have multi-outputs right now. I will ultimately need to make good use of the sync time for various things my driver does. Some parts of the mouse sprite stuff can probably go there. I think the audio is still gonna be real tight in the worst case scenario, but it'd be great use of the time available in a single P2 COG if it all fits in the end, especially given all the various features contained in my current P2 video driver. Be real nice to have all existing features working without too many compromises.

    One thing I want to do is to have a master volume control setting in the final sample output for left/right channels going out to HDMI. One of my per display longs updated per frame will suffice to hold dual 16 bit multiplier values - this could be used for panning effects too. Most display devices can also already control volume, and it's best to do it at the amp, but it'd be nice to be able to do it from the P2 as well even just for fading in/out etc. You do of course lose quality by scaling the signal range down, I know.

  • TonyB_TonyB_ Posts: 2,173

    @Wuerfel_21 said:
    You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.

    For 44.1/48kHz audio and 31.5kHz video, do you output 1 or 2 audio samples every line including vertical blanking & sync?

  • Yes. I always send it during HSync.

  • roglohrogloh Posts: 5,615

    Just realized that in any 480i/576i case you'll have 2x the time per scan line (scan line frequency is halved) so there would be extra time to do more audio packet processing. This should work out okay if you need to encode 3-4 samples per line (15.75kHz * 4 > 48kHz). As you can still fit 4 stereo samples per audio packet you only need to encode a single packet so it won't mess with the data island arrangement. For text in 40 column output (double wide mode) you'd have to pixel quadruple however. Again there is more time to do this processing so it scales accordingly.

  • roglohrogloh Posts: 5,615
    edited 2024-10-15 11:37

    @Wuerfel_21 , I've been looking at this code for a while and am wondering if there are any ways optimize it further. It seems like a lot of data movement but it's probably about as good as it's gonna get without unrolling loops which won't buy much. Nice use of splitw and mergeb by the way.

    I still sort of wonder if something could be done during the prior CRC encode stage of subpackets when you are already working with nibbles. Maybe they could be split/merged then during that sequence in the rep loops? Something with ROLNIB perhaps accumulating across channels. The interleaving of red/green channel data is an added complication, isn't it.

    UPDATE: Actually on second thoughts it's probably pretty fast code as is. If you add anything more into the CRC REP it will multiply up pretty quickly and likely exceed this code's timing.

                  ' Assume PTRB is an appropriate LUT pointer to deposit 32 longs of encoded TERC4
    .serloop
                  sub scantimecomp,#3 + (3*4)
                  call scanfunc
    
                  ' serialize ch0
                  cmp serbyte,#0 wz
            if_z  rolbyte serch0,packet_extra,#2 ' FE for first packet in island, FF otherwise
            if_nz rolbyte serch0,#$FF,#0
                  altgb serbyte,#packet_header
                  rolbyte serch0
                  'rolbyte serch0,packet_vsync,#0 ' VSYNC on(?)
                  'rolbyte serch0,#$FF,#0 ' HSYNC on
                  rolword serch0,packet_extra,#0 ' Set sync status
    
                  altgw serbyte,#packet_data3
                  rolword temp1
    
                  call scanfunc
    
                  altgw serbyte,#packet_data2
                  rolword temp1
                  splitw temp1
                  rolword serch1,temp1,#0
                  rolword serch2,temp1,#1
    
                  altgw serbyte,#packet_data1
                  rolword temp1
                  altgw serbyte,#packet_data0
                  rolword temp1
                  splitw temp1
    
                  call scanfunc
    
                  rolword serch1,temp1,#0
                  rolword serch2,temp1,#1
    
                  mergeb serch0
                  mergeb serch1
                  mergeb serch2
                  'debug("TERC bytes ",uhex_long(serch0,serch1,serch2))
    
                  mov temp4,#4
    .byteenc
                  getnib temp1,serch0,#0
                  alts temp1,#terctable
                  mov temp2,0-0 ' This also copies HSync bit for VGA
                  getnib temp1,serch0,#1
    
                  call scanfunc
    
                  alts temp1,#terctable
                  mov temp3,0-0 ' This also copies HSync bit for VGA
    
                  getnib temp1,serch1,#0
                  setq tmds_ch1_mask
                  alts temp1,#terctable
                  muxq temp2,0-0
                  getnib temp1,serch1,#1
                  alts temp1,#terctable
                  muxq temp3,0-0
    
                  call scanfunc
    
                  getnib temp1,serch2,#0
                  setq tmds_ch2_mask
                  alts temp1,#terctable
                  muxq temp2,0-0
                  getnib temp1,serch2,#1
                  alts temp1,#terctable
                  muxq temp3,0-0
    
                  wrlut temp2,ptrb++
                  wrlut temp3,ptrb++
                  'debug("packet enc ",uhex_(temp2))
                  'debug("packet enc ",uhex_(temp3))
    
                  call scanfunc
                  shr serch0,#8
                  shr serch1,#8
                  shr serch2,#8
                  djnz temp4,#.byteenc
    
                  incmod serbyte,#3   wc
        if_nc     jmp #.serloop
    
    
                  sub scantimecomp,#1
                  jmp scanfunc ' tail call
    
    ' TERC packet pre-encoding buffers
    packet_header res 1
    packet_data0  res 2
    packet_data1  res 2
    packet_data2  res 2
    packet_data3  res 2
    
  • You can save some longs by re-rolling the .byteenc loop, I guess. Other than that, well I didn't find any way to make it faster, but I also didn't loose sleep over it, since it's already quick enough for what I need. There is a theoretical trick I considered - if you only ever send 2 samples, the upper two bits of ch1/ch2 are not needed and can be zero, so you can reduce the processing that way.

    The CRC loop is separate because it is a separate thing really. The CRC bytes are just normal data to the TERC encoder.

    OT: Friggin vodafone is having a normal one, currently only have internetworking on my thinkpad (knew that LTE modem would come in handy one day), might run out of LTE quota and disappear for a while.

Sign In or Register to comment.