Shop OBEX P1 Docs P2 Docs Learn Events
HDMI added to Prop2 - Page 17 — Parallax Forums

HDMI added to Prop2

11517192021

Comments

  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-04 02:17
    rogloh wrote: »
    In regards to your point about RDLUT taking 3 clocks instead of 2, I am hoping that going back to the 8 instruction (16 clocks) per nibble pair approaches that have been found we should still have enough time left over on the scanline for a couple of additional/optional things in this idea for our HDMI gfx/text driver:

    You've come up with three different ways to convert pixel nibbles into TMDS data that have 3, 2 or 1 look-up tables and I think they will all work and take the same time.
    rogloh wrote: »
    1) font table lookup - need to block read up to 256 bytes as 64 LONGs for the current scanline and populate LUT RAM, this would take an extra 0.256us for the raw transfers plus some address/setup overheads, or half the transfers if only the top character range is dynamic.

    Yes, a great idea, the line buffer cog will have time to read 64 longs and the height of a character could be anything. I don't understand the need for the additional fixed fonts you mentioned, however. All the fonts will be soft, yet fixed if they aren't changed.
    rogloh wrote: »
    2) transparent mouse cursor overlay - this feature would be very useful to support a decent 640x480x16 color mode with a mouse pointer for enabling traditional GUI application. Combining this into our two existing COGs could save burning another COG for a mouse pointer, so well worth it - I'd see it almost as a must. I don't quite know how many instructions it will need yet, but it will need to identify the appropriate graphics image for the scanline, rotate/mask accordingly and write over the pixel buffer before it is converted and written back to hub. A 32x32 pixel sized mouse pointer would need 4 or 5 longs in the (nibble) pixel buffer adjusted with the image overlayed. There should be some nice new instructions that can be put to good use here. This work needs to happen before the conversion using your 4b-10B long table stuff which otherwise complicate things further and require even more adjustments.

    Some sort of cursor or pointer is a must and I'm sure it could be implemented. If the bursts of TMDS data written to the line buffer lag behind the streamer rather then ahead, then it should be possible for either the line buffer cog or streamer cog to overlay the cursor on top of the relevant pixels and for the line buffer cog to write the cursor+pixel data to the line buffer after all the other TMDS data.

    The code for overlaying a transparent cursor or sprite is similar to that for a text character:
    ' Foreground pixel is non-transparent when 0000
    	movbyts	pixels,#0
    	mergeb	pixels
    	setq	pixels
    	alts    bg, #fgbuf
    	mov	data,0-0
    	alts    fg, #fgbuf
    	muxq	data,0-0
    
    ' Foreground pixel is transparent when 0000
    	movbyts	pixels,#0
    	mergeb	pixels
    	alts    bg, #fgbuf
    	mov	data,0-0
    	alts    fg, #fgbuf
    	and	pixels,0-0
    	muxnibs	data,pixels
    

    fgbuf is misnamed now.
    rogloh wrote: »
    The "streamer" COG should certainly do any of the text attribute stuff desired like flashing foreground/background etc. I think there will be definitely be enough cycles left over for that with all the optimisations that were found there so far.

    Once the streamer has started, the streamer cog has nothing really to do in graphics mode and it could handle some sprites.
  • Cluso99Cluso99 Posts: 18,069
    I presume there is available cog space? If so, a decoder and jump to separate instruction groups could be faster.
    There is another instruction that helps here (just cannot recall atm). Alternately use a jump table.
  • roglohrogloh Posts: 5,787
    edited 2018-11-04 02:55
    TonyB_ wrote: »
    I don't understand the need for the additional fixed fonts you mentioned, however. All the fonts will be soft, yet fixed if they aren't changed.

    Yeah having this dynamic / soft font read in per scanline is very generic and could pretty much do everything. The idea of a fixed font sort of evolved from the original state of things before that idea was raised and is probably redundant now. The only minor benefit I can see is that it could be useful to reuse 2kB hub memory and have half a VGA character font fixed in COG+LUT RAM plus it avoids character set corruption, meaning ASCII range text output is always possible no matter what the state of the dynamic font is in, but that itself can be quite restrictive as well. P2 has a lot more memory now than P1 so saving/reclaiming an extra 2kB is a lot less of an issue really. Everything fully dynamic is probably fine, in any case its possibility remains if ever needed.
    Some sort of cursor or pointer is a must and I'm sure it could be implemented. If the bursts of TMDS data written to the line buffer lag behind the streamer rather then ahead, then it should be possible for either the line buffer cog or streamer cog to overlay the cursor on top of the relevant pixels and for the line buffer cog to write the cursor+pixel data to the line buffer after all the other TMDS data.

    Well the processing, pipelining and skew needs to be considered, with lag/lead model choices. I definitely think it would be good to avoid double buffering if possible, as there are 2000 longs in a scanline buffer in hub RAM and that is 8kB or so, you'd need 16kB for two lines, that's too much. Single buffering is probably still doable with some tight synchronization between the two COGs. COGATN is probably our friend there as the streamer COG can signal the line buffer COG reliably to begin generating data on a specific clock and the PASM can be made mostly deterministic outside of any timing jitter introduced by the egg-beater or other content related execution timing differences.

    Also, if would be great to have something workable for later if some future "HDMI data specific" COG can also write into this same shared line buffer in its blanking regions, for possibly supporting audio and other data island stuff. I would presume that might be achievable down the track too even via bit-bang.
  • Actually thinking further about the fonts. VGA had this concept of alternate font selection by optional use of the top bit of the foreground intensity (bit3) to select the font. So on a P2 you could have something similar where optionally one of the attribute bits could be assigned to select the character from the dynamic soft font or the fixed "system" font held either in COG+LUT RAM (which would be scanline limited) or just read in alongside the other dynamic font. This buys you two fonts on the screen at once if you sacrifice high intensity foreground use. Doesn't have to be implemented but all sorts of cool things are possible with text modes. There's possibly the cycle budget for that too, and definitely enough LUT RAM for two dynamic fonts to be loaded.
    Cluso99 wrote: »
    I presume there is available cog space? If so, a decoder and jump to separate instruction groups could be faster.
    There is another instruction that helps here (just cannot recall atm). Alternately use a jump table.

    I don't think we've checked everything we've discussed so far fits into COG RAM yet as such (we should look into it), first instinct tells me it would because the larger buffer use (outside of any static fonts) in COG RAM is not huge, and the loop unrolling for performance reasons actually doesn't have to be that much in the end and the streamer COG code is likely to be small too. In fact I'd be very surprised if it doesn't fit.
  • >Also, if would be great to have something workable for later if some future "HDMI data specific" COG can also write into this same shared line buffer.

    Without a second buffer, that either ends up being a race condition (changes happen after pixels are drawn to the display), or throughput is limited. (blanking isn't all that much time) That can be enough for a cursor, or perhaps pointer overlay, but it's all got to happen very early on, or it just won't be visible at the left side of the display consistently. Doing that generally takes away from the time the COGS prepping and or streaming the line ram have to do that.

    Until I have a working setup, I have been reading here. Didn't someone say most of the data per scanline does not change? Perhaps just buffering changes, to be combined with the mostly static data plays out better?

  • roglohrogloh Posts: 5,787
    edited 2018-11-04 06:51
    @potatohead,
    I think I know what you mean about race conditions and throughput issues. You do have to time your hub writes at the suitable time and be deterministic enough to not get ahead of where the current streamer COG is actually reading from. Because all these pixel calculations are mostly time bounded I think this is achievable for the video data portion. However the mouse pointer / cursor overlay adjustment could also be done inline when you reach its pixels and not right at the end of the scanline. Yes that can make things a tight squeeze. Multiple sprites are trickier and potentially need double buffering. The thing is I think we can sort of achieve some double buffering internally in the shared LUT by passing the next 80 longs of pixel data or characters back and forth in the LUT with two separate in/out buffers. So we can be transforming test/gfx data one scanline ahead in the streamer COG and converting from 4b-10B using data results from the prior scanline in the line buffer COG. They are independent processes.

    The blanking part of the line buffer is the only part that would change if there was this "HDMI data specific" COG that writes into the data islands. Again that would need to be well synchronized and updated at the right time, however it doesn't need to act on every single line and there will be times when it could write out a computed sequence or fixed template and then give itself more processing time to generate the next data island some scanlines later. Right now I still think it is probably doable with a single buffer.

    I'm not saying it is easy though, it requires careful understanding of what is happening in each stage.
  • roglohrogloh Posts: 5,787
    edited 2018-11-04 10:01
    I made a diagram showing timing issues for single line buffer approach. If you assume the line buffer holding the 10B data begins being read at time=0 and output by the streamer COG then the line buffer COG should delay the writing of its output for the next scanline some time x after this. The line buffer COG code also needs to make sure that it finishes writing to hub no earlier than some time t=y+25.6 where y is non-zero, and x,y need to be made large enough to ensure any jitter during its burst writes do not cause any of its data to be put into the line buffer at any address not yet read by the streamer COG.

    The time left over at the end of the line buffer process until the end of the scanline is z. z+x then needs to be enough time left over to do all the remaining work outside of writing data to the line buffer itself. If these things are adhered to then there should not be a race condition in my opinion.

    UPDATE: some numbers of interest below. Things are getting rather tight in the line buffer COG!

    For the critical loop in the line buffer COG that generates the 10B data per pixel and writes it back to hub, it would be taking the line buffer COG something like this amout of time to stream out data, assuming our 8 instruction timing for odd/even nibble processing:

    Unrolling the loop 4 times and processing these 4 copies of all the 8 pixels in each long from the LUT RAM, plus the rdlut and a rdlut address pointer increment overhead per long to read the required data back out from LUT RAM becomes 4 x (4 x 8 + 1 + 1.5) instructions actually executed (because rdlut takes 3 clocks, I treated it like 1.5 instructions). Total number of clocks needed is twice this or 276. This whole batch of work creates 80 longs of data needing to be written to the hub, representing 32 pixels worth of data and is also repeated 20 times in an outer loop for 640 pixels on the line.

    So the execution timing for each batch of 80 long writes looks like 276 clocks + 80 transfer clocks + (loop overhead + hub window delay). I'd expect the outer loop overhead to be small, say 3 instructions or 6 clocks (ideally less), and the hub window delay worst case should be below 16 clocks, bringing this number up to about 378 clocks for processing and transferring the 80 longs. Multiply this by our 20 repeats and you get 7560 clocks. This is a total of 30.24 us at 250MHz and exceeds the active video portion in my diagram which is good as that means it can't overtake the streamer COG if the work starts slightly after t=0. We then have 1.51us or 188 instructions left over to read in 80 longs/words of pixels/text characters, plus any dynamic font table(s) being 64 longs (each), plus any active line management logic overhead, hub burst latency, any per scanline data reads etc. I calculate about 84 instructions worth of time left over to do all that once these 208 longs of burst transfers are taken into account, even assuming 2 fonts are in use to be conservative. Not a lot of time left actually but should hopefully be sufficient for our needs if the mouse sprite computation is done in the streamer COG now. Being tight like this means we are making some pretty high use of the COG resources now (maybe by doing things the wrong way LOL).
    888 x 486 - 52K
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-04 13:26
    rogloh wrote: »
    @potatohead,
    I think I know what you mean about race conditions and throughput issues. You do have to time your hub writes at the suitable time and be deterministic enough to not get ahead of where the current streamer COG is actually reading from. Because all these pixel calculations are mostly time bounded I think this is achievable for the video data portion. However the mouse pointer / cursor overlay adjustment could also be done inline when you reach its pixels and not right at the end of the scanline. Yes that can make things a tight squeeze. Multiple sprites are trickier and potentially need double buffering. The thing is I think we can sort of achieve some double buffering internally in the shared LUT by passing the next 80 longs of pixel data or characters back and forth in the LUT with two separate in/out buffers. So we can be transforming test/gfx data one scanline ahead in the streamer COG and converting from 4b-10B using data results from the prior scanline in the line buffer COG. They are independent processes.

    The streamer outputs TMDS bytes at a constant 250MBps, while the line buffer cog writes bursts of TMDS longs at 1000MBps. Even if the latter lagged the former at the start, it might still catch up so precise timing is vital. Double buffering in LUT RAM probably essential if there are sprites in graphics mode, otherwise maybe not as streamer cog can compose a long full of pixels faster than the line buffer cog can convert them to TMDS.
    rogloh wrote: »
    The blanking part of the line buffer is the only part that would change if there was this "HDMI data specific" COG that writes into the data islands. Again that would need to be well synchronized and updated at the right time, however it doesn't need to act on every single line and there will be times when it could write out a computed sequence or fixed template and then give itself more processing time to generate the next data island some scanlines later. Right now I still think it is probably doable with a single buffer.

    I'm not saying it is easy though, it requires careful understanding of what is happening in each stage.

    Single line buffer in hub RAM is feasible, I think. 8000 bytes/2000 longs for a whole line including blanking and hsync. The line buffer cog will need to add and remove the vsync and set pixels to black at end of last active display line if necessary. I'm wondering whether audio data islands could be entirely within vertical blanking but I haven't looked into that yet.

    Re tight timing, unrolling the loop as much as possible gives us the most time. My earlier timings were based on unrolling that was a little excessive perhaps.
  • "TonyB_ wrote:

    The streamer outputs TMDS bytes at a constant 250MBps, while the line buffer cog writes bursts of TMDS longs at 1000MBps. Even if the latter lagged the former at the start, it might still catch up so precise timing is vital. Double buffering in LUT RAM probably essential if there are sprites in graphics mode, otherwise maybe not as streamer cog can compose a long full of pixels faster than the line buffer cog can convert them to TMDS.

    Agree. Because we are producing data in bursts 4x faster than consuming it, we need to delay for at least 3/4 of the size of the burst initially before we begin our hub writes from the line buffer COG, or 60 longs (240 bytes) if the burst size written is 80 longs. We should make it slightly more than this for safety, and setting our delay "x" to 1 microsecond is about right and a nice round number of 250 bytes, just higher than 240. Because the average production rate gets spaced out and becomes slower than the consumption rate due to all our processing + overheads we don't have to worry about catching up after that. Other burst sizes would need different minimum delays.
    Single line buffer in hub RAM is feasible, I think. 8000 bytes/2000 longs for a whole line including blanking and hsync. The line buffer cog will need to add and remove the vsync and set pixels to black at end of last active display line if necessary. I'm wondering whether audio data islands could be entirely within vertical blanking but I haven't looked into that yet.

    Not sure if it's possible to send the audio entirely in the vertical blanking region (though some other data formatting can be sent there). With HDMI from memory I think the minimum audio sample rate required to be supported is 32kHz and data islands containing audio are going to be required in most scanlines with the ~31.5kHz line frequency of VGA to keep the jitter managed unless the receiver has large buffers and can support very large blocks arriving together in a row (some might perhaps). IMHO you're almost guaranteed to need an additional COG for supporting the audio unless the streamer COG can squeeze some of this work in too. But certainly some data encoding work should be done in the otherwise unused vertical blanking time if and where possible. I think AVI and audio info frames need to be sent once a frame for example.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-05 13:43
    rogloh wrote: »
    Note: for oscillator selection P2 clocking at 250MHz (for HDMI) cannot be achieved with 12.288MHz, though it can be at 19.2MHz and 26MHz using the following divisor/multiplier ratios if these suit the P2 PLL range...

    250 = 19.2/48 * 625
    250 = 26/13 * 125, or 26/26 * 250, or 26/52 * 500

    Also for HDTV stuff 148.5 MHz happens to be 19.2/64 * 495 and 26 / 52 * 297

    We should be aiming for 25.175 * 10

    12.288 * 799/39 = 251.7464
    19.2 * 118/9 = 251.7333
    26.0 * 610/63 = 251.7460

    I can take requests! :)

    Addendum:
    12.288 * 568/47 = 148.5018
  • Question:

    Does or can SETQ+WRLONG modify PTRx automatically?
    	setq	#block-1	
    	wrlong	buffer,ptra
    

    ptra := ptra + 4 * block?
  • TonyB_ wrote: »
    Question:

    Does or can SETQ+WRLONG modify PTRx automatically?
    	setq	#block-1	
    	wrlong	buffer,ptra
    

    ptra := ptra + 4 * block?

    PTRA will remain unchanged.
  • ozpropdev wrote: »
    TonyB_ wrote: »
    Question:

    Does or can SETQ+WRLONG modify PTRx automatically?
    	setq	#block-1	
    	wrlong	buffer,ptra
    

    ptra := ptra + 4 * block?

    PTRA will remain unchanged.

    Thanks, ozpropdev.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-05 03:21
    rogloh wrote: »
    For the critical loop in the line buffer COG that generates the 10B data per pixel and writes it back to hub, it would be taking the line buffer COG something like this amout of time to stream out data, assuming our 8 instruction timing for odd/even nibble processing:

    Unrolling the loop 4 times and processing these 4 copies of all the 8 pixels in each long from the LUT RAM, plus the rdlut and a rdlut address pointer increment overhead per long to read the required data back out from LUT RAM becomes 4 x (4 x 8 + 1 + 1.5) instructions actually executed (because rdlut takes 3 clocks, I treated it like 1.5 instructions). Total number of clocks needed is twice this or 276. This whole batch of work creates 80 longs of data needing to be written to the hub, representing 32 pixels worth of data and is also repeated 20 times in an outer loop for 640 pixels on the line.

    So the execution timing for each batch of 80 long writes looks like 276 clocks + 80 transfer clocks + (loop overhead + hub window delay). I'd expect the outer loop overhead to be small, say 3 instructions or 6 clocks (ideally less), and the hub window delay worst case should be below 16 clocks, bringing this number up to about 378 clocks for processing and transferring the 80 longs. Multiply this by our 20 repeats and you get 7560 clocks. This is a total of 30.24 us at 250MHz and exceeds the active video portion in my diagram which is good as that means it can't overtake the streamer COG if the work starts slightly after t=0. We then have 1.51us or 188 instructions left over to read in 80 longs/words of pixels/text characters, plus any dynamic font table(s) being 64 longs (each), plus any active line management logic overhead, hub burst latency, any per scanline data reads etc. I calculate about 84 instructions worth of time left over to do all that once these 208 longs of burst transfers are taken into account, even assuming 2 fonts are in use to be conservative. Not a lot of time left actually but should hopefully be sufficient for our needs if the mouse sprite computation is done in the streamer COG now. Being tight like this means we are making some pretty high use of the COG resources now (maybe by doing things the wrong way LOL).

    Thanks for the calculations. I've done some of my own, as shown in the table below. I've assumed the worst-case WRLONG time of 10 cycles for the first long, so every block write takes 20 * loop unrolls + 9 cycles.
    Loop	Clock	  Cog RAM longs used
    unrolls	cycles	instr +	buffer = total
    
    1	8480	 34	 20	  54
    2	7800	 62	 40	 102
    4	7460	118	 80	 198
    5	7392	146	100	 246
    8	7290	230	160	 390
    10	7256	286	200	 486
    
    If loop unrolls = n,
    cycles = 7120 + 1360/n = (89n + 17) * 80/n
    instr  = 6 + 28n
    buffer = 20n
    

    I suggest an unrolling of at least 5. The number of instructions (instr) is based on the third version of your loop code, repeated below, which I think is the best as it uses the minimum of cog/LUT RAM.
    	getnib  even, data, #0
    	rdlut   temp, even
    	setword buf+1, temp, #1
    	rol     temp,#16
    	setword buf+2, temp, #0
    	getnib  odd, data, #1
    	rdlut   buf+4, odd
    

    even and odd could be the same register.
  • Which pixel nibble should be the leftmost on the screen in graphics mode, bits[31:28] or bits[3:0]?

    The CGA used the low nibble of the attribute byte in text mode for the foreground and bit 7 of the attribute was either a flashing bit or background intensity (default was the former). Flashing would give us a poor man's cursor, until it's done properly.
  • @TonyB_, I'm not sure about that offhand, but whatever nibble endianness the PC followed I'd hope we should certainly try to emulate, in case there are any images/screenshots etc available from PCs that can be presented as they were originally without needing conversion etc. Same goes for text formats with FG/BG nibble order, attribute flashing bits etc.

    Decent cursor handling should be doable in the streamer COG. Will just need to pass in cursor co-ordinates from hub memory once per frame and modify constructed line during pixel render process. Simple enough I'd expect.

    As for the unroll, yes the more the better so long as there is space in COG RAM for the further required features like SYNC line handling etc. I thought there might be some sweet spot if the block transfer length is some multiple or 8 or 16 longs etc, but that might not be the case. I'm not yet familiar enough with P2 hub transfer burst timing to understand all the intricacies there.
  • RaymanRayman Posts: 14,646
    Wonder if USB-C would make for the ultimate connector for a P2 SBC...

    Was just looking at this monitor and it shows how you can connect a mobile device via USB-C to the monitor and also plug keyboard and mice into the monitor:
    https://www.samsung.com/us/computing/monitors/flat/27--h850-wqhd-pls-monitor-ls27h850qfnxgo/

    It looks to me like it's just plain HDMI on the superspeed pins:
    https://en.wikipedia.org/wiki/USB-C

    So, you get USB + HDMI +power in a single cable.
    Actually, it's not 100% clear if the monitor would provide power, but I think it might...
  • RaymanRayman Posts: 14,646
    Also you can see usb speakers
    Maybe easier than hdmi audio?
    Or, maybe it’d work either way?
  • Rayman wrote: »
    Wonder if USB-C would make for the ultimate connector for a P2 SBC...

    Was just looking at this monitor and it shows how you can connect a mobile device via USB-C to the monitor and also plug keyboard and mice into the monitor:
    https://www.samsung.com/us/computing/monitors/flat/27--h850-wqhd-pls-monitor-ls27h850qfnxgo/

    It looks to me like it's just plain HDMI on the superspeed pins:
    https://en.wikipedia.org/wiki/USB-C

    So, you get USB + HDMI +power in a single cable.
    Actually, it's not 100% clear if the monitor would provide power, but I think it might...

    Thanks for the info, Rayman. The USB-C connector can be plugged in both ways around, but this should not be a problem as the P2 HDMI pins could be reversed if necessary:
    http://forums.parallax.com/discussion/comment/1451318/#Comment_1451318
  • RaymanRayman Posts: 14,646
    It also looks like the USB 2.0 part can also work in "full speed" mode, meaning completely native P2 to monitor+keyboard+mouse+speakers+power over one cable...
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-11 17:08
    Here is possible code for streamer cog:
    ' Streamer cog text mode (non-flashing, 16 colours foreground & background)
    ' Loop for converting text codes and attributes into pixel nibbles
    
    loop	rdlut   text,text_ptr		' read next two text chars from LUT
            add     text_ptr,#1		' increment text pointer
    '
    ' text[31:16,15:0] = {bg1[3:0],fg1[3:0],char1[7:0]},{bg0[3:0],fg0[3:0],char0[7:0]}
    '
    ' Even char
    	getnib  fg,text,#2		' get foreground colour
            getnib  bg,text,#3		' get background colour
    	getbyte	char,text,#0		' char = char code
    	ror	char,#2			' char[5:0] = index to font long, char[31:30] = byte# in long
    	alts	char,#font_row		' LUT pre-loaded with 256 char row bit patterns in 64 longs
    	rdlut	pixels,0-0		' read bit pattern long from LUT
    	shr	char,#27		' char = 0/8/16/24 for byte# 0/1/2/3
    	ror	pixels,char		' char bit pattern in pixels[7:0] = abcdefgh
    	movbyts	pixels,#0		' pixels = abcdefgh_abcdefgh_abcdefgh_abcdefgh
    	mergeb	pixels			' pixels = aaaabbbb_ccccdddd_eeeeffff_gggghhhh
    	setq	pixels			' set foreground mask
    	alts    bg,#bgbuf
    	mov	data,0-0		' get background into all pixels
    	alts    fg,#fgbuf
    	muxq	data,0-0		' overlay foreground onto background
            wrlut   data,pixel_ptr		' write eight pixel nibbles to LUT
            add     pixel_ptr,#1		' increment pixel pointer
    ' Odd char
    	getnib  fg,text,#6
            getnib  bg,text,#7
    	getbyte	char,text,#2
    	ror	char,#2
    	alts	char,#font_row
    	rdlut	pixels,0-0
    	shr	char,#27
    	ror	pixels,char
    	movbyts	pixels,#0
    	mergeb	pixels
    	setq	pixels
    	alts    bg,#bgbuf
    	mov	data,0-0
    	alts    fg,#fgbuf
    	muxq	data,0-0
            wrlut   data,pixel_ptr
            add     pixel_ptr,#1
    '
    ' Instructions in loop = 2+17+17 = 36
    ' Cycles in loop = 3*3 + 33*2 = 75
    '
    

    240 cycles in total or three per character could be saved by copying all the font rows from LUT to cog RAM first, so that only char codes and attributes are read from the LUT during the loop. Either way, the streamer can process pixels more than twice as fast as the line buffer cog and it would make sense to have separate 80 long pixel buffers in the shared LUTs to allow the streamer cog to do other things.
  • roglohrogloh Posts: 5,787
    edited 2018-11-06 05:11
    @TonyB_, not sure that the USB-C alternate mode HDMI pair reversal order is exactly what is going to be supported by a future spin of P2 HW supporting TMDS encoding directly. However a bitbang HDMI method with its look up table implementation can map any combination of the streamed 8 bits to pins, which is very flexible.

    Talking about flexible mappings with look up tables and HDMI, I wonder if there is any chance of supporting an 8 bit indexed colour lookup mode, perhaps at lower resolution like 320x240 if required for reducing the processing bandwidth by pixel doubling, using a 256 bit table in LUT RAM to determine the symbols with bytes, instead of being nibble based? Then we might be able to have a nice 320x240x256 screen out of palette of 4096 over HDMI with fancier colours (12bpp, 16 levels of each primary colour), or hopefully even the full VGA 640x480 res with a bit more effort somehow. This might become rather useful for games and other animation particularly once the pixels are byte addressable which would be easier/faster to write/scroll on single pixel boundaries if writing sprites into hub RAM frame buffer from external COGs, as no need for nibble reordering etc. Not sure about the disparity issues with a larger number of uneven symbols for these colours being sent out over TMDS but it might be worth considering. If we keep 256 longs free in the LUT for this purpose maybe it could work out.

    Right now LUT RAM usage is looking something like this, mostly assuming the ideas we've discussed above:

    1) 80 Longs in / out for text/bitmap transfer buffer (1)
    2) 80 Longs in / out for text/bitmap transfer buffer (2) - if double buffering required for multiple sprite overlay etc
    3) 64 longs as a dynamic font table for the scanline being processed next (single font supported only in this case, unless double buffering not required), and you also mention above this might be copyable to COG RAM in the streamer COG for faster access, that method could free some LUT space meaning we could share one of the 80 long buffers if the COGs are already well synchronized and are prepared for this transfer of font data in advance of the text/bitmap data.
    4) 16 longs for 4b-10B LUT (ideally at index address 0 for speed)
    5) I think we would want to keep up to maybe 16? longs free for other purpose, e.g. cursor position, mouse cursor pos/data, other signalling data per scanline like text/gfx mode etc.

    That would still leave 256 longs free for other things such as a larger 256 entry lookup table for 256 colour palette translation perhaps? Or also for transferring/holding some limited sprite data in other modes. This 256 entry lookup table probably wants to sit at address 0 in LUT RAM as well however.

    Also because the other streamer COG has access to this LUT RAM as well containing the palette translation to 10B format, there may be scope to have the lookup workload potentially be split over both COGs. So if the streamer COG is not doing a complex text mode for the scanline, it could do translation of some of the 320 or 640 pixels, something like that. Streamer COG could do the evens, and line buffer COG could do the odds and data interleaved. This also may help limit the transfer needed in/out of the LUT RAM on the streamer COG. Bear in mind that a 640 byte pixel mode means reading 160 longs now. Maybe 80 longs could be sent to LUT and the other 80 retained in the line buffer COG. LUT RAM usage will become more of an issue now, because each long of 4 pixels, will create 6 LONGs of look up table results, or 4 longs needing to be further split/expanded into 6. Output bandwidth to hub remains the same however which is very good.

    Got to think more about this...

    Here's an idea for the byte indexing version of the LUT to build up the line buffer (8b to 10B conversion), using just the line buffer COG.
            rdlut   data,  rdlutaddr ' read next 4 pixels from LUT (or it could be from COGRAM)
            add     rdlutaddr, #1
    
            getbyte byte, data, #0     ' get even pixel
            rdlut   buffer+1,  byte    ' read even pixel LUT value
            getbyte byte, data, #1     ' get odd pixel
            rdlut   buffer+4,  byte    ' read odd pixel LUT value
            rol     buffer+1,  #16     ' rotate into position
            movword buffer+2,  buffer+1, #1 ' copy LSW
            movword buffer+1,  fixup, #0   ' and patch up LSW with fix
            
            getbyte byte, data, #2     ' get even pixel
            rdlut   buffer+6,  byte    ' read even pixel LUT value
            getbyte byte, data, #3     ' get odd pixel
            rdlut   buffer+9,  byte    ' read odd pixel LUT value
            rol     buffer+6,  #16     ' rotate into position
            movword buffer+7,  buffer+6, #1 ' copy LSW
            movword buffer+6,  fixup, #0   ' and patch up LSW with fix
    

    This work seems to take 16 instructions (37 clocks if reading its data from LUT), to do 4 pixels and COG RAM may be faster. The problem is that we need to do 160 of these loops, not 80 like before. So we might have to do half the width (320 pixels) and replicate pixels somewhere, or double scanline height, either that or involve the other COG to help the workload (which might be possible).

    UPDATE: maybe the 37 clocks above for 4 pixels is ok, because 160 loops x 37 clocks is still only 23.68us @250MHz. This should leave time for the remaining burst reads/writes in the line buffer COG. Somehow I thought we have to do 2x the work for 8 bit mode as nibble mode but each iteration of nibble mode does 8 pixels in 32 instructions, so it's comparable. Have confused myself a bit there and need to get these numbers right!

    UPDATE2: If the line buffer COG knows this scanline being processed is 8b indexed graphics it won't have to read in 64 or more longs for dynamic font(s), so that helps buy more time to read the additional 80 longs of pixel data (160 instead of 80 for nibble mode, and 80xwords for text mode). I think this might be doable at 640x480, if the TMDS side of things works out with more colour levels. Mouse pointer work may be trickier though...
  • rogloh wrote: »
    @TonyB_, not sure that the USB-C alternate mode HDMI pair reversal order is exactly what is going to be supported by a future spin of P2 HW supporting TMDS encoding directly.

    USB-C is getting fairly important these days. P2 hardware should definitely support it (by way of adding that functionality if it doesn't exist yet. Since there is already a full reverse, whatever it takes to support USB should not add too much logic, i think?)
  • Is the lane pair reversal meant to happen in the HDMI source transmitter device end outputting on a USB-C cable or at the receiver on the other end of the USB-C cable I wonder? That's probably important to understand first and there's very likely some protocol meant to negotiate all this. If the general idea of using USB-C connectors at some point can actually be supported that could be quite good though. I expect it may be rather more involved and may drag in a bunch of protocols and possibly require USB2.0 rates for signalling etc. Need to look at the specs.
  • william chanwilliam chan Posts: 1,326
    edited 2018-11-06 08:19
    Leon was hit and severely crippled by a car a few years ago. Last update I can find about him was him being moved into hospice care of some sort. (Details are vague.)

    Should somebody visit Leon to make sure he is Ok and pass to him a notebook computer so he can view this forum?
    I am sure he will want to keep track of P2's imminent release.
  • rogloh wrote: »
    Talking about flexible mappings with look up tables and HDMI, I wonder if there is any chance of supporting an 8 bit indexed colour lookup mode, perhaps at lower resolution like 320x240 if required for reducing the processing bandwidth by pixel doubling, using a 256 bit table in LUT RAM to determine the symbols with bytes, instead of being nibble based? Then we might be able to have a nice 320x240x256 screen out of palette of 4096 over HDMI with fancier colours (12bpp, 16 levels of each primary colour), or hopefully even the full VGA 640x480 res with a bit more effort somehow. This might become rather useful for games and other animation particularly once the pixels are byte addressable which would be easier/faster to write/scroll on single pixel boundaries if writing sprites into hub RAM frame buffer from external COGs, as no need for nibble reordering etc. Not sure about the disparity issues with a larger number of uneven symbols for these colours being sent out over TMDS but it might be worth considering. If we keep 256 longs free in the LUT for this purpose maybe it could work out.

    Some quick thoughts ...

    4bpp/colour means four of the five longs for a pixel pair would have to be modified, compared to three now, assuming low four bits of TMDS codes remain static:

    2bpp/colour:
    | R | G | B |CLK| R | G | B |CLK| R | G | B |CLK| R | G | B |CLK|
    |+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|
    |               |               |               |               | pixel[bits]
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0   even[3:0]
     * * * * * * 0 1 * * * * * * 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0   even[7:4]
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0/* * * * * * 0 1 * * * * * * 0 1    odd[1:0]/even[9:8]
     1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0    odd[5:2]
     * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1    odd[9:6]
    

    4bpp/colour:
    | R | G | B |CLK| R | G | B |CLK| R | G | B |CLK| R | G | B |CLK|
    |+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|
    |               |               |               |               | pixel[bits]
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0   even[3:0]
     * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 1 0   even[7:4]
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0/* * * * * * 0 1 * * * * * * 0 1    odd[1:0]/even[9:8]
     * * * * * * 0 1 * * * * * * 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0    odd[5:2]
     * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1    odd[9:6]
    

    This means 80% of the pixel longs would have to be modified, which could be done if the two cogs do half each. The biggest disparity now is between the cog workload: line buffer > 90%, streamer < 40% for text mode (and much less for graphics without sprites).
  • I've been looking at 16 levels for each colour and the TMDS codes for the following byte values are all xx0, i.e. low nibble is zero:
    00-01
    10
    1E-21
    2E-31
    3E-41
    4E-51
    5E-61
    6E-71
    7E-81
    8E-91
    9E-A1
    AE-B1
    BE-C1
    CE-D1
    DE-E1
    EF
    FE-FF
    

    There are 17 groups and I suggest the following 16 from those above:
    10,20,30,40,50,60,70,80,8F,9F,AF,BF,CF,DF,EF,FF?
    

    I think we should use 10 for black as it is balanced output, in order to meet the spec. A similar argument could apply to white in which case EF could be the maximum level and FF discarded, which would allow for 15 shades plus transparent. It's a bit of a toss-up between 7F and 80 in terms of step size, but the latter has fewer transitions.
  • Ok the 12bpp colour is a bit of a challenge with the required 4 long updates needed in the line buffer per odd/even pixel pair and a 480 line resolution. Would probably need to find a way to have 384 longs of actual LUT data for the 8b to 10B translation get distributed over two COGs and essentially do two table lookup accesses per pixel. Maybe 320x240 is the limit, or 640x240 even with line doubling & doubling buffering in hub RAM.

    There may be still some way to do this, but this particular mode is quite a bit more challenging. Still it would be very cool if doable, even if only supporting a single mouse pointer sprite for GUI in our COG pair, with any real sprites just rendered into the frame buffer before any processing by the COG pair. Am going to think some more about it...
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-11-07 16:18
    I think 640x480x8bpp with 256 of 4096 colour palette might work. Firstly, the two cogs need to share the workload more evenly and to avoid confusion let's call them A and B. Each cog has a 256 long palette written to LUT RAM then copied to cog RAM but they are not the same: cog A has TMDS bits[7:4] and cog B has TMDS bits[9:6].

    Both cogs read longs of video data from a buffer in LUT RAM updated every line and use each byte in turn as a palette index. The blocks of five longs containing two TMDS pixels to be written to hub RAM are also stored in LUT RAM. Cog A writes longs 1 and 3* whereas cog B writes longs 2* and 4, as shown below. The asterisk means a little data manipulation is required. Therefore each cog needs to write only two longs and they do not clash. Long 0 never changes as the low nibble of the TMDS codes is always identical.
    | R | G | B |CLK| R | G | B |CLK| R | G | B |CLK| R | G | B |CLK|
    |+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|
    |               |               |               |               | pixel[bits]		Cog
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0   even[3:0]
     * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 1 0   even[7:4]		A
     0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0/* * * * * * 0 1 * * * * * * 0 1    odd[1:0]/even[9:8]	B
     * * * * * * 0 1 * * * * * * 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0    odd[5:2]		A
     * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1    odd[9:6]		B
    

    EDIT:
    The palettes are modified only during vertical blanking. There are no character codes and attributes nor font rows to read as this is a graphics mode.
  • roglohrogloh Posts: 5,787
    edited 2018-11-07 03:39
    Yeah, I was thinking along the same lines TonyB_ when mulling over it. There sort of needs to be two different tables used by the two COGs probably some of it held in their local COG RAM (at least on one side), and this could get transferred across via the LUT at start of frame once this particular graphics mode is determined. We would need to transfer some of the lookup results back from the streamer COG to the line buffer COG via the LUT too. I was thinking one table is for the lower 32 bits of the 48 bits and the other table is for the top 16 bits or it could be split differently, like you had where some result bits are overlapped, and that could help equalise the processing too perhaps. The line buffer COG would combine its own results as well as taking whatever results come back from the streamer COG before writing to Hub. In fact it might be best to just write all the shared results into LUT RAM before streaming out to the hub in the batch and hub writes are not as expensive as the reads which is nice.

    It's good to find the most decently balanced use of processing resources and COG+LUT RAM allocation that fits and is not incompatible with other selectable mode requirements for other frames as this avoids restarting COGs with different mode specific code, causing loss of HDMI sync for a bit each time the mode changes (not what we want). The trick here is likely squeezing in the mouse cursor work somewhere suitable, and one time to do it could be once the original 160 source longs have been read in from hub, and just before any translation work starts by both COGs. A 32 pixel wide mouse graphics overlay just needs to know the colours in the original 8-9 longs around the mouse's X position on the scanline and mask/rewrite this data before main table processing begins. These 9 longs could be just duplicated directly in COG RAM from the hub with a second burst to COG RAM after/before the LUT RAM burst, then this pixel data gets modified and re-written to the LUT and can be accessed by both COGs. Yes that's the way to do it.

    This is really good, and I think it might actually be doable in two COGs now, though I still need to count cycles to know for sure. Having this mode in addition to the others will be rather nice, and even though it burns up to 300kB of total HUB RAM for the frame buffer it will open up other applications that need a bit more HDMI colour until RevB silicon happens.
Sign In or Register to comment.