HDMI added to Prop2

1131416181921

Comments

  • rogloh wrote: »
    ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.

    rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working? :)

    Chip, a couple of questions, if you have time:

    1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?

    2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
    Formerly known as TonyB
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 02:48:55
    I think the terminology of streaming may be getting confusing here. The "streamer" in the COG from what I understand at least streams hub data to/from PINs, potentially using the LUT as an intermediary to translate the data in LUT and DDS modes. But block transfers are also possible on a long per clock basis to/from HUB and COG or LUT memory. In our HDMI example, only the HDMI COG is streaming byte data to pins from hub at 250MHz, the other transfers are just block transfers. And thankfully because the streamer is not using the LUT modes, transfers within a shared LUT RAM when enabled should be possible between a COG pair and they won't need to interrupt the streamer because they don't have to involve the hub.

  • TonyB_ wrote: »
    rogloh wrote: »
    ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.

    rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working? :)

    Chip, a couple of questions, if you have time:

    1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?

    2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.

    1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.

    2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 03:42:33
    cgracey wrote: »
    1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.

    2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.

    Okay, so for 1) we will need the additional instruction in the critical loop as we can't guarantee those bits will be 0. This is still okay budget wise, but I will look for any other instruction optimizations to see what can be done. 8 instructions per pixel pair is a bit nicer than 9 as it is a good hub window multiple.

    EDIT: second thoughts, actually being a hub window multiple is not all that important, as there will be other setup overhead that needs to be included that likely takes us off a hub window anyway. But I'll still look to what can be done if anything.
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 05:13:37
    Think I managed to get it back down to 8 instructions per pixel pair in the critical loop path, and another bonus is that this version only needs two look up tables for translation now, saving 16 longs. :smile:
    ' table1 is 16 longs holding even nibble translation data
    ' table2 is 16 longs holding odd nibble translation data
    ' data is 32 bit long source reg holding next 8 pixels of bitmap (4bpp nibble based CGA like colours)
    ' buf is a temporary result buffer holding both some precomputed longs and also partially updated
    
            getnib  even, data, #0   ' extract even nibble from bitmap data
            alts    even, #table1    ' compute look up table 1 address
            mov     buf+1, 0-0       ' update long in buffer at long offset 1, trashes LS word temporarily
            setword buf+2, buf+1, #0 ' patch LS word in buffer at long offset 2
            setword buf+1, fixup, #0 ' fixup trashed LS word at buffer at long offset 1
    
            getnib  odd, data, #1    ' extract odd nibble from bitmap data
            alts    odd, #table2     ' compute lookup table 2 address
            mov     buf+4, 0-0       ' update long in buffer at long offset 4
    
            ' repeat for other pixels in data and further unrolled copies
            ...
    fixup   long  ... ' constant value for fixing buf[1] etc
    
    table1  long[16] ' 16 long fixed table for even pixels
            ...
    table2  long[16] ' 16 long fixed table for odd pixels
            ...
    
    buf     long[80] ' temp buffer of adequate block size for unrolled loop above, containing some preallocated data
            ..
    


    UPDATE: Just realised there might be a further optimisation possible if you can use a single 16 entry table in LUT RAM at offset 0 for the translation data, this one takes only 7 instructions per pixel pair and saves yet another 16 COG longs, as only one table needed now.
    getnib  even, data, #0
    rdlut   temp, even
    setword buf+1, temp, #1
    rol     temp,#16
    setword buf+2, temp, #0
    getnib  odd, data, #1
    rdlut   buf+4, odd
    

    This is good because it buys you a lot of spare time and a bit more COG space, in fact 2.56us extra per line which could be used for things like mouse pointer overlays for GUI, or flashing text cursor type calculations etc. At 250MHz, 2.56us is 320 extra instructions per scanline that can be executed, that is quite a bit.
  • For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.

    ALTI can inc/dec the content in source D.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 06:22:31
    evanh wrote: »
    For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.

    ALTI can inc/dec the content in source D.

    I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
    EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
  • rogloh wrote: »
    evanh wrote: »
    For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.

    ALTI can inc/dec the content in source D.

    I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
    EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
    That's right Roger.
    This is not in the docs.
    In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
    Melbourne, Australia
  • evanhevanh Posts: 6,845
    edited 2018-11-03 - 06:56:24
    Oh, I'm not sure now. Maybe I only ever used small numbers for S.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • ozpropdev wrote: »
    rogloh wrote: »
    evanh wrote: »
    For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.

    ALTI can inc/dec the content in source D.

    I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
    EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
    That's right Roger.
    This is not in the docs.
    In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.

    Whoops. I just added it to the docs.

    The instruction spreadsheet covered this functionality, though.
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 07:00:53
    Ok, seems some things are going to be partially undocumented, early on at least, and we may have to figure things out as we go. New discoveries are yet to be made with P2! LOL.
  • I had definitely missed. Here's a snippet from my integer to decimal print routine:
    		add     bcdi, #1
    		altsn   bcdi, #bcdstr         '(instruction prefix)
    		setnib  parm                  'file the decimal digit as BCD:  bcdstr[bcdi] = parm
    
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • That's what Rev A is for. Clean up, final polish pass.



    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • BTW: Chip, the nibble indexing is very cool for BCD work!

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 07:53:19
    evanh wrote: »
    BTW: Chip, the nibble indexing is very cool for BCD work!

    Yeah it would be great for BCD and nibble indexing is awesome for pixel fonts too in 4bpp mode. I just coded up a sample loop for a text mode font lookup on the HDMI streamer COG and found each 8 bit character can be done in 35 instructions with foreground and background colours applied per pixel, this is good because it is of the same order of time as the work needed to do the translation to 10bit format for HDMI (28 or 32 instructions for 8 pixels) and won't have to hold up the processing much and can fit in the budget even without optimizations

            rdlut   char, rdlutaddr  ' read LUT to get 16 bit word in [FG:BG:char] format
            add     rdlutaddr, #1 ' increment source address
    
            getnib  fg, char, #3  ' extract fg colour nibble
            getnib  bg, char, #2  ' extract bg colour nibble
            and     char, #$7f    ' mask off unwanted parts
    if_c    add     char, #128 wc ' c is already set outside the loop if we are on high scanlines of row (4-7)
            alts    char, #font   ' determine font table location
            mov     pixels, 0-0   ' pixels contains 8 bit pixel data
            rol     pixels, scan  ' compensate for scanline (scan is maintained as scanline * 8 outside loop)
    
            test    pixels, #0 wz
    if_z    setnib  data, fg, #0
    if_nz   setnib  data, bg, #0
            test    pixels, #1 wz
    if_z    setnib  data, fg, #1
    if_nz   setnib  data, bg, #1
            test    pixels, #2 wz
    if_z    setnib  data, fg, #2
    if_nz   setnib  data, bg, #2
            test    pixels, #3 wz
    if_z    setnib  data, fg, #3
    if_nz   setnib  data, bg, #3
            test    pixels, #4 wz
    if_z    setnib  data, fg, #4
    if_nz   setnib  data, bg, #4
            test    pixels, #5 wz
    if_z    setnib  data, fg, #5
    if_nz   setnib  data, bg, #5
            test    pixels, #6 wz
    if_z    setnib  data, fg, #6
    if_nz   setnib  data, bg, #6
            test    pixels, #7 wz
    if_z    setnib  data, fg, #7
    if_nz   setnib  data, bg, #7
    
            wrlut   data, wrlutaddr ' write back long to shared LUT
            add     wrlutaddr,#1
    
    
    
  • That looks like a DECOD and SKIP might work better here.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Yeah, thanks Cluso I like this skip thing. It could get the unrolled part of my pixel loop down to 2 instructions per pixel. Would need two stages of 8 each and a complement of the font pixels in the middle plus a clearing of original accumulator beforehand, so from 24 (8*3) to perhaps 19 (16+3), making 30 total now instead of 35. This means my font loop code can potentially be made slightly faster than the 10bit format conversion work in the other COG and won't ever need to become a bottleneck as the resulting bitmap data will always be ready in time for the other COG to process, once the initial latency is factored in. Or it potentially buys a few extra cycles for some type of text flash attribute or inverse feature using the otherwise unused top bit of the character index to do something useful.
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 10:18:09
    Actually it's potentially better than that because I might be able to replicate some nibbles in the long accumulator first and then only update the other colour based on the lit pixels in the font. ie. I could assume same foreground colour for all pixels in the character scanline, using a LUT of FG colours that replicates same nibble into 8 nibbles of a long, then overwrite with background colour depending on skip test result. Have to be careful though with the SKIP as it works on 32 bits and I really only want 8 bits done so it may be problematic unless each new SKIP clobbers the existing running SKIP and I mask the upper 24 bits to 0 when I set things up.

    UPDATE: possible code with SKIP optimization now 23 instructions per character and much faster than the HDMI 10 bit table stuff running in the other COG. This seems to be the way to go if it can work. Can also have some instructions left over to do something useful with the top bit of the character once we figure best thing for that. By the way I realised that the font table doesn't have to be aligned in COGRAM anymore so it might even be possible to have a font with more than 8 scanlines per character, like 10 or 12 giving 40,48 row modes too assuming space for it exists. A 12 scanline font can look nicer than 8 so its good to have that option now too. 12 scanlines needs another 128 COG RAM values, or we could have some of this font held in LUTRAM too, that might even allow 16 scanline fonts or the full 256 character range which would be awesome! The transfer of pixel data between COGs only needs 80 longs in and out in the LUT, and it could even be shared in the same memory going in and coming out, so that leaves a lot of extra LUT space to hold additional font data/scanlines. The code below could be made smart enough to know when to read the font from COG vs LUT. One option is to have the bottom 128 characters a fixed font, and the top 128 characters dynamically reprogrammable, and read in per frame from hub RAM and stored in the LUT. This allows for cool animation stuff in text mode or simply two fonts selected by the high bit of the character. I hope the entire 512 addresses of LUTRAM are shared, but I don't really know, it might be half the range or something.
            rdlut   char, rdlutaddr  ' read LUT to get 16 bit word in [FG:BG:char] format
            add     rdlutaddr, #1 ' increment source address
    
            getnib  fg, char, #3  ' extract fg colour nibble
            getnib  bg, char, #2  ' extract bg colour nibble
            and     char, #$7f    ' mask off unwanted parts
    if_c    add     char, #128    ' c is already set outside the loop if we are on high scanlines of row (4-7)
            alts    char, #font   ' determine font table location
            mov     pixels, 0-0   ' pixels contains 8 bit pixel data
            rol     pixels, scan  ' compensate for scanline (scan is maintained as scanline * 8 outside loop)
            and     pixels, #$ff  ' ensure we don't continue to skip instructions beyond 8 bits (might be combinable with previous rol using getbyte instr. instead)
    
            alts    fg, #fgbuf    ' compute table address
            mov     data, 0-0     ' replicate same fg nibble 8 times into long from an 8 long table that does this, indexed by fg nibble
            skip    pixels
            setnib  data, bg, #0
            setnib  data, bg, #1
            setnib  data, bg, #2
            setnib  data, bg, #3
            setnib  data, bg, #4
            setnib  data, bg, #5
            setnib  data, bg, #6
            setnib  data, bg, #7
    
            wrlut   data, wrlutaddr ' write back long to shared LUT
            add     wrlutaddr,#1
    
  • Had another thought, I think it would be possible to just read in the font data dynamically for the scanline being rendered right into the LUT RAM alongside the character data from hub RAM. It's only another 64 longs worth for 256 characters which won't take long at 250Mlongs/sec read rates. This could allow arbitrary sized fonts to be supported as well or augmenting some fixed font available in COG RAM if needed. The font scanline packing and indexed addressing approach changes slightly but won't add that many more cycles to achieve.
  • If the SKIP is running in COG then SKIPF works faster!!! The instructions not executed have no clocks!!!

    One SKIP/SKIPF replaces another. Also, once the SKIP/SKIPF is zero it completes.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • TonyB_TonyB_ Posts: 1,178
    edited 2018-11-03 - 20:14:50
    deleted
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,178
    edited 2018-11-03 - 19:18:05
    rogloh wrote: »
            and     pixels, #$ff
            alts    fg, #fgbuf 
            mov     data, 0-0
            skip    pixels
            setnib  data, bg, #0
            setnib  data, bg, #1
            setnib  data, bg, #2
            setnib  data, bg, #3
            setnib  data, bg, #4
            setnib  data, bg, #5
            setnib  data, bg, #6
            setnib  data, bg, #7
    

    Skipping is not the best option in this case. A nibble mask can be created from the low byte in pixels using the following algorithm (based on ozpropdev's reply)
    			' d = xxxxxxxx xxxxxxxx xxxxxxxx abcdefgh
    	movbyts	d,#0	' d = abcdefgh abcdefgh abcdefgh abcdefgh
    	mergeb	d	' d = aaaabbbb ccccdddd eeeeffff gggghhhh
    

    and the code at the top can be replaced by
    	movbyts	pixels,#0
    	mergeb	pixels
    	setq	pixels
    	alts    bg, #fgbuf
    	mov	data,0-0
    	alts    fg, #fgbuf
    	muxq	data,0-0
    
    Formerly known as TonyB
  • rogloh wrote: »
    Had another thought, I think it would be possible to just read in the font data dynamically for the scanline being rendered right into the LUT RAM alongside the character data from hub RAM. It's only another 64 longs worth for 256 characters which won't take long at 250Mlongs/sec read rates. This could allow arbitrary sized fonts to be supported as well or augmenting some fixed font available in COG RAM if needed. The font scanline packing and indexed addressing approach changes slightly but won't add that many more cycles to achieve.

    This is a good idea and the number of cycles might be the same or lower. Two characters could be packed into one long to save LUT RAM and block writing time.
    Formerly known as TonyB
  • roglohrogloh Posts: 1,012
    edited 2018-11-03 - 21:22:49
    TonyB_ wrote: »
    and the code at the top can be replaced by
    	movbyts	pixels,#0
    	mergeb	pixels
    	setq	pixels
    	alts    bg, #fgbuf
    	mov	data,0-0
    	alts    fg, #fgbuf
    	muxq	data,0-0
    

    Wow, all these fancy new instructions to learn to play with! Another 5 instructions saved I think. I like it. P2 is cool.

    This is a good idea and the number of cycles might be the same or lower. Two characters could be packed into one long to save LUT RAM and block writing time.

    Yeah that way you'd only need 40 longs to hold the 80 character/colour words in the LUT and it would be 2x faster to transfer the text for processing by the other COG. I was originally thinking that the input/output buffer between COGs in the LUT could be shared, meaning there is more block RAM for extra font data. But it doesn't need to be shared.

    Right now for the LUTRAM usage, it could have 16 longs for a 10b translation table (7 instruction optimisation mentioned earlier), 40 or 80 longs for 80 characters input to the HDMI COG, either a shared 80 longs out, or another 80 long bitmap result buffer back, and the remaining free for other uses such as dynamic font scanlines for 256 characters. A single font table read into the LUT for the current scanline only takes 64 longs, so with everything combined in the maximum usage case with no shared input/output buffer etc I think we now have 240 longs used in the LUT and in the best case just 160. If 256 LUT RAM entries are left over after all this that provides enough space for enabling 16 scanline fixed fonts on the lower 128 characters with half this table stored in the HDMI COG RAM and the other half stored in the extra LUT RAM. We could have proper 16 scanline VGA text capability and dynamic programmable characters as well in the upper 128 character set range. That would be pretty cool.
  • TonyB_TonyB_ Posts: 1,178
    edited 2018-11-03 - 23:29:14
    rogloh wrote: »
    Think I managed to get it back down to 8 instructions per pixel pair in the critical loop path, and another bonus is that this version only needs two look up tables for translation now, saving 16 longs. :smile:
    ' table1 is 16 longs holding even nibble translation data
    ' table2 is 16 longs holding odd nibble translation data
    ' data is 32 bit long source reg holding next 8 pixels of bitmap (4bpp nibble based CGA like colours)
    ' buf is a temporary result buffer holding both some precomputed longs and also partially updated
    
            getnib  even, data, #0   ' extract even nibble from bitmap data
            alts    even, #table1    ' compute look up table 1 address
            mov     buf+1, 0-0       ' update long in buffer at long offset 1, trashes LS word temporarily
            setword buf+2, buf+1, #0 ' patch LS word in buffer at long offset 2
            setword buf+1, fixup, #0 ' fixup trashed LS word at buffer at long offset 1
    
            getnib  odd, data, #1    ' extract odd nibble from bitmap data
            alts    odd, #table2     ' compute lookup table 2 address
            mov     buf+4, 0-0       ' update long in buffer at long offset 4
    
            ' repeat for other pixels in data and further unrolled copies
            ...
    fixup   long  ... ' constant value for fixing buf[1] etc
    
    table1  long[16] ' 16 long fixed table for even pixels
            ...
    table2  long[16] ' 16 long fixed table for odd pixels
            ...
    
    buf     long[80] ' temp buffer of adequate block size for unrolled loop above, containing some preallocated data
            ..
    


    UPDATE: Just realised there might be a further optimisation possible if you can use a single 16 entry table in LUT RAM at offset 0 for the translation data, this one takes only 7 instructions per pixel pair and saves yet another 16 COG longs, as only one table needed now.
    getnib  even, data, #0
    rdlut   temp, even
    setword buf+1, temp, #1
    rol     temp,#16
    setword buf+2, temp, #0
    getnib  odd, data, #1
    rdlut   buf+4, odd
    

    This is good because it buys you a lot of spare time and a bit more COG space, in fact 2.56us extra per line which could be used for things like mouse pointer overlays for GUI, or flashing text cursor type calculations etc. At 250MHz, 2.56us is 320 extra instructions per scanline that can be executed, that is quite a bit.

    I think ALTx instructions are not a problem, but well done with finding an alternative that is no slower. RDLUT takes 3 cycles therefore bottom version no quicker than top one.
    Formerly known as TonyB
  • rogloh wrote: »
    I think the terminology of streaming may be getting confusing here. The "streamer" in the COG from what I understand at least streams hub data to/from PINs, potentially using the LUT as an intermediary to translate the data in LUT and DDS modes. But block transfers are also possible on a long per clock basis to/from HUB and COG or LUT memory. In our HDMI example, only the HDMI COG is streaming byte data to pins from hub at 250MHz, the other transfers are just block transfers. And thankfully because the streamer is not using the LUT modes, transfers within a shared LUT RAM when enabled should be possible between a COG pair and they won't need to interrupt the streamer because they don't have to involve the hub.

    I'll use streamer more accurately from now on and use block reads or writes for the fast FIFO transfers. I find "HDMI cog" confusing. Can we just call the two cogs "line buffer" and "streamer"? They both deal with TDMS data that form a crucial part of DVI/HDMI.
    Formerly known as TonyB
  • "TonyB_ wrote:
    RDLUT takes 3 cycles therefore bottom version no quicker than top one.

    Ok, thanks for letting me know about RDLUT timing. Actually I didn't realise that. Seems WRLUT still takes 2 cycles at least.
  • cgracey wrote: »
    TonyB_ wrote: »
    rogloh wrote: »
    ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.

    rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working? :)

    Chip, a couple of questions, if you have time:

    1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?

    2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.

    1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.

    2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.

    Thanks for the info, Chip, which means D stays the same in ALTx D,#S.
    Formerly known as TonyB
  • TonyB_ wrote: »
    cgracey wrote: »
    TonyB_ wrote: »
    rogloh wrote: »
    ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.

    rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working? :)

    Chip, a couple of questions, if you have time:

    1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?

    2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.

    1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.

    2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.

    Thanks for the info, Chip, which means D stays the same in ALTx D,#S.

    That is right.
  • "TonyB_ wrote:
    I'll use streamer more accurately from now on and use block reads or writes for the fast FIFO transfers. I find "HDMI cog" confusing. Can we just call the two cogs "line buffer" and "streamer"? They both deal with TDMS data that form a crucial part of DVI/HDMI.

    We can call them whatever we want, certainly those, or other names like render COG and display COG, although it can get confusing if this display COG is also effectively doing some text rendering work. Only your "line buffer" COG does the 4 bit to 10 byte table stuff though. Maybe a name that deals with that type of work is useful. I guess line buffer is fine for now, and I can try to use it too.

    In regards to your point about RDLUT taking 3 clocks instead of 2, I am hoping that going back to the 8 instruction (16 clocks) per nibble pair approaches that have been found we should still have enough time left over on the scanline for a couple of additional/optional things in this idea for our HDMI gfx/text driver:

    1) font table lookup - need to block read up to 256 bytes as 64 LONGs for the current scanline and populate LUT RAM, this would take an extra 0.256us for the raw transfers plus some address/setup overheads, or half the transfers if only the top character range is dynamic.

    2) transparent mouse cursor overlay - this feature would be very useful to support a decent 640x480x16 color mode with a mouse pointer for enabling traditional GUI application. Combining this into our two existing COGs could save burning another COG for a mouse pointer, so well worth it - I'd see it almost as a must. I don't quite know how many instructions it will need yet, but it will need to identify the appropriate graphics image for the scanline, rotate/mask accordingly and write over the pixel buffer before it is converted and written back to hub. A 32x32 pixel sized mouse pointer would need 4 or 5 longs in the (nibble) pixel buffer adjusted with the image overlayed. There should be some nice new instructions that can be put to good use here. This work needs to happen before the conversion using your 4b-10B long table stuff which otherwise complicate things further and require even more adjustments.

    The "streamer" COG should certainly do any of the text attribute stuff desired like flashing foreground/background etc. I think there will be definitely be enough cycles left over for that with all the optimisations that were found there so far.
Sign In or Register to comment.