ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.
rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working?
Chip, a couple of questions, if you have time:
1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?
2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
I think the terminology of streaming may be getting confusing here. The "streamer" in the COG from what I understand at least streams hub data to/from PINs, potentially using the LUT as an intermediary to translate the data in LUT and DDS modes. But block transfers are also possible on a long per clock basis to/from HUB and COG or LUT memory. In our HDMI example, only the HDMI COG is streaming byte data to pins from hub at 250MHz, the other transfers are just block transfers. And thankfully because the streamer is not using the LUT modes, transfers within a shared LUT RAM when enabled should be possible between a COG pair and they won't need to interrupt the streamer because they don't have to involve the hub.
ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.
rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working?
Chip, a couple of questions, if you have time:
1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?
2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
Okay, so for 1) we will need the additional instruction in the critical loop as we can't guarantee those bits will be 0. This is still okay budget wise, but I will look for any other instruction optimizations to see what can be done. 8 instructions per pixel pair is a bit nicer than 9 as it is a good hub window multiple.
EDIT: second thoughts, actually being a hub window multiple is not all that important, as there will be other setup overhead that needs to be included that likely takes us off a hub window anyway. But I'll still look to what can be done if anything.
Think I managed to get it back down to 8 instructions per pixel pair in the critical loop path, and another bonus is that this version only needs two look up tables for translation now, saving 16 longs.
' table1 is 16 longs holding even nibble translation data
' table2 is 16 longs holding odd nibble translation data
' data is 32 bit long source reg holding next 8 pixels of bitmap (4bpp nibble based CGA like colours)
' buf is a temporary result buffer holding both some precomputed longs and also partially updated
getnib even, data, #0 ' extract even nibble from bitmap data
alts even, #table1 ' compute look up table 1 address
mov buf+1, 0-0 ' update long in buffer at long offset 1, trashes LS word temporarily
setword buf+2, buf+1, #0 ' patch LS word in buffer at long offset 2
setword buf+1, fixup, #0 ' fixup trashed LS word at buffer at long offset 1
getnib odd, data, #1 ' extract odd nibble from bitmap data
alts odd, #table2 ' compute lookup table 2 address
mov buf+4, 0-0 ' update long in buffer at long offset 4
' repeat for other pixels in data and further unrolled copies
...
fixup long ... ' constant value for fixing buf[1] etc
table1 long[16] ' 16 long fixed table for even pixels
...
table2 long[16] ' 16 long fixed table for odd pixels
...
buf long[80] ' temp buffer of adequate block size for unrolled loop above, containing some preallocated data
..
UPDATE: Just realised there might be a further optimisation possible if you can use a single 16 entry table in LUT RAM at offset 0 for the translation data, this one takes only 7 instructions per pixel pair and saves yet another 16 COG longs, as only one table needed now.
This is good because it buys you a lot of spare time and a bit more COG space, in fact 2.56us extra per line which could be used for things like mouse pointer overlays for GUI, or flashing text cursor type calculations etc. At 250MHz, 2.56us is 320 extra instructions per scanline that can be executed, that is quite a bit.
For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.
ALTI can inc/dec the content in source D.
I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.
ALTI can inc/dec the content in source D.
I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
That's right Roger.
This is not in the docs.
In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
For ALTD and ALTS, the source D is not changed. Only the following instruction D or S operand is modified.
ALTI can inc/dec the content in source D.
I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
That's right Roger.
This is not in the docs.
In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
Whoops. I just added it to the docs.
The instruction spreadsheet covered this functionality, though.
Ok, seems some things are going to be partially undocumented, early on at least, and we may have to figure things out as we go. New discoveries are yet to be made with P2! LOL.
BTW: Chip, the nibble indexing is very cool for BCD work!
Yeah it would be great for BCD and nibble indexing is awesome for pixel fonts too in 4bpp mode. I just coded up a sample loop for a text mode font lookup on the HDMI streamer COG and found each 8 bit character can be done in 35 instructions with foreground and background colours applied per pixel, this is good because it is of the same order of time as the work needed to do the translation to 10bit format for HDMI (28 or 32 instructions for 8 pixels) and won't have to hold up the processing much and can fit in the budget even without optimizations
rdlut char, rdlutaddr ' read LUT to get 16 bit word in [FG:BG:char] format
add rdlutaddr, #1 ' increment source address
getnib fg, char, #3 ' extract fg colour nibble
getnib bg, char, #2 ' extract bg colour nibble
and char, #$7f ' mask off unwanted parts
if_c add char, #128 wc ' c is already set outside the loop if we are on high scanlines of row (4-7)
alts char, #font ' determine font table location
mov pixels, 0-0 ' pixels contains 8 bit pixel data
rol pixels, scan ' compensate for scanline (scan is maintained as scanline * 8 outside loop)
test pixels, #0 wz
if_z setnib data, fg, #0
if_nz setnib data, bg, #0
test pixels, #1 wz
if_z setnib data, fg, #1
if_nz setnib data, bg, #1
test pixels, #2 wz
if_z setnib data, fg, #2
if_nz setnib data, bg, #2
test pixels, #3 wz
if_z setnib data, fg, #3
if_nz setnib data, bg, #3
test pixels, #4 wz
if_z setnib data, fg, #4
if_nz setnib data, bg, #4
test pixels, #5 wz
if_z setnib data, fg, #5
if_nz setnib data, bg, #5
test pixels, #6 wz
if_z setnib data, fg, #6
if_nz setnib data, bg, #6
test pixels, #7 wz
if_z setnib data, fg, #7
if_nz setnib data, bg, #7
wrlut data, wrlutaddr ' write back long to shared LUT
add wrlutaddr,#1
Yeah, thanks Cluso I like this skip thing. It could get the unrolled part of my pixel loop down to 2 instructions per pixel. Would need two stages of 8 each and a complement of the font pixels in the middle plus a clearing of original accumulator beforehand, so from 24 (8*3) to perhaps 19 (16+3), making 30 total now instead of 35. This means my font loop code can potentially be made slightly faster than the 10bit format conversion work in the other COG and won't ever need to become a bottleneck as the resulting bitmap data will always be ready in time for the other COG to process, once the initial latency is factored in. Or it potentially buys a few extra cycles for some type of text flash attribute or inverse feature using the otherwise unused top bit of the character index to do something useful.
Actually it's potentially better than that because I might be able to replicate some nibbles in the long accumulator first and then only update the other colour based on the lit pixels in the font. ie. I could assume same foreground colour for all pixels in the character scanline, using a LUT of FG colours that replicates same nibble into 8 nibbles of a long, then overwrite with background colour depending on skip test result. Have to be careful though with the SKIP as it works on 32 bits and I really only want 8 bits done so it may be problematic unless each new SKIP clobbers the existing running SKIP and I mask the upper 24 bits to 0 when I set things up.
UPDATE: possible code with SKIP optimization now 23 instructions per character and much faster than the HDMI 10 bit table stuff running in the other COG. This seems to be the way to go if it can work. Can also have some instructions left over to do something useful with the top bit of the character once we figure best thing for that. By the way I realised that the font table doesn't have to be aligned in COGRAM anymore so it might even be possible to have a font with more than 8 scanlines per character, like 10 or 12 giving 40,48 row modes too assuming space for it exists. A 12 scanline font can look nicer than 8 so its good to have that option now too. 12 scanlines needs another 128 COG RAM values, or we could have some of this font held in LUTRAM too, that might even allow 16 scanline fonts or the full 256 character range which would be awesome! The transfer of pixel data between COGs only needs 80 longs in and out in the LUT, and it could even be shared in the same memory going in and coming out, so that leaves a lot of extra LUT space to hold additional font data/scanlines. The code below could be made smart enough to know when to read the font from COG vs LUT. One option is to have the bottom 128 characters a fixed font, and the top 128 characters dynamically reprogrammable, and read in per frame from hub RAM and stored in the LUT. This allows for cool animation stuff in text mode or simply two fonts selected by the high bit of the character. I hope the entire 512 addresses of LUTRAM are shared, but I don't really know, it might be half the range or something.
rdlut char, rdlutaddr ' read LUT to get 16 bit word in [FG:BG:char] format
add rdlutaddr, #1 ' increment source address
getnib fg, char, #3 ' extract fg colour nibble
getnib bg, char, #2 ' extract bg colour nibble
and char, #$7f ' mask off unwanted parts
if_c add char, #128 ' c is already set outside the loop if we are on high scanlines of row (4-7)
alts char, #font ' determine font table location
mov pixels, 0-0 ' pixels contains 8 bit pixel data
rol pixels, scan ' compensate for scanline (scan is maintained as scanline * 8 outside loop)
and pixels, #$ff ' ensure we don't continue to skip instructions beyond 8 bits (might be combinable with previous rol using getbyte instr. instead)
alts fg, #fgbuf ' compute table address
mov data, 0-0 ' replicate same fg nibble 8 times into long from an 8 long table that does this, indexed by fg nibble
skip pixels
setnib data, bg, #0
setnib data, bg, #1
setnib data, bg, #2
setnib data, bg, #3
setnib data, bg, #4
setnib data, bg, #5
setnib data, bg, #6
setnib data, bg, #7
wrlut data, wrlutaddr ' write back long to shared LUT
add wrlutaddr,#1
Had another thought, I think it would be possible to just read in the font data dynamically for the scanline being rendered right into the LUT RAM alongside the character data from hub RAM. It's only another 64 longs worth for 256 characters which won't take long at 250Mlongs/sec read rates. This could allow arbitrary sized fonts to be supported as well or augmenting some fixed font available in COG RAM if needed. The font scanline packing and indexed addressing approach changes slightly but won't add that many more cycles to achieve.
Skipping is not the best option in this case. A nibble mask can be created from the low byte in pixels using the following algorithm (based on ozpropdev's reply)
' d = xxxxxxxx xxxxxxxx xxxxxxxx abcdefgh
movbyts d,#0 ' d = abcdefgh abcdefgh abcdefgh abcdefgh
mergeb d ' d = aaaabbbb ccccdddd eeeeffff gggghhhh
Had another thought, I think it would be possible to just read in the font data dynamically for the scanline being rendered right into the LUT RAM alongside the character data from hub RAM. It's only another 64 longs worth for 256 characters which won't take long at 250Mlongs/sec read rates. This could allow arbitrary sized fonts to be supported as well or augmenting some fixed font available in COG RAM if needed. The font scanline packing and indexed addressing approach changes slightly but won't add that many more cycles to achieve.
This is a good idea and the number of cycles might be the same or lower. Two characters could be packed into one long to save LUT RAM and block writing time.
Wow, all these fancy new instructions to learn to play with! Another 5 instructions saved I think. I like it. P2 is cool.
This is a good idea and the number of cycles might be the same or lower. Two characters could be packed into one long to save LUT RAM and block writing time.
Yeah that way you'd only need 40 longs to hold the 80 character/colour words in the LUT and it would be 2x faster to transfer the text for processing by the other COG. I was originally thinking that the input/output buffer between COGs in the LUT could be shared, meaning there is more block RAM for extra font data. But it doesn't need to be shared.
Right now for the LUTRAM usage, it could have 16 longs for a 10b translation table (7 instruction optimisation mentioned earlier), 40 or 80 longs for 80 characters input to the HDMI COG, either a shared 80 longs out, or another 80 long bitmap result buffer back, and the remaining free for other uses such as dynamic font scanlines for 256 characters. A single font table read into the LUT for the current scanline only takes 64 longs, so with everything combined in the maximum usage case with no shared input/output buffer etc I think we now have 240 longs used in the LUT and in the best case just 160. If 256 LUT RAM entries are left over after all this that provides enough space for enabling 16 scanline fixed fonts on the lower 128 characters with half this table stored in the HDMI COG RAM and the other half stored in the extra LUT RAM. We could have proper 16 scanline VGA text capability and dynamic programmable characters as well in the upper 128 character set range. That would be pretty cool.
Think I managed to get it back down to 8 instructions per pixel pair in the critical loop path, and another bonus is that this version only needs two look up tables for translation now, saving 16 longs.
' table1 is 16 longs holding even nibble translation data
' table2 is 16 longs holding odd nibble translation data
' data is 32 bit long source reg holding next 8 pixels of bitmap (4bpp nibble based CGA like colours)
' buf is a temporary result buffer holding both some precomputed longs and also partially updated
getnib even, data, #0 ' extract even nibble from bitmap data
alts even, #table1 ' compute look up table 1 address
mov buf+1, 0-0 ' update long in buffer at long offset 1, trashes LS word temporarily
setword buf+2, buf+1, #0 ' patch LS word in buffer at long offset 2
setword buf+1, fixup, #0 ' fixup trashed LS word at buffer at long offset 1
getnib odd, data, #1 ' extract odd nibble from bitmap data
alts odd, #table2 ' compute lookup table 2 address
mov buf+4, 0-0 ' update long in buffer at long offset 4
' repeat for other pixels in data and further unrolled copies
...
fixup long ... ' constant value for fixing buf[1] etc
table1 long[16] ' 16 long fixed table for even pixels
...
table2 long[16] ' 16 long fixed table for odd pixels
...
buf long[80] ' temp buffer of adequate block size for unrolled loop above, containing some preallocated data
..
UPDATE: Just realised there might be a further optimisation possible if you can use a single 16 entry table in LUT RAM at offset 0 for the translation data, this one takes only 7 instructions per pixel pair and saves yet another 16 COG longs, as only one table needed now.
This is good because it buys you a lot of spare time and a bit more COG space, in fact 2.56us extra per line which could be used for things like mouse pointer overlays for GUI, or flashing text cursor type calculations etc. At 250MHz, 2.56us is 320 extra instructions per scanline that can be executed, that is quite a bit.
I think ALTx instructions are not a problem, but well done with finding an alternative that is no slower. RDLUT takes 3 cycles therefore bottom version no quicker than top one.
I think the terminology of streaming may be getting confusing here. The "streamer" in the COG from what I understand at least streams hub data to/from PINs, potentially using the LUT as an intermediary to translate the data in LUT and DDS modes. But block transfers are also possible on a long per clock basis to/from HUB and COG or LUT memory. In our HDMI example, only the HDMI COG is streaming byte data to pins from hub at 250MHz, the other transfers are just block transfers. And thankfully because the streamer is not using the LUT modes, transfers within a shared LUT RAM when enabled should be possible between a COG pair and they won't need to interrupt the streamer because they don't have to involve the hub.
I'll use streamer more accurately from now on and use block reads or writes for the fast FIFO transfers. I find "HDMI cog" confusing. Can we just call the two cogs "line buffer" and "streamer"? They both deal with TDMS data that form a crucial part of DVI/HDMI.
ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.
rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working?
Chip, a couple of questions, if you have time:
1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?
2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
Thanks for the info, Chip, which means D stays the same in ALTx D,#S.
ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.
rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working?
Chip, a couple of questions, if you have time:
1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?
2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
Thanks for the info, Chip, which means D stays the same in ALTx D,#S.
I'll use streamer more accurately from now on and use block reads or writes for the fast FIFO transfers. I find "HDMI cog" confusing. Can we just call the two cogs "line buffer" and "streamer"? They both deal with TDMS data that form a crucial part of DVI/HDMI.
We can call them whatever we want, certainly those, or other names like render COG and display COG, although it can get confusing if this display COG is also effectively doing some text rendering work. Only your "line buffer" COG does the 4 bit to 10 byte table stuff though. Maybe a name that deals with that type of work is useful. I guess line buffer is fine for now, and I can try to use it too.
In regards to your point about RDLUT taking 3 clocks instead of 2, I am hoping that going back to the 8 instruction (16 clocks) per nibble pair approaches that have been found we should still have enough time left over on the scanline for a couple of additional/optional things in this idea for our HDMI gfx/text driver:
1) font table lookup - need to block read up to 256 bytes as 64 LONGs for the current scanline and populate LUT RAM, this would take an extra 0.256us for the raw transfers plus some address/setup overheads, or half the transfers if only the top character range is dynamic.
2) transparent mouse cursor overlay - this feature would be very useful to support a decent 640x480x16 color mode with a mouse pointer for enabling traditional GUI application. Combining this into our two existing COGs could save burning another COG for a mouse pointer, so well worth it - I'd see it almost as a must. I don't quite know how many instructions it will need yet, but it will need to identify the appropriate graphics image for the scanline, rotate/mask accordingly and write over the pixel buffer before it is converted and written back to hub. A 32x32 pixel sized mouse pointer would need 4 or 5 longs in the (nibble) pixel buffer adjusted with the image overlayed. There should be some nice new instructions that can be put to good use here. This work needs to happen before the conversion using your 4b-10B long table stuff which otherwise complicate things further and require even more adjustments.
The "streamer" COG should certainly do any of the text attribute stuff desired like flashing foreground/background etc. I think there will be definitely be enough cycles left over for that with all the optimisations that were found there so far.
Comments
rogloh, thanks for the text mode explanation. I think we should start off with graphics mode. Who will be the first person to get this working?
Chip, a couple of questions, if you have time:
1. Can you confirm that ALTD D,{#}S and similar instructions leave D unchanged?
2. If data are streamed from hub RAM to LUT RAM in cog 0 and LUT sharing is enabled in cog 1, will cog 1's LUT also receive the streamed data? Assuming cog 1 is not streaming using its LUT.
1) In the ALTx instructions, S[17:9] gets added into D. If those bits are zero, D stays the same.
2) It looks like all software-controlled LUT writes get sent to the adjacent cog, in case it has LUT sharing enabled. So, SETQ2+WRLONG should write to the adjacent cog, as well.
Okay, so for 1) we will need the additional instruction in the critical loop as we can't guarantee those bits will be 0. This is still okay budget wise, but I will look for any other instruction optimizations to see what can be done. 8 instructions per pixel pair is a bit nicer than 9 as it is a good hub window multiple.
EDIT: second thoughts, actually being a hub window multiple is not all that important, as there will be other setup overhead that needs to be included that likely takes us off a hub window anyway. But I'll still look to what can be done if anything.
UPDATE: Just realised there might be a further optimisation possible if you can use a single 16 entry table in LUT RAM at offset 0 for the translation data, this one takes only 7 instructions per pixel pair and saves yet another 16 COG longs, as only one table needed now.
This is good because it buys you a lot of spare time and a bit more COG space, in fact 2.56us extra per line which could be used for things like mouse pointer overlays for GUI, or flashing text cursor type calculations etc. At 250MHz, 2.56us is 320 extra instructions per scanline that can be executed, that is quite a bit.
ALTI can inc/dec the content in source D.
I had assumed that too, but Chip seems to indicate otherwise above for certain cases. In any case we have some decent workarounds to deal with it now.
EDIT: in our case s[17:9] could be zero so yes it could be left unchanged.
This is not in the docs.
Whoops. I just added it to the docs.
The instruction spreadsheet covered this functionality, though.
Yeah it would be great for BCD and nibble indexing is awesome for pixel fonts too in 4bpp mode. I just coded up a sample loop for a text mode font lookup on the HDMI streamer COG and found each 8 bit character can be done in 35 instructions with foreground and background colours applied per pixel, this is good because it is of the same order of time as the work needed to do the translation to 10bit format for HDMI (28 or 32 instructions for 8 pixels) and won't have to hold up the processing much and can fit in the budget even without optimizations
UPDATE: possible code with SKIP optimization now 23 instructions per character and much faster than the HDMI 10 bit table stuff running in the other COG. This seems to be the way to go if it can work. Can also have some instructions left over to do something useful with the top bit of the character once we figure best thing for that. By the way I realised that the font table doesn't have to be aligned in COGRAM anymore so it might even be possible to have a font with more than 8 scanlines per character, like 10 or 12 giving 40,48 row modes too assuming space for it exists. A 12 scanline font can look nicer than 8 so its good to have that option now too. 12 scanlines needs another 128 COG RAM values, or we could have some of this font held in LUTRAM too, that might even allow 16 scanline fonts or the full 256 character range which would be awesome! The transfer of pixel data between COGs only needs 80 longs in and out in the LUT, and it could even be shared in the same memory going in and coming out, so that leaves a lot of extra LUT space to hold additional font data/scanlines. The code below could be made smart enough to know when to read the font from COG vs LUT. One option is to have the bottom 128 characters a fixed font, and the top 128 characters dynamically reprogrammable, and read in per frame from hub RAM and stored in the LUT. This allows for cool animation stuff in text mode or simply two fonts selected by the high bit of the character. I hope the entire 512 addresses of LUTRAM are shared, but I don't really know, it might be half the range or something.
One SKIP/SKIPF replaces another. Also, once the SKIP/SKIPF is zero it completes.
Skipping is not the best option in this case. A nibble mask can be created from the low byte in pixels using the following algorithm (based on ozpropdev's reply)
and the code at the top can be replaced by
This is a good idea and the number of cycles might be the same or lower. Two characters could be packed into one long to save LUT RAM and block writing time.
Wow, all these fancy new instructions to learn to play with! Another 5 instructions saved I think. I like it. P2 is cool.
Yeah that way you'd only need 40 longs to hold the 80 character/colour words in the LUT and it would be 2x faster to transfer the text for processing by the other COG. I was originally thinking that the input/output buffer between COGs in the LUT could be shared, meaning there is more block RAM for extra font data. But it doesn't need to be shared.
Right now for the LUTRAM usage, it could have 16 longs for a 10b translation table (7 instruction optimisation mentioned earlier), 40 or 80 longs for 80 characters input to the HDMI COG, either a shared 80 longs out, or another 80 long bitmap result buffer back, and the remaining free for other uses such as dynamic font scanlines for 256 characters. A single font table read into the LUT for the current scanline only takes 64 longs, so with everything combined in the maximum usage case with no shared input/output buffer etc I think we now have 240 longs used in the LUT and in the best case just 160. If 256 LUT RAM entries are left over after all this that provides enough space for enabling 16 scanline fixed fonts on the lower 128 characters with half this table stored in the HDMI COG RAM and the other half stored in the extra LUT RAM. We could have proper 16 scanline VGA text capability and dynamic programmable characters as well in the upper 128 character set range. That would be pretty cool.
I think ALTx instructions are not a problem, but well done with finding an alternative that is no slower. RDLUT takes 3 cycles therefore bottom version no quicker than top one.
I'll use streamer more accurately from now on and use block reads or writes for the fast FIFO transfers. I find "HDMI cog" confusing. Can we just call the two cogs "line buffer" and "streamer"? They both deal with TDMS data that form a crucial part of DVI/HDMI.
Ok, thanks for letting me know about RDLUT timing. Actually I didn't realise that. Seems WRLUT still takes 2 cycles at least.
Thanks for the info, Chip, which means D stays the same in ALTx D,#S.
That is right.
We can call them whatever we want, certainly those, or other names like render COG and display COG, although it can get confusing if this display COG is also effectively doing some text rendering work. Only your "line buffer" COG does the 4 bit to 10 byte table stuff though. Maybe a name that deals with that type of work is useful. I guess line buffer is fine for now, and I can try to use it too.
In regards to your point about RDLUT taking 3 clocks instead of 2, I am hoping that going back to the 8 instruction (16 clocks) per nibble pair approaches that have been found we should still have enough time left over on the scanline for a couple of additional/optional things in this idea for our HDMI gfx/text driver:
1) font table lookup - need to block read up to 256 bytes as 64 LONGs for the current scanline and populate LUT RAM, this would take an extra 0.256us for the raw transfers plus some address/setup overheads, or half the transfers if only the top character range is dynamic.
2) transparent mouse cursor overlay - this feature would be very useful to support a decent 640x480x16 color mode with a mouse pointer for enabling traditional GUI application. Combining this into our two existing COGs could save burning another COG for a mouse pointer, so well worth it - I'd see it almost as a must. I don't quite know how many instructions it will need yet, but it will need to identify the appropriate graphics image for the scanline, rotate/mask accordingly and write over the pixel buffer before it is converted and written back to hub. A 32x32 pixel sized mouse pointer would need 4 or 5 longs in the (nibble) pixel buffer adjusted with the image overlayed. There should be some nice new instructions that can be put to good use here. This work needs to happen before the conversion using your 4b-10B long table stuff which otherwise complicate things further and require even more adjustments.
The "streamer" COG should certainly do any of the text attribute stuff desired like flashing foreground/background etc. I think there will be definitely be enough cycles left over for that with all the optimisations that were found there so far.