All PASM2 gurus - help optimizing a text driver over DVI?

in Propeller 2
So I've been adding a text driver for VGA over DVI/HDMI and I have it all working and am interested to see if it can be optimized further. I'll try to post it soon. It's really great what a single P2 COG can do now...
The text driver portion is fairly simple and does 80 characters and 16 colour text on a scanline, similar to what VGA would offer including a flashing attribute option on bit7 of the colour byte. Right now when counting up instructions in my code snippet below I think it effectively takes 25 clocks plus the hub read instruction of the font data to compute the coloured pixel data for each character (I do two chars in each loop). A random hub read can take up to 16 clocks more bringing the total up to 41 clocks * 80 characters = 3280 clock cycles in the critical inner loop in the worst case. It currently fits but this encoding stage is already taking over half of the visible scan line (640 pixels = 6400 clocks) and I would rather like to tighten it up if possible to allow other things to be done on the same scan line including transferring the data to/from HUB and allowing for the streamer bandwidth to interrupt and slow down these hub transfers as well which it will, plus some cursors and other fancy features later, etc.
Ideally if it all could all be done in less than 3200 clocks with the transfer included that would be awesome as I know I could make good use of the remaining time per scan line for other very useful things... :-) I know I can already get it down to below this number if I jettison the flashing text capability (saves two instructions per character = 320 clocks) but it would be sort of nice to keep that around too, though it's hardly essential.
So does anyone know a clever and faster method to convert the 8 bit font data representing the selected 8 pixel colours and a pair of nibbles representing the foreground and background colours into a packed 32 bit long for the character and which always takes less than the 7 instructions that I have used below, for all source input data?
I was hoping to use the newer instructions like MUXNIBS/SPLITB/MERGEB/ROLNIB etc to do this but couldn't yet find a faster way than what I have below, basically by leveraging MOVBYTS which is a great new instruction by the way. MIXPIX may be of use too perhaps (although it is slow) or even a different data arrangement? One thing is that the current 16 bit word with the traditional VGA data format (BG | FG | character) can really help you with the font address translation and if you rearrange it you'll likely break that existing optimization, however if overall some further gains are clearly there it could be possible to reorder the source data order too.
Roger.
The text driver portion is fairly simple and does 80 characters and 16 colour text on a scanline, similar to what VGA would offer including a flashing attribute option on bit7 of the colour byte. Right now when counting up instructions in my code snippet below I think it effectively takes 25 clocks plus the hub read instruction of the font data to compute the coloured pixel data for each character (I do two chars in each loop). A random hub read can take up to 16 clocks more bringing the total up to 41 clocks * 80 characters = 3280 clock cycles in the critical inner loop in the worst case. It currently fits but this encoding stage is already taking over half of the visible scan line (640 pixels = 6400 clocks) and I would rather like to tighten it up if possible to allow other things to be done on the same scan line including transferring the data to/from HUB and allowing for the streamer bandwidth to interrupt and slow down these hub transfers as well which it will, plus some cursors and other fancy features later, etc.
Ideally if it all could all be done in less than 3200 clocks with the transfer included that would be awesome as I know I could make good use of the remaining time per scan line for other very useful things... :-) I know I can already get it down to below this number if I jettison the flashing text capability (saves two instructions per character = 320 clocks) but it would be sort of nice to keep that around too, though it's hardly essential.
So does anyone know a clever and faster method to convert the 8 bit font data representing the selected 8 pixel colours and a pair of nibbles representing the foreground and background colours into a packed 32 bit long for the character and which always takes less than the 7 instructions that I have used below, for all source input data?
I was hoping to use the newer instructions like MUXNIBS/SPLITB/MERGEB/ROLNIB etc to do this but couldn't yet find a faster way than what I have below, basically by leveraging MOVBYTS which is a great new instruction by the way. MIXPIX may be of use too perhaps (although it is slow) or even a different data arrangement? One thing is that the current 16 bit word with the traditional VGA data format (BG | FG | character) can really help you with the font address translation and if you rearrange it you'll likely break that existing optimization, however if overall some further gains are clearly there it could be possible to reorder the source data order too.
Roger.
'generate the next text scan line
gen_text
mov fontaddr, scanline 'build font table base address
shl fontaddr, #8 'for this font and scanline
add fontaddr, fontbase 'font tables aligned on 256 bytes
setq2 #40-1 'get next 80 chars with colours
rdlong $100, ptra++ 'read 40 longs from HUB into LUT
push ptra 'preserve original pointers
push ptrb
mov ptra, #$100 'prepare LUT read/write pointers
mov ptrb, #$100+40
testb modedata, #0 wz 'flashing or high intensity bg?
if_nz mov test1, instr1 'include flashing bit test...
if_nz mov test2, instr2 '...when flashing mode is enabled
if_z mov test1, #0 'otherwse remove test instructions
if_z mov test2, #0 wc 'also need to clear C in this case
rep @.endloop, #40 'repeat loop 40 times for 80 chars
rdlut first, ptra++ 'read a pair of colours and chars
getword second, first, #1 '0000BFcc cc=char, B=bg,F=fg nbls
setbyte fontaddr, first, #0 'determine font lookup address
test1 bitl first, #15 wcz 'test (and clear) flashing bit
rdbyte pixels, fontaddr 'get font information for character
if_c and pixels, flash 'make it all background if flashing
movbyts first, #%01100101 'becomes BFxxBFBF
setnib first, first, #7 'becomes FFxxBFBF
setnib first, first, #5 'becomes FFFxBFBF
getnib bgcolour, first, #3 'get background colour
setnib first, bgcolour, #4 'becomes FFFBBFBF
setnib first, bgcolour, #0 'becomes FFFBBFBB
movbyts first, pixels 'select pixel colours
wrlut first, ptrb++ 'save to LUT memory
setbyte fontaddr, second, #0 'determine font lookup address
test2 bitl second, #15 wcz 'test (and clear) flashing bit
rdbyte pixels, fontaddr 'get font information for character
if_c and pixels, flash 'make it the background if flashing
movbyts second, #%01100101 'becomes BFxxBFBF
setnib second, second, #7 'becomes FFxxBFBF
setnib second, second, #5 'becomes FFFxBFBF
getnib bgcolour, second, #3 'get background colour
setnib second, bgcolour, #4 'becomes FFFBBFBF
setnib second, bgcolour, #0 'becomes FFFBBFBB
movbyts second, pixels 'select pixel colours
wrlut second, ptrb++ 'save to LUT memory
.endloop
pop ptrb
pop prta
setq2 #80-1 'write 640 nibble pixels to HUB RAM
wrlong $100+40, ptrb++
ret
instr1 bitl first, #15 wcz 'test (and clear) flashing bit
instr2 bitl second, #15 wcz 'test (and clear) flashing bit
Comments
32 bit character format is
FG | FG | FG | BG | BG | FG | char_code8
Just need this last bit to translate this format into the final 8 nibbles.
getnib bgcolour, first, #3 'get background colour setnib first, bgcolour, #1 'becomes FF_FB_BF_Bc setnib first, bgcolour, #0 'becomes FF_FB_BF_BB movbyts first, pixels 'select pixel colours
Update:
Or even better, this...saving another instruction!
movbyts first, #%11100110 'becomes FF_FB_BF_FB setnib first, first, #1 'becomes FF_FB_BF_BB movbyts first, pixels 'select pixel colours
In my driver case while the colours are being processed the FIFO will very much be active, so I probably won't use that approach during that portion, but it could be used before/after the active pixel streaming portion starts running.
One bonus, once you use 80 longs per character row instead of 80 words, is that you save the unrolled loop instruction space for processing coloured pixels and can do the translation inline in LUT RAM saving 40 LUT longs and also sharing a common LUT pointer which in my case means I only need to preserve one pointer instead of two.
I still would quite like a conventional 16 bit screen data format though...we'll have to see how it pans out.
IFs become a mux. Equality compares and logic operators are cheap. Full compares and numerical operators not so much. What little you are doing is all logic and even that is just debug code I think.
If the streamer could be clocked at half the pixel rate, would it be able to replicate the last pixel value until the next pixel is read when the SETCMOD is setup for DVI/HDMI mode? That might generate two identical 10b patterns together but I'm wondering about the HDMI output clock as we still want it at 25.2MHz and generating the 10b codes at 252MHz, not something half this.
Tricky.
Keeping the packed format you can bring it down by a few instructions:
movbyts first, #%01010101 'becomes BF_BF_BF_BF 2 cycles mov temp, first ' 2 cycles rol temp, #4 'temp becomes FB_FB_FB_FB 2 cycles setq #$F0FF000F ' 2 cycles muxq first, temp 'first becomes FF_FB_BF_BB 2 cycles
Rinse and repeat for second.
Does something like this help?
mov pb,#$c3 '8 pixels pixels_x2 setword pb,pb,#1 mergew pb 'pb = $0000f00f mov pb,#$c3 '8 pixels pixels_x4 movbyts pb,#0 mergeb pb 'pb = $ff0000ff
Nice one AJL, I like it! Saves two instructions on my initial approach, and still allows 16 bit screen format. I'll have to remember to use this muxq thing more often.
ozpropdev I think that top example would certainly help out for doing the 40 column text mode stuff, yes! I currently use two instructions to transform it as well but with a 16 entry table in COGRAM, while your approach totally eliminates the need for this table.
It might help for duplicating bitwise pixels in graphics modes too, though that still needs to be read in, and operated on one item at a time and written back to the line buffer. That is going to burn extra instructions and we need to do several different combinations for doubling bits/twits/nibbles/bytes/words/longs etc. Can't really be helped I guess, unless the HDMI output hardware could replicate things for us when the streamer is clocked at half the normal rate, but I'm not sure that is possible.
abcdefghijklmnopqrstuvwxyzABCDEF (32 bits)
into two other longs:
ababcdcdefefghghijijklklmnmnopop
and
qrqrststuvuvwxwxyzyzABABCDCDEFEF
and also similarly for replicating nibbles?
ABCDEFGH (each letter a nibble of the long)
into
AABBCCDD
and
EEFFGGHH?
Is the latter just a getnib, rolnib, rolnib,
then getnib rolnib rolnib type of sequence repeating for all nibbles of the long?
I presume you've already looked at my VGA font driver (http://forums.parallax.com/discussion/169803/vga-640x480-800x600-1024x768-full-color-ansi-text-driver). Of course for HDMI you'll have to take quite a different approach, but the font conversion tools might be useful for you.
Not sure how much font memory I can fit into COG RAM, as it is rapidly filling up and these fonts are in the order of 4kB but each scanline row is only 256bytes so if there was room to hold it, I guess you could burst it all in per scanline in anticipation of it being needed. It would only take 64 clocks to get it into the COG (~6 pixels of the 800 pixel budget). In my case there may ultimately not be enough room for holding it all, but it is something to be considered if it is doable and becomes required from a performance point of view. I can see that reading in executable code for different video modes (think overlays) is going to help keep memory use down, though I do also like the COG having everything inside and once started it runs self contained so to speak and this helps prevent video driver lockups due to any other errant COG code that could be trashing HUB memory.
One of the problems with the egg-beater model is that random hub accesses have to be considered to always take the longest time of 16 clocks, unlike the P1 where once you locked into the hub pattern, you could get another fast access in the successive hub cycle independent of the particular hub address accessed. Doesn't work like that on the P2.
I have a testing program lying around if you want to try getting a handle on this.
if blinking and appropriate frame set_all_ones = 1 else if underline and scanline == 7 set_all_ones = 1 else if strikethrough and scanline == 4 set_all_ones = 1 else set_all_ones = 0
then in the inner loop if set_all_ones is true use $FF instead of the actual font data.The two character formats I chose were 64 bits/character and 32 bits/character, organized as:
64 bits: FFFFFF is foreground color, BBBBBB is background color, cc is 8 bit character index, xx are special effect flags (underline, blinking, etc.) FF FF FF cc BB BB BB xx 32 bits: FF is 8 bit foreground color index, BB is background color index FF cc BB xx
If you are able to use a similar format then you can lift all of the ANSI escape sequence routines directly from my code, which would save you some work. The character processing is all done in the "host" COG in a high level language (I have a Spin driver, also translated into C). A driver COG reads the character text buffer and outputs the VGA signal.I was able to fit the code for both character variants, plus 256 longs of color palette and a 64 long buffer for font data, into COG memory but it's a tight squeeze. I think for the next version I'll start moving code into LUT, or perhaps do an overlay for the inner loop.
OTOH the various methods of burst read/write on P2 are enormously faster than P1, and we have an additional 2K of internal memory in the form of the LUT, so the trick to good performance is caching HUB data wherever possible.
Sounds pretty fully featured Eric. Not sure I like a 64 bit size, that's huge but I can see that it would offer all the options. I think having extra attributes is nice, particularly in mono modes which is something else to consider down the track. As you say the underline and blink etc can often leverage some common logical code.
Algorithm goes something like this...4020 input longs get converted to 8040 output longs, these must be read/written back to hub, but that's only another 12060 clocks or so.
rep @end, #20 times rdlut x, ptra++ mov y, x mov a, x mov b, x ror a, #2 rol b, #2 setq ##$33333333 (might be backwards, didn't check) muxq x,a setq ##$CCCCCCCC muxq y,b rolnib a,x,#7 rolnib a,y,#7 rolnib a,x,#6 rolnib a,y,#6 rolnib a,x,#5 rolnib a,y,#5 rolnib a,x,#4 rolnib a,y,#4 wrlut a, ptrb++ rolnib a,x,#3 rolnib a,y,#3 rolnib a,x,#2 rolnib a,y,#2 rolnib a,x,#1 rolnib a,y,#1 rolnib a,x,#0 rolnib a,y,#0 wrlut a, ptrb++ end
Nibble mode is less instructions but more clocks ~1560, still okay.
rep @end, #40 rdlut x, ptra++ rolnib a,x,#7 rolnib a,x,#7 rolnib a,x,#6 rolnib a,x,#6 rolnib a,x,#5 rolnib a,x,#5 rolnib a,x,#4 rolnib a,x,#4 wrlut a, ptrb++ rolnib a,x,#3 rolnib a,x,#3 rolnib a,x,#2 rolnib a,x,#2 rolnib a,x,#1 rolnib a,x,#1 rolnib a,x,#0 rolnib a,x,#0 wrlut a, ptrb++ end
A line 1 | B line 1 | C line 1 | ... A line 2 | B line 2 | C line 2 | ...
Using cog ram 0-255 for a lookup table saves pointer arithmetic. I don't know if this is always important on P2.You could also use a lookup table to convert BF to FFFBBFBB.
If the pixel data was quadrupled to long size, then combine with AJL's code: need to invert some of the pixels for this.
setd fsq,first ' set later q value according to character and fsq,NOT_D8 ' forgot that first[8] is not always zero movbyts first, #%01010101 'becomes BF_BF_BF_BF 2 cycles mov temp, first ' 2 cycles rol temp, #4 'temp becomes FB_FB_FB_FB 2 cycles fsq setq 0 ' 2 cycles get q from font data in 0-255 muxq first, temp 'first becomes packed pixels
I would guess the the TMDS encoder just captures the streamer output every 10 clocks. If that is the case, then just run the streamer at sysclock/20 for pixel doubling.
It's easily do-able. As you say, the whole font scanline is only ~64 cycles to read in, and it saves a random read from HUB which is potentially 16 cycles per character; the replacement lookup using altgb is only 4 cycles per character, so with an 80 character wide screen that's a total savings of 80*(16-4) - 64 = 896 cycles per scan line, which is pretty significant.
setq ##$33333333 (might be backwards, didn't check) muxq x,a setq ##$CCCCCCCC muxq y,b
could becomesetq ##$33333333 (might be backwards, didn't check) muxq x,a ror b,#2 muxq y,b rol b, #2
It's the same footprint and clocks but maybe you can do something with this to save a rotate???Remember the setq/muxq retains the last setq value.
For a time-critical loop, better to use the following?
setq reg_33333333 (might be backwards, didn't check) muxq x,a setq reg_CCCCCCCC muxq y,b