All PASM2 gurus - help optimizing a text driver over DVI?
rogloh
Posts: 5,573
in Propeller 2
So I've been adding a text driver for VGA over DVI/HDMI and I have it all working and am interested to see if it can be optimized further. I'll try to post it soon. It's really great what a single P2 COG can do now...
The text driver portion is fairly simple and does 80 characters and 16 colour text on a scanline, similar to what VGA would offer including a flashing attribute option on bit7 of the colour byte. Right now when counting up instructions in my code snippet below I think it effectively takes 25 clocks plus the hub read instruction of the font data to compute the coloured pixel data for each character (I do two chars in each loop). A random hub read can take up to 16 clocks more bringing the total up to 41 clocks * 80 characters = 3280 clock cycles in the critical inner loop in the worst case. It currently fits but this encoding stage is already taking over half of the visible scan line (640 pixels = 6400 clocks) and I would rather like to tighten it up if possible to allow other things to be done on the same scan line including transferring the data to/from HUB and allowing for the streamer bandwidth to interrupt and slow down these hub transfers as well which it will, plus some cursors and other fancy features later, etc.
Ideally if it all could all be done in less than 3200 clocks with the transfer included that would be awesome as I know I could make good use of the remaining time per scan line for other very useful things... :-) I know I can already get it down to below this number if I jettison the flashing text capability (saves two instructions per character = 320 clocks) but it would be sort of nice to keep that around too, though it's hardly essential.
So does anyone know a clever and faster method to convert the 8 bit font data representing the selected 8 pixel colours and a pair of nibbles representing the foreground and background colours into a packed 32 bit long for the character and which always takes less than the 7 instructions that I have used below, for all source input data?
I was hoping to use the newer instructions like MUXNIBS/SPLITB/MERGEB/ROLNIB etc to do this but couldn't yet find a faster way than what I have below, basically by leveraging MOVBYTS which is a great new instruction by the way. MIXPIX may be of use too perhaps (although it is slow) or even a different data arrangement? One thing is that the current 16 bit word with the traditional VGA data format (BG | FG | character) can really help you with the font address translation and if you rearrange it you'll likely break that existing optimization, however if overall some further gains are clearly there it could be possible to reorder the source data order too.
Roger.
The text driver portion is fairly simple and does 80 characters and 16 colour text on a scanline, similar to what VGA would offer including a flashing attribute option on bit7 of the colour byte. Right now when counting up instructions in my code snippet below I think it effectively takes 25 clocks plus the hub read instruction of the font data to compute the coloured pixel data for each character (I do two chars in each loop). A random hub read can take up to 16 clocks more bringing the total up to 41 clocks * 80 characters = 3280 clock cycles in the critical inner loop in the worst case. It currently fits but this encoding stage is already taking over half of the visible scan line (640 pixels = 6400 clocks) and I would rather like to tighten it up if possible to allow other things to be done on the same scan line including transferring the data to/from HUB and allowing for the streamer bandwidth to interrupt and slow down these hub transfers as well which it will, plus some cursors and other fancy features later, etc.
Ideally if it all could all be done in less than 3200 clocks with the transfer included that would be awesome as I know I could make good use of the remaining time per scan line for other very useful things... :-) I know I can already get it down to below this number if I jettison the flashing text capability (saves two instructions per character = 320 clocks) but it would be sort of nice to keep that around too, though it's hardly essential.
So does anyone know a clever and faster method to convert the 8 bit font data representing the selected 8 pixel colours and a pair of nibbles representing the foreground and background colours into a packed 32 bit long for the character and which always takes less than the 7 instructions that I have used below, for all source input data?
I was hoping to use the newer instructions like MUXNIBS/SPLITB/MERGEB/ROLNIB etc to do this but couldn't yet find a faster way than what I have below, basically by leveraging MOVBYTS which is a great new instruction by the way. MIXPIX may be of use too perhaps (although it is slow) or even a different data arrangement? One thing is that the current 16 bit word with the traditional VGA data format (BG | FG | character) can really help you with the font address translation and if you rearrange it you'll likely break that existing optimization, however if overall some further gains are clearly there it could be possible to reorder the source data order too.
Roger.
'generate the next text scan line gen_text mov fontaddr, scanline 'build font table base address shl fontaddr, #8 'for this font and scanline add fontaddr, fontbase 'font tables aligned on 256 bytes setq2 #40-1 'get next 80 chars with colours rdlong $100, ptra++ 'read 40 longs from HUB into LUT push ptra 'preserve original pointers push ptrb mov ptra, #$100 'prepare LUT read/write pointers mov ptrb, #$100+40 testb modedata, #0 wz 'flashing or high intensity bg? if_nz mov test1, instr1 'include flashing bit test... if_nz mov test2, instr2 '...when flashing mode is enabled if_z mov test1, #0 'otherwse remove test instructions if_z mov test2, #0 wc 'also need to clear C in this case rep @.endloop, #40 'repeat loop 40 times for 80 chars rdlut first, ptra++ 'read a pair of colours and chars getword second, first, #1 '0000BFcc cc=char, B=bg,F=fg nbls setbyte fontaddr, first, #0 'determine font lookup address test1 bitl first, #15 wcz 'test (and clear) flashing bit rdbyte pixels, fontaddr 'get font information for character if_c and pixels, flash 'make it all background if flashing movbyts first, #%01100101 'becomes BFxxBFBF setnib first, first, #7 'becomes FFxxBFBF setnib first, first, #5 'becomes FFFxBFBF getnib bgcolour, first, #3 'get background colour setnib first, bgcolour, #4 'becomes FFFBBFBF setnib first, bgcolour, #0 'becomes FFFBBFBB movbyts first, pixels 'select pixel colours wrlut first, ptrb++ 'save to LUT memory setbyte fontaddr, second, #0 'determine font lookup address test2 bitl second, #15 wcz 'test (and clear) flashing bit rdbyte pixels, fontaddr 'get font information for character if_c and pixels, flash 'make it the background if flashing movbyts second, #%01100101 'becomes BFxxBFBF setnib second, second, #7 'becomes FFxxBFBF setnib second, second, #5 'becomes FFFxBFBF getnib bgcolour, second, #3 'get background colour setnib second, bgcolour, #4 'becomes FFFBBFBF setnib second, bgcolour, #0 'becomes FFFBBFBB movbyts second, pixels 'select pixel colours wrlut second, ptrb++ 'save to LUT memory .endloop pop ptrb pop prta setq2 #80-1 'write 640 nibble pixels to HUB RAM wrlong $100+40, ptrb++ ret instr1 bitl first, #15 wcz 'test (and clear) flashing bit instr2 bitl second, #15 wcz 'test (and clear) flashing bit
Comments
32 bit character format is
Just need this last bit to translate this format into the final 8 nibbles.
Update:
Or even better, this...saving another instruction!
In my driver case while the colours are being processed the FIFO will very much be active, so I probably won't use that approach during that portion, but it could be used before/after the active pixel streaming portion starts running.
One bonus, once you use 80 longs per character row instead of 80 words, is that you save the unrolled loop instruction space for processing coloured pixels and can do the translation inline in LUT RAM saving 40 LUT longs and also sharing a common LUT pointer which in my case means I only need to preserve one pointer instead of two.
I still would quite like a conventional 16 bit screen data format though...we'll have to see how it pans out.
IFs become a mux. Equality compares and logic operators are cheap. Full compares and numerical operators not so much. What little you are doing is all logic and even that is just debug code I think.
If the streamer could be clocked at half the pixel rate, would it be able to replicate the last pixel value until the next pixel is read when the SETCMOD is setup for DVI/HDMI mode? That might generate two identical 10b patterns together but I'm wondering about the HDMI output clock as we still want it at 25.2MHz and generating the 10b codes at 252MHz, not something half this.
Tricky.
Keeping the packed format you can bring it down by a few instructions:
Rinse and repeat for second.
Does something like this help?
Nice one AJL, I like it! Saves two instructions on my initial approach, and still allows 16 bit screen format. I'll have to remember to use this muxq thing more often.
ozpropdev I think that top example would certainly help out for doing the 40 column text mode stuff, yes! I currently use two instructions to transform it as well but with a 16 entry table in COGRAM, while your approach totally eliminates the need for this table.
It might help for duplicating bitwise pixels in graphics modes too, though that still needs to be read in, and operated on one item at a time and written back to the line buffer. That is going to burn extra instructions and we need to do several different combinations for doubling bits/twits/nibbles/bytes/words/longs etc. Can't really be helped I guess, unless the HDMI output hardware could replicate things for us when the streamer is clocked at half the normal rate, but I'm not sure that is possible.
abcdefghijklmnopqrstuvwxyzABCDEF (32 bits)
into two other longs:
ababcdcdefefghghijijklklmnmnopop
and
qrqrststuvuvwxwxyzyzABABCDCDEFEF
and also similarly for replicating nibbles?
ABCDEFGH (each letter a nibble of the long)
into
AABBCCDD
and
EEFFGGHH?
Is the latter just a getnib, rolnib, rolnib,
then getnib rolnib rolnib type of sequence repeating for all nibbles of the long?
I presume you've already looked at my VGA font driver (http://forums.parallax.com/discussion/169803/vga-640x480-800x600-1024x768-full-color-ansi-text-driver). Of course for HDMI you'll have to take quite a different approach, but the font conversion tools might be useful for you.
Not sure how much font memory I can fit into COG RAM, as it is rapidly filling up and these fonts are in the order of 4kB but each scanline row is only 256bytes so if there was room to hold it, I guess you could burst it all in per scanline in anticipation of it being needed. It would only take 64 clocks to get it into the COG (~6 pixels of the 800 pixel budget). In my case there may ultimately not be enough room for holding it all, but it is something to be considered if it is doable and becomes required from a performance point of view. I can see that reading in executable code for different video modes (think overlays) is going to help keep memory use down, though I do also like the COG having everything inside and once started it runs self contained so to speak and this helps prevent video driver lockups due to any other errant COG code that could be trashing HUB memory.
One of the problems with the egg-beater model is that random hub accesses have to be considered to always take the longest time of 16 clocks, unlike the P1 where once you locked into the hub pattern, you could get another fast access in the successive hub cycle independent of the particular hub address accessed. Doesn't work like that on the P2.
I have a testing program lying around if you want to try getting a handle on this.
The two character formats I chose were 64 bits/character and 32 bits/character, organized as: If you are able to use a similar format then you can lift all of the ANSI escape sequence routines directly from my code, which would save you some work. The character processing is all done in the "host" COG in a high level language (I have a Spin driver, also translated into C). A driver COG reads the character text buffer and outputs the VGA signal.
I was able to fit the code for both character variants, plus 256 longs of color palette and a 64 long buffer for font data, into COG memory but it's a tight squeeze. I think for the next version I'll start moving code into LUT, or perhaps do an overlay for the inner loop.
OTOH the various methods of burst read/write on P2 are enormously faster than P1, and we have an additional 2K of internal memory in the form of the LUT, so the trick to good performance is caching HUB data wherever possible.
Sounds pretty fully featured Eric. Not sure I like a 64 bit size, that's huge but I can see that it would offer all the options. I think having extra attributes is nice, particularly in mono modes which is something else to consider down the track. As you say the underline and blink etc can often leverage some common logical code.
Algorithm goes something like this...4020 input longs get converted to 8040 output longs, these must be read/written back to hub, but that's only another 12060 clocks or so.
Nibble mode is less instructions but more clocks ~1560, still okay.
You could also use a lookup table to convert BF to FFFBBFBB.
If the pixel data was quadrupled to long size, then combine with AJL's code: need to invert some of the pixels for this.
I would guess the the TMDS encoder just captures the streamer output every 10 clocks. If that is the case, then just run the streamer at sysclock/20 for pixel doubling.
It's easily do-able. As you say, the whole font scanline is only ~64 cycles to read in, and it saves a random read from HUB which is potentially 16 cycles per character; the replacement lookup using altgb is only 4 cycles per character, so with an 80 character wide screen that's a total savings of 80*(16-4) - 64 = 896 cycles per scan line, which is pretty significant.
Remember the setq/muxq retains the last setq value.
For a time-critical loop, better to use the following?