All PASM2 gurus - help optimizing a text driver over DVI?

rogloh · 2019-10-09 02:17

So I've been adding a text driver for VGA over DVI/HDMI and I have it all working and am interested to see if it can be optimized further. I'll try to post it soon. It's really great what a single P2 COG can do now...

The text driver portion is fairly simple and does 80 characters and 16 colour text on a scanline, similar to what VGA would offer including a flashing attribute option on bit7 of the colour byte. Right now when counting up instructions in my code snippet below I think it effectively takes 25 clocks plus the hub read instruction of the font data to compute the coloured pixel data for each character (I do two chars in each loop). A random hub read can take up to 16 clocks more bringing the total up to 41 clocks * 80 characters = 3280 clock cycles in the critical inner loop in the worst case. It currently fits but this encoding stage is already taking over half of the visible scan line (640 pixels = 6400 clocks) and I would rather like to tighten it up if possible to allow other things to be done on the same scan line including transferring the data to/from HUB and allowing for the streamer bandwidth to interrupt and slow down these hub transfers as well which it will, plus some cursors and other fancy features later, etc.

Ideally if it all could all be done in less than 3200 clocks with the transfer included that would be awesome as I know I could make good use of the remaining time per scan line for other very useful things... :-) I know I can already get it down to below this number if I jettison the flashing text capability (saves two instructions per character = 320 clocks) but it would be sort of nice to keep that around too, though it's hardly essential.

So does anyone know a clever and faster method to convert the 8 bit font data representing the selected 8 pixel colours and a pair of nibbles representing the foreground and background colours into a packed 32 bit long for the character and which always takes less than the 7 instructions that I have used below, for all source input data?

I was hoping to use the newer instructions like MUXNIBS/SPLITB/MERGEB/ROLNIB etc to do this but couldn't yet find a faster way than what I have below, basically by leveraging MOVBYTS which is a great new instruction by the way. MIXPIX may be of use too perhaps (although it is slow) or even a different data arrangement? One thing is that the current 16 bit word with the traditional VGA data format (BG | FG | character) can really help you with the font address translation and if you rearrange it you'll likely break that existing optimization, however if overall some further gains are clearly there it could be possible to reorder the source data order too.

Roger.

'generate the next text scan line
gen_text    
            mov     fontaddr, scanline      'build font table base address 
            shl     fontaddr, #8            'for this font and scanline
            add     fontaddr, fontbase      'font tables aligned on 256 bytes

            setq2   #40-1                   'get next 80 chars with colours
            rdlong  $100, ptra++            'read 40 longs from HUB into LUT

            push    ptra                    'preserve original pointers
            push    ptrb
            
            mov     ptra, #$100             'prepare LUT read/write pointers
            mov     ptrb, #$100+40

            testb   modedata, #0 wz         'flashing or high intensity bg?
    if_nz   mov     test1, instr1           'include flashing bit test...
    if_nz   mov     test2, instr2           '...when flashing mode is enabled
    if_z    mov     test1, #0               'otherwse remove test instructions
    if_z    mov     test2, #0 wc            'also need to clear C in this case

            rep     @.endloop, #40          'repeat loop 40 times for 80 chars
            rdlut   first, ptra++           'read a pair of colours and chars
            getword second, first, #1       '0000BFcc  cc=char, B=bg,F=fg nbls

            setbyte fontaddr, first, #0     'determine font lookup address
test1       bitl    first, #15 wcz          'test (and clear) flashing bit
            rdbyte  pixels, fontaddr        'get font information for character
    if_c    and     pixels, flash           'make it all background if flashing
            movbyts first, #%01100101       'becomes BFxxBFBF
            setnib  first, first, #7        'becomes FFxxBFBF
            setnib  first, first, #5        'becomes FFFxBFBF
            getnib  bgcolour, first, #3     'get background colour
            setnib  first, bgcolour, #4     'becomes FFFBBFBF
            setnib  first, bgcolour, #0     'becomes FFFBBFBB
            movbyts first, pixels           'select pixel colours
            wrlut   first, ptrb++           'save to LUT memory

            setbyte fontaddr, second, #0    'determine font lookup address
test2       bitl    second, #15 wcz         'test (and clear) flashing bit
            rdbyte  pixels, fontaddr        'get font information for character
    if_c    and     pixels, flash           'make it the background if flashing
            movbyts second, #%01100101      'becomes BFxxBFBF
            setnib  second, second, #7      'becomes FFxxBFBF
            setnib  second, second, #5      'becomes FFFxBFBF
            getnib  bgcolour, second, #3    'get background colour
            setnib  second, bgcolour, #4    'becomes FFFBBFBF
            setnib  second, bgcolour, #0    'becomes FFFBBFBB
            movbyts second, pixels          'select pixel colours
            wrlut   second, ptrb++          'save to LUT memory
.endloop
            pop     ptrb
            pop     prta

            setq2   #80-1                   'write 640 nibble pixels to HUB RAM
            wrlong  $100+40, ptrb++

            ret

instr1      bitl    first, #15 wcz          'test (and clear) flashing bit
instr2      bitl    second, #15 wcz         'test (and clear) flashing bit

rogloh · 2019-10-09 02:37

Thinking about data order I just realised if we pre-encode some character colours in 32 bit longs instead of words in this format below, where BG and FG are the colour index nibbles, then we can shave off three instructions per character. Downsides are more overhead to write coloured characters into HUB, more memory use and longer HUB to HUB text copies and more time to read data into the DVI COG but it would save a LOT of time in the inner loop. Don't think there'll be a faster way that this but I'd like to be proved wrong. Ideally would be nice to stick to a 16 bit format.

32 bit character format is

FG | FG | FG | BG | BG | FG | char_code8

Just need this last bit to translate this format into the final 8 nibbles.

            getnib  bgcolour, first, #3     'get background colour
            setnib  first, bgcolour, #1     'becomes FF_FB_BF_Bc
            setnib  first, bgcolour, #0     'becomes FF_FB_BF_BB
            movbyts first, pixels           'select pixel colours

Update:

Or even better, this...saving another instruction!

            movbyts first, #%11100110      'becomes FF_FB_BF_FB
            setnib  first, first, #1       'becomes FF_FB_BF_BB
            movbyts first, pixels          'select pixel colours

evanh · 2019-10-09 03:05

If the FIFO is free then you can ditch the SETQ2/RDLONG and RDLUT/GETWORD combo and replace them with a RDFAST and RFWORD instead. This way you can go back to a single staged #80 loop handling a character at a time. RDFAST can be told to not stall and therefore always takes 2 clocks only.

evanh · 2019-10-09 03:16

Alternatively, I'd try to find space to burst read the 40 longwords into cogRAM rather than lutRAM. Eliminate the RDLUT and index the cogRAM more directly.

rogloh · 2019-10-09 03:22

That's handy to know evanh. There are times on the scanline where the FIFO won't be active and at times it can be more convenient to read data in the appropriate width instead of always as longs with setq(2). In fact that caught me out initially, as I wrongly assumed you could burst rdword's into LUT from HUB this way, but you can't, setq(2) always transfers longs it appears.

In my driver case while the colours are being processed the FIFO will very much be active, so I probably won't use that approach during that portion, but it could be used before/after the active pixel streaming portion starts running.

One bonus, once you use 80 longs per character row instead of 80 words, is that you save the unrolled loop instruction space for processing coloured pixels and can do the translation inline in LUT RAM saving 40 LUT longs and also sharing a common LUT pointer which in my case means I only need to preserve one pointer instead of two.

I still would quite like a conventional 16 bit screen data format though...we'll have to see how it pans out.

rogloh · 2019-10-09 03:31

evanh wrote: »

Alternatively, I'd try to find space to burst read the 40 longwords into cogRAM rather than lutRAM. Eliminate the RDLUT and index the cogRAM more directly.

The benefit I see with RDLUT apart from the additional space it can offer is that you can get the read and pointer auto increment operation done in 3 clocks instead of an ALTxx + MOV equivalent which takes 4 clocks. However eventually I may need to keep the LUT free for other data tables.

evanh · 2019-10-09 03:36

rogloh wrote: »

One bonus, once you use 80 longs per character row instead of 80 words, ...

That sounds like a single text line buffer would be enough for streamer displaying. So converting to a small longword buffer, line by line, on the fly could be an option?

rogloh · 2019-10-09 03:45

That is sort of what I am doing though I do have more than one line buffer to get the benefit of the FIFO autowrap in all pixel modes until I figure out how to control it more dynamically. I read in 80 characters (right now as 40 longs), translate and write back the 80 "rendered" longs to HUB for the streamer to output over HDMI on the next scanline. I would imagine 80 longs per scanline is not a lot of space to burn on the P2. The other benefit of this approach is it is universal and allows other COGs (such as sprite COGs) to generate the data that will be output by the DVI cog.

evanh · 2019-10-09 03:49

Oh.

evanh · 2019-10-09 04:17

Hmm, your code would pack into an instruction or two as hardware. Assignments like that are free in hardware. All the gets and sets and moves with direct register addressing disappear entirely. Only the indexed ones stay.

IFs become a mux. Equality compares and logic operators are cheap. Full compares and numerical operators not so much. What little you are doing is all logic and even that is just debug code I think.

rogloh · 2019-10-09 04:22

LOL, yeah this all doable in Verilog. I once put a minimal VGA type text driver into Verilog on the P1V and it didn't take too many resources. Was able to fit it into a Altera MAX 10M08 with 3 COGs IIRC. It could read directly from dual ported HUB RAM I think, plus did some bitmap graphics modes from external SRAM.

rogloh · 2019-10-09 06:50

One thing I am finding tricky on the P2 is pixel doubling in DVI modes. It is easy to line double, but generating duplicate pixels is a LOT of work especially in graphics modes where the data is already packed and you don't want to duplicate it one bit/twit/nibble/byte/word/long at a time. So I'm not sure if I'll be able to do a 320 width graphics mode. Text is not so hard to do this, I was able to try out a 40 column mode.

If the streamer could be clocked at half the pixel rate, would it be able to replicate the last pixel value until the next pixel is read when the SETCMOD is setup for DVI/HDMI mode? That might generate two identical 10b patterns together but I'm wondering about the HDMI output clock as we still want it at 25.2MHz and generating the 10b codes at 252MHz, not something half this.

Tricky.

AJL · 2019-10-09 08:17

rogloh wrote: »
Thinking about data order I just realised if we pre-encode some character colours in 32 bit longs instead of words in this format below, where BG and FG are the colour index nibbles, then we can shave off three instructions per character. Downsides are more overhead to write coloured characters into HUB, more memory use and longer HUB to HUB text copies and more time to read data into the DVI COG but it would save a LOT of time in the inner loop. Don't think there'll be a faster way that this but I'd like to be proved wrong. Ideally would be nice to stick to a 16 bit format.

32 bit character format is
FG | FG | FG | BG | BG | FG | char_code8
Just need this last bit to translate this format into the final 8 nibbles.
            getnib  bgcolour, first, #3     'get background colour
            setnib  first, bgcolour, #1     'becomes FF_FB_BF_Bc
            setnib  first, bgcolour, #0     'becomes FF_FB_BF_BB
            movbyts first, pixels           'select pixel colours
Update:

Or even better, this...saving another instruction!
            movbyts first, #%11100110      'becomes FF_FB_BF_FB
            setnib  first, first, #1       'becomes FF_FB_BF_BB
            movbyts first, pixels          'select pixel colours

Keeping the packed format you can bring it down by a few instructions:

            movbyts first, #%01010101      'becomes BF_BF_BF_BF      2 cycles
            mov temp, first                '     2 cycles
            rol temp, #4                   'temp becomes FB_FB_FB_FB     2 cycles
            setq #$F0FF000F                '    2 cycles
            muxq first, temp               'first becomes FF_FB_BF_BB      2 cycles

Rinse and repeat for second.

evanh · 2019-10-09 08:22

The DVI/HDMI encoding hardware I've not followed, so can't help there sorry. I would imagine the handcrafted TDMS for the revA chip can probably be customised for pixel doubling.

ozpropdev · 2019-10-09 09:06

Roger
Does something like this help?

		mov	pb,#$c3	'8 pixels
pixels_x2	setword	pb,pb,#1
		mergew	pb	'pb = $0000f00f

		mov	pb,#$c3	'8 pixels
pixels_x4	movbyts	pb,#0
		mergeb	pb	'pb = $ff0000ff

rogloh · 2019-10-09 10:18

AJL wrote: »

Keeping the packed format you can bring it down by a few instructions:

            movbyts first, #%01010101      'becomes BF_BF_BF_BF      2 cycles
            mov temp, first                '     2 cycles
            rol temp, #4                   'temp becomes FB_FB_FB_FB     2 cycles
            setq #$F0FF000F                '    2 cycles
            muxq first, temp               'first becomes FF_FB_BF_BB      2 cycles

Rinse and repeat for second.

Nice one AJL, I like it! Saves two instructions on my initial approach, and still allows 16 bit screen format. I'll have to remember to use this muxq thing more often.

ozpropdev wrote: »

Roger
Does something like this help?

		mov	pb,#$c3	'8 pixels
pixels_x2	setword	pb,pb,#1
		mergew	pb	'pb = $0000f00f

		mov	pb,#$c3	'8 pixels
pixels_x4	movbyts	pb,#0
		mergeb	pb	'pb = $ff0000ff

ozpropdev I think that top example would certainly help out for doing the 40 column text mode stuff, yes! I currently use two instructions to transform it as well but with a 16 entry table in COGRAM, while your approach totally eliminates the need for this table.

It might help for duplicating bitwise pixels in graphics modes too, though that still needs to be read in, and operated on one item at a time and written back to the line buffer. That is going to burn extra instructions and we need to do several different combinations for doubling bits/twits/nibbles/bytes/words/longs etc. Can't really be helped I guess, unless the HDMI output hardware could replicate things for us when the streamer is clocked at half the normal rate, but I'm not sure that is possible.

rogloh · 2019-10-09 10:37

So it looks like replicating single bits is simple using ozpropdev 8 bit at a time approach, but what's the fastest reasonably simple way to convert a long with say these two bit pixels (each letter is a bit):

abcdefghijklmnopqrstuvwxyzABCDEF (32 bits)

into two other longs:

ababcdcdefefghghijijklklmnmnopop
and
qrqrststuvuvwxwxyzyzABABCDCDEFEF

and also similarly for replicating nibbles?
ABCDEFGH (each letter a nibble of the long)
into
AABBCCDD
and
EEFFGGHH?

Is the latter just a getnib, rolnib, rolnib,
then getnib rolnib rolnib type of sequence repeating for all nibbles of the long?

ersmith · 2019-10-09 10:40

Random HUB reads are really slow, as you've noticed, so it's really worth-while to keep the font data in COG memory where you can access it via altgb + getbyte (4 cycles instead of up to 16). It's ironic, but the LUT is actually better for code than for holding look up tables (the COG memory is better for look up tables).

I presume you've already looked at my VGA font driver (http://forums.parallax.com/discussion/169803/vga-640x480-800x600-1024x768-full-color-ansi-text-driver). Of course for HDMI you'll have to take quite a different approach, but the font conversion tools might be useful for you.

rogloh · 2019-10-09 10:54

No I haven't seen it ersmith, but I should really try to take a look when I can. I'm mainly now interested in pixel doubling as I think there are probably some reasonable solutions for the other things. A 320 wide pixel mode can still be useful for conserving HUB memory, instead of always using 640 pixels.

Not sure how much font memory I can fit into COG RAM, as it is rapidly filling up and these fonts are in the order of 4kB but each scanline row is only 256bytes so if there was room to hold it, I guess you could burst it all in per scanline in anticipation of it being needed. It would only take 64 clocks to get it into the COG (~6 pixels of the 800 pixel budget). In my case there may ultimately not be enough room for holding it all, but it is something to be considered if it is doable and becomes required from a performance point of view. I can see that reading in executable code for different video modes (think overlays) is going to help keep memory use down, though I do also like the COG having everything inside and once started it runs self contained so to speak and this helps prevent video driver lockups due to any other errant COG code that could be trashing HUB memory.

One of the problems with the egg-beater model is that random hub accesses have to be considered to always take the longest time of 16 clocks, unlike the P1 where once you locked into the hub pattern, you could get another fast access in the successive hub cycle independent of the particular hub address accessed. Doesn't work like that on the P2.

evanh · 2019-10-09 11:35

rogloh wrote: »

Doesn't work like that on the P2.

I think it's best to keep using those burst modes where possible ... but I'll jump in here to make the comment that hub accesses can be instruction counted like they were on the prop1. However the counting has to account for hub addresses too. If you are always repeating the same access order then the instruction complexity isn't any more. Just an extra component to the mental effort to get it sorted. A bonus in all this is the prop2 hubRAM writes can be a lot faster than what the prop1 ever could hope for. You can quite literally write every three clocks with the right conditions.

I have a testing program lying around if you want to try getting a handle on this.

ersmith · 2019-10-09 11:37

The other thing I found is that blinking, underline, and strikethrough are all essentially the same code in the inner loop, all testing a flag set up at the start of the scanline, something like the following pseudo-code:

if blinking and appropriate frame
   set_all_ones = 1
else if underline and scanline == 7
   set_all_ones = 1
else if strikethrough and scanline == 4
   set_all_ones = 1
else
  set_all_ones = 0

then in the inner loop if set_all_ones is true use $FF instead of the actual font data.

The two character formats I chose were 64 bits/character and 32 bits/character, organized as:

  64 bits: FFFFFF is foreground color, BBBBBB is background color, cc is 8 bit character index, xx are special effect flags (underline, blinking, etc.)
  FF FF FF cc BB BB BB xx
  32 bits: FF is 8 bit foreground color index, BB is background color index
  FF cc BB xx

If you are able to use a similar format then you can lift all of the ANSI escape sequence routines directly from my code, which would save you some work. The character processing is all done in the "host" COG in a high level language (I have a Spin driver, also translated into C). A driver COG reads the character text buffer and outputs the VGA signal.

I was able to fit the code for both character variants, plus 256 longs of color palette and a 64 long buffer for font data, into COG memory but it's a tight squeeze. I think for the next version I'll start moving code into LUT, or perhaps do an overlay for the inner loop.

ersmith · 2019-10-09 11:47

evanh wrote: »

rogloh wrote: »

Doesn't work like that on the P2.

I think it's best to keep using those burst modes where possible ... but I'll jump in here to make the comment that hub accesses can be instruction counted like they were on the prop1. However the counting has to account for hub addresses too.

Right, but if the accesses are random (e.g. looking up font data based on the current character) then there's no way to predict the hub addresses, so you pretty much have to assume the worst case. On P1 the timing didn't depend on hub address, only COG number.

OTOH the various methods of burst read/write on P2 are enormously faster than P1, and we have an additional 2K of internal memory in the form of the LUT, so the trick to good performance is caching HUB data wherever possible.

rogloh · 2019-10-09 11:47

evanh wrote: »

...A bonus in all this is the prop2 hubRAM writes can be a lot faster than what the prop1 ever could hope for. You can quite literally write every three clocks with the right conditions.

I know what you mean, there are certain read or write patterns that are sweet, but for random font table access directly to HUB this benefit is not realisable. The normal setq burst read/writes are great. So much memory bandwidth!!

rogloh · 2019-10-09 11:55

ersmith wrote: »

The other thing I found is that blinking, underline, and strikethrough are all essentially the same code in the inner loop, all testing a flag set up at the start of the scanline...

If you are able to use a similar format then you can lift all of the ANSI escape sequence routines directly from my code, which would save you some work. The character processing is all done in the "host" COG in a high level language (I have a Spin driver, also translated into C). A driver COG reads the character text buffer and outputs the VGA signal.

I was able to fit the code for both character variants, plus 256 longs of color palette and a 64 long buffer for font data, into COG memory but it's a tight squeeze. I think for the next version I'll start moving code into LUT, or perhaps do an overlay for the inner loop.

Sounds pretty fully featured Eric. Not sure I like a 64 bit size, that's huge but I can see that it would offer all the options. I think having extra attributes is nice, particularly in mono modes which is something else to consider down the track. As you say the underline and blink etc can often leverage some common logical code.

evanh · 2019-10-09 11:56

Just need the streamer able to write to lutRAM. That would be cool.

rogloh · 2019-10-09 12:09

I think I found a method to duplicate bit-pairs (4 colour) pixel modes. The inner loop is pretty slow though ~22801140 cycles for doubling 320 pixels to 640 pixels, but it fits the budget.

Algorithm goes something like this...4020 input longs get converted to 8040 output longs, these must be read/written back to hub, but that's only another 12060 clocks or so.

rep @end, #20 times
rdlut x, ptra++
mov y, x
mov a, x
mov b, x
ror a, #2
rol b, #2
setq ##$33333333 (might be backwards, didn't check)
muxq x,a
setq ##$CCCCCCCC
muxq y,b
rolnib a,x,#7
rolnib a,y,#7
rolnib a,x,#6
rolnib a,y,#6
rolnib a,x,#5
rolnib a,y,#5
rolnib a,x,#4
rolnib a,y,#4
wrlut a, ptrb++
rolnib a,x,#3
rolnib a,y,#3
rolnib a,x,#2
rolnib a,y,#2
rolnib a,x,#1
rolnib a,y,#1
rolnib a,x,#0
rolnib a,y,#0
wrlut a, ptrb++
end

Nibble mode is less instructions but more clocks ~1560, still okay.

rep @end, #40
rdlut x, ptra++
rolnib a,x,#7
rolnib a,x,#7
rolnib a,x,#6
rolnib a,x,#6
rolnib a,x,#5
rolnib a,x,#5
rolnib a,x,#4
rolnib a,x,#4
wrlut a, ptrb++
rolnib a,x,#3
rolnib a,x,#3
rolnib a,x,#2
rolnib a,x,#2
rolnib a,x,#1
rolnib a,x,#1
rolnib a,x,#0
rolnib a,x,#0
wrlut a, ptrb++
end

evanh · 2019-10-09 12:22

Nice! I remember finding those rolls handy for concatenating something too. Wasn't anywhere near that big though.

SaucySoliton · 2019-10-09 16:59

If the font data was arranged like this, then it would be possible to cache it into cog ram. A single burst read gets the pixel data for the current line for all characters.

A line 1  |  B line 1  |  C line 1  | ...
A line 2  |  B line 2  |  C line 2  | ...

Using cog ram 0-255 for a lookup table saves pointer arithmetic. I don't know if this is always important on P2.

You could also use a lookup table to convert BF to FFFBBFBB.

If the pixel data was quadrupled to long size, then combine with AJL's code: need to invert some of the pixels for this.

            setd  fsq,first                ' set later q value according to character 
            and  fsq,NOT_D8                ' forgot that first[8] is not always zero
            movbyts first, #%01010101      'becomes BF_BF_BF_BF      2 cycles
            mov temp, first                '     2 cycles
            rol temp, #4                   'temp becomes FB_FB_FB_FB     2 cycles
fsq         setq 0                         '    2 cycles get q from font data in 0-255
            muxq first, temp               'first becomes packed pixels

I would guess the the TMDS encoder just captures the streamer output every 10 clocks. If that is the case, then just run the streamer at sysclock/20 for pixel doubling.

ersmith · 2019-10-09 17:55

rogloh wrote: »

I'm mainly now interested in pixel doubling as I think there are probably some reasonable solutions for the other things. A 320 wide pixel mode can still be useful for conserving HUB memory, instead of always using 640 pixels.

For graphics this would definitely be a big win. For text, I don't think it's worth sweating over -- an 80x30x2 text buffer such as you're proposing is less than 5K out of the 512K of hub memory; reducing it to 40x30 is not a huge saving. For that matter, you could achieve the same thing by using a 16 pixel wide font.

Not sure how much font memory I can fit into COG RAM, as it is rapidly filling up and these fonts are in the order of 4kB but each scanline row is only 256bytes so if there was room to hold it, I guess you could burst it all in per scanline in anticipation of it being needed.

It's easily do-able. As you say, the whole font scanline is only ~64 cycles to read in, and it saves a random read from HUB which is potentially 16 cycles per character; the replacement lookup using altgb is only 4 cycles per character, so with an 80 character wide screen that's a total savings of 80*(16-4) - 64 = 896 cycles per scan line, which is pretty significant.

Cluso99 · 2019-10-09 20:37

This section

setq ##$33333333 (might be backwards, didn't check)
muxq x,a
setq ##$CCCCCCCC
muxq y,b

could become

setq ##$33333333 (might be backwards, didn't check)
muxq x,a
ror b,#2
muxq y,b
rol b, #2

It's the same footprint and clocks but maybe you can do something with this to save a rotate???
Remember the setq/muxq retains the last setq value.

TonyB_ · 2019-10-09 20:50

Cluso99 wrote: »
This section
setq ##$33333333 (might be backwards, didn't check)
muxq x,a
setq ##$CCCCCCCC
muxq y,b
could become
setq ##$33333333 (might be backwards, didn't check)
muxq x,a
ror b,#2
muxq y,b
rol b, #2
It's the same footprint and clocks but maybe you can do something with this to save a rotate???
Remember the setq/muxq retains the last setq value.

For a time-critical loop, better to use the following?

setq reg_33333333 (might be backwards, didn't check)
muxq x,a
setq reg_CCCCCCCC
muxq y,b

All PASM2 gurus - help optimizing a text driver over DVI?

Comments