All PASM2 gurus - help optimizing a text driver over DVI?

evanh · 2019-10-11 05:54

Note: Q retains its value. MUXQ can reuse Q as many times as want.

AJL · 2019-10-11 06:00

rogloh wrote: »
Cool. I think this can be shaved by one instruction with a setq, muxq sequence.
movbyts pa, #%%1100
mov pb, pa
shl pb,#4
setq mask
muxq pa, pb

etc
Five is better than 8 I had before. Will code this up.

Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
Does the %% change the interpretation of the immediate?

rogloh · 2019-10-11 06:07

Yes it does, %% indicates we are using two bit (twit, nit) forms from 0-3 representing 00, 01, 10, 11 in binary.

evanh wrote: »

Note: Q retains its value. MUXQ can reuse Q as many times as want.

This is good as I had it in an inner loop, which I can remove it from.

rogloh · 2019-10-11 06:19

Here's the updated code for pixel transfers, still have to do the two bit version. I've noticed many similarities in the sequences which could probably be leveraged using skipf to reduce this down further.

pixelmove   long    normalpixels

normalpixels
            rep     #2, readburst
            rdlut   a, ptra++
            wrlut   a, ptrb++
            ret
            
doublebits  
            rep     #8, readburst
            rdlut   a, ptra++
            mov     b, a
            movbyts a, #%%1010
            movbyts b, #%%3232
            mergew  a
            wrlut   a, ptrb++
            mergew  b
            wrlut   b, ptrb++
            ret

doublebytes
            rep     #6, readburst
            rdlut   a, ptra++
            mov     b, a
            movbyts a, #%%1100
            wrlut   a, ptrb++
            movbyts b, #%%3322
            wrlut   b, ptrb++
            ret

doublewords
            rep     #6, readburst
            rdlut   a, ptra++
            mov     b, a
            movbyts a, #%%1010
            wrlut   a, ptrb++
            movbyts b, #%%3232
            wrlut   b, ptrb++
            ret

doublelongs 
            rep     #3, readburst
            rdlut   a, ptra++
            wrlong  a, ptrb++
            wrlong  a, ptrb++
            ret
 
doublenibbles
            setq    nibblemask
            rep     #12, readburst
            rdlut   b, ptra++
            getword a, b, #1                
            movbyts b, #%%1100
            mov     pb, b
            shl     pb, #4
            muxq    b, pb
            wrlut   b, ptrb++
            movbyts a, #%%1100
            mov     pb, a
            shl     pb, #4
            muxq    a, pb
            wrlut   a, ptrb++
            ret

nibblemask  long    $0ff00ff0

ozpropdev · 2019-10-11 06:38

After another coffee I realzed you coild do this too.

		movbyts	pa,#%%1100
		mov	pb,pa
		shl	pb,#4		
		and	pb,mask
		muxnibs	pa,pb

Edit. Doh! No that won't work for nibbles = 0, Scrap that idea.
Time to change brand of coffee I think!

rogloh · 2019-10-11 06:44

I know how you feel, this stuff is killing my brain right now. LOL.

ozpropdev · 2019-10-11 06:54

Keep CHIPping away Roger, your getting close! Pardon the pun

ozpropdev · 2019-10-11 07:15

AJL wrote: »

Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
Does the %% change the interpretation of the immediate?

The %% represents a base 4 number (Quaternary).

AJL · 2019-10-11 08:08

How about this, @rogloh :

  alts modedata, #skmasks  'set up skipmask based on mode
  mov skmask, 0-0  
  skipf skmask  
  alts modedata, #replengths  'replength must be adjusted for skipf
  mov replength, 0-0  
  alts modedata, #amaps  'first transform map if needed
  mov amap, 0-0  
  alts modedata, #bmaps  'second transform map if needed
  mov bmap, 0-0  
  setq nibblemask  
  rep replength, readburst  
  rdlut a, ptra++  
  mov b, a  'copy for later
  getword b, a, #1  'save second half
  movbyts a, amap  'first transform
  mov pb, a  'take a copy
  shl pb, #4  'nibble shift first half
  muxq a, pb  'merge nibble copy
  mergew a  'double bits from first transform
  wrlut a, prtb++  'write first result
  movbyts b, bmap  'second transform
  mergewb  'double bits from second transform
  mov pb, b  'take a copy
  shl pb, #4  'nibble shift second half
  muxq b, pb  'merge nibble copy
  wrlut b, ptrb++  'write second result
  wrlong a, ptrb++  'write first long
  wrlong a, ptrb++  'write second long
  ret  'done
     
replengths  
  3  
  8  
  6  
  6  
  4  
  12  
    
skmasks  
  %011111111011111110011100  
  %011011100001110100010000  
  %011011110011110100010000  
  %011011110011110100010000  
  %000111111111111110011100  
  %011000010010000010000000  
    
amap  
  0  
  %%1010  
  %%1100  
  %%1010  
  0  
  %%1100  
    
bmap  
  0  
  %%3232  
  %%3322  
  %%3232  
  0  
  %%1100  

'modedata
' normalpixels  = 0
' doublebits = 1
' doublebytes = 2
' doublewords = 3
' doublelongs = 4
' doublenibbles = 5

Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.

replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.

Untested, but by my calculations each method takes:

'                single pass           640 output pixel burst
normalpixels     12 clocks             490 clocks
doublebits       22 clocks             658 clocks
doublebytes      18 clocks             498 clocks
doublewords      18 clocks             498 clocks
doublelongs      14 clocks             330 clocks 
doublenibbles    30 clocks             980 clocks

rogloh · 2019-10-11 09:13

Looks interesting AJL. I'll have to take a look in more detail. I also need to make sure that the bursts are big enough to help reduce for the loop overhead however it can't really all fit in COG or LUT memory at once so I need to break it up into smaller transfers. Right now I transfer up to 40 longs in each burst for processing pixels from the source buffer, and obviously double this to writing back 80 longs out at a time to the next scanline buffer. The number of bursts required varies with the pixel depth. I think right now the doubling of longs from 320 pixels to 640 is not going to meet the budget unless I optimize it. In general I want to keep all these to below about 3100 clocks or so, under half an active scanline. I think it is just about doable. I can also share the input and output buffer if I process backwards instead of forwards, using ptra-- and ptrb-- etc, though I haven't resorted to that trick yet.

I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...

rogloh · 2019-10-11 09:28

Fixed the nibble mode, I just found out that setq does not retain its value after muxq operation. I had it outside the loop just done once at the start in the sample code above, but found it needs to be inside the loop for pixels to double, evanh.

evanh · 2019-10-11 09:43

I've tested MUXQ before, what I've said is correct. Either something else is modifying Q, unlikely, or the bug was coincidental, like the hardcoded REP length! I've learnt with experience not to do that.

rogloh · 2019-10-11 10:22

Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?

rogloh · 2019-10-11 10:27

Not sure whats up with setq.

Here is the bad code, pixels not doubled on the screen, though they are modified in a weird way, still same size 640 on screen.:

doublenibbles
            setq    nibblemask
            rep     #12, readburst
            rdlut   b, ptra++
            getword a, b, #1                
            movbyts b, #%%1100
            mov     pb, b
            shl     pb, #4
            muxq    b, pb
            wrlut   b, ptrb++
            movbyts a, #%%1100
            mov     pb, a
            shl     pb, #4
            muxq    a, pb
            wrlut   a, ptrb++
            ret

Here is the working code, pixels doubled properly, 320 on screen.

doublenibbles
            rep     #14, readburst
            rdlut   b, ptra++
            getword a, b, #1                
            movbyts b, #%%1100
            mov     pb, b
            shl     pb, #4
            setq    nibblemask
            muxq    b, pb
            wrlut   b, ptrb++
            movbyts a, #%%1100
            mov     pb, a
            shl     pb, #4
            setq    nibblemask
            muxq    a, pb
            wrlut   a, ptrb++
            ret

ozpropdev · 2019-10-11 10:42

The RDLUT will be clearing the Q value.

evanh · 2019-10-11 11:14

ozpropdev wrote: »

The RDLUT will be clearing the Q value.

Whoa! RDLUT fills Q with data it gets from lutRAM.

Why did that come about, Chip?

ozpropdev · 2019-10-11 11:22

Maybe something to do with RDLUT taking 3 clocks?

evanh · 2019-10-11 11:27

Yeah. But given there is already a ton of dedicated registers throughout the design I'm not sure why Q was sacrificed to achieve that.

Is lutRAM location $1ff real? Strangely, RDLUT reg, ##$1ff doesn't affect Q.

evanh · 2019-10-11 11:30

Holly Smile, I'm wrong, Q is untouched! It's just being switched out or something.

Ha! Ah, no, somehow lutRAM address $1ff had the same data I was using for SETQ

rogloh · 2019-10-11 11:36

So in the current state of my code now, all P2 colour modes seem to be now working at VGA resolution over DVI, and one of these video "modes" below is selectable per frame:

Two text modes, both 16 colour LUT based, optional line doubling, 
either a flashing or high intensity background, flashing block text cursor with
data read from 16 bit character screen memory (classic VGA type of screen buffer)
   - 80 column 16 colour text, 8xF size font
   - 40 column 16 colour text, 8xF size font
Multiple graphics modes in one of four resolutions:
   - 640xN
   - 320xN
   - 640xN/2 (line doubled)
   - 320xN/2 (line doubled)
Colour modes are:
   - LUMA8 (all 8 colours, 8 bit luminance) 
   - RGBI8 (3 bit colour, 5 bit luminance)
   - RGB8 (3:3:2)
   - RGB16 (5:6:5)
   - RGB24 (8:8:8)
   - LUT palette 1bpp
   - LUT palette 2bpp
   - LUT palette 4bpp
   - LUT palette 8bpp

The active scanline count N can be setup as 350, 400, 480, etc, it's statically configurable with 
suitable frame timing blanking parameters at compile time.  Perhaps this could be made dynamic at some point. 
The font scanline height F is also configurable per frame.

As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.

Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon.

AJL · 2019-10-11 11:38

rogloh wrote: »

Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?

Unless the tool set includes a method for automatically generating rep length parameters for code using skipf then hard coded is going to be necessary. I use a spreadsheet to write up the code, generate the skip patterns, and calculate the rep length values.
I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.

evanh · 2019-10-11 11:57

Looking good Roger. With higher sysclock it should also be easy to do 848x480 and 424x240 for 16:9 aspect monitors.

TonyB_ · 2019-10-11 12:54

RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?

ersmith · 2019-10-11 13:16

TonyB_ wrote: »

RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?

I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.

evanh · 2019-10-11 13:22

TonyB_ wrote: »

So XBYTE also affects Q?

Ah, yes, that'll be the why. I've completely ignored xbyte discussions but I see the xbyte sequence uses a hidden rdlut. I'm guessing that data is needed stored close to the ALU/pipeline and Q is it.

So I also suspect xbyte is the only reason Q is used by RDLUT at all.

evanh · 2019-10-11 13:30

ersmith wrote: »

I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.

The Q register is just a parameter for those type instructions. EDIT: Or more importantly, the modal changes in those instructions are not triggered until the SETQ is a prefixing instruction.

Xbyte will be the reason why RDLUT is different.

cgracey · 2019-10-11 13:30

evanh wrote: »

ozpropdev wrote: »

The RDLUT will be clearing the Q value.

Whoa! RDLUT fills Q with data it gets from lutRAM.

Why did that come about, Chip?

Q gets used for data capture in a few ways. I will list them when I am at my computer today.

evanh · 2019-10-11 13:34

[deleted]

cgracey · 2019-10-11 13:36

ersmith wrote: »

TonyB_ wrote: »

RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?

I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.

SETQ also shields interrupts to protect the next instruction.

Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.

evanh · 2019-10-11 13:44

cgracey wrote: »

Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.

That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments