Shop Learn
All PASM2 gurus - help optimizing a text driver over DVI? - Page 3 — Parallax Forums

All PASM2 gurus - help optimizing a text driver over DVI?

1356729

Comments

  • evanhevanh Posts: 12,032
    edited 2019-10-11 05:58
    Note: Q retains its value. MUXQ can reuse Q as many times as want.
  • rogloh wrote: »
    Cool. I think this can be shaved by one instruction with a setq, muxq sequence.
    movbyts pa, #%%1100
    mov pb, pa
    shl pb,#4
    setq mask
    muxq pa, pb
    
    etc
    

    Five is better than 8 I had before. Will code this up.

    Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
    Does the %% change the interpretation of the immediate?

  • Yes it does, %% indicates we are using two bit (twit, nit) forms from 0-3 representing 00, 01, 10, 11 in binary.
    evanh wrote: »
    Note: Q retains its value. MUXQ can reuse Q as many times as want.
    This is good as I had it in an inner loop, which I can remove it from.
  • roglohrogloh Posts: 3,654
    edited 2019-10-11 06:43
    Here's the updated code for pixel transfers, still have to do the two bit version. I've noticed many similarities in the sequences which could probably be leveraged using skipf to reduce this down further.
    pixelmove   long    normalpixels
    
    normalpixels
                rep     #2, readburst
                rdlut   a, ptra++
                wrlut   a, ptrb++
                ret
                
    doublebits  
                rep     #8, readburst
                rdlut   a, ptra++
                mov     b, a
                movbyts a, #%%1010
                movbyts b, #%%3232
                mergew  a
                wrlut   a, ptrb++
                mergew  b
                wrlut   b, ptrb++
                ret
    
    doublebytes
                rep     #6, readburst
                rdlut   a, ptra++
                mov     b, a
                movbyts a, #%%1100
                wrlut   a, ptrb++
                movbyts b, #%%3322
                wrlut   b, ptrb++
                ret
    
    doublewords
                rep     #6, readburst
                rdlut   a, ptra++
                mov     b, a
                movbyts a, #%%1010
                wrlut   a, ptrb++
                movbyts b, #%%3232
                wrlut   b, ptrb++
                ret
    
    doublelongs 
                rep     #3, readburst
                rdlut   a, ptra++
                wrlong  a, ptrb++
                wrlong  a, ptrb++
                ret
     
    doublenibbles
                setq    nibblemask
                rep     #12, readburst
                rdlut   b, ptra++
                getword a, b, #1                
                movbyts b, #%%1100
                mov     pb, b
                shl     pb, #4
                muxq    b, pb
                wrlut   b, ptrb++
                movbyts a, #%%1100
                mov     pb, a
                shl     pb, #4
                muxq    a, pb
                wrlut   a, ptrb++
                ret
    
    nibblemask  long    $0ff00ff0
    
    
  • ozpropdevozpropdev Posts: 2,776
    edited 2019-10-11 06:43
    After another coffee I realzed you coild do this too.
    		movbyts	pa,#%%1100
    		mov	pb,pa
    		shl	pb,#4		
    		and	pb,mask
    		muxnibs	pa,pb
    
    

    Edit. Doh! No that won't work for nibbles = 0, Scrap that idea.
    Time to change brand of coffee I think!
  • I know how you feel, this stuff is killing my brain right now. LOL.
  • Keep CHIPping away Roger, your getting close! Pardon the pun :lol:
  • AJL wrote: »
    Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
    Does the %% change the interpretation of the immediate?
    The %% represents a base 4 number (Quaternary).

  • AJLAJL Posts: 457
    edited 2019-10-11 08:10
    How about this, @rogloh :
      alts modedata, #skmasks  'set up skipmask based on mode
      mov skmask, 0-0  
      skipf skmask  
      alts modedata, #replengths  'replength must be adjusted for skipf
      mov replength, 0-0  
      alts modedata, #amaps  'first transform map if needed
      mov amap, 0-0  
      alts modedata, #bmaps  'second transform map if needed
      mov bmap, 0-0  
      setq nibblemask  
      rep replength, readburst  
      rdlut a, ptra++  
      mov b, a  'copy for later
      getword b, a, #1  'save second half
      movbyts a, amap  'first transform
      mov pb, a  'take a copy
      shl pb, #4  'nibble shift first half
      muxq a, pb  'merge nibble copy
      mergew a  'double bits from first transform
      wrlut a, prtb++  'write first result
      movbyts b, bmap  'second transform
      mergewb  'double bits from second transform
      mov pb, b  'take a copy
      shl pb, #4  'nibble shift second half
      muxq b, pb  'merge nibble copy
      wrlut b, ptrb++  'write second result
      wrlong a, ptrb++  'write first long
      wrlong a, ptrb++  'write second long
      ret  'done
         
    replengths  
      3  
      8  
      6  
      6  
      4  
      12  
        
    skmasks  
      %011111111011111110011100  
      %011011100001110100010000  
      %011011110011110100010000  
      %011011110011110100010000  
      %000111111111111110011100  
      %011000010010000010000000  
        
    amap  
      0  
      %%1010  
      %%1100  
      %%1010  
      0  
      %%1100  
        
    bmap  
      0  
      %%3232  
      %%3322  
      %%3232  
      0  
      %%1100  
    
    'modedata
    ' normalpixels  = 0
    ' doublebits = 1
    ' doublebytes = 2
    ' doublewords = 3
    ' doublelongs = 4
    ' doublenibbles = 5
    

    Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.

    replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.

    Untested, but by my calculations each method takes:
    '                single pass           640 output pixel burst
    normalpixels     12 clocks             490 clocks
    doublebits       22 clocks             658 clocks
    doublebytes      18 clocks             498 clocks
    doublewords      18 clocks             498 clocks
    doublelongs      14 clocks             330 clocks 
    doublenibbles    30 clocks             980 clocks
    

  • Looks interesting AJL. I'll have to take a look in more detail. I also need to make sure that the bursts are big enough to help reduce for the loop overhead however it can't really all fit in COG or LUT memory at once so I need to break it up into smaller transfers. Right now I transfer up to 40 longs in each burst for processing pixels from the source buffer, and obviously double this to writing back 80 longs out at a time to the next scanline buffer. The number of bursts required varies with the pixel depth. I think right now the doubling of longs from 320 pixels to 640 is not going to meet the budget unless I optimize it. In general I want to keep all these to below about 3100 clocks or so, under half an active scanline. I think it is just about doable. I can also share the input and output buffer if I process backwards instead of forwards, using ptra-- and ptrb-- etc, though I haven't resorted to that trick yet.

    I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...
  • Fixed the nibble mode, I just found out that setq does not retain its value after muxq operation. I had it outside the loop just done once at the start in the sample code above, but found it needs to be inside the loop for pixels to double, evanh.

  • evanhevanh Posts: 12,032
    edited 2019-10-11 09:44
    I've tested MUXQ before, what I've said is correct. Either something else is modifying Q, unlikely, or the bug was coincidental, like the hardcoded REP length! I've learnt with experience not to do that.

  • Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
  • Not sure whats up with setq.

    Here is the bad code, pixels not doubled on the screen, though they are modified in a weird way, still same size 640 on screen.:
    doublenibbles
                setq    nibblemask
                rep     #12, readburst
                rdlut   b, ptra++
                getword a, b, #1                
                movbyts b, #%%1100
                mov     pb, b
                shl     pb, #4
                muxq    b, pb
                wrlut   b, ptrb++
                movbyts a, #%%1100
                mov     pb, a
                shl     pb, #4
                muxq    a, pb
                wrlut   a, ptrb++
                ret
    

    Here is the working code, pixels doubled properly, 320 on screen.
    doublenibbles
                rep     #14, readburst
                rdlut   b, ptra++
                getword a, b, #1                
                movbyts b, #%%1100
                mov     pb, b
                shl     pb, #4
                setq    nibblemask
                muxq    b, pb
                wrlut   b, ptrb++
                movbyts a, #%%1100
                mov     pb, a
                shl     pb, #4
                setq    nibblemask
                muxq    a, pb
                wrlut   a, ptrb++
                ret
    
  • The RDLUT will be clearing the Q value.
  • evanhevanh Posts: 12,032
    edited 2019-10-11 11:17
    ozpropdev wrote: »
    The RDLUT will be clearing the Q value.
    Whoa! RDLUT fills Q with data it gets from lutRAM.

    Why did that come about, Chip?

  • Maybe something to do with RDLUT taking 3 clocks?
  • evanhevanh Posts: 12,032
    Yeah. But given there is already a ton of dedicated registers throughout the design I'm not sure why Q was sacrificed to achieve that.

    Is lutRAM location $1ff real? Strangely, RDLUT reg, ##$1ff doesn't affect Q.
  • evanhevanh Posts: 12,032
    edited 2019-10-11 11:37
    Holly crap, I'm wrong, Q is untouched! It's just being switched out or something.

    Ha! Ah, no, somehow lutRAM address $1ff had the same data I was using for SETQ

  • roglohrogloh Posts: 3,654
    edited 2019-10-11 11:47
    So in the current state of my code now, all P2 colour modes seem to be now working at VGA resolution over DVI, and one of these video "modes" below is selectable per frame:
    Two text modes, both 16 colour LUT based, optional line doubling, 
    either a flashing or high intensity background, flashing block text cursor with
    data read from 16 bit character screen memory (classic VGA type of screen buffer)
       - 80 column 16 colour text, 8xF size font
       - 40 column 16 colour text, 8xF size font
    Multiple graphics modes in one of four resolutions:
       - 640xN
       - 320xN
       - 640xN/2 (line doubled)
       - 320xN/2 (line doubled)
    Colour modes are:
       - LUMA8 (all 8 colours, 8 bit luminance) 
       - RGBI8 (3 bit colour, 5 bit luminance)
       - RGB8 (3:3:2)
       - RGB16 (5:6:5)
       - RGB24 (8:8:8)
       - LUT palette 1bpp
       - LUT palette 2bpp
       - LUT palette 4bpp
       - LUT palette 8bpp
    
    The active scanline count N can be setup as 350, 400, 480, etc, it's statically configurable with 
    suitable frame timing blanking parameters at compile time.  Perhaps this could be made dynamic at some point. 
    The font scanline height F is also configurable per frame.
    

    As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.

    Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon. :smile:
  • rogloh wrote: »
    Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
    Unless the tool set includes a method for automatically generating rep length parameters for code using skipf then hard coded is going to be necessary. I use a spreadsheet to write up the code, generate the skip patterns, and calculate the rep length values.
    I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.
  • evanhevanh Posts: 12,032
    Looking good Roger. With higher sysclock it should also be easy to do 848x480 and 424x240 for 16:9 aspect monitors.
  • RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?
  • TonyB_ wrote: »
    RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?

    I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
  • evanhevanh Posts: 12,032
    TonyB_ wrote: »
    So XBYTE also affects Q?
    Ah, yes, that'll be the why. I've completely ignored xbyte discussions but I see the xbyte sequence uses a hidden rdlut. I'm guessing that data is needed stored close to the ALU/pipeline and Q is it.

    So I also suspect xbyte is the only reason Q is used by RDLUT at all.

  • evanhevanh Posts: 12,032
    edited 2019-10-11 13:33
    ersmith wrote: »
    I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
    The Q register is just a parameter for those type instructions. EDIT: Or more importantly, the modal changes in those instructions are not triggered until the SETQ is a prefixing instruction.

    Xbyte will be the reason why RDLUT is different.

  • cgraceycgracey Posts: 13,631
    evanh wrote: »
    ozpropdev wrote: »
    The RDLUT will be clearing the Q value.
    Whoa! RDLUT fills Q with data it gets from lutRAM.

    Why did that come about, Chip?

    Q gets used for data capture in a few ways. I will list them when I am at my computer today.
  • evanhevanh Posts: 12,032
    edited 2019-10-11 13:37
    [deleted]
  • cgraceycgracey Posts: 13,631
    ersmith wrote: »
    TonyB_ wrote: »
    RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?

    I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.

    SETQ also shields interrupts to protect the next instruction.

    Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
  • evanhevanh Posts: 12,032
    cgracey wrote: »
    Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
    That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.
Sign In or Register to comment.