Shop Learn
All PASM2 gurus - help optimizing a text driver over DVI? - Page 8 — Parallax Forums

All PASM2 gurus - help optimizing a text driver over DVI?

1568101129

Comments

  • jmgjmg Posts: 14,847
    cgracey wrote: »
    Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

    Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
    From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
    A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
    The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
    In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
  • cgraceycgracey Posts: 13,631
    edited 2019-10-16 19:28
    jmg wrote: »
    cgracey wrote: »
    Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

    Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
    From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
    A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
    The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
    In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

    I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, potentially 5 more longs stream in, possibly filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.
  • jmgjmg Posts: 14,847
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

    Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
    From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
    A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
    The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
    In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

    I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.

    That's better, I think you say here
    * P2 FIFO size is 19L deep
    * P2 FIFO trigger point is 14L
    * BURST size is 5L, so it reads in 15,16,17,18,19
    If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
    Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
  • cgraceycgracey Posts: 13,631
    jmg wrote: »
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

    Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
    From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
    A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
    The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
    In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

    I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.

    That's better, I think you say here
    * P2 FIFO size is 19L deep
    * P2 FIFO trigger point is 14L
    * BURST size is 5L, so it reads in 15,16,17,18,19
    If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
    Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?

    The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
  • jmgjmg Posts: 14,847
    edited 2019-10-17 03:55
    cgracey wrote: »
    jmg wrote: »
    That's better, I think you say here
    * P2 FIFO size is 19L deep
    * P2 FIFO trigger point is 14L
    * BURST size is 5L, so it reads in 15,16,17,18,19
    If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
    Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?

    The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.

    Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
    Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst
    Rd Clock of 10c
    111111111111111111111111111111111111111111wwww1111111111111111111111111111111111111111111111wwww111111
    988888888887777777777666666666655555555554wwww5679988888888887777777777666666666655555555554wwww567998   FIFO Data
     /         /         /         /         /         /         /         /         /         /         /   RdCLK (-1)
                                                  /////                                             /////    Wr CLK (+1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~wwwwSSSSS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SSSSS~~~ Streamer has BUS
    
    Rd Clock of 2c (based in Fill-to-19)
    1111111111111111111111111111111111111111111111111111111111111111
    9887766554433233445566778898877665544332334455667788998877665544
     / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /  Rd CLK (-1)
                  /////////////             /////////////               Wr CLK (+1)
    ~~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
    
    Rd Clock of 2c (based on 5 loads from 14, correct ) 
    1111111111111111111111111111111111111111111111111111111111111111
    9887766554433233445566766554433233445566734455667788998877665544
     / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /  Rd CLK (-1)
                  ////////         /////////                            Wr CLK (+1)
    ~~~~~~~~~~~wwwSSSSSSSS~~~~~wwwwSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
                           123456781
    
    
    w = wait for streamer start slot match  S = Streamer FIFO write clock.
    
    The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.
    For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
    At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
    Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.
  • evanhevanh Posts: 12,021
    Chip,
    The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.
  • evanhevanh Posts: 12,021
    edited 2019-10-17 00:22
    jmg wrote: »
    That's better, I think you say here
    * P2 FIFO size is 19L deep
    * P2 FIFO trigger point is 14L
    * BURST size is 5L, so it reads in 15,16,17,18,19
    It's six, #14 is included. If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
  • jmgjmg Posts: 14,847
    evanh wrote: »
    It's six, #14 is included.
    I'm just using Chip's numbers, when fifo gets down to 14, next empty slot is 15, so it can fill 15,16,17,18,19 which is 5

    evanh wrote: »
    If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
    That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
    With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.

  • evanhevanh Posts: 12,021
    Yeah, maybe it is only 5 minimum. So that's the high and low marks.

    It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.

  • cgraceycgracey Posts: 13,631
    It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
  • cgraceycgracey Posts: 13,631
    edited 2019-10-17 03:14
    I was having an impossible time trying to figure out how many stages to make the FIFO. In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required. It was a lot more than I had figured before.
  • jmgjmg Posts: 14,847
    cgracey wrote: »
    I was having an impossible time trying to figure out how many stages to make the FIFO.
    In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
    It was a lot more than I had figured before.
    Good idea, because this gets complex very quickly....
    Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?

    cgracey wrote: »
    It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.

    Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
    Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?

  • evanhevanh Posts: 12,021
    edited 2019-10-17 04:16
    Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.

    I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.

  • cgraceycgracey Posts: 13,631
    jmg wrote: »
    cgracey wrote: »
    I was having an impossible time trying to figure out how many stages to make the FIFO.
    In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
    It was a lot more than I had figured before.
    Good idea, because this gets complex very quickly....
    Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?

    cgracey wrote: »
    It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.

    Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
    Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?

    It could be >5 if data is being read out during the FIFO reload. It could even be continuous.
  • cgraceycgracey Posts: 13,631
    evanh wrote: »
    Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.

    I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.

    Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.
  • evanhevanh Posts: 12,021
    edited 2019-10-17 04:47
    But you said there was a bunch of buffering registers that drain into the FIFO for 5 clocks after the FIFO has stopped addressing hubram. Are those buffers not always in place as part of the egg-beater? Won't SETQ+RDLONG need them too?
  • jmgjmg Posts: 14,847
    cgracey wrote: »
    It could be >5 if data is being read out during the FIFO reload.
    I probably did not word that well.
    The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?

    The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
    Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
    There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
    cgracey wrote: »
    It could even be continuous.
    It is continuous FIFO load, if read is done at 1c ?
  • AJL wrote: »

    As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?

    Then the generation happens once at start of frame, and it collapses to a simple flag check each line.

    It could be done like that but that approach probably burns more COGRAM.
    1) It needs to set a flag somewhere.
    2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
    3) It needs to test this flag to see if it needs to generate the code.
    4) It needs to branch
    That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.

    While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.
    If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.

    Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.

    I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).
  • cgraceycgracey Posts: 13,631
    edited 2019-10-17 11:09
    jmg wrote: »
    cgracey wrote: »
    It could be >5 if data is being read out during the FIFO reload.
    I probably did not word that well.
    The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?

    The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
    Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
    There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
    cgracey wrote: »
    It could even be continuous.
    It is continuous FIFO load, if read is done at 1c ?

    Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.

    For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.
  • roglohrogloh Posts: 3,654
    edited 2019-10-18 11:23
    I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.

    Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...

    There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
    do_mouse
                getword a, mouse_xy, #1         'get mouse y screen co-ordinate
                getnib  b, mouseptr, #7         'get y hotspot of mouse image
                sub     a, b                    'compensate for the y hotspot
                subr    a, scanline             'compute sprite row offset     
                cmpr    a, #15 wc               'check if sprite covers scanline
                
                alts    bpp, #bittable          'bpp starts out as index from 0-5
                mov     bitmask, 0-0            'get table entry using bpp index
                mul     a, bitmask              'multiply mouse row by its length
                shr     bitmask, #16 wz         'extract mask portion
        if_z    not     bitmask                 'fix up the 32 bpp case
                mov     bpp, bitmask
                ones    bpp                     'convert into real bpp
    
                add     a, mouseptr             'add offset to base mouse address
                setq2   #17-1                   'get 17 longs max, mouse mask+image
                rdlong  $120, a                 'read mouse data and store in LUT
                    
                getword offset, mouse_xy, #0    'get mouse x screen co-ordinate 
                getnib  b, mouseptr, #6         'get x hotspot of mouse image
    
                mov     pixels, offset
                sub     offset, b               'compensate for the x hotspot
       if_nc    subr    pixels, ##640 wcz       'compute pixels until end of line
                add     pixels, b               'increase by the x hotspot amount
                fle     pixels, #16             'limit drawn pixels to 16
    
     if_c_or_z  ret     wcz                     'exit if sprite is out of x/y range
    
                mov     ptrb, #$120             'ptrb is used for mouse image data
                rdlut   c, ptrb++               'read in the mouse mask first
    
                abs     a, offset               'retain bits
                muls    offset, bpp             'convert number of pixels into bits
                abs     b, offset wc            'test for negative value (clipped)
                mov     muxmask, bitmask        'setup mask for pixel's data size
        if_nc   rol     muxmask, b              'align mask for first data pixel
        if_c    rol     bitmask, b              'align mask for first mouse pixel
        if_c    shr     c, a                    'eliminate mouse pixels if clipped
                shr     b, #5                   'convert bits to longs
        if_c    add     ptrb, b                 'advance mouse data to skip pixels
    
                shl     b, #2                   'convert longs to bytes
        if_nc   add     save, b                 'adjust scanline buffer position                
    
                setq2   #16-1                   'read 16 scanline longs into LUT
                rdlong  $110, save              'using adjusted hub read position
    
                mov     ptra, #$110             'ptra used for source image data
                rdlut   a, ptra                 'get original scanline pixel data
                test    $, #1 wc                'c=1 will trigger initial read
    
                rep     @endmouse, pixels       'repeat loop for 16 pixels
        if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
        if_c    rol     b, offset               'align with the source input data
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
        if_c    setq    muxmask                 'apply bit mask to muxq mask
        if_nc   setq    #0                      'setup a transparent mouse pixel
                muxq    a, b                    'select original or mouse pixel
        if_c    wrlut   a, ptra                 'write back updated data if altered
                rol     muxmask, bpp wc         'advance mask by 1,2,4,8,16,32 bits
        if_c    rdlut   a, ++ptra               '...and read next source pixel(s)
                rol     bitmask, bpp wc         'rotate mask for mouse data reload
    
    endmouse
                setq2   #16-1 
                wrlong  $110, save              'write LUT image data back to hub
                ret     wcz
    
    bittable
                long    $0001_00_08             '1bpp
                long    $0003_00_08             '2bpp
                long    $000F_00_0C             '4bpp
                long    $00FF_00_14             '8bpp
                long    $FFFF_00_24             '16bpp
                long    $0000_00_44             '32bpp 
    
    bitmask     long    0  'TODO eventually share other temporary register space
    muxmask     long    0  'in order to eliminate these temp working variables
    offset      long    0 
    bpp         long    0 
    
    

    Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.

    Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.

    I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
    ...
    endmouse
    
                mul     bpp, ##640
                sub     bpp, offset
                shr     bpp, #5
                fle     bpp, #15
    
                setq2   bpp ' was #16-1 
                wrlong  $110, save              'write LUT image data back to hub
                ret     wcz
    
  • @rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.
  • rogloh wrote: »
    I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.

    Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...

    There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
    do_mouse
                getword a, mouse_xy, #1         'get mouse y screen co-ordinate
                getnib  b, mouseptr, #7         'get y hotspot of mouse image
                sub     a, b                    'compensate for the y hotspot
                subr    a, scanline             'compute sprite row offset     
                cmpr    a, #15 wc               'check if sprite covers scanline
                
                alts    bpp, #bittable          'bpp starts out as index from 0-5
                mov     bitmask, 0-0            'get table entry using bpp index
                mul     a, bitmask              'multiply mouse row by its length
                shr     bitmask, #16 wz         'extract mask portion
        if_z    not     bitmask                 'fix up the 32 bpp case
                mov     bpp, bitmask
                ones    bpp                     'convert into real bpp
    
                add     a, mouseptr             'add offset to base mouse address
                setq2   #17-1                   'get 17 longs max, mouse mask+image
                rdlong  $120, a                 'read mouse data and store in LUT
                    
                getword offset, mouse_xy, #0    'get mouse x screen co-ordinate 
                getnib  b, mouseptr, #6         'get x hotspot of mouse image
    
                mov     pixels, offset
                sub     offset, b               'compensate for the x hotspot
       if_nc    subr    pixels, ##640 wcz       'compute pixels until end of line
                add     pixels, b               'increase by the x hotspot amount
                fle     pixels, #16             'limit drawn pixels to 16
    
     if_c_or_z  ret     wcz                     'exit if sprite is out of x/y range
    
                mov     ptrb, #$120             'ptrb is used for mouse image data
                rdlut   c, ptrb++               'read in the mouse mask first
    
                abs     a, offset               'retain bits
                muls    offset, bpp             'convert number of pixels into bits
                abs     b, offset wc            'test for negative value (clipped)
                mov     muxmask, bitmask        'setup mask for pixel's data size
        if_nc   rol     muxmask, b              'align mask for first data pixel
        if_c    rol     bitmask, b              'align mask for first mouse pixel
        if_c    shr     c, a                    'eliminate mouse pixels if clipped
                shr     b, #5                   'convert bits to longs
        if_c    add     ptrb, b                 'advance mouse data to skip pixels
    
                shl     b, #2                   'convert longs to bytes
        if_nc   add     save, b                 'adjust scanline buffer position                
    
                setq2   #16-1                   'read 16 scanline longs into LUT
                rdlong  $110, save              'using adjusted hub read position
    
                mov     ptra, #$110             'ptra used for source image data
                rdlut   a, ptra                 'get original scanline pixel data
                test    $, #1 wc                'c=1 will trigger initial read
    
                rep     @endmouse, pixels       'repeat loop for 16 pixels
        if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
        if_c    rol     b, offset               'align with the source input data
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
        if_c    setq    muxmask                 'apply bit mask to muxq mask
        if_nc   setq    #0                      'setup a transparent mouse pixel
                muxq    a, b                    'select original or mouse pixel
        if_c    wrlut   a, ptra                 'write back updated data if altered
                rol     muxmask, bpp wc         'advance mask by 1,2,4,8,16,32 bits
        if_c    rdlut   a, ++ptra               '...and read next source pixel(s)
                rol     bitmask, bpp wc         'rotate mask for mouse data reload
    
    endmouse
                setq2   #16-1 
                wrlong  $110, save              'write LUT image data back to hub
                ret     wcz
    
    bittable
                long    $0001_00_08             '1bpp
                long    $0003_00_08             '2bpp
                long    $000F_00_0C             '4bpp
                long    $00FF_00_14             '8bpp
                long    $FFFF_00_24             '16bpp
                long    $0000_00_44             '32bpp 
    
    bitmask     long    0  'TODO eventually share other temporary register space
    muxmask     long    0  'in order to eliminate these temp working variables
    offset      long    0 
    bpp         long    0 
    
    

    Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.

    Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.

    I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
    ...
    endmouse
    
                mul     bpp, ##640
                sub     bpp, offset
                shr     bpp, #5
                fle     bpp, #15
    
                setq2   bpp ' was #16-1 
                wrlong  $110, save              'write LUT image data back to hub
                ret     wcz
    

    It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
  • ersmith wrote: »
    @rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.

    It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.
  • It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
    Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...

    There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
  • nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog
  • rogloh wrote: »
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
        if_c    setq    muxmask                 'apply bit mask to muxq mask
        if_nc   setq    #0                      'setup a transparent mouse pixel
                muxq    a, b                    'select original or mouse pixel
        if_c    wrlut   a, ptra                 'write back updated data if altered
    
    ]
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
                setq    muxmask                 'apply bit mask to muxq mask
        if_c    muxq    a, b                    'select mouse pixel if not transparent
        if_c    wrlut   a, ptra                 'write back updated data if altered
    
  • Tubular wrote: »
    nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog

    Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.

    I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.
  • TonyB_ wrote: »
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
                setq    muxmask                 'apply bit mask to muxq mask
        if_c    muxq    a, b                    'select mouse pixel if not transparent
        if_c    wrlut   a, ptra                 'write back updated data if altered
    
    Great. Makes total sense, had overlooked it. This is where other people's fresh eyes can help.
  • AJLAJL Posts: 457
    edited 2019-10-18 20:20
    rogloh wrote: »
    It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
    Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...

    There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.

    Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.

    If my figuring is correct this should work for all bit depths:
      altd ptra, #$EF
      setq2 0-0
      wrlong $110, save
      ret wcz
    

    Edit: forgot the need to subtract one from the setq2 value. Code updated.
  • One stupid easy way to do clipping is to simply allocate undisplayed buffer RAM on either side of the scan line.

    Then, just write the mouse data normally without clipping, just a normal bounds check.
Sign In or Register to comment.