All PASM2 gurus - help optimizing a text driver over DVI?

jmg · 2019-10-16 18:32

cgracey wrote: »

Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

cgracey · 2019-10-16 18:40

jmg wrote: »

cgracey wrote: »

Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, potentially 5 more longs stream in, possibly filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.

jmg · 2019-10-16 19:37

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.

That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?

cgracey · 2019-10-16 19:58

jmg wrote: »

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.

Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.

I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)

The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.

That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?

The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.

jmg · 2019-10-16 21:36

cgracey wrote: »

jmg wrote: »

That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?

The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.

Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst

Rd Clock of 10c
111111111111111111111111111111111111111111wwww1111111111111111111111111111111111111111111111wwww111111
988888888887777777777666666666655555555554wwww5679988888888887777777777666666666655555555554wwww567998   FIFO Data
 /         /         /         /         /         /         /         /         /         /         /   RdCLK (-1)
                                              /////                                             /////    Wr CLK (+1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~wwwwSSSSS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SSSSS~~~ Streamer has BUS

Rd Clock of 2c (based in Fill-to-19)
1111111111111111111111111111111111111111111111111111111111111111
9887766554433233445566778898877665544332334455667788998877665544
 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /  Rd CLK (-1)
              /////////////             /////////////               Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~~ Streamer has BUS

Rd Clock of 2c (based on 5 loads from 14, correct ) 
1111111111111111111111111111111111111111111111111111111111111111
9887766554433233445566766554433233445566734455667788998877665544
 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /  Rd CLK (-1)
              ////////         /////////                            Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSS~~~~~wwwwSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
                       123456781


w = wait for streamer start slot match  S = Streamer FIFO write clock.

The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.
For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.

evanh · 2019-10-17 00:05

Chip,
The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.

evanh · 2019-10-17 00:21

jmg wrote: »

That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19

It's six, #14 is included. If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.

jmg · 2019-10-17 01:27

evanh wrote: »

It's six, #14 is included.

I'm just using Chip's numbers, when fifo gets down to 14, next empty slot is 15, so it can fill 15,16,17,18,19 which is 5

evanh wrote: »

If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.

That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.

evanh · 2019-10-17 02:00

Yeah, maybe it is only 5 minimum. So that's the high and low marks.

It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.

cgracey · 2019-10-17 03:10

It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.

cgracey · 2019-10-17 03:12

I was having an impossible time trying to figure out how many stages to make the FIFO. In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required. It was a lot more than I had figured before.

jmg · 2019-10-17 03:58

cgracey wrote: »

I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.

Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?

cgracey wrote: »

It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.

Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?

evanh · 2019-10-17 04:13

Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.

I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.

cgracey · 2019-10-17 04:24

jmg wrote: »

cgracey wrote: »

I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.

Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?

cgracey wrote: »

It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.

Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?

It could be >5 if data is being read out during the FIFO reload. It could even be continuous.

cgracey · 2019-10-17 04:29

evanh wrote: »

Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.

I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.

Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.

evanh · 2019-10-17 04:45

But you said there was a bunch of buffering registers that drain into the FIFO for 5 clocks after the FIFO has stopped addressing hubram. Are those buffers not always in place as part of the egg-beater? Won't SETQ+RDLONG need them too?

jmg · 2019-10-17 04:47

cgracey wrote: »

It could be >5 if data is being read out during the FIFO reload.

I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?

The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.

cgracey wrote: »

It could even be continuous.

It is continuous FIFO load, if read is done at 1c ?

rogloh · 2019-10-17 06:49

AJL wrote: »

As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?

Then the generation happens once at start of frame, and it collapses to a simple flag check each line.

It could be done like that but that approach probably burns more COGRAM.
1) It needs to set a flag somewhere.
2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
3) It needs to test this flag to see if it needs to generate the code.
4) It needs to branch
That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.

While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.

If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.

Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.

I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).

cgracey · 2019-10-17 11:06

jmg wrote: »

cgracey wrote: »

It could be >5 if data is being read out during the FIFO reload.

I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?

The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.

cgracey wrote: »

It could even be continuous.

It is continuous FIFO load, if read is done at 1c ?

Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.

For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.

rogloh · 2019-10-18 09:08

I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.

Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...

There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...

do_mouse
            getword a, mouse_xy, #1         'get mouse y screen co-ordinate
            getnib  b, mouseptr, #7         'get y hotspot of mouse image
            sub     a, b                    'compensate for the y hotspot
            subr    a, scanline             'compute sprite row offset     
            cmpr    a, #15 wc               'check if sprite covers scanline
            
            alts    bpp, #bittable          'bpp starts out as index from 0-5
            mov     bitmask, 0-0            'get table entry using bpp index
            mul     a, bitmask              'multiply mouse row by its length
            shr     bitmask, #16 wz         'extract mask portion
    if_z    not     bitmask                 'fix up the 32 bpp case
            mov     bpp, bitmask
            ones    bpp                     'convert into real bpp

            add     a, mouseptr             'add offset to base mouse address
            setq2   #17-1                   'get 17 longs max, mouse mask+image
            rdlong  $120, a                 'read mouse data and store in LUT
                
            getword offset, mouse_xy, #0    'get mouse x screen co-ordinate 
            getnib  b, mouseptr, #6         'get x hotspot of mouse image

            mov     pixels, offset
            sub     offset, b               'compensate for the x hotspot
   if_nc    subr    pixels, ##640 wcz       'compute pixels until end of line
            add     pixels, b               'increase by the x hotspot amount
            fle     pixels, #16             'limit drawn pixels to 16

 if_c_or_z  ret     wcz                     'exit if sprite is out of x/y range

            mov     ptrb, #$120             'ptrb is used for mouse image data
            rdlut   c, ptrb++               'read in the mouse mask first

            abs     a, offset               'retain bits
            muls    offset, bpp             'convert number of pixels into bits
            abs     b, offset wc            'test for negative value (clipped)
            mov     muxmask, bitmask        'setup mask for pixel's data size
    if_nc   rol     muxmask, b              'align mask for first data pixel
    if_c    rol     bitmask, b              'align mask for first mouse pixel
    if_c    shr     c, a                    'eliminate mouse pixels if clipped
            shr     b, #5                   'convert bits to longs
    if_c    add     ptrb, b                 'advance mouse data to skip pixels

            shl     b, #2                   'convert longs to bytes
    if_nc   add     save, b                 'adjust scanline buffer position                

            setq2   #16-1                   'read 16 scanline longs into LUT
            rdlong  $110, save              'using adjusted hub read position

            mov     ptra, #$110             'ptra used for source image data
            rdlut   a, ptra                 'get original scanline pixel data
            test    $, #1 wc                'c=1 will trigger initial read

            rep     @endmouse, pixels       'repeat loop for 16 pixels
    if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
    if_c    rol     b, offset               'align with the source input data
            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
    if_c    setq    muxmask                 'apply bit mask to muxq mask
    if_nc   setq    #0                      'setup a transparent mouse pixel
            muxq    a, b                    'select original or mouse pixel
    if_c    wrlut   a, ptra                 'write back updated data if altered
            rol     muxmask, bpp wc         'advance mask by 1,2,4,8,16,32 bits
    if_c    rdlut   a, ++ptra               '...and read next source pixel(s)
            rol     bitmask, bpp wc         'rotate mask for mouse data reload

endmouse
            setq2   #16-1 
            wrlong  $110, save              'write LUT image data back to hub
            ret     wcz

bittable
            long    $0001_00_08             '1bpp
            long    $0003_00_08             '2bpp
            long    $000F_00_0C             '4bpp
            long    $00FF_00_14             '8bpp
            long    $FFFF_00_24             '16bpp
            long    $0000_00_44             '32bpp 

bitmask     long    0  'TODO eventually share other temporary register space
muxmask     long    0  'in order to eliminate these temp working variables
offset      long    0 
bpp         long    0

Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.

Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.

I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.

...
endmouse

            mul     bpp, ##640
            sub     bpp, offset
            shr     bpp, #5
            fle     bpp, #15

            setq2   bpp ' was #16-1 
            wrlong  $110, save              'write LUT image data back to hub
            ret     wcz

ersmith · 2019-10-18 12:02

@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.

AJL · 2019-10-18 12:34

rogloh wrote: »

I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.

Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...

There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...

do_mouse
            getword a, mouse_xy, #1         'get mouse y screen co-ordinate
            getnib  b, mouseptr, #7         'get y hotspot of mouse image
            sub     a, b                    'compensate for the y hotspot
            subr    a, scanline             'compute sprite row offset     
            cmpr    a, #15 wc               'check if sprite covers scanline
            
            alts    bpp, #bittable          'bpp starts out as index from 0-5
            mov     bitmask, 0-0            'get table entry using bpp index
            mul     a, bitmask              'multiply mouse row by its length
            shr     bitmask, #16 wz         'extract mask portion
    if_z    not     bitmask                 'fix up the 32 bpp case
            mov     bpp, bitmask
            ones    bpp                     'convert into real bpp

            add     a, mouseptr             'add offset to base mouse address
            setq2   #17-1                   'get 17 longs max, mouse mask+image
            rdlong  $120, a                 'read mouse data and store in LUT
                
            getword offset, mouse_xy, #0    'get mouse x screen co-ordinate 
            getnib  b, mouseptr, #6         'get x hotspot of mouse image

            mov     pixels, offset
            sub     offset, b               'compensate for the x hotspot
   if_nc    subr    pixels, ##640 wcz       'compute pixels until end of line
            add     pixels, b               'increase by the x hotspot amount
            fle     pixels, #16             'limit drawn pixels to 16

 if_c_or_z  ret     wcz                     'exit if sprite is out of x/y range

            mov     ptrb, #$120             'ptrb is used for mouse image data
            rdlut   c, ptrb++               'read in the mouse mask first

            abs     a, offset               'retain bits
            muls    offset, bpp             'convert number of pixels into bits
            abs     b, offset wc            'test for negative value (clipped)
            mov     muxmask, bitmask        'setup mask for pixel's data size
    if_nc   rol     muxmask, b              'align mask for first data pixel
    if_c    rol     bitmask, b              'align mask for first mouse pixel
    if_c    shr     c, a                    'eliminate mouse pixels if clipped
            shr     b, #5                   'convert bits to longs
    if_c    add     ptrb, b                 'advance mouse data to skip pixels

            shl     b, #2                   'convert longs to bytes
    if_nc   add     save, b                 'adjust scanline buffer position                

            setq2   #16-1                   'read 16 scanline longs into LUT
            rdlong  $110, save              'using adjusted hub read position

            mov     ptra, #$110             'ptra used for source image data
            rdlut   a, ptra                 'get original scanline pixel data
            test    $, #1 wc                'c=1 will trigger initial read

            rep     @endmouse, pixels       'repeat loop for 16 pixels
    if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
    if_c    rol     b, offset               'align with the source input data
            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
    if_c    setq    muxmask                 'apply bit mask to muxq mask
    if_nc   setq    #0                      'setup a transparent mouse pixel
            muxq    a, b                    'select original or mouse pixel
    if_c    wrlut   a, ptra                 'write back updated data if altered
            rol     muxmask, bpp wc         'advance mask by 1,2,4,8,16,32 bits
    if_c    rdlut   a, ++ptra               '...and read next source pixel(s)
            rol     bitmask, bpp wc         'rotate mask for mouse data reload

endmouse
            setq2   #16-1 
            wrlong  $110, save              'write LUT image data back to hub
            ret     wcz

bittable
            long    $0001_00_08             '1bpp
            long    $0003_00_08             '2bpp
            long    $000F_00_0C             '4bpp
            long    $00FF_00_14             '8bpp
            long    $FFFF_00_24             '16bpp
            long    $0000_00_44             '32bpp 

bitmask     long    0  'TODO eventually share other temporary register space
muxmask     long    0  'in order to eliminate these temp working variables
offset      long    0 
bpp         long    0

Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.

Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.

I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.

...
endmouse

            mul     bpp, ##640
            sub     bpp, offset
            shr     bpp, #5
            fle     bpp, #15

            setq2   bpp ' was #16-1 
            wrlong  $110, save              'write LUT image data back to hub
            ret     wcz

It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.

rogloh · 2019-10-18 13:00

ersmith wrote: »

@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.

It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.

rogloh · 2019-10-18 13:08

It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.

Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...

There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.

Tubular · 2019-10-18 13:19

nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog

TonyB_ · 2019-10-18 13:42

rogloh wrote: »

            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
    if_c    setq    muxmask                 'apply bit mask to muxq mask
    if_nc   setq    #0                      'setup a transparent mouse pixel
            muxq    a, b                    'select original or mouse pixel
    if_c    wrlut   a, ptra                 'write back updated data if altered

]

            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
            setq    muxmask                 'apply bit mask to muxq mask
    if_c    muxq    a, b                    'select mouse pixel if not transparent
    if_c    wrlut   a, ptra                 'write back updated data if altered

rogloh · 2019-10-18 13:42

Tubular wrote: »

nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog

Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.

I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.

rogloh · 2019-10-18 13:45

TonyB_ wrote: »

            shr     c, #1 wc                'get mask bit 1=set, 0=transparent
            setq    muxmask                 'apply bit mask to muxq mask
    if_c    muxq    a, b                    'select mouse pixel if not transparent
    if_c    wrlut   a, ptra                 'write back updated data if altered

Great. Makes total sense, had overlooked it. This is where other people's fresh eyes can help.

AJL · 2019-10-18 14:56

rogloh wrote: »

It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.

Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...

There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.

Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.

If my figuring is correct this should work for all bit depths:

  altd ptra, #$EF
  setq2 0-0
  wrlong $110, save
  ret wcz

Edit: forgot the need to subtract one from the setq2 value. Code updated.

potatohead · 2019-10-18 17:03

One stupid easy way to do clipping is to simply allocate undisplayed buffer RAM on either side of the scan line.

Then, just write the mouse data normally without clipping, just a normal bounds check.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments