Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, potentially 5 more longs stream in, possibly filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst
Rd Clock of 10c
111111111111111111111111111111111111111111wwww1111111111111111111111111111111111111111111111wwww111111
988888888887777777777666666666655555555554wwww5679988888888887777777777666666666655555555554wwww567998 FIFO Data
/ / / / / / / / / / / RdCLK (-1)
///// ///// Wr CLK (+1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~wwwwSSSSS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SSSSS~~~ Streamer has BUS
Rd Clock of 2c (based in Fill-to-19)
1111111111111111111111111111111111111111111111111111111111111111
9887766554433233445566778898877665544332334455667788998877665544
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Rd CLK (-1)
///////////// ///////////// Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
Rd Clock of 2c (based on 5 loads from 14, correct )
1111111111111111111111111111111111111111111111111111111111111111
9887766554433233445566766554433233445566734455667788998877665544
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Rd CLK (-1)
//////// ///////// Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSS~~~~~wwwwSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
123456781
w = wait for streamer start slot match S = Streamer FIFO write clock.
The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.
For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.
Chip,
The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
It's six, #14 is included. If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.
Yeah, maybe it is only 5 minimum. So that's the high and low marks.
It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
I was having an impossible time trying to figure out how many stages to make the FIFO. In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required. It was a lot more than I had figured before.
I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.
Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.
Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
It could be >5 if data is being read out during the FIFO reload. It could even be continuous.
Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.
I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.
Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.
But you said there was a bunch of buffering registers that drain into the FIFO for 5 clocks after the FIFO has stopped addressing hubram. Are those buffers not always in place as part of the egg-beater? Won't SETQ+RDLONG need them too?
It could be >5 if data is being read out during the FIFO reload.
I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
It could be done like that but that approach probably burns more COGRAM.
1) It needs to set a flag somewhere.
2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
3) It needs to test this flag to see if it needs to generate the code.
4) It needs to branch
That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.
While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).
It could be >5 if data is being read out during the FIFO reload.
I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
It is continuous FIFO load, if read is done at 1c ?
Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.
For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.
I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
do_mouse
getword a, mouse_xy, #1 'get mouse y screen co-ordinate
getnib b, mouseptr, #7 'get y hotspot of mouse image
sub a, b 'compensate for the y hotspot
subr a, scanline 'compute sprite row offset
cmpr a, #15 wc 'check if sprite covers scanline
alts bpp, #bittable 'bpp starts out as index from 0-5
mov bitmask, 0-0 'get table entry using bpp index
mul a, bitmask 'multiply mouse row by its length
shr bitmask, #16 wz 'extract mask portion
if_z not bitmask 'fix up the 32 bpp case
mov bpp, bitmask
ones bpp 'convert into real bpp
add a, mouseptr 'add offset to base mouse address
setq2 #17-1 'get 17 longs max, mouse mask+image
rdlong $120, a 'read mouse data and store in LUT
getword offset, mouse_xy, #0 'get mouse x screen co-ordinate
getnib b, mouseptr, #6 'get x hotspot of mouse image
mov pixels, offset
sub offset, b 'compensate for the x hotspot
if_nc subr pixels, ##640 wcz 'compute pixels until end of line
add pixels, b 'increase by the x hotspot amount
fle pixels, #16 'limit drawn pixels to 16
if_c_or_z ret wcz 'exit if sprite is out of x/y range
mov ptrb, #$120 'ptrb is used for mouse image data
rdlut c, ptrb++ 'read in the mouse mask first
abs a, offset 'retain bits
muls offset, bpp 'convert number of pixels into bits
abs b, offset wc 'test for negative value (clipped)
mov muxmask, bitmask 'setup mask for pixel's data size
if_nc rol muxmask, b 'align mask for first data pixel
if_c rol bitmask, b 'align mask for first mouse pixel
if_c shr c, a 'eliminate mouse pixels if clipped
shr b, #5 'convert bits to longs
if_c add ptrb, b 'advance mouse data to skip pixels
shl b, #2 'convert longs to bytes
if_nc add save, b 'adjust scanline buffer position
setq2 #16-1 'read 16 scanline longs into LUT
rdlong $110, save 'using adjusted hub read position
mov ptra, #$110 'ptra used for source image data
rdlut a, ptra 'get original scanline pixel data
test $, #1 wc 'c=1 will trigger initial read
rep @endmouse, pixels 'repeat loop for 16 pixels
if_c rdlut b, ptrb++ 'get next mouse sprite pixel(s)
if_c rol b, offset 'align with the source input data
shr c, #1 wc 'get mask bit 1=set, 0=transparent
if_c setq muxmask 'apply bit mask to muxq mask
if_nc setq #0 'setup a transparent mouse pixel
muxq a, b 'select original or mouse pixel
if_c wrlut a, ptra 'write back updated data if altered
rol muxmask, bpp wc 'advance mask by 1,2,4,8,16,32 bits
if_c rdlut a, ++ptra '...and read next source pixel(s)
rol bitmask, bpp wc 'rotate mask for mouse data reload
endmouse
setq2 #16-1
wrlong $110, save 'write LUT image data back to hub
ret wcz
bittable
long $0001_00_08 '1bpp
long $0003_00_08 '2bpp
long $000F_00_0C '4bpp
long $00FF_00_14 '8bpp
long $FFFF_00_24 '16bpp
long $0000_00_44 '32bpp
bitmask long 0 'TODO eventually share other temporary register space
muxmask long 0 'in order to eliminate these temp working variables
offset long 0
bpp long 0
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
...
endmouse
mul bpp, ##640
sub bpp, offset
shr bpp, #5
fle bpp, #15
setq2 bpp ' was #16-1
wrlong $110, save 'write LUT image data back to hub
ret wcz
@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.
I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
do_mouse
getword a, mouse_xy, #1 'get mouse y screen co-ordinate
getnib b, mouseptr, #7 'get y hotspot of mouse image
sub a, b 'compensate for the y hotspot
subr a, scanline 'compute sprite row offset
cmpr a, #15 wc 'check if sprite covers scanline
alts bpp, #bittable 'bpp starts out as index from 0-5
mov bitmask, 0-0 'get table entry using bpp index
mul a, bitmask 'multiply mouse row by its length
shr bitmask, #16 wz 'extract mask portion
if_z not bitmask 'fix up the 32 bpp case
mov bpp, bitmask
ones bpp 'convert into real bpp
add a, mouseptr 'add offset to base mouse address
setq2 #17-1 'get 17 longs max, mouse mask+image
rdlong $120, a 'read mouse data and store in LUT
getword offset, mouse_xy, #0 'get mouse x screen co-ordinate
getnib b, mouseptr, #6 'get x hotspot of mouse image
mov pixels, offset
sub offset, b 'compensate for the x hotspot
if_nc subr pixels, ##640 wcz 'compute pixels until end of line
add pixels, b 'increase by the x hotspot amount
fle pixels, #16 'limit drawn pixels to 16
if_c_or_z ret wcz 'exit if sprite is out of x/y range
mov ptrb, #$120 'ptrb is used for mouse image data
rdlut c, ptrb++ 'read in the mouse mask first
abs a, offset 'retain bits
muls offset, bpp 'convert number of pixels into bits
abs b, offset wc 'test for negative value (clipped)
mov muxmask, bitmask 'setup mask for pixel's data size
if_nc rol muxmask, b 'align mask for first data pixel
if_c rol bitmask, b 'align mask for first mouse pixel
if_c shr c, a 'eliminate mouse pixels if clipped
shr b, #5 'convert bits to longs
if_c add ptrb, b 'advance mouse data to skip pixels
shl b, #2 'convert longs to bytes
if_nc add save, b 'adjust scanline buffer position
setq2 #16-1 'read 16 scanline longs into LUT
rdlong $110, save 'using adjusted hub read position
mov ptra, #$110 'ptra used for source image data
rdlut a, ptra 'get original scanline pixel data
test $, #1 wc 'c=1 will trigger initial read
rep @endmouse, pixels 'repeat loop for 16 pixels
if_c rdlut b, ptrb++ 'get next mouse sprite pixel(s)
if_c rol b, offset 'align with the source input data
shr c, #1 wc 'get mask bit 1=set, 0=transparent
if_c setq muxmask 'apply bit mask to muxq mask
if_nc setq #0 'setup a transparent mouse pixel
muxq a, b 'select original or mouse pixel
if_c wrlut a, ptra 'write back updated data if altered
rol muxmask, bpp wc 'advance mask by 1,2,4,8,16,32 bits
if_c rdlut a, ++ptra '...and read next source pixel(s)
rol bitmask, bpp wc 'rotate mask for mouse data reload
endmouse
setq2 #16-1
wrlong $110, save 'write LUT image data back to hub
ret wcz
bittable
long $0001_00_08 '1bpp
long $0003_00_08 '2bpp
long $000F_00_0C '4bpp
long $00FF_00_14 '8bpp
long $FFFF_00_24 '16bpp
long $0000_00_44 '32bpp
bitmask long 0 'TODO eventually share other temporary register space
muxmask long 0 'in order to eliminate these temp working variables
offset long 0
bpp long 0
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
...
endmouse
mul bpp, ##640
sub bpp, offset
shr bpp, #5
fle bpp, #15
setq2 bpp ' was #16-1
wrlong $110, save 'write LUT image data back to hub
ret wcz
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.
It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
shr c, #1 wc 'get mask bit 1=set, 0=transparent
if_c setq muxmask 'apply bit mask to muxq mask
if_nc setq #0 'setup a transparent mouse pixel
muxq a, b 'select original or mouse pixel
if_c wrlut a, ptra 'write back updated data if altered
]
shr c, #1 wc 'get mask bit 1=set, 0=transparent
setq muxmask 'apply bit mask to muxq mask
if_c muxq a, b 'select mouse pixel if not transparent
if_c wrlut a, ptra 'write back updated data if altered
nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog
Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.
I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.
shr c, #1 wc 'get mask bit 1=set, 0=transparent
setq muxmask 'apply bit mask to muxq mask
if_c muxq a, b 'select mouse pixel if not transparent
if_c wrlut a, ptra 'write back updated data if altered
Great. Makes total sense, had overlooked it. This is where other people's fresh eyes can help.
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.
If my figuring is correct this should work for all bit depths:
altd ptra, #$EF
setq2 0-0
wrlong $110, save
ret wcz
Edit: forgot the need to subtract one from the setq2 value. Code updated.
Comments
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.
For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.
The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.
That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.
It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.
It could be >5 if data is being read out during the FIFO reload. It could even be continuous.
Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
It is continuous FIFO load, if read is done at 1c ?
It could be done like that but that approach probably burns more COGRAM.
1) It needs to set a flag somewhere.
2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
3) It needs to test this flag to see if it needs to generate the code.
4) It needs to branch
That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.
While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.
I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).
Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.
For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.
I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.
Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.
If my figuring is correct this should work for all bit depths:
Edit: forgot the need to subtract one from the setq2 value. Code updated.
Then, just write the mouse data normally without clipping, just a normal bounds check.