Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, potentially 5 more longs stream in, possibly filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+6) stages are filled, after which point 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows, under all potential scenarios.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst
Rd Clock of 10c
111111111111111111111111111111111111111111wwww1111111111111111111111111111111111111111111111wwww111111
988888888887777777777666666666655555555554wwww5679988888888887777777777666666666655555555554wwww567998 FIFO Data
/ / // // // // / RdCLK (-1)
///// ///// Wr CLK (+1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~wwwwSSSSS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SSSSS~~~ Streamer has BUS
Rd Clock of 2c (based in Fill-to-19)
11111111111111111111111111111111111111111111111111111111111111119887766554433233445566778898877665544332334455667788998877665544
/ / // // // // // // // // // // // // // // // / Rd CLK (-1)
///////////// ///////////// Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
Rd Clock of 2c (based on 5 loads from14, correct )
11111111111111111111111111111111111111111111111111111111111111119887766554433233445566766554433233445566734455667788998877665544
/ / // // // // // // // // // // // // // // // / Rd CLK (-1)
//////// ///////// Wr CLK (+1)
~~~~~~~~~~~wwwSSSSSSSS~~~~~wwwwSSSSSSSSS~~~~~~~~~~~ Streamer has BUS
123456781
w = wait for streamer start slot match S = Streamer FIFO write clock.
The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.
For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.
Chip,
The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
It's six, #14 is included. If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
If I'm reading into what Chip has said correctly then that is also the minimum burst length, not fixed. Any consumption from the FIFO during refilling will extend the refilling. It's a little fancier than I first envisioned.
That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.
Yeah, maybe it is only 5 minimum. So that's the high and low marks.
It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
I was having an impossible time trying to figure out how many stages to make the FIFO. In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required. It was a lot more than I had figured before.
I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.
Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
I was having an impossible time trying to figure out how many stages to make the FIFO.
In the end, I made a simulator that did all kinds of worst-case banging on the FIFO, in order to learn the equation for the number of stages required.
It was a lot more than I had figured before.
Good idea, because this gets complex very quickly....
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
It's like it turns the faucet on when it's under 14 stages full. As soon as it gets to 14 stages full, it turns the faucet off, but there could be five more stages of data streaming in because of register delays.
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
It could be >5 if data is being read out during the FIFO reload. It could even be continuous.
Oh, it is like a force feed then. Lol, I thought I was pointing out poor wording.
I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.
Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.
But you said there was a bunch of buffering registers that drain into the FIFO for 5 clocks after the FIFO has stopped addressing hubram. Are those buffers not always in place as part of the egg-beater? Won't SETQ+RDLONG need them too?
It could be >5 if data is being read out during the FIFO reload.
I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
It could be done like that but that approach probably burns more COGRAM.
1) It needs to set a flag somewhere.
2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
3) It needs to test this flag to see if it needs to generate the code.
4) It needs to branch
That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.
While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).
It could be >5 if data is being read out during the FIFO reload.
I probably did not word that well.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
It is continuous FIFO load, if read is done at 1c ?
Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.
For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.
I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
do_mouse
getword a, mouse_xy, #1'get mouse y screen co-ordinategetnib b, mouseptr, #7'get y hotspot of mouse imagesub a, b 'compensate for the y hotspotsubr a, scanline 'compute sprite row offset cmpr a, #15wc'check if sprite covers scanlinealts bpp, #bittable 'bpp starts out as index from 0-5mov bitmask, 0-0'get table entry using bpp indexmul a, bitmask 'multiply mouse row by its lengthshr bitmask, #16wz'extract mask portionif_znot bitmask 'fix up the 32 bpp casemov bpp, bitmask
ones bpp 'convert into real bppadd a, mouseptr 'add offset to base mouse addresssetq2 #17-1'get 17 longs max, mouse mask+imagerdlong$120, a 'read mouse data and store in LUTgetword offset, mouse_xy, #0'get mouse x screen co-ordinate getnib b, mouseptr, #6'get x hotspot of mouse imagemov pixels, offset
sub offset, b 'compensate for the x hotspotif_ncsubr pixels, ##640wcz'compute pixels until end of lineadd pixels, b 'increase by the x hotspot amountfle pixels, #16'limit drawn pixels to 16if_c_or_zretwcz'exit if sprite is out of x/y rangemovptrb, #$120'ptrb is used for mouse image datardlut c, ptrb++ 'read in the mouse mask firstabs a, offset 'retain bitsmuls offset, bpp 'convert number of pixels into bitsabs b, offset wc'test for negative value (clipped)mov muxmask, bitmask 'setup mask for pixel's data sizeif_ncrol muxmask, b 'align mask for first data pixelif_crol bitmask, b 'align mask for first mouse pixelif_cshr c, a 'eliminate mouse pixels if clippedshr b, #5'convert bits to longsif_caddptrb, b 'advance mouse data to skip pixelsshl b, #2'convert longs to bytesif_ncadd save, b 'adjust scanline buffer position setq2 #16-1'read 16 scanline longs into LUTrdlong$110, save 'using adjusted hub read positionmovptra, #$110'ptra used for source image datardlut a, ptra'get original scanline pixel datatest $, #1wc'c=1 will trigger initial readrep @endmouse, pixels 'repeat loop for 16 pixelsif_crdlut b, ptrb++ 'get next mouse sprite pixel(s)if_crol b, offset 'align with the source input datashr c, #1wc'get mask bit 1=set, 0=transparentif_csetq muxmask 'apply bit mask to muxq maskif_ncsetq #0'setup a transparent mouse pixelmuxq a, b 'select original or mouse pixelif_cwrlut a, ptra'write back updated data if alteredrol muxmask, bpp wc'advance mask by 1,2,4,8,16,32 bitsif_crdlut a, ++ptra'...and read next source pixel(s)rol bitmask, bpp wc'rotate mask for mouse data reload
endmouse
setq2 #16-1wrlong$110, save 'write LUT image data back to hubretwcz
bittable
long$0001_00_08'1bpplong$0003_00_08'2bpplong$000F_00_0C'4bpplong$00FF_00_14'8bpplong$FFFF_00_24'16bpplong$0000_00_44'32bpp
bitmask long0'TODO eventually share other temporary register space
muxmask long0'in order to eliminate these temp working variables
offset long0
bpp long0
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
...
endmouse
mul bpp, ##640sub bpp, offset
shr bpp, #5fle bpp, #15setq2 bpp ' was #16-1 wrlong$110, save 'write LUT image data back to hubretwcz
@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.
I was finally able to get my mouse sprite code in and working in all colour modes with clipping. It fits, takes 60 COG longs, plus the call setup overhead. The 4 temporary working longs can be reclaimed from elsewhere.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
do_mouse
getword a, mouse_xy, #1'get mouse y screen co-ordinategetnib b, mouseptr, #7'get y hotspot of mouse imagesub a, b 'compensate for the y hotspotsubr a, scanline 'compute sprite row offset cmpr a, #15wc'check if sprite covers scanlinealts bpp, #bittable 'bpp starts out as index from 0-5mov bitmask, 0-0'get table entry using bpp indexmul a, bitmask 'multiply mouse row by its lengthshr bitmask, #16wz'extract mask portionif_znot bitmask 'fix up the 32 bpp casemov bpp, bitmask
ones bpp 'convert into real bppadd a, mouseptr 'add offset to base mouse addresssetq2 #17-1'get 17 longs max, mouse mask+imagerdlong$120, a 'read mouse data and store in LUTgetword offset, mouse_xy, #0'get mouse x screen co-ordinate getnib b, mouseptr, #6'get x hotspot of mouse imagemov pixels, offset
sub offset, b 'compensate for the x hotspotif_ncsubr pixels, ##640wcz'compute pixels until end of lineadd pixels, b 'increase by the x hotspot amountfle pixels, #16'limit drawn pixels to 16if_c_or_zretwcz'exit if sprite is out of x/y rangemovptrb, #$120'ptrb is used for mouse image datardlut c, ptrb++ 'read in the mouse mask firstabs a, offset 'retain bitsmuls offset, bpp 'convert number of pixels into bitsabs b, offset wc'test for negative value (clipped)mov muxmask, bitmask 'setup mask for pixel's data sizeif_ncrol muxmask, b 'align mask for first data pixelif_crol bitmask, b 'align mask for first mouse pixelif_cshr c, a 'eliminate mouse pixels if clippedshr b, #5'convert bits to longsif_caddptrb, b 'advance mouse data to skip pixelsshl b, #2'convert longs to bytesif_ncadd save, b 'adjust scanline buffer position setq2 #16-1'read 16 scanline longs into LUTrdlong$110, save 'using adjusted hub read positionmovptra, #$110'ptra used for source image datardlut a, ptra'get original scanline pixel datatest $, #1wc'c=1 will trigger initial readrep @endmouse, pixels 'repeat loop for 16 pixelsif_crdlut b, ptrb++ 'get next mouse sprite pixel(s)if_crol b, offset 'align with the source input datashr c, #1wc'get mask bit 1=set, 0=transparentif_csetq muxmask 'apply bit mask to muxq maskif_ncsetq #0'setup a transparent mouse pixelmuxq a, b 'select original or mouse pixelif_cwrlut a, ptra'write back updated data if alteredrol muxmask, bpp wc'advance mask by 1,2,4,8,16,32 bitsif_crdlut a, ++ptra'...and read next source pixel(s)rol bitmask, bpp wc'rotate mask for mouse data reload
endmouse
setq2 #16-1wrlong$110, save 'write LUT image data back to hubretwcz
bittable
long$0001_00_08'1bpplong$0003_00_08'2bpplong$000F_00_0C'4bpplong$00FF_00_14'8bpplong$FFFF_00_24'16bpplong$0000_00_44'32bpp
bitmask long0'TODO eventually share other temporary register space
muxmask long0'in order to eliminate these temp working variables
offset long0
bpp long0
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
...
endmouse
mul bpp, ##640sub bpp, offset
shr bpp, #5fle bpp, #15setq2 bpp ' was #16-1 wrlong$110, save 'write LUT image data back to hubretwcz
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
@rogloh , what is your format for 16bpp character data? Is it 1 bit for a special effect, 4 bits for foreground color, 3 bits for background color, and 8 bits for the character? How are those arranged? I'd like to add a 16bpp mode to my VGA driver and it might as well be compatible with your DVI driver.
It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
shr c, #1wc'get mask bit 1=set, 0=transparentif_csetq muxmask 'apply bit mask to muxq maskif_ncsetq #0'setup a transparent mouse pixelmuxq a, b 'select original or mouse pixelif_cwrlut a, ptra'write back updated data if altered
]
shr c, #1wc'get mask bit 1=set, 0=transparentsetq muxmask 'apply bit mask to muxq maskif_cmuxq a, b 'select mouse pixel if not transparentif_cwrlut a, ptra'write back updated data if altered
nice going, squeezing the mouse pointer code in too. I wouldn't have imagined all this could be done in a single cog
Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.
I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.
shr c, #1wc'get mask bit 1=set, 0=transparentsetq muxmask 'apply bit mask to muxq maskif_cmuxq a, b 'select mouse pixel if not transparentif_cwrlut a, ptra'write back updated data if altered
Great. Makes total sense, had overlooked it. This is where other people's fresh eyes can help.
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
Ok I'll look into that tomorrow. But pixels does not always equal longs unfortunately...
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.
If my figuring is correct this should work for all bit depths:
altdptra, #$EFsetq20-0wrlong$110, save
retwcz
Edit: forgot the need to subtract one from the setq2 value. Code updated.
Comments
Can you also add the low water mark number that evanh mentions ? - ie that (fixed?) trigger level, where another fill burst starts ?
From your wording, does the FIFO wait until it can load 8+6, which makes the low water mark 5 ? (seems too low?)
A burst size of 8 gave an assumed load trigger of 11 in above discussion, but your wording suggests a larger burst ( but that may be initial load only ?)
The trigger level may not be hard-fixed, if it has to wait until below, and then wait for a slot-align ?
In this case of /10 empty rate, -1 is possible, but faster empty rates could remove many more until correct slot arrives.
I changed the wording. Does this answer your question? (I'm wondering if this change does the job.)
That's better, I think you say here
* P2 FIFO size is 19L deep
* P2 FIFO trigger point is 14L
* BURST size is 5L, so it reads in 15,16,17,18,19
If a read(s) do occur, does burst increase to 6,7,8 as needed, or is burst size always fixed at 5 ?
Is first read a continual 19L ? or does it burst 5,5,5,4 giving some CPU bus time ?
The first load is 19L, assuming no data was read from the FIFO (ie RFLONG). If continuous RFLONGs occur, it will just load continuously, since that is the requirement. Once ANY data is in the FIFO, though, reading can begin.
Thanks, Is burst fill always 5L, or can that increase, to always stop at Full = 19L ? I'm guessing a fill to 19L works better ?
Taking the 10c empty rate, I get this for the FIFO and Streamer-BUS lockouts, for a 5 Size burst
Rd Clock of 10c 111111111111111111111111111111111111111111wwww1111111111111111111111111111111111111111111111wwww111111 988888888887777777777666666666655555555554wwww5679988888888887777777777666666666655555555554wwww567998 FIFO Data / / / / / / / / / / / RdCLK (-1) ///// ///// Wr CLK (+1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~wwwwSSSSS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SSSSS~~~ Streamer has BUS Rd Clock of 2c (based in Fill-to-19) 1111111111111111111111111111111111111111111111111111111111111111 9887766554433233445566778898877665544332334455667788998877665544 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Rd CLK (-1) ///////////// ///////////// Wr CLK (+1) ~~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~wwwSSSSSSSSSSSSS~~~~~~~~~~~ Streamer has BUS Rd Clock of 2c (based on 5 loads from 14, correct ) 1111111111111111111111111111111111111111111111111111111111111111 9887766554433233445566766554433233445566734455667788998877665544 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Rd CLK (-1) //////// ///////// Wr CLK (+1) ~~~~~~~~~~~wwwSSSSSSSS~~~~~wwwwSSSSSSSSS~~~~~~~~~~~ Streamer has BUS 123456781 w = wait for streamer start slot match S = Streamer FIFO write clock.
The effect of the 5 x S on user HUB Data BUS access, would be to stall for +8, as it is forced to skip one, and wait for the next go-around.For the other 45 clocks of the 50c repeat, the bus is free, and usual wait for this go-round apply to any HUB Data BUS access.
At higher streamer read rates the 50 shrinks, S widens as some reads-during-write occur, and the limit useful case looks to be 2c, where streamer has bus for 13c of 26c time slots ?
Streamers at 1c use 100% of the BUS, so there are no HUB Data BUS slots for that streamer's COG.
The wording of "after which point 5 more longs stream in" seems to imply a possible force feeding. ie: The FIFO didn't address those locations in hubRAM but is still getting them anyway.
That's what I've assumed above, that it can burst load 5 or more, and fills to 19.
With /2c readout rate, and a 3c Wait, I get 7 bytes read out, during the 3wait+13write clocks, burst length here is 13.
It also means determinism is out the door. The FIFO burst starts are going to be on a rotation themselves. There's no point trying to arrange for ideal timing of the SETQ+RDLONG.
Does that simulator report the user bandwidth ? - can a web calculator, or such a simulator be used to give an indication or range of bandwidths for various Streamer read rates ?
Hmm, so that's sounding a firmly fixed 5 size on the burst ? (I've modified the 2c example above for this 5-from-14 rule )
Because the FIFO fills on every clock, and register delays should be fixed, is there ever <> 5 ?
I guess that means all reads from hubRAM have this trailing buffering. So a single RDLONG has five subsequent longwords that are discarded.
It could be >5 if data is being read out during the FIFO reload. It could even be continuous.
Only hub-exec and RDFAST engage the FIFO in reading the hub, so that many longs are read in. RDxxxx only reads one or two longs, in cases of overlap on a word or a long.
The question should have been focused on the tail, ie once 14 is reached, and the 5 more rule starts, can anything change the 5-more ?
The 2c example above, I now have reading 8 or 9 bursts, by applying a nominal wait-for-slot and also applying the 5-after-14 rule.
Assuming that's modeled right, that gives gaps of 8 or 9 (for that wait case) which is just enough to ensure a slot exists in every gap between fill-bursts, for a possible user-data access.
There may be other wait cases, that shrink the gap to < 8, and that means user data may (rarely) stretch across 2 FIFO fills+gaps, before it hits a slot align.
It is continuous FIFO load, if read is done at 1c ?
It could be done like that but that approach probably burns more COGRAM.
1) It needs to set a flag somewhere.
2) There needs to be a flag available (perhaps could be shared in another reg if another one is suitable for this)
3) It needs to test this flag to see if it needs to generate the code.
4) It needs to branch
That's probably up to 3 or 4 more registers consumed. Simpler to do it every time you want it and not need to check. Once there are cycles available to do this the work is not a problem.
While I'm not likely to switch modes from one line to the next in the first release of this driver, I'd quite like to be able to in the future to open up the concept of supporting display region lists... though in the worst case reading in a 256 entry palette each scanline is not ideal, and there certainly won't be time do clean it up to stop DVI locking up if bit 1 is set in the long. There would be time in the vertical blanking to do this however.
I'd expect we will need to make use of self modifying code once we get to that point. And reclaiming COG/LUT space dynamically in different portions of the scanline to get everything to fit. I'm still trying to keep that in the back of my mind as I go along here to try not to preclude it. Something may possibly have to be dropped there to make room. Maybe pixel doubling won't be possible with it, or I'll need to read in dynamic code blocks from HUB, which I'm still hoping to avoid for stability reasons. Eg. you have an errant COG you are debugging using the video driver to display memory or other state etc and this errant COG goes and kills the video driver executable code. Annoying, but instead if the video driver stays up the whole time, you might be able to continue debugging things (to a point anyway).
Once a cog gives the eggbeater a command to read, it takes 5 clocks for the data to come out of the eggbeater. It doesn't matter if SETQ is used, or not.
For hub-exec and RDFAST, cycle by cycle, data is coming in from the eggbeater into the FIFO, while data is being read out of the FIFO. This is why very long eggbeater bursts (more than 5 extra) can occur. Demand must be met.
Clipping code gets tedious and tricky to deal with and it gave me some grief with the flags. Be nice if there were some optimisations found there, if anyone knows a better way...
There is one remaining side effect I think where the hub data may overwrite 15 longs beyond the last scanline buffer with the original data that it reads from that line. If this becomes an issue it could be accommodated by padding out the line buffer area by 60 extra bytes at the end, so I might just document that and leave it. Otherwise there would need to be some additional code put in to compute and enforce write size limits that I was trying to avoid as it increases the size of the code further than I'd like, though I will at least look into how to do it...
do_mouse getword a, mouse_xy, #1 'get mouse y screen co-ordinate getnib b, mouseptr, #7 'get y hotspot of mouse image sub a, b 'compensate for the y hotspot subr a, scanline 'compute sprite row offset cmpr a, #15 wc 'check if sprite covers scanline alts bpp, #bittable 'bpp starts out as index from 0-5 mov bitmask, 0-0 'get table entry using bpp index mul a, bitmask 'multiply mouse row by its length shr bitmask, #16 wz 'extract mask portion if_z not bitmask 'fix up the 32 bpp case mov bpp, bitmask ones bpp 'convert into real bpp add a, mouseptr 'add offset to base mouse address setq2 #17-1 'get 17 longs max, mouse mask+image rdlong $120, a 'read mouse data and store in LUT getword offset, mouse_xy, #0 'get mouse x screen co-ordinate getnib b, mouseptr, #6 'get x hotspot of mouse image mov pixels, offset sub offset, b 'compensate for the x hotspot if_nc subr pixels, ##640 wcz 'compute pixels until end of line add pixels, b 'increase by the x hotspot amount fle pixels, #16 'limit drawn pixels to 16 if_c_or_z ret wcz 'exit if sprite is out of x/y range mov ptrb, #$120 'ptrb is used for mouse image data rdlut c, ptrb++ 'read in the mouse mask first abs a, offset 'retain bits muls offset, bpp 'convert number of pixels into bits abs b, offset wc 'test for negative value (clipped) mov muxmask, bitmask 'setup mask for pixel's data size if_nc rol muxmask, b 'align mask for first data pixel if_c rol bitmask, b 'align mask for first mouse pixel if_c shr c, a 'eliminate mouse pixels if clipped shr b, #5 'convert bits to longs if_c add ptrb, b 'advance mouse data to skip pixels shl b, #2 'convert longs to bytes if_nc add save, b 'adjust scanline buffer position setq2 #16-1 'read 16 scanline longs into LUT rdlong $110, save 'using adjusted hub read position mov ptra, #$110 'ptra used for source image data rdlut a, ptra 'get original scanline pixel data test $, #1 wc 'c=1 will trigger initial read rep @endmouse, pixels 'repeat loop for 16 pixels if_c rdlut b, ptrb++ 'get next mouse sprite pixel(s) if_c rol b, offset 'align with the source input data shr c, #1 wc 'get mask bit 1=set, 0=transparent if_c setq muxmask 'apply bit mask to muxq mask if_nc setq #0 'setup a transparent mouse pixel muxq a, b 'select original or mouse pixel if_c wrlut a, ptra 'write back updated data if altered rol muxmask, bpp wc 'advance mask by 1,2,4,8,16,32 bits if_c rdlut a, ++ptra '...and read next source pixel(s) rol bitmask, bpp wc 'rotate mask for mouse data reload endmouse setq2 #16-1 wrlong $110, save 'write LUT image data back to hub ret wcz bittable long $0001_00_08 '1bpp long $0003_00_08 '2bpp long $000F_00_0C '4bpp long $00FF_00_14 '8bpp long $FFFF_00_24 '16bpp long $0000_00_44 '32bpp bitmask long 0 'TODO eventually share other temporary register space muxmask long 0 'in order to eliminate these temp working variables offset long 0 bpp long 0
Edit: posted too soon. Last change I put in that I thought was innocuous and worked in one mode actually broke some clipping of colour modes I'd tested earlier... :frown: Very close though.
Edit2: Just found the line I'd changed out in my code and put it back. Now clipping is working as I wanted. Code adjusted above and increased by one long. Also one constant used was with ## so I guess it is now 62 longs.
I think I found a workaround for the minor issue described earlier, adds 4 longs though if I share the ##640 constant as a separate register instead of ##.
... endmouse mul bpp, ##640 sub bpp, offset shr bpp, #5 fle bpp, #15 setq2 bpp ' was #16-1 wrlong $110, save 'write LUT image data back to hub ret wcz
It looks like when the code gets to endmouse, pixels still contains the number of pixels drawn. Subtract one from pixels before use in the setq2 and save yourself 3 longs in the code.
It is standard CGA/VGA text format, 16bit data. LS Byte is text character, MS byte holds 4 bit foreground colour in bits 8-11, 3 (or 4) bit background colour in bits 12-15. Bit15 is also a text flash attribute if not configured for 16 background colours and only the first 8 palette entries are used for background colours in that case.
There may still be some other optimizations buried there with any luck if you see any. I felt it got a little messy in the code to track the clipping and handle hotspot offsets and some reverse subtract code was used and that seems to set the flags differently to what I needed, so I had to do extra work there.
shr c, #1 wc 'get mask bit 1=set, 0=transparent setq muxmask 'apply bit mask to muxq mask if_c muxq a, b 'select mouse pixel if not transparent if_c wrlut a, ptra 'write back updated data if altered
Thanks, same here. There's just about space left for a TERC4 encoder...that code needs ~200 longs and 16 LUT table entries plus quite a lot of extra state variables, but depending on how LUTRAM gets used, it might still eventually fit. It's a real juggling act. I can also try to reuse COGRAM buffer space more across different functions like the mouse and pixel doubling stuff unless it blows out the timing budget. The main annoyance is the 256 colour palette mode. That chews up half the LUTRAM. If it was 4 bit LUT only that would open up a huge amount of free space for code. I do think there is time to load a palette in per scanline too instead of once per frame which opens up the possibility of per scanline mode regions, however then you are at the mercy of the user not clearing bit 1 as there certainly would not be time to do that clearing operation on all scanlines in an 8 bit LUT mode whereas you could if it gets loaded during vertical blanking, though I'm still not yet doing that step myself there yet either. Would be kind to do it but takes more COGRAM of course.
I still want to get HyperRAM framebuffers in too. Still wanting to keep it all self contained. Lots of competing constraints.
Ok, I see that now. So, given that you set ptra to $110 before the pixel loop, then the number of longs to write is ptra - $110.
If my figuring is correct this should work for all bit depths:
altd ptra, #$EF setq2 0-0 wrlong $110, save ret wcz
Edit: forgot the need to subtract one from the setq2 value. Code updated.
Then, just write the mouse data normally without clipping, just a normal bounds check.