I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.
32 bit initial input is: XXXXXXXX_XXXXXXXX_abcdefgh_ijklmnop
SPLITW a ' result XXXXXXXX_acegikmo_XXXXXXXX_bdfhjlnp
MOVBYTS a, #%%2020 ' result acegikmo_bdfhjlnp_acegikmo_bdfhjlnp
MERGEB a ' result ababcdcd_efefghgh_ijijklkl_mnmnopop
Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.
I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.
Nibble doubling without muxq is tantalising now. Probably best to work backwards using different splits, merges and movbyts.
Yeah I would expect that too, but the exact sequence has yet eluded me. I think it could be more steps and if it is that may become an issue. Good news is that the 2-bit doubling I found above seems to be working now and most importantly I was able to rework this new method into the earlier skipf framework nicely without violating the 7 skip maximum which would have killed it for me. Even saved a couple of COG longs too in the process. Bonus.
Here's the magic pixel doubling sequence that seems to work for all colour depths.
doublepixels
rep instcount, readburst
skipf pattern
rdlut a, ptra++
setq nibblemask
splitw a
mov b, a
movbyts a, bytemask1
mov pb, a
mergeb a
mergew a
shl pb, #4
muxq a, pb
wrlut a, ptrb++
movbyts b, bytemask2
mergew b
mergeb b
mov pb, b
shl pb, #4
muxq b, pb
wrlut b, ptrb++
ret
nibblemask long $0ff00ff0
' ___skip_pattern____bytemask_inst
doublebits long %11110001101100110_01000100_1001
doublenits long %11101001110100011_10001000_1010
doublenibs long %00011000011000100_01010000_1110
doublebytes long %11111001111100110_01010000_0111
doublewords long %11111001111100110_01000100_0111
doublelongs long %11111101111110110_00000000_0101
'bytemask2 is bytemask1^$AA when bit 0 of the skipf pattern is 0
'bytemask2 is bytemask1^$55 when bit 0 of the skipf pattern is 1 (and then cleared to 0)
It's not a 7 skip maximum per se. 8 or more skips in a row simply means the eighth instruction is cancelled (2 cycles consumed) instead of being skipped.
I can see why you'd want to avoid it, but it shouldn't be considered a showstopper.
I can see why you'd want to avoid it, but it shouldn't be considered a showstopper.
It's only a showstopper for me on the 16bpp pixel doubling hitting over 3200 cycles minimum once another instruction is added to the skipf loop, but it will be fine for pure DVI. I've figured out doubling 16bpp pixels with the current skipf pattern for 16bpp is already a minimum of 2880 clocks, excluding some housekeeping and loop overheads, and lost clocks from latencies during setq transfers while the streamer is operating, which I've not figured out yet - maybe @evanh would have some idea on the amount of extra cycles we need to factor in for this if we are streaming longs from the hub through the FIFO effectively sustaining long read rates of sys_clock/N while doing large setq/setq2 based sequential burst transfers from the COG at the same time, where N=10 (for 32/24bpp), N=20 (for 16bpp), N=40 (for 8bpp) etc.
Using immediates instead of registers for the movbyts saves longs overall and fiddling about with skip patterns and masks:
In general yes, but in this case I think I won't be able to squeeze that many instructions in the gaps between different variants and it would go past the 7 skip limit, adding more cycles. It would be fine for depths <= 8 but not 16bpp or the 24bpp I have in there right now (probably only temporarily for truecolour as it is a bandwidth and clock cycle pig).
Using immediates instead of registers for the movbyts saves longs overall and fiddling about with skip patterns and masks:
In general yes, but in this case I think I won't be able to squeeze that many instructions in the gaps between different variants and it would go past the 7 skip limit, adding more cycles. It would be fine for depths <= 8 but not 16bpp or the 24bpp I have in there right now (probably only temporarily for truecolour as it is a bandwidth and clock cycle pig).
My changes make no difference to 16bpp as it is placed last of the movbyts so there is still room for two more instructions without any cancelling.
Ok, I can take a look at it if you found a working sequence that keeps things intact for 16bpp and below. I will likely pull that truecolour variant out anyway as a different case. There's always the chance of Saucy's approach for that with the LUT too though I'd sort of like everything to work the same way as special cases add more COG overhead to deal with them. With the fastest 7 clock rep loop instead of this common 11 clock version of doubling longs, it will get it down to 2240 clocks for the doubling which is a lot better than 3520! Overall it is still going to be over 3200, but there might be some extra processing time somewhere down the line (which I am sort of banking on).
Adding comments to skipping code make a huge difference to understanding and is essential really. The maximum number of instructions skipped is quite easy to see (six for 8bpp and five for 16bpp in code above, with limit of seven for no wasted cycles).
Just figured out a good way to resolve the slowness of truecolor pixel doubling to better support a 320 wide pixel mode over DVI, leaving enough cycles for other future things...
I can read the 320 wide pixel data (working on say 32 long chunks at a time) into COG RAM, then just use the following unrolled code where ptra is setup to point into LUT buffer area being written into:
I can easily build this unrolled code block dynamically once at the start and probably store it in the 64 long area I had reserved for a font buffer in text mode, plus I can use the existing 40 long character buffer for holding the 32 long source data read with repeated setq bursts.
The end result is now only 4 P2 clocks per doubled pixel multiplied by 320 = 1280 clocks, plus some outer loop overhead which will be small. This is way better than 7 clocks x 320 I had before with rdlut, wrlut, wrlut combo. Even when I add in the extra 320 clocks to read and 640 clocks to write back to HUB it will be easily less than my 3200 budget. Very happy now!
wrlut with an auto incrementing pointer is very powerful instruction for sequential transfers.
Just figured out a good way to resolve the slowness of truecolor pixel doubling to better support a 320 wide pixel mode over DVI, leaving enough cycles for other future things...
I can read the 320 wide pixel data (working on say 32 long chunks at a time) into COG RAM, then just use the following unrolled code where ptra is setup to point into LUT buffer area being written into:
I can easily build this unrolled code block dynamically once at the start and probably store it in the 64 long area I had reserved for a font buffer in text mode, plus I can use the existing 40 long character buffer for holding the 32 long source data read with repeated setq bursts.
The end result is now only 4 P2 clocks per doubled pixel multiplied by 320 = 1280 clocks, plus some outer loop overhead which will be small. This is way better than 7 clocks x 320 I had before with rdlut, wrlut, wrlut combo. Even when I add in the extra 320 clocks to read and 640 clocks to write back to HUB it will be easily less than my 3200 budget. Very happy now!
wrlut with an auto incrementing pointer is very powerful instruction for sequential transfers.
I considered a REP too, though the approach in your example it will effectively double the execution time, plus I think you missed doing what I want, as it just copies not replicates and increments cogbuf by 1 (but I think I know what you meant). To fit it I think you'd want something like this...
rep #4, #32
wrlut cogbuf, ptra++
add $-1, d0
wrlut cogbuf, ptra++
add $-1, d0 ' where d0 = $200 (or you could use the alti method to increment the d field instead)
It takes 8 clocks x 320 pixels = 2560 total.
Building my unrolled loop is not a problem dynamically and it won't take long and I have already had space for it because I needed a buffer for fonts which will hold this.
Ah, common slip-up with that ADD in the REP. The cogbuf address is coded into the instruction. To make it a variable (indirection), ALTx prefixing instructions are required to be used. In this case ALTI will be the most useful.
Man, I hadn't imagined anyone being that hoggy on the prop2.
LOL. I like to push things to the limit. Pixel doubling 32 bits from hub back into hub is something that does stress the P2 it seems.
While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.
... For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has predence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. ...
I don't actually know but the most sense to me is the FIFO reads in whole rotations of 8 clocks at a time, possibly 16 clocks. That has priority so the hubRAM instructions have to fit in around the FIFO.
On that basis, the streamer has the FIFO bursting for 8 clocks every 80 clocks. You are wanting 40 clocks for an instruction burst. Makes it about 50/50 chance of coinciding and extending your burst by only 8 extra clocks.
If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?
I considered a REP too, though the approach in your example it will effectively double the execution time, plus I think you missed doing what I want, as it just copies not replicates and increments cogbuf by 1 (but I think I know what you meant). To fit it I think you'd want something like this...
rep #4, #32
wrlut cogbuf, ptra++
add $-1, d0
wrlut cogbuf, ptra++
add $-1, d0 ' where d0 = $200 (or you could use the alti method to increment the d field instead)
It takes 8 clocks x 320 pixels = 2560 total.
Building my unrolled loop is not a problem dynamically and it won't take long and I have already had space for it because I needed a buffer for fonts which will hold this.
Yes, my mistake: I missed cogbuf incrementing every second wrlut, and for some reason I also was thinking of cogbuf as a pointer to the location, rather than the location itself.
Man, I hadn't imagined anyone being that hoggy on the prop2.
LOL. I like to push things to the limit. Pixel doubling 32 bits from hub back into hub is something that does stress the P2 it seems.
While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has predence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.
It makes my head swim, too. Best determined by experimentation.
It makes my head swim, too. Best determined by experimentation.
Yeah I will be experimenting with it soon. I can capture the number of clocks taken to do various things and probably display it graphically in the picture on the fly to see how much headroom I have left after doubling, as a proportion of the 6400 cycles during the visible pixel portion for the different modes.
If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?
My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer? I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.
While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.
I think it is 16 clocks and FIFO depth is 16 longs, which it has to be for 16 cogs and it wasn't reduced for 8 cogs? Well, RDFAST/WRFAST block size wasn't reduced from 64 to 32 bytes anyway when 16 cogs wouldn't fit. What I don't know is when is the FIFO reloaded? When it's almost empty? Or half empty? Only one person knows for sure.
For 24bpp, can't hardware do the pixel repeating? A cog doesn't need to manipulate any data within each long and it seems a waste. A related question, does the low byte of RGB24 streamer data have to be zero?
Re fast block moves yielding to RDFAST streaming, none of the addresses is random for video, so if there is a sweetspot timing then it could apply all the time.
It makes my head swim, too. Best determined by experimentation.
Yeah I will be experimenting with it soon. I can capture the number of clocks taken to do various things and probably display it graphically in the picture on the fly to see how much headroom I have left after doubling, as a proportion of the 6400 cycles during the visible pixel portion for the different modes.
If all that is correct, then the next question is what happens if the instruction burst is 80 clocks long. That's 100% coinciding. Is it always just +8 clocks or is there also a chance of +16 clocks?
My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer? I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.
SETQ+RD/WRLONG only work in a forward direction, since that is how the RAMs cycle in the egg beater.
While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.
I think it is 16 clocks as FIFO depth is 16 longs, which it has to be for 16 cogs and it wasn't reduced for 8 cogs. What I don't know is when is the FIFO reloaded? When it's almost empty? Or half empty? Only one person knows for sure.
For 24bpp, can't hardware do the pixel repeating? A cog doesn't need to manipulate any data within each long and it seems a waste. A related question, does the low byte of RGB24 longs have to be zero?
Re fast block moves yielding to RDFAST streaming, none of the addresses is random for video, so if there is a sweetspot timing then it could apply all the time.
FIFO depth was reduced for 8 cogs.
Bit 1 in the 32-bit streamer output selects literal mode, where the three 10-bit fields above bit 1 become serial channel data. If bit 1 is 0, the top three bytes are encoded as colors, each becoming 10 serial bits.
My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer?
No. It only increments. All bursts increment, per sysclock, through consecutive longword locations.
I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.
There are two factors, normally, just for the hub access alone. So there is variability without the FIFO involved. The FIFO's added intrusions is extra on top. Still, managing it is entirely possible simply because the start burst addresses of both the FIFO and instruction are deterministic.
Chip, Does the FIFO refill in 8-clock bursts for an 8-cog prop2? Or is it always 16 clocks, to match the 64-byte block size? Is the initial filling different?
Thanks for the clarification, Chip. The 64-byte RDFAST/WRFAST block size was kept at the 16-cog level when FIFO was reduced. My very late night post not helping much if at all, apologies.
Comments
Nibble doubling without muxq is tantalising now. Probably best to work backwards using different splits, merges and movbyts.
Here's the magic pixel doubling sequence that seems to work for all colour depths.
I can see why you'd want to avoid it, but it shouldn't be considered a showstopper.
It's only a showstopper for me on the 16bpp pixel doubling hitting over 3200 cycles minimum once another instruction is added to the skipf loop, but it will be fine for pure DVI. I've figured out doubling 16bpp pixels with the current skipf pattern for 16bpp is already a minimum of 2880 clocks, excluding some housekeeping and loop overheads, and lost clocks from latencies during setq transfers while the streamer is operating, which I've not figured out yet - maybe @evanh would have some idea on the amount of extra cycles we need to factor in for this if we are streaming longs from the hub through the FIFO effectively sustaining long read rates of sys_clock/N while doing large setq/setq2 based sequential burst transfers from the COG at the same time, where N=10 (for 32/24bpp), N=20 (for 16bpp), N=40 (for 8bpp) etc.
My changes make no difference to 16bpp as it is placed last of the movbyts so there is still room for two more instructions without any cancelling.
I can read the 320 wide pixel data (working on say 32 long chunks at a time) into COG RAM, then just use the following unrolled code where ptra is setup to point into LUT buffer area being written into:
wrlut cogbuf, ptra++
wrlut cogbuf, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+1, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+2, ptra++
wrlut cogbuf+3, ptra++
wrlut cogbuf+3, ptra++
I can easily build this unrolled code block dynamically once at the start and probably store it in the 64 long area I had reserved for a font buffer in text mode, plus I can use the existing 40 long character buffer for holding the 32 long source data read with repeated setq bursts.
The end result is now only 4 P2 clocks per doubled pixel multiplied by 320 = 1280 clocks, plus some outer loop overhead which will be small. This is way better than 7 clocks x 320 I had before with rdlut, wrlut, wrlut combo. Even when I add in the extra 320 clocks to read and 640 clocks to write back to HUB it will be easily less than my 3200 budget. Very happy now!
wrlut with an auto incrementing pointer is very powerful instruction for sequential transfers.
Job for v34.
Is there a reason you can't use a rep block?
e.g.
Smaller code footprint, same execution speed.
Outer loop overhead is as simple as setting cogbuf, ptra, and loopcount.
rep #4, #32
wrlut cogbuf, ptra++
add $-1, d0
wrlut cogbuf, ptra++
add $-1, d0 ' where d0 = $200 (or you could use the alti method to increment the d field instead)
It takes 8 clocks x 320 pixels = 2560 total.
Building my unrolled loop is not a problem dynamically and it won't take long and I have already had space for it because I needed a buffer for fonts which will hold this.
I thought true colour was 24bpp.
https://en.wikipedia.org/wiki/Color_depth#True_color_(24-bit)
Man, I hadn't imagined anyone being that hoggy on the prop2.
While I have your attention evanh, I am wondering if you have any insights into the expected lack of bandwidth from the egg-beater accesses with the streamer/fifo reading out pixels sequentially at different bit depths and competing with a setq burst transfer at the same time.
For 24bpp (32 bit accesses from the streamer every 10th clock) I was thinking it might be something like 1 time in 10 that it is held up, but it may not be as simple as this, and if the FIFO has precedence perhaps you lose 8 or 16 clocks at a time, every time it lines up with the setq slice being requested or something like that. Based on your knowledge of access patterns, do you have anything to suggest? Maybe @cgracey would be best to ask.
On that basis, the streamer has the FIFO bursting for 8 clocks every 80 clocks. You are wanting 40 clocks for an instruction burst. Makes it about 50/50 chance of coinciding and extending your burst by only 8 extra clocks.
Yes, my mistake: I missed cogbuf incrementing every second wrlut, and for some reason I also was thinking of cogbuf as a pointer to the location, rather than the location itself.
It makes my head swim, too. Best determined by experimentation.
Yeah I will be experimenting with it soon. I can capture the number of clocks taken to do various things and probably display it graphically in the picture on the fly to see how much headroom I have left after doubling, as a proportion of the 6400 cycles during the visible pixel portion for the different modes.
My concern is getting into some pathological case where it hits repeatedly. I am wondering if I might be best to read things in backwards, using ptra--? Is that even possible to do with setq? Perhaps that could help too, or does that increase the time to do the transfer? I could also try to offset the initial setq request in time by some number of clocks if that helps, so I am reading odd long addresses and the streamer might be reading an even long one or something like that. Time will tell what is best.
I think it is 16 clocks and FIFO depth is 16 longs, which it has to be for 16 cogs and it wasn't reduced for 8 cogs? Well, RDFAST/WRFAST block size wasn't reduced from 64 to 32 bytes anyway when 16 cogs wouldn't fit. What I don't know is when is the FIFO reloaded? When it's almost empty? Or half empty? Only one person knows for sure.
For 24bpp, can't hardware do the pixel repeating? A cog doesn't need to manipulate any data within each long and it seems a waste. A related question, does the low byte of RGB24 streamer data have to be zero?
Re fast block moves yielding to RDFAST streaming, none of the addresses is random for video, so if there is a sweetspot timing then it could apply all the time.
SETQ+RD/WRLONG only work in a forward direction, since that is how the RAMs cycle in the egg beater.
FIFO depth was reduced for 8 cogs.
Bit 1 in the 32-bit streamer output selects literal mode, where the three 10-bit fields above bit 1 become serial channel data. If bit 1 is 0, the top three bytes are encoded as colors, each becoming 10 serial bits.
There are two factors, normally, just for the hub access alone. So there is variability without the FIFO involved. The FIFO's added intrusions is extra on top. Still, managing it is entirely possible simply because the start burst addresses of both the FIFO and instruction are deterministic.
Chip,
Does the FIFO refill in 8-clock bursts for an 8-cog prop2? Or is it always 16 clocks, to match the 64-byte block size? Is the initial filling different?
Thanks for the clarification, Chip. The 64-byte RDFAST/WRFAST block size was kept at the 16-cog level when FIFO was reduced. My very late night post not helping much if at all, apologies.