So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.
Original bad idea, with everything in LUT:
getbyte ppc_tmp1,ppc_bg1,#3 ' get last pixel of previous long
getbyte ppc_tmp2,ppc_bg2,#3
getbyte ppc_tmp3,ppc_bg3,#3
getbyte ppc_tmp4,ppc_bg4,#3
rdlut ppc_sprp,ptra[##384]
rdlut ppc_spr,ptra[##320]
rdlut ppc_bg4,ptra[##256]
rdlut ppc_bg3,ptra[##192]
rdlut ppc_bg2,ptra[##128]
rdlut ppc_bg1,ptra[##64]
rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
skip ppc_mosaicmask
if_c setbyte ppc_bg1,ppc_tmp1,#0 ' get replace first pixel
movbyts ppc_bg1,ppc_tmp5
if_c setbyte ppc_bg2,ppc_tmp2,#0
movbyts ppc_bg2,ppc_tmp5
if_c setbyte ppc_bg3,ppc_tmp3,#0
movbyts ppc_bg3,ppc_tmp5
if_c setbyte ppc_bg4,ppc_tmp4,#0
movbyts ppc_bg4,ppc_tmp5
59 cycles, awful.
New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):
rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
if_nc setq ppc_n1 ' ##$FF_FF_FF_FF, replace all pixels
if_c setq ppc_n256 ' ##$FF_FF_FF_00, keep lowest byte (last post-mosaic pixel in last quad)
skipf ppc_mosaicmask
alts ppc_bgptr,#ppc_bg1buffer
mov ppc_bg1,0-0 ' BG1 mosaic OFF
movbyts ppc_bg1,#%%0123 ' ^
muxq ppc_bg1,0-0 ' BG1 mosaic ON
movbyts ppc_bg1,ppc_tmp5 ' ^
alts ppc_bgptr,#ppc_bg2buffer
mov ppc_bg2,0-0 ' BG2 mosaic OFF
movbyts ppc_bg2,#%%0123 ' ^
muxq ppc_bg2,0-0 ' BG2 mosaic ON
movbyts ppc_bg2,ppc_tmp5 ' ^
alts ppc_bgptr,#ppc_bg3buffer
mov ppc_bg3,0-0 ' BG3 mosaic OFF
movbyts ppc_bg3,#%%0123 ' ^
muxq ppc_bg3,0-0 ' BG3 mosaic ON
movbyts ppc_bg3,ppc_tmp5 ' ^
alts ppc_bgptr,ppc_bg4buf_increment ' increment bgptr
mov ppc_bg4,0-0 ' BG4 mosaic OFF
movbyts ppc_bg4,#%%0123 ' ^
muxq ppc_bg4,0-0 ' BG4 mosaic ON
movbyts ppc_bg4,ppc_tmp5 ' ^
' sprites can not be mosaic'd
alti ppc_sprptr,#%100_111_000 ' must have ppc_spr in R field
movbyts ppc_spr,#%%0123 '
alti ppc_sprpptr,#%100_111_000 ' etc
movbyts ppc_sprp,#%%0123
This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.
(the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
@rogloh said:
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
This all is to get everything to fit at 343 at all.
Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)
Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).
Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.
EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
@Wuerfel_21 said:
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.
EDIT:
Only true if there are no RFLONG's between RDFAST and WMLONG.
What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.
I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.
@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
rdfast bit31,pa
long 0[7]
rflong buffer+0
rflong buffer+1
rflong buffer+2
rflong buffer+3
rflong buffer+4
long 0[2]
setq #1
wmlong buffer,pb
Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!
WORST-CASE RDFAST
slice 0123456701234567012345670123456701234
!!!!!!!!!!!!!!!!!!!!!!!!| cycles
'RDFAST aaawwwwwwwrbbbbb | 16 wait
RDFAST xx | 2 no-wait
NOP xx | 2
NOP xx | 2
NOP xx | 2
NOP xx | 2
NOP xx | 2
NOP xx | 2
NOP xx | 2
RFLONG xx | 2
RFLONG xx | 2
RFLONG xx | 2
RFLONG xx | 2
RFLONG xx | 2
NOP xx | 2
NOP xx | 2
SETQ xx | 2
WMLONG aaw 3
I think I need to make this optimization in the BG tile drawing functions, too.
Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)
@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
rdfast bit31,pa
long 0[7]
rflong buffer+0
rflong buffer+1
rflong buffer+2
rflong buffer+3
rflong buffer+4
long 0[2]
setq #1
wmlong buffer,pb
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
For reference, current sprite tile code draft. May run 34 times per line
.tile
cmp ptra,ppr_sprbufleft wc ' is buffer - 7
if_nc cmpr ptra,ppr_sprbufright wc ' is buffer + 255
if_c jmp #.tile_clipped
' Tile address is something like
' %BBBY_YYYX_XXX0_yyy0
' B is base for sprite sheet (from OBJSEL and second sheet flag)
' Y is sprite sheet Y position
' X is sprite sheet X position
' y is bottom 3 bits of pixel Y position
mov ppr_tileid,ppr_tilebase ' already should have VRAM base, sheet base and sheet Y in it
rolnib ppr_tileid,ppr_tilex,#0
shl ppr_tileid,#1
rolnib ppr_tileid,ppr_tile,#2 ' (Y & 3) << 1
rdfast ppr_bit31, ppr_tileid
' get palette (do down here to save 28 MOVs and fill RDFAST delay slot)
getnib ppr_tilebuffer5,ppr_tile,#6
shl ppr_tilebuffer5,#3
and ppr_tilebuffer5,#%0111_0000
mov ppr_tilebuffer6,ppr_tilebuffer5
' get priority
getbyte ppr_tilebuffer3,ppr_tile,#3
and ppr_tilebuffer3,#%0011_0000
mov ppr_tilebuffer4,ppr_tilebuffer3
rflong ppr_tilebuffer1
rflong ppr_tilebuffer2 ' dummy read
rflong ppr_tilebuffer3 ' dummy read
rflong ppr_tilebuffer4 ' dummy read
rflong ppr_tilebuffer5
setword ppr_tilebuffer1,ppr_tilebuffer5,#1 ' combine bitplanes
if_z rev ppr_tilebuffer1
if_z movbyts ppr_tilebuffer1,#%%0123
' generate masks (transparent pixels high)
not ppr_tmp3,ppr_tilebuffer1
ror ppr_tilebuffer1,#16
andn ppr_tmp3,ppr_tilebuffer1
ror ppr_tilebuffer1,#8
andn ppr_tmp3,ppr_tilebuffer1
' maskify!
mergew ppr_tmp3
movbyts ppr_tilebuffer5,ppr_tmp3
movbyts ppr_tilebuffer3,ppr_tmp3
shr ppr_tmp3,#8
movbyts ppr_tilebuffer6,ppr_tmp3
movbyts ppr_tilebuffer4,ppr_tmp3
' write priority
setq #1
wmlong ppr_tilebuffer3,ptrb++
' fix tilebuffer (down here for slice alignment)
ror ppr_tilebuffer1,#8
' usual pixel conversion
mergeb ppr_tilebuffer1
movbyts ppr_tilebuffer1, #%%3120
mergew ppr_tilebuffer1
movbyts ppr_tilebuffer1, #%%3120
splitw ppr_tilebuffer1
mov ppr_tilebuffer2,ppr_tilebuffer1
shr ppr_tilebuffer1,#4
' add attributes to non-masked pixels
setq ppr_attrmask4
muxq ppr_tilebuffer1,ppr_tilebuffer5
muxq ppr_tilebuffer5,ppr_tilebuffer6
' write pixels
setq #1
wmlong ppr_tilebuffer1,ptra++
djz ppr_sprchkleft,#.draw_done ' line limit
.tile_next
sumz ppr_tilex,#1 ' advance tile (Z is still mirror bit)
djnf ppr_tileleft,#.tile
tjns ppr_spridx,#.sprite ' already decremented this earlier
jmp #.draw_done
' 50 alu ops + 2 WMLONG+1u + 1 branch
' ~122 per tile - 4148 total
' Buffer spacing is 264 bytes. There should be 11 ops between the WMLONGs (12 including SETQ)
' total time for both WMLONG together should then be 10..18 cycles (excluding SETQ)
.tile_clipped
cmp ptra,ppr_sprbufwtf wz ' is buffer - 256
if_nz add ptra,#8
if_nz add ptrb,#8
if_z djz ppr_sprchkleft,#.draw_done ' special case: X==-256 counts all tiles towards line limit
testb ppr_tile,#30 wz ' get mirror flag into Z again
jmp #.tile_next
maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.
@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
rdfast bit31,pa
long 0[7]
rflong buffer+0
rflong buffer+1
rflong buffer+2
rflong buffer+3
rflong buffer+4
long 0[2]
setq #1
wmlong buffer,pb
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.
Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.
Speed currently:
Cog2 ppc_time = 21_267
Cog3 ppm_time = 21_674
Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.
@rogloh said:
Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.
one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.
Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).
Comments
FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.
Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.
Just wondering whether you could move this RDFAST to the start of the loop and REP 256 times, not 257.
It'd still have to be 257 loops because the calculations are interleaved, but moving it to the top would stop it from stalling at the end
(though maybe I'm not seeing something)
So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.
Original bad idea, with everything in LUT:
59 cycles, awful.
New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):
This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.
(the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
This all is to get everything to fit at 343 at all.
Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)
Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).
Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.
EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM
Okay, here's some more dumbops.
Remember the "reading two bitplanes at once" trick from earlier?
17..25 cycles
Turns out RDFAST abuse can be even faster! (given enough instructions to do while it fetches)
14 cycles flat. Brr brr ride the lightning.
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.
EDIT:
Only true if there are no RFLONG's between RDFAST and WMLONG.
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.
I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.
Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!
EDITED.
So anyways, > @Wuerfel_21 said:
Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
Then it would never cross into the next long, yea. Though in the actual code there end up being 14 ops between RFLONG and WMLONG
For reference, current sprite tile code draft. May run 34 times per line
maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.
After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.
Making the clip worst case go away has caused these 7 horrid ops to appear in the per-sprite code (32 times per line).
(Right clip is only one op and a condition)
Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.
Speed currently:
Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.
Looks tight but within spec - nothing better than a hard constraint to help make it fit.
Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.
one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.
new shitpost status: got 3 layers drawing text, but for some reason BG4 is not getting through compositing.
(Also my left monitor died so now I'm stuck in 1280x1024 hell)
Ah, 'twas just a misaligned skip table. Happens to the best of us.
Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).
Is there any chance of using another cog for video to reduce sysclk?
It's already using 3 cogs, there isn't anywhere to go. And splitting the composite/math tasks to use multiple cogs would be good pain.