So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.
Original bad idea, with everything in LUT:
getbyte ppc_tmp1,ppc_bg1,#3' get last pixel of previous longgetbyte ppc_tmp2,ppc_bg2,#3getbyte ppc_tmp3,ppc_bg3,#3getbyte ppc_tmp4,ppc_bg4,#3rdlut ppc_sprp,ptra[##384]
rdlut ppc_spr,ptra[##320]
rdlut ppc_bg4,ptra[##256]
rdlut ppc_bg3,ptra[##192]
rdlut ppc_bg2,ptra[##128]
rdlut ppc_bg1,ptra[##64]
rdlut ppc_tmp5,ptra++ wc' mosaic table: MOVBYTS data + carryover flagskip ppc_mosaicmask
if_csetbyte ppc_bg1,ppc_tmp1,#0' get replace first pixelmovbyts ppc_bg1,ppc_tmp5
if_csetbyte ppc_bg2,ppc_tmp2,#0movbyts ppc_bg2,ppc_tmp5
if_csetbyte ppc_bg3,ppc_tmp3,#0movbyts ppc_bg3,ppc_tmp5
if_csetbyte ppc_bg4,ppc_tmp4,#0movbyts ppc_bg4,ppc_tmp5
59 cycles, awful.
New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):
rdlut ppc_tmp5,ptra++ wc' mosaic table: MOVBYTS data + carryover flagif_ncsetq ppc_n1 ' ##$FF_FF_FF_FF, replace all pixelsif_csetq ppc_n256 ' ##$FF_FF_FF_00, keep lowest byte (last post-mosaic pixel in last quad)skipf ppc_mosaicmask
alts ppc_bgptr,#ppc_bg1buffer
mov ppc_bg1,0-0' BG1 mosaic OFFmovbyts ppc_bg1,#%%0123' ^muxq ppc_bg1,0-0' BG1 mosaic ONmovbyts ppc_bg1,ppc_tmp5 ' ^alts ppc_bgptr,#ppc_bg2buffer
mov ppc_bg2,0-0' BG2 mosaic OFFmovbyts ppc_bg2,#%%0123' ^muxq ppc_bg2,0-0' BG2 mosaic ONmovbyts ppc_bg2,ppc_tmp5 ' ^alts ppc_bgptr,#ppc_bg3buffer
mov ppc_bg3,0-0' BG3 mosaic OFFmovbyts ppc_bg3,#%%0123' ^muxq ppc_bg3,0-0' BG3 mosaic ONmovbyts ppc_bg3,ppc_tmp5 ' ^alts ppc_bgptr,ppc_bg4buf_increment ' increment bgptrmov ppc_bg4,0-0' BG4 mosaic OFFmovbyts ppc_bg4,#%%0123' ^muxq ppc_bg4,0-0' BG4 mosaic ONmovbyts ppc_bg4,ppc_tmp5 ' ^' sprites can not be mosaic'dalti ppc_sprptr,#%100_111_000' must have ppc_spr in R fieldmovbyts ppc_spr,#%%0123'alti ppc_sprpptr,#%100_111_000' etcmovbyts ppc_sprp,#%%0123
This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.
(the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
@rogloh said:
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
This all is to get everything to fit at 343 at all.
Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)
Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).
Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.
EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
@Wuerfel_21 said:
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.
EDIT:
Only true if there are no RFLONG's between RDFAST and WMLONG.
What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.
I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.
Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!
WORST-CASERDFAST
slice 0123456701234567012345670123456701234
!!!!!!!!!!!!!!!!!!!!!!!!| cycles
'RDFAST aaawwwwwwwrbbbbb | 16 waitRDFAST xx | 2 no-wait
NOP xx | 2NOP xx | 2NOP xx | 2NOP xx | 2NOP xx | 2NOP xx | 2NOP xx | 2RFLONG xx | 2RFLONG xx | 2RFLONG xx | 2RFLONG xx | 2RFLONG xx | 2NOP xx | 2NOP xx | 2SETQ xx | 2WMLONG aaw 3
I think I need to make this optimization in the BG tile drawing functions, too.
Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
For reference, current sprite tile code draft. May run 34 times per line
.tile
cmpptra,ppr_sprbufleft wc' is buffer - 7if_nccmprptra,ppr_sprbufright wc' is buffer + 255if_cjmp #.tile_clipped
' Tile address is something like' %BBBY_YYYX_XXX0_yyy0' B is base for sprite sheet (from OBJSEL and second sheet flag)' Y is sprite sheet Y position' X is sprite sheet X position' y is bottom 3 bits of pixel Y positionmov ppr_tileid,ppr_tilebase ' already should have VRAM base, sheet base and sheet Y in itrolnib ppr_tileid,ppr_tilex,#0shl ppr_tileid,#1rolnib ppr_tileid,ppr_tile,#2' (Y & 3) << 1rdfast ppr_bit31, ppr_tileid
' get palette (do down here to save 28 MOVs and fill RDFAST delay slot)getnib ppr_tilebuffer5,ppr_tile,#6shl ppr_tilebuffer5,#3and ppr_tilebuffer5,#%0111_0000mov ppr_tilebuffer6,ppr_tilebuffer5
' get prioritygetbyte ppr_tilebuffer3,ppr_tile,#3and ppr_tilebuffer3,#%0011_0000mov ppr_tilebuffer4,ppr_tilebuffer3
rflong ppr_tilebuffer1
rflong ppr_tilebuffer2 ' dummy readrflong ppr_tilebuffer3 ' dummy readrflong ppr_tilebuffer4 ' dummy readrflong ppr_tilebuffer5
setword ppr_tilebuffer1,ppr_tilebuffer5,#1' combine bitplanesif_zrev ppr_tilebuffer1
if_zmovbyts ppr_tilebuffer1,#%%0123' generate masks (transparent pixels high)not ppr_tmp3,ppr_tilebuffer1
ror ppr_tilebuffer1,#16andn ppr_tmp3,ppr_tilebuffer1
ror ppr_tilebuffer1,#8andn ppr_tmp3,ppr_tilebuffer1
' maskify!mergew ppr_tmp3
movbyts ppr_tilebuffer5,ppr_tmp3
movbyts ppr_tilebuffer3,ppr_tmp3
shr ppr_tmp3,#8movbyts ppr_tilebuffer6,ppr_tmp3
movbyts ppr_tilebuffer4,ppr_tmp3
' write prioritysetq #1wmlong ppr_tilebuffer3,ptrb++
' fix tilebuffer (down here for slice alignment)ror ppr_tilebuffer1,#8' usual pixel conversionmergeb ppr_tilebuffer1
movbyts ppr_tilebuffer1, #%%3120mergew ppr_tilebuffer1
movbyts ppr_tilebuffer1, #%%3120splitw ppr_tilebuffer1
mov ppr_tilebuffer2,ppr_tilebuffer1
shr ppr_tilebuffer1,#4' add attributes to non-masked pixelssetq ppr_attrmask4
muxq ppr_tilebuffer1,ppr_tilebuffer5
muxq ppr_tilebuffer5,ppr_tilebuffer6
' write pixelssetq #1wmlong ppr_tilebuffer1,ptra++
djz ppr_sprchkleft,#.draw_done ' line limit
.tile_next
sumz ppr_tilex,#1' advance tile (Z is still mirror bit)djnf ppr_tileleft,#.tile
tjns ppr_spridx,#.sprite ' already decremented this earlierjmp #.draw_done
' 50 alu ops + 2 WMLONG+1u + 1 branch' ~122 per tile - 4148 total' Buffer spacing is 264 bytes. There should be 11 ops between the WMLONGs (12 including SETQ)' total time for both WMLONG together should then be 10..18 cycles (excluding SETQ)
.tile_clipped
cmpptra,ppr_sprbufwtf wz' is buffer - 256if_nzaddptra,#8if_nzaddptrb,#8if_zdjz ppr_sprchkleft,#.draw_done ' special case: X==-256 counts all tiles towards line limittestb ppr_tile,#30wz' get mirror flag into Z againjmp #.tile_next
maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.
Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.
Speed currently:
Cog2 ppc_time = 21_267Cog3 ppm_time = 21_674
Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.
@rogloh said:
Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.
one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.
Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).
Comments
FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.
Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.
Just wondering whether you could move this RDFAST to the start of the loop and REP 256 times, not 257.
It'd still have to be 257 loops because the calculations are interleaved, but moving it to the top would stop it from stalling at the end
(though maybe I'm not seeing something)
So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.
Original bad idea, with everything in LUT:
getbyte ppc_tmp1,ppc_bg1,#3 ' get last pixel of previous long getbyte ppc_tmp2,ppc_bg2,#3 getbyte ppc_tmp3,ppc_bg3,#3 getbyte ppc_tmp4,ppc_bg4,#3 rdlut ppc_sprp,ptra[##384] rdlut ppc_spr,ptra[##320] rdlut ppc_bg4,ptra[##256] rdlut ppc_bg3,ptra[##192] rdlut ppc_bg2,ptra[##128] rdlut ppc_bg1,ptra[##64] rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag skip ppc_mosaicmask if_c setbyte ppc_bg1,ppc_tmp1,#0 ' get replace first pixel movbyts ppc_bg1,ppc_tmp5 if_c setbyte ppc_bg2,ppc_tmp2,#0 movbyts ppc_bg2,ppc_tmp5 if_c setbyte ppc_bg3,ppc_tmp3,#0 movbyts ppc_bg3,ppc_tmp5 if_c setbyte ppc_bg4,ppc_tmp4,#0 movbyts ppc_bg4,ppc_tmp5
59 cycles, awful.
New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):
rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag if_nc setq ppc_n1 ' ##$FF_FF_FF_FF, replace all pixels if_c setq ppc_n256 ' ##$FF_FF_FF_00, keep lowest byte (last post-mosaic pixel in last quad) skipf ppc_mosaicmask alts ppc_bgptr,#ppc_bg1buffer mov ppc_bg1,0-0 ' BG1 mosaic OFF movbyts ppc_bg1,#%%0123 ' ^ muxq ppc_bg1,0-0 ' BG1 mosaic ON movbyts ppc_bg1,ppc_tmp5 ' ^ alts ppc_bgptr,#ppc_bg2buffer mov ppc_bg2,0-0 ' BG2 mosaic OFF movbyts ppc_bg2,#%%0123 ' ^ muxq ppc_bg2,0-0 ' BG2 mosaic ON movbyts ppc_bg2,ppc_tmp5 ' ^ alts ppc_bgptr,#ppc_bg3buffer mov ppc_bg3,0-0 ' BG3 mosaic OFF movbyts ppc_bg3,#%%0123 ' ^ muxq ppc_bg3,0-0 ' BG3 mosaic ON movbyts ppc_bg3,ppc_tmp5 ' ^ alts ppc_bgptr,ppc_bg4buf_increment ' increment bgptr mov ppc_bg4,0-0 ' BG4 mosaic OFF movbyts ppc_bg4,#%%0123 ' ^ muxq ppc_bg4,0-0 ' BG4 mosaic ON movbyts ppc_bg4,ppc_tmp5 ' ^ ' sprites can not be mosaic'd alti ppc_sprptr,#%100_111_000 ' must have ppc_spr in R field movbyts ppc_spr,#%%0123 ' alti ppc_sprpptr,#%100_111_000 ' etc movbyts ppc_sprp,#%%0123
This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.
(the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?
This all is to get everything to fit at 343 at all.
Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)
Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).
Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.
EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM
Okay, here's some more dumbops.
Remember the "reading two bitplanes at once" trick from earlier?
setq #4 rdlong tilebuffer+0, tile setword tilebuffer+0,tilebuffer+4,#1
17..25 cycles
Turns out RDFAST abuse can be even faster! (given enough instructions to do while it fetches)
rdfast bit31,tile ' imagine about 7 useful instructions here rflong tilebuffer+0 rflong tilebuffer+1 rflong tilebuffer+2 rflong tilebuffer+3 rflong tilebuffer+4 setword tilebuffer+0,tilebuffer+4,#1
14 cycles flat. Brr brr ride the lightning.
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?
Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.
EDIT:
Only true if there are no RFLONG's between RDFAST and WMLONG.
WORST-CASE RDFAST slice 01234567012345670123456701234567 !!!!!!!!!!!!!!!!!!! cycles RDFAST aaawwwwwwwrbbbbb 16 ! = read from hub RAM and write to FIFO
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
CON DAT org .loop getrnd pa getrnd pb waitx #15 wc ' randomize slice getct time1 rdfast bit31,pa long 0[7] rflong buffer+0 rflong buffer+1 rflong buffer+2 rflong buffer+3 rflong buffer+4 long 0[2] setq #1 wmlong buffer,pb getct time2 subr time1,time2 fge maxtime,time1 fle mintime,time1 mov time2,maxtime sub time2,mintime debug(udec(time1,mintime,maxtime,time2)) jmp #.loop bit31 long 1<<31 mintime long posx maxtime long 0 buffer res 5 time1 res 1 time2 res 1
What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.
I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.
Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!
WORST-CASE RDFAST slice 0123456701234567012345670123456701234 !!!!!!!!!!!!!!!!!!!!!!!!| cycles 'RDFAST aaawwwwwwwrbbbbb | 16 wait RDFAST xx | 2 no-wait NOP xx | 2 NOP xx | 2 NOP xx | 2 NOP xx | 2 NOP xx | 2 NOP xx | 2 NOP xx | 2 RFLONG xx | 2 RFLONG xx | 2 RFLONG xx | 2 RFLONG xx | 2 RFLONG xx | 2 NOP xx | 2 NOP xx | 2 SETQ xx | 2 WMLONG aaw 3
EDITED.
So anyways, > @Wuerfel_21 said:
Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.
Then it would never cross into the next long, yea. Though in the actual code there end up being 14 ops between RFLONG and WMLONG
For reference, current sprite tile code draft. May run 34 times per line
.tile cmp ptra,ppr_sprbufleft wc ' is buffer - 7 if_nc cmpr ptra,ppr_sprbufright wc ' is buffer + 255 if_c jmp #.tile_clipped ' Tile address is something like ' %BBBY_YYYX_XXX0_yyy0 ' B is base for sprite sheet (from OBJSEL and second sheet flag) ' Y is sprite sheet Y position ' X is sprite sheet X position ' y is bottom 3 bits of pixel Y position mov ppr_tileid,ppr_tilebase ' already should have VRAM base, sheet base and sheet Y in it rolnib ppr_tileid,ppr_tilex,#0 shl ppr_tileid,#1 rolnib ppr_tileid,ppr_tile,#2 ' (Y & 3) << 1 rdfast ppr_bit31, ppr_tileid ' get palette (do down here to save 28 MOVs and fill RDFAST delay slot) getnib ppr_tilebuffer5,ppr_tile,#6 shl ppr_tilebuffer5,#3 and ppr_tilebuffer5,#%0111_0000 mov ppr_tilebuffer6,ppr_tilebuffer5 ' get priority getbyte ppr_tilebuffer3,ppr_tile,#3 and ppr_tilebuffer3,#%0011_0000 mov ppr_tilebuffer4,ppr_tilebuffer3 rflong ppr_tilebuffer1 rflong ppr_tilebuffer2 ' dummy read rflong ppr_tilebuffer3 ' dummy read rflong ppr_tilebuffer4 ' dummy read rflong ppr_tilebuffer5 setword ppr_tilebuffer1,ppr_tilebuffer5,#1 ' combine bitplanes if_z rev ppr_tilebuffer1 if_z movbyts ppr_tilebuffer1,#%%0123 ' generate masks (transparent pixels high) not ppr_tmp3,ppr_tilebuffer1 ror ppr_tilebuffer1,#16 andn ppr_tmp3,ppr_tilebuffer1 ror ppr_tilebuffer1,#8 andn ppr_tmp3,ppr_tilebuffer1 ' maskify! mergew ppr_tmp3 movbyts ppr_tilebuffer5,ppr_tmp3 movbyts ppr_tilebuffer3,ppr_tmp3 shr ppr_tmp3,#8 movbyts ppr_tilebuffer6,ppr_tmp3 movbyts ppr_tilebuffer4,ppr_tmp3 ' write priority setq #1 wmlong ppr_tilebuffer3,ptrb++ ' fix tilebuffer (down here for slice alignment) ror ppr_tilebuffer1,#8 ' usual pixel conversion mergeb ppr_tilebuffer1 movbyts ppr_tilebuffer1, #%%3120 mergew ppr_tilebuffer1 movbyts ppr_tilebuffer1, #%%3120 splitw ppr_tilebuffer1 mov ppr_tilebuffer2,ppr_tilebuffer1 shr ppr_tilebuffer1,#4 ' add attributes to non-masked pixels setq ppr_attrmask4 muxq ppr_tilebuffer1,ppr_tilebuffer5 muxq ppr_tilebuffer5,ppr_tilebuffer6 ' write pixels setq #1 wmlong ppr_tilebuffer1,ptra++ djz ppr_sprchkleft,#.draw_done ' line limit .tile_next sumz ppr_tilex,#1 ' advance tile (Z is still mirror bit) djnf ppr_tileleft,#.tile tjns ppr_spridx,#.sprite ' already decremented this earlier jmp #.draw_done ' 50 alu ops + 2 WMLONG+1u + 1 branch ' ~122 per tile - 4148 total ' Buffer spacing is 264 bytes. There should be 11 ops between the WMLONGs (12 including SETQ) ' total time for both WMLONG together should then be 10..18 cycles (excluding SETQ) .tile_clipped cmp ptra,ppr_sprbufwtf wz ' is buffer - 256 if_nz add ptra,#8 if_nz add ptrb,#8 if_z djz ppr_sprchkleft,#.draw_done ' special case: X==-256 counts all tiles towards line limit testb ppr_tile,#30 wz ' get mirror flag into Z again jmp #.tile_next
maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.
After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.
rdfast bit31,pa long 0[7] rflong buffer+0 rflong buffer+1 rflong buffer+2 rflong buffer+3 rfword buffer+4 long 0[1] setq #1 wmlong buffer,pb '4-11 cycles
Making the clip worst case go away has caused these 7 horrid ops to appear in the per-sprite code (32 times per line).
signx ptra,#8 wc ' PTRA has sprite X ' left clip logic if_c neg ppr_tmp1,ptra if_c shr ppr_tmp1,#3 if_c sub ppr_tileleft,ppr_tmp1 if_c tjs ppr_tileleft,#.sprite_clipped ' no tiles left if_c sumz ppr_tilex,ppr_tmp1 if_c shl ppr_tmp1,#3 if_c add ptra,ppr_tmp1
(Right clip is only one op and a condition)
cmp ptra,ppr_sprbufright wc ' is buffer + 256 if_c djnf ppr_tileleft,#.tile ' next tile unless clipping into right edge
Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.
Speed currently:
Cog2 ppc_time = 21_267 Cog3 ppm_time = 21_674
Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.
Looks tight but within spec - nothing better than a hard constraint to help make it fit.
Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.
one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.
new shitpost status: got 3 layers drawing text, but for some reason BG4 is not getting through compositing.
(Also my left monitor died so now I'm stuck in 1280x1024 hell)
Ah, 'twas just a misaligned skip table. Happens to the best of us.
Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).
Is there any chance of using another cog for video to reduce sysclk?
It's already using 3 cogs, there isn't anywhere to go. And splitting the composite/math tasks to use multiple cogs would be good pain.