Console Emulation

TonyB_ · 2024-03-28 15:33

@Wuerfel_21 said:

              rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
.loop
              setq #63
        _ret_ wrlong ppr_tilebuffer1,ptrb

FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.

Wuerfel_21 · 2024-03-28 16:14

@TonyB_ said:
@Wuerfel_21 said:
              rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
.loop
              setq #63
        _ret_ wrlong ppr_tilebuffer1,ptrb
FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.

Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.

TonyB_ · 2024-03-28 16:49

@Wuerfel_21 said:
@TonyB_ said:
@Wuerfel_21 said:
              rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
.loop
              setq #63
        _ret_ wrlong ppr_tilebuffer1,ptrb
FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.
Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.

Just wondering whether you could move this RDFAST to the start of the loop and REP 256 times, not 257.

Wuerfel_21 · 2024-03-28 16:50

It'd still have to be 257 loops because the calculations are interleaved, but moving it to the top would stop it from stalling at the end

(though maybe I'm not seeing something)

Wuerfel_21 · 2024-03-29 15:49

So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.

Original bad idea, with everything in LUT:

              getbyte ppc_tmp1,ppc_bg1,#3 ' get last pixel of previous long
              getbyte ppc_tmp2,ppc_bg2,#3
              getbyte ppc_tmp3,ppc_bg3,#3
              getbyte ppc_tmp4,ppc_bg4,#3
              rdlut ppc_sprp,ptra[##384]
              rdlut ppc_spr,ptra[##320]
              rdlut ppc_bg4,ptra[##256]
              rdlut ppc_bg3,ptra[##192]
              rdlut ppc_bg2,ptra[##128]
              rdlut ppc_bg1,ptra[##64]
              rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
              skip ppc_mosaicmask
        if_c  setbyte ppc_bg1,ppc_tmp1,#0  ' get replace first pixel
              movbyts ppc_bg1,ppc_tmp5
        if_c  setbyte ppc_bg2,ppc_tmp2,#0
              movbyts ppc_bg2,ppc_tmp5
        if_c  setbyte ppc_bg3,ppc_tmp3,#0
              movbyts ppc_bg3,ppc_tmp5
        if_c  setbyte ppc_bg4,ppc_tmp4,#0
              movbyts ppc_bg4,ppc_tmp5

59 cycles, awful.

New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):

              rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
        if_nc setq ppc_n1   ' ##$FF_FF_FF_FF, replace all pixels
        if_c  setq ppc_n256 ' ##$FF_FF_FF_00, keep lowest byte  (last post-mosaic pixel in last quad)
              skipf ppc_mosaicmask
              alts ppc_bgptr,#ppc_bg1buffer
              mov ppc_bg1,0-0  ' BG1 mosaic OFF
              movbyts ppc_bg1,#%%0123  ' ^
              muxq ppc_bg1,0-0 ' BG1 mosaic ON
              movbyts ppc_bg1,ppc_tmp5 ' ^

              alts ppc_bgptr,#ppc_bg2buffer
              mov ppc_bg2,0-0  ' BG2 mosaic OFF
              movbyts ppc_bg2,#%%0123  ' ^
              muxq ppc_bg2,0-0 ' BG2 mosaic ON
              movbyts ppc_bg2,ppc_tmp5 ' ^

              alts ppc_bgptr,#ppc_bg3buffer
              mov ppc_bg3,0-0  ' BG3 mosaic OFF
              movbyts ppc_bg3,#%%0123  ' ^
              muxq ppc_bg3,0-0 ' BG3 mosaic ON
              movbyts ppc_bg3,ppc_tmp5 ' ^

              alts ppc_bgptr,ppc_bg4buf_increment ' increment bgptr
              mov ppc_bg4,0-0  ' BG4 mosaic OFF
              movbyts ppc_bg4,#%%0123  ' ^
              muxq ppc_bg4,0-0 ' BG4 mosaic ON
              movbyts ppc_bg4,ppc_tmp5 ' ^

              ' sprites can not be mosaic'd
              alti ppc_sprptr,#%100_111_000 ' must have ppc_spr in R field
              movbyts ppc_spr,#%%0123 '
              alti ppc_sprpptr,#%100_111_000 ' etc
              movbyts ppc_sprp,#%%0123

This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.

(the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)

rogloh · 2024-03-30 00:27

Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?

Wuerfel_21 · 2024-03-30 00:31

@rogloh said:
Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?

This all is to get everything to fit at 343 at all.

Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)

Wuerfel_21 · 2024-03-30 01:32

Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).

Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.

EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM

Wuerfel_21 · 2024-03-30 02:28

Okay, here's some more dumbops.

Remember the "reading two bitplanes at once" trick from earlier?

setq #4
rdlong tilebuffer+0, tile
setword tilebuffer+0,tilebuffer+4,#1

17..25 cycles

Turns out RDFAST abuse can be even faster! (given enough instructions to do while it fetches)

rdfast bit31,tile
' imagine about 7 useful instructions here
rflong tilebuffer+0
rflong tilebuffer+1
rflong tilebuffer+2
rflong tilebuffer+3
rflong tilebuffer+4
setword tilebuffer+0,tilebuffer+4,#1

14 cycles flat. Brr brr ride the lightning.

Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?

TonyB_ · 2024-03-30 12:35

@Wuerfel_21 said:
Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?

Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.

EDIT:
Only true if there are no RFLONG's between RDFAST and WMLONG.

TonyB_ · 2024-03-30 12:51

WORST-CASE RDFAST

slice      01234567012345670123456701234567
                   !!!!!!!!!!!!!!!!!!!     cycles
RDFAST   aaawwwwwwwrbbbbb                  16

! = read from hub RAM and write to FIFO

Wuerfel_21 · 2024-03-30 13:01

Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

CON

DAT
              org
.loop
              getrnd pa
              getrnd pb
              waitx #15 wc ' randomize slice
              getct time1
              rdfast bit31,pa
              long 0[7]
              rflong buffer+0
              rflong buffer+1
              rflong buffer+2
              rflong buffer+3
              rflong buffer+4
              long 0[2]
              setq #1
              wmlong buffer,pb
              getct time2
              subr time1,time2
              fge maxtime,time1
              fle mintime,time1
              mov time2,maxtime
              sub time2,mintime
              debug(udec(time1,mintime,maxtime,time2))
              jmp #.loop

bit31         long 1<<31

mintime       long posx
maxtime       long 0

buffer        res 5

time1         res 1
time2         res 1

TonyB_ · 2024-03-30 13:21

What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.

Wuerfel_21 · 2024-03-30 13:36

I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.

TonyB_ · 2024-03-30 13:48

@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

              rdfast bit31,pa
              long 0[7]
              rflong buffer+0
              rflong buffer+1
              rflong buffer+2
              rflong buffer+3
              rflong buffer+4
              long 0[2]
              setq #1
              wmlong buffer,pb

Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!

WORST-CASE RDFAST

slice      0123456701234567012345670123456701234
                   !!!!!!!!!!!!!!!!!!!!!!!!|    cycles
'RDFAST  aaawwwwwwwrbbbbb                  |    16 wait
RDFAST   xx                                |     2 no-wait
NOP        xx                              |     2
NOP          xx                            |     2
NOP            xx                          |     2
NOP              xx                        |     2
NOP                xx                      |     2
NOP                  xx                    |     2
NOP                    xx                  |     2
RFLONG                   xx                |     2
RFLONG                     xx              |     2
RFLONG                       xx            |     2
RFLONG                         xx          |     2
RFLONG                           xx        |     2
NOP                                xx      |     2
NOP                                  xx    |     2
SETQ                                   xx  |     2
WMLONG                                   aaw     3

EDITED.

Wuerfel_21 · 2024-03-30 13:57

So anyways, > @Wuerfel_21 said:

I think I need to make this optimization in the BG tile drawing functions, too.

Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)

TonyB_ · 2024-03-30 14:13

@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

              rdfast bit31,pa
              long 0[7]
              rflong buffer+0
              rflong buffer+1
              rflong buffer+2
              rflong buffer+3
              rflong buffer+4
              long 0[2]
              setq #1
              wmlong buffer,pb

I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.

Wuerfel_21 · 2024-03-30 14:20

Then it would never cross into the next long, yea. Though in the actual code there end up being 14 ops between RFLONG and WMLONG

Wuerfel_21 · 2024-03-30 14:37

For reference, current sprite tile code draft. May run 34 times per line

.tile
              cmp ptra,ppr_sprbufleft wc ' is buffer - 7
        if_nc cmpr ptra,ppr_sprbufright wc ' is buffer + 255
        if_c  jmp #.tile_clipped
              ' Tile address is something like
              ' %BBBY_YYYX_XXX0_yyy0
              ' B is base for sprite sheet (from OBJSEL and second sheet flag)
              ' Y is sprite sheet Y position
              ' X is sprite sheet X position
              ' y is bottom 3 bits of pixel Y position
              mov ppr_tileid,ppr_tilebase ' already should have VRAM base, sheet base and sheet Y in it
              rolnib ppr_tileid,ppr_tilex,#0
              shl ppr_tileid,#1
              rolnib ppr_tileid,ppr_tile,#2 ' (Y & 3) << 1

              rdfast ppr_bit31, ppr_tileid
              ' get palette (do down here to save 28 MOVs and fill RDFAST delay slot)
              getnib ppr_tilebuffer5,ppr_tile,#6
              shl ppr_tilebuffer5,#3
              and ppr_tilebuffer5,#%0111_0000
              mov ppr_tilebuffer6,ppr_tilebuffer5
              ' get priority
              getbyte ppr_tilebuffer3,ppr_tile,#3
              and ppr_tilebuffer3,#%0011_0000
              mov ppr_tilebuffer4,ppr_tilebuffer3

              rflong ppr_tilebuffer1
              rflong ppr_tilebuffer2 ' dummy read
              rflong ppr_tilebuffer3 ' dummy read
              rflong ppr_tilebuffer4 ' dummy read
              rflong ppr_tilebuffer5
              setword ppr_tilebuffer1,ppr_tilebuffer5,#1 ' combine bitplanes

        if_z  rev ppr_tilebuffer1
        if_z  movbyts ppr_tilebuffer1,#%%0123
              ' generate masks (transparent pixels high)
              not ppr_tmp3,ppr_tilebuffer1
              ror ppr_tilebuffer1,#16
              andn ppr_tmp3,ppr_tilebuffer1
              ror ppr_tilebuffer1,#8
              andn ppr_tmp3,ppr_tilebuffer1
              ' maskify!
              mergew ppr_tmp3 
              movbyts ppr_tilebuffer5,ppr_tmp3
              movbyts ppr_tilebuffer3,ppr_tmp3
              shr ppr_tmp3,#8
              movbyts ppr_tilebuffer6,ppr_tmp3
              movbyts ppr_tilebuffer4,ppr_tmp3
              ' write priority
              setq #1
              wmlong ppr_tilebuffer3,ptrb++
              ' fix tilebuffer (down here for slice alignment)
              ror ppr_tilebuffer1,#8
              ' usual pixel conversion
              mergeb ppr_tilebuffer1
              movbyts ppr_tilebuffer1, #%%3120
              mergew ppr_tilebuffer1
              movbyts ppr_tilebuffer1, #%%3120
              splitw ppr_tilebuffer1
              mov ppr_tilebuffer2,ppr_tilebuffer1
              shr ppr_tilebuffer1,#4
              ' add attributes to non-masked pixels
              setq ppr_attrmask4
              muxq ppr_tilebuffer1,ppr_tilebuffer5
              muxq ppr_tilebuffer5,ppr_tilebuffer6
              ' write pixels
              setq #1
              wmlong ppr_tilebuffer1,ptra++
              djz ppr_sprchkleft,#.draw_done ' line limit
.tile_next
              sumz ppr_tilex,#1 ' advance tile (Z is still mirror bit)
              djnf ppr_tileleft,#.tile
              tjns ppr_spridx,#.sprite ' already decremented this earlier
              jmp #.draw_done
              ' 50 alu ops + 2 WMLONG+1u + 1 branch
              ' ~122 per tile - 4148 total
              ' Buffer spacing is 264 bytes. There should be 11 ops between the WMLONGs (12 including SETQ)
              ' total time for both WMLONG together should then be 10..18 cycles (excluding SETQ)

.tile_clipped
              cmp ptra,ppr_sprbufwtf wz ' is buffer - 256
        if_nz add ptra,#8
        if_nz add ptrb,#8
        if_z  djz ppr_sprchkleft,#.draw_done ' special case: X==-256 counts all tiles towards line limit
              testb ppr_tile,#30 wz ' get mirror flag into Z again
              jmp #.tile_next

maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.

TonyB_ · 2024-03-30 14:47

@TonyB_ said:
@Wuerfel_21 said:
Experimental verification: 3 instructions needed (2 if you exclude the SETQ)
              rdfast bit31,pa
              long 0[7]
              rflong buffer+0
              rflong buffer+1
              rflong buffer+2
              rflong buffer+3
              rflong buffer+4
              long 0[2]
              setq #1
              wmlong buffer,pb
I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.

After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.

              rdfast bit31,pa
              long 0[7]
              rflong buffer+0
              rflong buffer+1
              rflong buffer+2
              rflong buffer+3
              rfword buffer+4
              long 0[1]
              setq #1
              wmlong buffer,pb   '4-11 cycles

Wuerfel_21 · 2024-03-30 19:03

Making the clip worst case go away has caused these 7 horrid ops to appear in the per-sprite code (32 times per line).

              signx ptra,#8 wc ' PTRA has sprite X
              ' left clip logic
        if_c  neg ppr_tmp1,ptra
        if_c  shr ppr_tmp1,#3
        if_c  sub ppr_tileleft,ppr_tmp1
        if_c  tjs ppr_tileleft,#.sprite_clipped ' no tiles left
        if_c  sumz ppr_tilex,ppr_tmp1
        if_c  shl ppr_tmp1,#3
        if_c  add ptra,ppr_tmp1

(Right clip is only one op and a condition)

              cmp ptra,ppr_sprbufright wc ' is buffer + 256
        if_c  djnf ppr_tileleft,#.tile ' next tile unless clipping into right edge

Wuerfel_21 · 2024-03-31 17:34

Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.

Speed currently:

Cog2  ppc_time = 21_267
Cog3  ppm_time = 21_674

Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.

rogloh · 2024-04-01 00:48

@Wuerfel_21 said:
Speed currently:
Cog2  ppc_time = 21_267
Cog3  ppm_time = 21_674
Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.

Looks tight but within spec - nothing better than a hard constraint to help make it fit.

Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.

Wuerfel_21 · 2024-04-01 02:09

@rogloh said:
Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.

one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.

Wuerfel_21 · 2024-04-01 03:02

new shitpost status: got 3 layers drawing text, but for some reason BG4 is not getting through compositing.

Wuerfel_21 · 2024-04-01 03:07

(Also my left monitor died so now I'm stuck in 1280x1024 hell)

Wuerfel_21 · 2024-04-01 03:41

Ah, 'twas just a misaligned skip table. Happens to the best of us.

rogloh · 2024-04-01 04:11

Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).

TonyB_ · 2024-04-01 11:38

Is there any chance of using another cog for video to reduce sysclk?

Wuerfel_21 · 2024-04-01 11:49

It's already using 3 cogs, there isn't anywhere to go. And splitting the composite/math tasks to use multiple cogs would be good pain.

Console Emulation

Comments