Shop OBEX P1 Docs P2 Docs Learn Events
Console Emulation - Page 49 — Parallax Forums

Console Emulation

1464749515268

Comments

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-28 15:39

    @Wuerfel_21 said:

                  rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
    .loop
                  setq #63
            _ret_ wrlong ppr_tilebuffer1,ptrb
    

    FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-28 16:14

    @TonyB_ said:

    @Wuerfel_21 said:

                  rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
    .loop
                  setq #63
            _ret_ wrlong ppr_tilebuffer1,ptrb
    

    FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.

    Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-28 16:49

    @Wuerfel_21 said:

    @TonyB_ said:

    @Wuerfel_21 said:

                  rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
    .loop
                  setq #63
            _ret_ wrlong ppr_tilebuffer1,ptrb
    

    FIFO is 19 longs deep so fast block write will be stalled for 24 cycles after loop ends. Inconsequential timing-wise but worth a brief mention.

    Yea, might be better to do that after returning to whatever routine would call this. But then it'd have to be a plain RET.

    Just wondering whether you could move this RDFAST to the start of the loop and REP 256 times, not 257.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-28 16:53

    It'd still have to be 257 loops because the calculations are interleaved, but moving it to the top would stop it from stalling at the end

    (though maybe I'm not seeing something)

  • So here's something funny I cam up with to reload fresh longs of layer data and apply mosaic effect.

    Original bad idea, with everything in LUT:

                  getbyte ppc_tmp1,ppc_bg1,#3 ' get last pixel of previous long
                  getbyte ppc_tmp2,ppc_bg2,#3
                  getbyte ppc_tmp3,ppc_bg3,#3
                  getbyte ppc_tmp4,ppc_bg4,#3
                  rdlut ppc_sprp,ptra[##384]
                  rdlut ppc_spr,ptra[##320]
                  rdlut ppc_bg4,ptra[##256]
                  rdlut ppc_bg3,ptra[##192]
                  rdlut ppc_bg2,ptra[##128]
                  rdlut ppc_bg1,ptra[##64]
                  rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
                  skip ppc_mosaicmask
            if_c  setbyte ppc_bg1,ppc_tmp1,#0  ' get replace first pixel
                  movbyts ppc_bg1,ppc_tmp5
            if_c  setbyte ppc_bg2,ppc_tmp2,#0
                  movbyts ppc_bg2,ppc_tmp5
            if_c  setbyte ppc_bg3,ppc_tmp3,#0
                  movbyts ppc_bg3,ppc_tmp5
            if_c  setbyte ppc_bg4,ppc_tmp4,#0
                  movbyts ppc_bg4,ppc_tmp5
    

    59 cycles, awful.

    New cracked idea (layers now need to live in cog, also mosaic table needs to have different contents):

                  rdlut ppc_tmp5,ptra++ wc ' mosaic table: MOVBYTS data + carryover flag
            if_nc setq ppc_n1   ' ##$FF_FF_FF_FF, replace all pixels
            if_c  setq ppc_n256 ' ##$FF_FF_FF_00, keep lowest byte  (last post-mosaic pixel in last quad)
                  skipf ppc_mosaicmask
                  alts ppc_bgptr,#ppc_bg1buffer
                  mov ppc_bg1,0-0  ' BG1 mosaic OFF
                  movbyts ppc_bg1,#%%0123  ' ^
                  muxq ppc_bg1,0-0 ' BG1 mosaic ON
                  movbyts ppc_bg1,ppc_tmp5 ' ^
    
                  alts ppc_bgptr,#ppc_bg2buffer
                  mov ppc_bg2,0-0  ' BG2 mosaic OFF
                  movbyts ppc_bg2,#%%0123  ' ^
                  muxq ppc_bg2,0-0 ' BG2 mosaic ON
                  movbyts ppc_bg2,ppc_tmp5 ' ^
    
                  alts ppc_bgptr,#ppc_bg3buffer
                  mov ppc_bg3,0-0  ' BG3 mosaic OFF
                  movbyts ppc_bg3,#%%0123  ' ^
                  muxq ppc_bg3,0-0 ' BG3 mosaic ON
                  movbyts ppc_bg3,ppc_tmp5 ' ^
    
                  alts ppc_bgptr,ppc_bg4buf_increment ' increment bgptr
                  mov ppc_bg4,0-0  ' BG4 mosaic OFF
                  movbyts ppc_bg4,#%%0123  ' ^
                  muxq ppc_bg4,0-0 ' BG4 mosaic ON
                  movbyts ppc_bg4,ppc_tmp5 ' ^
    
                  ' sprites can not be mosaic'd
                  alti ppc_sprptr,#%100_111_000 ' must have ppc_spr in R field
                  movbyts ppc_spr,#%%0123 '
                  alti ppc_sprpptr,#%100_111_000 ' etc
                  movbyts ppc_sprp,#%%0123
    

    This is 41 cycles. The trick is that the order of the bytes is always reversed. So byte 0 ends up with the 4th pixel. And when MUXQ-ing the new long over it, it can replace the 1st new pixel. (If that wasn't clear, replacing the first pixel with the last pixel of the previous four is needed when a single mosaic super-pixel crosses the long boundary). Since further processing is unrolled, taking the endian flip into account is free.

    (the "SPRP" buffer has the two priority bits for sprites. There are 9 bits for sprites: 4 data, 3 palette, 2 priority. My idea is to store the priority into a separate buffer. Though that'll be somewhat painful to mask when rendering.)

  • roglohrogloh Posts: 5,837
    edited 2024-03-30 00:28

    Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?

  • @rogloh said:
    Nice savings. If this is part of your critical path do these types of optimizations typically help reduce the P2 clock frequency a bit now, or are there other parts in the code that would still demand the 343MHz operation?

    This all is to get everything to fit at 343 at all.

    Currently looking at sprite drawing. Once again just a tiny bit too slow. Well, the other option would be to make that Mode 7 routine from earlier faster, since that's the slowest BG mode (I think - only prototyped Modes 0, 1, 2 and 7)

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-30 01:57

    Well, I think some 470 cycles can infact be knocked loose on that mode 7 routine by removing the SKIPF (save 514 cycles) that always skips exactly one instruction and instead loading one of three specialized routines depending on the overflow mode (should cost at most 42 cycles more - zero if there ends up being enough room in cog/lut).

    Mode 0 (four layers) is now the slowest BG mode. EDIT: Oh, it always was, I just calculated it wrong.

    EDIT 2: Mode 7 is slightly slower again if I block read entire rows of tiles (2*32 words) into cog RAM

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-30 02:28

    Okay, here's some more dumbops.

    Remember the "reading two bitplanes at once" trick from earlier?

    setq #4
    rdlong tilebuffer+0, tile
    setword tilebuffer+0,tilebuffer+4,#1
    

    17..25 cycles

    Turns out RDFAST abuse can be even faster! (given enough instructions to do while it fetches)

    rdfast bit31,tile
    ' imagine about 7 useful instructions here
    rflong tilebuffer+0
    rflong tilebuffer+1
    rflong tilebuffer+2
    rflong tilebuffer+3
    rflong tilebuffer+4
    setword tilebuffer+0,tilebuffer+4,#1
    

    14 cycles flat. Brr brr ride the lightning.

    Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-30 13:35

    @Wuerfel_21 said:
    Now this is where I need to either build a test program tomorrow or ask @evanh (he's seems to actually understand this): How many instructions need to go inbetween last RFLONG and a WMLONG that I don't want to stall?

    Six longs are read into the FIFO before RDFAST ends, leaving 13 more to read during subsequent instructions. A random hub write has a fixed pre-write time of two cycles, therefore six two-cycle instructions between RDFAST and WMLONG should be more than enough to prevent a stall.

    EDIT:
    Only true if there are no RFLONG's between RDFAST and WMLONG.

  • TonyB_TonyB_ Posts: 2,193
    WORST-CASE RDFAST
    
    slice      01234567012345670123456701234567
                       !!!!!!!!!!!!!!!!!!!     cycles
    RDFAST   aaawwwwwwwrbbbbb                  16
    
    ! = read from hub RAM and write to FIFO
    
  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-30 13:02

    Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

    CON
    
    DAT
                  org
    .loop
                  getrnd pa
                  getrnd pb
                  waitx #15 wc ' randomize slice
                  getct time1
                  rdfast bit31,pa
                  long 0[7]
                  rflong buffer+0
                  rflong buffer+1
                  rflong buffer+2
                  rflong buffer+3
                  rflong buffer+4
                  long 0[2]
                  setq #1
                  wmlong buffer,pb
                  getct time2
                  subr time1,time2
                  fge maxtime,time1
                  fle mintime,time1
                  mov time2,maxtime
                  sub time2,mintime
                  debug(udec(time1,mintime,maxtime,time2))
                  jmp #.loop
    
    bit31         long 1<<31
    
    mintime       long posx
    maxtime       long 0
    
    buffer        res 5
    
    time1         res 1
    time2         res 1
    
  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-30 13:22

    What I said earlier was wrong and I'll edit it later. Every RFLONG instruction that is executed when the FIFO is filling means another read from hub RAM and write to FIFO cycle will be needed, therefore the FIFO burst will be more 19 cycles.

  • I think I need to make this optimization in the BG tile drawing functions, too. Would have to buffer the pixels into cog ram and burst write them. There already needs to be a 64 long buffer for mode 7, so that's ok. Not sure what to do about 8bpp and hires modes, those need twice the space. Hopefully it all just ends up fitting.

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-30 13:53

    @Wuerfel_21 said:
    Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

                  rdfast bit31,pa
                  long 0[7]
                  rflong buffer+0
                  rflong buffer+1
                  rflong buffer+2
                  rflong buffer+3
                  rflong buffer+4
                  long 0[2]
                  setq #1
                  wmlong buffer,pb
    

    Timing diagram below agrees with your test. FIFO burst is 19 + 5 = 24 cycles. I do know what I'm talking about when I remember all the relevant details!

    WORST-CASE RDFAST
    
    slice      0123456701234567012345670123456701234
                       !!!!!!!!!!!!!!!!!!!!!!!!|    cycles
    'RDFAST  aaawwwwwwwrbbbbb                  |    16 wait
    RDFAST   xx                                |     2 no-wait
    NOP        xx                              |     2
    NOP          xx                            |     2
    NOP            xx                          |     2
    NOP              xx                        |     2
    NOP                xx                      |     2
    NOP                  xx                    |     2
    NOP                    xx                  |     2
    RFLONG                   xx                |     2
    RFLONG                     xx              |     2
    RFLONG                       xx            |     2
    RFLONG                         xx          |     2
    RFLONG                           xx        |     2
    NOP                                xx      |     2
    NOP                                  xx    |     2
    SETQ                                   xx  |     2
    WMLONG                                   aaw     3
    

    EDITED.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-30 13:58

    So anyways, > @Wuerfel_21 said:

    I think I need to make this optimization in the BG tile drawing functions, too.

    Hmm, it doesn't really benefit anything but the 2bpp function. The 4bpp one lacks unrelated instructions to fill the wait with and 8bpp can't really benefit because 13 RFLONGS take too long. So that solves the buffer dilemma. 2bpp is the critical path, anyways (well, haven't thought about what goes on in Mode 5 yet, which is 4bpp+2bpp hires)

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-30 14:18

    @Wuerfel_21 said:
    Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

                  rdfast bit31,pa
                  long 0[7]
                  rflong buffer+0
                  rflong buffer+1
                  rflong buffer+2
                  rflong buffer+3
                  rflong buffer+4
                  long 0[2]
                  setq #1
                  wmlong buffer,pb
    

    I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.

  • Then it would never cross into the next long, yea. Though in the actual code there end up being 14 ops between RFLONG and WMLONG

  • For reference, current sprite tile code draft. May run 34 times per line

    .tile
                  cmp ptra,ppr_sprbufleft wc ' is buffer - 7
            if_nc cmpr ptra,ppr_sprbufright wc ' is buffer + 255
            if_c  jmp #.tile_clipped
                  ' Tile address is something like
                  ' %BBBY_YYYX_XXX0_yyy0
                  ' B is base for sprite sheet (from OBJSEL and second sheet flag)
                  ' Y is sprite sheet Y position
                  ' X is sprite sheet X position
                  ' y is bottom 3 bits of pixel Y position
                  mov ppr_tileid,ppr_tilebase ' already should have VRAM base, sheet base and sheet Y in it
                  rolnib ppr_tileid,ppr_tilex,#0
                  shl ppr_tileid,#1
                  rolnib ppr_tileid,ppr_tile,#2 ' (Y & 3) << 1
    
                  rdfast ppr_bit31, ppr_tileid
                  ' get palette (do down here to save 28 MOVs and fill RDFAST delay slot)
                  getnib ppr_tilebuffer5,ppr_tile,#6
                  shl ppr_tilebuffer5,#3
                  and ppr_tilebuffer5,#%0111_0000
                  mov ppr_tilebuffer6,ppr_tilebuffer5
                  ' get priority
                  getbyte ppr_tilebuffer3,ppr_tile,#3
                  and ppr_tilebuffer3,#%0011_0000
                  mov ppr_tilebuffer4,ppr_tilebuffer3
    
                  rflong ppr_tilebuffer1
                  rflong ppr_tilebuffer2 ' dummy read
                  rflong ppr_tilebuffer3 ' dummy read
                  rflong ppr_tilebuffer4 ' dummy read
                  rflong ppr_tilebuffer5
                  setword ppr_tilebuffer1,ppr_tilebuffer5,#1 ' combine bitplanes
    
            if_z  rev ppr_tilebuffer1
            if_z  movbyts ppr_tilebuffer1,#%%0123
                  ' generate masks (transparent pixels high)
                  not ppr_tmp3,ppr_tilebuffer1
                  ror ppr_tilebuffer1,#16
                  andn ppr_tmp3,ppr_tilebuffer1
                  ror ppr_tilebuffer1,#8
                  andn ppr_tmp3,ppr_tilebuffer1
                  ' maskify!
                  mergew ppr_tmp3 
                  movbyts ppr_tilebuffer5,ppr_tmp3
                  movbyts ppr_tilebuffer3,ppr_tmp3
                  shr ppr_tmp3,#8
                  movbyts ppr_tilebuffer6,ppr_tmp3
                  movbyts ppr_tilebuffer4,ppr_tmp3
                  ' write priority
                  setq #1
                  wmlong ppr_tilebuffer3,ptrb++
                  ' fix tilebuffer (down here for slice alignment)
                  ror ppr_tilebuffer1,#8
                  ' usual pixel conversion
                  mergeb ppr_tilebuffer1
                  movbyts ppr_tilebuffer1, #%%3120
                  mergew ppr_tilebuffer1
                  movbyts ppr_tilebuffer1, #%%3120
                  splitw ppr_tilebuffer1
                  mov ppr_tilebuffer2,ppr_tilebuffer1
                  shr ppr_tilebuffer1,#4
                  ' add attributes to non-masked pixels
                  setq ppr_attrmask4
                  muxq ppr_tilebuffer1,ppr_tilebuffer5
                  muxq ppr_tilebuffer5,ppr_tilebuffer6
                  ' write pixels
                  setq #1
                  wmlong ppr_tilebuffer1,ptra++
                  djz ppr_sprchkleft,#.draw_done ' line limit
    .tile_next
                  sumz ppr_tilex,#1 ' advance tile (Z is still mirror bit)
                  djnf ppr_tileleft,#.tile
                  tjns ppr_spridx,#.sprite ' already decremented this earlier
                  jmp #.draw_done
                  ' 50 alu ops + 2 WMLONG+1u + 1 branch
                  ' ~122 per tile - 4148 total
                  ' Buffer spacing is 264 bytes. There should be 11 ops between the WMLONGs (12 including SETQ)
                  ' total time for both WMLONG together should then be 10..18 cycles (excluding SETQ)
    
    .tile_clipped
                  cmp ptra,ppr_sprbufwtf wz ' is buffer - 256
            if_nz add ptra,#8
            if_nz add ptrb,#8
            if_z  djz ppr_sprchkleft,#.draw_done ' special case: X==-256 counts all tiles towards line limit
                  testb ppr_tile,#30 wz ' get mirror flag into Z again
                  jmp #.tile_next
    

    maybe should fix the worst case for clipping. If 32 64x64 sprites are all clipped down to one tile, .tile_clipped runs 224 times.

  • TonyB_TonyB_ Posts: 2,193
    edited 2024-03-30 14:51

    @TonyB_ said:

    @Wuerfel_21 said:
    Experimental verification: 3 instructions needed (2 if you exclude the SETQ)

                  rdfast bit31,pa
                  long 0[7]
                  rflong buffer+0
                  rflong buffer+1
                  rflong buffer+2
                  rflong buffer+3
                  rflong buffer+4
                  long 0[2]
                  setq #1
                  wmlong buffer,pb
    

    I think the last RFLONG could be replaced by RFWORD, which in theory should reduce FIFO burst from 24 to 23. If so, you'd need a RDLUT before SETQ+WMLONG to actually save a cycle.

    After more thought, no need for RDLUT, just have one two-cycle instruction between RFWORD and SETQ. WMLONG will have a 1 in 8 chance of a stall and so will take 4-11 cycles, not 3-10, but overall one cycle saved if practice matches theory.

                  rdfast bit31,pa
                  long 0[7]
                  rflong buffer+0
                  rflong buffer+1
                  rflong buffer+2
                  rflong buffer+3
                  rfword buffer+4
                  long 0[1]
                  setq #1
                  wmlong buffer,pb   '4-11 cycles
    
  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-30 19:04

    Making the clip worst case go away has caused these 7 horrid ops to appear in the per-sprite code (32 times per line).

                  signx ptra,#8 wc ' PTRA has sprite X
                  ' left clip logic
            if_c  neg ppr_tmp1,ptra
            if_c  shr ppr_tmp1,#3
            if_c  sub ppr_tileleft,ppr_tmp1
            if_c  tjs ppr_tileleft,#.sprite_clipped ' no tiles left
            if_c  sumz ppr_tilex,ppr_tmp1
            if_c  shl ppr_tmp1,#3
            if_c  add ptra,ppr_tmp1
    

    (Right clip is only one op and a condition)

                  cmp ptra,ppr_sprbufright wc ' is buffer + 256
            if_c  djnf ppr_tileleft,#.tile ' next tile unless clipping into right edge
    
  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2024-03-31 17:43

    Hooked most of the pixel pipeline together. Just missing the actual layer rendering now.

    Speed currently:

    Cog2  ppc_time = 21_267
    Cog3  ppm_time = 21_674
    

    Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.
    This under worst conditions I can easily manifest - Both windows at different positions, BG1 fills screen and is blended with the fix color.
    Can probably knock loose some 50 cycles in either's per-line init code if I wanted to. One thing currently not done that needs doing is to make delayed copies of certain registers and the palette RAM, so that raster effects are in-sync. Will need to hawk that into one of them. Need to finish out the layer stuff first.

  • roglohrogloh Posts: 5,837

    @Wuerfel_21 said:
    Speed currently:

    Cog2  ppc_time = 21_267
    Cog3  ppm_time = 21_674
    

    Remember, time available at 343 MHz is 21824. PPC jitters a little bit due to loads of hub access. I took the worst value.

    Looks tight but within spec - nothing better than a hard constraint to help make it fit. :smile:

    Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.

  • @rogloh said:
    Hopefully you won't find any last minute gotchas you may have overlooked that will blow this time out.

    one can hope. Current shitpost status is that I have something really corrupted-looking rendering on one of the background layers, so something clearly is working to some extent.

  • new shitpost status: got 3 layers drawing text, but for some reason BG4 is not getting through compositing.

  • (Also my left monitor died so now I'm stuck in 1280x1024 hell)

  • Ah, 'twas just a misaligned skip table. Happens to the best of us.

  • roglohrogloh Posts: 5,837

    Nice. Once you have some video data output as positive feedback I find it's usually enough of a motivation to continue and then it's only a matter of time until it all works... (I know you still have some CPU stuff to do).

  • TonyB_TonyB_ Posts: 2,193

    Is there any chance of using another cog for video to reduce sysclk?

  • It's already using 3 cogs, there isn't anywhere to go. And splitting the composite/math tasks to use multiple cogs would be good pain.

Sign In or Register to comment.