Console Emulation

Wuerfel_21 · 2023-11-26 02:17

Okay, we're dealing with a genuine flexsplorp, I think. @ersmith I think the constant folding is not masking bit shift amounts properly. Too tired to investigate that further today.

Fixed it up on my end (see github). The listed timing seems to not work on pin 32, I need to change it to this:

HYPER_LATENCY = 6
HYPER_WAIT  = HYPER_LATENCY*4 - 2
HYPER_DELAY = 15
HYPER_SYNC_CLOCK = false
HYPER_SYNC_DATA = true

Also, if you need a <=16MB game, try Money Puzzle Exchanger. Probably one of the most funny ones if you plan on never pressing a button.

evanh · 2023-11-26 04:23

Bah! I need to kick Roger. The useable range for DELAY is only eight!
And I see in your init code in megayume_upper.spin2 you've used a fixed offset of -7 to move the min-max out to 7..14. Does that even match the driver code?

rogloh · 2023-11-26 04:28

@evanh said:
Bah! I need to kick Roger. The useable range for DELAY is only eight!
And I see in your init code in megayume_upper.spin2 you've used a fixed offset of -7 to move the min-max out to 7..14. Does that even match the driver code?

Our numbers are independent. I think Ada's matched the HW data sheet more closely whereas mine are purely synthetic and were also offset slightly to range extend a little and also fit in 4 bit driver nibble, but I think they are biased more to the lower end than the higher end. As the frequencies increased perhaps they have topped out vs what we saw with HyperRAM originally?

evanh · 2023-11-26 05:05

Roger,
Are you able to point me to a line number in hyperdrv.spin2 where the offset is applied?

rogloh · 2023-11-26 05:22

Ok I must have been thinking about PSRAM which is where I added it. (psram16drv.spin2)

                            setq    xfreq1                  'reconfigure with single cycle xfreq (sysclk/1)
                            xcont   delay, #0               'configurable fine input delay per P2 clock cycle
                            xcont   #6, #0                  'fixed delay offset to expand delay range

I can't find doing it in the original HyperRAM code but if you want to try to expand it in hyperdrv.spin for the reads you might be able change the waitx #2 instruction below to add some constant value to delay before it gets used further down. But the driver is very tight and does self modify so hopefully this fits and is not gonna break something. Also I hope delay is not somehow used later on, and just gets re-read when required. Been quite a while since I looked at this stuff.

                            waitx   #2                      'delay long enough for DATA bus transfer to complete
                            fltl    datapins                'tri-state DATA bus
                            waitxfi                         'wait for address phase+latency to complete
p1                          wxpin   #2, clkpin              'adjust transition delay to # clocks
p2                          setxfrq xfreq2                  'setup streamer frequency
                            wypin   clks, clkpin            'setup number of transfer clocks
                            wrpin   regdatabus, datapins    'setup data bus inputs as registered or not
                            waitx   delay                   'tuning delay for input data reading
                            xinit   xrecv, #0               'start data transfer and then jump to setup code
                            call    resume                  'go see what we will do next while we are streaming
                            waitxfi                         'wait for streaming to finish
                            wrpin   registered, datapins    'prepare data pins for address phase transfer
            _ret_           drvh    cspin                   'de-assert chip select

rogloh · 2023-11-26 05:37

@evanh Looks like the COG is chockers at 1024 longs so if you want to make a change to make your thermal test run you'll need to save a long elsewhere by commenting something out for your test to make room.

Also this line worries me:

                            shr     delay, #5 wc            ' a b c d e | g prep delay and test for registered inputs

See the "f" case is missing. That means there is a path where delay is retained from last use. This is the special locked transfer case where you don't yield to other COGs in the middle of the transfer. So long as you don't lock the COG's transfers with the QoS settings you should be okay.

Update: Actually if this is just for a thermal test you may be able to just hard code this line to be whatever constant delay value you want.
Eg. change:

                            waitx   delay                   'tuning delay for input data reading

to

                            waitx   #9                   'tuning delay for input data reading

Wuerfel_21 · 2023-11-26 07:10

@evanh said:
Does that even match the driver code?

It ought to, but at some point I got tired of fixing bugs with that, so now the high level code only ever writes to memory (writes always work)

evanh · 2023-11-26 10:52

I can't find a value that works. I think the Latency value is screwed too. Time to go back to the 96MB add-on.

EDIT: Doh! I'd forgotten it works fine on base pin P0. The problem is with base pin P32. So the delay is the wrong place to look.

Wuerfel_21 · 2023-11-26 11:28

@evanh said:
EDIT: Doh! I'd forgotten it works fine on base pin P0. The problem is with base pin P32. So the delay is the wrong place to look.

You have updated to the latest commit that actually fixes pin32? The delay does need some tweaking due to trace lenghts or something. See post above, that timing works for me in NeoYume. For MegaYume the stock settings work for pin 32.

evanh · 2023-11-27 07:50

Oops, didn't read that very well. What package was updated? Actually, it's Megayume that's been failing for me. I haven't been testing this on Neoyume because I don't have a 16MB game for it.

Wuerfel_21 · 2023-11-27 09:50

I've updated both the emulators to fix the pin 32 problem. Was that not sufficiently communicated?

evanh · 2023-11-27 12:50

In that case it didn't work for me. No worries, I've been busy offline anyway.

EDIT: Err, my botch up. It's working thanks. I'd gone too far on the config and didn't put everything back.

msrobots · 2023-12-28 07:20

Hey @Wuerfel_21, can you do NES too?

Elite was the first 3D game, please look at this ... https://www.bbcelite.com

Mike

evanh · 2023-12-28 07:34

I played a lot of Frontier (Elite 2). It did a good job of handling motion in space correctly. Enjoyed docking with full manual controls.

It struggled with high-G physics though. I couldn't put my finger on what was wrong but some stuff just went batty.

Wuerfel_21 · 2023-12-28 14:28

@msrobots said:
Hey @Wuerfel_21, can you do NES too?

Theoretically yes, but the Famicom/NES is a truly accursed architecture, more than people realize.

There's no raster interrupt, instead all sorts of raster effects are achieved by either cycle counting or polling the PPU "sprite 0 hit" bit (weird vestigial hardware collision detector)
Every game more complex than the original SMB uses some kind of bankswitch IC. In the simplest case that's just some logic gates to switch program ROM around, but some of these add crazy features such as additional sound channels, remapping video memory at precise points mid-scanline, even a real raster IRQ. There's many, many different cartridge boards that behave very differently.
At the tail end the games become too large to fit reasonably in hub ram (freak outlier is Metal Slader Glory, which has 512k PRG + 512k CHR ROM)

msrobots · 2023-12-29 04:16

It was made for bbcmicro, later converted to other 6502. But they are monochrome for the 3d part NES was the only color one in 3d. The menu is in color on all of them.

Original game was just 22k pure 6502 assembler...

just looked some video about it. I think I played it on Atari 800 or so.

Mike

Wuerfel_21 · 2024-03-05 00:53

Dubious and strange thing I just now realized: NeoYume has kinda muffled sound. For reasons that completely escape me, the IIR lowpass coefficient is set to $1000. (lower number -> stronger filter, more muffled) OPNACog, where this particular LPF code originates, uses $2200. But for some reason when I ported that code into NeoYume I guess I set it to $1000 (and just left it there after getting exhausted by fixing the countless sound bugs. My dreams are still haunted by Z80 NMI race conditions and that one corruption bug that turned out to be a byte/word mismatch). I feel like the YM2610 needs some additional filtering that the YM2608 doesn't (due to heavy use of super crunchy samples), but $1000 seems like too much. Might need to upload some comparsions.

Also, something I realized a long time ago but didn't want to rock the boat on: the actual audio levels are rather low (compare to playing a WAV file at full scale). However, actual full-scale is WAY TOO LOUD when connecting headphones directly to the amplified output, so uhhh....

Wuerfel_21 · 2024-03-07 00:41

So I ended up going with LPF coefficient $1800. I think this snippet (3rd demo from Ironclad) illustrates the issue:

$1000 (previous value) is a bit too dull
$2000 (closer to OPNACog default) is nice and bright, but some sounds become ear-piercing
$1800 (new value) is goldilocks' choice then - more detail from the instrumentation, but not so bright as to be annoying.

@VonSzarvas forum is busted again, can't post MP3s

VonSzarvas · 2024-03-07 06:27

@Wuerfel_21 said:
@VonSzarvas forum is busted again, can't post MP3s

Have you ever been able to do that?
Regardless- the file types list is out of my reach

Wuerfel_21 · 2024-03-26 22:48

So I was thinking about making a SNES emulator to complete the set. I really should be working on other things, but this is kinda fun to work on so uhhh... don't expect anything to come of this, but I prototyped some code for the SNES video pipeline. I'm just posting this because this code is cracking my brain up and I want to show it.

So in MegaYume, video is pipelined across three cogs: [layer rendering] -> [composite/S&H/palette lookup] -> [output]

Layer rendering really just means to render both tile layers and the sprites. Composite means to combine layers according to their priority. These have to be separate steps to get the correct behavior (basically, you can create an ordering paradox between sprites and layers. This was known about and used for certain effects). (In NeoYume, everything is done in one go, since there's no layer priority system at all)

For the SNES emulator, this would need to be split into four cogs (in exchange, I can fit audio into just one cog), because the video hardware is just that much more complex:

[layer rendering] -> [h-mosaic/window mask/composite] -> [palette lookup/color math/fade-out] -> [output]

Layer rendering is the same idea, except up to 4 (!) tile layers may be active. Also there's mode 7 (which appears to need just as many cycles as the four normal layers AFTER optimizing it with RDFAST pipeline tricks)
Mosaic is the pixelation "blur" effect you often see. The vertical part is achieved by rendering the same line multiple times, but the horizontal part is difficult and is best done after rendering. Compositing always happens for 512 pixels per line, with odd and even pixels using different mask settings. After color lookup, the even and odd pixels can be blended together by color math. Finally, the lower nibble of the INIDISP register controls overall screen brightness (on the real machine this is the RGB DAC reference voltage).

Anyways, the color math cog! No idea if it actually works lmao. Debug claims this is 21702 cycles per scanline, which just flies in under the limit (assuming I get to clock at 343MHz, which is another 5 MHz higher than NeoYume). The entire thing is too long to fit in a code block, so here's the cliff notes version:

 DAT '' PPU Color Math cycle count test
    '' SNES master clock is 21.447 MHz
    '' multiplying this by 16 gives sysclk of 343.152 MHz
    '' SNES scanline is 1364 mclk -> 21824 P2 clocks

              org

              ' TAKING STOCK:
              ' priority circuit generates 512 pixels per line
              ' even are "sub", odd pixels are "main" (and use a different set of mask registers)
              ' color math runs on every pixel, but in lowres modes only main/odd pixels are visible
              ' color window and inhibit states only update on main/odd pixels
              '  - pixel 0 state may thus be undefined (BSNES sets everything as if previous pixels were all transparent)
              ' color window can, in any in/out combination
              '  - black out current color (CGWSEL[8..7])
              '  - inhibit color math (CGWSEL[6..5])
              ' color math is also inhibited based on main/odd pixel's layer (CDADSUB[5..0])
              ' color math is either with previous pixel or fixed color (CGWSEL[1]).
              ' when mathing with previous color and THE PREVIOUS EVEN PIXEL (either x-1 or x-2) was backdrop, pixels are
              '  instead mathed with the fixed color (really, BSNES?)
              ' half math flag is additionally inhibited (only update on odd pixel!):
              '  - when window is blacking out current color
              '  - when CGWSEL[1] is set and prev. color is replaced with fixed color (see above)
              ' at the very end, pixels are multiplied with INIDISP[3..0] (on real PPUs, this is an analog effect)

              ' our priority compositor outputs 32 bits per pixel pair in curious format
              ' we need to always output 512 truecolor pixels to the video driver, all effects applied


              ' odd/even are more natural to loop than even/odd
              ' will need 1 dummy iteration?

              ' due to need of inserting REP compensation NOPs
              ' a separate loop is used depending on CGADSUB[7] (add/sub mode)

              ' low LUT is reloaded with palette on each line
              ' high LUT always holds direct color lookup

ppm_addloop
              rep #29,#64 ' always 29 ops regardless of skip

              ' odd:  %WSSS_PPPD_IIII_IIII
              ' D is direct color flag
              ' PPP is direct color "palette"
              ' SSS is source ID
              ' W is color window state
              rfword ppm_odddat
              rdlut ppm_oddcol,ppm_odddat
              getnib ppm_src,ppm_odddat,#3
              getnib ppm_pppd,ppm_odddat,#2
              alts ppm_pppd,#ppm_dcolor_odd_tbl ' even somewheres are zero
              or ppm_oddcol,0-0
              altd ppm_src,#ppm_skip_table
              skipf 0-0
              ' 17 cyc
ppm_fc0 if_z  mov ppm_evencol,ppm_fixedcol ' self-modify condition and WZ bit
              ' we overwrite evencol here so oddcol is valid for next color math
              ' a : NO MATH
              ' b : NO MATH + BLANK
              ' c : MATH
              ' d : MATH + BLANK
              ' e : MATH + HALF                                         a b c d e
        if_nz shr ppm_evencol,#1 ' ADD + HALF                                   x
        if_nz shr ppm_oddcol,#1  ' ADD + HALF                                 x x
              addpix ppm_evencol,ppm_oddcol ' skip if window blanking       x   x
        if_nz shl ppm_oddcol,#1  ' ADD + HALF                                 x x
              mov ppm_evencol,ppm_oddcol ' SKIP unless math skipped     x        
              mov ppm_evencol,#0 ' SKIP unless (no math AND blanking)     x      
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x x 

              and ppm_evencol,ppm_cmask ' always
              ' 17 + 17 cyc

              altr ppm_wrptr,ppm_pxwrite ' auto increment ALTR
              mixpix ppm_evencol,ppm_evencol ' PIV set to INIDISP-derived value
              altr ppm_wrptr,ppm_pxwrite     ' LOWRES only
              mixpix ppm_evencol,ppm_evencol ' LOWRES only
              ' 17 + 17 + 9 (assuming hires)

              ' even: %Dxxx_xxxD_IIII_IIII
              ' yes, two direct color flags
              ' (we already have a PPPx)
              rfword ppm_evendat wc
              rdlut ppm_evencol,ppm_evendat
        if_c  alts ppm_pppd,#ppm_dcolor_always_tbl ' in this table ignore LSB
        if_c  or ppm_evencol,0-0
              ' 17 + 17 + 9 + 9

ppm_fc1 if_z  mov ppm_oddcol,ppm_fixedcol ' self-modify condition
              ' we overwrite ppm_oddcol here so evencol is valid for next color math
              ' a : NO MATH
              ' b : NO MATH + BLANK
              ' c : MATH
              ' d : MATH + BLANK
              ' e : MATH + HALF                                         a b c d e
        if_nz shr ppm_oddcol,#1 ' ADD + HALF                                    x
        if_nz shr ppm_evencol,#1  ' ADD + HALF                                x x
              addpix ppm_oddcol,ppm_evencol ' skip if window blanking       x   x
        if_nz shl ppm_evencol,#1  ' ADD + HALF                                x x
              mov ppm_oddcol,ppm_evencol ' SKIP unless math skipped     x        
              mov ppm_oddcol,#0 ' SKIP unless (no math AND blanking)      x      
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x x      

              and ppm_oddcol,ppm_cmask
              ' 17 + 17 + 9 + 9 + 17
              altr ppm_wrptr,ppm_pxwrite ' HIGHRES only
              mixpix ppm_oddcol,ppm_oddcol ' HIGHRES only

              test ppm_evendat,#$FF wz ' backdrop check
              ' 17 + 17 + 9 + 9 + 17 + 11
        _ret_ sub ppm_wrptr,#128


ppm_subloop
              rep #29,#64
              ' odd:  %WSSS_PPPD_IIII_IIII
              ' D is direct color flag
              ' PPP is direct color "palette"
              ' SSS is source ID
              ' W is color window state
              rfword ppm_odddat
              ' work odd pixel and load SKIP pattern
              rdlut ppm_oddcol,ppm_odddat
              getnib ppm_src,ppm_odddat,#3
              getnib ppm_pppd,ppm_odddat,#2
              alts ppm_pppd,#ppm_dcolor_odd_tbl ' even somewheres are zero
              or ppm_oddcol,0-0
              altd ppm_src,#ppm_skip_table
              skipf 0-0
              ' 17
ppm_fc2 if_z  mov ppm_evencol,ppm_fixedcol ' self-modify condition and WZ bit
              ' we overwrite evencol here so oddcol is valid for next color math
              ' a : NO MATH
              ' b : NO MATH + BLANK
              ' c : MATH
              ' d : MATH + BLANK (same as blank?)
              ' e : MATH + HALF                                         a b c d e  
              not ppm_evencol    ' SUB                                      x   x
              addpix ppm_evencol,ppm_oddcol ' skip if window blanking       x   x
              not ppm_evencol    ' SUB                                      x   x
        if_nz shr ppm_evencol,#1 ' SUB + HALF                                   x
              mov ppm_evencol,ppm_oddcol ' SKIP unless math skipped     x        
              mov ppm_evencol,#0 ' SKIP unless (no math AND blanking)     x   x  
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x   x  
              nop ' REP compensator                                     x x   x   

              and ppm_evencol,ppm_cmask
              ' 17 + 17

              altr ppm_wrptr,ppm_pxwrite ' auto increment ALTR
              mixpix ppm_evencol,ppm_evencol ' PIV set to INIDISP-derived value
              altr ppm_wrptr,ppm_pxwrite     ' LOWRES only
              mixpix ppm_evencol,ppm_evencol ' LOWRES only
              ' 17 + 17 + 9 (assuming hires)

              ' even: %Dxxx_xxxD_IIII_IIII
              ' yes, two direct color flags
              ' (we already have a PPPx)
              rfword ppm_evendat wc
              rdlut ppm_evencol,ppm_evendat
        if_c  alts ppm_pppd,#ppm_dcolor_always_tbl ' in this table ignore LSB
        if_c  or ppm_evencol,0-0
              ' 17 + 17 + 9 + 9

ppm_fc3 if_z  mov ppm_oddcol,ppm_fixedcol ' self-modify condition
              ' we overwrite ppm_oddcol here so evencol is valid for next color math
              ' a : NO MATH
              ' b : NO MATH + BLANK
              ' c : MATH
              ' d : MATH + BLANK
              ' e : MATH + HALF                                         a b c d e  
              not ppm_oddcol     ' SUB                                      x   x
              addpix ppm_oddcol,ppm_evencol ' skip if window blanking       x   x
              not ppm_oddcol    ' SUB                                       x   x
        if_nz shr ppm_oddcol,#1 ' SUB + HALF                                    x
              mov ppm_oddcol,ppm_evencol ' SKIP unless math skipped     x        
              mov ppm_oddcol,#0 ' SKIP unless (no math AND blanking)      x   x  
              nop ' REP compensator                                     x x x x  
              nop ' REP compensator                                     x x   x  
              nop ' REP compensator                                     x x   x   

              and ppm_oddcol,ppm_cmask
              ' 17 + 17 + 9 + 9 + 17
              altr ppm_wrptr,ppm_pxwrite ' HIGHRES
              mixpix ppm_oddcol,ppm_oddcol ' HIGHRES

              test ppm_evendat,#$FF wz ' backdrop check
              ' 17 + 17 + 9 + 9 + 17 + 11
        _ret_ sub ppm_wrptr,#128



ppm_fcmodmask   long %0101_0000000_010_000000000_000000000 ' if OR'd set if_always and WZ, if ANDN'd, set if_z

ppm_wrskip8_inc long ppm_skip_table+8 + 1<<9

ppm_pxwrite     long ppm_pxbuffer + 1<<9
ppm_dummydat    long $5_0_00 ' backdrop+outside pixel

ppm_skip_hires          long %00_0_000000000_0_0000_1100_0_000000000_0
ppm_skip_lores          long %11_0_000000000_0_0000_0000_0_000000000_0

ppm_skip_normal         long %00_0_000101111_0_0000_0000_0_000101111_0
ppm_skip_blank          long %00_0_000011111_0_0000_0000_0_000011111_0
ppm_skip_math_add       long %00_0_000111011_0_0000_0000_0_000111011_0
ppm_skip_math_sub       long %00_0_110111000_0_0000_0000_0_110111000_0
ppm_skip_math_blank_add long %00_0_100110101_0_0000_0000_0_100110101_0
                        ' sub from blanked pixel is same as just blank
ppm_skip_math_half      long %00_0_111110000_0_0000_0000_0_111110000_0

ppm_cbuf_even           long @composite_buffer
ppm_cbuf_odd            long @composite_buffer+512*2

ppm_512           long 512
ppm_1024          long 1024
ppm_2048          long 2048

ppm_dcolor_always_tbl
                  long $00_00_00_00[2]
                  long $10_00_00_00[2]
                  long $00_10_00_00[2]
                  long $10_10_00_00[2]
                  long $00_00_20_00[2]
                  long $10_00_20_00[2]
                  long $00_10_20_00[2]
                  long $10_10_20_00[2]
ppm_dcolor_odd_tbl
                  long 0, $00_00_00_00
                  long 0, $10_00_00_00
                  long 0, $00_10_00_00
                  long 0, $10_10_00_00
                  long 0, $00_00_20_00
                  long 0, $10_00_20_00
                  long 0, $00_10_20_00
                  long 0, $10_10_20_00



ppm_cmask long $F8F8F800

ppm_bit31 long negx

ppm_skip_table    res 7
                  res 1 ' unused!
                  res 7
                  res 1 ' unused!

                  res 2 ' space for dummy pickles
ppm_pxbuffer      res 128 
                  res 1

ppm_evendat res 1
ppm_odddat res 1
ppm_evencol res 1
ppm_oddcol res 1

ppm_src  res 1
ppm_pppd  res 1
ppm_wrptr res 1


ppm_tmp1  res 1
ppm_tmp2  res 1
ppm_tmp3  res 1
                    fit 496

rogloh · 2024-03-27 01:45

@Wuerfel_21 said:
Debug claims this is 21702 cycles per scanline, which just flies in under the limit (assuming I get to clock at 343MHz, which is another 5 MHz higher than NeoYume).

Nice. Could end up being a fun project. But 343MHz! That's getting high.

Again it will need external memory for cartridge ROM data - so what's your expectation of overall COG allocation looking like for something like this? You'd need a different CPU emulator for audio now (SPC700 apparently), plus something to emulate whatever the main core uses, right?

pik33 · 2024-03-27 15:44

343 MHz is at the edge of EC32 PSRAM stability... That's high. The CPU is modified 65816... the next one to emulate.

Wuerfel_21 · 2024-03-27 17:07

@rogloh said:
Nice. Could end up being a fun project. But 343MHz! That's getting high.

Again it will need external memory for cartridge ROM data - so what's your expectation of overall COG allocation looking like for something like this? You'd need a different CPU emulator for audio now (SPC700 apparently), plus something to emulate whatever the main core uses, right?

The sound part I already made a while ago as standalone player: https://obex.parallax.com/obex/spccog/ That fits in one cog rather easily. Might have to change the CPU/DSP task switching to be more granular. Apparently there's some very fun race conditions relating to the audio mailbox communications.

So the cogs need to end up as
1. Spin2 code (menu and disk I/O)
2. USB/other input
3. PPU layer rendering
4. PPU compositing
5. PPU color math
6. video output (+ HDMI audio encoder when I finally implement it)
7. CPU + DMA
8. Audio

The CPU is a 65816 theoretically running at 3.58 MHz (mclk/6), but in practice it gets clock stretched to mclk/8 on every RAM access cycle. Doing DMA on the same cog means that the PSRAM access won't need locks. Instruction table probably needs to go into hub so there can be 4 versions of it, for each possible M/X state combination. Maybe need to do something about the D bit as well, 16 bit BCD sounds painful.

@pik33 said:
343 MHz is at the edge of EC32 PSRAM stability... That's high. The CPU is modified 65816... the next one to emulate.

Weren't you using it at like 350? Anyways, can always settle down for sysclk/3. My real concern is that it's creeping closer to where the P2 core starts crapping out. But as far as I can tell it really needs to be this high to get everything done in time. Well, mostly it's a waste since hires mode isn't super common, but oh well. The compositing so far is similarly tight and that always operates in high-res.

pik33 · 2024-03-27 20:50

Yes, I used a P2 at 354, many hours, stable, without problems. However, PSRAM on my Edge starts to give up about 340. 360 is where problems start with P2 itself, at least these units I tested.

rogloh · 2024-03-27 21:54

@Wuerfel_21 said:

So the cogs need to end up as
1. Spin2 code (menu and disk I/O)
2. USB/other input
3. PPU layer rendering
4. PPU compositing
5. PPU color math
6. video output (+ HDMI audio encoder when I finally implement it)
7. CPU + DMA
8. Audio

That's not sounding too difficult to fit. Having a full Spin2 COG at your disposal makes things a lot easier to co-ordinate.

Still hopeful of a P2 Edge version assuming it can run at 343MHz or you figure out a way to reduce that a little. I'm currently working on another dedicated system board for the P2Edge with VGA/HDMI/A/V and USB right now so will likely end up with a nice target platform to run this. Will post about that separately soon but here's a sneaky peek of its progress:

Wuerfel_21 · 2024-03-27 22:44

@rogloh said:
Still hopeful of a P2 Edge version assuming it can run at 343MHz or you figure out a way to reduce that a little.

EDGE is the most stable board overall, so if anything it'd be edge-exclusive

I'm currently working on another dedicated system board for the P2Edge with VGA/HDMI/A/V and USB right now so will likely end up with a nice target platform to run this. Will post about that separately soon but here's a sneaky peek of its progress:

looks neat

evanh · 2024-03-27 23:09

@pik33 said:
Yes, I used a P2 at 354, many hours, stable, without problems. However, PSRAM on my Edge starts to give up about 340. 360 is where problems start with P2 itself, at least these units I tested.

It'll be hot with all eight cogs busy. More cooling will likely be needed.

Wuerfel_21 · 2024-03-27 23:38

unrelatedly, I had figured out an... interesting way to read the strange planar tile format. So each word contains data for two bitplanes in high and low bytes. For 2bpp mode, that's that and 8 words define a tile. In 4bpp mode, 16 words are needed. The first two bitplanes are stored the same as in 2bpp mode and then the other two bitplanes are stored in the same way. Very odd.

So to get all the data for a line you'd have to

rdword tilebuffer+0,tile
add tile,#16
rdword tilebuffer+1,tile
setword tilebuffer+0,tilebuffer+1,#1

I think(tm) this takes 29..36 cycles (second rdword hits worst case, owie)

Instead, one can:

setq #4
rdlong tilebuffer+0, tile
setword tilebuffer+0,tilebuffer+4,#1

for 17..25 cycles, at the expense of trashing 3 registers in the process

Wuerfel_21 · 2024-03-28 00:03

Also, the optimized mode7 routine I spoke of earlier (also untested, so probably bugged). I calculated it to take under 12000 cycles.

ppr_draw_mode7 '' SUPER CRACKHEAD VERSION
              '' first iteration is nonsense!

              '' positions are naturally 7.3.8 %OOOO_OOOO_OOOO_OOTT_TTTT_TPPP_ssss_ssss
              '' we use 5 padding bits         %OOOO_OOOO_OTTT_TTTT_PPPs_ssss_sssp_pppp
              '' TODO: verify minimum number of overflow bits needed
              '' (IIRC BSNES uses a full int for this without padding)

              '' TODO: full init here (negligable cycles)

              ' compensate for nonsense
              sub ppr_m7x,ppr_m7a
              sub ppr_m7y,ppr_m7c
              sub ppr_m7x,ppr_m7a
              sub ppr_m7y,ppr_m7c

              rep @.loop-2,#257

              skipf something
              '' Advance X/Y
              add ppr_m7x,ppr_m7a
              add ppr_m7y,ppr_m7c

              ' make tile pointer
              getbyte ppr_tmp1,ppr_m7x,#2
              shl ppr_tmp1,#1
              getbyte ppr_tmp2,ppr_m7y,#2
              setbyte ppr_tmp1,ppr_tmp2,#1
              zerox ppr_tmp1,#14 ' clamp pointer to 32k (SKIP if other overflow handling)
              setword ppr_tmp1,#@vram>>16,#1 ' VRAM pointer

              rfbyte ppr_tmp2 ' CATCH PREVIOUS PIXEL
              rdfast ppr_bit31,ppr_tmp1 ' REQUEST TILE
        if_nz mov ppr_tilebuffer1,#0 ' overflow -> transparency (SKIP THIS)
              altsb ppr_tmp4,ppr_m7buf_write
              setbyte ppr_tmp2

              ' Check overflow for current pixel
              test ppr_m7x,ppr_m7over wz
        if_z  test ppr_m7y,ppr_m7over wz

              '' prep pixel pointer
              getnib ppr_tmp1,ppr_m7y,#3 ' get %PPPs
              shr ppr_tmp1,#1 ' discard %s
              rolnib ppr_tmp1,ppr_m7x,#3 ' shift in %PPPs
              or ppr_tile,ppr_m7pbase ' set lsb and VRAM pointer

              rfbyte ppr_tile ' CATCH TILE
        if_nz mov ppr_tile,#0 ' overflow -> tile 0 (SKIP THIS)

              ' make pixel pointer
              shl ppr_tile,#7
              or ppr_tmp1,ppr_tile

              rdfast ppr_bit31,ppr_tmp1 ' REQUEST PIXEL
.loop
              setq #63
        _ret_ wrlong ppr_tilebuffer1,ptrb


ppr_m7over    long -1024 << (8+5)
ppr_m7pbase   long @vram + 1 ' VRAM pointer (64K aligned) + LSB
ppr_m7buf_write long ppr_tilebuffer1 + 1<<9
ppr_bit31     long 1<<31

TonyB_ · 2024-03-28 15:24

@Wuerfel_21 said:
unrelatedly, I had figured out an... interesting way to read the strange planar tile format. So each word contains data for two bitplanes in high and low bytes. For 2bpp mode, that's that and 8 words define a tile. In 4bpp mode, 16 words are needed. The first two bitplanes are stored the same as in 2bpp mode and then the other two bitplanes are stored in the same way. Very odd.

So to get all the data for a line you'd have to
rdword tilebuffer+0,tile
add tile,#16
rdword tilebuffer+1,tile
setword tilebuffer+0,tilebuffer+1,#1
I think(tm) this takes 29..36 cycles (second rdword hits worst case, owie)

I think this takes 23-30 cycles if no long crossings, else 24-31, but still slower than your alternative.

BEST-CASE

slice       012345670123456701234567
            |           |           cycles
rdword   aaarbbbbb      |            9
add               xx    |            2
rdword              aaawrbbbbb      10
setword                       xx     2
                                    --
                                    23

WORST-CASE

slice      01234567012345670123456701234567
                   |           |           cycles
rdword   aaawwwwwwwrbbbbb      |           16
add                      xx    |            2
rdword                     aaawrbbbbb      10
setword                              xx     2
                                           --
                                           30

a = pre-read delay
b = post-read delay
r = read 
w = egg beater wait

Console Emulation

Comments