Console Emulation

rogloh · 2024-04-09 01:02

Sounds demanding. Looks like you have some work cut out for you again.

evanh · 2024-04-09 08:28

Ugh! Me offended! Amiga was way more serious than just a games console. It was superior to the Mac in some ways. Windoze was a joke in comparison back then. The pre-emptive multitasking stood out and got very good utilisation with features like AREXX inter-app controls and scripting.

Wuerfel_21 · 2024-04-09 11:37

The point is that the Amiga is the sort of machine where you create a gradient by changing a color register with precise timing.

evanh · 2024-04-09 12:36

I do remember marvelling at an Atari 800 doing that in a shop display early on. So I gather it wasn't a common feature of consoles going forward then? EDIT: You know, I don't remember any consoles in New Zealand before maybe the PS1.

Bit of a segue on fancy video hardware: Tempest's vector display (Original arcade version) impressed me, I guess exactly because it wasn't rasterised. The range of intensity in the colours as well as the natural fine line detail of the direct vector graphics. And it was the first vector display I'd ever seen anywhere.

But it was the weighted rotary control paddle that made it work so well as a game. Remakes are all just a shadow of the real game without that.

Wuerfel_21 · 2024-04-09 13:17

@evanh said:
I do remember marvelling at an Atari 800 doing that in a shop display early on. So it wasn't a common feature of consoles going forward then I gather?

Not very common on consoles and computers alike. Many systems (PCs, the original NES, anything using TMS9918, etc) don't have a scanline counter IRQ at all (not to mention dedicated hardware support like Amiga's Copper and SNES' H-DMA). And then you of course need a register that can actually create an interesting effect being changed. And as such it stopped being useful when going from these "video synthesizer" kind of 2D chips over to rendering full frame buffers using command-based GPUs.

On Megadrive there's a "free" raster effect in that the VDP can on its own load a different horizontal scroll value per-line from VRAM. Rewriting the colors has the problem that it will cause a glitch (not visible in most emulators) if done during active scan (current pixel replaced with the color just written to CRAM). And the 68000 IRQ timing is rather wobbly, so it's usually not done. You can see it with the water levels in the sonic games. In that case it rewrites the entire CRAM at once, which causes a big line of glitches. The flickering wave effect that happens in the same screen region somewhat hides it. There doesn't seem to be a glitch or any tearing when writing VSRAM (where vertical scroll values live). Perhaps for that it buffers the write until H-blank like for VRAM. Certainly no glitches in OutRun etc.

Meanwhile in SNES-land: [forum refuses to properly autostart the video at 2:29]

Just casually changing 2 scroll registers per scanline. Also, the HUD is the same layer as the fire, so it needs to switch the layer routing inbetween to send BG3 to be blended with that background layer (presumably on BG2).

And of course any of the perspective effects done with Mode7 are also really raster effects on the matrix registers.

Bit of a segue on fancy video hardware: Tempest's (Original arcade version) vector display impressed me, I guess exactly because it wasn't rasterised. The range of intensity in the colours as well as the natural fine line detail of the direct vector graphics.

I played on an asteroids cab some years back. Totally unforgettable the smoothness of those vector graphics.

But it was the weighted rotary control paddle that made it work so well as a game. Remakes are all just a shadow of the real game without that.

People always forget external things like that are a part of the original game.

Wuerfel_21 · 2024-04-09 21:01

Cached ROM reads implemented for PSRAM. Also very basic SD card loading. Still a ways off from actually running any code. Also took out the HyperRAM for now to reduce headache. Will of course be added back later.

Pictured: Displaying the ROM title (found at $00FFC0 - which is at offset $007FC0 on LoROM, remember!).

Unrelated: I finally need to fix some s‌hit in my PC and finally figure out how to properly make use of my VisionRGB card such that I can have direct screen captures. Still have a playstation to fix though owie.

Wuerfel_21 · 2024-04-10 18:01

Offtopic, but I did end up getting that ASUS monitor @rogloh

To make it work with your video code, I had to change the 1920x1200 mode to match standard CVT timings (otherwise it would detect as 1600x1200):

wuxga_timing ' experimental 1920x1200@60Hz for ASUS PA248QV ~~~~~Dell 2405FPW at 77*4 MHz YMMV
            long   CLK308MHz
            long   308000000
                   '_HSyncPolarity___FrontPorch__SyncWidth___BackPorch__Columns
                   '     1 bit         7 bits      8 bits      8 bits    8 bits
            long   (SYNC_POS<<31) | ( 48<<24) | ( 32<<16) | (80<<8 ) |(1920/8)

                   '_VSyncPolarity___FrontPorch__SyncWidth___BackPorch__Visible
                   '     1 bit         8 bits      3 bits      9 bits   11 bits
            long   (SYNC_NEG<<31) | (  3<<23) | (  6<<20) | ( 26<<11) | 1200
            long   2<<8
            long   0
            long   0   ' reserved for CFRQ parameter

Interestingly, it likes none of the higher line-multiply modes from my emulators, only VGA2X works well. The HDMI input also works. Even works with my HDMI audio PoC driver. Built-in speakers are Smile, but there's an output jack.

Funny shapes game at 1920x1200: (need to publish the current files, did some fixes and converted USB driver to usbnew at some point)

rogloh · 2024-04-10 23:11

@Wuerfel_21 said:
Offtopic, but I did end up getting that ASUS monitor @rogloh
To make it work with your video code, I had to change the 1920x1200 mode to match standard CVT timings (otherwise it would detect as 1600x1200):

Yeah I had similar detection problems until I stumbled on something that worked. When I wake up I'll give your parameters a try to see if they also work on my Dell. I've probably tried CVT before though.

Interestingly, it likes none of the higher line-multiply modes from my emulators, only VGA2X works well. The HDMI input also works. Even works with my HDMI audio PoC driver. Built-in speakers are Smile, but there's an output jack.

Very annoying that it doesn't like to scale up your emulators' outputs. Maybe keep tweaking and you'll find something it likes. Nice that HDMI audio works though. Since I resurrected both my Dell 2405FPW's I'm less urgently needing another HDMI monitor but would hopefully end up with one that works with HDMI audio which I could also work on with my original codebase and try your stuff too.

Funny shapes game at 1920x1200:

With good quality VGA cables I imagine it looks really crisp and clean if you are seeing 1:1 pixels.

Wuerfel_21 · 2024-04-13 13:33

(I started typing this 2 days ago but forgot to post)

I got a 5x mode that works ok - it gets detected as 1600x1280 (???), so not super ideal, but oh well. The main concern is MegaYume, since it can switch between 256x224 and 320x224 and those have to be the same visual size. For NeoYume, 640x480 modes are fine since it only uses 320x224. For the upcoming MisoYume, the HDMI output will be fine since any weird resolution can be stretched to the full screen. So if the 256x224 image is doubled up and padded to 672x480, the monitor takes care of the aspect ratio conversion. For a 16:9 monitor you'd need a wider active area of course. Forgot the math thereof.

Wuerfel_21 · 2024-04-14 01:00

I implemented most of the PPU registers today. Unfortunately, still no code on the ground, as even the simple Hello World test I was trying to run needs the VBlank NMI to work. Oh well.

evanh · 2024-04-14 01:33

That's on the NMI?! If it exists it ends up being used I guess.

Wuerfel_21 · 2024-04-14 11:42

Yea they just used the NMI input for that. Then again, you really don't want that one to get delayed, since you can't write VRAM/CGRAM/OAM outside of blanking (will brap if you try anyways). Though many games will set the force-blank bit since sometimes the vblank task takes too long regardless.

evanh · 2024-04-14 13:06

There's no way to build any extension to atomic ops. So you're left with double/ring buffering solutions or serialising locks built off the basic atomic ops. That's sucky for those smaller data structures.

Wuerfel_21 · 2024-04-14 16:27

As usual, they do give you a register to turn the NMI off, which defeats the point, but lets you use it as a second "normal" IRQ vector. There usually wouldn't be much synchronization. Some kind of "ready flag" that the main process sets when it's done doing it's thing and the NMI handler then unsets when it's done with transferring everything to VRAM.

Wuerfel_21 · 2024-04-14 21:42

Now, I could just hack in the interrupt, but really, there's a complex system of timing that that all plays into.

Some processing needs to run on a per-scanline basis. Notably:

A limited amount of CPU instructions need to run
HDMA happens during H-blank, which is at the end of the scanline. Problematically, there may be cycles left after the HDMA where the CPU might want to write something into the video register. HDMA consumes a variable amount of cycles.
An IRQ may happen at any horizontal position (the actual position doesn't matter since rendering is somewhat atomic per-scanline, just the amount of cycles that run before the IRQ hits)
NMI might happen at the beginning of a scanline (if it's the first blanking line - depends on overscan mode setting) (or immediately if NMI is unmasked while not yet ack'd)
NMI is auto-ack'd when blanking ends

Furthermore, I think I have come to conclusion that the audio CPU needs to run in lockstep with the main CPU. So whenever the main CPU is done with an instruction, the audio CPU can get to execute the equivalent amount of cycles. When the main CPU wants to access the mailboxes, it needs to wait until the audio CPU is waiting. Slightly problematically, the audio section has it's own clock source that doesn't have any relationship to the main clock. A further problem is that that means the audio clock now needs to be tied to the video Hsync rate (since the 65816 executes N cycles per scanline and thus gives the sound section some N/~21 cycles to run, which then produces one sample for every 32 cycles that it runs), in a way that causes quite a headache (eg. the TMDS modes can't match the correct scanline length exactly due to not being a multiple of 10. Also, the audio rate needs to be an integer number of P2 cycles per sample)

evanh · 2024-04-14 23:54

LOL, a mask for NMI is certainly not usual! Yeah, so they really just wanted an extra IRQ and made do.

rogloh · 2024-04-15 00:39

@Wuerfel_21 said:
Furthermore, I think I have come to conclusion that the audio CPU needs to run in lockstep with the main CPU. So whenever the main CPU is done with an instruction, the audio CPU can get to execute the equivalent amount of cycles. When the main CPU wants to access the mailboxes, it needs to wait until the audio CPU is waiting. Slightly problematically, the audio section has it's own clock source that doesn't have any relationship to the main clock. A further problem is that that means the video clock now needs to be tied to the video Hsync rate (since the 65816 executes N cycles per scanline and thus gives the sound section some N/~21 cycles to run, which then produces one sample for every 32 cycles that it runs), in a way that causes quite a headache (eg. the TMDS modes can't match the correct scanline length exactly due to not being a multiple of 10. Also, the audio rate needs to be an integer number of P2 cycles per sample)

Complex timing requirements there. I don't envy your efforts right now getting some of it working. Not 100% sure but perhaps some effective CPU gapping at end of scan lines (or just execute fake NOPs without incrementing the PC) but letting audio continue on could be of some use, they are then only semi-sychronized. Don't know how well that could fit in with the DMA and IRQ/NMI stuff though. Or if it turns out too difficult, just support TV/VGA only modes which free up the clock rate a bit more vs TMDS?

Wuerfel_21 · 2024-04-15 11:23

Hmm, thinking about it more, if each scanline is some fraction of a master clock too short/long, I think that can be compensated per-scanline to make the overall audio clock constant. Just need to keep the audio sync counter as a seperate thing and add/subtract 1 sometimes. As mentioned, there's no fixed clock relationship (the real SNES uses a ceramic resonator for the audio clock). It just has to be in the right ballpark. Notably, neither side can be "paused" for a long period of time. When I'm home again I'm gonna post the cartridge streaming code disassembled from Tales of Phantasia to illustrate why that is.

I think I'll just hack in the NMI first, though. Oh well. Might need to tweak the video code a bit, too.

Wuerfel_21 · 2024-04-15 16:17

@Wuerfel_21 said:
Hmm, thinking about it more, if each scanline is some fraction of a master clock too short/long, I think that can be compensated per-scanline to make the overall audio clock constant. Just need to keep the audio sync counter as a seperate thing and add/subtract 1 sometimes.

The math on that checks out. Assuming the CPU cycles count in mclks and the P2 clock is 16mclk... The theoretical length of a scanline is 136416 -> 21824 cycles. If this is rounded up to 21830, there's 6/16ths of a mclk too much, so on occasion, an additional mclk must be given to the audio counter to keep everything in sync. Luckily, the CPU already has to do some psuedo-cycles every scanline to account for DRAM refresh, so that could be handled quite easily together with that.

Anyways, the aforementioned ToP cartridge streaming. Background: the SPC700/DSP can only access their own 64K of RAM and everything has to get there through a set of 4 mailboxes. ToP has some crackhead code to allow playing sounds directly from cartridge ROM, anyways. Here's what breaking into that in the Mesen-S debugger looks like (note: the disassembly is slightly bugged)

So on the 65816 side (DP is set to $2100, so $40..$43 are the mailboxes):

-       LDA $40
        BPL -       ; wait for negative handshake
        LDA $0092,X ; load 1st byte
        STA $41     ; send 1st byte
        LDA $0091,X ; load 2nd byte
        STA $42     ; send 2nd byte
        LDA $0090,X ; load 3rd byte
        STA $43     ; send 3rd byte
-       LDA $40
        BMI -       ; wait for positive handshake
        LDA $008F,X ; load 1st byte
        STA $41     ; send 1st byte
        LDA $008E,X ; load 2nd byte
        STA $42     ; send 2nd byte
        LDA $008D,X ; load 3rd byte
        STA $43     ; send 3rd byte
-       LDA $40
        BPL -       ; wait for negative handshake
        [... unrolled until it gets to zero ...]

And on the SPC700 side ($F4..$F7 are the mailboxes):

-       MOV A,$F5 ; receive 1st byte
        MOV [$AC]+Y,A ; store 1st byte
        EOR $80,#$80 ; flip handshake sign
        MOV $F4,$80 ; write handshake
        MOV A,$F6 ; receive 2nd byte
        MOV X,$F7 ; receive 3rd byte
        DEC Y
        MOV [$AC]+Y,A ; store 2nd byte
        DEC Y
        MOV A,X
        MOV [$AC]+Y,A ; store 3rd byte
        DBNZ Y,-

Notice that there are not one, but two hazards here. The SPC700 reads the 2nd and 3rd bytes after signalling the next handshake, so it relies on the 65816 to be slow enough to not send these immediately. There is no handshake going the other way, so it also relies on the 65816 to be fast enough to actually send everything in time (As can be gleaned from the event viewer window in the screenshot, the 65816 loops multiple times at each handshake (read events). This almost has to be the case for a poorly handshook transfer like this, since the 65816 experiences time gaps due to DRAM refresh (middle of the screen) and HDMA (just right of screen)).

Also notice that the programmer apparently was unaware that the mailboxes are separate for each direction (i.e. reading it always gives the last value the other processor wrote on its side, never the values you write on this side), in which case 4 bytes could be transferred in each block.

EDIT 1: To calculate out the first hazard further:

; LDA $40
-2      LDA dp opcode
-1      operand $40
0       $2140 READ HERE, latest cycle the handshake can be set
; BMI - (not taken)
1       BMI opcode
2       BMI operand
; LDA $008F,X
3       LDA a,x opcode
4       operand low $8F
5       operand high $00
6       (index-across-pages penalty cycle))
7       data read
; STA $41
8       STA dp opcode
9       operand $41
10      data write
; LDA $008E,X
11      LDA a,x opcode
12      operand low $8E
13      operand high $00
14      (index-across-pages penalty cycle)
15      data read
; STA $42
16      STA dp opcode
17      operand $42
18      data write
; LDA $008D,X
19      LDA a,x opcode
20      operand low $8D
21      operand high $00
22      (index-across-pages penalty cycle)
23      data read
; STA $43
24      STA dp opcode
25      operand $43
26      data write

So the 65816 can at the earliest write to $2142 18 cycles after the handshake is set. These are fast cycles (6 mclk), so that's about 5 cycles on the SPC700. MOV A, dp is a 3-cycle instruction, so the 2nd byte is safe. $2143 gets written 8 cycles later (26 total, ~7 SPC700 cycles), which is also safe, but just by one cycle (i.e. the 3rd byte may have been overwritten when DEC Y starts executing).

The real hardware may have an additional cycle of safety due to propagation of the registers, idk.

EDIT 2: In case you wonder what all this is actually accomplishing, watch this:

Wuerfel_21 · 2024-04-16 18:02

After significant amounts of pain:

Great Wisdoms:

One ought to make sure that when X=0, STX actually writes 16 bits
16 bit writes need to actually work in the first place
You shall not push the status register instead of the data bank register

Next up, fixing the tile mirror flag

EDIT: Since that was obviously trivial, the real next thing is to get the DMA controller going, at least for simple one-shot transfers.

Wuerfel_21 · 2024-04-16 23:47

Slightly less gormless Hello World using the very half-hearted DMA controller. Currently the main DMA loop is in hub RAM, what a bother. I could perhaps use stack magic to reduce the overhead (so the B-Bus write handler jumps right back into the top of the loop). Also, address decrement transfers would destroy the PSRAM cache mechanism. One hopes no one actually uses those...

evanh · 2024-04-17 00:04

That's probably true for all caches, including modern CPU hardware caches. SDRAMs use incrementing bursts therefore so does the caching hardware.

SNES must have been one of the last of any design that didn't use SDRAMs combined with hardware caching.

Wuerfel_21 · 2024-04-17 00:15

I think there was some amount of time where page-mode DRAM was the thing, which I think doesn't care about direction.

The SNES is pretty much a standard 8-bit 6502 bus with some DRAM and some mask ROM in the cartridge. Very 80s, really. The most notable thing is that there's a separate 8-bit I/O address bus (B-Bus) that allows DMA to transfer directly between memory and peripherals in one cycle. When accessing the registers normally through the $21xx range, presumably there's some logic to gate through the A-Bus address (since DMA controller and CPU are one chip - Ricoh 5A22)

Wuerfel_21 · 2024-04-17 00:30

Also, got this bresenham thing going... All the previous tests (including my hand-rolled PPU testing) were with BG1SC=$00 (= BG1's nametable base pointer and size), so I never noticed that that wasn't getting shifted into quite the right position.

Also notice the white dots at the top border resulting from the video pipeline not being properly synchronized.

evanh · 2024-04-17 01:19

@Wuerfel_21 said:
I think there was some amount of time where page-mode DRAM was the thing, which I think doesn't care about direction.

Totally the norm up till then. SDRAM didn't exist at all.

Early caches all used page-mode DRAMs. So I guess SDRAMs were designed to suit how caches already worked rather than the other way around.

Wuerfel_21 · 2024-04-17 18:23

After some bugfixing, it can now boot some real games to some extent.

...

Do I really want to know everything that's going wrong here?

Wuerfel_21 · 2024-04-17 18:50

So I copied in the APU code from SPCcog and did the minimum changes to hook up the mailboxes...

@Wuerfel_21 said:
Maybe don't celebrate too early though, it's not working yet. Wait until I can get some corruption vaguely resembling Super Mario World on-screen

@avsa242

No inputs yet, so that's presumably why it's going straight into the menu. But the music is playing just fine!

avsa242 · 2024-04-17 21:17

/me beams!
You've almost saved the Triforce, excellent work regardless

Wuerfel_21 · 2024-04-17 22:38

After some amount of fixing the sprites, it now almost works if you squint. Interestingly, the sprite mask generation was completely broken and the fixed version now is also two instructions shorter.

There's a release ZIP, but good luck getting that to work. Has not been tested on anything besides my rather unusual EVAL+96MB setup.

Only games I have that aren't black screen or die before they can be properly started are Super Mario World and Castlevania Dracula X. No idea why these two. With MegaYume it was Sonic and Panorama Cotton, similarly odd.

Though Rock'n'Roll Racing also works, just without the racing.

rogloh · 2024-04-17 23:38

Every step is closer. Great work as usual. Nice you have some sounds now too.

Console Emulation

Comments