Console Emulation

Wuerfel_21 · 2024-04-18 00:20

Not just "some sounds", the sound core is more-or-less perfect on account of just being pasted in from SPCcog, which already went through multiple versions of improvement. (IIRC the DSP should be almost bit-for-bit with the real thing. The BRR decoder certainly is). The main problem will be getting the timing sync right for the 65816/SPC700 comms on games that do something strange with it, as lengthily illustrated a few posts ago.

Wuerfel_21 · 2024-04-18 19:02

@Wuerfel_21 said:
Interestingly, the sprite mask generation was completely broken and the fixed version now is also two instructions shorter.

Of course, my hubris took the better of me there and I need to add one instruction for it to work correctly, so it's only shorter than the original by 1.

Wuerfel_21 · 2024-04-18 23:42

Shitpost status: Trying to figure out why F-Zero is blackscreening. No idea why that game, it wouldn't work anyways.

The choice of $80 for BRA's opcode may be causing me a lot of trouble, it's a rather common byte. If the PC becomes misaligned, it may execute BRA something, which then lands on another BRA something...

Wuerfel_21 · 2024-04-19 00:36

So it turns out JMP (a,x) was broken in a very specific way: it would clear the program bank. I guess none of the test programs would catch this since all their code fits into bank 0. But it does seem like a pretty important instruction, doesn't it?

So I can now play garbled-graphics-F-Zero. How nice. Though it's even more garbled than it really should be. For some reason the sprites are wrong, too! But you can kinda tell that it's running properly underneath that.

Also now running better (of the ROMs I bothered to put on the SD card):

Plok: seems to run ok
Zelda: Still garbled but now doesn't crash. Needs NVRAM to get further.
Demon's Crest: Title screen ok, blank screen in game (game is running in the background)
Super Ghouls and Ghosts: ingame with garbled gfx
Pocky and Rocky: crash after character select
Wild Guns: crashes going ingame (also a Natsume game...)
Maerchen Adventure Cotton 100%: Shows logo (was previously completely dead, no PPU init)

Meanwhile Rock'n'Roll Racing is still just rock'n'roll with no racing. Also HiROM isn't implemented, so anything using that isn't even thinking about running.

Also still not entirely sure where the flickering black bars come from. I guess it could be CPU enabling force-blank while not properly synchronized to the screen refresh. I guess the timing nonsense shouldn't be put off much further.

Wuerfel_21 · 2024-04-19 18:39

Looking into Pocky & Rocky. So far I've got:

(I'm really just writing it down for myself)

Crashes due to RTS pulling a bad address
RTS address is bad because stack overflows into ROM (stack grows down, so "overflow" is from $0000 to $FFFF)
stack overflows because NMI takes too long and interrupts itself
NMI takes too long because loop @ $008E86 runs too long
loop too long because garbage at $0524 (loop length is ($0524,Y) AND $3FFF, should be $8002 on first call, is $0040, next call is even longer)
garbage is written at $00A343, which is STA $0442,Y with Y=$00E2 - I think Y shouldn't go that high in this function. Though I don't see the correct value being written, either. It does seem to appear 2 bytes later? (EDIT: the index the correct value gets stored to is at $0522, which also gets corrupted, it just happens to get a $0002 written into it)
Y is loaded from (and stored back to!) $0440, which appears to be some kind of ongoing index.
Edit2:
So the routine at $00A335 (that ends up corrupting the other stuff) copies 8 bytes from an array at $0010 to another at $0442 and increments $0440. Another routine at $008B3B copies 8 bytes from the $0442 array to (not an array) $0100 and decrements $0440. It would appear that these functions should be called in equilibrium to not overflow $0440. The second routine gets called during NMI. Both of them are really called during the character selection, but it seems that corrupting $0522 makes it crash when trying to draw the text for the story sequence. Scary action at a distance.
$008B3B only gets called once per NMI. So calling $00A335 more often than that will cause the $0442 array to overflow eventually. It probably has something to do with lazy-loading of data into VRAM? The first byte (that gets copied to $0100) seems to be used as a jump table index at $008A06.
AHA! There's a second routine that's supposed to decrement $0440 at $008B72. But it seems this is called from IRQ? So I guess the bug is just that misoyume doesn't have IRQs yet. What an odd (and somewhat lame) explanation.

Wuerfel_21 · 2024-04-20 12:10

Interesting discovery while trying to get the timing to stay in whack: For most scanlines, the CPU cog finishes it's job in under 17k cycles (WITHOUT CODE CACHE - though on a slowROM game), but some take up almost 40k cycles, which really throws the timing out of whack. The total time available is ~21k

It appears that big DMA transfers are causing this massive bottleneck. The allotted timing is 128 cycles per byte transferred, which sounds ok until you realize all the stuff involved in copying a single byte from CPU memory to VRAM. Though the actual hardware can't actually read from many sources in this case (I/O is off limits), so maybe I can optimize it by assuming the entire transfer will use the same source throughout (i.e. if the start address is in RAM, call directly into the RAM read function instead of taking the normal memory map detour for every byte)

I'm actually considering shoving the EXECF table out in hub ram, since I'm already running short on cog/lut space and will need a lot still.

Wuerfel_21 · 2024-04-21 00:28

Completely unrelated to the MisoYume project, look at what I just remembered exists (actually, I remembered on wednesday but only now remembered to post here, after reading it a little bit):

https://github.com/mwenge/tempest2k

Yea, source code for Tempest 2000 on the Atari Jaguar. @evanh you said you like Tempest, right? How about T2K on P2?

Porting the GPU microcodes to P2 is probably easy. They're not super long and the Tom/Jerry architecture is actually very similar to P2 (2-operand RISC, 16 bit multiply, divide with huge delay). A P2 cog @320 MHz should be about 4 to 6 times faster than the real Jaguar GPU/DSP (~30 MHz, 1 IPC), though that's not accounting for the blitter, which allegedly(!) does ~120 MPix/s gouroud shading. It's not really used like that, I think.

The audio code appears to be missing, but it seems like it just plays standard protracker MOD files and uncompressed effect samples.
The main logic code is a huge wad of poorly commented 68000 ASM, but I think that can just be stuffed through the 68k emulation I already have. The normally trapping $Axxx instructions could be used to handle all the I/O, so the emulated memory map would be extremely simple. Would need to change the screen logic from swapping between buffers to uploading them into PSRAM so that only one needs to be resident at any time (they're 384x240 16 bit, quite huge). Though IDK how that'd work with the screen feedback effects, they probably work better if the source screen is resident. It appears T2K always(?) has a 3rd screen buffer overlaid onto the main double-buffered screen, which appears to be used mostly to put stuff on top of those feedback effects without becoming part of them (like the bonus stage crosshairs). Some other graphics are also overlay objects. So also need another cog to handle object merging and CRY->RGB conversion (which is something like GETBYTE+MOVBYTS+ALTS+MULPIX for each pixel...).

rogloh · 2024-04-21 01:22

@Wuerfel_21 said:
It appears that big DMA transfers are causing this massive bottleneck. The allotted timing is 128 cycles per byte transferred, which sounds ok until you realize all the stuff involved in copying a single byte from CPU memory to VRAM. Though the actual hardware can't actually read from many sources in this case (I/O is off limits), so maybe I can optimize it by assuming the entire transfer will use the same source throughout (i.e. if the start address is in RAM, call directly into the RAM read function instead of taking the normal memory map detour for every byte)

Definitely sounds like you will need to optimize something further. Is the actual transfer timing per each byte of the DMA critical or can you just detect the start of the DMA operation, do it fast at native P2 transfer rates and figure out the time it'd normally take then notify of completion later at that time if required? Maybe that will buy you some time when more than one byte gets transferred by amortizing the initial overhead over the entire transfer, but given how tightly coupled things seem to be from what you've posted previously I'd expect it's probably not possible to do that.

Wuerfel_21 · 2024-04-21 01:34

@rogloh said:
Definitely sounds like you will need to optimize something further. Is the actual transfer timing per each byte of the DMA critical or can you just detect the start of the DMA operation, do it fast at native P2 transfer rates and figure out the time it'd normally take then notify of completion later at that time if required? Maybe that will buy you some time when more than one byte gets transferred by amortizing the initial overhead over the entire transfer, but given how tightly coupled things seem to be from what you've posted previously I'd expect it's probably not possible to do that.

That could/should work, but the problem is that there's a lot of variables and it's not as simple as just copying bytes. DMA can target any B-Bus register, though really the only useful ones (for one-shot DMA) are VMDATAL, VMDATAH, both VMDATAx together, CGDATA and OAMDATA. Now, VMDATAx has to do this crazy address remapping, CGDATA needs to convert colors to 32bpp and is one of those weird "write twice" registers and may start out in it's "written once" state, OAMDATA needs to reformat the data a little bit to save cycles when rendering it. I already look up the handlers for the register(s) that are being used only once and then call them directly. So really, I need to optimize the source side of things first, as said. Some cycles could be shaved by unrolling the code a bit. But not sure if that'll be enough. The DMA loop is currently in hubexec, which isn't helping matters. Maybe I can put the small fragment that needs to go inbetween the source read and register write in cog ram to save a hub branch. Or maybe I can make separate loops with all possible sources (RAM or ROM) inlined into them.

Wuerfel_21 · 2024-04-21 10:54

@Wuerfel_21 said:
Maybe I can put the small fragment that needs to go inbetween the source read and register write in cog ram to save a hub branch.

I did this, except I reordered the operations and just did the stack push trick, so no extra cog ram needed. This also allowed to structure the loop such that it only does 2 hubexec branches per iteration (call to the register handler and subsequent return to the top of the DMA loop)

Still way too slow.

Wuerfel_21 · 2024-04-21 11:03

The current DMA write loop:

              jmp #.wrstart

.wrloop
              add rk_ea,rk_dmainc
              setword rk_optmp1,rk_ea,#0

              add rk_cycles,#2
              andn rk_cycles,#7

              getword pa,rk_optmp2,#0
              'getct pb
              'subr rk_time,pb
              'debug("DMA byte ",udec(rk_time))
              djz pa,#.wrstop

.wrstart
              'getct rk_time
              setword rk_optmp2,pa,#0

              add rk_elapsed,rk_cycles
              sub rk_until_eol,rk_cycles
              sub rk_until_irq,rk_cycles
              mov rk_cycles,#0
              wrlong rk_elapsed,#elapsed_cycles
              tjs rk_until_eol,#.eol

              getbyte rk_ea,rk_optmp3,#3
              rolword rk_ea,rk_optmp1,#0

              push ##.wrloop
              altgw rk_dma_suboff,#rk_dma_bbptr
              getword pa
              incmod rk_dma_suboff,#3
              push pa
              jmp #rk_read8

(the optmp registers hold DMA parameters, the whole suboff/bbptr mechanism figures out the register handler to call - all register handlers are below $10000 so a word is sufficient as a pointer.)

Now I just need to pull ~80 more cycles out of my hat (~84 when we consider that this still needs a sub/tjs for HDMA timing?). Oh bother.

TonyB_ · 2024-04-21 13:24

A couple of thoughts:

I realised yesterday that I could read words from registers or RAM for both word and byte instructions and not worry about any non-zero bits above the msb provided I deal with the carry flag properly after addition or subtraction. This eliminates separate byte read instructions and also the need to zero high bits after signing-extending to 32 bits with SIGNX.

For your latest project how about no sound and throwing every cog you can at video and DMA?

Wuerfel_21 · 2024-04-21 13:30

@TonyB_ said:
A couple of thoughts:

I realised yesterday that I could read words from registers or RAM for both word and byte instructions and not worry about any non-zero bits above the msb provided I deal with the carry flag properly after addition or subtraction. This eliminates separate byte read instructions and also the need to zero high bits after signing-extending to 32 bits with SIGNX.

I constructed all the word reads as combining two byte reads. This is somewhat slow, but the side effects of memory access will be correct. Most registers have side effects. Also, I emulated open bus - unmapped memory will read stale values (usually ends up being the last byte of the instruction).

For your latest project how about no sound and throwing every cog you can at video and DMA?

4 cogs doing the video pipeline conga (layer render -> mosaic+window+main/sub composite -> palette lookup+color math+fade -> output) seems to just work. Not sure if extra cogs would help with the DMA issue. No audio is no bueno. Aside from the idea of playing without sound being quite gormless, the game codes will hang without the audio CPU running. And the DSP fits in the same cog, so there's no reason not to have sound.

rogloh · 2024-04-21 13:40

Yeah, keep sound in. DMA emulation is a challenge if it has to modify data for different transfers on the fly and remap addresses. Sounds a bit like you will need to split it into all the different types and optimize each one in separate PASM2 paths accordingly, or come up with an alternative solution overall. Perhaps different small overlays could be read into COGRAM for different DMA actions on the fly if that buys anything vs HUBEXEC.

Wuerfel_21 · 2024-04-21 14:01

I know I can do it, but the obvious ways to make it fast aren't pretty and take a lot of work. So have to get clever.

The address remapping isn't technically part of the DMA. The DMA just writes data into registers every cycle. The VRAM registers have their behaviour controlled by the VMAIN register. Courtesy of snesdev wiki:

VMAIN - Video Port Control ($2115 write)

7  bit  0
---- ----
M... RRII
|    ||||
|    ||++- Address increment amount:
|    ||     0: Increment by 1 word
|    ||     1: Increment by 32 words
|    ||     2: Increment by 128 words
|    ||     3: Increment by 128 words
|    ++--- Address remapping: (VMADD -> Internal)
|           0: None
|           1: Remap rrrrrrrr YYYccccc -> rrrrrrrr cccccYYY (2bpp)
|           2: Remap rrrrrrrY YYcccccP -> rrrrrrrc ccccPYYY (4bpp)
|           3: Remap rrrrrrYY YcccccPP -> rrrrrrcc cccPPYYY (8bpp)
+--------- Address increment mode:
            0: Increment after writing $2118 or reading $2139
            1: Increment after writing $2119 or reading $213A

VMADD is the current VRAM address and is what gets incremented. The remapping happens between VMADD and the actual VRAM address bus.

This is implemented like this (I don't think I actually tested if it even works right). It's already optimized in that VMAIN is kept in cog RAM. (rk_memtmp1 is the address from VMADD)

              testb rk_ppu_vmain,#3 wc
              testb rk_ppu_vmain,#2 wz
              mov rk_memtmp2,rk_memtmp1
    if_not_00 shl rk_memtmp2,#3
              mov rk_memtmp3,rk_memtmp1
        if_01 shr rk_memtmp3,#5
        if_10 shr rk_memtmp3,#6
        if_11 shr rk_memtmp3,#7
        if_01 setq #$FF
        if_10 setq #$1FF
        if_11 setq rk_3ff
    if_not_00 muxq rk_memtmp1,rk_memtmp2
              setq #$07
              muxq rk_memtmp1,rk_memtmp3

One obvious funny would be to create 4 separate implementations of ppu_write_vmdatal and ppu_write_vmdatah and modify the jump table when VMAIN is written. Maybe also pre-decode the increment. Also, VMADD can possibly be stored into cog RAM as well.

Though the actual slowest register handler I think is OAMDATA when OAMADD is in the small high OAM area. Though that's only 32 bytes long, so unless the code does something stupid, it's not so bad.

CGDATA I think is the fastest currently. The implementation is quite nice and short:

ppu_write_cgdata
              rdlong rk_memtmp1,#ppu_cgaddr ' also get latch
              'debug("CGDATA write ",uhex_byte(rk_memv),uhex_long(rk_pc,rk_memtmp1))
              testb rk_memtmp1,#0 wc
        if_nc setbyte rk_memtmp1,rk_memv,#2

              mov rk_memtmp2,rk_memtmp1
              add rk_memtmp2,#1
              bitl rk_memtmp2,#9 addbits 6

              wrlong rk_memtmp2,#ppu_cgaddr
        if_nc ret wcz

              rolbyte rk_memv,rk_memtmp1,#2
              shl rk_memv,#1
              mov rk_memtmp2,rk_memv ' blue in place
              shl rk_memv,#2
              setbyte rk_memtmp2,rk_memv,#3 ' red in place
              shr rk_memv,#5
              setbyte rk_memtmp2,rk_memv,#2 ' green
              and rk_memtmp2,##$F8F8F800
              getword rk_memtmp1,rk_memtmp1,#0
              shl rk_memtmp1,#1
              'debug("palette set ",uhex_long(rk_memtmp2,rk_memtmp1))
              add rk_memtmp1,##@palette_current - 2 ' minus 2 to compensate for set LSB
              wrlong rk_memtmp2,rk_memtmp1

              ret wcz

With some setup, the source read could be replaced wholesale by an RDBYTE ptrb++, CMP and branch. (or ptrb-- for negative). The compare would need to handle running out of memory.

Rayman · 2024-04-21 14:25

Never heard of Atari Jaguar, but it is interesting. Mainly because the system and games are public domain, if I read it right.
I see the source code and ROM for tempest is on that link.
That's very nice.

Possible for P2 to emulate that?
Looks like came a couple years after Sega Genesis, so might be tough...

Wuerfel_21 · 2024-04-21 14:35

@Wuerfel_21 said:
Maybe also pre-decode the increment. Also, VMADD can possibly be stored into cog RAM as well.

Yea, squeezing VMADD and the pre-decoded increment into that same cog register is an obvious and easy win of quite some cycles in the VMDATAx handler.

@Rayman said:
Never heard of Atari Jaguar, but it is interesting. Mainly because the system and games are public domain, if I read it right.

You never heard of it because it flopped real hard I don't think the games are supposed to be public domain, but they've been circulated online for ages and no one seems to care very much.

Possible for P2 to emulate that?
Looks like came a couple years after Sega Genesis, so might be tough...

Probably not. But given the tempest source code, it'd be possible to convert all the Jaguar specifics to native P2 equivalents (and possibly introduce a few improvements like restoring the missing music tracks).

Wuerfel_21 · 2024-04-21 15:08

it'd be possible to convert all the Jaguar specifics to native P2 equivalents

To illustrate that, consider this very simple Jaguar GPU routine (found near the end of donkey.gas) which is used to draw the background of the highscore screen (I think)

    MACRO ran           ;Sequence generator out of Graphics Gems    
    btst #0,\1
    jr z,\~noxortt
    shrq #1,\1      ;branch optimisation - the SHRQ is always done
    xor \2,\1
\~noxortt: nop
    ENDM


star_loop: ran xseed,xmask
    move xseed,px
    ran yseed,ymask
    move yseed,py       ;"random" XY star position
    add xdisp,px
    add ydisp,py        ;add XY offset passed in
    and andlim,px       
    and andlim,py       ;wrap to 0-255
    cmp maxx,px
    jump pl,(nopixl)    ;clip max X
    cmp maxy,py     ;no harm if this is done whatever
    jump pl,(nopixl)
    shlq #1,px      ;x to point at words
    mult linesize,py    ;offset in lines to bytes
    add px,py
    add scrbase,py      ;py now points at pixel
    storew starcol,(py) ;plot the star
no_pixl: subq #1,nstars
    jump ne,(starloop)  ;loop until nstars is 0
    nop
    movei #StopGPU,r0
    jump (r0)
    nop

Notice that this would translate very obviously into P2 instructions (except our assemblers for some reason lack macros) and we can even spot some easy optimizations

star_loop
              ' "random" star X
              shr xseed,#1 wc
        if_c  xor xseed,xmask
              mov px, xseed
              ' "random" star Y
              shr yseed,#1 wc
        if_c  xor yseed,ymask
              mov py, yseed

              add px,xdisp
              add py,ydisp
              and px,andlim
              and py,andlim
              cmp px,maxx wc
        if_b  cmp py,maxy wc
        if_ae jmp #.no_pixl
              shl px,#1
              mul py,linesize
              add py,px
              add py,scrbase
              wrword starcol,py
.no_pixl 
              djnz nstars,#star_loop
              jmp #StopGPU

Wuerfel_21 · 2024-04-21 16:25

Anyways, I think I got the average DMA transfer down to ~168 cycles by optimizing the register handlers, so just need 40 more cycles to fall off a truck

Wuerfel_21 · 2024-04-21 17:37

Though I just tried again and it seems the current DMA code is fast enough to not totally loose sync to the video (since these DMAs happen during Vblank, there's a few scanlines it can catch up on). I still want to make it properly fast enough.

Though maybe I'll look into something else to clear the mind. Maybe fix the mosaic effect.

Wuerfel_21 · 2024-04-21 23:43

I did try adding the mode 7 code, too. Interesting observations:

the versions with overflow check don't work correctly. I think there's a bug in the setup code.
at some point some of the video mode test ROMs I used earlier have stopped working correctly.

Wuerfel_21 · 2024-04-22 19:46

MisoYume Progress Report:

It appears that games start and stop working by some mysterious force (presumably memory shifting when I add or remove code???). Should look into that. Also Pocky & Rocky is still broken, even though I fixed the IRQ problem, or so I think. Might have to look back into that. The attract demo runs fine-ish. Mario World is now super corrupted.

However, I have implemented H-DMA and I'm very close to F-Zero being fully playable/enjoyable. Just need to figure out the Mode 7 address generator issues so that the cars are actually on the track.

Wuerfel_21 · 2024-04-22 22:13

Here it is! Alpha 03

Compatible games currently:

F-Zero: Works great! I ran an entire grand prix, very cromulent.
Plok: Works great!
Castlevania Dracula X: Works but interesting color math bug
Super Mario World: Works but garbled graphics ingame
Super Ghouls & Ghosts: Works but garbled graphics ingame

(Obviously "works great" still includes minor graphics oddities, but I know how to fix those)

Wuerfel_21 · 2024-04-22 22:52

@Wuerfel_21 said:
Here it is! Alpha 03

Compatible games currently:

Castlevania Dracula X: Works but interesting color math bug

Easy fix, sprites were using BG3's source ID. Classic SKIPF table blunder. Change PSKIP_MAIN_SPR = %11_1_11110_110100 to PSKIP_MAIN_SPR = %11_1_11110_111100

Rayman · 2024-04-22 22:52

Congrats!
Very specific clock freq:
_CLKFREQ = 343636320

NeoYume depends of video type? Did you say it was 336 MHz for VGA?

Wuerfel_21 · 2024-04-22 22:58

It's just 96x NTSC color burst.

NeoYume has a constant clock, which is 14x the 24.something MHz AES master clock. Ends up something like 338 MHz.

MegaYume depends on video mode but is in the vicinity of 328 MHz.

Rayman · 2024-04-22 23:32

Hopefully I can find time to try this very soon...
Be interesting to see how my boards fair...

BTW: Are you thinking about adding an HDMI with sound option to the emulators?
I imagine that's why you got HDMI with sound working, right?

Wuerfel_21 · 2024-04-22 23:58

@Rayman said:
Hopefully I can find time to try this very soon...
Be interesting to see how my boards fair...

Been living most our lives, in a voltage regulator's paradise.

Attached a binary I built. It ran once for about 5 seconds. And this is on the nice one where NeoYume mostly works. Maybe you have an even nicer one to play on. F-Zero game must be at /MISOYUME/FZEROU.SFC

EDIT: pulled out one of the older ones (with the bad audio) and it managed to get halfway across mute city until the USB driver crashed.

BTW: Are you thinking about adding an HDMI with sound option to the emulators?
I imagine that's why you got HDMI with sound working, right?

Yea, that was the idea. Just have not gotten to it.

Rayman · 2024-04-23 00:12

@Wuerfel_21 said:
Attached a binary I built. It ran once for about 5 seconds.

Ouch. Maybe I'll have to explore 6-layer boards after all... Hope not...

Actually, 5 seconds is promising... Maybe just a cooler or starting frozen can fix that....

Wuerfel_21 · 2024-04-23 00:30

@Rayman said:

@Wuerfel_21 said:
Attached a binary I built. It ran once for about 5 seconds.

Ouch. Maybe I'll have to explore 6-layer boards after all... Hope not...

Actually, 5 seconds is promising... Maybe just a cooler or starting frozen can fix that....

5 seconds once. It really has a problem getting started. I want to blame insufficient regulation of the 1.8v rail, but I have nothing to back that up.

I tried another board (the one that's missing the audio IC) and that made it all the way through the knight league - how curious?

Console Emulation

Comments