Console Emulation

Wuerfel_21 · 2021-12-20 02:57

@rogloh said:
So how full is the COG+LUT RAM now @Wuerfel_21 ?

Not sure (about to go to bed, it's like 4 am), but I think there's a decent bit of space left in both and the only things left to go in there some I/O related bits, interrupt polling and of course, the external ROM interface (and related opcode queue code). But for the ROM interface I'm already using a workalike (i.e. this is the same interface I plan for the PSRAM/HyperRAM bits):

mk_readrom_ea ' read single long, offset such that the requested
              ' address ends up at mk_romio_area
              mov pb,mk_effaddr
              mov mk_romio_length,#1
              rczr pb wcz
              mov mk_romio_target,mk_romio_area_ptr
        if_c  sub mk_romio_target,#2
        if_z  sub mk_romio_target,#1
mk_readrom    ' arbitrary block read
              shl pb,#2
              zerox pb,#15 ' <- ROM SIZE HERE
              add pb,##fake_rom
              mov mk_memtmp1,mk_romio_target
              debug("ROM read from ",uhex_long(pb)," to ",uhex_long(mk_romio_target))
              rep @.readrom,mk_romio_length
              rdlong mk_memtmp0,pb
              add pb,#4
              wrlong mk_memtmp0,mk_memtmp1
              add mk_memtmp1,#4
.readrom
              ret wcz

Note that opcode fetch is currently very primitive though, no queue and doesn't go through the actual ROM interface because uhhhh. So that might consume quite a couple longs.

... but if the space is not enough, I can simply move a few more opcodes to hub (or inversely, bring some opcodes or addressing modes into cog/lut (as of now, all addressing modes that are not register direct or simple (An) are a hub call))

rogloh · 2021-12-20 03:16

Ok, it should be possible to convert/hook that type of thing into the existing PSRAM driver at least for initial testing.

One thing you might find is that if you can cache some code snippets read from external RAM into a (smaller) simulated ROM area stored in HUB RAM you might get better performance reading from that block whenever you can, rather than making lots of individual accesses to the external memory. There's probably some scope there for some interesting performance improvements by playing with the burst sizes read and see how much you gain from the latency savings vs the overhead of checking addresses fall within a range already available in HUB. Or you can just try the individual random reads and compare those too.

Wuerfel_21 · 2021-12-20 14:54

Dealing with data cache/prefetch seems a bit... ehhhh.

Caching repeated access to the same address would be easy enough, but how often does that happen?

Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

Wuerfel_21 · 2021-12-20 16:14

Also, current idea for interrupt implementation is to use JSEx instructions to check for lock changes before each instruction. Only need two of them because on the megadrive, there are only really two interrupt sources: VBlank and scanline counter interrupts from the VDP. There's technically also an external interrupt line for peripherals to use, but that's kinda out-of-scope.

rogloh · 2021-12-20 23:26

@Wuerfel_21 said:
... Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

16 words read in at a time seems like a good place to start, given we can clock in at 320MB/s @ 320MHz. Can be tweaked further if required, eg. 8 or 32 etc (maybe nibble masked addresses will be fastest for testing given the getnib/setnib opcodes in the P2). A proper I-cache is hopefully not required to achieve some decent performance for this emulator, although for some it could be an interesting thing to examine if you ever wanted to execute directly from PSRAM in general.

Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

Impressive, looks like it should fit nicely in the end.

Wuerfel_21 · 2021-12-20 23:59

@rogloh said:

@Wuerfel_21 said:
... Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

16 words read in at a time seems like a good place to start, given we can clock in at 320MB/s @ 320MHz. Can be tweaked further if required, eg. 8 or 32 etc (maybe nibble masked addresses will be fastest for testing given the getnib/setnib opcodes in the P2). A proper I-cache is hopefully not required to achieve some decent performance for this emulator, although for some it could be an interesting thing to examine if you ever wanted to execute directly from PSRAM in general.

Got that code block fetching system implemented now. Size can be configured to any even number of words and checking if a branch is inside the currently cached block is simple - we already need to keep track of how many words are left, so simply subtract the branch displacement from that counter and check if it is still in the valid range (which, since the valid range is 0..MK_ROMQUE_MAX, only requires a single unsigned compare).

Wuerfel_21 · 2021-12-21 00:00

@rogloh said:

Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

Impressive, looks like it should fit nicely in the end.

Mind you, there's a significant amount of hub code. Only what's really needed is in cog/lut right now.

rogloh · 2021-12-21 00:48

It will be good to simply try my existing PSRAM driver as is with your pre-fetching support to see if that has any hope of working without it being directly coupled to your COG. At 320MHz it can probably get close to 1us per request so 16 words is then 32MB/s and some of it can potentially be done in parallel to your emulator code running (so not sure how that translates to final 68k MIPs, it depends on branches and wasted reads).

I have attached some sample code showing a simple PSRAM config for the new P2EDGE and how the mailbox could be used from PASM2 for block reads/writes, but there are far more commands the mailbox can use than are shown here. Memory addresses can also be mapped differently in the banks if the address space used is to be shared with HUB RAM (sometimes it is useful to be able to do that and to reserve 0-16MB for indicating HUB addresses, other times not). This PASM2 demo already works with my driver.

{
 Propeller 2 PSRAM demo (PASM)
 =============================

 This software contains a simple demo showing how a PASM COG can use the PSRAM driver 
 without requiring the overhead of the complete SPIN2 based memory driver as well.

 The driver is initialized and a PASM COG then accesses the PSRAM with direct mailbox
 access using the burst write and read commands to transfer data.

 No QoS policy is used, so any COG can access the memory without prioritization.

 Run this with DEBUG mode enabled.
}
'----------------------------------------------------------------------- 

CON
    _clkfreq = 160000000

    DEBUG_BAUD = 115200

    MAXBURST = 512  ' set to a suitable device burst size & keep under maximum CS low time of 8us
    DELAY    = 8    ' set to an input delay suitable for this P2 clock frequency (from 0-15)
    ADDRSIZE = 25   ' number of address bits used in 32MB of PSRAM

    ' P2 EDGE PSRAM pin mappings
    DATABUS  = 40
    CLK_PIN  = 56
    CE_PIN   = 57

OBJ
    psram : "psramdrv"


PUB main() | driverAddr
    ' patch in the proper HUB addresses for Propeller Tool (redundant for FlexSpin)

    long[@startupData][5]:=@deviceData
    long[@startupData][6]:=@qosData
    long[@startupData][7]:=@mailboxes

    ' get the address of the PSRAM memory driver so we can start it

    driverAddr:= psram.getDriverAddr()

    ' start the PSRAM memory driver and wait for it to complete initialization

    coginit(NEWCOG, driverAddr, @startupData)
    repeat until long[@startupData] == 0 

    ' now just continue running as the PSRAM reader cog, pass mailbox base address as argument

    coginit(cogid(), @reader, @mailboxes)

DAT
    orgh 

'----------------------------------------------------------------------- 
' Reader Cog PASM2 code entry point
'----------------------------------------------------------------------- 

reader 
            org     0

            cogid   pb              'get COG ID
            mov     pa, #12     
            mul     pa, pb          'scale by 12 bytes per mailbox (3 longs)
            add     ptra, pa        'compute real mailbox start address for this COG

            add     msgaddr, ptrb   'determine real HUB RAM location of the test message

            'write the test message into PSRAM
            'NOTE: the setq burst write method used here can only be used without interruption and 
            'relies on this sequential addresses being written in order each clock cycle before the 
            'RAM driver poller can read mailbox data that is incomplete.  If you are using the streamer
            'or if interrupts could somehow delay the burst write part way through this would not work
            'and you would need to ensure you write the first mailbox long after other two longs.

            setnib  addr, #%1111, #7'include the write burst command in the cmd+address parameter
            setq    #3-1            'write 3 longs to mailbox in a burst (can do this only without interruption)
            wrlong  addr, ptra      'trigger the write burst to external memory
pollwrite   rdlong  pa, ptra wcz    'check for the result
    if_c    jmp     #pollwrite      'wait until done or error
    if_nz   jmp     #error          'error check (optional but useful if you encounter setup problems)

            'read the message back to a new address (just using this COG's own HUB space as scratch buffer)

            setnib  addr, #%1011, #7'setup read burst command in the cmd+address parameter
            mov     msgaddr, ptrb   'update destination hub address to COG's scratch area
            setq    #3-1            'write 3 longs to mailbox in a burst (can do this only without interruption)
            wrlong  addr, ptra      'trigger the read burst command
pollread    rdlong  pa, ptra wcz    'get the result
    if_c    jmp     #pollread       'wait until done or error
    if_nz   jmp     #error          'error check (optional but useful if you encounter setup problems)

            'display the message we just read back with DEBUG statements

            DEBUG   (ZSTR(ptrb))    'print string we just read
            cogstop pb              'stop here

            'if an error occurred, display the error code to help debug code

error       DEBUG   ("Test failed, error code=-", SDEC_LONG_(pa), 13, 10)
            cogstop pb              'stop here

' 3 long structure to be written to mailbox
addr        long    $__0abcdef          ' command & some address in external memory
msgaddr     long    message - reader    ' HUB source/destination address for burst
msglen      long    msgend - message    ' length in bytes

            fit     502


'----------------------------------------------------------------------- 
        orgh

' data to be passed to driver when starting it
startupData
    long    _clkfreq    ' use current frequency
    long    0           ' optional flags
    long    0           ' reset pin mask on port A for PSRAM (none)
    long    0           ' reset pin mask on port B for PSRAM (none)
    long    DATABUS     ' PSRAM data bus start pin
    long    deviceData  ' address of devices data structure in HUBRAM
    long    qosData     ' address of QoS data structure in HUBRAM
    long    mailboxes   ' address of mailbox structure in HUBRAM

deviceData
    ' 16 bank parameters follow
    long    (MAXBURST << 16) | (DELAY << 12) | (ADDRSIZE - 1)   ' bank 0
    long    (MAXBURST << 16) | (DELAY << 12) | (ADDRSIZE - 1)   ' bank 1
    long    0[14]                                               ' bank 2-15
    ' 16 banks of pin parameters follow
    long    (CLK_PIN << 8) | CE_PIN                             ' bank 0 
    long    (CLK_PIN << 8) | CE_PIN                             ' bank 1 
    long    -1[14]                                              ' bank 2-15

qosData 
    long    $FFFF0000       ' cog 0 default QoS parameters
    long    $FFFF0000       ' cog 1 default QoS parameters
    long    $FFFF0000       ' cog 2 default QoS parameters
    long    $FFFF0000       ' cog 3 default QoS parameters
    long    $FFFF0000       ' cog 4 default QoS parameters
    long    $FFFF0000       ' cog 5 default QoS parameters
    long    $FFFF0000       ' cog 6 default QoS parameters
    long    $FFFF0000       ' cog 7 default QoS parameters


mailboxes
    long    0[8*3]          ' 3 longs per mailbox per COG


message byte "This message is coming at you today all the way from PSRAM!", 0
msgend  byte 0

{{
-------------
LICENSE TERMS
-------------
Copyright 2020, 2021 Roger Loh

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in 
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
THE SOFTWARE.
}}

Wuerfel_21 · 2021-12-21 02:30

Yeah, that looks doable.

Unrelatedly, I've just completed implementing all the instructions. Probably a million bugs, but I can't find a test program that can run from ROM so uhhhhh, let's not think about that too much. Next stop: hooking this up to the VDP so I can run some simple test programs

rogloh · 2021-12-21 03:15

Ok let me know when you need a PSRAM driver. I do intend to get this out before xmas and take a little break.

Wuerfel_21 · 2021-12-21 18:36

Status report: got video driver and VDP integrated into the binary and set up (using flexspin to compile the Spin code into high memory). Also, a simple "Hello World" ROM seems to end up at the correct STOP opcode (end of program, as opposed to an exception handler or getting stuck on some inappropriately decoded opcode), so I presume that when the VDP register interface is working, there'll be some letters on screen.

TonyB_ · 2021-12-21 20:03

Any chance of seeing the 68K emulator PASM2 code?

Wuerfel_21 · 2021-12-21 20:40

When it is done and working.

Note that there's a bunch of SEGA-isms in the code (memory map, broken TAS (though I know its broken on Amiga, too. Not sure about other 68k systems), assumption that vector table is in ROM, etc), so you'd need to rework it a bit to use it for a different purpose.

Wuerfel_21 · 2021-12-21 22:34

Semi-relatedly, anyone got any idea of how colors in %0000_BBB0_GGG0_RRR0 CRAM format could be expanded to RGB24 at decent speed? I guess there's the nuclear option of precomputing a big table (would only be 8K!)... That could also allow emulating an accurate luma curve (real VDP DAC is slightly nonlinear), though that might look goofy with the shadow/highlight effect (which happens on the final RGB values)

Wuerfel_21 · 2021-12-22 00:17

Well, after tracking down a stupid issue wherein I fumbled TEST and TESTB....

ＢＥＨＯＬＤ

rogloh · 2021-12-22 00:53

Well done. All coming together now, and the external memory driver is now available for you too when you need it.

Wuerfel_21 · 2021-12-22 13:56

Cool register description...

In other words, owie ouch the interrupts hurt my head

Wuerfel_21 · 2021-12-22 14:02

In particular, I'm at the point where this simple and contrived program works.... BUT ONLY IN DEBUG MODE???? And without DEBUG, it's solid blue??????

Something something timing, I fear. Or maybe the debugger messes stuff up.

entry:
    move.w #$E0E,$FFFFFFE0.w ; init color variable (magenta)
    move #$2500,sr ; enable interrupts > 5 (VBlank)
    move.w #$8170,$C00004 ; enable VINT in VDP
spinlock:
    bra spinlock

vint_handler:
    move.l #$C0000000,$C00004 ; setup CRAM write
    move.w $FFFFFFE0.w,$C00000
    add.w #2,$FFFFFFE0.w
    rte

Wuerfel_21 · 2021-12-22 14:29

Yes, it is timing. If I insert a waitx #174 (and not one cycle less!) into the instruction loop, it works in non-debug mode....

So, I am sending an interrupt by pulsing (locking and immediately releasing) a lock inside the VDP. This is picked up by one of the event channels of the 68k cog and before loading the next instruction, is checked using JSE2. How would timing differences cause this to break after one iteration???($E0E (magenta) + 2 = $E10 (blue)) The 68k code never touches the lock, it only ever polls the SE2 flag and only in the instruction fetch code. And I can verify that the VDP is still getting and releasing the lock just fine.

TonyB_ · 2021-12-22 15:28

@Wuerfel_21 said:
Semi-relatedly, anyone got any idea of how colors in %0000_BBB0_GGG0_RRR0 CRAM format could be expanded to RGB24 at decent speed? I guess there's the nuclear option of precomputing a big table (would only be 8K!)... That could also allow emulating an accurate luma curve (real VDP DAC is slightly nonlinear), though that might look goofy with the shadow/highlight effect (which happens on the final RGB values)

What is "decent speed"? Presumably this is for converting the palette data to P2 format as each palette word is loaded/written?

Converting
%0000__B2B1B0_0__G2G1G0_0__R2R1R0_0
to say
%R2R1R0_R2R1R0_R2R1__G2G1G0_G2G1G0_G2G1__B2B1B0_B2B1B0_B2B1__00000000
looks pretty horrible.

A table in hub RAM seems best option in terms of speed, saving cog/LUT space and emulating non-linearity.

Wuerfel_21 · 2021-12-22 16:46

Yea, table is what I went with.

Unrelatedly, have not fixed the interrupt issue, but at least narrowed it down: it seems that for some reason it ends up doing a branch to an odd address, triggering an address error and getting stuck in that. That explains why it doesn't acknowledge any more interrupts, but how it gets into that situation and why it is timing-sensitive is still mystifying me.

Wuerfel_21 · 2021-12-22 17:22

Okay, I think I got it? I was reading the PC from the wrong address for RTE (incremented SP correctly after reading SR, but didn't move it into EA register again). Don't ask me how that is timing-sensitive though.

Wuerfel_21 · 2021-12-23 00:50

Since I'll be away for a couple days, I guess this is as good as any occassion to release... something.

Introducing MegaYume version alpha 0.0.0, the terrible emulator that is entirely borken! Applause!

Not with that attitude

Features include:
- VGA graphic (just the one)
- USB keyboard input
- "Plays" Pong (it does not know the rules of Pong)
- bugs
- non-functioning spaghetti code 68000 emulator which is now called "MotoKore" because I had to figure out what mk_ actually stands for, lol. Just claim it's a mortal kombat reference or smth idk idgaf.
- LED debugging nonsense on pins 38/39 that I was too lazy to edit out

To build, use the included build.sh (or read it and just do the same thing on the terminal) and then load megayume_lower.binary. Needs a reasonably recent flexspin, but for a change from my usual releases, I don't think you need the bleeding edge version.

Keyboard is mapped as follows:

Enter -> START
X -> A
C -> B
V -> C
Directions are on arrow keys

There's also a few other contrived test ROMs bundled in - just scroll to the bottom of megayume_lower.spin2 and you'll find where the ROM file is included.

Wuerfel_21 · 2021-12-23 01:58

Ok, so to make the paddle not be.. like that, change the branch in line 193 from mk_cmp_common to mk_sub_common. Something something confusing parameter order.

rogloh · 2021-12-23 02:44

Great, I'll have a browse through this, maybe I can port it to my Voyager board too for fun.

Wuerfel_21 · 2021-12-23 13:23

Wuerfel_21 · 2021-12-23 14:45

Ok, so.... can you spot the problem(s)?

"But Ada, what does MOVE USP have to do with the aforementioned problem?" you say. Well...
As you may be able to tell, there's two mistakes here:

direction test is wrong (should test bit 3, but tests bit 0 xor bit 1 instead)
Address registers are indexed starting from A7

So what is supposed to be a MOVE A6,USP is interpreted as MOVE USP,A13. Now, A13 doesn't exist, so it writes the USP into the jump table entry for opcodes $3xxx (MOVE.W) instead. The USP is zero, so it writes zero.
Now, the next time a MOVE.W executes, the contents of D0 onwards will be executed as P2 instructions. D0 is zero (-> NOP), but D1 contains $0000FFFF which dissassembles to _RET_ ROL $07F,INB. INB on my setup happens to be $FFFF_FF30, so it rotates the instruction at $07F by 16 bits and returns. As seen above, that instruction is the tailcall through mk_writef for MOVE.B instructions. But after the corruption it isn't a branch anymore, so execution falls through into the MOVE.W implementation, where an address error is triggered due to an odd address. This leads to an infinite loop of address errors, because people writing these test ROMs don't seem to realize that you can't simply RTE from a bus or address error due to the special stack frame (will end up branching to the fault address, which in case of address error will always be odd, causing another address error in an infinite loop (until you run the SP out of RAM))

Wuerfel_21 · 2021-12-23 14:56

To that, a new ZIP.

Changes:
- fixed aforementioned bugs
- removed pin 38/39 led stuff
- included some more ROMs (check out wintest.bin and sampler.bin for some interactivity)

Wuerfel_21 · 2021-12-26 23:07

So well, I'm back.

So, I've
a) started implementing DMA
b) tried some more ROMs

So, it transpires that Flicky actually already sorta does something. The SEGA logo, title screen and instruction screen work, but in-game the map is corrupted and it hangs immediately after Flicky comes out of the door. Oddly enough, the newer version with the DMA stuff seems to work less (no roof and floor, immediate crash), even if I disable it... odd.

Wuerfel_21 · 2021-12-26 23:40

So, it seems that shifting the VDP I/O code around in hub seems to affect which of the two results (immediate crash, no roof/floor/score vs. crash after entry animation) happens. Oh, this is going to be another painful one, isn't it?

Console Emulation

Comments