Shop OBEX P1 Docs P2 Docs Learn Events
Console Emulation - Page 5 — Parallax Forums

Console Emulation

1235768

Comments

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-20 03:03

    @rogloh said:
    So how full is the COG+LUT RAM now @Wuerfel_21 ?

    Not sure (about to go to bed, it's like 4 am), but I think there's a decent bit of space left in both and the only things left to go in there some I/O related bits, interrupt polling and of course, the external ROM interface (and related opcode queue code). But for the ROM interface I'm already using a workalike (i.e. this is the same interface I plan for the PSRAM/HyperRAM bits):

    mk_readrom_ea ' read single long, offset such that the requested
                  ' address ends up at mk_romio_area
                  mov pb,mk_effaddr
                  mov mk_romio_length,#1
                  rczr pb wcz
                  mov mk_romio_target,mk_romio_area_ptr
            if_c  sub mk_romio_target,#2
            if_z  sub mk_romio_target,#1
    mk_readrom    ' arbitrary block read
                  shl pb,#2
                  zerox pb,#15 ' <- ROM SIZE HERE
                  add pb,##fake_rom
                  mov mk_memtmp1,mk_romio_target
                  debug("ROM read from ",uhex_long(pb)," to ",uhex_long(mk_romio_target))
                  rep @.readrom,mk_romio_length
                  rdlong mk_memtmp0,pb
                  add pb,#4
                  wrlong mk_memtmp0,mk_memtmp1
                  add mk_memtmp1,#4
    .readrom
                  ret wcz
    

    Note that opcode fetch is currently very primitive though, no queue and doesn't go through the actual ROM interface because uhhhh. So that might consume quite a couple longs.

    ... but if the space is not enough, I can simply move a few more opcodes to hub (or inversely, bring some opcodes or addressing modes into cog/lut (as of now, all addressing modes that are not register direct or simple (An) are a hub call))

  • Ok, it should be possible to convert/hook that type of thing into the existing PSRAM driver at least for initial testing.

    One thing you might find is that if you can cache some code snippets read from external RAM into a (smaller) simulated ROM area stored in HUB RAM you might get better performance reading from that block whenever you can, rather than making lots of individual accesses to the external memory. There's probably some scope there for some interesting performance improvements by playing with the burst sizes read and see how much you gain from the latency savings vs the overhead of checking addresses fall within a range already available in HUB. Or you can just try the individual random reads and compare those too.

  • Dealing with data cache/prefetch seems a bit... ehhhh.

    Caching repeated access to the same address would be easy enough, but how often does that happen?

    Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

    Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

  • Also, current idea for interrupt implementation is to use JSEx instructions to check for lock changes before each instruction. Only need two of them because on the megadrive, there are only really two interrupt sources: VBlank and scanline counter interrupts from the VDP. There's technically also an external interrupt line for peripherals to use, but that's kinda out-of-scope.

  • @Wuerfel_21 said:
    ... Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

    16 words read in at a time seems like a good place to start, given we can clock in at 320MB/s @ 320MHz. Can be tweaked further if required, eg. 8 or 32 etc (maybe nibble masked addresses will be fastest for testing given the getnib/setnib opcodes in the P2). A proper I-cache is hopefully not required to achieve some decent performance for this emulator, although for some it could be an interesting thing to examine if you ever wanted to execute directly from PSRAM in general.

    Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

    Impressive, looks like it should fit nicely in the end.

  • @rogloh said:

    @Wuerfel_21 said:
    ... Code of course will be fetched in bigger blocks, maybe 16 words at a time? That should be big enough to eliminate code reads in hot loops and speed up short branches.

    16 words read in at a time seems like a good place to start, given we can clock in at 320MB/s @ 320MHz. Can be tweaked further if required, eg. 8 or 32 etc (maybe nibble masked addresses will be fastest for testing given the getnib/setnib opcodes in the P2). A proper I-cache is hopefully not required to achieve some decent performance for this emulator, although for some it could be an interesting thing to examine if you ever wanted to execute directly from PSRAM in general.

    Got that code block fetching system implemented now. Size can be configured to any even number of words and checking if a branch is inside the currently cached block is simple - we already need to keep track of how many words are left, so simply subtract the branch displacement from that counter and check if it is still in the valid range (which, since the valid range is 0..MK_ROMQUE_MAX, only requires a single unsigned compare).

  • @rogloh said:

    Also, current memory usage is as such: Registers up to $1D4 and LUT up to $30f. So basically, 3/4 full.

    Impressive, looks like it should fit nicely in the end.

    Mind you, there's a significant amount of hub code. Only what's really needed is in cog/lut right now.

  • It will be good to simply try my existing PSRAM driver as is with your pre-fetching support to see if that has any hope of working without it being directly coupled to your COG. At 320MHz it can probably get close to 1us per request so 16 words is then 32MB/s and some of it can potentially be done in parallel to your emulator code running (so not sure how that translates to final 68k MIPs, it depends on branches and wasted reads).

    I have attached some sample code showing a simple PSRAM config for the new P2EDGE and how the mailbox could be used from PASM2 for block reads/writes, but there are far more commands the mailbox can use than are shown here. Memory addresses can also be mapped differently in the banks if the address space used is to be shared with HUB RAM (sometimes it is useful to be able to do that and to reserve 0-16MB for indicating HUB addresses, other times not). This PASM2 demo already works with my driver.

    {
     Propeller 2 PSRAM demo (PASM)
     =============================
    
     This software contains a simple demo showing how a PASM COG can use the PSRAM driver 
     without requiring the overhead of the complete SPIN2 based memory driver as well.
    
     The driver is initialized and a PASM COG then accesses the PSRAM with direct mailbox
     access using the burst write and read commands to transfer data.
    
     No QoS policy is used, so any COG can access the memory without prioritization.
    
     Run this with DEBUG mode enabled.
    }
    '----------------------------------------------------------------------- 
    
    CON
        _clkfreq = 160000000
    
        DEBUG_BAUD = 115200
    
        MAXBURST = 512  ' set to a suitable device burst size & keep under maximum CS low time of 8us
        DELAY    = 8    ' set to an input delay suitable for this P2 clock frequency (from 0-15)
        ADDRSIZE = 25   ' number of address bits used in 32MB of PSRAM
    
        ' P2 EDGE PSRAM pin mappings
        DATABUS  = 40
        CLK_PIN  = 56
        CE_PIN   = 57
    
    OBJ
        psram : "psramdrv"
    
    
    PUB main() | driverAddr
        ' patch in the proper HUB addresses for Propeller Tool (redundant for FlexSpin)
    
        long[@startupData][5]:=@deviceData
        long[@startupData][6]:=@qosData
        long[@startupData][7]:=@mailboxes
    
        ' get the address of the PSRAM memory driver so we can start it
    
        driverAddr:= psram.getDriverAddr()
    
        ' start the PSRAM memory driver and wait for it to complete initialization
    
        coginit(NEWCOG, driverAddr, @startupData)
        repeat until long[@startupData] == 0 
    
        ' now just continue running as the PSRAM reader cog, pass mailbox base address as argument
    
        coginit(cogid(), @reader, @mailboxes)
    
    DAT
        orgh 
    
    '----------------------------------------------------------------------- 
    ' Reader Cog PASM2 code entry point
    '----------------------------------------------------------------------- 
    
    reader 
                org     0
    
                cogid   pb              'get COG ID
                mov     pa, #12     
                mul     pa, pb          'scale by 12 bytes per mailbox (3 longs)
                add     ptra, pa        'compute real mailbox start address for this COG
    
                add     msgaddr, ptrb   'determine real HUB RAM location of the test message
    
                'write the test message into PSRAM
                'NOTE: the setq burst write method used here can only be used without interruption and 
                'relies on this sequential addresses being written in order each clock cycle before the 
                'RAM driver poller can read mailbox data that is incomplete.  If you are using the streamer
                'or if interrupts could somehow delay the burst write part way through this would not work
                'and you would need to ensure you write the first mailbox long after other two longs.
    
                setnib  addr, #%1111, #7'include the write burst command in the cmd+address parameter
                setq    #3-1            'write 3 longs to mailbox in a burst (can do this only without interruption)
                wrlong  addr, ptra      'trigger the write burst to external memory
    pollwrite   rdlong  pa, ptra wcz    'check for the result
        if_c    jmp     #pollwrite      'wait until done or error
        if_nz   jmp     #error          'error check (optional but useful if you encounter setup problems)
    
                'read the message back to a new address (just using this COG's own HUB space as scratch buffer)
    
                setnib  addr, #%1011, #7'setup read burst command in the cmd+address parameter
                mov     msgaddr, ptrb   'update destination hub address to COG's scratch area
                setq    #3-1            'write 3 longs to mailbox in a burst (can do this only without interruption)
                wrlong  addr, ptra      'trigger the read burst command
    pollread    rdlong  pa, ptra wcz    'get the result
        if_c    jmp     #pollread       'wait until done or error
        if_nz   jmp     #error          'error check (optional but useful if you encounter setup problems)
    
                'display the message we just read back with DEBUG statements
    
                DEBUG   (ZSTR(ptrb))    'print string we just read
                cogstop pb              'stop here
    
                'if an error occurred, display the error code to help debug code
    
    error       DEBUG   ("Test failed, error code=-", SDEC_LONG_(pa), 13, 10)
                cogstop pb              'stop here
    
    ' 3 long structure to be written to mailbox
    addr        long    $__0abcdef          ' command & some address in external memory
    msgaddr     long    message - reader    ' HUB source/destination address for burst
    msglen      long    msgend - message    ' length in bytes
    
                fit     502
    
    
    '----------------------------------------------------------------------- 
            orgh
    
    ' data to be passed to driver when starting it
    startupData
        long    _clkfreq    ' use current frequency
        long    0           ' optional flags
        long    0           ' reset pin mask on port A for PSRAM (none)
        long    0           ' reset pin mask on port B for PSRAM (none)
        long    DATABUS     ' PSRAM data bus start pin
        long    deviceData  ' address of devices data structure in HUBRAM
        long    qosData     ' address of QoS data structure in HUBRAM
        long    mailboxes   ' address of mailbox structure in HUBRAM
    
    deviceData
        ' 16 bank parameters follow
        long    (MAXBURST << 16) | (DELAY << 12) | (ADDRSIZE - 1)   ' bank 0
        long    (MAXBURST << 16) | (DELAY << 12) | (ADDRSIZE - 1)   ' bank 1
        long    0[14]                                               ' bank 2-15
        ' 16 banks of pin parameters follow
        long    (CLK_PIN << 8) | CE_PIN                             ' bank 0 
        long    (CLK_PIN << 8) | CE_PIN                             ' bank 1 
        long    -1[14]                                              ' bank 2-15
    
    qosData 
        long    $FFFF0000       ' cog 0 default QoS parameters
        long    $FFFF0000       ' cog 1 default QoS parameters
        long    $FFFF0000       ' cog 2 default QoS parameters
        long    $FFFF0000       ' cog 3 default QoS parameters
        long    $FFFF0000       ' cog 4 default QoS parameters
        long    $FFFF0000       ' cog 5 default QoS parameters
        long    $FFFF0000       ' cog 6 default QoS parameters
        long    $FFFF0000       ' cog 7 default QoS parameters
    
    
    mailboxes
        long    0[8*3]          ' 3 longs per mailbox per COG
    
    
    message byte "This message is coming at you today all the way from PSRAM!", 0
    msgend  byte 0
    
    {{
    -------------
    LICENSE TERMS
    -------------
    Copyright 2020, 2021 Roger Loh
    
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:
    
    The above copyright notice and this permission notice shall be included in 
    all copies or substantial portions of the Software.
    
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
    THE SOFTWARE.
    }}
    
  • Yeah, that looks doable.

    Unrelatedly, I've just completed implementing all the instructions. Probably a million bugs, but I can't find a test program that can run from ROM so uhhhhh, let's not think about that too much. Next stop: hooking this up to the VDP so I can run some simple test programs

  • Ok let me know when you need a PSRAM driver. I do intend to get this out before xmas and take a little break.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-21 18:37

    Status report: got video driver and VDP integrated into the binary and set up (using flexspin to compile the Spin code into high memory). Also, a simple "Hello World" ROM seems to end up at the correct STOP opcode (end of program, as opposed to an exception handler or getting stuck on some inappropriately decoded opcode), so I presume that when the VDP register interface is working, there'll be some letters on screen.

  • Any chance of seeing the 68K emulator PASM2 code?

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-21 20:41

    When it is done and working.

    Note that there's a bunch of SEGA-isms in the code (memory map, broken TAS (though I know its broken on Amiga, too. Not sure about other 68k systems), assumption that vector table is in ROM, etc), so you'd need to rework it a bit to use it for a different purpose.

  • Semi-relatedly, anyone got any idea of how colors in %0000_BBB0_GGG0_RRR0 CRAM format could be expanded to RGB24 at decent speed? I guess there's the nuclear option of precomputing a big table (would only be 8K!)... That could also allow emulating an accurate luma curve (real VDP DAC is slightly nonlinear), though that might look goofy with the shadow/highlight effect (which happens on the final RGB values)

  • Well, after tracking down a stupid issue wherein I fumbled TEST and TESTB....

    BEHOLD

  • Well done. All coming together now, and the external memory driver is now available for you too when you need it.

  • Cool register description...

    In other words, owie ouch the interrupts hurt my head

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-22 14:06

    In particular, I'm at the point where this simple and contrived program works.... BUT ONLY IN DEBUG MODE???? And without DEBUG, it's solid blue??????

    Something something timing, I fear. Or maybe the debugger messes stuff up.

    entry:
        move.w #$E0E,$FFFFFFE0.w ; init color variable (magenta)
        move #$2500,sr ; enable interrupts > 5 (VBlank)
        move.w #$8170,$C00004 ; enable VINT in VDP
    spinlock:
        bra spinlock
    
    vint_handler:
        move.l #$C0000000,$C00004 ; setup CRAM write
        move.w $FFFFFFE0.w,$C00000
        add.w #2,$FFFFFFE0.w
        rte
    
  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-22 14:35

    Yes, it is timing. If I insert a waitx #174 (and not one cycle less!) into the instruction loop, it works in non-debug mode....

    So, I am sending an interrupt by pulsing (locking and immediately releasing) a lock inside the VDP. This is picked up by one of the event channels of the 68k cog and before loading the next instruction, is checked using JSE2. How would timing differences cause this to break after one iteration???($E0E (magenta) + 2 = $E10 (blue)) The 68k code never touches the lock, it only ever polls the SE2 flag and only in the instruction fetch code. And I can verify that the VDP is still getting and releasing the lock just fine.

  • TonyB_TonyB_ Posts: 2,193
    edited 2021-12-22 15:44

    @Wuerfel_21 said:
    Semi-relatedly, anyone got any idea of how colors in %0000_BBB0_GGG0_RRR0 CRAM format could be expanded to RGB24 at decent speed? I guess there's the nuclear option of precomputing a big table (would only be 8K!)... That could also allow emulating an accurate luma curve (real VDP DAC is slightly nonlinear), though that might look goofy with the shadow/highlight effect (which happens on the final RGB values)

    What is "decent speed"? Presumably this is for converting the palette data to P2 format as each palette word is loaded/written?

    Converting
    %0000__B2B1B0_0__G2G1G0_0__R2R1R0_0
    to say
    %R2R1R0_R2R1R0_R2R1__G2G1G0_G2G1G0_G2G1__B2B1B0_B2B1B0_B2B1__00000000
    looks pretty horrible.

    A table in hub RAM seems best option in terms of speed, saving cog/LUT space and emulating non-linearity.

  • Yea, table is what I went with.

    Unrelatedly, have not fixed the interrupt issue, but at least narrowed it down: it seems that for some reason it ends up doing a branch to an odd address, triggering an address error and getting stuck in that. That explains why it doesn't acknowledge any more interrupts, but how it gets into that situation and why it is timing-sensitive is still mystifying me.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-22 17:23

    Okay, I think I got it? I was reading the PC from the wrong address for RTE (incremented SP correctly after reading SR, but didn't move it into EA register again). Don't ask me how that is timing-sensitive though.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-23 00:56

    Since I'll be away for a couple days, I guess this is as good as any occassion to release... something.

    Introducing MegaYume version alpha 0.0.0, the terrible emulator that is entirely borken! Applause!


    Not with that attitude

    Features include:
    - VGA graphic (just the one)
    - USB keyboard input
    - "Plays" Pong (it does not know the rules of Pong)
    - bugs
    - non-functioning spaghetti code 68000 emulator which is now called "MotoKore" because I had to figure out what mk_ actually stands for, lol. Just claim it's a mortal kombat reference or smth idk idgaf.
    - LED debugging nonsense on pins 38/39 that I was too lazy to edit out

    To build, use the included build.sh (or read it and just do the same thing on the terminal) and then load megayume_lower.binary. Needs a reasonably recent flexspin, but for a change from my usual releases, I don't think you need the bleeding edge version.

    Keyboard is mapped as follows:

    • Enter -> START
    • X -> A
    • C -> B
    • V -> C
    • Directions are on arrow keys

    There's also a few other contrived test ROMs bundled in - just scroll to the bottom of megayume_lower.spin2 and you'll find where the ROM file is included.

  • Ok, so to make the paddle not be.. like that, change the branch in line 193 from mk_cmp_common to mk_sub_common. Something something confusing parameter order.

  • Great, I'll have a browse through this, maybe I can port it to my Voyager board too for fun.

  • Wuerfel_21Wuerfel_21 Posts: 5,105
    edited 2021-12-23 15:01

    Ok, so.... can you spot the problem(s)?

    "But Ada, what does MOVE USP have to do with the aforementioned problem?" you say. Well...
    As you may be able to tell, there's two mistakes here:

    • direction test is wrong (should test bit 3, but tests bit 0 xor bit 1 instead)
    • Address registers are indexed starting from A7

    So what is supposed to be a MOVE A6,USP is interpreted as MOVE USP,A13. Now, A13 doesn't exist, so it writes the USP into the jump table entry for opcodes $3xxx (MOVE.W) instead. The USP is zero, so it writes zero.
    Now, the next time a MOVE.W executes, the contents of D0 onwards will be executed as P2 instructions. D0 is zero (-> NOP), but D1 contains $0000FFFF which dissassembles to _RET_ ROL $07F,INB. INB on my setup happens to be $FFFF_FF30, so it rotates the instruction at $07F by 16 bits and returns. As seen above, that instruction is the tailcall through mk_writef for MOVE.B instructions. But after the corruption it isn't a branch anymore, so execution falls through into the MOVE.W implementation, where an address error is triggered due to an odd address. This leads to an infinite loop of address errors, because people writing these test ROMs don't seem to realize that you can't simply RTE from a bus or address error due to the special stack frame (will end up branching to the fault address, which in case of address error will always be odd, causing another address error in an infinite loop (until you run the SP out of RAM))

  • To that, a new ZIP.

    Changes:
    - fixed aforementioned bugs
    - removed pin 38/39 led stuff
    - included some more ROMs (check out wintest.bin and sampler.bin for some interactivity)

  • So well, I'm back.

    So, I've
    a) started implementing DMA
    b) tried some more ROMs

    So, it transpires that Flicky actually already sorta does something. The SEGA logo, title screen and instruction screen work, but in-game the map is corrupted and it hangs immediately after Flicky comes out of the door. Oddly enough, the newer version with the DMA stuff seems to work less (no roof and floor, immediate crash), even if I disable it... odd.

  • So, it seems that shifting the VDP I/O code around in hub seems to affect which of the two results (immediate crash, no roof/floor/score vs. crash after entry animation) happens. Oh, this is going to be another painful one, isn't it?

Sign In or Register to comment.