6502 CPU Emulator

macca · 2021-04-27 06:31

Hello,

Here is a 6502 CPU emulator I wrote for the P2.

The demo program runs (and pass) the 6502 functional test from https://github.com/Klaus2m5/6502_65C02_functional_tests which tests all instructions, address modes, math operations, etc.

Here is the debug output:

In this stage I focused mainly on accurate emulation, that's why the code is as it is now. The cycle count should be very accurate, including the cross-page additional cycles, this should allow to add delay timings compatible with the original chip.

In the current state, without much optimizations and with some (maybe not needed) safe checks, it runs at about 22.659 P2 clocks per 6502 clock on average, if math serves me right, it means that a P2 at 160 MHz can run the emulation at about 7MHz. With some minimal optimizations, a bare 20MHz P2 should run at C64 speed without problems.

Enjoy!

Best regards,
Marco

Tubular · 2021-04-27 06:54

Nice!! Congratulations

rosco_pc · 2021-04-27 07:18

you're on a roll

pik33 · 2021-04-27 07:21

Wow ! I planned to do a 6502 and... I don't have to.

AJL · 2021-04-27 07:31

Congrats! Now to expand it to 65C816 :-)

pik33 · 2021-04-27 08:06

Now as I read the code... a lot can be optimized using skipf/execf/xbyte. Also, for Atari/Commodore emulation, "illegal opcodes" have to be added, but now we have an excellent starting point

Cluso99 · 2021-04-27 08:45

Congratulations!

rogloh · 2021-04-27 09:49

Great stuff. With a P2 SIDCOG and this 6502 probably what we need now is just some VIC-II graphics emulation plus CIA chip and some C64 code could potentially start to run in a complete emulator. Who's game to write a full VIC-II emulator? Are you intending to do one of those too next @macca?

Ahle2 · 2021-04-27 10:19

macca, fantastic news...

I too have had thoughts about making a 6502 emulator for the P2 since no one else seemed to be persuing it. I'm buried in too many other tasks (so nothing gets done it feels like).
I would love to try it out with ".sid" files and SIDcog (since Crescendo isn't in an useful state right now)

macca · 2021-04-27 11:18

Thank you all!

@pik33 said:
Now as I read the code... a lot can be optimized using skipf/execf/xbyte. Also, for Atari/Commodore emulation, "illegal opcodes" have to be added, but now we have an excellent starting point

After cooling down from the relative call issue I had, I'll certainly add some skip patterns to optimize the addres mode variants. I'm not sure to use xbyte since, as far as I know, it prevents some self-modify code from working correctly, also I'm not sure about the performance gain, the code for this cpus tend to have many branches that will frequently flush the pipeline. We'll see.

@rogloh said:
Great stuff. With a P2 SIDCOG and this 6502 probably what we need now is just some VIC-II graphics emulation plus CIA chip and some C64 code could potentially start to run in a complete emulator. Who's game to write a full VIC-II emulator? Are you intending to do one of those too next @macca?

Yes, but it is not immediate. I have already explored the VIC-II a bit and after the TMS9918 emulation I don't see problems doing it, however I would like to see if I can do an Atari 2600 emulation first.

@Ahle2 said:
I would love to try it out with ".sid" files and SIDcog (since Crescendo isn't in an useful state right now)

Ah, yes, I would love to do that too. Since the 6502 uses memory mapped I/O I think it is just a matter of peeking the right locations. Currently it doesn't support interrupts so if the code uses delay loops it should work, provided the clock is tuned appropriately.

Wuerfel_21 · 2021-04-27 11:23

@macca said:
Thank you all!

@pik33 said:
Now as I read the code... a lot can be optimized using skipf/execf/xbyte. Also, for Atari/Commodore emulation, "illegal opcodes" have to be added, but now we have an excellent starting point

After cooling down from the relative call issue I had, I'll certainly add some skip patterns to optimize the addres mode variants. I'm not sure to use xbyte since, as far as I know, it prevents some self-modify code from working correctly, also I'm not sure about the performance gain, the code for this cpus tend to have many branches that will frequently flush the pipeline. We'll see.

Flushing the FIFO really isn't that bad, barely longer than a standard RDBYTE. IIRC if you can get 2 RFBYTEs without flushing the FIFO inbetween, it is always faster than two RDBYTES. However, you would need to flush after every writing instruction to allow for self-modification.

rogloh · 2021-04-27 13:07

In higher frequency P2 implementations that people might like for higher quality audio or for better/more accurate video emulation capabilities if needed, it seems like you should be able to use your total clock cycle count to introduce some waitx delays depending on how fast/slow the emulation is vs the real clock cycles in the instruction loop. It would be cool to have a cycle accurate emulator if possible. So you could run the P2 at some convenient multiple of the 14.31818MHz frequency for correct video output for example and divide down from there according to your slowest instruction and then just introduce the appropriate waitx delays to compensate for the faster P2 execution speed.

Eg. The P2 runs at (say) 14.31818MHz x 16 or ~ 229.1MHz as an exact multiple of the NTSC colour frequency used in the C64.
The slowest 6502 emulated instruction might (for arguments sake) take 150 P2 clocks to execute, but is really meant to normally take only 7 C64 clocks. We can measure its execution time in the loop (which varies slightly due to variable hub latency), and then compute the difference between 7 x 1.023MHz clock cycles and 150 x 229.1MHz clock cycles and add a waitx accordingly. So we'd compute 7 x 14 x 16 - which means it should take 1568 P2 clocks for 1:1 emulation, but instead it took only 150 P2 clock cycles, so we would do a waitx (1568-150-2).

Wuerfel_21 · 2021-04-27 13:53

Cycle-accurate emulation is a bit more difficult than that: You need to make the memory accesses happen on the right cycles. This is necessary both for Atari 2600 and C64 (less so on the latter, but still required for a lot of special effects, such as the scrolling in Mayhem in Monsterland)

AJL · 2021-04-27 14:46

The sensible way to do that is to break each 6502 operation down into actions for each 6502 clock cycle, and pad each 6502 clock cycle sequence so that they all take the same number of P2 cycles.

ersmith · 2021-04-27 15:07

I think you'd want to use waitct to wait for a particular (exact) P2 cycle corresponding to the 6502 cycle, rather than padding or inserting delays with waitx

Coley · 2021-04-27 18:17

Nice work macca

macca · 2021-04-27 19:07

@rogloh said:
In higher frequency P2 implementations that people might like for higher quality audio or for better/more accurate video emulation capabilities if needed, it seems like you should be able to use your total clock cycle count to introduce some waitx delays depending on how fast/slow the emulation is vs the real clock cycles in the instruction loop.

Yes, that's exactly what I want to do. It is not yet implemented but the facility is there, there is a variable that holds the last instruction cycle count so it only needs to update a timer with clock ratio * cycles and wait.

The major problem, as @Wuerfel_21 said, is obtain accurate memory access.

@AJL said:
The sensible way to do that is to break each 6502 operation down into actions for each 6502 clock cycle, and pad each 6502 clock cycle sequence so that they all take the same number of P2 cycles.

Yes, a state machine would be the optimal solution, but I think it will void most of the facilities provided by the P2 architecture (and anyway I wouldn't want to implement such thing). We'll see.

@Wuerfel_21 said:
Flushing the FIFO really isn't that bad, barely longer than a standard RDBYTE. IIRC if you can get 2 RFBYTEs without flushing the FIFO inbetween, it is always faster than two RDBYTES. However, you would need to flush after every writing instruction to allow for self-modification.

Well, this will be the last thing to do. I have implemented some failsafe methods that are not really needed, I think I'll use one of the ptr registers as program counter to use the auto-increment feature, this should save a lot of cycles.

Baggers · 2021-04-27 19:37

Nice work @macca great to see a 6502 emu on the P2 too! lots of fun times ahead I reckon.

rogloh · 2021-04-28 00:46

@ersmith said:
I think you'd want to use waitct to wait for a particular (exact) P2 cycle corresponding to the 6502 cycle, rather than padding or inserting delays with waitx

My thought was the waitx value would have been computed based on the last time in the loop AND the time the instruction is meant to take so it would hit the right P2 clock cycle. Though the ADDCT+WAITCT approach would probably achieve the same thing.

@AJL said:
The sensible way to do that is to break each 6502 operation down into actions for each 6502 clock cycle, and pad each 6502 clock cycle sequence so that they all take the same number of P2 cycles.

@Wuerfel_21 said:
Cycle-accurate emulation is a bit more difficult than that: You need to make the memory accesses happen on the right cycles. This is necessary both for Atari 2600 and C64 (less so on the latter, but still required for a lot of special effects, such as the scrolling in Mayhem in Monsterland)

Ok, I didn't realize that you needed to go that fine grained for aligning memory timing as well. I guess it could be an issue if you are emulating peripherals. However there is also the HUB timing differences between COGs reading/writing memory that would probably start to come into play there too if other COGs are emulating video periperhals. If you needed to do that it certainly would make things more difficult. Probably good to start simple then figure out the harder, more complete way if required.

cgracey · 2021-04-28 00:59

Maybe we could do one Apple ][ per cog.

potatohead · 2021-04-28 02:47

The very nice thing about an Apple is it has no interrupts and no timer.

A basic system is little more than a ROM, CPU and one or more display pages.

rogloh · 2021-04-28 03:03

@cgracey said:
Maybe we could do one Apple ][ per cog.

That'd be cool. Or have a mix of things, like a Z80 based arcade game, an Apple II and a C64 etc all running at the same time. That'd make quite a good demo too.

Cluso99 · 2021-04-28 03:07

@potatohead said:
The very nice thing about an Apple is it has no interrupts and no timer.

A basic system is little more than a ROM, CPU and one or more display pages.

Sorry but Apple does use interrupts. I have a patent based on interrupts and high speed transfer over the bus.

potatohead · 2021-04-28 06:57

Add on cards can use interrupts, also can employ DMA, and provide timers, and other things including specialized video displays and I/O, but the basic machine, sans any specialized cards does not.

As an emulation target, an Apple ][ 8 bit computer can be pretty lean and not tightly coupled to CPU cycles, or even speed.

You, and others, employed interrupts on an Apple, (which is cool) but the Apple ][ 8 bit machines do not employ them otherwise, Mouse and Network cards being the exception. I have neither now, and only used the mouse card back in the day. Someone is updating MouseDesk, so technically one could run a full GUI... but I digress.

If there is a program in memory, one only needs to monitor the soft switch addresses, provide one or more pages of display, and handle keyboard.

And, frankly, if one needs to put a program into memory, a simple paste operation into the system monitor can do that.

The ADT bootstrap program does that via the Super Serial card, or cassette interface.

DOS 3.3 or PRODOS literally gets typed in, via the serial port being redirected as the system input. "IN #[slot num]" then run to create bootable disks. That is what I did when I got my Apple //e some years ago. I have since installed a card that works with USB drives. But it was fun to bring up the floppy system to use prior to getting some form of diskless storage.

Spiffy!

That machine also has a CPU card in it that is a 65816, just for grins, and it runs the machine from something crazy like 0.1Mhz through about 16Mhz. (Which is crazy fast) I can turn a knob and modulate the CPU in real time! My granddaughter likes to twiddle it to play games at her speed.

Lol, she is 5 this year.

Anyway, the interrupt system is there, unused in most popular configurations, and an emulation capable of just that much can run pretty much all single load software.

Let's just say playing Space Invaders, Choplifter and friends can happen sans interrupts.

Adding a disk emulation would bring it up to a ton of software.

Mockingboard would include timers and an interrupt signal off those timers, and that usually saw use in games for background music and or managing screen redraws and related things. Very little software actually requires one of these cards.

Another Edit: I am writing about the 8 bit machines. The 16bit GS machine does use interrupts and is an order more sophisticated.

rogloh · 2021-04-28 07:09

Yeah the soft switches are an issue to deal with. If the emulated system needs to trigger soft switches with either reads or writes to particular addresses, something needs to sit before the hub accesses in the 6502 COG to deal with that. A subroutine call to read/write HUB will be needed that would intercept this and write some commands to some other COG mailbox area and/or use COGATN to trigger its action. Either that or deal with this aspect in a 6502 COG as well if it can emulate more than just the core CPU itself. It depends on just how complicated the HW emulation is, hopefully not that hard. Personally I'm not familiar with all the Apple II memory mapped peripherals/latches etc, or C64 for that matter, as I've not been brought up on those machines - I started out on Z80 systems instead and moved to PCs from there.

potatohead · 2021-04-28 07:36

On small upside is, unlike other 8 bit machines, I believe they are all address triggered only. I am trying to recall any that are not, and am coming up empty.

A one COG Apple would have to scan them all.

A multi-COG one could just be watching those addresses and respond too, or get a signal as you say.

Some could be off in timing and to start many programs would not be impacted. This is true for the display at least. Memory related ones need to be bang on though.

Another upside is the display pages are fixed. Downside is the crazy addressing and artifact color.

Maybe a text mode Apple would fit into a COG? Probably needs two for a good baseline functioning emulation.

Re: COGATN combined with the 6502 CPU writing the address of interest somewhere would be enough for a service COG to get the request, fetch the address and do whatever.

Those addresses are all in a block of I/O space too, so there is an opportunity to range check, at least for the basics.

Might be reasonable and pretty lean?

Wuerfel_21 · 2021-04-28 07:46

@rogloh said:
Yeah the soft switches are an issue to deal with. If the emulated system needs to trigger soft switches with either reads or writes to particular addresses, something needs to sit before the hub accesses in the 6502 COG to deal with that. A subroutine call to read/write HUB will be needed that would intercept this and write some commands to some other COG mailbox area and/or use COGATN to trigger its action. Either that or deal with this aspect in a 6502 COG as well if it can emulate more than just the core CPU itself. It depends on just how complicated the HW emulation is, hopefully not that hard. Personally I'm not familiar with all the Apple II memory mapped peripherals/latches etc, or C64 for that matter, as I've not been brought up on those machines - I started out on Z80 systems instead and moved to PCs from there.

Yeah, subroutine to read/write is neccessary. Can be bypassed for ZP and stack access and generally, code reads, too. Code doesn't run from IO registers, generally (and also doesn't cross over between ROM/RAM/different banks, etc).

I think the ideal way is to just call some hubexec function to access device registers.
I'm currently trying to write a YM2612 sound chip emulation, and that would just not work with a mailbox+COGATN, at least not without having it skip a sample every time it happens. In particular, to save cycles in the emulator cog, the registers are remapped into a more sensible order, unused bits are masked out and certain values (notably operator detune amount) have to be updated every time a frequency register is written.

potatohead · 2021-04-28 07:47

The hardware is pretty simple.

1Mhz 6502, RAM, address triggered softswitches and the video system.

That is fixed memory map, text with no character redefinition that can also be a low resolution 40x48 pixel 16 color display. 2 pages of that, and 2 more pages of higher resolution bitmap display.

But, get this:

Seven, count 'em, seven pixels per byte!
High bit literally shifts the ones in that byte by a half pixel clock to offer two sets of artifact colors.

High res is 280x192 monochrome. 140'ish x 192 in 6 colors.

The best?

Non linear memory map. Each horizontal line is 40 sequential bytes, but the vertical ones are arranged in a crazy pattern. This was done to avoid a dedicated RAM refresh circuit. And it reads on one phase of the system clock, making the whole works transparent to the 6502.

The CPU runs at 1.024Mhz, and there is no screen DMA. Simple, sort of.

The video system is done with discrete logic and basically runs all the time and can run when the CPU is dead even.

I saw some Z80 CPM machines, did most of my early computing on 6502 and 6809, and then went to the PC from there myself. At that same rough time, I also ended up in SGI IRIX land professionally and stayed in all of that until mid 00's. (Then left hard, gave it all to some guy who was as excited as I was done. Too perfect!)

rogloh · 2021-04-28 08:27

@Wuerfel_21 said:
I think the ideal way is to just call some hubexec function to access device registers.

Good idea. Perhaps a hub exec call on each read or write for identifying addresses of interest in some particular range and then branch to any more finer grained handling required from there. Luckily we have a fast P2 that still has probably 10-15x more performance headroom left over after the emulator's own needs at P2 speeds in the 225-350MHz range. There should be plenty of P2 cycles left to branch out and deal with special cases if needed.

potatohead · 2021-04-28 08:37

I just did a bit of catch up on the softswitches:

They are in page $C0

Pre Apple //e and //c, they are address sensitive only.

Some addresses were made to accept writes to engage the switch, and return states on reads.

The keyboard is an example. On everything prior to //e, reading this got keyboard data. Writes to it on newer machines would trigger a hardware function.

The current guidance on all this is to always use writes to trigger a switch, reads to get state info.

Looks like one page is active, $C0, and is only address sensitive in machines up through the ][+

Wuerfel_21 · 2021-04-28 10:47

@rogloh said:

@Wuerfel_21 said:
I think the ideal way is to just call some hubexec function to access device registers.

Good idea. Perhaps a hub exec call on each read or write for identifying addresses of interest in some particular range and then branch to any more finer grained handling required from there. Luckily we have a fast P2 that still has probably 10-15x more performance headroom left over after the emulator's own needs at P2 speeds in the 225-350MHz range. There should be plenty of P2 cycles left to branch out and deal with special cases if needed.

Not every R/W, just the ones going to a tricky area. RAM and ROM usually sit at the top and bottom, so those cases can be cought early with a CMP+JMP each. C64 in particular has a weird memory map with lots of bank switching though. Still, the bottom of memory is always RAM on a 6502, for obvious reasons.

6502 CPU Emulator

Comments