8086 emulation - just kicking around some ideas
Dennis Ferron
Posts: 480
With Baggers et al going strong on their 6502 emulator, and someone else working on an 8080 emulator, I thought I might leap frog to thoughts of 8086 emulation. I'm much more familiar with x86 assembly and the PC architecture than I am with some of the older computers. I don't expect 8086 PC emulation to be very practical until the Propeller 2 comes out, but we can start with the Propeller 1, and I think most of the code would still be applicable. As an example of how it would be useful, you could make a hand-held Propeller based thing like a gameboy, that could run old DOS games.
One of the things they've discussed in the 2600 emulation thread is whether to store registers in hub RAM so you can have more than one cog working on them, or store registers in cog ram for speed. I'm thinking of a hybrid approach where the general purpose registers most used for math (AX,BX,CX,DX) are stored in cog RAM to make instructions like ADD AX,BX fast, while registers that address memory like SP, BP, and IP, and the segment registers (CS, DS, ES, SS) are stored in hub RAM. I'm thinking instead of splitting up the work by running multiple identical cogs on the same opcode stream, let's split up the work according to processor architecture components. For instance, the original 8086 had an instruction fetch unit and an instruction execution unit that ran in tandem; on the Prop we could have 1 cog be the fetch unit and another cog be the execute unit. Since any useful code that will run on a PC will likely need 512K or 1M of RAM to function correctly, we'll certainly need an external RAM chip, even on the Prop2. The fetch unit would take care of accessing the external RAM chip, and it wouldn't need to know about AX,BX, etc. It would only need the segment registers and pointer registers, which lying in hub RAM would provide a sort of "communication channel" between the execution unit and the fetch unit.
With this design, for prototypes you could start out with the fetch unit wired to pull from hub RAM, and change the prefetch object to use various types of external RAM later without changing the rest of the emulator.
Prefixes that deal with looping like REP and REPNE could be pre-emptively decoded and handled by the fetch unit, so the execution unit would not need to know about them. The fetch unit could simply re-execute the same instruction until the loop is completed. You'd think it would need access to CX to know when the count is 0, but actually it only needs the execution unit to write a flags register to hub RAM, not CX. Because Z and V flags will be available from that, the fetch unit could also handle jump instructions itself without sending them to the execution unit, because at the time a conditional jump is encountered, the flags register can be inspected to find out whether to jump or not. (The flag being set by a cmp or test instruction previously in the execution unit.)
I don't think we need to go for cycle-perfect emulation for the 8086. It's too hard, but it's also less important here. With something like the Atari 2600, the instruction timing is tightly coupled to the TV signal generation. The PC, on the other hand, has a video card which decouples the CPU timing from the video timing. Also, games had to be able to run on many different models of PCs with different processor classes, and so they game code had to account for differences in execution speed anyway. We just need to reach an acceptable minimum speed, not be cycle-perfect I think.
One of the things they've discussed in the 2600 emulation thread is whether to store registers in hub RAM so you can have more than one cog working on them, or store registers in cog ram for speed. I'm thinking of a hybrid approach where the general purpose registers most used for math (AX,BX,CX,DX) are stored in cog RAM to make instructions like ADD AX,BX fast, while registers that address memory like SP, BP, and IP, and the segment registers (CS, DS, ES, SS) are stored in hub RAM. I'm thinking instead of splitting up the work by running multiple identical cogs on the same opcode stream, let's split up the work according to processor architecture components. For instance, the original 8086 had an instruction fetch unit and an instruction execution unit that ran in tandem; on the Prop we could have 1 cog be the fetch unit and another cog be the execute unit. Since any useful code that will run on a PC will likely need 512K or 1M of RAM to function correctly, we'll certainly need an external RAM chip, even on the Prop2. The fetch unit would take care of accessing the external RAM chip, and it wouldn't need to know about AX,BX, etc. It would only need the segment registers and pointer registers, which lying in hub RAM would provide a sort of "communication channel" between the execution unit and the fetch unit.
With this design, for prototypes you could start out with the fetch unit wired to pull from hub RAM, and change the prefetch object to use various types of external RAM later without changing the rest of the emulator.
Prefixes that deal with looping like REP and REPNE could be pre-emptively decoded and handled by the fetch unit, so the execution unit would not need to know about them. The fetch unit could simply re-execute the same instruction until the loop is completed. You'd think it would need access to CX to know when the count is 0, but actually it only needs the execution unit to write a flags register to hub RAM, not CX. Because Z and V flags will be available from that, the fetch unit could also handle jump instructions itself without sending them to the execution unit, because at the time a conditional jump is encountered, the flags register can be inspected to find out whether to jump or not. (The flag being set by a cmp or test instruction previously in the execution unit.)
I don't think we need to go for cycle-perfect emulation for the 8086. It's too hard, but it's also less important here. With something like the Atari 2600, the instruction timing is tightly coupled to the TV signal generation. The PC, on the other hand, has a video card which decouples the CPU timing from the video timing. Also, games had to be able to run on many different models of PCs with different processor classes, and so they game code had to account for differences in execution speed anyway. We just need to reach an acceptable minimum speed, not be cycle-perfect I think.
Comments
A two propeller approach where one propeller emulates just the whole cpu could be a possibility.
The normal approach to creating emulators, at least in the ones that I have looked at since starting this whole thing, is just to write a whole bunch of code that logically implements each instruction.
What you seem to be proposing is a more like: create, in code, some simulations of the hardware components. The Bus Interface unit, the ALU, the Registers, etc etc and then stitch them together to create the complete CPU.
This sounds much more like the approach someone would take if they were creating a CPU core in Verilog or VHDL for an FPGA, ASIC or ,well, a real chip like the Propeller itself.
Now the traditional approach comes about because people are writing an emulator program on a single CPU machine, a PC for example, and everything has to be done sequentially.
The latter approach comes about because when designing with FPGA or "real" hardware you can make use of the fact that you have lots of gates that can do many things in parallel.
As such I think you have a novel idea for the Propeller which has at least some parallelism going on.
However, at the end of the day the logic required to implement all the instructions has to "be" somewhere. In the traditional case it is written in the emulator code. In the latter case it would end up in an emulation of the microcode of the emulated CPU.
I'm sure 8086 is possible, one way or the other. Given the amount of effort it has taken to get 8080 logically correct, as far as I know, I would have to take my hat off in a very big way to anyone with the tenacity to do the 8086.
As I have said elsewhere even the Z80 seems to be an order of magnitude more work that the 8080.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle
Post Edited (Leon) : 1/23/2009 12:48:44 AM GMT
I think the new Prop+Sram that I am designing would be an ideal candidate for just the x86 emulator. Keep the peripherals (VGA, Keyboard, etc) but excluding·uSD card (disk emulator)·on a separate Prop. Communication via high speed serial async (tx & rx) between Props.
There is a 1Mx8 SRAM but it don't think it was available in PDIP. I am unsure whether to do SMT.
PS. The original IBM PC ran at about 4.7MHz on an 8088. Could be a good place to start.
Just my 2c
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Prop Tools under Development or Completed (Index)
· Emulators (Micros eg Altair, and Terminals eg VT100) - index
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz
I'm feeling that while the Prop has 8 COGS it really does not have the parallelism of real gates in FPGA etc to pull of the latter.
For a CPU more complicated than an 8080 I agree with Cluso that you just have to distribute the opcode handlers over multiple COGS as he describes. Otherwise you have to use LMM or overlays just to get enough code in and that will kill speed.
At one point I did have all the 8080 emulator in 1 COG with no LMM or overlays. But the contorted logic required to squeze it all in made very slow. Turned out faster when using a little LMM. I'm quite wedded to single COG emulation at the moment as I want to use all the other COGs for the rest of the CP/M system. System On A Chip!
@Cluso: Forget about 8088. Logically it is the same as a 8086 but with only an 8 bit data bus. Multiplexing high and low bytes will slow things. Isn't it better to build an external RAM multiplexing 16 bits of data with the low 16 bits of address?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Prop Tools under Development or Completed (Index)
· Emulators (Micros eg Altair, and Terminals eg VT100) - index
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Also musings re a Z80, there are a huge number of instructions you can cheat with, eg the set and reset instructions. No-one ever used these to set a bit in a byte pointed to by the IX+n register. They would code the 8080 way - read the byte into A, set the bit and read it out again. So there would be lots of codes where you could replicated this - if you want to set a bit in D, push a, put D in A, set the bit, A to D, and then pop A. Ok, it is slower, but those instructions are hardly used anyway. A quick scan down some real assembly source code shows that most coders use 20 or 30 instructions. So the less popular ones can be coded in much slower ways. Actually, I think an 8080 emulator is about 99% done for a Z80 too. You could even get lazy and code all the unused instructions to an error message and throw some real code at it and note the errors. As for 8086 - this is stretching my memory banks a bit here, but didn't they (and all the chips the IBM worked on) use a memory banking system? It had banks and an offset or something but it only incremented a tiny bit each time. That might take some extra coding. And the 8086 range might also benefit more from the 512k ram chip as they used 640k instead of 64k.
I think in a generic sense that a propeller +512k ram chip with all the pins straight through and maybe one cog devoted to a uart serial simulation, but the other 7 free, is going to give the maximal flexibility and speed to emulate all sorts of chips.
Post Edited (Dr_Acula (James Moxham)) : 1/23/2009 9:49:10 AM GMT
Where we do need COGS is:
1) If you want to go faster put some opcode handlers in PASM in another COG or more.
2) Peripherals, UART, video.
Just now PropAltair doesn't do 1) rather it uses pretty much all COGs and HUB to get a whole CP/M computer on a chip !
(Well the 8 floppy disks are on SD card)
Interesting musings about "cheating" in the Z80. I know there are some programs in the CP/M world that need a Z80 to run. I have always suspected that these programs probably don't use much of the Z80 capabilities. So as you say it should be possible to determine which z80 opcodes they do use and create a partial Z80 emulator that will run them. This is a task for another day.
On the other hand I know nothing about for example the Sinclair ZX81 or Spectrum world. Uncle Clive had a reputation for using devices in strange ways an exploiting every possible trick to get his products working. So maybe a different subset of z80 features will be needed there, if not the whole set.
I seriously think we should design a twin Prop + RAM board to implement what you describe in your last paragraph.
Try to have one of the Props surrounded by a "standard" demo board circuit. So that it can be used as such. Other prop dedicated to the RAM heavy jobs like emulation.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Also I see comments sometimes about running out of longs. Could some of this storage go off to external ram as well?
As for running out of LONGs, generally I'm referring to my COG code space. There is not much storage going on there only the 8080 registers and a handful of variables.
BTW: N8VEM has landed here, check your PM. [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.