@macca said:
Had to rewrite the processor emulation almost completely to fix the bugs. Thankfully I found a suite of tests that allowed me to fix everything...
@macca said:
Had to rewrite the processor emulation almost completely to fix the bugs. Thankfully I found a suite of tests that allowed me to fix everything...
@macca said:
Had to rewrite the processor emulation almost completely to fix the bugs. Thankfully I found a suite of tests that allowed me to fix everything...
They are for 186/286 but aside from some differences in the results they are valid for an 8086.
Thanks, I downloaded these tests last year but didn't understand how to use them at that time.
Each 64K .bin files replaces BIOS ROM at F0000 but isn't there a potential problem? After reset, CS = FFFF and IP = 0000. The jmp start instruction (EB 0E) in the binary at offset FFF0 (physical address FFFF0) seems to change IP to 0010 and therefore physical address of start = FFFF0+0010 = 00000.
@TonyB_ said:
Thanks, I downloaded these tests last year but didn't understand how to use them at that time.
Each 64K .bin files replaces BIOS ROM at F0000 but isn't there a potential problem? After reset, CS = FFFF and IP = 0000. The jmp start instruction (EB 0E) in the binary at offset FFF0 (physical address FFFF0) seems to change IP to 0010 and therefore physical address of start = FFFF0+0010 = 00000.
It is a bit rough but should work with both FAT16 and FAT32 partitions.
You need to have the PCDOS 2.00 disk images in the root directory, named PCDOS_A.IMG and PCDOS_B.IMG (for drive A and B as the name implies) and the emulator should boot from drive A, otherwise without SD card or if the files are missing, it boots to the BASIC rom (the image files should be easily found on the 'net).
The SD card is not mapped on the P2-EDGE pins because I don't have a module with the microSD and when I tried to wire an adapter it won't even allowed to upload programs... maybe it is the adapter or I haven't wired it correctly... anyway, it should work, I don't remember where I got the SD code(*) but should be compatibile with the SD/Flash-CS wiring.
@pik33 said:
If there is a place for it, maybe consider adding 186 (improperly called "286") additional opcodes.
In late 80s and early 90s there were a lot of AT computers in use, with 286 or even 386 processors, but nobody was using 286 "protected mode" yet. All things were DOS or Windows 3.11, CPU in the "real mode" except BIOS switches them for emulating the EMS. But there were several convenient instructions we liked to use in asm. The most popular of them was pusha and popa. The rest of them was shift,rol,push,mul imm, which I used, and insb/outsb/enter/leave/bounfd which I don't remember using.
I remember the first 486 we bought for the faculty. Then we installed Matlab on it - it worked with the light speed
Well, the 186 note brings to mind the HP200LX, which is the only system I have ever had with a 186 in it. Quite a handy little device it was. I still have one and could dump the ROM if needed (and instructions provided).
@Wuerfel_21 said:
Not a lot of software uses 286's protected mode because intel forgot to add a way to switch back to real mode. 386 allows this, but of course if you're gonna write for 386, you might aswell use 32 bit mode.
cmps without flags set are simply nops. Maybe try and add these wz ? I will try to run this later. I have tro find DOS 2.0 as the lowest DOS I have now is 3
@rogloh said:
Such an old school look and feel. I love it.
When compiling on flexspin 5.9.12 I get these warnings... isn't this an indication of a bug?
i8086_xt.spin2:6908: warning: instruction cmp used without flags being set
i8086_xt.spin2:6915: warning: instruction cmp used without flags being set
Yes, those lines are missing a wz but, as you guessed, are relevant only for the SD card access, and I also think that have no effect anyway... the FAT1 or FAT3 check is enough unless you have an SD formatted as FAT12...
It crashed in Basic with this:
10 print "Hello"
20 goto 10
Interesting, however I don't think it "crashed" in the real sense of the word. The BIOS screen output routines blanks the video when scrolling so it looks like it is always blanked... try a long for / next loop, like
10 for i=1 to 300
20 print "Hello"
30 next i
You should see a blank screen and at some point it terminates.
It should be possibile to stop the program with a key combination, but can't remember how...
Looking at your x86 emulator @macca, I see it accesses memory through iread_memb and iwrite_memb entry points. I sort of wonder what would happen if we tacked on some really tight PSRAM reading code (like Wuerfel_21 uses) using simple 4-bit wide PSRAM so as to avoid RMW requirements. At 320MHz there may be some potential to get from 3 to 4 separate PSRAM reads or writes in per microsecond, excluding the instruction execution. I read that the original 8088/8086 took 4 clocks per bus cycle, transferring 8/16 bits respectively so that's probably in the same ballpark for a machine in the 8-10MHz class, and likely fine for emulating slow 4.77MHz machines. Also the PSRAM can read in 16 bits in only 4 P2 clocks more than reading 8 bits and alignment is not an issue with 4 bit PSRAM.
Alternatively if that's still slow I wonder if we had a COG paired to your emulator that is running some PSRAM driver code and can read and write bytes on behalf of the x86 emulator requests (made via upper LUT longs) and it also manages a decent sized cache in the HUB RAM for this x86 emulator COG while the PC's 640kB is actually stored in PSRAM. That paired COG could help keep the read/write requests latency rather low vs any HUB based scheme with mailboxes etc. Also if cache tags are managed in the paired COGs COGRAM, assuming it has enough room for some small sized block mappings (let's say 4kB block sized, 160 of them for 640kB) then maybe the cached blocks could be read from HUB RAM when known to be in memory and swapped out on demand when we're out of HUB space etc without needing lots of other HUB RAM accesses. It's somewhat tempting to try this out. I do wonder if a cached x86 would run slow as an old dog or still be "somewhat" usable. Most of those early XT type machines never really used a cache to my knowledge, that came later with 486's etc, so there's not much to compare against. It might work but be intermittently slow, not sure.
One problem are the peripherals if they need to access memory directly (like DMA controllers etc). In that case some DOS memory may have to be marked as non-cacheable and that introduces more complexities. Certainly video RAM should remain in HUB, although BIOS could be put into PSRAM which frees more HUB for cache stuff.
Not sure if a cache system is necessarily helpful. PSRAM is just fast enough to where the overhead of a cache layer might just make everything worse. Though, what does the PSRAM do when you raise CS before completing a command? If that's not an issue, the cache logic can happen in parallel with the PSRAM setup, so the worst-case latency isn't affected.
@rogloh said:
Looking at your x86 emulator @macca, I see it accesses memory through iread_memb and iwrite_memb entry points. I sort of wonder what would happen if we tacked on some really tight PSRAM reading code (like Wuerfel_21 uses) using simple 4-bit wide PSRAM so as to avoid RMW requirements.
[...]
I don't know what may be a good caching scheme for that, I'm not familiar at all with PSRAM so I don't know how it works and what are the limits, apart from the impression that are a bit critical with the timings... It depends how the PSRAM read/write time compares with the hub rd/wrbyte. Maybe removing some instructions like the need to add the hub ram start address and rearranging the code a bit could help to gain some clock cycles.
Most of those early XT type machines never really used a cache to my knowledge, that came later with 486's etc, so there's not much to compare against. It might work but be intermittently slow, not sure.
The 8086 has a 6-bytes prefetch queue where code and immediates are read from, this can be refilled automatically in a dedicated small fifo buffer starting from PC. The prefetch queue is always cleared with each branch so there may be plenty of time to fetch the first byte.
One problem are the peripherals if they need to access memory directly (like DMA controllers etc). In that case some DOS memory may have to be marked as non-cacheable and that introduces more complexities. Certainly video RAM should remain in HUB, although BIOS could be put into PSRAM which frees more HUB for cache stuff.
Currently the DMA is used only by the disk drives and it runs in its own COG. Since the disk drives were slow at the time, this can be slowed down at will (sort of, there is always a timeout in the BIOS, so better not be too slow...) and give priority to the CPU.
Not sure if a cache system is necessarily helpful.
If a PSRAM is used only by the 8086, then maybe not, but if something else needs this (eg. SoundBlaster or GUS) then yes.
eeeh, not neccessarily. In NeoYume the RAM is constantly getting blasted with video and ADPCM reads and it still gets enough bandwidth to run the 68000 emulation at speeds cromulent enough to keep up with the 12MHz original. Though IDK how that'd look without the big code prefetch (which only really works for ROM code) and all memory being external.
@macca said:
The 8086 has a 6-bytes prefetch queue where code and immediates are read from, this can be refilled automatically in a dedicated small fifo buffer starting from PC. The prefetch queue is always cleared with each branch so there may be plenty of time to fetch the first byte.
Yes I feel there's some scope to gain improvements here with that because while it's not branching you could be reading from a cached 6 bytes from an earlier read from PSRAM. If you read 6 bytes each jump you'll only incur a 20 P2 clock additional penalty vs reading one byte from PSRAM, and 20 clocks is already somewhat comparable to a single RDBYTE anyway when the egg beater is not nicely aligned. After that point you'll only need to read PSRAM code again for a branch (which presumably took a while for an x86 anyway) or after a few more instructions exhaust the queue. It wont have to be for every single instruction. Only the memory accessing ones. Surely that can help a little bit.
@pik33 said:
Not sure if a cache system is necessarily helpful.
If a PSRAM is used only by the 8086, then maybe not, but if something else needs this (eg. SoundBlaster or GUS) then yes.
Maybe caching CS segment only can be a good solution.
Interesting idea and it's a smaller range to worry about which is nice. Far calls/returns could change that dynamically and it may invalidate frequently though. I wasn't really considering caching read data, and it would need to use a write through cache anyway in case or self modifying code in other code segments read later. Not really worth caching the data I guess.
Liking the idea of a potential SoundBlaster/Adlib emulation down the track, as well as a serial port COG perhaps to let the emulated PC talk externally to other devices (e.g, old serial mouse, ) or to the host PC via emulated COM ports. Maybe macca already has that in mind too or it could be contributed by others if needed...? Looks like 5 COGs are already used leaving 3 more for such things. That could suffice to make a nice little self-contained emulated old school PC/XT to fit in the P2.
Didn't dig deep enough yet but is there already an IO pin output that can be mapped to a squawker, or perhaps that could that get added too?
All jumps and calls take at least 15 clock cycles. Any conditional jump requires four clock cycles if not taken, but if taken, it requires 16 cycles in addition to resetting the prefetch queue; therefore, conditional jumps should be arranged to be not taken most of the time, especially inside loops. In some cases, a sequence of logic and movement operations is faster than a conditional jump that skips over one or two instructions to achieve the same result.
This is very helpful for giving PSRAM plenty of time to prefetch a LOT of data. 15 cycles at 10MHz is 1.5us. You could easily go off and read 64 bytes in that time @320MHz with 4 bit PSRAM.
A small block of PSRAM (say 32 bytes) could be read into HUB acting as a larger prefetch buffer after every x86 branch and then while reading and executing these bytes from HUB you only really have to deal with running past the end of the block to trigger a new PSRAM load. For this you'd need to track an offset in the block and whenever it gets a bit too close to the end and you might go past it with the longest opcode sequence, then reload a fresh block from PSRAM. Also you need to track if the x86 writes or DMA are within the active block region to invalidate the block for self modifying code cases. That's just about it.
@rogloh said:
Liking the idea of a potential SoundBlaster/Adlib emulation down the track, as well as a serial port COG perhaps to let the emulated PC talk externally to other devices (e.g, old serial mouse, ) or to the host PC via emulated COM ports. Maybe macca already has that in mind too or it could be contributed by others if needed...? Looks like 5 COGs are already used leaving 3 more for such things. That could suffice to make a nice little self-contained emulated old school PC/XT to fit in the P2.
The COM port support is on the TODO list, as well as hard disk support (which I think can be coupled with the floppy drive code and share the same COG). The 8250 serial chip was very simple and comparabile to the SmartPin Async mode so I don't think it will require a dedicated COG.
Didn't dig deep enough yet but is there already an IO pin output that can be mapped to a squawker, or perhaps that could that get added too?
If you mean the beeper, no, I haven't mapped the pin and the timer associated to it needs to be checked.
This is very helpful for giving PSRAM plenty of time to prefetch a LOT of data. 15 cycles at 10MHz is 1.5us. You could easily go off and read 64 bytes in that time @320MHz with 4 bit PSRAM.
Yes, branches are very slow anyway.
A small block of PSRAM (say 32 bytes) could be read into HUB acting as a larger prefetch buffer after every x86 branch and then while reading and executing these bytes from HUB you only really have to deal with running past the end of the block to trigger a new PSRAM load. For this you'd need to track an offset in the block and whenever it gets a bit too close to the end and you might go past it with the longest opcode sequence, then reload a fresh block from PSRAM. Also you need to track if the x86 writes or DMA are within the active block region to invalidate the block for self modifying code cases. That's just about it.
I was thinking of a ring buffer, or something, so the "feeder" COG can check when it is near the wrap-around and refill.
About the self-modifying code, I don't think this will ever be an issue, if the writes are within the prefetch queue they will be ignored anyway. It would help, to maintain the "compatibility" to keep the queue to 6 bytes.
Also was thinking about something like a "memory server" using a smartpin in long repository mode, with 32 bits it can hold 8-bits for data + 20-bits for address, and there are 4 spare bits that can be used as read/write flag and something else. Writes to smartpin can be hooked to an interrupt in the "server COG", the client COG should wait anyway for the read but can let the write go on its own. Depends if all that is compensated by removing (or better moving to the server COG) the code needed to calculate the hub ram address, check the ram/rom space and the call/returns to the subroutines (mainly the i_readmemb/i_writememb code).
As last resort hook a parallel SRAM and go with it... maybe with multiplexed address/data bus like the original 8086 (20 address/data + 1 A/D flag, 3 as CS,OE,WE) or plain full 20+8+3 pins.
It would help to have a standard carrier board for these things...
It would help to have a standard carrier board for these things...
ISA 8-bit bus? I have a real old CGA somewhere at the university. A full sized board with a hundred (?) ICs soldered on it and a composite video output. And a 2 MB memory extension board, also a full size, big, heavy PCB full of ICs soldered on it.
@macca,
here's a patch for your x86 emulator code to use 640kB (or more) with PSRAM. I don't have anything to exercise the extra memory but it now reports as 640kB in the BIOS self test at bootup, so that's a good start.
It's slowed it down a little bit but for now there is no pre-fetch implemented, all reads/writes below 640kB go to PSRAM and are byte oriented, so there is more scope to speed this up for word accesses and also with a PIQ (pre-fetch input queue) of some type, or further caching schemes. It's important to not cross any 1kB page boundaries in PSRAM if you are doing unaligned accesses, so some more work would be needed to split the access if that occurs.
If you diff this against your original posted source you'll see what I did, and it's not that much of a change. I took my PSRAM init code and patched in some of Wuerfel_21's low latency read code and my own write code with a few tidying mods. This new PSRAM code in the x86 COG currently consumes about 37 longs or so of total LUT/COG space but could increase if you add more features, unless you can keep them in HUB RAM.
If you want to use this it should be easy for you to extend/modify as you see fit and make other adjustments to suit your preferred initialization sequence. At some point you may want your instruction fetch reads to be stored or copied over to a HUB area separate to the scratch space used for the other data reads so that a PIQ would be unaffected by normal data reads. You could potentially copy the ROM BIOS data into PSRAM at boot time too, if you need to free HUB RAM for other uses later...maybe a 256kB VGA buffer at some point if this emulator project really goes to town...
Also I wasn't confident about changes to your i_ea registers so I just kept the PSRAM read/write routines fully separated from your code and called from your .ram read/write routines, so there is some scope to merge/save cycles there too. If i_ea can be changed, it can be used instead of pa for the PSRAM cmd+address for example.
I set it up for the P2-Edge (P2-EC32MB) PSRAM pinout with the constants below but you can adjust the PINs in different setups. You may need to adjust the PSRAM_DELAY if you change the clock rate from 320MHz or see other issues.
Note: some other I/O base pins got changed in my P2-Edge setup so be aware of that too before you run it.
PSRAM_CLK_PIN = 56
PSRAM_CE_PIN = 57
PSRAM_DATA_PINS = 40+(3<<6) ' base data pin with 4 pin group
PSRAM_PORT_OFFSET = (PSRAM_DATA_PINS & $3c) << 17
PSRAM_WAIT = 10
PSRAM_DELAY = 4 ' adjust depending on CLKFREQ
PSRAM_INIT_DELAY = 150*(_CLKFREQ/1000000) ' ~150us
@Wuerfel_21 and @macca, I just found an optimization for all my PSRAM stuff.
In the code that prepares the address phase data order I just used to do this to prepare the address for streaming out:
setbyte pa, #$02, #3
splitb pa ' these 4 instructions reverse the nibble endianness
rev pa
movbyts pa, #%%0123
mergeb pa
But I've now realized if you use the "a" bit (bit 16) in the streamer command to reverse the data for you, I think you can just do this instead, saving 3 instructions for both reads and writes, nice:
setbyte pa, #$02, #3
movbyts pa, #%%0123
I don't know why I've missed this until now. Crazy. Seems to work fine in macca's emulator, although I should test it further to be sure it works with address bursts etc. But I think it actually works and saves 6 clocks.
Here are the magic PSRAM access sequences, at least for 320MHz operation:
CON
PSRAM_CLK_PIN = 56
PSRAM_CE_PIN = 57
PSRAM_DATA_PINS = 40+(3<<6) ' base data pin with 4 pin group
PSRAM_PORT_OFFSET = (PSRAM_DATA_PINS & $3c) << 17
PSRAM_WAIT = 10
PSRAM_DELAY = 4 ' adjust depending on CLKFREQ
PSRAM_INIT_DELAY = 150*(_CLKFREQ/1000000) ' ~150us
DAT
psram_read8
wrfast bit31, hub_scratch ' pa = PSRAM read address, hub_scratch is where in hub to read PSRAM data into
setbyte pa, #$EB, #3
movbyts pa, #%%0123
drvl #PSRAM_CE_PIN
drvl #PSRAM_DATA_PINS
xinit ximm8, pa
wypin #(8+PSRAM_WAIT+2)*2, #PSRAM_CLK_PIN ' enough clocks for address phase, delay and 1 byte transfer
setq nco_fast
xcont #PSRAM_WAIT*2+PSRAM_DELAY,#0 ' send address
waitxmt
fltl #PSRAM_DATA_PINS
setq nco_slow
xcont xread2, #0 ' read data
waitxfi ' wait until streamer is done
_ret_ drvh #PSRAM_CE_PIN
psram_write8
' pa = PSRAM write address, i_tmpb is data byte to write
setbyte pa, #$02, #3
movbyts pa, #%%0123
drvl #PSRAM_CE_PIN
drvl #PSRAM_DATA_PINS
'setq nco_fast ' only needed if not already setup in a prior (first) read
xinit ximm8, pa
wypin #(16+4), #PSRAM_CLK_PIN ' number of clock transitions
xcont ximm2, i_tmpb
waitxfi
fltl #PSRAM_DATA_PINS
_ret_ drvh #PSRAM_CE_PIN
hub_scratch long @hub_scratch_buf
bit31 'alias
nco_fast long $8000_0000
nco_slow long $4000_0000
ximm2 long $60800002 + PSRAM_PORT_OFFSET ' write 2 nibbles
ximm8 long $60810008 + PSRAM_PORT_OFFSET ' write 8 nibbles
xread2 long $E0800002 + PSRAM_PORT_OFFSET ' read 2 nibbles
UPDATE: I've changed the code I posted earlier to include this, and attached the new zip above.
@rogloh said:
@macca,
here's a patch for your x86 emulator code to use 640kB (or more) with PSRAM. I don't have anything to exercise the extra memory but it now reports as 640kB in the BIOS self test at bootup, so that's a good start.
That's great! Thank you very much.
Unfortunately I don't have the PSRAM module so I can't do anything with it.
If you diff this against your original posted source you'll see what I did, and it's not that much of a change. I took my PSRAM init code and patched in some of Wuerfel_21's low latency read code and my own write code with a few tidying mods. This new PSRAM code in the x86 COG currently consumes about 37 longs or so of total LUT/COG space but could increase if you add more features, unless you can keep them in HUB RAM.
No problem, some of the hub code will be moved to COG or LUT to gain some clock cycles, but haven't decided what yet.
Also I wasn't confident about changes to your i_ea registers so I just kept the PSRAM read/write routines fully separated from your code and called from your .ram read/write routines, so there is some scope to merge/save cycles there too. If i_ea can be changed, it can be used instead of pa for the PSRAM cmd+address for example.
No i_ea should not be changed aside from that and to keep it within range, word read and writes increments it for the next byte, this may be optimized a bit by reading a word but should be done from the caller routine.
The only thing missing, if I'm not wrong, is the disk drive read/write (still using hub ram) so using a disk image won't work with PSRAM. I guess this will need some arbitration to avoid conflicts with the CPU access.
@macca said:
Unfortunately I don't have the PSRAM module so I can't do anything with it.
Oh man, missing out. You'll have to beg, borrow, buy, or build one. Only a single chip PSRAM 6 pins required for 8MB and SOP8 chips are easy to solder.
No i_ea should not be changed aside from that and to keep it within range, word read and writes increments it for the next byte, this may be optimized a bit by reading a word but should be done from the caller routine.
Yeah I thought as much and left it be.
The only thing missing, if I'm not wrong, is the disk drive read/write (still using hub ram) so using a disk image won't work with PSRAM. I guess this will need some arbitration to avoid conflicts with the CPU access.
Disk/DMA may need some work. If it runs in another COG and needs to write memory that's going to be interesting. A lock may work out for this perhaps and you could duplicate the read/write routines in that COG.
Comments
Well, that's only slightly slow at 320MHz then.
What is this suite?
https://github.com/andreas-jonsson/virtualxt/tree/develop/tools/testdata
They are for 186/286 but aside from some differences in the results they are valid for an 8086.
Thanks, I downloaded these tests last year but didn't understand how to use them at that time.
Each 64K .bin files replaces BIOS ROM at F0000 but isn't there a potential problem? After reset, CS = FFFF and IP = 0000. The jmp start instruction (EB 0E) in the binary at offset FFF0 (physical address FFFF0) seems to change IP to 0010 and therefore physical address of start = FFFF0+0010 = 00000.
The tests code sets CS=F000 and IP=FFF0 at start.
Getting closer...
Booting PCDOS 2.00 from an in-memory disk image.
Since the disk is only 180k it fits into hub ram with everything else, very little space left, can't even enable debug...
This also means I'm getting short of excuses to procrastinate the SD card driver...
Very neat!
And the SD card support is implemented...
It is a bit rough but should work with both FAT16 and FAT32 partitions.
You need to have the PCDOS 2.00 disk images in the root directory, named PCDOS_A.IMG and PCDOS_B.IMG (for drive A and B as the name implies) and the emulator should boot from drive A, otherwise without SD card or if the files are missing, it boots to the BASIC rom (the image files should be easily found on the 'net).
The SD card is not mapped on the P2-EDGE pins because I don't have a module with the microSD and when I tried to wire an adapter it won't even allowed to upload programs... maybe it is the adapter or I haven't wired it correctly... anyway, it should work, I don't remember where I got the SD code(*) but should be compatibile with the SD/Flash-CS wiring.
Edit: (*) it is the Catalina_SD_Plugin.spin2
Yeah, it seems the P2 is an emulation beast! Kudos @macca !
Well, the 186 note brings to mind the HP200LX, which is the only system I have ever had with a 186 in it. Quite a handy little device it was. I still have one and could dump the ROM if needed (and instructions provided).
Lol! More ammunition for this 68k fanboy.
Such an old school look and feel. I love it.
When compiling on flexspin 5.9.12 I get these warnings... isn't this an indication of a bug?
It crashed in Basic with this:
cmps without flags set are simply nops. Maybe try and add these wz ? I will try to run this later. I have tro find DOS 2.0 as the lowest DOS I have now is 3
It will run without the DOS images (boots to BASIC). I didn't uses the SD.
I did patch in the WZ on those lines, but I think this is something relevant to the SD filesystem so it wasn't part of my issue under BASIC.
Yes, those lines are missing a wz but, as you guessed, are relevant only for the SD card access, and I also think that have no effect anyway... the FAT1 or FAT3 check is enough unless you have an SD formatted as FAT12...
Interesting, however I don't think it "crashed" in the real sense of the word. The BIOS screen output routines blanks the video when scrolling so it looks like it is always blanked... try a long for / next loop, like
You should see a blank screen and at some point it terminates.
It should be possibile to stop the program with a key combination, but can't remember how...
Ah, yeah Ctrl-Scroll Lock seems to break into the program and it returns to BASIC and still works.
Looking at your x86 emulator @macca, I see it accesses memory through iread_memb and iwrite_memb entry points. I sort of wonder what would happen if we tacked on some really tight PSRAM reading code (like Wuerfel_21 uses) using simple 4-bit wide PSRAM so as to avoid RMW requirements. At 320MHz there may be some potential to get from 3 to 4 separate PSRAM reads or writes in per microsecond, excluding the instruction execution. I read that the original 8088/8086 took 4 clocks per bus cycle, transferring 8/16 bits respectively so that's probably in the same ballpark for a machine in the 8-10MHz class, and likely fine for emulating slow 4.77MHz machines. Also the PSRAM can read in 16 bits in only 4 P2 clocks more than reading 8 bits and alignment is not an issue with 4 bit PSRAM.
Alternatively if that's still slow I wonder if we had a COG paired to your emulator that is running some PSRAM driver code and can read and write bytes on behalf of the x86 emulator requests (made via upper LUT longs) and it also manages a decent sized cache in the HUB RAM for this x86 emulator COG while the PC's 640kB is actually stored in PSRAM. That paired COG could help keep the read/write requests latency rather low vs any HUB based scheme with mailboxes etc. Also if cache tags are managed in the paired COGs COGRAM, assuming it has enough room for some small sized block mappings (let's say 4kB block sized, 160 of them for 640kB) then maybe the cached blocks could be read from HUB RAM when known to be in memory and swapped out on demand when we're out of HUB space etc without needing lots of other HUB RAM accesses. It's somewhat tempting to try this out. I do wonder if a cached x86 would run slow as an old dog or still be "somewhat" usable. Most of those early XT type machines never really used a cache to my knowledge, that came later with 486's etc, so there's not much to compare against. It might work but be intermittently slow, not sure.
One problem are the peripherals if they need to access memory directly (like DMA controllers etc). In that case some DOS memory may have to be marked as non-cacheable and that introduces more complexities. Certainly video RAM should remain in HUB, although BIOS could be put into PSRAM which frees more HUB for cache stuff.
LUT sharing + streamer -> brap
Not sure if a cache system is necessarily helpful. PSRAM is just fast enough to where the overhead of a cache layer might just make everything worse. Though, what does the PSRAM do when you raise CS before completing a command? If that's not an issue, the cache logic can happen in parallel with the PSRAM setup, so the worst-case latency isn't affected.
But only if the streamer uses LUT modes which might be possible to avoid if we use nibble mode PSRAMs.
Yeah that is my same uncertainty too.
[...]
I don't know what may be a good caching scheme for that, I'm not familiar at all with PSRAM so I don't know how it works and what are the limits, apart from the impression that are a bit critical with the timings... It depends how the PSRAM read/write time compares with the hub rd/wrbyte. Maybe removing some instructions like the need to add the hub ram start address and rearranging the code a bit could help to gain some clock cycles.
The 8086 has a 6-bytes prefetch queue where code and immediates are read from, this can be refilled automatically in a dedicated small fifo buffer starting from PC. The prefetch queue is always cleared with each branch so there may be plenty of time to fetch the first byte.
Currently the DMA is used only by the disk drives and it runs in its own COG. Since the disk drives were slow at the time, this can be slowed down at will (sort of, there is always a timeout in the BIOS, so better not be too slow...) and give priority to the CPU.
If a PSRAM is used only by the 8086, then maybe not, but if something else needs this (eg. SoundBlaster or GUS) then yes.
Maybe caching CS segment only can be a good solution.
eeeh, not neccessarily. In NeoYume the RAM is constantly getting blasted with video and ADPCM reads and it still gets enough bandwidth to run the 68000 emulation at speeds cromulent enough to keep up with the 12MHz original. Though IDK how that'd look without the big code prefetch (which only really works for ROM code) and all memory being external.
Yes I feel there's some scope to gain improvements here with that because while it's not branching you could be reading from a cached 6 bytes from an earlier read from PSRAM. If you read 6 bytes each jump you'll only incur a 20 P2 clock additional penalty vs reading one byte from PSRAM, and 20 clocks is already somewhat comparable to a single RDBYTE anyway when the egg beater is not nicely aligned. After that point you'll only need to read PSRAM code again for a branch (which presumably took a while for an x86 anyway) or after a few more instructions exhaust the queue. It wont have to be for every single instruction. Only the memory accessing ones. Surely that can help a little bit.
Interesting idea and it's a smaller range to worry about which is nice. Far calls/returns could change that dynamically and it may invalidate frequently though. I wasn't really considering caching read data, and it would need to use a write through cache anyway in case or self modifying code in other code segments read later. Not really worth caching the data I guess.
Liking the idea of a potential SoundBlaster/Adlib emulation down the track, as well as a serial port COG perhaps to let the emulated PC talk externally to other devices (e.g, old serial mouse, ) or to the host PC via emulated COM ports. Maybe macca already has that in mind too or it could be contributed by others if needed...? Looks like 5 COGs are already used leaving 3 more for such things. That could suffice to make a nice little self-contained emulated old school PC/XT to fit in the P2.
Didn't dig deep enough yet but is there already an IO pin output that can be mapped to a squawker, or perhaps that could that get added too?
Just read this on wikipedia's 8088/8086 pages.
This is very helpful for giving PSRAM plenty of time to prefetch a LOT of data. 15 cycles at 10MHz is 1.5us. You could easily go off and read 64 bytes in that time @320MHz with 4 bit PSRAM.
A small block of PSRAM (say 32 bytes) could be read into HUB acting as a larger prefetch buffer after every x86 branch and then while reading and executing these bytes from HUB you only really have to deal with running past the end of the block to trigger a new PSRAM load. For this you'd need to track an offset in the block and whenever it gets a bit too close to the end and you might go past it with the longest opcode sequence, then reload a fresh block from PSRAM. Also you need to track if the x86 writes or DMA are within the active block region to invalidate the block for self modifying code cases. That's just about it.
The COM port support is on the TODO list, as well as hard disk support (which I think can be coupled with the floppy drive code and share the same COG). The 8250 serial chip was very simple and comparabile to the SmartPin Async mode so I don't think it will require a dedicated COG.
If you mean the beeper, no, I haven't mapped the pin and the timer associated to it needs to be checked.
Yes, branches are very slow anyway.
I was thinking of a ring buffer, or something, so the "feeder" COG can check when it is near the wrap-around and refill.
About the self-modifying code, I don't think this will ever be an issue, if the writes are within the prefetch queue they will be ignored anyway. It would help, to maintain the "compatibility" to keep the queue to 6 bytes.
Also was thinking about something like a "memory server" using a smartpin in long repository mode, with 32 bits it can hold 8-bits for data + 20-bits for address, and there are 4 spare bits that can be used as read/write flag and something else. Writes to smartpin can be hooked to an interrupt in the "server COG", the client COG should wait anyway for the read but can let the write go on its own. Depends if all that is compensated by removing (or better moving to the server COG) the code needed to calculate the hub ram address, check the ram/rom space and the call/returns to the subroutines (mainly the i_readmemb/i_writememb code).
As last resort hook a parallel SRAM and go with it... maybe with multiplexed address/data bus like the original 8086 (20 address/data + 1 A/D flag, 3 as CS,OE,WE) or plain full 20+8+3 pins.
It would help to have a standard carrier board for these things...
ISA 8-bit bus? I have a real old CGA somewhere at the university. A full sized board with a hundred (?) ICs soldered on it and a composite video output. And a 2 MB memory extension board, also a full size, big, heavy PCB full of ICs soldered on it.
@macca,
here's a patch for your x86 emulator code to use 640kB (or more) with PSRAM. I don't have anything to exercise the extra memory but it now reports as 640kB in the BIOS self test at bootup, so that's a good start.
It's slowed it down a little bit but for now there is no pre-fetch implemented, all reads/writes below 640kB go to PSRAM and are byte oriented, so there is more scope to speed this up for word accesses and also with a PIQ (pre-fetch input queue) of some type, or further caching schemes. It's important to not cross any 1kB page boundaries in PSRAM if you are doing unaligned accesses, so some more work would be needed to split the access if that occurs.
If you diff this against your original posted source you'll see what I did, and it's not that much of a change. I took my PSRAM init code and patched in some of Wuerfel_21's low latency read code and my own write code with a few tidying mods. This new PSRAM code in the x86 COG currently consumes about 37 longs or so of total LUT/COG space but could increase if you add more features, unless you can keep them in HUB RAM.
If you want to use this it should be easy for you to extend/modify as you see fit and make other adjustments to suit your preferred initialization sequence. At some point you may want your instruction fetch reads to be stored or copied over to a HUB area separate to the scratch space used for the other data reads so that a PIQ would be unaffected by normal data reads. You could potentially copy the ROM BIOS data into PSRAM at boot time too, if you need to free HUB RAM for other uses later...maybe a 256kB VGA buffer at some point if this emulator project really goes to town...
Also I wasn't confident about changes to your i_ea registers so I just kept the PSRAM read/write routines fully separated from your code and called from your .ram read/write routines, so there is some scope to merge/save cycles there too. If i_ea can be changed, it can be used instead of pa for the PSRAM cmd+address for example.
I set it up for the P2-Edge (P2-EC32MB) PSRAM pinout with the constants below but you can adjust the PINs in different setups. You may need to adjust the PSRAM_DELAY if you change the clock rate from 320MHz or see other issues.
Note: some other I/O base pins got changed in my P2-Edge setup so be aware of that too before you run it.
Cheers,
Roger
@Wuerfel_21 and @macca, I just found an optimization for all my PSRAM stuff.
In the code that prepares the address phase data order I just used to do this to prepare the address for streaming out:
But I've now realized if you use the "a" bit (bit 16) in the streamer command to reverse the data for you, I think you can just do this instead, saving 3 instructions for both reads and writes, nice:
I don't know why I've missed this until now. Crazy. Seems to work fine in macca's emulator, although I should test it further to be sure it works with address bursts etc. But I think it actually works and saves 6 clocks.
Here are the magic PSRAM access sequences, at least for 320MHz operation:
UPDATE: I've changed the code I posted earlier to include this, and attached the new zip above.
That's great! Thank you very much.
Unfortunately I don't have the PSRAM module so I can't do anything with it.
No problem, some of the hub code will be moved to COG or LUT to gain some clock cycles, but haven't decided what yet.
No i_ea should not be changed aside from that and to keep it within range, word read and writes increments it for the next byte, this may be optimized a bit by reading a word but should be done from the caller routine.
The only thing missing, if I'm not wrong, is the disk drive read/write (still using hub ram) so using a disk image won't work with PSRAM. I guess this will need some arbitration to avoid conflicts with the CPU access.
Oh man, missing out. You'll have to beg, borrow, buy, or build one. Only a single chip PSRAM 6 pins required for 8MB and SOP8 chips are easy to solder.
Yeah I thought as much and left it be.
Disk/DMA may need some work. If it runs in another COG and needs to write memory that's going to be interesting. A lock may work out for this perhaps and you could duplicate the read/write routines in that COG.