@Wuerfel_21 said:
So, uhh, memory amiright.
@rogloh
Uhm, is this right? It seems to start up correctly, but all the data gets garbled (I'm trying with PSRAM4 right now).
You gotta be sure the timing parameters are setup right, they are critical. By bypassing the normal init with your own code there is a reasonable chance that something is getting setup incorrectly at init time. If you print out the different structure's values from my demos and then compare to yours you should be able to find the difference somewhere.
If the input latency clocks are wrong it could appear to introduce a data offset from the real addresses. They may need tuning in your setup too.
Also if you are messing about with using unregistered clocks in the flag setting you might be introducing some timing change too. That could be it, because I don't enable that for PSRAM but you have appeared to here.
Turns out I wasn't computing some addresses right and it does now work. But only once. Returning to the menu and loading again does work, but it doesn't run (menu remains on screen because lol). Returning to the menu again and loading yet again hangs totally. Strange.
Anyways, watch it load that enormous 8MB Bad Apple ROM:
@Wuerfel_21 said:
Turns out I should probably actually stop the cogs that I start.
LOL - I'm sure you'll get there in the end! Well done so far.
I wonder if the SD card can be sped up a bit. It seems to take a bit of time to read the 8MB. Is that because it has to decode something as it goes or is that just the transfer speed off the card? You might like to increase the transfer size if you can, that could help if you were writing very small blocks. You can always try to write much larger bursts.
It loads 16K at a time into alternating buffers (so the RAM write and SD read are parallelized). If it's a .SMD ROM (not actually determined by file extension, but by detecting the dump header) theres some byte twiddling going on, but for normal raw dumps it just loads straight into memory. I think its mostly that flexspin's SD library is slow (Look how long it takes to read the directory!). The very recent system function fcache enablement commit made it a bit faster though.
Yeah okay. 16kB is plenty big enough to give decent performance. Must be that SD card library. Looks like it's doing something around 500-600kB/s or so which is reasonable but isn't awesome for large files.
I'm not a gamer -- retro or otherwise -- but if someone built a P2 version of the Hydra I would buy one just to play with the code you PASM wizards are writing. Neat stuff.
@JonnyMac said:
I'm not a gamer -- retro or otherwise -- but if someone built a P2 version of the Hydra I would buy one just to play with the code you PASM wizards are writing. Neat stuff.
Well, it might honor you that MegaYume was mostly written and debugged on a P2EDGE-32MB sitting in a "JohnnyMac" breakout board.
@rogloh said:
Yeah okay. 16kB is plenty big enough to give decent performance. Must be that SD card library. Looks like it's doing something around 500-600kB/s or so which is reasonable but isn't awesome for large files.
Yes, it is. I tested PSRAM4 library speed and it is about 500 cycles for single read, and about 5 cycles per byte for bursts, which is over 60 MB/s, but for a module loading from SD into PSRAM the speed is about 400 kB/s, so SD library is a limiter there. Enough for 16bit 44100 Hz .wav playing, but this is far form the maximum speed available for a modern SD card even on 1 bit interface bus.
@Wuerfel_21. Now you are using the HyperRAM board as well, you might be able to try to copy some of your images onto the HyperFlash. It can potentially even execute in place from there too via normal read/writes and should be very fast indeed. It's only 32MB in size but this might be okay for some frequently used games you want to load quickly vs getting them off SD.
@rogloh said:
Yeah okay. 16kB is plenty big enough to give decent performance. Must be that SD card library. Looks like it's doing something around 500-600kB/s or so which is reasonable but isn't awesome for large files.
Yes, it is. I tested PSRAM4 library speed and it is about 500 cycles for single read, and about 5 cycles per byte for bursts, which is over 60 MB/s, but for a module loading from SD into PSRAM the speed is about 400 kB/s, so SD library is a limiter there. Enough for 16bit 44100 Hz .wav playing, but this is far form the maximum speed available for a modern SD card even on 1 bit interface bus.
BTW, there is still notable delays in the block completion handshaking in current implementations. Wild guess is this is limited by the commands support in SPI mode. Maybe 1-bit SD mode can be engaged with existing shared pinout, dunno.
@JonnyMac said:
I'm not a gamer -- retro or otherwise -- but if someone built a P2 version of the Hydra I would buy one just to play with the code you PASM wizards are writing. Neat stuff.
JonnyMac, Coley and I are doing a portable arcade cab, which could essentially be kinda what you are after, is that good enough?
@evanh said:
BTW, there is still notable delays in the block completion handshaking in current implementations. Wild guess is this is limited by the commands support in SPI mode. Maybe 1-bit SD mode can be engaged with existing shared pinout, dunno.
@rogloh said:
@Wuerfel_21. Now you are using the HyperRAM board as well, you might be able to try to copy some of your images onto the HyperFlash. It can potentially even execute in place from there too via normal read/writes and should be very fast indeed. It's only 32MB in size but this might be okay for some frequently used games you want to load quickly vs getting them off SD.
Yea, could do that, but I could also not. Doesn't seem very useful to me. I think with @evanh 's improved SD code it's probably going to load most games at good enough speed - have to try building with that.
@Baggers said:
@JonnyMac said:
I'm not a gamer -- retro or otherwise -- but if someone built a P2 version of the Hydra I would buy one just to play with the code you PASM wizards are writing. Neat stuff.
JonnyMac, Coley and I are doing a portable arcade cab, which could essentially be kinda what you are after, is that good enough?
Speaking of, have you got one of the 6 button cabs assembled yet? Could send you a build of the SD-loading megayume to try out. Isn't really ready for consumption yet, but works enough to show that it works (should probably push it onto a git branch too. )
In absence of anything else to say, have some CRT pics: *jumps behind a bush and disappears*
Time for MegaYume beta with GUI and SD loading, that is. You can get it off the beta branch on Github. It'll be merged into master when flexspin 5.9.10 actually releases (The new upper code relies on some bugfixes, expanded libc.a and Shift-JIS charset support). Thus also no release ZIP, because if you can't into git, it'd be worthless, anyways.
Major changes from the alpha versions:
There's a GUI for loading ROMs from SD card now.
You can return to the menu by pressing CTRL+Esc during emulation
If there is a MEGAYUME folder on your SD card, it will open that by default
SMD format ROMs can be loaded
Machine region is now auto-set based on ROM header info
Almost all configuration has been moved into a config.spin2
Known issues:
sometimes it fails to read the directory after cold boot. Just press enter and try again.
some games have issues if launched after returning to the menu from another game
controller types are still hardcoded in megayume_lower (Ideally I'd detect the handful of games that don't work with 6 button pads by serial number)
there is no GUI override for machine region
I haven't tested it with HyperRAM yet (but 4 bit and 16 bit PSRAM certainly work)
I've put beta02 on github - the main change is SRAM save support. Some notes on that:
SRAM is only emulated if the ROM header declares it (GenPlusGX always enables it on ROMs <= 2MB). There's also some known edge cases of misdeclared SRAM that aren't handled right now.
SRAM is always emulated as 16 bit wide (most cartridges use 8 bit RAM that only appears at odd addresses).
If SRAM is declared as non-volatile, it is stored on disk as an SRM file that should be compatible with most emulators (haven't actually tested that yet lol). It is recommended you name your ROM files to fit within the 8.3 limit, as the save file will be created using the 8.3 name.
SRAM is flushed to disk if 3 seconds have passed since a write (or when emulation is ended using Ctrl+Esc). This seems to work well with the way most (all?) games use SRAM
Fixes games that couldn't be started due to lack of SRAM (Landstalker, Shining in the Darkness, probably more)
Megayume beta03 is on github now (since flexspin 5.9.10 is out, the GUI version is now on the master branch):
Fixed race condition in ROM load code
Hitting Pause/Break will halt the 68000 CPU - nice for getting freeze frames without a "game paused" overlay (this obviously won't work at all with games that use H interrupt and may cause all sorts of issues in general)
Aaaand there's a beta04 now. Incorporated and adapted some of @macca 's code for USB game pad support. Very preliminary. As such the limitations:
Only one controller is supported (general driver issue)
No HID descriptor parsing - only generic XInput pads and RetroBit SEGA USB controllers are supported through hard-coding right now.
forgot to add @macca (and I guess @garryj to begin with) to the credit screen. But it's like 3AM I won't fix it now.
If you hold Start+Down or Start+Up for 3 seconds, it will quit or reset the game, respectively. I didn't figure out a clean way to disable this for keyboard yet, but you can disable it altogether in the config.spin2 file.
So, I've looked into NeoGeo emulation some. It really is the "big iron" of 2D game machines...
It has the same 68000 main CPU / Z80 sound CPU setup as the Megadrive, just clocked faster(12 MHz vs 7.67 MHz 68000, 4 MHz vs 3.58 MHz Z80). This shouldn't be a big concern, as even the lowly 4 bit PSRAM Megayume is generally faster than it needs to be. And this presumed "NeoYume" would only work with 16 bit PSRAM, anyways, because...
Big ROM hot cheese. Look up pictures NeoGeo cartriges. TWO EDGE CONNECTORS. There's 5 separate ROM buses: P-ROM for 68000 code (up to 2MB without bankswitching), M-ROM for Z80 code (usually 128KB), S-ROM for text tiles (usually 128KB), C-ROM for sprite tiles (up to 128MB, though I think(tm) the largest actual game is 64MB) and V-ROM for ADPCM samples (up to 16MB). S and M can fit somewhere in Hub, but the rest need to live in PSRAM and there will need to be a dedicated cog to coordinate access between 68000, sprites and ADPCM. I have some idea of how I want this to work. In particular, instead of polling I want to use the cog pair mechanism such that the 68k cog can write a request into LUT and the PSRAM cog will get interrupted from polling Sprites/ADPCM. Also want to try using a smartpin for chip select, such that the next transfer can be prepared while the current one is still going.
I tried prototyping up parts of the graphics system, but I couldn't get sprite logic (i.e. figuring out which 16px slivers are to be rendered on a given line) fast enough yet (for the worst-case of 96 sprites on a line (one of which is #381), anyways). As is, it just eludes being able to fit in a cog @ 336MHz (another cog would then palette lookup and blit the data received from the PSRAM cog onto the screen buffer. Somewhat analogous to the LSPC and NEO-B1 ASICs on an actual NeoGeo motherboard). However, I have some ideas for more extreme optimization approaches, but don't expect to be able to read the code after I'm done.
Anyways, "no pics no clix", so here's a VRAM dump I rendered with a small script I wrote on PC, to test my understanding of the graphics system (this one is fairly pedestrian though - no scaling involved)
Also, here's a nice graph of the aforementioned five ROM busses and how they all connect in the machine. Note that SFIX and SM1 are not present on AES units (which is what I plan to emulate, because they're just that bit simpler)
Complicated! .... looking up the specs on the YM2610 I note it clearly states "fixed pitch" on all but one PCM channel.
It stands out that the Amiga's four channel "variable pitch" was quite unique. It is what allowed the tracker mods to use very little CPU. Everything later just seems to rely on CPU to digitally mix the PCM channels.
@evanh said:
Complicated! .... looking up the specs on the YM2610 I note it clearly states "fixed pitch" on all but one PCM channel.
It stands out that the Amiga's four channel "variable pitch" was quite unique. It is what allowed the tracker mods to use very little CPU. Everything later just seems to rely on CPU to digitally mix the PCM channels.
Eh, there are a couple other hardware PCM mixers, though I think the Amiga is unique in the single-cycle precision of the mixing. Everything else (including this here YM2610) mixes at some 2 figure kHz rate and either deals with the crunchiness or, see SNES audio board, runs a really muffled interpolation filter.
Though the YM2610 seems to do pretty fine with just the one repitchable ADPCM channel. That's generally used for the lead melody, while the fixed channels are used for percussion, sampled chords/riffs and of course sound effects. The FM (oddly, even though there are only 4 channels, they made it like a YM2608 with channels 1 and 3 removed, instead of just putting all the channels on one port) is used for bass and background melodies (sometimes also as a sort of echo effect for ADPCM-B). The SSG channels AFAICT are really only used to play the coin insert sound on MVS units.
@Wuerfel_21 said:
So, I've looked into NeoGeo emulation some. It really is the "big iron" of 2D game machines...
@Wuerfel_21 , given your success with your P2 Megadrive emulator this seems like another great project challenge for you. Wow 5 ROM buses! I'm interested to see how you decide get the PSRAM request mechanism going if you share memory access to multiple COGs. That's one of the key parts to figure out for a project like this in a way that works well and minimizes latency where it's needed. It might be somewhat simpler to do if you have more than one memory bus, but the P2-EC32MB HW is not wired for that. I'm assuming that would be part of your target platform instead of making your own HW with more than one memory bus.
@Wuerfel_21 said:
Eh, there are a couple other hardware PCM mixers, though I think the Amiga is unique in the single-cycle precision of the mixing. Everything else (including this here YM2610) mixes at some 2 figure kHz rate and either deals with the crunchiness or, see SNES audio board, runs a really muffled interpolation filter.
The Amiga's mixing is analogue, each channel has its own DAC and matching sample DMA/timing. I hadn't realised it was so uniquely powerful. It all fit so cleanly and simple to fully utilise.
EDIT: Hmm, just been looking at the schematics for first time in 30 years ... seems I remember it a little wrongly. I had thought the mixing was external to Paula chip but it's not. However it does state there is four DACs, one for each channel. So I'll conclude the mixing is still analogue, just inside Paula instead of on the PCB.
After a spiritual journey through the forests of imagination, I have come back mentally scarred for life, but with a grasp on how to get sprite logic worst case under the magic 21K cycles. Most important optimization is to reorder the representation of high VRAM such that all the sprite attributes can be fetched in two RFLONGs.
That allows checking one sprite in 18 cycles. The interesting data is SCB3, which looks like %YYYYYYYYY_S_HHHHHH (Y is inverse Y position, S is sticky bit (if set, reuse settings from previous sprite) and H is the height in 16 pixel increments). The Y check has to be with 9 bit modular arithithmetic (so you can have a 32-tall sprite wrap around the entire 512px virtual screen). Luckily, the Y is already left-justified within the word and the simplest reordering of the VRAM also left-justifies it in the first long read.
Behold:
rdfast #0,lspc_vramhiptr
modc _clr wc
.scanlp
rflong lspc_tmp1 ' SCB2 in low word, SCB3 in high word
rflong lspc_tmp4 ' SCB4 in low word, garbo in high word
testb lspc_tmp1,#6+16 wz ' sticky bit
if_nz getword lspc_tmp2,lspc_tmp1,#1
if_nz shl lspc_tmp2,#27 wc ' left justify height. top bit (set->always visible) shifts into C
subr lspc_tmp1,lspc_chkline
if_nc_and_nz cmp lspc_tmp1,lspc_tmp2 wc
if_nc djnz lspc_sprleft,#.scanlp ' 9 op per sprite
if_nc jmp #.scan_done
'' Insert code to deal with sprite here
Now to write the blitter and memory arbiter to figure out if any of this actually works.
Comments
You gotta be sure the timing parameters are setup right, they are critical. By bypassing the normal init with your own code there is a reasonable chance that something is getting setup incorrectly at init time. If you print out the different structure's values from my demos and then compare to yours you should be able to find the difference somewhere.
If the input latency clocks are wrong it could appear to introduce a data offset from the real addresses. They may need tuning in your setup too.
Also if you are messing about with using unregistered clocks in the flag setting you might be introducing some timing change too. That could be it, because I don't enable that for PSRAM but you have appeared to here.
I use psram4 in the module player - it works there The driver is started with delay=12 at 336 MHz, then I load the module using 128-bytes chunks.
Turns out I wasn't computing some addresses right and it does now work. But only once. Returning to the menu and loading again does work, but it doesn't run (menu remains on screen because lol). Returning to the menu again and loading yet again hangs totally. Strange.
Anyways, watch it load that enormous 8MB Bad Apple ROM:
Turns out I should probably actually stop the cogs that I start.
LOL - I'm sure you'll get there in the end! Well done so far.
I wonder if the SD card can be sped up a bit. It seems to take a bit of time to read the 8MB. Is that because it has to decode something as it goes or is that just the transfer speed off the card? You might like to increase the transfer size if you can, that could help if you were writing very small blocks. You can always try to write much larger bursts.
It loads 16K at a time into alternating buffers (so the RAM write and SD read are parallelized). If it's a .SMD ROM (not actually determined by file extension, but by detecting the dump header) theres some byte twiddling going on, but for normal raw dumps it just loads straight into memory. I think its mostly that flexspin's SD library is slow (Look how long it takes to read the directory!). The very recent system function fcache enablement commit made it a bit faster though.
Yeah okay. 16kB is plenty big enough to give decent performance. Must be that SD card library. Looks like it's doing something around 500-600kB/s or so which is reasonable but isn't awesome for large files.
I'm not a gamer -- retro or otherwise -- but if someone built a P2 version of the Hydra I would buy one just to play with the code you PASM wizards are writing. Neat stuff.
Amazing! Bravo! Bravo!
Well, it might honor you that MegaYume was mostly written and debugged on a P2EDGE-32MB sitting in a "JohnnyMac" breakout board.
Yes, it is. I tested PSRAM4 library speed and it is about 500 cycles for single read, and about 5 cycles per byte for bursts, which is over 60 MB/s, but for a module loading from SD into PSRAM the speed is about 400 kB/s, so SD library is a limiter there. Enough for 16bit 44100 Hz .wav playing, but this is far form the maximum speed available for a modern SD card even on 1 bit interface bus.
@Wuerfel_21. Now you are using the HyperRAM board as well, you might be able to try to copy some of your images onto the HyperFlash. It can potentially even execute in place from there too via normal read/writes and should be very fast indeed. It's only 32MB in size but this might be okay for some frequently used games you want to load quickly vs getting them off SD.
eMMC chips might be perfect for this with both high volume and high speed
This is why a 4bit driver is needed for SD
The default pins for SD is only good for SPI mode. There would need to be an additional SD add-on using more pins.
BTW, there is still notable delays in the block completion handshaking in current implementations. Wild guess is this is limited by the commands support in SPI mode. Maybe 1-bit SD mode can be engaged with existing shared pinout, dunno.
JonnyMac, Coley and I are doing a portable arcade cab, which could essentially be kinda what you are after, is that good enough?
I surprised myself, with some hints for others, and pretty much eliminated the delays. We now have performant SD in SPI mode - https://forums.parallax.com/discussion/174513/improving-sd-card-performance/p1 No driver cog, just inline bit-bashed with sysclock/8 for SPI clock.
Yea, could do that, but I could also not. Doesn't seem very useful to me. I think with @evanh 's improved SD code it's probably going to load most games at good enough speed - have to try building with that.
Speaking of, have you got one of the 6 button cabs assembled yet? Could send you a build of the SD-loading megayume to try out. Isn't really ready for consumption yet, but works enough to show that it works (should probably push it onto a git branch too. )
In absence of anything else to say, have some CRT pics: *jumps behind a bush and disappears*
It is time.
Time for MegaYume beta with GUI and SD loading, that is. You can get it off the beta branch on Github. It'll be merged into master when flexspin 5.9.10 actually releases (The new upper code relies on some bugfixes, expanded libc.a and Shift-JIS charset support). Thus also no release ZIP, because if you can't into git, it'd be worthless, anyways.
Major changes from the alpha versions:
There's a GUI for loading ROMs from SD card now.
MEGAYUME
folder on your SD card, it will open that by defaultSMD format ROMs can be loaded
config.spin2
Known issues:
megayume_lower
(Ideally I'd detect the handful of games that don't work with 6 button pads by serial number)Related to the USB controller talk in the other thread...
Guess what I just found out about:
OMG they're making official SEGA USB controllers! (And they're not even ludicrously expensive!)
Well, now I have a reason to get USB gamepad reading going on P2 ;3
I've put beta02 on github - the main change is SRAM save support. Some notes on that:
Megayume beta03 is on github now (since flexspin 5.9.10 is out, the GUI version is now on the master branch):
Aaaand there's a beta04 now. Incorporated and adapted some of @macca 's code for USB game pad support. Very preliminary. As such the limitations:
If you hold Start+Down or Start+Up for 3 seconds, it will quit or reset the game, respectively. I didn't figure out a clean way to disable this for keyboard yet, but you can disable it altogether in the config.spin2 file.
So, I've looked into NeoGeo emulation some. It really is the "big iron" of 2D game machines...
Anyways, "no pics no clix", so here's a VRAM dump I rendered with a small script I wrote on PC, to test my understanding of the graphics system (this one is fairly pedestrian though - no scaling involved)
Also, here's a nice graph of the aforementioned five ROM busses and how they all connect in the machine. Note that SFIX and SM1 are not present on AES units (which is what I plan to emulate, because they're just that bit simpler)
Complicated! .... looking up the specs on the YM2610 I note it clearly states "fixed pitch" on all but one PCM channel.
It stands out that the Amiga's four channel "variable pitch" was quite unique. It is what allowed the tracker mods to use very little CPU. Everything later just seems to rely on CPU to digitally mix the PCM channels.
Eh, there are a couple other hardware PCM mixers, though I think the Amiga is unique in the single-cycle precision of the mixing. Everything else (including this here YM2610) mixes at some 2 figure kHz rate and either deals with the crunchiness or, see SNES audio board, runs a really muffled interpolation filter.
Though the YM2610 seems to do pretty fine with just the one repitchable ADPCM channel. That's generally used for the lead melody, while the fixed channels are used for percussion, sampled chords/riffs and of course sound effects. The FM (oddly, even though there are only 4 channels, they made it like a YM2608 with channels 1 and 3 removed, instead of just putting all the channels on one port) is used for bass and background melodies (sometimes also as a sort of echo effect for ADPCM-B). The SSG channels AFAICT are really only used to play the coin insert sound on MVS units.
@Wuerfel_21 , given your success with your P2 Megadrive emulator this seems like another great project challenge for you. Wow 5 ROM buses! I'm interested to see how you decide get the PSRAM request mechanism going if you share memory access to multiple COGs. That's one of the key parts to figure out for a project like this in a way that works well and minimizes latency where it's needed. It might be somewhat simpler to do if you have more than one memory bus, but the P2-EC32MB HW is not wired for that. I'm assuming that would be part of your target platform instead of making your own HW with more than one memory bus.
The Amiga's mixing is analogue, each channel has its own DAC and matching sample DMA/timing. I hadn't realised it was so uniquely powerful. It all fit so cleanly and simple to fully utilise.
EDIT: Hmm, just been looking at the schematics for first time in 30 years ... seems I remember it a little wrongly. I had thought the mixing was external to Paula chip but it's not. However it does state there is four DACs, one for each channel. So I'll conclude the mixing is still analogue, just inside Paula instead of on the PCB.
After a spiritual journey through the forests of imagination, I have come back mentally scarred for life, but with a grasp on how to get sprite logic worst case under the magic 21K cycles. Most important optimization is to reorder the representation of high VRAM such that all the sprite attributes can be fetched in two RFLONGs.
That allows checking one sprite in 18 cycles. The interesting data is SCB3, which looks like %YYYYYYYYY_S_HHHHHH (Y is inverse Y position, S is sticky bit (if set, reuse settings from previous sprite) and H is the height in 16 pixel increments). The Y check has to be with 9 bit modular arithithmetic (so you can have a 32-tall sprite wrap around the entire 512px virtual screen). Luckily, the Y is already left-justified within the word and the simplest reordering of the VRAM also left-justifies it in the first long read.
Behold:
Now to write the blitter and memory arbiter to figure out if any of this actually works.