Console Emulation

hinv · 2022-06-01 02:07

@Wuerfel_21 said:
If the P2 had access to full speed 4bit SD bus, it would probably be faster.

As a hat with 26 pin bus like the PiStorm uses, the Pi could serve the P2 files off of SD, USB, Network a LOT faster leaving the P2 to do what it does well, Real Time processing and Mixed Signal I/O.
I wonder if the Pi could serve main memory fast enough for Agnus, Denise, Paula and Gary emulations to do their DMA to/from it over those 26 pins. The PiStorm relies on the shared Chip memory to be in the Amiga where the PiStorm emulates all of the Fast RAM, Hard Drive, Floppy Drive, Network and "Retargetable Graphics"...which means the screen comes out the HDMI port of the Pi, which would seem to suggest extra transfers of screen memory (highest bandwidth). The more I look into this the more complex an Amiga emulation looks, but also the more doable I am guessing that someone with your genius could do if you wanted to. MegaYume and NeoYume are very impressive.

evanh · 2022-06-01 02:13

@pik33 said:

Paula-like (don't know how much like) audio driver

It runs at original Paula frequency and it can directly accept Paula's sample periods without any translating. Internally it simulates as exactly as it can the Paula's method of outputting samples. What is different is the register structure and additional features (more channels, non-integer periods, 16-bit samples, channel synchronizing). Then it lacks Paula's method (PWM) to control channel volume, I multiply it instead. This low frequency PWM adds some distortion to the Paula' sound.

I'm leaning towards not mixing in software to stereo DACs but rather have four DACs with analogue mixing, same as real Paula. This way each DAC has its own sample rate set in its associated smartpin, just like the real Paula. The downside is the Eval A/V accessory won't suit but, hey, the mixing is certainly simple with just four resistors.

The volume control would still be software integrated on a channel basis just by scaling to smartpin's 16-bit register. After all, the Prop2's own 16-bit output is PWM'd on back of 8-bit DAC.

Of course, also nothing stopping from having eight Prop2 DACs wired like this either.

pik33 · 2022-06-01 06:26

I'm leaning towards not mixing in software to stereo DACs but rather have four DACs with analogue mixing, same as real Paula. This way each DAC has its own sample rate set in its associated smartpin, just like the real Paula. The downside is the Eval A/V accessory won't suit but, hey, the mixing is certainly simple with just four resistors.

Using 4 DACs and pins is somewhat wasting of resources without any change in the audio quality. Also, a pure Paula stereo output is awful to hear using headphones - 100 % stereo separation. Stereo mixing can handle this problem.

evanh · 2022-06-01 06:36

I wouldn't call it wasteful when it fees up so much processing. True about the stereo though, best to treat Amiga audio as mono.

pik33 · 2022-06-01 07:57

The driver code is now optimized and the amount of processing saved by use 4 pins is not that much, if any. First versions of the driver really mixed these samples at 3.5 MHz sample rate, but the current version doesn't do this anymore. It does this instead when it computes the next channel sample:

            sub     rs,oldrs                 ' replace the old sample with the new one in the mix
            add     rs,rs0
            sub     ls,oldls
            add     ls,ls0

In 4 DAC version samples still have to be buffered as getting and processing the sample is time consuming. You have to wait more than one microsecond to get a sample from PSRAM, while, in the worst case, you have 95 clocks to get and compute 4 samples which is simply not possible. So you still have to have buffers, this time 4 of them, and precompute the samples earlier. I cannot see any processing saved at all.

evanh · 2022-06-01 08:41

@pik33 said:
The driver code is now optimized and the amount of processing saved by use 4 pins is not that much, if any. First versions of the driver really mixed these samples at 3.5 MHz sample rate, but the current version doesn't do this anymore. It does this instead when it computes the next channel sample:

Cool. That's new. How's the timing achieved?

pik33 · 2022-06-01 09:04

The driver keeps the periods for al the channels and the global time in Paula's clock. At every loop it determines which channel sample has to be computed next. It computes the sample and replaces it in the mix. It also computes the Paula ticks between the last computed sample and the sample it just got. This tick count is the amount of the old mix sample values it has to place in the buffer. The circular buffer for 512 samples is placed in the LUT. The DAC consumes these samples at the 3.5 MHz rate using the interrupt. 512 samples at 3.5 MHz is about 150 microseconds (and this is the driver's delay) , so there is always enough time to compute the next sample. To make sample accessing faster (in average), I added a simple cache system, so if the cache hit, the sample is loaded from the hub instead of PSRAM, but if miss, 256 new samples is loaded to the cache. The buffer controls the main loop which has to wait for the free place to put these samples into it. The worst case in this version is : all 8 samples has to be computed in the same clock, and all caches are missed, using 4bit PSRAM. It would need about 5000 P2 clocks, but the PSRAM can be busy (video driver) so the time can be up to 3 scan lines, 100 microseconds, which still fits in the buffer.

Wuerfel_21 · 2022-06-01 10:16

@hinv said:
@Wuerfel_21 said:

I haven't thought super thoroughly about Amiga (partially because a computer emulator ensues a lot more fiddly boring stuff like disk drive emulation than a cartridge game system), but the big problem is that there's too much RAM to keep it all in Hub (unless you wanna go with an unexpanded A1000, lol) and having the external memory writeable causes a number of issues (native word size is 32 bit, exec queue needs to be invalidated/updated for self-modifying code, etc).

You bring up a good point. Most Amiga games expect 512KB of memory. I don't know how fast that could be paged in/out of PSRAM as there seems to be a bit of overhead latency for small transfers. What about keeping the HUB for Chip RAM(the RAM of the Amiga that is shared between the 68000, Agnus, Gary, Paula and Denise) and loading everything else from PSRAM? https://www.cypherpunk.at/download/eagle/32c3/AMIGA/A500_BD.png
There is the issue of the ROM as well, which is 256KB at minimum.

That 512K is all chip though. Pulling the graphics from PSRAM is no problem, I think(tm). NeoYume does up to 96 individual requests for sprite data and manages to fit that with audio streaming and ROM access for the supposedly -12MHz 68000 (P2 MotoKore's actual performance can't really be measured in 68000 MHz since some operations are very slow and some are very fast). The main problem is that writing to PSRAM would be rather slow due to dealing with the native word size and stuff. If a separate memory arbiter cog is used, some of that can be masked (because it can do the store while the CPU goes on doing whatever). And as said-earlier, to make self-modifiying code work, the exec queue needs to be flushed, which eats into performance any way you slice it.

evanh · 2022-06-01 10:43

@pik33 said:
... 512 samples at 3.5 MHz is about 150 microseconds (and this is the driver's delay) , so there is always enough time to compute the next sample...

Ah, still seems rather processor heavy. Just like I imagined.

When using four DACs, the ring buffer vanishes and each DAC has its own smartpin hardware buffering. Lightens the timing burden.

Wuerfel_21 · 2022-06-01 11:59

Anyways, messed around some more, updated NeoYume compatibility list. I think most of the games with broken graphics are caused by them not clearing the top bits of the CROM addresses, so I guess I'll have to splice an instruction in somewhere to mask those off. Will also try to investigate the broken sound games and figure out if the Neo Drift Out issue really is another height 33 headache.

Game	Status
Metal Slug	Works
Metal Slug 2	Would work (needs bigger PSRAM to load all graphics)
Samurai Shodown	Works
Samurai Shodown 2	Works
Twinkle Star Sprites	Works
Ironclad	Works (need to rename files to fit in 8.3)
Neo Drift Out	Glitched graphics (fricking height 33 quirks again, I bet), sound seems to die after one game
Cyber-Lip	Doesn't boot
Viewpoint	Hangs immediately
Zed Blade	Works
Sengoku	Works
Sengoku 2	Hangs. (Graphics are OK now)
Magician Lord	M1 too big, also glitched floors
King of Fighters '95	Works in theory, in practice the memory map is [expletive] stupid and holey and thus we're short 1MB of RAM
Magical Drop 2	Broken sound
King of the Monsters	Graphics issues
Karnov's Revenge	Black screen after eyecatcher
World Heroes 2	Graphics issues
Waku Waku 7	Would be too big, but also crashes immediately
Mutation Nation	Works
Aero Fighters 2	Broken graphics, broken sound
Aero Fighters 3	Broken sound
Crossed Swords	Graphics issues
Fatal Fury	Works
Double Dragon	Works

macca · 2022-06-01 12:05

@Wuerfel_21 said:
That 512K is all chip though. Pulling the graphics from PSRAM is no problem, I think(tm). NeoYume does up to 96 individual requests for sprite data and manages to fit that with audio streaming and ROM access for the supposedly -12MHz 68000 (P2 MotoKore's actual performance can't really be measured in 68000 MHz since some operations are very slow and some are very fast). The main problem is that writing to PSRAM would be rather slow due to dealing with the native word size and stuff. If a separate memory arbiter cog is used, some of that can be masked (because it can do the store while the CPU goes on doing whatever). And as said-earlier, to make self-modifiying code work, the exec queue needs to be flushed, which eats into performance any way you slice it.

I was wondering if a caching scheme of some kind could be implemented, like the modern processors L1 (or L2) cache ?
Not at all an expert in this field, don't even know where to start...

Wuerfel_21 · 2022-06-01 12:06

@evanh said:

@pik33 said:
... 512 samples at 3.5 MHz is about 150 microseconds (and this is the driver's delay) , so there is always enough time to compute the next sample...

Ah, still seems rather processor heavy. Just like I imagined.

When using four DACs, the ring buffer vanishes and each DAC has its own smartpin hardware buffering. Lightens the timing burden.

For YM2610 sample streaming, I request 16 byte chunks of audio (32 samples because ADPCM, which is 1.7ms for the 18.6 kHz rate of fixed-pitch channels) in a double buffer fashion. So there's some variable delay on the initial sample key-on, but for the rest of the sample the next block is always requested ahead of time (ie. block N+1 is requested exactly when block N-1 has played through). Though I think on Amiga you'd have to fix the delay because there'd be phasing issues otherwise if someone's doing stereo panning with two channels.

Wuerfel_21 · 2022-06-01 12:13

@macca said:

@Wuerfel_21 said:
That 512K is all chip though. Pulling the graphics from PSRAM is no problem, I think(tm). NeoYume does up to 96 individual requests for sprite data and manages to fit that with audio streaming and ROM access for the supposedly -12MHz 68000 (P2 MotoKore's actual performance can't really be measured in 68000 MHz since some operations are very slow and some are very fast). The main problem is that writing to PSRAM would be rather slow due to dealing with the native word size and stuff. If a separate memory arbiter cog is used, some of that can be masked (because it can do the store while the CPU goes on doing whatever). And as said-earlier, to make self-modifiying code work, the exec queue needs to be flushed, which eats into performance any way you slice it.

I was wondering if a caching scheme of some kind could be implemented, like the modern processors L1 (or L2) cache ?
Not at all an expert in this field, don't even know where to start...

Instructions are already cached to some extent. This is that "execution queue" I talk about. It always reads a couple of instructions after the one it actually wants, so those next instructions can just come from the queue. Branches (forward and backward) within the queued block are also fast (kinda like loop mode on 68010, but better).

Wuerfel_21 · 2022-06-01 13:48

Seems that masking the top bits of the tile IDs away does indeed fix Crossed Swords,World Heroes 2 and King of the Monsters. Also fixed the graphics for Aero Fighters 2, was just a typo in the load script.

macca · 2022-06-01 17:53

@Wuerfel_21 said:
Instructions are already cached to some extent. This is that "execution queue" I talk about. It always reads a couple of instructions after the one it actually wants, so those next instructions can just come from the queue. Branches (forward and backward) within the queued block are also fast (kinda like loop mode on 68010, but better).

Yes, I was thinking about something a bit more complicated, like a full-associative cache (hope this is the correct name), with lines of 32/64 bytes, or whatever may be more efficent when transfering data to/from the RAM. It should keep track of the lines that can be replaced when a memory chunk not present in cache is requested (a counter to track the memory accesses ?). Critical memory regions (like video ram) can be kept in hub or use a different caching scheme.
Don't know, maybe it is too much complicated to implement efficently in software.

Wuerfel_21 · 2022-06-01 18:42

@macca said:
Don't know, maybe it is too much complicated to implement efficently in software.

Yep, waaay too. The PSRAM isn't actually that slow that it is worth implementing a caching scheme is really worth it. (considering all the pickles like unaligned longwords, keeping consistency between CPU and Chips, tracking dirty lines, etc. The latter two could be avoided by making the cache write-through, but then it just slows down any bulk write operation (such as drawing to the screen))
The exec queue works because of specific optimizations that are possible with it.

EDIT:
I wouldn't worry too much. With Hub-as-RAM and direct PSRAM (as in MegaYume), it handily outperforms the actual console (7.67 MHz 68000, no waitstates) and it seems to do pretty well with interrupt PSRAM (vs actual NeoGeo's 12MHz 68000 with some (?) waitstates on ROM). The main area of enspeedment is the multiply/divide/shift instructions, which are microcoded on 68000.

Wuerfel_21 · 2022-06-01 19:17

Unrelatedly, speaking of boot times...
Oh my, the ROM loader is not fast. Which makes sense, given that it has to use RCFAST. I guess if fast boot is needed, it needs to be forgone in favor of a faster loader in flash. Or perhaps just a tiny stub _BOOT_P2.BIX that sets clkfreq and then calls back into ROM to load something else (does it even work at higher clkfreq? I guess we can hotpatch it though). IDK. Anyone remember where the ROM source code is?

Anyways, this is in SD-no-serial mode, just hitting the reset button. The capture card has no notable latency in responding to inputs and the upcode initializes the video immediately, this is just how long it takes the P2 to boot.

ke4pjw · 2022-06-01 19:27

So I have a P2 hooked to my Tandy Color Computer 3 and I have the HALT line pulled low (or high?). The P2 is powered by the CoCo and upon power on, the CoCo is halted until the P2 boots from SD, when it releases the HALT line. This only takes a few milliseconds and isn't noticeable much. My boot program is tiny and I never thought about what it would look like with several hundred K of program on the SD card.

How large is the _BOOT_P2.BIX for this project?

Wuerfel_21 · 2022-06-01 19:42

@ke4pjw said:
How large is the _BOOT_P2.BIX for this project?

449524 bytes. Good chunk of that is just empty though, LZ4 brings it to 65540 bytes. That'd be a good one for a potential P2 revision: loading compressed files might be quite a bit faster. Or giving the boot file an optional header that contains the correct clock settings.

Wuerfel_21 · 2022-06-01 22:09

Anyways, here's a version with initial menu load support (as somewhat seen in the boot time video). There's a define in config.spin2 that you can use to directly boot a game and skip the menu.

evanh · 2022-06-01 22:38

@Wuerfel_21 said:
Oh my, the ROM loader is not fast. Which makes sense, given that it has to use RCFAST. I guess if fast boot is needed, it needs to be forgone in favor of a faster loader in flash. Or perhaps just a tiny stub _BOOT_P2.BIX that sets clkfreq and then calls back into ROM to load something else (does it even work at higher clkfreq? I guess we can hotpatch it though). IDK. Anyone remember where the ROM source code is?

Yep, needs a loader. Might be tricky relying on existing calibrated RCFAST routines though. Probably easiest to extract such source and tweak the timing to suit faster sysclock. ROM source is with Pnut files - https://forums.parallax.com/discussion/171196/pnut-spin2-latest-version-v35s-floating-point-fixes-bytefit-wordfit-textstring/p1

rogloh · 2022-06-02 01:53

@Wuerfel_21 said:
Unrelatedly, speaking of boot times...
Oh my, the ROM loader is not fast. Which makes sense, given that it has to use RCFAST. I guess if fast boot is needed, it needs to be forgone in favor of a faster loader in flash. Or perhaps just a tiny stub _BOOT_P2.BIX that sets clkfreq and then calls back into ROM to load something else (does it even work at higher clkfreq? I guess we can hotpatch it though). IDK. Anyone remember where the ROM source code is?

Anyways, this is in SD-no-serial mode, just hitting the reset button. The capture card has no notable latency in responding to inputs and the upcode initializes the video immediately, this is just how long it takes the P2 to boot.

Yes P2 SD boot can be slow. If it were me I'd just be using the SPI flash for the application, and only keep the game files on the SD. What are the downsides to that?

From SPI flash you should be able to get close to a sub 1s boot process which is quite responsive from power on (might need a two stage loader). My LCD tablet thingy boots about that fast from flash to the point I can get a display screen presented on the LCD so it feels snappy. You don't start to question whether you switched it on right or otherwise get a little impatient as you could do in those extra seconds of startup delay with SD etc, particularly if booting is a common operation.

Wuerfel_21 · 2022-06-02 11:24

Yea for any sort of production use the SD loader is unsuitable for the simple fact that it doesn't produce any error messages to the user, so yeah you'd either put an application or a more sophisticated loader of some sort into flash.

Anyways, here's a simple turbo chainloader. Just but this as the _BOOT_P2.BIX file and it will go brr brr fast:

CON _CLKFREQ = 300_000_000
DAT
              org 0
              asmclk
              '' Patch pullup check
              wrlong #0,##$fc5b4 ' not sure why it fails but ok
              '' Move filename into place
              mov $1DC,name+0
              mov $1DD,name+1
              mov $1DE,name+2
              drvh #38 ' Set LED
              call #$fc578
              drvl #38 ' Clear LED if fail
              jmp #$

name          byte "LOADTEST","BIX",0

Yep, as simple as setting the clock (and patching that check that fails for some reason)

Demonstration with the same binary as above:

evanh · 2022-06-02 11:37

Cool, that is tiny! And kudos to Cluso for ROM based loader robust at 300 MHz. And that was when knowledge of the I/O latencies were still utter guess work.

On a separate note: One thing about booting from EEPROM first and then loading data from SD card, it's likely to be tricky to arbitrate which component, EEPROM or SD, responds to SPI commands. I recommend using only clock mode 3 for both components in this situation. With the SPI clock idling high there is much less confusion over which CS is valid.

Wuerfel_21 · 2022-06-02 14:45

@evanh said:
Cool, that is tiny! And kudos to Cluso for ROM based loader robust at 300 MHz. And that was when knowledge of the I/O latencies were still utter guess work.

More like "it's so slow that it still works at 15x speed". It doesn't properly init the card though, but that's not an issue if the fast load stub itself is booting from SD.

hinv · 2022-06-02 15:42

@evanh said:

@Wuerfel_21 said:
Oh my, the ROM loader is not fast. Which makes sense, given that it has to use RCFAST. I guess if fast boot is needed, it needs to be forgone in favor of a faster loader in flash. Or perhaps just a tiny stub _BOOT_P2.BIX that sets clkfreq and then calls back into ROM to load something else (does it even work at higher clkfreq? I guess we can hotpatch it though). IDK. Anyone remember where the ROM source code is?

Yep, needs a loader. Might be tricky relying on existing calibrated RCFAST routines though. Probably easiest to extract such source and tweak the timing to suit faster sysclock. ROM source is with Pnut files - https://forums.parallax.com/discussion/171196/pnut-spin2-latest-version-v35s-floating-point-fixes-bytefit-wordfit-textstring/p1

Maye write the boot loader in TAQOZ, since it is already in ROM?

pik33 · 2022-06-02 19:34

Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2022 Total Spectrum Software Inc.
Version 5.9.12-beta-HEAD-v5.9.11-2-g85b1a2e2 Compiled on: May 30 2022
neoyume_upper.spin2
|-NeoVGA.spin2
|-NeoVGA.spin2
|-libc.a
Unable to open file `fsadapter.c': No such file or directory
Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2022 Total Spectrum Software Inc.
Version 5.9.12-beta-HEAD-v5.9.11-2-g85b1a2e2 Compiled on: May 30 2022
neoyume_input.spin2
|-NeoVGA.spin2
|-1CogKbM_neoyume.spin2
neoyume_input.p2asm
Done.
Program size is 9160 bytes
Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2022 Total Spectrum Software Inc.
Version 5.9.12-beta-HEAD-v5.9.11-2-g85b1a2e2 Compiled on: May 30 2022
neoyume_lower.spin2
neoyume_lower.spin2:7756: error: file neoyume_upper.binary: No such file or directory

Wuerfel_21 · 2022-06-02 19:43

oop, forgot to include the fsadapter file. Should probably add a function to that extent to the flexspin stdlib....
(Currently working on cleaning everything up a bit and then publishing to github...)

pik33 · 2022-06-02 20:24

Now this:
neoyume_upper.spin2:320: error: unknown identifier strcasecmp in class libc_a

evanh · 2022-06-02 21:40

@pik33 said:
Now this:
neoyume_upper.spin2:320: error: unknown identifier strcasecmp in class libc_a

Probably a newly added C lib function - Version 5.9.12-beta-HEAD-v5.9.11-2-g85b1a2e2 Compiled on: May 30 2022

Console Emulation

Comments