Okay, we're dealing with a genuine flexsplorp, I think. @ersmith I think the constant folding is not masking bit shift amounts properly. Too tired to investigate that further today.
Fixed it up on my end (see github). The listed timing seems to not work on pin 32, I need to change it to this:
Bah! I need to kick Roger. The useable range for DELAY is only eight!
And I see in your init code in megayume_upper.spin2 you've used a fixed offset of -7 to move the min-max out to 7..14. Does that even match the driver code?
@evanh said:
Bah! I need to kick Roger. The useable range for DELAY is only eight!
And I see in your init code in megayume_upper.spin2 you've used a fixed offset of -7 to move the min-max out to 7..14. Does that even match the driver code?
Our numbers are independent. I think Ada's matched the HW data sheet more closely whereas mine are purely synthetic and were also offset slightly to range extend a little and also fit in 4 bit driver nibble, but I think they are biased more to the lower end than the higher end. As the frequencies increased perhaps they have topped out vs what we saw with HyperRAM originally?
Ok I must have been thinking about PSRAM which is where I added it. (psram16drv.spin2)
setq xfreq1 'reconfigure with single cycle xfreq (sysclk/1)
xcont delay, #0 'configurable fine input delay per P2 clock cycle
xcont #6, #0 'fixed delay offset to expand delay range
I can't find doing it in the original HyperRAM code but if you want to try to expand it in hyperdrv.spin for the reads you might be able change the waitx #2 instruction below to add some constant value to delay before it gets used further down. But the driver is very tight and does self modify so hopefully this fits and is not gonna break something. Also I hope delay is not somehow used later on, and just gets re-read when required. Been quite a while since I looked at this stuff.
waitx #2 'delay long enough for DATA bus transfer to complete
fltl datapins 'tri-state DATA bus
waitxfi 'wait for address phase+latency to complete
p1 wxpin #2, clkpin 'adjust transition delay to # clocks
p2 setxfrq xfreq2 'setup streamer frequency
wypin clks, clkpin 'setup number of transfer clocks
wrpin regdatabus, datapins 'setup data bus inputs as registered or not
waitx delay 'tuning delay for input data reading
xinit xrecv, #0 'start data transfer and then jump to setup code
call resume 'go see what we will do next while we are streaming
waitxfi 'wait for streaming to finish
wrpin registered, datapins 'prepare data pins for address phase transfer
_ret_ drvh cspin 'de-assert chip select
@evanh Looks like the COG is chockers at 1024 longs so if you want to make a change to make your thermal test run you'll need to save a long elsewhere by commenting something out for your test to make room.
Also this line worries me:
shr delay, #5 wc ' a b c d e | g prep delay and test for registered inputs
See the "f" case is missing. That means there is a path where delay is retained from last use. This is the special locked transfer case where you don't yield to other COGs in the middle of the transfer. So long as you don't lock the COG's transfers with the QoS settings you should be okay.
Update: Actually if this is just for a thermal test you may be able to just hard code this line to be whatever constant delay value you want.
Eg. change:
@evanh said:
EDIT: Doh! I'd forgotten it works fine on base pin P0. The problem is with base pin P32. So the delay is the wrong place to look.
You have updated to the latest commit that actually fixes pin32? The delay does need some tweaking due to trace lenghts or something. See post above, that timing works for me in NeoYume. For MegaYume the stock settings work for pin 32.
Oops, didn't read that very well. What package was updated? Actually, it's Megayume that's been failing for me. I haven't been testing this on Neoyume because I don't have a 16MB game for it.
@msrobots said:
Hey @Wuerfel_21, can you do NES too?
Theoretically yes, but the Famicom/NES is a truly accursed architecture, more than people realize.
There's no raster interrupt, instead all sorts of raster effects are achieved by either cycle counting or polling the PPU "sprite 0 hit" bit (weird vestigial hardware collision detector)
Every game more complex than the original SMB uses some kind of bankswitch IC. In the simplest case that's just some logic gates to switch program ROM around, but some of these add crazy features such as additional sound channels, remapping video memory at precise points mid-scanline, even a real raster IRQ. There's many, many different cartridge boards that behave very differently.
At the tail end the games become too large to fit reasonably in hub ram (freak outlier is Metal Slader Glory, which has 512k PRG + 512k CHR ROM)
It was made for bbcmicro, later converted to other 6502. But they are monochrome for the 3d part NES was the only color one in 3d. The menu is in color on all of them.
Original game was just 22k pure 6502 assembler...
just looked some video about it. I think I played it on Atari 800 or so.
Dubious and strange thing I just now realized: NeoYume has kinda muffled sound. For reasons that completely escape me, the IIR lowpass coefficient is set to $1000. (lower number -> stronger filter, more muffled) OPNACog, where this particular LPF code originates, uses $2200. But for some reason when I ported that code into NeoYume I guess I set it to $1000 (and just left it there after getting exhausted by fixing the countless sound bugs. My dreams are still haunted by Z80 NMI race conditions and that one corruption bug that turned out to be a byte/word mismatch). I feel like the YM2610 needs some additional filtering that the YM2608 doesn't (due to heavy use of super crunchy samples), but $1000 seems like too much. Might need to upload some comparsions.
Also, something I realized a long time ago but didn't want to rock the boat on: the actual audio levels are rather low (compare to playing a WAV file at full scale). However, actual full-scale is WAY TOO LOUD when connecting headphones directly to the amplified output, so uhhh....
So I was thinking about making a SNES emulator to complete the set. I really should be working on other things, but this is kinda fun to work on so uhhh... don't expect anything to come of this, but I prototyped some code for the SNES video pipeline. I'm just posting this because this code is cracking my brain up and I want to show it.
So in MegaYume, video is pipelined across three cogs: [layer rendering] -> [composite/S&H/palette lookup] -> [output]
Layer rendering really just means to render both tile layers and the sprites. Composite means to combine layers according to their priority. These have to be separate steps to get the correct behavior (basically, you can create an ordering paradox between sprites and layers. This was known about and used for certain effects). (In NeoYume, everything is done in one go, since there's no layer priority system at all)
For the SNES emulator, this would need to be split into four cogs (in exchange, I can fit audio into just one cog), because the video hardware is just that much more complex:
Layer rendering is the same idea, except up to 4 (!) tile layers may be active. Also there's mode 7 (which appears to need just as many cycles as the four normal layers AFTER optimizing it with RDFAST pipeline tricks)
Mosaic is the pixelation "blur" effect you often see. The vertical part is achieved by rendering the same line multiple times, but the horizontal part is difficult and is best done after rendering. Compositing always happens for 512 pixels per line, with odd and even pixels using different mask settings. After color lookup, the even and odd pixels can be blended together by color math. Finally, the lower nibble of the INIDISP register controls overall screen brightness (on the real machine this is the RGB DAC reference voltage).
Anyways, the color math cog! No idea if it actually works lmao. Debug claims this is 21702 cycles per scanline, which just flies in under the limit (assuming I get to clock at 343MHz, which is another 5 MHz higher than NeoYume). The entire thing is too long to fit in a code block, so here's the cliff notes version:
DAT '' PPU Color Math cycle count test
'' SNES master clock is 21.447 MHz
'' multiplying this by 16 gives sysclk of 343.152 MHz
'' SNES scanline is 1364 mclk -> 21824 P2 clocks
org
' TAKING STOCK:
' priority circuit generates 512 pixels per line
' even are "sub", odd pixels are "main" (and use a different set of mask registers)
' color math runs on every pixel, but in lowres modes only main/odd pixels are visible
' color window and inhibit states only update on main/odd pixels
' - pixel 0 state may thus be undefined (BSNES sets everything as if previous pixels were all transparent)
' color window can, in any in/out combination
' - black out current color (CGWSEL[8..7])
' - inhibit color math (CGWSEL[6..5])
' color math is also inhibited based on main/odd pixel's layer (CDADSUB[5..0])
' color math is either with previous pixel or fixed color (CGWSEL[1]).
' when mathing with previous color and THE PREVIOUS EVEN PIXEL (either x-1 or x-2) was backdrop, pixels are
' instead mathed with the fixed color (really, BSNES?)
' half math flag is additionally inhibited (only update on odd pixel!):
' - when window is blacking out current color
' - when CGWSEL[1] is set and prev. color is replaced with fixed color (see above)
' at the very end, pixels are multiplied with INIDISP[3..0] (on real PPUs, this is an analog effect)
' our priority compositor outputs 32 bits per pixel pair in curious format
' we need to always output 512 truecolor pixels to the video driver, all effects applied
' odd/even are more natural to loop than even/odd
' will need 1 dummy iteration?
' due to need of inserting REP compensation NOPs
' a separate loop is used depending on CGADSUB[7] (add/sub mode)
' low LUT is reloaded with palette on each line
' high LUT always holds direct color lookup
ppm_addloop
rep #29,#64 ' always 29 ops regardless of skip
' odd: %WSSS_PPPD_IIII_IIII
' D is direct color flag
' PPP is direct color "palette"
' SSS is source ID
' W is color window state
rfword ppm_odddat
rdlut ppm_oddcol,ppm_odddat
getnib ppm_src,ppm_odddat,#3
getnib ppm_pppd,ppm_odddat,#2
alts ppm_pppd,#ppm_dcolor_odd_tbl ' even somewheres are zero
or ppm_oddcol,0-0
altd ppm_src,#ppm_skip_table
skipf 0-0
' 17 cyc
ppm_fc0 if_z mov ppm_evencol,ppm_fixedcol ' self-modify condition and WZ bit
' we overwrite evencol here so oddcol is valid for next color math
' a : NO MATH
' b : NO MATH + BLANK
' c : MATH
' d : MATH + BLANK
' e : MATH + HALF a b c d e
if_nz shr ppm_evencol,#1 ' ADD + HALF x
if_nz shr ppm_oddcol,#1 ' ADD + HALF x x
addpix ppm_evencol,ppm_oddcol ' skip if window blanking x x
if_nz shl ppm_oddcol,#1 ' ADD + HALF x x
mov ppm_evencol,ppm_oddcol ' SKIP unless math skipped x
mov ppm_evencol,#0 ' SKIP unless (no math AND blanking) x
nop ' REP compensator x x x x
nop ' REP compensator x x x x
nop ' REP compensator x x x
and ppm_evencol,ppm_cmask ' always
' 17 + 17 cyc
altr ppm_wrptr,ppm_pxwrite ' auto increment ALTR
mixpix ppm_evencol,ppm_evencol ' PIV set to INIDISP-derived value
altr ppm_wrptr,ppm_pxwrite ' LOWRES only
mixpix ppm_evencol,ppm_evencol ' LOWRES only
' 17 + 17 + 9 (assuming hires)
' even: %Dxxx_xxxD_IIII_IIII
' yes, two direct color flags
' (we already have a PPPx)
rfword ppm_evendat wc
rdlut ppm_evencol,ppm_evendat
if_c alts ppm_pppd,#ppm_dcolor_always_tbl ' in this table ignore LSB
if_c or ppm_evencol,0-0
' 17 + 17 + 9 + 9
ppm_fc1 if_z mov ppm_oddcol,ppm_fixedcol ' self-modify condition
' we overwrite ppm_oddcol here so evencol is valid for next color math
' a : NO MATH
' b : NO MATH + BLANK
' c : MATH
' d : MATH + BLANK
' e : MATH + HALF a b c d e
if_nz shr ppm_oddcol,#1 ' ADD + HALF x
if_nz shr ppm_evencol,#1 ' ADD + HALF x x
addpix ppm_oddcol,ppm_evencol ' skip if window blanking x x
if_nz shl ppm_evencol,#1 ' ADD + HALF x x
mov ppm_oddcol,ppm_evencol ' SKIP unless math skipped x
mov ppm_oddcol,#0 ' SKIP unless (no math AND blanking) x
nop ' REP compensator x x x x
nop ' REP compensator x x x x
nop ' REP compensator x x x
and ppm_oddcol,ppm_cmask
' 17 + 17 + 9 + 9 + 17
altr ppm_wrptr,ppm_pxwrite ' HIGHRES only
mixpix ppm_oddcol,ppm_oddcol ' HIGHRES only
test ppm_evendat,#$FF wz ' backdrop check
' 17 + 17 + 9 + 9 + 17 + 11
_ret_ sub ppm_wrptr,#128
ppm_subloop
rep #29,#64
' odd: %WSSS_PPPD_IIII_IIII
' D is direct color flag
' PPP is direct color "palette"
' SSS is source ID
' W is color window state
rfword ppm_odddat
' work odd pixel and load SKIP pattern
rdlut ppm_oddcol,ppm_odddat
getnib ppm_src,ppm_odddat,#3
getnib ppm_pppd,ppm_odddat,#2
alts ppm_pppd,#ppm_dcolor_odd_tbl ' even somewheres are zero
or ppm_oddcol,0-0
altd ppm_src,#ppm_skip_table
skipf 0-0
' 17
ppm_fc2 if_z mov ppm_evencol,ppm_fixedcol ' self-modify condition and WZ bit
' we overwrite evencol here so oddcol is valid for next color math
' a : NO MATH
' b : NO MATH + BLANK
' c : MATH
' d : MATH + BLANK (same as blank?)
' e : MATH + HALF a b c d e
not ppm_evencol ' SUB x x
addpix ppm_evencol,ppm_oddcol ' skip if window blanking x x
not ppm_evencol ' SUB x x
if_nz shr ppm_evencol,#1 ' SUB + HALF x
mov ppm_evencol,ppm_oddcol ' SKIP unless math skipped x
mov ppm_evencol,#0 ' SKIP unless (no math AND blanking) x x
nop ' REP compensator x x x x
nop ' REP compensator x x x
nop ' REP compensator x x x
and ppm_evencol,ppm_cmask
' 17 + 17
altr ppm_wrptr,ppm_pxwrite ' auto increment ALTR
mixpix ppm_evencol,ppm_evencol ' PIV set to INIDISP-derived value
altr ppm_wrptr,ppm_pxwrite ' LOWRES only
mixpix ppm_evencol,ppm_evencol ' LOWRES only
' 17 + 17 + 9 (assuming hires)
' even: %Dxxx_xxxD_IIII_IIII
' yes, two direct color flags
' (we already have a PPPx)
rfword ppm_evendat wc
rdlut ppm_evencol,ppm_evendat
if_c alts ppm_pppd,#ppm_dcolor_always_tbl ' in this table ignore LSB
if_c or ppm_evencol,0-0
' 17 + 17 + 9 + 9
ppm_fc3 if_z mov ppm_oddcol,ppm_fixedcol ' self-modify condition
' we overwrite ppm_oddcol here so evencol is valid for next color math
' a : NO MATH
' b : NO MATH + BLANK
' c : MATH
' d : MATH + BLANK
' e : MATH + HALF a b c d e
not ppm_oddcol ' SUB x x
addpix ppm_oddcol,ppm_evencol ' skip if window blanking x x
not ppm_oddcol ' SUB x x
if_nz shr ppm_oddcol,#1 ' SUB + HALF x
mov ppm_oddcol,ppm_evencol ' SKIP unless math skipped x
mov ppm_oddcol,#0 ' SKIP unless (no math AND blanking) x x
nop ' REP compensator x x x x
nop ' REP compensator x x x
nop ' REP compensator x x x
and ppm_oddcol,ppm_cmask
' 17 + 17 + 9 + 9 + 17
altr ppm_wrptr,ppm_pxwrite ' HIGHRES
mixpix ppm_oddcol,ppm_oddcol ' HIGHRES
test ppm_evendat,#$FF wz ' backdrop check
' 17 + 17 + 9 + 9 + 17 + 11
_ret_ sub ppm_wrptr,#128
ppm_fcmodmask long %0101_0000000_010_000000000_000000000 ' if OR'd set if_always and WZ, if ANDN'd, set if_z
ppm_wrskip8_inc long ppm_skip_table+8 + 1<<9
ppm_pxwrite long ppm_pxbuffer + 1<<9
ppm_dummydat long $5_0_00 ' backdrop+outside pixel
ppm_skip_hires long %00_0_000000000_0_0000_1100_0_000000000_0
ppm_skip_lores long %11_0_000000000_0_0000_0000_0_000000000_0
ppm_skip_normal long %00_0_000101111_0_0000_0000_0_000101111_0
ppm_skip_blank long %00_0_000011111_0_0000_0000_0_000011111_0
ppm_skip_math_add long %00_0_000111011_0_0000_0000_0_000111011_0
ppm_skip_math_sub long %00_0_110111000_0_0000_0000_0_110111000_0
ppm_skip_math_blank_add long %00_0_100110101_0_0000_0000_0_100110101_0
' sub from blanked pixel is same as just blank
ppm_skip_math_half long %00_0_111110000_0_0000_0000_0_111110000_0
ppm_cbuf_even long @composite_buffer
ppm_cbuf_odd long @composite_buffer+512*2
ppm_512 long 512
ppm_1024 long 1024
ppm_2048 long 2048
ppm_dcolor_always_tbl
long $00_00_00_00[2]
long $10_00_00_00[2]
long $00_10_00_00[2]
long $10_10_00_00[2]
long $00_00_20_00[2]
long $10_00_20_00[2]
long $00_10_20_00[2]
long $10_10_20_00[2]
ppm_dcolor_odd_tbl
long 0, $00_00_00_00
long 0, $10_00_00_00
long 0, $00_10_00_00
long 0, $10_10_00_00
long 0, $00_00_20_00
long 0, $10_00_20_00
long 0, $00_10_20_00
long 0, $10_10_20_00
ppm_cmask long $F8F8F800
ppm_bit31 long negx
ppm_skip_table res 7
res 1 ' unused!
res 7
res 1 ' unused!
res 2 ' space for dummy pickles
ppm_pxbuffer res 128
res 1
ppm_evendat res 1
ppm_odddat res 1
ppm_evencol res 1
ppm_oddcol res 1
ppm_src res 1
ppm_pppd res 1
ppm_wrptr res 1
ppm_tmp1 res 1
ppm_tmp2 res 1
ppm_tmp3 res 1
fit 496
@Wuerfel_21 said:
Debug claims this is 21702 cycles per scanline, which just flies in under the limit (assuming I get to clock at 343MHz, which is another 5 MHz higher than NeoYume).
Nice. Could end up being a fun project. But 343MHz! That's getting high.
Again it will need external memory for cartridge ROM data - so what's your expectation of overall COG allocation looking like for something like this? You'd need a different CPU emulator for audio now (SPC700 apparently), plus something to emulate whatever the main core uses, right?
@rogloh said:
Nice. Could end up being a fun project. But 343MHz! That's getting high.
Again it will need external memory for cartridge ROM data - so what's your expectation of overall COG allocation looking like for something like this? You'd need a different CPU emulator for audio now (SPC700 apparently), plus something to emulate whatever the main core uses, right?
The sound part I already made a while ago as standalone player: https://obex.parallax.com/obex/spccog/ That fits in one cog rather easily. Might have to change the CPU/DSP task switching to be more granular. Apparently there's some very fun race conditions relating to the audio mailbox communications.
So the cogs need to end up as
1. Spin2 code (menu and disk I/O)
2. USB/other input
3. PPU layer rendering
4. PPU compositing
5. PPU color math
6. video output (+ HDMI audio encoder when I finally implement it)
7. CPU + DMA
8. Audio
The CPU is a 65816 theoretically running at 3.58 MHz (mclk/6), but in practice it gets clock stretched to mclk/8 on every RAM access cycle. Doing DMA on the same cog means that the PSRAM access won't need locks. Instruction table probably needs to go into hub so there can be 4 versions of it, for each possible M/X state combination. Maybe need to do something about the D bit as well, 16 bit BCD sounds painful.
@pik33 said:
343 MHz is at the edge of EC32 PSRAM stability... That's high. The CPU is modified 65816... the next one to emulate.
Weren't you using it at like 350? Anyways, can always settle down for sysclk/3. My real concern is that it's creeping closer to where the P2 core starts crapping out. But as far as I can tell it really needs to be this high to get everything done in time. Well, mostly it's a waste since hires mode isn't super common, but oh well. The compositing so far is similarly tight and that always operates in high-res.
Yes, I used a P2 at 354, many hours, stable, without problems. However, PSRAM on my Edge starts to give up about 340. 360 is where problems start with P2 itself, at least these units I tested.
So the cogs need to end up as
1. Spin2 code (menu and disk I/O)
2. USB/other input
3. PPU layer rendering
4. PPU compositing
5. PPU color math
6. video output (+ HDMI audio encoder when I finally implement it)
7. CPU + DMA
8. Audio
That's not sounding too difficult to fit. Having a full Spin2 COG at your disposal makes things a lot easier to co-ordinate.
Still hopeful of a P2 Edge version assuming it can run at 343MHz or you figure out a way to reduce that a little. I'm currently working on another dedicated system board for the P2Edge with VGA/HDMI/A/V and USB right now so will likely end up with a nice target platform to run this. Will post about that separately soon but here's a sneaky peek of its progress:
@rogloh said:
Still hopeful of a P2 Edge version assuming it can run at 343MHz or you figure out a way to reduce that a little.
EDGE is the most stable board overall, so if anything it'd be edge-exclusive
I'm currently working on another dedicated system board for the P2Edge with VGA/HDMI/A/V and USB right now so will likely end up with a nice target platform to run this. Will post about that separately soon but here's a sneaky peek of its progress:
@pik33 said:
Yes, I used a P2 at 354, many hours, stable, without problems. However, PSRAM on my Edge starts to give up about 340. 360 is where problems start with P2 itself, at least these units I tested.
It'll be hot with all eight cogs busy. More cooling will likely be needed.
unrelatedly, I had figured out an... interesting way to read the strange planar tile format. So each word contains data for two bitplanes in high and low bytes. For 2bpp mode, that's that and 8 words define a tile. In 4bpp mode, 16 words are needed. The first two bitplanes are stored the same as in 2bpp mode and then the other two bitplanes are stored in the same way. Very odd.
@Wuerfel_21 said:
unrelatedly, I had figured out an... interesting way to read the strange planar tile format. So each word contains data for two bitplanes in high and low bytes. For 2bpp mode, that's that and 8 words define a tile. In 4bpp mode, 16 words are needed. The first two bitplanes are stored the same as in 2bpp mode and then the other two bitplanes are stored in the same way. Very odd.
Comments
Okay, we're dealing with a genuine flexsplorp, I think. @ersmith I think the constant folding is not masking bit shift amounts properly. Too tired to investigate that further today.
Fixed it up on my end (see github). The listed timing seems to not work on pin 32, I need to change it to this:
Also, if you need a <=16MB game, try Money Puzzle Exchanger. Probably one of the most funny ones if you plan on never pressing a button.
Bah! I need to kick Roger. The useable range for DELAY is only eight!
And I see in your init code in megayume_upper.spin2 you've used a fixed offset of -7 to move the min-max out to 7..14. Does that even match the driver code?
Our numbers are independent. I think Ada's matched the HW data sheet more closely whereas mine are purely synthetic and were also offset slightly to range extend a little and also fit in 4 bit driver nibble, but I think they are biased more to the lower end than the higher end. As the frequencies increased perhaps they have topped out vs what we saw with HyperRAM originally?
Roger,
Are you able to point me to a line number in hyperdrv.spin2 where the offset is applied?
Ok I must have been thinking about PSRAM which is where I added it. (psram16drv.spin2)
I can't find doing it in the original HyperRAM code but if you want to try to expand it in hyperdrv.spin for the reads you might be able change the waitx #2 instruction below to add some constant value to delay before it gets used further down. But the driver is very tight and does self modify so hopefully this fits and is not gonna break something. Also I hope delay is not somehow used later on, and just gets re-read when required. Been quite a while since I looked at this stuff.
@evanh Looks like the COG is chockers at 1024 longs so if you want to make a change to make your thermal test run you'll need to save a long elsewhere by commenting something out for your test to make room.
Also this line worries me:
See the "f" case is missing. That means there is a path where delay is retained from last use. This is the special locked transfer case where you don't yield to other COGs in the middle of the transfer. So long as you don't lock the COG's transfers with the QoS settings you should be okay.
Update: Actually if this is just for a thermal test you may be able to just hard code this line to be whatever constant delay value you want.
Eg. change:
to
It ought to, but at some point I got tired of fixing bugs with that, so now the high level code only ever writes to memory (writes always work)
I can't find a value that works. I think the Latency value is screwed too. Time to go back to the 96MB add-on.
EDIT: Doh! I'd forgotten it works fine on base pin P0. The problem is with base pin P32. So the delay is the wrong place to look.
You have updated to the latest commit that actually fixes pin32? The delay does need some tweaking due to trace lenghts or something. See post above, that timing works for me in NeoYume. For MegaYume the stock settings work for pin 32.
Oops, didn't read that very well. What package was updated? Actually, it's Megayume that's been failing for me. I haven't been testing this on Neoyume because I don't have a 16MB game for it.
I've updated both the emulators to fix the pin 32 problem. Was that not sufficiently communicated?
In that case it didn't work for me. No worries, I've been busy offline anyway.
EDIT: Err, my botch up. It's working thanks. I'd gone too far on the config and didn't put everything back.
Hey @Wuerfel_21, can you do NES too?
Elite was the first 3D game, please look at this ... https://www.bbcelite.com
Mike
I played a lot of Frontier (Elite 2). It did a good job of handling motion in space correctly. Enjoyed docking with full manual controls.
It struggled with high-G physics though. I couldn't put my finger on what was wrong but some stuff just went batty.
Theoretically yes, but the Famicom/NES is a truly accursed architecture, more than people realize.
It was made for bbcmicro, later converted to other 6502. But they are monochrome for the 3d part NES was the only color one in 3d. The menu is in color on all of them.
Original game was just 22k pure 6502 assembler...
just looked some video about it. I think I played it on Atari 800 or so.
Mike
Dubious and strange thing I just now realized: NeoYume has kinda muffled sound. For reasons that completely escape me, the IIR lowpass coefficient is set to $1000. (lower number -> stronger filter, more muffled) OPNACog, where this particular LPF code originates, uses $2200. But for some reason when I ported that code into NeoYume I guess I set it to $1000 (and just left it there after getting exhausted by fixing the countless sound bugs. My dreams are still haunted by Z80 NMI race conditions and that one corruption bug that turned out to be a byte/word mismatch). I feel like the YM2610 needs some additional filtering that the YM2608 doesn't (due to heavy use of super crunchy samples), but $1000 seems like too much. Might need to upload some comparsions.
Also, something I realized a long time ago but didn't want to rock the boat on: the actual audio levels are rather low (compare to playing a WAV file at full scale). However, actual full-scale is WAY TOO LOUD when connecting headphones directly to the amplified output, so uhhh....
So I ended up going with LPF coefficient $1800. I think this snippet (3rd demo from Ironclad) illustrates the issue:
@VonSzarvas forum is busted again, can't post MP3s
Have you ever been able to do that?
Regardless- the file types list is out of my reach
So I was thinking about making a SNES emulator to complete the set. I really should be working on other things, but this is kinda fun to work on so uhhh... don't expect anything to come of this, but I prototyped some code for the SNES video pipeline. I'm just posting this because this code is cracking my brain up and I want to show it.
So in MegaYume, video is pipelined across three cogs: [layer rendering] -> [composite/S&H/palette lookup] -> [output]
Layer rendering really just means to render both tile layers and the sprites. Composite means to combine layers according to their priority. These have to be separate steps to get the correct behavior (basically, you can create an ordering paradox between sprites and layers. This was known about and used for certain effects). (In NeoYume, everything is done in one go, since there's no layer priority system at all)
For the SNES emulator, this would need to be split into four cogs (in exchange, I can fit audio into just one cog), because the video hardware is just that much more complex:
[layer rendering] -> [h-mosaic/window mask/composite] -> [palette lookup/color math/fade-out] -> [output]
Layer rendering is the same idea, except up to 4 (!) tile layers may be active. Also there's mode 7 (which appears to need just as many cycles as the four normal layers AFTER optimizing it with RDFAST pipeline tricks)
Mosaic is the pixelation "blur" effect you often see. The vertical part is achieved by rendering the same line multiple times, but the horizontal part is difficult and is best done after rendering. Compositing always happens for 512 pixels per line, with odd and even pixels using different mask settings. After color lookup, the even and odd pixels can be blended together by color math. Finally, the lower nibble of the INIDISP register controls overall screen brightness (on the real machine this is the RGB DAC reference voltage).
Anyways, the color math cog! No idea if it actually works lmao. Debug claims this is 21702 cycles per scanline, which just flies in under the limit (assuming I get to clock at 343MHz, which is another 5 MHz higher than NeoYume). The entire thing is too long to fit in a code block, so here's the cliff notes version:
Nice. Could end up being a fun project. But 343MHz! That's getting high.
Again it will need external memory for cartridge ROM data - so what's your expectation of overall COG allocation looking like for something like this? You'd need a different CPU emulator for audio now (SPC700 apparently), plus something to emulate whatever the main core uses, right?
343 MHz is at the edge of EC32 PSRAM stability... That's high. The CPU is modified 65816... the next one to emulate.
The sound part I already made a while ago as standalone player: https://obex.parallax.com/obex/spccog/ That fits in one cog rather easily. Might have to change the CPU/DSP task switching to be more granular. Apparently there's some very fun race conditions relating to the audio mailbox communications.
So the cogs need to end up as
1. Spin2 code (menu and disk I/O)
2. USB/other input
3. PPU layer rendering
4. PPU compositing
5. PPU color math
6. video output (+ HDMI audio encoder when I finally implement it)
7. CPU + DMA
8. Audio
The CPU is a 65816 theoretically running at 3.58 MHz (mclk/6), but in practice it gets clock stretched to mclk/8 on every RAM access cycle. Doing DMA on the same cog means that the PSRAM access won't need locks. Instruction table probably needs to go into hub so there can be 4 versions of it, for each possible M/X state combination. Maybe need to do something about the D bit as well, 16 bit BCD sounds painful.
Weren't you using it at like 350? Anyways, can always settle down for sysclk/3. My real concern is that it's creeping closer to where the P2 core starts crapping out. But as far as I can tell it really needs to be this high to get everything done in time. Well, mostly it's a waste since hires mode isn't super common, but oh well. The compositing so far is similarly tight and that always operates in high-res.
Yes, I used a P2 at 354, many hours, stable, without problems. However, PSRAM on my Edge starts to give up about 340. 360 is where problems start with P2 itself, at least these units I tested.
That's not sounding too difficult to fit. Having a full Spin2 COG at your disposal makes things a lot easier to co-ordinate.
Still hopeful of a P2 Edge version assuming it can run at 343MHz or you figure out a way to reduce that a little. I'm currently working on another dedicated system board for the P2Edge with VGA/HDMI/A/V and USB right now so will likely end up with a nice target platform to run this. Will post about that separately soon but here's a sneaky peek of its progress:
EDGE is the most stable board overall, so if anything it'd be edge-exclusive
looks neat
It'll be hot with all eight cogs busy. More cooling will likely be needed.
unrelatedly, I had figured out an... interesting way to read the strange planar tile format. So each word contains data for two bitplanes in high and low bytes. For 2bpp mode, that's that and 8 words define a tile. In 4bpp mode, 16 words are needed. The first two bitplanes are stored the same as in 2bpp mode and then the other two bitplanes are stored in the same way. Very odd.
So to get all the data for a line you'd have to
I think(tm) this takes 29..36 cycles (second rdword hits worst case, owie)
Instead, one can:
for 17..25 cycles, at the expense of trashing 3 registers in the process
Also, the optimized mode7 routine I spoke of earlier (also untested, so probably bugged). I calculated it to take under 12000 cycles.
I think this takes 23-30 cycles if no long crossings, else 24-31, but still slower than your alternative.