That shouldn't be it... But there may be some bug related to when it thinks it needs to switch to the second bank (or rather, a disagreement on that between loader and custom arbiter). I don't think anyone ever ran a multi-bank 16 bit setup (not even @rogloh ). Let me check
Ok, think we have it now. Last Blade and Metal Slug are working with this:
' Enable one of these to select the exmem type to use
#define USE_PSRAM16
'#define USE_PSRAM8
'#define USE_PSRAM4
'#define USE_HYPER
' For PSRAM (either type)
BasePin=32'16 'This BasePin constant makes it easier to move FPR around to different pins...
CenPin0=28'5+BasePin
CenPin1=29'6+BasePin
CenPin2=-1'7+BasePin
SckPin0=30'4+BasePin
' For PSRAM (either type)
PSRAM_CLK = SckPin0 addpins 1
PSRAM_SELECT = CenPin0
PSRAM_BASE = BasePin'32'40
PSRAM_BANKS = 2
' \/ Uncomment if PSRAM_BANKS = 1 for speedup
'#define USE_PSRAM_NOBANKS
PSRAM_WAIT = 5
PSRAM_DELAY = 14
PSRAM_SYNC_CLOCK =false
PSRAM_SYNC_DATA = false 'true
Glad to hear you have a working setup Rayman. Your timing is going to be different as it is board dependent and you would certainly need to tweak the parameters to suit. SYNC_DATA acts as a "half step" and lets you dial in the sweet spot more. I do wonder if having two PSRAM clock signals has made it a little tighter to find this sweet spot at high P2 clock rates if there is any skew between these clock outputs on your board.
@Wuerfel_21 Was just toying with the idea of an expansion board that would add another 64 MB.
Could use two more pins for CE and maybe have to use the slow mode with 8 chips on each clock pin...
You'd have to use 26 and 27 as extra select pins. Bigger problem is that the expanded RAM would have different characteristics. Roger's driver actually can set different delay for different banks, but my optimized code can't. Might work out, might not.
But in short:
WAIT is the command latency of the PSRAM. This is always 5 with the APMemory part.
DELAY is the delay between output and input from the P2's point of view, i.e. extra streamer cycles inserted between command output and data input.
SYNC_CLOCK controls the sync/async pin mode bit for the clock pin(s). Not sure what this is actually good for.
SYNC_DATA controls the sync/async pin mode bit for the data pins. Setting this to true is basically a -0.5 for DELAY.
In theory this is translated to the format used by roger's driver in exmem_start in the upper code, but I have long given up on trying to keep that properly in sync and now use roger's driver for writing only.
@Rayman said:
@rogloh Thanks for the memory driver. It appears to have every feature one could ever want.
I'm trying to figure out how to adapt your video demo to this board now...
No worries. Yeah it can be useful. For your board you should be able to just do the addpins 1 thing for the clock pin and setup two 32MB banks instead of one in the P2 Edge defaults. My video driver is written to allow a frame buffer to be sourced from either bank although it may or may not span between the 32MB, I can't recall offhand if it will wrap there or not. I know HyperRam had a similar issue there at an 8MB (internal) boundary, but in this case perhaps the PSRAM page boundary fragmentation will help...
Was just toying with the idea of an expansion board that would add another 64 MB.
Could use two more pins for CE and maybe have to use the slow mode with 8 chips on each clock pin...
That's a lot of fanout for the data bus (4 loads). It might be hard to get it to work at high speeds, but would be fun to try.
@Rayman said:
@Wuerfel_21 Any NeoGeo games that need more than 96MB (or don't work in slow mode)?
Nope, they top out at 96 MB and 8bit+slow is sufficient to prevent graphics/sound dropouts (IIRC 4bit+slow isn't, if you want to see what that looks like).
The one with the most crazy performance issues (also present on real hardware) is Metal Slug 2 (esp when you get to the 3rd level lol). Twinkle Star Sprites is the runner up for heavy slowdown and frequently maxing out the sprite buffer (noticable when the border graphics go missing). Though I think(?) some of the slowdown in that one is also coded in intentionally.
What's possibly happened there is, without sync, the Prop2's internal uneven skews are exposed and are better matching the external skews than the more even sync mode. In other words, maybe you've lucked in.
Right, but when the multiple cases shrink to one (at higher frequencies) then that's super tight for sysclock/2. Only DDR at sysclock/1 should be that tight normally.
EDIT: I guess Rayman still has two cases though. At lower frequencies sysclock/2 should be up to eight usable cases I think.
Yeah only two working cases is tight, although thankfuly it still apparently functions at high speeds. I normally keep the clock pin as sync only and only vary the delay and data pin IO between sync and async. On the P2Edge at slower P2 clock rates doing this yields 4 settings that work, and at faster speeds over 270MHz yields either 3 or 2 working cases per P2 operating frequency as it transitions through the bands. Running the psram_delay_test on your board would be useful to see this profile. Give it a try.
One thing that still concerns me is the dual clock PCB routing and dual clock output pins. As a possible separate test of this in particular you could temporarily short two clock signals together right near the P2 with a solder bridge/link on your PCB and then run a delay test carefully ensuring only one clock output pin is to be driven (so no P2 pin gets overloaded) before removing the short. If some pin-to-pin skew is involved then using this config could show either a delay test improvement with just a single clock driving the routed traces over your current setup, or maybe not if there is a different issue causing this.
Another thing that might be reducing the performance is the extra stub at the expansion connector perhaps causing some reflections or adverse effect for PSRAM signals. The P2 Edge didn't bring out its PSRAM pins to the Edge connector for example. Also you do have 2 loads per data pin too with 64MB. I remember my own 64MB board wasn't quite as good as 32MB.
psram_delay_test sounds like a good idea.
NeoYume seems like a pretty good memory tester to me, but maybe there needs to be something that does a thorough test...
Comments
Actually, no need to mod PSRAM16 driver, just added "addpins 1" to clock pin setting and it's working!
This is great!
Now, just need to figure out why don't have NEO-PO.bin file on uSD...
You probably just pressed B on the menu, which toggles between NEO-EPO (international bios) and NEO-PO (japan bios)
Ok, thanks.
I'm seeing that Sonic Wings 2 works, but Metal Slug does not.
Guessing something wrong with second bank of 32 MB.
Metal Slug 1 fits within one 32 MB bank
may have been tripped up by USE_PSRAM_NOBANKS
That shouldn't be it... But there may be some bug related to when it thinks it needs to switch to the second bank (or rather, a disagreement on that between loader and custom arbiter). I don't think anyone ever ran a multi-bank 16 bit setup (not even @rogloh ). Let me check
Some glancing at the code would indicate that it's alright...
You have external access to those RAM pins, right? Can you put a pullup on the second select pin and then set BANKS to 1?
@Wuerfel_21 Traveling now, so hard to do pullup, but maybe can use a USB cable as a wire or something...
Do the games crossed swords and Metal Slug use same clkfreq if nothing else changes?
I got Metal Slug to behave once, but can't since reproduce...
Yes, everything is always the same clkfreq in NeoYume.
You have
PSRAM_SYNC_CLOCK = false
, try it with true (might need to adjust DELAY and SYNC_DATA. Leave BANKS at 2 for now.)Ok, think we have it now. Last Blade and Metal Slug are working with this:
Guess SYNC_DATA = false was the key change...
Success!
64 MB at 16 bit bandwidth without slow mode is a pretty good setup, dare I say.
Glad to hear you have a working setup Rayman. Your timing is going to be different as it is board dependent and you would certainly need to tweak the parameters to suit. SYNC_DATA acts as a "half step" and lets you dial in the sweet spot more. I do wonder if having two PSRAM clock signals has made it a little tighter to find this sweet spot at high P2 clock rates if there is any skew between these clock outputs on your board.
@rogloh Thanks for the memory driver. It appears to have every feature one could ever want.
I'm trying to figure out how to adapt your video demo to this board now...
@Wuerfel_21 Was just toying with the idea of an expansion board that would add another 64 MB.
Could use two more pins for CE and maybe have to use the slow mode with 8 chips on each clock pin...
How do these settings filter down to the PSRAM driver?
I'm not seeing it right away...
You'd have to use 26 and 27 as extra select pins. Bigger problem is that the expanded RAM would have different characteristics. Roger's driver actually can set different delay for different banks, but my optimized code can't. Might work out, might not.
Explained in https://github.com/IRQsome/NeoYume/blob/master/RAMCONFIG.MD
But in short:
WAIT is the command latency of the PSRAM. This is always 5 with the APMemory part.
DELAY is the delay between output and input from the P2's point of view, i.e. extra streamer cycles inserted between command output and data input.
SYNC_CLOCK controls the sync/async pin mode bit for the clock pin(s). Not sure what this is actually good for.
SYNC_DATA controls the sync/async pin mode bit for the data pins. Setting this to true is basically a -0.5 for DELAY.
In theory this is translated to the format used by roger's driver in
exmem_start
in the upper code, but I have long given up on trying to keep that properly in sync and now use roger's driver for writing only.No worries. Yeah it can be useful. For your board you should be able to just do the addpins 1 thing for the clock pin and setup two 32MB banks instead of one in the P2 Edge defaults. My video driver is written to allow a frame buffer to be sourced from either bank although it may or may not span between the 32MB, I can't recall offhand if it will wrap there or not. I know HyperRam had a similar issue there at an 8MB (internal) boundary, but in this case perhaps the PSRAM page boundary fragmentation will help...
That's a lot of fanout for the data bus (4 loads). It might be hard to get it to work at high speeds, but would be fun to try.
@Wuerfel_21 Any NeoGeo games that need more than 96MB (or don't work in slow mode)?
Nope, they top out at 96 MB and 8bit+slow is sufficient to prevent graphics/sound dropouts (IIRC 4bit+slow isn't, if you want to see what that looks like).
The one with the most crazy performance issues (also present on real hardware) is Metal Slug 2 (esp when you get to the 3rd level lol). Twinkle Star Sprites is the runner up for heavy slowdown and frequently maxing out the sprite buffer (noticable when the border graphics go missing). Though I think(?) some of the slowdown in that one is also coded in intentionally.
Also, try the USB driver test branch. Supports more controllers and multiplayer over a USB hub.
What's possibly happened there is, without sync, the Prop2's internal uneven skews are exposed and are better matching the external skews than the more even sync mode. In other words, maybe you've lucked in.
Played with the PSRAM settings a bit.
These appear to be the only settings that work with NeoYume:
Delay 14, clock false, data false
Delay 15, clock true, data true
Ah, just super tight then. You'll probably still get the odd glitch. Particularly as room temperature changes.
Those are large delay values, thinking about it. Signal won't be square at all.
Having only one/two good settings is normal, esp. for multi-bank. I'd guess the sync/sync one would be more stable.
Right, but when the multiple cases shrink to one (at higher frequencies) then that's super tight for sysclock/2. Only DDR at sysclock/1 should be that tight normally.
EDIT: I guess Rayman still has two cases though. At lower frequencies sysclock/2 should be up to eight usable cases I think.
Yeah only two working cases is tight, although thankfuly it still apparently functions at high speeds. I normally keep the clock pin as sync only and only vary the delay and data pin IO between sync and async. On the P2Edge at slower P2 clock rates doing this yields 4 settings that work, and at faster speeds over 270MHz yields either 3 or 2 working cases per P2 operating frequency as it transitions through the bands. Running the psram_delay_test on your board would be useful to see this profile. Give it a try.
One thing that still concerns me is the dual clock PCB routing and dual clock output pins. As a possible separate test of this in particular you could temporarily short two clock signals together right near the P2 with a solder bridge/link on your PCB and then run a delay test carefully ensuring only one clock output pin is to be driven (so no P2 pin gets overloaded) before removing the short. If some pin-to-pin skew is involved then using this config could show either a delay test improvement with just a single clock driving the routed traces over your current setup, or maybe not if there is a different issue causing this.
Another thing that might be reducing the performance is the extra stub at the expansion connector perhaps causing some reflections or adverse effect for PSRAM signals. The P2 Edge didn't bring out its PSRAM pins to the Edge connector for example. Also you do have 2 loads per data pin too with 64MB. I remember my own 64MB board wasn't quite as good as 32MB.
psram_delay_test sounds like a good idea.
NeoYume seems like a pretty good memory tester to me, but maybe there needs to be something that does a thorough test...
Appears that psram_delay_test only goes to 14, gives error when try to change to 16...