@Rayman said:
PSRAM_SYNC_DATA = false ... seems to help
Just did a test where I piggybacked a chip removed from bank 5 on top of the upper nibble chip of bank 4 (ran a jumper wire for chip select). Seems to work now.
Thats a good sign for 4 chips in parallel...
If you're up to put chips at both sides of the pcb, that'll spare a lot of route-length-related parasytics (capacitance/indutance and noise-picking), at the expense of a single via, per signal, per each pair of Psrams.
@Rayman said:
PSRAM_SYNC_DATA = false ... seems to help
Assuming that is switching off registered I/O for the data pins, it does three things:
Subtracts one sysclock period of latency from both outgoing and incoming data. Easily compensated with the "delay" tuning value.
Presumed to increase the spread of skew timings between the data signals inside the Prop2. This has possibility of fluke countering external skewing but much more likely to be detrimental.
Adds maybe 1.0 ns of phase-shift to transmit data wrt serial clock pin. Presumed to subtract similar for receive data. Can be relied on for using as setup and hold timings, but skew timings can easily swamp it.
@Rayman said:
With sync on the errors start out as a few bad pixels but otherwise ok. Then goes nuts after a while…
Given that more chips and higher frequency and warmer temperature all make it worse in a similar way ... We're probably looking at attenuation as being the biggest enemy here. The pin drives just aren't strong enough to go as fast as we're demanding. Square waves are long gone.
Reducing track lengths and chip count is solution. We're stuck with the fixed track lengths of Eval Board so only so much can be done there. We might be limited to 64 MB unless larger capacity RAM chips can be sourced.
@Rayman said:
@Yanomani if you mean repopulate upper half to check sync = false, I’ll probably do that.
I was thinking about if/when you decide to make a new layout, after testing those two chips, one pig-backed over the other.
At least to me, it seemed to be an enticing proof-of-concept on how better it could be, if all the tracks could be shortenned to a minimum length, and the most part of interconnecting vias could be eliminated.
If/when /CE is permanently tyed to ground, the whole command/address and data sequences are messed up, and the Psram chip will soon loose tracking of what is being asked/how it's expected to perform anything, if it'll perform something realy meaningfull, at all.
The proper falling/rising of /CE at the right time, and in closer coordination with the expected CLK pin transitions is of prime importance!
P.S. If at least the first command is properlly passed to the Psram, and correctly "understood"/processed, it'll stay on it forever, tracking only the CLK pin, so as to advance to the next stage of that "single one first command" it has properlly received and acknowledged.
@Rayman said:
With 5 chips in parallel (one of them piggy backed), the three with longest paths work, but the two closest to the header don't.
Need to remove piggy back and test the close ones again. Guess I assumed they would work, but need to verify...
Perhaps the extra-length of the longer tracks is doing the "job" of a low-value series resistor, attenuating part of any excessive ringing, and other detrimental waveform-deforming events.
Another possibility can be you're luck, and within a real mess of many signal reflections, one of the delays you've choosed just happens to find an almost regular "eye apperture", where the right data pattern can be safelly writen, and read aftewards.
Should not survive to some temperature fluctuations, thougth...
Ok, back to four chips in parallel (none piggy backed).
PSRAM_SYNC_DATA = true works for first three chips, but not the last.
PSRAM_SYNC_DATA = false works for the last three chips, but not the first...
It's 03:00 o'clock; can't sleep if I can't stop thinking about it...
It's like trying to assemble/repair one helicopter, blindfolded, inside a dark room, based solelly in memories, from a huge pile of blueprints, and manuals.
Too much information, and not any single scope screenshot to guide some musings; hope you agree that's a tough way of thinking about signal integrity, but, anyway...
If you have a pcb with four Psram chips assembled onto it, with their data buses connected in parallel (even worse; interconnected to another pcb), with independent chip selects and clocks, and you're observing that the first three perform under certain conditions, while keeping the fourth-one deselected (aka: muted), and, in another set of tests, the last three are able to perform, while keeping the first-one deselected, my best call is to "hear" what the "muted" chips themselves has to "say", about what "they" are experiencing.
For that to be possible, one of your drivers needs to be modified, in order to put the "first/fourth, rulled-out observer" into Write-mode, them perform each part of the test individually, remembering of terminating the "observer" write in an orderly manner, so as to preserve what it was able to "grab" from the bus, while each one of the other three was being writen, or readen, as appliable.
Then, in an extra operation, you'll need to dump the contents of the observer chip, check if it has recorded de control/address sequence of the last test accordingly (including the wait interval), them check the rest of its contents, so as to identify what it has "heard" about what was going on the bus it was "snooping".
In essence, you'll need to turn the first/third non-performing device that is onto the bus, into a "lean mean" logic analyzer.
It'll be essentially recording the events it whitnesses, so this is a completelly non-harming operation.
Also, if the passes last less then, says, 64mS each turn, refresh of the sample contents will not be of concern, at all, but proper Command Termination (APMemory device datasheet, item 8.6, pg 10) is a must, or such a spy-movie-alike, "environmental listening" will not work.
For better results, random data should be replaced by unballanced DC-patterns (long sequences of "zeros", interpersed by just a few "ones", starting from a single differing-bit, and them, increasing their count, as the test progresses. The inverse test also needs to be performed in the same way.)
Since you have four paralleled data lanes, perhaps the DC-unballanced test could use the same pattern for them all.
Also, perhaps introducing small checkerboard patterns in the middle of long DC-unbalanced-ones could be proved usefull.
As I said, it's kind of a blind-date, social experiment situation. Only the results will show the truth.
Lot's of interesting ideas to see if chips can sense the bus data correctly, but who will do all these things....?
I'm currently looking into the 8bit variant driver coding... it's currently tight with 4 LUT registers free and 3 COGRAM registers free. Am seeing if/how the 16 bit variant could be modified. The RMW stuff adds a lot of complexity, although we don't have to select upper/lower 8 bit paths of 16 bit bus anymore (which we did for 16 bits) so that could free some space with any luck. The streamer combinations are all 8 bit now.
It just occurred to me that one can rely on the Streamers, in order to generate both DC-balanced patterns (DVI normal-mode), and also partially DC-unbalanced ones (DVI literal-mode), but the second option will not be perfect, since the running disparity will be automatically encoded and applied.
P2 Sysclock will also need to be lowered, bellow 180 MHz, in order to produce meaningfull Psram operations, but the whole eight bits are useful, specially the two clock lanes (just like a "sound" track, recorded at the side of the frames, onto movie pictures), it'll just need to swap the lanes, periodically (the same control that enables the use of DVI signals, running at either side of the PCB).
P.S. On second thougth, DC-balanced need to be performed bellow 180 MHz, but, partially DC-unbalanced can enjoy the full limit of Psram trials.
It'll only need to be ensured the many consecutive "zeros" or "ones", in each pattern.
@rogloh said:
I'm currently looking into the 8bit variant driver coding... it's currently tight with 4 LUT registers free and 3 COGRAM registers free.
Ok so I'm more confident this would fit now. My current approach has likely freed sufficient space. I'm currently re-purposing the 16 bit driver code, by keeping the individual writes all done as 32 bits as required by that driver, instead of reducing it down to 16 bits and making the 32 bit stuff take a different/new code path. It keeps most of the already debugged and working write code intact, by mainly changing the number of clocks and streamed length calculations (now in bytes instead of words). This is a slight cheat and is not 100% optimal in terms of P2 cycles needed for individual word or byte writes (apart from write bursts which remain optimized), but should help get it working much faster by leveraging the previous working code. Further optimizations can come later.
@rogloh said:
Exactly, what indicates that it "works"? Running NeoYume or my test program?
Comparing runs of the delay tester will tell us heaps. Each new hardware arrangement will produce different report patterns. And final check can be at 70 °C to see a degraded case.
PS: We already have some patterns of what does and does not work with NeoYume. As more patterns are collected we can narrow down what is borderline ... Most likely it's as simple as needing a solid 100% column right past 350 MHz sysclock.
Ok @Wuerfel_21 , here's a very early 8 bit PSRAM driver I've just got running. I've only tested the basic read and write block transfers using my memory test, so there could still be bugs I need to fix in other areas not yet tested, but given the memory test is passing I think hopefully at least the burst transfers work and you'll be able to use it to put some game ROM data into PSRAM.
Here's the driver code....including the 8 bit delay tester that uses it. I sped it up a bit with some inline assembly for the data compare process. This is still subject to change if I find bugs to fix etc.
Here's a run with it on my 64MB board that has two PSRAM chips in parallel per P2 IO pin group of 4...seems clean to 350MHz. Maybe you'll be able to get 48MB accessible on Rayman's board in 8 bit mode with 3 device loads populated or 64MB with 4? If it ever shows up I'll give it a go here.
@rogloh said:
Ok @Wuerfel_21 , here's a very early 8 bit PSRAM driver I've just got running. I've only tested the basic read and write block transfers using my memory test, so there could still be bugs I need to fix in other areas not yet tested, but given the memory test is passing I think hopefully at least the burst transfers work and you'll be able to use it to put some game ROM data into PSRAM.
Here's the driver code....including the 8 bit delay tester that uses it. I sped it up a bit with some inline assembly for the data compare process. This is still subject to change if I find bugs to fix etc.
I think there's an init issue. It only works after running in 4bit mode on the top bus half once (when I added it to NeoYume. Also strange sound problems, but I'd believe that's a me bug).
Ok. Is there still a problem with the device init @Wuerfel_21 , or have you resolved it somewhere in your own code? I can take a look tomorrow if you think there is an issue there in my code.
EDIT: just found a problem. Due to porting from single bank 16 bit there was no support for more than one bank. Will fix it in a minute and repost here...
UPDATE: fixed I think. I've updated my 8, 16, and 4 bit versions to properly init with multiple PSRAM banks now.
@Rayman You would run psram8_delay_test.binary after building it with flexspin and running with loadp2, then you can enter the start pin of the data bus for your board fitted to P2-EVAL and use the pin numbers for CE and CLK to test the bank. You can also enter additional pins to drive high during the test so you can drive your floating CE pins.
Also you can add 64 to the base CLK pin number to get it to drive 2 adjacent pins if the CLK needs to be paired in a pin group as you have on your board.
Here's an example for my board, with no pin group for the CLK and no additional CE pins (mine have pullup resistors)
RLs-MacBook-Pro:p2memorydrv_09b-2 roger$ flexspin -2 psram8_delay_test.spin2
Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2021 Total Spectrum Software Inc.
Version 5.9.3-beta-v5.9.2-33-g35412c83 Compiled on: Sep 26 2021
psram8_delay_test.spin2
|-psram8-generic.spin2
|-|-psram8drv.spin2
|-SmartSerial.spin2
|-ers_fmt.spin2
psram8_delay_test.p2asm
Done.
Program size is 87944 bytes
RLs-MacBook-Pro:p2memorydrv_09b-2 roger$ loadp2 -t psram8_delay_test.binary
( Entering terminal mode. Press Ctrl-] or Ctrl-Z to exit. )
PSRAM 8 bit memory read delay test over frequency, ESC exits
Enter the base pin number for your PSRAM (0,8,16...48) [40]: 0
Enter the chip enable pin number for your PSRAM [57]: 16
Enter the clock pin number for your PSRAM [56]: 18
Enter an additional CE/CLK P2 pin to drive high (0-55), or a higher value to exit [56]: 56
Enter a starting frequency to test in MHz (100-350) : [100] 300
Enter the ending frequency to test in MHz (300-350) : [300] 310
Enter 1 to use the automatic delay value only, or 0 to test over the delay range : [0] 0
Enter 1 to display the first error encountered, or 0 to not display error details : [0] 0
Testing P2 from 300000000 - 310000000 Hz
Successful data reads from 100 block transfers of 8192 random bytes
Frequency Delay 3 4 5 6 7 8 9 10 11 12 13 14
300000000 (11) 0% 0% 0% 0% 0% 0% 0% 99% 100% 100% 0% 0%
301000000 (11) 0% 0% 0% 0% 0% 0% 0% 96% 100% 100% 0% 0%
302000000 (11) 0% 0% 0% 0% 0% 0% 0% 96% 100% 100% 0% 0%
303000000 (11) 0% 0% 0% 0% 0% 0% 0% 91% 100% 100% 0% 0%
304000000 (11) 0% 0% 0% 0% 0% 0% 0% 97% 100% 100% 0% 0%
305000000 (11) 0% 0% 0% 0% 0% 0% 0% 90% 100% 100% 0% 0%
306000000 (11) 0% 0% 0% 0% 0% 0% 0% 81% 100% 100% 1% 0%
307000000 (11) 0% 0% 0% 0% 0% 0% 0% 82% 100% 100% 0% 0%
308000000 (11) 0% 0% 0% 0% 0% 0% 0% 66% 100% 100% 6% 0%
309000000 (11) 0% 0% 0% 0% 0% 0% 0% 58% 100% 100% 14% 0%
310000000 (11) 0% 0% 0% 0% 0% 0% 0% 50% 100% 100% 46% 0%
Comments
If you're up to put chips at both sides of the pcb, that'll spare a lot of route-length-related parasytics (capacitance/indutance and noise-picking), at the expense of a single via, per signal, per each pair of Psrams.
Assuming that is switching off registered I/O for the data pins, it does three things:
@Yanomani if you mean repopulate upper half to check sync = false, I’ll probably do that.
@evanh ok maybe should turn back on and try adjusting delay.
With sync on the errors start out as a few bad pixels but otherwise ok. Then goes nuts after a while…
Given that more chips and higher frequency and warmer temperature all make it worse in a similar way ... We're probably looking at attenuation as being the biggest enemy here. The pin drives just aren't strong enough to go as fast as we're demanding. Square waves are long gone.
Reducing track lengths and chip count is solution. We're stuck with the fixed track lengths of Eval Board so only so much can be done there. We might be limited to 64 MB unless larger capacity RAM chips can be sourced.
I think 64MB with 16bits is probably a sweet spot. Needs 3 header groups though.
Tried PSRAM_SYNC_DATA=true with +/-1 change on PSRAM_DELAY, no luck.
I was thinking about if/when you decide to make a new layout, after testing those two chips, one pig-backed over the other.
At least to me, it seemed to be an enticing proof-of-concept on how better it could be, if all the tracks could be shortenned to a minimum length, and the most part of interconnecting vias could be eliminated.
So this is strange...
Left in the piggy backed chip but with chip enable grounded.
Started adding back in upper nibble chips...
Worked with at least one upper nibble chip (maybe two, should have made notes).
Removed piggy backed chip and now only works with one of the upper nibble chips.
So, looks like 4 chips in parallel with below settings works:
If/when /CE is permanently tyed to ground, the whole command/address and data sequences are messed up, and the Psram chip will soon loose tracking of what is being asked/how it's expected to perform anything, if it'll perform something realy meaningfull, at all.
The proper falling/rising of /CE at the right time, and in closer coordination with the expected CLK pin transitions is of prime importance!
P.S. If at least the first command is properlly passed to the Psram, and correctly "understood"/processed, it'll stay on it forever, tracking only the CLK pin, so as to advance to the next stage of that "single one first command" it has properlly received and acknowledged.
With 5 chips in parallel (one of them piggy backed), the three with longest paths work, but the two closest to the header don't.
Need to remove piggy back and test the close ones again. Guess I assumed they would work, but need to verify...
Perhaps the extra-length of the longer tracks is doing the "job" of a low-value series resistor, attenuating part of any excessive ringing, and other detrimental waveform-deforming events.
Another possibility can be you're luck, and within a real mess of many signal reflections, one of the delays you've choosed just happens to find an almost regular "eye apperture", where the right data pattern can be safelly writen, and read aftewards.
Should not survive to some temperature fluctuations, thougth...
Ok, back to four chips in parallel (none piggy backed).
PSRAM_SYNC_DATA = true works for first three chips, but not the last.
PSRAM_SYNC_DATA = false works for the last three chips, but not the first...
Were all those tests conducted over a long range of P2 Sysclk frequencies, or have you been trying just some specific ones?
Exactly, what indicates that it "works"? Running NeoYume or my test program?
It's 03:00 o'clock; can't sleep if I can't stop thinking about it...
It's like trying to assemble/repair one helicopter, blindfolded, inside a dark room, based solelly in memories, from a huge pile of blueprints, and manuals.
Too much information, and not any single scope screenshot to guide some musings; hope you agree that's a tough way of thinking about signal integrity, but, anyway...
If you have a pcb with four Psram chips assembled onto it, with their data buses connected in parallel (even worse; interconnected to another pcb), with independent chip selects and clocks, and you're observing that the first three perform under certain conditions, while keeping the fourth-one deselected (aka: muted), and, in another set of tests, the last three are able to perform, while keeping the first-one deselected, my best call is to "hear" what the "muted" chips themselves has to "say", about what "they" are experiencing.
For that to be possible, one of your drivers needs to be modified, in order to put the "first/fourth, rulled-out observer" into Write-mode, them perform each part of the test individually, remembering of terminating the "observer" write in an orderly manner, so as to preserve what it was able to "grab" from the bus, while each one of the other three was being writen, or readen, as appliable.
Then, in an extra operation, you'll need to dump the contents of the observer chip, check if it has recorded de control/address sequence of the last test accordingly (including the wait interval), them check the rest of its contents, so as to identify what it has "heard" about what was going on the bus it was "snooping".
In essence, you'll need to turn the first/third non-performing device that is onto the bus, into a "lean mean" logic analyzer.
It'll be essentially recording the events it whitnesses, so this is a completelly non-harming operation.
Also, if the passes last less then, says, 64mS each turn, refresh of the sample contents will not be of concern, at all, but proper Command Termination (APMemory device datasheet, item 8.6, pg 10) is a must, or such a spy-movie-alike, "environmental listening" will not work.
For better results, random data should be replaced by unballanced DC-patterns (long sequences of "zeros", interpersed by just a few "ones", starting from a single differing-bit, and them, increasing their count, as the test progresses. The inverse test also needs to be performed in the same way.)
Since you have four paralleled data lanes, perhaps the DC-unballanced test could use the same pattern for them all.
Also, perhaps introducing small checkerboard patterns in the middle of long DC-unbalanced-ones could be proved usefull.
As I said, it's kind of a blind-date, social experiment situation. Only the results will show the truth.
Lot's of interesting ideas to see if chips can sense the bus data correctly, but who will do all these things....?
I'm currently looking into the 8bit variant driver coding... it's currently tight with 4 LUT registers free and 3 COGRAM registers free. Am seeing if/how the 16 bit variant could be modified. The RMW stuff adds a lot of complexity, although we don't have to select upper/lower 8 bit paths of 16 bit bus anymore (which we did for 16 bits) so that could free some space with any luck. The streamer combinations are all 8 bit now.
Imagine a hungry cannibal, whith his hands and feet well tyed, around a huge tree.
It's my personal version of the "State of the Onion"; no mistyping here...
P.S. We can have a fantastic barbecue, based solelly on moistenned cowboy boots, salt, and charcoal.
It's just a matter of imagination.
P.S.II
https://youtube.com/watch?v=92kcJeOcOTM
It just occurred to me that one can rely on the Streamers, in order to generate both DC-balanced patterns (DVI normal-mode), and also partially DC-unbalanced ones (DVI literal-mode), but the second option will not be perfect, since the running disparity will be automatically encoded and applied.
P2 Sysclock will also need to be lowered, bellow 180 MHz, in order to produce meaningfull Psram operations, but the whole eight bits are useful, specially the two clock lanes (just like a "sound" track, recorded at the side of the frames, onto movie pictures), it'll just need to swap the lanes, periodically (the same control that enables the use of DVI signals, running at either side of the PCB).
P.S. On second thougth, DC-balanced need to be performed bellow 180 MHz, but, partially DC-unbalanced can enjoy the full limit of Psram trials.
It'll only need to be ensured the many consecutive "zeros" or "ones", in each pattern.
Ok so I'm more confident this would fit now. My current approach has likely freed sufficient space. I'm currently re-purposing the 16 bit driver code, by keeping the individual writes all done as 32 bits as required by that driver, instead of reducing it down to 16 bits and making the 32 bit stuff take a different/new code path. It keeps most of the already debugged and working write code intact, by mainly changing the number of clocks and streamed length calculations (now in bytes instead of words). This is a slight cheat and is not 100% optimal in terms of P2 cycles needed for individual word or byte writes (apart from write bursts which remain optimized), but should help get it working much faster by leveraging the previous working code. Further optimizations can come later.
Comparing runs of the delay tester will tell us heaps. Each new hardware arrangement will produce different report patterns. And final check can be at 70 °C to see a degraded case.
PS: We already have some patterns of what does and does not work with NeoYume. As more patterns are collected we can narrow down what is borderline ... Most likely it's as simple as needing a solid 100% column right past 350 MHz sysclock.
Sorry, should have said…
Running NeoYume with Crossed Swords game with no visible errors has been my definition of “Works”.
Things may otherwise work at low freq, but that’s the best test I have.
Guess I should also figure out how to run @evanh ‘s test program…
As said, crossed swords only uses the first chip, use different game.
Ok @Wuerfel_21 , here's a very early 8 bit PSRAM driver I've just got running. I've only tested the basic read and write block transfers using my memory test, so there could still be bugs I need to fix in other areas not yet tested, but given the memory test is passing I think hopefully at least the burst transfers work and you'll be able to use it to put some game ROM data into PSRAM.
Here's the driver code....including the 8 bit delay tester that uses it. I sped it up a bit with some inline assembly for the data compare process. This is still subject to change if I find bugs to fix etc.
Here's a run with it on my 64MB board that has two PSRAM chips in parallel per P2 IO pin group of 4...seems clean to 350MHz. Maybe you'll be able to get 48MB accessible on Rayman's board in 8 bit mode with 3 device loads populated or 64MB with 4? If it ever shows up I'll give it a go here.
That looks an easy win with two columns at 100%. 4 banks of 16 MB (8-bit) might be possible.
I think there's an init issue. It only works after running in 4bit mode on the top bus half once (when I added it to NeoYume. Also strange sound problems, but I'd believe that's a me bug).
Yep, there was an issue with the bankswitching logic. So multibank at 16bit also wouldn't have worked.
Ok. Is there still a problem with the device init @Wuerfel_21 , or have you resolved it somewhere in your own code? I can take a look tomorrow if you think there is an issue there in my code.
EDIT: just found a problem. Due to porting from single bank 16 bit there was no support for more than one bank. Will fix it in a minute and repost here...
UPDATE: fixed I think. I've updated my 8, 16, and 4 bit versions to properly init with multiple PSRAM banks now.
@rogloh this sounds great!
How can I test with 96 MB board?
It’s now populated as 48 MB, but want to try piggy backing again if that shows hope…
@Rayman You would run psram8_delay_test.binary after building it with flexspin and running with loadp2, then you can enter the start pin of the data bus for your board fitted to P2-EVAL and use the pin numbers for CE and CLK to test the bank. You can also enter additional pins to drive high during the test so you can drive your floating CE pins.
Also you can add 64 to the base CLK pin number to get it to drive 2 adjacent pins if the CLK needs to be paired in a pin group as you have on your board.
Here's an example for my board, with no pin group for the CLK and no additional CE pins (mine have pullup resistors)