Just a little doubt of mine: can your driver be tweaked to execute the Write ops at the Psrams at lower frequencies (in the sense of number of P2 Sysclks spent at each High/Low level part of the whole Psrams Clk-line-periods), then execute the Read ops at the "normal" higher frequencies, as it's doing, right now?
Sure, the initial command/address phase will also need to be tweaked too, just to ensure the ram chips will get exactly what is expected for them to receive, as sent by your driver.
Perhaps, although writes typically don't have a problem because the clock and data phases are shifted by a full P2 clock.
EDIT: probably the simplest thing is to change the SPIN2 testing code to do the data writing at say 200MHz where it typically always works and then change the frequency to do the actual reading and comparison. It will take a bit longer to run the test due to additional clock switching but it could help rule out write corruption effects. It looks like Wuerfel_21 already cut back on the delays tested which help speeds things up.
Perhaps, although writes typically don't have a problem because the clock and data phases are shifted by a full P2 clock.
Thanks for your prompt answer! Much appreciated.
Another question: out of curiosity; how is your heart (and brain)? Any chance for a sudden heart attack (or stroke)?
Because the next question will "reveal" my darkest intentions:
Can your driver be tweaked to set/reset at least TWO of Psrams chip enable lines, at the same time, in order to read-from/write-to more than one chip, at the same time?
@Yanomani said:
Can your driver be tweaked to set/reset at least TWO of Psrams chip enable lines, at the same time, in order to read-from/write-to more than one chip, at the same time?
Yeah the normal PSRAM driver already does have the ability to drive pin group assigned CE pins at the same time if they are consecutively allocated. We also do read from more than one chip in the P2-EVAL.
If you are asking about reading and writing in the same request to/from two different chips by sharing the bus, then no it doesn't do that. Address setup would be a problem. Copies to/from external RAMs go through the HUB. This also allows copies to go to different buses as well (in case multiple disparate memory boards are fitted). So far I haven't encountered a huge requirement for intra-bus copying.
Yeah the normal PSRAM driver already does have the ability to drive pin group assigned CE pins at the same time if they are consecutively allocated. We also do read from more than one chip in the P2-EVAL.
If you are asking about reading and writing in the same request to/from two different chips by sharing the bus, then no it doesn't do that. Address setup would be a problem. Copies to/from external RAMs go through the HUB. This also allows copies to go to different buses as well (in case multiple disparate memory boards are fitted). So far I haven't encountered a huge requirement for intra-bus copying.
Just to be clear: I'm not seeking for any eventual (and faint) write corruption; I'm just trying to be sure none of any two (or, eventually, three) chips will get wrong (and differing) command/address/data sequencies into them.
Because, in fact, I'm pursuing the Reads in parallel of more than one ram chip, at the same time, in order to get ~25 Ohm of drive capacity (any two ram chips, in parallel), or even ~17 Ohm (any three of them), at the same nibble...
At least this time, intra-bus copying is just not what I'm aimimng in my "radar"...
Ok, so try a write with more than one chip receiving the data and then readback using both chips outputing together to drive in parallel... might be risky if they get bad data or drive at different times and clash but it's doable...hopefully won't fry the drivers. Is this a safe test?
The test code already allows CE to be paralleled via inputting a group. It accepts pin numbers up to 255 which can support 3 additional sequential CE pins in the group... I'm reluctant to recommend doing that in case something fries it on readback. Is this a safe thing to do Yanomani?
last:=cepin
repeat
send("Enter the chip enable pin number for your PSRAM [", f.dec(last),"]: ")
cepin:= getdec(last)
until (cepin +<= 255) '63)
@rogloh said:
Ok, so try a write with more than one chip receiving the data and then readback using both chips outputing together to drive in parallel... might be risky if they get bad data or drive at different times and clash but it's doable...hopefully won't fry the drivers. Is this a safe test?
Personally, I don't expect for any conflicting transitions to last more than, says, 500 pS (for any TWO paralleled ram chips); must never last nonwhere near 1nS. for sure.
Any risky situation will show itself very soon, thru a more than usual heating, affecting the involved devices; P2 will not be at risk, never, unless any ram chip burns too badly (totally unnexpected), at the point of fuseing/melting its own output cmos structures at their pad rings.
Hmm, is there anything that'd stop the data bus from going like this? Not sure what'd be ideal for clock. I think it might be beneficial to run it along the data bus, but source it on the opposite end. So the banks further from the data pins are closer to the clock pin in equal proportion.
But a 3-header corner setup with 16 bit bus and 3 banks would also give 96MB with a lot less headache, I think.
[doh, oops, me typing before reading] ... Generally, the add-on boards are not ideal for top speed. The delay tests demonstrate notable improvement using the EC32MB's onboard RAM vs anything plugged into the Eval Boards.
Roger,
I think the alternate "delay" values are data pins registered vs unregistered, right? And I'm guessing doing that with the clock pin, instead, failed to provide the desired timing spread?
I raise the question because it'd be more ideal to use registered data pins all the time if possible - Adjusting the clock pin instead. I would expect this to remove most of the I/O skew times in that OnSemi spreadsheet. I believe those numbers must have been only for unregistered I/O. They were far too large values otherwise. But, most importantly, registering will even them up.
@evanh said:
Roger,
I think the alternate "delay" values are data pins registered vs unregistered, right? And I'm guessing doing that with the clock pin, instead, failed to provide the desired timing spread?
Yes. Although that was originally found with HyperRAM. Maybe the PSRAM drive gives different results? When we get the board we could try to vary the clock. I had some of that stuff experimentally optional in HyperRAM code. There might be some extra complexity to get the clock phase right at the start and centered for writes this way.
Yes, transmit data could be tricky to get the clock phase aligned. I know you were using "transition" smartpin mode with sysclock/2 so the clock smartpin didn't have to be resync'd each time it was used.
That's where an unregistered clock pin can be an advantage. It appears slightly phase lagged, albeit one sysclock earlier. Therefore can use that lag for the data setup time requirement. Ie: Tx data transitions about 1.0 ns ahead of rising clock edge instead of the expected falling clock edge ... Taking a page out of HyperRAM DDR method. Because the timing is tight it's important that the data pins are registered to keep the skew spread minimal.
The thing is, the errors happen earlier over a wide range. This problem on the 96MB looks like it is load related, no delay improves it. It just maxes out earlier on some chips. Could also be performance variation between devices. We've been lucky in the original batch of devices hitting a very high overclock, eg. to maybe 170MHz which is over their rated 133MHz (266MHz P2). Perhaps some the chips Rayman received were not quite as good. If they could be tested in isolation we could tell, but attached to 5 other chips and on a new board layout it's hard to know.
It will be interesting to see if we obtain similar results to Wuerfel_21 with respect to which chips on the board perform better than others. If this matches a similar result for us, it might be a layout type of issue. If it's just totally random again it could be load related and variation among the devices output drivers.
Pity they don't sell 8 pin DIP devices for the PSRAM, then we could socket a board and plug and play the devices to test, and potentially do our own overclock binning.
Pity they don't sell 8 pin DIP devices for the PSRAM, then we could socket a board and plug and play the devices to test, and potentially do our own overclock binning.
In fact, there are some test sockets available for those 50mil-pitch, 8-pin SOP devices, though they're not cheap (at least, they don't appear to be nonwhere closer to Yamaichi...):
$13 for a single SOP8 socket isn't too bad. You could at least then make a single chip tester board and overclock each chip individually to see where it maxes out before you hand solder to a final board. Of course this is not going to be used for any real production purposes unless you could automate the entire testing process somehow with your chip supply before board production. But if you need a specific board to run at 340MHz maybe it would help.
@rogloh said:
The thing is, the errors happen earlier over a wide range. This problem on the 96MB looks like it is load related, no delay improves it. It just maxes out earlier on some chips. Could also be performance variation between devices. We've been lucky in the original batch of devices hitting a very high overclock, eg. to maybe 170MHz which is over their rated 133MHz (266MHz P2). Perhaps some the chips Rayman received were not quite as good. If they could be tested in isolation we could tell, but attached to 5 other chips and on a new board layout it's hard to know.
Mind the 24MB board. Same batch of chips, gets cranked to 170Mhz no problem.
Less load on that one yeah. So getting 96MB as 3x32MB in the 16 bit wide PSRAM arrangement is still a reasonable possibility, or at least 64MB with any luck with just two paralleled banks. I just need to obtain another chip (or probably just desolder one off VonSzarvas' original test boards) and fit my own board with 64MB to see how it fares on P2 EVAL.
Doubles the latency and halves the bandwidth. For a 4 bit system, it's already 1/4 of the P2-Edge so it's not ideal. Not sure if it would be fast enough for Wuerfel_21's emulator.
Maybe a dual sysclk/4 clock with alternating phase would be the way to go with the streamer running at sysclk/2, but this really complicates the access width and address setup phase and with the RMW needed it's starting to become a nightmare to figure all that out, and I think my 16bit driver implementation is already getting pretty packed out dealing with all that stuff. From a native word size point of view it could make a 2 phase x8 bus setup look like a 1x16 arrangement which is natively 32 bits (at half the performance). The 16 bit arrangement is still king though I think due to its reduction of parallel loads.
If you make a PSRAM board like I did for P2-EVAL we could consider a dual clock setup for achieving 4x16 bit banks perhaps, as we do have a total of 8 pins accessible on the third header group or even more if we steal another header group. This could keep the clock rate down for four load devices. But it's nice to keep the flexibility to be able to run with single clock too (using a pin group perhaps if the clocks are split over different devices) and not force ourselves down the dual phase clock path completely.
@evanh said:
It won't double the real latency as most of that is software overhead.
Well for my driver yes, but for Wuerfel_21's emulator it could have more of an impact because the latency is tighter there. For my graphics stuff, the halving of bandwidth would be the main issue, not so much the latency. Not too many video modes would be able to make use of 4 bit PSRAM at sysclk/4 for a frame buffer. That's only sysclk/8 bytes/second with a 4 bit bus, so a 250MHz P2 could stream at up to ~31MB/s, which is really only enough for VGA @8bpp, with little left over for write bandwidth after the overheads etc.
@evanh said:
Below 300 MHz is reliable at sysclock/2, right? PS: I'm just throwing out the idea so as to use Rayman's add-on at the needed ~340 MHz.
Yeah it was okay at lower rates at least on one chip that was failing at higher speeds. Probably worth a shot anyway to see what happens. Wuerfel_21 will need to change the read rate of the emulator and see if it keeps up with half the speed.
Did your board arrive yet? Mine is still AWOL. Actually it was in Japan today, took 5 days to get there from NYC. Hopefully will arrive sometime next week.
Comments
Perhaps, although writes typically don't have a problem because the clock and data phases are shifted by a full P2 clock.
EDIT: probably the simplest thing is to change the SPIN2 testing code to do the data writing at say 200MHz where it typically always works and then change the frequency to do the actual reading and comparison. It will take a bit longer to run the test due to additional clock switching but it could help rule out write corruption effects. It looks like Wuerfel_21 already cut back on the delays tested which help speeds things up.
Thanks for your prompt answer! Much appreciated.
Another question: out of curiosity; how is your heart (and brain)? Any chance for a sudden heart attack (or stroke)?
Because the next question will "reveal" my darkest intentions:
Can your driver be tweaked to set/reset at least TWO of Psrams chip enable lines, at the same time, in order to read-from/write-to more than one chip, at the same time?
Hung up on the bit errors being only on SIO0/SIO1...
SIO2 has better routing on account of running right by the pins, but SIO3 is the same as the other two really.
No, I just cut out the boring result columns afterwards.
Yeah the normal PSRAM driver already does have the ability to drive pin group assigned CE pins at the same time if they are consecutively allocated. We also do read from more than one chip in the P2-EVAL.
If you are asking about reading and writing in the same request to/from two different chips by sharing the bus, then no it doesn't do that. Address setup would be a problem. Copies to/from external RAMs go through the HUB. This also allows copies to go to different buses as well (in case multiple disparate memory boards are fitted). So far I haven't encountered a huge requirement for intra-bus copying.
Just to be clear: I'm not seeking for any eventual (and faint) write corruption; I'm just trying to be sure none of any two (or, eventually, three) chips will get wrong (and differing) command/address/data sequencies into them.
Because, in fact, I'm pursuing the Reads in parallel of more than one ram chip, at the same time, in order to get ~25 Ohm of drive capacity (any two ram chips, in parallel), or even ~17 Ohm (any three of them), at the same nibble...
At least this time, intra-bus copying is just not what I'm aimimng in my "radar"...
Ok, so try a write with more than one chip receiving the data and then readback using both chips outputing together to drive in parallel... might be risky if they get bad data or drive at different times and clash but it's doable...hopefully won't fry the drivers. Is this a safe test?
The test code already allows CE to be paralleled via inputting a group. It accepts pin numbers up to 255 which can support 3 additional sequential CE pins in the group... I'm reluctant to recommend doing that in case something fries it on readback. Is this a safe thing to do Yanomani?
Personally, I don't expect for any conflicting transitions to last more than, says, 500 pS (for any TWO paralleled ram chips); must never last nonwhere near 1nS. for sure.
Any risky situation will show itself very soon, thru a more than usual heating, affecting the involved devices; P2 will not be at risk, never, unless any ram chip burns too badly (totally unnexpected), at the point of fuseing/melting its own output cmos structures at their pad rings.
Well you and Wuerfel_21 can decide if you want to go down that path. I'm off to sleep now anyway....good night!
Ah, a note on facing some risks: I just discovered that this fantastic man was born in Brazil! And did lasted 89 full and productive years...
I'm kind of a fan of him, since the first time I read about his experiments, many many decades ago.
https://en.wikipedia.org/wiki/John_Stapp
Hmm, is there anything that'd stop the data bus from going like this? Not sure what'd be ideal for clock. I think it might be beneficial to run it along the data bus, but source it on the opposite end. So the banks further from the data pins are closer to the clock pin in equal proportion.
But a 3-header corner setup with 16 bit bus and 3 banks would also give 96MB with a lot less headache, I think.
[doh, oops, me typing before reading] ... Generally, the add-on boards are not ideal for top speed. The delay tests demonstrate notable improvement using the EC32MB's onboard RAM vs anything plugged into the Eval Boards.
PS: Sysclock/4 will work.
Roger,
I think the alternate "delay" values are data pins registered vs unregistered, right? And I'm guessing doing that with the clock pin, instead, failed to provide the desired timing spread?
I raise the question because it'd be more ideal to use registered data pins all the time if possible - Adjusting the clock pin instead. I would expect this to remove most of the I/O skew times in that OnSemi spreadsheet. I believe those numbers must have been only for unregistered I/O. They were far too large values otherwise. But, most importantly, registering will even them up.
Yes. Although that was originally found with HyperRAM. Maybe the PSRAM drive gives different results? When we get the board we could try to vary the clock. I had some of that stuff experimentally optional in HyperRAM code. There might be some extra complexity to get the clock phase right at the start and centered for writes this way.
Yes, transmit data could be tricky to get the clock phase aligned. I know you were using "transition" smartpin mode with sysclock/2 so the clock smartpin didn't have to be resync'd each time it was used.
That's where an unregistered clock pin can be an advantage. It appears slightly phase lagged, albeit one sysclock earlier. Therefore can use that lag for the data setup time requirement. Ie: Tx data transitions about 1.0 ns ahead of rising clock edge instead of the expected falling clock edge ... Taking a page out of HyperRAM DDR method. Because the timing is tight it's important that the data pins are registered to keep the skew spread minimal.
The thing is, the errors happen earlier over a wide range. This problem on the 96MB looks like it is load related, no delay improves it. It just maxes out earlier on some chips. Could also be performance variation between devices. We've been lucky in the original batch of devices hitting a very high overclock, eg. to maybe 170MHz which is over their rated 133MHz (266MHz P2). Perhaps some the chips Rayman received were not quite as good. If they could be tested in isolation we could tell, but attached to 5 other chips and on a new board layout it's hard to know.
It will be interesting to see if we obtain similar results to Wuerfel_21 with respect to which chips on the board perform better than others. If this matches a similar result for us, it might be a layout type of issue. If it's just totally random again it could be load related and variation among the devices output drivers.
Pity they don't sell 8 pin DIP devices for the PSRAM, then we could socket a board and plug and play the devices to test, and potentially do our own overclock binning.
In fact, there are some test sockets available for those 50mil-pitch, 8-pin SOP devices, though they're not cheap (at least, they don't appear to be nonwhere closer to Yamaichi...):
Model #: OTS-8(16)-1.27-03
(https://www.waveshare.com/ots-16-1.27-03-8.htm)
(https://www.test-socket.com/)
As for technical data availability (precise design info, landpattern, capacitive load per pin, ...), both sellers are scarse on it.
This make/model seems to be produced by:
(https://www.enplas.co.jp/english/)
Just registered at their website, in order to be able to download the full catalog; waiting for email confirmation.
$13 for a single SOP8 socket isn't too bad. You could at least then make a single chip tester board and overclock each chip individually to see where it maxes out before you hand solder to a final board. Of course this is not going to be used for any real production purposes unless you could automate the entire testing process somehow with your chip supply before board production. But if you need a specific board to run at 340MHz maybe it would help.
Mind the 24MB board. Same batch of chips, gets cranked to 170Mhz no problem.
Less load on that one yeah. So getting 96MB as 3x32MB in the 16 bit wide PSRAM arrangement is still a reasonable possibility, or at least 64MB with any luck with just two paralleled banks. I just need to obtain another chip (or probably just desolder one off VonSzarvas' original test boards) and fit my own board with 64MB to see how it fares on P2 EVAL.
Sysclock/4 will be fine.
Doubles the latency and halves the bandwidth. For a 4 bit system, it's already 1/4 of the P2-Edge so it's not ideal. Not sure if it would be fast enough for Wuerfel_21's emulator.
Maybe a dual sysclk/4 clock with alternating phase would be the way to go with the streamer running at sysclk/2, but this really complicates the access width and address setup phase and with the RMW needed it's starting to become a nightmare to figure all that out, and I think my 16bit driver implementation is already getting pretty packed out dealing with all that stuff. From a native word size point of view it could make a 2 phase x8 bus setup look like a 1x16 arrangement which is natively 32 bits (at half the performance). The 16 bit arrangement is still king though I think due to its reduction of parallel loads.
If you make a PSRAM board like I did for P2-EVAL we could consider a dual clock setup for achieving 4x16 bit banks perhaps, as we do have a total of 8 pins accessible on the third header group or even more if we steal another header group. This could keep the clock rate down for four load devices. But it's nice to keep the flexibility to be able to run with single clock too (using a pin group perhaps if the clocks are split over different devices) and not force ourselves down the dual phase clock path completely.
It won't double the real latency as most of that is software overhead.
Well for my driver yes, but for Wuerfel_21's emulator it could have more of an impact because the latency is tighter there. For my graphics stuff, the halving of bandwidth would be the main issue, not so much the latency. Not too many video modes would be able to make use of 4 bit PSRAM at sysclk/4 for a frame buffer. That's only sysclk/8 bytes/second with a 4 bit bus, so a 250MHz P2 could stream at up to ~31MB/s, which is really only enough for VGA @8bpp, with little left over for write bandwidth after the overheads etc.
Below 300 MHz is reliable at sysclock/2, right? PS: I'm just throwing out the idea so as to use Rayman's add-on at the needed ~340 MHz.
Ye. I tried NeoYume with clock multiplier lowered to 10, worked fine (aside from being 50% too slow).
Yeah it was okay at lower rates at least on one chip that was failing at higher speeds. Probably worth a shot anyway to see what happens. Wuerfel_21 will need to change the read rate of the emulator and see if it keeps up with half the speed.
Did your board arrive yet? Mine is still AWOL. Actually it was in Japan today, took 5 days to get there from NYC. Hopefully will arrive sometime next week.
Ok so it didn't work then.
No sign of progress since leaving Japan six days ago.