Then I'm almost sure your Eval-setup is suffering from Clock_Jittery-Syndrome (sure, frequency and temperature-related), and this also reveal some extra information; since you're relying on the Smart Pins in order to get propper PSRam CLK and CE# signals, but can't do the same for the PSRam I/Os (due to forcefully needing the Streamer to get proper data signals), and despite the Clocked-action is occurrying at the pad ring, the resulting timings are NOT aligned, due to some subtle differences between signals coming from the very pad ring region (pure 3.3V-related logic (CLK and CE#)), and the ones that come from the raw OR'ed OUT-bus (coming from Cogs/Streamers, so 1.8V-related, thus must be voltage-translated to 3.3V, before hitting the last latching stage at the pad ring).
It'll be transmission line effects. The paths are much longer out to the add-on boards. I had previously thought attenuation was the biggest issue but maybe bus terminations might do wonders.
Here's Rayman's 48 MB add-on attached to the same Eval Board. It has less ICs, only three per bus instead of six. The tracks are shorter too. The results are much improved over the 96 MB add-on.
@evanh said:
Crossing page boundaries is crappy with these chips. I'm getting errors as low as 70 MHz SPI clock when doing larger than 1024 byte bursts. Smaller length page crossings can actually achieve 133 MHz, just. Whereas staying within a page works right up to Prop2's limits, ~200 MHz SPI clock when chilled. Definitely a good rule to split them up.
Of course. That is why my driver always follows this rule and splits it up. Break the rules and it's anyone's guess as to the result. PSRAM seems solid as anything when you keep within the 1kB page size (or 2kB for 8bit, 4kB for 16bit).
@evanh said:
It'll be transmission line effects. The paths are much longer out to the add-on boards. I had previously thought attenuation was the biggest issue but maybe bus terminations might do wonders.
What's the P2_IO pin-numbers that are being used to drive PSRam's DSIO[3:0]???
P.S. I understand the CLK and CE#, but the DATA_PINS-numbering system is messing whith my brain contents...
@evanh said:
Crossing page boundaries is crappy with these chips. I'm getting errors as low as 70 MHz SPI clock when doing larger than 1024 byte bursts. Smaller length page crossings can actually achieve 133 MHz, just. Whereas staying within a page works right up to Prop2's limits, ~200 MHz SPI clock when chilled. Definitely a good rule to split them up.
Of course. That is why my driver always follows this rule and splits it up. Break the rules and it's anyone's guess as to the result. PSRAM seems solid as anything when you keep within the 1kB page size (or 2kB for 8bit, 4kB for 16bit).
You and @evahn @evanh (sorry!!!) can also try splitting the CLK and DATA transactions in two, but keeping PSRam CE# = "Low" and providing a rest-period of CLK = "Low" at the "interval-region"; this would give more time for the PSRam for it to "prepare" the next Data-buffer, ready for continuing the operation at the next row. The interval would also enable enough time for another command to be sent to the Streamer, without needing to forcefully pre-buffering it.
The right interval timing (Sysclk-cycles count) is of prime importance, as to ensure keeping with the Smart Pin-based new series of CLK pulses, just in sync with expected in-phase operation with Streamer I/O data transfers.
There would be no hicup, and you'll avoid all the burden of ending the current CE#-constrainned transaction, just to start a new one, soon afterwards. Would be some kind of a "Command-Chainning", but without needing to re-send the same command and providing new (and unneeded) addressing phases.
As for the tests of nibble-oriented devices with 1kB-long rows, maybe I can add some "little bits" (sic) of my own.
The annexed file contains 2kBytes of non-random, nibble-oriented data, composed by 64 different 32-byte-wide "mini-rows" of data-patterns, intended to produce a deterministic set of frequencies at PSRams SIO[3:0].
Each "mini-row" is identifyed by a one-byte Gray code which represents its number at the sequence, spanning from 'h00 thru 'h28, and located at the 18th byte-position of each 32-byte-long sequence.
The following excerpte (four mini-rows) shows an example of how they are individually coded:
Aargh! Somehow I've made the timing different between Flexspin and Pnut. I was comparing them too, must have been a late change. Pnut is acting like it has one less instruction in the critical section ... EDIT: Bah! It's Flexspin trying to optimise by using an immediate operand when it determines there is a constant value. Which can be either # or ## depending on how large that constant is!
So my workaround for that issue didn't work either. I guess I hadn't checked it careful enough. Time for a follow-up Flexspin bug report ...
EDIT: This is triggering a rethink. I might try introducing a block of parameters in hubRAM that can be fast copied into register space ...
EDIT2: Ha! Not bad at all. I can happily tack it on the end of the inline code. eg:
PUB tx_cmd( cmd )
org
setxfrq xfrq ' set sysclock/1 for lead in timing
rolnib cmd, cmd, #1 ' big-endian nibble swap
xinit leadin, #0 ' lead-in timing, at sysclock/1
setq nco ' streamer transfer rate
xcont ca8, cmd ' tx Command only
waitx #2 ' hacked in for quick test
drvl datp ' active for tx CA phase
drvl #PSRAM_CE_PIN
dirh #PSRAM_CLK_PIN ' start smartpin internally cycling at SPI clock rate
wypin #2, #PSRAM_CLK_PIN ' 2 SPI clocks for Command only
waitx #2 * CLK_DIV - 4
dirl #PSRAM_CLK_PIN ' reset smartpin
dirl datp ' tristate the databus upon completion
_ret_ drvh #PSRAM_CE_PIN
xfrq long $8000_0000
leadin long M_LEADIN
nco long M_NCO
ca8 long M_CA8
datp long PSRAM_DATA_PIN
end
Nice! It's solved the new problem completely. And behaves the same in Pnut now again.
EDIT: Err, not completely. There is still a variant where when the function parameters have constants passed to them, it can have the same issue. Problem here is said function parameters need to be able to modify that inline data block. I don't know if that's possible. Certainly no direct symbol access in Flexspin.
Yes. ORG/END is Interpreter-style loaded ASM block, ASM/ENDASM just dumps your assembly into the compiler IR (can participate in inlining, constant propagation, CORDIC reorder etc)
This is rearing its head mostly now because I'm trying to move to doing the 16-bit wide bus. And what's changed is the defined constant for the 16 data pins. It now has an ADDPINS 15 so it exceeds the 9-bit limit of a single # immediate. This has had catastrophic repercussions on timing when compiling with Flexspin.
And there is another issue, actually even worse, where a local register variable will be used if the passed in constant is non-zero. But if it's zero then the optimiser will generate an immediate operand with ## instead. This one occurs when selecting individual chips in SPI interface mode.
There's something else going on too. And I better make a backup of this one. If I shift position of certain data in DAT section the program crashes or does stupid things. It's a little like a buffer overrun but I'm pretty certain that isn't the reason. Just swapping two items ahead of the buffers can still blow it up. And moving a non-sensitive, but important, item to after the buffers is fine.
PS: I suspect this the elusive change-one-innocuous-thing-and-it-goes-bats. The one that will vanish without a trace again.
Roger,
Thanks for putting me onto using lutRAM for mapping the CA phase of QPI interface onto multiple chips. It was a learning experience just in treating SPI and QPI interface modes so differently. It didn't take long to decide to access one chip at a time when in SPI mode. SPI is always only going to be for config here anyway.
@evanh said:
Roger,
Thanks for putting me onto using lutRAM for mapping the CA phase of QPI interface onto multiple chips. It was a learning experience just in treating SPI and QPI interface modes so differently. It didn't take long to decide to access one chip at a time when in SPI mode. SPI is always only going to be for config here anyway.
No problem. In my driver suite I've coded up 3 different PSRAM drivers so far plus some other special variants for Ada so I guess I should probably know a little bit about getting them working on the P2 by now. You'll have fun with the RMW aspects during write bursts
@rogloh said:
You'll have fun with the RMW aspects during write bursts
Heh, not my problem. I don't intend to make a finished product. It's mostly about demonstrating the streamer/smartpin aligning in software, instruction counting, and doing the timing calculations. It's a lot more important for perfection at sysclock/2 than at sysclock/8 with the SD cards.
EDIT: That might be next direction to go in. Use this knowledge to implement the 4-bit SD protocol at sysclock/2. Operate 50 MHz SD cards with just 100 MHz sysclock.
Yeah fair enough. A spin2 based PSRAM driver is good for education and for some single COG use, but once you need a couple of COGs sharing the memory, or add real-time constaints, it starts to be a bit of a limiting factor and you need an arbiter COG that can fragment and control access to the shared memory. e.g. video use.
I tell you what. Having both QPI and SPI modes done with the streamer, and the equivalent timing calculations, provided a huge amount of troubleshooting. It was amazing how many times I got disparate outcomes between SPI and QPI. Which meant I could compare and come up with ideas for why and what was wrong. Very rarely did I revert back to older code to dig out of a hole.
Oh, I pillaged the nibble-swap code from your driver. I couldn't get my head around using those spit/merge instructions in combo like that. It looks like something Ariba came up with.
EDIT: Although I did once use them for the EPROM's DPI mode ... ah, it was merge-only:
read_byte4
waitse1 'wait for smartpin (spi_do) buffer full event
rdpin pa, #spi_do '16-bit shift-in as little-endian (odd bits)
rdpin pb, #spi_di '(even bits)
rev pa 'but SPI data is stored as big-endian (odd bits)
rev pb '(even bits)
rolword pa, pb, #0 'combine to a single 32-bit word
_ret_ mergew pa 'untangle the odd-even pattern
You'll need to use the splitb, rev, movbyts, mergeb combination when the "a" bit is not available in the streamer command - i.e. for hub burst transfers in the 8 and 16 bit bus modes that can't use pure immediate nibble mode. This is required for correct address endianness.
For 4 bit mode, you can use the simpler movbyts command to swap bytes and have the "a" bit do the nibble reversals for you.
I find coming up with these bit twiddling sequences difficult when you need multiple in a row to achieve the final result. It's not something that just comes to my mind as to what you need to do for some reason. If fact in the past I've even resorted to writing some code that brute force exercises a bunch of these commands in different sequences until it comes up with the result. Eg. trying out different combinations of split, rev, movbyts, rol, merge in different orders etc until it eventually stumbles onto what you want, lol. They are highly versatile instructions.
I've used the same for all now, ie The nibble swapper below. The table in lutRAM is built according to number of parallel chips and, in the case of a single, which half of the 8-bit pin group.
Looks unweidly, why not just use mul to compute the values? Multiply idx by $1111 as you increment idx from 0-$F and write the word or byte to LUT.
In fact I don't think it hurts to replicate the pattern to all 4 nibbles in all cases, as the streamer commands will select the appropriate number of pins that receive the data.
Comments
Yup, for that particular run.
Then I'm almost sure your Eval-setup is suffering from Clock_Jittery-Syndrome (sure, frequency and temperature-related), and this also reveal some extra information; since you're relying on the Smart Pins in order to get propper PSRam CLK and CE# signals, but can't do the same for the PSRam I/Os (due to forcefully needing the Streamer to get proper data signals), and despite the Clocked-action is occurrying at the pad ring, the resulting timings are NOT aligned, due to some subtle differences between signals coming from the very pad ring region (pure 3.3V-related logic (CLK and CE#)), and the ones that come from the raw OR'ed OUT-bus (coming from Cogs/Streamers, so 1.8V-related, thus must be voltage-translated to 3.3V, before hitting the last latching stage at the pad ring).
It'll be transmission line effects. The paths are much longer out to the add-on boards. I had previously thought attenuation was the biggest issue but maybe bus terminations might do wonders.
Here's Rayman's 48 MB add-on attached to the same Eval Board. It has less ICs, only three per bus instead of six. The tracks are shorter too. The results are much improved over the 96 MB add-on.
Of course. That is why my driver always follows this rule and splits it up. Break the rules and it's anyone's guess as to the result. PSRAM seems solid as anything when you keep within the 1kB page size (or 2kB for 8bit, 4kB for 16bit).
What's the P2_IO pin-numbers that are being used to drive PSRam's DSIO[3:0]???
P.S. I understand the CLK and CE#, but the DATA_PINS-numbering system is messing whith my brain contents...
DATA_PINS = 232 CE_PIN = 57 CLK_PIN = 56
DATA_PINS = 232
232 = 192 + 40 which is a pin group of 4 based at P40.
That's a good idea - Split off the ADDPINS component ...
Thanks guys! I was at the nearmost strabismus-contraption-point of my ever-suffering eyes here...
You and @evahn @evanh (sorry!!!) can also try splitting the CLK and DATA transactions in two, but keeping PSRam CE# = "Low" and providing a rest-period of CLK = "Low" at the "interval-region"; this would give more time for the PSRam for it to "prepare" the next Data-buffer, ready for continuing the operation at the next row. The interval would also enable enough time for another command to be sent to the Streamer, without needing to forcefully pre-buffering it.
The right interval timing (Sysclk-cycles count) is of prime importance, as to ensure keeping with the Smart Pin-based new series of CLK pulses, just in sync with expected in-phase operation with Streamer I/O data transfers.
There would be no hicup, and you'll avoid all the burden of ending the current CE#-constrainned transaction, just to start a new one, soon afterwards. Would be some kind of a "Command-Chainning", but without needing to re-send the same command and providing new (and unneeded) addressing phases.
Okay, the 4-bit code is all sorted. I'm really happy - Managed to eliminate all uncalculated (hand-coded) timing values!
EDIT: I guess I should add a boilerplate ... done
EDIT2: Comments improved
EDIT3: Rename of udec() to udeci() to avoid name clash in Pnut
EDIT4: This one has a bug, the newer release supersedes it - https://forums.parallax.com/discussion/comment/1541443/#Comment_1541443
As for the tests of nibble-oriented devices with 1kB-long rows, maybe I can add some "little bits" (sic) of my own.
The annexed file contains 2kBytes of non-random, nibble-oriented data, composed by 64 different 32-byte-wide "mini-rows" of data-patterns, intended to produce a deterministic set of frequencies at PSRams SIO[3:0].
Each "mini-row" is identifyed by a one-byte Gray code which represents its number at the sequence, spanning from 'h00 thru 'h28, and located at the 18th byte-position of each 32-byte-long sequence.
The following excerpte (four mini-rows) shows an example of how they are individually coded:
"A5 5A A5 5A 01 80 FF 08 10 FF 01 80 5A A5 00 00 00 00 5A A5 FE 7F 00 FE 7F 00 FE 7F 5A A5 5A A5
A5 5A A5 5A 01 80 FF 08 10 FF 01 80 5A A5 00 00 00 01 5A A5 FE 7F 00 FE 7F 00 FE 7F 5A A5 5A A5
A5 5A A5 5A 01 80 FF 08 10 FF 01 80 5A A5 00 00 00 03 5A A5 FE 7F 00 FE 7F 00 FE 7F 5A A5 5A A5
A5 5A A5 5A 01 80 FF 08 10 FF 01 80 5A A5 00 00 00 02 5A A5 FE 7F 00 FE 7F 00 FE 7F 5A A5 5A A5"
Hope it helps a bit
Damn, my newly fleshed out stdlib.spin2 has a namespace conflict with
udec()
in Pnut ... and I've used it lots too.EDIT: Now updated above.
Aargh! Somehow I've made the timing different between Flexspin and Pnut. I was comparing them too, must have been a late change. Pnut is acting like it has one less instruction in the critical section ... EDIT: Bah! It's Flexspin trying to optimise by using an immediate operand when it determines there is a constant value. Which can be either # or ## depending on how large that constant is!
So my workaround for that issue didn't work either. I guess I hadn't checked it careful enough. Time for a follow-up Flexspin bug report ...
EDIT: This is triggering a rethink. I might try introducing a block of parameters in hubRAM that can be fast copied into register space ...
EDIT2: Ha! Not bad at all. I can happily tack it on the end of the inline code. eg:
Nice! It's solved the new problem completely. And behaves the same in Pnut now again.
EDIT: Err, not completely. There is still a variant where when the function parameters have constants passed to them, it can have the same issue. Problem here is said function parameters need to be able to modify that inline data block. I don't know if that's possible. Certainly no direct symbol access in Flexspin.
Yeah that is handy. I think returning like that only works in the org/end not asm/endasm blocks, right?
Yes. ORG/END is Interpreter-style loaded ASM block, ASM/ENDASM just dumps your assembly into the compiler IR (can participate in inlining, constant propagation, CORDIC reorder etc)
This is rearing its head mostly now because I'm trying to move to doing the 16-bit wide bus. And what's changed is the defined constant for the 16 data pins. It now has an
ADDPINS 15
so it exceeds the 9-bit limit of a single # immediate. This has had catastrophic repercussions on timing when compiling with Flexspin.And there is another issue, actually even worse, where a local register variable will be used if the passed in constant is non-zero. But if it's zero then the optimiser will generate an immediate operand with ## instead. This one occurs when selecting individual chips in SPI interface mode.
There's something else going on too. And I better make a backup of this one. If I shift position of certain data in DAT section the program crashes or does stupid things. It's a little like a buffer overrun but I'm pretty certain that isn't the reason. Just swapping two items ahead of the buffers can still blow it up. And moving a non-sensitive, but important, item to after the buffers is fine.
PS: I suspect this the elusive change-one-innocuous-thing-and-it-goes-bats. The one that will vanish without a trace again.
Okay, a newer more generic release that handles one, two and four chips in parallel. Only have to adjust the pin constants to suit.
I'll work on the comments.
Roger,
Thanks for putting me onto using lutRAM for mapping the CA phase of QPI interface onto multiple chips. It was a learning experience just in treating SPI and QPI interface modes so differently. It didn't take long to decide to access one chip at a time when in SPI mode. SPI is always only going to be for config here anyway.
No problem. In my driver suite I've coded up 3 different PSRAM drivers so far plus some other special variants for Ada so I guess I should probably know a little bit about getting them working on the P2 by now. You'll have fun with the RMW aspects during write bursts
Heh, not my problem. I don't intend to make a finished product. It's mostly about demonstrating the streamer/smartpin aligning in software, instruction counting, and doing the timing calculations. It's a lot more important for perfection at sysclock/2 than at sysclock/8 with the SD cards.
EDIT: That might be next direction to go in. Use this knowledge to implement the 4-bit SD protocol at sysclock/2. Operate 50 MHz SD cards with just 100 MHz sysclock.
Yeah fair enough. A spin2 based PSRAM driver is good for education and for some single COG use, but once you need a couple of COGs sharing the memory, or add real-time constaints, it starts to be a bit of a limiting factor and you need an arbiter COG that can fragment and control access to the shared memory. e.g. video use.
I tell you what. Having both QPI and SPI modes done with the streamer, and the equivalent timing calculations, provided a huge amount of troubleshooting. It was amazing how many times I got disparate outcomes between SPI and QPI. Which meant I could compare and come up with ideas for why and what was wrong. Very rarely did I revert back to older code to dig out of a hole.
Yeah it's nice to have a good working reference to compare against. You are not starting out from scratch.
Oh, I pillaged the nibble-swap code from your driver. I couldn't get my head around using those spit/merge instructions in combo like that. It looks like something Ariba came up with.
EDIT: Although I did once use them for the EPROM's DPI mode ... ah, it was merge-only:
You'll need to use the splitb, rev, movbyts, mergeb combination when the "a" bit is not available in the streamer command - i.e. for hub burst transfers in the 8 and 16 bit bus modes that can't use pure immediate nibble mode. This is required for correct address endianness.
For 4 bit mode, you can use the simpler movbyts command to swap bytes and have the "a" bit do the nibble reversals for you.
I find coming up with these bit twiddling sequences difficult when you need multiple in a row to achieve the final result. It's not something that just comes to my mind as to what you need to do for some reason. If fact in the past I've even resorted to writing some code that brute force exercises a bunch of these commands in different sequences until it comes up with the result. Eg. trying out different combinations of split, rev, movbyts, rol, merge in different orders etc until it eventually stumbles onto what you want, lol. They are highly versatile instructions.
I've used the same for all now, ie The nibble swapper below. The table in lutRAM is built according to number of parallel chips and, in the case of a single, which half of the 8-bit pin group.
Here's the table builder (Still needs comments added)
Looks unweidly, why not just use mul to compute the values? Multiply idx by $1111 as you increment idx from 0-$F and write the word or byte to LUT.
In fact I don't think it hurts to replicate the pattern to all 4 nibbles in all cases, as the streamer commands will select the appropriate number of pins that receive the data.