I seriously recommend building off my earlier testing with the 4-bit wide RAMs. It was a test case for finding/providing all the timing switches and the maths for the Prop2 to master any clocked data bus interface. Starting from there will give you a lot more clarity of where everything fits instead of repeating the same try-until-it-works approach we've been going through over and over.
I definitely need to give it some attention though. It's not built to handle DDR and the datasheet looks to have extra latencies, eg: memory writes, compared to the 4-bit parts.
@evanh said:
Oh, ouch, CA phase is six bytes long. That's a spanner in the works ...
Yeah this is nothing like the PSRAM we've been using. It's a lot more like HyperRAM. Better to start with an existing HyperRAM driver to get this to work.
@Rayman , do you know why your first attempt failed?
LOL, no way I'm going back to the old hyperRAM code. It was a mess.
Okay, my first second attempt at building the bigger CA phase is this (It's even bigger than the nibble swapping code ):
wrlut paddr, #4 ' insert Address[7:0] into CA data
shr paddr, #8
wrlut paddr, #3 ' insert Address[15:8] into CA data
shr paddr, #8
wrlut paddr, #2 ' insert Address[23:16] into CA data
shr paddr, #8
wrlut paddr, #1 ' insert Address[31:24] into CA data
wrlut pcmd, #0 ' insert Command into CA data
I might prefer using hubRAM instead. It would only require a single WRLONG, no byte shuffling, ... oops, no, using hubRAM for CA phase would require a RDFAST+WRFAST (or two RDFASTs). Which would really slow things down.
UPDATE: Didn't need the second WRLUT pcmd since it can be doubled in the sequence pattern. Saved one instruction. I feel better already
@rogloh said:
Isn't it easier to stream from immediates than use the LUT to translate the output, or are you hoping to gang multiple chips together for a wider bus?
That was on the back of my mind, yes. The other, more immediate, reason is retaining unbroken clocks from CA phase to data phase, having a single lead-in and single WYPIN. Making room for bus tri-stating and rx pin registration is tight at sysclock/1.
EDIT: It's doable if I eat into the fixed latency interval. I guess that's acceptable ...
EDIT2: Man, it's seriously tight. This needs tested. to see of the rx registration switchover is taking in time. I might have to do the tri-stating last.
EDIT3: Duh! I counted the latency wrong - It's in clock cycles, I had used transfer cycles ... And I'd not doubled it either, 2 x LC default. There's plenty of time.
No idea why the modified hyperram driver didn't work. But, everything has to be exactly right and it's hard to tell without hooking up a scope.
Right now, I'm just at the stage of making sure the chips are soldered on right and the board works.
This is a 32 MB board with two 16 MB chips, each with own control signals but sharing 8-bit bus.
Two extra pins have USB connector.
Here's what it looks like.
Nice thing about this over hyperram is higher frequency rating, 266 MB/s.
Also, the address syntax is simpler, although that's not a big deal.
@Rayman said:
This is a 32 MB board with two 16 MB chips, each with own control signals but sharing 8-bit bus.
Two extra pins have USB connector.
Oh. uh, about that. You can't just throw USB on any old pin pair. Technically yes, but current driver needs at least one extra dummy pin I think (which is where the accessory board has the LED).
@Rayman said:
That said, if anybody can think of a better use for the two pins, I'd be interested to here it.
No, USB is good. I guess when @macca 's hub driver gets ready, it'd be neat to have a board with a built-in 4 port hub. Speaking of, does that driver currently need the extra USB pins?
Oh, bugger, Rayman, looking at the pin configs, you've made the same mistake Parallax did with the HyperRAM add-on board. You've made it with double loading on the data pins and only single loading on the clock pins. That'll never make sysclock/1.
@evanh said:
Oh, bugger, Rayman, looking at the pin configs, you've made the same mistake Parallax did with the HyperRAM add-on board. You've made it with double loading on the data pins and only single loading on the clock pins. That'll never make sysclock/1.
In addition, the pins being split up (instead of contiguous or shared) obstructs the software implementation (because then you need to switch all the pins instead of just CE (because the others can be driven with pinfields)).
@Rayman said:
That said, if anybody can think of a better use for the two pins, I'd be interested to here it.
No, USB is good. I guess when @macca 's hub driver gets ready, it'd be neat to have a board with a built-in 4 port hub. Speaking of, does that driver currently need the extra USB pins?
The extra pins are used for notifications (in long-repository mode) and to drive a led to signal the user the correct operations. In PASM-only both can be removed, in Spin (when I get to update the Spin driver) you need a way to notify the events if you don't want to use a pin. Both pins however can be placed anywhere, no need to have them near the USB pair.
@Rayman said:
@evanh I think you asked for and I added optional rc network on both clock signals
No, using the capacitor was always a workaround. I did it on Parallax's Hyper add-on board because of this very same lopsided signal behaviour that occurs when they aren't equally loaded. Doing the same with resistors will be the same effect. It's just a R-C curve either way.
When doing the layout:
Make all signals, except CS, have equal number of loads. Not doing this destroys the limited ability to configure the data setup time with I/O registration.
Try to keep all eight data paths, and DQS, relatively equal length. It's not critical but every bit helps.
Make the clock path longer than the data paths. This adds a natural clock delay to improve data setup time. The bulk of the setup time is set by programming the I/O registration but a little biasing in its favour is a good thing. Or just make them all the same length if you like.
And, I don't have personal experience but, the general guidelines I've looked at also specify that the data/clock signals should flow in a single contiguous bus. No branching in the paths. I suspect this guideline can be bent a little without problem. Short branches are better than long branches.
On the other hand, the longer the tracks run in parallel the more crosstalk occurs. So, tidy isn't always best. Interleaved ground tracks are added to shield from crosstalk in a tidy manner.
And on the subject of grounding, rule of thumb is the more the merrier. Get as large as possible surface running into each ground pin of the headers. And equally make sure there is wide plane path to every IC. It's vital that grounding is much lower impedance than all others.
Hmm, thanks, not surprising. I'll get the scope out tonight and check the tx timings.
Right, found another "oops that also should have been changed to suit the new method" - The M_CA8 constant was still set for LUT when it should have been IMM.
And there is an off-by-one-transfer in the clock/data phase relationship too, about to delve into that one ... EDIT: I'm guessing it's related to the way I'm scaling the SPI clock. When the DDR switch is on it doubles the relative clock length so that when CLK_DIV = 1 (sysclock/1) then the SPI clock cycle is two sysclocks long.
EDIT2: Yeah, that looks to be it. Can't be 100% without a chip to test on ... I could do a variant for the HyperRAMs now though ...
Based on the image you posted at #9, I wasn't able to determine how the two available 3.3V power supplies (coming from P2 Eval connectors) are distributed to both PSRAM chips.
Are they split between VDD and VDDQ?
Does VDDQ to both chips comes from the Basepin-tagged connector (from where ADQ[7:0] are brougth), while the VDDs are fed from the other one?
Right, here's the fully working hyperRAM tester. Tested with Parallax's Hyper Eval Add-on.
CPOL/CPHA didn't make much sense in the context of DDR, so I split off from the 4-bit QPI code. Command sets are different anyway. Timing wise, I do have a working merged variant.
For you to test Rayman,
Here's the ported OPI variant. It's the same timings, just the command set changed. Should always start out looking the same as above at 60 MHz. The four columns of u0,r0,u1 and r1 at 100%
EDIT: Bah, latencies are different to Hyper parts. Updated with what I think is correct from the datasheet
EDIT2: Grr, was too hasty last night. And too tired I guess. RX routine still had the HyperRAM CA sequencing. Fixed that one now too.
Hmmm, one really starts to rely on testing to catch all those mistakes. So when testing isn't there, or too lazy to perform the tests, you are doomed to make the same mistakes.
The components of lead-in timing: + 5 is sysclock ticks from XINIT to DIRH's starting of the smartpin cycle. See code below. + CLK_ADV is optional, in sysclock ticks, to phase advance the PSRAM clock with respect to tx data. + CLK_DIV<<1 retards tx data by one DDR clock cycle. CLK_DIV is one transfer period. This compensates for the first internal smartpin cycle which always occurs between the DIRH and WYPIN instructions. + CLK_REGD retards tx data by one tick to compensate for delayed clock out when the clock pin is registered. - TX_REGD advances tx data by one tick to compensate for delayed tx data when the tx pin is registered. - CLK_DIV>>1 advances tx data by half a transfer period to provide setup and hold timings. Like CPHA=0 in SPI terms.
xinit mleadin, #0 ' lead-in timing, at sysclock/1 (immediate effect)
setq mnco ' streamer transfer rate (takes effect with buffered command below)
xcont mcmd, pcmd ' tx Command + Address (buffered-op to align with WYPIN below)
dirh #PSRAM_CLK_PIN ' start smartpin internally cycling at SPI clock rate
wypin len, #PSRAM_CLK_PIN ' SPI clocks for CA phase, RAM latency, and data phase
I must still have something wrong with the OPI CA sequence ... there is an unexplained +1 on the read latency for hyperRAM parts. Maybe that's not suited for OPI parts. Try this:
Comments
I seriously recommend building off my earlier testing with the 4-bit wide RAMs. It was a test case for finding/providing all the timing switches and the maths for the Prop2 to master any clocked data bus interface. Starting from there will give you a lot more clarity of where everything fits instead of repeating the same try-until-it-works approach we've been going through over and over.
I definitely need to give it some attention though. It's not built to handle DDR and the datasheet looks to have extra latencies, eg: memory writes, compared to the 4-bit parts.
Oh, ouch, CA phase is six bytes long. That's a spanner in the works ...
Yeah this is nothing like the PSRAM we've been using. It's a lot more like HyperRAM. Better to start with an existing HyperRAM driver to get this to work.
@Rayman , do you know why your first attempt failed?
LOL, no way I'm going back to the old hyperRAM code. It was a mess.
Okay, my first second attempt at building the bigger CA phase is this (It's even bigger than the nibble swapping code ):
I might prefer using hubRAM instead. It would only require a single WRLONG, no byte shuffling, ... oops, no, using hubRAM for CA phase would require a RDFAST+WRFAST (or two RDFASTs). Which would really slow things down.
UPDATE: Didn't need the second
WRLUT pcmd
since it can be doubled in the sequence pattern. Saved one instruction. I feel better alreadyIsn't it easier to stream from immediates than use the LUT to translate the output, or are you hoping to gang multiple chips together for a wider bus?
That was on the back of my mind, yes. The other, more immediate, reason is retaining unbroken clocks from CA phase to data phase, having a single lead-in and single WYPIN. Making room for bus tri-stating and rx pin registration is tight at sysclock/1.
EDIT: It's doable if I eat into the fixed latency interval. I guess that's acceptable ...
EDIT2: Man, it's seriously tight. This needs tested. to see of the rx registration switchover is taking in time. I might have to do the tri-stating last.
EDIT3: Duh! I counted the latency wrong - It's in clock cycles, I had used transfer cycles ... And I'd not doubled it either, 2 x LC default. There's plenty of time.
No idea why the modified hyperram driver didn't work. But, everything has to be exactly right and it's hard to tell without hooking up a scope.
Right now, I'm just at the stage of making sure the chips are soldered on right and the board works.
This is a 32 MB board with two 16 MB chips, each with own control signals but sharing 8-bit bus.
Two extra pins have USB connector.
Here's what it looks like.
Nice thing about this over hyperram is higher frequency rating, 266 MB/s.
Also, the address syntax is simpler, although that's not a big deal.
Damn, I'm really out of time tonight ... it almost feels testable too ...
Ugh, give this a try. It compiles now at least. You'll likely want to make some edits to the first CON section.
If everything works first try (not likely) then you should see a report similar to this - https://forums.parallax.com/discussion/comment/1541657/#Comment_1541657
EDIT: First bug fix - I'd changed the meaning of some delays and missed making one correction to suit that change.
Oh. uh, about that. You can't just throw USB on any old pin pair. Technically yes, but current driver needs at least one extra dummy pin I think (which is where the accessory board has the LED).
I fixed that in the attached. Only two pins needed.
That said, if anybody can think of a better use for the two pins, I'd be interested to here it.
@evanh Tried you code after adjusting pin #s and turning off the second chip.
Gives all 0% at all frequencies.
No, USB is good. I guess when @macca 's hub driver gets ready, it'd be neat to have a board with a built-in 4 port hub. Speaking of, does that driver currently need the extra USB pins?
The existing code already handles disabling of other bank selects even without declaring them.
Hmm, thanks, not surprising. I'll get the scope out tonight and check the tx timings.
Oh, bugger, Rayman, looking at the pin configs, you've made the same mistake Parallax did with the HyperRAM add-on board. You've made it with double loading on the data pins and only single loading on the clock pins. That'll never make sysclock/1.
Oh yeah....
In addition, the pins being split up (instead of contiguous or shared) obstructs the software implementation (because then you need to switch all the pins instead of just CE (because the others can be driven with pinfields)).
I think this one warrants a redesign then.
@evanh I think you asked for and I added optional rc network on both clock signals
The extra pins are used for notifications (in long-repository mode) and to drive a led to signal the user the correct operations. In PASM-only both can be removed, in Spin (when I get to update the Spin driver) you need a way to notify the events if you don't want to use a pin. Both pins however can be placed anywhere, no need to have them near the USB pair.
No, using the capacitor was always a workaround. I did it on Parallax's Hyper add-on board because of this very same lopsided signal behaviour that occurs when they aren't equally loaded. Doing the same with resistors will be the same effect. It's just a R-C curve either way.
When doing the layout:
And, I don't have personal experience but, the general guidelines I've looked at also specify that the data/clock signals should flow in a single contiguous bus. No branching in the paths. I suspect this guideline can be bent a little without problem. Short branches are better than long branches.
On the other hand, the longer the tracks run in parallel the more crosstalk occurs. So, tidy isn't always best. Interleaved ground tracks are added to shield from crosstalk in a tidy manner.
And on the subject of grounding, rule of thumb is the more the merrier. Get as large as possible surface running into each ground pin of the headers. And equally make sure there is wide plane path to every IC. It's vital that grounding is much lower impedance than all others.
Right, found another "oops that also should have been changed to suit the new method" - The M_CA8 constant was still set for
LUT
when it should have beenIMM
.And there is an off-by-one-transfer in the clock/data phase relationship too, about to delve into that one ... EDIT: I'm guessing it's related to the way I'm scaling the SPI clock. When the DDR switch is on it doubles the relative clock length so that when CLK_DIV = 1 (sysclock/1) then the SPI clock cycle is two sysclocks long.
EDIT2: Yeah, that looks to be it. Can't be 100% without a chip to test on ... I could do a variant for the HyperRAMs now though ...
Ha, got results. And I'd forgotten all about needing to drive DQS/RWDS low for data writes. It wasn't writing a damn thing without that!
Everything is 99% for the moment. Still got to sort out the latencies. Job for tomorrow night ...
I've attached the hyperRAM version:
Hi @Rayman
Based on the image you posted at #9, I wasn't able to determine how the two available 3.3V power supplies (coming from P2 Eval connectors) are distributed to both PSRAM chips.
Are they split between VDD and VDDQ?
Does VDDQ to both chips comes from the Basepin-tagged connector (from where ADQ[7:0] are brougth), while the VDDs are fed from the other one?
Joystick using resister latter would be good. It takes a lot less overhead to read than a USB port.
Right, here's the fully working hyperRAM tester. Tested with Parallax's Hyper Eval Add-on.
CPOL/CPHA didn't make much sense in the context of DDR, so I split off from the 4-bit QPI code. Command sets are different anyway. Timing wise, I do have a working merged variant.
EDIT: Bug fix - round up the clock pulse count if byte count is odd number of bytes. Scratch that. It doesn't yet handle using DQS for odd byte count.
For you to test Rayman,
Here's the ported OPI variant. It's the same timings, just the command set changed. Should always start out looking the same as above at 60 MHz. The four columns of u0,r0,u1 and r1 at 100%
EDIT: Bah, latencies are different to Hyper parts. Updated with what I think is correct from the datasheet
EDIT2: Grr, was too hasty last night. And too tired I guess. RX routine still had the HyperRAM CA sequencing. Fixed that one now too.
Hmmm, one really starts to rely on testing to catch all those mistakes. So when testing isn't there, or too lazy to perform the tests, you are doomed to make the same mistakes.
The components of lead-in timing:
+ 5 is sysclock ticks from XINIT to DIRH's starting of the smartpin cycle. See code below.
+ CLK_ADV is optional, in sysclock ticks, to phase advance the PSRAM clock with respect to tx data.
+ CLK_DIV<<1 retards tx data by one DDR clock cycle. CLK_DIV is one transfer period. This compensates for the first internal smartpin cycle which always occurs between the DIRH and WYPIN instructions.
+ CLK_REGD retards tx data by one tick to compensate for delayed clock out when the clock pin is registered.
- TX_REGD advances tx data by one tick to compensate for delayed tx data when the tx pin is registered.
- CLK_DIV>>1 advances tx data by half a transfer period to provide setup and hold timings. Like CPHA=0 in SPI terms.
Further reading - https://forums.parallax.com/discussion/comment/1542073/#Comment_1542073
@evanh Tried your code with basepin changed to 0 to match my setup. Still all 0%
I must still have something wrong with the OPI CA sequence ... there is an unexplained +1 on the read latency for hyperRAM parts. Maybe that's not suited for OPI parts. Try this:
If you are wanting to examine with a scope then comment out line 93
lib.pllset( mhz * 1_000_000 )
. It'll run repeating at a fixed 4 MHz sysclock then.