I just noticed that 166MHz is the fastest speed listed. Now that the P2 might be clockable at 200MHz (or more?), none of these would be usable at full speed. Also, just to make sure I am understanding correctly, when I look at balls.spin, I don't see anything driving the CLK pin. I'm assuming this is being done implicitly by the FPGA. And I'm guessing that the final version of the driver will need to explicitly set up a counter output for CLK. So, even though it would be possible to clock the SDRAM at a lower rate and adjust command timing appropriately, the XFR would still be sampling at the internal clock rate.
It looks like there are some packages that run at 200MHz, but with a maximum memory of 16MB. To get any faster, you have to start looking at DDR...
Right. To use those big memories conservatively, you'd want to clock the Propeller at 160MHz. Not a huge price to pay.
Could pay to check into that, as pushing general purpose pins >> 100MHz is likely to be a challenge.
Be nice to not have the Memory constraining the core speed, as customers will then ask why it does not have DDR2.
A moderate clock ratio option for SDRAM could also help manage power drain.
- plus it gives users some means of margin checks, and gives lattiude for over-clocking.
Could pay to check into that, as pushing general purpose pins >> 100MHz is likely to be a challenge.
Be nice to not have the Memory constraining the core speed, as customers will then ask why it does not have DDR2.
A moderate clock ratio option for SDRAM could also help manage power drain.
- plus it gives users some means of margin checks, and gives lattiude for over-clocking.
Those I/O pins are our own full-custom design, so they should work perfectly. They'll have their own local flops and everything will be tuned down to a few picoseconds.
Right. To use those big memories conservatively, you'd want to clock the Propeller at 160MHz. Not a huge price to pay.
When does XFR sample the pins? On the falling edge of the clock? The reason I'm asking is because I'm wondering whether we will be able to run the SDRAM at their maximum frequency. If the CLK signal is being generated by the counter module, how much of a timing skew is there between the output CLK signal and the internal clock that XFR is registered to?
If the big difference between SDR and DDR is the double-clocking, could you have an XFR mode that can handle that? The rest could still be taken care of in software, correct?
Those I/O pins are our own full-custom design, so they should work perfectly. They'll have their own local flops and everything will be tuned down to a few picoseconds.
200MHz CLKs are only 2.5ns hi, 2.5ns low, which I would still consider pushing things, on average PCBs and EMC thresholds.
Locking the SDR speed to be only the CPU speed seems quite a constraint, given the price/stock placements of SDRAM.
I could not find stock of 200MHz parts anywhere, and only part codes on smaller parts (which may EOL)
SDRAM is in the legacy basket, so supporting choices helps supply.
Alliance show just one part code with 200MHz, AS4C16M16S-5TCN, (zero disti stock) so Parallax could contact Alliance and ask about supply.
I'd also check into Commercial/Industrial dual branding, as sometimes companies merge part codes for Commercial and Industrial (one speed grade down)
ie the 166MHz Industrial parts, which are available, may be 200MHz @ Commercial range, or even Commercial Range, Vcc(3v3) +/-5% or +/-3%
.
200MHz CLKs are only 2.5ns hi, 2.5ns low, which I would still consider pushing things, on average PCBs and EMC thresholds.
Locking the SDR speed to be only the CPU speed seems quite a constraint, given the price/stock placements of SDRAM.
I could not find stock of 200MHz parts anywhere, and only part codes on smaller parts (which may EOL)
SDRAM is in the legacy basket, so supporting choices helps supply.
Alliance show just one part code with 200MHz, AS4C16M16S-5TCN, (zero disti stock) so Parallax could contact Alliance and ask about supply.
I'd also check into Commercial/Industrial dual branding, as sometimes companies merge part codes for Commercial and Industrial (one speed grade down)
ie the 166MHz Industrial parts, which are available, may be 200MHz @ Commercial range, or even Commercial Range, Vcc(3v3) +/-5% or +/-3%
.
.
I agree that a 1:1 vs 1:2 speed selector would be beneficial.
I used to do a lot of testing of sdram, DDR, DDR2 and DDR3 memories, and there was a quite a bit headroom for running them beyond spec - especially towards the end of their production life cycle, by which time their processes were pretty much perfected.
In my experience, an SDRAM stick on a quality well bypassed DIMM module would run 20%+ past its rated speed in an average home/office environment. I often got DDR2 memory to run 50% above spec, with a 10%-25% reduction in CAS and RAS clocks and a small bump in voltage.
Given that we are talking about much shorter traces (a single ram connected to a P2) I imagine most 166MHz SDRAM chips would work fine at 200Mhz in an average environment, and all 183Mhz chips would be rock solid at 200Mhz. Running memory tests over a weekend should find any chips that would be an issue.
For industrial spec systems, dropping the P2 clock to match the chip, or running the memory at half clock speed would be the ticket.
So, the idea would be to add a speed selector option to the SETXFR? If so, then the other things that would have to be kept in mind when running at 1:2 would be:
The CLK would also have to be set to 1:2.
The 16pin_to_AUX mode would take 4 clock cylces per 32-bit reads/writes.
All commands would need an additional one-clock cycle to make sure that the SDRAM latched the command.
Well isn't it better just to use a clock frequency that the SDRAM can tolerate (as previously mentioned) rather than creating a bunch of extra timing headaches? Definitely would still like to allow running the chip as fast as possible without SDRAM attached.
The SDRAM always responds on a rising edge, so 1.5 is not possible.
In a new chip design, It might be possible to run a 400MHz PLL. and divide that by 2,3,4 for the SDRAM pacing.
It is generally done to divide the PLL by at least 2, to get 50% duty cycle primary clock signals.
I've been trying to come up with a way to allow multiple cogs to share some or all of the same pins for accessing multiple SDRAM chips (one per cog). In this case, I am less concerned with maximizing performance and more concerned with maximizing the memory available to each cog (with no expectation of sharing external memory between cogs and also with the knowledge that each cog will have a copy of the SDRAM driver running). As best I can tell, the address and command pins could be shared as long as they are cooperatively accessed (e.g. with the help of LOCKxxx). Each cog would still need its own CS and DQ lines (and possibly CKE). Or it *might* be possible to share DQ by judiciously using CKE, but I'm not sure about that.
Does this seems feasible, or is there something I'm forgetting?
Well isn't it better just to use a clock frequency that the SDRAM can tolerate (as previously mentioned) rather than creating a bunch of extra timing headaches? Definitely would still like to allow running the chip as fast as possible without SDRAM attached.
It is not really adding any timing headaches, as the clock simply skips.
Hardware is essentially a Clock enable, and the Clock enable is either logic 1, or 50% so every second clock changes SDRAM logic.
Timing of any edges to master clock, are the same in both cases.
This could change in SW, to allow power/speed tradeoffs.
Could SETXFR include an 8-bit transfer mode (in addition to 16-bit and 32-bit)? This would help reduce the number of pins required (obviously at the cost of bandwidth), which can come in handy when using multiple SDRAM chips (e.g. as expressed in my prior post). I know there are 4-bit SDRAM verrsions, but that might be pushing things a little. Supporting 8, 16, and 32 bits in SETXFR seems natural.
I've been trying to come up with a way to allow multiple cogs to share some or all of the same pins for accessing multiple SDRAM chips (one per cog). In this case, I am less concerned with maximizing performance and more concerned with maximizing the memory available to each cog (with no expectation of sharing external memory between cogs and also with the knowledge that each cog will have a copy of the SDRAM driver running). As best I can tell, the address and command pins could be shared as long as they are cooperatively accessed (e.g. with the help of LOCKxxx). Each cog would still need its own CS and DQ lines (and possibly CKE). Or it *might* be possible to share DQ by judiciously using CKE, but I'm not sure about that.
Does this seems feasible, or is there something I'm forgetting?
Sounds like a can of worms... but might be doable on a relay (pass the baton) basis. (ie slow and SW paced)
You still need to ensure someone is responsible for refresh, and any Clock min specs need to be watched.
Such designs are not going to be portable to more advanced DRAM, which has tighter clock constraints, so are probably best avoided.
Is there a time when we don't use a separate COG for transferring data to/from SDRAM? The SDRAM clock should be under control of a separate CTR anyway right? If we had direct SDRAM execute for example, the maximum clock speed would be more important (but that is not a P2 feature).
Could SETXFR include an 8-bit transfer mode (in addition to 16-bit and 32-bit)? This would help reduce the number of pins required (obviously at the cost of bandwidth), which can come in handy when using multiple SDRAM chips (e.g. as expressed in my prior post). I know there are 4-bit SDRAM verrsions, but that might be pushing things a little. Supporting 8, 16, and 32 bits in SETXFR seems natural.
Alliance shows a single x8 part number, stock check shows just 5 in stock, at Future, and it's a $10 part.
It is still in a 54 pin package.
SDRAM is in legacy mode, so you do not want to go too far from the mainstream, or supply will get even harder.
Sounds like a can of worms... but might be doable on a relay (pass the baton) basis. (ie slow and SW paced)
You still need to ensure someone is responsible for refresh, and any Clock min specs need to be watched.
Such designs are not going to be portable to more advanced DRAM, which has tighter clock constraints, so are probably best avoided.
I think I conveyed the idea badly. All I want to do is allow each cog to access its own SDRAM. Ideally, the P2 would have a humongous(er) number of pins and each cog could just use a dedicated set. But, as that's not the case, the thought is to give up some performance by sharing some of the pins. Each cog would only ever access their own SDRAM, be responsible for their own refreshes, etc. All that's shared is the pin resources.
Also, this is a one-off design. I would not expect this to be generally useful. Then again, this chip keeps defying expectations (and it's not even released yet!).
Note that it would be even easier to do this if XFR supported 8-bit transfers. With that, every cog could share the command and address lines, but have dedicated data lines, and still have pins left over! Not that I'm suggesting that every cog should have it's own SDRAM, but you get the idea.
Is there a time when we don't use a separate COG for transferring data to/from SDRAM? The SDRAM clock should be under control of a separate CTR anyway right? If we had direct SDRAM execute for example, the maximum clock speed would be more important (but that is not a P2 feature).
Not sure what direct SDRAM execute means here, but SDRAM has burst access, and refresh constraints.
It's cheap, but rather more of a pain than Async SRAM
Alliance shows a single x8 part number, stock check shows just 5 in stock, at Future, and it's a $10 part.
It is still in a 54 pin package.
SDRAM is in legacy mode, so you do not want to go too far from the mainstream, or supply will get even harder.
Fair enough. Note, however, that there are other 8-bit buses that this could be used with. For instance, parallel-to-serial shift registers still have their place. And believe it or not, there are still legacy US Navy systems that use 8-bit ISA!
But I'm not going to push this one. If it's trivial to add, it might be worth doing. If it's not, then there's always bit-banging...
Not sure what direct SDRAM execute means here, but SDRAM has burst access, and refresh constraints.
It's cheap, but rather more of a pain than Async SRAM
Ya, that's fairly clear from having to write drivers for SDRAM in a P1 COG already. It's a big PITA to get started, but worth it for cache designs. If P2 did all the setup for us, it would be easier. That is not a feature request!
I don't recall seeing any information on how SDRAM works with P2 other than via a separate COG. There is an SDRAM clock pin, but that was never characterized very well AFAIK.
P2 docs on implementation of SDRAM support is a black hole to me. Are there any pointers?
I think I conveyed the idea badly. All I want to do is allow each cog to access its own SDRAM. Ideally, the P2 would have a humongous(er) number of pins and each cog could just use a dedicated set. But, as that's not the case, the thought is to give up some performance by sharing some of the pins.
That does raise the question, of how Chip has settled on map of any SDRAM logic.
Does each COG have a SDRAM state engine and counter-sets ? Allocated to what pins ?
Completely independent (separate pins) operation may be supported in silicon ?
Each cog would only ever access their own SDRAM, be responsible for their own refreshes, etc. All that's shared is the pin resources.
but once you share pins, you have time gotchas waiting, as one-at-a-time has to apply.
If you are going to share pins, you do not need two SDRAMs, just give each user a time-share chunk of the larger one.
Video designs will get close to this : Video-read will be given one set of hard time-slots, and Video-write gets the crumbs.
Usually one COG would do either read or write.
Assuming each COG has a mappable SDRAM state engine, it could be a good area to test - state engine hand-over across cogs may have some wrinkles, and what happens when a user gets the time-slots wrong ?
If I get your meaning, that's why I asked if the XFR changes were going to include writing to cog ram (in addition to aux and hub). While this wouldn't allow you to execute out of external RAM "directly", it would be possible to perform LMM-style execution without ever having to touch the hub.
Or, with careful partitioning of the hub space and the use of the new HUBEXEC stuff, I could easily see a each cog have a cog-local SDRAM driver that's managing the executable code blocks in the hub for that same cog. For instance, you could encode something like a "long jump" in the hub code, which would do the following steps:
Invoke the local SDRAM driver (in cog-execution mode) to read the new code into the HUB.
Perform a JMPA (or whatever Chip's calling it now) to switch back to hub-execution mode.
I don't recall seeing any information on how SDRAM works with P2 other than via a separate COG. There is an SDRAM clock pin, but that was never characterized very well AFAIK.
Still very much a moving target, Chip did say this :
I'm changing the way XFR works, though, so the new version will be different. I'm making XFR handle complete metered transactions on its own in the background, instead of needing the cog to hover over everything.
...
We've been using a 256Mb chip that comes in a 54-pin package.
..
You still need to hover, somewhat to feed the SDRAM its commands, but this will simplify things.
So there looks to be some hardware support, unclear if that is per-cog, or per-P2
The Full Page in SDRAMs is a reasonable size, of 512, so even just HW streaming will be great to have.
Comments
Right. To use those big memories conservatively, you'd want to clock the Propeller at 160MHz. Not a huge price to pay.
Can you run the core at 200MHz, and clock the memory at 100MHz ?
Not as planned.
Could pay to check into that, as pushing general purpose pins >> 100MHz is likely to be a challenge.
Be nice to not have the Memory constraining the core speed, as customers will then ask why it does not have DDR2.
A moderate clock ratio option for SDRAM could also help manage power drain.
- plus it gives users some means of margin checks, and gives lattiude for over-clocking.
Those I/O pins are our own full-custom design, so they should work perfectly. They'll have their own local flops and everything will be tuned down to a few picoseconds.
When does XFR sample the pins? On the falling edge of the clock? The reason I'm asking is because I'm wondering whether we will be able to run the SDRAM at their maximum frequency. If the CLK signal is being generated by the counter module, how much of a timing skew is there between the output CLK signal and the internal clock that XFR is registered to?
If the big difference between SDR and DDR is the double-clocking, could you have an XFR mode that can handle that? The rest could still be taken care of in software, correct?
If speed is more important than memory-size and price of it
http://www.digikey.com/product-detail/en/MT48LC2M32B2P-5:J%20TR/MT48LC2M32B2P-5:J%20TR-ND/4315466
http://www.digikey.com/product-detail/en/W9812G6JH-5/W9812G6JH-5-ND/4037466
200MHz CLKs are only 2.5ns hi, 2.5ns low, which I would still consider pushing things, on average PCBs and EMC thresholds.
Locking the SDR speed to be only the CPU speed seems quite a constraint, given the price/stock placements of SDRAM.
I could not find stock of 200MHz parts anywhere, and only part codes on smaller parts (which may EOL)
SDRAM is in the legacy basket, so supporting choices helps supply.
Alliance show just one part code with 200MHz, AS4C16M16S-5TCN, (zero disti stock) so Parallax could contact Alliance and ask about supply.
I'd also check into Commercial/Industrial dual branding, as sometimes companies merge part codes for Commercial and Industrial (one speed grade down)
ie the 166MHz Industrial parts, which are available, may be 200MHz @ Commercial range, or even Commercial Range, Vcc(3v3) +/-5% or +/-3%
.
.
I agree that a 1:1 vs 1:2 speed selector would be beneficial.
I used to do a lot of testing of sdram, DDR, DDR2 and DDR3 memories, and there was a quite a bit headroom for running them beyond spec - especially towards the end of their production life cycle, by which time their processes were pretty much perfected.
In my experience, an SDRAM stick on a quality well bypassed DIMM module would run 20%+ past its rated speed in an average home/office environment. I often got DDR2 memory to run 50% above spec, with a 10%-25% reduction in CAS and RAS clocks and a small bump in voltage.
Given that we are talking about much shorter traces (a single ram connected to a P2) I imagine most 166MHz SDRAM chips would work fine at 200Mhz in an average environment, and all 183Mhz chips would be rock solid at 200Mhz. Running memory tests over a weekend should find any chips that would be an issue.
For industrial spec systems, dropping the P2 clock to match the chip, or running the memory at half clock speed would be the ticket.
How big problem it is to made.
1:1 - 1:1,5 - 1:2
That would be a real pain to synthesize, from my limited experience.
Anything else?
The SDRAM always responds on a rising edge, so 1.5 is not possible.
In a new chip design, It might be possible to run a 400MHz PLL. and divide that by 2,3,4 for the SDRAM pacing.
It is generally done to divide the PLL by at least 2, to get 50% duty cycle primary clock signals.
Does this seems feasible, or is there something I'm forgetting?
It is not really adding any timing headaches, as the clock simply skips.
Hardware is essentially a Clock enable, and the Clock enable is either logic 1, or 50% so every second clock changes SDRAM logic.
Timing of any edges to master clock, are the same in both cases.
This could change in SW, to allow power/speed tradeoffs.
Sounds like a can of worms... but might be doable on a relay (pass the baton) basis. (ie slow and SW paced)
You still need to ensure someone is responsible for refresh, and any Clock min specs need to be watched.
Such designs are not going to be portable to more advanced DRAM, which has tighter clock constraints, so are probably best avoided.
Alliance shows a single x8 part number, stock check shows just 5 in stock, at Future, and it's a $10 part.
It is still in a 54 pin package.
SDRAM is in legacy mode, so you do not want to go too far from the mainstream, or supply will get even harder.
I think I conveyed the idea badly. All I want to do is allow each cog to access its own SDRAM. Ideally, the P2 would have a humongous(er) number of pins and each cog could just use a dedicated set. But, as that's not the case, the thought is to give up some performance by sharing some of the pins. Each cog would only ever access their own SDRAM, be responsible for their own refreshes, etc. All that's shared is the pin resources.
Also, this is a one-off design. I would not expect this to be generally useful. Then again, this chip keeps defying expectations (and it's not even released yet!).
Note that it would be even easier to do this if XFR supported 8-bit transfers. With that, every cog could share the command and address lines, but have dedicated data lines, and still have pins left over! Not that I'm suggesting that every cog should have it's own SDRAM, but you get the idea.
Not sure what direct SDRAM execute means here, but SDRAM has burst access, and refresh constraints.
It's cheap, but rather more of a pain than Async SRAM
Fair enough. Note, however, that there are other 8-bit buses that this could be used with. For instance, parallel-to-serial shift registers still have their place. And believe it or not, there are still legacy US Navy systems that use 8-bit ISA!
But I'm not going to push this one. If it's trivial to add, it might be worth doing. If it's not, then there's always bit-banging...
Ya, that's fairly clear from having to write drivers for SDRAM in a P1 COG already. It's a big PITA to get started, but worth it for cache designs. If P2 did all the setup for us, it would be easier. That is not a feature request!
I don't recall seeing any information on how SDRAM works with P2 other than via a separate COG. There is an SDRAM clock pin, but that was never characterized very well AFAIK.
P2 docs on implementation of SDRAM support is a black hole to me. Are there any pointers?
That does raise the question, of how Chip has settled on map of any SDRAM logic.
Does each COG have a SDRAM state engine and counter-sets ? Allocated to what pins ?
Completely independent (separate pins) operation may be supported in silicon ?
but once you share pins, you have time gotchas waiting, as one-at-a-time has to apply.
If you are going to share pins, you do not need two SDRAMs, just give each user a time-share chunk of the larger one.
Video designs will get close to this : Video-read will be given one set of hard time-slots, and Video-write gets the crumbs.
Usually one COG would do either read or write.
Assuming each COG has a mappable SDRAM state engine, it could be a good area to test - state engine hand-over across cogs may have some wrinkles, and what happens when a user gets the time-slots wrong ?
If I get your meaning, that's why I asked if the XFR changes were going to include writing to cog ram (in addition to aux and hub). While this wouldn't allow you to execute out of external RAM "directly", it would be possible to perform LMM-style execution without ever having to touch the hub.
Or, with careful partitioning of the hub space and the use of the new HUBEXEC stuff, I could easily see a each cog have a cog-local SDRAM driver that's managing the executable code blocks in the hub for that same cog. For instance, you could encode something like a "long jump" in the hub code, which would do the following steps:
Still very much a moving target, Chip did say this :
So there looks to be some hardware support, unclear if that is per-cog, or per-P2
The Full Page in SDRAMs is a reasonable size, of 512, so even just HW streaming will be great to have.