Good. They might become useful for trigger input/outputs in a 32 bit logic analyzer application using the remainder of port A, or other control lines etc.
IIRC, it was the HyperCK line that was being delayed, not CS#.
Ah, yes. That makes more sense.
Yep.
Write timing parameters at sysclock/2, I believe, due to the nice 90° clock-in-the-middle aspect, can be reliably set the same for all board layouts and any temperature without any care taken.
Write timing at sysclock/1, without the easy 90° clock phase, is much more impedance sensitive to achieve the right setup and hold timings. However ... if the board layout is done right (Clock and data have matched impedance), I think just by using an unregistered output pin for clock and registered pins for data then that'll give an ideal phase shift between clock and data. No capacitor needed at all. Testing on the Eval Board is tantalising close to this already. https://forums.parallax.com/discussion/comment/1507171/#Comment_1507171 I believe only prevented by the extra load on the data pins from two hyper parts. Assuming this can be done without a capacitor then writes at sysclock/1 should be highly reliable too. Temperature has little to no impact even with the capacitor.
Read timing parameters is where it gets rough and is layout and temperature sensitive for all decent data rates. The nature of the latencies through the prop2, clock out and data return, plus the external latencies all compound on each other.
The lookup table, for read timings, that Roger has mentioned will need to be calibrated for each board layout. Probably anything faster than sysclock/8 will need some knowledge or calibration smarts. And in the higher rates the temperature will need to be calibrated for as well. I recommend using a learning algorithm that periodically verifies a block of RAM to find the usable frequency bands and fill the lookup table.
Looking at a plot of the HyperRAM read timing for sysclk/2 reads, I think I'll need to tweak the breakpoints a bit more in that case. Right now the selected delay is not always the "inner" delay value when multiple choices are possible. This means we might be closer to the data bus edge transitions in time than we want, meaning setup/hold might be more easily violated using those delays.
See what I mean below when comparing the delay the driver automatically chooses (in parenthesis) vs when the working overlaps occur for the different P2 frequencies in my results below. I think I should keep at least one working delay value (100% transfer success) either side of the selected delay wherever possible. That's not happening with the table yet so I will fix that. Sysclk/1 doesn't have that luxury, there are only ever going to be two working values and just by taking an isolated measurement we don't really know which one has better setup/hold time in advance. We can only measure over a full range and guess where the better delay value will be.
We'll make the layout so we can run them as a pair or separately, and we'll have the flexibility to do sysclk/1 or sysclk/2. It's going to be a very tight layout. I think the longest route will only be about 13mm.
Yes, it seems some periodic recalibration should be done to track temperature.
In the future, it would be really nice to build 8-tap digital delay lines into the I/O pins so that incoming and outgoing data could be registered at any 45-degree offset in the clock cycle. That would provide ultimate flexibility for full-speed data transactions. Wouldn't it be cool if there was a 3-bit selector for an I/O pin's phase? You'd still have multi-clock delays for fast signals, but you could have 8 divisions of the clock to pick from within the cycle of interest.
Would be much better to have SPI support with auto clock generation or reception sort of like the UART. A general purpose extension to the UART so it could also do I2C. Not sure if I2S is also in the same ilk too.
Of course, hindsight is wonderful. I guess it will be years before we see another P2 revision due to cost
Still think a few really simple and really fast bit-bang cpus would be better instead of smart pins. Think of a full-speed (400+MHz single clock ie 4-port 128x32 cog ram) with only basic bit/byte/word/long manipulation instructions ie less than P1. If these were close to the I/O pins ie no clock delays to input/output a lot of current problems would disappear.
Anyway, 90nm looks to be more on the horizon now and that may bring 1GHz
Funnily, sync serial output is generally the trickier one to handle in the smartpins because those modes use external clock (As an enable rather than a true clock), whereas the streamer is blind to any clock, it just metronomically paces off the sysclock. This means that all the prop2 latencies then pile onto the sending smartpin. Only the external latencies stay with the reading smartpin.
In the end the streamer's clock blindness seems to play to its advantage ... when used as a master device at least.
Honestly, just having a "half clock" registered delay would have been helpful.
Which in practice would be either the registered one-clock ff delay rising edge that exists now, and in addition a one-and-a-half clock ff delay on the falling edge.
In the future, it would be really nice to build 8-tap digital delay lines into the I/O pins so that incoming and outgoing data could be registered at any 45-degree offset in the clock cycle.
Write timing at sysclock/1, without the easy 90° clock phase, is much more impedance sensitive to achieve the right setup and hold timings. However ... if the board layout is done right (Clock and data have matched impedance), I think just by using an unregistered output pin for clock and registered pins for data then that'll give an ideal phase shift between clock and data. No capacitor needed at all. Testing on the Eval Board is tantalising close to this already.
@evanh At the moment, my registered/unregistered clock option is something applied only at startup time to patch both the read and write code timing to account for it and to configure the clock pin appropriately for it (once) using WRPIN. This currently means that enabling unregistered clocks to improve sysclk/1 writes also enables it for the reads as well, and we know that setting reduces the read performance. To resolve that and make the choice of reg/unreg clocks independent for reads and writes requires setting up this clock pin state on each transaction. It would be at least 3 more instructions needed as I have 3 read/write code paths, or maybe two if I limit it just to the RAM writes and restore the clock pin state at the end. If we lock down the unregistered clock write settings at some point to exclude flash and register writes and not be applied to reads, I could free this waitx in the read path, and the waitx in the flash/register write paths and that gives me two instructions.
Otherwise for gaining more instructions I would need to make room by losing my fast vector jump table per bank, which in itself will add 2 more instructions to the path.
Right now all I need to do in the read/write code paths to accommodate an unregistered clock is to have a WAITX clkdelay and I setup this clkdelay value to be either 1 or 0 at startup time, depending on choice of the reg/unreg clock. I'd still need to keep that, but I also would need the extra WRPIN instructions too.
Existing:
setxfrq xfreq2 'setup streamer frequency (sysclk/2)
waitx clkdelay 'odd delay shifts clock phase from data
xinit ximm4, addrhi 'send 4 bytes of addrhi data
wypin clks, clkpin 'start memory clock output
If I could only save something in the 3 instructions I already have to burn to reset the Smartpin clock phase for transition mode on the clock pin, and/or combine it with WXPIN operation here if could help give me more room... This gets done in 3 places too.
drvl cspin 'drop CS pin to start the next transfer
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin '...to resync for 2 clocks between transitions
drvl clkpin '...and re-enable Smartpin
Which in practice would be either the registered one-clock ff delay rising edge that exists now, and in addition a one-and-a-half clock ff delay on the falling edge.
Ah, of course, use both stages to form a simple short route for the needed fast propagation.
drvl cspin 'drop CS pin to start the next transfer
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin '...to resync for 2 clocks between transitions
drvl clkpin '...and re-enable Smartpin
No need to lower and raise DIR for low-level pin mode changes. The low-level P settings of WRPIN is separate from smartpin M mode changes. They could have been two instructions really.
I tidied up the automatic latency configuration after driver restart which should suit boards like the future P2 Edge and P2D2 as they don't wire a dedicated P2 pin to control the HyperRAM reset. In theory this code should also now work with V2 HyperRAM assuming the V2 HyperRAM detection works according to section 6.1 information in this V1->V2 HyperRAM migration application note from Cypress below.
In fact I don't think we will need to do a lot more than this to get the V2 HyperRAM to work. I've made it such that the HyperRAM latency is set the same, unless the Hyper bus clock frequency is greater than 100MHz and we are using the newer V2 RAM which can be operated faster, but requires the larger latency in order to not overclock. Once we get V2 HyperRAM we can try to run a P2 up to 400MHz and see if the memory still works! LOL.
'Loop through bus devices and setup a default device latency in case it had been changed
'prior to this driver restarting, and if its reset pin was not enabled. An obscure case.
repeat i from 0 to NUMBANKS-1
device := devices[bus * 2 * NUMBANKS + i]
if device
if device & F_FLASHFLAG
repeat j from 0 to 15 ' find any address mapped to this bus
if addrMap[j]==bus
setFlashLatency((j<<28)+(i<<24), DEFAULT_HYPERFLASH_LATENCY)
quit
else
repeat j from 0 to 15 ' find any address mapped to this bus
if addrMap[j] == bus
' assume Version 1 HyperRAM for now
latency := DEFAULT_HYPERRAM1_LATENCY
' check for Version 2 HyperRAM
id := readRamIR((j<<28)+(i<<24), 1, 0) ' read IR1
if id & $ff == 1 ' if V2 HyperRAM, check operating frequency
if ((flags & (F_FASTREAD|F_FASTWRITE)) && freq > 200000000)
latency := DEFAULT_HYPERRAM2_LATENCY
setRamLatency((j<<28)+(i<<24), latency)
quit
Had my DDR bus speed confused above, here's the actual mapping of V2 HyperRAM latency values with bus frequency (Cypress). Seems we can possibly tighten it up further, although I need some of the latency period to execute P2 code.
0000 - 5 Clock Latency @ 133 MHz Max Frequency
0001 - 6 Clock Latency @ 166 MHz Max Frequency
0010 - 7 Clock Latency @ 200 MHz/166 MHz Max Frequency (default)
0011 - Reserved
0100 - Reserved
...
1101 - Reserved
1110 - 3 Clock Latency @ 85 MHz Max Frequency
1111 - 4 Clock Latency @ 104 MHz Max Frequency
If I could only save something in the 3 instructions I already have to burn to reset the Smartpin clock phase for transition mode on the clock pin, and/or combine it with WXPIN operation here if could help give me more room... This gets done in 3 places too.
drvl cspin 'drop CS pin to start the next transfer
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin '...to resync for 2 clocks between transitions
drvl clkpin '...and re-enable Smartpin
You should be able to use WXPIN #1,clkpin by itself, as a timer reset, straight after completion of the clock pulses. eg:
drvh cspin 'raise CS pin to complete transfer
wxpin #1, clkpin 'cancel the period timer of transition mode
Then, for the next burst of clocks, set the desired period at the same place as the critical DIRH clkpin was. DIR can be left high all the time then. The clock smartpin can be enabled (DIRH) at init time and left that way.
PS: Bear in mind this trick works with transition mode. It's not a cure all for other smartpin modes that may suffer similar niggles.
Thanks evanh I'll have to give that a try sometime to see if it works out - I know I've had issues with this stuff in the past, and once I found something that worked, I sort of took it and stopped searching for better ways to do it.
Actually, it's important that there is some sysclocks between the WXPIN #1 and any subsequent desired period setting. The exact requirement depends on what the prior period setting was. If the period was 500 then you'd want 500 sysclocks to ensure the cancellation takes effect. Otherwise the trick won't help. In other words, I'm counting on small periods always being applied.
PS: Of course, it's not really cancelling anything. I called it that just because a period of one eliminates the timing niggle that occurs when the period is more than one.
Look at it like the non-blocking FIFO setting. Where the data will be fetched while further instructions are executing, and as long as you have enough other things to do before needing the data then the FIFO will be ready in time. Issuing the WXPIN #1 early enough means the smartpin will be in a desirable state in time for the next burst of clocks to be exactly timed.
In the future, it would be really nice to build 8-tap digital delay lines into the I/O pins so that incoming and outgoing data could be registered at any 45-degree offset in the clock cycle.
Wouldn't that need an 8x PLL running?
I think there are simpler circuit topologies used for doing that kind of thing, but a PLL would work.
It's like a PLL, but you get to adjust the integrator on EVERY source clock. The integrator controls delay elements. If you had eight of them, you could make 8 phases of the input clock. If we had ten delay elements, we could do HDMI very nicely.
DLLs are commonly used in high-speed communications among chips on a board (e.g., between a memory controller and its SDRAM chips) in order to "cancel out" things like input and output buffer delays as well as wiring delays, allowing very tight control over setup and hold times relative to the clock signal. This allows data rates to be much higher than would otherwise be possible.
That sure rings a bell.
The delay line ring is drawn as a chain of buffers. How fast is a cold buffer in the used On Semi process? 100 ps maybe. The delay chain wouldn't have to cover lower frequencies I presume. Those can be dealt with using a high sysclock if desired. I'm guessing 20 ns maybe 40 ns max delay in the chain. It'll be temperature dependant so will need a worst case number of elements. We're wanting sub-nanosecond resolution anyway, so 400 elements and taps needed?
If I could only save something in the 3 instructions I already have to burn to reset the Smartpin clock phase for transition mode on the clock pin, and/or combine it with WXPIN operation here if could help give me more room... This gets done in 3 places too.
drvl cspin 'drop CS pin to start the next transfer
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin '...to resync for 2 clocks between transitions
drvl clkpin '...and re-enable Smartpin
You should be able to use WXPIN #1,clkpin by itself, as a timer reset, straight after completion of the clock pulses. eg:
drvh cspin 'raise CS pin to complete transfer
wxpin #1, clkpin 'cancel the period timer of transition mode
Then, for the next burst of clocks, set the desired period at the same place as the critical DIRH clkpin was. DIR can be left high all the time then. The clock smartpin can be enabled (DIRH) at init time and left that way.
PS: Bear in mind this trick works with transition mode. It's not a cure all for other smartpin modes that may suffer similar niggles.
Gave this idea a quick go, but seeing problems with first read byte/word which seems to be zeroed. Something is not right with the clocks, and I'm not willing to dig into it right now. I'm putting this idea on hold until after the initial release. If it can be figured out reliably/consistently it might be something that can save some instructions. It's tricky because introducing instructions like rdlut which are 3 clocks can then throw it off and there need to be an exact number of clocks to align the data if in the different cases of registered/unregistered clock and data buses, plus sysclk/1 and sysclk/2. Lots of cases to all get right.
Comments
Those just go to the edge-card fingers.
Write timing parameters at sysclock/2, I believe, due to the nice 90° clock-in-the-middle aspect, can be reliably set the same for all board layouts and any temperature without any care taken.
Write timing at sysclock/1, without the easy 90° clock phase, is much more impedance sensitive to achieve the right setup and hold timings. However ... if the board layout is done right (Clock and data have matched impedance), I think just by using an unregistered output pin for clock and registered pins for data then that'll give an ideal phase shift between clock and data. No capacitor needed at all. Testing on the Eval Board is tantalising close to this already. https://forums.parallax.com/discussion/comment/1507171/#Comment_1507171 I believe only prevented by the extra load on the data pins from two hyper parts. Assuming this can be done without a capacitor then writes at sysclock/1 should be highly reliable too. Temperature has little to no impact even with the capacitor.
Read timing parameters is where it gets rough and is layout and temperature sensitive for all decent data rates. The nature of the latencies through the prop2, clock out and data return, plus the external latencies all compound on each other.
The lookup table, for read timings, that Roger has mentioned will need to be calibrated for each board layout. Probably anything faster than sysclock/8 will need some knowledge or calibration smarts. And in the higher rates the temperature will need to be calibrated for as well. I recommend using a learning algorithm that periodically verifies a block of RAM to find the usable frequency bands and fill the lookup table.
See what I mean below when comparing the delay the driver automatically chooses (in parenthesis) vs when the working overlaps occur for the different P2 frequencies in my results below. I think I should keep at least one working delay value (100% transfer success) either side of the selected delay wherever possible. That's not happening with the table yet so I will fix that. Sysclk/1 doesn't have that luxury, there are only ever going to be two working values and just by taking an isolated measurement we don't really know which one has better setup/hold time in advance. We can only measure over a full range and guess where the better delay value will be.
We'll make the layout so we can run them as a pair or separately, and we'll have the flexibility to do sysclk/1 or sysclk/2. It's going to be a very tight layout. I think the longest route will only be about 13mm.
Yes, it seems some periodic recalibration should be done to track temperature.
In the future, it would be really nice to build 8-tap digital delay lines into the I/O pins so that incoming and outgoing data could be registered at any 45-degree offset in the clock cycle. That would provide ultimate flexibility for full-speed data transactions. Wouldn't it be cool if there was a 3-bit selector for an I/O pin's phase? You'd still have multi-clock delays for fast signals, but you could have 8 divisions of the clock to pick from within the cycle of interest.
Of course, hindsight is wonderful. I guess it will be years before we see another P2 revision due to cost
Still think a few really simple and really fast bit-bang cpus would be better instead of smart pins. Think of a full-speed (400+MHz single clock ie 4-port 128x32 cog ram) with only basic bit/byte/word/long manipulation instructions ie less than P1. If these were close to the I/O pins ie no clock delays to input/output a lot of current problems would disappear.
Anyway, 90nm looks to be more on the horizon now and that may bring 1GHz
In the end the streamer's clock blindness seems to play to its advantage ... when used as a master device at least.
Which in practice would be either the registered one-clock ff delay rising edge that exists now, and in addition a one-and-a-half clock ff delay on the falling edge.
Wouldn't that need an 8x PLL running?
8x PLL or 8-stage PLL VCO (ooops! my bad)?
@evanh At the moment, my registered/unregistered clock option is something applied only at startup time to patch both the read and write code timing to account for it and to configure the clock pin appropriately for it (once) using WRPIN. This currently means that enabling unregistered clocks to improve sysclk/1 writes also enables it for the reads as well, and we know that setting reduces the read performance. To resolve that and make the choice of reg/unreg clocks independent for reads and writes requires setting up this clock pin state on each transaction. It would be at least 3 more instructions needed as I have 3 read/write code paths, or maybe two if I limit it just to the RAM writes and restore the clock pin state at the end. If we lock down the unregistered clock write settings at some point to exclude flash and register writes and not be applied to reads, I could free this waitx in the read path, and the waitx in the flash/register write paths and that gives me two instructions.
Otherwise for gaining more instructions I would need to make room by losing my fast vector jump table per bank, which in itself will add 2 more instructions to the path.
Right now all I need to do in the read/write code paths to accommodate an unregistered clock is to have a WAITX clkdelay and I setup this clkdelay value to be either 1 or 0 at startup time, depending on choice of the reg/unreg clock. I'd still need to keep that, but I also would need the extra WRPIN instructions too.
Existing:
If I could only save something in the 3 instructions I already have to burn to reset the Smartpin clock phase for transition mode on the clock pin, and/or combine it with WXPIN operation here if could help give me more room... This gets done in 3 places too.
No need to lower and raise DIR for low-level pin mode changes. The low-level P settings of WRPIN is separate from smartpin M mode changes. They could have been two instructions really.
I'm out of time now. Need to get to work. My first week of rotating shifts. Will be back on day shift next week.
https://www.cypress.com/file/498626/download
In fact I don't think we will need to do a lot more than this to get the V2 HyperRAM to work. I've made it such that the HyperRAM latency is set the same, unless the Hyper bus clock frequency is greater than 100MHz and we are using the newer V2 RAM which can be operated faster, but requires the larger latency in order to not overclock. Once we get V2 HyperRAM we can try to run a P2 up to 400MHz and see if the memory still works! LOL.
You should be able to use WXPIN #1,clkpin by itself, as a timer reset, straight after completion of the clock pulses. eg:
Then, for the next burst of clocks, set the desired period at the same place as the critical DIRH clkpin was. DIR can be left high all the time then. The clock smartpin can be enabled (DIRH) at init time and left that way.
PS: Bear in mind this trick works with transition mode. It's not a cure all for other smartpin modes that may suffer similar niggles.
PS: Of course, it's not really cancelling anything. I called it that just because a period of one eliminates the timing niggle that occurs when the period is more than one.
Look at it like the non-blocking FIFO setting. Where the data will be fetched while further instructions are executing, and as long as you have enough other things to do before needing the data then the FIFO will be ready in time. Issuing the WXPIN #1 early enough means the smartpin will be in a desirable state in time for the next burst of clocks to be exactly timed.
I think there are simpler circuit topologies used for doing that kind of thing, but a PLL would work.
I found this about DLL's:
https://rtldigitaldesign.blogspot.com/2016/06/difference-between-pll-and-dll.html#:~:text=In electronics, a delay-locked,replaced by a delay line.&text=The output of the DLL,resulting, negatively delayed clock signal.
It's like a PLL, but you get to adjust the integrator on EVERY source clock. The integrator controls delay elements. If you had eight of them, you could make 8 phases of the input clock. If we had ten delay elements, we could do HDMI very nicely.
The delay line ring is drawn as a chain of buffers. How fast is a cold buffer in the used On Semi process? 100 ps maybe. The delay chain wouldn't have to cover lower frequencies I presume. Those can be dealt with using a high sysclock if desired. I'm guessing 20 ns maybe 40 ns max delay in the chain. It'll be temperature dependant so will need a worst case number of elements. We're wanting sub-nanosecond resolution anyway, so 400 elements and taps needed?
Gave this idea a quick go, but seeing problems with first read byte/word which seems to be zeroed. Something is not right with the clocks, and I'm not willing to dig into it right now. I'm putting this idea on hold until after the initial release. If it can be figured out reliably/consistently it might be something that can save some instructions. It's tricky because introducing instructions like rdlut which are 3 clocks can then throw it off and there need to be an exact number of clocks to align the data if in the different cases of registered/unregistered clock and data buses, plus sysclk/1 and sysclk/2. Lots of cases to all get right.