I tried "> Prop_Chk 0 0 0 0"<cr> on CVA9 V27 at all kinds of baud rates etc and with some extra > characters as well. Monitored for glitches, anything....... not a sausage.
I've been investigating why my SD card routines aren't running at 80MHz but are fine at 20MHZ. It seems that there is a timing problem when I use a REP loop.
For instance, this does not work at 80MHZ
' SPIRD ( dummy -- dat )
SPIRD rep @.end,#8 ' 8 bits
outnot sck ' clock (low high)
testp miso wc ' read data from card
outnot sck
rcl tos,#1 ' shift in msb first
.end ret
But this does:
' SPIRD ( dummy -- dat )
SPIRD mov R2,#8 ' 8 bits
.L0 outnot sck ' clock (low high)
testp miso wc ' read data from card
outnot sck
rcl tos,#1 ' shift in msb first
djnz R2,#.L0
.end ret
On the scope the data is ready from the falling edge of the previous clock and so is very stable for 100ns @80MHZ before the clock goes high, then the data is read, then the clock goes low. My SD init command sent as 0 0 CMD is looking for a response and instead of reading $01 it ends up reading $00 but at 20MHZ it is fine.
UPDATE: It works if I add a nop before taking the clock high but not if I try to move the testp before the clock.
SPIRD rep @.end,#8 ' 8 bits
nop
outnot sck ' clock (low high)
testp miso wc ' read data from card
outnot sck
rcl tos,#1 ' shift in msb first
.end ret
In the non-rep version, there are two extra clock cycles between the second OUTNOT and the first OUTNOT. Does the REP version work if put a NOP after the RCL (as part of the REP block)?
In the non-rep version, there are two extra clock cycles between the second OUTNOT and the first OUTNOT. Does the REP version work if put a NOP after the RCL (as part of the REP block)?
I had just tested with a nop at the start of the loop and added that to my last post in anticipation of someone asking me that
Since the data is ready and stable I can discount the source which leaves the rep loop to look at. I am going to try unrolling a loop without a rep just to check that too.
That's right, Guys. There are delays on IN and OUT which amount to several cycles. If you look at the ROM_Booter code, you can see a comment in there that indicates the SPI data pin is being sampled from before the clock transition, even though the clock transition instruction precedes it. I found there was even room for three MORE cycles there. At 80MHz, any marginal timing would be made worse. I'm pretty certain this is the problem.
It is best to locate your sample cycle as late as possible before the clock transition occurs, as a result of one of your prior instructions. Well, maybe back it off one clock, from there, just to be really safe.
That's right, Guys. There are delays on IN and OUT which amount to several cycles.
....
At 80MHz, any marginal timing would be made worse. I'm pretty certain this is the problem.
That means the SD card being tested has delays of more than one 80MHz SysCLK ?
Are they like SPI flash parts, where faster command exist, but at the cost of more dummy bytes and less-portable code ?
It takes FIVE (20MHz) or SIX (80MHz) clocks! It's much safer to write code for the 80MHz reality.
'
' Check OUT to IN time
'
con
' x = 2, mode = $00 'at 20MHz, x=1 misses high, x=2 catches it
x = 3, mode = $FF 'at 80MHz, x=2 misses high, x=3 catches it
dat org
clkset #mode
drvl #0 '2! make pin 0 low
waitx #10 '12 give plenty of time
drvh #0 '2! now make pin 0 high
waitx #x '2+x wait 2+x cycles
testp #0 wc '1?1 sample pin 0 into c, !..? = 2+x+1
drvc #32 '2! write c to led
jmp #$
So, if you're running at 20MHz and output a state to a pin, you must sample it 5 clocks later to see the change. At 80MHz, you must sample it 6 clocks later. Again, always code with a six-clock assumption, as it's safer.
This means that if you transition a SPI clock pin to the state in which new data will come out of a connected SPI device, you can reliably sample the data input pin 4 clocks after toggling the clock and still see the data that was coming out BEFORE the clock actually toggled.
The reason there's a clock-cycle difference between 20MHz and 80MHz is because at 20MHz, the pin change is registered on the same clock in which it was changed, whereas at 80MHz, the pin transition was underway, but missed registration on the input circuit.
So you can sample IN bits 4 clocks after a related OUT change and still see the IN state that was before the already-executed OUT-state change.
To give enough time to see the OUT change take effect, you must wait 6 clocks before sampling IN. This will give you coverage at high speed, assuming there is no significant loading on the pin that would delay the transition by a whole clock.
I made a program to test OUT-to-IN feedback time.
It takes FIVE (20MHz) or SIX (80MHz) clocks!
....
The reason there's a clock-cycle difference between 20MHz and 80MHz is because at 20MHz, the pin change is registered on the same clock in which it was changed, whereas at 80MHz, the pin transition was underway, but missed registration on the input circuit.
So you can sample IN bits 4 clocks after a related OUT change and still see the IN state that was before the already-executed OUT-state change.
To give enough time to see the OUT change take effect, you must wait 6 clocks before sampling IN. This will give you coverage at high speed, assuming there is no significant loading on the pin that would delay the transition by a whole clock.
Wow, that's now a lot of clock cycles, and the variance is also a concern.
What will the PAD Ring add to the delays, as this is a FPGA-only test, right ?
Most code timing will be 2T, unless it uses WAIT to do a fractional opcode time, so 6 SysCLK == 3 opcodes
- oh, I see you sample 50% in the testp, so that makes code-only timing 2T+1, ie +2 opcodes gives 5T and +3 opcodes is 7T, so has margin over the 6T.
What is the expected value for 160MHz SysCLKs, how does final silicon compare with FPGA added delays..?
If the delays can 'add' a whole SysCLK, at moderately fast clock speeds, how can users know they are clear of that threshold ?
I can see situations where it 'tests fine on the bench', but fails in the field, or across production batches...
Most MCUs have much less turn-around delay, this from AVR data - ie they need only ONE NOP to read the post-change value.
AVR: "As indicated by the two arrows tpd,max and tpd,min, a single signal transition on the pin will be delayed between ½ and 1½ system clock period depending upon the time of assertion.
When reading back a software assigned pin value, a nop instruction must be inserted as indicated in Figure 25. The out instruction sets the SYNC LATCH signal at the positive edge of the clock. In this case, the delay tpd through the synchronizer is one system clock period."
We are longer, and I don't know that we must be, but to get timing on the FPGA, I had to insert flops to meet interconnect delays.
The variance on the FPGA is pretty understandable, I think. Perhaps with some timing-constraint assignments, I could make it behave at 80MHz like it does at 20MHz. Those paths look like this:
On the silicon, those paths will be constrained to meet timing for our in-pad registers that can be enabled. Those registers will add an additional clock cycle in each direction.
On the scope the data is ready from the falling edge of the previous clock and so is very stable for 100ns @80MHZ before the clock goes high, then the data is read, then the clock goes low. My SD init command sent as 0 0 CMD is looking for a response and instead of reading $01 it ends up reading $00 but at 20MHZ it is fine.
I wonder if we can use the "synchronous serial receive" of a smartpin to more quickly read SPI data...
Of course, any serious SPI speed work is going to need the smart pins. ( & maybe streamer too..)
I believe the Boot code is avoiding the Smart Pins, mainly to lower risk. (ie less of the chip has to work)
I was also wondering about i2c slave operation & sense of START and STOP, and I wonder what CAN bus speeds will be possible with relatively high delays. I guess it just means higher SysCLK speeds than might have been possible.
CAN bus uses OR sense arbitration, so when you 'see' a signal not what you sent, you release the bus.
I think you missed the information in the timing diagram, here's the lower right side where the timing is zoomed to 200ns/division (two hundred nanoseconds). You see that the data is ready from the previous falling clock and about 100ns later the clock goes high, the code reads the data that has long been ready and then the clock goes low.
It seems a shame that I have to waste time with a nop and although this code may eventually use a smartpin after tests are complete, there is nonetheless a "gotcha" here that we need to be aware of. It's not as if I'm dealing with metastability issues by trying to read when it has just changed but the code without the nop is already flaky at 40MHz.
...
It seems a shame that I have to waste time with a nop and although this code may eventually use a smartpin after tests are complete, there is nonetheless a "gotcha" here that we need to be aware of. It's not as if I'm dealing with metastability issues by trying to read when it has just changed but the code without the nop is already flaky at 40MHz.
Hmm, if it is unreliable also at 40MHz, how many NOPs might be needed at 120MHz or 160MHz ?
Having a large number of SysCLKs is bad enough, but having the patch-NOPs needed be also frequency dependent (and that also means PVT dependent), makes code writing and testing a risky business.
What is more of a concern to me is that we cannot drive the I/O pins like we could in the P1. There are gotchas, and without using the smart pins. It makes the whole P2 soft-peripherals concept a concern. There will be a lot of users caught out because, as you know, most don't RTFM. It won't be a pleasant experience like the P1 was/is.
I don't know if there is any solution, but it sure doesn't look good to me
Once you are aware of it, it's just something you incorporate into your coding.
And you are not going to need unknowable numbers of NOPs. Just follow the cycle counting guidelines I gave above and everything will be okay, in all circumstances, unless a pin is heavily loaded and unable to transition within one full clock. And remember that it's a matter of clock cycles, not necessarily NOPs. And it's 4 clocks for reading before a transition gets output, and 6 clocks for reading after a transition
With timing constraints, I'm pretty sure we could get that 6 down to 5 at 80MHz, as logic would dictate.
I sense we're all a little fatigued, as we get to the end of this project.
Some good news: I had a Webex meeting with OnSemi today and we went over ESD stategy. They have determined over the years that dirt-simple works best. We just need diodes for clamps and R-C driven NMOS devices for trapping high voltages on the power supplies. Very simple. I will modify our schematics accordingly. I love it when "simple" is the best solution.
Comments
So, the only thing that doesn't work is the dynamically switchable scheme?
The other errors seem to have been related to the last byte of the download getting clipped off, right?
Yes, it makes no sense to me why it doesn't work.
Yes, it seems to make no sense, at all. Why should a mux that affects only a few non-critical pins cause total failure?
For instance, this does not work at 80MHZ
But this does:
On the scope the data is ready from the falling edge of the previous clock and so is very stable for 100ns @80MHZ before the clock goes high, then the data is read, then the clock goes low. My SD init command sent as 0 0 CMD is looking for a response and instead of reading $01 it ends up reading $00 but at 20MHZ it is fine.
UPDATE: It works if I add a nop before taking the clock high but not if I try to move the testp before the clock.
I had just tested with a nop at the start of the loop and added that to my last post in anticipation of someone asking me that
Since the data is ready and stable I can discount the source which leaves the rep loop to look at. I am going to try unrolling a loop without a rep just to check that too.
Maybe you are toggling the clock at same time actually reading I/O?
It is best to locate your sample cycle as late as possible before the clock transition occurs, as a result of one of your prior instructions. Well, maybe back it off one clock, from there, just to be really safe.
That's a 4-instruction loop with 2 instructions for toggling the clock and one for read...
What about 40MHz ?
That means the SD card being tested has delays of more than one 80MHz SysCLK ?
Are they like SPI flash parts, where faster command exist, but at the cost of more dummy bytes and less-portable code ?
If you assume a 0ns memory, what is the turn-around delay or 'NOPs needed' for a CLK to READ (pin out to pin in)
Are you saying 3 ?
It takes FIVE (20MHz) or SIX (80MHz) clocks! It's much safer to write code for the 80MHz reality.
So, if you're running at 20MHz and output a state to a pin, you must sample it 5 clocks later to see the change. At 80MHz, you must sample it 6 clocks later. Again, always code with a six-clock assumption, as it's safer.
This means that if you transition a SPI clock pin to the state in which new data will come out of a connected SPI device, you can reliably sample the data input pin 4 clocks after toggling the clock and still see the data that was coming out BEFORE the clock actually toggled.
The reason there's a clock-cycle difference between 20MHz and 80MHz is because at 20MHz, the pin change is registered on the same clock in which it was changed, whereas at 80MHz, the pin transition was underway, but missed registration on the input circuit.
So you can sample IN bits 4 clocks after a related OUT change and still see the IN state that was before the already-executed OUT-state change.
To give enough time to see the OUT change take effect, you must wait 6 clocks before sampling IN. This will give you coverage at high speed, assuming there is no significant loading on the pin that would delay the transition by a whole clock.
Wow, that's now a lot of clock cycles, and the variance is also a concern.
What will the PAD Ring add to the delays, as this is a FPGA-only test, right ?
Most code timing will be 2T, unless it uses WAIT to do a fractional opcode time, so 6 SysCLK == 3 opcodes
- oh, I see you sample 50% in the testp, so that makes code-only timing 2T+1, ie +2 opcodes gives 5T and +3 opcodes is 7T, so has margin over the 6T.
What is the expected value for 160MHz SysCLKs, how does final silicon compare with FPGA added delays..?
If the delays can 'add' a whole SysCLK, at moderately fast clock speeds, how can users know they are clear of that threshold ?
I can see situations where it 'tests fine on the bench', but fails in the field, or across production batches...
Most MCUs have much less turn-around delay, this from AVR data - ie they need only ONE NOP to read the post-change value.
AVR: "As indicated by the two arrows tpd,max and tpd,min, a single signal transition on the pin will be delayed between ½ and 1½ system clock period depending upon the time of assertion.
When reading back a software assigned pin value, a nop instruction must be inserted as indicated in Figure 25. The out instruction sets the SYNC LATCH signal at the positive edge of the clock. In this case, the delay tpd through the synchronizer is one system clock period."
We are longer, and I don't know that we must be, but to get timing on the FPGA, I had to insert flops to meet interconnect delays.
The variance on the FPGA is pretty understandable, I think. Perhaps with some timing-constraint assignments, I could make it behave at 80MHz like it does at 20MHz. Those paths look like this:
outgoing: register --> logic --> pin
incoming: pin --> logic --> register
On the silicon, those paths will be constrained to meet timing for our in-pad registers that can be enabled. Those registers will add an additional clock cycle in each direction.
A few posts back you wrote I am fairly sure you meant 10ns
Many thanks for working thru these timing issues. I need to understand them for the SD card boot code.
BTW Has anyone done a FullDuplexSerial P2 pasm equivalent object?
Of course, any serious SPI speed work is going to need the smart pins. ( & maybe streamer too..)
I believe the Boot code is avoiding the Smart Pins, mainly to lower risk. (ie less of the chip has to work)
I was also wondering about i2c slave operation & sense of START and STOP, and I wonder what CAN bus speeds will be possible with relatively high delays. I guess it just means higher SysCLK speeds than might have been possible.
CAN bus uses OR sense arbitration, so when you 'see' a signal not what you sent, you release the bus.
It seems a shame that I have to waste time with a nop and although this code may eventually use a smartpin after tests are complete, there is nonetheless a "gotcha" here that we need to be aware of. It's not as if I'm dealing with metastability issues by trying to read when it has just changed but the code without the nop is already flaky at 40MHz.
Hmm, if it is unreliable also at 40MHz, how many NOPs might be needed at 120MHz or 160MHz ?
Having a large number of SysCLKs is bad enough, but having the patch-NOPs needed be also frequency dependent (and that also means PVT dependent), makes code writing and testing a risky business.
I don't know if there is any solution, but it sure doesn't look good to me
And you are not going to need unknowable numbers of NOPs. Just follow the cycle counting guidelines I gave above and everything will be okay, in all circumstances, unless a pin is heavily loaded and unable to transition within one full clock. And remember that it's a matter of clock cycles, not necessarily NOPs. And it's 4 clocks for reading before a transition gets output, and 6 clocks for reading after a transition
With timing constraints, I'm pretty sure we could get that 6 down to 5 at 80MHz, as logic would dictate.
Some good news: I had a Webex meeting with OnSemi today and we went over ESD stategy. They have determined over the years that dirt-simple works best. We just need diodes for clamps and R-C driven NMOS devices for trapping high voltages on the power supplies. Very simple. I will modify our schematics accordingly. I love it when "simple" is the best solution.