Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

cgracey · 2017-11-21 06:44

Peter, so that means that our pin-swapping scheme doesn't work, right? Without pin swapping, things work okay. Is that correct?

Peter Jakacki · 2017-11-21 06:58

V27z has pin swapping but locked, not selectable via SW1 I believe. The boot pins remapped to the microSD is the version that I'm using.

cgracey · 2017-11-21 07:02

Peter Jakacki wrote: »

V27z has pin swapping but locked, not selectable via SW1 I believe. The boot pins remapped to the microSD is the version that I'm using.

So, the only thing that doesn't work is the dynamically switchable scheme?

The other errors seem to have been related to the last byte of the download getting clipped off, right?

Peter Jakacki · 2017-11-21 07:29

Yes, I concur, the only thing is the switching, though strange it seems.

cgracey · 2017-11-21 07:29

Peter Jakacki wrote: »

Yes, I concur, the only thing is the switching, though strange it seems.

Yes, it makes no sense to me why it doesn't work.

Peter Jakacki · 2017-11-21 08:53

I tried "> Prop_Chk 0 0 0 0"<cr> on CVA9 V27 at all kinds of baud rates etc and with some extra > characters as well. Monitored for glitches, anything....... not a sausage.

ozpropdev · 2017-11-21 10:24

I found the switchable versions were all flat liners on BeMicro CV-A9.

cgracey · 2017-11-21 10:25

ozpropdev wrote: »

I found the switchable versions were all flat liners on BeMicro CV-A9.

Yes, it seems to make no sense, at all. Why should a mux that affects only a few non-critical pins cause total failure?

ozpropdev · 2017-11-21 10:28

But the switchable versions for DE2-115 and CV-A2 worked right?

Peter Jakacki · 2017-11-21 11:31

I remember that the switch worked for the DE2-115. Just on the off chance and to help track it down can you perhaps try SW2 instead just in case.

Peter Jakacki · 2017-11-21 14:26

Hmmm, now V27zz, the locked remapped I/O version is locking up again after a while, just when I thought it was ok.

Peter Jakacki · 2017-11-21 15:04

I've been investigating why my SD card routines aren't running at 80MHz but are fine at 20MHZ. It seems that there is a timing problem when I use a REP loop.

For instance, this does not work at 80MHZ

' SPIRD ( dummy -- dat )
SPIRD		rep	@.end,#8		' 8 bits
		outnot 	sck			' clock (low high)
		testp	miso wc			' read data from card
		outnot	sck
		rcl	tos,#1			' shift in msb first
.end		ret

But this does:

' SPIRD ( dummy -- dat )
SPIRD		mov	R2,#8			' 8 bits
.L0		outnot 	sck			' clock (low high)
		testp	miso wc			' read data from card
		outnot	sck
		rcl	tos,#1			' shift in msb first
		djnz	R2,#.L0
.end		ret

On the scope the data is ready from the falling edge of the previous clock and so is very stable for 100ns @80MHZ before the clock goes high, then the data is read, then the clock goes low. My SD init command sent as 0 0 CMD is looking for a response and instead of reading $01 it ends up reading $00 but at 20MHZ it is fine.

UPDATE: It works if I add a nop before taking the clock high but not if I try to move the testp before the clock.

SPIRD		rep	@.end,#8		' 8 bits
		nop
		outnot 	sck			' clock (low high)
		testp	miso wc			' read data from card
		outnot	sck
		rcl	tos,#1			' shift in msb first
.end		ret

Seairth · 2017-11-21 15:30

In the non-rep version, there are two extra clock cycles between the second OUTNOT and the first OUTNOT. Does the REP version work if put a NOP after the RCL (as part of the REP block)?

Peter Jakacki · 2017-11-21 15:34

Seairth wrote: »

In the non-rep version, there are two extra clock cycles between the second OUTNOT and the first OUTNOT. Does the REP version work if put a NOP after the RCL (as part of the REP block)?

I had just tested with a nop at the start of the loop and added that to my last post in anticipation of someone asking me that

Since the data is ready and stable I can discount the source which leaves the rep loop to look at. I am going to try unrolling a loop without a rep just to check that too.

Rayman · 2017-11-21 17:05

I think there is now additional delays on I/O due to inputs being "registered", right?

Maybe you are toggling the clock at same time actually reading I/O?

cgracey · 2017-11-21 17:38

That's right, Guys. There are delays on IN and OUT which amount to several cycles. If you look at the ROM_Booter code, you can see a comment in there that indicates the SPI data pin is being sampled from before the clock transition, even though the clock transition instruction precedes it. I found there was even room for three MORE cycles there. At 80MHz, any marginal timing would be made worse. I'm pretty certain this is the problem.

It is best to locate your sample cycle as late as possible before the clock transition occurs, as a result of one of your prior instructions. Well, maybe back it off one clock, from there, just to be really safe.

Rayman · 2017-11-21 18:27

Hmm... That might make it hard to do a 20 MHz read...
That's a 4-instruction loop with 2 instructions for toggling the clock and one for read...

jmg · 2017-11-21 18:54

Peter Jakacki wrote: »

I've been investigating why my SD card routines aren't running at 80MHz but are fine at 20MHZ.

What about 40MHz ?

cgracey wrote: »

That's right, Guys. There are delays on IN and OUT which amount to several cycles.
....
At 80MHz, any marginal timing would be made worse. I'm pretty certain this is the problem.

That means the SD card being tested has delays of more than one 80MHz SysCLK ?
Are they like SPI flash parts, where faster command exist, but at the cost of more dummy bytes and less-portable code ?

cgracey wrote: »

I found there was even room for three MORE cycles there.

If you assume a 0ns memory, what is the turn-around delay or 'NOPs needed' for a CLK to READ (pin out to pin in)
Are you saying 3 ?

cgracey · 2017-11-21 21:18

I made a program to test OUT-to-IN feedback time.

It takes FIVE (20MHz) or SIX (80MHz) clocks! It's much safer to write code for the 80MHz reality.

'
' Check OUT to IN time
'
con

'	x = 2, mode = $00	'at 20MHz, x=1 misses high, x=2 catches it
	x = 3, mode = $FF	'at 80MHz, x=2 misses high, x=3 catches it

dat	org

	clkset	#mode

	drvl	#0	'2!	make pin 0 low
	waitx	#10	'12	give plenty of time
	drvh	#0	'2!	now make pin 0 high
	waitx	#x	'2+x	wait 2+x cycles
	testp	#0 wc	'1?1	sample pin 0 into c, !..? = 2+x+1
	drvc	#32	'2!	write c to led
	jmp	#$

So, if you're running at 20MHz and output a state to a pin, you must sample it 5 clocks later to see the change. At 80MHz, you must sample it 6 clocks later. Again, always code with a six-clock assumption, as it's safer.

This means that if you transition a SPI clock pin to the state in which new data will come out of a connected SPI device, you can reliably sample the data input pin 4 clocks after toggling the clock and still see the data that was coming out BEFORE the clock actually toggled.

The reason there's a clock-cycle difference between 20MHz and 80MHz is because at 20MHz, the pin change is registered on the same clock in which it was changed, whereas at 80MHz, the pin transition was underway, but missed registration on the input circuit.

So you can sample IN bits 4 clocks after a related OUT change and still see the IN state that was before the already-executed OUT-state change.

To give enough time to see the OUT change take effect, you must wait 6 clocks before sampling IN. This will give you coverage at high speed, assuming there is no significant loading on the pin that would delay the transition by a whole clock.

jmg · 2017-11-21 21:37

cgracey wrote: »

I made a program to test OUT-to-IN feedback time.
It takes FIVE (20MHz) or SIX (80MHz) clocks!

....
The reason there's a clock-cycle difference between 20MHz and 80MHz is because at 20MHz, the pin change is registered on the same clock in which it was changed, whereas at 80MHz, the pin transition was underway, but missed registration on the input circuit.

So you can sample IN bits 4 clocks after a related OUT change and still see the IN state that was before the already-executed OUT-state change.

To give enough time to see the OUT change take effect, you must wait 6 clocks before sampling IN. This will give you coverage at high speed, assuming there is no significant loading on the pin that would delay the transition by a whole clock.

Wow, that's now a lot of clock cycles, and the variance is also a concern.
What will the PAD Ring add to the delays, as this is a FPGA-only test, right ?

Most code timing will be 2T, unless it uses WAIT to do a fractional opcode time, so 6 SysCLK == 3 opcodes
- oh, I see you sample 50% in the testp, so that makes code-only timing 2T+1, ie +2 opcodes gives 5T and +3 opcodes is 7T, so has margin over the 6T.

What is the expected value for 160MHz SysCLKs, how does final silicon compare with FPGA added delays..?

If the delays can 'add' a whole SysCLK, at moderately fast clock speeds, how can users know they are clear of that threshold ?

I can see situations where it 'tests fine on the bench', but fails in the field, or across production batches...

Most MCUs have much less turn-around delay, this from AVR data - ie they need only ONE NOP to read the post-change value.

AVR: "As indicated by the two arrows tpd,max and tpd,min, a single signal transition on the pin will be delayed between ½ and 1½ system clock period depending upon the time of assertion.
When reading back a software assigned pin value, a nop instruction must be inserted as indicated in Figure 25. The out instruction sets the SYNC LATCH signal at the positive edge of the clock. In this case, the delay tpd through the synchronizer is one system clock period."

cgracey · 2017-11-21 21:44

Thanks, Jmg, for the info.

We are longer, and I don't know that we must be, but to get timing on the FPGA, I had to insert flops to meet interconnect delays.

The variance on the FPGA is pretty understandable, I think. Perhaps with some timing-constraint assignments, I could make it behave at 80MHz like it does at 20MHz. Those paths look like this:

outgoing: register --> logic --> pin
incoming: pin --> logic --> register

On the silicon, those paths will be constrained to meet timing for our in-pad registers that can be enabled. Those registers will add an additional clock cycle in each direction.

Cluso99 · 2017-11-21 22:45

Peter,
A few posts back you wrote

On the scope the data is ready from the falling edge of the previous clock and so is very stable for 100ns @80MHZ before the clock goes high, then the data is read, then the clock goes low. My SD init command sent as 0 0 CMD is looking for a response and instead of reading $01 it ends up reading $00 but at 20MHZ it is fine.

I am fairly sure you meant 10ns

Cluso99 · 2017-11-21 22:48

Guys,
Many thanks for working thru these timing issues. I need to understand them for the SD card boot code.

BTW Has anyone done a FullDuplexSerial P2 pasm equivalent object?

Rayman · 2017-11-21 23:00

I wonder if we can use the "synchronous serial receive" of a smartpin to more quickly read SPI data...

jmg · 2017-11-21 23:16

Rayman wrote: »

I wonder if we can use the "synchronous serial receive" of a smartpin to more quickly read SPI data...

Of course, any serious SPI speed work is going to need the smart pins. ( & maybe streamer too..)
I believe the Boot code is avoiding the Smart Pins, mainly to lower risk. (ie less of the chip has to work)

I was also wondering about i2c slave operation & sense of START and STOP, and I wonder what CAN bus speeds will be possible with relatively high delays. I guess it just means higher SysCLK speeds than might have been possible.

CAN bus uses OR sense arbitration, so when you 'see' a signal not what you sent, you release the bus.

Peter Jakacki · 2017-11-21 23:48

I think you missed the information in the timing diagram, here's the lower right side where the timing is zoomed to 200ns/division (two hundred nanoseconds). You see that the data is ready from the previous falling clock and about 100ns later the clock goes high, the code reads the data that has long been ready and then the clock goes low.

It seems a shame that I have to waste time with a nop and although this code may eventually use a smartpin after tests are complete, there is nonetheless a "gotcha" here that we need to be aware of. It's not as if I'm dealing with metastability issues by trying to read when it has just changed but the code without the nop is already flaky at 40MHz.

jmg · 2017-11-22 00:02

Peter Jakacki wrote: »

...
It seems a shame that I have to waste time with a nop and although this code may eventually use a smartpin after tests are complete, there is nonetheless a "gotcha" here that we need to be aware of. It's not as if I'm dealing with metastability issues by trying to read when it has just changed but the code without the nop is already flaky at 40MHz.

Hmm, if it is unreliable also at 40MHz, how many NOPs might be needed at 120MHz or 160MHz ?
Having a large number of SysCLKs is bad enough, but having the patch-NOPs needed be also frequency dependent (and that also means PVT dependent), makes code writing and testing a risky business.

Cluso99 · 2017-11-22 00:59

What is more of a concern to me is that we cannot drive the I/O pins like we could in the P1. There are gotchas, and without using the smart pins. It makes the whole P2 soft-peripherals concept a concern. There will be a lot of users caught out because, as you know, most don't RTFM. It won't be a pleasant experience like the P1 was/is.

I don't know if there is any solution, but it sure doesn't look good to me

cgracey · 2017-11-22 01:12

Once you are aware of it, it's just something you incorporate into your coding.

And you are not going to need unknowable numbers of NOPs. Just follow the cycle counting guidelines I gave above and everything will be okay, in all circumstances, unless a pin is heavily loaded and unable to transition within one full clock. And remember that it's a matter of clock cycles, not necessarily NOPs. And it's 4 clocks for reading before a transition gets output, and 6 clocks for reading after a transition

With timing constraints, I'm pretty sure we could get that 6 down to 5 at 80MHz, as logic would dictate.

cgracey · 2017-11-22 01:22

I sense we're all a little fatigued, as we get to the end of this project.

Some good news: I had a Webex meeting with OnSemi today and we went over ESD stategy. They have determined over the years that dirt-simple works best. We just need diodes for clamps and R-C driven NMOS devices for trapping high voltages on the power supplies. Very simple. I will modify our schematics accordingly. I love it when "simple" is the best solution.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments