sporadically failed boot from SPI flash

jef_vt · 2020-08-29 09:26

Hi,
I have a problem with prop 2 booting correctly.
I have a SPI flash W25Q128 connected with all the software on it.
Battery power is always connected to the device and it is clean without spikes as far as i can see. 3.3V and 1.8V.
At shutdown (for the whole device), it goes to sleep to save power. at startup, it does a reboot so the code is always clean from the flash. Worked like a charm with prop 1.
With prop 2, more than 99% of the time it starts up perfect. We now have more than 150 devices and have seen at least 6 that failed to reboot.
I managed to capture a failed startup.
The signals seems perfect and the boot checksum is correct (706F7250), but the prop does not continue. I think it has read 1 bit wrong.
Only P61 has pull-up. So Serial window of 100ms, then SPI flash. If SPI flash fails then serial window of 60s.

My question: is there a way to re-trigger the read command for flash?
When the prop reads 1 bit wrong, it goes in shutdown and the device does not start up.

jmg · 2020-08-29 09:44

jef_vt wrote: »

Hi,
...
At shutdown (for the whole device), it goes to sleep to save power. at startup, it does a reboot so the code is always clean from the flash. Worked like a charm with prop 1.
With prop 2, more than 99% of the time it starts up perfect. We now have more than 150 devices and have seen at least 6 that failed to reboot.
...

Is this your own PCB layout ?
Does a cold start (power ramp) always work ok ?
A reboot without power cycle, is not going to fully reset the flash, so maybe that is an issue ? (P2 does issue a flash reset, but maybe that is not 100% coverage from all states?)

Are you saying, of the 6 that show issues, they have appx 99% boot yield, and the issues are only seen on those 6, all others are ok (meaning 100% success) ?

You could check the RCFAST speed on these, to see if it correlates with failures ?
You could connect a edge counter to the SPI_CLK, and check what that gets to on fail and pass cases ?

jef_vt wrote: »

Hi,
My question: is there a way to re-trigger the read command for flash?
When the prop reads 1 bit wrong, it goes in shutdown and the device does not start up.

I think that would need some form of external watchdog. It could look at SPI pins for some minimum count, or frequency, and if it fails to exceed that count, it issues a reset.
Of course, that assumes a 2nd boot attempt works - did you confirm that does ?

Tubular · 2020-08-29 09:46

Hmm thats the same W25Q128 used by the P2 eval board.

One thought would be to look at adding a hardware watchdog, which is cancelled out by successful boot, but it sounds like you already have a lot of hardware made

Is the SPI bus shared by anything other than the flash chip?

jef_vt · 2020-08-29 10:22

jmg wrote: »

jef_vt wrote: »

Hi,
...
At shutdown (for the whole device), it goes to sleep to save power. at startup, it does a reboot so the code is always clean from the flash. Worked like a charm with prop 1.
With prop 2, more than 99% of the time it starts up perfect. We now have more than 150 devices and have seen at least 6 that failed to reboot.
...

Is this your own PCB layout ?
Does a cold start (power ramp) always work ok ?
A reboot without power cycle, is not going to fully reset the flash, so maybe that is an issue ? (P2 does issue a flash reset, but maybe that is not 100% coverage from all states?)

Are you saying, of the 6 that show issues, they have appx 99% boot yield, and the issues are only seen on those 6, all others are ok (meaning 100% success) ?

You could check the RCFAST speed on these, to see if it correlates with failures ?
You could connect a edge counter to the SPI_CLK, and check what that gets to on fail and pass cases ?

jef_vt wrote: »

Hi,
My question: is there a way to re-trigger the read command for flash?
When the prop reads 1 bit wrong, it goes in shutdown and the device does not start up.

I think that would need some form of external watchdog. It could look at SPI pins for some minimum count, or frequency, and if it fails to exceed that count, it issues a reset.
Of course, that assumes a 2nd boot attempt works - did you confirm that does ?

The PCB is my own design.
Cold start works always as far as i know.
I have had problems in the past with not-clean shutdown sequence. It now shuts down all communication before reboot.
I can see no difference in timing and data between a succesful reboot and unsuccesfull reboot. Only when the checksum is made, the prop does not continue to start up.

and correctly assumed; all other devices have 100% succes rate at reboot. Hardware is identical and the 6 devices have only expirienced it once. After hardware reset, it boots up.
But the reset button is not available to the customer.

Tubular wrote: »

Hmm thats the same W25Q128 used by the P2 eval board.

One thought would be to look at adding a hardware watchdog, which is cancelled out by successful boot, but it sounds like you already have a lot of hardware made

Is the SPI bus shared by anything other than the flash chip?

the PCB's are sold worldwide. Only by the 'huge' number of reboots, the lottery winners are calling.

SPI bus is only connected to the flash.

I am thinking that a hardware watchdog is the only way.

evanh · 2020-08-29 10:29

jef_vt wrote: »

I managed to capture a failed startup.
The signals seems perfect and the boot checksum is correct (706F7250), but the prop does not continue. I think it has read 1 bit wrong.
Only P61 has pull-up. So Serial window of 100ms, then SPI flash. If SPI flash fails then serial window of 60s.

I would recommend more diagnostics be added.

One possibility is the ever present PLL switching glitch. It won't be an issue if only setting the sysclock once per power up. But, given the ease of specifying a desired clock frequency and the prevalence of doing so in forum examples, there's always the possibility the sysclock's PLL is being set twice during boot up. If this is desired then there is a safe method that has an extra step - https://forums.parallax.com/discussion/comment/1466702/#Comment_1466702

PS: Same applies if coming out of a low power mode say. Or more precisely, the entering to RCSLOW for low power. The attempt to stop the PLL can cause a crash if not done right.

Rayman · 2020-08-29 10:43

I think @evanh is into something...
I don’t think many people have worked with going into rc mode and then back again...
Maybe try just going to a low clock speed instead?
Or, dig up that info on dealing with the pll...

By sleep, you mean rc mode, right?
How are you doing the reboot?

evanh · 2020-08-29 10:46

Slowing the PLL has the same risk as switching to RCFAST or RCSLOW. It's the PLL at XDIVP = 1 that has the issue, and that just happens to be the most used config.

EDIT: Experimenting has shown that retaining just XDIVP = 1 across PLL adjustments seems to be safe ... but Chip has indicated that this may not hold true across production batches of the prop2 - https://forums.parallax.com/discussion/comment/1466494/#Comment_1466494

jef_vt · 2020-08-29 11:57

I like the idea that a fault in de PLL is a cause.
That means a software patch can fix this.

edit: important info: the crystal I use is 10MHz abracon: ABM7-10.0000HZ-D2Y-T

some defines:
  mode_20mhz   = %0000_0001_000000_0000000111_1111_10_00
  mode_20mhzLP = %0000_0000_000000_0000000000_0000_00_00
  mode_80mhz   = %0000_0001_000000_0000000111_1111_10_11
  mode_20khz   = %0000_0000_000000_0000000000_0000_00_01
  mode_reset   = %0001_0000_000000_0000000000_0000_00_00
'                         | |||||| |||||||||| |||| || 00 := 20MHZ
'                         | |||||| |||||||||| |||| || 01 := 20kHz
'                         | |||||| |||||||||| |||| || 10 := XI
'                         | |||||| |||||||||| |||| || 11 := PLL
'                         | |||||| |||||||||| |||| 00 := no XI
'                         | |||||| |||||||||| |||| 01 := no F
'                         | |||||| |||||||||| |||| 10 := 15pF
'                         | |||||| |||||||||| |||| 11 := 30pF
'                         | |||||| |||||||||| 1111 := division of VCO to clk
'                         | |||||| 0000000111 := division of VCO to PLL
'                         | 000000 := division of XI
'                         1 := PLL

at startup I do this:
pub Main
    clkset(mode_20mhz, 20_000_000)
    waitcnt(CNT+clkfreq/100)
    clkset(mode_80mhz, 80_000_000)


To go to low power/sleep I do this:
    asm
    HUBSET ##mode_20mhz
    endasm
    waitcnt(CNT+20_000_000/100)
    asm
    HUBSET ##mode_20mhzLP
    endasm

For some background tasks in sleep, I switch between 2  clocks depending on the processing power needed:
        asm
        HUBSET ##mode_20khz
        endasm
and
        asm
        HUBSET ##mode_20mhzLP
        endasm


For A reset I do this:
    asm
    HUBSET ##mode_reset
    endasm

evanh · 2020-08-29 12:08

Oh, the spin clkset() function will be doing it safely. If you use that in all cases, instead of the inlined assembly HUBSET, then the issue is taken care of for you. Presumably clkset() supports RCFAST, RCSLOW and RESET.

PS: And you only need a single clkset() at any one place. It does the multi-part delayed sequence internally itself.

cgracey · 2020-08-29 15:33

Jef_vt, would it be possible for you to post the code sections where the clock is getting changed? Then, we could probably tell right away if this is a PLL issue.

evanh · 2020-08-29 21:37

Chip,
He has those snippets posted with description just above. You can see he's using clkset() at boot time to get 80 MHz with XDIVP = 1, that one is fine. But then dropping back to 20 MHz still with XDIVP = 1 for lower power using just a HUBSET. That's one problem there although probably not the one that crashes.

Then followed by an RCFAST using HUBSET - that's the most likely crash point.

The other cases around RCSLOW are likely okay because the prop is no longer in PLL operation.

cgracey · 2020-08-29 21:57

evanh wrote: »

Chip,
He has those snippets posted with description just above. You can see he's using clkset() at boot time to get 80 MHz with XDIVP = 1, that one is fine. But then dropping back to 20 MHz still with XDIVP = 1 for lower power using just a HUBSET. That's one problem there although probably not the one that crashes.

Then followed by an RCFAST using HUBSET - that's the most likely crash point.

The other cases around RCSLOW are likely okay because the prop is no longer in PLL operation.

Thanks. I didn't see that there was more data in that window.

He needs to use CLKSET(mode,freq) in Spin2. It takes care of everything.

Here is the code from the Spin2 interpreter that executes for CLKSET:

'
'
' CLKSET(clkmode,clkfreq)
'
clkset_		mov	z,x			'get clkfreq into z

		setq	#2-1			'get clkmode into y
		rdlong	x,--ptra		'get stack top into x

		rdlong	w,#@clkmode_hub		'get current clkmode to avoid (PPPP = %1111) clock glitch
		andn	w,#%11			'switch to 20MHz while maintaining old pll/xtal settings
		hubset	w

clkset_init	test	y,#%10		wz	'if new pll/xtal settings then switch to 20MHz for 10ms
	if_nz	mov	w,y			'..while new pll/xtal settings take effect
	if_nz	andn	w,#%11
	if_nz	hubset	w
	if_nz	wrlong	##20_000_000,#@clkfreq_hub
	if_nz	waitx	##20_000_000/100

		hubset	y			'now switch to new settings

		setq	#2-1			'update clkmode and clkfreq
	_ret_	wrlong	y,#@clkmode_hub

evanh · 2020-08-29 22:11

cgracey wrote: »

		...
		setq	#2-1			'update clkmode and clkfreq
	_ret_	wrlong	y,#@clkmode_hub

Chip,
What happens with the hubRAM priorities there? The FIFO is flushed and waiting for its refil slot. And WRLONG is also waiting for its slot. Does WRLONG get pushed aside no matter? Or can the WRLONG fit in a hubRAM write or two while the FIFO is still waiting for its slot to arrive?

EDIT: I'm presuming hubexec of course. Treat the question as hypothetical if not hubexec.

evanh · 2020-08-29 22:32

I'm guessing the former is correct. On that assumption, can a write occur before the FIFO butts in?

cgracey · 2020-08-30 00:04

evanh wrote: »

I'm guessing the former is correct. On that assumption, can a write occur before the FIFO butts in?

RDxxxx/WRxxxx can slip in amid FIFO activity if the FIFO cannot use that particular slot.

evanh · 2020-08-30 00:09

Oh, it's the latter then. Right, I think I remember you saying that in the past too. Thanks.

jmg · 2020-08-31 01:06

jef_vt wrote: »

and correctly assumed; all other devices have 100% succes rate at reboot. Hardware is identical and the 6 devices have only expirienced it once. After hardware reset, it boots up.
But the reset button is not available to the customer.

How many reboots per unit have you had so far (roughly) ?

jef_vt wrote: »

the PCB's are sold worldwide. Only by the 'huge' number of reboots, the lottery winners are calling.
SPI bus is only connected to the flash.
I am thinking that a hardware watchdog is the only way.

If the clock-handover steps detailed above do not work, it may need a HW WDOG.
FYI, there is a WDOG in the UB3 on P2D2, and that can map to (eg) Flash.DO for example, so that a failed boot would auto-reset/retry.
A detail that could need attention, is the WDOG tolerance to shortest pulses.
eg I find STWD100NYWY3F specs 1us min and ~100ns pulses are ignored, so Flash.DO may just be ok as a retrigger during boot process (but not Flash.CK) ?

jef_vt · 2020-09-01 18:37

When I calculate, I think about 20.000 reboots in total on all devices.
I asked around for the total of failed reboots and it is now about 20.
I heard some devices have more problems than others, but I am not sure about that information.

For HW WDOG, I think I will make a circuit with already used components on the board. The empty places on the pick and place machine is ... well... not much.
But thanks for the info! I surely will check the design and use it as a guideline.

as I am now working on the firmware to change all hubset to clkset, I have found something strange.
I can't change from 80MHz to 20MHzLP with clkset.

some defines:

  mode_20mhz   = %0000_0001_000000_0000000111_1111_10_00 '20MHZ
  mode_20mhzLP = %0000_0000_000000_0000000000_0000_00_00 '20MHZ  
  mode_80mhz   = %0000_0001_000000_0000000111_1111_10_11 '80MHZ
'                         | |||||| |||||||||| |||| || 00 := 20MHZ
'                         | |||||| |||||||||| |||| || 01 := 20kHz
'                         | |||||| |||||||||| |||| || 10 := XI
'                         | |||||| |||||||||| |||| || 11 := PLL
'                         | |||||| |||||||||| |||| 00 := no XI
'                         | |||||| |||||||||| |||| 01 := no F
'                         | |||||| |||||||||| |||| 10 := 15pF
'                         | |||||| |||||||||| |||| 11 := 30pF
'                         | |||||| |||||||||| 1111 := division of VCO to clk
'                         | |||||| 0000000111 := division of VCO to PLL
'                         | 000000 := division of XI
'                         1 := PLL

at startup (this works)
  clkset(mode_80mhz, 80_000_000)

... boring firmware stuff

at 'sleep' (this does not work)
  clkset(mode_20mhzLP, 20_000_000)

at 'sleep (this works)
  clkset(mode_20mhz, 20_000_000)
  waitcnt(CNT+20_000_000/100)
  clkset(mode_20mhzLP, 20_000_000)

anyone a wild gues?
I use command line fastspin compiler v4.0.4
I know it's not the newest version, but a newer compiler broke some code, so I kept this version.

msrobots · 2020-09-01 20:03

how about making a reset pin available to the user? As to avoid cancelling all the power to get it running again.

Seems to be a rare event with 20 out of 20,000 and the concept of reset buttons is - I guess - pretty common knowledge even the disposal thing in my kitchen sink has one.

As a wild guess, maybe switch to RCslow without PLL and then switching to 20Mhz?

Enjoy!

Mike

jmg · 2020-09-01 20:45

jef_vt wrote: »

as I am now working on the firmware to change all hubset to clkset, I have found something strange.
I can't change from 80MHz to 20MHzLP with clkset.

anyone a wild gues?

Maybe that's too many bits changed ? It may be the PLL OFF bit acts before the clock hand over, and it may spawn a runt pulse ?
Clean Clock hand over usually takes a little time as it waits for next-edges before acting.

You could try change of clock source only, with PLL bit still enabled and see if that helps, then try how much delay is needed before PLL bit clear ?

evanh · 2020-09-02 04:18

jef_vt wrote: »

I use command line fastspin compiler v4.0.4
I know it's not the newest version, but a newer compiler broke some code, so I kept this version.

Eric went through a few changes for inlined assembly not that long ago. It might be okay for you again now.

evanh · 2020-09-02 04:23

jef_vt wrote: »

at 'sleep' (this does not work)
  clkset(mode_20mhzLP, 20_000_000)

at 'sleep (this works)
  clkset(mode_20mhz, 20_000_000)
  waitcnt(CNT+20_000_000/100)
  clkset(mode_20mhzLP, 20_000_000)

anyone a wild gues?

One notable difference is that any trailing serial output will be corrupted in the first case because 20 MHz is too far off the real frequency. RCFAST is somewhere around 24-25 MHz at 20 °C. And it'll go even faster at lower temperatures.

EDIT: And serial input may stop functioning - if that is used for wakeup for example.

Tubular · 2020-09-02 04:44

Being slightly pedantic here, but...
From earlier experiments with glob top P2D2 RCFAST actually had a slight negative temp coefficient (but was very stable)
RCSLOW has a fairly strong positive temperature coefficient like you describe
Also we've seen several P2's in the "VGA club" - above VGA 25.175 MHz dot clock

evanh · 2020-09-02 04:45

Hehe, good to know. I just assumed.

rogloh · 2020-09-03 06:43

@Tubular When I tried doing VGA with RC fast a couple of weeks back I had some pretty nasty sync jitter issues that made it pretty unusable. Just letting you know. Now this was with the LCD monitor. Didn't try the VGA monitor I suspect it would probably cause it lots of wobbles.

Tubular · 2020-09-03 07:18

Ok good to know. I don't think I'd use it without a crystal tbh, just a fun curiousity for a minimal system

Rayman · 2020-09-03 10:05

How much power do you save turning off pll? I think I’d try just leaving pll on all the time if having issues...

evanh · 2020-09-04 04:37

I'm measuring about 35 µA more on VDD and 980 µA more on VIO with the PLL at 25 MHz vs RCFAST.

EDIT: Second attempt netted 42 µA more on VDD and 967 µA more on VIO, see https://forums.parallax.com/discussion/comment/1505575/#Comment_1505575

Rayman · 2020-09-04 09:50

That’s a lot on VIO...
Wonder why...

Tubular · 2020-09-04 10:39

i think the pll circuit pulls its power from vio2831

Rayman · 2020-09-04 12:58

In that case I guess going to rcfast makes sense if it can be done reliably

sporadically failed boot from SPI flash

Comments