sporadically failed boot from SPI flash
jef_vt
Posts: 23
in Propeller 2
Hi,
I have a problem with prop 2 booting correctly.
I have a SPI flash W25Q128 connected with all the software on it.
Battery power is always connected to the device and it is clean without spikes as far as i can see. 3.3V and 1.8V.
At shutdown (for the whole device), it goes to sleep to save power. at startup, it does a reboot so the code is always clean from the flash. Worked like a charm with prop 1.
With prop 2, more than 99% of the time it starts up perfect. We now have more than 150 devices and have seen at least 6 that failed to reboot.
I managed to capture a failed startup.
The signals seems perfect and the boot checksum is correct (706F7250), but the prop does not continue. I think it has read 1 bit wrong.
Only P61 has pull-up. So Serial window of 100ms, then SPI flash. If SPI flash fails then serial window of 60s.
My question: is there a way to re-trigger the read command for flash?
When the prop reads 1 bit wrong, it goes in shutdown and the device does not start up.
I have a problem with prop 2 booting correctly.
I have a SPI flash W25Q128 connected with all the software on it.
Battery power is always connected to the device and it is clean without spikes as far as i can see. 3.3V and 1.8V.
At shutdown (for the whole device), it goes to sleep to save power. at startup, it does a reboot so the code is always clean from the flash. Worked like a charm with prop 1.
With prop 2, more than 99% of the time it starts up perfect. We now have more than 150 devices and have seen at least 6 that failed to reboot.
I managed to capture a failed startup.
The signals seems perfect and the boot checksum is correct (706F7250), but the prop does not continue. I think it has read 1 bit wrong.
Only P61 has pull-up. So Serial window of 100ms, then SPI flash. If SPI flash fails then serial window of 60s.
My question: is there a way to re-trigger the read command for flash?
When the prop reads 1 bit wrong, it goes in shutdown and the device does not start up.
Comments
Does a cold start (power ramp) always work ok ?
A reboot without power cycle, is not going to fully reset the flash, so maybe that is an issue ? (P2 does issue a flash reset, but maybe that is not 100% coverage from all states?)
Are you saying, of the 6 that show issues, they have appx 99% boot yield, and the issues are only seen on those 6, all others are ok (meaning 100% success) ?
You could check the RCFAST speed on these, to see if it correlates with failures ?
You could connect a edge counter to the SPI_CLK, and check what that gets to on fail and pass cases ?
I think that would need some form of external watchdog. It could look at SPI pins for some minimum count, or frequency, and if it fails to exceed that count, it issues a reset.
Of course, that assumes a 2nd boot attempt works - did you confirm that does ?
One thought would be to look at adding a hardware watchdog, which is cancelled out by successful boot, but it sounds like you already have a lot of hardware made
Is the SPI bus shared by anything other than the flash chip?
The PCB is my own design.
Cold start works always as far as i know.
I have had problems in the past with not-clean shutdown sequence. It now shuts down all communication before reboot.
I can see no difference in timing and data between a succesful reboot and unsuccesfull reboot. Only when the checksum is made, the prop does not continue to start up.
and correctly assumed; all other devices have 100% succes rate at reboot. Hardware is identical and the 6 devices have only expirienced it once. After hardware reset, it boots up.
But the reset button is not available to the customer.
the PCB's are sold worldwide. Only by the 'huge' number of reboots, the lottery winners are calling.
SPI bus is only connected to the flash.
I am thinking that a hardware watchdog is the only way.
One possibility is the ever present PLL switching glitch. It won't be an issue if only setting the sysclock once per power up. But, given the ease of specifying a desired clock frequency and the prevalence of doing so in forum examples, there's always the possibility the sysclock's PLL is being set twice during boot up. If this is desired then there is a safe method that has an extra step - https://forums.parallax.com/discussion/comment/1466702/#Comment_1466702
PS: Same applies if coming out of a low power mode say. Or more precisely, the entering to RCSLOW for low power. The attempt to stop the PLL can cause a crash if not done right.
I don’t think many people have worked with going into rc mode and then back again...
Maybe try just going to a low clock speed instead?
Or, dig up that info on dealing with the pll...
By sleep, you mean rc mode, right?
How are you doing the reboot?
EDIT: Experimenting has shown that retaining just XDIVP = 1 across PLL adjustments seems to be safe ... but Chip has indicated that this may not hold true across production batches of the prop2 - https://forums.parallax.com/discussion/comment/1466494/#Comment_1466494
That means a software patch can fix this.
edit: important info: the crystal I use is 10MHz abracon: ABM7-10.0000HZ-D2Y-T
PS: And you only need a single clkset() at any one place. It does the multi-part delayed sequence internally itself.
He has those snippets posted with description just above. You can see he's using clkset() at boot time to get 80 MHz with XDIVP = 1, that one is fine. But then dropping back to 20 MHz still with XDIVP = 1 for lower power using just a HUBSET. That's one problem there although probably not the one that crashes.
Then followed by an RCFAST using HUBSET - that's the most likely crash point.
The other cases around RCSLOW are likely okay because the prop is no longer in PLL operation.
Thanks. I didn't see that there was more data in that window.
He needs to use CLKSET(mode,freq) in Spin2. It takes care of everything.
Here is the code from the Spin2 interpreter that executes for CLKSET:
Chip,
What happens with the hubRAM priorities there? The FIFO is flushed and waiting for its refil slot. And WRLONG is also waiting for its slot. Does WRLONG get pushed aside no matter? Or can the WRLONG fit in a hubRAM write or two while the FIFO is still waiting for its slot to arrive?
EDIT: I'm presuming hubexec of course. Treat the question as hypothetical if not hubexec.
RDxxxx/WRxxxx can slip in amid FIFO activity if the FIFO cannot use that particular slot.
If the clock-handover steps detailed above do not work, it may need a HW WDOG.
FYI, there is a WDOG in the UB3 on P2D2, and that can map to (eg) Flash.DO for example, so that a failed boot would auto-reset/retry.
A detail that could need attention, is the WDOG tolerance to shortest pulses.
eg I find STWD100NYWY3F specs 1us min and ~100ns pulses are ignored, so Flash.DO may just be ok as a retrigger during boot process (but not Flash.CK) ?
I asked around for the total of failed reboots and it is now about 20.
I heard some devices have more problems than others, but I am not sure about that information.
For HW WDOG, I think I will make a circuit with already used components on the board. The empty places on the pick and place machine is ... well... not much.
But thanks for the info! I surely will check the design and use it as a guideline.
as I am now working on the firmware to change all hubset to clkset, I have found something strange.
I can't change from 80MHz to 20MHzLP with clkset.
anyone a wild gues?
I use command line fastspin compiler v4.0.4
I know it's not the newest version, but a newer compiler broke some code, so I kept this version.
Seems to be a rare event with 20 out of 20,000 and the concept of reset buttons is - I guess - pretty common knowledge even the disposal thing in my kitchen sink has one.
As a wild guess, maybe switch to RCslow without PLL and then switching to 20Mhz?
Enjoy!
Mike
Clean Clock hand over usually takes a little time as it waits for next-edges before acting.
You could try change of clock source only, with PLL bit still enabled and see if that helps, then try how much delay is needed before PLL bit clear ?
Eric went through a few changes for inlined assembly not that long ago. It might be okay for you again now.
EDIT: And serial input may stop functioning - if that is used for wakeup for example.
From earlier experiments with glob top P2D2 RCFAST actually had a slight negative temp coefficient (but was very stable)
RCSLOW has a fairly strong positive temperature coefficient like you describe
Also we've seen several P2's in the "VGA club" - above VGA 25.175 MHz dot clock
EDIT: Second attempt netted 42 µA more on VDD and 967 µA more on VIO, see https://forums.parallax.com/discussion/comment/1505575/#Comment_1505575
Wonder why...