watch dog timer
Zap-o
Posts: 452
Do any of you use a watch dog timer with the propeller? I ask cause I am reading a book and it mentions that all good engineers use a WDT with microprocessors. It states how the WDT will help with devices that are ran in noisy environments etc. I have never used one with the propeller so I ask what do you think?
Comments
With ARM chips the WDT is the best way to implement a software reset. One sets it for a very short time to initiate the reset.
So your saying a propeller will never need a WDT?
An external oscillator is not going to help you with a WDT since the Prop uses a single system clock source for all of its cogs and the hub functions. If the system clock stops working, the whole chip stops. An errant program could change the clock mode register to something that won't work and it all would stop. You would need an independent WDT. Something as simple as a CMOS 555-type timer that's reset by the Prop would work.
http://datasheets.maxim-ic.com/en/ds/MAX6369-MAX6374.pdf
I think that, for the Propeller, a statistical software watchdog system, where everybody watches everybody else, would be effective. The idea is that each cog would watch all the others and do a software reset if any fail to meet their deadlines. This could be done without any pins, wherein each cog would periodically write the value of cnt to its own hub location. At the same time it would check the other seven locations to make sure each of the other cogs met its deadline. This would entail always running eight cogs, but the unused ones could idle in waitcnt without drawing excessive power.
Of course, this is not 100.00000% reliable, but from a statistical standpoint the probability of missing errant behavior is vanishingly low.
-Phil
That was the sort of thing I envisaged when I suggested an oscillator - something providing a regular pulse.
One fine day we sold a local chemical plant a very expensive drumfilling system based on this indicator. It was explosion proof (mounted in a heavy airtight steel enclosure) and arrayed with safety devices because the chemical it was dispensing into drums was highly toxic and flammable. Of course we were quite glad that the scale had a watchdog timer, which could be noted on the schematic diagram included with the service manual. (In those days we still did a lot of component level repair.)
Then one day it overfilled a drum, which exploded from the pressure, and there was much massing of HAZMAT gear and evacuations and other expensive results. We were called on the carpet and asked to figure out how it happened.
As it happens there was a rare failure mode where due to bus contention or outright failure of the 9901, it would become impossible to read the lower 14 bits of the counter. In this state the 16-bit ADC would become a 2-bit ADC capable of reporting only the raw count values 0, $1000, $2000, and $3000. Otherwise the scale would work fine though. I saw this happen four or five times in the 15 years we regularly serviced this model. And needless to say, of all the possible ways it could have failed, this is how the expensive explosion proof drumfiller decided to croak.
Once I figured out what had happened, they demanded that we GUARANTEE that this would NEVER HAPPEN AGAIN before they were willing to restart the line. Now this same scale had an output from the setpoint card for MOTION, in those days generally used to prevent BCD printers from printing if the weight wasn't stable. It occurred to me that during the process of filling a drum the weight is always changing, so I added a relay that interlocked the fill operation to the motion output; after a few seconds to get things started, if the weight stopped changing it would close the valve.
Theyre still using it today.
1) You suspect your code might crash under some as yet unforeseen circumstance.
Perhaps your design is not thorough or your testing not extensive. Or you
don't trust your compiler or whatever.
2) You suspect that the hardware might fail in a recoverable way. Could be a
glitch on the power supply. Could be other noise getting in and latching things
up. Could be that cosmic ray flipping a bit in your RAM.
These are very much the same. The assumption is that either you want the thing
brought to a dead halt when anything goes wrong to stop it doing more damage.
Or perhaps you are willing to bet that a reset will start it up again in a
sensible way.
Either way an external watchdog device could be a good idea.
Phil raises an interesting point. With a multi-core device like the Prop
perhaps you need a watchdog on each core to try to be sure they are all
operating.
Then of course you could have a single hardware watchdog on one core
and have that be the watch dog for functionality on all the others.
Or perhaps one core has a hardware watchdog and the others all supervise each
other in a chain. That might fit with the flow of processing and data in your
application.
I would be loath to get rid of the hardware watchdog though.
Still there are many things that can go wrong, Local Roger has a good example
above.
I once saw a system where the background loop had crashed causing the
system to fail badly. However the watchdog was oblivious because it was
still being kicked from an interrupt routine, triggered from a timer, that was
still running OK. Note, this is similar to having one dead COG.
Recently a system I looked at would fail at random when nearby equipment was
powered up even though this system was opto-isolated on all it's I/O and
running from it's own batteries. Turned out that the parallel I/O outputs were
getting reset to inputs by some EMI. The system had no idea this had happened
and continued it's merry way.
A watchdog then is a means to ensure that a system recovers from a
temporary faulty situation or perhaps shuts itself down totally. The idea is
that it does what you want it to do over the long haul or gives up. The system
should not end up doing things it is not designed to do. A watch dog is a crude way of trying to build a "fault tolerant" system.
In general though, determining all possible failure modes turns out to be
rather hard and full of surprises, as we have seen above. It's a question of how far you want to
pursue it.
Ultimately you might get into the world of multiple-redundant systems with
multiple processors and multiple power supplies etc. All working on the same
task such that if one node fails (produces a wrong result) the others can continue correctly (perhaps shutting down or restarting the failed node)
That turns out to be rather hard to. You might think that it is sufficient that
3 processors work on the job in a kind of democracy. In the event of one
failure the other working nodes have a majority vote on what to do and the right
thing gets done. Turns out that even then it can fail. You need 4 such nodes
to detect a failure in any one of them in some cases.
I refer to this http://en.wikipedia.org/wiki/Byzantine_fault_tolerance be sure to check the paper on the Microsoft research site linked to from there.
You will be pleased to hear that even the Primary Flight Computers of the fly-by-wire Boeing 777 do not meet the Byzantine Generals Criteria:)
In my programming if halts are theoretically possible, I generally have each individual cog change the state of something (I've used HUB memory or pin states), and have a "master" cog checking for that change. But if you are running into halts or freezes, except in the most speed-intensive applications, is probably more about sussing out coding problems, or it's just lazing programming.
It seems that as long as you aren't doing extremely fast communication, or the like, with a little bit of work, you could get away with not using a traditional watchdog at all on the Propeller. It has a system-wide clock that can be used as a simpler WDT when needed.
Perhaps a cog as watchdog is sufficient in 99.999% of errant code cases. One might worry that an errant cog could kill off any other cog including the watch dog cog.
Also it does not cater for all the odd hardware glitches that can occur as described in previous posts.