watch dog timer

Zap-o · 2011-09-12 18:52

Do any of you use a watch dog timer with the propeller? I ask cause I am reading a book and it mentions that all good engineers use a WDT with microprocessors. It states how the WDT will help with devices that are ran in noisy environments etc. I have never used one with the propeller so I ask what do you think?

tonyp12 · 2011-09-12 19:05

One that monitors the power conditions or one that monitors that the code did not get stuck?

Zap-o · 2011-09-13 10:30

A watch dog timer for code.

Leon · 2011-09-13 10:50

They are essential in many systems. A cog could be devoted to the WDT function by providing it with an independent external oscillator.

With ARM chips the WDT is the best way to implement a software reset. One sets it for a very short time to initiate the reset.

Zap-o · 2011-09-13 10:52

Leon wrote: »

With ARM chips, the WDT is the best way to implement a software reset.

So your saying a propeller will never need a WDT?

Leon · 2011-09-13 10:59

Of course it will. I suggested a way to implement it.

Mike Green · 2011-09-13 11:34

Leon,
An external oscillator is not going to help you with a WDT since the Prop uses a single system clock source for all of its cogs and the hub functions. If the system clock stops working, the whole chip stops. An errant program could change the clock mode register to something that won't work and it all would stop. You would need an independent WDT. Something as simple as a CMOS 555-type timer that's reset by the Prop would work.

Zap-o · 2011-09-13 11:44

This is the unit I am considering

http://datasheets.maxim-ic.com/en/ds/MAX6369-MAX6374.pdf

Phil Pilgrim (PhiPi) · 2011-09-13 12:45

The difficulty with a software watchdog timer is, "Who watches the watchdog?" But, even with an external watchdog timer, how do you make sure that every cog is running properly? You'd almost have to have a watchdog timer for each cog with a wired-OR connection to /RST.

I think that, for the Propeller, a statistical software watchdog system, where everybody watches everybody else, would be effective. The idea is that each cog would watch all the others and do a software reset if any fail to meet their deadlines. This could be done without any pins, wherein each cog would periodically write the value of cnt to its own hub location. At the same time it would check the other seven locations to make sure each of the other cogs met its deadline. This would entail always running eight cogs, but the unused ones could idle in waitcnt without drawing excessive power.

Of course, this is not 100.00000% reliable, but from a statistical standpoint the probability of missing errant behavior is vanishingly low.

-Phil

Leon · 2011-09-13 12:53

Mike Green wrote: »

Leon,
An external oscillator is not going to help you with a WDT since the Prop uses a single system clock source for all of its cogs and the hub functions. If the system clock stops working, the whole chip stops. An errant program could change the clock mode register to something that won't work and it all would stop. You would need an independent WDT. Something as simple as a CMOS 555-type timer that's reset by the Prop would work.

That was the sort of thing I envisaged when I suggested an oscillator - something providing a regular pulse.

localroger · 2011-09-13 15:57

I have a great story, which was much less funny at the time, about a watchdog timer that didn't. We used to sell a scale display that was designed around 1980 using the TMS9995 CPU and the 9901 controller chip. The discrete ADC used the 9901 controller counter as the accumulator for delta-sigma ADC.

One fine day we sold a local chemical plant a very expensive drumfilling system based on this indicator. It was explosion proof (mounted in a heavy airtight steel enclosure) and arrayed with safety devices because the chemical it was dispensing into drums was highly toxic and flammable. Of course we were quite glad that the scale had a watchdog timer, which could be noted on the schematic diagram included with the service manual. (In those days we still did a lot of component level repair.)

Then one day it overfilled a drum, which exploded from the pressure, and there was much massing of HAZMAT gear and evacuations and other expensive results. We were called on the carpet and asked to figure out how it happened.

As it happens there was a rare failure mode where due to bus contention or outright failure of the 9901, it would become impossible to read the lower 14 bits of the counter. In this state the 16-bit ADC would become a 2-bit ADC capable of reporting only the raw count values 0, $1000, $2000, and $3000. Otherwise the scale would work fine though. I saw this happen four or five times in the 15 years we regularly serviced this model. And needless to say, of all the possible ways it could have failed, this is how the expensive explosion proof drumfiller decided to croak.

Once I figured out what had happened, they demanded that we GUARANTEE that this would NEVER HAPPEN AGAIN before they were willing to restart the line. Now this same scale had an output from the setpoint card for MOTION, in those days generally used to prevent BCD printers from printing if the weight wasn't stable. It occurred to me that during the process of filling a drum the weight is always changing, so I added a relay that interlocked the fill operation to the motion output; after a few seconds to get things started, if the weight stopped changing it would close the valve.

Theyre still using it today.

Heater. · 2011-09-14 04:39

What is a watchdog for?

1) You suspect your code might crash under some as yet unforeseen circumstance.
Perhaps your design is not thorough or your testing not extensive. Or you
don't trust your compiler or whatever.

2) You suspect that the hardware might fail in a recoverable way. Could be a
glitch on the power supply. Could be other noise getting in and latching things
up. Could be that cosmic ray flipping a bit in your RAM.

These are very much the same. The assumption is that either you want the thing
brought to a dead halt when anything goes wrong to stop it doing more damage.
Or perhaps you are willing to bet that a reset will start it up again in a
sensible way.

Either way an external watchdog device could be a good idea.

Phil raises an interesting point. With a multi-core device like the Prop
perhaps you need a watchdog on each core to try to be sure they are all
operating.

Then of course you could have a single hardware watchdog on one core
and have that be the watch dog for functionality on all the others.

Or perhaps one core has a hardware watchdog and the others all supervise each
other in a chain. That might fit with the flow of processing and data in your
application.

I would be loath to get rid of the hardware watchdog though.

Still there are many things that can go wrong, Local Roger has a good example
above.

I once saw a system where the background loop had crashed causing the
system to fail badly. However the watchdog was oblivious because it was
still being kicked from an interrupt routine, triggered from a timer, that was
still running OK. Note, this is similar to having one dead COG.

Recently a system I looked at would fail at random when nearby equipment was
powered up even though this system was opto-isolated on all it's I/O and
running from it's own batteries. Turned out that the parallel I/O outputs were
getting reset to inputs by some EMI. The system had no idea this had happened
and continued it's merry way.

A watchdog then is a means to ensure that a system recovers from a
temporary faulty situation or perhaps shuts itself down totally. The idea is
that it does what you want it to do over the long haul or gives up. The system
should not end up doing things it is not designed to do. A watch dog is a crude way of trying to build a "fault tolerant" system.

In general though, determining all possible failure modes turns out to be
rather hard and full of surprises, as we have seen above. It's a question of how far you want to
pursue it.

Ultimately you might get into the world of multiple-redundant systems with
multiple processors and multiple power supplies etc. All working on the same
task such that if one node fails (produces a wrong result) the others can continue correctly (perhaps shutting down or restarting the failed node)

That turns out to be rather hard to. You might think that it is sufficient that
3 processors work on the job in a kind of democracy. In the event of one
failure the other working nodes have a majority vote on what to do and the right
thing gets done. Turns out that even then it can fail. You need 4 such nodes
to detect a failure in any one of them in some cases.

I refer to this http://en.wikipedia.org/wiki/Byzantine_fault_tolerance be sure to check the paper on the Microsoft research site linked to from there.

You will be pleased to hear that even the Primary Flight Computers of the fly-by-wire Boeing 777 do not meet the Byzantine Generals Criteria:)

ericball · 2011-09-14 06:37

Putting the watchdog reset in an ISA is really dumb IMHO as two of the primary reasons to implement a watchdog timer is to detect infinite loops and cases where the PC no longer points to code (e.g. in PASM JMP label instead of JMP #label) so the processor is executing nonsense. My two biggest worries with a watchdog timer are false triggers (because the code takes a little too long between watchdog resets) and the need for the initialization routine to be able to recover from all possible initialization states (since the watchdog reset could have occurred at any point). That being said, I don't design critical hardware like localroger.

Bobb Fwed · 2011-09-14 10:29

It seems a bit wasteful to use a discrete watchdog timer with the propeller. I could see it under some circumstances (like if all eight cogs are being used to the absolute maximum capacity). But for (I'd imagine) 99% of applications, there is at least one cog that is waiting periodically. This wasted time could be used to implement a system-wide watch dog check, or pin oscillation or whatever is needed.
In my programming if halts are theoretically possible, I generally have each individual cog change the state of something (I've used HUB memory or pin states), and have a "master" cog checking for that change. But if you are running into halts or freezes, except in the most speed-intensive applications, is probably more about sussing out coding problems, or it's just lazing programming.
It seems that as long as you aren't doing extremely fast communication, or the like, with a little bit of work, you could get away with not using a traditional watchdog at all on the Propeller. It has a system-wide clock that can be used as a simpler WDT when needed.

Heater. · 2011-09-14 10:36

Bob,
Perhaps a cog as watchdog is sufficient in 99.999% of errant code cases. One might worry that an errant cog could kill off any other cog including the watch dog cog.
Also it does not cater for all the odd hardware glitches that can occur as described in previous posts.

watch dog timer

Comments