How to implement a watchdog
ManAtWork
Posts: 2,176
in Propeller 1
As explained in this thread I recently had to learn the hard way that the propeller is somehow prone to runaway crashes where outputs are toggling in an uncontrolled way. As the propeller doesn't have a hardware watchdog nor memory protection it is very hard or impossible to actually protect memory from being overwritten. And, as I said, because of the OUTA and DIRA contents of all cogs being ORed together, there is also no way of stopping another cog from switching on outputs.
The propeller philosophy is: as little as possible dedicated hardware, do everything in software. So if we have one spare cog left we could at least do some sanity check and trigger a reset after we found out that something has gone wrong. This doesn't prevent a glitch on the outputs but it limits the possible damage.
In this thread I'd like to discuss ideas how a software watchdog could be implemented that hasn't to much impact on performance but can detect dangerous effects of serious software failures such as buffer overruns, stack overflow...
Of course there is no 100% saftey. A crashing cog executing garbage code could accidentally execute a cogstop for the watchdog cog. But if the watchdog detects illegal states and resets the whole propeller fast enough there's a good chance that nothing serious happens.
The propeller philosophy is: as little as possible dedicated hardware, do everything in software. So if we have one spare cog left we could at least do some sanity check and trigger a reset after we found out that something has gone wrong. This doesn't prevent a glitch on the outputs but it limits the possible damage.
In this thread I'd like to discuss ideas how a software watchdog could be implemented that hasn't to much impact on performance but can detect dangerous effects of serious software failures such as buffer overruns, stack overflow...
Of course there is no 100% saftey. A crashing cog executing garbage code could accidentally execute a cogstop for the watchdog cog. But if the watchdog detects illegal states and resets the whole propeller fast enough there's a good chance that nothing serious happens.
Comments
Memory areas that do not change (code and const data) could be protected by a checksum. The watchdog re-calculates the checksum frequently and triggers if something changes.
This requires separation of const and var data which is easy for newly written code but difficult for pre-written libraries.
Random outputs is a particular concern...you would never intentionally energise A & B coils of a solenoid valve simultaneously so maybe a standalone micro watching for non-sensical logic?
Some outputs could be marked as critical. Authorized changes of those outputs must always update a "nominal state" variable in hub ram. The watchdog could detect differences between nominal and actual state of the critical outputs. (inputs and non-critical outputs can change at any time)
This would at least catch random writes to DIRA and OUTA by a crashed cog. However it does not catch a erronous jump to the code of the authorized output function.
Sometimes I even had a camera running just so I could physically see what should have been impossible
Have the RC circuit hold the reset line if there is a problem. Not an actual counter you reset, but none the less a time constant.
If you're experiencing runaway behavior, then the fault is with your program. I would never consider a watchdog circuit to be a panacea, when finding and fixing bugs in errant code is an option.
-Phil
+1
Always expect the unexpected.
That's not quite true, as a common approach for data-corruption protection is to modulus-mask any memory writes.
With binary sized buffers that can be a simple AND.
That means any data buffers cannot overwrite any other memory, but they can wrap onto their own area with 'bad' index or count values.
Those failure are easier to debug, and tend to be one-off and recover.
If you protect data space overflow, that leaves actual memory corruption, and Prop seems to be quite good there ?
If you have a spare COG, you can do simple heart-beat checks on other COGS, but if you protect against data overflows, that should be less needed.
It is useful to have a heartbeat led on almost any control product and that can be multi-cog checking.
You can get regulators with watchdogs, some are windowed, which needs a pulse between some limits to avoid reset.
More paranoid systems implement a power-removal watchdog, as many chips today do not fully reset on a reset pin, and a reset pin does not recover latchup.
A special "SecureSpin" variant could add checksums. It could be every instruction, or every N bytes. Or only before writing to the hub or IO registers. The memory address should be part of the checksum, to guard against running code that was copied from another memory location. If the checksum included a unique random number as well it could provide some protection against intentional buffer overflows.
I think I'd use a second Propeller to act as a supervisor over the main Propeller.
Maybe have one cog in the main Prop check on things and then report over a serial link to the Supervisor.
I think this could be bullet proof...
A redesign with an extra propeller or a different CPU is not an option. If I did a redesign I'd completely leave out the propeller and take a CPU with watchdog and protection features. There are ARM based chips with two cores and a comparison circuit that verifies every single computation result and memory write.
The software watchdog solution is meant as in-the-field update to existing products to reduce the chance of a fatal failure to a minimum.
I agree with PhiPi that avoiding and fixing bugs has priority over "post mortem" fixes. But you can't foresee every possible pitfall. Even if I didn't use Pham's driver but wrote my own I'd never thought about the ENC28J60 returning 0 or negative numbers as packet length. And even a 100% test where every branch of conditional code ran at least once wouldn't have catched this case. It was simply not possible to test against this before the "evil" switch was sold. A standard network interface card couldn't even send such a corrupted packet.
But chances are high that executing random memory contents as code generates loops that periodically trigger outputs or even overwrite timer registers to generate random frequencies. I actually have a simple RC watchdog on my board but in many cases it didn't timeout. So the watchdog should at least be so restrictive to require special patterns, checksums or other question-answer protocols where the "magic code" is hardly hit by bad luck.
Godd idea, Tracy. Unfortunatelly, all high side drivers I use or when using an N-channel MOSFET that is directly driven by a propeller pin an active high signal is expected. So we need external inverters.
ARM cortex M0 at least have a standard watchdog and some degree of IO-register protection. Some peripherals require special register write sequences to be enabled. And they have an oscillator watchdog against clock failure which is useful in switching power applications (PWM). Cortex M7 CPUs have memory protection and a flash/code checksum watchdog.
But all lamenting and arguing doesn't help much. I think a software watchdog is still the best option for a safety update. Not bulletproof but much better than nothing. Using inverters and low active outputs is good for new designs.
The thread title is "How to implement a watchdog" so in many respects it's not just about your specific requirements.
If you have a spare cog, writing a PASM watchdog routine is probably the best approach. This shouldn't have any impact on performance, and could use a combination of pin monitoring and code check-summing to achieve your needs. Being PASM based it can remain protected from your hubram overwrite issue.
As an aside, the code you posted doesn't match the version of the driver available in the OBEX:
The buggy switch might still cause problems even with the updated driver if it delivers headers indicating zero or negative length packets.
A simple fix for that might be:
a) Code & data-table areas that should never change, always return the same checksum/crc.
b) Some data or code areas that should change, are checked to ensure they update at expected rates.
If you also want to watch the watchdog, you can use a regulator with a watchdog inbuilt, or a sub 50c MCU as a system monitor/watchdog, and that gives bonus security/serial numbers.
You cannot test all cases, but you can code as I described above, to bound limit any writes to areas that can also contain code.
That does not prevent errors, but it does ensure you cannot corrupt code yourself.
PC compilers used to have range checking options, and there could be a case for embedded compilers to add a range-bounding option as the code cost is quite low...
Here, we code that in manually.