How to implement a watchdog

As explained in this thread I recently had to learn the hard way that the propeller is somehow prone to runaway crashes where outputs are toggling in an uncontrolled way. As the propeller doesn't have a hardware watchdog nor memory protection it is very hard or impossible to actually protect memory from being overwritten. And, as I said, because of the OUTA and DIRA contents of all cogs being ORed together, there is also no way of stopping another cog from switching on outputs.

The propeller philosophy is: as little as possible dedicated hardware, do everything in software. So if we have one spare cog left we could at least do some sanity check and trigger a reset after we found out that something has gone wrong. This doesn't prevent a glitch on the outputs but it limits the possible damage.

In this thread I'd like to discuss ideas how a software watchdog could be implemented that hasn't to much impact on performance but can detect dangerous effects of serious software failures such as buffer overruns, stack overflow...

Of course there is no 100% saftey. A crashing cog executing garbage code could accidentally execute a cogstop for the watchdog cog. But if the watchdog detects illegal states and resets the whole propeller fast enough there's a good chance that nothing serious happens.

Comments

  • Idea #1: Checksums for static hub ram areas
    Memory areas that do not change (code and const data) could be protected by a checksum. The watchdog re-calculates the checksum frequently and triggers if something changes.

    This requires separation of const and var data which is easy for newly written code but difficult for pre-written libraries.
  • For industrial applications, I don't skimp in this area.

    Random outputs is a particular concern...you would never intentionally energise A & B coils of a solenoid valve simultaneously so maybe a standalone micro watching for non-sensical logic?
    Failure is not an option...it's bundled with the software.
  • You could also implement a more "traditional" watchdog by having all the Spin cogs periodically increment a hub location. If this doesn't happen for a while or it gets corrupted, trigger the soft reset.
  • Idea #2: authorization system for outputs
    Some outputs could be marked as critical. Authorized changes of those outputs must always update a "nominal state" variable in hub ram. The watchdog could detect differences between nominal and actual state of the critical outputs. (inputs and non-critical outputs can change at any time)

    This would at least catch random writes to DIRA and OUTA by a crashed cog. However it does not catch a erronous jump to the code of the authorized output function.
  • Peter JakackiPeter Jakacki Posts: 8,511
    edited 2019-08-02 - 14:01:38
    For this type of industrial control if I didn't implement it as a redundant fail-safe using a second Prop circuit, then I would perhaps have another micro, which could be a Prop, but completely independent, monitoring I/O lines and signalling its disapproval of any big no-no's in the way of an emergency stop. Of course you could just reset the Prop but it would be much better that the e-stop killed the power to the Prop board and when the operator restarted it, then the Prop would start afresh. For this I would have the main Prop constantly logging to an SD card or EEPROM or something with timestamps so that you can go back through and find out what happened. That has saved me many times, both in heavy industrial control, and light commercial where there are units scattered all over the country.

    Sometimes I even had a camera running just so I could physically see what should have been impossible :)

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • Why not have an external RC that needs to be "fed" periodically like a traditional WDT?
    Have the RC circuit hold the reset line if there is a problem. Not an actual counter you reset, but none the less a time constant.


    Beau Schwabe -- Submicron Forensic Engineer
    www.Kit-Start.com - bschwabe@Kit-Start.com ෴෴ www.BScircuitDesigns.com - icbeau@bscircuitdesigns.com ෴෴

  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,386
    edited 2019-08-02 - 15:05:19
    ManAtWork wrote:
    And, as I said, because of the OUTA and DIRA contents of all cogs being ORed together, there is also no way of stopping another cog from switching on outputs.
    That's simply not true. The DIRA contents are unique to each cog. If you don't want a cog switching on an output pin, you set the pin's corresponding bit to zero. The OUTA contents are also unique to each cog. What gets ORed are the actual pin outputs when they're being driven, not the contents of the registers.

    If you're experiencing runaway behavior, then the fault is with your program. I would never consider a watchdog circuit to be a panacea, when finding and fixing bugs in errant code is an option.

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • Basically what Phil said but with a clarification. What you want to be able to do is to test your software without any watchdog protection and get it to run reliably 24/7. The watchdog may indeed catch bugs but that won't make the equipment reliable. But what I find on industrial equipment is that no matter how good and reliable the software is, you only need some suppressor or other component fail and you can get unexpected EMI disrupting the signals or even the CPU itself. Hence the watchdog. If it is a safety issue then just use another Prop with very simple software that just monitors and acts. You could even parallel most of the I/O pins of the two Props but the supervisory Prop should have its own regulator and could even run RCFAST rather than a crystal, mainly to keep it as simple as possible and less prone to noise.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • Well, the rule is, output trumps input and high trumps low. So, make low output be the active state of a certain control pin. Your watchdog drives that pin high if it needs to override a fault, if that pin locks low too long for example.
  • Tracy AllenTracy Allen Posts: 6,362
    edited 2019-08-02 - 16:16:25
    I’ve made a traditional COP watchdog using a cog counter or two. Simplest is to assign a pin to ctra and couple it through 1nF back to RST. Retard the phase and make it an output, so that it’s pin comes up output high. The main program loop has to feed the watchdog often enough to keep it from timing out. Main loop stuck =reset. Not foolproof, but effective. My concerns have been not so much to protect a control system as to forestall expensive visits to remote data logging sites.
  • But what I find on industrial equipment is that no matter how good and reliable the software is, you only need some suppressor or other component fail and you can get unexpected EMI disrupting the signals or even the CPU itself.

    +1
    Always expect the unexpected.

    Failure is not an option...it's bundled with the software.
  • jmgjmg Posts: 13,778
    ManAtWork wrote: »
    .. As the propeller doesn't have a hardware watchdog nor memory protection it is very hard or impossible to actually protect memory from being overwritten..

    That's not quite true, as a common approach for data-corruption protection is to modulus-mask any memory writes.
    With binary sized buffers that can be a simple AND.
    That means any data buffers cannot overwrite any other memory, but they can wrap onto their own area with 'bad' index or count values.
    Those failure are easier to debug, and tend to be one-off and recover.

    If you protect data space overflow, that leaves actual memory corruption, and Prop seems to be quite good there ?

    ManAtWork wrote: »
    In this thread I'd like to discuss ideas how a software watchdog could be implemented that hasn't to much impact on performance but can detect dangerous effects of serious software failures such as buffer overruns, stack overflow...
    Of course there is no 100% saftey. A crashing cog executing garbage code could accidentally execute a cogstop for the watchdog cog. But if the watchdog detects illegal states and resets the whole propeller fast enough there's a good chance that nothing serious happens.
    If you have a spare COG, you can do simple heart-beat checks on other COGS, but if you protect against data overflows, that should be less needed.
    It is useful to have a heartbeat led on almost any control product and that can be multi-cog checking.

    You can get regulators with watchdogs, some are windowed, which needs a pulse between some limits to avoid reset.
    More paranoid systems implement a power-removal watchdog, as many chips today do not fully reset on a reset pin, and a reset pin does not recover latchup.
  • It seems like spin has no invalid bytecodes. https://github.com/rosco-pc/propeller-wiki/wiki/Spin-Byte-Code While this is good for code density, it means that random data can likely be interpreted as a valid program. If we reduce the percentage of bytecodes that are valid, we are more likely to detect that we are executing random data instead of a program. We don't want to reduce the number of bytecodes, but we could add extra invalid ones. For example if we have 256 instructions, encode them into 16 bits. When the interpreter runs, the probability of 16 bits of random data matching ANY instruction is 0.4%. (1/256) This is inspired by the Shannon limit.

    A special "SecureSpin" variant could add checksums. It could be every instruction, or every N bytes. Or only before writing to the hub or IO registers. The memory address should be part of the checksum, to guard against running code that was copied from another memory location. If the checksum included a unique random number as well it could provide some protection against intentional buffer overflows.
    James https://github.com/SaucySoliton/

    Invention is the Science of Laziness
  • I think Peter Jakacki's idea is really the only way to be sure if you have a critical system and complex software that may contain major bugs...

    I think I'd use a second Propeller to act as a supervisor over the main Propeller.
    Maybe have one cog in the main Prop check on things and then report over a serial link to the Supervisor.
    I think this could be bullet proof...
    Prop Info and Apps: http://www.rayslogic.com/
  • frank freedmanfrank freedman Posts: 1,478
    edited 2019-08-06 - 04:58:08
    Rayman, Rayman, Rayman...for zis you haf no chance. As soon as you said bulletproof, it will no longer be...... If bullet proof was possible, Lloyds, Hanford and other commercial and industrial insurers would be quite obsoleted.
    Ordnung ist das halbe Leben
    I gave up on that half long ago.........
  • Well, I should have been more precise about what my requirements are.

    A redesign with an extra propeller or a different CPU is not an option. If I did a redesign I'd completely leave out the propeller and take a CPU with watchdog and protection features. There are ARM based chips with two cores and a comparison circuit that verifies every single computation result and memory write.

    The software watchdog solution is meant as in-the-field update to existing products to reduce the chance of a fatal failure to a minimum.

    I agree with PhiPi that avoiding and fixing bugs has priority over "post mortem" fixes. But you can't foresee every possible pitfall. Even if I didn't use Pham's driver but wrote my own I'd never thought about the ENC28J60 returning 0 or negative numbers as packet length. And even a 100% test where every branch of conditional code ran at least once wouldn't have catched this case. It was simply not possible to test against this before the "evil" switch was sold. A standard network interface card couldn't even send such a corrupted packet.
  • Unfortunatelly, a simple RC or countdown watchdog doesn't work well for code runaway. They are good for detecting hangup-crashes but that is not the worst-case scenario for me. A hangup/freeze would stop the machine and leave it in the current state. That is safe.

    But chances are high that executing random memory contents as code generates loops that periodically trigger outputs or even overwrite timer registers to generate random frequencies. I actually have a simple RC watchdog on my board but in many cases it didn't timeout. So the watchdog should at least be so restrictive to require special patterns, checksums or other question-answer protocols where the "magic code" is hardly hit by bad luck.
    Well, the rule is, output trumps input and high trumps low. So, make low output be the active state of a certain control pin. Your watchdog drives that pin high if it needs to override a fault, if that pin locks low too long for example.

    Godd idea, Tracy. Unfortunatelly, all high side drivers I use or when using an N-channel MOSFET that is directly driven by a propeller pin an active high signal is expected. So we need external inverters.
  • To be fair, most microcontrollers don't have any memory or IO pin protection.

    ARM cortex M0 at least have a standard watchdog and some degree of IO-register protection. Some peripherals require special register write sequences to be enabled. And they have an oscillator watchdog against clock failure which is useful in switching power applications (PWM). Cortex M7 CPUs have memory protection and a flash/code checksum watchdog.

    But all lamenting and arguing doesn't help much. I think a software watchdog is still the best option for a safety update. Not bulletproof but much better than nothing. Using inverters and low active outputs is good for new designs.
  • MJBMJB Posts: 1,104
    And if you have a spare COG running in COG mode at least it.s code is save
  • AJLAJL Posts: 163
    edited 2019-08-06 - 12:59:27
    ManAtWork wrote: »
    Well, I should have been more precise about what my requirements are.

    A redesign with an extra propeller or a different CPU is not an option. If I did a redesign I'd completely leave out the propeller and take a CPU with watchdog and protection features. There are ARM based chips with two cores and a comparison circuit that verifies every single computation result and memory write.

    The software watchdog solution is meant as in-the-field update to existing products to reduce the chance of a fatal failure to a minimum.

    I agree with PhiPi that avoiding and fixing bugs has priority over "post mortem" fixes. But you can't foresee every possible pitfall. Even if I didn't use Pham's driver but wrote my own I'd never thought about the ENC28J60 returning 0 or negative numbers as packet length. And even a 100% test where every branch of conditional code ran at least once wouldn't have catched this case. It was simply not possible to test against this before the "evil" switch was sold. A standard network interface card couldn't even send such a corrupted packet.

    The thread title is "How to implement a watchdog" so in many respects it's not just about your specific requirements.

    If you have a spare cog, writing a PASM watchdog routine is probably the best approach. This shouldn't have any impact on performance, and could use a combination of pin monitoring and code check-summing to achieve your needs. Being PASM based it can remain protected from your hubram overwrite issue.

    As an aside, the code you posted doesn't match the version of the driver available in the OBEX:
    PUB get_frame(pktptr) | packet_addr, new_rdptr
    '' Get Ethernet Frame from Buffer
    
      banksel(ERDPTL)
      wr_reg(ERDPTL, packetheader[nextpacket_low])
      wr_reg(ERDPTH, packetheader[nextpacket_high])
    
      repeat packet_addr from 0 to 5
        packetheader[packet_addr] := rd_sram
    
      rxlen := (packetheader[rec_bytecnt_high] << 8) + packetheader[rec_bytecnt_low]
    
      'bytefill(@packet, 0, MAXFRAME)                       ' Uncomment this if you want to clean out the buffer first
                                                            '  otherwise, leave commented since it's faster to just leave stuff
                                                            '  in the buffer
      
      ' protect from oversized packet
      if rxlen =< MAXFRAME
        rd_block(pktptr, rxlen)
        {repeat packet_addr from 0 to rxlen - 1
          BYTE[@packet][packet_addr] := rd_sram}
    

    The buggy switch might still cause problems even with the updated driver if it delivers headers indicating zero or negative length packets.

    A simple fix for that might be:
    if rxlen =< MAXFRAME
        rd_block(pktptr, rxlen#>1)
    
  • A watchdog could also check for free cogs (which can be done on the side by simply checking the carry out from hub RAM instructions) and reset when it finds any. That protects it from 7/8ths of random cog stops.
  • jmgjmg Posts: 13,778
    ManAtWork wrote: »
    A redesign with an extra propeller or a different CPU is not an option.
    The software watchdog solution is meant as in-the-field update to existing products to reduce the chance of a fatal failure to a minimum.
    You can do a quite useful one, by scanning memory and checking
    a) Code & data-table areas that should never change, always return the same checksum/crc.
    b) Some data or code areas that should change, are checked to ensure they update at expected rates.

    If you also want to watch the watchdog, you can use a regulator with a watchdog inbuilt, or a sub 50c MCU as a system monitor/watchdog, and that gives bonus security/serial numbers.

    ManAtWork wrote: »
    But you can't foresee every possible pitfall. Even if I didn't use Pham's driver but wrote my own I'd never thought about the ENC28J60 returning 0 or negative numbers as packet length. And even a 100% test where every branch of conditional code ran at least once wouldn't have catched this case. It was simply not possible to test against this before the "evil" switch was sold. A standard network interface card couldn't even send such a corrupted packet.
    You cannot test all cases, but you can code as I described above, to bound limit any writes to areas that can also contain code.
    That does not prevent errors, but it does ensure you cannot corrupt code yourself.

    PC compilers used to have range checking options, and there could be a case for embedded compilers to add a range-bounding option as the code cost is quite low...
    Here, we code that in manually.
  • I wonder if any bounds checking options in GCC for P1 would have prevented this...
    Prop Info and Apps: http://www.rayslogic.com/
Sign In or Register to comment.