Prop Lockup - Help with Watchdog Object

Joms · 2009-07-24 19:31

I wrote a program to monitor the weight of two scales, which many people here helped with.· Now I have a new problem with this project.

It seems after about 24-36 hours of working properly the MAIN public will lock-up.· I am not exactly sure where it happens at in the program, but I know it is not the entire prop.· The video still outputs·and the heartbeat LED still flashs, but it will not update the display until I reboot.

Heres the two solutions that bring up a few more questions:

1.· I was thinking about programming a pin to go high for a bit everytime the MAIN public repeats.· I would then take this pin to a 555 timer and reset the chip if it doesn't detect a pulse every few seconds.· I understand this to be a watchdog circuit.

2.· When I was searching the forum for watchdog posts I noticed that some people run a watchdog in another cog instead of a hardware watchdog.· I didn't find any objects for this in the OBEX, or is this something I just write?· I don't know about how to start that code, is there any examples to learn from somewhere?· What is the theory behind the operation of a software watchdog?

3.· If someone looks at my program do they see anything that would cause it to lock up?· Am I programming something wrong?· Should I be using a REPEAT somewhere I am not?

Thanks for the help in advance with this one.· I just installed my project and a day later lost an afternoon of data because the thing was locked-up and didn't notice it.· I am open to any other ideas people might have...

Attached should be my entire program...

Post Edited (Joms) : 7/25/2009 3:41:32 AM GMT

MagIO2 · 2009-07-24 20:04

You already mentioned, that only your main is locked up. So, there is no need to restart the whole prop. If you restart the whole prop the video generations stops as well until it's reloaded. With a software watchdog you could simply restart the one COG without a dropout of the video signal.
How to do the SW watchdog depends on your prop-usage. If you still have free COGs it's the easiest to use one. For example ... for each COG you want to be watchdogged you have a byte in HUB-RAM on a fixed place. This has to be changed by the to be watched COG frequently. The watchdog-COG on the other hand checks from time to time that the value has been changed. If it did not change in a given timeframe it restarts that COG. The tricky thing is that the watchdog COG needs to know how to start each COG that it has to supervise - and the code itself has to be restartable. Restartable can mean two different things:
1. It can continue where it has been stopped - that's hard as the used variables might be in any state
2. It's a real restart - in this case you should not rely on the fact that variables are zero after boot but initialize each variable

But ..... I did not have a look at your code yet .. but I guess it should be possible to find the reason for the lock and avoid it.

Mike Green · 2009-07-24 20:05

Common reasons for a program to lock up after running fine for a while:

1) Somewhere the program uses recursion improperly. Typically this is due to calling a method in an attempt to make a GOTO like:

PRI myRoutine
   ' ... Do something
   ' ... Do more stuff
   myRoutine

2) A subscript for an array is out of range and the program eventually starts writing its data on top of the stack area or the program itself. Typically, you'd have an array declared as having 16 bytes (like "VAR byte myArray[noparse][[/noparse] 16 ]") and your program doesn't check that the subscript goes beyond the valid range of 0-15. The program stores something in the next variable in memory which affects another routine which eventually causes something that catches your attention.

MagIO2 · 2009-07-24 20:13

The abort might be the problem, as you don't catch it ... so, whenever the character received is not 0-9 your main will simply stop.

Joms · 2009-07-24 20:22

Ok, thanks for the help so far...

MagIO2,
It would be ok that the whole chip reset. If I lost video for a few seconds it would be better then loosing data for an afternoon. To be honest I don't know if I am educated enought to direct a new cog to look directly at a part of memory somewhere else. I will start looking in the prop manual about how to access just one byte in the memory table. I should have 3 left over cogs, so it should be easy if I figure the program out.

Mike,
You actually caught my first problem before where I was doing what you explained above. I have sense replaced that code with a repeat loop and that did not fix it. I am just looking through my code and thinking about doing away with the 3 seperate publics and makeing them into one large public with a repeat. Basically this is how it works now and could be causing an issue...

PUB Main --> PUB ReadLoadInWt --> PUB Main --> PUB ReadLoadOutWt --> PUB Main --> Repeats back to start of PUB Main... Maybe I am going in and out of the main public too often?

Joms · 2009-07-24 20:24

MagIO2,

I shouldn't be getting anything else, but perhaps that could cause it? Does the abort just end the cog that PUB Main is running in? I will pull out my manual and read more about the abort command.

Do you think I should replace it with a 'main' command, so it will just start over the process, or do you think 'return' would be better?

MagIO2 · 2009-07-24 20:52

No, don't replace with main!

Your function is called by main. If you call main from this function again you have a recursion. This will eat up your stack-space and in the end you have a real crash.

To go back to main you have to use return or abort ....
The abort itself is exactly for this kind of problem. Your function detected a problem which it should report to the caller up to a level where the problem can be handled. If no upper-level function can handle it, the COG is simply stopped.

As I read in the manual, with using the abort you will loose the possibility to return a value. So what would help in your case is :
You only return a positive number which has max. 6 digits. So, the value is max. 999999. In case of an error you could simply return negative values. For example -1 to tell the caller that a timeout occurred and -2 to tell that you received an invalid character.

MagIO2 · 2009-07-24 21:03

BTW, maybe it's worth to think about serial connection as well. Obviously you did not really expect this character you received. So, how often does this happen that you receive wrong characters? Is it possible that you maybe receive accepted characters that are wrong? (E.g. a 9 has been send but a 0 has been received) How important is it to always have a valid value? Maybe you need to add some kind of error-detection to your project?! Or harden the serial link against electromagnetic noise.

Joms · 2009-07-24 21:21

The serial connection is actually rs422 that I convert to 232 about 2 inches away from the prop. I the rs422 cable is fairly long, around 1000' feet so it is entirely possible that it will receive something that shouldn't be there. However, I do not really care if it does, as 99.9% of the time the data is correct. The scale sends out this data about 3 times a second, so if a wrong byte is received it will just over-write it with the proper one anyways.

Whenever I looked at the display the data is correct, it just seems at some point overnight or some time that it must receive a bad byte and stop.

I also read through the abort command. I am guessing that if it does attempt to abort and the Main doesn't know what to do with the abort it receives back, it will just stop. Because I don't care about getting a bad byte, I would rather the program just keep going, do you think I should just get rid of the 'abort' and 'return' commands? If so, what should I put there?

MagIO2 · 2009-07-24 21:57

pub readloadinwt | c, n, t                               'Read Load-In Weight
  t := 0                                                 'Reset Time-Out Counter to Zero
  repeat while (c := loadindata.rxtime(100)) <> 2        'Wait for STX   
    ++t                                                  'Add 1 to the Time-Out Counter 
    if t > 25                                            'If Time-Out goes over 25 then return 
      Display.loadin(string("    --"))                   'Display -- if Time-Out occurs 
      return [b][color=red]-1[/color][/b]                                             'Return to Main  
  n := 0
  repeat while n < 3                                     'Get 3 Characters, resetting if another rx
    if (c := loadindata.rxtime(100)) == 2
      n := 0
    else
      n++
  result := 0                                            'Get 6 Characters and accumulate them as weight 
  n := 0
  repeat 6
    c := loadindata.rxtime(100)
    loadinwt[noparse][[/noparse]n++] := c
    if c <> $20                                          'Ignore leading spaces 
      if c < "0" or c > "9"                              'Make sure we're really reading numbers 
        [s][color=orange]abort[/color][/s] [color=red][b]repeat -2[/b][/color]                                           'Abort because something is wrong 
      else                                               'Mask this digit to convert ASCII to numeric and shift into result
        result := (result * 10) + (c & $F)
  loadinwt[noparse][[/noparse]n] := 0                                       'Null terminate string result

  REPEAT
     
    loadin := readloadinwt                               'Store Load-In weight to VAR

    case loadin
      -1:
         Display.loadin( string("  EXP"))
      -2:
         Display.loadin( string("  NAN"))
      other:
         Display.loadin(@loadinwt)                            'Update display driver with Load-In weight

Supposing that the first 2 repeats in readloadinwt are for synchronisation with the sender, that's what I'd do. EXP shows you that the wait time has expired ... if this is displayed for longer time you lost connection.
NAN shows you that a wrong character has been received ... if this is displayed for longer time you have problem with receiving the right data.
·

Joms · 2009-07-24 23:22

Wow, I think I am understanding this... Good info...

If the program goes to the first return, it will return a -1.

I am assuming the second edit should be 'return' instead of 'repeat' because that is the only way I can make it work. And if it does use that it returns -2 in which case the display will show NAN.

Do you think I should rely on this change completely? or should I try to program another cog to be an watchdog? I was looking at still maybe using the 555 timer, but I would have to put a transister or inverter between the 555 & the prop because the 555 outputs an high when a reset is needed, while the prop is looking for a low...

Joms · 2009-07-25 03:43

Instead of the hardware version I am attempting to write an object that I can use for this and future projects. I would like to make it something universal so I can put it in the exchange.

Basically the object will take a byte address, delay, and count input. It will look at the specified byte every 1 seconds (or specified delay) and reboot after 10 (or specified count) failed attempts.

Question: What am I doing wrong? I know this is pretty general, but there isn't much code here. Below is what is in my main program, attaches is the object.

VAR
  Byte   toggle

OBJ
  Watchdog: "Watchdog"

PUB Start
  toggle~
  Watchdog.Start(@toggle,1000,10)

  main

PUB Main

  REPEAT

    !outa[noparse][[/noparse]8]
    !toggle
   
    loadin := readloadinwt                               'Store Load-In weight to VAR
    Display.loadin(@loadinwt)                            'Update display driver with Load-In weight

    loadout := readloadoutwt                             'Store Load-Out weight to VAR
    Display.loadout(@loadoutwt)                          'Update display driver with Load-Out weight
    
    if loadin > 2000                                     'Activate Load-In Relay if over limit 
      outa[noparse][[/noparse]10]~~                                         'P10 High  
    else                                                 'De-Activate Load-In Relay if under limit
      outa[noparse][[/noparse]10]~                                          'P10 Low

    if loadout > 2000                                    'Activate Load-Out Relay if over limit
      outa[noparse][[/noparse]9]~~                                          'P9 High
    else                                                 'De-Activate Load-Out Relay if under limit
      outa[noparse][[/noparse]9]~                                           'P9 Low

Post Edited (Joms) : 7/25/2009 4:09:16 AM GMT

Timmoore · 2009-07-25 04:05

I use this technique a lot in my bots. I start a cog, it loops once a sec checking a variable, if it hasn't incremented from last time it reboots. The main processing code increments the variable during its processing. If the main code ever doesn't increment the variable for 1 sec the system reboots.
If you look in the attached file for the routine watchdog, you can look the watchdog loop, checking the variable watchdogcounter. This routine is started from a cognew during initialization and you can see the watchdogcounter variable incremented in the main processing loop.
I haven't this pretty effective. In theory there are ways of writing of the watchdog code so it will not work but in practice I have found if something goes wrong the watchdog catches it.

Joms · 2009-07-25 04:11

Ok, I was able to go through and see how your code worked.· I made mine do about the same thing.

Instead of looking at a timer, I just have a variable counting up by one until it reaches 100 at which time I have it reset back to zero.

Basically, I got it working, THANKS!

Post Edited (Joms) : 7/25/2009 5:24:01 AM GMT

MagIO2 · 2009-07-25 16:44

Now that you solved it ... some thoughts about a watchdog ...

Ok, good lesson for you, but I think you don't really need a watchdog. You simply had a bug in your software (abort). But if you fiexed that, there is no reason for the main-loop to fail again! You use the time-version for receiving bytes from serial interface, so there is no external condition which could block your main-loop.

Watchdogs are used in high relieable systems as the last line of defense where for example external systems could cause a microcontroller to hang. Say you have a microcontroller which has an interrupt input connected to an external device. The external device runs crazy and the microcontroller never leaves the interrupt routine again because of that.
Or you have several interrupt sources which can occur in nearly same tiime and can't be processed fast enough.
In these microcontrollers you usually have a watchdog circuit in hardware, because if you controller got stuck in a interrupt-routine there is no chance to have a software watchdog. AND interrupts with all that problems are needed, as one CPU can only do one thing at a time.

On the propeller this is different. We don't have interrupts because we can do at least 8 things at a time. So, it is only a matter of programming to make your software failsafe. An external device running crazy will of course generate invalid data, but it will never keep your program in a permanent interrupt request routine - if you did your homework.

Prop Lockup - Help with Watchdog Object

Comments