Prop Lockup - Help with Watchdog Object
Joms
Posts: 279
I wrote a program to monitor the weight of two scales, which many people here helped with.· Now I have a new problem with this project.
It seems after about 24-36 hours of working properly the MAIN public will lock-up.· I am not exactly sure where it happens at in the program, but I know it is not the entire prop.· The video still outputs·and the heartbeat LED still flashs, but it will not update the display until I reboot.
Heres the two solutions that bring up a few more questions:
1.· I was thinking about programming a pin to go high for a bit everytime the MAIN public repeats.· I would then take this pin to a 555 timer and reset the chip if it doesn't detect a pulse every few seconds.· I understand this to be a watchdog circuit.
2.· When I was searching the forum for watchdog posts I noticed that some people run a watchdog in another cog instead of a hardware watchdog.· I didn't find any objects for this in the OBEX, or is this something I just write?· I don't know about how to start that code, is there any examples to learn from somewhere?· What is the theory behind the operation of a software watchdog?
3.· If someone looks at my program do they see anything that would cause it to lock up?· Am I programming something wrong?· Should I be using a REPEAT somewhere I am not?
Thanks for the help in advance with this one.· I just installed my project and a day later lost an afternoon of data because the thing was locked-up and didn't notice it.· I am open to any other ideas people might have...
Attached should be my entire program...
Post Edited (Joms) : 7/25/2009 3:41:32 AM GMT
It seems after about 24-36 hours of working properly the MAIN public will lock-up.· I am not exactly sure where it happens at in the program, but I know it is not the entire prop.· The video still outputs·and the heartbeat LED still flashs, but it will not update the display until I reboot.
Heres the two solutions that bring up a few more questions:
1.· I was thinking about programming a pin to go high for a bit everytime the MAIN public repeats.· I would then take this pin to a 555 timer and reset the chip if it doesn't detect a pulse every few seconds.· I understand this to be a watchdog circuit.
2.· When I was searching the forum for watchdog posts I noticed that some people run a watchdog in another cog instead of a hardware watchdog.· I didn't find any objects for this in the OBEX, or is this something I just write?· I don't know about how to start that code, is there any examples to learn from somewhere?· What is the theory behind the operation of a software watchdog?
3.· If someone looks at my program do they see anything that would cause it to lock up?· Am I programming something wrong?· Should I be using a REPEAT somewhere I am not?
Thanks for the help in advance with this one.· I just installed my project and a day later lost an afternoon of data because the thing was locked-up and didn't notice it.· I am open to any other ideas people might have...
Attached should be my entire program...
Post Edited (Joms) : 7/25/2009 3:41:32 AM GMT
Comments
How to do the SW watchdog depends on your prop-usage. If you still have free COGs it's the easiest to use one. For example ... for each COG you want to be watchdogged you have a byte in HUB-RAM on a fixed place. This has to be changed by the to be watched COG frequently. The watchdog-COG on the other hand checks from time to time that the value has been changed. If it did not change in a given timeframe it restarts that COG. The tricky thing is that the watchdog COG needs to know how to start each COG that it has to supervise - and the code itself has to be restartable. Restartable can mean two different things:
1. It can continue where it has been stopped - that's hard as the used variables might be in any state
2. It's a real restart - in this case you should not rely on the fact that variables are zero after boot but initialize each variable
But ..... I did not have a look at your code yet .. but I guess it should be possible to find the reason for the lock and avoid it.
1) Somewhere the program uses recursion improperly. Typically this is due to calling a method in an attempt to make a GOTO like:
2) A subscript for an array is out of range and the program eventually starts writing its data on top of the stack area or the program itself. Typically, you'd have an array declared as having 16 bytes (like "VAR byte myArray[noparse][[/noparse] 16 ]") and your program doesn't check that the subscript goes beyond the valid range of 0-15. The program stores something in the next variable in memory which affects another routine which eventually causes something that catches your attention.
MagIO2,
It would be ok that the whole chip reset. If I lost video for a few seconds it would be better then loosing data for an afternoon. To be honest I don't know if I am educated enought to direct a new cog to look directly at a part of memory somewhere else. I will start looking in the prop manual about how to access just one byte in the memory table. I should have 3 left over cogs, so it should be easy if I figure the program out.
Mike,
You actually caught my first problem before where I was doing what you explained above. I have sense replaced that code with a repeat loop and that did not fix it. I am just looking through my code and thinking about doing away with the 3 seperate publics and makeing them into one large public with a repeat. Basically this is how it works now and could be causing an issue...
PUB Main --> PUB ReadLoadInWt --> PUB Main --> PUB ReadLoadOutWt --> PUB Main --> Repeats back to start of PUB Main... Maybe I am going in and out of the main public too often?
I shouldn't be getting anything else, but perhaps that could cause it? Does the abort just end the cog that PUB Main is running in? I will pull out my manual and read more about the abort command.
Do you think I should replace it with a 'main' command, so it will just start over the process, or do you think 'return' would be better?
Your function is called by main. If you call main from this function again you have a recursion. This will eat up your stack-space and in the end you have a real crash.
To go back to main you have to use return or abort ....
The abort itself is exactly for this kind of problem. Your function detected a problem which it should report to the caller up to a level where the problem can be handled. If no upper-level function can handle it, the COG is simply stopped.
As I read in the manual, with using the abort you will loose the possibility to return a value. So what would help in your case is :
You only return a positive number which has max. 6 digits. So, the value is max. 999999. In case of an error you could simply return negative values. For example -1 to tell the caller that a timeout occurred and -2 to tell that you received an invalid character.
Whenever I looked at the display the data is correct, it just seems at some point overnight or some time that it must receive a bad byte and stop.
I also read through the abort command. I am guessing that if it does attempt to abort and the Main doesn't know what to do with the abort it receives back, it will just stop. Because I don't care about getting a bad byte, I would rather the program just keep going, do you think I should just get rid of the 'abort' and 'return' commands? If so, what should I put there?
Supposing that the first 2 repeats in readloadinwt are for synchronisation with the sender, that's what I'd do. EXP shows you that the wait time has expired ... if this is displayed for longer time you lost connection.
NAN shows you that a wrong character has been received ... if this is displayed for longer time you have problem with receiving the right data.
·
If the program goes to the first return, it will return a -1.
I am assuming the second edit should be 'return' instead of 'repeat' because that is the only way I can make it work. And if it does use that it returns -2 in which case the display will show NAN.
Do you think I should rely on this change completely? or should I try to program another cog to be an watchdog? I was looking at still maybe using the 555 timer, but I would have to put a transister or inverter between the 555 & the prop because the 555 outputs an high when a reset is needed, while the prop is looking for a low...
Basically the object will take a byte address, delay, and count input. It will look at the specified byte every 1 seconds (or specified delay) and reboot after 10 (or specified count) failed attempts.
Question: What am I doing wrong? I know this is pretty general, but there isn't much code here. Below is what is in my main program, attaches is the object.
Post Edited (Joms) : 7/25/2009 4:09:16 AM GMT
If you look in the attached file for the routine watchdog, you can look the watchdog loop, checking the variable watchdogcounter. This routine is started from a cognew during initialization and you can see the watchdogcounter variable incremented in the main processing loop.
I haven't this pretty effective. In theory there are ways of writing of the watchdog code so it will not work but in practice I have found if something goes wrong the watchdog catches it.
Instead of looking at a timer, I just have a variable counting up by one until it reaches 100 at which time I have it reset back to zero.
Basically, I got it working, THANKS!
Post Edited (Joms) : 7/25/2009 5:24:01 AM GMT
Ok, good lesson for you, but I think you don't really need a watchdog. You simply had a bug in your software (abort). But if you fiexed that, there is no reason for the main-loop to fail again! You use the time-version for receiving bytes from serial interface, so there is no external condition which could block your main-loop.
Watchdogs are used in high relieable systems as the last line of defense where for example external systems could cause a microcontroller to hang. Say you have a microcontroller which has an interrupt input connected to an external device. The external device runs crazy and the microcontroller never leaves the interrupt routine again because of that.
Or you have several interrupt sources which can occur in nearly same tiime and can't be processed fast enough.
In these microcontrollers you usually have a watchdog circuit in hardware, because if you controller got stuck in a interrupt-routine there is no chance to have a software watchdog. AND interrupts with all that problems are needed, as one CPU can only do one thing at a time.
On the propeller this is different. We don't have interrupts because we can do at least 8 things at a time. So, it is only a matter of programming to make your software failsafe. An external device running crazy will of course generate invalid data, but it will never keep your program in a permanent interrupt request routine - if you did your homework.