At wit's end: A call for (debugging) suggestions

ags · 2013-05-08 14:33

"I will do my best to take the time to make this a short letter..." - paraphrase of Mark Twain/Blaise Pascal (take your pick)

I've been struggling with a bug for a month (or more). I have not yet resolved it.
I'm using a Prop ProtoBoard USB. I've integrated a Wiznet W812MJ (using the W5100, similar to the Spinneret) as well as an MP3 decoding chip, SD card and RTC.
The symptoms are an (apparently) random yet frequent reboot of the Propeller.

I've gone through every line of code at least once with a thorough review process. I've reviewed the design as well.
I've re-written the W5100 driver from scratch, as well as an abstract socket interface.
I've re-written the MP3 decoder driver from scratch.
I've written the RTC driver from scratch.
I have not touched the fsrw/spi_safe.spin SD driver by lonesock and others.
I am using multiple cogs which may be contending for resources (but I am using locks to prevent that, for sockets as an example).
Other than the main cog (SPIN) all but one are running PASM drivers. The other SPIN cog has an enormous stack (512 longs) which I've moved around in memory.There is no recursion (that I am aware of)

My normal world is large-scale (>1MLOC) software so I am no expert here. I am used to different tools and techniques for debugging. I've spent a great deal of time grasping at straws that have not proven to be the problem. I don't think it's possible to debug this problem by examining bits of code, and I won't ask anyone to look at the entire program.

My request: are there any suggestions as to where to focus next? What is the most usual cause of a reboot? I can describe the symptoms in more detail, but that may just add to the confusion. I have lots of logging statements being written, so can see how long it takes from the last log statement to the reboot - and it seems almost immediate. I've not seen the usual "off in space for a while - then bad things happen" symptoms. Now at 20MIPS it doesn't take very long to bounce around in 32kB of code space looking for a reboot instruction (I suppose). I guess something can be happening in the hardware which might be causing a brownout - not clear how but it's possible. The regulators on the ProtoBoard are pretty beefy, sized to drive servos. I will make a note to myself to check the output of the wall-wart I'm using to power the board. The W812MJ is soldered to the protoboard, the other items are using a breadboard and wires up to 6" long. I could see that being an issue that might cause improper data reads but I'm not relying on any value read from the SD card or RTC to determine a prop memory location to be written. I'm only writing to the MP3 decoder (after initialization).

Ideas welcome. In fairness I have to state that I may say "already tried that" or "doesn't seem likely" - but I will absolutely listen.

Thanks.

Mike Green · 2013-05-08 14:44

Either your program is referencing an array out of range or using an invalid pointer or the stack is overflowing. This is the usual cause of these intermittent restarts. If you were driving a motor, noise from the motor might cause a random restart, but, in your case, there's nothing like a motor involved.

SRLM · 2013-05-08 15:00

Do you have serial going out the USB? It may be this bug: http://forums.parallax.com/showthread.php/116711-Can-Parallax-do-something-about-the-FTDI-reset-bug

What have you checked on the hardware side? I'd look at the power supply and the reset line.

ags · 2013-05-08 18:23

Thanks for the information on the FTDI issue, I hadn't heard of that. After reading through the threads, I don't think this is my problem (though haven't tested yet) because it seems this particular problem results in the prop constantly cycling through resets. My condition is resets happening after minutes, hours, or even days of operation. However, this may be another lead. I may put a scope on pin 31 to see if it ever goes high. If I'm somehow driving that pin high through some error in my code wouldn't that explain the problem? A miscreant cog might do such a thing.

Here's the code I've used to conditionally enable the serial lines. Even if I have debug enabled, I think it avoids the problem. Is this correct?

  debug~~ ' ~=OFF ~~=ON

  if debug
    if (ina[30..31]<>)
      debug~

  if debug
    PST.Start(115_200)
    waitcnt(clkfreq + cnt)
    PST.Home
    PST.Clear

Still good to know about this problem for future work. Thanks.

ags · 2013-05-08 18:33

Mike Green wrote: »

Either your program is referencing an array out of range or using an invalid pointer or the stack is overflowing. This is the usual cause of these intermittent restarts. If you were driving a motor, noise from the motor might cause a random restart, but, in your case, there's nothing like a motor involved.

Mike, that is where I've been focusing my attention, and I agree that seems most likely. However, I haven't come up with anything (not that that proves anything other than it's a hard problem and/or I'm not great at debugging). There are other ways of getting into trouble, like uninitialized locals (example is when writing a byte value to a local then assigning to another long. The upper bytes are "dirty" and cause seemingly random results. Don't ask me how I know)

What seems unusual to me is that the reboot sometimes happens after 10 minutes, sometimes after 10 hours. I'm not saying it's impossible, but for a stack overflow I wouldn't expect to see such a wide variation. Things have to line up "just wrong" of course, and that could look very random. Same for an improperly indexed array. And all that the program (under test) is doing is reading from an SD card, writing to the MP3 decoder, and looping for a TCP connect request. Sometimes is reboots with no input at all, sometimes on the first few connect requests, and sometimes not until connecting the next day.

TinkersALot · 2013-05-08 18:37

I can't tell you many times I have gone witless looking for a problem that ended up being an underpowered supply.

Other than that, the best advice I can give is: Challenge all assumptions.....

rogloh · 2013-05-08 19:19

Sorry to hear of your frustruation. My ideas:

1) You should definitely check the power supply and if you can just get another one try it out as well.

2) You should write a dumb program that loads all COGs that just sit it tight loops and also accesses memory intensively (ie. loads up the board with highest current usage) but otherwise should not be able to crash. If it can also enable the other devices you have on the board at startup of this looping program that would be good too. Then let it run and prove you problem still happens with some highly uncrashable code. If so it could be bad power or bad board H/W.

3) Add large caps ~470uF to the power rails of the board and some 0.1uf bypass near your other devices power pins to see if it prevents voltage drops or noise that could trigger brown outs or other problems. Maybe your board has some bad capacitors or insufficient bypassing for the other devices?

4) Noise/spikes coming in from USB port? Try to run your board without USB attached to the PC and see if that helps.

5) If you have a DSO, monitor the voltage of the prop and set it to trigger at something a little higher than brownout voltage to see if you are getting power glitches. Also check your reset line.

6) If all else fails in tracking it down from a software side perhaps invest in another board..., or replace the voltage regulator?

I know these are mostly hardware suggestions but yet if could always still be a software issue. After once staring at my code intently for days and not seeing anything I spent ages starting to replace lots of components off boards which didn't help as the problem was ultimately in my software. A comma had somehow got added to a line in some C code where it should not have been (and spaced way out over 80 columns on the line so I didn't even see it in my editor which wasn't set to do line wrapping which sucked!). I wish you good luck...

SRLM · 2013-05-08 19:29

ags wrote: »
At first I was really upset at the thought of a problem like this (the FTDI issue) costing me so much time. After reading through the threads, I think it isn't the case. Now I'm back to being disappointed at not having a clue what is wrong! I don't think this is my problem (though haven't tested yet) because it seems the known problem results in the prop constantly cycling through resets. My condition is resets happening after minutes, hours, or even days of operation. However, this may be another lead. I may put a scope on pin 31 to see if it ever goes high. If I'm somehow driving that pin high through some error in my code wouldn't that explain the problem? A miscreant cog might do such a thing.

Here's the code I've used to conditionally enable the serial lines. Even if I have debug enabled, I think it avoids the problem. Is this correct?
  debug~~ ' ~=OFF ~~=ON

  if debug
    if (ina[30..31]<>)
      debug~

  if debug
    PST.Start(115_200)
    waitcnt(clkfreq + cnt)
    PST.Home
    PST.Clear

The easy way to check if it's the FTDI issue is to run with the USB plugged in. The USB voltage powers the FTDI chip, so you won't see any resets due to that.

Mike G · 2013-05-08 20:28

I feel for ya... I created a memory leak in an early version of the Spinneret PASM libraries. I took me weeks to figure it out. What seemed random actually had a pattern. It just took a long while for the problem to affect the application. I ended up building a instrumented tester which allowed me to realize the pattern. From there I was able to isolate the problematic code.

But it was a tough road... Good luck.

ags · 2013-05-08 20:41

Mike G wrote: »

I feel for ya... I created a memory leak in an early version of the Spinneret PASM libraries. I took me weeks to figure it out. What seemed random actually had a pattern. It just took a long while for the problem to affect the application. I ended up building a instrumented tester which allowed me to realize the pattern. From there I was able to isolate the problematic code.

But it was a tough road... Good luck.

Mike G, can you elaborate a bit more on your "memory leak" in the Spinneret code? I understand the concept well in a large-scale environment. With no heap or dynamic memory allocation (other than the stack) what exactly is a memory leak on the propeller? That is exactly what I would be looking for in a different context. I just tried another test and it ran fine... until three hours later I connected to the HTTP server and bang! Another reset. A memory leak (over time) would explain this behavior.

Mike Green · 2013-05-08 21:52

1) Each cog that's running Spin has to have a run-time stack. The default start goes from the end of the program upwards towards the end of memory and that's usually not a problem. Each cog that's started running the Spin interpreter has to have its own stack. This is usually allocated from an array declared for that purpose and these arrays have a fixed size. If it's not big enough for the worst case use, the stack may overflow and write over other variables or code that lies beyond the upper bound of the array used for the stack. Errors like this can take a while to declare themselves since the overwritten code or variables might not be needed for some time.

2) You can have pointers and array subscripts in Spin just like in a "large-scale environment". You can create a heap or other dynamic memory allocation mechanism. Most programs don't have this because it's expensive in terms of code size and adds overhead to execution time, but your program could have used one of the dynamic memory allocation objects in the ObEx. There are other uses of pointers other than for dynamic memory allocation and these can go awry.

kwinn · 2013-05-08 21:57

Since this could be a hardware or software problem you could be eliminate the hardware by using two high speed comparators and RS flip-flops to monitor the reset pin and the power supply. Use trim pots to set the comparator thresholds, and have the comparator outputs set the RS flip-flops. A led on the F-F output would indicate it's state. This circuit would need it's own power supply and a push button to reset the flip-flops.

If it turns out not to be hardware you could have each subroutine print a unique character on entry and log that output on the PC to see if there is a pattern of some sort to where the reset occurs. Tedious and time consuming to do, but sometimes the only way to figure out what is happening.

cavelamb · 2013-05-08 22:20

First things first, ags.

What are you using for power?

ags · 2013-05-09 09:26

Good grief.

Original Post wrote:

I guess something can be happening in the hardware which might be causing a brownout - not clear how but it's possible. The regulators on the ProtoBoard are pretty beefy, sized to drive servos. I will make a note to myself to check the output of the wall-wart I'm using to power the board.

SRLM wrote:

What have you checked on the hardware side? I'd look at the power supply and the reset line.

TinkersALot wrote:

can't tell you many times I have gone witless looking for a problem that ended up being an underpowered supply.

rogloh wrote:

You should definitely check the power supply and if you can just get another one try it out as well

cavelamb wrote:

First things first, ags. What are you using for power?

Have you seen cartoons where the character's eyes pop about a foot out of his skull? That's what it felt like when I looked at the wall-wart providing 9vDC to my board. Rating: 210mA...

I found another rated for 500mA, fired it up and let it run over night. This morning, as I walked over to run some quick tests I realized that I'd be frustrated if the problem remained, and frustrated if it was resolved and I had spent >100 hours rewriting and debugging code when the problem was a $5 supply.

The problem is gone. Turns out I'm more relieved than frustrated, and will use this experience to good use. As TikersALot said, "Test all assumptions". The board and supply had worked for years. I forgot that when I first ordered and received the board, I didn't realize it wasn't powered by the onboard USB connection. I scrounged and found a supply lying around. I checked the voltage and tip/polarity, but other than that never thought about it again. I forgot that my other projects end up on boards I design, typically driven by 5v20A supplies running the rest of the equipment. I didn't stop to think that instead of just one peripheral on the board, I now had an Ethernet module, an RTC, an MP3 decoder, and an SD card. And 6 cogs running at once. Now that I have found the problem, all the symptoms make sense. I must have been right on the edge of the envelope, making it seem random.

I appreciate the constructive replies to this thread, and the PM'd offers to review the entire code. This is a great community.

Heater. · 2013-05-09 09:58

I think many of us are having a little chuckle at your expense just now. Mostly because we have done that kind of running around in circles in vain ourselves.

My first Prop experiments would crash at a certain point in the code. Not always. Turned out that at that point in the code it lit up just one LED too many which clobbered the power supply. Of course, like you, I was totally convinced it was a problem with my ever so cunning and devious code. (Read "overly complicated").

SRLM · 2013-05-09 09:59

I'm glad that you found the problem.

Now that you've found it, I'll divert the thread slightly:

kwinn wrote: »

If it turns out not to be hardware you could have each subroutine print a unique character on entry and log that output on the PC to see if there is a pattern of some sort to where the reset occurs. Tedious and time consuming to do, but sometimes the only way to figure out what is happening.

Thanks for this technique! I really like the simplicity, and I'll add it to my debugging tricks bag.

prof_braino · 2013-05-09 10:16

I was going to suggest "check power", but now I'll suggest editing the original post and marking this as SOLVED

ags · 2013-05-09 10:24

@SRLM: I implemented that as part of my software debugging strategy (using Ethernet connectivity rather than the serial/USB method). Problem was that the (now known to be resets) still happened "randomly" and the logging didn't pinpoint the problem - only new random areas to review/rewrite (now regret...)

@prof_braino: Tried to do that before, it took two attempts to get it right.

TinkersALot · 2013-05-09 10:58

ags wrote: »

Good grief.
....and frustrated if it was resolved and I had spent >100 hours rewriting and debugging code

well, in the times that I have fallen into this trap, I have done the same thing--crawled over the code the point of being cross-eyed with it. The upside is after these exercises I know the code is freaking bulletproof (after I give it enough electrons to work with that is).

ags · 2013-05-09 13:03

I couldn't agree more. Even after complete rewrites, I spent so much time looking for off-by-one errors, concurrency issues, proper initialization, etc. - this code has 'been reviewed".

Additionally, I learned a lot more about the Propeller and Propeller Tool. Didn't want to, hoped not to have to, but now (if I can keep it all in my head - or write down some notes for myself) I understand the fine details of variable size & alignment, referencing vs including objects, object vs instance variables, the unusual DAT symbol with different alignment than size, compiler reordering of VAR symbols, intermediate and immediate operations, have memorized SPIN operator precedence, the difference between SPIN symbol address and cog address, use of "@@"... and even found mistakes in documentation. For instance, without even testing it, I'm comfortable predicting that the locknew example in the datasheet (v1.2, page 123) will not work as expected.

SRLM · 2013-05-09 13:49

ags wrote: »

For instance, without even testing it, I'm comfortable predicting that the locknew example in the datasheet (v1.2, page 123) will not work as expected.

Oooh, do tell! I'm interested, since I've been documenting lock* for the PropGCC effort.

ags · 2013-05-09 15:32

It actually has nothing to do with lock* - it's an issue of size. The SemID variable is a byte. locknew returns a long, which is truncated when it is assigned to a byte. This is then compared to -1, which is represented as a long. The comparison forces the SemID value to be a long, but it is not sign extended when that happens. So in the case of no lock being available, the comparison will be

if 255 == -1

which will always fail. So you will always think you acquired a lock.

To make it work, the comparison should be

if ~(SemID:=locknew) == -1

or

if (SemID:=locknew) == 255

or make SemID a long.

I haven't tested it, but I'm pretty sure.

Like I said, way too much staring at code...

kwinn · 2013-05-09 19:00

SRLM wrote: »

I'm glad that you found the problem.

Now that you've found it, I'll divert the thread slightly:

Thanks for this technique! I really like the simplicity, and I'll add it to my debugging tricks bag.

Glad you like the idea. Another useful thing you can do with it is count how many times each subroutine is called either by counting the number of occurrences of each character in the log data or having a program on the PC count each occurrence as it is received. Very handy information to have if you need to speed up your program.

ags · 2013-05-09 20:49

I would remove or redact post #22 but it would make the thread disjointed. I'm now here to eat crow. I was wrong. The example in the manual does work, even for the case of failing to get a lock.

I still believe what I wrote is correct, with the exception of the value resulting from an assignment. It appears that the result of this assignment:

VAR
  byte lock
PUB getLock
  if [COLOR=#ff0000](lock:=locknew)[/COLOR] == -1
    <failure>
  else
    <success>

is not the value of lock after the assignment, but the value which was assigned to it (the value retuned by locknew). So while it is true a comparison of:

lock:=locknew
if lock == -1

will always fail, it's not the value of lock that matters in the assignment in red, but the return value of locknew.

Maybe I should consider a career in pottery instead. Ugh.

kwinn · 2013-05-09 22:55

ags wrote: »

.........

Maybe I should consider a career in pottery instead. Ugh.

No, don't do that. Playing with the propeller is much more fun. We all jump to incorrect conclusions occasionally. There are times when I feel the only thing I open my mouth for is to change feet.

At wit's end: A call for (debugging) suggestions

Comments