At wit's end: A call for (debugging) suggestions
ags
Posts: 386
"I will do my best to take the time to make this a short letter..." - paraphrase of Mark Twain/Blaise Pascal (take your pick)
I've been struggling with a bug for a month (or more). I have not yet resolved it.
I'm using a Prop ProtoBoard USB. I've integrated a Wiznet W812MJ (using the W5100, similar to the Spinneret) as well as an MP3 decoding chip, SD card and RTC.
The symptoms are an (apparently) random yet frequent reboot of the Propeller.
I've gone through every line of code at least once with a thorough review process. I've reviewed the design as well.
I've re-written the W5100 driver from scratch, as well as an abstract socket interface.
I've re-written the MP3 decoder driver from scratch.
I've written the RTC driver from scratch.
I have not touched the fsrw/spi_safe.spin SD driver by lonesock and others.
I am using multiple cogs which may be contending for resources (but I am using locks to prevent that, for sockets as an example).
Other than the main cog (SPIN) all but one are running PASM drivers. The other SPIN cog has an enormous stack (512 longs) which I've moved around in memory.There is no recursion (that I am aware of)
My normal world is large-scale (>1MLOC) software so I am no expert here. I am used to different tools and techniques for debugging. I've spent a great deal of time grasping at straws that have not proven to be the problem. I don't think it's possible to debug this problem by examining bits of code, and I won't ask anyone to look at the entire program.
My request: are there any suggestions as to where to focus next? What is the most usual cause of a reboot? I can describe the symptoms in more detail, but that may just add to the confusion. I have lots of logging statements being written, so can see how long it takes from the last log statement to the reboot - and it seems almost immediate. I've not seen the usual "off in space for a while - then bad things happen" symptoms. Now at 20MIPS it doesn't take very long to bounce around in 32kB of code space looking for a reboot instruction (I suppose). I guess something can be happening in the hardware which might be causing a brownout - not clear how but it's possible. The regulators on the ProtoBoard are pretty beefy, sized to drive servos. I will make a note to myself to check the output of the wall-wart I'm using to power the board. The W812MJ is soldered to the protoboard, the other items are using a breadboard and wires up to 6" long. I could see that being an issue that might cause improper data reads but I'm not relying on any value read from the SD card or RTC to determine a prop memory location to be written. I'm only writing to the MP3 decoder (after initialization).
Ideas welcome. In fairness I have to state that I may say "already tried that" or "doesn't seem likely" - but I will absolutely listen.
Thanks.
I've been struggling with a bug for a month (or more). I have not yet resolved it.
I'm using a Prop ProtoBoard USB. I've integrated a Wiznet W812MJ (using the W5100, similar to the Spinneret) as well as an MP3 decoding chip, SD card and RTC.
The symptoms are an (apparently) random yet frequent reboot of the Propeller.
I've gone through every line of code at least once with a thorough review process. I've reviewed the design as well.
I've re-written the W5100 driver from scratch, as well as an abstract socket interface.
I've re-written the MP3 decoder driver from scratch.
I've written the RTC driver from scratch.
I have not touched the fsrw/spi_safe.spin SD driver by lonesock and others.
I am using multiple cogs which may be contending for resources (but I am using locks to prevent that, for sockets as an example).
Other than the main cog (SPIN) all but one are running PASM drivers. The other SPIN cog has an enormous stack (512 longs) which I've moved around in memory.There is no recursion (that I am aware of)
My normal world is large-scale (>1MLOC) software so I am no expert here. I am used to different tools and techniques for debugging. I've spent a great deal of time grasping at straws that have not proven to be the problem. I don't think it's possible to debug this problem by examining bits of code, and I won't ask anyone to look at the entire program.
My request: are there any suggestions as to where to focus next? What is the most usual cause of a reboot? I can describe the symptoms in more detail, but that may just add to the confusion. I have lots of logging statements being written, so can see how long it takes from the last log statement to the reboot - and it seems almost immediate. I've not seen the usual "off in space for a while - then bad things happen" symptoms. Now at 20MIPS it doesn't take very long to bounce around in 32kB of code space looking for a reboot instruction (I suppose). I guess something can be happening in the hardware which might be causing a brownout - not clear how but it's possible. The regulators on the ProtoBoard are pretty beefy, sized to drive servos. I will make a note to myself to check the output of the wall-wart I'm using to power the board. The W812MJ is soldered to the protoboard, the other items are using a breadboard and wires up to 6" long. I could see that being an issue that might cause improper data reads but I'm not relying on any value read from the SD card or RTC to determine a prop memory location to be written. I'm only writing to the MP3 decoder (after initialization).
Ideas welcome. In fairness I have to state that I may say "already tried that" or "doesn't seem likely" - but I will absolutely listen.
Thanks.
Comments
What have you checked on the hardware side? I'd look at the power supply and the reset line.
Here's the code I've used to conditionally enable the serial lines. Even if I have debug enabled, I think it avoids the problem. Is this correct?
Still good to know about this problem for future work. Thanks.
Mike, that is where I've been focusing my attention, and I agree that seems most likely. However, I haven't come up with anything (not that that proves anything other than it's a hard problem and/or I'm not great at debugging). There are other ways of getting into trouble, like uninitialized locals (example is when writing a byte value to a local then assigning to another long. The upper bytes are "dirty" and cause seemingly random results. Don't ask me how I know)
What seems unusual to me is that the reboot sometimes happens after 10 minutes, sometimes after 10 hours. I'm not saying it's impossible, but for a stack overflow I wouldn't expect to see such a wide variation. Things have to line up "just wrong" of course, and that could look very random. Same for an improperly indexed array. And all that the program (under test) is doing is reading from an SD card, writing to the MP3 decoder, and looping for a TCP connect request. Sometimes is reboots with no input at all, sometimes on the first few connect requests, and sometimes not until connecting the next day.
Other than that, the best advice I can give is: Challenge all assumptions.....
1) You should definitely check the power supply and if you can just get another one try it out as well.
2) You should write a dumb program that loads all COGs that just sit it tight loops and also accesses memory intensively (ie. loads up the board with highest current usage) but otherwise should not be able to crash. If it can also enable the other devices you have on the board at startup of this looping program that would be good too. Then let it run and prove you problem still happens with some highly uncrashable code. If so it could be bad power or bad board H/W.
3) Add large caps ~470uF to the power rails of the board and some 0.1uf bypass near your other devices power pins to see if it prevents voltage drops or noise that could trigger brown outs or other problems. Maybe your board has some bad capacitors or insufficient bypassing for the other devices?
4) Noise/spikes coming in from USB port? Try to run your board without USB attached to the PC and see if that helps.
5) If you have a DSO, monitor the voltage of the prop and set it to trigger at something a little higher than brownout voltage to see if you are getting power glitches. Also check your reset line.
6) If all else fails in tracking it down from a software side perhaps invest in another board..., or replace the voltage regulator?
I know these are mostly hardware suggestions but yet if could always still be a software issue. After once staring at my code intently for days and not seeing anything I spent ages starting to replace lots of components off boards which didn't help as the problem was ultimately in my software. A comma had somehow got added to a line in some C code where it should not have been (and spaced way out over 80 columns on the line so I didn't even see it in my editor which wasn't set to do line wrapping which sucked!). I wish you good luck...
The easy way to check if it's the FTDI issue is to run with the USB plugged in. The USB voltage powers the FTDI chip, so you won't see any resets due to that.
But it was a tough road... Good luck.
Mike G, can you elaborate a bit more on your "memory leak" in the Spinneret code? I understand the concept well in a large-scale environment. With no heap or dynamic memory allocation (other than the stack) what exactly is a memory leak on the propeller? That is exactly what I would be looking for in a different context. I just tried another test and it ran fine... until three hours later I connected to the HTTP server and bang! Another reset. A memory leak (over time) would explain this behavior.
2) You can have pointers and array subscripts in Spin just like in a "large-scale environment". You can create a heap or other dynamic memory allocation mechanism. Most programs don't have this because it's expensive in terms of code size and adds overhead to execution time, but your program could have used one of the dynamic memory allocation objects in the ObEx. There are other uses of pointers other than for dynamic memory allocation and these can go awry.
If it turns out not to be hardware you could have each subroutine print a unique character on entry and log that output on the PC to see if there is a pattern of some sort to where the reset occurs. Tedious and time consuming to do, but sometimes the only way to figure out what is happening.
What are you using for power?
Have you seen cartoons where the character's eyes pop about a foot out of his skull? That's what it felt like when I looked at the wall-wart providing 9vDC to my board. Rating: 210mA...
I found another rated for 500mA, fired it up and let it run over night. This morning, as I walked over to run some quick tests I realized that I'd be frustrated if the problem remained, and frustrated if it was resolved and I had spent >100 hours rewriting and debugging code when the problem was a $5 supply.
The problem is gone. Turns out I'm more relieved than frustrated, and will use this experience to good use. As TikersALot said, "Test all assumptions". The board and supply had worked for years. I forgot that when I first ordered and received the board, I didn't realize it wasn't powered by the onboard USB connection. I scrounged and found a supply lying around. I checked the voltage and tip/polarity, but other than that never thought about it again. I forgot that my other projects end up on boards I design, typically driven by 5v20A supplies running the rest of the equipment. I didn't stop to think that instead of just one peripheral on the board, I now had an Ethernet module, an RTC, an MP3 decoder, and an SD card. And 6 cogs running at once. Now that I have found the problem, all the symptoms make sense. I must have been right on the edge of the envelope, making it seem random.
I appreciate the constructive replies to this thread, and the PM'd offers to review the entire code. This is a great community.
My first Prop experiments would crash at a certain point in the code. Not always. Turned out that at that point in the code it lit up just one LED too many which clobbered the power supply. Of course, like you, I was totally convinced it was a problem with my ever so cunning and devious code. (Read "overly complicated").
Now that you've found it, I'll divert the thread slightly:
Thanks for this technique! I really like the simplicity, and I'll add it to my debugging tricks bag.
@prof_braino: Tried to do that before, it took two attempts to get it right.
well, in the times that I have fallen into this trap, I have done the same thing--crawled over the code the point of being cross-eyed with it. The upside is after these exercises I know the code is freaking bulletproof (after I give it enough electrons to work with that is).
Additionally, I learned a lot more about the Propeller and Propeller Tool. Didn't want to, hoped not to have to, but now (if I can keep it all in my head - or write down some notes for myself) I understand the fine details of variable size & alignment, referencing vs including objects, object vs instance variables, the unusual DAT symbol with different alignment than size, compiler reordering of VAR symbols, intermediate and immediate operations, have memorized SPIN operator precedence, the difference between SPIN symbol address and cog address, use of "@@"... and even found mistakes in documentation. For instance, without even testing it, I'm comfortable predicting that the locknew example in the datasheet (v1.2, page 123) will not work as expected.
Oooh, do tell! I'm interested, since I've been documenting lock* for the PropGCC effort.
if 255 == -1
which will always fail. So you will always think you acquired a lock.
To make it work, the comparison should be
if ~(SemID:=locknew) == -1
or
if (SemID:=locknew) == 255
or make SemID a long.
I haven't tested it, but I'm pretty sure.
Like I said, way too much staring at code...
Glad you like the idea. Another useful thing you can do with it is count how many times each subroutine is called either by counting the number of occurrences of each character in the log data or having a program on the PC count each occurrence as it is received. Very handy information to have if you need to speed up your program.
I still believe what I wrote is correct, with the exception of the value resulting from an assignment. It appears that the result of this assignment:
is not the value of lock after the assignment, but the value which was assigned to it (the value retuned by locknew). So while it is true a comparison of:
will always fail, it's not the value of lock that matters in the assignment in red, but the return value of locknew.
Maybe I should consider a career in pottery instead. Ugh.
No, don't do that. Playing with the propeller is much more fun. We all jump to incorrect conclusions occasionally. There are times when I feel the only thing I open my mouth for is to change feet.