Tools for a memory corruption hunt. Have any/favorite?

Stephen Moraco · 2008-02-29 08:34

I've been buried by my real job lately but in my propCAN prototype I'm
experiencing a memory corruption (wrong values being written to regions
I care about).

I've no details on the problem yet other than I have cogs driving video
out and they are watching a set of key locations (continually dumping
the current values to the screen) and I see these locations get
overwritten on certain events with incorrect values.

I'm about to begin the hunt (when next I can work on this project ;-)
but just before I do, I thought I'd ask if anybody has had a reason
to chase such problems and if they therefore have any advice for
this kind of chase on the propeller...

As a quick reminder I've 7 cogs running... some of which are running
spin and some of which are running assembly code. I know also that
the culprit code is not running on the two cogs involved in creating
the video output. (since the problem existed before I activated the
video out.)

All "constructive" suggestions welcome ;-)

Regards,
Stephen, KZ0Q
--

http://propcandev.blogspot.com/
http://propcan.moraco.us/

deSilva · 2008-02-29 08:59

In fact, this can be ANY bug

This list is incomplete of course, but arranged to my ideas of likelihood:

(1) local stack-overflow in SPIN-COGs (nesting?, long parameter lists? local parameters?)
COG stacks shoud start with 20..
(2) global stack overflow, i.e. stack will reach top graphics arae, or ROM.

However both will soon lead to more severe problems, so they are not the most likely ones in your case

(3) bad vector size, short by 1, or and old value, forgotten to increase as it is a literal.
(4) bad VAR variable alignment. Remember that the compiler re-arranges VARs as: First LONGs, then WORDs, las BYTEs
(5) mis-computation of vector-index; often by 1; often caused by offset REPEAT loop by one...
(6) wrong dereferencing of addresses; note you can write the more readable
LONG[noparse][[/noparse]address][noparse][[/noparse]element]
rather than
LONG[noparse][[/noparse]address+4*element]
(7) bad address computation from PASM COG.. many possibilities

The only SYSTEMATIC way to hunt such bugs is to out-comment the code you expect that can WRITE to that area that gets corrupted, piece after piece. As this might disturb the general functionality you could have to simplify that by constant values, e.t.c.

It is most important to do this SYSTEMATICALLY, i.e. in any case write down what you have already done.
A simply way is to SAVE ALL those test versions so you can recur to them if no longer sure whether it really worked with that variant the day before yesterday

Post Edited (deSilva) : 2/29/2008 10:36:05 AM GMT

mirror · 2008-02-29 10:18

I had an odd cause of a memory corruption at one stage.
I had a PASM subroutine that was called using a CALL function, but later called using a JMP function (accidently of course, as the result of some refactoring) - which means that the code returned to the address referenced by the prior CALL. What makes this bug subtle is that the return address was valid so the code did appear to work, but the code was being executed in an invalid context - hence memory corruption. If the cog had just hung/crashed it would have been easy to find, but half working code is more dangerous than not-working code.
I eliminated / rewrote pieces of code until I found the culprit - about 2 days solid debugging.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

hippy · 2008-02-29 14:26

Apart from commenting out all the code and slowly adding it back in, looking for when the problem arises, I don't have much to suggest. Removing code until the problem goes away is a similar approach. You may not identify the actual problem but may get a pointer as to what it relates to.

Analysing the data which is causing the corruption may point to where it's originating from.

One trick may be to add delays in the various Cogs one at a time. If you can slow down the Cog and correlate that with a slowdown of corruption you may have a pointer to at least which Cog is causing a problem.

Stephen Moraco · 2008-04-08 03:18

Thanks for all the helpful hints. I thought I add that I did find the problem. It should serve as a reminder of
diligence we need when using (1) self modifying code, and (2) instructions which in their own documentation
are contrary to the instruction form.

I offer a quick explanation:

The appearence of memory corruption was actually caused by my having incorrectly initialized a Cog RAM indexing
loop which was writing to Main RAM. I guess is takes a little more time for some concepts to fully manifest in my
brain

This is part 1. The 2nd part I really had to review a number of times. The wr[noparse][[/noparse]long|word|byte] series
of instructions must be carefully paid attention to! The destination part of the instruction encoding specifies the source
data for the write!!! This interfered with my getting this working a couple of times... (course' this is also a reminder
to code when one is more alert!

So, there you have it. I found the problem (two loops bad) and fixed it. Once again I feel I can write stable code.

I thought I'd share...

Regards,
Stephen, KC0Q

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

·

Tools for a memory corruption hunt. Have any/favorite?

Comments