Shop OBEX P1 Docs P2 Docs Learn Events
Tools for a memory corruption hunt. Have any/favorite? — Parallax Forums

Tools for a memory corruption hunt. Have any/favorite?

Stephen MoracoStephen Moraco Posts: 303
edited 2008-04-08 03:18 in Propeller 1
I've been buried by my real job lately but in my propCAN prototype I'm
experiencing a memory corruption (wrong values being written to regions
I care about).

I've no details on the problem yet other than I have cogs driving video
out and they are watching a set of key locations (continually dumping
the current values to the screen) and I see these locations get
overwritten on certain events with incorrect values.

I'm about to begin the hunt (when next I can work on this project ;-)
but just before I do, I thought I'd ask if anybody has had a reason
to chase such problems and if they therefore have any advice for
this kind of chase on the propeller...

As a quick reminder I've 7 cogs running... some of which are running
spin and some of which are running assembly code. I know also that
the culprit code is not running on the two cogs involved in creating
the video output. (since the problem existed before I activated the
video out.)

All "constructive" suggestions welcome ;-)


Regards,
Stephen, KZ0Q
--

http://propcandev.blogspot.com/
http://propcan.moraco.us/

Comments

  • deSilvadeSilva Posts: 2,967
    edited 2008-02-29 08:59
    In fact, this can be ANY bug smile.gif
    This list is incomplete of course, but arranged to my ideas of likelihood:

    (1) local stack-overflow in SPIN-COGs (nesting?, long parameter lists? local parameters?)
    COG stacks shoud start with 20..
    (2) global stack overflow, i.e. stack will reach top graphics arae, or ROM.

    However both will soon lead to more severe problems, so they are not the most likely ones in your case

    (3) bad vector size, short by 1, or and old value, forgotten to increase as it is a literal.
    (4) bad VAR variable alignment. Remember that the compiler re-arranges VARs as: First LONGs, then WORDs, las BYTEs
    (5) mis-computation of vector-index; often by 1; often caused by offset REPEAT loop by one...
    (6) wrong dereferencing of addresses; note you can write the more readable
    LONG[noparse][[/noparse]address][noparse][[/noparse]element]
    rather than
    LONG[noparse][[/noparse]address+4*element]
    (7) bad address computation from PASM COG.. many possibilities

    The only SYSTEMATIC way to hunt such bugs is to out-comment the code you expect that can WRITE to that area that gets corrupted, piece after piece. As this might disturb the general functionality you could have to simplify that by constant values, e.t.c.

    It is most important to do this SYSTEMATICALLY, i.e. in any case write down what you have already done.
    A simply way is to SAVE ALL those test versions so you can recur to them if no longer sure whether it really worked with that variant the day before yesterday smile.gif

    Post Edited (deSilva) : 2/29/2008 10:36:05 AM GMT
  • mirrormirror Posts: 322
    edited 2008-02-29 10:18
    I had an odd cause of a memory corruption at one stage.
    I had a PASM subroutine that was called using a CALL function, but later called using a JMP function (accidently of course, as the result of some refactoring) - which means that the code returned to the address referenced by the prior CALL. What makes this bug subtle is that the return address was valid so the code did appear to work, but the code was being executed in an invalid context - hence memory corruption. If the cog had just hung/crashed it would have been easy to find, but half working code is more dangerous than not-working code.
    I eliminated / rewrote pieces of code until I found the culprit - about 2 days solid debugging.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  • hippyhippy Posts: 1,981
    edited 2008-02-29 14:26
    Apart from commenting out all the code and slowly adding it back in, looking for when the problem arises, I don't have much to suggest. Removing code until the problem goes away is a similar approach. You may not identify the actual problem but may get a pointer as to what it relates to.

    Analysing the data which is causing the corruption may point to where it's originating from.

    One trick may be to add delays in the various Cogs one at a time. If you can slow down the Cog and correlate that with a slowdown of corruption you may have a pointer to at least which Cog is causing a problem.
  • Stephen MoracoStephen Moraco Posts: 303
    edited 2008-04-08 03:18
    Thanks for all the helpful hints. I thought I add that I did find the problem. It should serve as a reminder of
    diligence we need when using (1) self modifying code, and (2) instructions which in their own documentation
    are contrary to the instruction form.

    I offer a quick explanation:

    The appearence of memory corruption was actually caused by my having incorrectly initialized a Cog RAM indexing
    loop which was writing to Main RAM. I guess is takes a little more time for some concepts to fully manifest in my
    brain wink.gif This is part 1. The 2nd part I really had to review a number of times. The wr[noparse][[/noparse]long|word|byte] series
    of instructions must be carefully paid attention to! The destination part of the instruction encoding specifies the source
    data for the write!!! This interfered with my getting this working a couple of times... (course' this is also a reminder
    to code when one is more alert! wink.gif

    So, there you have it. I found the problem (two loops bad) and fixed it. Once again I feel I can write stable code.

    I thought I'd share...

    Regards,
    Stephen, KC0Q

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    ·
Sign In or Register to comment.