How to debug strange behaviour

ManAtWork · 2010-10-09 04:03

Hello,

working on my new stepper motor controller, yesterday I ran into a real strange problem. As I added some lines to my code the program suddenly stopped working. I removed the new instructions and, of course, the program worked, again. But I didn't find an error no matter how hard I tried. So I simply added NOPs insted af real instructions, and ... Oh No! The problem was back again. I found out that the number of NOPs inserted has a great effect to the programs behaviour. Some magic number even had the result of making the program indeterministic, that means, I pressed F10 or F11 and somethimes it worked, next time it didn't.

Even more confusing, cycling power several times loading the program from EEPROM behaved different each time (so the prog plug/serial connection can't be the fault).

My first thought was a hardware problem. I took another board - same problem. I measured the supply voltage with the scope. There was some noise but only narrow spikes of 100 to 200mV around 3.3V. I have a 4 layer PCB with four 100nF caps directly placed around the QFP propeller. The voltage regulator is a TS1117B-33 rated for 800mA, loaded with 60mA and bypassed with two 4.7u caps, so lots of headroom here. I have separated ground planes for controller and power stage which only overlap at a single via connecting the both. So no motor current flows through the controller ground plane. Anyway I tried with the motor disconnected but the problems persisted.

So I gave up and went back to the code backup of one day before. That version was much less complex and didn't launch cogs on the fly which makes it much more robust. But for "fun" I tried adding a random number of NOPs again. This time it didn't cause freezing or other severe misbehaviour. However I could "disable" some features like automatic current reduction at standstill with the "right" amount of NOPs. I didn't added them inside a time critical loop but into the initialization part of a cog's assembler code. So the NOPs shouldn't change the relevant timing but only shift the rest of the code towards higher adress space.

My suspiction is the something is corrupting my memory. I have no idea if the hub memory is overwritten before the code gets loaded into cog memory or if it's corrupted afterwards. Depending on memory usage the overwritten variables or code locations differ. With the right amount of NOPs the corrupted address may hit a variable where the initial value is unimportant thus masking the problem.

I don't expect that anybody can solve my problem right away. But hopefully somebody can give me a hint what to look for or can recommend a good debugging tool. Unfortunatelly, the "divide and conquer" rule does not help here. Any part of the code tested separately will work perfectly simply because it doesn't use all of the adress space. So the corrupted memory address could be located far behind the end of the used memory.

Of course, assembler programming is always "dangerous" and a small typing error can lead to unexpected results. My favorite one is forgetting the "#" for jump instructions. On the one hand the propeller is ideally suited for hardware hacks and "cycle squeezing" programming. On the other hand programming techniques like self-modifying code and use of raw adress pointers without type checking make bugs to be made easily and hard to be found.

Another pitfall is using the same label for different assembler DAT sections. For example I want to share a hub variable between two cogs. I write something like

DAT ' cog 1
... wrlong a,adrHub

adrHub  long  0

DAT ' cog 2
... rdlong b,adrHub

Even if adrHub means the same hub adress both occurences have to be labeled differently in both cogs, of course. The label adrHub wouldn't have a valid adress for cog #2 in the example above. The compiler doesn't warn you because labels have a global namespace. However the erronous code might still work in some cases although a different (undefined) address would be used in cog 2.

This is just one example. It would be nice if we could make some sort of checklist of dangerous pitfalls so it would be easier to hunt such nasty bugs.

Regards

Heater. · 2010-10-09 05:58

We probably need to see more of your code (all) to help with this.

I presume the DAT code you are running in two COGS is actually written in the same .spin file.

You should have ORG 0 at the beginning of each such PASM section.

You should always put FIT at the end.

You do realize that "rdlong a, adrhub" and "wrlong a, adrhub" is actually reading/writing into a the content of some HUB location whose address is held id adrhub. Which in your example is zero. So you are reading and writing from HUB location zero. Probably not what you intended.

You will of course need two such adrhub pointers and they will need to be intialized to the HUB address you want to read before loading the cog. Or they can be set by passing the address into COG via the PAR at start up.

ManAtWork · 2010-10-09 07:50

Hello Heater,

I don't know if it helps much and I don't want to bore anybody with my long code, but here it is (at least the assembler code of the affected cog)
[code]
CON
_CLKMODE = XTAL1 + PLL16X
_XINFREQ = 4_910_000 ' -> 79MHz system clock
clkfrq = 78_560_000
tickSpin = clkfrq / 10 ' Schleifenzeit f

bill190 · 2010-10-09 08:09

This is the way I create stack space for cogs. Using separate variable names...

VAR
long GoStack[120] 'Stack space for go cog
long StackSpaceSendBits[120] 'Stack space for SendBits cog

PUB
cognew(go, @GoStack )
cognew(@SendBits, @StackSpaceSendBits )

ManAtWork · 2010-10-11 01:31

Hello,

after using the weekend to clear my mind, I did a simple test this morning. I commented out all other cog DAT sections. Then I recompiled to check if there are any cross-border references to labels of other DAT sections.

I got a hit! adrError is not defined resulting in a compiler error message. This could be a logical explanation for the misbehaviour. Since adrError was defined in a DAT section before the cog code that uses it it had a negative address (relative to ORG 0). Address bits are truncated to 9 bits (cog ram space) so this points to a long at the end of the ram.

As long as the used code and variable space is small nothing important is overwritten and the program still works as expected. Both writes and reads go to the same (wrong) address. However, if the program grows bigger variables at the end of the DAT sections get overwritten. This explains why inserting NOPs changes the behaviour.
:idea:

I hope this was the only bug (although Murphy says it wasn't).

Regards

ManAtWork · 2010-10-11 09:09

I tested my fixed code and - good news - it seems to work stable, again.

Of course, there still are other bugs but no more scary, unexplainable phenomena.

So I think I paniced too much and should have verified my first suspiction instead of wasting time doubting. But often you just need to share your thoughts with somebody to fight desperation...

Thanks for listening

How to debug strange behaviour

Comments