Propeller Chip - Apparent Cog Instability

deSilva · 2008-02-19 08:05

It is very obvious to me that this explains the situation! I have not been very systematic in my experiments, but I followed A LOT of theories, bringing them to the end always showed: "Surprise! That was it not!!
What Chip said is nearly the only possibility left: Sherlock Holmes would have come to it as well: "Eliminating the impossible..."

I had been the prejudiced one!! I couldn't imagine that something would happen with the stack BEFORE the new SPIN interpreter had been loaded to the COG. However it is obvious it needs some parametrization, and as PAR is rather limited it MUST be done on the stack.

So I missed a good opportunity to enhance by reputation

As Mike I should say it just will suffice to document it: It is not a bug, as it was implemented willingly and notabene correctly, but it is as UNEXPECTED as the failing COGNEW with a routine of a foreign object. So we have two very awkward traps here now....

I must also apollogize to the persons as Phil and Hippy, who had a very sound suspicion against COGINIT. I had not! I was sure it worked, but I used the bare machine instruction from PASM only!!!

Post Edited (deSilva) : 2/19/2008 8:11:06 AM GMT

Phil Pilgrim (PhiPi) · 2008-02-19 08:41

deSilva,

What would a point be, absent a counterpoint?

No apology necessary. It was a great discussion from which we've all benefitted, and I'm glad you were willing to pursue it to the bitter end. (I pretty much bowed out with, "Just say 'no' to COGINIT and other dangerous drugs.") And thanks to Chip for taking time out from the Prop II to shed the light of authority on the root cause.

-Phil

hippy · 2008-02-19 14:03

@ deSilva : No apology necessary, and while I'm adverse to CogInit ( for reasons explained ) I
didn't consider CogInit to be bad to use when used "safely". Equally, I was wrong when I said that
I did not think a CogStop before CogInit would improve stability.

@ OzStamp : I don't really see there being any "accusations", certainly not in a bad way. Something
didn't work and everything said was nothing more than trying to find the cause.

@ Chip : While CogInit trampling over the stack in use by the Cog being re-CogInit'd would be a
problem, how does that explain that changes in the code invoking CogInit affect it ? For example,
removing the 'repeat i from 1 to 2' should not alter what's happening with the other Cog's stack.

Is it just a case that with this particular example the timing is such that it does affect the Cog, and
in most cases the other Cog is in WaitCnt so the adverse stack changes have no effect ?

Paul Voss · 2008-02-19 15:18

Another twist in the problem...

I ran the code last night with all the coginits done carefully so cogs alway stop on their own - they are never hit with a coginit while running. Just when I thought everything was stable, the freezup/garbled text problem happened again.

Next I took out all the case statements from my (real) program. With this one change, the complex code ran fine overnight. Here's what I conclude:

1) clobbering running cogs with coginit is not the source of the problem
2) the case statement is a prime suspect - when it is removed, the program runs well (so far for 8 hours)
3) it could be an interaction between case and coginit?
4) I am now perplexed that deSilva found that case isn't involved???

The mystery deepens..

Paul

PS: I am using coginit to have deterministic operation - I just set the cog IDs once (in CON) and never have to worry about keeping tract of who is running and who is not - or running out of cogs. The only additional object I am using is FullDuplexSerial and I modified it so that I specify its cog as well. As for stack space, I made it much larger than needed (by~10x) so that there would be no questions there. My entire program only uses half the Propsticks memory (program and variables) as shown in the compiler's memory map - so no harm in allocating lots of stack to the cogs.

Paul Voss · 2008-02-19 15:30

Sorry I missed some good comments on the forum before posting my reply - just didn't follow the thread correctly. None the less, the experiment last night still stantds (see my previous post for details). So here is my question:

Is it ok to use coginit if it never hits a running cog? For example, my gps cog runs for 9.5 sec max and is called at intervals of 10 sec. At this late date, I am reluctant to change all the coginits to cognews. It would be easy though to add a cogstop prior to every coginit. Any advice?

Given the experiment last night and my (still imperfect understanding of the deeper issues), I am still suspicious of the case statement.

Any advice? I want to lock down the code this afternoon so there is still time to test.

Paul

deSilva · 2008-02-19 15:51

Paul,
(a) COGINIT is absolutely safe to use, when respecting that the "stack" is not already in use (by a former activation)
(b) The problem with our "modify-and-try" debugging is that is will only catch situations where the already started COG runs wild due to the stack manipulation in the main program. This only works with a carefully constructed COG code. There are many undetected "problem cases"
(c) I have a deep mistrust to complex CASE match patterns, but no proof for it...
(d) Already yesterday I was not happy with the good reproducibility of my test code... The CHECK routine remains 99.99% of its time in the WAIT instruction.... Chances that the stack manipulations Chip was talking of afflict it are terribly low. I inserted this REPEAT 3000 loop also to change the phase, but with no differences...

I shall invest another hour now for some further investigations...

Advice? Just put a COGSTOP in front of every COGINIT just for safety...

Post Edited (deSilva) : 2/19/2008 4:10:48 PM GMT

hippy · 2008-02-19 17:09

If it is CogInit trampling on the bottom of the stack, it may be possible to push stack active usage
further up the stack within the CogInit'd Cog by simply adding more local variables.

I don't understand though why removing the one local variable in deSilva's version gets rid of the
corruption. I'd have thought it would make it worse, although we don't know exactly how the stack
is being trampled on. I'm guessing that a push within the bytecode overwrites what the pre-CogInit
action put there, so when the Cog is finally initialised it goes wrong.

Phil Pilgrim (PhiPi) · 2008-02-19 18:14

While it's fascinating to speculate on what exactly is happening with COGINIT, I sincerely hope this discussion is no more than academic. At the risk of sounding strident on the subject, I have to take serious issue with Paul's assertion that he's using COGINIT to ensure determinism. COGSTOP and COGINIT are not really program statements at all, but cudgels used to bludgeon running processes and coerce new ones into unnatural servitude. Except perhaps as watchdogs to deal with emergent situations like runaway processes, I don't believe they belong in a well-written program and, as a matter of style, find their use brutish at best. Using COGSTOP, for example, is like hitting Ctl-Alt-Del or invoking SIGKILL to end a program, cutting it off at the knees, when it could just as easily be made to exit gracefully on its own and tell the world that it's done so.

I would gently suggest that if one thinks his program needs either of these two commands to operate reliably or deterministicly (except perhaps in a watchdog process), he needs to reexamine the premises that got him to that point. I'm glad, for example, that Chip did not include a GOTO in Spin. If he had, we'd be having the same discussion in another thread, with one side saying, "If it's there, why not use it?", and the other side pointing out its evils.

Paul, in your case in particular, if the GPS cog can run 95% of the time, why not 100%? What I'm reading between the lines is that you're trying to use a cog-resident process as a subroutine when, in fact, it's something quite different.

I apologize if my tone seems harshly critical. After all, we're all learning, and the reliability of parallel processes is an interesting topic in its own right; so I'm always happy to entertain countervailing opinions. Moreover, if CASE is, in whole or in part, the culprit here, the discussion becomes more than academic, since there are good and valid reasons for wanting to use it.

-Phil

cgracey · 2008-02-19 18:32

Phil, I agree wholeheartedly with what you wrote.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Paul Voss · 2008-02-19 21:23

Just a quick reply to Phil (and Chip). First, thanks for your thoughful replies - I am learning this parallel processing as I go and am by no means an expert - and despite the challenges the past couple days - I am COMPLETELY sold on the Propeller chip!!!

Phil brought up the simple question "why not run the gps cog 100% of the time since it is running 95% already". In my real flight program, the gps cog is running for 95% of one or two 10-sec control cycles - then it is off for 10 to 60 control cycles to save power. Perhaps there is a better way to do this but... at least for the first flight with the propeller, it is comforting to know that the gps cogcode will run only on cog 4 (for example) and no other part of the code will contend for this cog. It is somewhat an embarassment of the riches - with eight cogs, I don't have to share them and worry about what code is running where. Cogstop(4), for example, always stops the gps - without question.

I plan to put a cogstop before every coginit just to be safe. The case statement is still a concern so I will not use it in the flight code until I understand why removing it fixed the lockup problem last night (in which no clobbered cogs were involved).

Paul

cgracey · 2008-02-19 21:48

Paul, do you think you could post some code that demonstrates the difference between using and not using the CASE statement?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-02-19 22:02

Paul,

I understand you may be up against a deadline and are possbily desperate to get something — anything — that works. I sincerely hope, after your first flight, that you'll take the time to understand why the use of COGSTOP and COGINIT are to be strictly avoided. You say that you take comfort in knowing that "the gps cogcode will run only on cog 4 (for example) and no other part of the code will contend for this cog." I wish I could convice you that it doesn't matter one little bit which cog a process runs in. It's best to think of the cogs as residing in an anonymous pool from which they can be fished at will using COGNEW. Although COGNEW returns the number of the cog it assigns, don't think of that number as identifying a fixed slot that you need to know anything at all about, but rather as a receipt that can be presented to COGSTOP in the event that an emergency shutdown of the cog is necessary.

I agree that saving power is a valid reason not to keep a cog running all the time. But I would urge you to consider just letting the cog take a graceful exit after informing the top-level system that it's doing so, rather than using COGSTOP to end it or COGINIT to restart it. The latter will surely lead you down the road of bad programming habits that you'll come to regret eventually.

-Phil

Paul Baker · 2008-02-19 23:28

Achieving low power consumption is best accomplished by using a WAITxx, while in a waitcnt/waitpeq/waitpne the clock to the cog is disabled placing it into an ultra-low power mode.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

cgracey · 2008-02-19 23:44

Jeff and I just ran some tests using a Spin COGINIT within a CASE structure and we could not find any problems that were not·remedied by the insertion of a COGSTOP before·the Spin COGINIT. We did observe the stack conflict problem theorized in my previous posts, though, where a Spin COGINIT was being performed on·a cog that was already executing a Spin routine using the same stack space. As expected, that would cause frequent blow-ups. As long as cogs were stopped before being restarted with Spin code using the same stack space, though, everything worked fine.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Paul Voss · 2008-02-20 02:54

Dear Chip,

I posted the code that locked up last night (despite the fact that all cogs should have stopped on their own well before the 10-sec control cycle clobbered them with a coginit). However, running the code tonight (just a few times), I cannot get it to hang again - maybe it was a fluke. No need to carefully examine this complicated (and now outdated) program - could just file it should any independent concern arise about the case statement in the future. The new code (no case statements, no clobbered cogs) runs well and appears to be very stable. I'll be shipping a Propeller-controlled balloon by FedEx to Svalbard tomorrow. Thanks!

Paul

Paul Baker · 2008-02-20 02:57

Congratulations on getting your project done, I know you were on a really tight schedule. Please keep us informed on how your experiment goes.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-02-20 06:26

Paul V.,

I'm so glad you got everything working! I was telling some dinner guests this evening about this thread and your plan to send up a balloon in the arctic. One of them asked where, and I had to say that I didn't know. I wish I had read your latest post before they left; the person who asked was from Norway.

Now that all your coding stuff has been put to bed, can you tell us a little more about the mission?

Thanks, and good luck!

-Phil

deSilva · 2008-02-20 08:30

Also - good luck! I have no doubt that the program will be stable now!

One - though remote - explanation of the one bad test yesterday could be that you had the wrong code in the EEPROM. I know well that configuration management is generally a challenge in the last phase of an overdue project

AP · 2008-03-19 04:10

I didn't see a comment on this (perhaps I read too fast):

Your code has _xinfreq = 10_000_000, and the _clkmode is using PLL (granted it is just 8x, but read on)

So what gives? All the documentation I have read says ANYTIME you use PLL, what is really happening is·the xtal frequency·is 16x boosted for the internal PLL, then if your _clkmode setting is say 8x, then the 'real' PLL (16x) is cut in half, and a clock pulse frequency of 8x is what the rest of the prop chip sees (outside of the PLL circuit).

Isn't it true that the documentation clearly states that with PLL, the highest _xinfreq is 8MHz? Trying to run > 8MHz _xinfreq with PLL enabled is asking for instability, no? I ran into problems using a 16MHz crystal anytime I tried using the PLL, even just 4x (I thought surely 64 MHz is allowed). When I instead used just _clkmode = xtal1 without the + PLL4x my coe worked fine all of the time (granted at only 16MHz clkfreq).

Does your code problem persist with PLL disabled? What if you try it with a 5MHz crystal and 16x PLL?

Dean

Peter Jakacki · 2008-03-19 07:32

Paul, that garbled hyperterminal stuff got me to thinking. Here is this new board in front of me that was garbling the greeting message on startup. I looked at the code and saw I had the greeting part of it directly after invoking the start function (which calls cognew). So I inserted a 20ms delay between the two and the problem went away. I'm not really trying to get into the discussion but thought I'd give you some feedback on that aspect of it.

*Peter*

Propeller Chip - Apparent Cog Instability

Comments