Propeller Chip - Apparent Cog Instability
Paul Voss
Posts: 13
I am using a Propeller chip to control small research balloons in the Arctic next month - until recently, everything was going great - the parallel processors are a dream to work with.
However, I have found a very strange instability that I am worried could be a flaw in the chip (hopefully I've just made an idiotic mistake and someone will straighten me out).
I reduced the offending code to a short and simple program (attached). The problem is very fussy, depending on the exact timing of two coginits - change one little thing and the problem goes away. However, as written the problem reliably occurs on both raw and board-integrated Propsticks with differing power supplies and external connections. Although the symptom is garbled text on hyperterminal, all the serial code can be removed and the problem still persists - in this case, the LED (if enabled) will flash about 9-15 times and then go out - the main and led cog both lock up, so it appears to be a cog interaction. Also note that the debug cog is not specified here and could be stopped bya subsequent coginit - in other tests, I have specified the debug cog and got the same lockup problem.
If some of the smart people on this forum could take a look at the attached code excerpt, I would greatly appreciate it! The code is a bit unusual due to the complexity of the parent program it came from. Note that I am not looking for just a fix (there are many simple changes that miraculously fix the problem) - rather, I need to understand what is going on so that I don't fly unstable code. The balloons need to ship very soon - this problem was an unfortunately last-minute surprise.
Thanks
Paul
However, I have found a very strange instability that I am worried could be a flaw in the chip (hopefully I've just made an idiotic mistake and someone will straighten me out).
I reduced the offending code to a short and simple program (attached). The problem is very fussy, depending on the exact timing of two coginits - change one little thing and the problem goes away. However, as written the problem reliably occurs on both raw and board-integrated Propsticks with differing power supplies and external connections. Although the symptom is garbled text on hyperterminal, all the serial code can be removed and the problem still persists - in this case, the LED (if enabled) will flash about 9-15 times and then go out - the main and led cog both lock up, so it appears to be a cog interaction. Also note that the debug cog is not specified here and could be stopped bya subsequent coginit - in other tests, I have specified the debug cog and got the same lockup problem.
If some of the smart people on this forum could take a look at the attached code excerpt, I would greatly appreciate it! The code is a bit unusual due to the complexity of the parent program it came from. Note that I am not looking for just a fix (there are many simple changes that miraculously fix the problem) - rather, I need to understand what is going on so that I don't fly unstable code. The balloons need to ship very soon - this problem was an unfortunately last-minute surprise.
Thanks
Paul
Comments
Might not be the problem but it's true [noparse]:)[/noparse]
Graham
Another thing is that, by reinitializing the LED and GPS cogs, you're stopping them at an arbitrary place, then restarting them. Also, a COGINIT will take several milliseconds to perform at 10MHz.
according to Paul the problem occurs at 2 seconds, but not at 1 second - which is opposite to your understanding - and somewhat changes the timing issues.
Paul,
I had a similar garbling of serial characters a while ago (8 to 10 months). It seemed to be dependent on how the code was laid out. Swapping lines of code and changing delays made the problem change - also I only seemed to have a problem when sending non-printable characters.
The horrible answer is that as I wrote more code - a whole lot more - the problem mysteriously dissappeared. I never did find out what the problem was, but it hasn't bugged me since. I wasn't starting/stopping/interrupting the cogs like you are, but I did have 7 of the cogs occupied.
I wish I could shed more light. I'm reasonably sure it's not a problem with the chip as when the problem occurred I'd only been using the chip for a very short amount of time (<1 month).
Just·one thing out of curiosity, is this what you mean by your case statement?
First, on the c:=cnt being reversed - I agree it is awkward, however, it was a deliberate reversal due to the multiple possible threads after c is initialized - there is simply no place to put the waitcnt at the end of the (real) program that works for all scenarios. Putting the waitcnt at the very beginning works well with the one exception that I need to be careful on the first pass through (before the c:=cnt line is executed)
Second, on starting and stopping the cogs at arbitrary places - I thought this was ok - its only flashing an led and all pins revert to 0 when the cog is stopped (led off). Perhaps not the ideal way to do it, but it seems it should not be causing the lockup I am seeing.
Mike - I think you and I may have had the same issues. The "mysteriously disappear" thing would normally work - however, with the flight issues, I need to be 100% certain what is going on. The code I posted seems very simple and should not be causing lockup problems. Hopefully, someone will prove me wrong - show me my error. This is what I am hoping for. And yes, the logic in your if-then statement is what is in the case I believe.
Thanks again - please let me know if you have further thoughts - this remains a deep (and urgent) mystery.
-Phil
It's impossible to be 100% sure with your program as written. You've written it with the potential for a fault ... missing a WAITCNT time. You need to do a timing analysis of the delays introduced by the COGINITs and the debug output or you need to rewrite the code to be independent of the delays introduced by them. Unfortunately, there's no execution time chart for the Spin operators to make it easy. You'll need to do some testing to determine the actual time involved. I'm a firm believer in designing programs to do what they need to do rather than relying on testing. You can't always do that or do it completely, but, to the extent you can do it, it improves your program's reliability and your faith in it.
I agree timing waitcnt timing is critical. In this case though, the delay of the cog inits is a few 10s of milliseconds vs the 2 seconds of the waitcnt - so it's not even close to causing a problem in this instance. Also, the symptom I see (printing random characters continuously to hyperterminal or freezing forever all the cogs) is completely different that would be caused by a hanging waitcnt. Thanks for your reply.
Paul
I'm not a huge fan of using COGINIT in Spin. In my mind, it's too dangerous, presupposes too much, and ought to be banned from the language. My recommmendation: just don't use it! Here's a version of your code that uses COGNEW and COGSTOP and seems not to suffer the hangups that occur in the original code:
-Phil
However the things you describe GENERALLY have one cause only - stack (or other memory) overflows.
I notice in Phil's posting that you allocate 500 LONGs for them. This is curious. Either you need so much.... so are you sure you do not need even more???
On the other hand this is considerable space... Are you sure the main COG has still got enough memory? You have no _STACK safety belt instruction in it...
A second remark: It is generally a better technique to use WAITCNT as it is intended, as a "waiting upto a deadline", and refer to CNT once only, rather then twist it to a "delay" instruction....
Did you try Paul's first version of code before writing your own? Were you able to reproduce the problem, or did you just write the most likely workaround?
Why not use coginit - if it is part of the language then why not use it?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
It's more error prone than CogNew, requiring that the Cog to be used isn't already in use. With
the right precautions and checks it is okay. To know which Cogs are available is not always easy
when using sub-objects, and, once working, hard-wired CogInits may cause program failure if sub-
objects are changed or more sub-objects are added, or where the code is included as a sub-
object itself.
A CogInit stops whatever may be running in that Cog even if essential to the program, and there's
no feedback on whether the Cog was previously in use or not.
I wouldn't ban CogInit but would recommend CogNew in preference unless there were compelling
reasons to use CogInit.
Post Edited (hippy) : 2/18/2008 11:10:22 AM GMT
-Martin
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
SelmaWare Solutions - StampPlot GUI for controllers, XBee and Propeller Application Boards
Southern Illinois University Carbondale, Electronic Systems Technologies
American Technical Educator's Assoc. Conference·- April, Biloxi, MS. -- PROPELLER WORKSHOP!
to result in any different stability.
I personally wouldn't CogStop then CogNew/CogInit unless I had to. I prefer to get the Cogs running
at the start of the program and then control them by updating 'shared variables', but appreciate that
may not always be possible.
Yes, I tried his original code and was able to reproduce the problem. I made the least modification possible to exorcize the COGINITs, and that seems to have fixed things. (I added the CON for the LED pin, only becasue pin 27 isn't available on my proto board.)
Just bacause a feature is available is no reason to use it. Many otherwise structured languages provide GOTO, too, but good coding practice disdains its use.
___
Hippy,
I agree: stopping and restarting a cog is not the best practice. If the interval between stopping and restarting were long enough, I could see it as a way to save power, but that's about it.
-Phil
It is most unlikely that anything discussed above is the cause for the problems.
Stack-Overflow...
DeSilva now took a deeper look at the program.. No memory usage of any kind at all...
Other funny things.. DeSilva had never seen such a CASE construct.... but there is nothing against it in the Manual... However:
No, that cannot be!!
Notice that CASE is not used very often... many programmers shy this construct and there have been reports from time to time that it needs STACK of unclear amount...
Some expreimenting shows that a problem occurs ONLY with a most spefic CASE match pattern...
As soon as you omit ",1" or ",2" everything works fine.
It also seems to run when ordering the values, i.e. "1,i:" and "2,i+1:"
Conclusion: There is something weird with complex match patterns, disturbing the stack. This might or might not self-repair in a normal program, but COGINIT after some case labels seems to be very susceptible to it...
Post Edited (deSilva) : 2/18/2008 8:50:42 PM GMT
When I mentioned that I had a problem some months ago, there was no stopping and starting of cogs but there was a huge case statement used to parse incoming comms messages. So maybe there is something with case statements!?
Hippy, There is reason to start and stop cogs and Paul has identified the situation where it is of most benefit. He is trying to save power by running a slow clock (10Mhz) and by having the minimal number of tasks (cogs) running at a time.
Phil, I'm not convinced of the error of using coginit. Once again Paul has identified the exact situation where it's needed, for complete control of the cog allocation. Maybe·it's possible that there's a problem if a cog is forcibly stopped while in the middle of a wait instruction or some other specific situation!?
What Paul has so kindly given us is a minimal piece of code which demonstrates the problem. His logic and deduction are superior to have been able to give us such a concise piece of code with which even Phil was able to reproduce the problem (thanks Phil).
Unfortunately I'm without usable hardware right at this moment. I'm interested in this thread because I've seen the instability Paul has spoken of but in a·different set of circumstances. It's not the sort of thing that's going to stop me from enjoying my work with the Propeller, but if we can identify the BEWARE then we will all write better code!
·
I will also be careful using the case statement - it is efficient in my real code because there are many possible values of N and hence all checked cases do see action. None the less, if I continue to have any problems, it would be simple enough to replace case with some if-then lines.
It the code behaves well over the next 24 hours, I will be able to ship the balloons. In a couple weeks, live flight data will be posted on www.science.smith.edu/cmet.
Thanks to all.
Paul
But please, mirror et al. : Listen to what I posted, not to your prejudices wrt COGINIT, which is a very fine an reliable instruction
And yes, there is something with case match patterns
Post Edited (deSilva) : 2/18/2008 11:10:36 PM GMT
'Sorry to disagree, but there's no good reason ever to have or want complete control over cog allocation. All the cogs are alike. Why insist on picking one over another? It doesn't save any time, and doing so can be a recipe for disaster, particularly when using third-party objects that spawn their own cogs. The Propeller provides a completely transparent cog allocation mechanism via COGNEW, which returns the cog number for those rare occasions when you need ot know it. It's simply the right tool for the job.
Of course, the question of why Paul's original program fails is still open, and those into Propeller program pathology can perhaps dig up a cause. My approach is more like the doctor in the following conversation:
····Patient: "Doc, it hurts when I use COGINIT."
····Doctor: "Then don't use COGINIT."
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Doctor. "Just sit up and beg."
After some more testing it becomes curioser and curioser. It HAS TO DO with COGINIT, WAITCNT, but with many things more..... Maybe I shall start reading the bytecode....
However I still do not think it is an issue of COGINIT but a compiler bug... I can produce the issue with IF as well, not needing a CASE.... But there is NEVER an issue with linear code and simple loops....
I give up for the moment...
It would be good if the creator " herr Chip " could step in here and sniff it out..
cheers Ron
E.g. removing the unused local variable in CHECK or changing the WAIT intervals...
Post Edited (deSilva) : 2/19/2008 2:33:20 AM GMT
The other possibility is a bug in the interpreter - but I really really hope not.
Chip is·pretty amazing, but to err is human.·It wouldn't be the first chip with an errata, so that in itself doesn't bother me.·What bothers me is that I possibly stumbled into and back out of it months ago without being able to extract a sufficiently compact piece of code to post to the forum at the time.
·
The blow-up occurs when this new stack frame being built is in the same area that a Spin cog is already working in. This can cause nasty problems, but may not always, making it all the more dangerous.
I'm confident that if the above example were modified so that the COGINIT used alternating stack areas (not always "@aStack"), there would be no problem, as the new stack frame being built wouldn't already be in active use.
Also, a COGSTOP before the COGINIT, in this case, would solve this problem. However, it would introduce a new possible problem of allowing other cogs to grab that temporarily-stopped cog·by a COGNEW·of their own, before your own·COGINIT would actually execute.
Perhaps this would be the simplest solution: have the Spin routine that you are referencing in the COGINIT ("check" in this case) consist of nothing but a call to another Spin routine, which would then form the loop, and never return. This would build the stack to a height that would exceed the·top-most long being modified by the re-launching COGINIT.
Basically, relaunching Spin code into an already-being-used stack area is like playing Russian Roulette. You need to either do a COGSTOP first, use a different stack area, or know that the already-active stack is currently at a height which won't mind its bottom being modified.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 2/19/2008 4:17:03 AM GMT
So for all those accusations... there is no compiler error.. no hardware bug..
No mystery.. just plain facts.. just what was needed
Thanks Chip.. take care..
Ron Nollet Mel OZ
I hope this is it. DeSilva's example made it easy for me to see.
Perhaps one of you can confirm the theory. I've just got Propeller II stuff in front of me.
I could modify the compiler to generate roughly this sequence in response to a COGINIT(cognum, spinroutine, @stack):
· COGINIT(cognum, @asmloop, 0)···········'sort of like COGSTOP, but keeps the cog tied up
· COGINIT(cognum, spinroutine, @stack)···'do COGINIT as usual
DAT
asmloop·· jmp·· #0······················ 'an assembly-language·infinite loop
Can any of you think of any related pitfall scenarios that might still be out there? Would this compiler modification be a good idea? It would only apply to the case of COGINIT being used to launch a Spin routine, and would always burden that sequence with perhaps·10 bytes of·code.
And, if any of you can provide examples of problems with CASE, I'm very interested in addressing this. I don't know of any trouble, myself, but a few of you mentioned there might be some issues.
Thanks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 2/19/2008 4:49:14 AM GMT
Please leave the compiler as it is (in regard to this case). This is an issue for documentation. This goes under Tricks and Traps and hopefully becomes one of many examples in an "Introduction to Multiprocessing with the Propeller" tutorial.