Mothra - Major Propeller Problem or At Least To Me It Is
idbruce
Posts: 6,197
Hello Everyone
I have some code that has been giving me some serious grief, and tracking down the source of my grief has been a real pain.
A large portion of this grief has been caused by mathematical operations, performed either immediately before or during the processing of a waitcnt(). These grief stricken operations apply to both adding and subtracting, but for the sake of simplicity, I will just discussion the subtraction portion of my grief. To make matters worse, I had no problems with performing these operations in a similar fashion in different code.
Anyhow please notice my grief
Program hangs at runtime
Program hangs at runtime
Program hangs at runtime
Program hangs at runtime
Program hangs at runtime
Program runs perfectly, but without the necessary subtraction
I would say "good grief", but this is some very bad grief!
I have some code that has been giving me some serious grief, and tracking down the source of my grief has been a real pain.
A large portion of this grief has been caused by mathematical operations, performed either immediately before or during the processing of a waitcnt(). These grief stricken operations apply to both adding and subtracting, but for the sake of simplicity, I will just discussion the subtraction portion of my grief. To make matters worse, I had no problems with performing these operations in a similar fashion in different code.
Anyhow please notice my grief
Program hangs at runtime
follower_1_start_stop_speed--; follower_1_start_stop_speed--; follower_1_start_stop_speed--; waitcnt(follower_1_start_stop_speed + CNT);
Program hangs at runtime
follower_1_start_stop_speed -= 3; waitcnt(follower_1_start_stop_speed + CNT);
Program hangs at runtime
follower_1_start_stop_speed = follower_1_start_stop_speed - 3; waitcnt(follower_1_start_stop_speed + CNT);
Program hangs at runtime
waitcnt((follower_1_start_stop_speed -= follower_1_ramp_inc_dec) + CNT);
Program hangs at runtime
waitcnt((follower_1_start_stop_speed -= 3) + CNT);
Program runs perfectly, but without the necessary subtraction
waitcnt(follower_1_start_stop_speed + CNT);
I would say "good grief", but this is some very bad grief!
Comments
I can't believe that a simple subtraction causes this kind of problem, however it is possible if the value of follower_1_start_stop_speed is exactly the time needed for the waitcnt to execute. Looks like you are using C, so remember that machine code instructions takes at least 16 clock cycles to execute in LMM (more if you are using CMM) and even the simplest function may take a number of instructions.
It would help if you could post a complete, reproducible code snippet.
As seen in the code below, follower_1_start_stop_speed starts out with a value of 14000. In it's current state, this code will run and output to the serial terminal, but if you try to change it to using one of the subtractions that have been commented out, it will hang during the first loop.
Thanks for your input.
Yes, I arrived at the 385 minimum about 4 hours ago
I did not know that for a fact, but I suspected that and I have tried slightly larger numbers, but I will try higher.
Hey now.... Similar code with addition and subtraction worked fine in Spin, but then I went to C with this code and then HALT However, I thought I had a similar section of code working under C when you and I were working on the 3D printer code, but I cannot find it anywhere. Maybe cause it never existed
I have tried this code in both CMM and LMM, but to no avail.
Here's something that runs on the PAB:
If the delay is too low (which will cause a roll-over), the LEDs will not come on at [apparently] the same time. The low value will be dependent on compiler settings.
That is a darn FINE idea!!!! And sometimes I do that to, but I did not think of it this time. THANK YOU!
When you say it runs on the PAB, does that mean you are sneaking in the 33 offset somehow or are you just using that as an example of a test for a too short delay?
The behaviour is really weird, I have added some debug output to see the values used by the various waitcnt and I don't see any obvious suspect value, aside from the 385 used by follower_1_pulse_width and copy_1_masters_pulse_width but should be enough in both LMM and CMM.
The weird thing is that adding debug outputs also changes the behaviour. With your code it works as you describe and "hangs", but if I add print lines to see where it hangs, the program stops in another point. Seems that something corrupts the memory used by the waitcnt variables, I don't see anything that could do that (no pointers, no unusual variable usage, etc.).
Just for testing I tried to change the floating point operations to integer and it works. I'm not sure if it is really that or it is just a side effect, but can you try to remove all floating point operations and see what happens ? follower_1_dec_part looks very precise but maybe the calculation could be rounded. Just for reference, I remember that the propeller gcc compiler had a bug with the floating point variables, it should have been fixed but the fix may not be included in the version used by SimpleIDE (if you are using it) (I'm using a recent version of GCC here so may not be that).
Oh, follower_1_step_counter is uninitialized, should not make any difference, but better initialize all variables.
Thanks for taking a good look at it. It is really kind of mind boggling.
Yes and this is the third similar piece of code where a print statement can cause strange behavior.
Thanks again for taking a look. Just imagine the GRIEF
Any other brave souls want to take a crack at it?
You are correct...
I have pin pointed it to the following code
If I comment out the waitcnt(), then it will run, and if I uncomment it, and change follower_1_offset to a whole number, then it will run also. So it is the follower_1_offset that is the source of all my grief.
Somehow the data type of follower_1_offset is being changed.
CNT runs at 80 MHz, or whatever, regardless of what your program does.
Your program, in whatever language, runs around at whatever instruction rate it does.
If you order a wait on CNT. As in:
Then either the count you ordered is coming up soon or it is not. If it is not you are hung for another total count around the counter.
If you want to use waitcnt as a delay then you need to know what is the value of CNT now and calculate what value of CNT is in the future to complete the delay. And take account of the time it take your code to run whilst doing that.
It is my understanding, that when a waitcnt is encountered, the program stops execution. If the waitcnt is too small, then the counter rolls over, and if it is a very large number, then you will be waiting a while. Either way, the program or cog should stop execution at waitcnt and wait for the time specified, regardless if the clock rolls around or just waits a long time. In other words, the program should not stop indefinitely.
It worked fine until you delete the print statement
The following code will work flawlessly, until you delete the following print statement, at which point, the program will hang:
Are you saying the program freezes for longer than 2^32/SysCLK ?!
I (and I think most) have assumed when you said 'hangs' you meant in real-time terms, and that it did resume 53.687s later ?
If the program freezes totally >> 53s, you do not have a WAITCNT issue, rather, something has crashed/corrupted entirely.
The worst possible outcome for true WAITCNT issues, is to roll over to the next equals.
You should specify which exact tool chains/versions you are using here.
It is definitely not a waitcnt issue, or as you would surmise after waiting several minutes.
There are two print statements in the small test program above. With both print statements uncommented, the code works flawlessly as stated, but then if you comment out the first print statement, you would expect the second print statement to show up, but after several minutes, still no output.
And as macca stated above, he had similar problems.....
AND over in the General threads, about h-bridges, plau45 stated a similar experience with the print statements.
I am using SimpleIDE Version 1.1.0
macca said he was using GCC
I haven't tested the last snipped, but your original code worked after few minutes, you need to multiply the 53 seconds by the number of times the affected waitcnt is called and if I remember correctly it was called 6-7 times, so at least 5-6 minutes to wait.
The bug I was referring to is this:
http://forums.parallax.com/discussion/comment/1359250/
AFAIK, it is fixed in the GCC version built here http://david.zemon.name:8111/project.html?projectId=PropGCC&tab=projectOverview&guest=1 but not in the gcc version shipped with SimpleIDE and I don't think that the double-to-uint32_t conversion is a workaround (I see that the site doesn't work, too bad, hope it is a temporary problem).
My suggestion is still the same, can you try to remove all floating point operations so to be sure that you are not hitting that (or a similar) bug ?
Very interesting and you are correct. I just can't understand, why the print statement allows it to work, but if you comment it out, then it becomes compounded wait, based on the number of loops. This is some very crazy #$*&%.
Years ago, when I started working on several G-Code efforts, I came to the conclusion that it would be so much easier to just use floating point, if the use of memory and speed were within reason, which is one of the main reasons I switched to C from Spin. Now, just when I have something that I really want to sink my teeth into, the floating point is failing me. A complete and utter disappointment and I cannot believe this is a problem. This kind of bull should have been fixed a long time ago.
Your original code specified the mult as a double (64 bit float) not a float (32 bit). The values in the expression will all be promoted to double (cast up), then the mult will happen, and then the result will be cast back down to a uint32. It'll be horribly expensive on the Prop because float math in C runs in whatever mode you're compiling in (LMM or CMM), not as native PASM, and unless you tick the "32 bit doubles" box in there you're making it worse.
He above version does the exact same work, but in integer math. As long as you're careful that the intermediate value won't exceed 32 bits this approach is fine, and preferable, because it'll be an order of magnitude faster.
J
Of course, that all seems fine and dandy for a hardcoded variable used for testing, but this goes much deeper than one hard coded variable.
And of course, I would like to use as much of my previous code that I possibly can. The use of doubles, goes back a long way into my current and past efforts.
It all starts with the G-Code struct, that was intended to be passed from one Propeller to another.
I am sure you remember this structure and what I am referring to. This structure is inter-mingled through so much code, that it would take a very long time to switch it all over:
Please excuse the mess within the project, because it is a work in progress, with quite a bit of redundancy.
I agree with you, however we are trying to debug a problem and converting to integer math takes some things out of the equation.
There is definitely something wrong here, could be the floating point operation, could be the version of gcc used by SimpleIDE, could be the options used by the compiler, or a combination of things.
I tried to simplify the things a bit more getting rid of simplelibrary, the results are interesting.
With floating point, LMM works, CMM doesn't work. Looks like some value gets too small for waitcnt, if I change follower_1_start_stop_speed to 24000 it works, so looks like the calculated value is too small for CMM which is the default mode for SimpleIDE.
With integer math, LMM and CMM works, so, unless I did something plain wrong, the initial values are correct and the resulting calculations are still correct after all.
What does this means ? Floating point math is wrong in CMM ?
Or maybe there is something else that we don't see.
If you want to experment with command line, I'm attaching a simple Makefile project that doesn't use simplelibrary and uses its own uart driver, this should take any library issue out of the equation.
It has been a while since I tested with LMM and since you mentioned it, I figured I would give it a try with that small test snippet above.
With the first "print" commented out and the project built in LMM, it ran just fine, as long as the number of ramps did not exceed 2100. Then I uncommented the print statement to see what the numbers were looking like and it locked up just after printing the follower_1_offset value of 4753.
It is all very interesting, but very aggravating.
Going back to what you said earlier and how I responded...
I posted this snippet:
All three of the structure members being initialized above, are all currently double data types. I suppose I could eliminate current_move.follower_1_ratio completely, and I could change current_move.follower_1_int_part and current_move.follower_1_dec_part to 32 bit unsigned integers. Then I am thinking that I could replace the code above, with something similar to this:
and then, going in line with what you said, in the problem area:
Does that sound and look right?
Your print is killing the code because they take *forever* in Prop land. Comment out the waitcnt and run the loop with the print to see what the values would have been.
Could it be that CMM is just too slow for the values used? I previously ran some tests looking at how long it take for the prop to switch a pin from output high to low. I compared different languages (PASM, Spin, Prop C in LMM and CMM, pfth) and different methods of making the switch. I used one of the counters to count the number of clocks from issuing the command to the time the pin went low.
When using C, changing nothing but the memory model, CMM was about 5 times slower than LMM.
For example when using the function low(pinnum) from SimpleTools LMM took129 clocks and CMM took 689 clocks. (PASM took 6 clocks). Note that using the Propeller.h library instead of SimpleTools functions reduced the times greatly for both CMM and LMM.
All the results are shown in this thread:
https://forums.parallax.com/discussion/164271/solved-with-timing-results-problem-with-pasm-counters-ctra
Tom