Trouble with multiple cog methods reading same variables
Erlend
Posts: 612
Just when I believed I had a breakthrough-
I am making a gas burner hot water pumping process as part of an espresso machine. The code is organized such that the top level object defines a list of contigous varables that will hold all measured and calculated process varables. A pointer to this block of varables is passed to a ADC scanner object running in a separate object. Same manner for other I/O systems. These all do the job of feeding the variables with values.
In the same manner the pointer is passed to an object which displays the variables (for debugging) through PST. For the HotWaterControl, also running in a dedicated cog, pointers are passed as parameters to give the object access to such values as measured water temperature.
Trouble is, some times it works, sometimes not, so I got suspicious that the various objects were competing for access to the variables. Tried putting in some delays in the various repeat loops, but that did not fix it. As an example:
I have already tried setting all Stack sizes very high.
Something I do not know that I should?
Erlend
I am making a gas burner hot water pumping process as part of an espresso machine. The code is organized such that the top level object defines a list of contigous varables that will hold all measured and calculated process varables. A pointer to this block of varables is passed to a ADC scanner object running in a separate object. Same manner for other I/O systems. These all do the job of feeding the variables with values.
In the same manner the pointer is passed to an object which displays the variables (for debugging) through PST. For the HotWaterControl, also running in a dedicated cog, pointers are passed as parameters to give the object access to such values as measured water temperature.
Trouble is, some times it works, sometimes not, so I got suspicious that the various objects were competing for access to the variables. Tried putting in some delays in the various repeat loops, but that did not fix it. As an example:
'Wait with starting pump until initial heating of water is finished '------------------------------------------------------------------------------------------------------------------------------------------------- REPEAT UNTIL LONG[LptrTempWater] > 90 'LiSPtemp 'Wait until water is hot enough for brewing WAITCNT(clkfreq/2 + cnt)does not (always) work, but sometimes triggers only when the temperature (as displayed through PST) reaches much higher, say 115 deg - but this varies.
I have already tried setting all Stack sizes very high.
Something I do not know that I should?
Erlend
Comments
The technique you're using is very common and cogs can only access variables one (cog) at a time. A cog will never see a halfway written variable.
The problem is likely something else. If you're willing to post an archive of your code and describe what you expect it to do and what it does do, some of us would likely be willing to take a look at it.
-the hotwatercontrol
the adc
the value monitor
the Main - very much a debug version
This is still not all, but maybe enough?
Erlend
surely it can be done. But it is another source of bugs if the value of the pointer is wrong.
You have to check if you use the "@"-operator or not to access the right location in memory.
As long as you keep all methods within ONE *.SPIN-file you can use all variables even ACROSS cogs
If you have methods in different *.SPIN-files you can use "SetVar" und "GetVar" methods.
As a general advice:
Code in small steps. Code one method for one thing to do.
Do extensive tests to each piece of your code. Testing all situations: value zero, value is max, value is min, value is out of allowed range
After all these tests. Code next piece of code. In this manner you can be 99% sure that a bug is NOT inside previous written code
but most of the time in the new method you are coding right now.
All this testing takes time. But in the long run it will save a lot of time because you are not searching for bugs across hundreds of lines of code.
As I'm coding now for more than 25 years me experience is: if it (should) go fast it will turn out to be real slow.
here is a small demo-code that shows how variables can be accessed across cogs directly by their name.
http://forums.parallax.com/showthread.php/113075-Variables-across-objects-again
best regards
Stefan
If two COGS are simply writing some values to the same HUB location the result in that location will be whatever the last COG wrote. It will make no difference if you use locks or not.
But if they are doing a read/update/write operation, for example incrementing a counter, then locks will be required around the read and write other wise things will go wrong.
Admittedly the first case above is not very useful as far as I can tell. Although if the writes and reads are some how ordered in time nicely, COG A writes, COG C reads, COG B writes, COG C reads, no locks would be required.
I would much rather download an archive. Trying the read through the code in the browser does not sound like fun.
The archive should have all the code needed to compile.
I do work according to the principle of testing piece by piece, module by module, but I know from lifelong experience (from other technologies) that the real hard part is when bringing the pieces togheter into a system. I have tested basic stuff such as that the pointers point to where they should, and that the values arrive. When tested as small pieces. So, refering to my first post; the repeat loop does work as intended, when tested in isolation, but as part of the complete system it only occasionally works.
I feel brave posting all this more or less half-cooked and untidy code. Thankfully the forum is very kind at heart.
Erlend
What does it do when it works and what does it do (or not do) when it doesn't?
I do see a problem with your P control. I think you're getting a rounding error when the temperature is too hot.
Line #286 of "HeatWaterControl".
This probably doesn't do what you want. The value of iHeatP will only change with every ten degrees of temperature difference.
I think the equation below would give a little more resolution:
With small values, you want to wait to divide until the end of the calculation. There's likely a better formula than the one above for the intended purpose since the number being divided by ten is still pretty small.
I'm not sure if this at all related to the issue you're having since I don't know what "not working" means.
You're right that, strictly speaking, multiple cogs can write to the same variable while other cogs are reading that variable as long as no cog makes a private copy of the variable value while it changes the value (read / modify / write). The private copy might be implicit (like in a register or stack). It's difficult to design multiple cog routines that keep in synchrony in a particular order of execution as you illustrated, but it can be done (usually with WAITCNT) ... not an exercise for anyone but a very experienced programmer used to working with multiprocessing (multiple processors with shared resources).
The value monitor looks good.
Erlend
Yes accuracy and control algorithms are basic still, I know. But they work.
What's the indicator the main process control has started?
I'm not sure maybe you are doing at already this way.
What I meant with testing was
method A testing for itself
adding method A to the whole thing
testing it there
writing method B testing it for itself
adding it to the whole thing testing it in the whole thing.
etc.
Spin supports global variables and parameters
Changing global variables sometimes causes strange behaviour if you loose the oversight over your code
which parts of the code change the value of which variable at which time.
And as the code in each cog runs independendly you can't predict which loop is at which point.
Whenever a method just needs to read a value why not coding pass parameter by value?
OK if the value is updated from one cog and another cog needs to read the value regularly
instead of accessing a variable via a pointer accessing the variable via its variablename?
If I understand right your method "HotWcontrol" runs through one time and thats all. There is no loop.
hwc.start is called once and that's all
Duane posted while I was still writing
So how about monitoring inbetween-results of the caclulation
I mean monitoring
LiSPtemp - iTempWater
(LiSPtemp - iTempWater) / 10
||(LiSPtemp - iTempWater)/10)
(1+ ||(LiSPtemp - iTempWater)/10))
(iGainP * (1+ ||(LiSPtemp - iTempWater)/10))
etc.
best regards
Stefan
I can't tell you how many times something like this has helped be catch integer math errors (which in hindsight are always obvious).
My guess right now is there's an integer math issue causing the problem.
The top level objects starts the hotwater control in a separate cog, that method starts by inititializing variables, sets up the 'scheduler', and then does the following: 1) ignites the burner - 2) waits for the water to heat up - then 3) starts the PID control of the heater and speed control of the pump, which runs until a cup of espresso coffee is finished brewing. Step 3 is what I refer to as the 'main process'. My present trouble is that step 2 is governed by a repeat loop waiting for the temperature to be reached (around line 210-), simply by doing a comparison to the measured temperature which is held in a global variable - there is no complexity, it just LONG reads the value and does a > comparison. Problem is, it behaves unpredictable.
@Stefan
I do not want to go into a discussion of how I structure my code as I do, because then I need to describe all the features of the finished system as well as many other coding-philosophical things. I do not think it is a bad way to split my code into many independant modules, centered around a common set of global (measured and derived value) varaibles. It is a well proven concept. The mechanism of passing pointers to variables is also straight forward when the rules a so simple as in my code (one can write, many can read, assume asynch updating).
The calculations you are pointing to are still to be polished - but they are not executed until later - until after the problem I described. Yes, I know the pains of integer math traps.
Erlend
What's your indicator the PID control is active?
I think it would help if you added some sort of indicator to you know when the "main process" starts.
Thanks,
Erlend
as the propeller-chip has no interrupts the behaviour of code is completely deterministic.
old programmers wisdom:
a programm does ALWAYS what the programmer has coded.
If the program does some unexpected things it STILL does exactly what the programmer has coded
only thing is the programmer coded something he does not fully understand and that's the reason why unexpected
things happen.
The only thing that really helps is to monitor every dammed detail of the code. Beginning at that place you find suspicious.
Your monitor-method uses
the "wait-for-temperature high enough"-loop uses
I guess you have thought lot's of times "LONG[LptrTempWater]" has exactly the same value as "LONG[intLong0] [1]"
but did you really monitor it?
You are using multiple cogs.
This opens the possability that parts of the code execute from loop to loop in different amounts of time.
For several test-runs
Did you monitor all the code involved how far it is executed?
You could do this with extra longs with via constants hardcoded adresses that are in memory far away from the end of your code
Something like $7F80. Through using names of constants you make sure that the adress-value is REALLY constant.
Each part of the code that runs in his own cog has this such a long. Then insert in your code assigning increasing values
to these longs
example
or alternativly switching on/off different LEDs
to get feedback which part of the code is executing where
You are right: it's not nescessary to start a discussion about coding philisophies.
But copying values to new variables opens the possability that values can be different
indeed it is really strange that your loop
leaves the loop sometimes at a temperature of 115 degrees
to me this would mean:
- the code starts looping at 115 degrees
- the value of LONG[LptrTempWater] is different from the monitored one LONG[intLong0] [1]
best regards
Stefan
Erlend
When I compare the call to hwc.Start(...) with PUB Start(...) I see a difference in the parameters.
In PUB start there is ptrOpmode the opposite side has
@gQuadPos{@gHWCmode}.
As I see you are using @gQuadPos while {@gHWCmode} this is a comment.
I guess that it should be @gHWCmode, but I could be wrong.
You've spotted right, but I am afraid it is only the result of me being half-way to setting up a varable handle to use for hotwatercontrol to write back it's status - for debugging purpose. The gQuadPos is an easy one to use for this purpose, as it is already being displayed by ValueMonitor. Sorry, but I am sharing half-baked code here.
Erlend
Thanks all, for the help to trace the problem, and for helping me to take the approach of tracing the problem instead of drilling down into what I thought was the problem.
It turns out that the scheme of sharing-reading global variables was fine, and also that the REPEAT UNTIL Temp_measured > Temp_setpoint worked fine.
I inserted 'telltales' into the code to see at what stage it got stuck - it was just was after the temperature comparison code! So why didn't the pump start running? The 'milestone scheduler' code is all governed by the CNT values as it executes, and in my original code the 'milestones' values got referenced to CNT -- before the wait for the water to get hot REPEAT UNTIL loop. So when finally the water was hot enough that CNT value was minutes old. I assume therefore that some 32bit 'counting-around effect' caused the problem. I cannot explain exactly how. Maybe someone can? When I moved the code which sets up the milestones referenced to CNT to just before the 'scheduler' code, it all works perfect.
Again, thanks for the help. Now I can go on to improve and tune the PID control algorithm.
Erlend
Present code - with telltales:
I'm glad you found it.
I'm sure you're not the first or last to look for a problem in the wrong place.
I often use a multiple cog and multiple object serial driver with locks to find these sorts of bugs.
You can only test for delays less than (about) 26 seconds (at 80MHz). If the delay is longer than 26s, the interval will be negative and fail the ">" comparison used to see if a set time has passed.
congrats that you found it!
the WaitCnt-command works as follows:
the parameter of WaitCnt (usually something like (ClkFreq */ somefactor + cnt)
gives a value which will be reached after some time in the future.
The systemvariable cnt delivers the value of an always free-running 32bit counter which rolls over to zero in just one clocktick when the counter reaches his
maximum value (which is 2^32)
The WaitCnt command stops the cog completely until the value given as paremeter in the WaitCnt-command matches the value of the systemcounter
with smaller and easier to understand numbers let's say maximum-value is 1000
let's say cnt has a value of 600
if you execute a WaitCnt with a value 500
the freerunning counter has already a higher value.
this means the next match will occur after counting up to 1000 then roll over to zero and start counting up again until 500 is reached.
With the real system running at 80 MHz counting up will take 2^32 / 80.000.000 = 53,7 seconds
So whenever code reacts after aprox one minute you have a "cnt has passed matching value already problem"
and if code does not react soon wait for at least a bit more than a minute to see if the code reacts then.
In Spin there is a minimumtime of 385 clockticks which it takes to interprete a WaitCnt-command.
So if you want to create a pulse shorter than 1/80.000.000 * 385 = 4,8 microseconds it will not work in spin
by using WaitCnt. You would have to do this in assember or by using the counter-modules.
If you still can spend a cog how about using a software RTC like JonMcPhalens softrtc http://obex.parallax.com/object/322
with milliseconds and "summarised seconds of day" as a timereference?
best regards
Stefan
Stefan,
I don't think the issue was the waitcnt statement. I think it was the way time periods were compared using a start time more than 26 seconds from the current time.
With waitcnt you can pause up to about 53 seconds but then comparing intervals the longest interval one can safely compare is about 26 seconds. Intervals longer than 26 seconds will be calculated as a negative number.
you are right. In SPIN MSB is the sign. So it cuts down to half the time.
best regards
Stefan