weired bug, spooky timing or code size dependencies

ManAtWork · 2022-06-01 14:50

Oh man, I feel so tired. I spent the whole day desperately hunting a very strange bug. It began to show up when I added code part B to my already working code A. A stopped working although I thought that there were no dependecies of A from B. This is normally a sign of memory being corrupted or timing problems (race conditions...). I started commenting out parts of code B to find out which line caused the problem.

But (as you can expect) I couldn't find any logical pattern. Totally randomly, sometimes it worked sometimes it didn't. Because I searched for memory being accidentally overwritten I commented out all writes of any results to hub memory. As expected the bug went away. But it came back as soon as I replaced the writes with NOPs! A single NOP could make the difference between everything working perfectly or nothing at all, although my code wasn't really timing critical. OK, it has to process real time ADC signals every 80µs but I could see at the scope that there was at least 40µs idle time left before the next loop iteration.

This immediately reminded me of RossH's problem and my mood became worse and worse. After an hour of shotgun debugging I managed to get everything reduced to the absolute minimum.

CON
  _xtlfreq = 25_000_000
  _clkfreq = 200_000_000

OBJ
  pwm   : "SC3_PwmCtrl"
  com   : "SC3_HvCom"
  par   : "Parameters"

PUB Main () | error, t
  t:= pwm.Start ()
  debug ("Start ", uhex (t))
  repeat

"com" was the object I've added last. Although I have removed any calls or references to it it still stopped "pwm" from working. Only if I also commented out the com : "SC3_HvCom" line it worked normally, again. How is this possible?

The problem is when you experience such unexplainable behaviour you start doubting everything. I remembered I once had a "partial" crash of the Propeller Tool. It couldn't compile correctly anymore and the problem went away after restarting it. But not this time.

So I decided to dig deeper and try to find out what exactly didn't work. Until then I only knew that pwm.Start() didn't return. After several more hours replacing commands with NOPs I found out that interrupts in the PWM/ADC handler cog didn't work so that the main program was waiting endlessly for the results that never came.

To be continued soon...

Wuerfel_21 · 2022-06-01 15:05

Hmm, maybe try building it with flexspin, sometimes it spits out useful warnings when you're doing something subtly dumb. Might also give some other insight.

I've had a funny issue like that just a few days ago where I used a cog label in the wrong cog, but I don't think that'd apply it here.

And yes, you can get PropTool to corrupt itself in a couple ways. A reliable way to accomplish this is to try to open a binary file (which sometimes happens accidentially when you included one via FILE and do that thing where it opens all dependencies).

ManAtWork · 2022-06-01 15:33

The next idea was that part A (PwmCtrl) was faulty from the beginning, e.g. overwriting some memory region, and I was just lucky that the region was somewhere it didn't hurt until I added part B (HvCom). But adding padding or alignment to my DAT sections didn't change anything and when I verified all pointers I always got the correct addresses.

What attracted my attention was that when I commented out my calibration routines for the ADCs and replaced them by loading of constants for gain and offset everything worked reliably even if I added delays to compensate the shorter execution time of the dummy calibration.

Because the current sensor ADCs were too fast for being processed with Spin code and the operations were to complex for inline assembly I have to start a cog multiple times for the calibration.
1. do a pinstart (pinAdcAll, P_ADC | P_ADC_GIO, adcFilter, 256)
2. start cog to process some samples and store the results (cog shuts down when done)
3. do a pinstart (pinAdcAll, P_ADC | P_ADC_VIO, adcFilter, 256)
4. start cog to process some samples and store the results
5. calculate gain and offset
6. start the cog again with the main code and interrupts

Because the cog shuts itself down after completion of the calibration sampling procedure the same cog is used three times. I have the suspicion that something in the cog is not cleared when it is stopped and restarted later. I use the event system to trigger the interrupts. I use a sequence of FLTL, WRPIN, WXPIN, WYPIN and DRVL to reset and initialize the smart pins in the main code. However, this doesn't seem sufficient to clear everything. Something from the previous use of the ADC pins seems to remain hidden inside. Or the event is not cleared when the cog is restarted so that no new event/interrupt is triggered even if the smart pin IN is raised.

I still have no explanation why the present or missing OBJ reference without any call to SC3_HvCom makes any difference. But if I add a
pinclear (pinAdcAll)
to step 5. everything seems to work normally.

ke4pjw · 2022-06-01 19:39

@ManAtWork said:
The next idea was that part A (PwmCtrl) was faulty from the beginning, e.g. overwriting some memory region, and I was just lucky that the region was somewhere it didn't hurt until I added part B (HvCom). But adding padding or alignment to my DAT sections didn't change anything and when I verified all pointers I always got the correct addresses.

What attracted my attention was that when I commented out my calibration routines for the ADCs and replaced them by loading of constants for gain and offset everything worked reliably even if I added delays to compensate the shorter execution time of the dummy calibration.

Because the current sensor ADCs were too fast for being processed with Spin code and the operations were to complex for inline assembly I have to start a cog multiple times for the calibration.
1. do a pinstart (pinAdcAll, P_ADC | P_ADC_GIO, adcFilter, 256)
2. start cog to process some samples and store the results (cog shuts down when done)
3. do a pinstart (pinAdcAll, P_ADC | P_ADC_VIO, adcFilter, 256)
4. start cog to process some samples and store the results
5. calculate gain and offset
6. start the cog again with the main code and interrupts

Because the cog shuts itself down after completion of the calibration sampling procedure the same cog is used three times. I have the suspicion that something in the cog is not cleared when it is stopped and restarted later. I use the event system to trigger the interrupts. I use a sequence of FLTL, WRPIN, WXPIN, WYPIN and DRVL to reset and initialize the smart pins in the main code. However, this doesn't seem sufficient to clear everything. Something from the previous use of the ADC pins seems to remain hidden inside. Or the event is not cleared when the cog is restarted so that no new event/interrupt is triggered even if the smart pin IN is raised.

I still have no explanation why the present or missing OBJ reference without any call to SC3_HvCom makes any difference. But if I add a
pinclear (pinAdcAll)
to step 5. everything seems to work normally.

I suspect whatever state pin state you leave the cog in, it will remain that way when the cog is started back. I have multiple cogs accessing the same pins and use locks to determine who has access to the pins. I had to ensure that my pins were in a floating state when I released the lock, otherwise pins states are ORed and it interferes with the other cog.

Not sure if this helps

evanh · 2022-06-01 22:17

@ManAtWork said:
... But if I add a
pinclear (pinAdcAll)
to step 5. everything seems to work normally.

Does sound like it's to do with events and how a smartpin re-arms itself.

ManAtWork · 2022-06-02 07:47

Yes, I'm fully aware that the smart pins do not belong to a specific cog but have their own state machine and "memory". Thus, it's important to always reset and initialize them properly and avoid conflicts between cogs, if neccessary with locks or mailboxes. But even resetting (DIR=lo/hi) and overwriting all the wr/wx/wypin registers was not enough in this case.

I do not blame Chip for anything. The P2 is definitely the best processor design I've ever seen or even can dream of. The documentation is still a little sparse and sometimes the model in my mind does not match the exact behaviour of the real thing. But when things don't work they normally do it in a consistent way. It just doesn't work until I add some AKPIN here and a WAITX there... What made me worry in this case was the dependencies on program length or timing (NOPs or OBJ include) that lacks any logical explanation. There seemed to be a small timing or memory alignment slot that made it work by chance although the code was faulty or at least incomplete.

You just have to be aware of it. The first signs were clearly pointing toward a memory corruption problem. But that was not the actual cause, only the symptoms were similar.

evanh · 2022-06-02 08:34

@ke4pjw said:
I suspect whatever state pin state you leave the cog in, it will remain that way when the cog is started back. I have multiple cogs accessing the same pins and use locks to determine who has access to the pins. I had to ensure that my pins were in a floating state when I released the lock, otherwise pins states are ORed and it interferes with the other cog.

Terry,
That's true for smartpins but not cogs. A COGSTOP or COGINIT will clear its DIRA/B and OUTA/B registers. The wire'd ORing for that cog is freed up.

Smartpins, on the other hand, stay configured with whatever mode was last set with a WRPIN. Only a hard reset, or WRPIN #0, will clear the mode word.

Christof Eb. · 2022-06-04 05:33

Do I understand this correctly? You could alter behavior by placing a nop into a routine, which is never executed?

I would think of a bug either in the compiler or in the loader then.

ManAtWork · 2022-06-04 09:50

Not exactly. The behaviour is as follows:

With the pinclear() inserted as explained in post #3 the whole program works very reliably and does not depend on any NOPs inserted or removed. I've even added other functions and objects in the meanwhile and it remains stable.
Without the pinclear() the behaviour was very unpredictable. Adding and removing NOPs (to rotines that were executed, of course) changed the results.
I've never tested adding NOPs to routines that were not executed.
But removing an OBJ reference changed the behaviour even though no function of that object was ever called. E.g. it worked with none or only one of "com" and "par" included but not with both. As the Propeller Tool doesn't support dead code removal that obviously changed the length of the executable and the memory arrangement.

ManAtWork · 2022-06-04 10:18

I don't think it's a compiler bug. I guess that there might be a problem with the way the selectable event "INA/INB bit of pin %PPPPPP rises" works. When you assign an event to an ADC smart pin the intention is to get notified whenever there is a new sample ready in the Z register of that pin in which case the IN signal changes from low to high. The possible problem is that the state of the input is undefined before the ADC is running because the pin might be floating or driven to any voltage below or above the digital threshold.

You need to set DIR=0 to reset the smart pin during initialisation. But this changes IN to an unknown state because the now non-smart pin follows the digital input signal of that pin. This could trigger false events. I've already tried to avoid that by executing the SETSEx command after the initialisation of the ADC pin. I've also noticed that it's always a good idea to place an AKPIN and POLLSEx instruction to clear the pin-ready and event latches that might have been unintentinally triggered before you actually want to listen to them. But as it seems there has to be some mechanism of failure that I forgot to consider. For example the internal sample window counter of the ADC pins might continue to run even if the smart pin is reset (DIR=0). This could potentially cause IN to rise immediately after reset and before my SETSEx instruction. This in turn could cause the event to never happen because the IN is already high and therefore there is no more rising edge which would trigger an event.

If this is true (I can't tell for sure) then it should be documented that there is a procedure that has to be followed exactly of how to safely setup an ADC pin with an event that does not depend on any previous use of that pin.

Christof Eb. · 2022-06-04 12:21

Just an idea to find out about the loader: Is the behavior the same, if you load from SD?
I still think, that changing the memory layout or the amount of transfered code, which is never executed, should have nothing to do with details in code (set up of an adc), which is used.

evanh · 2022-06-04 13:19

@ManAtWork said:
If this is true (I can't tell for sure) then it should be documented that there is a procedure that has to be followed exactly of how to safely setup an ADC pin with an event that does not depend on any previous use of that pin.

I've never nailed down an exact best practise but I've often had to fiddle around with the order to get what I've wanted. So, yeah, no surprise smartpins linking to events having caused some grief for you.

ManAtWork · 2022-06-04 13:30

All tests were done with my custom servo controller board. There is no easy way to boot from SD card. Although the code should also run on a blank KISS or EVAL board it might produce different results because the ADC pins are then floating instead of being driven by the sensor signals. Any results from such tests would be of very limited meaningfulness.

@"Christof Eb." said:
I still think, that changing the memory layout or the amount of transfered code, which is never executed, should have nothing to do with details in code (set up of an adc), which is used.

Even code that is not executed can have a (small) effect on execution timing because evry single address change can affect the timing of hub RAM fetches.

For example the internal sample window counter of the ADC pins might continue to run even if the smart pin is reset (DIR=0). This could potentially cause IN to rise immediately after reset and before my SETSEx instruction. This in turn could cause the event to never happen because the IN is already high and therefore there is no more rising edge which would trigger an event.

A delay of 0..7 clock cycles can make a difference here in very unlucky conditions. I'm still not sure if this is the actual cause of the effects I was seeing. If I was payed for that job and were willing to spend many days for that I could probably find the answer, eventually. But as I'm always short on time, anyway and like to invest as much as possible in productive work my motivation for that decreased now as I have a solution that at least works although I don't know exactly why.

The most important reason I write this is because I tell other people about it. It's the nature of such "Heisenbugs" that they are hard to find. It's important to not only look at what you think is "obvious" (it's a compiler bug or hardware design flaw) but to stay vigilant and creative. The actual cause could be something completely different. And sometimes we need and can find fixes even if we don't fully understand the reason of the error, although it's a lot more satisfying if we do.

Yanomani · 2022-06-04 13:32

@ManAtWork said:

You need to set DIR=0 to reset the smart pin during initialisation. But this changes IN to an unknown state because the now non-smart pin follows the digital input signal of that pin. This could trigger false events. I've already tried to avoid that by executing the SETSEx command after the initialisation of the ADC pin. I've also noticed that it's always a good idea to place an AKPIN and POLLSEx instruction to clear the pin-ready and event latches that might have been unintentinally triggered before you actually want to listen to them. But as it seems there has to be some mechanism of failure that I forgot to consider. For example the internal sample window counter of the ADC pins might continue to run even if the smart pin is reset (DIR=0). This could potentially cause IN to rise immediately after reset and before my SETSEx instruction. This in turn could cause the event to never happen because the IN is already high and therefore there is no more rising edge which would trigger an event.

If this is true (I can't tell for sure) then it should be documented that there is a procedure that has to be followed exactly of how to safely setup an ADC pin with an event that does not depend on any previous use of that pin.

Perhaps I'm interpreting the docs in a wrong way, but...

I order to control the "logic" state of any pin beforehand, can't you use the %TT control bits to override DIR, and, at the same time, set the HHH/LLL drive-strenght controls to impose a suitable High or Low level (avoiding damage to externally-connected hardware)?

evanh · 2022-06-04 13:34

@ManAtWork said:
... I've also noticed that it's always a good idea to place an AKPIN and POLLSEx instruction to clear the pin-ready and event latches that might have been unintentinally triggered before you actually want to listen to them.

Yes, that's a good rugged and simple all-rounder. Otherwise requires a lot more state tracking.

But as it seems there has to be some mechanism of failure that I forgot to consider. For example the internal sample window counter of the ADC pins might continue to run even if the smart pin is reset (DIR=0).

It won't be that. DIR=0 always holds a smartpin in reset.

This could potentially cause IN to rise immediately after reset and before my SETSEx instruction. This in turn could cause the event to never happen because the IN is already high and therefore there is no more rising edge which would trigger an event.

It just seems to happen with some smartpin modes. And the opposite too for some modes - Where it needs a kick to trigger the first buffer empty state.

evanh · 2022-06-04 13:43

@Yanomani said:
I order to control the "logic" state of any pin beforehand, can't you use the %TT control bits to override DIR, and, at the same time, set the HHH/LLL drive-strenght controls to impose a suitable High or Low level (avoiding damage to externally-connected hardware)?

Not an output. "IN" triggering an event from a smartpin, namely Sinc3 filter mode. DIR controls reset/enable of the smartpin.

Yanomani · 2022-06-04 14:12

@evanh said:

Not an output. "IN" triggering an event from a smartpin, namely Sinc3 filter mode. DIR controls reset/enable of the smartpin.

Then I got it wrong, because, in my mind, associating OUTn = "1" and %TT= "01" along with proper HHH/LLL values would suffice, without interfering with DIRn usage, during intended Smart pin setup..

evanh · 2022-06-04 14:41

Yep, but ManAtWork only wants ADC input. The DIR talk is all about smartpin control of input.

RossH · 2022-06-08 00:04

Interesting. The erratic nature of the Propeller behavior described here does indeed reminds me very much of my own problem - which I never tracked down, just avoided by changing the clock frequency (and I am still not sure why this seems to work!).

Swapping two instructions when one of them is a NOP could make the program fail or work for no logical reason. I suspected a memory corruption, but could never find one. I also suspected that something in the Propeller was not being reset properly, since sometimes the program would simply work fine for a while before failing - but I never thought of the smartpins (which I use for serial I/O if nothing else).

So, what is the best way to reset all the smartpins to a known state? Is it sufficient to use a single DIRL instruction followed by a single WRPIN instruction to set all pins to DIR=0 and mode=0?

evanh · 2022-06-08 09:46

@RossH said:
So, what is the best way to reset all the smartpins to a known state? Is it sufficient to use a single DIRL instruction followed by a single WRPIN instruction to set all pins to DIR=0 and mode=0?

Yep that would be closest to hard reset. But for a given mode only DIRL is needed.

Brian Fairchild · 2022-06-08 20:05

What do Parallax say about this? If it is a silicon problem then it feels like a bit of a showstopper.

evanh · 2022-06-08 23:00

Can't jump to that conclusion from the small evidence so far. We'd need a source code that can then be whittled down to minimum needed to demonstrate the behaviour.

ManAtWork · 2022-06-09 10:26

@"Brian Fairchild" : I don't think it's a silicon problem or anything that we could call a real bug. At most it's missing documentation or me interpreting it incorrectly. I just posted this because I wanted to give a hint of what to look for if they encounter similar problems which can often be very frustrating.

@evanh : Do you really want to do further investigations? I could share the code and if we are lucky the behaviour can hopefully be reproduced on a blank EVAL or KISS board... But this might need a lot of time and the outcome is very uncertain...

evanh · 2022-06-09 10:48

Can't say I'm up for it. Probably just one of a gazillion possible coding bugs in assembly. They're damn easy to make, no matter what CPU architecture it is - https://forums.parallax.com/discussion/174628/addpins-i-dont-understand

pik33 · 2022-06-09 19:28

A single NOP could make the difference between everything working perfectly or nothing at all

....look at the console emulator topic. This time one nop repaired a program on a one particular P2. I spent maybe 20 hours experimenting to find the solution, and what caused the problem is still not explained. Maybe this cog has a slower signal path somewhere, or something like this, and in the result it needs a nop in this one and only place in over 5000 lines of code to stabilize the result in the pointer register,

msrobots · 2022-06-09 22:04

well the chip is build and certified for 180 MHz and all of you are running at over 300++, it is maybe related?

Claiming a 'Silicon Failure' is kind of hard here.

just curious,

Mike

RossH · 2022-06-10 00:39

@pik33 said:

A single NOP could make the difference between everything working perfectly or nothing at all

....look at the console emulator topic. This time one nop repaired a program on a one particular P2. I spent maybe 20 hours experimenting to find the solution, and what caused the problem is still not explained. Maybe this cog has a slower signal path somewhere, or something like this, and in the result it needs a nop in this one and only place in over 5000 lines of code to stabilize the result in the pointer register,

Interesting. I'm certainly not going to start worrying yet - I am sure it is solvable - but it seems to me that what we may all be seeing is basically a memory alignment issue - i.e. that the P2 can behave differently depending on which of the 8 possible memory alignments the code uses, and by which of the 8 cogs. I have certainly seen cases that hint at this, and indeed there is still some code in Catalina that is designed to avoid using specific memory alignments that seemed more prone to failure. It seemed at the time to work so I left it in place, but now I think I was probably just seeing symptoms of a deeper problem and the fact that this solved that particular case was largely coincidental. Especially since a very similar problem came back later that seemed to also depend on the specific clock frequency in use. When you look at the Propeller 2 "egg beater" architecture, you can see how either or both of these could easily be true.

One thing we all seem to find is that the problem only shows up with large and complex programs, which makes it very hard to rule out that the problem is not just a bug in our own software, and also makes it difficult to produce a small example that demonstrates the issue. The number of cogs executing may also contribute, which I guess could mean it is also power related.

I wonder if Parallax might consider spending some time developing a test program that really thrashes the egg-beater across the whole of Hub RAM, using different combinations of cogs, and at various clock speeds, and detects any failures.

Or perhaps they have already done so and ruled this out? That would also be worth knowing!

Ross.

evanh · 2022-06-10 22:44

@RossH said:
I wonder if Parallax might consider spending some time developing a test program that really thrashes the egg-beater across the whole of Hub RAM, using different combinations of cogs, and at various clock speeds, and detects any failures.

Or perhaps they have already done so and ruled this out? That would also be worth knowing!

I've done a couple different testers along this line: One was for achieving max power consumption, so it didn't verify any data. It used all cogs and all streamers to fully max out both Cordic ops and hubRAM bandwidth in parallel. This one needed external cooling to prevent PLL self-limiting from ruining the manual measurement of 5 Volt supply current. PS: Achieved nearly 4 Watts at 250 MHz, 4.7 Watts at 300 MHz. PPS: Exceeding 900 mA from 5 Volt, USB would have shutdown from overload if used.

The other program, just recently, was for measuring hubRAM R/W corruption when at or near thermal self-limiting of the PLL. It used only cog0 and had pauses at 10 MHz, to minimise uncontrolled heating, but did verify 100% of hubRAM at any range of frequencies.

Wuerfel_21 · 2022-06-10 22:53

Pik's funny issue isn't really depending on hub alignment though. The issue seems to be caused by a certain instruction pattern, regardless of hub alignment, clock speed and surrounding code changes (and of course localized entirely to one particular cog on one particular chip)

Here's the entire graphics fetching loop from NeoYume:

              '' adding NOP here does not bypass bug
.slotlp
              '' NOP here does bypass bug
              rdlong ma_mtmp1,ptrb[2] wc
        if_c  jmp #ma_lineloop ' got sentinel

              shl     ma_mtmp1,#1 ' sprite lines are 2 longs
              add     ma_mtmp1,ma_char_base
              setbyte ma_mtmp1,#$EB,#3
              splitb  ma_mtmp1
              rev     ma_mtmp1
              movbyts ma_mtmp1, #%%0123
              mergeb  ma_mtmp1
              rep @.irqshield,#1
              drvh  #PSRAM_SELECT
              drvl  ma_psram_pinfield
              xinit ma_psram_addr_cmd,ma_mtmp1
              wypin #(8+PSRAM_WAIT+4)*2,#PSRAM_CLK
              setq ma_nco_fast
              xcont #PSRAM_WAIT*2+PSRAM_DELAY,#0
              wrfast ma_bit31,ptrb
              waitxmt
              fltl ma_psram_pinfield
              setq ma_nco_slow
              xcont ma_psram_readspr_cmd,#0
              waitxfi
              drvl #PSRAM_SELECT
.irqshield
              add ptrb,#4*4
              '' adding NOP here does not bypass bug
              djnz ma_slotleft,#.slotlp

As you may see, the only hub-aligning instruction is the one RDLONG and since adding a NOP at the bottom of the loop doesn't bypass the bug, it is most certainly not hub alignment.

evanh · 2022-06-10 22:56

@Wuerfel_21 said:
Pik's funny issue isn't really depending on hub alignment though. The issue seems to be caused by a certain instruction pattern, regardless of hub alignment, clock speed and surrounding code changes (and of course localized entirely to one particular cog on one particular chip)

Agreed. I would've called software bugs long ago but for the fact it affects only Pik's one chip.

I'm just throwing in my tester info because Ross showed some interest as a general tool.

RossH · 2022-06-11 00:41

@evanh said:

@Wuerfel_21 said:
Pik's funny issue isn't really depending on hub alignment though. The issue seems to be caused by a certain instruction pattern, regardless of hub alignment, clock speed and surrounding code changes (and of course localized entirely to one particular cog on one particular chip)

Agreed. I would've called software bugs long ago but for the fact it affects only Pik's one chip.

I'm just throwing in my tester info because Ross showed some interest as a general tool.

I am still more inclined to opt for a software bug than a hardware problem. However, a tool to verify a particular chip under maximum stress would be very useful, for testing overclocking limits if nothing else.

I would also like a simpler tool that initialized all cog RAM, all Hub RAM, all hardware stacks and all I/O pin registers. But that one I can probably write myself, and will do so when I get some time.

Ross.

weired bug, spooky timing or code size dependencies

Comments