Possible TLOCK/TFREE multi-tasking bug
ozpropdev
Posts: 2,793
Hi All
I am experiencing a weird bug in my Invaders code that might be a HW bug.
The code runs 3 tasks in a single cog in "COG" mode.
Code snippets are loaded from hub and executed in cog. This code has been running fine
for over 6 months now with no hiccups.
Now I am seeing random screen blanking of approx. .5 to 15 seconds duration. The period in between failure
ranges from ~3 minutes to 2 hours.
This fault appears on DE0 and DE2 FPGA versions. I have run the code on my DE2 with Toolbox running and have
found some odd signals in using TRACE.
The first oddity is the TRACE is showing FETCH activity even though I am NOT running Hubexec mode.
The P2 docs define FETCH as " - pipeline stall due to hub instruction fetch"
I was lucky enough to capture TRACE data during a failure and a second oddity was spotted.
The task switching mechanism is locked on a single task as if a TLOCK instruction was executed.
There definitely is no TLOCK instructions in my code or patterns that would represent that instruction.
What is even stranger is that it also appears that a ghost TFREE instruction that seems to materialize to kick start it all again.
My SETTASK instruction is only used on start up and is over written after the code starts. The time slot value is
is also over written at startup so it cant be executed again.
I have spent countless hours trying to find a trigger scenario or a bug in my code. I have checked the integrity of the code
snippets sitting in HUB ram in case I am getting corruption, all checks out fine. When the multi-tasking kicks in
again all is well as if nothing happened. Even comms @1M still runs during the lockup. I have even swapped the tasks
around and the main task is always the one that locks. I even considered a LIFO stack problem, but how could that cause
a phantom TLOCK? The code can run for hours without a glitch.
I've included my OK and FAIL trace results for the curious.
Weird huh? Any suggestions?
Cheers
Brian
I am experiencing a weird bug in my Invaders code that might be a HW bug.
The code runs 3 tasks in a single cog in "COG" mode.
Code snippets are loaded from hub and executed in cog. This code has been running fine
for over 6 months now with no hiccups.
Now I am seeing random screen blanking of approx. .5 to 15 seconds duration. The period in between failure
ranges from ~3 minutes to 2 hours.
This fault appears on DE0 and DE2 FPGA versions. I have run the code on my DE2 with Toolbox running and have
found some odd signals in using TRACE.
The first oddity is the TRACE is showing FETCH activity even though I am NOT running Hubexec mode.
The P2 docs define FETCH as " - pipeline stall due to hub instruction fetch"
I was lucky enough to capture TRACE data during a failure and a second oddity was spotted.
The task switching mechanism is locked on a single task as if a TLOCK instruction was executed.
There definitely is no TLOCK instructions in my code or patterns that would represent that instruction.
What is even stranger is that it also appears that a ghost TFREE instruction that seems to materialize to kick start it all again.
My SETTASK instruction is only used on start up and is over written after the code starts. The time slot value is
is also over written at startup so it cant be executed again.
I have spent countless hours trying to find a trigger scenario or a bug in my code. I have checked the integrity of the code
snippets sitting in HUB ram in case I am getting corruption, all checks out fine. When the multi-tasking kicks in
again all is well as if nothing happened. Even comms @1M still runs during the lockup. I have even swapped the tasks
around and the main task is always the one that locks. I even considered a LIFO stack problem, but how could that cause
a phantom TLOCK? The code can run for hours without a glitch.
I've included my OK and FAIL trace results for the curious.
Weird huh? Any suggestions?
Cheers
Brian
Comments
C.W.
The TRACE with apparent TLOCK is strange.
Would it be possible to divide your code down until you isolate occurrence of the bug?
Chip
(you've probably considered this already)
If your code hasn't changed from emulation to emulation for a while, is it possible to roll back through the emulations and see when the behavior started? If your code has changed, this will certainly be harder to do unless you have previous versions saved and paired to emulation releases.
I've seen my Nano pick up spurious resets (my assumption) where it will stop running whatver was running and when you talk to it you go back to the monitor. But your description states it stalls and resumes, so yoru case is different.
On the capture_fail dump, does that task 0 execution pattern make sense?
I only ask because I've had some really weird bugs due to accidentally-corrupted memory. The bugs masqueraded as other kinds of problems that sent me on wild goose chases.
Also, I posted the current TRACE bit definitions in your thread regarding TRACE questions. In looking at your trace dumps, I thought some things didn't look properly labelled. My fault for giving you out-of-sync documentation. I think once you make some changes, that trace output will start making more sense.
At the time, I wrote it off to a glitchy connection. Guess I'll have to just observe it running later today to see.
Confirmed. Glitch about 10 minutes in. Changed to the default, simple video example code shipped with the FPGA image. I saw it on the TV driver. Didn't get to check VGA, etc...
Since I have to update anyway, I'll load an older one and observe it later today.
It sounds like you saw this glitch on the NTSC example driver that comes with the FPGA configuration files? I'll look into this. Perhaps WAITVID is the common denominator among these failures.
Chip
It shows as either a flash as the frame is missed or corrupt, or a color shift, then shift back. I noted events about 10 mins into a fresh FPGA boot. On my older testing set, I can set the vhold to just barely hold the frame, and the glitch will often trigger a roll as the TV resyncs. That was the easiest to see.
And that is the image from the 20th. I'm one behind.
Cheers
Brian
I have gone back 2 releases of FPGA to 06 Feb 2014 and code runs fine.
The only difference between my last Invaders code and my current version is the use of the new COGRUN format
and the new JMPTn instead of JMPTASK, that's all. Testing continues.... I even made a version that uses a HUB stack to verify if maybe a LIFO stack over/under flow was
a possible contender, alas no. 3 levels max was verified.
I'm on a mission now to narrow down the offending gremlin.
@Chip. Thanks for the TRACE info update, it helped heaps!
Cheers
Brian
Run instead Brians Invaders in Demo mode --- run forever.
Mostly after screen "Pres Key to start" will sometimes come little delay before next screen will come ON.
With one pass my Screen said OUT OF SYNC in about one sec.
Then restarted demo
Okay. I ran the DE0 version and saw the video stop for a few seconds.
This Invaders programs is a little large for me to find a problem in, though. We need to isolate the bug more so that we can identify it.
I altered the simple_ntsc demo to use multitasking. At first, it failed, because I had WAITVID instructions in the delay slots after an RETD. After I fixed that, it worked fine, indefinitely.
I know that this program is little bigg.
But shows what are problem.
Then still --- It not need be Hardware problem.
But maybe Brian can find in his part what are problem
It would be good if he could identify it.
Brian, did you get those updated TRACE bit definitions into your tool, yet? I really like that thing! You've got the data presented in an easy-to-view format.
I have updated my TRACE labelling. now, thanks.
I'm having another attempt today to isolate this screen blanking issue. It's a bit of a mystery.
Cheers
Brian
eg just what time-base this has, and the size of disturbances could give clues, and this is not such a good scope problem.
What I'm trying to find is where/why/how the time slots are being locked?
The details I was getting at, is the 'intermittently' is not usually intermittent, but will have a definite and likely regular trigger.
If you can find better values on 'how often' and 'how long' that could help where to look for the origin.
I think I'm drawing closer now to the problem area but as to why the tasks are being locked is still a mystery.
After days of testing I noticed that if a random glitch occurred it always happened when the player has been killed.
By a process of elimination I arrived at a single code snippet "vaporize" which as its name implies animates the
players demise. By skipping this snippet removed the glitch altogether. Aha! I've found it.. or so I thought.
The code snippet basically gets a random bit pattern from the LFSR and AND's it with the "cannon" sprite to
get the vaporize effect. Nothing complex there. Once again by a process of elimination it seemed that the
GETLFSR instruction was the culprit. I place IF_NEVER conditions before all three instructions and the problem disappeared.
Ok, I'll just use SEUSSF instead to get a random pattern. This made things worse as the phantom lock never unlocked.
This was verified with extensive SETRACE logging. Hmmm, I'll try something else then. INCPAT,DECPAT had the
same results as SEUSSF, when the lockup occurs it never unlocked. Now things are getting weird.
My next plan of attack was I needed to capture a trace of this actual snippet in progress to see what's going on.
This at first was going to be tricky as the code loads snippets to the same COG space ~8000 times a second.
Identifying which snippet was currently loaded was going to require a trick of some sort.
What I did was convert just that snippet to run in HUBEXEC mode and that way I could use trace to detect
the FETCH signal to indicate that I was watching the right piece of code. Too easy...except now the problem
has disappeared again. Huh?
Wait a minute.. the only thing that changed was a single cog register definition.
Each snippet has its own register definitions at the end of the code and is part of the snippet load.
Ok now where getting somewhere. I added some NOP's at the start of the snippet thinking that a bug in my code
must be corrupting COG ram somehow. Various amounts of spacer nops had no effect. I even moved the
snippet location in HUB to see if that was related.
The only reliable fix I found was a single NOP inserted between the JMP #NEXT_COGLET and the
register definition. This fix hasn't failed in hours of parallel testing on DE0 + DE2.
So this is where I am at after 6 days of bug chasing. I'm close to something but what?
@Chip
Is it possible this is a fmax problem with the FPGA build?
Cheers
Brian
ospropdev, it's a pain, but can you set your FPGA to a lower clock, adjust video, and test the fmax theory?
On that note, I'm confused on how to do that. Are the docs on CLKSET current? I would love to test parametric video at a few clocks.
WIth all the discussion, I'm very reluctant to continue to code on the FPGA. Perhaps I should continue... Hard to know. Nice to see you working anyway.