Possible TLOCK/TFREE multi-tasking bug

ozpropdev · 2014-03-30 05:23

Hi All

I am experiencing a weird bug in my Invaders code that might be a HW bug.
The code runs 3 tasks in a single cog in "COG" mode.
Code snippets are loaded from hub and executed in cog. This code has been running fine
for over 6 months now with no hiccups.
Now I am seeing random screen blanking of approx. .5 to 15 seconds duration. The period in between failure
ranges from ~3 minutes to 2 hours.
This fault appears on DE0 and DE2 FPGA versions. I have run the code on my DE2 with Toolbox running and have
found some odd signals in using TRACE.
The first oddity is the TRACE is showing FETCH activity even though I am NOT running Hubexec mode.
The P2 docs define FETCH as " - pipeline stall due to hub instruction fetch"

I was lucky enough to capture TRACE data during a failure and a second oddity was spotted.
The task switching mechanism is locked on a single task as if a TLOCK instruction was executed.
There definitely is no TLOCK instructions in my code or patterns that would represent that instruction.
What is even stranger is that it also appears that a ghost TFREE instruction that seems to materialize to kick start it all again.
My SETTASK instruction is only used on start up and is over written after the code starts. The time slot value is
is also over written at startup so it cant be executed again.

I have spent countless hours trying to find a trigger scenario or a bug in my code. I have checked the integrity of the code
snippets sitting in HUB ram in case I am getting corruption, all checks out fine. When the multi-tasking kicks in
again all is well as if nothing happened. Even comms @1M still runs during the lockup. I have even swapped the tasks
around and the main task is always the one that locks. I even considered a LIFO stack problem, but how could that cause
a phantom TLOCK? The code can run for hours without a glitch.

I've included my OK and FAIL trace results for the curious.

Weird huh? Any suggestions?

Cheers
Brian

ctwardell · 2014-03-30 07:13

No answers, just a confirmation that I have seen the behavior when running Invaders on my nano.

C.W.

cgracey · 2014-03-30 07:13

Brian,

The TRACE with apparent TLOCK is strange.

Would it be possible to divide your code down until you isolate occurrence of the bug?

Chip

mindrobots · 2014-03-30 07:28

Brian,

(you've probably considered this already)

If your code hasn't changed from emulation to emulation for a while, is it possible to roll back through the emulations and see when the behavior started? If your code has changed, this will certainly be harder to do unless you have previous versions saved and paired to emulation releases.

I've seen my Nano pick up spurious resets (my assumption) where it will stop running whatver was running and when you talk to it you go back to the monitor. But your description states it stalls and resumes, so yoru case is different.

cgracey · 2014-03-30 08:17

Brian,

On the capture_fail dump, does that task 0 execution pattern make sense?

I only ask because I've had some really weird bugs due to accidentally-corrupted memory. The bugs masqueraded as other kinds of problems that sent me on wild goose chases.

Also, I posted the current TRACE bit definitions in your thread regarding TRACE questions. In looking at your trace dumps, I thought some things didn't look properly labelled. My fault for giving you out-of-sync documentation. I think once you make some changes, that trace output will start making more sense.

potatohead · 2014-03-30 10:52

I am currently working on a video driver and saw random screen blanks twice last night same duration. On the P1, this sometimes would happen in a driver once every so often due to an improper cnt rollover case handling. This is on the image for the 20th. I just didn't update yet. Was planning to after I finished solving a problem.

At the time, I wrote it off to a glitchy connection. Guess I'll have to just observe it running later today to see.

Confirmed. Glitch about 10 minutes in. Changed to the default, simple video example code shipped with the FPGA image. I saw it on the TV driver. Didn't get to check VGA, etc...

Since I have to update anyway, I'll load an older one and observe it later today.

cgracey · 2014-03-30 14:05

Potatohead,

It sounds like you saw this glitch on the NTSC example driver that comes with the FPGA configuration files? I'll look into this. Perhaps WAITVID is the common denominator among these failures.

Chip

potatohead · 2014-03-30 16:33

Yes I did. I am building a parametric version of some drivers and saw the glitch. Thought it best to run the dead simple code to verify.

It shows as either a flash as the frame is missed or corrupt, or a color shift, then shift back. I noted events about 10 mins into a fresh FPGA boot. On my older testing set, I can set the vhold to just barely hold the frame, and the glitch will often trigger a roll as the TV resyncs. That was the easiest to see.

And that is the image from the 20th. I'm one behind.

ozpropdev · 2014-03-30 17:46

cgracey wrote: »

Also, I posted the current TRACE bit definitions in your thread regarding TRACE questions. In looking at your trace dumps, I thought some things didn't look properly labelled. My fault for giving you out-of-sync documentation. I think once you make some changes, that trace output will start making more sense.

Thanks Chip, I have another good look at it all today.
Cheers
Brian

ozpropdev · 2014-03-31 07:41

Update:
I have gone back 2 releases of FPGA to 06 Feb 2014 and code runs fine.
The only difference between my last Invaders code and my current version is the use of the new COGRUN format
and the new JMPTn instead of JMPTASK, that's all. Testing continues....

		'	coginit	_coglet,_game1,#0

'_coglet			long	@game_launch
'_game1			long	@game1
..is now
			cogrun	_game1,#0

_game1			long	(@game1 >> 2) << 16 | (@coglet >> 2)
..and
		'	jmptask	#%0010,#vga       'Setup multi-tasking
		'	jmptask	#%0100,#timers
..is now
			jmpt1	#vga
			jmpt2	#timers

I even made a version that uses a HUB stack to verify if maybe a LIFO stack over/under flow was
a possible contender, alas no. 3 levels max was verified.
I'm on a mission now to narrow down the offending gremlin.

@Chip. Thanks for the TRACE info update, it helped heaps!
Cheers
Brian

cgracey · 2014-03-31 07:52

I'm running the simple_ntsc demo now with the scope set to trigger on an excessive period of no sync pulses. So far, nothing.

Sapieha · 2014-03-31 08:20

Hi Chip.

Run instead Brians Invaders in Demo mode --- run forever.

Mostly after screen "Pres Key to start" will sometimes come little delay before next screen will come ON.
With one pass my Screen said OUT OF SYNC in about one sec.
Then restarted demo

cgracey wrote: »

I'm running the simple_ntsc demo now with the scope set to trigger on an excessive period of no sync pulses. So far, nothing.

cgracey · 2014-03-31 09:18

Sapieha wrote: »

Hi Chip.

Run instead Brians Invaders in Demo mode --- run forever.

Mostly after screen "Pres Key to start" will sometimes come little delay before next screen will come ON.
With one pass my Screen said OUT OF SYNC in about one sec.
Then restarted demo

Okay. I ran the DE0 version and saw the video stop for a few seconds.

This Invaders programs is a little large for me to find a problem in, though. We need to isolate the bug more so that we can identify it.

I altered the simple_ntsc demo to use multitasking. At first, it failed, because I had WAITVID instructions in the delay slots after an RETD. After I fixed that, it worked fine, indefinitely.

Sapieha · 2014-03-31 09:34

Hi Chip.

I know that this program is little bigg.
But shows what are problem.

Then still --- It not need be Hardware problem.

But maybe Brian can find in his part what are problem

cgracey wrote: »

Okay. I ran the DE0 version and saw the video stop for a few seconds.

This Invaders programs is a little large for me to find a problem in, though. We need to isolate the bug more so that we can identify it.

I altered the simple_ntsc demo to use multitasking. At first, it failed, because I had WAITVID instructions in the delay slots after an RETD. After I fixed that, it worked fine, indefinitely.

cgracey · 2014-03-31 09:45

Sapieha wrote: »

Hi Chip.

I know that this program is little bigg.
But shows what are problem.

Then still --- It not need be Hardware problem.

But maybe Brian can find in his part what are problem

It would be good if he could identify it.

Brian, did you get those updated TRACE bit definitions into your tool, yet? I really like that thing! You've got the data presented in an easy-to-view format.

ozpropdev · 2014-03-31 15:04

cgracey wrote: »

It would be good if he could identify it.

Brian, did you get those updated TRACE bit definitions into your tool, yet? I really like that thing! You've got the data presented in an easy-to-view format.

Hi Chip
I have updated my TRACE labelling. now, thanks.
I'm having another attempt today to isolate this screen blanking issue. It's a bit of a mystery.
Cheers
Brian

jmg · 2014-03-31 15:09

Is there spare resource to set up a monostable style time-checker, and capture some info whenever it is seen disturbed.
eg just what time-base this has, and the size of disturbances could give clues, and this is not such a good scope problem.

ozpropdev · 2014-03-31 15:19

It seems that the video glitch is a result of the video task being turned off intermittently.
What I'm trying to find is where/why/how the time slots are being locked?

jmg · 2014-03-31 15:25

ozpropdev wrote: »

It seems that the video glitch is a result of the video task being turned off intermittently.
What I'm trying to find is where/why/how the time slots are being locked?

The details I was getting at, is the 'intermittently' is not usually intermittent, but will have a definite and likely regular trigger.
If you can find better values on 'how often' and 'how long' that could help where to look for the origin.

ozpropdev · 2014-04-05 04:38

Update 2: The plot thickens.

I think I'm drawing closer now to the problem area but as to why the tasks are being locked is still a mystery.
After days of testing I noticed that if a random glitch occurred it always happened when the player has been killed.
By a process of elimination I arrived at a single code snippet "vaporize" which as its name implies animates the
players demise. By skipping this snippet removed the glitch altogether. Aha! I've found it.. or so I thought.

The code snippet basically gets a random bit pattern from the LFSR and AND's it with the "cannon" sprite to
get the vaporize effect. Nothing complex there. Once again by a process of elimination it seemed that the
GETLFSR instruction was the culprit. I place IF_NEVER conditions before all three instructions and the problem disappeared.

Ok, I'll just use SEUSSF instead to get a random pattern. This made things worse as the phantom lock never unlocked.
This was verified with extensive SETRACE logging. Hmmm, I'll try something else then. INCPAT,DECPAT had the
same results as SEUSSF, when the lockup occurs it never unlocked. Now things are getting weird.

My next plan of attack was I needed to capture a trace of this actual snippet in progress to see what's going on.
This at first was going to be tricky as the code loads snippets to the same COG space ~8000 times a second.
Identifying which snippet was currently loaded was going to require a trick of some sort.

What I did was convert just that snippet to run in HUBEXEC mode and that way I could use trace to detect
the FETCH signal to indicate that I was watching the right piece of code. Too easy...except now the problem
has disappeared again. Huh?

Wait a minute.. the only thing that changed was a single cog register definition.
Each snippet has its own register definitions at the end of the code and is part of the snippet load.
Ok now where getting somewhere. I added some NOP's at the start of the snippet thinking that a bug in my code
must be corrupting COG ram somehow. Various amounts of spacer nops had no effect. I even moved the
snippet location in HUB to see if that was related.

The only reliable fix I found was a single NOP inserted between the JMP #NEXT_COGLET and the
register definition. This fix hasn't failed in hours of parallel testing on DE0 + DE2.

			org
_g_vaporize		mov	entity,#e_cannon
			call	#get_metrics
			setptra	et_v
			setptry	et_b
			mov	cxf,#8
			mov	byte_ofst,#0

loop_v			rdaux	dxf,ptry++
			shl	dxf,et_o
			getbyte	m0,dxf,#2
			getbyte	m1,dxf,#1
			getbyte	m2,dxf,#0
			getlfsr	:ax
			and	m0,:ax
			getlfsr	:ax
			and	m1,:ax
			getlfsr	:ax
			and	m2,:ax
			wrbyte	m0,ptra++
			wrbyte	m1,ptra++
			wrbyte	m2,ptra++
			addptra	#bytes_per_col- 3 '29
			djnz	cxf,@loop_v
			jmp	#next_coglet

                        nop     '***** FIX ??? ****

:ax			long	0

vap_end			long	$cececece

So this is where I am at after 6 days of bug chasing. I'm close to something but what?

@Chip
Is it possible this is a fmax problem with the FPGA build?

Cheers
Brian

potatohead · 2014-04-05 10:55

BTW, the glitches I saw were very similar to what was being described here. Finally traced it to a noisy electrical outlet. (grrr)

ospropdev, it's a pain, but can you set your FPGA to a lower clock, adjust video, and test the fmax theory?

On that note, I'm confused on how to do that. Are the docs on CLKSET current? I would love to test parametric video at a few clocks.

WIth all the discussion, I'm very reluctant to continue to code on the FPGA. Perhaps I should continue... Hard to know. Nice to see you working anyway.

Possible TLOCK/TFREE multi-tasking bug

Comments