Using WAITVID with INDA++ and multi-tasking + other observations
ozpropdev
Posts: 2,792
Hi All
I've discovered a gotcha when using WAITVID with indirect registers and multi-tasking.
When in multi-tasking mode the WAITVID jumps to itself while waiting for the video hardware.
This causes the INDA register to increment each time it jumps to itself!
This needs to be substituted with the following code
Cheers
Brian
I've discovered a gotcha when using WAITVID with indirect registers and multi-tasking.
When in multi-tasking mode the WAITVID jumps to itself while waiting for the video hardware.
This causes the INDA register to increment each time it jumps to itself!
waitvid inda++,color 'increments inda while waiting for video hardware
This needs to be substituted with the following code
waitvid inda,color setinda ++1 'works for multi-tasking
Cheers
Brian
Comments
I thought the general rule of Prop WAITxx opcodes was that they idled the core, and so saved power ?
The previous update to the FPGA core changed the way WAITVID operates.
It detects multi-tasking and switches WAITVID to a jumps to itself.
Ahh, too slow. Oh well.
Nice report ozpropdev. I'm thinking it might be a feature too. In any case, good to know.
I guess it is a feature, but only applicable to multi=tasking.
If there is a power saving in non-tasking mode, then a 'fix' has an impact if it loses that.
However, there certainly is a code management risk, so better may be a warning in the Assembler, and the Assembler should know when tasking is enabled ?
It's not reasonable to follow that pattern, just in case, IMHO. The person wanting to multi-task it can and should consider this right along with other things as part of assembling a multi-tasking COG.
Of course, in an ideal world, users know all details and forget nothing.
However, this is a rather hidden gotcha, and relatively easy to catch with simple Assembler checking, so the ASM should be made tasking-aware.
-Phil
That is to say there should be some simple model of how things work and how features interact with each other so that one is not surprised by a lot of exceptions, corner cases, incompatible features etc.
This "feature" does seem to violate that principle. At least it was a surprise to Ozpropdev:)
I want to say bug too.
But, then again the multitask mode for this is to have the instruction jump back to it self, which does make sense. So the instruction is executing repeatedly in multitask mode. Gotcha makes total sense due to the repeated execution going on.
An assembler warning is warranted for sure.
Glad he found it, and it will be interesting to see what Chip does with it.
And I'm having trouble with a good use case. Not sure there is one for the repeated increments. Since the instruction is modal, multi tasking data sheet descriptions will need to be clear too. It really is a JMPVID in multitask mode, not a passive wait like WAITVID is.
Personally, fixing the gotcha case makes sense. Fixing it because it is generally undesirable makes sense. Calling it a bug does not.
More like works as designed. Not sure it s works as intended...
Only caveat here, is what if applying the fix, then gives a much higher power consumption, on non-tasked waits ?
- so some care is needed.
Now, if the argument were ++inda, that might be a different story.
-Phil
I have made another observation
When using GETMULL in multi-tasking I have observed what appears to be unexpected pipeline stall.
It appears that this instruction is not jumping to itself in multi-tasking mode?
Adding the small loop seems to fix the issue. Sorry Chip!
Cheers
Brian
to me that this may not be a hard thing to fix?
Basically don't execute the INDx stuff if condition is not met.
Yeah, I think it's going to come down to a few gates for a fix. In the verilog, lots of things happen in parallel, so an if statement added in there will just add the decision circuit as part of the instruction somehow. There are per cycle / pipeline limits, and I have no idea what those are. But the fix seems like a line or two max.
Yeah, observation, gotcha. Agreed.
@Phil, Chip used the words "jump to themselves" and I took that to mean the instruction actually does execute repeatedly. The idea was to cut down on COG code as a few instructions were needed for polling, so why not just have the instruction do it and cut out all the BS?
Getmull not doing that seems to make sense. A program could just do other things to consume the time before the multiply is done... I'll bet Chip only did the wait*** instructions.
And maybe that one isn't so easy to have repeatedly execute? Or, maybe it's time to complete the multiply is known to a degree where it makes best sense to just stuff instructions in there to make best case use of the time.
I'm unclear on what happens when multiple threads ask for the math... ???
@jmg, I'm not sure there is a power consideration. The waitvid executes over and over. Before, the COG TASK would be in a small loop, polling over and over. In both cases power is being used.
In the single task case, the wait probably does save power as expected. However, P2 is significantly more power hungry than P1 is anyway, due to the much higher process leakage according to Chip. How big of a difference would it make, and is that enough to warrant single tasking things?
I gotta get back on my DE2 and get this Mac repair done. Sheesh, do not coffee spill a Mac Book Pro. It will cost you hours... lots of hours.
-Phil
In multi-task mode, waits don't make sense because of how multi-task mode is implemented. Some things are shared. There really is one video circuit, one math circuit, etc... and there is basically one pipeline too. If it stalls, all the thread stall.
Originally, waits became polls to address this and make the special features useful in multi-task mode, so that all the threads can operate smoothly in multi-task mode, which required the programmer to author a small loop in multi-task mode that didn't need to be there in single task mode.
In the most recent iteration, the polling loops got compressed into a self-looping "wait" instruction that executes over and over, until condition is met, essentially saving the additional loop instructions so that more can be done per task per COG.
In a real sense, as a poll instruction, it does get executed repeatedly, but it's part of a little loop.
Now in this iteration, it simply executes repeatedly without having to be explicitly coded into part of a little loop, thus "jump to themselves" as Chip said they do.
And the advantage of this is the same code basically works multi-tasked and single-tasked, but for the "gotcha" we are discussing here; namely, auto incrementing happening as the instruction executes. Which is why I said, "works as designed", but maybe not desirable as designed, and not bug.
What we really need is auto inc / dec ONCE, until wait condition is met, so that the repeated execution to perform polling of the instruction in a multi-task setting performs like the wait form of the instruction does in a single task setting. And that would satisfy the principle of least surprise we are all discussing here.
Actually, it might / should be more complicated. If it's a post dec/inc operation, it should get done when condition is met, but when it's a pre dec/inc operation, it should get done once before condition is met.
-Phil
Ooh, and look! Getmull is in there... You had to use a loop, and I think that's probably a bug. Nice catch, IMHO.
In multi-task mode, they basically aren't wait instructions. They do execute, and optionally do something or set something conditionally. Before a loop was needed for program control, now that loop is contained in the instruction.
task?
Seems to be worth it to me:)
Execute is a vague term.
Better would be to use the term complete - many could argue that a waiting instruction that is waiting, is executing just fine, but it is not completing until the wait condition is met.
On Prop 1 that also meant a HW based wait, and the opcode fetch could pause, saving significant power.
I can also see some use for an opcode that can report how long it actually waited.
The Ideal solution is to have a safe default, but to not penalize Power, or extended uses.
-Phil
What if it does: "loads internal registers, checks shifter state - if ready: writes results, and set flags?"
Is this half executed?
It may be a pipeline issue: The register may already be loaded before the Video-shifter state can be checked.
Andy
So then, if the pipe gets stalled like we expect a wait instruction to do, all the tasks suffer that stall. At first, the restriction was we just don't do waitvid and friends in multi-task mode, but sombody --Bill I think, chimed in and asked for polling so we could make multi-function COG drivers with video, mouse, whatever.
So we did polling to get around that whole thing with POLLVID, etc... with the restriction that we just would not use the wait form in multi-task mode.
After some thinking, it became apparent that the instruction could be changed to a polling model in multi-task mode, and then we got here, where the little loop needed to do the polling in multi-task mode becomes an instruction that jumps to itself over and over until it's condition is met.
So, it's executing. It's executing the same way it did last iteration where we would run it, and if the condition were met, the result would happen and our loop could do program control and all of that took extra instructions only needed for multi-task mode...
My point here really is the "executing" part of it is unchanged from last FPGA core to this one. The difference is the addition of the "jump to itself on condition not met" part, which just eliminates the loops we had to write.
We really can't have the classic wait behavior in multi-task mode because we have a shared pipeline across all tasks. What we do get is check, check, check, check, do it!
If this holds true, in the present situation Chip can be facing a challenge, to slice operations to be done by each instruction, as they progress thru the pipeline.
In cases of post-incrementing behaviors, such as INDA++, I'm supposing it was crafted in the passing from third to fourth pipeline stage, since he must reserve the fourth stage write window, to update the IP, if condition was met, otherwise the IP remains the same, causing the effect of jumping to itself.
Now, in cases of pre-incrementing behaviors, things get worse, because updates must occur early in the pipeline, to allow using the right pointer value to gather data.
I'm also assuming that the final decision about executing or not the instruction, is being taken at stage four, to allow more room for a timing coincidence to occur.
Perhaps, if the original value could be latched, before pre or post increments are performed, then the write window, freed by not having to update the IP, could be used to mux back the original value, letting it "untouched" at all.
Only my two cents.
Yanomani
Both require different styles of coding and design.
Does it matter that we have to do things slightly differently depending on what tasking model we are running in?
My original suggestion to Chip was can POLVID be made to operate like PASSCNT that jumps to itself.
The primary goal of this was to reduce the amount of cog space consumed.
Chip went one step better and incorporated his new concept of "auto" detect multi-tasking mode.
Maybe the gotcha is too complicated to fix because of the pipeline complexity..
If we need to take one step back and make the POLVID instruction "jump to itself" then that seems Ok too.
We still get the saving of cog space and it's a simple and fast solution to the problem.
WAITVID could then be restoed back to how it was and all is well?
Just an idea
Brian