Using WAITVID with INDA++ and multi-tasking + other observations

ozpropdev · 2013-11-01 21:51

Hi All

I've discovered a gotcha when using WAITVID with indirect registers and multi-tasking.
When in multi-tasking mode the WAITVID jumps to itself while waiting for the video hardware.
This causes the INDA register to increment each time it jumps to itself!

        waitvid inda++,color      'increments inda while waiting for video hardware

This needs to be substituted with the following code

        waitvid inda,color
        setinda ++1          'works for multi-tasking

Cheers
Brian

jmg · 2013-11-01 22:00

That feature could give a nice way to check how long you actually waited for, so would be useful for checking timing margins.

I thought the general rule of Prop WAITxx opcodes was that they idled the core, and so saved power ?

ozpropdev · 2013-11-01 22:11

We used to use POLVID in multi-tasking to poll the hardware so we didn't stall the pipeline.
The previous update to the FPGA core changed the way WAITVID operates.
It detects multi-tasking and switches WAITVID to a jumps to itself.

potatohead · 2013-11-01 22:14

That was the general rule, until Chip made the change to have them jump to themselves.

Ahh, too slow. Oh well.

Nice report ozpropdev. I'm thinking it might be a feature too. In any case, good to know.

ozpropdev · 2013-11-02 01:42

jmg wrote: »

That feature could give a nice way to check how long you actually waited for, so would be useful for checking timing margins.

potatohead wrote: »

I'm thinking it might be a feature too.

I guess it is a feature, but only applicable to multi=tasking.

Seairth · 2013-11-02 08:34

I'd consider this to be a bug. Otherwise *every* use of INDx (at least for OBEX code) would have to follow that pattern just in case it ended up in a multitasking situation.

jmg · 2013-11-02 12:16

Seairth wrote: »

I'd consider this to be a bug. Otherwise *every* use of INDx (at least for OBEX code) would have to follow that pattern just in case it ended up in a multitasking situation.

If there is a power saving in non-tasking mode, then a 'fix' has an impact if it loses that.

However, there certainly is a code management risk, so better may be a warning in the Assembler, and the Assembler should know when tasking is enabled ?

potatohead · 2013-11-02 12:28

I consider the COG the basic reuse case. Anything built to run in a COG is going to run in a COG. Other things will only run when the many other considerations are factored in. LMM, Multi-task, etc...

It's not reasonable to follow that pattern, just in case, IMHO. The person wanting to multi-task it can and should consider this right along with other things as part of assembling a multi-tasking COG.

jmg · 2013-11-02 14:28

potatohead wrote: »

It's not reasonable to follow that pattern, just in case, IMHO. The person wanting to multi-task it can and should consider this right along with other things as part of assembling a multi-tasking COG.

Of course, in an ideal world, users know all details and forget nothing.

However, this is a rather hidden gotcha, and relatively easy to catch with simple Assembler checking, so the ASM should be made tasking-aware.

Phil Pilgrim (PhiPi) · 2013-11-02 14:49

This seems like more of a bug than a feature. IOW, if the instruction doesn't actually execute, no side effects should take place either. At least that's what the principle of least surprise would dictate.

-Phil

Heater. · 2013-11-02 14:51

There is a thing called "The Principal of Least Surprise".

That is to say there should be some simple model of how things work and how features interact with each other so that one is not surprised by a lot of exceptions, corner cases, incompatible features etc.

This "feature" does seem to violate that principle. At least it was a surprise to Ozpropdev:)

Heater. · 2013-11-02 14:53

Phil, you took the words out of my mouth.

potatohead · 2013-11-02 16:07

I agree with that.

I want to say bug too.

But, then again the multitask mode for this is to have the instruction jump back to it self, which does make sense. So the instruction is executing repeatedly in multitask mode. Gotcha makes total sense due to the repeated execution going on.

An assembler warning is warranted for sure.

Glad he found it, and it will be interesting to see what Chip does with it.

And I'm having trouble with a good use case. Not sure there is one for the repeated increments. Since the instruction is modal, multi tasking data sheet descriptions will need to be clear too. It really is a JMPVID in multitask mode, not a passive wait like WAITVID is.

Personally, fixing the gotcha case makes sense. Fixing it because it is generally undesirable makes sense. Calling it a bug does not.

More like works as designed. Not sure it s works as intended...

jmg · 2013-11-02 16:29

potatohead wrote: »

Personally, fixing the gotcha case makes sense. Fixing it because it is generally undesirable makes sense. Calling it a bug does not.

Only caveat here, is what if applying the fix, then gives a much higher power consumption, on non-tasked waits ?
- so some care is needed.

Phil Pilgrim (PhiPi) · 2013-11-02 17:03

potatohead wrote:

So the instruction is executing repeatedly in multitask mode. Gotcha makes total sense due to the repeated execution going on.

But is it really executing? To me (a casual observer), it seems it's only branching back to itself because it could not execute. So nothing should happen in that case.

Now, if the argument were ++inda, that might be a different story.

-Phil

ozpropdev · 2013-11-02 17:10

I was reluctant to use the term bug, maybe I will use the term observation in the future.

I have made another observation

When using GETMULL in multi-tasking I have observed what appears to be unexpected pipeline stall.

        setmula  reg1
        setmulb    reg2
        mov    mx,#7
        djnz    mx,$        'loop added to eliminate stall
        getmull    reg3

It appears that this instruction is not jumping to itself in multi-tasking mode?
Adding the small loop seems to fix the issue. Sorry Chip!

Cheers
Brian

ozpropdev · 2013-11-02 17:17

I would be the first to admit that I don't know a thing about Verilog coding, but it seems
to me that this may not be a hard thing to fix?

Basically don't execute the INDx stuff if condition is not met.

potatohead · 2013-11-02 17:51

Oh, don't be sorry. You are right at the edge of things. Sure glad you are there too. You are finding out things many of us won't get to for a while.

Yeah, I think it's going to come down to a few gates for a fix. In the verilog, lots of things happen in parallel, so an if statement added in there will just add the decision circuit as part of the instruction somehow. There are per cycle / pipeline limits, and I have no idea what those are. But the fix seems like a line or two max.

Yeah, observation, gotcha. Agreed.

@Phil, Chip used the words "jump to themselves" and I took that to mean the instruction actually does execute repeatedly. The idea was to cut down on COG code as a few instructions were needed for polling, so why not just have the instruction do it and cut out all the BS?

Getmull not doing that seems to make sense. A program could just do other things to consume the time before the multiply is done... I'll bet Chip only did the wait*** instructions.

And maybe that one isn't so easy to have repeatedly execute? Or, maybe it's time to complete the multiply is known to a degree where it makes best sense to just stuff instructions in there to make best case use of the time.

I'm unclear on what happens when multiple threads ask for the math... ???

@jmg, I'm not sure there is a power consideration. The waitvid executes over and over. Before, the COG TASK would be in a small loop, polling over and over. In both cases power is being used.

In the single task case, the wait probably does save power as expected. However, P2 is significantly more power hungry than P1 is anyway, due to the much higher process leakage according to Chip. How big of a difference would it make, and is that enough to warrant single tasking things?

I gotta get back on my DE2 and get this Mac repair done. Sheesh, do not coffee spill a Mac Book Pro. It will cost you hours... lots of hours.

Phil Pilgrim (PhiPi) · 2013-11-02 18:14

potatohead wrote:

@Phil, Chip used the words "jump to themselves" and I took that to mean the instruction actually does execute repeatedly. The idea was to cut down on COG code as a few instructions were needed for polling, so why not just have the instruction do it and cut out all the BS?

But it's a wait instruction. Wait instructions do not "execute" until the wait condition is satisfied. The only reason to jump back to itself is not to stall the pipeline when the instruction cannot execute -- not to execute it repeatedly.

-Phil

potatohead · 2013-11-02 18:47

Yes, that is precisely how wait instructions work in single task mode.

In multi-task mode, waits don't make sense because of how multi-task mode is implemented. Some things are shared. There really is one video circuit, one math circuit, etc... and there is basically one pipeline too. If it stalls, all the thread stall.

Originally, waits became polls to address this and make the special features useful in multi-task mode, so that all the threads can operate smoothly in multi-task mode, which required the programmer to author a small loop in multi-task mode that didn't need to be there in single task mode.

In the most recent iteration, the polling loops got compressed into a self-looping "wait" instruction that executes over and over, until condition is met, essentially saving the additional loop instructions so that more can be done per task per COG.

In a real sense, as a poll instruction, it does get executed repeatedly, but it's part of a little loop.

Now in this iteration, it simply executes repeatedly without having to be explicitly coded into part of a little loop, thus "jump to themselves" as Chip said they do.

And the advantage of this is the same code basically works multi-tasked and single-tasked, but for the "gotcha" we are discussing here; namely, auto incrementing happening as the instruction executes. Which is why I said, "works as designed", but maybe not desirable as designed, and not bug.

What we really need is auto inc / dec ONCE, until wait condition is met, so that the repeated execution to perform polling of the instruction in a multi-task setting performs like the wait form of the instruction does in a single task setting. And that would satisfy the principle of least surprise we are all discussing here.

Actually, it might / should be more complicated. If it's a post dec/inc operation, it should get done when condition is met, but when it's a pre dec/inc operation, it should get done once before condition is met.

Phil Pilgrim (PhiPi) · 2013-11-02 18:51

I think you're confused by the word "execute." It loops back on itself when it cannot execute. The "bug" lies in the apparent fact that side effects still take place even though the instruciton itself is in a wait state.

-Phil

ozpropdev · 2013-11-02 18:59

From Chip's Big Updates post wrote:

Many instructions which used to need polling in multitasking now jump back to themselves without stalling the pipeline, until their wait condition is met. In single-task situations, they still stall the pipeline, just as they used to:

WAITVID/GETMULL/GETMULH/GETDIVQ/GETDIVR/GETSQRT/GETQX/GETQY/GETQZ/SYNCTRA/SYNCTRB

potatohead · 2013-11-02 19:01

Thank you. I was just going to look for that.

Ooh, and look! Getmull is in there... You had to use a loop, and I think that's probably a bug. Nice catch, IMHO.

In multi-task mode, they basically aren't wait instructions. They do execute, and optionally do something or set something conditionally. Before a loop was needed for program control, now that loop is contained in the instruction.

rjo__ · 2013-11-02 19:25

so… in the multitasking environment, this saves at least one instructions… which improves the bandwidth by a factor of 2/x, where x is the number of clocks in that
task?
Seems to be worth it to me:)

jmg · 2013-11-02 19:40

Phil Pilgrim (PhiPi) wrote: »

I think you're confused by the word "execute." It loops back on itself when it cannot execute. The "bug" lies in the apparent fact that side effects still take place even though the instruciton itself is in a wait state.

Execute is a vague term.
Better would be to use the term complete - many could argue that a waiting instruction that is waiting, is executing just fine, but it is not completing until the wait condition is met.

On Prop 1 that also meant a HW based wait, and the opcode fetch could pause, saving significant power.

I can also see some use for an opcode that can report how long it actually waited.

The Ideal solution is to have a safe default, but to not penalize Power, or extended uses.

potatohead · 2013-11-02 19:43

The wait report can be had from the counters though.

Phil Pilgrim (PhiPi) · 2013-11-02 19:50

jmg wrote:

Execute is a vague term.

What's vague about "loads internal registers, writes results, and set flags?" If it does this stuff it obviously executes; if not, it falls through. To me the desired behavior while multitasking would be as if an if_wait_condition_met prefix were attached to the instruction, with the proviso that if the wait condition is not met the IP is not advanced.

-Phil

Ariba · 2013-11-02 20:06

Phil Pilgrim (PhiPi) wrote: »

What's vague about "loads internal registers, writes results, and set flags?" If it does this stuff it obviously executes; if not, it falls through. To me the desired behavior while multitasking would be as if an if_wait_condition_met prefix were attached to the instruction, with the proviso that if the wait condition is not met the IP is not advanced.

-Phil

What if it does: "loads internal registers, checks shifter state - if ready: writes results, and set flags?"
Is this half executed?
It may be a pipeline issue: The register may already be loaded before the Video-shifter state can be checked.

Andy

potatohead · 2013-11-02 20:16

I think it's all about the pipeline. Chip said as much. When Heater mentioned doing this, Chip caught on and did it quickly, but there were limits because it's not a full multi-tasking circuit in the pipeline. That pipeline is shared among the tasks.

So then, if the pipe gets stalled like we expect a wait instruction to do, all the tasks suffer that stall. At first, the restriction was we just don't do waitvid and friends in multi-task mode, but sombody --Bill I think, chimed in and asked for polling so we could make multi-function COG drivers with video, mouse, whatever.

So we did polling to get around that whole thing with POLLVID, etc... with the restriction that we just would not use the wait form in multi-task mode.

After some thinking, it became apparent that the instruction could be changed to a polling model in multi-task mode, and then we got here, where the little loop needed to do the polling in multi-task mode becomes an instruction that jumps to itself over and over until it's condition is met.

So, it's executing. It's executing the same way it did last iteration where we would run it, and if the condition were met, the result would happen and our loop could do program control and all of that took extra instructions only needed for multi-task mode...

My point here really is the "executing" part of it is unchanged from last FPGA core to this one. The difference is the addition of the "jump to itself on condition not met" part, which just eliminates the loops we had to write.

We really can't have the classic wait behavior in multi-task mode because we have a shared pipeline across all tasks. What we do get is check, check, check, check, do it!

Yanomani · 2013-11-02 20:58

IMHO this is exactly the caveat here! I remember I've read at some point before, that Cog's memory has two ports for reading and one for writing.
If this holds true, in the present situation Chip can be facing a challenge, to slice operations to be done by each instruction, as they progress thru the pipeline.
In cases of post-incrementing behaviors, such as INDA++, I'm supposing it was crafted in the passing from third to fourth pipeline stage, since he must reserve the fourth stage write window, to update the IP, if condition was met, otherwise the IP remains the same, causing the effect of jumping to itself.
Now, in cases of pre-incrementing behaviors, things get worse, because updates must occur early in the pipeline, to allow using the right pointer value to gather data.
I'm also assuming that the final decision about executing or not the instruction, is being taken at stage four, to allow more room for a timing coincidence to occur.
Perhaps, if the original value could be latched, before pre or post increments are performed, then the write window, freed by not having to update the IP, could be used to mux back the original value, letting it "untouched" at all.

Only my two cents.

Yanomani

ozpropdev · 2013-11-02 22:39

Are we trying too hard to make single/multi-tasking one and the same?
Both require different styles of coding and design.
Does it matter that we have to do things slightly differently depending on what tasking model we are running in?

My original suggestion to Chip was can POLVID be made to operate like PASSCNT that jumps to itself.
The primary goal of this was to reduce the amount of cog space consumed.
Chip went one step better and incorporated his new concept of "auto" detect multi-tasking mode.
Maybe the gotcha is too complicated to fix because of the pipeline complexity..
If we need to take one step back and make the POLVID instruction "jump to itself" then that seems Ok too.
We still get the saving of cog space and it's a simple and fast solution to the problem.
WAITVID could then be restoed back to how it was and all is well?

Just an idea

Brian

Using WAITVID with INDA++ and multi-tasking + other observations

Comments