Using WAITVID with INDA++ and multi-tasking + other observations

cgracey · 2013-11-02 22:54

jmg wrote: »

Only caveat here, is what if applying the fix, then gives a much higher power consumption, on non-tasked waits ?
- so some care is needed.

There's not much opportunity for power-saving in the cog. As long as the clock is switching, there's a lot going on. A pipeline stall, though, does inhibit 99% of state changes, which are what consume power.

cgracey · 2013-11-02 22:57

Phil Pilgrim (PhiPi) wrote: »

But is it really executing? To me (a casual observer), it seems it's only branching back to itself because it could not execute. So nothing should happen in that case.

Now, if the argument were ++inda, that might be a different story.

-Phil

I'll see what I can do about this. I have a feeling that it's either going to be impossible or easy to fix.

Cluso99 · 2013-11-02 23:49

cgracey wrote: »

I'll see what I can do about this. I have a feeling that it's either going to be impossible or easy to fix.

Hope it's easy, but if not, then it's a caveat we have to be aware of (and only in multithreading). We now have a fantastic threaded model thanks to a lot of input.

cgracey · 2013-11-03 00:14

ozpropdev wrote: »
I was reluctant to use the term bug, maybe I will use the term observation in the future.

I have made another observation

When using GETMULL in multi-tasking I have observed what appears to be unexpected pipeline stall.
        setmula  reg1
        setmulb    reg2
        mov    mx,#7
        djnz    mx,$        'loop added to eliminate stall
        getmull    reg3
It appears that this instruction is not jumping to itself in multi-tasking mode?
Adding the small loop seems to fix the issue. Sorry Chip!

Cheers
Brian

I just did a test and it seems to be working fine. There could have been some incidental thing I noticed and fixed since that FPGA file was put out. Thanks for finding things like this. I would image there will be a few of these kinds of problems.

cgracey · 2013-11-03 00:18

Phil Pilgrim (PhiPi) wrote: »

But it's a wait instruction. Wait instructions do not "execute" until the wait condition is satisfied. The only reason to jump back to itself is not to stall the pipeline when the instruction cannot execute -- not to execute it repeatedly.

-Phil

The problem is that INDx happens in stage 2 of the pipeline, while "execution" happens at stage 4. INDA/INDB is already modified by the time the JMP #$ occurs - and it keeps getting modified with every iteration. I'll look more into it, but I think this is something we will have to be aware of in multitasking code.

potatohead · 2013-11-03 00:23

*bookmarked* for documentation later on.

cgracey · 2013-11-03 00:30

Yanomani wrote: »

IMHO this is exactly the caveat here! I remember I've read at some point before, that Cog's memory has two ports for reading and one for writing.
If this holds true, in the present situation Chip can be facing a challenge, to slice operations to be done by each instruction, as they progress thru the pipeline.
In cases of post-incrementing behaviors, such as INDA++, I'm supposing it was crafted in the passing from third to fourth pipeline stage, since he must reserve the fourth stage write window, to update the IP, if condition was met, otherwise the IP remains the same, causing the effect of jumping to itself.
Now, in cases of pre-incrementing behaviors, things get worse, because updates must occur early in the pipeline, to allow using the right pointer value to gather data.
I'm also assuming that the final decision about executing or not the instruction, is being taken at stage four, to allow more room for a timing coincidence to occur.
Perhaps, if the original value could be latched, before pre or post increments are performed, then the write window, freed by not having to update the IP, could be used to mux back the original value, letting it "untouched" at all.

Only my two cents.

Yanomani

You understand the dilemma pretty well.

There actually are restore-to-prior-state-if-pipeline-stage-cancelled circuits for INDA/INDB. They operate in pipeline stages 2 and 3, but not 4. They are already huge, and making them able to back up at stage 4 would make them enough slower that it would create the new critical path. I'll look into it, anyway, but I think we might have to live with things as they are.

Phil Pilgrim (PhiPi) · 2013-11-03 00:33

cgracey wrote:

The problem is that INDx happens in stage 2 of the pipeline, while "execution" happens at stage 4. ...

So I take it that the normal (non-multitasked) sequence of events for waitvid inda++,src is this?

1. Load the contents pointed to by inda into a temporary register.
2. Increment inda.
3. Wait until video is ready to accept new data.
4. Transfer contents of temporary register to video buffer.

That does complicate things. In that case, avoiding autoincrementing during multitasking and doing it in a separate instruction after the fact is probably the only way out.

-Phil

cgracey · 2013-11-03 00:34

ozpropdev wrote: »

Are we trying too hard to make single/multi-tasking one and the same?
Both require different styles of coding and design.
Does it matter that we have to do things slightly differently depending on what tasking model we are running in?

My original suggestion to Chip was can POLVID be made to operate like PASSCNT that jumps to itself.
The primary goal of this was to reduce the amount of cog space consumed.
Chip went one step better and incorporated his new concept of "auto" detect multi-tasking mode.
Maybe the gotcha is too complicated to fix because of the pipeline complexity..
If we need to take one step back and make the POLVID instruction "jump to itself" then that seems Ok too.
We still get the saving of cog space and it's a simple and fast solution to the problem.
WAITVID could then be restoed back to how it was and all is well?

Just an idea
Brian

We could put POLVID back in, as well as make POLMUL, POLDIV, POLSQRT, POLCOR, etc., or we could just use the multitasking WAITVID, GETMULL, etc., as they are and not use INDx with them while multitasking. This will result in smaller code almost all the time, and equal-size code when you need to use a direct register, along with a separate MOV w/INDx.

cgracey · 2013-11-03 00:43

Phil Pilgrim (PhiPi) wrote: »

So I take it that the normal (non-multitasked) sequence of events for waitvid inda++,src is this?
1. Load the contents pointed to by inda into a temporary register.
2. Increment inda.
3. Wait until video is ready to accept new data.
4. Transfer contents of temporary register to video buffer.

That does complicate things. In that case, avoiding autoincrementing during multitasking and doing it in a separate instruction after the fact is probably the only way out.

-Phil

It's more like this:

Stage 2: Copy INDA/INDB's value into the S and/or D field(s) of the 32-bit instruction, increment/decrement INDA/INDB.
Stage 3: Issue register reads from S and D fields.
Stage 4: Write D if 'write' instruction.

There are circuits for rolling back INDA/INDB in the event of a pipeline cancellation, but cancellations only occur at stages 2 and 3, not 4. As I said in the other post, adding stage 4 INDA/INDB rollback would probably blow the clock-period time budget.

It took me several days last time to get my head around all the rules needed to address rollback under different situations, just for stages 2 and 3.

Phil Pilgrim (PhiPi) · 2013-11-03 00:49

Chip,

In the example cited, inda is the destination. But it appears that an increment is not hte same as a "write." Correct? So, in such a case, an nr qualifier does not pertain to autoincrementing indx?

-Phil

cgracey · 2013-11-03 01:35

Phil Pilgrim (PhiPi) wrote: »

Chip,

In the example cited, inda is the destination. But it appears that an increment is not hte same as a "write." Correct? So, in such a case, an nr qualifier does not pertain to autoincrementing indx?

-Phil

I'm not sure I understand the question, but whether a read or write instruction, when it finally does its job, the register it actually reads or writes could be far away from what was intended, as many auto-inc/dec's could have occurred.

Cluso99 · 2013-11-03 01:37

Phil, remember there is no NR qualifier in the new instruction set so the point will become moot.

potatohead · 2013-11-03 09:18

I think simply not using the register that way makes the most sense at this point. We still get the benefit of smaller code a lot of the time and we get similar behavior for a lot of code a lot of the time across single task / multi-task modes.

ozpropdev · 2013-11-03 16:43

potatohead wrote: »

I think simply not using the register that way makes the most sense at this point. We still get the benefit of smaller code a lot of the time and we get similar behavior for a lot of code a lot of the time across single task / multi-task modes.

I agree.

If the documentation says "don't do this" then who am I to argue.
As long as it then follows with "do this instead" .

Edit:

So in conclusion we end up with.

Never use pre/post inc/dec INDx functions within polling multi-tasking instructions. Use a second instruction to achieve this.

There, problem solved!

Yanomani · 2013-11-03 19:50

cgracey wrote: »

You understand the dilemma pretty well.

There actually are restore-to-prior-state-if-pipeline-stage-cancelled circuits for INDA/INDB. They operate in pipeline stages 2 and 3, but not 4. They are already huge, and making them able to back up at stage 4 would make them enough slower that it would create the new critical path. I'll look into it, anyway, but I think we might have to live with things as they are.

Thanks Chip, for the insight on pipeline operation. I've been blind flying inside its belly since the beginning, and each glimpse we get about its complexity and the way you managed to solve this puzzle, only adds to our knowledge base.

On the subject of instructions that jump to themselves, have you ever considered using an extra bit for each of the four possible tasks IPs?

Perhaps it could be bit IP(-1), intended not to be involved during IP increments( except being cleared, if the corresponding IP is incremented), nor PUSHs or POPs, only to live on its own, always cleared, except when a re-instantiation condition exists, forcing the instruction it points to jump to itself?

If it can be crafted someway like this, its non zero state could be used to avoid pre or post increments, or whichever deleterious effects re-instantiation could produce, next time the instruction passes thru the pipeline.
On succesfull instruction exit, the corresponding IP is incremented, and also forces IP(-1) to be cleared, thus restoring normal behavior.

It can be seen as an on the fly exception flag, whose persistence is restricted to a single instruction processing. On succesfull exit of the instruction, then ceases its usefulness, so it is to be reset, ready for next round.

As I said at the begining, I'm blind flying, thus hitting the nose at some mountain is not a totaly unexpected hazard.

Yanomani

Bill Henning · 2013-11-04 07:04

Personally, I am not bothered by the auto-increment side effect; it is easy enough to note in the documentation what happens.

Frankly, I can think of some cases where it might be useful to get that count (think someone mentioned that already) ... it could be used as an indication of "free" cpu time in that thread.

cgracey wrote: »

We could put POLVID back in, as well as make POLMUL, POLDIV, POLSQRT, POLCOR, etc., or we could just use the multitasking WAITVID, GETMULL, etc., as they are and not use INDx with them while multitasking. This will result in smaller code almost all the time, and equal-size code when you need to use a direct register, along with a separate MOV w/INDx.

Using WAITVID with INDA++ and multi-tasking + other observations

Comments