[SOLVED] Using SETINDx with REPS, a little gotcha.

jmg · 2014-01-30 14:59

Dave Hein wrote: »
It might be a good idea to include some form of conditional assembly in the assembler so that code can work when running stand-alone or with other tasks. Maybe something like the #define and #ifdef directives used with C. The NOPs would be added, or not depending on an assembler flag. So using the previous example, the code would look something like this.
#define MULTI_TASK
...
    
dummy   long  $FACE0000
        mov   myreg,dummy
        reps  #4,#4
#ifndef MULTI_TASK
        nop
#endif
        shr   myreg,#1
        shr   myreg,#1
        shr   myreg,#1
        shr   myreg,#1

Close, but even that code is not user tolerant. - worse, it gives the illusion it is coded tolerant.

The code Ariba gave above IS user tolerant, for the current silicon.

Tor · 2014-01-30 15:09

jmg wrote: »

Sigh. It is actually this simple : The assembler does nothing you do not ask it to do. Period.

But then we are all in agreement, aren't we? I may have misunderstood. I thought this was about if the assembler should supply spacer instructions where that is necessary for correct working of e.g. reps
If that is not the issue then I admit to being confused about what is actually discussed here and what the disagreement really is about. Maybe someone could provide a short description of exactly what the issues are. I for one would be grateful.

-Tor

ctwardell · 2014-01-30 15:12

Dave Hein wrote: »
It might be a good idea to include some form of conditional assembly in the assembler so that code can work when running stand-alone or with other tasks. Maybe something like the #define and #ifdef directives used with C. The NOPs would be added, or not depending on an assembler flag. So using the previous example, the code would look something like this.
#define MULTI_TASK
...
    
dummy   long  $FACE0000
        mov   myreg,dummy
        reps  #4,#4
#ifndef MULTI_TASK
        nop
#endif
        shr   myreg,#1
        shr   myreg,#1
        shr   myreg,#1
        shr   myreg,#1

It isn't quite that simple because it depends on how the multitasking slots are allocated.

I think this really is just something the user needs to handle on their own.

C.W.

ctwardell · 2014-01-30 15:15

jmg wrote: »

Close, but even that code is not user tolerant. - worse, it gives the illusion it is coded tolerant.

The code Ariba gave above IS user tolerant, for the current silicon.

The user needs to bring some intelligence to the table, trying to protect users from themselves is a losing battle.

C.W.

ozpropdev · 2014-01-30 15:17

Ariba wrote: »

- Add a spacer NOP after REPS
- Add an additional NOP at the end of the loop
- Set the instruction count in REPS so that it includes the last NOP.

I think Andy has the solution.

If all REPS follow the conventions above , it covers both single task and mulri-task scenarios safely.
Then the only warnings needed in the DOCS are about time slot influences and a reminder that their is only one REPx circuit per cog.

No assembler changes required.

Brian

Dave Hein · 2014-01-30 15:21

But doesn't that make the code inefficient if you are only repeating a single instruction? It basically runs at half-speed.

ozpropdev · 2014-01-30 15:41

Dave Hein wrote: »

But doesn't that make the code inefficient if you are only repeating a single instruction? It basically runs at half-speed.

It all comes back to the DOCS again. Those coders who want to get the MAX out of REPS will still be able to tweak the code for performance.

jmg · 2014-01-30 16:01

Dave Hein wrote: »

But doesn't that make the code inefficient if you are only repeating a single instruction? It basically runs at half-speed.

Yes, there is a run time cost inside the loop, (which I missed earlier) - that makes it a context safe, but less than optimal, solution.
That run-time cost is highest on shortest loops.

Chip may yet find a way to deliver both context safe and optimal (or at least no inner loop cost) - he's done quite well so far.

Tubular · 2014-01-30 16:05

Am I the only one that finds a thread like this vaguely reassuring - with regard to what we're fussing over.

I think essentially jmg's suggestion is a "user friendly" one and is really just a bit ahead of its time wrt where we are with prop2 right now. It's probably a whole lot easier to provide a kind of tool-tip that reminds the programmer or the potential trap, together with clear and complete description in the manual.

I wonder whether Chip doesn't look at a thread like this and think it would be easier to add the remaining three repeater blocks to at least remove that gotcha. I remember thinking about this in some detail and giving my response here .

I still think its fine to have to think about these resources and how they must be used carefully. May help to prevent the onset of Alzheimer's.

ozpropdev · 2014-01-30 17:40

So what have we learned here?

* Spacer instructions ARE required in 1-task applications to allow the pipeline to prime
before repeating can commence. If REPS is used by a task that uses no more than every 2nd
time slot, NO spacers are needed. If REPD is used by a task that uses no more than every
4th time slot, NO spacers are needed, as three intervening instructions will be provided
by the other task(s).

"Spacers CANNOT be used in REPx loops in multi-tasking"

Brian

mindrobots · 2014-01-30 17:44

Wow! If it's that simple, it makes all this brilliant discussion seem a bit silly! :0)

jmg · 2014-01-30 19:02

ozpropdev wrote: »

So what have we learned here?

"Spacers CANNOT be used in REPx loops in multi-tasking"

Almost, but not quite : your own typo-test and Ariba's code shows there is another middle case where if you multi-task and fail to meet the greater-than setting, then you do need the user-spacer.
In your case, you got two different unstable results on that setting.

I think Ariba's code also works safely in that middle case, but at the cost of some added loop overhead.

Also, REPD may need one or two or three user-provided spacers, depending on the slot settings.
The REPD has multiple middle cases, and they may prove harder to avoid - eg someone may want to tune the slots a little.

You can avoid user-spacers in Multi-tasking, provided you always keep above a certain slot space, but fall below that, and your code will become erratic.

Ariba's solution (REPS) gives two user spacers, of which one is always used within the loop, (sometime leading, sometimes trailing) for 100% predictable outcomes (any slot map tolerated, even a mixed or dynamic one)

A similar redundant REPD solution will exist, but at the cost of more code-padding, and more wasted cycles in the loop.

Simple ? Not really. Dangerous ? Yes.

jmg · 2014-01-30 19:08

Tubular wrote: »

I think essentially jmg's suggestion is a "user friendly" one and is really just a bit ahead of its time wrt where we are with prop2 right now.

Correct, it is looking ahead to how users might/will stumble, and how the Obex can be made more robust.

ozpropdev · 2014-01-30 20:26

jmg wrote: »

Almost, but not quite : your own typo-test and Ariba's code shows there is another middle case where if you multi-task and fail to meet the greater-than setting, then you do need the user-spacer.
In your case, you got two different unstable results on that setting.

I think Ariba's code also works safely in that middle case, but at the cost of some added loop overhead.

Also, REPD may need one or two or three user-provided spacers, depending on the slot settings.
The REPD has multiple middle cases, and they may prove harder to avoid - eg someone may want to tune the slots a little.

You can avoid user-spacers in Multi-tasking, provided you always keep above a certain slot space, but fall below that, and your code will become erratic.

Ariba's solution (REPS) gives two user spacers, of which one is always used within the loop, (sometime leading, sometimes trailing) for 100% predictable outcomes (any slot map tolerated, even a mixed or dynamic one)

A similar redundant REPD solution will exist, but at the cost of more code-padding, and more wasted cycles in the loop.

Simple ? No. Dangerous ? Yes.

Running my same test with a different schedule produced erratic behaviour with a spacer NOP and without.

schedule		long	%%2010_1010_1010_1010
changed to
schedule		long	%%2111_1111_0000_0000

It seems their is no guarantee when the REPS block starts in the schedule. In this schedule most of the time the
REPS instruction will need a spacer except in one case where it doesn't. A bit of a lottery.

Based on the results I have revised my statement.

"Spacers CANNOT be used reliably in REPx loops in multi-tasking"

Brian

jmg · 2014-01-30 20:48

ozpropdev wrote: »

It seems their is no guarantee when the REPS block starts in the schedule. In this schedule most of the time the
REPS instruction will need a spacer except in one case where it doesn't. A bit of a lottery.

Lotteries are what worries me.

Did you test that with Ariba's code ? ( #71) - ie with a NOP at each end and a plus 1 on the block size.
The hope is that 'covers all timing bases', and so will be stable. Be nice to confirm Y/N on his suggestion.

ozpropdev · 2014-01-30 21:26

jmg wrote: »

Lotteries are what worries me.

Did you test that with Ariba's code ? ( #71) - ie with a NOP at each end and a plus 1 on the block size.
The hope is that 'covers all timing bases', and so will be stable. Be nice to confirm Y/N on his suggestion.

Sorry I didn't make that clear, (Doing too many things at once at the moment).
I was talking about time slots < every 2nd in this test.
I tried the original way as well as Ariba's way.
Yes, Ariba's suggested convention works perfectly in all the scenarios I tested.

Brian

jmg · 2014-01-30 21:41

ozpropdev wrote: »

Yes, Ariba's suggested convention works perfectly in all the scenarios I tested.

Good - so there is a 'high tolerance' structure for tasks, (amongst the many minefields) it's just a pity is has a loop-penalty as well.
The code-overhead side is almost tolerable, if the speed hit could be avoided.
Something for Chip to think about ?

cgracey · 2014-01-30 22:11

This is a hard thing to overcome in hardware. The pipeline needs spacers to implement REPS and REPD, and I don't think I can magically insert them into the pipeline, as there's too much going on.

In 4-way multitasking, a jmp (or loop) takes only one clock - same as a NOP. So, in 4-way multitasking, REPS/REPD offers no speed or code size advantage, anyway.

Maybe the documentation should just state that REPS/REPD are for single-task programs, only. That would save a lot of headaches.

This spacer thing really nailed me once on the Spin2 interpreter. I turned on multitasking and everything blew up. I had actually forgotten that spacers weren't needed in REPS cases where a task was getting half or less of the clock cycles. I just recoded everything to use DJNZ's. Problem solved. Sanity maintained.

The thing that I cannot figure out how to solve is the spacer issue. If that were solved, things would simplify and there would be strong reason to add 3 more instances of the REPS/REPD hardware so that every task could have one to its advantage.

Sapieha · 2014-01-30 22:31

Hi Chip.

It is not possible to stall pipeline for REPS settle before it start execute --->
Give some speed missing but I thing -- will solve problems

cgracey wrote: »

This is a hard thing to overcome in hardware. The pipeline needs spacers to implement REPS and REPD, and I don't think I can magically insert them into the pipeline, as there's too much going on.

In 4-way multitasking, a jmp (or loop) takes only one clock - same as a NOP. So, in 4-way multitasking, REPS/REPD offers no speed or code size advantage, anyway.

Maybe the documentation should just state that REPS/REPD are for single-task programs, only. That would save a lot of headaches.

This spacer thing really nailed me once on the Spin2 interpreter. I turned on multitasking and everything blew up. I had actually forgotten that spacers weren't needed in REPS cases where a task was getting half or less of the clock cycles. I just recoded everything to use DJNZ's. Problem solved. Sanity maintained.

The thing that I cannot figure out how to solve is the spacer issue. If that were solved, things would simplify and there would be strong reason to add 3 more instances of the REPS/REPD hardware so that every task could have one to its advantage.

cgracey · 2014-01-30 22:34

Sapieha wrote: »

Hi Chip.

It is not possible to stall pipeline for REPS settle before it start execute --->
Give some speed missing but I thing -- will solve problems

It's not a matter of stalling the pipeline, but of stepping things through it.

Sapieha · 2014-01-30 22:44

Hi Chip.

Sory as I still don't fully grasp pipelining in P2.
So I post maybe dumb questions

What about if REPS discard last previous instructions pipeline.
Then loads without pipeline active to settle

cgracey wrote: »

It's not a matter of stalling the pipeline, but of stepping things through it.

cgracey · 2014-01-30 23:25

Sapieha wrote: »

What about if REPS discard last previous instructions pipeline.
Then loads without pipeline active to settle

Since your post, I've been thinking about how to get around this problem by doing what you mentioned above.

I could cancel the pipeline by doing a JMP to PC+1. That would start REPx with a clean pipe. The only problem is that GETPIX requires 3 clocks in its two prior pipeline stages, so we wouldn't be able to repeat GETPIX, which is important to be able to do.

The alternative is to add a state to the repeat circuit where we don't repeat (JMP PC-n) until one (REPS) or three (REPD) instructions pass from the same task. That wouldn't require any code modification and would be fastest for single-task programs, as the pipeline wouldn't need to be cleared. I'm going to try to do it that way. If it works, I'll add three more circuits so each task can have one.

Sapieha · 2014-01-30 23:29

Hi Chip.

Thanks.

Any solution that skip spacers will be GOOD.

Remove all confusions for users

cgracey wrote: »

Since your post, I've been thinking about how to get around this problem by doing what you mentioned above.

I could cancel the pipeline by doing a JMP to PC+1. That would start REPx with a clean pipe. The only problem is that GETPIX requires 3 clocks in its two prior pipeline stages, so we wouldn't be able to repeat GETPIX, which is important to be able to do.

The alternative is to add a state to the repeat circuit where we don't repeat (JMP PC-n) until one (REPS) or three (REPD) instructions pass from the same task. That wouldn't require any code modification and would be fastest for single-task programs, as the pipeline wouldn't need to be cleared. I'm going to try to do it that way. If it works, I'll add three more circuits so each task can have one.

cgracey · 2014-01-30 23:37

Sapieha wrote: »

Hi Chip.

Thanks.

Any solution that skip spacers will be GOOD.

Remove all confusions for users

I could skip the spacers by clearing the pipeline, but this would waste three clocks in single-task mode. I've also found that it's sometimes very handy to have an instruction or three to do some pin output with, just before the repeating block executes.

Code would be easier to write without spacers, though. You just couldn't do any output right before the repeating block executes.

cgracey · 2014-01-30 23:41

Would you guys like to see REPS/REPD work so that no spacers are ever required?

This would make code easy to write, but would waste three clock in single-task mode and would not allow you to abut some pin output instruction(s) right up against the start of the repeating block.

I could stall the pipeline for the two instructions before GETPIX, without needing two discrete 'WAIT #2' instructions, so GETPIX would still work.

potatohead · 2014-01-30 23:43

I like the spacers for that reason (setup), rather than just burn the clocks. If it's not easily sorted in tasking mode, so be it. We've got the DNJZ option handy for that case. Deffo want to keep GETPIX optimized.

I don't understand what this means:

The alternative is to add a state to the repeat circuit where we don't repeat (JMP PC-n) until one (REPS) or three (REPD) instructions pass from the same task.

Is this essentially burning the cycles?

Sapieha · 2014-01-30 23:44

Hi Chip.

If it waste 1x3 clocks on entire function and still are deterministic -- That NOT BIG problem --

BUT if it waste 3 clocks every round --- BIG problem

cgracey wrote: »

I could skip the spacers by clearing the pipeline, but this would waste three clocks in single-task mode. I've also found that it's sometimes very handy to have an instruction or three to do some pin output with, just before the repeating block executes.

Code would be easier to write without spacers, though. You just couldn't do any output right before the repeating block executes.

cgracey · 2014-01-30 23:46

potatohead wrote: »

I like the spacers for that reason (setup), rather than just burn the clocks. If it's not easily sorted in tasking mode, so be it. We've got the DNJZ option handy for that case.

I don't understand what this means:

Is this essentially burning the cycles?

I should have said, instead of "(REPS)", "(in the case of REPS)". Same for REPD. I meant that if REPS had executed, I would ensure that one instruction from that task passed before repeating. I would wait for three instructions in the case of REPD.

I'm talking about making coding consistent in all task modes.

cgracey · 2014-01-30 23:48

Sapieha wrote: »

Hi Chip.

If it waste 1x3 clocks on entire function and still are deterministic -- That NOT BIG problem --

BUT if it waste 3 clocks every round --- BIG problem

It would only waste three clocks initially. The loops would be zero-overhead.

Sapieha · 2014-01-30 23:50

Hi Chip.

That sounds VERY good

cgracey wrote: »

It would only waste three clocks initially. The loops would be zero-overhead.

[SOLVED] Using SETINDx with REPS, a little gotcha.

Comments