C: What Is The Closest C Statement To A SPIN REPEAT

idbruce · 2015-04-11 13:56

I realize there are several ways to do this, but I am looking for the fastest possible execution.

For example, for a 10 iteration loop in SPIN, we simply say:

REPEAT 10

Now let's say that we assign 10 to a variable called STEPS, our code then becomes:

REPEAT STEPS

During the iterations, the STEPS variable remains unchanged, but you do not get to see what is going on in the background. Now I would assume that in the background of the REPEAT execution, there is a counter variable to which a value of STEPS would be assigned, and that variable would be decremented or incremented for each iteration.

With that in mind, I would asume that something like the following would probably be the fastest:

int i = STEPS;

while(i != 0) 
{
    SOME CODE;

    i--;
}

It would be nice if I could just use STEPS without creating a variable and assigning a value, and without the value of STEPS being altered. Just looking for ideas and opinions. However, like I said, I am looking for the fastest possible execution. I would imagine someone might know the answer without testing

David Betz · 2015-04-11 14:04

I think what you did is probably about the best you'll do in C. In fact, it probably isn't much different from what Spin is doing except that the variable is explicit instead of being hidden like in Spin.

abecedarian · 2015-04-11 14:14

int STEPS = 10;

// loop to iterate "steps" number of times:
for (int i = 1; i <= STEPS; i++) {
  // do something
  // variable "i" can be used do indicate which "step" you're currently in
  // and is only available for use within the scope of this "for" loop
}

idbruce · 2015-04-11 14:15

David

I think what you did is probably about the best you'll do in C. In fact, it probably isn't much different from what Spin is doing except that the variable is explicit instead of being hidden like in Spin.

Yea, I would assume SPIN is very similar.

Actually, I just thought of something......

In my case, STEPS actually represents both the ramp-up and ramp-down steps for a motor driver. If I seperate these two and have a variable for both of them, I could then eliminate the assignment within the driver, for both the ramp-up and ramp-down iterations. So then I could decrement the variable itself and not worry about it being altered.

Hmmm... I do believe that is what I am going to do.

jmg · 2015-04-11 14:16

idbruce wrote: »

Just looking for ideas and opinions. However, like I said, I am looking for the fastest possible execution. I would imagine someone might know the answer without testing

Mostly people look for smallest, and fast comes along too...

There is a DJNZ opcode, so ideally, you want to aim for that.

For that, see this code example from David Betz (#48 & #45)
http://forums.parallax.com/showthread.php/160731-What-are-the-limits-of-in-COG-with-C-or-BASIC?p=1325462&viewfull=1#post1325462

In that example

[B]for (count = 8; --count >= 0; ) {[/B]

compiles to this loop control code 

	mov	r6, #9    
	jmp	#.L9      
.L10    
' Loop stuff here                   
.L9                       
	djnz	r6,#.L10

Which has the smallest, and fastest, Looping.
It takes 3 lines to 'get going' and has the iterating value of 8..1, but as you see, there is nothing faster for looping speed.

idbruce · 2015-04-11 14:39

@jmg

but as you see, there is nothing faster for looping speed.

I will just have to take your word for it, because I am ASM iliterate

Heater. · 2015-04-11 14:46

Here are some loops:

void loop1() {
    int i = steps;
    while(i != 0)
    {
        someCode();
        i--;
    }
}

void loop2() {
    int i = steps;
    while(i-- != 0)
    {
        someCode();
    }
}

void loop3() {
    int i = steps;
loop:
    someCode();
    if (--i) goto loop;
}

void loop4() {
    int i = steps;
    for (int i = 0; i < 10; i++)
    {
        someCode();
    }
}

Here is what the compile too:

loop1:
_loop1
	mov	__TMP0,#(2<<4)+14
	call	#__LMM_PUSHM
	mvi	r7,#_steps
	rdlong	r14, r7
	brs	#.L3
.L4
	lcall	#_someCode
	sub	r14, #1
.L3
	cmps	r14, #0 wz,wc
	IF_NE	brs	#.L4
	mov	__TMP0,#(2<<4)+15
	call	#__LMM_POPRET


_loop2
	mov	__TMP0,#(2<<4)+14
	call	#__LMM_PUSHM
	mvi	r7,#_steps
	rdlong	r14, r7
	brs	#.L6
.L7
	lcall	#_someCode
	sub	r14, #1
.L6
	cmps	r14, #0 wz,wc
	IF_NE	brs	#.L7
	mov	__TMP0,#(2<<4)+15
	call	#__LMM_POPRET


_loop3
	mov	__TMP0,#(2<<4)+14
	call	#__LMM_PUSHM
	mvi	r7,#_steps
	rdlong	r14, r7
.L9
	lcall	#_someCode
	djnz	r14,#__LMM_JMP
	long	.L9
	mov	__TMP0,#(2<<4)+15
	call	#__LMM_POPRET


_loop4
	mov	__TMP0,#(2<<4)+14
	call	#__LMM_PUSHM
	mov	r14, #10
.L12
	lcall	#_someCode
	djnz	r14,#__LMM_JMP
	long	.L12
	mov	__TMP0,#(2<<4)+15
	call	#__LMM_POPRET

Here is the summary of instructions used:

        Loop   func
loop1 -    4     11
loop2 -    4     11
loop3 -    3      9
loop4 -    3      8

Just use "for".

idbruce · 2015-04-11 14:59

Heater

Very interesting....

In my case, STEPS actually represents both the ramp-up and ramp-down steps for a motor driver. If I seperate these two and have a variable for both of them, I could then eliminate the assignment within the driver, for both the ramp-up and ramp-down iterations. So then I could decrement the variable itself and not worry about it being altered.

In the case of a while loop, what happens if the assignment is made outside the function? I realize that it is extra instructions elsewhere, but the initialization would not be in a time critical area?

Heater. · 2015-04-11 15:03

Hmm...anyone have any idea why the code generate to read the steps variable is different in loop4() above?

loop4() uses just:

	mov	r14, #10

but the other three all do:

	mvi	r7,#_steps
	rdlong	r14, r7

abecedarian · 2015-04-11 15:10

Probably because you assign i the value of steps before entering the for loop, then in the for loop you re-define it to equal 0.
The compiler will see the redefinition without anything happening between the two and discard the first.

void loop4() {
    int i = steps;                // <- gets optimized away
    for (int i = 0; i < 10; i++)  // because i is redefined here
    {
        someCode();
    }
}

// could be re-written:
void loop4() {
    for (int i = 0; i < steps; i++)
    {
        someCode();
    }
}
// or:
void loop4() {
   for (int i = steps; i > 0; i--)
    {
        someCode();
    }
}

Heater. · 2015-04-11 15:14

You mean like:

int i;

void loop1() {
    while(i != 0)
    {
        someCode();
        i--;
    }
}

int main()
{
    i = 10;
    loop1();
}

That would work.

Don't do that though. Having global variables lying around will get you into a mess.

Anyway, skip the premature optimization, get the code running correctly first.

jmg · 2015-04-11 15:15

Heater. wrote: »

Hmm...anyone have any idea why the code generate to read the steps variable is different in loop4() above?

I'd say a for loop is by far the most common, so gets the most focus, and thus the best attention to size/speed efforts.
Note the for structure example I gave above, that compiles to a single DJNZ looping instruction.

jmg · 2015-04-11 15:26

idbruce wrote: »

@jmg
I will just have to take your word for it, because I am ASM iliterate

You do not need to be very ASM literate, just enough to be able to see the looping structure.
Note also the optimizer can do things you may not expect. (so checking ASM is always a good idea)

As well as removing the preload, look carefully at the count direction in Loop4.Post#7
The code says i++, but the ASM says DJNZ ?!
That happens, because the for loop did not reference the i, so the optimizer flips to use the more compact DJNZ, of course, if you actually use i in the loop, the for structure will change.
ie use care when doing small benchmarks.

Heater. · 2015-04-11 15:35

jmg,

My code also compiles to djnz (oops you may have read the wrong asm, I posted the wrong file initially, it's correct now)

Your example is for in COG code if I'm not mistaken.

Which starts me thinking....this is LMM code, so in the "while" loop examples we see things like this:

.L4
        lcall   #_someCode
        sub     r14, #1
.L3
        cmps    r14, #0 wz,wc
        IF_NE   brs     #.L4

And the "for" loop looks more efficient like so:

.L12
        lcall   #_someCode
        djnz    r14,#__LMM_JMP
        long    .L12

But wait...they are both doing more work than it appears. The "for" loop case is jumping off to a kernel routine __LMM_JMP to do the LMM jump. The "while" loop is using "brs" which is not a Propeller instruction!

Until we know how long those things take to run we still don't know which loop is faster.

Heater. · 2015-04-11 15:37

abecedarian,

Oops, yes, well spotted. I forgot to remove that redundant declaration. Which was half the point of showing the code!

Heater. · 2015-04-11 15:55

@abecedarian,

Unfortunately correcting my code, well actually using yours, makes our for loop bigger

void loop4a() {
    for (int i = 0; i < steps; i++)
    {
        someCode();
    }
}

void loop4b() {
    for (int i = steps; i > 0; i--)
    {
        someCode();
    }
}

Becomes:

_loop4a
	mov	__TMP0,#(3<<4)+13
	call	#__LMM_PUSHM
	mov	r14, #0
	mvi	r13,#_steps
	brs	#.L12
.L13
	lcall	#_someCode
	add	r14, #1
.L12
	rdlong	r7, r13
	cmps	r14, r7 wz,wc
	IF_B 	brs	#.L13
	mov	__TMP0,#(3<<4)+15
	call	#__LMM_POPRET


_loop4b
	mov	__TMP0,#(2<<4)+14
	call	#__LMM_PUSHM
	mvi	r7,#_steps
	rdlong	r14, r7
	brs	#.L15
.L16
	lcall	#_someCode
	sub	r14, #1
.L15
	cmps	r14, #0 wz,wc
	IF_A 	brs	#.L16
	mov	__TMP0,#(2<<4)+15
	call	#__LMM_POPRET

the loops are now 5 and 4 instructions long. Making the while loops the winners!

jmg is so right about this micro-optimization problem.

idbruce · 2015-04-11 17:14

Anyway, skip the premature optimization, get the code running correctly first.

I hear ya

but thanks for the testing... I thought it was fairly interesting

davidsaunders · 2015-04-11 17:32

idbruce wrote: »

I hear ya but thanks for the testing... I thought it was fairly interesting

The easiest implementation would be:

{
long tmp;
 for (tmp = STEPS; tmp; tmp--)
 {
   /*code in loop*/
 }
}

And that is just the way to do things in C.

ersmith · 2015-04-11 17:54

If you're running in LMM mode, and your loop body has function calls in it, it's probably not going to matter -- the LMM jump overhead is going to dominate the instruction counting.

For COG mode, or for LMM code with no (non-native) functions in the loop, you can force a djnz with:

int i = steps;
do {
   // stuff
} while(--i != 0);

which, if you think about it, is exactly what djnz does (note that the loop will always be executed at least once though).

Usually though one writes a loop like this as:

for (i = 0; i < steps; i++) {
  // stuff
}

This is a common enough pattern that the compiler will usually optimize it very well, especially if it knows that steps is non-negative (e.g. if it is declared unsigned).

In LMM mode, try very hard to keep the loop small ( < 1K code) and simple (no function calls). This will allow the compiler to put the loop into FCACHE, which will speed it up enormously (4X or so).

jmg · 2015-04-11 18:22

Heater. wrote: »

Anyway, skip the premature optimization, get the code running correctly first.

This type of analysis I call less "premature optimization" and more 'getting ones head in sync with how the compiler thinks', which is certainly a good idea for any embedded developer looking for the best results.
The lowest looping overhead occurs when the Compiler uses DJNZ, but that does decrement the variable, and the last loop value is 1, not 0, which some may expect.

I guess that shows just how old the DJNZ opcode is, and a smarter opcode would have been DJNU (dec & Jump if not underflow), which would have allowed 0 based indexing and cleaner nesting of DJNU opcodes.
Do we blame intel for that oversight ?

Heater. · 2015-04-11 18:49

jmg,

I'm all in favour of getting to know what the compiler does. The question is: Is it worth worrying about micro-optimizations like this if the cost might be writing less than blindingly obvious code.

Currently I can't be sure that using DJNZ in LMM loops is the lowest overhead. That code does a jump to some LMM kernel routine in the COG at __LMM_JMP. So far I have no idea what that routine looks like.

I like the DJNU idea.

abecedarian · 2015-04-11 19:12

Heater. wrote: »

@abecedarian,
>snip<
Unfortunately correcting my code, well actually using yours, makes our for loop bigger.
the loops are now 5 and 4 instructions long. Making the while loops the winners!

jmg is so right about this micro-optimization problem.

Probably the re-cast of i from int to long causing part of that. I keep forgetting everything is a long in Propeller land.
I wonder how it'd add up if "i" was declared long?

Heater. · 2015-04-11 19:22

Make no difference. A long and an int are both 32 bit signed in prop-gcc.

Genetix · 2015-04-12 23:40

Bruce, for what's is worth, REPEAT is the Spin equivalent of FOR.NEXT in PBASIC.

Also, FOR is usually used for counting from or to a particular value, but WHILE is used to either wait for something to happen or do something is happening.

For example, I used this to wait for C to be pressed in a QBASIC program. INKEY$ grabs whatever is sitting in the Keyboard buffer even if it's nothing.

DO : LOOP UNTIL INKEY$ = "C"

C: What Is The Closest C Statement To A SPIN REPEAT

Comments