Code execution time. C "compiled with blocklyprop"

Shawna · 2018-10-03 00:09

I am trying to figure out how to determine execution time for lines of c code compiled by blocklyprop. I have multiple ideas on how to do it, but I have been shocked by the results so far. I would like to discuss what i am doing with you guys.

So this is where I am at.

I wanted to see which was faster.

This

while (!((pin == 1))) {
      pin = input(2);
    }

or this

waitpeq(state, mask);

I thought maybe I could measure the time by toggling an output and reading it with a scope.

So I used this code, which had a high time of 23uS and a low time of 16uS.

while (1) {
    high(3);
    while (!((pin == 1))) {
      pin = input(2);
    }
    low(3);
  }

And this Code, which had a high time of 23uS and a low time of 16uS.

while (1) {
    high(3);
    waitpeq(state, mask);
    low(3);
  }

To my surprise the high time and low time were the same for both loops, or pretty close to the same. I thought the waitpeq(state, mask); loop would be a lot faster. I have done a lot of reading and I thought the waitpeq statement should take 6 clock cycles, I probably read something wrong.
This seemed really slow so I thought that maybe I was measuring something wrong.

So then I tried this.

while (1) {
    high(3);
    low(3);
  }

One period or cycle was 31.5uS. For some reason I assumed it would be faster.

23+16 = 39uS
39-31.5 = 7.5uS
7.5uS/12.5nS = 600 clock cylcles
600 clock cycles is way more than 6 or 10 or whatever the waitpeq statement should take. Maybe that 6 clock cycles is for pasm only.

Am I thinking about this properly or am I way off?

Thanks guys

Peter Jakacki · 2018-10-03 03:07

I think this is a problem because it sounds like it has been compiled for the compact model CMM rather than LMM or even PASM. By comparison even a slow Tachyon version that pushes and pops the data stack each time takes 5.8us with:

BEGIN 5 HIGH 5 LOW AGAIN

But using a different method it takes 1.4us for the period:

5 MASK BEGIN H L AGAIN

I'm sure you can configure the compiler to generate faster but less compact code.

DavidZemon · 2018-10-03 03:18

Can I recommend you change your measurement style to something like so:

unsigned int start;
unsigned int done;
while (1) {
  start = CNT;
  waitpeq(state, mask);
  done = CNT;
  printi("Clocks = %d --- Time = %d us\n", done - start, (done - start) / 80);
}

If you don't have a serial terminal hooked up to read the console output, then switch from using the high()/low() functions to manipulating the OUTA register directly. Those functions aren't very efficient.

twm47099 · 2018-10-03 05:28

I had run some tests to measure the overhead time for a pin to go from an output high to a low based on the Propeller Education Kit (PEK) example of RC timing using counters (starts on page 126 of the PEK book.) The explanation of how I did this is in the first post of this thread:
https://forums.parallax.com/discussion/164271/solved-with-timing-results-problem-with-pasm-counters-ctra

I had looked at a few languages (PASM, Spin, Prop C LMM and CMM) and used a few different methods (for example in Prop C I compared the Simple Tools functions input(pin) and low(pin) with using the propeller.h variables DIRA and OUTA. The results in system clocks are in the last post of that thread.

But to summarize the min & max clocks for each language/memory model:
PASM: 6 clocks
Spin: 592 to 643 clocks
Prop C CMM: 145 clocks (using propeller.h variables) to 707 clocks (using Simple Tools)
Prop C LMM: 17 clocks (using propeller.h variables) to 147 clocks (using Simple Tools)

So with C using LMM is significantly faster than CMM (if you can fit it in the memory) and as David mentioned using the propeller.h variables is significantly faster than the SimpleTools functions (at least for this simple example.)

I don't have the C program I used on this computer so unfortunately I can't show the details of the methods I used.

Tom

Shawna · 2018-10-03 19:21

Peter Jakacki wrote: »

I'm sure you can configure the compiler to generate faster but less compact code.

Yeah, I will download the files from blocklyprop and recompile them in simple ide as CMM and see if there is any difference in speed.

DavidZemon wrote: »
Can I recommend you change your measurement style to something like so:
unsigned int start;
unsigned int done;
while (1) {
  start = CNT;
  waitpeq(state, mask);
  done = CNT;
  printi("Clocks = %d --- Time = %d us\n", done - start, (done - start) / 80);
}

That is originally what I did but the execution time seemed so long that I switched over to the high low pulse and the scope, I thought it would be more accurate. I also was not sure how long the command start = CNT; took to execute.

I tried something like this to determine how long start = CNT would take to execute. My thought was however long it took to run the 3 lines of code could be divided by 3 and that would be the time for each line. The results of that seemed to be off also.

int start;
 int done;
int temp;
while (1) {
  start = CNT;
  temp = CNT;
  done = CNT;
  printi("Clocks = %d --- Time = %d us\n", done - start, (done - start) / 80);
}

twm47099 wrote: »

I had run some tests to measure the overhead time for a pin to go from an output high to a low based on the Propeller Education Kit (PEK) example of RC timing using counters (starts on page 126 of the PEK book.) The explanation of how I did this is in the first post of this thread:
https://forums.parallax.com/discussion/164271/solved-with-timing-results-problem-with-pasm-counters-ctra

Thanks for the link, this learning experience is leading into using the counters. I am trying to figure out the best way to get the speed I need. For most things speed does not matter, but for other things it is critical.

I will play around with different modes and methods tonight.

Thanks Guys

pmrobert · 2018-10-03 19:34

If you want it faster you'll want LMM and enabling fcache is another speed help.

-Mike R.

Roy Eltham · 2018-10-03 20:14

The code in SimpleTools/BlocklyProp is meant to be clear and easy to understand. It sacrifices top performance for that. Also, CMM is used because otherwise you can't fit very much into your programs.
It's working pretty darn well for its target audience.

It's still great for getting a lot of things working quickly and easily.

If you are after high performance, then you need to be getting more "down to the metal" with PASM and other SimpleIDE/propgcc features (like cog target and LMM, etc.).

Shawna · 2018-10-03 20:34

I've seen the term fcache thrown around the forum, but I do not know what it is, I will have to investigate it.

Roy Eltham wrote: »

The code in SimpleTools/BlocklyProp is meant to be clear and easy to understand. It sacrifices top performance for that. Also, CMM is used because otherwise you can't fit very much into your programs.
It's working pretty darn well for its target audience.

It's still great for getting a lot of things working quickly and easily.

If you are after high performance, then you need to be getting more "down to the metal" with PASM and other SimpleIDE/propgcc features (like cog target and LMM, etc.).

Blocklyprop is great, I'm not knocking it! I am not very good at programming in C, or anything else for that matter, blocklyprop is a great tool. It has been a great way to get my son involved with programming also.

My son got a robot for Christmas which used another form of Blockly and it was garbage. Very slow and used the computer as an emulator to run the code! It was not enjoyable for him to play with after a couple of hours. So we bought an activity bot, which has been great fun and very educational for both of us.

I am kicking around the idea of trying to put some in-line pasm in my blocklyprop project using the user defined block. I have read a few threads on it!

I'll see how far I get on it tonight.

DavidZemon · 2018-10-03 21:31

For fcache, see my explanation in list item #5 of this comment: https://forums.parallax.com/discussion/comment/1441822/#Comment_1441822
You'll also find in that comment a link to how I use it. In that link, you'll find lots of other examples (some complicated, some simple) of performance-oriented code. For instance, how might you create an easy-to-use but also performance-oriented function for toggling GPIO? Why, how about wrapping it in a class and then writing a method on that class which uses inline assembly to directly manipulate the OUTA register? See my example in PropWare's Port class here. That lets you write really simple code like so:

#include <PropWare/gpio/pin.h>

using PropWare::Pin;

int main () {
    const auto quarterSecond = CLKFREQ / 4;

    const Pin led1(Pin::P17, Pin::Dir::OUT);
    while (1) {
        led1.toggle();
        waitcnt(quarterSecond + CNT);
    }
}

Which generates very efficient assembly:

.section        .text.startup.main,"ax",@progbits
        .balign 4
        .global _main
_main
        mvi     r7,#__clkfreq
        mvi     r5,#-131073
        rdlong  r6, r7
        mov     r7, DIRA
        and     r7, r5
        mvi     r5,#131072
        or      r7, r5
        shr     r6, #2
        mov     DIRA, r7
        jmp     #__LMM_FCACHE_LOAD
        long    .L4-.L3
.L3
.L2
' 273 "/home/david/reusable/Documents/Programming/PropellerProjects/PropWare/CMakeModules/../PropWare/gpio/port.h" 1
        xor outa, r5
' 0 "" 2
        mov     r7, CNT
        add     r7, r6
        waitcnt r7,#0
        jmp     #__LMM_FCACHE_START+(.L2-.L3)
        jmp     __LMM_RET
        .compress default
.L4

It's using fcache automagically here, which makes it a little harder to read. But, just know that when it says "jmp #__LMM_FCACHE_START+(.L2-.L3)", the target of that jump is the "xor outa, r5" instruction. So from there, you can see the tight loop that's been generated from some very easy-to-read C++ code:

xor outa, r5
mov     r7, CNT
add     r7, r6
waitcnt r7,#0
jmp     #__LMM_FCACHE_START+(.L2-.L3)

One can, of course, get an even better loop by using waitcnt2 instead of waitcnt, which gives you proper access to the waitcnt PASM instruction:

#include <PropWare/gpio/pin.h>

using PropWare::Pin;

int main () {
    const auto quarterSecond = CLKFREQ / 4;

    const Pin led1(Pin::P17, Pin::Dir::OUT);
    auto timer = CNT + quarterSecond;
    while (1) {
        led1.toggle();
        timer = waitcnt2(timer, quarterSecond);
    }
}

And the resulting PASM loop is now exactly what you'd expect hand-written code to look like:

xor outa, r5
waitcnt r7,r6
jmp     #__LMM_FCACHE_START+(.L2-.L3)

Shawna · 2018-10-04 18:09

@David
I am still digesting the code you posted, thanks.

While playing with execution timing, I came across something I didn't anticipate happening. I ran the code below to see how long the start = (CNT) and stop = (CNT) would take to execute. Using SimpleIDE in CMM mode it took 144 clock cycles.

I then commented out the first term_cmd(CLS) and used the term_cmd(CLS) right before the print command, if that makes sense. I ran the program again and this time it took 256 clock cycles. I don't understand why. Could someone please explain this too me.

// ------ Libraries and Definitions ------
#include "simpletools.h"

// ------ Global Variables and Objects ------
int start;
int stop;

// ------ Main Program ------
int main() {

  while (1) { 
    term_cmd(CLS);
    start = (CNT);
    stop = (CNT);
    //term_cmd(CLS);
    print("%d", (stop - start));
    pause(150);
  }

}

Dave Hein · 2018-10-04 20:43

The difference in time is probably due to C's code optimizer. Another issue is that start and stop are global variables, which causes them to be stored in hub RAM. If you make them local variables within the main routine it will probably reduce the number of cycles.

Code execution time. C "compiled with blocklyprop"

Comments