Code execution time. C "compiled with blocklyprop"
Shawna
Posts: 508
I am trying to figure out how to determine execution time for lines of c code compiled by blocklyprop. I have multiple ideas on how to do it, but I have been shocked by the results so far. I would like to discuss what i am doing with you guys.
So this is where I am at.
I wanted to see which was faster.
This
or this
I thought maybe I could measure the time by toggling an output and reading it with a scope.
So I used this code, which had a high time of 23uS and a low time of 16uS.
And this Code, which had a high time of 23uS and a low time of 16uS.
To my surprise the high time and low time were the same for both loops, or pretty close to the same. I thought the waitpeq(state, mask); loop would be a lot faster. I have done a lot of reading and I thought the waitpeq statement should take 6 clock cycles, I probably read something wrong.
This seemed really slow so I thought that maybe I was measuring something wrong.
So then I tried this.
One period or cycle was 31.5uS. For some reason I assumed it would be faster.
23+16 = 39uS
39-31.5 = 7.5uS
7.5uS/12.5nS = 600 clock cylcles
600 clock cycles is way more than 6 or 10 or whatever the waitpeq statement should take. Maybe that 6 clock cycles is for pasm only.
Am I thinking about this properly or am I way off?
Thanks guys
So this is where I am at.
I wanted to see which was faster.
This
while (!((pin == 1))) { pin = input(2); }
or this
waitpeq(state, mask);
I thought maybe I could measure the time by toggling an output and reading it with a scope.
So I used this code, which had a high time of 23uS and a low time of 16uS.
while (1) { high(3); while (!((pin == 1))) { pin = input(2); } low(3); }
And this Code, which had a high time of 23uS and a low time of 16uS.
while (1) { high(3); waitpeq(state, mask); low(3); }
To my surprise the high time and low time were the same for both loops, or pretty close to the same. I thought the waitpeq(state, mask); loop would be a lot faster. I have done a lot of reading and I thought the waitpeq statement should take 6 clock cycles, I probably read something wrong.
This seemed really slow so I thought that maybe I was measuring something wrong.
So then I tried this.
while (1) { high(3); low(3); }
One period or cycle was 31.5uS. For some reason I assumed it would be faster.
23+16 = 39uS
39-31.5 = 7.5uS
7.5uS/12.5nS = 600 clock cylcles
600 clock cycles is way more than 6 or 10 or whatever the waitpeq statement should take. Maybe that 6 clock cycles is for pasm only.
Am I thinking about this properly or am I way off?
Thanks guys
Comments
But using a different method it takes 1.4us for the period:
I'm sure you can configure the compiler to generate faster but less compact code.
If you don't have a serial terminal hooked up to read the console output, then switch from using the high()/low() functions to manipulating the OUTA register directly. Those functions aren't very efficient.
https://forums.parallax.com/discussion/164271/solved-with-timing-results-problem-with-pasm-counters-ctra
I had looked at a few languages (PASM, Spin, Prop C LMM and CMM) and used a few different methods (for example in Prop C I compared the Simple Tools functions input(pin) and low(pin) with using the propeller.h variables DIRA and OUTA. The results in system clocks are in the last post of that thread.
But to summarize the min & max clocks for each language/memory model:
PASM: 6 clocks
Spin: 592 to 643 clocks
Prop C CMM: 145 clocks (using propeller.h variables) to 707 clocks (using Simple Tools)
Prop C LMM: 17 clocks (using propeller.h variables) to 147 clocks (using Simple Tools)
So with C using LMM is significantly faster than CMM (if you can fit it in the memory) and as David mentioned using the propeller.h variables is significantly faster than the SimpleTools functions (at least for this simple example.)
I don't have the C program I used on this computer so unfortunately I can't show the details of the methods I used.
Tom
Yeah, I will download the files from blocklyprop and recompile them in simple ide as CMM and see if there is any difference in speed.
That is originally what I did but the execution time seemed so long that I switched over to the high low pulse and the scope, I thought it would be more accurate. I also was not sure how long the command start = CNT; took to execute.
I tried something like this to determine how long start = CNT would take to execute. My thought was however long it took to run the 3 lines of code could be divided by 3 and that would be the time for each line. The results of that seemed to be off also.
Thanks for the link, this learning experience is leading into using the counters. I am trying to figure out the best way to get the speed I need. For most things speed does not matter, but for other things it is critical.
I will play around with different modes and methods tonight.
Thanks Guys
-Mike R.
It's working pretty darn well for its target audience.
It's still great for getting a lot of things working quickly and easily.
If you are after high performance, then you need to be getting more "down to the metal" with PASM and other SimpleIDE/propgcc features (like cog target and LMM, etc.).
Blocklyprop is great, I'm not knocking it! I am not very good at programming in C, or anything else for that matter, blocklyprop is a great tool. It has been a great way to get my son involved with programming also.
My son got a robot for Christmas which used another form of Blockly and it was garbage. Very slow and used the computer as an emulator to run the code! It was not enjoyable for him to play with after a couple of hours. So we bought an activity bot, which has been great fun and very educational for both of us.
I am kicking around the idea of trying to put some in-line pasm in my blocklyprop project using the user defined block. I have read a few threads on it!
I'll see how far I get on it tonight.
You'll also find in that comment a link to how I use it. In that link, you'll find lots of other examples (some complicated, some simple) of performance-oriented code. For instance, how might you create an easy-to-use but also performance-oriented function for toggling GPIO? Why, how about wrapping it in a class and then writing a method on that class which uses inline assembly to directly manipulate the OUTA register? See my example in PropWare's Port class here. That lets you write really simple code like so:
Which generates very efficient assembly:
It's using fcache automagically here, which makes it a little harder to read. But, just know that when it says "jmp #__LMM_FCACHE_START+(.L2-.L3)", the target of that jump is the "xor outa, r5" instruction. So from there, you can see the tight loop that's been generated from some very easy-to-read C++ code:
One can, of course, get an even better loop by using waitcnt2 instead of waitcnt, which gives you proper access to the waitcnt PASM instruction:
And the resulting PASM loop is now exactly what you'd expect hand-written code to look like:
I am still digesting the code you posted, thanks.
While playing with execution timing, I came across something I didn't anticipate happening. I ran the code below to see how long the start = (CNT) and stop = (CNT) would take to execute. Using SimpleIDE in CMM mode it took 144 clock cycles.
I then commented out the first term_cmd(CLS) and used the term_cmd(CLS) right before the print command, if that makes sense. I ran the program again and this time it took 256 clock cycles. I don't understand why. Could someone please explain this too me.