How to get accurate micro sec Pauses?

BTL24 · 2013-11-20 18:04

I am having trouble getting micro sec pauses to work with the Prop Activity board. I followed the library guidelines and used the set_pause_dt(CLKFREQ/1000000); command, which is supposed to allow the pause() command to switch from 1 mS increments to 1 uS increments.

When I program up a pause (10) I see 40uS on scope. When I program pause(20), I see 50uS on scope.

I have tried a higher resolution with set_pause_dt(CLKFREQ/2000000); ..but all I see is a High output.

I have gone other way and programmed 10 uS increments with set_pause_dt(CLKFREQ/100000); and then programmed pause(1)... but again I see 40 uS.

What am I doing wrong?

Is there a better way to get 10 uS pulses than using pause() command?

Thanks,
BTL24 (Brian).

jmg · 2013-11-20 18:39

Try some values further from the origin, eg see if 200, 201, 202 give 1us changes.

Any delay library is going to have some minimum time, and then some granularity.
With Prop GCC there may be other loading delays to consider as well.

BTL24 · 2013-11-20 20:16

I ran some more tests (found in library) on how to measure clock ticks against system timer called CNT. Results are discouraging, but explains why I cant get 1 uS increments in "pause" function.

Optimization levels apparently make a difference, but I still cant get to 1 uS increments. I ran optimized for size for test results below. When I optimized for speed.... system just hung. Other optimization levels gave worse results.

Results:

For ptick10 (10 uS), display reports...PTick = 800 ticks against measured delay tick = 1216 ticks (or 1.52 times greater).

For ptick20 (20uS), display reports ...PTick = 1600 ticks against measured delay tick = 2016 ticks (1.26 times greater).

In summary, I am surprised that the caliber of processor with the prop cant get below 20 uS pause. What am I doing wrong? Is it the overhead of C?

If this truly is the case... I will have to abandon the prop for a another processor. In my work, I have to manipulate bits and timing down to 100s of nS..sometimes smaller.

#include "simpletools.h"                // Include simpletools header

unsigned int us, ptick10, ptick20, ti, tf;

int main()                              // main function
{
  us = CLKFREQ/1000000;                //Establish speceial timer tick  of 1 uS
  ptick10 = 10 * us;                   //Tweaked for 10 us time
  ptick20 = 20 * us;                   //Tweaked for 20 uS time
  ti = CNT;
  pause_ticks(ptick10);
  tf = CNT;
  print("PTick = %d\n", ptick10);
  print("delay tick = %d\n",tf-ti);
}

Phil Pilgrim (PhiPi) · 2013-11-20 20:24

The problem with any kind of pause function is, "Where does the pause begin and end?" The answer is that it begins after the procedure call overhead and ends before the procedure return overhead. In between those two events, it's totally accurate.

The Propeller, with a timing granularity of 12.5 ns, can definitely do what you want; but don't expect it to happen in C, Spin, or any other high-level language. For that, you will have to take the time to learn Propeller assembly (PASM). It's nothing to be afraid of, BTW: PASM is one of the friendliest, most approachable assembly languages out there and well worth the time to master.

-Phil

BTL24 · 2013-11-20 20:41

Phil Pilgrim (PhiPi) wrote: »

The problem with any kind of pause function is, "Where does the pause begin and end?" The answer is that it begins after the procedure call overhead and ends before the procedure return overhead. In between those two events, it's totally accurate.

The Propeller, with a timing granularity of 12.5 ns, can definitely do what you want; but don't expect it to happen in C, Spin, or any other high-level language. For that, you will have to take the time to learn Propeller assembly (PASM). It's nothing to be afraid of, BTW: PASM is one of the friendliest, most approachable assembly languages out there and well worth the time to master.

-Phil

Phil,

Thanks for the quick response... and I just was coming to that same conclusion. For the 10 uS and 20 usS critical timing I need, I thought of SPIN... but I guess you claim assembly would be better.

Can in line assembly be placed inside a C program?... or is it best to assemble outside, and then include the function later during build?

Regards,
BTL24 (Brian)

jmg · 2013-11-20 20:56

BTL24 wrote: »

In summary, I am surprised that the caliber of processor with the prop cant get below 20 uS pause. What am I doing wrong? Is it the overhead of C?

If this truly is the case... I will have to abandon the prop for a another processor. In my work, I have to manipulate bits and timing down to 100s of nS..sometimes smaller.

On any uC sub-us is really going to need Assembler, and care.

Can you clarify "manipulate bits and timing " - what granularity in edges do you need, what is Min and max time between edges, is this output only, or do you need to also capture/time edges arriving ?

jazzed · 2013-11-20 21:13

BTL24 wrote: »

Can in line assembly be placed inside a C program?

Yes. There are several examples. Here is one that mixes C+ASM. It could be faster without the C code.

Care to guess what it does?

__attribute__((fcache)) static void _outbyte(int bitcycles, int txmask, int value)
{
  int j = 10;
  int waitcycles;

  waitcycles = CNT + bitcycles;
  while(j-- > 0) {
    __asm__ volatile("waitcnt %[_waitcycles], %[_bitcycles]"
    : [_waitcycles] "+r" (waitcycles)
    : [_bitcycles] "r" (bitcycles));

    __asm__ volatile("shr %[_value],#1 wc \n\t"
    "muxc outa, %[_mask]"
    : [_value] "+r" (value)
    : [_mask] "r" (txmask));
  }
}

For guidance on GCC in-line ASM look to this: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Note that propeller-gcc in-line ASM is like PASM with regard to source and destination.

SRLM · 2013-11-20 21:42

BTL24 wrote: »

Can in line assembly be placed inside a C program?... or is it best to assemble outside, and then include the function later during build?

Well, by definition, "inline assembly" is inline and not linked in. Here's a tutorial on that:

https://sites.google.com/site/propellergcc/documentation/tutorials/inline-asm-basics

And this I2C driver uses inline assembly for the bit banging: https://github.com/libpropeller/libpropeller/blob/master/libpropeller/i2c/i2c_base.h

ersmith · 2013-11-21 04:46

BTL24 wrote: »

Is there a better way to get 10 uS pulses than using pause() command?

Possibly not with the pause() function, but the waitcnt() function can give pauses down to about 0.2us (or 0.05us if you're inside an FCACHE loop or otherwise in COG code rather than LMM mode).

ersmith · 2013-11-21 05:05

Phil Pilgrim (PhiPi) wrote: »

The Propeller, with a timing granularity of 12.5 ns, can definitely do what you want; but don't expect it to happen in C, Spin, or any other high-level language.

It certainly can happen in C, although to get the maximum 12.5ns granularity will take some care, especially in LMM mode.

For example:

#include <stdio.h>
#include <propeller.h>

//
// wait for N cycles
// returns the actual number of cycles waited
//

__attribute__((fcache))
int fastwait(int N)
{
    int ti, tf;
    ti = CNT;
    waitcnt(ti+N);
    tf = CNT;
    return tf - ti;
}

int
main()
{
    unsigned int us, delay;;
    us = CLKFREQ / 1000000; // aim for 1 uS time
    printf("timing test:\n");
    delay = fastwait(us);
    printf("us = %d actual delay = %d\n", us, delay);
    return 0;
}

Produces:

[ Entering terminal mode. Type ESC or Control-C to exit. ]
timing test:
us = 80 actual delay = 85

4 of the extra 5 cycles are due to the fetch of CNT after the waitcnt; I'm not sure where the other cycle came from, but it's not C overhead (you'll get the same result with PASM).

But Phil is right, PASM is a very friendly assembly language, and probably worth learning!

Eric

BTL24 · 2013-11-21 06:18

jmg wrote: »

On any uC sub-us is really going to need Assembler, and care.

Can you clarify "manipulate bits and timing " - what granularity in edges do you need, what is Min and max time between edges, is this output only, or do you need to also capture/time edges arriving ?

Agreed on assembler and care. I measured the prop C high() and low() functions... they each take 12 uS to execute, just by themselves! That, added to other overhead of calling functions and such, means I will never make the pulse times I desire with this approach. It is now making sense as to what is happening now in prop with C code. Assembly appears to be the answer.

My Goal:

What I am trying to do is create a serial stream of pulses with high and low values for precise uS of time. This will be used to send a signal to a string of GE Color Effects (GECE) smart pixels lights that reside on a 3 wire system... +5V, Gnd, and signal (serial stream). After an enumeration process (assigning each bulb and address), the color and intensity of light can be managed per bulb, as each is addressable.

FYI...I have done this very reliably with SX processor for years using simple HIGH and LOW commands with PAUSEUS() function. By bit banging the signal line with pulses that meet the unique format and timing, I was able to talk to the bulbs successfully. Obviously the overhead with SX is less than with Prop C code. Trying to convert code to prop.

GECE Serial Signal Format is as follows...

Serial stream consists of :
Start Pulse is high pulse for 10 uS
Address of bulb - 6 bits
Intensity - 8 bits
Blue Color - 4 bits
Green Color - 4 bits
Red Color - 4 bits
Idle time is 20 uS at a logical level of low

Special Timing/Structure of "1" and "0"...
Logical 0 is 10 uS low followed by 20 uS of high
Logical 1 is 20 uS low followed by 10 uS of high
Data is sent MSB first, LSB last

I also may have to buffer up the Prop output of 3 v to 5 v levels . Not sure if GECE will recognize that low of a level... but that is another matter that I can easily solve.

Regards,
BTL24 (Brian)

BTL24 · 2013-11-21 06:20

jazzed wrote: »
Yes. There are several examples. Here is one that mixes C+ASM. It could be faster without the C code.

Care to guess what it does?
__attribute__((fcache)) static void _outbyte(int bitcycles, int txmask, int value)
{
  int j = 10;
  int waitcycles;

  waitcycles = CNT + bitcycles;
  while(j-- > 0) {
    __asm__ volatile("waitcnt %[_waitcycles], %[_bitcycles]"
    : [_waitcycles] "+r" (waitcycles)
    : [_bitcycles] "r" (bitcycles));

    __asm__ volatile("shr %[_value],#1 wc \n\t"
    "muxc outa, %[_mask]"
    : [_value] "+r" (value)
    : [_mask] "r" (txmask));
  }
}
For guidance on GCC in-line ASM look to this: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Note that propeller-gcc in-line ASM is like PASM with regard to source and destination.

Outstanding! It creates a delay in very fine resolution. AND I get to see how to inject assembly into C..

I need to better understand fcache. I think this will be the way to speed things up. (or run in own cog).

Thanks.

Regards.
BTL24 (Brian)

BTL24 · 2013-11-21 06:21

SRLM wrote: »

Well, by definition, "inline assembly" is inline and not linked in. Here's a tutorial on that:

https://sites.google.com/site/propellergcc/documentation/tutorials/inline-asm-basics

And this I2C driver uses inline assembly for the bit banging: https://github.com/libpropeller/libpropeller/blob/master/libpropeller/i2c/i2c_base.h

Thanks for the links...

Brian

BTL24 · 2013-11-21 06:23

ersmith wrote: »
It certainly can happen in C, although to get the maximum 12.5ns granularity will take some care, especially in LMM mode.

For example:
#include <stdio.h>
#include <propeller.h>

//
// wait for N cycles
// returns the actual number of cycles waited
//

__attribute__((fcache))
int fastwait(int N)
{
    int ti, tf;
    ti = CNT;
    waitcnt(ti+N);
    tf = CNT;
    return tf - ti;
}

int
main()
{
    unsigned int us, delay;;
    us = CLKFREQ / 1000000; // aim for 1 uS time
    printf("timing test:\n");
    delay = fastwait(us);
    printf("us = %d actual delay = %d\n", us, delay);
    return 0;
}
Produces:
[ Entering terminal mode. Type ESC or Control-C to exit. ]
timing test:
us = 80 actual delay = 85
4 of the extra 5 cycles are due to the fetch of CNT after the waitcnt; I'm not sure where the other cycle came from, but it's not C overhead (you'll get the same result with PASM).

But Phil is right, PASM is a very friendly assembly language, and probably worth learning!

Eric

Thank you for the example. I learn the fastest with examples like yours. fcache apparently is critical to getting things moving fast with the prop.

Regards,
Brian

jmg · 2013-11-21 14:40

ersmith wrote: »

4 of the extra 5 cycles are due to the fetch of CNT after the waitcnt; I'm not sure where the other cycle came from, but it's not C overhead (you'll get the same result with PASM).

The unexpected 5th cycle has to come from somewhere.

What happens as you vary the delay value ? (there will be some min where it breaks )

Also, how does the call overhead change ? - user code often does the pin-IO stuff either side of a customdelay() call, but for best precision, (least SW jitter), it is probably better to do the pin-IO inside the function, right next to the waitcnt() ?

kuroneko · 2013-11-21 15:08

jmg wrote: »

The unexpected 5th cycle has to come from somewhere.

There is no unexpected cycle. The code comes down to the following sequence:

[COLOR="#FFA500"]mov     one, cnt
add     one, #delay
waitcnt one, #0[/COLOR]
mov     two, cnt

The difference between two and one is the time taken for add/waitcnt + 4. IOW it gives us the time for the first 3 insns. Which breaks down to 9{14} + 71. Nine cycles are required to stop waitcnt from blocking for a full period which gives us a base delay of 14 cycles. With the remainder of 71 you'll get 85.

jmg · 2013-11-21 16:17

kuroneko wrote: »

The difference between two and one is the time taken for add/waitcnt + 4. IOW it gives us the time for the first 3 insns. Which breaks down to 9{14} + 71. Nine cycles are required to stop waitcnt from blocking for a full period which gives us a base delay of 14 cycles. With the remainder of 71 you'll get 85.

I do not quite follow this, The P1 data says MOV/ADD are 4 cycles, and that leaves WAITCNT which says 6+N, but it auto adjusts once it has got going.. so I can see 4 extras come from one ASM line. I can get 14 min as 4+4+6

Maybe WaitCNT effectively needs one clock to complete/exit ? (which would make sense to me)

kuroneko · 2013-11-21 16:41

jmg wrote: »

I do not quite follow this, The P1 data says MOV/ADD are 4 cycles, and that leaves WAITCNT which says 6+N, but it auto adjusts once it has got going.. so I can see 4 extras come from one ASM line. I can get 14 min as 4+4+6

In order to achieve 14 cycle minimum you need to add 9 to cnt (being fetched first), otherwise you wait for rollover (cnt is sampled during the 3rd insn cycle, waitcnt does the first match during its 4th).

mov     cnt, cnt
add     cnt, #9{14} + N
waitcnt cnt, #0

With N equal 0 you get 14 cycles. With 80 passed in, N is 71 (14 + 71 = 85).

attachment.php?attachmentid=105152&stc=1&d=1385080705

Mike Green · 2013-11-21 16:55

In case you get confused about Kuroneko's example and its use of the "cnt" register, remember that the cog memory is actually 512 longs and each special function register (SFR) like CNT has a corresponding "shadow ram" location in the cog memory. For all read / write SFRs, you can't really access the shadow ram since all accesses refer to the SFR. For read-only SFRs like CNT, the SFR is only accessible as a source field. Any use of that location as a destination field actually accesses the shadow ram memory and Kuroneko is making use of that shadow ram location for temporary storage. Other, stranger, uses of shadow ram can occur. For example, read / write accesses write to shadow ram as well as the hardware register and instruction fetches are done from shadow ram rather than the SFRs.

jmg · 2013-11-21 17:21

kuroneko wrote: »

With N equal 0 you get 14 cycles. With 80 passed in, N is 71 (14 + 71 = 85).

- but the user asked for 80, so the number added is 80. [add one, #delay]

Missing from the cycles diagram is the exit cases, and exactly where in the 6 cycles do the added wait-clocks get packed.
That detail determines where cnt is when the next opcode starts, which is the info I am chasing.

The ideal code is where that 5 cycles is corrected for, inside the delay routine.

jmg · 2013-11-21 17:24

Mike Green wrote: »

In case you get confused about Kuroneko's example and its use of the "cnt" register, remember that the cog memory is actually 512 longs and each special function register (SFR) like CNT has a corresponding "shadow ram" location in the cog memory.

Yes, I prefer the format of #16, especially for examples/tutorials.

kuroneko · 2013-11-21 17:36

jmg wrote: »

- but the user asked for 80, so the number added is 80. [add one, #delay]

Yes, did I say anything different? The point is that the insn sequence steals 9 out of those 80 to maintain its 14 cycle minimum. Which only leaves 71 extra (above minimum) cycles. Are we talking about the same thing here?

jmg wrote: »

Missing from the cycles diagram is the exit cases, and exactly where in the 6 cycles do the added wait-clocks get packed.
That detail determines where cnt is when the next opcode starts, which is the info I am chasing.

Can you elaborate? Not sure I understand the question.

jmg · 2013-11-21 17:48

kuroneko wrote: »

Can you elaborate? Not sure I understand the question.

Your diagram shows S D w m . R for the 6 cycles min of WAIT, but does not show where that stretches.
ie does it do [S D w m . . . . . R] or [S D w m . R R R R R ] or ?

Whether is gains 1 count on entry, or exit, is less important than correcting for the +5 overall path.

kuroneko · 2013-11-21 17:51

jmg wrote: »

Your diagram shows S D w m . R for the 6 cycles min of WAIT, but does not show where that stretches.
ie does it do [S D w m . . . . . R] or [S D w m . R R R R R ] or ?

The match cycle repeats, i.e.

[S D w m m m m m m . R]

which is actually not active code but the cog sleeping until some h/w decides it can continue.

jmg · 2013-11-21 18:04

kuroneko wrote: »
The match cycle repeats, i.e.
[S D w m m m m m m . R]
which is actually not active code but the cog sleeping until some h/w decides it can continue.

Cool, so I make that, merging with your other code

[ S  D  w  m  m  m  m  m  m  .  R][ S  D  e  R]
          14  .. 77 78 79 80 81 82  83 84 85
                          ^^EV            ^^RV
EV is Exit-trigger value, and RV is read value.

Porting this to a PinAction function, like the OP seeks, using the above example would this be coded something like ?

__attribute__((fcache))
void fastPulse(int N)
{
    PinAction;
    waitcnt(cnt+N-5);  // 5 or now 9 ?
    PinAction;   // Pin edges are exactly N cycles apart, N >  ?
}

or maybe less compiler dependent is ?

__attribute__((fcache))
void fastPulse(int N)
{
    int ti;
    PinAction;
    ti = CNT;
    waitcnt(ti+N-9);
    PinAction;   // Pin edges are exactly N cycles apart, N >  ?
}

kuroneko · 2013-11-21 18:19

Whether it needs five or nine adjustment cycles really depends on the final order of insns. I'll have to play with the compiler a bit to see how the order is affected. But I'm sure that's now a minor issue

How to get accurate micro sec Pauses?

Comments