Shop OBEX P1 Docs P2 Docs Learn Events
Speed test @ toggling a P2 pin, the race is on! — Parallax Forums

Speed test @ toggling a P2 pin, the race is on!

AndyPropAndyProp Posts: 60
edited 2024-02-23 05:23 in PASM2/Spin2 (P2)

So there are a bunch of compiling methods/software's for the P2, from Basic to C; but how do they each fair in a pin toggling speed test.

I suspect PASM will destroy all, but will it ?

Anyone have a fast high speed frequency counter who can test each software at toggling a pin and record the results here?

Comments

  • @AndyProp said:

    So there are a bunch of compiling methods/software's for the P2, from Basic to C; but how do they each fair in a pin toggling speed test.

    I suspect PASM will destroy all, but will it ?

    Anyone have a fast high speed frequency counter who can test each software at toggling a pin and record the results here?

    Is using smartpins cheating? If not, then even the slowest P2 language can toggle a pin at the theoretical maximum rate of sysclk/2. The hard part of fast I/O on the P2 tends to be more a matter of getting the cog to dance in step with the smartpins rather than of just manually twiddling pins as fast as possible, which is nowhere near as fast.

  • Is using smartpins cheating?

    I am thinking more of the compiled code than the direct ability of the P2.

  • evanhevanh Posts: 15,912
    edited 2024-02-23 06:37

    That challenge is too easy for Flexspin. Flexspin will compile all its supported languages to the same sequence of instructions. Which will be the same speed as hand crafted Pasm. That's because there is primitives added in each language for the DRVx/FLTL instructions, eg: Spin uses PINH/PINL/PINT/PINF.

    The only question mark is how to trigger the REP instruction compiler optimisation.

    Here's the Pasm for bit-bashing at sysclock/4 pin frequency - Assuming it's running in cogRAM.

            rep     @.rend, #0    ' loop forever
            drvnot  #pinnum
    .rend
    
  • evanhevanh Posts: 15,912

    An equivalent (same speed) would be:

            rep     @.rend, #0    ' loop forever
            drvh    #pinnum
            drvl    #pinnum
    .rend
    
  • evanhevanh Posts: 15,912
    edited 2024-02-23 06:55

    Yep, here's an example:

    void  main(void)
    {
        while(1)
        {
            _pinh(56);
            _pinl(56);
        }
    }
    

    And the resulting compiled .p2asm:

    _main
    ' {
    '     while(1)
        callpa  #(@LR__0003-@LR__0001)>>2,fcache_load_ptr_
    LR__0001
        rep @LR__0004, #0
    LR__0002
        drvh    #56
        drvl    #56
    LR__0003
    LR__0004
    

    So that CALLPA even automatically drops the loop into cogRAM for max speed.

  • The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

  • @"Christof Eb." said:
    The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

    You haven't seen a compiler that can do this because, in the general case, it's mathematically impossible...

    Nine women can't make a baby in one month.

  • RossHRossH Posts: 5,462

    @"Christof Eb." said:
    The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

    Catalina can.

    Here is a a trivial implementation of the Sieve of Eratosthenes, which Catalina speeds up by a factor of 4 by spreading the algorithm across multiple cogs (in general, the actual speed up factor depends on both the algorithm and the number of available cogs):

    /******************************************************************************
     *                                                                            *
     *                         The Sieve of Eratosthenes                          *
     *                                                                            *
     *      This is a "classic" version of the sieve program, augmented           *
     *      with the new Catalina Multi-processing pragmas, which will            *
     *      enable multi-processing if this program is compiled using the         *
     *      Catalina compiler, but ignored if it is compiled with another         *
     *      compiler. Multi-processing typically improves the program             *
     *      performance by 3 or 4 times (depending on the number of cogs          *
     *      available).                                                           *
     *                                                                            *
     * Commands to compile this program as a serial program on a P2 might be:     *
     *                                                                            *
     *    catalina -p2 sieve.c -lci -O5                                 *
     *                                                                            *
     * Commands to compile this program as a parallel program on a P2 might be:   *
     *                                                                            *
     *    catalina -p2 -Z sieve.c -lthreads -lci -O5                    *
     *                                                                            *
     ******************************************************************************/
    
    #include <stdio.h>
    #include <stdlib.h>
    
    // define the size of the sieve (if not already defined):
    #ifndef SIEVE_SIZE
    #if defined(__P2__)||defined(__CATALINA_P2)
    #define SIEVE_SIZE   400000
    #else
    #define SIEVE_SIZE   12000
    #endif
    #endif
    
    unsigned char *primes = NULL;
    
    #pragma propeller worker(unsigned long i) local(unsigned long j) stack(60)
    
    // main : allocate and initialize the sieve, then eliminate all multiples
    //        of primes, then print the time taken, and all the resulting primes. 
    void main(void){
    
       unsigned long i, j;
       unsigned long k = 1;
       unsigned long count;
    
       // allocate a byte array of suitable size
       primes = malloc(SIEVE_SIZE);
    
       if (primes == NULL) {
          // cannot allocate array
          exit(1); 
       }
    
       // initialize sieve array to zero
       for (i = 0; i < SIEVE_SIZE; i++) {
          primes[i] = 0;
       }
    
       t_printf("starting ...\n");
    
       #pragma propeller start
    
       // remember starting time
       count = _cnt();
    
       // eliminate multiples of primes
       for (i = 2; i < SIEVE_SIZE/2; i++) {
          if (primes[i] == 0) {
    
         #pragma propeller begin 
             for (j = 2; i*j < SIEVE_SIZE; j++) {
                primes[i*j] = 1;
             }
         #pragma propeller end
          }
       }
    
       #pragma propeller wait
    
       // calculate time taken
       count = _cnt() - count;
       t_printf("... done - %ld clocks\n", count);
    
       t_printf("\npress a key to see results\n");
       k_wait();
    
       // print the resulting primes, starting from 2
        for (i = 2; i < SIEVE_SIZE; i++) {
           if (primes[i] == 0) {
              t_printf("prime(%d)= %d, ", k++, i);
           }
       }
    
       while(1);
    }
    
    

    In the Catalina 'Parallelizer' documentation, you will find another example - speeding up a Fast Fourier Transform algorithm. In that case, the speed improvement is about 2.5 times.

    Ross.

  • evanhevanh Posts: 15,912
    edited 2024-02-24 04:27

    Ross,
    Just been looking at Catalina. You've got something strange going on with the .TXT doc files. Every time Sometimes when you've used the double quote character, your editor seems to be inserting something not from ASCII. And it doesn't simply convert to ASCII in my editor.
    Eg: xcopy /e /i ”%LCCDIR%\demos” ”%HOMEPATH%\demos\”

    EDIT: Ah, reply to this message to see it. Its code is 0x94.
    EDIT2: Actually, might be just README_P2.TXT alone. I just happened to look at this one first.

  • RossHRossH Posts: 5,462
    edited 2024-02-24 04:50

    @evanh said:
    Ross,
    Just been looking at Catalina. You've got something strange going on with the .TXT doc files. Every time Sometimes when you've used the double quote character, your editor seems to be inserting something not from ASCII. And it doesn't simply convert to ASCII in my editor.
    Eg: xcopy /e /i ”%LCCDIR%\demos” ”%HOMEPATH%\demos\”

    EDIT: Ah, reply to this message to see it. Its code is 0x94.
    EDIT2: Actually, might be just README_P2.TXT alone. I just happened to look at this one first.

    Yes, I see what you mean. Some spurious binary characters have ended up in that file, with value 0x94. Only that file, it seems - probably from cutting and pasting characters from a terminal window.

    Attached is the file with those characters removed.

    Thanks,
    Ross.

    Oops: Forgot to add back in the double quotes - file updated again!

  • @RossH said:

    @"Christof Eb." said:
    The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

    Catalina can.

    Here is a a trivial implementation of the Sieve of Eratosthenes, which Catalina speeds up by a factor of 4 by spreading the algorithm across multiple cogs (in general, the actual speed up factor depends on both the algorithm and the number of available cogs):

    /******************************************************************************
     *                                                                            *
     *                         The Sieve of Eratosthenes                          *
     *                                                                            *
     *      This is a "classic" version of the sieve program, augmented           *
     *      with the new Catalina Multi-processing pragmas, which will            *
     *      enable multi-processing if this program is compiled using the         *
     *      Catalina compiler, but ignored if it is compiled with another         *
     *      compiler. Multi-processing typically improves the program             *
     *      performance by 3 or 4 times (depending on the number of cogs          *
     *      available).                                                           *
     *                                                                            *
     * Commands to compile this program as a serial program on a P2 might be:     *
     *                                                                            *
     *    catalina -p2 sieve.c -lci -O5                                 *
     *                                                                            *
     * Commands to compile this program as a parallel program on a P2 might be:   *
     *                                                                            *
     *    catalina -p2 -Z sieve.c -lthreads -lci -O5                    *
     *                                                                            *
     ******************************************************************************/
    
    #include <stdio.h>
    #include <stdlib.h>
    
    // define the size of the sieve (if not already defined):
    #ifndef SIEVE_SIZE
    #if defined(__P2__)||defined(__CATALINA_P2)
    #define SIEVE_SIZE   400000
    #else
    #define SIEVE_SIZE   12000
    #endif
    #endif
    
    unsigned char *primes = NULL;
    
    #pragma propeller worker(unsigned long i) local(unsigned long j) stack(60)
    
    // main : allocate and initialize the sieve, then eliminate all multiples
    //        of primes, then print the time taken, and all the resulting primes. 
    void main(void){
    
       unsigned long i, j;
       unsigned long k = 1;
       unsigned long count;
    
       // allocate a byte array of suitable size
       primes = malloc(SIEVE_SIZE);
    
       if (primes == NULL) {
          // cannot allocate array
          exit(1); 
       }
    
       // initialize sieve array to zero
       for (i = 0; i < SIEVE_SIZE; i++) {
          primes[i] = 0;
       }
    
       t_printf("starting ...\n");
    
       #pragma propeller start
    
       // remember starting time
       count = _cnt();
    
       // eliminate multiples of primes
       for (i = 2; i < SIEVE_SIZE/2; i++) {
          if (primes[i] == 0) {
    
       #pragma propeller begin 
             for (j = 2; i*j < SIEVE_SIZE; j++) {
                primes[i*j] = 1;
             }
       #pragma propeller end
          }
       }
    
       #pragma propeller wait
    
       // calculate time taken
       count = _cnt() - count;
       t_printf("... done - %ld clocks\n", count);
    
       t_printf("\npress a key to see results\n");
       k_wait();
    
       // print the resulting primes, starting from 2
        for (i = 2; i < SIEVE_SIZE; i++) {
           if (primes[i] == 0) {
              t_printf("prime(%d)= %d, ", k++, i);
           }
       }
    
       while(1);
    }
    
    

    In the Catalina 'Parallelizer' documentation, you will find another example - speeding up a Fast Fourier Transform algorithm. In that case, the speed improvement is about 2.5 times.

    Ross.

    Thanks for the information!
    Christof

Sign In or Register to comment.