Speed test @ toggling a P2 pin, the race is on!

AndyProp · 2024-02-23 05:22

So there are a bunch of compiling methods/software's for the P2, from Basic to C; but how do they each fair in a pin toggling speed test.

I suspect PASM will destroy all, but will it ?

Anyone have a fast high speed frequency counter who can test each software at toggling a pin and record the results here?

Electrodude · 2024-02-23 05:52

@AndyProp said:

So there are a bunch of compiling methods/software's for the P2, from Basic to C; but how do they each fair in a pin toggling speed test.

I suspect PASM will destroy all, but will it ?

Anyone have a fast high speed frequency counter who can test each software at toggling a pin and record the results here?

Is using smartpins cheating? If not, then even the slowest P2 language can toggle a pin at the theoretical maximum rate of sysclk/2. The hard part of fast I/O on the P2 tends to be more a matter of getting the cog to dance in step with the smartpins rather than of just manually twiddling pins as fast as possible, which is nowhere near as fast.

AndyProp · 2024-02-23 05:59

Is using smartpins cheating?

I am thinking more of the compiled code than the direct ability of the P2.

evanh · 2024-02-23 06:35

That challenge is too easy for Flexspin. Flexspin will compile all its supported languages to the same sequence of instructions. Which will be the same speed as hand crafted Pasm. That's because there is primitives added in each language for the DRVx/FLTL instructions, eg: Spin uses PINH/PINL/PINT/PINF.

The only question mark is how to trigger the REP instruction compiler optimisation.

Here's the Pasm for bit-bashing at sysclock/4 pin frequency - Assuming it's running in cogRAM.

        rep     @.rend, #0    ' loop forever
        drvnot  #pinnum
.rend

evanh · 2024-02-23 06:39

An equivalent (same speed) would be:

        rep     @.rend, #0    ' loop forever
        drvh    #pinnum
        drvl    #pinnum
.rend

evanh · 2024-02-23 06:53

Yep, here's an example:

void  main(void)
{
    while(1)
    {
        _pinh(56);
        _pinl(56);
    }
}

And the resulting compiled .p2asm:

_main
' {
'     while(1)
    callpa  #(@LR__0003-@LR__0001)>>2,fcache_load_ptr_
LR__0001
    rep @LR__0004, #0
LR__0002
    drvh    #56
    drvl    #56
LR__0003
LR__0004

So that CALLPA even automatically drops the loop into cogRAM for max speed.

Christof Eb. · 2024-02-23 18:47

The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

Electrodude · 2024-02-23 19:59

@"Christof Eb." said:
The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

You haven't seen a compiler that can do this because, in the general case, it's mathematically impossible...

Nine women can't make a baby in one month.

RossH · 2024-02-23 22:15

@"Christof Eb." said:
The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

Catalina can.

Here is a a trivial implementation of the Sieve of Eratosthenes, which Catalina speeds up by a factor of 4 by spreading the algorithm across multiple cogs (in general, the actual speed up factor depends on both the algorithm and the number of available cogs):

/******************************************************************************
 *                                                                            *
 *                         The Sieve of Eratosthenes                          *
 *                                                                            *
 *      This is a "classic" version of the sieve program, augmented           *
 *      with the new Catalina Multi-processing pragmas, which will            *
 *      enable multi-processing if this program is compiled using the         *
 *      Catalina compiler, but ignored if it is compiled with another         *
 *      compiler. Multi-processing typically improves the program             *
 *      performance by 3 or 4 times (depending on the number of cogs          *
 *      available).                                                           *
 *                                                                            *
 * Commands to compile this program as a serial program on a P2 might be:     *
 *                                                                            *
 *    catalina -p2 sieve.c -lci -O5                                 *
 *                                                                            *
 * Commands to compile this program as a parallel program on a P2 might be:   *
 *                                                                            *
 *    catalina -p2 -Z sieve.c -lthreads -lci -O5                    *
 *                                                                            *
 ******************************************************************************/

#include <stdio.h>
#include <stdlib.h>

// define the size of the sieve (if not already defined):
#ifndef SIEVE_SIZE
#if defined(__P2__)||defined(__CATALINA_P2)
#define SIEVE_SIZE   400000
#else
#define SIEVE_SIZE   12000
#endif
#endif

unsigned char *primes = NULL;

#pragma propeller worker(unsigned long i) local(unsigned long j) stack(60)

// main : allocate and initialize the sieve, then eliminate all multiples
//        of primes, then print the time taken, and all the resulting primes. 
void main(void){

   unsigned long i, j;
   unsigned long k = 1;
   unsigned long count;

   // allocate a byte array of suitable size
   primes = malloc(SIEVE_SIZE);

   if (primes == NULL) {
      // cannot allocate array
      exit(1); 
   }

   // initialize sieve array to zero
   for (i = 0; i < SIEVE_SIZE; i++) {
      primes[i] = 0;
   }

   t_printf("starting ...\n");

   #pragma propeller start

   // remember starting time
   count = _cnt();

   // eliminate multiples of primes
   for (i = 2; i < SIEVE_SIZE/2; i++) {
      if (primes[i] == 0) {

     #pragma propeller begin 
         for (j = 2; i*j < SIEVE_SIZE; j++) {
            primes[i*j] = 1;
         }
     #pragma propeller end
      }
   }

   #pragma propeller wait

   // calculate time taken
   count = _cnt() - count;
   t_printf("... done - %ld clocks\n", count);

   t_printf("\npress a key to see results\n");
   k_wait();

   // print the resulting primes, starting from 2
    for (i = 2; i < SIEVE_SIZE; i++) {
       if (primes[i] == 0) {
          t_printf("prime(%d)= %d, ", k++, i);
       }
   }

   while(1);
}

In the Catalina 'Parallelizer' documentation, you will find another example - speeding up a Fast Fourier Transform algorithm. In that case, the speed improvement is about 2.5 times.

Ross.

evanh · 2024-02-24 04:14

Ross,
Just been looking at Catalina. You've got something strange going on with the .TXT doc files. Every time Sometimes when you've used the double quote character, your editor seems to be inserting something not from ASCII. And it doesn't simply convert to ASCII in my editor.
Eg: xcopy /e /i %LCCDIR%\demos %HOMEPATH%\demos\

EDIT: Ah, reply to this message to see it. Its code is 0x94.
EDIT2: Actually, might be just README_P2.TXT alone. I just happened to look at this one first.

RossH · 2024-02-24 04:46

@evanh said:
Ross,
Just been looking at Catalina. You've got something strange going on with the .TXT doc files. Every time Sometimes when you've used the double quote character, your editor seems to be inserting something not from ASCII. And it doesn't simply convert to ASCII in my editor.
Eg: xcopy /e /i %LCCDIR%\demos %HOMEPATH%\demos\

EDIT: Ah, reply to this message to see it. Its code is 0x94.
EDIT2: Actually, might be just README_P2.TXT alone. I just happened to look at this one first.

Yes, I see what you mean. Some spurious binary characters have ended up in that file, with value 0x94. Only that file, it seems - probably from cutting and pasting characters from a terminal window.

Attached is the file with those characters removed.

Thanks,
Ross.

Oops: Forgot to add back in the double quotes - file updated again!

Christof Eb. · 2024-02-24 19:32

@RossH said:

@"Christof Eb." said:
The really interesting question is, how can you split a task into 8, which can run on 8 cogs in parallel? I have not yet seen a compiler, which can do this?

Catalina can.

Here is a a trivial implementation of the Sieve of Eratosthenes, which Catalina speeds up by a factor of 4 by spreading the algorithm across multiple cogs (in general, the actual speed up factor depends on both the algorithm and the number of available cogs):

/******************************************************************************
 *                                                                            *
 *                         The Sieve of Eratosthenes                          *
 *                                                                            *
 *      This is a "classic" version of the sieve program, augmented           *
 *      with the new Catalina Multi-processing pragmas, which will            *
 *      enable multi-processing if this program is compiled using the         *
 *      Catalina compiler, but ignored if it is compiled with another         *
 *      compiler. Multi-processing typically improves the program             *
 *      performance by 3 or 4 times (depending on the number of cogs          *
 *      available).                                                           *
 *                                                                            *
 * Commands to compile this program as a serial program on a P2 might be:     *
 *                                                                            *
 *    catalina -p2 sieve.c -lci -O5                                 *
 *                                                                            *
 * Commands to compile this program as a parallel program on a P2 might be:   *
 *                                                                            *
 *    catalina -p2 -Z sieve.c -lthreads -lci -O5                    *
 *                                                                            *
 ******************************************************************************/

#include <stdio.h>
#include <stdlib.h>

// define the size of the sieve (if not already defined):
#ifndef SIEVE_SIZE
#if defined(__P2__)||defined(__CATALINA_P2)
#define SIEVE_SIZE   400000
#else
#define SIEVE_SIZE   12000
#endif
#endif

unsigned char *primes = NULL;

#pragma propeller worker(unsigned long i) local(unsigned long j) stack(60)

// main : allocate and initialize the sieve, then eliminate all multiples
//        of primes, then print the time taken, and all the resulting primes. 
void main(void){

   unsigned long i, j;
   unsigned long k = 1;
   unsigned long count;

   // allocate a byte array of suitable size
   primes = malloc(SIEVE_SIZE);

   if (primes == NULL) {
      // cannot allocate array
      exit(1); 
   }

   // initialize sieve array to zero
   for (i = 0; i < SIEVE_SIZE; i++) {
      primes[i] = 0;
   }

   t_printf("starting ...\n");

   #pragma propeller start

   // remember starting time
   count = _cnt();

   // eliminate multiples of primes
   for (i = 2; i < SIEVE_SIZE/2; i++) {
      if (primes[i] == 0) {

   #pragma propeller begin 
         for (j = 2; i*j < SIEVE_SIZE; j++) {
            primes[i*j] = 1;
         }
   #pragma propeller end
      }
   }

   #pragma propeller wait

   // calculate time taken
   count = _cnt() - count;
   t_printf("... done - %ld clocks\n", count);

   t_printf("\npress a key to see results\n");
   k_wait();

   // print the resulting primes, starting from 2
    for (i = 2; i < SIEVE_SIZE; i++) {
       if (primes[i] == 0) {
          t_printf("prime(%d)= %d, ", k++, i);
       }
   }

   while(1);
}

In the Catalina 'Parallelizer' documentation, you will find another example - speeding up a Fast Fourier Transform algorithm. In that case, the speed improvement is about 2.5 times.

Ross.

Thanks for the information!
Christof

Speed test @ toggling a P2 pin, the race is on!

So there are a bunch of compiling methods/software's for the P2, from Basic to C; but how do they each fair in a pin toggling speed test.

I suspect PASM will destroy all, but will it ?

Anyone have a fast high speed frequency counter who can test each software at toggling a pin and record the results here?

Comments