Is a P2 Edge "Multi" Board feasible?

RossH · 2023-03-11 09:12

Okay! My simple Prop-to-Prop cables arrived this week, so I have connected a P2_EVAL to a P2_EDGE, and here is my initial test program - simple but reliable synchronous transfers using a 32 bit parallel bus between two or more Propellers at 32Mb/s ...

/*
 * Program to test how fast and reliably bus read/writes can be done using a 
 * simple synchronous parallel bus connecting two or more Propellers. 
 *
 * There can be only one sender cog at a time, but can be multiple receiver 
 * cogs. The intention is to start one sender cog on one Propeller, and 
 * multiple receiver cogs on the other Propellers. This program supports
 * four receivers on a single Propeller.
 *
 * The Propellers must be connected pin to pin on pins 00 .. 31 (i.e. port A). 
 * If you have more than two Propellers connected, you could start additional 
 * receiver cogs on the other Propellers.
 *
 * This is a synchronous transfer, so how much data can be transferred depends
 * on how closely the respective Propeller clocks are synchronized. The crystal 
 * accuracy is typically +/-0.5 PPM. At 25 clocks per long, this means the 
 * clocks may be out of sync by up one clock after 40,000 longs. A suitable 
 * maximum size for a single synchronous transfer might therefore be 20,000 
 * longs, which is why this value is used by this program. Larger transfers 
 * can be performed by doing multiple smaller transfers (which is the point 
 * of this test program!).
 *
 * The program uses P2 NATIVE PASM, so it must be compiled in P2 NATIVE mode
 * (which is the default mode for the Propeller 2). 
 *
 * To maximize available cogs, add -C NO_MOUSE and -C NO_FLOAT (and -C SIMPLE
 * if using a serial HMI). However in this simple-minded program, buffer space
 * is likely to be the limiting factor, not cogs.
 *
 *  To build as a sender, add -C SENDER
 *
 *  To build as a receiver, add -C RECEIVER
 *
 * For example, compile with a command like:
 *
 *    catalina -p2 -lci p2_bus.c -o sender -C NO_MOUSE -C NO_FLOAT -C SENDER
 * or
 *    catalina -p2 -lci p2_bus.c -o receiver -C NO_MOUSE -C NO_FLOAT -C RECEIVER
 *
 * Then load and execute with commands like:
 *
 *    payload sender -PN -i
 *    payload receiver -PM -i
 *
 * where N & M are the Propeller ports to use for the sender and receiver,
 * respectively (add more 'receiver' commands if there are more than two
 * Propellers on the bus).
 */

#if !defined (__CATALINA_SENDER) && !defined(__CATALINA_RECEIVER)
#error EITHER SENDER OR RECEIVER MUST BE DEFINED!
#endif

#include <stdio.h>
#include <stdlib.h>
#include <catalina_cog.h>
#include <catalina_plugin.h>

#define XFER_DELAY 7             // clocks per long is 18+this (7 works!)

#define NUM_LONGS  20000         // number of longs to transfer (20000 works!)

#define STACK_SIZE 500           // size of cog stack (stdio requires 500)

#define TOTAL_ERRORS_ONLY        // print only total errors, not details

static unsigned long start = 0;  // clock count used to synchronize cogs
static int lock = 0;             // lock to protect I/O

static unsigned long *send_buff; // data to be sent
static unsigned long *rcv1_buff; // data received (1)
static unsigned long *rcv2_buff; // data received (2)
static unsigned long *rcv3_buff; // data received (3)
static unsigned long *rcv4_buff; // data received (4)

/*
 * sync - synchronize multiple cogs to start on a specific clock count
 *
 *    'start' should be set to a clock count some time in the 
 *            future - e.g. _cnt() + _clockfreq() for one second
 */
int sync(unsigned long start) {
   return PASM (
      " getct  r0\n"
      " sub    r2, r0\n"
      " waitx  r2\n"
      " getct  r0\n"
   );
}

/*
 * send - pasm code to write a number of longs to the 32 bit bus
 *        Note: sending starts immediately, and then sends a new long
 *              every 'time'+18 ticks.
 *        Note: loads the code into LUT RAM and executes it there.
 *
 *    'time' (passed in r4) is the time between writes in clocks (+18)
 *    'buff' (passed in r3) is an array holding longs to send
 *    'size' (passed in r2) is the size of the array
 */
int send(int time, void *buff, int size) {
   return PASM (
      // set pins 00 .. 31 as outputs
      "        mov     outa, #0\n"
      "        or      dira,##$FFFFFFFF\n"

      // load LUT RAM:
      "        setq2   #(send_end - send_start - 1)\n"
      "        rdlong  0, ##@send_start\n"

      // jump to code in LUT RAM:
      "        jmp     #send_start\n" 

      // code to be executed in LUT RAM:
      "        org     $200\n"
      "send_start\n"
      "        getct   r0\n"             // LUT: 2 (clocks)
      "\n"
      "        rep     #7, r2\n"         // LUT: 2
      "        addct1  r0, #18\n"        // LUT: 2
      "        rdlong  r1, r3\n"         // LUT: 9 .. 16
      "        waitct1\n"                // LUT: 2
      "        add     r3, #4\n"         // LUT: 2
      "        mov     outa, r1\n"       // LUT: 2
      "        addct1  r0, r4\n"         // LUT: 2
      "        waitct1\n"                // LUT: 2
      "\n"
      "        addct1  r0, r4\n"
      "        addct1  r0, #18\n"
      "        waitct1\n"
      "        mov     outa, #0\n"
      "        getct   r0\n"
      "        jmp     #send_cont\n"
      "send_end\n" 

      // resume Hub Execution:
      "       orgh\n"
      "send_cont\n" 
   );
}

/*
 * sender - send an array of longs to one or more receivers
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void sender(void *buff) {
   unsigned int started, stopped, total;
   int me = _cogid();

   started = _cnt();
   stopped = send(XFER_DELAY, buff, NUM_LONGS);
   total   = stopped - started;
   ACQUIRE(lock);
   printf("send (cog %d) took %d clocks (%d per long)\n\n", 
           me, total, total/NUM_LONGS);
   RELEASE(lock);
   while(1); // don't exit
}

/*
 * recv - pasm code to read a number of longs from the 32 bit bus. 
 *        Note: receiving starts 'time' clock ticks after any non-zero
 *              value is detected on the bus, and then reads a new long 
 *              every 'time'+18 ticks.
 *        Note: loads the code into LUT RAM and executes it there.
 *
 *    'time' (passed in r4) is the time between reads in clocks (+18)
 *    'buff' (passed in r3) is an array to hold longs received
 *    'size' (passed in r2) is the size of the array
 */
int recv(int time, void *buff, int size) {
   return PASM (
      // set pins 00 .. 31 as inputs
      "        andn    dira,##$FFFFFFFF\n"

      // load LUT RAM:
      "        setq2   #(recv_end - recv_start - 1)\n"
      "        rdlong  0, ##@recv_start\n"

      // jump to code in LUT RAM:
      "        jmp     #recv_start\n"

      // code to be executed in LUT RAM:
      "        org     $200\n"
      "recv_start\n"
      "        mov     r1, ina\n"        // LUT: 2 (clocks)
      "        cmp     r1, #0 wz\n"      // LUT: 2
      " if_z   jmp     #recv_start\n"    // LUT: 4
      "        getct   r0\n"             // LUT: 2 
      "\n"
      "        rep     #7, r2\n"         // LUT: 2
      "        addct1  r0, #18\n"        // LUT: 2
      "        wrlong  r1, r3\n"         // LUT: 3 .. 10
      "        waitct1\n"                // LUT: 2
      "        add     r3, #4\n"         // LUT: 2
      "        addct1  r0, r4\n"         // LUT: 2
      "        waitct1\n"                // LUT: 2
      "        mov     r1, ina\n"        // LUT: 2 
      "\n"
      "        getct   r0\n"
      "        jmp     #recv_cont\n"
      "recv_end\n"

      // resume Hub Execution:
      "        orgh\n"
      "recv_cont\n"
   );
}

/*
 * receiver- receive an array of longs from the 32 bit bus
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void receiver(void *buff) {
   unsigned int started, stopped;
   int me = _cogid();

   started = sync(start);
   stopped = recv(XFER_DELAY, buff, NUM_LONGS);
   ACQUIRE(lock);
   printf("recv (cog %d) started at clock 0x%08x\n", me, started);
   RELEASE(lock);
   while(1); // don't exit
}


void main(void) {
   unsigned long i;
   long send_stack[STACK_SIZE];
   long rcv1_stack[STACK_SIZE];
   long rcv2_stack[STACK_SIZE];
   long rcv3_stack[STACK_SIZE];
   long rcv4_stack[STACK_SIZE];
   int send_cog;
   int rcv1_cog;
   int rcv2_cog;
   int rcv3_cog;
   int rcv4_cog;
   int errors = 0;
   int transfers = 0;

   // assign a lock to be used to avoid plugin contention
   lock = _locknew();

   // give the vt100 emulator (if used) a chance to start
#ifdef __CATALINA_VT100
   _waitms(500);
#endif

   // allocate the arrays (we allocate them all whether we are a sender
   // or a receiver)
   send_buff = malloc(NUM_LONGS*4);
   rcv1_buff = malloc(NUM_LONGS*4);
   rcv2_buff = malloc(NUM_LONGS*4);
   rcv3_buff = malloc(NUM_LONGS*4);
   rcv4_buff = malloc(NUM_LONGS*4);

   // initialize the arrays
   for (i = 0; i < NUM_LONGS; i++) {
      send_buff[i] = i+1; // can be anything, but must be non-zero
      rcv1_buff[i] = 0;
      rcv2_buff[i] = 0;
      rcv3_buff[i] = 0;
      rcv4_buff[i] = 0;
   }

#ifdef __CATALINA_SENDER

   // set all bus pins to zero
   PASM(
      "       mov     outa, #0\n"
      "       or      dira,##$FFFFFFFF\n"
   );

   // start ONE sender
   k_clear();
   ACQUIRE(lock);
   printf("SENDER (Clock = %lu Hz)\n\n", _clockfreq());
   printf("Start the receiver cogs in the Receiver program,\n");
   printf("then press any key to start sender cog\n");
   RELEASE(lock);
   k_wait();

   // keep transferring forever
   while (1) {
      ACQUIRE(lock);
      printf("\nStarting sender cog ...\n\n");
      RELEASE(lock);
      send_cog = _cogstart_C(&sender, send_buff, send_stack, STACK_SIZE);

      // give the sender a chance to send (3 seconds is generous!)
      _waitms(3000);
      ACQUIRE(lock);
      printf("... done\n");
      RELEASE(lock);

      // cancel the sender (we restart it again for each transfer)
      _cogstop(send_cog);

      // update and print statistics
      transfers++;
      ACQUIRE(lock);
      printf("\nTotal %d transfers\n",transfers); 
      RELEASE(lock);
   }

#endif

#ifdef __CATALINA_RECEIVER

   ACQUIRE(lock);
   printf("RECEIVER (Clock = %lu Hz)\n\n", _clockfreq());
   printf("Press any key to start receiver cogs, then start the sender\n");
   printf("cog in the sender program\n");
   RELEASE(lock);
   k_wait();
   ACQUIRE(lock);
   printf("Starting receiver cogs ...\n\n");
   RELEASE(lock);

   while (1) {
      // set a start time for the receiver cogs to use in the sync function
      start = _cnt() + _clockfreq(); // set start time for +1 seconds

      // start FOUR receiver cogs
      rcv1_cog = _cogstart_C(&receiver, rcv1_buff, rcv1_stack, STACK_SIZE);
      rcv2_cog = _cogstart_C(&receiver, rcv2_buff, rcv2_stack, STACK_SIZE);
      rcv3_cog = _cogstart_C(&receiver, rcv3_buff, rcv3_stack, STACK_SIZE);
      rcv4_cog = _cogstart_C(&receiver, rcv4_buff, rcv4_stack, STACK_SIZE);
      ACQUIRE(lock);
      printf("... done\n");
      RELEASE(lock);

      // give receiver a chance to receive
      _waitms(1000);

      // wait till receivers complete
      while (
          (rcv1_buff[NUM_LONGS - 1] == 0)
       && (rcv2_buff[NUM_LONGS - 1] == 0)
       && (rcv3_buff[NUM_LONGS - 1] == 0)
       && (rcv4_buff[NUM_LONGS - 1] == 0)
      ) {
          _waitms(1000);
      }

      // terminate the receiver cogs (we restart them again for each transfer)
      _cogstop(rcv1_cog);
      _cogstop(rcv2_cog);
      _cogstop(rcv3_cog);
      _cogstop(rcv4_cog);

      // check the results
      ACQUIRE(lock);
      printf("\nChecking data ...\n");
      for (i = 0; i < NUM_LONGS; i++) {
         // check receiver 1 got the correct data
         if (send_buff[i] != rcv1_buff[i]) {
            printf("send[%3d]=0x%08X != rcv1[%3d]=0x%08X\n",  
                   i, send_buff[i], i, rcv1_buff[i]);
            _waitms(5);
#ifdef TOTAL_ERRORS_ONLY
            // if only counting total errors, we are done - we don't 
            // report each mismatch
            errors++;
            break;
#endif
         }
         // check receiver 2 got the correct data
         if (send_buff[i] != rcv2_buff[i]) {
            printf("send[%3d]=0x%08X != rcv2[%3d]=0x%08X\n",  
                   i, send_buff[i], i, rcv2_buff[i]);
            _waitms(5);
#ifdef TOTAL_ERRORS_ONLY
            // if only counting total errors, we are done - we don't 
            // report each mismatch
            errors++;
            break;
#endif
         }
         // check receiver 3 got the correct data
         if (send_buff[i] != rcv3_buff[i]) {
            printf("send[%3d]=0x%08X != rcv3[%3d]=0x%08X\n",  
                   i, send_buff[i], i, rcv3_buff[i]);
            _waitms(5);
#ifdef TOTAL_ERRORS_ONLY
            // if only counting total errors, we are done - we don't 
            // report each mismatch
            errors++;
            break;
#endif
         }
         // check receiver 4 got the correct data
         if (send_buff[i] != rcv4_buff[i]) {
            printf("send[%3d]=0x%08X != rcv4[%3d]=0x%08X\n",  
                   i, send_buff[i], i, rcv4_buff[i]);
            _waitms(5);
#ifdef TOTAL_ERRORS_ONLY
            // if only counting total errors, we are done - we don't 
            // report each mismatch
            errors++;
            break;
#endif
         }
      }

      // update and print statistics
      transfers++;
      printf("Total errors = %d (from %d transfers)\n", errors, transfers); 
      RELEASE(lock);

      // re-initialize the arrays for the next transfer
      for (i = 0; i < NUM_LONGS; i++) {
         rcv1_buff[i] = 0;
         rcv2_buff[i] = 0;
         rcv3_buff[i] = 0;
         rcv4_buff[i] = 0;
      }
   }

#endif

}

More to come!

RossH · 2023-03-12 02:07

I've updated my "Propeller2Propeller" bus test program (now called p2p.c) to add the ability for a Propeller to either be a 'sender', a 'receiver', or a 'transceiver' which alternates between sending and receiving (which makes it a more realistic test).

Also, I had some failures with transferring synchronous blocks of 20,000 longs so I have wound it back to 10,000 longs at a time for the moment. I think this is due to slight differences between the Propeller clocks. Larger transfers will need to be done in multiple synchronous blocks anyway, but eventually I might add some code to auto detect the maximum block size so the user doesn't need to configure it.

The next step is to add a higher level protocol to allow multiple Propellers to share the bus and send and receive without bus contention.

Also, I intend to make it configurable whether the P2P bus is 8, 16 or 32 bits wide.

Ross.

evanh · 2023-03-12 04:53

PASM(" andn dira, ##$FFFFFFFF\n");

can be PASM(" and dira, #0\n");
or PASM(" mov dira, #0\n");

or dira, ##$FFFFFFFF

can be not dira, #0
or neg dira, #1
or bmask dira, #31

RossH · 2023-03-12 07:01

@evanh said:

PASM(" andn dira, ##$FFFFFFFF\n");

can be PASM(" and dira, #0\n");
or PASM(" mov dira, #0\n");

or dira, ##$FFFFFFFF

can be not dira, #0
or neg dira, #1
or bmask dira, #31

True, but eventually I will allow for 8, 16 or 32 bit bus configurations using any combination of the 4 bytes inn the port, so I was keeping it straightforward - i.e. you set the corresponding bit to 1 to include that bit in the bus. Also, this means I can eventually just define one mask to represent the bus bits and use it everywhere.

evanh · 2023-03-12 08:13

There is also dirl #basepin | 7<<6 and dirh #basepin | 7<<6 for eight consecutive pins. And this also works for port B, although you can't straddle both ports in one op.

PS: The previous 32-bit ops can also be dirl #0 | 31<<6 and dirh #0 | 31<<6 respectively.

RossH · 2023-03-12 08:55

@evanh said:
There is also dirl #basepin | 7<<6 and dirh #basepin | 7<<6 for eight consecutive pins. And this also works for port B, although you can't straddle both ports in one op.

PS: The previous 32-bit ops can also be dirl #0 | 31<<6 and dirh #0 | 31<<6 respectively.

The P2 has more possibilities than I have had hot dinners!

evanh · 2023-03-12 10:00

Oops, those "PS:" don't encode without ##. #7<<6 is the largest immediate.

Is a P2 Edge "Multi" Board feasible?

Comments