Is there a better way to do this?

RossH · 2023-03-05 01:23

All

I have a need to transfer arbitrary amounts of data synchronously between one sender and one or more receivers. The sender and each receiver run on different cogs.

I know that I could read and write the whole buffer using fast reads and writes, or just pass the address of the buffer between cogs, but I need the transfers to be done with the senders and receivers sharing only a single Hub long.

The following code works, but due to the Propeller egg-beater architecture, it takes a minimum of 50 clock ticks per long transferred - but I am hoping I can actually do it much faster than that.

Is there a clever method that I am missing?

Here is my sender PASM:

'
' send - send data via a single Hub long
'
'   r5 is the clock ticks between hub writes
'   r4 is the hub long to write
'   r3 is an array holding longs to send
'   r2 is the size of the array
'   r0, r1 are temp storage
'
send
   getct  r0          ' 2 (clocks)
   addct1 r0, #4      ' 2
   rep    #5, r2      ' 2
      waitct1         ' 2
      addct1 r0, r5   ' 2
      rdlong r1, r3   ' 9 .. 44
      wrlong r1, r4   ' 3 .. 38
      add    r3, #4   ' 2
   ret

Here is my receiver PASM (runs on a different cog to the code above):

' receive - receive data via a single Hub long

'   r5 is the clock ticks between hub reads
'   r4 is the hub long to read
'   r3 is an array to hold longs received
'   r2 is the size of the array
'   r0, r1 are temp storage
'
recv
   getct  r0          ' 2 (clocks)
   addct1 r0, r5      ' 2
   rep    #5, r2      ' 2
      waitct1         ' 2
      addct1 r0, r5   ' 2
      rdlong r1, r4   ' 9 .. 44
      wrlong r1, r3   ' 3 .. 38
      add    r3, #4   ' 2
   ret

Note that the above code relies on the sender and all receivers starting on exactly the same clock. Here is some code that does that:

'
' sync - synchronize multiple cogs to start on a specific clock count (run by all senders and receivers)
'
' r2 should be set to a clock count some time in the future
'    e.g. current clock + clock frequency for one second
' r0 is temp storage
'
sync
   getct  r0
   sub    r2, r0
   waitx  r2
   ret

If anyone wants to try it, here is the full program to test the code above (in Catalina C but with the PASM embedded):

/*
 * Program to test how fast Hub read/writes can be done between multiple cogs.
 * There is one sender cog, but can be multiple receiver cogs.
 *
 * The program uses P2 NATIVE PASM, so it must be compiled in P2 NATIVE mode
 * (which is the default mode). To maximize available cogs, add -C NO_KEYBOARD 
 * -C NO_MOUSE -C NO_FLOAT
 *
 * For example, compile with a command like:
 *
 *    catalina -p2 -lci p2_hubtest.c -C NO_KEYBOARD -C NO_MOUSE -C NO_FLOAT
 *
 * Then load and execute with a command like:
 *
 *    payload p2_hubtest -i
 */

#include <catalina_cog.h>
#include <stdio.h>

#define XFER_TIME  50            // clocks per hub transfer (minimum 50!)

#define NUM_LONGS  1024          // number of longs to transfer

#define STACK_SIZE 500           // size of stack for cogs

static unsigned long start = 0;  // clock count used to synchronize cogs
static unsigned long xfer;       // long to use to transfer data
static int lock = 0;             // lock to protect I/O

static unsigned long send_buff[NUM_LONGS]; // data to be sent
static unsigned long rcv1_buff[NUM_LONGS]; // data received (1)
static unsigned long rcv2_buff[NUM_LONGS]; // data received (2)
static unsigned long rcv3_buff[NUM_LONGS]; // data received (3)

/*
 * sync - synchronize multiple cogs to start on a specific clock count
 *
 *    'start' should be set to a clock count some time in the 
 *            future - e.g. _cnt() + _clockfreq() for one second
 */
int sync(unsigned long start) {
   return PASM (
      " getct  r0\n"
      " sub    r2, r0\n"
      " waitx  r2\n"
      " getct  r0\n"
   );
}

/*
 * send - pasm code to write a number of longs to a hub ram location
 *
 *    'time' (passed in r5) is the clock ticks between hub writes
 *    'xfer' (passed in r4) is the hub long to write
 *    'buff' (passed in r3) is an array holding longs to send
 *    'size' (passed in r2) is the size of the array
 */
int send(int time, void *xfer, void *buff, int size) {
   return PASM (
      " getct  r0\n"          // 2 (clocks)
      " addct1 r0, #4\n"      // 2
      " rep    #5, r2\n"      // 2
      "    waitct1\n"         // 2
      "    addct1 r0, r5\n"   // 2
      "    rdlong r1, r3\n"   // 9 .. 44
      "    wrlong r1, r4\n"   // 3 .. 38
      "    add    r3, #4\n"   // 2
      " getct  r0\n"          // 2
   );
}

/*
 * sender - send an array of longs to one or more receivers
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void sender(void *buff) {
   unsigned int started, stopped, total;
   int me = _cogid();

   started = sync(start);
   stopped = send(XFER_TIME, &xfer, buff, NUM_LONGS);
   total   = stopped - started;
   ACQUIRE(lock);
   printf("send (cog %d) started at clock 0x%08x\n", me, started);
   printf("send (cog %d) stopped at clock 0x%08x\n", me, stopped);
   printf("send (cog %d) took %d clocks (%d per long)\n\n", 
           me, total, total/NUM_LONGS);
   RELEASE(lock);
   while(1); // don't exit
}

/*
 * recv - pasm code to read a number of longs from a hub ram location
 *
 *    'time' (passed in r5) is the clock ticks between hub reads
 *    'xfer' (passed in r4) is the hub long to read
 *    'buff' (passed in r3) is an array to hold longs received
 *    'size' (passed in r2) is the size of the array
 */
int recv(int time, void *xfer, void *buff, int size) {
   return PASM (
      " getct  r0\n"          // 2 (clocks)
      " addct1 r0, r5\n"      // 2
      " rep    #5, r2\n"      // 2
      "    waitct1\n"         // 2
      "    addct1 r0, r5\n"   // 2
      "    rdlong r1, r4\n"   // 9 .. 44
      "    wrlong r1, r3\n"   // 3 .. 38
      "    add    r3, #4\n"   // 2
      " getct  r0\n"          // 2
   );
}

/*
 * receiver- receive an array of longs from a sender
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void receiver(void *buff) {
   unsigned int started, stopped, total;
   int me = _cogid();

   started = sync(start);
   stopped = recv(XFER_TIME, &xfer, buff, NUM_LONGS);
   total   = stopped - started;
   ACQUIRE(lock);
   printf("recv (cog %d) started at clock 0x%08x\n", me, started);
   printf("recv (cog %d) stopped at clock 0x%08x\n", me, stopped);
   printf("recv (cog %d) took %d clocks (%d per long)\n\n", 
          me, total, total/NUM_LONGS);
   RELEASE(lock);
   while(1); // don't exit
}


void main(void) {
   unsigned long i;
   long send_stack[STACK_SIZE];
   long rcv1_stack[STACK_SIZE];
   long rcv2_stack[STACK_SIZE];
   long rcv3_stack[STACK_SIZE];

   // assign a lock to be used to avoid plugin contention
   lock = _locknew();

   // give the vt100 emulator a chance to start
   _waitms(500);

   // initialize the arrays
   for (i = 0; i < NUM_LONGS; i++) {
      send_buff[i] = i; // can be anything 
      rcv1_buff[i] = 0;
      rcv2_buff[i] = 0;
      rcv3_buff[i] = 0;
   }

   ACQUIRE(lock);
   printf("starting cogs ...\n\n");
   RELEASE(lock);

   // set a start time for the cogs to use in the sync function
   start = _cnt() + _clockfreq(); // set start time for +1 seconds

   // start ONE sender and THREE receivers
   _cogstart_C(&sender, send_buff, send_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv1_buff, rcv1_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv2_buff, rcv2_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv3_buff, rcv3_stack, STACK_SIZE);

   // give the cogs a chance to do the transfers (and also
   // print their output, which can take a second)
   _waitms(1000);

   // check the results
   ACQUIRE(lock);
   printf("checking data ...\n");
   for (i = 0; i < NUM_LONGS; i++) {
     // check receiver 1 got the correct data
     if (send_buff[i] != rcv1_buff[i]) {
        printf("send[%3d]=0x%08X != rcv1[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv1_buff[i]);
        _waitms(5);
     }
     // check receiver 2 got the correct data
     if (send_buff[i] != rcv2_buff[i]) {
        printf("send[%3d]=0x%08X != rcv2[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv2_buff[i]);
        _waitms(5);
     }
     // check receiver 3 got the correct data
     if (send_buff[i] != rcv3_buff[i]) {
        printf("send[%3d]=0x%08X != rcv3[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv3_buff[i]);
        _waitms(5);
     }
   }
   printf("... done\n");
   RELEASE(lock);

   while(1); // don't exit
}

Wuerfel_21 · 2023-03-05 01:44

The first obvious thing you should try is using the FIFO to read/write the source/destination buffers. This now also makes timing within your loop always be the same, so you can drop the waitct stuff.

Since an RDLONG is at least 9 cycles, the reciever loop will end up syncing at 16 cycles

rep #2,size
rdlong tmp,mailbox
wflong tmp

The send loop needs a wait to keep pace

rep #3,size
rflong tmp
wrlong tmp,mailbox
waitx #4

RossH · 2023-03-05 01:49

@Wuerfel_21 said:
The first obvious thing you should try is using the FIFO to read/write the source/destination buffers. This now also makes timing within your loop always be the same, so you can drop the waitct stuff.

Since an RDLONG is at least 9 cycles, the reciever loop will end up syncing at 16 cycles
rep #2,size
rdlong tmp,mailbox
wflong tmp
The send loop needs a wait to keep pace
rep #3,size
rflong tmp
wrlong tmp,mailbox
waitx #4

This is where I started, but I soon discovered the egg-beater architecture means that it doesn't work reliably unless you allow at least 50 clocks per transfer. That's why I found I needed the waitct stuff. Perhaps this is due to the fact that I am using hub execution.

Wuerfel_21 · 2023-03-05 01:51

@RossH said:
Perhaps this is due to the fact that I am using hub execution.

Then do not. REP in hubexec is slooooow.

AJL · 2023-03-05 02:00

Why not use a smart pin in long repository mode?

WXPIN to write to the repository takes 2 clocks.
RDPIN or RQPIN to read from the repository takes 2 clocks.

For multiple receivers you’d use RQPIN for all but the last. The sender waits for the IN signal to drop before the next WXPIN.

The pin in question could still be used as an output.

Edit: If either PTRA or PTRB is free you could speed things up further by using that in place of R3 and using the auto increment feature.

RossH · 2023-03-05 02:00

@Wuerfel_21 said:

@RossH said:
Perhaps this is due to the fact that I am using hub execution.

Then do not. REP in hubexec is slooooow.

Unfortunately, I have to do so unless I can dedicate a cog to just doing the transfers. But I'm out of cogs

RossH · 2023-03-05 03:41

@AJL said:
Why not use a smart pin in long repository mode?

WXPIN to write to the repository takes 2 clocks.
RDPIN or RQPIN to read from the repository takes 2 clocks.

For multiple receivers you’d use RQPIN for all but the last. The sender waits for the IN signal to drop before the next WXPIN.

The pin in question could still be used as an output.

Edit: If either PTRA or PTRB is free you could speed things up further by using that in place of R3 and using the auto increment feature.

Yes, I thought of the smart pin repository. I could do that in some cases, but not all - not enough pins to spare.

EDIT: And also, I currently use both PTRA and PTRB for other things.

RossH · 2023-03-05 04:04

@AJL said:
Why not use a smart pin in long repository mode?

...

The pin in question could still be used as an output.

Actually, this idea has potential. If I can use the repository mode on pins that are also being used by other cogs as simple outputs then I may be able to find enough pins. The main problem is that I would have to limit the use of those pins to always being used as simple outputs.

Has anyone tried this?

Thanks.

AJL · 2023-03-05 04:17

@RossH said:

@AJL said:
Why not use a smart pin in long repository mode?

...

The pin in question could still be used as an output.

Actually, this idea has potential. If I can use the repository mode on pins that are also being used by other cogs as simple outputs then I may be able to find enough pins. The main problem is that I would have to limit the use of those pins to always being used as simple outputs.

Has anyone tried this?

Thanks.

I haven’t tried it myself. The same pins could also be used as simple outputs on the same cog(s).
The smart pin usage is through WxPIN and RxPIN, while the simple output is controlled by writing to OUTx and DIRx.

rogloh · 2023-03-05 05:33

What about signaling the receiver COG(s) with a COGATN from the sender? Can that shave any cycles vs the WAITCT methods, and reduce the loop time?

Another possibility which may speed things up is to burst read your data in chunks via the LUTRAM or COGRAM where you get to transfer more data in one go with SETQ/SETQ2 read/write transfers rather than doing one long in each 50 P2 clocks. You'll need some amount of free scratch COG/LUT memory for this of course, but it wouldn't necessarily have to be that large, maybe even doing up to 16 longs at a time could help out.

Tubular · 2023-03-05 05:52

Yeah, Brian has successfully tried that smart pin repository along with simple output

RossH · 2023-03-05 06:09

@AJL said:

I haven’t tried it myself. The same pins could also be used as simple outputs on the same cog(s).
The smart pin usage is through WxPIN and RxPIN, while the simple output is controlled by writing to OUTx and DIRx.

Hmmm. I need them to be able to be used as outputs on the cogs that already use them, while other cogs use just them in repository mode. However, the DIR bit is also used to reset the smart mode, so at the very least everything would have to be configured in the correct order.

Some experiments required when I get time.

In the meantime, I have realized that my original solution is probably not as bad as I first thought. I cannot use the FIFO or WRFAST because I use Hub Execution mode, and so a normal WRLONG can take up to 20 clocks anyway. 50 clocks per long still represents a throughput of 16 MB/s (assuming 200Mhz, and also that I have my maths correct!). This would sufficient for most applications except video, and is certainly sufficient for my purposes.

rogloh · 2023-03-05 06:22

@RossH said:
... I cannot use the FIFO or WRFAST because I use Hub Execution mode

You probably can if you get the current FIFO state with GETPTR at the start of your inline PASM and then restore the fifo position back to this address with a RDFAST at the end of your inline PASM block. I've used this technique once before and it appeared to let me share the FIFO in Hub Exec mode while running my own PASM code. In fact IIRC someone recently mentioned this is already being done in PropTool inline PASM but not flex I think.

RossH · 2023-03-05 06:31

Thanks all.

I have some things to try. All transfers need to be synchronous and I was trying to keep thing simple and have only the one transfer mechanism, but I didn't want to limit the size of transfer. It may not be possible to achieve all of these at once.

RossH · 2023-03-05 06:36

@rogloh said:

@RossH said:
... I cannot use the FIFO or WRFAST because I use Hub Execution mode

You probably can if you get the current FIFO state with GETPTR at the start of your inline PASM and then restore the fifo position back to this address with a RDFAST at the end of your inline PASM block. I've used this technique once before and it appeared to let me share the FIFO in Hub Exec mode while running my own PASM code. In fact IIRC someone recently mentioned this is already being done in PropTool inline PASM but not flex I think.

Thanks! That sounds interesting. I'll check out GETPTR. I should have known there would be a way to do it on the Propeller!

RossH · 2023-03-05 07:22

@rogloh said:

@RossH said:
... I cannot use the FIFO or WRFAST because I use Hub Execution mode

You probably can if you get the current FIFO state with GETPTR at the start of your inline PASM and then restore the fifo position back to this address with a RDFAST at the end of your inline PASM block. I've used this technique once before and it appeared to let me share the FIFO in Hub Exec mode while running my own PASM code. In fact IIRC someone recently mentioned this is already being done in PropTool inline PASM but not flex I think.

Actually, maybe I am not understanding your point. I found code of yours that uses GETPTR/RDFAST in Cog execution mode to call a function in Hub execution mode and then restore the FIFO, but is this necessary the other way around? If I am in Hub execution mode I can execute a function in Cog execution mode just by calling it. Do I need to save and restore the FIFO in such cases? Perhaps if I was in an interrupt service routine, but in other code why would I need to do so?

Sorry if I am being thick!

Ross.

evanh · 2023-03-05 10:51

Roger probably has done that because there is no cog space left and the streamer is stopped during that routine. Presumably for reconfigure feature.

TonyB_ · 2023-03-05 11:24

@RossH said:
If I am in Hub execution mode I can execute a function in Cog execution mode just by calling it. Do I need to save and restore the FIFO in such cases? Perhaps if I was in an interrupt service routine, but in other code why would I need to do so?

As mentioned before, it would be quicker to run the send/receive code in cog/LUT RAM. If that code doesn't use the FIFO then it won't need to do GETPTR ... RDFAST. For successive RDLONGs, the shortest time of 9 cycles occurs when the 2nd long is in the next egg beater slice after the 1st. Put another way, byte address of 2nd long is 4 more than 1st. (Longest time of 16 cycles occurs when egg beater slices are the same.) This means that you could read one long every 17 cycles, with 8 of these cycles available for up to four 2-cycle instructions between RDLONGs. If one intermediate instruction then next RDLONG would take 15 cycles, if two ... 13, if three ... 11 and if four ... 9.

evanh · 2023-03-05 11:36

@RossH said:

@Wuerfel_21 said:

@RossH said:
Perhaps this is due to the fact that I am using hub execution.

Then do not. REP in hubexec is slooooow.

Unfortunately, I have to do so unless I can dedicate a cog to just doing the transfers. But I'm out of cogs

The solution is temporarily block-copy a small sub-routine into cogRAM. Then call it.

rogloh · 2023-03-05 11:47

@RossH said:

@rogloh said:

Actually, maybe I am not understanding your point. I found code of yours that uses GETPTR/RDFAST in Cog execution mode to call a function in Hub execution mode and then restore the FIFO, but is this necessary the other way around? If I am in Hub execution mode I can execute a function in Cog execution mode just by calling it. Do I need to save and restore the FIFO in such cases? Perhaps if I was in an interrupt service routine, but in other code why would I need to do so?

Sorry if I am being thick!

Ross.

IMO you'd only have to save the FIFO read position and restore it if you were running some inline PASM while in hubexec mode and wanted to override the use the FIFO temporarily before returning to normal hubexec mode. If your inline PASM is however executed in code from cogexec mode there's likely no need to do this save/restore, because the FIFO would have to be restarted for you via the return (or branch) back to hubexec from cogexec mode. That's what I expect anyway - but maybe give it a go and see what happens....I'm possibly a little rusty now on the P2 after a nice summer break.

RossH · 2023-03-05 22:46

@TonyB_ said:

As mentioned before, it would be quicker to run the send/receive code in cog/LUT RAM. If that code doesn't use the FIFO then it won't need to do GETPTR ... RDFAST. For successive RDLONGs, the shortest time of 9 cycles occurs when the 2nd long is in the next egg beater slice after the 1st. Put another way, byte address of 2nd long is 4 more than 1st. (Longest time of 16 cycles occurs when egg beater slices are the same.) This means that you could read one long every 17 cycles.

Yes, this is pretty much what I had hoped for and expected, but I no longer program the Propeller much in anything except C, and I forget that things are different when using Hub Execution mode.

@evanh said:

The solution is temporarily block-copy a small sub-routine into cogRAM. Then call it.

Yes, this is what I will try next. If this works (which I believe it will) then that is the easiest solution.

After that, I will look deeper into Roger's suggestions.

Ross.

RossH · 2023-03-06 06:57

Progress ...

Just executing the code from LUT RAM doubles the speed (now 25 clocks per long instead of 50). Note that the addct1/waitct1 instructions are still required to keep everything synchronous.

The updated Catalina C code is below. The updated PASM code is embedded in the send() and recv() functions and is the same as above except it is now executed from LUT RAM ...

/*
 * Program to test how fast Hub read/writes can be done between multiple cogs.
 * There is one sender cog, but can be multiple receiver cogs.
 *
 * The program uses P2 NATIVE PASM, so it must be compiled in P2 NATIVE mode
 * (which is the default mode). To maximize available cogs, add -C NO_KEYBOARD 
 * -C NO_MOUSE -C NO_FLOAT
 *
 * For example, compile with a command like:
 *
 *    catalina -p2 -lci p2_hubtest.c -C NO_KEYBOARD -C NO_MOUSE -C NO_FLOAT
 *
 * Then load and execute with a command like:
 *
 *    payload p2_hubtest -i
 */

#include <catalina_cog.h>
#include <stdio.h>

#define XFER_TIME  25            // clocks per hub transfer (minimum 25!)

#define NUM_LONGS  1024          // number of longs to transfer

#define STACK_SIZE 500           // size of stack for cogs

static unsigned long start = 0;  // clock count used to synchronize cogs
static unsigned long xfer;       // long to use to transfer data
static int lock = 0;             // lock to protect I/O

static unsigned long send_buff[NUM_LONGS]; // data to be sent
static unsigned long rcv1_buff[NUM_LONGS]; // data received (1)
static unsigned long rcv2_buff[NUM_LONGS]; // data received (2)
static unsigned long rcv3_buff[NUM_LONGS]; // data received (3)

/*
 * sync - synchronize multiple cogs to start on a specific clock count
 *
 *    'start' should be set to a clock count some time in the 
 *            future - e.g. _cnt() + _clockfreq() for one second
 */
int sync(unsigned long start) {
   return PASM (
      " getct  r0\n"
      " sub    r2, r0\n"
      " waitx  r2\n"
      " getct  r0\n"
   );
}

/*
 * send - pasm code to write a number of longs to a hub ram location
 *        (now loads the code into LUT RAM and executes it there)
 *
 *    'time' (passed in r5) is the clock ticks between hub writes
 *    'xfer' (passed in r4) is the hub long to write
 *    'buff' (passed in r3) is an array holding longs to send
 *    'size' (passed in r2) is the size of the array
 */
int send(int time, void *xfer, void *buff, int size) {
   return PASM (
      // load LUT RAM:
      " setq2  #(send_end - send_start - 1)\n"
      " rdlong 0, ##@send_start\n"

      // jump to code in LUT RAM:
      " jmp    #send_start\n" 

      // code to be executed in LUT RAM:
      " org $200\n"
      "send_start\n"
      " getct  r0\n"          // LUT: 2 (clocks)
      " addct1 r0, #4\n"      // LUT: 2
      " rep    #5, r2\n"      // LUT: 2
      "    waitct1\n"         // LUT: 2
      "    addct1 r0, r5\n"   // LUT: 2
      "    rdlong r1, r3\n"   // LUT: 9 .. 16
      "    wrlong r1, r4\n"   // LUT: 3 .. 10
      "    add    r3, #4\n"   // LUT: 2
      " getct  r0\n"          // LUT: 2
      " jmp #send_cont\n"     // LUT: 4
      "send_end\n" 

      // resume Hub Execution:
      " orgh\n"
      "send_cont\n" 
   );
}

/*
 * sender - send an array of longs to one or more receivers
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void sender(void *buff) {
   unsigned int started, stopped, total;
   int me = _cogid();

   started = sync(start);
   stopped = send(XFER_TIME, &xfer, buff, NUM_LONGS);
   total   = stopped - started;
   ACQUIRE(lock);
   printf("send (cog %d) started at clock 0x%08x\n", me, started);
   printf("send (cog %d) stopped at clock 0x%08x\n", me, stopped);
   printf("send (cog %d) took %d clocks (%d per long)\n\n", 
           me, total, total/NUM_LONGS);
   RELEASE(lock);
   while(1); // don't exit
}

/*
 * recv - pasm code to read a number of longs from a hub ram location
 *        (now loads the code into LUT RAM and executes it there)
 *
 *    'time' (passed in r5) is the clock ticks between hub reads
 *    'xfer' (passed in r4) is the hub long to read
 *    'buff' (passed in r3) is an array to hold longs received
 *    'size' (passed in r2) is the size of the array
 */
int recv(int time, void *xfer, void *buff, int size) {
   return PASM (
      // load LUT RAM:
      " setq2  #(recv_end - recv_start - 1)\n"
      " rdlong 0, ##@recv_start\n"

      // jump to code in LUT RAM:
      " jmp    #recv_start\n"

      // code to be executed in LUT RAM:
      " org $200\n"
      "recv_start\n"
      " getct  r0\n"          // LUT: 2 (clocks)
      " addct1 r0, r5\n"      // LUT: 2
      " rep    #5, r2\n"      // LUT: 2
      "    waitct1\n"         // LUT: 2
      "    addct1 r0, r5\n"   // LUT: 2
      "    rdlong r1, r4\n"   // LUT: 9 .. 16
      "    wrlong r1, r3\n"   // LUT: 3 .. 10
      "    add    r3, #4\n"   // LUT: 2
      " getct  r0\n"          // LUT: 2
      " jmp #recv_cont\n"     // LUT: 4
      "recv_end\n"

      // resume Hub Execution:
      " orgh\n"
      "recv_cont\n"
   );
}

/*
 * receiver- receive an array of longs from a sender
 *
 *    'buff' is an array of NUM_LONGS longs.
 */
void receiver(void *buff) {
   unsigned int started, stopped, total;
   int me = _cogid();

   started = sync(start);
   stopped = recv(XFER_TIME, &xfer, buff, NUM_LONGS);
   total   = stopped - started;
   ACQUIRE(lock);
   printf("recv (cog %d) started at clock 0x%08x\n", me, started);
   printf("recv (cog %d) stopped at clock 0x%08x\n", me, stopped);
   printf("recv (cog %d) took %d clocks (%d per long)\n\n", 
          me, total, total/NUM_LONGS);
   RELEASE(lock);
   while(1); // don't exit
}


void main(void) {
   unsigned long i;
   long send_stack[STACK_SIZE];
   long rcv1_stack[STACK_SIZE];
   long rcv2_stack[STACK_SIZE];
   long rcv3_stack[STACK_SIZE];

   // assign a lock to be used to avoid plugin contention
   lock = _locknew();

   // give the vt100 emulator a chance to start
   _waitms(500);

   // initialize the arrays
   for (i = 0; i < NUM_LONGS; i++) {
      send_buff[i] = i; // can be anything 
      rcv1_buff[i] = 0;
      rcv2_buff[i] = 0;
      rcv3_buff[i] = 0;
   }

   ACQUIRE(lock);
   printf("starting cogs ...\n\n");
   RELEASE(lock);

   // set a start time for the cogs to use in the sync function
   start = _cnt() + _clockfreq(); // set start time for +1 seconds

   // start ONE sender and THREE receivers
   _cogstart_C(&sender, send_buff, send_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv1_buff, rcv1_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv2_buff, rcv2_stack, STACK_SIZE);
   _cogstart_C(&receiver, rcv3_buff, rcv3_stack, STACK_SIZE);

   // give the cogs a chance to do the transfers (and also
   // print their output, which can take a second)
   _waitms(1000);

   // check the results
   ACQUIRE(lock);
   printf("checking data ...\n");
   for (i = 0; i < NUM_LONGS; i++) {
     // check receiver 1 got the correct data
     if (send_buff[i] != rcv1_buff[i]) {
        printf("send[%3d]=0x%08X != rcv1[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv1_buff[i]);
        _waitms(5);
     }
     // check receiver 2 got the correct data
     if (send_buff[i] != rcv2_buff[i]) {
        printf("send[%3d]=0x%08X != rcv2[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv2_buff[i]);
        _waitms(5);
     }
     // check receiver 3 got the correct data
     if (send_buff[i] != rcv3_buff[i]) {
        printf("send[%3d]=0x%08X != rcv3[%3d]=0x%08X\n",  
               i, send_buff[i], i, rcv3_buff[i]);
        _waitms(5);
     }
   }
   printf("... done\n");
   RELEASE(lock);

   while(1); // don't exit
}

This is a synchronous throughput of 32MB/s or 256Mbps, which I think is going to be fast enough

evanh · 2023-03-06 12:08

How much spare lutRAM and cogRAM is available? And what sizes are typical? And how much of a benefit would 320 MBytes/s be?
PS: There is ways to make it even faster but these are definitely more complicated with more overheads.

RossH · 2023-03-07 00:48

@evanh said:
How much spare lutRAM and cogRAM is available? And what sizes are typical? And how much of a benefit would 320 MBytes/s be?
PS: There is ways to make it even faster but these are definitely more complicated with more overheads.

I don't have enough spare cogs to run dedicated sender or receiver cogs, so that means I have to be able to do it in a cog that is also used to run C code (which is one reason my test program uses C "wrapper" functions around the PASM - I could instead have started the PASM senders and receivers directly on a bare cog).

This means Cog RAM is very tight - Catalina has only a few longs to spare. However, Catalina does guarantee that 256 longs of LUT RAM will always be available (which is what my program currently uses). I could probably lift that to about 400 longs if I had to. But this is probably not enough space for complex code as well as any significant buffers.

For my application a typical transfer size would be about 2kb, but it could go up to to 512kb.

Also, my application requires that all transfers be synchronous and use only a single long of Hub RAM.

320MB/s is not necessary for my application, but 64MB/s would be nice. However, I'd certainly like to see how 320MB/s could be done, and it may also be useful for others.

Ross.

evanh · 2023-03-07 12:45

Hmm, might be tricky to devise an alternative to the single longword exchange mechanism but one thing that can be done straight up is make use of the FIFO for the sender's consecutive data fetches and receiver's consecutive data stores.

On that note, the critical loop speed will be dictated by the hub access of the receiver's RDLONG (The WRLONG becomes WFLONG, which is 2 clock ticks only) ...

evanh · 2023-03-07 13:03

The main thing I was first thinking about is, when not using the FIFO, there is expected to be deterministic (unbroken) burst copying using SETQ/SETQ2 + RDLONG/WRLONG. Because each cog copies at the same speed of one longword per tick this approach can then be used just like you're using the single RDLONG/WRLONG but with a block at a time.

And if the exchange block size was say 16 longwords in hubRAM then it could be expected to achieve something like 10x the data rate. A guess.

Obviously this isn't a single longword exchange any longer.

RossH · 2023-03-07 22:38

@evanh said:
Hmm, might be tricky to devise an alternative to the single longword exchange mechanism but one thing that can be done straight up is make use of the FIFO for the sender's consecutive data fetches and receiver's consecutive data stores.

On that note, the critical loop speed will be dictated by the hub access of the receiver's RDLONG (The WRLONG becomes WFLONG, which is 2 clock ticks only) ...

Yes, the FIFO can be used for transferring data to/from Hub RAM by both the sender and receiver - but only if you do not also have to use Hub RAM for the transfer between cogs. However, it may be possible to do that using the smart pin repository mode rather than using the Hub, so that the send and receiver FIFO can just run continuously and at full speed. This is not exactly what I had in mind, but I may be able to use it.

I'll do some experimenting.

TonyB_ · 2023-03-07 22:49

Why can't the receiver cog read from the hub RAM buffer directly?

RossH · 2023-03-08 05:38

@TonyB_ said:
Why can't the receiver cog read from the hub RAM buffer directly?

The buffers only exist in my test program. The final program may or may not have them, depending on the application. The source of the data may not be Hub RAM, and neither may the destination. Forcing the sender to buffer the data and then send the buffer to the receiver would require buffer space I just don't have, and also slow down the application and/or limit the size and number of concurrent transfers - e.g. the send buffer could not be used to send more data (perhaps to a different receiver) until the first receiver had finished with it.

I realized while I was writing this that what I really want is for each cog to be able to use its FIFO to send data to another cog, rather than between cog and hub!

Ross.

evanh · 2023-03-08 07:29

@RossH said:
Yes, the FIFO can be used for transferring data to/from Hub RAM by both the sender and receiver - but only if you do not also have to use Hub RAM for the transfer between cogs.

No conflict there. RDLONG/WRLONG can happily co-mingle with RFLONG/WFLONG.

However, it may be possible to do that using the smart pin repository mode rather than using the Hub, so that the send and receiver FIFO can just run continuously and at full speed. This is not exactly what I had in mind, but I may be able to use it.

That's a good idea anyway. The only catch is you're thieving a smartpin that might be wanted for something else.

An option could be temporarily use one of the SD/EEPROM pins. Similar to what Chip has recently done with the UART pins when debug is enabled. The RX pin (P63) gets set to repository mode while user code is executing but becomes a serial pin each time the debug ISR is called. This way a parameter is picked up by the ISR and used to set the baud (Feature request by me) on each interrupt. This solved an issue where the compile-time baud was locked in to the protected debugger code and couldn't be overridden by the user when the system clock frequency is changed.

msrobots · 2023-03-10 04:20

one thing I would like to explore is to use the streamer, I have not found it yet but one last addition to the streamer modes was one that could stream in with a external clock.
Added for ADC/DAC pins to/from streamer but chip confirmed it works also without activating ADC/DAC.

Still thinking about that Ringbuffer concept @"Beau Schwabe" did with the P1. Sending a memory block from one P2 to the next in line pausing at each P2 to get cooperative shared and then send along to the next one.

I am sadly to stupid yet with P2 assembly.

Mike

Is there a better way to do this?

Comments