What is the fastest method for data transfer between COGS ?

Bill Baird · 2006-07-10 21:18

It seems that data must be transfered between COGs by going thru Main RAM using the read and write Hub commands.

If the Cogs are properly synchronized to catch the minimum Hub delay, the Max transfer rate in assembly language might be something like 4 clocks for a write long instruction from COG 1 to Main RAM + 7 clocks for the Hub delay + 4 clks for a write long instruction from Main RAM to COG 2 + 7 clks Hub delay = 22 clks per long transfer x 13.3 ns per clock = 292.6 ns = 3.417 MHz.

Worst case might be 4 + 22 + 4 + 22 = 52 clks = 691.6 ns = 1.4459 MHz.

Is this right, or is it worse because semiphores have to be used, and then should the writes from COG 1 be done in a chunk to a block of Main RAM and read out to COG 2 in a succeeding chunk to reduce semiphore overhead?

Or is there a global shared resource register I don't see that could be used to pass data in two 4 clock reads and writes by the two COGs ? Or possibly you could wire together two sections of the I/O pins and transfer between the I and O registers of the the 2 COGs at this max rate that way ?

Or am I out of it altogether in my understanding here ?

EDIT: Thread moved to Propeller chip forum -Ryan

Paul Baker · 2006-07-10 21:50

The fastest method would be to use a group of pins to transfer data, if you used all 32 pins, you could transfer one long every machine instruction (20 MLPS), you dont need to wire pins together since all cogs have access to the pins, one configures to input, the other configures to output. Of course this isn't really a reasonable situation, you would likely use a subset of pins. Doing the required shifting of data to reassemble the data and slow things down abit, but it will still be faster if you use 8 pins.

As far as using memory, you wont need to use semaphores in this situation.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.

Bill Baird · 2006-07-11 00:53

Great - thanks for the quick reply.

So if I understand, the COGs can just read and write the INA and OUTA registers.

Then these 2 regsters, seen as separate in the software, are really one hardware port register which can be accessed as INA in the software when the pins are set to input by the DIRA command or OUTA for pins set as output ?

So then this port register or parts of it can act as a shared resource register for transfer of data - as I read you.

Unfortunately I do mostly need the I/O pins for external communication, but maybe I can find some time windows to use some pins as data transfer registers.

Can any other shared special purpose register be used this way ?

The counter A control, freq, and PLL registers can be read and written to by any COG it seems, but then it seems you need to pick one or two pins for I/O in the control register to activate the counters and the other config registers, and then the counters would be dirven crazy by different long values passing through the other registers.

But maybe this wouldn't bother anything if you didn't care about using the counters ?

How bout the Video config registers ? If you weren't using video, could they be used for COG data transfers ?

Thanks again.

Mike Green · 2006-07-11 01:49

I think that anything much more complex than using a block of 8 I/O pins for data transfer between cogs would not save much time over transferring through hub memory. You don't need to use a semaphore if the two cogs run synchronized (in terms of hub cycles) and if the transfer is one-way (cog to cog).

Dennis Ferron · 2006-07-27 01:41

Actually that is a really good question.

Suppose I wanted to use a number of cogs to form a sort of "3-d pipeline in reverse" to process vision data for a robot. I.e., I might view the vision processing as a series of data transforms, where each cog in the pipeline bangs on the image data in some way to extract feature data which will be interpreted by the next cog in the pipeline. How much and how fast could I expect to pump data from cog to cog through this pipeline?

Paul Baker · 2006-07-27 16:09

Assuming real-time computation, the fastest the pipeline can go is determined by the slowest stage in the pipeline. There are two things which slow down a stage, the first is data bandwidth, or the number of data elements which must be operated upon. The second is computational complexity, or how many computations per data element which must be computed. By splitting the entire computation into stages in a pipeline you are hacking away at the second criteria. The data bandwidth can be lowered by dedicating more than one cog to a high bandwidth stage and interleaving the handling of data elements between the multiple cogs.

To optimize the pipeline for maximum efficiency of stages in the type of application you are interested in, the·bandwidth vs. computation graph will look like a supply-demand curve:

attachment.php?attachmentid=42636

IOW, a high bandwidth stage will have very few computations on each data element, and a low bandwidth stage will have a high number of computations on each data element.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
...

Post Edited (Paul Baker) : 7/27/2006 10:48:26 PM GMT

What is the fastest method for data transfer between COGS ?

Comments