What is the fastest method for data transfer between COGS ?
Bill Baird
Posts: 23
It seems that data must be transfered between COGs by going thru Main RAM using the read and write Hub commands.
If the Cogs are properly synchronized to catch the minimum Hub delay, the Max transfer rate in assembly language might be something like 4 clocks for a write long instruction from COG 1 to Main RAM + 7 clocks for the Hub delay + 4 clks for a write long instruction from Main RAM to COG 2 + 7 clks Hub delay = 22 clks per long transfer x 13.3 ns per clock = 292.6 ns = 3.417 MHz.
Worst case might be 4 + 22 + 4 + 22 = 52 clks = 691.6 ns = 1.4459 MHz.
Is this right, or is it worse because semiphores have to be used, and then should the writes from COG 1 be done in a chunk to a block of Main RAM and read out to COG 2 in a succeeding chunk to reduce semiphore overhead?
Or is there a global shared resource register I don't see that could be used to pass data in two 4 clock reads and writes by the two COGs ? Or possibly you could wire together two sections of the I/O pins and transfer between the I and O registers of the the 2 COGs at this max rate that way ?
Or am I out of it altogether in my understanding here ?
EDIT: Thread moved to Propeller chip forum -Ryan
If the Cogs are properly synchronized to catch the minimum Hub delay, the Max transfer rate in assembly language might be something like 4 clocks for a write long instruction from COG 1 to Main RAM + 7 clocks for the Hub delay + 4 clks for a write long instruction from Main RAM to COG 2 + 7 clks Hub delay = 22 clks per long transfer x 13.3 ns per clock = 292.6 ns = 3.417 MHz.
Worst case might be 4 + 22 + 4 + 22 = 52 clks = 691.6 ns = 1.4459 MHz.
Is this right, or is it worse because semiphores have to be used, and then should the writes from COG 1 be done in a chunk to a block of Main RAM and read out to COG 2 in a succeeding chunk to reduce semiphore overhead?
Or is there a global shared resource register I don't see that could be used to pass data in two 4 clock reads and writes by the two COGs ? Or possibly you could wire together two sections of the I/O pins and transfer between the I and O registers of the the 2 COGs at this max rate that way ?
Or am I out of it altogether in my understanding here ?
EDIT: Thread moved to Propeller chip forum -Ryan
Comments
As far as using memory, you wont need to use semaphores in this situation.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
So if I understand, the COGs can just read and write the INA and OUTA registers.
Then these 2 regsters, seen as separate in the software, are really one hardware port register which can be accessed as INA in the software when the pins are set to input by the DIRA command or OUTA for pins set as output ?
So then this port register or parts of it can act as a shared resource register for transfer of data - as I read you.
Unfortunately I do mostly need the I/O pins for external communication, but maybe I can find some time windows to use some pins as data transfer registers.
Can any other shared special purpose register be used this way ?
The counter A control, freq, and PLL registers can be read and written to by any COG it seems, but then it seems you need to pick one or two pins for I/O in the control register to activate the counters and the other config registers, and then the counters would be dirven crazy by different long values passing through the other registers.
But maybe this wouldn't bother anything if you didn't care about using the counters ?
How bout the Video config registers ? If you weren't using video, could they be used for COG data transfers ?
Thanks again.
Suppose I wanted to use a number of cogs to form a sort of "3-d pipeline in reverse" to process vision data for a robot. I.e., I might view the vision processing as a series of data transforms, where each cog in the pipeline bangs on the image data in some way to extract feature data which will be interpreted by the next cog in the pipeline. How much and how fast could I expect to pump data from cog to cog through this pipeline?
To optimize the pipeline for maximum efficiency of stages in the type of application you are interested in, the·bandwidth vs. computation graph will look like a supply-demand curve:
IOW, a high bandwidth stage will have very few computations on each data element, and a low bandwidth stage will have a high number of computations on each data element.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
...
Post Edited (Paul Baker) : 7/27/2006 10:48:26 PM GMT