Running two cogs in phase with each other
bob_g4bby
Posts: 489
- The goal - compute a cordic function on a 1024 sample array in the least amount of time.
- The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles
- So read from hub ram using the FIFO with RDFAST which always takes two cycles
- BUT we can't use WRFAST to write the data back, it takes too long to set in the opposite direction
- So use two cogs, running loops of identical size and also run them in phase. One cog does the reads RDFAST using it's FIFO, the other writes WRFAST using it's FIFO. Both cogs share LUT.
But how to get both cogs exactly in phase? The following piece of code is an experiment to get two cogs to pulse two smartpins exactly at the same time by the use of WAITATN / COGATN instructions. I had to put one NOP instruction after COGATN to get the two cogs running exactly in phase.
Sharing the data movement has another benefit, each cog only needs to run a 14 cycle loop ( 4x the cordic process time of 56 cycles). The next best loop time would be 28 cycles - 1/2 the speed. Here's the experiment to test phase lock. I used an oscilloscope on pins P56 and P57 to get the two pulses lined up with each other:-
'' File .......... post WAITATN synchronism test.spin2
'' Version........ 1
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /
''=============================================================
CON {timing constants}
_xinfreq = 20_000_000
_clkfreq = 180_000_000
CON {fixed IO pins}
CON {application IO pins}
p56 = 56
p57 = 57
CON {Enumerated constants}
CON {Data Structure definitions}
OBJ {Child Objects}
VAR {Variable Declarations}
{Public / Private methods}
pub main()
coginit(2,@cog2,0)
coginit(1,@cog1,0)
DAT {Symbols and Data}
DAT {Data Pointers}
DAT {Cog-exec assembly language}
org
cog1
drvl #p56
waitx ##100000000 ' wait for cog2 to load and execute waitatn
cogatn #4 ' kick cog2 off
nop
cog1a
drvh #p56 ' compare this output pulse on pin 56
drvl #p56
jmp #cog1a
cog2
drvl #p57
waitatn ' wait for cog1 to kick me off
cog2a
drvh #p57 ' with this one on pin 57. Adjust waitx instructions until both pulses are coincident
drvl #p57
jmp #cog2a
DAT {Hub-exec assembly language}
Cordic example to follow ...

Comments
Get a clock count in spin, add a couple hundred to it, send the pointer over to the two pasm cogs on startup... have the two pasm routines sit on a waitcnt untill the count you passed to them...
If both are doing very similar things, easy way is to use same assembly for both cogs and edit the assembly before launching the second cog.
Use the main cog to give them ATN at same time...
Both good tips, there, thanks. Nice simple techniques.
Your technique is the simplest.
Do note that if you want to use LUT sharing, you need an odd/even pair like 0/1 or 2/3, 1/2 would not work.
There's a special value you can pass to coginit to make it start up such a pair automatically, which is in theory time-aligned +/- some constant, in practice, if you have the debugger going, it will delay an unpredictable amount due to the serial lock.
So here's me starting an odd/even pair of cogs with one command. They both run the same cog code in their respective cog RAMS. However, the cogs read their ID numbers and split off to run 'cogeven' and 'cogodd' code. I've kept my method of synchronising the two loops starting at 'cogeven1' and 'cogodd1':-
'' File .......... post WAITATN synchronism test.spin2 '' Version........ 2 '' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN '' Author......... Bob Edwards '' Email.......... bob.edwards50@yahoo.com '' Started........ / / '' Latest update.. / / ''============================================================= CON {timing constants} _xinfreq = 20_000_000 _clkfreq = 180_000_000 CON {fixed IO pins} CON {application IO pins} p56 = 56 p57 = 57 CON {Enumerated constants} CON {Data Structure definitions} OBJ {Child Objects} VAR {Variable Declarations} {Public / Private methods} pub main() coginit(COGEXEC_NEW_PAIR, @cogstart, 0 ) ' start two cogs with linked LUT reads DAT {Symbols and Data} DAT {Data Pointers} DAT {Cog-exec assembly language} org cogstart cogid _cogid testb _cogid, #0 WZ IF_Z jmp #cogodd cogeven drvl #p56 waitx ##100000000 ' wait for odd cog to load and execute waitatn add _cogid, #1 ' now the odd cog number mov pr0, #0 bith pr0, _cogid cogatn pr0 ' wake up the odd number cog nop ' just two cycle delay needed for output pins to be in sync cogeven1 drvh #p56 ' compare this output pulse on pin 56 drvl #p56 jmp #cogeven1 cogodd drvl #p57 setluts #1 ' writes to even cog LUT will also write to the odd cog LUT waitatn ' wait for even cog to kick me off cogodd1 drvh #p57 ' with this one on pin 57 drvl #p57 jmp #cogodd1 _cogid res 1 DAT {Hub-exec assembly language}@bob_g4bby, another approach would be to use block transfer to copy 512 consecutive samples (split to each cog) into the LUT of each cog and then do the calculation in both cogs in parallel and copying it back to hub with block transfer. Then you would only need to wake up the second cog with COGATN instruction to do their work and avoid exact synchronization.
As the runtime of the calculation and data transfer time is the same in each cog, both will end at the same time.
The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.
'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub mov ptrb,##ptr_hub setq2 #512-1 rdlong $200-$200,ptrb++ 'copy from hub RAM to LUT 'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub mov ptrb,##ptr_hub setq2 #512-1 wrlong $200-$200,ptrb++ 'copy back from LUT to hub RAMTo modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.
The fast block move from LUT RAM takes 6 + 4...11 + 511 = 521... 528 cycles. I tested this thoroughly just a few days ago and the first write from LUT RAM takes one more cycle than from register RAM in the worst case. The reason for this difference is because it takes three cycles, not two, to read from LUT RAM.
The extra cycle causes the hub RAM slot for a 3 cycle write to be missed, there is a wait of one hub RAM revolution and therefore the write takes 3 + 8 = 11 cycles. This applies to only one of the eight hub RAM egg beater slices. For the other seven the writes take 4...10 cycles and are identical for reg and LUT RAM.
That's some useful timing info. I too had wondered whether block move into and out of LUT was the way to go. Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function. You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?
I'll get down to writing the cordic code, and publish it here. I envisage it turning into a bit of a library for doing dsp things on arrays of data - like audio or software radio streams etc.
@bob_g4bby, yes that is correct as each cog will process only 512 longs. But on your approach it is four times of cycles to move data as one cog must read all 1024 longs and the second cog must write all these longs. Hence, it will take 2048 clocks to read and 2048 clocks to write the data. Additional the time doing the cordic function.
One can use the FIFO with WFLONG on both cogs to write the result back to hub as all data to process are already in the LUT RAM. This would avoid to write the result to LUT (saves 2 clocks for each sample). This is the same time necessary to write data using FIFO.
Unfortunately reading a LUT sample needs 3 clocks.
My approach is quite easy as the code used in both cogs is identical. Only the pointer to read and write the 512 longs block is different.
With your approach you need to handover data to the second cog for the calculation and the result of calculation from first cog to write back both afterwards to hub. I think this will take additional overhead.
@bob_g4bby, one thing I did not mention before and which is important to know using your approach is, that you need additional 5120 clocks to move the data (1024 longs) from the first cog to the second cog. Also if you use the LUT to share the data between two cogs, you will not get the data transfer for free.
As I already did mention it takes 5 clocks per long to write and read the LUT which is necessary in your approach.
The conclusion is, as you process as less data per cog you will get the best performance to calculate the data. If you need speed for the calculation consider to use more cogs to do this.
With my approach the calculation is easy scalable as the code is identical for each cog. Each cog is calculating a small part of the data.