Running two cogs in phase with each other
bob_g4bby
Posts: 507
- The goal - compute a cordic function on a 1024 sample array in the least amount of time.
- The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles
- So read from hub ram using the FIFO with RDFAST which always takes two cycles
- BUT we can't use WRFAST to write the data back, it takes too long to set in the opposite direction
- So use two cogs, running loops of identical size and also run them in phase. One cog does the reads RDFAST using it's FIFO, the other writes WRFAST using it's FIFO. Both cogs share LUT.
But how to get both cogs exactly in phase? The following piece of code is an experiment to get two cogs to pulse two smartpins exactly at the same time by the use of WAITATN / COGATN instructions. I had to put one NOP instruction after COGATN to get the two cogs running exactly in phase.
Sharing the data movement has another benefit, each cog only needs to run a 14 cycle loop ( 4x the cordic process time of 56 cycles). The next best loop time would be 28 cycles - 1/2 the speed. Here's the experiment to test phase lock. I used an oscilloscope on pins P56 and P57 to get the two pulses lined up with each other:-
'' File .......... post WAITATN synchronism test.spin2
'' Version........ 1
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /
''=============================================================
CON {timing constants}
_xinfreq = 20_000_000
_clkfreq = 180_000_000
CON {fixed IO pins}
CON {application IO pins}
p56 = 56
p57 = 57
CON {Enumerated constants}
CON {Data Structure definitions}
OBJ {Child Objects}
VAR {Variable Declarations}
{Public / Private methods}
pub main()
coginit(2,@cog2,0)
coginit(1,@cog1,0)
DAT {Symbols and Data}
DAT {Data Pointers}
DAT {Cog-exec assembly language}
org
cog1
drvl #p56
waitx ##100000000 ' wait for cog2 to load and execute waitatn
cogatn #4 ' kick cog2 off
nop
cog1a
drvh #p56 ' compare this output pulse on pin 56
drvl #p56
jmp #cog1a
cog2
drvl #p57
waitatn ' wait for cog1 to kick me off
cog2a
drvh #p57 ' with this one on pin 57. Adjust waitx instructions until both pulses are coincident
drvl #p57
jmp #cog2a
DAT {Hub-exec assembly language}
Cordic example to follow ...

Comments
Get a clock count in spin, add a couple hundred to it, send the pointer over to the two pasm cogs on startup... have the two pasm routines sit on a waitcnt untill the count you passed to them...
If both are doing very similar things, easy way is to use same assembly for both cogs and edit the assembly before launching the second cog.
Use the main cog to give them ATN at same time...
Both good tips, there, thanks. Nice simple techniques.
Your technique is the simplest.
Do note that if you want to use LUT sharing, you need an odd/even pair like 0/1 or 2/3, 1/2 would not work.
There's a special value you can pass to coginit to make it start up such a pair automatically, which is in theory time-aligned +/- some constant, in practice, if you have the debugger going, it will delay an unpredictable amount due to the serial lock.
So here's me starting an odd/even pair of cogs with one command. They both run the same cog code in their respective cog RAMS. However, the cogs read their ID numbers and split off to run 'cogeven' and 'cogodd' code. I've kept my method of synchronising the two loops starting at 'cogeven1' and 'cogodd1':-
'' File .......... post WAITATN synchronism test.spin2 '' Version........ 2 '' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN '' Author......... Bob Edwards '' Email.......... bob.edwards50@yahoo.com '' Started........ / / '' Latest update.. / / ''============================================================= CON {timing constants} _xinfreq = 20_000_000 _clkfreq = 180_000_000 CON {fixed IO pins} CON {application IO pins} p56 = 56 p57 = 57 CON {Enumerated constants} CON {Data Structure definitions} OBJ {Child Objects} VAR {Variable Declarations} {Public / Private methods} pub main() coginit(COGEXEC_NEW_PAIR, @cogstart, 0 ) ' start two cogs with linked LUT reads DAT {Symbols and Data} DAT {Data Pointers} DAT {Cog-exec assembly language} org cogstart cogid _cogid testb _cogid, #0 WZ IF_Z jmp #cogodd cogeven drvl #p56 waitx ##100000000 ' wait for odd cog to load and execute waitatn add _cogid, #1 ' now the odd cog number mov pr0, #0 bith pr0, _cogid cogatn pr0 ' wake up the odd number cog nop ' just two cycle delay needed for output pins to be in sync cogeven1 drvh #p56 ' compare this output pulse on pin 56 drvl #p56 jmp #cogeven1 cogodd drvl #p57 setluts #1 ' writes to even cog LUT will also write to the odd cog LUT waitatn ' wait for even cog to kick me off cogodd1 drvh #p57 ' with this one on pin 57 drvl #p57 jmp #cogodd1 _cogid res 1 DAT {Hub-exec assembly language}@bob_g4bby, another approach would be to use block transfer to copy 512 consecutive samples (split to each cog) into the LUT of each cog and then do the calculation in both cogs in parallel and copying it back to hub with block transfer. Then you would only need to wake up the second cog with COGATN instruction to do their work and avoid exact synchronization.
As the runtime of the calculation and data transfer time is the same in each cog, both will end at the same time.
The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.
'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub mov ptrb,##ptr_hub setq2 #512-1 rdlong $200-$200,ptrb++ 'copy from hub RAM to LUT 'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub mov ptrb,##ptr_hub setq2 #512-1 wrlong $200-$200,ptrb++ 'copy back from LUT to hub RAMTo modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.
The fast block move from LUT RAM takes 6 + 4...11 + 511 = 521... 528 cycles. I tested this thoroughly just a few days ago and the first write from LUT RAM takes one more cycle than from register RAM in the worst case. The reason for this difference is because it takes three cycles, not two, to read from LUT RAM.
The extra cycle causes the hub RAM slot for a 3 cycle write to be missed, there is a wait of one hub RAM revolution and therefore the write takes 3 + 8 = 11 cycles. This applies to only one of the eight hub RAM egg beater slices. For the other seven the writes take 4...10 cycles and are identical for reg and LUT RAM.
That's some useful timing info. I too had wondered whether block move into and out of LUT was the way to go. Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function. You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?
I'll get down to writing the cordic code, and publish it here. I envisage it turning into a bit of a library for doing dsp things on arrays of data - like audio or software radio streams etc.
@bob_g4bby, yes that is correct as each cog will process only 512 longs. But on your approach it is four times of cycles to move data as one cog must read all 1024 longs and the second cog must write all these longs. Hence, it will take 2048 clocks to read and 2048 clocks to write the data. Additional the time doing the cordic function.
One can use the FIFO with WFLONG on both cogs to write the result back to hub as all data to process are already in the LUT RAM. This would avoid to write the result to LUT (saves 2 clocks for each sample). This is the same time necessary to write data using FIFO.
Unfortunately reading a LUT sample needs 3 clocks.
My approach is quite easy as the code used in both cogs is identical. Only the pointer to read and write the 512 longs block is different.
With your approach you need to handover data to the second cog for the calculation and the result of calculation from first cog to write back both afterwards to hub. I think this will take additional overhead.
@bob_g4bby, one thing I did not mention before and which is important to know using your approach is, that you need additional 5120 clocks to move the data (1024 longs) from the first cog to the second cog. Also if you use the LUT to share the data between two cogs, you will not get the data transfer for free.
As I already did mention it takes 5 clocks per long to write and read the LUT which is necessary in your approach.
The conclusion is, as you process as less data per cog you will get the best performance to calculate the data. If you need speed for the calculation consider to use more cogs to do this.
With my approach the calculation is easy scalable as the code is identical for each cog. Each cog is calculating a small part of the data.
This code achieves true rendezvous between two cogs, no matter which arrives first. The other will wait for it. Here's the code I used to test the idea:-
'' File .......... rendezvous.spin2 '' Version........ 2 '' Purpose........ 'rendezvous' code to synchronise two cogs instruction streams, no matter who arrives first '' Author......... Bob Edwards '' Email.......... bob.edwards50@yahoo.com '' Started........ / / '' Latest update.. / / ''============================================================= {Spin2_v50} CON {timing constants} _xinfreq = 20_000_000 _clkfreq = 180_000_000 CON {fixed IO pins} CON {application IO pins} p56 = 56 p57 = 57 CON {Enumerated constants} CON {Data Structure definitions} OBJ {Child Objects} VAR {Variable Declarations} {Public / Private methods} pub main() coginit(COGEXEC_NEW_PAIR, @cogstart, ptra ) ' start two unused adjacent cogs with linked LUT DAT {Symbols and Data} DAT {Data Pointers} DAT {Cog-exec assembly language} 'Two low-going pulses on pin 56 and 57 remain in step on the same clock edge, no matter how out-of-step two cogs are to start with ' Can be run with or without DEBUG org cogstart cogid _cogid testb _cogid, #0 WZ IF_Z jmp #cogodd ' split the two cogs out to each do their own code cogeven drvh #p56 getrnd pr1 and pr1, ##$FFFFF waitx pr1 ' wait a random time - cogs severely out of step cogid _cogid add _cogid, #1 ' this is the odd cog ID mov pr0, #0 bith pr0, _cogid ' bit pattern required to ATN the odd cog cogatn pr0 ' rendezvous code waitatn ' rendezvous code cogeven1 drvl #p56 ' compare this output pulse on pin 56 ... waitx ##1000 drvh #p56 debug("Cog 2 here") ' just so we know it's running ok jmp #cogeven cogodd drvh #p57 getrnd pr1 and pr1, ##$FFFFF waitx pr1 ' wait a random time - cogs severely out of step cogid _cogid sub _cogid, #1 ' this is the even cog ID mov pr0, #0 bith pr0, _cogid ' bit pattern required to ATN the even cog waitatn ' rendezvous code cogatn pr0 ' rendezvous code nop ' rendezvous code cogodd1 drvl #p57 ' ... with this one on pin 57 - they always remain in step on the same clock pulse waitx ##1000 drvh #p57 debug("Cog 3 here") ' just so we know it's running ok jmp #cogodd _cogid res 1 DAT {Hub-exec assembly language}The attached pdf explains the rendezvous code a little more - I'm quite proud of this! - cheers, Bob
So, having synchronised the code execution of two cogs with a common LUT:-
1. Cog 1 writes a long to LUT with the WRLUT instruction (2 cycles)
2. Cog 2 reads the same long from LUT with the RDLUT instruction (3 cycles)
What is the soonest after WRLUT execution can RDLUT execution be for the data to be read correctly?
The answer turns out to be that RDLUT can execute on Cog2 at the same time WRLUT is executing on COG1. This is good news, allowing really tight coding with that knowledge.
The data sent by that WRLUT will be received by the RDLUT reliably
Here's my test code. Run with debug and observe the data count continues. The program will stop if the WRLUT / RDLUT data is mismatched. Notice I've used pulses on pin 56 and 57 to verify timing is spot on.
'' File .......... WRLUT-RDLUT test.spin2 '' Version........ 2 '' Purpose........ How soon after a WRLUT instruction in one cog can the other cog RDLUT that data? '' Author......... Bob Edwards '' Email.......... bob.edwards50@yahoo.com '' Started........ 8/1/26 '' Latest update.. / / ''============================================================= {Spin2_v50} CON {timing constants} _xinfreq = 20_000_000 _clkfreq = 180_000_000 CON {fixed IO pins} CON {application IO pins} p56 = 56 p57 = 57 CON {Enumerated constants} #0, cog2val, cog3val, paramcount CON {Data Structure definitions} OBJ {Child Objects} VAR {Variable Declarations} long param[paramcount] {Public / Private methods} pub main() repeat ' waitms(100) param[cog2val] := param[cog2val] + 1 coginit(COGEXEC_NEW_PAIR, @cogstart, @param ) ' start two unused adjacent cogs with linked LUT waitms(1) debug(udec(param[cog2val], param[cog3val])) until (param[cog2val]<>param[cog3val]) ' Stop the program if the WRLUT / RDLUT data disagrees DAT {Symbols and Data} DAT {Data Pointers} DAT {Cog-exec assembly language} org cogstart drvl #p56 drvl #p57 cogid _cogid testb _cogid, #0 WZ IF_Z jmp #cogodd cogeven add _cogid, #1 ' this is the odd cog ID setluts #1 ' all writes from the odd cog will be applied to the even cog LUT too mov pr1, #0 bith pr1, _cogid ' bit pattern required to ATN the odd cog cogeven1 rdlong pr0, ptra[cog2val] cogatn pr1 ' rendezvous code waitatn ' rendezvous code drvh #p56 ' this code is in sync with the code below ***** wrlut pr0, #countlut ' and this drvl #p56 ' but not this because wrlut takes 2 cycles debug("Cog 2 here") cogid _cogid cogstop _cogid cogodd cogid _cogid sub _cogid, #1 ' this is the even cog id mov pr0, #0 bith pr0, _cogid ' bit pattern required to ATN the even cog setluts #1 ' all writes from the even cog will be applied to the odd cog LUT too ' cogodd1 waitatn ' rendezvous code cogatn pr0 ' rendezvous code nop ' rendezvous code drvh #p57 ' ***** this code with the code above rdlut pr0, #countlut ' and this (so WRLUT and RDLUT happen simultaneosly and the data is passed over correctly! drvl #p57 ' but not this because rdlut takes 3 cycles wrlong pr0, ptra[cog3val] debug("Cog 3 here") cogid _cogid cogstop _cogid ' COG memory _cogid res 1 count res 1 ' LUT memory org countlut res 1 DAT {Hub-exec assembly language}I've implemented cartesian to polar conversion in my growing dsp library. For speed, I start a pair of cogs with shared LUT ram. These are kicked into life with a COGATN in library method "xytopol", which waits until it receives a COGATN back from the cog pair to signal conversion complete. The cog pair collaborate closely to feed data in and out of the qvector cordic instructions. My "rendezvous" codelet is used to start them off together and various nops and waitx keep the cogs synced pretty well. I have to admit, my initial timing planned out in a spreadsheet was way off and the polar results were sometimes failing.. I cheated, resorting to my trusty chinese oscilloscope, to measure various pulses output on P56 and P57 to debug the timing. The delays you see got the timing close enough. Ah well!
The following demo runs very well under Spin Tools 0.51.0. Don't use 0.52.0 if your OS is Windows, the scope displays have gone very slow. I'm sure that'll get fixed soon. Run with ctrl-F10
Key to scope displays:-
Scope1 100Hz sinewave
Scope 2 150Hz sinewave
Scope 3 100Hz sine multiplied by 150Hz sine (modulation)
Scope 4 100Hz sine added to 150Hz sine (Mixing)
Scope 5 Conversion of 100Hz sine to polar coordinates. Real is the magnitude (which dithers a bit around 1000) and Imag is the angle, which advances linearly until it reaches full circle, then drops back to zero again - all as expected. I've had it soaking for a bit
Next step is to add the companion polar to cartesian conversion - maybe by self-modifying code to save space. Qvector versus Qrotate instructions swap.
@Kaio , I think your approach and this example from Chip point the way to a two-times-faster single-cog solution to array cordics. I'll have another go at these methods with that in mind.