Running two cogs in phase with each other

bob_g4bby · 2025-12-16 11:04

The goal - compute a cordic function on a 1024 sample array in the least amount of time.
The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles
So read from hub ram using the FIFO with RDFAST which always takes two cycles
BUT we can't use WRFAST to write the data back, it takes too long to set in the opposite direction
So use two cogs, running loops of identical size and also run them in phase. One cog does the reads RDFAST using it's FIFO, the other writes WRFAST using it's FIFO. Both cogs share LUT.

But how to get both cogs exactly in phase? The following piece of code is an experiment to get two cogs to pulse two smartpins exactly at the same time by the use of WAITATN / COGATN instructions. I had to put one NOP instruction after COGATN to get the two cogs running exactly in phase.

Sharing the data movement has another benefit, each cog only needs to run a 14 cycle loop ( 4x the cordic process time of 56 cycles). The next best loop time would be 28 cycles - 1/2 the speed. Here's the experiment to test phase lock. I used an oscilloscope on pins P56 and P57 to get the two pulses lined up with each other:-

'' File .......... post WAITATN synchronism test.spin2
'' Version........ 1
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /

''=============================================================

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000


CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

{Public / Private methods}

pub main()
    coginit(2,@cog2,0)
    coginit(1,@cog1,0)

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

        org


cog1
        drvl #p56
        waitx ##100000000               ' wait for cog2 to load and execute waitatn
        cogatn #4                       ' kick cog2 off
        nop
cog1a
        drvh #p56                       ' compare this output pulse on pin 56
        drvl #p56
        jmp #cog1a

cog2
        drvl #p57
        waitatn                         ' wait for cog1 to kick me off
cog2a
        drvh #p57                       ' with this one on pin 57. Adjust waitx instructions until both pulses are coincident
        drvl #p57
        jmp #cog2a


DAT {Hub-exec assembly language}

Cordic example to follow ...

mwroberts · 2025-12-16 15:33

Get a clock count in spin, add a couple hundred to it, send the pointer over to the two pasm cogs on startup... have the two pasm routines sit on a waitcnt untill the count you passed to them...

Rayman · 2025-12-16 16:49

If both are doing very similar things, easy way is to use same assembly for both cogs and edit the assembly before launching the second cog.
Use the main cog to give them ATN at same time...

bob_g4bby · 2025-12-16 19:37

Both good tips, there, thanks. Nice simple techniques.

TonyB_ · 2025-12-16 20:58

@bob_g4bby said:
Both good tips, there, thanks. Nice simple techniques.

Your technique is the simplest.

Wuerfel_21 · 2025-12-16 22:09

Do note that if you want to use LUT sharing, you need an odd/even pair like 0/1 or 2/3, 1/2 would not work.
There's a special value you can pass to coginit to make it start up such a pair automatically, which is in theory time-aligned +/- some constant, in practice, if you have the debugger going, it will delay an unpredictable amount due to the serial lock.

bob_g4bby · 2025-12-17 17:22

So here's me starting an odd/even pair of cogs with one command. They both run the same cog code in their respective cog RAMS. However, the cogs read their ID numbers and split off to run 'cogeven' and 'cogodd' code. I've kept my method of synchronising the two loops starting at 'cogeven1' and 'cogodd1':-

'' File .......... post WAITATN synchronism test.spin2
'' Version........ 2
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /

''=============================================================

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000


CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

{Public / Private methods}

pub main()
coginit(COGEXEC_NEW_PAIR, @cogstart, 0 )        ' start two cogs with linked LUT reads

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

        org
cogstart
        cogid _cogid
        testb _cogid, #0 WZ
IF_Z    jmp #cogodd

cogeven
        drvl #p56
        waitx ##100000000               ' wait for odd cog to load and execute waitatn
        add _cogid, #1                 ' now the odd cog number
        mov pr0, #0
        bith pr0, _cogid
        cogatn pr0                      ' wake up the odd number cog
        nop                             ' just two cycle delay needed for output pins to be in sync
cogeven1
        drvh #p56                       ' compare this output pulse on pin 56
        drvl #p56
        jmp #cogeven1

cogodd
        drvl #p57
        setluts #1                      ' writes to even cog LUT will also write to the odd cog LUT
        waitatn                         ' wait for even cog to kick me off
cogodd1
        drvh #p57                       ' with this one on pin 57
        drvl #p57
        jmp #cogodd1

        _cogid      res 1

DAT {Hub-exec assembly language}

Kaio · 2025-12-18 21:32

@bob_g4bby said:

The goal - compute a cordic function on a 1024 sample array in the least amount of time.

The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles

@bob_g4bby, another approach would be to use block transfer to copy 512 consecutive samples (split to each cog) into the LUT of each cog and then do the calculation in both cogs in parallel and copying it back to hub with block transfer. Then you would only need to wake up the second cog with COGATN instruction to do their work and avoid exact synchronization.
As the runtime of the calculation and data transfer time is the same in each cog, both will end at the same time.

The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.

'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
        mov      ptrb,##ptr_hub
        setq2   #512-1
        rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT

'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
        mov      ptrb,##ptr_hub
        setq2   #512-1
        wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM

To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

TonyB_ · 2025-12-18 23:53

@Kaio said:
The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.
'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
        mov      ptrb,##ptr_hub
        setq2   #512-1
        rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT

'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
        mov      ptrb,##ptr_hub
        setq2   #512-1
        wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM
To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

The fast block move from LUT RAM takes 6 + 4...11 + 511 = 521... 528 cycles. I tested this thoroughly just a few days ago and the first write from LUT RAM takes one more cycle than from register RAM in the worst case. The reason for this difference is because it takes three cycles, not two, to read from LUT RAM.

The extra cycle causes the hub RAM slot for a 3 cycle write to be missed, there is a wait of one hub RAM revolution and therefore the write takes 3 + 8 = 11 cycles. This applies to only one of the eight hub RAM egg beater slices. For the other seven the writes take 4...10 cycles and are identical for reg and LUT RAM.

bob_g4bby · 2025-12-19 08:54

That's some useful timing info. I too had wondered whether block move into and out of LUT was the way to go. Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function. You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

I'll get down to writing the cordic code, and publish it here. I envisage it turning into a bit of a library for doing dsp things on arrays of data - like audio or software radio streams etc.

Kaio · 2025-12-19 23:04

@bob_g4bby said:
Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function.

@bob_g4bby, yes that is correct as each cog will process only 512 longs. But on your approach it is four times of cycles to move data as one cog must read all 1024 longs and the second cog must write all these longs. Hence, it will take 2048 clocks to read and 2048 clocks to write the data. Additional the time doing the cordic function.

You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

One can use the FIFO with WFLONG on both cogs to write the result back to hub as all data to process are already in the LUT RAM. This would avoid to write the result to LUT (saves 2 clocks for each sample). This is the same time necessary to write data using FIFO.
Unfortunately reading a LUT sample needs 3 clocks.

My approach is quite easy as the code used in both cogs is identical. Only the pointer to read and write the 512 longs block is different.

With your approach you need to handover data to the second cog for the calculation and the result of calculation from first cog to write back both afterwards to hub. I think this will take additional overhead.

Kaio · 2025-12-20 09:19

@bob_g4bby, one thing I did not mention before and which is important to know using your approach is, that you need additional 5120 clocks to move the data (1024 longs) from the first cog to the second cog. Also if you use the LUT to share the data between two cogs, you will not get the data transfer for free.
As I already did mention it takes 5 clocks per long to write and read the LUT which is necessary in your approach.

The conclusion is, as you process as less data per cog you will get the best performance to calculate the data. If you need speed for the calculation consider to use more cogs to do this.

With my approach the calculation is easy scalable as the code is identical for each cog. Each cog is calculating a small part of the data.

bob_g4bby · 2026-01-08 10:48

This code achieves true rendezvous between two cogs, no matter which arrives first. The other will wait for it. Here's the code I used to test the idea:-

'' File .......... rendezvous.spin2
'' Version........ 2
'' Purpose........ 'rendezvous' code to synchronise two cogs instruction streams, no matter who arrives first
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /

''=============================================================

{Spin2_v50}

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000

CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

{Public / Private methods}

pub main()
    coginit(COGEXEC_NEW_PAIR, @cogstart, ptra )        ' start two unused adjacent cogs with linked LUT

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

'Two low-going pulses on pin 56 and 57 remain in step on the same clock edge, no matter how out-of-step two cogs are to start with
' Can be run with or without DEBUG

        org

cogstart
        cogid _cogid
        testb _cogid, #0 WZ
IF_Z    jmp #cogodd                     ' split the two cogs out to each do their own code

cogeven
        drvh #p56
        getrnd pr1
        and pr1, ##$FFFFF
        waitx pr1                       ' wait a random time - cogs severely out of step
        cogid _cogid
        add _cogid, #1                  ' this is the odd cog ID
        mov pr0, #0
        bith pr0, _cogid                ' bit pattern required to ATN the odd cog
        cogatn pr0                      ' rendezvous code
        waitatn                         ' rendezvous code
cogeven1
        drvl #p56                       ' compare this output pulse on pin 56 ...
        waitx ##1000
        drvh #p56
        debug("Cog 2 here")             ' just so we know it's running ok
        jmp #cogeven

cogodd
        drvh #p57
        getrnd pr1
        and pr1, ##$FFFFF
        waitx pr1                       ' wait a random time - cogs severely out of step
        cogid _cogid
        sub _cogid, #1                  ' this is the even cog ID
        mov pr0, #0
        bith pr0, _cogid                ' bit pattern required to ATN the even cog
        waitatn                         ' rendezvous code
        cogatn pr0                      ' rendezvous code
        nop                             ' rendezvous code
cogodd1
        drvl #p57                       ' ... with this one on pin 57 - they always remain in step on the same clock pulse
        waitx ##1000
        drvh #p57
        debug("Cog 3 here")             ' just so we know it's running ok
        jmp #cogodd

_cogid  res 1

DAT {Hub-exec assembly language}

The attached pdf explains the rendezvous code a little more - I'm quite proud of this! - cheers, Bob

bob_g4bby · 2026-01-08 14:48

So, having synchronised the code execution of two cogs with a common LUT:-
1. Cog 1 writes a long to LUT with the WRLUT instruction (2 cycles)
2. Cog 2 reads the same long from LUT with the RDLUT instruction (3 cycles)
What is the soonest after WRLUT execution can RDLUT execution be for the data to be read correctly?

The answer turns out to be that RDLUT can execute on Cog2 at the same time WRLUT is executing on COG1. This is good news, allowing really tight coding with that knowledge.

The data sent by that WRLUT will be received by the RDLUT reliably
Here's my test code. Run with debug and observe the data count continues. The program will stop if the WRLUT / RDLUT data is mismatched. Notice I've used pulses on pin 56 and 57 to verify timing is spot on.

'' File .......... WRLUT-RDLUT test.spin2
'' Version........ 2
'' Purpose........ How soon after a WRLUT instruction in one cog can the other cog RDLUT that data?
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ 8/1/26
'' Latest update.. / /

''=============================================================

{Spin2_v50}

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000

CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

#0, cog2val, cog3val, paramcount

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

long param[paramcount]

{Public / Private methods}

pub main()
    repeat
        ' waitms(100)
        param[cog2val] := param[cog2val] + 1
        coginit(COGEXEC_NEW_PAIR, @cogstart, @param )        ' start two unused adjacent cogs with linked LUT
        waitms(1)
        debug(udec(param[cog2val], param[cog3val]))
    until (param[cog2val]<>param[cog3val])                   ' Stop the program if the WRLUT / RDLUT data disagrees

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

        org

cogstart
        drvl #p56
        drvl #p57
        cogid _cogid
        testb _cogid, #0 WZ
IF_Z    jmp #cogodd

cogeven
        add _cogid, #1                  ' this is the odd cog ID
        setluts #1                      ' all writes from the odd cog will be applied to the even cog LUT too
        mov pr1, #0
        bith pr1, _cogid                ' bit pattern required to ATN the odd cog
cogeven1
        rdlong pr0, ptra[cog2val]
        cogatn pr1                      ' rendezvous code
        waitatn                         ' rendezvous code
        drvh #p56                       ' this code is in sync with the code below *****
        wrlut pr0, #countlut            ' and this
        drvl #p56                       ' but not this because wrlut takes 2 cycles
        debug("Cog 2 here")
        cogid _cogid
        cogstop _cogid

cogodd
        cogid _cogid
        sub _cogid, #1                  ' this is the even cog id
        mov pr0, #0
        bith pr0, _cogid                ' bit pattern required to ATN the even cog
        setluts #1                      ' all writes from the even cog will be applied to the odd cog LUT too                                            '
cogodd1
        waitatn                         ' rendezvous code
        cogatn pr0                      ' rendezvous code
        nop                             ' rendezvous code
        drvh #p57                       ' ***** this code with the code above
        rdlut pr0, #countlut            '  and this (so WRLUT and RDLUT happen simultaneosly and the data is passed over correctly!
        drvl #p57                       ' but not this because rdlut takes 3 cycles
        wrlong pr0, ptra[cog3val]
        debug("Cog 3 here")
        cogid _cogid
        cogstop _cogid
' COG memory
_cogid  res 1
count   res 1

 ' LUT memory
org

countlut res 1

DAT {Hub-exec assembly language}

bob_g4bby · 2026-01-11 15:07

I've implemented cartesian to polar conversion in my growing dsp library. For speed, I start a pair of cogs with shared LUT ram. These are kicked into life with a COGATN in library method "xytopol", which waits until it receives a COGATN back from the cog pair to signal conversion complete. The cog pair collaborate closely to feed data in and out of the qvector cordic instructions. My "rendezvous" codelet is used to start them off together and various nops and waitx keep the cogs synced pretty well. I have to admit, my initial timing planned out in a spreadsheet was way off and the polar results were sometimes failing.. I cheated, resorting to my trusty chinese oscilloscope, to measure various pulses output on P56 and P57 to debug the timing. The delays you see got the timing close enough. Ah well!

The following demo runs very well under Spin Tools 0.51.0. Don't use 0.52.0 if your OS is Windows, the scope displays have gone very slow. I'm sure that'll get fixed soon. Run with ctrl-F10
Key to scope displays:-
Scope1 100Hz sinewave
Scope 2 150Hz sinewave
Scope 3 100Hz sine multiplied by 150Hz sine (modulation)
Scope 4 100Hz sine added to 150Hz sine (Mixing)
Scope 5 Conversion of 100Hz sine to polar coordinates. Real is the magnitude (which dithers a bit around 1000) and Imag is the angle, which advances linearly until it reaches full circle, then drops back to zero again - all as expected. I've had it soaking for a bit

Next step is to add the companion polar to cartesian conversion - maybe by self-modifying code to save space. Qvector versus Qrotate instructions swap.

bob_g4bby · 2026-01-12 03:56

@Kaio , I think your approach and this example from Chip point the way to a two-times-faster single-cog solution to array cordics. I'll have another go at these methods with that in mind.

bob_g4bby · 2026-01-13 07:46

So no need to run two helper cogs in close synchronism - the code would have been difficult to maintain. This version rewrites 'xytopol' as inline assembly, processing 16 iq samples at a time in a small buffer in cog ram. The transfer of data back and forth from hub ram now takes roughly 1 cycle per long, it was 2. The qvector instruction executes every 8 cycles, instead of the original ~14 cycles. The code is much easier to understand. I will now look at the other dsp methods to see where this same technique can be applied. I can also write the polar to xy conversion method using the qrotate instruction and check that the original sine wave (or close) is reconstituted.

' convert x,y samples in buffin to magnitude, angle samples in buffout - optionally, buffout can be the same as buffin
pub xytopol(buffin, buffout) | buffinptr, buffoutptr, counter

    org

        mov buffinptr, buffin
        mov buffoutptr, buffout
        mov counter, #(sigbuffsize/16)
xypol1
            setq #31
            rdlong array, buffinptr
            qvector array, array+1
            nop
            qvector array+2, array+3
            nop
            qvector array+4, array+5
            nop
            qvector array+6, array+7
            nop
            qvector array+8, array+9
            nop
            qvector array+10, array+11
            nop
            qvector array+12, array+13
            nop
            qvector array+14, array+15
            getqy   array
            getqx   array+1
            nop
            qvector array+16, array+17
            getqy   array+2
            getqx   array+3
            nop
            qvector array+18, array+19
            getqy   array+4
            getqx   array+5
            nop
            qvector array+20, array+21
            getqy   array+6
            getqx   array+7
            nop
            qvector array+22, array+23
            getqy   array+8
            getqx   array+9
            nop
            qvector array+24, array+25
            getqy   array+10
            getqx   array+11
            nop
            qvector array+26, array+27
            getqy   array+12
            getqx   array+13
            nop
            qvector array+28, array+29
            getqy   array+14
            getqx   array+15
            nop
            qvector array+30, array+31
            getqy   array+16
            getqx   array+17
            getqy   array+18
            getqx   array+19
            getqy   array+20
            getqx   array+21
            getqy   array+22
            getqx   array+23
            getqy   array+24
            getqx   array+25
            getqy   array+26
            getqx   array+27
            getqy   array+28
            getqx   array+29
            getqy   array+30
            getqx   array+31

            setq #31
            wrlong array, buffoutptr
            add buffinptr, #(32*4)              ' remember hub ram is addressed in bytes!
            add buffoutptr, #(32*4)
            djnz counter, #xypol1
            ret

array res    32

    end

TonyB_ · 2026-01-13 13:14

@bob_g4bby said:

            setq #31
            wrlong array, buffoutptr
            add buffinptr, #(32*4)              ' remember hub ram is addressed in bytes!
            add buffoutptr, #(32*4)

No need for add nor to remember if you use ptra or ptrb:

            setq #31
            wrlong array, ptrb++

bob_g4bby · 2026-01-13 13:44

I'll change that, @TonyB_ , thanks for the tip. That's the great thing about showing your code on this forum - the long-time P2ers can always find a way to polish off an instruction or two.

I've got polar to cartesian working this morning - almost a carbon copy of cartesian to polar, so your change can go in both methods. They both execute in around 56uS with a 320MHz CPU clock.

Wuerfel_21 · 2026-01-13 14:22

In inline ASM you can only use PTRB, PTRA is used for the stack pointer

bob_g4bby · 2026-01-13 15:44

@Wuerfel_21 , I've heard about not using PTRA in in-line code before, Ada. The obvious question - how if I saved PTRA in a local variable first thing and restored it just before the RET of the inline assembly method? Why couldn't I use PTRA then?

Wuerfel_21 · 2026-01-13 16:20

I think that works

Rayman · 2026-01-13 18:18

Maybe push and pop would work to save/restore ptra...

Running two cogs in phase with each other

Comments