Shop OBEX P1 Docs P2 Docs Learn Events
Running two cogs in phase with each other — Parallax Forums

Running two cogs in phase with each other

bob_g4bbybob_g4bby Posts: 507
edited 2025-12-16 11:32 in PASM2/Spin2 (P2)
  • The goal - compute a cordic function on a 1024 sample array in the least amount of time.
  • The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles
  • So read from hub ram using the FIFO with RDFAST which always takes two cycles
  • BUT we can't use WRFAST to write the data back, it takes too long to set in the opposite direction
  • So use two cogs, running loops of identical size and also run them in phase. One cog does the reads RDFAST using it's FIFO, the other writes WRFAST using it's FIFO. Both cogs share LUT.

But how to get both cogs exactly in phase? The following piece of code is an experiment to get two cogs to pulse two smartpins exactly at the same time by the use of WAITATN / COGATN instructions. I had to put one NOP instruction after COGATN to get the two cogs running exactly in phase.

Sharing the data movement has another benefit, each cog only needs to run a 14 cycle loop ( 4x the cordic process time of 56 cycles). The next best loop time would be 28 cycles - 1/2 the speed. Here's the experiment to test phase lock. I used an oscilloscope on pins P56 and P57 to get the two pulses lined up with each other:-

'' File .......... post WAITATN synchronism test.spin2
'' Version........ 1
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /

''=============================================================

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000


CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

{Public / Private methods}

pub main()
    coginit(2,@cog2,0)
    coginit(1,@cog1,0)

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

        org


cog1
        drvl #p56
        waitx ##100000000               ' wait for cog2 to load and execute waitatn
        cogatn #4                       ' kick cog2 off
        nop
cog1a
        drvh #p56                       ' compare this output pulse on pin 56
        drvl #p56
        jmp #cog1a

cog2
        drvl #p57
        waitatn                         ' wait for cog1 to kick me off
cog2a
        drvh #p57                       ' with this one on pin 57. Adjust waitx instructions until both pulses are coincident
        drvl #p57
        jmp #cog2a


DAT {Hub-exec assembly language}

Cordic example to follow ...

Comments

  • Get a clock count in spin, add a couple hundred to it, send the pointer over to the two pasm cogs on startup... have the two pasm routines sit on a waitcnt untill the count you passed to them...

  • RaymanRayman Posts: 15,951

    If both are doing very similar things, easy way is to use same assembly for both cogs and edit the assembly before launching the second cog.
    Use the main cog to give them ATN at same time...

  • Both good tips, there, thanks. Nice simple techniques.

  • @bob_g4bby said:
    Both good tips, there, thanks. Nice simple techniques.

    Your technique is the simplest.

  • Do note that if you want to use LUT sharing, you need an odd/even pair like 0/1 or 2/3, 1/2 would not work.
    There's a special value you can pass to coginit to make it start up such a pair automatically, which is in theory time-aligned +/- some constant, in practice, if you have the debugger going, it will delay an unpredictable amount due to the serial lock.

  • bob_g4bbybob_g4bby Posts: 507
    edited 2025-12-25 09:19

    So here's me starting an odd/even pair of cogs with one command. They both run the same cog code in their respective cog RAMS. However, the cogs read their ID numbers and split off to run 'cogeven' and 'cogodd' code. I've kept my method of synchronising the two loops starting at 'cogeven1' and 'cogodd1':-

    '' File .......... post WAITATN synchronism test.spin2
    '' Version........ 2
    '' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
    '' Author......... Bob Edwards
    '' Email.......... bob.edwards50@yahoo.com
    '' Started........ / /
    '' Latest update.. / /
    
    ''=============================================================
    
    CON {timing constants}
            _xinfreq =  20_000_000
            _clkfreq = 180_000_000
    
    
    CON {fixed IO pins}
    
    CON {application IO pins}
            p56 = 56
            p57 = 57
    
    CON {Enumerated constants}
    
    CON {Data Structure definitions}
    
    OBJ {Child Objects}
    
    VAR {Variable Declarations}
    
    {Public / Private methods}
    
    pub main()
    coginit(COGEXEC_NEW_PAIR, @cogstart, 0 )        ' start two cogs with linked LUT reads
    
    DAT {Symbols and Data}
    
    DAT {Data Pointers}
    
    DAT {Cog-exec assembly language}
    
            org
    cogstart
            cogid _cogid
            testb _cogid, #0 WZ
    IF_Z    jmp #cogodd
    
    cogeven
            drvl #p56
            waitx ##100000000               ' wait for odd cog to load and execute waitatn
            add _cogid, #1                 ' now the odd cog number
            mov pr0, #0
            bith pr0, _cogid
            cogatn pr0                      ' wake up the odd number cog
            nop                             ' just two cycle delay needed for output pins to be in sync
    cogeven1
            drvh #p56                       ' compare this output pulse on pin 56
            drvl #p56
            jmp #cogeven1
    
    cogodd
            drvl #p57
            setluts #1                      ' writes to even cog LUT will also write to the odd cog LUT
            waitatn                         ' wait for even cog to kick me off
    cogodd1
            drvh #p57                       ' with this one on pin 57
            drvl #p57
            jmp #cogodd1
    
            _cogid      res 1
    
    DAT {Hub-exec assembly language}
    
  • @bob_g4bby said:

    • The goal - compute a cordic function on a 1024 sample array in the least amount of time.
    • The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles

    @bob_g4bby, another approach would be to use block transfer to copy 512 consecutive samples (split to each cog) into the LUT of each cog and then do the calculation in both cogs in parallel and copying it back to hub with block transfer. Then you would only need to wake up the second cog with COGATN instruction to do their work and avoid exact synchronization.
    As the runtime of the calculation and data transfer time is the same in each cog, both will end at the same time.

    The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.

    'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT
    
    'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM
    

    To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

  • TonyB_TonyB_ Posts: 2,258
    edited 2025-12-19 00:01

    @Kaio said:
    The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.

    'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT
    
    'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM
    

    To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

    The fast block move from LUT RAM takes 6 + 4...11 + 511 = 521... 528 cycles. I tested this thoroughly just a few days ago and the first write from LUT RAM takes one more cycle than from register RAM in the worst case. The reason for this difference is because it takes three cycles, not two, to read from LUT RAM.

    The extra cycle causes the hub RAM slot for a 3 cycle write to be missed, there is a wait of one hub RAM revolution and therefore the write takes 3 + 8 = 11 cycles. This applies to only one of the eight hub RAM egg beater slices. For the other seven the writes take 4...10 cycles and are identical for reg and LUT RAM.

  • bob_g4bbybob_g4bby Posts: 507
    edited 2025-12-19 09:00

    That's some useful timing info. I too had wondered whether block move into and out of LUT was the way to go. Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function. You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

    I'll get down to writing the cordic code, and publish it here. I envisage it turning into a bit of a library for doing dsp things on arrays of data - like audio or software radio streams etc.

  • KaioKaio Posts: 281
    edited 2025-12-20 09:23

    @bob_g4bby said:
    Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function.

    @bob_g4bby, yes that is correct as each cog will process only 512 longs. But on your approach it is four times of cycles to move data as one cog must read all 1024 longs and the second cog must write all these longs. Hence, it will take 2048 clocks to read and 2048 clocks to write the data. Additional the time doing the cordic function.

    You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

    One can use the FIFO with WFLONG on both cogs to write the result back to hub as all data to process are already in the LUT RAM. This would avoid to write the result to LUT (saves 2 clocks for each sample). This is the same time necessary to write data using FIFO.
    Unfortunately reading a LUT sample needs 3 clocks.

    My approach is quite easy as the code used in both cogs is identical. Only the pointer to read and write the 512 longs block is different.

    With your approach you need to handover data to the second cog for the calculation and the result of calculation from first cog to write back both afterwards to hub. I think this will take additional overhead.

  • @bob_g4bby, one thing I did not mention before and which is important to know using your approach is, that you need additional 5120 clocks to move the data (1024 longs) from the first cog to the second cog. Also if you use the LUT to share the data between two cogs, you will not get the data transfer for free.
    As I already did mention it takes 5 clocks per long to write and read the LUT which is necessary in your approach.

    The conclusion is, as you process as less data per cog you will get the best performance to calculate the data. If you need speed for the calculation consider to use more cogs to do this.

    With my approach the calculation is easy scalable as the code is identical for each cog. Each cog is calculating a small part of the data.

  • bob_g4bbybob_g4bby Posts: 507
    edited 2026-01-10 05:39

    This code achieves true rendezvous between two cogs, no matter which arrives first. The other will wait for it. Here's the code I used to test the idea:-

    '' File .......... rendezvous.spin2
    '' Version........ 2
    '' Purpose........ 'rendezvous' code to synchronise two cogs instruction streams, no matter who arrives first
    '' Author......... Bob Edwards
    '' Email.......... bob.edwards50@yahoo.com
    '' Started........ / /
    '' Latest update.. / /
    
    ''=============================================================
    
    {Spin2_v50}
    
    CON {timing constants}
            _xinfreq =  20_000_000
            _clkfreq = 180_000_000
    
    CON {fixed IO pins}
    
    CON {application IO pins}
            p56 = 56
            p57 = 57
    
    CON {Enumerated constants}
    
    CON {Data Structure definitions}
    
    OBJ {Child Objects}
    
    VAR {Variable Declarations}
    
    {Public / Private methods}
    
    pub main()
        coginit(COGEXEC_NEW_PAIR, @cogstart, ptra )        ' start two unused adjacent cogs with linked LUT
    
    DAT {Symbols and Data}
    
    DAT {Data Pointers}
    
    DAT {Cog-exec assembly language}
    
    'Two low-going pulses on pin 56 and 57 remain in step on the same clock edge, no matter how out-of-step two cogs are to start with
    ' Can be run with or without DEBUG
    
            org
    
    cogstart
            cogid _cogid
            testb _cogid, #0 WZ
    IF_Z    jmp #cogodd                     ' split the two cogs out to each do their own code
    
    cogeven
            drvh #p56
            getrnd pr1
            and pr1, ##$FFFFF
            waitx pr1                       ' wait a random time - cogs severely out of step
            cogid _cogid
            add _cogid, #1                  ' this is the odd cog ID
            mov pr0, #0
            bith pr0, _cogid                ' bit pattern required to ATN the odd cog
            cogatn pr0                      ' rendezvous code
            waitatn                         ' rendezvous code
    cogeven1
            drvl #p56                       ' compare this output pulse on pin 56 ...
            waitx ##1000
            drvh #p56
            debug("Cog 2 here")             ' just so we know it's running ok
            jmp #cogeven
    
    cogodd
            drvh #p57
            getrnd pr1
            and pr1, ##$FFFFF
            waitx pr1                       ' wait a random time - cogs severely out of step
            cogid _cogid
            sub _cogid, #1                  ' this is the even cog ID
            mov pr0, #0
            bith pr0, _cogid                ' bit pattern required to ATN the even cog
            waitatn                         ' rendezvous code
            cogatn pr0                      ' rendezvous code
            nop                             ' rendezvous code
    cogodd1
            drvl #p57                       ' ... with this one on pin 57 - they always remain in step on the same clock pulse
            waitx ##1000
            drvh #p57
            debug("Cog 3 here")             ' just so we know it's running ok
            jmp #cogodd
    
    _cogid  res 1
    
    DAT {Hub-exec assembly language}
    
    

    The attached pdf explains the rendezvous code a little more - I'm quite proud of this! - cheers, Bob

  • bob_g4bbybob_g4bby Posts: 507
    edited 2026-01-08 18:56

    So, having synchronised the code execution of two cogs with a common LUT:-
    1. Cog 1 writes a long to LUT with the WRLUT instruction (2 cycles)
    2. Cog 2 reads the same long from LUT with the RDLUT instruction (3 cycles)
    What is the soonest after WRLUT execution can RDLUT execution be for the data to be read correctly?

    The answer turns out to be that RDLUT can execute on Cog2 at the same time WRLUT is executing on COG1. This is good news, allowing really tight coding with that knowledge.

    The data sent by that WRLUT will be received by the RDLUT reliably
    Here's my test code. Run with debug and observe the data count continues. The program will stop if the WRLUT / RDLUT data is mismatched. Notice I've used pulses on pin 56 and 57 to verify timing is spot on.

    '' File .......... WRLUT-RDLUT test.spin2
    '' Version........ 2
    '' Purpose........ How soon after a WRLUT instruction in one cog can the other cog RDLUT that data?
    '' Author......... Bob Edwards
    '' Email.......... bob.edwards50@yahoo.com
    '' Started........ 8/1/26
    '' Latest update.. / /
    
    ''=============================================================
    
    {Spin2_v50}
    
    CON {timing constants}
            _xinfreq =  20_000_000
            _clkfreq = 180_000_000
    
    CON {fixed IO pins}
    
    CON {application IO pins}
            p56 = 56
            p57 = 57
    
    CON {Enumerated constants}
    
    #0, cog2val, cog3val, paramcount
    
    CON {Data Structure definitions}
    
    OBJ {Child Objects}
    
    VAR {Variable Declarations}
    
    long param[paramcount]
    
    {Public / Private methods}
    
    pub main()
        repeat
            ' waitms(100)
            param[cog2val] := param[cog2val] + 1
            coginit(COGEXEC_NEW_PAIR, @cogstart, @param )        ' start two unused adjacent cogs with linked LUT
            waitms(1)
            debug(udec(param[cog2val], param[cog3val]))
        until (param[cog2val]<>param[cog3val])                   ' Stop the program if the WRLUT / RDLUT data disagrees
    
    DAT {Symbols and Data}
    
    DAT {Data Pointers}
    
    DAT {Cog-exec assembly language}
    
            org
    
    cogstart
            drvl #p56
            drvl #p57
            cogid _cogid
            testb _cogid, #0 WZ
    IF_Z    jmp #cogodd
    
    cogeven
            add _cogid, #1                  ' this is the odd cog ID
            setluts #1                      ' all writes from the odd cog will be applied to the even cog LUT too
            mov pr1, #0
            bith pr1, _cogid                ' bit pattern required to ATN the odd cog
    cogeven1
            rdlong pr0, ptra[cog2val]
            cogatn pr1                      ' rendezvous code
            waitatn                         ' rendezvous code
            drvh #p56                       ' this code is in sync with the code below *****
            wrlut pr0, #countlut            ' and this
            drvl #p56                       ' but not this because wrlut takes 2 cycles
            debug("Cog 2 here")
            cogid _cogid
            cogstop _cogid
    
    cogodd
            cogid _cogid
            sub _cogid, #1                  ' this is the even cog id
            mov pr0, #0
            bith pr0, _cogid                ' bit pattern required to ATN the even cog
            setluts #1                      ' all writes from the even cog will be applied to the odd cog LUT too                                            '
    cogodd1
            waitatn                         ' rendezvous code
            cogatn pr0                      ' rendezvous code
            nop                             ' rendezvous code
            drvh #p57                       ' ***** this code with the code above
            rdlut pr0, #countlut            '  and this (so WRLUT and RDLUT happen simultaneosly and the data is passed over correctly!
            drvl #p57                       ' but not this because rdlut takes 3 cycles
            wrlong pr0, ptra[cog3val]
            debug("Cog 3 here")
            cogid _cogid
            cogstop _cogid
    ' COG memory
    _cogid  res 1
    count   res 1
    
     ' LUT memory
    org
    
    countlut res 1
    
    DAT {Hub-exec assembly language}
    
    
  • I've implemented cartesian to polar conversion in my growing dsp library. For speed, I start a pair of cogs with shared LUT ram. These are kicked into life with a COGATN in library method "xytopol", which waits until it receives a COGATN back from the cog pair to signal conversion complete. The cog pair collaborate closely to feed data in and out of the qvector cordic instructions. My "rendezvous" codelet is used to start them off together and various nops and waitx keep the cogs synced pretty well. I have to admit, my initial timing planned out in a spreadsheet was way off and the polar results were sometimes failing.. I cheated, resorting to my trusty chinese oscilloscope, to measure various pulses output on P56 and P57 to debug the timing. The delays you see got the timing close enough. Ah well!

    The following demo runs very well under Spin Tools 0.51.0. Don't use 0.52.0 if your OS is Windows, the scope displays have gone very slow. I'm sure that'll get fixed soon. Run with ctrl-F10
    Key to scope displays:-
    Scope1 100Hz sinewave
    Scope 2 150Hz sinewave
    Scope 3 100Hz sine multiplied by 150Hz sine (modulation)
    Scope 4 100Hz sine added to 150Hz sine (Mixing)
    Scope 5 Conversion of 100Hz sine to polar coordinates. Real is the magnitude (which dithers a bit around 1000) and Imag is the angle, which advances linearly until it reaches full circle, then drops back to zero again - all as expected. I've had it soaking for a bit

    Next step is to add the companion polar to cartesian conversion - maybe by self-modifying code to save space. Qvector versus Qrotate instructions swap.

  • @Kaio , I think your approach and this example from Chip point the way to a two-times-faster single-cog solution to array cordics. I'll have another go at these methods with that in mind.

Sign In or Register to comment.