Shop OBEX P1 Docs P2 Docs Learn Events
Running two cogs in phase with each other — Parallax Forums

Running two cogs in phase with each other

bob_g4bbybob_g4bby Posts: 489
edited 2025-12-16 11:32 in PASM2/Spin2 (P2)
  • The goal - compute a cordic function on a 1024 sample array in the least amount of time.
  • The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles
  • So read from hub ram using the FIFO with RDFAST which always takes two cycles
  • BUT we can't use WRFAST to write the data back, it takes too long to set in the opposite direction
  • So use two cogs, running loops of identical size and also run them in phase. One cog does the reads RDFAST using it's FIFO, the other writes WRFAST using it's FIFO. Both cogs share LUT.

But how to get both cogs exactly in phase? The following piece of code is an experiment to get two cogs to pulse two smartpins exactly at the same time by the use of WAITATN / COGATN instructions. I had to put one NOP instruction after COGATN to get the two cogs running exactly in phase.

Sharing the data movement has another benefit, each cog only needs to run a 14 cycle loop ( 4x the cordic process time of 56 cycles). The next best loop time would be 28 cycles - 1/2 the speed. Here's the experiment to test phase lock. I used an oscilloscope on pins P56 and P57 to get the two pulses lined up with each other:-

'' File .......... post WAITATN synchronism test.spin2
'' Version........ 1
'' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
'' Author......... Bob Edwards
'' Email.......... bob.edwards50@yahoo.com
'' Started........ / /
'' Latest update.. / /

''=============================================================

CON {timing constants}
        _xinfreq =  20_000_000
        _clkfreq = 180_000_000


CON {fixed IO pins}

CON {application IO pins}
        p56 = 56
        p57 = 57

CON {Enumerated constants}

CON {Data Structure definitions}

OBJ {Child Objects}

VAR {Variable Declarations}

{Public / Private methods}

pub main()
    coginit(2,@cog2,0)
    coginit(1,@cog1,0)

DAT {Symbols and Data}

DAT {Data Pointers}

DAT {Cog-exec assembly language}

        org


cog1
        drvl #p56
        waitx ##100000000               ' wait for cog2 to load and execute waitatn
        cogatn #4                       ' kick cog2 off
        nop
cog1a
        drvh #p56                       ' compare this output pulse on pin 56
        drvl #p56
        jmp #cog1a

cog2
        drvl #p57
        waitatn                         ' wait for cog1 to kick me off
cog2a
        drvh #p57                       ' with this one on pin 57. Adjust waitx instructions until both pulses are coincident
        drvl #p57
        jmp #cog2a


DAT {Hub-exec assembly language}

Cordic example to follow ...

Comments

  • Get a clock count in spin, add a couple hundred to it, send the pointer over to the two pasm cogs on startup... have the two pasm routines sit on a waitcnt untill the count you passed to them...

  • RaymanRayman Posts: 15,925

    If both are doing very similar things, easy way is to use same assembly for both cogs and edit the assembly before launching the second cog.
    Use the main cog to give them ATN at same time...

  • Both good tips, there, thanks. Nice simple techniques.

  • TonyB_TonyB_ Posts: 2,254

    @bob_g4bby said:
    Both good tips, there, thanks. Nice simple techniques.

    Your technique is the simplest.

  • Do note that if you want to use LUT sharing, you need an odd/even pair like 0/1 or 2/3, 1/2 would not work.
    There's a special value you can pass to coginit to make it start up such a pair automatically, which is in theory time-aligned +/- some constant, in practice, if you have the debugger going, it will delay an unpredictable amount due to the serial lock.

  • bob_g4bbybob_g4bby Posts: 489
    edited 2025-12-25 09:19

    So here's me starting an odd/even pair of cogs with one command. They both run the same cog code in their respective cog RAMS. However, the cogs read their ID numbers and split off to run 'cogeven' and 'cogodd' code. I've kept my method of synchronising the two loops starting at 'cogeven1' and 'cogodd1':-

    '' File .......... post WAITATN synchronism test.spin2
    '' Version........ 2
    '' Purpose........ To determine what waitx adjustment is needed to synchronise instructions between two COGS, after executing WAITATN / COGATN
    '' Author......... Bob Edwards
    '' Email.......... bob.edwards50@yahoo.com
    '' Started........ / /
    '' Latest update.. / /
    
    ''=============================================================
    
    CON {timing constants}
            _xinfreq =  20_000_000
            _clkfreq = 180_000_000
    
    
    CON {fixed IO pins}
    
    CON {application IO pins}
            p56 = 56
            p57 = 57
    
    CON {Enumerated constants}
    
    CON {Data Structure definitions}
    
    OBJ {Child Objects}
    
    VAR {Variable Declarations}
    
    {Public / Private methods}
    
    pub main()
    coginit(COGEXEC_NEW_PAIR, @cogstart, 0 )        ' start two cogs with linked LUT reads
    
    DAT {Symbols and Data}
    
    DAT {Data Pointers}
    
    DAT {Cog-exec assembly language}
    
            org
    cogstart
            cogid _cogid
            testb _cogid, #0 WZ
    IF_Z    jmp #cogodd
    
    cogeven
            drvl #p56
            waitx ##100000000               ' wait for odd cog to load and execute waitatn
            add _cogid, #1                 ' now the odd cog number
            mov pr0, #0
            bith pr0, _cogid
            cogatn pr0                      ' wake up the odd number cog
            nop                             ' just two cycle delay needed for output pins to be in sync
    cogeven1
            drvh #p56                       ' compare this output pulse on pin 56
            drvl #p56
            jmp #cogeven1
    
    cogodd
            drvl #p57
            setluts #1                      ' writes to even cog LUT will also write to the odd cog LUT
            waitatn                         ' wait for even cog to kick me off
    cogodd1
            drvh #p57                       ' with this one on pin 57
            drvl #p57
            jmp #cogodd1
    
            _cogid      res 1
    
    DAT {Hub-exec assembly language}
    
  • KaioKaio Posts: 281

    @bob_g4bby said:

    • The goal - compute a cordic function on a 1024 sample array in the least amount of time.
    • The problem - moving samples into and out of cog registers cannot be done with RDLONG and WRLONG because these take a variable number of cycles to execute and cordic outputs will get missed if the tight program loops vary in cycles

    @bob_g4bby, another approach would be to use block transfer to copy 512 consecutive samples (split to each cog) into the LUT of each cog and then do the calculation in both cogs in parallel and copying it back to hub with block transfer. Then you would only need to wake up the second cog with COGATN instruction to do their work and avoid exact synchronization.
    As the runtime of the calculation and data transfer time is the same in each cog, both will end at the same time.

    The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.

    'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT
    
    'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM
    

    To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

  • TonyB_TonyB_ Posts: 2,254
    edited 2025-12-19 00:01

    @Kaio said:
    The block transfer is the fastest memory transfer of P2 which takes only one clock starting from the second long to be copied.

    'takes 6 + 9 ... 16 + 511 = 526 ... 533 clocks to copy 512 longs from hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            rdlong  $200-$200,ptrb++    'copy from hub RAM to LUT
    
    'takes 6 + 3 .. 10 + 511 = 520 ... 527 clocks to copy 512 longs to hub
            mov      ptrb,##ptr_hub
            setq2   #512-1
            wrlong  $200-$200,ptrb++    'copy back from LUT to hub RAM
    

    To modify the data in LUT RAM you can use RDLUT/WRLUT which takes 5 clocks in combination for each sample.

    The fast block move from LUT RAM takes 6 + 4...11 + 511 = 521... 528 cycles. I tested this thoroughly just a few days ago and the first write from LUT RAM takes one more cycle than from register RAM in the worst case. The reason for this difference is because it takes three cycles, not two, to read from LUT RAM.

    The extra cycle causes the hub RAM slot for a 3 cycle write to be missed, there is a wait of one hub RAM revolution and therefore the write takes 3 + 8 = 11 cycles. This applies to only one of the eight hub RAM egg beater slices. For the other seven the writes take 4...10 cycles and are identical for reg and LUT RAM.

  • bob_g4bbybob_g4bby Posts: 489
    edited 2025-12-19 09:00

    That's some useful timing info. I too had wondered whether block move into and out of LUT was the way to go. Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function. You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

    I'll get down to writing the cordic code, and publish it here. I envisage it turning into a bit of a library for doing dsp things on arrays of data - like audio or software radio streams etc.

  • KaioKaio Posts: 281
    edited 2025-12-20 09:23

    @bob_g4bby said:
    Assuming the fast block move out of LUT takes the same time, we have a total of 1042 .. 1056 cycles spent just pushing the data around on top of the time doing the cordic function.

    @bob_g4bby, yes that is correct as each cog will process only 512 longs. But on your approach it is four times of cycles to move data as one cog must read all 1024 longs and the second cog must write all these longs. Hence, it will take 2048 clocks to read and 2048 clocks to write the data. Additional the time doing the cordic function.

    You've highlighted another difference in speed there - reading and writing LUT ram takes 3 cycles whereas using the FIFOs in each cog, reading and writing Hub ram with WRFAST and RDFAST takes 2 cycles. Sounds like another overhead there too. It's possible that calculations using LUT ram are slower overall than calculating directly in Hub RAM?

    One can use the FIFO with WFLONG on both cogs to write the result back to hub as all data to process are already in the LUT RAM. This would avoid to write the result to LUT (saves 2 clocks for each sample). This is the same time necessary to write data using FIFO.
    Unfortunately reading a LUT sample needs 3 clocks.

    My approach is quite easy as the code used in both cogs is identical. Only the pointer to read and write the 512 longs block is different.

    With your approach you need to handover data to the second cog for the calculation and the result of calculation from first cog to write back both afterwards to hub. I think this will take additional overhead.

  • KaioKaio Posts: 281

    @bob_g4bby, one thing I did not mention before and which is important to know using your approach is, that you need additional 5120 clocks to move the data (1024 longs) from the first cog to the second cog. Also if you use the LUT to share the data between two cogs, you will not get the data transfer for free.
    As I already did mention it takes 5 clocks per long to write and read the LUT which is necessary in your approach.

    The conclusion is, as you process as less data per cog you will get the best performance to calculate the data. If you need speed for the calculation consider to use more cogs to do this.

    With my approach the calculation is easy scalable as the code is identical for each cog. Each cog is calculating a small part of the data.

Sign In or Register to comment.