The unofficial P2 documentation project

Seairth · 2013-04-28 08:34

cgracey wrote: »

It's my lack of experience in actually programming DSP that caused me to come up with this overly-complex solution to what was a simple problem. If we revise the die, we'll improve this mechanism.

I suspect your experience is still better than most. Nevertheless, for anyone interested in DSP, I strongly recommend reading The Scientist and Engineer's Guide to Digital Signal Processing (http://www.dspguide.com, PDF version found at http://www.dspguide.com/pdfbook.htm). it's very accessible. It won't give code, but it should give enough understanding to write the code.

timx8 · 2013-04-28 10:27

This is very cool stuff. So of the CORDIC unit, big multiplier, big divider, and big square-rooter, do these live in independent hardware, or do they share resources? Could an ambitious coder run all 4 simultaneously? The fast MUL/SCL/MAC instructions I'm assuming are independent of these?

Kye · 2013-04-28 15:21

Yeah, I believe they are all separate state machines.

cgracey · 2013-04-28 20:42

Kye wrote: »

Yeah, I believe they are all separate state machines.

That's right. They can all run at the same time.

potatohead · 2013-04-28 21:26

!!!

Wow. We have very luxurious math in PASM now. So far, I've used them a few times. Haven't interleaved ops yet, but obviously it's an option. I find myself working shifts and adds, only to remember that we've got fast math now. Fun!

pedward · 2013-04-28 22:04

Chip, I'd like to see documentation on the register remapping, could you detail this?

MJB · 2013-04-29 01:00

Seairth wrote: »

I suspect your experience is still better than most. Nevertheless, for anyone interested in DSP, I strongly recommend reading The Scientist and Engineer's Guide to Digital Signal Processing (http://www.dspguide.com, PDF version found at http://www.dspguide.com/pdfbook.htm). it's very accessible. It won't give code, but it should give enough understanding to write the code.

The PDFs are a little tedious to find, but if you start here:
http://www.dspguide.com/CH1.PDF
then it is CH1 .. CH34.PDF

I didn't find it in one file - which would be around 15MB

Seairth · 2013-04-29 06:20

MJB wrote: »

The PDFs are a little tedious to find, but if you start here:
http://www.dspguide.com/CH1.PDF
then it is CH1 .. CH34.PDF

I didn't find it in one file - which would be around 15MB

That's true. Each chapter is a separate PDF. I used an online tool to merge them into one file. I'd put that up here, but I don't think it falls under "permissible use".

cgracey · 2013-04-29 08:03

pedward wrote: »

Chip, I'd like to see documentation on the register remapping, could you detail this?

I'll work on that today. I should have it done by this evening.

User Name · 2013-04-29 12:46

potatohead wrote: »

We have very luxurious math in PASM now.

Seriously. I've never seen such luxury in all my years of embedded design.

At one point I wondered why only one cog would fit in a Cyclone IV 4C22. Now I know why.

There's really an extraordinary amount of logic packed in the P2.

pedward · 2013-04-29 13:21

User Name wrote: »

Seriously. I've never seen such luxury in all my years of embedded design.

At one point I wondered why only one cog would fit in a Cyclone IV 4C22. Now I know why. There's really an extraordinary amount of logic packed in the P2.

Did you think Chip has really be foolin' around for the last 5 years?

Sapieha · 2013-04-30 07:15

Hi Chip.

You said You will post COG to COG communication by Internal port info and Remapping of COG registers .

How it is are going ?

cgracey · 2013-04-30 08:11

Sapieha wrote: »

Hi Chip.

You said You will post COG to COG communication by Internal port info and Remapping of COG registers .

How it is are going ?

I'm still working on the register remapping. After that I'll cover the Port D cog-to-cog communication.

Sapieha · 2013-04-30 08:37

Hi Chip.

Thanks

cgracey · 2013-05-01 13:35

Okay. Here are the latest doc's which now include register remapping:

Prop2_Docs.txt

Here's the new section:

REGISTER REMAPPING
------------------

The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDA or the current task.

When using remapping, you must locate your program code above the last used block of
registers which the bottom-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.

Here is the SETMAP instruction:

    SETMAP  D/#n            - Configure register remapping to %M_BBB_RRR

        %M = mode

            %0 = INDA selects the block
            %1 = task number selects the block

        %BBB = block count

            %000 = 1 block          remapping disabled for %000
            %001 = 2 blocks         remapping enabled for %001..%111
            %010 = 4 blocks
            %011 = 8 blocks
            %100 = 16 blocks
            %101 = 32 blocks
            %110 = 64 blocks
            %111 = 128 blocks

        %RRR = register count

            %000 = 1 register       remap $000
            %001 = 2 registers      remap $000..$001
            %010 = 4 registers      remap $000..$003
            %011 = 8 registers      remap $000..$007
            %100 = 16 registers     remap $000..$00F
            %101 = 32 registers     remap $000..$01F
            %110 = 64 registers     remap $000..$03F
            %111 = 128 registers    remap $000..$07F


The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDA or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses, not via INDA and INDB
accesses.

Below is an elaboration of all uniquely-useful remapping schemes:


                                  S/D addresses
%M_BBB_RRR    blocks regs      initial -> remapped       block selector
-----------------------------------------------------------------------------
%x_000_xxx    1      x               <same>

%0_001_000    2      1      %000000000 -> %00000000P     P = INDA[0]
%0_001_001    2      2      %00000000X -> %0000000PX
%0_001_010    2      4      %0000000XX -> %000000PXX     (2 threads)
%0_001_011    2      8      %000000XXX -> %00000PXXX
%0_001_100    2      16     %00000XXXX -> %0000PXXXX
%0_001_101    2      32     %0000XXXXX -> %000PXXXXX
%0_001_110    2      64     %000XXXXXX -> %00PXXXXXX
%0_001_111    2      128    %00XXXXXXX -> %0PXXXXXXX

%0_010_000    4      1      %000000000 -> %0000000PP     PP = INDA[1..0]
%0_010_001    4      2      %00000000X -> %000000PPX
%0_010_010    4      4      %0000000XX -> %00000PPXX     (4 threads)
%0_010_011    4      8      %000000XXX -> %0000PPXXX
%0_010_100    4      16     %00000XXXX -> %000PPXXXX
%0_010_101    4      32     %0000XXXXX -> %00PPXXXXX
%0_010_110    4      64     %000XXXXXX -> %0PPXXXXXX
%0_010_111    4      128    %00XXXXXXX -> %PPXXXXXXX

%0_011_000    8      1      %000000000 -> %000000PPP     PPP = INDA[2..0]
%0_011_001    8      2      %00000000X -> %00000PPPX
%0_011_010    8      4      %0000000XX -> %0000PPPXX     (8 threads)
%0_011_011    8      8      %000000XXX -> %000PPPXXX
%0_011_100    8      16     %00000XXXX -> %00PPPXXXX
%0_011_101    8      32     %0000XXXXX -> %0PPPXXXXX
%0_011_110    8      64     %000XXXXXX -> %PPPXXXXXX

%0_100_000    16     1      %000000000 -> %00000PPPP     PPPP = INDA[3..0]
%0_100_001    16     2      %00000000X -> %0000PPPPX
%0_100_010    16     4      %0000000XX -> %000PPPPXX     (16 threads)
%0_100_011    16     8      %000000XXX -> %00PPPPXXX
%0_100_100    16     16     %00000XXXX -> %0PPPPXXXX
%0_100_101    16     32     %0000XXXXX -> %PPPPXXXXX

%0_101_000    32     1      %000000000 -> %0000PPPPP     PPPPP = INDA[4..0]
%0_101_001    32     2      %00000000X -> %000PPPPPX
%0_101_010    32     4      %0000000XX -> %00PPPPPXX     (32 threads)
%0_101_011    32     8      %000000XXX -> %0PPPPPXXX
%0_101_100    32     16     %00000XXXX -> %PPPPPXXXX

%0_110_000    64     1      %000000000 -> %000PPPPPP     PPPPPP = INDA[5..0]
%0_110_001    64     2      %00000000X -> %00PPPPPPX
%0_110_010    64     4      %0000000XX -> %0PPPPPPXX     (64 threads)
%0_110_011    64     8      %000000XXX -> %PPPPPPXXX

%0_111_000    128    1      %000000000 -> %00PPPPPPP     PPPPPPP = INDA[6..0]
%0_111_001    128    2      %00000000X -> %0PPPPPPPX
%0_111_010    128    4      %0000000XX -> %PPPPPPPXX     (128 threads)

%1_001_000    2      1      %000000000 -> %00000000T     T = bit 0 of the task number
%1_001_001    2      2      %00000000X -> %0000000TX
%1_001_010    2      4      %0000000XX -> %000000TXX     (2 tasks)
%1_001_011    2      8      %000000XXX -> %00000TXXX
%1_001_100    2      16     %00000XXXX -> %0000TXXXX
%1_001_101    2      32     %0000XXXXX -> %000TXXXXX
%1_001_110    2      64     %000XXXXXX -> %00TXXXXXX
%1_001_111    2      128    %00XXXXXXX -> %0TXXXXXXX

%1_010_000    4      1      %000000000 -> %0000000TT     TT = task number
%1_010_001    4      2      %00000000X -> %000000TTX
%1_010_010    4      4      %0000000XX -> %00000TTXX     (4 tasks)
%1_010_011    4      8      %000000XXX -> %0000TTXXX
%1_010_100    4      16     %00000XXXX -> %000TTXXXX
%1_010_101    4      32     %0000XXXXX -> %00TTXXXXX
%1_010_110    4      64     %000XXXXXX -> %0TTXXXXXX
%1_010_111    4      128    %00XXXXXXX -> %TTXXXXXXX


Here is an example program which uses remapping with multi-threading:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    loop[4]         '$010..$013, all threads start at loop

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixinda #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP 3 clocks to take effect

loop            tasksw                  'switch to next thread
                incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)


Here is an example program which uses remapping with multi-tasking:

DAT             org

period          long    2-1             '$000, task 0   (16 longs initally execute as NOPs)
time            long    0               '$001, task 0
pin_x           long    0               '$002, task 0
pin_y           long    1               '$003, task 0

                long    4-1             '$000, task 1
                long    0               '$001, task 1
                long    2               '$002, task 1
                long    3               '$003, task 1

                long    8-1             '$000, task 2
                long    0               '$001, task 2
                long    4               '$002, task 2
                long    5               '$003, task 2

                long    16-1            '$000, task 3
                long    0               '$001, task 3
                long    6               '$002, task 3
                long    7               '$003, task 3


                setmap  #%1_010_010     'remap 4 blocks of 4 regs by task
                settask #%11_10_01_00   'set all 4 tasks in motion
                jmptask #loop,#%1111    'herd tasks to loop


loop            incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 tasks executing same code with unique registers)

pedward · 2013-05-01 15:36

Wow, I didn't realize the P2 had both threading and multi-tasking.

Threading appears to be cooperative multi-tasking, yielding control of the COG when the loop is finished, whereas the multi-tasking appears to be more like temporal multi-threading.

TASKSW only yields control of the main COG after a single section of code runs, executing only one PC at a time.

SETTASK allows for up to 4 PCs to be executing simultaneously, but at different pipeline stages, so each PC moves forward in lockstep with another.

TASKSW is useful for applications where you have either very time sensitive, or blocking code that you want to run, where other tasks don't have hard realtime demands.

SETTASK is useful for applications where you need hard realtime in multiple threads at once, but at the expense of only using non-blocking, non-flushing instructions.

cgracey · 2013-05-01 15:50

I just added some details to the the latest doc's in post #316, in case anyone already grabbed them.

Seairth · 2013-05-01 19:07

cgracey wrote: »

I just added some details to the the latest doc's in post #316, in case anyone already grabbed them.

The attached document states the following for TASKSW: "Instructions trailing TASKSWD are in the next thread". However, this would seem to contradict the way that the other xxxD instructions seem to work (i.e. trailing instructions that are already in the pipeline are associated with the code that's *before* the jump, not after). If TASKSW is conceptually different this way (the documentation is correct), I suggest emphasizing that in the document.

cgracey · 2013-05-01 20:26

Seairth wrote: »

The attached document states the following for TASKSW: "Instructions trailing TASKSWD are in the next thread". However, this would seem to contradict the way that the other xxxD instructions seem to work (i.e. trailing instructions that are already in the pipeline are associated with the code that's *before* the jump, not after). If TASKSW is conceptually different this way (the documentation is correct), I suggest emphasizing that in the document.

The reason is because TASKSWD is (I think) 'JMPRETD INDA,++INDA WZ, WC' and when INDA gets incremented, the next instruction has the remapped registers already pointing to the next thread's register block and the flags have been saved and updated, as well. So, the thread context has switched and those trailing instructions are in the next thread.

I'll make sure this is documented better. Thanks for pointing this out.

Seairth · 2013-05-01 21:34

The threading example makes my brain hurt, which might explain why it looks "wrong" to me. When that code runs, do you actually end up with an initial four switches that basically do nothing but fix up the PC array? Would this also work:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    thread[4]       '$010..$013, all threads start at thread

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixinda #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP 3 clocks to take effect

loop            tasksw                  'switch to next thread
thread          incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)

My reasoning here is that the pc array will contain the addresses for the thread label (not loop), and TASKSW (rather, JMPRET) is going to load that address from the next array element while storing PC+1 (with PC value being the address of the TASKSW instruction) in the current array element (which is always the same address as the thread label)..

Or do I have this all wrong?

cgracey · 2013-05-01 22:14

Seairth wrote: »

The threading example makes my brain hurt, which might explain why it looks "wrong" to me. When that code runs, do you actually end up with an initial four switches that basically do nothing but fix up the PC array? Would this also work:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    thread[4]       '$010..$013, all threads start at thread

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixinda #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP 3 clocks to take effect

loop            tasksw                  'switch to next thread
thread          incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)

My reasoning here is that the pc array will contain the addresses for the thread label (not loop), and TASKSW (rather, JMPRET) is going to load that address from the next array element while storing PC+1 (with PC value being the address of the TASKSW instruction) in the current array element (which is always the same address as the thread label)..

Or do I have this all wrong?

That code will do the same thing as my example. It will just execute the loop's body code before the first TASKSW for each thread.

My example would have been a lot richer if I showed a more complex program (instead of a loop) which made conditional branches with TASKSW's placed throughout. That way, the PC array wouldn't always contain the same values, but possibly a different return address for each thread, most of the time.

Sapieha · 2013-05-01 22:33

Hi Chip.

Nice - Thanks

cgracey wrote: »

Okay. Here are the latest doc's which now include register remapping:

Prop2_Docs.txt

Here's the new section:

REGISTER REMAPPING
------------------

The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDA or the current task.

When using remapping, you must locate your program code above the last used block of
registers which the bottom-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.

Here is the SETMAP instruction:

    SETMAP  D/#n            - Configure register remapping to %M_BBB_RRR

        %M = mode

            %0 = INDA selects the block
            %1 = task number selects the block

        %BBB = block count

            %000 = 1 block          remapping disabled for %000
            %001 = 2 blocks         remapping enabled for %001..%111
            %010 = 4 blocks
            %011 = 8 blocks
            %100 = 16 blocks
            %101 = 32 blocks
            %110 = 64 blocks
            %111 = 128 blocks

        %RRR = register count

            %000 = 1 register       remap $000
            %001 = 2 registers      remap $000..$001
            %010 = 4 registers      remap $000..$003
            %011 = 8 registers      remap $000..$007
            %100 = 16 registers     remap $000..$00F
            %101 = 32 registers     remap $000..$01F
            %110 = 64 registers     remap $000..$03F
            %111 = 128 registers    remap $000..$07F


The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDA or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses, not via INDA and INDB
accesses.

Below is an elaboration of all uniquely-useful remapping schemes:


                                  S/D addresses
%M_BBB_RRR    blocks regs      initial -> remapped       block selector
-----------------------------------------------------------------------------
%x_000_xxx    1      x               <same>

%0_001_000    2      1      %000000000 -> %00000000P     P = INDA[0]
%0_001_001    2      2      %00000000X -> %0000000PX
%0_001_010    2      4      %0000000XX -> %000000PXX     (2 threads)
%0_001_011    2      8      %000000XXX -> %00000PXXX
%0_001_100    2      16     %00000XXXX -> %0000PXXXX
%0_001_101    2      32     %0000XXXXX -> %000PXXXXX
%0_001_110    2      64     %000XXXXXX -> %00PXXXXXX
%0_001_111    2      128    %00XXXXXXX -> %0PXXXXXXX

%0_010_000    4      1      %000000000 -> %0000000PP     PP = INDA[1..0]
%0_010_001    4      2      %00000000X -> %000000PPX
%0_010_010    4      4      %0000000XX -> %00000PPXX     (4 threads)
%0_010_011    4      8      %000000XXX -> %0000PPXXX
%0_010_100    4      16     %00000XXXX -> %000PPXXXX
%0_010_101    4      32     %0000XXXXX -> %00PPXXXXX
%0_010_110    4      64     %000XXXXXX -> %0PPXXXXXX
%0_010_111    4      128    %00XXXXXXX -> %PPXXXXXXX

%0_011_000    8      1      %000000000 -> %000000PPP     PPP = INDA[2..0]
%0_011_001    8      2      %00000000X -> %00000PPPX
%0_011_010    8      4      %0000000XX -> %0000PPPXX     (8 threads)
%0_011_011    8      8      %000000XXX -> %000PPPXXX
%0_011_100    8      16     %00000XXXX -> %00PPPXXXX
%0_011_101    8      32     %0000XXXXX -> %0PPPXXXXX
%0_011_110    8      64     %000XXXXXX -> %PPPXXXXXX

%0_100_000    16     1      %000000000 -> %00000PPPP     PPPP = INDA[3..0]
%0_100_001    16     2      %00000000X -> %0000PPPPX
%0_100_010    16     4      %0000000XX -> %000PPPPXX     (16 threads)
%0_100_011    16     8      %000000XXX -> %00PPPPXXX
%0_100_100    16     16     %00000XXXX -> %0PPPPXXXX
%0_100_101    16     32     %0000XXXXX -> %PPPPXXXXX

%0_101_000    32     1      %000000000 -> %0000PPPPP     PPPPP = INDA[4..0]
%0_101_001    32     2      %00000000X -> %000PPPPPX
%0_101_010    32     4      %0000000XX -> %00PPPPPXX     (32 threads)
%0_101_011    32     8      %000000XXX -> %0PPPPPXXX
%0_101_100    32     16     %00000XXXX -> %PPPPPXXXX

%0_110_000    64     1      %000000000 -> %000PPPPPP     PPPPPP = INDA[5..0]
%0_110_001    64     2      %00000000X -> %00PPPPPPX
%0_110_010    64     4      %0000000XX -> %0PPPPPPXX     (64 threads)
%0_110_011    64     8      %000000XXX -> %PPPPPPXXX

%0_111_000    128    1      %000000000 -> %00PPPPPPP     PPPPPPP = INDA[6..0]
%0_111_001    128    2      %00000000X -> %0PPPPPPPX
%0_111_010    128    4      %0000000XX -> %PPPPPPPXX     (128 threads)

%1_001_000    2      1      %000000000 -> %00000000T     T = bit 0 of the task number
%1_001_001    2      2      %00000000X -> %0000000TX
%1_001_010    2      4      %0000000XX -> %000000TXX     (2 tasks)
%1_001_011    2      8      %000000XXX -> %00000TXXX
%1_001_100    2      16     %00000XXXX -> %0000TXXXX
%1_001_101    2      32     %0000XXXXX -> %000TXXXXX
%1_001_110    2      64     %000XXXXXX -> %00TXXXXXX
%1_001_111    2      128    %00XXXXXXX -> %0TXXXXXXX

%1_010_000    4      1      %000000000 -> %0000000TT     TT = task number
%1_010_001    4      2      %00000000X -> %000000TTX
%1_010_010    4      4      %0000000XX -> %00000TTXX     (4 tasks)
%1_010_011    4      8      %000000XXX -> %0000TTXXX
%1_010_100    4      16     %00000XXXX -> %000TTXXXX
%1_010_101    4      32     %0000XXXXX -> %00TTXXXXX
%1_010_110    4      64     %000XXXXXX -> %0TTXXXXXX
%1_010_111    4      128    %00XXXXXXX -> %TTXXXXXXX


Here is an example program which uses remapping with multi-threading:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    loop[4]         '$010..$013, all threads start at loop

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixinda #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP 3 clocks to take effect

loop            tasksw                  'switch to next thread
                incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)


Here is an example program which uses remapping with multi-tasking:

DAT             org

period          long    2-1             '$000, task 0   (16 longs initally execute as NOPs)
time            long    0               '$001, task 0
pin_x           long    0               '$002, task 0
pin_y           long    1               '$003, task 0

                long    4-1             '$000, task 1
                long    0               '$001, task 1
                long    2               '$002, task 1
                long    3               '$003, task 1

                long    8-1             '$000, task 2
                long    0               '$001, task 2
                long    4               '$002, task 2
                long    5               '$003, task 2

                long    16-1            '$000, task 3
                long    0               '$001, task 3
                long    6               '$002, task 3
                long    7               '$003, task 3


                setmap  #%1_010_010     'remap 4 blocks of 4 regs by task
                settask #%11_10_01_00   'set all 4 tasks in motion
                jmptask #loop,#%1111    'herd tasks to loop


loop            incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 tasks executing same code with unique registers)

BEEP · 2013-05-02 01:28

Prop2_Docs_130501.pdf

Edit:
File deleted.

cgracey · 2013-05-02 14:00

Okay. I got the inter-cog exchange documented:

Prop2_Docs.txt

Here's the new part:

INTER-COG EXCHANGE
------------------

The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.

Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.

The only control over the inter-cog exchange is each cog's PIND input filter.

The SETXCH instruction is used to set the PIND input filter:

    SETXCH  D/#n            - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA

        %DDDDDDDD = filter for PIND input bits 31..24

            %xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
            %x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
            %1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input

        %CCCCCCCC = filter for PIND input bits 23..16

            %xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
            %x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
            %1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input

        %BBBBBBBB = filter for PIND input bits 15..8

            %xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
            %x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
            %1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input

        %AAAAAAAA = filter for PIND input bits 7..0

            %xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
            %x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
            %1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input


To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.

After SETXCH, PIND can be read for newly-filtered data on the third clock:

        SETXCH  #$00000001      'change filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from new filter


Writes to a PIND are readable from a PIND on the third clock, as well.

The PIND port does not connect to CTRA, CTRB, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:

    GETP/GETNP                                      - pin reads
    OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ   - pin writes
    JP/JPD/JNP/JNPD                                 - pin branches

Sapieha · 2013-05-02 14:01

Hi Chip.

BIG Thanks.

cgracey wrote: »

Okay. I got the inter-cog exchange documented:

Prop2_Docs.txt

Here's the new part:

INTER-COG EXCHANGE
------------------

The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.

Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.

The only control over the inter-cog exchange is each cog's PIND input filter.

The SETXCH instruction is used to set the PIND input filter:

    SETXCH  D/#n            - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA

        %DDDDDDDD = filter for PIND input bits 31..24

            %xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
            %x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
            %1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input

        %CCCCCCCC = filter for PIND input bits 23..16

            %xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
            %x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
            %1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input

        %BBBBBBBB = filter for PIND input bits 15..8

            %xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
            %x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
            %1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input

        %AAAAAAAA = filter for PIND input bits 7..0

            %xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
            %x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
            %1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input


To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.

After SETXCH, PIND can be read for newly-filtered data on the third clock:

        SETXCH  #$00000001      'change filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from new filter


Writes to a PIND are readable from a PIND on the third clock, as well.

The PIND/DIRD port does not connect to the CTR's, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:

    GETP/GETNP                                      - pin reads
    OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ   - pin writes
    JP/JPD/JNP/JNPD                                 - pin branches

Bill Henning · 2013-05-02 14:13

Thanks Chip!

Sapieha just messaged me that you added the PIND docs... I am now digesting it.

cgracey wrote: »

Okay. I got the inter-cog exchange documented:

Prop2_Docs.txt

Here's the new part:

INTER-COG EXCHANGE
------------------

The fourth I/O port of each cog (PIND/DIRD) is implemented as a 32-bit inter-cog data
exchange, instead of 32 external I/O pins.

Each cog outputs 32 bits of data via PIND, with the actual output being the logical-AND of
the PIND and DIRD registers. Each cog can select which of the other cogs' PIND outputs are
going to be gated into its own PIND inputs, on a per-byte basis.

The only control over the inter-cog exchange is each cog's PIND input filter.

The SETXCH instruction is used to set the PIND input filter:

    SETXCH  D/#n            - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA

        %DDDDDDDD = filter for PIND input bits 31..24

            %xxxxxxx1 = cog 0's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxxx1x = cog 1's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxxx1xx = cog 2's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxxx1xxx = cog 3's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xxx1xxxx = cog 4's PIND[31..24] output will be OR'd into PIND[31..24] input
            %xx1xxxxx = cog 5's PIND[31..24] output will be OR'd into PIND[31..24] input
            %x1xxxxxx = cog 6's PIND[31..24] output will be OR'd into PIND[31..24] input
            %1xxxxxxx = cog 7's PIND[31..24] output will be OR'd into PIND[31..24] input

        %CCCCCCCC = filter for PIND input bits 23..16

            %xxxxxxx1 = cog 0's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxxx1x = cog 1's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxxx1xx = cog 2's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxxx1xxx = cog 3's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xxx1xxxx = cog 4's PIND[23..16] output will be OR'd into PIND[23..16] input
            %xx1xxxxx = cog 5's PIND[23..16] output will be OR'd into PIND[23..16] input
            %x1xxxxxx = cog 6's PIND[23..16] output will be OR'd into PIND[23..16] input
            %1xxxxxxx = cog 7's PIND[23..16] output will be OR'd into PIND[23..16] input

        %BBBBBBBB = filter for PIND input bits 15..8

            %xxxxxxx1 = cog 0's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxxx1x = cog 1's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxxx1xx = cog 2's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxxx1xxx = cog 3's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xxx1xxxx = cog 4's PIND[15..8] output will be OR'd into PIND[15..8] input
            %xx1xxxxx = cog 5's PIND[15..8] output will be OR'd into PIND[15..8] input
            %x1xxxxxx = cog 6's PIND[15..8] output will be OR'd into PIND[15..8] input
            %1xxxxxxx = cog 7's PIND[15..8] output will be OR'd into PIND[15..8] input

        %AAAAAAAA = filter for PIND input bits 7..0

            %xxxxxxx1 = cog 0's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxxx1x = cog 1's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxxx1xx = cog 2's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxxx1xxx = cog 3's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xxx1xxxx = cog 4's PIND[7..0] output will be OR'd into PIND[7..0] input
            %xx1xxxxx = cog 5's PIND[7..0] output will be OR'd into PIND[7..0] input
            %x1xxxxxx = cog 6's PIND[7..0] output will be OR'd into PIND[7..0] input
            %1xxxxxxx = cog 7's PIND[7..0] output will be OR'd into PIND[7..0] input


To input only cog 0's 32-bit output, you would use the filter value $01_01_01_01. To input
the logical-OR of cog 0's and cog 1's 32-bit outputs, you would use $03_03_03_03. In most
programming cases, it may be desirable to just see one other cog's full 32-bit output in
your PIND input, but many other arrangements are possible. For example, by using 8-bit or
16-bit fields with SETF/MOVF to transfer data piecewise from several cogs, a final cog can
read the aggregate 32-bit result.

After SETXCH, PIND can be read for newly-filtered data on the third clock:

        SETXCH  #$00000001      'change filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from new filter


Writes to a PIND are readable from a PIND on the third clock, as well.

The PIND port does not connect to CTRA, CTRB, XFR, or SER, but it does support the
following pin instructions, as if it were a regular I/O port:

    GETP/GETNP                                      - pin reads
    OFFP/NOTP/CLRP/SETP/SETPC/SETPNC/SETPZ/SETPNZ   - pin writes
    JP/JPD/JNP/JNPD                                 - pin branches

jazzed · 2013-05-02 14:27

Chip, what is SER ? I see brief references to SNDSER and RCVSER in the doc, but nothing else.

Bill Henning · 2013-05-02 14:51

Steve,

SNDSER & RCVSER are Chip's new inter-prop communications instructions. Last I heard was:

- 32 bit buffered input and output (1 long buffer I think)
- three lines needed: TXD, RXD, CLK
- can block on read/write, or poll for completion
- may need shared crystal between the props
- 1 bit per clock cycle (original plan was PLL'd higher)

Many of us are waiting for more info :-)

pedward · 2013-05-02 15:00

The SER runs at clock speed, without any specific control, last I heard. Similar to how SDRAM runs at clock rate.

The synthesis was too muddy to make an independent clock work.

jazzed · 2013-05-02 15:14

Wish it was SERDES. I begged, and begged, and begged.

The unofficial P2 documentation project

Comments