PROPELLER 2 MEMORY
------------------

In the Propeller 2, there are two primary types of memory:

HUB MEMORY

    256K bytes of 1-port main memory shared by all cogs

        - cogs launch from this memory
        - cogs can read/write this memory as bytes, words, longs, and wides (8 longs)
        - $00000..$00DFF is ROM - contains Booter, SHA-256/HMAC, and Monitor
        - $00E00..$3FFFF is RAM - for application usage


COG MEMORY (8 instances)

    512 longs of 4-port register RAM for code and data usage

        - simultaneous instruction, source, and destination reading, plus destination writing
        - %000..$1F1 = RAM
        - $1F2       = INDA, indirect window
        - $1F3       = INDB, indirect window
        - $1F4..$1F7 = PINA..PIND, pin input (read-only)
        - $1F8..$1FB = OUTA..OUTD, pin output state control
        - $1FC..$1FF = DIRA..DIRD, pin output drive control

    256 longs of 2-port auxiliary RAM for data and video usage

        - readable and writeable via instructions or free-running pin-transfer circuit
        - video circuit can read pixel data asynchronously from second port

    4 longs x 4 tasks' worth of LIFO stacks for CALL/RET/PUSH/POP instructions

    8 longs x 1 line of data cache for RDBYTEC/RDWORDC/RDLONGC/RDWIDEC instructions

    8 longs x 4 lines of instruction cache for executing from hub memory



INSTRUCTION ENCODING
--------------------

Cog instructions are 32 bits long and comprised of several bit fields. There are two main types of
instructions: dual-operand and single-operand. Dual-operand instructions specify both a D register
for read/write access or an immediate D value, and an S register for read access or an immediate S
value. Single-operand instructions specify only a D register or immediate value.


Dual-operand encoding:

TTTTTTT ZC I CCCC DDDDDDDDD SSSSSSSSS     IF_x    MNEM    D/#,S/#   WZ,WC


Single-operand encoding:

1111111 ZC x CCCC DDDDDDDDD TTTTTTTTT     IF_x    MNEM    D/#       WZ,WC



      TTTTTTT = Instruction according to MNEM

            Z = Z flag write control: 0=don't write Z, 1=write Z
                Defaults to 0, but may be set to 1 by adding WZ (Write Z) after operand(s)

                Unless specified otherwise, the value written to Z is the NOR of the 32-bit D result.

            C = C flag write control: 0=don't write C, 1=write C
                Defaults to 0, but may be set to 1 by adding WC (Write C) after operand(s)

            I = SSSSSSSSS is register or immediate, 0=register address (S), 1=immediate (#n)

         CCCC = Execution condition (expressed by IF_x prefix)
                Determines Z/C flag conditions upon which the instruction will execute

                CCCC  condition       CCCC  mnemonic prefixes (in easy-to-read order)
                ---------------------------------------------------------------------
                0000  never           1111  IF_ALWAYS (default)
                0001  nc &  nz        1100  IF_C                          IF_B
                0010  nc &  z         0011  IF_NC                         IF_AE
                0011  nc              1010  IF_Z                          IF_E
                0100  c  &  nz        0101  IF_NZ                         IF_NE
                0101  nz              1000  IF_C_AND_Z     IF_Z_AND_C
                0110  c  <> z         0100  IF_C_AND_NZ    IF_NZ_AND_C
                0111  nc |  nz        0010  IF_NC_AND_Z    IF_Z_AND_NC
                1000  c  &  z         0001  IF_NC_AND_NZ   IF_NZ_AND_NC   IF_A
                1001  c  =  z         1110  IF_C_OR_Z      IF_Z_OR_C      IF_BE
                1010  z               1101  IF_C_OR_NZ     IF_NZ_OR_C
                1011  nc |  z         1011  IF_NC_OR_Z     IF_Z_OR_NC
                1100  c               0111  IF_NC_OR_NZ    IF_NZ_OR_NC
                1101  c  |  nz        1001  IF_C_EQ_Z      IF_Z_EQ_C
                1110  c  |  z         0110  IF_C_NE_Z      IF_Z_NE_C
                1111  always          0000  IF_NEVER

    DDDDDDDDD = Destination register address (D) or zero-extended immediate value (#n)

    SSSSSSSSS = Source register address (S) or zero-extended immediate value (#n)



HUB MEMORY INSTRUCTIONS
-----------------------

These instructions read and write hub memory.

All instructions use D as the data conduit, except RDWIDE/RDWIDEC/WRWIDE, which use the eight WIDE
registers. The WIDEs can be mapped into cog register space using the SETWIDE instruction or kept
hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the WIDEs
overlay eight contiguous cog registers and can be read or written as all other registers, though they
cannot be executed from. Any write via D to the WIDE registers, when mapped, will affect the underlying
cog registers, as well. A RDWIDE/RDWIDEC will affect the WIDE registers, but not the underlying cog
registers.

The cached reads RDBYTEC/RDWORDC/RDLONGC/RDWIDEC will do a RDWIDE if the current read address is
outside of the 8-long hub window of the prior RDWIDE. Otherwise, they will immediately return cached
data. The DCACHEX instruction invalidates the cache, forcing a fresh RDWIDE next time a cached read
executes.

Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The
timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant,
causing these instructions to take varying numbers of clocks. Timing can be made determinant, though,
by intentionally spacing these instructions apart so that after the first in a series executes, the
subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of
clocks. The trick is to write useful code to go in between them.

WRBYTE/WRWORD/WRLONG/WRWIDE/RDWIDE complete on the hub cycle, making them take 1..8 clocks.

RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks.

RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is already cached, otherwise 3..10 clocks.

RDWIDEC takes only 1 clock if data is cached, otherwise 1..8 clocks.

After a RDWIDE, mapped WIDE registers are accessible via D and S after three clocks:


        RDWIDE  hubaddress      'read a wide into the WIDE registers mapped at wide0..wide7

        NOP                     'do something for at least 3 clocks to allow WIDEs to update
        NOP
        NOP

        CMP     wide0,wide1     'mapped WIDEs are now accessible via D and S


After a SETWIDE/SETWIDZ, mapped WIDE registers are writable immediately at their new address, but
their contents only become readable via D and S after 2 instructions:


        SETWIDE #wide0          'map WIDEs to wide0..wide7 (three LSB of address must be %000)

        NOP                     'do at least two instructions to queue up WIDEs
        NOP

        CMP     wide0,wide1     'mapped WIDEs are now accessible via D and S


On cog startup, the WIDE registers are hidden and cleared to 0's.


instructions  (PTRx = PTRA/PTRB)                                                                   clocks
---------------------------------------------------------------------------------------------------------
0000000 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTE  D,S       'read byte at S into D                  3..10
0000000 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTE  D,PTRx    'read byte at PTRx into D               3..10
0000001 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTEC D,S       'read cached byte at S into D        1, 3..10 
0000001 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTEC D,PTRx    'read cached byte at PTRx into D     1, 3..10
1101000 00 0 CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  D,S       'write lower byte in D at S              1..8
1101000 00 1 CCCC DDDDDDDDD SUPNNNNNN     WRBYTE  D,PTRx    'write lower byte in D at PTRx           1..8
1101000 01 0 CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  #D,S      'write immediate D at S                  1..8
1101000 01 1 CCCC DDDDDDDDD SUPNNNNNN     WRBYTE  #D,PTRx   'write immediate D at PTRx               1..8

0000010 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORD  D,S       'read word at S into D                  3..10
0000010 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORD  D,PTRx    'read word at PTRx into D               3..10
0000011 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORDC D,S       'read cached word at S into D        1, 3..10 
0000011 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORDC D,PTRx    'read cached word at PTRx into D     1, 3..10
1101000 10 0 CCCC DDDDDDDDD SSSSSSSSS     WRWORD  D,S       'write lower word in D at S              1..8
1101000 10 1 CCCC DDDDDDDDD SUPNNNNNN     WRWORD  D,PTRx    'write lower word in D at PTRx           1..8
1101000 11 0 CCCC DDDDDDDDD SSSSSSSSS     WRWORD  #D,S      'write immediate D at S                  1..8
1101000 11 1 CCCC DDDDDDDDD SUPNNNNNN     WRWORD  #D,PTRx   'write immediate D at PTRx               1..8

0000100 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONG  D,S       'read long at S into D                  3..10
0000100 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONG  D,PTRx    'read long at PTRx into D               3..10
0000101 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONGC D,S       'read cached long at S into D        1, 3..10 
0000101 ZC 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONGC D,PTRx    'read cached long at PTRx into D     1, 3..10
1101001 00 0 CCCC DDDDDDDDD SSSSSSSSS     WRLONG  D,S       'write long in D at S                    1..8
1101001 00 1 CCCC DDDDDDDDD SUPNNNNNN     WRLONG  D,PTRx    'write long in D at PTRx                 1..8
1101001 01 0 CCCC DDDDDDDDD SSSSSSSSS     WRLONG  #D,S      'write immediate D at S                  1..8
1101001 01 1 CCCC DDDDDDDDD SUPNNNNNN     WRLONG  #D,PTRx   'write immediate D at PTRx               1..8

1111111 00 0 CCCC DDDDDDDDD 000101101     RDWIDEC D         'read cached wide at D into WIDEs     1, 1..8
1111111 00 1 CCCC SUPNNNNNN 000101101     RDWIDEC PTRx      'read cached wide at PTRx into WIDEs  1, 1..8
1111111 00 0 CCCC DDDDDDDDD 000101110     RDWIDE  D         'read wide at D into WIDEs               1..8
1111111 00 1 CCCC SUPNNNNNN 000101110     RDWIDE  PTRx      'read wide at PTRx into WIDEs            1..8
1111111 00 0 CCCC DDDDDDDDD 000101111     WRWIDE  D         'write WIDEs at D                        1..8
1111111 00 1 CCCC SUPNNNNNN 000101111     WRWIDE  PTRx      'write WIDEs at PTRx                     1..8
---------------------------------------------------------------------------------------------------------


PTRx expressions:

    INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s
    SCALE = 1 for byte, 2 for word, 4 for long, or 32 for wide

    S = 0 for PTRA, 1 for PTRB
    U = 0 to keep PTRx same, 1 to update PTRx
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
    NNNNNN = INDEX
    nnnnnn = -INDEX


    SUPNNNNNN     PTR expression
    -----------------------------------------------------------------------------
    000000000     PTRA              'use PTRA
    100000000     PTRB              'use PTRB
    011000001     PTRA++            'use PTRA,                PTRA += SCALE
    111000001     PTRB++            'use PTRB,                PTRB += SCALE
    011111111     PTRA--            'use PTRA,                PTRA -= SCALE
    111111111     PTRB--            'use PTRB,                PTRB -= SCALE
    010000001     ++PTRA            'use PTRA + SCALE,        PTRA += SCALE
    110000001     ++PTRB            'use PTRB + SCALE,        PTRB += SCALE
    010111111     --PTRA            'use PTRA - SCALE,        PTRA -= SCALE
    110111111     --PTRB            'use PTRB - SCALE,        PTRB -= SCALE

    000NNNNNN     PTRA[INDEX]       'use PTRA + INDEX*SCALE
    100NNNNNN     PTRB[INDEX]       'use PTRB + INDEX*SCALE
    011NNNNNN     PTRA++[INDEX]     'use PTRA,                PTRA += INDEX*SCALE
    111NNNNNN     PTRB++[INDEX]     'use PTRB,                PTRB += INDEX*SCALE
    011nnnnnn     PTRA--[INDEX]     'use PTRA,                PTRA -= INDEX*SCALE
    111nnnnnn     PTRB--[INDEX]     'use PTRB,                PTRB -= INDEX*SCALE
    010NNNNNN     ++PTRA[INDEX]     'use PTRA + INDEX*SCALE,  PTRA += INDEX*SCALE
    110NNNNNN     ++PTRB[INDEX]     'use PTRB + INDEX*SCALE,  PTRB += INDEX*SCALE
    010nnnnnn     --PTRA[INDEX]     'use PTRA - INDEX*SCALE,  PTRA -= INDEX*SCALE
    110nnnnnn     --PTRB[INDEX]     'use PTRB - INDEX*SCALE,  PTRB -= INDEX*SCALE


Examples:

0000000 00 1 1111 DDDDDDDDD 000000000     RDBYTE  D,PTRA         'read byte at PTRA into D
1101000 10 1 1111 DDDDDDDDD 111000001     WRWORD  D,PTRB++       'write lower word in D at PTRB,      PTRB += 1*2
0000100 00 1 1111 DDDDDDDDD 011111111     RDLONG  D,PTRA--       'read long at PTRA into D,           PTRA -= 1*4
1111111 00 1 1111 110000001 000101110     RDWIDE  ++PTRB         'read wide at PTRB+32 into WIDEs,    PTRB += 1*32
1101000 00 1 1111 DDDDDDDDD 010111111     WRBYTE  D,--PTRA       'write lower byte in D at PTRA-1,    PTRA -= 1*1

1101000 10 1 1111 DDDDDDDDD 100000111     WRWORD  D,PTRB[7]      'write lower word in D to PTRB+7*2
0000101 00 1 1111 DDDDDDDDD 011011111     RDLONGC D,PTRA++[31]   'read cached long at PTRA into D,    PTRA += 31*4
1111111 00 1 1111 111111101 000101111     WRWIDE  PTRB--[3]      'write WIDEs at PTRB,                PTRB -= 3*32
1101000 00 1 1111 DDDDDDDDD 010000110     WRBYTE  D,++PTRA[6]    'write lower byte in D to PTRA+6*1,  PTRA += 6*1
0000010 00 1 1111 DDDDDDDDD 110110110     RDWORD  D,--PTRB[10]   'read word at PTRB-10*2 into D,      PTRB -= 10*2


Bytes, words, longs, and wides are addressed as follows: 

    for RDBYTE/RDBYTEC/WRBYTE, address = %XXXXXXXXXXXXXXXXX (bits 17..0 are used)
    for RDWORD/RDWORDC/WRWORD, address = %XXXXXXXXXXXXXXXX- (bits 17..1 are used)
    for RDLONG/RDLONGC/WRLONG, address = %XXXXXXXXXXXXXXX-- (bits 17..2 are used)
    for RDWIDE/RDWIDEC/WRWIDE, address = %XXXXXXXXXXXX----- (bits 17..5 are used)

address  byte  word    long        wide
-------------------------------------------------------------------
00000-   50   *7250   *706F7250   *0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00001-   72    7250    706F7250    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00002-   6F   *706F    706F7250    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00003-   70    706F    706F7250    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00004-   32   *2E32   *20302E32    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00005-   2E    2E32    20302E32    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00006-   30   *2030    20302E32    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00007-   20    2030    20302E32    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00008-   00   *2000   *0C7C2000    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00009-   20    2000    0C7C2000    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000A-   7C   *0C7C    0C7C2000    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000B-   0C    0C7C    0C7C2000    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000C-   03   *CC03   *0C7CCC03    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000D-   CC    CC03    0C7CCC03    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000E-   7C   *0C7C    0C7CCC03    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0000F-   0C    0C7C    0C7CCC03    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00010-   45   *FE45   *0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00011-   FE    FE45    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00012-   C1   *0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00013-   0D    0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00014-   E3   *B6E3   *0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00015-   B6    B6E3    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00016-   FC   *0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00017-   0C    0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00018-   01   *C601   *0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
00019-   C6    C601    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001A-   7C   *0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001B-   0C    0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001C-   01   *C601   *0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001D-   C6    C601    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001E-   7C   *0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250
0001F-   0D    0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE450C7CCC030C7C200020302E32706F7250

* new word/long/wide



PTRA/PTRB INSTRUCTIONS
----------------------

Each cog has two 18-bit pointers, PTRA and PTRB, which can be read, written, modified,
and used to access hub memory.

At cog startup, the PTRA and PTRB registers are initialized as follows:

    PTRA = %XX_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer
    PTRB = %XX_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
1111111 ZC 0 CCCC DDDDDDDDD 000001010     GETPTRA D         'get PTRA into D                    1
1111111 ZC 0 CCCC DDDDDDDDD 000001011     GETPTRB D         'get PTRB into D                    1

1111111 00 0 CCCC DDDDDDDDD 010001000     SETPTRA D         'set PTRA to D                      1
1111111 00 1 CCCC DDDDDDDDD 010001000     SETPTRA #D        'set PTRA to #D                     1
1111111 00 0 CCCC DDDDDDDDD 010001001     SETPTRB D         'set PTRB to D                      1
1111111 00 1 CCCC DDDDDDDDD 010001001     SETPTRB #D        'set PTRB to #D                     1

1111111 00 0 CCCC DDDDDDDDD 010001010     ADDPTRA D         'add D into PTRA                    1
1111111 00 1 CCCC DDDDDDDDD 010001010     ADDPTRA #D        'add #D into PTRA                   1
1111111 00 0 CCCC DDDDDDDDD 010001011     ADDPTRB D         'add D into PTRB                    1
1111111 00 1 CCCC DDDDDDDDD 010001011     ADDPTRB #D        'add #D into PTRB                   1

1111111 00 0 CCCC DDDDDDDDD 010001100     SUBPTRA D         'subtract D from PTRA               1
1111111 00 1 CCCC DDDDDDDDD 010001100     SUBPTRA #D        'subtract #D from PTRA              1
1111111 00 0 CCCC DDDDDDDDD 010001101     SUBPTRB D         'subtract D from PTRB               1
1111111 00 1 CCCC DDDDDDDDD 010001101     SUBPTRB #D        'subtract #D from PTRB              1
-------------------------------------------------------------------------------------------------



WIDE-RELATED INSTRUCTIONS
-------------------------

Each cog has eight WIDE registers which form a 256-bit conduit between the hub memory and the cog.
This conduit can transfer eight longs every 8 clocks via the RDWIDE/WRWIDE instructions. It can
also be used as an 8-long/16-word/32-byte read cache, by using RDBYTEC/RDWORDC/RDLONGC/RDWIDEC.

Initially hidden and cleared to zero, the WIDE registers are mappable into cog register space by
using the SETWIDE/SETWIDZ instructions to set an 8-even address range (%xxxxxx000) where the WIDE
registers are to appear. If the three LSBs are not %000, the WIDEs will be hidden. SETWIDZ works
just like SETWIDE, but also clears the eight WIDE registers.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
1111111 00 0 CCCC DDDDDDDDD 010011110     SETWIDE D         'set WIDE base to D                 1
1111111 00 1 CCCC DDDDDDDDD 010011110     SETWIDE #D        'set WIDE base to #D                1
1111111 00 0 CCCC DDDDDDDDD 010011111     SETWIDZ D         'set WIDE base to D, WIDEs=0        1
1111111 00 1 CCCC DDDDDDDDD 010011111     SETWIDZ #D        'set WIDE base to #D, WIDEs=0       1
1111111 00 0 CCCC 000000000 100011000     DCACHEX           'invalidate cache                   1
-------------------------------------------------------------------------------------------------



HUB CONTROL INSTRUCTIONS
------------------------

These instructions are used to control hub circuits and cogs.

Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where
there is no result to wait for (no Z, C, or D), these instructions complete on the hub cycle, making
them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases
where a result is anticipated (Z, C, or D), these instructions complete on the 1st clock after the
hub cycle, making them take 2..9 clocks.


COGNEW  D, S/#
--------------

COGNEW starts the lowest-numbered idle cog.

For COGNEW, D specifies a long address in hub memory that is the start of the program that is to be
loaded into the idle cog, while S is a 18-bit parameter (usually an address) that will be conveyed to
PTRA of that cog. PTRB of that cog will be set to the start address of its new program in hub memory,
which is the same as the D value used in the COGNEW instruction, AND'd with $3FFFC to form a hub long
address.

COGNEW will return the number of the started cog (0..7) into D, with C=0 indicating success or C=1
indicating failure, in which case no cog was idle and so D is invalid.


COGINIT D, S/#, #0..7
---------------------

COGINIT is used to start a cog by its number (0..7). Any cog can be (re)started, whether it is idle
or running. A cog can even execute a COGINIT to restart itself with a new program.

COGINIT uses D and S identically to COGNEW, but doesn't return anything in D or C, as its behavior
is determinant.

COGINIT uses a third operand to convey the number of the cog to be started (0..7). Those three bits,
with a leading 0 bit, are located in nibble 6 of the COGINIT instruction. The SETNIB instruction can
be used to make the cog number variable:


        COGID   x                'get my cog number into x
        SETNIB  :inst,x,#6       'install x into COGINIT
        NOP                      'must execute two instruction before modified code can execute
        NOP                      '(NOPs are not required in 4-way round-robin multitasking)
:inst   COGINIT pgm,ptr,#0       'restart me


When a cog is started, $1F4 contiguous longs are read from hub memory and written to cog registers
$000..$1F3. The cog will then begin execution at $000. This process takes 1,017 clocks.


CLKSET  D
---------

CLKSET writes the lower 9 bits of D to the hub clock register:

%R_MMMM_XX_SS

R = 1 for hardware reset, 0 for continued operation

MMMM = PLL mode:

        %1111 for multiply XI by 16
        %1110 for multiply XI by 15
        %1101 for multiply XI by 14
        %1100 for multiply XI by 13
        %1011 for multiply XI by 12
        %1010 for multiply XI by 11
        %1001 for multiply XI by 10
        %1000 for multiply XI by 9
        %0111 for multiply XI by 8
        %0110 for multiply XI by 7
        %0101 for multiply XI by 6
        %0100 for multiply XI by 5
        %0011 for multiply XI by 4
        %0010 for multiply XI by 3
        %0001 for multiply XI by 2
        %0000 for disabled, else XX must be set for XI input or XI/XO crystal oscillator

XX = XI/XO pin mode:

        %11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback
        %10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback
        %01 for XI input, XO floats
        %00 for XI reads low, XO floats

SS = Clock selector:

        %11 for PLL
        %10 for XTAL (10MHz-20MHz)
        %01 for RCSLOW (~20KHz)
        %00 for RCFAST (~20MHz)


Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode
with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST
or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within
10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you
can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you
can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being
in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its
new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot
might exceed the speed limit of the chip.


COGID   D
---------

If WC is not specified, COGID returns the number of the cog (0..7) into D.

If WC is specified, COGID returns the state of cog D into C, where 0=idle / 1=running, without writing D.


COGSTOP D/#
-----------

COGSTOP stops the cog specified in D/# (0..7). The stopped cog will return to a reset state in which all
of its output signals will be held low, cancelling any effects it was having on I/O pins.


LOCKNEW D
LOCKRET D/#
LOCKSET D/#
LOCKCLR D/#
-----------

There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with
LOCKRET, set with LOCKSET, and cleared with LOCKCLR.

While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided
so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time.

When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works
the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/
clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission
to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a
'LOCKSET locknum  wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared
resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to
clear the lock and give another cog the opportunity to get C=0 back.

LOCKNEW returns the next available lock into D, with C=1 if no lock was free.

LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW.

LOCKSET sets the lock in D and returns its prior state in C.

LOCKCLR clears the lock in D and returns its prior state in C.


instructions                                                                                 clocks
---------------------------------------------------------------------------------------------------
1001111 0C 0 CCCC DDDDDDDDD SSSSSSSSS     COGNEW  D,S     'launch new cog at D, cog PTRA = S   1..9
1001111 0C 1 CCCC DDDDDDDDD SSSSSSSSS     COGNEW  D,#S    'launch new cog at D, cog PTRA = #S  1..9

11000nn n0 0 CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,S,#n  'launch cog n at D, cog PTRA = S     1..9
11000nn n0 1 CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,#S,#n 'launch cog n at D, cog PTRA = #S    1..9

1111111 Z0 0 CCCC DDDDDDDDD 000000000     COGID   D       'get cog number into D               2..9
1111111 Z1 0 CCCC DDDDDDDDD 000000000     COGID   D   WC  'get cog D state, C = running        2..9

1111111 ZC 0 CCCC DDDDDDDDD 000000010     LOCKNEW D       'get new lock into D, C = busy       2..9

1111111 00 0 CCCC DDDDDDDDD 010000000     CLKSET  D       'set clock to D                      1..8
1111111 00 1 CCCC DDDDDDDDD 010000000     CLKSET  #D      'set clock to #D                     1..8

1111111 00 0 CCCC DDDDDDDDD 010000001     COGSTOP D       'stop cog D                          1..8
1111111 00 1 CCCC DDDDDDDDD 010000001     COGSTOP #D      'stop cog #D                         1..8

1111111 0C 0 CCCC DDDDDDDDD 010000010     LOCKSET D       'set lock D, C = prev state          1..9
1111111 0C 1 CCCC DDDDDDDDD 010000010     LOCKSET #D      'set lock #D, C = prev state         1..9

1111111 0C 0 CCCC DDDDDDDDD 010000011     LOCKCLR D       'clear lock D, C = prev state        1..9
1111111 0C 1 CCCC DDDDDDDDD 010000011     LOCKCLR #D      'clear lock #D, C = prev state       1..9

1111111 00 0 CCCC DDDDDDDDD 010000100     LOCKRET D       'return lock D                       1..8
1111111 00 1 CCCC DDDDDDDDD 010000100     LOCKRET #D      'return lock #D                      1..8
---------------------------------------------------------------------------------------------------



INDIRECT REGISTERS
------------------

Each cog has two indirect registers: INDA and INDB. They are located at $1F2 and $1F3.

By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed.

INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and
the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the
pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB
references.

FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the
lower of the initial and terminal values and the top limit(s) to the higher.

SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and
top limit(s) to $000 and $1FF, respectively.

Because indirect addressing must occur in the 2nd stage of the pipeline, long before C and Z are valid for
conditional execution in the 4th stage, all instructions which use indirect addressing are forced to always
execute. This frees the conditional bit field (CCCC) for specifying indirect operations. The top two bits of
CCCC are used for indirect D and the bottom two bits are used for indirect S. If only D or S is indirect, the
other two bits in CCCC are ignored.

Here is the INDA/INDB usage scheme which repurposes the CCCC field:

TTTTTTT ZC I CCCC DDDDDDDDD SSSSSSSSS
-------------------------------------
xxxxxxx xx x 00xx 111110010 xxxxxxxxx        D = INDA        'use INDA
xxxxxxx xx x 00xx 111110011 xxxxxxxxx        D = INDB        'use INDB
xxxxxxx xx x 01xx 111110010 xxxxxxxxx        D = INDA++      'use INDA,      INDA += 1
xxxxxxx xx x 01xx 111110011 xxxxxxxxx        D = INDB++      'use INDB,      INDB += 1
xxxxxxx xx x 10xx 111110010 xxxxxxxxx        D = INDA--      'use INDA,      INDA -= 1
xxxxxxx xx x 10xx 111110011 xxxxxxxxx        D = INDB--      'use INDB       INDB -= 1
xxxxxxx xx x 11xx 111110010 xxxxxxxxx        D = ++INDA      'use INDA+1,    INDA += 1
xxxxxxx xx x 11xx 111110011 xxxxxxxxx        D = ++INDB      'use INDB+1,    INDB += 1

xxxxxxx xx 0 xx00 xxxxxxxxx 111110010        S = INDA        'use INDA
xxxxxxx xx 0 xx00 xxxxxxxxx 111110011        S = INDB        'use INDB
xxxxxxx xx 0 xx01 xxxxxxxxx 111110010        S = INDA++      'use INDA,      INDA += 1
xxxxxxx xx 0 xx01 xxxxxxxxx 111110011        S = INDB++      'use INDB,      INDB += 1
xxxxxxx xx 0 xx10 xxxxxxxxx 111110010        S = INDA--      'use INDA,      INDA -= 1
xxxxxxx xx 0 xx10 xxxxxxxxx 111110011        S = INDB--      'use INDB       INDB -= 1
xxxxxxx xx 0 xx11 xxxxxxxxx 111110010        S = ++INDA      'use INDA+1,    INDA += 1
xxxxxxx xx 0 xx11 xxxxxxxxx 111110011        S = ++INDB      'use INDB+1,    INDB += 1


If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the
post-modifier effect:

0100000 00 0 0011 111110010 111110010        MOV INDA,++INDA    'Move @INDA+1 into @INDA,   INDA += 1
0101000 00 0 1100 111110011 111110011        ADD ++INDB,INDB    'Add @INDB into @INDB+1,    INDB += 1

Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx.


Here are the instructions which are used to set the pointer and limit values for INDA and INDB:

instructions *                                                                             clocks
-------------------------------------------------------------------------------------------------
1111110 00 1 0001 TTTTTTTTT IIIIIIIII        FIXINDA #terminal,#initial                         1
1111110 00 1 0100 TTTTTTTTT IIIIIIIII        FIXINDB #terminal,#initial                         1
1111110 00 1 0101 TTTTTTTTT IIIIIIIII        FIXINDS #terminal,#initial                         1

1111110 00 1 0010 000000000 AAAAAAAAA        SETINDA #addrA                                     1
1111110 00 1 0011 000000000 AAAAAAAAA        SETINDA ++/--deltA                                 1

1111110 00 1 1000 BBBBBBBBB 000000000        SETINDB #addrB                                     1
1111110 00 1 1100 BBBBBBBBB 000000000        SETINDB ++/--deltB                                 1 

1111110 00 1 1010 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,#addrA                              1
1111110 00 1 1011 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,++/--deltA                          1
1111110 00 1 1110 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,#addrA                          1
1111110 00 1 1111 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,++/--deltA                      1
-------------------------------------------------------------------------------------------------
* addrA/addrB/terminal/initial = register address (0..511),
  deltA/deltB = 9-bit signed delta --256..++255

Examples:

1111110 00 1 0010 000000000 000000101        SETINDA #5        'INDA = 5, bottom = 0, top = 511
1111110 00 1 0011 000000000 000000011        SETINDA ++3       'INDA += 3, bottom = 0, top = 511
1111110 00 1 1100 111111100 000000000        SETINDB --4       'INDB -= 4, bottom = 0, top = 511
1111110 00 1 1011 000000111 000001000        SETINDS #7,++8    'INDB = 7, INDA += 8, bottoms = 0, tops = 511

1111110 00 1 0001 000001111 000001000        FIXINDA #15,#8    'INDA = 8, bottom = 8, top = 15
1111110 00 1 0100 000010000 000011111        FIXINDB #16,#31   'INDB = 31, bottom = 16, top = 31
1111110 00 1 0101 001100011 000110010        FIXINDS #99,#50   'INDA/INDB = 50, bottoms = 50, tops = 99



AUXILIARY RAM
--------------

Each cog has a 256-long auxiliary RAM called AUX that can be used for data, call/return (Z,C,PC) stacks,
and streaming buffers for video pixels and pin-transfers. AUX's contents are not initialized at either
reset or cog start. So, at cog (re)start, it will contain whatever it happened to power up with, or
whatever was last written to it.

There are two complementary sets of AUX read/write instructions. One set addresses AUX from 0..255,
while the other addresses AUX in reverse order from 255..0. This scheme allows for simple operation of
separate data/program stacks (LIFO's) which can grow towards each other. There are also two 8-bit AUX
pointer registers, PTRX and PTRY, which can be used in AUX addressing expressions.

Here are the forward-addressing (0..255) read/write instructions for AUX:

RDAUX   D,S           read AUX[S] into D
RDAUX   D,#0..255     read AUX[0..255] into D
RDAUX   D,PTRX        read AUX[PTRX] into D, can update PTRX
RDAUX   D,PTRY        read AUX[PTRY] into D, can update PTRY

WRAUX   D/#,S         write D/# to AUX[S]
WRAUX   D/#,#0..255   write D/# to AUX[0..255]
WRAUX   D/#,PTRX      write D/# to AUX[PTRX], can update PTRX
WRAUX   D/#,PTRY      write D/# to AUX[PTRX], can update PTRY


The reverse-addressing (255..0) read/write instructions for AUX are just like those above, except
that they have an "R" in their mnemonics and apply a 1's-complement (!) to the apparent address:

RDAUXR  D,S           read AUX[!S] into D
RDAUXR  D,#0..255     read AUX[!0..255] into D
RDAUXR  D,PTRX        read AUX[!PTRX] into D, can update PTRX
RDAUXR  D,PTRY        read AUX[!PTRY] into D, can update PTRY

WRAUXR  D/#,S         write D/# to AUX[!S]
WRAUXR  D/#,#0..255   write D/# to AUX[!0..255]
WRAUXR  D/#,PTRX      write D/# to AUX[!PTRX], can update PTRX
WRAUXR  D/#,PTRY      write D/# to AUX[!PTRX], can update PTRY


There are also push/pop/call/ret instructions which use AUX. Those using PTRX are forward-addressing
and those using PTRY are reverse-addressing:

PUSHX   D/#           alias for 'WRAUX D/#,PTRX++'
PUSHY   D/#           alias for 'WRAUXR D/#,PTRY++'

POPX    D             alias for 'RDAUX D,--PTRX'
POPY    D             alias for 'RDAUXR D,--PTRY'

CALLX   D/#/@         write {Z,C,PC} to AUX[PTRX++], jump to D/#/@, cancel same-task pipelined instructions
CALLY   D/#/@         write {Z,C,PC} to AUX[!PTRY++], jump to D/#/@, cancel same-task pipelined instructions

CALLXD  D/#/@         write {Z,C,PC} to AUX[PTRX++], jump to D/#/@, don't cancel same-task pipelined instructions
CALLYD  D/#/@         write {Z,C,PC} to AUX[!PTRY++], jump to D/#/@, don't cancel same-task pipelined instructions

RETX                  read {Z,C,PC} from AUX[--PTRX], cancel same-task pipelined instructions
RETY                  read {Z,C,PC} from AUX[!--PTRY], cancel same-task pipelined instructions

RETXD                 read {Z,C,PC} from AUX[--PTRX], don't cancel same-task pipelined instructions
RETYD                 read {Z,C,PC} from AUX[!--PTRY], don't cancel same-task pipelined instructions


PTRX and PTRY can be set, added to, subtracted from, read, or checked using the following instructions:

SETPTRX D/#           set PTRX to D/#
SETPTRY D/#           set PTRY to D/#
ADDPTRX D/#           add D/# to PTRX
ADDPTRY D/#           add D/# to PTRY
SUBPTRX D/#           subtract D/# from PTRX
SUBPTRY D/#           subtract D/# from PTRY
GETPTRX D             get PTRX into D, PTRX==0 into Z, PTRX.7 into C
GETPTRY D             get PTRY into D, PTRY==0 into Z, PTRY.7 into C
CHKPTRX               get PTRX==0 into Z, PTRX.7 into C
CHKPTRY               get PTRY==0 into Z, PTRY.7 into C


PTRX/PTRY expressions for RDAUX/RDAUXR/WRAUX/WRAUXR:

    INDEX = -16..+15 for simple offsets, 0..15 for ++'s, or 0..16 for --'s

    X = 1 for PTRX/PTRY expression, 0 for constant in 8 LSBs
    S = 0 for PTRX, 1 for PTRY
    U = 0 to keep PTRX/PTRY same, 1 to update PTRX/PTRY
    P = 0 to use PTRX/PTRY + INDEX, 1 to use PTRX/PTRY (post-modify)
    NNNNN = INDEX
    nnnnn = -INDEX


    XSUPNNNNN     SPx expression
    ----------------------------------------------------------------------
    100000000     PTRX             'use PTRX
    110000000     PTRY             'use PTRY
    101100001     PTRX++           'use PTRX,                PTRX += 1
    111100001     PTRY++           'use PTRY,                PTRY += 1
    101111111     PTRX--           'use PTRX,                PTRX -= 1
    111111111     PTRY--           'use PTRY,                PTRY -= 1
    101000001     ++PTRX           'use PTRX + 1,            PTRX += 1
    111000001     ++PTRY           'use PTRY + 1,            PTRY += 1
    101011111     --PTRX           'use PTRX - 1,            PTRX -= 1
    111011111     --PTRY           'use PTRY - 1,            PTRY -= 1

    1000NNNNN     PTRX[INDEX]      'use PTRX + INDEX
    1100NNNNN     PTRY[INDEX]      'use PTRY + INDEX
    1011NNNNN     PTRX++[INDEX]    'use PTRX,                PTRX += INDEX
    1111NNNNN     PTRY++[INDEX]    'use PTRY,                PTRY += INDEX
    1011nnnnn     PTRX--[INDEX]    'use PTRX,                PTRX -= INDEX
    1111nnnnn     PTRY--[INDEX]    'use PTRY,                PTRY -= INDEX
    1010NNNNN     ++PTRX[INDEX]    'use PTRX + INDEX,        PTRX += INDEX
    1110NNNNN     ++PTRY[INDEX]    'use PTRY + INDEX,        PTRY += INDEX
    1010nnnnn     --PTRX[INDEX]    'use PTRX - INDEX,        PTRX -= INDEX
    1110nnnnn     --PTRY[INDEX]    'use PTRY - INDEX,        PTRY -= INDEX


Examples:

0000110 00 1 1111 DDDDDDDDD 100000000     RDAUX   D,PTRX         'read AUX[PTRX] into D
0000111 00 1 1111 DDDDDDDDD 101111111     RDAUXR  D,PTRX--       'read AUX[!PTRX] into D,          PTRX -= 1
1101010 00 1 1111 DDDDDDDDD 111000001     WRAUX   D,++PTRY       'write D to AUX[PTRY+1],          PTRY += 1
1101010 10 1 1111 DDDDDDDDD 110000111     WRAUXR  D,PTRY[7]      'write D to AUX[!PTRY+7]
0000110 00 1 1111 DDDDDDDDD 101101111     RDAUX   D,PTRX++[15]   'read AUX[PTRX] into D,           PTRX += 15
1101010 00 1 1111 DDDDDDDDD 111010110     WRAUX   D,--PTRY[10]   'write D to AUX[PTRY-10],         PTRY -= 10


instructions                                                                                clocks
------------------------------------------------------------------------------------------------------
0000110 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDAUX   D,S        'read AUX[S] into D                  2
0000110 ZC 1 CCCC DDDDDDDDD 0SSSSSSSS     RDAUX   D,#S       'read AUX[#S] into D                 1
0000110 ZC 1 CCCC DDDDDDDDD 1SUPNNNNN     RDAUX   D,PTRX/Y   'read AUX[PTRX/Y exp] into D         1

0000111 ZC 0 CCCC DDDDDDDDD SSSSSSSSS     RDAUXR  D,S        'read AUX[!S] into D                 2
0000111 ZC 1 CCCC DDDDDDDDD 0SSSSSSSS     RDAUXR  D,#S       'read AUX[#!S] into D                1
0000111 ZC 1 CCCC DDDDDDDDD 1SUPNNNNN     RDAUXR  D,PTRX/Y   'read AUX[!PTRX/Y exp] into D        1

1101010 00 0 CCCC DDDDDDDDD SSSSSSSSS     WRAUX   D,S        'write D into AUX[S]                 1 **
1101010 00 1 CCCC DDDDDDDDD 0SSSSSSSS     WRAUX   D,#S       'write D into AUX[#S]                1 **
1101010 00 1 CCCC DDDDDDDDD 1SUPNNNNN     WRAUX   D,PTRX/Y   'write D into AUX[PTRX/Y exp]        1 **
1101010 01 0 CCCC DDDDDDDDD SSSSSSSSS     WRAUX   #D,S       'write #D into AUX[S]                1 **
1101010 01 1 CCCC DDDDDDDDD 0SSSSSSSS     WRAUX   #D,#S      'write #D into AUX[#S]               1 **
1101010 01 1 CCCC DDDDDDDDD 1SUPNNNNN     WRAUX   #D,PTRX/Y  'write #D into AUX[PTRX/Y exp]       1 **

1101010 10 0 CCCC DDDDDDDDD SSSSSSSSS     WRAUXR  D,S        'write D into AUX[!S]                1 **
1101010 10 1 CCCC DDDDDDDDD 0SSSSSSSS     WRAUXR  D,#S       'write D into AUX[#!S]               1 **
1101010 10 1 CCCC DDDDDDDDD 1SUPNNNNN     WRAUXR  D,PTRX/Y   'write D into AUX[!PTRX/Y]           1 **
1101010 11 0 CCCC DDDDDDDDD SSSSSSSSS     WRAUXR  #D,S       'write #D into AUX[!S]               1 **
1101010 11 1 CCCC DDDDDDDDD 0SSSSSSSS     WRAUXR  #D,#S      'write #D into AUX[#!S]              1 **
1101010 11 1 CCCC DDDDDDDDD 1SUPNNNNN     WRAUXR  #D,PTRX/Y  'write #D into AUX[!PTRX/Y]          1 **

1111111 ZC 0 CCCC DDDDDDDDD 000001100     GETPTRX D          'get PTRX into D                     1
1111111 ZC 0 CCCC DDDDDDDDD 000001101     GETPTRY D          'get PTRY into D                     1

1111111 00 0 CCCC DDDDDDDDD 010000000     SETPTRX D          'set PTRX to D                       1
1111111 00 1 CCCC DDDDDDDDD 010000000     SETPTRX #D         'set PTRX to #D                      1
1111111 00 0 CCCC DDDDDDDDD 010000001     SETPTRY D          'set PTRY to D                       1
1111111 00 1 CCCC DDDDDDDDD 010000001     SETPTRY #D         'set PTRY to #D                      1

1111111 00 0 CCCC DDDDDDDDD 010000010     ADDPTRX D          'add D into PTRX                     1
1111111 00 1 CCCC DDDDDDDDD 010000010     ADDPTRX #D         'add #D into PTRX                    1
1111111 00 0 CCCC DDDDDDDDD 010000011     ADDPTRY D          'add D into PTRY                     1
1111111 00 1 CCCC DDDDDDDDD 010000011     ADDPTRY #D         'add #D into PTRY                    1

1111111 00 0 CCCC DDDDDDDDD 010000100     SUBPTRX D          'subtract D from PTRX                1
1111111 00 1 CCCC DDDDDDDDD 010000100     SUBPTRX #D         'subtract #D from PTRX               1
1111111 00 0 CCCC DDDDDDDDD 010000101     SUBPTRY D          'subtract D from PTRY                1
1111111 00 1 CCCC DDDDDDDDD 010000101     SUBPTRY #D         'subtract #D from PTRY               1

1111110 11 0 CCCC 00 nnnnnnnnnnnnnnnn     CALLX   #n         'write Z,C,PC* into [PTRX++], PC=n   4 **
1111110 11 0 CCCC 01 nnnnnnnnnnnnnnnn     CALLX   @n         'write Z,C,PC* into [PTRX++], PC+=n  4 **
1111110 11 0 CCCC 10 nnnnnnnnnnnnnnnn     CALLXD  #n         'write Z,C,PC* into [PTRX++], PC=n   4 **
1111110 11 0 CCCC 11 nnnnnnnnnnnnnnnn     CALLXD  @n         'write Z,C,PC* into [PTRX++], PC+=n  4 **

1111110 11 1 CCCC 00 nnnnnnnnnnnnnnnn     CALLY   #n         'write Z,C,PC* into [PTRY++], PC=n   4 **
1111110 11 1 CCCC 01 nnnnnnnnnnnnnnnn     CALLY   @n         'write Z,C,PC* into [PTRY++], PC+=n  4 **
1111110 11 1 CCCC 10 nnnnnnnnnnnnnnnn     CALLYD  #n         'write Z,C,PC* into [PTRY++], PC=n   4 **
1111110 11 1 CCCC 11 nnnnnnnnnnnnnnnn     CALLYD  @n         'write Z,C,PC* into [PTRY++], PC+=n  4 **

1111111 ZC 0 CCCC xxxxxxxxx 100000100     RETX               'read Z,C,PC* from [--PTRX]          4
1111111 ZC 0 CCCC xxxxxxxxx 100000101     RETXD              'read Z,C,PC* from [--PTRX]          4

1111111 ZC 0 CCCC xxxxxxxxx 100000110     RETY               'read Z,C,PC* from [!--PTRY]         4
1111111 ZC 0 CCCC xxxxxxxxx 100000111     RETYD              'read Z,C,PC* from [!--PTRY]         4
------------------------------------------------------------------------------------------------------
* bit 17 is Z, bit 16 is C, bits 15..0 are PC, upper bits are ignored (RETx) or cleared (CALLx)
** if followed by 'RDAUX/RDAUXR D,#S/PTRX/PTRY' or RETAx/RETBx, add one clock



MULTI-TASKING
-------------

Each cog has four sets of flags and program counters (Z/C/PC), constituting four unique tasks that
can execute and switch on each instruction cycle.

At cog startup, the tasks are initialized as follows, with only task 0 enabled:


task Z  C  PC
----------------
0    0  0  $0000
1    0  0  $0001
2    0  0  $0002
3    0  0  $0003


The SETTASK instruction is used to set the number of time slots and the sequence of tasks within
those time slots. SETTASK's 32-bit operand consists of 16 bit pairs which declare tasks (0..3)
from the bottom bit pair, upwards, with any leading %00 bit pairs declaring unused time slots. This
way, simple task sequences can be established with immediate values:


    SETTASK #%%3210     'set repeating task sequence of 0-1-2-3

        4 time slots:     -  -  -  -  -  -  -  -  -  -  -  -  3  2  1  0
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
        TASK register:  %00_00_00_00_00_00_00_00_00_00_00_00_11_10_01_00


    SETTASK #%%210      'set repeating task sequence of 0-1-2

        3 time slots:     -  -  -  -  -  -  -  -  -  -  -  -  -  2  1  0
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
        TASK register:  %00_00_00_00_00_00_00_00_00_00_00_00_00_10_01_00


By providing a 32-bit value via D, up to 16 time slots can be defined:


        16 time slots:    3  0  1  0  2  0  1  0  2  0  1  0  2  0  1  0
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
        TASK register:  %11_00_01_00_10_00_01_00_10_00_01_00_10_00_01_00


In the case above, task 0 gets 1/2 of the time slots, task 1 gets 1/4, task 2 gets 3/16 and
task 3 gets 1/16. It is generally a good idea to intermingle the tasks evenly so that I/O
behavior is not lumpy in time.

Below, tasks 0..3 get 1/3, 1/3, 1/6, and 1/6 of the time slots, all perfectly spaced:


        6 time slots:     -  -  -  -  -  -  -  -  -  -  3  1  0  2  1  0
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
        TASK register:  %00_00_00_00_00_00_00_00_00_00_11_01_00_10_01_00


If you want task 0 to run most of the time, with task 1 running as seldom as possible:


        16 time slots:    1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
        TASK register:  %01_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00


The task identified in the bottom two bits of the SETTASK operand will be at the execution stage on
the 5th instruction cycle after SETTASK.

If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
task is given a time slot, it will execute and its Z/C/PC will be updated at every instruction cycle,
or time slot, alloted to it. If an active task's time slots are all taken away, that task's Z/C/PC
remain in the state where they left off, until it is given another time slot.

To immediately force any of the four tasks' PC's to a new address, JMPTASK can be used. JMPTASK uses a
4-bit mask to select which PC's are going to be written. Mask bits 3..0 represent PC's 3..0. The mask
value %1010 would write PC 3 and PC 1, while %0100 would write PC 2, only:


JMPTASK D/#,S/#         force PC's in mask D/# to address S/#


For every task affected by a JMPTASK instruction, all affected-task instructions currently in the
pipeline are cancelled. This insures that after JMPTASK executes, the next instruction from each
affected task will be from the new address. Also, instruction block repeating will be cancelled for
any affected task that was using REPS/REPD.


Here is an example in which all four tasks are started and each task toggles an I/O pin at a different
rate:


        ORG

        JMP     #task0          'task 0 begins here when the cog starts (this JMP takes 4 clocks)
        JMP     #task1          'task 1 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task2          'task 2 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task3          'task 3 begins here after task 0 executes SETTASK (this JMP takes 1 clock)

task0   SETTASK #%%3210         'enable all tasks in 0-1-2-3 round-robin sequence

:loop   NOTP    #0              'task 0, toggle pin 0               - loops every 8 clocks
        JMP     #:loop          '(this JMP takes 1 clock)

task1   NOTP    #1              'task 1, toggle pin 1               - loops every 12 clocks
        NOP
        JMP     #task1          '(this JMP takes 1 clock)

task2   NOTP    #2              'task 2, toggle pin 2               - loops every 16 clocks
        NOP                     
        NOP
        JMP     #task2          '(this JMP takes 1 clock)

task3   NOTP    #3              'task 3, toggle pin 3               - loops every 20 clocks
        NOP
        NOP
        NOP
        JMP     #task3          '(this JMP takes 1 clock)


------------------------------------------------------------------------------------------------------------
NOTE: When a normal branch instruction (JMP, CALL, RET, etc.) executes in the 4th and final stage of the
pipeline, all instructions progressing through the lower three stages which belong to the same task as the
branch instruction are cancelled. This inhibits execution of incidental data that was trailing the branch
instruction.

The delayed branch instructions (JMPD, CALLD, RETD, etc.) don't do any pipeline instruction cancellation and
exist to provide 1-clock branches, where three instructions belonging to the same task as the branch will
execute before instructions begin executing from the location branched to. For single-task programs this
is the natural consequence of allowing the three lower pipeline stages to advance to execution before the
instructions from the new address start executing. For multi-task programs that may not have had three
instructions in the pipeline from the branching task, the deficit of three instructions will be waited for
before the new address takes effect. This way, all code may be written for optimal single-task execution,
but it still works in all task modes.

For normal (non-delayed) CALLs, PC+1 is stored as the return address. For delayed CALLs, PC+4 is stored, to
accommodate three trailing instructions.

For single-task programs, normal branches take 4 clocks: 1 clock for the branch and 3 clocks for the
cancelled instructions to come through the pipeline before the new instruction stream begins to execute.
------------------------------------------------------------------------------------------------------------


Tips for coding multi-tasking programs
--------------------------------------

While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
remember that there's only one of each of the following cog resources and only one task can use it at a time:

  Singular resource      Some related instructions that could cause conflicts
  ----------------------------------------------------------------------------------------------------------
  WIDE registers         RDBYTEC/RDWORDC/RDLONGC/RDWIDEC/RDWIDE/WRWIDE/SETWIDE/SETWIDZ
  INDA                   FIXINDA/FIXINDS/SETINDA/SETINDS / INDA modification via INDA usage
  INDB                   FIXINDB/FIXINDS/SETINDB/SETINDS / INDB modification via INDB usage
  PTRA                   SETPTRA/ADDPTRA/SUBPTRA / PTRA modification via RDxxxx/WRxxxx
  PTRB                   SETPTRB/ADDPTRB/SUBPTRB / PTRB modification via RDxxxx/WRxxxx
  PTRX                   SETPTRX/ADDPTRX/SUBPTRX/CALLX/RETX/PUSHX/POPX / PTRX modification via RDAUXx/WRAUXx
  PTRY                   SETPTRY/ADDPTRY/SUBPTRY/CALLY/RETY/PUSHY/POPY / PTRY modification via RDAUXx/WRAUXx
  ACCA                   SETACCA/SETACCS/MACA/SARACCA/SARACCS/CLRACCA/CLRACCS
  ACCB                   SETACCB/SETACCS/MACB/SARACCB/SARACCS/CLRACCB/CLRACCS
  32x32 multiplier       MUL32/MUL32U
  64/32 divider          FRAC/DIV32/DIV32U/DIV64/DIV64U/DIV64D
  64-bit square rooter   SQRT64/SQRT32
  CORDIC computer        QSINCOS/QARCTAN/QROTATE/QLOG/QEXP/SETQI/SETQZ
  SERA                   SETSERA/SERINA/SEROUTA
  SERB                   SETSERB/SERINB/SEROUTB
  XFR                    SETXFR
  VID                    WAITVID/SETVID/SETVIDY/SETVIDI/SETVIDQ/POLVID
  CTRA                   SETCTRA/SETWAVA/SETPHSA/ADDPHSA/SUBPHSA/GETPHZA/POLCTRA/CAPCTRA/SYNCTRA
  CTRB                   SETCTRB/SETWAVB/SETPHSB/ADDPHSB/SUBPHSB/GETPHZB/POLCTRB/CAPCTRB/SYNCTRB
  PIX                    (not usable in multi-tasking, requires single-task timing)


When writing multi-task programs, be aware that any multi-clock instructions will stall the pipeline,
creating ripple effects in other tasks' timing. This may be impossible to avoid, as some task will
likely need to access hub memory, and hub instructions are mostly multi-clock. For absolutely deterministic
timing, it may be necessary to write a single-task program.

Some instructions which stall the pipeline during single-task execution will, instead, jump back to
themselves during multi-task execution (JMP #$), until their release condition is met. This way they
avoid stalling the pipeline, allowing other tasks to execute in the interstitial time slots:

  WAITVID D/#,S/#    wait for VID to grab new data

  SERINA  D          wait for serial input on SERA
  SERINB  D          wait for serial input on SERB
  SEROUTA D/#        wait to send serial output on SERA
  SEROUTB D/#        wait to send serial output on SERB

  GETMULL D          wait for lower multiplier result
  GETMULH D          wait for upper multiplier result
  GETDIVQ D          wait for divider quotient result
  GETDIVR D          wait for divider remainder result
  GETSQRT D          wait for square root result
  GETQX   D          wait for CORDIC X result
  GETQY   D          wait for CORDIC Y result
  GETQZ   D          wait for CORDIC Z result

  SYNCTRA            wait for PHSA to roll over
  SYNCTRB            wait for PHSB to roll over


For the above instructions, multi-tasking is considered to be active when SETTASK D/# has written
a mixture of tasks to the time slots. Remember that in multi-tasking, the above instructions behave
as branches, and therefore cannot be used in REPD/REPS instruction-repeat blocks. Also, you should
not use INDx++/INDx--/++INDx with these instructions during multi-tasking, as they will cause
INDA/INDB to increment or decrement each time they loop back to themselves, before the release
condition is met.

To avoid excessively stalling the pipeline during multi-tasking, the WAITCNT/WAITPEQ/WAITPNE
instructions can be substituted with non-stalling alternatives:

  PASSCNT D/#        jumps to itself if some amount of time has not passed, use instead of WAITCNT
  JP/JNP  D/#,S/#    jumps based on pin states, use instead of WAITPEQ/WAITPNE


The following instruction will not work in a multi-tasking program:

  GETPIX             needs 3 clocks in stages 2 and 3, takes 3 clocks in stage 4 - single-task only


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
1111001 10 0 CCCC DDDDDDDDD SSSSSSSSS        JMPTASK D,S      'Set PC's in mask D to S          1
1111001 10 1 CCCC DDDDDDDDD SSSSSSSSS        JMPTASK D,#S     'Set PC's in mask D to #S         1
1111001 11 0 CCCC DDDDDDDDD SSSSSSSSS        JMPTASK #D,S     'Set PC's in mask #D to S         1
1111001 11 1 CCCC DDDDDDDDD SSSSSSSSS        JMPTASK #D,#S    'Set PC's in mask #D to #S        1

1111111 00 0 CCCC DDDDDDDDD 010010011        SETTASK D        'Set TASK to D                    1
1111111 00 1 CCCC DDDDDDDDD 010010011        SETTASK #D       'Set TASK to #D                   1
-------------------------------------------------------------------------------------------------



PIPELINE
--------

Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:


  1st stage    - Read instruction from cog register RAM

  2nd stage    - Determine any indirect or remapped D and S addresses within instruction
               - Update INDA and INDB

  3rd stage    - Read D and S from cog register RAM

  4th stage    - Execute instruction using D and S
               - Write any D result to cog register RAM
               - Update Z/C/PC and any other results


On every clock cycle, the instruction data in each stage advances to the next stage, unless the instruction
in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for the hub).

To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
the 3rd stage to substitute for any obsoleted D or S data currently being read from the cog register RAM.
The same is done for instruction data currently being read in the 1st stage, but this still leaves a two-
stage gap between when a register is modified and when it can be executed:


        'single-task self-modifying code

        SETI    :inst,top9         '(initially 4th stage) modify instruction
        NOP                        '(initially 3rd stage) 1...
        NOP                        '(initially 2nd stage) 2... at least two instructions in-between
:inst   ADD     A,B                '(initially 1st stage) modified instruction properly executes


Tasks that execute no more frequently than every 3rd time slot don't need to observe this 2-instruction
spacer rule when executing self-modifying code, because their instructions will always be sufficiently spread
apart in the pipeline by other tasks' instructions, enabling a just-modified instruction to be properly read
and executed in that task's next time slot. If less than two spacers are afforded to a modify-execute sequence
in a single-task program, the old instruction will be read and executed, instead of the newly-modified one.
This can be used to advantage for efficient overlapped modify-execute sequences.

When a branch instruction executes, that task's program counter is abruptly changed from what had been a
steadily incrementing course, requiring that the pipeline be reloaded, beginning at the new program counter
address. This can leave up to three instructions in the pipeline which were trailing the branch instruction
and belong to the same task as the branch.

Normally, these trailing instructions are incidental data which are not intended for execution, and therefore
must be cancelled within the pipeline, so that they pass through without doing anything. However, in the case
of a single-task program, it may be desirable to allow those instrucions to execute, without cancellation, to
increase pipeline efficiency. This will result in the branch taking just 1 clock cycle, but three trailing
instructions will be executed before the branch appears to take effect:


        'single-task delayed branch

        JMPD    #somewhere      '(initially 4th stage) do a delayed jmp, then toggle P0 and cycle P1
        NOTP    #0              '(initially 3rd stage)
        NOTP    #1              '(initially 2nd stage)
        NOTP    #1              '(initially 1st stage) next instruction is loaded from 'somewhere'


To accommodate both cancelling and non-cancelling branches, branch instructions have two versions. The ones
that end in the letter 'D' for 'delayed' are non-cancelling and take only one clock.

The branch instructions that don't end in the letter 'D' are what would be considered 'normal' branches, as
they cancel any same-task instructions in the pipeline, so that the next instruction to execute after the
branch would be the instruction which was branched to.

For code compatibility across all task modes, three trailing instructions from the same task as the delayed
branch will always be executed before the delayed branch takes effect, regardless of whether a program is
single- or multi-task.

Here are all the branching instructions:


       'normal'        'delayed'
        cancelling      non-cancelling
        ----------      --------------
        JMP             JMPD                     jump to address

        CALL            CALLD                    call subroutine using task's 4-level stack
        RET             RETD                     return from subroutine using task's 4-level stack

        CALLA           CALLAD                   call subroutine using HUB[PTRA++]
        RETA            RETAD                    return from subroutine using HUB[--PTRA]

        CALLB           CALLBD                   call subroutine using HUB[PTRB++]
        RETB            RETBD                    return from subroutine using HUB[--PTRB]

        CALLX           CALLXD                   call subroutine using AUX[PTRX++]
        RETX            RETXD                    return from subroutine using AUX[--PTRX]

        CALLY           CALLYD                   call subroutine using AUX[!PTRY++]
        RETY            RETYD                    return from subroutine using AUX[!--PTRY]

        JMPSW           JMPSWD                   jmp/call with Z/C/PC store
        SWITCH          SWITCHD                  switch between threads (JMPSW/JMPSWD INDB,++INDB)

        IJZ             IJZD                     increment D and jump if result zero
        IJNZ            IJNZD                    increment D and jump if result not zero
        DJZ             DJZD                     decrement D and jump if result zero
        DJNZ            DJNZD                    decrement D and jump if result not zero

        JP              JPD                      jump if pin D reads high
        JNP             JNPD                     jump if pin D reads low

        JZ              JZD                      jump if D zero
        JNZ             JNZD                     jump if D not zero

        JMPLIST                                  jump to position in jump list

        JMPTASK                                  jump selected tasks to address


Here is an example of a delayed branch:


loop            MOV     X,#100          'toggle P0/P1/P2 100 times, then toggle P3 (single-task)

loop2           DJNZD   X,@loop2        'loop, delayed branch executes 3 trailing instructions
                NOTP    #0              'toggle P0
                NOTP    #1              'toggle P1
                NOTP    #2              'toggle P2

                NOTP    #3              'now toggle P3
                JMP     #loop           'do it again



INSTRUCTION-BLOCK REPEATING
---------------------------

Each task within a cog has an instruction-block repeater that can variably repeat up to 64
instructions without any clock-cycle overhead.

REPS and REPD are used to initiate block repeats. These instructions specify how many times the
trailing instruction block will be executed and how many instructions are in the block:


REPS    #n,#i    - execute 1..64 instructions 1..65536 times,   requires 1 spacer instruction

REPD    #i       - execute 1..64 instructions infinitely,       requires 3 spacer instructions
REPD    D,#i     - execute 1..64 instructions D+1 times,        requires 3 spacer instructions
REPD    #n,#i    - execute 1..64 instructions 1..512 times,     requires 3 spacer instructions


REPS differs from REPD by executing at the 2nd stage of the pipeline, instead of the 4th. By
executing two stages earlier, it needs only one spacer instruction. Because of its earliness,
no conditional execution is possible, so it is forced to always execute, allowing the CCCC bits
to be repurposed, affording a contiguous 16-bit constant for the repeat count.

The instruction-block repeater will quit repeating the block if a branch instruction executes
within the block, or if a JMPTASK instruction affects the task which is using the repeater.

The following instructions potentially jump to themselves (JMP #$) and, by branching, will
cancel the block repeater if executed within a repeat block:

    PASSCNT                                                     - always
    SERINA/SERINB/SEROUTA/SEROUTB                               - only during multi-tasking
    GETMULL/GETMULH/GETDIVQ/GETDIVR/GETSQRT/GETQX/GETQY/GETQZ   - only during multi-tasking
    WAITVID/SYNCTRA/SYNCTRB                                     - only during multi-tasking


Example (1-task):

        REPD    D,#1            'execute 1 instruction D times (if D=0, same as D=1)

        NOP                     '3 spacer instructions needed (could do something useful)
        NOP
        NOP

        NOTP    #0              'toggle P0, block repeats every 1 clock


Example (1-task):

        REPS    #20_000,#4      'execute 4 instructions 20,000 times

        NOP                     '1 spacer instruction needed (make the most of it)

        NOTP    #0              'toggle P0
        NOTP    #1              'toggle P1
        NOTP    #2              'toggle P2
        NOTP    #3              'toggle P3, block repeats every 4 clocks


instructions (iiiiii = #i-1, n_nnnn_nnnnnnnnn_nnn/nnnnnnnnn = #n-1)                            clocks
-----------------------------------------------------------------------------------------------------
1111101 01 1 nnnn nnnnnnnnn nnniiiiii        REPS    #n,#i   'execute 1..64 inst's 1..65536 times   1

1111111 00 0 CCCC 111111111 001iiiiii        REPD    #i      'execute 1..64 inst's infintely        1
1111111 00 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D times          1
1111111 00 1 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1
-----------------------------------------------------------------------------------------------------



HUB COUNTER
-----------

The hub contains a 64-bit counter called CNT that increments on each clock cycle. Each cog can use CNT
to mark time in various ways. On chip reset, the ROM Booter initializes CNT to $00000000_00000000, from
which point it begins incrementing.

Here are the instructions which relate to CNT:

GETCNT  D               Get CNT[31..0] into D.

GETCNTX D               Get CNT[63..32], delayed by 1 clock, into D. A single-task program executing a
                        GETCNT, immediately followed by a GETCNTX, would get a 64-bit snapshot of CNT.

SUBCNT  D               Get CNT[31..0] minus D into D. If another SUBCNT is executed in the next clock
                        cycle by the same task, it gets CNT[63..32], delayed by 1 clock, minus D minus
                        the carry from the previous SUBCNT (not the C flag) into D. In either case, the
                        logical not of the MSB of the D result (not the carry) goes into C, indicating
                        by C=1 if CNT[31..0] or CNT[63..0] has exceeded the original D value(s).

CMPCNT  D               Same as SUBCNT, but doesn't store the D result(s). Useful for periodic checking
                        if a time target has been reached yet.

PASSCNT D               Jump to self if MSB of CNT[31..0] minus D is 1. In other words, loop until
                        CNT[31..0] exceeds D. This is intended as a non-pipeline-stalling alternative
                        to WAITCNT, for use in multi-task programs.

WAITCNT D,S/#           Wait for CNT[31..0] to be equal to D. Adds S/# into D.

WAITCNT D,S/#       WC  Wait for CNT[63..0] to be equal to the concatenation of the last-written D
                        value and the D expressed in the WAITCNT. Adds S/# into D. Carry from the
                        addition goes into C. This instruction only works within single-task programs,
                        as the last-written D needs to be from the same task.

WAITPEQ D,S/#,#port WC  Like WAITPEQ without WC, except the last-written D value becomes a CNT[31..0]
                        timeout target, with C returning 0 if the WAITPEQ condition was met, or 1 if the
                        timeout occurred first. This instruction only works within single-task programs,
                        as the last-written D needs to be from the same task.

WAITPNE D,S/#,#port WC  Like WAITPNE without WC, except the last-written D value becomes a CNTL[31..0]
                        timeout target, with C returning 0 if the WAITPNE condition was met, or 1 if the
                        timeout occurred first. This instruction only works within single-task programs,
                        as the last-written D needs to be from the same task.


Examples:

        'Measure time using lower 32 bits of CNT

        GETCNT  ticks            'get CNT[31..0] into ticks
        <somecode>               'execute some code
        SUBCNT  ticks            'get CNT[31..0] minus ticks into ticks, <somecode> took ticks-1 clocks


        'Measure time using full 64 bits of CNT (single-task)

        GETCNT  ticks_lo         'get CNT[63..0] into {ticks_hi, ticks_lo}
        GETCNTX ticks_hi
        <somecode>               'execute some code
        SUBCNT  ticks_lo         'get CNT[63..0] minus {ticks_hi, ticks_lo} into {ticks_hi, ticks_lo}
        SUBCNT  ticks_hi         '<somecode> took {ticks_hi, ticks_lo}-1 clocks


        'Do something for some time

        GETCNT  ticks            'get CNT[31..0]
        ADD     ticks,#500       'add 500

loop    <somecode>               'execute some code
        CMPCNT  ticks       WC   'check if 500 clocks have elapsed yet
 if_nc  JMP     #loop            'if not, loop


        'Do something every Nth clock (multi-task)

        GETCNT  ticks            'get CNT[31..0]

loop    ADD     ticks,#500       'add 500
        PASSCNT ticks            'wait for next 500th clock
        <somecode>               'execute some code
        jmp     #loop            'loop


        'Do something every Nth clock using CNT[31..0] (single-task)

        GETCNT  ticks            'get CNT[31..0]
        ADD     ticks,#500       'add initial 500

loop    WAITCNT ticks,#500       'wait for next 500th clock, add next 500, jitter-free
        <somecode>               'execute some code
        jmp     #loop            'loop


        'Do something every Nth clock using CNT[63..0] (single-task)

        GETCNT  ticks_lo         'get CNT[63..0] into {ticks_hi, ticks_lo}
        GETCNTX ticks_hi

loop    ADD     ticks_lo,lo WC   'add 64-bit clock offset
        ADDX    ticks_hi,hi      'this last-written D value becomes the CNT[63..32] target
        WAITCNT ticks_lo,#0 WC   'wait for next {hi,lo}th clock, don't add here (easier above), jitter-free
        <somecode>               'execute some code
        jmp     #loop            'loop


        'Wait for pins to equal a value, with time-out (single-task)

        GETCNT  ticks            'get CNT[31..0] as timeout base
        ADD     ticks,#200       'add timeout of 200 clocks, last-written D value is timeout target
        WAITPEQ value,mask,#0 WC 'wait for (PINA & mask) == value, with timeout
 if_c   JMP     #timeout         'if C=1 then timeout occurred, else pin condition was met


instructions                                                                                                                     clocks
---------------------------------------------------------------------------------------------------------------------------------------
1111111 ZC 0 CCCC DDDDDDDDD 000000100        GETCNT  D             'get CNT[31..0] into D, C=MSB                                      1
1111111 ZC 0 CCCC DDDDDDDDD 000000101        GETCNTX D             'get prior CNT[63..32] into D, C=MSB                               1

1111111 ZC 0 CCCC DDDDDDDDD 000100010        SUBCNT  D             'get CNT[31..0] (then prior CNT[63..32]) minus D into D, C=passed  1
1111111 0C 0 CCCC DDDDDDDDD 010001100        CMPCNT  D             'compare CNT[31..0] (then prior CNT[63..32]) to D, C=passed        1

1111111 00 0 CCCC DDDDDDDDD 010100110        PASSCNT D             'loop until CNT[31..0] passes D                                    1*
1111111 00 1 CCCC DDDDDDDDD 010100110        PASSCNT #D            'loop until CNT[31..0] passes #D                                   1*

1001111 10 0 CCCC DDDDDDDDD SSSSSSSSS        WAITCNT D,S           'wait for CNT[31..0] == D, D += S                                  ?
1001111 10 1 CCCC DDDDDDDDD SSSSSSSSS        WAITCNT D,#S          'wait for CNT[31..0] == D, D += #S                                 ?
1001111 11 0 CCCC DDDDDDDDD SSSSSSSSS        WAITCNT D,S     WC    'wait for CNT[63..0] == {last D, D}, D += S, C=carry               ?
1001111 11 1 CCCC DDDDDDDDD SSSSSSSSS        WAITCNT D,#S    WC    'wait for CNT[63..0] == {last D, D}, D += #S, C=carry              ?

110010n n1 0 CCCC DDDDDDDDD SSSSSSSSS        WAITPEQ D,S,#n  WC    'wait for (PINn & S) == D, or CNT[31..0] == last D, C=timeout      ?
110010n n1 1 CCCC DDDDDDDDD SSSSSSSSS        WAITPEQ D,S,#n  WC    'wait for (PINn & #S) == D, or CNT[31..0] == last D, C=timeout     ?

110011n n1 0 CCCC DDDDDDDDD SSSSSSSSS        WAITPNE D,S,#n  WC    'wait for (PINn & S) <> D, or CNT[31..0] == last D, C=timeout      ?
110011n n1 1 CCCC DDDDDDDDD SSSSSSSSS        WAITPNE D,S,#n  WC    'wait for (PINn & #S) <> D, or CNT[31..0] == last D, C=timeout     ?
---------------------------------------------------------------------------------------------------------------------------------------
* 1 + number of other instructions in the pipeline (0..3) which belong to the same task



HUB EXECUTION
-------------

When a cog is started, registers $000..$1F3 are loaded sequentially from hub memory and then
execution commences at register $000. Executing code in this initial mode, from within the
cog, is fastest and deterministic, though cog space is limited, with some of the registers
invariably serving as data and variables, possibly limiting your code size.

Large programs, or programs which don't need to be deterministic and would like to free up the
cog register space for data, may be executed from hub memory, instead. These programs address
the 256K byte hub memory as 64k longs, ranging from $0000..$FFFF. To accommodate this, all cog
program counters are 16-bit, and there are 16-bit-constant 'jump', 'call', and 'return'
instructions.

To execute from the hub, simply branch outside of the cog address space of $000..$1FF to the
executable hub address space of $0200..$FFFF. You can jump, call, and return to and from
any address. If an instruction's address is $000..$1FF, it is fetched from cog memory. If an
instruction's address is $0200..$FFFF, it is fetched from hub memory.

Each cog has four instruction cache lines of eight longs, each, which serve as intermediaries
between the hub memory and instruction pipeline. Whenever an instruction is needed from the
hub that is not currently cached, a cache line is loaded on the next hub cycle, temporarily
stalling the pipeline. Cache lines are reloaded on a least-recently-used basis. A prefetch
mode, enabled on cog start, allows straight-line code without hub instructions to execute at
full-speed, as if it was running in the cog memory. Prefetch may be turned off to speed up
programs which have multiple tasks executing from the hub, and would be hindered by irrelevant
prefetches. It may also be turned off to allow a single-task program to cache four lines that
can be looped within, without cache disruption.

Here are the instructions which govern the instruction cache:

        ICACHEX         'invalid instruction cache, forces reloads on next hub instructions
        ICACHEP         'enable prefetch (this mode is enabled on cog start)
        ICACHEN         'disable prefetch


To help make hub execution practical, there are two instructions, AUGS and AUGD, which each
provide 23 bits of data to extend 9-bit constants in subsequent instructions to 32 bits:

        AUGS    #longvalue >> 9
        MOV     reg,#longvalue & $1FF

        AUGD    #longvalue >> 9
        SETXCH  #longvalue & $1FF

        AUGS    #frq32a >> 9
        AUGD    #frq32b >> 9
        SETFRQS #frq32b & $1FF,#frq32a & $1FF


For simplicity, these can be coded as such:

        MOV     reg,##longvalue

        SETXCH  ##longvalue

        SETFRQS ##frq32b,##frq32a


AUGS is cancelled when a subsequent instruction expresses a constant S. AUGD is cancelled when
a subsequent instruction expresses a constant D. There are separate AUGS/AUGD circuits for each
of the four tasks within a cog.

Remember that for every ##, you are generating an AUGS/AUGD instruction.


All 'jump' and 'call' instructions have 16-bit-constant and D-register variants:

        (delayed '-D' versions omitted for brevity)

        JMP     #absolute16     'jump to 16-bit absolute address
        JMP     @relative16     'jump to 16-bit relative address
        JMP     D               'jump to D[15:0], WZ/WC load Z/C from D[31:30]

        CALL    #absolute16     'call to 16-bit absolute address, push {Z,C,PC+1} into task's 4-level stack
        CALL    @relative16     'call to 16-bit relative address, push {Z,C,PC+1} into task's 4-level stack
        CALL    D               'call to D[15:0], push {Z,C,PC+1} into task's 4-level stack, WZ/WC load Z/C from D[31:30]

        CALLA   #absolute16     'call to 16-bit absolute address, WRLONG {Z,C,PC+1},PTRA++
        CALLA   @relative16     'call to 16-bit relative address, WRLONG {Z,C,PC+1},PTRA++
        CALLA   D               'call to D[15:0], WRLONG {Z,C,PC+1},PTRA++, WZ/WC load Z/C from D[31:30]

        CALLB   #absolute16     'call to 16-bit absolute address, WRLONG {Z,C,PC+1},PTRB++
        CALLB   @relative16     'call to 16-bit relative address, WRLONG {Z,C,PC+1},PTRB++
        CALLB   D               'call to D[15:0], WRLONG {Z,C,PC+1},PTRB++, WZ/WC load Z/C from D[31:30]

        CALLX   #absolute16     'call to 16-bit absolute address, WRAUX {Z,C,PC+1},PTRX++
        CALLX   @relative16     'call to 16-bit relative address, WRAUX {Z,C,PC+1},PTRX++
        CALLX   D               'call to D[15:0], WRAUX {Z,C,PC+1},PTRX++, WZ/WC load Z/C from D[31:30]

        CALLY   #absolute16     'call to 16-bit absolute address, WRAUXR {Z,C,PC+1},PTRY++
        CALLY   @relative16     'call to 16-bit relative address, WRAUXR {Z,C,PC+1},PTRY++
        CALLY   D               'call to D[15:0], WRAUXR {Z,C,PC+1},PTRY++, WZ/WC load Z/C from D[31:30]


The 'return' instructions can use WZ/WC to restore Z/C to the caller's states:

        RET                     'return, pop {Z,C,PC} from task's 4-level stack
        RETA                    'return, RDLONG {Z,C,PC},--PTRA
        RETB                    'return, RDLONG {Z,C,PC},--PTRB
        RETX                    'return, RDAUX  {Z,C,PC},--PTRX
        RETY                    'return, RDAUXR {Z,C,PC},--PTRY


The 'push' and 'pop' instructions:

        PUSH    D/#             'push D/# into task's 4-level stack
        PUSHA   D/#             'WRLONG D/#,PTRA++
        PUSHB   D/#             'WRLONG D/#,PTRB++
        PUSHX   D/#             'WRAUX  D/#,PTRX++
        PUSHY   D/#             'WRAUXR D/#,PTRY++

        POP     D               'pop D from task's 4-level stack
        POPA    D               'RDLONG D,--PTRA
        POPB    D               'RDLONG D,--PTRB
        POPX    D               'RDAUX  D,--PTRX
        POPY    D               'RDAUXR D,--PTRY


The conditional jumps, which specify a register or a 9-bit constant for their branch address,
all sign-extend their 9-bit constants for use as a relative address - unless AUGS is used to
expresses a full 16-bit relative address:

        IJZ     D,@relative9        'increment D and jump to 9-bit relative address if zero
        IJZ     D,@@relative16      'increment D and jump to 16-bit relative address if zero
        IJZ     D,S                 'increment D and jump to S[15:0] if zero

        IJNZ    D,@relative9        'increment D and jump to 9-bit relative address if not zero
        IJNZ    D,@@relative16      'increment D and jump to 16-bit relative address if not zero
        IJNZ    D,S                 'increment D and jump to S[15:0] if not zero

        DJZ     D,@relative9        'decrement D and jump to 9-bit relative address if zero
        DJZ     D,@@relative16      'decrement D and jump to 16-bit relative address if zero
        DJZ     D,S                 'decrement D and jump to S[15:0] if zero

        DJNZ    D,@relative9        'decrement D and jump to 9-bit relative address if not zero
        DJNZ    D,@@relative16      'decrement D and jump to 16-bit relative address if not zero
        DJNZ    D,S                 'decrement D and jump to S[15:0] if not zero

        JZ      D,@relative9        'test D and jump to 9-bit relative address if zero
        JZ      D,@@relative16      'test D and jump to 16-bit relative address if zero
        JZ      D,S                 'test D and jump to S[15:0] if zero

        JNZ     D,@relative9        'test D and jump to 9-bit relative address if not zero
        JNZ     D,@@relative16      'test D and jump to 16-bit relative address if not zero
        JNZ     D,S                 'test D and jump to S[15:0] if not zero

        JP      D/#,@relative9      'jump to 9-bit relative address if pin D/# reads high
        JP      D/#,@@relative16    'jump to 16-bit relative address if pin D/# reads high
        JP      D/#,S               'jump to S[15:0] if pin D/# reads high

        JNP     D/#,@relative9      'jump to 9-bit relative address if pin D/# reads low
        JNP     D/#,@@relative16    'jump to 16-bit relative address if pin D/# reads low
        JNP     D/#,S               'jump to S[15:0] if pin D/# reads low


JMPSW jumps according to the S field and stores {Z,C,PC} into D. WZ and WC can be used to load
{Z,C} from S[17:16]:

        JMPSW   D,@relative9        'jump to 9-bit relative address, store [Z,C,PC} into D
        JMPSW   D,@@relative16      'jump to 16-bit relative address, store [Z,C,PC} into D
        JMPSW   D,S                 'jump to S[15:0], store [Z,C,PC} into D
        JMPSW   D,S    WZ,WC        'jump to S[15:0], store [Z,C,PC} into D, Z=S[17], C=S[16]

        SWITCH                      'alias for 'JMPSW INDB,++INDB WZ,WC'
                                    'For round-robin switching among threads
                                    'Use FIXINDB to set up a loop of {Z,C,PC) registers for threads
                                    'Can be used with register remapping for multiple program instances
                                    'Instructions trailing SWITCHD are contextually in the next thread


JMPLIST jumps to a base address (S/@/@@) plus index (D).

        JMPLIST D,@relative9        'jump to D plus 9-bit relative address
        JMPLIST D,@@relative16      'jump to D plus 16-bit relative address
        JMPLIST D,S                 'jump to D plus S


LOCBASE converts a 16-bit hub instruction address into a normal 18-bit hub address for use
with RDxxxx/WRxxxx instructions:

        LOCBASE D,@relative9        'get 18-bit hub address from 9-bit relative address into D
        LOCBASE D,@@relative16      'get 18-bit hub address from 16-bit relative address into D
        LOCBASE D,S                 'get 18-bit hub address from S[15:0] into D


LOCBYTE/LOCWORD/LOCLONG are like LOCBASE, but use the initial D value as an index which gets
scaled and added to the normal 18-bit hub address:

        LOCBYTE D,@relative9        'get 18-bit byte-indexed hub address from 9-bit relative address into D
        LOCBYTE D,@@relative16      'get 18-bit byte-indexed hub address from 16-bit relative address into D
        LOCBYTE D,S                 'get 18-bit byte-indexed hub address from S[15:0] into D

        LOCWORD D,@relative9        'get 18-bit word-indexed hub address from 9-bit relative address into D
        LOCWORD D,@@relative16      'get 18-bit word-indexed hub address from 16-bit relative address into D
        LOCWORD D,S                 'get 18-bit word-indexed hub address from S[15:0] into D

        LOCLONG D,@relative9        'get 18-bit long-indexed hub address from 9-bit relative address into D
        LOCLONG D,@@relative16      'get 18-bit long-indexed hub address from 16-bit relative address into D
        LOCLONG D,S                 'get 18-bit long-indexed hub address from S[15:0] into D


Remember that @@ is going to generate an AUGS instruction.


LOCPTRA/LOCPTRB convert 16-bit constant hub instruction addresses into normal 18-bit hub addresses and then store
them into into PTRA/PTRB:

        LOCPTRA #absolute16         'get 18-bit hub address into PTRA from 16-bit absolute instruction address
        LOCPTRA @relative16         'get 18-bit hub address into PTRA from 16-bit relative instruction address

        LOCPTRB #absolute16         'get 18-bit hub address into PTRB from 16-bit absolute instruction address
        LOCPTRB @relative16         'get 18-bit hub address into PTRB from 16-bit relative instruction address


There are five assembler directives which are used to position instructions and set cog vs hub assembly modes:

        ORGH    absolute16          'set 16-bit-address hub mode, advances to absolute16 and sets origin
        ORGH                        'set 16-bit-address hub mode, initial state in DAT block

        ORG     absolute9           'set 9-bit-address cog mode, sets origin to absolute9
        ORG                         'set 9-bit-address cog mode, sets origin to 0

        ORGF    absolute9           'advances to absolute9, must be in cog mode

        RES     regcount            'reserves regcount locations, must be in cog mode
        RES                         'reserves 0 locations, must be in cog mode

        FIT     address             'errors out if address exceeded, works in both modes
        FIT                         'if cog mode, error if origin > $1F2; if hub mode, error if origin > $10000


Here is an example PASM application (use F11 to download) which demonstrates hub execution:


        orgh    $380            '$380 = 18-bit load address $E00

        org                     'internal cog code

        jmp     @go             'jump to hub memory

x       long    3               'cog register variable

        orgh    $1000           'some hub code at $1000

go      incmod  x,#3
        jmplist x,@@list

        orgh    $1400           'some hub code at $1400

list    jmp     @z0
        jmp     @z1
        jmp     @z2
        jmp     @z3

        orgh    $1800           'some hub code at $1800

z0      notp    #0
        jmp     @go

z1      notp    #1
        jmp     @go

z2      notp    #2
        jmp     @go

z3      notp    #3
        jmp     @go



COUNTERS - this section is not done yet!!!
--------

Each cog has two configurable counters. They are named CTRA and CTRB and are accessed by
thirteen instructions each. The instructions which end in "A" are for CTRA and those that
end in "B" are for CTRB. For brevity, only CTRA instructions are used in the definitions and
examples that follow.

        GETPHSA D               - Get PHSA into D
        GETPHZA D               - Get PHSA into D, simultaneously clear PHSA to 0
        GETCOSA D               - Get COSA into D
        GETSINA D               - Get SINA into D

        SETCTRA D/#             - Set CTRA configuration
        SETWAVA D/#             - Set WAVA
        SETFRQA D/#             - Set FRQA
        SETPHSA D/#             - Set PHSA
        ADDPHSA D/#             - Add to PHSA
        SUBPHSA D/#             - Subtract from PHSA

        SYNCTRA                 - Wait for PHSA to roll over
        POLCTRA WC              - Check if PHSA has rolled over (C=1 if rolled over)
        CAPCTRA                 - Capture CTRA accumulators into COSA and SINA

Modes:

  (QDR = PHS[31] XNOR PHS[30], or PHS[31] delayed by 90 degrees)


  Off Mode
  -------------------------------------------------------------------------------
  %00000 = Counter off (initial state after cog start)


  NCO Modes
  -------------------------------------------------------------------------------
  %00001 = NCO output + video PLL mode, PLL output = PHS[31] (reference signal)
  %00010 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 32
  %00011 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 16
  %00100 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 8
  %00101 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 4
  %00110 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 2
  %00111 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 1
  %01000 = NCO output

  DUAL Modes
  -------------------------------------------------------------------------------
  %000_01001 = dual NCO outputs + dual COUNT_LOWS inputs
  %001_01001 = dual NCO outputs + dual COUNT_HIGHS inputs
  %010_01001 = dual NCO outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01001 = dual NCO outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01001 = dual NCO outputs + dual TIME_LOWS inputs
  %101_01001 = dual NCO outputs + dual TIME_HIGHS inputs
  %110_01001 = dual NCO outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01001 = dual NCO outputs + dual TIME_POSITIVE_EDGES inputs

  %000_01010 = dual DUTY outputs + dual COUNT_LOWS inputs
  %001_01010 = dual DUTY outputs + dual COUNT_HIGHS inputs
  %010_01010 = dual DUTY outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01010 = dual DUTY outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01010 = dual DUTY outputs + dual TIME_LOWS inputs
  %101_01010 = dual DUTY outputs + dual TIME_HIGHS inputs
  %110_01010 = dual DUTY outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01010 = dual DUTY outputs + dual TIME_POSITIVE_EDGES inputs

  %000_01011 = dual PWM outputs + dual COUNT_LOWS inputs
  %001_01011 = dual PWM outputs + dual COUNT_HIGHS inputs
  %010_01011 = dual PWM outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01011 = dual PWM outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01011 = dual PWM outputs + dual TIME_LOWS inputs
  %101_01011 = dual PWM outputs + dual TIME_HIGHS inputs
  %110_01011 = dual PWM outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01011 = dual PWM outputs + dual TIME_POSITIVE_EDGES inputs

  WAVE modes
  -------------------------------------------------------------------------------
  %01100 = dual SQR_WAVE output + GOERTZEL input
  %01101 = dual SAW_WAVE output + GOERTZEL input
  %01110 = dual TRI_WAVE output + GOERTZEL input
  %01111 = dual SIN_WAVE output + GOERTZEL input

In the WAVE modes, FRQ is added into PHS on every clock cycle. The top nine bits of PHS
are used to drive sine and cosine lookup tables which are used for sine output functions
and GOERTZEL computations. While the sine/cosine output functions are the most useful for
signal processing, triangle-, sawtooth-, and square-wave output functions are also selectable,
being derived from the top nine bits of PHS, as well.

The WAVE modes output both parallel DAC signals and duty-modulated pin signals. All
output signals are nine bits in base quality with an additional nine sub-bits of dithering
to maintain base quality after attenuative scaling. The dual outputs differ only in phase
and are set up by the WAV register:


  WAV register in WAVE modes (can be changed by SETWAVA/SETWAVB instruction)
  -------------------------------------------------------------------------------
  %PPPPPPPPP_xxxxx_TTTTTTTTT_AAAAAAAAA

      PPPPPPPPP = phase advance for OUTA (0 to 511/512 revolutions)
          xxxxx = unused for WAVE modes
      TTTTTTTTT = offset for OUTA and OUTB
      AAAAAAAAA = amplitude for OUTA and OUTB


  Initial value after cog start:

  %010000000_00000_100000000_111111111

      010000000 = 90-degree phase advance for GOERTZEL use (OUTA=cosine, OUTB=sine)
          00000 = unused
      100000000 = mid-point offset (allows maximum amplitude)
      111111111 = maximum amplitude


The GOERTZEL computation works as follows, on every clock:

    Nine-bit sine and cosine values are looked up using the top nine bits of PHS.
    The sine and cosine values are negated if INA is 0, else they remain the same.
    The sine and cosine values are added into separate sine and cosine accumulators.

This process measures the energy content of INA at the frequency of PHS rollover.
To make this work, the INA pin should be configured for delta-sigma ADC mode, so
that it streams back 1's and 0's that ratiometrically represent the voltage of the
I/O pin.

To make a GOERTZEL measurement:

    - The top nine bits of WAV should be set to %010000000 for proper cosine lookup.
    - FRQ must be set to generate the frequency of interest in PHS rollovers (SETFRQA).
    - PHS and the accumulators should be cleared to 0 (SETPHSA #0, then CAPCTRA).
    - Some number of complete PHS rollovers must be waited for (SYNCTRA/POLLCTRA).
    - The accumulators must be captured and read (CAPCTRA + GETCOSA + GETSINA).
    - The hypotenuse of the accumulators will indicate signal strength and phase.

By making swept FRQ measurements in a closed loop, where OUTA is used to output a reference
frequency of known phase to stimulate a system, and INA receives a signal back that
is somehow coupled to OUTA, you can determine things such as spectral response, resonant
frequency, and frequency vs. phase of a system.

The more PHS rollovers in a measurement, the more selective the result will be. For open-
loop measurements, this means tighter bandwidth. For closed-loop measurements, the angle
of the hypotenuse becomes meaningful. The QARCTAN instruction can translate the sine and
cosine accumulations into power and phase values.


  LOGIC Modes
  -------------------------------------------------------------------------------
  %10000 = LOGIC_A_POSEDGE input    INA & !INA previous
  %10001 = LOGIC_NA_AND_NB input   !INA & !INB
  %10010 = LOGIC_A_AND_NB input     INA & !INB
  %10011 = LOGIC_NB input                 !INB
  %10100 = LOGIC_NA_AND_B input    !INA &  INB
  %10101 = LOGIC_NA input          !INA
  %10110 = LOGIC_A_NE_B input       INA <> INB
  %10111 = LOGIC_NA_OR_NB input    !INA | !INB
  %11000 = LOGIC_A_AND_B input      INA &  INB
  %11001 = LOGIC_A_EQ_B input       INA == INB
  %11010 = LOGIC_A input            INA
  %11011 = LOGIC_A_OR_NB input      INA | !INB
  %11100 = LOGIC_B input                   INB
  %11101 = LOGIC_NA_OR_B input     !INA |  INB
  %11110 = LOGIC_A_OR_B input       INA |  INB
  %11111 = LOGIC_ENCODER input      INA,   INB encoder

    OUTA = ADD signal (condition met or LOGIC_ENCODER forward step)
    OUTB = SUB signal (LOGIC_ENCODER reverse step)

In the LOGIC modes, FRQ is conditionally added to PHS on each clock cycle that meets that
mode's requirement. In the case of the LOGIC_ENCODER mode, FRQ may be added or subtracted
to/from PHS when a half-step is registered. OUTA and OUTB reflect the ADD and SUB states
for each cycle, and are more likely to be useful by other CTR's, rather than being sent to
output pins.


DACS
----

Each cog outputs 4 channels of DAC data, named DAC0..DAC3. These DAC data channels can be
set to values or actively driven from CTRA, CTRB, or VID. In all cases but VID, the source
data is 18 bits and is dithered on every clock cycle for 9-bit DAC output. In the case of
VID, the source data is just 9 bits, so no dithering is performed.

Each I/O pin has a 75-ohm 9-bit DAC which can be configured using CFGPINS to output a
fixed DAC channel from any cog. Every cog's DAC0..DAC3 are available, in that sequence,
to P0..P3, then to P4..P7, then to the next four pins, and so on, as shown below:


PortA   PortB   PortC       DACx
--------------------------------
P0      P32     P64         DAC0 from any cog
P1      P33     P65         DAC1 from any cog
P2      P34     P66         DAC2 from any cog
P3      P35     P67         DAC3 from any cog
P4      P36     P68         DAC0 from any cog
P5      P37     P69         DAC1 from any cog
P6      P38     P70         DAC2 from any cog
P7      P39     P71         DAC3 from any cog
P8      P40     P72         DAC0 from any cog
P9      P41     P73         DAC1 from any cog
P10     P42     P74         DAC2 from any cog
P11     P43     P75         DAC3 from any cog
P12     P44     P76         DAC0 from any cog
P13     P45     P77         DAC1 from any cog
P14     P46     P78         DAC2 from any cog
P15     P47     P79         DAC3 from any cog
P16     P48     P80         DAC0 from any cog
P17     P49     P81         DAC1 from any cog
P18     P50     P82         DAC2 from any cog
P19     P51     P83         DAC3 from any cog
P20     P52     P84         DAC0 from any cog
P21     P53     P85         DAC1 from any cog
P22     P54     P86         DAC2 from any cog
P23     P55     P87         DAC3 from any cog
P24     P56     P88         DAC0 from any cog
P25     P57     P89         DAC1 from any cog
P26     P58     P90         DAC2 from any cog
P27     P59     P91         DAC3 from any cog
P28     P60     P92         DAC0 from any cog
P29     P61     P93         DAC1 from any cog
P30     P62     P94         DAC2 from any cog
P31     P63     P95         DAC3 from any cog


Here are the instructions which configure DAC0..DAC3:

    CFGDAC0 D/#     - Configure DAC0

        %00 = Software controlled (default)
        %01 = CTRA SIGA
        %10 = CTRA SIGA + CTRB SIGA
        %11 = VID SIG0

    CFGDAC1 D/#     - Configure DAC1

        %00 = Software controlled (default)
        %01 = CTRA SIGB
        %10 = CTRA SIGB + CTRB SIGB
        %11 = VID SIG1

    CFGDAC2 D/#     - Configure DAC2

        %00 = Software controlled (default)
        %01 = CTRB SIGA
        %10 = CTRA SIGA + CTRB SIGA
        %11 = VID SIG2

    CFGDAC3 D/#     - Configure DAC3

        %00 = Software controlled (default)
        %01 = CTRB SIGB
        %10 = CTRA SIGB + CTRB SIGB
        %11 = VID SIG3

    CFGDACS D/#     - Configure DAC3..DAC0 from four 2-bit fields: %33_22_11_00


For configurations %00..%10, the data sources are 18 bits wide, with the 9 lower bits
being dithered by a 32-bit LFSR to realize more DAC resolution. This improves dynamic
range, but introduces a white noise of one step in amplitude in the 9-bit DAC output.
As dynamic signals get smaller in amplitude, they appear to sink into the dither noise,
but actually remain very high-Q, as the dither noise is very low-Q. For configuration
%11 (VID), the data is a straight 9 bits with no dithering.

The dithering works by taking nine fixed bits from a 32-bit LFSR and sign-extending
them to 18 bits. This yields a pseudo-random value ranging from %111111111_100000000
(negative) to %000000000_011111111 (positive) on every clock cycle. When added to the
18-bit source data, the lower 9 bits of source data are realized as a proportional
toggling between two adjacent values in the top 9 bits of the sum, which form the DAC
output data. It will take at least 512 (2^9) clocks for the DAC output to average to
the intended 18-bit source value, assuming source data is static.

On cog start, all configurations are cleared to %00 and the source values are set to
%000000000_100000000, which is effectively zero, since dithering will never cause an
output step toggle when the nine lower source bits are %100000000:


       source data %XXXXXXXXX_100000000
  + minimum dither %111111111_100000000
                   --------------------
                 = %XXXXXXXXX_000000000    (top 9 bits are unchanged)


       source data %XXXXXXXXX_100000000
  + maximum dither %000000000_011111111
                   --------------------
                 = %XXXXXXXXX_111111111    (top 9 bits are unchanged)


Here are the instructions which set DAC0..DAC3 source values in software:


    SETDAC0 #n      - Set DAC0 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC0 D       - Set DAC0 to D[31..14], force configuration to %00 *

    SETDAC1 #n      - Set DAC1 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC1 D       - Set DAC1 to D[31..14], force configuration to %00 *

    SETDAC2 #n      - Set DAC2 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC2 D       - Set DAC2 to D[31..14], force configuration to %00 *

    SETDAC3 #n      - Set DAC3 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC3 D       - Set DAC3 to D[31..14], force configuration to %00 *

    SETDACS #n      - Set DAC3..DAC0 to %nnnnnnnnn_100000000
                      Force DAC3..DAC0 configurations to %00

    SETDACS D       - Set DAC3 to %dddddddd0_100000000, where dddddddd is D[31..24]
                      Set DAC2 to %dddddddd0_100000000, where dddddddd is D[23..16]
                      Set DAC1 to %dddddddd0_100000000, where dddddddd is D[15..8]
                      Set DAC0 to %dddddddd0_100000000, where dddddddd is D[7..0]
                      Force DAC3..DAC0 configurations to %00

             
    * Be aware when using SETDACx D, that if D < $00400000 or D > $FFC03FFF, full-
      scale toggling will occur, as the dither addition will cause wrapping. For
      ground-based DAC output, you can add $00400000 to each output sample to
      prevent this from happening.



VIDEO
-----

Each cog has a video generator (VID) that can stream pixel data and perform colorspace
conversion and modulation, so that final video signals can be output to the 75-ohm DACs
on the I/O pins.

Pixel streaming, colorspace conversion, modulation, DAC channel driving, and DAC pin
updating are all performed in a pipelined fashion on each cycle of VID's dot clock.

VID gets it dot clock from CTRA's PLL. CTRA must be configured for PLL operation in
order for VID to operate.

The DAC channel(s) must be configured for video output by using CFGDAC0..CFGDAC3 or
CFGDACS. To set all DAC channels to video, do 'CFGDACS #%11_11_11_11'.

The I/O pins which will output the DAC channels must be configured to do so via CFGPINS.

To turn on VID and configure its DAC channel outputs, the SETVID instruction is used:

    SETVID  D/#     - Set video configuration register (VCFG)

        %00xxxxx = off (default)             SIG3    SIG2    SIG1    SIG0
                                             ----------------------------
        %01xxxxx = SDTV/HDTV/VGA             Y_R     I_G     Q_B     SYN
        %10xxxxx = NTSC/PAL S-VIDEO          YIQ     YIQ     _IQ     Y__
        %11xxxxx = NTSC/PAL COMPOSITE        YIQ     YIQ     YIQ     YIQ

        %xx0xxxx = zero-extend Y/I/Q coefficients for VGA colorspace (allows +$80, or '+1.0')
        %xx1xxxx = sign-extend Y/I/Q coefficients for NTSC/PAL/SDTV/HDTV colorspace

        %xxx0xxx = no sync on Y_R         (VGA)
        %xxx1xxx = sync on Y_R            (SDTV/HDTV)

        %xxxx0xx = no sync on I_G         (VGA)
        %xxxx1xx = sync on I_G            (SDTV/HDTV)

        %xxxxx0x = no sync on Q_B         (VGA)
        %xxxxx1x = sync on Q_B            (SDTV/HDTV)

        %xxxxxx0 = positive sync on SYN   (VGA)
        %xxxxxx1 = negative sync on SYN   (VGA)


Before any meaningful video signals can be output, you must set the colorspace coefficients
and offset levels, which are each 8 bits:

    SETVIDY D/#     - Set Y_R's offset level and RGB colorspace coefficients: $YO_YR_YG_YB

    SETVIDI D/#     - Set I_G's offset level and RGB colorspace coefficients: $IO_IR_IG_IB

    SETVIDQ D/#     - Set Q_B's offset level and RGB colorspace coefficients: $QO_QR_QG_QB


All pixels are internally handled by VID as 8:8:8 bit R:G:B data.

Colorspace conversion is performed as sum-of-products calculations on the R:G:B pixel data
and the colorspace coefficients, yielding Y, I, and Q components:

    Where R, G, B are 8-bit pixel color components and Y, I, Q are 9-bit sums (MOD 512):

        Y = (R*YR + G*YG + B*YB)/64        Where YR, YG, YB are 8-bit Y coefficients
        I = (R*IR + G*IG + B*IB)/64        Where IR, IG, IB are 8-bit I coefficients
        Q = (R*QR + G*QG + B*QB)/64        Where QR, QG, QB are 8-bit Q coefficients


    For outputs Y_R, I_G, and Q_B, offset levels are added to the Y, I, and Q components to
    properly position the final signals for SDTV/HDTV. In the case of VGA outputs, the
    offset levels are set to 0, since they are ground-based.

    For modulated outputs YIQ and _IQ, the I and Q components, treated as (I,Q), are rotated
    around (0,0) by an angle that steps 1/16th of a revolution on each dot clock, yielding
    Q'. In the case of YIQ output, the Y component (luma) and Q' (chroma) are added to form
    a composite video signal. In the case of _IQ output, an offset level is added to Q' to
    form an s-video chroma signal. For Y__ output, the Y component (luma) is output alone to
    form an s-video luma signal.


Below are some common colorspace coefficient sets. Note that these values are normalized
to 1.0. In the sum-of-products calculations, 128 is equal to 1.0, so the values below
should all be multiplied by 128 to get the proper 8-bit values for usage as coefficients.
In practice, the values will need to be scaled down so that under 75-ohm load, they will
peak at 1.0V (not 1.65V, which is 3.3V/2). This scaling will compromise DAC span by ~39%,
leaving you with a still-sufficient ~8.3 bits of DAC resolution. However, if you'd like
to keep DAC span maximal, you may leave the coefficients as originally computed and
achieve the proper voltage under load by using an external voltage divider made from two
resistors, being sure to maintain the 75 ohms source impedance.


coefficient positions
-----------------------
YR       YG       YB
IR       IG       IB
QR       QG       QB
-----------------------

RGB (VGA)     VCFG[4]=0
-----------------------
1        0        0           R sums to 1
0        1        0           G sums to 1
0        0        1           B sums to 1
-----------------------

YPbPr (HDTV)  VCFG[4]=1                             x128
-----------------------                             -------------
+.213    +.715    +.072       Y  sums to 1          +27  +92  +9
-.115    -.385    +.500       Pb sums to 0          -15  -49  +64
+.500    -.454    -.046       Pr sums to 0          +64  -58  -6
-----------------------

YPbPr (SDTV)  VCFG[4]=1
-----------------------
+.299    +.587    +.114       Y  sums to 1
-.169    -.331    +.500       Pb sums to 0
+.500    -.419    -.081       Pr sums to 0
-----------------------

YIQ (NTSC)    VCFG[4]=1
-----------------------
+.299    +.587    +.114       Y sums to 1
+.596    -.274    -.322       I sums to 0 *
+.212    -.523    +.311       Q sums to 0 *
-----------------------

YUV (PAL)     VCFG[4]=1
-----------------------
+.299    +.587    +.114       Y sums to 1
-.147    -.289    +.436       U sums to 0 *
+.615    -.515    -.100       V sums to 0 *
-----------------------

* These sets of three coefficients must be scaled by 0.608 to pre-compensate for
  CORDIC rotator expansion which will occur in the video modulator.


Once VID is configured, WAITVID instructions are used to issue contiguous commands
to keep the pixel streamer busy:

    WAITVID --> pixel streamer --> colorspace/modulator --> DAC signals --> I/O pins


VID double-buffers WAITVID commands to relax WAITVID timing requirements.

In single-task mode (on cog start or after 'SETTASK zero'), WAITVID will stall the
pipeline as it waits for VID to take the command. In multi-task mode (after
'SETTASK nonzero'), WAITVID will keep jumping back to itself until VID takes the
command, in order to free up clock cycles for other tasks. In either case, the
POLVID instruction may be used to test whether or not VID is ready for another
command, in which case WAITVID will release immediately, taking only one clock.

    POLVID  WC      - Check if VID ready for another WAITVID, C=1 if ready


Here is the WAITVID instruction:

    WAITVID D/#,S/# - Wait for VID ready, then give next command via D and S

When WAITVID executes, the D and S values are captured by VID and used for the duration
of the command.

The WAITVID instruction has special encoding so that immediate D values can range from
0 to 3583, or $DFF. These large immediate D values are helpful in reducing code size
when issuing WAITVIDs that generate sync signals.

The D operand of WAITVID has four fields:

    %AAAAAAAA_MMMM_PPPPPPP_CCCCCCCCCCCCC

             %AAAAAAAA = AUX base address for pixel lookup (0..255)
                 %MMMM = pixel mode (0..15), elaborated below
              %PPPPPPP = number of dot clocks per pixel (1..127, 0 acts as 128)
        %CCCCCCCCCCCCC = number of dot clocks in WAITVID (1..8191, 0 acts as 8192)


The D operand's %MMMM field determines which pixel mode will be used for the WAITVID and
what the S operand will be used for:

    %0000 = LIT_RGBS32    - S is used as a literal 8:8:8:8 bit R:G:B:SYNC pixel. This is
                            the only mode which can generate sync signals. In this mode,
                            only the %CCCCCCCCCCCCC bits of D are used, so all other bits
                            can be 0.

    %0001 = CLU1_RGB24    - 32 1-bit offsets in S lookup 8:8:8 pixel longs in AUX
    %0010 = CLU2_RGB24    - 16 2-bit offsets in S lookup 8:8:8 pixel longs in AUX
    %0011 = CLU4_RGB24    - 8 4-bit offsets in S lookup 8:8:8 pixel longs in AUX
    %0100 = CLU8_RGB24    - 4 8-bit offsets in S lookup 8:8:8 pixel longs in AUX
    %0101 = CLU8_RGB15    - 4 8-bit offsets in S lookup 5:5:5 pixel words in AUX
    %0110 = CLU8_RGB16    - 4 8-bit offsets in S lookup 5:6:5 pixel words in AUX

                            The CLUx modes use the 1/2/4/8-bit fields of S, lowest field
                            first, as offsets for looking up pixels in AUX, starting at
                            %AAAAAAAA. Upon completion of each pixel, the next higher
                            bit field is used, with the highest field repeating.

                            For CLU1_RGB24..CLU8_RGB24, the 1/2/4/8-bit fields are used
                            as long offsets into AUX, yielding 8:8:8 pixel data from AUX
                            data bits 23..0.

                            For CLU8_RGB15 and CLU8_RGB16, bits 7..1 of each 8-bit field
                            are used as the long offset into AUX, while bit 0 selects the
                            low or high word containing the 5:5:5 (LSB-justified) or
                            5:6:5 pixel data.

    %0111 = STR1_RGB9     - 1-bit pixels streamed from AUX select between 3:3:3 colors in
                            S[17..9] and S[26..18]. The stream start address in AUX is
                            %AAAAAAAA plus S[7..0], with S[31..27] selecting the starting
                            bit.

    %1000 = STR4_RGBI4    - 4-bit pixels are streamed from AUX starting at %AAAAAAAA plus
                            S[7..0], with S[31..29] selecting the starting nibble. The
                            pixels are colored as:

                            %0000 = black
                            %0001 = dark grey
                            %0010 = dark blue
                            %0011 = bright blue
                            %0100 = dark green
                            %0101 = bright green
                            %0110 = dark cyan
                            %0111 = bright cyan
                            %1000 = dark red
                            %1001 = bright red
                            %1010 = dark magenta
                            %1011 = bright magenta
                            %1100 = olive
                            %1101 = yellow
                            %1110 = light grey
                            %1111 = white

    %1001 = STR4_LUMA4    - 4-bit pixels are streamed from AUX starting at %AAAAAAAA plus
                            S[7..0], with S[31..29] selecting the starting nibble. The
                            pixels are used as brightness values for colors determined by
                            S[11..9]:

                            %000 = black..orange
                            %001 = black..blue
                            %010 = black..green
                            %011 = black..cyan
                            %100 = black..red
                            %101 = black..magenta
                            %110 = black..yellow
                            %111 = black..white

    %1010 = STR8_RGBI8    - 8-bit pixels are streamed from AUX starting at %AAAAAAAA plus
                            S[7..0], with S[31..30] selecting the starting byte. The
                            pixels are colored as:

                            $00..$1F = black..orange
                            $20..$3F = black..blue
                            $40..$5F = black..green
                            $60..$7F = black..cyan
                            $80..$9F = black..red
                            $A0..$BF = black..magenta
                            $C0..$DF = black..yellow
                            $E0..$FF = black..white

    %1011 = STR8_LUMA8    - 8-bit pixels are streamed from AUX starting at %AAAAAAAA plus
                            S[7..0], with S[31..30] selecting the starting byte. The
                            pixels are used as brightness values for colors determined by
                            S[11..9]:

                            %000 = black..orange
                            %001 = black..blue
                            %010 = black..green
                            %011 = black..cyan
                            %100 = black..red
                            %101 = black..magenta
                            %110 = black..yellow
                            %111 = black..white

    %1100 = STR8_RGB8     - 8-bit 3:3:2 pixels are streamed from AUX starting at %AAAAAAAA
                            plus S[7..0], with S[31..30] selecting the starting byte.

    %1101 = STR16_RGB15   - 15-bit 5:5:5 pixels are streamed from AUX starting at %AAAAAAAA
                            plus S[7..0], with S[31] selecting the starting word.

    %1110 = STR16_RGB16   - 16-bit 5:6:5 pixels are streamed from AUX starting at %AAAAAAAA
                            plus S[7..0], with S[31] selecting the starting word.

    %1111 = STR32_RGB24   - 24-bit 8:8:8 pixels are streamed from AUX starting at %AAAAAAAA
                            plus S[7..0].


For outputting SYNC signals, the LIT_RGBS32 mode must be used. Because WAITVID's D can be an
immediate value up to 3583, and because S values that generate sync all fit within 9 bits, any
fixed sync pattern can be coded directly with a few 'WAITVID #D,#S' instructions.


    DAC channel outputs (9 bits each, MOD 512) according to S input using LIT_RGBS32 mode
    --------------------------------------------------------------------------------------------------------
    Y_R     %RRRRRRRR_GGGGGGGG_BBBBBBBB_xxxxxxxx = YO*2 + Y                 component/vga pixel  VCFG[3] = 0
            %RRRRRRRR_GGGGGGGG_BBBBBBBB_SSSSSSSS = YO*2 + Y + SSSSSSSS*2    component sync       VCFG[3] = 1

    I_G     %RRRRRRRR_GGGGGGGG_BBBBBBBB_xxxxxxxx = IO*2 + I                 component/vga pixel  VCFG[2] = 0
            %RRRRRRRR_GGGGGGGG_BBBBBBBB_SSSSSSSS = IO*2 + I + SSSSSSSS*2    component sync       VCFG[2] = 1

    Q_B     %RRRRRRRR_GGGGGGGG_BBBBBBBB_xxxxxxxx = QO*2 + Q                 component/vga pixel  VCFG[1] = 0
            %RRRRRRRR_GGGGGGGG_BBBBBBBB_SSSSSSSS = QO*2 + Q + SSSSSSSS*2    component sync       VCFG[1] = 1

    SYN     %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx0 = VCFG[0]*511              vga sync unasserted
            %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx1 = !VCFG[0]*511             vga sync asserted

    Y__     %RRRRRRRR_GGGGGGGG_BBBBBBBB_xxxxxx00 = YO*2 + Y                 s-video luma pixel
            %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxx01 = IO*2                     s-video luma sync high
            %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxx1x = 0                        s-video luma sync low

    _IQ     %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx = QO*2 + Q'                s-video chroma

    YIQ     %RRRRRRRR_GGGGGGGG_BBBBBBBB_xxxxxx00 = YO*2 + Y + Q'            composite pixel
            %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxx01 = IO*2 + Q'                composite sync high
            %xxxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxx1x = Q'                       composite sync low


The following example programs display luma-graduated color bars in various output modes:

    simple_VGA_1280x1024.spin
    simple_VGA_800x600.spin
    simple_VGA_640x480.spin
    simple_HDTV_1920x1080p.spin
    simple_HDTV_1280x720p.spin
    simple_NTSC_256x192.spin



TEXTURE MAPPER
--------------

Each cog has a texture mapper (PIX) which can navigate a rectangular 2D texture with Z-perspective
correction to locate a texture pixel, translate that texture pixel into A:R:G:B (Alpha:Red:Green:Blue)
pixel data, perform discrete scaling on those A:R:G:B components, and then mix the resulting pixel
with another pixel for multi-layered 3D effects.

A texture is stored in register RAM as a sequence of 1/2/4/8-bit texture pixels which build from
the bottom bits of an initial register, upwards, and then into subsequent registers. They are
ordered, in contiguous sequence, from top-left to top-right down to bottom-left to bottom-right.
These texture pixels get used as offsets into AUX to look up A:R:G:B pixel data which may be either
8:8:8:8 bits (long) or 1:5:5:5 bits (word). Texture width and height are individually settable to
1/2/4/8/16/32/64/128 pixel(s).


To configure PIX, the SETPIX instruction is used:

    SETPIX  D/#  - Set PIX configuration to %WWW_HHH_PP_S_H_V_xxxx_AAAAAAAA_RRRRRRRRR

          %WWW = texture map width, %HHH = texture map height

                 %000 =   1 pixel
                 %001 =   2 pixels
                 %010 =   4 pixels
                 %011 =   8 pixels
                 %100 =  16 pixels
                 %101 =  32 pixels
                 %110 =  64 pixels
                 %111 = 128 pixels

           %PP = texture pixel size

                 %00 = 1 bit
                 %01 = 2 bits
                 %10 = 4 bits
                 %11 = 8 bits

            %S = AUX pixel data size

                 %0 = long, 8:8:8:8 bit A:R:G:B data
                 %1 = word, 1:5:5:5 bit A:R:G:B data (gets expanded to 8:8:8:8)

            %H = horizontal mirroring

                 %0 = OFF, image repeats when U'[15] is 1
                 %1 = ON,  image mirrors when U'[15] is 1

            %V = vertical mirroring

                 %0 = OFF, image repeats when V'[15] is 1
                 %1 = ON,  image mirrors when V'[15] is 1

     %AAAAAAAA = base address in AUX of A:R:G:B pixel data

    %RRRRRRRRR = base address in register RAM of texture pixels


Aside from SETPIX, which configures PIX's base metrics, there are seven other instructions
which establish initial values and deltas for the Z perspective, U/V texture coordinates,
and A/R/G/B scalers. These instructions are likely to be used before every sequence of GETPIX
instructions. They each set the value of their respective 16-bit parameter to the high word
of the operand, while the low word sets the 16-bit delta which gets added to the parameter
upon every GETPIX instruction:

    SETPIXZ D/#       - Set {Z,DZ} to D/#
    SETPIXU D/#       - Set {U,DU} to D/#
    SETPIXV D/#       - Set {V,DV} to D/#
    SETPIXA D/#       - Set {A,DA} to D/#
    SETPIXR D/#       - Set {R,DR} to D/#
    SETPIXG D/#       - Set {G,DG} to D/#
    SETPIXB D/#       - Set {B,DB} to D/#


These instructions can be used to establish two settings at a time:

    SETPIX0 D/#,S/#   - Set config to D/# and {Z,DZ} to S/#
    SETPIX1 D/#,S/#   - Set {U,DU} to D/# and {V,DV} to S/#
    SETPIX2 D/#,S/#   - Set {A,DA} to D/# and {R,DR} to S/#
    SETPIX3 D/#,S/#   - Set {G,DG} to D/# and {B,DB} to S/#


Once PIX is configured and initial parameters are set, the GETPIX instruction may be used to
look up the current texture pixel, scale its A/R/G/B components, mix it with a pixel in D,
and update the U/V/Z/A/R/G/B parameters with their deltas. GETPIX only works in single-task
programs, as it requires 3 clocks in pipeline stages 2 and 3:

        WAIT    #3              'ready pipeline, GETPIX needs 3 clocks in pipeline stage 2
        WAIT    #3              'ready pipeline, GETPIX needs 3 clocks in pipeline stage 3
        GETPIX  pixel           'execute GETPIX, GETPIX takes 3 clocks in pipeline stage 4


To make GETPIX more efficient, it can be repeated using REPD to perform a sequence of pixel
operations, taking only 3 clocks per pixel:

        REPD    #64,#1          'render 64 texture pixels and blend them with 'pixels'
        SETINDA #pixels         'point INDA to pixels
        WAIT    #3              'ready pipeline, 3 clocks in initial pipeline stage 2
        WAIT    #3              'ready pipeline, 3 clocks in initial pipeline stage 3
        GETPIX  INDA++          'execute GETPIX, 3 clocks per repeating GETPIX


As GETPIX executes, the following sequence occurs over three pipeline stages:


    In pipeline stage 2:

        Z-perspective correction *
        ------------------------
        Z' = 256 - Z[31..24]
        U' = (U[31..16] / Z') MOD 256
        V' = (V[31..16] / Z') MOD 256

        A texture pixel is read from register RAM at texture location (U',V'), with the
        U' and V' top-most bits being used as coordinates. For example, if the texture
        size is 32x8, then the top 5 bits of U' and the top 3 bits of V' would be used
        to locate the texture pixel.

        parameter updating
        ------------------
        Z = Z + DZ
        U = U + DU
        V = V + DV


    In pipeline stage 3:

        The texture pixel is used as an offset to look up A:R:G:B pixel data in AUX.
        If the AUX data is a word (1:5:5:5 bit A:R:G:B), the fields get expanded so
        that %A_BCDEF_GHIJK_LMNOP becomes %AAAAAAAA_BCDEFBCD_GHIJKGHI_LMNOPLMN. If the
        AUX data is a long (8:8:8:8 bit A:R:G:B), it is used directly. These expanded
        or direct 8:8:8:8 bit fields become TA:TR:TG:TB.


    In pipeline stage 4:

        pixel scaling
        -------------
        A' = (TA * A[31..24]  +  255) / 256
        R' = (TR * R[31..24]  +  255) / 256
        G' = (TG * G[31..24]  +  255) / 256
        B' = (TB * B[31..24]  +  255) / 256

        parameter updating **
        ------------------
        A = A + DA
        R = R + DR
        G = G + DG
        B = B + DB

        pixel mixing
        ------------
        A':R':G':B' is mixed with the pixel in D according to the MIX configuration
        <see the PIXEL MIXER description>

        If WC is used with GETPIX, C will return 1 if A' is not 0.


*  Note that if Z[31..24] = 0, no scaling occurs, or (U',V') = (U[31..24],V[31..24]).
   The bigger Z[31..24] gets, the more compressed the texture rendering becomes, until
   when Z[31..24] = 255, (U',V') = (U[23..16],V[23..16]).

** A/R/G/B are actually updated in pipeline stage 2, but their original values are
   propagated to pipeline stage 4.


The following program provides a simplistic example of how PIX is used:

    texture_NTSC_256x192.spin



PIXEL MIXER
-----------

Each cog has a pixel mixer called MIX that can combine two pixels in a sum-of-products
operation, where:

  inputs:

    DA = D pixel A component (8 bits)
    DR = D pixel R component (8 bits)
    DG = D pixel G component (8 bits)
    DB = D pixel B component (8 bits)

    SA = S pixel A component or GETPIX A' component (8 bits)
    SR = S pixel R component or GETPIX R' component (8 bits)
    SG = S pixel G component or GETPIX G' component (8 bits)
    SB = S pixel B component or GETPIX B' component (8 bits)

  outputs:

    A' = ((DA * DAX  +  SA * SAX  +  255) / 256) max 255
    R' = ((DR * DRX  +  SR * SRX  +  255) / 256) max 255
    G' = ((DG * DGX  +  SG * SGX  +  255) / 256) max 255
    B' = ((DB * DBX  +  SB * SBX  +  255) / 256) max 255


The DAX/DRX/DGX/DBX/SAX/SRX/SGX/SBX terms determine the type of mixing that will be done.
The terms are configurable for the MIXPIX/GETPIX instructions, but fixed for the others:

    ADDPIX  D,S/#    - Add and clamp A:R:G:B components into D

                       DAX = $FF   SAX = $FF
                       DRX = $FF   SRX = $FF
                       DGX = $FF   SGX = $FF
                       DBX = $FF   SBX = $FF


    MULPIX  D,S/#    - Multiply A:R:G:B components into D

                       DAX = SA    SAX = $00
                       DRX = SR    SRX = $00
                       DGX = SG    SGX = $00
                       DBX = SB    SBX = $00


    BLNPIX  D,S/#    - Blend A:R:G:B components by SA into D

                       DAX = !SA   SAX = SA
                       DRX = !SA   SRX = SA
                       DGX = !SA   SGX = SA
                       DBX = !SA   SBX = SA


Here is the general-purpose MIXPIX instruction:

    MIXPIX  D,S/#    - Mix A:R:G:B components according to SETMIX into D


To configure for MIXPIX/GETPIX usage, the SETMIX instruction is used:

    SETMIX  D/#,S/#  - Set MIX configuration to D/#[8..0], S/#[31..0]

                       D/#[8..0]   sets M       - initialized to $001 *

                       S/#[31..24] sets DAB     - initialized to $00
                       S/#[23..16] sets DCB     - initialized to $00
                       S/#[15..8]  sets SAB     - initialized to $FF *
                       S/#[7..0]   sets SCB     - initialized to $00


        M[8] = 0 for long mode, where D and S pixels are 8:8:8:8 bit A:R:G:B

        M[8] = 1 for word mode, where D and S pixels are 1:5:5:5 bit A:R:G:B

               1:5:5:5 pixels are expanded so that %A_BCDEF_GHIJK_LMNOP becomes
               %AAAAAAAA_BCDEFBCD_GHIJKGHI_LMNOPLMN for the mixing computation.
               When being packed back down to 1:5:5:5 bit A:R:G:B, the single A
               bit will be 1 if the resultant A was not 0, and the R:G:B fields
               will be set to the top 5 bits of the resultant R:G:B.

               In word mode, the low word in D will be operated on and the words
               will be swapped, leaving the mixed pixel in the new high word and
               the old high word in the new low word. Also, pixel data from S
               will be taken alternately from the low and high word with each
               operation, with SETMIX resetting the selector to the low word.

               Word mode affects all ADDPIX/MULPIX/BLNPIX/GETMIX/GETPIX.


        M field          000   001   010   011   100   101   110   111
        --------------------------------------------------------------
        M[7]      DAX =  DAB    SA
        M[6..4]   DRX =  $00   $FF    SA   !SA    DA   !DA   DCB    SR
        M[6..4]   DGX =  $00   $FF    SA   !SA    DA   !DA   DCB    SG
        M[6..4]   DBX =  $00   $FF    SA   !SA    DA   !DA   DCB    SB
        --------------------------------------------------------------
        M[3]      SAX =  SAB    DA
        M[2..0]   SRX =  $00   $FF    SA   !SA    DA   !DA   SCB    DR
        M[2..0]   SGX =  $00   $FF    SA   !SA    DA   !DA   SCB    DG
        M[2..0]   SBX =  $00   $FF    SA   !SA    DA   !DA   SCB    DB


      * M and SAB are initialized on cog start so that GETPIX will return the
        scaled A:R:G:B texture pixel without any blending.


The PIXADD/PIXMUL/PIXBLN/PIXMIX instructions all take 2 clocks, while GETPIX
takes 3 clocks.



PIN TRANSFER
------------

Each cog has a pin transfer (XFR) which can automatically move data between pins and
WIDEs/AUX, in the background, while instructions execute normally.

XFR is configured with the SETXFR instruction:

    SETXFR  D/#     - Set XFR configuration to %E_MMM_PPP

          %E = enable

                %0 = off (initial state after cog start)
                %1 = on

          %MMM = mode

                %000 = WIDEs_to_16_pins
                %001 = WIDEs_to_32_pins
                %010 = AUX_to_16_pins
                %011 = AUX_to_32_pins
                %100 = 16_pins_to_WIDEs
                %101 = 32_pins_to_WIDEs
                %110 = 16_pins_to_AUX
                %111 = 32_pins_to_AUX

          %PPP = pin group

                %000 = pins 15..0    for 16-pin modes,   pins 31..0   for 32-pin modes
                %001 = pins 31..16   for 16-pin modes,   pins 31..0   for 32-pin modes
                %010 = pins 47..32   for 16-pin modes,   pins 63..32  for 32-pin modes
                %011 = pins 63..48   for 16-pin modes,   pins 63..32  for 32-pin modes
                %100 = pins 79..64   for 16-pin modes,   pins 95..64  for 32-pin modes
                %101 = pins 95..80   for 16-pin modes,   pins 95..64  for 32-pin modes
                %110 = pins 111..96  for 16-pin modes,   pins 127..96 for 32-pin modes
                %111 = pins 127..112 for 16-pin modes,   pins 127..96 for 32-pin modes


For WIDEs_to_16_pins mode (%000), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: WIDE0 low word is output to pins
    2nd clock: WIDE0 high word is output to pins
    3rd clock: WIDE1 low word is output to pins
    4th clock: WIDE1 high word is output to pins
    5th clock: WIDE2 low word is output to pins
    6th clock: WIDE2 high word is output to pins
    7th clock: WIDE3 low word is output to pins
    8th clock: WIDE3 high word is output to pins
    9th clock: WIDE4 low word is output to pins
   10th clock: WIDE4 high word is output to pins
   11th clock: WIDE5 low word is output to pins
   12th clock: WIDE5 high word is output to pins
   13th clock: WIDE6 low word is output to pins
   14th clock: WIDE6 high word is output to pins
   15th clock: WIDE7 low word is output to pins
   16th clock: WIDE7 high word is output to pins


For WIDEs_to_32_pins mode (%001), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: WIDE0 is output to pins
    2nd clock: WIDE1 is output to pins
    3rd clock: WIDE2 is output to pins
    4th clock: WIDE3 is output to pins
    5th clock: WIDE4 is output to pins
    6th clock: WIDE5 is output to pins
    7th clock: WIDE6 is output to pins
    8th clock: WIDE7 is output to pins


For AUX_to_16_pins mode (%010), on the second cycle after SETXFR is executed, the
following 2-clock pattern begins and then repeats indefinitely:

    1st clock: AUX[SPB] low word is output to pins
    2nd clock: AUX[SPB++] high word is output to pins


For AUX_to_32_pins mode (%011), on the second cycle after SETXFR is executed, the
following 1-clock pattern begins and then repeats indefinitely:

    1st clock: AUX[SPB++] is output to pins


For 16_pins_to_WIDEs mode (%100), on the cycle after SETXFR is executed, the following
16-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to WIDE0
    3rd clock: pins are sampled into low word
    4th clock: pins are sampled into high word, long is written to WIDE1
    5th clock: pins are sampled into low word
    6th clock: pins are sampled into high word, long is written to WIDE2
    7th clock: pins are sampled into low word
    8th clock: pins are sampled into high word, long is written to WIDE3
    9th clock: pins are sampled into low word
   10th clock: pins are sampled into high word, long is written to WIDE4
   11th clock: pins are sampled into low word
   12th clock: pins are sampled into high word, long is written to WIDE5
   13th clock: pins are sampled into low word
   14th clock: pins are sampled into high word, long is written to WIDE6
   15th clock: pins are sampled into low word
   16th clock: pins are sampled into high word, long is written to WIDE7


For 32_pins_to_WIDEs mode (%101), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to WIDE0
    2nd clock: pins are sampled and written to WIDE1
    3rd clock: pins are sampled and written to WIDE2
    4th clock: pins are sampled and written to WIDE3
    5th clock: pins are sampled and written to WIDE4
    6th clock: pins are sampled and written to WIDE5
    7th clock: pins are sampled and written to WIDE6
    8th clock: pins are sampled and written to WIDE7


For 16_pins_to_AUX mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to AUX[SPB++]


For 32_pins_to_AUX mode (%111), on the cycle after SETXFR is executed, the following
1-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to AUX[SPB++]


While an AUX_to_pins or pins_to_AUX mode is active, you should not read or write AUX or
modify SPB, as such attempts will likely interfere with XFR operation and cause unexpected
results. VID, however, has an asynchronous second port to AUX, so it can, for example,
stream pixels out at the same time XFR streams them in.

To stop XFR, execute 'SETXFR #0' on the last cycle of desired XFR operation.

An example of XFR usage is in the following program:

    balls.spin



BIG MULTIPLIER
--------------

Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which
perform 20x20-bit signed multiplications, each cog has a separate, larger multiplier that
can do 32x32-bit signed or unsigned multiplication while other instructions execute.

To start a 32x32-bit multiply, execute one of the following:

    MUL32   D/#,S/#     - Begin 32x32-bit signed multiply of D/# and S/#
    MUL32U  D/#,S/#     - Begin 32x32-bit unsigned multiply of D/# and S/#

You'll have 17 clock cycles to execute other code, if you wish, before GETMULL/GETMULH
will return the low/high long(s) of the result:

    GETMULL D           - Get low long of result
    GETMULH D           - Get high long of result

In single-task mode, GETMULL/GETMULH will stall the pipeline until the result is ready.
In multi-task mode, GETMULL/GETMULH will jump to themselves until the result is ready,
freeing clocks for other tasks.


BIG DIVIDER
-----------

Each cog has a 64-over-32-bit divider which can perform signed and unsigned divides, as
well as calculate 32-bit fractions, while other instructions execute. For signed divides,
the remainder result will have the sign of the numerator. Both the quotient and the
remainder results are 32 bits.

To start a 32/32-bit divide, execute one of the following:

    DIV32   D/#,S/#     - Begin 32/32-bit signed divide of D/# over S/#
    DIV32U  D/#,S/#     - Begin 32/32-bit unsigned divide of D/# over S/#


To start a 64/32-bit divide, first set the denominator:

    DIV64D  D/#         - Set the 32-bit denominator to D/#

Then execute one of the following:

    DIV64   D/#,S/#     - Set the 64-bit numerator to {S/#,D/#} and begin signed divide
    DIV64U  D/#,S/#     - Set the 64-bit numerator to {S/#,D/#} and begin unsigned divide


To start a 32-bit fraction calculation, use FRAC:

    FRAC    D/#,S/#     - Begin calculating the unsigned fraction of D/# over S/#, where
                          D/# and S/# are unsigned 32-bit values and D/# is less than S/#.
                          Use GETDIVQ to get the result.

                          Examples:

                              FRAC #1,#2 yields $80000000 (1/2 of $1_00000000)
                              FRAC #1,#3 yields $55555555 (1/3 of $1_00000000)
                              FRAC #1,#4 yields $40000000 (1/4 of $1_00000000)
                              FRAC #15,#16 yields $F0000000
                              FRAC $80000000,$90000000 yields $E38E38E3
                              FRAC 31_250,80_000_000 yields $00199999


After starting the divider, you'll have 17 clocks cycles to execute other code, if you
wish, before GETDIVQ/GETDIVR will return the quotient/remainder long(s) of the result:

    GETDIVQ D           - Get quotient result
    GETDIVR D           - Get remainder result

In single-task mode, GETDIVQ/GETDIVR will stall the pipeline until the result is ready.
In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
freeing clocks for other tasks.



SQUARE ROOTER
-------------

Each cog has a 32/64-bit square root calculator which can compute square roots from
unsigned values, while other instructions execute.

To start a square root computation, execute one of the following:

    SQRT32  D/#         - Begin computing square root of 32-bit unsigned D/#
    SQRT64  D/#,S/#     - Begin computing square root of 64-bit unsigned {S/#,D/#}

In the case of SQRT32, you'll have 16 clock cycles to execute other code, if you wish,
or 32 clock cycles in the case of SQRT64, before GETSQRT will return the result:

    GETSQRT D           - Get root result

In single-task mode, GETSQRT will stall the pipeline until the result is ready. In
multi-task mode, GETSQRT will jump to itself until the result is ready, freeing clocks
for other tasks.



CORDIC ENGINE
-------------

Each cog has a CORDIC engine which can perform trigonometric, logarithmic, exponential,
and hyperbolic functions while other instructions execute.

Here are the instructions associated with the CORDIC engine:

    QLOG    D/#         - Compute logarithm of D/#
                          (unsigned number -> log-base-2)

    QEXP    D/#         - Compute exponential of D/#
                          (log-base-2 -> unsigned number)

    QSINCOS D/#,S/#     - Compute sine and cosine of D/# with amplitude S/#
                          (polar -> cartesian)

    QARCTAN D/#,S/#     - Compute distance and angle of (D/#,S/#) to (0,0)
                          (cartesian -> polar)

    SETQZ   D/#         - Set CORDIC Z to D/#, used to set angle before QROTATE
    QROTATE D/#,S/#     - Rotate (D/#,S/#) around (0,0) by an angle

    GETQX   D           - Get CORDIC X result
    GETQY   D           - Get CORDIC Y result
    GETQZ   D           - Get CORDIC Z result

    SETQI   D/#         - Set CORDIC trigonometric/hyperbolic and iteration modes

In single-task mode, GETQX/GETQY/GETQZ will stall the pipeline until the result is ready.
In multi-task mode, GETQX/GETQY/GETQZ will jump to themselves until the result is ready,
freeing clocks for other tasks.


QLOG/QEXP usage:

To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set
the input term and begin the computation. Then do GETQZ to get the result. Log values are
encoded with the whole exponent in the top 5 bits and the fractional exponent in the
bottom 27 bits. Here are some examples of numbers converted to log values, then back to
numbers again using QLOG and QEXP:

    number ->   QLOG ->     QEXP
    ---------------------------------
    $00000000   $00000000   $00000001   (0 same as 1)
    $00000001   $00000000   $00000001
    $00000002   $08000000   $00000002
    $00000003   $0CAE00D2   $00000003
    $00000004   $10000000   $00000004
    $00000005   $12934F09   $00000005
    $07ADCBD8   $D786F595   $07ADCBD9   (first lossy bidirectional conversion, +1)
    $20000000   $E8000000   $20000000
    $40000000   $F0000000   $40000000
    $80000000   $F8000000   $80000000
    $FFFFFFFF   $FFFFFFFF   $FFFFFFE9   (last lossy bidirectional conversion, -22)


QSINCOS/QARCTAN/QROTATE usage:

For the circular functions, angles are 32-bits and roll over at 360-degrees:

    $00000000 = 0 degrees                (360 * $00000000 / $1_00000000)
    $00000001 = ~0.000000083819 degrees  (360 * $00000001 / $1_00000000)
    $00B60B61 = ~1 degree                (360 * $00B60B61 / $1_00000000)
    $20000000 = 45 degrees               (360 * $20000000 / $1_00000000)
    $40000000 = 90 degrees               (360 * $40000000 / $1_00000000)
    $80000000 = 180 degrees              (360 * $80000000 / $1_00000000)
    $C0000000 = 270 degrees              (360 * $C0000000 / $1_00000000)
    $FFFFFFFF = ~359.9999999162 degrees  (360 * $FFFFFFFF / $1_00000000)


The X and Y inputs to the circular functions are signed 30-bit values, ranging from
-$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the
sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during
the CORDIC iterations, after which they are shifted back down and rounded to form the X
and Y results.

The circular functions will return X and Y results that are scaled by constant K, which is
~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC
scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K.

To compute sine and cosine simultaneously, the 'QSINCOS D/#,S/#' instruction can be used,
with the angle supplied in D/# and the amplitude in S/#. Immediate values of S/# are
special cases which produce the following amplitudes, where n is the immediate value:

    #$00..$1F produces +/- 2^(n[4..0]-1)
    #$20..$3F produces +/- 2^(n[4..0]-1) * 255/256
    #$40..$5F produces +/- 2^(n[4..0]-1) * 7/8
    #$60..$7F produces +/- 2^(n[4..0]-1) * 3/4

For example, #$09 will yield results ranging from -$100..$100 and #$29 will yield results
ranging from -$FF..$FF. Use GETQX and GETQY to retrieve the cosine and sine results.

To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do
'QARCTAN D/#,S/#' with the X in D/# and the Y in S/#. Use GETQX to get the distance and
GETQZ to get the angle.

To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then
do 'QROTATE D/#,S/#', with the X in D/# and the Y in S/#. Use GETQX and GETQY to retrieve
the rotated (X,Y) coordinate.


CORDIC modes:

The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to
select between adaptive and fixed iterations:

    SETQI   D/#     - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start)

        %M = mode

            %0 = trigonometric (K = ~1.64676025812)
            %1 = hyperbolic    (K = ~0.82815936096)

        %IIIII = iterations

                    %00000 = adaptive iterations (adaptive resolution, variable time)
            %00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time)


Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so
that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different
internal constants to track the angle, it skips the zeroth iteration, and the fourth and
thirteenth iterations are repeated to ensure convergence. Hence, K differs between
trigonometric and hyperbolic modes, as well as clock cycles.

When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude
of the X and Y inputs to ensure an efficient computation which preserves initial precision.
For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like
precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration
count to ensure good-enough precision, but with budgeted or exact timing.


CORDIC timing:

Here is a table that shows how many free clocks are available for other instructions to
execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ:

    i = %IIIII           i = 0 (adaptive)                     i = 1..31 (fixed)
    operation            clocks free                          clocks free
    ---------------------------------------------------------------------------
    QLOG    D/#          35                                   2 + i + h
    QEXP    D/#          35                                   2 + i + h

    Trigonometric mode

    QSINCOS D/#,#n       2 + n                                2 + i
    QSINCOS D/#,S        5 + mag(abs(D/#) | abs(S))           3 + i
    QARCTAN D/#,S/#      5 + mag(abs(D/#) | abs(S/#))         3 + i
    QROTATE D/#,S/#      5 + mag(abs(D/#) | abs(S/#))         3 + i

    Hyperbolic mode

    QSINCOS D/#,#n       1 + n + j                            1 + i + h
    QSINCOS D/#,S        4 + mag(abs(D/#) | abs(S)) + k       2 + i + h
    QARCTAN D/#,S/#      4 + mag(abs(D/#) | abs(S/#)) + k     2 + i + h
    QROTATE D/#,S/#      4 + mag(abs(D/#) | abs(S/#)) + k     2 + i + h
    --------------------------------------------------------------------------

    h = 0 if i is 0..3       j = 0 if n is 1..3        k = 0 if mag is 0..1
        1 if i is 4..12          1 if n is 4..12           1 if mag is 2..10
        2 if i is 13..31         2 if n is 13..31          2 if mag is 11..30



MULTIPLY AND ACCUMULATE
-----------------------

Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the
MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values,
arithmetically shifted right, and read back. On cog start, ACCA and ACCB are both cleared
to $00000000_00000000.

The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the
resultant 40-bit product into ACCA or ACCB in a single clock:

    MACA    D/#,S/#         - multiply D/#[19..0] by S/#[19..0] and accumulate into ACCA
    MACB    D/#,S/#         - multiply D/#[19..0] by S/#[19..0] and accumulate into ACCB


By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters
can be realized in a few instructions:

        FIXINDA #buff+15,#buff          'set circular sample buffer
        FIXINDB #taps+15,#taps          'set circular tap buffer

:loop   REPS    #16,#1                  'ready for 16-tap FIR
        CLRACCA                         'clear ACCA
        MACA    INDB++,INDA++           'multiply and accumulate buff and taps (16 clocks)

        GETACCA result                  'get result
        '<use result>                   'use result

        '<get sample>                   'get new sample
        MOV     --INDA,sample           'enter new sample, buff scrolls against taps

        JMP     #:loop                  'loop


The accumulators may be cleared by the following instructions:

    CLRACCA                 - clear ACCA to $00000000_00000000
    CLRACCB                 - clear ACCB to $00000000_00000000
    CLRACCS                 - clear ACCA and ACCB to $00000000_00000000


The accumulators may be set to arbitrary values by these instructions:

    SETACCA D/#,S/#         - set ACCA to {S/#,D/#}
    SETACCB D/#,S/#         - set ACCB to {S/#,D/#}


To make post-MACA/MACB computations simpler, the SARACCA/SARACCB/SARACCS instructions can
be used to arithmetically shift the accumulators downward, in order to consolidate their
leading bits into the lower long. This shifting can be performed on ACCA and ACCB
individually, or together. The SARACCA/SARACCB/SARACCS instructions take 1 clock, but won't
execute until 2 clocks after MACA/MACB. So, if SARACCA immediately follows MACA, SARACCA
will take 3 clocks:

    SARACCA D/#             - arithmetically right-shift ACCA by D/#[5..0] (0..63)
    SARACCB D/#             - arithmetically right-shift ACCB by D/#[5..0] (0..63)
    SARACCS D/#             - arithmetically right-shift ACCA and ACCB by D/#[5..0] (0..63)


To read back the contents of the accumulators, GETACAL/GETACAH/GETACBL/GETACBH instructions
are used. These instructions take 1 clock, but won't execute until 2 clocks after MACA/MACB.
So, if GETACAL immediately follows MACA, GETACAL will take three clocks:

    GETACAL D               - get lower long of ACCA into D
    GETACAH D               - get upper long of ACCA into D
    GETACBL D               - get lower long of ACCB into D
    GETACBH D               - get upper long of ACCB into D



REGISTER REMAPPING
------------------

The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDB or the current task.

When using remapping, you must locate your program code above the last used block of
registers which the upper-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.

Here is the SETMAP instruction:

    SETMAP  D/#             - Configure register remapping to %M_BBB_RRR

        %M = mode

            %0 = INDB selects the block
            %1 = task number selects the block

        %BBB = block count

            %000 = 1 block          remapping disabled for %000
            %001 = 2 blocks         remapping enabled for %001..%111
            %010 = 4 blocks
            %011 = 8 blocks
            %100 = 16 blocks
            %101 = 32 blocks
            %110 = 64 blocks
            %111 = 128 blocks

        %RRR = register count

            %000 = 1 register       remap $000
            %001 = 2 registers      remap $000..$001
            %010 = 4 registers      remap $000..$003
            %011 = 8 registers      remap $000..$007
            %100 = 16 registers     remap $000..$00F
            %101 = 32 registers     remap $000..$01F
            %110 = 64 registers     remap $000..$03F
            %111 = 128 registers    remap $000..$07F


The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDB or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses which range from $000 to
the remapped-register-count minus 1 (see %RRR above), not via INDA and INDB accesses.

Below is an elaboration of all uniquely-useful remapping schemes:


                                  S/D addresses
%M_BBB_RRR    blocks regs      initial -> remapped       block selector
-----------------------------------------------------------------------------
%x_000_xxx    1      x               <same>

%0_001_000    2      1      %000000000 -> %00000000P     P = INDB[0]
%0_001_001    2      2      %00000000X -> %0000000PX
%0_001_010    2      4      %0000000XX -> %000000PXX     (2 threads)
%0_001_011    2      8      %000000XXX -> %00000PXXX
%0_001_100    2      16     %00000XXXX -> %0000PXXXX
%0_001_101    2      32     %0000XXXXX -> %000PXXXXX
%0_001_110    2      64     %000XXXXXX -> %00PXXXXXX
%0_001_111    2      128    %00XXXXXXX -> %0PXXXXXXX

%0_010_000    4      1      %000000000 -> %0000000PP     PP = INDB[1..0]
%0_010_001    4      2      %00000000X -> %000000PPX
%0_010_010    4      4      %0000000XX -> %00000PPXX     (4 threads)
%0_010_011    4      8      %000000XXX -> %0000PPXXX
%0_010_100    4      16     %00000XXXX -> %000PPXXXX
%0_010_101    4      32     %0000XXXXX -> %00PPXXXXX
%0_010_110    4      64     %000XXXXXX -> %0PPXXXXXX
%0_010_111    4      128    %00XXXXXXX -> %PPXXXXXXX

%0_011_000    8      1      %000000000 -> %000000PPP     PPP = INDB[2..0]
%0_011_001    8      2      %00000000X -> %00000PPPX
%0_011_010    8      4      %0000000XX -> %0000PPPXX     (8 threads)
%0_011_011    8      8      %000000XXX -> %000PPPXXX
%0_011_100    8      16     %00000XXXX -> %00PPPXXXX
%0_011_101    8      32     %0000XXXXX -> %0PPPXXXXX
%0_011_110    8      64     %000XXXXXX -> %PPPXXXXXX

%0_100_000    16     1      %000000000 -> %00000PPPP     PPPP = INDB[3..0]
%0_100_001    16     2      %00000000X -> %0000PPPPX
%0_100_010    16     4      %0000000XX -> %000PPPPXX     (16 threads)
%0_100_011    16     8      %000000XXX -> %00PPPPXXX
%0_100_100    16     16     %00000XXXX -> %0PPPPXXXX
%0_100_101    16     32     %0000XXXXX -> %PPPPXXXXX

%0_101_000    32     1      %000000000 -> %0000PPPPP     PPPPP = INDB[4..0]
%0_101_001    32     2      %00000000X -> %000PPPPPX
%0_101_010    32     4      %0000000XX -> %00PPPPPXX     (32 threads)
%0_101_011    32     8      %000000XXX -> %0PPPPPXXX
%0_101_100    32     16     %00000XXXX -> %PPPPPXXXX

%0_110_000    64     1      %000000000 -> %000PPPPPP     PPPPPP = INDB[5..0]
%0_110_001    64     2      %00000000X -> %00PPPPPPX
%0_110_010    64     4      %0000000XX -> %0PPPPPPXX     (64 threads)
%0_110_011    64     8      %000000XXX -> %PPPPPPXXX

%0_111_000    128    1      %000000000 -> %00PPPPPPP     PPPPPPP = INDB[6..0]
%0_111_001    128    2      %00000000X -> %0PPPPPPPX
%0_111_010    128    4      %0000000XX -> %PPPPPPPXX     (128 threads)

%1_001_000    2      1      %000000000 -> %00000000T     T = bit 0 of the task number
%1_001_001    2      2      %00000000X -> %0000000TX
%1_001_010    2      4      %0000000XX -> %000000TXX     (2 tasks)
%1_001_011    2      8      %000000XXX -> %00000TXXX
%1_001_100    2      16     %00000XXXX -> %0000TXXXX
%1_001_101    2      32     %0000XXXXX -> %000TXXXXX
%1_001_110    2      64     %000XXXXXX -> %00TXXXXXX
%1_001_111    2      128    %00XXXXXXX -> %0TXXXXXXX

%1_010_000    4      1      %000000000 -> %0000000TT     TT = task number
%1_010_001    4      2      %00000000X -> %000000TTX
%1_010_010    4      4      %0000000XX -> %00000TTXX     (4 tasks)
%1_010_011    4      8      %000000XXX -> %0000TTXXX
%1_010_100    4      16     %00000XXXX -> %000TTXXXX
%1_010_101    4      32     %0000XXXXX -> %00TTXXXXX
%1_010_110    4      64     %000XXXXXX -> %0TTXXXXXX
%1_010_111    4      128    %00XXXXXXX -> %TTXXXXXXX


Here is an example program which uses remapping with multi-threading:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    loop[4]         '$010..$013, all threads start at loop

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixindb #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP to take effect before 'switch'

loop            switch                  'switch to next thread
                incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)


Here is an example program which uses remapping with multi-tasking:

DAT             org

period          long    2-1             '$000, task 0   (16 longs initally execute like NOPs)
time            long    0               '$001, task 0
pin_x           long    0               '$002, task 0
pin_y           long    1               '$003, task 0

                long    4-1             '$000, task 1
                long    0               '$001, task 1
                long    2               '$002, task 1
                long    3               '$003, task 1

                long    8-1             '$000, task 2
                long    0               '$001, task 2
                long    4               '$002, task 2
                long    5               '$003, task 2

                long    16-1            '$000, task 3
                long    0               '$001, task 3
                long    6               '$002, task 3
                long    7               '$003, task 3


                setmap  #%1_010_010     'remap 4 blocks of 4 regs by task
                settask #%%3210         'set all 4 tasks in motion
                jmptask #%1111,#loop    'herd tasks to loop


loop            incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 tasks executing same code with unique registers)



PORT D INTER-COG EXCHANGE
-------------------------

Port A, associated with PINA/OUTA/DIRA, connects to external pins 0..31.    *** SAME
Port B, associated with PINB/OUTB/DIRB, connects to external pins 32..63.   *** SAME
Port C, associated with PINC/OUTC/DIRC, connects to external pins 64..91.   *** SAME
Port D, associated with PIND/OUTD/DIRD, connects to internal pins 96..127.  *** DIFFERENT!!!

The internal pins of port D differ from the external pins of ports A/B/C in regard to both
outputs and inputs:

    Each cog generates its port D outputs in the same pattern it generates its port A/B/C
    outputs:

        OUTD is OR'd with SERA/SERB/CTRA/CTRB/XFR/TRACE outputs 127..96, then those 32 bits
        get AND'd with DIRD to form the port D outputs.

    The difference is that all the cogs' port D outputs are not OR'd together before going
    to a set of 32 I/O pins. Instead, each cog's port D outputs are kept separated, and
    every cog can determine which other cogs' port D outputs it wants to see in its own
    PIND input, which also feeds SERA/SERB/CTRA/CTRB/XFR inputs 127..96.


The SETXCH instruction is used to set the PIND input filter:

    SETXCH  D/#             - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA

        %DDDDDDDD = filter for PIND[31..24]

            %xxxxxxx1 = cog 0's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxxxx1x = cog 1's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxxx1xx = cog 2's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxx1xxx = cog 3's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxx1xxxx = cog 4's port D output [31..24] will be OR'd into PIND[31..24] input
            %xx1xxxxx = cog 5's port D output [31..24] will be OR'd into PIND[31..24] input
            %x1xxxxxx = cog 6's port D output [31..24] will be OR'd into PIND[31..24] input
            %1xxxxxxx = cog 7's port D output [31..24] will be OR'd into PIND[31..24] input

        %CCCCCCCC = filter for PIND[23..16]

            %xxxxxxx1 = cog 0's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxxxx1x = cog 1's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxxx1xx = cog 2's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxx1xxx = cog 3's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxx1xxxx = cog 4's port D output [23..16] will be OR'd into PIND[23..16] input
            %xx1xxxxx = cog 5's port D output [23..16] will be OR'd into PIND[23..16] input
            %x1xxxxxx = cog 6's port D output [23..16] will be OR'd into PIND[23..16] input
            %1xxxxxxx = cog 7's port D output [23..16] will be OR'd into PIND[23..16] input

        %BBBBBBBB = filter for PIND[15..8]

            %xxxxxxx1 = cog 0's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxxxx1x = cog 1's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxxx1xx = cog 2's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxx1xxx = cog 3's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxx1xxxx = cog 4's port D output [15..8] will be OR'd into PIND[15..8] input
            %xx1xxxxx = cog 5's port D output [15..8] will be OR'd into PIND[15..8] input
            %x1xxxxxx = cog 6's port D output [15..8] will be OR'd into PIND[15..8] input
            %1xxxxxxx = cog 7's port D output [15..8] will be OR'd into PIND[15..8] input

        %AAAAAAAA = filter for PIND[7..0]

            %xxxxxxx1 = cog 0's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxxxx1x = cog 1's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxxx1xx = cog 2's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxx1xxx = cog 3's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxx1xxxx = cog 4's port D output [7..0] will be OR'd into PIND[7..0] input
            %xx1xxxxx = cog 5's port D output [7..0] will be OR'd into PIND[7..0] input
            %x1xxxxxx = cog 6's port D output [7..0] will be OR'd into PIND[7..0] input
            %1xxxxxxx = cog 7's port D output [7..0] will be OR'd into PIND[7..0] input


To input only cog 0's port D output into PIND, you would use the filter value $01_01_01_01.
To input the logical OR of cog 0's and cog 1's port D outputs into PIND, you would use
$03_03_03_03. In most cases, it may be desirable to just see one other cog's full port D
output in a PIND input, but many other arrangements are possible. SETBYTE and GETBYTE
instructions can be used to efficiently move bytes via OUTD/PIND windows.

After SETXCH, PIND can be read for newly-filtered data on the third clock:

        SETXCH  #$00000001      'change filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from new filter


Writes to an OUTD are readable from a PIND on the third clock, as well.



SERIAL TRANSCEIVERS
-------------------

Each cog has two asynchronous full-duplex serial transceivers, called SERA and SERB, which
can transmit and receive 8-bit and/or 32-bit data, with an optionally-appended 4-bit ID to
enable automatic data filtering on the receiver side.

To use SERA/SERB:

    - Configure the transceiver and set the baud rate(s) using SETSERA/SETSERB.
    - Make the TX pin an output if you are going to transmit.
    - Execute SEROUTA/SEROUTB instructions to transmit data.
    - Execute SERINA/SERINB instructions to receive data.


Baud rates are established in terms of clocks per bit, or by the bit period. Valid bit
periods range from 1..65535 (160Mbps..2441bps @160MHz). The practical minimum bit period
between same-frequency Propeller chips is 3, which yields 53.333Mbps @160MHz.

Before transmitting or receiving data, SERA/SERB must be configured:


    SETSERA D/#,S/#   - Set SERA configuration to %KKKK_NNNN_MMMM_R_T_DD_CCCCCCC_BB_AAAAAAA
                        using D/#.

                        Set SERA transmit period to S/#[15..0]. Set SERA receive period to
                        S/#[31..16], unless value is 0, in which case use S/#[15..0].


    SETSERB D/#,S/#   - Set SERB configuration to %KKKK_NNNN_MMMM_R_T_DD_CCCCCCC_BB_AAAAAAA
                        using D/#.

                        Set SERB transmit period to S/#[15..0]. Set SERB receive period to
                        S/#[31..16], unless value is 0, in which case use S/#[15..0].


        %KKKK = transmitter ID

        %NNNN = receiver ID target

        %MMMM = receiver ID mask

        %R = receiver ID mode

            %0 = receiver ID disabled, only 8 or 32 data bits will be received

            %1 = receiver ID enabled, four additional ID bits (%JJJJ) will be received,
                 received data will only by captured if (%JJJJ & %MMMM) = %NNNN

        %T = transmitter ID mode

            %0 = transmitter ID disabled, only 8 or 32 data bits will be transmitted

            %1 = transmitter ID enabled, %KKKK will be appended to the transmit data

        %DD = receiver mode

            %00 = receiver disabled
            %01 = 32-bit data, inverse RX polarity (STOP=L, START=H)
            %10 = 8-bit data,  true    RX polarity (STOP=H, START=L)
            %11 = 8-bit data,  inverse RX polarity (STOP=L, START=H)

        %CCCCCCC = RX pin, 0..127

        %BB = transmitter mode

            %00 = transmitter disabled
            %01 = 32-bit data, inverse TX polarity (STOP=L, START=H)
            %10 = 8-bit data,  true    TX polarity (STOP=H, START=L)
            %11 = 8-bit data,  inverse TX polarity (STOP=L, START=H)

        %AAAAAAA = TX pin, 0..127


The SERA/SERB configuration registers are initialized to $00000000 on cog start.


Once a transmitter is enabled, the following instructions may be used to transmit data:


        SEROUTA D/#         - wait to transmit D/# on SERA
                            - if single-task, stalls pipeline until D/# captured
                            - if multi-task, loops until D/# captured (frees pipeline)

        SEROUTA D/#   WC    - try to transmit D/# on SERA, C=1 if D/# captured
                            - always takes 1 clock

        SEROUTB D/#         - wait to transmit D/# on SERB
                            - if single-task, stalls pipeline until D/# captured
                            - if multi-task, loops until D/# captured (frees pipeline)

        SEROUTB D/#   WC    - try to transmit D/# on SERB, C=1 if D/# captured
                            - always takes 1 clock


The transmitters operate by capturing data from a SEROUTA/SEROUTB instruction, and then
outputting timed states on TX. First, a STOP state is output, then a START state, followed
by the data bits (and optional ID bits), LSB first, with a STOP state being output at the
end, but not timed, as the transmitter is no longer busy and it is ready to receive more
data from another SEROUTA/SEROUTB command.


Once a receiver is enabled, the following instructions may be used to receive data:


        SERINA  D           - wait to receive data from SERA into D
                            - if single-task, stalls pipeline until data captured
                            - if multi-task, loops until data captured (frees pipeline)

        SERINA  D     WC    - try to receive new data from SERA into D, C=1 if new data
                            - always takes 1 clock

        SERINB  D           - wait to receive data from SERB into D
                            - if single-task, stalls pipeline until data captured
                            - if multi-task, loops until data captured (frees pipeline)

        SERINB  D     WC    - try to receive new data from SERB into D, C=1 if new data
                            - always takes 1 clock


The receivers wait for a STOP state on RX, then a START state, and then they sample the data
bits (and optional ID bits), LSB first, on the center of each bit period, until the last bit
is sampled. At that point, the received data is captured and made available via SERINA/SERINB,
and the receiver goes back to waiting for another STOP state.


To transmit "Hello" at 2M baud, if you're running at 80MHz:


        SETSERA #%10<<7 + 3,#40     'set SERA for 8-bit transmit on pin 3 at 40 clocks/bit
        CLRP    #3                  'make pin3 an output, SERA drives it high

        SEROUTA #"H"                'send message
        SEROUTA #"e"
        SEROUTA #"l"
        SEROUTA #"l"
        SEROUTA #"o"

        JMP     #$


Here is an example which receives 32-bit data and outputs it to pins 31..0:


        SETSERA _sera,#3            'set 32-bit data, pin 33, use fast bit period of 3
        NEG     DIRA,#1             'make P31..P0 outputs

LOOP    SERINA  OUTA                'receive 32 bits into P31..P0
        JMP     #LOOP               'loop

_sera   LONG    %01<<16 + 33<<9     '32-bit data, pin 33


To do the same thing, but with filtering, just change _sera:


_sera   LONG    %0110_1110_1<<19 + %01<<16 + 33<<9   'only allow ID's %0110 and %0111



TRACE
-----

A cog can cause its execution state (from pipeline stage 4) to be output to pins on
every clock cycle by using the SETRACE instruction:

    SETRACE D/#     - Set trace configuration to %TTTT

                      %TTTT = trace configuration

                          %0xx0 = trace output disabled (initial state on cog start)

                          %0001 = output 32-bit trace to pins 31..0
                          %0011 = output 32-bit trace to pins 63..32
                          %0101 = output 32-bit trace to pins 95..64
                          %0111 = output 32-bit trace to pins 127..96

                          %1000 = output 16-bit trace to pins 15..0
                          %1001 = output 16-bit trace to pins 31..16
                          %1010 = output 16-bit trace to pins 47..32
                          %1011 = output 16-bit trace to pins 63..48
                          %1100 = output 16-bit trace to pins 79..64
                          %1101 = output 16-bit trace to pins 95..80 (pins 95..92 don't exist)
                          %1110 = output 16-bit trace to pins 111..96
                          %1111 = output 16-bit trace to pins 127..112


The 32-bit trace output is comprised of the following signals, from MSB to LSB:

    TASK[1..0]   - the executing task, 0..3
    HUB          - hub cycle that comes once every 8 clocks
    FETCH        - pipeline stall due to hub instruction fetch
    GO           - pipeline not stalled and instruction done
    COND         - execution condition
    JUMP         - a jump is executing
    VID_ACK      - WAITVID able to execute
    CTRA_SYNC    - CTRA is rolling over
    CTRB_SYNC    - CTRB is rolling over
    SERA_RX_RDY  - SERA's receive buffer is full, ready for SERINA
    SERA_TX_RDY  - SERA's transmit buffer is empty, ready for SEROUTA
    SERB_RX_RDY  - SERB's receive buffer is full, ready for SERINB
    SERB_TX_RDY  - SERB's transmit buffer is empty, ready for SEROUTB
    PC[15..0]    - full 16 bits of the program counter


The 16-bit trace output is comprised of the following signals, from MSB to LSB:

    TASK[1..0]   - the executing task, 0..3
    HUB          - hub cycle that comes once every 8 clocks
    FETCH        - pipeline stall due to hub instruction fetch
    GO           - pipeline not stalled and instruction done
    COND         - execution condition
    JUMP         - a jump is executing
    PC[8..0]     - lower 9 bits of the program counter


For the output to appear, the DIR bits corresponding to the trace pins must be set.

Idea: By outputting trace data to the internal port D pins (%PPP = %11x), and having
another cog trigger using WAITPEQ before logging trace data, a trace debugger could
be made.



INSTRUCTION LIST
----------------

ZCDS (for D column: W=write, M=modify, R=read, L=read/immediate)
----------------------------------------------------------------------------------------------------------------------
ZCWS  0000000 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDBYTE  D,S/PTRA/PTRB           (waits for hub)
ZCWS  0000001 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDBYTEC D,S/PTRA/PTRB           (waits for hub if dcache miss)
ZCWS  0000010 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDWORD  D,S/PTRA/PTRB           (waits for hub)
ZCWS  0000011 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDWORDC D,S/PTRA/PTRB           (waits for hub if dcache miss)
ZCWS  0000100 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDLONG  D,S/PTRA/PTRB           (waits for hub)
ZCWS  0000101 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDLONGC D,S/PTRA/PTRB           (waits for hub if dcache miss)
ZCWS  0000110 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDAUX   D,S/#0..$FF/PTRX/PTRY
ZCWS  0000111 ZC I CCCC DDDDDDDDD SSSSSSSSS     RDAUXR  D,S/#0..$FF/PTRX/PTRY

ZCMS  0001000 ZC I CCCC DDDDDDDDD SSSSSSSSS     ISOB    D,S/#
ZCMS  0001001 ZC I CCCC DDDDDDDDD SSSSSSSSS     NOTB    D,S/#
ZCMS  0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS     CLRB    D,S/#
ZCMS  0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETB    D,S/#
ZCMS  0001100 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETBC   D,S/#
ZCMS  0001101 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETBNC  D,S/#
ZCMS  0001110 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETBZ   D,S/#
ZCMS  0001111 ZC I CCCC DDDDDDDDD SSSSSSSSS     SETBNZ  D,S/#

ZCMS  0010000 ZC I CCCC DDDDDDDDD SSSSSSSSS     ANDN    D,S/#
ZCMS  0010001 ZC I CCCC DDDDDDDDD SSSSSSSSS     AND     D,S/#
ZCMS  0010010 ZC I CCCC DDDDDDDDD SSSSSSSSS     OR      D,S/#
ZCMS  0010011 ZC I CCCC DDDDDDDDD SSSSSSSSS     XOR     D,S/#
ZCMS  0010100 ZC I CCCC DDDDDDDDD SSSSSSSSS     MUXC    D,S/#
ZCMS  0010101 ZC I CCCC DDDDDDDDD SSSSSSSSS     MUXNC   D,S/#
ZCMS  0010110 ZC I CCCC DDDDDDDDD SSSSSSSSS     MUXZ    D,S/#
ZCMS  0010111 ZC I CCCC DDDDDDDDD SSSSSSSSS     MUXNZ   D,S/#

ZCMS  0011000 ZC I CCCC DDDDDDDDD SSSSSSSSS     ROR     D,S/#
ZCMS  0011001 ZC I CCCC DDDDDDDDD SSSSSSSSS     ROL     D,S/#
ZCMS  0011010 ZC I CCCC DDDDDDDDD SSSSSSSSS     SHR     D,S/#
ZCMS  0011011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SHL     D,S/#
ZCMS  0011100 ZC I CCCC DDDDDDDDD SSSSSSSSS     RCR     D,S/#
ZCMS  0011101 ZC I CCCC DDDDDDDDD SSSSSSSSS     RCL     D,S/#
ZCMS  0011110 ZC I CCCC DDDDDDDDD SSSSSSSSS     SAR     D,S/#
ZCMS  0011111 ZC I CCCC DDDDDDDDD SSSSSSSSS     REV     D,S/#

ZCWS  0100000 ZC I CCCC DDDDDDDDD SSSSSSSSS     MOV     D,S/#
ZCWS  0100001 ZC I CCCC DDDDDDDDD SSSSSSSSS     NOT     D,S/#
ZCWS  0100010 ZC I CCCC DDDDDDDDD SSSSSSSSS     ABS     D,S/#
ZCWS  0100011 ZC I CCCC DDDDDDDDD SSSSSSSSS     NEG     D,S/#
ZCWS  0100100 ZC I CCCC DDDDDDDDD SSSSSSSSS     NEGC    D,S/#
ZCWS  0100101 ZC I CCCC DDDDDDDDD SSSSSSSSS     NEGNC   D,S/#
ZCWS  0100110 ZC I CCCC DDDDDDDDD SSSSSSSSS     NEGZ    D,S/#
ZCWS  0100111 ZC I CCCC DDDDDDDDD SSSSSSSSS     NEGNZ   D,S/#

ZCMS  0101000 ZC I CCCC DDDDDDDDD SSSSSSSSS     ADD     D,S/#
ZCMS  0101001 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUB     D,S/#
ZCMS  0101010 ZC I CCCC DDDDDDDDD SSSSSSSSS     ADDX    D,S/#
ZCMS  0101011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUBX    D,S/#
ZCMS  0101100 ZC I CCCC DDDDDDDDD SSSSSSSSS     ADDS    D,S/#
ZCMS  0101101 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUBS    D,S/#
ZCMS  0101110 ZC I CCCC DDDDDDDDD SSSSSSSSS     ADDSX   D,S/#
ZCMS  0101111 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUBSX   D,S/#

ZCMS  0110000 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUMC    D,S/#
ZCMS  0110001 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUMNC   D,S/#
ZCMS  0110010 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUMZ    D,S/#
ZCMS  0110011 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUMNZ   D,S/#
ZCMS  0110100 ZC I CCCC DDDDDDDDD SSSSSSSSS     MIN     D,S/#
ZCMS  0110101 ZC I CCCC DDDDDDDDD SSSSSSSSS     MAX     D,S/#
ZCMS  0110110 ZC I CCCC DDDDDDDDD SSSSSSSSS     MINS    D,S/#
ZCMS  0110111 ZC I CCCC DDDDDDDDD SSSSSSSSS     MAXS    D,S/#

ZCMS  0111000 ZC I CCCC DDDDDDDDD SSSSSSSSS     ADDABS  D,S/#
ZCMS  0111001 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUBABS  D,S/#
ZCMS  0111010 ZC I CCCC DDDDDDDDD SSSSSSSSS     INCMOD  D,S/#
ZCMS  0111011 ZC I CCCC DDDDDDDDD SSSSSSSSS     DECMOD  D,S/#
ZCMS  0111100 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMPSUB  D,S/#
ZCMS  0111101 ZC I CCCC DDDDDDDDD SSSSSSSSS     SUBR    D,S/#
ZCMS  0111110 ZC I CCCC DDDDDDDDD SSSSSSSSS     MUL     D,S/#                   (waits one clock)
ZCMS  0111111 ZC I CCCC DDDDDDDDD SSSSSSSSS     SCL     D,S/#                   (waits one clock)

ZCWS  1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS     DECOD2  D,S/#
ZCWS  1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS     DECOD3  D,S/#
ZCWS  1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS     DECOD4  D,S/#
ZCWS  1000011 ZC I CCCC DDDDDDDDD SSSSSSSSS     DECOD5  D,S/#
Z-WS  1000100 Z0 I CCCC DDDDDDDDD SSSSSSSSS     ENCOD   D,S/#
Z-WS  1000100 Z1 I CCCC DDDDDDDDD SSSSSSSSS     BLMASK  D,S/#
Z-WS  1000101 Z0 I CCCC DDDDDDDDD SSSSSSSSS     ONECNT  D,S/#                   (waits one clock)
Z-WS  1000101 Z1 I CCCC DDDDDDDDD SSSSSSSSS     ZERCNT  D,S/#                   (waits one clock)
-CWS  1000110 0C I CCCC DDDDDDDDD SSSSSSSSS     INCPAT  D,S/#
-CWS  1000110 1C I CCCC DDDDDDDDD SSSSSSSSS     DECPAT  D,S/#
--WS  1000111 00 I CCCC DDDDDDDDD SSSSSSSSS     SPLITB  D,S/#                   (also MERGEN)
--WS  1000111 01 I CCCC DDDDDDDDD SSSSSSSSS     MERGEB  D,S/#                   (also SPLITN)
--WS  1000111 10 I CCCC DDDDDDDDD SSSSSSSSS     SPLITW  D,S/#
--WS  1000111 11 I CCCC DDDDDDDDD SSSSSSSSS     MERGEW  D,S/#

--MS  10010nn n0 I CCCC DDDDDDDDD SSSSSSSSS     GETNIB  D,S/#,#0..7
--MS  10010nn n1 I CCCC DDDDDDDDD SSSSSSSSS     SETNIB  D,S/#,#0..7
--MS  1001100 n0 I CCCC DDDDDDDDD SSSSSSSSS     GETWORD D,S/#,#0..1
--MS  1001100 n1 I CCCC DDDDDDDDD SSSSSSSSS     SETWORD D,S/#,#0..1
--MS  1001101 00 I CCCC DDDDDDDDD SSSSSSSSS     SETWRDS D,S/#
--MS  1001101 01 I CCCC DDDDDDDDD SSSSSSSSS     ROLNIB  D,S/#
--MS  1001101 10 I CCCC DDDDDDDDD SSSSSSSSS     ROLBYTE D,S/#
--MS  1001101 11 I CCCC DDDDDDDDD SSSSSSSSS     ROLWORD D,S/#
--MS  1001110 00 I CCCC DDDDDDDDD SSSSSSSSS     SETS    D,S/#
--MS  1001110 01 I CCCC DDDDDDDDD SSSSSSSSS     SETD    D,S/#
--MS  1001110 10 I CCCC DDDDDDDDD SSSSSSSSS     SETX    D,S/#
--MS  1001110 11 I CCCC DDDDDDDDD SSSSSSSSS     SETI    D,S/#
-CMS  1001111 0C I CCCC DDDDDDDDD SSSSSSSSS     COGNEW  D,S/#                   (waits for hub)
-CMS  1001111 1C I CCCC DDDDDDDDD SSSSSSSSS     WAITCNT D,S/#                   (waits for CNT, +CNTX if WC)

--MS  101000n n0 I CCCC DDDDDDDDD SSSSSSSSS     GETBYTE D,S/#,#0..3
--MS  101000n n1 I CCCC DDDDDDDDD SSSSSSSSS     SETBYTE D,S/#,#0..3
--WS  1010010 00 I CCCC DDDDDDDDD SSSSSSSSS     SETBYTS D,S/#
--MS  1010010 01 I CCCC DDDDDDDDD SSSSSSSSS     MOVBYTS D,S/#                   (move bytes in D, S = %11_10_01_00 = D same)
--MS  1010010 10 I CCCC DDDDDDDDD SSSSSSSSS     PACKRGB D,S/#                   (S 8:8:8 -> D 5:5:5 << 16 | D >> 16)
--WS  1010010 11 I CCCC DDDDDDDDD SSSSSSSSS     UNPKRGB D,S/#                   (S 5:5:5 -> D 8:8:8)
--MS  1010011 00 I CCCC DDDDDDDDD SSSSSSSSS     ADDPIX  D,S/#                   (waits one clock)
--MS  1010011 01 I CCCC DDDDDDDDD SSSSSSSSS     MULPIX  D,S/#                   (waits one clock)
--MS  1010011 10 I CCCC DDDDDDDDD SSSSSSSSS     BLNPIX  D,S/#                   (waits one clock)
--MS  1010011 11 I CCCC DDDDDDDDD SSSSSSSSS     MIXPIX  D,S/#                   (waits one clock)

ZCMS  1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS     JMPSW   D,S/@
ZCMS  1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS     JMPSWD  D,S/@
--MS  1010110 00 I CCCC DDDDDDDDD SSSSSSSSS     IJZ     D,S/@
--MS  1010110 01 I CCCC DDDDDDDDD SSSSSSSSS     IJZD    D,S/@
--MS  1010110 10 I CCCC DDDDDDDDD SSSSSSSSS     IJNZ    D,S/@
--MS  1010110 11 I CCCC DDDDDDDDD SSSSSSSSS     IJNZD   D,S/@
--MS  1010111 00 I CCCC DDDDDDDDD SSSSSSSSS     DJZ     D,S/@
--MS  1010111 01 I CCCC DDDDDDDDD SSSSSSSSS     DJZD    D,S/@
--MS  1010111 10 I CCCC DDDDDDDDD SSSSSSSSS     DJNZ    D,S/@
--MS  1010111 11 I CCCC DDDDDDDDD SSSSSSSSS     DJNZD   D,S/@

ZCRS  1011000 ZC I CCCC DDDDDDDDD SSSSSSSSS     TESTB   D,S/#
ZCRS  1011001 ZC I CCCC DDDDDDDDD SSSSSSSSS     TESTN   D,S/#
ZCRS  1011010 ZC I CCCC DDDDDDDDD SSSSSSSSS     TEST    D,S/#
ZCRS  1011011 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMP     D,S/#
ZCRS  1011100 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMPX    D,S/#
ZCRS  1011101 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMPS    D,S/#
ZCRS  1011110 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMPSX   D,S/#
ZCRS  1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS     CMPR    D,S/#

--RS  11000nn n0 I CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,S/#,#0..7             (waits for hub) (use SETNIB :coginit,cog,#6 before)
---S  11000nn n1 I CCCC nnnnnnnnn SSSSSSSSS     WAITVID #0..$DFF,S/#            (waits for vid if single-task, loops if multi-task)
--RS  1100011 11 I CCCC DDDDDDDDD SSSSSSSSS     WAITVID D,S/#                                                 (waits for vid if single-task, loops if multi-task)
-CRS  110010n nC I CCCC DDDDDDDDD SSSSSSSSS     WAITPEQ D,S/#,#0..3             (waits for pins, plus CNT if WC)
-CRS  110011n nC I CCCC DDDDDDDDD SSSSSSSSS     WAITPNE D,S/#,#0..3             (waits for pins, plus CNT if WC)

--LS  1101000 0L I CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  D/#,S/PTRA/PTRB         (waits for hub)
--LS  1101000 1L I CCCC DDDDDDDDD SSSSSSSSS     WRWORD  D/#,S/PTRA/PTRB         (waits for hub)
--LS  1101001 0L I CCCC DDDDDDDDD SSSSSSSSS     WRLONG  D/#,S/PTRA/PTRB         (waits for hub)
--LS  1101001 1L I CCCC DDDDDDDDD SSSSSSSSS     FRAC    D/#,S/#
--LS  1101010 0L I CCCC DDDDDDDDD SSSSSSSSS     WRAUX   D/#,S/#0..$FF/PTRX/PTRY
--LS  1101010 1L I CCCC DDDDDDDDD SSSSSSSSS     WRAUXR  D/#,S/#0..$FF/PTRX/PTRY
--LS  1101011 0L I CCCC DDDDDDDDD SSSSSSSSS     SETACCA D/#,S/#
--LS  1101011 1L I CCCC DDDDDDDDD SSSSSSSSS     SETACCB D/#,S/#
--LS  1101100 0L I CCCC DDDDDDDDD SSSSSSSSS     MACA    D/#,S/#
--LS  1101100 1L I CCCC DDDDDDDDD SSSSSSSSS     MACB    D/#,S/#
--LS  1101101 0L I CCCC DDDDDDDDD SSSSSSSSS     MUL32   D/#,S/#
--LS  1101101 1L I CCCC DDDDDDDDD SSSSSSSSS     MUL32U  D/#,S/#
--LS  1101110 0L I CCCC DDDDDDDDD SSSSSSSSS     DIV32   D/#,S/#
--LS  1101110 1L I CCCC DDDDDDDDD SSSSSSSSS     DIV32U  D/#,S/#
--LS  1101111 0L I CCCC DDDDDDDDD SSSSSSSSS     DIV64   D/#,S/#
--LS  1101111 1L I CCCC DDDDDDDDD SSSSSSSSS     DIV64U  D/#,S/#

--LS  1110000 0L I CCCC DDDDDDDDD SSSSSSSSS     SQRT64  D/#,S/#
--LS  1110000 1L I CCCC DDDDDDDDD SSSSSSSSS     QSINCOS D/#,S/#
--LS  1110001 0L I CCCC DDDDDDDDD SSSSSSSSS     QARCTAN D/#,S/#
--LS  1110001 1L I CCCC DDDDDDDDD SSSSSSSSS     QROTATE D/#,S/#
--LS  1110010 0L I CCCC DDDDDDDDD SSSSSSSSS     SETSERA D/#,S/#                 (config,baud)
--LS  1110010 1L I CCCC DDDDDDDDD SSSSSSSSS     SETSERB D/#,S/#                 (config,baud)
--LS  1110011 0L I CCCC DDDDDDDDD SSSSSSSSS     SETCTRS D/#,S/#                 (ctrb,ctra)
--LS  1110011 1L I CCCC DDDDDDDDD SSSSSSSSS     SETWAVS D/#,S/#                 (ctrb,ctra)
--LS  1110100 0L I CCCC DDDDDDDDD SSSSSSSSS     SETFRQS D/#,S/#                 (ctrb,ctra)
--LS  1110100 1L I CCCC DDDDDDDDD SSSSSSSSS     SETPHSS D/#,S/#                 (ctrb,ctra)
--LS  1110101 0L I CCCC DDDDDDDDD SSSSSSSSS     ADDPHSS D/#,S/#                 (ctrb,ctra)
--LS  1110101 1L I CCCC DDDDDDDDD SSSSSSSSS     SUBPHSS D/#,S/#                 (ctrb,ctra)
--LS  1110110 0L I CCCC DDDDDDDDD SSSSSSSSS     JP      D/#,S/@
--LS  1110110 1L I CCCC DDDDDDDDD SSSSSSSSS     JPD     D/#,S/@
--LS  1110111 0L I CCCC DDDDDDDDD SSSSSSSSS     JNP     D/#,S/@
--LS  1110111 1L I CCCC DDDDDDDDD SSSSSSSSS     JNPD    D/#,S/@

--LS  111100n nL I CCCC DDDDDDDDD SSSSSSSSS     CFGPINS D/#,S/#,#0..2           (waits for alt)
--LS  1111001 1L I CCCC DDDDDDDDD SSSSSSSSS     JMPTASK D/#,S/#                 (mask,address)
--LS  1111010 0L I CCCC DDDDDDDDD SSSSSSSSS     SETXFR  D/#,S/#
--LS  1111010 1L I CCCC DDDDDDDDD SSSSSSSSS     SETMIX  D/#,S/#

--RS  1111011 00 I CCCC DDDDDDDDD SSSSSSSSS     JZ      D,S/@
--RS  1111011 01 I CCCC DDDDDDDDD SSSSSSSSS     JZD     D,S/@
--RS  1111011 10 I CCCC DDDDDDDDD SSSSSSSSS     JNZ     D,S/@
--RS  1111011 11 I CCCC DDDDDDDDD SSSSSSSSS     JNZD    D,S/@

--WS  1111100 00 I CCCC DDDDDDDDD SSSSSSSSS     LOCBASE D,S/@                   (if S:        S<<2, if @S:        (P+@S)<<2)
--MS  1111100 01 I CCCC DDDDDDDDD SSSSSSSSS     LOCBYTE D,S/@                   (if S: D<<0 + S<<2, if @S: D<<0 + (P+@S)<<2)
--MS  1111100 10 I CCCC DDDDDDDDD SSSSSSSSS     LOCWORD D,S/@                   (if S: D<<1 + S<<2, if @S: D<<1 + (P+@S)<<2)
--MS  1111100 11 I CCCC DDDDDDDDD SSSSSSSSS     LOCLONG D,S/@                   (if S: D<<2 + S<<2, if @S: D<<2 + (P+@S)<<2)

--RS  1111101 00 I CCCC DDDDDDDDD SSSSSSSSS     JMPLIST D,S/@                   (if S: D<<0 + S<<0, if @S: D<<0 + (P+@S)<<0)

--W-  1111101 01 0 CCCC DDDDDDDDD SSSSSSSSS     LOCINST D,@S                    (P+@S)
----  1111101 01 1 nnnn nnnnnnnnn nnniiiiii     REPS    #1..$10000,#1..64

----  1111101 10 n nnnn nnnnnnnnn nnnnnnnnn     AUGS    #23bits                 (appends n to upper bits of next immediate S)
----  1111101 11 n nnnn nnnnnnnnn nnnnnnnnn     AUGD    #23bits                 (appends n to upper bits of next immediate D)

----  1111110 00 0 BBAA ddddddddd sssssssss     FIXINDA #d,#s / FIXINDB #d,#s / FIXINDS #d,#s / SETINDA #s / SETINDB #d / SETINDS #d,#s

----   1111110 00 1 CCCC 00 nnnnnnnnnnnnnnnn    LOCPTRA #abs
----   1111110 00 1 CCCC 01 nnnnnnnnnnnnnnnn    LOCPTRA @rel
----   1111110 00 1 CCCC 10 nnnnnnnnnnnnnnnn    LOCPTRB #abs
----   1111110 00 1 CCCC 11 nnnnnnnnnnnnnnnn    LOCPTRB @rel

----  1111110 01 0 CCCC 00 nnnnnnnnnnnnnnnn     JMP     #abs
----  1111110 01 0 CCCC 01 nnnnnnnnnnnnnnnn     JMP     @rel
----  1111110 01 0 CCCC 10 nnnnnnnnnnnnnnnn     JMPD    #abs
----  1111110 01 0 CCCC 11 nnnnnnnnnnnnnnnn     JMPD    @rel

----  1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn     CALL    #abs
----  1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn     CALL    @rel
----  1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn     CALLD   #abs
----  1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn     CALLD   @rel

----  1111110 10 0 CCCC 00 nnnnnnnnnnnnnnnn     CALLA   #abs
----  1111110 10 0 CCCC 01 nnnnnnnnnnnnnnnn     CALLA   @rel
----  1111110 10 0 CCCC 10 nnnnnnnnnnnnnnnn     CALLAD  #abs
----  1111110 10 0 CCCC 11 nnnnnnnnnnnnnnnn     CALLAD  @rel

----  1111110 10 1 CCCC 00 nnnnnnnnnnnnnnnn     CALLB   #abs
----  1111110 10 1 CCCC 01 nnnnnnnnnnnnnnnn     CALLB   @rel
----  1111110 10 1 CCCC 10 nnnnnnnnnnnnnnnn     CALLBD  #abs
----  1111110 10 1 CCCC 11 nnnnnnnnnnnnnnnn     CALLBD  @rel

----  1111110 11 0 CCCC 00 nnnnnnnnnnnnnnnn     CALLX   #abs
----  1111110 11 0 CCCC 01 nnnnnnnnnnnnnnnn     CALLX   @rel
----  1111110 11 0 CCCC 10 nnnnnnnnnnnnnnnn     CALLXD  #abs
----  1111110 11 0 CCCC 11 nnnnnnnnnnnnnnnn     CALLXD  @rel

----  1111110 11 1 CCCC 00 nnnnnnnnnnnnnnnn     CALLY   #abs
----  1111110 11 1 CCCC 01 nnnnnnnnnnnnnnnn     CALLY   @rel
----  1111110 11 1 CCCC 10 nnnnnnnnnnnnnnnn     CALLYD  #abs
----  1111110 11 1 CCCC 11 nnnnnnnnnnnnnnnn     CALLYD  @rel

ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000000     COGID   D                       (waits for hub) (doesn't write D if WC)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000001     TASKID  D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000010     LOCKNEW D                       (waits for hub)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000011     GETLFSR D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000100     GETCNT  D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000101     GETCNTX D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000110     GETACAL D                       (waits for mac)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000000111     GETACAH D                       (waits for mac)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001000     GETACBL D                       (waits for mac)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001001     GETACBH D                       (waits for mac)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001010     GETPTRA D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001011     GETPTRB D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001100     GETPTRX D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001101     GETPTRY D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001110     SERINA  D                       (waits for rx if single-task, loops if multi-task, releases if WC)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000001111     SERINB  D                       (waits for rx if single-task, loops if multi-task, releases if WC)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010000     GETMULL D                       (waits for mul if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010001     GETMULH D                       (waits for mul if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010010     GETDIVQ D                       (waits for div if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010011     GETDIVR D                       (waits for div if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010100     GETSQRT D                       (waits for sqrt if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010101     GETQX   D                       (waits for cordic if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010110     GETQY   D                       (waits for cordic if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000010111     GETQZ   D                       (waits for cordic if single-task, loops if multi-task)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011000     GETPHSA D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011001     GETPHZA D                       (clears phsa)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011010     GETCOSA D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011011     GETSINA D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011100     GETPHSB D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011101     GETPHZB D                       (clears phsb)
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011110     GETCOSB D
ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000011111     GETSINB D

ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100000     PUSHZC  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100001     POPZC   D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100010     SUBCNT  D                       (subtracts D from CNT, then CNTX if same thread)
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100011     GETPIX  D                       (takes 3 clocks, needs 3 clocks in prior two stages, no condition allowed)
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100100     BINBCD  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100101     BCDBIN  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100110     BINGRY  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000100111     GRYBIN  D                       (waits one clock)
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000101000     ESWAP4  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000101001     ESWAP8  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000101010     SEUSSF  D
ZCM-  1111111 ZC 0 CCCC DDDDDDDDD 000101011     SEUSSR  D
Z-M-  1111111 ZC 0 CCCC DDDDDDDDD 000101100     INCD    D                       (D += $200)
Z-M-  1111111 ZC 0 CCCC DDDDDDDDD 000101101     DECD    D                       (D -= $200)
Z-M-  1111111 ZC 0 CCCC DDDDDDDDD 000101110     INCDS   D                       (D += $201)
Z-M-  1111111 ZC 0 CCCC DDDDDDDDD 000101111     DECDS   D                       (D -= $201)

ZCW-  1111111 ZC 0 CCCC DDDDDDDDD 000110000     POP     D                       (pops from task's tiny stack)

--L-  1111111 00 L CCCC DDDDDDDDD 001iiiiii     REPD    D/#1..512,#1..64        (REPD $1FF,#1..64 = infinite repeat, can use REPD #i)

--L-  1111111 00 L CCCC DDDDDDDDD 010000000     CLKSET  D/#                     (waits for hub)
--L-  1111111 00 L CCCC DDDDDDDDD 010000001     COGSTOP D/#                     (waits for hub)
-CL-  1111111 0C L CCCC DDDDDDDDD 010000010     LOCKSET D/#                     (waits for hub)
-CL-  1111111 0C L CCCC DDDDDDDDD 010000011     LOCKCLR D/#                     (waits for hub)
--L-  1111111 00 L CCCC DDDDDDDDD 010000100     LOCKRET D/#                     (waits for hub)
--L-  1111111 00 L CCCC DDDDDDDDD 010000101     RDWIDEC D/PTRA/PTRB             (waits for hub if dcache miss)
--L-  1111111 00 L CCCC DDDDDDDDD 010000110     RDWIDE  D/PTRA/PTRB             (waits for hub)
--L-  1111111 00 L CCCC DDDDDDDDD 010000111     WRWIDE  D/PTRA/PTRB             (waits for hub)

ZCL-  1111111 ZC L CCCC DDDDDDDDD 010001000     GETP    D/#                     (pin into !Z/C via WZ/WC)
ZCL-  1111111 ZC L CCCC DDDDDDDDD 010001001     GETNP   D/#                     (pin into Z/!C via WZ/WC)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001010     SEROUTA D/#                     (waits for tx if single-task, loops if multi-task, releases if WC)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001011     SEROUTB D/#                     (waits for tx if single-task, loops if multi-task, releases if WC)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001100     CMPCNT  D/#                     (subtracts D from CNT, then CNTX if same thread)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001101     WAITPX  D/#                     (waits for any edge, +CNT if WC)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001110     WAITPR  D/#                     (waits for pos edge, +CNT if WC)
-CL-  1111111 0C L CCCC DDDDDDDDD 010001111     WAITPF  D/#                     (waits for neg edge, +CNT if WC)

ZCL-  1111111 ZC L CCCC DDDDDDDDD 010010000     SETZC   D/#                     (D[1:0] into Z/C via WZ/WC)
--L-  1111111 00 L CCCC DDDDDDDDD 010010001     SETMAP  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010010010     SETXCH  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010010011     SETTASK D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010010100     SETRACE D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010010101     SARACCA D/#                     (waits for mac)
--L-  1111111 00 L CCCC DDDDDDDDD 010010110     SARACCB D/#                     (waits for mac)
--L-  1111111 00 L CCCC DDDDDDDDD 010010111     SARACCS D/#                     (waits for mac)

--L-  1111111 00 L CCCC DDDDDDDDD 010011000     SETPTRA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011001     SETPTRB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011010     ADDPTRA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011011     ADDPTRB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011100     SUBPTRA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011101     SUBPTRB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011110     SETWIDE D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010011111     SETWIDZ D/#

--L-  1111111 00 L CCCC DDDDDDDDD 010100000     SETPTRX D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100001     SETPTRY D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100010     ADDPTRX D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100011     ADDPTRY D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100100     SUBPTRX D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100101     SUBPTRY D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010100110     PASSCNT D/#                     (loops if (CNT - D) msb set)
--L-  1111111 00 L CCCC DDDDDDDDD 010100111     WAIT    D/#                     (waits 1+ clocks, 0 same as 1)

--L-  1111111 00 L CCCC DDDDDDDDD 010101000     OFFP    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101001     NOTP    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101010     CLRP    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101011     SETP    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101100     SETPC   D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101101     SETPNC  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101110     SETPZ   D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010101111     SETPNZ  D/#

--L-  1111111 00 L CCCC DDDDDDDDD 010110000     DIV64D  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110001     SQRT32  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110010     QLOG    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110011     QEXP    D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110100     SETQI   D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110101     SETQZ   D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110110     CFGDACS D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010110111     SETDACS D/#

--L-  1111111 00 L CCCC DDDDDDDDD 010111000     CFGDAC0 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111001     CFGDAC1 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111010     CFGDAC2 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111011     CFGDAC3 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111100     SETDAC0 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111101     SETDAC1 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111110     SETDAC2 D/#
--L-  1111111 00 L CCCC DDDDDDDDD 010111111     SETDAC3 D/#

--L-  1111111 00 L CCCC DDDDDDDDD 011000000     SETCTRA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000001     SETWAVA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000010     SETFRQA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000011     SETPHSA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000100     ADDPHSA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000101     SUBPHSA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000110     SETVID  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011000111     SETVIDY D/#

--L-  1111111 00 L CCCC DDDDDDDDD 011001000     SETCTRB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001001     SETWAVB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001010     SETFRQB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001011     SETPHSB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001100     ADDPHSB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001101     SUBPHSB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001110     SETVIDI D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011001111     SETVIDQ D/#

--L-  1111111 00 L CCCC DDDDDDDDD 011010000     SETPIX  D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010001     SETPIXZ D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010010     SETPIXU D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010011     SETPIXV D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010100     SETPIXA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010101     SETPIXR D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010110     SETPIXG D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011010111     SETPIXB D/#

--L-  1111111 00 L CCCC DDDDDDDDD 011011000     SETPORA D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011011001     SETPORB D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011011010     SETPORC D/#
--L-  1111111 00 L CCCC DDDDDDDDD 011011011     SETPORD D/#

--L-  1111111 00 L CCCC DDDDDDDDD 011011100     PUSH    D/#                     (pushes into task's 4-level stack)

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011110100     JMP     D                       (D[31:30] into Z/C via WZ/WC for JMP..CALLYD)
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011110101     JMPD    D

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011110110     CALL    D
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011110111     CALLD   D

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111000     CALLA   D
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111001     CALLAD  D

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111010     CALLB   D
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111011     CALLBD  D

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111100     CALLX   D
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111101     CALLXD  D

--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111110     CALLY   D
--R-  1111111 ZC 0 CCCC DDDDDDDDD 011111111     CALLYD  D

ZC--  1111111 ZC x CCCC xxxxxxxxx 100000000     RETA
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000001     RETAD
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000010     RETB
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000011     RETBD
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000100     RETX
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000101     RETXD
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000110     RETY
ZC--  1111111 ZC x CCCC xxxxxxxxx 100000111     RETYD

ZC--  1111111 ZC x CCCC xxxxxxxxx 100001000     RET
ZC--  1111111 ZC x CCCC xxxxxxxxx 100001001     RETD
ZC--  1111111 ZC x CCCC xxxxxxxxx 100001010     POLCTRA                         (ctra-rollover into !Z/C)
ZC--  1111111 ZC x CCCC xxxxxxxxx 100001011     POLCTRB                         (ctra-rollover into !Z/C)

ZC--  1111111 ZC x CCCC xxxxxxxxx 100001100     POLVID                          (vid-ready into !Z/C)
----  1111111 00 x CCCC xxxxxxxxx 100001101     CAPCTRA
----  1111111 00 x CCCC xxxxxxxxx 100001110     CAPCTRB
----  1111111 00 x CCCC xxxxxxxxx 100001111     CAPCTRS

----  1111111 00 x CCCC xxxxxxxxx 100010000     SETPIXW
----  1111111 00 x CCCC xxxxxxxxx 100010001     CLRACCA
----  1111111 00 x CCCC xxxxxxxxx 100010010     CLRACCB
----  1111111 00 x CCCC xxxxxxxxx 100010011     CLRACCS

ZC--  1111111 ZC x CCCC xxxxxxxxx 100010100     CHKPTRX
ZC--  1111111 ZC x CCCC xxxxxxxxx 100010101     CHKPTRY
----  1111111 00 x CCCC xxxxxxxxx 100010110     SYNCTRA                         (waits for ctra if single-task, loops if multi-task))
----  1111111 00 x CCCC xxxxxxxxx 100010111     SYNCTRB                         (waits for ctrb if single-task, loops if multi-task))

----  1111111 00 x CCCC xxxxxxxxx 100011000     DCACHEX
----  1111111 00 x CCCC xxxxxxxxx 100011001     ICACHEX
----  1111111 00 x CCCC xxxxxxxxx 100011010     ICACHEP
----  1111111 00 x CCCC xxxxxxxxx 100011011     ICACHEN


x = don't care, use 0
----------------------------------------------------------------------------------------------------------------------


Z effect
------------------------------------------------------------------------------------------
0 <none>
1 wz


C effect
------------------------------------------------------------------------------------------
0 <none>
1 wc


L     DDDDDDDDD        destination operand
------------------------------------------------------------------------------------------
0/na  DDDDDDDDD        register
1     #DDDDDDDDD       immediate, zero-extended


I     SSSSSSSSS        source operand
------------------------------------------------------------------------------------------
0/na  SSSSSSSSS        register
1     #SSSSSSSSS       immediate, zero-extended


CCCC  condition        (easier-to-read list)
------------------------------------------------------------------------------------------
0000  never            1111  always (default)
0001  nc  &  nz        1100  if_c                              if_b
0010  nc  &  z         0011  if_nc                             if_ae
0011  nc               1010  if_z                              if_e
0100   c  &  nz        0101  if_nz                             if_ne
0101  nz               1000  if_c_and_z      if_z_and_c
0110   c  <> z         0100  if_c_and_nz     if_nz_and_c
0111  nc  |  nz        0010  if_nc_and_z     if_z_and_nc
1000   c  &  z         0001  if_nc_and_nz    if_nz_and_nc      if_a
1001   c  =  z         1110  if_c_or_z       if_z_or_c         if_be
1010   z               1101  if_c_or_nz      if_nz_or_c
1011  nc  |  z         1011  if_nc_or_z      if_z_or_nc
1100   c               0111  if_nc_or_nz     if_nz_or_nc
1101   c  |  nz        1001  if_c_eq_z       if_z_eq_c
1110   c  |  z         0110  if_c_ne_z       if_z_ne_c
1111  always           0000  never


CCCC  inda/indb - CCCC=1111 after stage 2 of pipeline if inda/indb used (indx=inda/indb)
------------------------------------------------------------------------------------------
xx00  source indx
xx01  source indx++
xx10  source indx--
xx11  source ++indx

00xx  destination indx
01xx  destination indx++
10xx  destination indx--
11xx  destination ++indx