PROPELLER 2 MEMORY
------------------

In the Propeller 2, there are two primary types of memory:

HUB MEMORY

    128K bytes of main memory shared by all cogs

        - cogs launch from this memory
        - cogs can access this memory as bytes, words, longs, and quads (4 longs)
        - $00000..$00E7F is ROM - contains Booter, SHA-256/HMAC, and Monitor
        - $00E80..$1FFFF is RAM - for application usage


COG MEMORY (8 instances)

    512 longs of register RAM for code and data usage

        - simultaneous instruction, source, and destination reading, plus writing
        - last eight registers are for I/O pin control

    256 longs of stack RAM for data and video usage

        - accessible via push and pop operations
        - video circuit can read data simultaneously and asynchronously



INSTRUCTION ENCODING
--------------------

Cog instructions are 32 bits long and comprised of several bit fields. There are two main types of
instructions: dual-operand and single-operand. Dual-operand instructions specify both a D register, which
usually is read and written back, and an S register which is read or used as an immediate value. Single-
operand instructions specify only a D register.


Dual-operand encoding:

TTTTTT ZCR I CCCC DDDDDDDDD SSSSSSSSS     IF_x    MNEM    D,S/#n  WZ,WC,NR

       TTTTTT = Instruction according to instruction (MNEM)
            I = SSSSSSSSS register or immediate, 0=register address (S), 1=immediate (#n)


Single-operand encoding:

000011 ZCR 1 CCCC DDDDDDDDD TTTTTTTTT     IF_x    MNEM    D       WZ,WC,NR

    TTTTTTTTT = Instruction according to instruction (MNEM)


For both cases:

            Z = Z flag write control: 0=don't write Z, 1=write Z
                Defaults to 0, but may be set to 1 by adding WZ (Write Z) after operand(s)

                Unless specified otherwise, the value written to Z is the NOR of the 32-bit D result.

            C = C flag write control: 0=don't write C, 1=write C
                Defaults to 0, but may be set to 1 by adding WC (Write C) after operand(s)

            R = D register write control: 0=don't write D, 1=write D
                Default varies by instruction, but may be cleared to 0 by adding NR (No Result)

         CCCC = Execution condition (expressed by IF_x mnemonic prefix)
                Determines Z/C flag conditions upon which the instruction will execute

                CCCC  condition       CCCC  mnemonic prefixes (in easy-to-read order)
                ---------------------------------------------------------------------
                0000  never           1111  IF_ALWAYS (default)
                0001  nc &  nz        1100  IF_C                          IF_B
                0010  nc &  z         0011  IF_NC                         IF_AE
                0011  nc              1010  IF_Z                          IF_E
                0100  c  &  nz        0101  IF_NZ                         IF_NE
                0101  nz              1000  IF_C_AND_Z     IF_Z_AND_C
                0110  c  <> z         0100  IF_C_AND_NZ    IF_NZ_AND_C
                0111  nc |  nz        0010  IF_NC_AND_Z    IF_Z_AND_NC
                1000  c  &  z         0001  IF_NC_AND_NZ   IF_NZ_AND_NC   IF_A
                1001  c  =  z         1110  IF_C_OR_Z      IF_Z_OR_C      IF_BE
                1010  z               1101  IF_C_OR_NZ     IF_NZ_OR_C
                1011  nc |  z         1011  IF_NC_OR_Z     IF_Z_OR_NC
                1100  c               0111  IF_NC_OR_NZ    IF_NZ_OR_NC
                1101  c  |  nz        1001  IF_C_EQ_Z      IF_Z_EQ_C
                1110  c  |  z         0110  IF_C_NE_Z      IF_Z_NE_C
                1111  always          0000  IF_NEVER

    DDDDDDDDD = Destination register address (D)

    SSSSSSSSS = Source register address (S) or zero-extended immediate value (#n)



HUB MEMORY INSTRUCTIONS
-----------------------

These instructions read and write hub memory.

All instructions use D as the data conduit, except WRQUAD/RDQUAD/RDQUADC, which uses the four QUAD
registers. The QUADs can be mapped into cog register space using the SETQUAD instruction or kept
hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the QUADs
overlay four contiguous cog registers. These overlaid registers can be read and written as any other
registers, as well as executed. Any write via D to the QUAD registers, when mapped, will affect the
underlying cog registers, as well. A RDQUAD/RDQUADC will affect the QUAD registers, but not the
underlying cog registers.

The cached reads RDBYTEC/RDWORDC/RDLONGC/RDQUADC will do a RDQUAD if the current read address is
outside of the 4-long window of the prior RDQUAD. Otherwise, they will immediately return cached
data. The CACHEX instruction invalidates the cache, forcing a fresh RDQUAD next time a cached read
executes.

Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The
timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant,
causing these instructions to take varying numbers of clocks. Timing can be made determinant, though,
by intentionally spacing these instructions apart so that after the first in a series executes, the
subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of
clocks. The trick is to write useful code to go in between them.

WRBYTE/WRWORD/WRLONG/WRQUAD/RDQUAD complete on the hub cycle, making them take 1..8 clocks.

RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks.

RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is cached, otherwise 3..10 clocks.

RDQUADC takes only 1 clock if data is cached, otherwise 1..8 clocks.

After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:


        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow QUADs to update
        NOP
        NOP

        CMP     quad0,quad1     'mapped QUADs are now accessible via D and S


After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:


        SETQUAD #quad0          'map QUADs to quad0..quad3

        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow QUADs to update
        NOP
        NOP

        NOP                     'do at least 1 instruction to get QUADs into pipeline

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP


After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:


        SETQUAD #quad0          'map QUADs to quad0..quad3 (new address)

        NOP                     'do at least two instructions to queue up QUADs
        NOP

        CMP     quad0,quad1     'mapped QUADS are now accessible via D and S


On cog startup, the QUAD registers are cleared to 0's.


instructions                                                                                       clocks
---------------------------------------------------------------------------------------------------------
000000 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  D,S       'write lower byte in D at S              1..8
000000 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRBYTE  D,PTR     'write lower byte in D at PTR            1..8
000000 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTE  D,S       'read byte at S into D                  3..10
000000 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTE  D,PTR     'read byte at PTR into D                3..10
000000 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTEC D,S       'read cached byte at S into D        1, 3..10 
000000 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTEC D,PTR     'read cached byte at PTR into D      1, 3..10

000001 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRWORD  D,S       'write lower word in D at S              1..8
000001 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRWORD  D,PTR     'write lower word in D at PTR            1..8
000001 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORD  D,S       'read word at S into D                  3..10
000001 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORD  D,PTR     'read word at PTR into D                3..10
000001 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORDC D,S       'read cached word at S into D        1, 3..10
000001 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORDC D,PTR     'read cached word at PTR into D      1, 3..10

000010 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRLONG  D,S       'write D at S                            1..8
000010 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRLONG  D,PTR     'write D at PTR                          1..8
000010 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONG  D,S       'read long at S into D                  3..10
000010 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONG  D,PTR     'read long at PTR into D                3..10
000010 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONGC D,S       'read cached long at S into D        1, 3..10
000010 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONGC D,PTR     'read cached long at PTR into D      1, 3..10

000011 000 1 CCCC DDDDDDDDD 010110000     WRQUAD  D         'write QUADs at D                        1..8
000011 001 1 CCCC SUPNNNNNN 010110000     WRQUAD  PTR       'write QUADs at PTR                      1..8
000011 000 1 CCCC DDDDDDDDD 010110001     RDQUAD  D         'read quad at D into QUADs               1..8
000011 001 1 CCCC SUPNNNNNN 010110001     RDQUAD  PTR       'read quad at PTR into QUADs             1..8
000011 010 1 CCCC DDDDDDDDD 010110001     RDQUADC D         'read cached quad at D into QUADs     1, 1..8
000011 011 1 CCCC SUPNNNNNN 010110001     RDQUADC PTR       'read cached quad at PTR into QUADs   1, 1..8
---------------------------------------------------------------------------------------------------------


PTR expressions:

    INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s
    SCALE = 1 for byte, 2 for word, 4 for long, or 16 for quad

    S = 0 for PTRA, 1 for PTRB
    U = 0 to keep PTRx same, 1 to update PTRx
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
    NNNNNN = INDEX
    nnnnnn = -INDEX


    SUPNNNNNN     PTR expression
    -----------------------------------------------------------------------------
    000000000     PTRA              'use PTRA
    100000000     PTRB              'use PTRB
    011000001     PTRA++            'use PTRA,                PTRA += SCALE
    111000001     PTRB++            'use PTRB,                PTRB += SCALE
    011111111     PTRA--            'use PTRA,                PTRA -= SCALE
    111111111     PTRB--            'use PTRB,                PTRB -= SCALE
    010000001     ++PTRA            'use PTRA + SCALE,        PTRA += SCALE
    110000001     ++PTRB            'use PTRB + SCALE,        PTRB += SCALE
    010111111     --PTRA            'use PTRA - SCALE,        PTRA -= SCALE
    110111111     --PTRB            'use PTRB - SCALE,        PTRB -= SCALE

    000NNNNNN     PTRA[INDEX]       'use PTRA + INDEX*SCALE
    100NNNNNN     PTRB[INDEX]       'use PTRB + INDEX*SCALE
    011NNNNNN     PTRA++[INDEX]     'use PTRA,                PTRA += INDEX*SCALE
    111NNNNNN     PTRB++[INDEX]     'use PTRB,                PTRB += INDEX*SCALE
    011nnnnnn     PTRA--[INDEX]     'use PTRA,                PTRA -= INDEX*SCALE
    111nnnnnn     PTRB--[INDEX]     'use PTRB,                PTRB -= INDEX*SCALE
    010NNNNNN     ++PTRA[INDEX]     'use PTRA + INDEX*SCALE,  PTRA += INDEX*SCALE
    110NNNNNN     ++PTRB[INDEX]     'use PTRB + INDEX*SCALE,  PTRB += INDEX*SCALE
    010nnnnnn     --PTRA[INDEX]     'use PTRA - INDEX*SCALE,  PTRA -= INDEX*SCALE
    110nnnnnn     --PTRB[INDEX]     'use PTRB - INDEX*SCALE,  PTRB -= INDEX*SCALE


Examples:

000000 Z01 1 CCCC DDDDDDDDD 000000000     RDBYTE  D,PTRA         'read byte at PTRA into D
000001 000 1 CCCC DDDDDDDDD 111000001     WRWORD  D,PTRB++       'write lower word in D at PTRB,      PTRB += 2
000010 Z01 1 CCCC DDDDDDDDD 011111111     RDLONG  D,PTRA--       'read long at PTRA into D,           PTRA -= 4
000011 001 1 CCCC 110000001 010110001     RDQUAD  ++PTRB         'read quad at PTRB+16 into QUADs,    PTRB += 16
000000 000 1 CCCC DDDDDDDDD 010111111     WRBYTE  D,--PTRA       'write lower byte in D at PTRA-1,    PTRA -= 1

000001 000 1 CCCC DDDDDDDDD 100000111     WRWORD  D,PTRB[7]      'write lower word in D to PTRB+7*2
000010 Z11 1 CCCC DDDDDDDDD 011001111     RDLONGC D,PTRA++[15]   'read cached long at PTRA into D,    PTRA += 15*4
000011 001 1 CCCC 111111101 010110000     WRQUAD  PTRB--[3]      'write QUADs at PTRB,                PTRB -= 3*16
000000 000 1 CCCC DDDDDDDDD 010000110     WRBYTE  D,++PTRA[6]    'write lower byte in D to PTRA+6*1,  PTRA += 6*1
000001 Z01 1 CCCC DDDDDDDDD 110110110     RDWORD  D,--PTRB[10]   'read word at PTRB-10*2 into D,      PTRB -= 10*2


Bytes, words, longs, and quads are addressed as follows: 

    for WRBYTE/RDBYTE/RDBYTEC, address = %XXXXXXXXXXXXXXXXX (bits 16..0 are used)
    for WRWORD/RDWORD/RDWORDC, address = %XXXXXXXXXXXXXXXX- (bits 16..1 are used)
    for WRLONG/RDLONG/RDLONGC, address = %XXXXXXXXXXXXXXX-- (bits 16..2 are used)
    for WRQUAD/RDQUAD/RDQUADC, address = %XXXXXXXXXXXXX---- (bits 16..4 are used)

address  byte  word    long        quad
-------------------------------------------------------------------
00000-   50   *7250   *706F7250   *0C7CCC030C7C200020302E32706F7250
00001-   72    7250    706F7250    0C7CCC030C7C200020302E32706F7250
00002-   6F   *706F    706F7250    0C7CCC030C7C200020302E32706F7250
00003-   70    706F    706F7250    0C7CCC030C7C200020302E32706F7250
00004-   32   *2E32   *20302E32    0C7CCC030C7C200020302E32706F7250
00005-   2E    2E32    20302E32    0C7CCC030C7C200020302E32706F7250
00006-   30   *2030    20302E32    0C7CCC030C7C200020302E32706F7250
00007-   20    2030    20302E32    0C7CCC030C7C200020302E32706F7250
00008-   00   *2000   *0C7C2000    0C7CCC030C7C200020302E32706F7250
00009-   20    2000    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000A-   7C   *0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000B-   0C    0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000C-   03   *CC03   *0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000D-   CC    CC03    0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000E-   7C   *0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000F-   0C    0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
00010-   45   *FE45   *0DC1FE45   *0D7CC6010C7CC6010CFCB6E30DC1FE45
00011-   FE    FE45    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00012-   C1   *0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00013-   0D    0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00014-   E3   *B6E3   *0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00015-   B6    B6E3    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00016-   FC   *0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00017-   0C    0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00018-   01   *C601   *0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
00019-   C6    C601    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001A-   7C   *0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001B-   0C    0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001C-   01   *C601   *0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001D-   C6    C601    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001E-   7C   *0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001F-   0D    0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45

* new word/long/quad



PTRA/PTRB INSTRUCTIONS
----------------------

Each cog has two 17-bit pointers, PTRA and PTRB, which can be read, written, modified,
and used to access hub memory.

At cog startup, the PTRA and PTRB registers are initialized as follows:

    PTRA = %X_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer
    PTRB = %X_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 1 CCCC DDDDDDDDD 000010010     GETPTRA D         'get PTRA into D, C = PTRA[16]      1
000011 ZCR 1 CCCC DDDDDDDDD 000010011     GETPTRB D         'get PTRB into D, C = PTRB[16]      1

000011 000 1 CCCC DDDDDDDDD 010110010     SETPTRA D         'set PTRA to D                      1
000011 001 1 CCCC nnnnnnnnn 010110010     SETPTRA #n        'set PTRA to 0..511                 1
000011 000 1 CCCC DDDDDDDDD 010110011     SETPTRB D         'set PTRB to D                      1
000011 001 1 CCCC nnnnnnnnn 010110011     SETPTRB #n        'set PTRB to 0..511                 1

000011 000 1 CCCC DDDDDDDDD 010110100     ADDPTRA D         'add D into PTRA                    1
000011 001 1 CCCC nnnnnnnnn 010110100     ADDPTRA #n        'add 0..511 into PTRA               1
000011 000 1 CCCC DDDDDDDDD 010110101     ADDPTRB D         'add D into PTRB                    1
000011 001 1 CCCC nnnnnnnnn 010110101     ADDPTRB #n        'add 0..511 into PTRB               1

000011 000 1 CCCC DDDDDDDDD 010110110     SUBPTRA D         'subtract D from PTRA               1
000011 001 1 CCCC nnnnnnnnn 010110110     SUBPTRA #n        'subtract 0..511 from PTRA          1
000011 000 1 CCCC DDDDDDDDD 010110111     SUBPTRB D         'subtract D from PTRB               1
000011 001 1 CCCC nnnnnnnnn 010110111     SUBPTRB #n        'subtract 0..511 from PTRB          1
-------------------------------------------------------------------------------------------------



QUAD-RELATED INSTRUCTIONS
-------------------------

Each cog has four QUAD registers which form a 128-bit conduit between the hub memory and the cog.
This conduit can transfer four longs every 8 clocks via the WRQUAD/RDQUAD instructions. It can
also be used as a 4-long/8-word/16-byte read cache, utilized by RDBYTEC/RDWORDC/RDLONGC/RDQUADC.

Initially hidden, these QUAD registers are mappable into cog register space by using the SETQUAD
instruction to set an address where the base register is to appear, with the other three registers
following. To hide the QUAD registers, use SETQUAD to set an address of $1FF. SETQUAZ works just
like SETQUAD, but also clears the four QUAD registers.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC 000000000 000001000     CACHEX            'invalidate cache                   1
000011 Z01 1 CCCC DDDDDDDDD 000010001     GETTOPS D         'get top bytes of QUADs into D      1
000011 000 1 CCCC DDDDDDDDD 011100010     SETQUAD D         'set QUAD base to D                 1
000011 001 1 CCCC nnnnnnnnn 011100010     SETQUAD #n        'set QUAD base to 0..511            1
000011 010 1 CCCC DDDDDDDDD 011100010     SETQUAZ D         'set QUAD base to D, QUAD=0         1
000011 011 1 CCCC nnnnnnnnn 011100010     SETQUAZ #n        'set QUAD base to 0..511, QUAD=0    1
-------------------------------------------------------------------------------------------------



HUB CONTROL INSTRUCTIONS
------------------------

These instructions are used to control hub circuits and cogs.

Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where
there is no result to wait for (ZCR = %000), these instructions complete on the hub cycle, making
them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases
where a result is anticipated (ZCR <> %000), these instructions complete on the 1st clock after the
hub cycle, making them take 2..9 clocks.


COGINIT D,S
-----------

COGINIT is used to start cogs. Any cog can be (re)started, whether it is idle or running. A cog
can even execute a COGINIT to restart itself with a new program.

COGINIT uses D to specify a long address in hub memory that is the start of the program that is to be
loaded into a cog, while S is a 17-bit parameter (usually an address) that will be conveyed to PTRA
of the started cog. PTRB of the started cog will be set to the start address of its program that was
loaded from hub memory.

SETCOG must be executed before COGINIT to set the number of the cog to be started (0..7). If SETCOG
sets a value with bit 3 set (%1xxx), this will cause the next idle cog to be started when COGINIT is
executed, with the number of the cog started being returned in D, and the C flag returning 0 if okay,
or 1 if no idle cog was available. At cog startup, SETCOG is initialized to %0000.

When a cog is started, $1F8 contiguous longs are read from hub memory and written to cog registers
$000..$1F7. The cog will then begin execution at $000. This process takes 1,016 clocks.

Example:

        COGID   COGNUM           'what cog am I?
        SETCOG  COGNUM           'set my cog number
        COGINIT COGPGM,COGPTR    'restart me with the ROM Monitor

COGPGM  LONG    $0070C           'address of the ROM Monitor
COGPTR  LONG    90<<9 + 91       'tx = P90, rx = P91

COGNUM  RES     1


CLKSET  D
---------

CLKSET writes the lower 9 bits of D to the hub clock register:

%R_MMMM_XX_SS

R = 1 for hardware reset, 0 for continued operation

MMMM = PLL mode:

        %0000 for disabled, else XX must be set for XI input or XI/XO crystal oscillator
        %0001 for multiply XI by 2
        %0010 for multiply XI by 3
        %0011 for multiply XI by 4
        %0100 for multiply XI by 5
        %0101 for multiply XI by 6
        %0110 for multiply XI by 7
        %0111 for multiply XI by 8
        %1000 for multiply XI by 9
        %1001 for multiply XI by 10
        %1010 for multiply XI by 11
        %1011 for multiply XI by 12
        %1100 for multiply XI by 13
        %1101 for multiply XI by 14
        %1110 for multiply XI by 15
        %1111 for multiply XI by 16

XX = XI/XO pin mode:

        %00 for XI reads low, XO floats
        %01 for XI input, XO floats
        %10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback
        %11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback

SS = Clock selector:

        %00 for RCFAST (~20MHz)
        %01 for RCSLOW (~20KHz)
        %10 for XTAL (10MHz-20MHz)
        %11 for PLL


Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode
with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST
or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within
10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you
can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you
can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being
in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its
new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot
might exceed the speed limit of the chip.


COGID   D
---------

COGID returns the number of the cog (0..7) into D.


COGSTOP D
---------

COGSTOP stops the cog specified in D (0..7).


LOCKNEW D
LOCKRET D
LOCKSET D
LOCKCLR D
---------

There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with
LOCKRET, set with LOCKSET, and cleared with LOCKCLR.

While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided
so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time.

When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works
the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/
clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission
to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a
'LOCKSET locknum  wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared
resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to
clear the lock and give another cog the opportunity to get C=0 back.

LOCKNEW returns the next available lock into D, with C=1 if no lock was free.

LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW.

LOCKSET sets the lock in D and returns its prior state in C.

LOCKCLR clears the lock in D and returns its prior state in C.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,S     'launch cog at D, cog PTRA = S     1..9
000011 000 1 CCCC DDDDDDDDD 000000000     CLKSET  D       'set clock to D                    1..8
000011 001 1 CCCC DDDDDDDDD 000000001     COGID   D       'get cog number into D             2..9
000011 000 1 CCCC DDDDDDDDD 000000011     COGSTOP D       'stop cog in D                     1..8
000011 ZC1 1 CCCC DDDDDDDDD 000000100     LOCKNEW D       'get new lock into D, C = busy     2..9
000011 000 1 CCCC DDDDDDDDD 000000101     LOCKRET D       'return lock in D                  1..8
000011 0C0 1 CCCC DDDDDDDDD 000000110     LOCKSET D       'set lock in D, C = prev state     1..9
000011 0C0 1 CCCC DDDDDDDDD 000000111     LOCKCLR D       'clear lock in D, C = prev state   1..9
-------------------------------------------------------------------------------------------------



INDIRECT REGISTERS
------------------

Each cog has two indirect registers: INDA and INDB. They are located at $1F6 and $1F7.

By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed.

INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and
the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the
pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB
references.

SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and
top limit(s) to $000 and $1FF, respectively.

FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the
lower of the initial and terminal values and the top limit(s) to the higher.

Because indirect addressing must occur in the 2nd stage of the pipeline, long before C and Z are valid for
conditional execution in the 4th stage, all instructions which use indirect addressing are forced to always
execute. This frees the conditional bit field (CCCC) for specifying indirect operations. The top two bits of
CCCC are used for indirect D and the bottom two bits are used for indirect S. If only D or S is indirect, the
other two bits in CCCC are ignored.

Here is the INDA/INDB usage scheme which repurposes the CCCC field:

OOOOOO ZCR I CCCC DDDDDDDDD SSSSSSSSS
-------------------------------------
xxxxxx xxx x 00xx 111110110 xxxxxxxxx        D = INDA        'use INDA
xxxxxx xxx x 00xx 111110111 xxxxxxxxx        D = INDB        'use INDB
xxxxxx xxx x 01xx 111110110 xxxxxxxxx        D = INDA++      'use INDA,      INDA += 1
xxxxxx xxx x 01xx 111110111 xxxxxxxxx        D = INDB++      'use INDB,      INDB += 1
xxxxxx xxx x 10xx 111110110 xxxxxxxxx        D = INDA--      'use INDA,      INDA -= 1
xxxxxx xxx x 10xx 111110111 xxxxxxxxx        D = INDB--      'use INDB       INDB -= 1
xxxxxx xxx x 11xx 111110110 xxxxxxxxx        D = ++INDA      'use INDA+1,    INDA += 1
xxxxxx xxx x 11xx 111110111 xxxxxxxxx        D = ++INDB      'use INDB+1,    INDB += 1

xxxxxx xxx 0 xx00 xxxxxxxxx 111110110        S = INDA        'use INDA
xxxxxx xxx 0 xx00 xxxxxxxxx 111110111        S = INDB        'use INDB
xxxxxx xxx 0 xx01 xxxxxxxxx 111110110        S = INDA++      'use INDA,      INDA += 1
xxxxxx xxx 0 xx01 xxxxxxxxx 111110111        S = INDB++      'use INDB,      INDB += 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110110        S = INDA--      'use INDA,      INDA -= 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110111        S = INDB--      'use INDB       INDB -= 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110110        S = ++INDA      'use INDA+1,    INDA += 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110111        S = ++INDB      'use INDB+1,    INDB += 1


If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the
post-modifier effect:

101000 001 0 0011 111110110 111110110        MOV INDA,++INDA    'Move @INDA+1 into @INDA,   INDA += 1
100000 001 0 1100 111110111 111110111        ADD ++INDB,INDB    'Add @INDB into @INDB+1,    INDB += 1

Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx.


Here are the instructions which are used to set the pointer and limit values for INDA and INDB:

instructions *                                                                             clocks
-------------------------------------------------------------------------------------------------
111000 000 0 0001 000000000 AAAAAAAAA        SETINDA #addrA                                     1
111000 000 0 0011 000000000 AAAAAAAAA        SETINDA ++/--deltA                                 1

111000 000 0 0100 BBBBBBBBB 000000000        SETINDB #addrB                                     1
111000 000 0 1100 BBBBBBBBB 000000000        SETINDB ++/--deltB                                 1 

111000 000 0 0101 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,#addrA                              1
111000 000 0 0111 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,++/--deltA                          1
111000 000 0 1101 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,#addrA                          1
111000 000 0 1111 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,++/--deltA                      1

111001 000 0 0001 TTTTTTTTT IIIIIIIII        FIXINDA #terminal,#initial                         1
111001 000 0 0100 TTTTTTTTT IIIIIIIII        FIXINDB #terminal,#initial                         1
111001 000 0 0101 TTTTTTTTT IIIIIIIII        FIXINDS #terminal,#initial                         1
-------------------------------------------------------------------------------------------------
* addrA/addrB/terminal/initial = register address (0..511),
  deltA/deltB = 9-bit signed delta --256..++255

Examples:

111000 000 0 0001 000000000 000000101        SETINDA #5        'INDA = 5, bottom = 0, top = 511
111000 000 0 0011 000000000 000000011        SETINDA ++3       'INDA += 3, bottom = 0, top = 511
111000 000 0 1100 111111100 000000000        SETINDB --4       'INDB -= 4, bottom = 0, top = 511
111000 000 0 0111 000000111 000001000        SETINDS #7,++8    'INDB = 7, INDA += 8, bottoms = 0, tops = 511

111001 000 0 0001 000001111 000001000        FIXINDA #15,#8    'INDA = 8, bottom = 8, top = 15
111001 000 0 0100 000010000 000011111        FIXINDB #16,#31   'INDB = 31, bottom = 16, top = 31
111001 000 0 0101 001100011 000110010        FIXINDS #99,#50   'INDA/INDB = 50, bottoms = 50, tops = 99



STACK RAM
---------

Each cog has a 256-long stack RAM that is accessible via push and pop operations. Its contents
are not initialized at either reset or cog startup. So, at cog startup, it will contain whatever
it happened to power up with, or whatever was last written.

There are two stack pointers called SPA and SPB which are used to address the stack memory. Aside
from automatically incrementing and decrementing via pushes and pops, SPA and SPB can be set,
modified, read back, and checked:

SETSPA  D/#n      set SPA
SETSPB  D/#n      set SPB
ADDSPA  D/#n      add to SPA
ADDSPB  D/#n      add to SPB
SUBSPA  D/#n      subtract from SPA
SUBSPB  D/#n      subtract from SPB
GETSPA  D         get SPA, SPA==0 into Z, SPA.7 into C
GETSPB  D         get SPB, SPB==0 into Z, SPB.7 into C
GETSPD  D         get SPA minus SPB, SPA==SPB into Z, SPA<SPB into C
CHKSPA            check SPA, SPA==0 into Z, SPA.7 into C
CHKSPB            check SPB, SPB==0 into Z, SPB.7 into C
CHKSPD            check SPA minus SPB, SPA==SPB into Z, SPA<SPB into C

Data can be pushed and popped in both normal and reverse directions:

PUSHA   D/#n      push using SPA
PUSHB   D/#n      push using SPB
PUSHAR  D/#n      push using SPA, use pop addressing
PUSHBR  D/#n      push using SPB, use pop addressing
POPA    D         pop using SPA
POPB    D         pop using SPB
POPAR   D         pop using SPA, use push addressing
POPBR   D         pop using SPB, use push addressing

Aside from data, the program counter and flags can be pushed and popped using calls and returns:

CALLA   D/#n      call using SPA, zeros/Z/C/PC+1 are written @SPA, SPA += 1
CALLB   D/#n      call using SPB, zeros/Z/C/PC+1 are written @SPB, SPB += 1
RETA              return using SPA, Z/C/PC are read @SPA-1, SPA -= 1, if WZ/WC then Z/C updated
RETB              return using SPB, Z/C/PC are read @SPB-1, SPB -= 1, if WZ/WC then Z/C updated

CALLAD/CALLBD/RETAD/RETBD are delayed versions of CALLA/CALLB/RETA/RETB.

SPA and SPB are both initialized to 0 at cog startup.


instructions (stack RAM access is shown as [SPx++] and [--SPx])                            clocks
-------------------------------------------------------------------------------------------------
000011 ZC0 1 CCCC 000000000 000010101        CHKSPD          'SPA==SPB into Z, SPA<SPB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000010101        GETSPD  D       'SPA-SPB into D, Z/C as CHKSPD     1

000011 ZC0 1 CCCC 000000000 000010110        CHKSPA          'SPA==0 into Z, SPA.7 into C       1
000011 ZC1 1 CCCC DDDDDDDDD 000010110        GETSPA  D       'SPA into D, Z/C as CHKSPA         1

000011 ZC0 1 CCCC 000000000 000010111        CHKSPB          'SPB==0 into Z, SPB.7 into C       1
000011 ZC1 1 CCCC DDDDDDDDD 000010111        GETSPB  D       'SPB into D, Z/C as CHKSPB         1

000011 ZC1 1 CCCC DDDDDDDDD 000011000        POPAR   D       'read [SPA++] into D, MSB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000011001        POPBR   D       'read [SPB++] into D, MSB into C   1

000011 ZC1 1 CCCC DDDDDDDDD 000011010        POPA    D       'read [--SPA] into D, MSB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000011011        POPB    D       'read [--SPB] into D, MSB into C   1

000011 ZC0 1 CCCC 000000000 000011100        RETA            'read [--SPA] into Z/C/PC*         4
000011 ZC0 1 CCCC 000000000 000011101        RETB            'read [--SPB] into Z/C/PC*         4

000011 ZC0 1 CCCC 000000000 000011110        RETAD           'read [--SPA] into Z/C/PC*         1
000011 ZC0 1 CCCC 000000000 000011111        RETBD           'read [--SPB] into Z/C/PC*         1

000011 000 1 CCCC DDDDDDDDD 010100010        SETSPA  D       'set SPA to D                      1
000011 001 1 CCCC 0nnnnnnnn 010100010        SETSPA  #n      'set SPA to n                      1
000011 000 1 CCCC DDDDDDDDD 010100011        SETSPB  D       'set SPB to D                      1
000011 001 1 CCCC 0nnnnnnnn 010100011        SETSPB  #n      'set SPB to n                      1

000011 000 1 CCCC DDDDDDDDD 010100100        ADDSPA  D       'add D into SPA                    1
000011 001 1 CCCC 0nnnnnnnn 010100100        ADDSPA  #n      'add n into SPA                    1
000011 000 1 CCCC DDDDDDDDD 010100101        ADDSPB  D       'add D into SPB                    1
000011 001 1 CCCC 0nnnnnnnn 010100101        ADDSPB  #n      'add n into SPB                    1

000011 000 1 CCCC DDDDDDDDD 010100110        SUBSPA  D       'subtract D from SPA               1
000011 001 1 CCCC 0nnnnnnnn 010100110        SUBSPA  #n      'subtract n from SPA               1
000011 000 1 CCCC DDDDDDDDD 010100111        SUBSPB  D       'subtract D from SPB               1
000011 001 1 CCCC 0nnnnnnnn 010100111        SUBSPB  #n      'subtract n from SPB               1

000011 000 1 CCCC DDDDDDDDD 010101000        PUSHAR  D       'write D into [--SPA]              1 **
000011 001 1 CCCC nnnnnnnnn 010101000        PUSHAR  #n      'write n into [--SPA]              1 **
000011 000 1 CCCC DDDDDDDDD 010101001        PUSHBR  D       'write D into [--SPB]              1 **
000011 001 1 CCCC nnnnnnnnn 010101001        PUSHBR  #n      'write n into [--SPB]              1 **

000011 000 1 CCCC DDDDDDDDD 010101010        PUSHA   D       'write D into [SPA++]              1 **
000011 001 1 CCCC nnnnnnnnn 010101010        PUSHA   #n      'write n into [SPA++]              1 **
000011 000 1 CCCC DDDDDDDDD 010101011        PUSHB   D       'write D into [SPB++]              1 **
000011 001 1 CCCC nnnnnnnnn 010101011        PUSHB   #n      'write n into [SPB++]              1 **

000011 000 1 CCCC DDDDDDDDD 010101100        CALLA   D       'write Z/C/PC* into [SPA++], PC=D  4 **
000011 001 1 CCCC nnnnnnnnn 010101100        CALLA   #n      'write Z/C/PC* into [SPA++], PC=n  4 **
000011 000 1 CCCC DDDDDDDDD 010101101        CALLB   D       'write Z/C/PC* into [SPB++], PC=D  4 **
000011 001 1 CCCC nnnnnnnnn 010101101        CALLB   #n      'write Z/C/PC* into [SPB++], PC=n  4 **

000011 000 1 CCCC DDDDDDDDD 010101110        CALLAD  D       'write Z/C/PC* into [SPA++], PC=D  1 **
000011 001 1 CCCC nnnnnnnnn 010101110        CALLAD  #n      'write Z/C/PC* into [SPA++], PC=n  1 **
000011 000 1 CCCC DDDDDDDDD 010101111        CALLBD  D       'write Z/C/PC* into [SPB++], PC=D  1 **
000011 001 1 CCCC nnnnnnnnn 010101111        CALLBD  #n      'write Z/C/PC* into [SPB++], PC=n  1 **
-------------------------------------------------------------------------------------------------
* bit 10 is Z, bit 9 is C, bits 8..0 are PC, upper bits are ignored or cleared
** if a stack RAM write is immediately followed by a stack RAM read, add one clock



BYTE/WORD FIELD MOVER
---------------------

Each cog has a field mover that can move a byte or word from any field in S into any field in D. To use
the field mover, you must first configure it using SETF. Then, you can use MOVF to perform the moves.

SETF uses a 9-bit value to configure the field mover:

    %W_DDdd_SSss

        W = 1 for word mode, 0 for byte mode

        DD = D field mode:     %00 = D field pointer stays same after MOVF
                               %01 = D field pointer stays same after MOVF, D rotates left by byte/word
                               %10 = D field pointer increments after MOVF
                               %11 = D field pointer decrements after MOVF

        dd = D field pointer:  %00 = byte 0 / word 0
                               %01 = byte 1 / word 0
                               %10 = byte 2 / word 1
                               %11 = byte 3 / word 1

        SS = S field mode:     %0x = S field pointer stays same after MOVF
                               %10 = S field pointer increments after MOVF
                               %11 = S field pointer decrements after MOVF

        ss = S field pointer:  %00 = byte 0 / word 0
                               %01 = byte 1 / word 0
                               %10 = byte 2 / word 1
                               %11 = byte 3 / word 1

On cog startup, SETF is initialized to %0_0100_0000, so that MOVF will rotate D left by 8 bits and
then fill the bottom byte with the lower byte in S.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC DDDDDDDDD 011001010        SETF    D       'Configure field mover with D      1
000011 001 1 CCCC nnnnnnnnn 011001010        SETF    #n      'Configure field mover with 0..511 1
000101 000 0 CCCC DDDDDDDDD SSSSSSSSS        MOVF    D,S     'Move field from S into D          1
000101 000 1 CCCC DDDDDDDDD nnnnnnnnn        MOVF    D,#n    'Move field from 0..511 into D     1
-------------------------------------------------------------------------------------------------



MULTI-TASKING
-------------

Each cog has four sets of flags and program counters (Z/C/PC), constituting four unique tasks that
can execute and switch on each instruction cycle.

At cog startup, the tasks are initialized as follows:


task Z  C  PC
---------------
0    0  0  $000
1    0  0  $001
2    0  0  $002
3    0  0  $003


There are 16 rotating time slots in the TASK register that determine task sequence. Initially, all
time slots are set to 0, causing task 0 to execute exclusively, starting at address $000:


   time slots:   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
                  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
TASK register:  %00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00


The two LSB's of TASK always determine which task will execute next. After each instruction cycle,
the TASK register is rotated right by two bits, recycling slot 0 to slot 15 and getting the next task
into the 2 LSB's.


To enable other tasks, SETTASK is used to set the TASK register:

SETTASK D               write D to the TASK register
SETTASK #n              write {n[7:0], n[7:0], n[7:0], n[7:0]} to the TASK register

If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
task is given a time slot, it will execute and its flags and PC will be updated at every instruction,
or time slot. If an active task's time slots are all taken away, that task's flags and PC remain in the
state where they left off, until it is given another time slot.


To immediately force any of the four PC's to a new address, JMPTASK can be used. JMPTASK uses a 4-bit
mask to select which PC's are going to be written. Mask bits 0..3 represent PC's 0..3. The mask value
%1010 would write PC 3 and PC 1, while %0100 would write PC 2, only.

JMPTASK D,#mask         force PC's in mask to D
JMPTASK #addr,#mask     force PC's in mask to #addr

For every PC/task affected by a JMPTASK instruction, all affected-task instructions currently in the
pipeline are cancelled. This insures that once JMPTASK executes, the next instruction from each
affected task will be from the new address.


Here is an example in which all four tasks are started and each task toggles an I/O pin at a
different rate:


        ORG

        JMP     #task0          'task 0 begins here when the cog starts (this JMP takes 4 clocks)
        JMP     #task1          'task 1 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task2          'task 2 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task3          'task 3 begins here after task 0 executes SETTASK (this JMP takes 1 clock)

task0   SETTASK #%%3210         'enable all tasks (TASK = %11_10_01_00_11_10_01_00_11_10_01_00_11_10_01_00)

:loop   NOTP    #0              'task 0, toggle pin 0               - loops every 8 clocks
        JMP     #:loop          '(this JMP takes 1 clock)

task1   NOTP    #1              'task 1, toggle pin 1               - loops every 12 clocks
        NOP
        JMP     #task1          '(this JMP takes 1 clock)

task2   NOTP    #2              'task 2, toggle pin 2               - loops every 16 clocks
        NOP                     
        NOP
        JMP     #task2          '(this JMP takes 1 clock)

task3   NOTP    #3              'task 3, toggle pin 3               - loops every 20 clocks
        NOP
        NOP
        NOP
        JMP     #task3          '(this JMP takes 1 clock)


------------------------------------------------------------------------------------------------------------
NOTE: When a normal branch instruction (JMP, CALL, RET, etc.) executes in the 4th and final stage of the
pipeline, all instructions progressing through the lower three stages, which belong to the same task as the
branch instruction, are cancelled. This inhibits execution of incidental data that was trailing the branch
instruction.

The delayed branch instructions (JMPD, CALLD, RETD, etc.) don't do any pipeline instruction cancellation and
exist to provide 1-clock branches to single-task programs, where the three instructions following the branch
are allowed to execute before the new instruction stream begins to execute.

For single-task programs, normal branches take 4 clocks: 1 clock for the branch and 3 clocks for the
cancelled instructions to come through the pipeline before the new instruction stream begins to execute.

For multi-tasking programs that use all four tasks in sequence (ie SETTASK #%%3210), there are never any
same-task instructions in the pipeline that would require cancellation due to branching, so all branches
take just 1 clock.
------------------------------------------------------------------------------------------------------------


Tips for coding multi-tasking programs
--------------------------------------

While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
remember that there's only one of each of the following cog resources and only one task can use it at a time:

  SPA
  SPB
  INDA
  INDB
  PTRA
  PTRB
  ACCA
  ACCB
  32x32 multiplier
  64/32 divider
  64-bit square rooter
  CORDIC computer
  CTRA
  CTRB
  VID
  PIX (not usable in multi-tasking, requires single-task timing)
  XFR
  SER
  REPS/REPD
  SETF/MOVF

When writing multi-task programs, be aware that instructions that take multiple clocks will stall the
pipeline and have a ripple effect on the tasks' timing. This may be impossible to avoid, as some task
might need to access hub memory, and those instructions are not single-clock.

The WAITCNT/WAITPEQ/WAITPNE instructions should be recoded discretely using 1-clock instructions, to
avoid stalling the pipeline for excessive amounts of time.

The following instructions (WC versions) will take 1 clock, instead of potentially many, and return 1 in
C if they were successful:

  SNDSER  D  WC      attempt to send serial
  RCVSER  D  WC      attempt to receive serial
  GETMULL D  WC      attempt to get lower multiplier result
  GETMULH D  WC      attempt to get upper multiplier result
  GETDIVQ D  WC      attempt to get divider quotient result
  GETDIVR D  WC      attempt to get divider remainder result
  GETSQRT D  WC      attempt to get square root result
  GETQX   D  WC      attempt to get CORDIC X result
  GETQY   D  WC      attempt to get CORDIC Y result
  GETQZ   D  WC      attempt to get CORDIC Z result

Other instruction alternatives:

  POLCTRA    WC      returns 1 in C if CTRA rolled over, use instead of SYNCTRA
  POLCTRB    WC      returns 1 in C if CTRB rolled over, use instead of SYNCTRB
  POLVID     WC      returns 1 in C if WAITVID is ready, use to execute WAITVID without stalling
  PASSCNT D          jumps to itself if some amount of time has not passed, use instead of WAITCNT
  JP/JNP  D,S        jumps based on pin states, use instead of WAITPEQ/WAITPNE
  DJNZ    D,#$       loops until done, use instead of NOP D/#n

The following instruction will not work in a multi-tasking program:

  GETPIX             needs steady pipeline delays for perspective divider time - single-task only


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC DDDDDDDDD 01001mmmm        JMPTASK D,#mask  'Set PC's in mask to D            1
000011 001 1 CCCC nnnnnnnnn 01001mmmm        JMPTASK #n,#mask 'Set PC's in mask to 0..511       1

000011 000 1 CCCC DDDDDDDDD 011001011        SETTASK D        'Set TASK to D                    1
000011 001 1 CCCC 0nnnnnnnn 011001011        SETTASK #n       'Set TASK to n[7:0] copied 4x     1
-------------------------------------------------------------------------------------------------



PIPELINE
--------

Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:


  1st stage    - Read instruction from cog register RAM

  2nd stage    - Determine any indirect or remapped D and S addresses within instruction
                 Update INDA and INDB

  3rd stage    - Read D and S from cog register RAM

  4th stage    - Execute instruction using D and S
                 Write any D result to cog register RAM
                 Update Z/C/PC and any other results


On every clock cycle, the instruction data in each stage advances to the next stage, unless the instruction
in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for the hub).

To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
the 3rd stage to substitute for any obsoleted D or S data currently being read from the cog register RAM.
The same is done for instruction data currently being read in the 1st stage, but this still leaves a two-
stage gap between when a register is modified and when it can be executed:


        'single-task self-modifying code

        MOVI    :inst,top9         '(initially 4th stage) modify instruction
        NOP                        '(initially 3rd stage) 1...
        NOP                        '(initially 2nd stage) 2... at least two instructions in-between
:inst   ADD     A,B                '(initially 1st stage) modified instruction properly executes


Tasks that execute no more frequently than every 3rd time slot don't need to observe this 2-instruction
spacer rule when executing self-modifying code, because their instructions will always be sufficiently spread
apart in the pipeline by other tasks' instructions, enabling a just-modified instruction to be properly read
and executed in that task's next time slot. If less than two spacers are afforded to a modify-execute sequence,
the old instruction will be read and executed, instead of the newly-modified one. This can be used to advantage
for efficient overlapped modify-execute sequences.

When a branch instruction executes, that task's program counter is abruptly changed from what had been a
steadily incrementing course, requiring that the pipeline be reloaded, beginning at the new program counter
address. This can leave up to three instructions in the pipeline which were trailing the branch instruction
and belong to the same task as the branch.

Normally, these trailing instructions are incidental data which are not intended for execution, and therefore
must be cancelled within the pipeline, so that they pass through without doing anything. However, in the case
of a single-task program, it may be desirable to allow those instrucions to execute, without cancellation, to
increase pipeline efficiency. This will result in the branch taking just 1 clock cycle, but three trailing
instructions will be executed before the branch appears to take effect:


        'single-task delayed branch

        JMPD    #somewhere      '(initially 4th stage) do a delayed jmp, then toggle P0 and cycle P1
        NOTP    #0              '(initially 3rd stage)
        NOTP    #1              '(initially 2nd stage)
        NOTP    #1              '(initially 1st stage) next instruction is loaded from 'somewhere'


To accommodate both cancelling and non-cancelling branches, branch instructions have two versions. The ones
that end in the letter 'D' for 'delayed' are non-cancelling and take only one clock, and are intended only for
use in single-task programs.

The branch instructions that don't end in the letter 'D' are what would be considered 'normal' branches, as
they cancel any same-task instructions in the pipeline, so that the next instruction to execute after the
branch would be the instruction which was branched to.

Here are all the branching instructions:


       'normal'        'delayed'
        cancelling      non-cancelling
        ----------      --------------
        JMP             JMPD                     jump to address
        CALL            CALLD                    call subroutine
        RET             RETD                     return from subroutine
        JMPRET          JMPRETD                  general case branch instruction
        TASKSW          TASKSWD                  switch between threads
        CALLA           CALLAD                   call using stack @SPA
        CALLB           CALLBD                   call using stack @SPB
        RETA            RETAD                    return using stack @SPA
        RETB            RETBD                    return using stack @SPB
        IJZ             IJZD                     increment D and jump if result zero
        IJNZ            IJNZD                    increment D and jump if result not zero
        DJZ             DJZD                     decrement D and jump if result zero
        DJNZ            DJNZD                    decrement D and jump if result not zero
        TJZ             TJZD                     test D and jump if result zero
        TJNZ            TJNZD                    test D and jump if result not zero
        JP              JPD                      jump if pin D reads high
        JNP             JNPD                     jump if pin D reads low

        PASSCNT                                  loop until CNTL passes D
        JMPTASK                                  jump selected tasks to address



INSTRUCTION-BLOCK REPEATING
---------------------------

Each cog has an instruction-block repeater that can variably repeat up to 64 instructions without
any clock-cycle overhead.

REPD and REPS are used to initiate block repeats. These instructions specify how many times the
trailing instruction block will be executed and how many instructions are in the block:


REPD    #i       - execute 1..64 instructions infinitely, requires 3 spacer instructions *
REPD    D,#i     - execute 1..64 instructions D+1 times, requires 3 spacer instructions *
REPD    #n,#i    - execute 1..64 instructions 1..512 times, requires 3 spacer instructions *

REPS    #n,#i    - execute 1..64 instructions 1..16384 times, requires 1 spacer instruction *


REPS differs from REPD by executing at the 2nd stage of the pipeline, instead of the 4th. By
executing two stages earlier, it needs only one spacer instruction *. Because of its earliness,
no conditional execution is possible, so it is forced to always execute, allowing the CCCC bits
to be repurposed, along with Z, to provide a 14-bit constant for the repeat count.

The instruction-block repeater will quit repeating the block if a branch instruction executes
within the block. Care must be taken, though, if using JMPTASK to affect a task which may be
using the block repeater, as it will not cancel the block repeater. To get around this, the block
repeater can be benignly reassigned to the task doing the JMPTASK, before the JMPTASK executes:


        REPS    #1,#1           'effectively cancel the block repeater
        JMPTASK D/#n,#mask      'now do the JMPTASK


* Spacer instructions are required in 1-task applications to allow the pipeline to prime before
repeating can commence. If REPD is used by a task that uses no more than every 4th time slot, no
spacers are needed, as three intervening instructions will be provided by the other task(s). If
REPS is used by a task that uses no more than every 2nd time slot, no spacers are needed.


Example (1-task):

        REPD    D,#1            'execute 1 instruction D+1 times

        NOP                     '3 spacer instructions needed (could do something useful)
        NOP
        NOP

        NOTP    #0              'toggle P0, block repeats every 1 clock


Example (1-task):

        REPS    #20_000,#4      'execute 4 instructions 20,000 times

        NOP                     '1 spacer instruction needed (make the most of it)

        NOTP    #0              'toggle P0
        NOTP    #1              'toggle P1
        NOTP    #2              'toggle P2
        NOTP    #3              'toggle P3, block repeats every 4 clocks


Example (4-task, SETTASK #%%3210 timing):

task0   REPD    #1             'task0 will own the block repeater (no need for spacers)
        NOTP    #0             'toggle P0 every 4 clocks

task1   NOTP    #1             'toggle P1 every 8 clocks
        JMP     #task1

task2   NOTP    #2             'toggle P2 every 8 clocks
        JMP     #task2

task3   NOTP    #3             'toggle P3 every 8 clocks
        JMP     #task3


instructions (iiiiii = #i-1, nnnnnnnnn/n___nnnn_nnnnnnnnn = #n-1)                             clocks
----------------------------------------------------------------------------------------------------
000011 000 1 CCCC 111111111 001iiiiii        REPD    #i      'execute 1..64 inst's infintely       1
000011 000 1 CCCC nnnnnnnnn 001iiiiii        REPD    D,#i    'execute 1..64 inst's D+1 times       1
000011 001 1 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times    1
000011 n11 1 nnnn nnnnnnnnn 001iiiiii        REPS    #n,#i   'execute 1..64 inst's 1..16384 times  1
----------------------------------------------------------------------------------------------------
Note that the %iiiiii field represents 1..64 instructions, not the encoded 0..63. The %nnnnnnnnn/
%n___nnnn_nnnnnnnnn fields are +1-based, too.



HUB COUNTER
-----------

The hub contains a 64-bit counter called CNT that increments on each clock cycle. Each cog can use CNT
to mark time in various ways. On chip reset, the ROM Booter initializes CNT to $00000000_00000000. For
the purpose of describing the cog instructions which relate to CNT, the lower long of CNT is alternately
called CNTL and the upper long, delayed by one clock cycle, is called CNTH. The one-clock delay of CNTH
enables proper reading of the entire CNT value when two instructions must be used in sequence to access
its bottom and top longs.

Here are the instructions which relate to CNT:

GETCNT  D               Get CNTL into D. If another GETCNT is executed in the next clock cycle by the
                        same task, it gets CNTH into D.

SUBCNT  D               Get CNTL minus D into D. If another SUBCNT is executed in the next clock cycle
                        by the same task, it gets CNTH minus D minus carry from previous SUBCNT into D.
                        In either case, the logical not of the MSB of the D result (not the carry) goes
                        into C, indicating by C=1 if CNTL (or CNT) has exceeded the original D value(s).

CMPCNT  D               Same as SUBCNT, but doesn't store the D result(s). Useful for periodic checking
                        if a time target has been reached yet.

PASSCNT D               Jump to self if MSB of CNTL minus D is 1. In other words, loop until CNTL
                        exceeds D. This is intended as a non-pipeline-stalling alternative to WAITCNT,
                        for use in multi-task programs.

WAITCNT D,S/#n          Wait for CNTL to be equal to D. Adds S/#n into D.

WAITCNT D,S/#n  WC      Wait for CNT to be equal to the concatenation of the last-written D value and
                        the D expressed in the WAITCNT. Adds S/#n into D. Carry from the addition goes
                        into C.

WAITPEQ D,S/#n  WC      Like WAITPEQ without WC, except the last-written D value becomes a CNTL timeout
                        target, with C returning 0 if the WAITPEQ condition was met, or 1 if the timeout
                        occurred first.

WAITPNE D,S/#n  WC      Like WAITPNE without WC, except the last-written D value becomes a CNTL timeout
                        target, with C returning 0 if the WAITPNE condition was met, or 1 if the timeout
                        occurred first.


Examples:

        'Measure time using lower 32 bits of CNT

        GETCNT  ticks           'get CNTL into ticks
        <somecode>              'execute some code
        SUBCNT  ticks           'get CNTL minus ticks into ticks, <somecode> took ticks-1 to execute


        'Measure time using full 64 bits of CNT (single task)

        GETCNT  ticks_low       'get CNT into {ticks_high, ticks_low}
        GETCNT  ticks_high
        <somecode>              'execute some code
        SUBCNT  ticks_low       'get CNT minus {ticks_high, ticks_low} into {ticks_high, ticks_low}
        SUBCNT  ticks_high      '<somecode> took {ticks_high, ticks_low}-1 clocks to execute


        'Do something for some time

        GETCNT  ticks           'get CNTL
        ADD     ticks,#500      'add 500

loop    <somecode>              'execute some code
        CMPCNT  ticks       WC  'check if 500 clocks have elapsed yet
 if_nc  JMP     #loop           'if not, loop


        'Do something every Nth clock (multi-task)

        GETCNT  ticks           'get CNTL

loop    ADD     ticks,#500      'add 500
        PASSCNT ticks           'wait for next 500th clock
        <somecode>              'execute some code
        jmp     #loop           'loop


        'Do something every Nth clock (single-task)

        GETCNT  ticks           'get CNTL
        ADD     ticks,#500      'add initial 500

loop    WAITCNT ticks,#500      'wait for next 500th clock, add next 500
        <somecode>              'execute some code
        jmp     #loop           'loop


        'Wait for pins to equal a value, with time-out

        GETCNT  ticks           'get CNTL
        ADD     ticks,#200      'allow 200 clock cycles for WAITPEQ (CNTL target is last-stored value)
        WAITPEQ value,mask  WC  'wait for (pins & mask) = value
 if_c   JMP     #timeout        'if C=1 then timeout occurred, else pin condition was met


instructions                                                                                  clocks
----------------------------------------------------------------------------------------------------
000011 ZC0 1 CCCC DDDDDDDDD 000001100        CMPCNT  D       'compares D to CNTL, C = D > CNTL     1
000011 ZC1 1 CCCC DDDDDDDDD 000001100        SUBCNT  D       'subtracts D from CNTL, then CNTH     1
000011 000 1 CCCC DDDDDDDDD 000001101        PASSCNT D       'loops until CNTL passes D            1*
000011 001 1 CCCC DDDDDDDDD 000001101        GETCNT  D       'gets CNTL, then CNTH                 1
111111 0CR I CCCC DDDDDDDDD SSSSSSSSS        WAITCNT D,S     'wait for CNTL or CNT (WC), D += S    ?
111111 110 I CCCC DDDDDDDDD SSSSSSSSS        WAITPEQ D,S  WC 'wait for (pins & S) = D, do timeout  ?
111111 111 I CCCC DDDDDDDDD SSSSSSSSS        WAITPNE D,S  WC 'wait for (pins & S) <> D, do timeout ?
----------------------------------------------------------------------------------------------------
* 1 + number of other instructions in the pipeline (0..3) which belong to the executing task



BRANCHES
--------

As elaborated on in the pipeline section, there are both normal and delayed branching instructions.
The normal branching instructions cancel any same-task instructions which are in the pipeline, causing
the next instruction that executes in that task to be from the address that was branched to. The delayed
branching instructions, intended only for single-task programs, do not cancel any pipelined instructions,
allowing the three trailing instructions in the pipeline to execute before the branch appears to take
effect. The advantage in using delayed branches is that they only take one clock, but careful programming
is required to accommodate the three trailing instructions:


loop            MOV     X,#100          'toggle P0/P1/P2 100 times, then toggle P3

loop2           DJNZD   X,#loop2        'loop, delayed branch executes 3 trailing instructions
                NOTP    #0              'toggle P0
                NOTP    #1              'toggle P1
                NOTP    #2              'toggle P2

                NOTP    #3              'now toggle P3
                JMP     #loop           'do it again


In the branch instruction definitions below, only normal branches are shown, though any of them can be
made into delayed branches by adding a 'D' to their mnemonic (i.e. JMP becomes JMPD).

The JMP (jump), CALL, and RET (return) instructions are specific cases of the JMPRET instruction. CALL
works by simultaneously jumping to a labeled subroutine and storing the return address (the address after
the CALL) into a RET instruction that has the same label as the subroutine, but with '_RET' at the end:


loop            CALL    #sub1           'call to sub1, store next address into bits 8..0 of sub1_ret
                CALL    #sub2           'call to sub2, store next address into bits 8..0 of sub2_ret
                JMP     #loop           'loop back to first call

sub1            NOTP    #0              'start of sub1 routine
sub1_ret        RET                     'return to caller (actually JMP #returnaddress)

sub2            NOTP    #1              'start of sub2 routine
sub2_ret        RET                     'return to caller (actually JMP #returnaddress)


Because the return address is stored in an actual instruction at the end of the subroutine, these kinds
of calls cannot be recursive, unlike the stack RAM-based calls and returns which are elaborated on in the
STACK RAM section.

The WZ and WC suffixes can be used with CALL/RET instructions to control flag updating. For example,
if you wish to call a subroutine and preserve the Z and/or C flags, you can add the WZ and/or WC suffixes
to both the CALL and RET instructions to cause the flags to be initially saved on CALL and subsequently
restored on RET:


loop            CMP     a,b      WZ,WC  'compare a to b, affect Z and C
                CALL    #sub     WZ,WC  'call to sub and save Z/C/PC into bits 10..0 of the RET
    IF_C_OR_Z   JMP     #loop           'loop if a =< b
                JMP     #else           'else, branch

sub             GETP    #0       WC     'get pin 0 into C (mess up C and Z)
                GETNP   #1       WZ     'get pin 1 into Z
                SETPC   #6              'set pin 6 to C
                SETPZ   #7              'set pin 7 to Z
sub_ret         RET              WZ,WC  'return to caller, restore Z/C/PC from bits 10..0 in RET


Here are the discrete JMP/CALL/RET instructions and the general-case JMPRET instruction:


        JMP     S               - Jump to address in S[8..0]
                                  If WC then C = S[9]
                                  If WZ then Z = S[10]

        JMP     #n              - Jump to immediate 0..511
                                  If WC then C = bit 9 of JMP instruction (in unused D field)
                                  If WZ then Z = bit 10 of JMP instruction (in unused D field)

        CALL    #label          - Jump to label which begins subroutine
                                  The assembler points the D field to the RET at label_RET
                                  PC+1 is written to D[8..0] (PC+4 for CALLD)
                                  If WC then C is written to D[9]
                                  If WZ then Z is written to D[10]
                                  D[31..11], plus D[10]/D[9] per WZ/WC, are preserved

        RET                     - Jump to bits 8..0 of RET instruction (assembled as JMP #0)
                                  If WC then C = bit 9 of RET instruction (in unused D field)
                                  If WZ then Z = bit 10 of RET instruction (in unused D field)


        JMPRET  D,#n    NR      - Jump to immediate 0..511 (same as 'JMP #n' and 'RET')
                                  If WC then C = bit 9 of JMPRET instruction (in D field)
                                  If WZ then Z = bit 10 of JMPRET instruction (in D field)

        JMPRET  D,S     NR      - Jump to address in S[8..0] (same as 'JMP S')
                                  If WC then C = S[9]
                                  If WZ then Z = S[10]

        JMPRET  D,#n            - Jump to immediate 0..511 (same as 'CALL #label')
                                  PC+1 is written to D[8..0] (PC+4 for JMPRETD)
                                  If WC then C is written to D[9], else D[9] same
                                  If WZ then Z is written to D[10], else D[10] same
                                  D[31..11] are preserved

        JMPRET  D,S             - Jump to address in S[8..0]
                                  PC+1 is written to D[8..0] (PC+4 for JMPRETD)
                                  If WC then C is written to D[9] and reloaded from S[9]
                                  If WZ then Z is written to D[10] and reloaded from S[10]
                                  D[31..11], and D[10]/D[9] per WZ/WC, are preserved


        TASKSW                  - Short for 'JMPRET INDA,++INDA WZ,WC'
                                  For round-robin switching among threaded tasks
                                  Use FIXINDA to set up a ring of Z/C/PC registers
                                  Use with register remapping for multiple program instances
                                  Instructions trailing TASKSWD are in the next thread


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000111 ZC0 0 CCCC 000000000 SSSSSSSSS        JMP     S       'jump to S                         4 *
000111 ZC0 1 CCCC 000000000 nnnnnnnnn        JMP     #n      'jump to 0..511                    4 *
000111 ZC0 1 CCCC 000000000 000000000        RET             'return from subroutine            4 *
000111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL        CALL    #label  'call subroutine                   4 *
000111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS        JMPRET  D,S     'jump to S, store return in D      4 *
000111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn        JMPRET  D,#n    'jump to 0..511, store return in D 4 *
000111 111 0 0011 111110110 111110110        TASKSW          'JMPRET INDA,++INDA WZ,WC          4 *

010111 ZC0 0 CCCC 000000000 SSSSSSSSS        JMPD    S       'jump to S                         1
010111 ZC0 1 CCCC 000000000 nnnnnnnnn        JMPD    #n      'jump to 0..511                    1
010111 ZC0 1 CCCC 000000000 000000000        RETD            'return from subroutine            1
010111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL        CALLD   #label  'call subroutine                   1
010111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS        JMPRETD D,S     'jump to S, store return in D      1
010111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn        JMPRETD D,#n    'jump to 0..511, store return in D 1
010111 111 0 0011 111110110 111110110        TASKSWD         'JMPRETD INDA,++INDA WZ,WC         1
-------------------------------------------------------------------------------------------------
* 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline


Here are the conditional branches:


        IJZ     D,S/#n          - Increment D and Jump to S/#n if result is zero
        IJNZ    D,S/#n          - Increment D and Jump to S/#n if result is not zero
        DJZ     D,S/#n          - Decrement D and Jump to S/#n if result is zero
        DJNZ    D,S/#n          - Decrement D and Jump to S/#n if result is not zero
        TJZ     D,S/#n          - Jump to S/#n if D is zero
        TJNZ    D,S/#n          - Jump to S/#n if D is not zero
        JP      D,S/#n          - Jump to S/#n if pin D reads high
        JNP     D,S/#n          - Jump to S/#n if pin D reads low


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
111100 00R I CCCC DDDDDDDDD SSSSSSSSS        IJZ     D,S     'increment D and jump if zero      4 *
111100 10R I CCCC DDDDDDDDD SSSSSSSSS        IJNZ    D,S     'increment D and jump if not zero  4 *
111101 00R I CCCC DDDDDDDDD SSSSSSSSS        DJZ     D,S     'decrement D and jump if zero      4 *
111101 10R I CCCC DDDDDDDDD SSSSSSSSS        DJNZ    D,S     'decrement D and jump if not zero  4 *
111110 000 I CCCC DDDDDDDDD SSSSSSSSS        TJZ     D,S     'test D and jump if zero           4 *
111110 100 I CCCC DDDDDDDDD SSSSSSSSS        TJNZ    D,S     'test D and jump if not zero       4 *
111110 001 I CCCC DDDDDDDDD SSSSSSSSS        JP      D,S     'jump if pin D high                4 *
111110 101 I CCCC DDDDDDDDD SSSSSSSSS        JNP     D,S     'jump if pin D low                 4 *

111100 01R I CCCC DDDDDDDDD SSSSSSSSS        IJZD    D,S     'increment D and jump if zero      1
111100 11R I CCCC DDDDDDDDD SSSSSSSSS        IJNZD   D,S     'increment D and jump if not zero  1
111101 01R I CCCC DDDDDDDDD SSSSSSSSS        DJZD    D,S     'decrement D and jump if zero      1
111101 11R I CCCC DDDDDDDDD SSSSSSSSS        DJNZD   D,S     'decrement D and jump if not zero  1
111110 010 I CCCC DDDDDDDDD SSSSSSSSS        TJZD    D,S     'test D and jump if zero           1
111110 110 I CCCC DDDDDDDDD SSSSSSSSS        TJNZD   D,S     'test D and jump if not zero       1
111110 011 I CCCC DDDDDDDDD SSSSSSSSS        JPD     D,S     'jump if pin D high                1
111110 111 I CCCC DDDDDDDDD SSSSSSSSS        JNPD    D,S     'jump if pin D low                 1
-------------------------------------------------------------------------------------------------
* 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline



COUNTERS - this section is not done yet!!!
--------

Each cog has two configurable counters. They are named CTRA and CTRB and are accessed by
thirteen instructions each. The instructions which end in "A" are for CTRA and those that
end in "B" are for CTRB. For brevity, only CTRA instructions are used in the definitions and
examples that follow.

        GETPHSA D               - Get PHSA into D
        GETPHZA D               - Get PHSA into D, simultaneously clear PHSA to 0
        GETCOSA D               - Get COSA into D
        GETSINA D               - Get SINA into D

        SETCTRA D/#n            - Set CTRA configuration
        SETWAVA D/#n            - Set WAVA
        SETFRQA D/#n            - Set FRQA
        SETPHSA D/#n            - Set PHSA
        ADDPHSA D/#n            - Add to PHSA
        SUBPHSA D/#n            - Subtract from PHSA

        SYNCTRA                 - Wait for PHSA to roll over
        POLCTRA WC              - Check if PHSA has rolled over (C=1 if rolled over)
        CAPCTRA                 - Capture CTRA accumulators into COSA and SINA

Modes:

  (QDR = PHS[31] XNOR PHS[30], or PHS[31] delayed by 90 degrees)


  Off Mode
  -------------------------------------------------------------------------------
  %00000 = Counter off (initial state after cog start)


  NCO Modes
  -------------------------------------------------------------------------------
  %00001 = NCO output + video PLL mode, PLL output = PHS[31] (reference signal)
  %00010 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 32
  %00011 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 16
  %00100 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 8
  %00101 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 4
  %00110 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 2
  %00111 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 1
  %01000 = NCO output

  DUAL Modes
  -------------------------------------------------------------------------------
  %000_01001 = dual NCO outputs + dual COUNT_LOWS inputs
  %001_01001 = dual NCO outputs + dual COUNT_HIGHS inputs
  %010_01001 = dual NCO outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01001 = dual NCO outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01001 = dual NCO outputs + dual TIME_LOWS inputs
  %101_01001 = dual NCO outputs + dual TIME_HIGHS inputs
  %110_01001 = dual NCO outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01001 = dual NCO outputs + dual TIME_POSITIVE_EDGES inputs

  %000_01010 = dual DUTY outputs + dual COUNT_LOWS inputs
  %001_01010 = dual DUTY outputs + dual COUNT_HIGHS inputs
  %010_01010 = dual DUTY outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01010 = dual DUTY outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01010 = dual DUTY outputs + dual TIME_LOWS inputs
  %101_01010 = dual DUTY outputs + dual TIME_HIGHS inputs
  %110_01010 = dual DUTY outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01010 = dual DUTY outputs + dual TIME_POSITIVE_EDGES inputs

  %000_01011 = dual PWM outputs + dual COUNT_LOWS inputs
  %001_01011 = dual PWM outputs + dual COUNT_HIGHS inputs
  %010_01011 = dual PWM outputs + dual COUNT_NEGATIVE_EDGES inputs
  %011_01011 = dual PWM outputs + dual COUNT_POSITIVE_EDGES inputs
  %100_01011 = dual PWM outputs + dual TIME_LOWS inputs
  %101_01011 = dual PWM outputs + dual TIME_HIGHS inputs
  %110_01011 = dual PWM outputs + dual TIME_NEGATIVE_EDGES inputs
  %111_01011 = dual PWM outputs + dual TIME_POSITIVE_EDGES inputs

  WAVE modes
  -------------------------------------------------------------------------------
  %01100 = dual SQR_WAVE output + GOERTZEL input
  %01101 = dual SAW_WAVE output + GOERTZEL input
  %01110 = dual TRI_WAVE output + GOERTZEL input
  %01111 = dual SIN_WAVE output + GOERTZEL input

In the WAVE modes, FRQ is added into PHS on every clock cycle. The top nine bits of PHS
are used to drive sine and cosine lookup tables which are used for sine output functions
and GOERTZEL computations. While the sine/cosine output functions are the most useful for
signal processing, triangle-, sawtooth-, and square-wave output functions are also selectable,
being derived from the top nine bits of PHS, as well.

The WAVE modes output both parallel DAC signals and duty-modulated pin signals. All
output signals are nine bits in base quality with an additional nine sub-bits of dithering
to maintain base quality after attenuative scaling. The dual outputs differ only in phase
and are set up by the WAV register:


  WAV register in WAVE modes (can be changed by SETWAVA/SETWAVB instruction)
  -------------------------------------------------------------------------------
  %PPPPPPPPP_xxxxx_TTTTTTTTT_AAAAAAAAA

      PPPPPPPPP = phase advance for OUTA (0 to 511/512 revolutions)
          xxxxx = unused for WAVE modes
      TTTTTTTTT = offset for OUTA and OUTB
      AAAAAAAAA = amplitude for OUTA and OUTB


  Initial value after cog start:

  %010000000_00000_100000000_111111111

      010000000 = 90-degree phase advance for GOERTZEL use (OUTA=cosine, OUTB=sine)
          00000 = unused
      100000000 = mid-point offset (allows maximum amplitude)
      111111111 = maximum amplitude


The GOERTZEL computation works as follows, on every clock:

    Nine-bit sine and cosine values are looked up using the top nine bits of PHS.
    The sine and cosine values are negated if INA is 0, else they remain the same.
    The sine and cosine values are added into separate sine and cosine accumulators.

This process measures the energy content of INA at the frequency of PHS rollover.
To make this work, the INA pin should be configured for delta-sigma ADC mode, so
that it streams back 1's and 0's that ratiometrically represent the voltage of the
I/O pin.

To make a GOERTZEL measurement:

    - The top nine bits of WAV should be set to %010000000 for proper cosine lookup.
    - FRQ must be set to generate the frequency of interest in PHS rollovers (SETFRQA).
    - PHS and the accumulators should be cleared to 0 (SETPHSA #0, then CAPCTRA).
    - Some number of complete PHS rollovers must be waited for (SYNCTRA/POLLCTRA).
    - The accumulators must be captured and read (CAPCTRA + GETCOSA + GETSINA).
    - The hypotenuse of the accumulators will indicate signal strength and phase.

By making swept FRQ measurements in a closed loop, where OUTA is used to output a reference
frequency of known phase to stimulate a system, and INA receives a signal back that
is somehow coupled to OUTA, you can determine things such as spectral response, resonant
frequency, and frequency vs. phase of a system.

The more PHS rollovers in a measurement, the more selective the result will be. For open-
loop measurements, this means tighter bandwidth. For closed-loop measurements, the angle
of the hypotenuse becomes meaningful. The QARCTAN instruction can translate the sine and
cosine accumulations into power and phase values.


  LOGIC Modes
  -------------------------------------------------------------------------------
  %10000 = LOGIC_A_POSEDGE input    INA & !INA previous
  %10001 = LOGIC_NA_AND_NB input   !INA & !INB
  %10010 = LOGIC_A_AND_NB input     INA & !INB
  %10011 = LOGIC_NB input                 !INB
  %10100 = LOGIC_NA_AND_B input    !INA &  INB
  %10101 = LOGIC_NA input          !INA
  %10110 = LOGIC_A_NE_B input       INA <> INB
  %10111 = LOGIC_NA_OR_NB input    !INA | !INB
  %11000 = LOGIC_A_AND_B input      INA &  INB
  %11001 = LOGIC_A_EQ_B input       INA == INB
  %11010 = LOGIC_A input            INA
  %11011 = LOGIC_A_OR_NB input      INA | !INB
  %11100 = LOGIC_B input                   INB
  %11101 = LOGIC_NA_OR_B input     !INA |  INB
  %11110 = LOGIC_A_OR_B input       INA |  INB
  %11111 = LOGIC_ENCODER input      INA,   INB encoder

    OUTA = ADD signal (condition met or LOGIC_ENCODER forward step)
    OUTB = SUB signal (LOGIC_ENCODER reverse step)

In the LOGIC modes, FRQ is conditionally added to PHS on each clock cycle that meets that
mode's requirement. In the case of the LOGIC_ENCODER mode, FRQ may be added or subtracted
to/from PHS when a half-step is registered. OUTA and OUTB reflect the ADD and SUB states
for each cycle, and are more likely to be useful by other CTR's, rather than being sent to
output pins.


DACS
----

Each cog outputs 4 channels of DAC data, named DAC0..DAC3. These DAC data channels can be
set to values in software or actively driven from CTRA/CTRB or VID. In all cases but VID,
the source data is 18 bits and is dithered on every clock cycle for 9-bit DAC output. In
the case of VID, the source data is just 9 bits, so no dithering is performed.

Each I/O pin has a 75-ohm 9-bit DAC which can be configured using CFGPINS to output a
fixed DACx channel from any cog. Every cog's DAC0..DAC3 are available, in that sequence,
to P0..P3, then to the next four pins, and so on, as shown below:


PortA   PortB   PortC       DACx
--------------------------------
P0      P32     P64         DAC0
P1      P33     P65         DAC1
P2      P34     P66         DAC2
P3      P35     P67         DAC3
P4      P36     P68         DAC0
P5      P37     P69         DAC1
P6      P38     P70         DAC2
P7      P39     P71         DAC3
P8      P40     P72         DAC0
P9      P41     P73         DAC1
P10     P42     P74         DAC2
P11     P43     P75         DAC3
P12     P44     P76         DAC0
P13     P45     P77         DAC1
P14     P46     P78         DAC2
P15     P47     P79         DAC3
P16     P48     P80         DAC0
P17     P49     P81         DAC1
P18     P50     P82         DAC2
P19     P51     P83         DAC3
P20     P52     P84         DAC0
P21     P53     P85         DAC1
P22     P54     P86         DAC2
P23     P55     P87         DAC3
P24     P56     P88         DAC0
P25     P57     P89         DAC1
P26     P58     P90         DAC2
P27     P59     P91         DAC3
P28     P60     P92         DAC0
P29     P61     P93         DAC1
P30     P62     P94         DAC2
P31     P63     P95         DAC3


Here are the instructions which configure DAC0..DAC3:

    CFGDAC0 D/#n    - Configure DAC0

        %00 = Software controlled (default)
        %01 = CTRA SIGA
        %10 = CTRA SIGA + CTRB SIGA
        %11 = VID SIG0

    CFGDAC1 D/#n    - Configure DAC1

        %00 = Software controlled (default)
        %01 = CTRA SIGB
        %10 = CTRA SIGB + CTRB SIGB
        %11 = VID SIG1

    CFGDAC2 D/#n    - Configure DAC2

        %00 = Software controlled (default)
        %01 = CTRB SIGA
        %10 = CTRA SIGA + CTRB SIGA
        %11 = VID SIG2

    CFGDAC3 D/#n    - Configure DAC3

        %00 = Software controlled (default)
        %01 = CTRB SIGB
        %10 = CTRA SIGB + CTRB SIGB
        %11 = VID SIG3

    CFGDACS D/#n    - Configure DAC3..DAC0 from four 2-bit fields: %33_22_11_00


For configurations %00..%10, the data sources are 18 bits wide, with the 9 lower bits
being dithered by a 32-bit LFSR to realize more DAC resolution. This improves dynamic
range, but introduces a white noise of one step in amplitude in the 9-bit DAC output.
As dynamic signals get smaller in amplitude, they appear to sink into the dither noise,
but actually remain very high-Q, as the dither noise is very low-Q. For configuration
%11 (VID), the data is a straight 9 bits with no dithering, as pixels could only be
dithered once per frame, resulting only in visible luminance noise, which is not
desirable.

The dithering works by taking nine fixed bits from a 32-bit LFSR and sign-extending
them to 18 bits. This yields a pseudo-random value ranging from %111111111_100000000
(negative) to %000000000_011111111 (positive) on every clock cycle. When added to the
18-bit source data, the lower 9 bits of source data are realized as a proportional
toggling between two adjacent values in the top 9 bits of the sum, which form the DAC
output data. It will take at least 512 (2^9) clocks for the DAC output to average to
the intended 18-bit source value, assuming source data is static.

On cog start, all configurations are cleared to %00 and the source values are set to
%000000000_100000000, which is effectively zero, since dithering will never cause an
output step toggle when the nine lower source bits are %100000000:


       source data %XXXXXXXXX_100000000
  + minimum dither %111111111_100000000
                   --------------------
                 = %XXXXXXXXX_000000000    (top 9 bits are unchanged)


       source data %XXXXXXXXX_100000000
  + maximum dither %000000000_011111111
                   --------------------
                 = %XXXXXXXXX_111111111    (top 9 bits are unchanged)


Here are the instructions which set DAC0..DAC3 source values in software:


    SETDAC0 #n      - Set DAC0 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC0 D       - Set DAC0 to D[31..14], force configuration to %00 *

    SETDAC1 #n      - Set DAC1 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC1 D       - Set DAC1 to D[31..14], force configuration to %00 *

    SETDAC2 #n      - Set DAC2 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC2 D       - Set DAC2 to D[31..14], force configuration to %00 *

    SETDAC3 #n      - Set DAC3 to %nnnnnnnnn_100000000, force configuration to %00
    SETDAC3 D       - Set DAC3 to D[31..14], force configuration to %00 *

    SETDACS #n      - Set DAC3..DAC0 to %nnnnnnnnn_100000000
                      Force DAC3..DAC0 configurations to %00

    SETDACS D       - Set DAC3 to %dddddddd0_100000000, where dddddddd is D[31..24]
                      Set DAC2 to %dddddddd0_100000000, where dddddddd is D[23..16]
                      Set DAC1 to %dddddddd0_100000000, where dddddddd is D[15..8]
                      Set DAC0 to %dddddddd0_100000000, where dddddddd is D[7..0]
                      Force DAC3..DAC0 configurations to %00

             
    * Be aware when using SETDACx D, that if D < $00400000 or D > $FFC03FFF, full-
      scale toggling will occur, as the dither addition will cause wrapping. For
      ground-based DAC output, you can add $00400000 to each output sample to
      prevent this from happening.



VIDEO
-----

Each cog has a video generator (VID) that can stream pixel data and perform colorspace
conversion and modulation, so that final video signals can be output to the 75-ohm DACs
on the I/O pins.

Pixel streaming, colorspace conversion, modulation, DAC channel driving, and DAC pin
updating are all performed in a pipelined fashion on each cycle of VID's dot clock.

VID gets it dot clock from CTRA's PLL. So, CTRA must be configured for PLL operation in
order for VID to operate.

The DACx channels must be configured for video output by using CFGDACx. To set all DACx
channels to video, do 'CFGDACS #%11_11_11_11'.

The I/O pins which will output the DACx channels must be configured to do so via CFGPINS.

To turn on VID and configure its DAC channel outputs, the SETVID instruction is used:

    SETVID  D/#n    - Set video configuration register (VCFG)

        %00xx = off (default)             SIG3    SIG2    SIG1    SIG0
                                          ----------------------------
        %01xx = SDTV/HDTV/VGA             Y_R     I_G     Q_B     SYN
        %10xx = NTSC/PAL S-VIDEO          YIQ     YIQ     _IQ     Y__
        %11xx = NTSC/PAL COMPOSITE        YIQ     YIQ     YIQ     YIQ

        %xx0x = zero-extend Y/I/Q coefficients for VGA colorspace (allows +$80, or '1.0')
        %xx1x = sign-extend Y/I/Q coefficients for NTSC/PAL/SDTV/HDTV colorspace

        %xxx0 = positive VGA sync on SYN / positive modulation phase
        %xxx1 = negative VGA sync on SYN / negative modulation phase (used in PAL video)


Before any meaningful video signals can be output, you must set the colorspace coefficients
and offset levels, which are each 8 bits:

    SETVIDY D/#n    - Set Y_R offset level and RGB colorspace coefficients: $YO_YR_YG_YB

    SETVIDI D/#n    - Set I_G offset level and RGB colorspace coefficients: $IO_IR_IG_IB

    SETVIDQ D/#n    - Set Q_B offset level and RGB colorspace coefficients: $QO_QR_QG_QB


All pixels are internally handled by VID as 2:8:8:8 bit SYNC:R:G:B data.

Colorspace conversion is performed as sum-of-products calculations on the R:G:B pixel data
and the colorspace coefficients, yielding Y, I, and Q components:

    Where R, G, B are 8-bit pixel color components and Y, I, Q are 9-bit sums (MOD 512):

        Y = R*YR/64 + G*YG/64 + B*YB/64        Where YR, YG, YB are 8-bit Y coefficients
        I = R*IR/64 + G*IG/64 + B*IB/64        Where IR, IG, IB are 8-bit I coefficients
        Q = R*QR/64 + G*QG/64 + B*QB/64        Where QR, QG, QB are 8-bit Q coefficients


    For outputs Y_R, I_G, and Q_B, offset levels are added to the Y, I, and Q components to
    properly position the final signals for SDTV/HDTV. In the case of VGA outputs, the
    offset levels are set to 0, since they are ground-based.

    For modulated outputs YIQ and _IQ, the I and Q components, treated as (I,Q), are rotated
    around (0,0) by an angle that steps 1/16th of a revolution on each dot clock, yielding
    Q'. In the case of YIQ output, the Y component (luma) and Q' (chroma) are added to form
    a composite video signal. In the case of _IQ output, an offset level is added to Q' to
    form an s-video chroma signal. For Y__ output, the Y component (luma) is output alone to
    form an s-video luma signal.


For sync 'pixels', bit 24 or 25 is set in the pixel word and various formulas are used for
generating the different output signals. When less than 32 bits are expressed per pixel, the
SYNC bits will be %00.


    DAC channel outputs per pixel data input (outputs are 9 bits each, MOD 512)
    ------------------------------------------------------------------------------------
    Y_R     %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y             component/vga pixel
            %x1_0xxxxxxx_xxxxxxxx_xxxxxxxx = YO*2                 component/vga black
            %x1_1xxxxxxx_xxxxxxxx_SSSSSSSS = YO*2 + SSSSSSSS*2    component sync

    I_G     %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = IO*2 + I             component/vga pixel
            %x1_x0xxxxxx_xxxxxxxx_xxxxxxxx = IO*2                 component/vga black
            %x1_x1xxxxxx_xxxxxxxx_SSSSSSSS = IO*2 + SSSSSSSS*2    component sync

    Q_B     %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = QO*2 + Q             component/vga pixel
            %x1_xx0xxxxx_xxxxxxxx_xxxxxxxx = QO*2                 component/vga black
            %x1_xx1xxxxx_xxxxxxxx_SSSSSSSS = QO*2 + SSSSSSSS*2    component sync

    SYN     %x0_xxxxxxxx_xxxxxxxx_xxxxxxxx = VCFG[0]*511          vga sync unasserted
            %x1_xxxxxxxx_xxxxxxxx_xxxxxxxx = !VCFG[0]*511         vga sync asserted

    Y__     %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y             s-video luma pixel
            %01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2                 s-video luma sync high
            %1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = 0                    s-video luma sync low

    _IQ     %xx_xxxxxxxx_xxxxxxxx_xxxxxxxx = QO*2 + Q'            s-video chroma

    YIQ     %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y + Q'        composite pixel
            %01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2 + Q'            composite sync high
            %1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = Q'                   composite sync low


Below are some common colorspace coefficient sets. Note that these values are normalized
to 1. In the sum-of-products calculations, 128 is equal to 1, so the values below should all
be multiplied by 128 to get the proper 8-bit values for usage as coefficients. In practice,
the values will need to be scaled down so that under 75-ohm load, they will peak at 1.0V or
0.7V (not 1.65V, which is 3.3V/2). This scaling will compromise DAC span by ~39%..~58%,
leaving you with a still-sufficient ~8 bits of DAC resolution. However, if you'd like to
keep DAC span maximal, you may leave the coefficients as originally computed and achieve
the proper voltage under load by using external resistors, being sure to maintain 75 ohms
source impedance.


coefficient positions
-----------------------
YR       YG       YB
IR       IG       IB
QR       QG       QB
-----------------------

RGB (VGA)     VCFG[1]=0
-----------------------
1        0        0           R sums to 1
0        1        0           G sums to 1
0        0        1           B sums to 1
-----------------------

YPbPr (HDTV)  VCFG[1]=1                             x128
-----------------------                             -------------
+.213    +.715    +.072       Y  sums to 1          +27  +92  +9
-.115    -.385    +.500       Pb sums to 0          -15  -49  +64
+.500    -.454    -.046       Pr sums to 0          +64  -58  -6
-----------------------

YPbPr (SDTV)  VCFG[1]=1
-----------------------
+.299    +.587    +.114       Y  sums to 1
-.169    -.331    +.500       Pb sums to 0
+.500    -.419    -.081       Pr sums to 0
-----------------------

YIQ (NTSC)    VCFG[1]=1
-----------------------
+.299    +.587    +.114       Y sums to 1
+.596    -.274    -.322       I sums to 0 *
+.212    -.523    +.311       Q sums to 0 *
-----------------------

YUV (PAL)     VCFG[1]=1
-----------------------
+.299    +.587    +.114       Y sums to 1
-.147    -.289    +.436       U sums to 0 *
+.615    -.515    -.100       V sums to 0 *
-----------------------

* These three coefficients must be scaled by 0.608 to pre-compensate for CORDIC
  rotator expansion which will occur in the video modulator.


Once VID is configured, WAITVID instructions are used to issue contiguous commands
which keep the pixel streamer busy:

    WAITVID --> pixel streamer --> colorspace/modulator --> DACx signals --> I/O pins


VID double-buffers WAITVID commands to relax WAITVID timing requirements.

In case you don't want to commit to a WAITVID, which will stall the instruction pipeline
until VID is ready for another command, you can use the POLVID instruction to test
whether or not VID is ready for another WAITVID, in which case a subsequent WAITVID will
take only one clock:

    POLVID  WC      - Check if VID ready for another WAITVID, C=1 if ready


Here is the WAITVID instruction:

    WAITVID D,S/#n  - Wait for VID ready, then give next command via D and S


When WAITVID executes, the D and S values are captured by VID and used for the duration
of the command.

The D operand in WAITVID has four fields:

    %AAAAAAAA_MMMM_PPPPPPP_CCCCCCCCCCCCC

             %AAAAAAAA = stack RAM base address for pixel lookup (0..255)
                 %MMMM = pixel mode (0..15), elaborated below
              %PPPPPPP = number minus 1 of dot clocks per pixel (0..127 --> 1..128)
        %CCCCCCCCCCCCC = number minus 1 of dot clocks in WAITVID (0..8191 --> 1..8192)


The D operand's %MMMM field determines which pixel mode will be used for the WAITVID and
what the S operand will be used for:

    %0000 = LIT_SRGB26    - S is used as a literal 2:8:8:8 pixel. Only the %CCCCCCCCCCCCC
                            bits of D are used (all other bits can be 0).

    %0001 = CLU1_SRGB26   - 32 1-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
    %0010 = CLU2_SRGB26   - 16 2-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
    %0011 = CLU4_SRGB26   - 8 4-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
    %0100 = CLU8_SRGB26   - 4 8-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
    %0101 = CLU8_RGB15 *  - 4 8-bit offsets in S lookup 0:5:5:5 pixel words in stack RAM
    %0110 = CLU8_RGB16 *  - 4 8-bit offsets in S lookup 0:5:6:5 pixel words in stack RAM

                            The CLUx modes capture S, using its 1/2/4/8-bit fields, lowest
                            field first, as offsets for looking up pixels in stack RAM,
                            starting at %AAAAAAAA. Upon completion of each pixel, the next
                            higher bit field is used, with the highest field repeating.

                            For CLU1_SRGB26..CLU8_SRGB26, the 1/2/4/8-bit fields are used
                            as long offsets into stack RAM, yielding 2:8:8:8 pixel data.

                            For CLU8_RGB15 and CLU8_RGB16, bits 7..1 of each 8-bit field
                            is used as the long offset, while bit 0 selects the low/high
                            word containing the 0:5:5:5 or 0:5:6:5 pixel data.

    %0111 = STR1_RGB9 *   - 1-bit pixels streamed from stack RAM select between 0:3:3:3
                            colors in S[17..9] and S[26..18]. The stream start address in
                            stack RAM is %AAAAAAAA plus S[7..0], with S[31..27] selecting
                            the starting bit.

    %1000 = STR4_RGBI4 *  - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA
                            plus S[7:0], with S[31..29] selecting the starting nibble. The
                            pixels are colored as:

                            %0000 = black
                            %0001 = dark grey
                            %0010 = dark blue
                            %0011 = bright blue
                            %0100 = dark green
                            %0101 = bright green
                            %0110 = dark cyan
                            %0111 = bright cyan
                            %1000 = dark red
                            %1001 = bright red
                            %1010 = dark magenta
                            %1011 = bright magenta
                            %1100 = olive
                            %1101 = yellow
                            %1110 = light grey
                            %1111 = white

    %1001 = STR4_LUMA4 *  - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA
                            plus S[7:0], with S[31..29] selecting the starting nibble. The
                            pixels are used as brightness values for colors determined by
                            S[11..9]:

                            %000 = black..orange
                            %001 = black..blue
                            %010 = black..green
                            %011 = black..cyan
                            %100 = black..red
                            %101 = black..magenta
                            %110 = black..yellow
                            %111 = black..white

    %1010 = STR8_RGBI8 *  - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA
                            plus S[7:0], with S[31..30] selecting the starting byte. The
                            pixels are colored as:

                            $00..$1F = black..orange
                            $20..$3F = black..blue
                            $40..$5F = black..green
                            $60..$7F = black..cyan
                            $80..$9F = black..red
                            $A0..$BF = black..magenta
                            $C0..$DF = black..yellow
                            $E0..$FF = black..white

    %1011 = STR8_LUMA8 *  - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA
                            plus S[7:0], with S[31..30] selecting the starting byte. The
                            pixels are used as brightness values for colors determined by
                            S[11..9]:

                            %000 = black..orange
                            %001 = black..blue
                            %010 = black..green
                            %011 = black..cyan
                            %100 = black..red
                            %101 = black..magenta
                            %110 = black..yellow
                            %111 = black..white

    %1100 = STR8_RGB8 *   - 8-bit 0:3:3:2 pixels are streamed from stack RAM starting at
                            %AAAAAAAA plus S[7:0], with S[31..30] selecting the starting byte.

    %1101 = STR16_RGB15 * - 15-bit 0:5:5:5 pixels are streamed from stack RAM starting at
                            %AAAAAAAA plus S[7:0], with S[31] selecting the starting word.

    %1110 = STR16_RGB16 * - 16-bit 0:5:6:5 pixels are streamed from stack RAM starting at
                            %AAAAAAAA plus S[7:0], with S[31] selecting the starting word.

    %1111 = STR32_SRGB26  - 26-bit 2:8:8:8 pixels are streamed from stack RAM starting at
                            %AAAAAAAA plus S[7:0].


    * SYNC bits are set to %00 for these modes, since they specify color data, only.


The following example programs display luma-graduated color bars in various output modes:

    simple_VGA_1280x1024.spin
    simple_VGA_800x600.spin
    simple_VGA_640x480.spin
    simple_HDTV_1920x1080p.spin
    simple_HDTV_1280x720p.spin
    simple_NTSC_256x192.spin



TEXTURE MAPPER
--------------

Each cog has a texture mapper (PIX) which can sequentially navigate a rectangular 2D texture
map with Z-perspective correction to locate a texture pixel, translate that texture pixel into
A:R:G:B (Alpha:Red:Green:Blue) pixel data, perform discrete scaling on those A:R:G:B components,
and then alpha-blend the resulting pixel with another pixel for multi-layered 3D effects.

A texture map is stored in register RAM as a sequence of 1/2/4/8-bit texture pixels which build
from the bottom bits of an initial register, upward, then into subsequent registers. They are
ordered, in contiguous sequence, from top-left to top-right down to bottom-left to bottom-right.
These texture pixels get used as offsets into stack RAM to look up A:R:G:B pixel data. Texture
map width and height are individually settable to 1/2/4/8/16/32/64/128 pixel(s).

The SETPIX instruction is used to configure PIX:

    SETPIX  D/#n    - Set PIX configuration to %UUU_VVV_PP_W_H_V_xxxx_AAAAAAAA_RRRRRRRRR

          %UUU = texture map width, %VVV = texture map height

                 %000 =   1 pixel
                 %001 =   2 pixels
                 %010 =   4 pixels
                 %011 =   8 pixels
                 %100 =  16 pixels
                 %101 =  32 pixels
                 %110 =  64 pixels
                 %111 = 128 pixels

           %PP = texture pixel size

                 %00 = 1 bit
                 %01 = 2 bits
                 %10 = 4 bits
                 %11 = 8 bits

            %W = stack RAM pixel data offset/size

                 %0 = long offset, 8:8:8:8 bit A:R:G:B data
                 %1 = word offset, 1:5:5:5 bit A:R:G:B data (gets expanded to 8:8:8:8)

            %H = horizontal mirroring

                 %0 = OFF, image repeats when U'[15] set
                 %1 = ON,  image mirrors when U'[15] set

            %V = vertical mirroring

                 %0 = OFF, image repeats when V'[15] set
                 %1 = ON,  image mirrors when V'[15] set

     %AAAAAAAA = base address in stack RAM of A:R:G:B pixel data

    %RRRRRRRRR = base address in register RAM of texture pixels


Aside from SETPIX, which configures PIX's base metrics, there are seven other instructions
which establish initial values and deltas for the (U,V) texture coordinates, Z perspective,
and A/R/G/B scalers. These instructions are likely to be used before every sequence of GETPIX
instructions. They each set the value of their respective 16-bit parameter to the low word of
their operand, while the high word sets the 16-bit delta which gets added to the parameter
upon every GETPIX instruction:

    SETPIXU D/#n    - Set U to low word and DU to high word
    SETPIXV D/#n    - Set V to low word and DV to high word
    SETPIXZ D/#n    - Set Z to low word and DZ to high word
    SETPIXA D/#n    - Set A to low word and DA to high word
    SETPIXR D/#n    - Set R to low word and DR to high word
    SETPIXG D/#n    - Set G to low word and DG to high word
    SETPIXB D/#n    - Set B to low word and DB to high word


Once PIX is configured and initial parameters are set, the GETPIX instruction may be used to
look up the current texture pixel, scale its A/R/G/B components, blend it with a pixel in D,
and update the U/V/Z/A/R/G/B parameters with their deltas. GETPIX takes 3 clocks and also
needs 3 clocks in pipeline stages 2 and 3:

        NOP     #2              'ready pipeline, GETPIX needs 3 clocks in pipeline stage 2
        NOP     #2              'ready pipeline, GETPIX needs 3 clocks in pipeline stage 3
        GETPIX  pixel           'execute GETPIX, GETPIX takes 3 clocks in pipeline stage 4


To make GETPIX more efficient, it can be repeated using REPD to perform a sequence of pixel
operations:

        REPD    #64,#1          'render 64 texture pixels and blend them with 'pixels'
        SETINDA #pixels         'point INDA to pixels
        NOP     #2              'ready pipeline, 3 clocks in initial pipeline stage 2
        NOP     #2              'ready pipeline, 3 clocks in initial pipeline stage 3
        GETPIX  INDA++          'execute GETPIX, 3 clocks per repeating GETPIX


As GETPIX executes, the following sequence occurs over three pipeline stages:


    In pipeline stage 2:

        Z-perspective correction
        ------------------------
        Z' = 256 - Z[15:8]
        U' = (U[15:0] / Z') MOD 256
        V' = (V[15:0] / Z') MOD 256

        A texture pixel is read from register RAM at texture map location (U',V'), with
        the U' and V' top-most bits being used as coordinates. For example, if the texture
        size is 32x8, then the top 5 bits of U' and the top 3 bits of V' would be used to
        locate the texture pixel.

        parameter updating
        ------------------
        Z = Z + DZ
        U = U + DU
        V = V + DV


    In pipeline stage 3:

        The texture pixel is used as an offset to look up A:R:G:B pixel data in stack RAM,
        which gets assigned to TA:TR:TG:TB.


    In pipeline stage 4:

        pixel scaling
        -------------
        A' = (TA * A[15:8]  +  255) / 256
        R' = (TR * R[15:8]  +  255) / 256
        G' = (TG * G[15:8]  +  255) / 256
        B' = (TB * B[15:8]  +  255) / 256

        pixel blending
        --------------
        D[31..24] = 0
        D[23..16] = (A' * R'  +  (255 - A') * D[23..16]  +  255) / 256
        D[15..8]  = (A' * G'  +  (255 - A') * D[15..8]   +  255) / 256
        D[7..0]   = (A' * B'  +  (255 - A') * D[7..0]    +  255) / 256

        C = A' <> 0     (for GETPIX D/#n WC, C = texture pixel opacity <> 0)

        parameter updating
        ------------------
        A = A + DA
        R = R + DR
        G = G + DG
        B = B + DB


Note that if Z[15:8] = 0, no scaling occurs, or (U',V') = (U[15:8],V[15:8]). The bigger
Z[15:8] gets, the more compressed the texture rendering becomes, until when Z[15:8] = 255,
(U',V') = (U[7:0],V[7:0]).

The following program provides a simplistic example of how PIX is used:

    texture_NTSC_256x192.spin



PIN TRANSFER
------------

Each cog has a pin transfer (XFR) which can automatically move data between pins and QUADs
or from pins to stack RAM, in the background, while instructions execute normally.

XFR is configured with the SETXFR instruction:

    SETXFR  D/#n    - Set XFR configuration to %MMM_PPP

          %MMM = mode

                 %00x = off (initial state after cog start)
                 %010 = QUADs_to_16_pins
                 %011 = QUADs_to_32_pins
                 %100 = 16_pins_to_QUADs
                 %101 = 32_pins_to_QUADs
                 %110 = 16_pins_to_stack
                 %111 = 32_pins_to_stack

          %PPP = pin group

                %000 = pins 15..0  for 16-pin modes, pins 31..0  for 32-pin modes
                %001 = pins 31..16 for 16-pin modes, pins 31..0  for 32-pin modes
                %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes
                %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes
                %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes
                %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes
                %11x = no pins (reads 0's)


For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 low word is output to pins
    2nd clock: QUAD0 high word is output to pins
    3rd clock: QUAD1 low word is output to pins
    4th clock: QUAD1 high word is output to pins
    5th clock: QUAD2 low word is output to pins
    6th clock: QUAD2 high word is output to pins
    7th clock: QUAD3 low word is output to pins
    8th clock: QUAD3 high word is output to pins


For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 is output to pins
    2nd clock: QUAD1 is output to pins
    3rd clock: QUAD2 is output to pins
    4th clock: QUAD3 is output to pins


For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to QUAD0
    3rd clock: pins are sampled into low word
    4th clock: pins are sampled into high word, long is written to QUAD1
    5th clock: pins are sampled into low word
    6th clock: pins are sampled into high word, long is written to QUAD2
    7th clock: pins are sampled into low word
    8th clock: pins are sampled into high word, long is written to QUAD3


For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to QUAD0
    2nd clock: pins are sampled and written to QUAD1
    3rd clock: pins are sampled and written to QUAD2
    4th clock: pins are sampled and written to QUAD3


For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to stack at SPA++


For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following
1-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to stack at SPA++


While a pins_to_stack mode is active, you should not read or write stack RAM or modify
SPA, as such attempts will likely interfere with XFR operation and cause unexpected
results. VID, however, has an asynchronous second port to the stack RAM, so it can
stream pixels at the same time XFR streams them in.

To stop XFR, execute 'SETXFR #0' on the last cycle of desired XFR operation.

An example of XFR usage is in the following program:

    SDRAM_Driver.spin



BIG MULTIPLIER
--------------

Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which
perform 20x20-bit signed multiplies, each cog has a separate, larger multiplier that can
do 32x32-bit signed or unsigned multiplies while other instructions execute.

To start a big multiply, do either SETMULU (unsigned) or SETMULA (signed) to set the
first term, then do SETMULB to set the second term and start the multiplier. You'll
have 17 clocks of time to execute other code, if you wish, before doing GETMULL/GETMULH
to get the low/high long(s) of the result.

Here are the big multiplier instructions:

    SETMULU D/#n    - Set 1st input term and set unsigned operation
    SETMULA D/#n    - Set 1st input term and set signed operation
    SETMULB D/#n    - Set 2nd input term and start multiplier

    GETMULL D       - Get low long of result, waits if multiplier not done
    GETMULL D  WC   - Poll low long of result, C=1 if D valid, C=0 if multiplier busy
    GETMULH D       - Get high long of result, waits if multiplier not done
    GETMULH D  WC   - Poll high long of result, C=1 if D valid, C=0 if multiplier busy



BIG DIVIDER
-----------

Each cog has a 64-over-32-bit divider which can do signed or unsigned divides while other
instructions execute. For signed divides, the remainder result will have the sign of the
numerator. Both the quotient and the remainder results are 32 bits.

To start a 64-over-32-bit divide, do SETDIVU (unsigned) or SETDIVA (signed) to set the
low long of the numerator, followed by another SETDIVU or SETDIVA to set the high long
of the numerator. Then do SETDIVB to load the denominator and start the divider. There
will be 17 clocks of time to execute other code, if you wish, before doing GETDIVQ/GETDIVR
to get the quotient/remainder long(s) of the result.

To start a 32-over-32-bit divide, just do one SETDIVU or SETDIVA before the SETDIVB.

Here are the divider instructions:

    SETDIVU D/#n    - Set low (then high) long of numerator and set unsigned operation
    SETDIVA D/#n    - Set low (then high) long of numerator and set signed operation
    SETDIVB D/#n    - Set denominator and start divider

    GETDIVQ D       - Get quotient result, waits if divider not done
    GETDIVQ D  WC   - Poll quotient result, C=1 if D valid, C=0 if divider busy
    GETDIVR D       - Get remainder result, waits if divider not done
    GETDIVR D  WC   - Poll remainder result, C=1 if D valid, C=0 if divider busy


To compute a 32-bit fractional value of A-over-B where A < B, you can do SETDIVU #0,
SETDIVU A, then SETDIVB B. GETDIVQ will return the fraction. For example: SETDIVU #0,
SETDIVU #1, SETDIVB #3 yields a quotient of $55555555, or 1/3 of $1_00000000.



SQUARE ROOTER
-------------

Each cog has a 64-bit square rooter which can compute square roots from unsigned values
while other instructions execute.

To start a 64-bit square root computation, do SETSQRH to set the high long of the input
term, then do SETSQRL to set the low long and start the square rooter. There will be 32
clocks of time to execute other code, if you wish, before doing GETSQRT to get the result.

To start a 32-bit square root computation, just do SETSQRL to set the low long and start
the square rooter. There will be 16 clocks of time to execute other code, if you wish,
before doing GETSQRT to get the result.

    SETSQRH D/#n    - Set high long of input term
    SETSQRL D/#n    - Set low long of input term and start square rooter

    GETSQRT D       - Get root result, waits if square rooter not done
    GETSQRT D  WC   - Poll root result, C=1 if D valid, C=0 if square rooter busy



CORDIC ENGINE
-------------

Each cog has a CORDIC engine which can perform logarithmic, exponential, trigonometric,
and hyperbolic functions while other instructions execute.

Here are the instructions associated with the CORDIC engine:

    QLOG    D/#n    - Compute logarithm                    (unsigned number -> log-base-2)
    QEXP    D/#n    - Compute exponential                  (log-base-2 -> unsigned number)

    QSINCOS D,S/#n  - Compute sine and cosine with amplitude          (polar -> cartesian)
    QARCTAN D,S/#n  - Compute distance and angle of (X,Y) to (0,0)    (cartesian -> polar)

    SETQZ   D/#n    - Set CORDIC Z, used to set angle before QROTATE
    QROTATE D,S/#n  - Rotate (X,Y) around (0,0) by an angle

    GETQX   D       - Get CORDIC X result, waits if CORDIC busy
    GETQX   D  WC   - Poll CORDIC X result, C=1 if D valid, C=0 if CORDIC busy
    GETQY   D       - Get CORDIC Y result, waits if CORDIC busy
    GETQY   D  WC   - Poll CORDIC Y result, C=1 if D valid, C=0 if CORDIC busy
    GETQZ   D       - Get CORDIC Z result, waits if CORDIC busy
    GETQZ   D  WC   - Poll CORDIC Z result, C=1 if D valid, C=0 if CORDIC busy

    SETQI   D/#n    - Set CORDIC trigonometric/hyperbolic and iteration modes


QLOG/QEXP usage:

To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set
the input term and begin the computation. Then do GETQZ to get the result. Log values are
encoded with the whole exponent in the top 5 bits and the fractional exponent in the
bottom 27 bits. Here are some examples of numbers converted to log values, then back to
numbers again using QLOG and QEXP:

    number ->   QLOG ->     QEXP
    ---------------------------------
    $00000000   $00000000   $00000001   (0 same as 1)
    $00000001   $00000000   $00000001
    $00000002   $08000000   $00000002
    $00000003   $0CAE00D2   $00000003
    $00000004   $10000000   $00000004
    $00000005   $12934F09   $00000005
    $07ADCBD8   $D786F595   $07ADCBD9   (first lossy bidirectional conversion, +1)
    $20000000   $E8000000   $20000000
    $40000000   $F0000000   $40000000
    $80000000   $F8000000   $80000000
    $FFFFFFFF   $FFFFFFFF   $FFFFFFE9   (last lossy bidirectional conversion, -22)


QSINCOS/QARCTAN/QROTATE usage:

For the circular functions, angles are 32-bits and roll over at 360-degrees:

    $00000000 = 0 degrees                (360 * $00000000 / $1_00000000)
    $00000001 = ~0.000000083819 degrees  (360 * $00000001 / $1_00000000)
    $00B60B61 = ~1 degree                (360 * $00B60B61 / $1_00000000)
    $20000000 = 45 degrees               (360 * $20000000 / $1_00000000)
    $40000000 = 90 degrees               (360 * $40000000 / $1_00000000)
    $80000000 = 180 degrees              (360 * $80000000 / $1_00000000)
    $C0000000 = 270 degrees              (360 * $C0000000 / $1_00000000)
    $FFFFFFFF = ~359.9999999162 degrees  (360 * $FFFFFFFF / $1_00000000)


The X and Y inputs to the circular functions are signed 30-bit values, ranging from
-$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the
sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during
the CORDIC iterations, after which they are shifted back down and rounded to form the X
and Y results.

The circular functions will return X and Y results that are scaled by constant K, which is
~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC
scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K.

To compute sine and cosine simultaneously, the 'QSINCOS D,S/#n' instruction can be used,
with the angle supplied in D and the amplitude in S. Immediate #n values are special cases
where $00..$1F produce +/- 2^(n-1) amplitudes and $20..$3F produce 7/8ths of those
amplitudes. For example, #$09 will yield results ranging from -$100..$100 and #$29 will
yield results ranging from -$E0..$E0. Use GETQX and GETQY to retrieve the cosine and sine
results.

To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do
'QARCTAN D,S/#n' with the X in D and the Y in S/#n. Use GETQX to get the distance and
GETQZ to get the angle.

To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then
do 'QROTATE D,S/#n', with the X in D and the Y in S/#n. Use GETQX and GETQY to retrieve
the rotated (X,Y) coordinate.


CORDIC modes:

The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to
select between adaptive and fixed iterations:

    SETQI   D/#n    - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start)

        %M = mode

            %0 = trigonometric (K = ~1.64676025812)
            %1 = hyperbolic    (K = ~0.82815936096)

        %IIIII = iterations

                    %00000 = adaptive iterations (adaptive resolution, variable time)
            %00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time)


Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so
that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different
internal constants to track the angle, it skips the zeroth iteration, and the fourth and
thirteenth iterations are repeated to ensure convergence. Hence, K differs between
trigonometric and hyperbolic modes, as well as clock cycles.

When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude
of the X and Y inputs to ensure an efficient computation which preserves initial precision.
For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like
precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration
count to ensure good-enough precision, but with budgeted/exact timing.


CORDIC timing:

Here is a table that shows how many free clocks are available for other instructions to
execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ:

    i = %IIIII           i = 0 (adaptive)                    i = 1..31 (fixed)
    operation            clocks free                         clocks free
    --------------------------------------------------------------------------
    QLOG    D/#n         35                                  2 + i + h
    QEXP    D/#n         35                                  2 + i + h

    Trigonometric mode

    QSINCOS D,#n         2 + n                               2 + i
    QSINCOS D,S          5 + mag(abs(D) | abs(S))            3 + i
    QARCTAN D,S/#n       5 + mag(abs(D) | abs(S/#n))         3 + i
    QROTATE D,S/#n       5 + mag(abs(D) | abs(S/#n))         3 + i

    Hyperbolic mode

    QSINCOS D,#n         1 + n + j                           1 + i + h
    QSINCOS D,S          4 + mag(abs(D) | abs(S)) + k        2 + i + h
    QARCTAN D,S/#n       4 + mag(abs(D) | abs(S/#n)) + k     2 + i + h
    QROTATE D,S/#n       4 + mag(abs(D) | abs(S/#n)) + k     2 + i + h
    --------------------------------------------------------------------------

    h = 0 if i is 0..3       j = 0 if n is 1..3        k = 0 if mag is 0..1
        1 if i is 4..12          1 if n is 4..12           1 if mag is 2..10
        2 if i is 13..31         2 if n is 13..31          2 if mag is 11..30



MULTIPLY AND ACCUMULATE
-----------------------

Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the
MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values,
adjusted to exponent and mantissa, and read back. On cog start, ACCA and ACCB are both
cleared to $00000000_00000000.

The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the
resultant 40-bit product into ACCA or ACCB in a single clock:

    MACA    D,S/#n          - multiply D[19:0] by S[19:0]/#n and accumulate into ACCA
    MACB    D,S/#n          - multiply D[19:0] by S[19:0]/#n and accumulate into ACCB


By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters
can be realized in a few instructions:

        FIXINDA #buff+15,#buff          'set circular sample buffer
        FIXINDB #taps+15,#taps          'set circular tap buffer

:loop   REPS    #16,#1                  'ready for 16-tap FIR
        CLRACCA                         'clear ACCA
        MACA    INDB++,INDA++           'multiply and accumulate buff and taps (16 clocks)

        GETACCA result                  'get result
        '<use result>                   'use result

        '<get sample>                   'get new sample
        MOV     --INDA,sample           'enter new sample, buff scrolls against taps

        JMP     #:loop                  'loop


The accumulators may be cleared by the following instructions:

    CLRACCA                 - clear ACCA to $00000000_00000000
    CLRACCB                 - clear ACCB to $00000000_00000000
    CLRACCS                 - clear ACCA and ACCB to $00000000_00000000


The accumulators may be set to arbitrary values by these instructions:

    SETACCA D,S/#n          - set the lower long of ACCA to D and upper long to S/#n
    SETACCB D,S/#n          - set the lower long of ACCB to D and upper long to S/#n


To make post-MACA/MACB computations simpler, the FITACCA/FITACCB/FITACCS instructions can
be used to shift the accumulators downward, in order to consolidate their leading bits into
the lower long, while the upper long gets set to a 6-bit exponent which represents how many
shifts were needed, if any, to fit the value (including the sign bit) into the lower long.
This fitting can be performed on ACCA and ACCB individually, or on ACCA and ACCB together,
in order to preserve their relative magnitudes. The FITACCA/FITACCB/FITACCS instructions
take 2 clocks, but won't execute until 2 clocks after MACA/MACB. So, if FITACCA immediately
follows MACA, FITACCA will take 4 clocks:

    FITACCA                 - fit ACCA
    FITACCB                 - fit ACCB
    FITACCS                 - fit ACCA and ACCB with a common exponent


The GETACCA/GETACCB instructions are used to read back the contents of the accumulators.
GETACCA/GETACCB will always return the lower long of the accumulator, unless the lower long
has already been read and no intervening operation has changed the accumulator's contents,
in which case the upper long will be returned. These instruction take 1 clock, but won't
execute until 2 clocks after MACA/MACB. So, if GETACCA immediately follows MACA, GETACCA
will take 3 clocks:

    GETACCA D               - get lower long of ACCA, then higher long
    GETACCB D               - get lower long of ACCB, then higher long