Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc) - Page 4 — Parallax Forums

Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

1246724

Comments

  • jmgjmg Posts: 15,179
    edited 2012-11-29 22:35
    cgracey wrote: »
    What if we launch pnut.exe with a command switch that tells it to compile, download, and then shut down if there were no problems? That would be totally hands-free then. Would that work for you?

    Yes, that option is always a good idea in tools. It should create a log file too.
    That allows users to slave the compile & download to a smarter process.
    Of course, for small changes, they can work in the pnut editor.
  • cgraceycgracey Posts: 14,232
    edited 2012-11-29 22:43
    I can get around this with a key sequence or macro but the biggest problem at the moment is the 2K bin file limit, is that a bug?

    Here is a new version of pnut.exe (in .zip) which, when called with a filename.spin in the parameter, will load, compile, download, and close. Just do 'pnut filename.spin' from your editor.

    pnut.zip

    If a compile error occurs, it will be shown and the app will not close automatically.
  • cgraceycgracey Posts: 14,232
    edited 2012-11-29 22:50
    As I explained above, we are just loading the 'loader' which is given $1F8 longs + 8 longs of SHA-256/HMAC key to make a 2K initial load.

    It is that code's job to further load more data and execute it. If anyone is making development tools, this is the hook that the chip offers. What that $1F8-long program does is up to you. Eventually, tools will mask this issue and make the phenomenon transparent to the programmer, but if you are making a low-level tool system, this is your hookup.

    I will eventually make a re-loader that brings in more data, but right now I'm working on the documentation for the instructions. So, for now, this is all we've got.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-11-29 22:57
    cgracey wrote: »
    As I explained above, we are just loading the 'loader' which is given $1F8 longs + 8 longs of SHA-256/HMAC key to make a 2K initial load.

    It is that code's job to further load more data and execute it. If anyone is making development tools, this is the hook that the chip offers. What that $1F8-long program does is up to you. Eventually, tools will mask this issue and make the phenomenon transparent to the programmer, but if you are making a low-level tool system, this is your hookup.

    I will eventually make a re-loader that brings in more data, but right now I'm working on the documentation for the instructions. So, for now, this is all we've got.

    So I won't be able to do what I want to do then :( PNUT seems to compile the obj file fine but how do I convert that? Is it usable?
    BTW, thanks for the PNUT update :) (EDIT: Worked like a charm, I assigned it to F10 in Context and bingo!)
  • cgraceycgracey Posts: 14,232
    edited 2012-11-29 23:21
    So I won't be able to do what I want to do then :( PNUT seems to compile the obj file fine but how do I convert that? Is it usable?
    BTW, thanks for the PNUT update :) (EDIT: Worked like a charm, I assigned it to F10 in Context and bingo!)

    Every time you compile, a filename.obj file is output that contains all the binary data. You can use this program to inspect it (rename it to hexedit.exe - note that this viewer always shows a $00 byte at the end of the file which is not really there):

    hexedit.txt

    Can you imagine a way to get that object data into the emulator?
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-11-29 23:27
    cgracey wrote: »
    Every time you compile, a filename.obj file is output that contains all the binary data. You can use this program to inspect it (rename it to hexedit.exe - note that this viewer always shows a $00 byte at the end of the file which is not really there):

    hexedit.txt

    Can you imagine a way to get that object data into the emulator?
    I always use ZTREE a lot if I happen to be using windows so I can see into all the files easily. Is there anything special about obj files? I just got ZTREE to do a file comparison in hex mode and the first 2K is identical so that means I can just modify my kernel to accept a binary loader.
  • cgraceycgracey Posts: 14,232
    edited 2012-11-29 23:34
    I always use ZTREE a lot if I happen to be using windows so I can see into all the files easily. Is there anything special about obj files? I just got ZTREE to do a file comparison in hex mode and the first 2K is identical so that means I can just modify my kernel to accept a binary loader.

    The filename.bin is always 2KB, but the filename.obj is as big as all the code you compiled.

    The program that you download must resume communication that the booter was carrying on, in order to bring the rest of the data into the hub. Imagine you download a $1F8-long program. Then, you execute another application on the PC that communicates serially with your downloaded $1F8-long program to load the rest of your stuff in and execute it.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-11-29 23:39
    cgracey wrote: »
    The filename.bin is always 2KB, but the filename.obj is as big as all the code you compiled.

    The program that you download must resume communication that the booter was carrying on, in order to bring the rest of the data into the hub. Imagine you download a $1F8-long program. Then, you execute another application on the PC that communicates serially with your downloaded $1F8-long program to load the rest of your stuff in and execute it.
    Yes, I want to use the same code to handle the second stage as well so that I just send the obj file to it and it fills in the blanks from there. Teraterm allows me to send a file in binary form so that might be good enough for now on the second stage load.

    Once this kernel is running I will use it to interact both at a high level and low level with the P2 and hardware. Should be fun!
  • cgraceycgracey Posts: 14,232
    edited 2012-11-30 00:13
    Yes, I want to use the same code to handle the second stage as well so that I just send the obj file to it and it fills in the blanks from there. Teraterm allows me to send a file in binary form so that might be good enough for now on the second stage load.

    Once this kernel is running I will use it to interact both at a high level and low level with the P2 and hardware. Should be fun!

    I think you're on your way. It's a good thing you've got the big DE2-115 so that you've got more than one cog and a full 128KB memory.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-11-30 00:19
    cgracey wrote: »
    I think you're on your way. It's a good thing you've got the big DE2-115 so that you've got more than one cog and a full 128KB memory.
    Yep, and these boards are really good value for money and I can see them being very useful for debugging plus it's not possible to really brick them either. I've built in my serial routines and just putting in the basics to get it to chain load in the rest, fingers crossed.
  • BaggersBaggers Posts: 3,019
    edited 2012-11-30 02:47
    Looks like you guys are all having fun :)
    Can't wait to join in!
    But in the mean time I'm still playing on Prop1!
  • SapiehaSapieha Posts: 2,964
    edited 2012-11-30 04:18
    Hi Chip.

    Can You post any special register's MAP of COG.

    So it will be simpler to understand Instructions from "Propeller2DetailedPreliminaryFeatureList-v2.0.pdf"
  • David BetzDavid Betz Posts: 14,516
    edited 2012-11-30 04:42
    cgracey wrote: »
    Only $1F8 longs are being loaded.

    What we are doing is just loading what is actually the 'loader' that would perform further loading, decryption, etc. I will have to make a re-loader program to download, in lieu of the $1F8 longs which are now being sent, which will then load the user code in, up to the top of memory, if there is so much.
    Do you want me to work on a second stage loader? I could use some of the code I wrote for propeller-load.
  • nutsonnutson Posts: 242
    edited 2012-11-30 05:21
    Sapieha. I remember Chip saying there were more than 40 registers now. We will get a full description in due time.

    Just read the preliminary feature list and made this list for my own reference:

    There are 10 memory mapped registers:

    INDA/B 0x1F6 - 0x1F7 Indirect access to COG memory
    PINA/B/C/D 0x1F8 - 0x1FB Read / write I/O ports
    DIRA/B/C/D 0x1FC – 0x1FF Set pins to output

    All other registers can be accessed only with specialised instructions

    PTRA/B Pointer for hub access
    SPA/B CLUT (stack) pointer
    CNT System time counter
    LFSR Random number generator
    MACA/B Accu for 64 bit MAC operation
    CTRA/B Each have FRQ, PHS, SINand COS register
    MULLL/H etc, registers to acces the multiply, divide, SQRT and CORDIC ooperations
    DAC0/3 configuration and data for the DAC’s
  • cgraceycgracey Posts: 14,232
    edited 2012-11-30 05:48
    I've been working on the instruction set documentation and I've completed the parts that cover:

    1) Hub memory instructions
    2) Hub control instructions
    3) Cog RAM indirect instructions - New # syntax for SETINDx/FIXINDx
    4) Cog stack RAM instructions
    5) Multi-tasking

    There is a new PNUT.EXE in this .zip which supports the new SETINDx/FIXINDx syntax. Also, all the files anyone needs to use the DE0-Nano or DE2-115 are in here:

    Terasic_Prop2.zip

    Current Doc's:
    PROPELLER 2 MEMORY
    ------------------
    
    In the Propeller 2, there are two primary types of memory:
    
    HUB MEMORY
    
        128K bytes of main memory shared by all cogs
    
            - cogs launch from this memory
            - cogs can access this memory as bytes, words, longs, and quads (4 longs)
            - $00000..$00E7F is ROM - contains Booter, SHA-256/HMAC, and Monitor
    	- $00E80..$1FFFF is RAM - for application usage
    
    
    COG MEMORY (8 instances)
    
        512 longs of register RAM for code and data usage
    
            - simultaneous instruction, source, and destination reading, plus writing
            - last eight registers are for I/O pin control
    
        256 longs of stack RAM for data and video usage
    
            - accessible via push and pop operations
            - video circuit can read data simultaneously and asynchronously
    
    
    
    HUB MEMORY INSTRUCTIONS
    -----------------------
    
    These instructions read and write hub memory.
    
    All instructions use D as the data conduit, except WRQUAD/RDQUAD/RDQUADC, which uses the four QUAD
    registers. The QUADs can be mapped into cog register space using the SETQUAD instruction or kept
    hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the QUADs
    overlay four contiguous cog registers which can begin at any double-even address (%xxxxxxx00). These
    overlaid registers can be read and written as any other registers, as well as executed. Any write via
    D to the QUAD registers, when mapped, will affect the underlying cog registers, as well. A RDQUAD/
    RDQUADC will affect the QUAD registers, but not the underlying cog registers.
    
    The cached reads RDBYTEC/RDWORDC/RDLONGC/RDQUADC will do a RDQUAD if the current read address is
    outside of the 4-long window of the prior RDQUAD. Otherwise, they will immediately return cached
    data. The CACHEX instruction invalidates the cache, forcing a fresh RDQUAD next time a cached read
    executes.
    
    Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The
    timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant,
    causing these instructions to take varying numbers of clocks. Timing can be made determinant, though,
    by intentionally spacing these instructions apart so that after the first in a series executes, the
    subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of
    clocks. The trick is to write useful code to go in between them.
    
    WRBYTE/WRWORD/WRLONG/WRQUAD/RDQUAD complete on the hub cycle, making them take 1..8 clocks.
    
    RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks.
    
    RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is cached, otherwise 3..10 clocks.
    
    RDQUADC takes only 1 clock if data is cached, otherwise 1..8 clocks.
    
    After a RDQUAD, the QUAD registers are accessible via D and S on the 3rd clock and executable on the
    5th clock.
    
    
    instructions                                                                                       clocks
    ---------------------------------------------------------------------------------------------------------
    000000 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  D,S       'write lower byte in D at S              1..8
    000000 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRBYTE  D,PTR     'write lower byte in D at PTR            1..8
    000000 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTE  D,S       'read byte at S into D                  3..10
    000000 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTE  D,PTR     'read byte at PTR into D                3..10
    000000 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTEC D,S       'read cached byte at S into D        1, 3..10 
    000000 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTEC D,PTR     'read cached byte at PTR into D      1, 3..10
    
    000001 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRWORD  D,S       'write lower word in D at S              1..8
    000001 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRWORD  D,PTR     'write lower word in D at PTR            1..8
    000001 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORD  D,S       'read word at S into D                  3..10
    000001 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORD  D,PTR     'read word at PTR into D                3..10
    000001 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORDC D,S       'read cached word at S into D        1, 3..10
    000001 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORDC D,PTR     'read cached word at PTR into D      1, 3..10
    
    000010 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRLONG  D,S       'write D at S                            1..8
    000010 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRLONG  D,PTR     'write D at PTR                          1..8
    000010 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONG  D,S       'read long at S into D                  3..10
    000010 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONG  D,PTR     'read long at PTR into D                3..10
    000010 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONGC D,S       'read cached long at S into D        1, 3..10
    000010 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONGC D,PTR     'read cached long at PTR into D      1, 3..10
    
    000011 000 0 CCCC DDDDDDDDD 010110000     WRQUAD  D         'write QUADs at D                        1..8
    000011 001 1 CCCC SUPNNNNNN 010110000     WRQUAD  PTR       'write QUADs at PTR                      1..8
    000011 000 0 CCCC DDDDDDDDD 010110001     RDQUAD  D         'read quad at D into QUADs               1..8
    000011 001 1 CCCC SUPNNNNNN 010110001     RDQUAD  PTR       'read quad at PTR into QUADs             1..8
    000011 010 0 CCCC DDDDDDDDD 010110001     RDQUADC D         'read cached quad at D into QUADs     1, 1..8
    000011 011 1 CCCC SUPNNNNNN 010110001     RDQUADC PTR       'read cached quad at PTR into QUADs   1, 1..8
    ---------------------------------------------------------------------------------------------------------
    
    
    PTR expressions:
    
        INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s
        SCALE = 1 for byte, 2 for word, 4 for long, or 16 for quad
    
        S = 0 for PTRA, 1 for PTRB
        U = 0 to keep PTRx same, 1 to update PTRx
        P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
        NNNNNN = INDEX
        nnnnnn = -INDEX
    
    
        SUPNNNNNN     PTR expression
        -----------------------------------------------------------------------------
        000000000     PTRA              'use PTRA
        100000000     PTRB              'use PTRB
        011000001     PTRA++            'use PTRA,                PTRA += SCALE
        111000001     PTRB++            'use PTRB,                PTRB += SCALE
        011111111     PTRA--            'use PTRA,                PTRA -= SCALE
        111111111     PTRB--            'use PTRB,                PTRB -= SCALE
        010000001     ++PTRA            'use PTRA + SCALE,        PTRA += SCALE
        110000001     ++PTRB            'use PTRB + SCALE,        PTRB += SCALE
        010111111     --PTRA            'use PTRA - SCALE,        PTRA -= SCALE
        110111111     --PTRB            'use PTRB - SCALE,        PTRB -= SCALE
    
        000NNNNNN     PTRA[INDEX]       'use PTRA + INDEX*SCALE
        100NNNNNN     PTRB[INDEX]       'use PTRB + INDEX*SCALE
        011NNNNNN     PTRA++[INDEX]     'use PTRA,                PTRA += INDEX*SCALE
        111NNNNNN     PTRB++[INDEX]     'use PTRB,                PTRB += INDEX*SCALE
        011nnnnnn     PTRA--[INDEX]     'use PTRA,                PTRA -= INDEX*SCALE
        111nnnnnn     PTRB--[INDEX]     'use PTRB,                PTRB -= INDEX*SCALE
        010NNNNNN     ++PTRA[INDEX]     'use PTRA + INDEX*SCALE,  PTRA += INDEX*SCALE
        110NNNNNN     ++PTRB[INDEX]     'use PTRB + INDEX*SCALE,  PTRB += INDEX*SCALE
        010nnnnnn     --PTRA[INDEX]     'use PTRA - INDEX*SCALE,  PTRA -= INDEX*SCALE
        110nnnnnn     --PTRB[INDEX]     'use PTRB - INDEX*SCALE,  PTRB -= INDEX*SCALE
    
    
    Examples:
    
    000000 Z01 1 CCCC DDDDDDDDD 000000000     RDBYTE  D,PTRA         'read byte at PTRA into D
    000001 000 1 CCCC DDDDDDDDD 111000001     WRWORD  D,PTRB++       'write lower word in D at PTRB,      PTRB += 2
    000010 Z01 1 CCCC DDDDDDDDD 011111111     RDLONG  D,PTRA--       'read long at PTRA into D,           PTRA -= 4
    000011 001 1 CCCC 110000001 010110001     RDQUAD  ++PTRB         'read quad at PTRB+16 into QUADs,    PTRB += 16
    000000 000 1 CCCC DDDDDDDDD 010111111     WRBYTE  D,--PTRA       'write lower byte in D at PTRA-1,    PTRA -= 1
    
    000001 000 1 CCCC DDDDDDDDD 100000111     WRWORD  D,PTRB[7]      'write lower word in D to PTRB+7*2
    000010 Z11 1 CCCC DDDDDDDDD 011001111     RDLONGC D,PTRA++[15]   'read cached long at PTRA into D,    PTRA += 15*4
    000011 001 1 CCCC 111111101 010110000     WRQUAD  PTRB--[3]      'write QUADs at PTRB,                PTRB -= 3*16
    000000 000 1 CCCC DDDDDDDDD 010000110     WRBYTE  D,++PTRA[6]    'write lower byte in D to PTRA+6*1,  PTRA += 6*1
    000001 Z01 1 CCCC DDDDDDDDD 110110110     RDWORD  D,--PTRB[10]   'read word at PTRB-10*2 into D,      PTRB -= 10*2
    
    
    Bytes, words, longs, and quads are addressed as follows: 
    
        for WRBYTE/RDBYTE/RDBYTEC, address = %XXXXXXXXXXXXXXXXX (bits 16..0 are used)
        for WRWORD/RDWORD/RDWORDC, address = %XXXXXXXXXXXXXXXX- (bits 16..1 are used)
        for WRLONG/RDLONG/RDLONGC, address = %XXXXXXXXXXXXXXX-- (bits 16..2 are used)
        for WRQUAD/RDQUAD/RDQUADC, address = %XXXXXXXXXXXXX---- (bits 16..4 are used)
    
    address  byte  word    long        quad
    -------------------------------------------------------------------
    00000-   50   *7250   *706F7250   *0C7CCC030C7C200020302E32706F7250
    00001-   72    7250    706F7250    0C7CCC030C7C200020302E32706F7250
    00002-   6F   *706F    706F7250    0C7CCC030C7C200020302E32706F7250
    00003-   70    706F    706F7250    0C7CCC030C7C200020302E32706F7250
    00004-   32   *2E32   *20302E32    0C7CCC030C7C200020302E32706F7250
    00005-   2E    2E32    20302E32    0C7CCC030C7C200020302E32706F7250
    00006-   30   *2030    20302E32    0C7CCC030C7C200020302E32706F7250
    00007-   20    2030    20302E32    0C7CCC030C7C200020302E32706F7250
    00008-   00   *2000   *0C7C2000    0C7CCC030C7C200020302E32706F7250
    00009-   20    2000    0C7C2000    0C7CCC030C7C200020302E32706F7250
    0000A-   7C   *0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
    0000B-   0C    0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
    0000C-   03   *CC03   *0C7CCC03    0C7CCC030C7C200020302E32706F7250
    0000D-   CC    CC03    0C7CCC03    0C7CCC030C7C200020302E32706F7250
    0000E-   7C   *0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
    0000F-   0C    0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
    00010-   45   *FE45   *0DC1FE45   *0D7CC6010C7CC6010CFCB6E30DC1FE45
    00011-   FE    FE45    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00012-   C1   *0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00013-   0D    0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00014-   E3   *B6E3   *0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00015-   B6    B6E3    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00016-   FC   *0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00017-   0C    0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00018-   01   *C601   *0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    00019-   C6    C601    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001A-   7C   *0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001B-   0C    0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001C-   01   *C601   *0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001D-   C6    C601    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001E-   7C   *0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    0001F-   0D    0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
    
    * new word/long/quad
    
    
    
    
    PTRA/PTRB INSTRUCTIONS
    ----------------------
    
    Each cog has two 17-bit pointers, PTRA and PTRB, which can be read, written, modified,
    and used to access hub memory.
    
    At cog startup, the PTRA and PTRB registers are initialized as follows:
    
        PTRA = %X_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer
        PTRB = %X_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from
    
    
    instructions                                                                               clocks
    -------------------------------------------------------------------------------------------------
    000011 ZCR 1 CCCC DDDDDDDDD 000010010     GETPTRA D         'get PTRA into D, C = PTRA[16]      1
    000011 ZCR 1 CCCC DDDDDDDDD 000010011     GETPTRB D         'get PTRB into D, C = PTRB[16]      1
    
    000011 000 1 CCCC DDDDDDDDD 010110010     SETPTRA D         'set PTRA to D                      1
    000011 001 1 CCCC nnnnnnnnn 010110010     SETPTRA #n        'set PTRA to 0..511                 1
    000011 000 1 CCCC DDDDDDDDD 010110011     SETPTRB D         'set PTRB to D                      1
    000011 001 1 CCCC nnnnnnnnn 010110011     SETPTRB #n        'set PTRB to 0..511                 1
    
    000011 000 1 CCCC DDDDDDDDD 010110100     ADDPTRA D         'add D into PTRA                    1
    000011 001 1 CCCC nnnnnnnnn 010110100     ADDPTRA #n        'add 0..511 into PTRA               1
    000011 000 1 CCCC DDDDDDDDD 010110101     ADDPTRB D         'add D into PTRB                    1
    000011 001 1 CCCC nnnnnnnnn 010110101     ADDPTRB #n        'add 0..511 into PTRB               1
    
    000011 000 1 CCCC DDDDDDDDD 010110110     SUBPTRA D         'subtract D from PTRA               1
    000011 001 1 CCCC nnnnnnnnn 010110110     SUBPTRA #n        'subtract 0..511 from PTRA          1
    000011 000 1 CCCC DDDDDDDDD 010110111     SUBPTRB D         'subtract D from PTRB               1
    000011 001 1 CCCC nnnnnnnnn 010110111     SUBPTRB #n        'subtract 0..511 from PTRB          1
    -------------------------------------------------------------------------------------------------
    
    
    
    QUAD-RELATED INSTRUCTIONS
    -------------------------
    
    Each cog has four QUAD registers which form a 128-bit conduit between the hub memory and the cog.
    This conduit can transfer four longs every 8 clocks via the WRQUAD/RDQUAD instructions. It can
    also be used as a 4-long/8-word/16-byte read cache, utilized by RDBYTEC/RDWORDC/RDLONGC/RDQUADC.
    
    Initially hidden, these QUAD registers are mappable into cog register space by using the SETQUAD
    instruction to set a double-even address (%xxxxxxx00) where the base register is to appear, with
    the other three registers following. To hide the QUAD registers, use SETQUAD to set an address
    which is not double-even.
    
    
    instructions                                                                               clocks
    -------------------------------------------------------------------------------------------------
    000011 000 1 CCCC 000000000 000001000     CACHEX            'invalidate cache                   1
    000011 Z01 1 CCCC DDDDDDDDD 000010001     GETTOPS D         'get top bytes of QUADs into D      1
    000011 000 1 CCCC DDDDDDDDD 011100010     SETQUAD D         'set QUAD base address to D         1
    000011 001 1 CCCC nnnnnnnnn 011100010     SETQUAD #n        'set QUAD base address to 0..511    1
    -------------------------------------------------------------------------------------------------
    
    
    
    HUB CONTROL INSTRUCTIONS
    ------------------------
    
    These instructions are used to control hub circuits and cogs.
    
    Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where
    there is no result to wait for (ZCR = %000), these instructions complete on the hub cycle, making
    them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases
    where a result is anticipated (ZCR <> %000), these instructions complete on the 1st clock after the
    hub cycle, making them take 2..9 clocks.
    
    
    COGINIT D,S
    -----------
    
    COGINIT is used to start cogs. Any cog can be (re)started, whether it is idle or running. A cog
    can even execute a COGINIT to restart itself with a new program.
    
    COGINIT uses D to specify a long address in hub memory that is the start of the program that is to be
    loaded into a cog, while S is a 17-bit parameter (usually an address) that will be conveyed to PTRA
    of the started cog. PTRB of the started cog will be set to the start address of its program that was
    loaded from hub memory.
    
    SETCOG must be executed before COGINIT to set the number of the cog to be started (0..7). If SETCOG
    sets a value with bit 3 set (%1xxx), this will cause the next idle cog to be started when COGINIT is
    executed, with the number of the cog started being returned in D, and the C flag returning 0 if okay,
    or 1 if no idle cog was available. At cog startup, SETCOG is initialized to %0000.
    
    When a cog is started, $1F8 contiguous longs are read from hub memory and written to cog registers
    $000..$1F7. The cog will then begin execution at $000. This process takes 1,016 clocks.
    
    Example:
    
            COGID   COGNUM           'what cog am I?
            SETCOG  COGNUM           'set my cog number
            COGINIT COGPGM,COGPTR    'restart me with the ROM Monitor
    
    COGPGM  LONG    $0070C           'address of the ROM Monitor
    COGPTR  LONG    90<<9 + 91       'tx = P90, rx = P91
    
    COGNUM  RES     1
    
    
    CLKSET  D
    ---------
    
    CLKSET writes the lower 9 bits of D to the hub clock register:
    
    %R_MMMM_XX_SS
    
    R = 1 for hardware reset, 0 for continued operation
    
    MMMM = PLL multiplying factor for XI pin input:
            %0000 for PLL disabled
            %0001..%1111 for 2..16 multiply (XX must be set for XI input or XI/XO crystal oscillator)
    
    XX = XI/XO pin mode:
            00 for XI reads low, XO floats
            01 for XI input, XO floats
            10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback
            11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback
    
    SS = Clock selector:
            00 for RCFAST (~20MHz)
            01 for RCSLOW (~20KHz)
            10 for XTAL (10MHz-20MHz)
            11 for PLL
    
    Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode
    with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST
    or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within
    10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you
    can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you
    can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being
    in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its
    new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot
    might exceed the speed limit of the chip.
    
    
    COGID   D
    ---------
    
    COGID returns the number of the cog (0..7) into D.
    
    
    COGSTOP D
    ---------
    
    COGSTOP stops the cog specified in D (0..7).
    
    
    LOCKNEW D
    LOCKRET D
    LOCKSET D
    LOCKCLR D
    ---------
    
    There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with
    LOCKRET, set with LOCKSET, and cleared with LOCKCLR.
    
    While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided
    so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time.
    
    When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works
    the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/
    clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission
    to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a
    'LOCKSET locknum  wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared
    resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to
    clear the lock and give another cog the opportunity to get C=0 back.
    
    LOCKNEW returns the next available lock into D, with C=1 if no lock was free.
    
    LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW.
    
    LOCKSET sets the lock in D and returns its prior state in C.
    
    LOCKCLR clears the lock in D and returns its prior state in C.
    
    
    instructions                                                                               clocks
    -------------------------------------------------------------------------------------------------
    000011 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,S     'launch cog at D, cog PTRA = S     1..9
    000011 000 1 CCCC DDDDDDDDD 000000000     CLKSET  D       'set clock to D                    1..8
    000011 001 1 CCCC DDDDDDDDD 000000001     COGID   D       'get cog number into D             2..9
    000011 000 1 CCCC DDDDDDDDD 000000011     COGSTOP D       'stop cog in D                     1..8
    000011 ZC1 1 CCCC DDDDDDDDD 000000100     LOCKNEW D       'get new lock into D, C = busy     2..9
    000011 000 1 CCCC DDDDDDDDD 000000101     LOCKRET D       'return lock in D                  1..8
    000011 0C0 1 CCCC DDDDDDDDD 000000110     LOCKSET D       'set lock in D, C = prev state     1..9
    000011 0C0 1 CCCC DDDDDDDDD 000000111     LOCKCLR D       'clear lock in D, C = prev state   1..9
    -------------------------------------------------------------------------------------------------
    
    
    
    INDIRECT REGISTERS
    ------------------
    
    Each cog has two indirect registers: INDA and INDB. They are located at $1F6 and $1F7.
    
    By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed.
    
    INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and
    the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the
    pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB
    references.
    
    SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and
    top limit(s) to $000 and $1FF, respectively.
    
    FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the
    lower of the initial and terminal values and the top limit(s) to the higher.
    
    Because indirect addressing occurs very early in the pipeline and indirect pointers are affected earlier than
    the final stage where the conditional bit field (CCCC) normally comes into use, the CCCC field is repurposed
    for indirect operations. The top two bits of CCCC are used for indirect D and the bottom two bits are used
    for indirect S. All instructions which use indirect registers will execute unconditionally, regardless of the
    CCCC bits.
    
    Here is the INDA/INDB usage scheme which repurposes the CCCC field:
    
    OOOOOO ZCR I CCCC DDDDDDDDD SSSSSSSSS
    -------------------------------------
    xxxxxx xxx x 00xx 111110110 xxxxxxxxx        D = INDA        'use INDA
    xxxxxx xxx x 00xx 111110111 xxxxxxxxx        D = INDB        'use INDB
    xxxxxx xxx x 01xx 111110110 xxxxxxxxx        D = INDA++      'use INDA,      INDA += 1
    xxxxxx xxx x 01xx 111110111 xxxxxxxxx        D = INDB++      'use INDB,      INDB += 1
    xxxxxx xxx x 10xx 111110110 xxxxxxxxx        D = INDA--      'use INDA,      INDA -= 1
    xxxxxx xxx x 10xx 111110111 xxxxxxxxx        D = INDB--      'use INDB       INDB -= 1
    xxxxxx xxx x 11xx 111110110 xxxxxxxxx        D = ++INDA      'use INDA+1,    INDA += 1
    xxxxxx xxx x 11xx 111110111 xxxxxxxxx        D = ++INDB      'use INDB+1,    INDB += 1
    
    xxxxxx xxx 0 xx00 xxxxxxxxx 111110110        S = INDA        'use INDA
    xxxxxx xxx 0 xx00 xxxxxxxxx 111110111        S = INDB        'use INDB
    xxxxxx xxx 0 xx01 xxxxxxxxx 111110110        S = INDA++      'use INDA,      INDA += 1
    xxxxxx xxx 0 xx01 xxxxxxxxx 111110111        S = INDB++      'use INDB,      INDB += 1
    xxxxxx xxx 0 xx10 xxxxxxxxx 111110110        S = INDA--      'use INDA,      INDA -= 1
    xxxxxx xxx 0 xx10 xxxxxxxxx 111110111        S = INDB--      'use INDB       INDB -= 1
    xxxxxx xxx 0 xx11 xxxxxxxxx 111110110        S = ++INDA      'use INDA+1,    INDA += 1
    xxxxxx xxx 0 xx11 xxxxxxxxx 111110111        S = ++INDB      'use INDB+1,    INDB += 1
    
    
    If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the
    post-modifier effect:
    
    101000 001 0 0011 111110110 111110110        MOV INDA,++INDA    'Move @INDA+1 into @INDA,   INDA += 1
    100000 001 0 1100 111110111 111110111        ADD ++INDB,INDB    'Add @INDB into @INDB+1,    INDB += 1
    
    Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx.
    
    
    Here are the instructions which are used to set the pointer and limit values for INDA and INDB:
    
    instructions *                                                                             clocks
    -------------------------------------------------------------------------------------------------
    111000 000 0 0001 000000000 AAAAAAAAA        SETINDA #addrA                                     1
    111000 000 0 0011 000000000 AAAAAAAAA        SETINDA ++/--deltA                                 1
    
    111000 000 0 0100 BBBBBBBBB 000000000        SETINDB #addrB                                     1
    111000 000 0 1100 BBBBBBBBB 000000000        SETINDB ++/--deltB                                 1 
    
    111000 000 0 0101 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,#addrA                              1
    111000 000 0 0111 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,++/--deltA                          1
    111000 000 0 1101 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,#addrA                          1
    111000 000 0 1111 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,++/--deltA                      1
    
    111001 000 0 0001 TTTTTTTTT IIIIIIIII        FIXINDA #terminal,#initial                         1
    111001 000 0 0100 TTTTTTTTT IIIIIIIII        FIXINDB #terminal,#initial                         1
    111001 000 0 0101 TTTTTTTTT IIIIIIIII        FIXINDS #terminal,#initial                         1
    -------------------------------------------------------------------------------------------------
    * addrA/addrB/terminal/initial = register address (0..511),
      deltA/deltB = 9-bit signed delta --256..++255
    
    Examples:
    
    111000 000 0 0001 000000000 000000101        SETINDA #5        'INDA = 5, bottom = 0, top = 511
    111000 000 0 0011 000000000 000000011        SETINDA ++3       'INDA += 3, bottom = 0, top = 511
    111000 000 0 1100 111111100 000000000        SETINDB --4       'INDB -= 4, bottom = 0, top = 511
    111000 000 0 0111 000000111 000001000        SETINDS #7,++8    'INDB = 7, INDA += 8, bottoms = 0, tops = 511
    
    111001 000 0 0001 000001111 000001000        FIXINDA #15,#8    'INDA = 8, bottom = 8, top = 15
    111001 000 0 0100 000010000 000011111        FIXINDB #16,#31   'INDB = 31, bottom = 16, top = 31
    111001 000 0 0101 001100011 000110010        FIXINDS #99,#50   'INDA/INDB = 50, bottoms = 50, tops = 99
    
    
    
    STACK RAM
    ---------
    
    Each cog has a 256-long stack RAM that is accessible via push and pop operations. Its contents
    are not initialized at either reset or cog startup. So, at cog startup, it will contain whatever
    it happened to power up with, or whatever was last written.
    
    There are two stack pointers called SPA and SPB which are used to address the stack memory. Aside
    from automatically incrementing and decrementing via pushes and pops, SPA and SPB can be set,
    modified, read back, and checked:
    
    SETSPA  D/#n      set SPA
    SETSPB  D/#n      set SPB
    ADDSPA  D/#n      add to SPA
    ADDSPB  D/#n      add to SPB
    SUBSPA  D/#n      subtract from SPA
    SUBSPB  D/#n      subtract from SPB
    GETSPA  D         get SPA, SPA==0 into Z, SPA.7 into C
    GETSPB  D         get SPB, SPB==0 into Z, SPB.7 into C
    GETSPD  D         get SPA minus SPB, SPA==SPB into Z, SPA<SPB into C
    CHKSPA            check SPA, SPA==0 into Z, SPA.7 into C
    CHKSPB            check SPB, SPB==0 into Z, SPB.7 into C
    CHKSPD            check SPA minus SPB, SPA==SPB into Z, SPA<SPB into C
    
    Data can be pushed and popped in both normal and reverse directions:
    
    PUSHA   D/#n      push using SPA
    PUSHB   D/#n      push using SPB
    PUSHAR  D/#n      push using SPA, use pop addressing
    PUSHBR  D/#n      push using SPB, use pop addressing
    POPA    D         pop using SPA
    POPB    D         pop using SPB
    POPAR   D         pop using SPA, use push addressing
    POPBR   D         pop using SPB, use push addressing
    
    Aside from data, the program counter and flags can be pushed and popped using calls and returns:
    
    CALLA   D/#n      call using SPA
    CALLB   D/#n      call using SPB
    CALLAD  D/#n      call using SPA, delay branch until three trailing instructions executed
    CALLBD  D/#n      call using SPB, delay branch until three trailing instructions executed
    RETA              return using SPA
    RETB              return using SPB
    RETAD             return using SPA, delay branch until three trailing instructions executed
    RETBD             return using SPB, delay branch until three trailing instructions executed
    
    
    instructions (stack RAM access is shown as [SPx++] and [--SPx])                            clocks
    -------------------------------------------------------------------------------------------------
    000011 ZC0 1 CCCC 000000000 000010101        CHKSPD          'SPA==SPB into Z, SPA<SPB into C   1
    000011 ZC1 1 CCCC DDDDDDDDD 000010101        GETSPD  D       'SPA-SPB into D, Z/C as CHKSPD     1
    
    000011 ZC0 1 CCCC 000000000 000010110        CHKSPA          'SPA==0 into Z, SPA.7 into C       1
    000011 ZC1 1 CCCC DDDDDDDDD 000010110        GETSPA  D       'SPA into D, Z/C as CHKSPA         1
    
    000011 ZC0 1 CCCC 000000000 000010111        CHKSPB          'SPB==0 into Z, SPB.7 into C       1
    000011 ZC1 1 CCCC DDDDDDDDD 000010111        GETSPB  D       'SPB into D, Z/C as CHKSPB         1
    
    000011 ZC1 1 CCCC DDDDDDDDD 000011000        POPAR   D       'read [SPA++] into D, MSB into C   1
    000011 ZC1 1 CCCC DDDDDDDDD 000011001        POPBR   D       'read [SPB++] into D, MSB into C   1
    
    000011 ZC1 1 CCCC DDDDDDDDD 000011010        POPA    D       'read [--SPA] into D, MSB into C   1
    000011 ZC1 1 CCCC DDDDDDDDD 000011011        POPB    D       'read [--SPB] into D, MSB into C   1
    
    000011 ZC0 1 CCCC 000000000 000011100        RETA            'read [--SPA] into Z/C/PC*         4
    000011 ZC0 1 CCCC 000000000 000011101        RETB            'read [--SPB] into Z/C/PC*         4
    
    000011 ZC0 1 CCCC 000000000 000011110        RETAD           'read [--SPA] into Z/C/PC*         1
    000011 ZC0 1 CCCC 000000000 000011111        RETBD           'read [--SPB] into Z/C/PC*         1
    
    000011 000 1 CCCC DDDDDDDDD 010100010        SETSPA  D       'set SPA to D                      1
    000011 001 1 CCCC 0nnnnnnnn 010100010        SETSPA  #n      'set SPA to n                      1
    000011 000 1 CCCC DDDDDDDDD 010100011        SETSPB  D       'set SPB to D                      1
    000011 001 1 CCCC 0nnnnnnnn 010100011        SETSPB  #n      'set SPB to n                      1
    
    000011 000 1 CCCC DDDDDDDDD 010100100        ADDSPA  D       'add D into SPA                    1
    000011 001 1 CCCC 0nnnnnnnn 010100100        ADDSPA  #n      'add n into SPA                    1
    000011 000 1 CCCC DDDDDDDDD 010100101        ADDSPB  D       'add D into SPB                    1
    000011 001 1 CCCC 0nnnnnnnn 010100101        ADDSPB  #n      'add n into SPB                    1
    
    000011 000 1 CCCC DDDDDDDDD 010100110        SUBSPA  D       'subtract D from SPA               1
    000011 001 1 CCCC 0nnnnnnnn 010100110        SUBSPA  #n      'subtract n from SPA               1
    000011 000 1 CCCC DDDDDDDDD 010100111        SUBSPB  D       'subtract D from SPB               1
    000011 001 1 CCCC 0nnnnnnnn 010100111        SUBSPB  #n      'subtract n from SPB               1
    
    000011 000 1 CCCC DDDDDDDDD 010101000        PUSHAR  D       'write D into [--SPA]              1 **
    000011 001 1 CCCC nnnnnnnnn 010101000        PUSHAR  #n      'write n into [--SPA]              1 **
    000011 000 1 CCCC DDDDDDDDD 010101001        PUSHBR  D       'write D into [--SPB]              1 **
    000011 001 1 CCCC nnnnnnnnn 010101001        PUSHBR  #n      'write n into [--SPB]              1 **
    
    000011 000 1 CCCC DDDDDDDDD 010101010        PUSHA   D       'write D into [SPA++]              1 **
    000011 001 1 CCCC nnnnnnnnn 010101010        PUSHA   #n      'write n into [SPA++]              1 **
    000011 000 1 CCCC DDDDDDDDD 010101011        PUSHB   D       'write D into [SPB++]              1 **
    000011 001 1 CCCC nnnnnnnnn 010101011        PUSHB   #n      'write n into [SPB++]              1 **
    
    000011 000 1 CCCC DDDDDDDDD 010101100        CALLA   D       'write Z/C/PC* into [SPA++], PC=D  4 **
    000011 001 1 CCCC nnnnnnnnn 010101100        CALLA   #n      'write Z/C/PC* into [SPA++], PC=n  4 **
    000011 000 1 CCCC DDDDDDDDD 010101101        CALLB   D       'write Z/C/PC* into [SPB++], PC=D  4 **
    000011 001 1 CCCC nnnnnnnnn 010101101        CALLB   #n      'write Z/C/PC* into [SPB++], PC=n  4 **
    
    000011 000 1 CCCC DDDDDDDDD 010101110        CALLAD  D       'write Z/C/PC* into [SPA++], PC=D  1 **
    000011 001 1 CCCC nnnnnnnnn 010101110        CALLAD  #n      'write Z/C/PC* into [SPA++], PC=n  1 **
    000011 000 1 CCCC DDDDDDDDD 010101111        CALLBD  D       'write Z/C/PC* into [SPB++], PC=D  1 **
    000011 001 1 CCCC nnnnnnnnn 010101111        CALLBD  #n      'write Z/C/PC* into [SPB++], PC=n  1 **
    -------------------------------------------------------------------------------------------------
    * bit 10 is Z, bit 9 is C, bits 8..0 are PC, upper bits are ignored or cleared
    ** if a stack RAM write is immediately followed by a stack RAM read, add one clock
    
    
    
    MULTI-TASKING
    -------------
    
    Each cog has four sets of flags and program counters (Z/C/PC), constituting four unique tasks that
    can execute and switch on each instruction cycle.
    
    At cog startup, the tasks are initialized as follows:
    
    
    task Z  C  PC
    ---------------
    0    0  0  $000
    1    0  0  $001
    2    0  0  $002
    3    0  0  $003
    
    
    There are 16 rotating time slots in the TASK register that determine task sequence. Initially, all
    time slots are set to 0, causing task 0 to execute exclusively, starting at address $000:
    
    
       time slots:   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
                      |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
    TASK register:  %00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00
    
    
    The two LSB's of TASK always determine which task will execute next. After each instruction cycle,
    the TASK register is rotated right by two bits, recycling slot 0 to slot 15 and getting the next task
    into the 2 LSB's.
    
    
    To enable other tasks, SETTASK is used to set the TASK register:
    
    SETTASK D               write D to the TASK register
    SETTASK #n              write {n[7:0], n[7:0], n[7:0], n[7:0]} to the TASK register
    
    If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
    task is given a time slot, it will execute and its flags and PC will be updated at every instruction,
    or time slot. If an active task's time slots are all taken away, that task's flags and PC remain in the
    state where they left off, until it is given another time slot.
    
    
    To immediately force any of the four PC's to a new address, JMPTASK can be used. JMPTASK uses a 4-bit
    mask to select which PC's are going to be written. Mask bits 0..3 represent PC's 0..3. The mask value
    %1010 would write PC 3 and PC 1, while %0100 would write PC 2, only.
    
    JMPTASK D,#mask         force PC's in mask to D
    JMPTASK #addr,#mask     force PC's in mask to #addr
    
    For every PC/task affected by a JMPTASK instruction, all affected-task instructions currently in the
    pipeline are cancelled. This insures that once JMPTASK executes, the next instruction from each
    affected task will be from the new address.
    
    
    Here is an example in which all four tasks are started and each task toggles an I/O pin at a
    different rate:
    
    
            ORG
    
            JMP     #task0          'task 0 begins here when the cog starts (this JMP takes 4 clocks)
            JMP     #task1          'task 1 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
            JMP     #task2          'task 2 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
            JMP     #task3          'task 3 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
    
    task0   SETTASK #%%3210         'enable all tasks (TASK = %11_10_01_00_11_10_01_00_11_10_01_00_11_10_01_00)
    
    :loop   NOTP    #0              'task 0, toggle pin 0       (loops every 8 clocks)
            JMP     #:loop          '(this JMP takes 1 clock)
    
    task1   NOTP    #1              'task 1, toggle pin 1       (loops every 12 clocks)
            NOP
            JMP     #task1          '(this JMP takes 1 clock)
    
    task2   NOTP    #2              'task 2, toggle pin 2       (loops every 16 clocks)
            NOP                     
            NOP
            JMP     #task2          '(this JMP takes 1 clock)
    
    task3   NOTP    #3              'task 3, toggle pin 3       (loops every 20 clocks)
            NOP
            NOP
            NOP
            JMP     #task3          '(this JMP takes 1 clock)
    
    
    ------------------------------------------------------------------------------------------------------------
    NOTE: When a normal branch instruction (JMP, CALL, RET, etc.) executes in the fourth and final stage of the
    pipeline, all instructions progressing through the lower three stages, which belong to the same task as the
    branch instruction, are cancelled. This inhibits execution of incidental data that was trailing the branch
    instruction.
    
    The delayed branch instructions (JMPD, CALLD, RETD, etc.) don't do any pipeline instruction cancellation and
    exist to provide 1-clock branches to single-task programs, where the three instructions following the branch
    are allowed to execute before the new instruction stream begins to execute.
    
    For single-task programs, normal branches take 4 clocks: 1 clock for the branch and 3 clocks for the
    cancelled instructions to come through the pipeline before the new instruction stream begins to execute.
    
    For multi-tasking programs that use all four tasks in sequence (ie SETTASK #%%3210), there are never any
    same-task instructions in the pipeline that would require cancellation due to branching, so all branches
    take just 1 clock.
    ------------------------------------------------------------------------------------------------------------
    
    
    Tips for coding multi-tasking programs
    --------------------------------------
    
    While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
    remember that there's only one of each of the following cog resources and only one task can use it at a time:
    
      SPA
      SPB
      INDA
      INDB
      PTRA
      PTRB
      ACCA
      ACCB
      32x32 multiplier
      64/32 divider
      64-bit square rooter
      CORDIC computer
      CTRA
      CTRB
      VID
      PIX (not usable in multi-tasking, requires single-task timing)
      XFR
      SER
      Bitfield mover
    
    When writing multi-task programs, be aware that instructions that take multiple clocks will stall the
    pipeline and have a ripple effect on the tasks' timing. This may be impossible to avoid, as some task
    might need to access hub memory, and those instructions are not single-clock.
    
    The WAITCNT/WAITPEQ/WAITPNE instructions should be coded discretely using 1-clock instructions, to avoid
    stalling the pipeline for excessive amounts of time.
    
    The following instructions (WC versions) will take 1 clock, instead of potentially many, and return 1 in
    C if they were successful:
    
      SNDSER  D  WC      attempt to send serial
      RCVSER  D  WC      attempt to receive serial
      GETMULL D  WC      attempt to get lower multiplier result
      GETMULH D  WC      attempt to get upper multiplier result
      GETDIVQ D  WC      attempt to get divider quotient result
      GETDIVR D  WC      attempt to get divider remainder result
      GETSQRT D  WC      attempt to get square root result
      GETQX   D  WC      attempt to get CORDIC X result
      GETQY   D  WC      attempt to get CORDIC Y result
      GETQZ   D  WC      attempt to get CORDIC Z result
    
    Other instruction alternatives:
    
      POLCTRA    WC      returns 1 in C if CTRA rolled over, use instead of SYNCTRA
      POLCTRB    WC      returns 1 in C if CTRB rolled over, use instead of SYNCTRB
      POLVID     WC      returns 1 in C if WAITVID is ready, use to execute WAITVID without stalling
      PASSCNT D          jumps to itself if some amount of time has not passed, use instead of WAITCNT
      JP/JNP  D,S        jumps based on pin states, use instead of WAITPEQ/WAITPNE
      DJNZ    D,#$       loops until done, use instead of NOP D/#n
    
    The following instructions will not work in a multi-tasking program:
    
      REPS/REPD          operate by subtracting a value from the PC every n clocks - single-task only
      GETPIX             needs steady pipeline delays for perspective divider time - single-task only
    
    
    instructions                                                                               clocks
    -------------------------------------------------------------------------------------------------
    000011 000 1 CCCC DDDDDDDDD 01001mmmm        JMPTASK D,#mask  'Set PC's in mask to D            1
    000011 001 1 CCCC nnnnnnnnn 01001mmmm        JMPTASK #n,#mask 'Set PC's in mask to 0..511       1
    
    000011 000 1 CCCC DDDDDDDDD 011001011        SETTASK D        'Set TASK to D                    1
    000011 001 1 CCCC nnnnnnnnn 011001011        SETTASK #n       'Set TASK to n[7:0] copied 4x     1
    -------------------------------------------------------------------------------------------------
    
    
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-11-30 06:02
    cgracey wrote: »
    I've been working on the instruction set documentation and I've completed the part that covers the hub memory instructions:

    Thanks Chip, this is really useful, straight away I can see the RDBYTEC with PTR++ operation really speeding up bytecode operations.

    BTW, this operation here -> 000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cash
    Isn't this what the GFC has done?
  • cgraceycgracey Posts: 14,232
    edited 2012-11-30 06:12
    Thanks Chip, this is really useful, straight away I can see the RDBYTEC with PTR++ operation really speeding up bytecode operations.

    BTW, this operation here -> 000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cash
    Isn't this what the GFC has done?

    Woops. I'm getting ahead of the NWO here.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-11-30 06:18
    Thanks Chip - its great to have docs on those new instructions!

    In order to print them for reference, I quickly converted your text to Libre/Open office format, and also made a PDF - I am attaching them below.
  • evanhevanh Posts: 16,070
    edited 2012-11-30 06:51
    Good trick with the forward calculation on the pre(inc/dec)! Such an obvious solution to the iteration delay problem. I guess that shows how little hands on I've done.

    Chip, thanks for showing the working.
  • evanhevanh Posts: 16,070
    edited 2012-11-30 07:03
    If I'm not mistaken, mapping the QUADs to Cog space and exclusively managing Hub accesses with WRQUAD and RDQUAD then one could prevent any instruction stalling due to Hub accesses, right?

    PS: Of course, the limit with this approach is working in chunks of 16 bytes at a time and the resulting obligatory read-modify-write.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-11-30 07:14
    Thanks Chip - its great to have docs on those new instructions!

    In order to print them for reference, I quickly converted your text to Libre/Open office format, and also made a PDF - I am attaching them below.
    This is great! Any chance you could add the descriptions of SETTASK and JMPTASK that Chip posted earlier in message #77?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-11-30 07:21
    No problem!

    I need to print those too...
  • evanhevanh Posts: 16,070
    edited 2012-11-30 07:26
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-11-30 07:30
    oops!

    Evan, you were faster than me :-)
  • evanhevanh Posts: 16,070
    edited 2012-11-30 07:32
    hehe, I was just being cheeky. :)
  • David BetzDavid Betz Posts: 14,516
    edited 2012-11-30 07:34
    No problem!

    I need to print those too...
    Thanks Bill!!

    And thanks to Chip for providing these descriptions!
  • RaymanRayman Posts: 14,789
    edited 2012-11-30 12:05
    RDQUAD and RWQUAD look to be very nice as only needing 1 clock in a fast loop.

    Can the video generator handle a Quad?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-01 14:39
    Ugh. I was just getting ready to spend the weekend working on P2 code. I decided to try running PNut.exe under Parallels on the Mac and when I fired up Windows XP under Parallels I was told that several updates were available. First, I tried PNut.exe and putty for talking to my DE0-Nano board and both worked fine. I then decided to go ahead and do the various updates (Parallels, Windows, and Microsoft Security Essentials). Unfortunately, after doing all of those updates my FTDI driver no longer works. It gives me an error "This device cannot start. (Code 10)". Has anyone seen this? Any idea how to get around it? I have verified that I have the latest FTDI driver installed. What else could be causing this problem?
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-01 14:46
    Hi David

    Remove it and reinstall - In some cases that help

    David Betz wrote: »
    Ugh. I was just getting ready to spend the weekend working on P2 code. I decided to try running PNut.exe under Parallels on the Mac and when I fired up Windows XP under Parallels I was told that several updates were available. First, I tried PNut.exe and putty for talking to my DE0-Nano board and both worked fine. I then decided to go ahead and do the various updates (Parallels, Windows, and Microsoft Security Essentials). Unfortunately, after doing all of those updates my FTDI driver no longer works. It gives me an error "This device cannot start. (Code 10)". Has anyone seen this? Any idea how to get around it? I have verified that I have the latest FTDI driver installed. What else could be causing this problem?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-01 15:04
    Sapieha wrote: »
    Hi David

    Remove it and reinstall - In some cases that help
    That's a good suggestion but I already tried it and it didn't work. I'm wondering if I should try to back out some of my updates. Either that or I need to break down and bring my Dell laptop with a native Windows install upstairs and use that. :-(
Sign In or Register to comment.