Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

jmg · 2012-11-29 22:35

cgracey wrote: »

What if we launch pnut.exe with a command switch that tells it to compile, download, and then shut down if there were no problems? That would be totally hands-free then. Would that work for you?

Yes, that option is always a good idea in tools. It should create a log file too.
That allows users to slave the compile & download to a smarter process.
Of course, for small changes, they can work in the pnut editor.

cgracey · 2012-11-29 22:43

Peter Jakacki wrote: »

I can get around this with a key sequence or macro but the biggest problem at the moment is the 2K bin file limit, is that a bug?

Here is a new version of pnut.exe (in .zip) which, when called with a filename.spin in the parameter, will load, compile, download, and close. Just do 'pnut filename.spin' from your editor.

pnut.zip

If a compile error occurs, it will be shown and the app will not close automatically.

cgracey · 2012-11-29 22:50

As I explained above, we are just loading the 'loader' which is given $1F8 longs + 8 longs of SHA-256/HMAC key to make a 2K initial load.

It is that code's job to further load more data and execute it. If anyone is making development tools, this is the hook that the chip offers. What that $1F8-long program does is up to you. Eventually, tools will mask this issue and make the phenomenon transparent to the programmer, but if you are making a low-level tool system, this is your hookup.

I will eventually make a re-loader that brings in more data, but right now I'm working on the documentation for the instructions. So, for now, this is all we've got.

Peter Jakacki · 2012-11-29 22:57

cgracey wrote: »

As I explained above, we are just loading the 'loader' which is given $1F8 longs + 8 longs of SHA-256/HMAC key to make a 2K initial load.

It is that code's job to further load more data and execute it. If anyone is making development tools, this is the hook that the chip offers. What that $1F8-long program does is up to you. Eventually, tools will mask this issue and make the phenomenon transparent to the programmer, but if you are making a low-level tool system, this is your hookup.

I will eventually make a re-loader that brings in more data, but right now I'm working on the documentation for the instructions. So, for now, this is all we've got.

So I won't be able to do what I want to do then

PNUT seems to compile the obj file fine but how do I convert that? Is it usable?
BTW, thanks for the PNUT update

(EDIT: Worked like a charm, I assigned it to F10 in Context and bingo!)

cgracey · 2012-11-29 23:21

Peter Jakacki wrote: »

So I won't be able to do what I want to do then PNUT seems to compile the obj file fine but how do I convert that? Is it usable?
BTW, thanks for the PNUT update (EDIT: Worked like a charm, I assigned it to F10 in Context and bingo!)

Every time you compile, a filename.obj file is output that contains all the binary data. You can use this program to inspect it (rename it to hexedit.exe - note that this viewer always shows a $00 byte at the end of the file which is not really there):

hexedit.txt

Can you imagine a way to get that object data into the emulator?

Peter Jakacki · 2012-11-29 23:27

cgracey wrote: »

Every time you compile, a filename.obj file is output that contains all the binary data. You can use this program to inspect it (rename it to hexedit.exe - note that this viewer always shows a $00 byte at the end of the file which is not really there):

hexedit.txt

Can you imagine a way to get that object data into the emulator?

I always use ZTREE a lot if I happen to be using windows so I can see into all the files easily. Is there anything special about obj files? I just got ZTREE to do a file comparison in hex mode and the first 2K is identical so that means I can just modify my kernel to accept a binary loader.

cgracey · 2012-11-29 23:34

Peter Jakacki wrote: »

I always use ZTREE a lot if I happen to be using windows so I can see into all the files easily. Is there anything special about obj files? I just got ZTREE to do a file comparison in hex mode and the first 2K is identical so that means I can just modify my kernel to accept a binary loader.

The filename.bin is always 2KB, but the filename.obj is as big as all the code you compiled.

The program that you download must resume communication that the booter was carrying on, in order to bring the rest of the data into the hub. Imagine you download a $1F8-long program. Then, you execute another application on the PC that communicates serially with your downloaded $1F8-long program to load the rest of your stuff in and execute it.

Peter Jakacki · 2012-11-29 23:39

cgracey wrote: »

The filename.bin is always 2KB, but the filename.obj is as big as all the code you compiled.

The program that you download must resume communication that the booter was carrying on, in order to bring the rest of the data into the hub. Imagine you download a $1F8-long program. Then, you execute another application on the PC that communicates serially with your downloaded $1F8-long program to load the rest of your stuff in and execute it.

Yes, I want to use the same code to handle the second stage as well so that I just send the obj file to it and it fills in the blanks from there. Teraterm allows me to send a file in binary form so that might be good enough for now on the second stage load.

Once this kernel is running I will use it to interact both at a high level and low level with the P2 and hardware. Should be fun!

cgracey · 2012-11-30 00:13

Peter Jakacki wrote: »

Yes, I want to use the same code to handle the second stage as well so that I just send the obj file to it and it fills in the blanks from there. Teraterm allows me to send a file in binary form so that might be good enough for now on the second stage load.

Once this kernel is running I will use it to interact both at a high level and low level with the P2 and hardware. Should be fun!

I think you're on your way. It's a good thing you've got the big DE2-115 so that you've got more than one cog and a full 128KB memory.

Peter Jakacki · 2012-11-30 00:19

cgracey wrote: »

I think you're on your way. It's a good thing you've got the big DE2-115 so that you've got more than one cog and a full 128KB memory.

Yep, and these boards are really good value for money and I can see them being very useful for debugging plus it's not possible to really brick them either. I've built in my serial routines and just putting in the basics to get it to chain load in the rest, fingers crossed.

Baggers · 2012-11-30 02:47

Looks like you guys are all having fun

Can't wait to join in!
But in the mean time I'm still playing on Prop1!

Sapieha · 2012-11-30 04:18

Hi Chip.

Can You post any special register's MAP of COG.

So it will be simpler to understand Instructions from "Propeller2DetailedPreliminaryFeatureList-v2.0.pdf"

David Betz · 2012-11-30 04:42

cgracey wrote: »

Only $1F8 longs are being loaded.

What we are doing is just loading what is actually the 'loader' that would perform further loading, decryption, etc. I will have to make a re-loader program to download, in lieu of the $1F8 longs which are now being sent, which will then load the user code in, up to the top of memory, if there is so much.

Do you want me to work on a second stage loader? I could use some of the code I wrote for propeller-load.

nutson · 2012-11-30 05:21

Sapieha. I remember Chip saying there were more than 40 registers now. We will get a full description in due time.

Just read the preliminary feature list and made this list for my own reference:

There are 10 memory mapped registers:

INDA/B 0x1F6 - 0x1F7 Indirect access to COG memory
PINA/B/C/D 0x1F8 - 0x1FB Read / write I/O ports
DIRA/B/C/D 0x1FC – 0x1FF Set pins to output

All other registers can be accessed only with specialised instructions

PTRA/B Pointer for hub access
SPA/B CLUT (stack) pointer
CNT System time counter
LFSR Random number generator
MACA/B Accu for 64 bit MAC operation
CTRA/B Each have FRQ, PHS, SINand COS register
MULLL/H etc, registers to acces the multiply, divide, SQRT and CORDIC ooperations
DAC0/3 configuration and data for the DAC’s

cgracey · 2012-11-30 05:48

I've been working on the instruction set documentation and I've completed the parts that cover:

1) Hub memory instructions
2) Hub control instructions
3) Cog RAM indirect instructions - New # syntax for SETINDx/FIXINDx
4) Cog stack RAM instructions
5) Multi-tasking

There is a new PNUT.EXE in this .zip which supports the new SETINDx/FIXINDx syntax. Also, all the files anyone needs to use the DE0-Nano or DE2-115 are in here:

Terasic_Prop2.zip

Current Doc's:

PROPELLER 2 MEMORY
------------------

In the Propeller 2, there are two primary types of memory:

HUB MEMORY

    128K bytes of main memory shared by all cogs

        - cogs launch from this memory
        - cogs can access this memory as bytes, words, longs, and quads (4 longs)
        - $00000..$00E7F is ROM - contains Booter, SHA-256/HMAC, and Monitor
	- $00E80..$1FFFF is RAM - for application usage


COG MEMORY (8 instances)

    512 longs of register RAM for code and data usage

        - simultaneous instruction, source, and destination reading, plus writing
        - last eight registers are for I/O pin control

    256 longs of stack RAM for data and video usage

        - accessible via push and pop operations
        - video circuit can read data simultaneously and asynchronously



HUB MEMORY INSTRUCTIONS
-----------------------

These instructions read and write hub memory.

All instructions use D as the data conduit, except WRQUAD/RDQUAD/RDQUADC, which uses the four QUAD
registers. The QUADs can be mapped into cog register space using the SETQUAD instruction or kept
hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the QUADs
overlay four contiguous cog registers which can begin at any double-even address (%xxxxxxx00). These
overlaid registers can be read and written as any other registers, as well as executed. Any write via
D to the QUAD registers, when mapped, will affect the underlying cog registers, as well. A RDQUAD/
RDQUADC will affect the QUAD registers, but not the underlying cog registers.

The cached reads RDBYTEC/RDWORDC/RDLONGC/RDQUADC will do a RDQUAD if the current read address is
outside of the 4-long window of the prior RDQUAD. Otherwise, they will immediately return cached
data. The CACHEX instruction invalidates the cache, forcing a fresh RDQUAD next time a cached read
executes.

Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The
timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant,
causing these instructions to take varying numbers of clocks. Timing can be made determinant, though,
by intentionally spacing these instructions apart so that after the first in a series executes, the
subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of
clocks. The trick is to write useful code to go in between them.

WRBYTE/WRWORD/WRLONG/WRQUAD/RDQUAD complete on the hub cycle, making them take 1..8 clocks.

RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks.

RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is cached, otherwise 3..10 clocks.

RDQUADC takes only 1 clock if data is cached, otherwise 1..8 clocks.

After a RDQUAD, the QUAD registers are accessible via D and S on the 3rd clock and executable on the
5th clock.


instructions                                                                                       clocks
---------------------------------------------------------------------------------------------------------
000000 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRBYTE  D,S       'write lower byte in D at S              1..8
000000 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRBYTE  D,PTR     'write lower byte in D at PTR            1..8
000000 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTE  D,S       'read byte at S into D                  3..10
000000 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTE  D,PTR     'read byte at PTR into D                3..10
000000 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDBYTEC D,S       'read cached byte at S into D        1, 3..10 
000000 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDBYTEC D,PTR     'read cached byte at PTR into D      1, 3..10

000001 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRWORD  D,S       'write lower word in D at S              1..8
000001 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRWORD  D,PTR     'write lower word in D at PTR            1..8
000001 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORD  D,S       'read word at S into D                  3..10
000001 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORD  D,PTR     'read word at PTR into D                3..10
000001 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDWORDC D,S       'read cached word at S into D        1, 3..10
000001 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDWORDC D,PTR     'read cached word at PTR into D      1, 3..10

000010 000 0 CCCC DDDDDDDDD SSSSSSSSS     WRLONG  D,S       'write D at S                            1..8
000010 000 1 CCCC DDDDDDDDD SUPNNNNNN     WRLONG  D,PTR     'write D at PTR                          1..8
000010 Z01 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONG  D,S       'read long at S into D                  3..10
000010 Z01 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONG  D,PTR     'read long at PTR into D                3..10
000010 Z11 0 CCCC DDDDDDDDD SSSSSSSSS     RDLONGC D,S       'read cached long at S into D        1, 3..10
000010 Z11 1 CCCC DDDDDDDDD SUPNNNNNN     RDLONGC D,PTR     'read cached long at PTR into D      1, 3..10

000011 000 0 CCCC DDDDDDDDD 010110000     WRQUAD  D         'write QUADs at D                        1..8
000011 001 1 CCCC SUPNNNNNN 010110000     WRQUAD  PTR       'write QUADs at PTR                      1..8
000011 000 0 CCCC DDDDDDDDD 010110001     RDQUAD  D         'read quad at D into QUADs               1..8
000011 001 1 CCCC SUPNNNNNN 010110001     RDQUAD  PTR       'read quad at PTR into QUADs             1..8
000011 010 0 CCCC DDDDDDDDD 010110001     RDQUADC D         'read cached quad at D into QUADs     1, 1..8
000011 011 1 CCCC SUPNNNNNN 010110001     RDQUADC PTR       'read cached quad at PTR into QUADs   1, 1..8
---------------------------------------------------------------------------------------------------------


PTR expressions:

    INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s
    SCALE = 1 for byte, 2 for word, 4 for long, or 16 for quad

    S = 0 for PTRA, 1 for PTRB
    U = 0 to keep PTRx same, 1 to update PTRx
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
    NNNNNN = INDEX
    nnnnnn = -INDEX


    SUPNNNNNN     PTR expression
    -----------------------------------------------------------------------------
    000000000     PTRA              'use PTRA
    100000000     PTRB              'use PTRB
    011000001     PTRA++            'use PTRA,                PTRA += SCALE
    111000001     PTRB++            'use PTRB,                PTRB += SCALE
    011111111     PTRA--            'use PTRA,                PTRA -= SCALE
    111111111     PTRB--            'use PTRB,                PTRB -= SCALE
    010000001     ++PTRA            'use PTRA + SCALE,        PTRA += SCALE
    110000001     ++PTRB            'use PTRB + SCALE,        PTRB += SCALE
    010111111     --PTRA            'use PTRA - SCALE,        PTRA -= SCALE
    110111111     --PTRB            'use PTRB - SCALE,        PTRB -= SCALE

    000NNNNNN     PTRA[INDEX]       'use PTRA + INDEX*SCALE
    100NNNNNN     PTRB[INDEX]       'use PTRB + INDEX*SCALE
    011NNNNNN     PTRA++[INDEX]     'use PTRA,                PTRA += INDEX*SCALE
    111NNNNNN     PTRB++[INDEX]     'use PTRB,                PTRB += INDEX*SCALE
    011nnnnnn     PTRA--[INDEX]     'use PTRA,                PTRA -= INDEX*SCALE
    111nnnnnn     PTRB--[INDEX]     'use PTRB,                PTRB -= INDEX*SCALE
    010NNNNNN     ++PTRA[INDEX]     'use PTRA + INDEX*SCALE,  PTRA += INDEX*SCALE
    110NNNNNN     ++PTRB[INDEX]     'use PTRB + INDEX*SCALE,  PTRB += INDEX*SCALE
    010nnnnnn     --PTRA[INDEX]     'use PTRA - INDEX*SCALE,  PTRA -= INDEX*SCALE
    110nnnnnn     --PTRB[INDEX]     'use PTRB - INDEX*SCALE,  PTRB -= INDEX*SCALE


Examples:

000000 Z01 1 CCCC DDDDDDDDD 000000000     RDBYTE  D,PTRA         'read byte at PTRA into D
000001 000 1 CCCC DDDDDDDDD 111000001     WRWORD  D,PTRB++       'write lower word in D at PTRB,      PTRB += 2
000010 Z01 1 CCCC DDDDDDDDD 011111111     RDLONG  D,PTRA--       'read long at PTRA into D,           PTRA -= 4
000011 001 1 CCCC 110000001 010110001     RDQUAD  ++PTRB         'read quad at PTRB+16 into QUADs,    PTRB += 16
000000 000 1 CCCC DDDDDDDDD 010111111     WRBYTE  D,--PTRA       'write lower byte in D at PTRA-1,    PTRA -= 1

000001 000 1 CCCC DDDDDDDDD 100000111     WRWORD  D,PTRB[7]      'write lower word in D to PTRB+7*2
000010 Z11 1 CCCC DDDDDDDDD 011001111     RDLONGC D,PTRA++[15]   'read cached long at PTRA into D,    PTRA += 15*4
000011 001 1 CCCC 111111101 010110000     WRQUAD  PTRB--[3]      'write QUADs at PTRB,                PTRB -= 3*16
000000 000 1 CCCC DDDDDDDDD 010000110     WRBYTE  D,++PTRA[6]    'write lower byte in D to PTRA+6*1,  PTRA += 6*1
000001 Z01 1 CCCC DDDDDDDDD 110110110     RDWORD  D,--PTRB[10]   'read word at PTRB-10*2 into D,      PTRB -= 10*2


Bytes, words, longs, and quads are addressed as follows: 

    for WRBYTE/RDBYTE/RDBYTEC, address = %XXXXXXXXXXXXXXXXX (bits 16..0 are used)
    for WRWORD/RDWORD/RDWORDC, address = %XXXXXXXXXXXXXXXX- (bits 16..1 are used)
    for WRLONG/RDLONG/RDLONGC, address = %XXXXXXXXXXXXXXX-- (bits 16..2 are used)
    for WRQUAD/RDQUAD/RDQUADC, address = %XXXXXXXXXXXXX---- (bits 16..4 are used)

address  byte  word    long        quad
-------------------------------------------------------------------
00000-   50   *7250   *706F7250   *0C7CCC030C7C200020302E32706F7250
00001-   72    7250    706F7250    0C7CCC030C7C200020302E32706F7250
00002-   6F   *706F    706F7250    0C7CCC030C7C200020302E32706F7250
00003-   70    706F    706F7250    0C7CCC030C7C200020302E32706F7250
00004-   32   *2E32   *20302E32    0C7CCC030C7C200020302E32706F7250
00005-   2E    2E32    20302E32    0C7CCC030C7C200020302E32706F7250
00006-   30   *2030    20302E32    0C7CCC030C7C200020302E32706F7250
00007-   20    2030    20302E32    0C7CCC030C7C200020302E32706F7250
00008-   00   *2000   *0C7C2000    0C7CCC030C7C200020302E32706F7250
00009-   20    2000    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000A-   7C   *0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000B-   0C    0C7C    0C7C2000    0C7CCC030C7C200020302E32706F7250
0000C-   03   *CC03   *0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000D-   CC    CC03    0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000E-   7C   *0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
0000F-   0C    0C7C    0C7CCC03    0C7CCC030C7C200020302E32706F7250
00010-   45   *FE45   *0DC1FE45   *0D7CC6010C7CC6010CFCB6E30DC1FE45
00011-   FE    FE45    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00012-   C1   *0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00013-   0D    0DC1    0DC1FE45    0D7CC6010C7CC6010CFCB6E30DC1FE45
00014-   E3   *B6E3   *0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00015-   B6    B6E3    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00016-   FC   *0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00017-   0C    0CFC    0CFCB6E3    0D7CC6010C7CC6010CFCB6E30DC1FE45
00018-   01   *C601   *0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
00019-   C6    C601    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001A-   7C   *0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001B-   0C    0C7C    0C7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001C-   01   *C601   *0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001D-   C6    C601    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001E-   7C   *0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45
0001F-   0D    0D7C    0D7CC601    0D7CC6010C7CC6010CFCB6E30DC1FE45

* new word/long/quad




PTRA/PTRB INSTRUCTIONS
----------------------

Each cog has two 17-bit pointers, PTRA and PTRB, which can be read, written, modified,
and used to access hub memory.

At cog startup, the PTRA and PTRB registers are initialized as follows:

    PTRA = %X_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer
    PTRB = %X_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 1 CCCC DDDDDDDDD 000010010     GETPTRA D         'get PTRA into D, C = PTRA[16]      1
000011 ZCR 1 CCCC DDDDDDDDD 000010011     GETPTRB D         'get PTRB into D, C = PTRB[16]      1

000011 000 1 CCCC DDDDDDDDD 010110010     SETPTRA D         'set PTRA to D                      1
000011 001 1 CCCC nnnnnnnnn 010110010     SETPTRA #n        'set PTRA to 0..511                 1
000011 000 1 CCCC DDDDDDDDD 010110011     SETPTRB D         'set PTRB to D                      1
000011 001 1 CCCC nnnnnnnnn 010110011     SETPTRB #n        'set PTRB to 0..511                 1

000011 000 1 CCCC DDDDDDDDD 010110100     ADDPTRA D         'add D into PTRA                    1
000011 001 1 CCCC nnnnnnnnn 010110100     ADDPTRA #n        'add 0..511 into PTRA               1
000011 000 1 CCCC DDDDDDDDD 010110101     ADDPTRB D         'add D into PTRB                    1
000011 001 1 CCCC nnnnnnnnn 010110101     ADDPTRB #n        'add 0..511 into PTRB               1

000011 000 1 CCCC DDDDDDDDD 010110110     SUBPTRA D         'subtract D from PTRA               1
000011 001 1 CCCC nnnnnnnnn 010110110     SUBPTRA #n        'subtract 0..511 from PTRA          1
000011 000 1 CCCC DDDDDDDDD 010110111     SUBPTRB D         'subtract D from PTRB               1
000011 001 1 CCCC nnnnnnnnn 010110111     SUBPTRB #n        'subtract 0..511 from PTRB          1
-------------------------------------------------------------------------------------------------



QUAD-RELATED INSTRUCTIONS
-------------------------

Each cog has four QUAD registers which form a 128-bit conduit between the hub memory and the cog.
This conduit can transfer four longs every 8 clocks via the WRQUAD/RDQUAD instructions. It can
also be used as a 4-long/8-word/16-byte read cache, utilized by RDBYTEC/RDWORDC/RDLONGC/RDQUADC.

Initially hidden, these QUAD registers are mappable into cog register space by using the SETQUAD
instruction to set a double-even address (%xxxxxxx00) where the base register is to appear, with
the other three registers following. To hide the QUAD registers, use SETQUAD to set an address
which is not double-even.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC 000000000 000001000     CACHEX            'invalidate cache                   1
000011 Z01 1 CCCC DDDDDDDDD 000010001     GETTOPS D         'get top bytes of QUADs into D      1
000011 000 1 CCCC DDDDDDDDD 011100010     SETQUAD D         'set QUAD base address to D         1
000011 001 1 CCCC nnnnnnnnn 011100010     SETQUAD #n        'set QUAD base address to 0..511    1
-------------------------------------------------------------------------------------------------



HUB CONTROL INSTRUCTIONS
------------------------

These instructions are used to control hub circuits and cogs.

Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where
there is no result to wait for (ZCR = %000), these instructions complete on the hub cycle, making
them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases
where a result is anticipated (ZCR <> %000), these instructions complete on the 1st clock after the
hub cycle, making them take 2..9 clocks.


COGINIT D,S
-----------

COGINIT is used to start cogs. Any cog can be (re)started, whether it is idle or running. A cog
can even execute a COGINIT to restart itself with a new program.

COGINIT uses D to specify a long address in hub memory that is the start of the program that is to be
loaded into a cog, while S is a 17-bit parameter (usually an address) that will be conveyed to PTRA
of the started cog. PTRB of the started cog will be set to the start address of its program that was
loaded from hub memory.

SETCOG must be executed before COGINIT to set the number of the cog to be started (0..7). If SETCOG
sets a value with bit 3 set (%1xxx), this will cause the next idle cog to be started when COGINIT is
executed, with the number of the cog started being returned in D, and the C flag returning 0 if okay,
or 1 if no idle cog was available. At cog startup, SETCOG is initialized to %0000.

When a cog is started, $1F8 contiguous longs are read from hub memory and written to cog registers
$000..$1F7. The cog will then begin execution at $000. This process takes 1,016 clocks.

Example:

        COGID   COGNUM           'what cog am I?
        SETCOG  COGNUM           'set my cog number
        COGINIT COGPGM,COGPTR    'restart me with the ROM Monitor

COGPGM  LONG    $0070C           'address of the ROM Monitor
COGPTR  LONG    90<<9 + 91       'tx = P90, rx = P91

COGNUM  RES     1


CLKSET  D
---------

CLKSET writes the lower 9 bits of D to the hub clock register:

%R_MMMM_XX_SS

R = 1 for hardware reset, 0 for continued operation

MMMM = PLL multiplying factor for XI pin input:
        %0000 for PLL disabled
        %0001..%1111 for 2..16 multiply (XX must be set for XI input or XI/XO crystal oscillator)

XX = XI/XO pin mode:
        00 for XI reads low, XO floats
        01 for XI input, XO floats
        10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback
        11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback

SS = Clock selector:
        00 for RCFAST (~20MHz)
        01 for RCSLOW (~20KHz)
        10 for XTAL (10MHz-20MHz)
        11 for PLL

Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode
with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST
or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within
10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you
can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you
can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being
in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its
new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot
might exceed the speed limit of the chip.


COGID   D
---------

COGID returns the number of the cog (0..7) into D.


COGSTOP D
---------

COGSTOP stops the cog specified in D (0..7).


LOCKNEW D
LOCKRET D
LOCKSET D
LOCKCLR D
---------

There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with
LOCKRET, set with LOCKSET, and cleared with LOCKCLR.

While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided
so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time.

When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works
the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/
clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission
to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a
'LOCKSET locknum  wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared
resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to
clear the lock and give another cog the opportunity to get C=0 back.

LOCKNEW returns the next available lock into D, with C=1 if no lock was free.

LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW.

LOCKSET sets the lock in D and returns its prior state in C.

LOCKCLR clears the lock in D and returns its prior state in C.


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS     COGINIT D,S     'launch cog at D, cog PTRA = S     1..9
000011 000 1 CCCC DDDDDDDDD 000000000     CLKSET  D       'set clock to D                    1..8
000011 001 1 CCCC DDDDDDDDD 000000001     COGID   D       'get cog number into D             2..9
000011 000 1 CCCC DDDDDDDDD 000000011     COGSTOP D       'stop cog in D                     1..8
000011 ZC1 1 CCCC DDDDDDDDD 000000100     LOCKNEW D       'get new lock into D, C = busy     2..9
000011 000 1 CCCC DDDDDDDDD 000000101     LOCKRET D       'return lock in D                  1..8
000011 0C0 1 CCCC DDDDDDDDD 000000110     LOCKSET D       'set lock in D, C = prev state     1..9
000011 0C0 1 CCCC DDDDDDDDD 000000111     LOCKCLR D       'clear lock in D, C = prev state   1..9
-------------------------------------------------------------------------------------------------



INDIRECT REGISTERS
------------------

Each cog has two indirect registers: INDA and INDB. They are located at $1F6 and $1F7.

By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed.

INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and
the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the
pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB
references.

SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and
top limit(s) to $000 and $1FF, respectively.

FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the
lower of the initial and terminal values and the top limit(s) to the higher.

Because indirect addressing occurs very early in the pipeline and indirect pointers are affected earlier than
the final stage where the conditional bit field (CCCC) normally comes into use, the CCCC field is repurposed
for indirect operations. The top two bits of CCCC are used for indirect D and the bottom two bits are used
for indirect S. All instructions which use indirect registers will execute unconditionally, regardless of the
CCCC bits.

Here is the INDA/INDB usage scheme which repurposes the CCCC field:

OOOOOO ZCR I CCCC DDDDDDDDD SSSSSSSSS
-------------------------------------
xxxxxx xxx x 00xx 111110110 xxxxxxxxx        D = INDA        'use INDA
xxxxxx xxx x 00xx 111110111 xxxxxxxxx        D = INDB        'use INDB
xxxxxx xxx x 01xx 111110110 xxxxxxxxx        D = INDA++      'use INDA,      INDA += 1
xxxxxx xxx x 01xx 111110111 xxxxxxxxx        D = INDB++      'use INDB,      INDB += 1
xxxxxx xxx x 10xx 111110110 xxxxxxxxx        D = INDA--      'use INDA,      INDA -= 1
xxxxxx xxx x 10xx 111110111 xxxxxxxxx        D = INDB--      'use INDB       INDB -= 1
xxxxxx xxx x 11xx 111110110 xxxxxxxxx        D = ++INDA      'use INDA+1,    INDA += 1
xxxxxx xxx x 11xx 111110111 xxxxxxxxx        D = ++INDB      'use INDB+1,    INDB += 1

xxxxxx xxx 0 xx00 xxxxxxxxx 111110110        S = INDA        'use INDA
xxxxxx xxx 0 xx00 xxxxxxxxx 111110111        S = INDB        'use INDB
xxxxxx xxx 0 xx01 xxxxxxxxx 111110110        S = INDA++      'use INDA,      INDA += 1
xxxxxx xxx 0 xx01 xxxxxxxxx 111110111        S = INDB++      'use INDB,      INDB += 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110110        S = INDA--      'use INDA,      INDA -= 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110111        S = INDB--      'use INDB       INDB -= 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110110        S = ++INDA      'use INDA+1,    INDA += 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110111        S = ++INDB      'use INDB+1,    INDB += 1


If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the
post-modifier effect:

101000 001 0 0011 111110110 111110110        MOV INDA,++INDA    'Move @INDA+1 into @INDA,   INDA += 1
100000 001 0 1100 111110111 111110111        ADD ++INDB,INDB    'Add @INDB into @INDB+1,    INDB += 1

Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx.


Here are the instructions which are used to set the pointer and limit values for INDA and INDB:

instructions *                                                                             clocks
-------------------------------------------------------------------------------------------------
111000 000 0 0001 000000000 AAAAAAAAA        SETINDA #addrA                                     1
111000 000 0 0011 000000000 AAAAAAAAA        SETINDA ++/--deltA                                 1

111000 000 0 0100 BBBBBBBBB 000000000        SETINDB #addrB                                     1
111000 000 0 1100 BBBBBBBBB 000000000        SETINDB ++/--deltB                                 1 

111000 000 0 0101 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,#addrA                              1
111000 000 0 0111 BBBBBBBBB AAAAAAAAA        SETINDS #addrB,++/--deltA                          1
111000 000 0 1101 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,#addrA                          1
111000 000 0 1111 BBBBBBBBB AAAAAAAAA        SETINDS ++/--deltB,++/--deltA                      1

111001 000 0 0001 TTTTTTTTT IIIIIIIII        FIXINDA #terminal,#initial                         1
111001 000 0 0100 TTTTTTTTT IIIIIIIII        FIXINDB #terminal,#initial                         1
111001 000 0 0101 TTTTTTTTT IIIIIIIII        FIXINDS #terminal,#initial                         1
-------------------------------------------------------------------------------------------------
* addrA/addrB/terminal/initial = register address (0..511),
  deltA/deltB = 9-bit signed delta --256..++255

Examples:

111000 000 0 0001 000000000 000000101        SETINDA #5        'INDA = 5, bottom = 0, top = 511
111000 000 0 0011 000000000 000000011        SETINDA ++3       'INDA += 3, bottom = 0, top = 511
111000 000 0 1100 111111100 000000000        SETINDB --4       'INDB -= 4, bottom = 0, top = 511
111000 000 0 0111 000000111 000001000        SETINDS #7,++8    'INDB = 7, INDA += 8, bottoms = 0, tops = 511

111001 000 0 0001 000001111 000001000        FIXINDA #15,#8    'INDA = 8, bottom = 8, top = 15
111001 000 0 0100 000010000 000011111        FIXINDB #16,#31   'INDB = 31, bottom = 16, top = 31
111001 000 0 0101 001100011 000110010        FIXINDS #99,#50   'INDA/INDB = 50, bottoms = 50, tops = 99



STACK RAM
---------

Each cog has a 256-long stack RAM that is accessible via push and pop operations. Its contents
are not initialized at either reset or cog startup. So, at cog startup, it will contain whatever
it happened to power up with, or whatever was last written.

There are two stack pointers called SPA and SPB which are used to address the stack memory. Aside
from automatically incrementing and decrementing via pushes and pops, SPA and SPB can be set,
modified, read back, and checked:

SETSPA  D/#n      set SPA
SETSPB  D/#n      set SPB
ADDSPA  D/#n      add to SPA
ADDSPB  D/#n      add to SPB
SUBSPA  D/#n      subtract from SPA
SUBSPB  D/#n      subtract from SPB
GETSPA  D         get SPA, SPA==0 into Z, SPA.7 into C
GETSPB  D         get SPB, SPB==0 into Z, SPB.7 into C
GETSPD  D         get SPA minus SPB, SPA==SPB into Z, SPA<SPB into C
CHKSPA            check SPA, SPA==0 into Z, SPA.7 into C
CHKSPB            check SPB, SPB==0 into Z, SPB.7 into C
CHKSPD            check SPA minus SPB, SPA==SPB into Z, SPA<SPB into C

Data can be pushed and popped in both normal and reverse directions:

PUSHA   D/#n      push using SPA
PUSHB   D/#n      push using SPB
PUSHAR  D/#n      push using SPA, use pop addressing
PUSHBR  D/#n      push using SPB, use pop addressing
POPA    D         pop using SPA
POPB    D         pop using SPB
POPAR   D         pop using SPA, use push addressing
POPBR   D         pop using SPB, use push addressing

Aside from data, the program counter and flags can be pushed and popped using calls and returns:

CALLA   D/#n      call using SPA
CALLB   D/#n      call using SPB
CALLAD  D/#n      call using SPA, delay branch until three trailing instructions executed
CALLBD  D/#n      call using SPB, delay branch until three trailing instructions executed
RETA              return using SPA
RETB              return using SPB
RETAD             return using SPA, delay branch until three trailing instructions executed
RETBD             return using SPB, delay branch until three trailing instructions executed


instructions (stack RAM access is shown as [SPx++] and [--SPx])                            clocks
-------------------------------------------------------------------------------------------------
000011 ZC0 1 CCCC 000000000 000010101        CHKSPD          'SPA==SPB into Z, SPA<SPB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000010101        GETSPD  D       'SPA-SPB into D, Z/C as CHKSPD     1

000011 ZC0 1 CCCC 000000000 000010110        CHKSPA          'SPA==0 into Z, SPA.7 into C       1
000011 ZC1 1 CCCC DDDDDDDDD 000010110        GETSPA  D       'SPA into D, Z/C as CHKSPA         1

000011 ZC0 1 CCCC 000000000 000010111        CHKSPB          'SPB==0 into Z, SPB.7 into C       1
000011 ZC1 1 CCCC DDDDDDDDD 000010111        GETSPB  D       'SPB into D, Z/C as CHKSPB         1

000011 ZC1 1 CCCC DDDDDDDDD 000011000        POPAR   D       'read [SPA++] into D, MSB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000011001        POPBR   D       'read [SPB++] into D, MSB into C   1

000011 ZC1 1 CCCC DDDDDDDDD 000011010        POPA    D       'read [--SPA] into D, MSB into C   1
000011 ZC1 1 CCCC DDDDDDDDD 000011011        POPB    D       'read [--SPB] into D, MSB into C   1

000011 ZC0 1 CCCC 000000000 000011100        RETA            'read [--SPA] into Z/C/PC*         4
000011 ZC0 1 CCCC 000000000 000011101        RETB            'read [--SPB] into Z/C/PC*         4

000011 ZC0 1 CCCC 000000000 000011110        RETAD           'read [--SPA] into Z/C/PC*         1
000011 ZC0 1 CCCC 000000000 000011111        RETBD           'read [--SPB] into Z/C/PC*         1

000011 000 1 CCCC DDDDDDDDD 010100010        SETSPA  D       'set SPA to D                      1
000011 001 1 CCCC 0nnnnnnnn 010100010        SETSPA  #n      'set SPA to n                      1
000011 000 1 CCCC DDDDDDDDD 010100011        SETSPB  D       'set SPB to D                      1
000011 001 1 CCCC 0nnnnnnnn 010100011        SETSPB  #n      'set SPB to n                      1

000011 000 1 CCCC DDDDDDDDD 010100100        ADDSPA  D       'add D into SPA                    1
000011 001 1 CCCC 0nnnnnnnn 010100100        ADDSPA  #n      'add n into SPA                    1
000011 000 1 CCCC DDDDDDDDD 010100101        ADDSPB  D       'add D into SPB                    1
000011 001 1 CCCC 0nnnnnnnn 010100101        ADDSPB  #n      'add n into SPB                    1

000011 000 1 CCCC DDDDDDDDD 010100110        SUBSPA  D       'subtract D from SPA               1
000011 001 1 CCCC 0nnnnnnnn 010100110        SUBSPA  #n      'subtract n from SPA               1
000011 000 1 CCCC DDDDDDDDD 010100111        SUBSPB  D       'subtract D from SPB               1
000011 001 1 CCCC 0nnnnnnnn 010100111        SUBSPB  #n      'subtract n from SPB               1

000011 000 1 CCCC DDDDDDDDD 010101000        PUSHAR  D       'write D into [--SPA]              1 **
000011 001 1 CCCC nnnnnnnnn 010101000        PUSHAR  #n      'write n into [--SPA]              1 **
000011 000 1 CCCC DDDDDDDDD 010101001        PUSHBR  D       'write D into [--SPB]              1 **
000011 001 1 CCCC nnnnnnnnn 010101001        PUSHBR  #n      'write n into [--SPB]              1 **

000011 000 1 CCCC DDDDDDDDD 010101010        PUSHA   D       'write D into [SPA++]              1 **
000011 001 1 CCCC nnnnnnnnn 010101010        PUSHA   #n      'write n into [SPA++]              1 **
000011 000 1 CCCC DDDDDDDDD 010101011        PUSHB   D       'write D into [SPB++]              1 **
000011 001 1 CCCC nnnnnnnnn 010101011        PUSHB   #n      'write n into [SPB++]              1 **

000011 000 1 CCCC DDDDDDDDD 010101100        CALLA   D       'write Z/C/PC* into [SPA++], PC=D  4 **
000011 001 1 CCCC nnnnnnnnn 010101100        CALLA   #n      'write Z/C/PC* into [SPA++], PC=n  4 **
000011 000 1 CCCC DDDDDDDDD 010101101        CALLB   D       'write Z/C/PC* into [SPB++], PC=D  4 **
000011 001 1 CCCC nnnnnnnnn 010101101        CALLB   #n      'write Z/C/PC* into [SPB++], PC=n  4 **

000011 000 1 CCCC DDDDDDDDD 010101110        CALLAD  D       'write Z/C/PC* into [SPA++], PC=D  1 **
000011 001 1 CCCC nnnnnnnnn 010101110        CALLAD  #n      'write Z/C/PC* into [SPA++], PC=n  1 **
000011 000 1 CCCC DDDDDDDDD 010101111        CALLBD  D       'write Z/C/PC* into [SPB++], PC=D  1 **
000011 001 1 CCCC nnnnnnnnn 010101111        CALLBD  #n      'write Z/C/PC* into [SPB++], PC=n  1 **
-------------------------------------------------------------------------------------------------
* bit 10 is Z, bit 9 is C, bits 8..0 are PC, upper bits are ignored or cleared
** if a stack RAM write is immediately followed by a stack RAM read, add one clock



MULTI-TASKING
-------------

Each cog has four sets of flags and program counters (Z/C/PC), constituting four unique tasks that
can execute and switch on each instruction cycle.

At cog startup, the tasks are initialized as follows:


task Z  C  PC
---------------
0    0  0  $000
1    0  0  $001
2    0  0  $002
3    0  0  $003


There are 16 rotating time slots in the TASK register that determine task sequence. Initially, all
time slots are set to 0, causing task 0 to execute exclusively, starting at address $000:


   time slots:   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
                  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
TASK register:  %00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00


The two LSB's of TASK always determine which task will execute next. After each instruction cycle,
the TASK register is rotated right by two bits, recycling slot 0 to slot 15 and getting the next task
into the 2 LSB's.


To enable other tasks, SETTASK is used to set the TASK register:

SETTASK D               write D to the TASK register
SETTASK #n              write {n[7:0], n[7:0], n[7:0], n[7:0]} to the TASK register

If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
task is given a time slot, it will execute and its flags and PC will be updated at every instruction,
or time slot. If an active task's time slots are all taken away, that task's flags and PC remain in the
state where they left off, until it is given another time slot.


To immediately force any of the four PC's to a new address, JMPTASK can be used. JMPTASK uses a 4-bit
mask to select which PC's are going to be written. Mask bits 0..3 represent PC's 0..3. The mask value
%1010 would write PC 3 and PC 1, while %0100 would write PC 2, only.

JMPTASK D,#mask         force PC's in mask to D
JMPTASK #addr,#mask     force PC's in mask to #addr

For every PC/task affected by a JMPTASK instruction, all affected-task instructions currently in the
pipeline are cancelled. This insures that once JMPTASK executes, the next instruction from each
affected task will be from the new address.


Here is an example in which all four tasks are started and each task toggles an I/O pin at a
different rate:


        ORG

        JMP     #task0          'task 0 begins here when the cog starts (this JMP takes 4 clocks)
        JMP     #task1          'task 1 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task2          'task 2 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
        JMP     #task3          'task 3 begins here after task 0 executes SETTASK (this JMP takes 1 clock)

task0   SETTASK #%%3210         'enable all tasks (TASK = %11_10_01_00_11_10_01_00_11_10_01_00_11_10_01_00)

:loop   NOTP    #0              'task 0, toggle pin 0       (loops every 8 clocks)
        JMP     #:loop          '(this JMP takes 1 clock)

task1   NOTP    #1              'task 1, toggle pin 1       (loops every 12 clocks)
        NOP
        JMP     #task1          '(this JMP takes 1 clock)

task2   NOTP    #2              'task 2, toggle pin 2       (loops every 16 clocks)
        NOP                     
        NOP
        JMP     #task2          '(this JMP takes 1 clock)

task3   NOTP    #3              'task 3, toggle pin 3       (loops every 20 clocks)
        NOP
        NOP
        NOP
        JMP     #task3          '(this JMP takes 1 clock)


------------------------------------------------------------------------------------------------------------
NOTE: When a normal branch instruction (JMP, CALL, RET, etc.) executes in the fourth and final stage of the
pipeline, all instructions progressing through the lower three stages, which belong to the same task as the
branch instruction, are cancelled. This inhibits execution of incidental data that was trailing the branch
instruction.

The delayed branch instructions (JMPD, CALLD, RETD, etc.) don't do any pipeline instruction cancellation and
exist to provide 1-clock branches to single-task programs, where the three instructions following the branch
are allowed to execute before the new instruction stream begins to execute.

For single-task programs, normal branches take 4 clocks: 1 clock for the branch and 3 clocks for the
cancelled instructions to come through the pipeline before the new instruction stream begins to execute.

For multi-tasking programs that use all four tasks in sequence (ie SETTASK #%%3210), there are never any
same-task instructions in the pipeline that would require cancellation due to branching, so all branches
take just 1 clock.
------------------------------------------------------------------------------------------------------------


Tips for coding multi-tasking programs
--------------------------------------

While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
remember that there's only one of each of the following cog resources and only one task can use it at a time:

  SPA
  SPB
  INDA
  INDB
  PTRA
  PTRB
  ACCA
  ACCB
  32x32 multiplier
  64/32 divider
  64-bit square rooter
  CORDIC computer
  CTRA
  CTRB
  VID
  PIX (not usable in multi-tasking, requires single-task timing)
  XFR
  SER
  Bitfield mover

When writing multi-task programs, be aware that instructions that take multiple clocks will stall the
pipeline and have a ripple effect on the tasks' timing. This may be impossible to avoid, as some task
might need to access hub memory, and those instructions are not single-clock.

The WAITCNT/WAITPEQ/WAITPNE instructions should be coded discretely using 1-clock instructions, to avoid
stalling the pipeline for excessive amounts of time.

The following instructions (WC versions) will take 1 clock, instead of potentially many, and return 1 in
C if they were successful:

  SNDSER  D  WC      attempt to send serial
  RCVSER  D  WC      attempt to receive serial
  GETMULL D  WC      attempt to get lower multiplier result
  GETMULH D  WC      attempt to get upper multiplier result
  GETDIVQ D  WC      attempt to get divider quotient result
  GETDIVR D  WC      attempt to get divider remainder result
  GETSQRT D  WC      attempt to get square root result
  GETQX   D  WC      attempt to get CORDIC X result
  GETQY   D  WC      attempt to get CORDIC Y result
  GETQZ   D  WC      attempt to get CORDIC Z result

Other instruction alternatives:

  POLCTRA    WC      returns 1 in C if CTRA rolled over, use instead of SYNCTRA
  POLCTRB    WC      returns 1 in C if CTRB rolled over, use instead of SYNCTRB
  POLVID     WC      returns 1 in C if WAITVID is ready, use to execute WAITVID without stalling
  PASSCNT D          jumps to itself if some amount of time has not passed, use instead of WAITCNT
  JP/JNP  D,S        jumps based on pin states, use instead of WAITPEQ/WAITPNE
  DJNZ    D,#$       loops until done, use instead of NOP D/#n

The following instructions will not work in a multi-tasking program:

  REPS/REPD          operate by subtracting a value from the PC every n clocks - single-task only
  GETPIX             needs steady pipeline delays for perspective divider time - single-task only


instructions                                                                               clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC DDDDDDDDD 01001mmmm        JMPTASK D,#mask  'Set PC's in mask to D            1
000011 001 1 CCCC nnnnnnnnn 01001mmmm        JMPTASK #n,#mask 'Set PC's in mask to 0..511       1

000011 000 1 CCCC DDDDDDDDD 011001011        SETTASK D        'Set TASK to D                    1
000011 001 1 CCCC nnnnnnnnn 011001011        SETTASK #n       'Set TASK to n[7:0] copied 4x     1
-------------------------------------------------------------------------------------------------

Peter Jakacki · 2012-11-30 06:02

cgracey wrote: »

I've been working on the instruction set documentation and I've completed the part that covers the hub memory instructions:

Thanks Chip, this is really useful, straight away I can see the RDBYTEC with PTR++ operation really speeding up bytecode operations.

BTW, this operation here -> 000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cash
Isn't this what the GFC has done?

cgracey · 2012-11-30 06:12

Peter Jakacki wrote: »

Thanks Chip, this is really useful, straight away I can see the RDBYTEC with PTR++ operation really speeding up bytecode operations.

BTW, this operation here -> 000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cash
Isn't this what the GFC has done?

Woops. I'm getting ahead of the NWO here.

Bill Henning · 2012-11-30 06:18

Thanks Chip - its great to have docs on those new instructions!

In order to print them for reference, I quickly converted your text to Libre/Open office format, and also made a PDF - I am attaching them below.

evanh · 2012-11-30 06:51

Good trick with the forward calculation on the pre(inc/dec)! Such an obvious solution to the iteration delay problem. I guess that shows how little hands on I've done.

Chip, thanks for showing the working.

evanh · 2012-11-30 07:03

If I'm not mistaken, mapping the QUADs to Cog space and exclusively managing Hub accesses with WRQUAD and RDQUAD then one could prevent any instruction stalling due to Hub accesses, right?

PS: Of course, the limit with this approach is working in chunks of 16 bytes at a time and the resulting obligatory read-modify-write.

David Betz · 2012-11-30 07:14

Bill Henning wrote: »

Thanks Chip - its great to have docs on those new instructions!

In order to print them for reference, I quickly converted your text to Libre/Open office format, and also made a PDF - I am attaching them below.

This is great! Any chance you could add the descriptions of SETTASK and JMPTASK that Chip posted earlier in message #77?

Bill Henning · 2012-11-30 07:21

No problem!

I need to print those too...

evanh · 2012-11-30 07:26

Here's my attempt ...

Bill Henning · 2012-11-30 07:30

oops!

Evan, you were faster than me :-)

evanh · 2012-11-30 07:32

hehe, I was just being cheeky.

David Betz · 2012-11-30 07:34

Bill Henning wrote: »

No problem!

I need to print those too...

Thanks Bill!!

And thanks to Chip for providing these descriptions!

Rayman · 2012-11-30 12:05

RDQUAD and RWQUAD look to be very nice as only needing 1 clock in a fast loop.

Can the video generator handle a Quad?

David Betz · 2012-12-01 14:39

Ugh. I was just getting ready to spend the weekend working on P2 code. I decided to try running PNut.exe under Parallels on the Mac and when I fired up Windows XP under Parallels I was told that several updates were available. First, I tried PNut.exe and putty for talking to my DE0-Nano board and both worked fine. I then decided to go ahead and do the various updates (Parallels, Windows, and Microsoft Security Essentials). Unfortunately, after doing all of those updates my FTDI driver no longer works. It gives me an error "This device cannot start. (Code 10)". Has anyone seen this? Any idea how to get around it? I have verified that I have the latest FTDI driver installed. What else could be causing this problem?

Sapieha · 2012-12-01 14:46

Hi David

Remove it and reinstall - In some cases that help

David Betz wrote: »

Ugh. I was just getting ready to spend the weekend working on P2 code. I decided to try running PNut.exe under Parallels on the Mac and when I fired up Windows XP under Parallels I was told that several updates were available. First, I tried PNut.exe and putty for talking to my DE0-Nano board and both worked fine. I then decided to go ahead and do the various updates (Parallels, Windows, and Microsoft Security Essentials). Unfortunately, after doing all of those updates my FTDI driver no longer works. It gives me an error "This device cannot start. (Code 10)". Has anyone seen this? Any idea how to get around it? I have verified that I have the latest FTDI driver installed. What else could be causing this problem?

David Betz · 2012-12-01 15:04

Sapieha wrote: »

Hi David

Remove it and reinstall - In some cases that help

That's a good suggestion but I already tried it and it didn't work. I'm wondering if I should try to back out some of my updates. Either that or I need to break down and bring my Dell laptop with a native Windows install upstairs and use that. :-(

Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

Comments