PROPELLER 2 MEMORY ------------------ In the Propeller 2, there are two primary types of memory: HUB MEMORY 128K bytes of main memory shared by all cogs - cogs launch from this memory - cogs can access this memory as bytes, words, longs, and quads (4 longs) - $00000..$00E7F is ROM - contains Booter, SHA-256/HMAC, and Monitor - $00E80..$1FFFF is RAM - for application usage COG MEMORY (8 instances) 512 longs of register RAM for code and data usage - simultaneous instruction, source, and destination reading, plus writing - last eight registers are for I/O pin control 256 longs of stack RAM for data and video usage - accessible via push and pop operations - video circuit can read data simultaneously and asynchronously INSTRUCTION ENCODING -------------------- Cog instructions are 32 bits long and comprised of several bit fields. There are two main types of instructions: dual-operand and single-operand. Dual-operand instructions specify both a D register, which usually is read and written back, and an S register which is read or used as an immediate value. Single- operand instructions specify only a D register. Dual-operand encoding: TTTTTT ZCR I CCCC DDDDDDDDD SSSSSSSSS IF_x MNEM D,S/#n WZ,WC,NR TTTTTT = Instruction according to instruction (MNEM) I = SSSSSSSSS register or immediate, 0=register address (S), 1=immediate (#n) Single-operand encoding: 000011 ZCR 1 CCCC DDDDDDDDD TTTTTTTTT IF_x MNEM D WZ,WC,NR TTTTTTTTT = Instruction according to instruction (MNEM) For both cases: Z = Z flag write control: 0=don't write Z, 1=write Z Defaults to 0, but may be set to 1 by adding WZ (Write Z) after operand(s) Unless specified otherwise, the value written to Z is the NOR of the 32-bit D result. C = C flag write control: 0=don't write C, 1=write C Defaults to 0, but may be set to 1 by adding WC (Write C) after operand(s) R = D register write control: 0=don't write D, 1=write D Default varies by instruction, but may be cleared to 0 by adding NR (No Result) CCCC = Execution condition (expressed by IF_x mnemonic prefix) Determines Z/C flag conditions upon which the instruction will execute CCCC condition CCCC mnemonic prefixes (in easy-to-read order) --------------------------------------------------------------------- 0000 never 1111 IF_ALWAYS (default) 0001 nc & nz 1100 IF_C IF_B 0010 nc & z 0011 IF_NC IF_AE 0011 nc 1010 IF_Z IF_E 0100 c & nz 0101 IF_NZ IF_NE 0101 nz 1000 IF_C_AND_Z IF_Z_AND_C 0110 c <> z 0100 IF_C_AND_NZ IF_NZ_AND_C 0111 nc | nz 0010 IF_NC_AND_Z IF_Z_AND_NC 1000 c & z 0001 IF_NC_AND_NZ IF_NZ_AND_NC IF_A 1001 c = z 1110 IF_C_OR_Z IF_Z_OR_C IF_BE 1010 z 1101 IF_C_OR_NZ IF_NZ_OR_C 1011 nc | z 1011 IF_NC_OR_Z IF_Z_OR_NC 1100 c 0111 IF_NC_OR_NZ IF_NZ_OR_NC 1101 c | nz 1001 IF_C_EQ_Z IF_Z_EQ_C 1110 c | z 0110 IF_C_NE_Z IF_Z_NE_C 1111 always 0000 IF_NEVER DDDDDDDDD = Destination register address (D) SSSSSSSSS = Source register address (S) or zero-extended immediate value (#n) HUB MEMORY INSTRUCTIONS ----------------------- These instructions read and write hub memory. All instructions use D as the data conduit, except WRQUAD/RDQUAD/RDQUADC, which uses the four QUAD registers. The QUADs can be mapped into cog register space using the SETQUAD instruction or kept hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the QUADs overlay four contiguous cog registers. These overlaid registers can be read and written as any other registers, as well as executed. Any write via D to the QUAD registers, when mapped, will affect the underlying cog registers, as well. A RDQUAD/RDQUADC will affect the QUAD registers, but not the underlying cog registers. The cached reads RDBYTEC/RDWORDC/RDLONGC/RDQUADC will do a RDQUAD if the current read address is outside of the 4-long window of the prior RDQUAD. Otherwise, they will immediately return cached data. The CACHEX instruction invalidates the cache, forcing a fresh RDQUAD next time a cached read executes. Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant, causing these instructions to take varying numbers of clocks. Timing can be made determinant, though, by intentionally spacing these instructions apart so that after the first in a series executes, the subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of clocks. The trick is to write useful code to go in between them. WRBYTE/WRWORD/WRLONG/WRQUAD/RDQUAD complete on the hub cycle, making them take 1..8 clocks. RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks. RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is cached, otherwise 3..10 clocks. RDQUADC takes only 1 clock if data is cached, otherwise 1..8 clocks. After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks: RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3 NOP 'do something for at least 3 clocks to allow QUADs to update NOP NOP CMP quad0,quad1 'mapped QUADs are now accessible via D and S After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction: SETQUAD #quad0 'map QUADs to quad0..quad3 RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3 NOP 'do something for at least 3 clocks to allow QUADs to update NOP NOP NOP 'do at least 1 instruction to get QUADs into pipeline quad0 NOP 'QUAD0..QUAD3 are now executable quad1 NOP quad2 NOP quad3 NOP After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are readable via D and S after 2 instructions: SETQUAD #quad0 'map QUADs to quad0..quad3 (new address) NOP 'do at least two instructions to queue up QUADs NOP CMP quad0,quad1 'mapped QUADS are now accessible via D and S On cog startup, the QUAD registers are cleared to 0's. instructions clocks --------------------------------------------------------------------------------------------------------- 000000 000 0 CCCC DDDDDDDDD SSSSSSSSS WRBYTE D,S 'write lower byte in D at S 1..8 000000 000 1 CCCC DDDDDDDDD SUPNNNNNN WRBYTE D,PTR 'write lower byte in D at PTR 1..8 000000 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDBYTE D,S 'read byte at S into D 3..10 000000 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDBYTE D,PTR 'read byte at PTR into D 3..10 000000 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDBYTEC D,S 'read cached byte at S into D 1, 3..10 000000 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDBYTEC D,PTR 'read cached byte at PTR into D 1, 3..10 000001 000 0 CCCC DDDDDDDDD SSSSSSSSS WRWORD D,S 'write lower word in D at S 1..8 000001 000 1 CCCC DDDDDDDDD SUPNNNNNN WRWORD D,PTR 'write lower word in D at PTR 1..8 000001 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDWORD D,S 'read word at S into D 3..10 000001 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDWORD D,PTR 'read word at PTR into D 3..10 000001 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDWORDC D,S 'read cached word at S into D 1, 3..10 000001 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDWORDC D,PTR 'read cached word at PTR into D 1, 3..10 000010 000 0 CCCC DDDDDDDDD SSSSSSSSS WRLONG D,S 'write D at S 1..8 000010 000 1 CCCC DDDDDDDDD SUPNNNNNN WRLONG D,PTR 'write D at PTR 1..8 000010 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDLONG D,S 'read long at S into D 3..10 000010 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDLONG D,PTR 'read long at PTR into D 3..10 000010 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDLONGC D,S 'read cached long at S into D 1, 3..10 000010 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDLONGC D,PTR 'read cached long at PTR into D 1, 3..10 000011 000 1 CCCC DDDDDDDDD 010110000 WRQUAD D 'write QUADs at D 1..8 000011 001 1 CCCC SUPNNNNNN 010110000 WRQUAD PTR 'write QUADs at PTR 1..8 000011 000 1 CCCC DDDDDDDDD 010110001 RDQUAD D 'read quad at D into QUADs 1..8 000011 001 1 CCCC SUPNNNNNN 010110001 RDQUAD PTR 'read quad at PTR into QUADs 1..8 000011 010 1 CCCC DDDDDDDDD 010110001 RDQUADC D 'read cached quad at D into QUADs 1, 1..8 000011 011 1 CCCC SUPNNNNNN 010110001 RDQUADC PTR 'read cached quad at PTR into QUADs 1, 1..8 --------------------------------------------------------------------------------------------------------- PTR expressions: INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s SCALE = 1 for byte, 2 for word, 4 for long, or 16 for quad S = 0 for PTRA, 1 for PTRB U = 0 to keep PTRx same, 1 to update PTRx P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) NNNNNN = INDEX nnnnnn = -INDEX SUPNNNNNN PTR expression ----------------------------------------------------------------------------- 000000000 PTRA 'use PTRA 100000000 PTRB 'use PTRB 011000001 PTRA++ 'use PTRA, PTRA += SCALE 111000001 PTRB++ 'use PTRB, PTRB += SCALE 011111111 PTRA-- 'use PTRA, PTRA -= SCALE 111111111 PTRB-- 'use PTRB, PTRB -= SCALE 010000001 ++PTRA 'use PTRA + SCALE, PTRA += SCALE 110000001 ++PTRB 'use PTRB + SCALE, PTRB += SCALE 010111111 --PTRA 'use PTRA - SCALE, PTRA -= SCALE 110111111 --PTRB 'use PTRB - SCALE, PTRB -= SCALE 000NNNNNN PTRA[INDEX] 'use PTRA + INDEX*SCALE 100NNNNNN PTRB[INDEX] 'use PTRB + INDEX*SCALE 011NNNNNN PTRA++[INDEX] 'use PTRA, PTRA += INDEX*SCALE 111NNNNNN PTRB++[INDEX] 'use PTRB, PTRB += INDEX*SCALE 011nnnnnn PTRA--[INDEX] 'use PTRA, PTRA -= INDEX*SCALE 111nnnnnn PTRB--[INDEX] 'use PTRB, PTRB -= INDEX*SCALE 010NNNNNN ++PTRA[INDEX] 'use PTRA + INDEX*SCALE, PTRA += INDEX*SCALE 110NNNNNN ++PTRB[INDEX] 'use PTRB + INDEX*SCALE, PTRB += INDEX*SCALE 010nnnnnn --PTRA[INDEX] 'use PTRA - INDEX*SCALE, PTRA -= INDEX*SCALE 110nnnnnn --PTRB[INDEX] 'use PTRB - INDEX*SCALE, PTRB -= INDEX*SCALE Examples: 000000 Z01 1 CCCC DDDDDDDDD 000000000 RDBYTE D,PTRA 'read byte at PTRA into D 000001 000 1 CCCC DDDDDDDDD 111000001 WRWORD D,PTRB++ 'write lower word in D at PTRB, PTRB += 2 000010 Z01 1 CCCC DDDDDDDDD 011111111 RDLONG D,PTRA-- 'read long at PTRA into D, PTRA -= 4 000011 001 1 CCCC 110000001 010110001 RDQUAD ++PTRB 'read quad at PTRB+16 into QUADs, PTRB += 16 000000 000 1 CCCC DDDDDDDDD 010111111 WRBYTE D,--PTRA 'write lower byte in D at PTRA-1, PTRA -= 1 000001 000 1 CCCC DDDDDDDDD 100000111 WRWORD D,PTRB[7] 'write lower word in D to PTRB+7*2 000010 Z11 1 CCCC DDDDDDDDD 011001111 RDLONGC D,PTRA++[15] 'read cached long at PTRA into D, PTRA += 15*4 000011 001 1 CCCC 111111101 010110000 WRQUAD PTRB--[3] 'write QUADs at PTRB, PTRB -= 3*16 000000 000 1 CCCC DDDDDDDDD 010000110 WRBYTE D,++PTRA[6] 'write lower byte in D to PTRA+6*1, PTRA += 6*1 000001 Z01 1 CCCC DDDDDDDDD 110110110 RDWORD D,--PTRB[10] 'read word at PTRB-10*2 into D, PTRB -= 10*2 Bytes, words, longs, and quads are addressed as follows: for WRBYTE/RDBYTE/RDBYTEC, address = %XXXXXXXXXXXXXXXXX (bits 16..0 are used) for WRWORD/RDWORD/RDWORDC, address = %XXXXXXXXXXXXXXXX- (bits 16..1 are used) for WRLONG/RDLONG/RDLONGC, address = %XXXXXXXXXXXXXXX-- (bits 16..2 are used) for WRQUAD/RDQUAD/RDQUADC, address = %XXXXXXXXXXXXX---- (bits 16..4 are used) address byte word long quad ------------------------------------------------------------------- 00000- 50 *7250 *706F7250 *0C7CCC030C7C200020302E32706F7250 00001- 72 7250 706F7250 0C7CCC030C7C200020302E32706F7250 00002- 6F *706F 706F7250 0C7CCC030C7C200020302E32706F7250 00003- 70 706F 706F7250 0C7CCC030C7C200020302E32706F7250 00004- 32 *2E32 *20302E32 0C7CCC030C7C200020302E32706F7250 00005- 2E 2E32 20302E32 0C7CCC030C7C200020302E32706F7250 00006- 30 *2030 20302E32 0C7CCC030C7C200020302E32706F7250 00007- 20 2030 20302E32 0C7CCC030C7C200020302E32706F7250 00008- 00 *2000 *0C7C2000 0C7CCC030C7C200020302E32706F7250 00009- 20 2000 0C7C2000 0C7CCC030C7C200020302E32706F7250 0000A- 7C *0C7C 0C7C2000 0C7CCC030C7C200020302E32706F7250 0000B- 0C 0C7C 0C7C2000 0C7CCC030C7C200020302E32706F7250 0000C- 03 *CC03 *0C7CCC03 0C7CCC030C7C200020302E32706F7250 0000D- CC CC03 0C7CCC03 0C7CCC030C7C200020302E32706F7250 0000E- 7C *0C7C 0C7CCC03 0C7CCC030C7C200020302E32706F7250 0000F- 0C 0C7C 0C7CCC03 0C7CCC030C7C200020302E32706F7250 00010- 45 *FE45 *0DC1FE45 *0D7CC6010C7CC6010CFCB6E30DC1FE45 00011- FE FE45 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45 00012- C1 *0DC1 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45 00013- 0D 0DC1 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45 00014- E3 *B6E3 *0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45 00015- B6 B6E3 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45 00016- FC *0CFC 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45 00017- 0C 0CFC 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45 00018- 01 *C601 *0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 00019- C6 C601 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001A- 7C *0C7C 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001B- 0C 0C7C 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001C- 01 *C601 *0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001D- C6 C601 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001E- 7C *0D7C 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 0001F- 0D 0D7C 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45 * new word/long/quad PTRA/PTRB INSTRUCTIONS ---------------------- Each cog has two 17-bit pointers, PTRA and PTRB, which can be read, written, modified, and used to access hub memory. At cog startup, the PTRA and PTRB registers are initialized as follows: PTRA = %X_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer PTRB = %X_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from instructions clocks ------------------------------------------------------------------------------------------------- 000011 ZCR 1 CCCC DDDDDDDDD 000010010 GETPTRA D 'get PTRA into D, C = PTRA[16] 1 000011 ZCR 1 CCCC DDDDDDDDD 000010011 GETPTRB D 'get PTRB into D, C = PTRB[16] 1 000011 000 1 CCCC DDDDDDDDD 010110010 SETPTRA D 'set PTRA to D 1 000011 001 1 CCCC nnnnnnnnn 010110010 SETPTRA #n 'set PTRA to 0..511 1 000011 000 1 CCCC DDDDDDDDD 010110011 SETPTRB D 'set PTRB to D 1 000011 001 1 CCCC nnnnnnnnn 010110011 SETPTRB #n 'set PTRB to 0..511 1 000011 000 1 CCCC DDDDDDDDD 010110100 ADDPTRA D 'add D into PTRA 1 000011 001 1 CCCC nnnnnnnnn 010110100 ADDPTRA #n 'add 0..511 into PTRA 1 000011 000 1 CCCC DDDDDDDDD 010110101 ADDPTRB D 'add D into PTRB 1 000011 001 1 CCCC nnnnnnnnn 010110101 ADDPTRB #n 'add 0..511 into PTRB 1 000011 000 1 CCCC DDDDDDDDD 010110110 SUBPTRA D 'subtract D from PTRA 1 000011 001 1 CCCC nnnnnnnnn 010110110 SUBPTRA #n 'subtract 0..511 from PTRA 1 000011 000 1 CCCC DDDDDDDDD 010110111 SUBPTRB D 'subtract D from PTRB 1 000011 001 1 CCCC nnnnnnnnn 010110111 SUBPTRB #n 'subtract 0..511 from PTRB 1 ------------------------------------------------------------------------------------------------- QUAD-RELATED INSTRUCTIONS ------------------------- Each cog has four QUAD registers which form a 128-bit conduit between the hub memory and the cog. This conduit can transfer four longs every 8 clocks via the WRQUAD/RDQUAD instructions. It can also be used as a 4-long/8-word/16-byte read cache, utilized by RDBYTEC/RDWORDC/RDLONGC/RDQUADC. Initially hidden, these QUAD registers are mappable into cog register space by using the SETQUAD instruction to set an address where the base register is to appear, with the other three registers following. To hide the QUAD registers, use SETQUAD to set an address of $1FF. SETQUAZ works just like SETQUAD, but also clears the four QUAD registers. instructions clocks ------------------------------------------------------------------------------------------------- 000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cache 1 000011 Z01 1 CCCC DDDDDDDDD 000010001 GETTOPS D 'get top bytes of QUADs into D 1 000011 000 1 CCCC DDDDDDDDD 011100010 SETQUAD D 'set QUAD base to D 1 000011 001 1 CCCC nnnnnnnnn 011100010 SETQUAD #n 'set QUAD base to 0..511 1 000011 010 1 CCCC DDDDDDDDD 011100010 SETQUAZ D 'set QUAD base to D, QUAD=0 1 000011 011 1 CCCC nnnnnnnnn 011100010 SETQUAZ #n 'set QUAD base to 0..511, QUAD=0 1 ------------------------------------------------------------------------------------------------- HUB CONTROL INSTRUCTIONS ------------------------ These instructions are used to control hub circuits and cogs. Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where there is no result to wait for (ZCR = %000), these instructions complete on the hub cycle, making them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases where a result is anticipated (ZCR <> %000), these instructions complete on the 1st clock after the hub cycle, making them take 2..9 clocks. COGINIT D,S ----------- COGINIT is used to start cogs. Any cog can be (re)started, whether it is idle or running. A cog can even execute a COGINIT to restart itself with a new program. COGINIT uses D to specify a long address in hub memory that is the start of the program that is to be loaded into a cog, while S is a 17-bit parameter (usually an address) that will be conveyed to PTRA of the started cog. PTRB of the started cog will be set to the start address of its program that was loaded from hub memory. SETCOG must be executed before COGINIT to set the number of the cog to be started (0..7). If SETCOG sets a value with bit 3 set (%1xxx), this will cause the next idle cog to be started when COGINIT is executed, with the number of the cog started being returned in D, and the C flag returning 0 if okay, or 1 if no idle cog was available. At cog startup, SETCOG is initialized to %0000. When a cog is started, $1F8 contiguous longs are read from hub memory and written to cog registers $000..$1F7. The cog will then begin execution at $000. This process takes 1,016 clocks. Example: COGID COGNUM 'what cog am I? SETCOG COGNUM 'set my cog number COGINIT COGPGM,COGPTR 'restart me with the ROM Monitor COGPGM LONG $0070C 'address of the ROM Monitor COGPTR LONG 90<<9 + 91 'tx = P90, rx = P91 COGNUM RES 1 CLKSET D --------- CLKSET writes the lower 9 bits of D to the hub clock register: %R_MMMM_XX_SS R = 1 for hardware reset, 0 for continued operation MMMM = PLL mode: %0000 for disabled, else XX must be set for XI input or XI/XO crystal oscillator %0001 for multiply XI by 2 %0010 for multiply XI by 3 %0011 for multiply XI by 4 %0100 for multiply XI by 5 %0101 for multiply XI by 6 %0110 for multiply XI by 7 %0111 for multiply XI by 8 %1000 for multiply XI by 9 %1001 for multiply XI by 10 %1010 for multiply XI by 11 %1011 for multiply XI by 12 %1100 for multiply XI by 13 %1101 for multiply XI by 14 %1110 for multiply XI by 15 %1111 for multiply XI by 16 XX = XI/XO pin mode: %00 for XI reads low, XO floats %01 for XI input, XO floats %10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback %11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback SS = Clock selector: %00 for RCFAST (~20MHz) %01 for RCSLOW (~20KHz) %10 for XTAL (10MHz-20MHz) %11 for PLL Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within 10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot might exceed the speed limit of the chip. COGID D --------- COGID returns the number of the cog (0..7) into D. COGSTOP D --------- COGSTOP stops the cog specified in D (0..7). LOCKNEW D LOCKRET D LOCKSET D LOCKCLR D --------- There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with LOCKRET, set with LOCKSET, and cleared with LOCKCLR. While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time. When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/ clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a 'LOCKSET locknum wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to clear the lock and give another cog the opportunity to get C=0 back. LOCKNEW returns the next available lock into D, with C=1 if no lock was free. LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW. LOCKSET sets the lock in D and returns its prior state in C. LOCKCLR clears the lock in D and returns its prior state in C. instructions clocks ------------------------------------------------------------------------------------------------- 000011 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS COGINIT D,S 'launch cog at D, cog PTRA = S 1..9 000011 000 1 CCCC DDDDDDDDD 000000000 CLKSET D 'set clock to D 1..8 000011 001 1 CCCC DDDDDDDDD 000000001 COGID D 'get cog number into D 2..9 000011 000 1 CCCC DDDDDDDDD 000000011 COGSTOP D 'stop cog in D 1..8 000011 ZC1 1 CCCC DDDDDDDDD 000000100 LOCKNEW D 'get new lock into D, C = busy 2..9 000011 000 1 CCCC DDDDDDDDD 000000101 LOCKRET D 'return lock in D 1..8 000011 0C0 1 CCCC DDDDDDDDD 000000110 LOCKSET D 'set lock in D, C = prev state 1..9 000011 0C0 1 CCCC DDDDDDDDD 000000111 LOCKCLR D 'clear lock in D, C = prev state 1..9 ------------------------------------------------------------------------------------------------- INDIRECT REGISTERS ------------------ Each cog has two indirect registers: INDA and INDB. They are located at $1F6 and $1F7. By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed. INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB references. SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and top limit(s) to $000 and $1FF, respectively. FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the lower of the initial and terminal values and the top limit(s) to the higher. Because indirect addressing must occur in the 2nd stage of the pipeline, long before C and Z are valid for conditional execution in the 4th stage, all instructions which use indirect addressing are forced to always execute. This frees the conditional bit field (CCCC) for specifying indirect operations. The top two bits of CCCC are used for indirect D and the bottom two bits are used for indirect S. If only D or S is indirect, the other two bits in CCCC are ignored. Here is the INDA/INDB usage scheme which repurposes the CCCC field: OOOOOO ZCR I CCCC DDDDDDDDD SSSSSSSSS ------------------------------------- xxxxxx xxx x 00xx 111110110 xxxxxxxxx D = INDA 'use INDA xxxxxx xxx x 00xx 111110111 xxxxxxxxx D = INDB 'use INDB xxxxxx xxx x 01xx 111110110 xxxxxxxxx D = INDA++ 'use INDA, INDA += 1 xxxxxx xxx x 01xx 111110111 xxxxxxxxx D = INDB++ 'use INDB, INDB += 1 xxxxxx xxx x 10xx 111110110 xxxxxxxxx D = INDA-- 'use INDA, INDA -= 1 xxxxxx xxx x 10xx 111110111 xxxxxxxxx D = INDB-- 'use INDB INDB -= 1 xxxxxx xxx x 11xx 111110110 xxxxxxxxx D = ++INDA 'use INDA+1, INDA += 1 xxxxxx xxx x 11xx 111110111 xxxxxxxxx D = ++INDB 'use INDB+1, INDB += 1 xxxxxx xxx 0 xx00 xxxxxxxxx 111110110 S = INDA 'use INDA xxxxxx xxx 0 xx00 xxxxxxxxx 111110111 S = INDB 'use INDB xxxxxx xxx 0 xx01 xxxxxxxxx 111110110 S = INDA++ 'use INDA, INDA += 1 xxxxxx xxx 0 xx01 xxxxxxxxx 111110111 S = INDB++ 'use INDB, INDB += 1 xxxxxx xxx 0 xx10 xxxxxxxxx 111110110 S = INDA-- 'use INDA, INDA -= 1 xxxxxx xxx 0 xx10 xxxxxxxxx 111110111 S = INDB-- 'use INDB INDB -= 1 xxxxxx xxx 0 xx11 xxxxxxxxx 111110110 S = ++INDA 'use INDA+1, INDA += 1 xxxxxx xxx 0 xx11 xxxxxxxxx 111110111 S = ++INDB 'use INDB+1, INDB += 1 If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the post-modifier effect: 101000 001 0 0011 111110110 111110110 MOV INDA,++INDA 'Move @INDA+1 into @INDA, INDA += 1 100000 001 0 1100 111110111 111110111 ADD ++INDB,INDB 'Add @INDB into @INDB+1, INDB += 1 Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx. Here are the instructions which are used to set the pointer and limit values for INDA and INDB: instructions * clocks ------------------------------------------------------------------------------------------------- 111000 000 0 0001 000000000 AAAAAAAAA SETINDA #addrA 1 111000 000 0 0011 000000000 AAAAAAAAA SETINDA ++/--deltA 1 111000 000 0 0100 BBBBBBBBB 000000000 SETINDB #addrB 1 111000 000 0 1100 BBBBBBBBB 000000000 SETINDB ++/--deltB 1 111000 000 0 0101 BBBBBBBBB AAAAAAAAA SETINDS #addrB,#addrA 1 111000 000 0 0111 BBBBBBBBB AAAAAAAAA SETINDS #addrB,++/--deltA 1 111000 000 0 1101 BBBBBBBBB AAAAAAAAA SETINDS ++/--deltB,#addrA 1 111000 000 0 1111 BBBBBBBBB AAAAAAAAA SETINDS ++/--deltB,++/--deltA 1 111001 000 0 0001 TTTTTTTTT IIIIIIIII FIXINDA #terminal,#initial 1 111001 000 0 0100 TTTTTTTTT IIIIIIIII FIXINDB #terminal,#initial 1 111001 000 0 0101 TTTTTTTTT IIIIIIIII FIXINDS #terminal,#initial 1 ------------------------------------------------------------------------------------------------- * addrA/addrB/terminal/initial = register address (0..511), deltA/deltB = 9-bit signed delta --256..++255 Examples: 111000 000 0 0001 000000000 000000101 SETINDA #5 'INDA = 5, bottom = 0, top = 511 111000 000 0 0011 000000000 000000011 SETINDA ++3 'INDA += 3, bottom = 0, top = 511 111000 000 0 1100 111111100 000000000 SETINDB --4 'INDB -= 4, bottom = 0, top = 511 111000 000 0 0111 000000111 000001000 SETINDS #7,++8 'INDB = 7, INDA += 8, bottoms = 0, tops = 511 111001 000 0 0001 000001111 000001000 FIXINDA #15,#8 'INDA = 8, bottom = 8, top = 15 111001 000 0 0100 000010000 000011111 FIXINDB #16,#31 'INDB = 31, bottom = 16, top = 31 111001 000 0 0101 001100011 000110010 FIXINDS #99,#50 'INDA/INDB = 50, bottoms = 50, tops = 99 STACK RAM --------- Each cog has a 256-long stack RAM that is accessible via push and pop operations. Its contents are not initialized at either reset or cog startup. So, at cog startup, it will contain whatever it happened to power up with, or whatever was last written. There are two stack pointers called SPA and SPB which are used to address the stack memory. Aside from automatically incrementing and decrementing via pushes and pops, SPA and SPB can be set, modified, read back, and checked: SETSPA D/#n set SPA SETSPB D/#n set SPB ADDSPA D/#n add to SPA ADDSPB D/#n add to SPB SUBSPA D/#n subtract from SPA SUBSPB D/#n subtract from SPB GETSPA D get SPA, SPA==0 into Z, SPA.7 into C GETSPB D get SPB, SPB==0 into Z, SPB.7 into C GETSPD D get SPA minus SPB, SPA==SPB into Z, SPA 'execute some code SUBCNT ticks 'get CNTL minus ticks into ticks, took ticks-1 to execute 'Measure time using full 64 bits of CNT (single task) GETCNT ticks_low 'get CNT into {ticks_high, ticks_low} GETCNT ticks_high 'execute some code SUBCNT ticks_low 'get CNT minus {ticks_high, ticks_low} into {ticks_high, ticks_low} SUBCNT ticks_high ' took {ticks_high, ticks_low}-1 clocks to execute 'Do something for some time GETCNT ticks 'get CNTL ADD ticks,#500 'add 500 loop 'execute some code CMPCNT ticks WC 'check if 500 clocks have elapsed yet if_nc JMP #loop 'if not, loop 'Do something every Nth clock (multi-task) GETCNT ticks 'get CNTL loop ADD ticks,#500 'add 500 PASSCNT ticks 'wait for next 500th clock 'execute some code jmp #loop 'loop 'Do something every Nth clock (single-task) GETCNT ticks 'get CNTL ADD ticks,#500 'add initial 500 loop WAITCNT ticks,#500 'wait for next 500th clock, add next 500 'execute some code jmp #loop 'loop 'Wait for pins to equal a value, with time-out GETCNT ticks 'get CNTL ADD ticks,#200 'allow 200 clock cycles for WAITPEQ (CNTL target is last-stored value) WAITPEQ value,mask WC 'wait for (pins & mask) = value if_c JMP #timeout 'if C=1 then timeout occurred, else pin condition was met instructions clocks ---------------------------------------------------------------------------------------------------- 000011 ZC0 1 CCCC DDDDDDDDD 000001100 CMPCNT D 'compares D to CNTL, C = D > CNTL 1 000011 ZC1 1 CCCC DDDDDDDDD 000001100 SUBCNT D 'subtracts D from CNTL, then CNTH 1 000011 000 1 CCCC DDDDDDDDD 000001101 PASSCNT D 'loops until CNTL passes D 1* 000011 001 1 CCCC DDDDDDDDD 000001101 GETCNT D 'gets CNTL, then CNTH 1 111111 0CR I CCCC DDDDDDDDD SSSSSSSSS WAITCNT D,S 'wait for CNTL or CNT (WC), D += S ? 111111 110 I CCCC DDDDDDDDD SSSSSSSSS WAITPEQ D,S WC 'wait for (pins & S) = D, do timeout ? 111111 111 I CCCC DDDDDDDDD SSSSSSSSS WAITPNE D,S WC 'wait for (pins & S) <> D, do timeout ? ---------------------------------------------------------------------------------------------------- * 1 + number of other instructions in the pipeline (0..3) which belong to the executing task BRANCHES -------- As elaborated on in the pipeline section, there are both normal and delayed branching instructions. The normal branching instructions cancel any same-task instructions which are in the pipeline, causing the next instruction that executes in that task to be from the address that was branched to. The delayed branching instructions, intended only for single-task programs, do not cancel any pipelined instructions, allowing the three trailing instructions in the pipeline to execute before the branch appears to take effect. The advantage in using delayed branches is that they only take one clock, but careful programming is required to accommodate the three trailing instructions: loop MOV X,#100 'toggle P0/P1/P2 100 times, then toggle P3 loop2 DJNZD X,#loop2 'loop, delayed branch executes 3 trailing instructions NOTP #0 'toggle P0 NOTP #1 'toggle P1 NOTP #2 'toggle P2 NOTP #3 'now toggle P3 JMP #loop 'do it again In the branch instruction definitions below, only normal branches are shown, though any of them can be made into delayed branches by adding a 'D' to their mnemonic (i.e. JMP becomes JMPD). The JMP (jump), CALL, and RET (return) instructions are specific cases of the JMPRET instruction. CALL works by simultaneously jumping to a labeled subroutine and storing the return address (the address after the CALL) into a RET instruction that has the same label as the subroutine, but with '_RET' at the end: loop CALL #sub1 'call to sub1, store next address into bits 8..0 of sub1_ret CALL #sub2 'call to sub2, store next address into bits 8..0 of sub2_ret JMP #loop 'loop back to first call sub1 NOTP #0 'start of sub1 routine sub1_ret RET 'return to caller (actually JMP #returnaddress) sub2 NOTP #1 'start of sub2 routine sub2_ret RET 'return to caller (actually JMP #returnaddress) Because the return address is stored in an actual instruction at the end of the subroutine, these kinds of calls cannot be recursive, unlike the stack RAM-based calls and returns which are elaborated on in the STACK RAM section. The WZ and WC suffixes can be used with CALL/RET instructions to control flag updating. For example, if you wish to call a subroutine and preserve the Z and/or C flags, you can add the WZ and/or WC suffixes to both the CALL and RET instructions to cause the flags to be initially saved on CALL and subsequently restored on RET: loop CMP a,b WZ,WC 'compare a to b, affect Z and C CALL #sub WZ,WC 'call to sub and save Z/C/PC into bits 10..0 of the RET IF_C_OR_Z JMP #loop 'loop if a =< b JMP #else 'else, branch sub GETP #0 WC 'get pin 0 into C (mess up C and Z) GETNP #1 WZ 'get pin 1 into Z SETPC #6 'set pin 6 to C SETPZ #7 'set pin 7 to Z sub_ret RET WZ,WC 'return to caller, restore Z/C/PC from bits 10..0 in RET Here are the discrete JMP/CALL/RET instructions and the general-case JMPRET instruction: JMP S - Jump to address in S[8..0] If WC then C = S[9] If WZ then Z = S[10] JMP #n - Jump to immediate 0..511 If WC then C = bit 9 of JMP instruction (in unused D field) If WZ then Z = bit 10 of JMP instruction (in unused D field) CALL #label - Jump to label which begins subroutine The assembler points the D field to the RET at label_RET PC+1 is written to D[8..0] (PC+4 for CALLD) If WC then C is written to D[9] If WZ then Z is written to D[10] D[31..11], plus D[10]/D[9] per WZ/WC, are preserved RET - Jump to bits 8..0 of RET instruction (assembled as JMP #0) If WC then C = bit 9 of RET instruction (in unused D field) If WZ then Z = bit 10 of RET instruction (in unused D field) JMPRET D,#n NR - Jump to immediate 0..511 (same as 'JMP #n' and 'RET') If WC then C = bit 9 of JMPRET instruction (in D field) If WZ then Z = bit 10 of JMPRET instruction (in D field) JMPRET D,S NR - Jump to address in S[8..0] (same as 'JMP S') If WC then C = S[9] If WZ then Z = S[10] JMPRET D,#n - Jump to immediate 0..511 (same as 'CALL #label') PC+1 is written to D[8..0] (PC+4 for JMPRETD) If WC then C is written to D[9], else D[9] same If WZ then Z is written to D[10], else D[10] same D[31..11] are preserved JMPRET D,S - Jump to address in S[8..0] PC+1 is written to D[8..0] (PC+4 for JMPRETD) If WC then C is written to D[9] and reloaded from S[9] If WZ then Z is written to D[10] and reloaded from S[10] D[31..11], and D[10]/D[9] per WZ/WC, are preserved TASKSW - Short for 'JMPRET INDA,++INDA WZ,WC' For round-robin switching among threaded tasks Use FIXINDA to set up a ring of Z/C/PC registers Use with register remapping for multiple program instances Instructions trailing TASKSWD are in the next thread instructions clocks ------------------------------------------------------------------------------------------------- 000111 ZC0 0 CCCC 000000000 SSSSSSSSS JMP S 'jump to S 4 * 000111 ZC0 1 CCCC 000000000 nnnnnnnnn JMP #n 'jump to 0..511 4 * 000111 ZC0 1 CCCC 000000000 000000000 RET 'return from subroutine 4 * 000111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL CALL #label 'call subroutine 4 * 000111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS JMPRET D,S 'jump to S, store return in D 4 * 000111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn JMPRET D,#n 'jump to 0..511, store return in D 4 * 000111 111 0 0011 111110110 111110110 TASKSW 'JMPRET INDA,++INDA WZ,WC 4 * 010111 ZC0 0 CCCC 000000000 SSSSSSSSS JMPD S 'jump to S 1 010111 ZC0 1 CCCC 000000000 nnnnnnnnn JMPD #n 'jump to 0..511 1 010111 ZC0 1 CCCC 000000000 000000000 RETD 'return from subroutine 1 010111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL CALLD #label 'call subroutine 1 010111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS JMPRETD D,S 'jump to S, store return in D 1 010111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn JMPRETD D,#n 'jump to 0..511, store return in D 1 010111 111 0 0011 111110110 111110110 TASKSWD 'JMPRETD INDA,++INDA WZ,WC 1 ------------------------------------------------------------------------------------------------- * 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline Here are the conditional branches: IJZ D,S/#n - Increment D and Jump to S/#n if result is zero IJNZ D,S/#n - Increment D and Jump to S/#n if result is not zero DJZ D,S/#n - Decrement D and Jump to S/#n if result is zero DJNZ D,S/#n - Decrement D and Jump to S/#n if result is not zero TJZ D,S/#n - Jump to S/#n if D is zero TJNZ D,S/#n - Jump to S/#n if D is not zero JP D,S/#n - Jump to S/#n if pin D reads high JNP D,S/#n - Jump to S/#n if pin D reads low instructions clocks ------------------------------------------------------------------------------------------------- 111100 00R I CCCC DDDDDDDDD SSSSSSSSS IJZ D,S 'increment D and jump if zero 4 * 111100 10R I CCCC DDDDDDDDD SSSSSSSSS IJNZ D,S 'increment D and jump if not zero 4 * 111101 00R I CCCC DDDDDDDDD SSSSSSSSS DJZ D,S 'decrement D and jump if zero 4 * 111101 10R I CCCC DDDDDDDDD SSSSSSSSS DJNZ D,S 'decrement D and jump if not zero 4 * 111110 000 I CCCC DDDDDDDDD SSSSSSSSS TJZ D,S 'test D and jump if zero 4 * 111110 100 I CCCC DDDDDDDDD SSSSSSSSS TJNZ D,S 'test D and jump if not zero 4 * 111110 001 I CCCC DDDDDDDDD SSSSSSSSS JP D,S 'jump if pin D high 4 * 111110 101 I CCCC DDDDDDDDD SSSSSSSSS JNP D,S 'jump if pin D low 4 * 111100 01R I CCCC DDDDDDDDD SSSSSSSSS IJZD D,S 'increment D and jump if zero 1 111100 11R I CCCC DDDDDDDDD SSSSSSSSS IJNZD D,S 'increment D and jump if not zero 1 111101 01R I CCCC DDDDDDDDD SSSSSSSSS DJZD D,S 'decrement D and jump if zero 1 111101 11R I CCCC DDDDDDDDD SSSSSSSSS DJNZD D,S 'decrement D and jump if not zero 1 111110 010 I CCCC DDDDDDDDD SSSSSSSSS TJZD D,S 'test D and jump if zero 1 111110 110 I CCCC DDDDDDDDD SSSSSSSSS TJNZD D,S 'test D and jump if not zero 1 111110 011 I CCCC DDDDDDDDD SSSSSSSSS JPD D,S 'jump if pin D high 1 111110 111 I CCCC DDDDDDDDD SSSSSSSSS JNPD D,S 'jump if pin D low 1 ------------------------------------------------------------------------------------------------- * 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline COUNTERS - this section is not done yet!!! -------- Each cog has two configurable counters. They are named CTRA and CTRB and are accessed by thirteen instructions each. The instructions which end in "A" are for CTRA and those that end in "B" are for CTRB. For brevity, only CTRA instructions are used in the definitions and examples that follow. GETPHSA D - Get PHSA into D GETPHZA D - Get PHSA into D, simultaneously clear PHSA to 0 GETCOSA D - Get COSA into D GETSINA D - Get SINA into D SETCTRA D/#n - Set CTRA configuration SETWAVA D/#n - Set WAVA SETFRQA D/#n - Set FRQA SETPHSA D/#n - Set PHSA ADDPHSA D/#n - Add to PHSA SUBPHSA D/#n - Subtract from PHSA SYNCTRA - Wait for PHSA to roll over POLCTRA WC - Check if PHSA has rolled over (C=1 if rolled over) CAPCTRA - Capture CTRA accumulators into COSA and SINA Modes: (QDR = PHS[31] XNOR PHS[30], or PHS[31] delayed by 90 degrees) Off Mode ------------------------------------------------------------------------------- %00000 = Counter off (initial state after cog start) NCO Modes ------------------------------------------------------------------------------- %00001 = NCO output + video PLL mode, PLL output = PHS[31] (reference signal) %00010 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 32 %00011 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 16 %00100 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 8 %00101 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 4 %00110 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 2 %00111 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 1 %01000 = NCO output DUAL Modes ------------------------------------------------------------------------------- %000_01001 = dual NCO outputs + dual COUNT_LOWS inputs %001_01001 = dual NCO outputs + dual COUNT_HIGHS inputs %010_01001 = dual NCO outputs + dual COUNT_NEGATIVE_EDGES inputs %011_01001 = dual NCO outputs + dual COUNT_POSITIVE_EDGES inputs %100_01001 = dual NCO outputs + dual TIME_LOWS inputs %101_01001 = dual NCO outputs + dual TIME_HIGHS inputs %110_01001 = dual NCO outputs + dual TIME_NEGATIVE_EDGES inputs %111_01001 = dual NCO outputs + dual TIME_POSITIVE_EDGES inputs %000_01010 = dual DUTY outputs + dual COUNT_LOWS inputs %001_01010 = dual DUTY outputs + dual COUNT_HIGHS inputs %010_01010 = dual DUTY outputs + dual COUNT_NEGATIVE_EDGES inputs %011_01010 = dual DUTY outputs + dual COUNT_POSITIVE_EDGES inputs %100_01010 = dual DUTY outputs + dual TIME_LOWS inputs %101_01010 = dual DUTY outputs + dual TIME_HIGHS inputs %110_01010 = dual DUTY outputs + dual TIME_NEGATIVE_EDGES inputs %111_01010 = dual DUTY outputs + dual TIME_POSITIVE_EDGES inputs %000_01011 = dual PWM outputs + dual COUNT_LOWS inputs %001_01011 = dual PWM outputs + dual COUNT_HIGHS inputs %010_01011 = dual PWM outputs + dual COUNT_NEGATIVE_EDGES inputs %011_01011 = dual PWM outputs + dual COUNT_POSITIVE_EDGES inputs %100_01011 = dual PWM outputs + dual TIME_LOWS inputs %101_01011 = dual PWM outputs + dual TIME_HIGHS inputs %110_01011 = dual PWM outputs + dual TIME_NEGATIVE_EDGES inputs %111_01011 = dual PWM outputs + dual TIME_POSITIVE_EDGES inputs WAVE modes ------------------------------------------------------------------------------- %01100 = dual SQR_WAVE output + GOERTZEL input %01101 = dual SAW_WAVE output + GOERTZEL input %01110 = dual TRI_WAVE output + GOERTZEL input %01111 = dual SIN_WAVE output + GOERTZEL input In the WAVE modes, FRQ is added into PHS on every clock cycle. The top nine bits of PHS are used to drive sine and cosine lookup tables which are used for sine output functions and GOERTZEL computations. While the sine/cosine output functions are the most useful for signal processing, triangle-, sawtooth-, and square-wave output functions are also selectable, being derived from the top nine bits of PHS, as well. The WAVE modes output both parallel DAC signals and duty-modulated pin signals. All output signals are nine bits in base quality with an additional nine sub-bits of dithering to maintain base quality after attenuative scaling. The dual outputs differ only in phase and are set up by the WAV register: WAV register in WAVE modes (can be changed by SETWAVA/SETWAVB instruction) ------------------------------------------------------------------------------- %PPPPPPPPP_xxxxx_TTTTTTTTT_AAAAAAAAA PPPPPPPPP = phase advance for OUTA (0 to 511/512 revolutions) xxxxx = unused for WAVE modes TTTTTTTTT = offset for OUTA and OUTB AAAAAAAAA = amplitude for OUTA and OUTB Initial value after cog start: %010000000_00000_100000000_111111111 010000000 = 90-degree phase advance for GOERTZEL use (OUTA=cosine, OUTB=sine) 00000 = unused 100000000 = mid-point offset (allows maximum amplitude) 111111111 = maximum amplitude The GOERTZEL computation works as follows, on every clock: Nine-bit sine and cosine values are looked up using the top nine bits of PHS. The sine and cosine values are negated if INA is 0, else they remain the same. The sine and cosine values are added into separate sine and cosine accumulators. This process measures the energy content of INA at the frequency of PHS rollover. To make this work, the INA pin should be configured for delta-sigma ADC mode, so that it streams back 1's and 0's that ratiometrically represent the voltage of the I/O pin. To make a GOERTZEL measurement: - The top nine bits of WAV should be set to %010000000 for proper cosine lookup. - FRQ must be set to generate the frequency of interest in PHS rollovers (SETFRQA). - PHS and the accumulators should be cleared to 0 (SETPHSA #0, then CAPCTRA). - Some number of complete PHS rollovers must be waited for (SYNCTRA/POLLCTRA). - The accumulators must be captured and read (CAPCTRA + GETCOSA + GETSINA). - The hypotenuse of the accumulators will indicate signal strength and phase. By making swept FRQ measurements in a closed loop, where OUTA is used to output a reference frequency of known phase to stimulate a system, and INA receives a signal back that is somehow coupled to OUTA, you can determine things such as spectral response, resonant frequency, and frequency vs. phase of a system. The more PHS rollovers in a measurement, the more selective the result will be. For open- loop measurements, this means tighter bandwidth. For closed-loop measurements, the angle of the hypotenuse becomes meaningful. The QARCTAN instruction can translate the sine and cosine accumulations into power and phase values. LOGIC Modes ------------------------------------------------------------------------------- %10000 = LOGIC_A_POSEDGE input INA & !INA previous %10001 = LOGIC_NA_AND_NB input !INA & !INB %10010 = LOGIC_A_AND_NB input INA & !INB %10011 = LOGIC_NB input !INB %10100 = LOGIC_NA_AND_B input !INA & INB %10101 = LOGIC_NA input !INA %10110 = LOGIC_A_NE_B input INA <> INB %10111 = LOGIC_NA_OR_NB input !INA | !INB %11000 = LOGIC_A_AND_B input INA & INB %11001 = LOGIC_A_EQ_B input INA == INB %11010 = LOGIC_A input INA %11011 = LOGIC_A_OR_NB input INA | !INB %11100 = LOGIC_B input INB %11101 = LOGIC_NA_OR_B input !INA | INB %11110 = LOGIC_A_OR_B input INA | INB %11111 = LOGIC_ENCODER input INA, INB encoder OUTA = ADD signal (condition met or LOGIC_ENCODER forward step) OUTB = SUB signal (LOGIC_ENCODER reverse step) In the LOGIC modes, FRQ is conditionally added to PHS on each clock cycle that meets that mode's requirement. In the case of the LOGIC_ENCODER mode, FRQ may be added or subtracted to/from PHS when a half-step is registered. OUTA and OUTB reflect the ADD and SUB states for each cycle, and are more likely to be useful by other CTR's, rather than being sent to output pins. DACS ---- Each cog outputs 4 channels of DAC data, named DAC0..DAC3. These DAC data channels can be set to values in software or actively driven from CTRA/CTRB or VID. In all cases but VID, the source data is 18 bits and is dithered on every clock cycle for 9-bit DAC output. In the case of VID, the source data is just 9 bits, so no dithering is performed. Each I/O pin has a 75-ohm 9-bit DAC which can be configured using CFGPINS to output a fixed DACx channel from any cog. Every cog's DAC0..DAC3 are available, in that sequence, to P0..P3, then to the next four pins, and so on, as shown below: PortA PortB PortC DACx -------------------------------- P0 P32 P64 DAC0 P1 P33 P65 DAC1 P2 P34 P66 DAC2 P3 P35 P67 DAC3 P4 P36 P68 DAC0 P5 P37 P69 DAC1 P6 P38 P70 DAC2 P7 P39 P71 DAC3 P8 P40 P72 DAC0 P9 P41 P73 DAC1 P10 P42 P74 DAC2 P11 P43 P75 DAC3 P12 P44 P76 DAC0 P13 P45 P77 DAC1 P14 P46 P78 DAC2 P15 P47 P79 DAC3 P16 P48 P80 DAC0 P17 P49 P81 DAC1 P18 P50 P82 DAC2 P19 P51 P83 DAC3 P20 P52 P84 DAC0 P21 P53 P85 DAC1 P22 P54 P86 DAC2 P23 P55 P87 DAC3 P24 P56 P88 DAC0 P25 P57 P89 DAC1 P26 P58 P90 DAC2 P27 P59 P91 DAC3 P28 P60 P92 DAC0 P29 P61 P93 DAC1 P30 P62 P94 DAC2 P31 P63 P95 DAC3 Here are the instructions which configure DAC0..DAC3: CFGDAC0 D/#n - Configure DAC0 %00 = Software controlled (default) %01 = CTRA SIGA %10 = CTRA SIGA + CTRB SIGA %11 = VID SIG0 CFGDAC1 D/#n - Configure DAC1 %00 = Software controlled (default) %01 = CTRA SIGB %10 = CTRA SIGB + CTRB SIGB %11 = VID SIG1 CFGDAC2 D/#n - Configure DAC2 %00 = Software controlled (default) %01 = CTRB SIGA %10 = CTRA SIGA + CTRB SIGA %11 = VID SIG2 CFGDAC3 D/#n - Configure DAC3 %00 = Software controlled (default) %01 = CTRB SIGB %10 = CTRA SIGB + CTRB SIGB %11 = VID SIG3 CFGDACS D/#n - Configure DAC3..DAC0 from four 2-bit fields: %33_22_11_00 For configurations %00..%10, the data sources are 18 bits wide, with the 9 lower bits being dithered by a 32-bit LFSR to realize more DAC resolution. This improves dynamic range, but introduces a white noise of one step in amplitude in the 9-bit DAC output. As dynamic signals get smaller in amplitude, they appear to sink into the dither noise, but actually remain very high-Q, as the dither noise is very low-Q. For configuration %11 (VID), the data is a straight 9 bits with no dithering, as pixels could only be dithered once per frame, resulting only in visible luminance noise, which is not desirable. The dithering works by taking nine fixed bits from a 32-bit LFSR and sign-extending them to 18 bits. This yields a pseudo-random value ranging from %111111111_100000000 (negative) to %000000000_011111111 (positive) on every clock cycle. When added to the 18-bit source data, the lower 9 bits of source data are realized as a proportional toggling between two adjacent values in the top 9 bits of the sum, which form the DAC output data. It will take at least 512 (2^9) clocks for the DAC output to average to the intended 18-bit source value, assuming source data is static. On cog start, all configurations are cleared to %00 and the source values are set to %000000000_100000000, which is effectively zero, since dithering will never cause an output step toggle when the nine lower source bits are %100000000: source data %XXXXXXXXX_100000000 + minimum dither %111111111_100000000 -------------------- = %XXXXXXXXX_000000000 (top 9 bits are unchanged) source data %XXXXXXXXX_100000000 + maximum dither %000000000_011111111 -------------------- = %XXXXXXXXX_111111111 (top 9 bits are unchanged) Here are the instructions which set DAC0..DAC3 source values in software: SETDAC0 #n - Set DAC0 to %nnnnnnnnn_100000000, force configuration to %00 SETDAC0 D - Set DAC0 to D[31..14], force configuration to %00 * SETDAC1 #n - Set DAC1 to %nnnnnnnnn_100000000, force configuration to %00 SETDAC1 D - Set DAC1 to D[31..14], force configuration to %00 * SETDAC2 #n - Set DAC2 to %nnnnnnnnn_100000000, force configuration to %00 SETDAC2 D - Set DAC2 to D[31..14], force configuration to %00 * SETDAC3 #n - Set DAC3 to %nnnnnnnnn_100000000, force configuration to %00 SETDAC3 D - Set DAC3 to D[31..14], force configuration to %00 * SETDACS #n - Set DAC3..DAC0 to %nnnnnnnnn_100000000 Force DAC3..DAC0 configurations to %00 SETDACS D - Set DAC3 to %dddddddd0_100000000, where dddddddd is D[31..24] Set DAC2 to %dddddddd0_100000000, where dddddddd is D[23..16] Set DAC1 to %dddddddd0_100000000, where dddddddd is D[15..8] Set DAC0 to %dddddddd0_100000000, where dddddddd is D[7..0] Force DAC3..DAC0 configurations to %00 * Be aware when using SETDACx D, that if D < $00400000 or D > $FFC03FFF, full- scale toggling will occur, as the dither addition will cause wrapping. For ground-based DAC output, you can add $00400000 to each output sample to prevent this from happening. VIDEO ----- Each cog has a video generator (VID) that can stream pixel data and perform colorspace conversion and modulation, so that final video signals can be output to the 75-ohm DACs on the I/O pins. Pixel streaming, colorspace conversion, modulation, DAC channel driving, and DAC pin updating are all performed in a pipelined fashion on each cycle of VID's dot clock. VID gets it dot clock from CTRA's PLL. So, CTRA must be configured for PLL operation in order for VID to operate. The DACx channels must be configured for video output by using CFGDACx. To set all DACx channels to video, do 'CFGDACS #%11_11_11_11'. The I/O pins which will output the DACx channels must be configured to do so via CFGPINS. To turn on VID and configure its DAC channel outputs, the SETVID instruction is used: SETVID D/#n - Set video configuration register (VCFG) %00xx = off (default) SIG3 SIG2 SIG1 SIG0 ---------------------------- %01xx = SDTV/HDTV/VGA Y_R I_G Q_B SYN %10xx = NTSC/PAL S-VIDEO YIQ YIQ _IQ Y__ %11xx = NTSC/PAL COMPOSITE YIQ YIQ YIQ YIQ %xx0x = zero-extend Y/I/Q coefficients for VGA colorspace (allows +$80, or '1.0') %xx1x = sign-extend Y/I/Q coefficients for NTSC/PAL/SDTV/HDTV colorspace %xxx0 = positive VGA sync on SYN / positive modulation phase %xxx1 = negative VGA sync on SYN / negative modulation phase (used in PAL video) Before any meaningful video signals can be output, you must set the colorspace coefficients and offset levels, which are each 8 bits: SETVIDY D/#n - Set Y_R offset level and RGB colorspace coefficients: $YO_YR_YG_YB SETVIDI D/#n - Set I_G offset level and RGB colorspace coefficients: $IO_IR_IG_IB SETVIDQ D/#n - Set Q_B offset level and RGB colorspace coefficients: $QO_QR_QG_QB All pixels are internally handled by VID as 2:8:8:8 bit SYNC:R:G:B data. Colorspace conversion is performed as sum-of-products calculations on the R:G:B pixel data and the colorspace coefficients, yielding Y, I, and Q components: Where R, G, B are 8-bit pixel color components and Y, I, Q are 9-bit sums (MOD 512): Y = R*YR/64 + G*YG/64 + B*YB/64 Where YR, YG, YB are 8-bit Y coefficients I = R*IR/64 + G*IG/64 + B*IB/64 Where IR, IG, IB are 8-bit I coefficients Q = R*QR/64 + G*QG/64 + B*QB/64 Where QR, QG, QB are 8-bit Q coefficients For outputs Y_R, I_G, and Q_B, offset levels are added to the Y, I, and Q components to properly position the final signals for SDTV/HDTV. In the case of VGA outputs, the offset levels are set to 0, since they are ground-based. For modulated outputs YIQ and _IQ, the I and Q components, treated as (I,Q), are rotated around (0,0) by an angle that steps 1/16th of a revolution on each dot clock, yielding Q'. In the case of YIQ output, the Y component (luma) and Q' (chroma) are added to form a composite video signal. In the case of _IQ output, an offset level is added to Q' to form an s-video chroma signal. For Y__ output, the Y component (luma) is output alone to form an s-video luma signal. For sync 'pixels', bit 24 or 25 is set in the pixel word and various formulas are used for generating the different output signals. When less than 32 bits are expressed per pixel, the SYNC bits will be %00. DAC channel outputs per pixel data input (outputs are 9 bits each, MOD 512) ------------------------------------------------------------------------------------ Y_R %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y component/vga pixel %x1_0xxxxxxx_xxxxxxxx_xxxxxxxx = YO*2 component/vga black %x1_1xxxxxxx_xxxxxxxx_SSSSSSSS = YO*2 + SSSSSSSS*2 component sync I_G %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = IO*2 + I component/vga pixel %x1_x0xxxxxx_xxxxxxxx_xxxxxxxx = IO*2 component/vga black %x1_x1xxxxxx_xxxxxxxx_SSSSSSSS = IO*2 + SSSSSSSS*2 component sync Q_B %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = QO*2 + Q component/vga pixel %x1_xx0xxxxx_xxxxxxxx_xxxxxxxx = QO*2 component/vga black %x1_xx1xxxxx_xxxxxxxx_SSSSSSSS = QO*2 + SSSSSSSS*2 component sync SYN %x0_xxxxxxxx_xxxxxxxx_xxxxxxxx = VCFG[0]*511 vga sync unasserted %x1_xxxxxxxx_xxxxxxxx_xxxxxxxx = !VCFG[0]*511 vga sync asserted Y__ %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y s-video luma pixel %01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2 s-video luma sync high %1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = 0 s-video luma sync low _IQ %xx_xxxxxxxx_xxxxxxxx_xxxxxxxx = QO*2 + Q' s-video chroma YIQ %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y + Q' composite pixel %01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2 + Q' composite sync high %1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = Q' composite sync low Below are some common colorspace coefficient sets. Note that these values are normalized to 1. In the sum-of-products calculations, 128 is equal to 1, so the values below should all be multiplied by 128 to get the proper 8-bit values for usage as coefficients. In practice, the values will need to be scaled down so that under 75-ohm load, they will peak at 1.0V or 0.7V (not 1.65V, which is 3.3V/2). This scaling will compromise DAC span by ~39%..~58%, leaving you with a still-sufficient ~8 bits of DAC resolution. However, if you'd like to keep DAC span maximal, you may leave the coefficients as originally computed and achieve the proper voltage under load by using external resistors, being sure to maintain 75 ohms source impedance. coefficient positions ----------------------- YR YG YB IR IG IB QR QG QB ----------------------- RGB (VGA) VCFG[1]=0 ----------------------- 1 0 0 R sums to 1 0 1 0 G sums to 1 0 0 1 B sums to 1 ----------------------- YPbPr (HDTV) VCFG[1]=1 x128 ----------------------- ------------- +.213 +.715 +.072 Y sums to 1 +27 +92 +9 -.115 -.385 +.500 Pb sums to 0 -15 -49 +64 +.500 -.454 -.046 Pr sums to 0 +64 -58 -6 ----------------------- YPbPr (SDTV) VCFG[1]=1 ----------------------- +.299 +.587 +.114 Y sums to 1 -.169 -.331 +.500 Pb sums to 0 +.500 -.419 -.081 Pr sums to 0 ----------------------- YIQ (NTSC) VCFG[1]=1 ----------------------- +.299 +.587 +.114 Y sums to 1 +.596 -.274 -.322 I sums to 0 * +.212 -.523 +.311 Q sums to 0 * ----------------------- YUV (PAL) VCFG[1]=1 ----------------------- +.299 +.587 +.114 Y sums to 1 -.147 -.289 +.436 U sums to 0 * +.615 -.515 -.100 V sums to 0 * ----------------------- * These three coefficients must be scaled by 0.608 to pre-compensate for CORDIC rotator expansion which will occur in the video modulator. Once VID is configured, WAITVID instructions are used to issue contiguous commands which keep the pixel streamer busy: WAITVID --> pixel streamer --> colorspace/modulator --> DACx signals --> I/O pins VID double-buffers WAITVID commands to relax WAITVID timing requirements. In case you don't want to commit to a WAITVID, which will stall the instruction pipeline until VID is ready for another command, you can use the POLVID instruction to test whether or not VID is ready for another WAITVID, in which case a subsequent WAITVID will take only one clock: POLVID WC - Check if VID ready for another WAITVID, C=1 if ready Here is the WAITVID instruction: WAITVID D,S/#n - Wait for VID ready, then give next command via D and S When WAITVID executes, the D and S values are captured by VID and used for the duration of the command. The D operand in WAITVID has four fields: %AAAAAAAA_MMMM_PPPPPPP_CCCCCCCCCCCCC %AAAAAAAA = stack RAM base address for pixel lookup (0..255) %MMMM = pixel mode (0..15), elaborated below %PPPPPPP = number minus 1 of dot clocks per pixel (0..127 --> 1..128) %CCCCCCCCCCCCC = number minus 1 of dot clocks in WAITVID (0..8191 --> 1..8192) The D operand's %MMMM field determines which pixel mode will be used for the WAITVID and what the S operand will be used for: %0000 = LIT_SRGB26 - S is used as a literal 2:8:8:8 pixel. Only the %CCCCCCCCCCCCC bits of D are used (all other bits can be 0). %0001 = CLU1_SRGB26 - 32 1-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM %0010 = CLU2_SRGB26 - 16 2-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM %0011 = CLU4_SRGB26 - 8 4-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM %0100 = CLU8_SRGB26 - 4 8-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM %0101 = CLU8_RGB15 * - 4 8-bit offsets in S lookup 0:5:5:5 pixel words in stack RAM %0110 = CLU8_RGB16 * - 4 8-bit offsets in S lookup 0:5:6:5 pixel words in stack RAM The CLUx modes capture S, using its 1/2/4/8-bit fields, lowest field first, as offsets for looking up pixels in stack RAM, starting at %AAAAAAAA. Upon completion of each pixel, the next higher bit field is used, with the highest field repeating. For CLU1_SRGB26..CLU8_SRGB26, the 1/2/4/8-bit fields are used as long offsets into stack RAM, yielding 2:8:8:8 pixel data. For CLU8_RGB15 and CLU8_RGB16, bits 7..1 of each 8-bit field is used as the long offset, while bit 0 selects the low/high word containing the 0:5:5:5 or 0:5:6:5 pixel data. %0111 = STR1_RGB9 * - 1-bit pixels streamed from stack RAM select between 0:3:3:3 colors in S[17..9] and S[26..18]. The stream start address in stack RAM is %AAAAAAAA plus S[7..0], with S[31..27] selecting the starting bit. %1000 = STR4_RGBI4 * - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31..29] selecting the starting nibble. The pixels are colored as: %0000 = black %0001 = dark grey %0010 = dark blue %0011 = bright blue %0100 = dark green %0101 = bright green %0110 = dark cyan %0111 = bright cyan %1000 = dark red %1001 = bright red %1010 = dark magenta %1011 = bright magenta %1100 = olive %1101 = yellow %1110 = light grey %1111 = white %1001 = STR4_LUMA4 * - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31..29] selecting the starting nibble. The pixels are used as brightness values for colors determined by S[11..9]: %000 = black..orange %001 = black..blue %010 = black..green %011 = black..cyan %100 = black..red %101 = black..magenta %110 = black..yellow %111 = black..white %1010 = STR8_RGBI8 * - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31..30] selecting the starting byte. The pixels are colored as: $00..$1F = black..orange $20..$3F = black..blue $40..$5F = black..green $60..$7F = black..cyan $80..$9F = black..red $A0..$BF = black..magenta $C0..$DF = black..yellow $E0..$FF = black..white %1011 = STR8_LUMA8 * - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31..30] selecting the starting byte. The pixels are used as brightness values for colors determined by S[11..9]: %000 = black..orange %001 = black..blue %010 = black..green %011 = black..cyan %100 = black..red %101 = black..magenta %110 = black..yellow %111 = black..white %1100 = STR8_RGB8 * - 8-bit 0:3:3:2 pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31..30] selecting the starting byte. %1101 = STR16_RGB15 * - 15-bit 0:5:5:5 pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31] selecting the starting word. %1110 = STR16_RGB16 * - 16-bit 0:5:6:5 pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0], with S[31] selecting the starting word. %1111 = STR32_SRGB26 - 26-bit 2:8:8:8 pixels are streamed from stack RAM starting at %AAAAAAAA plus S[7:0]. * SYNC bits are set to %00 for these modes, since they specify color data, only. The following example programs display luma-graduated color bars in various output modes: simple_VGA_1280x1024.spin simple_VGA_800x600.spin simple_VGA_640x480.spin simple_HDTV_1920x1080p.spin simple_HDTV_1280x720p.spin simple_NTSC_256x192.spin TEXTURE MAPPER -------------- Each cog has a texture mapper (PIX) which can sequentially navigate a rectangular 2D texture map with Z-perspective correction to locate a texture pixel, translate that texture pixel into A:R:G:B (Alpha:Red:Green:Blue) pixel data, perform discrete scaling on those A:R:G:B components, and then alpha-blend the resulting pixel with another pixel for multi-layered 3D effects. A texture map is stored in register RAM as a sequence of 1/2/4/8-bit texture pixels which build from the bottom bits of an initial register, upward, then into subsequent registers. They are ordered, in contiguous sequence, from top-left to top-right down to bottom-left to bottom-right. These texture pixels get used as offsets into stack RAM to look up A:R:G:B pixel data. Texture map width and height are individually settable to 1/2/4/8/16/32/64/128 pixel(s). The SETPIX instruction is used to configure PIX: SETPIX D/#n - Set PIX configuration to %UUU_VVV_PP_W_H_V_xxxx_AAAAAAAA_RRRRRRRRR %UUU = texture map width, %VVV = texture map height %000 = 1 pixel %001 = 2 pixels %010 = 4 pixels %011 = 8 pixels %100 = 16 pixels %101 = 32 pixels %110 = 64 pixels %111 = 128 pixels %PP = texture pixel size %00 = 1 bit %01 = 2 bits %10 = 4 bits %11 = 8 bits %W = stack RAM pixel data offset/size %0 = long offset, 8:8:8:8 bit A:R:G:B data %1 = word offset, 1:5:5:5 bit A:R:G:B data (gets expanded to 8:8:8:8) %H = horizontal mirroring %0 = OFF, image repeats when U'[15] set %1 = ON, image mirrors when U'[15] set %V = vertical mirroring %0 = OFF, image repeats when V'[15] set %1 = ON, image mirrors when V'[15] set %AAAAAAAA = base address in stack RAM of A:R:G:B pixel data %RRRRRRRRR = base address in register RAM of texture pixels Aside from SETPIX, which configures PIX's base metrics, there are seven other instructions which establish initial values and deltas for the (U,V) texture coordinates, Z perspective, and A/R/G/B scalers. These instructions are likely to be used before every sequence of GETPIX instructions. They each set the value of their respective 16-bit parameter to the low word of their operand, while the high word sets the 16-bit delta which gets added to the parameter upon every GETPIX instruction: SETPIXU D/#n - Set U to low word and DU to high word SETPIXV D/#n - Set V to low word and DV to high word SETPIXZ D/#n - Set Z to low word and DZ to high word SETPIXA D/#n - Set A to low word and DA to high word SETPIXR D/#n - Set R to low word and DR to high word SETPIXG D/#n - Set G to low word and DG to high word SETPIXB D/#n - Set B to low word and DB to high word Once PIX is configured and initial parameters are set, the GETPIX instruction may be used to look up the current texture pixel, scale its A/R/G/B components, blend it with a pixel in D, and update the U/V/Z/A/R/G/B parameters with their deltas. GETPIX takes 3 clocks and also needs 3 clocks in pipeline stages 2 and 3: NOP #2 'ready pipeline, GETPIX needs 3 clocks in pipeline stage 2 NOP #2 'ready pipeline, GETPIX needs 3 clocks in pipeline stage 3 GETPIX pixel 'execute GETPIX, GETPIX takes 3 clocks in pipeline stage 4 To make GETPIX more efficient, it can be repeated using REPD to perform a sequence of pixel operations: REPD #64,#1 'render 64 texture pixels and blend them with 'pixels' SETINDA #pixels 'point INDA to pixels NOP #2 'ready pipeline, 3 clocks in initial pipeline stage 2 NOP #2 'ready pipeline, 3 clocks in initial pipeline stage 3 GETPIX INDA++ 'execute GETPIX, 3 clocks per repeating GETPIX As GETPIX executes, the following sequence occurs over three pipeline stages: In pipeline stage 2: Z-perspective correction ------------------------ Z' = 256 - Z[15:8] U' = (U[15:0] / Z') MOD 256 V' = (V[15:0] / Z') MOD 256 A texture pixel is read from register RAM at texture map location (U',V'), with the U' and V' top-most bits being used as coordinates. For example, if the texture size is 32x8, then the top 5 bits of U' and the top 3 bits of V' would be used to locate the texture pixel. parameter updating ------------------ Z = Z + DZ U = U + DU V = V + DV In pipeline stage 3: The texture pixel is used as an offset to look up A:R:G:B pixel data in stack RAM, which gets assigned to TA:TR:TG:TB. In pipeline stage 4: pixel scaling ------------- A' = (TA * A[15:8] + 255) / 256 R' = (TR * R[15:8] + 255) / 256 G' = (TG * G[15:8] + 255) / 256 B' = (TB * B[15:8] + 255) / 256 pixel blending -------------- D[31..24] = 0 D[23..16] = (A' * R' + (255 - A') * D[23..16] + 255) / 256 D[15..8] = (A' * G' + (255 - A') * D[15..8] + 255) / 256 D[7..0] = (A' * B' + (255 - A') * D[7..0] + 255) / 256 C = A' <> 0 (for GETPIX D/#n WC, C = texture pixel opacity <> 0) parameter updating ------------------ A = A + DA R = R + DR G = G + DG B = B + DB Note that if Z[15:8] = 0, no scaling occurs, or (U',V') = (U[15:8],V[15:8]). The bigger Z[15:8] gets, the more compressed the texture rendering becomes, until when Z[15:8] = 255, (U',V') = (U[7:0],V[7:0]). The following program provides a simplistic example of how PIX is used: texture_NTSC_256x192.spin PIN TRANSFER ------------ Each cog has a pin transfer (XFR) which can automatically move data between pins and QUADs or from pins to stack RAM, in the background, while instructions execute normally. XFR is configured with the SETXFR instruction: SETXFR D/#n - Set XFR configuration to %MMM_PPP %MMM = mode %00x = off (initial state after cog start) %010 = QUADs_to_16_pins %011 = QUADs_to_32_pins %100 = 16_pins_to_QUADs %101 = 32_pins_to_QUADs %110 = 16_pins_to_stack %111 = 32_pins_to_stack %PPP = pin group %000 = pins 15..0 for 16-pin modes, pins 31..0 for 32-pin modes %001 = pins 31..16 for 16-pin modes, pins 31..0 for 32-pin modes %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes %11x = no pins (reads 0's) For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 low word is output to pins 2nd clock: QUAD0 high word is output to pins 3rd clock: QUAD1 low word is output to pins 4th clock: QUAD1 high word is output to pins 5th clock: QUAD2 low word is output to pins 6th clock: QUAD2 high word is output to pins 7th clock: QUAD3 low word is output to pins 8th clock: QUAD3 high word is output to pins For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 is output to pins 2nd clock: QUAD1 is output to pins 3rd clock: QUAD2 is output to pins 4th clock: QUAD3 is output to pins For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled into low word 2nd clock: pins are sampled into high word, long is written to QUAD0 3rd clock: pins are sampled into low word 4th clock: pins are sampled into high word, long is written to QUAD1 5th clock: pins are sampled into low word 6th clock: pins are sampled into high word, long is written to QUAD2 7th clock: pins are sampled into low word 8th clock: pins are sampled into high word, long is written to QUAD3 For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to QUAD0 2nd clock: pins are sampled and written to QUAD1 3rd clock: pins are sampled and written to QUAD2 4th clock: pins are sampled and written to QUAD3 For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following 2-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled into low word 2nd clock: pins are sampled into high word, long is written to stack at SPA++ For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following 1-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to stack at SPA++ While a pins_to_stack mode is active, you should not read or write stack RAM or modify SPA, as such attempts will likely interfere with XFR operation and cause unexpected results. VID, however, has an asynchronous second port to the stack RAM, so it can stream pixels at the same time XFR streams them in. To stop XFR, execute 'SETXFR #0' on the last cycle of desired XFR operation. An example of XFR usage is in the following program: SDRAM_Driver.spin BIG MULTIPLIER -------------- Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which perform 20x20-bit signed multiplies, each cog has a separate, larger multiplier that can do 32x32-bit signed or unsigned multiplies while other instructions execute. To start a big multiply, do either SETMULU (unsigned) or SETMULA (signed) to set the first term, then do SETMULB to set the second term and start the multiplier. You'll have 17 clocks of time to execute other code, if you wish, before doing GETMULL/GETMULH to get the low/high long(s) of the result. Here are the big multiplier instructions: SETMULU D/#n - Set 1st input term and set unsigned operation SETMULA D/#n - Set 1st input term and set signed operation SETMULB D/#n - Set 2nd input term and start multiplier GETMULL D - Get low long of result, waits if multiplier not done GETMULL D WC - Poll low long of result, C=1 if D valid, C=0 if multiplier busy GETMULH D - Get high long of result, waits if multiplier not done GETMULH D WC - Poll high long of result, C=1 if D valid, C=0 if multiplier busy BIG DIVIDER ----------- Each cog has a 64-over-32-bit divider which can do signed or unsigned divides while other instructions execute. For signed divides, the remainder result will have the sign of the numerator. Both the quotient and the remainder results are 32 bits. To start a 64-over-32-bit divide, do SETDIVU (unsigned) or SETDIVA (signed) to set the low long of the numerator, followed by another SETDIVU or SETDIVA to set the high long of the numerator. Then do SETDIVB to load the denominator and start the divider. There will be 17 clocks of time to execute other code, if you wish, before doing GETDIVQ/GETDIVR to get the quotient/remainder long(s) of the result. To start a 32-over-32-bit divide, just do one SETDIVU or SETDIVA before the SETDIVB. Here are the divider instructions: SETDIVU D/#n - Set low (then high) long of numerator and set unsigned operation SETDIVA D/#n - Set low (then high) long of numerator and set signed operation SETDIVB D/#n - Set denominator and start divider GETDIVQ D - Get quotient result, waits if divider not done GETDIVQ D WC - Poll quotient result, C=1 if D valid, C=0 if divider busy GETDIVR D - Get remainder result, waits if divider not done GETDIVR D WC - Poll remainder result, C=1 if D valid, C=0 if divider busy To compute a 32-bit fractional value of A-over-B where A < B, you can do SETDIVU #0, SETDIVU A, then SETDIVB B. GETDIVQ will return the fraction. For example: SETDIVU #0, SETDIVU #1, SETDIVB #3 yields a quotient of $55555555, or 1/3 of $1_00000000. SQUARE ROOTER ------------- Each cog has a 64-bit square rooter which can compute square roots from unsigned values while other instructions execute. To start a 64-bit square root computation, do SETSQRH to set the high long of the input term, then do SETSQRL to set the low long and start the square rooter. There will be 32 clocks of time to execute other code, if you wish, before doing GETSQRT to get the result. To start a 32-bit square root computation, just do SETSQRL to set the low long and start the square rooter. There will be 16 clocks of time to execute other code, if you wish, before doing GETSQRT to get the result. SETSQRH D/#n - Set high long of input term SETSQRL D/#n - Set low long of input term and start square rooter GETSQRT D - Get root result, waits if square rooter not done GETSQRT D WC - Poll root result, C=1 if D valid, C=0 if square rooter busy CORDIC ENGINE ------------- Each cog has a CORDIC engine which can perform logarithmic, exponential, trigonometric, and hyperbolic functions while other instructions execute. Here are the instructions associated with the CORDIC engine: QLOG D/#n - Compute logarithm (unsigned number -> log-base-2) QEXP D/#n - Compute exponential (log-base-2 -> unsigned number) QSINCOS D,S/#n - Compute sine and cosine with amplitude (polar -> cartesian) QARCTAN D,S/#n - Compute distance and angle of (X,Y) to (0,0) (cartesian -> polar) SETQZ D/#n - Set CORDIC Z, used to set angle before QROTATE QROTATE D,S/#n - Rotate (X,Y) around (0,0) by an angle GETQX D - Get CORDIC X result, waits if CORDIC busy GETQX D WC - Poll CORDIC X result, C=1 if D valid, C=0 if CORDIC busy GETQY D - Get CORDIC Y result, waits if CORDIC busy GETQY D WC - Poll CORDIC Y result, C=1 if D valid, C=0 if CORDIC busy GETQZ D - Get CORDIC Z result, waits if CORDIC busy GETQZ D WC - Poll CORDIC Z result, C=1 if D valid, C=0 if CORDIC busy SETQI D/#n - Set CORDIC trigonometric/hyperbolic and iteration modes QLOG/QEXP usage: To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set the input term and begin the computation. Then do GETQZ to get the result. Log values are encoded with the whole exponent in the top 5 bits and the fractional exponent in the bottom 27 bits. Here are some examples of numbers converted to log values, then back to numbers again using QLOG and QEXP: number -> QLOG -> QEXP --------------------------------- $00000000 $00000000 $00000001 (0 same as 1) $00000001 $00000000 $00000001 $00000002 $08000000 $00000002 $00000003 $0CAE00D2 $00000003 $00000004 $10000000 $00000004 $00000005 $12934F09 $00000005 $07ADCBD8 $D786F595 $07ADCBD9 (first lossy bidirectional conversion, +1) $20000000 $E8000000 $20000000 $40000000 $F0000000 $40000000 $80000000 $F8000000 $80000000 $FFFFFFFF $FFFFFFFF $FFFFFFE9 (last lossy bidirectional conversion, -22) QSINCOS/QARCTAN/QROTATE usage: For the circular functions, angles are 32-bits and roll over at 360-degrees: $00000000 = 0 degrees (360 * $00000000 / $1_00000000) $00000001 = ~0.000000083819 degrees (360 * $00000001 / $1_00000000) $00B60B61 = ~1 degree (360 * $00B60B61 / $1_00000000) $20000000 = 45 degrees (360 * $20000000 / $1_00000000) $40000000 = 90 degrees (360 * $40000000 / $1_00000000) $80000000 = 180 degrees (360 * $80000000 / $1_00000000) $C0000000 = 270 degrees (360 * $C0000000 / $1_00000000) $FFFFFFFF = ~359.9999999162 degrees (360 * $FFFFFFFF / $1_00000000) The X and Y inputs to the circular functions are signed 30-bit values, ranging from -$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during the CORDIC iterations, after which they are shifted back down and rounded to form the X and Y results. The circular functions will return X and Y results that are scaled by constant K, which is ~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K. To compute sine and cosine simultaneously, the 'QSINCOS D,S/#n' instruction can be used, with the angle supplied in D and the amplitude in S. Immediate #n values are special cases where $00..$1F produce +/- 2^(n-1) amplitudes and $20..$3F produce 7/8ths of those amplitudes. For example, #$09 will yield results ranging from -$100..$100 and #$29 will yield results ranging from -$E0..$E0. Use GETQX and GETQY to retrieve the cosine and sine results. To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do 'QARCTAN D,S/#n' with the X in D and the Y in S/#n. Use GETQX to get the distance and GETQZ to get the angle. To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then do 'QROTATE D,S/#n', with the X in D and the Y in S/#n. Use GETQX and GETQY to retrieve the rotated (X,Y) coordinate. CORDIC modes: The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to select between adaptive and fixed iterations: SETQI D/#n - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start) %M = mode %0 = trigonometric (K = ~1.64676025812) %1 = hyperbolic (K = ~0.82815936096) %IIIII = iterations %00000 = adaptive iterations (adaptive resolution, variable time) %00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time) Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different internal constants to track the angle, it skips the zeroth iteration, and the fourth and thirteenth iterations are repeated to ensure convergence. Hence, K differs between trigonometric and hyperbolic modes, as well as clock cycles. When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude of the X and Y inputs to ensure an efficient computation which preserves initial precision. For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration count to ensure good-enough precision, but with budgeted/exact timing. CORDIC timing: Here is a table that shows how many free clocks are available for other instructions to execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ: i = %IIIII i = 0 (adaptive) i = 1..31 (fixed) operation clocks free clocks free -------------------------------------------------------------------------- QLOG D/#n 35 2 + i + h QEXP D/#n 35 2 + i + h Trigonometric mode QSINCOS D,#n 2 + n 2 + i QSINCOS D,S 5 + mag(abs(D) | abs(S)) 3 + i QARCTAN D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i QROTATE D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i Hyperbolic mode QSINCOS D,#n 1 + n + j 1 + i + h QSINCOS D,S 4 + mag(abs(D) | abs(S)) + k 2 + i + h QARCTAN D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h QROTATE D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h -------------------------------------------------------------------------- h = 0 if i is 0..3 j = 0 if n is 1..3 k = 0 if mag is 0..1 1 if i is 4..12 1 if n is 4..12 1 if mag is 2..10 2 if i is 13..31 2 if n is 13..31 2 if mag is 11..30 MULTIPLY AND ACCUMULATE ----------------------- Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values, adjusted to exponent and mantissa, and read back. On cog start, ACCA and ACCB are both cleared to $00000000_00000000. The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the resultant 40-bit product into ACCA or ACCB in a single clock: MACA D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCA MACB D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCB By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters can be realized in a few instructions: FIXINDA #buff+15,#buff 'set circular sample buffer FIXINDB #taps+15,#taps 'set circular tap buffer :loop REPS #16,#1 'ready for 16-tap FIR CLRACCA 'clear ACCA MACA INDB++,INDA++ 'multiply and accumulate buff and taps (16 clocks) GETACCA result 'get result ' 'use result ' 'get new sample MOV --INDA,sample 'enter new sample, buff scrolls against taps JMP #:loop 'loop The accumulators may be cleared by the following instructions: CLRACCA - clear ACCA to $00000000_00000000 CLRACCB - clear ACCB to $00000000_00000000 CLRACCS - clear ACCA and ACCB to $00000000_00000000 The accumulators may be set to arbitrary values by these instructions: SETACCA D,S/#n - set the lower long of ACCA to D and upper long to S/#n SETACCB D,S/#n - set the lower long of ACCB to D and upper long to S/#n To make post-MACA/MACB computations simpler, the FITACCA/FITACCB/FITACCS instructions can be used to shift the accumulators downward, in order to consolidate their leading bits into the lower long, while the upper long gets set to a 6-bit exponent which represents how many shifts were needed, if any, to fit the value (including the sign bit) into the lower long. This fitting can be performed on ACCA and ACCB individually, or on ACCA and ACCB together, in order to preserve their relative magnitudes. The FITACCA/FITACCB/FITACCS instructions take 2 clocks, but won't execute until 2 clocks after MACA/MACB. So, if FITACCA immediately follows MACA, FITACCA will take 4 clocks: FITACCA - fit ACCA FITACCB - fit ACCB FITACCS - fit ACCA and ACCB with a common exponent The GETACCA/GETACCB instructions are used to read back the contents of the accumulators. GETACCA/GETACCB will always return the lower long of the accumulator, unless the lower long has already been read and no intervening operation has changed the accumulator's contents, in which case the upper long will be returned. These instruction take 1 clock, but won't execute until 2 clocks after MACA/MACB. So, if GETACCA immediately follows MACA, GETACCA will take 3 clocks: GETACCA D - get lower long of ACCA, then higher long GETACCB D - get lower long of ACCB, then higher long