Propeller II update - BLOG

Cluso99 · 2013-12-16 09:47

cgracey wrote: »

...
All branches can access all of hub space. The caveat are the 9-bit immediate address branches like DJNZ. They'll become relative branches in hub exec mode. JMPSW (was JMPRET) always stores the return address in D, but can only reach all of hub address space using S register.
...

Would it make more sense for the cases like DJNZ to always be relative in both COG and HUB EXEC modes?

jmg · 2013-12-16 09:48

Leon wrote: »

It's still deterministic, anyway, with any number of threads in use:

"Thread scheduling is a simple round robin process with each active thread being executed in the next system clock cycle. This gives the appearance of up to eight concurrent threads per XCore. All threads are independent and have equal priority meaning that each task always receives a guaranteed minimum number of MIPS; this is central to building deterministic and responsive systems."

Yes and no. The caveat is you have to know how many threads the system will launch, to know how much time you have.
That means any 'late changes' to code, can bite big time.
Safest way to manage unknown later additions, is to always run the highest thread count, ie some as dummy time-swallowers.

Leon · 2013-12-16 09:54

Everything can be simulated and timed very accurately, enabling suitable hardware to be selected, with the xSOFTip Explorer software. It's very easy to add additional chips, if more processing power is required.

potatohead · 2013-12-16 10:00

How about we take that off-line, or to another thread?

Cluso99 · 2013-12-16 10:11

Chip,
It appears everyone agrees that multiple tasks using hubexec mode is just not going to work because of caching etc. Wouldn't it therefore be much simpler to not permit multiple hubexec tasks to work (in a single cog), and instead allocate any/all cache to a single task hubexec mode?

This does not prevent three other cog mode tasks running with hubexec anyway?

Seems as though we are overcomplicating the whole hubexec mode for something we will never use (multiple hubexec tasks in a single cog).

cgracey · 2013-12-16 10:17

Cluso99 wrote: »

Would it make more sense for the cases like DJNZ to always be relative in both COG and HUB EXEC modes?

They could be. The drawback for cog mode would be that you couldn't just write an address field into an instruction like you can now using SETS. You would have to pre-compute an address relative to the instruction that will be modified.

potatohead · 2013-12-16 10:27

This might be nice for overlay code and snippets in SPIN 2

cgracey · 2013-12-16 10:31

Cluso99 wrote: »

Chip,
It appears everyone agrees that multiple tasks using hubexec mode is just not going to work because of caching etc. Wouldn't it therefore be much simpler to not permit multiple hubexec tasks to work (in a single cog), and instead allocate any/all cache to a single task hubexec mode?

This does not prevent three other cog mode tasks running with hubexec anyway?

Seems as though we are overcomplicating the whole hubexec mode for something we will never use (multiple hubexec tasks in a single cog).

I understand the consensus. If those 4 cache lines use an LRU algorithm, though, it won't matter if one hub task is running or four are. In the case of four, the cache lines would effectively be distributed among the tasks, giving still decent performance.

cgracey · 2013-12-16 10:33

potatohead wrote: »

This might be nice for overlay code and snippets in SPIN 2

Good point.

I'll review some PASM source and see how often this poses a problem for self-modifying code.

jmg · 2013-12-16 10:36

Leon wrote: »

Everything can be simulated and timed very accurately, enabling suitable hardware to be selected, with the xSOFTip Explorer software. It's very easy to add additional chips, if more processing power is required.

That Everything can be simulated and timed very accurately, again reminds of the important missing piece of the P2 puzzle : a good simulator.

I believe The high end HDL language simulators work by compiling the VHDL/Verilog into C (or similar) and then running that.
So they effectively 'build a new simulator', on every run.

It would be very nice if the P2 Simulator Core, could be Auto-Built this way, from the HDL.
Has anyone seen that done ? Would it be fast enough ?

cgracey · 2013-12-16 10:37

jmg wrote: »

That Everything can be simulated and timed very accurately, again reminds of the important missing piece of the P2 puzzle : a good simulator.

I believe The high end HDL language simulators work by compiling the VHDL/Verilog into C (or similar) and then running that.
So they effectively 'build a new simulator', on every run.

It would be very nice if the P2 Simulator Core, could be Auto-Built this way, from the HDL.
Has anyone seen that done ? Would it be fast enough ?

That's a lot of logic to simulate. It seems a behavioral model would be a lot more efficient, but take a custom effort.

jmg · 2013-12-16 10:40

cgracey wrote: »

Good point.

I'll review some PASM source and see how often this poses a problem for self-modifying code.

Self-modifying code would still be possible ? it just needs an Abs-Rel patch added, if the user has only abs address values.?
- one line of PASM ?

Cluso99 · 2013-12-16 10:49

cgracey wrote: »

They could be. The drawback for cog mode would be that you couldn't just write an address field into an instruction like you can now using SETS. You would have to pre-compute an address relative to the instruction that will be modified.

Good point!
I wonder how many cases of self-modifying instructions change a DJNZ style instructions. Its normally the JMPRET that are modified (the RET versions).

jmg · 2013-12-16 11:02

cgracey wrote: »

That's a lot of logic to simulate. It seems a behavioral model would be a lot more efficient, but take a custom effort.

There is strong appeal to deriving the Simulator core, from the HDL, as that avoids the raft of 'not quite correct' divergence issues that can plague simulators, and it has more chance of simulating the peripherals, and all the HW interactions.

Seems there is Icarus, Veriwell, and Veripool, with Veripool sounding the fastest and best supported ?

http://opencores.org/opencores,tools
http://www.veripool.org/wiki/veripool/Verilog_Simulator_Benchmarks

Roy Eltham · 2013-12-16 11:06

Bill,
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles? You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.

Cluso99 · 2013-12-16 11:06

cgracey wrote: »

I understand the consensus. If those 4 cache lines use an LRU algorithm, though, it won't matter if one hub task is running or four are. In the case of four, the cache lines would effectively be distributed among the tasks, giving still decent performance.

I was thinking overnight that an instruction that controlled the instruction cache fetching might be an advantage...

My thoughts were the Instruction Cache would be say 4 lines. It would start out with the first wide loaded into line0. Then automatically the next sequential line1 would be fetched while the 8 longs in line 0 were being executed (in hubexec mode). As line1 was being executed, line 2 would be fetched. Then line 3 and then back to line 0.
This would be controlled by a small state m/c fetching the wide hub lines into the 4 cache lines.

Now, what would be nice is that the program could control the state m/c by executing a "fetch" instruction that says "fetch the next 3 lines". This would take 1 clock to set the state m/c.

'the following is the first instruction cache line0 already fetched and now executing in hubexec mode...
$2200: FETCH $4000,#3 'fetch the next 3 wides into the cache from hub $4000
$2204: hubexec code
$2208: hubexec code
$220C: hubexec code
$2210: hubexec code
$2214: hubexec code - this is going to call a routine at hub $4000
$2218: hubexec code
$221C: hubexec code
'while the above was executing, the state m/c fetches 3 lines from hub $4000 - because we know that will be required next.

I have no idea about how complex this may be. I just thought it could be simpler than LRU, and with smart programming, the program could control what was loaded into the instruction cache and avoid stalls by cache misses.

jmg · 2013-12-16 11:11

Cluso99 wrote: »

Good point!
I wonder how many cases of self-modifying instructions change a DJNZ style instructions. Its normally the JMPRET that are modified (the RET versions).

A general solution should be checked.
I can see DJNZ would be useful for timeout/watchdog style case statements/state machines.
The Jump is patched, and usually taken, and some state kicks the DJNZ variable, and if the system does not cross any kicks, it will timeout eventually.

Yanomani · 2013-12-16 11:30

Cluso99 wrote: »
I was thinking overnight that an instruction that controlled the instruction cache fetching might be an advantage...

My thoughts were the Instruction Cache would be say 4 lines. It would start out with the first wide loaded into line0. Then automatically the next sequential line1 would be fetched while the 8 longs in line 0 were being executed (in hubexec mode). As line1 was being executed, line 2 would be fetched. Then line 3 and then back to line 0.
This would be controlled by a small state m/c fetching the wide hub lines into the 4 cache lines.

Now, what would be nice is that the program could control the state m/c by executing a "fetch" instruction that says "fetch the next 3 lines". This would take 1 clock to set the state m/c.
'the following is the first instruction cache line0 already fetched and now executing in hubexec mode...
$2200: FETCH $4000,#3 'fetch the next 3 wides into the cache from hub $4000
$2204: hubexec code
$2208: hubexec code
$220C: hubexec code
$2210: hubexec code
$2214: hubexec code - this is going to call a routine at hub $4000
$2218: hubexec code
$221C: hubexec code
'while the above was executing, the state m/c fetches 3 lines from hub $4000 - because we know that will be required next.
I have no idea about how complex this may be. I just thought it could be simpler than LRU, and with smart programming, the program could control what was loaded into the instruction cache and avoid stalls by cache misses.

Cluso99

IMHO, it can be excellent to have this feature, for a handcrafted PASM program.
As for automatically generated code, i believe the linker can be crafted to check for straight thru code pieces, branching from another ones, and decide where to insert it, and how many lines of cache, up to a maximum of three, to request in advance.
Though I'm unsure, how it can be crafted to handle conditional execution at all.
Also, if the now being loaded code piece, does contain another reference, to a new cache line advanced preemptive request, and so on... How we could ensure any automated means, at link time, to select the best path anyway?
If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals to trash them very often...
Perhaps a P3 feature....

Yanomani

P.S.The following line:

Yanomani wrote: »

If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals to trash them very often...

Should read:

If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals keep recursively trashing them, very often...

Sorry by my poor English..

Bill Henning · 2013-12-16 11:33

Roy,

I did not ignore, I disagreed. I even posted about a minimal (very dumb) leaf function that just executed "a = b", perhaps you missed it in the flurry of postings.

Please show me a useful, non-trivial leaf function's dis-assembly from the test p2 gcc. Note, the function should not have any decorations, and should be for LMM mode (ie cog-only mode example is not valid); furthermore include the function prologue and epilogue code. I am working on some other products right now, and do not currently have the p2test branch installed, so I do not have the time to generate the sample function and dis-assembled code.

I'll count the cycles, and then we will know what the percentage is.

Given what I remember seeing about the prologue and epilog code, it consumed ~ 16*8 cycles each, so about 256 cycles - without even counting what the function did, or calling the function, or returning from it:

average of 4 cycles, divided by 256 cycles, is roughly 1.6% slowdown. Worst case, about 3%.

If the function does actual work, and counting the call/return, we should be below 1%.

Please note, it is not fair to add attributes or command line switches to ensure that gcc does not generate prolog/epilog code.

If the code generator, without attributes or command line switches, on its own can minimize the prolog/epilog code, then yes, for an utterly trivial (by which I mean very few cycle) leaf function the four cycle delay could be higher.

To have a 10% impact on the performance of the function, the whole function, including prolog and epilogue, including calling and return, and any hub access cycles it performs, including looping, would have to complete in 40 clock cycles.

It is literally impossible to have a 2x-4x slowdown on the whole program when using a hub stack for leaf functions.

Roy Eltham wrote: »

Bill,
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles? You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.

David Betz · 2013-12-16 11:46

Bill Henning wrote: »

Roy,

I did not ignore, I disagreed. I even posted about a minimal (very dumb) leaf function that just executed "a = b", perhaps you missed it in the flurry of postings.

Please show me a useful, non-trivial leaf function's dis-assembly from the test p2 gcc. Note, the function should not have any decorations, and should be for LMM mode (ie cog-only mode example is not valid); furthermore include the function prologue and epilogue code. I am working on some other products right now, and do not currently have the p2test branch installed, so I do not have the time to generate the sample function and dis-assembled code.

I'll count the cycles, and then we will know what the percentage is.

Given what I remember seeing about the prologue and epilog code, it consumed ~ 16*8 cycles each, so about 256 cycles - without even counting what the function did, or calling the function, or returning from it:

average of 4 cycles, divided by 256 cycles, is roughly 1.6% slowdown. Worst case, about 3%.

If the function does actual work, and counting the call/return, we should be below 1%.

Please note, it is not fair to add attributes or command line switches to ensure that gcc does not generate prolog/epilog code.

If the code generator, without attributes or command line switches, on its own can minimize the prolog/epilog code, then yes, for an utterly trivial (by which I mean very few cycle) leaf function the four cycle delay could be higher.

To have a 10% impact on the performance of the function, the whole function, including prolog and epilogue, including calling and return, and any hub access cycles it performs, including looping, would have to complete in 40 clock cycles.

It is literally impossible to have a 2x-4x slowdown on the whole program when using a hub stack for leaf functions.

I'm not sure why we're rehashing this. Chip has already proposed a couple of ways to handle this. I assume he'll look over our comments and choose one. Sounds like we'll get this in some form or other.

Roy Eltham · 2013-12-16 11:54

Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impact from the overhead.

If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...

David Betz · 2013-12-16 11:56

Roy Eltham wrote: »

Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impact from the overhead.

If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...

PropGCC does not do optimization by default but does if you specify an option. We usually use -Os but -O2 may generate faster code in some cases.

Roy Eltham · 2013-12-16 11:59

David,
Obviously not by default, but certainly most people will use the -O options on their release code like any sensible coder.

David Betz · 2013-12-16 12:01

Roy Eltham wrote: »

David,
Obviously not by default, but certainly most people will use the -O options on their release code like any sensible coder.

Absolutely true.

Bill Henning · 2013-12-16 12:03

Roy Eltham wrote: »

Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impacts.

If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...

My numbers were based on examining the propgcc generated code directly. Every register saved to the hub stack takes 8+ cycles, every one restored takes somewhat less if RDxxxC is used (about 4).

Say a leaf only saved/restored 4 registers, that would be 4*(8+4) cycles right there. Add the call, the return, we are at a minimum of 51 cycles. That's without the function doing anything. Any non-trivial function will take ~50+ cycles (hub access, loops, calculation - exact usage is irrelevant), bringing us to 100 cycles minimum, even using RDLONGC, even for a very light weight leaf function.

4/100 = 4%

Most leaf functions will do a lot more, for a much smaller percentage.

The truly trivial leaf functions I expect GCC is smart enough to in-line (aka "a = b" example); str* and mem* will take less code space in-lined and be faster; I think GCC is smart enough to automatically in-line such tiny functions.

Roy Eltham wrote: »

Bill,
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles?

That was not a technical argument. Again.

Roy Eltham wrote: »

You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.

I back up my technical discussions, and I am happy to accept arguments that are backed up with technically accurate data. Which is what I asked for in my previous post.

Bill Henning · 2013-12-16 12:06

I am actually interested in how good the code is with -O ... what I did not want to see was something like -fomit_prologue -fomit_epilog (I am certain I have the wrong names for the options)

I know Chip is working on a solution, and I think all of us will like it. But I cannot let personal attacks, or technically incorrect responses to my posts go.

David Betz wrote: »

Absolutely true.

Roy Eltham · 2013-12-16 12:17

Bill,
I'm done trying to talk to you. It never does anything but waste time for both of us. We'll see how things shake out of all this when Chip delivers the final results. Have a nice day.

cgracey · 2013-12-17 04:34

Here is the latest instruction set:

Prop2_Instructions_12_17_13.txt

There are now 16-bit immediate jumps and calls that can be relative or absolute. The jumps and calls that end in an underscore ("_") toggle hub execution mode. If you are running in the cog, a CALL_ #address will jump to hub memory. When that routine does a RET, it will return to cog memory. It works the other way, too. A CALL or JMP without an underscore stays in the cog or hub.

The JMPSW/JMPSWD instruction can be used to switch among threads. It will store {hubmode,Z,C,PC} into D and and load {hubmode,Z,C,PC} from S. So, it tracks threads wherever they are executing. All CALLs and RETs save and restore {hubmode,Z,C,PC}. PC is 16 bits so that it can span the entire 64K longs in the 256KB hub memory.

Here is the list:

ZCDS (for D column: W=write, M=modify, R=read, L=read/immediate)
---------------------------------------------------------------------------------------------------------------------
ZCWS		0000000 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDBYTE	D,S/PTRA/PTRB		(waits for hub)
ZCWS		0000001 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDBYTEC	D,S/PTRA/PTRB		(waits for hub if cache miss)
ZCWS		0000010 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDWORD	D,S/PTRA/PTRB		(waits for hub)
ZCWS		0000011 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDWORDC	D,S/PTRA/PTRB		(waits for hub if cache miss)
ZCWS		0000100 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDLONG	D,S/PTRA/PTRB		(waits for hub)
ZCWS		0000101 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDLONGC	D,S/PTRA/PTRB		(waits for hub if cache miss)
ZCWS		0000110 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDAUX	D,S/#0..$FF/PTRX/PTRY
ZCWS		0000111 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDAUXR	D,S/#0..$FF/PTRX/PTRY

ZCMS		0001000 ZC I CCCC DDDDDDDDD SSSSSSSSS		ISOB	D,S/#
ZCMS		0001001 ZC I CCCC DDDDDDDDD SSSSSSSSS		NOTB	D,S/#
ZCMS		0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS		CLRB	D,S/#
ZCMS		0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS		SETB	D,S/#
ZCMS		0001100 ZC I CCCC DDDDDDDDD SSSSSSSSS		SETBC	D,S/#
ZCMS		0001101 ZC I CCCC DDDDDDDDD SSSSSSSSS		SETBNC	D,S/#
ZCMS		0001110 ZC I CCCC DDDDDDDDD SSSSSSSSS		SETBZ	D,S/#
ZCMS		0001111 ZC I CCCC DDDDDDDDD SSSSSSSSS		SETBNZ	D,S/#

ZCMS		0010000 ZC I CCCC DDDDDDDDD SSSSSSSSS		ANDN	D,S/#
ZCMS		0010001 ZC I CCCC DDDDDDDDD SSSSSSSSS		AND	D,S/#
ZCMS		0010010 ZC I CCCC DDDDDDDDD SSSSSSSSS		OR	D,S/#
ZCMS		0010011 ZC I CCCC DDDDDDDDD SSSSSSSSS		XOR	D,S/#
ZCMS		0010100 ZC I CCCC DDDDDDDDD SSSSSSSSS		MUXC	D,S/#
ZCMS		0010101 ZC I CCCC DDDDDDDDD SSSSSSSSS		MUXNC	D,S/#
ZCMS		0010110 ZC I CCCC DDDDDDDDD SSSSSSSSS		MUXZ	D,S/#
ZCMS		0010111 ZC I CCCC DDDDDDDDD SSSSSSSSS		MUXNZ	D,S/#

ZCMS		0011000 ZC I CCCC DDDDDDDDD SSSSSSSSS		ROR	D,S/#
ZCMS		0011001 ZC I CCCC DDDDDDDDD SSSSSSSSS		ROL	D,S/#
ZCMS		0011010 ZC I CCCC DDDDDDDDD SSSSSSSSS		SHR	D,S/#
ZCMS		0011011 ZC I CCCC DDDDDDDDD SSSSSSSSS		SHL	D,S/#
ZCMS		0011100 ZC I CCCC DDDDDDDDD SSSSSSSSS		RCR	D,S/#
ZCMS		0011101 ZC I CCCC DDDDDDDDD SSSSSSSSS		RCL	D,S/#
ZCMS		0011110 ZC I CCCC DDDDDDDDD SSSSSSSSS		SAR	D,S/#
ZCMS		0011111 ZC I CCCC DDDDDDDDD SSSSSSSSS		REV	D,S/#

ZCWS		0100000 ZC I CCCC DDDDDDDDD SSSSSSSSS		MOV	D,S/#
ZCWS		0100001 ZC I CCCC DDDDDDDDD SSSSSSSSS		NOT	D,S/#
ZCWS		0100010 ZC I CCCC DDDDDDDDD SSSSSSSSS		ABS	D,S/#
ZCWS		0100011 ZC I CCCC DDDDDDDDD SSSSSSSSS		NEG	D,S/#
ZCWS		0100100 ZC I CCCC DDDDDDDDD SSSSSSSSS		NEGC	D,S/#
ZCWS		0100101 ZC I CCCC DDDDDDDDD SSSSSSSSS		NEGNC	D,S/#
ZCWS		0100110 ZC I CCCC DDDDDDDDD SSSSSSSSS		NEGZ	D,S/#
ZCWS		0100111 ZC I CCCC DDDDDDDDD SSSSSSSSS		NEGNZ	D,S/#

ZCMS		0101000 ZC I CCCC DDDDDDDDD SSSSSSSSS		ADD	D,S/#
ZCMS		0101001 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUB	D,S/#
ZCMS		0101010 ZC I CCCC DDDDDDDDD SSSSSSSSS		ADDX	D,S/#
ZCMS		0101011 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUBX	D,S/#
ZCMS		0101100 ZC I CCCC DDDDDDDDD SSSSSSSSS		ADDS	D,S/#
ZCMS		0101101 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUBS	D,S/#
ZCMS		0101110 ZC I CCCC DDDDDDDDD SSSSSSSSS		ADDSX	D,S/#
ZCMS		0101111 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUBSX	D,S/#

ZCMS		0110000 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUMC	D,S/#
ZCMS		0110001 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUMNC	D,S/#
ZCMS		0110010 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUMZ	D,S/#
ZCMS		0110011 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUMNZ	D,S/#
ZCMS		0110100 ZC I CCCC DDDDDDDDD SSSSSSSSS		MIN	D,S/#
ZCMS		0110101 ZC I CCCC DDDDDDDDD SSSSSSSSS		MAX	D,S/#
ZCMS		0110110 ZC I CCCC DDDDDDDDD SSSSSSSSS		MINS	D,S/#
ZCMS		0110111 ZC I CCCC DDDDDDDDD SSSSSSSSS		MAXS	D,S/#

ZCMS		0111000 ZC I CCCC DDDDDDDDD SSSSSSSSS		ADDABS	D,S/#
ZCMS		0111001 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUBABS	D,S/#
ZCMS		0111010 ZC I CCCC DDDDDDDDD SSSSSSSSS		INCMOD	D,S/#
ZCMS		0111011 ZC I CCCC DDDDDDDDD SSSSSSSSS		DECMOD	D,S/#
ZCMS		0111100 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMPSUB	D,S/#
ZCMS		0111101 ZC I CCCC DDDDDDDDD SSSSSSSSS		SUBR	D,S/#
ZCMS		0111110 ZC I CCCC DDDDDDDDD SSSSSSSSS		MUL	D,S/#			(waits one clock)
ZCMS		0111111 ZC I CCCC DDDDDDDDD SSSSSSSSS		SCL	D,S/#			(waits one clock)

ZCWS		1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS		DECOD2	D,S/#
ZCWS		1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS		DECOD3	D,S/#
ZCWS		1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS		DECOD4	D,S/#
ZCWS		1000011 ZC I CCCC DDDDDDDDD SSSSSSSSS		DECOD5	D,S/#
Z-WS		1000100 Z0 I CCCC DDDDDDDDD SSSSSSSSS		ENCOD	D,S/#
Z-WS		1000100 Z1 I CCCC DDDDDDDDD SSSSSSSSS		BLMASK	D,S/#
Z-WS		1000101 Z0 I CCCC DDDDDDDDD SSSSSSSSS		ONECNT	D,S/#			(waits one clock)
Z-WS		1000101 Z1 I CCCC DDDDDDDDD SSSSSSSSS		ZERCNT	D,S/#			(waits one clock)
-CWS		1000110 0C I CCCC DDDDDDDDD SSSSSSSSS		INCPAT	D,S/#
-CWS		1000110 1C I CCCC DDDDDDDDD SSSSSSSSS		DECPAT	D,S/#
--WS		1000111 00 I CCCC DDDDDDDDD SSSSSSSSS		SPLITB	D,S/#			(also MERGEN)
--WS		1000111 01 I CCCC DDDDDDDDD SSSSSSSSS		MERGEB	D,S/#			(also SPLITN)
--WS		1000111 10 I CCCC DDDDDDDDD SSSSSSSSS		SPLITW	D,S/#
--WS		1000111 11 I CCCC DDDDDDDDD SSSSSSSSS		MERGEW	D,S/#

--MS		10010nn n0 I CCCC DDDDDDDDD SSSSSSSSS		GETNIB	D,S/#,#0..7
--MS		10010nn n1 I CCCC DDDDDDDDD SSSSSSSSS		SETNIB	D,S/#,#0..7
--MS		1001100 n0 I CCCC DDDDDDDDD SSSSSSSSS		GETWORD	D,S/#,#0..1
--MS		1001100 n1 I CCCC DDDDDDDDD SSSSSSSSS		SETWORD	D,S/#,#0..1
--MS		1001101 00 I CCCC DDDDDDDDD SSSSSSSSS		STWORDS	D,S/#
--MS		1001101 01 I CCCC DDDDDDDDD SSSSSSSSS		ROLNIB	D,S/#
--MS		1001101 10 I CCCC DDDDDDDDD SSSSSSSSS		ROLBYTE	D,S/#
--MS		1001101 11 I CCCC DDDDDDDDD SSSSSSSSS		ROLWORD	D,S/#
--MS		1001110 00 I CCCC DDDDDDDDD SSSSSSSSS		SETS	D,S/#
--MS		1001110 01 I CCCC DDDDDDDDD SSSSSSSSS		SETD	D,S/#
--MS		1001110 10 I CCCC DDDDDDDDD SSSSSSSSS		SETX	D,S/#
--MS		1001110 11 I CCCC DDDDDDDDD SSSSSSSSS		SETI	D,S/#
-CMS		1001111 0C I CCCC DDDDDDDDD SSSSSSSSS		COGNEW	D,S/#			(waits for hub)
-CMS		1001111 1C I CCCC DDDDDDDDD SSSSSSSSS		WAITCNT	D,S/#			(waits for CNT, +CNTX if WC)

--MS		101000n n0 I CCCC DDDDDDDDD SSSSSSSSS		GETBYTE	D,S/#,#0..3
--MS		101000n n1 I CCCC DDDDDDDDD SSSSSSSSS		SETBYTE	D,S/#,#0..3
--WS		1010010 00 I CCCC DDDDDDDDD SSSSSSSSS		STBYTES	D,S/#
--MS		1010010 01 I CCCC DDDDDDDDD SSSSSSSSS		SWBYTES	D,S/#			(switch/copy bytes in D, S = %11_10_01_00 = D same)
--MS		1010010 10 I CCCC DDDDDDDDD SSSSSSSSS		PACKRGB	D,S/#			(S 8:8:8 -> D 5:5:5 << 16 | D >> 16)
--WS		1010010 11 I CCCC DDDDDDDDD SSSSSSSSS		UNPKRGB	D,S/#			(S 5:5:5 -> D 8:8:8)
--MS		1010011 00 I CCCC DDDDDDDDD SSSSSSSSS		ADDPIX	D,S/#			(waits one clock)
--MS		1010011 01 I CCCC DDDDDDDDD SSSSSSSSS		MULPIX	D,S/#			(waits one clock)
--MS		1010011 10 I CCCC DDDDDDDDD SSSSSSSSS		BLNPIX	D,S/#			(waits one clock)
--MS		1010011 11 I CCCC DDDDDDDDD SSSSSSSSS		MIXPIX	D,S/#			(waits one clock)

ZCMS		1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPSW	D,S/#
ZCMS		1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPSWD	D,S/#
--MS		1010110 00 I CCCC DDDDDDDDD SSSSSSSSS		IJZ	D,S/#
--MS		1010110 01 I CCCC DDDDDDDDD SSSSSSSSS		IJZD	D,S/#
--MS		1010110 10 I CCCC DDDDDDDDD SSSSSSSSS		IJNZ	D,S/#
--MS		1010110 11 I CCCC DDDDDDDDD SSSSSSSSS		IJNZD	D,S/#
--MS		1010111 00 I CCCC DDDDDDDDD SSSSSSSSS		DJZ	D,S/#
--MS		1010111 01 I CCCC DDDDDDDDD SSSSSSSSS		DJZD	D,S/#
--MS		1010111 10 I CCCC DDDDDDDDD SSSSSSSSS		DJNZ	D,S/#
--MS		1010111 11 I CCCC DDDDDDDDD SSSSSSSSS		DJNZD	D,S/#

ZCRS		1011000 ZC I CCCC DDDDDDDDD SSSSSSSSS		TESTB	D,S/#
ZCRS		1011001 ZC I CCCC DDDDDDDDD SSSSSSSSS		TESTN	D,S/#
ZCRS		1011010 ZC I CCCC DDDDDDDDD SSSSSSSSS		TEST	D,S/#
ZCRS		1011011 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMP	D,S/#
ZCRS		1011100 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMPX	D,S/#
ZCRS		1011101 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMPS	D,S/#
ZCRS		1011110 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMPSX	D,S/#
ZCRS		1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS		CMPR	D,S/#

--RS		11000nn n0 I CCCC DDDDDDDDD SSSSSSSSS		COGINIT	D,S/#,#0..7		(waits for hub) (SETNIB :coginit,cog,#6)
---S		11000nn n1 I CCCC nnnnnnnnn SSSSSSSSS		WAITVID	#0..$DFF,S/#		(waits for vid if single-task, loops if multi-task)
--RS		1100011 11 I CCCC DDDDDDDDD SSSSSSSSS		WAITVID	D,S/#			(waits for vid if single-task, loops if multi-task)
-CRS		110010n nC I CCCC DDDDDDDDD SSSSSSSSS		WAITPEQ	D,S/#,#0..3		(waits for pins, plus CNT if WC)
-CRS		110011n nC I CCCC DDDDDDDDD SSSSSSSSS		WAITPNE	D,S/#,#0..3		(waits for pins, plus CNT if WC)

--LS		1101000 0L I CCCC DDDDDDDDD SSSSSSSSS		WRBYTE	D/#,S/PTRA/PTRB		(waits for hub)
--LS		1101000 1L I CCCC DDDDDDDDD SSSSSSSSS		WRWORD	D/#,S/PTRA/PTRB		(waits for hub)
--LS		1101001 0L I CCCC DDDDDDDDD SSSSSSSSS		WRLONG	D/#,S/PTRA/PTRB		(waits for hub)
--LS		1101001 1L I CCCC DDDDDDDDD SSSSSSSSS		FRAC	D/#,S/#
--LS		1101010 0L I CCCC DDDDDDDDD SSSSSSSSS		WRAUX	D/#,S/#0..$FF/PTRX/PTRY
--LS		1101010 1L I CCCC DDDDDDDDD SSSSSSSSS		WRAUXR	D/#,S/#0..$FF/PTRX/PTRY
--LS		1101011 0L I CCCC DDDDDDDDD SSSSSSSSS		SETACCA	D/#,S/#
--LS		1101011 1L I CCCC DDDDDDDDD SSSSSSSSS		SETACCB	D/#,S/#
--LS		1101100 0L I CCCC DDDDDDDDD SSSSSSSSS		MACA	D/#,S/#
--LS		1101100 1L I CCCC DDDDDDDDD SSSSSSSSS		MACB	D/#,S/#
--LS		1101101 0L I CCCC DDDDDDDDD SSSSSSSSS		MUL32	D/#,S/#
--LS		1101101 1L I CCCC DDDDDDDDD SSSSSSSSS		MUL32U	D/#,S/#
--LS		1101110 0L I CCCC DDDDDDDDD SSSSSSSSS		DIV32	D/#,S/#
--LS		1101110 1L I CCCC DDDDDDDDD SSSSSSSSS		DIV32U	D/#,S/#
--LS		1101111 0L I CCCC DDDDDDDDD SSSSSSSSS		DIV64	D/#,S/#
--LS		1101111 1L I CCCC DDDDDDDDD SSSSSSSSS		DIV64U	D/#,S/#

--LS		1110000 0L I CCCC DDDDDDDDD SSSSSSSSS		SQRT64	D/#,S/#
--LS		1110000 1L I CCCC DDDDDDDDD SSSSSSSSS		QSINCOS	D/#,S/#
--LS		1110001 0L I CCCC DDDDDDDDD SSSSSSSSS		QARCTAN	D/#,S/#
--LS		1110001 1L I CCCC DDDDDDDDD SSSSSSSSS		QROTATE	D/#,S/#
--LS		1110010 0L I CCCC DDDDDDDDD SSSSSSSSS		SETSERA	D/#,S/#			(config,baud)
--LS		1110010 1L I CCCC DDDDDDDDD SSSSSSSSS		SETSERB	D/#,S/#			(config,baud)
--LS		1110011 0L I CCCC DDDDDDDDD SSSSSSSSS		SETCTRS	D/#,S/#			(ctrb,ctra)
--LS		1110011 1L I CCCC DDDDDDDDD SSSSSSSSS		SETWAVS	D/#,S/#			(ctrb,ctra)
--LS		1110100 0L I CCCC DDDDDDDDD SSSSSSSSS		SETFRQS	D/#,S/#			(ctrb,ctra)
--LS		1110100 1L I CCCC DDDDDDDDD SSSSSSSSS		SETPHSS	D/#,S/#			(ctrb,ctra)
--LS		1110101 0L I CCCC DDDDDDDDD SSSSSSSSS		ADDPHSS	D/#,S/#			(ctrb,ctra)
--LS		1110101 1L I CCCC DDDDDDDDD SSSSSSSSS		SUBPHSS	D/#,S/#			(ctrb,ctra)
--LS		1110110 0L I CCCC DDDDDDDDD SSSSSSSSS		JP	D/#,S/#
--LS		1110110 1L I CCCC DDDDDDDDD SSSSSSSSS		JPD	D/#,S/#
--LS		1110111 0L I CCCC DDDDDDDDD SSSSSSSSS		JNP	D/#,S/#
--LS		1110111 1L I CCCC DDDDDDDDD SSSSSSSSS		JNPD	D/#,S/#

--LS		111100n nL I CCCC DDDDDDDDD SSSSSSSSS		CFGPINS	D/#,S/#,#0..2		(waits for alt)
--LS		1111001 1L I CCCC DDDDDDDDD SSSSSSSSS		JMPTASK	D/#,S/#			(mode:mask,address)
--LS		1111010 0L I CCCC DDDDDDDDD SSSSSSSSS		SETXFR	D/#,S/#
--LS		1111010 1L I CCCC DDDDDDDDD SSSSSSSSS		SETMIX	D/#,S/#
--LS		1111011 0L I CCCC DDDDDDDDD SSSSSSSSS		<empty>	D/#,S/#
--LS		1111011 1L I CCCC DDDDDDDDD SSSSSSSSS		<empty>	D/#,S/#

--RS		1111100 00 I CCCC DDDDDDDDD SSSSSSSSS		JZ	D,S/#
--RS		1111100 01 I CCCC DDDDDDDDD SSSSSSSSS		JZD	D,S/#
--RS		1111100 10 I CCCC DDDDDDDDD SSSSSSSSS		JNZ	D,S/#
--RS		1111100 11 I CCCC DDDDDDDDD SSSSSSSSS		JNZD	D,S/#

----		1111101 00 n nnnn nnnnnnnnn nnnnnnnnn		AUGI	#23bits			(appends n to upper bits of next S or D immediate)

----		1111101 01 0 nnnn nnnnnnnnn nnniiiiii		REPS	#1..$10000,#1..64

----		1111101 01 1 BBAA ddddddddd sssssssss		FIXINDA #d,#s / FIXINDB #d,#s / FIXINDS #d,#s / SETINDA #s / SETINDB #d / SETINDS #d,#s

----		1111101 10 0 CCCC 00 nnnnnnnnnnnnnnnn		JMP	#abs
----		1111101 10 0 CCCC 01 nnnnnnnnnnnnnnnn		JMP_	#abs
----		1111101 10 0 CCCC 10 nnnnnnnnnnnnnnnn		JMP	@rel
----		1111101 10 0 CCCC 11 nnnnnnnnnnnnnnnn		JMP_	@rel
----		1111101 10 1 CCCC 00 nnnnnnnnnnnnnnnn		JMPD	#abs
----		1111101 10 1 CCCC 01 nnnnnnnnnnnnnnnn		JMPD_	#abs
----		1111101 10 1 CCCC 10 nnnnnnnnnnnnnnnn		JMPD	@rel
----		1111101 10 1 CCCC 11 nnnnnnnnnnnnnnnn		JMPD_	@rel

----		1111101 11 0 CCCC 00 nnnnnnnnnnnnnnnn		CALL	#abs
----		1111101 11 0 CCCC 01 nnnnnnnnnnnnnnnn		CALL_	#abs
----		1111101 11 0 CCCC 10 nnnnnnnnnnnnnnnn		CALL	@rel
----		1111101 11 0 CCCC 11 nnnnnnnnnnnnnnnn		CALL_	@rel
----		1111101 11 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLD	#abs
----		1111101 11 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLD_	#abs
----		1111101 11 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLD	@rel
----		1111101 11 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLD_	@rel

----		1111110 00 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLA	#abs
----		1111110 00 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLA_	#abs
----		1111110 00 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLA	@rel
----		1111110 00 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLA_	@rel
----		1111110 00 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLAD	#abs
----		1111110 00 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLAD_	#abs
----		1111110 00 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLAD	@rel
----		1111110 00 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLAD_	@rel

----		1111110 01 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLB	#abs
----		1111110 01 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLB_	#abs
----		1111110 01 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLB	@rel
----		1111110 01 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLB_	@rel
----		1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLBD	#abs
----		1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLBD_	#abs
----		1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLBD	@rel
----		1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLBD_	@rel

----		1111110 10 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLX	#abs
----		1111110 10 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLX_	#abs
----		1111110 10 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLX	@rel
----		1111110 10 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLX_	@rel
----		1111110 10 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLXD	#abs
----		1111110 10 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLXD_	#abs
----		1111110 10 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLXD	@rel
----		1111110 10 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLXD_	@rel

----		1111110 11 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLY	#abs
----		1111110 11 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLY_	#abs
----		1111110 11 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLY	@rel
----		1111110 11 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLY_	@rel
----		1111110 11 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLYD	#abs
----		1111110 11 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLYD_	#abs
----		1111110 11 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLYD	@rel
----		1111110 11 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLYD_	@rel

ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000000		COGID	D			(waits for hub) (doesn't write D if WC)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000001		LOCKNEW	D			(waits for hub)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000010		GETPC	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000011		GETLFSR	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000100		GETCNT	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000101		GETCNTX	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000110		GETACAL	D			(waits for mac)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000000111		GETACAH	D			(waits for mac)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001000		GETACBL	D			(waits for mac)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001001		GETACBH	D			(waits for mac)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001010		GETPTRA	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001011		GETPTRB	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001100		GETPTRX	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001101		GETPTRY	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001110		SERINA	D			(waits for rx if single-task, loops if multi-task, releases if WC)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000001111		SERINB	D			(waits for rx if single-task, loops if multi-task, releases if WC)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010000		GETMULL	D			(waits for mul if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010001		GETMULH	D			(waits for mul if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010010		GETDIVQ	D			(waits for div if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010011		GETDIVR	D			(waits for div if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010100		GETSQRT	D			(waits for sqrt if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010101		GETQX	D			(waits for cordic if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010110		GETQY	D			(waits for cordic if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000010111		GETQZ	D			(waits for cordic if single-task, loops if multi-task)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011000		GETPHSA	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011001		GETPHZA	D			(clears phsa)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011010		GETCOSA	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011011		GETSINA	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011100		GETPHSB	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011101		GETPHZB	D			(clears phsb)
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011110		GETCOSB	D
ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000011111		GETSINB	D

ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100000		PUSHZC	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100001		POPZC	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100010		SUBCNT	D			(subtracts D from CNT, then CNTX if same thread)
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100011		GETPIX	D			(takes 3 clocks, needs 3 clocks per two prior stages, no condition allowed)
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100100		BINBCD	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100101		BCDBIN	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100110		BINGRY	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000100111		GRYBIN	D			(waits one clock)
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000101000		ESWAP4	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000101001		ESWAP8	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000101010		SEUSSF	D
ZCM-		1111111 ZC 0 CCCC DDDDDDDDD 000101011		SEUSSR	D
Z-M-		1111111 ZC 0 CCCC DDDDDDDDD 000101100		INCD	D			(D += $200)
Z-M-		1111111 ZC 0 CCCC DDDDDDDDD 000101101		DECD	D			(D -= $200)
Z-M-		1111111 ZC 0 CCCC DDDDDDDDD 000101110		INCDS	D			(D += $201)
Z-M-		1111111 ZC 0 CCCC DDDDDDDDD 000101111		DECDS	D			(D -= $201)

ZCW-		1111111 ZC 0 CCCC DDDDDDDDD 000110000		POP	D			(pops from task's tiny stack)

--L-		1111111 00 L CCCC DDDDDDDDD 001iiiiii		REPD	D/#1..512,#1..64	(REPD $1FF,#1..64 = infinite repeat, can use REPD #i)

--L-		1111111 00 L CCCC DDDDDDDDD 010000000		CLKSET	D/#			(waits for hub)
--L-		1111111 00 L CCCC DDDDDDDDD 010000001		COGSTOP	D/#			(waits for hub)
-CL-		1111111 0C L CCCC DDDDDDDDD 010000010		LOCKSET	D/#			(waits for hub)
-CL-		1111111 0C L CCCC DDDDDDDDD 010000011		LOCKCLR	D/#			(waits for hub)
--L-		1111111 00 L CCCC DDDDDDDDD 010000100		LOCKRET	D/#			(waits for hub)
--L-		1111111 00 L CCCC DDDDDDDDD 010000101		RDWIDEC	D/PTRA/PTRB		(waits for hub if cache miss)
--L-		1111111 00 L CCCC DDDDDDDDD 010000110		RDWIDE	D/PTRA/PTRB		(waits for hub)
--L-		1111111 00 L CCCC DDDDDDDDD 010000111		WRWIDE	D/PTRA/PTRB		(waits for hub)

ZCL-		1111111 ZC L CCCC DDDDDDDDD 010001000		GETP	D/#			(pin into !Z/C via WZ/WC)
ZCL-		1111111 ZC L CCCC DDDDDDDDD 010001001		GETNP	D/#			(pin into Z/!C via WZ/WC)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001010		SEROUTA	D/#			(waits for tx if single-task, loops if multi-task, releases if WC)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001011		SEROUTB	D/#			(waits for tx if single-task, loops if multi-task, releases if WC)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001100		CMPCNT	D/#			(subtracts D from CNT, then CNTX if same thread)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001101		WAITPX	D/#			(waits for any edge, +CNT if WC)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001110		WAITPR	D/#			(waits for pos edge, +CNT if WC)
-CL-		1111111 0C L CCCC DDDDDDDDD 010001111		WAITPF	D/#			(waits for neg edge, +CNT if WC)

ZCL-		1111111 ZC L CCCC DDDDDDDDD 010010000		SETZC	D/#			(D[1:0] into Z/C via WZ/WC)
--L-		1111111 00 L CCCC DDDDDDDDD 010010001		SETMAP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010010010		SETXCH	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010010011		SETTASK	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010010100		SETRACE	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010010101		SARACCA	D/#			(waits for mac)
--L-		1111111 00 L CCCC DDDDDDDDD 010010110		SARACCB	D/#			(waits for mac)
--L-		1111111 00 L CCCC DDDDDDDDD 010010111		SARACCS	D/#			(waits for mac)

--L-		1111111 00 L CCCC DDDDDDDDD 010011000		SETPTRA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011001		SETPTRB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011010		ADDPTRA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011011		ADDPTRB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011100		SUBPTRA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011101		SUBPTRB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011110		SETWIDE	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010011111		SETWIDZ	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 010100000		SETPTRX	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100001		SETPTRY	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100010		ADDPTRX	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100011		ADDPTRY	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100100		SUBPTRX	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100101		SUBPTRY	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010100110		PASSCNT	D/#			(loops if (CNT - D) msb set)
--L-		1111111 00 L CCCC DDDDDDDDD 010100111		WAIT	D/#			(waits 1+ clocks, 0 same as 1)

--L-		1111111 00 L CCCC DDDDDDDDD 010101000		OFFP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101001		NOTP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101010		CLRP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101011		SETP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101100		SETPC	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101101		SETPNC	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101110		SETPZ	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010101111		SETPNZ	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 010110000		DIV64D	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110001		SQRT32	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110010		QLOG	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110011		QEXP	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110100		SETQI	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110101		SETQZ	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110110		CFGDACS	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010110111		SETDACS	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 010111000		CFGDAC0	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111001		CFGDAC1	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111010		CFGDAC2	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111011		CFGDAC3	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111100		SETDAC0	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111101		SETDAC1	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111110		SETDAC2	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 010111111		SETDAC3	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 011000000		SETCTRA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000001		SETWAVA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000010		SETFRQA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000011		SETPHSA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000100		ADDPHSA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000101		SUBPHSA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000110		SETVID	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011000111		SETVIDY	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 011001000		SETCTRB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001001		SETWAVB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001010		SETFRQB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001011		SETPHSB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001100		ADDPHSB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001101		SUBPHSB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001110		SETVIDI	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011001111		SETVIDQ	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 011010000		SETPIX	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010001		SETPIXZ	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010010		SETPIXU	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010011		SETPIXV	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010100		SETPIXA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010101		SETPIXR	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010110		SETPIXG	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011010111		SETPIXB	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 011011000		SETPORA	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011011001		SETPORB	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011011010		SETPORC	D/#
--L-		1111111 00 L CCCC DDDDDDDDD 011011011		SETPORD	D/#

--L-		1111111 00 L CCCC DDDDDDDDD 011011100		PUSH	D/#			(pushes into task's tiny stack)

--R-		1111111 00 0 CCCC DDDDDDDDD 011100110		JMPREL	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011100111		JMPRELD	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011101000		JMP	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101001		JMP_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101010		JMPD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101011		JMPD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011101100		CALL	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101101		CALL_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101110		CALLD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011101111		CALLD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011110000		CALLA	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110001		CALLA_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110010		CALLAD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110011		CALLAD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011110100		CALLB	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110101		CALLB_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110110		CALLBD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110111		CALLBD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011111000		CALLX	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111001		CALLX_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111010		CALLXD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111011		CALLXD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011111100		CALLY	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111101		CALLY_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111110		CALLYD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011111111		CALLYD_	D

ZC--		1111111 ZC x CCCC xxxxxxxxx 100000000		RETA
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000001		RETAD
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000010		RETB
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000011		RETBD
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000100		RETX
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000101		RETXD
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000110		RETY
ZC--		1111111 ZC x CCCC xxxxxxxxx 100000111		RETYD

ZC--		1111111 ZC x CCCC xxxxxxxxx 100001000		RET
ZC--		1111111 ZC x CCCC xxxxxxxxx 100001001		RETD
ZC--		1111111 ZC x CCCC xxxxxxxxx 100001010		POLCTRA				(ctra-rollover into !Z/C)
ZC--		1111111 ZC x CCCC xxxxxxxxx 100001011		POLCTRB				(ctra-rollover into !Z/C)

ZC--		1111111 ZC x CCCC xxxxxxxxx 100001100		POLVID				(vid-ready into !Z/C)
----		1111111 00 x CCCC xxxxxxxxx 100001101		CAPCTRA
----		1111111 00 x CCCC xxxxxxxxx 100001110		CAPCTRB
----		1111111 00 x CCCC xxxxxxxxx 100001111		CAPCTRS

----		1111111 00 x CCCC xxxxxxxxx 100010000		CACHEX
----		1111111 00 x CCCC xxxxxxxxx 100010001		CLRACCA
----		1111111 00 x CCCC xxxxxxxxx 100010010		CLRACCB
----		1111111 00 x CCCC xxxxxxxxx 100010011		CLRACCS

ZC--		1111111 ZC x CCCC xxxxxxxxx 100010100		CHKPTRX
ZC--		1111111 ZC x CCCC xxxxxxxxx 100010101		CHKPTRY
----		1111111 00 x CCCC xxxxxxxxx 100010110		SYNCTRA				(waits for ctra if single-task, loops if multi-task))
----		1111111 00 x CCCC xxxxxxxxx 100010111		SYNCTRB				(waits for ctrb if single-task, loops if multi-task))

----		1111111 00 x CCCC xxxxxxxxx 100011000		SETPIXW


x = don't care, use 0
---------------------------------------------------------------------------------------------------------------------


Z	effect
------------------------------------------------------------------------------------------
0	<none>
1	wz


C	effect
------------------------------------------------------------------------------------------
0	<none>
1	wc


L	DDDDDDDDD	destination operand
------------------------------------------------------------------------------------------
0/na	DDDDDDDDD	register
1	#DDDDDDDDD	immediate, zero-extended


I	SSSSSSSSS	source operand
------------------------------------------------------------------------------------------
0/na	SSSSSSSSS	register
1	#SSSSSSSSS	immediate, zero-extended


CCCC	condition	(easier-to-read list)
------------------------------------------------------------------------------------------
0000	never		1111	always			(default)
0001	nc  &  nz	1100	if_c						if_b
0010	nc  &  z	0011	if_nc						if_ae
0011	nc		1010	if_z						if_e
0100	 c  &  nz	0101	if_nz						if_ne
0101	nz		1000	if_c_and_z		if_z_and_c
0110	 c  <> z	0100	if_c_and_nz		if_nz_and_c
0111	nc  |  nz	0010	if_nc_and_z		if_z_and_nc
1000	 c  &  z	0001	if_nc_and_nz		if_nz_and_nc		if_a
1001	 c  =  z	1110	if_c_or_z		if_z_or_c		if_be
1010	 z		1101	if_c_or_nz		if_nz_or_c
1011	nc  |  z	1011	if_nc_or_z		if_z_or_nc
1100	 c		0111	if_nc_or_nz		if_nz_or_nc
1101	 c  |  nz	1001	if_c_eq_z		if_z_eq_c
1110	 c  |  z	0110	if_c_ne_z		if_z_ne_c
1111	always		0000	never


CCCC	inda/indb - CCCC=1111 after stage 2 of pipeline if inda/indb used (indx=inda/indb)
------------------------------------------------------------------------------------------
xx00	source indx
xx01	source indx++
xx10	source indx--
xx11	source ++indx

00xx	destination indx
01xx	destination indx++
10xx	destination indx--
11xx	destination ++indx

I'm getting all these changes into PNut.exe now. It's taking a while because the assembler must be made to work in hub space, plus all the branches work differently now.

ozpropdev · 2013-12-17 04:39

Thanks Chip!

David Betz · 2013-12-17 04:56

cgracey wrote: »

Here is the list:

----		1111110 00 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLA	#abs
----		1111110 00 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLA_	#abs
----		1111110 00 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLA	@rel
----		1111110 00 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLA_	@rel
----		1111110 00 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLAD	#abs
----		1111110 00 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLAD_	#abs
----		1111110 00 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLAD	@rel
----		1111110 00 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLAD_	@rel

----		1111110 01 0 CCCC 00 nnnnnnnnnnnnnnnn		CALLB	#abs
----		1111110 01 0 CCCC 01 nnnnnnnnnnnnnnnn		CALLB_	#abs
----		1111110 01 0 CCCC 10 nnnnnnnnnnnnnnnn		CALLB	@rel
----		1111110 01 0 CCCC 11 nnnnnnnnnnnnnnnn		CALLB_	@rel
----		1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn		CALLBD	#abs
----		1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn		CALLBD_	#abs
----		1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn		CALLBD	@rel
----		1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn		CALLBD_	@rel

--R-		1111111 00 0 CCCC DDDDDDDDD 011110000		CALLA	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110001		CALLA_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110010		CALLAD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110011		CALLAD_	D

--R-		1111111 00 0 CCCC DDDDDDDDD 011110100		CALLB	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110101		CALLB_	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110110		CALLBD	D
--R-		1111111 00 0 CCCC DDDDDDDDD 011110111		CALLBD_	D

Even though it doesn't say so, I assume that all of these instructions wait for a hub slot. Is that correct? Or is there some sort of data cache between to allow them to continue without waiting?

Propeller II update - BLOG

Comments