...
All branches can access all of hub space. The caveat are the 9-bit immediate address branches like DJNZ. They'll become relative branches in hub exec mode. JMPSW (was JMPRET) always stores the return address in D, but can only reach all of hub address space using S register.
...
Would it make more sense for the cases like DJNZ to always be relative in both COG and HUB EXEC modes?
It's still deterministic, anyway, with any number of threads in use:
"Thread scheduling is a simple round robin process with each active thread being executed in the next system clock cycle. This gives the appearance of up to eight concurrent threads per XCore. All threads are independent and have equal priority meaning that each task always receives a guaranteed minimum number of MIPS; this is central to building deterministic and responsive systems."
Yes and no. The caveat is you have to know how many threads the system will launch, to know how much time you have.
That means any 'late changes' to code, can bite big time.
Safest way to manage unknown later additions, is to always run the highest thread count, ie some as dummy time-swallowers.
Everything can be simulated and timed very accurately, enabling suitable hardware to be selected, with the xSOFTip Explorer software. It's very easy to add additional chips, if more processing power is required.
Chip,
It appears everyone agrees that multiple tasks using hubexec mode is just not going to work because of caching etc. Wouldn't it therefore be much simpler to not permit multiple hubexec tasks to work (in a single cog), and instead allocate any/all cache to a single task hubexec mode?
This does not prevent three other cog mode tasks running with hubexec anyway?
Seems as though we are overcomplicating the whole hubexec mode for something we will never use (multiple hubexec tasks in a single cog).
Would it make more sense for the cases like DJNZ to always be relative in both COG and HUB EXEC modes?
They could be. The drawback for cog mode would be that you couldn't just write an address field into an instruction like you can now using SETS. You would have to pre-compute an address relative to the instruction that will be modified.
Chip,
It appears everyone agrees that multiple tasks using hubexec mode is just not going to work because of caching etc. Wouldn't it therefore be much simpler to not permit multiple hubexec tasks to work (in a single cog), and instead allocate any/all cache to a single task hubexec mode?
This does not prevent three other cog mode tasks running with hubexec anyway?
Seems as though we are overcomplicating the whole hubexec mode for something we will never use (multiple hubexec tasks in a single cog).
I understand the consensus. If those 4 cache lines use an LRU algorithm, though, it won't matter if one hub task is running or four are. In the case of four, the cache lines would effectively be distributed among the tasks, giving still decent performance.
Everything can be simulated and timed very accurately, enabling suitable hardware to be selected, with the xSOFTip Explorer software. It's very easy to add additional chips, if more processing power is required.
That Everything can be simulated and timed very accurately, again reminds of the important missing piece of the P2 puzzle : a good simulator.
I believe The high end HDL language simulators work by compiling the VHDL/Verilog into C (or similar) and then running that.
So they effectively 'build a new simulator', on every run.
It would be very nice if the P2 Simulator Core, could be Auto-Built this way, from the HDL.
Has anyone seen that done ? Would it be fast enough ?
That Everything can be simulated and timed very accurately, again reminds of the important missing piece of the P2 puzzle : a good simulator.
I believe The high end HDL language simulators work by compiling the VHDL/Verilog into C (or similar) and then running that.
So they effectively 'build a new simulator', on every run.
It would be very nice if the P2 Simulator Core, could be Auto-Built this way, from the HDL.
Has anyone seen that done ? Would it be fast enough ?
That's a lot of logic to simulate. It seems a behavioral model would be a lot more efficient, but take a custom effort.
They could be. The drawback for cog mode would be that you couldn't just write an address field into an instruction like you can now using SETS. You would have to pre-compute an address relative to the instruction that will be modified.
Good point!
I wonder how many cases of self-modifying instructions change a DJNZ style instructions. Its normally the JMPRET that are modified (the RET versions).
That's a lot of logic to simulate. It seems a behavioral model would be a lot more efficient, but take a custom effort.
There is strong appeal to deriving the Simulator core, from the HDL, as that avoids the raft of 'not quite correct' divergence issues that can plague simulators, and it has more chance of simulating the peripherals, and all the HW interactions.
Seems there is Icarus, Veriwell, and Veripool, with Veripool sounding the fastest and best supported ?
Bill,
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles? You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.
I understand the consensus. If those 4 cache lines use an LRU algorithm, though, it won't matter if one hub task is running or four are. In the case of four, the cache lines would effectively be distributed among the tasks, giving still decent performance.
I was thinking overnight that an instruction that controlled the instruction cache fetching might be an advantage...
My thoughts were the Instruction Cache would be say 4 lines. It would start out with the first wide loaded into line0. Then automatically the next sequential line1 would be fetched while the 8 longs in line 0 were being executed (in hubexec mode). As line1 was being executed, line 2 would be fetched. Then line 3 and then back to line 0.
This would be controlled by a small state m/c fetching the wide hub lines into the 4 cache lines.
Now, what would be nice is that the program could control the state m/c by executing a "fetch" instruction that says "fetch the next 3 lines". This would take 1 clock to set the state m/c.
'the following is the first instruction cache line0 already fetched and now executing in hubexec mode...
$2200: FETCH $4000,#3 'fetch the next 3 wides into the cache from hub $4000
$2204: hubexec code
$2208: hubexec code
$220C: hubexec code
$2210: hubexec code
$2214: hubexec code - this is going to call a routine at hub $4000
$2218: hubexec code
$221C: hubexec code
'while the above was executing, the state m/c fetches 3 lines from hub $4000 - because we know that will be required next.
I have no idea about how complex this may be. I just thought it could be simpler than LRU, and with smart programming, the program could control what was loaded into the instruction cache and avoid stalls by cache misses.
Good point!
I wonder how many cases of self-modifying instructions change a DJNZ style instructions. Its normally the JMPRET that are modified (the RET versions).
A general solution should be checked.
I can see DJNZ would be useful for timeout/watchdog style case statements/state machines.
The Jump is patched, and usually taken, and some state kicks the DJNZ variable, and if the system does not cross any kicks, it will timeout eventually.
I was thinking overnight that an instruction that controlled the instruction cache fetching might be an advantage...
My thoughts were the Instruction Cache would be say 4 lines. It would start out with the first wide loaded into line0. Then automatically the next sequential line1 would be fetched while the 8 longs in line 0 were being executed (in hubexec mode). As line1 was being executed, line 2 would be fetched. Then line 3 and then back to line 0.
This would be controlled by a small state m/c fetching the wide hub lines into the 4 cache lines.
Now, what would be nice is that the program could control the state m/c by executing a "fetch" instruction that says "fetch the next 3 lines". This would take 1 clock to set the state m/c.
'the following is the first instruction cache line0 already fetched and now executing in hubexec mode...
$2200: FETCH $4000,#3 'fetch the next 3 wides into the cache from hub $4000
$2204: hubexec code
$2208: hubexec code
$220C: hubexec code
$2210: hubexec code
$2214: hubexec code - this is going to call a routine at hub $4000
$2218: hubexec code
$221C: hubexec code
'while the above was executing, the state m/c fetches 3 lines from hub $4000 - because we know that will be required next.
I have no idea about how complex this may be. I just thought it could be simpler than LRU, and with smart programming, the program could control what was loaded into the instruction cache and avoid stalls by cache misses.
Cluso99
IMHO, it can be excellent to have this feature, for a handcrafted PASM program.
As for automatically generated code, i believe the linker can be crafted to check for straight thru code pieces, branching from another ones, and decide where to insert it, and how many lines of cache, up to a maximum of three, to request in advance.
Though I'm unsure, how it can be crafted to handle conditional execution at all.
Also, if the now being loaded code piece, does contain another reference, to a new cache line advanced preemptive request, and so on... How we could ensure any automated means, at link time, to select the best path anyway?
If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals to trash them very often...
Perhaps a P3 feature....
I did not ignore, I disagreed. I even posted about a minimal (very dumb) leaf function that just executed "a = b", perhaps you missed it in the flurry of postings.
Please show me a useful, non-trivial leaf function's dis-assembly from the test p2 gcc. Note, the function should not have any decorations, and should be for LMM mode (ie cog-only mode example is not valid); furthermore include the function prologue and epilogue code. I am working on some other products right now, and do not currently have the p2test branch installed, so I do not have the time to generate the sample function and dis-assembled code.
I'll count the cycles, and then we will know what the percentage is.
Given what I remember seeing about the prologue and epilog code, it consumed ~ 16*8 cycles each, so about 256 cycles - without even counting what the function did, or calling the function, or returning from it:
average of 4 cycles, divided by 256 cycles, is roughly 1.6% slowdown. Worst case, about 3%.
If the function does actual work, and counting the call/return, we should be below 1%.
Please note, it is not fair to add attributes or command line switches to ensure that gcc does not generate prolog/epilog code.
If the code generator, without attributes or command line switches, on its own can minimize the prolog/epilog code, then yes, for an utterly trivial (by which I mean very few cycle) leaf function the four cycle delay could be higher.
To have a 10% impact on the performance of the function, the whole function, including prolog and epilogue, including calling and return, and any hub access cycles it performs, including looping, would have to complete in 40 clock cycles.
It is literally impossible to have a 2x-4x slowdown on the whole program when using a hub stack for leaf functions.
Bill,
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles? You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.
I did not ignore, I disagreed. I even posted about a minimal (very dumb) leaf function that just executed "a = b", perhaps you missed it in the flurry of postings.
Please show me a useful, non-trivial leaf function's dis-assembly from the test p2 gcc. Note, the function should not have any decorations, and should be for LMM mode (ie cog-only mode example is not valid); furthermore include the function prologue and epilogue code. I am working on some other products right now, and do not currently have the p2test branch installed, so I do not have the time to generate the sample function and dis-assembled code.
I'll count the cycles, and then we will know what the percentage is.
Given what I remember seeing about the prologue and epilog code, it consumed ~ 16*8 cycles each, so about 256 cycles - without even counting what the function did, or calling the function, or returning from it:
average of 4 cycles, divided by 256 cycles, is roughly 1.6% slowdown. Worst case, about 3%.
If the function does actual work, and counting the call/return, we should be below 1%.
Please note, it is not fair to add attributes or command line switches to ensure that gcc does not generate prolog/epilog code.
If the code generator, without attributes or command line switches, on its own can minimize the prolog/epilog code, then yes, for an utterly trivial (by which I mean very few cycle) leaf function the four cycle delay could be higher.
To have a 10% impact on the performance of the function, the whole function, including prolog and epilogue, including calling and return, and any hub access cycles it performs, including looping, would have to complete in 40 clock cycles.
It is literally impossible to have a 2x-4x slowdown on the whole program when using a hub stack for leaf functions.
I'm not sure why we're rehashing this. Chip has already proposed a couple of ways to handle this. I assume he'll look over our comments and choose one. Sounds like we'll get this in some form or other.
Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impact from the overhead.
If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...
Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impact from the overhead.
If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...
PropGCC does not do optimization by default but does if you specify an option. We usually use -Os but -O2 may generate faster code in some cases.
Bill,
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impacts.
If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...
My numbers were based on examining the propgcc generated code directly. Every register saved to the hub stack takes 8+ cycles, every one restored takes somewhat less if RDxxxC is used (about 4).
Say a leaf only saved/restored 4 registers, that would be 4*(8+4) cycles right there. Add the call, the return, we are at a minimum of 51 cycles. That's without the function doing anything. Any non-trivial function will take ~50+ cycles (hub access, loops, calculation - exact usage is irrelevant), bringing us to 100 cycles minimum, even using RDLONGC, even for a very light weight leaf function.
4/100 = 4%
Most leaf functions will do a lot more, for a much smaller percentage.
The truly trivial leaf functions I expect GCC is smart enough to in-line (aka "a = b" example); str* and mem* will take less code space in-lined and be faster; I think GCC is smart enough to automatically in-line such tiny functions.
You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.
I back up my technical discussions, and I am happy to accept arguments that are backed up with technically accurate data. Which is what I asked for in my previous post.
I am actually interested in how good the code is with -O ... what I did not want to see was something like -fomit_prologue -fomit_epilog (I am certain I have the wrong names for the options)
I know Chip is working on a solution, and I think all of us will like it. But I cannot let personal attacks, or technically incorrect responses to my posts go.
Bill,
I'm done trying to talk to you. It never does anything but waste time for both of us. We'll see how things shake out of all this when Chip delivers the final results. Have a nice day.
There are now 16-bit immediate jumps and calls that can be relative or absolute. The jumps and calls that end in an underscore ("_") toggle hub execution mode. If you are running in the cog, a CALL_ #address will jump to hub memory. When that routine does a RET, it will return to cog memory. It works the other way, too. A CALL or JMP without an underscore stays in the cog or hub.
The JMPSW/JMPSWD instruction can be used to switch among threads. It will store {hubmode,Z,C,PC} into D and and load {hubmode,Z,C,PC} from S. So, it tracks threads wherever they are executing. All CALLs and RETs save and restore {hubmode,Z,C,PC}. PC is 16 bits so that it can span the entire 64K longs in the 256KB hub memory.
Here is the list:
ZCDS (for D column: W=write, M=modify, R=read, L=read/immediate)
---------------------------------------------------------------------------------------------------------------------
ZCWS 0000000 ZC I CCCC DDDDDDDDD SSSSSSSSS RDBYTE D,S/PTRA/PTRB (waits for hub)
ZCWS 0000001 ZC I CCCC DDDDDDDDD SSSSSSSSS RDBYTEC D,S/PTRA/PTRB (waits for hub if cache miss)
ZCWS 0000010 ZC I CCCC DDDDDDDDD SSSSSSSSS RDWORD D,S/PTRA/PTRB (waits for hub)
ZCWS 0000011 ZC I CCCC DDDDDDDDD SSSSSSSSS RDWORDC D,S/PTRA/PTRB (waits for hub if cache miss)
ZCWS 0000100 ZC I CCCC DDDDDDDDD SSSSSSSSS RDLONG D,S/PTRA/PTRB (waits for hub)
ZCWS 0000101 ZC I CCCC DDDDDDDDD SSSSSSSSS RDLONGC D,S/PTRA/PTRB (waits for hub if cache miss)
ZCWS 0000110 ZC I CCCC DDDDDDDDD SSSSSSSSS RDAUX D,S/#0..$FF/PTRX/PTRY
ZCWS 0000111 ZC I CCCC DDDDDDDDD SSSSSSSSS RDAUXR D,S/#0..$FF/PTRX/PTRY
ZCMS 0001000 ZC I CCCC DDDDDDDDD SSSSSSSSS ISOB D,S/#
ZCMS 0001001 ZC I CCCC DDDDDDDDD SSSSSSSSS NOTB D,S/#
ZCMS 0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS CLRB D,S/#
ZCMS 0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS SETB D,S/#
ZCMS 0001100 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBC D,S/#
ZCMS 0001101 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBNC D,S/#
ZCMS 0001110 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBZ D,S/#
ZCMS 0001111 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBNZ D,S/#
ZCMS 0010000 ZC I CCCC DDDDDDDDD SSSSSSSSS ANDN D,S/#
ZCMS 0010001 ZC I CCCC DDDDDDDDD SSSSSSSSS AND D,S/#
ZCMS 0010010 ZC I CCCC DDDDDDDDD SSSSSSSSS OR D,S/#
ZCMS 0010011 ZC I CCCC DDDDDDDDD SSSSSSSSS XOR D,S/#
ZCMS 0010100 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXC D,S/#
ZCMS 0010101 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXNC D,S/#
ZCMS 0010110 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXZ D,S/#
ZCMS 0010111 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXNZ D,S/#
ZCMS 0011000 ZC I CCCC DDDDDDDDD SSSSSSSSS ROR D,S/#
ZCMS 0011001 ZC I CCCC DDDDDDDDD SSSSSSSSS ROL D,S/#
ZCMS 0011010 ZC I CCCC DDDDDDDDD SSSSSSSSS SHR D,S/#
ZCMS 0011011 ZC I CCCC DDDDDDDDD SSSSSSSSS SHL D,S/#
ZCMS 0011100 ZC I CCCC DDDDDDDDD SSSSSSSSS RCR D,S/#
ZCMS 0011101 ZC I CCCC DDDDDDDDD SSSSSSSSS RCL D,S/#
ZCMS 0011110 ZC I CCCC DDDDDDDDD SSSSSSSSS SAR D,S/#
ZCMS 0011111 ZC I CCCC DDDDDDDDD SSSSSSSSS REV D,S/#
ZCWS 0100000 ZC I CCCC DDDDDDDDD SSSSSSSSS MOV D,S/#
ZCWS 0100001 ZC I CCCC DDDDDDDDD SSSSSSSSS NOT D,S/#
ZCWS 0100010 ZC I CCCC DDDDDDDDD SSSSSSSSS ABS D,S/#
ZCWS 0100011 ZC I CCCC DDDDDDDDD SSSSSSSSS NEG D,S/#
ZCWS 0100100 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGC D,S/#
ZCWS 0100101 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGNC D,S/#
ZCWS 0100110 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGZ D,S/#
ZCWS 0100111 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGNZ D,S/#
ZCMS 0101000 ZC I CCCC DDDDDDDDD SSSSSSSSS ADD D,S/#
ZCMS 0101001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUB D,S/#
ZCMS 0101010 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDX D,S/#
ZCMS 0101011 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBX D,S/#
ZCMS 0101100 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDS D,S/#
ZCMS 0101101 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBS D,S/#
ZCMS 0101110 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDSX D,S/#
ZCMS 0101111 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBSX D,S/#
ZCMS 0110000 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMC D,S/#
ZCMS 0110001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMNC D,S/#
ZCMS 0110010 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMZ D,S/#
ZCMS 0110011 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMNZ D,S/#
ZCMS 0110100 ZC I CCCC DDDDDDDDD SSSSSSSSS MIN D,S/#
ZCMS 0110101 ZC I CCCC DDDDDDDDD SSSSSSSSS MAX D,S/#
ZCMS 0110110 ZC I CCCC DDDDDDDDD SSSSSSSSS MINS D,S/#
ZCMS 0110111 ZC I CCCC DDDDDDDDD SSSSSSSSS MAXS D,S/#
ZCMS 0111000 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDABS D,S/#
ZCMS 0111001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBABS D,S/#
ZCMS 0111010 ZC I CCCC DDDDDDDDD SSSSSSSSS INCMOD D,S/#
ZCMS 0111011 ZC I CCCC DDDDDDDDD SSSSSSSSS DECMOD D,S/#
ZCMS 0111100 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPSUB D,S/#
ZCMS 0111101 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBR D,S/#
ZCMS 0111110 ZC I CCCC DDDDDDDDD SSSSSSSSS MUL D,S/# (waits one clock)
ZCMS 0111111 ZC I CCCC DDDDDDDDD SSSSSSSSS SCL D,S/# (waits one clock)
ZCWS 1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD2 D,S/#
ZCWS 1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD3 D,S/#
ZCWS 1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD4 D,S/#
ZCWS 1000011 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD5 D,S/#
Z-WS 1000100 Z0 I CCCC DDDDDDDDD SSSSSSSSS ENCOD D,S/#
Z-WS 1000100 Z1 I CCCC DDDDDDDDD SSSSSSSSS BLMASK D,S/#
Z-WS 1000101 Z0 I CCCC DDDDDDDDD SSSSSSSSS ONECNT D,S/# (waits one clock)
Z-WS 1000101 Z1 I CCCC DDDDDDDDD SSSSSSSSS ZERCNT D,S/# (waits one clock)
-CWS 1000110 0C I CCCC DDDDDDDDD SSSSSSSSS INCPAT D,S/#
-CWS 1000110 1C I CCCC DDDDDDDDD SSSSSSSSS DECPAT D,S/#
--WS 1000111 00 I CCCC DDDDDDDDD SSSSSSSSS SPLITB D,S/# (also MERGEN)
--WS 1000111 01 I CCCC DDDDDDDDD SSSSSSSSS MERGEB D,S/# (also SPLITN)
--WS 1000111 10 I CCCC DDDDDDDDD SSSSSSSSS SPLITW D,S/#
--WS 1000111 11 I CCCC DDDDDDDDD SSSSSSSSS MERGEW D,S/#
--MS 10010nn n0 I CCCC DDDDDDDDD SSSSSSSSS GETNIB D,S/#,#0..7
--MS 10010nn n1 I CCCC DDDDDDDDD SSSSSSSSS SETNIB D,S/#,#0..7
--MS 1001100 n0 I CCCC DDDDDDDDD SSSSSSSSS GETWORD D,S/#,#0..1
--MS 1001100 n1 I CCCC DDDDDDDDD SSSSSSSSS SETWORD D,S/#,#0..1
--MS 1001101 00 I CCCC DDDDDDDDD SSSSSSSSS STWORDS D,S/#
--MS 1001101 01 I CCCC DDDDDDDDD SSSSSSSSS ROLNIB D,S/#
--MS 1001101 10 I CCCC DDDDDDDDD SSSSSSSSS ROLBYTE D,S/#
--MS 1001101 11 I CCCC DDDDDDDDD SSSSSSSSS ROLWORD D,S/#
--MS 1001110 00 I CCCC DDDDDDDDD SSSSSSSSS SETS D,S/#
--MS 1001110 01 I CCCC DDDDDDDDD SSSSSSSSS SETD D,S/#
--MS 1001110 10 I CCCC DDDDDDDDD SSSSSSSSS SETX D,S/#
--MS 1001110 11 I CCCC DDDDDDDDD SSSSSSSSS SETI D,S/#
-CMS 1001111 0C I CCCC DDDDDDDDD SSSSSSSSS COGNEW D,S/# (waits for hub)
-CMS 1001111 1C I CCCC DDDDDDDDD SSSSSSSSS WAITCNT D,S/# (waits for CNT, +CNTX if WC)
--MS 101000n n0 I CCCC DDDDDDDDD SSSSSSSSS GETBYTE D,S/#,#0..3
--MS 101000n n1 I CCCC DDDDDDDDD SSSSSSSSS SETBYTE D,S/#,#0..3
--WS 1010010 00 I CCCC DDDDDDDDD SSSSSSSSS STBYTES D,S/#
--MS 1010010 01 I CCCC DDDDDDDDD SSSSSSSSS SWBYTES D,S/# (switch/copy bytes in D, S = %11_10_01_00 = D same)
--MS 1010010 10 I CCCC DDDDDDDDD SSSSSSSSS PACKRGB D,S/# (S 8:8:8 -> D 5:5:5 << 16 | D >> 16)
--WS 1010010 11 I CCCC DDDDDDDDD SSSSSSSSS UNPKRGB D,S/# (S 5:5:5 -> D 8:8:8)
--MS 1010011 00 I CCCC DDDDDDDDD SSSSSSSSS ADDPIX D,S/# (waits one clock)
--MS 1010011 01 I CCCC DDDDDDDDD SSSSSSSSS MULPIX D,S/# (waits one clock)
--MS 1010011 10 I CCCC DDDDDDDDD SSSSSSSSS BLNPIX D,S/# (waits one clock)
--MS 1010011 11 I CCCC DDDDDDDDD SSSSSSSSS MIXPIX D,S/# (waits one clock)
ZCMS 1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSW D,S/#
ZCMS 1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSWD D,S/#
--MS 1010110 00 I CCCC DDDDDDDDD SSSSSSSSS IJZ D,S/#
--MS 1010110 01 I CCCC DDDDDDDDD SSSSSSSSS IJZD D,S/#
--MS 1010110 10 I CCCC DDDDDDDDD SSSSSSSSS IJNZ D,S/#
--MS 1010110 11 I CCCC DDDDDDDDD SSSSSSSSS IJNZD D,S/#
--MS 1010111 00 I CCCC DDDDDDDDD SSSSSSSSS DJZ D,S/#
--MS 1010111 01 I CCCC DDDDDDDDD SSSSSSSSS DJZD D,S/#
--MS 1010111 10 I CCCC DDDDDDDDD SSSSSSSSS DJNZ D,S/#
--MS 1010111 11 I CCCC DDDDDDDDD SSSSSSSSS DJNZD D,S/#
ZCRS 1011000 ZC I CCCC DDDDDDDDD SSSSSSSSS TESTB D,S/#
ZCRS 1011001 ZC I CCCC DDDDDDDDD SSSSSSSSS TESTN D,S/#
ZCRS 1011010 ZC I CCCC DDDDDDDDD SSSSSSSSS TEST D,S/#
ZCRS 1011011 ZC I CCCC DDDDDDDDD SSSSSSSSS CMP D,S/#
ZCRS 1011100 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPX D,S/#
ZCRS 1011101 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPS D,S/#
ZCRS 1011110 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPSX D,S/#
ZCRS 1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPR D,S/#
--RS 11000nn n0 I CCCC DDDDDDDDD SSSSSSSSS COGINIT D,S/#,#0..7 (waits for hub) (SETNIB :coginit,cog,#6)
---S 11000nn n1 I CCCC nnnnnnnnn SSSSSSSSS WAITVID #0..$DFF,S/# (waits for vid if single-task, loops if multi-task)
--RS 1100011 11 I CCCC DDDDDDDDD SSSSSSSSS WAITVID D,S/# (waits for vid if single-task, loops if multi-task)
-CRS 110010n nC I CCCC DDDDDDDDD SSSSSSSSS WAITPEQ D,S/#,#0..3 (waits for pins, plus CNT if WC)
-CRS 110011n nC I CCCC DDDDDDDDD SSSSSSSSS WAITPNE D,S/#,#0..3 (waits for pins, plus CNT if WC)
--LS 1101000 0L I CCCC DDDDDDDDD SSSSSSSSS WRBYTE D/#,S/PTRA/PTRB (waits for hub)
--LS 1101000 1L I CCCC DDDDDDDDD SSSSSSSSS WRWORD D/#,S/PTRA/PTRB (waits for hub)
--LS 1101001 0L I CCCC DDDDDDDDD SSSSSSSSS WRLONG D/#,S/PTRA/PTRB (waits for hub)
--LS 1101001 1L I CCCC DDDDDDDDD SSSSSSSSS FRAC D/#,S/#
--LS 1101010 0L I CCCC DDDDDDDDD SSSSSSSSS WRAUX D/#,S/#0..$FF/PTRX/PTRY
--LS 1101010 1L I CCCC DDDDDDDDD SSSSSSSSS WRAUXR D/#,S/#0..$FF/PTRX/PTRY
--LS 1101011 0L I CCCC DDDDDDDDD SSSSSSSSS SETACCA D/#,S/#
--LS 1101011 1L I CCCC DDDDDDDDD SSSSSSSSS SETACCB D/#,S/#
--LS 1101100 0L I CCCC DDDDDDDDD SSSSSSSSS MACA D/#,S/#
--LS 1101100 1L I CCCC DDDDDDDDD SSSSSSSSS MACB D/#,S/#
--LS 1101101 0L I CCCC DDDDDDDDD SSSSSSSSS MUL32 D/#,S/#
--LS 1101101 1L I CCCC DDDDDDDDD SSSSSSSSS MUL32U D/#,S/#
--LS 1101110 0L I CCCC DDDDDDDDD SSSSSSSSS DIV32 D/#,S/#
--LS 1101110 1L I CCCC DDDDDDDDD SSSSSSSSS DIV32U D/#,S/#
--LS 1101111 0L I CCCC DDDDDDDDD SSSSSSSSS DIV64 D/#,S/#
--LS 1101111 1L I CCCC DDDDDDDDD SSSSSSSSS DIV64U D/#,S/#
--LS 1110000 0L I CCCC DDDDDDDDD SSSSSSSSS SQRT64 D/#,S/#
--LS 1110000 1L I CCCC DDDDDDDDD SSSSSSSSS QSINCOS D/#,S/#
--LS 1110001 0L I CCCC DDDDDDDDD SSSSSSSSS QARCTAN D/#,S/#
--LS 1110001 1L I CCCC DDDDDDDDD SSSSSSSSS QROTATE D/#,S/#
--LS 1110010 0L I CCCC DDDDDDDDD SSSSSSSSS SETSERA D/#,S/# (config,baud)
--LS 1110010 1L I CCCC DDDDDDDDD SSSSSSSSS SETSERB D/#,S/# (config,baud)
--LS 1110011 0L I CCCC DDDDDDDDD SSSSSSSSS SETCTRS D/#,S/# (ctrb,ctra)
--LS 1110011 1L I CCCC DDDDDDDDD SSSSSSSSS SETWAVS D/#,S/# (ctrb,ctra)
--LS 1110100 0L I CCCC DDDDDDDDD SSSSSSSSS SETFRQS D/#,S/# (ctrb,ctra)
--LS 1110100 1L I CCCC DDDDDDDDD SSSSSSSSS SETPHSS D/#,S/# (ctrb,ctra)
--LS 1110101 0L I CCCC DDDDDDDDD SSSSSSSSS ADDPHSS D/#,S/# (ctrb,ctra)
--LS 1110101 1L I CCCC DDDDDDDDD SSSSSSSSS SUBPHSS D/#,S/# (ctrb,ctra)
--LS 1110110 0L I CCCC DDDDDDDDD SSSSSSSSS JP D/#,S/#
--LS 1110110 1L I CCCC DDDDDDDDD SSSSSSSSS JPD D/#,S/#
--LS 1110111 0L I CCCC DDDDDDDDD SSSSSSSSS JNP D/#,S/#
--LS 1110111 1L I CCCC DDDDDDDDD SSSSSSSSS JNPD D/#,S/#
--LS 111100n nL I CCCC DDDDDDDDD SSSSSSSSS CFGPINS D/#,S/#,#0..2 (waits for alt)
--LS 1111001 1L I CCCC DDDDDDDDD SSSSSSSSS JMPTASK D/#,S/# (mode:mask,address)
--LS 1111010 0L I CCCC DDDDDDDDD SSSSSSSSS SETXFR D/#,S/#
--LS 1111010 1L I CCCC DDDDDDDDD SSSSSSSSS SETMIX D/#,S/#
--LS 1111011 0L I CCCC DDDDDDDDD SSSSSSSSS <empty> D/#,S/#
--LS 1111011 1L I CCCC DDDDDDDDD SSSSSSSSS <empty> D/#,S/#
--RS 1111100 00 I CCCC DDDDDDDDD SSSSSSSSS JZ D,S/#
--RS 1111100 01 I CCCC DDDDDDDDD SSSSSSSSS JZD D,S/#
--RS 1111100 10 I CCCC DDDDDDDDD SSSSSSSSS JNZ D,S/#
--RS 1111100 11 I CCCC DDDDDDDDD SSSSSSSSS JNZD D,S/#
---- 1111101 00 n nnnn nnnnnnnnn nnnnnnnnn AUGI #23bits (appends n to upper bits of next S or D immediate)
---- 1111101 01 0 nnnn nnnnnnnnn nnniiiiii REPS #1..$10000,#1..64
---- 1111101 01 1 BBAA ddddddddd sssssssss FIXINDA #d,#s / FIXINDB #d,#s / FIXINDS #d,#s / SETINDA #s / SETINDB #d / SETINDS #d,#s
---- 1111101 10 0 CCCC 00 nnnnnnnnnnnnnnnn JMP #abs
---- 1111101 10 0 CCCC 01 nnnnnnnnnnnnnnnn JMP_ #abs
---- 1111101 10 0 CCCC 10 nnnnnnnnnnnnnnnn JMP @rel
---- 1111101 10 0 CCCC 11 nnnnnnnnnnnnnnnn JMP_ @rel
---- 1111101 10 1 CCCC 00 nnnnnnnnnnnnnnnn JMPD #abs
---- 1111101 10 1 CCCC 01 nnnnnnnnnnnnnnnn JMPD_ #abs
---- 1111101 10 1 CCCC 10 nnnnnnnnnnnnnnnn JMPD @rel
---- 1111101 10 1 CCCC 11 nnnnnnnnnnnnnnnn JMPD_ @rel
---- 1111101 11 0 CCCC 00 nnnnnnnnnnnnnnnn CALL #abs
---- 1111101 11 0 CCCC 01 nnnnnnnnnnnnnnnn CALL_ #abs
---- 1111101 11 0 CCCC 10 nnnnnnnnnnnnnnnn CALL @rel
---- 1111101 11 0 CCCC 11 nnnnnnnnnnnnnnnn CALL_ @rel
---- 1111101 11 1 CCCC 00 nnnnnnnnnnnnnnnn CALLD #abs
---- 1111101 11 1 CCCC 01 nnnnnnnnnnnnnnnn CALLD_ #abs
---- 1111101 11 1 CCCC 10 nnnnnnnnnnnnnnnn CALLD @rel
---- 1111101 11 1 CCCC 11 nnnnnnnnnnnnnnnn CALLD_ @rel
---- 1111110 00 0 CCCC 00 nnnnnnnnnnnnnnnn CALLA #abs
---- 1111110 00 0 CCCC 01 nnnnnnnnnnnnnnnn CALLA_ #abs
---- 1111110 00 0 CCCC 10 nnnnnnnnnnnnnnnn CALLA @rel
---- 1111110 00 0 CCCC 11 nnnnnnnnnnnnnnnn CALLA_ @rel
---- 1111110 00 1 CCCC 00 nnnnnnnnnnnnnnnn CALLAD #abs
---- 1111110 00 1 CCCC 01 nnnnnnnnnnnnnnnn CALLAD_ #abs
---- 1111110 00 1 CCCC 10 nnnnnnnnnnnnnnnn CALLAD @rel
---- 1111110 00 1 CCCC 11 nnnnnnnnnnnnnnnn CALLAD_ @rel
---- 1111110 01 0 CCCC 00 nnnnnnnnnnnnnnnn CALLB #abs
---- 1111110 01 0 CCCC 01 nnnnnnnnnnnnnnnn CALLB_ #abs
---- 1111110 01 0 CCCC 10 nnnnnnnnnnnnnnnn CALLB @rel
---- 1111110 01 0 CCCC 11 nnnnnnnnnnnnnnnn CALLB_ @rel
---- 1111110 01 1 CCCC 00 nnnnnnnnnnnnnnnn CALLBD #abs
---- 1111110 01 1 CCCC 01 nnnnnnnnnnnnnnnn CALLBD_ #abs
---- 1111110 01 1 CCCC 10 nnnnnnnnnnnnnnnn CALLBD @rel
---- 1111110 01 1 CCCC 11 nnnnnnnnnnnnnnnn CALLBD_ @rel
---- 1111110 10 0 CCCC 00 nnnnnnnnnnnnnnnn CALLX #abs
---- 1111110 10 0 CCCC 01 nnnnnnnnnnnnnnnn CALLX_ #abs
---- 1111110 10 0 CCCC 10 nnnnnnnnnnnnnnnn CALLX @rel
---- 1111110 10 0 CCCC 11 nnnnnnnnnnnnnnnn CALLX_ @rel
---- 1111110 10 1 CCCC 00 nnnnnnnnnnnnnnnn CALLXD #abs
---- 1111110 10 1 CCCC 01 nnnnnnnnnnnnnnnn CALLXD_ #abs
---- 1111110 10 1 CCCC 10 nnnnnnnnnnnnnnnn CALLXD @rel
---- 1111110 10 1 CCCC 11 nnnnnnnnnnnnnnnn CALLXD_ @rel
---- 1111110 11 0 CCCC 00 nnnnnnnnnnnnnnnn CALLY #abs
---- 1111110 11 0 CCCC 01 nnnnnnnnnnnnnnnn CALLY_ #abs
---- 1111110 11 0 CCCC 10 nnnnnnnnnnnnnnnn CALLY @rel
---- 1111110 11 0 CCCC 11 nnnnnnnnnnnnnnnn CALLY_ @rel
---- 1111110 11 1 CCCC 00 nnnnnnnnnnnnnnnn CALLYD #abs
---- 1111110 11 1 CCCC 01 nnnnnnnnnnnnnnnn CALLYD_ #abs
---- 1111110 11 1 CCCC 10 nnnnnnnnnnnnnnnn CALLYD @rel
---- 1111110 11 1 CCCC 11 nnnnnnnnnnnnnnnn CALLYD_ @rel
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000000 COGID D (waits for hub) (doesn't write D if WC)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000001 LOCKNEW D (waits for hub)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000010 GETPC D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000011 GETLFSR D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000100 GETCNT D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000101 GETCNTX D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000110 GETACAL D (waits for mac)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000000111 GETACAH D (waits for mac)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001000 GETACBL D (waits for mac)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001001 GETACBH D (waits for mac)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001010 GETPTRA D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001011 GETPTRB D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001100 GETPTRX D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001101 GETPTRY D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001110 SERINA D (waits for rx if single-task, loops if multi-task, releases if WC)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000001111 SERINB D (waits for rx if single-task, loops if multi-task, releases if WC)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010000 GETMULL D (waits for mul if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010001 GETMULH D (waits for mul if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010010 GETDIVQ D (waits for div if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010011 GETDIVR D (waits for div if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010100 GETSQRT D (waits for sqrt if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010101 GETQX D (waits for cordic if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010110 GETQY D (waits for cordic if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000010111 GETQZ D (waits for cordic if single-task, loops if multi-task)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011000 GETPHSA D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011001 GETPHZA D (clears phsa)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011010 GETCOSA D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011011 GETSINA D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011100 GETPHSB D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011101 GETPHZB D (clears phsb)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011110 GETCOSB D
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000011111 GETSINB D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100000 PUSHZC D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100001 POPZC D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100010 SUBCNT D (subtracts D from CNT, then CNTX if same thread)
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100011 GETPIX D (takes 3 clocks, needs 3 clocks per two prior stages, no condition allowed)
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100100 BINBCD D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100101 BCDBIN D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100110 BINGRY D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000100111 GRYBIN D (waits one clock)
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000101000 ESWAP4 D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000101001 ESWAP8 D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000101010 SEUSSF D
ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 000101011 SEUSSR D
Z-M- 1111111 ZC 0 CCCC DDDDDDDDD 000101100 INCD D (D += $200)
Z-M- 1111111 ZC 0 CCCC DDDDDDDDD 000101101 DECD D (D -= $200)
Z-M- 1111111 ZC 0 CCCC DDDDDDDDD 000101110 INCDS D (D += $201)
Z-M- 1111111 ZC 0 CCCC DDDDDDDDD 000101111 DECDS D (D -= $201)
ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 000110000 POP D (pops from task's tiny stack)
--L- 1111111 00 L CCCC DDDDDDDDD 001iiiiii REPD D/#1..512,#1..64 (REPD $1FF,#1..64 = infinite repeat, can use REPD #i)
--L- 1111111 00 L CCCC DDDDDDDDD 010000000 CLKSET D/# (waits for hub)
--L- 1111111 00 L CCCC DDDDDDDDD 010000001 COGSTOP D/# (waits for hub)
-CL- 1111111 0C L CCCC DDDDDDDDD 010000010 LOCKSET D/# (waits for hub)
-CL- 1111111 0C L CCCC DDDDDDDDD 010000011 LOCKCLR D/# (waits for hub)
--L- 1111111 00 L CCCC DDDDDDDDD 010000100 LOCKRET D/# (waits for hub)
--L- 1111111 00 L CCCC DDDDDDDDD 010000101 RDWIDEC D/PTRA/PTRB (waits for hub if cache miss)
--L- 1111111 00 L CCCC DDDDDDDDD 010000110 RDWIDE D/PTRA/PTRB (waits for hub)
--L- 1111111 00 L CCCC DDDDDDDDD 010000111 WRWIDE D/PTRA/PTRB (waits for hub)
ZCL- 1111111 ZC L CCCC DDDDDDDDD 010001000 GETP D/# (pin into !Z/C via WZ/WC)
ZCL- 1111111 ZC L CCCC DDDDDDDDD 010001001 GETNP D/# (pin into Z/!C via WZ/WC)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001010 SEROUTA D/# (waits for tx if single-task, loops if multi-task, releases if WC)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001011 SEROUTB D/# (waits for tx if single-task, loops if multi-task, releases if WC)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001100 CMPCNT D/# (subtracts D from CNT, then CNTX if same thread)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001101 WAITPX D/# (waits for any edge, +CNT if WC)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001110 WAITPR D/# (waits for pos edge, +CNT if WC)
-CL- 1111111 0C L CCCC DDDDDDDDD 010001111 WAITPF D/# (waits for neg edge, +CNT if WC)
ZCL- 1111111 ZC L CCCC DDDDDDDDD 010010000 SETZC D/# (D[1:0] into Z/C via WZ/WC)
--L- 1111111 00 L CCCC DDDDDDDDD 010010001 SETMAP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010010010 SETXCH D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010010011 SETTASK D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010010100 SETRACE D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010010101 SARACCA D/# (waits for mac)
--L- 1111111 00 L CCCC DDDDDDDDD 010010110 SARACCB D/# (waits for mac)
--L- 1111111 00 L CCCC DDDDDDDDD 010010111 SARACCS D/# (waits for mac)
--L- 1111111 00 L CCCC DDDDDDDDD 010011000 SETPTRA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011001 SETPTRB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011010 ADDPTRA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011011 ADDPTRB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011100 SUBPTRA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011101 SUBPTRB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011110 SETWIDE D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010011111 SETWIDZ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100000 SETPTRX D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100001 SETPTRY D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100010 ADDPTRX D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100011 ADDPTRY D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100100 SUBPTRX D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100101 SUBPTRY D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010100110 PASSCNT D/# (loops if (CNT - D) msb set)
--L- 1111111 00 L CCCC DDDDDDDDD 010100111 WAIT D/# (waits 1+ clocks, 0 same as 1)
--L- 1111111 00 L CCCC DDDDDDDDD 010101000 OFFP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101001 NOTP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101010 CLRP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101011 SETP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101100 SETPC D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101101 SETPNC D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101110 SETPZ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010101111 SETPNZ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110000 DIV64D D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110001 SQRT32 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110010 QLOG D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110011 QEXP D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110100 SETQI D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110101 SETQZ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110110 CFGDACS D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010110111 SETDACS D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111000 CFGDAC0 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111001 CFGDAC1 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111010 CFGDAC2 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111011 CFGDAC3 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111100 SETDAC0 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111101 SETDAC1 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111110 SETDAC2 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 010111111 SETDAC3 D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000000 SETCTRA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000001 SETWAVA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000010 SETFRQA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000011 SETPHSA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000100 ADDPHSA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000101 SUBPHSA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000110 SETVID D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011000111 SETVIDY D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001000 SETCTRB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001001 SETWAVB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001010 SETFRQB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001011 SETPHSB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001100 ADDPHSB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001101 SUBPHSB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001110 SETVIDI D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011001111 SETVIDQ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010000 SETPIX D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010001 SETPIXZ D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010010 SETPIXU D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010011 SETPIXV D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010100 SETPIXA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010101 SETPIXR D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010110 SETPIXG D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011010111 SETPIXB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011011000 SETPORA D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011011001 SETPORB D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011011010 SETPORC D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011011011 SETPORD D/#
--L- 1111111 00 L CCCC DDDDDDDDD 011011100 PUSH D/# (pushes into task's tiny stack)
--R- 1111111 00 0 CCCC DDDDDDDDD 011100110 JMPREL D
--R- 1111111 00 0 CCCC DDDDDDDDD 011100111 JMPRELD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101000 JMP D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101001 JMP_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101010 JMPD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101011 JMPD_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101100 CALL D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101101 CALL_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101110 CALLD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011101111 CALLD_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110000 CALLA D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110001 CALLA_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110010 CALLAD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110011 CALLAD_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110100 CALLB D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110101 CALLB_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110110 CALLBD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011110111 CALLBD_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111000 CALLX D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111001 CALLX_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111010 CALLXD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111011 CALLXD_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111100 CALLY D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111101 CALLY_ D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111110 CALLYD D
--R- 1111111 00 0 CCCC DDDDDDDDD 011111111 CALLYD_ D
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000000 RETA
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000001 RETAD
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000010 RETB
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000011 RETBD
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000100 RETX
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000101 RETXD
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000110 RETY
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100000111 RETYD
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100001000 RET
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100001001 RETD
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100001010 POLCTRA (ctra-rollover into !Z/C)
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100001011 POLCTRB (ctra-rollover into !Z/C)
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100001100 POLVID (vid-ready into !Z/C)
---- 1111111 00 x CCCC xxxxxxxxx 100001101 CAPCTRA
---- 1111111 00 x CCCC xxxxxxxxx 100001110 CAPCTRB
---- 1111111 00 x CCCC xxxxxxxxx 100001111 CAPCTRS
---- 1111111 00 x CCCC xxxxxxxxx 100010000 CACHEX
---- 1111111 00 x CCCC xxxxxxxxx 100010001 CLRACCA
---- 1111111 00 x CCCC xxxxxxxxx 100010010 CLRACCB
---- 1111111 00 x CCCC xxxxxxxxx 100010011 CLRACCS
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100010100 CHKPTRX
ZC-- 1111111 ZC x CCCC xxxxxxxxx 100010101 CHKPTRY
---- 1111111 00 x CCCC xxxxxxxxx 100010110 SYNCTRA (waits for ctra if single-task, loops if multi-task))
---- 1111111 00 x CCCC xxxxxxxxx 100010111 SYNCTRB (waits for ctrb if single-task, loops if multi-task))
---- 1111111 00 x CCCC xxxxxxxxx 100011000 SETPIXW
x = don't care, use 0
---------------------------------------------------------------------------------------------------------------------
Z effect
------------------------------------------------------------------------------------------
0 <none>
1 wz
C effect
------------------------------------------------------------------------------------------
0 <none>
1 wc
L DDDDDDDDD destination operand
------------------------------------------------------------------------------------------
0/na DDDDDDDDD register
1 #DDDDDDDDD immediate, zero-extended
I SSSSSSSSS source operand
------------------------------------------------------------------------------------------
0/na SSSSSSSSS register
1 #SSSSSSSSS immediate, zero-extended
CCCC condition (easier-to-read list)
------------------------------------------------------------------------------------------
0000 never 1111 always (default)
0001 nc & nz 1100 if_c if_b
0010 nc & z 0011 if_nc if_ae
0011 nc 1010 if_z if_e
0100 c & nz 0101 if_nz if_ne
0101 nz 1000 if_c_and_z if_z_and_c
0110 c <> z 0100 if_c_and_nz if_nz_and_c
0111 nc | nz 0010 if_nc_and_z if_z_and_nc
1000 c & z 0001 if_nc_and_nz if_nz_and_nc if_a
1001 c = z 1110 if_c_or_z if_z_or_c if_be
1010 z 1101 if_c_or_nz if_nz_or_c
1011 nc | z 1011 if_nc_or_z if_z_or_nc
1100 c 0111 if_nc_or_nz if_nz_or_nc
1101 c | nz 1001 if_c_eq_z if_z_eq_c
1110 c | z 0110 if_c_ne_z if_z_ne_c
1111 always 0000 never
CCCC inda/indb - CCCC=1111 after stage 2 of pipeline if inda/indb used (indx=inda/indb)
------------------------------------------------------------------------------------------
xx00 source indx
xx01 source indx++
xx10 source indx--
xx11 source ++indx
00xx destination indx
01xx destination indx++
10xx destination indx--
11xx destination ++indx
I'm getting all these changes into PNut.exe now. It's taking a while because the assembler must be made to work in hub space, plus all the branches work differently now.
Even though it doesn't say so, I assume that all of these instructions wait for a hub slot. Is that correct? Or is there some sort of data cache between to allow them to continue without waiting?
Comments
Yes and no. The caveat is you have to know how many threads the system will launch, to know how much time you have.
That means any 'late changes' to code, can bite big time.
Safest way to manage unknown later additions, is to always run the highest thread count, ie some as dummy time-swallowers.
It appears everyone agrees that multiple tasks using hubexec mode is just not going to work because of caching etc. Wouldn't it therefore be much simpler to not permit multiple hubexec tasks to work (in a single cog), and instead allocate any/all cache to a single task hubexec mode?
This does not prevent three other cog mode tasks running with hubexec anyway?
Seems as though we are overcomplicating the whole hubexec mode for something we will never use (multiple hubexec tasks in a single cog).
They could be. The drawback for cog mode would be that you couldn't just write an address field into an instruction like you can now using SETS. You would have to pre-compute an address relative to the instruction that will be modified.
I understand the consensus. If those 4 cache lines use an LRU algorithm, though, it won't matter if one hub task is running or four are. In the case of four, the cache lines would effectively be distributed among the tasks, giving still decent performance.
Good point.
I'll review some PASM source and see how often this poses a problem for self-modifying code.
That Everything can be simulated and timed very accurately, again reminds of the important missing piece of the P2 puzzle : a good simulator.
I believe The high end HDL language simulators work by compiling the VHDL/Verilog into C (or similar) and then running that.
So they effectively 'build a new simulator', on every run.
It would be very nice if the P2 Simulator Core, could be Auto-Built this way, from the HDL.
Has anyone seen that done ? Would it be fast enough ?
That's a lot of logic to simulate. It seems a behavioral model would be a lot more efficient, but take a custom effort.
Self-modifying code would still be possible ? it just needs an Abs-Rel patch added, if the user has only abs address values.?
- one line of PASM ?
I wonder how many cases of self-modifying instructions change a DJNZ style instructions. Its normally the JMPRET that are modified (the RET versions).
There is strong appeal to deriving the Simulator core, from the HDL, as that avoids the raft of 'not quite correct' divergence issues that can plague simulators, and it has more chance of simulating the peripherals, and all the HW interactions.
Seems there is Icarus, Veriwell, and Veripool, with Veripool sounding the fastest and best supported ?
http://opencores.org/opencores,tools
http://www.veripool.org/wiki/veripool/Verilog_Simulator_Benchmarks
Why are you insisting on ignoring the parts of my message that say leaf functions are often NOT large numbers of cycles? You keep saying they will have hundreds or thousands of cycles, but in practice on real C/C++ code I see many leaf functions that are much smaller (think in the 10s of cycles or less). So if you can accept that leaf functions can be very small, then you can see that the overhead can be much larger than 1-3%.
My thoughts were the Instruction Cache would be say 4 lines. It would start out with the first wide loaded into line0. Then automatically the next sequential line1 would be fetched while the 8 longs in line 0 were being executed (in hubexec mode). As line1 was being executed, line 2 would be fetched. Then line 3 and then back to line 0.
This would be controlled by a small state m/c fetching the wide hub lines into the 4 cache lines.
Now, what would be nice is that the program could control the state m/c by executing a "fetch" instruction that says "fetch the next 3 lines". This would take 1 clock to set the state m/c.
I have no idea about how complex this may be. I just thought it could be simpler than LRU, and with smart programming, the program could control what was loaded into the instruction cache and avoid stalls by cache misses.
A general solution should be checked.
I can see DJNZ would be useful for timeout/watchdog style case statements/state machines.
The Jump is patched, and usually taken, and some state kicks the DJNZ variable, and if the system does not cross any kicks, it will timeout eventually.
Cluso99
IMHO, it can be excellent to have this feature, for a handcrafted PASM program.
As for automatically generated code, i believe the linker can be crafted to check for straight thru code pieces, branching from another ones, and decide where to insert it, and how many lines of cache, up to a maximum of three, to request in advance.
Though I'm unsure, how it can be crafted to handle conditional execution at all.
Also, if the now being loaded code piece, does contain another reference, to a new cache line advanced preemptive request, and so on... How we could ensure any automated means, at link time, to select the best path anyway?
If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals to trash them very often...
Perhaps a P3 feature....
Yanomani
P.S.The following line:
Should read:
If only we could do real time analysis, at the incoming cache lines being fetched, to avoid conditionals keep recursively trashing them, very often...
Sorry by my poor English..
I did not ignore, I disagreed. I even posted about a minimal (very dumb) leaf function that just executed "a = b", perhaps you missed it in the flurry of postings.
Please show me a useful, non-trivial leaf function's dis-assembly from the test p2 gcc. Note, the function should not have any decorations, and should be for LMM mode (ie cog-only mode example is not valid); furthermore include the function prologue and epilogue code. I am working on some other products right now, and do not currently have the p2test branch installed, so I do not have the time to generate the sample function and dis-assembled code.
I'll count the cycles, and then we will know what the percentage is.
Given what I remember seeing about the prologue and epilog code, it consumed ~ 16*8 cycles each, so about 256 cycles - without even counting what the function did, or calling the function, or returning from it:
average of 4 cycles, divided by 256 cycles, is roughly 1.6% slowdown. Worst case, about 3%.
If the function does actual work, and counting the call/return, we should be below 1%.
Please note, it is not fair to add attributes or command line switches to ensure that gcc does not generate prolog/epilog code.
If the code generator, without attributes or command line switches, on its own can minimize the prolog/epilog code, then yes, for an utterly trivial (by which I mean very few cycle) leaf function the four cycle delay could be higher.
To have a 10% impact on the performance of the function, the whole function, including prolog and epilogue, including calling and return, and any hub access cycles it performs, including looping, would have to complete in 40 clock cycles.
It is literally impossible to have a 2x-4x slowdown on the whole program when using a hub stack for leaf functions.
My assumptions are based on optimized code that typically comes out of GCC, which often trims prologue and epilogue away and can result in functions that are only a small number of instructions. If the PropGCC compiler doesn't do even the most basic optimizations, then you are correct and the impact will always be small. If PropGCC does do these optimizations, then leaf functions can be quite small and result in much larger impact from the overhead.
If PropGCC does not do these optimizations, then they should, because the result will be MUCH smaller and MUCH faster code across the board. I find it hard to imagine that it's not doing them...
Obviously not by default, but certainly most people will use the -O options on their release code like any sensible coder.
My numbers were based on examining the propgcc generated code directly. Every register saved to the hub stack takes 8+ cycles, every one restored takes somewhat less if RDxxxC is used (about 4).
Say a leaf only saved/restored 4 registers, that would be 4*(8+4) cycles right there. Add the call, the return, we are at a minimum of 51 cycles. That's without the function doing anything. Any non-trivial function will take ~50+ cycles (hub access, loops, calculation - exact usage is irrelevant), bringing us to 100 cycles minimum, even using RDLONGC, even for a very light weight leaf function.
4/100 = 4%
Most leaf functions will do a lot more, for a much smaller percentage.
The truly trivial leaf functions I expect GCC is smart enough to in-line (aka "a = b" example); str* and mem* will take less code space in-lined and be faster; I think GCC is smart enough to automatically in-line such tiny functions.
That was not a technical argument. Again.
I back up my technical discussions, and I am happy to accept arguments that are backed up with technically accurate data. Which is what I asked for in my previous post.
I know Chip is working on a solution, and I think all of us will like it. But I cannot let personal attacks, or technically incorrect responses to my posts go.
I'm done trying to talk to you. It never does anything but waste time for both of us. We'll see how things shake out of all this when Chip delivers the final results. Have a nice day.
Prop2_Instructions_12_17_13.txt
There are now 16-bit immediate jumps and calls that can be relative or absolute. The jumps and calls that end in an underscore ("_") toggle hub execution mode. If you are running in the cog, a CALL_ #address will jump to hub memory. When that routine does a RET, it will return to cog memory. It works the other way, too. A CALL or JMP without an underscore stays in the cog or hub.
The JMPSW/JMPSWD instruction can be used to switch among threads. It will store {hubmode,Z,C,PC} into D and and load {hubmode,Z,C,PC} from S. So, it tracks threads wherever they are executing. All CALLs and RETs save and restore {hubmode,Z,C,PC}. PC is 16 bits so that it can span the entire 64K longs in the 256KB hub memory.
Here is the list:
I'm getting all these changes into PNut.exe now. It's taking a while because the assembler must be made to work in hub space, plus all the branches work differently now.