BTW: The assembler is simple and expressive enough for people to code directly in hex, mix mnemonic coding, labels, and whatever else they want to do together, in one nice file, specifying data in a lot of extremely easy and readable ways.
That's fantastic! People can make the expression that makes the most sense to them at that time with no worries.
MOV [ptr.D++],[ptr.S++]
becomes 2 instructions...
ALTDS D,#rrr_ddd_sss
MOV 0-0,0-0
so that a REP instruction is likely to cause the programmer to make a bug by forgetting the inserted instruction.
You cannot perform MOVI/MOVD/MOVS to a hub instruction, only a cog instruction.
But
:_LoopS REP #count,#_LoopE-LoopS
MOV [ptr.D++],[ptr.S++]
.. other code
:_LoopE
overcomes this problem.
BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
So jmg and potatohead can have it whichever way they like.
Now instead of having three dead simple to understand instructions MOV, MOVI, MOVS, which by the way have about a decade of common and well understood use, you want to load it all up under MOV, which then becomes one harder to understand thing.
- only I did not say anywhere that those opcodes would be removed, so your whole point is moot.
BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
So jmg and potatohead can have it whichever way they like.
In P2 there was a strict dictate of a preamble/pipeline delay, (unclear if that is still in P1+ ?)
Which is why the dual label form, with labels at actual REP_start and REP_End allows the Assembler to context check the pgmr is meeting the 'fine print', without the pgmr having to remember all the fine print.
(ie the trivial stuff, the PC should be doing )
Importantly, those simplest of edits, of insert/remove/comment lines are safe to do.
Loopy, I think most of us need the OBEX. If you really want to kill P1+ just tell everybody they have to program it in Forth. I do plan on porting pfth or Fast to the P1+, but I view that as more of an academic exercise. The real work will be done in Spin, C and PASM.
Noted, but I have all along thought that Forth would provide a lot of users with useful understanding of the new Propeller.
I really can't add much constructively to this 16cog, 512Kb, 64 ADC/DAC but my enthusiasm. It is exciting news as it will allow a lot more to get done with just one chip. SOC chips are never going to be anything by an accessory to the Propeller, certainly not a direct competitor -- but other chips that are comparable have long offered more memory for larger programs.
Those chips can also load Forth, but on only one CPU. Forth with 8 cpus is much faster. With 16 cpus, even better. I guess I am just excited to have this coming soon. It will be a good thing for GCC as well.
I need to read a bunch of preceding posts to know what has been floated, but I think we only need ONE instruction to handle the whole indirect mechanism:
ALTDS D,S/# - Selectively alter D and S fields in next instruction by using D as dual 9-bit pointers with S/# specifying the mode
S/# = %ddd_sss
ddd = 000: don't alter D field of next instruction
ddd = 001: alter D field of next instruction by substituting current D[17:9]
ddd = 010: <some mode we could define>
ddd = 011: <some mode we could define>
ddd = 100: alter D field of next instruction by substituting current D[17:9], increment current D[17:9]
ddd = 101: alter D field of next instruction by substituting current D[17:9], decrement current D[17:9]
ddd = 110: alter D field of next instruction by substituting current D[17:9]+1, increment current D[17:9]
ddd = 111: alter D field of next instruction by substituting current D[17:9]-1, decrement current D[17:9]
sss = 000: don't alter S field of next instruction
sss = 001: alter S field of next instruction by substituting current D[8:0]
sss = 010: <some mode we could define>
sss = 011: <some mode we could define>
sss = 100: alter S field of next instruction by substituting current D[8:0], increment current D[8:0]
sss = 101: alter S field of next instruction by substituting current D[8:0], decrement current D[8:0]
sss = 110: alter S field of next instruction by substituting current D[8:0]+1, increment current D[8:0]
sss = 111: alter S field of next instruction by substituting current D[8:0]-1, decrement current D[8:0]
Usage could be made simple by the assembler:
MOVS ptr,#from
MOVD ptr,#to
REP #count,#2
MOV [ptr++],[ptr++] 'move using pointers in ptr, this is actually two instructions: ALTDS and MOV
We've only got six bits specified for S/# in ALTDS, so we can use the three bits above to specify write-register alteration, with D[31:23] serving as the pointer for write redirection:
ALTDS D,S/#
S/# = %rrr_ddd_sss
rrr = same as ddd/sss, but uses D[31:23] as a write redirection pointer.
Will each task have its own state for the associated data?
Chris Wardell
Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.
Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:
movbf dest wc
movwf dest wc
that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?
inner loop becomes:
init: rdquad pcode
' rest of init code
next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping
if_c rdlong pcode++ ' only executed 1/32 of the time
.. decode instructions
jmp #next
Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.
Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
- C is the destination, A & B are sources
- A&B are specified earlier, so more time to do the op
MODE could be 10 bits, because it would always be immediate - so I bit can be re-purposed
Proposed MODE encoding:
%AABBCC_NNNN
Where:
AA/BB/CC:
00=use AA/BB/CC directly, no offset, no updating
01=add NNNN to AA/BB/CC before using pointer, DO NOT update register
10=add NNNN to AA/BB/CC before using pointer, update register
11=add NNNN to AA/BB/CC after using pointer, update register
Alternate encoding for NNN, leaves 8 more possible modes:
Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
Do we even need indirect access for RDxxxxC? The S field isn't immediate. If you need indirect addressing for a RDxxxxC, just use
mov temp, [inda]
rdbytec x, temp
If someone tries rdbytec x, [inda], just say it's undefined behavior (or make the indirection just not happen).
To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:
movbf dest wc
movwf dest wc
that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?
inner loop becomes:
init: rdquad pcode
' rest of init code
next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping
if_c rdlong pcode++ ' only executed 1/32 of the time
.. decode instructions
jmp #next
Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.
I like the +1/-1 without affecting the D field, for ALTDS.
I was thinking about everything with P16X512, and took a mental step back.
1) As far as I understand it, the issues with INDA/INDB/RDxxxC are simply not having enough time in the two cycle instructions to do all the work in time (without lots of complexity, or big drop in clock speed).
2) I've caught myself making calculation mistakes due to the 200MHz/100MIPS dichotomy, with the 2+ cycle instructions
I think there may be a solution, that does not involve prefix instructions, and simplifies things... but I don't know how much it affects Verilog or gate count, so please let me know
Here is what I think may work better:
We always talk in terms of clock cycles. No more cycle count confusion. We already have hub instructions etc that take more than 2 cycles - and as long as we know the cycle count, it is still deterministic
INDA/INDB:
Add a clock cycle. It should be easy to detect if an instruction refers to the INDA/INDB registers, so add a index-compute cycle. Heck add two if needed, because we save a LOT of memory using one op instead of two.
RDxxxxC:
Add a clock cycle. Still cheaper than adding more instructions to do the same thing, still faster. Hopefully that cycle is enough in case INDx is also used, but if not, add another cycle. Still saves a lot of memory.
Much faster than losing cached data reads, and takes less memory than my movf-reads-quad suggestion.
The point is that even if some instruction cycle counds increase (if using INDx or RDxxxxC) it will still be faster than not having them, or adding instructions... and use a lot less memory, so more fits in a P16X512
Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.
I like the +1/-1 without affecting the D field, for ALTDS.
Q1) Am I right in thinking that, as things stand, cores running out of hub ram will run at 50 MIPS and that all 16 cores will be able to run at that rate simultaneously?
Q2) And that when running from registers they will run at 100MIPS?
Finding the current spec is getting harder by the day
I believe this is correct assuming the clock rate is 200 MHz. The 50 MIPS for hubex is for straight-line code, where all 4 longs in a quad are executed.
One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.
Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?
I guess that's two questions in one lol
So to clarify... the two questions are :-
1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time
Thought of this as a possible way to implement Bill's suggestion in #830.
If an opcode has sufficient space to contain all the data needed for an operation, but timing prevents doing the operation in 2 clocks, what about having instuctions that execute by taking two trips through the 'pipeline'?
The first trip through does part of the processing, the second completes the processing.
One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.
Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?
I guess that's two questions in one lol
So to clarify... the two questions are :-
1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time
The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.
We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.
If feasible it would be nice to not require prefix instructions, and not lose RDxxxxC, INDA/INDB for performance and code density.
If not feasible, then it is not feasible
Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?
The reason I'd hate to lose INDx is that it makes for much faster table lookup code, cog based stack etc., than not having it. If it can't be done in 2 cycles, 3 or 4 is still far preferrable to self-modifying code for 99% possible cases.
Same for RDxxxC - if two clocks is not feasible, even if it had to go to 4 clocks, it is still much better than 16 clocks for the next hub cycle.
Mind you, resurrecting your movef{b/w/l} on the quad can take care of the RDxxxC cases, at the expense of more complicated code and slightly lower code density, and an ALT variation can substitute for INDx at the expense of memory and speed.
Only you (and deep diving your Verilog) can figure out the best option
p.s.
The modified ALT would be very handy addition to INDA/INDB, as it would effectively provide MANY additional slower IND registers
The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.
We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.
...Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?
The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?
Thanks Chip, I am learning a lot from you about the guts of P16x512 !!
Well, ALT is a nice instruction, and movf{b,w,l} with the mod addressing of the quad will allow pretty good performance, and we will have 512KB to play with
The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
The instruction would basically loop to itself one time, doing a different operation depending on if it is the first or second execution.
So for an INDA/INDB situation the first phase pass would compute the actual D/S value based on the indirect value.
The second pass would then use the computed D/S values.
Without knowing implemention details of the P1+ it is hard to suggest an actual implementation but conceptually I see something like this.
- An instruction using INDA/INDB is exectuted
- Since this uses INDA/INDB it is treated as a two phase instruction.
- Phase one computes the absolute values and places an absolute version of the instruction in an alternate instruction register and does not increment the PC.
- The instruction in the alternate instruction register is fetched instead of from memory and is executed like normal.
Comments
That's fantastic! People can make the expression that makes the most sense to them at that time with no worries.
EDIT- nonsense removed.
I think you are missing something... BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
So jmg and potatohead can have it whichever way they like.
- only I did not say anywhere that those opcodes would be removed, so your whole point is moot.
In P2 there was a strict dictate of a preamble/pipeline delay, (unclear if that is still in P1+ ?)
Which is why the dual label form, with labels at actual REP_start and REP_End allows the Assembler to context check the pgmr is meeting the 'fine print', without the pgmr having to remember all the fine print.
(ie the trivial stuff, the PC should be doing )
Importantly, those simplest of edits, of insert/remove/comment lines are safe to do.
This surely gives #2 ( + size of add more code ) for the rep instruction count.
I would have done...
This would give #1 ( + size of add more code ) for the rep instruction count
Noted, but I have all along thought that Forth would provide a lot of users with useful understanding of the new Propeller.
I really can't add much constructively to this 16cog, 512Kb, 64 ADC/DAC but my enthusiasm. It is exciting news as it will allow a lot more to get done with just one chip. SOC chips are never going to be anything by an accessory to the Propeller, certainly not a direct competitor -- but other chips that are comparable have long offered more memory for larger programs.
Those chips can also load Forth, but on only one CPU. Forth with 8 cpus is much faster. With 16 cpus, even better. I guess I am just excited to have this coming soon. It will be a good thing for GCC as well.
Will each task have its own state for the associated data?
Chris Wardell
I'll chew on it a bit, I think I may have an interesting variation on your posts, but I want to try it on paper before I post.
Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.
Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
movbf dest wc
movwf dest wc
that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?
inner loop becomes:
Prop style:
ALT D,S/#
INST D,S/#
Three operand style:
ALT A,B/#B
INST C,#MODE
- C is the destination, A & B are sources
- A&B are specified earlier, so more time to do the op
MODE could be 10 bits, because it would always be immediate - so I bit can be re-purposed
Proposed MODE encoding:
%AABBCC_NNNN
Where:
AA/BB/CC:
00=use AA/BB/CC directly, no offset, no updating
01=add NNNN to AA/BB/CC before using pointer, DO NOT update register
10=add NNNN to AA/BB/CC before using pointer, update register
11=add NNNN to AA/BB/CC after using pointer, update register
Alternate encoding for NNN, leaves 8 more possible modes:
SNNN = 0, +1, +2, +4, +16, -1, -2, -4, -16
For two op instructions, ie
MOV C, A
RDxxx C,A
perhaps B could be used as an optional index?
Do we even need indirect access for RDxxxxC? The S field isn't immediate. If you need indirect addressing for a RDxxxxC, just use
If someone tries rdbytec x, [inda], just say it's undefined behavior (or make the indirection just not happen).
electrodude
Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.
I like the +1/-1 without affecting the D field, for ALTDS.
@JMG: My apologies BTW. Let's say some outside factors were affecting discussion. I deleted the crappy post first chance I got.
I was thinking about everything with P16X512, and took a mental step back.
1) As far as I understand it, the issues with INDA/INDB/RDxxxC are simply not having enough time in the two cycle instructions to do all the work in time (without lots of complexity, or big drop in clock speed).
2) I've caught myself making calculation mistakes due to the 200MHz/100MIPS dichotomy, with the 2+ cycle instructions
I think there may be a solution, that does not involve prefix instructions, and simplifies things... but I don't know how much it affects Verilog or gate count, so please let me know
Here is what I think may work better:
We always talk in terms of clock cycles. No more cycle count confusion. We already have hub instructions etc that take more than 2 cycles - and as long as we know the cycle count, it is still deterministic
INDA/INDB:
Add a clock cycle. It should be easy to detect if an instruction refers to the INDA/INDB registers, so add a index-compute cycle. Heck add two if needed, because we save a LOT of memory using one op instead of two.
RDxxxxC:
Add a clock cycle. Still cheaper than adding more instructions to do the same thing, still faster. Hopefully that cycle is enough in case INDx is also used, but if not, add another cycle. Still saves a lot of memory.
Much faster than losing cached data reads, and takes less memory than my movf-reads-quad suggestion.
The point is that even if some instruction cycle counds increase (if using INDx or RDxxxxC) it will still be faster than not having them, or adding instructions... and use a lot less memory, so more fits in a P16X512
Q2) And that when running from registers they will run at 100MIPS?
Finding the current spec is getting harder by the day
Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?
I guess that's two questions in one lol
So to clarify... the two questions are :-
1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time
Thought of this as a possible way to implement Bill's suggestion in #830.
If an opcode has sufficient space to contain all the data needed for an operation, but timing prevents doing the operation in 2 clocks, what about having instuctions that execute by taking two trips through the 'pipeline'?
The first trip through does part of the processing, the second completes the processing.
More detail to follow...
C.W.
2: no
We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.
If feasible it would be nice to not require prefix instructions, and not lose RDxxxxC, INDA/INDB for performance and code density.
If not feasible, then it is not feasible
Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?
The reason I'd hate to lose INDx is that it makes for much faster table lookup code, cog based stack etc., than not having it. If it can't be done in 2 cycles, 3 or 4 is still far preferrable to self-modifying code for 99% possible cases.
Same for RDxxxC - if two clocks is not feasible, even if it had to go to 4 clocks, it is still much better than 16 clocks for the next hub cycle.
Mind you, resurrecting your movef{b/w/l} on the quad can take care of the RDxxxC cases, at the expense of more complicated code and slightly lower code density, and an ALT variation can substitute for INDx at the expense of memory and speed.
Only you (and deep diving your Verilog) can figure out the best option
p.s.
The modified ALT would be very handy addition to INDA/INDB, as it would effectively provide MANY additional slower IND registers
The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?
Thanks,
David
Well, ALT is a nice instruction, and movf{b,w,l} with the mod addressing of the quad will allow pretty good performance, and we will have 512KB to play with
This is intended as a possibly simple way to implement instructions that would have issues due the the timing problem mentioned by Chip:
The instruction would basically loop to itself one time, doing a different operation depending on if it is the first or second execution.
So for an INDA/INDB situation the first phase pass would compute the actual D/S value based on the indirect value.
The second pass would then use the computed D/S values.
Without knowing implemention details of the P1+ it is hard to suggest an actual implementation but conceptually I see something like this.
- An instruction using INDA/INDB is exectuted
- Since this uses INDA/INDB it is treated as a two phase instruction.
- Phase one computes the absolute values and places an absolute version of the instruction in an alternate instruction register and does not increment the PC.
- The instruction in the alternate instruction register is fetched instead of from memory and is executed like normal.
C.W.