Suggestion for an unconditional JMPRET instruction
Dave Hein
Posts: 6,347
I hate to suggest any changes to P2, but I think an unconditional jmpret instruction would be useful for improving pipeline utilization. The trace shown below is from spinsim running the pfth code. The column with the X, I and H characters indicates instructions that are conditionally not executed, invalidated instructions after a jump and hub waits. The hub waits can be reduced by using cached reads and organizing hub access to align with the cog's hub slot. The pipeline invalidates can be reduced by using delayed jumps. However, there are cases where delayed jumps cannot be used, and it would be useful to have an unconditional jump where the pipeline fill would follow the jump target. I'm suggesting a P1-like JMPRET instruction so that returns could be made invalidate-free also. I think this instruction would improve the utilization of the instruction pipeline.
Cog 0: 000000ba 0012 053c0617 rdword 003, 017 wz, rdw[1eba], z = 0, cram[3] = 1a2c Cog 0: 000000bb 0013 fca80018 X if_z jmp #000, #018 Cog 0: 000000bc 0014 507c2e02 add 017, #002, cram[17] = 1ebc Cog 0: 000000bd 0015 043c1203 H rdword 009, 003 Cog 0: 000000be 0015 043c1203 H rdword 009, 003 Cog 0: 000000bf 0015 043c1203 H rdword 009, 003 Cog 0: 000000c0 0015 043c1203 H rdword 009, 003 Cog 0: 000000c1 0015 043c1203 H rdword 009, 003 Cog 0: 000000c2 0015 043c1203 rdword 009, 003, rdw[1a2c], cram[9] = cc Cog 0: 000000c3 0016 fe3c12f4 jmp 009, #0f4 Cog 0: 000000c4 0017 00001eba I if_never rdbyte 00f, 0ba Cog 0: 000000c5 0018 527fbc04 I sub 1de, #004 Cog 0: 000000c6 0019 083c2fde I rdlong 017, 1de Cog 0: 000000c7 00cc 043c0817 H rdword 004, 017 Cog 0: 000000c8 00cc 043c0817 H rdword 004, 017 Cog 0: 000000c9 00cc 043c0817 H rdword 004, 017 Cog 0: 000000ca 00cc 043c0817 rdword 004, 017, rdw[1ebc], cram[4] = 0 Cog 0: 000000cb 00cd 507c2e02 add 017, #002, cram[17] = 1ebe Cog 0: 000000cc 00ce fcbc01a9 jmp #000, #1a9 Cog 0: 000000cd 00cf fcfc01ac I call #000, #1ac Cog 0: 000000ce 00d0 fcfc0198 I call #000, #198 Cog 0: 000000cf 00d1 403c0803 I mov 004, 003 Cog 0: 000000d0 01a9 d23c09dc wrlong 004, 1dc, wrl[15b0] = 0 Cog 0: 000000d1 01aa 507fb804 add 1dc, #004, cram[1dc] = 15b4 Cog 0: 000000d2 01ab fcbc0012 jmp #000, #012 Cog 0: 000000d3 01ac 527fb804 I sub 1dc, #004 Cog 0: 000000d4 01ad 083c0bdc I rdlong 005, 1dc Cog 0: 000000d5 01ae 527fb804 I sub 1dc, #004 Cog 0: 000000d6 0012 053c0617 H rdword 003, 017 wz Cog 0: 000000d7 0012 053c0617 H rdword 003, 017 wz Cog 0: 000000d8 0012 053c0617 H rdword 003, 017 wz Cog 0: 000000d9 0012 053c0617 H rdword 003, 017 wz Cog 0: 000000da 0012 053c0617 rdword 003, 017 wz, rdw[1ebe], z = 0, cram[3] = 1c14 Cog 0: 000000db 0013 fca80018 X if_z jmp #000, #018 Cog 0: 000000dc 0014 507c2e02 add 017, #002, cram[17] = 1ec0 Cog 0: 000000dd 0015 043c1203 H rdword 009, 003 Cog 0: 000000de 0015 043c1203 H rdword 009, 003 Cog 0: 000000df 0015 043c1203 H rdword 009, 003 Cog 0: 000000e0 0015 043c1203 H rdword 009, 003 Cog 0: 000000e1 0015 043c1203 H rdword 009, 003 Cog 0: 000000e2 0015 043c1203 rdword 009, 003, rdw[1c14], cram[9] = b5 Cog 0: 000000e3 0016 fe3c12f4 jmp 009, #0f4 Cog 0: 000000e4 0017 00001ebe I if_never rdbyte 00f, 0be Cog 0: 000000e5 0018 527fbc04 I sub 1de, #004 Cog 0: 000000e6 0019 083c2fde I rdlong 017, 1de Cog 0: 000000e7 00b5 fcfc01ae call #000, #1ae Cog 0: 000000e8 00b6 fcfc01b1 I call #000, #1b1 Cog 0: 000000e9 00b7 fcbc01a9 I jmp #000, #1a9 Cog 0: 000000ea 00b8 fcfc01ae I call #000, #1ae Cog 0: 000000eb 01ae 527fb804 sub 1dc, #004, cram[1dc] = 15b0 Cog 0: 000000ec 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000ed 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000ee 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000ef 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000f0 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000f1 01af 093c09dc H rdlong 004, 1dc wz Cog 0: 000000f2 01af 093c09dc rdlong 004, 1dc wz, rdl[15b0], z = 1, cram[4] = 0 Cog 0: 000000f3 01b0 fe3c0108 ret #000, #108 Cog 0: 000000f4 01b1 463c1204 I neg 009, 004 Cog 0: 000000f5 01b2 367c1202 I shl 009, #002 Cog 0: 000000f6 01b3 527c1204 I sub 009, #004 Cog 0: 000000f7 00b6 fcfc01b1 call #000, #1b1 Cog 0: 000000f8 00b7 fcbc01a9 I jmp #000, #1a9 Cog 0: 000000f9 00b8 fcfc01ae I call #000, #1ae
Comments
I am curious, how would that be better/faster than using one of the aux or 4 level fifo stack instructions?
Thanks,
Bill
It's sort of like the difference between REPS and REPD. REPD requires three instructions between REPD and the block of instruction that are going to be repeated. This is because REPD is handled at the end of the pipeline to get the value from the D register. This also allows REPD to be conditionally executed. REPS is process two stages earlier in the pipeline, so it only requires one instruction before the block of repeated instructions. However, because it is process earlier in the pipe REPS cannot be conditionally execute and it must use immediate parameters.
I'm proposing two instructions -- UJMP and UJMPRET. UJMP would unconditionally jump to a new address. UJMPRET would also be unconditional, but would also save the return address a cog memory location similar to P1's JMPRET. The assembler could also support a UCALL instruction, which is a special case of UJMPRET similar to P1's CALL instruction. Since UJMP and UJMPRET would be unconditional, and only use immediate values they could be processed earlier in the pipeline. If they are handled at the first stage of the pipeline there would be no need to do any invalidates.
I think the unconditional jumps would help to make native PASM code run more efficiently, and it would probably help the C compiler generate more efficient code.
Thanks, now I understand.
Unfortunately, as far as I know, that is not possible, but I could be wrong.
The instructions at the destination address have not been read into the pipeline yet, and the PC is not updated until the fourth pipeline stage.
What you are proposing would require the PC to be updated in the first or second pipeline stage.
I understand. The difference is that the PC would be pointing to the next instruction already, so it is accessible.
I understand, however my understanding (Chip would know) is that it can't work; if it could, then we would already have two cycle calls and jumps, and once cycle delayed jumps and calls.
If you are correct, I think in theory all JMP's could be short circuited into 2 cycle instructions, and I am not sure of the effect of this on the four cycle state machine.
Writing the return address normally would happen in the fourth cycle, so I am not sure how the short circuit would work for CALL's.
The condition bits they are predicated on are available in the first cycle, and the register contents I think are available on the second.
Good point about it not mattering if the return address is written after potentially executing instructions from the destination.
Good discussion.
I'd call that more a FASTJMP, and for that to work, would require that the pipeline is largely redundant, and can be flipped faster than usual. Removing conditional tests may save some cycles, but I'm not sure you could save 3.
All that's required is something like a 10-bit and-gate with inverters on the inputs for the zero bits to decode the UJMP opcode, which then controls a mux that selects either the PC or 16 bits from the UJMP instruction. If the delay is low enough where this can be used at the first stage of the pipeline the next instruction fetch will be from the UJMP target address. However, if the decode and mux takes too long the jump wouldn't occur until after the next sequential instruction. In that case, there could be delayed and non-delayed versions of UJMP which would either execute the next sequential instruction before jumping, or invalidate it.
inst (fjmp modifies PC here, in stage 2)
inst
fjmp #somewhere (fjmp becomes nop in stage 4)
inst (fjmpd would execute this instruction, fjmp would cancel it)
inst (1st instruction from 'somewhere')
This would help single-task programs, but wouldn't do anything for 4-task programs, as there are no same-task instructions ever in the pipeline.
As FCALL goes, this would only work with the 4-level FIFO stack, since there are no prior stages to compute pointers in.
Running without slot waits produced an improvement of 30% to 36%. Of course, this could only be achieved if a single cog was running. With multiple cogs running the gains would be less, and it may be difficult to implement a hub arbitration scheme that doesn't impact hub timing.
I need to think about the ramifications of this, but I imagine something could be done here. I think someone suggested that we could use the CCCC case of %0000 to force a fast jump within existing JMP opcodes. I mean, there's no practical use for having all the CCCC bits cleared, as there's no possibility of execution. We could use that case to mean 'always' and 'early'. This could really help single-task execution speed.
It's great that someone's been thinking about this issue.
If so, that would be a nice win!
Maybe it could be handy to be able to (temporarily/dynamically) patch an instruction as a NOP by setting its CCCC bits to 0, but maybe this feature would trump that case.
Actually I didn't see the explicit NOP instruction listed in the Instruction List of my copy of the Prop2_Docs.txt file. Is NOP already implemented by just setting CCCC as 0?
or reg,reg
and reg,reg
and so on... TONS of possible NOP candidates
Some instructions, like REPS, will execute with CCCC=0. For a NOP, just use $00000000.
We have variables that are treated as NOP's at the moment (as long as their < 19 bits).
This would only affect operation of the 16-bit-constant jumps and calls which have quite a few upper bits set in their opcodes.
Got it! Thanks
However, all is not lost, because cccc=1111 could be used. But it would be a caveat that "always execute jumps" would be faster.
Next question, is it possible to execute the jump in stage 2 while still doing the Z & C saving in stage 4. This seems awkward but Chip is the one to answer this.
Perhaps the real possibility is...
For cccc=1111, can jumps/calls all execute in stage 2 so we could have a global caveat. Does AUGS/AUGD come into play here???
It's certainly a big advantage to be able to execute jmps/calls/rets faster. ie it would only then be the conditional jumps that took longer.
This afternoon, I replaced TPUSH D/#,S/# with PUSHT0..PUSHT3 D/#, and I replaced TPOP D,S/# with POPT0..POPT3 D. This was to reclaim space for richer opcodes which we'll probably want to implement soon for SERDES or USB. So, other-task pushes and pops are now hard-coded by task, which is probably no practical loss. I also renamed T3LOAD/T3SAVE to LOADT3/SAVET3.
After I've tested these changes, I'll look into early branching. Off the bat, I know that an early JMP is possible, but early CALLS, especially early CALLA/CALLB which use the stack, may not be able to be made early. CALLA/CALLB have to wait for the hub cycle, anyway. We'll see.
The only branches that I think can be made to execute early are the following:
if_always JMP #/@
if_always CALL #/@
if_always RET
if_always LINK #/@ - maybe
Every other branch involves things happening later in the pipeline.
There are some hard-to-think-about what-ifs to consider.
NOP
PUSH #x
RET
The RET would execute at the same time as the NOP - before the PUSH. This could maybe be gotten around by not executing the RET in stage 2 if there is a PUSH or a POP in a higher stage of the pipeline. CALL would have the same issues as RET. I wonder if there'd be any other rules that would need to be followed. I don't have a lot of confidence at the moment about getting CALL/RET to execute early.
A plain JMP #/@ could be executed early with some circuitry to patch up an errant PC if it was cancelled later in the pipeline by the instruction above it. That should work. This means that only a plain JMP could be likely made to execute in 2 clocks without going to the extremes that other branches would require.
The reason REPS can work in stage 2 is because it doesn't subtract the repeat-block size from the target PC any earlier than when it's in stage 3, where it may be getting cancelled by a branch in the higher stage, cancelling it's PC-subtraction effect. So, REPS is kind of harmless because it doesn't modify the target PC until it's in the next stage, unlike what an early JMP would have to do - that is - affect the target PC right away, before we know if we're getting cancelled in the next stage or not. This leaves a potential mess to be cleaned up. This would not be too hard to accommodate, though. The question is, is it worth doing for what is basically a hard-wired GOTO? It's mainly CALLs and RETs that need speeding up, but they are perhaps too complicated to handle early.
I guess, in summary, there are strong reasons for having instructions execute at some constant point in the pipeline. It keeps things properly ordered and keeps one sane.
I think it is NO good idea -- lets it be that it is --- else it can made problems we still can't imagine
+1
c.w.