Shop OBEX P1 Docs P2 Docs Learn Events
Suggestion for an unconditional JMPRET instruction — Parallax Forums

Suggestion for an unconditional JMPRET instruction

Dave HeinDave Hein Posts: 6,347
edited 2014-03-12 08:39 in Propeller 2
I hate to suggest any changes to P2, but I think an unconditional jmpret instruction would be useful for improving pipeline utilization. The trace shown below is from spinsim running the pfth code. The column with the X, I and H characters indicates instructions that are conditionally not executed, invalidated instructions after a jump and hub waits. The hub waits can be reduced by using cached reads and organizing hub access to align with the cog's hub slot. The pipeline invalidates can be reduced by using delayed jumps. However, there are cases where delayed jumps cannot be used, and it would be useful to have an unconditional jump where the pipeline fill would follow the jump target. I'm suggesting a P1-like JMPRET instruction so that returns could be made invalidate-free also. I think this instruction would improve the utilization of the instruction pipeline.
Cog 0: 000000ba 0012 053c0617                rdword   003,  017 wz, rdw[1eba], z = 0, cram[3] = 1a2c
Cog 0: 000000bb 0013 fca80018 X if_z         jmp     #000, #018
Cog 0: 000000bc 0014 507c2e02                add      017, #002, cram[17] = 1ebc
Cog 0: 000000bd 0015 043c1203 H              rdword   009,  003
Cog 0: 000000be 0015 043c1203 H              rdword   009,  003
Cog 0: 000000bf 0015 043c1203 H              rdword   009,  003
Cog 0: 000000c0 0015 043c1203 H              rdword   009,  003
Cog 0: 000000c1 0015 043c1203 H              rdword   009,  003
Cog 0: 000000c2 0015 043c1203                rdword   009,  003, rdw[1a2c], cram[9] = cc
Cog 0: 000000c3 0016 fe3c12f4                jmp      009, #0f4
Cog 0: 000000c4 0017 00001eba I if_never     rdbyte   00f,  0ba
Cog 0: 000000c5 0018 527fbc04 I              sub      1de, #004
Cog 0: 000000c6 0019 083c2fde I              rdlong   017,  1de
Cog 0: 000000c7 00cc 043c0817 H              rdword   004,  017
Cog 0: 000000c8 00cc 043c0817 H              rdword   004,  017
Cog 0: 000000c9 00cc 043c0817 H              rdword   004,  017
Cog 0: 000000ca 00cc 043c0817                rdword   004,  017, rdw[1ebc], cram[4] = 0
Cog 0: 000000cb 00cd 507c2e02                add      017, #002, cram[17] = 1ebe
Cog 0: 000000cc 00ce fcbc01a9                jmp     #000, #1a9
Cog 0: 000000cd 00cf fcfc01ac I              call    #000, #1ac
Cog 0: 000000ce 00d0 fcfc0198 I              call    #000, #198
Cog 0: 000000cf 00d1 403c0803 I              mov      004,  003
Cog 0: 000000d0 01a9 d23c09dc                wrlong   004,  1dc, wrl[15b0] = 0
Cog 0: 000000d1 01aa 507fb804                add      1dc, #004, cram[1dc] = 15b4
Cog 0: 000000d2 01ab fcbc0012                jmp     #000, #012
Cog 0: 000000d3 01ac 527fb804 I              sub      1dc, #004
Cog 0: 000000d4 01ad 083c0bdc I              rdlong   005,  1dc
Cog 0: 000000d5 01ae 527fb804 I              sub      1dc, #004
Cog 0: 000000d6 0012 053c0617 H              rdword   003,  017 wz
Cog 0: 000000d7 0012 053c0617 H              rdword   003,  017 wz
Cog 0: 000000d8 0012 053c0617 H              rdword   003,  017 wz
Cog 0: 000000d9 0012 053c0617 H              rdword   003,  017 wz
Cog 0: 000000da 0012 053c0617                rdword   003,  017 wz, rdw[1ebe], z = 0, cram[3] = 1c14
Cog 0: 000000db 0013 fca80018 X if_z         jmp     #000, #018
Cog 0: 000000dc 0014 507c2e02                add      017, #002, cram[17] = 1ec0
Cog 0: 000000dd 0015 043c1203 H              rdword   009,  003
Cog 0: 000000de 0015 043c1203 H              rdword   009,  003
Cog 0: 000000df 0015 043c1203 H              rdword   009,  003
Cog 0: 000000e0 0015 043c1203 H              rdword   009,  003
Cog 0: 000000e1 0015 043c1203 H              rdword   009,  003
Cog 0: 000000e2 0015 043c1203                rdword   009,  003, rdw[1c14], cram[9] = b5
Cog 0: 000000e3 0016 fe3c12f4                jmp      009, #0f4
Cog 0: 000000e4 0017 00001ebe I if_never     rdbyte   00f,  0be
Cog 0: 000000e5 0018 527fbc04 I              sub      1de, #004
Cog 0: 000000e6 0019 083c2fde I              rdlong   017,  1de
Cog 0: 000000e7 00b5 fcfc01ae                call    #000, #1ae
Cog 0: 000000e8 00b6 fcfc01b1 I              call    #000, #1b1
Cog 0: 000000e9 00b7 fcbc01a9 I              jmp     #000, #1a9
Cog 0: 000000ea 00b8 fcfc01ae I              call    #000, #1ae
Cog 0: 000000eb 01ae 527fb804                sub      1dc, #004, cram[1dc] = 15b0
Cog 0: 000000ec 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000ed 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000ee 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000ef 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000f0 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000f1 01af 093c09dc H              rdlong   004,  1dc wz
Cog 0: 000000f2 01af 093c09dc                rdlong   004,  1dc wz, rdl[15b0], z = 1, cram[4] = 0
Cog 0: 000000f3 01b0 fe3c0108                ret     #000, #108
Cog 0: 000000f4 01b1 463c1204 I              neg      009,  004
Cog 0: 000000f5 01b2 367c1202 I              shl      009, #002
Cog 0: 000000f6 01b3 527c1204 I              sub      009, #004
Cog 0: 000000f7 00b6 fcfc01b1                call    #000, #1b1
Cog 0: 000000f8 00b7 fcbc01a9 I              jmp     #000, #1a9
Cog 0: 000000f9 00b8 fcfc01ae I              call    #000, #1ae
«1

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-25 07:17
    David,

    I am curious, how would that be better/faster than using one of the aux or 4 level fifo stack instructions?

    Thanks,

    Bill
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-25 07:58
    A non-delayed call or jmp instruction requires that the three next instructions that are already in the instruction pipeline must invalidate, which wastes three cycles. A delayed call or jmp instruction will cause the three instructions in the pipeline to be executed. What I'm proposing is something like a delayed call or jmp, but the three instructions in the pipeline are fetched from the call/jmp target address instead of the next three instructions directly after the call/jmp. This is how P1 works, except that it only had to worry about one instruction in the pipeline.

    It's sort of like the difference between REPS and REPD. REPD requires three instructions between REPD and the block of instruction that are going to be repeated. This is because REPD is handled at the end of the pipeline to get the value from the D register. This also allows REPD to be conditionally executed. REPS is process two stages earlier in the pipeline, so it only requires one instruction before the block of repeated instructions. However, because it is process earlier in the pipe REPS cannot be conditionally execute and it must use immediate parameters.

    I'm proposing two instructions -- UJMP and UJMPRET. UJMP would unconditionally jump to a new address. UJMPRET would also be unconditional, but would also save the return address a cog memory location similar to P1's JMPRET. The assembler could also support a UCALL instruction, which is a special case of UJMPRET similar to P1's CALL instruction. Since UJMP and UJMPRET would be unconditional, and only use immediate values they could be processed earlier in the pipeline. If they are handled at the first stage of the pipeline there would be no need to do any invalidates.

    I think the unconditional jumps would help to make native PASM code run more efficiently, and it would probably help the C compiler generate more efficient code.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-25 08:05
    Dave Hein wrote: »
    A non-delayed call or jmp instruction requires that the three next instructions that are already in the instruction pipeline must invalidate, which wastes three cycles. A delayed call or jmp instruction will cause the three instructions in the pipeline to be executed. What I'm proposing is something like a delayed call or jmp, but the three instructions in the pipeline are fetched from the call/jmp target address instead of the next three instructions directly after the call/jmp. This is how P1 works, except that it only had to worry about one instruction in the pipeline.

    Thanks, now I understand.

    Unfortunately, as far as I know, that is not possible, but I could be wrong.

    The instructions at the destination address have not been read into the pipeline yet, and the PC is not updated until the fourth pipeline stage.

    What you are proposing would require the PC to be updated in the first or second pipeline stage.
    Dave Hein wrote: »
    It's sort of like the difference between REPS and REPD. REPD requires three instructions between REPD and the block of instruction that are going to be repeated. This is because REPD is handled at the end of the pipeline to get the value from the D register. This also allows REPD to be conditionally executed. REPS is process two stages earlier in the pipeline, so it only requires one instruction before the block of repeated instructions. However, because it is process earlier in the pipe REPS cannot be conditionally execute and it must use immediate parameters.

    I understand. The difference is that the PC would be pointing to the next instruction already, so it is accessible.
    Dave Hein wrote: »
    I'm proposing two instructions -- UJMP and UJMPRET. UJMP would unconditionally jump to a new address. UJMPRET would also be unconditional, but would also save the return address a cog memory location similar to P1's JMPRET. The assembler could also support a UCALL instruction, which is a special case of UJMPRET similar to P1's CALL instruction. Since UJMP and UJMPRET would be unconditional, and only use immediate values they could be processed earlier in the pipeline. If they are handled at the first stage of the pipeline there would be no need to do any invalidates.

    I understand, however my understanding (Chip would know) is that it can't work; if it could, then we would already have two cycle calls and jumps, and once cycle delayed jumps and calls.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-25 08:27
    A REPS that only repeats one instruction is already changing the path of the instruction fetcher that fills the pipeline even before the REPS instruction hits the end of the pipe. I can easily modify the C code in spinsim to change how the pipeline is filled, and I assume it could be written in Verilog as well. There may be some limitation that I'm not aware of the would require UJMP to invalidate or do a delayed execution of a single instruction in the pipeline. It seems that once the UJMP instruction is clocked into the first register of the pipeline it could be used on the next cycle to override the PC and fetch from the target address instead. The PC would then continue to increment from that point on.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-25 08:53
    I am curious to see what Chip will say.

    If you are correct, I think in theory all JMP's could be short circuited into 2 cycle instructions, and I am not sure of the effect of this on the four cycle state machine.

    Writing the return address normally would happen in the fourth cycle, so I am not sure how the short circuit would work for CALL's.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-25 09:14
    All JMPs cannot short circuited because they can be conditionally executed and they can use register values. Writing the return address would happen on the fourth cycle. That would require that the return could not be done immediately, but would require at least three intervening instructions. That doesn't seem like a big restriction since the called routine normally executes a few instructions before it returns.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-02-25 09:26
    Dave Hein wrote: »
    All JMPs cannot short circuited because they can be conditionally executed and they can use register values. Writing the return address would happen on the fourth cycle. That would require that the return could not be done immediately, but would require at least three intervening instructions. That doesn't seem like a big restriction since the called routine normally executes a few instructions before it returns.

    The condition bits they are predicated on are available in the first cycle, and the register contents I think are available on the second.

    Good point about it not mattering if the return address is written after potentially executing instructions from the destination.

    Good discussion.
  • jmgjmg Posts: 15,148
    edited 2014-02-25 12:18
    Dave Hein wrote: »
    ... What I'm proposing is something like a delayed call or jmp, but the three instructions in the pipeline are fetched from the call/jmp target address instead of the next three instructions directly after the call/jmp.

    I'd call that more a FASTJMP, and for that to work, would require that the pipeline is largely redundant, and can be flipped faster than usual. Removing conditional tests may save some cycles, but I'm not sure you could save 3.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-25 13:25
    I'm not suggesting any redundancy in the pipeline, and I'm not sure what you mean by "flipped faster than usual". I'm certainly not suggesting clocking it any faster than it is now. Since it works at the beginning of the pipeline it would have to be unconditional.

    All that's required is something like a 10-bit and-gate with inverters on the inputs for the zero bits to decode the UJMP opcode, which then controls a mux that selects either the PC or 16 bits from the UJMP instruction. If the delay is low enough where this can be used at the first stage of the pipeline the next instruction fetch will be from the UJMP target address. However, if the decode and mux takes too long the jump wouldn't occur until after the next sequential instruction. In that case, there could be delayed and non-delayed versions of UJMP which would either execute the next sequential instruction before jumping, or invalidate it.
  • cgraceycgracey Posts: 14,133
    edited 2014-02-25 13:32
    Interesting idea, Dave. This would be possible for non-conditional constant jumps/calls. It would be funny looking:

    inst (fjmp modifies PC here, in stage 2)
    inst
    fjmp #somewhere (fjmp becomes nop in stage 4)
    inst (fjmpd would execute this instruction, fjmp would cancel it)

    inst (1st instruction from 'somewhere')


    This would help single-task programs, but wouldn't do anything for 4-task programs, as there are no same-task instructions ever in the pipeline.

    As FCALL goes, this would only work with the 4-level FIFO stack, since there are no prior stages to compute pointers in.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-02-25 14:29
    OK, so I guess it's not possible to do the jump in stage 1, but doing it in stage 2 still reduces the invalidates/delayed-executions from 3 to 1. I was thinking that fcall would write the return address at stage 4, but writing to the FIFO would also work. In that case it would be good to have a fret instruction also.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-03-11 10:50
    I was thinking about the FJMP instruction again, and I realized that we wouldn't actually need a new set of instructions, but we could just treat unconditional JMP, JMPD, CALL, CALLD, RET and RETD instructions as fast jumps. I tried this in spinsim along with a mode with no hub waits, and I got the following Dhrystone times running under p1spin.
    Normal jumps and hub waits    185 msec
    Fast jumps and hub waits      176 msec
    Normal jumps and no hub waits 128 msec
    Fast jumps and no hub waits   113 msec
    
    So there is only about a 5% to 12% improvement using fast jumps, which may not be worth the effort. However, I didn't use delayed fast jumps, which would help a bit more. A delayed fast jump would be used more often than a normal delayed jump since it only requires one executed instruction after the jump instead of three.

    Running without slot waits produced an improvement of 30% to 36%. Of course, this could only be achieved if a single cog was running. With multiple cogs running the gains would be less, and it may be difficult to implement a hub arbitration scheme that doesn't impact hub timing.
  • cgraceycgracey Posts: 14,133
    edited 2014-03-11 15:36
    Dave Hein wrote: »
    I was thinking about the FJMP instruction again, and I realized that we wouldn't actually need a new set of instructions, but we could just treat unconditional JMP, JMPD, CALL, CALLD, RET and RETD instructions as fast jumps. I tried this in spinsim along with a mode with no hub waits, and I got the following Dhrystone times running under p1spin.
    Normal jumps and hub waits    185 msec
    Fast jumps and hub waits      176 msec
    Normal jumps and no hub waits 128 msec
    Fast jumps and no hub waits   113 msec
    
    So there is only about a 5% to 12% improvement using fast jumps, which may not be worth the effort. However, I didn't use delayed fast jumps, which would help a bit more. A delayed fast jump would be used more often than a normal delayed jump since it only requires one executed instruction after the jump instead of three.

    Running without slot waits produced an improvement of 30% to 36%. Of course, this could only be achieved if a single cog was running. With multiple cogs running the gains would be less, and it may be difficult to implement a hub arbitration scheme that doesn't impact hub timing.


    I need to think about the ramifications of this, but I imagine something could be done here. I think someone suggested that we could use the CCCC case of %0000 to force a fast jump within existing JMP opcodes. I mean, there's no practical use for having all the CCCC bits cleared, as there's no possibility of execution. We could use that case to mean 'always' and 'early'. This could really help single-task execution speed.

    It's great that someone's been thinking about this issue.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 15:52
    If I understand correctly, this means that any unconditional non-delayed jump would take two cycles? (regardless of register, relative, cog addr, hub addr)

    If so, that would be a nice win!
    cgracey wrote: »
    I need to think about the ramifications of this, but I imagine something could be done here. I think someone suggested that we could use the CCCC case of %0000 to force a fast jump within existing JMP opcodes. I mean, there's no practical use for having all the CCCC bits cleared, as there's no possibility of execution. We could use that case to mean 'always' and 'early'. This could really help single-task execution speed.

    It's great that someone's been thinking about this issue.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-03-11 16:02
    If I understand correctly, this means that any unconditional non-delayed jump would take two cycles? (regardless of register, relative, cog addr, hub addr)

    If so, that would be a nice win!
    It would only work with jumps that used an immediate address. Jumps that use a register would have to wait until the 4th stage when the register contents are available. In my tests I used a CCCC value of %1111 to indicate a fast jump. For calls I performed the jump at stage 2, but wrote the Z, C and return address at stage 4. On return, the jump is done at stage 2, but the Z and C flags are set at stage 4.
  • roglohrogloh Posts: 5,168
    edited 2014-03-11 16:07
    cgracey wrote: »
    I need to think about the ramifications of this, but I imagine something could be done here. I think someone suggested that we could use the CCCC case of %0000 to force a fast jump within existing JMP opcodes. I mean, there's no practical use for having all the CCCC bits cleared, as there's no possibility of execution. We could use that case to mean 'always' and 'early'. This could really help single-task execution speed.

    It's great that someone's been thinking about this issue.


    Maybe it could be handy to be able to (temporarily/dynamically) patch an instruction as a NOP by setting its CCCC bits to 0, but maybe this feature would trump that case.

    Actually I didn't see the explicit NOP instruction listed in the Instruction List of my copy of the Prop2_Docs.txt file. Is NOP already implemented by just setting CCCC as 0?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 16:24
    or reg,#0

    or reg,reg

    and reg,reg

    and so on... TONS of possible NOP candidates :)
    rogloh wrote: »
    Maybe it could be handy to be able to (temporarily/dynamically) patch an instruction as a NOP by setting its CCCC bits to 0, but maybe this feature would trump that case.

    Actually I didn't see the explicit NOP instruction listed in the Instruction List of my copy of the Prop2_Docs.txt file. Is NOP already implemented by just setting CCCC as 0?
  • cgraceycgracey Posts: 14,133
    edited 2014-03-11 16:24
    rogloh wrote: »
    Maybe it could be handy to be able to (temporarily/dynamically) patch an instruction as a NOP by setting its CCCC bits to 0, but maybe this feature would trump that case.

    Actually I didn't see the explicit NOP instruction listed in the Instruction List of my copy of the Prop2_Docs.txt file. Is NOP already implemented by just setting CCCC as 0?


    Some instructions, like REPS, will execute with CCCC=0. For a NOP, just use $00000000.
  • ozpropdevozpropdev Posts: 2,791
    edited 2014-03-11 16:27
    If cccc=%0000 represents fast jumps, this will affect register block mapping won't it?
    We have variables that are treated as NOP's at the moment (as long as their < 19 bits).
  • roglohrogloh Posts: 5,168
    edited 2014-03-11 16:29
    Yeah using 0 for NOP is nice and simple to remember. I guess that is technically the same as "never RDBYTE D,S".
  • cgraceycgracey Posts: 14,133
    edited 2014-03-11 16:33
    ozpropdev wrote: »
    If cccc=%0000 represents fast jumps, this will affect register block mapping won't it?
    We have variables that are treated as NOP's at the moment (as long as their < 19 bits).


    This would only affect operation of the 16-bit-constant jumps and calls which have quite a few upper bits set in their opcodes.
  • ozpropdevozpropdev Posts: 2,791
    edited 2014-03-11 16:37
    cgracey wrote: »
    This would only affect operation of the 16-bit-constant jumps and calls which have quite a few upper bits set in their opcodes.

    Got it! Thanks
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-03-11 16:41
    cgracey wrote: »
    I need to think about the ramifications of this, but I imagine something could be done here. I think someone suggested that we could use the CCCC case of %0000 to force a fast jump within existing JMP opcodes. I mean, there's no practical use for having all the CCCC bits cleared, as there's no possibility of execution. We could use that case to mean 'always' and 'early'. This could really help single-task execution speed.
    As has been noted below, using cccc=0000 is a bad idea. We now have the capacity to change the cccc bits using SETCOND and making instructions never/always is likely the prime usage here.
    However, all is not lost, because cccc=1111 could be used. But it would be a caveat that "always execute jumps" would be faster.

    Next question, is it possible to execute the jump in stage 2 while still doing the Z & C saving in stage 4. This seems awkward but Chip is the one to answer this.

    Perhaps the real possibility is...
    For cccc=1111, can jumps/calls all execute in stage 2 so we could have a global caveat. Does AUGS/AUGD come into play here???
    It's certainly a big advantage to be able to execute jmps/calls/rets faster. ie it would only then be the conditional jumps that took longer.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-03-11 16:45
    OT - gee these threads can be fast and furious at times. Just while composing a reply, a number of posts have occurred ;)
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-03-11 19:04
    Cluso99 wrote: »
    Next question, is it possible to execute the jump in stage 2 while still doing the Z & C saving in stage 4. This seems awkward but Chip is the one to answer this.
    The return address is also saved at stage 4. This is required because there is a chance the instruction could be invalidated at stage 3 if a conditional jump is executed just before the unconditional call. I ran into this problem when I first coded it in the simulator. Also, the Z and C flags aren't available until stage 4.
  • cgraceycgracey Posts: 14,133
    edited 2014-03-11 20:57
    I'm going to investigate this early branching tonight.

    This afternoon, I replaced TPUSH D/#,S/# with PUSHT0..PUSHT3 D/#, and I replaced TPOP D,S/# with POPT0..POPT3 D. This was to reclaim space for richer opcodes which we'll probably want to implement soon for SERDES or USB. So, other-task pushes and pops are now hard-coded by task, which is probably no practical loss. I also renamed T3LOAD/T3SAVE to LOADT3/SAVET3.

    After I've tested these changes, I'll look into early branching. Off the bat, I know that an early JMP is possible, but early CALLS, especially early CALLA/CALLB which use the stack, may not be able to be made early. CALLA/CALLB have to wait for the hub cycle, anyway. We'll see.
  • cgraceycgracey Posts: 14,133
    edited 2014-03-12 04:15
    I've been drawing up pipeline state diagrams so that I can figure out if this early branching can work.

    The only branches that I think can be made to execute early are the following:

    if_always JMP #/@
    if_always CALL #/@
    if_always RET
    if_always LINK #/@ - maybe

    Every other branch involves things happening later in the pipeline.

    There are some hard-to-think-about what-ifs to consider.
  • cgraceycgracey Posts: 14,133
    edited 2014-03-12 06:05
    Well, I don't see how this sequence could work if the RET executed in stage 2 of the pipeline:

    NOP
    PUSH #x
    RET

    The RET would execute at the same time as the NOP - before the PUSH. This could maybe be gotten around by not executing the RET in stage 2 if there is a PUSH or a POP in a higher stage of the pipeline. CALL would have the same issues as RET. I wonder if there'd be any other rules that would need to be followed. I don't have a lot of confidence at the moment about getting CALL/RET to execute early.

    A plain JMP #/@ could be executed early with some circuitry to patch up an errant PC if it was cancelled later in the pipeline by the instruction above it. That should work. This means that only a plain JMP could be likely made to execute in 2 clocks without going to the extremes that other branches would require.

    The reason REPS can work in stage 2 is because it doesn't subtract the repeat-block size from the target PC any earlier than when it's in stage 3, where it may be getting cancelled by a branch in the higher stage, cancelling it's PC-subtraction effect. So, REPS is kind of harmless because it doesn't modify the target PC until it's in the next stage, unlike what an early JMP would have to do - that is - affect the target PC right away, before we know if we're getting cancelled in the next stage or not. This leaves a potential mess to be cleaned up. This would not be too hard to accommodate, though. The question is, is it worth doing for what is basically a hard-wired GOTO? It's mainly CALLs and RETs that need speeding up, but they are perhaps too complicated to handle early.

    I guess, in summary, there are strong reasons for having instructions execute at some constant point in the pipeline. It keeps things properly ordered and keeps one sane.
  • SapiehaSapieha Posts: 2,964
    edited 2014-03-12 06:24
    Hi Chip.

    I think it is NO good idea -- lets it be that it is --- else it can made problems we still can't imagine

    cgracey wrote: »
    Well, I don't see how this sequence could work if the RET executed in stage 2 of the pipeline:

    NOP
    PUSH #x
    RET

    The RET would execute at the same time as the NOP - before the PUSH. This could maybe be gotten around by not executing the RET in stage 2 if there is a PUSH or a POP in a higher stage of the pipeline. CALL would have the same issues as RET. I wonder if there'd be any other rules that would need to be followed. I don't have a lot of confidence at the moment about getting CALL/RET to execute early.

    A plain JMP #/@ could be executed early with some circuitry to patch up an errant PC if it was cancelled later in the pipeline by the instruction above it. That should work. This means that only a plain JMP could be likely made to execute in 2 clocks without going to the extremes that other branches would require.

    The reason REPS can work in stage 2 is because it doesn't subtract the repeat-block size from the target PC any earlier than when it's in stage 3, where it may be getting cancelled by a branch in the higher stage, cancelling it's PC-subtraction effect. So, REPS is kind of harmless because it doesn't modify the target PC until it's in the next stage, unlike what an early JMP would have to do - that is - affect the target PC right away, before we know if we're getting cancelled in the next stage or not. This leaves a potential mess to be cleaned up. This would not be too hard to accommodate, though. The question is, is it worth doing for what is basically a hard-wired GOTO? It's mainly CALLs and RETs that need speeding up, but they are perhaps too complicated to handle early.

    I guess, in summary, there are strong reasons for having instructions execute at some constant point in the pipeline. It keeps things properly ordered and keeps one sane.
  • ctwardellctwardell Posts: 1,716
    edited 2014-03-12 06:28
    Sapieha wrote: »
    hi chip.

    I think it is no good idea -- lets it be that it is --- else it can made problems we still can't imagine

    +1

    c.w.
Sign In or Register to comment.