Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 11 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

1678911

Comments

  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 15:20
    Yes.

    That is the best that can be done without fairly major changes.

    Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.
    I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.

    Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 15:27
    David Betz wrote: »
    I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.

    Past experience suggests very little bloat, I expect all four slots filled in 90%+ of cases... so 10% of 25% is 2.5% bloat (ie in 10% of quads, one long lost... 2.5%)

    P2 has almost 4x the hub of P1, so programs can be much bigger - besides, Chip may have added the compact 16 bit mode, the one I suggested when he asked would require very little logic to implement.
    David Betz wrote: »
    Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)

    Unfortunately due to the current architecture that would still only execute 1 in 8.... no faster than LMM2C1

    Major architecture changes would be needed for Prop2 (6 month to 1 year delay) otherwise.

    I'll be back in an hour or so, I have to run some errands.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 15:27
    Hello David

    I never looked into the LMM kernel source of PropGCC, so I need some time to study it.
    Basically the 2-instr LMM loads both instructions first and executes then the first. If you do an fjump in the first instruction the second must be cancelled, I do that with restarting the LMM loop (jmp #lmm). The comment about the movi means the long should be seen as a NOP when executed, I don't know exactly what is necessary for that in Prop2 yet.

    What are the fibo(26) results of Prop1-GCC in LMM mode?

    Andy
  • AribaAriba Posts: 2,690
    edited 2012-12-13 15:50
    David

    You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
    lmm  reps #511,#8
         nop
          rdlongc ins1,pc
          add pc,#4
          nop
    ins1  nop
          rdlongc ins2,pc
          add pc,#4
          nop
    ins2  nop
         jmp #lmm
    

    Andy
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 16:14
    Ariba wrote: »
    David

    You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
    lmm  reps #511,#8
         nop
          rdlongc ins1,pc
          add pc,#4
          nop
    ins1  nop
          rdlongc ins2,pc
          add pc,#4
          nop
    ins2  nop
         jmp #lmm
    

    Andy

    I'll try this as soon as I add the reps instruction to gas. :-)

    What I don't understand is how this is any faster than the same loop with only one instruction per iteration? Is there a delay associated with the automatic branch back to the start? If so, wouldn't unwinding the loop further speed things up a little more?
  • KyeKye Posts: 2,200
    edited 2012-12-13 16:45
    Hmm, Chip could just build all this functionality into the processor. But, that would mean the P2 would never come out.

    Good work guys! :)
  • AribaAriba Posts: 2,690
    edited 2012-12-13 16:47
    The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
    If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.

    But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
    You can try this modification to POP_RET:
    __LMM_POPRET
            call    #__LMM_POPM
            mov     pc,lr
            mov     L_ins2,dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
    __LMM_POPRET_ret
            ret
    
    dec_pc  sub  pc,#4
    
    If this works, you don't need to add the REPS instruction.

    Andy
  • KyeKye Posts: 2,200
    edited 2012-12-13 16:51
    I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.

    However, for loop counters they will also be effective.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 16:52
    Ariba wrote: »
    The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
    If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.

    But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
    You can try this modification to POP_RET:
    __LMM_POPRET
            call    #__LMM_POPM
            mov     pc,lr
            mov     L_ins2,dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
    __LMM_POPRET_ret
            ret
    
    dec_pc  sub  pc,#4
    
    If this works, you don't need to add the REPS instruction.

    Andy

    Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?
  • AribaAriba Posts: 2,690
    edited 2012-12-13 17:22
    David Betz wrote: »
    Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?

    Yes, you are right, repaeting a single instruction will be nearly the same.
    I missread your last question - yes the jmp #lmm takes another 4 cycles when the repeat loop is over and needs to be restarted, then there are another 2 cycles for establishing the repeat again. But this happens only every 511 * 2 (or 511 * 1) instructions. So you will not gain much with unwinding more iterations.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 17:48
    Kye wrote: »
    I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.

    However, for loop counters they will also be effective.

    PTRB as stackpointer can bring a lot of advantages, because of the addressing mode: rdlong PTRB+-0..32. This can access local variables easy.

    Another part for improvement is the FCACHE loader. With a load loop like that, the cache is loaded much faster:
    ' Load 64 longs from hub memory (@PTRA) into $100
    
    	REPS	#64,#1
    	SETINDA	#$100
    	RDLONGC	INDA++,PTRA++
    

    Andy
  • cgraceycgracey Posts: 14,251
    edited 2012-12-13 17:49
    Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 18:07
    Ariba wrote: »
    David

    You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
    lmm  reps #511,#8
         nop
          rdlongc ins1,pc
          add pc,#4
          nop
    ins1  nop
          rdlongc ins2,pc
          add pc,#4
          nop
    ins2  nop
         jmp #lmm
    

    Andy

    Oddly, this loop seems a bit slower than your simplest suggestion.
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            add pc,#4           'point to next instr
            jmpd #__LMM_loop
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
              nop
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    
    result:
    
    fibo(26) = 121393 (995ms) (59711062 ticks)
    
    __LMM_loop
            reps #511,#8
            nop
            rdlongc L_ins1,pc
            add pc,#4
            nop
    L_ins1  nop
            rdlongc L_ins2,pc
            add pc,#4
            nop
    L_ins2  nop
            jmp #__LMM_loop
    
    result:
    
    fibo(26) = 121393 (1031ms) (61882602 ticks)
    
  • AribaAriba Posts: 2,690
    edited 2012-12-13 18:34
    cgracey wrote: »
    Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.

    Do you mean 4 tasks that execute together one LMM-code ? I think that will be hard to do.

    I have tried 1 task which executes LMM and 3 other tasks with native PASM. With settask I can nicely weight the tasks. But if I use rdlongc for LMM the other tasks executes with a lot of jitter, so it is better to use only rdlong (without cache) for LMM.

    Andy
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-13 18:48
    Ariba,
    Using REPS with 511 for the loop count means loop forever, so it should only fall out when the repeated section does a jmp/call.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 18:52
    Roy Eltham wrote: »
    Ariba,
    Using REPS with 511 for the loop count means loop forever, so it should only fall out when the repeated section does a jmp/call.
    Why does 511 mean to repeat forever. It looks like the instruction has a 14 bit field for instruction count.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 18:53
    Roy

    I think this is only the case with REPD 511,#xx.
    But REPD needs 3 delay slots before the loop, so it is slower when the LMM support funtions jump back to the LMM loop.

    Andy
  • AribaAriba Posts: 2,690
    edited 2012-12-13 19:02
    David Betz wrote: »
    Oddly, this loop seems a bit slower than your simplest suggestion.
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            add pc,#4           'point to next instr
            jmpd #__LMM_loop
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
              nop
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    
    result:
    
    fibo(26) = 121393 (995ms) (59711062 ticks)
    
    __LMM_loop
            reps #511,#8
            nop
            rdlongc L_ins1,pc
            add pc,#4
            nop
    L_ins1  nop
            rdlongc L_ins2,pc
            add pc,#4
            nop
    L_ins2  nop
            jmp #__LMM_loop
    
    result:
    
    fibo(26) = 121393 (1031ms) (61882602 ticks)
    

    Interesting, but nothing can really surprise me after trying so many LMM variants with unexpected results.

    I see now that the simplest loop can be made 1 instruction shorter, perhaps it will be a bit faster with that:
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            jmpd #__LMM_loop
              add pc,#4           'point to next instr
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 19:05
    Ariba wrote: »
    Interesting, but nothing can really surprise me after trying so many LMM variants with unexpected results.

    I see now that the simplest loop can be made 1 instruction shorter, perhaps it will be a bit faster with that:
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            jmpd #__LMM_loop
              add pc,#4           'point to next instr
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    

    Yes, it helped a little:
    fibo(26) = 121393 (926ms) (55597232 ticks)
    
  • AribaAriba Posts: 2,690
    edited 2012-12-13 19:33
    David

    Do this fibo benchmarks use FCACHE or not?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 19:40
    Ariba wrote: »
    David

    Do this fibo benchmarks use FCACHE or not?
    No, I have fcache turned off because I want to see how well the various LMM loops perform.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 19:53
    Okay, another one to try:
    __LMM_loop
            reps #511,#6
            nop
              rdlongc L_ins1,pc
              add pc,#4
              rdlongc L_ins2,pc
    L_ins1    nop
              add pc,#4
    L_ins2    nop
            jmp #__LMM_loop
    
    But for that you must also make the modification of POPRET I postet earlier:
    __LMM_POPRET
            call    #__LMM_POPM
            mov     pc,lr
            mov     L_ins2,L_dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
    __LMM_POPRET_ret
            ret
    
    L_dec_pc  sub  pc,#4
    
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 20:00
    Ariba wrote: »
    Okay, another one to try:
    __LMM_loop
            reps #511,#6
            nop
              rdlongc L_ins1,pc
              add pc,#4
              rdlongc L_ins2,pc
    L_ins1    nop
              add pc,#4
    L_ins2    nop
            jmp #__LMM_loop
    
    But for that you must also make the modification of POPRET I postet earlier:
    __LMM_POPRET
            call    #__LMM_POPM
            mov     pc,lr
            mov     L_ins2,L_dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
    __LMM_POPRET_ret
            ret
    
    L_dec_pc  sub  pc,#4
    
    This one gives me:
    fibo(26) = 121393 (1031ms) (61882602 ticks)
    

    I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.
  • cgraceycgracey Posts: 14,251
    edited 2012-12-13 20:08
    There are two repeat instructions:

    REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
    REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat

    If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
    If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 20:13
    cgracey wrote: »
    There are two repeat instructions:

    REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
    REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat

    If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
    If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.
    Thanks for the explanation. I guess my instruction encodings are okay then.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 20:18
    Chip: could you describe the timing of REPS? In particular, is any time spent between iterations of the loop or is the jump back to the start essentially free?
  • cgraceycgracey Posts: 14,251
    edited 2012-12-13 20:18
    David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 20:20
    cgracey wrote: »
    David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.
    Actually, I was given that code by Ariba. I'll try your REPD example as soon as I teach the GNU assembler how to assemble REPD instructions! :-)
    I've been adding P2 instructions slowly as I need them.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 20:21
    David Betz wrote: »
    This one gives me:
    fibo(26) = 121393 (1031ms) (61882602 ticks)
    

    I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.

    Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
    I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions). EDIT: 9+6 = 14 - something must be wrong ;-)

    You can try the simplest LMM loop with repeats to test the reps:
    __LMM_loop
            reps #511,#5
            nop
              rdlongc L_ins1,pc
              add pc,#4
              nop
    L_ins1    nop
              nop
            jmp #__LMM_loop
    
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 20:23
    Ariba wrote: »
    Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
    I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions).

    You can try the simplest LMM loop with repeats to test the reps:
    __LMM_loop
            reps #511,#5
            nop
              rdlongc L_ins1,pc
              add pc,#4
              nop
    L_ins1    nop
              nop
            jmp #__LMM_loop
    
    Yes, but how do I know if it is actually repeating? If the REPS instruction were treated as a NOP the code would still work. Under what circumstances will that last JMP instruction be executed normally?
Sign In or Register to comment.