LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

David Betz · 2012-12-13 15:20

Bill Henning wrote: »

Yes.

That is the best that can be done without fairly major changes.

Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.

I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.

Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)

Bill Henning · 2012-12-13 15:27

David Betz wrote: »

I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.

Past experience suggests very little bloat, I expect all four slots filled in 90%+ of cases... so 10% of 25% is 2.5% bloat (ie in 10% of quads, one long lost... 2.5%)

P2 has almost 4x the hub of P1, so programs can be much bigger - besides, Chip may have added the compact 16 bit mode, the one I suggested when he asked would require very little logic to implement.

David Betz wrote: »

Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)

Unfortunately due to the current architecture that would still only execute 1 in 8.... no faster than LMM2C1

Major architecture changes would be needed for Prop2 (6 month to 1 year delay) otherwise.

I'll be back in an hour or so, I have to run some errands.

Ariba · 2012-12-13 15:27

Hello David

I never looked into the LMM kernel source of PropGCC, so I need some time to study it.
Basically the 2-instr LMM loads both instructions first and executes then the first. If you do an fjump in the first instruction the second must be cancelled, I do that with restarting the LMM loop (jmp #lmm). The comment about the movi means the long should be seen as a NOP when executed, I don't know exactly what is necessary for that in Prop2 yet.

What are the fibo(26) results of Prop1-GCC in LMM mode?

Andy

Ariba · 2012-12-13 15:50

David

You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:

lmm  reps #511,#8
     nop
      rdlongc ins1,pc
      add pc,#4
      nop
ins1  nop
      rdlongc ins2,pc
      add pc,#4
      nop
ins2  nop
     jmp #lmm

Andy

David Betz · 2012-12-13 16:14

Ariba wrote: »
David

You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
lmm  reps #511,#8
     nop
      rdlongc ins1,pc
      add pc,#4
      nop
ins1  nop
      rdlongc ins2,pc
      add pc,#4
      nop
ins2  nop
     jmp #lmm
Andy

I'll try this as soon as I add the reps instruction to gas. :-)

What I don't understand is how this is any faster than the same loop with only one instruction per iteration? Is there a delay associated with the automatic branch back to the start? If so, wouldn't unwinding the loop further speed things up a little more?

Kye · 2012-12-13 16:45

Hmm, Chip could just build all this functionality into the processor. But, that would mean the P2 would never come out.

Good work guys!

Ariba · 2012-12-13 16:47

The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.

But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
You can try this modification to POP_RET:

__LMM_POPRET
        call    #__LMM_POPM
        mov     pc,lr
        mov     L_ins2,dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
__LMM_POPRET_ret
        ret

dec_pc  sub  pc,#4

If this works, you don't need to add the REPS instruction.

Andy

Kye · 2012-12-13 16:51

I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.

However, for loop counters they will also be effective.

David Betz · 2012-12-13 16:52

Ariba wrote: »
The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.

But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
You can try this modification to POP_RET:
__LMM_POPRET
        call    #__LMM_POPM
        mov     pc,lr
        mov     L_ins2,dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
__LMM_POPRET_ret
        ret

dec_pc  sub  pc,#4
If this works, you don't need to add the REPS instruction.

Andy

Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?

Ariba · 2012-12-13 17:22

David Betz wrote: »

Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?

Yes, you are right, repaeting a single instruction will be nearly the same.
I missread your last question - yes the jmp #lmm takes another 4 cycles when the repeat loop is over and needs to be restarted, then there are another 2 cycles for establishing the repeat again. But this happens only every 511 * 2 (or 511 * 1) instructions. So you will not gain much with unwinding more iterations.

Ariba · 2012-12-13 17:48

Kye wrote: »

I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.

However, for loop counters they will also be effective.

PTRB as stackpointer can bring a lot of advantages, because of the addressing mode: rdlong PTRB+-0..32. This can access local variables easy.

Another part for improvement is the FCACHE loader. With a load loop like that, the cache is loaded much faster:

' Load 64 longs from hub memory (@PTRA) into $100

	REPS	#64,#1
	SETINDA	#$100
	RDLONGC	INDA++,PTRA++

Andy

cgracey · 2012-12-13 17:49

Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.

David Betz · 2012-12-13 18:07

Ariba wrote: »
David

You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
lmm  reps #511,#8
     nop
      rdlongc ins1,pc
      add pc,#4
      nop
ins1  nop
      rdlongc ins2,pc
      add pc,#4
      nop
ins2  nop
     jmp #lmm
Andy

Oddly, this loop seems a bit slower than your simplest suggestion.

__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        add pc,#4           'point to next instr
        jmpd #__LMM_loop
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
          nop
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

result:

fibo(26) = 121393 (995ms) (59711062 ticks)

__LMM_loop
        reps #511,#8
        nop
        rdlongc L_ins1,pc
        add pc,#4
        nop
L_ins1  nop
        rdlongc L_ins2,pc
        add pc,#4
        nop
L_ins2  nop
        jmp #__LMM_loop

result:

fibo(26) = 121393 (1031ms) (61882602 ticks)

Ariba · 2012-12-13 18:34

cgracey wrote: »

Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.

Do you mean 4 tasks that execute together one LMM-code ? I think that will be hard to do.

I have tried 1 task which executes LMM and 3 other tasks with native PASM. With settask I can nicely weight the tasks. But if I use rdlongc for LMM the other tasks executes with a lot of jitter, so it is better to use only rdlong (without cache) for LMM.

Andy

Roy Eltham · 2012-12-13 18:48

Ariba,
Using REPS with 511 for the loop count means loop forever, so it should only fall out when the repeated section does a jmp/call.

David Betz · 2012-12-13 18:52

Roy Eltham wrote: »

Ariba,
Using REPS with 511 for the loop count means loop forever, so it should only fall out when the repeated section does a jmp/call.

Why does 511 mean to repeat forever. It looks like the instruction has a 14 bit field for instruction count.

Ariba · 2012-12-13 18:53

Roy

I think this is only the case with REPD 511,#xx.
But REPD needs 3 delay slots before the loop, so it is slower when the LMM support funtions jump back to the LMM loop.

Andy

Ariba · 2012-12-13 19:02

David Betz wrote: »

Oddly, this loop seems a bit slower than your simplest suggestion.

__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        add pc,#4           'point to next instr
        jmpd #__LMM_loop
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
          nop
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

result:

fibo(26) = 121393 (995ms) (59711062 ticks)

__LMM_loop
        reps #511,#8
        nop
        rdlongc L_ins1,pc
        add pc,#4
        nop
L_ins1  nop
        rdlongc L_ins2,pc
        add pc,#4
        nop
L_ins2  nop
        jmp #__LMM_loop

result:

fibo(26) = 121393 (1031ms) (61882602 ticks)

Interesting, but nothing can really surprise me after trying so many LMM variants with unexpected results.

I see now that the simplest loop can be made 1 instruction shorter, perhaps it will be a bit faster with that:

__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        jmpd #__LMM_loop
          add pc,#4           'point to next instr
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

David Betz · 2012-12-13 19:05

Ariba wrote: »
Interesting, but nothing can really surprise me after trying so many LMM variants with unexpected results.

I see now that the simplest loop can be made 1 instruction shorter, perhaps it will be a bit faster with that:
__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        jmpd #__LMM_loop
          add pc,#4           'point to next instr
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

Yes, it helped a little:

fibo(26) = 121393 (926ms) (55597232 ticks)

Ariba · 2012-12-13 19:33

David

Do this fibo benchmarks use FCACHE or not?

David Betz · 2012-12-13 19:40

Ariba wrote: »

David

Do this fibo benchmarks use FCACHE or not?

No, I have fcache turned off because I want to see how well the various LMM loops perform.

Ariba · 2012-12-13 19:53

Okay, another one to try:

__LMM_loop
        reps #511,#6
        nop
          rdlongc L_ins1,pc
          add pc,#4
          rdlongc L_ins2,pc
L_ins1    nop
          add pc,#4
L_ins2    nop
        jmp #__LMM_loop

But for that you must also make the modification of POPRET I postet earlier:

__LMM_POPRET
        call    #__LMM_POPM
        mov     pc,lr
        mov     L_ins2,L_dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
__LMM_POPRET_ret
        ret

L_dec_pc  sub  pc,#4

David Betz · 2012-12-13 20:00

Ariba wrote: »

Okay, another one to try:

__LMM_loop
        reps #511,#6
        nop
          rdlongc L_ins1,pc
          add pc,#4
          rdlongc L_ins2,pc
L_ins1    nop
          add pc,#4
L_ins2    nop
        jmp #__LMM_loop

But for that you must also make the modification of POPRET I postet earlier:

__LMM_POPRET
        call    #__LMM_POPM
        mov     pc,lr
        mov     L_ins2,L_dec_pc   'compensate 2nd inc_pc if popret is executet in L_ins1 
__LMM_POPRET_ret
        ret

L_dec_pc  sub  pc,#4

This one gives me:

fibo(26) = 121393 (1031ms) (61882602 ticks)

I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.

cgracey · 2012-12-13 20:08

There are two repeat instructions:

REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat

If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.

David Betz · 2012-12-13 20:13

cgracey wrote: »

There are two repeat instructions:

REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat

If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.

Thanks for the explanation. I guess my instruction encodings are okay then.

David Betz · 2012-12-13 20:18

Chip: could you describe the timing of REPS? In particular, is any time spent between iterations of the loop or is the jump back to the start essentially free?

cgracey · 2012-12-13 20:18

David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.

David Betz · 2012-12-13 20:20

cgracey wrote: »

David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.

Actually, I was given that code by Ariba. I'll try your REPD example as soon as I teach the GNU assembler how to assemble REPD instructions! :-)
I've been adding P2 instructions slowly as I need them.

Ariba · 2012-12-13 20:21

David Betz wrote: »
This one gives me:
fibo(26) = 121393 (1031ms) (61882602 ticks)
I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.

Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions). EDIT: 9+6 = 14 - something must be wrong ;-)

You can try the simplest LMM loop with repeats to test the reps:

__LMM_loop
        reps #511,#5
        nop
          rdlongc L_ins1,pc
          add pc,#4
          nop
L_ins1    nop
          nop
        jmp #__LMM_loop

David Betz · 2012-12-13 20:23

Ariba wrote: »
Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions).

You can try the simplest LMM loop with repeats to test the reps:
__LMM_loop
        reps #511,#5
        nop
          rdlongc L_ins1,pc
          add pc,#4
          nop
L_ins1    nop
          nop
        jmp #__LMM_loop

Yes, but how do I know if it is actually repeating? If the REPS instruction were treated as a NOP the code would still work. Under what circumstances will that last JMP instruction be executed normally?

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments