That is the best that can be done without fairly major changes.
Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.
I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.
Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)
I still think it would be best to do some hand-compiling of some non-trivial examples to see if it is even possible to generate efficient code within the constraints of some of these more advanced loops and also whether it is worth the speedup if it also bloats the generated code significantly. After all, the P2 only has 128k of hub memory which is still pretty small compared with other MCUs and using a sparse instruction encoding to gain speed might not be worth the tradeoff.
Past experience suggests very little bloat, I expect all four slots filled in 90%+ of cases... so 10% of 25% is 2.5% bloat (ie in 10% of quads, one long lost... 2.5%)
P2 has almost 4x the hub of P1, so programs can be much bigger - besides, Chip may have added the compact 16 bit mode, the one I suggested when he asked would require very little logic to implement.
I never looked into the LMM kernel source of PropGCC, so I need some time to study it.
Basically the 2-instr LMM loads both instructions first and executes then the first. If you do an fjump in the first instruction the second must be cancelled, I do that with restarting the LMM loop (jmp #lmm). The comment about the movi means the long should be seen as a NOP when executed, I don't know exactly what is necessary for that in Prop2 yet.
What are the fibo(26) results of Prop1-GCC in LMM mode?
I'll try this as soon as I add the reps instruction to gas. :-)
What I don't understand is how this is any faster than the same loop with only one instruction per iteration? Is there a delay associated with the automatic branch back to the start? If so, wouldn't unwinding the loop further speed things up a little more?
The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.
But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
You can try this modification to POP_RET:
__LMM_POPRET
call #__LMM_POPM
mov pc,lr
mov L_ins2,dec_pc 'compensate 2nd inc_pc if popret is executet in L_ins1
__LMM_POPRET_ret
ret
dec_pc sub pc,#4
If this works, you don't need to add the REPS instruction.
I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.
However, for loop counters they will also be effective.
The REPS instruction unwinds the loop for you (repeat needs no additional cycles).
If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.
But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
You can try this modification to POP_RET:
__LMM_POPRET
call #__LMM_POPM
mov pc,lr
mov L_ins2,dec_pc 'compensate 2nd inc_pc if popret is executet in L_ins1
__LMM_POPRET_ret
ret
dec_pc sub pc,#4
If this works, you don't need to add the REPS instruction.
Andy
Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?
Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?
Yes, you are right, repaeting a single instruction will be nearly the same.
I missread your last question - yes the jmp #lmm takes another 4 cycles when the repeat loop is over and needs to be restarted, then there are another 2 cycles for establishing the repeat again. But this happens only every 511 * 2 (or 511 * 1) instructions. So you will not gain much with unwinding more iterations.
I think it might be a good idea to reserve the PTRA and PTRB variables for the PC and stack pointer respectively. Improved LMM performance could then be had.
However, for loop counters they will also be effective.
PTRB as stackpointer can bring a lot of advantages, because of the addressing mode: rdlong PTRB+-0..32. This can access local variables easy.
Another part for improvement is the FCACHE loader. With a load loop like that, the cache is loaded much faster:
' Load 64 longs from hub memory (@PTRA) into $100
REPS #64,#1
SETINDA #$100
RDLONGC INDA++,PTRA++
Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.
Remember that if you set up 4-way multitasking, all jumps take 1 clock. Have you thought about multi-tasked LMM? Multi-threaded LMM might be safer, since it gives you resource control above atomic operations.
Do you mean 4 tasks that execute together one LMM-code ? I think that will be hard to do.
I have tried 1 task which executes LMM and 3 other tasks with native PASM. With settask I can nicely weight the tasks. But if I use rdlongc for LMM the other tasks executes with a lot of jitter, so it is better to use only rdlong (without cache) for LMM.
I think this is only the case with REPD 511,#xx.
But REPD needs 3 delay slots before the loop, so it is slower when the LMM support funtions jump back to the LMM loop.
But for that you must also make the modification of POPRET I postet earlier:
__LMM_POPRET
call #__LMM_POPM
mov pc,lr
mov L_ins2,L_dec_pc 'compensate 2nd inc_pc if popret is executet in L_ins1
__LMM_POPRET_ret
ret
L_dec_pc sub pc,#4
But for that you must also make the modification of POPRET I postet earlier:
__LMM_POPRET
call #__LMM_POPM
mov pc,lr
mov L_ins2,L_dec_pc 'compensate 2nd inc_pc if popret is executet in L_ins1
__LMM_POPRET_ret
ret
L_dec_pc sub pc,#4
This one gives me:
fibo(26) = 121393 (1031ms) (61882602 ticks)
I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.
REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat
If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.
REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat
If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.
Thanks for the explanation. I guess my instruction encodings are okay then.
Chip: could you describe the timing of REPS? In particular, is any time spent between iterations of the loop or is the jump back to the start essentially free?
David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.
David, I see in your post above that you are doing a REPS #511,#, which won't repeat forever, but do 511 loops. For infinite repeat, you must do REPD $1FF,#. When the hardware sees register address $1FF, that means infinite. Sorry if you already understand this.
Actually, I was given that code by Ariba. I'll try your REPD example as soon as I teach the GNU assembler how to assemble REPD instructions! :-)
I've been adding P2 instructions slowly as I need them.
I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.
Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions). EDIT: 9+6 = 14 - something must be wrong ;-)
You can try the simplest LMM loop with repeats to test the reps:
Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions).
You can try the simplest LMM loop with repeats to test the reps:
Yes, but how do I know if it is actually repeating? If the REPS instruction were treated as a NOP the code would still work. Under what circumstances will that last JMP instruction be executed normally?
Comments
Instead, maybe Chip will put in the 15 bit PC and automatic rdlongc to fill the instruction pipeline. :-)
Past experience suggests very little bloat, I expect all four slots filled in 90%+ of cases... so 10% of 25% is 2.5% bloat (ie in 10% of quads, one long lost... 2.5%)
P2 has almost 4x the hub of P1, so programs can be much bigger - besides, Chip may have added the compact 16 bit mode, the one I suggested when he asked would require very little logic to implement.
Unfortunately due to the current architecture that would still only execute 1 in 8.... no faster than LMM2C1
Major architecture changes would be needed for Prop2 (6 month to 1 year delay) otherwise.
I'll be back in an hour or so, I have to run some errands.
I never looked into the LMM kernel source of PropGCC, so I need some time to study it.
Basically the 2-instr LMM loads both instructions first and executes then the first. If you do an fjump in the first instruction the second must be cancelled, I do that with restarting the LMM loop (jmp #lmm). The comment about the movi means the long should be seen as a NOP when executed, I don't know exactly what is necessary for that in Prop2 yet.
What are the fibo(26) results of Prop1-GCC in LMM mode?
Andy
You can try this 2-instr LMM loop, it is a bit slower than the other one I have posted, but may work for PropGCC without changes:
Andy
I'll try this as soon as I add the reps instruction to gas. :-)
What I don't understand is how this is any faster than the same loop with only one instruction per iteration? Is there a delay associated with the automatic branch back to the start? If so, wouldn't unwinding the loop further speed things up a little more?
Good work guys!
If the long is already in the cache RDLONGC needs only 1 cycle, if both LMM instructions are also single cycle then the 2 iterations fit exactly in a hub window.
But I have now looked into the LMM kernel source. All LMM-support routines which change the pc jumps back to __LMM_loop exept the POP_RET, because that is a subroutine.
You can try this modification to POP_RET: If this works, you don't need to add the REPS instruction.
Andy
However, for loop counters they will also be effective.
Thanks! I'll try that change to LMM_POPRET in a little while. However, I if REPS eliminates the looping cost then why even have two instructions in the loop? Wouldn't a loop with only a single instruction be just as fast?
Yes, you are right, repaeting a single instruction will be nearly the same.
I missread your last question - yes the jmp #lmm takes another 4 cycles when the repeat loop is over and needs to be restarted, then there are another 2 cycles for establishing the repeat again. But this happens only every 511 * 2 (or 511 * 1) instructions. So you will not gain much with unwinding more iterations.
PTRB as stackpointer can bring a lot of advantages, because of the addressing mode: rdlong PTRB+-0..32. This can access local variables easy.
Another part for improvement is the FCACHE loader. With a load loop like that, the cache is loaded much faster:
Andy
Oddly, this loop seems a bit slower than your simplest suggestion.
Do you mean 4 tasks that execute together one LMM-code ? I think that will be hard to do.
I have tried 1 task which executes LMM and 3 other tasks with native PASM. With settask I can nicely weight the tasks. But if I use rdlongc for LMM the other tasks executes with a lot of jitter, so it is better to use only rdlong (without cache) for LMM.
Andy
Using REPS with 511 for the loop count means loop forever, so it should only fall out when the repeated section does a jmp/call.
I think this is only the case with REPD 511,#xx.
But REPD needs 3 delay slots before the loop, so it is slower when the LMM support funtions jump back to the LMM loop.
Andy
Interesting, but nothing can really surprise me after trying so many LMM variants with unexpected results.
I see now that the simplest loop can be made 1 instruction shorter, perhaps it will be a bit faster with that:
Yes, it helped a little:
Do this fibo benchmarks use FCACHE or not?
I have to say that I'm a bit worried that I'm assembling the REPS instruction wrong. The instruction encoding table I got from Chip indicates that REPS has a 14 bit immediate field for instruction count but you seem to be assuming it has a 9 bit field. There is some disconnect here I think.
REPS #loops,#ins - executes early in the pipeline, uses a 14-bit constant, needs only one spacer instruction
REPD D,#ins - executes late in the pipeline, uses D or a 9-bit constant, needs three spacer instructions, if D is $1FF then infinite repeat
If REPS is used by a task that has at least one other task(s) between its own time slots, no spacers are needed.
If REPD is used by a task that has at least three other task(s) between its own time slots, no spacers are needed.
I've been adding P2 instructions slowly as I need them.
Maybe, if reps not loops correct, then the jmp #__LMM_loop is executed which would explain the slowness.
I think the 14bits are split in a 9bit and a 6bit field. (9bit for the repeat count, and 6 bit for the number of instructions). EDIT: 9+6 = 14 - something must be wrong ;-)
You can try the simplest LMM loop with repeats to test the reps: