LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

cgracey · 2012-12-12 16:45

Cluso99 wrote: »

Chip: It's a bit late to ask, but just in case...

I suppose there is no way to tap the COGINIT parts of the instruction ???

What I can see is that it has...
1. Reinitialise/reset the cog
2. An inbuilt fast quad loader - Starts at a given hub address D and loads $1F8/4 quad longs into cog $000..$1F7.
3. Commence cog execution at $000

Would it be possible to have a variant that...
1. Did not reinitialise/reset the cog
2. Only loaded from the given hub address into cog $000... a number of quad longs (or longs) as specified by S
3. Commenced execution at cog $000 (as normal)

Uses:
Fast overlay loader - could be used in GCC by loading sub-routines. I expect this could actually be faster than LMM with the right, but what I think could be reasonably simple, compiler changes.
Fast overlay loader for pasm programmers - each overlay kept in separate DAT sections.
Fast dynamic driver reloading
Fast block data loads (perhaps video blocks, etc)

As always, I am not sure of what it entails and therefore how complex/simple it is to implement (and its risks). I just put it out there in case it is really simple to do.

Fast loading from hub to cog ram can be done with just a few instructions:

' Load 64 longs from hub memory (@PTRA) into $100

	REPS	#64,#1
	SETINDA	#$100
	RDLONGC	INDA++,PTRA++

This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.

David Betz · 2012-12-12 17:02

Ariba wrote: »

Here is my personal consensus for easy to use LMM:

lmm    rdlongc ins1,pc     'read instruction
       add pc, #4          'point to next instr
       jmpd #lmm
ins1   nop                 'execute instruction
       nop                 '3 delay slots for jumpd
       nop
       jmp #lmm            'loop in case jumd got cancelled by LMM code


fjmp:  rdlongc pc,pc
       long jumpaddress    'bit31..18 must be 0

call subroutine:
       jmp #fcall
       long address

fcall: pusha pc
       rdlongc pc,pc
       jmp #lmm

fret:  popa pc

movi:  rdlongc rx,pc
       long value           'bit31..18 must be 0

branch: sub pc,#displacement
  or    add pc,#displacement


This will execute 1 instruction every 8 cycles = 20 MIPS @ 160MHz
with 2 spare cycles in the loop, so the LMM instruction can take up to
3 cycles without slowing down the MIPS.


------------------------------------------------------------------


lmm    rdlongc ins1,pc     'read instruction 1
       add pc, #4          'point to ins1+1
       rdlongc ins2,pc     'read instruction 2
       jmpd #lmm
ins1   nop                 'execute instruction 1
       add pc,#4           'point to ins2+1
ins2   nop                 'execute instruction 2
       jmp #lmm            'loop in case jumd got cancelled by LMM code


This will execute 2 instructions in 8 cycles = 40 MIPS @ 160MHz



fjump in LMM
       jmp #fjmp
       long jumpaddress
       ...
        
fjmp:  rdlongc pc,pc
       jmp #lmm             'cancel ins2 if fjump on ins1

call subroutine in LMM:
       jmp #fcall
       long address
       ...

fcall: pusha pc
       rdlongc pc,pc
       jmp #lmm

return:
       jmp #fret

fret:  popa pc
       jmp #lmm

movi:  rdlongc rx,pc
       long value           'bit31..18 must be 0

branch: sub pc,#displacement
       jmp #lmm             'cancel ins2 if branch on ins1

  or    add pc,#displacement
        nop

This includes also the LMM primitives for fjmp and so on. Because all the faster LMM variants load instructions before previous were executed this primitives get more complicated with them.

Andy

Thanks Andy! That looks quite reasonable as a next step. I'll try it once I teach GAS some more P2 instructions.

Cluso99 · 2012-12-12 17:27

cgracey wrote: »
Fast loading from hub to cog ram can be done with just a few instructions:
' Load 64 longs from hub memory (@PTRA) into $100

    REPS    #64,#1
    SETINDA    #$100
    RDLONGC    INDA++,PTRA++
This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.

This should hit the sweet spot for 4x longs in 8 clocks! This is great Chip, thanks

Bill Henning · 2012-12-13 05:46

Mistakes happen.

Roy Eltham wrote: »

Bill,
Your thinking about the sequence of events is wrong. I never saw your reply before editing my post. The times put into the system I when I click submit. I posted my first message, re-read it and realized my own mistake and edited the post, but it took a couple minutes. Your reply came in during that time. Then all the other stuff ensued. I apologize for not realizing you had already done a 3 instruction variant with rdlongc, but you have posted MANY variants in this thread, and I missed that one. I wasn't trying to be confrontational or whatever here, just offering up what I thought was an interesting alternative (that turned out to be already discussed).

Also, I not sure why people think calling it a "fancy new LMM variant" is a negative thing? I generally think of fancy and new as positive. I like them. I think it's awesome that you guys are doing it and found the bug in the chip in doing so.

Most of my talking about wanting the simpler LMM version for Prop2GCC is in reply to Sapieha who seems to think we should only go down the longer route and forego getting something working more quickly with a simpler LMM first. As I can see from the other PropGCC guys posting, they agree with me on getting it working quickly with a simpler version, and then optimizing it from there.

Also, the forcing comment was to Sapieha, not you. It's very difficult to properly understand him many times, and the tone of his statements often come across negative or personal. It's probably likely that I misunderstood him, but I'm not sure I'll ever know. Anyway, that's not the point...

Roy

Bill Henning · 2012-12-13 05:46

Water under the bridge.

Roy Eltham wrote: »

Bill,
Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).

Sapieha,
Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.

Bill Henning · 2012-12-13 05:47

Thanks Steve.

jazzed wrote: »

I'm very impressed by Bill, Andy, and other's efforts to push the performance envelope. I am not the P2 GCC lead, so it is not my decision whether any of it is used or not. That being said, most companies want to get a marketable product out the door.

Parallax is no different about "getting it out" and I'm sure they will use goal driven, value engineering as in the past.

An evolutionary approach is not unreasonable, but I do think the bar should be set high enough to compete effectively. I hope some consensus of what is good enough in a reasonable time can be achieved.

So, what is good enough? Andy has started to answer the question. Comparisons are difficult, but must be made.

I'll leave it at that. I have more useful things to do with my life these days, so don't expect many posts from me beyond this or some technical support as required.

Bill Henning · 2012-12-13 05:48

Thanks Andy - however for this case, PropGCC already uses FCACHE to make it fast

Ariba wrote: »
With what ARM chips should Prop2 compete? There are ARMs from 8 pins / 40 MIPS for a few cents up to chips with four 1.5 GHz cores, which cost not much more than the Prop2 will. So just forget that.

I don't think that the GCC makers are too lazy to implement the Quad-LMM. The overlayed Quad-LMM has efficiency issues, which eats up all the speed gain in that it needs to execute more instructions than normal LMM for the same code.

Say you copy a chunk of bytes. A normal LMM code can look like that:
bcopy   rdbyte tmp,ptra++
        wrbyte tmp,ptrp++
        sub count,#1  wz
 if_nz  sub pc,#4*4         'jump back
If you want do the same with Quad-LMM you can't use ptra (used as pc), the byte access must be at the end of quad packets and the jump back in
the loop needs an additional delayed quad packet to execute. In the worst case it looks like that:
bcopy   nop
        nop
        nop
        rdbyte tmp,srcaddr

        add srcaddr,#1
        sub count,#1 wz
 if_nz  getptra bcopy_location  'jump back
        wrbyte tmp,destaddr

        add destaddr,#1         'this quad will execute before jump happens 
        nop
        nop
        nop
Also if it may execute in the same time, this needs 3 times the code space. And code space will be even more of a problem on Prop2 than on Prop1.

Andy

Ariba · 2012-12-13 06:48

Bill Henning wrote: »

Thanks Andy - however for this case, PropGCC already uses FCACHE to make it fast

I knew you will say this

But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.

Andy

Bill Henning · 2012-12-13 06:55

All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.

The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.

That is an extremely compelling argument for LMM2Q (quad version short name)

Ariba wrote: »

I knew you will say this

But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.

Andy

David Betz · 2012-12-13 06:59

Bill Henning wrote: »

All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.

The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.

That is an extremely compelling argument for LMM2Q (quad version short name)

What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.

Bill Henning · 2012-12-13 07:11

The biggest one I am aware of right now is that the last slot (of four) should be used for any RDxxxx/WRxxx instruction, and I am guessing that is the safe place for all other long instructions.

Also the VLIW-style four instruction packets quad long aligned in the hub, which should be very easy for GCC "alignment rules" to implement.

I will be testing all the opcodes I can over the next few days/weeks, as LMM2Q will be the central feature of a couple of projects/products I've been working on since the initial P2 thread, so I have to make sure it will work with all instructions (that make sense in a LMM-style hypervisor)

(said projects/products were on hold until I could actually try P2 instructions and grok the pipeline - thanks Chip for the Nano fpga image!)

David Betz wrote: »

What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.

Ariba · 2012-12-13 07:14

David Betz wrote: »

What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.

1) code must be executed in quads, you can't jump to not quad aligned locations
2) the first 3 instructions in the quad must be single cycle instructions
3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.

I'm sure there is more which I not be aware of yet.

Andy

David Betz · 2012-12-13 07:30

Ariba wrote: »

1) code must be executed in quads, you can't jump to not quad aligned locations
2) the first 3 instructions in the quad must be single cycle instructions
3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.

I'm sure there is more which I not be aware of yet.

Andy

Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.

Ariba · 2012-12-13 08:06

David Betz wrote: »

Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.

Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.

I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.

Results:

           meas.   theoretical    
LMM type:  cycles  max MIPS     real MIPS testcode        %
------------------------------------------------------------
Quad-LMM   $1007   (80 MIPS)    4 cy = 40 MIPS @160MHz   100
3 instr    $13FE   (60 MIPS)    5 cy = 32 MIPS ''         80
2 instr    $17FF   (40 MIPS)    6 cy = 27 MIPS ''         67
1 instr    $23FD   (20 MIPS)    9 cy = 18 MIPS ''         45

Andy

Bill Henning · 2012-12-13 08:48

Andy,

Thank you for the benchmarks!

Today I am working on a Xmas present for the forum... could I ask a favor?

Would you mind adding another comparison result to the benchmarks? Cog code for native MIPS?

    REPS 128,#8
    getcnt   start
[ please insert the torture 8 instructions}

I suspect the result will be around $800 cycles as I think the hub window "shadows" will swallow all the single cycle instructions.

Also can you add the eight instrutions that are repeated? I have so many versions here that I am not sure which you used.

Thank you again for all the help / testing / contributions!

Ariba wrote: »
Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.

I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.

Results:
           meas.   theoretical    
LMM type:  cycles  max MIPS     real MIPS testcode        %
------------------------------------------------------------
Quad-LMM   $1007   (80 MIPS)    4 cy = 40 MIPS @160MHz   100
3 instr    $13FE   (60 MIPS)    5 cy = 32 MIPS ''         80
2 instr    $17FF   (40 MIPS)    6 cy = 27 MIPS ''         67
1 instr    $23FD   (20 MIPS)    9 cy = 18 MIPS ''         45
Andy

Bill Henning · 2012-12-13 09:01

To help with my carpal tunnel syndrome, and yours as well

LMM2Q - Propeller 2 LMM, RDQUAD 4/8 efficiency
LMM2C - Propeller 2 LMM, one RDLONGC per hub window, 1/8 efficiency
LMM2C2 - two RDLONGC's per hub window, 2/8 efficiency
LMM2C3 - three RDLONGC's per hub window, 3/8 efficiency

Bill Henning · 2012-12-13 10:30

Here is the latest version of my pipeline / LMM2Q test framework, with the revised torture test that obeys the LMM2Q VLIW rules.

Revised LMM2Q torture test:

i1	add	count1,#1
i2	add	count2,#1
i3	add	count3,#1
i4	rdlong	lc,result10
i5	add	count4,#1
i6	add	lc,#1
i7	add	count1,#1
i8	wrlong	lc,result10

Results:

>n
>2000.2027
02000- 00000100 00000080 00000080 00000080   '................'
02010- 00000000 00000000 00000000 00000000   '................'
02020- 00001007 00000200   '........'

Note: The loop now executes 257 times, so 256 LMM2Q "quad" packets are executed - this makes checking the results for correctness easier.

Ariba · 2012-12-13 10:58

OK I have added also the native PASM speed test.

For the LMM2Q I just taken your results, so the instruction order were :

i1      add     count, #1
i2      add     count2,#1
i3      add     count3,#1
i4      rdlong  lc,result6
i5      add     count, #1
i6      add     count4,#1
i7      add     lc,#1
i8      wrlong  lc,result6

For the other variants I had the old instruction order:

i1      add     count, #1
i2      add     count2,#1
i3      add     count3,#1
i4      add     count, #1
i5      rdlong  lc,result6
i6      add     lc,#1
i7      wrlong  lc,result6
i8      add     count4,#1

So I have tried the LMM2Q order also for the other variants. Interessting is that this affects the LMM2C3 results a lot. This variant is very sensible for the place of hub accesses. In the worst case you get only the same performance as with LMM2C2.

Results:

meas.    theoretical
LMM type:  cycles   max MIPS    real MIPS testcode       %    %
----------------------------------------------------------------
PASM2      $0807   (160 MIPS)   2 cy = 80 MIPS @160MHz  200  100
LMM2Q      $1007   (80 MIPS)    4 cy = 40 MIPS          100   50
LMM2C3     $13FE   (60 MIPS)    5 cy = 32 MIPS ''        80   40
LMM2C3 *   $16A7   (60 MIPS)    5.7cy= 28 MIPS ''        70   35
LMM2C2     $17FF   (40 MIPS)    6 cy = 27 MIPS ''        67   34
LMM2C1     $23FD   (20 MIPS)    9 cy = 18 MIPS ''        45   23
 
* unfavorable timed hub accesses

Andy

David Betz · 2012-12-13 11:24

Ariba wrote: »

Here is my personal consensus for easy to use LMM: ...

I tried your first simplest suggestion and got the following improvement:
With the old P1 LMM loop:

fibo(26) = 121393 (1492ms) (89566584 ticks)

With "option 1":

fibo(26) = 121393 (995ms) (59711070 ticks)

I haven't been able to get your two-instruction loop to work yet.

Edit: I just thought of something. I never installed new FPGA configuration with Chip's bug fix. Could that cause the two-instruction loop to fail?

David Betz · 2012-12-13 11:36

Ariba: I have a question about a comment in your two-instruction loop. Why do immediate values need to have bits 18-31 zero? This is probably what is causing this loop to fail in the P2 LMM kernel.

movi:  rdlongc rx,pc
       long value           'bit31..18 must be 0

Edit: Maybe this isn't the problem. The GCC LMM kernel handles constants differently.

David Betz · 2012-12-13 14:18

I've tried both of Andy's suggestions in the P2 GCC LMM kernel. The first works but the second doesn't. I've attached the source for the LMM kernel. I'd appreciate it if someone would look it over to see why Andy's second loop doesn't work. The loops are all at the address __LMM_loop. All of the loops are commented out except Andy's second option. Can anyone see what might be causing trouble here.

crt0_p2.S

And, before anyone mentions it, I will certainly replace the software multiply and divide code with P2 instructions soon! I just want to get the LMM loop working first. :-)

Edit: Sorry! Forgot the attachment the first time!

Bill Henning · 2012-12-13 14:27

Andy:

Thanks!

I am now curious what the ratios would be if all eight test instructions were single cycle ones - maybe "add countX,#1" with X=1..8

I'd run the tests but I am working on a surprise

David:

I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.

As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.

David Betz · 2012-12-13 14:47

Bill Henning wrote: »

David:

I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.

As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.

What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.

David Betz · 2012-12-13 14:53

David Betz wrote: »

What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.

Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.

Bill Henning · 2012-12-13 14:55

Hi David,

I took a quick look as requested.

That is basically LMM2C2 (two cached reads in the same 8 cycle hub loop, LMM2C3 is the one with three cached reads - you can find the acronym definitions in post#287 above, also the FAQ in post#2)

David, it would take me far too much time to explain, definitely more than wifey will let me do for free.

It will be easy for you to get the LMM2C1 variant running, and I strongly recommended that path to you. Anything faster than that will require a LOT of changes to GCC, which I will be happy to explain in detail once Ken gives me the go-ahead.

David Betz wrote: »

What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.

Bill Henning · 2012-12-13 14:57

You found one of the approximately 30-50 changes required.

I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.

Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.

Edit:

Sorry, above line sounds bad, not what I meant. What I meant is that fortunately I have the background so that I know what is needed - and would be able to direct the port to get it done much faster.

David Betz wrote: »

Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.

David Betz · 2012-12-13 15:02

Bill Henning wrote: »

You found one of the approximately 30-50 changes required.

I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.

Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.

What is LMM2C1?

Bill Henning · 2012-12-13 15:06

Hi David,

Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!

LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window

I think you can see why I use acronyms :-)

I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)

David Betz · 2012-12-13 15:13

Bill Henning wrote: »

Hi David,

Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!

LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window

I think you can see why I use acronyms :-)

I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)

Are you talking about this?

__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        add pc,#4           'point to next instr
        jmpd #__LMM_loop
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
          nop
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

That is already working.

Bill Henning · 2012-12-13 15:16

Yes.

That is the best that can be done without fairly major changes.

Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.

David Betz wrote: »

Are you talking about this?

__LMM_loop
        rdlongc L_ins1,pc   'read instruction
        add pc,#4           'point to next instr
        jmpd #__LMM_loop
L_ins1    nop               'execute instruction
          nop               '3 delay slots for jumpd
          nop
        jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code

That is already working.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments