Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 10 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

16781012

Comments

  • cgraceycgracey Posts: 14,253
    edited 2012-12-12 16:45
    Cluso99 wrote: »
    Chip: It's a bit late to ask, but just in case...

    I suppose there is no way to tap the COGINIT parts of the instruction ???

    What I can see is that it has...
    1. Reinitialise/reset the cog
    2. An inbuilt fast quad loader - Starts at a given hub address D and loads $1F8/4 quad longs into cog $000..$1F7.
    3. Commence cog execution at $000

    Would it be possible to have a variant that...
    1. Did not reinitialise/reset the cog
    2. Only loaded from the given hub address into cog $000... a number of quad longs (or longs) as specified by S
    3. Commenced execution at cog $000 (as normal)

    Uses:
    Fast overlay loader - could be used in GCC by loading sub-routines. I expect this could actually be faster than LMM with the right, but what I think could be reasonably simple, compiler changes.
    Fast overlay loader for pasm programmers - each overlay kept in separate DAT sections.
    Fast dynamic driver reloading
    Fast block data loads (perhaps video blocks, etc)

    As always, I am not sure of what it entails and therefore how complex/simple it is to implement (and its risks). I just put it out there in case it is really simple to do.

    Fast loading from hub to cog ram can be done with just a few instructions:
    ' Load 64 longs from hub memory (@PTRA) into $100
    
    	REPS	#64,#1
    	SETINDA	#$100
    	RDLONGC	INDA++,PTRA++
    

    This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-12 17:02
    Ariba wrote: »
    Here is my personal consensus for easy to use LMM:
    lmm    rdlongc ins1,pc     'read instruction
           add pc, #4          'point to next instr
           jmpd #lmm
    ins1   nop                 'execute instruction
           nop                 '3 delay slots for jumpd
           nop
           jmp #lmm            'loop in case jumd got cancelled by LMM code
    
    
    fjmp:  rdlongc pc,pc
           long jumpaddress    'bit31..18 must be 0
    
    call subroutine:
           jmp #fcall
           long address
    
    fcall: pusha pc
           rdlongc pc,pc
           jmp #lmm
    
    fret:  popa pc
    
    movi:  rdlongc rx,pc
           long value           'bit31..18 must be 0
    
    branch: sub pc,#displacement
      or    add pc,#displacement
    
    
    This will execute 1 instruction every 8 cycles = 20 MIPS @ 160MHz
    with 2 spare cycles in the loop, so the LMM instruction can take up to
    3 cycles without slowing down the MIPS.
    
    
    ------------------------------------------------------------------
    
    
    lmm    rdlongc ins1,pc     'read instruction 1
           add pc, #4          'point to ins1+1
           rdlongc ins2,pc     'read instruction 2
           jmpd #lmm
    ins1   nop                 'execute instruction 1
           add pc,#4           'point to ins2+1
    ins2   nop                 'execute instruction 2
           jmp #lmm            'loop in case jumd got cancelled by LMM code
    
    
    This will execute 2 instructions in 8 cycles = 40 MIPS @ 160MHz
    
    
    
    fjump in LMM
           jmp #fjmp
           long jumpaddress
           ...
            
    fjmp:  rdlongc pc,pc
           jmp #lmm             'cancel ins2 if fjump on ins1
    
    call subroutine in LMM:
           jmp #fcall
           long address
           ...
    
    fcall: pusha pc
           rdlongc pc,pc
           jmp #lmm
    
    return:
           jmp #fret
    
    fret:  popa pc
           jmp #lmm
    
    movi:  rdlongc rx,pc
           long value           'bit31..18 must be 0
    
    branch: sub pc,#displacement
           jmp #lmm             'cancel ins2 if branch on ins1
    
      or    add pc,#displacement
            nop
    
    This includes also the LMM primitives for fjmp and so on. Because all the faster LMM variants load instructions before previous were executed this primitives get more complicated with them.

    Andy

    Thanks Andy! That looks quite reasonable as a next step. I'll try it once I teach GAS some more P2 instructions.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-12 17:27
    cgracey wrote: »
    Fast loading from hub to cog ram can be done with just a few instructions:
    ' Load 64 longs from hub memory (@PTRA) into $100
    
        REPS    #64,#1
        SETINDA    #$100
        RDLONGC    INDA++,PTRA++
    

    This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.

    This should hit the sweet spot for 4x longs in 8 clocks! This is great Chip, thanks ;)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 05:46
    Mistakes happen.
    Roy Eltham wrote: »
    Bill,
    Your thinking about the sequence of events is wrong. I never saw your reply before editing my post. The times put into the system I when I click submit. I posted my first message, re-read it and realized my own mistake and edited the post, but it took a couple minutes. Your reply came in during that time. Then all the other stuff ensued. I apologize for not realizing you had already done a 3 instruction variant with rdlongc, but you have posted MANY variants in this thread, and I missed that one. I wasn't trying to be confrontational or whatever here, just offering up what I thought was an interesting alternative (that turned out to be already discussed).

    Also, I not sure why people think calling it a "fancy new LMM variant" is a negative thing? I generally think of fancy and new as positive. I like them. I think it's awesome that you guys are doing it and found the bug in the chip in doing so.

    Most of my talking about wanting the simpler LMM version for Prop2GCC is in reply to Sapieha who seems to think we should only go down the longer route and forego getting something working more quickly with a simpler LMM first. As I can see from the other PropGCC guys posting, they agree with me on getting it working quickly with a simpler version, and then optimizing it from there.

    Also, the forcing comment was to Sapieha, not you. It's very difficult to properly understand him many times, and the tone of his statements often come across negative or personal. It's probably likely that I misunderstood him, but I'm not sure I'll ever know. Anyway, that's not the point...


    Roy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 05:46
    Water under the bridge.
    Roy Eltham wrote: »
    Bill,
    Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).

    Sapieha,
    Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 05:47
    Thanks Steve.
    jazzed wrote: »
    I'm very impressed by Bill, Andy, and other's efforts to push the performance envelope. I am not the P2 GCC lead, so it is not my decision whether any of it is used or not. That being said, most companies want to get a marketable product out the door.

    Parallax is no different about "getting it out" and I'm sure they will use goal driven, value engineering as in the past.

    An evolutionary approach is not unreasonable, but I do think the bar should be set high enough to compete effectively. I hope some consensus of what is good enough in a reasonable time can be achieved.

    So, what is good enough? Andy has started to answer the question. Comparisons are difficult, but must be made.

    I'll leave it at that. I have more useful things to do with my life these days, so don't expect many posts from me beyond this or some technical support as required.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 05:48
    Thanks Andy - however for this case, PropGCC already uses FCACHE to make it fast :)
    Ariba wrote: »
    With what ARM chips should Prop2 compete? There are ARMs from 8 pins / 40 MIPS for a few cents up to chips with four 1.5 GHz cores, which cost not much more than the Prop2 will. So just forget that.


    I don't think that the GCC makers are too lazy to implement the Quad-LMM. The overlayed Quad-LMM has efficiency issues, which eats up all the speed gain in that it needs to execute more instructions than normal LMM for the same code.

    Say you copy a chunk of bytes. A normal LMM code can look like that:
    bcopy   rdbyte tmp,ptra++
            wrbyte tmp,ptrp++
            sub count,#1  wz
     if_nz  sub pc,#4*4         'jump back
    

    If you want do the same with Quad-LMM you can't use ptra (used as pc), the byte access must be at the end of quad packets and the jump back in
    the loop needs an additional delayed quad packet to execute. In the worst case it looks like that:
    bcopy   nop
            nop
            nop
            rdbyte tmp,srcaddr
    
            add srcaddr,#1
            sub count,#1 wz
     if_nz  getptra bcopy_location  'jump back
            wrbyte tmp,destaddr
    
            add destaddr,#1         'this quad will execute before jump happens 
            nop
            nop
            nop
    
    Also if it may execute in the same time, this needs 3 times the code space. And code space will be even more of a problem on Prop2 than on Prop1.

    Andy
  • AribaAriba Posts: 2,690
    edited 2012-12-13 06:48
    Thanks Andy - however for this case, PropGCC already uses FCACHE to make it fast :)

    I knew you will say this :smile:

    But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 06:55
    All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.

    The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.

    That is an extremely compelling argument for LMM2Q (quad version short name)
    Ariba wrote: »
    I knew you will say this :smile:

    But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.

    Andy
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 06:59
    All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.

    The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.

    That is an extremely compelling argument for LMM2Q (quad version short name)
    What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 07:11
    The biggest one I am aware of right now is that the last slot (of four) should be used for any RDxxxx/WRxxx instruction, and I am guessing that is the safe place for all other long instructions.

    Also the VLIW-style four instruction packets quad long aligned in the hub, which should be very easy for GCC "alignment rules" to implement.

    I will be testing all the opcodes I can over the next few days/weeks, as LMM2Q will be the central feature of a couple of projects/products I've been working on since the initial P2 thread, so I have to make sure it will work with all instructions (that make sense in a LMM-style hypervisor)

    (said projects/products were on hold until I could actually try P2 instructions and grok the pipeline - thanks Chip for the Nano fpga image!)
    David Betz wrote: »
    What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 07:14
    David Betz wrote: »
    What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
    1) code must be executed in quads, you can't jump to not quad aligned locations
    2) the first 3 instructions in the quad must be single cycle instructions
    3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
    4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.

    I'm sure there is more which I not be aware of yet.

    Andy
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 07:30
    Ariba wrote: »
    1) code must be executed in quads, you can't jump to not quad aligned locations
    2) the first 3 instructions in the quad must be single cycle instructions
    3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
    4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.

    I'm sure there is more which I not be aware of yet.

    Andy
    Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 08:06
    David Betz wrote: »
    Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.

    Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.

    I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
    This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
    1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.

    Results:
               meas.   theoretical    
    LMM type:  cycles  max MIPS     real MIPS testcode        %
    ------------------------------------------------------------
    Quad-LMM   $1007   (80 MIPS)    4 cy = 40 MIPS @160MHz   100
    3 instr    $13FE   (60 MIPS)    5 cy = 32 MIPS ''         80
    2 instr    $17FF   (40 MIPS)    6 cy = 27 MIPS ''         67
    1 instr    $23FD   (20 MIPS)    9 cy = 18 MIPS ''         45
    

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 08:48
    Andy,

    Thank you for the benchmarks!

    Today I am working on a Xmas present for the forum... could I ask a favor?

    Would you mind adding another comparison result to the benchmarks? Cog code for native MIPS?
        REPS 128,#8
        getcnt   start
    [ please insert the torture 8 instructions}
    

    I suspect the result will be around $800 cycles as I think the hub window "shadows" will swallow all the single cycle instructions.

    Also can you add the eight instrutions that are repeated? I have so many versions here that I am not sure which you used.

    Thank you again for all the help / testing / contributions!
    Ariba wrote: »
    Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.

    I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
    This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
    1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.

    Results:
               meas.   theoretical    
    LMM type:  cycles  max MIPS     real MIPS testcode        %
    ------------------------------------------------------------
    Quad-LMM   $1007   (80 MIPS)    4 cy = 40 MIPS @160MHz   100
    3 instr    $13FE   (60 MIPS)    5 cy = 32 MIPS ''         80
    2 instr    $17FF   (40 MIPS)    6 cy = 27 MIPS ''         67
    1 instr    $23FD   (20 MIPS)    9 cy = 18 MIPS ''         45
    

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 09:01
    To help with my carpal tunnel syndrome, and yours as well :)

    LMM2Q - Propeller 2 LMM, RDQUAD 4/8 efficiency
    LMM2C - Propeller 2 LMM, one RDLONGC per hub window, 1/8 efficiency
    LMM2C2 - two RDLONGC's per hub window, 2/8 efficiency
    LMM2C3 - three RDLONGC's per hub window, 3/8 efficiency
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 10:30
    Here is the latest version of my pipeline / LMM2Q test framework, with the revised torture test that obeys the LMM2Q VLIW rules.

    Revised LMM2Q torture test:
    i1	add	count1,#1
    i2	add	count2,#1
    i3	add	count3,#1
    i4	rdlong	lc,result10
    i5	add	count4,#1
    i6	add	lc,#1
    i7	add	count1,#1
    i8	wrlong	lc,result10
    

    Results:
    >n
    >2000.2027
    02000- 00000100 00000080 00000080 00000080   '................'
    02010- 00000000 00000000 00000000 00000000   '................'
    02020- 00001007 00000200   '........'
    

    Note: The loop now executes 257 times, so 256 LMM2Q "quad" packets are executed - this makes checking the results for correctness easier.
  • AribaAriba Posts: 2,690
    edited 2012-12-13 10:58
    OK I have added also the native PASM speed test.

    For the LMM2Q I just taken your results, so the instruction order were :
    i1      add     count, #1
    i2      add     count2,#1
    i3      add     count3,#1
    i4      rdlong  lc,result6
    i5      add     count, #1
    i6      add     count4,#1
    i7      add     lc,#1
    i8      wrlong  lc,result6
    

    For the other variants I had the old instruction order:
    i1      add     count, #1
    i2      add     count2,#1
    i3      add     count3,#1
    i4      add     count, #1
    i5      rdlong  lc,result6
    i6      add     lc,#1
    i7      wrlong  lc,result6
    i8      add     count4,#1
    

    So I have tried the LMM2Q order also for the other variants. Interessting is that this affects the LMM2C3 results a lot. This variant is very sensible for the place of hub accesses. In the worst case you get only the same performance as with LMM2C2.

    Results:
    meas.    theoretical
    LMM type:  cycles   max MIPS    real MIPS testcode       %    %
    ----------------------------------------------------------------
    PASM2      $0807   (160 MIPS)   2 cy = 80 MIPS @160MHz  200  100
    LMM2Q      $1007   (80 MIPS)    4 cy = 40 MIPS          100   50
    LMM2C3     $13FE   (60 MIPS)    5 cy = 32 MIPS ''        80   40
    LMM2C3 *   $16A7   (60 MIPS)    5.7cy= 28 MIPS ''        70   35
    LMM2C2     $17FF   (40 MIPS)    6 cy = 27 MIPS ''        67   34
    LMM2C1     $23FD   (20 MIPS)    9 cy = 18 MIPS ''        45   23
     
    * unfavorable timed hub accesses
    

    Andy
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 11:24
    Ariba wrote: »
    Here is my personal consensus for easy to use LMM: ...
    I tried your first simplest suggestion and got the following improvement:
    With the old P1 LMM loop:
    fibo(26) = 121393 (1492ms) (89566584 ticks)
    
    With "option 1":
    fibo(26) = 121393 (995ms) (59711070 ticks)
    
    I haven't been able to get your two-instruction loop to work yet.

    Edit: I just thought of something. I never installed new FPGA configuration with Chip's bug fix. Could that cause the two-instruction loop to fail?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 11:36
    Ariba: I have a question about a comment in your two-instruction loop. Why do immediate values need to have bits 18-31 zero? This is probably what is causing this loop to fail in the P2 LMM kernel.
    movi:  rdlongc rx,pc
           long value           'bit31..18 must be 0
    
    Edit: Maybe this isn't the problem. The GCC LMM kernel handles constants differently.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 14:18
    I've tried both of Andy's suggestions in the P2 GCC LMM kernel. The first works but the second doesn't. I've attached the source for the LMM kernel. I'd appreciate it if someone would look it over to see why Andy's second loop doesn't work. The loops are all at the address __LMM_loop. All of the loops are commented out except Andy's second option. Can anyone see what might be causing trouble here.

    crt0_p2.S

    And, before anyone mentions it, I will certainly replace the software multiply and divide code with P2 instructions soon! I just want to get the LMM loop working first. :-)

    Edit: Sorry! Forgot the attachment the first time!
    S
    S
    11K
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 14:27
    Andy:

    Thanks!

    I am now curious what the ratios would be if all eight test instructions were single cycle ones - maybe "add countX,#1" with X=1..8

    I'd run the tests but I am working on a surprise :)

    David:

    I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.

    As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 14:47
    David:

    I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.

    As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.
    What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 14:53
    David Betz wrote: »
    What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
    Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 14:55
    Hi David,

    I took a quick look as requested.

    That is basically LMM2C2 (two cached reads in the same 8 cycle hub loop, LMM2C3 is the one with three cached reads - you can find the acronym definitions in post#287 above, also the FAQ in post#2)

    David, it would take me far too much time to explain, definitely more than wifey will let me do for free.

    It will be easy for you to get the LMM2C1 variant running, and I strongly recommended that path to you. Anything faster than that will require a LOT of changes to GCC, which I will be happy to explain in detail once Ken gives me the go-ahead.
    David Betz wrote: »
    What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 14:57
    You found one of the approximately 30-50 changes required.

    I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.

    Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.

    Edit:

    Sorry, above line sounds bad, not what I meant. What I meant is that fortunately I have the background so that I know what is needed - and would be able to direct the port to get it done much faster.
    David Betz wrote: »
    Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 15:02
    You found one of the approximately 30-50 changes required.

    I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.

    Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.

    What is LMM2C1?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 15:06
    Hi David,

    Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!

    LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window

    I think you can see why I use acronyms :-)

    I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-13 15:13
    Hi David,

    Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!

    LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window

    I think you can see why I use acronyms :-)

    I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)
    Are you talking about this?
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            add pc,#4           'point to next instr
            jmpd #__LMM_loop
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
              nop
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    
    That is already working.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-13 15:16
    Yes.

    That is the best that can be done without fairly major changes.

    Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.
    David Betz wrote: »
    Are you talking about this?
    __LMM_loop
            rdlongc L_ins1,pc   'read instruction
            add pc,#4           'point to next instr
            jmpd #__LMM_loop
    L_ins1    nop               'execute instruction
              nop               '3 delay slots for jumpd
              nop
            jmp #__LMM_loop     'loop in case jmpd got cancelled by LMM code
    
    That is already working.
Sign In or Register to comment.