Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 3 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

1356712

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 11:08
    cgracey wrote: »
    Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.

    Here is my latest experiment:
    	reps	#256,#8
    	getcnt	start
    
    '-- must be on a XXXXXXX00 address in the cog
    ins1	nop
    ins2	nop
    ins3	nop
    ins4	nop
    ins5	setquad	#ins1
    	rdquad	pc
    ins6	add	pc,#16
    ins7	nop
    

    And the result looks correct:
    >n
    >2000.201f
    02000- 000000FE 0000007F 000000FE 00000080   '................'
    02010- 00000BFD 0000007E 00000000 00000000   '....~...........'
    

    The difference here is that I don't SETQUAD the cache in until after the NOP's are executed on the first pass, avoiding executing unknown code.

    I'll try some more code mixes that I can verify a bit later.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 11:23
    I just looked at the Verilog code and did some tests to confirm mapped QUAD behavior:
    After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
    
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
    	NOP			'do something for at least 3 clocks to allow the QUADs to update
    	NOP
    	NOP
    
    	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S
    
    After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:
    
    
            SETQUAD #quad0          'quad0 must be a quad-aligned address
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
            NOP                     'do something for at least 3 clocks to allow the QUADs to update
            NOP
            NOP
    
            NOP                     'do at least 2 instructions to allow QUADs to enter pipeline
            NOP
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    quad1   NOP
    quad2   NOP
    quad3   NOP
    

    This is different than what I initially documented. Sorry for the confusion.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 11:33
    Thank you for checking.

    I guess I will need to experiment more to see if somehow it can be sqeezed into an 8 or 16 instruction loop.

    What worries me is in the loop I have in post#62 may execute a mix of fresh/stale cache data.

    I'll think about it for a while, and come up with some test cases I can run on my nano.
    cgracey wrote: »
    I just looked at the Verilog and did some tests to confirm mapped QUAD behavior:
    After a RDQUAD, mapped QUAD registers are accessible via D and S after 3 clocks:
    
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
    	NOP			'do something for at least 3 clocks to allow the QUADs to update
    	NOP
    	NOP
    
    	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S
    
    After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:
    
    
            SETQUAD #quad0          'quad0 must be a quad-aligned address
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
            NOP                     'do something for at least 3 clocks to allow the QUADs to update
            NOP
            NOP
    
            NOP                     'do at least 2 instructions to allow QUADs to enter pipeline
            NOP
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    quad1   NOP
    quad2   NOP
    quad3   NOP
    
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 12:36
    I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?
  • AribaAriba Posts: 2,690
    edited 2012-12-10 12:38
    Thank you Chip and Bill

    With the new timing information, I was able to get it to work:
    ' Quad LMM
    DAT
            org 0
    lmm     reps #511,#7
            setquad #ins1
             rdquad pc
             nop
    ins1     nop            'quad aligned
             nop
             nop
             nop
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    pc      long @lmmcode+$0E80
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 25%        
            notp #4
            clrp #2
            sub pc,#32      'relative jump (delayed by 1 quad)
    
            nop
            notp #4         'toggle pin4 50%
            mul t1,t1       '2 cycle instr
            nop
                            '< jump happens here
    
    The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
    The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).

    The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.

    Andy
  • AribaAriba Posts: 2,690
    edited 2012-12-10 12:45
    cgracey wrote: »
    I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?
    Yes and No
    My conclusion before my previous test was, that the RDQUAD gets executed only the first time in the loop but not afterwards.
    But in my code above it gets executed always now. So it must depend on the mix of read-quad timing and repeat logic.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:01
    I think I saw that on occasion.
    cgracey wrote: »
    I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:06
    Thanks Ariba,

    Do you have the problem Chip mentioned of the cache being random code on the first pass?

    I tried to get around that by "priming" the cache.

    I'll try to torture your flavor.

    Regarding delayed jumps... PropGCC2 will have to schedule instructions very carefully; using RDQUAD essentially turns LMM2 into a VLIW with a fairly deep pipe.

    The pain will be worth it - it will be able to *very closely* approach 50% of native mips even without FCACHE etc, and almost 100% with all the tricks :-)

    I'll try the loop you posted shortly.
    Ariba wrote: »
    Thank you Chip and Bill

    With the new timing information, I was able to get it to work:
    ' Quad LMM
    DAT
            org 0
    lmm     reps #511,#7
            setquad #ins1
             rdquad pc
             nop
    ins1     nop            'quad aligned
             nop
             nop
             nop
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    pc      long @lmmcode+$0E80
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 25%        
            notp #4
            clrp #2
            sub pc,#32      'relative jump (delayed by 1 quad)
    
            nop
            notp #4         'toggle pin4 50%
            mul t1,t1       '2 cycle instr
            nop
                            '< jump happens here
    
    The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
    The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).

    The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:13
    Andy,

    Unless I goofed in the attached file, your latest version did not work (for me)
  • AribaAriba Posts: 2,690
    edited 2012-12-10 13:33
    Bill
    I dont now how the results must look like, I get:
    === Propeller II Monitor ===
    
    >n2000.2010
    02000- 00000080 00000080 000000FF 000000FF   '................'
    02010- 00000BFF   '....'
    >
    
    .. if I change it so that ins1 is quad aligned and not ins7

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:36
    ARGH!

    Thanks - I don't believe I quad-aligned the delay slot... duh. I've been staring at variations of this code too long :)

    Thanks Andy.

    I now get the same result you do.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:44
    Uh-Oh

    I just realized something. It is not executing the hub instructions in i4 and i6

    $2014 should have $0000007F in it (result6)
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 13:46
    This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

    Are you guys seeing this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 13:51
    Frankly I am seeing all sorts of unusual behaviour, depending on where the RDQUAD is in relation to REPS, the delay slots etc.

    I'll need to make some more complex test cases, because strangely enough, the only code that seems to pass my "torture" test is the one where the quad buffer is aligned, and comes *before* the RDQUAD

    What was really weird is Ariba's last code passed 5 of the 6 tests, only failing the readlong/add#1/writelong embedded LMM code...

    It might be a good idea to hold off on the shuttle run until we figure this out.
    cgracey wrote: »
    This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

    Are you guys seeing this?
  • AribaAriba Posts: 2,690
    edited 2012-12-10 13:52
    Chip

    For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
    In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.

    Andy

    Sorry read this post of you after writing the above:
    cgracey wrote: »
    This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

    Are you guys seeing this?
    :
    Hmm, but the 4 instructions get executed, also the last one - so the Quad window must then be 5 instructions. But yes as I wrote above the instruction before the QUAD window does not execute right.

    I think now it has nothing to do with the REPS, I see the same if I make the loop with jumps.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 13:55
    Ariba wrote: »
    Chip

    For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
    In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.

    Andy

    That's what I found, too, but the register before the QUADs seems to be QUAD0, like everything is shifted up one location.
  • AribaAriba Posts: 2,690
    edited 2012-12-10 14:22
    Yes you are right, the window is shifted up 1 address.
    If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:
    ' Quad LMM
    DAT
            org 0
    lmm     reps #511,#7
            setquad #ins1
             rdquad pc
             nop
    ins1     nop            'quad aligned
             nop
             nop
             notp #2
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    pc      long @lmmcode+$0E80
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 25%        
            notp #4
            clrp #2
            sub pc,#32      'relative jump (delayed by 1 quad)
    
            nop
            notp #4         'toggle pin4 50%
            mul t1,t1       '2 cycle instr
            nop
                            '< jump happens here
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 14:35
    This test also fails for me, it does not execute the WRLONG in i6
    Ariba wrote: »
    Yes you are right, the window is shifted up 1 address.
    If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:
    ' Quad LMM
    DAT
            org 0
    lmm     reps #511,#7
            setquad #ins1
             rdquad pc
             nop
    ins1     nop            'quad aligned
             nop
             nop
             notp #2
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    pc      long @lmmcode+$0E80
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 25%        
            notp #4
            clrp #2
            sub pc,#32      'relative jump (delayed by 1 quad)
    
            nop
            notp #4         'toggle pin4 50%
            mul t1,t1       '2 cycle instr
            nop
                            '< jump happens here
    
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 14:39
    I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

    I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

    Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 14:42
    You are most welcome!

    I REALLY enjoyed this (even with the excessive head scratching involved) - plus I am very glad it was caught before the shuttle run.
    cgracey wrote: »
    I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

    I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

    Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
  • FredBlaisFredBlais Posts: 370
    edited 2012-12-10 14:49
    This is really great, the propeller 2 emulation is worth the effort finally!
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-10 14:50
    Hi Chip.

    In time You fix that -- It is possible to add one more display mode for Monitor?

    You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing


    cgracey wrote: »
    I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

    I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

    Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
  • AribaAriba Posts: 2,690
    edited 2012-12-10 14:50
    This test also fails for me, it does not execute the WRLONG in i6

    Bill
    This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
    I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
    ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
    But that will change soon, when Chip posts the new FPGA files.

    Andy
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 14:52
    You are most welcome!

    I REALLY enjoyed this (even with the excessive head scratching involved) - plus I am very glad it was caught before the shuttle run.

    Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 14:54
    Sapieha wrote: »
    Hi Chip.

    In time You fix that -- It is possible to add one more display mode for Monitor?

    You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing

    I've thought about this, but there is no more room in the Monitor code. I mean not room for one more instruction. Binary would have been really nice for configuring the I/O pins, too.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 14:57
    My only concern for it is that there is some way to have a loop that takes 8 cycles that reads and executes the four quad registers... because I really want the 2 clock cycle effective LMM2 instructions (for normally single cycle instructions) as that makes a properly optimized PropGCC2 generate code comparable in speed with a 80-120MHz ARM (without FPU).
    cgracey wrote: »
    Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.
  • AribaAriba Posts: 2,690
    edited 2012-12-10 14:58
    cgracey wrote: »
    I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

    I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

    Thanks for discovering this, Bill, Ariba, and any others who've worked on this!

    Great that you found the reason :thumb:

    Andy
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-10 14:58
    Hi Chip.

    From my side of view --> Best case.

    If Quad aligns to first free Long address and next 3 after



    cgracey wrote: »
    Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 15:02
    Hi Andy,

    I was not criticizing... just pointing out that it did not work for some instructions. Frankly, I am glad the issue came up before the test chips - far cheaper to fix it now!
    Ariba wrote: »
    Bill
    This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
    I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
    ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
    But that will change soon, when Chip posts the new FPGA files.

    Andy
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 15:04
    Sapieha wrote: »
    Hi Chip.

    From my side of view --> Best case.

    If Quad aligns to first free Long address and next 3 after

    I think having to set the QUAD window at a QUAD-aligned address is a big pain. Do you guys agree? It keeps the circuitry simple but causes pain for the programmer.
Sign In or Register to comment.