Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 2 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

2456712

Comments

  • AribaAriba Posts: 2,690
    edited 2012-12-10 05:26
    I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
    Here is the simplest version:
    ' Quad LMM
    DAT
            org 0
            mov pc,start
    lmm     reps #511,#6
            setquad #ins1
             rdquad pc
    ins1     nop            'quad aligned
             nop
             nop
             nop
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    start   long @lmmcode+$0E80
    pc      long 0
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 50%        
            nop
            clrp #2
            nop
    
            notp #4         'toggle pin4 25%
            notp #4
            nop
            sub pc,#32      'relative jump (delayed)
    
            nop             'the 4 delay instructions
            nop
            nop
            nop
    
    The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
    The bad thing: This can not be true !
    This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
    Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.

    Perhaps somebody else see where my mistake is.

    BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 05:36
    Thanks, I will try this code.
    cgracey wrote: »
    That might do it. I'm thinking if you did something like...

    WRBYTE h00,PTRB++
    WRBYTE h01,PTRB++
    WRBYTE h02,PTRB++
    WRBYTE h03,PTRB++
    WRBYTE h04,PTRB++
    WRBYTE h05,PTRB++
    WRBYTE h06,PTRB++
    WRBYTE h07,PTRB++
    <repeat>

    ...you could see very easily if you were executing part of one RDQUAD and part of another.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 05:39
    Hi Ariba,

    Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)

    I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.

    You can download the Spin file for it from post#2
    Ariba wrote: »
    I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
    Here is the simplest version:
    ' Quad LMM
    DAT
            org 0
            mov pc,start
    lmm     reps #511,#6
            setquad #ins1
             rdquad pc
    ins1     nop            'quad aligned
             nop
             nop
             nop
             add pc,#16
            jmp #lmm        'after 511 repeats
    
    start   long @lmmcode+$0E80
    pc      long 0
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align
    
    lmmcode setp #2         'toggle pin2 50%        
            nop
            clrp #2
            nop
    
            notp #4         'toggle pin4 25%
            notp #4
            nop
            sub pc,#32      'relative jump (delayed)
    
            nop             'the 4 delay instructions
            nop
            nop
            nop
    
    The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
    The bad thing: This can not be true !
    This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
    Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.

    Perhaps somebody else see where my mistake is.

    BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.

    Andy
  • KyeKye Posts: 2,200
    edited 2012-12-10 05:40
    For those extra three instructions...

    You could do a simple:
    mov temp, ind
    test temp, mask wz 
    if_nz jmp #interrupt_code
    

    This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

    So... this would give us an up to 32 channel interrupt controller.

    Thanks,
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 05:45
    Hi Kye,

    Yep that would work!

    I think two instructions would work:

    test mask,pinx wz
    if_nz jmp #interrupt_handler
    Kye wrote: »
    For those extra three instructions...

    You could do a simple:
    mov temp, ind
    test temp, mask wz 
    if_nz jmp #interrupt_code
    

    This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

    So... this would give us an up to 32 channel interrupt controller.

    Thanks,
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-10 05:47
    Kye wrote: »
    For those extra three instructions...

    You could do a simple:
    mov temp, ind
    test temp, mask wz 
    if_nz jmp #interrupt_code
    

    This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

    So... this would give us an up to 32 channel interrupt controller.

    Thanks,

    This will trash the z flag. Won't that be a problem for the LMM code?
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-10 05:50
    Hi.

    I think both Bill's and Kye's versions need ---- Push Pop wz wc instructions to work properly


    David Betz wrote: »
    This will trash the z flag. Won't that be a problem for the LMM code?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 05:50
    Good point David & Sapieha, I guess I'd better have my first coffee of the day :-)

    mov tmp, pinX
    and tmp, mask
    tjnz tmp,#int_handler

    should not destroy the Z flag
  • AribaAriba Posts: 2,690
    edited 2012-12-10 06:24
    Hi Ariba,

    Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)

    I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.

    You can download the Spin file for it from post#2

    Hello Bill

    I have for sure checked your code. It's basically the same as mine, and has the same problems I think.
    If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
    If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.
    And you execute always the same instruction so you can not know if not only the first 4 get executed.

    I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 06:27
    I just tried a pipeline correctness test for RDQUAD based LMM2.

    In this version, I place 256 copies of the following eight instructions in the hub:
    add count1,#1
    add count2,#1
    add count3,#1
    add count4,#1
    add count4,#1
    add count3,#1
    add count2,#1
    add count1,#1
    

    and I execute the RDQUAD based LMM loop 256 times, executing a total of 1020 LMM instructions (the pipe is not primed on the first pass)

    (I am attaching lmm2_test2.spin to post#2 so y'all can replicate my results)

    After the run, execute the following commands in the monitor to view the results:

    n
    2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000803   '....'
    

    The top four numbers are the counts for the four LMM instructions that are executed in a loop.

    As they all show a count of 255, we know that each "countX" add was executed the correct number of times; and as I changed the pattern, the count should have been uneven if the wrong instructions got executed due to pipeline issues. Due to the pattern used, it is very unlikely that we would get the same result if there were pipeline issues.

    The last number is the number of cycles it took to execute the 256 RDQUAD LMM2 cycles.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 06:38
    Hi Andy,
    Ariba wrote: »
    Hello Bill

    I have for sure checked your code. It's basically the same as mine, and has the same problems I think.

    I agree that it is basically the same code... I just have not seen the problem yet; but I will try many other tests to try to find any problems.
    Ariba wrote: »
    If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
    If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.

    Now that is strange - the loop should take 8 cycles regardless of the delay slots as RDQUAD should always sync to the hub. Mind you, if it was always reading the same four longs from the hub, perhaps it just picks it up from the local quad cache - which is the only explanation I can think of for the faster execution.

    I'll look at this situation, including using a PC register.
    Ariba wrote: »
    And you execute always the same instruction so you can not know if not only the first 4 get executed.

    I just ran a test that uses 8 different instructions, where I check the count of how many times each got executed - and it uses them in a different sequence in two different RDQUADS.

    See post#2 for "lmm2_test2.spin"
    Ariba wrote: »
    I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.

    Andy

    I would also like to keep PTRA free.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 06:49
    Hi Andy,

    You are definitely on to something!

    Please see lmm2_test3.spin in post#2, I tried to replicate your results - and I got the same results you did!

    It can run both the 8 cycle and the 6 cycle version, just change 6/8 and un-comment the extra delay slots.

    - I have verified that the 8 cycle version takes 2051 cycles to execute 256 loops of 4 instructions

    - I have also verified that the 6 cycle version takes 1537 cycles to execute 256 loops of 4 instructions

    What is strange is that I use a pattern of 8 instructions that I hoped would reveal any pipeline issues.

    The only thing I can think of that would give this result is if RDQUAD was not waiting for the next hub cycle, but simply returned what was in its cache.

    I will try more/other tests...
  • AribaAriba Posts: 2,690
    edited 2012-12-10 06:50
    Hi Bill

    I checked your second testcode, and I am now sure you have the same problem.
    Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
    So you should get a different result for count2 and count3.

    You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.
    '
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  addins1,ptra++  ' save the add instruction into the hub
            wrlong  addins2,ptra++  ' save the add instruction into the hub
            wrlong  addins3,ptra++  ' save the add instruction into the hub
            wrlong  addins4,ptra++  ' save the add instruction into the hub
            wrlong  subins4,ptra++  ' save the sub instruction into the hub
            wrlong  subins3,ptra++  ' save the sub instruction into the hub
            wrlong  subins2,ptra++  ' save the sub instruction into the hub
            wrlong  subins1,ptra++  ' save the sub instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
    
            ' execute 256 LMM instructions
    
            nop
    '--
            setquad #ins1
    
            reps    #256,#8
            getcnt  start
    
            rdquad   ptra++
    '--
    ins1    nop     ' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
            nop     ' delay slot
            nop     ' delay slot
            nop     ' delay slot
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    addins1 add     count, #1
    addins2 add     count2,#1
    addins3 add     count3,#1
    addins4 add     count4,#1
    subins1 sub     count, #1
    subins2 sub     count2,#1
    subins3 sub     count3,#1
    subins4 sub     count4,#1
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    8192
    result2         long    8196
    result3         long    8200
    result4         long    8204
    result5         long    8208'
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  addins1,ptra++  ' save the add instruction into the hub
            wrlong  addins2,ptra++  ' save the add instruction into the hub
            wrlong  addins3,ptra++  ' save the add instruction into the hub
            wrlong  addins4,ptra++  ' save the add instruction into the hub
            wrlong  subins4,ptra++  ' save the sub instruction into the hub
            wrlong  subins3,ptra++  ' save the sub instruction into the hub
            wrlong  subins2,ptra++  ' save the sub instruction into the hub
            wrlong  subins1,ptra++  ' save the sub instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
    
            ' execute 256 LMM instructions
    
            nop
    '--
            setquad #ins1
    
            reps    #256,#8
            getcnt  start
    
            rdquad   ptra++
    '--
    ins1    nop     ' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
            nop     ' delay slot
            nop     ' delay slot
            nop     ' delay slot
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    addins1 add     count, #1
    addins2 add     count2,#1
    addins3 add     count3,#1
    addins4 add     count4,#1
    subins1 sub     count, #1
    subins2 sub     count2,#1
    subins3 sub     count3,#1
    subins4 sub     count4,#1
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    8192
    result2         long    8196
    result3         long    8200
    result4         long    8204
    result5         long    8208
    

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 07:07
    I tried to count cycles with 0 or 1 delay slots in lmm2_test2.spin and got some interesting results

    With three delay slots:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000803   '....'
    
    2050 cycles for 1024 LMM2 instructions

    With two delay slots:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000703   '....'
    
    1795 cycles for 1024 LMM2 instructions

    With one delay slot:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000603   '....'
    >
    
    1551 cycles for 1024 LMM2 instructions

    with 0 delay slots:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000602   '....'
    
    1550 cycles for 1024 LMM2 instructions

    This made me think of running a 9 instruction REP as a test...
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000903   '....'
    
    $903 cycles for 1024 LMM2 instructions

    10 instruction REP loop
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000A03   '....'
    
    $A03 cycles for 1024 LMM2 instructions

    15 instruction REP loop
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000F04   '....'
    
    $F04 cycles for 1024 LMM2 instructions

    20 instruction REP loop
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00001404   '....'
    
    $1404 cycles (5124) cycles for 1024 lmm instructions... REP 20 loop * 256

    Conclusion:

    It looks highly probable that RDQUAD does not wait for the next set of data from the hub, but instead uses the data present in its latches.

    Test for conclusion:

    If I use eight counters, only four should increment

    Possible workaround:

    I will try using a RDLONG first, then staying in sync with the hub with RDQUADS

    UPDATE:

    A different order of instructions appears to work, so it looks like it was just pipeline mis-use..
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 07:09
    Hi Andy,

    Yes, I was able to replicate your results. You are entirely correct.

    See my previous post for a lot more tests.

    It is looking like RDQUAD only fetches once... very strange, must investigate some more.

    (and thanks for finding the error!)
    Ariba wrote: »
    Hi Bill

    I checked your second testcode, and I am now sure you have the same problem.
    Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
    So you should get a different result for count2 and count3.

    You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.
    '
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  addins1,ptra++  ' save the add instruction into the hub
            wrlong  addins2,ptra++  ' save the add instruction into the hub
            wrlong  addins3,ptra++  ' save the add instruction into the hub
            wrlong  addins4,ptra++  ' save the add instruction into the hub
            wrlong  subins4,ptra++  ' save the sub instruction into the hub
            wrlong  subins3,ptra++  ' save the sub instruction into the hub
            wrlong  subins2,ptra++  ' save the sub instruction into the hub
            wrlong  subins1,ptra++  ' save the sub instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
    
            ' execute 256 LMM instructions
    
            nop
    '--
            setquad #ins1
    
            reps    #256,#8
            getcnt  start
    
            rdquad   ptra++
    '--
    ins1    nop     ' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
            nop     ' delay slot
            nop     ' delay slot
            nop     ' delay slot
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    addins1 add     count, #1
    addins2 add     count2,#1
    addins3 add     count3,#1
    addins4 add     count4,#1
    subins1 sub     count, #1
    subins2 sub     count2,#1
    subins3 sub     count3,#1
    subins4 sub     count4,#1
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    8192
    result2         long    8196
    result3         long    8200
    result4         long    8204
    result5         long    8208'
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  addins1,ptra++  ' save the add instruction into the hub
            wrlong  addins2,ptra++  ' save the add instruction into the hub
            wrlong  addins3,ptra++  ' save the add instruction into the hub
            wrlong  addins4,ptra++  ' save the add instruction into the hub
            wrlong  subins4,ptra++  ' save the sub instruction into the hub
            wrlong  subins3,ptra++  ' save the sub instruction into the hub
            wrlong  subins2,ptra++  ' save the sub instruction into the hub
            wrlong  subins1,ptra++  ' save the sub instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
    
            ' execute 256 LMM instructions
    
            nop
    '--
            setquad #ins1
    
            reps    #256,#8
            getcnt  start
    
            rdquad   ptra++
    '--
    ins1    nop     ' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
            nop     ' delay slot
            nop     ' delay slot
            nop     ' delay slot
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    addins1 add     count, #1
    addins2 add     count2,#1
    addins3 add     count3,#1
    addins4 add     count4,#1
    subins1 sub     count, #1
    subins2 sub     count2,#1
    subins3 sub     count3,#1
    subins4 sub     count4,#1
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    8192
    result2         long    8196
    result3         long    8200
    result4         long    8204
    result5         long    8208
    

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 07:23
    I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:
    again	rdquad   pc
    '--
    ins1    nop	' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
    	djnzd	howmany,#again
            add pc,#16	' no PTRA usage, trying to replicate Andy's issue
    	nop	' delay slot
    

    As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

    Here are the results:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000902   '....'
    

    Note the cycle count - looks like DJNZD takes two cycles when taken

    I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

    Chip, can you take a look at this?

    UPDATE:

    Looks like it was just weird pipeline effects... a different layout appears to work!
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 07:37
    Boy did I have a bad scare... I ran a test that had suggested that RDLONGC had an issue - BUT IT DOES NOT.

    My test for RDLONGC had a bug, I am happy to report that RDLONGC works fine for LMM2.

    UPDATE#1:
    	reps	#256,#5
    	getcnt	start
    
    	rdlongc	ins3,ptra++
    ins1	rdlongc	ins4,ptra++
    ins2	nop		' verified, does not execute
    ins3	nop		' verified, EXECUTES
    ins4	nop
    
    	getcnt	stop
    

    Works as well, so at least we can execute $200 LMM instructions in $800 cycles (1/4)

    UPDATE#2:
    	reps	#256,#6
    	getcnt	start
    
    	rdlongc	ins3,ptra++
    ins1	rdlongc	ins4,ptra++
    ins2	rdlongc ins5,ptra++
    ins3	nop		' verified, EXECUTES
    ins4	nop
    ins5	nop
    
    	getcnt	stop
    

    Also works, $300 LMM instructions in $C00 cycles - no better than the above.

    UPDATE#3:
    	reps	#256,#8
    	getcnt	start
    
    	rdlongc	ins3,ptra++
    	rdlongc	ins4,ptra++
    	rdlongc ins5,ptra++
    	rdlongc ins6,ptra++
    ins3	nop		' verified, EXECUTES
    ins4	nop
    ins5	nop
    ins6	nop
    
    	getcnt	stop
    

    Works, but no point to using it - takes $FFE cycles for $400 instructions, same 1/4 efficiency as UPDATE#1

    It looks like until we figure out why the RDQUAD version does not work the best efficiency we can get is 25%... but 40MIPS is not a bad start!
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 07:47
    I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:
    again	rdquad   pc
    '--
    ins1    nop	' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
    	djnzd	howmany,#again
            add pc,#16	' no PTRA usage, trying to replicate Andy's issue
    	nop	' delay slot
    

    As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

    Here are the results:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000902   '....'
    

    Note the cycle count - looks like DJNZD takes two cycles when taken

    I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

    Chip, can you take a look at this?

    I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 07:50
    I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:
    again	rdquad   pc
    '--
    ins1    nop	' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
    	djnzd	howmany,#again
            add pc,#16	' no PTRA usage, trying to replicate Andy's issue
    	nop	' delay slot
    

    As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

    Here are the results:
    >n
    >2000.2013
    02000- 000000FF 000000FF 000000FF 000000FF   '................'
    02010- 00000902   '....'
    

    Note the cycle count - looks like DJNZD takes two cycles when taken

    I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

    Chip, can you take a look at this?

    Bill, I see one problem... When you do a delayed jump (DJNZD in this case), you must follow it with THREE instructions, not two.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 07:59
    Thanks Chip.

    But don't worry, I figured it out - you were VERY right earlier warning me about possible RDQUAD pileline weirdness, and Ariba's testing was invaluable at generating a better verification.

    Here is code that apparently works fine with RDQUAD LMM2 loop - but the pipeline layout has to be a bit... unusual.

    Could you and Ariba check my results?
    	reps	#256,#8
    	getcnt	start
    
    '-- must be on a XXXXXXX00 address in the cog
    ins1	nop
    ins2	nop
    ins3	nop
    ins4	nop
    	rdquad	ptra++
    ins5	nop
    ins6	nop
    ins7	nop
    
    	getcnt	stop
    

    I am attaching the full spin file.

    Here are the results:
    >n
    >2000.2013
    02000- 00000000 00000000 00000001 00000001   '................'
    02010- 00000806   '....'
    

    The '1''s are due to the un-primed LMM pipeline

    UPDATE:

    It looks like the three delay slot nops / repeating 8 instructions is necessary; when I tried to leave them out, I get incorrect results when I make the add/sub's unbalanced
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 08:03
    Thanks Chip - it was NOT a problem with RDQUAD, but with how its pipelining was not used correctly by me :(

    You were absolutely right to warn me about potential pipeline weirdness with RDQUAD... but it looks like I got a layout that works.

    I really appreciate your help.
    cgracey wrote: »
    I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.

    I agree totally - any potential bugs found now can be fixed for far less then having to re-do after the shuttle run.. and requiring an additional shuttle run.

    Fortunately it looks like RDQUAD works fine, as long as the pipeline is arranged to its liking.

    Sorry for thinking there may be a bug - it was just the pipeline behaving different from what I expected.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 08:12
    Bill,

    I still don't understand why RDQUAD was taking only one clock.

    Could it have been that RDQUAD was overshadowed by one of the QUAD registers, making its location into a single-clock instruction?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 08:20
    I have no idea, but testing shows it likes to take one clock cycle, and it really wants to be followed by exactly seven slots if we want optimal timing.

    Take a peek at the layout that appears to work for me in post#51 ... that gives the correct result with Ariba's add/sub test, and also the correct result when I unbalance the add/sub's as follows:
    addins1	add	count, #1
    addins2	add	count2,#1
    addins3	add	count3,#1
    addins4	add	count4,#1
    subins1	sub	count, #1
    subins2 sub	count2,#1
    subins3 sub	count3,#1
    subins4 sub	count3,#1   ' normally refers to count4 to balance add/subs
    

    The weird thing is that it requires the three delay slots - if I reduce the repeat count to 5, so it does not have the three delay slots, it no longer works when unbalanced.

    I think I will try some experiments with multi-cycle instructions thrown into the mix to see if that will work properly.
    cgracey wrote: »
    Bill,

    I still don't understand why RDQUAD was taking only one clock.

    Could it have been that RDQUAD was overshadowed by one of the QUAD registers, making its location into a single-clock instruction?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 08:35
    I made a new torture test for LMM2 using RDQUAD

    This version keeps executing the following eight instructions:
    i1	add	count, #1
    i2	add	count2,#1
    i3	add	count3,#1
    i4	add	count4,#1
    i5	rdlong	lc,result6
    i6	add	lc,#1
    i7	wrlong	lc,result6
    i8	add	count,#1
    

    And here are the results:
    >2000.201f
    02000- 000000FE 0000007F 00000080 00000080   '................'
    02010- 00000FF6 0000007E 00000000 00000000   '....~...........'
    

    The results make sense even though it executed $800 cycles faster than I expected. It makes me think that each hub cycle can do both a hub read and a hub write...

    i1-i4 are executed 128 times
    i5-i8 are executed 128 times

    count1 is incremented 256 times (less initial non-priming loss)
    count2,3,4 are incremented 127,128,128 times (the 127 is due to non-priming)
    lc is incremented 126 times, which also makes sense due to non-priming, and the rdlong/add/wrlong being in the second group of 8

    I am surprised that the loop did not take around $17F6 cycles; that is why I suspect a hub cycle can involve both a read and a write.

    So it looks like RDQUAD based LMM2 can execute both singe and multi cycle instructions fine :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 08:42
    Andy,

    You will be happy to know the new RDQUAD pipe layout works just as well without using PTRA!

    Please find attached the non-PTRA version of the lmm2_quadtorture2.spin

    Here are the results it gave:
    >n
    >2000.201f
    02000- 000000FE 0000007F 00000080 00000080   '................'
    02010- 00000FF6 0000007E 00000000 00000000   '....~...........'
    

    Leaving PTRA available for user code is well worth one delay slot.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 09:41
    Andy,

    You will be happy to know the new RDQUAD pipe layout works just as well without using PTRA!

    Please find attached the non-PTRA version of the lmm2_quadtorture2.spin

    Here are the results it gave:
    >n
    >2000.201f
    02000- 000000FE 0000007F 00000080 00000080   '................'
    02010- 00000FF6 0000007E 00000000 00000000   '....~...........'
    

    Leaving PTRA available for user code is well worth one delay slot.

    Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 09:48
    Thanks Chip, I was thinking that the NOP's would execute, and RDQUAD overwrite them - I forgot they were mapped over, instead of copied over.

    I will now try priming them...
    cgracey wrote: »
    Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 10:05
    I just did a test and found that the QUADs are executable starting on the 6th instruction after the RDQUAD:
    After a RDQUAD, the QUAD registers can be executed beginning at the 6th instruction:
    
            SETQUAD #quad0          'quad0 must be a quad-aligned address
            RDQUAD  hubaddress      'read a quad into the QUAD registers which are mapped at quad0..quad3
            NOP                     'do 5 other instructions to allow QUADs to enter execution pipeline
            NOP
            NOP
            NOP
            NOP
    quad0   NOP                     'QUAD0..QUAD3 are executable from this point
    quad1   NOP
    quad2   NOP
    quad3   NOP
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 10:12
    Interesting - but that makes it take 10 cycles... ie 2 hub cycles :(

    Other than the first iteration executing garbage, why does the version I tried appear to work? Does it being in a REPS loop somehow make it executable in fewer cycles?

    Also, why do I get a very weird cycle count when I try to prime the RDQUAD in the attached file?

    Very interesting stuff.

    Look at the huge cycle count for the primed torture test:
    >2000.201f
    02000- 00000101 00000081 00000080 00000080   '................'
    02010- 024A6AF4 0000007F 00000000 00000000   '.jJ.............'
    
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 10:28
    Interesting - but that makes it take 10 cycles... ie 2 hub cycles :(

    Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.
Sign In or Register to comment.