LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Ariba · 2012-12-10 05:26

I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
Here is the simplest version:

' Quad LMM
DAT
        org 0
        mov pc,start
lmm     reps #511,#6
        setquad #ins1
         rdquad pc
ins1     nop            'quad aligned
         nop
         nop
         nop
         add pc,#16
        jmp #lmm        'after 511 repeats

start   long @lmmcode+$0E80
pc      long 0
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 50%        
        nop
        clrp #2
        nop

        notp #4         'toggle pin4 25%
        notp #4
        nop
        sub pc,#32      'relative jump (delayed)

        nop             'the 4 delay instructions
        nop
        nop
        nop

The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
The bad thing: This can not be true !
This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.

Perhaps somebody else see where my mistake is.

BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.

Andy

Bill Henning · 2012-12-10 05:36

Thanks, I will try this code.

cgracey wrote: »

That might do it. I'm thinking if you did something like...

WRBYTE h00,PTRB++
WRBYTE h01,PTRB++
WRBYTE h02,PTRB++
WRBYTE h03,PTRB++
WRBYTE h04,PTRB++
WRBYTE h05,PTRB++
WRBYTE h06,PTRB++
WRBYTE h07,PTRB++
<repeat>

...you could see very easily if you were executing part of one RDQUAD and part of another.

Bill Henning · 2012-12-10 05:39

Hi Ariba,

Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)

I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.

You can download the Spin file for it from post#2

Ariba wrote: »
I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
Here is the simplest version:
' Quad LMM
DAT
        org 0
        mov pc,start
lmm     reps #511,#6
        setquad #ins1
         rdquad pc
ins1     nop            'quad aligned
         nop
         nop
         nop
         add pc,#16
        jmp #lmm        'after 511 repeats

start   long @lmmcode+$0E80
pc      long 0
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 50%        
        nop
        clrp #2
        nop

        notp #4         'toggle pin4 25%
        notp #4
        nop
        sub pc,#32      'relative jump (delayed)

        nop             'the 4 delay instructions
        nop
        nop
        nop
The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
The bad thing: This can not be true !
This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.

Perhaps somebody else see where my mistake is.

BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.

Andy

Kye · 2012-12-10 05:40

For those extra three instructions...

You could do a simple:

mov temp, ind
test temp, mask wz 
if_nz jmp #interrupt_code

This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

So... this would give us an up to 32 channel interrupt controller.

Thanks,

Bill Henning · 2012-12-10 05:45

Hi Kye,

Yep that would work!

I think two instructions would work:

test mask,pinx wz
if_nz jmp #interrupt_handler

Kye wrote: »
For those extra three instructions...

You could do a simple:
mov temp, ind
test temp, mask wz 
if_nz jmp #interrupt_code
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

So... this would give us an up to 32 channel interrupt controller.

Thanks,

David Betz · 2012-12-10 05:47

Kye wrote: »
For those extra three instructions...

You could do a simple:
mov temp, ind
test temp, mask wz 
if_nz jmp #interrupt_code
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.

So... this would give us an up to 32 channel interrupt controller.

Thanks,

This will trash the z flag. Won't that be a problem for the LMM code?

Sapieha · 2012-12-10 05:50

Hi.

I think both Bill's and Kye's versions need ---- Push Pop wz wc instructions to work properly

David Betz wrote: »

This will trash the z flag. Won't that be a problem for the LMM code?

Bill Henning · 2012-12-10 05:50

Good point David & Sapieha, I guess I'd better have my first coffee of the day :-)

mov tmp, pinX
and tmp, mask
tjnz tmp,#int_handler

should not destroy the Z flag

Ariba · 2012-12-10 06:24

Bill Henning wrote: »

Hi Ariba,

Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)

I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.

You can download the Spin file for it from post#2

Hello Bill

I have for sure checked your code. It's basically the same as mine, and has the same problems I think.
If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.
And you execute always the same instruction so you can not know if not only the first 4 get executed.

I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.

Andy

Bill Henning · 2012-12-10 06:27

I just tried a pipeline correctness test for RDQUAD based LMM2.

In this version, I place 256 copies of the following eight instructions in the hub:

add count1,#1
add count2,#1
add count3,#1
add count4,#1
add count4,#1
add count3,#1
add count2,#1
add count1,#1

and I execute the RDQUAD based LMM loop 256 times, executing a total of 1020 LMM instructions (the pipe is not primed on the first pass)

(I am attaching lmm2_test2.spin to post#2 so y'all can replicate my results)

After the run, execute the following commands in the monitor to view the results:

n
2000.2013

02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000803   '....'

The top four numbers are the counts for the four LMM instructions that are executed in a loop.

As they all show a count of 255, we know that each "countX" add was executed the correct number of times; and as I changed the pattern, the count should have been uneven if the wrong instructions got executed due to pipeline issues. Due to the pattern used, it is very unlikely that we would get the same result if there were pipeline issues.

The last number is the number of cycles it took to execute the 256 RDQUAD LMM2 cycles.

Bill Henning · 2012-12-10 06:38

Hi Andy,

Ariba wrote: »

Hello Bill

I have for sure checked your code. It's basically the same as mine, and has the same problems I think.

I agree that it is basically the same code... I just have not seen the problem yet; but I will try many other tests to try to find any problems.

Ariba wrote: »

If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.

Now that is strange - the loop should take 8 cycles regardless of the delay slots as RDQUAD should always sync to the hub. Mind you, if it was always reading the same four longs from the hub, perhaps it just picks it up from the local quad cache - which is the only explanation I can think of for the faster execution.

I'll look at this situation, including using a PC register.

Ariba wrote: »

And you execute always the same instruction so you can not know if not only the first 4 get executed.

I just ran a test that uses 8 different instructions, where I check the count of how many times each got executed - and it uses them in a different sequence in two different RDQUADS.

See post#2 for "lmm2_test2.spin"

Ariba wrote: »

I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.

Andy

I would also like to keep PTRA free.

Bill Henning · 2012-12-10 06:49

Hi Andy,

You are definitely on to something!

Please see lmm2_test3.spin in post#2, I tried to replicate your results - and I got the same results you did!

It can run both the 8 cycle and the 6 cycle version, just change 6/8 and un-comment the extra delay slots.

- I have verified that the 8 cycle version takes 2051 cycles to execute 256 loops of 4 instructions

- I have also verified that the 6 cycle version takes 1537 cycles to execute 256 loops of 4 instructions

What is strange is that I use a pattern of 8 instructions that I hoped would reveal any pipeline issues.

The only thing I can think of that would give this result is if RDQUAD was not waiting for the next hub cycle, but simply returned what was in its cache.

I will try more/other tests...

Ariba · 2012-12-10 06:50

Hi Bill

I checked your second testcode, and I am now sure you have the same problem.
Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
So you should get a different result for count2 and count3.

You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.

'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  addins1,ptra++  ' save the add instruction into the hub
        wrlong  addins2,ptra++  ' save the add instruction into the hub
        wrlong  addins3,ptra++  ' save the add instruction into the hub
        wrlong  addins4,ptra++  ' save the add instruction into the hub
        wrlong  subins4,ptra++  ' save the sub instruction into the hub
        wrlong  subins3,ptra++  ' save the sub instruction into the hub
        wrlong  subins2,ptra++  ' save the sub instruction into the hub
        wrlong  subins1,ptra++  ' save the sub instruction into the hub

        ' point at start of lmm code

        setptra what    

        ' execute 256 LMM instructions

        nop
'--
        setquad #ins1

        reps    #256,#8
        getcnt  start

        rdquad   ptra++
'--
ins1    nop     ' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
        nop     ' delay slot
        nop     ' delay slot
        nop     ' delay slot

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

addins1 add     count, #1
addins2 add     count2,#1
addins3 add     count3,#1
addins4 add     count4,#1
subins1 sub     count, #1
subins2 sub     count2,#1
subins3 sub     count3,#1
subins4 sub     count4,#1

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    8192
result2         long    8196
result3         long    8200
result4         long    8204
result5         long    8208'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  addins1,ptra++  ' save the add instruction into the hub
        wrlong  addins2,ptra++  ' save the add instruction into the hub
        wrlong  addins3,ptra++  ' save the add instruction into the hub
        wrlong  addins4,ptra++  ' save the add instruction into the hub
        wrlong  subins4,ptra++  ' save the sub instruction into the hub
        wrlong  subins3,ptra++  ' save the sub instruction into the hub
        wrlong  subins2,ptra++  ' save the sub instruction into the hub
        wrlong  subins1,ptra++  ' save the sub instruction into the hub

        ' point at start of lmm code

        setptra what    

        ' execute 256 LMM instructions

        nop
'--
        setquad #ins1

        reps    #256,#8
        getcnt  start

        rdquad   ptra++
'--
ins1    nop     ' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
        nop     ' delay slot
        nop     ' delay slot
        nop     ' delay slot

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

addins1 add     count, #1
addins2 add     count2,#1
addins3 add     count3,#1
addins4 add     count4,#1
subins1 sub     count, #1
subins2 sub     count2,#1
subins3 sub     count3,#1
subins4 sub     count4,#1

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    8192
result2         long    8196
result3         long    8200
result4         long    8204
result5         long    8208

Andy

Bill Henning · 2012-12-10 07:07

I tried to count cycles with 0 or 1 delay slots in lmm2_test2.spin and got some interesting results

With three delay slots:

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000803   '....'

2050 cycles for 1024 LMM2 instructions

With two delay slots:

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000703   '....'

1795 cycles for 1024 LMM2 instructions

With one delay slot:

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000603   '....'
>

1551 cycles for 1024 LMM2 instructions

with 0 delay slots:

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000602   '....'

1550 cycles for 1024 LMM2 instructions

This made me think of running a 9 instruction REP as a test...

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000903   '....'

$903 cycles for 1024 LMM2 instructions

10 instruction REP loop

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000A03   '....'

$A03 cycles for 1024 LMM2 instructions

15 instruction REP loop

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000F04   '....'

$F04 cycles for 1024 LMM2 instructions

20 instruction REP loop

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00001404   '....'

$1404 cycles (5124) cycles for 1024 lmm instructions... REP 20 loop * 256

Conclusion:

It looks highly probable that RDQUAD does not wait for the next set of data from the hub, but instead uses the data present in its latches.

Test for conclusion:

If I use eight counters, only four should increment

Possible workaround:

I will try using a RDLONG first, then staying in sync with the hub with RDQUADS

UPDATE:

A different order of instructions appears to work, so it looks like it was just pipeline mis-use..

Bill Henning · 2012-12-10 07:09

Hi Andy,

Yes, I was able to replicate your results. You are entirely correct.

See my previous post for a lot more tests.

It is looking like RDQUAD only fetches once... very strange, must investigate some more.

(and thanks for finding the error!)

Ariba wrote: »

Hi Bill

I checked your second testcode, and I am now sure you have the same problem.
Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
So you should get a different result for count2 and count3.

You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.

'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  addins1,ptra++  ' save the add instruction into the hub
        wrlong  addins2,ptra++  ' save the add instruction into the hub
        wrlong  addins3,ptra++  ' save the add instruction into the hub
        wrlong  addins4,ptra++  ' save the add instruction into the hub
        wrlong  subins4,ptra++  ' save the sub instruction into the hub
        wrlong  subins3,ptra++  ' save the sub instruction into the hub
        wrlong  subins2,ptra++  ' save the sub instruction into the hub
        wrlong  subins1,ptra++  ' save the sub instruction into the hub

        ' point at start of lmm code

        setptra what    

        ' execute 256 LMM instructions

        nop
'--
        setquad #ins1

        reps    #256,#8
        getcnt  start

        rdquad   ptra++
'--
ins1    nop     ' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
        nop     ' delay slot
        nop     ' delay slot
        nop     ' delay slot

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

addins1 add     count, #1
addins2 add     count2,#1
addins3 add     count3,#1
addins4 add     count4,#1
subins1 sub     count, #1
subins2 sub     count2,#1
subins3 sub     count3,#1
subins4 sub     count4,#1

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    8192
result2         long    8196
result3         long    8200
result4         long    8204
result5         long    8208'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  addins1,ptra++  ' save the add instruction into the hub
        wrlong  addins2,ptra++  ' save the add instruction into the hub
        wrlong  addins3,ptra++  ' save the add instruction into the hub
        wrlong  addins4,ptra++  ' save the add instruction into the hub
        wrlong  subins4,ptra++  ' save the sub instruction into the hub
        wrlong  subins3,ptra++  ' save the sub instruction into the hub
        wrlong  subins2,ptra++  ' save the sub instruction into the hub
        wrlong  subins1,ptra++  ' save the sub instruction into the hub

        ' point at start of lmm code

        setptra what    

        ' execute 256 LMM instructions

        nop
'--
        setquad #ins1

        reps    #256,#8
        getcnt  start

        rdquad   ptra++
'--
ins1    nop     ' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
        nop     ' delay slot
        nop     ' delay slot
        nop     ' delay slot

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

addins1 add     count, #1
addins2 add     count2,#1
addins3 add     count3,#1
addins4 add     count4,#1
subins1 sub     count, #1
subins2 sub     count2,#1
subins3 sub     count3,#1
subins4 sub     count4,#1

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    8192
result2         long    8196
result3         long    8200
result4         long    8204
result5         long    8208

Andy

Bill Henning · 2012-12-10 07:23

I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:

again	rdquad   pc
'--
ins1    nop	' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
	djnzd	howmany,#again
        add pc,#16	' no PTRA usage, trying to replicate Andy's issue
	nop	' delay slot

As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

Here are the results:

>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000902   '....'

Note the cycle count - looks like DJNZD takes two cycles when taken

I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

Chip, can you take a look at this?

UPDATE:

Looks like it was just weird pipeline effects... a different layout appears to work!

Bill Henning · 2012-12-10 07:37

Boy did I have a bad scare... I ran a test that had suggested that RDLONGC had an issue - BUT IT DOES NOT.

My test for RDLONGC had a bug, I am happy to report that RDLONGC works fine for LMM2.

UPDATE#1:

	reps	#256,#5
	getcnt	start

	rdlongc	ins3,ptra++
ins1	rdlongc	ins4,ptra++
ins2	nop		' verified, does not execute
ins3	nop		' verified, EXECUTES
ins4	nop

	getcnt	stop

Works as well, so at least we can execute $200 LMM instructions in $800 cycles (1/4)

UPDATE#2:

	reps	#256,#6
	getcnt	start

	rdlongc	ins3,ptra++
ins1	rdlongc	ins4,ptra++
ins2	rdlongc ins5,ptra++
ins3	nop		' verified, EXECUTES
ins4	nop
ins5	nop

	getcnt	stop

Also works, $300 LMM instructions in $C00 cycles - no better than the above.

UPDATE#3:

	reps	#256,#8
	getcnt	start

	rdlongc	ins3,ptra++
	rdlongc	ins4,ptra++
	rdlongc ins5,ptra++
	rdlongc ins6,ptra++
ins3	nop		' verified, EXECUTES
ins4	nop
ins5	nop
ins6	nop

	getcnt	stop

Works, but no point to using it - takes $FFE cycles for $400 instructions, same 1/4 efficiency as UPDATE#1

It looks like until we figure out why the RDQUAD version does not work the best efficiency we can get is 25%... but 40MIPS is not a bad start!

cgracey · 2012-12-10 07:47

Bill Henning wrote: »
I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:
again	rdquad   pc
'--
ins1    nop	' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
	djnzd	howmany,#again
        add pc,#16	' no PTRA usage, trying to replicate Andy's issue
	nop	' delay slot
As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

Here are the results:
>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000902   '....'
Note the cycle count - looks like DJNZD takes two cycles when taken

I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

Chip, can you take a look at this?

I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.

cgracey · 2012-12-10 07:50

Bill Henning wrote: »
I tried one more test... I was wondering what happens if we don't use REP's, but DJNZD instead:
again	rdquad   pc
'--
ins1    nop	' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
	djnzd	howmany,#again
        add pc,#16	' no PTRA usage, trying to replicate Andy's issue
	nop	' delay slot
As per Andy's suggestion, four adds are followed by four subtracts in the LMM code

Here are the results:
>n
>2000.2013
02000- 000000FF 000000FF 000000FF 000000FF   '................'
02010- 00000902   '....'
Note the cycle count - looks like DJNZD takes two cycles when taken

I am afraid it looks like RDQUAD has an issue. I hope I am wrong.

Chip, can you take a look at this?

Bill, I see one problem... When you do a delayed jump (DJNZD in this case), you must follow it with THREE instructions, not two.

Bill Henning · 2012-12-10 07:59

Thanks Chip.

But don't worry, I figured it out - you were VERY right earlier warning me about possible RDQUAD pileline weirdness, and Ariba's testing was invaluable at generating a better verification.

Here is code that apparently works fine with RDQUAD LMM2 loop - but the pipeline layout has to be a bit... unusual.

Could you and Ariba check my results?

	reps	#256,#8
	getcnt	start

'-- must be on a XXXXXXX00 address in the cog
ins1	nop
ins2	nop
ins3	nop
ins4	nop
	rdquad	ptra++
ins5	nop
ins6	nop
ins7	nop

	getcnt	stop

I am attaching the full spin file.

Here are the results:

>n
>2000.2013
02000- 00000000 00000000 00000001 00000001   '................'
02010- 00000806   '....'

The '1''s are due to the un-primed LMM pipeline

UPDATE:

It looks like the three delay slot nops / repeating 8 instructions is necessary; when I tried to leave them out, I get incorrect results when I make the add/sub's unbalanced

Bill Henning · 2012-12-10 08:03

Thanks Chip - it was NOT a problem with RDQUAD, but with how its pipelining was not used correctly by me

You were absolutely right to warn me about potential pipeline weirdness with RDQUAD... but it looks like I got a layout that works.

I really appreciate your help.

cgracey wrote: »

I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.

I agree totally - any potential bugs found now can be fixed for far less then having to re-do after the shuttle run.. and requiring an additional shuttle run.

Fortunately it looks like RDQUAD works fine, as long as the pipeline is arranged to its liking.

Sorry for thinking there may be a bug - it was just the pipeline behaving different from what I expected.

cgracey · 2012-12-10 08:12

Bill,

I still don't understand why RDQUAD was taking only one clock.

Could it have been that RDQUAD was overshadowed by one of the QUAD registers, making its location into a single-clock instruction?

Bill Henning · 2012-12-10 08:20

I have no idea, but testing shows it likes to take one clock cycle, and it really wants to be followed by exactly seven slots if we want optimal timing.

Take a peek at the layout that appears to work for me in post#51 ... that gives the correct result with Ariba's add/sub test, and also the correct result when I unbalance the add/sub's as follows:

addins1	add	count, #1
addins2	add	count2,#1
addins3	add	count3,#1
addins4	add	count4,#1
subins1	sub	count, #1
subins2 sub	count2,#1
subins3 sub	count3,#1
subins4 sub	count3,#1   ' normally refers to count4 to balance add/subs

The weird thing is that it requires the three delay slots - if I reduce the repeat count to 5, so it does not have the three delay slots, it no longer works when unbalanced.

I think I will try some experiments with multi-cycle instructions thrown into the mix to see if that will work properly.

cgracey wrote: »

Bill,

I still don't understand why RDQUAD was taking only one clock.

Could it have been that RDQUAD was overshadowed by one of the QUAD registers, making its location into a single-clock instruction?

Bill Henning · 2012-12-10 08:35

I made a new torture test for LMM2 using RDQUAD

This version keeps executing the following eight instructions:

i1	add	count, #1
i2	add	count2,#1
i3	add	count3,#1
i4	add	count4,#1
i5	rdlong	lc,result6
i6	add	lc,#1
i7	wrlong	lc,result6
i8	add	count,#1

And here are the results:

>2000.201f
02000- 000000FE 0000007F 00000080 00000080   '................'
02010- 00000FF6 0000007E 00000000 00000000   '....~...........'

The results make sense even though it executed $800 cycles faster than I expected. It makes me think that each hub cycle can do both a hub read and a hub write...

i1-i4 are executed 128 times
i5-i8 are executed 128 times

count1 is incremented 256 times (less initial non-priming loss)
count2,3,4 are incremented 127,128,128 times (the 127 is due to non-priming)
lc is incremented 126 times, which also makes sense due to non-priming, and the rdlong/add/wrlong being in the second group of 8

I am surprised that the loop did not take around $17F6 cycles; that is why I suspect a hub cycle can involve both a read and a write.

So it looks like RDQUAD based LMM2 can execute both singe and multi cycle instructions fine

Bill Henning · 2012-12-10 08:42

Andy,

You will be happy to know the new RDQUAD pipe layout works just as well without using PTRA!

Please find attached the non-PTRA version of the lmm2_quadtorture2.spin

Here are the results it gave:

>n
>2000.201f
02000- 000000FE 0000007F 00000080 00000080   '................'
02010- 00000FF6 0000007E 00000000 00000000   '....~...........'

Leaving PTRA available for user code is well worth one delay slot.

cgracey · 2012-12-10 09:41

Bill Henning wrote: »
Andy,

You will be happy to know the new RDQUAD pipe layout works just as well without using PTRA!

Please find attached the non-PTRA version of the lmm2_quadtorture2.spin

Here are the results it gave:
>n
>2000.201f
02000- 000000FE 0000007F 00000080 00000080   '................'
02010- 00000FF6 0000007E 00000000 00000000   '....~...........'
Leaving PTRA available for user code is well worth one delay slot.

Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.

Bill Henning · 2012-12-10 09:48

Thanks Chip, I was thinking that the NOP's would execute, and RDQUAD overwrite them - I forgot they were mapped over, instead of copied over.

I will now try priming them...

cgracey wrote: »

Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.

cgracey · 2012-12-10 10:05

I just did a test and found that the QUADs are executable starting on the 6th instruction after the RDQUAD:

After a RDQUAD, the QUAD registers can be executed beginning at the 6th instruction:

        SETQUAD #quad0          'quad0 must be a quad-aligned address
        RDQUAD  hubaddress      'read a quad into the QUAD registers which are mapped at quad0..quad3
        NOP                     'do 5 other instructions to allow QUADs to enter execution pipeline
        NOP
        NOP
        NOP
        NOP
quad0   NOP                     'QUAD0..QUAD3 are executable from this point
quad1   NOP
quad2   NOP
quad3   NOP

Bill Henning · 2012-12-10 10:12

Interesting - but that makes it take 10 cycles... ie 2 hub cycles

Other than the first iteration executing garbage, why does the version I tried appear to work? Does it being in a REPS loop somehow make it executable in fewer cycles?

Also, why do I get a very weird cycle count when I try to prime the RDQUAD in the attached file?

Very interesting stuff.

Look at the huge cycle count for the primed torture test:

>2000.201f
02000- 00000101 00000081 00000080 00000080   '................'
02010- 024A6AF4 0000007F 00000000 00000000   '.jJ.............'

cgracey · 2012-12-10 10:28

Bill Henning wrote: »

Interesting - but that makes it take 10 cycles... ie 2 hub cycles

Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments