I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
Here is the simplest version:
' Quad LMM
DAT
org 0
mov pc,start
lmm reps #511,#6
setquad #ins1
rdquad pc
ins1 nop 'quad aligned
nop
nop
nop
add pc,#16
jmp #lmm 'after 511 repeats
start long @lmmcode+$0E80
pc long 0
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 50%
nop
clrp #2
nop
notp #4 'toggle pin4 25%
notp #4
nop
sub pc,#32 'relative jump (delayed)
nop 'the 4 delay instructions
nop
nop
nop
The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
The bad thing: This can not be true !
This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.
Perhaps somebody else see where my mistake is.
BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.
I've tried a few hours, with all variations, but can't get the QUAD-LMM code to work.
Here is the simplest version:
' Quad LMM
DAT
org 0
mov pc,start
lmm reps #511,#6
setquad #ins1
rdquad pc
ins1 nop 'quad aligned
nop
nop
nop
add pc,#16
jmp #lmm 'after 511 repeats
start long @lmmcode+$0E80
pc long 0
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 50%
nop
clrp #2
nop
notp #4 'toggle pin4 25%
notp #4
nop
sub pc,#32 'relative jump (delayed)
nop 'the 4 delay instructions
nop
nop
nop
The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
The bad thing: This can not be true !
This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.
Perhaps somebody else see where my mistake is.
BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.
mov temp, ind
test temp, mask wz
if_nz jmp #interrupt_code
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.
So... this would give us an up to 32 channel interrupt controller.
mov temp, ind
test temp, mask wz
if_nz jmp #interrupt_code
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.
So... this would give us an up to 32 channel interrupt controller.
mov temp, ind
test temp, mask wz
if_nz jmp #interrupt_code
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.
So... this would give us an up to 32 channel interrupt controller.
Thanks,
This will trash the z flag. Won't that be a problem for the LMM code?
Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)
I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.
You can download the Spin file for it from post#2
Hello Bill
I have for sure checked your code. It's basically the same as mine, and has the same problems I think.
If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.
And you execute always the same instruction so you can not know if not only the first 4 get executed.
I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.
The top four numbers are the counts for the four LMM instructions that are executed in a loop.
As they all show a count of 255, we know that each "countX" add was executed the correct number of times; and as I changed the pattern, the count should have been uneven if the wrong instructions got executed due to pipeline issues. Due to the pattern used, it is very unlikely that we would get the same result if there were pipeline issues.
The last number is the number of cycles it took to execute the 256 RDQUAD LMM2 cycles.
If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.
Now that is strange - the loop should take 8 cycles regardless of the delay slots as RDQUAD should always sync to the hub. Mind you, if it was always reading the same four longs from the hub, perhaps it just picks it up from the local quad cache - which is the only explanation I can think of for the faster execution.
I'll look at this situation, including using a PC register.
And you execute always the same instruction so you can not know if not only the first 4 get executed.
I just ran a test that uses 8 different instructions, where I check the count of how many times each got executed - and it uses them in a different sequence in two different RDQUADS.
I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.
Please see lmm2_test3.spin in post#2, I tried to replicate your results - and I got the same results you did!
It can run both the 8 cycle and the 6 cycle version, just change 6/8 and un-comment the extra delay slots.
- I have verified that the 8 cycle version takes 2051 cycles to execute 256 loops of 4 instructions
- I have also verified that the 6 cycle version takes 1537 cycles to execute 256 loops of 4 instructions
What is strange is that I use a pattern of 8 instructions that I hoped would reveal any pipeline issues.
The only thing I can think of that would give this result is if RDQUAD was not waiting for the next hub cycle, but simply returned what was in its cache.
I checked your second testcode, and I am now sure you have the same problem.
Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
So you should get a different result for count2 and count3.
You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.
'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong addins1,ptra++ ' save the add instruction into the hub
wrlong addins2,ptra++ ' save the add instruction into the hub
wrlong addins3,ptra++ ' save the add instruction into the hub
wrlong addins4,ptra++ ' save the add instruction into the hub
wrlong subins4,ptra++ ' save the sub instruction into the hub
wrlong subins3,ptra++ ' save the sub instruction into the hub
wrlong subins2,ptra++ ' save the sub instruction into the hub
wrlong subins1,ptra++ ' save the sub instruction into the hub
' point at start of lmm code
setptra what
' execute 256 LMM instructions
nop
'--
setquad #ins1
reps #256,#8
getcnt start
rdquad ptra++
'--
ins1 nop ' must be quad-long aligned!
ins2 nop
ins3 nop
ins4 nop
nop ' delay slot
nop ' delay slot
nop ' delay slot
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
addins1 add count, #1
addins2 add count2,#1
addins3 add count3,#1
addins4 add count4,#1
subins1 sub count, #1
subins2 sub count2,#1
subins3 sub count3,#1
subins4 sub count4,#1
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long 8192
result2 long 8196
result3 long 8200
result4 long 8204
result5 long 8208'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong addins1,ptra++ ' save the add instruction into the hub
wrlong addins2,ptra++ ' save the add instruction into the hub
wrlong addins3,ptra++ ' save the add instruction into the hub
wrlong addins4,ptra++ ' save the add instruction into the hub
wrlong subins4,ptra++ ' save the sub instruction into the hub
wrlong subins3,ptra++ ' save the sub instruction into the hub
wrlong subins2,ptra++ ' save the sub instruction into the hub
wrlong subins1,ptra++ ' save the sub instruction into the hub
' point at start of lmm code
setptra what
' execute 256 LMM instructions
nop
'--
setquad #ins1
reps #256,#8
getcnt start
rdquad ptra++
'--
ins1 nop ' must be quad-long aligned!
ins2 nop
ins3 nop
ins4 nop
nop ' delay slot
nop ' delay slot
nop ' delay slot
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
addins1 add count, #1
addins2 add count2,#1
addins3 add count3,#1
addins4 add count4,#1
subins1 sub count, #1
subins2 sub count2,#1
subins3 sub count3,#1
subins4 sub count4,#1
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long 8192
result2 long 8196
result3 long 8200
result4 long 8204
result5 long 8208
I checked your second testcode, and I am now sure you have the same problem.
Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
So you should get a different result for count2 and count3.
You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.
'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong addins1,ptra++ ' save the add instruction into the hub
wrlong addins2,ptra++ ' save the add instruction into the hub
wrlong addins3,ptra++ ' save the add instruction into the hub
wrlong addins4,ptra++ ' save the add instruction into the hub
wrlong subins4,ptra++ ' save the sub instruction into the hub
wrlong subins3,ptra++ ' save the sub instruction into the hub
wrlong subins2,ptra++ ' save the sub instruction into the hub
wrlong subins1,ptra++ ' save the sub instruction into the hub
' point at start of lmm code
setptra what
' execute 256 LMM instructions
nop
'--
setquad #ins1
reps #256,#8
getcnt start
rdquad ptra++
'--
ins1 nop ' must be quad-long aligned!
ins2 nop
ins3 nop
ins4 nop
nop ' delay slot
nop ' delay slot
nop ' delay slot
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
addins1 add count, #1
addins2 add count2,#1
addins3 add count3,#1
addins4 add count4,#1
subins1 sub count, #1
subins2 sub count2,#1
subins3 sub count3,#1
subins4 sub count4,#1
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long 8192
result2 long 8196
result3 long 8200
result4 long 8204
result5 long 8208'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong addins1,ptra++ ' save the add instruction into the hub
wrlong addins2,ptra++ ' save the add instruction into the hub
wrlong addins3,ptra++ ' save the add instruction into the hub
wrlong addins4,ptra++ ' save the add instruction into the hub
wrlong subins4,ptra++ ' save the sub instruction into the hub
wrlong subins3,ptra++ ' save the sub instruction into the hub
wrlong subins2,ptra++ ' save the sub instruction into the hub
wrlong subins1,ptra++ ' save the sub instruction into the hub
' point at start of lmm code
setptra what
' execute 256 LMM instructions
nop
'--
setquad #ins1
reps #256,#8
getcnt start
rdquad ptra++
'--
ins1 nop ' must be quad-long aligned!
ins2 nop
ins3 nop
ins4 nop
nop ' delay slot
nop ' delay slot
nop ' delay slot
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
addins1 add count, #1
addins2 add count2,#1
addins3 add count3,#1
addins4 add count4,#1
subins1 sub count, #1
subins2 sub count2,#1
subins3 sub count3,#1
subins4 sub count4,#1
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long 8192
result2 long 8196
result3 long 8200
result4 long 8204
result5 long 8208
Note the cycle count - looks like DJNZD takes two cycles when taken
I am afraid it looks like RDQUAD has an issue. I hope I am wrong.
Chip, can you take a look at this?
I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.
But don't worry, I figured it out - you were VERY right earlier warning me about possible RDQUAD pileline weirdness, and Ariba's testing was invaluable at generating a better verification.
Here is code that apparently works fine with RDQUAD LMM2 loop - but the pipeline layout has to be a bit... unusual.
Could you and Ariba check my results?
reps #256,#8
getcnt start
'-- must be on a XXXXXXX00 address in the cog
ins1 nop
ins2 nop
ins3 nop
ins4 nop
rdquad ptra++
ins5 nop
ins6 nop
ins7 nop
getcnt stop
It looks like the three delay slot nops / repeating 8 instructions is necessary; when I tried to leave them out, I get incorrect results when I make the add/sub's unbalanced
I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.
I agree totally - any potential bugs found now can be fixed for far less then having to re-do after the shuttle run.. and requiring an additional shuttle run.
Fortunately it looks like RDQUAD works fine, as long as the pipeline is arranged to its liking.
Sorry for thinking there may be a bug - it was just the pipeline behaving different from what I expected.
I have no idea, but testing shows it likes to take one clock cycle, and it really wants to be followed by exactly seven slots if we want optimal timing.
Take a peek at the layout that appears to work for me in post#51 ... that gives the correct result with Ariba's add/sub test, and also the correct result when I unbalance the add/sub's as follows:
addins1 add count, #1
addins2 add count2,#1
addins3 add count3,#1
addins4 add count4,#1
subins1 sub count, #1
subins2 sub count2,#1
subins3 sub count3,#1
subins4 sub count3,#1 ' normally refers to count4 to balance add/subs
The weird thing is that it requires the three delay slots - if I reduce the repeat count to 5, so it does not have the three delay slots, it no longer works when unbalanced.
I think I will try some experiments with multi-cycle instructions thrown into the mix to see if that will work properly.
The results make sense even though it executed $800 cycles faster than I expected. It makes me think that each hub cycle can do both a hub read and a hub write...
i1-i4 are executed 128 times
i5-i8 are executed 128 times
count1 is incremented 256 times (less initial non-priming loss)
count2,3,4 are incremented 127,128,128 times (the 127 is due to non-priming)
lc is incremented 126 times, which also makes sense due to non-priming, and the rdlong/add/wrlong being in the second group of 8
I am surprised that the loop did not take around $17F6 cycles; that is why I suspect a hub cycle can involve both a read and a write.
So it looks like RDQUAD based LMM2 can execute both singe and multi cycle instructions fine
Leaving PTRA available for user code is well worth one delay slot.
Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.
Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.
I just did a test and found that the QUADs are executable starting on the 6th instruction after the RDQUAD:
After a RDQUAD, the QUAD registers can be executed beginning at the 6th instruction:
SETQUAD #quad0 'quad0 must be a quad-aligned address
RDQUAD hubaddress 'read a quad into the QUAD registers which are mapped at quad0..quad3
NOP 'do 5 other instructions to allow QUADs to enter execution pipeline
NOP
NOP
NOP
NOP
quad0 NOP 'QUAD0..QUAD3 are executable from this point
quad1 NOP
quad2 NOP
quad3 NOP
Interesting - but that makes it take 10 cycles... ie 2 hub cycles
Other than the first iteration executing garbage, why does the version I tried appear to work? Does it being in a REPS loop somehow make it executable in fewer cycles?
Also, why do I get a very weird cycle count when I try to prime the RDQUAD in the attached file?
Very interesting stuff.
Look at the huge cycle count for the primed torture test:
Interesting - but that makes it take 10 cycles... ie 2 hub cycles
Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.
Comments
Here is the simplest version: The good thing: I get 40 LMM-MIPS on a 60 MHz Prop2.
The bad thing: This can not be true !
This code executes only the first 4 LMM instructions, and repeats them endlessly. It never loads the next 4 instructions - the ADD PC,#16 seems to have no effect.
Additionally the LMM loop takes only 6 cycles, so it seems RDQUAD does not wait for its hub window.
Perhaps somebody else see where my mistake is.
BTW: I find it much easier to check LMM loops with toggling pins and watch them on a scope, than writing to hub memory and check the results with the monitor.
Andy
Check post#17 in this thread for a version that appears to work - it gives 30MIPS at 60MHz (ok, four cycles less than 30MIPS)
I will be doing more testing on it, as Chip warned me that there may be some pipelining effects being hidden by fetching the same instruction.
You can download the Spin file for it from post#2
You could do a simple:
This would give you the ability to look at the port D register for masked interrupt pins. If one of these pins go high, then you would use the prop ASM instruction to get the highest bit set in the temp variable up above and then use that bit to lookup into a interrupt vector table and execute some interrupt handler code.
So... this would give us an up to 32 channel interrupt controller.
Thanks,
Yep that would work!
I think two instructions would work:
test mask,pinx wz
if_nz jmp #interrupt_handler
This will trash the z flag. Won't that be a problem for the LMM code?
I think both Bill's and Kye's versions need ---- Push Pop wz wc instructions to work properly
mov tmp, pinX
and tmp, mask
tjnz tmp,#int_handler
should not destroy the Z flag
Hello Bill
I have for sure checked your code. It's basically the same as mine, and has the same problems I think.
If I make my loop 8 instructions I get also the right timing, but that is only bcause the loop takes 8 cycles now, and not because the RDQUAD syncs to the hub slots.
If I remove 2 of the 3 delay slots and make the loop #6 instructions, in your code, I get also a much faster execution.
And you execute always the same instruction so you can not know if not only the first 4 get executed.
I also tried it with PTRA as ProgramCounter, with the same results. I just think the PTRx are to valuable to be used as PC if you anyway have free timing slots to increment the PC.
Andy
In this version, I place 256 copies of the following eight instructions in the hub:
and I execute the RDQUAD based LMM loop 256 times, executing a total of 1020 LMM instructions (the pipe is not primed on the first pass)
(I am attaching lmm2_test2.spin to post#2 so y'all can replicate my results)
After the run, execute the following commands in the monitor to view the results:
n
2000.2013
The top four numbers are the counts for the four LMM instructions that are executed in a loop.
As they all show a count of 255, we know that each "countX" add was executed the correct number of times; and as I changed the pattern, the count should have been uneven if the wrong instructions got executed due to pipeline issues. Due to the pattern used, it is very unlikely that we would get the same result if there were pipeline issues.
The last number is the number of cycles it took to execute the 256 RDQUAD LMM2 cycles.
I agree that it is basically the same code... I just have not seen the problem yet; but I will try many other tests to try to find any problems.
Now that is strange - the loop should take 8 cycles regardless of the delay slots as RDQUAD should always sync to the hub. Mind you, if it was always reading the same four longs from the hub, perhaps it just picks it up from the local quad cache - which is the only explanation I can think of for the faster execution.
I'll look at this situation, including using a PC register.
I just ran a test that uses 8 different instructions, where I check the count of how many times each got executed - and it uses them in a different sequence in two different RDQUADS.
See post#2 for "lmm2_test2.spin"
I would also like to keep PTRA free.
You are definitely on to something!
Please see lmm2_test3.spin in post#2, I tried to replicate your results - and I got the same results you did!
It can run both the 8 cycle and the 6 cycle version, just change 6/8 and un-comment the extra delay slots.
- I have verified that the 8 cycle version takes 2051 cycles to execute 256 loops of 4 instructions
- I have also verified that the 6 cycle version takes 1537 cycles to execute 256 loops of 4 instructions
What is strange is that I use a pattern of 8 instructions that I hoped would reveal any pipeline issues.
The only thing I can think of that would give this result is if RDQUAD was not waiting for the next hub cycle, but simply returned what was in its cache.
I will try more/other tests...
I checked your second testcode, and I am now sure you have the same problem.
Your code has a little error in that you have used "addins3,ptra++" two times instead of "addins2,ptra++" and "addins3,ptra++"
So you should get a different result for count2 and count3.
You can also try this modified version of your code, I just replaced the second 4 instructions with subs, so you should get zero or 1 as result, but I get still FF. This means only the first 4 instructions get executed.
Andy
With three delay slots: 2050 cycles for 1024 LMM2 instructions
With two delay slots: 1795 cycles for 1024 LMM2 instructions
With one delay slot: 1551 cycles for 1024 LMM2 instructions
with 0 delay slots: 1550 cycles for 1024 LMM2 instructions
This made me think of running a 9 instruction REP as a test...
$903 cycles for 1024 LMM2 instructions
10 instruction REP loop $A03 cycles for 1024 LMM2 instructions
15 instruction REP loop $F04 cycles for 1024 LMM2 instructions
20 instruction REP loop $1404 cycles (5124) cycles for 1024 lmm instructions... REP 20 loop * 256
Conclusion:
It looks highly probable that RDQUAD does not wait for the next set of data from the hub, but instead uses the data present in its latches.
Test for conclusion:
If I use eight counters, only four should increment
Possible workaround:
I will try using a RDLONG first, then staying in sync with the hub with RDQUADS
UPDATE:
A different order of instructions appears to work, so it looks like it was just pipeline mis-use..
Yes, I was able to replicate your results. You are entirely correct.
See my previous post for a lot more tests.
It is looking like RDQUAD only fetches once... very strange, must investigate some more.
(and thanks for finding the error!)
As per Andy's suggestion, four adds are followed by four subtracts in the LMM code
Here are the results:
Note the cycle count - looks like DJNZD takes two cycles when taken
I am afraid it looks like RDQUAD has an issue. I hope I am wrong.
Chip, can you take a look at this?
UPDATE:
Looks like it was just weird pipeline effects... a different layout appears to work!
My test for RDLONGC had a bug, I am happy to report that RDLONGC works fine for LMM2.
UPDATE#1:
Works as well, so at least we can execute $200 LMM instructions in $800 cycles (1/4)
UPDATE#2:
Also works, $300 LMM instructions in $C00 cycles - no better than the above.
UPDATE#3:
Works, but no point to using it - takes $FFE cycles for $400 instructions, same 1/4 efficiency as UPDATE#1
It looks like until we figure out why the RDQUAD version does not work the best efficiency we can get is 25%... but 40MIPS is not a bad start!
I will be able to look at it in 1/2 hour. In the meantime, can you do a Ctrl-L on your code in PNUT.EXE to check the encoding of the RDQUAD instruction? My test suite checked RDQUAD a few ways, so I'm thinking there might be some other problem, maybe in the assembler. If we do have a hardware bug, it means making a Verilog change and having the synthesis guys respin the synthesized block, which takes a few days. Better to do it now, before the test chip is sent to fab.
Bill, I see one problem... When you do a delayed jump (DJNZD in this case), you must follow it with THREE instructions, not two.
But don't worry, I figured it out - you were VERY right earlier warning me about possible RDQUAD pileline weirdness, and Ariba's testing was invaluable at generating a better verification.
Here is code that apparently works fine with RDQUAD LMM2 loop - but the pipeline layout has to be a bit... unusual.
Could you and Ariba check my results?
I am attaching the full spin file.
Here are the results:
The '1''s are due to the un-primed LMM pipeline
UPDATE:
It looks like the three delay slot nops / repeating 8 instructions is necessary; when I tried to leave them out, I get incorrect results when I make the add/sub's unbalanced
You were absolutely right to warn me about potential pipeline weirdness with RDQUAD... but it looks like I got a layout that works.
I really appreciate your help.
I agree totally - any potential bugs found now can be fixed for far less then having to re-do after the shuttle run.. and requiring an additional shuttle run.
Fortunately it looks like RDQUAD works fine, as long as the pipeline is arranged to its liking.
Sorry for thinking there may be a bug - it was just the pipeline behaving different from what I expected.
I still don't understand why RDQUAD was taking only one clock.
Could it have been that RDQUAD was overshadowed by one of the QUAD registers, making its location into a single-clock instruction?
Take a peek at the layout that appears to work for me in post#51 ... that gives the correct result with Ariba's add/sub test, and also the correct result when I unbalance the add/sub's as follows:
The weird thing is that it requires the three delay slots - if I reduce the repeat count to 5, so it does not have the three delay slots, it no longer works when unbalanced.
I think I will try some experiments with multi-cycle instructions thrown into the mix to see if that will work properly.
This version keeps executing the following eight instructions:
And here are the results:
The results make sense even though it executed $800 cycles faster than I expected. It makes me think that each hub cycle can do both a hub read and a hub write...
i1-i4 are executed 128 times
i5-i8 are executed 128 times
count1 is incremented 256 times (less initial non-priming loss)
count2,3,4 are incremented 127,128,128 times (the 127 is due to non-priming)
lc is incremented 126 times, which also makes sense due to non-priming, and the rdlong/add/wrlong being in the second group of 8
I am surprised that the loop did not take around $17F6 cycles; that is why I suspect a hub cycle can involve both a read and a write.
So it looks like RDQUAD based LMM2 can execute both singe and multi cycle instructions fine
You will be happy to know the new RDQUAD pipe layout works just as well without using PTRA!
Please find attached the non-PTRA version of the lmm2_quadtorture2.spin
Here are the results it gave:
Leaving PTRA available for user code is well worth one delay slot.
Bill, I see a problem with your code in post #51. You do a SETQUAD, then execute those locations before doing a RDQUAD. The QUAD registers are not initialized, so they will bank unknown data into where you have your NOP's, and execute unknown instructions.
I will now try priming them...
Other than the first iteration executing garbage, why does the version I tried appear to work? Does it being in a REPS loop somehow make it executable in fewer cycles?
Also, why do I get a very weird cycle count when I try to prime the RDQUAD in the attached file?
Very interesting stuff.
Look at the huge cycle count for the primed torture test:
Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.