Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.
Here is my latest experiment:
reps #256,#8
getcnt start
'-- must be on a XXXXXXX00 address in the cog
ins1 nop
ins2 nop
ins3 nop
ins4 nop
ins5 setquad #ins1
rdquad pc
ins6 add pc,#16
ins7 nop
I just looked at the Verilog code and did some tests to confirm mapped QUAD behavior:
After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow the QUADs to update
NOP
NOP
CMP quad0,quad1 'mapped QUADs are now accessible via D and S
After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:
SETQUAD #quad0 'quad0 must be a quad-aligned address
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow the QUADs to update
NOP
NOP
NOP 'do at least 2 instructions to allow QUADs to enter pipeline
NOP
quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP
This is different than what I initially documented. Sorry for the confusion.
I just looked at the Verilog and did some tests to confirm mapped QUAD behavior:
After a RDQUAD, mapped QUAD registers are accessible via D and S after 3 clocks:
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow the QUADs to update
NOP
NOP
CMP quad0,quad1 'mapped QUADs are now accessible via D and S
After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:
SETQUAD #quad0 'quad0 must be a quad-aligned address
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow the QUADs to update
NOP
NOP
NOP 'do at least 2 instructions to allow QUADs to enter pipeline
NOP
quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP
With the new timing information, I was able to get it to work:
' Quad LMM
DAT
org 0
lmm reps #511,#7
setquad #ins1
rdquad pc
nop
ins1 nop 'quad aligned
nop
nop
nop
add pc,#16
jmp #lmm 'after 511 repeats
pc long @lmmcode+$0E80
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 25%
notp #4
clrp #2
sub pc,#32 'relative jump (delayed by 1 quad)
nop
notp #4 'toggle pin4 50%
mul t1,t1 '2 cycle instr
nop
'< jump happens here
The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).
The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.
I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?
Yes and No
My conclusion before my previous test was, that the RDQUAD gets executed only the first time in the loop but not afterwards.
But in my code above it gets executed always now. So it must depend on the mix of read-quad timing and repeat logic.
Do you have the problem Chip mentioned of the cache being random code on the first pass?
I tried to get around that by "priming" the cache.
I'll try to torture your flavor.
Regarding delayed jumps... PropGCC2 will have to schedule instructions very carefully; using RDQUAD essentially turns LMM2 into a VLIW with a fairly deep pipe.
The pain will be worth it - it will be able to *very closely* approach 50% of native mips even without FCACHE etc, and almost 100% with all the tricks :-)
With the new timing information, I was able to get it to work:
' Quad LMM
DAT
org 0
lmm reps #511,#7
setquad #ins1
rdquad pc
nop
ins1 nop 'quad aligned
nop
nop
nop
add pc,#16
jmp #lmm 'after 511 repeats
pc long @lmmcode+$0E80
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 25%
notp #4
clrp #2
sub pc,#32 'relative jump (delayed by 1 quad)
nop
notp #4 'toggle pin4 50%
mul t1,t1 '2 cycle instr
nop
'< jump happens here
The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).
The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.
This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.
Frankly I am seeing all sorts of unusual behaviour, depending on where the RDQUAD is in relation to REPS, the delay slots etc.
I'll need to make some more complex test cases, because strangely enough, the only code that seems to pass my "torture" test is the one where the quad buffer is aligned, and comes *before* the RDQUAD
What was really weird is Ariba's last code passed 5 of the 6 tests, only failing the readlong/add#1/writelong embedded LMM code...
It might be a good idea to hold off on the shuttle run until we figure this out.
This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.
For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.
Andy
Sorry read this post of you after writing the above:
This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.
Are you guys seeing this?
:
Hmm, but the 4 instructions get executed, also the last one - so the Quad window must then be 5 instructions. But yes as I wrote above the instruction before the QUAD window does not execute right.
I think now it has nothing to do with the REPS, I see the same if I make the loop with jumps.
For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.
Andy
That's what I found, too, but the register before the QUADs seems to be QUAD0, like everything is shifted up one location.
Yes you are right, the window is shifted up 1 address.
If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:
' Quad LMM
DAT
org 0
lmm reps #511,#7
setquad #ins1
rdquad pc
nop
ins1 nop 'quad aligned
nop
nop
notp #2
add pc,#16
jmp #lmm 'after 511 repeats
pc long @lmmcode+$0E80
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 25%
notp #4
clrp #2
sub pc,#32 'relative jump (delayed by 1 quad)
nop
notp #4 'toggle pin4 50%
mul t1,t1 '2 cycle instr
nop
'< jump happens here
Yes you are right, the window is shifted up 1 address.
If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:
' Quad LMM
DAT
org 0
lmm reps #511,#7
setquad #ins1
rdquad pc
nop
ins1 nop 'quad aligned
nop
nop
notp #2
add pc,#16
jmp #lmm 'after 511 repeats
pc long @lmmcode+$0E80
t1 long 0
long 0[4-($-($ & $1FC))] 'quad align
lmmcode setp #2 'toggle pin2 25%
notp #4
clrp #2
sub pc,#32 'relative jump (delayed by 1 quad)
nop
notp #4 'toggle pin4 50%
mul t1,t1 '2 cycle instr
nop
'< jump happens here
I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.
I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.
Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.
I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.
Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.
I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.
Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
This test also fails for me, it does not execute the WRLONG in i6
Bill
This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
But that will change soon, when Chip posts the new FPGA files.
In time You fix that -- It is possible to add one more display mode for Monitor?
You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing
I've thought about this, but there is no more room in the Monitor code. I mean not room for one more instruction. Binary would have been really nice for configuring the I/O pins, too.
My only concern for it is that there is some way to have a loop that takes 8 cycles that reads and executes the four quad registers... because I really want the 2 clock cycle effective LMM2 instructions (for normally single cycle instructions) as that makes a properly optimized PropGCC2 generate code comparable in speed with a 80-120MHz ARM (without FPU).
I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.
I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.
Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
I was not criticizing... just pointing out that it did not work for some instructions. Frankly, I am glad the issue came up before the test chips - far cheaper to fix it now!
Bill
This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
But that will change soon, when Chip posts the new FPGA files.
If Quad aligns to first free Long address and next 3 after
I think having to set the QUAD window at a QUAD-aligned address is a big pain. Do you guys agree? It keeps the circuitry simple but causes pain for the programmer.
Comments
Here is my latest experiment:
And the result looks correct:
The difference here is that I don't SETQUAD the cache in until after the NOP's are executed on the first pass, avoiding executing unknown code.
I'll try some more code mixes that I can verify a bit later.
This is different than what I initially documented. Sorry for the confusion.
I guess I will need to experiment more to see if somehow it can be sqeezed into an 8 or 16 instruction loop.
What worries me is in the loop I have in post#62 may execute a mix of fresh/stale cache data.
I'll think about it for a while, and come up with some test cases I can run on my nano.
With the new timing information, I was able to get it to work: The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).
The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.
Andy
My conclusion before my previous test was, that the RDQUAD gets executed only the first time in the loop but not afterwards.
But in my code above it gets executed always now. So it must depend on the mix of read-quad timing and repeat logic.
Andy
Do you have the problem Chip mentioned of the cache being random code on the first pass?
I tried to get around that by "priming" the cache.
I'll try to torture your flavor.
Regarding delayed jumps... PropGCC2 will have to schedule instructions very carefully; using RDQUAD essentially turns LMM2 into a VLIW with a fairly deep pipe.
The pain will be worth it - it will be able to *very closely* approach 50% of native mips even without FCACHE etc, and almost 100% with all the tricks :-)
I'll try the loop you posted shortly.
Unless I goofed in the attached file, your latest version did not work (for me)
I dont now how the results must look like, I get: .. if I change it so that ins1 is quad aligned and not ins7
Andy
Thanks - I don't believe I quad-aligned the delay slot... duh. I've been staring at variations of this code too long
Thanks Andy.
I now get the same result you do.
I just realized something. It is not executing the hub instructions in i4 and i6
$2014 should have $0000007F in it (result6)
Are you guys seeing this?
I'll need to make some more complex test cases, because strangely enough, the only code that seems to pass my "torture" test is the one where the quad buffer is aligned, and comes *before* the RDQUAD
What was really weird is Ariba's last code passed 5 of the 6 tests, only failing the readlong/add#1/writelong embedded LMM code...
It might be a good idea to hold off on the shuttle run until we figure this out.
For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.
Andy
Sorry read this post of you after writing the above: :
Hmm, but the 4 instructions get executed, also the last one - so the Quad window must then be 5 instructions. But yes as I wrote above the instruction before the QUAD window does not execute right.
I think now it has nothing to do with the REPS, I see the same if I make the loop with jumps.
That's what I found, too, but the register before the QUADs seems to be QUAD0, like everything is shifted up one location.
If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:
I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.
Thanks for discovering this, Bill, Ariba, and any others who've worked on this!
I REALLY enjoyed this (even with the excessive head scratching involved) - plus I am very glad it was caught before the shuttle run.
In time You fix that -- It is possible to add one more display mode for Monitor?
You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing
Bill
This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
But that will change soon, when Chip posts the new FPGA files.
Andy
Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.
I've thought about this, but there is no more room in the Monitor code. I mean not room for one more instruction. Binary would have been really nice for configuring the I/O pins, too.
Great that you found the reason :thumb:
Andy
From my side of view --> Best case.
If Quad aligns to first free Long address and next 3 after
I was not criticizing... just pointing out that it did not work for some instructions. Frankly, I am glad the issue came up before the test chips - far cheaper to fix it now!
I think having to set the QUAD window at a QUAD-aligned address is a big pain. Do you guys agree? It keeps the circuitry simple but causes pain for the programmer.