I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.
I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.
Bill, without my morning coffee (I don't often drink coffee ), here are a few ideas (without much thought)...
1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.
2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.
3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.
Maybe these suggestions may at least provoke some further thoughts.
I've been trying many variations of the ordering of the LMM code, and the instructions in it since I posted #213
I cannot bust it, and RDLONG from the hub also works now.
It appears to be executing the four instructions in the RDQUAD mapped cache just fine, with the simplest REPS RDLONG loop, which I do not understand, because earlier it was not executing the RDLONG.
I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.
I haven't been following this thread too closely, but I wonder if the P2 would be more efficient running pieces of FCACHE code instead of using the LMM method. Small loops would run at the full processor speed. Straight-line code would execute by by loading a few quads and then running it. The initial hub stalls and pipeline delays would be averaged over the size of the chunk that is read each time. An optimal chunk size could be determined that works best over a variety of code snippets.
I am beginning to understand the RDQUAD/pipeline interaction quite well, and I'd say the loop is working, but requires careful scheduling
I've been running more experiments, and it appears that the first slot does get executed, but not until the next loop. I've been executing the test loop repeating 1,2,3 times and it makes what is happening clear, I am now trying some experiments to see if it is feasible to schedule it so it works as expected..
The alternate three instruction / one 23 bit constant (because that slot is executed sometimes) appears to work fine. I am trying to nail down the exact rules governing 'someimes'.
If the mapped quad were just fully executable just once cycle earlier there would be no difficulties. C'est la Vie.
Bill, how about setting up the first executed instruction set to be something like
i1 add count,#$100 ' should NOT execute
i2 add count2,#$110
i3 add count3,#$120
i4 add count4,#$130
i5 add count4,#$140
This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.
I ended up doing something similar - I figured out a way to really test the pipeline.
The findings are interesting to say the least... but it looks like it is quiet doable for a compiler that understands instruction scheduling - like gcc.
I am going to run some tests with multi-cycle instructions, but what appears to happen is that the instruction that should not be executed DOES get to execute, however it is deferred.
in memory
addr
addr+4
addr+8
addr+12
execution order
addr+4
addr+8
addr+12
[B]addr [/B] on the next iteration, it executes in the first slot
Bill, how about setting up the first executed instruction set to be something like
i1 add count,#$100 ' should NOT execute
i2 add count2,#$110
i3 add count3,#$120
i4 add count4,#$130
i5 add count4,#$140
This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.
from what you are seeing, it is as expected. i1 does not execute as you correctly assumed because it has not filled. the next loop has i1 left over fom the previous rdquad.
i think you just need asprin and coffee
what may be better is to preset the ix instructions with a rdquad, quicky followed by another rdquad within the reps loop immediately followed by the ix data. something like...
setquaz #i1
rdquad pc 'prefill i1-i4
reps #500,#6
add pc,#16 'not part of reps
rdquad pc
i1 nop
i2 nop
i3 nop
i4 nop
add pc #16
then providing i1-i4 do not take >5 clocks all should be fine.
(sorry - my android/xoom is impossible to correct typos on this forum)
(but as Andy pointed out, Chip beat me to it in post#177, which I somehow missed reading)
I had to add more result registers, and repeat the LMM2 loop 1,2,3+++ times in sequence to experiment with the pipeline.
Right now, it is looking like the following is possible:
- For simple instructions, 4 instructions per 8 clocks
- RDxxx/WRxxx must be in the last of tje four slots (otherwise incorrect results)
- as the cache is mapped in to be executed, you cannot use RDxxxC
- until their minimum position is determine, assume all multi-cycle instructions belong in the last slot
Here is the LMM2 loop:
setquaz #op0
reps #257,#8 ' since we know first iteration just primes the cache
getcnt start
rdquad ptra++
op0 nop ' we deliberately map into this spot to execute next iteration
op1 nop
op2 nop
op3 nop ' this is the problematic spot with previous LMM2 attempt
op4 nop ' normally executable from here, ops above not executable
op5 nop
op6 nop
getcnt stop
The results above would not be possible unless 257 iterations of the LMM2 loop were executed. Any pipeline error would disturb the results (just move the hub access to see it).
I said earlier that it was impossible.
Looks like the forum gods may prove me wrong.
I am attaching the code for this test, please try it, and try any other LMM code you can think of. My gut says it will all work, as long as multi-cycle instructions are placed in the last instruction in each quad.
I suspect that delayed jumps will work if placed in the first slot, I intend to test that in a while... but I need a break now.
Your input & feedback is MUCH appreciated.
Bill
p.s.
No need to use ptra! See lmm2_pipeline_explorer4.spin above
Bill
So you just figured out something that Chip said a few hours ago ;-) (post #177)
I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.
Andy
Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.
I can't change the timing of RDxxxx. That's kind of set, at this point.
I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.
I missed that message - I just looked, and you are right!
I don't really think it is too unflexible, if you think about it even the Prop1 GCC has to schedule hub access etc.
So far, the rule seems very simple - hub access instructions must be in the last slot of four, any single cycle instruction can be in any of the three slots.
GCC supports such rules in its machine definition files, and many existing VLIW architectures have *MUCH* more complex rules.
I think the RDLONGC version is a decent approach, but due to delay before an instruction is executable I think it is more like 1/6 clock frequency (I'll try it and report the results) instead of 1/2 clkfreq of LMM2 (RDQUAD version).
Bill
So you just figured out something that Chip said a few hours ago ;-) (post #177)
I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.
Andy
Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.
I took a look back in time (in this thread) and noticed that in post#9 Chip suggested the exact loop I am now using in post#224 (to replace the complicated ones I thought I might have to use).
I tried it after he posted it, but it did not seem to work correctly - which turned out to be due to the issue Chip found in the Verilog file.
Argh! If I had re-tried it after Chip fixed it, I (and a lot of you) could have saved a lot of time, but I don't think it was wasted time - I learned a LOT about how the pipelining works on Prop2.
I wish to thank everyone, especially Chip, for all of your feedback, suggestions, help and your experiments - it works now, and it is FAST!
I'll experiment some more, to make sure other long instructions work well, but no more today, or wifey will kill me.
This may be completely wrong, but I think this does 3 one clock instructions per iteration and hits the hub window each time.
again reps #511, 6
nop ' delay slot
rdlongc in1,ptra++ ' 1 or 3 clocks each time once synced (only one of the 3 takes 3 clocks each time around), also ptra++ will advance PC by 4 since we are using rdlongc
rdlongc in2,ptra++ ' 1 or 3 clock
rdlongc in3,ptra++ ' 1 or 3 clock
in1 nop
in2 nop
in3 nop
jmp #again ' jump back in if the reps breaks due to jmp/call or whatever[COLOR=#333333][FONT=Parallax]
[/FONT][/COLOR]
It requires using PTRA as the PC, which I think works, just means you can't use PTRA in LMM code.
Thinking about it more, the four executable instructions would require waiting for a second hub cycle, and as such, it will take 16 cycles (assuming single cycle instrutions) resulting in a 4/16 effective rate.
Argh, can't test it until tomorrow morning or wifey will kill me!
I think you saw my post before editing, it's only 3 executed instructions per iteration. So a 3 per 8 clocks rate.
Not sure how you count 12 clocks. Once primed, RDLONGC will take 3 clocks when it hits the hub, then 1 clock for 3 subsequent calls. Once primed the loop has an interesting pattern.
it takes 8 clocks each for 3 iterations, then the 4th one takes 6 clocks (all 3 of it's rdlongc's take 1 clock), then on the 5th iteration the first rdlongc takes 5 clocks, but you are in sync with the hub again (since you finished early and just waited 2 clocks for it). and you repeat this pattern.
The net clock count for the 5 iterations is 40.
Also, since there is no mapping for quad registers over cog registers (rdlongc writes to the actual cog register), you don't have aliasing issues with overlapping. So you can run normal LMM code without special ordering or grouping, as long as that code doesn't use the QUAD registers or PTRA.
See the first or second experiment at the top of this thread, I tried them and reported on them on the first page.
I could only fit three RDLONGC's and three single cycle instructions into a single hub cycle.
Of the code you proposed, only two of the fetched instructions would fit within the first hub cycle, the other two would incur a second 8 clock hub cycle.
(plus I have a foggy memory of trying what you are asking about a couple of days ago, and it taking two hub cycles)
I'll retest tomorrow as soon as wifey leaves for work :-)
Many other changes will be required for PropGCC2, but I don't have time to get into that until Ken gives me the go-ahead to work on PropGCC2.
Bill,
My example is just 3 instructions per hub cycle. I first posted the variant with 4, but quickly corrected it. You must have seen that one and not rechecked the post for the version it has now.
I think something that works more like Prop1 LMM will be easier to get up and running quickly for Prop2GCC, then later we can explore these special case variants.
Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.
The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong i1,ptra++ ' save the add instruction into the hub
wrlong i2,ptra++ ' save the add instruction into the hub
wrlong i3,ptra++ ' save the add instruction into the hub
wrlong i4,ptra++ ' save the add instruction into the hub
wrlong i5,ptra++ ' save the add instruction into the hub
wrlong i6,ptra++ ' save the add instruction into the hub
wrlong i7,ptra++ ' save the add instruction into the hub
wrlong i8,ptra++ ' save the add instruction into the hub
' point at start of lmm code
setptra what
mov pc,what
getcnt start
'--
' execute 256 LMM instructions
' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:
lmm reps #341,#8
nop
rdlongc ins1,ptra++ 'fetch ins1
rdlongc ins2,ptra++ 'fetch ins2
rdlongc ins3,ptra-- 'fetch ins3, point to ins1+1
ins1 nop 'execute ins1
addptra #4 'point to ins2+1
ins2 nop 'execute ins2
addptra #4 'point to ins3+1
ins3 nop 'execute ins3
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
i1 add count, #1
i2 add count2,#1
i3 add count3,#1
i4 add count4,#1
i5 rdlong lc,result6
i6 add lc,#1
i7 wrlong lc,result6
i8 add count, #1
{
i5 sub count, #1
i6 sub count2,#1
i7 sub count3,#1
i8 sub count3,#1
}
pc long 0
howmany long 0
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long $2000
result2 long $2004
result3 long $2008
result4 long $200C
result5 long $2010
result6 long $2014
lc long 0
Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!
I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.
Using rdlongc like that will not yield an 8 cycle loop. You only get an 8 cycle loop once in 5 iterations. Most of the iterations one of the rdlongc's will take at least 3 clocks.
Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.
The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
' make 4096 LMM instructions of "add countX,#1"
loop reps #256,#8
setptra what ' point to 16k buffer of "add count,#1" lmm code
wrlong i1,ptra++ ' save the add instruction into the hub
wrlong i2,ptra++ ' save the add instruction into the hub
wrlong i3,ptra++ ' save the add instruction into the hub
wrlong i4,ptra++ ' save the add instruction into the hub
wrlong i5,ptra++ ' save the add instruction into the hub
wrlong i6,ptra++ ' save the add instruction into the hub
wrlong i7,ptra++ ' save the add instruction into the hub
wrlong i8,ptra++ ' save the add instruction into the hub
' point at start of lmm code
setptra what
mov pc,what
getcnt start
'--
' execute 256 LMM instructions
' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:
lmm reps #341,#8
nop
rdlongc ins1,ptra++ 'fetch ins1
rdlongc ins2,ptra++ 'fetch ins2
rdlongc ins3,ptra-- 'fetch ins3, point to ins1+1
ins1 nop 'execute ins1
addptra #4 'point to ins2+1
ins2 nop 'execute ins2
addptra #4 'point to ins3+1
ins3 nop 'execute ins3
getcnt stop
mov cycles,stop
sub cycles,start
wrlong count, result1
wrlong count2,result2
wrlong count3,result3
wrlong count4,result4
wrlong cycles,result5
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
i1 add count, #1
i2 add count2,#1
i3 add count3,#1
i4 add count4,#1
i5 rdlong lc,result6
i6 add lc,#1
i7 wrlong lc,result6
i8 add count, #1
{
i5 sub count, #1
i6 sub count2,#1
i7 sub count3,#1
i8 sub count3,#1
}
pc long 0
howmany long 0
start long 0
stop long 0
cycles long 0
count long 0
count2 long 0
count3 long 0
count4 long 0
times long 0
what long 16384
result1 long $2000
result2 long $2004
result3 long $2008
result4 long $200C
result5 long $2010
result6 long $2014
lc long 0
We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.
We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.
Roy
The pursuit of a smoking fast LLM2 lead to finding an error in the verilog code for RDQUAD and in Chip making changes to allow more flexibility in the placement of the Quad CACHE, so it has already shown very real benefits.
Once Bill, Ariba and others have worked out the various options then they can be reviewed with the GCC team to see what makes sense.
It sounds like GCC has good facilities for handling out of order pipelining so it may not be too difficult once the various LMM techniques for the Prop2 have been worked out.
Comments
For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.
I was working on the 3ins/1arg version, and accidentally ran a test with an op in the arg slot... and it seemed to work!
I restored the torture test, including the initializing the hub variable being incremented... and it now works
Here are the results, attached is the source - can someone else verify the results please? Thanks in advance.
If the results are verified... maybe it was the forum gods, because I declared it impossible?
1 - I think it would slow things down too much
2 - I am back to REPS, but I am avoiding the flags so they are available for LMM code
3 - true, very very careful scheduling should work, BUT
As my latest post shows, it seems to be working again... I am hoping others will be able to replicate my results!
I cannot bust it, and RDLONG from the hub also works now.
It appears to be executing the four instructions in the RDQUAD mapped cache just fine, with the simplest REPS RDLONG loop, which I do not understand, because earlier it was not executing the RDLONG.
It is short and copies N*4 longs in 8 cycles per long (once synced to the hub with the first access)
Pleased you have the loop working now. Will just wait for some confirmations.
Just looked through your code. I think you should use SETQUAZ #arg instead of setquad #arg just to make sure there is nothing in the quad cache.
I am beginning to understand the RDQUAD/pipeline interaction quite well, and I'd say the loop is working, but requires careful scheduling
I've been running more experiments, and it appears that the first slot does get executed, but not until the next loop. I've been executing the test loop repeating 1,2,3 times and it makes what is happening clear, I am now trying some experiments to see if it is feasible to schedule it so it works as expected..
The alternate three instruction / one 23 bit constant (because that slot is executed sometimes) appears to work fine. I am trying to nail down the exact rules governing 'someimes'.
If the mapped quad were just fully executable just once cycle earlier there would be no difficulties. C'est la Vie.
i1 add count,#$100 ' should NOT execute
i2 add count2,#$110
i3 add count3,#$120
i4 add count4,#$130
i5 add count4,#$140
This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.
I ended up doing something similar - I figured out a way to really test the pipeline.
The findings are interesting to say the least... but it looks like it is quiet doable for a compiler that understands instruction scheduling - like gcc.
I am going to run some tests with multi-cycle instructions, but what appears to happen is that the instruction that should not be executed DOES get to execute, however it is deferred.
in memory
execution order
a longer example
memory
execution order
Assuming other instructions follow the same pattern, this is pretty easy to schedule in the compiler!
Now I will check to see how more complex instructions (multi-cycle ones) will affect code generation
I suspect that if they are not placed in the "moving" slot everything will be fine - but I need to test this
I've modified my test sample to make pipeline tests easier
Update:
Above instruction schedule works for single cycle instructions, but not yet on hub access.
I need a break from pipeline fighting...
i think you just need asprin and coffee
what may be better is to preset the ix instructions with a rdquad, quicky followed by another rdquad within the reps loop immediately followed by the ix data. something like...
setquaz #i1
rdquad pc 'prefill i1-i4
reps #500,#6
add pc,#16 'not part of reps
rdquad pc
i1 nop
i2 nop
i3 nop
i4 nop
add pc #16
then providing i1-i4 do not take >5 clocks all should be fine.
(sorry - my android/xoom is impossible to correct typos on this forum)
(but as Andy pointed out, Chip beat me to it in post#177, which I somehow missed reading)
I had to add more result registers, and repeat the LMM2 loop 1,2,3+++ times in sequence to experiment with the pipeline.
Right now, it is looking like the following is possible:
- For simple instructions, 4 instructions per 8 clocks
- RDxxx/WRxxx must be in the last of tje four slots (otherwise incorrect results)
- as the cache is mapped in to be executed, you cannot use RDxxxC
- until their minimum position is determine, assume all multi-cycle instructions belong in the last slot
Here is the LMM2 loop:
Here is the result:
The debug results are written into the hub as follows:
Result9 is the number of cycles used
Result10 is pre-loaded with $180 on every run hub variable
As you can see above:
count1 is $100 (added to in both quads)
count2 is $80
count3 is $80
count4 is $80
lc is $200 ($80+$180 initial value)
Here is the LMM code that is repeated in the hub 2048 times
The results above would not be possible unless 257 iterations of the LMM2 loop were executed. Any pipeline error would disturb the results (just move the hub access to see it).
I said earlier that it was impossible.
Looks like the forum gods may prove me wrong.
I am attaching the code for this test, please try it, and try any other LMM code you can think of. My gut says it will all work, as long as multi-cycle instructions are placed in the last instruction in each quad.
I suspect that delayed jumps will work if placed in the first slot, I intend to test that in a while... but I need a break now.
Your input & feedback is MUCH appreciated.
Bill
p.s.
No need to use ptra! See lmm2_pipeline_explorer4.spin above
So you just figured out something that Chip said a few hours ago ;-) (post #177)
I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.
Andy
Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.
I wish I did not miss it, it would have saved me a lot of work.
You are correct, it works!
I don't really think it is too unflexible, if you think about it even the Prop1 GCC has to schedule hub access etc.
So far, the rule seems very simple - hub access instructions must be in the last slot of four, any single cycle instruction can be in any of the three slots.
GCC supports such rules in its machine definition files, and many existing VLIW architectures have *MUCH* more complex rules.
I think the RDLONGC version is a decent approach, but due to delay before an instruction is executable I think it is more like 1/6 clock frequency (I'll try it and report the results) instead of 1/2 clkfreq of LMM2 (RDQUAD version).
I tried it after he posted it, but it did not seem to work correctly - which turned out to be due to the issue Chip found in the Verilog file.
Argh! If I had re-tried it after Chip fixed it, I (and a lot of you) could have saved a lot of time, but I don't think it was wasted time - I learned a LOT about how the pipelining works on Prop2.
I wish to thank everyone, especially Chip, for all of your feedback, suggestions, help and your experiments - it works now, and it is FAST!
I'll experiment some more, to make sure other long instructions work well, but no more today, or wifey will kill me.
It requires using PTRA as the PC, which I think works, just means you can't use PTRA in LMM code.
Roy
p.s. haven't actually tried it, can one of you?
I'll try it tomorrow, I can't right now as wifey has dragged me out of my lab for tonight.
Looking at it it looks like it would take 12 clock cycles when executing four single cycle instructions, which would mean a 4/12 efficiency ratio.
It would also require almost the same changes as RDQUAD to PropGCC.
Argh, can't test it until tomorrow morning or wifey will kill me!
Not sure how you count 12 clocks. Once primed, RDLONGC will take 3 clocks when it hits the hub, then 1 clock for 3 subsequent calls. Once primed the loop has an interesting pattern.
it takes 8 clocks each for 3 iterations, then the 4th one takes 6 clocks (all 3 of it's rdlongc's take 1 clock), then on the 5th iteration the first rdlongc takes 5 clocks, but you are in sync with the hub again (since you finished early and just waited 2 clocks for it). and you repeat this pattern.
The net clock count for the 5 iterations is 40.
Also, since there is no mapping for quad registers over cog registers (rdlongc writes to the actual cog register), you don't have aliasing issues with overlapping. So you can run normal LMM code without special ordering or grouping, as long as that code doesn't use the QUAD registers or PTRA.
Roy
I could only fit three RDLONGC's and three single cycle instructions into a single hub cycle.
Of the code you proposed, only two of the fetched instructions would fit within the first hub cycle, the other two would incur a second 8 clock hub cycle.
(plus I have a foggy memory of trying what you are asking about a couple of days ago, and it taking two hub cycles)
I'll retest tomorrow as soon as wifey leaves for work :-)
Many other changes will be required for PropGCC2, but I don't have time to get into that until Ken gives me the go-ahead to work on PropGCC2.
My example is just 3 instructions per hub cycle. I first posted the variant with 4, but quickly corrected it. You must have seen that one and not rechecked the post for the version it has now.
I think something that works more like Prop1 LMM will be easier to get up and running quickly for Prop2GCC, then later we can explore these special case variants.
It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.
The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.
Andy
I think it is not good idea to go simple way as fast we experience problems.
If all think same way we have never made any new design that is better.
As I said in previous post to You.
We need learn much before we can discover ALL possibility's.
And all info You can provide to us give that learning push forward.
I agree to to that Bill said.
But I'm not surprised in how fast things go,
Roy
The pursuit of a smoking fast LLM2 lead to finding an error in the verilog code for RDQUAD and in Chip making changes to allow more flexibility in the placement of the Quad CACHE, so it has already shown very real benefits.
Once Bill, Ariba and others have worked out the various options then they can be reviewed with the GCC team to see what makes sense.
It sounds like GCC has good facilities for handling out of order pipelining so it may not be too difficult once the various LMM techniques for the Prop2 have been worked out.
C.W.