LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Cluso99 · 2012-12-11 14:46

cgracey wrote: »

I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.

I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.

Bill Henning · 2012-12-11 14:53

Ok, now my mind is thoroughly confused.

I was working on the 3ins/1arg version, and accidentally ran a test with an op in the arg slot... and it seemed to work!

I restored the torture test, including the initializing the hub variable being incremented... and it now works

Here are the results, attached is the source - can someone else verify the results please? Thanks in advance.

>n
>2000.201f
02000- 00000080 00000080 00000080 000000FF   '................'
02010- 00000FFC 00000200 00000000 00000000   '................'

If the results are verified... maybe it was the forum gods, because I declared it impossible?

Bill Henning · 2012-12-11 14:56

Thank you - I appreciate the suggestions!

1 - I think it would slow things down too much

2 - I am back to REPS, but I am avoiding the flags so they are available for LMM code

3 - true, very very careful scheduling should work, BUT

As my latest post shows, it seems to be working again... I am hoping others will be able to replicate my results!

Cluso99 wrote: »

Bill, without my morning coffee (I don't often drink coffee ), here are a few ideas (without much thought)...

1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.

2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.

3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.

Maybe these suggestions may at least provoke some further thoughts.

Bill Henning · 2012-12-11 15:08

I've been trying many variations of the ordering of the LMM code, and the instructions in it since I posted #213

I cannot bust it, and RDLONG from the hub also works now.

It appears to be executing the four instructions in the RDQUAD mapped cache just fine, with the simplest REPS RDLONG loop, which I do not understand, because earlier it was not executing the RDLONG.

Bill Henning · 2012-12-11 15:11

I think Chip gave this sample in another thread...

   REPS  #100,#4
   nop
   RDLONGC inda++,ptra++
   RDLONGC inda++,ptra++
   RDLONGC inda++,ptra++
   RDLONGC inda++,ptra++

It is short and copies N*4 longs in 8 cycles per long (once synced to the hub with the first access)

Cluso99 wrote: »

I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.

Cluso99 · 2012-12-11 15:26

Thanks Bill. Yes I had forgotten that one.
Pleased you have the loop working now. Will just wait for some confirmations.

Dave Hein · 2012-12-11 15:33

I haven't been following this thread too closely, but I wonder if the P2 would be more efficient running pieces of FCACHE code instead of using the LMM method. Small loops would run at the full processor speed. Straight-line code would execute by by loading a few quads and then running it. The initial hub stalls and pipeline delays would be averaged over the size of the chunk that is read each time. An optimal chunk size could be determined that works best over a variety of code snippets.

Cluso99 · 2012-12-11 15:35

Bill,
Just looked through your code. I think you should use SETQUAZ #arg instead of setquad #arg just to make sure there is nothing in the quad cache.

Bill Henning · 2012-12-11 15:39

Thanks Cluso!

I am beginning to understand the RDQUAD/pipeline interaction quite well, and I'd say the loop is working, but requires careful scheduling

I've been running more experiments, and it appears that the first slot does get executed, but not until the next loop. I've been executing the test loop repeating 1,2,3 times and it makes what is happening clear, I am now trying some experiments to see if it is feasible to schedule it so it works as expected..

The alternate three instruction / one 23 bit constant (because that slot is executed sometimes) appears to work fine. I am trying to nail down the exact rules governing 'someimes'.

If the mapped quad were just fully executable just once cycle earlier there would be no difficulties. C'est la Vie.

Cluso99 wrote: »

Thanks Bill. Yes I had forgotten that one.
Pleased you have the loop working now. Will just wait for some confirmations.

Cluso99 · 2012-12-11 16:00

Bill, how about setting up the first executed instruction set to be something like
i1 add count,#$100 ' should NOT execute
i2 add count2,#$110
i3 add count3,#$120
i4 add count4,#$130
i5 add count4,#$140

This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.

Bill Henning · 2012-12-11 16:16

Thanks Cluso.

I ended up doing something similar - I figured out a way to really test the pipeline.

The findings are interesting to say the least... but it looks like it is quiet doable for a compiler that understands instruction scheduling - like gcc.

I am going to run some tests with multi-cycle instructions, but what appears to happen is that the instruction that should not be executed DOES get to execute, however it is deferred.

in memory

addr
addr+4
addr+8
addr+12

execution order

addr+4
addr+8
addr+12
[B]addr [/B]       on the next iteration, it executes in the first slot

a longer example

memory

i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
i11
i12

execution order

i2
i3
i4
[B]i1[/B]
i6
i7
i8
[B]i5[/B]
i10
i11
I12
[B]i9[/B]   - following quad cycle

Assuming other instructions follow the same pattern, this is pretty easy to schedule in the compiler!

Now I will check to see how more complex instructions (multi-cycle ones) will affect code generation

I suspect that if they are not placed in the "moving" slot everything will be fine - but I need to test this

I've modified my test sample to make pipeline tests easier

Update:

Above instruction schedule works for single cycle instructions, but not yet on hub access.

I need a break from pipeline fighting...

Cluso99 wrote: »

Bill, how about setting up the first executed instruction set to be something like
i1 add count,#$100 ' should NOT execute
i2 add count2,#$110
i3 add count3,#$120
i4 add count4,#$130
i5 add count4,#$140

This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.

Cluso99 · 2012-12-11 17:18

from what you are seeing, it is as expected. i1 does not execute as you correctly assumed because it has not filled. the next loop has i1 left over fom the previous rdquad.
i think you just need asprin and coffee

what may be better is to preset the ix instructions with a rdquad, quicky followed by another rdquad within the reps loop immediately followed by the ix data. something like...

setquaz #i1
rdquad pc 'prefill i1-i4
reps #500,#6
add pc,#16 'not part of reps
rdquad pc
i1 nop
i2 nop
i3 nop
i4 nop
add pc #16

then providing i1-i4 do not take >5 clocks all should be fine.

(sorry - my android/xoom is impossible to correct typos on this forum)

Bill Henning · 2012-12-11 17:32

I think I have it figured out.

(but as Andy pointed out, Chip beat me to it in post#177, which I somehow missed reading)

I had to add more result registers, and repeat the LMM2 loop 1,2,3+++ times in sequence to experiment with the pipeline.

Right now, it is looking like the following is possible:

- For simple instructions, 4 instructions per 8 clocks
- RDxxx/WRxxx must be in the last of tje four slots (otherwise incorrect results)
- as the cache is mapped in to be executed, you cannot use RDxxxC
- until their minimum position is determine, assume all multi-cycle instructions belong in the last slot

Here is the LMM2 loop:

	setquaz	#op0
	reps	#257,#8 ' since we know first iteration just primes the cache
	getcnt	start

	rdquad	ptra++
op0	nop	' we deliberately map into this spot to execute next iteration
op1	nop
op2	nop
op3	nop     ' this is the problematic spot with previous LMM2 attempt
op4	nop     ' normally executable from here, ops above not executable
op5	nop
op6	nop

	getcnt	stop

Here is the result:

>n
>2000.2027
02000- 00000100 00000080 00000080 00000080   '................'
02010- 00000000 00000000 00000000 00000000   '................'
02020- 00001007 00000200   '........'

The debug results are written into the hub as follows:

>n
>2000.2027
02000-  result1  result2  result3  result4
02010-  result5  result6  result7  result8
02020-  result9  result10

Result9 is the number of cycles used
Result10 is pre-loaded with $180 on every run hub variable

As you can see above:

count1 is $100 (added to in both quads)
count2 is $80
count3 is $80
count4 is $80

lc is $200 ($80+$180 initial value)

Here is the LMM code that is repeated in the hub 2048 times

i1	add	count1,#1
i2	add	count2,#1
i3	add	count3,#1
i4	rdlong	lc,result10

i5	add	count4,#1
i6	add	lc,#1
i7	add	count1,#1
i8	wrlong	lc,result10

The results above would not be possible unless 257 iterations of the LMM2 loop were executed. Any pipeline error would disturb the results (just move the hub access to see it).

I said earlier that it was impossible.

Looks like the forum gods may prove me wrong.

I am attaching the code for this test, please try it, and try any other LMM code you can think of. My gut says it will all work, as long as multi-cycle instructions are placed in the last instruction in each quad.

I suspect that delayed jumps will work if placed in the first slot, I intend to test that in a while... but I need a break now.

Your input & feedback is MUCH appreciated.

Bill

p.s.

No need to use ptra! See lmm2_pipeline_explorer4.spin above

Ariba · 2012-12-11 18:11

Bill
So you just figured out something that Chip said a few hours ago ;-) (post #177)

I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.

Andy

Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.

Bill Henning · 2012-12-11 18:33

Sorry Chip I missed this!

I wish I did not miss it, it would have saved me a lot of work.

You are correct, it works!

cgracey wrote: »

I can't change the timing of RDxxxx. That's kind of set, at this point.

I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.

Bill Henning · 2012-12-11 18:39

I missed that message - I just looked, and you are right!

I don't really think it is too unflexible, if you think about it even the Prop1 GCC has to schedule hub access etc.

So far, the rule seems very simple - hub access instructions must be in the last slot of four, any single cycle instruction can be in any of the three slots.

GCC supports such rules in its machine definition files, and many existing VLIW architectures have *MUCH* more complex rules.

I think the RDLONGC version is a decent approach, but due to delay before an instruction is executable I think it is more like 1/6 clock frequency (I'll try it and report the results) instead of 1/2 clkfreq of LMM2 (RDQUAD version).

Ariba wrote: »

Bill
So you just figured out something that Chip said a few hours ago ;-) (post #177)

I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.

Andy

Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.

Bill Henning · 2012-12-11 18:59

I took a look back in time (in this thread) and noticed that in post#9 Chip suggested the exact loop I am now using in post#224 (to replace the complicated ones I thought I might have to use).

I tried it after he posted it, but it did not seem to work correctly - which turned out to be due to the issue Chip found in the Verilog file.

Argh! If I had re-tried it after Chip fixed it, I (and a lot of you) could have saved a lot of time, but I don't think it was wasted time - I learned a LOT about how the pipelining works on Prop2.

I wish to thank everyone, especially Chip, for all of your feedback, suggestions, help and your experiments - it works now, and it is FAST!

I'll experiment some more, to make sure other long instructions work well, but no more today, or wifey will kill me.

Roy Eltham · 2012-12-11 20:13

This may be completely wrong, but I think this does 3 one clock instructions per iteration and hits the hub window each time.

again    reps       #511, 6
         nop                         ' delay slot

         rdlongc in1,ptra++  ' 1 or 3 clocks each time once synced (only one of the 3 takes 3 clocks each time around), also ptra++ will advance PC by 4 since we are using rdlongc
         rdlongc in2,ptra++  ' 1 or 3 clock
         rdlongc in3,ptra++  ' 1 or 3 clock
in1      nop 
in2      nop 
in3      nop 

         jmp #again             ' jump back in if the reps breaks due to jmp/call or whatever[COLOR=#333333][FONT=Parallax]
[/FONT][/COLOR]

It requires using PTRA as the PC, which I think works, just means you can't use PTRA in LMM code.

Roy

p.s. haven't actually tried it, can one of you?

Bill Henning · 2012-12-11 20:27

Hi Roy,

I'll try it tomorrow, I can't right now as wifey has dragged me out of my lab for tonight.

Looking at it it looks like it would take 12 clock cycles when executing four single cycle instructions, which would mean a 4/12 efficiency ratio.

It would also require almost the same changes as RDQUAD to PropGCC.

Bill Henning · 2012-12-11 20:34

Thinking about it more, the four executable instructions would require waiting for a second hub cycle, and as such, it will take 16 cycles (assuming single cycle instrutions) resulting in a 4/16 effective rate.

Argh, can't test it until tomorrow morning or wifey will kill me!

Roy Eltham · 2012-12-11 20:45

I think you saw my post before editing, it's only 3 executed instructions per iteration. So a 3 per 8 clocks rate.

Not sure how you count 12 clocks. Once primed, RDLONGC will take 3 clocks when it hits the hub, then 1 clock for 3 subsequent calls. Once primed the loop has an interesting pattern.
it takes 8 clocks each for 3 iterations, then the 4th one takes 6 clocks (all 3 of it's rdlongc's take 1 clock), then on the 5th iteration the first rdlongc takes 5 clocks, but you are in sync with the hub again (since you finished early and just waited 2 clocks for it). and you repeat this pattern.
The net clock count for the 5 iterations is 40.

Also, since there is no mapping for quad registers over cog registers (rdlongc writes to the actual cog register), you don't have aliasing issues with overlapping. So you can run normal LMM code without special ordering or grouping, as long as that code doesn't use the QUAD registers or PTRA.

Roy

Bill Henning · 2012-12-11 20:59

See the first or second experiment at the top of this thread, I tried them and reported on them on the first page.

I could only fit three RDLONGC's and three single cycle instructions into a single hub cycle.

Of the code you proposed, only two of the fetched instructions would fit within the first hub cycle, the other two would incur a second 8 clock hub cycle.

(plus I have a foggy memory of trying what you are asking about a couple of days ago, and it taking two hub cycles)

I'll retest tomorrow as soon as wifey leaves for work :-)

Many other changes will be required for PropGCC2, but I don't have time to get into that until Ken gives me the go-ahead to work on PropGCC2.

Roy Eltham · 2012-12-11 21:36

Bill,
My example is just 3 instructions per hub cycle. I first posted the variant with 4, but quickly corrected it. You must have seen that one and not rechecked the post for the version it has now.

I think something that works more like Prop1 LMM will be easier to get up and running quickly for Prop2GCC, then later we can explore these special case variants.

Ariba · 2012-12-11 22:42

Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.

The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  i1,ptra++       ' save the add instruction into the hub
        wrlong  i2,ptra++       ' save the add instruction into the hub
        wrlong  i3,ptra++       ' save the add instruction into the hub
        wrlong  i4,ptra++       ' save the add instruction into the hub
        wrlong  i5,ptra++       ' save the add instruction into the hub
        wrlong  i6,ptra++       ' save the add instruction into the hub
        wrlong  i7,ptra++       ' save the add instruction into the hub
        wrlong  i8,ptra++       ' save the add instruction into the hub

        ' point at start of lmm code

        setptra what    
        mov     pc,what
        getcnt  start
'--
        ' execute 256 LMM instructions

' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:

lmm     reps #341,#8
        nop
         rdlongc ins1,ptra++    'fetch ins1
         rdlongc ins2,ptra++    'fetch ins2
         rdlongc ins3,ptra--    'fetch ins3, point to ins1+1
ins1     nop                    'execute ins1
         addptra #4             'point to ins2+1
ins2     nop                    'execute ins2
         addptra #4             'point to ins3+1
ins3     nop                    'execute ins3

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

i1      add     count, #1
i2      add     count2,#1
i3      add     count3,#1
i4      add     count4,#1
i5      rdlong  lc,result6
i6      add     lc,#1
i7      wrlong  lc,result6
i8      add     count, #1

{
i5      sub     count, #1
i6      sub     count2,#1
i7      sub     count3,#1
i8      sub     count3,#1
}
pc      long    0
howmany long    0

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    $2000
result2         long    $2004
result3         long    $2008
result4         long    $200C
result5         long    $2010
result6         long    $2014

lc              long    0

Andy

Sapieha · 2012-12-12 00:29

Hi Roy.

I think it is not good idea to go simple way as fast we experience problems.
If all think same way we have never made any new design that is better.

Roy Eltham wrote: »

Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.

Sapieha · 2012-12-12 00:35

Hi Chip.

As I said in previous post to You.

We need learn much before we can discover ALL possibility's.
And all info You can provide to us give that learning push forward.

cgracey wrote: »

I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?

Sapieha · 2012-12-12 00:38

Hi User Name.

I agree to to that Bill said.

But I'm not surprised in how fast things go,

User Name wrote: »

Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!

I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.

Roy Eltham · 2012-12-12 03:55

Using rdlongc like that will not yield an 8 cycle loop. You only get an 8 cycle loop once in 5 iterations. Most of the iterations one of the rdlongc's will take at least 3 clocks.

Ariba wrote: »

Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.

The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.

CON

        CLOCK_FREQ = 60_000_000
        BAUD = 115_200

DAT
        org     0

        ' make 4096 LMM instructions of "add countX,#1"

loop    reps    #256,#8
        setptra what            ' point to 16k buffer of "add count,#1" lmm code

        wrlong  i1,ptra++       ' save the add instruction into the hub
        wrlong  i2,ptra++       ' save the add instruction into the hub
        wrlong  i3,ptra++       ' save the add instruction into the hub
        wrlong  i4,ptra++       ' save the add instruction into the hub
        wrlong  i5,ptra++       ' save the add instruction into the hub
        wrlong  i6,ptra++       ' save the add instruction into the hub
        wrlong  i7,ptra++       ' save the add instruction into the hub
        wrlong  i8,ptra++       ' save the add instruction into the hub

        ' point at start of lmm code

        setptra what    
        mov     pc,what
        getcnt  start
'--
        ' execute 256 LMM instructions

' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:

lmm     reps #341,#8
        nop
         rdlongc ins1,ptra++    'fetch ins1
         rdlongc ins2,ptra++    'fetch ins2
         rdlongc ins3,ptra--    'fetch ins3, point to ins1+1
ins1     nop                    'execute ins1
         addptra #4             'point to ins2+1
ins2     nop                    'execute ins2
         addptra #4             'point to ins3+1
ins3     nop                    'execute ins3

        getcnt  stop

        mov     cycles,stop
        sub     cycles,start

        wrlong  count, result1
        wrlong  count2,result2
        wrlong  count3,result3
        wrlong  count4,result4

        wrlong  cycles,result5

        coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!

monitor_pgm     long    $70C                    'monitor program address
monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)

i1      add     count, #1
i2      add     count2,#1
i3      add     count3,#1
i4      add     count4,#1
i5      rdlong  lc,result6
i6      add     lc,#1
i7      wrlong  lc,result6
i8      add     count, #1

{
i5      sub     count, #1
i6      sub     count2,#1
i7      sub     count3,#1
i8      sub     count3,#1
}
pc      long    0
howmany long    0

start   long    0
stop    long    0
cycles  long    0

count   long    0
count2  long    0
count3  long    0
count4  long    0
times   long    0

what            long    16384
result1         long    $2000
result2         long    $2004
result3         long    $2008
result4         long    $200C
result5         long    $2010
result6         long    $2014

lc              long    0

Andy

Roy Eltham · 2012-12-12 03:58

We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.

Roy

Sapieha wrote: »

Hi Roy.

I think it is not good idea to go simple way as fast we experience problems.
If all think same way we have never made any new design that is better.

ctwardell · 2012-12-12 04:18

Roy Eltham wrote: »

We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.

Roy

The pursuit of a smoking fast LLM2 lead to finding an error in the verilog code for RDQUAD and in Chip making changes to allow more flexibility in the placement of the Quad CACHE, so it has already shown very real benefits.

Once Bill, Ariba and others have worked out the various options then they can be reviewed with the GCC team to see what makes sense.

It sounds like GCC has good facilities for handling out of order pipelining so it may not be too difficult once the various LMM techniques for the Prop2 have been worked out.

C.W.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments