Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 8 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

168101112

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 14:46
    cgracey wrote: »
    I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.
    I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
    For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
    There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:53
    Ok, now my mind is thoroughly confused.

    I was working on the 3ins/1arg version, and accidentally ran a test with an op in the arg slot... and it seemed to work!

    I restored the torture test, including the initializing the hub variable being incremented... and it now works

    Here are the results, attached is the source - can someone else verify the results please? Thanks in advance.

    >n
    >2000.201f
    02000- 00000080 00000080 00000080 000000FF   '................'
    02010- 00000FFC 00000200 00000000 00000000   '................'
    

    If the results are verified... maybe it was the forum gods, because I declared it impossible?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:56
    Thank you - I appreciate the suggestions!

    1 - I think it would slow things down too much

    2 - I am back to REPS, but I am avoiding the flags so they are available for LMM code

    3 - true, very very careful scheduling should work, BUT

    As my latest post shows, it seems to be working again... I am hoping others will be able to replicate my results!
    Cluso99 wrote: »
    Bill, without my morning coffee (I don't often drink coffee ;) ), here are a few ideas (without much thought)...

    1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.

    2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.

    3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.

    Maybe these suggestions may at least provoke some further thoughts.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 15:08
    I've been trying many variations of the ordering of the LMM code, and the instructions in it since I posted #213

    I cannot bust it, and RDLONG from the hub also works now.

    It appears to be executing the four instructions in the RDQUAD mapped cache just fine, with the simplest REPS RDLONG loop, which I do not understand, because earlier it was not executing the RDLONG.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 15:11
    I think Chip gave this sample in another thread...
       REPS  #100,#4
       nop
       RDLONGC inda++,ptra++
       RDLONGC inda++,ptra++
       RDLONGC inda++,ptra++
       RDLONGC inda++,ptra++
    

    It is short and copies N*4 longs in 8 cycles per long (once synced to the hub with the first access)
    Cluso99 wrote: »
    I was not thinking of executing them there, but to then have access to them, either for copying to cog ram (as perhaps in overlay loading) or for loading some blocks of data, or just a quad long, to/from cog/hub.
    For example, sometimes we pass more than a long between cogs. We have to ensure that a particular long is the last long to be updated because we use this one for the "available" flag (i.e. we don't use locks). Often, those updates are 4 longs or less, so with rdquad and wrquad we can do 4 longs at once. In these cases, there is no need to waste cog space for the mapping if we could use the PINx space.
    There may be some very good reasons why this is not possible such as conflicts with the PINx, so I am just explaining the reasons.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 15:26
    Thanks Bill. Yes I had forgotten that one.
    Pleased you have the loop working now. Will just wait for some confirmations.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-12-11 15:33
    I haven't been following this thread too closely, but I wonder if the P2 would be more efficient running pieces of FCACHE code instead of using the LMM method. Small loops would run at the full processor speed. Straight-line code would execute by by loading a few quads and then running it. The initial hub stalls and pipeline delays would be averaged over the size of the chunk that is read each time. An optimal chunk size could be determined that works best over a variety of code snippets.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 15:35
    Bill,
    Just looked through your code. I think you should use SETQUAZ #arg instead of setquad #arg just to make sure there is nothing in the quad cache.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 15:39
    Thanks Cluso!

    I am beginning to understand the RDQUAD/pipeline interaction quite well, and I'd say the loop is working, but requires careful scheduling

    I've been running more experiments, and it appears that the first slot does get executed, but not until the next loop. I've been executing the test loop repeating 1,2,3 times and it makes what is happening clear, I am now trying some experiments to see if it is feasible to schedule it so it works as expected..

    The alternate three instruction / one 23 bit constant (because that slot is executed sometimes) appears to work fine. I am trying to nail down the exact rules governing 'someimes'.

    If the mapped quad were just fully executable just once cycle earlier there would be no difficulties. C'est la Vie.
    Cluso99 wrote: »
    Thanks Bill. Yes I had forgotten that one.
    Pleased you have the loop working now. Will just wait for some confirmations.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 16:00
    Bill, how about setting up the first executed instruction set to be something like
    i1 add count,#$100 ' should NOT execute
    i2 add count2,#$110
    i3 add count3,#$120
    i4 add count4,#$130
    i5 add count4,#$140

    This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 16:16
    Thanks Cluso.

    I ended up doing something similar - I figured out a way to really test the pipeline.

    The findings are interesting to say the least... but it looks like it is quiet doable for a compiler that understands instruction scheduling - like gcc.

    I am going to run some tests with multi-cycle instructions, but what appears to happen is that the instruction that should not be executed DOES get to execute, however it is deferred.

    in memory
    addr
    addr+4
    addr+8
    addr+12
    

    execution order
    addr+4
    addr+8
    addr+12
    [B]addr [/B]       on the next iteration, it executes in the first slot
    

    a longer example

    memory
    i1
    i2
    i3
    i4
    i5
    i6
    i7
    i8
    i9
    i10
    i11
    i12
    

    execution order
    i2
    i3
    i4
    [B]i1[/B]
    i6
    i7
    i8
    [B]i5[/B]
    i10
    i11
    I12
    [B]i9[/B]   - following quad cycle
    

    Assuming other instructions follow the same pattern, this is pretty easy to schedule in the compiler!

    Now I will check to see how more complex instructions (multi-cycle ones) will affect code generation

    I suspect that if they are not placed in the "moving" slot everything will be fine - but I need to test this

    I've modified my test sample to make pipeline tests easier

    Update:

    Above instruction schedule works for single cycle instructions, but not yet on hub access.

    I need a break from pipeline fighting...
    Cluso99 wrote: »
    Bill, how about setting up the first executed instruction set to be something like
    i1 add count,#$100 ' should NOT execute
    i2 add count2,#$110
    i3 add count3,#$120
    i4 add count4,#$130
    i5 add count4,#$140

    This way, you would know if the instructions from the first loop were being executed. Having the same increments doesn't show this up properly.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 17:18
    from what you are seeing, it is as expected. i1 does not execute as you correctly assumed because it has not filled. the next loop has i1 left over fom the previous rdquad.
    i think you just need asprin and coffee ;)

    what may be better is to preset the ix instructions with a rdquad, quicky followed by another rdquad within the reps loop immediately followed by the ix data. something like...

    setquaz #i1
    rdquad pc 'prefill i1-i4
    reps #500,#6
    add pc,#16 'not part of reps
    rdquad pc
    i1 nop
    i2 nop
    i3 nop
    i4 nop
    add pc #16

    then providing i1-i4 do not take >5 clocks all should be fine.

    (sorry - my android/xoom is impossible to correct typos on this forum)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 17:32
    I think I have it figured out.

    (but as Andy pointed out, Chip beat me to it in post#177, which I somehow missed reading)

    I had to add more result registers, and repeat the LMM2 loop 1,2,3+++ times in sequence to experiment with the pipeline.

    Right now, it is looking like the following is possible:

    - For simple instructions, 4 instructions per 8 clocks
    - RDxxx/WRxxx must be in the last of tje four slots (otherwise incorrect results)
    - as the cache is mapped in to be executed, you cannot use RDxxxC
    - until their minimum position is determine, assume all multi-cycle instructions belong in the last slot

    Here is the LMM2 loop:
    	setquaz	#op0
    	reps	#257,#8 ' since we know first iteration just primes the cache
    	getcnt	start
    
    	rdquad	ptra++
    op0	nop	' we deliberately map into this spot to execute next iteration
    op1	nop
    op2	nop
    op3	nop     ' this is the problematic spot with previous LMM2 attempt
    op4	nop     ' normally executable from here, ops above not executable
    op5	nop
    op6	nop
    
    	getcnt	stop
    


    Here is the result:
    >n
    >2000.2027
    02000- 00000100 00000080 00000080 00000080   '................'
    02010- 00000000 00000000 00000000 00000000   '................'
    02020- 00001007 00000200   '........'
    


    The debug results are written into the hub as follows:
    >n
    >2000.2027
    02000-  result1  result2  result3  result4
    02010-  result5  result6  result7  result8
    02020-  result9  result10
    


    Result9 is the number of cycles used
    Result10 is pre-loaded with $180 on every run hub variable

    As you can see above:

    count1 is $100 (added to in both quads)
    count2 is $80
    count3 is $80
    count4 is $80

    lc is $200 ($80+$180 initial value)

    Here is the LMM code that is repeated in the hub 2048 times
    i1	add	count1,#1
    i2	add	count2,#1
    i3	add	count3,#1
    i4	rdlong	lc,result10
    
    i5	add	count4,#1
    i6	add	lc,#1
    i7	add	count1,#1
    i8	wrlong	lc,result10
    


    The results above would not be possible unless 257 iterations of the LMM2 loop were executed. Any pipeline error would disturb the results (just move the hub access to see it).

    I said earlier that it was impossible.

    Looks like the forum gods may prove me wrong.

    I am attaching the code for this test, please try it, and try any other LMM code you can think of. My gut says it will all work, as long as multi-cycle instructions are placed in the last instruction in each quad.

    I suspect that delayed jumps will work if placed in the first slot, I intend to test that in a while... but I need a break now.

    Your input & feedback is MUCH appreciated.

    Bill

    p.s.

    No need to use ptra! See lmm2_pipeline_explorer4.spin above
  • AribaAriba Posts: 2,690
    edited 2012-12-11 18:11
    Bill
    So you just figured out something that Chip said a few hours ago ;-) (post #177)

    I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
    So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.

    Andy

    Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 18:33
    Sorry Chip I missed this!

    I wish I did not miss it, it would have saved me a lot of work.

    You are correct, it works!
    cgracey wrote: »
    I can't change the timing of RDxxxx. That's kind of set, at this point.

    I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 18:39
    I missed that message - I just looked, and you are right!

    I don't really think it is too unflexible, if you think about it even the Prop1 GCC has to schedule hub access etc.

    So far, the rule seems very simple - hub access instructions must be in the last slot of four, any single cycle instruction can be in any of the three slots.

    GCC supports such rules in its machine definition files, and many existing VLIW architectures have *MUCH* more complex rules.

    I think the RDLONGC version is a decent approach, but due to delay before an instruction is executable I think it is more like 1/6 clock frequency (I'll try it and report the results) instead of 1/2 clkfreq of LMM2 (RDQUAD version).
    Ariba wrote: »
    Bill
    So you just figured out something that Chip said a few hours ago ;-) (post #177)

    I think this Quad-LMM is a bit too unflexible to be used as a general LMM approach. Its not only the multicycle instructions, also the jumps and load constants are hard to do if you have a pc which only increments every quad and not every instruction. The compiler needs to arrange the code in quad packets which can get unefficient.
    So I will concentrate more on a rdlongc version with 1/4 clockfreq which should be easier to use and not that much slower.

    Andy

    Edit: Perhaps your example No 4 in the first post may solve some issues with the Qaud-LMM we have now. With two alternating load and execute quad locations the multicycle instructions can not affect the new loaded instructions in the other quad block.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 18:59
    I took a look back in time (in this thread) and noticed that in post#9 Chip suggested the exact loop I am now using in post#224 (to replace the complicated ones I thought I might have to use).

    I tried it after he posted it, but it did not seem to work correctly - which turned out to be due to the issue Chip found in the Verilog file.

    Argh! If I had re-tried it after Chip fixed it, I (and a lot of you) could have saved a lot of time, but I don't think it was wasted time - I learned a LOT about how the pipelining works on Prop2.

    I wish to thank everyone, especially Chip, for all of your feedback, suggestions, help and your experiments - it works now, and it is FAST!

    I'll experiment some more, to make sure other long instructions work well, but no more today, or wifey will kill me.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 20:13
    This may be completely wrong, but I think this does 3 one clock instructions per iteration and hits the hub window each time.
    again    reps       #511, 6
             nop                         ' delay slot
    
             rdlongc in1,ptra++  ' 1 or 3 clocks each time once synced (only one of the 3 takes 3 clocks each time around), also ptra++ will advance PC by 4 since we are using rdlongc
             rdlongc in2,ptra++  ' 1 or 3 clock
             rdlongc in3,ptra++  ' 1 or 3 clock
    in1      nop 
    in2      nop 
    in3      nop 
    
             jmp #again             ' jump back in if the reps breaks due to jmp/call or whatever[COLOR=#333333][FONT=Parallax]
    [/FONT][/COLOR]
    

    It requires using PTRA as the PC, which I think works, just means you can't use PTRA in LMM code.

    Roy

    p.s. haven't actually tried it, can one of you?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 20:27
    Hi Roy,

    I'll try it tomorrow, I can't right now as wifey has dragged me out of my lab for tonight.

    Looking at it it looks like it would take 12 clock cycles when executing four single cycle instructions, which would mean a 4/12 efficiency ratio.

    It would also require almost the same changes as RDQUAD to PropGCC.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 20:34
    Thinking about it more, the four executable instructions would require waiting for a second hub cycle, and as such, it will take 16 cycles (assuming single cycle instrutions) resulting in a 4/16 effective rate.

    Argh, can't test it until tomorrow morning or wifey will kill me!
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 20:45
    I think you saw my post before editing, it's only 3 executed instructions per iteration. So a 3 per 8 clocks rate.

    Not sure how you count 12 clocks. Once primed, RDLONGC will take 3 clocks when it hits the hub, then 1 clock for 3 subsequent calls. Once primed the loop has an interesting pattern.
    it takes 8 clocks each for 3 iterations, then the 4th one takes 6 clocks (all 3 of it's rdlongc's take 1 clock), then on the 5th iteration the first rdlongc takes 5 clocks, but you are in sync with the hub again (since you finished early and just waited 2 clocks for it). and you repeat this pattern.
    The net clock count for the 5 iterations is 40.

    Also, since there is no mapping for quad registers over cog registers (rdlongc writes to the actual cog register), you don't have aliasing issues with overlapping. So you can run normal LMM code without special ordering or grouping, as long as that code doesn't use the QUAD registers or PTRA.

    Roy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 20:59
    See the first or second experiment at the top of this thread, I tried them and reported on them on the first page.

    I could only fit three RDLONGC's and three single cycle instructions into a single hub cycle.

    Of the code you proposed, only two of the fetched instructions would fit within the first hub cycle, the other two would incur a second 8 clock hub cycle.

    (plus I have a foggy memory of trying what you are asking about a couple of days ago, and it taking two hub cycles)

    I'll retest tomorrow as soon as wifey leaves for work :-)

    Many other changes will be required for PropGCC2, but I don't have time to get into that until Ken gives me the go-ahead to work on PropGCC2.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 21:36
    Bill,
    My example is just 3 instructions per hub cycle. I first posted the variant with 4, but quickly corrected it. You must have seen that one and not rechecked the post for the version it has now.

    I think something that works more like Prop1 LMM will be easier to get up and running quickly for Prop2GCC, then later we can explore these special case variants.
  • AribaAriba Posts: 2,690
    edited 2012-12-11 22:42
    Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
    It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.

    The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  i1,ptra++       ' save the add instruction into the hub
            wrlong  i2,ptra++       ' save the add instruction into the hub
            wrlong  i3,ptra++       ' save the add instruction into the hub
            wrlong  i4,ptra++       ' save the add instruction into the hub
            wrlong  i5,ptra++       ' save the add instruction into the hub
            wrlong  i6,ptra++       ' save the add instruction into the hub
            wrlong  i7,ptra++       ' save the add instruction into the hub
            wrlong  i8,ptra++       ' save the add instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
            mov     pc,what
            getcnt  start
    '--
            ' execute 256 LMM instructions
    
    ' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:
    
    lmm     reps #341,#8
            nop
             rdlongc ins1,ptra++    'fetch ins1
             rdlongc ins2,ptra++    'fetch ins2
             rdlongc ins3,ptra--    'fetch ins3, point to ins1+1
    ins1     nop                    'execute ins1
             addptra #4             'point to ins2+1
    ins2     nop                    'execute ins2
             addptra #4             'point to ins3+1
    ins3     nop                    'execute ins3
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    i1      add     count, #1
    i2      add     count2,#1
    i3      add     count3,#1
    i4      add     count4,#1
    i5      rdlong  lc,result6
    i6      add     lc,#1
    i7      wrlong  lc,result6
    i8      add     count, #1
    
    {
    i5      sub     count, #1
    i6      sub     count2,#1
    i7      sub     count3,#1
    i8      sub     count3,#1
    }
    pc      long    0
    howmany long    0
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    $2000
    result2         long    $2004
    result3         long    $2008
    result4         long    $200C
    result5         long    $2010
    result6         long    $2014
    
    lc              long    0
    

    Andy
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-12 00:29
    Hi Roy.

    I think it is not good idea to go simple way as fast we experience problems.
    If all think same way we have never made any new design that is better.


    Roy Eltham wrote: »
    Bill,
    Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
    It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-12 00:35
    Hi Chip.

    As I said in previous post to You.

    We need learn much before we can discover ALL possibility's.
    And all info You can provide to us give that learning push forward.

    cgracey wrote: »
    I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-12 00:38
    Hi User Name.

    I agree to to that Bill said.

    But I'm not surprised in how fast things go,

    User Name wrote: »
    Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!

    I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-12 03:55
    Using rdlongc like that will not yield an 8 cycle loop. You only get an 8 cycle loop once in 5 iterations. Most of the iterations one of the rdlongc's will take at least 3 clocks.
    Ariba wrote: »
    Here is a 3 instruction per 8 cycle LMM code which has the benefit that PTRA always points to the next instruction. This makes fjmp, fcall and load-long much simpler than with the pipelined Quad-LMM.
    It takes $13FE clock cycles for 256 instructions compared to $1007 cycles for Bills latest Quad-LMM results.

    The nice thing: It has absolutely no problems with rdlongs and wrlongs in LMM code at any position.
    CON
    
            CLOCK_FREQ = 60_000_000
            BAUD = 115_200
    
    DAT
            org     0
    
            ' make 4096 LMM instructions of "add countX,#1"
    
    loop    reps    #256,#8
            setptra what            ' point to 16k buffer of "add count,#1" lmm code
    
            wrlong  i1,ptra++       ' save the add instruction into the hub
            wrlong  i2,ptra++       ' save the add instruction into the hub
            wrlong  i3,ptra++       ' save the add instruction into the hub
            wrlong  i4,ptra++       ' save the add instruction into the hub
            wrlong  i5,ptra++       ' save the add instruction into the hub
            wrlong  i6,ptra++       ' save the add instruction into the hub
            wrlong  i7,ptra++       ' save the add instruction into the hub
            wrlong  i8,ptra++       ' save the add instruction into the hub
    
            ' point at start of lmm code
    
            setptra what    
            mov     pc,what
            getcnt  start
    '--
            ' execute 256 LMM instructions
    
    ' Andy's 3 instr per 8 cycle LMM with ptra always = instr+1:
    
    lmm     reps #341,#8
            nop
             rdlongc ins1,ptra++    'fetch ins1
             rdlongc ins2,ptra++    'fetch ins2
             rdlongc ins3,ptra--    'fetch ins3, point to ins1+1
    ins1     nop                    'execute ins1
             addptra #4             'point to ins2+1
    ins2     nop                    'execute ins2
             addptra #4             'point to ins3+1
    ins3     nop                    'execute ins3
    
            getcnt  stop
    
            mov     cycles,stop
            sub     cycles,start
    
            wrlong  count, result1
            wrlong  count2,result2
            wrlong  count3,result3
            wrlong  count4,result4
    
            wrlong  cycles,result5
    
            coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm     long    $70C                    'monitor program address
    monitor_ptr     long    90<<9 + 91              'monitor parameter (conveys tx/rx pins)
    
    i1      add     count, #1
    i2      add     count2,#1
    i3      add     count3,#1
    i4      add     count4,#1
    i5      rdlong  lc,result6
    i6      add     lc,#1
    i7      wrlong  lc,result6
    i8      add     count, #1
    
    {
    i5      sub     count, #1
    i6      sub     count2,#1
    i7      sub     count3,#1
    i8      sub     count3,#1
    }
    pc      long    0
    howmany long    0
    
    start   long    0
    stop    long    0
    cycles  long    0
    
    count   long    0
    count2  long    0
    count3  long    0
    count4  long    0
    times   long    0
    
    what            long    16384
    result1         long    $2000
    result2         long    $2004
    result3         long    $2008
    result4         long    $200C
    result5         long    $2010
    result6         long    $2014
    
    lc              long    0
    

    Andy
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-12 03:58
    We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.

    Roy
    Sapieha wrote: »
    Hi Roy.

    I think it is not good idea to go simple way as fast we experience problems.
    If all think same way we have never made any new design that is better.
  • ctwardellctwardell Posts: 1,716
    edited 2012-12-12 04:18
    Roy Eltham wrote: »
    We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.

    Roy

    The pursuit of a smoking fast LLM2 lead to finding an error in the verilog code for RDQUAD and in Chip making changes to allow more flexibility in the placement of the Quad CACHE, so it has already shown very real benefits.

    Once Bill, Ariba and others have worked out the various options then they can be reviewed with the GCC team to see what makes sense.

    It sounds like GCC has good facilities for handling out of order pipelining so it may not be too difficult once the various LMM techniques for the Prop2 have been worked out.

    C.W.
Sign In or Register to comment.