Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 6 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

13468912

Comments

  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 05:58
    Hi Chip.

    NANO reprogrammed
    > Function correctly

    Thanks
  • AribaAriba Posts: 2,690
    edited 2012-12-11 06:25
    This version of the Quad-LMM loop works now also with no problems - well done Chip!
    ' Quad LMM
    DAT
            org 0
            mov pc,start
    lmm     setquaz #ins1
            rdquad pc
    ins1     nop
             nop
             nop
    ins4     nop
             jmpd #ins1
             add pc,#16
             nop
             rdquad pc
    
    start   long @lmmcode+$0E80
    pc      long 0
    t1      long 0
    
            long 0[4-($-($ & $1FC))]  'quad align following code
    
    lmmcode setp #2         'toggle pin2 25%        
            notp #4
            clrp #2
            sub pc,#32      'rel jump, delayed by 1 quad
    
            nop
            notp #4         'toggle pin4 50%
            nop
            nop
    

    I will try now some other features of the Prop2...

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 06:35
    WOW!

    I go to dinner, and to sleep, and look how much happens....

    I will head down to the lab shortly, update my Nano, run tests, and catch up on the posts :-)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 07:54
    Hi guys,

    Here is my massive "Catch Up" post:

    jmg,David:

    I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions :)

    Chip:

    First, thanks for SETQUADZ and everything!

    Right after this message I'll program your latest Nano image & update to the latest PNut; then I will torture-test RDQUADLONG.

    Cluso99:

    I like the movx (condition code)

    For byte packing/unpacking Chip has a MOVF that he will document (can't wait)

    potatohead:

    Yep, testing RDQUAD was "interesting", and non-cog aligned RDQUAD will be nice.

    re/ compressed: take a look at what I proposed, I was wary of any more complexity due to implementation delay, logic budget, chip's time

    Sapieha:

    I think movf will do this... (I hope!)

    Cluso99:

    Of course I don't mind you posting! An efficient loader will be needed for FCACHE

    I will try your code... looks good at a quick glance.


    Baggers:

    Your *REALLY* should ge a Nano...

    Chip:

    Downloading new .zip ... thanks!

    Andy:

    Loos good!

    I will torture your version too :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 08:48
    Latest Prop2 configuration / PNut torture tests:

    Chip's sample:
    >n
    >2000.201f
    02000- 00000080 00000080 000000FF 000000FF   '................'
    02010- 00000FF3 00000000 00000000 00000000   '................'
    

    not what we would like to see :-(

    Andy's main loop, modified for my test framework:
    >n
    >2000.201f
    02000- 00000080 00000080 000000FF 000000FF   '................'
    02010- 00000FF3 0000007F 00000000 00000000   '................'
    

    not what we would like to see :-(

    My latest main loop, quad aligned:
    >n
    >2000.201f
    02000- 000000FE 00000080 00000080 00000080   '................'
    02010- 00000FF6 0000007F 00000000 00000000   '................'
    

    IT WORKS!!!

    My latest main loop, NOT quad aligned:
    >n
    >2000.201f
    02000- 000000FE 00000080 00000080 00000080   '................'
    02010- 00000FF7 0000007F 00000000 00000000   '................'
    

    THANKS CHIP - WORKS PERFECTLY!!!

    No more worries about alignment...

    I am attaching the torture test to this message, it has all three different LMM2 loops in it - just uncomment the one you want to run.

    THE TORTURE TEST EXPLAINED

    Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)
    i1	add	count, #1
    i2	add	count2,#1
    i3	add	count3,#1
    i4	add	count4,#1
    i5	rdlong	lc,result6
    i6	add	lc,#1
    i7	wrlong	lc,result6
    i8	add	count,#1
    


    Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration

    i1-i4 are simple adds to counters, which are later deposited into result1-result4

    i5-i7 increment result6 in the hub

    i8 does an extra increment of result1

    The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.

    The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)

    This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times

    Therefore the mathematically correct expected results (in hex) are:

    result 1: $0000FE ($7F+$7F)
    result 2: $000080
    result 3: $000080
    result 4: $000080

    result 5: $000FF7 *** approximate
    result 6: $00007F

    256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots

    $800+$3F0+$3F0 = $FE0 cycles

    Sorry about the incorrect calculation in an earlier edit of this message - the timing works as expected!

    p.s.

    Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.

    The attached file is basically a framework for testing LMM2 and native instructions.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 09:13
    I wonder if the torture test is faster because there is only a single cog implemented on my nano, and thus it actually ends up using "other cogs" hub cycles?

    I fixed my initial incorrect calculation, the cycle count was correct as well.

    Could someone try the torture test using my loops on a DE2-115 and post the results?
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 09:14
    Setting up now... Programming. Bill, I've got time for a quick run. I'll do the loops and post up results here in a moment.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 09:35
    Thanks!

    Meanwhile, I found an error in my initial calculation for expected cycle count - and the result I am getting is correct.
    potatohead wrote: »
    Setting up now... Programming. Bill, I've got time for a quick run. I'll do the loops and post up results here in a moment.
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 09:41
    Bill, I uncommented them in sequence, from top to bottom.

    #1 --> "my out of order LMM2 inner loop


    === Propeller II Monitor ===

    >n2000.201f
    02000- 000000FE 00000080 00000080 00000080 '................'
    02010- 00000FF6 0000007F 00000000 00000000 '................'
    >

    #2 --> "Chip's easy to read LMM2 loop"



    === Propeller II Monitor ===

    >n2000.201f
    02000- 00000080 00000080 000000FF 000000FF '................'
    02010- 00000FF3 00000000 00000000 00000000 '................'
    >

    #3 --> "Andy's non-pta LMM2 loop with jumpd, modified for 256 iterations"



    === Propeller II Monitor ===

    >N2000.201f
    02000- 00000080 00000080 000000FF 000000FF '................'
    02010- 00000FF3 00000000 00000000 00000000 '................'
    >
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 09:44
    These don't make sense to me yet. Did I get the right test file Bill?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 09:47
    #1 makes sense - you might want to read the (long) explanation I put in the post#156 with the code that calculates the expected results for the torture test.

    Unfortunately Chip's and Andy's loops have pipeline issues, where as my out of order loop gets the right result.

    I am planning to come up with other torture tests over time - we really need to get LMM2 right.
    potatohead wrote: »
    These don't make sense to me yet. Did I get the right test file Bill?
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 09:50
    Got it. I see that explanation now.

    Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 09:57
    Thanks that would be great!

    The more verification we can get done, the more likely the test shuttle Chip's (pun intended) will just work!

    I am going to concentrate on making sure LMM2 is bullet proof, and appreciate additional torture testing on that.

    It would be very useful if we could get every instruction verified in both LMM2 and native cog environment... but we'd need many volunteers for that. I would prefer that LMM2 test results be reported in this thread.

    Ideally people would post a list of instructions they are working on, and the verified/not status.

    (I am certain that Chip already has great test vectors for his Verilog files, but extra verification never hurts)
    potatohead wrote: »
    Got it. I see that explanation now.

    Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 10:04
    I added the Torture Test documentation and latest version to post#2 - this version adds a test for my out of order LMM2 loop that does not use PTRA, but uses a pc register.

    Code snippet:
    	'nop	' needed for aligned, comment out for un-aligned test
    	setquaz	#ins1
    	reps	#256,#8
    	getcnt	start
    ins1	nop			'four LMM instructions from RDQUAD before last execute here
    ins2	nop
    ins3	nop
    ins4	nop
    ins0	rdquad	pc		'LMM loop
    ins5	add	pc,#16		'(this is where the mapped QUADs actually become executable after RDQUAD)
    ins6	nop
    ins7	nop
    

    Test results:
    >n
    >2000.201f
    02000- 000000FE 00000080 00000080 00000080   '................'
    02010- 00000FF7 0000007F 00000000 00000000   '................'
    

    The results are correct :)
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 10:05
    I'm wondering about the WAITVID, and friends... Need docs for those though. When Chip gets to it, that's another round of tests.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 10:08
    DEFINITELY!
    potatohead wrote: »
    I'm wondering about the WAITVID, and friends... Need docs for those though. When Chip gets to it, that's another round of tests.

    I really hope the non-DAC parallel mode output for VGA is still possible, I need it for LCD's.- otherwise I'll have bit (byte?word?long?) bang it
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-11 10:10
    Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)

    waitvid classic mode
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 10:11
    Very true... on P1 I've also used it for fast clocked serial out and 8 bit DDS experiments.
    potatohead wrote: »
    Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)
  • AribaAriba Posts: 2,690
    edited 2012-12-11 11:25
    Bill
    There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
    1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
    2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

    What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
    I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
    With normal 1 cycle instruction I don't see this problems.
    I think Chip needs to look again in his Verilog code....

    Andy
  • BaggersBaggers Posts: 3,019
    edited 2012-12-11 11:29
    cgracey wrote: »
    Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.

    No worries, you can save it for P3 :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 11:37
    Thanks for the thorough look Andy!

    I will try your modifications, and no doubt get the same results, and I am sure Chip will look over the Verilog code again - all of us want this work :)

    Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.
    Ariba wrote: »
    Bill
    There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
    1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
    2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

    What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
    I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
    With normal 1 cycle instruction I don't see this problems.
    I think Chip needs to look again in his Verilog code....

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 11:51
    Good catch Andy!

    I added:

    testvalue long $180

    at the bottom of the file, and

    wrlong testvalue,result6

    before the loop, and I verified that at the end of the run, result6 was $7F

    Thinking about it, I think that only RDLONG has an issue, otherwise we would not see $7F... and the single cycle adds are all executing correctly.

    Update:

    Thinking some more about it, it pretty much has to be a difference in how the Verilog handles executing RDLONG from a "real cog register" and a "RDQUAD cache mapped to registers"

    Reason I think this is the case is that two days ago the RDLONG tests worked fine - but I am going to go and re-test that right now...
    Ariba wrote: »
    Bill
    There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
    1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
    2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

    What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
    I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
    With normal 1 cycle instruction I don't see this problems.
    I think Chip needs to look again in his Verilog code....

    Andy
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 12:33
    Ok, looks like my hypothesis is confirmed.

    The problem pretty much has to be the difference between executing RDLONG from regular registers, or from the RDLONG cache buffer.

    I initialized result6 to $180, then I executed the following code:
    	reps	#127,#8
    	getcnt	start
    	add	count, #1
    	add	count2,#1
    	add	count3,#1
    	add	count4,#1
    	rdlong	lc,result6
    	add	lc,#1
    	wrlong	lc,result6
    	add	count,#1
    

    in the torture test framework, an got the following result:
    >n
    >2000.201f
    02000- 000000FE 0000007F 0000007F 0000007F   '................'
    02010- 000007F6 000001FF 00000000 00000000   '................'
    

    Note that result six is $180+$7F, and all the other results are correct as well.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 12:43
    Ariba wrote: »
    Bill
    There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
    1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
    2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

    What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
    I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
    With normal 1 cycle instruction I don't see this problems.
    I think Chip needs to look again in his Verilog code....

    Andy


    The problem is that RDLONG's take more than one clock and therefore make subsequent instructions in QUADs available for execution earlier. I don't know what to do about this, other than suggest timing provisions are made to execute the just-read QUADs and not execute overlapped QUADs, as the overlapped execution is where timing gets advanced by multi-clock instructions and causes new instructions to execute in place of old.

    Read this carefully and it will make sense:
    After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
    
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
    	NOP			'do something for at least 3 clocks to allow QUADs to update
    	NOP
    	NOP
    
    	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S
    
    
    After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:
    
    
            SETQUAD #quad0          'map QUADs to quad0..quad3
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
            NOP                     'do something for at least 3 clocks to allow QUADs to update
            NOP
            NOP
    
            NOP                     'do at least 1 instruction to get QUADs into pipeline
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    quad1   NOP
    quad2   NOP
    quad3   NOP
    
    
    After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
    readable via D and S after 2 instructions:
    
    
            SETQUAD #quad0          'map QUADs to quad0..quad3 (new address)
    
            NOP			'do at least two instructions to queue up QUADs
            NOP
    
            CMP     quad0,quad1     'mapped QUADS are now accessible via D and S
    
    
    On cog startup, the QUAD registers are cleared to 0's.
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 12:49
    Ok that makes sense.

    Question:

    RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

    Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

    I'll go try to make a new torture test to show RDLONG does work if scheduled right.

    The current thread expected RDLONG to pause the pipe until synced to the hub like in non-quad code, which makes writing code for it easier.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 12:54
    Ok that makes sense.

    Question:

    RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

    Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

    I can't change the timing of RDxxxx. That's kind of set, at this point.

    I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 12:55
    Hi Chip.

    Fills registers that are hidden by QUAD with same data as QUAD ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 12:57
    THANK YOU CHIP! IT WORKS!!!!

    You explanation let me schedule the code properly for the kernel.

    Re-written, scheduled LMM2 code:
    i1	rdlong	lc,result6
    i2	add	count, #1
    i3	add	count2,#1
    i4	add	count3,#1
    i5	add	count4,#1
    i6	add	lc,#1
    i7	wrlong	lc,result6
    i8	add	count,#1
    

    Test results - which make sense:
    >n
    >2000.201f
    02000- 000000FF 00000080 00000080 0000007F   '................'
    02010- 00000FF7 000001FF 00000000 00000000   '................'
    


    Basically, it was NOT a verilog issue, just improper scheduling of code to be run through the LMM2 kernel.

    Ok that makes sense.

    Question:

    RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

    Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

    I'll go try to make a new torture test to show RDLONG does work if scheduled right.

    The current thread expected RDLONG to pause the pipe until synced to the hub like in non-quad code, which makes writing code for it easier.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 13:00
    The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 13:02
    Sapieha wrote: »
    Hi Chip.

    Fills registers that are hidden by QUAD with same data as QUAD ?

    No, it has to do with when new QUAD values enter the pipeline, based on how many clocks have elapsed since the last RDQUAD.
Sign In or Register to comment.