Shop OBEX P1 Docs P2 Docs Learn Events
Waitpeq/waitpna resume timing — Parallax Forums

Waitpeq/waitpna resume timing

Cliff L. BiffleCliff L. Biffle Posts: 206
edited 2009-06-19 13:28 in Propeller 1
I'm extending my instruction scheduler to handle external events, and I was wondering if anyone could help me with the precise timing behavior of waitpna and waitpeq on the P8X32A.

I'm dealing with some particularly tricky bus interface code, so the specifics have suddenly become important. smile.gif I don't have the bench equipment I need to answer this question on my own.

In the below, I make some assumptions: INAx is the single pin we're monitoring; we're waiting for it to go high; we're using waitpeq. If the timing can vary in other cases (such as high-low transitions or waitpna), please let me know.

If INAx is held high, waitpeq should take 5 cycles.

If INAx is held low, waitpeq should spin indefinitely.

But if INAx transitions during execution of waitpeq, how soon is it noticed? Or, put another way, what hold time is required before the next instruction is allowed to execute?

If I've been unclear, tell me and I'll rephrase or elaborate. I'll add the results to my Propeller timing info page after testing them. For my application I need precision of no more than about 0.5clk, but if anyone has better it'd be great.

Thanks!

Edit: Just to clarify, I've seen a post from Paul Baker about a year ago that specifies that the next instruction executes on the next cycle after the condition is true, which almost answers my question -- but I'm hoping someone (possibly Paul) can be more specific about hold times for the comparison.

Post Edited (Cliff L. Biffle) : 5/9/2008 8:19:09 AM GMT
«1

Comments

  • kuronekokuroneko Posts: 3,623
    edited 2008-05-09 09:09
    Cliff L. Biffle said...
    If INAx is held high, waitpeq should take 5 cycles.

    FWIW, it's 6 cycles. http://forums.parallax.com/showthread.php?p=722987
  • Ken PetersonKen Peterson Posts: 806
    edited 2008-05-09 12:33
    Perhaps someone form Parallax can chime in on why the data sheet says 5+, and exactly how many clock cycles after the transition does the next instruction execute?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-09 15:54
    kuroneko said...
    Cliff L. Biffle said...
    If INAx is held high, waitpeq should take 5 cycles.

    FWIW, it's 6 cycles. http://forums.parallax.com/showthread.php?p=722987

    I notice you haven't gotten confirmation from Parallax on this.
    Ken Peterson said...
    Perhaps someone form Parallax can chime in on why the data sheet says 5+, and exactly how many clock cycles after the transition does the next instruction execute?

    Parallax has chimed in on this before; Paul Baker said
    [noparse][[/noparse]quote]To clarify a point, waitpeq is deterministic with respect to the pin state. The reason it is listed as 5+ is it takes 4 clocks to process the instruction, plus however many cycles of compare necessary to achieve the wait state. If the value is true at the beginning it will take 5 cycles since only one compare cycle occurs. For situations where more than one compare cycle occurs, the next instruction begins execution on the next clock cycle after a comparison evaluates true.

    (http://forums.parallax.com/showthread.php?p=656005)

    Kuroneko's statement would seem to conflict with that, however.
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-10 06:05
    Cliff L. Biffle said...
    Kuroneko's statement would seem to conflict with that, however.

    Well, I wish it was 5+ and I simply overlooked something. But as mentioned in the other thread, even with the monitored pin being static I get a 6 cycle delay (as opposed to having the equation always resolve to true and thereby ignoring the pin state altogether).
  • HarleyHarley Posts: 997
    edited 2008-05-14 19:12
    Funny, I just the other day measured additional 2 cycles to the 4 after the leading edge of a true state for WAITPEQ.

    The instruction is set up way before the the input pulse appears. And using ViewPort and some 'test pulses' on another I/O to mark events, I see 100 nsec delay between the awaited pulse and a 'test pulse' (running at 80 MHz). Never do I see a total of 5 cycles, but a constant 6. Costing another 50 nsec response. LIFE!!!
    yeah.gif

    What might the condition have to be to see a 5 clock response?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Harley Shanko
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-14 23:40
    Well, with no word from Parallax (though I haven't contacted them directly), I've got a Proto Board on the bench with my scope. I'll post traces when I've got 'em.
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-15 05:20
    Here's my preliminary report. Kuroneko appears to be correct -- I've not been able to get any wait* instruction to take less than 6 cycles.

    Bench configuration:
    - P8X32A (LQFP) on Parallax Proto Board. Date code 0641.
    - Regulator producing 3.6V (slightly out of spec)
    - Crystal stable at 80MHz.
    - Tektronix TDS1012B.

    Automated test suites run using propasm, make, and Remy Blank's Loader.py. I've omitted the clock configuration directives from the source here.

    Baseline: toggle.pa

    toggle
      mov DIRA, MASK
    :loop
      or OUTA, MASK
      andn OUTA, MASK
      jmp #:loop
    
    mask long $FFFFFFFF
    
    



    pt-toggle.png

    As expected this shows a 33% duty cycle, 50ns (4 cyc) high, 100ns (8c) low.

    Testing waitpeq
    For this run P0 was pulled directly to ground.

    waitgnd
      or DIRA, OUTPUT
    :loop
      or OUTA, OUTPUT
      waitpeq ZERO, INPUT
      andn OUTA, OUTPUT
      jmp #:loop
    
    OUTPUT long $0000FF00
    INPUT long $00000001
    ZERO long 0
    
    



    pt-waitpeq.png

    The high phase of the signal has grown by 75ns/6c. This supports Kuroneko's hypothesis that the fastest possible invocation of waitpeq will take 6c.

    Testing waitpna and resume latency
    For this test, I added a switch to pull P0 to rail. Channel 2 was tied to P0 with a rising-edge trigger at 1.63V. The chip was allowed to idle at waitpne for several seconds before P0 went high.

    waithi
      or DIRA, OUTPUT
    :loop
      or OUTA, OUTPUT
      waitpne ZERO, INPUT
      andn OUTA, OUTPUT
      jmp #:loop
    
    OUTPUT    long $0000FF00
    INPUT     long $00000001
    ZERO      long 0
    
    



    pt-waitpeq2.png

    This one answers my original question: there appears to be a minimum latency of two cycles from the time that a wait condition is satisfied to the time that the next instruction begins execution. If anyone else's data fail to support this, I can do a more thorough test using a random external source and the next instruction's input latching characteristics.

    I repeated this test several times and saw 75ns/2c - 80ns most of the time. If my model is correct the latency could approach 87.5ns/3c, but the data failed to support this thus far.

    Testing waitcnt

    While I was at it I constructed a test fixture for waitcnt. A script repeatedly recompiled the code below with different values for {CYCLES} and recorded the traces that resulted at P15.

    timecnt
      mov DIRA, OUTPUT
    :loop
      or OUTA, OUTPUT
      mov time, CNT
      add  time, #{CYCLES}
      waitcnt time, #0
      andn OUTA, OUTPUT
      jmp #:loop
    
    time      long 0
    OUTPUT    long $0000FF00
    
    



    Values of CYCLES in 0..8 (inclusive) failed to transition in the time alloted (500ms).

    Values of 9 and greater had high-phase periods of 212.5ns + (CYCLES - 8) * 12.5ns, and the expected 100ns low period.

    Here's the trace for CYCLES=11:
    pt-waitcnt.png

    Commenting out the waitcnt instruction gave a high period of 150ns, giving a best-case waitcnt time of 225ns - 150ns = 75ns, or 6c. This supports Kuroneko's hypothesis.

    I'm not sure I can tell, from this data, what the resume latency of waitcnt is. It's made more difficult by the ill-defined latching behavior of operands. I can say with confidence that it's between 2c and 3c, and judging from the behavior of waitpeq/waitpna I suspect 2c -- but again, my data is inconclusive.

    If anyone has suggestions for improving my methods, lemme know.

    Edit: Actually, I think the data may suggest an event-to-resume latency of three cycles, at least for waitcnt.

    Post Edited (Cliff L. Biffle) : 5/15/2008 5:49:11 AM GMT
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-05-15 05:47
    I find it interesting that the minimum delays observed for holding a pin in the desired state are consistent with those for transistioning a pin to the desired state. Since the measured delay is from the edge detected at the pin, I would expect the latter to be a further cycle or two delayed due to input synchronization. I haven't found any references in the docs to indicate how many stages of synchronization the Prop employs, though.

    -Phil
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-15 06:26
    Personal correspondence from a couple years back suggested that external inputs were latched within the corresponding fetch cycle (cycle 0 or 1 of the instruction), but I haven't tested it. I believe that this is supported by my PFcam driver for the OV6620, which assumes no input buffering or propagation delay (and works fine).

    I'll work out a way to test this.
  • AleAle Posts: 2,363
    edited 2008-05-15 08:26
    I did similar experiments, at 80 MHz clock, but using a slightly modified loop, i.e. not writing the variable before the waitcnt in case the pipeline gets *somehow* stalled, something that should not happen smile.gif, and it does not happen.
    org
    
    c5_start      mov       DIRA,k5_pin
                  mov       OUTA,#0
                  
    c5_lbl0
                  mov       k5_temp,CNT
                  add       k5_temp,#18
                  or        OUTA,k5_pin
                  nop
                  waitcnt   k5_temp,#0
                  andn      OUTA,k5_pin
                  jmp       #c5_lbl0             
    
    k5_temp       long 0
    k5_pin        long $8000
    
    
    



    Well, seems that 17 is the minimum (or it waits 2^32-16 or so cycles). If I remove the nop... 13 as expected is the minimum.
    320 x 234 - 5K
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-15 09:20
    Ale said...
    Well, seems that 17 is the minimum (or it waits 2^32-16 or so cycles). If I remove the nop... 13 as expected is the minimum.

    Well, that just proves to me that a NOP takes 4 cycles ... (reading 17 and 13 as replacement for #18 in your code, correct me if I'm wrong).

    Also, assuming the picture shows the behaviour of the code you posted (#18), then I can see the following:

    - the 190ns are roughly 15 cycles (at 80MHz)
    - 8 are consumed by the NOP and the toggle
    - that leaves 7 cycles for the waitcnt

    You also stated that #17 is the minimum for waitcnt not to stall for a very long time, which makes me believe that the pulse width could go down to 14 cycles which still leaves 6 for waitcnt. Would it be possible to take a measurement for case #17?

    Post Edited (kuroneko) : 5/15/2008 9:32:09 AM GMT
  • AleAle Posts: 2,363
    edited 2008-05-15 11:10
    Sure,

    here it is for 17 cycles. 16 sends it to the no trigger version wink.gif
    320 x 234 - 4K
  • stevenmess2004stevenmess2004 Posts: 1,102
    edited 2008-05-15 11:16
    This is weird because I'm sure that Paul needed the 1 cycle offset in his high speed serial object.
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-15 11:17
    Thanks. Seems that pipeline stalls don't come into it. Well, I can live with 6+ but there must be a reason that the docs mention 5+ ...
  • stevenmess2004stevenmess2004 Posts: 1,102
    edited 2008-05-15 12:09
    With the 17 we are getting 5+ for the waitcnt. (17-4-4-4=5)

    Don't know what's going on with the other waits though. What happens with a waitvid?
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-15 12:20
    stevenmess2004 said...
    With the 17 we are getting 5+ for the waitcnt. (17-4-4-4=5)

    What do you mean by 17? CNT adjustment? From the last posted scope screen I count 14 cycles which is 4 + 4 + 6.
  • stevenmess2004stevenmess2004 Posts: 1,102
    edited 2008-05-15 12:59
    True, sorry. But I wonder if that is where the 5+ came from? Would also be interesting to change the wait instruction to something like this
    waitcnt OUTA,%111
    



    Don't know if it will do anything useful but could be interesting.
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-15 20:24
    Steven, Kuroneko's right -- your test fixture is identical to mine with an additional 8 cycle latency. So 17 cycles is the result I would have predicted (9 + 8).

    Having waitcnt write directly to the output latch would be interesting to verify the latching behavior of the Writeback stage. I haven't tested it specifically, but I believe my measurements show the Writeback affecting outputs within the final cycle of the instruction.

    If I've miscalculated the waitpne resume latency it might be +1 cycle however. I can see a test fixture for this.

    I'll ping some of the Parallax guys directly on this -- there's only so much reverse engineering I want to do when I'm not getting paid to do it. smile.gif


    The good news is, using the timings calculated in my last post, I've now got perfect phase lock in my OV6620 driver -- despite the mutually prime clock frequencies (17.73MHz vs. 96MHz).
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-16 00:18
    Cliff L. Biffle said...
    Having waitcnt write directly to the output latch would be interesting to verify the latching behavior of the Writeback stage. I haven't tested it specifically, but I believe my measurements show the Writeback affecting outputs within the final cycle of the instruction.

    http://forums.parallax.com/showthread.php?p=720785 may be of interest to you. It helped me [noparse]:)[/noparse]
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-16 22:39
    Yeah, I've read that before. After the timing error, though, I'm trying to measure rather than assume.
  • Paul BakerPaul Baker Posts: 6,351
    edited 2008-05-19 22:46
    I have verfied that that WAITPxx does indeed take 6+ cycles to execute, we will be updating our documents to reflect this. Nice catch.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-20 01:05
    Thanks for the confirmation, Paul. Do you have any insight on my original question -- namely, after the waitxxx condition is met, how long until the next instruction kicks in? Two cycles? Three?
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-20 04:46
    Cliff L. Biffle said...
    Thanks for the confirmation, Paul. Do you have any insight on my original question -- namely, after the waitxxx condition is met, how long until the next instruction kicks in? Two cycles? Three?

    I was curious myself. So I put together a quick and dirty test program. This does a waitpeq followed by an OUTA modification. Two observers wait on the waitpeq pin and the modified OUTA pin respectively and then record CNT. The difference is 7. Given that OUTA happens in the result phase, the next instruction therefore starts at 7-3 cycles after the wait condition has been met, i.e. if the wait condition happens at CNT, the next instruction starts at CNT+4.
  • Paul BakerPaul Baker Posts: 6,351
    edited 2008-05-20 19:18
    Kuroneko, I too confirm a 7 cycle delay for an already established wait period + mov instruction, but I don't concur with your conclusion. I have discussed this behavior with another engineer and this is what we think is happening. But first a review of the Propeller Pipline:
    IdSDeR                 1st instruction
        IdSDeR             2nd instruction
            IdSDeR         3rd instruction
     
    Where 
    I = instruction fetch
    d = decode
    S = fetch source
    D = fetch destination
    e = execute
    R = write result
     
    

    When a waitxxx instruction is executed, the pipeline stalls. Now we know that the output of the mov is done in the result stage. If we count back 7 cycles that places us in the D stage of the waitcnt, but this doesn't make sense because this is not a logical place for the waitcnt to stall (whereas stalling at stage e makes sense).

    So a further examination of the waitxxx behavior is needed:

    The following is a representation of·a pin state a waitpxx instruction is performing on. The | represents a clock boundry, a _ represents no change of the pin and a / represents a change of the pin

     
    |_|_|/|_|_|
    1 2 3 4 5 6
     
    

    as you can see the event happens between cycle 3 and 4. At clock cycle 4 the waitpxx registers a true condition, but this does not immediately result in·restarting the pipeline because the true condition must be given time to propagate throughout the cog. Therefore the cog does not restart the pipeline until cycle 5. When the pipeline restarts it will be at the R stage of the waitxxx and the d stage of the following instruction. Take the 2 cycles needed to register and propage the true condition plus the 5 remaining stages of the next instruction we are left with 7 cycles.

    While I am not absolutely certain this is precisely whats going on, it is a good enough explanation of observed behavior that it should suffice to think·of waitxxx's behavior in this way.

    One further note, the tests that we both did were synchronous in nature, for asynchronous test it will be found that the response time will be (6,7] or anything between 6 and 7 cyles, but not including·6 itself (ie any marginal value above 6 such as 6.02).

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.

    Post Edited (Paul Baker (Parallax)) : 5/20/2008 7:55:04 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-05-20 23:17
    I released a DataLogger that can read all pins for 1880 clock cycles (yes, ALL clock cycles @ 12.5nS with 80MHz). It uses 4 cogs to do this and all 8 are used in my demo.

    But one cog is doing the presetting of the triggering point and this is easily modified to place whatever instructions you like after the initial set of pin2. If I recall, there is about 15 clock latency before the sampling takes effect but you can see that from the dataset. Place your code once Pin2 starts toggling. Enjoy smile.gif

    I am just about to drive from Sydney to Brisbane (actually Gosford to Gold Coast for those who know Australia) so will be out of action for a few days. I can then have a look at the timings.
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-20 23:23
    Paul Baker (Parallax) said...
    Kuroneko, I too confirm a 7 cycle delay for an already established wait period + mov instruction, but I don't concur with your conclusion.

    Thanks for this useful piece of information. I agree that I could always be wrong as I can only observe so much (not knowing the internals). While investigating a different problem I noticed that the stall for a waitcnt would end up in the D phase which - as you put it - doesn't make much sense (given that there are valid S and D registers to be fetched).

    Post Edited (kuroneko) : 5/21/2008 3:08:17 AM GMT
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2008-05-23 03:36
    Paul, your description of the pipeline implies that the next instruction's source value is latched before the waitpeq instruction waits -- which could obviously be a long time.

    This should be easy enough to verify, and could actually be useful (to e.g. detect pin changes).

    Stalling between the completion of the S and D stages matches my measurements, I think. Nice to see that your description matches the pipeline model I inferred a couple years back (www.cliff.biffle.org/software/propeller/notes.html).
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-05-23 11:26
    Attached is a timing diagram (sample_43A.spin) of waitpeq/waitpne instructions and mov outa,#x. I have added the IdSDER info as provided by Paul and my observations confirm his statements. The source code of the instructions is also attached.

    The reason the waitpeq takes·6 cycles is because the instruction pipeline is flushed. Therefore, the pin sampling takes place on the E cycle for the waitpeq instruction. The mov outa,#x writes to the output pin during the R cycle, as stated by Paul

    The clocking diagram is output by my program and saved by Hyperterminal into a compatable *.spin·file cool.gif
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-05-23 16:43
    This is all conjecture on my part, but Paul mentioned the pipeline stalling, not flushing. There's really no reason to flush the pipe, since the next instruction never changes. It would make more sense, I think, to assume that the execution phase of the WAIT instructions requires at least three clocks to complete. Moreover, I'm guessing that the execution phase itself is pipelined, overlapping latching of the inputs, ANDing with the source, and comparing with the destination. OTOH, the ANDing and comparison could be done as a Boolean in one step, with an additional step required for deciding the next state. That the execution phase be pipelined (assuming it's multi-step) is necessary; otherwise, the timing granularity would be greater than one clock.

    Here's a diagram that illustrates my conjecture:

    IdSDeR              OR      x, y
        IdSDlac         WAITPEQ one, pin10
             lac        'Latch inputs, AND source, compare with destination.
              lac
               lacR
            I????IdSDeR ADD     a, b
            
            ...or...
            
    IdSDeR              OR      x, y
        IdSDlcd         WAITPEQ one, pin0
             lcd        'Latch inputs, AND and compare, decide next state.
              lcd
               lcdR
            I????IdSDeR ADD     a, b
    
    
    


    What's unknowable is where, in the execute pipeline, the next instruction is read: at the beginning, at the end, or multiple times at each stage? This uncertainty is represented by the question marks above. But, in any event, it's read before the result stage of the WAIT, so the actual mechanism is unimportant.

    This hypothesis, compared with Cluso's, is only testable if the WAITPEQ/WAITPNE instruction can be forced (via "wr") to write something to the destination different from what was there before. The destination address would have to be the address of the next instruction. Then we could tell whether that instruction was fetched before (my hypothesis) or after (Cluso's hypothesis) the result of the WAIT was written. I haven't tried a WAITPEQ/WAITPNE with a forced "wr", so I don't know whether it can be forced to write something. If it can, what it writes could well be the result of the AND, in which case a WAITPNE would be capable of changing the destination to a different value.

    For the WAITCNT instruction, this test is much easier, since there is a known result to write. So I tried the following:

    CON
    
      _clkmode = xtal1 + pll16x
      _xinfreq = 5_000_000
    
    PUB start        
    
      cognew(@waittest, 0)
    
    DAT
    
    waittest      mov       dira,#3
                  waitcnt   nexti, #1
    nexti         mov       outa, #2
                  jmp       #nexti
    
    
    


    As a result of the WAITCNT, nexti should be incremented by one, which would cause both A0 and A1 to rise simultaneously, assuming the result gets written before the MOV instruction is fetched. In fact, A1 rises 100ns ahead of A0, which indicates that the pipe was never flushed during the WAITCNT, but merely stalled.

    -Phil

    Post Edited (Phil Pilgrim (PhiPi)) : 5/23/2008 4:50:29 PM GMT
  • kuronekokuroneko Posts: 3,623
    edited 2009-06-19 00:52
    Some after breakfast findings:
    • waitpeq with wr set performs an add dst, src
    • the next instruction is not affected
    • mask value (not location) is fetched during Execution stage (same as mov dst, src)
    • pins are first sampled in the stage after that (L)
    • result is written - if applicable - in the last cycle
    Which boils down to something like this (for running straight through):

    sdEL?R
    


    Not sure about cycle 5, maybe it is used for decision making as Phil suggests. Seems odd though because sample resolution is 1 cycle.
Sign In or Register to comment.