Waitpeq/waitpna resume timing

Cliff L. Biffle · 2008-05-09 08:11

I'm extending my instruction scheduler to handle external events, and I was wondering if anyone could help me with the precise timing behavior of waitpna and waitpeq on the P8X32A.

I'm dealing with some particularly tricky bus interface code, so the specifics have suddenly become important.

I don't have the bench equipment I need to answer this question on my own.

In the below, I make some assumptions: INAx is the single pin we're monitoring; we're waiting for it to go high; we're using waitpeq. If the timing can vary in other cases (such as high-low transitions or waitpna), please let me know.

If INAx is held high, waitpeq should take 5 cycles.

If INAx is held low, waitpeq should spin indefinitely.

But if INAx transitions during execution of waitpeq, how soon is it noticed? Or, put another way, what hold time is required before the next instruction is allowed to execute?

If I've been unclear, tell me and I'll rephrase or elaborate. I'll add the results to my Propeller timing info page after testing them. For my application I need precision of no more than about 0.5clk, but if anyone has better it'd be great.

Thanks!

Edit: Just to clarify, I've seen a post from Paul Baker about a year ago that specifies that the next instruction executes on the next cycle after the condition is true, which almost answers my question -- but I'm hoping someone (possibly Paul) can be more specific about hold times for the comparison.

Post Edited (Cliff L. Biffle) : 5/9/2008 8:19:09 AM GMT

kuroneko · 2008-05-09 09:09

Cliff L. Biffle said...
If INAx is held high, waitpeq should take 5 cycles.

FWIW, it's 6 cycles. http://forums.parallax.com/showthread.php?p=722987

Ken Peterson · 2008-05-09 12:33

Perhaps someone form Parallax can chime in on why the data sheet says 5+, and exactly how many clock cycles after the transition does the next instruction execute?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Cliff L. Biffle · 2008-05-09 15:54

kuroneko said...

Cliff L. Biffle said...
If INAx is held high, waitpeq should take 5 cycles.

FWIW, it's 6 cycles. http://forums.parallax.com/showthread.php?p=722987

I notice you haven't gotten confirmation from Parallax on this.

Ken Peterson said...
Perhaps someone form Parallax can chime in on why the data sheet says 5+, and exactly how many clock cycles after the transition does the next instruction execute?

Parallax has chimed in on this before; Paul Baker said
[noparse][[/noparse]quote]To clarify a point, waitpeq is deterministic with respect to the pin state. The reason it is listed as 5+ is it takes 4 clocks to process the instruction, plus however many cycles of compare necessary to achieve the wait state. If the value is true at the beginning it will take 5 cycles since only one compare cycle occurs. For situations where more than one compare cycle occurs, the next instruction begins execution on the next clock cycle after a comparison evaluates true.

(http://forums.parallax.com/showthread.php?p=656005)

Kuroneko's statement would seem to conflict with that, however.

kuroneko · 2008-05-10 06:05

Cliff L. Biffle said...
Kuroneko's statement would seem to conflict with that, however.

Well, I wish it was 5+ and I simply overlooked something. But as mentioned in the other thread, even with the monitored pin being static I get a 6 cycle delay (as opposed to having the equation always resolve to true and thereby ignoring the pin state altogether).

Harley · 2008-05-14 19:12

Funny, I just the other day measured additional 2 cycles to the 4 after the leading edge of a true state for WAITPEQ.

The instruction is set up way before the the input pulse appears. And using ViewPort and some 'test pulses' on another I/O to mark events, I see 100 nsec delay between the awaited pulse and a 'test pulse' (running at 80 MHz). Never do I see a total of 5 cycles, but a constant 6. Costing another 50 nsec response. LIFE!!!

What might the condition have to be to see a 5 clock response?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Harley Shanko

Cliff L. Biffle · 2008-05-14 23:40

Well, with no word from Parallax (though I haven't contacted them directly), I've got a Proto Board on the bench with my scope. I'll post traces when I've got 'em.

Cliff L. Biffle · 2008-05-15 05:20

Here's my preliminary report. Kuroneko appears to be correct -- I've not been able to get any wait* instruction to take less than 6 cycles.

Bench configuration:
- P8X32A (LQFP) on Parallax Proto Board. Date code 0641.
- Regulator producing 3.6V (slightly out of spec)
- Crystal stable at 80MHz.
- Tektronix TDS1012B.

Automated test suites run using propasm, make, and Remy Blank's Loader.py. I've omitted the clock configuration directives from the source here.

Baseline: toggle.pa

toggle
  mov DIRA, MASK
:loop
  or OUTA, MASK
  andn OUTA, MASK
  jmp #:loop

mask long $FFFFFFFF

As expected this shows a 33% duty cycle, 50ns (4 cyc) high, 100ns (8c) low.

Testing waitpeq
For this run P0 was pulled directly to ground.

waitgnd
  or DIRA, OUTPUT
:loop
  or OUTA, OUTPUT
  waitpeq ZERO, INPUT
  andn OUTA, OUTPUT
  jmp #:loop

OUTPUT long $0000FF00
INPUT long $00000001
ZERO long 0

The high phase of the signal has grown by 75ns/6c. This supports Kuroneko's hypothesis that the fastest possible invocation of waitpeq will take 6c.

Testing waitpna and resume latency
For this test, I added a switch to pull P0 to rail. Channel 2 was tied to P0 with a rising-edge trigger at 1.63V. The chip was allowed to idle at waitpne for several seconds before P0 went high.

waithi
  or DIRA, OUTPUT
:loop
  or OUTA, OUTPUT
  waitpne ZERO, INPUT
  andn OUTA, OUTPUT
  jmp #:loop

OUTPUT    long $0000FF00
INPUT     long $00000001
ZERO      long 0

This one answers my original question: there appears to be a minimum latency of two cycles from the time that a wait condition is satisfied to the time that the next instruction begins execution. If anyone else's data fail to support this, I can do a more thorough test using a random external source and the next instruction's input latching characteristics.

I repeated this test several times and saw 75ns/2c - 80ns most of the time. If my model is correct the latency could approach 87.5ns/3c, but the data failed to support this thus far.

Testing waitcnt

While I was at it I constructed a test fixture for waitcnt. A script repeatedly recompiled the code below with different values for {CYCLES} and recorded the traces that resulted at P15.

timecnt
  mov DIRA, OUTPUT
:loop
  or OUTA, OUTPUT
  mov time, CNT
  add  time, #{CYCLES}
  waitcnt time, #0
  andn OUTA, OUTPUT
  jmp #:loop

time      long 0
OUTPUT    long $0000FF00

Values of CYCLES in 0..8 (inclusive) failed to transition in the time alloted (500ms).

Values of 9 and greater had high-phase periods of 212.5ns + (CYCLES - 8) * 12.5ns, and the expected 100ns low period.

Here's the trace for CYCLES=11:

Commenting out the waitcnt instruction gave a high period of 150ns, giving a best-case waitcnt time of 225ns - 150ns = 75ns, or 6c. This supports Kuroneko's hypothesis.

I'm not sure I can tell, from this data, what the resume latency of waitcnt is. It's made more difficult by the ill-defined latching behavior of operands. I can say with confidence that it's between 2c and 3c, and judging from the behavior of waitpeq/waitpna I suspect 2c -- but again, my data is inconclusive.

If anyone has suggestions for improving my methods, lemme know.

Edit: Actually, I think the data may suggest an event-to-resume latency of three cycles, at least for waitcnt.

Post Edited (Cliff L. Biffle) : 5/15/2008 5:49:11 AM GMT

Phil Pilgrim (PhiPi) · 2008-05-15 05:47

I find it interesting that the minimum delays observed for holding a pin in the desired state are consistent with those for transistioning a pin to the desired state. Since the measured delay is from the edge detected at the pin, I would expect the latter to be a further cycle or two delayed due to input synchronization. I haven't found any references in the docs to indicate how many stages of synchronization the Prop employs, though.

-Phil

Cliff L. Biffle · 2008-05-15 06:26

Personal correspondence from a couple years back suggested that external inputs were latched within the corresponding fetch cycle (cycle 0 or 1 of the instruction), but I haven't tested it. I believe that this is supported by my PFcam driver for the OV6620, which assumes no input buffering or propagation delay (and works fine).

I'll work out a way to test this.

Ale · 2008-05-15 08:26

I did similar experiments, at 80 MHz clock, but using a slightly modified loop, i.e. not writing the variable before the waitcnt in case the pipeline gets *somehow* stalled, something that should not happen

, and it does not happen.

org

c5_start      mov       DIRA,k5_pin
              mov       OUTA,#0
              
c5_lbl0
              mov       k5_temp,CNT
              add       k5_temp,#18
              or        OUTA,k5_pin
              nop
              waitcnt   k5_temp,#0
              andn      OUTA,k5_pin
              jmp       #c5_lbl0             

k5_temp       long 0
k5_pin        long $8000

Well, seems that 17 is the minimum (or it waits 2^32-16 or so cycles). If I remove the nop... 13 as expected is the minimum.

kuroneko · 2008-05-15 09:20

Ale said...
Well, seems that 17 is the minimum (or it waits 2^32-16 or so cycles). If I remove the nop... 13 as expected is the minimum.

Well, that just proves to me that a NOP takes 4 cycles ... (reading 17 and 13 as replacement for #18 in your code, correct me if I'm wrong).

Also, assuming the picture shows the behaviour of the code you posted (#18), then I can see the following:

- the 190ns are roughly 15 cycles (at 80MHz)
- 8 are consumed by the NOP and the toggle
- that leaves 7 cycles for the waitcnt

You also stated that #17 is the minimum for waitcnt not to stall for a very long time, which makes me believe that the pulse width could go down to 14 cycles which still leaves 6 for waitcnt. Would it be possible to take a measurement for case #17?

Post Edited (kuroneko) : 5/15/2008 9:32:09 AM GMT

Ale · 2008-05-15 11:10

Sure,

here it is for 17 cycles. 16 sends it to the no trigger version

stevenmess2004 · 2008-05-15 11:16

This is weird because I'm sure that Paul needed the 1 cycle offset in his high speed serial object.

kuroneko · 2008-05-15 11:17

Thanks. Seems that pipeline stalls don't come into it. Well, I can live with 6+ but there must be a reason that the docs mention 5+ ...

stevenmess2004 · 2008-05-15 12:09

With the 17 we are getting 5+ for the waitcnt. (17-4-4-4=5)

Don't know what's going on with the other waits though. What happens with a waitvid?

kuroneko · 2008-05-15 12:20

stevenmess2004 said...
With the 17 we are getting 5+ for the waitcnt. (17-4-4-4=5)

What do you mean by 17? CNT adjustment? From the last posted scope screen I count 14 cycles which is 4 + 4 + 6.

stevenmess2004 · 2008-05-15 12:59

True, sorry. But I wonder if that is where the 5+ came from? Would also be interesting to change the wait instruction to something like this

waitcnt OUTA,%111

Don't know if it will do anything useful but could be interesting.

Cliff L. Biffle · 2008-05-15 20:24

Steven, Kuroneko's right -- your test fixture is identical to mine with an additional 8 cycle latency. So 17 cycles is the result I would have predicted (9 + 8).

Having waitcnt write directly to the output latch would be interesting to verify the latching behavior of the Writeback stage. I haven't tested it specifically, but I believe my measurements show the Writeback affecting outputs within the final cycle of the instruction.

If I've miscalculated the waitpne resume latency it might be +1 cycle however. I can see a test fixture for this.

I'll ping some of the Parallax guys directly on this -- there's only so much reverse engineering I want to do when I'm not getting paid to do it.

The good news is, using the timings calculated in my last post, I've now got perfect phase lock in my OV6620 driver -- despite the mutually prime clock frequencies (17.73MHz vs. 96MHz).

kuroneko · 2008-05-16 00:18

Cliff L. Biffle said...
Having waitcnt write directly to the output latch would be interesting to verify the latching behavior of the Writeback stage. I haven't tested it specifically, but I believe my measurements show the Writeback affecting outputs within the final cycle of the instruction.

http://forums.parallax.com/showthread.php?p=720785 may be of interest to you. It helped me [noparse]:)[/noparse]

Cliff L. Biffle · 2008-05-16 22:39

Yeah, I've read that before. After the timing error, though, I'm trying to measure rather than assume.

Paul Baker · 2008-05-19 22:46

I have verfied that that WAITPxx does indeed take 6+ cycles to execute, we will be updating our documents to reflect this. Nice catch.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Cliff L. Biffle · 2008-05-20 01:05

Thanks for the confirmation, Paul. Do you have any insight on my original question -- namely, after the waitxxx condition is met, how long until the next instruction kicks in? Two cycles? Three?

kuroneko · 2008-05-20 04:46

Cliff L. Biffle said...
Thanks for the confirmation, Paul. Do you have any insight on my original question -- namely, after the waitxxx condition is met, how long until the next instruction kicks in? Two cycles? Three?

I was curious myself. So I put together a quick and dirty test program. This does a waitpeq followed by an OUTA modification. Two observers wait on the waitpeq pin and the modified OUTA pin respectively and then record CNT. The difference is 7. Given that OUTA happens in the result phase, the next instruction therefore starts at 7-3 cycles after the wait condition has been met, i.e. if the wait condition happens at CNT, the next instruction starts at CNT+4.

Paul Baker · 2008-05-20 19:18

Kuroneko, I too confirm a 7 cycle delay for an already established wait period + mov instruction, but I don't concur with your conclusion. I have discussed this behavior with another engineer and this is what we think is happening. But first a review of the Propeller Pipline:

IdSDeR                 1st instruction
    IdSDeR             2nd instruction
        IdSDeR         3rd instruction
 
Where 
I = instruction fetch
d = decode
S = fetch source
D = fetch destination
e = execute
R = write result

When a waitxxx instruction is executed, the pipeline stalls. Now we know that the output of the mov is done in the result stage. If we count back 7 cycles that places us in the D stage of the waitcnt, but this doesn't make sense because this is not a logical place for the waitcnt to stall (whereas stalling at stage e makes sense).

So a further examination of the waitxxx behavior is needed:

The following is a representation of·a pin state a waitpxx instruction is performing on. The | represents a clock boundry, a _ represents no change of the pin and a / represents a change of the pin

 
|_|_|/|_|_|
1 2 3 4 5 6

as you can see the event happens between cycle 3 and 4. At clock cycle 4 the waitpxx registers a true condition, but this does not immediately result in·restarting the pipeline because the true condition must be given time to propagate throughout the cog. Therefore the cog does not restart the pipeline until cycle 5. When the pipeline restarts it will be at the R stage of the waitxxx and the d stage of the following instruction. Take the 2 cycles needed to register and propage the true condition plus the 5 remaining stages of the next instruction we are left with 7 cycles.

While I am not absolutely certain this is precisely whats going on, it is a good enough explanation of observed behavior that it should suffice to think·of waitxxx's behavior in this way.

One further note, the tests that we both did were synchronous in nature, for asynchronous test it will be found that the response time will be (6,7] or anything between 6 and 7 cyles, but not including·6 itself (ie any marginal value above 6 such as 6.02).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 5/20/2008 7:55:04 PM GMT

Cluso99 · 2008-05-20 23:17

I released a DataLogger that can read all pins for 1880 clock cycles (yes, ALL clock cycles @ 12.5nS with 80MHz). It uses 4 cogs to do this and all 8 are used in my demo.

But one cog is doing the presetting of the triggering point and this is easily modified to place whatever instructions you like after the initial set of pin2. If I recall, there is about 15 clock latency before the sampling takes effect but you can see that from the dataset. Place your code once Pin2 starts toggling. Enjoy

I am just about to drive from Sydney to Brisbane (actually Gosford to Gold Coast for those who know Australia) so will be out of action for a few days. I can then have a look at the timings.

kuroneko · 2008-05-20 23:23

Paul Baker (Parallax) said...
Kuroneko, I too confirm a 7 cycle delay for an already established wait period + mov instruction, but I don't concur with your conclusion.

Thanks for this useful piece of information. I agree that I could always be wrong as I can only observe so much (not knowing the internals). While investigating a different problem I noticed that the stall for a waitcnt would end up in the D phase which - as you put it - doesn't make much sense (given that there are valid S and D registers to be fetched).

Post Edited (kuroneko) : 5/21/2008 3:08:17 AM GMT

Cliff L. Biffle · 2008-05-23 03:36

Paul, your description of the pipeline implies that the next instruction's source value is latched before the waitpeq instruction waits -- which could obviously be a long time.

This should be easy enough to verify, and could actually be useful (to e.g. detect pin changes).

Stalling between the completion of the S and D stages matches my measurements, I think. Nice to see that your description matches the pipeline model I inferred a couple years back (www.cliff.biffle.org/software/propeller/notes.html).

Cluso99 · 2008-05-23 11:26

Attached is a timing diagram (sample_43A.spin) of waitpeq/waitpne instructions and mov outa,#x. I have added the IdSDER info as provided by Paul and my observations confirm his statements. The source code of the instructions is also attached.

The reason the waitpeq takes·6 cycles is because the instruction pipeline is flushed. Therefore, the pin sampling takes place on the E cycle for the waitpeq instruction. The mov outa,#x writes to the output pin during the R cycle, as stated by Paul

The clocking diagram is output by my program and saved by Hyperterminal into a compatable *.spin·file

Phil Pilgrim (PhiPi) · 2008-05-23 16:43

This is all conjecture on my part, but Paul mentioned the pipeline stalling, not flushing. There's really no reason to flush the pipe, since the next instruction never changes. It would make more sense, I think, to assume that the execution phase of the WAIT instructions requires at least three clocks to complete. Moreover, I'm guessing that the execution phase itself is pipelined, overlapping latching of the inputs, ANDing with the source, and comparing with the destination. OTOH, the ANDing and comparison could be done as a Boolean in one step, with an additional step required for deciding the next state. That the execution phase be pipelined (assuming it's multi-step) is necessary; otherwise, the timing granularity would be greater than one clock.

Here's a diagram that illustrates my conjecture:

IdSDeR              OR      x, y
    IdSDlac         WAITPEQ one, pin10
         lac        'Latch inputs, AND source, compare with destination.
          lac
           lacR
        I????IdSDeR ADD     a, b
        
        ...or...
        
IdSDeR              OR      x, y
    IdSDlcd         WAITPEQ one, pin0
         lcd        'Latch inputs, AND and compare, decide next state.
          lcd
           lcdR
        I????IdSDeR ADD     a, b

What's unknowable is where, in the execute pipeline, the next instruction is read: at the beginning, at the end, or multiple times at each stage? This uncertainty is represented by the question marks above. But, in any event, it's read before the result stage of the WAIT, so the actual mechanism is unimportant.

This hypothesis, compared with Cluso's, is only testable if the WAITPEQ/WAITPNE instruction can be forced (via "wr") to write something to the destination different from what was there before. The destination address would have to be the address of the next instruction. Then we could tell whether that instruction was fetched before (my hypothesis) or after (Cluso's hypothesis) the result of the WAIT was written. I haven't tried a WAITPEQ/WAITPNE with a forced "wr", so I don't know whether it can be forced to write something. If it can, what it writes could well be the result of the AND, in which case a WAITPNE would be capable of changing the destination to a different value.

For the WAITCNT instruction, this test is much easier, since there is a known result to write. So I tried the following:

CON

  _clkmode = xtal1 + pll16x
  _xinfreq = 5_000_000

PUB start        

  cognew(@waittest, 0)

DAT

waittest      mov       dira,#3
              waitcnt   nexti, #1
nexti         mov       outa, #2
              jmp       #nexti

As a result of the WAITCNT, nexti should be incremented by one, which would cause both A0 and A1 to rise simultaneously, assuming the result gets written before the MOV instruction is fetched. In fact, A1 rises 100ns ahead of A0, which indicates that the pipe was never flushed during the WAITCNT, but merely stalled.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 5/23/2008 4:50:29 PM GMT

kuroneko · 2009-06-19 00:52

Some after breakfast findings:

waitpeq with wr set performs an add dst, src
the next instruction is not affected
mask value (not location) is fetched during Execution stage (same as mov dst, src)
pins are first sampled in the stage after that (L)
result is written - if applicable - in the last cycle

Which boils down to something like this (for running straight through):

sdEL?R

Not sure about cycle 5, maybe it is used for decision making as Phil suggests. Seems odd though because sample resolution is 1 cycle.

Waitpeq/waitpna resume timing

Comments