Waitpeq/waitpna resume timing
Cliff L. Biffle
Posts: 206
I'm extending my instruction scheduler to handle external events, and I was wondering if anyone could help me with the precise timing behavior of waitpna and waitpeq on the P8X32A.
I'm dealing with some particularly tricky bus interface code, so the specifics have suddenly become important. I don't have the bench equipment I need to answer this question on my own.
In the below, I make some assumptions: INAx is the single pin we're monitoring; we're waiting for it to go high; we're using waitpeq. If the timing can vary in other cases (such as high-low transitions or waitpna), please let me know.
If INAx is held high, waitpeq should take 5 cycles.
If INAx is held low, waitpeq should spin indefinitely.
But if INAx transitions during execution of waitpeq, how soon is it noticed? Or, put another way, what hold time is required before the next instruction is allowed to execute?
If I've been unclear, tell me and I'll rephrase or elaborate. I'll add the results to my Propeller timing info page after testing them. For my application I need precision of no more than about 0.5clk, but if anyone has better it'd be great.
Thanks!
Edit: Just to clarify, I've seen a post from Paul Baker about a year ago that specifies that the next instruction executes on the next cycle after the condition is true, which almost answers my question -- but I'm hoping someone (possibly Paul) can be more specific about hold times for the comparison.
Post Edited (Cliff L. Biffle) : 5/9/2008 8:19:09 AM GMT
I'm dealing with some particularly tricky bus interface code, so the specifics have suddenly become important. I don't have the bench equipment I need to answer this question on my own.
In the below, I make some assumptions: INAx is the single pin we're monitoring; we're waiting for it to go high; we're using waitpeq. If the timing can vary in other cases (such as high-low transitions or waitpna), please let me know.
If INAx is held high, waitpeq should take 5 cycles.
If INAx is held low, waitpeq should spin indefinitely.
But if INAx transitions during execution of waitpeq, how soon is it noticed? Or, put another way, what hold time is required before the next instruction is allowed to execute?
If I've been unclear, tell me and I'll rephrase or elaborate. I'll add the results to my Propeller timing info page after testing them. For my application I need precision of no more than about 0.5clk, but if anyone has better it'd be great.
Thanks!
Edit: Just to clarify, I've seen a post from Paul Baker about a year ago that specifies that the next instruction executes on the next cycle after the condition is true, which almost answers my question -- but I'm hoping someone (possibly Paul) can be more specific about hold times for the comparison.
Post Edited (Cliff L. Biffle) : 5/9/2008 8:19:09 AM GMT
Comments
FWIW, it's 6 cycles. http://forums.parallax.com/showthread.php?p=722987
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I notice you haven't gotten confirmation from Parallax on this.
Parallax has chimed in on this before; Paul Baker said
[noparse][[/noparse]quote]To clarify a point, waitpeq is deterministic with respect to the pin state. The reason it is listed as 5+ is it takes 4 clocks to process the instruction, plus however many cycles of compare necessary to achieve the wait state. If the value is true at the beginning it will take 5 cycles since only one compare cycle occurs. For situations where more than one compare cycle occurs, the next instruction begins execution on the next clock cycle after a comparison evaluates true.
(http://forums.parallax.com/showthread.php?p=656005)
Kuroneko's statement would seem to conflict with that, however.
Well, I wish it was 5+ and I simply overlooked something. But as mentioned in the other thread, even with the monitored pin being static I get a 6 cycle delay (as opposed to having the equation always resolve to true and thereby ignoring the pin state altogether).
The instruction is set up way before the the input pulse appears. And using ViewPort and some 'test pulses' on another I/O to mark events, I see 100 nsec delay between the awaited pulse and a 'test pulse' (running at 80 MHz). Never do I see a total of 5 cycles, but a constant 6. Costing another 50 nsec response. LIFE!!!
What might the condition have to be to see a 5 clock response?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Harley Shanko
Bench configuration:
- P8X32A (LQFP) on Parallax Proto Board. Date code 0641.
- Regulator producing 3.6V (slightly out of spec)
- Crystal stable at 80MHz.
- Tektronix TDS1012B.
Automated test suites run using propasm, make, and Remy Blank's Loader.py. I've omitted the clock configuration directives from the source here.
Baseline: toggle.pa
As expected this shows a 33% duty cycle, 50ns (4 cyc) high, 100ns (8c) low.
Testing waitpeq
For this run P0 was pulled directly to ground.
The high phase of the signal has grown by 75ns/6c. This supports Kuroneko's hypothesis that the fastest possible invocation of waitpeq will take 6c.
Testing waitpna and resume latency
For this test, I added a switch to pull P0 to rail. Channel 2 was tied to P0 with a rising-edge trigger at 1.63V. The chip was allowed to idle at waitpne for several seconds before P0 went high.
This one answers my original question: there appears to be a minimum latency of two cycles from the time that a wait condition is satisfied to the time that the next instruction begins execution. If anyone else's data fail to support this, I can do a more thorough test using a random external source and the next instruction's input latching characteristics.
I repeated this test several times and saw 75ns/2c - 80ns most of the time. If my model is correct the latency could approach 87.5ns/3c, but the data failed to support this thus far.
Testing waitcnt
While I was at it I constructed a test fixture for waitcnt. A script repeatedly recompiled the code below with different values for {CYCLES} and recorded the traces that resulted at P15.
Values of CYCLES in 0..8 (inclusive) failed to transition in the time alloted (500ms).
Values of 9 and greater had high-phase periods of 212.5ns + (CYCLES - 8) * 12.5ns, and the expected 100ns low period.
Here's the trace for CYCLES=11:
Commenting out the waitcnt instruction gave a high period of 150ns, giving a best-case waitcnt time of 225ns - 150ns = 75ns, or 6c. This supports Kuroneko's hypothesis.
I'm not sure I can tell, from this data, what the resume latency of waitcnt is. It's made more difficult by the ill-defined latching behavior of operands. I can say with confidence that it's between 2c and 3c, and judging from the behavior of waitpeq/waitpna I suspect 2c -- but again, my data is inconclusive.
If anyone has suggestions for improving my methods, lemme know.
Edit: Actually, I think the data may suggest an event-to-resume latency of three cycles, at least for waitcnt.
Post Edited (Cliff L. Biffle) : 5/15/2008 5:49:11 AM GMT
-Phil
I'll work out a way to test this.
Well, seems that 17 is the minimum (or it waits 2^32-16 or so cycles). If I remove the nop... 13 as expected is the minimum.
Well, that just proves to me that a NOP takes 4 cycles ... (reading 17 and 13 as replacement for #18 in your code, correct me if I'm wrong).
Also, assuming the picture shows the behaviour of the code you posted (#18), then I can see the following:
- the 190ns are roughly 15 cycles (at 80MHz)
- 8 are consumed by the NOP and the toggle
- that leaves 7 cycles for the waitcnt
You also stated that #17 is the minimum for waitcnt not to stall for a very long time, which makes me believe that the pulse width could go down to 14 cycles which still leaves 6 for waitcnt. Would it be possible to take a measurement for case #17?
Post Edited (kuroneko) : 5/15/2008 9:32:09 AM GMT
here it is for 17 cycles. 16 sends it to the no trigger version
Don't know what's going on with the other waits though. What happens with a waitvid?
What do you mean by 17? CNT adjustment? From the last posted scope screen I count 14 cycles which is 4 + 4 + 6.
Don't know if it will do anything useful but could be interesting.
Having waitcnt write directly to the output latch would be interesting to verify the latching behavior of the Writeback stage. I haven't tested it specifically, but I believe my measurements show the Writeback affecting outputs within the final cycle of the instruction.
If I've miscalculated the waitpne resume latency it might be +1 cycle however. I can see a test fixture for this.
I'll ping some of the Parallax guys directly on this -- there's only so much reverse engineering I want to do when I'm not getting paid to do it.
The good news is, using the timings calculated in my last post, I've now got perfect phase lock in my OV6620 driver -- despite the mutually prime clock frequencies (17.73MHz vs. 96MHz).
http://forums.parallax.com/showthread.php?p=720785 may be of interest to you. It helped me [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
I was curious myself. So I put together a quick and dirty test program. This does a waitpeq followed by an OUTA modification. Two observers wait on the waitpeq pin and the modified OUTA pin respectively and then record CNT. The difference is 7. Given that OUTA happens in the result phase, the next instruction therefore starts at 7-3 cycles after the wait condition has been met, i.e. if the wait condition happens at CNT, the next instruction starts at CNT+4.
When a waitxxx instruction is executed, the pipeline stalls. Now we know that the output of the mov is done in the result stage. If we count back 7 cycles that places us in the D stage of the waitcnt, but this doesn't make sense because this is not a logical place for the waitcnt to stall (whereas stalling at stage e makes sense).
So a further examination of the waitxxx behavior is needed:
The following is a representation of·a pin state a waitpxx instruction is performing on. The | represents a clock boundry, a _ represents no change of the pin and a / represents a change of the pin
as you can see the event happens between cycle 3 and 4. At clock cycle 4 the waitpxx registers a true condition, but this does not immediately result in·restarting the pipeline because the true condition must be given time to propagate throughout the cog. Therefore the cog does not restart the pipeline until cycle 5. When the pipeline restarts it will be at the R stage of the waitxxx and the d stage of the following instruction. Take the 2 cycles needed to register and propage the true condition plus the 5 remaining stages of the next instruction we are left with 7 cycles.
While I am not absolutely certain this is precisely whats going on, it is a good enough explanation of observed behavior that it should suffice to think·of waitxxx's behavior in this way.
One further note, the tests that we both did were synchronous in nature, for asynchronous test it will be found that the response time will be (6,7] or anything between 6 and 7 cyles, but not including·6 itself (ie any marginal value above 6 such as 6.02).
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 5/20/2008 7:55:04 PM GMT
But one cog is doing the presetting of the triggering point and this is easily modified to place whatever instructions you like after the initial set of pin2. If I recall, there is about 15 clock latency before the sampling takes effect but you can see that from the dataset. Place your code once Pin2 starts toggling. Enjoy
I am just about to drive from Sydney to Brisbane (actually Gosford to Gold Coast for those who know Australia) so will be out of action for a few days. I can then have a look at the timings.
Thanks for this useful piece of information. I agree that I could always be wrong as I can only observe so much (not knowing the internals). While investigating a different problem I noticed that the stall for a waitcnt would end up in the D phase which - as you put it - doesn't make much sense (given that there are valid S and D registers to be fetched).
Post Edited (kuroneko) : 5/21/2008 3:08:17 AM GMT
This should be easy enough to verify, and could actually be useful (to e.g. detect pin changes).
Stalling between the completion of the S and D stages matches my measurements, I think. Nice to see that your description matches the pipeline model I inferred a couple years back (www.cliff.biffle.org/software/propeller/notes.html).
The reason the waitpeq takes·6 cycles is because the instruction pipeline is flushed. Therefore, the pin sampling takes place on the E cycle for the waitpeq instruction. The mov outa,#x writes to the output pin during the R cycle, as stated by Paul
The clocking diagram is output by my program and saved by Hyperterminal into a compatable *.spin·file
Here's a diagram that illustrates my conjecture:
What's unknowable is where, in the execute pipeline, the next instruction is read: at the beginning, at the end, or multiple times at each stage? This uncertainty is represented by the question marks above. But, in any event, it's read before the result stage of the WAIT, so the actual mechanism is unimportant.
This hypothesis, compared with Cluso's, is only testable if the WAITPEQ/WAITPNE instruction can be forced (via "wr") to write something to the destination different from what was there before. The destination address would have to be the address of the next instruction. Then we could tell whether that instruction was fetched before (my hypothesis) or after (Cluso's hypothesis) the result of the WAIT was written. I haven't tried a WAITPEQ/WAITPNE with a forced "wr", so I don't know whether it can be forced to write something. If it can, what it writes could well be the result of the AND, in which case a WAITPNE would be capable of changing the destination to a different value.
For the WAITCNT instruction, this test is much easier, since there is a known result to write. So I tried the following:
As a result of the WAITCNT, nexti should be incremented by one, which would cause both A0 and A1 to rise simultaneously, assuming the result gets written before the MOV instruction is fetched. In fact, A1 rises 100ns ahead of A0, which indicates that the pipe was never flushed during the WAITCNT, but merely stalled.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 5/23/2008 4:50:29 PM GMT
- waitpeq with wr set performs an add dst, src
- the next instruction is not affected
- mask value (not location) is fetched during Execution stage (same as mov dst, src)
- pins are first sampled in the stage after that (L)
- result is written - if applicable - in the last cycle
Which boils down to something like this (for running straight through):Not sure about cycle 5, maybe it is used for decision making as Phil suggests. Seems odd though because sample resolution is 1 cycle.