Shop OBEX P1 Docs P2 Docs Learn Events
WAITCNT vs. WAIT_CNT_PASSED — Parallax Forums

WAITCNT vs. WAIT_CNT_PASSED

fkifki Posts: 10
edited 2014-09-24 23:51 in Propeller 2
The current asm op:

WAITCNT Target,Delta

halts until (cnt==Target). If the intended Target already passed it halts for a full period of cnt overflow.
I cannot image a use case, where this is intended. I suppose most code will assume Target to be in near future.

Therefore i suggest an operation:

WAIT_CNT_PASSED Target

to halt until (cnt-Target) & $8000_0000 == 0
The ALU can calculate the difference with NR, The Z Flag can tell later if the Target was missed or not.
Open Question: is it possible to wire Bit 31 of ALU output to the halting circuit?

Fabian

Comments

  • ElectrodudeElectrodude Posts: 1,621
    edited 2014-08-29 09:04
    If you do that then you can only wait half as long.

    How about an instruction that sets the c flag to (cnt-Target) & $8000_0000 == 0 but doesn't actually wait? This would be very useful for doing something until a timer expires.

    Maybe one big instruction that does all of this: wc means set c and return immediately instead of waiting, wz determines if it should go when cnt==Target or when (cnt-Target)>0

    electrodude
  • ErNaErNa Posts: 1,742
    edited 2014-08-30 01:27
    I find useful a wait for input event OR timer event what allows to check for event with timeout.
  • fkifki Posts: 10
    edited 2014-09-03 01:17
    @Electrodude:
    Halfing maximum wait-time is no problem IMHO, because i can use several wait instructions in a row, if i plan to wait for a long time.
    Setting a flag would require a conditioned jump afterwards, that is up to 8 cycles jitter.

    If jitter is acceptable, we can do:
    mov cnt,cnt ' update shadow
    sub cnt,Target 'calc diff
    shl cnt,#1 WC 'extract Bit31

    @ErNa:
    waitpeq with timeout would be cool, but i suspect it would be difficult in the VHDL.

    Fabian
  • evanhevanh Posts: 15,171
    edited 2014-09-03 07:11
    As Chip indicated, there is both a performance cost and power cost when performing a subtraction compare instead of equality compare.

    I proposed a masking compromise at the bottom of that page. However, a programmable prescalar with equality compare would probably be the best improvement that provides the optional extended equality period while making full use of 32 bit compare.

    Down side is a prescalar has global effect. This prolly won't fly as it'll break compatibility between too many shared objects. ... Back to the masking method then.
  • jmgjmg Posts: 15,144
    edited 2014-09-04 19:43
    fki wrote: »
    waitpeq with timeout would be cool, but i suspect it would be difficult in the VHDL.

    It is a very common requirement.
    I think Chip was (looking at/had done) that, not sure if it is still in the latest variant.

    It is not that hard to do, you just OR the exits, and map to flags so you can tell which exit occurred.
    The WAITCNT would be queued somehow, so it allowed one opcode thru (the WAITPEQ) before applying.
  • jmgjmg Posts: 15,144
    edited 2014-09-04 19:45
    evanh wrote: »
    As Chip indicated, there is both a performance cost and power cost when performing a subtraction compare instead of equality compare.
    Counters within themselves have an adder, so it is unlikely to impact the overall MHz.
    Provided it has no impact on the system speed, the extra logic is not large in the scheme of things.
  • evanhevanh Posts: 15,171
    edited 2014-09-05 03:26
    jmg wrote: »
    Counters within themselves have an adder, so it is unlikely to impact the overall MHz.
    Provided it has no impact on the system speed, the extra logic is not large in the scheme of things.

    We could have a scaled timer per Cog on that basis ... but Chip wasn't talking about space issues.

    With a little bit of thought put in to the coding I think a coarser grain equality compare can accomplish what people are really wanting. And 32bit range becomes less important with an effective slower tick rate. Of course, setting that mask will require another instruction or special register.

    I presume the timing issue of subtraction compare that Chip mentions is for real, it'll produce a result every clock, unlike the ALU compare, which is every second clock. Or maybe this could be another minor compromise and actually use the AUL's compare function at it's slower rate. Half resolution isn't so bad. After all, reduced resolution is exactly what I'm thinking about with the masking idea.

    Using an adder will make quite a difference on power usage though. Power usage is a nicety of the simple equality compare.
  • evanhevanh Posts: 15,171
    edited 2014-09-05 03:33
    evanh wrote: »
    ... Or maybe this could be another minor compromise and actually use the AUL's compare function at it's slower rate. Half resolution isn't so bad. ...

    That will also require a staging latch to make the CNT value stable for the adder's two clocks - making the result have a 3 clock lag along side the 2 clock resolution.
  • evanhevanh Posts: 15,171
    edited 2014-09-05 04:00
    jmg wrote: »
    Counters within themselves have an adder, ...

    I think unidirectional counters can cheat by using a few more latches to form a delay-line and have a lot less logic. Again, it creates lag if there is any feedback, eg: count reset. However, the Prop's CNT has no such controls so any lag is invisible.
  • evanhevanh Posts: 15,171
    edited 2014-09-06 03:33
    Have a look at the attached. Ignoring the extra latch and tristate buffer, this looks like a good example of how compact a straight incremental counter can be. No fancy staging needed at all. Okay, a 32bit counter will need some big, up to 32 input, AND gates but I don't see that as a serious issue.
  • jmgjmg Posts: 15,144
    edited 2014-09-06 17:23
    That uses toggle counters, which have the adder using ( Toggle & AND gates).
    You can also add Up/Down using toggle FFs, with a second set of gates.

    Counters are not really a big issue in P2, as they can 32b count faster than the core can run.
    (I think the 64b CNT Chip had in one P2 variant, may have used a simple pipeline)

    The COG 'counter' is a 32b Full-adder, so that also proves you can run an full-adder at fSys speeds.
  • evanhevanh Posts: 15,171
    edited 2014-09-06 17:45
    Here's the staged counter example using D-Type. The XOR achieves toggle function.
    1024 x 1033 - 227K
  • jmgjmg Posts: 15,144
    edited 2014-09-06 18:19
    evanh wrote: »
    Here's the staged counter example using D-Type. The XOR achieves toggle function.
    That does not look quite right - the second block needs a ripple carry style enable.
    You can also pipeline only the RCO, by making it (0FFH-1), which uses fewer FF's
  • evanhevanh Posts: 15,171
    edited 2014-09-07 01:02
    Ah, of course, 0xFE is obvious now. That's a major saving on an incremental like this!

    I've worked out how to run the simulation part of this Logisim software now too. You can see in the snapshot how I've just clocked the staged flipflop. The clock is still in high state and there is a high on the input to bit8 of the counter.

    What's the suggested ripple enable for?
    1024 x 1033 - 210K
  • fkifki Posts: 10
    edited 2014-09-23 06:11
    Having a second look at it: the ALU does not need to constantly subtract.

    suggested change in cog_alu.v:
    wire add_sub = [..]
    i[5:2] == 4'b1111 ? 1'b1 // waitcnt -- now calcing sub

    suggested change in cog.v:
    wire wait_2late = m[4] && (alu_r[31] || (alu_r[30:2]==29'b0));
    wire waitx = i[oh:ol+2] == 4'b0000__ ? !bus_ack
    : i[oh:ol+1] == 5'b11110_ ? !match
    : i[oh:ol+0] == 6'b111110 ? ( !match && !wait_2late)
    : i[oh:ol+0] == 6'b111111 ? !vidack
    : 1'b0;

    usage:

    WAITCNT target, cnt
    This should abort waiting whenever (target-cnt) was negative or less than 4 cycles at call time.

    Beware: if you nearly hit cnt-match you might get out too early. this could be fixed of course by looking close at the lower bits of alu_r.
  • evanhevanh Posts: 15,171
    edited 2014-09-23 14:20
    Ya, that's the mask idea I've mentioned.
  • fkifki Posts: 10
    edited 2014-09-24 01:46
    evanh wrote: »
    Ya, that's the mask idea I've mentioned.
    IMHO there is improvement over
    http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261442&viewfull=1#post1261442

    A) the compare (wire "match" in cog.v) is not masked, it is still matching the exact cycle
    B) the ALU (which in P1 calcs d+=s) is used to find out, whether the deadline already passed. in this case the P1 waits for cnt to overflow,
    in the above this would fire wire "wait_2late" (because alu_r[31] is set). So the COG can continue.
    C) the value of S (in this case CNT) is fetched a few cycles before "match" can trigger the next instruction. In these rare cases (e.g. ALU sees: target=cnt+1), the ALU would calc alu_r==1 but "match" would not fire.. Therefore one of the following conditions is needed:
    alu_r[30:2]==29'b0 // means alu_r < 4
    alu_r[30:2]==29'b0 && !(alu_r[1:0]==2'b11) // means alu_r < 3
    alu_r[30:1]==30'b0 // means alu_r < 2
    which line fits is dependent on the design of the COG cycles

    BTW: These calculations hold regardless whether a counter overflow lies between cnt and target or not.
  • evanhevanh Posts: 15,171
    edited 2014-09-24 02:08
    Ah, okay, I have no idea what your snip-it is doing. There was some ANDs and ORs and you described an amount of jitter which is just what one would expect from masking the lower bits.

    As for using the ALU adder, as already highlighted, it's not a great idea. You'd have to latch the CNT before processing it to slow the update rate down to the speed of the ALU, which, of course, permanently lowers precision. And power consumption is also higher compared to the equality wait, which is a bit of a downer.
  • fkifki Posts: 10
    edited 2014-09-24 05:18
    evanh wrote: »
    You'd have to latch the CNT before processing [..] lowers precision. And power consumption[..]

    latch)
    When using CNT as S it is already latched on its way to the ALU. Therefor S represents CNT at the beginning of the currently executed instruction.

    power)
    I use the ALU only for a single calculation per WAIT instruction. So power consumption is the same as in P1 where the ALU is used once to increment target by delta. (WAITCNT target, delta)

    precision)
    the halting circuit is unchanged, only the ALU checks if the wait will be for more then 2^31 cycles. In that case the execution will continue.

    effect A)
    slighty missed targets wont halt the COG for a full overflow period
    effect B)
    the maximum wait time is halved

    Has someone a FPGA board, and is willing to test my files? Send me a private Message!
  • evanhevanh Posts: 15,171
    edited 2014-09-24 14:59
    Ah, so the extra part only tests on entry to the waiting. You should prolly start a topic in the P1V FPGA section.
  • fkifki Posts: 10
    edited 2014-09-24 23:51
    evanh wrote: »
    Ah, so the extra part only tests on entry to the waiting.
    Exactly. sorry for my unclear description.
Sign In or Register to comment.