WAITCNT vs. WAIT_CNT_PASSED

fki · 2014-08-29 01:17

The current asm op:

WAITCNT Target,Delta

halts until (cnt==Target). If the intended Target already passed it halts for a full period of cnt overflow.
I cannot image a use case, where this is intended. I suppose most code will assume Target to be in near future.

Therefore i suggest an operation:

WAIT_CNT_PASSED Target

to halt until (cnt-Target) & $8000_0000 == 0
The ALU can calculate the difference with NR, The Z Flag can tell later if the Target was missed or not.
Open Question: is it possible to wire Bit 31 of ALU output to the halting circuit?

Fabian

Electrodude · 2014-08-29 09:04

If you do that then you can only wait half as long.

How about an instruction that sets the c flag to (cnt-Target) & $8000_0000 == 0 but doesn't actually wait? This would be very useful for doing something until a timer expires.

Maybe one big instruction that does all of this: wc means set c and return immediately instead of waiting, wz determines if it should go when cnt==Target or when (cnt-Target)>0

electrodude

ErNa · 2014-08-30 01:27

I find useful a wait for input event OR timer event what allows to check for event with timeout.

fki · 2014-09-03 01:17

@Electrodude:
Halfing maximum wait-time is no problem IMHO, because i can use several wait instructions in a row, if i plan to wait for a long time.
Setting a flag would require a conditioned jump afterwards, that is up to 8 cycles jitter.

If jitter is acceptable, we can do:
mov cnt,cnt ' update shadow
sub cnt,Target 'calc diff
shl cnt,#1 WC 'extract Bit31

@ErNa:
waitpeq with timeout would be cool, but i suspect it would be difficult in the VHDL.

Fabian

fki · 2014-09-03 01:44

Just found similar thoughts in:
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261428&viewfull=1#post1261428

#1193, #1195, #1198

evanh · 2014-09-03 07:11

As Chip indicated, there is both a performance cost and power cost when performing a subtraction compare instead of equality compare.

I proposed a masking compromise at the bottom of that page. However, a programmable prescalar with equality compare would probably be the best improvement that provides the optional extended equality period while making full use of 32 bit compare.

Down side is a prescalar has global effect. This prolly won't fly as it'll break compatibility between too many shared objects. ... Back to the masking method then.

jmg · 2014-09-04 19:43

fki wrote: »

waitpeq with timeout would be cool, but i suspect it would be difficult in the VHDL.

It is a very common requirement.
I think Chip was (looking at/had done) that, not sure if it is still in the latest variant.

It is not that hard to do, you just OR the exits, and map to flags so you can tell which exit occurred.
The WAITCNT would be queued somehow, so it allowed one opcode thru (the WAITPEQ) before applying.

jmg · 2014-09-04 19:45

evanh wrote: »

As Chip indicated, there is both a performance cost and power cost when performing a subtraction compare instead of equality compare.

Counters within themselves have an adder, so it is unlikely to impact the overall MHz.
Provided it has no impact on the system speed, the extra logic is not large in the scheme of things.

evanh · 2014-09-05 03:26

jmg wrote: »

Counters within themselves have an adder, so it is unlikely to impact the overall MHz.
Provided it has no impact on the system speed, the extra logic is not large in the scheme of things.

We could have a scaled timer per Cog on that basis ... but Chip wasn't talking about space issues.

With a little bit of thought put in to the coding I think a coarser grain equality compare can accomplish what people are really wanting. And 32bit range becomes less important with an effective slower tick rate. Of course, setting that mask will require another instruction or special register.

I presume the timing issue of subtraction compare that Chip mentions is for real, it'll produce a result every clock, unlike the ALU compare, which is every second clock. Or maybe this could be another minor compromise and actually use the AUL's compare function at it's slower rate. Half resolution isn't so bad. After all, reduced resolution is exactly what I'm thinking about with the masking idea.

Using an adder will make quite a difference on power usage though. Power usage is a nicety of the simple equality compare.

evanh · 2014-09-05 03:33

evanh wrote: »

... Or maybe this could be another minor compromise and actually use the AUL's compare function at it's slower rate. Half resolution isn't so bad. ...

That will also require a staging latch to make the CNT value stable for the adder's two clocks - making the result have a 3 clock lag along side the 2 clock resolution.

evanh · 2014-09-05 04:00

jmg wrote: »

Counters within themselves have an adder, ...

I think unidirectional counters can cheat by using a few more latches to form a delay-line and have a lot less logic. Again, it creates lag if there is any feedback, eg: count reset. However, the Prop's CNT has no such controls so any lag is invisible.

evanh · 2014-09-06 03:33

Have a look at the attached. Ignoring the extra latch and tristate buffer, this looks like a good example of how compact a straight incremental counter can be. No fancy staging needed at all. Okay, a 32bit counter will need some big, up to 32 input, AND gates but I don't see that as a serious issue.

jmg · 2014-09-06 17:23

That uses toggle counters, which have the adder using ( Toggle & AND gates).
You can also add Up/Down using toggle FFs, with a second set of gates.

Counters are not really a big issue in P2, as they can 32b count faster than the core can run.
(I think the 64b CNT Chip had in one P2 variant, may have used a simple pipeline)

The COG 'counter' is a 32b Full-adder, so that also proves you can run an full-adder at fSys speeds.

evanh · 2014-09-06 17:45

Here's the staged counter example using D-Type. The XOR achieves toggle function.

jmg · 2014-09-06 18:19

evanh wrote: »

Here's the staged counter example using D-Type. The XOR achieves toggle function.

That does not look quite right - the second block needs a ripple carry style enable.
You can also pipeline only the RCO, by making it (0FFH-1), which uses fewer FF's

evanh · 2014-09-07 01:02

Ah, of course, 0xFE is obvious now. That's a major saving on an incremental like this!

I've worked out how to run the simulation part of this Logisim software now too. You can see in the snapshot how I've just clocked the staged flipflop. The clock is still in high state and there is a high on the input to bit8 of the counter.

What's the suggested ripple enable for?

fki · 2014-09-23 06:11

Having a second look at it: the ALU does not need to constantly subtract.

suggested change in cog_alu.v:
wire add_sub = [..]
i[5:2] == 4'b1111 ? 1'b1 // waitcnt -- now calcing sub

suggested change in cog.v:
wire wait_2late = m[4] && (alu_r[31] || (alu_r[30:2]==29'b0));
wire waitx = i[oh:ol+2] == 4'b0000__ ? !bus_ack
: i[oh:ol+1] == 5'b11110_ ? !match
: i[oh:ol+0] == 6'b111110 ? ( !match && !wait_2late)
: i[oh:ol+0] == 6'b111111 ? !vidack
: 1'b0;

usage:

WAITCNT target, cnt
This should abort waiting whenever (target-cnt) was negative or less than 4 cycles at call time.

Beware: if you nearly hit cnt-match you might get out too early. this could be fixed of course by looking close at the lower bits of alu_r.

evanh · 2014-09-23 14:20

Ya, that's the mask idea I've mentioned.

fki · 2014-09-24 01:46

evanh wrote: »

Ya, that's the mask idea I've mentioned.

IMHO there is improvement over
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261442&viewfull=1#post1261442

A) the compare (wire "match" in cog.v) is not masked, it is still matching the exact cycle

the ALU (which in P1 calcs d+=s) is used to find out, whether the deadline already passed. in this case the P1 waits for cnt to overflow,
in the above this would fire wire "wait_2late" (because alu_r[31] is set). So the COG can continue.
C) the value of S (in this case CNT) is fetched a few cycles before "match" can trigger the next instruction. In these rare cases (e.g. ALU sees: target=cnt+1), the ALU would calc alu_r==1 but "match" would not fire.. Therefore one of the following conditions is needed:
alu_r[30:2]==29'b0 // means alu_r < 4
alu_r[30:2]==29'b0 && !(alu_r[1:0]==2'b11) // means alu_r < 3
alu_r[30:1]==30'b0 // means alu_r < 2
which line fits is dependent on the design of the COG cycles

BTW: These calculations hold regardless whether a counter overflow lies between cnt and target or not.

evanh · 2014-09-24 02:08

Ah, okay, I have no idea what your snip-it is doing. There was some ANDs and ORs and you described an amount of jitter which is just what one would expect from masking the lower bits.

As for using the ALU adder, as already highlighted, it's not a great idea. You'd have to latch the CNT before processing it to slow the update rate down to the speed of the ALU, which, of course, permanently lowers precision. And power consumption is also higher compared to the equality wait, which is a bit of a downer.

fki · 2014-09-24 05:18

evanh wrote: »

You'd have to latch the CNT before processing [..] lowers precision. And power consumption[..]

latch)
When using CNT as S it is already latched on its way to the ALU. Therefor S represents CNT at the beginning of the currently executed instruction.

power)
I use the ALU only for a single calculation per WAIT instruction. So power consumption is the same as in P1 where the ALU is used once to increment target by delta. (WAITCNT target, delta)

precision)
the halting circuit is unchanged, only the ALU checks if the wait will be for more then 2^31 cycles. In that case the execution will continue.

effect A)
slighty missed targets wont halt the COG for a full overflow period
effect

the maximum wait time is halved

Has someone a FPGA board, and is willing to test my files? Send me a private Message!

evanh · 2014-09-24 14:59

Ah, so the extra part only tests on entry to the waiting. You should prolly start a topic in the P1V FPGA section.

fki · 2014-09-24 23:51

evanh wrote: »

Ah, so the extra part only tests on entry to the waiting.

Exactly. sorry for my unclear description.

WAITCNT vs. WAIT_CNT_PASSED

Comments