WAITCNT vs. WAIT_CNT_PASSED
fki
Posts: 10
The current asm op:
WAITCNT Target,Delta
halts until (cnt==Target). If the intended Target already passed it halts for a full period of cnt overflow.
I cannot image a use case, where this is intended. I suppose most code will assume Target to be in near future.
Therefore i suggest an operation:
WAIT_CNT_PASSED Target
to halt until (cnt-Target) & $8000_0000 == 0
The ALU can calculate the difference with NR, The Z Flag can tell later if the Target was missed or not.
Open Question: is it possible to wire Bit 31 of ALU output to the halting circuit?
Fabian
WAITCNT Target,Delta
halts until (cnt==Target). If the intended Target already passed it halts for a full period of cnt overflow.
I cannot image a use case, where this is intended. I suppose most code will assume Target to be in near future.
Therefore i suggest an operation:
WAIT_CNT_PASSED Target
to halt until (cnt-Target) & $8000_0000 == 0
The ALU can calculate the difference with NR, The Z Flag can tell later if the Target was missed or not.
Open Question: is it possible to wire Bit 31 of ALU output to the halting circuit?
Fabian
Comments
How about an instruction that sets the c flag to (cnt-Target) & $8000_0000 == 0 but doesn't actually wait? This would be very useful for doing something until a timer expires.
Maybe one big instruction that does all of this: wc means set c and return immediately instead of waiting, wz determines if it should go when cnt==Target or when (cnt-Target)>0
electrodude
Halfing maximum wait-time is no problem IMHO, because i can use several wait instructions in a row, if i plan to wait for a long time.
Setting a flag would require a conditioned jump afterwards, that is up to 8 cycles jitter.
If jitter is acceptable, we can do:
mov cnt,cnt ' update shadow
sub cnt,Target 'calc diff
shl cnt,#1 WC 'extract Bit31
@ErNa:
waitpeq with timeout would be cool, but i suspect it would be difficult in the VHDL.
Fabian
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261428&viewfull=1#post1261428
#1193, #1195, #1198
I proposed a masking compromise at the bottom of that page. However, a programmable prescalar with equality compare would probably be the best improvement that provides the optional extended equality period while making full use of 32 bit compare.
Down side is a prescalar has global effect. This prolly won't fly as it'll break compatibility between too many shared objects. ... Back to the masking method then.
It is a very common requirement.
I think Chip was (looking at/had done) that, not sure if it is still in the latest variant.
It is not that hard to do, you just OR the exits, and map to flags so you can tell which exit occurred.
The WAITCNT would be queued somehow, so it allowed one opcode thru (the WAITPEQ) before applying.
Provided it has no impact on the system speed, the extra logic is not large in the scheme of things.
We could have a scaled timer per Cog on that basis ... but Chip wasn't talking about space issues.
With a little bit of thought put in to the coding I think a coarser grain equality compare can accomplish what people are really wanting. And 32bit range becomes less important with an effective slower tick rate. Of course, setting that mask will require another instruction or special register.
I presume the timing issue of subtraction compare that Chip mentions is for real, it'll produce a result every clock, unlike the ALU compare, which is every second clock. Or maybe this could be another minor compromise and actually use the AUL's compare function at it's slower rate. Half resolution isn't so bad. After all, reduced resolution is exactly what I'm thinking about with the masking idea.
Using an adder will make quite a difference on power usage though. Power usage is a nicety of the simple equality compare.
That will also require a staging latch to make the CNT value stable for the adder's two clocks - making the result have a 3 clock lag along side the 2 clock resolution.
I think unidirectional counters can cheat by using a few more latches to form a delay-line and have a lot less logic. Again, it creates lag if there is any feedback, eg: count reset. However, the Prop's CNT has no such controls so any lag is invisible.
You can also add Up/Down using toggle FFs, with a second set of gates.
Counters are not really a big issue in P2, as they can 32b count faster than the core can run.
(I think the 64b CNT Chip had in one P2 variant, may have used a simple pipeline)
The COG 'counter' is a 32b Full-adder, so that also proves you can run an full-adder at fSys speeds.
You can also pipeline only the RCO, by making it (0FFH-1), which uses fewer FF's
I've worked out how to run the simulation part of this Logisim software now too. You can see in the snapshot how I've just clocked the staged flipflop. The clock is still in high state and there is a high on the input to bit8 of the counter.
What's the suggested ripple enable for?
suggested change in cog_alu.v:
wire add_sub = [..]
i[5:2] == 4'b1111 ? 1'b1 // waitcnt -- now calcing sub
suggested change in cog.v:
wire wait_2late = m[4] && (alu_r[31] || (alu_r[30:2]==29'b0));
wire waitx = i[oh:ol+2] == 4'b0000__ ? !bus_ack
: i[oh:ol+1] == 5'b11110_ ? !match
: i[oh:ol+0] == 6'b111110 ? ( !match && !wait_2late)
: i[oh:ol+0] == 6'b111111 ? !vidack
: 1'b0;
usage:
WAITCNT target, cnt
This should abort waiting whenever (target-cnt) was negative or less than 4 cycles at call time.
Beware: if you nearly hit cnt-match you might get out too early. this could be fixed of course by looking close at the lower bits of alu_r.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261442&viewfull=1#post1261442
A) the compare (wire "match" in cog.v) is not masked, it is still matching the exact cycle
the ALU (which in P1 calcs d+=s) is used to find out, whether the deadline already passed. in this case the P1 waits for cnt to overflow,
in the above this would fire wire "wait_2late" (because alu_r[31] is set). So the COG can continue.
C) the value of S (in this case CNT) is fetched a few cycles before "match" can trigger the next instruction. In these rare cases (e.g. ALU sees: target=cnt+1), the ALU would calc alu_r==1 but "match" would not fire.. Therefore one of the following conditions is needed:
alu_r[30:2]==29'b0 // means alu_r < 4
alu_r[30:2]==29'b0 && !(alu_r[1:0]==2'b11) // means alu_r < 3
alu_r[30:1]==30'b0 // means alu_r < 2
which line fits is dependent on the design of the COG cycles
BTW: These calculations hold regardless whether a counter overflow lies between cnt and target or not.
As for using the ALU adder, as already highlighted, it's not a great idea. You'd have to latch the CNT before processing it to slow the update rate down to the speed of the ALU, which, of course, permanently lowers precision. And power consumption is also higher compared to the equality wait, which is a bit of a downer.
latch)
When using CNT as S it is already latched on its way to the ALU. Therefor S represents CNT at the beginning of the currently executed instruction.
power)
I use the ALU only for a single calculation per WAIT instruction. So power consumption is the same as in P1 where the ALU is used once to increment target by delta. (WAITCNT target, delta)
precision)
the halting circuit is unchanged, only the ALU checks if the wait will be for more then 2^31 cycles. In that case the execution will continue.
effect A)
slighty missed targets wont halt the COG for a full overflow period
effect
the maximum wait time is halved
Has someone a FPGA board, and is willing to test my files? Send me a private Message!