(PASM) waitcnt and odd-clocks

ags · 2011-09-08 16:27

What would happen if I did something like this:

mov    clkCnt, cnt
add    clkCnt, someDelayValue
add    clkCnt, #3
waitcnt    clkCnt, #0

Basically, I'm asking if the waitcnt command actually does a compare for an exact equality between the D-paramater and the cnt register, and does it actually compare for each of the 4 clock cycles? If so, then that would mean I could cause execution to shift from a multiple of four.

I'm guessing that either using any D-parameter that is not a multiple of 4 will cause program execution to halt (never == cnt value when sampled by waitcnt) or there is some "magic" that happens and if the D-parameter is not %4 then execution will continue, still aligned with clk/4.

While I find this an interesting question just for educational purposes, I actually started to think about it because I'm using the cog counters, and I know that there are some edges that I would like to align with that are not on clk*4 multiples. When I started to think about how I could do that, it then occurred to me that setting the waitcnt D-parameter to something not %4, that may be a "gotcha" to beware of.

Phil Pilgrim (PhiPi) · 2011-09-08 16:29

ags wrote:

does it actually compare for each of the 4 clock cycles? If so, then that would mean I could cause execution to shift from a multiple of four.

Yes and yes.

-Phil

kuroneko · 2011-09-08 16:54

Basic example:

mov     cnt, cnt
add     cnt, #9[COLOR="gray"]{14}[/COLOR] + n
waitcnt cnt, #0

For n == 0 this sequence delays for 14 cycles. Adjusting n (up) gives you increasing delays with single cycle granularity.

As for your counter issue, I wouldn't worry about that. The counter is started in sync with the following data instructions so global clock sync doesn't really matter here.

frank freedman · 2011-09-08 19:01

kuroneko wrote: »
Basic example:
mov     cnt, cnt
add     cnt, #9[COLOR="gray"]{14}[/COLOR] + n
waitcnt cnt, #0
For n == 0 this sequence delays for 14 cycles. Adjusting n (up) gives you increasing delays with single cycle granularity.

As for your counter issue, I wouldn't worry about that. The counter is started in sync with the following data instructions so global clock sync doesn't really matter here.

Nicely stated.......

Frank

Not said in smart @ss way, just meant that this thread and specifically the exposition of the single cycle * n result helped me on an issue. Thx.

ags · 2011-09-12 13:45

I originally thought there was a typo in the example above, and that the "mov cnt, cnt" was a mistake. Now I'm realizing it's not a mistake, and the purpose is to synchronize the "cnt" register with a known value.

Am I correct that since the S-decode is done on baseline+1 clock cycle, then it would be the value of cnt at that "+1" clock that is stored? If that's correct, then I can't understand why the delay is 14 clocks, not 10. Since the difference is exactly 4 cycles, I must be missing a full instruction somewhere in my counting.

BTW, this means that unlike most (all?) other instructions, waitcnt can execute in as little as one clock cycle. Is that correct?

Thanks for the replies.

tonyp12 · 2011-09-12 13:55

>and the purpose is to synchronize the "cnt" register with a known value.

Not really, it's a trick to save on ram as destination-side use of CNT is a Shadow register and is available for general use.

mov cnt,#5 'set shadow to 5
add cnt,cnt 'add the real cnt to shadow
waitcnt cnt, #0 'wait for the real cnt to match shadow, then add 0 to shadow.

5 is the overhead and result in a 0 wait (you can not use less than 5)
Though the waitcnt would be a nop in nature it still would take 6 cycles as waitcnt is not a 4 cycle instruction but 6.

If you came in with ...4,8,12 before the 3 lines of code you coming out at 26,30,34... in itself a shift by 2 cycles.

Using a 6 in the inital shadow settings and you would come out 27,31,35 a shift of 3 cycles.

I could not see any good use of it though, it takes to many cycles for changes for use in highspeed i/o
if you needed cogs to be out of sync/in sync you would just do that when you set the inital wait that you have to do with repeated newcog anyway.
And cogs get out of sync on their own after you do a read/write to hub as they have to wait their turn.

ags · 2011-09-12 16:05

OK, thanks for pointing out the use of the cnt "shadow register" as an additional "free" cog register.

I'm still not following the rest. Assuming that the clock starts at 0 (which I know isn't the case, and doesn't matter since it's all relative), what would be stored in "myReg" after this instruction?

mov myReg, cnt

Using the fetchSource/fetchDestination/eXecute-instruction/writeResults model (SDXR), during the four clock cycles for a typical instruction, wouldn't that mean that the value in the cnt shadow register would be #1 since Source register value is read in cycle 1?

kuroneko · 2011-09-12 16:27

ags wrote: »

Assuming that the clock starts at 0 (which I know isn't the case, and doesn't matter since it's all relative), what would be stored in "myReg" after this instruction?

mov myReg, cnt

Using the fetchSource/fetchDestination/eXecute-instruction/writeResults model (SDXR), during the four clock cycles for a typical instruction, wouldn't that mean that the value in the cnt shadow register would be #1 since Source register value is read in cycle 1?

Live registers (cnt, ina, phsx) are sampled in X (result of D). All operands (src/dst) are also latched in X. Meaning S would see cnt₀, D sees cnt₀ +1 and finally X will see and sample cnt₀ +2.

kuroneko · 2011-09-12 16:34

tonyp12 wrote: »

If you came in with ...4,8,12 before the 3 lines of code you coming out at 26,30,34... in itself a shift by 2 cycles.

Can you elaborate (no coffee yet)? How do you arrive at a difference of 22 cycles?

tonyp12 · 2011-09-12 17:28

4,8,12 is before the first line the 3line code is even reached, just to show a pattern.
The 3 lines of code takes 14 cycles minimum

kuroneko · 2011-09-12 17:50

tonyp12 wrote: »

4,8,12 is before the first line the 3line code is even reached, just to show a pattern.
The 3 lines of code takes 14 cycles minimum

OK, got it. I was assuming the sequence 4, 8, 12 designates 3 different reference counts (so I couldn't make the link from 4 -> delay -> 26) but what you mean is a sequence of 4 cycle instructions which end up at 12 then delay for 14 arriving at 26.

ags · 2011-09-13 16:38

Ugh. Will someone walk me through this? Is this a correct, detailed explanation of what is going on? The assumptions are:
* the value of cnt is sampled in the 3rd stage of instruction execution
* waitcnt "releases" as soon as the current count is equal to the stored count, even if it's not a multiple of 4

The sequence is:

* start at cnt=0
* at cnt=3 the value of cnt is sampled (ALU value=3)
* at cnt=4 that value is stored in cnt shadow register (cntShadow=3)
* at cnt=7 9 is added to the value of cntShadow (ALU value=12)
* at cnt=8 that value is stored in cnt shadow register (cntShadow=12)
* at cnt=9 the S register value is fetched (if it is not immediate; if it is immediate, the value is decoded from the instruction and stored)
* at cnt=10 the D register value is fetched (ALU value=12)
* at cnt=11 waitcnt compares ALU value (=12) to cnt (=11) -> not equal
* at cnt=12 waitcnt compares ALU value (=12) to cnt (=12) -> EQUAL
* at cnt=13 the value of S register (or immediate value) is added to D register value
* at cnt=14 D register value written

So this means that although waitcnt will "release" control at any clock cycle (even non-multiples of 4), it still requires 2 additional clock cycles after cnt==specifiedValue (to add the value of S register and write that value to D register) before execution continues with the next instruction. Is that correct?

If so, can someone explain in more detail why the "active registers" are sampled on the X stage (3) rather than the first (as all other S registers)? Is it "just the way things are wired" for those special registers (which are tied to the ALU execution stage, not actually fetching a value from a "normal" cog RAM register)?

Thanks for the insight into this.

ags · 2011-09-13 16:52

tonyp12 wrote: »

...Though the waitcnt would be a nop in nature it still would take 6 cycles as waitcnt is not a 4 cycle instruction but 6...

Not to be argumentative, but the datasheet says "5+" clocks. Is that incorrect?

...if you needed cogs to be out of sync/in sync you would just do that when you set the inital wait that you have to do with repeated newcog anyway...

Why does one need to wait when starting/repeating newcogs? Doesn't that take just a normal hubop time (7-22 cycles)? Are you referring to a case when the new cog has to be completely initialized before the cog that started it resumes execution?

Thanks.

tonyp12 · 2011-09-13 17:02

5+ is from the old manual, look up the new one.

If you need to start two cogs in sync for an hi-res vga etc
You would start both of them that has a waitcnt in them that is far enough away in the future that both had a chance to get loaded.

Or if you starting two cogs from the same dat, but have different values inserted to them.
You would need to wait before you change this value and do the next newcog.
see page 28 here: http://gadgetgangster.com/working_files/assembly_tutorial/desilva.pdf

kuroneko · 2011-09-13 17:42

Let's start with something slightly different:

movi    ctra, #%0_11111_000
mov     frqa, #1
mov     phsa, #0
[COLOR="orange"]mov     temp, phsa[/COLOR]

This way the counter is setup to act similar to cnt, i.e. increment unconditionally by #1 every system clock. In the last cycle of mov phsa, #0 (R-phase) phsa will be 0. This is followed by SDeR (or SDXR if you like) of the sample move instruction. During S frqa will have been added once, during D a second time. Meaning the 3rd phase sees a 2 (not a 3) which it is about to sample. That said, during the 3rd cycle phsa will reach 3. Think of an active clock edge for each cycle, it has inputs (here result of D, phsa = 2) and outputs (increment phsa by #1).

ags wrote: »

The sequence is:

As noted it's slightly off, also you show 2 compare cycles which would mean that #9 isn't the minimal advance to avoid waiting for a full wrap-around. Evidently #8 is not enough which means the first match is done in the 4th cycle of waitcnt (also true for waitpxx).

attachment.php?attachmentid=85068&d=1315985888

attachment.php?attachmentid=85068&d=1315985888

ags wrote: »

So this means that although waitcnt will "release" control at any clock cycle (even non-multiples of 4), it still requires 2 additional clock cycles after cnt==specifiedValue (to add the value of S register and write that value to D register) before execution continues with the next instruction. Is that correct?

Yes, it needs some time to wake up and finally write the result if specified (nr is possible).

ags wrote: »

If so, can someone explain in more detail why the "active registers" are sampled on the X stage (3) rather than the first (as all other S registers)? Is it "just the way things are wired" for those special registers (which are tied to the ALU execution stage, not actually fetching a value from a "normal" cog RAM register)?

Let's put it this way, the final values for src and dst are latched in the 3rd cycle (for use by ALU or video h/w). Live registers (which are not part of cog RAM) are mux'ed in at this stage (not needed earlier) so you'll get whatever is in there at the time. So while source operands living in cog RAM and immediate values are processed earlier (there are only 4 cycles to access cog RAM) they don't become active until the 3rd cycle (but are squirreled away until needed).

As a mental exercise, where does jmp phsa go (with the setup above) and where does the next instruction come from (after the jump)?

Also, check out this thread: [thread=118096]Call for Clarity from Chip or Beau[/thread].
From the thread:

-----------------Cog RAM------------------
State   R/W   Address        Data In   Data Out          Other
---------------------------------------------------------------------------------------
0       R     source         -         -                 -
1       R     destination    -         source            S register is latched
2       R     instruction    -         destination       Final S is mux'd and latched, D register is latched
3       W     destination    result    instruction       ALU settles with S/D inputs, result is written to D register

ags · 2013-03-21 14:59

Two things:

First - I remembered this thread from over a year ago because I found it so helpful. I don't recall seeing this level of detail and accuracy (and readability) in any official Parallax document on precision timing. I just came back to it for a refresher and am able to proceed using just this bit of explanation. I want to thank kuroneko for his help and effort.

Second: I realized there was an embedded quiz which I never completed. I'll never earn my propeller beanie that way. So...

kuroneko wrote:
movi    ctra, #%0_11111_000
mov     frqa, #1
mov     phsa, #0
[COLOR=orange]jmp      phsa
[/COLOR]
As a mental exercise, where does jmp phsa go (with the setup above) and where does the next instruction come from (after the jump)?

Well, it's not going to cog address #2 because: a) that's the obvious answer given the previous explanation and you wouldn't ask an obvious question and b) looking at the Propeller manual for the JMP instruction (although it doesn't explicitly say this) I note that the instruction works by setting the PC equal to the value of the source operand. However, instruction decode happens during the e cycle of the previous instruction, so I presume modification of the PC happens in the D cycle. My guess is that the value of the source field at the end of the S cycle (in this case #1) is loaded/input/sampled as input in the D cycle, and latched into the PC as the output of the D cycle. It is then sampled in the e/I cycle - so that would mean the jmp would go to cog address #1. That reinforces my belief in the non-obvious answer, because now that leaves me at a movi, and I know one has to be careful there. However, movi isn't modifying an instruction (self-modifying code) but instead the ctra register, so I'm going with the next instruction comes from PC+1, which in that case would be cog address #2.

Then again, I'm making a lot of assumptions here, and my brain is beginning to hurt...

kuroneko · 2013-03-21 16:43

Sometimes the (too) obvious answers are correct!

With base being the value loaded into phsx just before the jump (#0) you'll end up at base + 2*frqx (#2) and the next insn is fetched from base + 3*frqx + 1 (#4). The last cycle basically sees base + 3*frqx (one after live operand fetch) and treats that as the new PC (source value), IOW while the insn at base + 2*frqx is executed it thinks it's at base + 3*frqx so the next fetch is from +1.

(PASM) waitcnt and odd-clocks

Comments