(PASM) waitcnt and odd-clocks
ags
Posts: 386
What would happen if I did something like this:
Basically, I'm asking if the waitcnt command actually does a compare for an exact equality between the D-paramater and the cnt register, and does it actually compare for each of the 4 clock cycles? If so, then that would mean I could cause execution to shift from a multiple of four.
I'm guessing that either using any D-parameter that is not a multiple of 4 will cause program execution to halt (never == cnt value when sampled by waitcnt) or there is some "magic" that happens and if the D-parameter is not %4 then execution will continue, still aligned with clk/4.
While I find this an interesting question just for educational purposes, I actually started to think about it because I'm using the cog counters, and I know that there are some edges that I would like to align with that are not on clk*4 multiples. When I started to think about how I could do that, it then occurred to me that setting the waitcnt D-parameter to something not %4, that may be a "gotcha" to beware of.
mov clkCnt, cnt add clkCnt, someDelayValue add clkCnt, #3 waitcnt clkCnt, #0
Basically, I'm asking if the waitcnt command actually does a compare for an exact equality between the D-paramater and the cnt register, and does it actually compare for each of the 4 clock cycles? If so, then that would mean I could cause execution to shift from a multiple of four.
I'm guessing that either using any D-parameter that is not a multiple of 4 will cause program execution to halt (never == cnt value when sampled by waitcnt) or there is some "magic" that happens and if the D-parameter is not %4 then execution will continue, still aligned with clk/4.
While I find this an interesting question just for educational purposes, I actually started to think about it because I'm using the cog counters, and I know that there are some edges that I would like to align with that are not on clk*4 multiples. When I started to think about how I could do that, it then occurred to me that setting the waitcnt D-parameter to something not %4, that may be a "gotcha" to beware of.
Comments
-Phil
As for your counter issue, I wouldn't worry about that. The counter is started in sync with the following data instructions so global clock sync doesn't really matter here.
Nicely stated.......
Frank
Not said in smart @ss way, just meant that this thread and specifically the exposition of the single cycle * n result helped me on an issue. Thx.
Am I correct that since the S-decode is done on baseline+1 clock cycle, then it would be the value of cnt at that "+1" clock that is stored? If that's correct, then I can't understand why the delay is 14 clocks, not 10. Since the difference is exactly 4 cycles, I must be missing a full instruction somewhere in my counting.
BTW, this means that unlike most (all?) other instructions, waitcnt can execute in as little as one clock cycle. Is that correct?
Thanks for the replies.
Not really, it's a trick to save on ram as destination-side use of CNT is a Shadow register and is available for general use.
mov cnt,#5 'set shadow to 5
add cnt,cnt 'add the real cnt to shadow
waitcnt cnt, #0 'wait for the real cnt to match shadow, then add 0 to shadow.
5 is the overhead and result in a 0 wait (you can not use less than 5)
Though the waitcnt would be a nop in nature it still would take 6 cycles as waitcnt is not a 4 cycle instruction but 6.
If you came in with ...4,8,12 before the 3 lines of code you coming out at 26,30,34... in itself a shift by 2 cycles.
Using a 6 in the inital shadow settings and you would come out 27,31,35 a shift of 3 cycles.
I could not see any good use of it though, it takes to many cycles for changes for use in highspeed i/o
if you needed cogs to be out of sync/in sync you would just do that when you set the inital wait that you have to do with repeated newcog anyway.
And cogs get out of sync on their own after you do a read/write to hub as they have to wait their turn.
I'm still not following the rest. Assuming that the clock starts at 0 (which I know isn't the case, and doesn't matter since it's all relative), what would be stored in "myReg" after this instruction?
mov myReg, cnt
Using the fetchSource/fetchDestination/eXecute-instruction/writeResults model (SDXR), during the four clock cycles for a typical instruction, wouldn't that mean that the value in the cnt shadow register would be #1 since Source register value is read in cycle 1?
The 3 lines of code takes 14 cycles minimum
* the value of cnt is sampled in the 3rd stage of instruction execution
* waitcnt "releases" as soon as the current count is equal to the stored count, even if it's not a multiple of 4
The sequence is:
* start at cnt=0
* at cnt=3 the value of cnt is sampled (ALU value=3)
* at cnt=4 that value is stored in cnt shadow register (cntShadow=3)
* at cnt=7 9 is added to the value of cntShadow (ALU value=12)
* at cnt=8 that value is stored in cnt shadow register (cntShadow=12)
* at cnt=9 the S register value is fetched (if it is not immediate; if it is immediate, the value is decoded from the instruction and stored)
* at cnt=10 the D register value is fetched (ALU value=12)
* at cnt=11 waitcnt compares ALU value (=12) to cnt (=11) -> not equal
* at cnt=12 waitcnt compares ALU value (=12) to cnt (=12) -> EQUAL
* at cnt=13 the value of S register (or immediate value) is added to D register value
* at cnt=14 D register value written
So this means that although waitcnt will "release" control at any clock cycle (even non-multiples of 4), it still requires 2 additional clock cycles after cnt==specifiedValue (to add the value of S register and write that value to D register) before execution continues with the next instruction. Is that correct?
If so, can someone explain in more detail why the "active registers" are sampled on the X stage (3) rather than the first (as all other S registers)? Is it "just the way things are wired" for those special registers (which are tied to the ALU execution stage, not actually fetching a value from a "normal" cog RAM register)?
Thanks for the insight into this.
Not to be argumentative, but the datasheet says "5+" clocks. Is that incorrect?
Why does one need to wait when starting/repeating newcogs? Doesn't that take just a normal hubop time (7-22 cycles)? Are you referring to a case when the new cog has to be completely initialized before the cog that started it resumes execution?
Thanks.
If you need to start two cogs in sync for an hi-res vga etc
You would start both of them that has a waitcnt in them that is far enough away in the future that both had a chance to get loaded.
Or if you starting two cogs from the same dat, but have different values inserted to them.
You would need to wait before you change this value and do the next newcog.
see page 28 here: http://gadgetgangster.com/working_files/assembly_tutorial/desilva.pdf
As noted it's slightly off, also you show 2 compare cycles which would mean that #9 isn't the minimal advance to avoid waiting for a full wrap-around. Evidently #8 is not enough which means the first match is done in the 4th cycle of waitcnt (also true for waitpxx).
Yes, it needs some time to wake up and finally write the result if specified (nr is possible).
Let's put it this way, the final values for src and dst are latched in the 3rd cycle (for use by ALU or video h/w). Live registers (which are not part of cog RAM) are mux'ed in at this stage (not needed earlier) so you'll get whatever is in there at the time. So while source operands living in cog RAM and immediate values are processed earlier (there are only 4 cycles to access cog RAM) they don't become active until the 3rd cycle (but are squirreled away until needed).
As a mental exercise, where does jmp phsa go (with the setup above) and where does the next instruction come from (after the jump)?
Also, check out this thread: [thread=118096]Call for Clarity from Chip or Beau[/thread].
From the thread:
First - I remembered this thread from over a year ago because I found it so helpful. I don't recall seeing this level of detail and accuracy (and readability) in any official Parallax document on precision timing. I just came back to it for a refresher and am able to proceed using just this bit of explanation. I want to thank kuroneko for his help and effort.
Second: I realized there was an embedded quiz which I never completed. I'll never earn my propeller beanie that way. So...
Well, it's not going to cog address #2 because: a) that's the obvious answer given the previous explanation and you wouldn't ask an obvious question and b) looking at the Propeller manual for the JMP instruction (although it doesn't explicitly say this) I note that the instruction works by setting the PC equal to the value of the source operand. However, instruction decode happens during the e cycle of the previous instruction, so I presume modification of the PC happens in the D cycle. My guess is that the value of the source field at the end of the S cycle (in this case #1) is loaded/input/sampled as input in the D cycle, and latched into the PC as the output of the D cycle. It is then sampled in the e/I cycle - so that would mean the jmp would go to cog address #1. That reinforces my belief in the non-obvious answer, because now that leaves me at a movi, and I know one has to be careful there. However, movi isn't modifying an instruction (self-modifying code) but instead the ctra register, so I'm going with the next instruction comes from PC+1, which in that case would be cog address #2.
Then again, I'm making a lot of assumptions here, and my brain is beginning to hurt...
With base being the value loaded into phsx just before the jump (#0) you'll end up at base + 2*frqx (#2) and the next insn is fetched from base + 3*frqx + 1 (#4). The last cycle basically sees base + 3*frqx (one after live operand fetch) and treats that as the new PC (source value), IOW while the insn at base + 2*frqx is executed it thinks it's at base + 3*frqx so the next fetch is from +1.