My bugs with the pipeline and stack...
Cluso99
Posts: 18,069
Postedit: It is due to (my) 2 interacting bugs (see posts below about the problems)
I have also updated the title to better reflect my problem.
BTW Delayed branches/jumps execute 3 instructions, not 2 as I had originally thought.
Chip:
I have been chasing down a problem where reading the stack seems to read the last written stack value under some circumstances. Sometimes it affects the first and/or second read and is timing related.
Can you see anything wrong with this code? I am wondering if there might be a problem in the pipeline.
I am using your 30Sep code in DE0, pnut & F11 and PST.
The attached code includes my LMM Debugger.
usb_093p.spin
I have also updated the title to better reflect my problem.
BTW Delayed branches/jumps execute 3 instructions, not 2 as I had originally thought.
Chip:
I have been chasing down a problem where reading the stack seems to read the last written stack value under some circumstances. Sometimes it affects the first and/or second read and is timing related.
Can you see anything wrong with this code? I am wondering if there might be a problem in the pipeline.
I am using your 30Sep code in DE0, pnut & F11 and PST.
The attached code includes my LMM Debugger.
''============================================== '' TEST #1 'Save to the stack: test1 mov data, #0 mov ctr1, #20 ' n samples mov ctr2, #20 ' n samples setspa #0 ' zero stack ptr :fill add data, addval pusha data ' store samples djnzd ctr1, #:fill ' (executes next 2 instr before jump) '\\ no problems if these are 3x NOP instructions mov lmm_x, #$1FF '|| <--- removing (or not nop) causes failures ' ror lmm_x, $2 '|| <--- removing (or not nop) causes failures ' nop '// <--- removing 1 causes failure (reads last written stack) '----------------------------------------------- ' Display the stack: setspa #0 ' zero stack ptr :loop popar lmm_x mov lmm_f, #_HEX+0 '\ set hex mode with 8 digits (8=0=default) call #LmmFun wz,wc '/ call LmmFun routine (saves and restores Z & C flags) mov lmm_x, #CR '\\ call #LmmTx wz,wc '// djnz ctr2, #:loop ''============================================== mov lmm_x, #CR '\\ call #LmmTx wz,wc '// ''============================================== '' TEST #2 'Save to the stack: test2 mov data, #0 mov ctr1, #20 ' n samples mov ctr2, #20 ' n samples setspa #0 ' zero stack ptr :fill add data, addval pusha data ' store samples djnz ctr1, #:fill ' '----------------------------------------------- ' Display the stack: setspa #0 ' zero stack ptr :loop popar lmm_x mov lmm_f, #_HEX+0 '\ set hex mode with 8 digits (8=0=default) call #LmmFun wz,wc '/ call LmmFun routine (saves and restores Z & C flags) mov lmm_x, #CR '\\ call #LmmTx wz,wc '// djnz ctr2, #:loop ''============================================== jmp #EnterDebug ' goto the debugger... ''============================================== strptr_userid long @userid ' pointer to userid/version string wait long 0 delay500ns long _xinfreq /2 ' 0.5 sec delay5s long _xinfreq * 5 ' 5 sec addval long $01010101 ctr1 long 0 ctr2 long 0 data long 0Here is the first test output (just the first 6 values)
01010101 14141414 03030303 04040404 05050505 06060606And the second test output which is correct (just the first 6 values). Note I have seen errors using the djnz but havn't reproduced it yet.
01010101 02020202 03030303 04040404 05050505 06060606Various incorrect results can be obtained by changing the 3 instructions (or omitting them) that follow the DJNZD instruction in test 1. Only 3 NOPs work correctly.
usb_093p.spin
spin
277K
Comments
Edit: never mind. I see that it's part of the LMM debugger.
Also, what is the point of the two lmm_x-related instructions? Are these just "filler" instructions, or do they have a significant meaning?
Edit:
I just noticed that two of those post-DJNZD instructions were commented out. Can you run a version that has three non-NOP instructions and show those results? The example you show above is executing:
MOV lmm_x, #$1FF
SETSPA #0
POPAR lmm_x
after every DJNZD iteration, which would certainly mess up your stack results. I still don't see where the $14141414 is coming from,
Seairth
It can be one of such strange coincidences that seems to always occur, just to add a bit of random nonsense to our otherwise conscious conclusions, but if I managed it right to understand the problem, the 14141414 is just a four time replay of the just decremented counter value (#20 -1), after the first iteration loop had executed, used as it is the data contents, in the first instruction in the loop.
So, this value appears to remain wrongly muxed (perhaps some latency problem) soon after the "djnzd ctr1, #:fill" executes its (dec ctr1) phase, and propagates the just decremented value (#20 -1), up to the stage the pipeline is fetching data's value, to execute (add data, addval). Then (#20 -1) 1 =14.
EDIT: Sure taking 14 in hex notation, so I should had been written 14h.
"Why it appears to be four times repeated? I had a bunch of insights on how Chip had crafted, the inner belly of many pipeline operations. If my thoughts are right, this behavior adds some new perspectives to my perception of the full landscape.
EDIT:Duhhh! It appear that the amount of sleeplessness I'm experimenting on last weeks, is affecting a lot more than my reactions readiness.
Now it had just started messing both with my eyes and the left hemisphere of my brain.
Perhaps a severe case of four-folded vision of mine!
It appears to be four-folded because the output routine shows it that way.
Twice Duhhh!
Sorry by the above cluttering garbage."
Why it executes only after the first iteration? Perhaps on the second one, pipeline flushing could eliminate any non loop resident instructions, and stops messing with the muxes.
But these are only guesses of mine.
To check for the above, since I have no access to a DEn, I'm relying on a new Cluso99 test, changing the first count values a bit, and telling us what happened after the change takes place.
Here is part of the last Prop2_Docs.txt Chip had published, that can project some light on the subject we are now talking about.
"When a branch instruction executes, that task's program counter is abruptly changed from what had been a
steadily incrementing course, requiring that the pipeline be reloaded, beginning at the new program counter
address. This can leave up to three instructions in the pipeline which were trailing the branch instruction
and belong to the same task as the branch.
Normally, these trailing instructions are incidental data which are not intended for execution, and therefore
must be cancelled within the pipeline, so that they pass through without doing anything. However, in the case
of a single-task program, it may be desirable to allow those instrucions to execute, without cancellation, to
increase pipeline efficiency. This will result in the branch taking just 1 clock cycle, but three trailing
instructions will be executed before the branch appears to take effect:"
Yanomani
If this comment in your code is really what you expect, then here is your error. As Seairth said, after a delayed jump 3 following instructions are executed.
If you have lesser than 3 the stackpointer gets always reset to 0, and all the stackfills go to stackram[0]. The 20th ($1E1E1E1E) is what you see then at position 0 or 1 (with 1 or 2 removed nops).
Here is my version of the test (with the new SERIN(SEROUT instructions instead the debugger)
I can insert many types of instructions and the result is always right. Andy
Ahh! Combine that with the SETSPA and POPAR instruction, you'd get "01010101" for the first value (due to the first iteration of the loop) and "14141414" for the second value (due to the remaining iterations, which are overwriting the same register). Then this brings up the next question: how do the third value and up get populated properly (assuming the code above)? The code should only be setting the first two stack registers.
Edit: is it possible that the remaining "good" values are due to a prior run of test2 and the stack registers aren't being reset/cleared when test1 is re-run?
After some 3 hours of healthy sleeping, I believe I'd managed to flush my own brain pipelines from much of the idea cluttering nonsense.
To regain the track of the insight that crossed my mind, as soon as I read Cluso99's post, I had to re-read Chip's documentations, trying to understand each and every aspect of pipeline operation, as for single and multitasking operations, and also for normal and delayed jumps.
Ariba's explanation did clear the tracks, exactly pointing what happened when Cluso99's routine had crossed the limits of pipeline straight ahead operation.
But my thoughts were looking for not just one lonely reason, I was trying to understand the underlying pipeline operational mechanics, and what exactly we can expect to extract, perhaps in the future, from such an exotic, sometimes appearing to show some chaos, deterministic machine.
Cluso99's shown output was in fact, the result of many tests, since he stated that:
"Various incorrect results can be obtained by changing the 3 instructions (or omitting them) that follow the DJNZD instruction in test 1. Only 3 NOPs work correctly."
And sure, I must agree with your conclusions that, just before the last results were obtained and displayed in its original forum post, test2 had passed in a previous iteration, and correctly settled the shown stack values.
But, repeating for not being forgotten, Chip's wonderful work is a deterministic machine driven pipeline mechanism, originaly intended for single task operations, that was lately adapted to the wonderful four thread task job accomodation we can enjoy now.
The working rules of the pipeline, from a programmer's standpoint, had changed, but its behavior relies on its state machine's operational rules, that almost remained the same.
I just can't wait to begin experimenting with it, using extra loop code alteration possibilities, to defy its true potentials, and our own ingenuity, in extracting the last juice drop from it.
Years of enjoyment to come, if we just can survive as a single piece, of not sanity deprived beings.
Yanomani
1. The test only works correctly when 3 x nops follow the djnzd
2. The 141414 is the last iteration (#20) result of the add loop. Repetition of the 14 is only because of the data I chose.
3. I have seen results where the first output is $14141414 followed by $01010101, then correctly $03030303 etc.
4. I have seen results where the first output is $01010101 followed by $14141414, then correctly.
5. I have seen the errors when I replaced DJNZD with DJNZ. This morning I will go back and dig out those codes and see if I can find out why I get errors.
At least I save my code regularly (as a history), so I can revert back. My original code was way more complex than this so I had to remove other code to get down to something simple to analyse the real problem.
Prior to posting this code I was clearing the stack, and only performing test 1 or test 2. Originally I noticed this problem because I put a single ascii char on the stack at the start, followed by the saving (what I am saving is an actual snoop of the USB FS buss) and that ascii char was getting corrupted. Originally I suspected it was my debugger but I am now happy that it is not to blame.
One of the byproducts of the stack not clearing on reset is that it retains what was written previously. This distorts the results.
Here is a new post where I prefill the stack with $80808080 and then save it with +$01010101 continually added. I have added nops to the prefill section just in case.
What it shows is that Test #1 places the first value onto the stack and the last value into the second position. No other stack locations are written to.
So, DJNZD is indeed executing the SETSTA #0. Depending upon its position in the pipeline (changed by the instructions/nops preceeding it, depends on whether the stack is incremented to +1 by the previous pusha or the +1 has executed before the stack is reset to 0. This makes sense. usb_093q.spin
Now to see what the circumstances were when I used DJNZ instead of DJNZD.
I just ran your code on my DE0 board with the latest FPGA binary with same results.
I then tried it on my DE2 board with previous FPGA binary with the same result.
That eliminates FPGA version differences.
I use the stack intensively in Invaders with no "apparent" issues.
I'll do some tests and see what transpires.
Cheers
Brian
I will mark the thread as "solved" in the meantime.
Meanwhile, here is some further tests that resulted in some fairly obscure program bugs...
Here is .93r
* I clear the PST screen
* I am filling 20 positions but only displaying the first 5
* I have prefilled the stack ($80800000 plus I add #1)
* TEST1 has only 1 NOP (now a MOV data, #$0F0) - this replaces the "data" ready for the second add.
* added TEST1A which adds 3 instructions following DJNZD, and all MOV data, #$0x0. Again this replaces the "data" stored in the second push. So all pushes following the first are wrong due to a different program bug.
Section of my offending code for Test1... Section of my offending code for Test1A...
Most likely, my original bug is a mix of these two bugs.
Certainly an interesting exercise! And a waste of more than a day
usb_093r.spin
What can be basically derived from your tests, is that the loop can be coded in some new fashion, and spare code space:
SETSPA #0
fill: DJNZD ctr1, #:fill
NOP
ADD data, addval
PUSHA data
It appears to be a so fancy way to code a loop, with the important instructions, apparently, out of the loop itself.
It can always be done this way, to recall the value of the new tools we have now at our hands. With the added bonus of the NOPS being perhaps substituted for something useful, just inside the loop, irrespective to where they are placed, between the three instructions, following the DJNZD one..
Good lessons, every day.
Yanomani
Edit1: I'm unlucky in posting the code in the right alignement.
Edit2: This is also a direct consequence of Chip's provided documentation, on the subject of using delayed branches inside a pipelined execution scheme, in single task scenarios, as I posted a excerpt from it, on post #5.
Perhaps an explanation of what I was trying to do when I fell foul of this may help.
* Examine/log two Input pins to the stack
* Each sample must occur in 6.667 clocks (12MHz =FS USB)
* Therefore I was sampling every 7+6+7 clocks
* I unravelled this loop to sample 15 sets including a ROR data,#2
* This allowed me to store 15 samples per long
* I needed to execute the 2 (my mistake, s/be 3) instructions after the DJNZD for the 15 sample loop
Thus the first bug was detected - I was resetting the SPA using the SETSPA #0 (3rd instruction).
I was not detecting this bug properly because my stack wasn't being cleared, so it had remnants of correct execution.
Then in simplifying what I was trying to do, to find the problem, I introduced the second bug that modified the data set (MOV data, #xx).
From then on, it was all downhill !
Sure, I had understood what you explained before, when you marked the thread as solved.
It just occurs to me, as an added bonus of studying the case, and as your own work progressed during the whole day, is the way to code the loop, totaly contained inside the pipeline, as I described on post #13.
Since I have no access to a DEn board, I must rely on your kindly indulgence, or any other of the kindly helpfull persons who has access to one of them, to test if my insight sounds good, or not.
Thanks in advance, for giving attention to my doubts.
Yanomani
P.S. In fact, since the loop stays totaly contained, inside the pipeline itself, as some hidroponic's cultured vegetables roots, why not to call them Propeller II Hydroponic Loops?
Hardly a waste of time! This pattern will be encountered again and again by people new to the P2 (and probably occasionally by people who aren't new to it). We all get to learn from the experience.
Nice! Note that I'd still rather read/comprehend this "out-of-order" code than the similar kind of stuff that you have to do for processors like the Itanium.
This also makes me realize that part of the new documentation effort should be a cookbook-style section for documenting these exact kind of design patterns.
Except of course that result in r1 actually becomes the result of an instruction you started 4 or 5 instructions previously and has only just popped out of the pipeline, the actual result of this instruction will come out later as well! Same for that multiply.
Turned to be extremely hard for programmers to get full speed out of the thing in assembler and impossible for compilers. As wikipedia says: Appauling. At one Intel presentation I attended on the launch of the i860 the guy was showing how to speed up an FFT in assembler using the pipelines and parallel execution etc. He could not get the thing up to full speed. Turned out some other company had figured that out. But really, even Intel did not know how to program their own chip!
Seairth, Yes. I do hope the PII architecture has not gone too far in the i860 and Itanium direction such that only the extreme gurus can get performance out of it or understand what others have written.
I'm sure compiler technology has improved but not enough. I know little of these things but evidence would suggest the compiler writers have failed to make a silk purse out of a pigs ear. For example, the bulk of the supercomputers in the TOP500 are x86_64 or EM64T which are your basic 64 bit x86 architecture as invented by AMD. http://en.wikipedia.org/wiki/File:Processor_families_in_TOP500_supercomputers.svg
Of course the Prop II is not at the level of complexity of the i860 or Itanium architectures so I hope you are right and a lot can be done in C/C++.
Heater
I fully agree with you, as most coders, me included, will tend to put the loop branching instruction, at the very end of the loop itself.
Way more when some instructions within the loop, do change the contents of registers that are controlling the loop count.
e.g.
SETSPA #0
fill: DJNZD ctr1, #:fill
SUB ctr1, #1
ADD data, addval
PUSHA data
Originaly intended to cut the loop count in half, avoiding some previous, and time consuming, division or shifting, will lead to strange results, if the coder don't pays full attention, in properly seting the initial value of ctr1.
But, since we can't expect to solve ancient problems, doing things in the same old way, there is to be some knee point, where our minds must be adapted, and evolute.
Education is the best and well known method to do pursuit this.
Propeller 2 will be a wonderful tool, its up to us to learn how to extract its full potential.
I am truly grateful to Cluso99, by bringing us the opportunity to learn, test and share such important concepts, expressed on this thread contents.
Yanomani
Known fixed repeat count:
Variable repeat count:
I think the P2 manual will need a VERY good section on pipelining for newbies, plus it should recommend that people new to pasm2 initially avoid the pipelined instructions.
The pipelined instructions make a huge difference in the speed of code I've been working on.
Yep. That means the newbie instructions need to tell how to avoid getting into unexpected trouble with the pipeline. Like where to put a load of NOPS.
rod1963,
I have high hopes for the propgcc guys. Currently propgcc can compile a C version of the fft_bench that runs in COG almost as fast as the hand made assembler version that has been tweaked by a few people on the forum.
My summarised instruction spreadsheet (posted in the documents thread) needs changing!
I had actually determined that these following instructions needed to be executed in my loop and would not cause a problem when executed when the jump failed.
I cannot use the REP instruction because a waitcnt also appears within my actual loop.
If we had an animated visualization tool that would let you see what happens when different code sequences execute, people would learn right away. Ahle????
Hi Chip,
Have you just developed the hability to read minds?
Is sleeplessness one of the counting factors for this achievement?
I believe that we must do a better documentation on the subject of the pipeline operation, when delayed versions of the branching instructions are used.
Despite there is no warning in the last version of Prop2_Docs.txt, the possibility of having two or more delayed branches inside the pipeline, simultaneously, can cause a lot of concerns.
Yanomani
-Phil