My bugs with the pipeline and stack...

Cluso99 · 2013-10-08 00:54

Postedit: It is due to (my) 2 interacting bugs (see posts below about the problems)
I have also updated the title to better reflect my problem.

BTW Delayed branches/jumps execute 3 instructions, not 2 as I had originally thought.

Chip:
I have been chasing down a problem where reading the stack seems to read the last written stack value under some circumstances. Sometimes it affects the first and/or second read and is timing related.

Can you see anything wrong with this code? I am wondering if there might be a problem in the pipeline.

I am using your 30Sep code in DE0, pnut & F11 and PST.
The attached code includes my LMM Debugger.

''==============================================
'' TEST #1
'Save to the stack:                                           
test1         mov       data, #0
              mov       ctr1, #20               ' n samples
              mov       ctr2, #20               ' n samples
              setspa    #0                      ' zero stack ptr
:fill         add       data, addval
              pusha     data                    ' store samples
              djnzd     ctr1, #:fill            ' (executes next 2 instr before jump)
                                                '\\ no problems if these are 3x NOP instructions
              mov       lmm_x, #$1FF            '|| <--- removing (or not nop) causes failures
'             ror       lmm_x, $2               '|| <--- removing (or not nop) causes failures
'             nop                               '// <--- removing 1 causes failure (reads last written stack)
'-----------------------------------------------
' Display the stack:
              setspa    #0                      ' zero stack ptr
:loop         popar     lmm_x 
                mov     lmm_f, #_HEX+0                  '\ set hex mode with 8 digits (8=0=default)
                call    #LmmFun         wz,wc           '/ call LmmFun  routine (saves and restores Z & C flags)
                mov     lmm_x, #CR                      '\\
                call    #LmmTx          wz,wc           '//
              djnz      ctr2, #:loop
''==============================================
                mov     lmm_x, #CR                      '\\
                call    #LmmTx          wz,wc           '//
''==============================================
'' TEST #2
'Save to the stack:
test2         mov       data, #0
              mov       ctr1, #20               ' n samples
              mov       ctr2, #20               ' n samples
              setspa    #0                      ' zero stack ptr
:fill         add       data, addval
              pusha     data                    ' store samples
              djnz      ctr1, #:fill            ' 
'-----------------------------------------------
' Display the stack:
              setspa    #0                      ' zero stack ptr
:loop         popar     lmm_x 
                mov     lmm_f, #_HEX+0                  '\ set hex mode with 8 digits (8=0=default)
                call    #LmmFun         wz,wc           '/ call LmmFun  routine (saves and restores Z & C flags)
                mov     lmm_x, #CR                      '\\
                call    #LmmTx          wz,wc           '//
              djnz      ctr2, #:loop
''==============================================
              jmp       #EnterDebug             ' goto the debugger...
''==============================================
strptr_userid   long      @userid               ' pointer to userid/version string
wait            long      0
delay500ns      long      _xinfreq /2           ' 0.5 sec
delay5s         long      _xinfreq * 5          ' 5 sec
addval          long      $01010101
ctr1            long      0
ctr2            long      0
data            long      0

Here is the first test output (just the first 6 values)

And the second test output which is correct (just the first 6 values). Note I have seen errors using the djnz but havn't reproduced it yet.

Various incorrect results can be obtained by changing the 3 instructions (or omitting them) that follow the DJNZD instruction in test 1. Only 3 NOPs work correctly.
usb_093p.spin

Gadgetman · 2013-10-08 03:08

Do you get the same error when using older code?

Seairth · 2013-10-08 06:02

where is the lmm_x label?

Edit: never mind. I see that it's part of the LMM debugger.

Seairth · 2013-10-08 06:40

The delayed jump will execute (assuming no multitasking) the next three instructions following it. So, for instance, when you remove the last NOP, you are instead executing the next instruction (SETSPA #0).

Also, what is the point of the two lmm_x-related instructions? Are these just "filler" instructions, or do they have a significant meaning?

Edit:

I just noticed that two of those post-DJNZD instructions were commented out. Can you run a version that has three non-NOP instructions and show those results? The example you show above is executing:

MOV lmm_x, #$1FF
SETSPA #0
POPAR lmm_x

after every DJNZD iteration, which would certainly mess up your stack results. I still don't see where the $14141414 is coming from,

Yanomani · 2013-10-08 08:28

Seairth wrote: »

The delayed jump will execute (assuming no multitasking) the next three instructions following it. So, for instance, when you remove the last NOP, you are instead executing the next instruction (SETSPA #0).

Also, what is the point of the two lmm_x-related instructions? Are these just "filler" instructions, or do they have a significant meaning?

Edit:

I just noticed that two of those post-DJNZD instructions were commented out. Can you run a version that has three non-NOP instructions and show those results? The example you show above is executing:

MOV lmm_x, #$1FF
SETSPA #0
POPAR lmm_x

after every DJNZD iteration, which would certainly mess up your stack results. I still don't see where the $14141414 is coming from,

Seairth

It can be one of such strange coincidences that seems to always occur, just to add a bit of random nonsense to our otherwise conscious conclusions, but if I managed it right to understand the problem, the 14141414 is just a four time replay of the just decremented counter value (#20 -1), after the first iteration loop had executed, used as it is the data contents, in the first instruction in the loop.

So, this value appears to remain wrongly muxed (perhaps some latency problem) soon after the "djnzd ctr1, #:fill" executes its (dec ctr1) phase, and propagates the just decremented value (#20 -1), up to the stage the pipeline is fetching data's value, to execute (add data, addval). Then (#20 -1) 1 =14.

EDIT: Sure taking 14 in hex notation, so I should had been written 14h.

"Why it appears to be four times repeated? I had a bunch of insights on how Chip had crafted, the inner belly of many pipeline operations. If my thoughts are right, this behavior adds some new perspectives to my perception of the full landscape.

EDIT:Duhhh! It appear that the amount of sleeplessness I'm experimenting on last weeks, is affecting a lot more than my reactions readiness.
Now it had just started messing both with my eyes and the left hemisphere of my brain.
Perhaps a severe case of four-folded vision of mine!

It appears to be four-folded because the output routine shows it that way.
Twice Duhhh!

Sorry by the above cluttering garbage."

Why it executes only after the first iteration? Perhaps on the second one, pipeline flushing could eliminate any non loop resident instructions, and stops messing with the muxes.
But these are only guesses of mine.

To check for the above, since I have no access to a DEn, I'm relying on a new Cluso99 test, changing the first count values a bit, and telling us what happened after the change takes place.

Here is part of the last Prop2_Docs.txt Chip had published, that can project some light on the subject we are now talking about.

"When a branch instruction executes, that task's program counter is abruptly changed from what had been a
steadily incrementing course, requiring that the pipeline be reloaded, beginning at the new program counter
address. This can leave up to three instructions in the pipeline which were trailing the branch instruction
and belong to the same task as the branch.

Normally, these trailing instructions are incidental data which are not intended for execution, and therefore
must be cancelled within the pipeline, so that they pass through without doing anything. However, in the case
of a single-task program, it may be desirable to allow those instrucions to execute, without cancellation, to
increase pipeline efficiency. This will result in the branch taking just 1 clock cycle, but three trailing
instructions will be executed before the branch appears to take effect:"

Yanomani

Ariba · 2013-10-08 10:35

Cluso

(executes next 2 instr before jump)

If this comment in your code is really what you expect, then here is your error. As Seairth said, after a delayed jump 3 following instructions are executed.

If you have lesser than 3 the stackpointer gets always reset to 0, and all the stackfills go to stackram[0]. The 20th ($1E1E1E1E) is what you see then at position 0 or 1 (with 1 or 2 removed nops).

Here is my version of the test (with the new SERIN(SEROUT instructions instead the debugger)
I can insert many types of instructions and the result is always right.

CON
  TX = 90
  RX = 91
  clkfrq = 80_000_000
  baud = 115200

DAT
        org 0
        clkset  max_freq        'set 80MHz
        setpera baudrt          'init serial out
        setsera sermode
test1         serina  t1        'wait for keypress

              mov       txd, #0
              mov       ctr1, #20               ' n samples
              mov       ctr2, #20               ' n samples
              setspa    #0                      ' zero stack ptr
:fill         add       txd, addval
              pusha     txd                    ' store samples
               djnzd    ctr1, #:fill            ' (executes next 2 instr before jump)
                                                '\\ no problems if these are 3x NOP instructions
              mov       lmm_x, #$1FF            '|| <--- removing (or not nop) causes failures
              ror       lmm_x, $2               '|| <--- removing (or not nop) causes failures
              nop                               '// <--- removing 1 causes failure (reads last written stack)
'-----------------------------------------------
' Display the stack:
              setspa    #0                      ' zero stack ptr
:loop         popar     t2 
              call      #hexout
              mov       txd, #32
              call      #chout
              djnz      ctr2, #:loop

              mov       txd, #13
              call      #chout
              mov       txd, #10
              call      #chout
              jmp       #test1

'--- terminal functions ----

hexout  mov digcnt,#8           'print t2 as hex with n digits
:digit  rol t2,#4               'check digit
        mov txd,t2
        and txd,#$F
        cmp txd,#10  wc
        add txd,#"0"            '0..9
  if_ae add txd,#7              'A..F
        cmp digcnt,digits  wc,wz
  if_be call #chout
        djnz digcnt,#:digit
hexout_ret  ret

chout   clrp #TX
        serouta txd
chout_ret ret

exit    coginit monitor,monpins  'relaunch cog0 with monitor
monitor long $700
monpins long TX<<9 + RX

baudrt  long clkfrq/baud
sermode long (2<<7+RX)<<9 + 2<<7+TX
max_freq long %1111_11_11
digits  long 8

t1      long 0
t2      long 0
digcnt  long 0
txd     long 0

addval  long $01010101
lmm_x   long 0
ctr1    long 0
ctr2    long 0

Andy

Seairth · 2013-10-08 11:01

Yanomani wrote: »

Seairth

It can be one of such strange coincidences that seems to always occur, just to add a bit of random nonsense to our otherwise conscious conclusions, but if I managed it right to understand the problem, the 14141414 is just a four time replay of the just decremented counter value (#20 -1), after the first iteration loop had executed, used as it is the data contents, in the first instruction in the loop.

So, this value appears to remain wrongly muxed (perhaps some latency problem) soon after the "djnzd ctr1, #:fill" executes its (dec ctr1) phase, and propagates the just decremented value (#20 -1), up to the stage the pipeline is fetching data's value, to execute (add data, addval). Then (#20 -1) 1 =14.

EDIT: Sure taking 14 in hex notation, so I should had been written 14h.

Ahh! Combine that with the SETSPA and POPAR instruction, you'd get "01010101" for the first value (due to the first iteration of the loop) and "14141414" for the second value (due to the remaining iterations, which are overwriting the same register). Then this brings up the next question: how do the third value and up get populated properly (assuming the code above)? The code should only be setting the first two stack registers.

Edit: is it possible that the remaining "good" values are due to a prior run of test2 and the stack registers aren't being reset/cleared when test1 is re-run?

Yanomani · 2013-10-08 15:15

Seairth wrote: »

Ahh! Combine that with the SETSPA and POPAR instruction, you'd get "01010101" for the first value (due to the first iteration of the loop) and "14141414" for the second value (due to the remaining iterations, which are overwriting the same register). Then this brings up the next question: how do the third value and up get populated properly (assuming the code above)? The code should only be setting the first two stack registers.

Edit: is it possible that the remaining "good" values are due to a prior run of test2 and the stack registers aren't being reset/cleared when test1 is re-run?

After some 3 hours of healthy sleeping, I believe I'd managed to flush my own brain pipelines from much of the idea cluttering nonsense.

To regain the track of the insight that crossed my mind, as soon as I read Cluso99's post, I had to re-read Chip's documentations, trying to understand each and every aspect of pipeline operation, as for single and multitasking operations, and also for normal and delayed jumps.

Ariba's explanation did clear the tracks, exactly pointing what happened when Cluso99's routine had crossed the limits of pipeline straight ahead operation.

But my thoughts were looking for not just one lonely reason, I was trying to understand the underlying pipeline operational mechanics, and what exactly we can expect to extract, perhaps in the future, from such an exotic, sometimes appearing to show some chaos, deterministic machine.

Cluso99's shown output was in fact, the result of many tests, since he stated that:

"Various incorrect results can be obtained by changing the 3 instructions (or omitting them) that follow the DJNZD instruction in test 1. Only 3 NOPs work correctly."

And sure, I must agree with your conclusions that, just before the last results were obtained and displayed in its original forum post, test2 had passed in a previous iteration, and correctly settled the shown stack values.

But, repeating for not being forgotten, Chip's wonderful work is a deterministic machine driven pipeline mechanism, originaly intended for single task operations, that was lately adapted to the wonderful four thread task job accomodation we can enjoy now.

The working rules of the pipeline, from a programmer's standpoint, had changed, but its behavior relies on its state machine's operational rules, that almost remained the same.

I just can't wait to begin experimenting with it, using extra loop code alteration possibilities, to defy its true potentials, and our own ingenuity, in extracting the last juice drop from it.

Years of enjoyment to come, if we just can survive as a single piece, of not sanity deprived beings.

Yanomani

Cluso99 · 2013-10-08 15:41

Just to quickly answer some questions/comments

1. The test only works correctly when 3 x nops follow the djnzd
2. The 141414 is the last iteration (#20) result of the add loop. Repetition of the 14 is only because of the data I chose.
3. I have seen results where the first output is $14141414 followed by $01010101, then correctly $03030303 etc.
4. I have seen results where the first output is $01010101 followed by $14141414, then correctly.
5. I have seen the errors when I replaced DJNZD with DJNZ. This morning I will go back and dig out those codes and see if I can find out why I get errors.

At least I save my code regularly (as a history), so I can revert back. My original code was way more complex than this so I had to remove other code to get down to something simple to analyse the real problem.

Prior to posting this code I was clearing the stack, and only performing test 1 or test 2. Originally I noticed this problem because I put a single ascii char on the stack at the start, followed by the saving (what I am saving is an actual snoop of the USB FS buss) and that ascii char was getting corrupted. Originally I suspected it was my debugger but I am now happy that it is not to blame.

Cluso99 · 2013-10-08 16:53

Thanks for the suggestions. What I posted has become a little clearer.

One of the byproducts of the stack not clearing on reset is that it retains what was written previously. This distorts the results.

Here is a new post where I prefill the stack with $80808080 and then save it with +$01010101 continually added. I have added nops to the prefill section just in case.

What it shows is that Test #1 places the first value onto the stack and the last value into the second position. No other stack locations are written to.
So, DJNZD is indeed executing the SETSTA #0. Depending upon its position in the pipeline (changed by the instructions/nops preceeding it, depends on whether the stack is incremented to +1 by the previous pusha or the +1 has executed before the stack is reset to 0. This makes sense.

'test 1 output (only part)
01010101
14141414
83838383
84848484
85858585
86868686

usb_093q.spin

Now to see what the circumstances were when I used DJNZ instead of DJNZD.

ozpropdev · 2013-10-08 17:45

Hi Cluso

I just ran your code on my DE0 board with the latest FPGA binary with same results.
I then tried it on my DE2 board with previous FPGA binary with the same result.
That eliminates FPGA version differences.
I use the stack intensively in Invaders with no "apparent" issues.
I'll do some tests and see what transpires.

Cheers
Brian

Cluso99 · 2013-10-08 17:55

Think I have resolved the problem. I will go back to my old code shortly to verify.
I will mark the thread as "solved" in the meantime.

Meanwhile, here is some further tests that resulted in some fairly obscure program bugs...

Here is .93r
* I clear the PST screen
* I am filling 20 positions but only displaying the first 5
* I have prefilled the stack ($80800000 plus I add #1)
* TEST1 has only 1 NOP (now a MOV data, #$0F0) - this replaces the "data" ready for the second add.
* added TEST1A which adds 3 instructions following DJNZD, and all MOV data, #$0x0. Again this replaces the "data" stored in the second push. So all pushes following the first are wrong due to a different program bug.

Section of my offending code for Test1...

              setspa    #0                      ' zero stack ptr
:fill         add       data, addval
              pusha     data                    ' store samples
              djnzd     ctr1, #:fill            ' (executes next 2 instr before jump)
                                                '\\ no problems if these are 3x NOP instructions
              mov       data, #$0F0            '|| <--- removing (or not nop) causes failures
'              nop       'ror       data, $2               '|| <--- removing (or not nop) causes failures
'              nop       'nop                               '// <--- removing 1 causes failure (reads last written stack)

Section of my offending code for Test1A...

              setspa    #0                      ' zero stack ptr
:fill         add       data, addval
              pusha     data                    ' store samples
              djnzd     ctr1, #:fill            ' (executes next 2 instr before jump)
                                                '\\ no problems if these are 3x NOP instructions
              mov       data, #$0F0             '|| <--- removing (or not nop) causes failures
              mov       data, #$0C0             '|| <--- removing (or not nop) causes failures
              mov       data, #$0A0             '// <--- removing 1 causes failure (reads last written stack)

80800001
80800002
80800003
80800004
80800005
80800006
-
01010101
010101F1 <-----
80800003
80800004
80800005
80800006
-
01010101
010101A1 <-----
010101A1
010101A1
010101A1
010101A1

Most likely, my original bug is a mix of these two bugs.

Certainly an interesting exercise! And a waste of more than a day

usb_093r.spin

Yanomani · 2013-10-08 19:04

Hi Cluso99

What can be basically derived from your tests, is that the loop can be coded in some new fashion, and spare code space:

SETSPA #0

fill: DJNZD ctr1, #:fill
NOP
ADD data, addval
PUSHA data

It appears to be a so fancy way to code a loop, with the important instructions, apparently, out of the loop itself.
It can always be done this way, to recall the value of the new tools we have now at our hands. With the added bonus of the NOPS being perhaps substituted for something useful, just inside the loop, irrespective to where they are placed, between the three instructions, following the DJNZD one..

Good lessons, every day.

Yanomani

Edit1: I'm unlucky in posting the code in the right alignement.

Edit2: This is also a direct consequence of Chip's provided documentation, on the subject of using delayed branches inside a pipelined execution scheme, in single task scenarios, as I posted a excerpt from it, on post #5.

Cluso99 · 2013-10-08 20:03

Yanomani:
Perhaps an explanation of what I was trying to do when I fell foul of this may help.

* Examine/log two Input pins to the stack
* Each sample must occur in 6.667 clocks (12MHz =FS USB)
* Therefore I was sampling every 7+6+7 clocks
* I unravelled this loop to sample 15 sets including a ROR data,#2
* This allowed me to store 15 samples per long
* I needed to execute the 2 (my mistake, s/be 3) instructions after the DJNZD for the 15 sample loop

Thus the first bug was detected - I was resetting the SPA using the SETSPA #0 (3rd instruction).
I was not detecting this bug properly because my stack wasn't being cleared, so it had remnants of correct execution.

Then in simplifying what I was trying to do, to find the problem, I introduced the second bug that modified the data set (MOV data, #xx).
From then on, it was all downhill !

Yanomani · 2013-10-08 20:16

Cluso99

Sure, I had understood what you explained before, when you marked the thread as solved.

It just occurs to me, as an added bonus of studying the case, and as your own work progressed during the whole day, is the way to code the loop, totaly contained inside the pipeline, as I described on post #13.

Since I have no access to a DEn board, I must rely on your kindly indulgence, or any other of the kindly helpfull persons who has access to one of them, to test if my insight sounds good, or not.

Thanks in advance, for giving attention to my doubts.

Yanomani

P.S. In fact, since the loop stays totaly contained, inside the pipeline itself, as some hidroponic's cultured vegetables roots, why not to call them Propeller II Hydroponic Loops?

Heater. · 2013-10-09 01:43

Yanomani,

        SETSPA       #0

fill:   DJNZD        ctr1, #:fill
        NOP
        ADD          data, addval
        PUSHA        data

Oh boy. Looks like we might be seeing some very strange looking code turning up in the PII.

Seairth · 2013-10-09 04:05

Cluso99 wrote: »

Certainly an interesting exercise! And a waste of more than a day

Hardly a waste of time! This pattern will be encountered again and again by people new to the P2 (and probably occasionally by people who aren't new to it). We all get to learn from the experience.

Seairth · 2013-10-09 04:14

Heater. wrote: »
        SETSPA       #0

fill:   DJNZD        ctr1, #:fill
        NOP
        ADD          data, addval
        PUSHA        data
Oh boy. Looks like we might be seeing some very strange looking code turning up in the PII.

Nice! Note that I'd still rather read/comprehend this "out-of-order" code than the similar kind of stuff that you have to do for processors like the Itanium.

Seairth · 2013-10-09 04:20

Heater. wrote: »

fill:   DJNZD        ctr1, #:fill
        NOP
        ADD          data, addval
        PUSHA        data

This also makes me realize that part of the new documentation effort should be a cookbook-style section for documenting these exact kind of design patterns.

Heater. · 2013-10-09 04:49

Seairth,

Note that I'd still rather read/comprehend this "out-of-order" code than the similar kind of stuff that you have to do for processors like the Itanium.

You would have thought Intel would have learned from the failure of their i860 RISC machine years before. That machine had long pipelines and could run integer and floating point code at the same time. The result was that to get speed out of it you ended up with code like:

    add r1, r2, r3    ; fmul  r4, r5, r6

Looks harmless. Result r1 becomes the integer addition of r2 and r3. Result r4 becomes the floating point multiplication of r5 and r6.
Except of course that result in r1 actually becomes the result of an instruction you started 4 or 5 instructions previously and has only just popped out of the pipeline, the actual result of this instruction will come out later as well! Same for that multiply.

Turned to be extremely hard for programmers to get full speed out of the thing in assembler and impossible for compilers. As wikipedia says:

While theoretically capable of peaking at about 60-80 MFLOPS for both single precision and double precision for the XP versions,[5] hand-coded assemblers managed to get only about up to 40 MFLOPS, and most compilers had difficulty getting even 10 MFLOPs.

Appauling. At one Intel presentation I attended on the launch of the i860 the guy was showing how to speed up an FFT in assembler using the pipelines and parallel execution etc. He could not get the thing up to full speed. Turned out some other company had figured that out. But really, even Intel did not know how to program their own chip!

Seairth,

This also makes me realize that part of the new documentation effort should be a cookbook-style section for documenting these exact kind of design patterns.

Yes. I do hope the PII architecture has not gone too far in the i860 and Itanium direction such that only the extreme gurus can get performance out of it or understand what others have written.

Dave Hein · 2013-10-09 05:02

I think compilers have improved since the i860, and they can optimize code to handle pipeline delays and other hardware restrictions. You may have to be a guru to write optimized P2 assembly, but optimal C code should not be as difficult. We may have to develop a set of rules for beginners and non-guru programmers for them to safely program in assembly.

Heater. · 2013-10-09 05:45

Dave,

I'm sure compiler technology has improved but not enough. I know little of these things but evidence would suggest the compiler writers have failed to make a silk purse out of a pigs ear. For example, the bulk of the supercomputers in the TOP500 are x86_64 or EM64T which are your basic 64 bit x86 architecture as invented by AMD. http://en.wikipedia.org/wiki/File:Processor_families_in_TOP500_supercomputers.svg

Of course the Prop II is not at the level of complexity of the i860 or Itanium architectures so I hope you are right and a lot can be done in C/C++.

Yanomani · 2013-10-09 08:54

Heater. wrote: »
Yanomani,
        SETSPA       #0

fill:   DJNZD        ctr1, #:fill
        NOP
        ADD          data, addval
        PUSHA        data
Oh boy. Looks like we might be seeing some very strange looking code turning up in the PII.

Heater

I fully agree with you, as most coders, me included, will tend to put the loop branching instruction, at the very end of the loop itself.
Way more when some instructions within the loop, do change the contents of registers that are controlling the loop count.

e.g.

SETSPA #0

fill: DJNZD ctr1, #:fill
SUB ctr1, #1
ADD data, addval
PUSHA data

Originaly intended to cut the loop count in half, avoiding some previous, and time consuming, division or shifting, will lead to strange results, if the coder don't pays full attention, in properly seting the initial value of ctr1.

But, since we can't expect to solve ancient problems, doing things in the same old way, there is to be some knee point, where our minds must be adapted, and evolute.

Education is the best and well known method to do pursuit this.
Propeller 2 will be a wonderful tool, its up to us to learn how to extract its full potential.

I am truly grateful to Cluso99, by bringing us the opportunity to learn, test and share such important concepts, expressed on this thread contents.

Yanomani

Bill Henning · 2013-10-09 09:19

This should be faster & smaller:

Known fixed repeat count:

    reps    #20,#2   ' repeat count is always the same
    setspa #0
    ' loop below
    add     data,addval
    pusha  data

Variable repeat count:

    repd   ctr1,#2  
    nop ' filler - should be filled with instruction from before the loop
    nop ' filler - should be filled with instruction from before the loop
    setspa #0
    ' loop below
    add     data,addval
    pusha  data

I think the P2 manual will need a VERY good section on pipelining for newbies, plus it should recommend that people new to pasm2 initially avoid the pipelined instructions.

The pipelined instructions make a huge difference in the speed of code I've been working on.

rod1963 · 2013-10-09 09:58

The big question is whether or not GCC can handle the pipeline ordering correctly so coders don't have to delve into PASM to get full speed from the P2. Otherwise if only guru level coders can eck out max mips from the P2, it's not a good thing.

Heater. · 2013-10-09 10:16

Bill,

..the P2 manual will need a VERY good section on pipelining for newbies, plus it should recommend that people new to pasm2 initially avoid the pipelined instructions

Yep. That means the newbie instructions need to tell how to avoid getting into unexpected trouble with the pipeline. Like where to put a load of NOPS.

rod1963,

I have high hopes for the propgcc guys. Currently propgcc can compile a C version of the fft_bench that runs in COG almost as fast as the hand made assembler version that has been tweaked by a few people on the forum.

Dave Hein · 2013-10-09 11:07

Here's a link that describes how GCC handles instruction pipelines -- http://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html . If you click on the "previous" link it describes how delayed jumps are handled. So I think GCC is capable of generating efficient code for P2. It's just a matter of setting up the architecture description correctly. Hopefully we'll get some of Eric's time when it comes to optimizing PropGCC for P2.

Cluso99 · 2013-10-09 17:48

Why I actually got caught was that I was under the wrong impression that the delayed jumps and branches executed only 2 instructions, not 3 following the jump/branch.
My summarised instruction spreadsheet (posted in the documents thread) needs changing!

I had actually determined that these following instructions needed to be executed in my loop and would not cause a problem when executed when the jump failed.

I cannot use the REP instruction because a waitcnt also appears within my actual loop.

cgracey · 2013-10-09 20:30

rod1963 wrote: »

The big question is whether or not GCC can handle the pipeline ordering correctly so coders don't have to delve into PASM to get full speed from the P2. Otherwise if only guru level coders can eck out max mips from the P2, it's not a good thing.

If we had an animated visualization tool that would let you see what happens when different code sequences execute, people would learn right away. Ahle????

Yanomani · 2013-10-09 20:43

cgracey wrote: »

If we had an animated visualization tool that would let you see what happens when different code sequences execute, people would learn right away. Ahle????

Hi Chip,

Have you just developed the hability to read minds?
Is sleeplessness one of the counting factors for this achievement?

I believe that we must do a better documentation on the subject of the pipeline operation, when delayed versions of the branching instructions are used.
Despite there is no warning in the last version of Prop2_Docs.txt, the possibility of having two or more delayed branches inside the pipeline, simultaneously, can cause a lot of concerns.

Yanomani

Phil Pilgrim (PhiPi) · 2013-10-09 20:50

A source code editor that simply adds some sort of graphical annotation to show where the branch actually occurs is probably good enough.

-Phil

My bugs with the pipeline and stack...

Comments