Fast cog-to-cog copy with non-1 strides

Wuerfel_21 · 2021-08-25 21:21

Here's a little technique I just figured out.
If you want to do a copy in cogRAM where the source/destination pointers need to move by more than one each step, you'd usually do it with a 3-instruction loop, probably involving ALTI. But by harnessing the immense powers of self-modifying code and pipeline abuse, you can do a 2-instruction loop:

              ' will copy
              ' apart+0 -> together+0, apart+8 -> together+1, apart+16 -> together+2, etc.
              sets .gathermov,#apart
              setd .gathermov,#together
              add  .gathermov,gatherDelta
              rep @.loop,#9 ' <- copy length
.gathermov    mov 0-0,0-0
              add .gathermov,gatherDelta
.loop

gatherDelta         long (1<<9)+8

evanh · 2021-08-26 00:46

That's six ticks of lag! I guess that shows how significant the pipeline is.

Simonius · 2021-08-27 08:09

How does the pipeline work?
Please explain in little more detail.

evanh · 2021-08-27 08:22

Chip did an ASCII art of it way back:

        |                   |                   |                   |                   |                   |                   |
rdRAM Ib|------+            |           rdRAM Ic|------+            |           rdRAM Id|------+            |           rdRAM Ie|
        |      |            |                   |      |            |                   |      |            |                   |
latch Da|--+   +--> rdRAM Db|---------> latch Db|--+   +--> rdRAM Dc|---------> latch Dc|--+   +--> rdRAM Dd|---------> latch Dd|
latch Sa|--+   +--> rdRAM Sb|---------> latch Sb|--+   +--> rdRAM Sc|---------> latch Sc|--+   +--> rdRAM Sd|---------> latch Sd|
latch Ia|--+   +--> latch Ib|---------> latch Ib|--+   +--> latch Ic|---------> latch Ic|--+   +--> latch Id|---------> latch Id|
        |  |                |                   |  |                |                   |  |                |                   |
        |  +---------------ALU--------> wrRAM Ra|  +---------------ALU--------> wrRAM Rb|  +---------------ALU--------> wrRAM Rc|
        |                   |                   |                   |                   |                   |                   |
        |                   |stall/done = 'gox' |                   |stall/done = 'gox' |                   |stall/done = 'gox' |
        |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |

evanh · 2021-08-27 08:25

I'll try to explain how to read that diagram now ...:

First point: "Pipelining" is a sort of paralleling of execution without really changing the sequential nature of executing an instruction.

Those sequential steps are:
1: Use the Program Counter to generate an address to fetch the instruction from memory.
2: Fetch that instruction - rdRAM Ib
3: Decode the instruction into control lines and, instruction dependant, operand fetches/immediates.
4: Fetch the operands (parameters) - rdRAM Db, rdRAM Sb
5: Execute: Control lines have prepared the ALU. Feed the operands into the ALU.
6: Write result of the execute back, instruction dependant - wrRAM Rb

Okay, those steps are a generalisation. They can be added to to make them many more steps and can even be packed into less steps. The AVR managed to pack them down to three steps I think.

evanh · 2021-08-27 09:06

Okay, pipelining: RISC introduced the pipeline. As an aside, I think what allowed RISC to shine was automated circuit layout and the use of hardware description languages (HDL). This in-turn changed the rules a little. The ALU stopped being a defined block and instead became more physically dispersed in the core.

A CPU pipeline is just like a production line in a factory. There is a series of stations (stages) where the same action at each station is performed. So the six steps above each become one of the stations. Instead of the CPU doing just one of the steps at a time, it can do up to all six at once. Just like on a production line, each of those stations is working on a difference individual product. So, each stage is now holding its specific step of adjacent instructions in sequence.

And, of course, that line of thinking was naturally extended to having multiple production lines side by side. In CPU talk, this got the name "super-scalar". This is very much a parallel execution of instructions though. Fraught with bugginess potential.

AJL · 2021-08-27 10:42

Well, the 6502 had pipelining, and I’d be surprised if it wasn’t already present in some mainframes or minicomputers too.

Wuerfel_21 · 2021-08-27 11:40

@Simonius said:
How does the pipeline work?
Please explain in little more detail.

In a very general sense, the effect of the pipeline is that the instruction being executed was actually fetched before the 2 previous instructions executed.

So for the gather copy loop as posted above, the first time the MOV is executed, it will actually have fetched it before the first ADD outside the loop, so it copies the first long. Then the ADD in the loop executes and advances the pointers to the 3rd long. Since REP looping is transparent to the pipeline, it immediately executes the MOV again, which was fetched just after the REP instruction ran, after the first ADD, so it copies the 2nd long. Then the ADD advances the pointers to the 3rd long and the MOV is executed as it was before the previous MOV (= after the previous previous ADD), so it copies the 3rd long. This then repeats until it is done.

Simonius · 2021-08-27 11:52

Thanks for the explanations.

Does in general, when not using self-modifying code, the pipeline present traps to the PASM2 programmer? Could a
mov a»b
mov b»c
in assembly have undefined behaviour because i did not account for the lagging of the instructions inside the pipeline?

evanh · 2021-08-27 12:16

Chip was careful to avoid such. Consecutive instructions can happily operate on a single register without losing "coherency". Much of the advantage comes from short turnaround from operand fetch to execute. And the final piece involves the writeback being available as a "forwarded" operand while it writes the register.

I think.

evanh · 2021-08-27 12:48

@AJL said:
Well, the 6502 had pipelining,

I'm gonna dispute that one. I just wiki'd 6502 and got meself a scanned copy of the hardware manual. See attached.

If that internal databus is a correct representation of how the ALU is fed then it can only operate one step at a time because there is no other paths for data to flow through the ALU. It has to load the operands one at a time, and also has to write results back through the same. Which precludes any use of a pipeline.

At best there could be a limited instruction prefetch using early address generation from the PC.

AJL · 2021-08-27 13:12

Yes, but instruction prefetch is still an example of pipelining.

Rayman · 2021-08-27 16:40

@Simonius said:
Does in general, when not using self-modifying code, the pipeline present traps to the PASM2 programmer? Could a
mov a»b
mov b»c
in assembly have undefined behaviour because i did not account for the lagging of the instructions inside the pipeline?

I'd say it's only self-modifying code that has this particular trap... (and then see if anyone disagrees...)

Wuerfel_21 · 2021-08-27 17:36

@Rayman said:

@Simonius said:
Does in general, when not using self-modifying code, the pipeline present traps to the PASM2 programmer? Could a
mov a»b
mov b»c
in assembly have undefined behaviour because i did not account for the lagging of the instructions inside the pipeline?

I'd say it's only self-modifying code that has this particular trap... (and then see if anyone disagrees...)

The pipeline is only apparent to the programmer in two situations: Self-modifying code and certain edge cases relating to ALTx prefix instructions (which are kind-of self-modifying code, too), notably in relation to SKIPF (skipping over more than 7 instructions in a row inserts a NOP into the pipeline, which the ALT effect is applied to instead of the intended next instruction). In all other regards, it is transparent to the programmer (though branches with delay slots, as found on many RISC architectures, would have been nice in some cases...)

Very little hardware behaviour in modern chips is actually "undefined" (in the C sense of "anything can happen"). It moreso "undocumented" or "unintended". The pipeline behavior is infact defined, documented and (somewhat) intended, which is why the code in the OP works reliably. (without interrupts, that is. Though that could be fixed by wrapping the beginning in single REP).

Simonius · 2021-08-27 20:04

I take it as critique. I did indeed say undefined but the more you understand of the P2 interna, the more it's workings are well-defined. Nevertheless, a pipeline that is glitch-free is desired and as i get it, measures have been taken to implement that. Self modifying code is out of spec, by definition ;'D

pik33 · 2021-08-27 20:25

@Simonius said:
Self modifying code is out of spec, by definition ;'D

In a Propeller? P1 was designed for self modifying code including a field (S/D/I) changing opcodes. This legacy is still present in a P2 although you have to add 2 nops instead of 1

The code from the topic needs 4 cycles per loop where I expected 6 which means I still don't know how the P2 pipeline work.

Simonius · 2021-08-27 20:29

The code from the topic needs 4 cycles per loop

Evanh said 6. Read again

there is an ADD in the pipeline before MOV even is issued.....2 ADDS are in the pipeline....the cycle takes 4 but the lag still is 6 cycles....

evanh · 2021-08-27 22:05

Two NOPs is same as 6 ticks in how I calculated since I was looking at the interval of effect rather than the added padding.

evanh · 2021-08-28 00:08

@Simonius said:
.....2 ADDS are in the pipeline....the cycle takes 4 but the lag still is 6 cycles....

Correct. Well, maybe not all in the actual pipeline but that's the simplified effect.

In Chip's timing diagram it shows instruction fetch to result writeback as 4 ticks apart. But there is no forwarding on instruction fetch, therefore a written result to the instruction address will be too late to update the fetch.

The next fetch window is another two ticks away. So becomes 6. In theory 5 ticks would suffice.

evanh · 2021-08-28 00:37

The timing diagram has an ambiguity on the address-data phase relationship. It's not clear where PC is used for fetching. and therefore lacks the same clarity throughout the diagram. I had a hard time questioning Chip on that some years back. I wasn't clear enough on what information I wanted so didn't get an answer beyond him not understanding me.

For example it may be better to rewrite my six steps from:

1: Use the Program Counter to generate an address to fetch the instruction from memory.
2: Fetch that instruction - rdRAM Ib
3: Decode the instruction into control lines and, instruction dependant, operand fetches/immediates.
4: Fetch the operands (parameters) - rdRAM Db, rdRAM Sb
5: Execute: Control lines have prepared the ALU. Feed the operands into the ALU.
6: Write result of the execute back, instruction dependant - wrRAM Rb

to
1: Use the Program Counter to generate an address to fetch the instruction from memory - rdRAM Ib
2: Fetch that instruction - latch Ib
3: Decode the instruction into control lines and, instruction dependant, operand fetches/immediates - rdRAM Db, rdRAM Sb
4: Fetch the operands (parameters) - latch Sb, latch Db
5: Execute: Control lines have prepared the ALU. Feed the operands into the ALU
6: Write result of the execute back, instruction dependant - wrRAM Rb

In particular, I've moved rdRAM Ib to step one. This changes the meaning of rdRAM Ib from being a pipeline stage, an already fetched data, into a leading address to be fetched. Which has knock-on meanings through the diagram, including making the writeback an action to be completed rather than actually written.

Fast cog-to-cog copy with non-1 strides

Comments