P2 Tricks, Traps & Differences between P1 (general discussion)

Cluso99 · 2017-12-14 00:24

We had a similar document for P1, so thought it might be about time to start a P2 version.

Lets have any discussions below, and keep the reference thread
forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only/p1?new=1
just for reference material only.

Cluso99 · 2017-12-14 00:38

WAITCTn

In P1 we did this...

                mov     count,            #(96*2)
                mov     delay,            cnt           
                add     delay,            delay5us      
:loop           waitcnt delay,            delay5us      
                'do something
                djnz    count,            #:loop

In P2 we do this...

                mov     count,            #(96*2)
                getct   delay                             
.loop           addct1  delay,            delay5us                
                waitct1                                          
                'do something
                djnz    count,            #.loop

And here is the trap: Our P1 code incorrectly converted to P2

                mov     count,            #(96*2)
                getct   delay                             
                addct1  delay,            delay5us                
.loop           waitct1                                          
                'do something
                djnz    count,            #.loop

Note we need to include the addct1 instruction into the loop because waitctn no longer adds to the count

evanh · 2017-12-14 01:09

Prop2 has WAITX instruction that is easier to use and does what's needed for the common use of just inserting a delay.

If precise period or phase aligning is needed then ADDCT1 becomes effective.

ozpropdev · 2017-12-14 02:09

Cluso99 wrote: »

Note we need to include the addct1 instruction into the loop because waitctn no longer adds to the count

I think the clue there for P1 users would be that WAITCTx is standalone and doesn't support D and S fields.

Cluso99 · 2018-01-09 02:34

RDWORD, RDLONG, WRWORD, WRLONG Hub non-aligned

The P1 ignores lowest hub address bit(s) for non-aligned hub reads and writes.
The P2 correctly accesses non-aligned word and long reads and writes.

In the P1, there were tricks associated with the lower hub address bit(s) being ignored. This code will fail on P2.

Cluso99 · 2018-05-10 05:08

I noticed this very smart coding trick Chip used in checking for a hex character (0..9, A..F, a..f).
Code and table must be in cog.
I also found a shorter way to convert it to a binary nibble.

hexchrs		long	%00000000_00000000_00000000_00000000
		long	%00000011_11111111_00000000_00000000		'"0".."9"
		long	%00000000_00000000_00000000_01111110		'"A".."F"
		long	%00000000_00000000_00000000_01111110		'"a".."f"
		long	%00000000_00000000_00000000_00000000
		long	%00000000_00000000_00000000_00000000
		long	%00000000_00000000_00000000_00000000
		long	%00000000_00000000_00000000_00000000

.check		altb	x,#hexchrs		'check for hex
		testb	0,x		wc
	if_nc	ret				'if not hex, c=0
                testbn  x,#6            wz      'hex, "0".."9"?
        if_nz   add     x,#9                    '..make low nibble $A..$F
        _ret_   and     x,#$F                   'extract low nibble, return c=1

The add x,#9 converts "A..F" to "J..O" and "a..f" to "j..o". "J" is ascii $4A and "j" is ascii $6A.
Then we strip off the upper nibble from both the "0..9" and "J..O"/"j..o" leaving $00..$0F"

While it's shorter code to first check for ASCII $80 and above, rather than using the 8 long table we could use a 4 long table. This would save 2 longs.
However, Chip is looking for the fastest code.

Cluso99 · 2018-05-10 05:16

A couple of days ago I was looking to replace some code but it required a condition code, and I had both C & Z in use.
PeterJ suggested we look at skip

Here is a solution to skipping instructions without using condition codes...

case1           mov     cmdtype,          #1            ' returns w /CS=0(enabled)
                jmp     #cmdxx
case2           mov 	cmdtype,          #0            ' returns w /CS=1(disabled)
cmdxx           outl    #sd_cs                          '/ /CS=0(enabled) 
.....
                skip    cmdtype                         '| skips next instr if #1
                outh    #sd_cs                          '/ /CS=1(disable) if reqd

Cluso99 · 2018-05-26 07:31

JMP & CALL Relative & Direct COG/LUT/HUB Addressing

Here is the definitive example from Chip

https://forums.parallax.com/discussion/comment/1438380/#Comment_1438380

Dave Hein · 2018-05-26 11:52

PNut allows for forward references of CON symbols, but only one level of forward reference is allowed. I'm not sure if this is considered a trap, but I discovered this recently when working on a forward reference issue in p2asm. The following code fails with an "undefined symbol".

con
  symbol1 = symbol2
  symbol2 = symbol3
  symbol3 = 1

I verified that PropTool has the same limitation.

ersmith · 2018-05-27 12:41

A potential trap I found is that P2 "locktry" is very similar to P1 "lockset", but sets the carry flag the opposite way. There are a few other instructions as well where the P2 flag setting is different from P1; "neg" for example sets C to the MSB of the source in P1, but to the MSB of the result in P2.

Cluso99 · 2018-10-04 02:58

Thought I would bump this thread now we are testing real P2 silicon.

Cluso99 · 2018-10-06 13:17

Please use this thread for any discussions about tricks, traps, etc.
Anything finalised can be put in the "reference thread" where it can be updated by the author if required.

There are a few P1 instructions such as REV and WAITPEQ and WAITPNE that need more details.
The jumps JMPRET/CALL/JMP/RET and DJNZ, TJZ, TJNZ need discussion here on the best ways to replace them before they get added to the reference thread. Any good ideas???

evanh · 2018-10-07 02:39

I've been trying to make use of the cordic pipeline feature but it quickly became apparent that any attempt to encode the hub rotation timings into the program will have compatibility issues later on when there is different numbers of cogs in future editions of the Prop2.

If you miss your command slot then the instruction gets blocked until the subsequent slot arrives. If you're not collecting the results when they're ready then you're out of luck. These two interactions, when the timing is changed, have varying impacts on what else can be done between issuing and fetching. Not the least of which is managing the number of parallel cordic commands in flux for a given cog.

It's tricky to be confident of the long term usability of the pipelining.

EDIT: I suppose the way to do it is always space the cordic commands at 16 clock intervals, skipping possible other slots. EDIT1.2: Another reason to follow this practise is because, in the 8-cog implementation of the Prop2, the clock-56 slot has a trickiness to it as well. If waiting for a result, the first GETQx returns at clock-55 so that an immediate attempt to issue a new command will be stalled until the clock-64 slot. By which time the remainder result will have been overwritten. I struck this when running a recursive function than feeds the result straight back into the cordic. EDIT1.3: The recursion problem was my fault, so I'm not sure about this point now.

EDIT2: As soon as there is two parallel commands in the pipe then interrupts are also not an option either. There is only room for one result at the end of the pipeline so timing is critical with more than one on the go.

evanh · 2018-10-07 06:33

I'm thinking it would be more effective to move the cordic inside the cogs, drop the pipelining and just have an iterative unit per cog. Taking the full 54 clocks for a result - Time that can be reliably used for something else.

That should actually reduce hardware resources and will eliminate the initial slot wait time.

Cluso99 · 2018-10-07 07:27

IIRC the Cordic (and multiplier, SQRT, etc) are in hub are that they use the one piece of silicon which is effectively pipelined, such that a new calculation can begin on each clock (ie hub) cycle. Think a single ALU pipelined and shared with a lot of uses on a timeslice basis.

potatohead · 2018-10-07 07:41

First, that's a real change. Bad juju.

Second, the CORDIC ended up in the HUB, because the COGS got big enough to cost us RAM.

(Lots of RAM)

evanh · 2018-10-07 08:12

Yeah, the amount of changes needed makes it a no-go idea.

I'd be surprised if, particularly for eight cogs, silicon used didn't reduce substantially though. The cordic, as it stands, is built for 54 fully pipelined stages (which means 54 parallel operations), so doesn't scale down with reduced cog count.

potatohead · 2018-10-07 10:22

I'm pretty sure it did. That discussion is hard to find. However, now that I think about it, we were at 16 COGS.

Yeah, you might be right today.

cgracey · 2018-10-07 16:05

A per-cog implementation would require four 40-bit barrel shifters. It would certainly take more logic. What we have now is super fast in terms of feed rate. A cog can feed in 7 computations, 8 clocks apart, and then get 7 results back, 8 clocks apart. So, in about 112 clocks, you can do 7 operations. A per-cog CORDIC could only get two computations done in that time.

And, I haven't experimented, but it might be possible to continuously overlap operations in 8-clock windows.

evanh · 2018-10-07 16:13

cgracey wrote: »

A per-cog implementation would require four 40-bit barrel shifters.

Without any pipelining though, wouldn't that reduce down to one barrel shifter because it becomes a dedicated iterative calculation.

evanh · 2018-10-07 16:21

cgracey wrote: »

What we have now is super fast in terms of feed rate. A cog can feed in 7 computations, 8 clocks apart, and then get 7 results back, 8 clocks apart. So, in about 112 clocks, you can do 7 operations.

What's the plan for 4-cog Prop2's and lower? I've been pondering how to make code compatible. Is it 4-clocks per command? 2-clocks for dual cog? Impossible to fetch results quick enough then. Is it no cordic at all? No cordic is harsh.

cgracey · 2018-10-07 16:49

evanh wrote: »

cgracey wrote: »

A per-cog implementation would require four 40-bit barrel shifters.

Without any pipelining though, wouldn't that reduce down to one barrel shifter because it becomes a dedicated iterative calculation.

Each iteration requires two barrel shifters: two for cross-adding and, in 16 special iterations, those same two for scale compensation. So, two, not four shifters would be needed.

cgracey · 2018-10-07 16:53

evanh wrote: »

cgracey wrote: »

What we have now is super fast in terms of feed rate. A cog can feed in 7 computations, 8 clocks apart, and then get 7 results back, 8 clocks apart. So, in about 112 clocks, you can do 7 operations.

What's the plan for 4-cog Prop2's and lower? I've been pondering how to make code compatible. Is it 4-clocks per command? 2-clocks for dual cog? Impossible to fetch results quick enough then. Is it no cordic at all? No cordic is harsh.

Whether every 8th, 4th, 2nd, or single clock, you could write your code to take 8 clocks, exactly, and it would always run on smaller configurations.

evanh · 2018-10-07 17:14

Yeah, I'd decided it was safest to pace at 16-clock intervals - in case there is ever a 16-cog Prop2. On that note, if 16-clock intervals was fixed then I suspect there could be some kind of reuse of stages when there is less than 16 cogs. And therefore must optimise to smaller footprint and lower power for the smaller models.

But most important will be compatibility with a 1-cog Prop2 model.

evanh · 2018-10-07 17:24

Chip,
The problem is different to hubram rotations. It's not just a bother for I/O timings. It affects ordinary inline maths. Although, it's not actually inline when using the cordic.

As it stands, either you craft your program special to fit the particular Prop2 model or you're not making full use of the cordic.

evanh · 2018-10-07 17:41

This will fail on the 1-cog and 2-cog models:

		qdiv    length, #10
		qdiv    width, #10
		getqx   lenshort
		getqy   lenfrac
		getqx   widshort
		getqy   widfrac

potatohead · 2018-10-07 19:07

How often is full use needed? Secondly, how often will actually running on a future model yet to be made, is needed?

jmg · 2018-10-07 19:36

evanh wrote: »

This will fail on the 1-cog and 2-cog models:

Unless, of course, those in-the-future-models take steps to ensure that actually works

We can expect future parts will try to be backward code compatible with existing parts.
eg They may include an option to match 8 COG timing, by ghost cog slots, to give exact timing match.

potatohead · 2018-10-07 20:23

When we get there, those problems are all very nice problems to have.

Cluso99 · 2018-10-07 22:26

Personally, I don't ever see a single cog P2 being made. It will never compete with other micros that way. Similarly, I don't foresee the need for a 2 cog P2.
But a 4 cog with less I/O and less Hub ram might be a starter, as might be a variant with some cores having reduced instructions to reduce silicon. Perhaps a 2 core P2 plus a mix of reduced Instruction P2 cores and even p2hot style (1 clock/instruction) 4 port cog ram cores.
Nice to muse over the possibilities, but let's not worry about those and make the P2 successful so we can get to discuss other variants

evanh · 2018-10-08 01:22

And a less bulky cordic.

P2 Tricks, Traps &amp; Differences between P1 (general discussion)

Comments

P2 Tricks, Traps & Differences between P1 (general discussion)