P2 Tricks, Traps & Differences between P1 (general discussion)
Cluso99
Posts: 18,069
We had a similar document for P1, so thought it might be about time to start a P2 version.
Lets have any discussions below, and keep the reference thread
forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only/p1?new=1
just for reference material only.
Lets have any discussions below, and keep the reference thread
forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only/p1?new=1
just for reference material only.
Comments
In P1 we did this...
In P2 we do this...
And here is the trap: Our P1 code incorrectly converted to P2 Note we need to include the addct1 instruction into the loop because waitctn no longer adds to the count
If precise period or phase aligning is needed then ADDCT1 becomes effective.
The P1 ignores lowest hub address bit(s) for non-aligned hub reads and writes.
The P2 correctly accesses non-aligned word and long reads and writes.
In the P1, there were tricks associated with the lower hub address bit(s) being ignored. This code will fail on P2.
Code and table must be in cog.
I also found a shorter way to convert it to a binary nibble. The add x,#9 converts "A..F" to "J..O" and "a..f" to "j..o". "J" is ascii $4A and "j" is ascii $6A.
Then we strip off the upper nibble from both the "0..9" and "J..O"/"j..o" leaving $00..$0F"
While it's shorter code to first check for ASCII $80 and above, rather than using the 8 long table we could use a 4 long table. This would save 2 longs.
However, Chip is looking for the fastest code.
PeterJ suggested we look at skip
Here is a solution to skipping instructions without using condition codes...
Here is the definitive example from Chip
https://forums.parallax.com/discussion/comment/1438380/#Comment_1438380
Anything finalised can be put in the "reference thread" where it can be updated by the author if required.
There are a few P1 instructions such as REV and WAITPEQ and WAITPNE that need more details.
The jumps JMPRET/CALL/JMP/RET and DJNZ, TJZ, TJNZ need discussion here on the best ways to replace them before they get added to the reference thread. Any good ideas???
If you miss your command slot then the instruction gets blocked until the subsequent slot arrives. If you're not collecting the results when they're ready then you're out of luck. These two interactions, when the timing is changed, have varying impacts on what else can be done between issuing and fetching. Not the least of which is managing the number of parallel cordic commands in flux for a given cog.
It's tricky to be confident of the long term usability of the pipelining.
EDIT: I suppose the way to do it is always space the cordic commands at 16 clock intervals, skipping possible other slots. EDIT1.2: Another reason to follow this practise is because, in the 8-cog implementation of the Prop2, the clock-56 slot has a trickiness to it as well. If waiting for a result, the first GETQx returns at clock-55 so that an immediate attempt to issue a new command will be stalled until the clock-64 slot. By which time the remainder result will have been overwritten. I struck this when running a recursive function than feeds the result straight back into the cordic. EDIT1.3: The recursion problem was my fault, so I'm not sure about this point now.
EDIT2: As soon as there is two parallel commands in the pipe then interrupts are also not an option either. There is only room for one result at the end of the pipeline so timing is critical with more than one on the go.
That should actually reduce hardware resources and will eliminate the initial slot wait time.
Second, the CORDIC ended up in the HUB, because the COGS got big enough to cost us RAM.
(Lots of RAM)
I'd be surprised if, particularly for eight cogs, silicon used didn't reduce substantially though. The cordic, as it stands, is built for 54 fully pipelined stages (which means 54 parallel operations), so doesn't scale down with reduced cog count.
Yeah, you might be right today.
And, I haven't experimented, but it might be possible to continuously overlap operations in 8-clock windows.
Each iteration requires two barrel shifters: two for cross-adding and, in 16 special iterations, those same two for scale compensation. So, two, not four shifters would be needed.
Whether every 8th, 4th, 2nd, or single clock, you could write your code to take 8 clocks, exactly, and it would always run on smaller configurations.
But most important will be compatibility with a 1-cog Prop2 model.
The problem is different to hubram rotations. It's not just a bother for I/O timings. It affects ordinary inline maths. Although, it's not actually inline when using the cordic.
As it stands, either you craft your program special to fit the particular Prop2 model or you're not making full use of the cordic.
Unless, of course, those in-the-future-models take steps to ensure that actually works
We can expect future parts will try to be backward code compatible with existing parts.
eg They may include an option to match 8 COG timing, by ghost cog slots, to give exact timing match.
But a 4 cog with less I/O and less Hub ram might be a starter, as might be a variant with some cores having reduced instructions to reduce silicon. Perhaps a 2 core P2 plus a mix of reduced Instruction P2 cores and even p2hot style (1 clock/instruction) 4 port cog ram cores.
Nice to muse over the possibilities, but let's not worry about those and make the P2 successful so we can get to discuss other variants