Is anyone using the JP/JNP instructions?
cgracey
Posts: 14,152
I've been spending time speeding up critical paths. One thing that has been sticking out like a sore thumb for a long time is the JP/JNP instruction combo. It takes a long time to mux one of 64 INA/INB pins and then propagate it through the branch logic, in order to gate several late circuits. It's like a tent pole lifting up the amount of time required for the cog to cycle.
I looked all through my code base and I've never even used JP/JNP. Not sure why, but I think I tend to code not so directly as to branch on a pin state.
If I get rid of JP/JNP, things speed up quite well and a lot of resultant paths become very tame. In fact, in asking Quartus to identify bottlenecks, which it rates with priority numbers, things really flatten out when the JP/JNP combo goes away. I mean, the bottlenecks all take on about the same value, which means there was one tent pole tending to hold up everything - JP/JNP.
So, would anyone be upset if I got rid of JP/JNP?
By the way, the next release is looking like no problem for 100MHz.
I looked all through my code base and I've never even used JP/JNP. Not sure why, but I think I tend to code not so directly as to branch on a pin state.
If I get rid of JP/JNP, things speed up quite well and a lot of resultant paths become very tame. In fact, in asking Quartus to identify bottlenecks, which it rates with priority numbers, things really flatten out when the JP/JNP combo goes away. I mean, the bottlenecks all take on about the same value, which means there was one tent pole tending to hold up everything - JP/JNP.
So, would anyone be upset if I got rid of JP/JNP?
By the way, the next release is looking like no problem for 100MHz.
Comments
The alternative to this would be what ?
2 opcodes, of Pin -> C and then IF_C JMP ?
That's 2 words, instead of 1, and 4 cycles instead of now 3 ?
JP/JNP is preferable, but there is a backup plan here, so this is less of a 'drop dead' problem.
Varying SysCLK counts is not uncommon in MCUs, to match speeds with FLASH, wait states are often added.
Which keeps the humans happy
Do you release it at 100MHz, or 96MHz, to match USB speeds better ?
100MHz is probably a little tougher than 96MHz, so you could try 100MHz and confirm USB, then do a 96MHz build later to see if 96MHz hits some sweet spot
80/12 = 6.666666667 I think this works now ok ?
100/12 = 8.33333 should also be ok ? (more sysclks to finish, and less % jitter)
Q: Waiting on a Pin Level/Edge is still 1-SysClk granular, right ?
A TESTB and a JMP is fine for a substitute.
For the sake of slick timing I'd get rid of it.
3 clocks would also be ok.
We are more likely to wait for pin hi/low, although we will find uses for JP/JNP if it's there.
Ok. Thanks, Everyone.
Do you have code snippet on the USB transmit waits ?
Can a wait-on-pin opcode not be used instead ? What are the leading-edge delays on that ?
Here are the granularities in SysCLKs at Cycle counts, and MHz Note going from 2 SysCLK to 3 SysCLK is almost compensated for by 80 -> 100MHz, but not quite.
Going from 2 SysCLKs to 4 SysCLKs at 80MHz is quite a degrade in granularity.
Super!
I just did a big compile and what was holding up the timing (with all 16 cogs instantiated) was the CORDIC output to the cogs. Those signals had fan-outs of 16 each and had to go all over the FPGA. So, I made 4 sets of intermediary registers to give each set of 4 cogs their own copies. This will allow the interconnect delays to be really reduced, as they'll now get split between the CORDIC system and each set of 4 cogs, but will cost one clock cycle when GETQX/GETQY must wait.