Is anyone using the JP/JNP instructions?

cgracey · 2017-05-09 08:52

I've been spending time speeding up critical paths. One thing that has been sticking out like a sore thumb for a long time is the JP/JNP instruction combo. It takes a long time to mux one of 64 INA/INB pins and then propagate it through the branch logic, in order to gate several late circuits. It's like a tent pole lifting up the amount of time required for the cog to cycle.

I looked all through my code base and I've never even used JP/JNP. Not sure why, but I think I tend to code not so directly as to branch on a pin state.

If I get rid of JP/JNP, things speed up quite well and a lot of resultant paths become very tame. In fact, in asking Quartus to identify bottlenecks, which it rates with priority numbers, things really flatten out when the JP/JNP combo goes away. I mean, the bottlenecks all take on about the same value, which means there was one tent pole tending to hold up everything - JP/JNP.

So, would anyone be upset if I got rid of JP/JNP?

By the way, the next release is looking like no problem for 100MHz.

cgracey · 2017-05-09 08:59

P.S. There is one alternative... I could make it a 3-clock instruction, which is odd. I've actually got this working now, but it involves tweaking the state machine to inject a dummy clock cycle before 'get', so that it can register the pin state, eliminating the big delay in the branch logic. I'll need to review the state machine logic before I'll feel very comfortable with it. It's working, but I'm not as confident as I'd like to be, yet.

evanh · 2017-05-09 09:19

How about not mux'ing but mask and use Z flag result instead. EDIT: Hmm, to manage 64 bits of mask, that might be an extra clock anyway.

jmg · 2017-05-09 09:27

cgracey wrote: »

I looked all through my code base and I've never even used JP/JNP. Not sure why, but I think I tend to code not so directly as to branch on a pin state.

Once you get more used to boolean opcodes, you'll do more of that

cgracey wrote: »

P.S. There is one alternative... I could make it a 3-clock instruction, which is odd. I've actually got this working now, but it involves tweaking the state machine to inject a dummy clock cycle before 'get', so that it can register the pin state, eliminating the big delay in the branch logic. I'll need to review the state machine logic before I'll feel very comfortable with it. It's working, but I'm not as confident as I'd like to be, yet.

The alternative to this would be what ?
2 opcodes, of Pin -> C and then IF_C JMP ?
That's 2 words, instead of 1, and 4 cycles instead of now 3 ?

JP/JNP is preferable, but there is a backup plan here, so this is less of a 'drop dead' problem.
Varying SysCLK counts is not uncommon in MCUs, to match speeds with FLASH, wait states are often added.

cgracey wrote: »

By the way, the next release is looking like no problem for 100MHz.

Which keeps the humans happy

Do you release it at 100MHz, or 96MHz, to match USB speeds better ?

100MHz is probably a little tougher than 96MHz, so you could try 100MHz and confirm USB, then do a 96MHz build later to see if 96MHz hits some sweet spot

80/12 = 6.666666667 I think this works now ok ?
100/12 = 8.33333 should also be ok ? (more sysclks to finish, and less % jitter)

Q: Waiting on a Pin Level/Edge is still 1-SysClk granular, right ?

ozpropdev · 2017-05-09 09:49

I use it only because it's there.
A TESTB and a JMP is fine for a substitute.
For the sake of slick timing I'd get rid of it.

Cluso99 · 2017-05-09 10:16

There are alternatives as ozprop suggested if you need the space.
3 clocks would also be ok.

We are more likely to wait for pin hi/low, although we will find uses for JP/JNP if it's there.

cgracey · 2017-05-09 10:28

I got rid of JP/JNP. Feels much better now. Only 125 paths were below 100MHz and they were only off by 400ps, or less. Once I compile all 16 cogs, it's going to slow down, but it should come in at over 80MHz, which is fine for us running at 100MHz on our desks.

dMajo · 2017-05-09 12:29

I think this was important before the interrupts was made. Now I see no reason to maintain 2 clocks jump on pin status. Test&jump is OK for me.

Seairth · 2017-05-09 15:21

(deleted)

garryj · 2017-05-09 17:53

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

cgracey · 2017-05-09 19:48

garryj wrote: »

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

Ok. Thanks, Everyone.

jmg · 2017-05-09 20:00

garryj wrote: »

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

Which build do you need to confirm this is ok ?

Do you have code snippet on the USB transmit waits ?

Can a wait-on-pin opcode not be used instead ? What are the leading-edge delays on that ?

Here are the granularities in SysCLKs at Cycle counts, and MHz

USB/MHZ   1      2       3      4  ( x 100 for % at 12MHz)
12/80     0.15   0.3     0.45   0.6
12/96     0.125  0.25    0.375  0.5
12/100    0.12   0.24    0.36   0.48

Note going from 2 SysCLK to 3 SysCLK is almost compensated for by 80 -> 100MHz, but not quite.
Going from 2 SysCLKs to 4 SysCLKs at 80MHz is quite a degrade in granularity.

garryj · 2017-05-09 21:37

jmg wrote: »

garryj wrote: »

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

Which build do you need to confirm this is ok ?

I changed from JNP to TESTIN+JMP on v18 and all is good :-D

cgracey · 2017-05-09 22:06

garryj wrote: »

jmg wrote: »

garryj wrote: »

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

Which build do you need to confirm this is ok ?

I changed from JNP to TESTIN+JMP on v18 and all is good :-D

Super!

I just did a big compile and what was holding up the timing (with all 16 cogs instantiated) was the CORDIC output to the cogs. Those signals had fan-outs of 16 each and had to go all over the FPGA. So, I made 4 sets of intermediary registers to give each set of 4 cogs their own copies. This will allow the interconnect delays to be really reduced, as they'll now get split between the CORDIC system and each set of 4 cogs, but will cost one clock cycle when GETQX/GETQY must wait.

Cluso99 · 2017-05-10 01:35

cgracey wrote: »

garryj wrote: »

jmg wrote: »

garryj wrote: »

I use JNP on USB transmit waits, but I don't think losing it would be a disaster.

Which build do you need to confirm this is ok ?

I changed from JNP to TESTIN+JMP on v18 and all is good :-D

Super!

I just did a big compile and what was holding up the timing (with all 16 cogs instantiated) was the CORDIC output to the cogs. Those signals had fan-outs of 16 each and had to go all over the FPGA. So, I made 4 sets of intermediary registers to give each set of 4 cogs their own copies. This will allow the interconnect delays to be really reduced, as they'll now get split between the CORDIC system and each set of 4 cogs, but will cost one clock cycle when GETQX/GETQY must wait.

Will the final silicon require these registers and additional clock too, or is it a function of the fpga?

evanh · 2017-05-10 02:03

Cluso99 wrote: »

Will the final silicon require these registers and additional clock too...?

That's an easy yes.

evanh · 2017-05-10 02:06

There is many places where Chip has made such compromises for both higher clock rate and lower power consumption. This is why the P2-Hot was dumped.

cgracey · 2017-05-10 02:07

I think FPGA interconnects are quite a bit slower than wires on a chip, but it won't hurt to have these registers. Getting the timing optimized, even if it is just interconnects, helps expose the logic delays.

Is anyone using the JP/JNP instructions?

Comments