Understanding timing analysis

Seairth · 2014-09-23 18:06

I am guessing that most of my issues understanding the TimeQuest Timing Analysis stems from lack of experience. I've attached the SDC file that I created, but I am a bit unsure whether it's correct. Feedback would be appreciated.

DE0-Nano.sdc.7z

Assuming that it is correct, I see that the Video PLLs are consistently an issue. Which leads me to the following question: for those of you who use video on the P1, how often do you need more than one Video PLL? Would it be feasible to move it out of CTR (of every cog) and have a single one that's global and configured via a hubop? This would reduce per-cog logic (252 registers, to start) and get rid of seven global clock lines. All cogs would share the single, global Video PLL.

Note, however, that I'm still not sure if it would fix any timing issues. (To whoever it was that removed the video functions altogether, how much of a difference did it make to fMax?)

2014-09-25: I have updated the attachment with a version that appears to be correct. Well, it allows the Nano build to compile without any TimeQuest errors. My guess is that the same file will also work for the DE2-115 and the BEMicro CV (just rename and copy).

2014-09-26: Added documentation links that are also posted later in the thread:

Quartus II Vol 2-6: Timing Analysis Overview
Quartus II Vol 2-7: The Quartus II TimeQuest Timing Analyzer

SDC and TimeQuest API Reference Manual

TimeQuest User Guide.pdf

Cluso99 · 2014-09-24 03:32

I don't know how to check any timings. I get about 50 warnings on my compiles so currently i am just ignoring them.

jac_goudsmit · 2014-09-24 08:52

I haven't been able to work on the code since last week (work got a little busy) but I'm putting the timing analysis and the idea of shared video on the to-do list. (Sorry for bumping the thread).

I'm happy to see you made your own fork, Seairth! Thanks for joining!

===Jac

Seairth · 2014-09-25 03:55

I came across this document yesterday:

TimeQuest User Guide.pdf

I'm still working my way through it. It's definitely worth the read if you are going to do any timing analysis at all.

ozpropdev · 2014-09-25 04:14

Thanks Seairth

A good find!
Unravelling the mysteries of Quartus is quite time consuming.
Not nearly enough spare Nanoseconds in the day!

pik33 · 2014-09-25 07:52

PDF saved. This is a thing we must learn to go further.

Maybe tomorrow I will publish a graphic VGA driver using SRAM. Today it started to work after a lot of experiments and using Polish equivalents of all English 4-letter words... and some other bad Polish words which (I think) have no one word English equivalent. The timing problems were awful. The state machine hangs, the addresses were wrong and then the barrel shifter of P1V was too slow. Trying to use optimizations made compilation several times slower (6..8 to 20..30 minutes) without any visible result. So I had to slow the Propeller to 106.25 MHz using the VGA pixel clock and saving one PLL. Then use some weird tricks to make the signal to reach its destination in time - even if this time was one cycle later, so if the state machine hanged at counter==0, it can run again if the counter==1.

So maybe there is a method to speed up what is critical to speed up using this tool. This barrel shifter starts to make random errors over 110 MHz where the rest of the Propeller works @ 140 MHz. This need to be optimized. First try failed - my version of the shifter using case statement was ever slower than Chip's (and he simply used >> operator). So maybe using time constraints can made the Quartus to compile this barrel shifter to run under these 7 nanoseconds and to put some signals together in less than 9 nanoseconds to make the state machine work with 106 MHz pixel clock.

pmrobert · 2014-09-25 13:51

Does anyone have a .sdc file for the P1V? I've tried creating one unsuccessfully and am pretty sure I'm missing some minor detail.

Seairth · 2014-09-25 17:12

pmrobert wrote: »

Does anyone have a .sdc file for the P1V? I've tried creating one unsuccessfully and am pretty sure I'm missing some minor detail.

Take a look at the first post in this thread. There is one in the attachment.

rogloh · 2014-09-25 18:50

pik33 wrote: »

PDF saved. This is a thing we must learn to go further.

Maybe tomorrow I will publish a graphic VGA driver using SRAM. Today it started to work after a lot of experiments and using Polish equivalents of all English 4-letter words... and some other bad Polish words which (I think) have no one word English equivalent. The timing problems were awful. The state machine hangs, the addresses were wrong and then the barrel shifter of P1V was too slow. Trying to use optimizations made compilation several times slower (6..8 to 20..30 minutes) without any visible result. So I had to slow the Propeller to 106.25 MHz using the VGA pixel clock and saving one PLL. Then use some weird tricks to make the signal to reach its destination in time - even if this time was one cycle later, so if the state machine hanged at counter==0, it can run again if the counter==1.

So maybe there is a method to speed up what is critical to speed up using this tool. This barrel shifter starts to make random errors over 110 MHz where the rest of the Propeller works @ 140 MHz. This need to be optimized. First try failed - my version of the shifter using case statement was ever slower than Chip's (and he simply used >> operator). So maybe using time constraints can made the Quartus to compile this barrel shifter to run under these 7 nanoseconds and to put some signals together in less than 9 nanoseconds to make the state machine work with 106 MHz pixel clock.

Hi pik33,

Can you tell me when you were getting these random barrel shifter errors, did the Quartus FMAX reports for hot/cold temp range etc mention that the maximum frequency of the design was expected to be over 110MHz? If so, I am wondering if we can have faith in these numbers without a detailed SDC constraint file setup...I was sort of hoping we wouldn't need to dive that deep into it.

I've just browsed through the TimeQuest stuff Seairth sent the link to. It looks somewhat complex but I'd be suprised if we neeed to dig down into every little nitty gritty to just get a basic first order estimate of the maximum clock rate. I would have hoped for example that the setup/hold timing within the paths used by the P1V design internally were already fixed by the Altera device characteristics and the fitment of the RTL that Quartus generates for it and so it could use that in its calculations. Unfortunately it's still early days so I still don't really know enough about it yet to make any informed decisions there. There are probably FPGA veterans saying no, it doesn't work like that at all and you need put timing constraints everywhere to get even moderately useful FMAX analysis. I was just hoping it wasn't that complicated, but often these days it really is.

Roger.

pmrobert · 2014-09-25 18:53

Seairth wrote: »

Take a look at the first post in this thread. There is one in the attachment.

Apologies! That's what i get perusing the forum on a phone...

Seairth · 2014-09-25 20:30

rogloh wrote: »

Hi pik33,

Can you tell me when you were getting these random barrel shifter errors, did the Quartus FMAX reports for hot/cold temp range etc mention that the maximum frequency of the design was expected to be over 110MHz? If so, I am wondering if we can have faith in these numbers without a detailed SDC constraint file setup...I was sort of hoping we wouldn't need to dive that deep into it.

I've just browsed through the TimeQuest stuff Seairth sent the link to. It looks somewhat complex but I'd be suprised if we neeed to dig down into every little nitty gritty to just get a basic first order estimate of the maximum clock rate. I would have hoped for example that the setup/hold timing within the paths used by the P1V design internally were already fixed by the Altera device characteristics and the fitment of the RTL that Quartus generates for it and so it could use that in its calculations. Unfortunately it's still early days so I still don't really know enough about it yet to make any informed decisions there. There are probably FPGA veterans saying no, it doesn't work like that at all and you need put timing constraints everywhere to get even moderately useful FMAX analysis. I was just hoping it wasn't that complicated, but often these days it really is.

Here's a run-down of the SDC:

create_clock is used to define the 50MHz base clock and a virtual clock to be used with the I/O constraints.
derive_pll_clocks causes TimeQuest to find the plls and call create_generated_clock for each one (just one, in this case).
create_generated_clock is used to define the cog_clk.
create_generated_clock is used to define video clock (for each tap) that can be generated by each cog.
set_input_delay and set_output_delay provide constraints for the I/O. I just left the min/max values at 0, since I have no idea what is going to be externally connected to those pins (and the LEDs are irrelevant, I think). This is also why I defined a virtual clock earlier.
set_clock_groups is used to make sure that TimeQuest doesn't assume that there are paths between each of the defined clocks. Clocks in the same group are related, while clocks in different groups are not. For instance, the video clock taps are multiplexed, so there is no relationship between each of the taps.
set_false_path is an alternative approach to set_clock_groups. TimeQuest saw that there was a relationship between the virtual clock and ctra outputs. I didn't want that relationship evaluated, so I indicated that it was a false path.

There are probably some things that could be simplified. For instance, I might have been able to use wildcards for the video clocks, which would also make the SDC more flexible when changing the number of cogs. But my TCL is weak, so I didn't get too crazy with this file.

rjo__ · 2014-09-25 20:31

Ya'all know that I can't help in this, BUT I would like to re-iterate (you know this but might have forgotten about it): On the official DE2-115 compile, there is only one clock with an FMAX less than 200MHz. For me, the problem is when I ask for recommendations, I can't find the file and I don't understand what is said in the report on the screen.

Thanks for the PDF...I'm going to look, but I'm not expecting much:)

Seairth · 2014-09-26 12:52

I also see that the Quartus II Handbook has a couple chapters on Timing Analysis in Volume 2. Below are links to download the individual chapters. I suggest reading the overview chapter before reading the TimeQuest User Guide that I linked to earlier. It certainly cleared a few things up for me.

Quartus II Vol 2-6: Timing Analysis Overview
Quartus II Vol 2-7: The Quartus II TimeQuest Timing Analyzer

Edit: And if you want to get into the Tcl weeds:

SDC and TimeQuest API Reference Manual

Willy Ekerslyke · 2014-09-26 14:32

This timing stuff is a bit beyond me but I do remember this:

http://forums.parallax.com/showthread.php/156851-Some-overclocking-)?p=1284958&viewfull=1#post1284958

I seems to me that identifying these multicycle paths to Timequest is quite important to getting accurate timing results and - if I understand Chip's comment correctly - getting the best actual FMAX.

Seairth · 2014-09-26 14:50

Willy Ekerslyke wrote: »

This timing stuff is a bit beyond me but I do remember this:

http://forums.parallax.com/showthread.php/156851-Some-overclocking-)?p=1284958&viewfull=1#post1284958

I seems to me that identifying these multicycle paths to Timequest is quite important to getting accurate timing results and - if I understand Chip's comment correctly - getting the best actual FMAX.

Thanks! I was actually looking for that post!

rogloh · 2014-09-26 17:31

Seairth wrote: »

Here's a run-down of the SDC:
create_clock is used to define the 50MHz base clock and a virtual clock to be used with the I/O constraints.

derive_pll_clocks causes TimeQuest to find the plls and call create_generated_clock for each one (just one, in this case).

create_generated_clock is used to define the cog_clk.

create_generated_clock is used to define video clock (for each tap) that can be generated by each cog.

set_input_delay and set_output_delay provide constraints for the I/O. I just left the min/max values at 0, since I have no idea what is going to be externally connected to those pins (and the LEDs are irrelevant, I think). This is also why I defined a virtual clock earlier.

set_clock_groups is used to make sure that TimeQuest doesn't assume that there are paths between each of the defined clocks. Clocks in the same group are related, while clocks in different groups are not. For instance, the video clock taps are multiplexed, so there is no relationship between each of the taps.

set_false_path is an alternative approach to set_clock_groups. TimeQuest saw that there was a relationship between the virtual clock and ctra outputs. I didn't want that relationship evaluated, so I indicated that it was a false path.

There are probably some things that could be simplified. For instance, I might have been able to use wildcards for the video clocks, which would also make the SDC more flexible when changing the number of cogs. But my TCL is weak, so I didn't get too crazy with this file.

Thanks for this summary and the other information you provided Seairth. I'll want to try to look at this again sometime when I get a chance to understand it more and I can see ultimately it will be important.

Seairth · 2014-09-27 12:07

Willy Ekerslyke wrote: »

This timing stuff is a bit beyond me but I do remember this:

http://forums.parallax.com/showthread.php/156851-Some-overclocking-)?p=1284958&viewfull=1#post1284958

I seems to me that identifying these multicycle paths to Timequest is quite important to getting accurate timing results and - if I understand Chip's comment correctly - getting the best actual FMAX.

This is what that post stated:

cgracey wrote: »

Note that the cog ALU settles over two clocks and the hub gets its ena signal every other clock. If you were to make multicycle=2 assignments for those paths, the compiler could optimize the other stuff that really needs it and you could maybe get 200MHz on the FPGA, even though the compiled Fmax might only be 160MHz.

Only, I am not seeing this for the ALU. For instance, in cog.v, I see that the "s" and "d" registers are updated on "m[2]":

always @(posedge clk_cog)
if (m[2])
    s <= sx;

always @(posedge clk_cog)
if (m[2])
    d <= ram_q;

But the results from the ALU are then written on m[3]:

wire ram_w          = m[3] && alu_wr;

cog_ram cog_ram_  ( .clk    (clk_cog),
                    .ena    (ram_ena),
                    .w      (ram_w),
                    .a      (ram_a),
                    .d      (alu_r),
                    .q      (ram_q) );

I don't see where the ALU has two clocks to settle, as m[2] transitions to m[3] on the same clock that "s" and "d" are being written. What am I missing?

rogloh · 2014-09-27 17:33

From what I understood when deciphering the P1V code it doesn't settle in two clocks. Both S and D get latched at the end of m[2] and the ALU result such as (D+S) is latched at the end of m[3], so the ALU only gets a single clock for it to generate its result for the regular 4 clock cycle instructions. Don't want to disagree with what Chip said a while back but I am guessing he probably intended to mean something else there.

pik33 · 2014-09-27 22:13

If this is true, maybe it will be useful to add fifth stage giving alu 2 clocks to settle. This will slow the prop but then a higher clock may be available.

Willy Ekerslyke · 2014-09-28 08:29

Could it be that Chip was saying the ALU has two HUB clocks to settle. Not sure if that changes anything though..

Understanding timing analysis

Comments