Switch to synchronous enable?
Seairth
Posts: 2,474
I notice that the cogs use an asynchronous enable. If these signals were switch to synchronous enables (remove all the "negedge ena"), what would be the repercussion, other than the cog enable/disable being delayed by one clock cycle. From an FPGA perspective, this change has little impact on Cyclone IV synthesis, but has a significant positive impact on Cyclone V synthesis. For the BEMicro CV, usage goes from 8522 ALMs (using async enable) to 8118 ALMs (using sync enable). Further, optimizing for size, you can get this down to 7097 ALMs!
Any thoughts?
Any thoughts?
Comments
I'm not quite following - "negedge ena" still sounds synchronous to me, just on the other edge ?
Sometimes that is done as a means of pipelining, without a full clock period cost.
That said, it is rare to see both edges used in modern FPGA synth, where routing tends to dominate over register delays.
If the tools seem to have a preference to one construct, then certainly explore that construct.
In the following Verilog, ena is an "asynchronous" signal, as it causes the always block to execute as soon as ena goes low, regardless of the state of clk_cog.
Now, compare that to the following code. In this case, ena is "synchronous" because the combinational code doesn't change the state of run (based on the value of ena) until the rising edge of clk_cog.
The register that drives ena (hub:cog_ena) is updated by either clk_cog or the asynchronous chip reset (nres). By having the above block(s) use an asynchronous enable, cogs are enabled/disabled as soon as hub:cog_ena latches a new value (which can either be on a clock edge or the global chip reset. By having the above block(s) use a synchronous enable, cogs are enabled/disabled on the next clock after hub:cog_ena latches a new value.
When targeting Cyclone IV, there is very little difference the the LE count. When targeting Cyclone V, the ALM savings is significant. I have not checked to see how this affects fMax. I also do not know how this affects ASIC design (in the off chance Parallax invests in a P1V variant, this is obviously an important concern).
In the end, it might not be worth making the change just for improved Cyclone V utilization. Just thinking out loud...
Looking that the netlist results, I notice that the above code inserts an additional mux to handle the synchronous enable. I don't know whether this is slower than the asynchronous enable driving the /clr pin on a flip-flop, though. I'm guessing it is. But I still don't know if it makes any difference overall.
Might it be better to have a separate always to initialise the values like this (using cog.v as an example)... This way, these statements.. can all be simplified to..
I can't see how adding an additional layer of gating logic can make the CV compile a design with fewer gates. As I understand it, synchronous gating is .AND.ing the clock with the reset signal and storing the result in a flop for the next posedge. How can this make for less circuitry? Not saying you're wrong but it doesn't make much sense to me.
I think that is a resource-mapping artifact, it is not actually compiling to gates.
The Altera DOCs show a number of differences between Cyclone IV and Cyclone V granular building blocks.
The IV does have a Clock Enable, but that is LAB (16 x LE) wide, so that could force more usage.
The V does not show an explicit enable, but does have routing to create one, so maybe a MUX is closer to what the fabric actually implements, so it reports like that.
It is still buried within the ALM, so it is not counted as another item consumed.
Cyclone V : Adaptive Logic Modules Resources
One ALM contains four programmable registers. Each register has the following ports:
• Data
• Clock
• Synchronous and asynchronous clear
• Synchronous load
Cyclone IV Logic Elements Each LE has the following features:
■ A four-input look-up table (LUT), which can implement any function of four variables
■ A programmable register
■ A carry chain connection
■ A register chain connection
Does that mean that the gate structures I see in the RTL view are not actually what's being done in the logic of the chip? Does the technology map view do a better job?