How to Achieve Optimal PPA and Up to 10X TAT Gain in Your Next Digital Design Implementation

For complex, advanced-node designs, there’s a tug-of-war brewing between oft-conflicting goals around performance, power, and area (PPA) and turnaround time (TAT). Both are important for design success, yet it can be difficult to achieve optimal PPA with the highest productivity—without making any tradeoffs. At the root of this problem is that with traditional place-and-route tools, designers need to break the systems on chip (SoCs) into many small blocks. Breaking the SoCs into small blocks, in turn, makes it challenging to achieve optimal PPA and TAT. This paper discusses new digital implementation technology that equips you to handle larger size blocks and meet stringent PPA and TAT goals for SoCs at advanced 16/14/10nm FinFET as well as established process nodes.

If you’re designing SoCs for high-end applications—think cloud computing systems, mobility, networking, and the like—you can’t afford to sacrifice PPA or TAT. Your success depends on getting the best performing, lowest power chip to market before your competitors. Yet, the path toward this nirvana is bumpy at best, considering the new design challenges that emerge as process nodes shrink.

There are three key areas of challenge. The first is that design sizes are getting huge, as shown in Figure 1. Not long ago, “large” designs were in the 20M-to 30M-instance range, but today’s advanced-node designs can be over 100M instances.

Figure 1: Designs are continuing to grow larger

With traditional place-and-route tools, you will have too many small blocks—perhaps 1M to 2M instances each— to deal with. Designing with so many blocks takes too long and it impacts TAT and overall time to market, as indicated in Figure 2.

Figure 2: Increasing number of blocks impacts TAT

A second challenge is that the blocks have tight PPA requirements. Demands for PPA are getting more intense. Instead of designing at 500MHz, many engineers today are designing at 3GHz or 4GHz. Also, power envelopes are getting smaller and power is getting more difficult to control. See Figure 3 for an illustration of increasing power demands.

Figure 3: Power demands are growing with more complex designs

A third challenge is that a lot of tool flows are fragmented between place and route and signoff, as shown in Figure 4. Different engines are used to handle IC layout and timing analysis, resulting in a lack of consistency and a lot of iterations. We need much more integrated flows for place and route, timing and power analysis, and RC extraction.

Figure 4: Many implementation tool flows are fragmented between place and route and signoff

In this paper, we’ll dive deeper into new digital design capabilities—including innovations in placement, optimi-zation, routing, and clocking—that are demonstrating typically 10% to 20% production-proven PPA advantages along with an up to 10X TAT gain in advanced 16/14/10nm FinFET designs as well as in established process nodes.

New Design Challenges, Greater Complexity Require New Technology

At 20nm and below, wire dimensions and lithography start to reach their limits, and digital designers must employ double patterning or even triple patterning on the interconnect between transistor layers. In addition to consuming more mask layers, double patterning results in additional design rules and complicates layout verification. Layout designers use colors to determine which features go on which masks, and any IC implementation tool used at 20nm or below must be “color aware.”

At 28nm and 20nm, wire delays dominate timing over gate delays, because wires haven’t scaled with the transistors. Designers are using local interconnects, but they require new layers, rules, and connectivity models to manage. Also, at smaller nodes, devices generally have higher leakage current even in the off state, so the total power dissipated can be much higher than expected.

There are other concerns. Smaller nodes have about 1,000 new design rules to address, and more than 400 new advanced layout rules for the 1X layers. There are hundreds of multi-mode/multi-corner (MMMC) views on which to close timing. You also must account for variable thicknesses in the metal stacks and increasing wire resistances, which emerge at higher level metal layers.

Along with the physical challenges, there are plenty of electrical challenges such as increased device parasitics and the complexity of FinFET transistors. In FinFET designs, for example, there are many more resistances than at 28nm, and the growth in parasitics is resulting in bigger netlists, which impacts performance of physical implementation tools.

A long-time, persistent design challenge, power reduction has become even more critical for ICs that go in mobile and wearable devices. At every process node, methods for reducing dynamic power and leakage power must

be re-examined. What works for one node effectively might not be as effective for another due to the changing ratio of dynamic and leakage power components. The introduction of FinFETs at 20nm and below, for example, is resulting in improvements in performance and area along with inherently good leakage characteristics. However, as performance in FinFET devices gets pushed to the limit, you’ll need to carefully manage leakage power to avoid pushing up power dissipation.

Individual components of the design implementation system must comprehend the complexities of the downstream flows that follow. For example, placement engines, besides being aware of pin access and patterning, need to be aware of timing and routing impact as well. During clock synthesis, traditional clock mesh and tree structures have limitations for closing the critical timing paths at the desired frequency since they are optimized for lower frequencies to balance out for zero or fixed skew, while the standard datapath optimization techniques are run independently.

In summary, digital placement, clocking, routing, and optimization tools must, somehow, address each of these challenges to help digital designers reach their optimal PPA and TAT/capacity targets. Technologies currently on the market aren’t suited to meet these objectives.

Introducing Next-Generation Digital Design Implementation Solution

The new Cadence^® Innovus^™ Implementation System meets designers’ needs by delivering a typical 10% to 20% PPA advantage along with an up to 10X TAT and capacity gain. Providing the industry’s first massively parallel solution, the system can effectively handle blocks as large as 10 million instances or more.

Figure 5: Innovus Implementation System provides the industry’s first massively parallel solution

The Innovus Implementation System delivers these results through several key capabilities:

Massively parallel architectures that can handle huge designs and take advantage of multi-threading on multi-core workstations, as well as distributed processing over networks of computers

Its new GigaPlace solver-based placement technology, which is slack driven and topology-, pin access-, and color-aware to provide optimal pipeline placement, wirelength, utilization, and PPA

An advanced, multi-threaded, layer-aware timing- and power-driven optimization engine, which reduces dynamic and leakage power

A unique concurrent clock and datapath optimization engine, which enhances cross-corner variability and boosts performance with reduced power

Next-generation slack-driven routing with track-aware timing optimization, which addresses signal integrity early on and improves post-route correlation

Full-flow multi-objective technology, which makes concurrent electrical and physical optimization possible

Figure 6. High-performance design benchmarks on embedded processor

In the following sections, we’ll go into more detail about how these key capabilities result in PPA and TAT advantages.

A New Slack-Driven Technique for Placement

The new GigaPlace engine changes the way placement is performed and enhances PPA. Traditionally, placement has been “timing aware” and “lightly” integrated with other engines in the implementation system, such as timing analysis and optimization. With the GigaPlace engine, placement is slack driven and tightly integrated; in other words, the engine helps place the cells in a timing-driven mode by building up the slack profile of the paths and performing the placement adjustments based on these timing slacks. (For more information, read this white paper that discusses how the GigaPlace engine reduced wirelength in an ARM^® Cortex^®-M7 design.)

The GigaPlace engine models accurate electrical constraints and physical constraints (floorplan, route topology-based wire length, congestion). It also integrates the mathematical model of Cadence’s timing- and power-driven optimization engine, which is also embedded in the Innovus Implementation System. The engine enables concurrent, convergent optimization of electrical and physical metrics. More importantly, the designer’s intent can be extracted automatically from the electrical constraints, which in turn helps to achieve better optimization for physical metrics. A global optimization strategy and a novel numerical solver are employed to avoid the trap of local minima, resulting in the globally optimal PPA. This strategy avoids costly design iterations between different steps of the flow and results in a faster design closure with the best PPA.

The GigaPlace engine solves for overlap and wire length, as well as slack that is driven by gate delay, false/multi-cycle paths, layer assignment, and congestion timing effects. These results provide better total negative slack (TNS)/worst negative slack (WNS), wire length, congestion, spreading, and power. In summary, the GigaPlace engine is:

Electrically driven, accounting for MMMC slack, skew, and power

Physically driven, accounting for routing topology, layer, color, and pin access

Optimization driven, accounting for gate sizing and buffering

Pin access has become a new design closure metric—since, even if design congestion is low, the routing might still be impossible. The GigaPlace engine features an adaptive pin-access flow that automatically spaces cells based on the neighboring instance’s pin-access restrictions, and not just high local pin density. A proprietary algorithm in the tool globally plans how the router will access each pin (the accessability is based on instances, not library_cells).

The GigaPlace engine has a cell-spreading cost function that considers more DRC rules and pre-routes. An optimi-zation cost function considers cell spreading in both horizontal and vertical directions, and there’s an in-row space juggling function during legalization.

Given the complex floorplans in today’s designs, resolving congestion can become very difficult. The GigaPlace engine, with its automatic density screen technology, simplifies the process by automatically adding density screens in floorplan-induced high-traffic areas. The algorithm analyzes floorplans, traffic patterns, and congestion maps to keep standard cells away from the congested area, such as narrow channels, notches, and macro boundaries. This floorplanning helps reduce congestion without requiring you to add these density screens yourself.

Advanced Timing- and Power-Driven Optimization

There are different techniques to start optimizing timing and power at an early stage. One of the techniques involves making optimal use of higher metal layers for routing since they have better resistance and also help in meeting timing while reducing leakage and dynamic power. The upper layers of the metal stack have different default widths and spacing than the lower layers. As a result, the wire delays of the upper layers can be more than 10X less than those on the lower layers, causing a big timing gain from routing long and critical nets on the upper layers. Of course, there are a limited number of routing resources available on upper metal layers because of the presence of power nets. These limited resources can cause potential congestion and routing issues down the line if left unaddressed.

Through its route-aware optimization capability, the next-generation, multi-threaded advanced timing- and power-driven optimization engine in the Innovus Implementation System can identify long timing-critical nets, query a new congestion-tracking infrastructure to ensure that there’s space available on the upper layers, and then rebuffer these nets on the upper layers in order to improve timing. With this capability, you can maintain critical layer assignments during the entire pre-route optimization flow. These assignments are passed on to the system’s next-generation massively parallel global routing engine so that the final routing will also have the correct layer assignment.

The smaller the process node, the more important coloring becomes. At 10nm, in fact, a full coloring flow is a must for placement and routing, extraction, and DRC. This flow is necessary due to the use of M1/M2 layers as horizontal/vertical metal, which impacts cell architecture and routability, as well as a significant increase in critical rules for placement and routing. Since variability in areas such as wire width and spacing is also an issue at 10nm, having different color tracks might have an up to 50% resistance difference. Cadence has developed a color-driven digital flow for 10nm designs, including capabilities such as color-aware track assignment in the implementation tool and color-aware resistance change in the parasitic extraction tool.

The optimization engine also helps reduce dynamic and leakage power while facilitating optimal performance. A decision engine inside the system makes use of a rich library of power-aware transforms to step through the

available options and reclaim power without affecting timing. This reclaimed power results in minimized leakage, as well as internal and switching power at a global level. If switching activity data is unavailable, it employs proba-bility-based propagation. The engine thus makes the best judgment in terms of finding the optimal power solution to lower power of an SoC without compromising on performance or area.

Power optimization algorithms in the system honor the power intent definitions in both Common Power Format (CPF) and Unified Power Format (UPF). The algorithms use relevant information from all the standard power-saving design techniques, including power shutoff (PSO), multi-supply voltage (MSV), dynamic voltage frequency scaling (DVFS), and substrate biasing.

The Innovus Implementation System is integrated with the Cadence Voltus^™ IC Power Integrity Solution, tapping into a power calculation engine that provides an activity propagation feature as well as a full dynamic current profile of the power delivery network. This capability helps to tune the power intent file and make the grid more robust, preventing peak current failures.

Better Cross-Corner Variability with Concurrent Clocking

The Innovus Implementation System features a next-generation clock concurrent optimization engine with true multi-threading, enhanced useful skew, and flow integration. It merges physical optimization with clock-tree synthesis (CTS), simultaneously building clocks and optimizing logic delays based directly on a propagated clocks model. All the optimization decisions are based on true propagated clocks, taking into account clock gates, inter-clock paths, and on-chip variation (OCV) derates. With additional innovations in this engine, you can experience enhanced hold closure, fence region/halo/multi-corner/power domain support along with electromigration avoidance, post-route clock optimization, and improved routing correlation.

The Innovus Implementation System introduces a new FlexH feature that delivers a structure that is topologically as close to an H-tree as possible, while trading off between the different soft and hard constraints. It provides a democratization of the H-tree approach to a real-world SoC design environment. Without this capability, designers would typically use mesh or a hand-created tree, but these methods consume a lot of power and have architectural limitations. FlexH employs an advanced heuristic search algorithm, which explores millions of different possible tree

structures to find the best compromise between avoiding blockages and power rails, adhering to partition, module, and power domain constraints, and optimizing insertion delay, power, and skew.

FlexH can handle cloning of clock logic between the hybrid tree endpoints and auto-generate the clock-tree spec for the rest of the clock concurrent optimization flow. It improves cross-corner variability, reducing hold TNS and enhancing setup timing. Compared with a hand-created tree, FlexH has demonstrated significantly improved insertion delay and multi-corner skew reduction.

Achieving Up to 10X TAT Gain

Let’s take a closer look at how the Innovus Implementation System is able to boost digital design TAT. First and foremost is its full-flow massively parallel architecture, which can run multi-threaded tasks simultaneously on multiple CPUs. The architecture is designed such that the system can produce best-in-class TAT with standard hardware, which is normally 8 to 16 CPUs per box. In addition, the flow can scale over a large number of CPUs for designs with a larger instance count. The architecture can also be described as “look ahead” in its approach, as it accounts for upstream and downstream steps and effects in the design flow, providing a runtime boost and minimizing design iterations between the placement, optimization, clocking, and routing engines. See Figure 7 for a comparison against a reference tool.

Figure 7: Innovus Implementation System TAT vs. reference tool

The system’s advanced timing- and power-driven optimization engine provides threaded MMMC timing. As the number of MMMC views increases, the engine delivers a sub-linear speedup.

With the explosion in number of design rules at smaller processes, routing and post-route closure can be roadblocks. The system’s routing engine is designed such that these tasks are handled on additional CPUs—more than 100 if needed for larger designs. Backed by its processing speed, the routing engine simultaneously evaluates and optimizes interconnect topology based on the effects on timing, area, power, manufacturability, and yield. With its correct-by-construction approach, the engine can resolve potential double-patterning conflicts on the fly to create a routing topology that is correct for double patterning and DRC the first time and also more area efficient. The engine is equipped with a deterministic multi-threaded backplane, provides full-flow timing corre-lations, and offers a flexible 2D/3D congestion mode. It also features a track-based optimization algorithm that fixes signal integrity issues before detail routing, reduces the timing jump between pre-route and post-route, and enables faster design closure.

While the tools discussed here are speeding up both timing and power signoff, they can also contribute to a faster overall design closure process when used with complementary tools that form a complete signoff flow. This flow makes it possible to employ techniques such as early rail analysis, real-world peak power analysis, and unified electrical signoff. Under a traditional design flow, signoff analysis would usually happen after the design has been placed and routed, but the electrical integrity problems found at this stage would require either much longer to resolve or be irreparable. Since Cadence’s Tempus^™ static timing analysis, Quantus^™ QRC extraction, and Voltus power integrity technologies are integrated with the Innovus Implementation System, you can accurately model the parasitics, timing, signal, and power integrity issues at the early stage of physical implementation and achieve faster convergence on these electrical metrics, resulting in faster design closure.

Familiar, Easy-to-Use Flow Fosters Productivity

Since multiple production-proven signoff engines are integrated into the Innovus Implementation System, it was essential to have a simplified user and scripts interface. The system fosters usability by simplifying command naming and aligning common implementation methods across other Cadence digital and signoff tools. The processes of design initialization, database access, command consistency, and metric collection have all been streamlined and simplified. In addition, updated and shared methods have been added to run, define, and deploy reference flows. These updated interfaces and reference flows increase productivity by delivering a familiar interface across core implementation and signoff products.

Summary

As digital design challenges continue to grow with each process node shrink, new placement and routing and clocking capabilities are needed to meet PPA and TAT goals. Cadence’s Innovus Implementation System gives digital designers capabilities to enhance their designs in ways that weren’t possible before. With a common user interface and command set across a digital implementation flow spanning RTL to signoff, you can work produc-tively to create differentiated SoCs. You can take advantage of robust visualization and reporting for enhanced debugging, root-cause analysis, and metric-driven design flow management. You’ll also be equipped to reach that long-sought-after design nirvana: meeting PPA and TAT targets without any tradeoffs to either side. The Innovus Implementation System is optimized for industry-leading ARMv8 processors and also for 16/14/10nm FinFET processes along with established processes, supporting earlier design starts with faster ramp up.

For Further Information

Learn more about Innovus Implementation System at http://www.cadence.com/products/di/innovus_implemen– tation_system/pages/default.aspx.

Authors

Rahul, Product Marketing Director at Cadence

Vinay Patwardhan, Director Product Management at Cadence

How to Achieve Optimal PPA and Up to 10X TAT Gain in Your Next Digital Design Implementation

Electronics Maker

Subscribe Newsletter

Subscribe Newsletter

Quick Links

Contact Us