Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002

Minimum Wirelength Zero Skew Clock Routing Trees with Buffer Insertion

John Thompson, Kurt Ting, and Simon Wong

Department of Electrical and Computer Engineering

University of Wisconsin - Madison

{jdthompson, kting, wangwong}@wisc.edu

W02-1

Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002

ABSTRACT

Zero skew clock routing is an issue of increasing importance in the realm of VLSI design. As a result of the increasing speeds of on-chip clocks, zero skew clock tree construction has become critical for the correct operation of high performance VLSI circuits. In addition, in an effort to both reduce power consumption and the deformation of clock signals at synchronizing elements on a chip, a minimum wirelength characteristic of clock tree networks is highly desirable.

In an effort to provide a solution to the current issues dealing with zero skew clock tree construction, we present an efficient two-phase algorithm based on the Elmore delay model, which successfully constructs zero skew clock routing trees with buffer insertion and minimum wirelength. The results of an implementation of this algorithm have been verified to display zero skew characteristics in conformance with the Elmore delay model equations. The first phase of the algorithm is a bottom-up delayed merge embedding (DME) with buffer insertion procedure which enumerates all of the possible zero skew clock trees for consideration in the second phase. In the second phase, a top-down procedure of merged embedding is performed with the objective of minimizing wirelength.

1.INTRODUCTION

In a synchronous sequential circuit, the clock signal is used to define a time reference for the movement of data from one storage element to another, through the circuit. Due to the extreme importance of the signal in synchronous circuits, much attention has been paid to the characteristics of these clock signals and the networks used to route them on-chip. While they are often regarded in the same light as any other control signals, further inspection makes it obvious that special consideration must be given when dealing with clock signals. These signals can typically be characterized as having the largest fan out, longest routing distances, and the highest operational speeds of any on-chip signal, control or data. Furthermore, since data

signals are driven with a temporal reference to the on-chip clock signals, these clock signals must be particularly clean and sharp. In addition, as technology scales down in feature size, clock signals in particular become affected by increased resistance due to their long interconnect lengths and decreasing line dimensions. The point is illustrated by the following equation for a wire’s resistance.

(1)

where R is the resistance of the wire, ρ is the resistivity of the wire in Ohms-meter, l is the length of the wire, and A is the cross sectional area of the wire is meters2.

From the above equation, (1), it can be seen that in general, as the feature size of VLSI chips decreases by a factor of fin a single dimension, the resistance of all wires increases by a factor off. This is because the length of the wire (l) decreases by a factor f, while the cross sectional area of the wire (A) also decreases, but by a factor of f2. This increased line resistance is one of the primary reasons for the growing importance of clock distribution on synchronous performance. Finally, the control of any differences in the delay of the clock signals can severely limit the maximum performance of the entire system as well as create catastrophic race conditions which may cause incorrect data values to be latched into the circuits registers.

In addition to the trend of decreasing feature sizes in VLSI circuits, a second trend of heightening clock speeds is also prevalent. Currently, clock speeds achievable in synchronous VLSI design are mainly limited by two factors, i) the longest delay path through any block of combinational logic and ii) clock skew. While the first contributing factor, that of the maximum delay through combinational logic can only be solved by further design considerations, the notion of clock skew can largely be dealt with using clock routing algorithmic techniques. Clock skew can be defined as the maximum difference in arrival times of a clocking signal at the synchronizing elements which it triggers in a design. Its limiting characteristics on clock cycle period can be observed from the following well-known inequality that governs the clock period of a clock signal net.

(2)

where td is the delay on the longest path through combinational logic, tskew is the maximum clock skew, tsu is the set-up time of the synchronizing elements (assuming they are edge triggered), and tds is the propagation delay within the synchronizing elements. Furthermore, the term td can further be broken down into two disjoint components according to the following equation.

(3)

where td-interconnect represents that portion of the delay through the longest path of combinational logic (td) that can be attributed to interconnect and td-gates represents that portion of the delay which can be attributed to the actual delay through the gates. Increased switching speeds in VLSI circuits due to decreasing feature sizes will decrease the td-gates term but not the td-interconnect term.

From the above analysis, it can be inferred that the two dominant terms determining the minimum clock period (maximum clock speed) are those associated with clock skew, tskew, and wire interconnect through the longest block of combinational logic, td-interrconnect. In a paper authored by Bakoglu [1], it was noted that clock skew may account for over 10% of the system’s cycle time in high-performance VLSI circuits. Working to lessen the effect of clock skew, a vast amount of research has been devoted solely to reducing skew in clock routing networks.

2.RELATED WORK

In the past few decades, much research has been devoted to the issue of minimizing clock skew to zero. Several algorithms have been proposed which all achieve zero clock skew at the cost of varying complexity, wire lengths, and other measures. The simplest approach was first proposed by Dhar, Franklin, and Wang [2]. Their proposed H-tree algorithm is based on the construction of a completely balanced clock tree as can be seen in Figure 1.

While the algorithm proposed by Dhar et al. will result in zero skew clock tree construction with a minimum amount of computational complexity, the approach results in unacceptably large lengths of wire that are required to route the resulting clock network. Therefore, this approach is no longer considered for clock tree construction.

More recently, several others have proposed much more complex algorithms which all succeed in constructing clock trees with a characteristic zero skew and varying costs in terms of complexity and resulting wirelengths. Examples of these algorithms are the Method of Means and Medians (MMM) algorithm, which was proposed by Jackson et al. [3], and the Geometric Matching Algorithm, which was proposed by Cong et al. [4].

Figure 1. H-tree over four points

In addition to development of new algorithms that achieve overall zero skew clock tree construction, other recent research has aimed at changing the most prominent view that the entire clock network should have a characteristic zero skew. The most notable research fitting into this category was conducted by Chen et al [5]. They researched and formulated the notion of associative skew. In a paper published in 1999, they proposed that it is not necessary that all synchronizing elements receive clock signals with relative zero skew. Instead, what is important is that closely related synchronizing elements receive clock signals with relative zero skew. For example, flip flops in a shift register, which are directly connected, should all receive identical clock signals; that is, clock signals with the same period and zero skew. On the other hand, synchronizing elements which are disjoint from one another in a circuit’s s-graph may not need to receive relative zero skew clock signals. Illustrations of these two scenarios are provided in Figures 2(a,b). In their paper, they also discuss strategies and algorithms to determine which synchronizing elements should be clustered together to have relative zero skew, and providea version of DME which will construct appropriate clock routing trees.

Figure 2(a). 4bit shift register (relative zero skew should be enforced for all of the synchronizing elements)

Figure 2(b): An unconnected s-graph (synchronizing element s2 may have clock skew relative to elements s0 and s1)

3.PROBLEM FORMULATION

As was stated previously, the problem considered in this paper is that of constructing a zero skew clock tree with buffer insertion and minimum wirelength. Stated more formally, the problem can be formulated as follows:

Given a set of clock sinks,{S}, construct a clock network which stems from a single root node,R, and terminates on all sinks,si{S}. In the construction of this network, all zero skew equations according to the Elmore delay model must be observed and satisfied. In addition, buffers must be inserted where needed, so as to limit the capacitance seen at any node to some upper threshold. Finally, wirelength should be minimized.

The Elmore delay model, which is used in order to assure the zero skew characteristics is based on the first-order moment of the impulse response, and is developed as follows. Let α and β respectively denote the resistance and capacitance per unit length of interconnect wire, so that the resistance, rev, and capacitance, cev, of a given interconnect wire, ev, are given by α|rev| and β|cev|, respectively. For each sink si in the tree T(S), there is a loading capacitance cL,which is the input capacitance of the functional unit driven by si.

We let Tv denote the subtree of T(S) rooted at v, and let cv denote the node capacitance of v. The tree capacitance of Tv is denoted by Cv and equals the sum of the capacitances in Tv. Cv is calculated using the following recursive formula:

(4)

The Elmore delay can be calculated as follows:

(5)

Finally, it should be noted that Elmore delay is additive. That is, if v is a vertex on the u-w path, then tED(u, w) = tED(u, v) + tED(v, w), and in particular, if v is a child of u on the u-si path, then tED(u, si) = rev((1/2)cev + Cv) + tED(v, si). A sink node si may be treated as a trivial zero skew subtree with capacitance cL, and delay zero.

4.DME WITH BUFFER INSERTION ALGORITHM

We propose an algorithm for zero skew clock tree routing which both minimizes wirelength and inserts buffers at the appropriate locations simultaneously. We accomplish this task by combining the well-known Deferred Merge Embedding (DME) algorithm with a simple buffer insertion heuristic. The DME algorithm, which was first proposed by Chao et al. [6], is based on the idea that finalized interconnect embedding for a clock network should be postponed as long as possible in hopes that better embedding choices will be able to be made given a greater view of the problem at hand. With this in mind, the DME algorithm is performed in two phases, one of which determines all possible zero skew embedding tapping point locations for the clock network by working bottom-up, from the network sinks to the root tapping point. The other phase is a top-down merge embedding procedure which acts to choose optimal tapping point locations among those determined to provide zero skew in the first phase of the algorithm. More formal descriptions of the two phases implemented in our research are provided in the sections that follow.

4.1Bottom-up Phase I

As was stated previously, the purpose of the first phase of our two phase algorithm is to determine all of the possible zero skew tapping locations which will later most likely result in a minimum wirelength tree construction outcome. This phase of our algorithm was first described by Masato Edahiro [7] in 1993. His proposed bottom-up phase, which is what we have chosen to use, is described below.

Let K be the current set of points and segments for consideration in an iteration of the algorithm; initially K will be the set of all clock sinking locations (K = {S}). On each iteration of the procedure, the nearest neighbor pair in the current set, K, is first found. Next, a merging segment is constructed between the two nearest neighbors. The process involved for merging segment construction was first described by Chao et al. [5]. In their work, two general cases of merging segment construction are discussed. These two cases correspond to two scenarios, the normal case and the case in which interconnect snaking is required.

Before determining which merging segment scenario is the one present, the zero skew tapping length between the two selected Manhattansegmentsmust be found. To do so, letTSa and TSb represent the two trees of merging segments to be merged. Let TSa and TSb, respectively, have capacitances C1 and C2 and delays t1 = tED(a) and t2 = tED(b). In addition, let pl(v) be a merging point with minimum merging cost.

From the Elmore delay model equation given by tED(v,a) = rea((½)cea + C1), it can be seen that pl(v) satisfies the following equation.

(6)

Now, let the Manhattan distance between the two selected merging segments, d(ms(a), ms(b)), be equal toκ. Supposing that TSa and TSb can be merged with merging cost κ, that is |ea| = x and |eb| = κ–x for 0 xκ, then we have the resistances rea = x and reb = (κ – x) and the capacitances cea = x and ceb= (κ – x). After substituting into (6) and solving for x, the following is obtained.

(7)

From here, there are two cases, the first of which requires no interconnect snaking, and occurs when 0 xκ. In this case, |ea| = x and |eb| = κ – x. In this case, the merging Manhattan segment can be determined following a procedure shown in Figure 3(a). It can be seen from the figure that two Manhattan radii are drawn, one around each Manhattan segment or core, with characteristic radii as determined by the values ea and eb obtained previously. In this case, the new merged Manhattan segment is the overlap of the two Manhattan radii.

In the other case, in which x 0 or x 1, the assumption of merging cost κ results in a negative edge length for either ea or eb. In this case, an extended distance κ’κ is required to balance the delays of the two trees. If x 0, which means that t1t2, then we choose pl(a) as the merging point and set |ea| = 0 and |eb| = κ’. Solving the following equation for κ’,

Figure 3(a). Merging Manhattan segments (Case #1)

Figure 3(b). Merging Manhattan segments (Case #2)

(8)

one obtains the following.

(9)

Similarly, if xκ, we set |eb| = 0 and

(10)

In the interconnect snaking case, only one Manhattan radius is drawn, for whichever of ea and eb is nonzero. In this case, the resulting merged Manhattan segment is the portion of the Manhattan core that is enclosed in the other Manhattan core’sManhattan radius. This idea is illustrated in Figure 3(b).

Now we present a formal description of our bottom-up phase of clock tree construction, which is the same algorithm originally used in Edahiro’s 1993 work [6], with modifications made to include a simple buffer insertion heuristic.

AlgorithmFind_Center([NS])

Input:Set of clock sinks {S}

Output:Tree of merged Manhattan segments, TS

K {S}

while |K| != 1 (if |K| = 1, then the element in K is the segment for the center vc)

Choose the nearest neighbor pair of Manhattan segments, v1 and v2, from K

Calculate the segment for vfromv1 and v2 using the zero skew merge

Delete v1 and v2 from K

Add vtoK

if v’s node capacitance > specified maximum node capacitance

Insert a buffer at node v

Set v’s capacitance to zero

endif

endwhile

In our implementation of this bottom-up phase, the nearest neighbor pair contained in a set of Manhattansegmentswas found by using the Delaunay triangulation graph method. This method will not be discussed any further in this paper, as it is out of the scope of the real research conducted; however, an excellent reference on the subject can be found in the book, Computational Geometry: An Introduction, by Preparata and Shamos [8]. Once the nearest neighbors have been found, the zero skew merge between the selected Manhattan segments is performed as was previously described. In addition, the consideration of a buffer insertion is performed at this point. This procedure of pairing and merging with buffer insertion is performed until only one node, the root node, remains. At this point, the top-down phase of our algorithm is performed.

4.2Top-down Phase II

At this point, the bottom-up phase has completed and a resulting tree of merged Manhattan segments has been obtained. The next step is to determine the exact final embeddings of internal nodes in the zero skew clock tree. This is accomplished in a top-down phase two of the algorithm.

For a node v in topology G, i) if v is the root node, then select any point in ms(v), the root nodes Manhattan segment, to be pl(v), the new tapping or merging point; or ii) if v is an internal node other than the root, choose pl(v) to be any point in ms(v) that is at a distance |ev| or less from the placement of v’s parent p. Because the merging segment ms(p) was constructed such that d(ms(v), ms(p))  |ev|, there must exist some choice of pl(v) satisfying the previously mentioned condition. In case ii), the algorithm first creates a square tilted rectangular region (TRR), trrp, with a radius of |ev| and a core equal to {pl(p)}; then pl(v) can be any point from ms(v) trrp. This is illustrated in Figure 4. A more formal definition of this procedure, termed Find_Exact_Placement by Chao et al. [6], is provided below.