Chapter 2

The Basic Concepts of Algorithms

The synthesis and analysis of algorithms is a rather complicated field in computer science. Since we shall model every problem in molecular biology as an algorithm problem in this book, we shall give a brief introduction of some basic concepts of algorithms in this chapter in the hope that the reader can have a basic understanding of issues related to algorithms. It is quite difficult to present all of the terms in the field of algorithms formally. We will therefore take an informal approach. That is, all of the basic concepts will be presented in an informal way.

2.1 The Minimal Spanning Tree Problem

Every computer program is based upon some kind of algorithms. Efficient algorithms will produce efficient programs. Let us consider the minimal spanning tree problem. We believe that the reader will quickly understand why it is so important to study algorithm design.

There are two versions of spanning tree problem. We will introduce one version of them first. Consider Figure 2.1.

Figure 2.1: A Set of Planar Points

A tree is a graph without cycles and a spanning tree of a set of points is a tree consisting all of the points. In Figure 2.2, we show three spanning trees of the set of points in Figure 2.1.

Figure 2.2: Three Spanning Trees of the Set of Points in Figure 2.1

Among the three spanning trees, the tree in Figure 2.2(a) is the shortestand this is what we are interested. Thus the minimal spanning tree problem is defined as follows: We are given a set of points and we are asked to find a spanning tree with the shortest total length.

How can we find a minimal spanning tree? A very straightforward algorithm is to enumerate all possible spanning trees and one of them must be what we are looking for. Figure 2.3 shows all of the possible spanning trees for three points. As can be seen, there are only three of them.

Figure 2.3: All Possible Spanning Trees for Three Points

For four points, as shown in Figure 2.4, there are sixteen 16 possible spanning trees. In general, it can be shown that given n points, there are possible spanning trees for them. Thus if we have 5 points, there are already = 125 possible spanning trees. If n = 100, there will be possible spanning trees. Even if we have an algorithm to generate all spanning trees, time will not allow us to do so. No computer can finish this enumeration within any reasonable time.

Figure 2.4: All Possible Spanning Tree for Four Points

Yet, there is an efficient algorithm to solve this minimal spanning tree problem. Let us first introduce the Prim's Algorithm.

2.1.1 Prim's Algorithm

Consider the points in Figure 2.1 again. Suppose we start with any point, say point b. he nearest neighbor of point b is point a. We now connect point a with point b , as shown in Figure 2.5(a). Let us denote the set as X and the set of the rest of points as Y. We now find the shortest distance between points in X and Y which is that between between b and e, We add e to the minimal spanning tree by connecting b with e, as shown in Figure 2.5(b). Now X = and Y = . During the whole process, we continuously create two sets, namely X and Y. X consists of all of the points in the partially created minimal spanning tree and Y consists of all of the remaining points. In each step of Prim's algorithm, we find a shortest distance between X and Y and add a new point to the tree until Y is empty. For the points in Figure 2.1, the process of constructing a minimal spanning tree through this method is shown in Figure 2.5.

Figure 2.5: The Process of Constructing a Minimal Spanning Thee Based upon Prim’s Algorithm

In the above, we assumed that the input is a set of planar points. We can generalize it so that the input is a connected graph where each edge is associated with a positive weight. It can be easily seen that a set of planar points corresponds to a graph where there is an edge between every two vertices and the weight associated with each edge is simply the Euclidean distance between the two pints. If there is an edge between every pair of vertices, we shall call this kind of graph a complete graph. Thus a set of planar points correspond to a complete graph. Note that in a general graph, it is possible that there is no edge between two vertices. A typical graph is now shown in Figure 2.6.

Throughout the entire book, we shall always denote a graph by G = (V,E) where V is the set of vertices in G and E is the set of edges in G.

Figure 2.6: A General Graph

We now present Prim's algorithm as follows:

Algorithm 2.1 Prim's Algorithm to Construct a Minimal Spanning Tree

Input: A weighted, connected and undirected graph G = (V,E).

Output: A minimal spanning tree of G.

Step 1: Let x be any vertex in V. Let X = and Y = V \

Step 2: Select an edge (u,v) from E such that ,and (u,v)has the smallest weight among edges between X and Y.

Step 3: Connect u to v. Let and .

Step 4: If Y is empty, terminate and the resulting tree is a minimal spanning tree. Otherwise, go to Step 2.

Let us consider the graph in Figure 2.7. he process of applying Prim's algorithm to this graph is now illustrated in Figure 2.8.

Figure 2.7: A General Graph

Figure 2.8: The Process of Applying Prim’s Algorithm to the Graph in Figure 2.7

We will not formally prove the correctness of Prim's algorithm. The reader can find the proof in almost every textbook on algorithms. Yet, now we can easily see the importance of algorithms. If one does not know the existence of such an efficient algorithm to construct minimal spanning trees, one can never construct minimal spanning trees if the input size is large. It would be disastrous for any one to use an exhaustive search method in this case.

In the following, we will introduce another algorithm to construct a minimal spanning tree. This algorithm is called Kruskal's algorithm.

2.1.2 Kruskal's Algorithm

Kruskal's algorithm to construct a minimal spanning tree is quite similar to Prim's algorithm. It would first sort all of the edges in the graph into an ascending sequence. Then edges are added into a partially constructed minimal spanning tree one by one. Each time an edge is added, we check whether a cycle is formed. If a cycle is formed, we discard this edge. The algorithm is terminated if the tree contains n-1 edges.

Algorithm 2.2 Kruskal's Algorithm to Construct a Minimal Spanning Tree

Input: A weighted, connected and undirected graph G = (V,E).

Output: A minimal spanning tree of G.

Step 1:.

Step 2:while Tcontains less than n-1edges do

Choose an edge (v,w) from E of the smallest weight.

Delete (v,w) from E.

If the adding of (v,w)does not create cycle in Tthen

Add (v,w) to T.

Else Discard (v,w).

end while

Let us consider the graph in Figure 2.7. The process of applying Kruskal's algorithm to this graph is illustrated in Figure 2.9.

Both algorithms presented above are efficient.

Figure 2.9: The Process of Applying Kruskal’s Algorithm to the Graph in Figure 2.7

2.2 The Longest Common Subsequence Problem

In this section, we will introduce another seemingly difficult problem and again show an efficient algorithm to solve this problem.

Consider the following sequence:

S: ABDDRGTY.

The following sequences are all subsequences of S

ABDDRGT, DGTY, DDRGT, BDR, ABDTY, DDRT, GY, ADRG

ABTY, TY, ABG, DDY, AR, BTY, TY, Y

Suppose that we are given two sequences as follows:

S1: ABDDRGTY

S2: CDEDGRRT

Then we can see that the following sequences are subsequences of both S1 andS2

DRT

DRGT

DDRGT

All of the above sequences are called common subsequences of S1 andS2. The longest common subsequence problem is defined as follows: Given two sequences, find a longest common subsequence of them. If one is not familiar with algorithm design, one may be totally lost with this problem. Algorithm 2.3 is a naive algorithm to solve this problem.

Algorithm 2.3 A Naive Algorithm for the Longest Common Subsequence Problem

Step 1: Generate all of the subsequences of, say S1

Step 2: Starting with the longest one, check whether it is a subsequence of S2

Step 3: If it is not, delete this subsequence and, return to Step 2.

Step 4: Otherwise, return this subsequence as a longest common subsequence of S1 andS2

The trouble is that it is exceedingly time-consuming to generate all of the subsequences of a sequence. A much better and much more efficient algorithm employs a strategy called the dynamic programming strategy. Since this strategy is used very often to solve problems in molecular biology, we shall describe it in the next section.

2.3 The Dynamic Programming Strateg

The dynamic programming strategy can be explained by considering the graph in Figure 2.10. Our problem is to find a shortest route from vertex S to vertex T. As can be seen, there are three branches from S. Thus we have at least three routes, namely going through A, going through B and going through C. We have no idea which one is the shortest. But we have the following principle: If we go through a vertex X, we should find a shortest route from X to T

Figure 2.10: A Graph

Let d(x,y) denote the shortest distance between vertices x and y. We have the following equation:

The question is: How do we find the shortest route from, say vertex A, to vertex T? Note that we can use the same principle to find a shortest route from A to T. That is, the problem to find a shortest route from A to T is the same as the problem finding a shortest route from S to T except the size of the problem is now smaller.

The shortest route finding problem can now be solved systematically as follows:

= 31

= 4

= 35

Substituting 2.2, 2.3 and 2.4 into 2.1, we obtain that = min{15+31,18+4,3

+35} = 22, which implies that the shortest route from S to T is . As shown above, the basic idea of dynamic programming strategy is to decompose a large problem into several sub-problems. Each sub-problem is identical to the original problem except the size is smaller. Thus the dynamic programming strategy always solves a problem recursively.

In the next section, we go back to the longest common subsequence problem and show how the dynamic programming strategy can be applied to solve the problem.

2.4 Application of the Dynamic Programming Strategy to Solve theLongest Common Subsequence Problem

The longest common subsequence problem was presented in Section 2.2. It was also pointed out that we can not solve the problem in any naïve and unsophisticated way. In this section, we shall show that this problem can be solved elegantly by using the dynamic programming strategy.

We are given two sequences: and . Consider and . There are two cases:

Case 1:. In this case, , which is equivalent to , must be included in the longest common subsequence. The longest common subsequence of and is the longest common subsequence of and plus .

Case 2:. Then we find two longest common subsequences, that of and and that of and . Among these two, we choose the longer one and the longest common subsequence of and must be this longer one.

To summarize, the dynamic programming strategy decomposes the longest common subsequence problem into three identical sub-problems and each of them is of smaller size. Each sub-problem can now be solved recursively.

In the following, to simplify the discussion, let us concentrate our mind to finding the length of a longest common subsequence. It will be obvious that our algorithm can be easily extended to find a longest common subsequence.

Let denote the length of a longest common subsequence of and . can be found by the following formula:

The following is an algorithm to find the length of a longest common subsequence based upon the dynamic programming strategy:

Algorithm 2.4 An Algorithm to Find the Length of a Longest Common Subsequence Based upon the Dynamic Programming Strategy

Input: and

Output: The length of a longest common subsequence of A and B, denoted as

Step 1:

Step 2:fori = 1 to mdo

forj = 1 to ndo

ifthen

else

end for

end for

Let us consider an example.

A = AGCT

and B = CGT

The entire process of finding the length of a longest common subsequence of A and B is now illustrated in the following table, Table 2.1. By tracing back, we can find two longest subsequences CT and GT.

Let us consider another example: A = aabcdec and B = badea. Table 2.2 illustrates the process.

Again, it can be seen that we have two longest common subsequences, namely bde and ade.

Table 2.1: The Process of Finding the Length of Longest Common Subsequence of AGCT and CGT

Table 2.2: The Process of Finding the Length of a Longest Common Subsequence of A = aabcdec and B = badea

2.5 The Time-Complexity of Algorithms

In the above sections, we showed that it is important to be able to design efficient algorithms. Or, to put it in another way, we may say that many problems can hardly be solved if we can not design efficient algorithms. Therefore, we now come to a critical question: How do we measure the efficiency of algorithms?

We usually say that an algorithm is efficient if the program based upon this algorithm runs very fast. But whether a program runs fast or not sometimes depends on the hardware and also the skill of the programmers which are irrelevant to the algorithm.

In algorithm analysis, we always choose a particular step of the algorithm. Then we try to see how many such steps are needed to complete the program. For instance, in all sorting algorithms, the comparison of data can not be avoided. Therefore, we often use the number of comparisons of data as the time-complexity of a sorting algorithm.

Let us consider the straight insertion sort algorithm. We are given a sequence of numbers, . The straight insertion sort algorithm scans this sequence. If is found to be smaller than , we put to the left of . This process is continued until the left of is not smaller than it.

Algorithm 2.4The Straight Insertion Sort Algorithm

Input: A sequence of numbers

Output: The sorted sequence of

for j = 2 to ndo

whiledo

end while

end for

Suppose that the input sequence is 9, 17, 1, 5, 10. The straight insertion sort sorts this sequence into a sorted sequence as follows:

9

9,17

1,9,17

1,5,9,17

1,5,9,10,17

In this sorting algorithm, the dominating steps are data movements. There are three data movements, namely and . We can use the number of data movements to measure the time-complexity of the algorithm. In the algorithm, there are one outer loop and one inner loop. For the outer loop, the data movement operations and are always executed no matter what the input is. For the inner loop, there is only one data movement, namely . This operation is executed only if this inner loop is executed. In other words, whether this operation is executed depends on the input data. Let us denote the number of data movements executed for the inner loop by . The total number of data movements for the straight insertion sort is

Best Case:

The worst case occurs when the input sequence is exactly reversibly sorted. In such a case,

Thus,

Average Case:

To conduct an analysis of the average case, note that when is being considered, data movements have already been sorted. Imagine that is the largest among the i numbers. Then the inner loop will not be executed. If is the jth largest number among the i numbers, there will be j-1 data movements executed in the inner loop. The probability that is the jth largest among i numbers is for . Therefore the average number of data movements is

The average case time-complexity for straight insertion sort is

In summary, the time-complexities of the straight insertion sort are as follows:

Straight Insertion Sort Algorithm

Best Case: 2(n-1)

Average Case:

Worst Case:

In the above, we showed how an algorithm can be analyzed. For each algorithm, there are three time-complexities: the worst case time-complexity, the average case time-complexity and the best case time-complexity. For practical purposes, the average case time-complexity is the most important one. Unfortunately, it is usually very difficult to obtain an average case time-complexity. In this book, unless we specify clearly, whenever we mention the time-complexity, we mean the worst case time-complexity.

Now, suppose we have a time-complexity equal to . It can be easily seen that as n becomes very very large, dominates. That is, we may ignore the term n when n is large enough. We now present a formal definition for this idea.

Definition 2.1if and only if there exist two positive constants c and such that .

This function is called big function. If is , is bounded by , is certain sense, as n is large enough. If the time-complexity of an algorithm is , it will take less than steps to run this algorithm as n is large enough for some c. Note that n is the size of the input data.

Assume that the time-complexity of an algorithm is . Then

Thus, we say that the time-complexity is because we can fine c and to be 2 and 1 respectively.

It is customary to use to represent a constant. Let us go back to the time-complexities of the straight insertion sort algorithm. Using the big function, we now have:

Straight Insertion Sort Algorithm

Best Case:

Average Case:

Worst Case:

Many other algorithms have been analyzed. Their time-complexities are as follows:

The Binary Search Algorithm

Best Case:

Average Case:

Worst Case:

The Straight Selection Sort

Best Case: