An Algorithm to Reconstruct a Target DNA Sequence from Its Spectrum Connected at a Given Level

Fang-Xiang Wu

Division of Biomedical Engineering

University of Saskatchewan

Saskatoon, SK, S7N 5A9, CANADA

Wen-Jun Zhang

Division of Biomedical Engineering

University of Saskatchewan

Saskatoon, SK, S7N 5A9, CANADA

Anthony J. Kusalik

Department of Computer Science

University of Saskatchewan

Saskatoon, SK, S7N 5A9, CANADA

Abstract

In order to sequence a target DNA, it is first cleaved into many shorter overlapping fragments by chemical or physical techniques. The nucleotide sequence of each fragment is thendetermined (read)by established methods. The set of all read fragments which cover the target DNA sequence is called its spectrum. It is believed that the shortest superstring of a spectrum is the best candidate for the target DNA sequence. The general problem of finding the shortest superstring for any given set of strings is NP-hard. Fortunately, the biological instance of this problem is easier. It is not likely that two read fragments, each consisting of several hundred letters, which come from consecutive locations on the target DNA sequence have an overlap ofonlya few letters; typically, the overlap will be longer. Thus one may reasonably assume that two strings in the spectrum have significant overlap (connectivity) if they come from consecutive locations on the target DNA sequence. A class of important instances satisfying this assumption are those whose spectraare from DNA microarrays. This assumption allows us to claim and show the following: if the spectrum of a target DNA sequence is substring-free and connected at level , and the target DNA squence has no repeats of size or larger, then there exists an algorithm to reconstruct the target DNA sequence in the linear time after an overlap graph of the spectrum is built.

1. Introduction

A DNA molecule is a very long polymer chain consisting typically of hundreds of thousands of nucleotides A, C, G, T. The sequence of nucleotides is represented by a string over the alphabet.At present, experimental techniques and conditions only allow accurate identification (“reading”) ofabout 600 bases out of a long single-stranded DNA molecule[1]. Therefore, reconstructing the whole sequence from fragments is an important step in sequencing a whole DNA molecule.This reconstruction is called fragment assembly.The set of all read fragments which cover a target DNA sequence is called its spectrum. It isbelieved that the shortest superstring of a spectrumis the best candidate for the target DNA sequence [1,7]. Such a shortest superstring representation is likely to be correct if the target DNA sequence does not have many long repeats (which is often the case in practice [2]).

The general problem of finding the shortest superstring for any given set of strings is NP-hard. Because of this complexity,much effort has been expended to find good approximation algorithms with guaranteed performance [1-5, 8] or to find efficient and exact algorithms for restricted problems [1, 6-8]. In this paper, we develop an efficient and exact algorithm for the class of problems arising from fragment assembly. Our algorithm improves Setubal and Meidanis’s [7] and may subsume Pevzner’s algorithm [6] to some extent.

We review some basic definition and concepts in Section 2. In Section3, two existing algorithms are described and our new algorithm is presented. Finally Section 4 discusses our results and future work.

2. Preliminaries

Let be a set of strings over some alphabet . A common superstring, or simply superstring, of is a string such that each in is a substring of . The shortest superstring problem is to find a superstring of the smallest possible length for any given set of strings . The central concept of most existing algorithms for this problem is a distance graph or overlapgraph on the set . Without loss of generality, assume that the set is substring-free in that no string is a substring of any other . For two strings and , let be the longest string such that and for some nonempty strings and . The string is called the overlap of string with respect to string, and denoted by. Aoverlapfunction is a mapping : , where is the set of all natural numbers. For any two strings and in , . The string is called the prefix of string with respect to string , and denoted by . A distancefunction is another mapping : .For any two strings and in , . In the paper, denotes the number of elements of a set. denotes the length of a string .

A distance graph of is defined as a weighted digraph , where the set of vertices is , the set of edges is , and the weight function is the distance function . Correspondingly, an overlap graph is defined as a weighted digraph , where the sets of vertices and edges in are the same as those in , but the weight function in is the overlap function .

It was proved [6] that a superstring of corresponds to a Hamiltonian path in the distance graph (or in the overlap graph ), and vice versa. Furthermore, either the shortest Hamiltonian path in or the longest Hamiltonian path in corresponds to the shortest superstring of . Given a nonnegative integer , we define a sub-graph of as a weighted digraph keeping only those edges with a weight at least in . For a directed path in the overlap graph , the string is denoted by . Obviously, is the shortest superstring generated along the path . The string is denoted by .

3. Algorithms

3.1 Two existing algorithms

Pevzner [6] presented an algorithm to reconstruct target DNA sequences whose spectra come fromDNA microarrays. A DNA microarray consists of all short known DNA fragments (called probes) with length .Obviously, such a DNA microarray has fragments (probes). This kind of spectra possesses the following properties: 1) All strings have the same length , and 2) Any two consecutive length strings in a target DNA sequence overlap by letters. In addition, Pevzner assumed that each length string appears only once in a target DNA sequence. Exploiting these properties and the assumption, he reduced the reconstructing of a target DNA sequence to the finding of an Eulerian path on a digraph , whose set of vertices is the set of all tuples, and whose set of edge is the spectrum of the target DNA sequence. Although finding an Eulerian path is simple [9], reconstructing a target DNA sequencebecomes complicated because Eulerian path in is not unique. Actually, the number of Eulerian paths in is exponential growth with respect to the degrees of vertices [9]. Gusfield et al. proved in [8] that finding the optimal Eulerian path in is NP-hard when the degrees of its vertices are greater than 3.

In [6], Setubal and Meidanis viewed a target DNA sequenceas an integer interval. A samplingof the target DNA sequence consists of some subintervals of the interval. Two intervals and in are said to be linked at level if . is said to be connected at level if for any two intervals and in there are a series of intervals for such that , and is linked to at level for . A sampling is said to be subinterval-free if there are no two intervals and in with .A spectrum is generated by if every string in corresponds to an interval in .By using the properties of an interval graph, Setubal and Meidanis obtained:

Thereom 1: Let be a subinterval-free, connected at level sampling of a target DNA sequence that covers . If a spectrum of the target DNA sequence is generated by , and has no repeats with size or larger, then the digraph has a unique Hamiltonian path and .

3.2 Improved Algorithm

There is an unreasonable restriction on Setubal-Meidanis algorithm. In theorem 1, the spectrum must be generated by a sampling consisting of some integer intervals. Such a spectrum is interval-dependent. This means that locations of all intervals in were known before is constructed byits spectrum . It seems controversial. In this section, we develop a new algorithm which improves theorem 1 in that it holds for interval-independent spectra.

Two read fragments and are linked at level if . A spectrum is said to be connected atlevel if for any two strings and in there is either a directed path from to or one from to in on which two consecutive fragments (vertices) are linked at level . The connected at level spectrum describes well the assumption that the two read fragments from consecutive locations in must have significant overlap and the assumption that all read fragments cover the target DNA. It is obvious that these two assumptions are biologically reasonable. Ourmain result is described as follows:

Theorem 2: Assume that the spectrum of a target DNA sequenceis substring-free and connected at level . If has no repeats with size or larger, then the digraph has a unique Hamiltonian path and

Comparing theorem 2to theorem 1, it is obvious that the restriction on theorem 1, whichthat the spectrum must be generated by a sampling , vanishes. Since the spectrum in theorem is substring-free, the target DNA is the shortest superstring of a Hamiltonian path in the digraph . To prove this theorem, we need the following Lemmas.

Lemma 1: If has no repeats with size or larger, then there is no directed cycle in , that is, is an acyclic digraph.

Proof: By contradiction. Let: be a directed cycle in , so is also a directed cycle in . Since corresponds to a Hamiltonian path , there exists some such that either string or string , say , doesn’t appear on . Thus the string appears on two distinct locations on. This means that there is a repeat with size or larger on, thusa contradiction has been concluded. Therefore, Lemma 1 has been proved.

Lemma 2: If has no repeats with size or larger, then every vertex in has both in-degree at most one and out-degree at most one.

Proof: We first prove by contradiction that every vertex in has in-degree at most one under the condition of Lemma. Assume that there exists some vertex with in-degree more than one. Thus there exist two edges and in . Similarly to the argument in Lemma 1, out of two strings and , at least oneis not in , say , so string is a repeat whose size is or larger. Thus the first conclusion is concluded with this contradiction.

Then the second conclusion that every vertex in has out-degree at most one may be proved similarly.

Lemma 3: If is substring-free and connected at level , and if has no repeats of size or larger, then in there exist exactly one vertex with in-degree zero, another vertex with out-degree zero, andthe other vertices except for these two vertices in withboth in-degree one and out-degree one.

Proof: First we notice that digraph is connected since is substring-free and connected at level . Therefore every vertex in has at least either in-degree one or out-degree one because otherwise there exist an isolated vertex. If there are two vertex and with in-degree zero, then there is neither a path from to nor one from to in . This contradicts with the condition that is substring-free and connected at level . Therefore there is at most one vertex with in-degree zero in . Furthermore, if there is no vertex with in-degree zero, this means that all vertices have in-degree one by Lemma 2. Thus there is a directed circle in , it contradicts with Lemma 1. So there is exactly one vertex with in-degree zero in .

Similarly, we can obtain that there is exactly one vertex with out-degree zero in . Finally we conclude that the other vertices in have both in-degree one and out-degree one by employing Lemma 2.

Proof of theorem 2: By Lemma 3, in the digraph exactly two vertices are semi-balanced and all other vertices are balanced [6, 9], so there exists an Eulerian path in . Furthermore, every balanced vertex has exactly in-degree one and out-degree one and the two semi-balanced vertices have exactly in-degree one or out-degree one, respectively. Therefore this Eulerian path is unique and goes through all vertices in exactly once. Therefore, the path is also a unique Hamiltonian path in the digraph .

For simplicity, let . We prove that by contradiction. Assume that , there exists an edge on path that doesn’t occur on the Hamiltonian path corresponding to in . This means that string occurs two times on. A contradiction is abstracted because of that. Therefore, we have completed the proof of Theorem 2.

Since finding an Eulerian path ina given digraph can be fulfilled in the linear time [9], where is the set of verticesof the digraph, we conclude by Theorem 2 and its proof:

Theorem 3: Assume that the spectrum of a target DNA is substring-free and connected at level . If has no repeats of size or larger, then there exists an algorithm that can exact reconstruct from its spectrum in the linear time after the digraph is built.

4. Discussions and future works

Comparing Setubal and Meidanis’s algorithm to ours, it is obvious that we have improved the former in that the interval-dependent condition of the spectrum is removed. Therefore, our algorithm is more useful. On the other hand, although Pevzner’s algorithm can efficiently find an Eulerian path in the digraph which can be used to construct a candidate sequence for the target DNA, it faces difficulties when the degrees of all vertices in the digraph are greater than one. When the degree of all vertices in the digraph is equal to one, our algorithm is more general than Pevzner’s in that ours is not restricted to spectra coming from DNA microarrays; i.e. allfragments in a spectrum are the same long.

When the degree of vertices in is greater than one, Pevzner’s algorithm will find a number of candidate sequences for a target DNA. To choose one string from them as closest to the actual DNA sequence, the problem is equivalent to finding the optimal Eulerian path in and requires additional experimental information to solve. In a recent paper [8], Gusfield et al. proved that if the degree of every vertex in is exactly two, there exists an algorithmfor finding an optimal target DNA sequence in linear time, and that if the degree of all vertices in is equal to or larger than four, finding the optimal candidate for a target DNA sequence isagain NP-hard.

The structure of repeats in a target DNA plays a crucial rule in designing algorithms for fragment assembly. Our algorithm requires that repeats are shorter than the degree of overlap between fragments. Although this condition may seem harsh, our algorithm may suggest other approaches which can overcome the restriction. Future work is to design exact or good approximation algorithms for fragment assembly in which the structure of repeats in the target DNA is known. This information may often be available before the target DNA is reconstructed.

5. Acknowledgements

We thank Natural Sciences and Engineering Research Council of Canada (NSERC) for a partial financial support to this research. The first author would like to thank the University of Saskatchewan for funding him through a graduate scholarship award.

6. References

[1] M. Pop, S.L. Salzberg, and M. Shumway,“Genome sequence assembly: algorithms and issues”, IEEE Computer, 35, 2002, pp.47-54.

[2] A. Blum, et al.,“Linear approximation of shortest superstring”, Journal of the Association for Computing Machinery, 41, 1994, pp. 630-647.

[3] D. Breslauer, T. Jiang, and Z. Jiang,“Rotations of periodic strings and short superstring”, Journal of Algorithms, 24, 1997, pp. 340-353.

[4] C. Armen and C. Stein, “A superstring approximation algorithm”, Discrete Applied Mathem-atics, 88, 1998, pp. 29-57.

[5]Z.Sweedyk,“A - approximation algorithm for shortest superstring”, SIAM Journal on Computing, 29, 1999, 954-986.

[6] Pevzner, P.A.,Computational Molecular Biology: An Algorithm Approach, The MIT Press,USA, 2000.

[7] Setubal, J. and J. Meidanis, Introduction to Computational Molecular Biology, International Thomson Publishing, 1997.

[8] D.Gusfield, et al.,“Graph traversals, genes and matroids: an efficient case of the traveling salesman problem”, Discrete Applied Mathematics, 88, 1998, pp. 167-180.

[9] Fleischner H., Eulerian Graphs and Related Topics, Elsevier Science Publishers, 1990.