11

Multiple Sequence Alignment Using Partial Order Graphs

Christopher Lee*, Catherine Grasso, Mark Sharlow

Department of Chemistry and Biochemistry

University of California, Los Angeles

Los Angeles, CA 90095-1570

TEL: 310-825-7374

FAX: 310-267-0248

EMAIL:

*Author to whom correspondence should be addressed.

Draft September 24, 2001

Printed September 24, 2001

ABSTRACT

Motivation: Progressive multiple sequence alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts.

Results: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (POA: partial order alignment) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g. an alignment of 5000 sequences in 4 hours on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism.

Availability: The partial order alignment program POA is available at http://www.bioinformatics.ucla.edu/poa.

Contact: Christopher Lee, .

INTRODUCTION

Multiple sequence alignment (MSA) is one of the most important tools in bioinformatics, and has a long history of study in computer science and bioinformatics (Needleman & Wunsch, 1970; Smith & Waterman, 1981; Taylor, 1986; Barton & Sternberg, 1987; Higgins & Sharp, 1988; Altschul, 1989; Lipman et al., 1989; Subbiah & Harrison, 1989; Gotoh, 1993; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000; Gordon et al., 2001). It continues to be an active field of research, because it poses computational challenges that are only partially solved by existing algorithms. For pairwise sequence alignment (of just two sequences), a globally optimal solution can be found in O(L2) time by dynamic programming, where L is the length of the sequences. This algorithm can be extended to align N sequences optimally, but requires O(LN) time. Whereas the pairwise alignment time of O(L2) is acceptable for typical biological sequences of genes and proteins (L<10,000), the exponential time required for aligning larger numbers of sequences by dynamic programming is impractical. Thus optimal alignment of pairs of sequences is performed routinely, but optimal alignment of multiple sequences is generally not possible.

Instead, a variety of heuristic MSA algorithms have been developed, nearly all of them based on progressive application of pairwise sequence alignment to build up alignments of larger numbers of sequences as proposed by Feng and Doolittle (Feng & Doolittle, 1987). An excellent example of the progressive alignment approach is CLUSTAL, initially released by Higgins, et al. in 1988 (Higgins & Sharp, 1988). This algorithm builds a multiple sequence alignment through a series of pair-wise alignments. Initially, all of the sequences are aligned pair-wise resulting in N(N-1)/2 alignments. The scores of these alignments are then used to construct a binary tree of their evolutionary relationships. Finally, the algorithm builds a multiple sequence alignment in the order dictated by the evolutionary tree: the most recently diverged sequences are aligned first resulting in N/2 alignment profiles; the N/2 alignment profiles are aligned to each other resulting in N/4 alignment profiles; and so forth, until all of the sequences have been aligned, resulting in a single alignment profile of the multiple sequence alignment of all N sequences. Finding the scores and constructing the binary evolutionary tree is O(NL2). Using the binary tree to build the multiple sequence alignment is O(L2logN). Consequently, the overall algorithm runs in polynomial time and can be used to align over a hundred sequences in a reasonable amount of time. As pointed out by Gibson and co-workers (Thompson et al., 1994), this algorithm is greedy by nature and may find a local minimum either because the guide tree is not correct or because alignment errors that happen early on in the process of building the multiple sequence alignment get locked in. As a consequence, this algorithm does not guarantee an optimal multiple sequence alignment. Many MSA methods have been developed with different advantages and disadvantages (Taylor, 1986; Barton & Sternberg, 1987; Altschul, 1989; Lipman et al., 1989; Subbiah & Harrison, 1989; Gotoh, 1993; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000; Gordon et al., 2001), but globally optimal alignment of larger numbers of sequences remains impractical.

Gibson and coworkers (Thompson et al., 1994) highlighted two major problems with the progressive multiple sequence alignment approach: the local minimum problem, mentioned above, and the choice of appropriate alignment parameters. Their solution was to weight sequences according to their degree of similarity, and to vary gap penalties and substitution matrices during not only the dynamic programming but also the course of building up the full alignment. This strategy was implemented in CLUSTALW, which has been one of the most commonly used multiple sequence alignment programs since 1994 (Thompson et al., 1994).

While we agree with this analysis, we believe there are additional problems in progressive alignment, whose resolution can yield further improvements. Specifically, we view handling of gaps and insertions as one of the most troublesome and least rigorously formulated aspects of multiple sequence alignment. We wish to emphasize that this is not a criticism of any existing MSA method, but rather a way of highlighting opportunities for fundamental rethinking of the problem. We will use the term “progressive alignment” to designate the general class of MSA build-up algorithms, and our usages of this term should not be interpreted as references to any existing algorithm. Based on analysis of fundamental problems in gap / insertion representation in progressive alignment, we will present a novel approach based on graph theory that could have broad applicability in multiple sequence alignment and sequence analysis algorithms. Indeed, this approach complements existing MSA algorithms, and can be combined with them in many useful ways, since each method has specific advantages.

Progressive Alignment and the MSA Representation Problem

To identify new areas of potential improvement in progressive alignment, it is important to define its components clearly. Progressive alignment requires aligning pairs of MSAs, to build up larger MSAs. In practice, however, pairwise dynamic programming is not applied directly to align the pairs of MSAs. Instead, progressive alignment has relied on reducing each MSA to a one-dimensional sequence which can be used in pairwise dynamic programming sequence alignment. This one-dimensional sequence is a consensus sequence or profile of the MSA that gives position-specific residue weighting and scoring. It is assumed that the two cluster MSAs should be aligned exactly as their 1D profiles align, which makes sense if no information about each MSA is lost when it is reduced to a profile. From this point of view, gaps / insertions pose a special challenge (Altschul, 1989; Gotoh, 1996), since they raise the question of what should be included in the profile, and the threat that important information will be lost in the reduction process.

Unfortunately, this reduction of an MSA to a 1D profile inevitably involves loss of information. That is, while the MSA contains all the information to produce the profile, the profile does not contain all the information needed to reconstruct the original MSA. This is demonstrated by the fact that many different MSAs (e.g. with letters in a given column shuffled) would reduce to the same profile. Because a given profile does not map uniquely to a single MSA, it cannot be used to reconstruct the original MSA. This lack of uniqueness can also be described as degeneracy in the MSA ® profile mapping.

What kinds of information are lost? Let’s assume that all columns of the MSA are included in the 1D profile (if not, that would immediately constitute a loss of information). While the profile may keep residue and gap frequencies for each column, it has no information on which sequence a given letter comes from. This makes scoring of gaps / insertions especially problematic. Consider the simple MSA below. The profile will contain all columns from all sequences in the MSA, even though it is likely that no sequence in the MSA (or nature) actually contains all these columns. Thus, any sequence not containing all these inserts will now be charged artifactual gap penalties even if its “gaps” are exactly the same as those in one or more sequences in the alignment. Conversely, the only sequence that won’t be charged a gap penalty is the completely artificial sequence constructed by pasting together all columns from all sequences. Alignment of multidomain proteins, for example, would give rise to a gap of hundreds of residues wherever an extra domain was present or missing in one or more sequences, resulting in very large gap penalties. Although one may use position-specific gap penalties to try to diminish this problem (for example, reducing the gap penalty at a position if many sequences are gapped there), this just reveals more problems. For example, if only one sequence in the profile contained an extra domain, we might reduce the gap penalty greatly for those positions, to allow for sequences that lack this domain. However, very low gap penalties are extremely problematic because they cause an exponential increase in the number of higher scoring alignments, even for alignment of a random sequence. And if a new sequence happened to align to this extra domain, it would be wrong to use an extremely low gap penalty for its insertions / deletions within that domain. A 1D profile has no way to correct for this, because it does not know that these positions are all from the same source sequence.

.....ACATGTCGAT.....AGGTG

TGCAC.....TCGATACATAAGGTG

* ^

Furthermore, consider the two columns marked above. Technically, there is a gap in both columns. One position (^) is a true gap, i.e. a position that one sequence is missing, bracketed on both sides by positions where both sequences align. However, the other position (*) is not actually a gap in the alignment, because both sequences do not even begin to align until several residues further to the right. Aligning the sequence S = TGCACTCGAT… to a profile of this MSA would be charged a 5 residue gap penalty because S lacks ACATG. This gap penalty is purely an artifact of the one dimensional representation: it vanishes if we reorder the profile by swapping the residues 1-5 with residues 6-10. Now TGCACTCGAT… matches without a gap. This reordering of the consensus does not change the content of the original MSA, since the first five residues of both sequences were not aligned at all.

This reveals that there is tremendous degeneracy in the representation of gaps / insertions within this standard MSA format, which we will refer to as the tabular row-column MSA format (RC-MSA). For example, there are possible orderings of the first 10 columns of this MSA that differ only in the order in which these unaligned residues are given. These different RC-MSAs are all equivalent to the original alignment, but will each give rise to different gap penalties. There is no “right choice” of which to use; all of these gap penalties are artifactual.

We can develop a new approach to multiple sequence alignment based on these considerations. We desire a new MSA representation that eliminates both the degeneracy of the RC-MSA (which causes scoring artifacts), and the degeneracy of the 1D profile (which causes information loss). To be useful for progressive alignment, this MSA representation 1) should itself be alignable by pairwise dynamic programming alignment, just as a 1D sequence or profile can be; 2) should not lose any information from the MSA, nor introduce any degeneracy. This raises the interesting question, what is the real content of an MSA? We consider this to be just two pieces of information: what sequence positions are aligned to each other; and the ordering of these positions within the sequences themselves. We have constructed a data structure that purifies the idea of an MSA down to just these two properties, that satisfies all of the above criteria, and therefore has very interesting benefits for progressive MSA.

ALGORITHM

Partial Order Multiple Sequence Alignment (PO-MSA) Data Structure: Consider the simple alignment of two sequences depicted in RC-MSA format in Fig. 1a. We can instead represent it as a partially ordered graph in which individual sequence letters are represented by nodes, and directed edges are drawn between consecutive letters in each sequence (Fig. 1d). In this PO-MSA format, a single sequence is simply a linear series of nodes each connected by a single incoming edge and a single outgoing edge (Fig. 1b). Reconstructing the RC-MSA in Fig. 1a as a PO-MSA requires two steps. First, the two aligned sequences are redrawn in PO-MSA format, with dashed circles indicating aligned letters (Fig. 1c). Second, the letters that are aligned and identical are fused as a single node, while the letters that are aligned but not identical are represented as separate nodes that are recorded as being aligned to each other (Fig. 1d). When letters are fused as one node, the resulting node stores information about all of the individual sequence letters from which it was derived, specifically the ID of the original sequence(s) and the index of the letter’s position within that sequence. Thus it is possible to trace the path of each individual sequence through the PO-MSA. The PO-MSA format contains all the information of a traditional MSA, but represents it as a graph compacted for minimal node and edge counts. Redundant edges are removed, i.e. a given pair of nodes will be connected by at most one edge. A node may have any number of incoming or outgoing edges. These edges simply indicate the preceding or subsequent letters of every sequence that passes through that node.