Education and Computational Biology:

Incremental K-Leaf Powers, Cliques, and Phylogeny Reconstruction

Dean L. Zeller

Department of Computer Science

KentStateUniversity

Spring, 2006



Abstract – The following is an investigation in areas of phylogenetic reconstruction for the purpose of encouraging education of algorithm analysis. Discussed within is an alternate evolution model named the incremental k-leaf power, serving as an extension of the k-leaf power in regards to phylogenetic reconstruction. Class assignments requiring little prior knowledge of evolution trees have been created implementing these concepts. At this point, the discussion is purely theoretical in nature and not based on any existing genetic tests. The full text for the class assignments is available from the author.

Index terms – graph theory, computational biology, evolutionary tree, phylogeny reconstruction

…the great Tree of Life fills with its dead and broken branches the crust of the earth, and covers the surface with its ever-branching and beautiful ramifications.

Charles Darwin (1809-1882)

Father of Evolution

I. Introduction

Phylogenetic analysis is an important tool in evolutionary biology and bioinformatics. Reconstruction of phylogenetic trees and their subsequent analyses provide important insights into the evolutionary history of organisms, as well as provide clues about mechanisms governing evolutionary changes in genes and genomes. Generally, phylogenetic relationships among taxa are represented in a bifurcating tree format, with or without root, where terminal nodes represent extant taxonomic units and interior nodes represent inferred ancestral units. These nodes are connected via internal and external branches. A node is bifurcating when it has only two immediate descendant lineages, and multifurcating when it has more than two immediate descendants. It should be noted that a bifurcating representation is often insufficient to represent all aspects of evolution. Horizontal gene transfer and hybrid speciation are evolutionary occurrences do not follow a bifurcating pattern. There are other ways to represent these relationships such as phylogenetic networks.

Computational biology is a field commonly regarded as complicated. Some problems require a great deal of previous knowledge and experience to comprehend. But there are those problems in bioinformatics easy enough to analyze without requiring an advanced degree in biology, mathematics, or computer science. This paper dwells into developing lessons in computational biology that are easy to understand at all educational levels. These assignments can be used as part of a curriculum for the next generation of computer scientists.

Section II gives background in evolution trees, k-leaf powers, and cliques. Section III discusses the class assignments that have been tested with favorable results on introductory-level computer science students.

II. Background

A.Evolution trees

An evolutionary tree (or phylogeny) is a tree structure that represents the genetic relationships among species. Each leaf represents one species, while internal nodes represent the common ancestors between leaf species. As evolution can be represented using a tree model, it is a topic of particular interest to both biologists and computer scientists. Advancements in theory for one area can easily apply to the other. An evolutionary tree is a common method to organize inheritance within object oriented program design (OOD). It has been the topic of hundreds of papers published each year. As technology has advanced, man’s ability to build and represent evolutionary trees has increased in complexity. Computer programs are able to store and analyze evolutionary data faster than before. It is with technological and algorithmic advancements that geneticists were able to create the GENOME project, a complete genetic mapping of the human body. The importance of further development in this area is staggering. It has become a controversial topic among doctors, philosophers, and even religions. Figure 1 shows an example of an evolutionary tree using existing species.

Figure 1: Evolution Tree example [3]

One method for solving computational biology problems is maximum-parsimony. This method uses character-based tree-building algorithms to find the evolutionary tree that requires the fewest changes to explain evolution of the observed species. While this method is interesting and useful, it will be the topic of a future paper. A second method for solving computational biology problems is distance-based. With a distance matrix, one can represent dissimilarity of species. Using M as an nn diagonally symmetric matrix, M[i,j] is a measure of dissimilarity between species i and j.

B.K-Leaf Powers and K-Leaf Roots

As previously stated, a bifurcating tree is an incomplete model of evolution. An alternative model of evolution is the kleaf power graph and corresponding kleaf root. A tree T is converted into a kleaf power Gk for a specified kvalue by connecting an edge (u,v) in Gk for any pair of vertices not more than a specified kvalue. In mathematical terms, dT(x,y)k(x,y)Gk.

Figure 2: k-leaf power and k-leaf root

The problem of converting a given tree into a kleaf power is trivial. A brute-force algorithm simply calculates the distance between every vertex pair within T and assigns the edges in Gk accordingly. The interesting problem of research is to go the other direction: given a kleaf power Gk, generate the tree structure, called the kleaf root. This problem is far more complex, as not all trees have a corresponding k-leaf power. Brandstädt and Le wrote linear time solutions for k=3 [1] and k=4 [2], but k5 remains an open problem. It should be noted that these algorithms are sufficiently complex in nature to require advanced knowledge of algorithm analysis. Figure 2 shows a simple example of the conversion of a k-leaf power into a k-leaf root. The class assignments described below are a deeper look into the relationship between kleaf powers and kleaf roots through an incremental k-leaf power evolution model.

C.Cliques

A clique is defined as a subset of mutually adjacent vertices. It is observed that as a kleaf power builds incrementally for a given tree, new members to cliques are introduced. It is useful to indicate the cliques in the kleaf power diagrams. The incremental formation of cliques give information on the structure of the kleaf root. Figure 3 shows examples of cliques of up to six vertices.

Figure 3: Cliques

III. Class Assignments

The basis for this portion of the paper was inspired by a book named An Atlas of Graphs [5]. In 1999, Read and Wilson released this reference of every possible topology for a variety of graphs, including trees, binary trees, outerplanar graphs, polyhedron graphic representations, and many others. At first glance, it seemed to be a simple collection of graphs, along with some analysis. However, the message portrayed by the pages and pages of morphing graphs was far deeper and more complex than a basic reference. Studying the method used to create every possible model topology can teach volumes on the structure and nature of the graph type specified. It was a simple, yet elegant method of portraying information, as if the graphs themselves represented the words of a paragraph. If a picture is worth a thousand words, imagine the amount of information contained within a series of sequential graphs, all designed to show an idea. Assignments 1 and 2 mimic the format of Atlas to create a similar reference of k-leaf powers and k-leaf roots.

A.Assignment 1 – Evolution Trees

A few basic assumptions will greatly reduce the problem complexity, allowing for greater in-depth analysis. In a bifurcating tree it is assumed that the only possible point of evolution is a parent with two children. A single child (also called a redundant node) can be removed without losing any evolutionary data. It is given that the species is continually changing over time through evolution and mutation, and thus a single node in between two nodes does not add to the information portrayed in the phylogeny. It is also assumed a population will not split into more than two species. While there are cases in evolution where this happens, it greatly adds to the complexity of the problem. Multifurcating tree sections are replaced with isomorphic approximations. Unlike the first assumption, there is some loss of detail.

Figure 4: Assumptions

The third assumption is to consider only isomorphically unique trees. This analysis is currently at the theoretical level only, and thus the labels are abstract. As such, isomorphic trees are considered equal, and only one needs analysis. These assumptions greatly limit the number of tree patterns that can be created through reconstruction. Figure 4 shows these assumptions graphically. Assignment 1creates displays on all evolution trees of up to seven leaves, following the assumptions described above.

Example 1

Phylogeny T is converted in the k-leaf powers for k-values of 2 through 8.

B.Assignment 2 – Incremental K-leaf powers

Reconstruction of a phylogeny from a kleaf power requires connected vertices to form a clique. For k=2, cliques contain only two vertices. For k=3, cliques can have up to three vertices, forming a triangle structure. As the k-value increases, so does the maximum possible size of the clique. One key to phylogeny reconstruction is to identify the cliques.

In Example 1, a single phylogeny is converted incrementally into various k-leaf powers for k-values of 2 through 8, resulting in graphs G2 through G8. In G2 the edges are connected only for those cases where the vertices have the same parent. G3 has more edges than in G2, identifying close relations, but it is still not a connected graph. At G4, all leaf vertices from the phylogeny are connected. Note the left and right “clusters” of edges, and the similarity to the structure of T. G5 is similar to G4, with one additional edge. G6 is a fuller graph than G5, and G7 is a nearly complete graph. Since no two leaves in T are more than eight separations apart, G8 is a complete graph where all vertices form a clique. In this case, it appears that G4 has the most structural information useful to a geneticist. Observe the cliques described in Example 1. For k=2, the cliques are {ab} and {ef}. At k=3, the {ab} clique obtains c as a new member, and a similar situation with {ef} acquiring d. This process is repeated each time the k-value increases. Assignment 2 creates the incremental k-leaf powers for each tree generated in assignment 1.

C.Assignment 3 – Phylogeny Reconstruction

Conceivably, the genetic test could contain more accurate information than just a 1 or a 0, resulting in discrete values as categories of closeness. The k-leaf power reconstruction algorithms in [1] and [2] do not make any assumptions on how closely related two species are, only that they are genetically compatible to a certain degree. With a discrete test, the closeness value could be used to more quickly and accurately reconstruct the phylogeny. By connecting those species registering closer according to the test, complex algorithms may not be necessary for reconstruction. Example 2 demonstrates how results from a discrete-value test on all distinct pairs of six species collected in the summary table can be used to reconstruct a phylogeny incrementally. This is a contrived example that will convert into a phylogeny; there are many examples of data that do not convert as easily. Once the data is collected, the tree is built up from the lowest values first. When all species in the tree are connected, the tree is complete and no further analysis is necessary. Assignment 3 reconstructs a phylogeny from a variety of data tables using the incremental k-leaf power approach.

Conclusions

It is not enough to simply learn and discuss the methods other researchers have done in the past. Publishable results must be found to further develop work in this area. Few subjects have greater need to be explored than evolutionary trees. Biologists use evolutionary trees to represent the history of life on Earth, and as more genomic data become available, we need better and more efficient methods to do phylogenetic analyses. The evolutionary tree problems discussed in this paper need refinement, clarity, and further development. New methods of analyzing genetic data must be discovered. The field of computer science is also evolving. Fast, reliable biological computers will replace the slower electronic models. It is our job as educators to prepare the next generation of computer programmers.

Example 2

Input: Results from discrete similarity experiments

Output:Incremental k-leaf powers and k-leaf roots for values of 2 through 6.

Diff(a,b)  2Diff(c,d)  4

Diff(a,c)  3Diff(c,e)  5

Diff(a,d)  5Diff(c,f)  5

Diff(a,e)  6Diff(d,e)  3

Diff(a,f)  6Diff(d,f)  3

Diff(b,c)  3Diff(e,f)  2

Diff(b,d)  5

Diff(b,e)  6

Diff(b,f)  6

Difference Summary Table
a / b / c / d / e / f
a / 2 / 3 / 5 / 6 / 6
b / 3 / 5 / 6 / 6
c / 4 / 5 / 5
d / 3 / 3
e / 2
f

Acknowledgements

The author would like to appreciate the effort portrayed by the following CS10051 students for successfully completing all three parts of the class assignments described in this paper: Lindsey Braun, Rachel Holloman, Ryan Lundin, Daniel Rubin, Kelly Schobinger, and Thomas Zgonc.

References

[1]Brandstädt, A. and V.B. Le, Structure and Linear Time Recognition of 3-Leaf Powers, Information Processing Letters (98), 133-138, 2006.

[2]Brandstädt, A., V.B. Le, and R. Sritharan, Structure and Linear Time Recognition of 4-Leaf Powers, submitted for publication, 2006.

[3]Felsenstein, J, Inferring Phylogenies, Sinauer Associates, Inc., 2004.

[4]Gusfield, D, Algorithms on Strings, Trees, and Sequences, CambridgeUniversity Press, 1997.

[5]Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.

[6]Wu, B.Y. and K.M. Chao, Spanning Trees and Optimization Problems, Chapman & Hall/CRC, 2004.

Dean L. Zeller is in the computer science Ph.D. program at KentStateUniversity. He achieved his Masters of Science in computer science from Bowling GreenStateUniversity in 1996, as well as his Bachelors of Science in mathematics and computer science in 1992.

He has ten years of teaching experience in the education industry. Eight of the ten years have been in higher education environments, teaching classes introducing new students to computers, mid-level programming classes such as data structures and computer architecture, and up to advanced courses on compiler design, artificial intelligence, and computability theory. He also has two years of experience teaching to high- and middle-school students. He has published an article on the Minimization statistical algorithm in 1997 in the Journal of Nursing Research, teaming up with the Department of Nursing at CaseWestern ReserveUniversity. His research interests include: graph theory, statistical algorithms, user interface design, and computer science education.

Manuscript received May 11th, 2006. This work is for presentation at the OCCBIO ’06 converence at OhioUniversity (June 28-30, 2006).

D. L. Zeller is in the computer science Ph.D. program at Kent State University, Kent, OH. (phone: 330-343-2877, email: )