The Florida State University

THE FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

DEVELOPING A BIOINFORMATICS UTILITY BELT TO ELIMINATE SEARCH REDUNDANCY FROM THE EVER-GROWING DATABASES

MISHA TAYLOR

A Thesis submitted to the

Department of Computer Science

in partial fulfillment of the

requirements for the degree of

Master of Science

Degree Awarded:

Spring Semester, 2003

Misha Taylor

The members of the Committee approve the thesis of Misha Taylor defended on April 1, 2003.

______

Robert van Engelen

Professor Directing Thesis

______

David Swofford

Committee Member

______

Theodore Baker

Committee Member

______

Steven M. Thompson

Committee Member

The Office of Graduate Studies has verified and approved the above named committee members

This thesis is dedicated to my dear friend Jane Sinagub, without whose support this work could not have been produced. We’ve been through some tough times, but I’m glad that we were able to make it through the years together as friends. I love you dearly Jane, and I’m happy that you decided to let me play a bit part in your life.

To Paul, “Brother” Max, and “Ma Petite” Flora, you’re as much as my family as Jane and I love you just as much.

Also, thanks to Deena Westbrook for keeping me company after bioinformatics class and entertained with chitchat while writing my thesis. It made the process a lot easier and more fun than it should have been.

ACKNOWLEDGMENTS

Steven Thompson taught me more about bioinformatics in six months than I could have learned on my own in six years. I am still amazed that he was willing to take on a computer science student with absolutely no background in biology whatsoever. He took me out to dinner and slowly introduced me to the field of bioinformatics. All this without me being enrolled in his classes or being one of his students, but just wanting to know more about his world. Steve is one of the most extraordinary teachers I’ve ever met. I also appreciate his patience in re-explaining biological concepts to me two or three times as it was very difficult for me to start acquiring a “biological mindset”.

Thanks to Theresa Thompson for helping me understand the “tao of Steve”, when things were getting difficult at one point, and also getting to the bottom of the mystery of the pink pants. Her advice on literature and politics was also very helpful and insightful.

Robert van Engelen was my major professor throughout this process. I appreciate his guidance and advice, and his willingness to work with an interdisciplinary project. I also would like to thank my original advisor, Ernest McDuffie, who sadly left The Florida State University before I was able to complete my original thesis.

I also appreciate having taken a semester of Dave Swofford’s bioinformatics class while completing this thesis. I think it really helped to give me a solid grounding in some of the fundamental algorithms. He is a very knowledgeable and accomplished instructor.

TABLE OF CONTENTS

List of Tables vii

List of Figures viii

Abstract ix

INTRODUCTION 1

BIOLOGICAL DATABASES 3

Biological Databases are massive 4

Biological data can be noisy and redundant, with unclear features 5

Another Artificial Source of Redundancy 6

A REAL-WORLD PROBLEM 7

Process of the human expert 8

Sorting out redundancies 8

Taxonomy Analysis 10

SIMILARITY SEARCHES 12

BLAST Output 12

Part 1 – Overview of the Query 12

Part 2 – Descriptions of each significant alignment. 13

Part 3 – Pairwise Alignments 13

Part 4 – Statistical Summary 14

Evolution 15

Three fundamental algorithm types 18

Dot plots 18

The need for quantitative methods 21

Pairwise sequence alignment 24

Global Alignment 28

Local Alignments 32

Multiple sequence alignment 32

GCG WISCONSIN PACKAGE 34

MUMS AND SUFFIX TREES 41

MUMs 42

Longest Increasing Subsubsequence 47

CONSENSUS ALGORITHMS 48

Clustering in MUMmer 2.1 49

The Union-Find Algorithm 49

IMPLEMENTATION RECOMMENTATIONS: C++ OR JAVA? 52

NINJA 53

The Cost of Being Object-Oriented 54

Further adventures of the Rice Research Team 55

CONCLUSION 57

EPILOGUE 61

REFERENCES 63

BIOGRAPHICAL SKETCH 67

List of Tables

Table 1 - Dot plot example 18

Table 2 - The 4 DNA Nucleotide Sequences and Their Official Codes 23

Table 3 - The 20 Amino Acids and Their Official Codes 23

List of Figures

Figure 1 - Growth of GenBank 5

Figure 2 - Align ESTs to cDNAs 9

Figure 3 - Example of a sequence region which contains a gene model 10

Figure 4 - BLAST Output, Part 1 - Overview of the Query 12

Figure 5 - BLAST Output, Part 2 - Description of each significant alignment 13

Figure 6 - BLAST Output, Part 3 – Pairwise Alignments 14

Figure 7 - BLAST Output, Part 4 - Statistical Summary 15

Figure 8 - Origin of genes having a similar sequence[1] 16

Figure 9 - Dot plot window example 1 19

Figure 10 - Dot plot window example 2 20

Figure 11 - Dot plot of two paralogues 21

Figure 12 - Key observation for dynamic programming 26

Figure 13 - Dynamic programming always looks back to previous diagonal 27

Figure 14 - Recurrence relation 28

Figure 15 - Global alignment, part 1 29

Figure 16 - Global alignment, part 2 30

Figure 17 - Global alignment, part 3 30

Figure 18 - Global alignment, part 4 31

Figure 19 - Global alignment, part 5 31

Figure 20 - Suffix tree 44

Figure 21 - Aligning Genome A and Genome B after locating MUMs 45

Figure 22 - Crossing anchors 46

Figure 23 - Aligned anchors 46

Figure 24 - Clustering two-dimensional data 48

Abstract

Biological databases are growing at an exponential rate. Designing algorithms to deal with the inherent redundancy in these databases which can cope with the overwhelming amount of data returned from similarity searches is an active area of research.

This paper presents an overview of a real-world problem related to biological database searching, outlining how a human expert solves this problem. Then, several bioinformatics approaches are presented from the literature, forming a “utility belt” which might be used to solve the problem computationally.

INTRODUCTION

Bioinformatics tools are basically a set of computer algorithms and procedures for analyzing and sifting through biological data[1]. The field of bioinformatics has been around for about 20 or 30 years. Biologists, realizing just how valuable a tool computers were, started applying computer technology to their scientific endeavors just as soon as computers were available to the general public. According to Stanford Professor, Mark A. Musen[2], it is only the term bioinformatics that is relatively new, not the actual field itself.

The bio-prefix of the word should be fairly straightforward to understand. Bioinformatics deals with biology and biological data.

It is the –informatics suffix that is more interesting. The word informatics is derived from the French word, informatique, which can be translated into English as “data processing”[3].

Why is an uncommon word like informatics used in the term bioinformatics? Mark Musen[2] claims that it is in the European tradition to incorporate elements of data processing into the fundamental computer science curriculum, whereas in the United States it is not. Instead, if a student wishes to pursue a serious study of “information, its structure, its acquisition, and its use,” that student historically enrolled in a business program, or more recently, an information sciences program. Apparently in Europe, it is more common that the equivalent of information sciences is integrated into the computer science curriculum and not as a separate field of study or departmental entity.

A key difference between the informatics perspective and the traditional computer science perspective is focus. In informatics, the information and knowledge is of central focus, whereas in computer science, the focus is mostly on algorithms and writing programs[2]. From an informatics perspective “you can’t do anything without the data – knowledge is power,” whereas from a computer science perspective one might say “you can’t do anything without a program – after all, it is the code actually performing the action.” Neither statement is invalid, depending on your perspective.

In fact, bioinformatics incorporates both data and algorithms. The term used for the field is a reminder that the biology and the data are paramount.

That is a brief explanation of the word bioinformatics and the etymology of the term. As mentioned earlier, it is a fairly new term for the field.

So, our task, should we choose to accept it, is to do the research necessary in order to prepare for the development of a bioinformatics utility belt. This virtual bioinformatics utility belt, explained herein in this document, will help us tackle the information overload from the ever-growing biological databases.

Biological databases are filled with redundant sequence information for various reasons, which will be covered in more detail in the next chapter. When searches are performed against these databases, the result sets also contain redundant “hits”. In this document, it is our goal to explore the nature of this redundancy problem more fully, putting together a list of possible solutions and approaches to address this issue – the utility belt. We won’t be able to offer a solution to the redundancy problem here, but rather it is our goal to offer a starting point for a potential solution.

CHAPTER 1

BIOLOGICAL DATABASES

While one can argue that even in the Internet age, most new biological information is still published as text in the form of journal articles, papers and annotations[4] – nothing is going to keep the academic engine away from the “publish or perish” mantra anytime soon – there has been an explosion in the amount of biological data “published” to computer databases over the past 20 years. Researchers routinely publish their biological findings to Internet databases such as GenBank, SWISS-PROT, Pfam, and SMART.

GenBank is one of the largest and oldest biological databases. It contains all publicly available DNA sequences[5]. Obviously, DNA sequences are important to biologists because DNA sequences contain the unique instructions that indicate how to create a particular organism. DNA consists of varying sequences (typically very long sequences of thousands or millions of molecules) out of 4 possible molecules called “bases” or “nucleotides”[6].

SWISS-PROT is another one of the older biological databases. It contains “annotated protein sequences,” including shape and sequence information. Proteins are important because they are responsible for performing most of the functions of life. Information about the three-dimensional (3D) shape of a protein is important, because shape determines the function of a protein within an organism. Proteins are sequences of 20 possible amino acids. As far as biologists know, proteins with identical sequences fold into the same three-dimensional shape.

The Pfam and SMART databases are newer, smaller, more specialized protein databases, classifying and categorizing proteins into families and domains. They contain much more detailed information, complementing the SWISS-PROT protein database.

GenBank, SWISS-PROT, Pfram, and SMART are just a few selected examples of popular biological databases. There are many others available that are also widely used. Biologists rarely perform research without referring to these biological databases. Even if results are published in the form of a report or paper, most journals require that researchers post their findings to at least one of these databases. In addition, most of these database efforts are funded by government-sponsored grants, so many of these databases are also publicly-accessible, including all of the databases mentioned above[5].

Biological Databases are massive

One characteristic of these biological databases is that they are massive. As of August 2002, there are approximately 18,197,000 sequence records in GenBank[7]. Release 40.44 of SWISS-PROT contains 122,214 annotated protein sequence records comprising 44,864,044 amino acids on February 22, 2003[8]. Pfam contains 3,735 alignments and 3-D models for protein families, and the SMART databases contains 500 domains encompassing more than 54,000 proteins[5].

Also, biological databases are rapidly growing. For example, GenBank, one of the oldest and largest biological databases, is doubling is size every 15 months[5]. The following diagram shows the almost exponential growth rate of GenBank from 1982 to 2002. The number of bases has been plotted against the release date[9].

Figure 1 - Growth of GenBank

These databases are enormous. Any researcher wanting to search through this data would have to scan through tens of thousands to millions of records to find data relevant to their research. At times, this can seem like looking for a needle in a haystack.

Biological data can be noisy and redundant, with unclear features

Another characteristic of biological data is that it can be fuzzy. Any quantitative measure of a biological system is merely an approximation. Biological researchers use a wide variety of procedures to collect their data, and the results are typically an “average value of several independent experiments”[10].

In addition to being fuzzy, biological data also tends to have a lot of noise in it. While some of this noise is due to experimental error, this noise can also be produced by nature itself in the form of mutations. As an organism evolves over time, physical changes get expressed in its genes and chromosomes. A gene is a specific sequence of DNA that produces a protein. A chromosome is a collection of genes and is the basic unit of how traits get passed from parents to children[11]. Two genes are matched based on their sequence similarity, not on the exact sequence, given that one must account for evolution in an organism, and the overall “fuzziness” in this whole process.

Another issue that contributes to noise in the data is that biologists do not yet understand the relevance of many components in DNA sequences and protein sequences. For example, certain sections in a DNA sequence code for certain proteins – these are called exons. Other components do not seem to affect an organism’s protein-making machinery in any way that biologists can understand. These sections are known as “spacer DNA” or “noncoding DNA”. Spacer DNA within a gene are also known as introns[12]. While looking for a sequence match, one must take into account only the exons (the coding regions) in a sequence. In humans, only 3% of a genetic sequence may consist of exons. As with mutations, this means that sequence matching involves similarity and inexact matching techniques rather than exact techniques[6].