Having a Blast: Analyzing Gene Sequence Data with Blastquest

Having a Blast: Analyzing Gene Sequence Data with BlastQuest

–

Where Do We Go from here?

Abstract

In this paper, we pursue two main goals. First, we describe a new tool called BlastQuest, for managing BLAST query results. BlastQuest provides interactive, Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. Specifically, the BLAST results, which are in XML format, are extracted, structured, and stored persistently in a relational database to support a series of built-in analysis operations that can be used to select, filter, and order data from multiple BLAST results efficiently and without referring to the original result files. In addition, users have the option to interact with the BLAST data through a mask-oriented, non-SQL query interface.

Despite BlastQuest’s recognized benefits for biologists, its functionality is limited in several important ways. The second goal of this paper is to analyze these shortcomings and describe a new concept based on two main pillars. (1) A Genomics Algebra, which provides an extensible set of high-level genomic data types (GDTs) together with a comprehensive collection of appropriate genomic functions, and (2) a Unifying Database, which allows us to integrate and manage the semi-structured contents of publicly available genomic repositories and to transfer these data into GDT values.

1.Introduction

Biologists are nowadays confronted with two main problems, namely the exponentially growing volume of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing complexity of biological applications and methods afflicted with an inherent lack of biological knowledge. As a result, many and very important challenges in biology and genomics are now challenges in computing and here especially in advanced information management and algorithmic design.

The currently most widely used and accepted tool for conducting similarity searches on gene sequences is BLAST (Basic Local Alignment Search Tool) [1]. BLAST comprises a set of similarity search programs that employ heuristic algorithms and techniques to detect relationships between gene sequences and rank the computed ‘hits’ statistically. An essential problem for the biologist is currently the processing and evaluation of BLAST query results, since a BLAST search yields its result exclusively in a textual format (e.g., ASCII, HTML, XML). This format has the benefit of being application-neutral but at the same time impedes its direct analysis. In this paper, we describe a new powerful tool, called BlastQuest, for managing BLAST results stemming from multiple individual queries. This tool provides the biologist with interactive and Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. In particular, BLAST results from multiple queries are imported, structured, and stored in a relational database to support a series of built-in analysis operations that can be used to select, filter, group, and order these data efficiently and without referring to the original BLAST result files. In addition, users have the option to interact with the data through a user-tailored, screen-mask oriented, non-SQL query interface based at a deeper, hidden level on a well-defined subset of SQL.

Section 2 elaborates on the current, main challenges in genomics and emphasizes the need for tools capable of processing BLAST results. In Section 3, we describe our BlastQuest system from the system architecture and user interface perspectives. Section 4 describes desired improvements to BlastQuest and why new, sophisticated concepts, tools, and non-standard database technology, which altogether should lead us far beyond BLAST technology, are indispensable in order to advance biological and genomic research and progress. Finally, Section 5 draws some conclusions.

2.The Challenge of Genomics and Its Effect on Computer Science

Genomics is a biological discipline focused on understanding living organisms at the level of the whole genome. It goes beyond a gene-by-gene approach and instead takes a global view of the complete genetic system. Genomic scientists examine the full catalog of genes, the process that control them, gene inter-relationships and inter-dependencies, and how the organism responds to changes in environment through the expression of genetic information. In order to illustrate the challenges faced by scientists in this field, we first review the most important concepts underlying gene sequencing.

2.1.Gene Sequencing

DNA is an information storage macromolecule to encode all of the heritable information passed from generation to generation of living organisms. In biological systems, genetic information flows from DNA (genes) to proteins, which are the molecules responsible for mediating or catalyzing biological processes. In other words, inherited information is selectively converted into active biomolecules in response to changing environmental conditions or demands. The molecular information pathway from gene to protein goes through an intermediate class of molecules known as messenger RNA (mRNA). The synthesis of mRNA is known as transcription, and the conversion of mRNA into protein is a process known as translation. Both transcription and translation are important regulatory steps used to control which genetic information is expressed, and when and where protein molecules will be made by the cell. The constellation of mRNA molecules in a cell at any moment represents the expressed genome. The expressed genome is also referred to as the transcriptome. Identifying all the genes present in the transcriptome effectively infers the proteins being utilized by the cell (also known as the proteome) and essentially defines the current biochemical process of the cell. While characterizing the global cellular proteome would be most direct and informative, this is not possible using currently available technology. Instead genomics scientists use high throughput DNA sequencing to characterize the genome and the transcriptome. Genome sequencing involves determining the nucleotide sequence of extensive chromosomal regions or in some cases a complete nucleotide sequence of the whole genome. Characterization of the transcriptome on the other hand involves full or partial sequence characterization of mRNA molecules. Partial sequences of mRNA molecules are known as Expressed Sequence Tags (EST) sequences. While the process of DNA sequencing is routine, nucleotide sequences do not directly reveal their biological meaning or function. The possible biological function of a gene sequence must be determined either through direct empirical experimentation, or more often through inferencing of gene function using nucleotide sequence homology searches of gene databases such as GenBank [5].

2.2.Gene Homology Searches

Gene homology searches most often use the BLAST algorithm [1]. The BLAST search engine takes a query nucleotide sequence and searches it against the database for entries matching the query. The BLAST algorithm calculates statistical scores (bit scores and e-values) making real sequence homology matches easier to distinguish from matches that might happen by chance. Other information included in the BLAST result includes a short text string summarizing the biological properties of the database match, and several unique identification numbers, the GI Number (unique ID for Genbank records) and Accession Number, linking the matched sequence back to the GenBank database and to additional information stored in the full database record. Each nucleotide query sequence submitted to the BLAST search engine returns as few as zero (no matching homologous sequence) to hundreds of matching database records. Results of BLAST searches are usually interpreted by reviewing the text output.

However, large-scale genomics projects often generate tens of thousands of nucleotide sequences and the prospect of manually manipulating, summarizing, and interpreting the thousands of BLAST output files is impractical at best. Scientists facing this informatics challenge may become discouraged or might overlook important information because they simply cannot find it. Clearly, methods or tools are needed to help manage the process of identifying and evaluating unknown nucleotide sequences and the sometimes-overwhelming information obtained in large-scale nucleotide sequence homology searches.

2.3.BlastQuest as an Answer to Tool Requirements from a Biologists Perspective

Genomics requires an information technology infrastructure on a scale previously unheard of and specifically adapted to the unique data collection and analysis demands of biomedical science. The BlastQuest system we describe demonstrates our current approach to management and visualization of genomics information. It is by no means a complete biological data management solution, but our first attempt to develop a prototype tool that can help us manage BLAST results through well-established relational database principles. We are using BlastQuest to test new functionalities and evaluate the strengths and limitations of relational databases as support tools for genomics research. Most important, we believe BlastQuest will lead us to a new integrating data model, language, and tool for processing and querying genomic information enabling scientists to synthesize biological insights through transparent access to genomics information. We have more to say about these planned improvements in Section 4.

The BlastQuest Project began with several modest goals:

A BLAST results viewing tool accessible to research groups at remote locations. Users should have access to their BLAST results from anywhere on the Web including the ability to share results with colleagues in other locations.
Selective browsing of BLAST homology search results. As a first step, biologists want a broad overview of the possible biological functions of the many genes sequences represented in their DNA sequence data. The ability to reduce and summarize BLAST data to only the most significant results is initially very informative.
Search capability on a variety of criteria, such as text terms on biological properties or gene functions. As biological scientists identify their most interesting gene sequences they need a way to focus and retrieve only those search results related to the precise topic of interest.
Selective data filtering on various BLAST statistical criteria such as e-value or bit score. These statistical parameters help discriminate between real sequence homology matches and matches that might happen by chance. There are no hard limits to the significance of these statistical parameters. The user will choose parameters giving either a more relaxed or restricted view as needed.
Selective data grouping on criteria such as GI number, or a defined number of top-scoring results. For example, viewing the three statistically best-scoring results for each query sequence is a convenient way to summarize and browse BLAST results for many query sequences. Grouping query sequences by GI number collects all of the query sequences having sequence homology matches with the same sequences from the database. Two or more query sequences sharing the same database homology match imply the query sequences are related to each other and suggest additional analysis of the relationship is warranted.
Privacy constrained sharing of results among the scientists. DNA sequence data is often proprietary and may constitute intellectual property. Such data should not be made public until properly protected.
A convenient interface for getting queries into and BLAST results out of the system. The interface must be attractive and logically implemented so users will be able to find and use the tools the system provides.

We are unaware of an existing BLAST results management system incorporating all the goals stated above. To the best of our knowledge, the functionalities of WebBLAST 2.0 [3] and the Ontario Center for Genomic Computing OCGC BLAST [2] match many of our requirements but fall short in several important aspects. For example, there is no provision in WebBLAST for applying global filtering and grouping operations, or a mechanism for searching all BLAST results on user-supplied text terms. The OCGC BLAST results manager appears closest to BlastQuest in functionality, allowing selected viewing and data filtering on up to five criteria. However, OCGO BLAST is not available to genomics scientists outside of the Province of Ontario, Canada. The BlastQuest Project is designed to meet our immediate specific requirements, but most important, provide a platform we might freely modify to test our notions of Genomics Algebra, an advanced query language for biological information.

3.The BlastQuest System

BlastQuest simplifies large-scale analysis in gene sequencing projects by providing scientists with a means to filter, summarize, sort, group, and search BLAST data. BlastQuest extracts gene data from XML files, which are returned as the result of homology searches from BLAST engines, and stores them in an underlying relational database. This allows the user to benefit from well-known relational concepts like transactions, controlled sharing, and querying optimization.

The most frequently used user operations are hard-wired in the user interface and accessible via command buttons. Their execution rests on SQL that is hidden from the user. To enable data analysis that is not directly supported by the built-in user interface operations, BlastQuest offers a more flexible, mask-oriented, and especially non-SQL query interface since biologists object to SQL due to its complexity and low-level abstraction (see Section 4). This interface essentially allows the user to construct complex boolean expressions as selection conditions which include logical operators and substring search predicates. The underlying query execution is based on parameterized SQL queries, which are instantiated and automatically translated into executable SQL code by the DBMS.

Another interesting feature of BlastQuest is that it can be linked to the so-called SMART (Simple Modular Architecture Research Tool [6]) (see Section 3.1). The integration of BlastQuest output into SMART for querying is in direct response to the desire by scientists for new tools and interfaces capable of accessing and integrating external resources into one system. In Section 4, we describe our plans to develop a Genomics Algebra query software that operates on a unifying database whose contents can include data from existing genomics repositories. Finally, BlastQuest enables to manage BLAST data on a per-project or per-user basis using the security features of the underlying database while at the same time allow controlled sharing of this data in order to support collaboration.

3.1.Architectural Overview

Figure 1 depicts a conceptual overview of the 3-tiered BlastQuest system architecture. Tier 1 contains the database backend, which is implemented using an instance of the MySQL[1] RDBMS. Since BlastQuest is mainly a proof-of-concept prototype rather than a production-strength system, our choice for a DBMS was governed by availability of source code and platform compatibility rather than performance and richness in features. The database backend stores and manages BLAST and PHRAP (Phragment Assembly Program) [4] results, which are represented as XML and ACE[2] (ArChivE) documents and whose structure has been mapped into the relations Hit, NoHit, and Assemble shown in Figure 2.

Figure 1: Conceptual overview of the BlastQuest system architecture.

For each gene sequence that produced a match during the BLAST search, the relation Hit stores the XML file name where the original query sequence can be found as well as detailed hit information, such as hit definition, expect value, bit score and so forth. The relation NoHit stores information about those sequences, which have no database match by the homology search criteria. From a biological point of view, sequences with no homologous sequence match often lead to new genes and are analyzed in a different manner (outside of BlastQuest). In addition, the database also stores information about how related gene segments are assembled into single consensus DNA sequences by PHRAP, which is external to BlastQuest and invoked before the DNA sequence results are submitted to BLAST. PHRAP outputs its results in an ACE file, which is mapped into the relation called Assemble. Querying the Assemble relation with a specific consensus sequence name, one can retrieve all segments that are clustered into the query consensus sequence.

Figure 2: Relational Schema of the BlastQuest database.

The database also maintains information about users and their corresponding gene sequencing projects, which are stored in the three remaining relations, User, Project, and UserProj. The relation UserProj represents the many-many relationship between scientists and the projects to which they belong. Since all sequence data is organized by project (using the PID foreign key in each of the relations Hit, NoHit, and Assemble), BlastQuest provides control over who has access to which data.

Tier 2 contains the multi-threaded BlastQuest application program, which is divided into four modules: The client interface module, which handles communication with the Web clients in tier 1, the two loader modules for extracting and loading data from the XML and ACE input files into the database, and the SQL constructor for assembling the queries and record sets to be sent to the database. The client interface module is implemented as a series of Java Server pages (JSPs) that execute inside a Tomcat server. The remaining three modules are implemented as Java classes.

The XML loader parses each BLAST result file into a Document Object Model (DOM) representation using the Xerces Java Parser 1.4.4. The XML loader then extracts the relevant data items needed to populate the Hit and NoHit tables. Specifically, the loader module contains two classes whose structures correspond to the Hit and NoHit tables in the database schema. When the loader collects data from an XML file, it populates the appropriate class objects with the extracted values. At the end, the objects are passed to the SQL manager, which creates the SQL commands to insert the values into the relational database. The ACE loader works in a similar fashion. However, since there was no standard ACE parser available, we created our own. Our event-based parser detects the presence of certain keywords in the ACE input file and extracts the information associated with that keyword. It is important to note that other, more efficient loading options are possible, for example by using the bulk loading utilities of the DBMS. However, by making our loader modules part of the Web-based middleware, users can load BLAST results into their BlastQuest accounts from anywhere on the Web as long as they have access to a Web browser.