BIO-116
Detecting Outliers over Gene Expression Data using a Vertical Data Representation
ABSTRACT
Staggering volume of data is collected in biological databases, which requires innovative computational approaches to extract knowledge from those sources of data. Outlier detection is attracting more and more attention in data mining and knowledge discovery. It is not only used to clean up datasets, but also to discover useful anomalies/exceptions. In this paper, we apply our vertical outlier detection algorithm in a biological context to detect interesting anomalies from gene data. Unlike current outlier detection approaches in biology which remove noise as a first step in other processes like clustering or classification, our vertical outlier detection method is used to directly detect interesting anomalies from gene data. Our method utilizes a pruning technology, “by-neighbor” labeling technology and adopts a vertical data representation model, P-trees[1], making the method more efficient over large datasets in terms of speed and scalable to data size. We explore our method over a variety of biological datasets such as brain data, breast cancer data, Diffuse Large B-cell Lymphoma Gene Expression Data and Diffuse Large B-cell Lymphoma Survival Data. It is shown experimentally that our method can efficiently detect anomalies which could be of interests to biologist. Furthermore, our vertical outlier detection method performs better than current approaches over the various biology datasets in term of speed and scalability.
Keywords
Outlier Detection, Gene Expression, P-Tree, Pruning, by-Neighbors Labeling
1. INTRODUCTION
The problem of mining rare events, deviant objects, and exceptions is critically important in many domains, such as electronic commerce, network, surveillance, and health monitoring. The current outlier mining approaches can be classified into five categories: statistic-based [4], distance-based [5][6] [7], density-based [8][9],clustering-based [10], and deviation-based [12][13]. Distance-based outlier detection approaches are attracting the most attention for knowledge discovery in large database.
Knorr and Ng proposed a unified outlier definition based on the distance notion. They show that for many discordance tests in statistics, if an object O is an outlier according to a specific discordance test, then O is also a distance-based outlier given an outlier percentage and a distance threshold. Knorr and Ng proposed a nested loop cell-based algorithm and Ramaswamy and Rastogi introduced a partition-based algorithm for this approach. The methods works well with uniform distributed datasets without the prior knowledge of the data distribution. However, the methods are not efficient and do not scale well over large datasets. In our previous work, we proposed a vertical outlier detection method with local pruning. Our method is based on a pruning technique, the “by-neighbor” labeling technique, and adopts a vertical data representation model, P-trees, which make the method work efficiently over large datasets in terms of speed and scalability to data size.
In biology and bioinformatics, analysis of data from large scale microarray experiments is hard because of its size and high dimensionality. The outlier detection in gene expression data has its importance but together with the difficulty of high dimensionality and the sparsity of data in high-dimensional space, the detection is a major challenge in the analysis and affects the quality of the interpretation of results which sometimes may even lead to false biological conclusions. Statistical models provide a conceptual framework for the analysis of such data. Multivariate statistical process control (MVSPC) coupled with robust principle component analysis is one method that is used to detect and remove the outliers and also reduce the high dimensionality of the microarray data [1]. Several statistical models have also been proposed to solve the problem of detecting outliers. Statistical model for oligonucleotide expression array data that addresses several important analysis issues like the automatic detection and handling of outliers was implemented [2]. The model estimates the level of the transcript of any given gene provided a given number of samples are used in an experiment. In this model, array outliers, probe outliers and single outliers were automatically identified through an iterative process based on arrays with large standard error and discarding the outlier arrays, probes and single data points during each iteration thereby reducing the cardinality and dimensionality of the data while fitting the model. Another approach employed to detect outliers is a three sample gene expression profiling measuring system with statistical framework and determine the precise gene expression measurement and the source of errors [3]. In summary, almost all of current work of outlier detections in biology focuses on removing noise from datasets to improve other processes like clustering, classification, or association rule mining. In this paper, we work on finding anomaly interests directly from data and improve the performance of the outlier detection process over very large biology datasets.
In this paper, we introduce our vertical outlier detection algorithm into the biology community to detect anomaly interests in gene expression data and other biology datasets. We make a comprehensive experimental design to explore our algorithms over various kinds of biology datasets. Our algorithm produces anomaly interests which are interesting to both biologist and data mining people. In view of performance, our outlier detection method beat current approaches in terms of speed and scalability to data size.
The paper is organized as follows. Section 2 reviews our vertical outlier detection method; we propose a comprehensive experimental design scheme, which explore the outlier detections over various biology datasets, using formal design methodology in section 3.1; in section 3.2, we show the experimental results and explain the results both in practice and in theory. The paper is concluded in section 4.
2. REVIEW OF OUR VERTICAL OUTLIER DETECTON METHOD
In our previous work, we proposed a vertical outlier detection algorithm with local pruning [14]. In the approach, we introduce a neighborhood based outlier definition, and two properties which can work as a pruning rule and a “by-neighbor” label rule respectively. Also, we adopt a vertical data representation, P-Tree, as our data structure. Based on the pruning, “by-neighbor” label and P-Trees technology, we proposed a novel our vertical distance-based outlier detection method with local pruning. The pruning technology and the “by-neighbor” labeling technology facilitate the outlier detection procedure significantly. P-Trees speed up the procedure further by its optimal logical operations AND and OR. Our method has an order of magnitude speed improvement over the current approaches and beats other approaches in terms of scalability to data size also
2.1. Definitions and Properties
Definition 1 (Neighborhood)
The Neighborhood of a data point O with the radius r is defined as a set Nbr (O, r) = {xÎ X | |O-x|£ r}, where |O-x| is the distance between O and x. It is also called r-neighborhood. The points in this neighborhood are called neighbors of O. The number of neighbors of O is defined as N (Nbr (O, r)).
Definition 2 (Outliers)
Based on neighborhood, a point O is considered as a DB (p, D) outlier if N (Nbr(O,D)) ≤ (1-p)*|T|, where |T| is the total size of the dataset T. We define outliers as a subset of the dataset T with N (Nbr(x,D)) ≤ (1-p)*|T|, where x denotes a point in the outlier set. The outlier set is denoted as Ols (T, p, D) = {xÎT | N (Nbr(x,D)) ≤ (1-p)*|T|}.
Property 1: If N ( Nbr (O, D/2)) ≥ (1-p)*|T|, then all the points in Nbr ((O, D/2)) are not outliers.
The proof of the property is omitted and the detail can be seen in [14]. Property 1 can be used as a pruning rule. Based on this property, all points in the Nbr(O, D/2) can be pruned off. Figure 1 show the pruning, where the Nbr (O, D/2) is pruned.
Figure 1 Pruning Nbr (O, D/2)
Property 2: If N ( Nbr (O, 2D)) < |T| * (1-p), then all the points in the Nbr (O, D) are outliers.
The proof of the property is omitted and the detail can be seen in [14]. According to property 2, all points in Nbr (O, D) can be labeled as outliers. Therefore, property 2 provides a powerful means to label outliers by neighborhood, instead of point by point, which will speed up the outlier detection process significantly. Figure 2 shows the “by-neighbor” process. All the points in the Nbr (O, D) can be considered as outliers.
Figure 2. Label Outliers by-neighbors
2.2. P-Trees Technology
Unlike traditional horizontal data representation approaches, in the P-Trees approach, we decompose attributes of relational tables into separate files by bit position and compress the vertical bit files using a data-mining-ready structure called the P-trees. Instead of processing horizontal data vertically, we process these vertical P-trees horizontally through fast logical operations. Since P-trees markedly compress the data and the P-trees logical operations scale extremely well, this vertical data structure can be of potential to address the non-scalability with respect to size. In this paper, P-Trees speed up the outlier detection process significantly by searching neighbors using inequality P-Trees and pruning using the P-Tree AND operation[18][19][20]. The vertical approach improves outlier detection dramatically in terms of speed.
2.3. Our Vertical Distance-based Outlier Detection Method with Local Pruning
The algorithm is described herein in short and the detail can be seen in [14]. Our algorithm is based on our neighborhood-based outlier definition, property 1 and property 2. The algorithm goes as follow.
Step 1: One point P in the dataset is selected arbitrarily;
Step 2: Search for its D-neighborhood and calculate the number of neighbors, N (Nbr(P,D);
In case N (Nbr(P,D) ≥ (1-p)*|T|, where |T| is the total size of the dataset, reduce the radius from D to D/2; then, calculate N(Nbr(P,D/2). If N(Nbr(P,D/2) > (1-p) * |T|, all the D/2-neighbors are not outliers, Therefore, those points need not undergo further testing, which is one of the points where the efficiency of our algorithm comes.
In case N (Nbr(P,D) (1-p)*|T|, the algorithm expands the neighborhood radius from D to 2D and compute N (Nbr(P,2*D). If N (Nbr(P,2*D) < (1-p)*|T|, all the D-neighbors can be claimed as outliers. Those points are inserted into the outlier set and need not undergo further detection either, which contribute to the speed improvement as well; otherwise, only point P is an outlier.
Step 3: The process is conducted iteratively until all points in the dataset T are examined.
The algorithm is implemented using the vertical data model, P-Trees. The neighborhood search is fast conducted by inequality P-Trees ANDing, computation of number of neighbors is just extracting the value of the root node of the neighborhood P-Trees[14], and algorithm prunes points by neighborhood using the optimal P-Trees operation, AND. The P-Tree based vertical outlier detection method is shown in figure 4.
Figure 3 the Vertical Outlier Detection Method with Local Pruning
3. DETECTING OUTLIERS FROM VARIOUS BIOLOGICAL DATA
In this section, we will explore current distance-based outlier detection methods and our vertical outlier detection approaches over various biology datasets. In section 3.1, we propose our experimental design scheme using formal design principles; in section 3.2, we show the experiment results and performance analysis is conducted with regard to computational complexity.
3.1. Experimental Design
We design our experiments using general factorial design approach [14][16]. In the input space, we consider three general factors, which are dataset type, data size and algorithm type and the other two factors, outlier percentage p and distance threshold D, which are critical for the performance of the distance-based outlier detection process in particular. We explore our vertical algorithm and the current state-of-the-art distance-based algorithms over different type of datasets from biology area. The percentage p and distance threshold D is varied to exam the performers further. In the output space, outlier sets together with biological description are given; also, we consider speed and scalability to data size as the other three output factors. In our experimental design, we assume that there is no interaction exiting in a two-factor model.
Our purpose lies in: first, we show that, unlike the current status of the outlier algorithms in biology area, where outlier detection is used to remove noise [1][2][3] , our method can find anomaly interests which are important to biologist. Second, we will show that our vertical outlier detection method performs better than the current algorithms in distance-based outlier detection areas over various biology datasets in terms of speed and scalability to data size. The accuracy of our algorithm is comparable to that of the best of the current approaches.
Algorithms: We will compare the performance of our vertical algorithms with that of the current approaches in this area. We choose Knorr and Ng’s Nest Loop Approach, Cell-based Approach and Sridhar Ramaswamy’s partition-based approach to compare with. Those three algorithms are the state-of-the-art in the distance-based outlier mining. The four algorithms are listed in table 1.
Table 1 Algorithms
Algorithm1 / Nested Loop AlgorithmAlgorithm 2 / Cell based Algorithm
Algorithm 3 / Partition-based Method
Algorithm 4 / Our Vertical Algorithm
Datasets: We will explore the above algorithms over a variety of biology datasets. Dataset 1 is brain data with 2 attribute and 28 instances. Dataset 2 is cancer data. It has 699 instances, 10 attributes. Each instance has one of 2 possible classes: benign or malignant. Among 699 instances, 458 (65.5%) are benign and 241(34.5%) are malignant. Dataset 3 is abalone dataset with 9 attribute and 4177 cases. 19 cases among 4177 are outliers. Dataset 4 and dataset 5 are Diffuse Large B-cell Lymphoma Gene Expression and Survival Data respectively. The data is from Eisen Lab. The gene expression data has 21696 rows and 21 columns, while the survival data has 6 columns and 41 rows. The datasets are listed in table 2.
Table 2 Biology Datasets
Dataset 1 / Brain Datahttp://devlab.cs.dartmouth.edu/IDM-REPORT/http://www.act.cmis.csiro.au/rohanb/outliers/brain/brain.dat
Dataset 2 / Breast Cancer Data http://www.act.cmis.csiro.au/rohanb/outliers/breast-cancer/
Dataset 3 / Abalone Data http://www.liafa.jussieu.fr/~amicheli/DataMining/abalone.html
Dataset 4 / Diffuse Large B-cell Lymphoma Gene Expression: http://rana.lbl.gov/EisenData.htm[17]
Dataset 5 / Diffuse Large B-cell Lymphoma Survival Data: http://rana.lbl.gov/EisenData.htm [17]
Data Size: To study how the four algorithms scale with dataset size, we varied the number of points from 4096 to 1048576. For the small datasets, we increase the data size by mirroring the distribution of the original datasets, thus large datasets still keeping the statistic feature of the original datasets.
