Data Mining of Microarray Data Using Multidimensional Analytic and Visualization Techniques
Patrick Hoffman, Dave Pinkney, Jennifer Wu
AnVil Informatics
4th Floor, 600 Suffolk Street
Lowell, MA 01854
http://www.anvilinformatics.com
The insights that result from analyses of large microarray datasets represent an important new focus in the drug discovery process. In this poster we demonstrate the application of two machine learning techniques, supervised and unsupervised learning, to microarray data. Additionally we present new techniques that facilitate clustering comparisons using visual and analytical approaches.
The microarray data sets we used are publicly available and result from various yeast gene experiments. Our purpose was to demonstrate the value of applying high dimensional analytic and visual data mining techniques to discover trends and patterns in the data.
In our analyses, we compare many classification and clustering techniques on both yeast diauxic shift data and yeast cell cycle data. Application of novel visualization techniques (Parallel Coordinates, Circle Segments, Radviz, etc.) to both datasets helps us gain insights into the gene expression data.
Supervised Clustering of
Microarray Data
Microarray experiments typically lead to the analysis of thousands of gene expression profiles. Genes of similar function often have similar expression profiles. This attribute can be exploited by creating classifiers that are trained on the expression profiles of genes with known function, and applied to unknown genes in order to classify them based on expression profile.
We performed several experiments that built classifiers from 35 genes with 5 distinct expression profiles. The information came from publicly available yeast gene expression data that was generated from microarray experiments. Some of the results are shown in this poster.
Most classifiers, such as a Decision Tree, Neural Network and Naive Bayes, can classify the 35 “training” genes perfectly. Once trained, these tools can be used to automatically classify the 6000 remaining unclassified genes based on the characteristics of their expression profiles.
A Kohonen self-organizing map can be used to cluster the 35 “training” genes, and based on this clustering can be applied to the classification of the unknown genes. The following four pictures show the expression profiles of the “training” genes, the Kohonen map built from the genes, and the expression profiles of the unknown genes after being classified with the Kohonen Map.
A Parallel Coordinates visualization displaying gene expression levels for 35 genes with distinct expression profiles. The genes were classified based on their expression profile, which is shown plotted over the 7 measured time intervals.
A Kohonen self-organizing map clusters the 35 genes with known class by computing a new pair of axes and locating the genes according to its idea of similarity. The Kohonen map can then be used as a classifier if the operator designates which clusters correspond to which gene function.
After classification based on the genes of known expression profile, the Kohonen self-organizing map shows the distribution of over 6000 microarray records (genes).
A Parallel Coordinates visualization shows the expression profiles of the 6000 genes after classification by the Kohonen self organizing map.
Unsupervised Clustering of Microarray Data
There are many clustering techniques that can be applied to microarray data, such as Hierarchical, K-Means and Self Organizing Maps. We applied several clustering techniques to publicly available microarray yeast gene expression data. The expression levels were measured over two cell cycles and 800 genes were identified algorithmically as being cell cycle regulated. These genes were classified into 5 groups based on the cell cycle phase of their expression. We analyzed and visualized the expression levels of the 800 genes using several unsupervised clustering techniques; a few excerpts of these analyses are shown.
In the following pictures we show several traditional and novel techniques for visualizing data once it has been clustered or classified, and then present the results of two unsupervised clustering techniques.
If the data has already been clustered, graphs such as this average expression profile plot can be used to present summary information about the characteristics of each cluster. Here we see the average expression profile, with standard deviation bars added, plotted for the 5 Peak clusters in the cell cycle data. The clusters clearly demonstrate the cyclic nature of the data set.
A novel extension to the average expression profile plot is this Histogram Matrix visualization. The 5 Peak phase clusters are displayed as a sequence of sixteen histograms for each cell cycle. Rather then providing standard deviation bars, this visualization presents all of the distribution information using a histogram at each time point.
Another powerful way of examining classified or clustered data is with an interactive Parallel Coordinates visualization. This parallel coordinates visualization is being used to examine all of the gene expression values corresponding to two of the cell cycle clusters. The phase difference between the expression times of the two clusters can be clearly seen.
If the data has not been clustered, a common approach is to apply a Hierarchical Agglomerative Clustering method, and to visualize the results the familiar Dendrogram visualization. A colored patch grid corresponding to positive (green) and negative (red) data values enhances the visual analysis and comprehension of the clustering.
Another way to cluster data is to use Polyviz, a proprietary high-dimensional clustering technique based on a spring force paradigm. This Polyviz visualization clustered the microarray data using the expression values and is colored by the Peak phase classification.
The Kohonen self-organizing map is another powerful clustering technique that can be applied to unclustered data. This clustering of the microarray data shows the relationship between gene expression levels (based on cluster location) and the Peak classification column (used to color the points).
Cluster Comparison Techniques
Scientists have many clustering techniques at their disposal, each with its own set of advantages and disadvantages. How can scientists determine which clustering technique is best for their data? How can the results of two different clustering algorithms be meaningfully compared? The answers to these questions are ongoing research issues, but we present here three visual and one analytic approach towards answering these questions.
This custom visualization allows one to visually compare the results of a K-means clustering technique that generates 30 clusters (on the left) with a technique that produced 5 clusters (on the right). Poly-lines are used to identify an individual record’s location within each of the results. The visual comparison allows one to gain a meaningful understanding of how the cluster results differ.
This visualization of two clustering techniques uses a jittered scatter plot to enable the comparison of the clustering results. Five clusters from one technique (along the Y-axis) are compared with 12 clusters from another technique (along the X-axis). If the X-axis clusters were a pure subset of the Y-axis clusters then there would only be one clump per vertical line. In this case only the 12th cluster on the X-axis is pure while the 1st is nearly so.
The Color Correlated Column visualization is another custom visualization for comparing clustering results. This visualization allows one to simultaneously compare the results of over 20 different clusterings of the data. The records are sorted vertically by the Peak class, which is represented by the colored bar on the right. The predicted class is represented with a grayscale. If the change in grayscale value corresponds to the change in color, then there is a strong correlation between the true and predicted class.
Comparing Clustering Techniques
Clustering
/ Data / Number of / %correct / %correct / %correct / %correct / %correctTechnique
/ Clusters / method -1 / method -2 / method -3 / method -4 / maximum1 / Kohonen 3 / Norm / 30 / 72.6 / 69.1 / 65.7 / 67.8 / 72.6
2 / Kohonen 1 / Norm / 30 / 72.3 / 69.5 / 65.2 / 67.7 / 72.3
3 / Kohonen 2 / Norm / 30 / 71.8 / 66.4 / 62.3 / 65.2 / 71.8
4 / C K-means 1 / Norm / 30 / 71.1 / 66.4 / 59.7 / 65.1 / 71.1
5 / SOM 4 / Original / 25 / 70.1 / 61.9 / 59.9 / 63.2 / 70.1
6 / SOM 12 / Original / 27 / 69.3 / 64.0 / 60.1 / 63.0 / 69.3
7 / Kohonen 2 / Original / 19 / 68.5 / 64.3 / 58.6 / 62.7 / 68.5
8 / C K-means 1 / Original / 30 / 67.2 / 63.6 / 55.0 / 61.9 / 67.2
9 / Kohonen 1 / Original / 19 / 67.1 / 59.8 / 53.6 / 58.8 / 67.1
10 / Kohonen 3 / Original / 18 / 66.8 / 65.5 / 56.4 / 63.9 / 66.8
11 / C K-means 2 / Norm / 5 / 66.8 / 61.1 / 56.4 / 58.6 / 66.8
12 / SOM 7 / Norm / 12 / 62.5 / 57.8 / 49.6 / 52.8 / 62.5
13 / M K-means 1 / Original / 5 / 59.7 / 51.8 / 48.4 / 54.7 / 59.7
14 / Dendrogram 2 / Original / 6 / 58.8 / 54.5 / 46.8 / 47.5 / 58.8
15 / K-means 2 / Original / 5 / 55.8 / 50.0 / 47.8 / 54.5 / 55.8
16 / SOM 7 / Original / 5 / 54.8 / 51.8 / 42.8 / 55.1 / 55.1
17 / Dendrogram 1 / Original / 6 / 45.6 / 43.1 / 32.7 / 33.4 / 45.6
18 / SOM 12 / Norm / 30 / 44.2 / 38.5 / 31.0 / 36.0 / 44.2
19 / M K-means 2 / Original / 30 / 43.7 / 36.6 / 29.3 / 35.9 / 43.7
20 / M K-means 3 / Original / 17 / 39.5 / 30.8 / 23.5 / 30.2 / 39.5
21 / random / Original / 6 / 37.5 / 16.3 / 20.0 / 22.9 / 37.5
The results of several clustering techniques are analytically compared with the Peak class in this example. For a given technique, each generated cluster was considered to be a subset of one of the true classes. The class chosen for each cluster was based on the majority of “truth” classes for the genes in that cluster. After each cluster was categorized, the resulting accuracies were calculated. The total percent correct and the average accuracy for each class was calculated and is presented in the method columns.
This Radviz visualization presents Mechanism of Action data from the NCI chemical structure database, clustered by the fingerprint of the chemicals.
This Polyviz visualization presents Mechanism of Action data from the NCI chemical structures database, clustered by the fingerprint of the chemicals.