Statistical modeling and machine learning for molecular biology
Alan Moses
Part II Clustering
Chapter 5 – distance-based clustering
The idea of clustering is to take some observations from a pool and find groups or “clusters” within these observations. A simple interpretation of this process is to imagine that there were really more than one kind of observations in the pool, but that we didn’t know that in advance. Although we sample randomly from the pool, we actually alternate between observations of different kinds.
Perhaps the most intuitive way to group the observations into the clusters is to compare each data point to each other, and put the data point that are closest together into the same cluster. For example, if the observations were 5.1, 5.3, 7.8, 7.9, 6.3, we probably want to put 5.1 and 4.9 being only 0.2 apart, in the same cluster, and 7.8 and 7.9 in the same cluster, as they are only 0.1 apart. We then have to decide what to do with 6.3: we could decide that it belongs in its own cluster, or put it in one of the other two clusters. If we decide to put it in one of the other clusters, we have to decide which one to put it in: it’s 1.0 and 1.2 away from the observations in the first cluster and 1.5 and 1.6 away from the observations in the second cluster. This simple example illustrates the major decisions we have to make in a distance-based clustering strategy. And, of course, we’re not actually going to do clustering by looking at the individual observations and thinking about it: we are going to decide a set of rules or “algorithm” and have a computer automatically perform the procedure.
- Multivariate distances for clustering
One important issue that is not obvious in this simple example is the choice of the distance used to compare the observations. The notion of the distance between data points is a key idea in this chapter, as all of the clustering methods we’ll discuss can be used with a variety of distances.In the example above, it makes sense to say that 5.1 and 4.9 are 0.2 apart. However, in a typical genomics problem, we are doing clustering to find groups in high-dimensional data: we need to know how far apart the observations are in the high-dimensional space.Figure 1a illustrates gene expression observations in two dimensions.
Figure 1 example of grouping genes based on expression in two cell types
a) illustrates expression data represented as points in two dimensions b) illustrates the differences between absolute distances, and the angle between observations. c) and d) show hierarchical clustering of gene expression data. i-v indicate the steps in the iterative process that builds the hierarchical tree. The first two step merge pairs of similar observations. The third step merges a cluster of two observations with a another observation. Based on the hierarchical tree, it looks like there are two clusters in the data, and we could choose a distance cutoff (dashed line) to define the two clusters.
In a typical gene-expression clustering problem, we might want to find clusters of genes that show similar expression patterns. In the case of the ImmGen data, there are thousands of genes, each of whose expression level has been measured in hundreds of cell types. To cluster these genes (or cell types) using a distance-based approach we first need to decide on the definition of “distance” we’ll use to compare the observations. Multivariate distances are most easily understood if the data are thought of as vectors in the high-dimensional space. For example, to cluster genesin the Immgen data, each cell type would represent a different dimension, and each gene’s expression levels would be a vector of length equal to the number of cell types. This is conveniently represented as a vector Xg = (Xg1, Xg2 … Xgn), whereXg is he expression vector for the gene g, for each of the n different cell types. We can then define a “distance” between two genes, g and h as some comparison of their two vectors, Xg and Xh. A typical distance that is used in machine learning applications is the “Euclidean distance” which is the sum of the squared distance between the components of two vectors.
The second last equation shows the definition using vector notation, and last equality uses special notation which indicates the length or “norm” of a vector. The distance is called Euclidean because in two or three dimensions, this is the natural definition of the distance between two points. Notice that in the Euclidean distance each of the n-dimensions (in this case cell types) is equally weighted. In general, one might want to consider weighting the dimensions because they are not always independent – for example, if two cell types have very similar expression patternsover all, we would not want to cluster genes based on similarity of expression patterns in those cell types. The correlation between dimensions becomes particularly problematic if we are trying to cluster observations where there are many correlated dimensions and only a small number of otherdimensions that have a lot of information. The large number of correlated dimensions will tend to dominate the Euclidean distance and we will not identify well-delineated clusters. This leads to the idea that distances should be weighted based on how correlated the dimensions are. An example of a popular distance that does this is the so-called Malhalanobis distance
Which weights the dimensions according to the inverse of the covariance matrix (S-1). In principle, this distance does exactly what we might want: it downweights the dimensions proportional to the inverse of their correlation.A major issue with using this distance for high-dimensional data is that you have to estimate the covariance matrix: if you have 200 cell types, the covariance is a (symmetric) 200 x 200 matrix (20,100 parameters), so you will need a lot of data!
In practice neither Euclidean distance or Malhalanobis distances work very well for clustering gene expression data because they consider the “absolute” distance between the vectors. Usually we are not interested in genes with similar expression levels, but rather genes with similar “patterns”. Therefore, we would like distances that don’t depend on the length of the vectors. An example of a widely-used distance for clustering gene expression is the so-called correlation distance which measures the similarity ofthe patterns.
Where r(Xg,Xh) is Pearson’s correlation between vectors Xg and Xh(which we will see again in Chapter 7) and I have used s(X) to indicate the standard deviation and m(X) to indicate the average (or mean) of the observation vectors. We put the “1 –“ in front because the correlation is 1 for very similar vectors and -1 for vectors with opposite patterns. So this distance is 0 if vectors have identical patterns and 2 for vectors that are opposite. Notice that the correlation is “normalized” by both the mean and the standard deviation, so it is insensitive to the location and the length of the vectors:it is related (in a complicated way) to the angle between two the vectors, rather than a geometric“distance” between them. However, perhaps a more geometrically intuitive distance is the so-called cosine-distance, which is related to the cosine of the angle between the vectors in the high-dimensional space.
Where I have used the ||x|| notation to indicate the norm of the observation vectors (as in the formula for Euclidean distance above. Once again we put in the “1 –“ because the cosine of 0 degrees (two vectors point at an identical angle) is 1 and 180 degrees (the two vectors point in opposite directions) is -1. So once again this distance goes between 0 and 2. Although this formula looks quite different to the one I gave for the correlation, it is actually very similar: it is a so-called “uncentered” correlation, where the mean (or expectation) has not been subtracted from the observations. So this distance does depend on the location of the vectors in the high-dimensional space, but is still independent of their lengths. Figure 1b illustrates an example where the correlation or cosine distances would be very small,but the Euclidean or Malhalanobis distances would be large. It’s important to note that although the correlation and cosine distance only considers the patterns and not the magnitude of the observations, they still treat each dimension equally and will not usually produce good clusters in practice if the data contains many highly correlated dimensions. In order to account for the possibility of correlated dimensions, we define a weighted cosine distance as follows:
Where w = (w1, w2 .. wn) specifies a weight for each dimension, and I have used the identity matrix I, and some awkward notation, tr(Iw) to show that the distance must be normalized by the sum of the weights. In gene cluster 3.0 these weights are defined heuristically, based on the intuition that genes that are similar to many other genes should be downweighted relative to genes that have few neighbours in the high dimensional space. Usually, these weights are important to getting good results on large datasets.
- Agglomerative clustering
Once the distance metric is chosen, there are still many different possible clustering strategies that can be used. We’ll first describe the one most commonly used in bioinformatics and genomics applications, agglomerative hierarchical clustering. The clustering procedure starts by assigning each datapointto its own cluster. Then, the clusters are searched for the pair with the smallest distance. These are then removed from the list of clusters, and merged into a new cluster containing two datapoints, and the total number of clusters decreases by 1. The process is repeated until the entire dataset has been merged into one giant cluster. Figure 1c illustrates the first three steps of hierarchical clustering in the simple example. Figure 1d illustrates a completed hierarchical tree from this data.
There is one technical subtlety here: how to define the distances between clusters with more than one datapoint inside them?When each datapoint is assigned to its own cluster, it’s clear that the distance between two clusters can be simply the distance between the two datapoints (using one of the distances defined above). However, once we have begun merging clusters, in order to decide what is the closest pair in the next round, we have to be able to calculate the distance between clusterswith arbitrary, different numbers of observations inside them. In practice there are a few different ways to do this, and they differ in the interpretation of the clusters and the speed it takes to compute them. Perhaps simplest is so-called “single linkage” where the distance between two clusters is defined as the *smallest* distance between any pair of datapoints where one is taken from each cluster. However, you might argue that the distance between clusters shouldn’t just reflect the two closest observations – those might not reflect the overall pattern in the cluster very well. A more popular (but a little bit harder to calculate) alternative is the so-called “average linkage” where the distance between two clusters is defined as the average distance between all pairs of datapoints where one is taken from each cluster. These distances are illustrated in figure 2. When average linkage is used to do agglomerative hierarchical clustering, the procedure is referred to as UPGMA, and this is bar far the most common way clustering is done. As long as the clustering is done by joining individual observations into groups and then merging those groups, the process is referred to as “agglomerative”. If, on the other hand, the clustering approach starts with the entire pool, and then tries to cut the dataset into successively smaller and smaller groups, it is known as divisive.
Figure 2 – distances between a cluster and a datapoint. Three observations (X2, X3, X6) have been merged into a cluster (iii) and we now calculate the distance of a datapoint (X5) from the cluster to decide whether it should be added or not. In single linkage (a) we use the distance to the closest point, in complete linkage (c) we use the distance to the furthest point, while for average linkage (b) we calculate the average of the points in the cluster m(x) and calculate the distance to that. Grey arrows indicate the vectors between which we are calculating distances. The Euclidean distance is indicated as a black line, while the angle between vectors is indicated by φ.
At first the whole agglomerative clustering procedure might seem a bit strange: the goal was to group the data into clusters, but we ended merging all the data into one giant cluster. However, as we do the merging, we keep track of the order that each point gets merged and the distance separating the two clusters or datapoints that were merged. This ordering defines a “hierarchy” that relates every observation in the sample to every other. We can then define “groups” by choosing a distance cutoff on how large a distance points within the same cluster can have. In complex datasets, we often don’t know how many clusters we’re looking for and we don’t know what this distance cutoff should be (or even if there will be one distance cutoff for the whole hierarchy). But the hierarchical clustering can still be used to define clusters in the data even in an ad hoc way, by searching the hierarchy for groups that “look good” according to some user-defined criteria. Hierarchies are very natural structures by which to group biological data because they capture the tree-like relationships that are very common. For example, the ribosome is a cluster of genes, but this cluster has within it a cluster that represents 26S particle and another for the 14S particle. T-cells represent a cluster of cells, but this cluster is made up of several sub-types of T-cells.
In the example below, I have hierarchically clustered both the genes and cell types: for the genes, this means finding the groups in a 214 dimensional space (each gene has a measurement in each cell type), while for the cell types this means finding the groups in an 8697 dimensional space (each cell type has a measurement for each gene). The tree (or dendogram) created for the genes and cell types is represented beside the data. In general, if possible it’s always preferential to represent a clustering result where the primary data are still visible. Often, visualizing the primary data makes immediately apparent that clusters (or other structure in the data) are driven by technical problems: missing data, outlierts, etc.
Figure 3 - hierarchical clustering of the ImmGen data using GeneCluster 3.0
Hierarchical clustering with average linkage for 8697 genes (that had the biggest relative expression differences) in 214 cell types. Both genes and cell types were weighted using default parameters. The data matrix is displayed using a heatmap where white corresponds to the highest relative expression level, grey corresponds to average expression and black corresponds to the lowest relative expression (gene expression was log-transformed and mean log-expression level was subtracted from each gene). Left shows a representation of the entire dataset. Right top panel shows a clear “cluster” where the distances (indicated by the depth and height of the dendogram) are small between genes and cells. Right bottom panel shows that this corresponds to a group of immunoglobulin genes that are highly expressed in B cells. The clustering results are visualized using Java Treeview.
- Clustering DNA sequences
To illustrate the power of thinking about molecular biology observations in high-dimensional spaces, I want to now show that a nearly identical clustering procedure is used in an even more “classical” problem in evolutionary genetics and molecular evolution. Here, the goal is to hierarchically organize sequences to form a tree. In this case, the hierarchical structure reflects the shared ancestry between the sequences. Although the task of organizing sequences hierarchically is widely referred to as ‘phylogenetic reconstruction’, in a sense, reconstructing a phylogenetic tree is a way of grouping similar sequences together – the more similar the sequences in the group, the more recently shared a common ancestor. To see the mathematical relationship between this problem and clustering gene expression data, we first need to think about the sequences as multivariate observations that have been drawn from a pool. In a sequence of length, L, each position, j, can be thought of as a vector Xj, where the b-th component of Xj is 1 if the sequence has the DNA base b at that position. For example, the DNA sequence CACGTG would be
Where the bases have been ordered arbitrarily from top to bottom as, A, C, G, T. A protein sequence might be represented as a 20 x L matrix. To compute the distance between two (aligned) gene sequences, Xg and Xh, we could use one of the distances defined above. For example, we could sum up the cosine distance at each position
Where to derive the simple formula on the right, I have used the fact that the norm of the vector at each position in each gene sequence is 1 (each position can only have one 1, all the others must be zeros)and the inner product between two matrices is the sum of the dot products over each of the component vectors. In sequence space, this distance metric is nothing but the number of different nucleotide (or protein) residues between the two vectors, and is a perfectly sensible way to define distances between closely related sequences. Noticethat just like with gene expression distances, this simple metric treats all of the dimensions (types of DNA or protein residues) identically. In practice, certain types of DNA (or protein) changes might be more likely than others, and ideally these should be downweighted when computing a distance.