Presented at Interface ’98, and published in Computing Science and Statistics, 30, 1998, 257-263.

Efficient hierarchical cluster analysis

for data mining and knowledge discovery

David Wishart

Department of Management

University of St. Andrews

St. Katharine’s West, The Scores, St. Andrews

KY16 9AL Scotland

Email: Web:

Abstract

The paper compares hierarchical cluster analysis with decision trees for data mining and knowledge discovery applications. It is argued that “top-down” binary decision trees can force orthogonal partitions on to data whose shape due to correlated variables might indicate that a non-orthogonal partition is more appropriate, whereas “bottom-up” hierarchical cluster analysis is better at recovering the true shape. A fast algorithm is described for Ward’s method, capable of constructing clustering trees for thousands of observations and therefore suitable for KDD applications. A hybrid clustering method is proposed which combines the best features of Ward’s method and single linkage (nearest neighbor) to resolve the shape of clusters having non-zero covariance. The use of an agglomerative tree for identification is discussed, and the methods are illustrated by reference to the H-R diagram of visual stars. Finally, implementation for Windows is described.

1. Introduction

The data mining and knowledge discovery (KDD) field is characterized by the use of decision trees which derive binary keys that can be used to organize or segment large databases. A wide range of methods have been proposed, and are comprehensively reviewed elsewhere in this volume.

These algorithms generally proceed by selecting a sample or training set from which a decision tree is constructed. The resulting tree may not be optimum, because the initial split of the training set on a single predictor variable may not be the most efficient split for the whole database. Various techniques such as boosting or bagging, coupled with multiple sampling, tree validation and pruning are used to test different training sets and hence to find a consensus decision tree which best reflects the variability in the data.

By contrast, hierarchical agglomerative clustering methods are seldom used for segmentation in KDD applications because most algorithms for these methods are computationally intensive. However, we have devised an iterative least squares algorithm which is comparatively very efficient and can be used to construct agglomerative trees for many thousands of cases (Section 5).

We also explore the scope for combining least squares and nearest neighbor clustering to recover the shape of natural clusters and hence attempt an answer to the question “how many clusters are present?”. While an agglomerative tree may not be as easy to interpret as a binary decision tree, we discuss briefly a tree traversal method for identification.

2. The H-R Diagram

The binary decision tree approach makes the inherent assumption that classes can be efficiently defined in terms of orthogonal partitions by splitting, at each stage, on a single variable. This may have the descriptive advantage of providing a simple key to selecting the data in terms of predictor variables, for example when designing data retrieval tools to search a database. However, binary decision trees substantially limit the feasible range of segment definitions considered to orthogonal partitions of the scatter, and may therefore be seductively naive in a multivariate context.

To illustrate this point, we consider the “H-R diagram” shown in Fig. 1. This was first obtained by the astronomer H. N. Russell who plotted the temperature and luminosity of visual stars in 1914 (for details, see Struve and Zebergs, 1962). Most people interpreting this diagram would conclude that the points form two natural swarms or clusters; and indeed, early 20th century astronomers called them giants and dwarfs, respectively.

Whereas the giant star sequence is reasonably spherical and well-defined in terms of low temperature and luminosity (the axes are inverted), this is not true of the dwarf star sequence. To classify the dwarf stars it is necessary to transform the data to the ratio luminosity ÷ temperature, which is constant.

The decision tree approach fails to find the dwarf star sequence in its entirity by splitting on a single variable. There is no single split on either luminosity (line A) or temperature (line B) which separates giants and dwarfs. The best split that can be achieved by a decision tree is on both luminosity (line A) and temperature (line B) to form the four quarters, as summarized in Fig. 3.

Figure 1: The H-R diagram for visual stars, showing the two recognised clusters of giants and dwarfs. Lines A and B indicate possible decision tree splits on luminosity or temperature, while line C indicates a partition which minimizes the Euclidean sum of squares for two clusters.

It can therefore be seen from this simple two-dimensional example that the decision tree approach forces partitions which are orthogonal to the co-ordinate axes, regardless of any shape in the data. The results are heavily influenced by any pre-processing or transformation of the variables.

However, the decision tree succeeds in resolving a fairly reasonable split of the H-R diagram at two levels, where the dwarf star sequence is partitioned into roughly two halves. Whether this is a satisfactory outcome would depend upon the objective of the analysis. If the objective is to partition the data efficiently into tight subsets, the decision tree is effective. It has the presentational advantage of providing a descriptive key with which to subdivide the data.

For a further discussion of the difficulty of recovering the structure within a dataset, see Rocke(1998).

3. Hierarchical Agglomerative Clustering versus Decision Trees

Hierarchical agglomerative clustering methods appear not to have been seriously used for segmentation in KDD applications, because of their computational complexity. However, there are a number of reasons why these methods should not be rejected.

The agglomerative or “bottom up” hierarchical approach has the advantage of finding any number of compact, spherically-shaped clusters at a certain level which cover the sample density, as illustrated in Fig. 2. This shows how the H-R diagram comprising several thousand sample points can be mapped by a tesselation of about 80 small round clusters.

Figure 2: The H-R diagram of Fig. 1 has been simplified to a tesselation of 80 spherically-shaped clusters generated by Ward’s method, which map the densest parts of the scatter.

We contend that the “top down” decision tree approach has inherently greater risk of mis-classification by inefficiently splitting on a single variable than the “bottom up” approach. Each classification generated in a decision tree is univariate by definition, and this limits the range of possible segments available for consideration. By comparison, the agglomerative approach is multivariate and exploratory, and allows for more feasible segments to be investigated in terms of the actual distribution of the scatter.

It may aid interpretation to have a simple binary decision key for classifying data, expressed in terms of whether predictor variables are above or below certain threshold values. But data do not naturally conform to orthogonal partitions when the variables are correlated, as often happens in a multivariate context. When many variables are presented for analysis, there are more likely to be inter-correlations present and, therefore, any natural clusters are likely to have shapes which are not orthogonal to the co-ordinate axes. In this respect, decision trees may, in many instances, produce naive partitions as illustrated for the H-R diagram by the threshold values A and B in Fig. 1 and the binary tree of Fig. 3.

4. Ward’s Method

Users of cluster analysis will be familiar with the advantages and disadvantages of different hierarchical agglomerative methods. Ward’s method performs in a manner similar to the decision tree approach, namely it combines the two clusters at each stage which minimize the squared error function, or Euclidean sum of squares, E. The objective function E is defined as follows:

E = kikj (xij - kj)2

where for a given classification observation i belongs to cluster k, xij is the value of variable j for observation i, and kj is the mean of variable j in cluster k. The summations are, firstly, the squared error of observation i in cluster k for all the variables; secondly, for all observations i within cluster k; and finally, to total the error over all clusters k.

Standard Ward’s method clustering proceeds as follows:

a)First compute a matrix of squared Euclidean distances for all pairs of observations p and q:

dpq2 = j (xpj - xqj)2

b)Combine the two cases p and q whose union results in minimum E. This will initially be the two cases for which dpq2 is minimum.

c)Transform the distances dpk2 to Epk, the squared error for the union of cluster pq with all other cases/clusters k.

d)Repeat step c), at each subsequent step combining the two cases or clusters whose union results in minimum E.

e)Finish when all the cases are grouped into 1 cluster.

Figure 3: Example of a binary decision tree for the H-R diagram which partitions the scatter into 4 segments. Note that the axes in Fig. 1 are inverted.

The matrix of squared distances {dpq2} is symmetric, and therefore requires storage of O(½n2). It will be evident that there are normally n-1 steps c) in the algorithm; and since each step involves searching the distance matrix for a new minimum, the computing time is O(n3). The algorithm is therefore computationally intensive for large n, and standard statistical packages which offer this method cannot perform efficiently for more than about 1500 observations.

5. Fast Cluster Algorithm

However, we have implemented a fast algorithm for Ward’s method by a series of iterations at each of which many unions are completed simultaneously (Murtagh, 1985; Wishart, 1985). Furthermore, the algorithm does not require the storage of a squared Euclidean distance matrix, but operates solely on the observational data. The storage constraint is therefore O(nv), where v is the number of variables; and the computing time is O(n2).

For example, when clustering 2,000 observations on 4 variables the algorithm required 32 iterations, compared to 1999 fusion steps by the standard algorithm. The tree for Ward’s method on this dataset was obtained in about 5 seconds using a P133 laptop at Interface ‘98.

It is possible to apply our algorithm to many thousands of cases because the main operational constraint is the requirement to store the observational data in memory. For example, a tree of 30,000 cases has been produced in a data mining application for a bank (Goulet and Wishart, 1996).

Further reductions in computing time are achieved using binary arithmetic instead of floating-point arithmetic. While this may result in approximate values for E due to truncation on binary division, this is of limited importance in the context of finding an efficient partition of the scatter into an efficient tesselation of tight spherical clusters. What is important is that we can now achieve very efficient hierarchical agglomerative clustering on many thousands of cases, in a comparatively small number of iterations.

Figure 4: Partition of the H-R diagram by Ward’s method, final 80 clusters. The 2-cluster split corresponds to partition line C in Fig. 1.

6. Spherical Clusters

It is well known that Ward’s method tends to force observations into spherically-shaped clusters which all have minimum squared error Ek. Fig. 2 illustrates the result obtained by stopping Ward’s method at about 100 clusters for the data in the H-R diagram and discarding all lightly-populated clusters corresponding to noise. Since there are a large number of clusters and the cluster shapes are all relatively small, the resulting tesselation does a good job of mapping the densest part of the scatter.

However, this is not so evident from the remainder of the Ward’s method tree, that is the fusion of the final 80 clusters reproduced in Fig. 4. The 2-cluster split corresponds to the minimum E partition line C in Fig. 1 which bisects the dwarf star sequence into two parts. The 3-cluster split corresponds closely to the decision tree result of Fig. 3 but without a dwarf star noise cluster. As with the decision tree approach, Ward’s method fails to separate the dwarf star sequence in its entirity.

7. Single Linkage

We therefore need an alternative method which recovers the structure within the dwarf star sequence, or indeed in any elongated cluster which has some internal covariance structure. A clustering method capable of doing this is single linkage, or nearest neighbor. This method scores the similarity between any two clusters as the single link distance between their nearest neighbors.

The results of single linkage can be displayed in the form of a minimum spanning tree. This is the graph of least overall length which connects all the points, and has no circuits. It is illustrated for the H-R diagram in Fig. 5, where the longest link in the tree corresponds to the distance between the dwarf and giant star sequences at their closest points.

Figure 5: Having reduced the H-R diagram to a tesselation of 80 representative clusters located in the densest regions of the scatter, the giant and dwarf star sequences can now be mapped by the minimum spanning tree. The dotted line indicates the separation of the final two clusters, and the single linkage tree for this graph is shown in Fig. 6.

Single linkage has two disadvantages: firstly, the similarity between two clusters is defined only in terms of the distance between two points; and secondly, as a result it can fail to recover clusters which are poorly separated because of chaining, whereby a chain of points is established across the saddle density between two modes. For the H-R diagram, single linkage would tend to merge giants and dwarfs by bridging the noise between the two scatters.

We have discussed elsewhere a kth nearest-neighbor method for reducing chaining (Wishart, 1969), which was later modified to a hierarchical density-seeking method (Wishart, 1973). Both of these methods are included in Clustan (Wishart, 1987). However, they are computationally intensive because, like most other hierarchical clustering methods, they require the calculation and storage of a proximity matrix from which kth nearest neighbors are derived. The computing time is O(n3) and the storage requirement is O(½n2), as for standard Ward’s method.

8. Hybrid Cluster Method

We therefore propose to combine the best features of Ward’s method and single linkage, avoiding forcing spherically-shaped final clusters where there is shape in the sample density, and at the same time reducing chaining by single linkage.

The first stage of this hybrid method is to complete a hierarchical agglomerative clustering by Ward’s method to obtain a representative tesselation of the sample space at a fairly low level of, say, 100 clusters. Because of the nature of our fast clustering algorithm for Ward’s method, this can now be computed efficiently for many thousands of cases.

Figure 6: Single linkage tree for the tesselation of cluster centres and the minimum spanning tree. The giant and dwarf star sequences are now clearly separated at the 2-cluster level.

We thus obtain a summary of the sample space expressed in terms of about 100 cluster means each with an associated weight, the number of observations it contains. Clusters of low weight constitute the noise in the data, or the sparsely populated parts of the scatter, and are removed. The resulting tesselation covering the densest parts of the sample space is as illustrated for the H-R diagram in Fig. 2. In practice the tesselation does not consist of spheres but is, rather, composed of the subsets of points which make up the clusters. Their shapes are therefore typically polyhedral in v dimensions.

We then use single linkage to map the shapes of the underlying elongated clusters, as in Fig. 6. This finds the giant and dwarf star sequences of the H-R diagram at the 2 cluster level.

9. Resolving Cluster Shape

Having thus simplified the bulk of a large dataset to a tesselation of tight clusters with associated means, we can now examine the clusters more easily for internal covariate structure. For example, a correlation or covariance matrix of the subset of points within each cluster could be obtained, and the points could be projected into the sub-space of principal components or principal co-ordinates.

In Fig. 7 we illustrate how regression lines could be fitted to the two clusters found by the single linkage step of the proposed hybrid clustering method.

10. Case Identification

Hierarchical agglomerative clustering trees are not as intuitively interpretable as binary decision trees. But we have argued that partitions which are orthogonal to the co-ordinate axes may be naive when the variables are inter-correlated. We have, therefore, to adopt a multivariate approach to case identification using a hierarchical agglomerative tree.

A stepwise identification method is provided by procedure Classify of Clustan/PC. It traverses a tree, such as the one shown in Fig. 4, from right to left to find the clusters of best fit at each level. Unlike standard Ward’s method, cluster means are computed and stored as a by-product of our fast hierarchical clustering algorithm.

Figure 7: Having separated the two clusters of giant and dwarf stars by applying the tree in Fig. 6 to the data in Fig. 2, regression lines can be fitted to each cluster separately to reveal their shape.

Classify starts by comparing a new case with each of the last 2 clusters, in effect with partition line C in Fig. 1. Having found which cluster centre is nearest, it proceeds down that branch of the tree, examining at each stage the next two clusters in the current branch. To reduce the possibility of mis-classification, a look-ahead variation has been introduced (Wishart, 1986).

Classify is very fast. For example, it would require about 10-15 cluster comparisons to identify which sphere (if any) in the tesselation of Fig. 2 envelopes a new case. It also has the benefit of finding unbiased estimates of E where there are missing values in the data (see Wishart, 1986).