AMS575 Cluster Analysis

I.Introduction.

Cluster analysis is the grouping of individuals in a population in order to discover structure in the data. In some sense, we would like the individuals within a group to be close or similar to one another, but dissimilar from individuals in other groups. Clustering is fundamentally a collection of methods of data exploration. However, the results of a cluster analysis may produce identifiable structure that can be used to generate hypotheses (to be tested on a separate dataset) to account for the observed data.

Before attempting a classification, it is important to understand the problem you are wishing to address. Different classifications, with consequently different interpretations, can be imposed on a sample and the choice of variables is very important.

Related to clustering is clumping, which allows an object to belong to more than one group.

We will discuss the following topics:

Hierarchical methods: methods that derive a clustering from a given dissimilarity matrix;
Non-hierarchical methods: K-means method, fuzzy K-means.

II.Hierarchical methods

Hierarchical clustering procedures are the most commonly used clustering methods. A hierarchical tree is a nested set of partitions represented by a tree diagram or dendrogram. Sectioning a tree at a particlar level produces a partition into several disjoint groups. An example of a hierarchical classification is the classification of the animal kingdom. Each species belongs to a series of nested clusters of increasing size with a decreasing number of common characteristics.

There are several different algorithms for finding a hierarchical tree. An agglomerative algorithm begins with n subclusters, each containing a single data point and at each stage merges the two most similar groups to form a new cluster. A divisive algorithm operates by successively splitting groups beginning with a single group and continuing until there are n groups, each of a single individual. Generally, divisive algorithms are computationally inefficient.

We will study four major hierarchical methods: single-linkage, complete-linkage, average-linkage, and Ward’s method.

Single-Linkage (nearest-neighbor) Method. The distance between two groups A and B is the distance between their closest members, i.e.

Thus two objects a and b belong to the same single-link cluster at level d if there exists a chain of intermediate objects, linking them such that all the distances are no more than d. The name “single-linkage” or “nearest neighbor” method reflect the fact that it takes only a single link to join two distinct groups and that the distance between two groups is the distance of their closest neighbors.

A consequence of this joining together by a single link is that some groups can become elongated with some distant points, having little in common, being grouped together because there is a chain of intermediate objects.

Ex1. Perform single-linkage clustering based on the following dissimilarity matrix:

1 / 2 / 3 / 4 / 5 / 6
1 / 0 / 4 / 13 / 24 / 12 / 8
2 / 0 / 10 / 22 / 11 / 10
3 / 0 / 7 / 3 / 9
4 / 0 / 6 / 18
5 / 0 / 8.5
6 / 0

Complete-Linkage (furthest-neighbor) Method. The distance between two groups A and B is the distance between the two furthest members, i.e.

At each stage, the closest groups are merged of course. The difference between this method and the single-link method is the measure of distance between groups. The groups found by sectioning the complete-linkage dendrogram at level d have the property that each pair have distance less than d for all members in the group. The method concentrates on the internal cohesion of groups in contrast to the single-linkage method, which seeks isolated groups.

Ex2. Please perform complete-linkage clustering based on the dissimilarity matrix in Ex1.

Average-Linkage Method. The distance between two groups A and B is the average distance between all pairs of individuals, one from each group, i.e.

Ex3. Please perform average-linkage clustering based on the dissimilarity matrix in Ex1.

Ward’s Hierarchical Clustering Method. At each stage of the algorithm, the two groups that produce the smallest increase in the total within-group sum of squares are merged. The dissimilarity between two groups is defined to be the increase in the total sum of squares that would result if they were joined.

Ex4. Please perform cluster using Ward’s method based on the dissimilarity matrix in Ex1.

The complete-linkage, average-linkage, and Ward’s method tend to concentrate on internal cohesion, producing homogeneous, compact (often spherical) groups.

III.Non-hierarchical methods

K-means Method. The aim of the k-means algorithm is to partition the data into k clusters so that the within-group sum of squares is minimized. The simplest form of the k-means algorithm is based on alternating two procedures. The first is one of assigning objects to groups. An object is usually assigned to the group to whose mean it is closest in the Euclidean sense. The second procedure is the calculation of new group means based on the assignment. The process terminates when no movement of an object to another group will reduce the within-group sum of squares.

Ex 4. (Example 12.12 in J&W, 2nd edition, page 695-696).

FuzzyK-means Method. The partitioning methods described so far in this chapter have the property that each object belongs to one group only. The basic idea of fuzzy clustering method is that patterns are allowed to belong to all clusters with different degrees of membership (probabilities). The first generalization of the k-means algorithm was presented by Dunn (1974). The fuzzy k-means algorithm attempts to find a solution for parameters Pji (i = 1, …, n; j = 1, …, g) for which

is minimized subject to the constraints

The parameter Pji represents the degree (probability) of association of the ith pattern or object within the jth group. The parameter r (≥1) is a scalar termed the weighting exponent which controls the ‘fuzziness’ of the resulting clusters and mj is the centoid of the jth group

As r→1, this algorithm tends to the basic k-means algorithm. In that case, we know that a minimum of Jr gives values for the Pji that are either 0 or 1.

IV.SAS procedures for clustering

Hierarchical Clustering – Proc cluster. For on-line manual, go to:

Non-hierarchical Clustering – Proc Fastclus. For on-line manual, go to:

Here is the link of the master index for SAS on-line manual at MIT.

/*Jungnam’s SAS program for average linkage analysis.*/

data a (type = distance );

infile "JN\Microarray\p_corr;

input gene $ $ cor1-cor1606 @@;

run;

proc cluster data = a method = average

out=output print = 50 pseudo nosquare ;

id gene;

run;

proc tree data = output ncl = 17 out=outp;

copy gene;

run;

References:

1. Andrew Webb. Statistical Pattern Recognition. Arnold & OxfordUniversity Press, 1999.

2. J.C. Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32-57, 1974.

3. R.A. Johnson & D.W. Wichern. Applied Multivariate Statistical Analysis. 5th edition. Prentice Hall, 2002.