### Chapter 5-24. Cluster Analysis

Cluster analysis is one of the many multivariate techniques, where you can have one or more outcome variables.

It is a method to classify observations into natural underlying groups. The goal of the clustering is to have observations in the same cluster be more alike than observations in other clusters.

[Another multivariate analysis, factor analysis, works on variables, rather than observations, finding underlying factors that each subset of variables appear to have in common. This approach is used to find subscales when developing a psychometric test.]

One of many ways to do this is to use Euclidean distance as the (dis)similarity measure, where the distance between observations in r dimensions, formed by the r outcome variables, is closest for observations within the same cluster. Euclidean distance is defined as:

where to are the variables for individual i.

**Drug Dose Example**

If you were doing a study of the association of the opioids to mortality or to adverse drug events, a problem you would encounter is coming up with dose categories to include in a regression model. Established dose categories are not available. There is not even a consensus among physicians of what dose level is safe. In order to avoid the potential criticism of “you picked the dose categories that gave you significance”, you would like to use an “objective” categorization. One approach would be quartiles, but there is no reason why that would provide clinically meaningful cut-points. Possibly a better approach is to use cluster analysis, where the algorithm finds natural clusters of doses. In this sense, the dose categories are somewhat of a consensus of popular dose levels, at least among the physicians who prescribed the doses in your dataset.

**Morphine Equivalent Dose Dataset**

The dataset morphequivdose.dta contains n = 40 observations of various specific opioid prescriptions to treat severe pain. The doses are converted to “morphine equivalent of opioids (mg/day)”. The data are hypothetical but resemble the distribution of actual data.

Source: Greg Stoddard

______

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.

Reading in the data,

Open

Find the directory where you copied the course CD

Change to the subdirectory datasets & do-files

Single click on morphequivdose.dta

Open

use "C:\Documents and Settings\u0032770.SRVR\Desktop\

Biostats & Epi With Stata\datasets & do-files\morphequivdose.dta",

clear

* which must be all on one line, or use:

cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"

cd "Biostats & Epi With Stata\datasets & do-files"

use morphequivdose, clear

Obtaining a frequency table,

StatisticsSummaries, tables & tests

Tables

Oneway tables

Categorical Variable: morphequiv

OK

tabulate morphequiv

morphine |

equivalent |

of opioids |

(mg/day) | Freq. Percent Cum.

------+------

1.5 | 1 2.50 2.50

6 | 1 2.50 5.00

7.5 | 1 2.50 7.50

22.5 | 1 2.50 10.00

30 | 1 2.50 12.50

39 | 1 2.50 15.00

45 | 3 7.50 22.50

57 | 1 2.50 25.00

60 | 2 5.00 30.00

67.5 | 1 2.50 32.50

90 | 4 10.00 42.50

110 | 1 2.50 45.00

120 | 1 2.50 47.50

159 | 1 2.50 50.00

180 | 2 5.00 55.00

192 | 1 2.50 57.50

210 | 1 2.50 60.00

240 | 3 7.50 67.50

255 | 1 2.50 70.00

270 | 1 2.50 72.50

300 | 1 2.50 75.00

330 | 1 2.50 77.50

360 | 2 5.00 82.50

422 | 1 2.50 85.00

460 | 1 2.50 87.50

555 | 1 2.50 90.00

690 | 1 2.50 92.50

706 | 1 2.50 95.00

810 | 1 2.50 97.50

1455 | 1 2.50 100.00

------+------

Total | 40 100.00

We see a wide range of doses, where no real obvious natural gaps show up to permit classification into clusters.

Obtaining a histogram,

Easy graphs

Histogram

Main tab: Variable: morphequiv

Continuous data

OK

histogram morphequiv

This graph perhaps suggests cutpoints (low dose = 0-249)(moderate dose = 250-499)

(high dose = 500-999)(very high dose = 1000+), but we have to be careful because it has low resolution (too few bins, or bars).

Obtaining a histogram with more bars and better labels,

histogram morphequiv

, bin(30) percent

xtitle("morphine equivalent of opioids (mg/day)")

ytitle("percent times dose prescribed")

xlabels(0(50)1500,angle(vertical))

;

#delimit cr

Our prior cutpoints

(low dose = 0-249)(moderate dose = 250-499)

(high dose = 500-999)(very high dose = 1000+)

still seem okay, except it is not clear where to divide low dose and moderate dose.

When we look at the histogram, we are trying to find four sub-histograms, with as little overlap as possible.

That is the same as saying we are looking for four means, where the data around the means are closer to their own group mean than to any other group mean.

That is the approach behind k-means cluster analysis .

**K Means Cluster Analysis**

This approach uses the following iterative algorithm (Stata Multivariate Statistics Reference Manual Release 10, 2007, p.119),

“Kmeans and kmedians clustering are iterative procedures that partitions the data into k groups or clusters. The procedure begins with k initial group centers. Observations are assigned to the groups with the closest center. The mean or median of the observations assigned to each of the groups is computed, and the process is repeated. These steps continue until all observations remain in the same group form the previous iteration.”

This exact same algorithm is described, but not so succiently, in Shannon (2008, p.356). The Shannon citation is preferable for inclusion in an article, since textbooks are better citations than software user manuals.

Obtaining a k-means cluster analysis, requesting four dose groups,

StatisticsMultivariate analysis

Cluster analysis

Cluster data

Kmeans

Main tab: Variables: morphequiv

K (the number of groups): 4

(Dis)similarity measure: Continuous: L2 or Euclidean

Name this cluster analysis: kmeans4

Options tab: Initial group centers: K unique random observations

Random number seed: 999

Advanced tab: Generated grouping variable name: dosegrp

OK

capture cluster drop kmeans4

capture drop dosegrp

cluster kmeans morphequiv, k(4) measure(L2) ///

name(kmeans4) start(krandom(999)) generate(dosegrp)

This finishes running, but does not display anything.

To see the clusters produced, use

Summary tables & tests

Tables

Table of summary statistics (tabstat)

Main tab: Variables: morphequiv

Group statistics by variable: dosegrp

Statistics to display: minimum

Statistics to display: maximum

Options tab: Use columns as: statistics

No overall statistic

OK

tabstat morphequiv, statistics( min max ) by(dosegrp) ///

nototal columns(statistics)

Summary for variables: morphequiv

by categories of: dosegrp

dosegrp | min max

------+------

1 | 240 555

2 | 690 1455

3 | 1.5 67.5

4 | 90 210

------

Our prior cutpoints were:

Low dose = 0-249

Moderate dose = 250-499

High dose = 500-999

Very high dose = 1000+

Examining the minimum and maximums, and interpolating the cutpoints, we might use:

Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

Notice the Stata returned the groups out of order. To fix this, so that we have an ordinal-scaled group variable, we can use

recode dosegrp (3=1)(4=2)(1=3)(2=4) , gen(dosegroup)tabstat morphequiv, statistics( min max ) by(dosegroup) ///

nototal columns(statistics)

Summary for variables: morphequiv

by categories of: dosegroup (RECODE of dosegrp)

dosegroup | min max

------+------

1 | 1.5 67.5

2 | 90 210

3 | 240 555

4 | 690 1455

------

Obtaining a histogram with the cutpoints delineated,

Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

#delimit ;histogram morphequiv

, bin(30) percent

xtitle("morphine equivalent of opioids (mg/day)")

ytitle("percent times dose prescribed")

xlabels(0(50)1500,angle(vertical))

xline(80.5) xline(225.5) xline(600.5)

;

#delimit cr

From this graph, it is not convincing that the cutpoints for the clusters should be as shown. This is partly due to the bin size of the bars.

It is more convincing to look at the frequency table.

Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

morphine |

equivalent |

of opioids |

(mg/day) | Freq. Percent Cum.

------+------

1.5 | 1 2.50 2.50

6 | 1 2.50 5.00

7.5 | 1 2.50 7.50

22.5 | 1 2.50 10.00

30 | 1 2.50 12.50

39 | 1 2.50 15.00

45 | 3 7.50 22.50

57 | 1 2.50 25.00

60 | 2 5.00 30.00

67.5 | 1 2.50 32.50

90 | 4 10.00 42.50

110 | 1 2.50 45.00

120 | 1 2.50 47.50

159 | 1 2.50 50.00

180 | 2 5.00 55.00

192 | 1 2.50 57.50

210 | 1 2.50 60.00

240 | 3 7.50 67.50

255 | 1 2.50 70.00

270 | 1 2.50 72.50

300 | 1 2.50 75.00

330 | 1 2.50 77.50

360 | 2 5.00 82.50

422 | 1 2.50 85.00

460 | 1 2.50 87.50

555 | 1 2.50 90.00

690 | 1 2.50 92.50

706 | 1 2.50 95.00

810 | 1 2.50 97.50

1455 | 1 2.50 100.00

------+------

Total | 40 100.00

Notice that 67.5 is quite close to 60, while 90 is closer to 110. Similarly with the other cutpoints.

One drawback of the k-means approach is that you have to specify the number of groups in advance, rather than let the procedure find the optimal number of groups.

We could loop through several choices, and then decide which number of groups appeals to us.

forval i=2/6 {capture cluster drop kmeans`i’

capture drop dosegrp`i’

cluster kmeans morphequiv, k(`i’) measure(L2) ///

name(kmeans`i’) start(krandom(999)) generate(dosegrp`i’)

tabstat morphequiv, statistics( min max ) by(dosegrp`i’) ///

nototal columns(statistics)

}

Summary for variables: morphequiv

by categories of: dosegrp2

dosegrp2 | min max

------+------

1 | 1.5 460

2 | 555 1455

------

Summary for variables: morphequiv

by categories of: dosegrp3

dosegrp3 | min max

------+------

1 | 210 555

2 | 690 1455

3 | 1.5 192

------

Summary for variables: morphequiv

by categories of: dosegrp4

dosegrp4 | min max

------+------

1 | 240 555

2 | 690 1455

3 | 1.5 67.5

4 | 90 210

------

Summary for variables: morphequiv

by categories of: dosegrp5

dosegrp5 | min max

------+------

1 | 330 555

2 | 690 1455

3 | 1.5 45

4 | 159 300

5 | 57 120

------

Summary for variables: morphequiv

by categories of: dosegrp6

dosegrp6 | min max

------+------

1 | 180 300

2 | 690 1455

3 | 1.5 22.5

4 | 90 159

5 | 30 67.5

6 | 330 555

------

Stopping Rules

We do not have to just use intuition to decide on how many clusters to form. We can use a formal stopping rule.

As stated in the Stata manual (Stata Multivariate Statistics Version 9 Reference Manual, 2005, p.183),

“There are a large number of cluster stopping rules. Milligan and Cooper (1985) provide an evaluation of 30 stopping rules, singling out the Calińksi and Harabasz index and the Duda and Hart index as two of the best rules.”

The Duda and Hart index only works with hierachical cluster analysis, which we will see below. The Calińksi and Harabasz index works with both the nonhierarchical (k-means and k-medians) and hierarchical cluster analysis (single linkage, complete linkage, average linkage, centroidlinkage, waveragelinkage, medianlinkage, wardslinkage).

These rules provide a “distinct clustering” statistic that you can visually compare between cluster results for different numbers of clusters.

As stated in the Stata manual (Stata Multivariate Statistics Version 9 Reference Manual, 2005, p.188-186),

“Distinct clustering is characterized by large Calińksi and Harabasz pseudo-F values, large Duda and Hart Je(2)/Je(1) values, and small Duda and Hart pseudo-T-squared values.

The conventional wisdom for deciding the number of groups based on the Duda and Hart stopping rule table is to find one of the largest Je(2)/Je(1) values that corresponds to a low psuedo-T-squared value that has much larger T-squared values next to it.”

Applying the Calińksi and Harabasz index to the first of the k-means cluster analyzses computed above,

Multivariate analysis

Cluster analysis

Postclustering

Cluster analysis stopping rules

Cluster analysis: kmeans2

Stopping rule: Calinski/Harabasz pseudo-F index

OK

cluster stop kmeans2, rule(calinski)

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 2 | 74.18 |

+------+

Doing this for all six k-means cluster analyzses,

cluster stop kmeans`i' , rule(calinski)

}

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 2 | 74.18 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 3 | 78.62 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 4 | 57.73 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 5 | 50.91 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 6 | 40.19 |

+------+

We see that the 3 clusters model produced the best distinct clustering, as it has the largest Calińksi and Harabasz statistic.

We might still decide to maintain the 4 dose levels, however, if we decide it is more natural for physicians to think of opioid prescriptions in terms of (low, moderate, high, very high) than (low, moderate, high). It would be worth trying the 3-dose and 4-dose categorizations in a regression model for some other outcome, however, if that is the purpose of this classification exercise.

**K Medians Cluster Analysis**

This approach is the same as k-means, only the groups are formed around medians, rather than means.

Obtaining a k-medians cluster analysis, requesting four dose groups,

StatisticsMultivariate analysis

Cluster analysis

Cluster data

Kmedians

Main tab: Variables: morphequiv

K (the number of groups): 4

(Dis)similarity measure: Continuous: L2 or Euclidean

Name this cluster analysis: kmedians4

Options tab: Initial group centers: K unique random observations

Random number seed: 999

Advanced tab: Generated grouping variable name: mediansgrp

OK

capture cluster drop kmedians4

capture drop mediansgrp

cluster kmedians morphequiv, k(4) measure(L2) ///

name(kmedians4) start(krandom(999)) generate(mediansgrp)

Displaying the clusters,