Chapter 5-24. Cluster Analysis

Cluster analysis is one of the many multivariate techniques, where you can have one or more outcome variables.

It is a method to classify observations into natural underlying groups. The goal of the clustering is to have observations in the same cluster be more alike than observations in other clusters.

[Another multivariate analysis, factor analysis, works on variables, rather than observations, finding underlying factors that each subset of variables appear to have in common. This approach is used to find subscales when developing a psychometric test.]

One of many ways to do this is to use Euclidean distance as the (dis)similarity measure, where the distance between observations in r dimensions, formed by the r outcome variables, is closest for observations within the same cluster. Euclidean distance is defined as:

where to are the variables for individual i.

Drug Dose Example

If you were doing a study of the association of the opioids to mortality or to adverse drug events, a problem you would encounter is coming up with dose categories to include in a regression model. Established dose categories are not available. There is not even a consensus among physicians of what dose level is safe. In order to avoid the potential criticism of “you picked the dose categories that gave you significance”, you would like to use an “objective” categorization. One approach would be quartiles, but there is no reason why that would provide clinically meaningful cut-points. Possibly a better approach is to use cluster analysis, where the algorithm finds natural clusters of doses. In this sense, the dose categories are somewhat of a consensus of popular dose levels, at least among the physicians who prescribed the doses in your dataset.

Morphine Equivalent Dose Dataset

The dataset morphequivdose.dta contains n = 40 observations of various specific opioid prescriptions to treat severe pain. The doses are converted to “morphine equivalent of opioids (mg/day)”. The data are hypothetical but resemble the distribution of actual data.

Source: Greg Stoddard

______

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.


Reading in the data,

File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on morphequivdose.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\morphequivdose.dta",
clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use morphequivdose, clear

Obtaining a frequency table,

Statistics
Summaries, tables & tests
Tables
Oneway tables
Categorical Variable: morphequiv
OK
tabulate morphequiv

morphine |

equivalent |

of opioids |

(mg/day) | Freq. Percent Cum.

------+------

1.5 | 1 2.50 2.50

6 | 1 2.50 5.00

7.5 | 1 2.50 7.50

22.5 | 1 2.50 10.00

30 | 1 2.50 12.50

39 | 1 2.50 15.00

45 | 3 7.50 22.50

57 | 1 2.50 25.00

60 | 2 5.00 30.00

67.5 | 1 2.50 32.50

90 | 4 10.00 42.50

110 | 1 2.50 45.00

120 | 1 2.50 47.50

159 | 1 2.50 50.00

180 | 2 5.00 55.00

192 | 1 2.50 57.50

210 | 1 2.50 60.00

240 | 3 7.50 67.50

255 | 1 2.50 70.00

270 | 1 2.50 72.50

300 | 1 2.50 75.00

330 | 1 2.50 77.50

360 | 2 5.00 82.50

422 | 1 2.50 85.00

460 | 1 2.50 87.50

555 | 1 2.50 90.00

690 | 1 2.50 92.50

706 | 1 2.50 95.00

810 | 1 2.50 97.50

1455 | 1 2.50 100.00

------+------

Total | 40 100.00

We see a wide range of doses, where no real obvious natural gaps show up to permit classification into clusters.
Obtaining a histogram,

Graphics
Easy graphs
Histogram
Main tab: Variable: morphequiv
Continuous data
OK
histogram morphequiv

This graph perhaps suggests cutpoints (low dose = 0-249)(moderate dose = 250-499)

(high dose = 500-999)(very high dose = 1000+), but we have to be careful because it has low resolution (too few bins, or bars).


Obtaining a histogram with more bars and better labels,

#delimit ;
histogram morphequiv
, bin(30) percent
xtitle("morphine equivalent of opioids (mg/day)")
ytitle("percent times dose prescribed")
xlabels(0(50)1500,angle(vertical))
;
#delimit cr

Our prior cutpoints

(low dose = 0-249)(moderate dose = 250-499)

(high dose = 500-999)(very high dose = 1000+)

still seem okay, except it is not clear where to divide low dose and moderate dose.

When we look at the histogram, we are trying to find four sub-histograms, with as little overlap as possible.

That is the same as saying we are looking for four means, where the data around the means are closer to their own group mean than to any other group mean.

That is the approach behind k-means cluster analysis .


K Means Cluster Analysis

This approach uses the following iterative algorithm (Stata Multivariate Statistics Reference Manual Release 10, 2007, p.119),

“Kmeans and kmedians clustering are iterative procedures that partitions the data into k groups or clusters. The procedure begins with k initial group centers. Observations are assigned to the groups with the closest center. The mean or median of the observations assigned to each of the groups is computed, and the process is repeated. These steps continue until all observations remain in the same group form the previous iteration.”

This exact same algorithm is described, but not so succiently, in Shannon (2008, p.356). The Shannon citation is preferable for inclusion in an article, since textbooks are better citations than software user manuals.

Obtaining a k-means cluster analysis, requesting four dose groups,

Statistics
Multivariate analysis
Cluster analysis
Cluster data
Kmeans
Main tab: Variables: morphequiv
K (the number of groups): 4
(Dis)similarity measure: Continuous: L2 or Euclidean
Name this cluster analysis: kmeans4
Options tab: Initial group centers: K unique random observations
Random number seed: 999
Advanced tab: Generated grouping variable name: dosegrp
OK
capture cluster drop kmeans4
capture drop dosegrp
cluster kmeans morphequiv, k(4) measure(L2) ///
name(kmeans4) start(krandom(999)) generate(dosegrp)

This finishes running, but does not display anything.


To see the clusters produced, use

Statistics
Summary tables & tests
Tables
Table of summary statistics (tabstat)
Main tab: Variables: morphequiv
Group statistics by variable: dosegrp
Statistics to display: minimum
Statistics to display: maximum
Options tab: Use columns as: statistics
No overall statistic
OK
tabstat morphequiv, statistics( min max ) by(dosegrp) ///
nototal columns(statistics)

Summary for variables: morphequiv

by categories of: dosegrp

dosegrp | min max

------+------

1 | 240 555

2 | 690 1455

3 | 1.5 67.5

4 | 90 210

------

Our prior cutpoints were:

Low dose = 0-249

Moderate dose = 250-499

High dose = 500-999

Very high dose = 1000+

Examining the minimum and maximums, and interpolating the cutpoints, we might use:

Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

Notice the Stata returned the groups out of order. To fix this, so that we have an ordinal-scaled group variable, we can use

recode dosegrp (3=1)(4=2)(1=3)(2=4) , gen(dosegroup)
tabstat morphequiv, statistics( min max ) by(dosegroup) ///
nototal columns(statistics)


Summary for variables: morphequiv

by categories of: dosegroup (RECODE of dosegrp)

dosegroup | min max

------+------

1 | 1.5 67.5

2 | 90 210

3 | 240 555

4 | 690 1455

------

Obtaining a histogram with the cutpoints delineated,

Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

#delimit ;
histogram morphequiv
, bin(30) percent
xtitle("morphine equivalent of opioids (mg/day)")
ytitle("percent times dose prescribed")
xlabels(0(50)1500,angle(vertical))
xline(80.5) xline(225.5) xline(600.5)
;
#delimit cr

From this graph, it is not convincing that the cutpoints for the clusters should be as shown. This is partly due to the bin size of the bars.

It is more convincing to look at the frequency table.


Low dose = 0-80

Moderate dose = 81-225

High dose = 226-600

Very high dose = 601+

morphine |

equivalent |

of opioids |

(mg/day) | Freq. Percent Cum.

------+------

1.5 | 1 2.50 2.50

6 | 1 2.50 5.00

7.5 | 1 2.50 7.50

22.5 | 1 2.50 10.00

30 | 1 2.50 12.50

39 | 1 2.50 15.00

45 | 3 7.50 22.50

57 | 1 2.50 25.00

60 | 2 5.00 30.00

67.5 | 1 2.50 32.50

90 | 4 10.00 42.50

110 | 1 2.50 45.00

120 | 1 2.50 47.50

159 | 1 2.50 50.00

180 | 2 5.00 55.00

192 | 1 2.50 57.50

210 | 1 2.50 60.00

240 | 3 7.50 67.50

255 | 1 2.50 70.00

270 | 1 2.50 72.50

300 | 1 2.50 75.00

330 | 1 2.50 77.50

360 | 2 5.00 82.50

422 | 1 2.50 85.00

460 | 1 2.50 87.50

555 | 1 2.50 90.00

690 | 1 2.50 92.50

706 | 1 2.50 95.00

810 | 1 2.50 97.50

1455 | 1 2.50 100.00

------+------

Total | 40 100.00

Notice that 67.5 is quite close to 60, while 90 is closer to 110. Similarly with the other cutpoints.
One drawback of the k-means approach is that you have to specify the number of groups in advance, rather than let the procedure find the optimal number of groups.

We could loop through several choices, and then decide which number of groups appeals to us.

forval i=2/6 {
capture cluster drop kmeans`i’
capture drop dosegrp`i’
cluster kmeans morphequiv, k(`i’) measure(L2) ///
name(kmeans`i’) start(krandom(999)) generate(dosegrp`i’)
tabstat morphequiv, statistics( min max ) by(dosegrp`i’) ///
nototal columns(statistics)
}

Summary for variables: morphequiv

by categories of: dosegrp2

dosegrp2 | min max

------+------

1 | 1.5 460

2 | 555 1455

------

Summary for variables: morphequiv

by categories of: dosegrp3

dosegrp3 | min max

------+------

1 | 210 555

2 | 690 1455

3 | 1.5 192

------

Summary for variables: morphequiv

by categories of: dosegrp4

dosegrp4 | min max

------+------

1 | 240 555

2 | 690 1455

3 | 1.5 67.5

4 | 90 210

------

Summary for variables: morphequiv

by categories of: dosegrp5

dosegrp5 | min max

------+------

1 | 330 555

2 | 690 1455

3 | 1.5 45

4 | 159 300

5 | 57 120

------

Summary for variables: morphequiv

by categories of: dosegrp6

dosegrp6 | min max

------+------

1 | 180 300

2 | 690 1455

3 | 1.5 22.5

4 | 90 159

5 | 30 67.5

6 | 330 555

------


Stopping Rules

We do not have to just use intuition to decide on how many clusters to form. We can use a formal stopping rule.

As stated in the Stata manual (Stata Multivariate Statistics Version 9 Reference Manual, 2005, p.183),

“There are a large number of cluster stopping rules. Milligan and Cooper (1985) provide an evaluation of 30 stopping rules, singling out the Calińksi and Harabasz index and the Duda and Hart index as two of the best rules.”

The Duda and Hart index only works with hierachical cluster analysis, which we will see below. The Calińksi and Harabasz index works with both the nonhierarchical (k-means and k-medians) and hierarchical cluster analysis (single linkage, complete linkage, average linkage, centroidlinkage, waveragelinkage, medianlinkage, wardslinkage).

These rules provide a “distinct clustering” statistic that you can visually compare between cluster results for different numbers of clusters.

As stated in the Stata manual (Stata Multivariate Statistics Version 9 Reference Manual, 2005, p.188-186),

“Distinct clustering is characterized by large Calińksi and Harabasz pseudo-F values, large Duda and Hart Je(2)/Je(1) values, and small Duda and Hart pseudo-T-squared values.

The conventional wisdom for deciding the number of groups based on the Duda and Hart stopping rule table is to find one of the largest Je(2)/Je(1) values that corresponds to a low psuedo-T-squared value that has much larger T-squared values next to it.”


Applying the Calińksi and Harabasz index to the first of the k-means cluster analyzses computed above,

Statistics
Multivariate analysis
Cluster analysis
Postclustering
Cluster analysis stopping rules
Cluster analysis: kmeans2
Stopping rule: Calinski/Harabasz pseudo-F index
OK
cluster stop kmeans2, rule(calinski)

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 2 | 74.18 |

+------+


Doing this for all six k-means cluster analyzses,

forval i=2/6 {
cluster stop kmeans`i' , rule(calinski)
}

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 2 | 74.18 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 3 | 78.62 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 4 | 57.73 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 5 | 50.91 |

+------+

+------+

| | Calinski/ |

| Number of | Harabasz |

| clusters | pseudo-F |

|------+------|

| 6 | 40.19 |

+------+

We see that the 3 clusters model produced the best distinct clustering, as it has the largest Calińksi and Harabasz statistic.

We might still decide to maintain the 4 dose levels, however, if we decide it is more natural for physicians to think of opioid prescriptions in terms of (low, moderate, high, very high) than (low, moderate, high). It would be worth trying the 3-dose and 4-dose categorizations in a regression model for some other outcome, however, if that is the purpose of this classification exercise.


K Medians Cluster Analysis

This approach is the same as k-means, only the groups are formed around medians, rather than means.

Obtaining a k-medians cluster analysis, requesting four dose groups,

Statistics
Multivariate analysis
Cluster analysis
Cluster data
Kmedians
Main tab: Variables: morphequiv
K (the number of groups): 4
(Dis)similarity measure: Continuous: L2 or Euclidean
Name this cluster analysis: kmedians4
Options tab: Initial group centers: K unique random observations
Random number seed: 999
Advanced tab: Generated grouping variable name: mediansgrp
OK
capture cluster drop kmedians4
capture drop mediansgrp
cluster kmedians morphequiv, k(4) measure(L2) ///
name(kmedians4) start(krandom(999)) generate(mediansgrp)

Displaying the clusters,