R tutorial

Using general prediction strength methods

to estimate the number of clusters in a dataset

Moira Regelsson and Steve Horvath

University of California, Los Angeles

Correspondence:

http://www.ph.ucla.edu/biostat/people/horvath.htm

In this tutorial we describe the general prediction strength approach and provide R code for applying it. See the technical report for more detail:

Moira Regelson, Meghna Kamath, Steve Horvath (2005) General prediction strength methods for estimating the number of clusters with an application to gene co-expression networks. UCLA Technical Report. http://www.genetics.ucla.edu/horvath/GeneralPredictionStrength

SUMMARY

Several recent approaches for estimating the number of clusters in a data set make use of supervised learning methods for assessing cluster reproducibility. To apply supervised learning methods, the data is repeatedly split into a training and a test set. A clustering procedure is applied to both sets to arrive at an observed cluster outcome in each data set. Next, a classifier is constructed to predict the cluster label in the training data and is then applied to predict the cluster label in the test data. Studying the extent of agreement between the observed test set cluster outcome and its prediction gives a measure of cluster reproducibility. This measure is then used to estimate the number of clusters in the dataset. Tibshirani et al (2001) introduced a measure of cluster agreement, the prediction strength, which measures the proportion of observation pairs in the training data that co-cluster in the test data. We derive a simple formula for computing this prediction strength index that allows generalization to the co-clustering of multiplets of observations. The proposed general prediction strength criterion is useful when dealing with elongated clusters or with hierarchically organized clusters and compares favorably to classical methods of estimating the number of clusters in simulated datasets.

We will use an example of two circular clusters in 10-d for illustratation. The data is available in comma-separated format (csv) at http://www.genetics.ucla.edu/horvath/GeneralPredictionStrength/testData.csv.

To look at the data, simply open the file in Excel or a text editor.


######################################### DEMO CODE #################################################################

## 1) To install the R software, go to http://www.R-project.org

## 2) After installing R, you need to install an additional R

## package: scatterplot3d. At the R command prompt type

install.packages(“scatterplot3d”, CRAN=http://www.cran.r-project.org)

## 3) Download the following files from

## http://www.genetics.ucla.edu/horvath/GeneralPredictionStrength

## a) R function file: "psFunctions.txt",

## b) The test data file: "testData.csv"

## 4) Place all the files into the same directory, for example, "C:\temp\GeneralPredictionStrength"

## 5) Open the R software by double clicking its icon.

## 6) Open this tutorial file in a text editor

## 7) Copy and paste the R commands from this tutorial into the R session.

## Comments are preceded by "#" and are automatically ignored by R.

# Start copying and pasting the following

## change the working directory in R to where the data and functions are (edit as needed)

setwd("C:\temp\GeneralPredictionStrength ")

## load the library and ignore the warning message

source(“psFunctions.txt”)

## read in the data set

myData <- read.csv("test.csv", row.names=1)

## each sample in this dataset is in a row: use "t()" to

## transpose if you want to apply this to another data set

## with samples in the columns

dim(myData) # show the dimension of the dataset

## perform classical multi-dimensional scaling to reduce the data

## to three dimensions and display the result:

cmd1 <- cmdscale(dist(myData), 3)

scatterplot3d(cmd1,xlab="x", ylab="y", zlab="z")

## set the parameters to use in the prediction strength

maxK <- 5 # largest number of clusters to consider

mList <- c(2,5,10) # set of values for "m" (m-tuplets to co-cluster)

cvCount <- 10 # number of cross-validations to perform

myPSarray <- pamPsClNo(myData, klimit=maxK, cvCount=cvCount, m=mList)

print(myPSarray)

##> print(myPSarray)

## m=2 m=5 m=10

##k=1 1.000 1.0000 1.00e+00

##k=2 1.000 1.0000 1.00e+00

##k=3 0.518 0.0846 4.63e-03

##k=4 0.493 0.0526 3.31e-04

##k=5 0.361 0.0209 7.62e-06

matplot(main="", ylab="", myPSarray,

type="l",

lty=1:dim(myPSarray)[2],

lwd=3,

col=gray((1:maxK)/(maxK + 1)))

lines(1:maxK, rep(.8, maxK))

title(ylab="PSE(m,k)", xlab="k")

legend(1,0.4,dimnames(myPSarray)[[2]],

col=gray((1:maxK)/(maxK + 1)),

lty=1:dim(myPSarray)[2], lwd=3, bty="n")

## since all values of m indicate two clusters, use pam clustering

## to determine which samples are assigned to which clusters

## Note that this is a trivial example:

## - in many cases varying m will change the number of clusters predicted
## - cluster assignment will not generally be so easy

## cluster the raw data into two clusters using pam (partitioning around

## medoid) clustering

pamClus <- pam(myData, k=2)

## use the reduced data to visualize the clustering: points

## are colored with their cluster assignment

scatterplot3d(cmd1, color=pamClus$clustering, xlab="x", ylab="y", zlab="z")

2