Random Forest Clustering and Other Statistical Methods

This manuscript is thesupplement methods for the paper:

Seligson DB, Horvath S, , Shi T, Yu H, Tze S, Grunstein M andKurdistani SK.Global histone modification patterns predict risk of prostate cancer recurrence.Nature (2005).

Section 1

Prostate Tissue Microarray (TMA). Under IRB approval, a prostate tissue microarray (TMA) was constructed using formalin-fixed, paraffin-embedded prostate tissue samples provided through the Department of Pathology and Laboratory Medicine at the UCLAMedicalCenter. Primary radical prostatectomy cases from 1984-1995 were randomly selected from the pathology database. The original H&E stained diagnostic slides were reviewed by a study pathologist (D.S.) utilizing the Gleason histological grading [1] and the 1997 AJCC/UICC TNM classification systems [2]. Case material from 246 prostatectomies was arrayed into 3 blocks encompassing a total of 1,364 individual tissue cores. All cases were of the histological type adenocarcinoma, conventional, not otherwise specified (NOS) [3].

TMAs were constructed as previously described [4]. At least 3 replicate tumor samples were taken from donor tissue blocks in a highly representative fashion. Matched morphologically normal, hypertrophic (BPH) and in situ neoplastic lesions (PIN), were also arrayed, when available. Twenty patients treated with neoadjuvant hormones were excluded from the study. Of the remaining 226 cases, 183 (81%) were informative for all 5 histone markers and 171 of those also were supported by complete recurrence data. In this group, the median age at the time of surgery was 65 (range 46 to 76). 104 (57%) patients were low grade (Gleason score 2-6); 79 (43%) were high grade (Gleason score 7-10). Half of the tumors, (50%) were not confined to the prostate (organ confined = T2a or T2b with negative lymph nodes, no capsular extension and with negative surgical margins). Sixty (33%) patients were margin positive and 33 (18%) had seminal vesicle invasion (pT3b). Regarding capsular invasion, 38 (21%) had no invasion, 101 (55%) had invasion, and 44 (24%) had capsular extension. Concurrent regional lymphadenectomy accompanied 181 (99%) cases, 12 of which (7%) were positive for metastases. The maximum pre-operative serum PSA was known for 160 patients (87%), with a median value of 8.9 ng/ml, (range 0.6-96.5).

Supplement Table 2 shows the clinicopathologic data for the subset of patients with low grade (Gleason Score 2-6) tumors (n=104). The median age at the time of surgery of this group was 64 (range 46 to 75). In the low grade tumors, 36% were not confined to the prostate. Twenty five (24%) patients were margin positive and 5 (5%) had seminal vesicle invasion (pT3b). Twenty eight (27%) had no capsular invasion by tumor, 56 (54%) had capsular invasion, and 20 (19%) had capsular extension. Concurrent regional lymphadenectomy accompanied 102 (98%) cases, only 2 of which (2%) were positive for metastases. The maximum pre-operative serum PSA was known for 93 patients (89%), with a median value of 7.8 ng/ml, (range 0.6-56.0).

A retrospective analysis for outcome assessment was based on detailed anonymized clinicopathologic information linked to the TMA tissue specimens. Recurrence, defined as a postoperative serum PSA of 0.2 ng/ml or greater, was seen in 61 (34%) of all study patients, and 20 (19%) of patients with low grade tumors. The median total follow-up, defined as the time to recurrence or to last contact in non-recurring patients, was 50 months (range 1.0-163) for all patients, and 60.0 months (range 2-163) for patients with low grade tumors.

The median follow-up time within the recurring and non-recurring patient groups was 22.0 (1.0-115.0) and 65.5 months (range 2.0-163.0), respectively, in all patients, and 30.5 (2.0-98.0) and 65.5 months (range 2.0-163.0), respectively, in patients with low grade tumors.

Immunohistochemistry. A standard 2-step indirect immunohistochemical staining method was used for all antibodies (DAKO, Carpenteria, CA). Tissue array sections (4 m-thick) were cut immediately prior to staining using a TMA sectioning aid (Instrumedics, NJ). Following deparaffinization in xylenes, the sections were rehydrated in graded alcohols. Endogenous peroxidase was quenched with 3% hydrogen peroxide in methanol at room temperature. The sections were placed in 95O C solution of 0.01 M sodium citrate buffer (pH 6.0) for antigen retrieval. 5% normal goat serum was next applied for 30 min. to block non-specific protein binding sites. Primary rabbit anti-histone polyclonal antibodies were applied for 30 min at room temperature (H3K18Ac at 1:200; H4R3diMe at 1:25; H3K9Ac at 1:800; H4K12Ac at 1:100; and H3K4diMe at 1:800 dilution from stock). Detection was accomplished using the DAKO Envision System, followed by chromogen detection with diaminobenzidine (DAB). Incubations were performed in a humidity chamber. The sections were counterstained with Harris’ Hematoxylin, followed by dehydration and mounting. Negative controls were identical array sections stained minus the primary antibody.

Scoring of immunohistochemistry. Semi-quantitative assessment of antibody staining on the TMAs was performed by two study pathologists (H.Y. and D.B.S.) blinded to the clinicopathologic variables. Prostatic glandular epithelium was the scored target tissue; scoring of benign tissues did not include basal cells. Both tissue spot histology and grading were confirmed on Hematoxylin and Eosin (H&E) stained TMA slides, as well as on all of the counterstained study slides. The frequency of nuclear expression positive target cells (range 0-100%) was scored for each TMA spot.

REFERENCES

1.Gleason, D.F., Classification of prostatic carcinomas. Cancer Chemother Rep, 1966. 50(3): p. 125-8.

2.Sobin, L.H. and I.D. Fleming, TNM Classification of Malignant Tumors, fifth edition (1997). Union Internationale Contre le Cancer and the American Joint Committee on Cancer. Cancer, 1997. 80(9): p. 1803-4.

3.young, R.H., Srigley, J.R., Amin, M.B., Ulbright, T.M., Cubilla, A., Tumors of the prostate gland, seminal vesicle, male urethra, and penis., in Atlas of Tumor Pathology. 2000, Armed Forces Institute of Pathology: WashingtonDC.

4.Kononen, J., et al., Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med, 1998. 4(7): p. 844-7.

Section 2

Random Forest Clustering and Other Statistical Methods

In part A, we will discuss and reviewRandomForest (RF) clustering. In part B, we will compare RF clustering to a more standard clustering analysis involving the Euclidean distance.

A. Review of Random Forest Clustering

What are RF predictors?

RF predictors are a state of the art prediction method which have been shown to work well with many different types of data (1). A random forest predictor is a collection of individual classification tree predictors. The random forest construction allows one to construct a similarity measure between two samples by counting the number of times a tree predictor places them in the same terminal node. Random forests can also be used for unsupervised learning problems (class outcomes unknown) by first generating synthetic data, which are chosen to represent the null hypothesis of no dependence structure in the data. Here we generate synthetic observations by randomly sampling from the product of empirical marginal distributions of the observed data. Then a random forest predictor is constructed to distinguish observed from synthetic data. By restricting the resulting intrinsic similarity measure to the observed data, one can define a similarity measure between the unlabeled observations. To protect against random fluctuations due to Monte Carlo sampling, we generate 100 synthetic data sets and average the resulting similarity measures. We define the RF dissimilarity measure as the square root of 1 minus the similarity measure (4). We use the RF dissimilarity as input for classical multidimensional scaling (MDS), which is related to principal component analysis. It takes the dissimilarities between samples and returns a set of points in a low dimensional Euclidean space such that the Euclidean distances between the points are approximately equal to the RF dissimilarities (2,3). To cluster the points we grouped the samples along the arms of the resulting “U” shape. The results are extremely similar to using the RF dissimilarity directly in partitioning around medoids (PAM) clustering (4).

Why Random Forests Clustering?

One major input of clustering analysis is the dissimilarity measure (4). We propose to use a random forest dissimilarity for TMA data since it has the following relevant theoretical advantages (5).

First, the clustering results do not change when one or more covariates are monotonically transformed since the dissimilarity only depends on the feature ranks. Thus, one does not need to worry about symmetrizing skewed covariate distributions.

Second, the RF dissimilarity weighs the contributions of each covariate on the dissimilarity in a natural way: the more related the covariate is to other covariates, e.g. the more correlated a protein marker is with other markers, the more it will affect the definition of the RF dissimilarity.

Third, the RF dissimilarity does not require the user to specify threshold values for dichotomizing tumor expressions. Since the RF dissimilarity is based on individual tree predictors, which dichotomize the expression values as part of their construction, the RF dissimilarity automatically dichotomizes the expressions in a principled, data-driven way.It is standard practice in supervised analyses to dichotomize tumor marker expressions for ease of interpretation and reproducibility. But we caution against using external threshold values for dichotomizing expressions in unsupervised analyses since dichotomization may reduce the information content or even bias the results. In contrast, RF clustering automatically dichotomizes staining scores in a principled, data-driven way.

For a technical description of the RF dissimilarity consult Breiman (1), Shi and Horvath (5) and a technical report and R tutorial that can be downloaded from

Analysis Steps and Other Statistical Methods

Our analyses of the data involvedthe following 3 general steps:

A)use RF clustering to group the patients based only on their tumor marker expression profiles;

B)assess the differences between the resultant clusters in terms of their survival distributions and other clinico-pathological variables, such as stage, grade etc.;

C)examine the difference in tumor marker expression between the clusters. The statistical methods used in the analyses are described below.

We used several methods for describing the clusters in terms of clinical variables and tumor marker expressions. To test whether variables differed across groups, we used the Kruskal-Wallis test, which is a non-parametric multi-group comparison test. To visualize the survival distributions, we used Kaplan-Meier plots. Log-rank tests were used to test the difference between survival distributions. All p-values were two sided and p < 0.05 was considered significant. All statistical analyses were carried out with the freely available software R ( (6). R code that implements RF clustering can befound at the following webpage:

B. Comparing the results of the random forest dissimilarity with those of the Euclidean distance.

Here we report the results of comparing the random forest dissimilarity tothe standard dissimilarity (Euclidean distance) when the latter is used as input of partitioning around medoid clustering (4). Instead of PAM clustering, which is a variant of k-means clustering, one could also use hierarchical clustering to group samples on the basis of RF dissimilarity. We find that our main message does not change when other standard clustering procedures are used that take a dissimilarity measure as input.

The table below cross tabulates the patients based on their cluster membership predicted by the 2 methods.

Table Comparing the random forest clustering results with those of the Euclidean distance for the 104 low Gleason score prostate tumors.

Euclidean Distance
clustering
Rand. Forest
Dissimilarity clustering / Cluster 1 / Cluster 2
Cluster 1 / 55 / 1
Cluster 2 / 8 / 40

Overall there is good agreement between the clustering results: the clusterings disagree on only 8+1 patient samples. There is indirect empirical evidence that RF clustering is superior: The figure shows Kaplan Meier plots that visualize the recurrence free time distributions for the different patient samples defined by the cells of the above table. Specifically, the figure shows the Kaplan Meier plots for

i)patients in RF cluster 1 and Euclid. cluster 1,

ii)patients in RF cluster 2 and Euclid. cluster 2,

iii)RF cluster 2 and Euclid. cluster 1.

Since there was only one patient in RF cluster 1 and Euclidean cluster 2, the corresponding Kaplan Meier plot was ignored.

FigureKaplan Meier (KM) plots corresponding to different clusterings of the low Gleason score prostate samples. The RF method clusters patients corresponding to the green and red KM curves into one cluster. In contrast, the Euclidean distance PAM analysis clusters patients corresponding to the green and black KM curve into one cluster. Clearly, the RF dissimilarity is superior to the Euclidean distance in this particular analysis.

In our empirical comparison, RF clustering leads to superior performance on these prostate tumor samples. We have listed several theoretical reasons why we expect that RF clustering is a useful method for typically skewed tumor marker data (5). We have found additional empirical evidence that RF clustering performs well with tumor marker data (7). But there is no substitute to using additional real data to show that this method performs well in practice. Future studies on additional real TMA data sets should aim to provide empirical evidence that the RF dissimilarity is indeed worthwhile for TMA data. Unfortunately, the TMA community does not (yet) offer a collection of benchmark data sets.

REFERENCES:

1.Breiman L: Random forests. Machine Learning 2001, 45:5-32

2.Venables WN, Ripley BD: Modern applied statistics with S-PLUS. New York, Springer-Verlag, 1999, pp xi, 501

3.Cox TF, Cox MAA: Multidimensional scaling. Boca Raton, Chapman & Hall/CRC, 2001

4.Kaufman L, Rousseeuw PJ: Finding groups in data: an introduction to cluster analysis. New York, Wiley, 1990, pp xiv, 342

5.Shi T, Horvath S (2005) Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics, in press.

6.Ihaka, R. and Gentleman, R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics, 5: 299-314, 1996.

7.Shi T, Seligson D, BelldegrunAS, Palotie A, Horvath S (2005) Tumor Classification by Tissue Microarray Profiling: Random Forest Clustering Applied to Renal Cell Carcinoma. Mod Pathol. 2005 Apr;18(4):547-57