Supplementary material for “Random Forest Fishing: A Novel Approach to Identifying Organic Group of Risk Factors in Genome wide Association Studies” by Yang and Gu.
Table of Contents
1.Brief review of Random Forest (RF)
2.Cooling scheme used to control bait set sizes
3.SNP coverage
4.Tuning parameters in RF and RFF
5.Computational time of RFF tests
6.Power for detecting multiple risk SNPs using 500 cases and 500 controls
7.Functional annotation of genes assigned to the RFF-found SNPs
8.Overlap of RFF results with other tests in real GWAS data analyses
9.References
1.Brief review of Random Forest (RF)
RF is an ensemble method that combines the result of many classification and regression trees (CART) to make a prediction. The trees were built after introducing two levels of randomization1. Specifically, each tree in the forest is grown as follows:
- Randomly sampling subjects from the data to grow each tree. The same number of subjects are randomly sampled with replacement from the original data, and used as the training dataset to grow the trees. The sampling leaves out about one third of the subjects from the original. They are called out of bag (OOB) samples, and are used as the test (validation) dataset to get unbiased estimation of the prediction rate and variable importance, so RF requires no external testing samples.
- Randomly selecting candidate variables to determine splitting criteria at each node of the tree. If there are totally M variables in the dataset, a number m much less than M is specified such that at each node, only m variables are selected at random for evaluation and the one that most differentiate the predicting trait is chosen to split the node. The splitting procedure is repeated to get all nodes of the tree.
- The above steps are repeated to grow a pre-determined number of trees to form a “random forest”. The prediction results of all the trees are pooled to “vote” for the overall prediction by the random forest.
RF provides excellent prediction accuracy. The importance of each variable for predicting the trait could be measured using the difference in prediction accuracy before and after randomly permuting the values of that variable. This measure entails not only marginal effects of the variable, but also interaction effects with other factors. Using this feature precludes the need to explicitly model each possible interaction, desirable when analyzing large-scale GWAS datasets.
Direct application of RF to GWAS analysis poses a real challenge. First, to roughly cover all SNPs and their interactions, a large number of trees have to be grown. To give a simple example, if we grow a random forest in a dataset with 500,000 SNPs, and assume it involves ~700 SNPs per tree. The probability that two specific SNPs occur in the same tree is about 2×10-6. This means that at least 500,000 trees have to be grown in the forest to ensure that this combination is expected to be represented at least once. The requirement for a huge number of trees, plus the complexity to explore node-splitting variables when growing each tree, make it extremely computation intensive if not entirely impossible. Second, even if we managed to grow enough trees for a GWAS dataset, the variable importance measure based on these trees would not be reliable. Since the vast majority of the chromosomal SNPs are noise (irrelevant to the disease of interest), at each node of a tree, the handful of candidate splitting variables tend to be all noise. That means most splits within a tree are using noise variables, and estimations such as prediction rates and variable importance based on these trees are “fitting-to-noise” and unreliable.
To overcome these challenges, the new RFF method uses the idea of “fishing” to improve the power of RF to detect interacting SNPs, and autonomously determine the size of a globally important group of relevant SNPs. To avoid “fitting to noises”, RFF handles the dimensionality problem by using an iterative process to traverse the space of all variables and limiting the number of variables in each iteration of RF analysis.
2.Cooling scheme used to control bait set sizes
In all experiments reported herein, the rate of decreasing baits sizes is chosen so that over any interval of a given length, on average, RFF always evaluates about the same number of variables. Let the distribution of pool sizes (the same as bait set sizes)follows a density function. For any interval covering from to of a given length the average number of variables to be sampled by all pools over the interval equals , where is an arbitrary bait size. To make the integral only dependent on (constant for any given value of ), we used, where is a normalization constant. This cooling scheme ensures that more iterations are used for smaller bait set sizes.Figure S1 shows the distribution used by this cooling scheme in real data analysis that decreases bait set sizes from 2500 to 50.
Figure S1. Distribution of bait set sizes used in RFF analysis of real GWAS data of the HHD study. Bait sets decreases from 2500 to 50. The distribution is approximated by .
An added benefit of using decreasing bait set sizes is that it also gives the number of variables to output after fishing. In a typical RFF test, the prediction accuracy of bait sets tends to increase at first as more and more important factors are caught. After the set gets saturated with important factors, the important factors are beginning to be dropped off as bait set size decreases, which makes the prediction accuracy stop increasing and start dropping down. We select the bait set at the peak of the prediction rate curve as the output important variables from RFF, i.e., the smallest set with best prediction accuracy.
3.SNP coverage
For the iterative process employed by RFF, we set a targetSNPcoverage to ensure that, on average,all GWAS SNPs are evaluated for the expected number of times. The covered times in real data analysis were summarized and shown in Table S1. When the targeted coverage is 10, only 2 SNPs were not tested, and more than of the GWAS SNPs were tested at least 5 times. The actual average coverage in analysis is 12.1 times. The histograms of actual times a SNP was sampled (covered) were shown in Figure S2.
Table S1.SNP coverage in real analysis for target coverage of 10.
Covered times / Targeted coverage 10SNPs / %
/ 2 / 0%
/ 264 / 0.07%
/ 2970 / 0.76%
/ 17137 / 4.40%
/ 58932 / 15.14%
/ 132459 / 34.02%
1
4.Tuning parameters in RF and RFF
Table S2.TunablerandomForest2 /Random Jungle3 parameters and their default values usedbyRFF in evaluation tests
Parameter / Description / Value used in the reported RFF testsmtry / Number of variables to be randomly selectedand analyzed to choose a splitting variable at each node of a tree / Default value (square root of number of input variables) is used in RFF
ntree / Total number of trees to construct in RF / Default value (500 trees) is used in evaluating RFF.
weight / Weights for each subject in the dataset; to be tuned for unbalanced case/control datasets / No applicable because we useda balanced case-control design.
Tree type / Decision tree or regression tree depending on nature of the phenotype / Decision trees are used for binary traits.
Importance measure / Measure of variable importance, mean decrease in prediction accuracy, or node impurity (Gini index) / Default choice by the RF engine was used: in the simulation tests, mean decrease in prediction accuracy was used by randomForest; in the real data test, decrease in node Gini index was used by Random Jungle.
Table S3. Overview of RFF tuning parameters
Parameter / Description / Value used in the reported RFF testsNumber of populations / Number of populations(parallel bait sets) in genetic algorithm / 10
Initial bait set size / The bait sets size at the beginning / 100 in simulation; 2500 in real GWAS analysis
Last bait set size / The last bait set size / 3 in simulation; 50 in real GWAS analysis
Targeted SNP coverage / The targeted average times that each SNPs will be sampled for evaluation in fishing / 4-24 in simulation analysis; 10 in real GWAS analysis
Number of generations / Total number of generations or iterations / The values is automatically set after generating vector bait set sizes given the initial bait set size, the last bait set size, and the target SNP coverage times. The total number of generations is 603-3614 in simulations, and 624 in real GWAS analysis.
Pairwise interaction guidance / (Optional) if pairwise interaction guidance is to be used, the pairwise interaction test result should be given. / For both simulation and real data analysis, guidance is given by using PLINK fast-epistasis test result.
Initial bait sets / (Optional) the bait sets before RFF starts could be specified. It uses previous analysis results as starting point for fishing. Otherwise, if it is not specified, RFF starts by randomly sample GWAS SNPs to start. / Both simulated and real data analysis reported in the manuscript use random initial bait sets. We also tried using initial sets from top single SNP test results and compared to the random sets. Using top SNPs at start gives much better prediction rate at the beginning, but in the end, it has close best prediction and similar best bait set during the whole fishing process.
1
Tunable parameters used by RF(RJ) outside of RFF
The parameters in Table S2 and Table S3 were used in RFF when calling RF on subsets of variables. RFF was also compared with direct RF applications using RJ. In direct RJ analysis of simulated datasets, SNP importance is estimated based on 1,000 trees. To check if the number of trees is adequate, we tested a few randomly selected datasets with up to 10,000 trees and found no considerable change in either importance values or variable ranks. Other parameters in RJ tests were set using default values as in Table S2.
RJ application using tuned parameters as by Goldstein el. al.4 was performed with mtry set to 1/10 of total GWAS SNPs, ntree set to 8000, and SNPs in high LD pruned out (SNP <0.9).
5.Computational time of RFF tests
For the simulation tests, RFF using the R package and randomForest2 as RF engine took ~10 hours on a single CPU (Quad-Core AMD Opteron(tm) Processor 2354) to analyze the 50K dataset in 1000 subjects at coverage of 4x, and ~65 hours at 24x. For the real data tests, RFF using the C++ package and Random Jungle3 as RF engine took ~3 hours on a single CPU (Intel(R) Xeon(R) X5660 2.80GHz) to analyze the QC’ed 500K dataset in 140 individuals.
6.Power for detecting multiple risk SNPs using 500 cases and 500 controls
Figure S3. Power of detecting risk SNPs using 500 cases and 500 controls
The power is shown in all scenarios to detect any of the risk SNPs (“>=1 SNP”), any of the weak risk SNPs (“>= 1 weak SNPs”), and more weak SNPs. A weak risk SNP is one that has no marginal effect at all, and contributes to disease through interactions. The power to detect the 6 risk SNPs are shown for the 5 scenarios. “Chi2” represent χ2 tests; “RJ” for Random Jungle test; “Pairwise” for pairwise interaction test using PLINK fast epitasis. In these methods, we declare the top 31 SNPs as “detected” to measure the power. “RFF.nointx” and “RFF.intx” are RFF tests without and with interaction. “RFF.intx1” and “RFF.intx2” used empirical guidance based on the pairwise interaction tests, with SNP coverage of 4 and 24, respectively; and “RFF.intx3”used theoretical guidance based on synthetic interactions clustered over the 6 risk SNPs.
1
7.Functional annotation of genes assigned to the RFF-found SNPs
Table S4. Functional enrichment detected by DAVID5 as over representation of genes involved in known biological pathways or with known disease associations.Terms with p value are shown with their P values.
Category / Term / P-ValueOMIM_DISEASE / A Genome-Wide Association Study Identifies Protein Quantitative Trait Loci (pQTLs) / 0.0028
KEGG_PATHWAY / Arrhythmogenic right ventricular cardiomyopathy (ARVC) / 0.044
PANTHER_PATHWAY / P00053:T cell activation / 0.052
KEGG_PATHWAY / Androgen and estrogen metabolism / 0.056
GENETIC_ASSOCIATION_DB_DISEASE / atherosclerosis, coronary, lipoprotein / 0.071
KEGG_PATHWAY / Purine metabolism / 0.074
REACTOME_PATHWAY / REACT_11061:Signalling by NGF / 0.081
GENETIC_ASSOCIATION_DB_DISEASE / Post-transplantation diabetes mellitus (PTDM) / 0.084
PANTHER_PATHWAY / P00033:Insulin/IGF pathway-protein kinase B signaling cascade / 0.085
GENETIC_ASSOCIATION_DB_DISEASE / long QT syndrome / 0.097
1
8.Overlap of RFF results with other tests inreal GWAS data analyses
Table S5. Overlap of top 213 SNP from different tests
RFF* / Chi2 / epi / rj / rj.tunedRFF / 213 / 92 / 1 / 95 / 105
Chi2 / 92 / 213 / 0 / 69 / 85
epi / 1 / 0 / 213 / 0 / 0
rj / 95 / 69 / 0 / 213 / 84
rj.tuned / 105 / 85 / 0 / 84 / 213
*RFF: random forest fishing; Chi2: single SNP test; epi: pairwise interaction test using PLINK fast-epistasis; rj: direct application of Random Jungle using default parameters; rj.tuned: Random Jungle test by tuning parameters as in [Goldstein el. al.], mtry is set to of total GWAS SNPs, and ntree is set to 8000; SNPs in high LD are pruned out (). These parameters lead to the best results in Goldstein el. al.4
Table S6. Overlap genes assigned to the top 213 SNP from different tests as in Table S4.
RFF / Chi2 / epi / rj / rj.tunedRFF / 196 / 95 / 9 / 98 / 121
Chi2 / 95 / 159 / 5 / 73 / 98
epi / 9 / 5 / 138 / 8 / 9
rj / 98 / 73 / 8 / 185 / 103
rj.tuned / 121 / 98 / 9 / 103 / 204
9.References
1.Breiman L: Random Forest. Machine Learning 2001; 45: 5-32.
2.Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002; 2: 18-22.
3.Schwarz DF, Konig IR, Ziegler A: On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010; 26: 1752-1758.
4.Goldstein BA, Hubbard AE, Cutler A, Barcellos LF: An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings. BMC Genetics 2010; 11: 49.
1