Controlling the False Discovery Rate in Behavior Genetics Research

Yoav Benjamini1, Dan Drai2, Greg Elmer3, Neri Kafkafi4, Ilan Golani2

1) Department of Statistics and O.R., The Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv, Israel

2) Department of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel

3) Maryland Psychiatry Research Institute, MD

4) National Institute of Drug Abuse, Baltimore, MD.

Correspondence:

Yoav Benjamini

Department of Statistics and O.R.,

The Sackler Faculty of Exact Sciences,

Tel Aviv University, Tel Aviv

Israel

Fax (972)-3-6409357

Key Words: Multiple comparisons, exploratory behavior in mice

Summary

The screening of many endpoints when comparing groups from different strains, searching for some statistically significant difference, raises the multiple comparisons problem in its most severe form. Testing the endpoints’ differences at some declared level of significance, say the .05, the probability of making a type-I error increases far beyond 0.05.The traditional approach to this problem has been to control the probability of erroneously rejecting even one of the true null hypotheses – the familiar Bonferroni procedure achieving such control. However, the incurred loss of power in large problems stemming from the extra protection, led many practitioners to neglect multiplicity control altogether. The False Discovery Rate (FDR) suggested by Benjamini and Hochberg (1995) is a new, different, and compromising point of view at the errors in multiple testing. The FDR is the expected proportion of erroneous rejections among the rejections, and controlling the FDR goes a long way towards controlling the increased error from multiplicity while losing less in power. In this paper we explain the FDR criterion, and present two simple procedures that control the FDR. We demonstrate their increased power on two studies discussed in the conference: the study of exploratory behavior by Drai et al (2000), and the study of the interaction with laboratory environment by Crabbe et al (1999).

1. Introduction

A quantifiable description of mouse behavior should promote the mapping of the mouse genome by characterizing the repertoires of inbred strains, congenic lines, knockouts, transgenic lines, and populations obtained by selective breeding. The need for such characterization has resulted in the design of batteries of behavioral and physiological tests. As was demonstrated by the papers presented at the “Behavioral Phenotyping of Mouse Mutants” conference (Cologne, February 2000), such studies never constitute of a single, pre-specified, measure, which is being compared between two strains of mice. The studies develop and explore many characteristics – also called behavioral endpoints, trying to identify those endpoints for which there is a significant strain difference. In the above conference the studies reported in the poster session involved the testing of many behavioral endpoints, ranging from 4 to 64 per study (the median being 13). The papers read from the floor were of about the same size in terms of the number of endpoints studied. We estimate that overall, the working list of behavioral endpoints is about a hundred endpoints long, and keeps growing.

The screening of many endpoints when comparing groups from different strains, searching for some statistically significant difference, raises the multiple comparisons problem in its most severe form. The search is conducted by testing each hypothesis of no strain difference in some endpoint, which is done at some declared level of type I error, say the .05. Detecting such a difference as “statistically significant” amounts to making a statistical discovery. But then, when screening such a large family of hypotheses simultaneously the probability of making a type I error increases far beyond the declared level of the individual test. If no action is taken, the average number of errors in individual endpoints per study, for the above mentioned group of studies may be as high as 2, whether the endpoints are statistically independent or not.

The traditional approach in multiple hypotheses testing to tackle this increased type I error, has been to control the probability of erroneously rejecting even one of the true null hypotheses - the control of the familywise error-rate. The books by Hochberg and Tamhane (1987), Westfall and Young (1993), and Hsu (1996) all reflect this tradition. The control of this error-rate at some level  requires each of the individual m tests to be conducted at lower levels, as in the Bonferroni procedure where  / m is used.

The Bonferroni procedure is just an example, as more powerful familywise error-rate controlling procedures are currently available for many multiple testing problems. Many of the newer procedures are as flexible as the Bonferroni, making use of the p-values only. For a recent review see Hsu (1996). Still, there is a fundamental drawback to the traditional approach of controlling the probability of making even one erroneous rejection: the power to reject a specific hypothesis is greatly reduced in a large family of hypotheses. The incurred loss of power in large problems (even with the newer procedures) led many practitioners to neglect multiplicity control altogether. While mandatory in psychological research, most medical journals do not require the analysis of the multiplicity effect on the statistical conclusions. Only the leading New England Journal of Medicine does it. In genetic research, the need for multiplicity control has been debated heavily. In QTL analysis the debate resulted in some unsatisfactory compromises: controlling the familywise error-rate at the .5 level and using a more limited confirmatory study (see Weller et al 1998 for background and further references). It should be emphasized that the recent trend away from hypotheses testing towards confidence statements does not solve the multiplicity problem, where simultaneous coverage should be of equal concern.

The False Discovery Rate (FDR), suggested by Benjamini and Hochberg (1995), is a new, different, and compromising point of view at how the errors in multiple testing could be considered. The FDR is the expected proportion of erroneous rejections among the rejections. In this paper we shall explain this notion, discuss some simple procedures that control the FDR while being less conservative, and stress the importance of controlling for the effect of multiplicity.

2. Two motivating example

In a separate paper in this issue, Drai et al (2000) propose to study the open field behavior of mice using the approach developed in the study of rats. They describe an effort to augment the commonly used measures of the open field test with a set of new ethologically-relevant parameters. These parameters, which can be measured automatically and efficiently, reveal a natural structure that involves motivation, navigation, spatial memory and learning. Some 17 such parameters are identified in that study, and are presented in the leftmost column of Table 1. The value of theses parameters were estimated and compared between 8 maleC57BL/6Jtau (C57) Bulb, and 8 male BALB/cJtau (BALB) mice from the Tel Aviv University medical school stocks. We shall use those results to motivate the approach and demonstrate the procedures involved. A more detailed description of the experimental setting is available there.

Ignoring the issue of multiplicity altogether, all nine hypotheses for which the observed p-value is less than 0.05 should be rejected. Controlling for the possibility of increased type I error using the Bonferroni procedure, each p-value is compared to 0.05/16 = 0.0029. In this case only six differences are statistically significant. There is quite a difference in this study between the implications of the two approaches.

Table 1 about here

A somewhat similar situation appears in the work of Crabbe et al (1999), who studied the possible confounding influences of laboratory environment on tests of mice behavior. Some 56 statistical hypare tested (48 by a different account), and only 14 would be determined as statistically significant if a multiple comparison adjustment would have been taken with the Boferroni procedure (only 6 by the other account). The authors argue in the published paper that multiplicity corrected p-values should not be given, and their Web site gives a discussion of the rationale for doing so.

Nevertheless, they do take a partial step towards multiplicity correction: instead of the conventional 0.05 level used for statistical significance, they used a somewhat stricter level, namely the 0.01 level.

2. The False Discovery Rate Criterion (FDR)

Consider the case of a family of m null hypotheses being tested in a study. Some tested null hypotheses may be true - possibly even all - meaning no difference exists between the strains in the corresponding parameters. Other hypotheses may be false - meaning differences exist - and we wish to discover these false ones by rejecting the corresponding null hypotheses, granting us with statistical discoveries. Obviously we would like to discover as many as possible of the false ones, while making as few as possible errors of rejecting true ones.

Rejecting a set of R hypotheses, some unknown V of them might have been true ones that are erroneously rejected. Let us “measure the harm” imposed by such errors, by considering the ratio of the two V/R, i.e. we consider the proportions of erroneous rejections to the total number of rejections. If no hypothesis is rejected there is no harm from erroneous rejection, so the ratio is defined to be 0. It makes sense to control this proportion in each and every study, but this is impossible. Nevertheless, we can look at the average value of this proportion, which we define as the False Discovery Rate (FDR). This false discovery rate can be controlled at any desired level.

The false discovery rate criterion is a compromise between the uncorrected analysis of the multiple tests, and the traditionally corrected approaches. If all tested hypotheses are true, meaning that there is no behavioral difference between the strains whatsoever, controlling the FDR controls the traditional probability of making even one error. Therefore it also makes sense to use the conventional levels such as .05 or .01 for FDR control (though in some applications higher values may be justifiable). However, when many of the tested hypotheses are rejected, indicating that many hypotheses are not true, the error from a single erroneous rejection is not always as crucial for drawing conclusions from the family tested, and the proportion of errors among the rejections is controlled instead. Thus we are ready to bear with more errors when many hypotheses are rejected, but with less errors when fewer are rejected: two error out of 40 established differences is bearable, two errors out of 4 is certainly not.

In many applied problems it has been argued that the control of the FDR at some specified level is the more appropriate response to the multiplicity concern: in educational research (Williams, Jones and Tukey, 1999), signal processing (Abramovich and Benjamini 1998), Medical Research (Mallet, 1998), in Psychology (Keselman et al 1999) and in Genetics (Weller et al., 1998). The practical difference between the two approaches is not small or trivial, and the larger the problem the more dramatic the difference is. In the following section we present two procedures that control the FDR at the desired level.

3. Two FDR controlling procedures

Benjamini and Hochberg (1995) gave a simple stepwise procedure that controls the FDR when the test statistics are independent. This procedure has been lately shown to control the FDR when the test statistics are positively correlated as well. It is available in SAS, but can be easily perform by hand as we show below. We shall also present the procedure of Benjamini and Liu (1999) that always controls the FDR – even for generally correlated test statistics.

Both procedures make use of the p-values of the tested differences only, so the statistical test itself may be tailored to the problem at hand, be it t-test, binomial test, chi-square test, or some other non-parametric test. The individual p-values should then be sorted from smallest to largest as is demonstrated in column 2 and 3 of Table 1. Denote the i-th smallest p-value (in the i-th row) by p(i), for each i between 1 and m. The procedure in Benjamini and Hochberg (1995) runs as following:

Starting from the largest p-value p(m), compare p(i) with 0.05*i/m. Continue as long as p(i) > 0.05*i/m. Let k be the first time when p(k) is less or equal to 0.05*k/m, and reject the hypotheses corresponding to the smallest k p-values.

The procedure is demonstrated in column 5 of Table 1, controlling the FDR at level 0.05. Starting with the largest p-value – the 17th in order: 0.87 > 0.05*17/17 = 0.05. Continue therefore to 16th p-value, which is 0.56, and compare it to 0.05*16/17 = 0.047. It is still larger, so we continue up in Table 1. The first time when the inequality is reversed is when the 9th p-value 0.0148 is less than 0.05*9/17 = 0.0264. We thus reject all the 9 hypotheses corresponding to the p-values less than .0264.

Note that in this procedure we start at the 0.05 level, and if all differences are statistically significant at this level – all are rejected, as if no correction for multiplicity was taken. If we get all the way to the smallest p-value, that hypothesis will be rejected if its p-value is less than .05/16, as if the Bonferroni procedure was used. In between the constants are linearly spaced. In our example, all hypotheses with p-value less than 0.05 are rejected by this procedure.

The procedure in Benjamini and Liu (1999) runs as following:

Starting from the smallest p-value p(1), compare p(i) with h(i) = min(0,05,0.05*m/(m+1-i)2). Reject the hypothesis corresponding to p(1) if smaller or equal to the threshold h(1), and continue to reject as long as p(i) < h(i). Stop when p(i) > h(i) for the first time.

The procedure is demonstrated in column 6 of Table 1, controlling the FDR at level 0.05. Starting with the smallest p-value: 0.00063 < min(0.05,0.05*17/172) = 0.0029. Continue therefore to the second p-value, which is 0.000013, and compare it to 0.05*17/162 = 0.0033. It is still smaller, so we continue down in Table 1. The last time when the inequality holds is when the 8th p-value 0.0065 is less than 0.05*17/102 = 0.0085. For the 9th p-value min(0.05,0.05*17/92) = 0.0104, and the 9th p-value is 0.0148 which is bigger. We therefore stop, and reject all eight p-values which are smaller than 0.0085.

Note that the largest 4 constants are all 0.05 because 0.05*m/(m+1-i)2 is larger than 0.05. Following the remark in Benjamini and Liu (1999), this modification of the procedure ensures that a hypothesis is not rejected unless: (a) within the tested family the FDR is less than .05; and (b) individually its statistical significance is 0.05.

Again, as with the first procedure, at the two extremes the p-values are compared to 0.05 and to 0.05/17. The progression of the thresholds, though, is not linear, and the stepping direction is in the other way – from the smallest p-value to the largest.

So far we have demonstrated both procedures on the same data. There is a difference between the results of the analysis of the two procedures, the first one rejecting 9 hypotheses the second only 9. Which one is more appropriate? The first procedure requires that the estimated differences be statistically independent or positively dependent. The second one requires no such assumption – in fact it requires no assumption at all. Checking the dependencies among our differences of endpoints we found a few quite large and significant negative correlations. Therefore, the correct procedure to use in this example is the more general second procedure.

As to the second example from Crabbe et al (1999), the design of the study implies that the tests are almost independent (testing interactions and main effects in a balanced ANOVA). Therefore we may use the first procedure. Exact p-values are not given in that paper, bthe available information allows to make some rough arithmetic: among the 56 tested hypotheses, 14 p-values are less than .00001, 6 between 0.00001 and 0.001, and 4 between 0.001 and 0.01. Thus the 24th p-value is less than 0.01. Since 0.01 < 0.05*24/56 at least these 24 p-values should be rejected. Had we had the full information, it is quite possible that a few more hypotheses could be rejected.

This paper also demonstrates how to overcome a possible manipulation of the FDR criterion. One may make the FDR criterion less restrictive by “throwing in” among the tested hypotheses a few ones that are plainly false, and therefore sure rejections. In this study the strains and parameters were so chosen to bring out strain differences as clearly as possible – here for a good reason and not merely in order to manipulate the FDR criterion. However, the information needed to tackle this difficulty appears also in the study: identifying the 8 “obvious” strain main effects, we can take them out of the analysis thereby leaning towards a more conservative FDR analysis. Having now 16 p-values less than 0.01, they should be compared to 0.05*16/48, and these same 16 are still rejected. The skeptic readers of other studies can always perform the above sensitivity analysis, if they suspect a problem, because the relevant information has to be included in the studies.

4. Discussion

It is clear that the Multiple Comparisons Problem has to be addressed in the comparison of behavioral endpoints between strains of mice. This is especially important in any automated screening tool which is designed for discovering of genetic differences, such as the effort of Golani et al (2000) in the study of exploratory behavior. The control of the false discovery rate seems to us the appropriate approach for that purpose, striking a balance between the concern about type I error-rate and the concern about type II error-rate that arises from being too conservative. In both examples analyzed this was evident, and in the analysis of Crabbe et al (1999) in fact the procedure actually got at the compromising level of strictness chosen by the experimenters on an intuitive level. We therefore recommend using FDR controlling procedures while screening an established list of endpoints or a potential pool of new ones.

Traditional multiple comparisons procedures offer even stricter control against the increased type I error of discovering a non-existing difference. Thus, if differences are found which pass the Bonferroni threshold, or other procedures that control the familywise error-rate, the evidence should be regarded as stronger. Williams et al suggest calling such differences as “highly significant”, and those passing the FDR threshold as simply “significant”. We are not sure that such formalism is needed, but we do emphasize that if a difference is not found to be statistically significant after controlling for the FDR, it should not be declared “statistically significant” - even if individually its corresponding p-value is less than 0.05.