Fisher S Exact Test

FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE



Suppose that X ~ Bin(m, p1) and Y ~ Bin(n, p2). Further suppose that X and Y are independent. The data layout can be summarized in this two by two table:

Group / Successes / Failures / Total
1 /

X = A

/ m - X = B / m = A+B
2 /

Y = C

/ n - Y = D / n = C+D
Total /

A+C

/ B+D / N = A+B+C+D

The display is overlaid with the A - B - C - D notation. We’ll refer to this notation at the end.

We want to test the null hypothesis H0: p1 = p2 . For the moment, we will be intentionally vague about whether H1 is p1 > p2 or p1 < p2 or p1  p2 . If m and n are reasonably large, then we can use the chi squared statistic

2 =

Here “reasonably large” means big enough to use the normal approximation. This would seem to require that m  30 and n  30. The approximation also requires that X not be too close to 0 or to m, and also that Y also not be too close to 0 or to n; in such cases a Poisson approximation would be more appropriate.

There is now some question about how to proceed if m and/on n is small. Let’s use the symbol T = X + Y as the total of the first column. In terms of the random variables, let’s note the table as

Group / Successes / Failures / Total
1 /

X

/ m
2 /

Y

/ n
Total /

T

/ N

The joint distribution, under H0 , of (X, Y) is

f(x, y) =

This uses p as the common value for p1 and p2 . Let’s make the transformation

(X, Y)  (X, T)

Since these are all discrete random variables, we do not need to worry about the Jacobian of a multivariable transformation. We need only replace y by t x. The transformed likelihood is

f(x, t) =

We could also be extra careful and note the ranges of the variables. The likelihood above could have the indicator product I(0  x  m) I(0  y  n). This would be transformed to I(0  x  m) I(0  t x  n). We’ll have a little more to say about this below, at the bottom of page 3.

Since T = X + Y, its distribution under the null hypothesis would be binomial (m+n, p). That is, the probability law of T is

f(t) =

Let’s now find the conditional probability law of X, given T = t. This is

f(x | t) = =

It’s quite amazing that all the factors in p cancel.

This is the conditional probability law of X, given a total number of successes t. This conditional law is hypergeometric, saying that the number of successes in group 1 is based on m random draws from a set of m + n values, of which t are successes.

This hypergeometric forms the basis for Fisher’s exact test. No approximations are needed! However, the hypergeometric distribution is a little “gritty” in the sense to be noted below on page 4, and we may have trouble using it.

Here is an example which shows Fisher’s exact test in action. Suppose that we want to compare the success rates for two different brands of arthritis medication. Brand Q was tested on 10 subjects, and brand R was tested on 12 subjects. Each subject was asked whether the treatment provided any relief. The responses were these:

Brand / Successes / Failures / Total

Q

2

/ 8 / 10

R

8

/ 4 / 12
Total /

10

/ 12 / 22

The obvious null hypothesis is H0: pQ = pR . We want  = 0.05 as the level of significance. We are tempted to use the chi squared statistic, but the sample sizes are too small to make the approximate distribution believable. Just for the record, the value would be 2  4.7911.

Suppose that the alternative hypothesis were H1: pQ > pR . That is, suppose that the experiment was set up by someone expecting to show that brand Q was better. An immediate look at the data shows that product Q was worse. There is no way that H1 has any chance at all. With no serious computation, we simply accept H0 and move on.

Suppose that the alternative hypothesis were H1: pQ < pR , and that the experiment was set up by someone expecting to show that brand R was better. Since = = 0.20 and =  0.67, the data suggest H1 and we must investigate whether this apparent superior performance is distinguishable from mere chance. If the null hypothesis holds, and if T = 10 is the total number of successes, then the conditional distribution of X, the value in the upper left box, is hypergeometric with parameters (22; 10, 10). These three parameters are (grand total; row total, column total). Thus

P( X = x | T = 10 ) = =

We could invoke this reasoning for any of the four cells of the table. For this table, the first row has a smaller total that the second row (10 vs 12), and the first column has a smaller total than the second column (10 vs 12). It’s convenient to work with the cell corresponding to the smaller of the two row totals and also to the smaller of the two column totals. For this table, it’s the upper-left cell. If this strategy is used, the conditional hypergeometric distribution will start at zero, and the clerical issues are a little simpler.

Once t is fixed in the indicator product I(0  x  m) I(0  t x  n), the range of x’s is given by max{ 0, t n }  x  min{ t, m }. It’s really easiest if the lower limit is zero.

Here, with the help of Minitab, is this distribution:

x / Prob / CumProb
0 / 0.000102 / 0.00010
1 / 0.003402 / 0.00350
2 / 0.034447 / 0.03795
3 / 0.146974 / 0.18492
4 / 0.300071 / 0.48500
5 / 0.308645 / 0.79364
6 / 0.160753 / 0.95439
7 / 0.040826 / 0.99522
8 / 0.004593 / 0.99981
9 / 0.000186 / 1.00000
10 / 0.000002 / 1.00000

The actual event was { X = 2 }. We define the p value as P[ X  2 ] = 0.03795. Since this is less than 0.05, we would reject H0 and conclude that brand R is better. The event { X  2 } would be described as “an outcome as extreme, or even more extreme, in support of H1 than the outcome actually observed.”

If we had been asked, before seeing the data, to formulate a 5% rejection rule, that rule would have to be “Reject H0 if X  2.” This would however not be a 5% rule, as the real probability of Type I error would be only 3.795%. The grittiness of a discrete distribution prevents us from hitting the 5% target.

Finally, suppose that the alternative hypothesis were H1: pQ  pR . This would correspond to an experiment set up with no particular prejudice. The null hypothesis distribution is still the one above, but it’s not obvious how to compute a p value. Here are two common methods.

(1)Conduct the test as two separate one tail tests, each at level . Equivalently, double the one tail p value and compare to . In our case, this would be 2  0.03795 = 0.0759. As this is bigger than 0.05, we would accept H0 . This is called the Clopper-Pearson method.

(1)It can sometimes happen that the most extreme event in one of the tails has probability > . In that case, the Clopper-Pearson method reverts to a one tail test at level . This is not an issue in our case, because the low end event { X = 0 } has probability 0.000102 < 0.025 and also the upper end event { X = 10 } has probability 0.000002 < 0.025.

(2)Form the rejection set by collecting the outcomes { X = x } in increasing order of probability. For our data this would give the following:

x / Prob / CumProb
10 / 0.000002 / 0.000002
0 / 0.000102 / 0.000102
9 / 0.000186 / 0.000290
1 / 0.003402 / 0.003692
8 / 0.004593 / 0.003692
2 / 0.034447 / 0.042732
7 / 0.040826 / 0.083558
3 / 0.146974 / 0.230532
6 / 0.160753 / 0.391285
4 / 0.300071 / 0.691356
5 / 0.308645 / 1.000001

Our problem had { X = 2 }, and the cumulative through this value is 0.042732. As this is below 0.05, we would reject H0 .

This is the Wilson Sterne rule. Other methods have been proposed as well. The Wilson Sterne method is more powerful, meaning that it rejects H0 more often, but it has other problems.

Problem 1: The Wilson Sterne method is more complicated than the Clopper Pearson method.

Problem 2: One can invert hypothesis tests to get confidence intervals. The phrase “invert hypothesis tests” refers to finding the set { 0 | with actual data x, the hypothesis H0:  = 0 is accepted }. The Wilson Sterne rule can sometimes lead to disconnected confidence intervals!

Fisher’s exact test was illustrated with focus on the upper left cell. The procedure could have been with any of the four cells of the table. All conclusions would be logically and numerically consistent. As a clerical guideline, it’s usually easiest to focus on the cell which has the smaller row total and also the smaller column total.

This discussion would not be complete without mention of Fisher’s “lady tasting tea” experiment. A certain aristocratic lady has taste so refined that she can tell whether the milk or the tea was placed first in her teacup. This unusual skill is to be put to a test. Eight teacups are set out. Four cups are selected at random, and for these selected cups, the milk is poured first. For the remaining cups, the tea is poured first. The lady, who has been discreetly kept away from the preparation, is now asked for her judgments. She is aware that there are exactly four cups in each category, and she will try to identify the four cups in which the milk was poured first. That is, she will supply data for this table:

Lady’s judgment
Actual / Milk poured first / Tea poured first / Total
Milk poured first / / 4
Tea poured first / / 4
Total /

/ 4 / 8

It happens that the lady identifies three out of four correctly. If we use the 5% level of significance, how should we appraise her skill?

Note that she has filled out the table in this fashion:

Lady’s judgment
Actual / Milk poured first / Tea poured first / Total
Milk poured first /

/ 1 / 4
Tea poured first /

/ 3 / 4
Total /

/ 4 / 8

The null hypothesis is that her guessing is random, and the alternative is that she has some skill. The alternative would support large numbers in the upper left cell. Let’s associate this cell with the random variable X. The null hypothesis distribution is hypergeometric, with these probabilities:

x / Prob / CumProb
0 / 0.014286 / 0.01429
1 / 0.228571 / 0.24286
2 / 0.514286 / 0.75714
3 / 0.228571 / 0.98571
4 / 0.014286 / 1.00000

The result was { X = 3 }. The event { X  3 } would be described as “an outcome as extreme, or even more extreme, in support of H1 than the outcome actually observed.” Then P[ X  3 ] = 1 P[ X  2 ] = 0.24286 gives us the p value. This is well in excess of 0.05, so that we would have to accept the null hypothesis that the lady is guessing. She would have to correctly identify all four of the milk-first teacups in order to be convincing!

  gs2011