Multiple Comparison Handout

I. Multiple Comparison

A.What is it?

When your Anova has more than 2 groups and it is significant, it only tells you than two or more groups are significantly different from one another, but not which one’s.

- In order to determine which means are different we have to compare different pairs of means to determine which ones are significantly different. We have to do this for multiple pairs (e.g. X1 vs X2, X1 vs X3, X2 vs X3) so it is call a Multiple Comparison.

B. Why do we do it?

When we want to test the differences between a large number of groups we could just use a series of t-tests and not do the ANOVA. However, each time we do a test we add together the type I error rates.

- For example: If I compare 3 groups I will have K(K-1)/2 comparisons to do (3 in this case). If each comparison is tested at the p<.05 level then you will end up with an final alpha (probability of Type I error) of .15, e.g. 15% type I error rate

- This is called the Per Experiment (PE) Error Rate. = It represents the number of Type I errors we expect to make when the Null Hypothesis (Ho) is true.

-Per Comparison (PC) Error rate = alpha for each test

PE error rate = (# comparisons)*(PC).

(3) * (.05) = .15

- Familywise (FW) Error Rate = When we have a group of comparisons between means (Called a Family of Comparisons), then we can estimate the probability that we have at least 1 Type I error in our family of comparisons.

FW = 1- (1-PC)c

FW for a variable with 3 groups = 1- (1-.05)3 = 1-.8574 = .1426

- PC FW PE

II. Planned Linear Contrasts (A priori)

A. What, When, & Why

1. What- Planned Linear Contrasts are comparisons that you plan on making between specific means or groups of means based on your hypotheses. They are tested using F statistics with

df between= 1 and df error = N - K degrees of freedom and each test is compared to the critical value of F at the p = .05 level.

2. When- Use them when you have specific hypotheses regarding the means of your groups. In

general you should not use Linear Contrasts when the number of comparisons your are making

exceeds the number of degrees of freedom you have between groups (i.e., K-1). When the number of

contrasts exceeds K-1 then the Bonnferoni procedure is generally preferred.

3. Why- This test holds the FW error rate at .05, because we are not exceeding the Type I error rate for the overall Alpha.

B. Simple Comparisons - Comparing two means at a time.

For Example = Anova with 4 levels. 1 vs 2, 2 vs. 3, 3 vs 4.

C. Complex Comparisons - Comparing groups of means or a group of means to a single mean.

For example in a One-Way Anova with 4 levels you may be interested in testing the differences between group 1 and group 3, group 1 and group 4, group 2 and group 3, and group 2 and group 4. But had no reason to believe group 1 and 2 would differ or that group 3 and 4 would differ. In such a case you could compare the groups of groups:

(Grp 1 + Grp 2)/2 vs. (Grp 3 + Grp 4)/2

IV. Post Hoc Analyses (A posteriori)

A. What, When & Why

1. What- Post Hoc test allow us to make simple and complex comparisons between our group means when we do not have any a priori hypotheses about how the means should be related. Most of these tests use some form of the t statistic. The basic root of all the Post Hoc tests is :

Different tests use different degrees of freedom and different alpha levels for the critical value, depending on how they control the FW error rate.

2. When- Use them when you don’t have a priori hypotheses or when you find interesting, but unexpected results. Also, Post Hoc test can only be examined and interpreted when we have a significant overall F, otherwise it is totally inappropriate to examine Post Hoc tests.

3. Why- Because our hypotheses are not a priori and because we will probably test all possible mean combinations = c ( c = (K(K-1))/2), then we greatly increase the likelihood that we are going to commit a Type I error as the number of groups increases. Post Hoc tests employ varius means for controlling the FW and PE error rates.

NOTE: In SPSS, post hoc tests will not allow you to make complex comparisons. Only the Linear Contrasts options will allow this. However, you can make complex comparisons using post hoc tests by using hand calculations.

B. Choosing Tests

- Different Post Hoc tests use different methods to control FW and PE. Some tests are very conservative. Conservative tests go to great lengths to prevent the user from committing a Type I error. They use more stringent criterion for determining significance. Many of these tests become more and more stringent as the number of groups increases (directly limiting the FW and PE error rate). Although these tests buy you protection against Type I error, it comes at a cost. As the tests become more stringent, you loose Power (1-B). More Liberal tests, buy you Power but the cost is an increased chance of Type I error. There is no set rule for determining which test to use, but different researchers have offered some guidelines for choosing. Mostly it is an issue of pragmatics and whether the number of comparisons exceeds K-1.

C. Fisher’s LSD

-The Fisher LSD (Least Significant Different) is basically the Post Hoc equivalent of a Linear Contrast.

- This tests sets Alpha Level per comparison. Alpha = .05 for every comparison. df = df error (i.e. df within).

- This test is the most liberal of all Post Hoc tests. The critical t for significance is unaffected by the number of groups.

- This test is appropriate when you have 3 means to compare. In general the alpha is held at .05 because of the criterion that you can’t look at LSD’s unless the Anova is significant.

- This test is generally not considered appropriate if you have more than 3 means unless there is reason to believe that there is no more than one true Null Hypothesis hidden in the means.

F. Dunn’s (Bonferroni)

-Dunn’s t’ test can actually be applied to both Post Hoc and A Priori Hypotheses. It does not require the overall Anova to be significant. It is sometimes referred to as the Bonferroni t because it used the Bonferroni PE correction procedure in determining the critical value for significance.

- In general, this test should be used when the number of comparisons you are making exceeds the number of degrees of freedom you have between groups (e.g. K-1) even if your comparisons are a priori

- This test sets alpha per experiment; Alpha = (.05)/c for every comparison. df = df error.

- c = number of comparisons (K(K-1))/2

- For Example; c = 4 then Alpha = .05/4 = .0125. Thus the PE = .05

K = 2, c = 1, Alpha = .05

K = 3, c = 3, Alpha = .0167

K = 4, c = 6, Alpha = .00833

K = 5, c = 10, Alpha = .005

- When doing hand calculations you need to find the critical value from Dunn’s Table of critical values for t’ which simply accounts for the fact that regular t tables do not display critical values for fractions of alpha .05 (e.g., t critical @ Alpha .0125 = ?).

- This test is extremely conservative and rapidly reduces power as the number of comparisons being made increase.

C. Newman-Keuls.

- Newman-Keuls is a step down procedure that is not as conservative as Dunn’s t test. First, the means of the groups are ordered (ascending or descending) and then the largest and smallest means are tested for significant differences. If those means are different, then test smallest with next largest, until you reach a test that is not significant. Once you reach that point then you can only test differences between means that exceed the difference between the means that were found to be non-significant.

For Example. For a test with 5 means

X5 > X1, p < .05. X4 = X1, p = ns. Can’t test diff between X1 and X3, X1 and X2, or X2 and X3. Can test dif between X2 and X5 if the dif between the means exceeds the difference between the means of X1 and X5.

The critical value of this test is dependent on the df error (n-K) and the number of steps between means being compared (e.g. there are 5 steps between means 1 and 5, but only 2 steps between 1 and 2).

- This test sets alpha using a scaled down FW error rate: Alpha =

E.g. K = 5, c = 10

r = 1, Alpha = .05

r = 3, Alpha = .025

r = 6, Alpha = .00851

r = 10, Alpha = .00512

- Newman-Keuls is perhaps one of the most common Post Hoc test, but it is a rather controversial test. The major problem with this test is that when there is more than one true Null Hypothesis in a set of means it will overestimate they FW error rate.

- In general we would use this when the number of comparisons we are making is larger than K-1 and we don’t want to be as conservative as the Dunn’s test is.

E. Tukey’s HSD

- Tukey HSD (Honestly Significant Difference) is essentially like the Newman-Keul, but the tests between each mean are compared to the critical value that is set for the test of the means that are furthest apart (rmax e.g. if there are 5 means we use the critical value determined for the test of X1 and X5).

- This Method corrects for the problem found in the Newman-Keuls where the FW is inflated when there is more than one True Null Hypothesis in a set of means.

- This test buy protection against Type I error, but again at the cost of Power.

- This test sets alpha using the FW error rate: Alpha =

K = 2, rmax = 1, Alpha = .05

K = 3, rmax = 3, Alpha = .025

K = 4, rmax = 6, Alpha = .00851

K = 5, rmax = 10, Alpha = .00512

- this tends to me the most common test and preferred test because it is very conservative with respect to Type I error when the Null hypothesis is true. In general, HSD is preferred when you will make all the possible comparisons between a large set of means (Six or more means).

F. Tukey’s WSD

- Tukey’s WSD (Wholly Significant Difference) is sometimes referred to as the Tukey’sb Test. This test is a compromise the Newman-Keuls and the more conservative HSD. Here the alpha for each test is the Average of the Newman-Keuls Alpha and the HSD Alpha.

Where

E.g. K = 5, c = 10

r = 1,Alpha NK = .05, Alpha rmax = .00512 Alpha WSD = .02756

r = 3, Alpha NK = .025 Alpha rmax = .00512 Alpha WSD = .01506

r = 6, Alpha NK = .00851, Alpha rmax = .00512 Alpha WSD = .00682

r = 10, Alpha NK = .00512, Alpha rmax = .00512 Alpha WSD = .00512

- Thus the WSD is better than Newman-Kuels at preventing Type I error when more than one Null Hypothesis is true for your set of means, but it is not as complete as the HSD. However, with WSD you do not loose as much power as you do with the HSD.

- The WSD is best to use when you are making more than K-1 comparisons, you need more control of Type I error than Newman-Kuels, and you are testing fewer than (K(K-1))/2 comparisons.

G. Sheffé

- The Sheffé Test is designed to protect against a Type I error when all possible complex and simple comparisons are made. That is we are not just looking the possible combinations of comparisons between pairs of means. We are also looking at the possible combinations of comparisons between groups of means. Thus Sheffé is the most conservative of all tests.

- Because this test does give us the capacity to look at complex comparisons, it essentially uses the same statistic as the Linear Contrasts tests. However, Sheffé uses a different critical value (or at least it makes an adjustment to the critical value of F).

- Sheffé sets a more conservative F critical to create an Effective FW error rate.

-First, for each comparison find F critical at the Alpha = .10 (So we start off more liberal)

df btw = 1 df error = K-1

-Second, Multiply the F critical by K-1 and use the quotient as the critical value for all comparisons (both simple and complex) in that family of means.

- This test has less power than the HSD when you are making Pairwise (simple) comparisons, but it has more power than HSD when you are making Complex comparisons.

- In general, only use this when you want to make many Post Hoc complex comparisons (e.g. more than K-1).

H. q the studentized range statistic

The Newman-Kuels, HSD and WSD all use the q statistic which is based on the studentized range (q is often referred to as the studentized range statistic). When finding the critical q, you will need two pieces of information. First you need the df. In the case of post hoc testing use the df error(n-K) from the Anova. Second, you will need r. r is the number of steps between the means you are testing. (e.g. there are 5 steps between means 1 and 5, but only 2 steps between 1 and 2).

- The Formula for q

It is similar to

Thus And