Introduction to the ANOVA Model
and Planned Comparisons
1. A Note on Notation
2. What the ANOVA Model Tests
3. Assumptions of ANOVA Model
4. One-Way ANOVA
5. Multiple Comparisons & Familywise Error Rates
A Note on Notation
· Symbols
o i refers to a specific subject. Saying “i = 1, …, n” means the number of subjects starts at 1 and goes all the way to n; where n is the total number of subjects.
o j refers to a specific treatment condition. Saying “j = 1, …p” means there are 1 to p treatments; or p is the total number of treatment conditions.
o We’ll add more letters later as we get into more complicated designs.
· Notation to Describe ANOVA Designs:
o CR-p: refers to “completely randomized, with p conditions). Thus, CR-4 means completely randomized with four levels of the treatment. (i.e., independent groups design)
o RB-p: refers to “randomized block design,” or a repeated measures design. The letter p still refers to levels of treatment. (i.e., repeated measures design)
o CRF-pq: refers to “completely randomized factorial design.” Represents a factorial design; that is, when we have more than two or more IVs. The “p” refers to number of levels for fist IV, and “q” refers to number of levels for second IV.
What the ANOVA Model Does
· ANOVA stands for ANalysis Of VAriance
o Was articulated in an Illinois field (from agricultural research).
o Analyzes the variance from different sources (effects, or groups), so we are looking at mean differences between groups.
o There are four basic sources of variance:
§ Between Group (BG): also known as variance due to the treatment or manipulation.
§ Within Group (WG): also known as error variance; this reflects the variance in the DV not associated with group status.
§ Total variance = sBG+ sWG
§ Interaction Variance (for non-additive models)
· With ANOVA, we are most interested in the means for each group, and how they may differ. Suppose we study 40 subjects, 20 in each condition:
o We have i = 1, …, n, subjects; or n = 40 subjects
o We also have j = 1, …p, treatments; or p = 2 treatments.
In this design, we want to test the following null hypothesis:
H0: Mean Control = Mean Goal = 0
H0: m1 = m2 = 0
That is, there are no mean differences (a null hypothesis). Alternatively, we propose:
Ha: m1 ¹ m2 ¹ 0, or there is a mean difference of some form.
· Of course, we could also use nil hypothesis tests, and/or directional hypothesis tests.
Clearly this simple two-group hypothesis is nothing more than a t-test, but we can add any number of group mean differences.
Let’s take a look again at how ANOVA relates to regression:
General Form: SSTOT = SSEFFECT + SSERROR
Regression: SSTOT = SSREG + SSERROR
ANOVA: SSTOT = SSBG + SSWG
Or: SSTOT = SSTREATMENT + SSERROR
Regression: SSTOT = SSREG + SSERROR
ANOVA: SSTOT = SSBG + SSWG
Let’s look at closer distinctions:
SSTOT: is the total amount of variance in the data.
Regression: SSTOT =
= each score - mean
ANOVA: SSTOT =
= each score in condition- grand mean
So SSTOT reflects all of the variance in the DV
SSEFFECT: variance due to the treatment, predictor, or effect
Regression: SSREG =
= each predicted score - mean
ANOVA: SSBG =
= each treatment mean- grand mean
So SSBG reflects all of the variance explained by treatment.
SSERROR: variance due to all other unexplained effects
Regression: SSERROR =
= each observed score –
predicted score
ANOVA: SSWG =
= each individual score–
treatment mean
So SSERROR reflects all of the within group variance.
Thus, ANOVA and regression partition variance the same way; only difference lies in continuous versus categorical nature of IVs.
Remember the Experimental Design Equation:
Yij = m + aj + ei(j)
Yij = is each persons score within each condition
m = is the grand mean.
aj = treatment effect for condition j.
ei(j) = error; is equal to Yij - m - aj. Notice that it considers error for each subject nested within each treatment—that is, i(j) or “i is nested within j.”
So, each person’s score is a sum of three parameters: m, aj, and ei(j).
Yij = m + aj + ei(j)
Yij = Y.. + (Y.j – Y..) + (Yij-Y.j)
The experimental design equation describes our ANOVA model!
Effect Size Estimation in ANOVA Designs
In regression, effect size of the model was
R2 = SSregression / SSTotal
In ANOVA, effect size can be given by
h2 = SStreatment / SSTotal
So, R2 = h2; they are the same! But, historically, eta-squared is reported in ANOVA.
Another popular effect size estimate in ANOVA is w2 (omega-squared). This is conceptually equal to:
treatment variance
(error variance + treatment variance)
Or computationally:
w2 = =
where p = number of levels or conditions.
Note that w2 is the proportion of the population variance in the DV that is accounted for by specifying the treatment-level classification. Is a measure of association!
· Effect Size Conventions:
w2 = .01 = small effect
w2 = .06 = moderate effect
w2 = .14 = large effect
Assumptions of the ANOVA Model
F-test assumptions:
1. observations are drawn from normally distributed population
2. observations are random samples from the populations, or are randomly assigned to condition
3. the numerator and denominator of the F statistic are independent
4. the numerator and denominator of the F statistic are estimates of the same population variance s2e.
Model Assumptions:
1. The model Yij = m + aj + ei(j) contains all relevant sources of variation (like misspecification in regression).
2. The experiment contains all of the treatment levels of interest.
3. The errors are independent, normally distributed within each treatment population. This is just like regression!
What happens if the ANOVA assumptions are violated?
· The model Yij = m + aj + ei(j) contains all relevant sources of variation.
o Violation influences Type I error and power.
· The experiment contains all of the treatment levels of interest.
o Violation influences the estimation of the between groups variance and the very hypothesis being tested. Very serious!
· The errors are independent, normally distributed within each treatment population (i.e., homogenous variances).
o Violation of nonnormality is usually not serious unless it is severe.
o Violation of homogeneity is more serious, especially will small sample sizes.
§ Consequence is in terms of inflating Type I error rates.
§ There is no good reason not to always check it!
Multiple Comparison Procedures:
Issues and Concerns
Multiple Comparisons
· Used to assess which group means differ from each other, after the overall F test has demonstrated at least one difference exists. If the F test establishes that there is an effect on the dependent variable, the researcher then proceeds to determine which group means differ significantly from others.
· The total possible number of comparisons is k(k-1)/2. Multiple comparisons help specify the exact nature of the overall effect determined by the F test.
Multiple comparison procedures are methods for comparing group means:
1.Pairwise comparisons (between 2 different group means)
2.Contrasts (between 2 or more Sets; with at least on set containing multiple groups).
But how do we know what group means to test? In an ANOVA/Experimental framework, we often discuss two types:
1. Planned tests (a priori): confirmatory, the hypothesis or reason the study was conducted
2. post hoc tests (a posteriori, unplanned, etc): more exploratory, used to learn more about the data than the hypotheses allow.
Which correction procedure should I use?
-Depends on question:
-compare only treatment to control group: Dunnett
-all pairwise comparisons: Tukey
-contrasts: Scheffe
· The Bonferroni test is recommended for multiple planned comparisons, if the number of comparisons is not large.
o A .05 significance for the first comparison means there is 5% chance of a Type I error. However, as one computes additional significance coefficients for additional comparisons, one is increasing the likelihood of Type I error.
o The more comparisons to be tested, the more stringent the alpha level must become to obtain a total overall experimentwise Type I error rate of 5%. Dunn's test accomplishes this by setting the alpha rate at .05/p, where p is the number of comparisons.
o For instance, for 10 comparisons, in order to obtain an overall Type I error rate of .05, one should test each comparison against an alpha of .05/10 = .005.
· The Tukey method is preferred when the number of groups is large as it is a very conservative pairwise comparison test, and researchers prefer to be conservative when the large number of groups threatens to inflate Type I errors.
o Some recommend it only when all pairwise comparisons are being tested. When all pairwise comparisons are being tested, the Tukey HSD test is more powerful than the Dunn test (Dunn may be more powerful for fewer than all comparisons).
· The Scheffé test maintains an experimentwise .05 significance level in the face of multiple comparisons; it does so at the cost of a loss in statistical power (more Type II errors may be made -- thinking you do not have a mean relationship when you do).
o That is, the Scheffé test is a conservative one (more conservative than Dunn or Tukey, for ex.), not appropriate for planned comparisons but rather restricted to post hoc comparisons.
Let’s run through an example:
Suppose we are interested in answering the question: How does exposure to violence influence aggressiveness? We set up a study in which 3 groups are exposed to different amounts of violence and then complete a survey measuring aggression.
Hypotheses???
Descriptives
aggressive behavior score
N / Mean / Std. Deviation / Std. Error / 95% Confidence Interval for Mean / Minimum / MaximumLower Bound / Upper Bound
no exposure to violence / 10 / 59.6000 / 6.85079 / 2.16641 / 54.6992 / 64.5008 / 49.00 / 67.00
exposed to 10 minutes of violence / 10 / 66.3000 / 9.14148 / 2.89079 / 59.7606 / 72.8394 / 49.00 / 78.00
exposed to 30 minutes of violence / 10 / 74.7000 / 6.99285 / 2.21133 / 69.6976 / 79.7024 / 65.00 / 83.00
Total / 30 / 66.8667 / 9.75469 / 1.78096 / 63.2242 / 70.5091 / 49.00 / 83.00
Test of Homogeneity of Variances
aggressive behavior score
Levene Statistic / df1 / df2 / Sig..215 / 2 / 27 / .808
ANOVA
aggressive behavior score
Sum of Squares / df / Mean Square / F / Sig.Between Groups / 1144.867 / 2 / 572.433 / 9.572 / .001
Within Groups / 1614.600 / 27 / 59.800
Total / 2759.467 / 29
So what’s our conclusion? What’s next?
Multiple Comparisons
Dependent Variable: aggressive behavior score
Mean Difference (I-J) / Std. Error / Sig. / 95% Confidence Interval(I) GROUP / (J) GROUP / Lower Bound / Upper Bound
Tukey HSD / no exposure to violence / exposed to 10 minutes of violence / -6.7000 / 3.45832 / .148 / -15.2746 / 1.8746
exposed to 30 minutes of violence / -15.1000* / 3.45832 / .000 / -23.6746 / -6.5254
exposed to 10 minutes of violence / no exposure to violence / 6.7000 / 3.45832 / .148 / -1.8746 / 15.2746
exposed to 30 minutes of violence / -8.4000 / 3.45832 / .056 / -16.9746 / .1746
exposed to 30 minutes of violence / no exposure to violence / 15.1000* / 3.45832 / .000 / 6.5254 / 23.6746
exposed to 10 minutes of violence / 8.4000 / 3.45832 / .056 / -.1746 / 16.9746
Scheffe / no exposure to violence / exposed to 10 minutes of violence / -6.7000 / 3.45832 / .173 / -15.6572 / 2.2572
exposed to 30 minutes of violence / -15.1000* / 3.45832 / .001 / -24.0572 / -6.1428
exposed to 10 minutes of violence / no exposure to violence / 6.7000 / 3.45832 / .173 / -2.2572 / 15.6572
exposed to 30 minutes of violence / -8.4000 / 3.45832 / .069 / -17.3572 / .5572
exposed to 30 minutes of violence / no exposure to violence / 15.1000* / 3.45832 / .001 / 6.1428 / 24.0572
exposed to 10 minutes of violence / 8.4000 / 3.45832 / .069 / -.5572 / 17.3572
Bonferroni / no exposure to violence / exposed to 10 minutes of violence / -6.7000 / 3.45832 / .190 / -15.5272 / 2.1272
exposed to 30 minutes of violence / -15.1000* / 3.45832 / .001 / -23.9272 / -6.2728
exposed to 10 minutes of violence / no exposure to violence / 6.7000 / 3.45832 / .190 / -2.1272 / 15.5272
exposed to 30 minutes of violence / -8.4000 / 3.45832 / .066 / -17.2272 / .4272
exposed to 30 minutes of violence / no exposure to violence / 15.1000* / 3.45832 / .001 / 6.2728 / 23.9272
exposed to 10 minutes of violence / 8.4000 / 3.45832 / .066 / -.4272 / 17.2272
* The mean difference is significant at the .05 level.
Conclusion:
A one-way analysis of variance (ANOVA) indicated that participants in the control, 10 minute exposure to violence, and 30 minute exposure to violence groups significantly differed on mean aggression scores, F (2, 27) = 9.572, p < .05.
In addition results of our paired comparison tests all revealed that the mean aggression score in the 30 minute exposure to violence group (M = 74.7, SD = 6.99) was significantly higher than the mean aggression score of the control group (M = 59.6, SD= 6.85). No significant differences were detected between the control and 10 minute exposure group (M = 66.3, SD = 9.14), and the 10 and 30 minute exposure group (though these results were approaching significance).
Randomized Block (RB) Designs and Repeated Measures ANOVA
1. Randomized Block Designs
2. Repeated Measures Analysis
3. Conditions Supporting Repeated Measures
4. Corrections for Violations
Randomized Block Design
Sometimes when we conduct an experiment, there is something called nuisance variation present in the data or design:
· Nuisance Variation: systematic variance in the DV that is not the focus of the experiment
o Cognitive ability in a training design
o Motivational differences in GPA