Week of October 16, 18
ANOVA example/Splus analysis
Polychlorinated biphenyls (PCBs), used in the manufacture of large electrical transformers and capacitors, are extremely hazardous contaminants when released into the environment. Samples of fish were taken from each of four rivers and analyzed for PCB concentration (in parts per million).
We are interested in whether the data provide sufficient evidence to indicate differences in the mean PCB concentration in fish for the five rivers.
Using Splus the data can be summarized as follows:
*** Summary Statistics for data in: PCBfish ***
CODE:1
PCB
Mean: 4.11560
Total N: 25.00000
Std Dev.: 3.68525
------
CODE:2
PCB
Mean: 7.11600
Total N: 20.00000
Std Dev.: 7.28052
------
CODE:3
PCB
Mean: 9.70565
Total N: 23.00000
Std Dev.: 7.38819
------
CODE:4
PCB
Mean: 10.67542
Total N: 24.00000
Std Dev.: 6.83140
Boxplots of the Four Groups
Normal Probability Plots for the Four Groups
Splus commands to produce the grid of four normal probability plots all at once:
par(mfrow=c(2,2))
for (i in 1:4){
qqnorm(PCBfish[,1][PCBfish[,2]==i],ylab="Data quantiles")
title(paste("River ",i,sep=""))}
These plots can be produced by going to “File” and “New” and “Script File”. Paste the commands into the script file window, press “F10” and the four plots are produced automatically.
The boxplots show skew and large spread. It appears that we have unequal amounts of variation in the four groups. Further, the normal probability plots seem to indicate a long right tail to the data.
If an ANOVA were run on these data, the assumptions of the ANOVA (recall what these are) don’t appear to be met (specifically, why?).
A log transformation is attempted. We could also try a reciprocal transformation.
Summary statistics: Log Transformed Data
CODE:1
log.PCB.
Mean: 1.09490
Total N: 25.00000
Std Dev.: 0.80603
------
CODE:2
log.PCB.
Mean: 1.51748
Total N: 20.00000
Std Dev.: 0.95615
------
CODE:3
log.PCB.
Mean: 1.88929
Total N: 23.00000
Std Dev.: 1.02278
------
CODE:4
log.PCB.
Mean: 2.09008
Total N: 24.00000
Std Dev.: 0.89714
Boxplots – Log Transformed Data
Normal Probability Plots – Log Transformed Data
Now run the ANOVA. What is the null hypothesis?
*** Analysis of Variance Model ***
Short Output:
Call:
aov(formula = log.PCB. ~ CODE, data = PCBfish, na.action = na.omit)
Terms:
CODE Residuals
Sum of Squares 14.01750 74.48842
Deg. of Freedom 3 88
Residual standard error: 0.9200322
Estimated effects may be unbalanced
Df Sum of Sq Mean Sq F Value Pr(F)
CODE 3 14.01750 4.672501 5.520053 0.001608902
Residuals 88 74.48842 0.846459
What are your conclusions?
Practice Problems:
- Find a 95% confidence interval for the difference in PCB concentrations in fish between rivers 1 and 2.
- Suppose you want to estimate the difference in PCB concentrations in fish between rivers 1 and 2 to within 0.5 ppm with confidence approximately equal to 0.95. How many fish would need to be included in each sample? Assume equal sample sizes and that your estimate of the standard deviation can be taken to be the population standard deviation.
Model Diagnostics
Under the ANOVA model, what assumptions do we make about the random variation around the group means?
Log Transformed Data
Raw Data