Week of October 16, 18

ANOVA example/Splus analysis

Polychlorinated biphenyls (PCBs), used in the manufacture of large electrical transformers and capacitors, are extremely hazardous contaminants when released into the environment. Samples of fish were taken from each of four rivers and analyzed for PCB concentration (in parts per million).

We are interested in whether the data provide sufficient evidence to indicate differences in the mean PCB concentration in fish for the five rivers.

Using Splus the data can be summarized as follows:

*** Summary Statistics for data in: PCBfish ***

CODE:1

PCB

Mean: 4.11560

Total N: 25.00000

Std Dev.: 3.68525

------

CODE:2

PCB

Mean: 7.11600

Total N: 20.00000

Std Dev.: 7.28052

------

CODE:3

PCB

Mean: 9.70565

Total N: 23.00000

Std Dev.: 7.38819

------

CODE:4

PCB

Mean: 10.67542

Total N: 24.00000

Std Dev.: 6.83140

Boxplots of the Four Groups

Normal Probability Plots for the Four Groups

Splus commands to produce the grid of four normal probability plots all at once:

par(mfrow=c(2,2))

for (i in 1:4){

qqnorm(PCBfish[,1][PCBfish[,2]==i],ylab="Data quantiles")

title(paste("River ",i,sep=""))}

These plots can be produced by going to “File” and “New” and “Script File”. Paste the commands into the script file window, press “F10” and the four plots are produced automatically.

The boxplots show skew and large spread. It appears that we have unequal amounts of variation in the four groups. Further, the normal probability plots seem to indicate a long right tail to the data.

If an ANOVA were run on these data, the assumptions of the ANOVA (recall what these are) don’t appear to be met (specifically, why?).

A log transformation is attempted. We could also try a reciprocal transformation.

Summary statistics: Log Transformed Data

CODE:1

log.PCB.

Mean: 1.09490

Total N: 25.00000

Std Dev.: 0.80603

------

CODE:2

log.PCB.

Mean: 1.51748

Total N: 20.00000

Std Dev.: 0.95615

------

CODE:3

log.PCB.

Mean: 1.88929

Total N: 23.00000

Std Dev.: 1.02278

------

CODE:4

log.PCB.

Mean: 2.09008

Total N: 24.00000

Std Dev.: 0.89714

Boxplots – Log Transformed Data

Normal Probability Plots – Log Transformed Data

Now run the ANOVA. What is the null hypothesis?

*** Analysis of Variance Model ***

Short Output:

Call:

aov(formula = log.PCB. ~ CODE, data = PCBfish, na.action = na.omit)

Terms:

CODE Residuals

Sum of Squares 14.01750 74.48842

Deg. of Freedom 3 88

Residual standard error: 0.9200322

Estimated effects may be unbalanced

Df Sum of Sq Mean Sq F Value Pr(F)

CODE 3 14.01750 4.672501 5.520053 0.001608902

Residuals 88 74.48842 0.846459

What are your conclusions?

Practice Problems:

  • Find a 95% confidence interval for the difference in PCB concentrations in fish between rivers 1 and 2.
  • Suppose you want to estimate the difference in PCB concentrations in fish between rivers 1 and 2 to within 0.5 ppm with confidence approximately equal to 0.95. How many fish would need to be included in each sample? Assume equal sample sizes and that your estimate of the standard deviation can be taken to be the population standard deviation.

Model Diagnostics

Under the ANOVA model, what assumptions do we make about the random variation around the group means?

Log Transformed Data

Raw Data