Post Mortem on a Real Data Set:
1. An Example of an Unbalanced 1-Way ANOVA
2. Use of Bonferroni's Method for Multiple Comparisons
This data set is part of a study by a medical researcher to assess a new methodology for detecting cancerous cells in the tube from the cervix that leads to the uterus. The medical researcher selected specimens from tissues that had been removed because they were cancerous. These cells are of two grades of cancer (1) Low grade adenocarcinoma in situ (level 1 in the variable celltype in the data set) and (2) High grade adenocarcimo (level 2). The goal is to compare them with normal cells. The normal cells (level 0 in the analysis below) were obtained from samples of hysterectomies that had been performed for reasons unrelated to any cancers. This may seem OK on the face of it, but in fact the cells in the female reproductive tract undergo age related changes, so one can not be sure that any differences found were due to cancer but may simply be age related changes. The goal here is to examine any age differences that may exist between the three groups.
We should note that this is an example of a bad use of hypothesis tests: we wish to find evidence that the ages do not differ "significantly" meaning that we want to find evidence in favor of the null hypothesis. One of the groups (the normals) has a small sample size of 11, so there is not too much power for detecting departures, i.e. there is a high probability of type II error.
First Analysis: One Way ANOVA: Are there differences on average between the normals and other groups? To assess this question, we performed a 1-way ANOVA. Here is the basic output from Minitab:
Worksheet size: 5000 cells
One-way Analysis of Variance
Analysis of Variance for age
Source DF SS MS F P
celltype 2 753 377 2.91 0.061
Error 67 8675 129
Total 69 9428
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev --+------+------+------+----
0 13 46.08 15.28 (------*------)
1 38 37.84 8.99 (------*------)
2 19 42.58 12.66 (------*------)
--+------+------+------+----
Pooled StDev = 11.38 35.0 40.0 45.0 50.0
Tukey's pairwise comparisons
Family error rate = 0.0500
Individual error rate = 0.0193
Critical value = 3.39
Intervals for (column level mean) - (row level mean)
0 1
1 -0.53
17.00
2 -6.32 -12.40
13.32 2.93
The P-value for the analysis of variance is 0.061. By the “usual” 0.05 level of significance, we can say there are not significant differences in the mean levels of the groups, but it is hardly comforting. The normal probability plot of the residuals is shown below, and does not suggest any problem with the normality assumptions.
However, boxplots of the age by celltype definitely suggest that the assumption of equal variances is not satisfied.
I selected “Basic Statistics” and “displayed descriptive statistics” of Age by Celltype. The results are shown below:
Descriptive Statistics
Variable celltype N Mean Median TrMean StDev
age 0 13 46.08 46.00 46.18 15.28
1 38 37.84 36.00 37.12 8.99
2 19 42.58 39.00 41.47 12.66
Variable celltype SE Mean Minimum Maximum Q1 Q3
age 0 4.24 23.00 68.00 30.50 58.00
1 1.46 25.00 62.00 31.75 40.50
2 2.90 26.00 78.00 34.00 50.00
It seems that the normal group has the largest sample standard deviation (at 15.28) and is by any measure of central tendency (mean, median, or trimmed mean) the oldest. It is also the smallest groups at 13 (the others have 38 and 19). We concluded that there are potentially problems with the ANOVA assumptions. Note also that Tukey's method of multiple comparisons is definitely dubious in this case as the sample sizes within groups are not nearly equal.
Comment: How should the study have been done? Ideally, for each of the cancer patients we would have found a normal patient who (nearly) matched in important characteristics such as age, race, smoking, SES (socioeconomic status), etc. Then we could do paired sample t-tests to detect differences in the variables of interest (which haven’t been described here), and have some assurance that differences that we found were due to the cancer and not so-called confounding factors.
Bonferroni's Method of Multiple Comparisons: Now we perform a similar analysis to the one above but using Bonferroni's method instead of ANOVA + Tukey's method. Here is a quick summary of Bonferroni's method, which applies to any multiple comparisons problem:
- For simultaneous 1- confidence intervals for k parameters, construct individual 1-/k confidence intervals for each parameter separately.
- For testing k sets of null hypothesis with a Family Wise Error Rate (FWER) of , perform individual hypothesis tests at level /k.
The bottom line is to divide the error probability by the number of confidence intervals or tests. One issue with Bonferroni's method is that it is not as powerful as a specially designed method meaning that it has higher type II error probabilities for tests and wider confidence intervals. For instance, in an ANOVA setting, the ANOVA test is more likely to detect a difference (reject the null hypothesis of no difference), of course provided the ANOVA assumptions are met. The beauty of Bonferroni's method is that it applies to any setting.
Recall that the two sample t-test is reasonably robust to departures from normality (so is ANOVA) and doesn't require the assumption of equal variances (which is a bit of a problem for ANOVA). So, we reanalyzed the above data by performing 3 pairwise two sample t-tests, but each t-test will be at the 0.05/3 = 0.0167 level of significance since there are 3 pairwise comparisons (1 vs. 2, 1 vs. 0, and 2 vs. 0). To perform the tests at the 0.0167 level, we simply reject if any of the P-values are below 0.0167. I also tried to get 1-.0167 = 98.33% confidence intervals (so the simultaneous level of confidence is 95%) but minitab appears to have rounded off to just 98%, so I have only 94% simultaneous level of confidence. The results are shown below.
Two Sample T-Test and Confidence Interval
Two sample T for a2
ct2 N Mean StDev SE Mean
1 38 37.84 8.99 1.5
2 19 42.6 12.7 2.9
98% CI for mu (1) - mu (2): ( -13.0, 3.6)
T-Test mu (1) = mu (2) (vs not =): T = -1.46 P = 0.16 DF = 27
Two Sample T-Test and Confidence Interval
Two sample T for a3
ct3 N Mean StDev SE Mean
0 13 46.1 15.3 4.2
1 38 37.84 8.99 1.5
98% CI for mu (0) - mu (1): ( -3.9, 20.4)
T-Test mu (0) = mu (1) (vs not =): T = 1.84 P = 0.087 DF = 14
Two Sample T-Test and Confidence Interval
Two sample T for a4
ct4 N Mean StDev SE Mean
0 13 46.1 15.3 4.2
2 19 42.6 12.7 2.9
98% CI for mu (0) - mu (2): ( -9.8, 16.8)
T-Test mu (0) = mu (2) (vs not =): T = 0.68 P = 0.50 DF = 22
The smallest P-value was 0.087 (for testing 0 (normal) vs. 1 (adenocarcinoma in situ). As this is not less than 0.0167, we cannot reject the null hypothesis of no difference.