Topic 8 ANOVA

Compare means when there are more than two groups

The t-test for independent samples can be used when there are two groups. What if we are interested in comparing more than two group means? We can carry out pair-wise comparisons between pairs of groups. But this is likely to be tedious when there are many groups. The other problem is that when many significance tests are carried out, we are likely to find a few significant by chance as compared to carrying out only one test.

The one-way analysis of variance (ANOVA) is a test that is suitable for testing the hypothesis that

The null hypothesis is that all group means are equal. Rejection of the null hypothesis means that at least one group mean is not equal to the others. One can regard one-way ANOVA as testing the equality of all group means simultaneously.

Typically, the variable for which the group mean is compared should be continuous, and the group variable is categorical. For example, compare mean mathematics achievement across 4 countries.

The dataset, PISA06SCHOOL_4CNT.sav, is a ‘cut-down’ copy of the PISA 2006 school questionnaire data. Only four countries are included in the file: Australia, Germany, New Zealand and Russia.

Example 1

We are interested in knowing whether the student-teacher ratio (STRATIO) is similar in the four countries. An ANOVA can provide such a statistical test for us. However, it is recommended that, before you jump straight to significance testing, some exploration of the data be carried out, to get a feel for the differences.

Compute the mean and standard error of the mean for the variable STRATIO separately for the four countries:

Select SPSS menu

Analyze à Descriptive statistics à Explore;

Place STRATIO as dependent list variable, and COUNTRY as factor list variable.

Complete the following table:

Average student-teacher ratio / Standard error of the average
Australia
Germany
New Zealand
Russia Federation

Looking at the mean and standard errors in the above table, which countries may have similar student-teacher ratios, and which countries may have different student-teacher ratios?

To answer these questions more formally through significance tests, carry out an ANOVA:

Select Analyze à Compare Means à One-way ANOVA

Place STRATIO as Dependent list variable, and COUNTRY as Factor.

(If you also select Options à Descriptives, you will get means and standard errors as well.)

The output from this analysis is shown below.

While this information from ANOVA may have answered our question of whether the four countries have similar student-teacher ratio, it is not extremely helpful, as we don’t know whether the four countries are all different, or, the difference is just between one pair of countries.

A pair-wise comparison may help answer the question of which countries are different.

Re-do the ANOVA again, this time click the button POST HOC, and check the box “Bonferroni”.

A multiple-comparisons table appears in the output, showing the significance level of pair-wise comparisons.

Fill in the table below.

Sig. level / Mean STRatio / Australia / Germany / New Zealand / Russia
Mean STRatio / 13.4 / 16.9
Australia / 13.4
Germany / 16.9
New Zealand
Russia / .104

How do these significance levels differ from independent sample t-tests?

Carry out a t-test for Australia and Russia.

The significance level from a two-sample t-test for a comparison between Australia and Russia in student-teacher ratio is ______.

If paired comparison from ANOVA is used, you would accept the null hypothesis, at the 95% confidence interval, that there is a no difference between Australia and Russia.

If independent sample t-test is used, you would reject the null hypothesis and accept the alternate hypothesis that there is a difference between Australia and Russia.

You will find that, whenever Bonferroni adjustment is made, statistical tests become less significant (i.e., less likely to reject the null hypothesis).

Multiple comparisons and Bonferroni

The issue is that, if you carry out just one significance test, say at 95% confidence level, there is a 5% chance that the observed value is as extreme as what you have found, and yet the observation could still be from the null distribution (distribution under the null hypothesis). But you will reject the null hypothesis because the value is ‘extreme’. So there is a 5% chance you will reject the null hypothesis incorrectly.

However, if 100 significance tests are carried out instead of just one significance test, one would expect 5 tests to be significant (at 95% confidence level) by chance. That is, out of 100 observations, one would expect 5 observations to have values outside the 95% region (when the null hypothesis is true). We shouldn’t reject the null hypothesis for these 5 cases.

In general, when we carry out many significance tests, we want to be more cautious about rejecting the null hypothesis, because, by chance, out of so many tests, some will be significant just by chance (e.g, we expect 5 to be significant by chance out of 100). One way to be ‘more cautious’ is to lower the significance level. So, instead of (for 95% confidence level), we can use , for example, so we will not falsely reject the null hypothesis too often.

The Bonferroni adjustment uses , instead of , as the significance level when n significance tests are carried out.

For example, if we carry out 100 significance tests, then we will reject the null hypothesis if “Sig < 0.0005” (instead of “sig < 0.05), but still say that the significant test is at 95% confidence level.

An example of multiple comparisons with and without Bonferroni adjustment can be found on Page 59 of the PISA 2003 initial report 2003initialreportP59.pdf.

Discussion Points

(A)

In Topic 6, we posed the question about the number of raining days. I will rephrase this question below:

In a region, the number of raining days per year is approximately normally distributed, with a mean of 85 days and a standard deviation of 15 days (the distribution was established by collecting 200 years of data).

(1) Last year, the number of raining days was 120 days. Was this year an ‘abnormal’ year?

(2) Within the 200 years surveyed, 5 years had more than 115 raining days. Were these 5 years “abnormal” years?

(Hint: In (1), how many significance tests are carried out? In (2), how many significance tests are carried out? If multiple significance tests are carried out, what value should be used?)

(B)

A common (and somewhat worrying) practice is that schools are ranked based on their performance on a national test. The schools in the lowest performing 5 percentile are branded “under performers” (or worse still, leading to threatened disciplinary actions from Education authorities.)

Assume that there are no true differences between schools in performance, but the observed variation across schools is simply due to random variation. When a distribution of performance of schools is formed, there will always be 5% of the schools in the lowest performing percentile (due to random variation). We can’t point a finger at any school to say that we have evidence that they are low performers. (This is a bit like saying that everyone has to be above average). This is because that we found the so-called ‘low performing’ schools by examining many schools. That is, we carried out many significance tests. In carrying out many significance tests, we are bound to find some significant by chance.

The situation is different if we have reasons to believe a particular school is not performing well (e.g., teachers’ and parents’ complaints, etc.). We then look at that school’s performance and find that the school is performing in the lowest 5 percentile among other schools. We can then use this as further evidence that the school is likely to be underperforming. In this case, we have only carried out one significance test.

(C)

Another example discussed in the subject forum was about a person charged with a criminal offence based on statistical evidence, such as, statistically, the chance of the events happening is 1 in a million, etc.

Let’s say that the chance of having 3 SIDS (sudden infant death syndrome) happening to one family is 1 in a million (I am making these figures up). In the world, there are many millions of people, so there must be a few families out there where this has happened to them by chance. We can’t go around the world and look for these families and lay criminal charges. This is because by look at all the families in the world, we are carrying out many significance tests, and some will be significant by chance.

On the other hand, when a person has come under suspicion that there could be unnatural infant deaths (evidence not based on statistics), we then look at the statistics of having 3 SIDS in one family, this can be used as one piece of supporting evidence for possible criminal act. In this case, only one significance test is carried out.

More generally, don’t go out and ‘fish’ for ‘culprits’. If there are already suspected ‘cuprits’, then statistics can be used as one additional piece of information to assess the likelihood of certain events.

Example 2

Use the data set PISA06SCHOOL_4CNT.sav to test the hypothesis that there are no differences in the mean scores of “proportion of certified teachers (variable PROPCERT) in the four countries.

Will you reject or accept the null hypothesis?

Also conduct pair-wise comparisons and fill in the table below.

Sig. level / Mean PROPCERT / Australia / Germany / New Zealand / Russia
Mean PROPCERT / 0.98 / 0.91
Australia / 0.98 / é / é / l
Germany
New Zealand
Russia

Repeat the exercise by sub-sample only 10 percent of the schools in the data set.

Sig. level / Mean PROPCERT / Australia / Germany / New Zealand / Russia
Mean PROPCERT
Australia
Germany
New Zealand
Russia

What have you noticed in relation to the number of significant results when the sample size is smaller?

What do you think you’d find in relation to the number of significant results if sample size is much large, e.g., 5000 schools are sampled in each country?

Discussion Points

The ‘truth’ (or close to the truth) is that the four countries are extremely unlikely to have identical (up to a few thousand decimal places) proportions of certified teachers. Given large enough sample size, the significance tests will show significance. This is because the null hypothesis is whether the four country means are ‘identical’ or not.

On the other hand, when the sample size is too small, the data set will not have enough ‘power’ to detect small differences.

So, whenever you accept the null hypothesis, it is simply that you don’t have a large enough sample size to detect the difference (when the ‘truth’ is that there is a difference, however minute).

So be aware of this when you accept the null hypothesis.

Further, in designing a survey, be aware that the sample size can determine the outcome of the survey in terms of hypothesis testing.

In summary, statistical theory does not lead you to definitive truth. There are many caveats in relation to statistical testing. When you draw conclusions in a report, keep these caveats in mind. Statistical significance testing should not be carried out without a thorough understanding of the meaning of the tests.

Repeated-measures analysis of variance

As for paired-sample t-test, when the subjects are matched (e.g., prices of the same textbooks at three universities), we can carry out a repeated-measures analysis of variance. An example is that students’ mathematics proficiency is measured on several occasions, with intervention strategies implemented in-between the measures. In this case, a repeated-measures analysis of variance will have more power to detect difference in mean scores, because we focus on differences in scores for each individual, therefore removing the variation due to different abilities across different students.

In SPSS, this analysis can be done through the Generalized Linear Model menu. But this option is only available in the SPSS advanced module. This option is not available in the student version and post-graduate version of SPSS.

Additional exercises

The data set y12survey2.sav contains survey results on student interests. See accompanying documents Year12 Survey_coding2.doc and Holland2fory12data.doc.

Answer the broad questions:

Are there differences between girls and boys in their interests?

Are students’ interest areas related to father’s education level?