13.4 Test of Independence: Contingency Tables
Motivating Example:
Objective:
we want to determine whether the beer preference is independent of the gender of the beer drinker.
We want to test
Beer preference is independent of the gender
vs.
Beer preference is not independent of the gender
with .
We have the following data:
Beer PreferenceLight / Regular / Dark / Total / Proportion
Gender / Male / 20 / 40 / 20 / 80 /
Female / 30 / 30 / 10 / 70 /
Total / 50 / 70 / 30 / 150 / 1
Proportion / / / / 1
The above table is called a contingency table.
If is true, then the expected numbersunder are
.
The expected numbers under can be summarized by
Beer PreferenceLight / Regular / Dark / Proportion
Gender / Male / / / /
Female / / / /
Proportion / / /
Intuitively, if the differences between the observed number and the expect number (under ) , , are small, that might imply is true and thus the observed number and the expected number (under ) are close. The following statistic can be used to reflect the difference between the observed number and the expected number,
General Case:
Suppose there are two variables, column variable (with m categories) and row variable (with p categories). We want test the hypothesis
Row variable is independentof column variable
vs.
Row variable is not independentof column variable.
Suppose the sample size is n. The contingency tableis
Column Variable (m columns)1 / ... / j / … / m / proportions
Row
Variable
(p rows) / 1 / / … / / … / /
i / / … / / … / /
p / / … / / … / /
proportions / / … / / … / / 1
If is true, thenthe expected numbersunder are
Column Variable (m columns)1 / ... / j / … / m / proportions
Row
Variable
(p rows) / 1 / / … / / … / /
i / / … / / … / /
p / / … / / … / /
proportions / / … / / … / / 1
Note:
where
and
.
Thus, the chi-square statistic used to reflect the difference between the observed number and the expected number is
Next question:how large must be to reject ?
Chi-Square Test:
Let
As for every i and j, the chi-square test with level of significance for
Row variable is independentof column variable
vs.
Row variable is not independentof column variable.
is to
,
where can be obtained by
.
In addition,
.
Note: as is true, the random variable with sample value is .
Example (continue)
Since and , thus we reject. Also,
,
we also rejectbased on p-value. Therefore, we conclude that the beer preference is not independent of the gender of the beer drinker.
Example:
The following data are the number of people who are in favor of, are not in favor of, and have no comment on, some proposal:
Favor / Not Favor / No CommentMale / 252 / 145 / 203
Female / 148 / 105 / 147
Please test if female and male differ in their opinions about the proposal with.
[solution:]
The column totals are while the row totals are . In addition, the total number is 1000.
The table for the expected numbers is
Favor / Not Favor / No Comment / Row TotalMale / / / / 600
Female / / / / 400
Column Total / 400 / 250 / 350 / 1000
Thus,
Since , we do not reject .
Online Exercise:
Exercise 13.4.1
Exercise 13.4.2
1