Analysis of Categorical Data s1

13 - ANALYSIS OF CATEGORICAL DATA
(Daniel, Ch. 12, Gerstman Ch. 18 & 19)

In this handout we cover several different methods for analyzing categorical data.

The methods we will examine are:

· 2 2 Contingency Tables (i.e., OR’s, RR’s, )

· r c Contingency Tables

· McNemar’s Test

· Cochran-Mantel-Haenszel Method

COMPARING TWO POPULATION PROPORTIONS EXAMPLE -
AGE AT FIRST PREGNANCY AND CERVICAL CANCER (a Case-Control Study)

These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer. In a case-control study a random sample of cases (i.e. people with the disease in question) and controls (i.e. people similar to those in the case group, except they do not have the disease) and the proportion of people with some potential risk factor are compared across the two groups. In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer. The data is presented in the table below:

Age at First
Pregnancy <= 25
(risk factor present) / Age at First
Pregnancy > 25
(risk factor absent) / Row Totals
(fixed)
Cervical Cancer (Case) / 42 / 7 / 49
Control / 203 / 114 / 317
Column Totals
(random) / 245 / 121 / n=366

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are independent.

The distribution of the risk factor is the same for both cases and controls

The proportion of women with the risk factors is same for both groups, i.e.
.

Ha: There IS an association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are NOT independent.

The distribution of the risk factor is NOT the same for both cases and controls.

The proportion of women with the risk factor is not the same for both groups, i.e.

Development of the Test Statistic

If the null hypothesis is true we expect the proportion of women with the risk factor in the case and control groups to be the same. We can also think of this in terms of conditional probabilities. Two events A and B are said to be independent if

i.e. knowledge about the occurrence of B tells you nothing about the occurrence of A.

Consider the following generic representation of the contingency table for this example.

1st Preg. Age
(Risk Factor Present) / 1st Preg. Age > 25
(Risk Factor Absent) / ROW TOTALS
Case
(Disease Present) / a / b / R1=(a+b)
Control
(Disease Absent) / c / d / R2=(c+d)
COLUMN TOTALS / C1=(a+c) / C2=(b+d) / n

From this table we can calculate the conditional probability of having the risk factor given the disease status of the subject as follows:

P(Risk|Disease) =

which if risk and disease are independent should be equal to

Setting these two expressions equal to one another gives:

à .

This gives what we expect a, the number of women that have the risk factor in the

disease group, to be if the null hypothesis is true.

In a similar fashion we could find what we expect c, the number of women with the

risk factor in control group, to be. If the null hypothesis is true we expect

to be equal to à We expect .

We also can look at absence of the risk factor in the same way which gives the following expected values for b and d.

and

Notice there is a general pattern here, the expected value for frequency in the ith row and the jth column of the table is found by taking the row total for that row () times the column total for that column () and then dividing by the total sample size (n), i.e.

Our test statistic looks at the difference or discrepancy between what we observe when our data is collected and what we expect to see if the null hypothesis of independence is true. Intuitively, if the observed frequencies are far away from what we expect to see if the variables in question were independent, then we will reject the null hypothesis and conclude a significant relationship between the variables exists.

Pearson’s Chi-Square Test Statistic

where the expected frequencies for the cells are given by the formula:

If the observed frequencies differ substantially from the expected frequencies thetest statistic will be “BIG”. How do we define “BIG”? We find the probability of getting a test statistic value as extreme or more extreme than the one observed if there was truly no association between the two variables in question, i.e. we find the p-value associated with our test statistic. If the null hypothesis is true the test statistic follows a Chi-squared distribution with degrees of freedom df = . Here, r = # of rows and c = # of columns in the contingency table.

Chi-Square Distribution with p-value

The larger the test statistic value, the smaller the p-value.

EXAMPLE (cont’d) CONDUCTING THE TEST OF INDEPENDENCE

1. State Hypotheses

Ho: There is no association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are independent.

Ha: There is an association between age at first pregnancy and cervical cancer, i.e. age at

first pregnancy and cervical cancer are NOT independent.

2. Determine Test Criteria

Choose

Test Statistic

3. Compute Test Statistic

a) Find expected frequencies and put them in the contingency table beneath the

observed frequencies in parentheses.

b) Calculate the Chi-Square statistic.

4. Compute p-value

(use either the Chi-square Probability Calculator in JMP or the Chi-square Table at the end of these notes)

If our observed test statistic value exceeds the value in Area in Upper Tail 0.050 column we reject the null hypothesis in favor of the alternative.

Area in Upper Tail

df 0.100 0.050 0.025 0.010 0.001

1 2.71 3.84 5.02 6.63 10.83

2 4.61 5.99 7.38 9.21 13.82

. … … … … …

Because our test statistic value is greater than ______we reject the null and conclude that age at first pregnancy and cervical cancer status are not independent. We can use the table to put an upper bound on the p-value by noting that the largest value our test statistic value exceeds is 6.63. This says that our p-value < ______.

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value = .0026836.

Yate’s Correction for Continuity

When we have a 2 X 2 contigency table the Chi-square test outlined above is not really appropriate particularly when the same size (n) is “small”. When working with 2 X 2 tables it is often times preferable to use Yate’s Correction when calculating thetest statistic or use Fisher’s Exact Test which, as the name suggests, is an exact test!

Test Statistic: Pearson’s Chi-Square Test with Yate’s Correction for Continuity in 2 X 2 Contingency Tables.

where .

Conducting the Analysis in JMP
Enter these data into JMP as shown below.

The contingency table in JMP is obtained by selecting Fit Y by X from the Analyze menu and placing Disease in the X box and Preg. Age in the Y box. The resulting mosaic plot and contingency table are shown below.

The Row %'s give the proportion of women with the potential risk factor in each group. From these data we can see that proportion of women who had their first pregnancy at or before age 25 in the cervical cancer group is .8571 or 85.71% vs. .6404 or 64.04% for the women in the control group. It certainly appears that the proportion of women with the potential risk factor is higher in the cervical cancer (case) group, i.e. there is a relationship between the potential risk factor and cervical cancer.

Chi-square test results are given automatically

There is strong evidence that the proportion of women with the risk factor differs significantly between the cases and the controls, i.e. we have strong evidence that cervical cancer and age at first pregnancy are not independent. (p = .0027).

Note: JMP does not use Yate’s correction for continuity when calculating the chi-square test statistic.

Instead of the chi-square test we can use the results of Fisher's Exact Test, which is included in the JMP output whenever we are working with a 2 X 2 table, are shown below.

The three p-values given are for testing the following:

(1) Left, p-value = .9996 is for testing if the proportion of women with the potential risk factor is larger for the control group. Had this been significant that would suggest that having a first pregnancy at or before the age of 25 reduces your risk of developing cervical cancer. This is clearly not supported as the p-value > .05.

(2) Right, p-value = .0014 is for testing if the proportion of women with the potential risk factor is larger for the cervical cancer (case) group. The fact this p-value is significant suggests that having a first pregnancy at or before the age of 25 increases your risk of developing cervical cancer. This was the research hypothesis for the doctors who conducted this study.

(3) 2-Tail, p-value = .0029 is for testing if the proportion of women with the potential risk factor differs between the two groups. The fact this p-values is significant suggests that the proportion of women having a first pregnancy at or before the age of 25 is not the same for both groups. Because the sample proportion is larger for the case group we can again conclude that early age at first pregnancy increases risk of cervical cancer.

What other analyses could we perform for these data?

Example 2: Type of Skin Melanoma and Site on Body

Is there a relationship between the type of skin melanoma and where on the body the melanoma appeared? To answer this question n = 400 patients were cross-classified according to type of melanoma and where the melanoma appeared. The data collected are summarized in the contingency table below.

Site of Melanoma
Type of Melanoma / Head & Neck / Trunk / Extremities / Row Totals
Hutchinson’s Melanomic
Freckle / 22 / 2 / 10 / R1 = 34
Superficial Spreading
Melanoma / 16 / 54 / 115 / R2 = 185
Nodular / 19 / 33 / 73 / R3 = 125
Indeterminate / 11 / 17 / 28 / R4 = 56
Column Totals / C1 = 68 /
C2 = 106 / / n = 400

1. State Hypotheses:

2. Determine Test Criteria

Choose

Test Statistic: Pearson Chi-Square Test for r x c Contingency Tables

where r = # of rows, c = # of columns, and .

Conditions for the Test Statistic to be Valid

We should have no cells with expected frequencies less than 1 and at least 80% of the expected frequencies should be greater than 5. If either of these conditions are violated you have two options:

· Increase sample size (n) to that the expected frequencies increase, assuming additional data could be gathered under the same experimental conditions.

· Combine sparse categories, which increase the cell frequencies and associated row/column totals which will therefore increase expected frequencies.

3. Compute Test Statistic

a) Compute expected frequencies and place them in table beneath the observed

frequencies in parentheses.

Site of Melanoma /
Type of Melanoma / Head & Neck / Trunk / Extremities / Row Totals
Hutchinson’s Melanomic
Freckle / 22
(5.78) / 2
(9.01) / 10
(19.21) /
R1 = 34
Superficial Spreading
Melanoma / 16
( ) / 54
(49.03) / 115
(104.53) / R2 = 185
Nodular / 19
(21.25) / 33
(33.13) / 73
(70.62) / R3 = 125
Indeterminate / 11
(9.52) / 17
( ) / 28
(31.64) / R4 = 56
Column Totals / C1 = 68 /
C2 = 106 / / n = 400

b) Compute Test Statistic

4. Compute p-value
Using the chi-square table at the end of the notes we have

Area in Upper Tail

df 0.100 0.050 0.025 0.010 0.001

1 2.71 3.84 5.02 6.63 10.83

2 4.61 5.99 7.38 9.21 13.82

... … … … … …

6 10.64 12.59 14.45 16.81 22.46

Chi-Square Probability Calculator in JMP

To use this simply enter the test statistic value and the degrees of freedom (df) in the table and p-value will be calculated. Here we see our p-value