Analysis of Categorical Data

ASSOCIATIONS IN CATEGORICAL DATA

(Statistics Course Notes by Alan Pickering)

Conventions Used in My Statistics Handouts

· Text which appears in unshaded boxes can be treated as an aside. It may define a concept being used or provide a mathematical justification for something. Some of these relate to very important statistical ideas with which you SHOULD be familiar.

· The names of SPSS menus will be written in shadow bold font (e.g., Analyze). The options on the menu will be called procedures and the name will be written in bold small capitals font (e.g., Descriptive Statistics). Procedures sometimes offer a family of related options, the name of the one selected will appear in italic small capitals font (e.g., Crosstabs). The resulting window will contain boxes to be filled or checked, and will offer buttons to access subwindows. Subwindow names will appear in italic font (e.g., Statistics). Subwindows also have boxes to be filled or checked. Thus the full path to the Crosstabs procedure will be written:

Analyze > Descriptive Statistics > Crosstabs

· Questions to be completed will appear in shaded boxes at various points.

PART I – BASICS AND BACKGROUND

What Are Categorical Data?

Categorical or nominal data are those in which the classes within each variable do not have any meaningful numerical value. Common examples are: gender; presence or absence (of some sign, symptom, disease, or behaviour); ethnic group etc. It is sometimes useful to recode a numerical variable into a small number of categories, and sometimes this classification will retain ordinal information (e.g., carrying out a median-split on a personality scale to create high or low trait subgroups; or grouping subjects into broad age bands). The data can then be analysed using the techniques described below, but one has to be aware that these analyses usually have considerably reduced power, relative to using the full range of values on the variable.

Contingency Tables

One can carry out analyses on a single categorical variable to check whether the frequencies occurring for each level of that category are as predicted (check out SPSS Procedure: Analyze > Non-Parametric Tests > Chi-square). However, much more commonly, the basic data structure for categorical variables is the contingency (or classification) table, or crosstabulation, formed from the variables concerned. These are described as n-way contingency tables, where n refers to the number of variables involved. The contingency table documents the frequencies (i.e., counts) of data for each combination of the variables concerned. Hence contingency tables are also referred to as frequency tables. The small-scale example below is a 2-way table formed from the variables PDstatus (i.e., Parkinson’s Disease status: does the participant have Parkinson’s Disease?) and Smokehis (i.e., smoking history: has the participant smoked regularly at some point during their life?). Each variable in the table has two levels (yes vs. no).

PDstatus

/
yes (=1) / no (=2) / Row
Totals

Smokehis

/ yes (=1) / 3
(5.73) / 11
(8.27) / 14
no (=2) / 6
(3.27) / 2
(4.73) / 8
Column
Totals / 9 / 13 / Grand
Total=22

Table 1. Observed frequency counts of current Parkinson’s disease status by smoking history. Expected frequencies under an independence model are in parentheses.

In SPSS, some of the analysis procedures reviewed below require that the categorical variables have numerical values, so it is recommended to always code categorical variables in this way. Remember that it is completely arbitrary how these values are assigned to the categories[1]. To aid memory, and to produce clearer analysis printouts, one should always use the variable labelling options for such variables and include verbal value labels for each numerically-coded categorical variable.

The Names of the Techniques

Introductory stats courses familiarise students with a specific technique for analysing 2-way contingency tables: Pearson’s c2 test. More general methods are required for analysing higher-order contingency tables (involving 3 or more variables), and some of these analytic methods are therefore described (e.g., in Tabachnik and Fidell) as multiway frequency table analyses (or MFA). However, there are a large number of alternative names for a family of closely-related procedures. Here are just some of the other names which one may come across (e.g., in the procedure names available in SPSS): multinomial (or binary) logistic regression; logit regression; logistic analysis; analysis of multinomial variance; and (hierarchical) loglinear modelling. As these names imply the techniques are often comparable to the procedures used for handling numerical data, in particular multiple linear regression and analysis of variance (ANOVA). A reasonable simplification is that the categorical data analyses below can be thought of as extending the Pearson c2 test to multiway tables. The techniques also share, with the c2 test, the fact that the calculated statistics are tested for statistical significance against the c2 distribution (just as multiple linear regression and ANOVA involving testing their statistics against the F distribution).

Association in Contingency Tables

There are many techniques in statistics for detecting an association between 2 variables. A significant association simply means that the values of one variable vary systematically (i.e., at a level greater than chance) with values of the other variable. The most well-known measures of association are probably the (various types of) correlation coefficient between two variables. Correlation coefficients can reveal the extent to which the score (or rank) of one variable is linearly related to the score (or rank) of another. In contingency tables, the values within each category have no intrinsic numerical value, but associations can still be detected. An association means that the distribution of frequencies across the levels of one category differs depending upon the particular level of another category. When there is no association between variables, they are described as being independent. Thus, independence in a 2-way table means that there is no association between the row and column variables.

It is a much-noted statistical fact that finding significant associations between variables, in itself, tells you nothing about the causal relationships at work. In association analyses one may therefore have no logical reason to treat variables as either dependent or independent. Sometimes research is entirely exploratory and, when significant associations are found, the search for a causal connection between the variables begins. In such research with categorical variables one might typically take a single sample of subjects and record the values on the variables of interest. For example, one might explore whether the political orientation reported by a subject (left; centre; right) was associated with the newspaper he or she reads. Neither variable here is obviously the dependent variable (DV), as the causal relationship could go in either direction. Very often, however, one has a causal model in mind. For example, one might be interested in whether a subject’s gender is associated with political orientation. Here, political orientation is the DV and a subject’s gender is the independent variable (IV), as it not possible for political orientation to affect gender. The research here will usually adopt a different sampling scheme, by controlling the sampling of the subjects in terms of the IVs. In this example, two samples of subjects (males and females) would be tested, recording the DV (political orientation) in each sample. In would be typical to arrange for equal-sized samples of males and females in such research. In the example data of Table 1, we are interested in finding variables that predict whether a subject will develop Parkinson’s Disease (PD). Thus PD status is the DV and the IV (or predictor) is smoking history.

For contingency table data, the distinction between (exploratory) analyses, where all variables have a similar status, and analyses involving both IVs and DVs is important. We will see below that it affects the name of the analysis and the statistical procedure that one uses. In this handout, where the categorical analyses have both IVs and DVs, we will adopt the convention that the DV will be shown as the column variable.

It follows from the above that a pair of alternative hypotheses (H1 = independence; H2 = association) may be applied to a contingency table. In order to decide between these hypotheses, one can calculate a statistic that reflects the discrepancy between the actual frequencies obtained and the frequencies that would be expected under the independence model described above. If the discrepancies are within the limits of chance (i.e., the statistic is nonsignificant), then one cannot reject the hypothesis of independence. If the discrepancies are not within chance limits (i.e., the statistic is significant) then one can safely reject the independence hypothesis, which implies association between the variables in the table.

Estimating Expected Frequencies Under the Independence Model

We will use the data from Table 1 as an example. If the variables PDstatus and Smokehis are independent then the proportion of “Smokehis=yes” subjects with PD should be equal to the proportion of “Smokehis=no”subjects with PD, and both should be equal to the proportion of subjects who have PD overall. The overall proportion with PD is equal to 0.41 (i.e., 9/22). Therefore, the expected frequency with PD in the Smokehis=yes group should be 0.41 times the total number of subjects in the Smokehis=yes group; i.e., 0.41*14 (=5.73). This gives the expected frequency without PD in the Smokehis=yes group by subtraction (=14-5.73=8.27). Similar calculations give the expected frequencies for PD (0.41*8=3.27) and no PD (=8-3.27=4.73) in the Smokehis=no group. Another way to get the expected frequency for a cell in row R and column C is to multiply the row total for row R by the column total for column C and divide the result by the grand total for the whole table (e.g., [9*14]/22=5.27 for row 1 and column 1). This approach is easy to use with tables that have more than 2 variables (where the rows represent one variable, columns another, and separate subtables are used for other variables).

Testing Associations in 2-way Contingency Tables

There are several statistics that one can compute to test for association vs. independence in a 2-way contingency table (such statistics are thus sometimes referred to as an “indices of association”). Three such statistics (described below) are worthy of attention, and two (G2; OR) are of particular significance for logistic regression analyses.

In this section we shall consider a 2-way table with R rows and C columns; this is therefore referred to as an RxC table. There are m cells in the table where m = R * C. The actual frequency in cell number i of the table is denoted by the symbol fi and the expected frequency (under the independence model) is denoted by ei.

(i) Pearson’s c2 statistic.

How to Compute using SPSS: Select the following procedure:

Analyze > Descriptive Statistics > Crosstabs

Click on the Statistics button to access the Statistics subwindow and then check the Chi-square box. (The Cells subwindow is also useful as it lets you display things other than just the actual frequencies in the contingency table.)

Key SPSS Output: Conducting a Pearson c2 analysis on the data in Table 1 (which is available on the J drive as the dataset small parks data) produces the following SPSS output:

What Do These Results Mean?: The results of this analysis show that the c2 test statistic was significantly greater than the tabulated value (which is approximately the expected value for the statistic, assuming independence between the variables). Therefore, the independence model can be rejected and one concludes that variables of PDstatus and smokehis are associated. If the p-value associated with the c2 test statistic was nonsignificant, then one would not be able to reject the hypothesis that PDstatus and smokehis are independent.

Formula[2]:

c2 = Si=1 to m ([fi – ei]2/ei)

Degrees of Freedom (df):

df=(R-1)*(C-1)

Testing Significance: Under the independence model, the c2 test statistic has a distribution which follows the c2 distribution with the degrees of freedom as given above. (Pearson could have been more helpful and given his statistic a name that differed from that of the distribution against which it is tested.) It has been shown that the c2 distribution can be used to test the Pearson c2 statistic as long as none of the expected frequencies is lower than 3.

(ii) Likelihood ratio statistic (usually abbreviated G2 or Y2).

“Computing using SPSS”, “Key SPSS Output”, “What Do These Results Mean?”, df, and “Testing Significance”, are the same as for Pearson c2. Because of the mathematical relationship between the two formulae, the values under many circumstances are approximately equal.

Formula[3]:

G2 = 2 * Si=1 to m (fi *log[fi/ei])

(iii) Odds ratio (OR).

(Note: this applies only to a 2x2 table, or to 2x2 comparison within a larger table.)

How to Compute Using SPSS: This is also available via SPSS Crosstabs. Follow the procedure for c2 and G2 but now check the “Cochran’s and Mantel-Haenszel Statistics” box in the Statistics subwindow.

Key SPSS Output: Conducting an OR analysis for the data in Table 1 using SPSS Crosstabs gives the following additional output:

What Do These Results Mean: The OR, as for c2 and G2, tests whether the independence hypothesis can be rejected. The significance test above reveals a pvalue of 0.022, and so the row and column variables (smoking history and Parkinson’s Disease status) are not independent. An OR value of 1 corresponds to perfect independence, and can range from 0 to plus infinity. The value calculated for the Table 1 data lies well below one (=0.091; the p-value shows this to be significantly different from 1). This means that the odds of having Parkinson’s Disease (PD) if you are a smoker (row 1) are 0.091 times the odds of having PD if you are a nonsmoker. Smoking in these data is significantly protective with respect to PD. If one had predicted the direction of this relationship (based on the existing literature demonstrating a PD-protective effect for smoking), then one could justifiably make a one-tailed test: the pvalue for the OR of 0.091 in this case would be 0.022/2 (=0.011).

In the SPSS output, one sees that the natural logarithm of the odds ratio is also reported. For various reasons, this statistic is more useful within contingency tables than the raw OR. This use of log-transformed statistics explains why contingency table analyses are often described as logistic analyses, logistic regressions or loglinear modelling. The output also reports confidence intervals for the OR and log(OR) statistics. The next 2 boxes give a few basic facts about logarithms and confidence intervals.