Analyzing Categorical Data

LECTURE 6 MAR 10, 2009

A. Introduction

Variables such as gender, sick or well, success or failure, and age group represent categories rather than numerical values. We use a number of statistical techniques such as tests of proportions and chi-square with variables of this type.

B. Questionnaire Design and Analysis

A common way to collect certain types of data is by using a questionaire.

DATA QUEST;

INPUT ID 1-3 AGE 4-5 GENDER $ 6 RACE $ 7 MARITAL $ 8 EDUC $ 9

PRES 10 ARMS 11 CITIES 12;

DATALINES;

001091111232

002452222422

003351324442

004271111121

005682132333

006651243425

;

PROCMEANSDATA=QUEST MAXDEC=2NMEANSTD;

TITLE'QUESTIONNAIRE ANALYSIS';

VAR AGE;

RUN;

Notice that we have not left any spaces between the values for each variable. This is common method of data entry since it saves spaces and extra typing. Therefore, we must specify the column location for each variable.

PROCFREQDATA=QUEST;

TITLE'FREQUENCY COUNTS FOR CATEGORICAL VARIABLE';

TABLES GENDER RACE MARITAL EDUC PRES ARMS CITIES;

RUN;

C. Adding Variable Labels

We can improve considerably on the output. First, we have to refer back to our coding scheme to see the definition of each of the variable names. Some variable names like GENDER and RACE need no explanation; others like PRES and CITIES do. We can associate a variable label with each variable name by using a LABEL statement.

DATA QUEST;

INPUT ID 1-3 AGE 4-5 GENDER $ 6 RACE $ 7 MARITAL $ 8 EDUC $ 9

PRES 10 ARMS 11 CITIES 12;

LABEL MARITAL='MARITAL STATUS'

EDUC='EDUCATION LEVEL'

PRES='PRESIDENT DOING A GOOD JOB'

ARMS='ARMS BUDGET INCREASE'

CITIES='FEDERAL AID TO CITIES';

DATALINES;

001091111232

002452222422

003351324442

004271111121

005682132333

006651243425

;

RUN;

PROCMEANSDATA=QUEST MAXDEC=2NMEANSTD;

TITLE'QUESTIONNAIRE ANALYSIS';

VAR AGE;

RUN;

PROCFREQDATA=QUEST;

TITLE'FREQUENCY COUNTS FOR CATEGORICAL VARIABLE';

TABLES GENDER RACE MARITAL EDUC PRES ARMS CITIES;

RUN;

D. Adding “Value Lables”(Formats)

We would like to improve the readability of the output one step further. The remaining problem is that the values for our variables (1=male 2=female etc.) are printed on the output, not the names that we have assigned to these values. We would like the output to show the number of males and females, for example, not the number of 1’s and 2’s for the variable GENDER. We can supply the “value labels” in two steps.

The first step is to define our code values for each variable.

PROCFORMAT;

VALUE $SEXFMT '1'='MALE' '2'='FEMALE';

VALUE $RACE '1'='WHITE''2'='AFRICAN AM.''3'='HISPANIC' '4'='OTHER';

VALUE $OSCAR '1'='SINGLE' '2'='MARRIED' '3'='WIDOWED' '4'='DIVORCED';

VALUE $EDUC '1'='HIGH SCH OR LESS'

'2'='TWO YR. COLLEGE'

'3'='FOUR YR.COLLEGE'

'4'='GRADUATE DEGREE';

VALUE LIKERT 1 ='STRONG DISAGREE'

2 ='DISAGREE'

3 ='NEUTRAL'

4 ='AGREE'

5 ='STR AGREE';

RUN;

The second step will be associate a FORMAT with one or more variable names. FORMAT statement can be placed within a DATA step or as a statement in PROC step.

FORMAT GENDER $SEXFMT. RACE $RACE. MARITAL $OSCAR. EDUC $EDUC. PRES ARMS CITIES LIKERT.;

E. Two-way Frequency Tables

Besides computing frequencies on individual variables, we may have occasion to count occurrences of one variable at each level of another variable. Suppose we took a poll of presidential preference and also recorded the gender of the respondent.

DATA ELECT;

INPUT GENDER $ CANDID $;

DATALINES;

M BUSH

F KERRY

M KERRY

M BUSH

F KERRY

(More data)

;

PROCFREQDATA=ELECT;

TABLES GENDER CANDID CANDID*GENDER;

RUN;

Notice that the first two tables are one-way frequency tables; the TABLE specification, CANDID*GENDER is a request for a two-way table.

F. Computing Chi-square from Frequency Counts

When you already have a contingency table want to use SAS software to compute a chi-square statistic, there is a WEIGHT statement that makes this task possible.

DATA CHISQ;

INPUT GROUP $ OUTCOME $ COUNT;

DATALINES;

CONTROL DEAD 20

CONTROL ALIVE 80

DRUG DEAN 10

DRUG ALIVE 90

;

PROCFREQDATA=CHISQ;

TABLES GROUP*OUTCOME / CHISQ;

WEIGHT COUNT;

RUN;

G.McNemar’s Test for Paired Data

Suppose you want to determine the effect of an anti-cigarette advertisement on people’s attitudes towards smoking. In this hypothetical example, we ask 100 people their attitude towards smoking (either positive or negative). We then show them the anti-cigarette advertisement and again ask their smoking attitude. This experimental design is called a paired or matched design since the same subjects are responding to a question under two different conditions (before and after an advertisement).

DATA MCNEMAR;

INPUT ATTITUDE_BEFORE $ ATTITUDE_AFTER $ COUNT;

DATALINES;

POSITIVE NEGATIVE 30

POSITIVE POSITIVE 23

NEGATIVE POSITIVE 15

NEGATIVE NEGATIVE 32

;

PROCFREQDATA= MCNEMAR;

TABLES ATTITUDE_BEFORE * ATTITUDE_AFTER / AGREE;

WEIGHT COUNT;

RUN;

H. Odds Ratios

Suppose we want to determine if people with a rare brain tumor are more likely to have been exposed to benzene than people without a brain tumor. One experimental design used to answer this question is called a case-control design. As the name implies, you first start with cases, people with a disease or condition and find people who are as similar as possible but who do not have brain tumors. Those people are the controls.

DATA ODDS;

INPUT OUTCOME $ EXPOSURE $ COUNT;

DATALINES;

CASE 1-YES 50

CASE 2-NO 100

CONTROL 1-YES 20

CONTROL 2-NO 130

;

PROCFREQDATA=ODDS;

TITLE'ODDS RATIO COMPUTING';

TABLES EXPOSURE*OUTCOME / CHISQCMH;

WEIGHT COUNT;

RUN;

I. Relative Risk

Suppose we conducted a prospective cohort study to investigate the effect of aspirin on heart attacks. A group of patient who are at risk for a heart attack are randomly assigned to either a placebo or aspirin. At the end of one year, the number of patients suffering a heart attack is recorded.

DATA RR;

LENGTH GROUP $ 9;

INPUT GROUP $ OUTCOME $ COUNT;

DATALINES;

1-PLACEBO MI 20

1-PLACEBO NO-MI 80

2-ASPIRIN MI 15

2-ASPIRIN NO-MI 135

;

PROCFREQDATA=RR;

TITLE'ODDS RATIO COMPUTING';

TABLES GROUP * OUTCOME / CMH;

WEIGHT COUNT;

RUN;

The Col1 relative risk is how much more or less likely you are to be in the column 1 category( in this case MI) if you are in the row 1 group(in the case , placebo). The computed relative risk is 2.00 with the 95% confidence interval of 1.087 to 3.68

J. Chi-square Test for Trend

If the categories in a 2xN table present ordinal levels, you may want to compute what is called a chi-square test for trend. That is, are the proportions in each of the N levels increasing or decreasing in a linear fashion?

DATA TREND;

INPUT RESULT $ GROUP $ COUNT @@;

DATALINES;

FAIL A 10 FAIL B 15 FAIL C 14 FAIL D 25

PASS A 90 PASS B 85 PASS C 86 PASS D 75

;

PROCFREQDATA=TREND;

TITLE'TREND TEST';

TABLES RESULT*GROUP / CHISQ;

WIGHT COUNT;

RUN;

K. Mantel-Haenszel Chi-square for Stratified Tables and Meta Analysis

You may have a serial of 2x2 tables for each level of another factor. This may be a confounding factor such as age, or you may have a 2x2 tables at each site in a multisite study. In any event, one way to analyze multiple 2x2 tables of this sort is to compute a Mantel-Haenszel chi-square. This technique is sometimes referred to as meta-analysis and is becoming quite popular in medicine, education, and psychology. The Mantel-Haenszel statistic is also used frequently for item bias research.

DATA ABILITY;

INPUT GENDER $ RESULTS $ SLEEP $ COUNT;

DATALINES;

BOYS FAIL 1-LOW 20

BOYS FAIL 2-HIGH 15

BOYS PASS 1-LOW 100

BOYS PASS 2-HIGH 150

GIRLS FAIL 1-LOW 30

GIRLS FAIL 2-HIGH 25

GIRLS PASS 1-LOW 100

GIRLS PASS 2-HIGH 200

;

PROCFREQDATA=ABILITY;

TITLE'MANTEL-HAENSZEL CHISQ';

TABLES GENDER*SLEEP*RESULTS / ALL;

WEIGHT COUNT;

RUN;

Notice that the Breslow-Day test for homogeneity of the odds ratio is not significant (p=.698), so we can be comfortable in combining these two tables.