Statistics 475 Notes 21

Reading: Lohr, Chapter 8.4-8.5

I. Mechanisms for Nonresponse

Most surveys have some residual nonresponse after careful design and follow-up of nonresponse.

The methods for fixing up nonresponse are necessarily model-based. If we are to make any inferences about the nonrespondents, we must assume that they are related to the respondents in some way.

We consider whether or not unit i would respond if selected into the sample to be a random variable, denoted by :

denotes whether or not unit i would respond if the unit was selected into the sample. The probability that unit i would respond if selected into the sample is

.

Suppose that is a response of interest and that is a vector of information known about unit i in the sample (e.g., age, sex, race). Little and Rubin (1987, Statistical Analysis of Missing Data) defined three ways in which the nonresponse can be classified:

1. Missing Completely at Random: If does not depend on or , the missing data are missing completely at random (MCAR). Such a situation occurs if, for example, someone at the laboratory drops a test tube containing the blood sample of one of the survey participants – there is no reason to think that the dropping of the test tube had anything to do with the white blood cell count. In the Current Population Survey in which one response variable is employment ( if unit i is employed, 0 if not), the data are MCAR if the probability of nonresponse is completely unrelated to region of the United States, race, sex, age or any other variable measured for the sample and if the probability of nonresponse is unrelated to employment.

If the data are MCAR, nonrespondents are essentially selected at random from the sample and the respondents are a representative sample. If a simple random sample of size n is taken, then if the data are MCAR, the respondents will be a simple random subsample of variable size . The sample mean of the respondents, , is approximately unbiased for the population mean. The MCAR mechanism is implicitly adopted when nonresponse is ignored.

2. Missing at Random Given Covariates, or Ignorable Nonresponse: If depends on but not on , the data are missing at random (MAR); the nonresponse depends only on the observed variables. We can successfully model the nonresponse, since we know the value of for all sample units. For the Current Population Survey, the data would be missing at random if the probability of responding depended on region, race, sex and age – all known quantities – but not on the employment status within each region/age/race/sex class. This is sometimes termed ignorable nonresponse. Ignorable means that a model can explain the nonresponse mechanism and that the nonresponse can be ignored after the model accounts for it, not that the nonresponse can be completely ignored and complete data methods used.

3. Nonignorable Nonresponse: If the probability of nonresponse depends on the value of a response variable and cannot be completely explained by values of the ’s, then the response is nonignorable. In the Current Population Survey, the response would be nonignorable if the probability of responding depended on employment status even after taking into account region, age, race, sex and other known variables.

We can distinguish between MCAR and MAR by checking whether there is a relationship between the observed probabilities of response in subgroups defined by known covariates: if there is a significant relationship, then the data are likely not MCAR.

Example: The data below comes from a sample survey of college undergraduates taking an introductory statistics course. The variable to be analyzed here is the number of hours per week that the student devoted to study (outside of regular class time). Out of the 1500 students taking the course in an academic year, a simple random sample of 200 was selected. Of the 200, 135 answered this question about study hours. The gender of these 135 students is shown below (1=female, 2=male). Of the 200 selected in the sample, 130 were female and 70 were male.

hours=c(3,6.5,7,10,5,5,4,15,18,12,8,20,3,6,8,30,12,10,2,12,2,15,20,14,20,15,10,20,23,7,15,13,4,2,15,10,15,13,25,11,13,6,5,18,6.5,10,7,15,5,3,12,11,2,12,5.5,10,10,7,4,10,8,5,7,8,20,24,5,18,10,15,8,10,6,10,5,10,10,3,7,22,8,10,28,12,3,6,3,8,20,7,3,12,12,14,15,3.5,20,11,12,20,20,8,5,10,15,15,5,20,8,10,15,35,10,25,20,10,2,12,21,20,8,10,12,15,5,5,12,15,30,20,18,3,6,5,10);

gender=c(2,1,2,1,1,1,2,1,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,2,1,2,1,1,1,2,1,2,1,2,1,1,1,1,1,1,1,1,1,2,2,1,1,1,1,1,2,2,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,2,2,1,2,1,2,1,1,1,2,1,2,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,1,1);

sum(gender==1)

[1] 104

> sum(gender==2)

[1] 31

Thus, for the 200 students selected into the sample, we have the following cross-classification:

Responded / Didn’t Respond
Female / 104 / 26 / 130
Male / 31 / 39 / 70
135 / 65

If the data are MCAR, gender should be independent of responding. We can test this hypothesis with a chi-squared test:

hours=c(3,6.5,7,10,5,5,4,15,18,12,8,20,3,6,8,30,12,10,2,12,2,15,20,14,20,15,10,20,23,7,15,13,4,2,15,10,15,13,25,11,13,6,5,18,6.5,10,7,15,5,3,12,11,2,12,5.5,10,10,7,4,10,8,5,7,8,20,24,5,18,10,15,8,10,6,10,5,10,10,3,7,22,8,10,28,12,3,6,3,8,20,7,3,12,12,14,15,3.5,20,11,12,20,20,8,5,10,15,15,5,20,8,10,15,35,10,25,20,10,2,12,21,20,8,10,12,15,5,5,12,15,30,20,18,3,6,5,10);

# Gender of respondents

gender.respondents=c(rep(1,104),rep(2,31));

# Nonrespondents' gender

gender.nonrespondents=c(rep(1,26),rep(2,39));

# Gender of whole sample put together as a factor variable

gender.wholesample=as.factor(c(gender.respondents,gender.nonrespondents));

# Response indicator

response=c(rep(1,135),rep(0,65));

chisq.test(response,gender.wholesample)

> chisq.test(response,gender.wholesample)

Pearson's Chi-squared test with Yates' continuity correction

data: response and gender.wholesample

X-squared = 24.8521, df = 1, p-value = 6.19e-07

There is strong evidence that the data is not MCAR (p-value <0.0001). However, we cannot test whether the data is MAR vs. nonignorable because whether it is nonignorable depends on the unknown responses of the nonrespondents.

1