Working with Missing Values

Workshop on Working with Missing Values: Supplement to PowerPoint Presentation*

Alan C. Acock

Oregon State University

This document and selected references, data, and programs are available at

Note to Readers

These are lecture notes for a presentation. This is not a self-contained, systematic treatment of the topic. This is not intended for publication and has not been carefully edited for publication purposes. Instead, these notes are intended to complement a one-day workshop presentation. The workshop will expand and clarify many of the points presented in this document. The intention of this document is to help workshop participants follow the presentation. They are much more detailed than a usual power point set of slides, but much less detailed than a self-contained treatment of the topics. Others may find these notes useful, but they are not a substitute for participation in the workshop.

Working with Missing Values

Types of Missing Values

Missing by definition of domain.

A survey participant is excluded from your analysis because they are not in the domain you are investigating.

If you are comparing the social networks of married women to unmarried, lesbian women, you would drop all men and unmarried women who are not lesbians because they do not fall into the domain of your study. This is not a problem.
An investigator needs to eliminate these people from the survey before talking about any type of missing values or attrition.
Note total sample size, then state number of participants who fit the definition of the study population.

Most surveys have several codes for missing values to distinguish participants who refused to answer

Those who answered that they don’t know,
Those who were valid skips, and
Those that were skipped by interviewer error.

If a researcher is imputing values, those that are missing, but who are not in the domain being investigated, should not have their values imputed. For example

Valid skips usually should not be imputed, but
A handful of people who were skipped by interviewer error should be imputed.
It simply makes no sense to impute the age of first menstruation for men!
The distinctions between types of missing values are lost in datasets when only a single code (-9, dot, etc.) is used, regardless of the reason the value is missing so care in setting up coding is critical.

Imputing values for people who respond that they “don’t know” is especially challenging.

If asked to rate your marital satisfaction on a very satisfied to very dissatisfied scale, one researcher may say that “don’t know” is half way between satisfied and dissatisfied and assign a corresponding value.
A participant’s views may be bipolar, sometimes being extremely satisfied and other times being extremely dissatisfied. Giving them a score that is half way between satisfied and dissatisfied or imputing a value for them may not make sense from this perspective—the scale itself does not make sense to them. Ex. Family solidarity
Another time the “don’t know” option is problematic is when answering the question requires special knowledge.
If a person in the U.S. were asked to rate the average marital satisfaction of women in the Ukraine, the person may answer that they “don’t know” because they are not even sure where the Ukraine is, much less know anything about gender roles and marital satisfaction there.
This does not mean they are half way between high and low; it does mean the scale is not meaningful to them. Imputing a value for them would be inappropriate.

Defining the domain, eliminating people who do not fit this domain, and assigning values to missing values needs to be done with great care. Your decisions need to be clear to the reader and far too few papers are clear in how they do this.

Attrition Analysis

A quick review of major family journals indicates that many authors have done little or nothing in the way of attrition analysis. Once a dataset is reduced to those participants who should have data, some analysis of attrition due to missing values is necessary.

If a traditional solution is used such as listwise deletion (drop any participant who has a missing value on any item in your analysis), then those participants who have missing values should be compared to those that you choose to analyze.
This can be done by simple chi-square or t-tests of variables on which you have information.
For example, if a panel study has four waves and some people were missing for one or two of the waves then you should compare them to the participants you analyze for the waves on which you have data.
Is the percentage minority higher in the group you dropped using listwise deletion?
Are those you analyzed more educated?
Are women overrepresented among those you analyzed?

Attrition analysis alerts readers to potential biases in your analysis and the limits to the value your research has for generalizing.
If there are few statistically or substantively significant differences between those you drop and those you analyze, then this reassures the reader of the strength of your analysis.

MCAR, MAR, NMAR

Here is what we are talking about

Table 1

Patterns of Missing Values

Matrix D
(data with no missing values)
Y / X1 / X2 / X3 / X4
5 / 6 / 4 / 3 / 1
4 / 3 / 2 / 3 / 2
3 / 1 / 2 / 1 / 3
3 / 2 / 1 / 4 / 5
1 / 5 / 4 / 5 / 4.
Matrix D’
(data with missing values) / Matrix M
(missing pattern)
Y / X1 / X2 / X3 / X4 / Dy / M1 / M2 / M3 / M4
5 / . / 4 / 3 / 1 / 0 / 1 / 0 / 0 / 0
4 / . / . / . / 2 / 0 / 1 / 1 / 1 / 0
. / 1 / 2 / 1 / . / 1 / 0 / 0 / 0 / 1
3 / 2 / 1 / 4 / 5 / 0 / 0 / 0 / 0 / 0
1 / 5 / 4 / . / . / 0 / 0 / 0 / 1 / 1

There are many explanations for patterns of missing values. The procedures we discuss here are appropriate for MCAR and MAR. We will mention procedures appropriate for NMAR.

MCAR is widely used in the statistical literature for data that is missing completely at random. This is rarely a realistic assumption. For example,

We know that men are more likely to have missing values than women. If this is true for your data, then the values that are missing are systematically related to gender and we could not say they were missing completely at random.
If you give each participant a random sample of 75% of the items on a questionnaire (each participant may answer a different subset of items this way), the values that are missing would be missing completely at random MCAR

MAR is the minimum condition and it has a meaning that is quite different than it sounds. It should not be confused with MCAR. If your dataset includes variables that “explain” patterns of missingness, then the residual values controlling for these variables will be missing at random. Suppose you know that missingness (not answering items about income is related to gender, race, education, income, marital status, and occupational category.

You control for gender, race, education, marital status, and occupational category.
After doing this, the missing values on income are not related to income—there is no residual relationship to income after you control for the other “mechanisms” of missingness.
This would be missing at random

NMAR means that the missing values are not missing at random. Suppose your study does not include education, race, gender, occupational status as in the last example. Without controlling for these it is likely that those with missing values on income are either poor or rich. This would mean that not reporting income was related to income and therefore it would not be ignorable

NMAR happens when your study does not include variables that “explain” patterns of missingness.
A panel study of math test scores from grades 10-12. Those who are missing at will likely over represent those who do very poorly at math and decide to drop out of school. Math scores would seem to get better each year because the students who are poor at math have a higher attrition rate.

MCAR says that R (pattern of missingness) is not related to X or Y, but only to some random process Z.

MAR says that R (pattern of missingness) is related to X (other variables in model or included in imputations) in part and to some random process Z in part. R may be correlated with Y, but there is no partial relationship controlling for X and Z.

MNAR says that R (pattern of missingness is related to X (other variables in model or included in imputations) in part, some random process Z in part, but also still related to Y controlling for X and Z.

These can be illustrated with data from Schafer and Graham, Table 1

Blood Pressure Measurements in January (X) and February (Y) with Missing Values Imposed in Three Different Methods
January
Measure (X) / February Measure (Y) / MCAR for February / MAR for February / MNAR for February
169 / 148 / 148 / 148 / 148
126 / 123 / - / - / -
132 / 149 / - / - / 149
160 / 169 / - / 169 / 169
105 / 138 / - / - / -
116 / 102 / - / - / -
125 / 88 / - / - / -
112 / 100 / - / - / -
133 / 150 / - / - / 150
94 / 113 / - / - / -
109 / 96 / - / - / -
109 / 78 / - / - / -
106 / 148 / - / - / 148
176 / 137 / - / 137 / -
128 / 155 / - / - / 155
131 / 131 / - / - / -
130 / 101 / 101 / - / -
145 / 155 / - / 155 / 155
136 / 140 / - / - / -
146 / 134 / - / 134 / -
111 / 129 / - / - / -
97 / 85 / 85 / - / -
134 / 124 / 124 / - / -
153 / 112 / - / 112 / -
118 / 118 / - / - / -
137 / 122 / 122 / - / -
101 / 119 / - / - / -
103 / 106 / 106 / - / -
78 / 74 / 74 / - / -
151 / 113 / - / 113 / -
M = 125.7 / M = 121.9 / M = 108.6 / M = 138.3 / M = 153.4
Sd = 23.0 / Sd = 24.7 / Sd = 25.1 / Sd = 21.1 / Sd = 7.5

MCAR is a random sample of 7 observation so it is completely a random process

MAR is the time two score for people who scored over 140 in January (X). The R (pattern of missingness is partly explained by X, but there is no relationship between Y and R not “explained” by X.

MNAR has the people measured on Y who have high scores on Y. Y “explains” the R (pattern of missingness).

Traditional Approaches

Listwise Deletion

Listwise deletion is the default in virtually all packages. It has the following problems:

Reduces power because you are throwing away data. Inflates standard erros
If missing values are not missing completely at random, the parameter estimates will have a systematic bias (more minorities and men will be missing than white women. If race or gender are related to the outcome variable, the exclusion of minorities and men who have missing values will bias estimates.

Pairwise Deletion

Pairwise deletion includes everybody who answers both items in a pair to estimate the covariance for that pair. It uses everybody who answers an item to estimate the variance of the item. It then puts these variances and covariances together in a variance covariance matrix and analyzes this matrix. Problems

Each covariance is based on a different subsample so what population is being represented is unclear.
Since covariance matrix does not represent a single population it may not be possible to invert it and the program may fail.
It is unclear what degrees of freedom are appropriate. Parts of the model use more information than other parts of the model.

Mean Substitution

Mean substitution replaces each missing value with the mean of that variable.

This greatly reduces the variance of a predictor and this weakens explanatory power

It will attenuate parameter estimates in the bivariate case but with several predictors it may make some bigger than they should be and others smaller.
Mean is often a terrible choice of a missing value. Those missing income often have very low or very high incomes but those with average incomes usually report it.
This keeps the full N and degrees of freedom but in doing this it is not allowing for cases that have no variance because they all got the mean value

Substituting a mean for a subgroup will mitigate these problems, but only to a degree. For example, divide a sample of women by marital status and substitute missing values on income with the mean of each marital type.

Dummy Variable with Mean Substitution

Mean substitution replaces each missing value with the mean of that variable and a dummy variable is added coded 1 if missing the variable and 0 if not missing it. This keeps all the cases, but this is misleading. It will produce identical parameter estimates as the listwise deletion and the B’s for the dummy variables will show how much those missing a value on a variable deviate from the mean for those that have no missing values.

1.The parameter estimates are still potentially biased

2.The degrees of freedom (cases) are exaggerated.

Regression Imputation

A multiple regression is done to predict each variable in the model. These equations are done to impute missing values.

This approach does noting about the uncertainty of the imputation process. If the R2 = .90 the imputed values might be pretty good. If the R2 = .10, the imputed missing values will not be very good.
The predicted values are a function of the variables already in the model and hence these values are not really contributing an independent effect for the variables involving imputations.

Tansitional Approach: Single imputation using EM algorithm

Discuss single imputation and how uncertainty is introduced.

This approach is better than the previous approaches and yields unbiased parameter estimates
This approach still has biased standard errors.

Readers should know that in an article in The American Statistician, von Hippel (2004) is highly critical of the way SPSS implements EM in the MVA module. He states:

The final method, expectation maximization (EM), produces asymptotically unbiased estimates, but EM’s implementation in MVA is limited to point estimates (without standard errors) of means, variances, and covariances. MVA can also impute values using the EM algorithm, but values are imputed without residual variation, so analyses that use the imputed values can be biased (von Hippel, 2004, p 160.

von Hippel acknowledges that although SPSS does not add the residual variation appropriately, it makes an adjustment later in the process. If a researcher chooses to do single imputation, there are freeware programs available that may be superior to SPSS, although not as user friendly. An example is Graham’s program EMCOV available at

However, even Graham recommends that users should use multiple imputation when it is appropriate.

Data Used for this Presentation

Dataset for this workshop is nlsy97missing.dta. This is a subset of data from the NLSY97 dataset. Here is a condensed codebook:

. codebook, compact

Variable Obs Unique Mean Min Max Label

------pubid 8984 8984 4504.302 1 9022 youth public id code

sampwt97 8984 3920 215699.6 32330 1575942 round 1 sampling weight 1997

age97 8984 7 14.35363 12 18 age at interview date 1997

gender97 8984 2 1.48809 1 2 youth gender 1997

hhsize97 8984 16 4.548976 1 16 household size 1997

hhin97 6588 2242 46361.7 -48100 246474 household income 1997

dinner97 5356 8 5.07823 0 7 # days/wk dinner w/family 1997

fun97 5356 8 2.710045 0 7 # days/wk fun as a family 1997

psmoke97 8871 5 2.611656 1 5 % peers smoke 1997

pdrink97 8799 5 2.136152 1 5 % peers drunk 1+/month 1997

psport97 8943 5 3.688695 1 5 % peers sports, clubs 1997

pgang97 8812 5 1.594757 1 5 % peers belong to gang 1997

pcoll97 8866 5 3.568915 1 5 % peers plan college 1997

pvol97 8838 5 2.09131 1 5 % peers volunteer 1997

pdrug97 8758 5 2.307376 1 5 % peers use illegal drugs 1997

pcut97 8920 5 2.408184 1 5 % peers cut class/school 1997

hwwdy97 4717 5 3.830189 1 5 # weekdays do homework 1997

hwwenh97 4720 18 .828178 0 90 weekend hours do homework 1997

smday97 3497 31 6.942808 0 30 # days smoke last 30 days 1997

drday97 3819 27 1.834512 0 30 # days drank alc-30 days 1997

maday97 1785 31 4.0493 0 30 # days used marij-30 days 1997

------

We will focus on two software packages, Norm and Stata. Norm is a freeware program available at:

Working with Missing Items in a Scale

Stata

The spost commands discussed in the Workshop on Categorical and Count Dependent Variables includes a command called misschk. This is good to run when first creating a scale. Suppose we want to access negative peer influence. The NLSY97 has 8 items on which students, 12-18, rated the percentage of their schoolmates who did various things. A score of 1 reflects 0-19%, a 2 reflects 20-39%, a 3 reflects 40-59%, a 4 reflects 60-79%, and a 5 reflects 80-100%

Here is the command (not available from a menu)

misschk psmoke97-pcut97, gen(m_) dummy help

misschkThis is the name of the command

psmoke97-pcut97These are the variables

gen(m_)Generates a variable for how many observations are missing items

dummyA dummy variable for whether each item is missing or not for each observation.

helpPrints out the names of the new variables and what they are

. misschk psmoke97-pcut97, gen(m_) dummy help

Variables examined for missing values

# Variable # Missing % Missing

------

1 psmoke97 113 1.3

2 pdrink97 185 2.1

3 psport97 41 0.5

4 pgang97 172 1.9

5 pcoll97 118 1.3

6 pvol97 146 1.6

7 pdrug97 226 2.5

8 pcut97 64 0.7

The columns in the table below correspond to the # in the table above.

If a column is blank, there were no missing cases for that variable.

Missing for |

which |

variables? | Freq. Percent Cum.

------+------

12345 678 | 17 0.19 0.19

123_5 678 | 1 0.01 0.20

123__ ___ | 1 0.01 0.21

12_45 678 | 1 0.01 0.22

12_45 67_ | 4 0.04 0.27

12_45 6__ | 1 0.01 0.28

12_45 _7_ | 1 0.01 0.29

12_4_ 678 | 1 0.01 0.30

12_4_ 67_ | 2 0.02 0.32

12_4_ _78 | 2 0.02 0.35

12_4_ _7_ | 9 0.10 0.45

12_4_ ___ | 3 0.03 0.48

12__5 67_ | 1 0.01 0.49

12__5 6__ | 1 0.01 0.50

12__5 _78 | 3 0.03 0.53

12__5 ___ | 1 0.01 0.55

12___ 678 | 1 0.01 0.56

12___ 67_ | 3 0.03 0.59

12___ 6_8 | 1 0.01 0.60

12___ _78 | 3 0.03 0.63

12___ _7_ | 6 0.07 0.70

12______| 15 0.17 0.87

1_345 678 | 1 0.01 0.88

1_3_5 67_ | 1 0.01 0.89

1__45 ___ | 1 0.01 0.90

1__4_ 67_ | 1 0.01 0.91

1__4_ _7_ | 3 0.03 0.95

1__4_ ___ | 1 0.01 0.96

1___5 _7_ | 1 0.01 0.97

1___5 ___ | 1 0.01 0.98

1____ _7_ | 6 0.07 1.05

1______8 | 2 0.02 1.07

1______| 17 0.19 1.26

_2345 678 | 3 0.03 1.29

_234_ 67_ | 1 0.01 1.30

_234_ _7_ | 1 0.01 1.31

_234_ __8 | 1 0.01 1.32

_23__ ___ | 1 0.01 1.34

_2_45 67_ | 2 0.02 1.36

_2_45 _78 | 1 0.01 1.37

_2_45 _7_ | 1 0.01 1.38

_2_4_ 678 | 1 0.01 1.39

_2_4_ 67_ | 3 0.03 1.42

_2_4_ 6__ | 1 0.01 1.44

_2_4_ _78 | 2 0.02 1.46

_2_4_ _7_ | 11 0.12 1.58

_2_4_ __8 | 1 0.01 1.59

_2_4_ ___ | 11 0.12 1.71

_2__5 67_ | 1 0.01 1.73

_2__5 _7_ | 1 0.01 1.74

_2__5 __8 | 1 0.01 1.75

_2__5 ___ | 3 0.03 1.78

_2___ 678 | 3 0.03 1.81

_2___ 6__ | 6 0.07 1.88

_2___ _7_ | 18 0.20 2.08

_2___ __8 | 1 0.01 2.09

_2______| 32 0.36 2.45

__345 ___ | 1 0.01 2.46

__34_ 678 | 1 0.01 2.47

__34_ 67_ | 1 0.01 2.48

__34_ _7_ | 2 0.02 2.50

__34_ ___ | 1 0.01 2.52

__3_5 6_8 | 1 0.01 2.53

__3__ 67_ | 1 0.01 2.54

__3__ _7_ | 1 0.01 2.55

__3__ ___ | 4 0.04 2.59

___45 67_ | 1 0.01 2.60

___45 6__ | 3 0.03 2.64

___45 _7_ | 2 0.02 2.66

___45 ___ | 3 0.03 2.69

___4_ 67_ | 1 0.01 2.70

___4_ 6__ | 8 0.09 2.79

___4_ _78 | 3 0.03 2.83

___4_ _7_ | 13 0.14 2.97

___4_ __8 | 1 0.01 2.98

___4_ ___ | 43 0.48 3.46

____5 678 | 2 0.02 3.48

____5 67_ | 1 0.01 3.50

____5 6__ | 7 0.08 3.57

____5 _7_ | 8 0.09 3.66

____5 __8 | 1 0.01 3.67

____5 ___ | 39 0.43 4.11

_____ 678 | 1 0.01 4.12

_____ 67_ | 9 0.10 4.22

_____ 6__ | 51 0.57 4.79

______78 | 5 0.06 4.84

______7_ | 57 0.63 5.48

______8 | 2 0.02 5.50

______| 8,490 94.50 100.00

------+------

Total | 8,984 100.00

Table indicates the number of variables for which an observation

has missing data.

Missing for |

how many |

variables? | Freq. Percent Cum.

------+------

0 | 8,490 94.50 94.50

1 | 245 2.73 97.23

2 | 122 1.36 98.59

3 | 46 0.51 99.10

4 | 35 0.39 99.49

5 | 18 0.20 99.69

6 | 5 0.06 99.74

7 | 6 0.07 99.81

8 | 17 0.19 100.00

------+------

Total | 8,984 100.00

Variables created:

m_pattern is a string variable showing the pattern of missing data.

m_number is the number of variables for which a case has missing data.

m_<varnm> is a binary variable indicating missing data for <varnm>.

. codebook m_*, compact

Variable Obs Unique Mean Min Max Label

------

m_psmoke97 8984 2 .0125779 0 1 Missing value for psmoke97?

m_pdrink97 8984 2 .0205922 0 1 Missing value for pdrink97?

m_psport97 8984 2 .0045637 0 1 Missing value for psport97?

m_pgang97 8984 2 .0191451 0 1 Missing value for pgang97?

m_pcoll97 8984 2 .0131345 0 1 Missing value for pcoll97?

m_pvol97 8984 2 .0162511 0 1 Missing value for pvol97?

m_pdrug97 8984 2 .0251558 0 1 Missing value for pdrug97?

m_pcut97 8984 2 .0071238 0 1 Missing value for pcut97?

m_pattern 8984 89 . . . Missing for which variables?

m_number 8984 9 .1185441 0 8 Missing for how many variables?

------

It appears that quite a few people skipped one or two of the items. Very few skipped more than two. We could construct our variable using this scale if we have at least 6 of the 8 items answered (75%) and we would only lose about 1.4% of the observations.

In Stata we can use the command alpha (illustrate this using the menus). This is a much more capable command than reliability is in SPSS.

We can let the program decide if any items need to be reverse coded.
We can indicate which ones need to be reverse coded. If it reverses any items, we lose the simple interpretation of the score.
We get the usual information we got from SPSS

alpha psmoke97-pcut97, detail generate(peers_neg) item label min(6)

Test scale = mean(unstandardized items)

Items | S it-cor ir-cor ii-cov alpha label

------+------

psmoke97 | + 0.745 0.616 .34491 0.710 % peers smoke 1997

pdrink97 | + 0.749 0.626 .34652 0.708 % peers drunk 1+/month 1997

psport97 | - 0.372 0.203 .47517 0.780 % peers sports, clubs 1997

pgang97 | + 0.549 0.413 .42652 0.748 % peers belong to gang 1997

pcoll97 | - 0.495 0.335 .43752 0.761 % peers plan college 1997

pvol97 | - 0.399 0.221 .46644 0.779 % peers volunteer 1997

pdrug97 | + 0.794 0.682 .3253 0.695 % peers use illegal drugs1997

pcut97 | + 0.711 0.572 .35653 0.719 % peers cut class/school 1997

------+------

Test scale | .39735 0.765 mean(unstandardized items)

------

I will drop the positive items because they are measuring a different dimension.