What do we mean by missing data?
Missing data are simply observations that we intended to be made but did not. For example, an individual may only respond to certain questions in a survey, or may not respond at all to a particular wave of a longitudinal survey.
In the presence of missing data, our goal remains making inferences that apply to the population targeted by the complete sample - i.e. the goal remains what it was if we had seen the complete data.
However, both making inferences and performing the analysis are now more complex. We will see we need to make assumptions in order to draw inferences, and then use an appropriate computational approach for the analysis.
We will avoid adopting computationally simple solutions (such as just analysing complete data or carrying forward the last observation in a longitudinal study) which generally lead to misleading inferences.
In practice the data consist of (a) the observations actually made (where '?' denotes a missing observation):
Figure 1: Typical partially observed data setand (b) the pattern of missing values:
Figure 2: Pattern of missing values for the data in Figure 1. A '1' indicates that an observation is seen, a '0' that it is missingInferential framework
When it comes to analysis, whether we adopt a frequentist approach (Figure 3) or a Bayesian approach (Figure 4), the likelihood is central. In these notes, for convenience, we discuss issues from a frequentist perspective, although often we use appropriate Bayesian computational strategies to approximate frequentist analyses.
Figure 3: Schematic for frequentist (sometimes termed traditional) paradigm of inferenceThe actual sampling process involves the 'selection' of the missing values, as well as the units. So to complete the process of inference in a justifiable way we need to take this into account.
Figure 4: Schematic for Bayesian paradigm of inferenceThe likelihood is a measure of comparative support for different models given the data. It requires a model for the observed data, and as with classical inference this must involve aspects of the way in which the missing data have been selected (i.e. the missingness mechanism).
Assumptions
We distinguish between item and unit nonresponse (missingness). For item missingness, values can be missing on response (i.e. outcome) variables and/or on explanatory (i.e. design/covariate/exposure/confounder) variables.
Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances, ratios, regression parameters and so on). Missing data can also affect inferences, i.e. the properties of tests and confidence intervals, and Bayesian posterior distributions.
A critical determinant of these effects is the way in which the probability of an observation being missing (the missingness mechanism) depends on other variables (measured or not) and on its own value.
In contrast with the sampling process, which is usually known, the missingness mechanism is usually unknown.
The data alone cannot usually definitively tell us the sampling process.
Likewise, the missingness pattern, and its relationship to the observations, cannot definitively identify the missingness mechanism.
The additional assumptions needed to allow the observed data to be the basis of inferences that would have been available from the complete data can usually be expressed in terms of either
- the relationship between selection of missing observations and the values they would have taken, or
- the statistical behaviour of the unseen data.
These additional assumptions are not subject to assessment from the data under analysis; their plausibility cannot be definitively determined from the data at hand.
The issues surrounding the analysis of data sets with missing values therefore centre on assumptions. We have to
- decide which assumptions are reasonable and sensible in any given setting;
- contextual/subject matter information will be central to this - ensure that the assumptions are transparent;
- explore the sensitivity of inferences/conclusions to the assumptions, and
- understand which assumptions are associated with particular analyses.
Getting computation out of the way
The above implies it is sensible to use approaches that make weak assumptions, and to seek computational strategies to implement them.
However, often computationally simple strategies are adopted, which make strong assumptions, which are subsequently hard to justify.
Classic examples are completers analysis (i.e. only including units with fully observed data in the analysis) and last observation carried forward. The latter is sometimes advocated in longitudinal studies, and replaces a unit's unseen observations at a particular wave with their last observed values, irrespective of the time that has elapsed between the two waves.
Simple, ad-hoc methods and their shortcomings
In contrast to principled methods, these usually create a single 'complete' dataset, which is analysed as if it were the fully observed data.
Unless certain, fairly strong, assumptions are true, the answers are invalid.
We briefly review the following methods:
- Analysis of completers only
- Imputation of simple mean
- Imputation of regression mean
- Last observation carried forward
Completers analysis
The data on the left below has one missing observation on variable 2, unit 10.
/- Completers analysis deletes all units with incomplete data from the analysis (here unit 10).
- It is inefficient.
- It is problematic in regression when covariate values are missing and models with several sets of explanatory variables need to be compared. Either we keep changing the size of the data set, as we add/remove explanatory variables with missing observations, or we use the (potentially very small, and unrepresentative) subset of the data with no missing values.
- When the missing observations are not a completely random selection of the data, a completers analysis will give biased estimates and invalid inferences.
Simple mean imputation
The data on the left below has one missing observation on variable 2, unit 10.
We replace this with the arithmetic average of the observed data for that variable. This value is shown in red in the table below.
/- This approach is clearly inappropriate for categorical variables.
- It does not lead to proper estimates of measures of association or regression coefficients. Rather, associations tend to be diluted.
- In addition, variances will be wrongly estimated (typically under estimated) if the imputed values are treated as real. Thus inferences will be wrong too.
Regression mean imputation
Here, we use the completers to calculate the regression of the incomplete variable on the other complete variables. Then, we substitute the predicted mean for each unit with a missing value. In this way we use information from the joint distribution of the variables to make the imputation.
Example
Consider again our dataset with two variables, which is missing variable 2 on unit 10:
To perform regression imputation, we first regress variable 2 on variable 1 (note, it doesn't matter which of these is the 'response' in the model of interest). In our example, we use simple linear regression:
V2 = + V1 + e.
Using units 1-9, we find that = 6.56 and = - 0.366, so the regression relationship is
Expected value of V2 = 6.56 - 0.366V1.
For unit 10, this gives
6.56 - 0.366 x 3.6 = 5.24.
This value is shown in red below:
/ Results of regression mean imputation. Note- Regression mean imputation can generate unbiased estimates of means, associations ad regression coefficients in a much wider range of settings than simple mean imputation.
- However, one important problem remains. The variability of the imputations is too small, so the estimated precision of regression coefficients will be wrong and inferences will be misleading.
Creating and extra category
When a categorical variable has missing values it is common practice to add an extra 'missing value' category. In the example below, the missing values, denoted '?' have been given the category 3.
/ This is bad practice because:- the impact of this strategy depends on how missing values are divided among the real categories, and how the probability of a value being missing depends on other variables;
- very dissimilar classes can be lumped into one group;
- sever bias can arise, in any direction, and
- when used to stratify for adjustment (or correct for confounding) the completed categorical variable will not do its job properly.
Last observation carried forward (LOCF)
This method is specific to longitudinal data problems.
For each individual, missing values are replaced by the last observed value of that variable. For example:
Here the three missing values for unit 1, at times 4, 5 and 6 are replaced by the value at time 3, namely 2.0. Likewise the two missing values for unit 3, at times 5 and 6, are replaced by the value at time 4, which is 3.5.
Using LOCF, once the data set has been completed in this way it is analysed as if it were fully observed.
For full longitudinal data analyses this is clearly disastrous: means and covariance structure are seriously distorted. For single time point analyses the means are still likely to be distorted, measures of precision are wrong and hence inferences are wrong. Note this is true even if the mechanism that causes the data to be missing is completely random. For a full discussion download the talk 'LOCF - time to stop carrying it forward' from the preprints page of this site.
Conclusions
Unless the proportion missing is so small as to be unlikely to affect inferences, these simple ad-hoc methods should be avoided. However, note that 'small' is hard to define: estimates of the chances of rare events can be very sensitive to just a few missing observations; likewise, a sample mean can be sensitive to missing observations which are in the tails of the distribution.
They usually conflict with the statistical model that underpins the analysis (however simple and implicit this might be) So they introduce bias.
As the assumptions about the reason for the data being missing that they implicitly make are often difficult to describe (e.g. with LOCF), they can make it very hard to know what assumptions are being made in the analysis.
They do not properly reflect statistical uncertainty: data are effectively 'made up' and no subsequent account is taken of this.
Some notation
The data
We denote the data we intended to collect, by Y, and we partition this into
Y = {Yo,Ym}.
where Yo is observed and Ym is missing.
Note that some variables in Y may be outcomes/responses, some may be explanatory variables/covariates. Depending on the context these may all refer to one unit, or to an entire dataset.
Missing value indicator
Corresponding to every observation Y, there is a missing value indicator R, defined as:
R =
with R corresponding to Y.
Missing value mechanism
The key question for analyses with missing data is, under what circumstances, if any, do the analyses we would perform if the data set were fully observed lead to valid answers?
As before, 'valid' means that effects and their SE's are consistently estimated, tests have the correct size, and so on, so inferences are correct.
The answer depends on the missing value mechanism.
This is the probability that a set of values are missing given the values taken by the observed and missing observations, which we denote by
Pr(R | yo,ym)
Examples of missing value mechanisms
- The chance of nonresponse to questions about income usually depend on the person's income.
- Someone may not be at home for an interview because they are at work.
- The chance of a subject leaving a clinical trial may depend on their response to treatment.
- A subject may be removed from a trial if their condition is insufficiently controlled.
Missing Completely at Random (MCAR)
Suppose the probability of an observation being missing does not depend on observed or unobserved measurements. In mathematical terms, we write this as
Pr(r | yo,ym) = Pr(r)
Then we say that the observation is Missing Completely At Random, which is often abbreviated to MCAR.
Note that in a sample survey setting MCAR is sometimes called uniform non-response.
If data are MCAR, then consistent results with missing data can be obtained by performing the analyses we would have used had their been no missing data, although there will generally be some loss of information. In practice this means that, under MCAR, the analysis of only those units with complete data gives valid inferences.
An example of a MCAR mechanism would be that a laboratory sample is dropped, so the resulting observation is missing.
However, many mechanisms that initially seem to be MCAR may turn out not to be. For example, a patient in a clinical trial may be lost to follow up after 'falling' under a bus; however if it is a psychiatric trial, this may be an indication of poor response to treatment. Likewise, if a response to a postal questionnaire is missing because the questionnaire was lost or stolen in the post, this may not be random but rather reflect the area in which the sorting office is located.
As we have already said, under MCAR analyses of completers only (a short hand for including in the analysis only units with fully observed data) give valid inferences.
So do analyses based on moment based estimators (for example, generalised estimating equations), and other estimators derived from consistent estimating equations.
By consistent estimating equations we mean functions of the data and unknown parameters whose expectation, taken over the complete data at the population parameter values, is zero. Under MCAR, they still have expectation zero, and so still lead to valid inferences.
Saying the same thing mathematically, an estimating equation can be written as U(y,), and at the estimate , U(y,) = 0. The estimating equation is consistent because EU(Y,) = 0 (where is the population parameter value). It remains consistent if the data are missing completely at random (MCAR) because, even then, still EU(Yo,) = 0.
A simple example of a consistent is estimating equation is the sample mean, U(y,) = - .
Missing At Random (MAR)
After considering MCAR, a second question naturally arises. That is, what are the most general conditions under which a valid analysis can be done using only the observed data, and no information about the missing value mechanism, Pr(r | yo,ym)?
The answer to this is when, given the observed data, the missingness mechanism does not depend on the unobserved data. Mathematically,
Pr(r | yo,ym) = Pr(r | yo).
This is termed Missing At Random, abbreviated MAR.
This is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations, whether observed or not.
For example:
As units 1 and 2 have the same values where both are observed, given these observed values, under MAR, variables 3, 5 and 6 from unit 2 have the same distribution (NB not the same value!) as variables 3, 5 and 6 from unit 1.
Note that under MAR the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of 'random'. The important idea is that the missing value mechanism can expressed solely in terms of observations that are observed.
Unfortunately, this can rarely be definitively determined from the data at hand!
Examples of MAR mechanisms
- A subject may be removed from a trial if his/her condition is not controlled sufficiently well (according to pre-defined criteria on the response).
- Two measurements of the same variable are made at the same time. If they differ by more than a given amount a third is taken. This third measurement is missing for those that do not differ by the given amount.
A special case of MAR is uniform non-response within classes. For example, suppose we seek to collect data on income and property tax band. Typically, those with higher incomes may be less willing to reveal them. Thus, a simple average of incomes from respondents will be downwardly biased.
However, now suppose we have everyone's property tax band, and given property tax band non-response to the income question is random. Then, the income data is missing at random; the reason, or mechanism, for it being missing depends on property band. Given property band, missingness does not depend on income itself.
Therefore, to get an unbiased estimate of income, we first average the observed income within each property band. As data are missing at random given property band, these estimates will be valid. To get an estimate of the overall income, we simply combine these estimates, weighting by the proportion in each property band.
In this example, a simple summary statistic (average of observed incomes) was biased. Conversely, a simple model (estimate of income conditional on property band), where we condition on the variable that makes the data MAR, led to a valid result.
This is an example of a more general result. Methods based on the likelihood are valid under MAR. However, in general non-likelihood methods (e.g. based on completers, moments, estimating equations & including generalised estimating equations) are not valid under MAR, although some can be 'fixed up'. In particular, ordinary means, and other simple summary statistics from observed data, will be biased.
Finally, note that in a likelihood setting the term ignorable is often used to refer to and MAR mechanism. It is the mechanism (i.e. the model for Pr(R | yo)) which is ignorable - not the missing data!
Missing Not At Random (MNAR)
When neither MCAR nor MAR hold, we say the data are Missing Not At Random, abbreviated MNAR.
In the likelihood setting (see end of previous section) the missingness mechanism is termed non-ignorable.
What this means is
- Even accounting for all the available observed information, the reason for observations being missing still depends on the unseen observations themselves.
- To obtain valid inference, a joint model of both Y and R is required (that is a joint model of the data and the missingness mechanism).
Unfortunately
- We cannot tell from the data at hand whether the missing observations are MCAR, NMAR or MAR (although we can distinguish between MCAR and MAR).
- In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism.
Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under the time and budgetary constraints of many applied projects.