Working With Missing Data 1
Working With Missing Data[1]
It is often the case that the weakest point of a study is the quality of the data that can be brought to bear on the research problem. The more you can find out about why the data are "as they are," the more you can develop a case about the patterns of missing data, as well as a rationale about why the pattern may or may not matter.In my experience, journal editors and reviewers are much more interested in the nature of the data in terms of their completeness than they used to be, and they want authors to discuss the steps they took to deal with potential bias due to any missing datain their study. Eliminating missing cases before analysis is not viewed as a legitimate solution to the problem.
Typical approaches for dealing with missing data (e.g., listwise deletion, mean imputation, regression imputation) have been found to lead to biased estimates of model parameters. Two current acceptable approaches for dealing with missing data are multiple imputation (MI) and full information maximum likelihood (ML) estimation which includes the missing data points in the analysis (Enders, 2011; Enders & Peugh, 2004). These approaches are well established for single-level analyses. MI involves imputing a range of random plausible values for missing data, which results in several complete data sets which can then be analyzed. One of the advantages of this approach is that it introduces variability into the distribution of cases with missing data which is more likely represent the population than imputing a single value for each missing case.
For software programs that support maximum likelihood (ML) estimation with missing data present, there is no need to remove subjects with incomplete subject data. Estimation is based on available data points, and subjects do not need to have complete data. As Enders and Peugh (2004) note, partial data actually contribute to the estimation of the model’s parameters by implying probable values for missing scores via the correlations among variables. Expectation maximization (EM), a common method for obtaining ML estimates with incomplete data, treats the model parameters (rather than the data points themselves) as missing values to be estimated and borrows information from the existing data at successive iterations until differences between successive iterations are trivial. The ML-based approach is available in specialized software routines that are typically used for examining multilevel data structures (e.g., Mplus, HLM).
Missing data can pose a number of additional problems in multilevel data structures, depending on the sampling design underlying the data set, the extent to which the data are missing at each level, and whether or not the data can be assumed to be missing at random.In some modeling situations, there may be considerable missing data. Compared with single-level analyses, the difficulties presented by multilevel analyses concerns the likelihood that the missing data at one level (e.g., Level 2) may be linked to the missing data at Level 1. In SPSS Mixed, this is because if a predictor is missing for a Level 2 unit (e.g., school-community socioeconomic status or school enrollment size), the entire individual-level data for that unit will be lost, even if the individuals in that unit have no missing data at Level 1. So, for example, if a single school is missing the data on enrollment size, perhaps 200 individuals may be lost from the Level 1 data set as a result. We can see then that the loss of individual data at Level 1 may not be able to be considered “random” if there is some known (or unknown) process related to missingness at Level 2, since the selection of individuals within units is related to the selection of the units themselves. This is often referred to in survey research as a “two-stage” sample. Moreover, individuals within a particular unit may share similarities with each other that are stronger than similarities they might share with individuals in other units (e.g., similar socialization, certain educational experiences).
Users should consider data preparation and data analysis as two separate steps. In preparing the data for analysis, first it is often useful to determine the amount of missing data present, as well as the number of missing data patterns. It is important to note that the reality is that there is no real way to get data that are missing back (short of actually following up with subjects in a study). In a sense, then, we are always dealing with the problem of missing information to some extent when we use actual data. The quality of our analysis depends on assumptions we make about the patterns of missing responses present and what is reasonable to conclude about those patterns.Second, what we do about the missing data we have becomes a more pressing concern. There are a number of available options for dealing with missing data. It helps to know what the defaults and options are for handling missing data in the software programs that the analyst is considering to use in each given situation.
Some Background
IBMSPSS is limited in its ability to deal with various patterns of missing data. As a default, the program uses listwise deletion of cases when there is any missing data. This means anyindividual with data missing on any variable will be dropped from the analysis. As a first step, we suggest examining the amount of missing data on each variable. Even with 5% or less per variable, in some cases, listwise deletion can result in a tremendous loss of data and biased parameter estimation. The program providesthe typical options for examining missing data, which include listwise, pairwise, and mean substitution. In most situations, however, none of these would be considered as acceptable approaches (Puegh & Enders, 2004). For example, listwise deletion leads to inflated standard errors when the data are considered to be “missing completely at random” and biased parameter estimates when the data are assumed to be “missing at random” (Allison, 2002; Larsen, 2011). Mean substitution treats all individuals with missing data as if they were located at the “grand mean,” which introduces bias in most situations by reducing variance (since it is unlikely all missing individuals would located be at the mean). The IBMSPSS Base Statistics program also provides a basic missing data routine which has several replacement methodswhich are not applicable to multilevel data (e.g., series mean, mean of nearby points, median of nearby points). Note also that although “user-missing” values can be specified in IBMSPSS, this is approach is typically used for categorical responses, where some possible responses are coded as missing (e.g., “not applicable” in survey questions). If these user-defined missing values are included in the analysis, they will also bias parameter estimates. The program does provide a missing data multiple imputation routine as an add-on module. It is appropriate for imputing plausible values for single-level analyses.
Multilevel Data
For multilevel data there is less guidance available from previous studies on missing data (Larsen, 2011). Larsen recently conducted a study comparing MI and ML approaches in situations where there were individuals nested within groups. He found both approaches were relatively similar in handling Level 1 estimates under the different conditions examined. More importantly, however, as missing data increased at Level 2, the estimates of the Level 2 predictor from imputed data sets displayed increased parameter bias and decreased standard errors compared to the estimates obtained from the complete data set. For Level 2 estimation, Larsen found ML estimation with the missing data performed much better than the MI approach. This is because the MI procedure used in his study did not account for random effects. More specifically, the student-level data (Level 1) were assumed to be randomly sampled from the same population (i.e., in this case, a classroom), rather than to come from different classrooms (Larsen, 2011). There are appropriate MI routines available, however, in multilevel software (e.g., Mplus, HLM) that can impute plausible values properly at Level 2.For Mplus, this is available beginning with Version 6 of the software.
Importantly, where data are missing at Level 2 (e.g., as for a school covariate), for programs defaulting to listwise deletion such as IBMSPSS, this will result in losing all the individuals within those units. As I noted, these individuals may not have any missing data on the outcome or the Level 1 predictors. It is important to reiterate that the sampling frame through which the data were generated may have an impact on assumptions we make about the distribution of the data at each level. This implies that the nature missing data at Level 2 in relation to the manner in which the units were selected—whether units themselves were randomly sampled from a population of units or were just an unspecified “collection” of available units—can further complicate interpretation of the Level 1 results.
For growth models, which can be considered a type of multilevel model, one of the advantages is that missing data and student mobility can be incorporated directly into the analyses using ML estimation, which reduces parameter bias that would result from eliminating those cases (Peugh & Enders, 2004). These types of models are implemented in vertical format in SPSS (i.e., where a single individual has several lines comprising the repeated outcome measures), which allows the retention of cases with partial data on the outcome variable. This means that is there are 3 repeated measures and at least one is present, the partial data will be retained. This is important in maintaining the assumption that the data are missing at random and, therefore, tends to reduce parameter bias that can result from eliminating these cases with partial data entirely (Enders & Peugh, 2004). If a covariate is missing, however, all data for that individual will be lost due to listwise deletion.
Types of Missing Data
Rubin (1976) introduced the notion of the distribution of missingness as a way to classify the conditions under which missing data can be ignored. Little and Rubin (2002) distinguish between data missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). This is sometimes also referred to as non-ignorable missing (NIM). Missing completely at random implies that the data do not depend on other variables in the model (e.g., covariates or outcomes). For example, this is generally the case when a random sample is taken from a population. MAR implies that the missing data may depend on other variables in the model. In both cases, however, the missing case is assumed to be independent of the value on the unobserved (missing) value (Hox, 2010). MNAR suggests that probability of dropping out is related to responses at the time of dropping out. An example would be if students who have high absences are considerably more likely to drop out than students on other attendance categories (e.g., perhaps ¾ of the dropouts are in the “high absence” category). In longitudinal studies, however, we typically have information about individuals who participate (e.g., mobility, perceptions about school processes, student outcomes) from previous occasions. In such cases, it is generally reasonable to assume MAR, conditional on those variables, which also includes scores on the outcomes at earlier times (Hox, 2010; Schafer, 2005). This makes it easier to check for possible relationships between the tendency to drop out of a treatment and certain predictors existing across earlier occasions in a study. This type of examination is not possible in cross-sectional studies (since the data are only collected at one point in time).
Other techniques have been developed for data that are MNAR. Enders (2011) demonstrates the usefulness of two of the MNAR approaches for longitudinal data (i.e., selection models and pattern mixture models) and demonstrates their use on a real data set. More specifically, selection models for longitudinal data combine a substantive model (i.e., a growth curve model) with a set of regression equations that predict missingness, while a pattern mixture analysis stratifies the sample into subgroups that share the same missing data pattern and estimates a growth model separately within each pattern.Interested readers can consult Hedeker and Gibbons (1997) for one example of a pattern mixture modeling approach that uses the missing data pattern (represented by one or more dummy variables) as a predictor in the growth model. Their approach can be estimated with standard mixed modeling procedures (e.g., the MIXED procedures in SPSS and SAS).
For users who have access to the SPSS multiple imputation (MI) data module, patterns of missing data can first be identified and then plausible values can be imputed.The imputed data sets can be saved as separate data sets and then analyzed. It is often the case, for example, that even with 25-35% missing one can impute plausible values into a number of data sets and do a reasonable job of creating complete data sets with the missing values given a set of random values.One of the advantages of this approach is that other variables can also be used to supply information about missing data, but they need not be included into the actual model estimation. This approach to missing data is recommended when the assumption that the data are MAR is plausible.
The analyst can generate a relatively large number of imputed data sets (Bodner, 2008) and then analyze the complete data sets and report the mean estimates and standard errors across several imputations.The values imputed through MIrepresent draws from a distribution; in other words, they inherently contain some variation. This parameter variation across multiple imputations is important for creating reasonable distributions of plausible values for variables with missing values. If we assume some degree of normality, we can average the parameter estimates over the imputed data sets. My practical experience with MI approaches suggests they do pretty well at estimating the total data set where missing values are randomly dispersed across a sizable number of individuals (100-200 or more) found in most published studies. It is important to keep in mind, however, the MI approach as implemented in SPSSdoes not assume missing values on group-level (Level 2) variables.
A Possible Approach to Implement
Much of this discussion about missing data suggests that dealing with missing data is not so much about "How much missing data is allowable?" but, rather, is more focused onhow to develop a process to deal with the missing data. I favor a strategy of triangulating research results with different approaches which are currently recommended for examining missing data. It is possible to do something like the following.
1. You can try running the model using listwise deletion (which assumes MCAR). This data set likely be considerably smaller than the "partially complete” data, but it gives the analyst a baseline view (albeit likely a biased one) for comparing subsequent results. Where the data set is quite large, the listwise data set may still represent the population “reasonably well.” Of course, one never exactly knows for sure. With MIXED, the results with listwise deletion should match the results of the existing (partially complete) data set, since the variables with missing values are listwise deleted, unless the data are vertically arranged as in a growth model.
2. Second, if there is not too much missing data per variable, you can then compare the listwise results against a number of complete data sets generated using a MI program which can be applied to hierarchical data structures (e.g., Mplus, HLM). In some situations, I have also obtained comparable results using the MI routine available in SPSS (e.g., if the data are all missing at Level 1 only, or perhaps are missing only at Level 2 with a large Level 2 data set). As I noted earlier, in multilevel analyses the more problematic situation is where data on a Level 2 covariate is missing for all members of a particular unit, since the focus is often on the between-group (Level 2) model after adjustment for differences among individuals at Level 1. In two-level designs, the selection of students within a school cannot be considered as independent observations, since the students selected within the school will likely have some common characteristics. I have achieved good results in recovering estimates from complete data sets using Mplus to impute values at both Level 2 and Level 1 data in examples where missing data are assumed to be MAR.I have also achieved good results in recapturing complete data sets using the SPSS MI routine, when some basic assumptions met about the nature of the Level 2 data are met (e.g., units represent a simple random sample from the population and can be assumed to be MAR). This approach seems to work well, for example, where the Level 1 data are relatively complete, and so the missing data at Level 1 results mainly from the individuals being dropped from the analysis due to missing data on one or more covariates at Level 2.