An Empirical Comparison of Techniques for Handling Incomplete Data When Using Decision Trees

BHEKISIPHO TWALA,

Brunel Software Engineering Research Centre,

School of Information Systems, Computing and Mathematics,

BrunelUniversity, Uxbridge, Middlesex UB8 3PH

______

OBJECTIVE: Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This paper investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data on a single attribute and on all attributes for different proportions and mechanisms of missing data on resulting tree-based models.

METHOD: The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using twenty one complete (i.e. with no missing values) datasets obtained from the UCI repository of machine learning databases. A 4-way repeated measures design was employed to analyze the data.

RESULTS: The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques while multiple imputation,which utilizes the Expectation Maximization algorithm,represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputation--versus model-based imputation procedures gave reasonably good results although some discrepancies remained

CONCLUSIONS: Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern and mechanisms of missing data. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques such as multiple imputation.

Keywords: incomplete data, machine learning, decision trees, classification accuracy

______

1. INTRODUCTION

Machine learning (ML) algorithms have proven to be of great practical value in a variety of application domains. Unfortunately, such algorithms generally operate in environments that are fraught with imperfections. One form of imperfection is incompleteness of data. This is currently an issue faced by machine learning and statistical pattern recognition researchers who use real-world databases. One primary concern of classifier learning is prediction accuracy. Handling incomplete data (data unavailable or unobserved for any number of reasons) is an important issue for classifier learning since incomplete data in either the training data or test (unknown) datamay not only impact interpretations of the data or the models created from the data but may also affect the prediction accuracy of learned classifiers. Rates of less than 1% missing data are generally considered trivial, 1-5% manageable. However, 5-15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation [Pyle, 1999].

There are two common solutions to the problem of incomplete data that are currently applied by researchers. The first includes omitting the instances having missing values (i.e. listwise deletion), which does not only seriously reduce the sample sizes available for analysis but also ignores the mechanism causing the missingness. The problem with a smaller sample size is that it gives greater possibility of a non-significant result, i.e., the larger the sample the greater the statistical power of the test. The second solution imputes (or estimate) missing values from the existing data. The major weakness of single imputation methods is that they underestimate uncertainty and so yield invalid tests and confidence intervals, since the estimated values are derived from the ones actually present [Little and Rubin, 1987].

The two most common tasks when dealing with missing values, thus, choosing a missing data technique, are to investigate the pattern and mechanism of missingness to get an idea of the process that could have generated the missing data, and to produce sound estimates of the parameters of interest, despite the fact that the data are incomplete. In other words, the potential impact missing data can have is dependent on the pattern and mechanism leading to the nonresponse. In addition, the choice of how to deal with missing data should also be based on the percentage of data that are missing and the size of the sample.

Robustness has twofold meaning in terms of dealing with missing values when using decision trees. The toleration of missing values in training data is one, and the toleration of missing data in test data is the other. Although the problem of incomplete data has been treated adequately in various real world datasets, there are rather few published works or empirical studies concerning the task of assessing learning and classification accuracy of missing data techniques (MDTs) using supervised ML algorithms such as decision trees [Breiman et al., 1984; Quinlan, 1993].

The following section briefly discusses missing data patterns and mechanisms that lead to the introduction of missing values in datasets. Section 3 presents details of seven MDTs that are used in this paper. Section 4 empirically evaluates the robustness and accuracy of the eight MDTs on twenty one machine learning domains. We close with a discussion and conclusions, and then directions for future research.

2. PATTERNS AND MECHANISMS OF MISSING DATA

The pattern simply defines which values in the data set are observed and which are missing. The three most common patterns of nonresponse in data are univariate, monotonic and arbitrary. When missing values are confined to a single variable we have a univariate pattern; monotonic pattern occurs if a subject, say , is missing then the other variables, say , are missing as well or when the data matrix can be divided into observed and missing parts with a “staircase” line dividing them; arbitrary patterns occur when any set of variables may be missing for any unit.

The law generating the missing values seems to be the most important task since it facilitates how the missing values could be estimated more efficiently. If data are missing completely at random (MCAR) or missing at random (MAR), we say that missingness is ignorable. For example, suppose that you are modelling software defects as a function of development time. If missingness is not related to the missing values of defect rate itself and also not related on the values of development time, such data is considered to be MCAR. For example, there may be no particular reason why some project managers told you their defect rates and others did not. Furthermore, software defects may not be identified or detected due to a given specific development time. Such data are considered to be MAR. MAR essentially says that the cause of missing data (software defects) may be dependent on the observed data (development time) but must be independent of the missing value that would have been observed. It is a less restrictive model than MCAR, which says that the missing data cannot be dependent on either the observed or the missing data. MAR is also a more realistic assumption for data to meet, but not always tenable. The more relevant and related attributes one can include in statistical models, the more likely it is that the MAR assumption will be met. For data that is informatively missing (IM) or not missing at random (NMAR) then the mechanism is not only non-random and not predictable from the other variables in the dataset but cannot be ignored, i.e., we have non ignorable missingness [Little and Rubin, 1987; Schafer, 1997]. In contrast to the MAR condition outlined above, IM arise when the probability that defect rate is missing depends on the unobserved value of defect rate itself. For example, software project managers may be less likely to reveal projects with high defect rates. Since the pattern of IM data is not random, it is not amenable to common MDTs and there are no statistical means to alleviate the problem.

MCAR is the most restrictive of the three conditions and in practice it is usually difficult to meet the MCAR assumption. Generally you can test whether MCAR conditions can be met by comparing the distribution of the observed data between the respondents and non-respondents. In other words, data can provide evidence against MCAR. However, data cannot generally distinguish between MAR and IM without distributional assumptions, unless the mechanisms is well understood. For example, right censoring (or suspensions) is IM but is in some sense known. An item, or unit, which is removed from a reliability test prior to failure or a unit which is in the field and is still operating at the time the reliability of these units is to be determined is called a suspended item or right censored instance.

3. DECISION TREES AND MISSING DATA TECHNIQUES

Decision trees (DTs) are one of the most popular approaches for both classification and regression type predictions. They are generated based on specific rules. A DT is a classifier in a tree structure. A leaf node is the outcome obtained and it is computed with respect to the existing attributes. A decision node is based on an attribute, which branches for each possible outcome for that attribute. One approach to create a DT is to use theentropy, which is a fundamental quantity in information theory. The entropy value determines the level of uncertainty. The degree of uncertainty is related to the success rate of predicting the result. Often the training dataset used for constructing a DT may not be a proper representative of the real-life situation and may contain noise and the DT is said to over-fit the training data.To overcome the over-fitting problem DTs use a pruning strategy that minimizes the output variable variance in the validation data by not only selecting a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally as accurate for predicting or classifying "new" instances.
Several methods have been proposed in the literature to treat missing data when using DTs. Missing values can cause problems at two points when using DTs; 1) when deciding on a splitting point (when growing the tree), and 2) when deciding into which daughter node each instance goes (when classifying an unknown instance).Methods for taking advantage of unlabelled classes can also be developed, although we do not deal with them in this thesis, i.e., we are assuming that the class labels are not missing.

This section describes several MDTs that have been proposed in the literature to treat missing data when using DTs. These techniques are also the ones used in the simulation study in the next section. They are divided into three categories: ignoring and discarding data, imputation and machine learning.

3.1 Ignoring and Discarding Missing Data

Over the years, the most common approach to dealing with missing data has been to pretend there are no missing data. The method used when you simply omit any instances that are missing data on the relevant attributes and go through with the statistical analysis with only complete data is called complete-case analysis or listwise deletion (LD). Due to its simplicity and ease of use, in many statistical packages, LD is the default analysis. LD is also based on the assumption that data are MCAR.

3.2 Imputation

Imputation methods involve replacing missing values with estimated ones based on information available in the dataset. Imputation methods can be divided into single and multiple imputation methods. In single imputation the missing value is replaced with one imputed value, and in multiple imputation, several values are used. Most imputation procedures for missing data are single imputation. In the following section we briefly describe how each of the imputation techniques works.

3.2.1 Single Imputation Techniques

3.2.1.1 Decision Tree Approach

One approach suggested by Shapiro and described by Quinlan (1993) is to use a decision tree approach to impute the missing values of an attribute. If S is the training set and an attribute with missing values. The method considers the subset of S, with only instances where the attribute is known. In the original class is regarded as another attribute while the value of becomes the class to be determined. A classification tree is built using for predicting the value of from the other attributes and the class. Then the tree is used to fill the missing values. Decision tree single imputation (DTSI), which can also be considered as a ML technique, is suitable for domains in which strong relation between attributes exist, especially between the class attribute and the attributes with unknown or missing values.

3.2.1.2 Expectation Maximization

In brief, expectation maximization (EM) is an iterative procedure where a complete dataset is created by filling-in (imputing) one or more plausible values for the missing data by repeating the following steps: 1.) In the E-step, one reads in the data, one instance at a time. As each case is read in, one adds to the calculation of the sufficient statistics (sums, sums of squares, sums of cross products). If missing values are available for the instance, they contribute to these sums directly. If a variable is missing for the instance, then the best guess is used in place of the missing value. 2.) In the M-step, once all the sums have been collected, the covariance matrix can simply be calculated. This two step process continues until the change in covariance matrix from one iteration to the next becomes trivially small.Details of the EM algorithm for covariance matrices are given in [Dempter et al., 1977; Little and Rubin, 1987]. EM requires that data are MAR.As mentioned earlier, the EM algorithm (and its simulation cased variants) could be utilised to impute only a single value for each missing value, which from now on we shall call EM single imputation (EMSI). The single imputations are drawn from the predictive distribution of the missing data given the observed data and the EM estimates for the model parameters. A DT is then grown using the complete dataset. The tree obtained depends on the values imputed.

3.2.1.3 Mean or Mode

Mean or more single imputation (MMSI) is one of the most common and extremely simple method of imputation of missing values. In MMSI, whenever a value is missing for one instance on a particular attribute, the mean (for a continuous or numerical attribute) or modal value (for a nominal or categorical attribute), based on all non-missing instances, is used in place of the missing value. Although this approach permits the inclusion of all instances in the final analysis, it leads to invalid results. Use of MMSI will lead to valid estimates of mean or modal values from the data only if the missing value are MCAR, but the estimates of the variance and covariance parameters (and hence correlations, regression coefficients, and other similar parameters) are invalid because this method underestimates the variability among missing values by replacing them with the corresponding mean or modal value. In fact, the failure to account for the uncertainty behind imputed data seems to be the general drawback for single imputation methods

3.2.2 Multiple Imputation

Multiple imputation (MI) is one of the most attractive methods for general purpose handling of missing data in multivariate analysis. Rubin (1987) described MI as a three-step process. First, sets of plausible values for missing instances are created using an appropriate model that reflects the uncertainty due to the missing data. Each of these sets of plausible values can be used to “fill-in” the missing values and create a “completed” dataset. Second, each of these datasets can be analyzed using complete-data methods. Finally, the results are combined. For example, replacing each missing value with a set of five plausible values or imputations would result to building five DTs, and the predictions of the five trees would be averaged into a single tree, i.e., the average tree is obtained by multiple imputation.

There are various ways to generate imputations. Schafer (1997) has written a set of general purpose programs for MI of continuous multivariate data (NORM), multivariate categorical data (CAT), mixed categorical and continuous (MIX), and multivariate panel or clustered data (PNA). These programs were initially created as functions operating within the statistical languages S and SPLUS [SPLUS, 2003]. NORM includes and EM algorithm for maximum likelihood estimation of means, variance and covariances. NORM also adds regression-prediction variability by a procedure known as data augmentation [Tanner and Wong, 1987].Although not absolutely necessary, it is almost always a good idea to run the EM algorithm before attempting to generate MIs. The parameter estimates from EM provide convenient starting values for data augmentation (DA). Moreover, the convergence behaviour of EM provides useful information on the likely convergence behaviour of DA. This is the approach we follow in this paper, which we shall for now on call EMMI.

3.3 Machine Learning Techniques