1

Imputation of Missing Data for input to Support Vector Machines

Bob L. Wall
Montana State University
Department of Computer Science
Bozeman, MT
/ Jeff K. Elser
Montana State University
Department of Computer Science
Bozeman, MT

ABSTRACT
The standard formulation of support vector machines (SVMs) does not allow for missing values for any of the attributes in an example being learned or classified. We examine preprocessing methods to prepare data sets containing missing values for use by an SVM. These methods can be separated into two categories: ignoring the missing data (discarding training examples with a missing attribute or discarding attributes that have missing values), or using a process known as imputation to generate missing values. A simple imputation method is just to use the average value for the attribute, but there are more robust techniques, including a k-nearest neighbors (KNN) approach and the use of a separate SVM to determine the likely value. We compare the performance of an SVM on test datasets containing missing values using each of the above techniques, and finally we propose a topic for further research - a possible method of handling the missing values directly within the SVM learning algorithm.

INTRODUCTION

The standard formulation of support vector machines (SVMs) does not allow for missing values for any of the attributes in an example being learned or classified. We examine methods by which data sets containing missing values can be processed using an SVM. This is typically accomplished by one of two means: ignoring the missing data (either by discarding examples with a missing attribute value or discarding an attribute that has missing values), or using a process generally referred to as imputation, by which a value is generated for the attribute. These techniques are typically carried out on the data prior to its being supplied to the learning algorithm. A simple imputation method is just to use the average value for the attribute, but there are more robust techniques that are likely to supply a value closer to the one that is missing, including a k-Nearest neighbors (KNN) approach and the use of a separate SVM to determine the likely value.

We compare the performance of an SVM on test datasets containing missing values using the following techniques: ignoring examples with missing values; ignoring attributes with missing values; imputing the missing values using the attribute’s mean value; imputing the missing values using KNN; and imputing the missing values using a separate SVM.

Data Sets

In order to have control over the variables in the experiments, we wrote a data set generator. The generator calculated the sum of the squares of all the values in a row of data. If the sum was less than or equal to one, the example belonged to the class. If the sum was greater than one, it did not belong to the class. This creates a hyper-sphere shape which is not linearly separable. This is necessary because the linearly classifiable data we started with was not causing any error. The error caused by the harder data sets allows us to see differences in performance better. The number of positive and negative training examples and test cases were controlled explicitly. The number of attributes was also explicitly stated for each set.

Two data sets had 20 examples, four had 50 examples, two had 100 examples, one had 1000, and one had 10,000. Each had an equal number of positive and negative examples. The number of attributes varied between two and fifteen.

Measuring SVM PerformancE

When we generated the training data sets, we also generated test data sets with the same distribution for each attribute. After training an SVM using the training data, we tested it against the test data set and measured the accuracy.

Introducing Missing Values

In the generated data, a random attribute was picked and then rows of that attribute were randomly picked to be missing. The number of columns and rows used were explicitly controlled by us, but the actual position of the columns and rows was random.

Handling Missing Values

Eliminating Missing Values

The first two methods by which we deal with missing values in the data set are very simple: just ignore training examples that contain missing values, or ignore attributes that have missing values in any of the training examples.

Note that the first method causes problems if any examples to be classified after learning is complete contain missing values; they must either be ignored, or the missing values must be generated using one of the following methods.

Replacing with Mean / Mode Value

This technique is not significantly more complicated than the first two. If a training example has a missing value, that value is replaced by the “most common” value for that attribute. For continuous attributes, we use the mean or average of all the specified values for that attribute in the training data set. (This does impose the requirement that two passes be made over the data, one to determine the mean and another to update any missing values.) For discrete attributes, we use the mode, or most commonly occuring, value for that attribute.

K-Nearest Neighbors (KNN) Imputation

For this implementation, if a training example contains one or more missing values, we measure the distance between the example and all other examples that contain no missing values. Our distance metric is a modified version of the Manhattan distance – the distance between two examples is the sum of the distances between the corresponding attribute values in each example. For discrete attributes, this distance is 0 if the values are the same, and 1 otherwise. In order to combine distances for discrete and continuous attributes, we perform a similar distance measurement for continuous attributes – if the absolute difference between the two values is less than half a standard deviation, the distance is 0; otherwise, it is 1.

The K complete examples closest to the example with missing values are used to choose a value. For a discrete attribute, the most frequently occurring value is used. For a continuous attribute, the average of the values from the K neighbors is used. For our experiments, we used K = 5.

SVM Imputation

An SVM can be used to impute the missing values as well. For each attribute in the training set that has missing values, we train an SVM using all of the training examples that have no missing values; we ignore the original classification value from the data set and use the value of the attribute being imputed is the target value. We also ignore any other attributes that have any missing values when generating this new training data.

If the attribute is a continuous value, we use SVM-Light in regression mode to learn the data. If the attribute is discrete with only two values, a standard SVM in classification mode can be used. For a discrete attribute with more than two values, special handling is necessary – standard SVMs, including SVM-Light, do not do multi-category classification. We use the standard technique of one-against-the-others to handle multi-valued discrete attributes; if the attribute has n values, we generate n separate training data sets; in data set i, the classification value for each example indicates whether or not the example had value i for the attribute.

After an SVM is trained on each data set, we then use that SVM to classify or perform regression on the examples with missing values for the attribute. Again, we ignore any other attributes that have missing values. If the attribute being imputed is continuous, the SVM will use regression to generate the value. If the attribute is continuous, we need to classify it with each of the n SVM models and select the value corresponding to the SVM that classifies the example as positive. If more than one SVM generates a positive classification, we randomly select one value.

Results

The results of processing each of our data sets containing missing data using each of the above methods is summarized below.

For the generated data, we averaged the accuracies for 10 varied data sets, for each imputation technique. As would be expected, the control where the actual data was left in had the best accuracy. The next best technique was using information from the model to pick values. This involved replacing missing values with the expected center of the model. The use of an SVM to impute the values was next best. Effectivly tieing for fourth were three more realistic techniques: discard attributes with missing values, k-nearest neighbor, and replacing values with the mean. Since it is would be uncommon to have information about the model to simply pick good values, these last three are probably the dominant choices. Replacing with a 1 or -1 were inferior as expected. The worst technique was discarding rows. This could be partly because we chose our missing values by focusing on single columns. That would cause the missing values to be distributed more evenly between the rows, and thus cause a greater loss of data upon removing those rows.

Method Average Accuracy (%)

Control93.19

Replace with 088.92

Replace with 180.02

Replace with -180.57

Discard Examples68.5

Discard Attributes86.66

Mean86.76

KNN86.75

SVM87.15

Table 1

Topics for Further Research

It is obviously of significant value in certain data sets to avoid throwing away examples because one or more of their attribute values are missing. Other attributes that do have values might be significant to learning the classification criteria. However, if possible, care should be taken to avoid introducing error to the classification by assigning an incorrect value for the missing attribute.

It may be possible to take advantage of the way in which the SVM computes the hyperplane that separates categories to better handle missing attribute values. Consider the theoretical basis for SVM learning – positive and negative examples are mapped to a higher-dimensional Euclidean space (i.e. a space in which the inner product is defined) in which it is possible to separate the positive and negative examples with a single hyperplane. The two groups of examples are each enclosed inside a convex hull, and the hyperplane is built by drawing a line between the closest points on the two convex hulls, and drawing a plane perpendicular to this line halfway between the convex hulls.

Consider adding a new data point which has a missing value for an attribute. When this data point is mapped to the higher-dimensional space, if it lies within the boundaries of its classification’s convex hull, regardless of the value which could be assigned to the missing attribute, or if it extends the convex hull in a direction away from the convex hull enclosing the other classification, the point can be ignored; it cannot be a support vector. If the data point is mapped in such a way that it extends the convex hull in a direction toward the other convex hull, it may be significant. If varying the value of the missing attribute within some range of values moves the point along a line that is parallel to the closest face of the other convex hull, then it is safe to assign any of those values to the missing attribute.

The difficulty with this analysis is that the SVM does not actually map the points to the higher-dimensional space; it evaluates a kernel function that is chosen such that its result is proportional to the inner product in the higher-dimensional space. The values obtained from the kernel function are supplied to a quadratic programming optimization algorithm to select the support vectors. In order to handle examples with missing values within the SVM as it is learning, it might be possible to somehow tag the values and generate a range of results from the kernel evaluation, then modify the quadratic optimization algorithm to determine whether the point should be a candidate for a support vector.

For a detailed introduction to the workings of SVMs, see (Chen, 2003). For more detailed analysis of SVMs and their applications, see (Cristianini, 2000) and (Schölkopf, 2002).

Conclusions

Surprisingly, the K-Nearest Neighbors and SVM imputation techniques did not perform significantly better than the technique of replacing the missing values with the attribute mean, at least for these data sets. For other more realistic data sets, we would expect to possibly see increased accuracy using the more complicated imputation techniques; however, experiments on a database of automobile data demonstrated similar results (the mean value and KNN techniques were slightly better than ignoring attributes with missing values). We encountered some other problems with using the SVM imputation technique that prevented a meaningful comparison of results on that data set. One problem was non-convergence in the SVM optimization algorithm; normalizing the values of each continuous attribute to the range [0, 1] helped somewhat, but normalizing to [-1,1] might provide better results.

All in all, more experimentation is probably required to better gauge whether the additional effort required to use an SVM to impute missing values is worthwhile. It would also be useful to attempt to find heuristics to characterize the data that would act as a guide for choosing the most appropriate imputation method.

It would definitely be worthwhile to pursue research into incorporating the handling of missing values directly into the SVM learning algorithm. However, there is no guarantee that this will lead to a computationally feasible algorithm.

Acknowledgements

The experiments conducted for this research used the SVMLight package, written by Thorsten Joachims (Joachims, 1999). For information about the package, and to download, visit

references

Chen, P.H., C.-J. Lin, and B. Schölkopf, 2003, “A Tutorial on Nu-Support Vector Machines,” http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf.

Cristianini, N. and J. Shawe-Taylor, 2000, An Introduction to Support Vector Machines, Cambridge University Press, 2000.

Joachims, T., 1998, “Making Large-Scale SVM Learning Practical,” in B. Schölkopf, C. Burges, and A. Smola (eds.), Advances in Kernel Methods - Support Vector Learning, MIT-Press, Cambridge, MA, 1999.

Schölkopf, B. and A. J. Smola, 2002, Learning with Kernels, MIT Press, Cambridge, MA, 2002.