Class Imbalances versus Small Disjuncts

Taeho Jo

SITE, University of Ottawa
800 King Edward Ave., P.O. Box 450 Stn.A
Ottawa, Ontario, Canada K1N 6N5


Nathalie Japkowicz

SITE, University of Ottawa
800 King Edward Ave., P.O. Box 450 Stn.A
Ottawa, Ontario, Canada K1N 6N5

ABSTRACT

It is often assumed that class imbalances are responsible for significant losses of performance in standard classifiers. The purpose of this paper is to the question whether class imbalances are truly responsible for this degradation or whether it can be explained in some other way. Our experiments suggest that the problem is not directly caused by class imbalances, but rather, that class imbalances may yield small disjuncts which, in turn, will cause degradation. We argue that, in order to improve classifier performance, it may, then, be more useful to focus on the small disjuncts problem than it is to focus on the class imbalance problem. We experiment with a method that takes the small disjunct problem into consideration, and show that, indeed, it yields a performance superior to the performance obtained using standard or advanced solutions to the class imbalance problem.

Keywords

class imbalance, small disjuncts, rare cases, resampling, within-class imbalance, between-class imbalance.

1.  INTRODUCTION

Although class imbalances have been reported to hinder the performance of standard classifiers on many different types of problems[1], no study has made a point of linking the class imbalance problem directly to this loss. As a matter of fact, although the performance of standard classifiers may decrease on many class imbalanced domains, that does not necessarily demonstrate that it is the imbalance, per se, that causes this decrease. Rather, it is quite possible that class imbalances yield certain conditions that hamper classification, which would suggest 1) that class imbalances are not necessarily always a problem and, perhaps even more importantly, 2) that dealing with class imbalances alone will not always help improve performance.

The purpose of this paper is to question whether class imbalances are truly to blame for the reported losses of performance or whether these deficiencies can be explained in some other way. We show that class imbalances are, actually, often not a problem by themselves, but that, in small and complex data sets, they come accompanied with the problem of small disjuncts [5], which in turn causes a degradation in standard classifiers’ performance.[2] We conclude the paper by summarizing and testing an approach that we have previously designed (and described in greater length, elsewhere [7],[8]) that counters the effect of small disjuncts.

Though, the results presented in this paper may not be fully surprising, it is important to note that the paper’s main contribution rather lies in the shift of focus it proposes, which leads to new insights and solutions.

The remainder of the paper is divided into seven sections. Section 2 discusses the class imbalance, rare case and small disjunct problems. Section 3 describes the artificial and real domains on which our study is based. Section 4 shows the effect of class imbalances on these domains. The results suggest that class imbalances are often problematic, but not always. Section 5 shows the result of further experiments that contrast the class imbalance problem to the problem of small disjunct. These results show that it is the small disjunct problem rather than the class imbalance problem that is to blame for the loss of performance. Section 6 describes four standard methods designed for the class imbalance problem only (methods that ignore small disjuncts), a standard method for pruning small disjuncts (a method that completely eradicates the small disjuncts), as well as our newer approach, cluster-based oversampling [7], [8], that oversamples while addressing both the class imbalance and the small disjuncts problems (a method that inflates the small disjuncts). Section 7 shows the effects of cluster-based oversampling on artificial and real domains and contrasts them to those of the other methods described in Section 6. Section 8 concludes the paper.

2.  CLASS IMBALANCES, RARE CASES AND SMALL DISJUNCTS

In a concept-learning problem, the data set is said to present a class imbalance if it contains many more examples of one class than the other. Such a situation poses challenges for typical classifiers such as decision tree induction systems or multi-layer perceptrons that are designed to optimize overall accuracy without taking into account the relative distribution of each class. As a result, these classifiers tend to ignore small classes while concentrating on classifying the large ones accurately. Such a problem occurs in a large number of practical domains and is often dealt with by using re-sampling or cost-based methods. While cost-based methods were previously reported to perform better than random re-sampling approaches (e.g., [10]), they do not have the flexibility offered by re-sampling approaches. In particular, re-sampling approaches can generate novel examples or re-sample different parts of the space differently. Cost-based approaches, instead, take a monolithic approach to the re-balancing of data sets. This paper will explore whether the flexibility of re-sampling approaches can be exploited to the advantage of the class imbalanced problem. Its particular emphasis is on whether isolating rare cases and inflating them at different rates can be beneficial.

Rare or exceptional cases correspond to small numbers of training examples in particular areas of the feature space. When learning a concept, the presence of rare cases in the domain is an important consideration. The reason why rare cases are of interest is that they cause small disjuncts to occur, which are known to be more error prone than large disjuncts [5]. In more detail, learning systems usually create concept definitions that consist of several disjuncts. Each disjunct, in turn, is a conjunctive definition of a subconcept of the original concept. The coverage of a disjunct corresponds to the number of training examples it correctly classifies, and a disjunct is considered to be a “small disjunct” if that coverage is low. In fact, small disjuncts are not inherently more error prone than large disjuncts. What makes them more error prone are the bias of the classifiers [5] as well as the effect of attribute noise, missing attributes, class noise and training set size on the rare cases which cause them [13],[14].

Table 1, in the next section, illustrates very well how, in the two families of artificial domains, as the class imbalance and the concept complexity increases while the size of the training set decreases, the number of rare cases increases. For example, at concept complexity 3, imbalance level 1:3 and large training size, there are no rare cases since the smallest cases contain 100 examples. However, in the small setting, there are 8 rare/cases represented by 5 or 15 examples. The situation is even worse at class imbalance 1:9 where the smallest cases are represented by only 2 examples.

In the real world domains, rare cases are unknown since high dimensional data cannot be visualized to reveal areas of low coverage. In the remainder of this paper, when the rare cases of a real-world domain are necessary to consider, we will approximate them, using an unsupervised method (e.g., k-means). It is important to note that, in doing so, we make two assumptions:

1)  the small disjuncts constructed by our unsupervised algorithm do correspond fully to the (unknown) rare cases of the domain;

2)  there is a correspondence between the small disjuncts learned by the unsupervised method (the supposed rare cases of the domain) and those subsequently learned by the supervised method.

3.  DATA DOMAINS

This section describes the artificial and real domains on which our experiments are based. We generated two different families of artificial data. The first one uses a simple uni-dimensional backbone model while the second one uses a more complex, multi-dimensional model. The purpose of using artificial domains is to understand the nature of the observed degradation in domains that present a class imbalance. Using real-world domains only could have yielded results difficult to decipher. Nonetheless, artificial domains are not sufficient to fully test a hypothesis. We also need to see if this hypothesis applies to real data. In order to do so, we selected two data sets from the UCI Repository for Machine Learning: Wisconsin Breast Cancer and Pima Indian Diabetes, as well as one larger domain: Customer foreign exchange data for currency risk management. These three domains are highly challenging and quite imbalanced, especially the currency risk management domain whose minority class accounts for less than 3% of the data.

3.1  Artificial Domain with Single Dimension

For our first artificial family of domains, we created 27 data sets with various combinations of three parameters which we deemed significant for out study: concept complexity, training set size, and degreee of imbalance. This family of domains was already used in our previous work and explained at length there [10]. These 27 data sets were generated in the following way: each of the domains is one dimensional with input in the [0, 1] range associated with one of the two classes (1 or 0). The input range is divided into a number of regular intervals (i.e, intervals of the same size), each associated with a different class value. Contiguous intervals have opposite class values and the degree of concept complexity corresponds to the number of alternating intervals present in the domain. Actual training sets are generated from these backbone models by sampling points at random (using a uniform distribution), from each of the intervals. The number of points sampled from each interval depends of the size of the domain as well as on its degree of imbalance. An example of a backbone model is shown in Figure 1.

Figure 1. A Backbone Model of Complexity 3

Three different complexity levels were considered (c=1, 2, and 3) where each level, c, corresponds to a backbone model composed of 2c regular intervals. For example, the domains generated at complexity level c=1 are such that every point whose input is in range [0, 0.5) is associated with a class value of 1, while every point whose input is in range (0.5, 1] is associated with a class value of 0; At complexity level c=2, points in intervals [0, 0.25) and (0.5, 0.75) are associated with class value 1 while those in intervals (0.25, 0.5) and (0.75, 1] are associated with class value 0; etc., regardless of the size of the training set and its degree of imbalance.

Table 1 shows the distribution of training examples in each interval of the 27 domains. The top sub-table corresponds to c=1, the middle one corresponds to c=2, and the bottom one corresponds to c=3, where c is the level of complexity. In Table 1, each row indicates the combination of class imbalance and training set size and each column indicates each interval and its label. In the definition of class imbalance, 1:9 is defined as the high class imbalance, 1:3 corresponds to the middle class imbalance, and 1:1 corresponds to the low class imbalance (actually, it corresponds to a complete class balance). In the definition of the training set sizes, 80 is defined as the small training set size, 400 corresponds to the middle one, and 1600 corresponds to the large one. Each entry of Table 1 indicates the number of training examples in each interval. This table shows that the training examples are distributed uniformly into intervals.

Table 1. The Distribution of Training Examples with c=1 (Top), c=2 (middle), and c=3 (bottom)

Table 2. The Distribution of Test Examples

The test set size is fixed at 100 for each complexity level. Table 2 shows the distribution of the test examples per interval. In Table 2, the rows indicate the degree of complexity and the columns represent the actual intervals. Each entry indicates the number of test examples and the class label of each interval. In this domain as well as in all the others used for this study, we purposely chose to test our methods on perfectly balanced testing sets. This decision was made in order to put more emphasis on classifying the minority class examples correctly than accuracy on the “properly” distributed testing set (i.e., with the same class imbalance as in the training set) would. Separating the errors made on the positive and the negative data or reporting ROC curves are alternative approaches that we have used elsewhere [10], [12], but that were not practical here, given the large number of experiments conducted. We believe, however, that the performance measure used here is sufficiently indicative of the trend exhibited by the various methods we tested.

3.2  Artificial Domain with Multiple Dimensions

In the second artificial domains, the dimension of each feature vector is set to five and 27 domains are generated as in the previous subsection. The definitions of the training set size and the class imbalance degree are the same as those used in the previous section and they were summarized in Table 1. The concept complexity of this domain is defined as the number of clusters present in the domain.

Table 3. The Definition of Clusters in this Domain

#Clusters / Cluster Centers / Length
2 / [0.0348, 0.9225, 0.1262, 0.2717, 0.7858 +]
[0.5348, 0.4225, 0.6262, 0.7717, 0.2858 -] / 0.25
4 / [0.3241, 0.1006, 0.4903, 0.3863, 0.5487 +]
[0.5741, 0.3506, 0.7403, 0.6363, 0.8241 -]
[0.8241, 0.6006, 0.9903, 0.8861, 0.0487 +]
[0.0741, 0.8506, 0.2403, 0.1363, 0.2987 -] / 0.125
8 / [0.6819, 0.4511, 0.9610, 0.8550, 0.9932 +]
[0.8069, 0.5761, 0.0860, 0.9800, 0.1182 -]
[0.9319, 0.7011, 0.2110, 0.1050, 0.2432 +]
[0.0569, 0.8261, 0.3360, 0.2300, 0.3682 -]
[0.1819, 0.9511, 0.4610, 0.3550, 0.4932 +]
[0.3069, 0.0761, 0.5860, 0.4800, 0.6182 -]
[0.4319, 0.2011, 0.7110, 0.6050, 0.7432 +]
[0.5569, 0.3261, 0.8360, 0.7300, 0.8682 -] / 0.0625

Table 3 shows the definition of the clusters we used to generate the data for these domains. These clusters were designed so as to display an alternance of the classes in the five-dimensional space considered, and so as to avoid any overlap, which was previously shown to interfere with the class imbalance problem [11] and should be treated separately. In the column labeled “cluster centers”, in Table 3, each vector indicates the center of each cluster along with its class label. Each cluster is defined as a five dimensional hyper-cube with its center and its length (indicated in the last column of Table 3). The actual cluster points are distributed around their centers using a uniform distribution. In case some of the elements of the feature vectors are randomly assigned values smaller than zero or greater than one, their values are set to zero or one, respectively.