Chapter 4: Lazy Classification Using P-Trees

Paper Title:

Decision Tree Induction for Dynamic, High-dimensional Data Using P-Trees

Primary Contact: Anne Denton

Mailing Address:

Department of Computer Science

North Dakota State University

Fargo, ND 58105-5164, USA

E-mail:

Telephone:

(701) 231-6404

Fax:

(701) 231-8255

Authors:

Anne Denton and William Perrizo

Department of Computer Science, North Dakota State University

Category:

Information Systems

Decision Tree Induction for Dynamic, High-dimensional Data Using P-Trees[1]

Anne Denton and William Perrizo

Department of Computer Science,

North Dakota State University

Fargo, ND 58105-5164, USA

{Anne.Denton, William.Perrizo}@ndsu.nodak.edu

Abstract

Decision Tree Induction is a powerful classification tool that is much used in practice and works well for static data with dozens of attributes. We adapt the decision tree concept to a setting where data changes rapidly and hundreds or thousands of attributes may be relevant. Decision tree branches are evaluated as needed, based on the most recent data, focusing entirely on the data that needs to be classified. Our algorithm is based on the P-tree data structure that allows fast evaluation of counts of data points, and results in scaling that is better than linear in the data set size.

1. Introduction

A typical data mining task consists of predicting a discrete property, such as whether a patient is sick, whether an e-mail message is spam, or whether a protein has a particular function. Traditionally a handful of attributes were used for the prediction. When evaluating whether a patient has a particular disease a doctor may do a dozen tests and use the results for his conclusion. A machine learning tool to assist in the task may use that data as input to a prediction. Decision tree algorithms, such as C4.5 [1] and its successors have proven extremely useful in this setting. Modern data mining problems are, however, often different in nature. When predicting the function of a gene, researchers can draw on a wealth of information that has been collected in biological databases and may contain thousands of relevant properties. Traditional decision tree algorithms are not suitable in this setting because they are only able to use a small number of attributes to break up the data set. Using more attributes would result in decision tree branches that are represented by few or no training points and cannot provide statistically significant information, a problem known as the curse of dimensionality [2]. Our algorithm builds decision tree branches around the sample that is to be classified (lazy classification). That means that the data set is not broken up unnecessarily by attribute values that are irrelevant to the sample in question. We consistently follow the ideal of an entirely sample-driven classification even when continuous attributes are involved. Whereas traditional decision tree algorithms, including lazy decision tree induction [3], discretize the data based on the data set available at training time, our algorithm uses intervals that are based on the test sample.

This aspect of the presented technique is similar to instance-based approaches such as Parzen window [4], kernel Podium [5], and k-nearest neighbor classification. It is therefore important to highlight that other instance-based techniques lack the powerful attribute selection mechanism that is central to decision tree algorithms in general and ours in particular. Kernel- and window-based techniques do not normally weight attributes according to their importance. For categorical attributes, this limitation is particularly problematic because it only allows two possible distances for each attribute, such as distance 0 if the values of categorical attributes match and 1 otherwise. Solutions that have been suggested include weighting attribute dimensions according to their information gain [6]; optimizing attribute weighting using genetic algorithms [7]; selecting important attributes in a wrapper approach [8]; and, in a more general kernel formulation, boosting as applied to heterogeneous kernels [9]. All of these approaches increase the algorithm complexity and are therefore unsuitable to the typically large data set sizes encountered in data mining.

A further problem with traditional decision tree algorithms is that the classifier, i.e., the decision tree, has to be constructed every time the data changes. That is unacceptable in settings were new data arrives rapidly and old data may have to be discarded. Many current problems, such as predictions in computer networks, involve streaming data for which this problem is most pronounced. We use the P-tree data structure [10] to compute counts on the current data in a way that typically scales less then linear with data set size. The favorable scaling is achieved by a bit-column-wise storage organization in which sections of columns that are purely composed of 0 or 1 values are eliminated from the calculation of counts. If the goal is, for example, to count data points that have a value of 1 for a particular binary attribute, and the first half of the data is 0 for that attribute, then processing only involves the second half. Compression is hierarchical and the speed benefit is exploited at every level. This leads to a best-case complexity that is logarithmic in the number of data points.

Section 2 discusses the algorithm, with Section 2.1 summarizing the P-tree data structure, Section 2.2 motivating the HOBbit distance [11], and Section 2.3 explaining how we map to binary attributes. Section 2.4 discusses our pruning technique and Section 2.5 introduces a new way of combining multiple classifiers. Section 3 presents our experimental setup and results, with Section 3.1 describing the data sets, Sections 3.2 and 3.3 discussing the accuracy of different classifiers, and Section 3.4 demonstrating performance. Section 4 concludes the paper.

2. Algorithm

Our algorithm is loosely based on decision tree induction [1] in selecting attributes successively based on their relevance to the classification task. Data points that match in all selected attributes are considered relevant to the prediction task and the class label of the sample of interest is determined from the plurality of votes by those points. Attribute selection is based on optimizing the gain of information as defined by Shannon [12]. In contrast to conventional decision tree induction [1] tree branches are constructed as needed based on the sample that is to be classified, similar to lazy decision tree induction [3]. The main differences compared with [3] lie in our treatment of continuous attributes using a window function and the efficient count calculation using P-trees.

2.1. P-trees

The P-tree data structure was originally developed for spatial data [10] but has been successfully applied in many contexts [11,13]. The key idea behind P-trees is that AND operations on bit vectors are fast and hierarchical compression further improves their scaling. AND operations between P-trees are employed to evaluate counts of data points that correspond to particular attribute combinations. Performance depends critically on the quality of compression, which in turn depends on how similar neighboring data points are. Some types of data have an inherent continuity. Spatial data, for example, will often show neighboring pixels that represent similar color values. Data that shows such continuity can easily be organized to optimize compression. The key to preserving continuity when moving from a high-dimensional space to the one-dimensional storage organization is the use of space filling curves. P-trees for spatial data [10] use the Peano curve to determine the order of data points and optimize compression.

If data shows no natural continuity it is often beneficial to impose an order by sorting. Applying the Peano-ordering concept to sorting is straight forward. Rather than using the spatial coordinates to determine an ordering, the data-mining-relevant attributes themselves are used. That means that records are sorted according to the order in which they would be encountered on traversing the attribute space in Peano-order. Figure 1 shows how two numerical attributes are traversed as well as the P-tree that is constructed from the sequence of the highest order bit of x. Only integer data types are traversed in Peano order since the concept of proximity does not apply to categorical data. The relevance of categorical data for the ordering depends on the number of attributes needed to represent it.

Figure 1. Peano order sorting and P-tree construction.

P-trees are data structures that use hierarchical compression. The P-tree graph shows a tree with fan-out 2 in which 1 stands for nodes that represent only 1 values, 0 stands for nodes that represent only 0 values, and m ("mixed") stands for nodes that represent a combination of 0 and 1 values. Only "mixed" nodes have children. For other nodes, the data sequence can be reconstructed based on the purity information and node level alone. The P-tree example is kept simple for demonstration purposes. The implementation has a fan-out of 16 and uses an array rather than pointers to represent the tree structure. It, furthermore, stores the count of 1-bits for all nodes to improve ANDing speed. For data mining purposes, we are mainly interested in the total number of 1-bits for the entire tree, which we will call root count in the following sections.

2.2. HOBbit Distance

The nature of a P-tree-based data representation with its bit-column structure has a strong impact on the kinds of algorithms that will be efficient. P-trees allow easy evaluation of the number of data points in intervals that can be represented by one bit pattern. For 8-bit numbers, the interval [0,128) can easily be specified by requiring the highest-order bit to be 0 and not putting any constraints on the lower-order bits. This specification is derived as the root count of the P-tree corresponding to the highest-order bit. The interval [128,194) can be defined by requiring the highest order bit to be 1 and the next bit to be 0, and not putting constraints on any other bits. The number of points within this interval is evaluated by computing an AND of the basic P-tree corresponding to the highest-order bit and the complement of the next highest-order bit. Arbitrary intervals can be defined using P-trees but may require the evaluation of several ANDs.

This observation led to the definition of the HOBbit distance for numerical data [11]. Note that this does not apply to categorical data, which can only have either of two possible distances, distance 0 if the attribute values are the same and 1 if they are different.

The HOBbit distance is defined as

(1)

where as and at are attribute values, and denotes the floor function. The HOBbit distance is, thereby, the number of bits by which two values have to be right-shifted to make them equal. The numbers 32 (10000) and 37 (10101) have HOBbit distance 3 because only the first two digits are equal, and the numbers, consequently, have to be right-shifted by 3 bits.

2.3. Virtual Binary Attributes

For the purpose of the calculation of information gain, all attributes are treated as binary. Any attribute can be mapped to a binary attribute using the following prescription. For a given test point and a given attribute, a virtual attribute is assumed to be constructed, which we will call a. For categorical attributes a is 1 for the attribute value of the test sample and 0 otherwise. Consider, as an example, a categorical attribute that represents color and a sample point that has attribute value "red." Attribute a would (virtually) be constructed to be 1 for all "red" items and 0 for all others, independently of how large the domain is. For continuous attributes, a is 1 for any value within the HOBbit neighborhood of interest and 0 otherwise. For a 5-bit integer attribute, a HOBbit neighborhood of dHOBbit=3, and a sample value of 37, binary attribute a will be imagined as being 1 for values in the interval [32,48) and 0 for values outside this interval. The class label attribute is a binary attribute for all data sets that we discuss in this paper and therefore does not require special treatment.

2.4. Pruning

It is well known that decision trees have to be pruned to work successfully [1]. Information gain alone is, therefore, not a sufficient criterion to decide which attributes to use for classification. We use statistical significance as a stopping criterion, which is similar to decision tree algorithms that prune during tree construction. The calculation of significance is closely related to that of information gain. Information gain is a function of the probability of a particular split under the assumption that a is unrelated to the class label [15], whereas significance is commonly derived as the probability that the observed split or any more extreme split would be encountered; i.e., it represents the p-value. This derivation explains why information gain is commonly used to determine the best among different possible splits, whereas significance determines whether a particular split should be done at all.

In our algorithm, significance is calculated on a different subset of the data than information gain to get a statistically sound estimate. The training set is split into two parts, with two-thirds of the data being used to determine information gain and one-third to test significance through Fisher's exact test, where the two-sided test was approximated by twice the value of the one-sided test [15]. A two-sided version was also implemented, but test runs showed that it was slower and did not lead to higher accuracy. An attribute is considered relevant only if it leads to a split that is significant, e.g., at the 1% level. The full set is then taken to determine the predicted class label through majority vote.