Proceedings Template - WORD s7

Pariseau Correlation in Business Intelligence

William K. Perrizo

Computer Science Department
North Dakota State University
Fargo, ND 58105 USA

Gregory H. Wettstein

Computer Science Department
North Dakota State University
Fargo, ND 58105 USA

ABSTRACT

In this paper, we describe a new correlation measure used in a reliable high performance classification system that includes a Nearest Neighbor Vote based classification and a vertical data structure (Predicate Tree or P-tree[1]) for efficient vertical processing. The method was used to produce a solution to the, now famous, Netfix problem. In the Probe Dataset provided by Netflix, the method produces a 19% improvement in Root Mean Square Error rating prediction over the company’s current algorithm, CineMatch, which predicts the rating a renter will give to a movie based on the entire history of previous ratings. A solution has been submitted based on this correlation measure and algorithm.

Keywords

Pariseau Correlation, Predicate-Tree, Classification Nearest-Neighbor

1. INTRODUCTION

Business Intelligence (BI) problems are very interesting data mining problems. Typical BI training data sets are large and extremely unbalanced among classes. The Business Intelligence problem is a classification problem. Given a history of customer events (purchase or rentals) and satisfaction levels (ratings), one classifies potential future customer events as to most likely satisfaction level. This information is then used to suggest to customers what item they might like to purchase or rent next. Clearly this is the central need for any Internet based company or any company seeking to do targeted advertising (all companies?). The classification training set or Business Intelligence Data Set, BIDS, usually consists of the past history of customer satisfaction rating events (the rating of a product purchased or rented. In the Netflix case, it is a 1-5 rating of a rented movie). Thus, at it’s simplest, the training set looks something like BIDS(customerId, productID, rating), where the features are customerID and productID and the class label is rating. Very often there are other features such as DataOfEvent, etc. Of course there are many other single entity features that might be useful to the classification process, such as Customer Features (name, address, ethnicity, age,…) and Product features (type, DataOfCreation, Creator, color, weight, …). Together these form a Data Warehouse in the Star Model in which the central fact is BIDS and the Dimension star points are the Customer and Product feature files. In this paper we will keep it simple and work only with BIDS. In the more complex case, the Dimension files are usually joined into the central fact file to form the Classification Training Set.

Throughout this paper, it is assumed that the data is in vertical format. Horizontal format is the standard in the industry and is ubiquitous. However, it is the authors contention (proved valid by the success of the method) that vertical formatting is superior for many data mining applications, whereas for standard data processing, horizontal formatting is still best.

Horizontal Data formatting simply means that the data on an entity type is collected into horizontal records (of fields), one for each entity instance (e.g., employee records, one for each employee). In order to process horizontal data, scans down the collection (file) of these horizontal records are required. Of course, these horizontal files can be indexed, providing efficient, alternate entry points, but still, scans are usually necessary (full inversion of the file will eliminate the need for scans but that is extremely expensive to do and maintain, and nearly impossible for large, volatile horizontal files.).

Vertical Data formatting simply means that the data on an entity is collected into vertical slices by feature or by bit position of feature (or by some other vertical slicing of a coding scheme applied to a feature or features of the entity; such as value slicing in which each individual value in a feature column is bitmapped).

In this BIDS classifier, variant forms of Nearest Neighbor Vote based classification were combined with a vertical data structure (Predicate Tree or P-tree ) for efficient processing. Even for small data sets, this would result in a large number of horizontal training database scans to arrive at an acceptable solution. The use of a vertical data structure (P-tree), provides acceptable computation times (as opposed to scanning horizontal record data sets). Each component of the proposed solution are discussed in subsequent sections of this paper.

2. BACKGROUND

2.1 Similarity and Relevance Analysis

Classification models that consider all attributes equally, such as the classical K-Nearest-Neighbor classification (KNN), work well when all attributes are similar in their relevance to the classification task[14]. This is, however, often not the case. The problem is particularly pronounced for situations with a large number of attributes in which some are known to be irrelevant. Many solutions have been proposed that weight dimensions according to their relevance to the classification problem. The weighting can be derived as part of the algorithm [5]. In an alternative strategy the attribute dimensions are scaled, using a evolutionary algorithm to optimize the classification accuracy of a separate algorithm, such as KNN [21].

In an effort to reduce the number of classification voters and to reduce the dimension of the vector space over which these class votes are calculated, judicious customer (also called “user” in this paper) and item (also called “movie” in this paper) selection proved extremely important. That is to say, if one lets all neighboring users vote over all neighboring movies, the resulting predicted class (rating) turns out be grossly in error. It is also important to note that the vertical representation facilitates efficient attribute relevance in situations with a large number of attributes. Attribute relevance for a particular attribute with respect to the class attribute can be achieved by only reading those two data columns compared to reading all the long rows of data in a horizontal approach.

2.2 P-tree Data Structure

The BIDS data set was structured into vertical Predicate-trees or P-trees. P-trees are a lossless, compressed, and data-mining-ready vertical data structures. P-trees are used for fast computation of counts and for masking specific phenomena. This vertical data representation consists of set structures representing the data column-by-column rather than row-by row (horizontal relational data). Predicate-trees are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. A P-tree is a lossless, compressed, and data-mining ready data structure. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [12][7][19][1][22][2][13][18] [24][6]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursive sub-division, while recording the predicate truth value for each division. Each level of the tree contains truth-bits that represent sub-trees and can then be used for phenomena masking and fast computation of counts. This construction is continued recursively down each tree path until downward closure is reached. E.g., if the predicate is “purely 1 bits”, downward closure is reached when purity is reached (either purely 1 bits or purely 0 bits). In this case, a tree branch is terminated when a sub-division is reached that is entirely pure (which may or may not be at the leaf level). These basic P-trees and their complements are combined using boolean algebra operators such as AND(&) OR(|) and NOT(`) to produce mask P-trees for individual values, individual tuples, value intervals, tuple rectangles, or any other attribute pattern[3]. The root count of any P-tree will indicate the occurrence count of that pattern. The P-tree data structure provides a structure for counting patterns in an efficient, highly scalable manner.

2.3 Pariseau Correlation Based Nearest Neighbor Pruning

The Pariseau Correlation for each pair of data points in a data set can be computed rapidly using vertical P-trees[2]. For example, for each given pair of user, (u,v), the Pariseau Correlation (also called 1 Perpendicular Correlation), or PC(u,v), is

[ SQRT{(v-u-vbar+ubar)o(v-u-vbar+ubar) /(a + nb)} ]2

where o is the inner product of the ratings of users, u and v, across n selected movies and a and b are tunable parameters. This approach avoids the requirement of computing all Euclidian distances between the sample and all the other points in the training data, in arriving at the Nearest Neighbor set. But more importantly, it produces a select set of neighbors which, when allowed to vote for the most likely rating, produces very low error. Later in this paper, an explanation will be given of the motivation for this correlation measure and a pictorial description of it will also be provided.

3. CLASSIFICATION

3.1 Closed Nearest Neighbor Classification

Most of the classification algorithms applicable to pattern recognition problems are based on the k-nearest neighbor (KNN) rule[24]. Each training data vector is labeled by the class it belongs to and is treated as a reference vector. During classification k nearest reference vectors to the unknown (query) vector X are found, and the class of vector X is determined by a ‘majority rule’ vote. Restricting the Nearest Neighbor set to k can lead to wrong classifications if there are many ties for the kth neighbor. In a classical KNN approach the nearest neighborhood is not closed due to algorithmic complexities of finding all neighbors within a given distance using one horizontal pass of the dataset (it takes two passes, one to produce exactly k nearest neighbors and therefore the distance of the kth nearest neighbor, ε, then another pass to find all additional neighbors also at a distance of ε from the unclassified sample).

Using P-tree vertical structures, however, all these kth nearest neighbor ties can be included at no additional processing cost using the ‘closed k Nearest Neighbor’ algorithm [12]. Rather than allowing a uniform vote for each neighbor, Gaussian like weighted votes were used. Gaussian weighting is based on the distance from the sample to the voting training point. A training sample closer to the test sample will have a higher influence on the final classification outcome compared to a sample that is further away from the test sample based on its’ distance to the sample. The vote contribution is also further adjusted based on a vote weighting derived from the Pariseau Correlations of both the movies (as related to the movie rating to be predicted) and the user Pariseau Correlation (as related to the user rating to be predicted) calculated as above.

4. Parameter Optimization & CLASSIFICATION RESULTS

The multiple parameters involved in the classification algorithms were optimized via a hill-climbing algorithm. The method has some similarity to the winning method applied by one of the authors to the Association of Computing Machinery Knowledge Discovery and Data Mining Cup in 2006 (KDDCup06) [16] in which the Negative Predictive Value, NPV, was calculated by TN/(TN+FN). The above fitness function encourages solutions with high TN, provided that NPV was within a threshold value. Although this task specified threshold for NPV was 1.0, with the very low number of negative cases it was hard to expect multiple solutions that meet the actual NPV threshold and also maintained a high TN level. A genetic algorithm, GA, was used on collections of quality solutions in each generation potentially influence the set of solutions in the next generation.

Since the training data set in the KDDCup case was small, patient level bootstrapping was used to validate solutions. Bootstrap implies taking out one sample from the training set for testing and repeating until all samples were used for testing. In this specific task which involves multiple candidates for the same patient we removed all candidates from the particular training set for a given patient when used as a test sample. The overall approach of the solution is summarized in the following diagram.ttribute relevance is carried out on the training data to identify the top M attributes. Those attributes are subsequently used to optimize the parameter space, with a standard Genetic Algorithm (GA) [9]. The set of weights on the parameters represented the solution space that was encoded for the GA. Multiple iterations with the feedback from the solution evaluation is used to arrive at the best solution. Finally the optimized parameters are used on the combined classification algorithm to classify the unknown samples.

Figure 1 Overview of the Classification Approach

Attributes with zero weights at the end of the optimization run were omitted. These parameters optimized with the training data were able to achieve over 19 % improvement over the industry leading Netflix CineMatch algorithm. For a given training data set once the values are optimized they can be repeatedly used on unknown test data until the training data set is changed.

The solution quality of this approach is evident by the achievement of performance against the industry leading algorithm, Netflix’s CineMatch algorithm. This compares to the maximum possible quality that has been achieved by any of the other nearly 30,000 Netflix prize competitors (namely ~8.7% as of November 19, 2007) according to the Netflix Leader Board.. To further prove the quality of the proposed solution we used this solution against the well known Fisher’s Iris data set [26]. In the case of the linearly separable classification (easy) our approach scores very high, as expected. In the case of the classification of the linearly non separable classification (hard) it also scores very high.

5. CONCLUSIONS AND FUTURE WORK

In this paper we describe a successful approach to a Business Intelligence classification. The approach takes into consideration typical Business Intelligence (BIDS) training data set characteristics such as large and extremely unbalanced classes, noisy class labels, large number of potential descriptive features with irrelevant and redundant features. The approach also takes into consideration the requirement to have extremely high performance thresholds in proposed systems in order to achieve acceptance. The approach described in this paper was successfully used to submit an entry for the Netflix Problem as described on the company website, netflix.com.

In this classifier, a variant form of Nearest Neighbor Vote based classification was combined with a vertical data structuring (Predicate Tree or P-tree ) for efficient processing. The use of a vertical data structure (P-tree) provided acceptable computation times and very high accuracy for this approach.