PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo
Google, Inc.
[bpanda, jsherbach, sugato]@google.com, bayardo@alum.mit.edu
ABSTRACT plexities such as data partitioning, scheduling tasks across many machines, handling machine failures, and performing inter-machine communication. These properties have motivated many technology companies to run MapReduce frameworks on their compute clusters for data analysis and other data management tasks. MapReduce has become in some sense an industry standard. For example, there are open source implementations such as Hadoop that can be run either in-house or on cloud computing services such as
Amazon EC2.1 Startups like Cloudera2 offer software and services to simplify Hadoop deployment, and companies including Google, IBM and Yahoo! have granted several universities access to Hadoop clusters to further cluster computing research.3
Despite the growing popularity of MapReduce [12], its application to certain standard data mining and machine learning tasks remains poorly understood. In this paper we focus on one such task: tree learning. We believe that a tree learner capable of exploiting a MapReduce cluster can effectively address many scalability issues that arise in building tree models on massive datasets. Our choice of focusing on tree models is motivated primarily by their popularity.
Tree models are used in many applications because they are interpretable, can model complex interactions, and can handle both ordered and unordered features. Recent studies have shown that tree models, when combined with ensemble techniques, provide excellent predictive performance across a wide variety of domains [8, 9].
This paper describes our experiences with developing and deploying a MapReduce based tree learner called PLANET, which stands for Parallel Learner for Assembling Numerous
Ensemble Trees. The development of PLANET was motivated by a real application in sponsored search advertising in which massive clickstreams are processed to develop a predictor of user experience following the click of a sponsored search ad [30]. We show how PLANET can be scaled effectively to large datasets, describe experiments that highlight the performance characteristics of PLANET, and demonstrate the benefits of various optimizations that we implemented within the system. We show that while MapReduce is not a panacea, it still provides a powerful basis on which scalable tree learning can be implemented.
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s computing infrastructure is based on commodity hardware.
In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLA-
NET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.
1. INTRODUCTION
In this paper, we look at leveraging the MapReduce distributed computing framework for a complex data mining task of wide interest: learning ensembles of classification or regression trees. While there are other methods for parallel and distributed tree learning, building production-ready implementations remains complex and error-prone. With the wide and growing availability of MapReduce-capable compute infrastructures, it is natural to ask whether such infrastructures may be of use in parallelizing common data mining tasks such as tree learning. For many data mining operations, MapReduce may offer better scalability with vastly simplified deployment in a production setting.
MapReduce is a simple model for distributed computing that abstracts away many of the difficulties in parallelizing data management operations across a cluster of commodity machines. MapReduce reduces, if not eliminates, many com-
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM.
The rest of the paper is organized as follows. In Section
2 we describe the necessary background on which we build,
1
2
3For example, see
and http://www.nsf.gov/news/news summ.jsp?cntn id=111470
VLDB ‘09, August 24-28, 2009, Lyon, France
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. Algorithm 1 InMemoryBuildNode including the formal problem definitions of classification and regression. We also review the process of solving these problems through tree induction, and describe the MapReduce paradigm for distributed computation. As a prelude to a more detailed description of our approach, in Section 3 we provide an example of how tree induction proceeds in PLA-
NET. This example describes the roles of each of the major components as well as their high level requirements. Section
4 provides a more formal algorithm description for the case of learning a single classification or regression tree, and Section 5 describes how PLANET can be generalized to produce ensembles of trees via boosting and/or bagging. In Section
6 we discuss several important details we had to address in our efforts to develop an efficient and production-ready deployment. We describe the performance of PLANET on our sponsored search derived clickstream dataset in Section 7.
We review related work in Section 8 and conclude with a discussion of future work in Section 9.
Require: Node n, Data D ⊆ D∗
1: (n →split,DL,DR)=FindBestSplit(D)
2: if StoppingCriteria(DL) then
3: n →left prediction=FindPrediction(DL)
4: else
5:
InMemoryBuildNode(n →left,DL)
6: if StoppingCriteria(DR) then
7:
8: else
9: InMemoryBuildNode(n →right,DR) n →right prediction=FindPrediction(DR)
In our example tree model, predicate evaluations at nonleaf nodes have only two outcomes, leading to binary splits.
While tree models can have non-binary splits, for the sake of simplicity we will focus on binary splits only for the remainder of this paper. All our techniques also apply to tree algorithms with non-binary splits with straightforward modifications.
2. PRELIMINARIES
Let X = {X1, X2, . . . XN } be a set of attributes with do-
Tree models are popular because they are interpretable, capable of modeling complex classification and regression tasks, and handle both ordered and categorical domains.
Recent work by Caruana et al. [9] has also shown that tree models, when combined with ensemble learning methods like bagging [4], boosting [14], and forests [5], outperform many other popular learning methods in terms of prediction accuracy. A thorough discussion of tree models and different ensemble methods is beyond the scope of this paper — see [29] for a good review. mains DX , DX , . . . DX respectively. Let Y be an output
12
Nwith domain DY . Consider a dataset D∗ = {(xi, yi)|xi ∈
DX × DX × . . . DX , yi ∈ DY } sampled from an unknown
12
Ndistribution, where the ith data vector xi has an output yi associated with it. Given the dataset D∗, the goal in supervised learning is to learn a function (or model) F :
DX × DX × . . . DX → DY that best approximates the 12
Ntrue distribution of D∗. If DY is continuous, the learning problem is a regression problem; if DY is categorical, it is a classification problem.
2.2 Learning Tree Models
Let L be a function that quantifies in some way the discrepancy between the function prediction F(xi) on xi and the actual output yi. A model that minimizes the net loss
Previous work on learning tree models is extensive. For a given training dataset D∗, finding the optimal tree is known to be NP-Hard; thus most algorithms use a greedy top-down approach to construct the tree (Algorithm 1) [13]. At the root of the tree, the entire training dataset D∗ is examined to find the best split predicate for the root. The dataset is then partitioned along the split predicate and the process is repeated recursively on the partitions to build the child nodes.
Finding the best split predicate for a node (Line 1) is the most important step in the greedy learning algorithm, and has been the subject of much of the research in tree learning. Numerous techniques have been proposed for finding the right split at a node, depending on the particular learning problem. The main idea is to reduce the impurity (I) in a node. Loosely defined, the impurity at a node is a measure of the dissimilarity in the Y values of the training records D that are input to the node. The general strategy is to pick a predicate that maximizes I(D) − (I(DL) + I(DR)), where
DL and DR are the datasets obtained after partitioning D on the chosen predicate. At each step the algorithm greedily partitions the data space to progressively reduce region impurity. The process continues until all Y values in the input dataset D to a node are the same, at which point the algorithm has isolated a pure region (Lines 2-3 and 6-7). Some algorithms do not continue splitting until regions are completely pure, and instead stop once the number of records in D falls below a predefined threshold.
P
L(F(xi), yi) on the training set D∗ may not
∗
(x ,y )∈D iigeneralize well (have low loss) when applied to future data [32].
Generalization is attained through controlling model complexity by various methods, e.g., pruning and ensemble learning for tree models [5]. The learned model is evaluated by measuring its net loss when applied to a holdout data set.
2.1 Tree Models
Classification and regression trees are one of the oldest and most popular data mining models [13]. Tree models represent F by recursively partitioning the data space DX
×
1
DX × . . . DX into non-overlapping regions, with a simple model in each region.
2
N
Figure 1 shows an example tree model. Non-leaf nodes in the tree define region boundaries in the data space. Each region boundary is represented as a predicate on an attribute in X. If the attribute is ordered, the predicate is of the form
X v, v ∈ DX (e.g., Node A in Figure 1). Unordered attributes have predicates of the form X ∈ {v1, v2, . . . vk}, v1 ∈ DX , v2 ∈ DX , . . . vk ∈ DX , (e.g., Node B in Figure 1).
The path from the root to a leaf node in the tree defines a region. Leaf nodes (e.g., the left child of A in Figure 1), contain a region prediction which in most cases is a constant value or some simple function. To make predictions on an unknown x, the tree is traversed to find the region containing x. The region containing x is the path from the root to a leaf in the tree along which all non-leaf predicates are true when evaluated on x. The prediction given by this leaf is used as the value for F(x).
Popular impurity measures that have been proposed are derived from measures such as entropy, Gini index, and variance [29], to name only a few. PLANET uses an impurity measure based on variance (V ar) to evaluate the quality of a split. The higher the variance in the Y values of a node, the greater the node’s impurity. Further details on the split criteria are discussed in Section 2.3. While we focus concretely on variance as our split criteria for the remainder of this presentation, as long as a split metric can be computed on subsets of the training data and later aggregated,
PLANET can be easily extended to support it. is sorted along Xi, and a split point is considered between each adjacent pair of values for Xi in the sorted list.
• For unordered domains, split predicates are of the form
Xi ∈ {v1, v2, . . . vk}, where {v1, v2, . . . vk} ∈ P(DX ), ithe power set of DX . Breiman [6] presents an algoirithm for finding the best split predicate for a categorical attribute without evaluating all possible subsets of 2.2.1 Scalability Challenge
DX . The algorithm is based on the observation that the optimal split predicate is a subsequence in the list i
The greedy tree induction algorithm we have described is simple and works well in practice. However, it does not scale well to large training datasets. FindBestSplit requires a full scan of the node’s input data, which can be large at higher levels of the tree. Large inputs that do not fit in main memory become a bottleneck because of the cost of scanning data from secondary storage. Even at lower levels of the tree where a node’s input dataset D is typically much smaller than D∗, loading D into memory still requires reading and writing partitions of D∗ to secondary storage multiple times.
Previous work has looked at problem of building tree models from datasets which are too large to fit completely in main memory. Some of the known algorithms are disk-based approaches that use clever techniques to optimize the number of reads and writes to secondary storage during tree construction (e.g., [26]). Other algorithms scan the training data in parallel using specialized parallel architectures (e.g.,
[3]). We defer a detailed discussion of these approaches and how they compare to PLANET to Section 8. As we will show in Section 8, some of the ideas used in PLANET have been proposed in the past; however, we are not aware of any efforts to build massively parallel tree models on commodity hardware using the MapReduce framework. of values for Xi sorted by the average Y value.
StoppingCriteria(D): A node in the tree is not expanded if the number of records in D falls below a threshold. Alternatively, the user can also specify the maximum depth to which a tree should be built.
FindPrediction(D): The prediction at a leaf is simply the average of the all the Y values in D.
2.4 MapReduce
PLANET uses MapReduce [12] to distribute and scale tree induction to very large datasets. MapReduce provides a framework for performing a two-phase distributed computation on large datasets, which in our case is the training dataset D∗. In the Map phase, the system partitions D∗ into a set of disjoint units which are assigned to workers, known as mappers. In parallel, each mapper scans through its assigned data and applies a user-specified map function to each record. The output of the user’s map function is a set of hkey, valuei pairs which are collected for MapReduce’s
Reduce phase. In the reduce phase, the key-value pairs are grouped by key and are distributed to a series of workers, called reducers. Each reducer then applies a user-specified reduce function to all the values for a key and outputs a final value for the key. The collection of final values from all of the reducers is the final output of MapReduce.
Post-pruning learned trees to prevent overfitting is also a well studied problem. However, with ensemble models (Section 5), post pruning is not always needed. Since PLANET is primarily used for building ensemble models, we do not discuss post pruning in this paper.
2.3 Regression Trees
3. EXAMPLE
Regression trees are a special case of tree models where the output attribute Y is continuous [5]. We focus primarily on regression trees within this presentation because most of our use cases require predictions on continuous outputs.
Note that any regression tree learner also supports binary
(0-1) classification tasks by modeling them as instances of logistic regression. The core operations of regression tree learning in Algorithm 1 are implemented as follows:
The PLANET framework breaks up the process of constructing a tree model into a set of MapReduce tasks. Dependencies exist between the different tasks, and PLANET uses clever scheduling methods to efficiently execute and manage them. Before delving into the technical details of the framework, we begin with a detailed example of how tree induction proceeds in PLANET.
The example introduces the different components in PLA-
NET, describes their roles, and provides a high level overview of the entire system. To keep the example simple we only discuss the construction of a single tree. The method extends naturally to ensembles of trees, as we discuss in Section 5.
FindBestSplit(D): In a regression tree, D is split using the predicate that results in the largest reduction in variance. Let V ar(D) be the variance of the output attribute
Y measured over all records in D. At each step the tree learning algorithm picks a split which maximizes
Example setup: Let us assume that we have a training dataset D∗ with 100 records. Further assume that tree induction stops once the number of training records at a node falls below 10. Let the tree in Figure 1 be the model that will be learned if we ran Algorithm 1 on a machine with suf-
ficient memory. Our goal in this example is to demonstrate how PLANET constructs the tree in Figure 1 when there is a memory constraint limiting Algorithm 1 to operating on inputs of size 25 records or less.
|D| × V ar(D) − (|DL| × V ar(DL) + |DR| × V ar(DR)), (1) where DL ⊂ D and DR ⊂ D are the training records in the left and right subtree after splitting D by a predicate.
Regression trees use the following policy to determine the set of predicates whose split quality will be evaluated:
• For ordered domains, split predicates are of the form
Xi v, for some v ∈ DX . To find the best split, D i• Nodes in InMemQ are processed using MR InMemory.
Recall that nodes in InMemQ have input data sets D that are small enough to fit in memory. Therefore, given a set of nodes N, MR InMemory completes tree induction at nodes in N using Algorithm 1.
X1 v1
A
|D|=90
|D|=10
X2 ∈ {v2, v3}
B
We defer details of the MapReduce jobs to the next section. In the remainder of this section, we will tie the above components together and walk through the example.
0.42266
|D|=45
|D|=45
D
C
3.2 Walkthrough
When tree induction begins, M, MRQ, and InMemQ are all empty. The only node the Controller can expand is the root (A). Finding the split for A requires a scan of the entire training dataset of 100 (≥ 25) records. Since this set is too large to fit in memory, A is pushed onto MRQ and InMemQ stays empty.
|D|=30
|D|=20 |D|=25 |D|=15
GH
EF
After initialization the Controller dequeues A from MRQ and schedules a job MR ExpandNodes({A}, M, D∗). This job computes a set of good splits for node A along with some additional information about each split. Specifically, for each split we compute (1) the quality of the split (i.e., the reduction in impurity), (2) the predictions in the left and right branches, and (3) the number of training records in the left and right branches.
Figure 1: Example Tree. Note that the labels on the nodes (in boxes) are the split predicates, while the labels on the edges are the sizes of the dataset in each branch (|D| denotes the dataset size in that branch in this figure).
The split information computed by MR ExpandNodes gets sent back to the Controller, which selects the best split for node A. In this example, the best split has 10 records in the left branch and 90 records in the right. The selected split information for node A is then added into the ModelFile.
The Controller next updates the queues with new nodes at which split predicates can be computed. The left branch of A has 10 records. This matches the stopping criteria and hence no new nodes are added for this branch. For the right branch with 90 records (≥ 25), node B can be expanded and is pushed onto MRQ.
Tree induction continues by dequeuing node B, and scheduling MR ExpandNodes({B}, M, D∗). Note that for expanding node B we only need the records that went down the right subtree of A, but to minimize book keeping, PLANET passes the entire training dataset to the MapReduce. As we describe in 4.3, MR ExpandNodes uses the current state of the ModelFile to determine the subset of D∗ that will be input to B.
Once the Controller has received the results for the MapReduce on node B and updated M with the split for B, it can now expand both C and D. Both of these nodes get 45 records as input and are therefore pushed on to MRQ. The Controller can now schedule a single MR ExpandNodes({C,
D}, M, D∗) job to find the best splits for both nodes C and D. Note that by expanding C and D in a single step, PLA-
NET expands trees breadth first as opposed to the depth
first process used by the in-memory Algorithm 1.
Once the Controller has the obtained the splits for C and D, it can schedule jobs to expand nodes E, F, G, and H. Of these, H uses 30 records, which still cannot fit in memory, and hence gets added to MRQ. The input sets to E, F, G are small enough to fit into memory and hence tree induction at these nodes can be completed in-memory. The Controller pushes these nodes into the InMemQueue.
3.1 Components
At the heart of PLANET is the Controller, a single machine that initiates, schedules and controls the entire tree induction process. The Controller has access to a compute cluster on which it schedules MapReduce jobs. In order to control and coordinate tree construction, the Controller maintains the following: