Improving the Performance of Data Cube Queries Using Familiesof Statistics Trees
1
Joachim Hammer and Lixin Fu
Computer & Information Science & Engineering
University of Florida
Gainesville, Florida 32611-6120
{jhammer, lfu}@cise.ufl.edu
1
Abstract. We propose a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Specifically, in this paper we focus on those cube queries which navigate over the abstraction hierarchies of the dimensions. Abstraction hierarchies are the result of partitioning dimension values of a data cube into different levels of granularity. Our approach is based on our previous version of CubiST (Cubing with Statistics Trees) which represents a drastic departure from existing approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views. Instead, CubiST computes and stores all possible aggregate views in the leaves of a statistics tree during a one-time scan of the detailed data. However, it uses a single statistics tree to answer all possible cube queries. The new version described here remedies this limitation by applying a greedy strategy to select a family of candidate trees which represent superviews over the different hierarchical levels of the dimensions. In addition, we have developed an algorithm to materialize the candidate trees in the family starting from the single statistics tree (base tree). Given an input query, our new query evaluation algorithm selects the smallest tree in the family which can provide the answer. This significantly reduces I/O time and improves in-memory performance over the original CubiST. Experimental evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems.
1Introduction
Most OLAP queries involve aggregates of measures over arbitrary regions of the dimensions. This ad-hoc “navigation” through the data cube helps users obtain a “big picture” view of the underlying data. However, on any but the smallest databases, these kinds of queries will tax even the most sophisticated query processors since the database size often grows to hundreds of GBs or TBs and the data in the tables is of high dimensionality with large domain sizes.
In [6], Fu and Hammer describe an efficient algorithm (CubiST) for evaluating ad-hoc OLAP queries using so-called Statistics Trees (STs). CubiST can be used to evaluate queries that return (aggregate) values (e.g., What is the percent difference in car sales between the southeast and northeast regions for the years 1990 through 2000?) rather than the records that satisfy the query (e.g., Find all car dealers in the southeast who sold more cars in 1999 than in 1998). We have termed the former type of query cube query. However, the original CubiST algorithm in [6] does not efficiently support complex cube queries which involve abstraction hierarchies for multiple dimensions. Abstraction hierarchies are the result of partitioning dimension values of a data cube into different levels of granularity (e.g., days into weeks, quarters, and years).
In this paper, we present a new methodology called CubiST++ to evaluate ad-hoc cube queries which is based on the original version of CubiST but greatly expands its usefulness and efficiency for all types of cube queries. CubiST++ specifically improves the evaluation of cube queries involving arbitrary navigation over the abstraction hierarchies in a data cube, an important and frequent class of queries. Instead of using a single (base) tree to compute all cube queries, our algorithms materialize a family of trees from which CubiST++ selects potentially smaller derived trees to compute the answer. This reduces memory requirements, I/O, and improves processing speed.
2Related Research
Research related to this work falls into three broad categories: OLAP servers including ROLAP and MOLAP, indexing and view materialization.
ROLAP servers use the familiar “row-and-column view” to store the data in relational tables using a star or snowflake schema design [4]. In the ROLAP approach, OLAP queries are translated into relational queries using the standard relational operators such as selection, projection, relational join, group-by, etc. However, directly executing translated SQL can be very inefficient. As a result, many commercial ROLAP servers extend SQL to support important OLAP operations directly inside the relational engine (e.g., RISQL from Redbrick Warehouse [24], the cube operator in Microsoft SQL Server [17]). MicroStrategy [18], Redbrick [23], and Informix's Metacube [12] are examples of ROLAP servers.
MOLAP servers use proprietary data structures such as multidimensional arrays to store the data cubes. Arbor software's Essbase [2], Oracle Express [21] and Pilot LightShip [22]are based on MOLAP technology. When the number of dimensions and their domain sizes increase, the data becomes very sparse. Storing sparse data in an array directly is inefficient. A popular technique to deal with sparse data is chunking [27] in which the full cube (array) is chunked into small pieces called cuboids.
Often specialized index structures are used in conjunction with OLAP servers to improve query performance. When the domain sizes are small, a bitmap index structure [20] can be used to help speed up OLAP queries. However, simple bitmap indexes are not efficient for large-cardinality domains and large range queries. In order to overcome this deficiency, an encoded bitmap scheme has been proposed [3]. A well-defined encoding can reduce the complexity of the retrieve function thus optimizing the computation. However, designing well-defined encoding algorithms remains an open problem. A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure [5]. O’Neil and Quass [19] provide an excellent overview of and detailed analyses for index structures which can be used to speed up OLAP queries.
View materialization refers to the pre-computing of partial query results used to derive the answer for frequently asked queries. Since it is impractical to materialize all possible views, view selection is an important research problem. For example, [11] introduced a greedy algorithm for choosing a near-optimal subset of views from a view materialization lattice. The computation of the materialized views, some of which depend on previously materialized views in the lattice, can be expensive when the views are stored on disk. More recently [9, 13, 14], for example, developed various algorithms for view selection in data warehouse environments.
Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the computation of group-by operations to amortize the disk reads, as proposed by [1]. Other related research in this area has focused on indexing pre-computed aggregates [25] and incrementally maintaining them [16]. Also relevant is the work on maintenance of materialized views (see [15] for a summary of excellent papers) and processing of aggregation queries [8, 26]. However, in order to be able to support truly ad-hoc OLAP queries, indexing and pre-computation of results alone will not produce good results. For example, building an index for each attribute of the warehouse or pre-computing every sub-cube requires too much space and results in unacceptable maintenance costs. We believe that our approach which is based on a new data structure, view selection and query answering algorithm holds the answer to how to support ad-hoc OLAP queries efficiently.
3Answering Cube Queries Using Statistics Trees
In this Section, we briefly review CubiST and the statistics tree (ST) data structure on which our new approach is based. Complete details can be found in [6]. An ST tree is a multi-way, balanced tree whose leave nodes hold the aggregate values for a set of records consisting of one or more attributes. By aggregate value we are referring to the result of applying an aggregation function (e.g., count, min, max) to each combination of dimension values in the data set. Leaf nodes are connected to facilitate retrieval of multiple aggregates. Each level in the tree (except the leaf level) corresponds to an attribute. An internal node has one pointer for each domain value, and an additional “star” pointer representing aggregation along the attribute (the star has the intuitive meaning “any” analogously to the star in Gray’s cube operator [7]). Internal nodes contain no data values.
In order to represent a data cube, an empty ST is created which has as many levels as there are dimensions in the cube (plus one for the leaves). The size of the nodes on each level corresponds to the cardinality of the dimension (plus one for the star pointer). Initially, the values in the leave nodes are set to zero. Populating the tree is done during a one-time scan of the data set: for each record in the set, the tree is traversed based on the dimension values in the tuple and the leaf node at the end of the path is updated accordingly (e.g., incremented by one in the case of the count function).
Figure 1 depicts a statistics tree corresponding to a data cube with three dimensions with cardinalities d1=2, d2=3, and d3=4 respectively. The contents of the tree are shown after inserting the record (1,2,4) into the empty tree. Note that the leaf nodes are storing the result of applying the aggregation function count to the data set, i.e., counting all possible combination of domain values. Further note that each tuple results in updates to eight different leaf nodes: at each node the algorithm follows the pointer corresponding to the data value as well as the star pointer, resulting in 2n paths, where n is the height of the tree. The update paths relating to the sample record are shown as solid thick lines in Figure 1.
Figure 1: Statistics tree after processing input record (1,2,4)
Once all the records are inserted, the ST is ready to answer queries. An ad-hoc cube query can be viewed as an aggregation of the measure attributes over a region defined by a set of dimensional constraints. Formally, a cube queryq is a tuple defining the regions of the cube on which to aggregate on: , where agg describes the aggregation function to be used, {i1, i2, …, ir} {1,2, …,k},r is the number of dimensions specified in the query, and k is the total number of dimensions in the data set. Componentdescribes a constraint for dimension ij,specifying the selected values which can either be a singleton, a range, or a partial set. For instance, query q = count(s1=4, s4=[1,3], s5={1,3}), specifies three constraints on dimensions 1, 4, and 5 of the data set: “4” states that the selected value of the first dimension must be the singleton value 4, “[1,3]” states that the desired values for the fourth dimension fall in the range from 1 to 3, and “{1,3}” states that the selected values for the fifth dimension must be either 1 or 3 (partial selection). The function count specifies that the total number of records satisfying the constraints be returned as result. In order to satisfy the above query, CubiST uses the constraint values in q to follow the paths to the desired leaves in the underlying ST. In the case of s1=4 this means following the fourth pointer of the node representing the first dimension in the tree and so forth. There is a total of six different paths since the query is composed of 1*3*2=6 partial results, one for each combination of the three dimensions. The partial results have been pre-computed during the creation of the tree (using the function count). The total of the values at the six leaf nodes form the result. Details about CubiST and how it can be integrated with relational warehouse systems are provided in [6].
The original CubiST provides a useful framework for answering cube queries. However, CubiST does NOT efficiently answer queries on data cubes containing hierarchies on the dimensional values. In those cases, one must first transform the conditions of the query which are expressed using different levels of abstraction into conditions not involving hierarchies, and then treat them as range queries or partial queries. Given the amount of data to be stored in the tree, the underlying ST may be large and may not fit into memory all at once. In order to overcome this problem we have developed CubiST++ ("CubiST plus Family") to automatically support abstraction hierarchies in the underlying data set and to alleviate the memory problems of its predecessor. In the next sections we describe a suite of algorithms for generating families of ST's and for using them to efficiently compute ad-hoc cube queries.
4Generating Families of Statistics Trees
One way to reduce the cost of paging STs in and out of memory is to transmit only the leaves of the ST. The internal nodes can be generated in memory without additional I/O. A more effective approach is to partition an ST into multiple, smaller subtrees, each representing a certain part of the data. In most cases, cube queries can be answered using one of the smaller subtrees rather than the single large ST. We term this single large ST as defined in the original version of CubiST base tree. Using this base tree, one can compute and materialize other statistics trees with the same dimensions but containing different levels of the abstraction hierarchy for one or more of the dimensions. We term these smaller trees derived trees. A base tree and its derived trees form a family of statistics trees with respect to the base tree. Before we describe CubiST++, we first show how to formally represent hierarchies.
4.1Hierarchical Partitioning and Mapping of Domain Values
In relational systems, dimensions containing hierarchies are stored in the familiar row-column format in which different levels are represented as separate attributes. In our new framework, we regard the interdependent levels in a hierarchy as one hierarchical attribute and map the values at different levels into integers. In this representation, a value at a higher level of abstraction includes multiple values at a lower level. By scanning a dimension table, we can establish the mappings, hierarchical partitions, as well as the inclusion relationships.
To illustrate, consider Table 1 which represents the location information in a data source. We first group the records by columns Region, State and City, remove duplicates, and then map the domain values into integers using the sorted order. The result is shown in Table 2 (numbers in parentheses are the integer assignments for the strings).
Table 1: Original location data Table 2: Partitioning and mapping of location data
Table 3 is created by replacing the strings in Table 1 with their integer assignments. Mapping strings into integers saves space and facilitates data processing internally. The compressed representation of the transformed tables reduces the runtime of join operations during the initialization phase of the ST since most relational warehouses store measures and dimension values in separate tables. Note that we only need to store tables 2 and 3 because they allow us to deduce the contents of Table 1.
Table 3: Location data after transformation
4.2A Greedy Algorithm for Selecting Family Members
To generate a family of statistics tree, we first need an algorithm for choosing from the set of all possible candidate trees (i.e., trees containing all combinations of dimensions and corresponding hierarchy levels) those trees that will eventually make up the family. Second, we need an algorithm for deriving the selected trees from the base tree.
Potentially, any combination of hierarchy levels can be selected as a potential tree. However, we need to consider space limitations and maintenance overhead. We introduce a greedy algorithm to choose the members in a step-by-step fashion. During each step starting from the base tree, we roll-up the values of the largest dimension to its next level in the hierarchy, keeping other dimensions the same. This newly formed ST becomes a new member of the family. The iteration continues until each dimension is rolled-up to its highest level or until the size of the family exceeds the maximum available space. The pseudo-code for this greedy selection mechanism is shown in Figure 2. The total size of the derived trees is only a small fraction of the base tree. In line 1, the original ST initialization algorithm proposed in [6] is used. The roll-up algorithm of line 7 is discussed in Section 4.3.
Figure 2: Greedy algorithm for generating a family of statistics trees
Let us illustrate with an example. Suppose a base tree T0 represents a data set with three dimensions d1, d2, d3whose cardinalities are 100 each. Each dimension is organized into ten groups of ten elements to form a two-level hierarchy. We use the superscripted hash mark to indicate values corresponding to the second hierarchy level. For example, 0# represents values 0 through 9, 1# represents values 10 through 19 etc. The degrees of internal nodes of T0 are 101 for each level of the tree (the extra 1 accounts for the star pointer). The selected derived trees T1, T2, T3 that make up a possible family together with T0 are as follows: T1 is the result of rolling up d2 in T0 (note since all three dimensions are of the same size, we pick one at random). T1 has degrees 101, 101 and 11. T2 is the result of rolling up d1in T1. Its degrees are 101, 11, and 11. T3 is the result of rolling up d0in T2. Its degrees are 11, 11, and 11. The family generated from T0 is shown in Figure 3.