Aggregate Function Computation and Iceberg Querying in Vertical Databases

A Thesis

Submitted to the Graduate Faculty

of the

North DakotaStateUniversity

of Science and Mathematics

By

Yue Cui

In Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Major Department:

Computer Science

June 2005

Fargo, North Dakota

ABSTRACT

Cui, Yue, M.S., Department of Computer Science, College of Science and Mathematics, North Dakota State University, June 2005. Aggregate Function Computation and Iceberg Querying in Vertical Databases. Major Professor: Dr. William Perrizo.

Aggregate FunctionComputation and Iceberg Querying are important and common in many applications of data mining and data warehousing because people are usually interested in looking for anomalies or unusual patterns by computing aggregate functions across many attributes or finding aggregate values above some specified thresholds. In this paper, we present efficient algorithms to compute aggregate functions and evaluate iceberg queries using P-trees, which is an innovative data structure on vertical databases. Our methods do well for most of the aggregate functions, such as Count, Sum, Average, Min, Max, Median, Rank, and Top-k, and in some situations, they are the best of all. Our algorithms use very little memory and perform quickly because in most cases we only pass over data once and use logical operations, And/Or, to implement the entire task.

ACKNOWLEDGMENTS

I would like to thank the following people for making this thesis a success:

  • My adviser, Dr. William Perrizo for his patience, support, and guidance.
  • My husband, Dalong Li for his love, support, and encouragement.
  • My Parents and sister for their love and support.

TABLE OF CONTENTS

ABSTRACT………………………………………………………………………………...

ACKNOWLEDGMENTS

LIST OF TABLES

LIST OF FIGURES

LIST OF ALGORITHMS

CHAPTER 1.INTRODUCTION

1.1.Background

CHAPTER 2.LITERATURE REVIEW

2.1.Aggregate Functions

2.2.Iceberg Queries

CHAPTER 3.Review of P-trees

3.1.Structure of P-trees

3.2.P-tree Operations

3.3.Predicate P-trees

3.3.1.Value P-trees

3.3.2.Inequity P-trees

CHAPTER 4.Aggregate Function Computation using P-trees

4.1.Count Aggregate

4.2.Sum Aggregate

4.3.Average Aggregate

4.4.Max Aggregate

4.5.Min Aggregate

4.6.Median/Rank Aggregate

4.7.Top-k Aggregate

CHAPTER 5.Iceberg query operation using P-trees

5.1.Step One

5.2.Step Two

5.3.Step Three

CHAPTER 6.Performance analysis

6.1.Computing Aggregate Functions

6.2.Computing the Example of Iceberg Query

CHAPTER 7.Conclusion

CHAPTER 8.Reference

LIST OF TABLES

Table Page

  1. Relation Sales.
  2. Binary Form of Sales.
  3. the Summary Table of Attribute Loc.
  4. Summary Table of Attribute Type.
  5. Summary Table of Our Example.

LIST OF FIGURES

FigurePage

  1. Construction of 1-D P-trees
  2. P1-trees for the Transaction Set
  3. AND, OR and NOT Operations.
  4. Value P-trees of Attribute Loc
  5. Procedure of Calculating PNY
  6. Value P-trees of Attribute Type.
  7. Procedure of Calculating PNY AND Notebook
  8. Procedure of Calculating PMN AND Notebook
  9. Sum Aggregate Performance Time Comparison
  10. Average Aggregate Performance Time Comparison
  11. Min Aggregate Performance Time Comparison
  12. Max Aggregate Performance Time Comparison
  13. Median or Rank-k Aggregate Performance Time Comparison
  14. Top-k Aggregate Performance Time Comparison
  15. Iceberg Query Performance Time Comparison

LIST OF ALGORITHMS

AlgorithmPage

  1. Sum Aggregate.
  2. Max Aggregate.
  3. Min Aggregate.
  4. Median/Rank(K) Aggregates.

1

CHAPTER 1.INTRODUCTION

1.1.Background

Aggregation functions [1] across many attributes are widely used in queries of Data mining and data warehousing. The commonly used aggregation functions include Count, Sum, Average, Min, Max, Median, Rank, and Top-k. The commonly used queries in Data mining and data warehousing are iceberg queries [2], which perform an aggregate function across attributes and then eliminate aggregate values that are below some specified threshold. Iceberg queriesare so called because the number of above threshold results is often very small (the tip of an iceberg) relative to the large amount of input data (the iceberg). Iceberg queries are also common in many other applications including information retrieval, clustering, and copy detection [2].

Fast implementation of Aggregate Functions and Iceberg Queries are always challenging tasks in data warehousing, which may contain millions of data. One method in data warehousing, which improves the speed of iceberg queries, is to precompute iceberg cube [3]. Iceberg cube actually stores all the above threshold aggregate information of a dataset. When a customer initiates an iceberg query, the system, instead of computing the result from the original dataset, looks into the iceberg cube for the answer. This helps to accelerate the on-line processing speed. Efficient algorithms in computation of iceberg cube are widely studied. A lot of researches have been done in this area [3].

Recently, a new data structure, Peano Tree (P-tree)[10], has been introduced for large dataset. P-tree is a lossless, quadrant based, compression data structure. It provides efficient logical operations that are fast and efficient. One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out “bit holes,” which consist of consecutive 0’s.

In this paper,we investigateproblemsonhow to efficiently implement aggregate functions and iceberg queries using P-tree. We have two contributions: first, we develop efficient algorithms, which compute the various aggregate functions by using P-tree; second, weillustrate theprocedureto implement iceberg query by using P-tree. In our algorithms and procedures, the main operations are counting the number of 1s in P-trees and performing logic operations, And/Or, of P-trees. These operations can be executed quickly by hardware. The costs of these operations are really cheap. As you will see in our experiments section, our methods and procedure are superior in computing aggregate functions and iceberg queries in many cases.

This paper is organized as follows. In section 2, we do Literature review, where we will introduce the concepts of Aggregate Functions and Iceberg Queries. In section 3, we review the P-tree technology. In section 4, we give the algorithms of the aggregate functions using P-tree. In section 5, we give an example of how to implement Iceberg Queries using P-tree. In section 6, we describe our experiment results, where we compare our method with bitmap index. Section 7 is the concluding section of this paper.

CHAPTER 2.LITERATURE REVIEW

2.1.Aggregate Functions

Frequently, people are interested in summarizing data to determine trends or produce top-level reports. For example, the purchasing manager may not be interested in a listing of all computer hardware sales, but may simply want to know the number of Notebook sold in a specific month. Aggregate functions can assist with the summarization of large volumes of data.

Three types of aggregate functions were identified[1]. Consider aggregating a set of tuples, T. Let {Si| i = 1 . . . n} be any complete set of disjoint subsets of T such that iSi = T, and i Si = {}.

Distributive: An aggregate function F is distributive if there is a function G such that

F (T) = G ({F (Si)| i = 1 . . . n}). SUM, MIN, and MAX are distributive with G = F. Count is distributive with G = SUM.

Algebraic: An Aggregate function F is algebraic if there is an M-tuple valued function G and a function H such that F (T) = H ({G (Si) | i = 1 . . . n}). Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. For Average, the function G records the sum and count. The H function adds these two components and then divides to produce the global average. Similar techniques apply to finding the N largest values, the center of mass of group of objects, and other algebraic functions. The key to algebraic functions is that a fixed size result (an M-tuple) can summarize the sub-aggregation.

Holistic: An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. That is, there is no constant M, such that an M-tuple characterizes the computation F. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

Efficient computation of all these aggregate functions is required in most large database applications. There are a lot efficient techniques for computing distributive and algebraic functions but only a few for holistic functions such as Median. In this paper, our method to compute Median is one of the best among all the available techniques.

2.2.Iceberg Queries

Iceberg queries refer to a class of queries which compute aggregate functions across attributes to find aggregate values above some specified threshold. Given a relation R with attributes a1, a2… an, and m, an aggregate function AggF, and a threshold T, an iceberg query has the form of follow:

SELECT R.a1, R.a2… R.an, AggF (R.m)

FROM relation R

GROUPBY R.a1, R.a2… R.an

HAVING AggF (R.m)>= T

The number of tuples, that satisfy the threshold in the having clause, is relatively small compared to the large amount of input data. The output result can be seen as the “tip of iceberg,” where the input data is the “iceberg.”

Suppose, a purchase manager is given a sales transaction dataset, he/she may want to know the total number of products, which are above a certain threshold T, of every type of product in each local store. To answer this question, we can use the iceberg query below:

SELECT Location, Product Type, Sum (# Product)

FROM Relation Sales

GROUPBY Location, Product Type

HAVING Sum (# Product) >= T

To implement iceberg query, a common strategy in horizontal database is first to apply hashing or sorting to all the data in the dataset, then to count all of the location & Product Type pair groups, and finally to eliminate those groups which do not pass the threshold T. But these algorithms can generate significant I/O for intermediate results and require large amounts of main memory. They leave much room for improvement in efficiency. One method is to prune groups using the Apriori-like [8][9] method. But the Apriori-like method is not always simple to use for all the aggregate functions. For instance, the Apriori-like method is efficient in calculating Sum but has little use in implementing Average [4]. In our example, instead of counting the number of tuples in every location & Product Type pair group at first step, we can do the following:

Generate Location-list: a list of local stores which sell more than T number of products.

For example,

SELECT Location, Sum (# Product)

FROM Relation Sales

GROUPBY Location

HAVING Sum (# Product) >= T

Generate Product Type-list: a list of categories which sell more than T number of products. For example,

SELECT Type, Sum (# Product)

FROM Relation Sales

GROUPBY Product Type

HAVING Sum (# Product) >= T

From the knowledge we generated above, we can eliminate many of the location & Product Type pair groups. It means that we only generate candidate location and Product Type pairs for local store and Product type which are in Location-list and Product Type-list. This approach improves efficiency by pruning many groups beforehand. Then performing operation, And, of value P-trees, we can calculate the results easily. We will illustrate our method in detail by example in section 5.

CHAPTER 3.Review of P-trees

A Peano tree (P-tree) is a lossless, bitwise tree. A P-tree can be 1-dimensional, 2-dimensional, 3-dimensional, etc. In this section, we will focus on 1-dimensional P-trees. We will give a brief review of P-trees on their structure and operations, and introduce two variations of predicate P-trees which will be used in our iceberg queries algorithm.

3.1.Structure of P-trees

Given a data set with d attributes, X = (A1, A2 … Ad), and the binary representation of jth attribute Aj as bj,mbj,m-1...bj,i… bj,1bj,0, we decompose each attribute into bit files, one file for each bit position [1]. To build a P-tree, a bit file is recursively partitioned into halves and each half into sub-halves until the sub-half is pure (entirely 1-bits or entirely 0-bits).

The detailed construction of P-trees is illustrated by an example inFigure 1. The transaction set is shown in a). For simplicity, assume each transaction has one attribute. We represent the attribute as binary values, e.g., (7)10 = (111)2. Then vertically decompose them into three separate bit files, one file for each bit, as shown in b). The corresponding basic P-trees, P1, P2 and P3, are constructed by recursive partition, which are shown in c), d) and e).

As shown in e) ofFigure 1, the root of P1 tree is 3, which is the 1-bit count of the entire bit file. The second level of P1 contains the 1-bit counts of the two halves, 0 and 3. Since the first half is pure, there is no need to partition it. The second half is further partitioned recursively.

Figure 1.Construction of 1-D P-trees

3.2.P-tree Operations

AND, OR and NOT logic operations are the most frequently used P-tree operations. For efficient implementation, we use a variation of P-trees, called Pure-1 trees (P1-trees). A tree is pure-1 if all the values in the sub-tree are 1’s. A node in a P1-tree is a 1-bit if and only if that half is pure-1.

Figure 2shows the P1-trees corresponding to the P-trees in c), d), and e) ofFigure 1.

Figure 2.P1-trees for the Transaction Set

The P-tree logic operations are performed level-by-level starting from the root level. They are commutative and distributive, since they are simply pruned bit-by-bit operations. For instance, ANDing a pure-0 node with anything results in a pure-0 node, ORing a pure-1 node with anything results in a pure-1 node. InFigure 3, a) is the ANDing result of P1,1 and P1,2, b) is the ORing result of P1,1 and P1,3, and c) is the result of NOT P1,3 (or P1,3’), where P1,1, P1,2 and P1,3 are shown inFigure 2.

Figure 3.AND, OR and NOT Operations.

3.3.Predicate P-trees

There are many variations of predicate P-trees, such as value P-trees, tuple P-trees, inequity P-trees, etc. In this section, we will describe value P-trees and inequity P-trees, which are used in evaluating range predicates.

3.3.1.Value P-trees

A value P-tree represents a data set X related to a specified value v, denoted by Px=v, where x  X. Let v = bmbm-1…b0, where bi is ith binary bit value of v. There are two steps to calculate Px=v. 1) Get the bit-P-tree Pb,i for each bit position of v according to the bit value: If bi = 1, Pb,i = Pi;Otherwise Pb,i = Pi’, 2) Calculate Px=vby ANDing all the bit P-trees of v, i.e. Px=v = Pb1 Pb2…  Pbm. Here,  means AND operation. For example, if we want to get a value P-tree satisfying x = 101 inFigure 1. We have Px=101= Pb,3 Pb,2Pb,1= P3 P2’P1.

3.3.2.Inequity P-trees

An inequity P-tree represents data points within a data set X satisfying an inequity predicate, such as x>v, xv, x<v, and xv. Without loss of generality, we will discuss two inequity P-trees for xv and xv, denoted by Pxv and Pxv, where x  X, v is a specified value. The calculation of Pxv and Pxv is as follows:

Calculation of Pxv: Let x be a data within a data set X, x be a m-bit data, and Pm, Pm-1, … P0 be the P-trees for the vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit value of v, and Pxv be the predicate tree for the predicate xv, then Pxv = Pm opm … Pi opi Pi-1 … op1 P0, i = 0, 1 … m, where 1) opi is  if bi=1, opi is  otherwise, and 2) the operators are right binding. Here,  means AND operation,  means OR operation, right binding means operators are associated from right to left, e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)). For example, the inequity tree Px 101 = (P2 (P1 P0)).

Calculation of Pxv: Calculation of Pxv is similar to Calculation of Pxv. Let x be a data within a data set X, x be a m-bit data set, and P’m, P’m-1, … P’0 be the complement P-trees for the vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit value of v, and Pxv be the predicate tree for the predicate xv, then Pxv = P’mopm … P’i opi P’i-1 … opk+1P’k, kim, where 1). opi is  if bi=0, opi is  otherwise, 2) k is the rightmost bit position with value of “0”, i.e., bk=0, bj=1,j<k, and 3) the operators are right binding. For example, the inequity tree Px 101 = (P’2 P’1).

We will frequently use value P-trees and inequity P-trees in our calculation of aggregation functions and iceberg queries.

CHAPTER 4.Aggregate Function Computation using P-trees

In this section, we give algorithms, which show how to use P-tree to compute various aggregation functions. We illustrate some of these algorithms by examples. For simplicity, the examples are computed on a single attribute. As you will see, our algorithms can be easily extended to evaluate aggregate functions over multiple attributes.

First, we give out our example dataset. Second, we illustrate the procedure to convert the sample dataset into P-tree bit files.

Suppose, we have a relation S with information about every transaction of a company, first we transform relation S into binary representation. For numerical attributes, this step is simple. We just need to change the decimal values into binary numbers. For categorical attributes, there will be two steps: first, we translate categorical values into integers. Second, we convert integers into binary numbers. By changing categorical attribute values into integers, we can save a lot of memory space and make processing procedure much easier at the same time. We do not need to process string values anymore in our algorithms.

Table 1.Relation Sales.

Id / Mon / Loc / Type / On line / # Product
1 / Jan / New York / Notebook / Y / 10
2 / Jan / Minneapolis / Desktop / N / 5
3 / Feb / New York / Printer / Y / 6
4 / Mar / New York / Notebook / Y / 7
5 / Mar / Minneapolis / Notebook / Y / 11
6 / Mar / Chicago / Desktop / Y / 9
7 / Apr / Minneapolis / Fax / N / 3

Table 2.Binary Form of Sales.

Id / Mon / Loc / Type / On line / # Product
P0,3 P0,2 P0,1 P0,0 / P1,4 P1,3 P1,2 P1,1 P1,0 / P2,2 P2,1 P2,0 / P3,0 / P4,3 P4,2 P4,1 P4,0
1 / 0001 / 00001 / 001 / 1 / 1010
2 / 0001 / 00101 / 010 / 0 / 0101
3 / 0010 / 00001 / 100 / 1 / 0110
4 / 0011 / 00001 / 001 / 1 / 0111
5 / 0011 / 00101 / 001 / 1 / 1011
6 / 0011 / 00110 / 010 / 1 / 1001
7 / 0100 / 00101 / 101 / 0 / 0011

Next we vertically decompose the binary transaction table into bit files: one file foreach bit position. In relation S, there are totally 17 bit files. Then we build 17 basic P-trees for relation S.The detailed decomposition process is discussed in section 3. For convenience, we only use uncompressed P-trees to illustrate the calculation process followed.

4.1.Count Aggregate

COUNT function is probably the simplest and most useful of all these Aggregate functions. It is not necessary to write special function for Count because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree Pi, RootCount(Pi) returns the number of 1s in Pi. For example, if we want to know how many transactions in relation S are conducted on-line, we just need to count the number of 1s in P3, 0.

Count (# Transaction on line) = RootCount (P3,0) = RootCount (1011110) = 5. Thus we know that 5 out of 7 transactions in our dataset are conducted on line.

4.2.Sum Aggregate

Sum function can total a field of numerical values.