Efficient Proximal Support Vector Machine with Boosting

Efficient OLAP Operations for Spatial Data Using Peano Cube

Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding, William Perrizo

{baoying.wang, fei.pan, dongmei.ren, yue.cui, qiang.ding, william.perrizo}@ndsu.nodak.edu

Computer Science Department

North Dakota State University

Fargo, ND 58105

Tel: (701) 231-6257

Fax: (701) 231-8255

Abstract

Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, computer cartography, environmental assessment data and planning data, efficient OLAP for spatial data is in great demand. In this paper, we present a data warehousing architecture, the Peano Cube, to facilitate OLAP operations on spatial data, e.g., slice/dice, rollup, range queries, and nearest neighbor queries. Fast logical operations on Peano Trees (P-trees) are used to accomplish these OLAP operations. A natural distance metric for P-trees, the High Order Basis Bit (HOBBit) metric is employed to achieve efficient nearest neighborhood queries. Predicate P-trees are used to efficiently reduce data access by filtering out “bit holes” consisting of consecutive 0s using logical operations. Experiments show that OLAP operations can be executed much faster this way than with traditional data methods.

Keywords:Online Analytical Processing (OLAP), Peano Cube, P+-trees, HOBBit metric.

1.Introduction

Data warehouses (DWs) are collections of historical, summarized, non-volatile data, accumulated from transactional databases. They are optimized for Online Analytical Processing (OLAP) and have proven to be valuable for decision-making [29]. A copy of the relevant data is periodically appended to the DWs with a proper timestamp. Knowledge workers can analyze the data without having to wait for updates to be installed.

With more and more spatial data being collected, such as remote sensed images, geographical information, digital sky survey data, computer cartography, environmental assessment data and planning data, it is important to study on-line analytical processing of spatial data warehouses.

The data in a warehouse are conceptually modeled as data cubes. Gray et al. introduce the data cube operator as a generalization of the SQL group by operator [23]. The chunk-based multiway array aggregation method for data cube computation in OLAP was proposed in [24].

Typically, OLAP queries are complex and the size of the data warehouse is huge, which can cause the queries to take very long to complete, if executed directly on raw data. This delay is unacceptable in most data warehouse environments, as it severely limits productivity. The usual requirement is query execution times of a few seconds or a few minutes at the most.

Two major approaches have been proposed to efficiently process queries in a data warehouse: Speeding up the queries by using index structures, and speeding up queries by operating on compressed data. Many different index structures have been proposed for high-dimensional data [20][21][22]. Most multi-dimensional index structures have an exponential dependence (in space or time or both) upon the number of dimensions. Some special indexing techniques, such as Bitmap index, join index and Bitmap join index, have been introduced. The Bitmap index uses a bit vector to represent the membership of the tuples in a table, i.e., whether or not the tuples have an attribute with the same value. A significant advantage of bitmap indices is that complex logical selection operations can be performed very quickly, by performing bit-wise AND, OR, and NOT operators. However, bitmap indexes are space inefficient for high cardinality attributes. Bitmap indexes are only suitable for narrow domains, especially for membership functions with 0, or 1 as function values. The problem associated with sparse bitmap indexes is discussed in [25]. In order to overcome this problem, some improved bitmaps, e.g., encoded bitmap [27], have been proposed. The join indexes gained popularity from its use in relational database query processing. Join indexing is especially useful for maintaining the relationship between a foreign key and its matching primary keys, from the joinable relations.

Recently, a new lossless data structure, Peano Count Tree (P-tree), and a natural distance metrics for P-trees, the HOBBit metric, have been introduced [30][31]. Peano Count trees are a lossless, quadrant based compression data structure for spatial data. It includes many variations, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. These variations are tree-based bitmap indexes for Peano quadrants. Achieving high speed using logical AND, OR, NOT, and fewer data scan using bit level indexing and compression provides an efficient data structure that is well suited to multi-dimensional spatial data.

In this paper, we present a data warehousing architecture, Peano Cube, to facilitate OLAP operations on spatial data, e.g., slice/dice, rollup, range queries, and nearest neighbor queries. These OLAP operations can be used to answer questions like: “find all galaxies brighter than magnitude 22”, “find gravitational lens candidates”, or “find other objects like this one.” Fast logical operations on Peano Trees (P-trees) are used to accomplish these OLAP operations. A natural distance metric for P-trees, the High Order Basis Bit (HOBBit) metric is employed to achieve efficient nearest neighborhood queries. Predicate P-trees are used to efficiently reduce data access by filtering out “bit holes” consisting of consecutive 0s using logical operations. Experiments show that OLAP operations can be executed much faster this way than with traditional data methods.

This paper is organized as follows. In section 2, we first review Peano count trees (P-trees) and its operations, and then describe a natural distance metrics for P-trees, HOBBit metrics. In section 3, we propose a new data warehouse architecture, Peano Cube, and efficient OLAP operations using HOBBit rings. Finally, we compare our Peano Cube with Microsoft SkyServer experimentally in section 4 and conclude the paper in section 5. The symbols used in this paper are collected in Table 1 and explained later.

Table 1. Symbols and Notations

Symbol / Definition
X / Spatial data, X = {x1, x2, …, xn}, n is the number of attributes
m / Maximal bit length of attributes
r / Radius of HOBBit ring
Pi,j / Basic P-tree for bit j of attribute i
Pi,j’ / Complement of Pi,j
bi,j / The jth bit of the ith attribute of x.
Pxi,j / Operator P-tree of jth bit of the ith attribute of x
Pvi,r / Value P-tree within ring r
Px,r / Tuple P-tree within ring r
Qid / Quadrant identification

2.Review of Peano Count Trees and HOBBit Metrics

In this section, we first review peano count trees (P-trees) and their variations, predicate P-trees, and then describe a natural distance metrics for P-trees, HOBBit metrics.

2.1Peano Count Trees

Peano Count trees are a lossless, quadrant based compression data structure for spatial data. It includes many variations, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. These variations are tree-based bitmap indexes for Peano quadrants. Achieving high speed using logical AND, OR, NOT, and less data scan using bit level indexing and compression provides an efficient data structure that is well suited to multi-dimensional spatial data.

2.1.1Basic P-trees

The basic Peano Count Tree (P-tree) is a tree structure organizing multi-dimensional relational data set with numerical values. Each attribute is split into separate files, one for each bit position. Abasic P-tree, Pi, j, is a P-tree for the jth bit of the ith attribute. For attribute of m bit, there are m basic P-trees, one for each bit position. The complement of basic P-tree Pi, j is a P-tree with complement value for every bit, which is denoted as Pi, j ‘. Figure 1 shows an example of basic P-tree construction and its complement.

a) 8x8 bSQ file b) Basic P-tree c) Complement P-tree

Figure 1.Example of P-tree Construction

In Figure 1, the left is 88 bSQ file, and the corresponding P-tree is given in the middle. The root count is 36, and the count at the next level, 16, 7, 13, 0, are the 1-bit count for the four major quadrants (in Peano or Z order). Since the first and last quadrant is made up of entirely 1-bits and 0-bits respectively, we do not need sub-trees for these two quadrants. The complement is shown on the right. It provides the 0-bit’s counts for each quadrant. It provides the 0-bit counts for each quadrant. We note here that we identify quadrants using a Quadrant identifier, Qid - the string of successive sub-quadrant numbers (01,2 or 3 in Z or Peano order, separated by “.” (as in IP addresses). Thus, the Qid of the bolded and underlined quadrant in Figure 1 is 2.2.

2.1.2Predicate P-trees

Predicate P-trees are variations of P-trees. There are many predicate P-trees, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. [30] Predicate P-trees are constructed such that set 1 if and only if condition is true throughout the quadrant, else set 0. In this paper, we exploit Pure1 P-trees (P1-tree), Not Pure0 P-trees (NP0-tree) and Peano Mixed trees (PM-tree), which are tree-based bitmap for Peano quadrants. Example of three predicate P-trees construction, Pure1 P-trees, Not Pure0 P-trees and Peano Mixed trees, are shown in Figure 2.

Figure 2.Three Predicate P-trees

Logic AND and OR are the most important and frequently used P-tree logical operation. Predict P-trees are tree-based bitmap for Peano quadrants. By using predicate P-trees operations, we filter the big holes of consecutive 0s and get the mixed quadrants. Then we load the mixed quadrants of Peano sequence into main memory, thus reduce the data access. The AND and OR operations using NP0-tree are illustrated in Figure 3.

a). NP0-tree1 b). NP0-tree2 c). AND Result d). OR Result

Figure 3.P1-tree Operations: AND and OR

In Figure 3, a) and b) are two NP0-trees, c) and d) are their AND result and OR result, respectively. From c), we know that bits at the range of position {17, 32} and {49, 64} are pure 0 since the bits in quadrant with Qid 2 and 3 are 0. Quadrants with Qid 1 and 2 are non-pure0 parts. We only need to load quadrants with Qid 1 and 2. The logical operation results calculated in such way are the exactly same as ANDing two bit sequence with reduced data access.

The logical operations can be executed among any predicate tree. The operation rules are independent of the predicate of P-trees. The operations of other predicate trees are the same. There are several ways to perform P-tree logical operations. The basic way is to perform operations level-by-level starting from the root level. The node operation is commutative. The rules of operations are summarized in Table 2.

It is natural for the operations of predicate P-trees to make use of parallel techniques, such as Disk Raid techniques, MPI, etc. They are natural extension of the basic P-tree algebra and the details are beyond the scope of this paper. Another important consideration for database query performance is the fact that Boolean operations, such as AND, OR, and NOT are extremely fast for P-trees. We treat the P-trees as arrays of bitsets and looping through them [32].

Table 2. P-tree Logical Operation Rules

Operand 1 / Operand 2 / AND Result / OR Result
0 / 0 / 0 / 0
0 / 1 / 0 / 1
1 / 1 / 1 / 1

By using HOBBit distance, P-trees logic operations can be computed very fast. The detailed algorithm for HOBBit ring based P-tree operations are detailed in next section.

2.2HOBBit Metrics

Many distance metrics have been used in clustering algorithms. Representative metrics include Euclidean distance, Manhattan distance, and Max distance. The High Order Basis Bit (HOBBit) metric is bit wise distance function [31]. It measures distance based on the most significant consecutive bit positions starting from the left. Bit at different position gives different contribution to similarity measurement. When comparing two values bitwise from left to right, once a difference is found, any further comparisons are not needed.

Let Ai be a non-negative fixed point attribute in data sets R(A1, A2, ..., An). Each Ai is represented as a fixed-point binary number x, i.e., x = x(m)x(m-1)---x(1)x(0).x(-1)---x(-n). Let X and Y be two values of x, the HOBBit similarity between X and Y is defined by

where and are the bits of X and Y respectively, denotes XOR. In another word, it is the left most position at which X and Y differ. The HOBBit distance between two tuples X and Y is defined by .

For two value X and Y of signed fixed binary attribute Ai, the HOBBit distance between X and Y are same as above if X and Y are same sign. If X and Y are opposite sign, then the distance is .

Definition 3.2.1. HOBBit Ring The HOBBit ring of radii, r1 and r2 , centered at c is defined as R(c, r1, r2) = {x X | r2 d(c,x)  r1}, where d(c,x) is HOBBit distance.Figure 4 shows a diagram of HOBBit ring R(c, r1, r2) in spatial data set, X.

Figure 4.Diagram of HOBBit RingR(c, r1, r2)

3.OLAP Operation Using Peano Cube

In this section, we present a new data warehouse architecture, Peano Cube, for multi-dimensional spatial data. OLAP operations, such as slice/dice, rollup and nearest neighbor queries, etc., are developed in section 3.2.

3.1Peano Cube Model

A Data Cube is a visual organization for multi-dimensional data. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined in terms of dimensions and facts. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. For example, a traffic supervision system may have a spatial data warehouse of remotely sensed images. The data warehouse is used to keep records of the area’s traffic with respect to the dimensions of coordinates and time. These dimensions may allow the traffic department to keep track of traffic by areas and time.

A multidimensional data model is typically organized around a central theme. This theme is represented by a fact table. Facts have numerical measurements. The fact table contains the names of the facts, as well as keys to each of the related dimension tables. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.

The data warehouse is organized as a star, snowflake or galaxy schema. The star schema model includes a large central Fact table and a set of smaller dimension tables. The snowflake schema is a variant of star in which the dimension tables are normalized. It reduces redundancy and is a better design. The galaxy schema, also called the constellation, includes multiple fact tables that share dimension tables.

In this paper, we propose a new data warehouse architecture, the Peano Cube, to represents multi-dimensional spatial data. For spatial data, the Peano Cube can have many pure 1 and pure0 quadrants. Its predicate trees can have very high compression rates, essential for efficient data access. Its features include the following characteristics:

The Peano Cube is a bitwise cube. Each attribute is split into bits, with one basic Peano Cube for each bit position. For data with m-bit values, we have m basic Peano Cubes.
Peano Cubes take advantage of continuity and sparseness in spatial data, by using the Peano order instead of raster order. They balance between high compression rates and random data access for the sake of efficient OLAP operations, e.g. rollup and rotate. They provide fairness with respect to dimensions.
Peano Cubes use predicate P-trees, which are tree-based bitmap indexes built on quadrants. By means of fast logical AND, OR, and NOT operations of predicate P-trees, Peano Cubes achieves efficient data access, which is very important for massive spatial data set.

Suppose we have a 3-D data cube representing the crop yield in an area at certain time. It has three dimensions: X, Y, and T (time). Figure 5 shows a fact table of yield, its data cube and Peano cube for every bit. Since yield is a 4-bit value, we split yield into four Peano bit sequence files and then build 4 Peano Cubes correspondingly.

We also build predicates trees: NP0 and P, based on the Peano sequence files. The processes and algorithms are described in Section 2

X / Y / T / Yeild
1 / 1 / 1 / 15 (1111)
2 / 1 / 1 / 15 (1111)
1 / 2 / 1 / 15 (1111)
2 / 2 / 1 / 15 (1111)
1 / 1 / 2 / 15 (1111)
2 / 1 / 2 / 15 (1111)
1 / 2 / 2 / 15 (1111)
2 / 2 / 2 / 15 (1111)
3 / 1 / 1 / 15 (1111)
4 / 1 / 1 / 4(100)
3 / 2 / 1 / 1(001)
4 / 2 / 1 / 12 (1100)
3 / 1 / 2 / 12 (1100)
4 / 1 / 2 / 2(010)
3 / 2 / 2 / 12 (1100)
4 / 2 / 2 / 12 (1100)
1 / 3 / 1 / 15 (1111)
2 / 3 / 1 / 15 (1111)
1 / 4 / 1 / 2 (010)
2 / 4 / 1 / 0 (000)
1 / 3 / 2 / 15 (1111)
2 / 3 / 2 / 15 (1111)
1 / 4 / 2 / 2 (010)
2 / 4 / 2 / 0 (000)
3 / 3 / 1 / 12 (1100)
4 / 3 / 1 / 12 (1100)
3 / 4 / 1 / 12 (1100)
4 / 4 / 1 / 12 (1100)
3 / 3 / 2 / 12 (1100)
4 / 3 / 2 / 12 (1100)
3 / 4 / 2 / 12 (1100)
4 / 4 / 2 / 12 (1100)
1 / 1 / 3 / 15 (1111)
2 / 1 / 3 / 15 (1111)
1 / 2 / 3 / 15 (1111)
2 / 2 / 3 / 15 (1111)
1 / 1 / 4 / 15 (1111)
2 / 1 / 4 / 15 (1111)
1 / 2 / 4 / 15 (1111)
2 / 2 / 4 / 15 (1111)
3 / 1 / 3 / 15 (1111)
4 / 1 / 3 / 10 (1010)
3 / 2 / 3 / 1(001)
4 / 2 / 3 / 14 (1110)
3 / 1 / 4 / 14((1110)
4 / 1 / 4 / 2(010)
3 / 2 / 4 / 12 (1100)
4 / 2 / 4 / 12 (1100)
1 / 3 / 3 / 15 (1111)
2 / 3 / 3 / 15 (1111)
1 / 4 / 3 / 2 (010)
2 / 4 / 3 / 0 (000)
1 / 3 / 4 / 15 (1111)
2 / 3 / 4 / 15 (1111)
1 / 4 / 4 / 2 (010)
2 / 4 / 4 / 0 (000)
3 / 3 / 3 / 12 (1100)
4 / 3 / 3 / 12 (1100)
3 / 4 / 3 / 12 (1100)
4 / 4 / 3 / 12 (1100)
3 / 3 / 4 / 12 (1100)
4 / 3 / 4 / 12 (1100)
3 / 4 / 4 / 12 (1100)
4 / 4 / 4 / 12 (1100)

(b) Data Cube of Field Yield

(a) Fact Table of Yield in Peano Order (c) Peano Cubes for Each Bit of Yield

Figure 5.A Fact Table, Its Data Cube and Peano Cubes

3.2OLAP Operation

In this section, we develop the basic OLAP operations, including slice/dice, rollup and the nearest neighbor query, using our proposed Peano cubes and predicates P-trees. The OLAP operations are the basis for answering spatial data warehouse questions like: “find all galaxies brighter than magnitude 22”, “find gravitational lens candidates”, or “find the total traffic flow during certain time periods from remotely sensed images.”

3.2.1Slice/Dice

The Slice operation is a selection along one dimension, while the dice operation defines a sub-cube by performing a selection on two or more dimensions. Slice and Dice are very efficiently accomplished using the Peano Cube. Typical select statements may have a number of predicates in their “where” clause that must be combined in a Boolean manner. The predicates may include “=”, “<” and “>”. These predicates lead to two different query scenarios, where “=” clause results in an equal query, and “<” or “>” in a range query.