A New Density Based Clustering for Large Spatial Datasets Using Peano Count Trees

Efficient Density Clustering Approach To Large Spatial Data Using Hawaiian Metric[1]

Fei Pan, Baoying Wang, Yi Zhang, Dongmei Ren, Xin Hu, William Perrizo

Computer Science Department

North Dakota State University

Fargo, ND 58105

Abstract

Data mining for spatial data becomes increasingly important as more and more organizations are exposed to spatial data, such as remote sensing, geographical information systems (GIS), astronomy, computer cartography, environmental assessment and planning, etc. Recently, density clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been recognized as the powerful clustering methods for Data Mining. These approaches typically regard clusters as dense regions of objects in the data space that are separated by regions of low density. However, the density clustering methods are slow. In this paper, we develop a new efficient density-clustering algorithm using Hawaiian metrics and P-trees. The fast P-tree ANDing operation facilitates the calculation of density function within a certain Hawaiian ring neighbor. The average time complexity of our algorithm for spatial data is. Computation experiments indicate that our proposed method is faster and more scalable than other density based clustering algorithms for large data sets. It also has comparable clustering accuracy for data sets with a large amount of noise.

Keywords: Density Clustering. Hawaiian Metrics. Peano Count Trees. Spatial Data

1. Introduction

Many organizations have large quantities of data collected in various application areas, such as remote sensing, geographical information systems (GIS), astronomy, computer cartography, environmental assessment and planning, etc. These data collections are growing rapidly. Therefore, efficient data mining methods for spatial data become more and more important.

Generally speaking, density based cluster algorithms group the attribute metric space into a set of connected components. A cluster is defined as a connected dense component, which grows in any direction that density leads. Therefore, density based clusters are capable of discovering arbitrary shape of clusters and are capable of dealing with noise and outliers.

There are two major approaches for density-based methods. The first approach calculates density of all data points and group them based on density connectivity. Typical algorithms in this approach include DBSCAN [6], OPTICS [8] and DBCLASD [9].

DBSCAN first defines a core object as a set of neighbor points consisting of more than certain number of data points. All the data points reachable from core objects server as clusters. The run time complexity of DBSCAN is for spatial data when using spatial index. Otherwise, it is [10]. OPTICS can be considered as extension of DBSCAN without providing global density. It assumes each cluster has its own density parameter and use a random variable to learn its probability distribution. It has same run time complexity as DBSCAN, that is, if a spatial index is used.

The second approach is represented by DENCLUE [3]. It exploits a density function, e.g., step function or Gaussian function to measure the density in attribute metric space. Clusters are identified by determining density attractors. Thus, clusters of arbitrary shape can be easily determined by overall density functions. This algorithm scales well with run time complexity by means of grid cells techniques.

Recently, a new distance metric, Hawaiian Metric, has been invented [2]. Hawaiian Metric is more natural for spatial data than other distance metrics. The Hawaiian metric exploits a new lossless data structure, called Peano Count Tree (P-tree) [1]. The calculation of Hawaiian metrics using P-tree is extremely fast. It provides opportunities for efficient data mining techniques.

In this paper, we propose an efficient density clustering algorithm to large spatial data using Hawaiian metrics. The basic idea is to exploit P-trees and Hawaiian Metric to calculate the density function in time of on average. The fast P-tree ANDing operation is used to get density function within a certain Hawaiian ring neighbor. We also adopt a look around pruning method to combine the density calculation and hill climbing together. The overall run time complexity is on average. Experiment results indicate our algorithm works efficiently on large-scale spatial data, outperforming other density methods significantly.

This paper is organized as follows. In section 2, The Hawaiian Metric and P-tree techniques are briefly reviewed. In section 3, we first propose a new efficient density clustering method using Hawaiian Metric, and then prove its efficiency in terms of time complexity analysis. Finally, we compare our method with other density method experimentally in section 4 and conclude the paper in section 5.

The symbols used in this paper are given in Table 1.

Table 1. Symbols and Notations

Symbol / Definition /
X / Spatial pixel, X = {x1, x2, …, xn}, n is the number of attributes /
m / Maximal bit length of attributes /
r / Radius of Hawaiian ring /
Pi,j / Basic P-tree for bit j of attribute i /
Pi,j’ / Complement of Pi,j /
bi,j / The jth bit of the ith attribute of x. /
Pxi,j / Operator P-tree of jth bit of the ith attribute of x /
Pvi,r / Value P-tree within ring r /
Px,r / Tuple P-tree within ring r /
Qid / Quadrant identification /

2. Review of Hawaiian Metrics and P-trees

In this section, we first briefly review the Hawaiian Metrics, and then describe the Peano Count Tree (P-tree) [1] data structure and related P-tree algebra.

2.1. Hawaiian Metrics

Many distance metrics have been used in clustering algorithms. Representative metrics include Euclidean distance, Manhattan distance, and Max distance. For two data points, X = (x1, x2, x3, …, xn-1) and Y = (y1, y2, y3, …, yn-1), the Euclidean distance function is defined as . It can be generalized to the Minkowski distance function , where q is a natural number. If q=2, this gives the Euclidean function. If q=1, gives the Manhattan distance. Max function is defined as .

The Hawaiian metrics, also called HOBBit metric [2], is bit wise distance function. It measures distance based on the most significant consecutive bit positions starting from the left. Bit at different position gives different contribution to similarity measurement. When comparing two values bitwise from left to right, once a difference is found, any further comparisons are not needed. Let Ai be a non-negative fixed point attribute in data sets R(A1, A2, ..., An). Each Ai is represented as a fixed-point binary number x, i.e., x = x(m)x(m-1)---x(1)x(0).x(-1)---x(-n). Let X and Y be two values of x, the Hawaiian similarity between X and Y is defined by

where and are the bits of X and Y respectively, denotes XOR. In another word, it is the left most position at which X and Y differ. The Hawaiian distance between two tuples X and Y is defined by .

For two value X and Y of signed fixed binary attribute Ai, the Hawaiian distance between X and Y are same as above if X and Y are same sign. If X and Y are opposite sign, then the distance is . The Hawaiian metric uses a data structure, called a Peano Count Tree, to facilitate its computation for spatial data. More details about P-trees are described in next sub-section.

2.2. Peano Count Trees (P-trees)

The Peano Count Tree (P-tree) is a tree structure organizing relational data set with numerical values. Each attribute is split into separate files, one for each bit position. A basic P-tree, Pi, j, is a P-tree for the jth bit of the ith attribute. For attribute of m bit, there are m basic P-trees, one for each bit position. The complement of basic P-tree Pi, j is a P-tree with complement value for every bit, which is denoted as Pi, j ‘. Figure 1 shows an example of basic P-tree construction and its complement.

a). 8x8 BSQ file b). Basic P-tree c). Complement P-tree

Figure 1. Example of P-tree Construction

In Figure 1, the left is 8´8 bSQ file, and the corresponding P-tree is given in the middle. The root count is 36, and the count at the next level, 16, 7, 13, 0, are the 1-bit count for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits and 0-bits respectively, we do not need sub-trees for these two quadrants. The complement is shown on the right. It provides the 0-bit’s counts for each quadrant.

P-tree ANDing is one of the most important and frequently used algebra. The ANDing operation is executed using PM-tree, which use masks instead of root counts. In PM-tree, three value logic, i.e., 0, 1, and m, is used to represent pure-0, pure-1 and mixed quadrants, respectively. The ANDing operation using PM-tree is illustrated in Figure 2.

a). P-tree-1 b). P-tree-2 c). AND-Result

Figure 2. P-tree Operation

Figure 2 shows the ANDing of PM-tree1 (on the left) and PM-tree2 (in the middle). The ANDing result is given on the right. There are several ways to perform P-tree ANDing. The basic way is to perform ANDing level-by-level starting from the root level. The node ANDing operation is commutative. The rules are summarized in Table 2.

Table 2. P-tree AND rules

Operand 1 / Operand 2 / Result /
0 / 0 / 0 /
0 / 1 / 0 /
0 / m / 0 /
1 / 1 / 1 /
1 / m / m /
m / m / 0 if four sub-quadrants result in 0; Otherwise m /

In Table 2, operand 1 and operand 2 are two P-trees (or sub-trees). ANDing a pure-0 tree with any P-tree results in a pure-0 tree. ANDing a pure-1 tree with any P-tree, P2, results in P2. ANDing two mixed tree results a mixed tree or pure-0 tree.

By using P-trees logic ANDing, complement operations, Hawaiian distance can be computed very fast. The detailed algorithm for Hawaiian ring based P-tree operations are detailed in the following section.

3. P-tree Hawaiian Density Based Clustering Algorithm

As mentioned earlier, the main drawback of existing density based algorithms is slow and lack of scalability. Typical density based algorithms, such as DBSCAN, OPTICS and DENCLUE, exploit different approaches to improve the speed and scalability. In this paper, we propose a P-tree Hawaiian ring based density clustering algorithm, called PHDClustering. The basic idea is to exploit Hawaiian ring and P-trees to get the density function in one step. The fast P-tree ANDing operation is used to get density function within a certain Hawaiian ring neighbor. We also adopt a look around pruning method to combine the density calculation and hill climbing together. The detailed algorithm is proposed in section 3.1 and 3.2.

In section 3.1, we describe calculation of the density function using P-trees and Hawaiian ring metric. In section 3.2, algorithm for finding density attractors is discussed. Finally, the efficiency of our algorithm is analyzed by means of time complexity.

3.1. Calculation of the Density Function Using P-trees and Hawaiian Ring

Calculation of density function using Hawaiian ring is accomplished by P-tree ANDing. Operator P-tree, Pxi,j, is defined as follows

If b i,j = 1 Pxi,j = Pi,j

Otherwise = P’i,j

The value P-trees within Hawaiian ring r are calculated by

Pvi,r = Pxi,1 & Pxi,2 & Pxi,3 & … & Pxi,r ( i = 1, 2, 3, …, n)

Then the tuple P-trees are calculated by ANDing the value P-trees as

Px,r= Pv1,r & Pv2,r &Pv3,r & … & Pvn,r

We define the weighted summation root count for each Hawaiian ring r (r = 1, 2 … m) as

Dx =

where Dx is the density function of point x, and w is the weight.

Note we use the weighted summation of root count for clustering accuracy. In this approach, the nearby neighbors contribute to the density function more than those far away from the point.

3.2. Finding Density Attractors Using Look Around Pruning Technique

With density of each data point, we need to find the density attractors, local maxima of the overall density function. The density function based algorithm can be summarized into three steps: 1) Get density function for each point, 2) Get the density attractor, 3) Determine the cluster for all points. Having a high density doesn’t necessarily mean it is a density attractor. It also depends on if it has the highest density among its neighbor.

Here we adopt a look around pruning technique to combine the density calculation and hill climbing together. We first define a neighborhood as a range of radius s. After finding data points with density Dx greater than a threshold, say x, we compare it with the density attractor within its neighborhood. If it is greater than the density of all its neighbors, it is labeled as the new density attractor, cluster center. The old density attractors are de-labeled and will not be considered any more.

The threshold x and neighborhood s influence the clustering speed and accuracy. The threshold x is empirically defined as a function of average density function of all data points, x = a. Here a is a non-negative coefficient and is the average density function. If x is large, less density attractor will be found. Thus clustering speed will increase but some cluster center might be missed. The neighborhood s is defined as the fixed number of Hawaiian ring. It ranges from 0 to the maximal bit length of attributes.

For example, suppose threshold x = 200, Qid of data point X is 0.3.2 and Dx = 250. The tuple P-tree Px,s within neighborhood s is shown in Figure 3. Since Dx > x, we need compare Dx with the neighbor’s density. From the Px,s, x has four neighbors with QIDs of 0.0.2, 0.3.1, 2.3.0 and 2.3.3. If densities of these points are respectively 300, 0, 80 and 0, and 0.0.2 and 2.3.0 are labeled as density attractors. By comparing Dx with the maximal density of 0.0.2 and 2.3.0, 250 < max(300, 80), therefore we determine that x is not a density attractor. Otherwise if Dx = 350, 350 > max (300, 80), x is labeled as the new density attractor. The old density attractors 0.0.2 and 2.3.0 are de-labeled and will not be considered later.