Anefficient Subspace Clustering Algorithm Based on Attribute Relativity and DBSCAN

AnEfficient Subspace Clustering Algorithm Based on Attribute Relativity and DBSCAN

Qiang Zeng, Yunyue Bai, Haitao He and Jiadong Ren

College of Information Science and Engineering

Yanshan University

Qinhuangdao, Hebei, P.R.China, 066004

The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao City, P.R.China, 066004

, ,{ haitao; jdren}@ysu.edu.cn

Abstract. In the high-dimensional data traditional clustering algorithms tend to break down because of the curse of dimensionality, high cost of time etc. This paper proposes a novel algorithm AReSUBCLU, an Effective Subspace Clustering Algorithm based on Attribute Relativity and DBSCAN. Active attribute is defined to reduce the dimension of data, and a subspace search tree which is constructed by attribute correlation matrix is defined to generate subspaces. Firstly，this method uses DBSCAN with the technique of weighted threshold to cluster on every active attribution. Secondly, the correlation matrix is constructed by computing the covariance of attributes. And the subspace search tree is constructed by the matrix which is used to measure the relativity of attributions. The set of nodes in eachroot branch of the treeis the interesting subspaces that the algorithm discovers. Finally,the subspace clusters are obtained through merging every cluster of active dimension with the usage of similarity of cluster. Evaluation with synthetic and real-world datasets demonstrates that AReSUBCLU has a good clustering quality, and capability of discovering subspace clusters of arbitrary shapes.

Keywords: Subspace cluster, Activeattribution, Similarity, Subspace search tree

1.Introduction.The analysis of clustering pays close attention to thehighdimensional data [1]. And subspace clustering method is natural born. In the following, previous works are reviewed with no completeness.

In [2],Agrawal et al. proposes CLIQUE algorithm by firstly using the method of grid-based and density-based to cluster the data. Itcan discoverclusters of many types and shapes. Butitis very depended on the location and the size of units.In [3]Kailing et al. uses the definition of density-connected cluster which is used in DBSCAN(DensityBased Spatial Clustering of Applications with Noise) which is proposed to find the clusters and the noise in [4].By compared with exiting grid-based methods, SUBCLU can find arbitrarily shaped clusters in subspaces, and the process of generating clusters is efficiently by using the monotonicity of density-connectivity to prune subspaces. In [5],Azizet al. proposesmethods to conquer the problem of local neighborhood choice, but the number of cluster need be assigned firstly. Chu presents adaptively different density thresholds method DENCOS to discover the clusters in subspace cardinalities in [6]. It can tackle the problem of the region densities is different in different subspace cardinalities. E. Müller proposed a scalable density-based subspace clustering algorithm [7]. It reduces the subspace process through identifying, aggregating and merging promising subspace.

Most of the density-based subspaces clustering algorithms havethe problem of density divergence. The cluster resultsareinaccurate by setting the global density threshold. In addition, the algorithm’s runtime grows exponential with the increasing of data dimensionality and subspace cluster dimensionality.Uponabove problems, this paper learning from a framework proposed by FIRES [8], proposes an efficient subspace clustering algorithm based on attribute relativity and DBSCAN. The paper is organized as follows.Section 2 describes the definitions of the problem; Section 3 is a detailed description of AReSUBCLU algorithm. Section 4 describes the experimental results, the efficiency and quality of the algorithm is verified by compared analysis. And a conclusion of the paper is given in section 5.

2.Problem Definitions.The key issue of subspace clustering algorithm is looking up the subspaces which contain clusters.This paper uses the property of mean-square deviation of the attribute data distribution and designs subspace search tree structure.

Our algorithm can detect arbitrarily shaped and positioned subspace clusters,and the time cost of searching the subspace decreases significantly with the usage of subspace search tree., and the ability of finding subspace cluster is good than CLIQUE and SCA which is presented by Kun in [9]. SCA is designed to overcome the scale exponentiallywith data dimensionality and the sensibility of input parameters.

Given a d-dimensionaldataset DB, n is the size of DB. Let be the set of all attributes of DB. S is a subspace of Aif, |S| denotes the number of subspace dimensionality.

Let denotes the data of ith dimension, denotes the mean value of ith dimension and denotes the mean-square deviation of ith dimension.

Definition 2.1. Active Attribute.Suppose attribute ai is a discrete random variable, si is the mean-square deviation of attributedata value, if thenai isan active attribute,otherwise not.

=0.1is obtained by more test in this paper, smin (savg) denotes the minimum (average) value of mean-square deviation of all dimensions.

Property2.1. If one attribute is not active, then this attribute is not contained in the set of candidate of subspace dimension.

Definition 2.2. Tuple of Cluster Center.T=(C1, C2, C3) C3 denotes the center of new cluster after merged two clusters;C1 andC2denote the cluster centers that are going to be merged.

The cluster center is computed by, yidenotes the point of the cluster, m denotes the number of points of cluster. And the tuple center isTC= (C1+C2+C3)/3.

Definition 2.3. Distance of Cluster. (1)

where, t1and t2denote the tuple center of cluster.

Definition2.4. Similarity ofCluster.The similarity of any two clusters is defined as follows:

(2)

wheremdenotes

If the similarity of two clusters is greater than or equal to the threshold，the two clustersare closeness and will be merged. The parameter should be specified.

Definition 2.5.Correlation Matrix of AttributeArr(1rnumber of active attribute)

(3)

Here, cijdenotes the covariance of attributes,, denote the mean of the corresponding attribute.

The matrix is symmetric obviously, so only the lower triangular matrix is focus on.This can reduce cost of dataaccessand runtime.

Definition 2.6. Subspace Search Tree.The subspace search tree (denoted as SubTree) is a multi-tree, and the node is consisted of value and child-list:

(1). the node value is the attribute of the DB

(2). the size of child-list is less than or equal to d, that is, every node can have d child nodes at most.

(3).the child node is relevant to the parent node according to the matrix.

Regarding attribute ai(1id) as a root of tree, ifajof the column aiis relevant toaiaccording to the lower triangular matrix, then letaj is a child node ofai. When all attributes deal as root node, the trees will be joined if some root nodes are leaf nodesof other trees. And all the trees are checked out, till no trees are joined. If one tree has joined in all other trees if possible, then it is deleted. The joined tree is a Subspace Search Tree. If a root node is not contained in any leaf node, this tree is also a SubTree.In this way, the interesting subspaces are the set of nodes of every root branch of SubTree.

Definition 2.7.Radius Weighted Vector

(4)

3.AReSUBCLU algorithm.This algorithmchecks out the inactive attributefirstly by definition of active attribute. Secondly, attribute correlation matrix is constructed by computing the covariance of attributes. If the covariance value of two attributes is greater than 0, then the value of matrix corresponding location is 1,otherwise 0. And subspace search trees are created through the matrix. The interesting subspaceof high dimensional data setis formed by combination of every root branch nodes which represent the attributes of dataset. DBSCAN with the weighted radius makes the one-dimensional clusters of each attribute more precise. Finally, result is obtained by merging clusters in the interesting subspace according the similarity.

Algorithm:AReSUBCLU

Input:Dataset D, ε, MinPts,

Output:subspace set S, subspace cluster setCS

(1).compute the mean square deviationvof every attribute, if then label the attribute active, and put v into Array var[];

(2).for each active attribute do

(3).for each active attribute do

(4).iftwo attribute’s covariation> 0then

(5).arr[i][j]=1;

(6). else

(7).arr[i][j]=0;

(8). end if-else

(9).end for

(10).end for

(11).for every active attributeai do

(12).construct tree useaias the root; checkout other attributesby searching ai column of thelower triangular matrix;

(13).for every attribute aj in ai column do

(14).ifarr[i][j]==1 then

(15).insert the ajinto the tree as the childnode of ai;

(16).endif

(17).endfor

(18).endfor

(19).while trees can be joined

(20).check out every tree;

(21). if rootnode of a tree is leaf nodes of other trees then

(22).the root node is joined into leaf nodes;

(23).delete this tree;

(24).endif

(25).endwhile

(26).C=//set of all clusters in 1-D subspace

(27).compute the weight wi for every active attribute

(28).for each active attributeaiA do

(29).Ci=dbscan(D,ai,wiε, MinPts)

(30).ifCithen

(31).C=CCi;

(32).endif

(33). endfor

(34).S=, CS=

(35).for every rootbranch of subspace search trees do

(36).subclu=, sub=

(37).put root’ clustersCrootintosubclu; put the root’s attribute into sub;

(38).for every node from root’ child to leafof this branch do

(39).if centers of clustersin subclu and Cnodemeet then

(40).put the node intosub, merge the clusters, and letsubclu=;

(41).compute new centers of clusters; put the new clusters to subclu;

(42).endif

(43).endfor

(44).put the sub into S; put the subclu into the CS;

(45).endfor

4.Experimental Evaluation.

4.1.Efficiency Test. AReSUBCLU is tested by the synthetic dataset. In all tests MinPts=18, ε=0.1,=0.8. Figure1 depicts the relation of the algorithm runtime against the size of the dataset and the dimensionality of the dataset. Figure 1 (a) shows it’s at least a nonlinear growth of time. This is because in merged phase, it uses the distance function of cluster which is necessary to repeatedly traverse the cluster to find a cluster center resulting inmore time, but this guarantees goodresults of subspace clusters with arbitrary shapes. Figure1 (b) shows that runtime grows at least nonlinear because the great cost of computing the attribute correlation matrix, with the increasing dimensionality.

(a)runtime vs. dataset size (b) runtime vs. dimensionality

FIGURE 1. Runtime against dataset size, dataset dimensionality

4.2.Quality Test. Seeds dataset,from theUCI machine learning repositoryis used to measure the quality ofAReSUBCLU compared to the CLIQUE. The test uses the average quality of cluster to measure subspace clusters of different dimensionality:

(5)

where nrepresents the number of clusters, therecall(i)means the ratio of points belonging to one class in theith cluster.

Figure2 shows the result of two algorithms in different dimensionality. AReSUBCLU has a better quality than CLIQUE. That is because the use of active attributes and subspace search tree guarantees the clusters are being in the interesting subspace. However, CLIQUE has restrictions of the grid position and density parameters, so the quality is poor relatively.

FIGURE 2. Average quality of different dimensional cluster

5.Conclusion.This paper presents a new subspace clusteringalgorithmto find clusters in high-dimensional data.According to the defined active attributes, we reduce the dimensionality of the data space.The correlation matrix of active attribute is established by covariance of attributes, and the matrix shows whether each attribute is relevant to others or not.The subspace search treesgenerated by thematrix reduce the hierarchy search time significantly. And the high interesting subspaces are consisted of the set of each root branch nodes of subspace search trees.TheDBSCANalgorithmcombined withweighted radius thresholdis used to active attributes to produce one-dimensional clusters, and the weighted methodimproves the quality of clusters.The subspace clustersare obtained through merging clusters of each active attributeby cluster similarity.Experiment results demonstrate the efficiency andcluster quality of AReSUBCLU, and it outperformswell in quality andcan find meaningful clusterscompared to CLIQUE.Future work is to enhance the ability of discovering the subspace clusters and the merging phase of clusters, then raising efficiency with the guarantee of good quality.

Acknowledgment.This work is supported by the Natural Science Foundation of P. R. China under Grant No. 61170190 and the Natural Science Foundation of Hebei Province P. R. China under Grant No.F2012203062. The authors also gratefullyacknowledge the helpful comments and suggestions of the reviewers, whichhave improvedthe presentation.

REFERENCES

[1] E. Muller, S. Gunnemann, I. Assent and T. Seidl, Evaluating Clustering in Subspace Projections of High Dimensional Data,Proc. of the VLDB Endowment,vol.2, no.1, pp.1270-1281, 2009.

[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,Proc. of ACM SIGMOD Int. Conf. on Management of Data, vol.27, no.2, pp.94-105, 1998.

[3] K. Kailing, H. P. Kriegel and P. Kröger, Density-Connected Subspace Clustering for High-Dimensional Data, Proc. 4th SIAM Int. Conf. on Data Mining,vol.4, pp.246-257, 2004.

[4] Martin Ester, Hans-PeterKriegel,Jiirg Sander and XiaoweiXu, A Density-Based Algorithm for Discovering Clustersin Large Spatial Databases with Noise, In Proc. 2ndInt. Conf. on Knowledge Discovery and Data Mining, pp. 291–316, 1996.

[5] M. S. Aziz and C. K. Reddy, A robust seedless algorithm for correlation clustering,PAKDD, part I, pp.28-37, 2010.

[6] Y. H. Chu, J. W. Huang, K.T. Chuang, D.N. Yang and M.S. Chen, Density Conscious Subspace Clustering for High-Dimensional Data,IEEE transactions on knowledge and data engineering, vol.22, no.1, pp.16-30,2010.

[7] E. Müller, I. Assent, S. Günnemann and T. Seidl,Scalable Density-Based Subspace Clustering, Proc. of 20th ACM Int.Conf. on Information and knowledge management, pp.1077-1086,2011.

[8] H.P. Kriegel, P. Kroger, M. Renz and S. Wurst, A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data, Proc. of 5th IEEE Int. Conf. on Data Mining,2005.

[9] Niu Kun, Zhang shubo and Chen junliang, Subspace clustering through attribute clustering, Journal of Bejing University of Posts and Telecommunications, vol.30, 2007.