Vertical Set Squared Distance based Clustering without prior knowledge of K
Amal Perera Taufik Abidin, , Masum Serazi, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
{amal.perera, taufik.abidin, , md.serazi, william.perrizo}@ndsu.edu
Abstract
Clustering is automated identification of groups of objects based on similarity. In Clustering two major research issues are scalability and the requirement of domain knowledge to determine input parameters. Most approaches suggest the use of sampling to address the issue of scalability. Sampling does not guarantee the best solution. Most approaches also require the use of domain knowledge, trial and error techniques, or exhaustive searching to figure out the required input parameters.
In this paper we introduce a new clustering technique based on the set squared distance. Cluster membership is determined based on the set squared distance to the respective cluster. As in the case of mean for k-means and median for k-medoids the cluster is represented by the entire cluster of points for each evaluation of membership. The set squared distance for all n items can be computed efficiently in O(n) using a vertical data structure and a few pre-computed values. Special ordering of the set squared distance is used to break the data into the “natural” clusters compared to the need of a known k for k means / medoids type of partition clustering.
Superior results are observed when the new clustering technique is compared with the classical k means clustering. To prove the cluster quality and the resolution of the unknown k, data sets with known classes such as the iris data, the uci_kdd network intrusion detection data, and synthetic data are used. Scalability of the proposed technique is proved using a large RSI data set.
1.0 Introduction
Clustering is a very important human activity. Built-in trainable clustering models are continuously trained from early childhood allowing us to separate cats from dogs. Clustering allows us to distinguish between different objects. Given a set of points in multidimensional space, the goal of clustering is to compute a partition of these points into sets called clusters, such that the points in the same cluster are more similar than points across different clusters. Clustering allows us to identify dense and sparse regions and, therefore, discover overall distribution of interesting patterns and correlations in the data. Automated clustering is very valuable in analyzing large data, and thus has found applications in many areas such as data mining, search engine indexing, pattern recognition, image processing, trend analysis and many other areas. 0[2]
Large number of clustering algorithms exists. These clustering algorithms are grouped into four: partitioning methods, hierarchical methods, density-based (connectivity) methods and grid-based methods in the KDD literature 0[3]. In partitioning methods n objects in the original data set is broken into k partitions iteratively, to achieve a certain optimal criterion. The most classical and popular partitioning methods are k-means [4] and k-medoid [5]. The k clusters are represented by the gravity of the cluster in k-means or by a representative of the cluster in k-medoid. Each object in the space is assigned to the closest cluster in each iteration. All the partitioning methods suffer from the requirement of providing the k (number of partitions) prior to clustering, only able to identify spherical cluster, and having large genuine clusters split in order to optimize cluster quality [3].
A hierarchical clustering algorithm produces a representation of the nested grouping relationship among objects. If the clustering hierarchy is formed from bottom up, at the start each data object is a cluster by itself, then small clusters are merged into bigger clusters at each level of the hierarchy based on similarity until at the top of the hierarchy all the data objects are in one cluster. The major difference between hierarchical algorithms is how to measure the similarity between each pair of clusters. Hierarchical clustering algorithms require the setting of a termination condition with some prior domain knowledge and typically they have high computational complexity[3].
Density based clustering methods attempts to separate the dense and sparse regions of objects in the data space 0. For each point of a cluster the density of data points in the neighborhood has to exceed some threshold[10]. Density based clustering techniques allow discovering arbitrary shaped clusters. But they do suffer from the requirement of setting prior parameters based on domain knowledge to arrive at the best possible clustering.
A grid-based approach divides the data space into a finite set of multidimensional grid cells and performs clustering in each cell and then groups those neighboring dense cells into clusters. Determination of the cell size and other parameters affect the final quality of the clustering.
Two of the most demanding challenges in clustering are scalability and minimal requirement of domain knowledge to determine the input parameters0. In this work we describe a new clustering mechanism that is scalable and operates without the need of an initial parameter that determines the expected number of clusters in the data set. We describe an efficient vertical technique to compute the influence based on the set square distance of each data point with respect to all other data points in the space. Natural partitions in the influence values are used to initially partition the data set into clusters. Subsequently cluster membership of each data point is confirmed or reassigned with the use of efficiently recalculating the set square distance with respect to each cluster.
2.0 Related Work
Many clustering algorithms work well on small datasets containing fewer than 200 data objects 0. For example, the NASA Earth Observing System will deliver close to a terabyte of remote sensing data per day and it is estimated that this coordinated series of satellites will generate peta-bytes of archived data in the next few years [12][13][14]. For real world applications, the requirement is to cluster millions of records using scalable techniques [11].
A general strategy to scale-up clustering algorithms is to draw a sample or to apply a kind of data compression before applying the clustering algorithm to the resulting representative objects. This may lead to biased results 0[14]. CLARA [15] addresses the scalability issue by choosing a representative sample of the data set and then continuing with the classical k-mediod method. The effectiveness depends on the size of the sample. CLARANS [6] is an example for a partition based clustering technique which uses a randomized and bounded search strategy to achieve the optimal criterion. This is achieved by not fixing the sample to a specific set from the data set for the entire clustering process. An exhaustive traversal of the search space is not achieved in the final clustering. BIRCH [7] uses a tree structure that records the sufficient statistics (summary) for subsets of data that can be compressed and represented by the summery. Initial threshold parameters a required to obtain the best clustering and computational optimality in BIRCH. Most of the clustering algorithms require the users to input certain parameters 0. Clustering results are sensitive to the input parameters. For example DENCLUE [10] requires the user to input the cell size to compute the influence function. DBSCAN [16] needs the neighborhood radius and minimum number of points that are required to mark a neighborhood as a core object with respect to density. To address the problem of the requirement for parameters OPTICS [17] computes an augmented cluster ordering for automatic and interactive cluster analysis. OPTICS stores sufficient additional information enabling the user to extract any density based clustering without having to re-scan the data set. Parameter-less ness comes at a cost. OPTICS has a time complexity of O (n log n) when used with a spatial index that allows it to easily walk through the search space. Less expensive partition based techniques suffer the requirement of specifying the number of expected partitions (k) prior to clustering 0[3][27]. X-means [27] attempts to find k by repeatedly searching through a different k’s and testing it against a model based on Bayesian Information Criterion (BIC). G-means [18] is another attempt to learn k using a repeated division of the data set until each individual cluster demonstrates a Gaussian data distribution within a user specified significance level. ACE [14] maps the search space to a grid using a suitable weighting function similar to the particle-mesh method used in Physics and then uses a few agents to heuristically search through the mesh to identify the natural clusters in the data. Initial weighting costs only O(n), but the success of the techniques depends on the agent based heuristic search and the size of the gird cell. The authors suggest a linear weighting scheme based on neighborhood grid cells and a variable grid cell size to avoid the over dependence on cell size for quality results. The linear weighting scheme adds more compute time to the process.
3.0 Our Approach
Our approach attempts to address the problem of scalability in clustering using a partition based algorithm and the use of a vertical data structure that aids fast computation of counts. Three major inherent issues with partition based algorithms are the need to input K, need to initialize the clusters that would lead to a optimal solution and, representation (prototype) and computation of membership for each cluster. We solve the first two problems based on the concept of being able to formally model the influence of each data point using a function first proposed for DENCLUE [10] and the use of an efficient technique to compute the total influence rapidly over the entire search space. Significantly large differences in the total influence is used to identify the natural clusters in the data set. Data points with similar total influence are initially put together as initial clusters to get a better initialization in search of a optimal clustering in the subsequent iterative process. Each cluster is represented by the entire cluster. With the use of a vertical data structure we show an efficient technique that can compute the membership for each data item by comparing the total influence of each item against each cluster.
The influence function can be interpreted as a function which describes the impact of a data point within its neighborhood [10]. Examples for influence functions are parabolic functions, square wave function, or the Gaussian function. The influence function can be applied to each data point. Indication of the overall density of the data space can be calculated as the sum of the influence function of all data points [10]. The density function which results from a Gaussian function for a point ‘a’ in a neighborhood ‘xi‘ is
The Gaussian influence function is used in DENCLUE [10] and since it is O(n 2), for all n data points they use a grid to locally compute the density. The influence function should be radially symmetric about any point (either variable), continuous and differentiable. Some other influence functions are:
We note that the power 2m function is a truncation of the Gaussian Maclaurin series. The following diagram shows the distribution of the density for the Gaussian and the Parabolic influence function.
Next we show how the density based on the Parabolic influence function denoted Set Square Distance could be efficiently computed using a vertical data structure.
Vertical data representation consists of set structures representing the data column-by-column rather than row-by-row (relational data). P-trees are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. P-trees [25] are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [22][24][26]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees and can then be used for fast computation of counts. This construction is continued recursively down each tree path until a pure sub-division is reached that is entirely pure (which may or may not be at the leaf level). The basic and complement P-trees are combined using boolean algebra operations to produce P-trees for values, entire tuples, value intervals, or any other attribute pattern. The root count of any pattern tree will indicate the occurrence count of that pattern. The P-tree data structure provides the a structure for counting patterns in an efficient manner.
Binary representation is intrinsically a fundamental concept in vertical data structures. Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as:
andare the highest and lowest order bits respectively.
Vertical Set Squared Distance (VSSD) is defined as follows: [need to clean up the formula]
where
=
Pi,j indicates the P-tree for the jth bit of ith attribute.
rc(P) denotes the root count of a P-tree. i.e. number of ‘1’ bit values present in the tree.
PX denotes the P-tree (mask) for the set X.
The advantage of VSSD is that root counts can be pre-computed and repeatedly used, as their operations are obviously independent from a, thus allowing us to pre-compute them in advance for the computation of VSSD for multiple number data points where the corresponding data set (Set X) does not change. This observation provides us with a technique to compute the VSSD influence based density for all data points in O(n). Further for a given cluster, VSSD influence based density for each data point could be computed efficiently to determine its’ cluster membership.
Algorithm (VSSDClus)
The algorithm has two phases. In the initial phase the VSSD is computed for the entire data set. While computing the VSSD, they are placed using an appropriate hash function to aid in the sorting for the next step. Next the VSSD values are sorted and differences between each consecutive two items are computed. It is assumed that outlier differences will indicate a separation between two clusters. Statistical outliers are identified by using the standard mean + 3 standard deviation formula on the differences. The data set is partitioned at the outlier differences. [Sample 2D Graph showing clear speration]
[need to mathematically write algo ? ]
Compute VSSD for all points in data set and place in hash map
Sort VSSD within hash map.
Find the difference between VSSDi and VSSD i+1 (i and i+1 sorted order)
Identify difference > {mean(difference) + 3 x standard Deviation (difference)}
Break into clusters using large differences as partition boundaries.
In phase two of the algorithm each item in the data set is confirmed or re-assigned based on the VSSD for with respect to each cluster. This is similar to the classical k-means algorithm. In this case instead of the mean each cluster is represented by all the data points in the cluster. And instead of the mean square distance to determine the cluster membership set squared distance is used.
Iterate until max iteration OR no change in cluster sizes OR oscillation
Compute VSSD for all points, against each Cluster (Ci).
Re-assign clusters based on min[VSSD(a,Ci)]
4.0 Experimental Results
To show the practical relevance of the new clustering approach we show comparative experimental results in this section. This approach is aimed at reducing the need for the parameter K and the scalability with respect to the cardinality of the data. To show the successful elimination of the K we use few different synthetic data sets and also actual real world data sets with known clusters. We also compare the results with a classical K-means algorithm to show relative difference in speed to obtain an optimal solution. To show the linear scalability we use a large RSI image data set with our approach and show the actual required computation time with respect to data size.
We use the following quality measure, extensively used in text mining to compare the quality of the clustering. Note that this measure could only be used with know clusters.
Let C* = {C1*,…..Ci*……. Cl*} be the original cluster and C = {C1,…..Ci……. Ck} be some clustering of the data set.
Note: F = 1 for a perfect clustering. F measure will also indicate if the selection of the number of clusters is appropriate.
Synthetic data
The following table shows the results for a few synthetically generated cluster data sets. The motivation is to show the capability of the algorithm to independently find the natural clusters in the data set. The classical K-means clustering algorithm with given K is used as a comparison. The number of Database scans required for (i.e iterations) required to achieve an F measure of 1.0 is shown in the following table.
Data Set / / /VSSDClus Iterations / 2 / 2 / 6
K-Means Iterations / 8 / 8 / 14
Iris Data