Efficient K Median Clustering with Bit Sliced Column Data

EFFICIENT K‐MEDIAN CLUSTERING WITH BIT SLICED COLUMN DATA

AMAL PERERA1,

WILIAM PERRIZO2

ABSTRACT.

Advances in earth science research have introduced cheaper data acquisition and storage mechanisms leading to large repositories of valuable data. In order to take advantage of the large volume of data we need scalable tools and techniques that help us to understand the data. Clustering is automated identification of groups of objects, based on similarity. Scalability is a major issue in clustering that needs to be addressed to provide cheaper timely analytical observations of data. Centroid based partition clustering is widely used in many areas of data analysis. In current partition clustering approaches the need to compute the respective centroids and the assignment of cluster membership requires multiple data scans reducing the potential for scalable approaches. In this work, we introduce a partition clustering technique, that uses bit slices of the original data columns, in order to eliminate the need for repeated horizontal database scans. Computation of the centroids, estimation of the total error and the assignment of cluster membership is done by manipulating bit slices for the entire data set, in contrast to scanning the data set for multiple data items. The vector‐of‐medians is used as the cluster center and efficiently identified, without having to partially sort the data items. The cluster membership is established by the use of geometric reasoning and the use of hyper rectangular range queries. The error for each iteration is calculated by position based manipulation of the bit slices. Superior results are observed when the new clustering technique is compared with the classical k‐means clustering and k‐medoid clustering. The use of bit sliced column data extends the concept of column store oriented databases which are growing in usage for data mining and analytics due to the efficiency in computations compared to the traditional row oriented databases.

1. INTRODUCTION

Many clustering algorithms work well on small data sets containing fewer than 200 data objects [1]. For real world applications such as earth science applications, the requirement is to cluster millions of records using scalable techniques. Partition based techniques are widely used for clustering [1][2]. In partition based clustering methods n objects, in the original data set, are broken into k partitions iteratively, to heuristically achieve a certain optimal criterion, such as the lowest sum of squared error. The k clusters are represented by the gravity of the cluster in kmeans or by a specially selected representative of the cluster in k‐medoid. Each object in the space is assigned to the closest cluster representative in each iteration[1]. In general, one of the most demanding challenges in clustering is scalability[1]. Partition clustering is widely used because of its simplicity [5]. It does not need any intermediate data structures or tuning parameters, except k. The simplicity comes with a drawback in scalability, due to the need to scan the entire data set at each iteration, to arrive at some locally optimum solution.

We acknowledge financial support for this research came from a Department of Energy Award (award # DE-FG52-08NA28921).

1 Dept. of Computer Science and Engineering, University of Moratuwa, Sri Lanka,

2 Dept. of Computer Science, North Dakota State University, Fargo, ND,

2. RELATED WORK AND OUR APPROACH

The Partition‐based clustering problem is related to other clustering and location problems, such as Euclidean kmedians (or the multi‐source Weber problem) [5] and the geometric k‐center problem [5][6]. Efficient optimal solutions are not known for most of these problems and some formulations are NP‐hard [5]. In selecting the centroid for a partition based clustering algorithm, it is important to consider the representation and the computational cost of identifying it. The most popular choice, mean [5] has a very small breakdown point (1/N). Breakdown point is the number of data points, that needs to be moved to infinity, for the estimator to move to infinity. The breakdown point of the median is ½. Breakdown point influences the clustering quality on data with outliers, as it is in the case with most natural world data sets. But in practice cluster mean is used as the centroid, due to the difficulty in finding the median, compared to the computation of the mean. For Median based clustering exact and approximate solutions are present in the literature [6]. With respect to clustering, CLARANS

[3] and CLARA [1] algorithms use sampling, to arrive at an approximate solution. PAM [1] does an exhaustive search. In 1902 Hayford suggested the use of the vector‐of‐medians to represent a sample of vector data [15]. Vector‐of‐median is composed of all the median values from each individual dimension in the data set, and it may not be an actual data point in the data set. In this work we also introduce an efficient median selection technique, that uses the self‐indexing characteristic of bit slices of data. The binary pattern of the median for a given data attribute is built by comparing the count of “1/0” bits starting from the most significant bit. The median will be in the partition that has more than 50% of the data points. Once the first bit is identified we continue to combine the bit slices with an AND operation and a subsequent count operation until we reach the least significant bit. A naïve assignment of cluster membership requires a scan of all the data points in the data set. To avoid this expensive scan kd‐trees [7] can be used to efficiently assign membership to collections of data points and also to calculate the squared error [5] [8]. In our approach, we propose the use of bit slices with range queries using geometric reasoning, thus eliminating the need to build intermediate data structures such as kd‐trees as a pre‐processing step to clustering and the need to allocate extra space and complicated traversal time. The result set of a range query can be efficiently identified by looking at the binary pattern of the boundary value and combining the respective bit slices using AND and OR operations. Efficient error computation is achieved by manipulating the bit slices based on their respective bit position. The concept of vertically partitioning database tables to improve data mining performance is not a new idea [9][10][12]. Recent work on the MonetDB/X100 [11] systems have pioneered the design of modern column‐oriented database systems. These work show that column‐oriented designs can dramatically outperform commercial and open source databases on benchmarks like TPC‐H due to superior CPU, cache performance and reduced I/O [13].

4. RESULTS & CONCLUSIONS

The bit sliced approach was implemented using the P‐tree3 data structure (efficient, data mining ready, lossless and compressed)[4]. Comparison results in Table 1 show that the new approach performs better than the classical K‐means algorithm and as good as a K‐Medoid (PAM) implementation with respect to the quality of results. Partitioning Around Medoids (PAM) is the best Medoid approach with respect to cluster quality [1]. Figure 1 A & B show the linear scalability of the vertical approach. A relatively large RSI (Remote Sensed Image) data set is used with our approach to show the required horizontal and vertical computation time, with respect to data set size. Figure 1C shows a comparison of the individual compute times for each component of the algorithm indicating the significant difference between the two approaches. It is possible to conclude that the vertical column oriented Kmedian clustering approach performs better than the traditional horizontal approach for large data sets.

Table 1. Network Intrusion Data [14] Clustering Results.

Approach Kmedian Kmeans PAMedoid

F‐Measure for k=2,4,6 0.91 0.86 0.77 0.91 0.75 0.72 0.91 0.91 0.85

REFERENCES

[1] J. Han, M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufmann, San Francisco, CA, 2001.

[2] O.R. Zaïane, A. Foss, C‐H. Lee, W. Wang, “On Data Clustering Analysis: Scalability, Constraints and Validation,” Pacific‐Asia Conf. on Knowledge Discovery and Data Mining (PAKDD),

pp. 28‐39, Taipei, Taiwan, May 2002.

[3] R. Ng, J. Han “Efficient and Effective Clustering Method for Spatial Data Mining,” International Conf. on Very Large Data Bases (VLDB), pp. 144‐155, Santiago de Chile, Chile, Sept. 1994.

[4] Q. Ding, M. Khan, A. Roy, W. Perrizo, “The P‐tree Algebra,” ACM Symposium on Applied Computing (SAC), pp. 426‐431, Madrid, Spain, March 2002.

[5] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, “An Efficient K‐Means Clustering Algorithm: Analysis and Implementation,” IEEE Transactions on Patterns

Analysis and Machine Intelligence, Vol. 24 No. 7 pp. 881‐892, July 2002.

[6] P.K. Agarwal, C.M. Procopiuc, “Exact and Approximation Algorithms for Clustering,” ACM Symposiam on Discrete Algorithms, pp. 658‐667, San Francisco, CA, January 1998.

[7] J. L. Bently, “Multidimensional Divide and Conquer,” Communications of the ACM, Vol. 23 No. 4 pp. 214‐229, April 1980.

[8] D. Pelleg, A. Moore, “Accelerating Exact K‐Means Algorithms with Geometric Reasoning,” ACM Special Interest Group in Knowledge Discovery and Data Mining (SIGKDD), pp. 277‐281,

San Diego CA, August 1999.

[9] D. S. Batory. On searching transposed files. ACM Trans.Database Syst., 4(4):531–544, 1979.

[10]S. Khoshafian, G. Copeland, T. Jagodis, H. Boral, and P. Valduriez. A query processing strategy for the decomposed storage model. In ICDE, pages 636–643, 1987.

[11]P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100:Hyper‐pipelining query execution. In CIDR, 2005.

[12]M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. R. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C‐Store: A Column‐

Oriented DBMS. In VLDB, pages 553–564, 2005.

[13] D.J. Abadi, S R. Madden, N. Hachem, “ColumnStores vs. RowStores: How Different Are They Really?”, ACM SIGMOD, Vancouver, BC, Canada, June 2008.

[14]UCI, “UCI Machine Learning Data Repository,” http://www.ics.uci.edu/~mlearn/MLSummary.html, accessed in June 2004.

[15]J. Hayford, “What is the Center of an Area, or the Center of a Population?” Journal of the American Statistical Association, Vol. 8 No. 58 pp. 47‐58, 1902.

3 United States Patent Number US 6,941,303 B2, Sept. 6, 2005