A Fast Greedy

K-Means Algorithm

N. Hussein

A Fast Greedy

K-Means Algorithm

N. Hussein

Master’s Thesis

Nr:9668098

University of Amsterdam

Faculty of Mathematics, Computer Sciences,

Physics and Astronomy

Euclides Building

Plantage muidergracht 24

1018 TV Amsterdam

the Netherlands

November, 2002

First of all I would like to thank Nikos Vlassis for his guidance and support, both during the stage of developing ideas as well as during the writing of this thesis. Sjaak Verbeek has been more than helpfull to push me in the beginning of the graduation project. I thank also Mike for correcting my english. Furthermore, I would like to thank my family for everything they had to cope during writing my thesis and for their support and love throughout these years. They make my life meaningfull. I am also grateful to the perfecthousing team for their support and understanding.

Table of Contents

Abstract

1Introduction......

2Definitions and Related Work

2.1Naïve k-means algorithm......

2.2Kd-trees

2.3The Greedy K-means Algorithm

2.4Accelerating the naïve K-Means Algorithm......

2.4.1The Blacklisting Algorithm

2.4.2Variants of k-means using kd-tree

3Accelerating the Global k-means

3.1The overall algorithm

3.2Restricted filtering Algorithm

3.2.1Computing Distortion for the current set of centers:

3.2.2Computing the distortion for the updated list of centers:

3.3Distortion computation for the set of candidate center locations

4Implementation

4.1Computing the new Centroids

4.2Computing the distortion

4.3Computing distortion for added center

4.4Maintaining the Tree

5Experimental results

5.1Dataset generated using MATLAB

5.1.1Variation of number of insertion points

5.1.2Variation of threshold1

5.1.3Variation of threshold2

5.1.4Effect of number of points

5.1.5Effect of dimensionality

5.1.6Effect of number of centers

5.2Data generated using Dasgupta [15]

5.2.7Variation in dimensionality

5.2.8Variation in number of points

5.2.9Variation in Cluster Separation

5.3Dataset from UCI ML Repository

6Conclusion

7Further Research

8References

Abstract

In clustering, we are given a set of N points in d-dimension space Rd and we have to arrange them into a number of groups (called clusters). In k-means clustering, the groups are identified by a set of points that are called the cluster centers. The data points belong to the cluster whose center is closest. Existing algorithms for k-means clustering suffer from two main drawbacks, (i) The algorithms are slow and do not scale to large number of data points and (ii) they converge to different local minima based on the initializations. We present a fast greedy k-means algorithm that attacks both these drawbacks based on a greedy approach. The algorithm is fast, deterministic and is an approximation of a global strategy to obtain the clusters. The method is also simple to implement requiring only two variations of kd-trees as the major data structure. With the assumption of a global clustering for k-1 centers, we introduce an efficient method to compute the global clustering for k clusters. As the global clustering for k = 1 is obvious (cluster center at the mean of all points), we can find the global solution for k clusters inductively. As a result, the algorithm we present can also be used for finding the minimum number of centers that satisfy a given criteria. We also present a number of empirical studies both on synthetic and real data.

1Introduction

Clustering is a well-known problem in statistics and engineering, namely, how to arrange a set of vectors (measurements) into a number of groups (clusters). Clustering is an important area of application for a variety of fields including data mining, statistical data analysis and vector quantization. The problem has been formulated in various ways in the machine learning, pattern recognition optimization and statistics literature. The fundamental clustering problem is that of grouping together (clustering) data items that are similar to each other. The most general approach to clustering is to view it as a density estimation problem. Because of its wide application, several algorithms have been devised to solve the problem. Notable among these are the EM algorithm, neural nets, SVM and k-means. Clustering the data acts as a way to parameterize the data so that one does not have to deal with the entire data in later analysis, but only with these parameters that describe the data. Sometimes clustering is also used to reduce the dimensionality of the data so as to make the analysis of the data simpler.

In one of its forms, clustering problems can be defined as: given a dataset of N records, each having dimensionality d, to partition the data into subsets such that a specific criterion is optimized. The most widely used criterion for optimization is the distortion criterion. Each record is assigned to a single cluster and distortion is the average squared Euclidean distance between a record and the corresponding cluster center. Thus this criterion minimizes the sum of the squared distances of each record from its corresponding center. K-means clustering is used to minimize the above-mentioned term by partitioning the data into k non-overlapping regions identified by their centers. Also k-means is defined only over numeric or continuous data since it requires the ability to compute Euclidean distances as well as the mean of a set of records.

Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean k-medians, in which the objective is to minimize the sum of distances to the nearest center, and the geometric k-center problem, in which the objective is to minimize the maximum distance from every point to its closest center. There are no efficient methods known to any of these problems and some formulations are NP-hard. In [11] is presented an asymptotically efficient approximation for the k-means clustering problem, but the large constant factors suggest that it is not a good candidate for practical implementation.

The k-means algorithm is based on the simple observation that the optimal placement of a center is at the centroid of the associated cluster. Thus given any set of k-centers C, for each center c C, let V(c) denote the voronoi region of the point c, that is, the region of space that is closest to the center c. At every stage of the k-means, the algorithm moves the centers in C to the centroid of the points in V(c) and then updates V(c) by recomputing the distance from each point to a center c. These steps are repeated until some convergence condition is met. For points in general positions (in particular, if no point is equidistant from two centers), the algorithm will converge to a point where further stages of the algorithm will not change the position of any center. This would be an ideal convergence condition. However, converging to the point where no center is further updated might take very long. Thus further stages of the algorithm are stopped when the change in distortion, is less than a threshold. This saves a lot of time as in the last stages, the centers move very little in every stage. Even without the approximation of convergence based on distortion, the results obtained are not necessarily a global minimum. The results obtained vary greatly based on the initial set of centers chosen. The algorithm is deterministic after the initial centers are determined.

The attractiveness of the k-means lies in its simplicity and flexibility. In spite of other algorithms being available, k-means continues to be an attractive method because of its convergence properties. However, it suffers from major shortcomings that have been a cause for it not being implemented on large datasets. The most important among these are

(i)K-means is slow and scales poorly with respect to the time it takes for large number of points;

(ii)The algorithm might converge to a solution that is a local minimum of the objective function.

Much of the related work does not attempt to confront both the before mentioned issues directly. Because of these shortcomings, K-means is at times used as a hill climbing method rather than a complete clustering algorithm, where the centroids are initialized to the centers obtained from some other methods.

The clusters obtained depend heavily on the initial centers. In [9] is given a discussion on how to choose the initial centers. To treat the problem of choosing the initial centers, several other techniques using stochastic global optimization methods (e.g. Simulated annealing, genetic algorithms), have been developed. Different methods of sub-sampling and approximation have also been proposed to counter the above. However, none of these methods have gained wide acceptance and in many practical applications, the clustering method that is used is based on running the k-means algorithm with multiple restarts.

There has been some related work where one of the above two issues mentioned has been addressed. In [2,3] and [4] methods have tried to address the problem of efficient implementation of the k-means algorithm. In [1] a method has addressed the problem of the local convergence of the k-means algorithm and proposed an iterative algorithm that comes closer to the global solution by running a set of deterministic k-means algorithms one by one. But the proposed strategy is very slow and has little practical application even in case of medium sized datasets.

Before proceeding, we explicitly define the problem that we are looking at. In a traditional k-means algorithm (for which we develop an improved solution) each of the points is associated with only one partition also called the cluster. The number of partitions is pre-specified. Each of the partitions is recognized by a cluster center, which is the mean of all the points in the partition. All the points in a partition are closer to its center than to any other cluster center. The accuracy of a clustering approach is defined by the distortion criterion, which is the mean squared distance of a point from its cluster center. Thus the objective is to get the clustering with minimum distortion and the specified number of clusters.

In this thesis we address the issues of scaling the algorithm for large number of points and convergence to local optima. We present a fast greedy k-means algorithm. The proposed algorithm is entirely deterministic and so avoids the problem of having to initialize the centers. In addition to combining ideas in [2,3], [4], [1] and the standard k-means algorithm, the proposed algorithm also introduces some new techniques to make k-means more tractable. The implementation is also simple and only requires variants of a kd-tree.

The rest of the thesis is organized as follows. In the next chapter, we introduce the various concepts that are related to this algorithm. In chapter 3 we outline our algorithm, its variants and the new ideas implemented. Chapter 4 explains the implementation issues so that similar results can be obtained in independent studies later. Experimental results on both synthetic and real data are reviewed in chapter 5. Chapter 6 gives the conclusions of the research and chapter 7 points to certain directions in which further research could be undertaken. The thesis closes with relevant references in chapters 8.

2Definitions and Related Work

In this chapter we look at relevant work and definitions that are important for the understanding of the thesis. The ideas covered in this chapter are the standard k-means algorithm, also referred to as the Lloyd’s algorithm in [4], a brief introduction to kd-trees, the Global K-means algorithm as in [1] including the greedy approximations and other methods suggested in the same paper for improving the performance of the method and the Blacklisting/filtering algorithm as in [4] and [2]. Several definitions, given in these papers, which we would use in this thesis, are also included. We, however, do not give any proofs of the theorems or lemma that are already proved in these papers.

2.1Naïve k-means algorithm

One of the most popular heuristics for solving the k-means problem is based on a simple iterative scheme for finding a locally optimal solution. This algorithm is often called the k-means algorithm. There are a number of variants to this algorithm, so to clarify which version we are using, we will refer to it as the naïve k-means algorithm as it is much simpler compared to the other algorithms described here. This algorithm is also referred to as the Lloyd’s algorithm in [4].

The naive k-means algorithm partitions the dataset into ‘k’ subsets such that all records, from now on referred to as points, in a given subset "belong" to the same center. Also the points in a given subset are closer to that center than to any other center. The partitioning of the space can be compared to that of Voronoi partitioning except that in Voronoi partitioning one partitions the space based on distance and here we partition the points based on distance.

The algorithm keeps track of the centroids of the subsets, and proceeds in simple iterations. The initial partitioning is randomly generated, that is, we randomly initialize the centroids to some points in the region of the space. In each iteration step, a new set of centroids is generated using the existing set of centroids following two very simple steps. Let us denote the set of centroids after the ith iteration by C(i). The following operations are performed in the steps:

(i)Partition the points based on the centroids C(i), that is, find the centroids to which each of the points in the dataset belongs. The points are partitioned based on the Euclidean distance from the centroids.

(ii)Set a new centroid c(i+1) C (i+1) to be the mean of all the points that are closest to c(i) C (i) The new location of the centroid in a particular partition is referred to as the new location of the old centroid.

The algorithm is said to have converged when recomputing the partitions does not result in a change in the partitioning. In the terminology that we are using, the algorithm has converged completely when C(i)and C(i – 1)are identical. For configurations where no point is equidistant to more than one center, the above convergence condition can always be reached. This convergence property along with its simplicity adds to the attractiveness of the k-means algorithm.

The naïve k-means needs to perform a large number of "nearest-neighbor" queries for the points in the dataset. If the data is ‘d’ dimensional and there are ‘N’ points in the dataset, the cost of a single iteration is O(kdN). As one would have to run several iterations, it is generally not feasible to run the naïve k-means algorithm for large number of points.

Sometimes the convergence of the centroids (i.e. C(i) and C(i+1) being identical) takes several iterations. Also in the last several iterations, the centroids move very little. As running the expensive iterations so many more times might not be efficient, we need a measure of convergence of the centroids so that we stop the iterations when the convergence criteria is met. Distortion is the most widely accepted measure. Clustering error measures the same criterion and is sometimes used instead of distortion. In fact k-means algorithm is designed to optimize distortion. Placing the cluster center at the mean of all the points minimizes the distortion for the points in the cluster. Also when another cluster center is closer to a point than its current cluster center, moving the cluster from its current cluster to the other can reduce the distortion further. The above two steps are precisely the steps done by the k-means cluster. Thus k-means reduces distortion in every step locally. The k-Means algorithm terminates at a solution that is locally optimal for the distortion function. Hence, a natural choice as a convergence criterion is distortion. Among other measures of convergence used by other researchers, we can measure the sum of Euclidean distance of the new centroids from the old centroids as in [9]. In this thesis we always use clustering error/distortion as the convergence criterion for all variants of k-means algorithm.

Definition 1: Clustering error is the sum of the squared Euclidean distances from points to the centers of the partitions to which they belong.

Mathematically, given a clustering , we denote by the centroid this clustering associates with an arbitrary point x (so for k-means, is simply the center closest to x). We then define a measure of quality for :

(1)

Where |a| is used to denote the norm of a vector ‘a’. The lesser the difference in distortion over successive iterations, the more the centroids have converged. Distortion is therefore used as a measure of goodness of the partitioning.

In spite of its simplicity, k-means often converges to local optima. The quality of the solution obtained depends heavily on the initial set of centroids, which is the only non-deterministic step in the algorithm. Note that although the starting centers can be selected arbitrarily, k-means is fully deterministic, given the starting centers. A bad choice of initial centers can have a great impact on both performance and distortion. Also a good choice of initial centroids would reduce the number of iterations that are required for the solution to converge. Many algorithms have tried to improve the quality of the k-means solution by suggesting different ways of sampling the initial centers, but none has been able to avoid the problem of the solution converging to a local optimum. For example, [9] gives a discussion on how to choose the initial centers, other techniques using stochastic global optimizations methods (e.g. Simulated annealing, genetic algorithms), have also been developed. None of these algorithms is widely accepted.

2.2Kd-trees

A very important data structure that is used in our algorithm is a kd-tree. A kd-tree is a data structure for storing a set of finite points from a d-dimensional space. It was introduced and examined it in detail in [12, 13].