K-Means & K-Harmonic Means: A Comparison of Two Unsupervised Clustering Algorithms.
Douglas Turnbull
CSE 202 – Project
November 2002
- Introduction
Gaining a relational understanding of information is important to biology, human cognition, artificial intelligence, and many other data-intensive fields of research. In many instances finding relationships may not be obvious by inspection due to numerous data points or high dimensionality. In these cases, computation may be useful in order to divide subsets of the information into groups called clusters. To effectively divide the information we must first define a criterion for creating groups, and second, find an optimal grouping based on that criterion.
The problem can be generalized as follows: Given a set N of n data points in d dimensional space, we must determine how to assign a set K of k points, called centers, in N so as to optimize based on some criterion. In most cases, it is natural to assume that N is much greater than K and d is relatively small. This formulation is an example of unsupervised learning. The system will create groupings based only the criterion and the information contained in the n data point.
K-means and K-harmonic means are two center-based algorithm that have been developed to solve this problem. K-means (KM) is a popular algorithm that was first presented over three decades ago [1]. The criterion it uses minimizes the total mean-squared distance from each point in N to that point’s closest center in K. K-harmonic means (KHM) is a more recent algorithm presented by Zhang in 2000[2]. This algorithm minimizes the harmonic average from all points in N to all centers in K.
This paper will give a description of the general KM algorithm in section 2 and a description of KHM in section 3. Section 4 will compare KM and KHM by first looking first looking a test case where KHM outperforms KM and then discussing some of the fundamental differences between the two algorithms. Section 5 will present one current application involving the classification of music and how it can use unsupervised clustering. Lastly, section 6 will look at some potential research ideas for KHM.
- K-Means
2.1 The General KM Algorithm
An outline for this algorithm is based on an algorithm presented by Elkan [3].
Data Structures:
N: n by (d+1) array - contains static information about data set
and a dynamic pointer to closest center
K: k by d array that holds information about centers
M: k by n array that holds distance from all point in N to all points in K
Initialization
- Create an initial K:
Choose any k points from N
- Initialized pointers in N:
Assign pointers to any element in K
Main Loop
- Fill Matrix M:
Calculate distances from all points in N to all centers in K
- Update Array N:
Change the pointer N[i] to point to a new center if that new center is closer than pervious center
- If no pointer to a center is updated in step 4 then stop
The algorithm has converged
- Otherwise Recompute and Update K:
For each center c in K, compute average value for all points in N that point to c and replace c with new point c’ from data set that is closest to average value. This can all be done with a linear pass though N.
2.2 K-means Analysis
2.2.1 Objective Function and Membership Constraint
This algorithm minimizes the total mean squared distance for each point, xi, and the closest center, cj.
minimize [SUM over i = 0 to N [min( ||xi – cj||2 for all cj in K)]]
Because the objective function only uses the minimum distances, each point xi is implicitly assigned to exactly one center cj. The property of being assigned to one center is referred to as hard membership. (We will return to this notion of membership in section 4.)
2.2.2 Storage Requirements
Storage requirements assuming that n > k > d:
N + K + M = n * (d+1) + n * d + n * k = O(nk)
The size of matrix M dominates the storage
2.2.3 Running Time
The running time for each iteration algorithm is dominated by distance calculations in step 4. If we can assume that a d dimension distance calculation is takes O(d) time, than the running time per iteration is given by:
O(nkd)
The convergence rate in not just a function of n, k, and d but depends on the nature of the data set. Therefore, it is difficult to discuss a runtime for the algorithm. For this reason center-based clustering algorithms are usually compared by the runtime of a single iteration.
2.2.4 Correctness
It has been shown in [5] that the problem of finding an optimal grouping given a d dimensional set of n data points into k groups is NP-hard. K-means uses a greedy heuristic that seeks to find a local minimum given an initial condition.
In each iteration, we find a better local solution. We only include new centers that reduce the value of the objective function at each step. The algorithm converges when it is not possible to exchange any center such that the total cost can be reduced. Although this gives us a local minimum, we would have to rerun the algorithm to check every initial condition in order to get a global minimum. Since there are an exponential number of possible initialization combinations, rerunning with all possibilities would take exponential time.
- K-Harmonic Mean
KHM is similar to KM in that they are both center-based iterative algorithm. The main difference is how the centers are updated in each iteration. The formula for updating center comes from a derivation found in [2]. In his paper, Zhang introduces a class of KHM with parameter p that is power associated with the distance calculation. In the standard KM algorithm p would be 2 because the distance calculation is given by squared distance ||xi – cj||2. It was found that KHM works better with values of p > 2. This will be discussed more in section 4.
3.1 Harmonic Average Function
The harmonic average is defined as
HA({a1,..,aK}) = K / [SUM over k = 1 to K ( 1 / ak) ]
This function has the property that if any one element in a1..aK is small, the Harmonic Average will also be small. If there are no small values the harmonic average will be large. It behaves like a minimum function but also gives some weight to all the other values.
3.2 The KHM Algorithm
Data Structures:
N: n by d+1 array - contains static information about data set
K: k by d array that holds information about centers
Initialization
- Create an initial K:
Choose any k points from N
Main Loop
- Compute Harmonic Averages and Update K: See Appendix A
- If no center is updated in step 4 then stop
The algorithm has converged
(Note: That a more complete version of the pseudocode is given in Appendix A.)
3.3 K-harmonic Means Analysis
3.3.1 Objective Function and Membership Constraint
The objective function of KHM is given by:
minimize [SUM over i = 0 to N [HA( ||xi – cj||2 for all cj in K)]]
where HA() is the harmonic average for each data point. Unlike KM, this algorithm uses information from all of the centers in K to calculate the harmonic average for each point in N. This means that no center completely owns a point, but rather partially influences the harmonic average for each point. This condition is referred to as soft membership.
3.3.2 Space Requirements, Running Time, and Correctness
The space requirements, O(nd), and running time, O(nkd), for KHM are given in [2]. The space requirement for KHM is given as somewhat less than KM because the algorithm never requires the storage of large matrix M that stores the distance calculations from all centers to all point. However, the all of these calculations must me made can be stored temporarily.
Like KM, KHM finds a local optimal solution based on initial conditions using a greedy heuristic. Each iteration improves the objective function until convergence.
4. Comparison of K-Means with K-Harmonic means
A recent study by Hamerly and Elkan[4] found that KHM significantly outperforms KM in a number of experiments. They also found that KHM was much less sensitive to initial conditions. This section will give a specific case in which KHM will finds a better grouping. We will then discuss two properties of KHM that make it superior to KM.
4.1 Test Case
Given a one-dimensional set of 8 data points, we wish to find an optimal three clustering. In this case d=1, n = 8 and k = 3. The points are located at (1,2,3,4,6,7,9,10) and the initial centers are located at (1,4,9).
12345678910
x1x2x3x4x5x6x7x8
c1c2c3
4.1.1 KM on Test Case
Running KM, we find that after the first iteration, (x1,x2) is assigned to c1, (x3, x4, x5) is assigned to c2, and (x6, x7, x8) is assigned to c3. When recomputing the averages for each center in step 6 of the KM algorithm we would find that no centers would need to be changed and the algorithm would converge with an objective function to equal to 11. (We will assume that an average value that is equidistant from two data points will be assigned to point with lesser absolute value. This is an unimportant detail but a needed policy for the reassignment of c1. )
However, a better local minimum would have been found given an initialization in which the centers were assigned to (x2, x5, x9). This would yield and local objective of 8.
4.1.2 KHM on Test Case
Given the same initial set up, the KHM algorithm, with p =3, given in Appendix A will converge after 3 iterations. The weight of each data point on the objective function is calculated in each iteration.
The weight function is given as:
Weighti = 1 / [ Sum from l = 1 to K ( 1 / ||xi –cl||p) ] 2
Once calculated, these weights are normalized and used for calculating the new values for updating the centers.
The state after each iteration is as follows:
Iteration 1:
12345678910
x1x2x3x4x5x6x7x8
c1c2c3
Weight:0.79.78034360.99
Iteration 2:
12345678910
x1x2x3x4x5x6x7x8
c1c2c3
Weight:.910.240.76054524
Iteration 2:
12345678910
x1x2x3x4x5x6x7x8
c1c2c3
Weight:.990.92150.770.96
Note that the final set of center does find a good local (and in this case, global) optimal solution both in respect to the KHM objective and KM objective function. Some of the weights are relatively high (524, 54, 36) for points far from a center and low for point lose to a center (epsilon represented by 0, .24, .78).
4.2 Hard vs. Soft Membership
As we saw in section 2.2.1, KM imposes hard membership on it data points: each data point is assigned to exactly one center. This means that each data point only has an influence over the center to which it is assigned. In area with high local density of data points and centers, a center maybe unable move away from a data point despite the fact that there is a second center nearby. This second center might yield a slightly worse local solution, but the global effect that repositioning one of the two centers would be beneficial to the clustering. However, this swapping cannot be done in using KM.
KHM differs in that, for each data point, the objective function uses the distances to all centers. The harmonic average is sensitive to the fact that there exist two or more centers are close to a data point. The algorithm will naturally shift one or more of these centers away to areas where there data point that have no close center. This will create a lower value for the objective function.
4.3 Static vs. Dynamic Weights
In each iteration of KM, the objective function gives equal weight to all of the data points. KHM, on the other hand, assigns dynamic weights to each data points based on a harmonic average. As described in section 3.1, the harmonic average will assign a large weight to a data point that is not close to any centers and a small weight to data point that is close to one or more centers. This principle is important because we want to avoid creating densely packed area of multiple centers. By increasing the weight of centers that are not close to any center, the algorithm can attract centers away from those dense areas without increasing the weight of those data points that are in more dense areas. This principal is central to KHM being less sensitive to initialization than KM. In KM centers tend to get trapped in dense area yielding poor clustering results.
5 Application: Music Genre Classification
One application that involves the unstructured clustering is the automatic classification of music for information retrieval(IR). The problem involves large libraries of digitally recorded sound files. In [6], Tzanetakis and Cook use signal processing techniques and statistical pattern recognition algorithms to extract features from recorded sound files.
In this case, the sound files represent data point and the features represent the dimensionality of data. Much energy has been put into supervised clustering based on nearest neighbor algorithms with a predefined audio classification hierarchy and associated training sets. However, in this method, notions of human cognition and psychoacoustics play a large role in determining the success of a classification system.
Using a fully automatic clustering algorithm such as KHM could limit the subjective nature of musical classification. Although this new classification may not make sense to a human, it could be used to automatically store an retrieve sound information with a much higher degree of accuracy.
6 Future Work
6.1 Original Goal
Over the past 30 years, many papers have presented ways in which to improve KM. Some focus on creating good initialization methods while other look at finding optimal values for k. My thought was that using some of these ideas but with KHM in mind.
In [3], Elkan describes ways to improve KM using geometric insights from the triangle inequality. He presents two lemma, one for an upper bound and one for a lower bound, in order to reduce the number of distance calculations needed. Although the improvements do not reduce the theoretical computational complexity, they do greatly reduce the number of distance calculation and achieve significant speedups in practice.
My original goals in writing this paper was to try to use the same principles involving the triangle inequality and apply them to the KHM algorithm. This was found to be not possible due to fact that the number of calculation cannot be reduced since all calculations between centers and points are needed for soft membership algorithms.
6.2 Improving on KHM
Despite Zhang claim in [2] that KHM is “essentially insensitive to initialization,” both [2] and [4] show that there is still remains some variance in the local optimal solution given by different initialization.
One idea to further reduce this variance would be to use the centers found after one pass KHM as an initialization to a second pass through KHM. We could artificially move centers from more dense clusters to locations between sparse clusters.
The Algorithm is as follows:
- Run KHM
- Run KM to determine hard membership
- For each center, calculate variance of cluster.
- Transplant centers from the m densest clusters to the m locations between sparse clusters.
- Rerun KHM
This algorithm could be evaluated by testing to see if the variance of the solution set is less than that of one pass through KHM given a series of different initializations.
Appendix A: Complete Pseudo Code for KHM
Data Structures:
N: n by d+1 array - contains static information about data set
Nmin: n element array which holds the minimum distance to any center
K: k by d array that holds information about centers
M: n by k array that holds distance from all point in N to all points in K
Temporary Arrays (Could be reduced but shown for simplicity)
U: n element array
Q: n by k temporary array
QQ:k element array
R: n by k temporary array
RR:k element array
T: n by k
p: KHM parameter discussed in beginning of section 3
Initialization
- Create an initial K:
Choose any k points from N
Main Loop
- Fill Matrix M:
Calculate distances from all points in N to all centers in K
- Compute Nmin:
Find minimum distance for to any center for each point in N
- Recompute Harmonic Averages and Update K:
For each point (j = 0 to n)
For each center (i = 0 to k)
U[j] = U[j] + (Nmin[j]/N[j,I])
U[j] = U[j] – 1;
For each center (i = 0 to k)
For each point (j = 0 to n)
Q[j,I] = [(Nmin[i]^(p-2) * (Nmin[i]/N[j,i])^(p+2)] /
[(1 + U[j]^p)^2]
For each center (i = 0 to k)
For each point (j = 0 to n)
QQ[i] += Q[j,I]
For each center (i = 0 to k)
For each point (j = 0 to n)
R[j,I] = Q[j,i] / QQ[i]
For each center (i = 0 to k)
For each point (j = 0 to n)
K[i] = K[i] + R[j,i]*N[j]
- If no center is updated in step 4 then stop
The algorithm has converged
The algorithm above is given in a simplified form as to show all the temporary values that are calculated by taking partial derivative for the Harmonic Average Function. Each nested loop in Step 4 represents the one of the five decomposed equations from Equation 6 of [3].
References
[1] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceeding of the fifth Berkeley symposium on mathematical statistics and probability. Pp 281-297, 1967.
[2] B. Zhang. Generalized k-harmonic means – boosting in unsupervised learning. Technical Report HPL-2000-137, Hewlett-Packard Labs, 2000.
[3] C. Elkan. K-means: fast, faster, fastest. UCSD AI Seminar Talk, October 2002.
[4] G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find better clusterings (pdf). To appear in Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM'02), November 2002.
[5] Drineas, Frieze, Kannan, Vempala, and Vinay. “Clustering in large graphs and matrices.” ACM-SIAM Symposium on Discrete Algorithms, 1999.
[6] Tzanetakis, and Cook “Musical Genre Classification of Audio Signals.” IEEE Transactions on Speech and Audio Processing. 2002.