Decision Tree Classifiers

Fuzzy C-Means Clustering

Introduction to the Fuzzy C-Means Clustering

Quite often, we are faced with a problem where a collection of data-points are provided to us and we are called upon to devise a procedure (automated process) that groups these points in different categories based on the similarities that these data-points have with each other. This procedure is called clustering. The data-points that are available to us are referred to as training set.

The Fuzzy C-Means algorithm was developed by Bezdek and his colleagues and is one of the most popular clustering algorithms. In the Fuzzy C-Means algorithm we start by initializing the value of C (the number of groups or clusters that we assume that the training data-points belong to) and the C cluster centers. Then, the Fuzzy C-Means algorithm redefines these cluster centers and membership values of the training points in a way that takes into consideration the distances of the points in the training set from these cluster centers. The membership value of a training point with respect to a cluster center indicates how strongly we believe that this training point belongs to the cluster center under consideration. This redefinition of cluster centers and membership values continues until satisfactory performance is attained (i.e., until the membership values of data-points to cluster centers have converged to fixed values). The Fuzzy C-Means algorithm has two distinct phases. These phases are:

· Training Phase

· Performance Phase

In the training phase, the Fuzzy C-Means algorithm defines the locations of the C cluster centers and the values of the memberships of each training data-point to every one of these cluster centers. In the performance phase, a test point (different that the training point) is given and the equation that defines memberships to the C cluster centers is used to identify the values of these memberships.

An Example: Iris Data Set

The Iris data-set set is one of the most celebrated data-sets in the pattern recognition literature. It consists of 150 data-points of measurements of Iris plant types. The Iris plant types are Iris-Setosa, Iris-Versicolor, and Iris Virginica. The measurements are sepal length (feature 1) and sepal width (feature 2), petal length (feature 3) and petal width (feature 4) of Iris plants. A scatter plot of the Iris plant measurements for every combination of features (measurements) of the Iris plant is shown in Figure 1. As it can be seen from these plots for every combination of these features there are three groups of data corresponding to the three types of the Iris plants. So it is instructive to see if a clustering algorithm like Fuzzy C-Means is capable of separating these three groups of data. In this application (The Iris Plant recognition problem) we are fortunate that we know the type of the Iris plant for each of the 150 data-points that are available to us. Hence, we can assess the performance of the Fuzzy C-Means algorithm.

Figure 1: Scatter plot of every combination of two features of the Iris data-set. As it can be seen from the figure there are certain pairs of features that have higher discriminatory power in separating the Iris-plant classes than others

(e.g., the features petal length and petal width have higher discriminatory power than the features sepal length and sepal width). In the figure red colored points correspond to Iris Setosa, green colored points correspond to Iris Versicolor, and blue colored points correspond to Iris Virginica.

As it can be seen from these plots there are certain combinations of features of Iris plants that have a higher discriminatory power than others (e.g., the combination of petal width and petal length features can more easily discriminate the three types of Iris plants than any other combination of two features). In this example we have chosen 120 data-points from the available 150 data-points of the Iris data-set to define the groups (clusters) in the data. We have also chosen to focus on two features (petal length and petal width). These data-points constitute our training set. The remaining 30 points comprise our test set.

In Figures 2-11 we are showing the cluster points from the time of their initial creation (Figure 2; iteration 1) to the time at which we have iterated with the Fuzzy C-Means algorithm 10 times (Figure 11). An observation that can be extracted from Figures 2-11 is that iterating through the Fuzzy C-Means equations eventually lead us to cluster centers that are believably the cluster centers of the data-points corresponding to each of the three types of Iris plants. Furthermore, in Figures 12 and 13 we are showing (by using different shades of red, blue and green) the membership values of every point in the space of petal length and petal width features. When the point is painted very red it means that the most likely cluster that the point belongs to is the red cluster (representing the Iris Setosa group), when the point is painted very green it means that the most likely cluster that the point belongs to is the green cluster (representing the Iris Versicolor group), and when the point is painted very blue it means that the most likely cluster that the point belongs to is the blue cluster (representing the Iris Virginica group). In the same figure if a point is painted red, green, or blue, independently of the shade, it means that the most dominant membership value is the value corresponding to the red (Setosa), green (Versicolor), or blue (Virginica) group. In Figure 12 we are also showing the collection of the training points that were used by the Fuzzy C-Means algorithm for the identification of the location of the cluster centers and for the calculation of the membership values of these training points to each one of the three cluster centers. In this figure we can see that a few of the blue training points have green as their most dominant membership value, and a few of the green points have blue as their most dominant membership value. In Figure 13 we also see the collection of data-points designated as test points. We see in this figure that the group to which each one of these test points belongs to is correctly identified.

Figure 2: Cluster center locations of the Iris data after 1 iteration of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 3: Cluster center locations of the Iris data after 2 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 4: Cluster center locations of the Iris data after 3 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 5: Cluster center locations of the Iris data after 4 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 6: Cluster center locations of the Iris data after 5 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 7: Cluster center locations of the Iris data after 6 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 8: Cluster center locations of the Iris data after 7 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 9: Cluster center locations of the Iris data after 8 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure 10: Cluster center locations of the Iris data after 9 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure11: Cluster center locations of the Iris data after 10 iterations of Fuzzy C-Means. The red colored points correspond to Iris Setosa, the green points correspond to Iris Versicolor and the blue points correspond to Iris Virginica. The black points identify the locations of the cluster centers that Fuzzy C-Means has computed.

Figure12: Membership values of the training points in the petal length/petal width feature space. Red points indicate that the dominant membership value is Iris Setosa, green points indicate that the dominant membership value is Iris Versicolor, and blue value indicates that the dominant membership value is Iris Virginica. The shade of red indicates the strength of the membership value, where stronger shade indicates higher membership value. In this figure the training points used to define the cluster centers are also shown. Two of the green training points are in a region of blue memberships, while two blue points are in a region of green membership values. In this figure the cluster points that Fuzzy C Means has discovered are designated by the black circles.

Figure13: Membership values of the points in the petal length/petal width feature space (test points are shown with the bigger symbol). Red points indicate that the dominant membership value is Iris Setosa, green points indicate that the dominant membership value is Iris Versicolor, and blue value indicates that the dominant membership value is Iris Virginica. The shade of red indicates the strength of the membership value, where stronger shade indicates higher membership value. In this figure the test points are also depicted. As it can be seen from the figure the test points correctly lie in the regions of membership values that matches their actual group (i.e., red test points in the red region of membership values, green points in the green region of membership values, etc.). In this figure the cluster points that Fuzzy C Means has discovered are designated by the black circles.