Clustering: partitioning data into similar groupings. A cluster is grouping of 'similar' items! Clustering is a process that partitions a set of objects into equivalence classes.
In clustering, data is partitioned into classes. Class members of each class are in some way more "similar" to each other than with those pertaining to different classes. So
a. intra-class variance is low
b. inter-class variance is high
Application of clustering:
1. Pattern recognition
2. Data mining
3. Image processing
4. Market research & econometric assessment
5. In WWW: document classification and search on a similar ontology space
6. Land use registry
7. Insurance
8. Urban and regional planning
Clustering = Unsupervised learning with no notion of pre-defined classes (what are they, how many of them – no a priori knowledge)
Data preparation before data mining:
► Normal data to be mined is noisy with many unwanted attributes, etc.
►Discretization of continuous data
►Data normalization [ -1 .. + 1] or [0 .. 1] range
►Data smoothing to reduce noise, removal of outliers, etc.
►Relevance analysis: feature selection to ensure relevant set of wanted features only
Clustering is an unsupervised partition of a given data into equivalent classes. Ultimately, clustering is equivalent to classification.
A good clustering would produce a partition with low within-group variance and high inter-group variance. The idea is centroid of a cluster (the geographical center of a cluster in a spatial data) and its variance about the centroid may be sufficient in most cases to depict the data.
many data few data.
Cluster point or centroid = exempler
Variance = mushiness of the concept of the centroid
within-group = intra-cluster
between-groups = inter cluster
The issue here is "similarity". How do we measure similarity? This is not easy to answer.
Secondly, if there are "hidden"patterns, does the clustering scheme discover them? Requirements of good clustering:
1. Insensitivity to order of input data
2. Capable of cluster identification on a single pass over data
3. Works even in presence of noise and outliers
4. Scalability
5. low dependence on domain knowledge
6. Ability to deal with different types of input: numerical, ordinal, categorical, etc.
7. Discovery of clusters with arbitrary shape. e.g.
Two points or patterns are similar if the distance between them is lower than some threshold.
A and B are in the same
cluster.
A metric space is a set where a generalized distance function of some sort could be defined. In this,
triangular inequality
How do we measure the distance? It depends! J
Euclidian distance:
Manhattan distance:
Bounded distance:
On any given metric space, a measure that never exceeds a threshold.
e.g. where is a measure of distance between two points obeying usual distance properties.
Maximum distance:
Minimum distance:
Mahalnabis distance:
String distance:
a. Hamming: Distance between two strings is the total number of positions they differ.
b. Levenshtein distance: Given a source string and a target string , the minimum number of insertions, deletions and substitutions require to transform into . Useful in
· Spell checking
· Speech recognition
· DNA analysis
· Plagiarism detection
Larger the distance between two data items less similar they are.
Also, we need a way to measure the distance between two clusters. A number of possibilities exist.
Single linkage distance (nearest neighbor):
minimum distance between points in them.
Complete Linkage distance (farthest neighbors)
Average linkage distance (a compromise):
Average distance among all pairs of points
Centroid Linkage (easiest)
■ Single-linkage produces long and stringy clusters
■ Complete linkage produces compact and small clusters
What should be the criteria for good clustering? Again depends.
Classical criterion-set:
1. Number of distinct clusters should be as low as possible and yet accommodate a sense of classification.
2. Distance between two distinct clusters should be larger than some threshold. How big, how small?
3. A decent clustering scheme will yield a minimum total variance of the clusters over all possible schemes.
Not all these are independent. Ideal clustering algorithm is an NP-complete bin-packing problem. One needs a workable heuristic to approach the problem.
Classification = Clustering = Learning
■ A set S is partitioned into a number of equivalent classes.
Supervised learning classification (normally)
Unsupervised Learning clustering
Both clustering and classification induce partition on the given data set .
The clustering schemes:
a. Partitioning algorithms: Construct various partitions and then evaluate them by various algorithms.
If clusters are too close, perhaps they should be coalesced. If they are too voluminous (high variance) , perhaps they should be partitioned further.
b. Hierarchical algorithm: agglomerative ad divisive approach
c. Density-based algorithms: based on connectivity and density functions.
d. Grid-based algorithms:
In general,
A cluster-scheme is agglomerative if new clusters are formed one at a time from the existing cluster-set.
Let = {} be an n-cluster partition. We obtain
from as follows:
a. Chose ,
b. as and are erased from and
the new cluster is inserted into the configuration.
The selection of and is predicated by some objective function. Some choices:
a.
b. where
Clustering or classification can be classified as
► supervised learning (e.g. as in neural nets)
► unsupervised learning (e.g. as in Isodata, ..)
Example Isodata.
Basic algorithm.
1. Given n patterns (points, objects, signatures ..)
.
2. Choose randomly any two points and such that the distance the minimum inter-cluster distance.
3. Take one of the remaining point . If assign to if
the maximum cluster diameter
4. If and assign
to is own cluster
5. For the remaining points go to step 3 until no point is remaining.
Pros and cons of Isodata
Pros.
1. Clustering is not geographically biased to any particular region of data distribution.
2. A very efficient way of finding inherent clusters in a set.
Cons.
1. Clustering is based on a number of iterations required.
2. One doesn’t know a priori number of distinct clusters.
3. Insensitive to variance/covariance.