Iterative Refinement Clustering Algorithms (E

Rafal Ladysz: FINAL PROJECT PAPER

for INFS 795, Fall 2004 (Professor Carlotta Domeniconi)

CLUSTERING OF EVOLVING TIME SERIES DATA:

anattempt to optimize the number and initial positions of cluster centroids

for k-means clustering with regards to goodness and accuracy

usingthe Euclidean and Dynamic Time Warping distance measures

Content

INTRODUCTION 1

OBVECTIVES AND MOTIVATION 2

THEORETICAL BACKGROUND AND OTHER WORKS 2

CHARACTERISTICS OF K-MEANS AND EM ALGORITHMS 3

SIMILARITY MEASURES 3

EXPERIMENTAL SETUP 5

OBJECTS EXTRACTION 5

INITIALIZATION 6

ALGORITHMS 7

SOFTWARE TOOLS 8

DATA 9

EXPERIMENTAL RESULTS10

EUCLIDEAN VS. DTW SIMILARITY MEASURES10

ROLE OF WEIGHTING10

DIMENSIONALITY11

DATA BIAS11

FINAL ACCURACY EVALUATION11

SUMMARY12

FUTURE WORKS12

REFERENCES13

APPENDIX14

INTRODUCTION

Clustering is unsupervised data mining technique, where objects under investigation are grouped into subsets – disjoint or overlapping – of the original set of objects.

Importance of the clustering as a data mining method stems from the fact that it does not (in general) assume any knowledge of the objects clustered, hence does not use any labels (classes) which are present in classification techniques (training-testing-evaluation). As such scenario (of no prior knowledge) is very often in real life, the clustering is a method of choice when one attempts to gather some general knowledge about scrutinizedobjects, usually being interested in their similarities and differences.

Clustering can be also applied when the knowledge on the objects exists, but the goal is to derive some new knowledge. This is often done in subspaces of the “original” domain space where the objects are described (in their full dimensionality). For instance, some geographical region can be divided into a number of climatological zones, but taking into account only subset of characterizing parameters (e.g. temperature and precipitation), some different partition with different number of clusters and their respective members might turn out a “better” choice.

It is important what kind of data is to be clustered. Time seriesis widely used form of data, expressed as a sequence of real numbers, where each moment in time series history is assigned a numeric value.Put another way, time series is a discrete function of time, often characterizing changes of some observable value(s) along time (as independent variable). Time series can be considered as a special case of data stream, where the latter can be multidimensional as opposed to scalar (one-dimensional) character of time series.

So called event datamust not be confused with time series, which“is (...) a sequence of real numbers representing measurements of a variable taken at equal time intervals. Techniques developed for time series require numerical features, and do not support target events within a time frame” [1].

Because of the above, clustering time series – being an intersection of twowidely visited area - is an important data analysis activitywhich helps discover deeply hidden patterns, trends features and indicators of intrinsic dynamics of the underlying system (represented by the time series). The method is somehow “similar” to correlation analysis. For instance – taking one more example from climatology – if a region can be divided into as many zones regarding temperature and precipitation as regarding temperature and pressure, it might suggest higher correlation between pressure and precipitation(obviously, correlation does not mean causality).

In this project the focus was madeon mainly just one clustering techniques: k-means, while only rudimentary comparison has been made between it and Expectation-Maximization (EM). Two distance (similarity) measures have been applied: Euclidean and Dynamic Time Warping (DTW). Three groups of time series data have been used: climatological, medical and financial (each represented by two real data sets). Such data diversity should help obtaining less biased results, as well as get some insight into these mutually complementary domain, representingmicro- and macro-scale, natural- and cultural-driven scenarios characterized by their respective time series.

OBVECTIVES AND MOTIVATION

The core objectives of this project are two-fold:

investigating k-means clustering algorithm in terms of itssensitivity to initial conditions, i.e. initial clusterscentroid positions and number of the centroids,
comparing Euclidean and DTW distance measures in how they influencewhat is to be investigated in 1.

Because of limitation of time and software tools available during the project realization, analogical research for EM clustering – initially included in the project scope – are considered as possible future work.

The idea and motivation for this project origin from another one where time series have been investigated wrt underlying chaos (see Appendix for explanation), using dedicated software tool. A parameter called the largest Lyapunov exponent (LE, see Appendixagain for explanation) was computed and all drastic changes were indexed, enabling mapping the timing of those changes back to the original time series (Figure 1, p. 4), thus generating a set of time series subsequenceswhich can be considered as objects of clustering.Successful implementation of real-time clustering of such subsequences as evolving time series data could improve accuracy of the mentioned algorithm computing LE by increasing its sensitivity (but not predictive capability)of detected signal.

I believe this project is non-trivial in its encountering two challenges: one is as old as partition-based clustering and regards optimization of number of centroids, which is considered an NP-hard problem, the second stems from the experimental specificity where also unknown is the number of objects to be clustered (time series subsequences). These experimental constraints makes the project interesting and challenging. It can beconsidered as the first step towards obtaining some experience in the area, hence the project itself and its results should help as landmark and framework for further attempts in this direction.

THEORETICAL BACKGROUND AND OTHER WORKS

Clustering time series has a long experimental history. Most frequently, the experimentators are interested in finding similar patterns among the objects, using either of two main approaches: whole clustering and subsequence. In the first method given a set of individual time series data, the objective is to group similar time series into the same cluster, whereas the in second one,

given a single time series, its subsequences are extracted with a by the means of “sliding window” [2].

Gunopulas et al.[3]use, besides the Euclidean similarity measure, also Longest Common Subspace -like similarity (with global scaling).

Perng et al.[4] introduce similarity definition congenial to human perception.

Clarkson [5] uses segmental k-means clustering with training. Segmental k-means operates with models, a number of samples per state, a number of states per model. Initializes N models and resegments each cluster membership for each model. Training is performed to estimate new model for a segment. Keeps resegmentation until convergence.

Vlachos et al. [6] suggest approach based on multiple distance measures, including the Longest Common Subspace and Dynamic Time Warping, denouncing Euclidean distance measure as too vulnerable to noise.

Aggarvalet al. [7] describe an interesting approach to clustering of evolving data streams (a generalized concept of scalar time series) introducing the Stream Clustering Framework, called CluStream, consisting of separate on- and offline processing.

Daewon et al. [8] introduces so called PseudoPCA for cluster analysis of time series, where after applying so called phase space reconstruction (see Appendix for explanation) to 1-dimensionalscalar data,the dimensionality reduction is performed.

Calinski et al. [9]were among those who attempted fundamental issue of heuristics for optimal number of clusters. Their pseudo-F statistic, called also C-H index, is often used as the objective function to determine optimal k. It can be described by the following formula:

C-H(k) = (B(K)/(k-1))/(W(k)/(n-k))

where B(k) = between cluster sum of squares, W(k) = within cluster sum of squares, k is the number of clusters andCH(k) is to be maximized over the clusters.

Legendre [10] uses the C-H index in its cascade-like strategy, where user starts with k1 groups and ends with k2 groups (k1 k2).Cascading from a larger to the next smaller of groups, the two closest groups are identified and fused, followed by two-step alternating least-squares algorithm, run until convergence. The two steps are: (1) computing cluster centroids and using them as new cluster seeds and (2) assigning each object to its nearest seed. The author also considers weighting attributes.

It should be mentioned that there are many other methods of optimizing number of clusters, for instance Ebdon [11] suggests the root mean square deviation (RMSD) to be used to determine the optimum number of clusters, where optimum solution usually have the smallest number of clusters without clustering in the same partition points which are “naturally” dissimilar. Higher values of the RMSD denotes increased variation within same clusters.

The approach taken in this project is yet different, facing the brutal reality of unknown number of objects and (of course), partitions, while taking into account one more target constraint: requirement of on-the-fly processing. Testing this requirement is out of scope of this project and can be considered as a possible future work closely related to it.

CHARACTERISTICS OF K-MEANS AND EM ALGORITHMS

The k-means algorithm has been designed by MacQueen in 1967 [12] and is probably the most popular and simplest clustering technique. It can be characterized briefly as follows:

suboptimal and greedy (susceptible to local minima)

sensitive to initial conditions and outliers: because of NP-hardness, random initialization

requires number of clusters (k) as part of the input

Euclidean distance being its most natural similarity measure, allowing spherical zones only

partitions are mutually disjoint (exclusive – no overlap allowed)

guaranteed to terminate.

The k-means has someits variants, e.g. k-modes (process categorical data), k-medoids (operates with medians rather than means, i.e. real objects, being thus more robust in presence of outliers), and EM, which is another widely used partition-based clustering algorithm.

The EM can be considered as a generalization of k-means to probabilistic setting (maintains probability of membership of all clusters rather than assign elements to initial clusters). It works iteratively: initialize means and covariance matrixwhile the convergence criteria is not met compute the probability of each data belonging to each cluster, recomputes the cluster distributions using the current membership probabilities, clusters probabilities are stored as instance weights using means and standard deviations of the attributes, andterminates when likelihood saturates. The EM is useful when for many dimensions the range of values differs significantly, but it also is sensitive to initial conditions.

SIMILARITY MEASURES

The Euclidean distance between two objects A and B in N-dimensional space, defined as

DE = (i = 1N (xA - xB)2)1/2

is one of the most commonly used distance measures, where each object, i.e. time series subsequence of length N,is considered as a point in N-dimensional space.

Advantages of Euclidean distance measure include: simplicity and natural, intuitive sense. Disadvantages are: high sensitivity to noise and outliers (especially for sparse data), covering only spherical domain space, demand for extensive data preprocessing if to be applied as time series similarity measure (what was the case of this project). Clearly, in spite of its merits, the Euclidean distance is not similarity measure of choice for time series.

In real life scenario very often small distortions take place, causing “delays” in time of some phenomena being reactions to or otherwise caused by their direct or indirect stimuli. To account for such artifacts so called Dynamic Time Warping (DTW) was proposed by Sankoff and Kruskal in 1983. The DTW as a distance measure minimizes dissimilarity by aligning two sequences as in the figure below.

Note.

This technique emerged originally for speech recognition in the 1970’s. In speech recognition the alignment is necessary due to different parts of the words being spoken at different rates. Therefore, the best alignment of the word to be recognized and a reference pattern, identifies the word to be the same as the one represented by the reference pattern.

Fig. 2: Comparison of Euclidean (left) and Dynamic Time Warping (right) distance measures between two time series sequences (marked red and blue) in terms of relations between their respective points (one-to-one vs. many-to-many)

The DTW uses the following formula to compute distance:

where γ(i, j) is the cumulative distance of the distance d(i, j) and its minimum cumulative distance among the adjacent cells. The idea is depicted in Figure 3 below:

Fig. 3: Idea of the DTW distance measure - general case of different lengths (in this projects all lengths were equal within each experimental run); for Euclidean distance the red “path” would be a straight diagonal (for equal distances)

Similarity between already aligned sequences becomes the conventional Euclidean distance between the two series – thus the DTW can be considered in a sense as a generalization of the Euclidean distance measure.

Advantages of DTW include optimality and relatively wide area of application, not only handwriting and speech recognition. Among disadvantages are O(N2) time and space complexity and limitation of the datasets size (up to 10,000). There are, however, methods to lower time complexity of DTW include: filling only part of the cost matrix, undersampling the data before time warping (used in this project) or reducing the number of times DTW runs by indexing.

EXPERIMENTAL SETUP

The experimental part referring to centroids optimization have been designed in the following structured way:

use all 6 datasets preprocessed as follows:

extract subsequences (initial points using the LET program)

index each subsequence for unique identification allowing for dimensions 30 and 100

remove outliers (very few cases found, thus performed manually)

mean-shift (subtracting subsequence-specific mean value from all points) [18]

apply weights w = 1 and w = 1.05 geometrically for the first 20 points in each subsequence

normalize both weighted (w = 1.05) and unweighed (w = 1) data (in 0-9 scale)

store each preprocessed dataset in its matrix

format input files for the SPSS (and for EEG dataset also for Weka ARFF format)

for each dataset, precompute DTW distances pairwise, creating the distance matrices
run centroid optimization using two merge-or-not radius (R) values for

Euclidean distance: with build in distance computing function

DTW-distance: using precomputed distance matrix as input from (2)

record and store resulting centroids positions (and, implicitly, their numbers k)
convert centroids positions file into SPSS-readable (to utilize “Centers initial file” option)
for all the above 6 datasets run k-means clustering using two options

without initial cluster positions, trying k < K

as above, but with k > K

with cluster positions from (4), i.e. using K

compare results for a, b and c, and in particular:

compute Calinski-Harabash index (called pseudo-F statistic, based on ANOVA)

perform visual accuracy evaluation (confusion matrix)

draw and report conclusions.

Comments:

-normalization: usually it matters for clustering whether data is normalized (none dimensionality favorized if incomparable scales, e.g. age and salary), as well how the data are scaled, but in this case all dimensions have been treated same way and none was more important than another, except when weigh was applied after all other preprocessing steps; the 0-9 scale was forced by limitation of the DTW software (the only available at this time)

-Celinski-Harabasz index can be interpreted as goodness (rather than accuracy) of particular k-partition.

OBJECTS EXTRACTION

Clustering evolving time series data is specific because it is not known upfront how many objects – i.e. time series subsequences – are to be dealt with. Another challenge is how to extract the subsequences so that they can be compared against each other (to evaluate their mutual similarities) with minimal bias. The latter has been resolved by using a proprietary software tool called Lyapunov Exponent Tracker (LET) designed and implemented to compute the largest Lyapunov (LE) exponent of time series data as a discrete function of time.

To better understand role of LET in this experiment and its background, it may be useful to look at its output (graphed using Excel):

Fig. 1: The input data is visualized in the lower graph (EEG vs. time); the upper graph shows values of LE vs. time used as criterion for generating “cutting off” points for subsequences (when sudden andsubstantial change of the LE occurs)

Extracting subsequences results from evaluationof the LE: whenever its change is evaluated as “drastic” (i.e. exceeding preset threshold), the underlying time index is mapped back to original input data (EEG) to determine beginning of the subsequence. As the subsequence lengths in all experiments are assume to be equal (within any run for any time series data), the starting point uniquely determines the whole subsequence as an object to be clustered. The number of such objects (subsequences) depends on the number of sample points in time series input data (from 6000 to 18000) and what subsequence length, being also its dimensionality, has been decided (30 or 100).

Comments on object extraction.

Obviously, such a method of extracting subsequences is somehow “arbitrary”, nevertheless the whole project origins from idea of boosting performance of the LET by the means of on-the-fly time series clustering, thus from that perspective this is method of choice regarding the objective related to LET. There might be alternative, “more general” methods of extracting time series subsequences, though. Another possibility is collecting time series as already single subsequences. In this experiment the only method used was applying the LET.

Because of the above “connection” the question about biaslessness can be raised at this point. Without any rigorous statistical considerations it can be said that the cut-off points generated based on chaotic behavior takes place in another space than clustering. This circumstance, at least qualitatively, might be a tentative argument for limited bias in extracting time series subsequences.