NII International Internship Project:

Correlation-Based Outlier Detection

Supervisor: Michael HOULE, Visiting Professor

The project will investigate the application of the relevant-set correlation (RSC) clustering model [1] to the detection of outlier elements in large, high-dimensional data sets. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-by-example, that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. The RSC model provides a framework for the comparison of any two arbitrary subsets of a large dataset, as to which of the two subsets constitutes the more significantly coherent grouping of data.

Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [2]) as the oracle [1]. The features of GreedyRSC include:

·  The ability to scale to large data sets, both in terms of the number of items and the size of the attribute sets.

·  Genericity, in its ability to deal with different types of attributes (categorical, ordinal, spatial).

·  Automatic determination of an appropriate number of clusters, with the user specifying as input parameters only the minimum desired cluster size and the maximum allowable correlation (proportion of overlap) between pairs of clusters.

·  Robustness with respect to noisy data.

·  The ability to identify clusters of any size (as small as three items).

For every point in the dataset, GreedyRSC uses the RSC model to quantify the coherence of groups of surrounding points, and greedily selects those with the most significant values. The GreedyRSC process can serve as the basis of an outlier detection strategy by first identifying and selecting regions which score poorly under the RSC model, rather than those which score strongly.

The goals of the OutlierRSC project are:

·  To adapt the RSC cluster candidate quality measures for the evaluation of outlier elements and outlier clusters. The simplest RSC-based approach is simply to select as outliers those sets that score poorest in terms of their RSC clustering quality. However, more sophisticated measures may lead to better performance.

·  To adapt GreedyRSC for detecting outlier elements in a variety of large, high-dimensional data sets.

·  To evaluate and benchmark the new dynamic categorization method against other prominent outlier detection methods.

The ideal duration of this project is 6 months, although visits of as short as 5 months will still be considered. Although it is possible to reduce the length of the internship after being accepted, it may be difficult to extend the duration beyond that which is stated in the candidate’s application. Therefore, candidates are strongly recommended to state in their application only the longest possible duration for their intended stay at NII.

References

[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta, GA, USA, 2008.

[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.