Classification Under the Relevant Set Correlation Model

NII International Internship Project:

Classification Under the Relevant Set Correlation Model

Supervisor: Michael HOULE, Visiting Professor

One of the most well-known classification methodsin machine learning is that of k-nearest-neighbor (k-NN) classification, a voting strategy in which each objectisassigned to the class most common amongitsk closest neighbors withina training set of examples.Despite its simplicity and effectiveness, the k-NN method has had the reputation of being impractically slow for large data sets, due to the perceived cost of generating neighbor sets. However,in recent years very efficient data structures have been proposed for approximate similarity search [3], allowing useful approximations of thek-NN lists to be generated even for large, high-dimensional training sets.

Despite its successes, k-NN classification suffers from a number of drawbacks. The choice of parameter k greatly affects the performance of k-NN. Larger values of k tend to reduce the influence of noise, at the risk of degrading the boundaries of the classes produced. Fixing the value of k will favor the prediction of classes having many instances among the training examples over classes with fewer instances, particularly when the number of instances is smaller than k.

The goal of this project is to devise new “adaptive” variants of k-NN in which the number of neighbors k contributing to the class prediction depends on the distribution of the data in the vicinity of the object to be classified. In particular, we will consider the relevant-set correlation (RSC) clustering model [1] as a means of determining the most appropriate set of voting neighbors. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data, but instead relies solely on the ranking of the data (the “relevant sets”) induced by the neighborhood relationship. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets.Based on the RSC model, ageneral-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [3]) as the oracle [1,2].

The specific goals of this project are:

To adapt the RSC cluster candidate quality measures to decide, for any object, the size k of the most consistent neighborhood from within the training set. Several variants should be proposed, some possibly making use of distance information as well as neighbor rankings.
To implement, evaluate and benchmark the new classification methods against other prominent techniques, such as basic k-NN and SVM classification. The experimentation will target (but not necessarily be limited to) large document and image data sets.

The ideal duration of this project is 6 months, although visits of as short as 4 months will still be considered. Although it is possible to reduce the length of the internship after being accepted, it may be difficult to extend the duration beyond that which is stated in the candidate’s application. Therefore, candidates are strongly recommended to state in their application the longest possible duration for their intended stay at NII.

References

[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta, GA, USA, 2008.

[2] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp.547-552, Washington DC, USA, 2003.

[3] M. E. Houle and J. Sakuma,"Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21stIEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.