Feature selection for clustering categorical data

with an embedded modeling approach

Cláudia Silvestre, Margarida Cardoso, and Mário Figueiredo

Escola Superior de Comunicação Social, Lisboa, Portugal

ISCTE, Business School, Lisbon University Institute, Lisboa, Portugal

Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal

Abstract: Research on the problem of feature selection for clusteringcontinues to develop. This is a challenging task, mainly due to the absenceof class labels to guide the search for relevant features. Categoricalfeature selection for clustering has rarely been addressed in the literature,with most of the proposed approaches having focused on numericaldata. In this work, we propose an approach to simultaneously clustercategorical data and select a subset of relevant features. Our approachis based on a modification of a finite mixture model (of multinomialdistributions), where a set of latent variables indicate the relevance ofeach feature. To estimate the model parameters, we implement a variantof the expectation-maximization (EM) algorithm that simultaneouslyselects the subset of relevant features, using a minimum message lengthcriterion. The proposed approach compares favourably with two baselinemethods: a filter based on an entropy measure and a wrapper based onmutual information. The results obtained on synthetic data illustrate the capability of the proposed EM method to recover ground truth. An applicationto real data, referred to official statistics, shows its usefulness.

Keywords: Cluster analysis, nite mixtures models, EM algorithm, featureselection, categorical features

References

[1] Steinley, D., Brusco, M.: Selection of variables in cluster analysis an empiricalcomparison of eight procedures. Psychometrika 73 (2008) 125-144.

[2] Raftery, A., Dean, N.: Variable selection for model-based clustering. Journal ofthe American Statistical Association 101 (2006) 168-178.

[3] Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. TheJournal of Machine Learning Research 3 (2003) 1157-1182.

[4] Dy, J., Brodley, C.: Feature selection for unsupervised learning. The Journal ofMachine Learning Research 5 (2004) 845-889.

[5] Law, M., Figueiredo, M., Jain, A.: Simultaneous feature selection and clusteringusing mixture models. IEEE Transactions on Pattern Analysis and MachineIntelligence 26 (2004) 1154-1166.

[6] Wallace, C., Boulton, D.: An information measure for classification. The ComputerJournal 11 (1968) 195-209.

[7] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incompletedata via the EM algorithm. Journal of Royal Statistical Society 39 (1997) 1-38 Series B.

[8] McLachlan, G.J., Peel, D.: A mixture model-based approach to the clustering ofmicroarray expression data. Bioinformatics 18 (2001) 413-422.

[9]Hoff , P.D.: Model-based subspace clustering. Bayesian Analysis 2 (2006) 321-344.

[10] Gunnemann, S., Frber, I., Seidl, T.: Multi-view clustering using mixture modelsin subspace projections. In: Proceedings of the 18th ACM SIGKDD internationalconference on Knowledge Discovery and Data mining - KDD'12. Volume 2. (2012) 132-140.

[11] Silvestre, C., Cardoso, M., Figueiredo, M.: Selecting categorical features in modelbasedclustering using a minimum message length criterion. In: The Tenth InternationalSymposium on intelligent Data Analysis - IDA 2011. (2011)

[12] Hong, Y., Kwong, S., Chang, Y., Ren, Q.: Unsupervised feature selection usingclustering ensembles and population based incremental learning algorithm.Pattern Recognition 41 (2008) 2742-2756.

[13] Vinh, L.T., Lee, S., Park, Y.T., d'Auriol, B.J.: A novel feature selection methodbased on normalized mutual information. Applied Intelligence 37 (2012) 100-120.

[14] Desai, A., Singh, H., Pudi, V.: Disc: Data-intensive similarity measure for categoricaldata. Advances in Knowledge Discovery and Data Mining, Lecture Notesin Computer Science 6635 (2011) 469-481.

[15] Duch, W., Wieczorek, T., Biesiada, J., Blachnik, M.: Comparision of featureranking methods based on information entropy. In: International Joint Conferenceon Neural Networks (IJCNN). (2004) 1415-1420.

[16] Seth, S., Principe, J.C.: Variable selection: A statistical dependence perspective.In: International Conference on Machine Learning and Applications (ICMLA). (2010) 931-936.

[17] Talavera, L.: An evaluation of fillter and wrapper methods for feature selection incategorical clustering. Advances in Intelligent Data Analysis VI (2005) 440-451.

[18] Brown, C., Pocock, A., Zhao, M., Lujan: Conditional likelihood maximisation: Aunifying framework for information theoretic feature selection. Journal of MachineLearning Research 13 (2012) 27-66.

[19] Zeng, H., Cheung, Y.: Feature selection for clustering on high dimensional data.In: PRICAI 2008: Trends in Articial Intelligence. (2008) 913-922.

[20] Dash, M., Choi, K., Scheuermann, P., Liu, H.: Feature selection for clustering -a filter solution. In: Proceedings of the Second International Conference on Data Mining. (2002) 115-122.

[21] Alibeigi, M., Hashemi, S., Hamzeh, A.: Unsupervised feature selection using featuredensity functions. International Journal of Electrical and Electronics Engineering3 (2009) 394-399.

[22] Zhu, S., Wang, D., Yu, K., Li, T., Gong, Y.: Feature selection for gene expressionusing model-based entropy. IEEE/ACM Transactions on Computational Biologyand Bioinformatics 7 (2010) 25-36.

[23] Guldogan, E., Gabbouj, M.: Feature selection for content-based image retrieval.Springer Journal on Signal, Image and Video Processing 2 (2008) 241-250.

[24] Liu, H., Sun, J., Liu, L., Zhang, H.: Feature selection with dynamic mutualinformation. Pattern Recognition 42 (2009) 1330-1339.

[25] Baneld, J., Raftery, A.: Gaussian and non-Gaussian clustering. Biometrics 49 (1993) 803-821.

[26] Celeux, G., Martin-Magniette, M.L., Maugis-Rabusseau, C., Raftery, A.: Comparingmodel selection and regularization approaches to variable selection in modelbasedclustering. Journal de la Socit Franxaise de Statistique (to appear (2013)).

[27] Constantinopoulos, C., Titsias, M.K., Likas, A. In: Bayesian Feature and ModelSelection for Gaussian Mixture Models. Volume 28. IEEE Transactions on PatternAnalysis and Machine Intelligence (2006) 1013-1018.

[28] Zeng, H., Cheung, Y.: A new feature selection method for gaussian mixtureclustering. Pattern Recognition 42 (2009) 243.250.

[29] Talavera, L.: Dependency-based feature selection for symbolic clustering. IntelligentData Analysis 4 (2000) 19-28.

[30] Brown, G.: A new perspective for information theoretic feature selection. In:International Conference on Articial Intelligence and Statistics. Volume 5. (2009) 49-56.

[31] Gr un, B., Leisch, F.: Identiability of nite mixtures of multinomial logit modelswith varying and xed eects. Journal of Classification 25 (2008) 225-244.

[32] Cover, T., Thomas, J.: 2. In: Entropy, Relative Entropy and Mutual Information.Elements of Information Theory (1991).

[33] Huang, J., Cai, Y., Xu, X.: A wrapper for feature selection based on mutualinformation. In: The 18th IEEE International Conference on Pattern Recognition - ICPR06. Volume 2. (2006) 618-621.

[34] Silvestre, C., Cardoso, M., Figueiredo, M.: Clustering with finite mixture modelsand categorical variables. In: International Conference on Computational Statistics- COMPSTAT2008. (2008).

[35] Silvestre, C., Cardoso, M., Figueiredo, M.: Simultaneously selecting categoricalfeatures and the number of clusters in model-based clustering. In: InternationalClassification Conference - ICC 2011. (2011).