Feature selection for clustering categorical data
with an embedded modeling approach
Cláudia Silvestre, Margarida Cardoso, and Mário Figueiredo
Escola Superior de Comunicação Social, Lisboa, Portugal
ISCTE, Business School, Lisbon University Institute, Lisboa, Portugal
Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal
Abstract: Research on the problem of feature selection for clusteringcontinues to develop. This is a challenging task, mainly due to the absenceof class labels to guide the search for relevant features. Categoricalfeature selection for clustering has rarely been addressed in the literature,with most of the proposed approaches having focused on numericaldata. In this work, we propose an approach to simultaneously clustercategorical data and select a subset of relevant features. Our approachis based on a modification of a finite mixture model (of multinomialdistributions), where a set of latent variables indicate the relevance ofeach feature. To estimate the model parameters, we implement a variantof the expectation-maximization (EM) algorithm that simultaneouslyselects the subset of relevant features, using a minimum message lengthcriterion. The proposed approach compares favourably with two baselinemethods: a filter based on an entropy measure and a wrapper based onmutual information. The results obtained on synthetic data illustrate the capability of the proposed EM method to recover ground truth. An applicationto real data, referred to official statistics, shows its usefulness.
Keywords: Cluster analysis, nite mixtures models, EM algorithm, featureselection, categorical features
