Junyang Qiu and Zhisong Pan

A Feature Selection Algorithm Based on Pairwise Constraints for Linked Social Media Data *

Junyang Qiu1, Zhisong Pan1

PLA University of Science and Technology, Nanjing, China

Email: ,

Received **** 2015

This work is licensed under the Creative Commons Attribution International License (CC BY).

http://creativecommons.org/licenses/by/4.0/

Abstract

Pairwise constraints specify whether a pair of samples belong to the same class (must-link constraints) or different classes (cannot-link constraints), have been widely used in machine learning. In this paper, we propose a new feature selection algorithm based on pairwise constraints for linked social media data, which integrate the linking information and pairwise constraints simultaneously to select robust features. As is known to all, the wide use of social media produces massive, high-dimensional and unlabeled social media data. Feature selection has been shown effective in dealing with high-dimensional data. The experiments on Flickr and BlogCatalog datasets indicate that our method outperforms other conventional feature selection methods.

Keywords

Pairwise constraints; feature selection; social media data; semi-supervised learning

1. Introduction

With the rapid development of the Internet, the pervasive use of social media produces massive data at an unprecedented speed. The massive unlabeled, high-dimensional social media data presents great challenge to data mining tasks such as clustering and classification [1]. Dimensionality reduction has shown to be very effective in reducing dimensionality, removing irrelevant and redundant features, increasing learning accuracy, and enhancing learning comprehensibility [2].

1.1. Background

Traditional dimensionality reduction methods can be divided into two categories: feature selection and feature extraction. Feature selection has proved to be effective in handing large-scale, high-dimensional data. It aims to select relevant features from the high-dimensional data for an accurate and compact data representation [3]. Feature selection can be divided into three categories according to the number of labeled samples used in the training sets:

1) Unsupervised feature selection;

2) Supervised feature selection;

3) Semi-supervised feature selection.

Traditional unsupervised feature selection algorithms include Information Gain [4], Laplacian Score [5], SPEC [6] and etc. Traditional supervised feature selection methods include Fisher Score [7], Relief-F [8] and etc.

With the advent of the era of Big Data and the development of the technology of data collection and storage, it is easy to collect large amounts of unlabeled data. But it is time-consuming and costly to complete the data tagging task. How to exploit abundant unlabeled samples to improve the generalization capability of the model learning from a small number of labeled samples has become one of the important issues in machine learning. For example, Zhao et al. [9] proposed a semi-supervised feature selection method based on spectral analysis, which takes advantage of both labeled and unlabeled samples through a regularization framework. Xu et al. [10] discussed a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization. The above feature selection algorithms are all performed with data that are typically assumed to be independent and identically distributed. However, the social media data violates this assumption due to the linkages between data, which presents new challenge to feature selection.

Utilizing domain knowledge has been proved useful in guiding many learning tasks [11], [12], [13]. In general, domain knowledge can be expressed in diverse forms, for example, class labels, pairwise constraints and other prior information [14]. In this paper, we focus on domain knowledge in the form of pairwise constraints, which specify whether a pair of samples belongs to the same class (must-link constraints) or different classes (cannot-link constraints). Pairwise constraints arise naturally in many tasks such as image retrieval. In those applications, the pairwise constraints are more easy and convenient to obtain than class labels, because the true labels may not be known a priori, while it could be easier for a user to recognize whether some pairs of samples belong to the same class or not. Besides, the pairwise samples can be derived from labeled samples but not vice versa.

Pairwise constraints are more practical and inexpensive than class labels and have been proved effective in many tasks, such as dimensionality reduction and clustering. Tang and Zhong [15] used pairwise constraints to guide dimensionality reduction, which took into account both must-link constraints and cannot-link constraints but overlooked the prior information provided by abundant unlabeled samples. Zhang et al. [14] proposed a simple but efficient semi- supervised dimensionality reduction algorithm called SSDR, which exploited both cannot-link and must-link constraints together with unlabeled samples. SSDR can preserve the intrinsic structure of the unlabeled samples as well as the pairwise constraints in the projected low-dimensional space. Later in 2008, Zhang et al. [2] introduced a new filter method for feature selection based on pairwise constraints, called Constraint Score. Sun, et al. [16] imported Bagging into Constraint Score and proposed a new method called Bagging Constraint Score. Wang et al. [17] presented a method for actively selecting informative pairwise constraints, which corresponded to picking up sample pairs far apart in the same cluster and those close in different clusters. An active semi-supervised spectral clustering (ASSC) was then developed by utilizing the selected pairwise constraints to adjust the distance matrix in spectral clustering. In 2012, Zeng et al. [18] introduced a pairwise constrained maximum margin clustering algorithm. Based on the maximum margin idea in maximum margin clustering, a set of effective loss functions were proposed to discourage the violation of given pairwise constraints. Ding et al. [19] proposed an effective clustering algorithm based on pairwise constraints, in which the similarity matrix of samples was adjusted and optimized by pairwise constraints. Reference [20] introduced Pairwise Constrained Component Analysis, a new algorithm for learning distance metrics from sparse pairwise similarity/dissimilarity constraints in high dimensional input space, problem for which most existing distance metric learning approaches are not adapted.

1.2. Contributions and Noverty

However, although pairwise constraints have been proved effective in many tasks, to the best of our knowledge, this topic has not yet been addressed in feature selection based on linked social media data till now. Inspired by its better performance in guiding learning tasks, we introduce pairwise constraints to the feature selection on the linked social media data and acquire a new algorithm called LFSPC (Linked data Feature Selection with Pairwise Constraints). LFSPC learns from the unlabeled and pairwise constraints samples simultaneously so as to select the most relevant features from the high-dimensional data. The motivation for exploiting unlabeled samples is to enhance performance when constraints are few. With the contribution of abundant unlabeled samples, the algorithm is expected to be more stable than using only the constraints. Experiments are carried out on Flickr and BlogCatalog datasets. Experimental results show that, with a proportion of pairwise constraints, LFSPC achieves similar or even higher performance than Fisher Score on the whole, and outperforms Laplacian Score. Besides, the performance of LFSPC is better than that of LUFS (an unsupervised feature selection framework for linked social media data) [1] and SSLFS (a semi-supervised feature selection method based on linked social media data) [21].

The rest of this paper is organized as follows. Section 2 introduces the semi-supervised learning with pairwise constraints. We give the statement and notations of the problem in Section 3. In Section 4 we present the formal description of the semi-supervised feature selection with pairwise constraints based on linked social media data. In Section 5, we give the framework of LFSPC and the optimization algorithm. The convergence analysis is also discussed in this section. Finally, experiments and conclusions are presented in Section 6 and 7 respectively.

2. Semi-Supervised Learning Based on Pairwise Constraints

The motivation of semi-supervised methods is to utilize the abundant unlabeled samples and a few labeled samples to improve the learning performance. Among these methods, linked social media data based methods are very novel, where the graph nodes denote the user nodes, and the weights on the edges represent the closeness and similarities between adjacent user nodes. Besides, in social media network, we can easily identify whether two users belong to the same circle of friends or not according to their interests, posts and tags. That is to say it is convenient to obtain pairwise constraints samples from social media network, thus we can propose a new semi-supervised feature selection method with pairwise constraints based on linked social media data. In the following our work is to investigate how to integrate the prior information provided by the pairwise constraints to this proposed framework based on linked social media data.

3. Problem Statement and Notations

Given a graph, is the set of edges describing the adjacent relationships between nodes, and is the set of nodes. The nodes belong to classes and the corresponding attribute-value matrix is, where represents the sample vector, is the dimension of samples. denotes the must-link constraints, and denotes the cannot-link constraints.

4. Feature Selection with Pairwise Constraints Based on Linked Social Media Data

4.1. The Social Dimension Regularization

In reference [1], Tang et al. proposed the concept of pseudo-class labels to guide the learning process. Given a mapping matrix, the attribute-value matrix can be projected into and. If and only if that each column of has one non-zero element, i.e., we define is the pseudo-class label matrix and we can assign each column’s non-zero element to be the class label of the corresponding sample.

Consider using the graph to signify the samples, we can denote samples set with the adjacency matrix where the element of can be formulated as Equation. (1):

(1)

Now we give the definition of the social dimension and modularity matrix. In reference [22], the social dimension regularization is introduced so as to combine the adjacency matrix and the attribute-value matrix. The definition of modularity matrix is as Equation. (2).The modularity maximization algorithm can be used to obtain the social dimension of samples, which is formulated as Equation. (3).

(2)

(3)

Here is the node’s degree vector, the matrix is the social dimension indicator matrix and is constituted of the eigenvectors corresponding to the top ranked eigenvalues of [23]. The K-means algorithm can be employed to acquire the discrete-valued social dimension denotation, if node belong to the dimension, else.The formal description of the weighted social dimension indicator matrix.

We can also define the social dimension scatter matrix. The within and between social dimension scatter matrix is,, and the total scatter matrix is . Samples in the same social dimension are similar while samples from different social dimensions are dissimilar. We can exploit the connection relations between samples to obtain a constraint. Here we use social dimension regularization to model the relations among linked samples, the object function can be shown as:

(4)

4.2. The Pairwise Constraints Regularization

We can construct the must-link and cannot-link constraints regularization terms respectively. Intuitively, we let the average distance in the projected low-dimensional space between samples involved by the cannot-link set as large as possible, while distances between samples involved by the must-link set as small as possible. Here the norm is used to measure the distance, the object function can be formulated as Equation. (5):

(5)

and are the number of must-link and cannot-link constraints, respectively. Here two scaling parameters and are introduced to balance the contributions of the must-link and cannot-link constraints.

4.3. Manifold Learning Model

On the basis of graph spectrum analysis, we can construct weighted matrix according to the attribute-value matrix. The weight between node and can be defined as Equation. (6) based on the RBF kernel:

(6)

As the assumption in graph spectrum analysis: The adjacent nodes are more likely to have the same class labels, that is to say the adjacent nodes hold the similar low-dimensional embedding, so the optimal object function can be concluded as minimizing the following expression:

(7)

Here is the embedding coordinate of samples, and is the Laplacian Matrix. The matrix is a diagonal matrix with its elements defined as. The object function can be formulated as minimizing Equation. (8):

(8)

5. The Framework of Feature Selection Algorithm Based on Pairwise Constraints for Linked Social Media Data

5.1. The Framework of the Model

Now we can integrate the manifold model, the social dimension regularization term and the pairwise constraints terms simultaneously to obtain the feature selection framework based on social media data:

(9)

Here by introducing the norm of, i.e., we can obtain the sparse row vector of. The problem can be simplified by setting. Besides, is added to make nonsingular. After the transformation the optimal problem can be formulated as Equation. (10):

(10)

Since is a constant, so the above problem can be simplified as Equation. (11) while and .

(11)

5.2. Optimization Algorithm for LFSPC

The derivation of the object function is:

(12)

(13)

Where is diagonal matrix, the diagonal element is. Set the derivation to 0 and we can obtain the following optimization algorithm according to [1]:

6. Experiments and results analysis

In this section, we evaluate the effectiveness of LFSPC on Flickr and BlogCatalog datasets, which are two subsets of standard datasets from Flickr and BlogCatalog. Here Flickr and BlogCatalog are two social websites, which provide image service and blog classification service respectively. Some statistics information of the datasets can be found in Table 1. We compare LFSPC with LUFS (Linked Unsupervised Feature Selection) [1] and SSLFS (Semi-Supervised Linked data Feature Selection) [21] and we utilize the 1-NN (1-Nearest Neighbors) algorithm to carry out the classification tasks.

Table 1.Statistics Information of the Datasets .

/ BlogCatalog Flickr /
Number of Samples / 5198 / 7575
Number of Features / 8189 / 12047
Number of Classes / 6 / 9
Number of Links / 27965 / 47344
Average
Degree / 5.38 / 6.25

In the setup of the experiments, 50% of each dataset is the classification samples, other 50% and pairwise constraints samples are used in the process of feature selection. We simulated the generation of pairwise constraints as follows: We randomly select pairs of samples from the training samples and create must-link or cannot-link constraints depending on whether the underlying classes of the two samples are the same or different. LFSPC has four parameters. We set for Flickr while for BlogCatalog according to [1]. In the process of parameters selection,and are tuned by adjusting one while fixing the other one. The ranges of parameters are set as follows: ,. The resulting value is, for Flickr while, for BlogCatalog. We can see that is far more than, because the distance between samples in the same class is typically smaller than that in different classes.