Dynamic Clustering based on Minimum Spanning Tree andContext

Similarity for Enhancing Document Classification

ABSTRACT: Document Classification is the task of assigning a text document to one or more predefined categories according to its content and the labeled training samples. Traditional classification schemes use all training samples for classification, thereby increasing storage requirements and calculation complexity as the number of features increase. Moreover, the commonly used classification techniques consider the number of categories is known in advance,this may not be so in actual reality. In the practical scenario, it is very much essential to find the number of clusters for unknown dataset dynamically. Identifying these limitations, the proposed work evolves a text clustering algorithm where clusters are generated dynamicallybased on minimum spanning tree incorporating semantic features.The proposed modelcan efficiently find the significant matching concepts between documents and can perform multi category classification. The formal analysis is supported by applications to email and cancer data sets. The cluster quality and accuracy values were compared with someof the widely used text clusteringtechniques which showed the efficiency of the proposed approach.

Keywords- Minimum spanning tree;Spam filtering; Dynamic clustering; Context similarity.

1

INTRODUCTION

In the digital era, the rapid growth in the volume of text documents available from various sources like Internet, digital libraries, medical records has spurred users to effectively retrieve, navigate,and organize information. The ultimate goalis to help users to search what they are looking for effortlessly and take decisions suitably. In this context, fast and high-quality document clustering algorithms play a major role. Most of the common techniques in text retrieval are based on the statistical analysis of terms i.e. words or phrases.Such text retrieval methods are based on the vector space model (VSM) which is a widely used data representation. The VSM represents each document as a featurevector of the terms in the form of term frequency or term weight (Salton et al., 1975). The similarity between documents is measured by one of the several similarity measuresthat are based on feature vector. Examples include the cosine measure and the Jaccardmeasure(Schaeffer, 2007).Metric distances such as Euclidean distance are not appropriate for high dimension and sparse domains. Most conventional measures estimatethe surface overlap between documents based on thewords they mention and ignore deeper semantic connections. To achieve a more accurate analysis, the underlying model should indicate the semantics of text.Conceptual information retrieval extractsinformation by processing the document on semantic level forming aconcept base and then retrieves relative information toprovide search results.

Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories. Methods used for text clustering include decision trees, contextual clustering, clustering based on datasummarization, statistical analysis, neural nets and rule-based systemsamong others(Nahm and Mooney, 2000; L. Talavera and J. Bejar, 2001; Jin et al., 2005).Most clustering techniques consider the number of clusters is fixed which can result in poor quality clustering. Dynamic document clustering is the process of inserting the newly arrived documents to the appropriate existing cluster so it is not required to relocate clusters, thus time and effort taken for clustering is drastically reduced(Wang et al., 2011; Nadig et al., 2008).Figure-1 shown below depicts a model for dynamic document clustering.

Figure-1: A model for Dynamic Document clustering

A spanning tree is an acyclic sub graph of a graph G, which contains all vertices from G and is also a tree. The minimum spanning tree(MST) of a weighted graph is the minimum weight spanning tree of that graph (Edla andJana, 2013). MST clustering algorithm is known to be capable of detecting clusters with irregular boundaries.Moreover MST is relatively insensitive to small amounts of noise spread over the field (Zahn, 1971).Thus the shape of a cluster boundary has little impact on the performance of the algorithm. The proposed approach does not require a preset number of clusters. Edges that satisfy a predefined inconsistency measure are removed from the tree.The process is iterated until there is a change in the edge list and all data are clustered.

The paper suggests a context based retrieval method at the sentence, document and corpus levels for enhancing the quality of text retrieval. Morespecifically, it can quantify how closely concepts relate to eachother and integrate this into a document similarity measure.As a result, documents do not have to mention the samewords to be judged similar.The suggested clustering technique is applied on two different data sets for developing clusters - email messages and cancer data sets to demonstrate its feasibility. A major contribution of this work is in developing clusters dynamically based on area of interest of email users and when applied to cancer data sets it can classify patients to different treatment clusters based on age groups. The work introduces a text classification algorithm which allows incremental and multi label classification by comparing with the context pool i.e. the most significant concepts in the cluster.

The subsequent sections of thepaper are organized as follows: A briefliterature review on related classification approaches is presented in Section-2. The problem definition follows in Section-3.The methodology and aim of the research work is depictedin Section-4.Section–5presentssomeimportant concepts associated with the proposed model. Section-6 proposes an algorithm fordeveloping the clusters using a minimum spanning tree and performs time complexity analysis.In later sections evaluationparameters are developed to compare the clustering results with other established state of art methods.

RELATED WORK

Previous research has shown the applicability of emaildatasets with existing classification techniques was limitedbecause of the large number of features which makes mostmails undistinguishable. Classification of email messages is essential to many tasks in web information retrieval such as maintaining email directories, spam filtering, and focused crawling (Kevin, 2003). A rule based induction algorithm was used for filtering emails and a comparison showed that it outperforms k-NN (Payne and Edwards, 1997).A Naive Bayes based classifier was developed for constructing a real-world email filtering system that suggests three most appropriate folders for each incoming message(Rennie, 2000). With very high precision the desired folder can be found among the three suggested ones, which dramatically simplifies the process of manual filtering. Another related effort was in the application of Support Vector Machine for email classification task and its comparison with Naive Bayes classifierdemonstrated its higher efficiency (Kiritchenko and Matwin, 2001). Later, an innovative method focused in reducing the classification error by discovering temporal relations in an email sequence in the form of temporal sequence patterns and embedding the discovered information into content based methods (Kiritchenko and Matwin, 2004). The above research work mainly focused on the classification approaches used for email documents. Now some of the research approaches for generating the clusters optimized for text mining is described below.

Critical analysis of recent document clustering algorithms revealed that documents are represented based on phrase or pair wise context, where in the similarity relationship between the sentences can be identified (Lam and Hwuang, 2009). Using tree representation similarity between two objects was identified and clustered (Chehreghani and Abolhassani, 2008). However this approach has a significantly high time and calculation complexity with increase in data size. Component based clusteringalgorithm which makes use of object based software representation for modeling thedocument and cosine and Euclid measure for document clustering was developed (Boris et al., 2012). A survey of clustering algorithms was conducted which revealed that most clustering techniques use term frequency method, however it fails to differentiate the degree of semantic importance of each term, and assigns weight without distinguishing between semantically important and unimportant words within the document (Prathima and Supreethi, 2011).A semantic similarity histogram based incremental document clustering algorithm

(Gad and Kamel, 2010) was proposed on Phrase Semantic Similarity Histogram. This algorithm integrates the text semanticto the incremental clustering process. The clusters were represented using semantichistogram which measures the distribution of semantic similarities within each cluster.(Leacock and Chodorow, 1998) proposed asemantic similarity calculation model based on distance,which is simple and intuitive, but is heavily dependent onpre-established ontology hierarchical network.

Latent semantic indexing(LSI) is a statistical technique that derives a correlation between all terms and documents in a corpus in order to overcome the problems inherent in lexical matching (Deerwester et al., 1990). LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents and perform information retrieval. However related work has shown that the newly reduced vectors are dense ones and there is no guarantee of saving memory (Hull, 1994). Latent Semantic Indexing is a great improvement over current search technologies, but it lacks the ability to determine whether a document is correct according to the context and meaning as intended by the creator. As LSI makes use of Singular Value Decomposition, every time documents are added or removed the SVD statistics of the entire collection are affected.

Most recent research approach uses the combinationof factors affecting the weights of concepts on sentence, document, and corpus levels that is capable of the accurate calculation of pair wise documents. The quality of text clustering achieved by this model significantly surpasses the traditional single term based approaches (Shady Shehata, 2010). Since medical documents and emails from users consist of purely domain specificmedical and technical terms, the performance of synonyms and hyponyms based clustering may not always yield good results as done by (Shady Shehata, 2010).

PROBLEM DEFINITION

The key element of most clustering algorithms is the availability of number of clusters and a similarity measure that is capable of identifying neighborsof a particular documentto be clustered together. In many situations the number of clusters to be formed may not be known in advance. A major drawback in most text mining techniques is that the frequency of a term is computed to explore its importance in the document, but two terms can have the same frequency in their documents, still one term can contribute more to the meaning of its sentencesthan the other term.The related work study shown in section-2 clearly highlights the fact that context based clustering algorithms achieve a better accuracy compared to only term frequency based approach. Methodology that use synonyms and hyponyms have shown to be less effective for technical and medical data sets. For e.g. synonym of term ‘Database’ in WordNet is extracted as “An organized body of related information” however it will not include related terms like file, storage, repository, table which can improve the accuracy of information retrieval. So, taking synonyms and hyponyms will not achieve good performance as meaning of these terms is not exactly same.The research gap identifies that developing a search system to look at the meaning of every document in its collection would result in higher time and calculation complexity. A more feasible approach would be to examine each document’s domain context. This would require less processing power, and provide a general overview of the content of the entire document. By providing a context-sensitive description of the document, the system could then verify that all documents found by its search are totally relevant.

METHODOLOGY

Work Flow

Description

An outline of the proposed method is shown in Figure-2 above.Pre-processing text has been done using techniques such as sentence segmentation, removal of stop words, stemming. In order to extract contexts a domain specific dictionary consisting of several technicaland medical termsare developed unlike the work of (Shehata, 2009) where WordNet lexical analyzer was used to extract synonyms and hyponyms.The clusters are formed dynamically using a Minimum spanning tree clustering algorithm (MSTCA) as explained in section-6. After formation of clusters a context store consisting of some top contextsfrom each cluster is created. Atext classification algorithm (CSTCA)is suggested which can identify a document from its content and classify it to an appropriate category.

Terminologies Associated with Proposed Method

  1. Context: A context is the set of concepts in a document. A domain specific dictionary is developed for concept extraction where terms related to each context is kept along with the meaning of the term. For e.g.considering human resource as a context, the concepts are resume, training, appraisal, recruitment, attendance. The documents containing these words are grouped together as a context node human resource.
  2. Noun Phrase and Verb: e.g. Amit slams the door. “Slams” is the verb. “Amit” and “the door” are the noun phrases.
  3. Term: Term is either a word or a phrase which is a sequence ofwords.
  4. Similarity Measure: The semantic based similarity between two documents is a function of the following factors.
  1. the number of matching concepts mc, in the verb and noun phrases in each document

ii. V denotes the total number of verb, in each sentence s.

iii.Term frequency tfi of each concept ci in each document d, where i= 1, 2…mc.

iv.The document frequency dfi of each conceptci, where i= 1,2…mc.

v.Corpus Relatedness R(c1, c2) between two matching contextc1 and c2of a spanning tree with length L is defined in Eq-1. The length (c1, c2) is the length of shortest path between concepts c1 and c2.

(1)

The contextual term frequency (ctf) is an important factor incalculating the context-based similarity measure between documents.The more frequent the context appears in the verb or noun phrases of a sentence, the moreconceptually similar the documents are.A context c can have many ctf values in different sentencesin the same document d. Thus, the ctf value is calculated by:

(2)

Sn is the total number of sentences that containcontext c in document d;taking the average of the ctfvalues of context c in its sentences of document dmeasures the overall importance of context c to themeaning of its sentences in document d.

  1. The context-based similarity between two documents, d1and d2 is calculated by:

(3)

Were

(4)

In (4), the tf weighti value presents the weight of concept i in document d at the document level.

In (4), the ctfweighti value presents the weight of the concept i in document d at the sentence level basedon the contribution of concept i to the semantics of the sentences in d. In (4), is the inverse document frequency and R (i1, i2) represents corpus level relatedness as mentioned in Eq-1.The sum of tf weighti andctf weightiin (4) presents an accurate measure of thecontribution of each context to the meaning of topics mentioned in a document.In (5), the tfij value is normalized by the length of thedocument vector of the term frequency tfij indocument d.

(5)

Here cn is the total number of the contexts which has aterm frequency value in document d.

In (6), the ctfij value is normalized by the length of thedocument vector of the contextual term frequencyctfij in document d, where j = 1,2, cn andwhere cn is the total number of contexts which has acontextual term frequency value in document d.

(6)

ProposedMinimum Spanning Tree Clustering Algorithm (MSTCA)

Construction of Spanning Tree from Connected Graph:

Step1: Let E be an edge joining two d1 and d2 text documents in the connected graph G then the similarity between the documents is calculated from Equation-3 as mentioned in earlier section. Then for each pair of documents find the dissimilarity measure as mentioned below.

Dis(d1,d2)=1-sim(d1,d2) (7)

This measure called the Semantic distance can be used to calculate dissimilarity between two concept nodes.Semanticdistance of two concept nodes is defined as weight of an edge.

Step 2: Construct a connected weighted graph where the vertices are the documents and draw the edges between each pair of vertices and calculate their weights using Eq-(7).

Step 3: Identify the number of cycles in the graph. For each cycle, remove the highest weighted edge involved in the cycle from the graph.At the end of this step a spanning tree is formed.

Formation of clusters using Spanning tree

The MSTCA algorithm comprises of two parts as given below.

In the ClusterForm algorithm, we first form a minimum spanning tree with the given data set and keep this as an edge list.Initialize status of all edges as ‘Rejected’. The ScanEdgeList algorithm is run on this edge list. The edges are added or removed in the process of cluster formation depending on some criterion over the threshold value. This process is repeated until there is a change in the edge list and all the clusters are fully formed. After formation of clusters, the top related contexts are extracted from every cluster which forms a context pool. The following parameters are taken as input to the algorithm.

(Pattern) n×d: Pattern matrix of n data elements having d dimensions. Pat (i, j) gives the value of the ith data point of jth dimension.