Deriving Topics andOpinions fromMicroblog

Research Proposal

Feng Jiang

fenjy009

110032208

10/06/2012

Supervisor:Jixue Liu

Associate Supervisor: Jiuyong Li

Contents

1 Statement of the Research Topic

1.1 Field of thesis

1.2 Context of research

1.3 Significance of research

1.4 Research questions

1.4.1 Topic extraction

1.4.2 Opinion extraction

1.5 Research sub questions

1.5.1 Updating problem

2 Review of literature

2.1 Keyword Extraction

2.2 Topic Extraction

2.3 Sentiment analysis

3 Significance and contributions

4 Methodologies

4.1 Text pre-processing

4.1.1 Word stemming

4.1.2 Feature filtering

4.2 Text Representation

4.2.1 Vector Space Model (VSM)

4.2.2 Feature item weighting

4.2.3 Text similarity

4.3 Dimensionality reduction

4.3.1 Feature selection

4.3.2 Feature Extraction

4.4 Classification

4.5. Clustering

5 Scope and limits

6 Project Plan

7 References

8 Trial Table of Contents of thesis

1

1 Statement of the Research Topic

1.1 Field of thesis

Text Ming

1.2 Context of research

In this day and age, microblogs play a pivotal role in people’s daily life. One famous microblog site is Twitter. Approximately 20 million visitors use it every day [1]. Unlike other traditional media such as TV, newspaper and magazine, it can allow the public to freely express their own thoughts and idea, discuss the important events, share useful information and knowledge, and facilitate the dissemination ofnews. Although it contains plenty of useful information, it is very hard for individuals to manually seek and track the important events, identifyfashion trends and findpopular products due to numerous posts. It is very easy and convenient for bloggersto write the microblogs and publish them, so they often publish microblogs which are useless. Consider Twitter for example, it allows users to send 140 words microposts and about 40 million microposts per day. Therefore, Twitternow containsnumerous posts and the number is increasing every day [3].

Obviously, we could draw on the existing web and text mining methods to analyse and mine the data of microblogs. However, we are confronted with many problems and challenges due to the microblogs’own characteristics, which we could not directly utilise these techniques to extract hot topics and opinions from them.Firstly,the data on microblogs is semi-structured and unstructured. The writing of microblogs does not have a single format, sobloggers often write articles by free style, which even contains some grammaticaland spelling error. Bloggers also tend to be welling to use the new words, grammar and abbreviation to express their opinions and emotions.Secondly, bloggers update their microblogs frequently. Now, Bloggers can use smartphones to publish posts at everywhere and every time, so the update speed is increasingly fast.Thirdly, a post mayhave several comments. It is difficult to judge the opinion ofeachcomment [1,2,3].For instance, some comments may do not refer to the topic, they may relate to other events. Therefore, it is hard to filter them.

1.3 Significance of research

It is important to explore good methods to mine the microblogs because thereis plenty of spam information in the microblogs. Take Twitter for example, major of microposts are pointless, only approximately 3.6% of microposts are main stream topic. Therefore, if we can find good approach to detect the topic and analyse the sentiment, people can save plenty of time and energy. They do not have to read the similar microposts and comments. Instead, they can quickly know the popular things. On top of that, readers can seek and track the important events, identify fashion trends and find popular products, government can gather residents’ opinions to improve the public service; companies can find the customers’ real needs to produce the better products.

1.4 Research questions

This minor thesis will proposea method to categorise the microposts, detect the topics from the same categories, and analyse the sentiment of micropoststo identitypersonal tendency.

1.4.1 Topic extraction

Although the number of posts in blogs is very large, we can classify them into different categoriesand find method to summary blogs belongs to the samecategories using a briefsentence. Right now, there are many keywords detection method, lack the sentence detection approach. So the problem is how to find the useful keywords to reduce dimensionality and create the topic sentence

1.4.2 Opinion extraction

After generating the topic sentence, how to analyse the topic sentences, then conduct the sentiment analysis to find who support this topic and who oppose it.

1.5 Research sub questions

1.5.1 Updating problem

There are mass data on the microblogs. Recently, the information grows explosively; information is frequently and increasingly published and updated on them, which generate large quantities of data. So the problem ishow to generate the stable cluster and deal with the latest posts after clustering.

2 Review of literature

2.1 Keyword Extraction

Keyword extraction is one the most essential elements in the project because topic detection and sentiment analysis will use the keywords to run the algorithm. In this part, I will review some classical method by using specific examples [4].

Position Weight (PW)

It considers the position of words in the article. It laid too much emphasis on the position feature; therefore it suits the structural document which has clear title, abstract, subtitle, and conclusion and so on. It may not be appropriate for unstructured document like blog and micro-blog [5,6].

Word Frequency

Word Frequency means how many times a word appears in a document. If the selected frequency of word is less than the threshold, it can be ignored [7].

Disadvantage: There seems to be plenty of information retrieval research finding to confirm that sometimes the words which have small frequency may contain more information. Therefore, in the process of feature selection, they should not be simply deleted according to the word frequency [7,8].

Document frequency thresholding (DF)

It is one of most simple feature selection algorithm. It means how many documents contain the word in the whole collection or corpus. The advantage of DF is that it has very few calculations and high operation speed, which can be applied in the large-scale classification task. In spite of this, it should be pointed out that DF approach is not without its own downside. Rare words may be more in a kind of document, and may also include the important judge information. If we simply give up these words, the accuracy of classifier may be affected [6].

For example:

Suppose we have 10,000 documents, so the document set is 10,000. If 1,000 documents include t1, 9,000 documents include t2, 5,000 documents include t3, 10 documents include t4, 6,000 documents include t5, and 200 documents include t5, then the document frequency is calculated as

DF (t1) = 1,000/10,000 = 0.1

DF (t2) = 9,000/10,000 = 0.9

DF (t3) = 4,000/10,000 = 0.4

DF (t4) = 10/10,000 =0.001

DF (t5) = 5,500/10,000 = 0.55

DF (t6) = 200/10,000 = 0.02

Then we can set the threshold is 0.3<threshold<0.6

So the keywords t3 and t5 is selected.

Information gain (IG)

It is widely used in the machine learning. It refers to frequency of occurrence and the number of times whendid not occur. However, there is large scale computationin information gain,if the number of microposts is too big [9].

Formula:

For example:

Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents.

keyword/class / c1 / c2 / total
t1 / 5,000 / 2,000 / 7,000
t2 / 10 / 5,000 / 5,010
t3 / 5,000 / 5,000 / 10,000

This table means 7,000 documents contains keyword t1. In it, 5,000 documents belong to class c1, and 2,000 belong to class c2.

IG (t1) =-(5,000/10,000*log5,000/10,000+5,000/10,000*log5,000/10,000)+ 7,000/10,000 * ( 5,000/7,000*log5,000/7,000+2,000/7,000*log2,000/7,000) + 3,000/10,000* (0/3,000*log0/3,000+3,000/3,000*log3,000/3,000)= 0.693- 0.7*(0.241+0.358)= 0.693- 0.419 =0.274

IG (t2) = =-(5,000/10,000*log5,000/10,000+5,000/10,000*log5,000/10,000)+ 5,010/10,000 * (10/5,010*log10/5,010+5,000/5,010*log5,000/5,010) + 4,990/10,000* (4,990/4,990*log4,990/4990+0/4,990*log0/4,990) = 0.693- 0.006= 0.687

IG (t3) = =- (5,000/10,000*log5,000/10,000+5,000/10,000*log5,000/10,000)+ 10,000 /10,000 * (5,000/10,000 *log5,000/10,000 +5,000/10,000 *log5,000/10,000) = 0

The result is that:

IG (t1) = 0.274

IG (t2) = 0.687

IG (t3) = 0

So the keywords t2 is the best one to use into classification.

Mutual information (MI)

It can evaluate the relationship of statistical independence between the word and the category. The calculation of time complexity is similar to the information gain, the average of mutual information is the information gain.However, it has some disadvantage. The scoring is easily influenced by the marginal probability [10].

Formula:

For example:

Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents.

This table means 5,000 documents contains keyword t1. In it, 4,000 documents belong to class c1, and 1,000 belong to class c2.

keyword/class / c1 / c2 / total
t1 / 4,000 / 1,000 / 5,000
t2 / 1 / 5,000 / 5,001
t3 / 5,000 / 5,000 / 10,000
t4 / 10 / 1 / 11

MI(t1, c1) = log(4,000/5,000) – log(5,000/10,000) = -0.223+0.693 = 0.470

MI(t1, c2) = log(1,000/5,000) – log(5,000/10,000) = -1.609+ 0.693 = -0.916

MI(t2, c1) = log(1/5,000) – log(5,001/10,000) = -7.824

MI(t2, c2) = log(5,000/5,000) – log(5,001/10,000) = 0.693

MI(t3, c1) = log(5000/5,000) – log(10,000/10,000) = 0

MI(t3, c2) = log(5,000/5,000) – log(10,000/10,000) = 0

MI(t4, c1) = log(10/5,000) – log(11/10,000) = -6.215+6.812=0.597

MI(t4, c2) = log(1/5,000) – log(11/10,000) = -8.517+6.812=-1.705

MImax(t1) = 0.470,

MImax(t2) = 0.693

MImax(t3) = 0

MImax(t4) = 0.597

The result means that

MImax(t2) >MImax(t4) >MImax(t1) >MImax(t3)

So the keyword t4 and t2 is more important than others, and should be selected.

However, we can see the drawbacks, t4 appears only 11 times in the document set, but its MImax is higher than t1 which totally appears 5,000. This method thinks words that have the low document frequency are more important than the words have high document frequency.

χ2 Statistic (CHI)

It suppose that there is a χ2freedom distribution between the words and document category, then measure the related degree between words and document categories [5].

Formula:

For example:

Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents.

keyword/class / c1 / c2 / total
t1 / 4,000 / 1,000 / 5,000
t2 / 1 / 5,000 / 5,001
t3 / 5,000 / 5,000 / 10,000
t4 / 10 / 1 / 11

Chi (t1,c1) = 10,000*[(4,000/10,000) * (4,000/10,000)- (1,000/10,000) * (1,000/10,000)] *[(4,000/10,000) * (4,000/10,000)- (1,000/10,000) * (1,000/10,000)]/{[(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]} =3,600

Chi (t1,c2) =10,000*[(1,000/10,000) * (1,000/10,000)- (4,000/10,000) * (4,000/10,000)] **[(1,000/10,000) * (1,000/10,000)- (4,000/10,000) * (4,000/10,000)] /{[(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]* [(4,000/10,000) +(1,000/10,000)]} = 10,000*(-0.15)* (-0.15)-0.5*0.5*0.5*0.5=3,600

Chi (t2,c1) = 10,000*0.25*0.25/0.5*0.5*0.5*0.5 = 10,000

Chi (t2,c2) = 10,000

Chi (t3,c1) = 0

Chi (t3,c2) = 0

Chi (t4,c1) = 0.0009*0.5*0.0009*0.5*10,000/0.0011*0.5*0.5*1= 7.364

Chi (t4,c2)= 7.364

(t1) = 3,600, (t2) = 10,000, (t3) = 0, (t4) = 7.364

(t2)>(t1)>(t4) >(t3)

So the keyword t2 is the best one.

Expected cross entropy, ECE

It is also an import method in machine learning. It consider the cross entropy between the words and document categories [10].

Formula:

For example:

Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents.

keyword/class / c1 / c2 / total
t1 / 4,000 / 1,000 / 5,000
t2 / 1 / 5,000 / 5,001
t3 / 5,000 / 5,000 / 10,000
t4 / 10 / 1 / 11

ECE(t1) = 5000/10,000*[(4000/5000*log((4000/5000) /0.5)) + (1000/5000*log((1000/5000) /0.5))] = 0.5*(0.8*log1.6+0.2*log0.4) =0.5* (0.376-0.183) = 0.0965

ECE(t2) = 0.5*[(0.0002*log0.0004)+log2]= 0.3455

ECE(t3) = 0.5*[(0.5*log1)+0.5*log1]= 0

ECE(t4) = 0.00043

ECE(t2) >ECE(t1) >ECE(t4) >ECE(t3)

So the keyword t2 is the best one.

Comparison

IG and χ2 Statistic (CHI) are the best effective in term of term removal. Document frequency thresholding (DF) has the similar accuracy with them. Comparing to other approaches, Mutual information (MI) is the worst method, because of preferring to rare terms and easily being affected by possibility estimation errors [6].

2.2 Topic Extraction

Chung-Hong et al. [11] propose a novel method to not onlydetect the hot topics, but also rank them in near real-time from microblogs. They first filter the microblogging messages which include the non-ASCII characters. Next, in the topic creation module, it contains major three steps. They are dynamic term weighting, neighbourhood generation, andtext clustering. In the dynamic term weighting, they consider the updating problem, so when they assign the weight to the keywords, they give the Burst Score plus term frequency. In neighbourhood generation, they propose a new method to analyse the neighbourhood of keywords, and find some sentence which contains keywords. Then they calculate the similarity between the texts.In the text clustering, they apply a density-basedclustering method to filter the spam rather than spam classifierthat sometimes only subjectively filters the keywords.

Sharifi et al. [12] propose two primary methods to summary microblog, which are all based on the extractive approach rather than abstractive method as the major feature of microblogs is too short.One method is a graph-basedalgorithm. It finds the most common words, and thenchooses the microposts contains the keywords, and calculate the weight of each sentence. Select the sentencewhich has the highest score as the topic sentence. The other one is the Hybrid TF-IDF algorithm, which is based on the theory of Term frequency inverse document frequency to find keywords which have this feature: the frequency of words in a document is high, but in corpus, the frequency of keywords is low. Finally they perform the experiments and find that in the mass data, Hybrid TF-IDF algorithm perform better than graph-based algorithm. In my view, I think the major problem is redundancy. Many microposts are quite similar, they many have the same similar score or weight, it is hard to select the best one.

Bossard et al [13] develop analgorithm to first build the similarities between all sentences. Then use the clustering method – fast global k-means to cluster the similarity matrix and generate the clusters inwhich sentences contain the similartopic. They use “Jaccard” measure to calculate the similarity. If two words are not the same, they use the Wordnetto find the synonymy and hyperonymyof these words.After clustering, they choose one sentence from each cluster as the summary that covers the relevantinformation. Simply, they can select the central sentence per cluster, which can optimize the similarities.

Lee [14] presenta density-based online clustering algorithm to detect the emerging eventsthat has the temporal and geospatialcharacteristics. They point out that microposts lack the semantic integrity, so it is hard to generatea reliable weighting and clustering method. They resolve such problem; propose a dynamic weighting algorithm in the real-time environment.Firstly, they apply theincremental DBSCANclustering method to group the microposts due to the incrementalupdate of data. Next when they calculate the weight, they assign the score for different group like uninformative word, topic word and common word. In my opinion, I think there are some limitations as they only concentrate on the temporal analysis and spatial analysis andhavefewer details about the density-basedclustering algorithm.

Feifei et al. [15] present a new algorithmcalled FNODT, which is based on extraction method to detect the hot topics. It should input a lot of parameterssuch as the number of users, the messagesnumber, word frequency, event’s occurrence and distribution of time. They sort the results and offer the top 50 as the hot topics. In this process, they use fuzzy clustering algorithm to cluster the microposts and use the classifier to classify the words set and generate the topic set. Besides, they use the topic set to re-retrieve the microposts, compare them by using above parameters, and the rank them, finally output the hot topic set.

Zitao et al. [16]propose a new feature selection method to detect topic, which base on part-of-speech and HowNet. They select the keywords which have the different part-of-speech but contain a considerable amount of information. In other words, it can filter the candidate characters and remove the spam information in the microposts. Then use the HowNet that is a knowledge baseto describe the semantic features and split the concrete concepts. They discuss how to set the thresholds in the algorithm to boost the classification quality. This method can also be used in many webapplication.

He et al. [17] analyse the characteristics of microblogs, and propose a new and lightalgorithm which called LN algorithm to extract the latest hot topics.The people or places are the input of this algorithm and are regarded as the subjects, and find the predicates and objects to describe the subjects and generate the hot topic sentence. Finally, they use the generation model tree to display the results. Specifically, they collect the rubbish information set, and then use it to train the Naive Bayes Classifier. Next, they conduct thepre-process of microblog corpus to remove the hyperlinks and non-English words. Moreover, they use the trained Naive Bayes Classifier to remove the junk information and get the denoising microblog corpus and use it as the next steps input.They use the method based on named entity recognitionand classification to find proper noun such as person's names or the name of a place. In it, they apply the Google Geocoder API to seektopic centre words. Finally, they create the generation tree based on the word frequency to show the hot topic. In my view, Ithink it is very hard to find the correct key words by using entity detection methods, because the number of micropostsis too large. If the number is small, it may be good approach.

Hutton et al. [18]present a novel summarizing microblogs method which can summarize microposts automaticallyby adopting the Phrase Reinforcement Algorithm.It can detect the most common phrase which contains the topic phrase. Then it is regarded as a summary.In this process, they apply the filtering method - Naïve Bayes classifier which they have trained by the data from Twitter,to derive the most relevant content. After they get the relevant sentences, they built a graphto display the common sequences of words that appear both before and after the topic words, and then give each note a weight to keep longer parses from controlling output. Finally, they select the most heavily weighted path as the topic sentence. This method has some limitations. Firstly, it cannot find the keywords automatically and cannot categorize the microposts. Instead, it adopts the twitter HTTP-based API. Users should input a keyword then get the cluster from twitter. Secondly, the topic sentence may be too short due to the number of note is too large, and many frequency of note is too small.

2.3 Sentiment analysis

Joshi and Belsare[19] propose a method to analyse the sentiment of microposts. They think the adjective words play a pivotal role in the sentiment analysis. So firstly, they use the suffix to tag the adjective words in the microposts. For example, the adjective word – exquisite can be marked with _JJ to becomeexquisite_JJ, and then used in the next steps. They introduce the QTag which the Part-of-speech tagger of English used in marking the words. They provide a method which uses a seed list ofadjectives. WordNet is an English lexical database, can offer the list. They also describe another approach. It can use the film review from IMDB to train the classifier. For example, we can use the low rating reviews to improve the negative classifier. In contrast, we can use the high rating reviews to enhance the affirmativeclassifier. Finally we can use the negative andpositiveclassifiers to judge the sentiments of microblogs.