Shengliang GAO et al. /Journal of Computational Information Systems 3:2 (2007) 203-213

A Comprehensive Review on Blog Mining under a Cross-Disciplinary Framework

Shengliang GAO[†], Conglei YAO

Lab of Computer Networks and Distributed Systems, PekingUniversity

Beijing 100871, P.R.China

Abstract

Nowadays, blog has been playing a more and more important role in human life as the internet evolves dramatically. Academia has seen an research burst on blog since 2002, not only in the area of computer science like IR and web mining, but also in the area of social sciences. We present a blog-centered framework here to abstract and generalize current researches on blog mining, with the intention of providing a clear image on blog-focused research area.

Keywords: blog; web mining

1.Introduction

Since blog's definitions are often slightly different from each other in different studies, here we firstly give an operational definition, or even precisely, characteristics of blog. In this paper, we take blog as web element having the following 5 features:(1) Reverse chronological entries. (2) Time-stamp adhering to every entry. (3) Some particular interactive features between each other: blogrolls, comments, trackbacks , etc.(4) Rss feeds, supporting tracking the update easily with an rss aggregator.(5) Socialized web text. The blog is human related, typically for recording everyday life or expressing the owner’s sentiment.

Following parts are organizedas: Part 2 clarifies blog-oriented framework in a meme map. part 3 analyzes the framework in detail and places current studies in it.Part 4 is a concise conclusion.

2.Blog-orientedframework

The framework is shown as a meme map in figure 1, following are legends of the meme map:

Fig.1 Meme map of weblogging ecosystem

The whole meme map is divided by a time line in the middle. Below is blogosphere, which is a collective noun often denoting the whole blogspace; above is personosphere (this term is coined by us as an analogy of blogosphere).

--Colors to denote different objects/processes.

--Brown vertices denote single blogs in blogosphere. The green edges between them denote the concrete links between them.

--Blue vertices denote single persons in personosphere. Magenta edges between them denote the relationships between them.

--Thin violet bidirectional edges between brown vertices and blue vertices denote the interactive processes between them.

--Vertices and circles to denote different granularies.

--Vertices denote single elements, either a blog or a person.

--Circles denote groups of elements, either a set of blogs or a group of people.

-- Thick Arrows denote the evolving processes of and interacting processes between evolving systems.

--Brown Arrow denotes the evolving process of blogosphere.

--Blue Arrow denotes the evolving process of personosphere.

--Thick violet arrows denote the mapping and interacting processes between blogosphere and personosphere.

To express a complicatedperson (the blue vertex) properly in following analysis, we use a n-tuple to formalizeit:<blogID, name, gender, age, geolocation, interests, friends, personality, mood, shopping tendency, political opinion,……

BlogID here is the primary key, and other attributes of this tuple could have the same value.

3. Detailed analyses

Weinterperateour weblogging ecosystemframework using a complete generalization of researches:Divide them into 5 categories, and placecorresponding representative studies in each category.

3.1. single brown vertex as a blog.

3.1.1. blog identification,splog detection, blog page crawling

The first question rising here is what a blogis and how to identify it.Blog identification and splog detection deals with this question, Representative researches appeared in [1][2][3].

Until now there are three methods of collecting blog pages that have been applied :(1)Using traditional IR focus-crawling strategy. (e.g. url analysis) (2)Using rss seed. (3)Usingping server’s service.

The first method is often used in published papers.The 2nd and 3rd are often used in blog search engines(e.g.

3.1.2blog genre analysis

In [4], Herring et al. found that blogs are more individualistic, intimate of self-expression than scholars expected. They concluded this based on blog author’scharacteristics, primary purpose for blog, and etc.In [5], Mike Thelwall proposed a dichotomy between externally-focused news-aware approximately daily blogs and internally-focused diary-like approximately weekly blogs after researching on bloggers during the London attacks.

3.2 single thin violet edge as the relation between a blog and a person.

On the direction from the blue vertex to the brown vertex, it deals with a social aspect of blog research: why people write blog? Through sampling and interviewing, Nardiet al. [6] concluded that there are mainly five motivations of people’s blogging behavior.Since their method of interview in StanfordUniv.limits the quantity of bloggers involved, it still leaves space for a large-scale investigation on this.

As to the direction from the brown vertex to the blue vertex, it deals with aneven more social question: how does blogging behaviour change the blogger’s life. Since this question is more social science concerned and a little beyond the scope of blog mining, no papers on this have been found by us till now.

3.3 single blue vertex as a formalized person.

This aspect focuses on constructing the abstract representation of a person through mining the blog text. Ablog could be viewed as a socialized web unit. So we could draw mapping between a person and a blog. This task could be regarded as reverting a blog text to a live person. Back to our denotion of a person in blogs, it is the process of deriving the n-tuple mentioned above.

In practice, many Blog Service Providers (BSPs) demand the user offer some basic personal information upon registration. Unfortunately, not all of the BSP does this, so researchersmust extract some other information about the author using clues in text. Researches on this aspect are currently doing:

3.3.1Extracting basic personal information:

Eg.Deriving geolocations from blog text. In [7], Jia Lin and Alexander Halavais used5 characters of data to extract geographic data, deriving the location where the author blog and found that using 3-digit zip codes to geo-map bloggers in America is a better choice.

3.3.2Deriving non-factual attributes especially subjective attributes from blogs.

In [8], Gilad Mishne and Maarten de Rijke used a combination of text analysis and external knowledge sources (amazon.com) to estimate the book taste of bloggers from their text, It was noted that though a particular wishing-to-buy book is not easy to find, the wishing-to-buy books' category is not difficult to deduce.At the level of individual posts, the algorithm performed quite well.
Sinceaffect computing has become an important focus in computer linguistics and blog text in nature is a resource of affective texts, many researches focus on affection mining in blogs. In [9] and [10], Gilad Mishne and Maarten de Rijke proposed a methodology on mood tracking.First, they identified textual features that can be used to estimate mood prevalence in aggregate level; second, they built models that predict the intensity of moods in a given time slots, utilizing features derived in first step. The methodology presented here in essence is utilization of word-mood association. In [11], Gilly Leshed and Joseph Kaye explained the mechanism of word-mood association in detail and reported a promising result of experiment on LiveJournals.

Fitz Heckel [12] did blog classification on bloggers’ political bias with the technique of combination of BASILISK and Self-Organizing Maps, and got surprisingly good results. Another famous study on bloggers’ political bias is reported by Lada Adamic et al.[13]. They systematically studied the degree of interaction between liberal and conservative blogs, and uncoveredsome differences in the two communities.

3.4 Blue circles and magenta edge as a group of people and interaction between people within.

3.4.1Blue circles: Descriptivestatistics andcorrelation analysis.

Through research aspect3——single blue vertex as a formalized person, we have got a set of vectors, each vector denoting a blogger in blogosphere. What could scholars do on them then? An obvious suggestion for next step comes from classic data mining techniques: we can apply some descriptive analysis to attain the distribution of one or more attribute(s), and do some correlation analysis or clustering based on the attributes. In[14], Ravi Kumar utilized 1.3 million blogger’s vector data to perform some descriptive analysis and reported bloggers’ geolocation distribution, age distribution, interests clusterand friendships.

3.4.2.Magenta edges:link analysis

Before going further on magenta edge section, we first illustrate the logic view of edges and vertices. As is shown in the meme map (Figure 1), the magenta edges are defining relations between bloggers.Link forms in blogosphere are diverse.Links that have been recognized andutilized in published papers are as following: (1) Mentions of other blogs and bloggers in entries (2) Blogrolls (3) Permalinks from one blog to another (4) Comments in response to other bloggers’ entries (5) Trackbacks. Edge is used to connect vertices, so here it isalso necessary to introducevertices’ operational meaning and different levels. Since blog is socialized web unit, Blogspace is a rich and complex social environmentthat admits study at many levels.Granularity involved division of vertices in blog studies could be at: (1)Macro level: the whole blogosphere (2)Middle level: group, community (3) Micro level: authoring person, page, article, named entity in text. Now we will turn to specific studies:

  1. Power law and small world:

Power law and small world are two famous statements in web link analyses. Many researchers are testing and checking them in blogosphere. Conclusions are different under different circumstances.Wiktor Bachnik et al, depicted quantitative and sociological characteristics of Polish blogosphere [15]. They found that sparsity is the first apparent property of examined blog networks and blogs are in fact scale-free networks. They also confirmed the existence of small world through comparison between their three different blog corpuses. In[16], Fernando Tricas et al. tried to justify the two famous statements in Spanish-speaking blogosphere, and finally concluded that at that stage of development(year of 2003), Spanish-speaking blogosphere had not reached yet the state where incoming links follow a power law, but it already showed“small world features”.

2.Community detection

Some studies have looked into detecting virtual communities in blogs.[17,18,19,20,21]. Approaches that researchers commonly used are content analysis, participant interview and clustering algorithm. In[18], Wei recorded the statistics of a knitting blog in order to find norms that indicates membership rules as an indicator of community. In [17], Alvin Chin and Mark Chignell proposed a social hypertext Model for finding communities in blogs.Yu-Ru lin et al.’sstudy [21] posed an enlightening question: Is simple link analysis qualified enough to be the evidence of community’s existence? They builta computational model of mutual awareness, based on which, a ranking-based clustering method is performed to detect communities. Sometimes, researchers define objects at differentgranular level other than a blog,In[22], Xin Li et al. used named entity to map web documents into a graph. Their showed that utilizing both the triangle geometry inside a graph and the mutual information between vertices, their clustering algorithm can effectively discover interesting communities.

3.5.Thick arrow group: blogosphere evolution, personosphere evolution and interaction between them

Blogs could be viewed as time-stamp-easy-to-get evolution system, so we can examineblogosphere evolution, personosphere evolution and interaction between themusing the easy-to-get time stamp. Specifically in the meme map (Figure 1), brown arrow denotes stream of blog corpus; blue arrow denotes stream in reality, such as news stream, movie box office income stream; violet arrow denotes comparisonand interactions between the two temporal evolution systems.Representative researches on this aspect are:

In [23], David Gurzick and Wayne G. Lutters revealedthat blogs underwent four stages in its evolving process. Most researches about this aspect of blog involve time graph analysis and burst/peak detection. In[24], Ravi, Kumar and Jasmine Novak extended Kleinberg's work on text stream burst detection[25], Kleinberg targeted email and research papers, trying to identify sharp rises in word frequencies in document streams. Ravi, Kumar et al. adapted the word frequency to link creation and applied it in blogosphere.In [26], Tapanee Tirapatet al’s study showed that indeed, there is a significant correlation between the amount of buzz a movie generates and its critical or financial success. They extracted all blog entries posted in the months of November and December 2005 from five regularly updated movie-related blogsand defined some metrics to determineblog activity for a movie.Similar methods could also be seen in automated trend discovery on blogosphere.In [27], Ntatalie S. Glance et al. proposed a trend discovery algorithm based on key phrases—key entities correlation.

4. Conclusion

In this paper, we have tried making a combination of two research perspectives. One is using “object-oriented” thought, to examine what characters weblogging ecosystem has, and to derive what kind of research could be done on it based on these characters. The other is a research-oriented perspective—to generalize what have been done on blog mining. We make this combination through placing current researches into proper positions in the meme map.

This research is supported by NSFC Grant 60435020: Theory and methods of question-and-answer information retrieval, NSFC Grant 60573166: Research on the correlation model and experimental computing methodology between web structure and social information, NSFC Grant 60603056:Research and Application on Sampling the Web.

References

[1] Erik Elgersma, Marrten de Rijke. Learning to Recognize Blogs: A Preliminary Exploration. EACL 2006

[2] Pranam Kolari,et al SVMs for the Blogosphere: Blog Identification and Splog Detection 2006 AAAI

[3] Tomoyuki, nanno, et al. Automatically Collecting, Monitoring, and Mining Japanese Weblogs. WWW2004

[4] Susan C. Herring, et al. Bridging the Gap: A Genre Analysis of Weblogs. 2004 IEEE

[5] Mike Thelwall. Bloggers during the London attacks: Top information Sources and topics. WWW 2006

[6] Bonnie A.Nardi, etc. “I’m Blogging This”

[7] Jia Lin, Alexander Halavais Mapping the Blogosphere in America WWW2004

[8] Gilad Mishne and Maarten de Rijke. Deriving Wishlists from Blogs. WWW 2006

[9] Gilad Mishne and Maarten de Rijke. MoodViews: Tools for Blog Mood Analysis. 2006, AAAI

[10] Gilad Mishne and Marrten de Rijke. Capturing Global Mood Levels using Blog Posts. 2006 AAAI

[11] Gilly Leshed, Joseph Kaye, Understanding How Bloggers Feel: Recognizing Affect in Blog Posts. CHI 2006

[12] Fritz Hekel, et al. Political Blog Analysis Using Bootstrapping Techniques. 05 seminorSwarthmoreCollege

[13] Lada Adamic. The Political Blogosphere and the 2004 U.S.Election:Divided They Blog.

[14] Ravi Kumar, et al. Structure and Evolution of Blogspace. Communications of the ACM December 2004

[15] Wiktor Bachnik, et al. Quantitive and sociological analysis of blog networks. arxiv.org

[16] Fernando Tricas,etc. Do we live in a small world?

[17] Alvin Chin, Mark Chignell. A Social Hypertext Model for Finding Community in Blogs HT2006

[18] Carolyn Wei. Formaition of Norms in a Blog Community .

[19] Efimova,L. et al. In search for a Virtual Settlement.
[20] Merelo-Guervos et al. Mapping weblog communities. 2004

[21] Yu-Ru Lin, et al. Discovery of Blog Communities based on Mutual Awareness. WWW2006

[22] Xin Li, et al. Mining Community Structure of Named Entity from Web Pages and Blogs. AAAI 2006
[23] David Gurzick, et al. From the personal to the Profound: Understanding the Blog Life Cycle.CHI 2006

[24] Ravi, Kumar and Jasmine Novak. On the Bursty Evolution of Blogospace. WWW 2005

[25] J. Kleinberg. Bursty and hierarchical structure in streams SIGKDD 2002

[26] Tapanee Tirapat, et al. Taking the Community’s Pulse, One Blog at a Time. ICWE 2006

[27] Ntatalie S.et al. BlogPulse: Automated Trend Discovery for Weblogs. WWW2004

[†] Corresponding author.

Email address: (Shengliang Gao)