An OKAPI-Based Approach for Article Filtering
Benedict Lee and Cuong Van Than
Department of Computer Science
Rice University
{benlee, cvthan}@rice.edu
1 Introduction
Text classification is the task of assigning predefined categories to documents. This problem has become important, both because of the vast volume of online data produced these days, and of its various practical applications. Typical applications include organizing Web pages by topic, filtering out inappropriate online content for specific audiences, and detecting authorship of disputed works.
Text classification can also be used as a preprocessing step in other machine learning tasks. For example, the Ares Project [Devika] aims to predict international conflict based on events extracted from news articles from sources such as the AFP, Reuters, and the Associated Press. The proliferation of news in electronic form has made it possible to automate the forecasting of conflict outbreaks. However, the huge number of articles also imposes a challenge to the accuracy of the prediction process because news sources also contain articles that are irrelevant to political tensions. When building an event database, we might get, for example, sports or music-related stories, which have nothing to do with international conflict, so we do not want to include such articles in our database. The main goal of our project for Comp 540 is to develop a process that differentiates politically relevant articles from politically irrelevant articles.
2 Overview of Text Classification
In this section, we provide a brief overview for the text classification problem. The book “Learning to Classify Text Using Support Vector Machines” by Jaochims gives an in-depth discussion for what is being presented here.
2.1 Definition
The general problem of text classification can be formally stated as learning an optimal estimator, based on training examples, to the function , where D is the document domain and C is the set of predefined classes. We often assume that no additional knowledge or metadata is available, so documents are classified solely on their content. Depending on the set C and on different tasks, classification may be binary, multi-class or multi-label.
In the binary setting, each document is classified in exactly one of two categories. The two categories might be “relevant” and “irrelevant”, as in our article filter, “like” vs. “dislike”, or “spam” and “non-spam” in de-spam email programs. Though binary classification is the simplest task, it is the most important in the learning problem [Joachims].
In case C has more than two classes, we have the multi-class classification problem. A typical application is to assign books into different subjects such as “mathematics”, “physical sciences”, “engineering”, or “social sciences” by the book titles and short descriptions (e.g. editorial reviews as in Amazon.com).
We can always solve a multi-class classification task by a number of binary classifiers. The common strategy is one-against-the-rest. Label the first element of C as +1 and all the remaining elements as –1 and apply a binary algorithm to classify documents according to these two labels. For the class –1, we repeat to split into two labels as above until every label corresponds to exactly one element of C.
Another method for dealing with multi-class problem by using binary classifiers is the pair-wise strategy: We learn each classifier for every pair of classes of C (so we have O(|C|2) classifiers), and an unseen document is classified by majority vote of these classifiers.
In practice, a document can fall into more than one class. For example, the book “Elements of Statistical Learning” used in our course falls into both statistics and computer science subjects. When a document can be assigned to more than one class as in that example, we have the multi-label text classification problem. Currently, no algorithms directly handle it. However, one can still use binary classifiers to learn a model for it. The prediction for a document in this setting is now an l-dimensional binary vector in {–1, +1}l, where l is the total number of class. Under this representation, the multi-label problem gets solved by independently applying a binary classification algorithm for each category.
So a good binary classifier is of critical importance. Our project aims to develop a text classification program that has a better performance than that of the current naive Bayes classifier used in Ares.
2.2 Text Representation
The next step is to choose an appropriate representation for texts, which has a crucial influence on the performance of classification algorithms. In fact, the choice of a representation of articles also affects the choice of an algorithm and the design of the whole program. There are typically five types of text representation [Joachims]:
§ Sub-word level—decomposition of words. In n-grams, the most popular sub-word representation, the document is represented as a set of strings of n characters. The word “book” corresponds to “boo” and “ook” in 3-grams. See [Neumann].
§ Word level—words and lexical information. The text is split into a sequence of words. It is usually assumed that the order of words is independent, so only the frequencies of words are recorded, while the overall structure of the document is ignored.
§ Multi-word level. At word level representation, it would not be possible to distinguish between a verb and noun “book,” for example. Basic building blocks in multi-word level are generally phrases and sentences so that it can capture syntactic information.
§ Semantic level. This would be the most efficient level of representation because it captures the meaning of the whole text.
§ Pragmatic level—the meaning of the text with respect to context and situation. This level is often not applicable to online news articles, as such context information is barely available.
Our algorithm chooses to represent documents at the semantic level by computing two values—the relevance and irrelevance rank. Because these scores measure the similarity of the whole article against a training dataset, we expect our program to perform better than programs that choose the representation at lower levels.
2.3 Feature Selection
Before we learn a model for a text classification task, the training dataset, after being converted to an appropriate representation, needs refinement. The main purpose of this additional step is to remove irrelevant attributes, such as prepositions and conjunctions, from the representation. Another motivation is computational efficiency—eliminating unnecessary and very common words like “the” or “and” effectively reduces the dimension of the input space.
There are two kinds of feature selection [Jaochims]:
§ Feature subset selection. This technique retains a subset of the original attributes. As mentioned in the previous paragraph, stopwords like “the” or “and” are irrelevant and they should be removed from the feature vector. Yang and Pedersen [Yang97] also suggest to delete infrequent words from the text representation, based on an assumption that words rarely occur through the training dataset have negligible influence on classifying texts.
§ Feature construction. New features are introduced by combining original attributes. English words often have several variations. For example, words “defense” “defend” and “defensive” have the same stem. It would be more appropriate to consider different word forms of a stem altogether as a single attribute. The net effect is minimizing the number of attributes. Another feature construction technique is to group synonyms into equivalence classes. This results in a more robust text classifier, because different words having similar semantic meanings are treated equally and as a single feature. [Fisher94] explores the use of this approach in text classification.
2.4 Conventional Learning Methods
We do not aim to provide an exhaustive list and description of learning methods used in classification. Rather, we list here some of the most popular approaches. However, we will discuss the naive Bayes classifier at a greater depth in “Related problems” Section, as it is the current method employed by Ares.
§ Naive Bayes classifier. The Bayesian approach is to assign a document d to a class y that maximizes the probability Pr(y | d). From Bayes theorem, this posterior probability is computed by:
.
Because Pr(d) is independent of y, maximizing Pr(y | d) is equivalent to maximizing Pr(d | y) Pr(y). Pr(y) can be estimated as the frequency of documents in class y in the training dataset. With the assumption that attributes of d are independent, the likelihood Pr(d | y) can also be estimated in the same way as Pr(y).
§ k-nearest neighbors. The method classify a new document by computing the similarity between it and other documents, and then assigning it the majority class among k closest neighbors. k-nearest neighbor algorithms show a good performance on text categorization tasks [Yang97]. In this method, an appropriate similarity/distance metric is essential to a good classifier. Our text classification program is in this learning category and it makes use of the OKAPI similarity measure. We will discuss more about this metric in Section [OKAPI].
§ Decision trees. The most popular decision tree learning program is C4.5 by Quinlan [Quinlan93]. The approach is to pick up the best attribute, create tree nodes corresponding to possible values of this attribute, and divide training examples among these nodes. The process is then repeated until no examples are left. To decide which attribute to be used at each step, information gain is calculated for each attribute (based on the training dataset). The attribute with the highest gain is chosen to as the best classifier for this step. Informally, information gain measures how well an attribute separates the training set. Therefore, the higher the gain, the better separation resulting from classifying training examples on the associated attribute. For a formal concept, equations for computing information gain and an illustrative example see [Mitchell].
3 Related Work
Text classification is a long-established problem, so there has been much related work in this field. In this section, we look at some applications of text classification as well as the current naive Bayes classifier used Ares.
3.1 Ares and the Naive Bayes Classifier
Classification of news articles is an integral part of the Ares Project. The first stage in achieving the ultimate goal of predicting international conflicts is to build an event
database. The current version of Ares uses a naive Bayes classifier to classify articles. Figure 1 illustrates how the naive Bayes classifier is used to build an event database.
Figure 1: Naive Bayes classifier in Ares. [From Ares’ Website]
As mentioned in Subsection 2.4, a document d is assigned to class y if the probability Pr(y | d) is maximized, which is equivalent to maximizing Pr(d | y) Pr(y). To estimate Pr(d | y), we assume that words occurring in the document are independent and identically distributed. Even with this assumption, however, a rough calculation [Mitchell] shows that the total number of probabilities Pr(di | y) that must be estimated is about 2 ´ 50,000. To make that number more manageable, the naive Bayes classifier in Ares decided to only select the 2000 most relevant words. This strategy effectively reduces the number of terms Pr(di | y) by 25 times. However, it also makes the classifier sensitive to the selection of feature words. See Ares’ Website for experimental results of this naive Bayes classifier.
3.2 Support Vector Machines
A support vector machine (SVM) is a class of supervised machine learning techniques. It is based on the principle of structural risk minimization. In linear classification, support vector machines create a hyper plane that separates the data into two sets with the maximum-margin. (A hyper plane with the maximum-margin has the distances from the hyper plane to points in the two sides are equal.) Mathematically, SVMs learn the sign function , where w is a weighted vector in . SVMs find the hyper plane separating the space into two half-spaces with the maximum-margin. Linear SVMs can be generalized for non-linear problems. To do so, the data is mapped into another space and we perform the linear SVM algorithm over this new space.
3.3 Applications of Text Classification
There are a number of text classification applications. A well-known application is spam filtering; e-mail needs to be classified as either legitimate or unsolicited/unwanted (spam). In [Graham], Graham proposes the use of naive Bayesian filtering in the context of spam classification. While Bayesian filtering has proven to be fairly successful in this problem, it can also be used as a more general text classifier. Following is a short list of other applications [Fabrizio]:
§ Detecting authorship of disputed works. Text classification can be used for the purpose of detecting authorship of disputed works. For example, there are many works that are known to belong to Shakespeare, but there are other stories whose authorship is in doubt.
§ Text filtering. The Web is flourished with online articles and stories. Many of them contain adult contents, and we want to shield children from reading them. So, we can integrate a text classifier into Web browsers to filter out not-for-kids stories.
§ Filing classified ads into classes. For example, should we print an ads under the category “Car Rental,” or “Real Estate.”
§ Filing patents under predefined categories.
4 Approach
We believe that relevant articles are more similar to other relevant articles than irrelevant ones. With this idea in mind, we propose a method to classify articles based on “similarity.” Given a new article, we generate a relevance rating and an irrelevance rating based on how “similar” the given article is to the training set of pre-classified articles.
While there are many ways to define “similarity”, we choose to use the OKAPI BM25 ranking formula [10]. OKAPI is a well-known and widely used ranking formula. Furthermore, Microsoft SQL Server 2005, the database used in the Ares project, provides built-in support for OKAPI. For future work, we could see how well other ranking formulas performed.
Given two bodies of text, the OKAPI formula will generate a score that corresponds to the similarity of the two bodies of text; higher scores indicate better matches.
For the problem of classification, we first need to have a sufficiently large training set of pre-classified articles. It needs to be large in order to have the breadth necessary to make matches with any input article. These articles are separated into “Relevant” and “Irrelevant” training sets. To calculate a relevance rating, we use the OKAPI formula to get scores comparing an input article with every article in the “Relevant” training set. We only take the sum of the top N scores and set that as the relevance rating for the inputted article. Similarly, we use OKAPI to get similarity scores between the inputted article and every article in the “Irrelevant” training set and set the irrelevance rating as the sum of the top N scores.