Ruban S & Cijo Paul1

AN EFFECTIVE QUERY REFINING METHOD TO REFINE QUERIES USING WIKIPEDIA

Ruban S[1]Cijo Paul[2]

1. INTRODUCTION

Information retrieval is the process of gathering relevant information from a large collection of information sources. The number of people using Information Retrieval in their lives has grown significantly. A child performing a simple Google Search, to a teenager reading her Instagram Stories is now participating in Information Retrieval. Information Retrieval is the act of retrieving relevant information based on the expression of an Information need.

An information need is expressed in terms of a query. A query is a statement that requests particular information. Anything you enter into a web search form is a query. A query thus forms a crucial part of an InformationRetrieval (IR) system. It connects the IR system to the information need expressed by the User. Queries to most search engines aren’t efficient in describing an information need. They’re usually short and not written very carefully. To fill this gap between the user’s intent and the query we employ Query Expansion. Query Expansion is the process of reformulating a seed query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. Information retrieval systems, employ query expansion to enhance the quality of the results. Recall is the number of results returned by a query in an Information Retrieval System. The goal of Query Expansion is to increase Recall of the system. For example, if you searched for “spider bites” a good query expansion system, would also match documents that are related to spider bites but may not have they text “spider bites” in them.

In this paper we have proposed a simple but effective Pseudo Relevance Feedback based method for Query Expansion using Wikipedia. The paper is divided into V sections. In Section II we present an overview of existing work, in Query Expansion. Section III describes our proposed Framework for Query Expansion. Section IV shows the results obtained from our experiments on the Framework. Section V summarizes the main conclusions of this work.

2. EXISTING WORK IN QUERY EXPANSION

Voorhees [1] expanded queries using a combination of synonyms; hypernyms and hyponyms manually selected from WordNet, and achieved limited improvement on short queries. Stairmand[2] used WordNet for query expansion, but they concluded that the improvement was restricted by the coverage of the WordNet.

Among most query expansion techniques, Pseudo Relevance Feedback(PRF) algorithms are considered to be most successful. PRF algorithms assume that the top ranked documents from a seed query are relevant and contain good expansion terms. Lavrenko’s et AL’s RM model[3] selects terms based on their term frequency in top retrieved documents and weighs them by the documents ranking score. Later Metzler[4] multiplied Inverse Document Frequency of the term to the term frequency to drown the effect of highly frequent terms in the corpus. Robertson et AL’s BM25 query expansion[5] selects terms based on their appearance in pseudo relevant documents versus their appearance in irrelevant documents.

PRF algorithms rely highly on the first set of results from the seed query. Most datasets are noisy and may not return great results based on the query. A good solution to this is to use an external high quality data source that is not noisy. Using Wikipedia, Xu Et Al[6], proposed a PRF like method on top retrieved documents from Wikipedia. Chenyang Xiong and Jamie Callan[7] also employed Freebase, a large public dataset of real world entities for query expansion.

3. PROPOSED METHODOLOGY

In this paper we propose a PRF method using Wikipedia for Query Expansion. Wikipedia is domain independent. It is a very large source of information manually crafted by volunteers all over the world. The speed at which its kept updated is fascinating. This makes it one of the best assets in Information Retrieval.

Our approach consists of 4 key phases. 1. Generating possible base candidates. 2. Finding related child candidates. 3. Finding related parent candidates. 4. Ranking resultant concept candidates to find the n-best suited queries for query expansion.

3.1. Generating possible base candidates:

We search all of wikipedia for the given query and sort the results according to their tf-idf weights. From this list, a sub list of N highest tf-idf articles are extracted. We will use them as the base candidates to find related concepts.

3.2. Finding related child candidates:

Wikipedia represents relationships by linking two Documents. This is crucial for us since we’re trying to find related queries to improve our seed query. Mining links from each of the results gathered in the previous stage will give us a list of child concepts, the original query maybe related to. We can now sort each concept according to the number of base candidates that link to them. The greater the number of links, the higher its possibility of being related to the seed query.

3.3. Finding related parent candidates:

Parent candidates are documents that link to the base candidate. Wikipedia keeps a list of documents that link to a given document in the WhatLinksHere tool. This tool can be used to extract parent candidates that maybe related to the seed query. The greater the number of base candidates a parent concept links to, greater its possibility of being related to the seed query.

Fig 1: representation of expansion terms using initial results from seed query

The anchor text used to describe links in Wikipedia is precise in describing the pages they link to. Hence the anchor text itself can be used as expansion terms to refine the seed query.

3.4. Ranking potential queries

Parent candidates are ranked according to the total number of links each of them has to all the base candidates.

Child candidates are ranked according to the total number of links that point to it from the base candidates.

4. RESULTS

In this section, we discuss the performance of our expansion engine using Wikipedia. Table-1 below, shows

The seed queries as well as the expansion terms retrieved for them. We tested the results on Google, to find the number of relevant search results before and after expansion of the seed query.

Seed Query / Expansion Terms from Wikipedia / Relevant Search Results after Expansion / Relevant Search Results before Expansion
Carpenter bee / Insect, Animal, Apidae, Hymenoptera, Taxonomy (biology), Arthropod, Carpenter bee / 100 / 93
barrett’s espophagus / Gastrointestinal disease, Barrett's esophagus, Esophagus, Medical Subject Headings, Digital object identifier, Dysphagia / 98 / 98
Evidence for evolution / On the Origin of Species, Charles Darwin, Coevolution, International Standard Book Number, Transitional fossil, Human evolution, Evolution / 43 / 45
F5 Tornado / Tornado, Fujita scale, International Standard Book Number, Digital object identifier, List of tornadoes striking downtown areas of large cities, Bibcode / 96 / 96
Feliz navidad lyrics / MusicBrainz, Daniel Santos (singer), Album, Record producer, Record label, Bolero, Feliz Navidad (song) / 28 / 43
Folk remedies sore throat / Outline of health sciences,Taxonomy (biology), Angiosperms, Eudicots, Plantae / 27 / 92
Halloween activities for middle school / Halloween, International Standard Book Number, October 31, The Halloween Tree, The Halloween Tree (film), Samhain, The Haunted Mask / 72 / 66
Hip roof / Gable, Roof garden, Flat roof, Roof, Gablet roof, Gable roof, Purlin / 67 / 86
History of orcas island / Orca, Washington (state), Puget Sound, International Standard Book Number, Killer whale, SeaWorld, Iceland / 25 / 18
Holes by louis sachar / Louis Sachar, Holes (novel), International Standard Book Number, Newbery Medal, OCLC, Sequoyah Book Award, List of winners of the National Book Award / 94 / 84

Table-1: Results from Wikipedia Query Expansion on select Queries

From Table-1 we can observe that most the expansion terms are related concepts to the seed query. For example “holes by louis sachar” gives us “Louis Sachar” and “Holes (novel)” which are terms that could be used to expand “Holes by louis sachar”.

However a few loosely related terms like “International Standard Book” number also show up in the results for a lot of seed queries. The following graph(Figure II) illustrates the effectiveness of the expanded queries against the original queries on Google.

Fig 2: Effectiveness of the expanded queries against the original queries on Google.

5. CONCLUSION

In this paper, we presented a simple PRF method for Query Expansion using Wikipedia. The method takes advantage of relationships among concepts represented by Wikipedia links. We first identified base candidates from the seed query. We then proceeded to identify all the candidates that are related to the the N best base candidates. Using, documents that link to the base candidates, as well as documents that are linked to in the base documents. We then proceed to use the anchor text of the links as query expansion terms, since they accurately represent the relationship between documents.

6. REFERENCES

[1]Voorhees, E.M.: Query expansion using lexical semantic relations. In: Proceedings of the1994 ACM SIGIR Conference on Research and Development in Information Retrieval(1994)

[2]Stairmand, M.A.: Textual context analysis for information retrieval. In: Proceedings of the1997 ACM SIGIR Conference on Research and Development in Information Retrieval(1997)

[3]V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of the 24th AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR2001), pages 120–127. ACM, 2001.

[4]D. Metzler. Beyond bags of words: effectively modeling dependence and features in information retrieval. PhD thesis, University of Massachusetts Amherst, September 2007.

[5]S. E. Robertson and S. Walker. Okapi/keenbow at TREC-8. In Proceedings of The 8th Text RetrievalConference, (TREC 1999), pages 151–162. NIST, 1999.

[6]Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on wikipedia. InProceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval, (SIGIR 2009), pages 59–66. ACM, 2009

[7]Query Expansion with Freebase Chenyan Xiong, Jamie Callan

[1]Assistant Professor, AIMIT, St. Alloysius College

[2]Student, MSC Software Technology, AIMIT, St. Alloysius College