An Empirical Performance Measurement of Microsoft's Search Engine

SearchEnginePerformance 1

SEARCH ENGINE PERFORMANCE

Comparing Performance of Different Search Enginesthrough Experiments

Xiannong Meng

Department of Computer Science

BucknellUniversity

Lewisburg, PA17837, U.S.A.

Song Xing

Department of Information Systems

CaliforniaStateUniversity -- Los Angeles

Los Angeles, CA90032, U.S.A.

Abstract

This chapter reports the results of a project attempting to assess the performance of a few major search engines from various perspectives. The search engines involved in the study include the Microsoft Search Engine (MSE) when it was in its beta test stage, AllTheWeb, and Yahoo. In a few comparisons, other search engines such as Google, Vivisimo are also included. The study collects statistics such as the average user response time, average process time for a query reported by MSE, as well as the number of pages relevant to a query reported by all search engines involved. The project also studies the quality of search results generated by MSE and other search engines using RankPower as the metric. We found MSE performs well in speed and diversity of the query results, while weaker in other statistics, compared to some other leading search engines. The contribution of this chapter is to review the performance evaluation techniques for search engines and use different measures to assess and compare the quality of different search engines, especially MSE.

Comparing Performance of Different Search Engines through Experiments

Introduction

Search engines, since their inception in the early to mid-1990, have gone through many stages of development. Early search engines were derived from the work of two different, but related fronts. One is to retrieve, organize, and make searchable the widely available, loosely formatted HTML documents over the Web. The other is then-existing information access tools such as Archie (Emtage, 1992), Gopher (Anklesaria et.al. 1993), and WAIS (Kahle, 1991) (Wide Area Information Servers). Archie collects information about numerous FTP sites and provides a searchable interface so users can easily retrieve files through different FTP sites. Gopher provides search tools to large number of Gopher servers on the Internet. WAIS has similar functionality to that of Archie, except that it concentrated on wide variety of information on the Internet, not just FTP sites. With the fast development of Web, search engines designed just for the Web started to emerge. Some of the examples include WWWW (World Wide Web Worm), then-most-powerful search engine AltaVista, NorthernLight, WebCrawler, Excite, InforSeek, HotBot, AskJeeves, AlltheWeb, MSNSearch, and of course, Google. Some of these search engines disappeared in history; others were re-tooled, re-designed, or simply merged; yet others have been able to stay at the front against all the competition. Google since its inception in 1998 has been the most popular search engine mostly because of its early success in its core algorithm for search, the PageRank algorithm (Brin & Page, 1998). Search engines today are generally capable of searching not only free text, but also structured information such as databases, as well as multi-media such as audio and video. Some of the representative work can be found in (Datta et.al., 2008) and (Kherfi et.al., 2004), More recently some academic search engines start to focus on indexing deeper web and producing knowledge based on the information available on the web, e.g., the KnowItAll project by Etzioni and his team, see for example (Banko et.al,. 2007).In a relatively short history, many aspects of search engines including software, hardware, management, investment and others have been researched and advanced. Microsoft, though a later comer in the Web search business, tried very hard to compete with Google and other leading search engines. As a result, Microsoft unveiled its own search engine on November 11th, 2004 with its Web site at (Sherman, 2004). We refer to it as MSN in this discussion. The beta version of the search has since evolved to what is now called Live search engine ( This chapter reports the results of a project attempting to assess the performance of the Microsoft search engine when it was in its beta version from various perspectives. Specifically the study collects statistics such as the average user response time, average process time for a query reported by MSE itself, the number of pages relevant to a query, the quality of the search in terms of RankPower, and comparisons with its competitors. The rest of the chapter is organized as follows. Section 2 provides an overview of search engine performance metrics. The goals and the metrics of this study are described in Section 3. Section 4 discusses the method of study and the experimental settings, followed by the results and their analysis in Section 5. Our thoughts and conclusions about the study are presented in Section 6.

Performance Metrics for Web Search Engines

While user perception is important in measuring the retrievalperformance of search engines, quantitative analyses provide more “scientific evidence” that a particular search engine is “better” than the other. Traditional measures of recall andprecision(Baeza-Yates 1999) work well for laboratory studies of informationretrieval systems. However, they do not capture the performance essence of today’s web information systems for three basic reasons. One reasonfor this problem lies in the importance of the rank of retrieved documents in web search systems. A user of web search engines would not go through the list of hundreds andthousands of results. A user typically goes through a few pages of a few tens of results. Therecall and precisionmeasuresdo not explicitly present the ranks of retrieved documents. A relevantdocument could be listed as the first or the last in thecollection. They mean the same as far as recall and precision areconcerned at a given recall value. The secondreason that recall and precision measures do not work well is that web search systems cannot practically identify andretrieve all thedocuments that are relevant to a search query in the wholecollection of documents. This is required by therecall/precision measure. The third reason is that theserecall/precision measures are a pair of numbers. It is not easyto read and interpret quickly what the measure means for ordinary users.Researchers (see a summary in (Korfhage 1997)) have proposed manysingle-value measures such as estimated search length ESL (Cooper 1968),averaged search length ASL (Losee 1998), F harmonic mean, E-measureand othersto tackle the third problem.

Meng (2006) compares through a set of real-life web searchdata the effectiveness of various single-value measures. The use andthe results ofASL, ESL, average precision, F-measure, E-measure, and the RankPower, applied against a set of web search results. The experiment data was collected by sending 72 randomly chosen queries toAltaVista(AltaVista, 2005) andMARS (Chen & Meng 2002, Meng & Chen 2005).

The classic measures of user-oriented performance of an IR system are precision and recall which can be traced back to thetime frame of 1960's (Cleverdon et.al. 1966, Treu 1967). Assume a collection ofN documents, of which Nrare relevant to the search query. Whena query is issued, the IR system returns a list of L results whereL <= N, of whichLr are relevant to the query. PrecisionPand recallR are defined as follows.

and

Note that 0 <= P <= 1 and 0 <= R <= 1. Essentially the precision measures the portion of the retrieved results thatare relevant to the query andrecall measures the percentage of relevant results areretrieved out of the total number of relevant results in the documentset. A typical way of measuring precision and recall is to compute the precisionat each recall level. A common method is to set the recall level to beof 10 intervals with 11 points ranging from 0.0 to 1.0. The precision is calculated for each of the recall level. The goal is to have a high precision rate, as well as a high recall rate. Several othermeasures are related to the measure of precision andrecall. Average precision and recall (Korfhage 1997) computes the average of recall andprecision over a set of queries. Theaverage precision at seen relevant documents (Baeza-Yates 1999) takes theaverage of precision values after each new relevant document is observed. TheR-precision (Baeza-Yates 1999) measure assumes the knowledge of totalnumber of relevant documents R in the document collection. Itcomputes the precision at R-th retrieved documents.The E measure was proposed in (Van Rijsbergen 1974) which canvary the weight of precision and recall by adjusting the parameter β between 0 and 1. In the extreme cases when β is 0, E = 1 - P, where recall has the least effect, and when β is 1, where recall has the mosteffect. The harmonic F measure (Shaw 1986) is essentially acomplement of the E measure, . The precision-recall measure and its variants areeffective measures of performance of information retrieval systemsin the environment where the total document collection is known andthe sub-set of documents that are relevant to a given query is alsoknown.

The drawbacks of the precision-recall based measures are multi-fold. Mostnoticeably, as Cooper pointed in his seminal paper (Cooper 1968), it does not provide a single measure; it assumes a binary relevant orirrelevant set of documents, failing to provide some gradual order of relevance; it does not have built-in capability for comparison ofsystem performance with purely random retrieval; and it does not takeinto account a crucial variable: the amount of material relevant tothe query which the user actually needs. The expected search length (ESL) (Cooper 1968, Korfhage 1997) is a proposed measure to counter these problems. ESL is the averagenumber of irrelevantdocuments that must be examined to retrieve a given number i ofrelevant documents. The weighted average of the individual expectedsearch lengths then can be defined as follows,

(5)

where N is the maximum number of relevant documents, and ei the expected search length for i relevant documents.

The average search length(ASL) (Losee 1998, Losee 1999, Losee 2000) is the expected position of a relevant document in the ordered list of alldocuments. For a binary judgment system (i.e. the document is eitherrelevant or irrelevant), the average search length is represented bythe following relation,

ASL = N[QA + (1-Q)(1-A)](6)

where N is the total number of documents, Q is the probability that ranking is optimal, and A is the expected proportion ofdocuments examined in an optimal raking if one examines all thedocuments up to the document in the average position of a relevantdocument. The key idea of ASL is that one can compute the quality of an IR system without actually measuring it if certainparameters can be learned in advance. On the other hand, if oneexamines the retrieved documents, the valueA can be determinedexperimentally, which is the total number of retrieved relevant documentsdivided by the total number of retrieved documents, thus the qualityindicator Qcan be computed.

Except the basic precision andrecall measures, therest of the afore-mentioned measures are single-value measures. Theyhave the advantage of representing the system performance in a singlevalue, thus it is easier to understand and compare the performance ofdifferent systems. However these single-value measures shareweakness in one of the two areas. Either they do not consider explicitly the positions of the relevant documents, or they do not explicitly considerthe count of relevant documents. This makes the measures non-intuitiveand difficult for users of interactive IR systems such as web searchengines to capture the meanings of the measures.

To alleviate the problems using other single-value measures for web search, Meng & Chen proposed a single-value measure calledRankPower (Meng & Chen 2004) that combinesthe precision and the placements of the returned relevant documents. The measure is based on the concept ofaverage ranks and the count of returned relevant documents. A closed-form expression of the optimal RankPower can be found such that comparisons of different web information retrieval systems can be easily made. The RankPower measure reaches its optimal value when all returned documents are relevant.

RankPower is defined as follows.

(7)

where N is the total number of documents retrieved, nis thenumber of relevant documents among N, Siis the place (or the position) of the ith relevant document.

While the physical meaning of RankPower as defined above isclear -- average rank divided by the count of relevant documents, thedomain in which its values can reach is difficult to interpret.The optimal value (the minimum) is 0.5 when all returned documents are relevant. It is not clear how tointerpret this value in an intuitive way, i.e. why 0.5. The other issueis that RankPower is notbounded above. Asingle relevant document listed as the last in a list ofmdocuments assures a RankPower value of m. If the list size increases, this value increases. In their recent work, (Tang et.al. 2007) proposed a revisedRankPowermeasure defined as follows.

(8)

where N is the total number of documents retrieved,n isthe number of relevant documents among the retrieved ones, and Siis the rank of each of the retrieved, relevant document. The beauty ofthis revision is that it now constrains the values ofthe RankPower to be between 0 and 1 with 1 being the most favoriteand 0 being the least favorite. A minor drawback of this definition isthat it loses the intuition of the original definition that is theaverage rank divided by the count of relevant documents.

The experiment and data analysis reported in (Meng 2006) compared RankPower measure with a number of other measures. While the exact numerical results may not be much relevant any more because they are dated, the data do show the effectiveness of RankPower measure. The results show that the RankPower measure was effective and easy to interpret.A similar approach to that was discussed in (Korfhage 1997) was used in the study. A set of 72 randomly chosen queries are sent to the chosen search engines (AltaVista, (AltaVista, 2005) and MARS (Chen & Meng, 2002)). The first 200 returneddocuments for each query are used as the document set. Each of the 200documents for each of the query is examined to determine thecollection of relevant document set.This process continues for all 72 queries. The average recall and precision are computed at each of the recall intervals. The results are listed in Table 1.

Table 1 Average Recall and Precision at the First 20 Returned Results

Recall / 0.00 / 0.10 / 0.20 / 0.30 / 0.40 / 0.50 / 0.60 / 0.70 / 0.80 / 0.90 / 1.00 / sum / Avg
0.00 / 4 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 4 / 0.00
0.10 / 0 / 2 / 1 / 1 / 3 / 0 / 0 / 1 / 1 / 1 / 1 / 11 / 0.48
0.20 / 0 / 6 / 4 / 1 / 1 / 4 / 2 / 5 / 0 / 3 / 4 / 30 / 0.52
0.30 / 0 / 0 / 1 / 2 / 8 / 4 / 1 / 1 / 0 / 0 / 0 / 17 / 0.43
0.40 / 0 / 1 / 0 / 0 / 2 / 1 / 0 / 0 / 1 / 1 / 0 / 6 / 0.52
0.50 / 0 / 0 / 0 / 0 / 0 / 0 / 1 / 0 / 0 / 0 / 0 / 1 / 0.60
0.60 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0.00
0.70 / 0 / 1 / 0 / 1 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 2 / 0.20
0.80 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0.00
0.90 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0.00
1.00 / 0 / 1 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 1 / 0.10
sum / 4 / 11 / 6 / 5 / 14 / 9 / 4 / 7 / 2 / 5 / 5 / 72
avg / 0.00 / 0.32 / 0.20 / 0.32 / 0.26 / 0.27 / 0.30 / 0.20 / 0.25 / 0.22 / 0.18 / Precision

Shown inTable 2are the numerical values of the varioussingle-value measures collected from the same data set. Following (Cooper 1968)’s discussion, five different types of ESL measures were studied. These five types are listed as follows.

[1. Type-1] A user may just want the answer to a very specific factual question or a single statistics. Only one relevant document is needed to satisfy the search request.

[2. Type-2] A user may actually want only a fixed number, for example,six of relevant documents to a query.

[3. Type-3] A user may wish to see all documents relevant to the topic.

[4. Type-4] A user may want to sample a subject area as in 2, but wish to specify the ideal size for the sample as some proportion, say one-tenth, of the relevant documents.

[5. Type-5] A user may wish to read all relevant documents in case there should be less than five, and exactlyfivein case there exist more than five.

Notice that variousESL measures are the number of irrelevant documents that must beexamined in order to find a fixed number of relevant documents; ASL,on the other hand, is the average position of the relevant documents;RankPower is a measure of average rank divided by the number ofrelevant documents with a lower bound of 0.5. In all cases, thesmaller the values are, the better the performance is. RevisedRankPower has valuesbetween 0 and 1 with 0 being the least favorite and 1 being the mostfavorite.

Table 2 Various Single-Value Measures Applied to the Experiment Data

AV / MARS
ESL / Type 1 / 3.78 / 0.014
Type 2 / 32.7 / 25.7
Type 3 / 124 / 113
Type 4 / 7.56 / 0.708
Type 5 / 25.7 / 17.3
ASL / Measured / 82.2 / 77.6
Estimate / 29.8 / 29.8
RankPower / 3.29 / 2.53
Revised Rank Power / 0.34 / 0.36

We can draw the following observations from the data shown in Table 2. Note that these observations demonstrate theeffectiveness of single-value measures, especially, theRankPower. The focus was not on the comparison of theactual search engines since the experimental data is a few years old.

In ESL Type 1 comparison, AltaVista has a value of 3.78 which means on the average, one needs to go through 3.78 irrelevant documents before finding a relevant document. In contrast, ESL Type 1 value for MARS is only 0.014 which means a relevant document can almost always be found at the beginning of the list. MARS performs much better in this comparison because of its relevance feedback feature.
ESL Type 2 counts the number of irrelevant documents that auser has to go through if she wants to find six relevant documents. AltaVista has a value of 32.7while MARS has a value of 25.7. Again because of the relevance feedback feature of MARS, it performs better than AltaVista.
It is very interesting to analyze the results for ESL Type3 request. ESL Type 3 request measures the number of irrelevant documents a user has to go through if she wants to find all relevant documents in a fixed document set. In our experiments, the document set is the 200 returned documents for a given query and the result is averaged over the 72 queries used in the study. Although the average number of relevant documents is thesame between AltaVista and MARS (see the values of estimatedASL) because of the way MARS works, the positions of these relevant documents are different. This results in different values of ESL Type 3. In order to find all relevant documents in the return set which the average value is 29.8 documents, AltaVista would have to examine a total of 124 irrelevant documents while MARS would examine 113 irrelevant documents because MARS have arranged more relevant documents to the beginning of the set.
ESL Type 4 requests indicate that the user wants to examine one-tenth of all relevant documents and how many irrelevant documents the user has to examine in order to achieve this goal. In this case, all relevant documents in the returned set of 200 have to be identified before the 10 percent can be counted. On average AlatVista would have to examine about 8 irrelevant documents before reaching the goal, while it only takes MARS fewer than one irrelevant documents.
ESL Type 5 requests examine up to a certain number of relevant documents. The example quoted in Cooper's paper (Cooper 1968) was five. For AltaVista, it takes about26 irrelevant documents to find five relevant documents, while MARS requires only about 17.

Goals and Metrics of theStudy