User Access Pattern Enhanced Small Web Search
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma, Hong-Jiang Zhang
2003.3.6
Technical Report
MSR-TR-2003-12
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
User Access Pattern Enhanced Small Web Search
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma, Hong-Jiang Zhang Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA, USA 98052
Abstract
Current search engines generally utilize link analysis techniques to improve the ranking of returned web-pages. However, the same techniques applied to a small Web, such as a website or an intranet Web, cannot achieve the same level of performance because the link structure is different from the global Web. In this paper we proposed a novel method of generating implicit link structure based on users’ access patterns, and then apply a modified PageRank algorithm to produce the ranking of web-pages. Our experimental results indicate that the proposed method outperforms keyword-based method by 16%, explicit link-based PageRank by 20% and DirectHit by 14%, respectively.
1. INTRODUCTION
Global search engines such as Google [6] and AltaVista [3] are widely used by Web users to find their desired information directly on the global Web. Nevertheless, to get more specific and up-to-date information, users usually go directly to websites as the very starting points and conduct search in those websites. Intranet search is another increasing application area which enables search on the documents of an organization. In both cases, search occurs in a closed sub-space of the Web, which is called small Web search in this paper.
Existing small Web search engines generally utilize the same search and ranking technologies as those widely used on global search engines. However, the performances of them are problematic. As reported in Forrester survey [24], current site-specific search engines fail to deliver all the relevant content, instead returning too much irrelevant content to meet the user’s information needs. In the survey, the search facilities of 50 websites were tested, but none of them received a satisfactory result. Furthermore, in the TREC-8 Small Web Task, little benefit is obtained from the use of link-based methods [4], which also demonstrates the failure of exiting search technologies in small Web search.
The reason of failure lies in several aspects. First, it is generally accepted that ranking by keyword-based similarity faces difficulties such as shortness and ambiguity of queries. Second, link analysis technologies could not be directly applied because the link structure of a small Web is different from the global Web. We will explain this difference in detail in Section 2. Third, users’ access log could be utilized to improve search performance. However, so far few efforts in this direction were made except DirectHit [5].
In this paper, we propose to generate implicit link structure based on user access pattern mining from Web logs. The link analysis algorithms then use the implicit links instead of the original explicit links to improve the search performance. The experimental results reveal that generated implicit links contain more recommendation links than explicit links. The search experiments on Berkeley website illustrate that our method outperforms the keyword-based method by 16%, explicit link based PageRank by 20% and DirectHit by 14%, respectively.
The rest of this paper is organized as follows. In Section 2, we present the basic link structure of a small Web. Section 3 analyzes user access pattern in a small Web. Section 4 describes the mining technology to construct implicit link structure. Section 5 discusses the ranking process. Our experimental results are presented in Section 6. In Section 7, we present related work on small Web search and link analysis. Finally, conclusions and future works are discussed in Section 8.
2. THE LINK STRUCTURE OF SMALL W EB S
The link analysis algorithms such as PageRank [14] and HITS [13] use eigenvector calculations to identify authoritative Web pages based on hyperlink structures. The intuition is that a page with high in-degree is highly recommended, and should have a high rank score.
Figure 1 . Link structure of the Global Web and a small Web
However, there is a basic assumption underlying those link analysis algorithms: the whole Web is a citation graph (see the left plot in Figure 1), and each hyperlink represents a citation or recommendation relationship. Formally, there is the following recommendation assumption.
R ecommendation assumption: a hyperlink in page X pointed to page Y stands for the recommendation of page Y by the author of page X.
Henzinger [16] also points out there is a similar-topic assumption underlying link analysis. In this paper, we consider similar-topic assumption to be roughly true because a small Web is often about a single domain. On the global Web, the recommendation assumption is generally correct because hyperlinks encode a considerable amount of authors’ judgment. Of course some hyperlinks are created not for the recommendation purpose, but the influence of them could be reduced to an ignorable level in the global Web.
However, this assumption is commonly invalid in the case of a small Web. As depicted in the right part of Figure 1, the majority of hyperlinks in a small Web are more “regular” than that in the global Web. For example, most links are from a parent node to children nodes, between sibling nodes or from leaves to the root (e.g. “Back to Home”). The reason is primarily because hyperlinks in a small Web are created by a small number of authors; and the purpose of the hyperlinks is usually to organize the content into a hierarchical or linear structure. Thus the in-degree measure does not reflect the authority of Web pages, making the existing link analysis algorithms not suitable for this domain.
In a small Web, hyperlinks could be divided into navigational link s and recommendation links. The latter is useful for link analysis. However, filtering out navigational links from all hyperlinks is not sufficient because the remaining recommendation links are incomplete and un-weighted. In other words, there are many implicit recommendation links (which is called “implicit links” for short) in a small Web.
We propose to use implicit links instead of explicit links to construct the graph of small Web. The pages and the implicit links is modeled as a connected weighted directed graph G = (V, E), where the set V of vertices consists of the pages in a small Web, and the set E represents the implicit links among pages, each link being associated by a numeric value w i30. The graph is completely connected because for each pair of pages there is an implicit link. The subsequent link analysis is based on this graph.
3. USER ACCESS PATTERN AND IMPLICIT LINK
User access patterns include straightforward statistics and more sophisticated forms such association rules or sequential patterns. DirectHit [5] is one attempt to utilize access patterns in search. It simply takes the hit count of each web-page as its rank score. However, the problem with DirectHit is that it could not reveal the real authoritativeness of web-pages, which is supposed to be weighted sum of authoritativeness of pages which cite them.
For the purpose of mining for implicit recommendation links among web-pages, we seek to find patterns that reflect such kind of recommendation relationship. In particular, we seek to find following two-item sequential pattern.
Two-item sequential pattern: an ordered pair of web-pages (A?B) with s%, which denote in a browsing session, if a user visit page A, he intend to visit page B after a few steps with probability s%.
To explain why two-item sequential patterns are implicit links that are required for link analysis, we should prove that (1)All implicit recommendation links could be generated by two-item sequential pattern mining, and (2)All two-item sequential patterns satisfy the recommendation assumption. The first hypothesis is commonly true because each implicit link will reflect many users’ intention from the source page to the destination page. Thus many users will traverse from the source page through different paths but finally arrive at the destination. This pattern could be discovered if we allow a large interval between page pairs in the mining process.
However the second hypothesis is not much intuitive. Many factors will affect the quality of generated links. Users’ browsing behaviors are often random and unintentional, which brings many noises into the results. Because user access patterns are statistically significant patterns mined from a large group of users instead of any individual, the impact of noise could be lowered down when there is sufficient log data.
Moreover, because traversal paths of users are composed of explicit links of the small Web, user access patterns will be subject to explicit link structures, which might cause the generated access patterns include unnecessary navigational links. In the following we will illustrate that explicit links have small influence on user access patterns when a small Web’s connectivity is high and users have no significant path selection bias.
The left graph of Figure 2 demonstrates a link structure in which there is no explicit links between A and Z. Suppose the user access patterns indicate that users often start from A and traverse through BC or D and arrive at Z at last. The two-item sequential patterns are A?B, B?C, C?Z, A?C, B?Z, A?D, D?Z, and A?Z, in which A?Z is the real implicit link while others are in fact navigational links. However, if users have no bias of paths selection, the support of A?Z will be two times of other pattern’s supports. If there are more paths between A and Z, each path will possess a tiny support, which could be easily removed by an appropriate minimum support. Thus the explicit links could be filtered out and only the pattern (A?Z) remains.
Another example in the right graph of Figure 2 considers there is an existing search page (or a site map equivalently). Users usually pose queries on the search page S and got result sets I, J, and K. If users who have visited I will also visit J for further information (maybe other search results are visited between them), there likely is an implicit link between I and J. In this case, explicit links between I and J (for example, there is a path I?X?Y?J) have no influence on access patterns too. Furthermore, the existence of search page of site map S dramatically increases the connectivity of a web-site.
Figure 2 . User access patterns
Of course there are cases that noise and explicit links may affect the link generation. The experiments show the resulting implicit links generated by our algorithms, as well as the quality of them, i.e. whether they are really recommendation links.
4. I MPLICIT R ECOMMENDATION L INK GENERATION
The data source of generating access patterns is Web logs collected from a small Web. We first preprocess the logs and segment them into browsing sessions. The preprocess procedure consists of three steps: data cleaning, session identification and consecutive repetitions elimination.
An entry in a Web log contains the following information: the timestamp of a traversal from a source page to a target page, the IP address of the originating host, the type of access (GET or POST) and other data. Since we do not care about the images or scripts embedded in the web-pages, we simply filter out the access entries for them. After cleaning the data, we simply distinguish the user by their IP address, i.e. we assume that consecutive accesses from the same IP during a certain time interval come from the same user. In order to handle the case of multi-users with the same IP, we remove IP addresses whose page hits count exceeds some threshold. Afterward, the consecutive entries are grouped into a browsing session. Different grouping criteria are modeled and compared in [29]. In this paper, we choose the criteria similar to [29], i.e., a new session starts when the duration of the whole group of traversals exceeds a time threshold. Consecutive repetitions within a session are then eliminated, e.g. compact the session ((A , A , B , C , C , C) to (A , B , C)). After preprocessing, we obtain a series of user browsing sessions S=(s1, s2, …, s m), where s i=(u i: p i1, p i2, …, p ik), u i is the user id and p ij is the web-pages in the browsing path.
The process of two-item sequential pattern mining is outlined below. Given a set of page-visiting transactions where each transaction is a list of web-pages ordered by access time, we find all ordered pairs of web-pages that have minimum support. After that each implicit links are created for each two-item user access pattern (i?j), with the weight being support of this pattern.
Existing algorithms for sequential mining like AprioriAll [27] do not fit our situation very well. For example, when we compute the frequency of the two-item candidate pattern (D?C), a long path such as (D?A?……?V?C), where the distance between D and C is large, will be taken into consideration. That is, a long path also contributes to the frequency of (D?C). According to the analysis in Berkhin et al. [23], when the interval of the page pairs in the path is larger than 6, the topics of those pages may not be relevant any more. Moreover, we only need to create the two-item sequential patterns.
Thus, we propose the following algorithm. First, a gliding window is used to create two-item ordered pairs. Second the frequency of each ordered pair is computed, and all the infrequent pairs are filtered out with a user-specified minimum support to generate the two-item sequential pattern.