Web Usage Mining and Pattern Discovery: a Survey Paper

Web Usage Mining and Pattern Discovery: A Survey Paper

By

Naresh Barsagade

CSE 8331

December 8, 2003

1. Introduction

Web technology is not evolving in comfortable and incremental steps, but it is turbulent, erratic, and often rather uncomfortable. It is estimated that the Internet, arguably the most important part of the new technological environment, has expanded by about 2000 % and that is doubling in size every six to ten months. In recent years, the advance in computer and web technologies and the decrease in their cost have expanded the means available to collect and store data. As an intermediate consequence, the amount of information (Meaningful data) stored has been increasing at a very fast pace. Traditional information analysis techniques are useful to create informative reports from data and to confirm predefined hypothesis about the data. However, huge volumes of data being collected create new challenges for such techniques as organizations look for ways to make use of the stored information to gain an edge over competitors. It is reasonable to believe that data collected over an extended period contains hidden knowledge about the business or patterns characterizing customer profile and behavior. With the rapid growth of the World Wide Web, the study of knowledge discovery in web, modeling and predicting the user’s access on a web site has become very important [GO2003].

From the administration, business and application point of view, knowledge obtained from the Web usage patterns could be directly applied to efficiently manage activities related to e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government, Digital Libraries, and so on [AR2003]. Web is becoming the necessity of the businesses and organizations because of its demand from the clients. Since the web technology largely feeds on ideas and knowledge rather than being dependent on fixed assets, it gave birth to new companies such as Yahoo, Google, Netscape, e-Bay, e-Trade, Expedia, Amazon and so on. With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web has become an important research area [JTP2002]. With the explosive growth of information sources available on the World Wide Web, it has become necessary for organizations to discover the usage patterns and analyze the discovered patterns to gain an edge over competitors.

Jespersen et al [JTB2002] proposed a hybrid approach for analyzing the visitor click stream sequences. A combination of hypertext probabilistic grammar and click fact table approach is used to mine Web logs, which could be also used for general sequence mining tasks. Mobasher et al [MCS1999] proposed the web personalization system, which consists of offline tasks related to the mining if usage data and online process of automatic Web page customization based on the knowledge discovered. LOGSOM (LOGSOM, a system that utilizes Kohonen's self-organizing map (SOM) to organize web pages into a two-dimensional map) proposed by Smith et al [SN2003], utilizes a self-organizing map based solely on the users’ navigation behavior, rather than the content of the web pages. LumberJack proposed by Chi et al [CRHL2002] builds up user profiles by combining both clustering of user sessions and traditional statistical traffic analysis using k–means algorithm. Joshi et al [JJYK1999] used relational online analytical processing approach for creating a Web log warehouse using access logs and mined logs. A comprehensive overview of web usage mining research is found in [SCDT2000, CMS97, CMS1999, RWC2000].

Web mining can be divided into three areas, namely web content mining, web structure mining and web usage mining [SCDT2000]. Web Content mining focuses on discovery of information stored on the Internet. Web Structure mining focuses on improvement in structural design of a website. Web Usage mining, the main topic of this paper, focuses on knowledge discovery from the usage of individuals web sites.

Global Internet Usage Average Usage [NN2003] shows the current usage around the globe and in United States.

Month of September 2003, Panel Type: Home

September / August / % Change
Number of Sessions per Month / 22 / 22 / 1.65
Number of Unique Domains Visited / 55 / 54 / 0.89
Page Views per Month / 901 / 899 / 0.3
Page Views per Surfing Session / 41 / 41 / 0
Time Spent per Month / 11:59:20 / 11:50:30 / 1.24
Time Spent During Surfing Session / 0:32:29 / 0:32:37 / -0.4
Duration of a Page Viewed / 0:00:48 / 0:00:47 / 0.94
Active Internet Universe / 252,672,070 / 253,054,814 / -0.15
Current Internet Universe Estimate / 419,054,724 / 416,339,888 / 0.65

United States: Average Web Usage

Month of October 2003, Panel Type: Home

Sessions/Visits Per Person / 71
Domains Visited Per Person / 103
PC Time Per Person / 80:46:37
Duration of a Web Page Viewed / 0:01:00
Active Digital Media Universe / 47,003,165
Current Digital Media Universe Estimate / 51,012,930

The remainder of the paper is organized as follows: Section 2 contains applications of web usage mining, section 3 contains basic components of web mining terminologies, taxonomy of web mining, architecture of web usage mining, explanation of individual components in web usage mining architecture, section 4 summarizes the paper, identifies several future research directions and section 5 contains the bibliography.

2. Applications of Web Usage Mining

Each of the applications can benefit from patterns that are ranked by subjective interesting.

Web usage mining is used in the following areas:

Web usage mining offers users the ability to analyze massive volumes of clickstream or click flow data, integrate the data seamlessly with transaction and demographic data from offline sources and apply sophisticated analytics for web personalization, e-CRM and other interactive marketing programs.
Personalization for a user can be achieved by keeping track of previously accessed pages. These pages can be used to identify the typical browsing behavior of a user and subsequently to predict desired pages.
By determining frequent access behavior for users, needed links can be identified to improve the overall performance of future accesses.
Information concerning frequently accessed pages can be used for caching.
In addition to modifications to the linkage structure, identifying common access behaviors can be used to improve the actual design of Web pages and to make other modifications to the site.
Web usage patterns can be used to gather business intelligence to improve Customer attraction, Customer retention, sales, marketing and advertisement, cross sales.
Mining of web usage patterns can help in the study of how browsers are used and the user’s interaction with a browser interface.
Usage characterization can also look into navigational strategy when browsing a particular site.
Web usage mining focuses on techniques that could predict user behavior while the user interacts with the Web.

Web usage mining helps in improving the attractiveness of a Web site, in terms of content and structure.
Performance and other service quality attributes are crucial to user satisfaction and high quality performance of a web application is expected.
Web usage mining of patterns provides a key to understanding Web traffic behavior, which can be used to deal with policies on web caching, network transmission, load balancing, or data distribution.
Web usage and data mining is also useful for detecting intrusion, fraud, and attempted break-ins to the system.
Web usage mining can be used in
e-Learning, e-Business, e-Commerce, e-CRM, e-Services, e-Education, e-Newspapers, e-Government, and Digital Libraries.
Web usage mining can be used in

Customer Relationship Management, Manufacturing and Planning, Telecommunications and Financial Planning.

Web usage mining can be used in

Physical Sciences, Social Sciences, Engineering, Medicine, and Biotechnology.

Web usage mining can be used in

Counter Terrorism and Fraud Detection, and detection of unusual accesses to secure data.

Web usage mining can be used in determination of common behaviors or traits of users who perform certain actions, such as purchasing merchandise.
Web usage mining can be used in usability studies to determine the interface quality.
Web usage mining can be used in network traffic Analysis for determining equipment requirements and data distribution in order to efficiently handle site traffic.

3. Web Usage Mining and Pattern Discovery

Web usage mining is the application of data mining techniques to discover usage pattern from Web data, in order to understand and better serve the needs of Web-based applications [CMS1997]. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. A high level Web usage mining Process is presented in Figure 1 [SCDT2000]. Mobasher et al. [CMS1997] proposes that the web mining process can be divided into two main parts. The first part includes the domain dependent processes of transforming the Web data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes some data mining and pattern matching techniques such as association rule and sequential patterns. In the absence of cookies or dynamically embedded session Ids in the URIs, the combination of IP address can be used as a first pass estimate of unique users. This estimate can be refined using the referrer field as described in [CMS1999].Some authors have proposed global architectures to handle the web usage mining process. Cooley et al [CTS1999] proposed a site information filter, named WebSIFT that establishes a framework for web usage mining as shown in Figure 2. The WebSIFT performs the mining in distinct tasks.

WeSift system divides the Web Usage Mining Process into three main parts, as show in Fig 1. For a particular Web site, the three server logs access, referrer, and agent (often combined into a single log), the HTML files, template files, script files or databases that make up the site content, and any optional data such as registration data or remote agent logs provide the information to construct the different information abstractions.

The preprocessing phase uses the input data to construct a server session file based on the method and heuristics discussed in [[CMS, 1999]. In order to preprocess a server log, the log must first be “cleaned”, which consists of removing unsuccessful requests, parsing relevant CGI name/value pairs and rolling up file accesses into page views. Once the log is converted into a list of page views, users must be identified. In the absence of cookies or dynamically embedded session Ids in the URIs, the combination of IP address

The first is preprocessing state in which user sessions are inferred from log data. The second searches for patterns in the data by making use of standard data mining techniques, such as association rules or mining for sequential patterns. In the third stage an information filter bases on domain knowledge and the web site structures is applied to the mining patterns in search for the interesting patterns. Links between pages and the similarity between contents of pages provide evidence that pages are related. The preprocessing phase allows the option of converting the server sessions into episodes prior to performing knowledge discovery.

Figure 2: A General Architecture for Web Usage Mining

In this case, episodes are either all of the page views in a server sessions that the user spent a significant amount of time viewing, or all of the navigation page views leading up to each content page view. The details of how a cutoff time is determined for classifying a page view as content or navigation are also contained in [CMS1999]. The click-stream or click-flow for each user is divided into sessions based on a simple thirty-minute timeout. The notion of what makes discovered knowledge interesting has been addressed in [PT1998]. A survey of methods that have been used to characterize the interestingness of discovered patterns is given in [HH1999]. Four dimensions used by [HH1999] to classify interestingness measures are pattern-form, representation, scope, and class. Pattern-form defines what type of patterns a measure is applicable to, such as association rules or classification rules. The representation dimension defines the nature of the framework, such as probabilistic or logical. Scope is a binary dimension that indicates whether the measure applies to single pattern, or to the entire discovered set. The final dimension, class is also a binary dimension that can be labeled as subjective or objective.

Preprocessing for the content and structure of a site involves assembling each page view for parsing and /or analysis. Page views are accessed through HTTP requests by a “site crawler” to assemble the components of the page view. This handles both static and dynamic content. In addition to being used to derive a site topology, the site files are used to classify the pages of a site. Both the site topology and page classification an then be fed into the information filter. The knowledge discovery phase uses existing data mining techniques to generate rules and patterns. Included in this phase is the generation of general usage statistics, such as number of “hits” per page, page most frequently accessed, most common starting page, and average time spent on each page.

The WebSIFT performs the mining in distinct tasks. The first state is preprocessing in which user sessions are inferred from log data. The second searches for patterns in the data by making use of standard data mining techniques, such as association rules or mining for sequential patterns. In the third stage an information filter bases on domain knowledge and the web site structures is applied to the mining patterns in search for the interesting patterns. Links between pages and the similarity between contents of pages provide evidence that the pages are related. This information is used to identify interesting patterns, for example, itemsets that contain pages not directly connected are declared interesting. In Mobasher et al [MCS1999] the authors propose to group the itemsets obtained by the mining stage in cluster of URL references. These clusters are aimed at real time web page personalization. A hypergraph is inferred from the mined itemsets where the nodes correspond to pages and the hyperedges connect pages in a itemset. The weight of a hyperedge is given by the confidence of the rules involved. The graph is subsequently partitioned into clusters and an occurring user session is matched against such clusters. For each URL in the matching clusters a recommendation score is computed and the recommendation set is composed by all the URL whose recommendation score is above a specified threshold.

In Buchner et al. [BBAMH1999] a new approach, in the form of process, is proposed to find marketing intelligence from Internet data. An n-dimensional web log data cube is created to store the collected data. Domain knowledge is incorporated into the data cube in order to reduce the pattern search space. They proposed an algorithm to extract navigation patterns from the data cube. The patterns conform to pre-specified navigation templates whose use enables the analyst to express his knowledge about the field and to guide the mining process. This model does not store the log data in compact form, and that can be major drawback when handling very large daily log files. Information on how customers are using a Web site is critical for marketers of electronic commerce businesses. Buchner et al [BM1998] have presented a knowledge discovery process in order to discover marketing intelligence from Web data. They define a Web log data hypercube that consolidates Web usage data along with marketing data for electronic commerce applications. Four distinct steps are identified in customer relationship life cycle that can be supported by their knowledge discovery techniques: customer attractions, customer retention, cross sales and customer departure.

In Masseglia et al [MPC1999] proposed an integrated tool for mining access patterns and association rules from log file. The techniques implemented pay particular attention to the handling of time constraints, such as the minimum and maximum time gap between adjacent requests in a pattern. The system provides a real time generator of dynamic links, which aimed at automatically modifying the hypertext organization when user navigation matches a previously mined rule.

Fundamental methods of data cleaning and preparation have been well studied by Srinivasa et al [SCDT2000]. The main techniques traditionally used for modeling usage patterns in a Web site are collaborative filtering (CF), clustering pages or user sessions, association rule generation, sequential pattern generation and Markov Models. The prediction step is the real-time processing of the model, which considers the active user session and makes recommendations based on the discovered patterns. The time spent on a page is a good measure of the user’s interest in that page, providing an implicit rating for it [GO2003]. If a user is interested in the content of a page, she will likely spend more time there compared to the other pages in her session. They presented a new model that uses both the sequences of visiting pages and the time spent on that pages which reflects the structural information of user session and handles two-dimensional information.